algorithms, data science, not-hadoop, twitter, Uncategorized

AI Needs a Better Acronym*

A team at METR Research recently did a study of the effectiveness of AI in software development and got some very surprising results. At the risk of oversimplifying, they report that the developers they studied were significantly slower when using AI, and moreover, tended not to recognize that fact.

METR’s summary of the paper runs to multiple pages–the full paper is longer–so a couple of lines here can’t really do it justice, but the developers on average had expected AI to speed them up by about 24%, and estimated after doing the tasks that using AI had sped them up by about 20%. In fact, on average, it had taken them about 19% longer to solve problems using AI than it did using just their unadorned skills.

My statistics days are behind me, but the work looked to be of high quality, and the people at METR seem to know how do do a study. Surprising though the result may be, the numbers are the numbers.

Yikes! How Can That Be?

Superficially, METR’s results seem wildly at odds with my experience using AI for coding. I used it for a month, and felt that it vastly increased my output, not by a double digit percentage, but by a large factor. Four times? Five times? More? It’s impossible to know for sure, but a lot.

Yet, after reading the study more carefully, not only did it begin to seem like a dog-bites-man story, I came to feel that their results strongly validate my own experience and understanding of AI. If you take the message to be, “AI doesn’t make developers faster“, you are probably missing the point.

(Don’t take my one paragraph summary too seriously. METR’s summary can can be seen here. The full study can also be reached from within that page.)

Continue reading →

algorithms, not-hadoop

Super Fast Estimates of Levenshtein Distance

Levenshtein Distance is an elegant measure of the dissimilarity of two strings. Given a pair of strings, say, “hat” and “cat”, the LD is the number of single-character edits that are required to turn one into the other. The LD of cat and hat is one. The LD of hats and cat is two.

LD is precise and has an intuitive meaning but in practice it is used mostly for short strings because the run-time of the algorithm to compute LD is quadratic, i.e, proportional to the product of the lengths of the two strings. On a reasonable computer you can compare strings as long as a line in this article in a few microseconds, but comparing a full page of this article to another full page would take a good chunk of a second. Comparing two books with LD is serious computing–many of minutes if you have a computer with sufficient memory, which you almost certainly do not.

That’s why, as elegant as LD is for describing how different two chunks of text are, people rarely consider using it as a way to compare documents such as Web pages, articles, or books.

This heuristic described here is a way around that problem. It turns out that you can compute a decent estimate of the LD of two large strings many thousands of times faster than you could compute the true LD. The other way to say this is that whatever amount of time you deem tolerable for a comparison computation, using estimates increases the size of strings you can compare within that time limit by a factor of as much as a few hundred. The practical size-range for estimated LD is in the megabyte range (and much larger for binary data.)

I612H

Equally importantly, the LD estimates are made from pre-computed signatures, not from the the original documents. This means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed, which is a tremendous advantage when you need to compare documents across a network.

Unlike an actual LD computation, estimating from signatures provides insight into approximately where and how two sequences differ. This allows finer distinctions to be made about near-duplication, for instance, is one document embedded in the other, or are two documents different versions with many small difference sprinkled throughout?

Continue reading →

algorithms, not-hadoop, twitter, Uncategorized

Z-Filters: Listening to a Million Voices

Z-Filters is a technique for listening to what millions of people are talking about.

If a hundred million people were to talk about a hundred million different things, making sense of it would be hopeless, but that’s not the way society operates. The number of things that a million people are talking about in a public forum at any given moment is much, much smaller—thousands, not millions—and by limiting the results to subjects that are newly emerging or newly active, a comprehensive view of what’s new can be seen at approximately the data-flow of the Times Square news ticker.

Continue reading →

Hadoop, Hadoop Hive, Ingestion, not-hadoop, Unicode

No Fluff Unicode Sumary for Hadoop

clovers6

Developers might not want to read all the background on Unicode included in this earlier blog entry. Here is a quick distillation of how Unicode and the UTF encodings are relevant to a Hadoop user—just the facts and the warnings.

Continue reading →

not-hadoop

Not Hadoop: All about Unicode

Unicode is a subject that trips up even experienced programmers. It’s one of those places where computer science and engineering bump hard into human diversity.

Unicode-in-a-Spiral-Lucida-Sans1
Continue reading →

	Water on A Pilgrim’s Progress #1:…
	Stewyn on Shifting to Hive Part II: Best…
	Glen on Go Go Go
	hadoop 3 Erasure cod… on Erasure Code in Hadoop
	Rajesh KSV on Shifting to Hive Part II: Best…

	Water on A Pilgrim’s Progress #1:…
	Stewyn on Shifting to Hive Part II: Best…
	Glen on Go Go Go
	hadoop 3 Erasure cod… on Erasure Code in Hadoop
	Rajesh KSV on Shifting to Hive Part II: Best…

hadoopoopadoop

Big Data with Hortonworks Hadoop

Category Archives: not-hadoop

AI Needs a Better Acronym*

Yikes! How Can That Be?

Super Fast Estimates of Levenshtein Distance

Z-Filters: Listening to a Million Voices

No Fluff Unicode Sumary for Hadoop

Not Hadoop: All about Unicode