algorithms, data science, not-hadoop, twitter, Uncategorized

AI Needs a Better Acronym*

A team at METR Research recently did a study of the effectiveness of AI in software development and got some very surprising results. At the risk of oversimplifying, they report that the developers they studied were significantly slower when using AI, and moreover, tended not to recognize that fact.

METR’s summary of the paper runs to multiple pages–the full paper is longer–so a couple of lines here can’t really do it justice, but the developers on average had expected AI to speed them up by about 24%, and estimated after doing the tasks that using AI had sped them up by about 20%. In fact, on average, it had taken them about 19% longer to solve problems using AI than it did using just their unadorned skills.

My statistics days are behind me, but the work looked to be of high quality, and the people at METR seem to know how do do a study. Surprising though the result may be, the numbers are the numbers.

Yikes! How Can That Be?

Superficially, METR’s results seem wildly at odds with my experience using AI for coding. I used it for a month, and felt that it vastly increased my output, not by a double digit percentage, but by a large factor. Four times? Five times? More? It’s impossible to know for sure, but a lot.

Yet, after reading the study more carefully, not only did it begin to seem like a dog-bites-man story, I came to feel that their results strongly validate my own experience and understanding of AI. If you take the message to be, “AI doesn’t make developers faster“, you are probably missing the point.

(Don’t take my one paragraph summary too seriously. METR’s summary can can be seen here. The full study can also be reached from within that page.)

Continue reading →

algorithms, data science, data science career, Hadoop, machine learning, Uncategorized

A Pilgrim’s Progress #2: The Data Science Tool Kit

The is the second post about becoming a computer scientist after a career in software engineering. The first part may be found here.

Only a student would think that software developers mostly write computer programs. Coding is a blast–it’s why you get into the field–but the great majority of professional programming time isn’t spent coding. It goes into the processes and tools that allow humans to work together, such a version control and Agile procedures; maintenance; testing and bug fixing; requirements gathering, and documentation. It’s hard to say where writing interesting code is on that list. Probably not third place. Fourth or fifth perhaps?

Linear PCA v nonlinear Principle Manifolds Андрей Зиновьев=Andrei Zinovyev

Fred Brooks famously showed that the human time that goes into a line of code is inversely-quadratic in the size of the project (I’m paraphrasing outrageously.) Coding gets the glory, but virtually all of the significant advances in software engineering since Brooks wrote in the mid-1970’s have actually been in the technology and management techniques for orchestrating the efforts of dozens or even hundreds of people to cooperatively to write a mass of code that might have the size and complexity of War and Peace. That is, if War and Peace were first released as a few chapters, and had to continue to make sense as the remaining 361 chapters come out over a period of many months or even years. Actually, War and Peace runs about half a million words, or 50,000 lines, which would make it quite a modest piece of software. In comparison, the latest Fedora Linux release has 206 million lines of code. A typical modern car might have 150 million. MacOS has 85 million. In the 1970’s four million lines was an immense program.

Continue reading →

algorithms, not-hadoop

Super Fast Estimates of Levenshtein Distance

Levenshtein Distance is an elegant measure of the dissimilarity of two strings. Given a pair of strings, say, “hat” and “cat”, the LD is the number of single-character edits that are required to turn one into the other. The LD of cat and hat is one. The LD of hats and cat is two.

LD is precise and has an intuitive meaning but in practice it is used mostly for short strings because the run-time of the algorithm to compute LD is quadratic, i.e, proportional to the product of the lengths of the two strings. On a reasonable computer you can compare strings as long as a line in this article in a few microseconds, but comparing a full page of this article to another full page would take a good chunk of a second. Comparing two books with LD is serious computing–many of minutes if you have a computer with sufficient memory, which you almost certainly do not.

That’s why, as elegant as LD is for describing how different two chunks of text are, people rarely consider using it as a way to compare documents such as Web pages, articles, or books.

This heuristic described here is a way around that problem. It turns out that you can compute a decent estimate of the LD of two large strings many thousands of times faster than you could compute the true LD. The other way to say this is that whatever amount of time you deem tolerable for a comparison computation, using estimates increases the size of strings you can compare within that time limit by a factor of as much as a few hundred. The practical size-range for estimated LD is in the megabyte range (and much larger for binary data.)

I612H

Equally importantly, the LD estimates are made from pre-computed signatures, not from the the original documents. This means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed, which is a tremendous advantage when you need to compare documents across a network.

Unlike an actual LD computation, estimating from signatures provides insight into approximately where and how two sequences differ. This allows finer distinctions to be made about near-duplication, for instance, is one document embedded in the other, or are two documents different versions with many small difference sprinkled throughout?

Continue reading →

algorithms, not-hadoop, twitter, Uncategorized

Z-Filters: Listening to a Million Voices

Z-Filters is a technique for listening to what millions of people are talking about.

If a hundred million people were to talk about a hundred million different things, making sense of it would be hopeless, but that’s not the way society operates. The number of things that a million people are talking about in a public forum at any given moment is much, much smaller—thousands, not millions—and by limiting the results to subjects that are newly emerging or newly active, a comprehensive view of what’s new can be seen at approximately the data-flow of the Times Square news ticker.

Continue reading →

	Water on A Pilgrim’s Progress #1:…
	Stewyn on Shifting to Hive Part II: Best…
	Glen on Go Go Go
	hadoop 3 Erasure cod… on Erasure Code in Hadoop
	Rajesh KSV on Shifting to Hive Part II: Best…

	Water on A Pilgrim’s Progress #1:…
	Stewyn on Shifting to Hive Part II: Best…
	Glen on Go Go Go
	hadoop 3 Erasure cod… on Erasure Code in Hadoop
	Rajesh KSV on Shifting to Hive Part II: Best…

hadoopoopadoop

Big Data with Hortonworks Hadoop

Category Archives: algorithms

AI Needs a Better Acronym*

Yikes! How Can That Be?

A Pilgrim’s Progress #2: The Data Science Tool Kit

Super Fast Estimates of Levenshtein Distance

Z-Filters: Listening to a Million Voices