algorithms, not-hadoop

Super Fast Estimates of Levenshtein Distance

This simple heuristic estimates the Levenshtein Distance (LD) of large sequences as much as tens of thousands of times faster than computing the true LD. In exchange for a certain loss of precision, this extends the size range over which LD is practical by a factor of a few hundred X, making it useful for comparing large text documents such as articles, Web pages, and books.


Equally importantly, the estimates are computed from relatively small signatures, which means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed.

The signatures can also be used to estimate approximately where and how the sequences differ. This allows finer distinctions to be made about near duplication, for instance, is one document embedded in the other, or are many small difference sprinkled throughout?

Continue reading

algorithms, not-hadoop, twitter, Uncategorized

Z-Filters: How to Listen to a Hundred Million Voices

Z-Filters is a technique for listening to what millions of people are talking about.

If a hundred million people were to talk about a hundred million different things, making sense of it would be a hopeless task, but that’s not the way society operates. The number of things people are talking about at any given moment is much, much smaller: thousands, not millions.

bubble-view.png Continue reading