algorithms, not-hadoop

Super Fast Estimates of Levenshtein Distance

This simple heuristic estimates the Levenshtein Distance (LD) of large sequences as much as tens of thousands of times faster than computing the true LD. In exchange for a certain loss of precision, this extends the size range over which LD is practical by a factor of a few hundred X, making it useful for comparing large text documents such as articles, Web pages, and books.


Equally importantly, the estimates are computed from relatively small signatures, which means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed.

The signatures can also be used to estimate approximately where and how the sequences differ. This allows finer distinctions to be made about near duplication, for instance, is one document embedded in the other, or are many small difference sprinkled throughout?

If your application requires a precise LD value, this heuristic isn’t for you, but the estimates are typically within about 0.05 of the true distance, which is more than enough accuracy for such tasks as:

  • Confirming suspected near-duplication.
  • Estimating how much two document vary.
  • Filtering through large numbers of documents to look for a near-match to some substantial block of text.

If anyone else implements this, I’d love to hear about your results.

Levenshtein Distance

If you’ve read this far, you probably already know what LD is, so I’ll only give a brief reminder here of what it does, and nothing about how it works. You can find all that in Wikipedia if you are interested, but you don’t need to understand how LD itself is computed to see how the heuristic works.

Quickly, LD is a measure of the similarity of two sequences. Sometimes it is called “edit distance” because the LD of two strings is the number of single-character edits that it would take to turn the first string into the second.  For example, the LD of CAT and HAT is one, because changing the C to an H turns CAT into HAT.  CATS and CAT also have an LD of one, because dropping the S does the trick. CATS and HAT have an LD of two.

The Problem

The algorithm for computing LD is fast enough for small strings, but its run-time grows in proportion to the product of the lengths of the two strings. This may not sound too bad, but this table shows how quadratic growth works out for text documents of increasing size. These results are for single threaded processing and do not include reading the documents from disk, which would overwhelm processing time for small strings. For scale, 10K characters could be a medium sized Web page, 50K an eight-page article and 1M  characters, a book. This article is about 17,500 characters.

Length Size Increase Time Runtime Increase Per Sec
100 1 100μ sec. 1 20,000
10K 100 0.5 sec. 10,000 2
50K 500 12.5 sec 25,000 0.125
1M 10,000 16.6 min. 100,000,000 0.001

My machine actually won’t compute LD on pairs of 50K strings because it runs out of memory at somewhere above 30K or so. One MB files are 33X as large, putting them completely beyond the pale.

The Estimating Heuristic

The idea behind the estimating heuristic is simple. We approximate the LD of a pair of strings by compressing them, taking the LD of the compressed signatures, and multiplying the result by the compression rate.

The compression method used is extreme and not reversible—you’ll typically compress the inputs by a factor of between 50x to 250x or more for text. Using a middle value of 150X, a 20KB document yields a signature of about 133 characters. LD runs faster on the signatures by a factor that is the square of the compression rate, so, in this case, it is about 150×150=22,500 times faster than computing the true value. The basic trade-off is that smaller signatures are faster but less accurate.

What Might This Be Good For?

Here’s an example.  Web crawlers gather billions of pages from across the Web so that they can be indexed. The pages often exist in multiple nearly identical versions, so for both economy of processing and so that users don’t get back many links to the same document, the near-duplicates need to be identified and ignored.

This is surprisingly difficult to do. This paper  explains how Google does it, but what is of interest here is that the algorithm has a 25% false positive rate. This is bad because a match means a document has already been indexed, and can be ignored. As it would be prohibitively costly to pull all the apparent matches from across the Web to verify that they are near-duplicates, it is necessary to settle for either failing to index a lot of documents or compromising on how many redundant results are indexed. The technique described here could verify the matches at a cost that is negligible compared to the total effort per page crawled.

Other Uses

The usefulness of the signatures is not strictly limited to estimating LD. The differences in the signatures are located in approximately the same relative positions as the differences in the original documents. Either alone or in combination with the LD, this can tell you a lot about the relationship between two similar documents without the need to examine the documents themselves. For instance, If two signatures differ at the beginning and the end, it is probable that the same block of text is simply embedded in two different pages.

If the differences are sprinkled throughout, but the LD is too small to be consistent with random files, then they are probably different versions of the same document.  A good example of this showed up in tests when comparing a modern translation of Dante’s Inferno to several thousand random books from the Gutenberg book set. Mutilated versions of the original had been added to the test set. No spurious matches were identified, but legitimate matches included some surprises. The algorithm recognized the similarity of:

  • The original text.
  • The mutilated versions.
  • The three volume set including Paradiso and Purgatorio.
  • The versions of the poem with diacritics included (which results in numerous small changes on every line.)
  • A different translation from the Italian done a century earlier.

How Does It Work?

The trick is the compression algorithm, which requires two properties.

  • The signature of a string must be the same whether it is compressed by itself or compressed embedded in a larger document.
  • The compressed signature must be much smaller than the original.

The first of these is actually impossible, but you can relax it a little to say the signature of a string will be the same except for a limited number of characters at either end.

To derive the signatures, we choose a two values, C and N.

  • C is the approximate compression rate. The larger this number, the smaller the signatures. The right choice depends on your goals. If C is too large, too much information will be lost. If it is too small, LD will execute slowly and the signatures will be bulky to store. Values anywhere from 25 to 250 are typical for text.
  • N is a neighborhood size. Compression will operate on size-N substrings of the input, beginning with each successive character in the string. The larger the N, the more sensitive the heuristic is to small differences.  An N of 6 to 20 usually gets good results for text.

We also choose any convenient alphabet of output characters for the signatures. The set of printable ASCII characters is a good choice because it allows you to inspect the signatures easily. The larger the cardinality of the output alphabet, the higher the entropy of the strings (which is good) but allowing non-alphanumerics can lead to surprises if they happen to be special characters in some processing context. For illustration purposes, we will use the alphanumeric ASCII chars and assume they are stored in a 62 element array of characters called OUTPUT_CHARS.

Signature generation is simple. I won’t include any optimizations here, just the basic idea.

  1. P, the current position in the input, is initially zero.
  2. Compute H(P), which is the hash of the N-character substring of the input that starts at position P.
  3. If H(P) modulo C is not equal to zero (or some other arbitrarily chosen value between 0 and C-1)
    • Generate no output.
    • If not at the end, increment P and return to step 1.
    • If you are at the end, quit.
  4. Otherwise, output a single pseudo-random character chosen from OUTPUT_CHARS in such a way that the choice is highly random, but a given H will always result in the same character. For concreteness, let’s say we emit the character in OUTPUT_CHARS that is at the index H(P) modulo OUTPUT_CHARS.length().
  5. If you are not at the end, increment P and go to step 1.
  6. Otherwise, quit.

That’s all there is to the generation of signatures. The value for N should be big enough that H(p) gives diverse values, but not so big that a tiny difference will bleed out over a large expanse of the text; sizes from six to 20 seem to work fine.

The effect of C is not only to shrink the strings by ignoring all but 1/c of the H values but to ensure that most minor differences are suppressed.

Some Real Examples

Below are the first 100 characters of signatures generated for 25 Gutenberg books. The first dozen bytes have recognizable similarities across all signatures but are exactly alike in only two. This is because every Gutenberg book starts with a big slab of semi-standard boilerplate disclaimers, copyright information etc. that is never exactly the same, and also contains a small amount of information about the specific book.

Notice also that there are several pairs of lines that are almost identical to each other—the last two, for instance. These are identical for the first 30 or so characters, then get slightly more different towards the end. That’s because they are the same book, with almost identical boilerplate, but one has the diacritical marks and the other doesn’t.  This creates a few once character differences per line, but the signature generation washes most of them out, resulting in fairly similar signatures.


For some applications, you need to know the expected LD of random strings of text of a given length. For instance, if you are trying spot distantly related texts, a result that is only sligthly lower than expected can be significant. The expected-LD function must be calibrated empirically because real text is far from random, and deviates from randomness in an application-dependent way.

Depending upon your goals, you can use a variety of criteria to interpret the LD value. In the Web crawler example, you might want to subtract the difference in length of the two signatures (adjusted by the expected-LD function) from the LD, then compute the ratio of the remaining quantity and the length of the smaller string.  Any ratio above some arbitrarily chosen constant would constitute confirmation of near duplication.

How well does it work?

These results are from a brain-dead simple implementation along the lines described above.  The first table below shows the signature size and the processing rate for two different file sizes and a range of values of C. I ran them on a 2015 MacBook Pro.  Any of these compression factors could be reasonable depending on the application.

C File Size Signature Size LD estimates/sec
25 4,865 181 8,000
50 4,865 98 30,300
100 4,865 59 90,909
200 4,865 28 250,000
25 27,478 940 280
50 27,478 514 1,048
100 27,478 305 2,932
200 27,478 195 11,627

The table below gives some example accuracy estimates with C set to 100. Texts include unrelated documents, identical documents, and originally identical documents subjected to a lot of changes. You can get a good idea of the actual similarity of the file by looking at the original LD values. Comparing the estimate of LD to the true value tells you how well it is working.

Text File Sizes Sig Sizes LD sig LD actual Est Err
Unrelated 27,478/23,932 366/419 366 20,706 24,002 0.1200
Identical 27,620/27,620 305/305 0 0 0 0.0
Mangled 4,865/2,404 34/61 27 2,461 2,153 0.0632
50/173 lines deleted 4,865/2,474 27/61 34 2,391 2,711 0.065
A few changes 4,865/4,801 59/61 3 64 239 0.036

Cross Computing

If you compare a file to a version of itself that has been cut into several big blocks and rearranged, the true edit distance will be misleadingly high (i.e. the files will seem dissimilar) because LD works at the single-character level.

This behavior can be improved upon by cutting each signature into blocks of a fixed size, and cross-comparing the blocks to find the best match for each block in one signature to some block in the other.  If the first signature has M blocks and the second has N, this is a total of N * M block comparisons.

That sounds like a lot, but the size of the blocks shrinks in inverse proportion, resulting in a commensurate decrease in the cost of each comparison. Therefore, the total cost is about the same as  it would be for a straight-up LD computation on the intact signatures.

The caveat is that the accuracy goes down somewhat as the number of blocks increases.

This can be used in several ways, for example:

  • You can get a better idea of the total similarity of two files by summing the lowest LD obtained for each block in one of the signatures.
  • The results can be focussed by allowing disregarding of some number of blocks, e.g., consider only the best K blocks, where K is the number of blocks in the smaller signature, minus two.

Scanning at Multiple Compression Rates

Large compression rates result in small signatures and a much faster LD estimate, but the smaller the signature, the more information is lost, and the less accurate is the result.

You can have your cake and eat it too by computing more than one signature for a given document.  Small signatures can be compared first to separate the sheep from the goats quickly, and the larger signatures compared only for pairs for which the first test indicates a reasonably high degree of similarity.


The per-character entropy of the 32K Gutenberg documents averages 4.45637 bits, but with some variance. The majority of the documents score fairly close to 4.5, but a few have entropy greater than 5.1 or less than 4.0.

The average entropy of the signatures at any level of compression tested is only slightly higher than the entropy of the raw text. This tells you that the compression is tracking the raw text pretty closely in terms of predictability of characters. That’s a good thing.

The average entropy goes down slightly as the compression increases, ranging from 4.93989 for 50X compression to 4.87675 for 350X. This means that the signatures are getting slightly less meaningful as the compression rate gets high, but not by a great amount.

Interestingly, however, within a given compression rate, the entropy of a text rarely diverges from the average by more than a few 100’th of a bit. In contrast, the entropy of the raw documents varies by as much as a bit per character. This indicates that the compression is pretty good at washing out the non-textual anomalies that cause files to deviate from the entropy typical of text. This makes sense because the anomalies that reduce entropy are typically repetitions, only a few of which will happen to result in output characters.

The following table gives the average per-character entropy for the raw files and for signatures computed for 32,000 documents using compression factors of 50, 100, 150, 200, 250, 300, and 350. The raw files are mostly between 50K and 400K characters long, but can be as small as 10K and as large as 1100K

Raw c=50 c=100 c=150 c=200 c=250 c=300 c=350
4.45637 4.93989 4.92783 4.92138 4.91678 4.89277 4.89741 4.87674


It is worth noting that there are other ways to make computing LD faster, notably the enhancement by Ukkonen, which is linear time with respect to the input sizes for similar document pairs and increases in runtime non-linearly with the LD of the pair. In other words, if the documents are almost the same, the comparison time is close to linear but gets slower the more they differ. (The details are a little more complex, but that’s the idea.)

This is useful, but it’s important to be clear about what it means. The estimate heuristic has a one-time signature preparation phase that, like Ukkonen, must read all of both documents, so if you’re only going to compare once and the documents are similar, Ukkonen and the estimate heuristic are of similar time complexity (assuming that the compression rate is high enough that the time to compare the two signatures is negligible.) And if the pairs are quite different, the run-time of Ukkonen is no longer linear in the lengths of the documents.

If you can amortize the cost of generating the signatures over many comparisons, the time advantage of the estimating heuristic grows in proportion for similar documents.

Therefore, the best use of the Ukkonen algorithm might be applying it to the signatures rather than on the original documents. The performance of Ukkonen on the signatures would be no worse than simple LD for the case of very different documents, and radically better for the case where the documents are very similar (which is likely to be the most common case.)

A Hidden Advantage

One interesting application of the LD estimate heuristic relies on the fact that very little useful information about the contents of documents can be divined from the signatures. The signatures are not quite cryptographically secure, because some information can be wrung out of them, but the extreme compression means that there can’t be much information left in the signatures. For instance, a signature compressed with  c=150 is only about 0.0067 as long as the input. As English, for example, has only about a 0.5 redundancy of information, there aren’t enough bits left in the output to contain much of the original content, and what little is left is quite scrambled.

This lets you do some interesting things. For instance, you can have a service that allows an outside party check whether partial matches to a document or other files are present without the service being able to read either the contents or the sample. The client submits the signature of what is sought and the server checks a set of similarly compressed signatures for the content.  The client does not have to show her cards and the service only has access to the signatures, not the originals.

One use of such a service might be forensic searches of computers. While a judge might be reluctant to allow a warrant that allows the police to poke around in a target’s computers at will, it might be willing to allow them to look for specific documents or images among a set of signatures generated for the contents of the subject computers.











Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s