This simple heuristic estimates the Levenshtein Distance (LD) of large sequences as much as tens of thousands of times faster than computing the true LD. In exchange for a certain loss of precision, this extends the size range over which LD is practical by a factor of a few hundred X, making it useful for comparing large text documents such as articles, Web pages, and books.
Equally importantly, the estimates are computed from relatively small signatures, which means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed.
The signatures can also be used to estimate approximately where and how the sequences differ. This allows finer distinctions to be made about near duplication, for instance, is one document embedded in the other, or are many small difference sprinkled throughout?
If your application requires a precise LD value, this heuristic isn’t for you, but the estimates are typically within about 0.05 of the true distance, which is more than enough accuracy for such tasks as:
- Confirming suspected near-duplication.
- Estimating how much two document vary.
- Filtering through large numbers of documents to look for a near-match to some substantial block of text.
If anyone else implements this, I’d love to hear about your results.
If you’ve read this far, you probably already know what LD is, so I’ll only give a brief reminder here of what it does, and nothing about how it works. You can find all that in Wikipedia if you are interested, but you don’t need to understand how LD itself is computed to see how the heuristic works.
Quickly, LD is a measure of the similarity of two sequences. Sometimes it is called “edit distance” because the LD of two strings is the number of single-character edits that it would take to turn the first string into the second. For example, the LD of CAT and HAT is one, because changing the C to an H turns CAT into HAT. CATS and CAT also have an LD of one, because dropping the S does the trick. CATS and HAT have an LD of two.
The algorithm for computing LD is fast enough for small strings, but its run-time grows in proportion to the product of the lengths of the two strings. This may not sound too bad, but this table shows how quadratic growth works out for text documents of increasing size. These results are for single threaded processing and do not include reading the documents from disk, which would overwhelm processing time for small strings. For scale, 10K characters could be a medium sized Web page, 50K an eight-page article and 1M characters, a book. This article is about 17,500 characters.
|Length||Size Increase||Time||Runtime Increase||Per Sec|
My machine actually won’t compute LD on pairs of 50K strings because it runs out of memory at somewhere above 30K or so. One MB files are 33X as large, putting them completely beyond the pale.
The Estimating Heuristic
The idea behind the estimating heuristic is simple. We approximate the LD of a pair of strings by compressing them, taking the LD of the compressed signatures, and multiplying the result by the compression rate.
The compression method used is extreme and not reversible—you’ll typically compress the inputs by a factor of between 50x to 250x or more for text. Using a middle value of 150X, a 20KB document yields a signature of about 133 characters. LD runs faster on the signatures by a factor that is the square of the compression rate, so, in this case, it is about 150×150=22,500 times faster than computing the true value. The basic trade-off is that smaller signatures are faster but less accurate.
What Might This Be Good For?
Here’s an example. Web crawlers gather billions of pages from across the Web so that they can be indexed. The pages often exist in multiple nearly identical versions, so for both economy of processing and so that users don’t get back many links to the same document, the near-duplicates need to be identified and ignored.
This is surprisingly difficult to do. This paper explains how Google does it, but what is of interest here is that the algorithm has a 25% false positive rate. This is bad because a match means a document has already been indexed, and can be ignored. As it would be prohibitively costly to pull all the apparent matches from across the Web to verify that they are near-duplicates, it is necessary to settle for either failing to index a lot of documents or compromising on how many redundant results are indexed. The technique described here could verify the matches at a cost that is negligible compared to the total effort per page crawled.
The usefulness of the signatures is not strictly limited to estimating LD. The differences in the signatures are located in approximately the same relative positions as the differences in the original documents. Either alone or in combination with the LD, this can tell you a lot about the relationship between two similar documents without the need to examine the documents themselves. For instance, If two signatures differ at the beginning and the end, it is probable that the same block of text is simply embedded in two different pages.
If the differences are sprinkled throughout, but the LD is too small to be consistent with random files, then they are probably different versions of the same document. A good example of this showed up in tests when comparing a modern translation of Dante’s Inferno to several thousand random books from the Gutenberg book set. Mutilated versions of the original had been added to the test set. No spurious matches were identified, but legitimate matches included some surprises. The algorithm recognized the similarity of:
- The original text.
- The mutilated versions.
- The three volume set including Paradiso and Purgatorio.
- The versions of the poem with diacritics included (which results in numerous small changes on every line.)
- A different translation from the Italian done a century earlier.
How Does It Work?
The trick is the compression algorithm, which requires two properties.
- The signature of a string must be the same whether it is compressed by itself or compressed embedded in a larger document.
- The compressed signature must be much smaller than the original.
The first of these is actually impossible, but you can relax it a little to say the signature of a string will be the same except for a limited number of characters at either end.
To derive the signatures, we choose a two values, C and N.
- C is the approximate compression rate. The larger this number, the smaller the signatures. The right choice depends on your goals. If C is too large, too much information will be lost. If it is too small, LD will execute slowly and the signatures will be bulky to store. Values anywhere from 25 to 250 are typical for text.
- N is a neighborhood size. Compression will operate on size-N substrings of the input, beginning with each successive character in the string. The larger the N, the more sensitive the heuristic is to small differences. An N of 6 to 20 usually gets good results for text.
We also choose any convenient alphabet of output characters for the signatures. The set of printable ASCII characters is a good choice because it allows you to inspect the signatures easily. The larger the cardinality of the output alphabet, the higher the entropy of the strings (which is good) but allowing non-alphanumerics can lead to surprises if they happen to be special characters in some processing context. For illustration purposes, we will use the alphanumeric ASCII chars and assume they are stored in a 62 element array of characters called OUTPUT_CHARS.
Signature generation is simple. I won’t include any optimizations here, just the basic idea.
- P, the current position in the input, is initially zero.
- Compute H(P), which is the hash of the N-character substring of the input that starts at position P.
- If H(P) modulo C is not equal to zero (or some other arbitrarily chosen value between 0 and C-1)
- Generate no output.
- If not at the end, increment P and return to step 1.
- If you are at the end, quit.
- Otherwise, output a single pseudo-random character chosen from OUTPUT_CHARS in such a way that the choice is highly random, but a given H will always result in the same character. For concreteness, let’s say we emit the character in OUTPUT_CHARS that is at the index H(P) modulo OUTPUT_CHARS.length().
- If you are not at the end, increment P and go to step 1.
- Otherwise, quit.
That’s all there is to the generation of signatures. The value for N should be big enough that H(p) gives diverse values, but not so big that a tiny difference will bleed out over a large expanse of the text; sizes from six to 20 seem to work fine.
The effect of C is not only to shrink the strings by ignoring all but 1/c of the H values but to ensure that most minor differences are suppressed.
Some Real Examples
Below are the first 100 characters of signatures generated for 25 Gutenberg books. The first dozen bytes have recognizable similarities across all signatures but are exactly alike in only two. This is because every Gutenberg book starts with a big slab of semi-standard boilerplate disclaimers, copyright information etc. that is never exactly the same, and also contains a small amount of information about the specific book.
Notice also that there are several pairs of lines that are almost identical to each other—the last two, for instance. These are identical for the first 30 or so characters, then get slightly more different towards the end. That’s because they are the same book, with almost identical boilerplate, but one has the diacritical marks and the other doesn’t. This creates a few once character differences per line, but the signature generation washes most of them out, resulting in fairly similar signatures.
RQNXXjCAovG2QEv29jR35fsCX5uZuaAfQIAouECef7jauXC2VGNTQ2NeKAZfAuA5o72GC9VR9xsXZ9Ijb5vXI7fmZy7N7Tm7e7I5 3eRAx9RQNVeoC9Ze9o37RKE5e9eGqVjGC9oERe9b9KZmuuev3NqaabCaV9oIuG7G2yX22C9K2fbNXajGIVx9GesoNmERTE5CbXuX 9RQNeQm9jTK3E7C3Zm93Zbb7TVTjjyQxfbjeqRbamEmvCqCsjGmso9oC9sTGAqZ7RNTsXfN7oQRV7NXX7GXEj2yVqbmj7EGQomvK x9RQNVsfEjbX7aKyoVaNI7yojuQu5TuK3fNKZXaQXy9fEbxajeVC9xuC3uNxy7RT3xIubjXvf7baXVeX2QAqvQbKZQ2ZNRs5yjGy x9RQNV9sEX7x9jEIfXKIjbKC2CassqasouV5RT3Kv3aVsEmGoo9uxeGCvI279fTA2INmGmseuAe9uRqITmQjbyZ3uaQV35AuKou9 Ax9RQNEAR5x9GI2Aa95ZRex9QIQEu9XQQ7eG7TuqNqjj5KK5A93VxbGoayaayCEqEXGAaICfx9bVGIaaCK9aCqyCfNmaCCqQuNVu Ax9RQNuefX9jxeXq2jAe53TIomKZsjZuEZxmKxsau2Z7yCmouVAufjoeK2GeN7Ix7Tb9oxye3XTToqK2ZC3aj3K2Vf35XjQ9y9VX x9RQNXxEu9KGj99TqCCVA2CGExNRTe9NZAIfZTuajG3vjIa2A25qAb7qmVq2abeNANVGuIaEKEC7IGG5feEoeTy3ZAAQKVqNxvCR RQNaTNs9j3aoGa9fG2bNVZyE3CTaZfaRo7TCajyIvxeGZ35eqNIsyT3mCG5oeQ7ZfXxCCT9v7AAo23veEombVaTjjbG3bXousN5Q x9RQNV59KGjT9AfCvssfyKEs5fEfQEIabGvv3o9V9syeQa7o7veKaTxa3Q3V9smKVoquCNvff7aaRNCEGbXjxTIe7EEZNEsb9Zu3 RQNVm39jVm5mu9eRsbQ3Vqf5K29aKCyayxsjqGZKfbX9xy5x9aauCR9TuKE2a3RuCNNmQmmTjmGVuT3GTeEuCm3eVvvmyRKRafyx x9RQNZa9KGjZjKK7AQ53CGTGEyERyATVbRRyGNNaETGZVbuNVXC73bTAmvVoxV3qaaKaE7yomTN9ZVNVCsfGjNyIyAV2ICxVsGZV RQ5ym939KGjTVsfvNCsGxX2vKNVN9yqZE95EfxvGq3E5AVvjoITI59CVve7ZZv22sNXsI7A5fabEKase9ysQEN579NAXAbuINejZ yKAx9RQZ9ZIxQGI5q2ZsjmIRXEfsZINx95soxXAuXyCQCoQN7oTojNVyK3fo3qbusKmqxmyyo7VA2TusEE5f9Z9sTEXxNAfmjyqv 9RQRqQZ9KGjqyqbbqIfRAKumeseoEoqTEfCyE3XXXIR3G5TA7AImmyqVV597v9uZoZvsK5ZCQ9Zj7q975CAausjGvKq99bsfGTCZ 9RQ29juK2X7a7bxvmRyGVZQ75sTaZqCIbxfRX5ReGbeRvjmRsCKTbReuGR5Rq7sZV3VAxaxZXxQ5ybqjqm75q3VaRRmXumIXCsju 9RQRX9jZZIXZKKevNfvmIVRoVuxqxVaCQaoEoTNvv2X9AvqqofQe739N7sV5X5XRIo2TRffqNGIyK9jXa9RaE7jRq5T3jQbfxvC2 RQTeEI9j7AsCQVVC5fZAyAyAxxoG9Xa9mGIR3b2ZAyQRNqof9IvxbVZ339oGVAyaCXbKmo5yf3Z7RZas55VxTsI22efjbufasNGQ Ax9RQ5bvqj9jf5Io5NG377o73RKm5A7yGxb7Ne5Qv5vv9muqQTx2uR7E3bjbGyyQ7RvoQa2omyj2joKGo9uxjeAbGufRCGuyQuG2 9RQ3aQKK9jyQjIC73m5RX75AoauXeKx2mQIT5aqxxxuVsTfuAGq72sACG3asGQRNN5GsN2NEjbu3yNvKsoXuj7mqVKXKqIfvKbXq 9RQ3aK9jyQjIC73m5RX75AoauXeKx2mQIT5aqxxxuVsTfuAGq72sACG3asGQRNN5GsN2NEjbu3yNvKsoT3v7mZEaNQX2a7NZxs7j j72Ax9RQEj3jKu9yZ9sEVf2qZEx7E9yCfxj5vfs5ZRIIfxI2fXNAVqXmvKuRse37VQvbyVfqbQe332jqeINfNNmNXGCqNeCevbym Ax9RQE3jKu9myZ9EV2qKEsEC7f5X5Txv5KZsfmCT23fNQGmGKuQe7VQvyV7fqb32jqeIIfNNmTGCNNe3vbqQ3bqZ5RfC3byKyKjR 9RQba9jybjjsVjQqqIVba5yCjEANaIATIQma5N92AmAK5XjZ953XfvsQN7GEZ5uqay55EZ2R3NCA2NsXERsEaoVZG75xKq5s5Ts3 9RQba9jybjjsVjQqqIVba5yCjEANaIATIQm592AmAK5XjZ953XfvsQN7GEZ5uqay55EZ2R3NCA2NsXERsEaoVZG75xKq5s5Ts3je
For some applications, you need to know the expected LD of random strings of text of a given length. For instance, if you are trying spot distantly related texts, a result that is only sligthly lower than expected can be significant. The expected-LD function must be calibrated empirically because real text is far from random, and deviates from randomness in an application-dependent way.
Depending upon your goals, you can use a variety of criteria to interpret the LD value. In the Web crawler example, you might want to subtract the difference in length of the two signatures (adjusted by the expected-LD function) from the LD, then compute the ratio of the remaining quantity and the length of the smaller string. Any ratio above some arbitrarily chosen constant would constitute confirmation of near duplication.
How well does it work?
These results are from a brain-dead simple implementation along the lines described above. The first table below shows the signature size and the processing rate for two different file sizes and a range of values of C. I ran them on a 2015 MacBook Pro. Any of these compression factors could be reasonable depending on the application.
|C||File Size||Signature Size||LD estimates/sec|
The table below gives some example accuracy estimates with C set to 100. Texts include unrelated documents, identical documents, and originally identical documents subjected to a lot of changes. You can get a good idea of the actual similarity of the file by looking at the original LD values. Comparing the estimate of LD to the true value tells you how well it is working.
|Text||File Sizes||Sig Sizes||LD sig||LD actual||Est||Err|
|50/173 lines deleted||4,865/2,474||27/61||34||2,391||2,711||0.065|
|A few changes||4,865/4,801||59/61||3||64||239||0.036|
If you compare a file to a version of itself that has been cut into several big blocks and rearranged, the true edit distance will be misleadingly high (i.e. the files will seem dissimilar) because LD works at the single-character level.
This behavior can be improved upon by cutting each signature into blocks of a fixed size, and cross-comparing the blocks to find the best match for each block in one signature to some block in the other. If the first signature has M blocks and the second has N, this is a total of N * M block comparisons.
That sounds like a lot, but the size of the blocks shrinks in inverse proportion, resulting in a commensurate decrease in the cost of each comparison. Therefore, the total cost is about the same as it would be for a straight-up LD computation on the intact signatures.
The caveat is that the accuracy goes down somewhat as the number of blocks increases.
This can be used in several ways, for example:
- You can get a better idea of the total similarity of two files by summing the lowest LD obtained for each block in one of the signatures.
- The results can be focussed by allowing disregarding of some number of blocks, e.g., consider only the best K blocks, where K is the number of blocks in the smaller signature, minus two.
Scanning at Multiple Compression Rates
Large compression rates result in small signatures and a much faster LD estimate, but the smaller the signature, the more information is lost, and the less accurate is the result.
You can have your cake and eat it too by computing more than one signature for a given document. Small signatures can be compared first to separate the sheep from the goats quickly, and the larger signatures compared only for pairs for which the first test indicates a reasonably high degree of similarity.
The per-character entropy of the 32K Gutenberg documents averages 4.45637 bits, but with some variance. The majority of the documents score fairly close to 4.5, but a few have entropy greater than 5.1 or less than 4.0.
The average entropy of the signatures at any level of compression tested is only slightly higher than the entropy of the raw text. This tells you that the compression is tracking the raw text pretty closely in terms of predictability of characters. That’s a good thing.
The average entropy goes down slightly as the compression increases, ranging from 4.93989 for 50X compression to 4.87675 for 350X. This means that the signatures are getting slightly less meaningful as the compression rate gets high, but not by a great amount.
Interestingly, however, within a given compression rate, the entropy of a text rarely diverges from the average by more than a few 100’th of a bit. In contrast, the entropy of the raw documents varies by as much as a bit per character. This indicates that the compression is pretty good at washing out the non-textual anomalies that cause files to deviate from the entropy typical of text. This makes sense because the anomalies that reduce entropy are typically repetitions, only a few of which will happen to result in output characters.
The following table gives the average per-character entropy for the raw files and for signatures computed for 32,000 documents using compression factors of 50, 100, 150, 200, 250, 300, and 350. The raw files are mostly between 50K and 400K characters long, but can be as small as 10K and as large as 1100K
It is worth noting that there are other ways to make computing LD faster, notably the enhancement by Ukkonen, which is linear time with respect to the input sizes for similar document pairs and increases in runtime non-linearly with the LD of the pair. In other words, if the documents are almost the same, the comparison time is close to linear but gets slower the more they differ. (The details are a little more complex, but that’s the idea.)
This is useful, but it’s important to be clear about what it means. The estimate heuristic has a one-time signature preparation phase that, like Ukkonen, must read all of both documents, so if you’re only going to compare once and the documents are similar, Ukkonen and the estimate heuristic are of similar time complexity (assuming that the compression rate is high enough that the time to compare the two signatures is negligible.) And if the pairs are quite different, the run-time of Ukkonen is no longer linear in the lengths of the documents.
If you can amortize the cost of generating the signatures over many comparisons, the time advantage of the estimating heuristic grows in proportion for similar documents.
Therefore, the best use of the Ukkonen algorithm might be applying it to the signatures rather than on the original documents. The performance of Ukkonen on the signatures would be no worse than simple LD for the case of very different documents, and radically better for the case where the documents are very similar (which is likely to be the most common case.)
A Hidden Advantage
One interesting application of the LD estimate heuristic relies on the fact that very little useful information about the contents of documents can be divined from the signatures. The signatures are not quite cryptographically secure, because some information can be wrung out of them, but the extreme compression means that there can’t be much information left in the signatures. For instance, a signature compressed with c=150 is only about 0.0067 as long as the input. As English, for example, has only about a 0.5 redundancy of information, there aren’t enough bits left in the output to contain much of the original content, and what little is left is quite scrambled.
This lets you do some interesting things. For instance, you can have a service that allows an outside party check whether partial matches to a document or other files are present without the service being able to read either the contents or the sample. The client submits the signature of what is sought and the server checks a set of similarly compressed signatures for the content. The client does not have to show her cards and the service only has access to the signatures, not the originals.
One use of such a service might be forensic searches of computers. While a judge might be reluctant to allow a warrant that allows the police to poke around in a target’s computers at will, it might be willing to allow them to look for specific documents or images among a set of signatures generated for the contents of the subject computers.