Levenshtein Distance is an elegant measure of the dissimilarity of two strings. Given a pair of strings, say, “hat” and “cat”, the LD is the number of single-character edits that are required to turn one into the other. The LD of cat and hat is one. The LD of hats and cat is two.
LD is precise and has an intuitive meaning but in practice it is used mostly for short strings because the run-time of the algorithm to compute LD is quadratic, i.e, proportional to the product of the lengths of the two strings. On a reasonable computer you can compare strings as long as a line in this article in a few microseconds, but comparing a full page of this article to another full page would take a good chunk of a second. Comparing two books with LD is serious computing–many of minutes if you have a computer with sufficient memory, which you almost certainly do not.
That’s why, as elegant as LD is for describing how different two chunks of text are, people rarely consider using it as a way to compare documents such as Web pages, articles, or books.
This heuristic described here is a way around that problem. It turns out that you can compute a decent estimate of the LD of two large strings many thousands of times faster than you could compute the true LD. The other way to say this is that whatever amount of time you deem tolerable for a comparison computation, using estimates increases the size of strings you can compare within that time limit by a factor of as much as a few hundred. The practical size-range for estimated LD is in the megabyte range (and much larger for binary data.)
Equally importantly, the LD estimates are made from pre-computed signatures, not from the the original documents. This means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed, which is a tremendous advantage when you need to compare documents across a network.
The signatures can also provide insight into approximately where and how two sequences differ. This allows finer distinctions to be made about near-duplication, for instance, is one document embedded in the other, or are two documents different versions with many small difference sprinkled throughout?
If your application requires a precise LD value, this heuristic isn’t for you, but the estimates are typically within about 0.05 of the true distance, which is more than enough accuracy for such tasks as:
- Confirming suspected near-duplication.
- Estimating how much two document vary.
- Filtering through large numbers of documents to look for a near-match to some substantial block of text.
- Guessing whether two documents are related in more complex ways. (More about this below.)
The Estimating Heuristic
As you’ll see below, you don’t actually have to know how LD is computed to understand how the estimating heuristic works. The idea is simple. We approximate the LD of a pair of strings by first compressing them, then taking the real LD of the compressed signatures, and finally multiplying the LD of the compressed strings by the compression rate.
Ordinary compression algorithms won’t work for this purpose. For one thing, they don’t compress typical text nearly enough. Almost all text has a lot of air in it, but the information content is as real as concrete. You can squeeze the air out, but the actual information compresses but so far. The amount of information in typical English text is about 1.3 bits per 8-bit character, which makes it about 0.1625 information and 0.8375 air. In other words, 6:1 is as much as you can compress ordinary English text if you want to be able to inflate it again.
An interesting corollary to this is that some common types of data can’t be compressed at all, and in fact actually expand when normally useful compression algorithms are applied. Previously compressed data usually has this property, for instance, as does any kind of pseudo-random data. This is part of a fascinating branch of math/computer science called “information theory” if you want to get deeper into it.
A second problem is that for estimating LD, we need the compressed version to be sort of like a thumbnail version of an image. We want to shrink it drastically while preserving some kind of recognizable similarity.
Ordinary lossless compression algorithms tend to eliminate the thumbnail property. Think of it this way: while the output of lossless compression algorithms obviously isn’t truly random (because you can reconstitute the original non-random data from it) it looks random in a mathematical sense. It has what computer scientists call “high entropy.” In a mathematical sense, texts compressed with good lossless compression algorithms tend to all look alike because if a text didn’t look random, you could usually compress it a bit more.
The secret sauce with estimating LD is that we don’t care about decompressing. Therefore, we can use a compression algorithm that is lossy in the extreme. For estimating LD you’ll typically compress the inputs by a factor of between 50x to 500x for text and by factors in the thousands for binary data.
Also, unlike most other compression algorithms, ours looks only at each small neighborhood of the string to be compressed. Usually these are overlapping blocks of text about a dozen characters wide.
Using a middle value of 150X for compression, a 20KB document yields a signature of only about 133 characters, which is in the range where computing the true LD takes only a small fraction of a millisecond. Now we have the quadratic nature of the LD computation working for us instead of against us.- You shrink by a factor hundreds, but it increases the comparison speed by many thousands.
What Might This Be Good For?
Here’s an example. Web crawlers gather billions of pages from across the Web so that they can be indexed. The pages often exist in multiple nearly identical versions, so for both economy of processing and so that users don’t get back a ridiculous number of links to the same document, the near-duplicates need to be identified and ignored.
This is surprisingly difficult to do. This paper explains how Google does it, but what is of interest here is that the algorithm has a 25% false positive rate. This is bad because a match means a document has already been indexed and could be ignored. It would be prohibitively costly to pull all the apparent matches from across the Web to verify that they are in fact near-duplicates, so it is necessary to settle for either failing to index a lot of documents or compromising on how many redundant results are indexed.
The technique described here could verify the matches at a cost that is small compared to the total effort per page crawled. Further, being able to eliminate the false positives almost infallibly lets you optimize the matching computation for false negatives, reducing errors of both kinds.
Doubtless Google has some clever way of doing something like this, but that’s a kind of problem this heuristic applies to.
The usefulness of the signatures is not limited to estimating LD. The differences in the signatures are located in approximately the same relative positions as the differences in the original documents. Either alone or in combination with the estimated LD, this can tell you a lot about the relationship between two similar documents without the need to examine the documents themselves. For instance, If two signatures differ only at the beginning and the end, you can infer that the same block of text is simply embedded in two different pages.
If the differences are sprinkled throughout, but the LD is too small to be consistent with a pair of random files, then they are probably different versions of the same document. A good example of this showed up in tests when comparing a modern translation of Dante’s Inferno to several thousand random books from the Gutenberg book set. The unchanged original was used as the subject and mutilated versions of the same file had been added to the test set. The test consisted of comparing the subject’s signature to the pre-computed signatures of all the samples.
No spurious matches were identified, but legitimate matches included some surprises. The algorithm recognized the similarity of the subject to:
- The original text.
- The mutilated versions.
- The Inferno to the three-volume set including Paradiso and Purgatorio.
- The non-ASCII versions of the same books with diacritics included (which results in numerous small changes on every line.)
- A different English translation from the Italian done a century earlier.
The last is astonishing–the LD’s of the two translations deviate sufficiently from the expected value for randomly chosen text that you can recognize kinship. An obvious guess about how to account for this is that matching proper nouns probably show up again and again in the documents in approximately the same places because poetry is highly structured. This was a crude comparison based only on the raw distance. Much more information is undoubtedly present in how the signatures line up at a low level.
How Does It Work?
The trick is the compression algorithm, which requires two properties.
- The signature of a string must be the same whether it is compressed by itself or compressed embedded in a larger document.
- The compressed signature must be much smaller than the original.
The first of these is actually impossible, but you can relax it a little to say the signature of a string will be the same whether computed by itself or as a substring of another string except for a limited number of characters at either end.
To derive the signatures, we choose a two values, C and N.
- C is the approximate compression rate. The larger this number, the smaller the signatures. The right choice depends on your goals. If C is too large, too much information will be lost. If it is too small, LD will execute slowly and the signatures will be bulky to store.
- N is a neighborhood size. Compression will operate on size-N substrings of the input, beginning with each successive character in the string. The larger the N, the more sensitive the heuristic is to small differences. An N of 6 to 20 usually gets good results for text.
We also choose any convenient alphabet of output characters for the signatures. The set of printable ASCII characters is a good choice because it allows you to inspect the signatures easily. The larger the cardinality of the output alphabet, the higher the entropy of the strings (which is good) but allowing non-alphanumerics can lead to surprises if they happen to be special characters in some processing context. For illustration purposes, we will use the alphanumeric ASCII chars and assume they are stored in a 62 element array of characters called OUTPUT_CHARS.
Signature generation is simple. I won’t include any optimizations here, just the basic idea.
- P, the current position in the input, is initially zero.
- Compute H(P), which is the hash of the N-character substring of the input that starts at position P.
- If H(P) modulo C is not equal to zero (or some other arbitrarily chosen value between 0 and C-1)
- Generate no output.
- If not at the end, increment P and return to step 1.
- If you are at the end, quit.
- Otherwise, output a single pseudo-random character chosen from OUTPUT_CHARS in such a way that the choice is highly random, but a given H will always result in the same character. For concreteness, let’s say we emit the character in OUTPUT_CHARS that is at the index H(P) modulo OUTPUT_CHARS.length().
- If you are not at the end, increment P and go to step 1.
- Otherwise, quit.
That’s all there is to the generation of signatures. The value for N should be big enough that H(p) gives diverse values, but not so big that a tiny difference will bleed out over a large expanse of the text; sizes from six to 20 seem to work fine.
The effect of C is not only to shrink the strings by ignoring all but 1/c of the H values but to ensure that most minor differences are suppressed.
Some Real Examples
Below are the first 100 characters of signatures generated for each of 25 Gutenberg books. The first dozen bytes have recognizable similarities across all signatures but are exactly alike in only two. This is because every Gutenberg book starts with a big slab of semi-standard boilerplate disclaimers, copyright information etc. that is never exactly the same, and also contains a small amount of information about the specific book.
Notice also that there are several pairs of lines that are almost identical to each other—the last two, for instance. These are identical for the first 30 or so characters, then get slightly more different towards the end. That’s because they are the same book, with almost identical boilerplate, but one has the diacritical marks and the other doesn’t. This creates a few one-character differences per line, but the signature generation washes most of them out, resulting in fairly similar signatures.
RQNXXjCAovG2QEv29jR35fsCX5uZuaAfQIAouECef7jauXC2VGNTQ2NeKAZfAuA5o72GC9VR9xsXZ9Ijb5vXI7fmZy7N7Tm7e7I5 3eRAx9RQNVeoC9Ze9o37RKE5e9eGqVjGC9oERe9b9KZmuuev3NqaabCaV9oIuG7G2yX22C9K2fbNXajGIVx9GesoNmERTE5CbXuX 9RQNeQm9jTK3E7C3Zm93Zbb7TVTjjyQxfbjeqRbamEmvCqCsjGmso9oC9sTGAqZ7RNTsXfN7oQRV7NXX7GXEj2yVqbmj7EGQomvK x9RQNVsfEjbX7aKyoVaNI7yojuQu5TuK3fNKZXaQXy9fEbxajeVC9xuC3uNxy7RT3xIubjXvf7baXVeX2QAqvQbKZQ2ZNRs5yjGy x9RQNV9sEX7x9jEIfXKIjbKC2CassqasouV5RT3Kv3aVsEmGoo9uxeGCvI279fTA2INmGmseuAe9uRqITmQjbyZ3uaQV35AuKou9 Ax9RQNEAR5x9GI2Aa95ZRex9QIQEu9XQQ7eG7TuqNqjj5KK5A93VxbGoayaayCEqEXGAaICfx9bVGIaaCK9aCqyCfNmaCCqQuNVu Ax9RQNuefX9jxeXq2jAe53TIomKZsjZuEZxmKxsau2Z7yCmouVAufjoeK2GeN7Ix7Tb9oxye3XTToqK2ZC3aj3K2Vf35XjQ9y9VX x9RQNXxEu9KGj99TqCCVA2CGExNRTe9NZAIfZTuajG3vjIa2A25qAb7qmVq2abeNANVGuIaEKEC7IGG5feEoeTy3ZAAQKVqNxvCR RQNaTNs9j3aoGa9fG2bNVZyE3CTaZfaRo7TCajyIvxeGZ35eqNIsyT3mCG5oeQ7ZfXxCCT9v7AAo23veEombVaTjjbG3bXousN5Q x9RQNV59KGjT9AfCvssfyKEs5fEfQEIabGvv3o9V9syeQa7o7veKaTxa3Q3V9smKVoquCNvff7aaRNCEGbXjxTIe7EEZNEsb9Zu3 RQNVm39jVm5mu9eRsbQ3Vqf5K29aKCyayxsjqGZKfbX9xy5x9aauCR9TuKE2a3RuCNNmQmmTjmGVuT3GTeEuCm3eVvvmyRKRafyx x9RQNZa9KGjZjKK7AQ53CGTGEyERyATVbRRyGNNaETGZVbuNVXC73bTAmvVoxV3qaaKaE7yomTN9ZVNVCsfGjNyIyAV2ICxVsGZV RQ5ym939KGjTVsfvNCsGxX2vKNVN9yqZE95EfxvGq3E5AVvjoITI59CVve7ZZv22sNXsI7A5fabEKase9ysQEN579NAXAbuINejZ yKAx9RQZ9ZIxQGI5q2ZsjmIRXEfsZINx95soxXAuXyCQCoQN7oTojNVyK3fo3qbusKmqxmyyo7VA2TusEE5f9Z9sTEXxNAfmjyqv 9RQRqQZ9KGjqyqbbqIfRAKumeseoEoqTEfCyE3XXXIR3G5TA7AImmyqVV597v9uZoZvsK5ZCQ9Zj7q975CAausjGvKq99bsfGTCZ 9RQ29juK2X7a7bxvmRyGVZQ75sTaZqCIbxfRX5ReGbeRvjmRsCKTbReuGR5Rq7sZV3VAxaxZXxQ5ybqjqm75q3VaRRmXumIXCsju 9RQRX9jZZIXZKKevNfvmIVRoVuxqxVaCQaoEoTNvv2X9AvqqofQe739N7sV5X5XRIo2TRffqNGIyK9jXa9RaE7jRq5T3jQbfxvC2 RQTeEI9j7AsCQVVC5fZAyAyAxxoG9Xa9mGIR3b2ZAyQRNqof9IvxbVZ339oGVAyaCXbKmo5yf3Z7RZas55VxTsI22efjbufasNGQ Ax9RQ5bvqj9jf5Io5NG377o73RKm5A7yGxb7Ne5Qv5vv9muqQTx2uR7E3bjbGyyQ7RvoQa2omyj2joKGo9uxjeAbGufRCGuyQuG2 9RQ3aQKK9jyQjIC73m5RX75AoauXeKx2mQIT5aqxxxuVsTfuAGq72sACG3asGQRNN5GsN2NEjbu3yNvKsoXuj7mqVKXKqIfvKbXq 9RQ3aK9jyQjIC73m5RX75AoauXeKx2mQIT5aqxxxuVsTfuAGq72sACG3asGQRNN5GsN2NEjbu3yNvKsoT3v7mZEaNQX2a7NZxs7j j72Ax9RQEj3jKu9yZ9sEVf2qZEx7E9yCfxj5vfs5ZRIIfxI2fXNAVqXmvKuRse37VQvbyVfqbQe332jqeINfNNmNXGCqNeCevbym Ax9RQE3jKu9myZ9EV2qKEsEC7f5X5Txv5KZsfmCT23fNQGmGKuQe7VQvyV7fqb32jqeIIfNNmTGCNNe3vbqQ3bqZ5RfC3byKyKjR 9RQba9jybjjsVjQqqIVba5yCjEANaIATIQma5N92AmAK5XjZ953XfvsQN7GEZ5uqay55EZ2R3NCA2NsXERsEaoVZG75xKq5s5Ts3 9RQba9jybjjsVjQqqIVba5yCjEANaIATIQm592AmAK5XjZ953XfvsQN7GEZ5uqay55EZ2R3NCA2NsXERsEaoVZG75xKq5s5Ts3je
For some applications, you need to know the expected LD of random strings of text of a given length. For instance, if you are trying to spot distantly related texts, a result that is only slightly lower than expected can be significant. The expected LD function must be calibrated empirically because real text is far from random, and deviates from randomness in an application-dependent way.
Depending upon your goals, you can use a variety of criteria to interpret the LD value. In the Web crawler example, you might want to subtract the difference in length of the two signatures (adjusted by the expected-LD function) from the LD, then compute the ratio of the remaining quantity and the length of the smaller string. Any ratio above some arbitrarily chosen constant would constitute confirmation of near duplication.
How well does it work?
These results are from a brain-dead simple implementation along the lines described above. The first table below shows the signature size and the processing rate for two different file sizes and a range of values of C. I ran them on a 2015 MacBook Pro. Any of these compression factors could be reasonable depending on the application.
|C||File Size||Signature Size||LD estimates/sec|
The table below gives some example accuracy estimates with C set to 100. Texts include unrelated documents, identical documents, and originally identical documents subjected to a lot of changes. You can get a good idea of the actual similarity of the file by looking at the original LD values. Comparing the estimate of LD to the true value tells you how well it is working.
|Text||File Sizes||Sig Sizes||LD sig||LD actual||Est||Err|
|50/173 lines deleted||4,865/2,474||27/61||34||2,391||2,711||0.065|
|A few changes||4,865/4,801||59/61||3||64||239||0.036|
If you compare a file to a version of itself that has been cut into several big blocks and rearranged, the true edit distance will be misleadingly high (i.e. the files will seem dissimilar) because LD works at the single-character level.
This behavior can be improved upon by cutting each signature into blocks of a fixed size, and cross-comparing the blocks to find the best match for each block in one signature to some block in the other. If the first signature has M blocks and the second has N, this is a total of N * M block comparisons.
That sounds like a lot of computing because the number of comparisons is proportional to the product of the number of block-divisions but the size of the blocks shrinks in inverse proportion, resulting in a commensurate compensating quadratic decrease in the cost of each comparison. Therefore, the total cost is about the same as it would be for a straight-up LD computation on the intact signatures.
The caveat is that the accuracy goes down somewhat as the number of blocks increases.
This can be used in several ways, for example:
- You can get a better idea of the total similarity of two files by summing the lowest LD obtained for each block in one of the signatures.
- The results can be focused by allowing disregarding of some number of blocks, e.g., consider only the best K blocks, where K is the number of blocks in the smaller signature, minus two.
Scanning at Multiple Compression Rates
Large compression rates result in small signatures and a much faster LD estimate, but the smaller the signature, the more information is lost, and the less accurate is the result.
You can have your cake and eat it too by computing more than one signature for a given document. Small signatures can be compared first to separate the sheep from the goats quickly, and the larger signatures compared only for pairs for which the first test indicates a reasonably high degree of similarity.
The per-character entropy of the 32K Gutenberg documents averages 4.45637 bits, but with some variance. The majority of the documents score fairly close to 4.5, but a few have entropy greater than 5.1 or less than 4.0.
The average entropy of the signatures at any level of compression tested is only slightly higher than the entropy of the raw text. This tells you that the compression is tracking the raw text pretty closely in terms of predictability of characters. That’s a good thing.
The average entropy goes down slightly as the compression increases, ranging from 4.93989 for 50X compression to 4.87675 for 350X. This means that the signatures are getting slightly less meaningful as the compression rate gets high, but not by a great amount.
Interestingly, however, within a given compression rate, the entropy of a text rarely diverges from the average by more than a few 100’th of a bit. In contrast, the entropy of the raw documents varies by as much as a bit per character. This indicates that the compression is pretty good at washing out the non-textual anomalies that cause files to deviate from the entropy typical of text. This makes sense because the anomalies that reduce entropy are typically repetitions, only a few of which will happen to result in output characters.
The following table gives the average per-character entropy for the raw files and for signatures computed for 32,000 documents using compression factors of 50, 100, 150, 200, 250, 300, and 350. The raw files are mostly between 50K and 400K characters long, but can be as small as 10K and as large as 1100K
It is worth noting that there are other ways to make computing LD faster, notably the enhancement by Ukkonen, which is linear time with respect to the input sizes for similar document pairs and increases in runtime non-linearly with the LD of the pair. In other words, if the documents are almost the same, the comparison time is close to linear but gets slower the more they differ. (The details are a little more complex, but that’s the idea.)
This is useful, but it’s important to be clear about what it means. The estimate heuristic has a one-time signature preparation phase that, like Ukkonen, must read all of both documents, so if you’re only going to compare once and the documents are similar, Ukkonen and the estimate heuristic are of similar time complexity (assuming that the compression rate is high enough that the time to compare the two signatures is negligible.) And if the pairs are quite different, the run-time of Ukkonen is no longer linear in the lengths of the documents.
If you can amortize the cost of generating the signatures over many comparisons, the time advantage of the estimating heuristic grows in proportion for similar documents.
Therefore, the best use of the Ukkonen algorithm might be applying it to the signatures rather than on the original documents. The performance of Ukkonen on the signatures would be no worse than simple LD for the case of very different documents, and radically better for the case where the documents are very similar (which is likely to be the most common case.)
Double Blind Searches
One interesting application of the LD estimate heuristic relies on the fact that very little useful information about the contents of documents can be divined from the signatures. The signatures cannot be said to be cryptographically secure, because some information can be wrung out of them, but the extreme compression means that the original text can’t be recovered. Of course there is some information in the signatures. You can find out that certain snips are consistent with text you have in your possession if you have an original text to compress to compare it to, but that, after all is the point. In terms of cryptanalysis, the the possibilities are mostly limited to confirming that snips of text in possession of the analyst are present.
The fact that the data is semi-encrypted lets you do some interesting things. For instance, you can have a service that allows an outside party to check whether partial matches to a document or other files are present without the service having access to the library to be searched or to the text to be searched for. The client does not have to show her cards and the service only has access to the signatures, not the originals.
One use of such a service might be forensic searches of computers. While a judge might be reluctant to allow a warrant that allows the police to poke around in a target’s computers at will, it might be willing to allow them to look for specific documents or images among a set of signatures generated for the contents of the subject computers.
The same technique would allow an organization concerned with security to filter all data going out, say by email, for known sensitive documents without ever revealing to the service the contents of documents it considers sensitive.