Levenshtein Distance is an elegant measure of the dissimilarity of two strings. Given a pair of strings, say, “hat” and “cat”, the LD is the number of single-character edits that are required to turn one into the other. The LD of cat and hat is one. The LD of hats and cat is two.
LD is precise and has an intuitive meaning but in practice it is used mostly for short strings because the run-time of the algorithm to compute LD is quadratic, i.e, proportional to the product of the lengths of the two strings. On a reasonable computer you can compare strings as long as a line in this article in a few microseconds, but comparing a full page of this article to another full page would take a good chunk of a second. Comparing two books with LD is serious computing–many of minutes if you have a computer with sufficient memory, which you almost certainly do not.
That’s why, as elegant as LD is for describing how different two chunks of text are, people rarely consider using it as a way to compare documents such as Web pages, articles, or books.
This heuristic described here is a way around that problem. It turns out that you can compute a decent estimate of the LD of two large strings many thousands of times faster than you could compute the true LD. The other way to say this is that whatever amount of time you deem tolerable for a comparison computation, using estimates increases the size of strings you can compare within that time limit by a factor of as much as a few hundred. The practical size-range for estimated LD is in the megabyte range (and much larger for binary data.)
Equally importantly, the LD estimates are made from pre-computed signatures, not from the the original documents. This means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed, which is a tremendous advantage when you need to compare documents across a network.
The signatures can also provide insight into approximately where and how two sequences differ. This allows finer distinctions to be made about near-duplication, for instance, is one document embedded in the other, or are many small difference sprinkled throughout?
If your application requires a precise LD value, this heuristic isn’t for you, but the estimates are typically within about 0.05 of the true distance, which is more than enough accuracy for such tasks as:
- Confirming suspected near-duplication.
- Estimating how much two document vary.
- Filtering through large numbers of documents to look for a near-match to some substantial block of text.
- Guessing whether two documents are related in more complex ways.
If anyone else implements this, I’d love to hear about your results.
The Estimating Heuristic
As you’ll see below, you don’t actually have to know how LD is computed to understand how the estimating heuristic works. The idea is simple. We approximate the LD of a pair of strings by first compressing them, then taking the real LD of the compressed signatures, and finally multiplying the result by the compression rate.
Ordinary compression algorithms won’t work for this purpose. For one thing, they don’t compress typical text nearly enough. Almost all text has a lot of air in it, but the information content is as real as concrete. You can squeeze the air out, but you can’t compress the pure information any further. The amount of information in typical English text is about 1.3 bits per 8-bit character, which makes it about 0.1625 information and 0.8375 air. In other words, about 6:1 compression is as much as you can compress text. An interesting corollary to this is that some common data types can’t be compressed at all, and actually expand when normally useful compression algorithms are applied. Previously compressed data usually has this property, for instance, as does any kind of pseudo-random data. This is part of a fascinating branch of math/computer science called “information theory” if you want to get deeper into it.
A second problem is that for estimating LD we need the compressed version to be sort of like a thumbnail version of an image. We want to shrink it while preserving some kind of recognizable similarity.
Ordinary lossless compression algorithms tend to eliminate the thumbnail property. Think of it this way: the output of lossless compression algorithms obviously isn’t random because you can reconstitute the original non-random data from it but it looks random in a mathematical sense. It has what computer scientists call “high entropy.” In a mathematical sense, text compressed with good lossless compression algorithms tends to all look alike. If it didn’t you could usually compress it a bit more.
The secret sauce with estimating LD is that we don’t care about uncompressing. Therefore, we can use a compression algorithm that is lossy in the extreme—you’ll typically compress the inputs by a factor of between 50x to 250x or more for text and in the thousands for binary data. Our method will also have the thumbnail property mentioned above.
Using a middle value of 150X for compression, a 20KB document yields a signature of only about 133 characters, which is in the range where computing the true LD takes only a small fraction of a millisecond. Now we have the quadratic nature of the LD computation working for us—you shrink by hundreds, but it increases the comparison speed by many thousands!
A Word About Theta: Θ
Computer scientists usually care about the big-Θ complexity of an algorithm, which tells you what function of the size of the input (n) will inevitably dominate the cost of computation as n increases. As we said above, the Θ of computing LD is n-squared which is what makes it useless for comparing big texts. Estimated LD applies LD to inputs with a length that is a linear function of n, so the Θ for estimating LD is no better than Θ for computing it accurately.
The escape hatch is that whatever the Θ of an algorithm is, Θ is only about the function that dominates runtime as n gets big enough. Infinity is a big place and we’re only talking about the little bit of it that is in the kilobytes to megabytes range. For small inputs you want to look at the constant and linear terms too.
To arrive at Θ you ignore terms like the compression rate (a constant) and ignore the time it takes to fetch data (a constant for the disk-seek time or HTTP latency, etc., plus a linear part for however long it takes to stream the data in.) In the case of estimating LD you also ignore the time it takes to compress (linear.) It’s all about the quadratic part.
Theta matters, but to see how the lower order functions affect things in the limited range we’re dealing with, consider the seek time for a fast disk, which is about four milliseconds. That happens to be about how long real LD takes to compute for two one-kilobyte strings. Therefore, if you have to look up the strings from disk, the constant lookup time dominates until you’re well above one kilobyte. (I say constant because the streaming time is negligible compared to the seek time for small amounts of data.)
Now consider estimating LD. The compression factor is another constant. Say we choose c=250. If we’re estimating LD, the time time it takes to compute the estimated LD from pre-computed signatures is less than the time it takes to read in the signature until you’re estimating for originals of maybe half a megabyte.
For computing the signatures of those half-megabyte strings the constant seek time is no longer the biggest part of reading in the data. Say you can do sustained reads from disk at 50MB/second and it takes that long again to compute the signature. That’s 1/100 of a second, or 10 milliseconds each or 20 milliseconds both strings then the four milliseconds to do the LD, a total of 24ms. In fact, few computers have enough memory to compute LD on half-megabyte strings, but if they did it would take many minutes. Therefore, even if you computed the signatures for a single use, the total cost of computing the LD estimate would still be dominated by the linear part–reading it in. The part dominated by the quadratic performance would be minor; four milliseconds out 204.
What Might This Be Good For?
Here’s an example. Web crawlers gather billions of pages from across the Web so that they can be indexed. The pages often exist in multiple nearly identical versions, so for both economy of processing and so that users don’t get back many links to the same document, the near-duplicates need to be identified and ignored.
This is surprisingly difficult to do. This paper explains how Google does it, but what is of interest here is that the algorithm has a 25% false positive rate. This is bad because a match means a document has already been indexed and can be ignored. As it would be prohibitively costly to pull all the apparent matches from across the Web to verify that they are near-duplicates, it is necessary to settle for either failing to index a lot of documents or compromising on how many redundant results are indexed. The technique described here could verify the matches at a cost that is small compared to the total effort per page crawled. Further, being able to eliminate the false positives almost infallibly lets you optimize your computation for false negatives, reducing errors of both kinds.
The usefulness of the signatures is not limited to estimating LD. The differences in the signatures are located in approximately the same relative positions as the differences in the original documents. Either alone or in combination with the LD, this can tell you a lot about the relationship between two similar documents without the need to examine the documents themselves. For instance, If two signatures differ at the beginning and the end, you can infer that the same block of text is simply embedded in two different pages.
If the differences are sprinkled throughout, but the LD is too small to be consistent with random files, then they are probably different versions of the same document. A good example of this showed up in tests when comparing a modern translation of Dante’s Inferno to several thousand random books from the Gutenberg book set. Mutilated versions of the original had been added to the test set. In this case we’re not looking for duplication so much pairs with an LD that is too small to be a coincidence.
No spurious matches were identified, but legitimate matches included some surprises. The algorithm recognized the similarity of:
- The original text.
- The mutilated versions.
- The Inferno to the three-volume set including Paradiso and Purgatorio.
- The non-ASCII versions of the same books with diacritics included (which results in numerous small changes on every line.)
- A different English translation from the Italian done a century earlier.
The last is astonishing–the LD’s of the two translations deviate sufficiently from the expected value for randomly chosen text that you can recognize a relationship. An obvious guess about how to account for this is that matching proper nouns probably show up again and again in the documents in approximately the same places because poetry is highly structured. This was a crude comparison based only on the raw distance. Much more information is undoubtedly present in how the signatures line up at a low level.
How Does It Work?
The trick is the compression algorithm, which requires two properties.
- The signature of a string must be the same whether it is compressed by itself or compressed embedded in a larger document.
- The compressed signature must be much smaller than the original.
The first of these is actually impossible, but you can relax it a little to say the signature of a string will be the same whether computed by itself or as a substring of another string except for a limited number of characters at either end.
To derive the signatures, we choose a two values, C and N.
- C is the approximate compression rate. The larger this number, the smaller the signatures. The right choice depends on your goals. If C is too large, too much information will be lost. If it is too small, LD will execute slowly and the signatures will be bulky to store.
- N is a neighborhood size. Compression will operate on size-N substrings of the input, beginning with each successive character in the string. The larger the N, the more sensitive the heuristic is to small differences. An N of 6 to 20 usually gets good results for text.
We also choose any convenient alphabet of output characters for the signatures. The set of printable ASCII characters is a good choice because it allows you to inspect the signatures easily. The larger the cardinality of the output alphabet, the higher the entropy of the strings (which is good) but allowing non-alphanumerics can lead to surprises if they happen to be special characters in some processing context. For illustration purposes, we will use the alphanumeric ASCII chars and assume they are stored in a 62 element array of characters called OUTPUT_CHARS.
Signature generation is simple. I won’t include any optimizations here, just the basic idea.
- P, the current position in the input, is initially zero.
- Compute H(P), which is the hash of the N-character substring of the input that starts at position P.
- If H(P) modulo C is not equal to zero (or some other arbitrarily chosen value between 0 and C-1)
- Generate no output.
- If not at the end, increment P and return to step 1.
- If you are at the end, quit.
- Otherwise, output a single pseudo-random character chosen from OUTPUT_CHARS in such a way that the choice is highly random, but a given H will always result in the same character. For concreteness, let’s say we emit the character in OUTPUT_CHARS that is at the index H(P) modulo OUTPUT_CHARS.length().
- If you are not at the end, increment P and go to step 1.
- Otherwise, quit.
That’s all there is to the generation of signatures. The value for N should be big enough that H(p) gives diverse values, but not so big that a tiny difference will bleed out over a large expanse of the text; sizes from six to 20 seem to work fine.
The effect of C is not only to shrink the strings by ignoring all but 1/c of the H values but to ensure that most minor differences are suppressed.
Some Real Examples
Below are the first 100 characters of signatures generated for each of 25 Gutenberg books. The first dozen bytes have recognizable similarities across all signatures but are exactly alike in only two. This is because every Gutenberg book starts with a big slab of semi-standard boilerplate disclaimers, copyright information etc. that is never exactly the same, and also contains a small amount of information about the specific book.
Notice also that there are several pairs of lines that are almost identical to each other—the last two, for instance. These are identical for the first 30 or so characters, then get slightly more different towards the end. That’s because they are the same book, with almost identical boilerplate, but one has the diacritical marks and the other doesn’t. This creates a few once character differences per line, but the signature generation washes most of them out, resulting in fairly similar signatures.
RQNXXjCAovG2QEv29jR35fsCX5uZuaAfQIAouECef7jauXC2VGNTQ2NeKAZfAuA5o72GC9VR9xsXZ9Ijb5vXI7fmZy7N7Tm7e7I5 3eRAx9RQNVeoC9Ze9o37RKE5e9eGqVjGC9oERe9b9KZmuuev3NqaabCaV9oIuG7G2yX22C9K2fbNXajGIVx9GesoNmERTE5CbXuX 9RQNeQm9jTK3E7C3Zm93Zbb7TVTjjyQxfbjeqRbamEmvCqCsjGmso9oC9sTGAqZ7RNTsXfN7oQRV7NXX7GXEj2yVqbmj7EGQomvK x9RQNVsfEjbX7aKyoVaNI7yojuQu5TuK3fNKZXaQXy9fEbxajeVC9xuC3uNxy7RT3xIubjXvf7baXVeX2QAqvQbKZQ2ZNRs5yjGy x9RQNV9sEX7x9jEIfXKIjbKC2CassqasouV5RT3Kv3aVsEmGoo9uxeGCvI279fTA2INmGmseuAe9uRqITmQjbyZ3uaQV35AuKou9 Ax9RQNEAR5x9GI2Aa95ZRex9QIQEu9XQQ7eG7TuqNqjj5KK5A93VxbGoayaayCEqEXGAaICfx9bVGIaaCK9aCqyCfNmaCCqQuNVu Ax9RQNuefX9jxeXq2jAe53TIomKZsjZuEZxmKxsau2Z7yCmouVAufjoeK2GeN7Ix7Tb9oxye3XTToqK2ZC3aj3K2Vf35XjQ9y9VX x9RQNXxEu9KGj99TqCCVA2CGExNRTe9NZAIfZTuajG3vjIa2A25qAb7qmVq2abeNANVGuIaEKEC7IGG5feEoeTy3ZAAQKVqNxvCR RQNaTNs9j3aoGa9fG2bNVZyE3CTaZfaRo7TCajyIvxeGZ35eqNIsyT3mCG5oeQ7ZfXxCCT9v7AAo23veEombVaTjjbG3bXousN5Q x9RQNV59KGjT9AfCvssfyKEs5fEfQEIabGvv3o9V9syeQa7o7veKaTxa3Q3V9smKVoquCNvff7aaRNCEGbXjxTIe7EEZNEsb9Zu3 RQNVm39jVm5mu9eRsbQ3Vqf5K29aKCyayxsjqGZKfbX9xy5x9aauCR9TuKE2a3RuCNNmQmmTjmGVuT3GTeEuCm3eVvvmyRKRafyx x9RQNZa9KGjZjKK7AQ53CGTGEyERyATVbRRyGNNaETGZVbuNVXC73bTAmvVoxV3qaaKaE7yomTN9ZVNVCsfGjNyIyAV2ICxVsGZV RQ5ym939KGjTVsfvNCsGxX2vKNVN9yqZE95EfxvGq3E5AVvjoITI59CVve7ZZv22sNXsI7A5fabEKase9ysQEN579NAXAbuINejZ yKAx9RQZ9ZIxQGI5q2ZsjmIRXEfsZINx95soxXAuXyCQCoQN7oTojNVyK3fo3qbusKmqxmyyo7VA2TusEE5f9Z9sTEXxNAfmjyqv 9RQRqQZ9KGjqyqbbqIfRAKumeseoEoqTEfCyE3XXXIR3G5TA7AImmyqVV597v9uZoZvsK5ZCQ9Zj7q975CAausjGvKq99bsfGTCZ 9RQ29juK2X7a7bxvmRyGVZQ75sTaZqCIbxfRX5ReGbeRvjmRsCKTbReuGR5Rq7sZV3VAxaxZXxQ5ybqjqm75q3VaRRmXumIXCsju 9RQRX9jZZIXZKKevNfvmIVRoVuxqxVaCQaoEoTNvv2X9AvqqofQe739N7sV5X5XRIo2TRffqNGIyK9jXa9RaE7jRq5T3jQbfxvC2 RQTeEI9j7AsCQVVC5fZAyAyAxxoG9Xa9mGIR3b2ZAyQRNqof9IvxbVZ339oGVAyaCXbKmo5yf3Z7RZas55VxTsI22efjbufasNGQ Ax9RQ5bvqj9jf5Io5NG377o73RKm5A7yGxb7Ne5Qv5vv9muqQTx2uR7E3bjbGyyQ7RvoQa2omyj2joKGo9uxjeAbGufRCGuyQuG2 9RQ3aQKK9jyQjIC73m5RX75AoauXeKx2mQIT5aqxxxuVsTfuAGq72sACG3asGQRNN5GsN2NEjbu3yNvKsoXuj7mqVKXKqIfvKbXq 9RQ3aK9jyQjIC73m5RX75AoauXeKx2mQIT5aqxxxuVsTfuAGq72sACG3asGQRNN5GsN2NEjbu3yNvKsoT3v7mZEaNQX2a7NZxs7j j72Ax9RQEj3jKu9yZ9sEVf2qZEx7E9yCfxj5vfs5ZRIIfxI2fXNAVqXmvKuRse37VQvbyVfqbQe332jqeINfNNmNXGCqNeCevbym Ax9RQE3jKu9myZ9EV2qKEsEC7f5X5Txv5KZsfmCT23fNQGmGKuQe7VQvyV7fqb32jqeIIfNNmTGCNNe3vbqQ3bqZ5RfC3byKyKjR 9RQba9jybjjsVjQqqIVba5yCjEANaIATIQma5N92AmAK5XjZ953XfvsQN7GEZ5uqay55EZ2R3NCA2NsXERsEaoVZG75xKq5s5Ts3 9RQba9jybjjsVjQqqIVba5yCjEANaIATIQm592AmAK5XjZ953XfvsQN7GEZ5uqay55EZ2R3NCA2NsXERsEaoVZG75xKq5s5Ts3je
For some applications, you need to know the expected LD of random strings of text of a given length. For instance, if you are trying to spot distantly related texts, a result that is only slightly lower than expected can be significant. The expected-LD function must be calibrated empirically because real text is far from random, and deviates from randomness in an application-dependent way.
Depending upon your goals, you can use a variety of criteria to interpret the LD value. In the Web crawler example, you might want to subtract the difference in length of the two signatures (adjusted by the expected-LD function) from the LD, then compute the ratio of the remaining quantity and the length of the smaller string. Any ratio above some arbitrarily chosen constant would constitute confirmation of near duplication.
How well does it work?
These results are from a brain-dead simple implementation along the lines described above. The first table below shows the signature size and the processing rate for two different file sizes and a range of values of C. I ran them on a 2015 MacBook Pro. Any of these compression factors could be reasonable depending on the application.
|C||File Size||Signature Size||LD estimates/sec|
The table below gives some example accuracy estimates with C set to 100. Texts include unrelated documents, identical documents, and originally identical documents subjected to a lot of changes. You can get a good idea of the actual similarity of the file by looking at the original LD values. Comparing the estimate of LD to the true value tells you how well it is working.
|Text||File Sizes||Sig Sizes||LD sig||LD actual||Est||Err|
|50/173 lines deleted||4,865/2,474||27/61||34||2,391||2,711||0.065|
|A few changes||4,865/4,801||59/61||3||64||239||0.036|
If you compare a file to a version of itself that has been cut into several big blocks and rearranged, the true edit distance will be misleadingly high (i.e. the files will seem dissimilar) because LD works at the single-character level.
This behavior can be improved upon by cutting each signature into blocks of a fixed size, and cross-comparing the blocks to find the best match for each block in one signature to some block in the other. If the first signature has M blocks and the second has N, this is a total of N * M block comparisons.
That sounds like a lot, but the size of the blocks shrinks in inverse proportion, resulting in a commensurate decrease in the cost of each comparison. Therefore, the total cost is about the same as it would be for a straight-up LD computation on the intact signatures.
The caveat is that the accuracy goes down somewhat as the number of blocks increases.
This can be used in several ways, for example:
- You can get a better idea of the total similarity of two files by summing the lowest LD obtained for each block in one of the signatures.
- The results can be focussed by allowing disregarding of some number of blocks, e.g., consider only the best K blocks, where K is the number of blocks in the smaller signature, minus two.
Scanning at Multiple Compression Rates
Large compression rates result in small signatures and a much faster LD estimate, but the smaller the signature, the more information is lost, and the less accurate is the result.
You can have your cake and eat it too by computing more than one signature for a given document. Small signatures can be compared first to separate the sheep from the goats quickly, and the larger signatures compared only for pairs for which the first test indicates a reasonably high degree of similarity.
The per-character entropy of the 32K Gutenberg documents averages 4.45637 bits, but with some variance. The majority of the documents score fairly close to 4.5, but a few have entropy greater than 5.1 or less than 4.0.
The average entropy of the signatures at any level of compression tested is only slightly higher than the entropy of the raw text. This tells you that the compression is tracking the raw text pretty closely in terms of predictability of characters. That’s a good thing.
The average entropy goes down slightly as the compression increases, ranging from 4.93989 for 50X compression to 4.87675 for 350X. This means that the signatures are getting slightly less meaningful as the compression rate gets high, but not by a great amount.
Interestingly, however, within a given compression rate, the entropy of a text rarely diverges from the average by more than a few 100’th of a bit. In contrast, the entropy of the raw documents varies by as much as a bit per character. This indicates that the compression is pretty good at washing out the non-textual anomalies that cause files to deviate from the entropy typical of text. This makes sense because the anomalies that reduce entropy are typically repetitions, only a few of which will happen to result in output characters.
The following table gives the average per-character entropy for the raw files and for signatures computed for 32,000 documents using compression factors of 50, 100, 150, 200, 250, 300, and 350. The raw files are mostly between 50K and 400K characters long, but can be as small as 10K and as large as 1100K
It is worth noting that there are other ways to make computing LD faster, notably the enhancement by Ukkonen, which is linear time with respect to the input sizes for similar document pairs and increases in runtime non-linearly with the LD of the pair. In other words, if the documents are almost the same, the comparison time is close to linear but gets slower the more they differ. (The details are a little more complex, but that’s the idea.)
This is useful, but it’s important to be clear about what it means. The estimate heuristic has a one-time signature preparation phase that, like Ukkonen, must read all of both documents, so if you’re only going to compare once and the documents are similar, Ukkonen and the estimate heuristic are of similar time complexity (assuming that the compression rate is high enough that the time to compare the two signatures is negligible.) And if the pairs are quite different, the run-time of Ukkonen is no longer linear in the lengths of the documents.
If you can amortize the cost of generating the signatures over many comparisons, the time advantage of the estimating heuristic grows in proportion for similar documents.
Therefore, the best use of the Ukkonen algorithm might be applying it to the signatures rather than on the original documents. The performance of Ukkonen on the signatures would be no worse than simple LD for the case of very different documents, and radically better for the case where the documents are very similar (which is likely to be the most common case.)
A Hidden Advantage
One interesting application of the LD estimate heuristic relies on the fact that very little useful information about the contents of documents can be divined from the signatures. The signatures are not quite cryptographically secure, because some information can be wrung out of them, but the extreme compression means that there can’t be much information left in the signatures. For instance, a signature compressed with c=150 is only about 0.0067 as long as the input. As English, for example, has only about a 0.5 redundancy of information, there aren’t enough bits left in the output to contain much of the original content, and what little is left is quite scrambled.
This lets you do some interesting things. For instance, you can have a service that allows an outside party to check whether partial matches to a document or other files are present without the service being able to read either the contents or the sample. The client submits the signature of what is sought and the server checks a set of similarly compressed signatures for the content. The client does not have to show her cards and the service only has access to the signatures, not the originals.
One use of such a service might be forensic searches of computers. While a judge might be reluctant to allow a warrant that allows the police to poke around in a target’s computers at will, it might be willing to allow them to look for specific documents or images among a set of signatures generated for the contents of the subject computers.