Way to implement "Get all strings with Levenshtein distance less than X" - data-structures

I'm wondering whether there's an efficient data structure to perform "Retrieve all strings with levenshtein distance less than X".
Few things I'm interested in:
Explanation of the algorithm.
Is there an existing implementation in existing database / programming langauge?
Paper / article that I can refer to?

this is nearest neighborer search in a metric space with levenshtein distance as the metric (or distance) function
a VP-tree is one of the ways of solving that problem
this Python VP-tree implementation is a working demo that shows how a VP-tree works run it on say a word list it provides an interactive shell where you type a word and it returns the words in that list that are no more then X distance from the word you typed

Sounds like a simple breadth-first search with each generation being just one 'edit' away from the previous - and with checks in place to ensure that a string appears in one-and-only-one level.
This would be easily implemented using a couple of hashsets / hashtables in a pair of loops.

Related

Word distance algorithm for OCR

I am working with an OCR output and I'm searching for special words inside it.
As the output is not clean, I look for elements that match my inputs according to a word distance lower than a specific threshold.
However, I feel that the Levenshtein distance or the Hamming distance are not the best way, as the OCR always seem to make the same mistakes: I for 1, 0 for O, Q for O... and these "classic" mistakes seem to be less important than "A for K" for instance. As a result, these distances do not care of the amount of differences of the appearances of the characters (low / high).
Is there any word distance algorithm that was made specifically for OCR that I can use that would better fit my case? Or should I implement my custom word distance empirically according to the visual differences of characters?
The Levenshtein distance allows you to specify different costs for every substitution pair (http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications, fifth item). So you can tune it to your needs by giving more or less emphasis to the common mistakes.
I you want a custom cost function for letter mismatch you could look at the Needleman–Wunsch algorithm (NW)
Wikipedia http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
OCR paper related to the NW-algorithm http://oro.open.ac.uk/20855/1/paper-15.pdf

Convert one string into another

This is an interview question. I need to convert the string a to b such that only one alphabet is changed at a time and after each change the transformed string is in the dictionary. You need to do this in the minimum number of transformations. For example the transformation from cat-->boy can be done as follows:
cat-->bat-->bot-->boy (if dictionary has bat and bot)
I can think of creating a prefix tree (trie), for this question, but am not sure how to proceed once I have a trie. Can someone suggest a possible approach? I am trying to avoid using brute force approach.
If you want to know calculate the minimum number of single character edits, have a look at Levenshtein distance. However this assumes that only insertion, deletion, and substitution is allowed.
For your example, changing cat -> boy has Levenshtein distance of 3, with three substitutions(c->b, a->o, t->y).
If transposition is also allowed, then you should consider Damerau–Levenshtein distance.
For example, cat -> cta has Levenshtein distance of 2, and Damerau–Levenshtein distance of 1
You've already broken the problem into a prefix trie.
There are a few more steps to take to arrive at a solution:
Write a function that takes an input string and looks up possible transformations by querying the trie-dictionary.
Come up with an admissible heuristic that you can use to choose between the results.
Use a well known shortest path algorithm like the A* search algorithm.

Shortest path from one word to another via valid words (no graph)

I came across this variation of edit-distance problem:
Find the shortest path from one word to another, for example storm->power, validating each intermediate word by using a isValidWord() function. There is no other access to the dictionary of words and therefore a graph cannot be constructed.
I am trying to figure this out but it doesn't seem to be a distance related problem, per se. Use simple recursion maybe? But then how do you know that you're going the right direction?
Anyone else find this interesting? Looking forward to some help, thanks!
This is a puzzle from Lewis Carroll known as Word Ladders. Donald Knuth covers this in The Stanford Graphbase. This also
You can view it as a breadth first search. You will need access to a dictionary of words, otherwise the space you will have to search will be huge. If you just have access to a valid word you can generate all the permutations of words and then just use isValidWord() to filter it down (Norvig's "How to Write a Spelling Corrector" is a great explanation of generating the edits).
You can guide the search by trying to minimize the edit distance between where you currently are and where you can to be. For example, generate the space of all nodes to search, and sort by minimum edit distance. Follow the links first that are closest (e.g. minimize the edit distance) to the target. In the example, follow the nodes that are closest to "power".
I found this interesting as well, so there's a Haskell implementation here which works reasonably well. There's a link in the comments to a Clojure version which has some really nice visualizations.
You can search from two sides at the same time. I.e. change a letter in storm and run it through isValidWord(), and change a letter in power and run it through isValidWord(). If those two words are the same, you have found a path.

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.
The Wikipedia article on edit distance gives some good background on the concept.
By taking "chunk transposition" into account, I mean that
Turing, Alan.
should match
Alan Turing
more closely than it matches
Turing Machine
I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.
The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).
In the case of your application you should probably think about adapting some algorithms from bioinformatics.
For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.
If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.
Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.
Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.
Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.
One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction (C(AB) - min(C(A),C(B))) / max(C(A), C(B)) This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project:
http://arxiv.org/abs/cs/0312044 "Clustering by Compression"
https://github.com/rudi-cilibrasi/libcomplearn C language implementation
There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.
I think you're looking for Jaro-Winkler distance which is precisely for name matching.
You might find compression distance useful for this. See an answer I gave for a very similar question.
Or you could use a k-tuple based counting system:
Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.

Finding how similar two strings are

I'm looking for an algorithm that takes 2 strings and will give me back a "factor of similarity".
Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have.
This is not for searching in a database. I'll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow.
I know this exists, i've seen it before, but I can't remember its name.
Edit: Thanks for pointing out Levenshtein and Hamming.
Now, which one should I implement? They basically measure different things, both of which can be used for what I want, but I'm not sure which one is more appropriate.
I've read up on the algorithms, Hamming seems obviously faster. Since neither will detect two characters being transposed (ie. Jordan and Jodran), which I believe will be a common mistake, which will be more accurate for what I want?
Can someone tell me a bit about the trade-offs?
Ok, so the standard algorithms are:
1) Hamming distance
Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text.
2) Levenstein distance.
The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition. The standard approach of calculating the Levenstein distance is to use dynamic programming.
3) Generalized Levenstein/(Damerau–Levenshtein distance)
This distance also takes into consideration transpositions of characters in a word, and is probably the edit distance most suited for fuzzy matching of manually-entered text. The algorithm to compute the distance is a bit more involved than the Levenstein distance (detecting transpositions is not easy). Most common implementations are a modification of the bitap algorithm (like grep).
In general you would probably want to consider an implementation of the third option implemented in some sort of nearest neighbour search based on a k-d tree
Levenstein distance
Hamming distance
soundex
metaphone
the Damerau-Levenshtein distance is similar to the Levenshtein distance, but also includes two-character transposition. the wikipedia page (linked) includes pseudocode that should be fairly trivial to implement.
You're looking for the Levenshtein distance

Resources