Word distance algorithm for OCR - algorithm

I am working with an OCR output and I'm searching for special words inside it.
As the output is not clean, I look for elements that match my inputs according to a word distance lower than a specific threshold.
However, I feel that the Levenshtein distance or the Hamming distance are not the best way, as the OCR always seem to make the same mistakes: I for 1, 0 for O, Q for O... and these "classic" mistakes seem to be less important than "A for K" for instance. As a result, these distances do not care of the amount of differences of the appearances of the characters (low / high).
Is there any word distance algorithm that was made specifically for OCR that I can use that would better fit my case? Or should I implement my custom word distance empirically according to the visual differences of characters?

The Levenshtein distance allows you to specify different costs for every substitution pair (http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications, fifth item). So you can tune it to your needs by giving more or less emphasis to the common mistakes.

I you want a custom cost function for letter mismatch you could look at the Needleman–Wunsch algorithm (NW)
Wikipedia http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
OCR paper related to the NW-algorithm http://oro.open.ac.uk/20855/1/paper-15.pdf

Related

Choosing Levenshtein vs Jaro Winkler?

I'm doing an application that computers a large list of brands/domains and detects variations from pre-determined keywords.
Examples:
facebook vs facebo0k.com
linkedIn vs linkedln.com
stackoverflow vs stckoverflow
I'm wondering if for the simply purpose of comparing two strings and detect subtle variations, both algorithms meet the purpose so there is not added value of choosing one over another unless it's for performance improvement?
I would use Damerau–Levenshtein with the added twist that the cost of substitution for common misspellings ('I' vs 'l', '0' vs 'O') or mistypings ('Q' vs 'W' etc.) would be lower.
The Smith-Waterman algorithm is probably going to be more adapted to your task, since it allows you to define a score function that will reflect what you consider to be a 'similarity' between characters (for instance O is quite similar to 0 etc).
I think that it has the advantage of allowing you to define your own score function, which is not necessarily the case with the Vanilla version of the other algorithms you present.
This algorithm is widely used in bioinformatics, where biologists try to detect DNA sequences that may be different, but have the same, or very similar, functionalities (for instance, that AGC codes the same protein than GTA).
The algorithm runs in quadratic time using dynamic programming, and is fairly easy to implement.
If you are only considering either Levenshtein or Jaro-Winkler distances then you will probably want to go with Jaro-Winkler since it takes into account only matching characters and any required transpositions (swapping of characters) and is a value between zero and one and will be equal to 1 (no similarity) if there are no closely matching characters (making it easy to filter out any obvious non-matches).
A Levenshtein distance will give a value for any arbitrarily distant pair of strings no matter how different they are, requiring you to choose a cutoff threshold of what to consider.
However, Jaro-Winkler gives extra weight to prefix similarity (matching characters near the beginning of the strings). If this isn't desired, than regular Jaro distance might be what you want.

Better distance metrics besides Levenshtein for ordered word sets and subsequent clustering

I am trying to solve a problem that involves comparing large numbers of word sets , each of which contains a large, ordered number of words from a set of words (totaling around 600+, very high dimensionality!) for similarity and then clustering them into distinct groupings. The solution needs to be as unsupervised as possible.
The data looks like
[Apple, Banana, Orange...]
[Apple, Banana, Grape...]
[Jelly, Anise, Orange...]
[Strawberry, Banana, Orange...]
...etc
The order of the words in each set matters ([Apple, Banana, Orange] is distinct from [Apple, Orange, Banana]
The approach I have been using so far has been to use Levenshtein distance (limited by a distance threshold) as a metric calculated in a Python script with each word being the unique identifier, generate a similarity matrix from the distances, and throwing that matrix into k-Mediods in KNIME for the groupings.
My questions are:
Is Levenshtein the most appropriate distance metric to use for this problem?
Is mean/medoid prototype clustering the best way to go about the groupings?
I haven't yet put much thought into validating the choice for 'k' in the clustering. Would evaluating an SSE curve of the clustering be the best way to go about this?
Are there any flaws in my methodology?
As an extension to the solution in the future, given training data, would anyone happen to have any ideas for going about assigning probabilities to cluster assignments? For example, set 1 has a 80% chance of being in cluster 1, etc.
I hope my questions don't seem too silly or the answers painfully obvious, I'm relatively new to data mining.
Thanks!
Yes, Levenshtein is a very suitable way to do this. But if the sequences vary in size much, you might be better off normalising these distances by dividing by the sum of the sequence lengths -- otherwise you will find that observed distances tend to increase for pairs of long sequences whose "average distance" (in the sense of the average distance between corresponding k-length substrings, for some small k) is constant.
Example: The pair ([Apple, Banana], [Carrot, Banana]) could be said to have the same "average" distance as ([Apple, Banana, Widget, Xylophone], [Carrot, Banana, Yam, Xylophone]) since every 2nd item matches in both, but the latter pair's raw Levenshtein distance will be twice as great.
Also bear in mind that Levenshtein does not make special allowances for "block moves": if you take a string, and move one of its substrings sufficiently far away, then the resulting pair (of original and modified strings) will have the same Levenshtein score as if the 2nd string had completely different elements at the position where the substring was moved to. If you want to take this into account, consider using a compression-based distance instead. (Although I say there that it's useful for computing distances without respect to order, it does of course favour ordered similarity to disordered similarity.)
check out SimMetrics on sourceforge for a platform supporting a variety of metrics able to use as a means to evaluate the best for a task.
for a commercially valid version check out K-Similarity from K-Now.co.uk.

Algorithm for measuring distance between disordered sequences

The Levenshtein distance gives us a way to calculate the distance between two similar strings in terms of disordered individual characters:
quick brown fox
quikc brown fax
The Levenshtein distance = 3.
What is a similar algorithm for the distance between two strings with similar subsequences?
For example, in
quickbrownfox
brownquickfox
the Levenshtein distance is 10, but this takes no account of the fact that the strings have two similar subsequences, which makes them more "similar" than completely disordered words like
quickbrownfox
qburiocwknfox
and yet this completely disordered version has a Levenshtein distance of eight.
What distance measures exist which take the length of subsequences into account, without assuming that the subsequences can be easily broken into distinct words?
I think that you can try shingles or some combinations of them with Levenshtein distance.
One simple metric would be to take all n*(n-1)/2 substrings in each string, and see how many overlap. There are some simple variations to this approach where you only look at substrings up to a certain length.
This would be similar to the BLEU score commonly used to evaluate machine translations. In the case of BLEU, they are comparing two sentences: they take all the unigrams, bigrams, trigrams, and 4-grams of words from each sentence. They calculate a version of precision and recall for each, and essentially use an average of those scores.
Initial stab: use a diff algorithm and the count of the number of differences as your distance
I have an impression that it's NP-complete problem.
At least, I cannot see how can we avoid an exhaustive search. Moreover, I cannot even see how can we verify given solution in polynomial time.
well the problem you're referring to falls under context sensitive grammar.
You basically define a grammar, the english grammar in this case and then find the distance between a grammar and a mismatch. You'll need to parse your input first.

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.
The Wikipedia article on edit distance gives some good background on the concept.
By taking "chunk transposition" into account, I mean that
Turing, Alan.
should match
Alan Turing
more closely than it matches
Turing Machine
I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.
The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).
In the case of your application you should probably think about adapting some algorithms from bioinformatics.
For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.
If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.
Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.
Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.
Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.
One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction (C(AB) - min(C(A),C(B))) / max(C(A), C(B)) This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project:
http://arxiv.org/abs/cs/0312044 "Clustering by Compression"
https://github.com/rudi-cilibrasi/libcomplearn C language implementation
There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.
I think you're looking for Jaro-Winkler distance which is precisely for name matching.
You might find compression distance useful for this. See an answer I gave for a very similar question.
Or you could use a k-tuple based counting system:
Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.

Finding how similar two strings are

I'm looking for an algorithm that takes 2 strings and will give me back a "factor of similarity".
Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have.
This is not for searching in a database. I'll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow.
I know this exists, i've seen it before, but I can't remember its name.
Edit: Thanks for pointing out Levenshtein and Hamming.
Now, which one should I implement? They basically measure different things, both of which can be used for what I want, but I'm not sure which one is more appropriate.
I've read up on the algorithms, Hamming seems obviously faster. Since neither will detect two characters being transposed (ie. Jordan and Jodran), which I believe will be a common mistake, which will be more accurate for what I want?
Can someone tell me a bit about the trade-offs?
Ok, so the standard algorithms are:
1) Hamming distance
Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text.
2) Levenstein distance.
The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition. The standard approach of calculating the Levenstein distance is to use dynamic programming.
3) Generalized Levenstein/(Damerau–Levenshtein distance)
This distance also takes into consideration transpositions of characters in a word, and is probably the edit distance most suited for fuzzy matching of manually-entered text. The algorithm to compute the distance is a bit more involved than the Levenstein distance (detecting transpositions is not easy). Most common implementations are a modification of the bitap algorithm (like grep).
In general you would probably want to consider an implementation of the third option implemented in some sort of nearest neighbour search based on a k-d tree
Levenstein distance
Hamming distance
soundex
metaphone
the Damerau-Levenshtein distance is similar to the Levenshtein distance, but also includes two-character transposition. the wikipedia page (linked) includes pseudocode that should be fairly trivial to implement.
You're looking for the Levenshtein distance

Resources