Algorithm for measuring distance between disordered sequences - algorithm

The Levenshtein distance gives us a way to calculate the distance between two similar strings in terms of disordered individual characters:
quick brown fox
quikc brown fax
The Levenshtein distance = 3.
What is a similar algorithm for the distance between two strings with similar subsequences?
For example, in
quickbrownfox
brownquickfox
the Levenshtein distance is 10, but this takes no account of the fact that the strings have two similar subsequences, which makes them more "similar" than completely disordered words like
quickbrownfox
qburiocwknfox
and yet this completely disordered version has a Levenshtein distance of eight.
What distance measures exist which take the length of subsequences into account, without assuming that the subsequences can be easily broken into distinct words?

I think that you can try shingles or some combinations of them with Levenshtein distance.

One simple metric would be to take all n*(n-1)/2 substrings in each string, and see how many overlap. There are some simple variations to this approach where you only look at substrings up to a certain length.
This would be similar to the BLEU score commonly used to evaluate machine translations. In the case of BLEU, they are comparing two sentences: they take all the unigrams, bigrams, trigrams, and 4-grams of words from each sentence. They calculate a version of precision and recall for each, and essentially use an average of those scores.

Initial stab: use a diff algorithm and the count of the number of differences as your distance

I have an impression that it's NP-complete problem.
At least, I cannot see how can we avoid an exhaustive search. Moreover, I cannot even see how can we verify given solution in polynomial time.

well the problem you're referring to falls under context sensitive grammar.
You basically define a grammar, the english grammar in this case and then find the distance between a grammar and a mismatch. You'll need to parse your input first.

Related

Levenshtein-distance is there another way rather than compare the missspelled word with all the dictionary word

i was searching for AI algorithm for spelling correction and i found Levenshtein distance algorithm that compare the similarity between two string so my question should i implement this similarity between the wrong word with the all the words that's in my dictionary? because if yes the time run will be slow. and my second question can this algorithm implement on two string that don't have the same length
thanks in advance
If you are working with Java or JavaScript, I have a library that can find all spelling candidates in your dictionary in linear time on the length of the query term:
https://github.com/universal-automata/liblevenshtein-java
The trick is that it intersects a Levenshtein automaton with your dictionary automaton, and only follows those paths which lead to terms having a Levenshtein distance no greater than what you specify from the query term.
I've set up a demo that you can play with:
http://universal-automata.github.io/liblevenshtein/

Word distance algorithm for OCR

I am working with an OCR output and I'm searching for special words inside it.
As the output is not clean, I look for elements that match my inputs according to a word distance lower than a specific threshold.
However, I feel that the Levenshtein distance or the Hamming distance are not the best way, as the OCR always seem to make the same mistakes: I for 1, 0 for O, Q for O... and these "classic" mistakes seem to be less important than "A for K" for instance. As a result, these distances do not care of the amount of differences of the appearances of the characters (low / high).
Is there any word distance algorithm that was made specifically for OCR that I can use that would better fit my case? Or should I implement my custom word distance empirically according to the visual differences of characters?
The Levenshtein distance allows you to specify different costs for every substitution pair (http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications, fifth item). So you can tune it to your needs by giving more or less emphasis to the common mistakes.
I you want a custom cost function for letter mismatch you could look at the Needleman–Wunsch algorithm (NW)
Wikipedia http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
OCR paper related to the NW-algorithm http://oro.open.ac.uk/20855/1/paper-15.pdf

Levenstein Transpose Distance

How can I implement the transpose/swap/twiddle/exchange distance alone using dynamic programming. I must stress that I do not want to check for the other operations (ie copy, delete, insert, kill etc) just transpose/swap.
I wish to apply Levenstein's algorithm just for swap distance. How would the code look like?
I'm not sure that Levenstein's algorithm can be used in this case. Without insert or delete operation, distance is good defined only between strings with same length and same characters. Examples of strings that isn't possible to transform to same string with only transpositions:
AB, ABC
AAB, ABB
With that, algorithm can be to find all possible permutations of positions of characters not on same places in both strings and look for one that can be represent with minimum number of transpositions or swaps.
An efficient application of dynamic programming usually requires that the task decompose into several instances of the same task for a shorter input. In case of the Levenstein distance, this boils down to prefixes of the two strings and the number of edits required to get from one to the other. I don't see how such a decomposition can be achieved in your case. At least I don't see one that would result in a polynomial time algorithm.
Also, it is not quite clear what operations you are talking about. Depending on the context, a swap or exchange can mean either the same thing as transposition or a replacement of a letter with an arbitrary other letter, e.g. test->text. If by "transpose/swap/twiddle/exchange" you try to say just "transpose", than you should have a look at Counting the adjacent swaps required to convert one permutation into another. If not, please clarify the question.

Better distance metrics besides Levenshtein for ordered word sets and subsequent clustering

I am trying to solve a problem that involves comparing large numbers of word sets , each of which contains a large, ordered number of words from a set of words (totaling around 600+, very high dimensionality!) for similarity and then clustering them into distinct groupings. The solution needs to be as unsupervised as possible.
The data looks like
[Apple, Banana, Orange...]
[Apple, Banana, Grape...]
[Jelly, Anise, Orange...]
[Strawberry, Banana, Orange...]
...etc
The order of the words in each set matters ([Apple, Banana, Orange] is distinct from [Apple, Orange, Banana]
The approach I have been using so far has been to use Levenshtein distance (limited by a distance threshold) as a metric calculated in a Python script with each word being the unique identifier, generate a similarity matrix from the distances, and throwing that matrix into k-Mediods in KNIME for the groupings.
My questions are:
Is Levenshtein the most appropriate distance metric to use for this problem?
Is mean/medoid prototype clustering the best way to go about the groupings?
I haven't yet put much thought into validating the choice for 'k' in the clustering. Would evaluating an SSE curve of the clustering be the best way to go about this?
Are there any flaws in my methodology?
As an extension to the solution in the future, given training data, would anyone happen to have any ideas for going about assigning probabilities to cluster assignments? For example, set 1 has a 80% chance of being in cluster 1, etc.
I hope my questions don't seem too silly or the answers painfully obvious, I'm relatively new to data mining.
Thanks!
Yes, Levenshtein is a very suitable way to do this. But if the sequences vary in size much, you might be better off normalising these distances by dividing by the sum of the sequence lengths -- otherwise you will find that observed distances tend to increase for pairs of long sequences whose "average distance" (in the sense of the average distance between corresponding k-length substrings, for some small k) is constant.
Example: The pair ([Apple, Banana], [Carrot, Banana]) could be said to have the same "average" distance as ([Apple, Banana, Widget, Xylophone], [Carrot, Banana, Yam, Xylophone]) since every 2nd item matches in both, but the latter pair's raw Levenshtein distance will be twice as great.
Also bear in mind that Levenshtein does not make special allowances for "block moves": if you take a string, and move one of its substrings sufficiently far away, then the resulting pair (of original and modified strings) will have the same Levenshtein score as if the 2nd string had completely different elements at the position where the substring was moved to. If you want to take this into account, consider using a compression-based distance instead. (Although I say there that it's useful for computing distances without respect to order, it does of course favour ordered similarity to disordered similarity.)
check out SimMetrics on sourceforge for a platform supporting a variety of metrics able to use as a means to evaluate the best for a task.
for a commercially valid version check out K-Similarity from K-Now.co.uk.

Finding how similar two strings are

I'm looking for an algorithm that takes 2 strings and will give me back a "factor of similarity".
Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have.
This is not for searching in a database. I'll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow.
I know this exists, i've seen it before, but I can't remember its name.
Edit: Thanks for pointing out Levenshtein and Hamming.
Now, which one should I implement? They basically measure different things, both of which can be used for what I want, but I'm not sure which one is more appropriate.
I've read up on the algorithms, Hamming seems obviously faster. Since neither will detect two characters being transposed (ie. Jordan and Jodran), which I believe will be a common mistake, which will be more accurate for what I want?
Can someone tell me a bit about the trade-offs?
Ok, so the standard algorithms are:
1) Hamming distance
Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text.
2) Levenstein distance.
The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition. The standard approach of calculating the Levenstein distance is to use dynamic programming.
3) Generalized Levenstein/(Damerau–Levenshtein distance)
This distance also takes into consideration transpositions of characters in a word, and is probably the edit distance most suited for fuzzy matching of manually-entered text. The algorithm to compute the distance is a bit more involved than the Levenstein distance (detecting transpositions is not easy). Most common implementations are a modification of the bitap algorithm (like grep).
In general you would probably want to consider an implementation of the third option implemented in some sort of nearest neighbour search based on a k-d tree
Levenstein distance
Hamming distance
soundex
metaphone
the Damerau-Levenshtein distance is similar to the Levenshtein distance, but also includes two-character transposition. the wikipedia page (linked) includes pseudocode that should be fairly trivial to implement.
You're looking for the Levenshtein distance

Resources