I'm doing an application that computers a large list of brands/domains and detects variations from pre-determined keywords.
Examples:
facebook vs facebo0k.com
linkedIn vs linkedln.com
stackoverflow vs stckoverflow
I'm wondering if for the simply purpose of comparing two strings and detect subtle variations, both algorithms meet the purpose so there is not added value of choosing one over another unless it's for performance improvement?
I would use Damerau–Levenshtein with the added twist that the cost of substitution for common misspellings ('I' vs 'l', '0' vs 'O') or mistypings ('Q' vs 'W' etc.) would be lower.
The Smith-Waterman algorithm is probably going to be more adapted to your task, since it allows you to define a score function that will reflect what you consider to be a 'similarity' between characters (for instance O is quite similar to 0 etc).
I think that it has the advantage of allowing you to define your own score function, which is not necessarily the case with the Vanilla version of the other algorithms you present.
This algorithm is widely used in bioinformatics, where biologists try to detect DNA sequences that may be different, but have the same, or very similar, functionalities (for instance, that AGC codes the same protein than GTA).
The algorithm runs in quadratic time using dynamic programming, and is fairly easy to implement.
If you are only considering either Levenshtein or Jaro-Winkler distances then you will probably want to go with Jaro-Winkler since it takes into account only matching characters and any required transpositions (swapping of characters) and is a value between zero and one and will be equal to 1 (no similarity) if there are no closely matching characters (making it easy to filter out any obvious non-matches).
A Levenshtein distance will give a value for any arbitrarily distant pair of strings no matter how different they are, requiring you to choose a cutoff threshold of what to consider.
However, Jaro-Winkler gives extra weight to prefix similarity (matching characters near the beginning of the strings). If this isn't desired, than regular Jaro distance might be what you want.
Related
I'm trying to write a function that detects how accurate the user typed a particular phrase/sentence/word/words. My objective is to build an app to train the user's typing accuracy of certain phrases.
My initial instinct is to use the basic levenshtein distance algorithm (mostly because that's the only algo I knew off the top of my head).
But after a bit more research, I saw that Jaro-Winkler is a slightly more interesting algorithm because of its consideration for transpositions.
I even found a link that talks about the differences between these algorithms:
Difference between Jaro-Winkler and Levenshtein distance?
Having read all that, in addition to the respective Wikipedia posts, I am still a little clueless as to which algorithm fits my objective the best.
Since you are grading the quality of typing, and you want to train the student to make zero mistakes, you should use Levenshtein distance, because it is less forgiving.
Additionally, Levenshtein score is more intuitive to understand, and easier to represent graphically, than the Jaro-Winkler results. You can modify Levenshtein algorithm to report insertions, deletions, and mistypes separately, and show end-users a list of corrections. Jaro-Winkler, on the other hand, gives you a score that is hard to show to end-user, because penalties for misspelling in the middle are lower than penalties at the end.
Slightly tongue-in-cheek, but only slightly: build a generative model for typing that gives high (prior) probability to hitting the right letter, and apportion out some probabilities for hitting two neighboring keys at once, two keys from different hands in the wrong order, two keys from the same hand in the wrong order, a key near the correct one, a key far from the correct one, etc. Or perhaps less ad-hoc: give your model a probability for a given sequence of keypresses given the current pair of keys needed to continue the passage. You could do a lot of things with such a model; for example, you could get a "distance"-like metric by giving a likelihood score for the learner's actual performance. But even better would be to give them a report summarizing which kinds of errors they make the most -- after all, why boil their performance down to a single number when many numbers would do? Bonus points if you learn the probabilities for the different kinds of errors from a large corpus of real typists' work.
I mostly agree with the answer given by dasblinkenlight, however, would suggest to use the Damerau-Levenshtein distance instead of only Levenshtein, that is, including transpositions. Transpositions are fairly frequent and easy to make while typing, and there is no good reason why they should incur a double distance penalty with respect to the other possible errors (insertions, deletions, and substitutions).
I am working with an OCR output and I'm searching for special words inside it.
As the output is not clean, I look for elements that match my inputs according to a word distance lower than a specific threshold.
However, I feel that the Levenshtein distance or the Hamming distance are not the best way, as the OCR always seem to make the same mistakes: I for 1, 0 for O, Q for O... and these "classic" mistakes seem to be less important than "A for K" for instance. As a result, these distances do not care of the amount of differences of the appearances of the characters (low / high).
Is there any word distance algorithm that was made specifically for OCR that I can use that would better fit my case? Or should I implement my custom word distance empirically according to the visual differences of characters?
The Levenshtein distance allows you to specify different costs for every substitution pair (http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications, fifth item). So you can tune it to your needs by giving more or less emphasis to the common mistakes.
I you want a custom cost function for letter mismatch you could look at the Needleman–Wunsch algorithm (NW)
Wikipedia http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
OCR paper related to the NW-algorithm http://oro.open.ac.uk/20855/1/paper-15.pdf
I have a big city database which was compiled from many different sources. I am trying to find a way to easily spot duplicates based on city name. The naive answer would be to use the levenshtein distance. However, the problem with cities is that they often have prefixes and suffixes which are common to the country they are in.
For example:
Boulleville vs. Boscherville
These are almost certainly different cities. However, because they both end with "ville" (and both begin with "Bo") they have a rather small Levenstein distance.
*I am looking for a string distance algorithm that takes into account the position of the character to minimize the effect of prefixes and suffixes by weighting letters in the middle of the word higher than letters at the ends of the word. *
I could probably write something myself but I would find it hard to believe that no one has yet published a suitable algorithm.
This is similar to stemming in Natural Language Programming.
In that field, the stem of a word is found before performing further analysis, e.g.
run => run
running => run
runs => run
(of course things like ran do not stem to run. For that one can use a lemmatizer. But I digress...). Even though stemming is far from perfect in NLP, it works remarkably well.
In your case, it may work well to stem the city using rules specific to city names before applying Levenstein. I'm not aware of a stemmer implementation for cities, but the rules seem on the surface to be fairly simple.
You might start with a list of prefixes and a list of suffixes (including any common variant / typo spellings) and simply remove such a prefix / suffix before checking the Levenstein distance.
On a side note, if you have additional address information (such as a street address or zip/postal code), there exists address normalization software for many countries that will find the best match based on address-specific algorithms.
A pretty simple way to do it would be to just remove the common prefix and suffix before doing the distance calculation. The absolute distance between the resulting strings will be the same as with the full strings, but when the shorter length is taken into account the distance looks much greater.
Also keep in mind that in general even grevious misspellings get the first letter right. It's highly likely, then, that Cowville and Bowville are different cities, even though their L. distance is only 1.
You can make your job a lot easier by, at least at first, not doing the distance calculation if two words start with the different letters. They're likely to be different. Concentrate first on removing duplicates of words that start with the same letters. If, after that, you still have a large number of potential duplicates, you can refine your distance threshold to more closely examine words that start with different letters.
I want to write a genetic algorithm that decodes a string encoded with a substitution cipher. The input will be a string of lowercase characters from a to z and space characters, which do not get encoded. For example,
uyd zjglk brsmh osc tjewn spdr uyd xqia fsv
is a valid encoding of
the quick brown fox jumps over the lazy dog
Notice that the space character does not get encoded.
Genes will be one-to-one, random character mappings.
To determine a gene's (or mapping's) fitness, the string to be decoded is applied this map, and the number of recognized English words in the result is counted.
The algorithm terminates when all the words in the input string are valid English words.
I do not want to use other techniques, such as frequency analysis.
Will this work? What can be said about performance?
Counting the number of valid words gives a fitness landscape that is very "plateau-y".
In your example string, every individual will be assigned an integral fitness value between 0 and 9, inclusive, with the vast majority being at the low end of that range. This means if you generate an initial population, it's likely that all of them will have a fitness of zero. This means you can't have meaningful selection pressure, and the whole thing looks quite a lot like a random walk. You'll occasionally stumble upon something that gets a word right, and at that point, the population will shift towards that individual.
Given enough time, (and assuming your words are short enough to have some hope of randomly finding one every once in a while), you will eventually find the string. Genetic algorithms with sensible (i.e., ergodic) operators will always find the optimal solution if you let them run far enough into the land of super-exponential time. However, it's very unlikely that a GA would be a very good way of solving the problem.
A genetic algorithm often has "recombination" as well as "mutation" to create a new generation from the previous one. You may want to consider this -- if you have two particular substitution ciphers in your generation and when you look at the parts of them that create english words, it may be possible to combine the non-conflicting parts of the two ciphers that create english words, and make a cipher that creates even more english words than either of the two original ciphers that you "mated." If you don't do this, then the genetic algorithm may take longer.
Also, you may want to alter your choice of "fitness" function to something more complex than simply how many english words the cipher makes. Intuitively, if there is an encrypted word that is fairly long (say 5 or more letters) and has some repeated letter(s), then if you succeed in translating this to an english word, it's probably typically much better evidence that this part of the cipher is correct, as opposed to if you have two or three different 2-letter words that translate to english.
As for the "will it work / what about performance", I agree with the general consensus that your genetic algorithm is basically a structured way to do random guessing, and initially it will probably often be hard to ensure your population of fit individuals have some individuals that are making good progress toward the correct solution, simply because there can be many ciphers that give incorrect english words, e.g. if you have a lot of 3-letter words with 3 distinct letters. So you will either need a huge population size (at least in the beginning), or you'll have to restart the algorithm if you determine that your population is not getting any fitter (because they are all stuck near local optima that give a moderate number of english words, but they're totally off-track from the correct solution).
For genetic algorithm you need a way to get next generation. Either you invent some way to cross two permutations into a third one or you just make random modifications of most successful permutations. The latter gives you essentially local search algorithm based on random walk, which is not too efficient in terms of time, but may converge.
The former won't do any good at all. For different permutations you may get non-zero word count even if they don't share a single correct letter pair. In short, substitution cypher is too nonlinear, so that your algorithm becomes a series of random guesses, something like bogosort. You may evaluate not a number of words, but something like "likelihood" of letter chains, but it will be pretty much a kind of frequency analysis.
I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.
The Wikipedia article on edit distance gives some good background on the concept.
By taking "chunk transposition" into account, I mean that
Turing, Alan.
should match
Alan Turing
more closely than it matches
Turing Machine
I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.
The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).
In the case of your application you should probably think about adapting some algorithms from bioinformatics.
For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.
If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.
Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.
Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.
Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.
One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction (C(AB) - min(C(A),C(B))) / max(C(A), C(B)) This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project:
http://arxiv.org/abs/cs/0312044 "Clustering by Compression"
https://github.com/rudi-cilibrasi/libcomplearn C language implementation
There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.
I think you're looking for Jaro-Winkler distance which is precisely for name matching.
You might find compression distance useful for this. See an answer I gave for a very similar question.
Or you could use a k-tuple based counting system:
Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.