Alternative to Levenshtein distance for prefixes / suffixes - algorithm

I have a big city database which was compiled from many different sources. I am trying to find a way to easily spot duplicates based on city name. The naive answer would be to use the levenshtein distance. However, the problem with cities is that they often have prefixes and suffixes which are common to the country they are in.
For example:
Boulleville vs. Boscherville
These are almost certainly different cities. However, because they both end with "ville" (and both begin with "Bo") they have a rather small Levenstein distance.
*I am looking for a string distance algorithm that takes into account the position of the character to minimize the effect of prefixes and suffixes by weighting letters in the middle of the word higher than letters at the ends of the word. *
I could probably write something myself but I would find it hard to believe that no one has yet published a suitable algorithm.

This is similar to stemming in Natural Language Programming.
In that field, the stem of a word is found before performing further analysis, e.g.
run => run
running => run
runs => run
(of course things like ran do not stem to run. For that one can use a lemmatizer. But I digress...). Even though stemming is far from perfect in NLP, it works remarkably well.
In your case, it may work well to stem the city using rules specific to city names before applying Levenstein. I'm not aware of a stemmer implementation for cities, but the rules seem on the surface to be fairly simple.
You might start with a list of prefixes and a list of suffixes (including any common variant / typo spellings) and simply remove such a prefix / suffix before checking the Levenstein distance.
On a side note, if you have additional address information (such as a street address or zip/postal code), there exists address normalization software for many countries that will find the best match based on address-specific algorithms.

A pretty simple way to do it would be to just remove the common prefix and suffix before doing the distance calculation. The absolute distance between the resulting strings will be the same as with the full strings, but when the shorter length is taken into account the distance looks much greater.
Also keep in mind that in general even grevious misspellings get the first letter right. It's highly likely, then, that Cowville and Bowville are different cities, even though their L. distance is only 1.
You can make your job a lot easier by, at least at first, not doing the distance calculation if two words start with the different letters. They're likely to be different. Concentrate first on removing duplicates of words that start with the same letters. If, after that, you still have a large number of potential duplicates, you can refine your distance threshold to more closely examine words that start with different letters.

Related

Choosing Levenshtein vs Jaro Winkler?

I'm doing an application that computers a large list of brands/domains and detects variations from pre-determined keywords.
Examples:
facebook vs facebo0k.com
linkedIn vs linkedln.com
stackoverflow vs stckoverflow
I'm wondering if for the simply purpose of comparing two strings and detect subtle variations, both algorithms meet the purpose so there is not added value of choosing one over another unless it's for performance improvement?
I would use Damerau–Levenshtein with the added twist that the cost of substitution for common misspellings ('I' vs 'l', '0' vs 'O') or mistypings ('Q' vs 'W' etc.) would be lower.
The Smith-Waterman algorithm is probably going to be more adapted to your task, since it allows you to define a score function that will reflect what you consider to be a 'similarity' between characters (for instance O is quite similar to 0 etc).
I think that it has the advantage of allowing you to define your own score function, which is not necessarily the case with the Vanilla version of the other algorithms you present.
This algorithm is widely used in bioinformatics, where biologists try to detect DNA sequences that may be different, but have the same, or very similar, functionalities (for instance, that AGC codes the same protein than GTA).
The algorithm runs in quadratic time using dynamic programming, and is fairly easy to implement.
If you are only considering either Levenshtein or Jaro-Winkler distances then you will probably want to go with Jaro-Winkler since it takes into account only matching characters and any required transpositions (swapping of characters) and is a value between zero and one and will be equal to 1 (no similarity) if there are no closely matching characters (making it easy to filter out any obvious non-matches).
A Levenshtein distance will give a value for any arbitrarily distant pair of strings no matter how different they are, requiring you to choose a cutoff threshold of what to consider.
However, Jaro-Winkler gives extra weight to prefix similarity (matching characters near the beginning of the strings). If this isn't desired, than regular Jaro distance might be what you want.

Genetic Algorithm and Substitution Cipher

I want to write a genetic algorithm that decodes a string encoded with a substitution cipher. The input will be a string of lowercase characters from a to z and space characters, which do not get encoded. For example,
uyd zjglk brsmh osc tjewn spdr uyd xqia fsv
is a valid encoding of
the quick brown fox jumps over the lazy dog
Notice that the space character does not get encoded.
Genes will be one-to-one, random character mappings.
To determine a gene's (or mapping's) fitness, the string to be decoded is applied this map, and the number of recognized English words in the result is counted.
The algorithm terminates when all the words in the input string are valid English words.
I do not want to use other techniques, such as frequency analysis.
Will this work? What can be said about performance?
Counting the number of valid words gives a fitness landscape that is very "plateau-y".
In your example string, every individual will be assigned an integral fitness value between 0 and 9, inclusive, with the vast majority being at the low end of that range. This means if you generate an initial population, it's likely that all of them will have a fitness of zero. This means you can't have meaningful selection pressure, and the whole thing looks quite a lot like a random walk. You'll occasionally stumble upon something that gets a word right, and at that point, the population will shift towards that individual.
Given enough time, (and assuming your words are short enough to have some hope of randomly finding one every once in a while), you will eventually find the string. Genetic algorithms with sensible (i.e., ergodic) operators will always find the optimal solution if you let them run far enough into the land of super-exponential time. However, it's very unlikely that a GA would be a very good way of solving the problem.
A genetic algorithm often has "recombination" as well as "mutation" to create a new generation from the previous one. You may want to consider this -- if you have two particular substitution ciphers in your generation and when you look at the parts of them that create english words, it may be possible to combine the non-conflicting parts of the two ciphers that create english words, and make a cipher that creates even more english words than either of the two original ciphers that you "mated." If you don't do this, then the genetic algorithm may take longer.
Also, you may want to alter your choice of "fitness" function to something more complex than simply how many english words the cipher makes. Intuitively, if there is an encrypted word that is fairly long (say 5 or more letters) and has some repeated letter(s), then if you succeed in translating this to an english word, it's probably typically much better evidence that this part of the cipher is correct, as opposed to if you have two or three different 2-letter words that translate to english.
As for the "will it work / what about performance", I agree with the general consensus that your genetic algorithm is basically a structured way to do random guessing, and initially it will probably often be hard to ensure your population of fit individuals have some individuals that are making good progress toward the correct solution, simply because there can be many ciphers that give incorrect english words, e.g. if you have a lot of 3-letter words with 3 distinct letters. So you will either need a huge population size (at least in the beginning), or you'll have to restart the algorithm if you determine that your population is not getting any fitter (because they are all stuck near local optima that give a moderate number of english words, but they're totally off-track from the correct solution).
For genetic algorithm you need a way to get next generation. Either you invent some way to cross two permutations into a third one or you just make random modifications of most successful permutations. The latter gives you essentially local search algorithm based on random walk, which is not too efficient in terms of time, but may converge.
The former won't do any good at all. For different permutations you may get non-zero word count even if they don't share a single correct letter pair. In short, substitution cypher is too nonlinear, so that your algorithm becomes a series of random guesses, something like bogosort. You may evaluate not a number of words, but something like "likelihood" of letter chains, but it will be pretty much a kind of frequency analysis.

Fuzzy Matching Numbers

I've been working with Double Metaphone and Caverphone2 for String comparisons and they work good on things like names, addresses, etc (Caverphone2 is working best for me). However, they produce way too many false positives when you get to numeric values, such as phone numbers, ip addresses, credit card numbers, etc.
So I've looked at the Luhn and Verhoeff algorithms and they describe essentially what I want, but not quite. They seem good at validation, but do not appear to be built for fuzzy matching. Is there anything that behaves like Luhn and Verhoeff, which could detected single-digit errors and transposition errors involving two adjacent digits, for encoding and comparison purposes similar to the fuzzy string algorithms?
I'd like to encode a number, then compare it to 100,000 other numbers to find closely identical matches. So something like 7041234 would match against 7041324 as a possible transcription error, but something like 4213704 would not.
Levenshtein and friends may be good for finding the distance between to specific strings or numbers. However if you want to build a spelling corrector you don't want to run through your entire word database at every query.
Peter Norvig wrote a very nice article on a simple "fuzzy matching" spelling correcter based on some of the technology behind google spelling suggestions.
If your dictionary has N entries, and the average word has length L, the "Brute force Levenshtein" approach would take time O(N*L^3). Peter Norvig's approach instead generates all words within a certain edit distance from the input, and looks them up in the dictionary. Hence it achieves O(L^k), where k is the furthest edit distance considered.

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.
The Wikipedia article on edit distance gives some good background on the concept.
By taking "chunk transposition" into account, I mean that
Turing, Alan.
should match
Alan Turing
more closely than it matches
Turing Machine
I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.
The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).
In the case of your application you should probably think about adapting some algorithms from bioinformatics.
For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.
If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.
Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.
Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.
Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.
One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction (C(AB) - min(C(A),C(B))) / max(C(A), C(B)) This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project:
http://arxiv.org/abs/cs/0312044 "Clustering by Compression"
https://github.com/rudi-cilibrasi/libcomplearn C language implementation
There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.
I think you're looking for Jaro-Winkler distance which is precisely for name matching.
You might find compression distance useful for this. See an answer I gave for a very similar question.
Or you could use a k-tuple based counting system:
Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.

Efficient word scramble algorithm

I'm looking for an efficient algorithm for scrambling a set of letters into a permutation containing the maximum number of words.
For example, say I am given the list of letters: {e, e, h, r, s, t}. I need to order them in such a way as to contain the maximum number of words. If I order those letters into "theres", it contain the words "the", "there", "her", "here", and "ere". So that example could have a score of 5, since it contains 5 words. I want to order the letters in such a way as to have the highest score (contain the most words).
A naive algorithm would be to try and score every permutation. I believe this is O(n!), so 720 different permutations would be tried for just the 6 letters above (including some duplicates, since the example has e twice). For more letters, the naive solution quickly becomes impossible, of course.
The algorithm doesn't have to actually produce the very best solution, but it should find a good solution in a reasonable amount of time. For my application, simply guessing (Monte Carlo) at a few million permutations works quite poorly, so that's currently the mark to beat.
I am currently using the Aho-Corasick algorithm to score permutations. It searches for each word in the dictionary in just one pass through the text, so I believe it's quite efficient. This also means I have all the words stored in a trie, but if another algorithm requires different storage that's fine too. I am not worried about setting up the dictionary, just the run time of the actual ordering and searching. Even a fuzzy dictionary could be used if needed, like a Bloom Filter.
For my application, the list of letters given is about 100, and the dictionary contains over 100,000 entries. The dictionary never changes, but several different lists of letters need to be ordered.
I am considering trying a path finding algorithm. I believe I could start with a random letter from the list as a starting point. Then each remaining letter would be used to create a "path." I think this would work well with the Aho-Corasick scoring algorithm, since scores could be built up one letter at a time. I haven't tried path finding yet though; maybe it's not a even a good idea? I don't know which path finding algorithm might be best.
Another algorithm I thought of also starts with a random letter. Then the dictionary trie would be searched for "rich" branches containing the remain letters. Dictionary branches containing unavailable letters would be pruned. I'm a bit foggy on the details of how this would work exactly, but it could completely eliminate scoring permutations.
Here's an idea, inspired by Markov Chains:
Precompute the letter transition probabilities in your dictionary. Create a table with the probability that some letter X is followed by another letter Y, for all letter pairs, based on the words in the dictionary.
Generate permutations by randomly choosing each next letter from the remaining pool of letters, based on the previous letter and the probability table, until all letters are used up. Run this many times.
You can experiment by increasing the "memory" of your transition table - don't look only one letter back, but say 2 or 3. This increases the probability table, but gives you more chance of creating a valid word.
You might try simulated annealing, which has been used successfully for complex optimization problems in a number of domains. Basically you do randomized hill-climbing while gradually reducing the randomness. Since you already have the Aho-Corasick scoring you've done most of the work already. All you need is a way to generate neighbor permutations; for that something simple like swapping a pair of letters should work fine.
Have you thought about using a genetic algorithm? You have the beginnings of your fitness function already. You could experiment with the mutation and crossover (thanks Nathan) algorithms to see which do the best job.
Another option would be for your algorithm to build the smallest possible word from the input set, and then add one letter at a time so that the new word is also is or contains a new word. Start with a few different starting words for each input set and see where it leads.
Just a few idle thoughts.
It might be useful to check how others solved this:
http://sourceforge.net/search/?type_of_search=soft&words=anagram
On this page you can generate anagrams online. I've played around with it for a while and it's great fun. It doesn't explain in detail how it does its job, but the parameters give some insight.
http://wordsmith.org/anagram/advanced.html
With javascript and Node.js I implemented a jumble solver that uses a dictionary and create a tree and then traversal the tree after that you can get all possible words, I explained the algorithm in this article in detail and put the source code on GitHub:
Scramble or Jumble Word Solver with Express and Node.js

Resources