Finding nearest string to a pair in a lexicon - algorithm

I am currently trying to come up with an efficient solution to the problem with the following formulation:
Given an input string s and a fixed lexicon find a string w1||w2 (|| denotes concatenation, w1 and w2 are words in the lexicon) with the lowest levenshtein distance to s.
The obvious naive solution is:
for word1 in lexicon:
for word2 in lexicon:
if lev_dist(word1 + word2) < lev_dist(lowest):
lowest = word1 + word2
I'm sure there must be better solutions to the problem. Can anyone offer any insight?

You may be able to do a bit better by putting lower bounds on the cost of individual strings.
Looking at the algorithm in http://en.wikipedia.org/wiki/Levenshtein_distance, at the time you care computing d[i, j] for the distance you know you are adding in a contribution that depends on s[i] and t[j], where s and t are the strings being compared, so you can make the costs of change/delete/insert depend on the position of the operation within the two strings.
This means that you can compute the distance between abcXXX and abcdef using a cost function in which operations on the characters marked XXX are free. This allows you to compute the cost of transforming abcXXX to abcdef if the string XXX is in fact the most favourable string possible.
So for each word w1 in the lexicon compute the distance between w1XXX and the target string and XXXw1 and the target string. Produce two copies of the lexicon, sorted in order of w1XXX distance and XXXw1 distance. Now try all pairs in order of the sum of left hand and right hand costs, which is a lower bound on the cost of that pair. Keep track of the best answer so far. When the best answer is at least as good as the next lower bound cost you encounter, you know that nothing you can try can improve on this best answer, so you can stop.

I assume you want to do this many times for the same lexicon. You've a misspelled word and suspect it's caused by the lack of a space between them, for example.
The first thing you'll surely need is a way to estimate string "closeness". I'm fond of normalization techniques. For example, replace each letter by a representative from an equivalence class. (Perhaps M and N both go to M because they sound similar. Perhaps PH --> F for a similar reason.)
Now, you'll want your normalized lexicon entered both frontwards and backwards into a trie or some similar structure.
Now, search for your needle both forwards and backwards, but keep track of intermediate results for both directions. In other words, at each position in the search string, keep track of the list of candidate trie nodes which have been selected at that position.
Now, compare the forwards- and backwards-looking arrays of intermediate results, looking for places that look like a good join point between words. You might also check for join points off-by-one from each other. (In other words, you've found the end of the first word and the beginning of the second.)
If you do, then you've found your word pair.

If you are running lots of queries on the same lexicon and want to improve the query time, but can afford some time for preprocessing, you can create a trie containing all possible words in the form w1 || w2. Then you can use the algorithm described here: Fast and Easy Levenshtein distance using a Trie to find the answer for any word you need.
What the algorithm does is basically walking the nodes of the trie and keeping track of the current minimum. Then if you end up in some node and the Levenshtein distance (of the word from the root to the current node and the input string s) is already larger than the minimum achieved so far, you can prune the entire subtree rooted at this node, because it cannot yield an answer.
In my testing with a dictionary of English words and random query words, this is anywhere between 30 and 300 times faster than the normal approach of testing every word in the dictionary, depending on the type of queries you run on it.

Related

How to do text corrections with a suffix array?

We used suffix array to implement search by keywords, for example consider a phrase:
white bathroom tile
we insert suffixes:
1) white bathroom tile
2) bathroom tile
3) tile
and now the phrase "white bathroom tile" can be found if a user types in words: "white", "bathroom" or "tile".
However, now there's a problem, a person can type in "tyle" and nothing will be found.
So, I wanted to ask how to implement some sort of fast fuzzy search for this. Basically I want this algorithm to correct the user and still find "tile".
I considered applying levenstein distance, but my attempt failed. The idea was that we could find the group of words that start with "t" and compute levenstein distance for each one of them, and then return results where the levenstein distance was minimal.
This failed, because the user can type is "iile" instead of "tile", and now nothing words, my algorithm applies levenstein distance to the words in the "i" group.
What is a good way to solve this?
You can use Edit distance algorithm algorithm find the list of words which have the minimum edit distance with the searched word.
For example, With the word tyle and ile the edit distance of the searched word tile will be 1. For the word iile, the edit distance between tile and iile will be 1 as well.
Update
If traversing all words on the suffix array and calculating edit distance is slow (it is, the edit distance is O(^2) in time complexity), I would suggest to build a prefix tree (trie) with all the suffixes of the sentence. And then during lookup, for example, for word tyle, try to traverse the prefix tree in this way:
If there is a node in the prefix tree for the current character, traverse the node
If there is no node for current character, recursively traverse all nodes and skip this character.
During lookup, calculate the number of characters you skipped. The fewer number of characters you skips, the better candidate the word is.
Found this interesting article about a data structure called BK-tree and related algorithms. So, I'm considering using a BK-tree.
Also this article talks about even more powerful methods.
Levenshtein distance is better for words, in additional you could use Cosine_similarity a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them
and for similarity sentence or paragraph, you could use TF-IDF measure

Levenstein Transpose Distance

How can I implement the transpose/swap/twiddle/exchange distance alone using dynamic programming. I must stress that I do not want to check for the other operations (ie copy, delete, insert, kill etc) just transpose/swap.
I wish to apply Levenstein's algorithm just for swap distance. How would the code look like?
I'm not sure that Levenstein's algorithm can be used in this case. Without insert or delete operation, distance is good defined only between strings with same length and same characters. Examples of strings that isn't possible to transform to same string with only transpositions:
AB, ABC
AAB, ABB
With that, algorithm can be to find all possible permutations of positions of characters not on same places in both strings and look for one that can be represent with minimum number of transpositions or swaps.
An efficient application of dynamic programming usually requires that the task decompose into several instances of the same task for a shorter input. In case of the Levenstein distance, this boils down to prefixes of the two strings and the number of edits required to get from one to the other. I don't see how such a decomposition can be achieved in your case. At least I don't see one that would result in a polynomial time algorithm.
Also, it is not quite clear what operations you are talking about. Depending on the context, a swap or exchange can mean either the same thing as transposition or a replacement of a letter with an arbitrary other letter, e.g. test->text. If by "transpose/swap/twiddle/exchange" you try to say just "transpose", than you should have a look at Counting the adjacent swaps required to convert one permutation into another. If not, please clarify the question.

How to find the closest pairs (Hamming Distance) of a string of binary bins in Ruby without O^2 issues?

I've got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:
0110101010101010110101010101
Ideally, I'd like to query for near binary matches. This means, if the two documents have the following numbers. Yes, this is Hamming Distance.
This is NOT currently supported in Mongo. So, I'm forced to do it in the application layer.
So, given this, I am trying to find a way to avoid having to do individual Hamming distance comparisons between the documents. that makes the time to do this basically impossible.
I have a LOT of RAM. And, in ruby, there seems to be a great gem (algorithms) that can create a number of trees, none of which I can seem to make work (yet) that would reduce the number of queries I'd need to make.
Ideally, I'd like to make 1 million queries, find the near duplicate strings, and be able to update them to reflect that.
Anyone's thoughts would be appreciated.
I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).
Then, I used a BK Tree to compare the strings.
The Hamming distance defines a metric space, so you could use the O(n log n) algorithm to find the closest pair of points, which is of the typical divide-and-conquer nature.
You can then apply this repeatedly until you have "enough" pairs.
Edit: I see now that Wikipedia doesn't actually give the algorithm, so here is one description.
Edit 2: The algorithm can be modified to give up if there are no pairs at distance less than n. For the case of the Hamming distance: simply count the level of recursion you are in. If you haven't found something at level n in any branch, then give up (in other words, never enter n + 1). If you are using a metric where splitting on one dimension doesn't always yield a distance of 1, you need to adjust the level of recursion where you give up.
As far as I could understand, you have an input string X and you want to query the database for a document containing string field b such that Hamming distance between X and document.b is less than some small number d.
You can do this in linear time, just by scanning all of your N=1M documents and calculating the distance (which takes small fixed time per document). Since you only want documents with distance smaller than d, you can give up comparison after d unmatched characters; you only need to compare all 256 characters if most of them match.
You can try to scan fewer than N documents, that is, to get better than linear time.
Let ones(s) be the number of 1s in string s. For each document, store ones(document.b) as a new indexed field ones_count. Then you can only query documents where number of ones is close enough to ones(X), specifically, ones(X) - d <= document.ones_count <= ones(X) + d. The Mongo index should kick in here.
If you want to find all close enough pairs in the set, see #Philippe's answer.
This sounds like an algorithmic problem of some sort. You could try comparing those with a similar number of 1 or 0 bits first, then work down through the list from there. Those that are identical will, of course, come out on top. I don't think having tons of RAM will help here.
You could also try and work with smaller chunks. Instead of dealing with 256 bit sequences, could you treat that as 32 8-bit sequences? 16 16-bit sequences? At that point you can compute differences in a lookup table and use that as a sort of index.
Depending on how "different" you care to match on, you could just permute changes on the source binary value and do a keyed search to find the others that match.

Number of simple mutations to change one string to another?

I'm sure you've all heard of the "Word game", where you try to change one word to another by changing one letter at a time, and only going through valid English words. I'm trying to implement an A* Algorithm to solve it (just to flesh out my understanding of A*) and one of the things that is needed is a minimum-distance heuristic.
That is, the minimum number of one of these three mutations that can turn an arbitrary string a into another string b:
1) Change one letter for another
2) Add one letter at a spot before or after any letter
3) Remove any letter
Examples
aabca => abaca:
aabca
abca
abaca
= 2
abcdebf => bgabf:
abcdebf
bcdebf
bcdbf
bgdbf
bgabf
= 4
I've tried many algorithms out; I can't seem to find one that gives the actual answer every time. In fact, sometimes I'm not sure if even my human reasoning is finding the best answer.
Does anyone know any algorithm for such purpose? Or maybe can help me find one?
(Just to clarify, I'm asking for an algorithm that can turn any arbitrary string to any other, disregarding their English validity-ness.)
You want the minimum edit distance (or Levenshtein distance):
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.
And one algorithm to determine the editing sequence is on the same page here.
An excellent reference on "Edit distance" is section 6.3 of the Algorithms textbook by S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani, a draft of which is available freely here.
If you have a reasonably sized (small) dictionary, a breadth first tree search might work.
So start with all words your word can mutate into, then all those can mutate into (except the original), then go down to the third level... Until you find the word you are looking for.
You could eliminate divergent words (ones further away from the target), but doing so might cause you to fail in a case where you must go through some divergent state to reach the shortest path.

Efficient word scramble algorithm

I'm looking for an efficient algorithm for scrambling a set of letters into a permutation containing the maximum number of words.
For example, say I am given the list of letters: {e, e, h, r, s, t}. I need to order them in such a way as to contain the maximum number of words. If I order those letters into "theres", it contain the words "the", "there", "her", "here", and "ere". So that example could have a score of 5, since it contains 5 words. I want to order the letters in such a way as to have the highest score (contain the most words).
A naive algorithm would be to try and score every permutation. I believe this is O(n!), so 720 different permutations would be tried for just the 6 letters above (including some duplicates, since the example has e twice). For more letters, the naive solution quickly becomes impossible, of course.
The algorithm doesn't have to actually produce the very best solution, but it should find a good solution in a reasonable amount of time. For my application, simply guessing (Monte Carlo) at a few million permutations works quite poorly, so that's currently the mark to beat.
I am currently using the Aho-Corasick algorithm to score permutations. It searches for each word in the dictionary in just one pass through the text, so I believe it's quite efficient. This also means I have all the words stored in a trie, but if another algorithm requires different storage that's fine too. I am not worried about setting up the dictionary, just the run time of the actual ordering and searching. Even a fuzzy dictionary could be used if needed, like a Bloom Filter.
For my application, the list of letters given is about 100, and the dictionary contains over 100,000 entries. The dictionary never changes, but several different lists of letters need to be ordered.
I am considering trying a path finding algorithm. I believe I could start with a random letter from the list as a starting point. Then each remaining letter would be used to create a "path." I think this would work well with the Aho-Corasick scoring algorithm, since scores could be built up one letter at a time. I haven't tried path finding yet though; maybe it's not a even a good idea? I don't know which path finding algorithm might be best.
Another algorithm I thought of also starts with a random letter. Then the dictionary trie would be searched for "rich" branches containing the remain letters. Dictionary branches containing unavailable letters would be pruned. I'm a bit foggy on the details of how this would work exactly, but it could completely eliminate scoring permutations.
Here's an idea, inspired by Markov Chains:
Precompute the letter transition probabilities in your dictionary. Create a table with the probability that some letter X is followed by another letter Y, for all letter pairs, based on the words in the dictionary.
Generate permutations by randomly choosing each next letter from the remaining pool of letters, based on the previous letter and the probability table, until all letters are used up. Run this many times.
You can experiment by increasing the "memory" of your transition table - don't look only one letter back, but say 2 or 3. This increases the probability table, but gives you more chance of creating a valid word.
You might try simulated annealing, which has been used successfully for complex optimization problems in a number of domains. Basically you do randomized hill-climbing while gradually reducing the randomness. Since you already have the Aho-Corasick scoring you've done most of the work already. All you need is a way to generate neighbor permutations; for that something simple like swapping a pair of letters should work fine.
Have you thought about using a genetic algorithm? You have the beginnings of your fitness function already. You could experiment with the mutation and crossover (thanks Nathan) algorithms to see which do the best job.
Another option would be for your algorithm to build the smallest possible word from the input set, and then add one letter at a time so that the new word is also is or contains a new word. Start with a few different starting words for each input set and see where it leads.
Just a few idle thoughts.
It might be useful to check how others solved this:
http://sourceforge.net/search/?type_of_search=soft&words=anagram
On this page you can generate anagrams online. I've played around with it for a while and it's great fun. It doesn't explain in detail how it does its job, but the parameters give some insight.
http://wordsmith.org/anagram/advanced.html
With javascript and Node.js I implemented a jumble solver that uses a dictionary and create a tree and then traversal the tree after that you can get all possible words, I explained the algorithm in this article in detail and put the source code on GitHub:
Scramble or Jumble Word Solver with Express and Node.js

Resources