How to do text corrections with a suffix array? - algorithm

We used suffix array to implement search by keywords, for example consider a phrase:
white bathroom tile
we insert suffixes:
1) white bathroom tile
2) bathroom tile
3) tile
and now the phrase "white bathroom tile" can be found if a user types in words: "white", "bathroom" or "tile".
However, now there's a problem, a person can type in "tyle" and nothing will be found.
So, I wanted to ask how to implement some sort of fast fuzzy search for this. Basically I want this algorithm to correct the user and still find "tile".
I considered applying levenstein distance, but my attempt failed. The idea was that we could find the group of words that start with "t" and compute levenstein distance for each one of them, and then return results where the levenstein distance was minimal.
This failed, because the user can type is "iile" instead of "tile", and now nothing words, my algorithm applies levenstein distance to the words in the "i" group.
What is a good way to solve this?

You can use Edit distance algorithm algorithm find the list of words which have the minimum edit distance with the searched word.
For example, With the word tyle and ile the edit distance of the searched word tile will be 1. For the word iile, the edit distance between tile and iile will be 1 as well.
Update
If traversing all words on the suffix array and calculating edit distance is slow (it is, the edit distance is O(^2) in time complexity), I would suggest to build a prefix tree (trie) with all the suffixes of the sentence. And then during lookup, for example, for word tyle, try to traverse the prefix tree in this way:
If there is a node in the prefix tree for the current character, traverse the node
If there is no node for current character, recursively traverse all nodes and skip this character.
During lookup, calculate the number of characters you skipped. The fewer number of characters you skips, the better candidate the word is.

Found this interesting article about a data structure called BK-tree and related algorithms. So, I'm considering using a BK-tree.
Also this article talks about even more powerful methods.

Levenshtein distance is better for words, in additional you could use Cosine_similarity a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them
and for similarity sentence or paragraph, you could use TF-IDF measure

Related

Algorithm Problem: Word transformation from a given word to another using only the words in a given dict

The detailed description of the problem is as follows:
Given two words (beginWord and endWord), and a dictionary's word list, find if there's a transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
I know this word can be solved using breadth-first-search. After I proposed the normal BFS solution, the interviewer asked me if I can make it faster. I didn't figure out a way to speed up. And the interviewer told me I should use a PriorityQueue instead to do a "Best-First-Search". And the priority is given by the hamming distance between the current word and target.
I don't quite understand why this can speed up the search. I feel by using priorityQueue we try to search the path that makes progress (i.e. reducing hamming distance).
This seems to be a greedy method. My questions is:
Why this solution is faster than the breadth-first-search solution? I feel the actual path can be like this: at first not making any progress, or even increasing the hamming distance, but after reaching a word the hamming distance goes down gradually. In this scenario, I think the priority queue solution will be slower.
Any suggestions will be appreciated! Thanks
First, I'd recommend to do some thorough reading on graph searching algorithms, that will explain the question to any detail you want (and far beyond).
TL;DR:
Your interviewer effectively recommended something close to the A* algorithm.
It differs from BFS in one aspect: which node to expand first. It uses a notion of distance score, composed from two elements:
At a node X, we already "traveled" a distance given by the number of transformations so far.
To reach the target from X, we still need to travel some more, at least N steps, where N is the number of characters different between node and target.
If we are to follow the path through X, the total number of steps from start to target can't be less than this score. It can be more if the real rest distance turns out to be longer (some words necessary for the direct path don't exist in the dictionary).
A* tells us: of all open (unexpanded) nodes, try the one first that potentially gives the shortest overall solution path, i.e. the one with the lowest score. And to implement that, a priority queue is a good fit.
In many cases, A* can dramatically reduce the search space (compared to BFS), and it still guarantees to find the best solution.
A* is NOT a greedy algorithm. It will eventually explore the whole search space, only in a much better ordering than a blind BFS.

Word distance algorithm for OCR

I am working with an OCR output and I'm searching for special words inside it.
As the output is not clean, I look for elements that match my inputs according to a word distance lower than a specific threshold.
However, I feel that the Levenshtein distance or the Hamming distance are not the best way, as the OCR always seem to make the same mistakes: I for 1, 0 for O, Q for O... and these "classic" mistakes seem to be less important than "A for K" for instance. As a result, these distances do not care of the amount of differences of the appearances of the characters (low / high).
Is there any word distance algorithm that was made specifically for OCR that I can use that would better fit my case? Or should I implement my custom word distance empirically according to the visual differences of characters?
The Levenshtein distance allows you to specify different costs for every substitution pair (http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications, fifth item). So you can tune it to your needs by giving more or less emphasis to the common mistakes.
I you want a custom cost function for letter mismatch you could look at the Needleman–Wunsch algorithm (NW)
Wikipedia http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
OCR paper related to the NW-algorithm http://oro.open.ac.uk/20855/1/paper-15.pdf

Finding nearest string to a pair in a lexicon

I am currently trying to come up with an efficient solution to the problem with the following formulation:
Given an input string s and a fixed lexicon find a string w1||w2 (|| denotes concatenation, w1 and w2 are words in the lexicon) with the lowest levenshtein distance to s.
The obvious naive solution is:
for word1 in lexicon:
for word2 in lexicon:
if lev_dist(word1 + word2) < lev_dist(lowest):
lowest = word1 + word2
I'm sure there must be better solutions to the problem. Can anyone offer any insight?
You may be able to do a bit better by putting lower bounds on the cost of individual strings.
Looking at the algorithm in http://en.wikipedia.org/wiki/Levenshtein_distance, at the time you care computing d[i, j] for the distance you know you are adding in a contribution that depends on s[i] and t[j], where s and t are the strings being compared, so you can make the costs of change/delete/insert depend on the position of the operation within the two strings.
This means that you can compute the distance between abcXXX and abcdef using a cost function in which operations on the characters marked XXX are free. This allows you to compute the cost of transforming abcXXX to abcdef if the string XXX is in fact the most favourable string possible.
So for each word w1 in the lexicon compute the distance between w1XXX and the target string and XXXw1 and the target string. Produce two copies of the lexicon, sorted in order of w1XXX distance and XXXw1 distance. Now try all pairs in order of the sum of left hand and right hand costs, which is a lower bound on the cost of that pair. Keep track of the best answer so far. When the best answer is at least as good as the next lower bound cost you encounter, you know that nothing you can try can improve on this best answer, so you can stop.
I assume you want to do this many times for the same lexicon. You've a misspelled word and suspect it's caused by the lack of a space between them, for example.
The first thing you'll surely need is a way to estimate string "closeness". I'm fond of normalization techniques. For example, replace each letter by a representative from an equivalence class. (Perhaps M and N both go to M because they sound similar. Perhaps PH --> F for a similar reason.)
Now, you'll want your normalized lexicon entered both frontwards and backwards into a trie or some similar structure.
Now, search for your needle both forwards and backwards, but keep track of intermediate results for both directions. In other words, at each position in the search string, keep track of the list of candidate trie nodes which have been selected at that position.
Now, compare the forwards- and backwards-looking arrays of intermediate results, looking for places that look like a good join point between words. You might also check for join points off-by-one from each other. (In other words, you've found the end of the first word and the beginning of the second.)
If you do, then you've found your word pair.
If you are running lots of queries on the same lexicon and want to improve the query time, but can afford some time for preprocessing, you can create a trie containing all possible words in the form w1 || w2. Then you can use the algorithm described here: Fast and Easy Levenshtein distance using a Trie to find the answer for any word you need.
What the algorithm does is basically walking the nodes of the trie and keeping track of the current minimum. Then if you end up in some node and the Levenshtein distance (of the word from the root to the current node and the input string s) is already larger than the minimum achieved so far, you can prune the entire subtree rooted at this node, because it cannot yield an answer.
In my testing with a dictionary of English words and random query words, this is anywhere between 30 and 300 times faster than the normal approach of testing every word in the dictionary, depending on the type of queries you run on it.

Number of simple mutations to change one string to another?

I'm sure you've all heard of the "Word game", where you try to change one word to another by changing one letter at a time, and only going through valid English words. I'm trying to implement an A* Algorithm to solve it (just to flesh out my understanding of A*) and one of the things that is needed is a minimum-distance heuristic.
That is, the minimum number of one of these three mutations that can turn an arbitrary string a into another string b:
1) Change one letter for another
2) Add one letter at a spot before or after any letter
3) Remove any letter
Examples
aabca => abaca:
aabca
abca
abaca
= 2
abcdebf => bgabf:
abcdebf
bcdebf
bcdbf
bgdbf
bgabf
= 4
I've tried many algorithms out; I can't seem to find one that gives the actual answer every time. In fact, sometimes I'm not sure if even my human reasoning is finding the best answer.
Does anyone know any algorithm for such purpose? Or maybe can help me find one?
(Just to clarify, I'm asking for an algorithm that can turn any arbitrary string to any other, disregarding their English validity-ness.)
You want the minimum edit distance (or Levenshtein distance):
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.
And one algorithm to determine the editing sequence is on the same page here.
An excellent reference on "Edit distance" is section 6.3 of the Algorithms textbook by S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani, a draft of which is available freely here.
If you have a reasonably sized (small) dictionary, a breadth first tree search might work.
So start with all words your word can mutate into, then all those can mutate into (except the original), then go down to the third level... Until you find the word you are looking for.
You could eliminate divergent words (ones further away from the target), but doing so might cause you to fail in a case where you must go through some divergent state to reach the shortest path.

Finding how similar two strings are

I'm looking for an algorithm that takes 2 strings and will give me back a "factor of similarity".
Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have.
This is not for searching in a database. I'll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow.
I know this exists, i've seen it before, but I can't remember its name.
Edit: Thanks for pointing out Levenshtein and Hamming.
Now, which one should I implement? They basically measure different things, both of which can be used for what I want, but I'm not sure which one is more appropriate.
I've read up on the algorithms, Hamming seems obviously faster. Since neither will detect two characters being transposed (ie. Jordan and Jodran), which I believe will be a common mistake, which will be more accurate for what I want?
Can someone tell me a bit about the trade-offs?
Ok, so the standard algorithms are:
1) Hamming distance
Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text.
2) Levenstein distance.
The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition. The standard approach of calculating the Levenstein distance is to use dynamic programming.
3) Generalized Levenstein/(Damerau–Levenshtein distance)
This distance also takes into consideration transpositions of characters in a word, and is probably the edit distance most suited for fuzzy matching of manually-entered text. The algorithm to compute the distance is a bit more involved than the Levenstein distance (detecting transpositions is not easy). Most common implementations are a modification of the bitap algorithm (like grep).
In general you would probably want to consider an implementation of the third option implemented in some sort of nearest neighbour search based on a k-d tree
Levenstein distance
Hamming distance
soundex
metaphone
the Damerau-Levenshtein distance is similar to the Levenshtein distance, but also includes two-character transposition. the wikipedia page (linked) includes pseudocode that should be fairly trivial to implement.
You're looking for the Levenshtein distance

Resources