Approximate string matching using backtracking - algorithm

I would like to use backtracking to search for all substrings in a long string allowing for variable length matches - that is matches allowing for a maximum given number of mismatches, insertions, and deletions. I have not been able to locate any useful examples. The closest I have found is this paper here, but that is terribly complex. Anyone?
Cheers,
Martin

Algorithm
The function ff() below uses recursion (i.e. backtracking) to solve your problem. The basic idea is that at the start of any call to f(), we are trying to match a suffix t of the original "needle" string to a suffix s of the "haystack" string, while allowing only a certain number of each type of edit operation.
// ss is the start of the haystack, used only for reporting the match endpoints.
void f(char* ss, char* s, char* t, int mm, int ins, int del) {
while (*s && *s == *t) ++s, ++t; // OK to always match longest segment
if (!*t) printf("%d\n", s - ss); // Matched; print endpoint of match
if (mm && *s && *t) f(ss, s + 1, t + 1, mm - 1, ins, del);
if (ins && *s) f(ss, s + 1, t, mm, ins - 1, del);
if (del && *t) f(ss, s, t + 1, mm, ins, del - 1);
}
// Find all occurrences of t starting at any position in s, with at most
// mm mismatches, ins insertions and del deletions.
void ff(char* s, char* t, int mm, int ins, int del) {
for (char* ss = s; *s; ++s) {
// printf("Starting from offset %d...\n", s - ss);
f(ss, s, t, mm, ins, del);
}
}
Example call:
ff("xxabcydef", "abcdefg", 1, 1, 1);
This outputs:
9
9
because there are two ways to find "abcdefg" in "xxabcydef" with at most 1 of each kind of change, and both of these ways end at position 9:
Haystack: xxabcydef-
Needle: abc-defg
which has 1 insertion (of y) and 1 deletion (of g), and
Haystack: xxabcyde-f
Needle: abc-defg
which has 1 insertion (of y), 1 deletion (of f), and 1 substitution of g to f.
Dominance Relation
It may not be obvious why it's actually safe to use the while loop on line 3 to greedily match as many characters as possible at the start of the two strings. In fact this may reduce the number of times that a particular end position will be reported as a match, but it will never cause an end position to be forgotten completely -- and since we're usually interested in just whether or not there is a match ending at a given position of the haystack, and without this while loop the algorithm would always take time exponential in the needle size, this seems a win-win.
It is guaranteed to work because of a dominance relation. To see this, suppose the opposite -- that it is in fact unsafe (i.e. misses some matches). Then there would be some match in which an initial segment of equal characters from both strings are not aligned to each other, for example:
Haystack: abbbbc
Needle: a-b-bc
However, any such match can be transformed into another match having the same number of operations of each type, and ending at the same position, by shunting the leftmost character following a gap to the left of the gap:
Haystack: abbbbc
Needle: ab--bc
If you do this repeatedly until it's no longer possible to shunt characters without requiring a substitution, you will get a match in which the largest initial segment of equal characters from both strings are aligned to each other:
Haystack: abbbbc
Needle: abb--c
My algorithm will find all such matches, so it follows that no match position will be overlooked by it.
Exponential Time
Like any backtracking program, this function will exhibit exponential slowdowns on certain inputs. Of course, it may be that on the inputs you happen to use, this doesn't occur, and it works out faster than particular implementations of DP algorithms.

I would start with Levenshtein's distance algorithm, which is the standard approach when checking for string similarities via mismatch, insertion and deletion.
Since the algorithm uses bottom up dynamic programming, you'll probably be able to find all substrings without having to execute the algorithm for each potential substring.

The nicest algorithm I'm aware of for this is A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming by Gene Myers. Given a text to search of length n, a pattern string to search for of length m and a maximum number of mismatches/insertions/deletions k, this algorithm takes time O(mn/w), where w is your computer's word size (32 or 64). If you know much about algorithms on strings, it's actually pretty incredible that an algorithm exists that takes time independent of k -- for a long time, this seemed an impossible goal.
I'm not aware of an existing implementation of the above algorithm. If you want a tool, agrep may be just what you need. It uses an earlier algorithm that takes time O(mnk/w), but it's fast enough for low k -- miles ahead of a backtracking search in the worst case.
agrep is based on the Shift-Or (or "Bitap") algorithm, which is a very clever dynamic programming algorithm that manages to represent its state as bits in an integer and get binary addition to do most of the work of updating the state, which is what speeds up the algorithm by a factor of 32 or 64 over a more typical implementation. :) Myers's algorithm also uses this idea to get its 1/w speed factor.

Related

Why do we have double hashing function as [(hash1(key) + i * hash2(key)) % TABLE_SIZE] but not simply as [(i * hash2(key)) % TABLE_SIZE]?

I learned the notation of double hashing [(hash1(key) + i * hash2(key)) % TABLE_SIZE] couple days ago. There is a part I couldn't understand after thinking about it and searching for answer for days.
Why don't we discard the [hash1(key)] part from the double hashing function and simply make it as [(i * hash2(key)) % TABLE_SIZE]?
I couldn't find any downside of doing this, except all the hashcodes would start from 0 (when i = 0).
The main purpose of using double hashing, avoiding clusters, can still be achieved.
It will be super thankful if anyone can help:D
A quick summary of this answer:
There is a practical performance hit to your modified version, though it's not very large.
I think this is due to there not being as many different probe sequences as in regular double hashing, leading to some extra collisions due to the birthday paradox.
Now, the actual answer. :-)
Let's start off with some empirical analysis. What happens if you switch from the "standard" version of double hashing to the variant of double hashing that you've proposed?
I wrote a C++ program that generates uniformly-random choices of h1 and h2 values for each of the elements. It then inserts them into two different double-hashing tables, one using the normal approach and one using the variant. It repeats this process multiple times and reports the average number of probes required across each insertion. Here's what I found:
#include <iostream>
#include <vector>
#include <random>
#include <utility>
using namespace std;
/* Table size is picked to be a prime number. */
const size_t kTableSize = 5003;
/* Load factor for the hash table. */
const double kLoadFactor = 0.9;
/* Number of rounds to use. */
const size_t kNumRounds = 100000;
/* Random generator. */
static mt19937 generator;
/* Creates and returns an empty double hashing table. */
auto emptyTable(const size_t numSlots) {
return vector<bool>(numSlots, false);
}
/* Simulation of double hashing. Each vector entry represents an item to store.
* The first element of the pair is the value of h1(x), and the second element
* of the pair is the value of h2(x).
*/
auto hashCodes(const size_t numItems) {
vector<pair<size_t, size_t>> result;
uniform_int_distribution<size_t> first(0, kTableSize - 1), second(1, kTableSize - 1);
for (size_t i = 0; i < numItems; i++) {
result.push_back({ first(generator), second(generator) });
}
return result;
}
/* Returns the probe location to use given a number of steps taken so far.
* If modified is true, we ignore h1.
*/
size_t locationOf(size_t tableSize, size_t numProbes, size_t h1, size_t h2, bool modified) {
size_t result = (numProbes == 0 || !modified)? h1 : 0;
result += h2 * numProbes;
return result % tableSize;
}
/* Performs a double-hashing insert, returning the number of probes required to
* settle the element into its place.
*/
size_t insert(vector<bool>& table, size_t h1, size_t h2, bool modified) {
size_t numProbes = 0;
while (table[locationOf(table.size(), numProbes, h1, h2, modified)]) {
numProbes++;
}
table[locationOf(table.size(), numProbes, h1, h2, modified)] = true;
return numProbes + 1; // Count the original location as one probe
}
int main() {
size_t normalProbes = 0, variantProbes = 0;
for (size_t round = 0; round < kNumRounds; round++) {
auto normalTable = emptyTable(kTableSize);
auto variantTable = emptyTable(kTableSize);
/* Insert a collection of items into the table. */
for (auto [h1, h2]: hashCodes(kTableSize * kLoadFactor)) {
normalProbes += insert(normalTable, h1, h2, false);
variantProbes += insert(variantTable, h1, h2, true);
}
}
cout << "Normal probes: " << normalProbes << endl;
cout << "Variant probes: " << variantProbes << endl;
}
Output:
Normal probes: 1150241942
Variant probes: 1214644088
So, empirically, it looks like the modified approach leads to about 5% more probes being needed to place all the elements. The question, then, is why this is.
I do not have a fully-developed theoretical explanation as to why the modified version is slower, but I do have a reasonable guess as to what's going on. Intuitively, double hashing works by assigning each element that's inserted a random probe sequence, which is some permutation of the table slots. It's not a truly-random permutation, since not all permutations can be achieved, but it's random enough for some definition of "random enough" (see, for example, Guibas and Szemendi's "The Analysis of Double Hashing).
Let's think about what happens when we do an insertion. How many times, on expectation, will we need to look at the probe sequence beyond just h1? The first item has 0 probability of needing to look at h2. The second item has 1/T probability, since it hits the first element with probability 1/T. The second item has 2/T probability, since it has a 2/T chance of hitting the first two items. More generally, using linearity of expectation, we can show that the the expected number of times an item will be in a spot that's already taken is given by
1/T + 2/T + 3/T + 4/T + ... + (n-1)/T
=(1+2+3+...+(n-1))/T
= n(n-1)/2T
Now, let's imagine that the load factor on our hash table is α, meaning that αT = n. Then the expected number of collisions works out to
αT(αT - 1) / 2T
≈ α2T / 2.
In other words, we should expect to see a pretty decent number of times where we need to inspect h2 when using double hashing.
Now, what happens when we look at the probe sequence? The number of different probe sequences using traditional double hashing is T(T-1), where T is the number of slots in the table. This is because there are T possible choices of h1(x) and T-1 choices for h2(x).
The math behind the birthday paradox says that once approximately √(2T(T-1)) ≈ T√2 items have been inserted into the table, we have a 50% chance that two of them will end up having the same probe sequence assigned. The good news here is that it's not possible to insert T√2 items into a T-slot hash table - that's more elements than slots! - and so there's a fairly low probability that we see elements that get assigned the same probe sequences. That means that the conflicts we get in the table are mostly due to collisions between elements that have different probe sequences, but coincidentally end up landing smack on top of one another.
On the other hand, let's think about your variation. Technically speaking, there are still T(T-1) distinct probe sequences. However, I'm going to argue that there are "effectively" more like only T-1 distinct probe sequences. The reason for this is that
probe sequences don't really matter unless you have a collision when you do an insertion, and
once there's a collision after an insertion, the probe sequence for a given element is determined purely by its h2 value.
This is not a rigorous argument - it's more of an intuition - but it shows that we have less variation in how our permutations get chosen.
Because there are only T-1 different probe sequences to pick from once we've had a collision, the birthday paradox says that we need to see about √(2T) collisions before we find two items with identical values of h2. And indeed, given that we see, on expectation, α2T/2 items that need to have h2 inspected, this means that we have a very good chance of finding items whose positions will be determined by the exact same sequence of h2 values. This means that we have a new source of collisions compared with "classical" double hashing: collisions from h2 values overlapping one another.
Now, even if we do have collisions with h2 values, it's not a huge deal. After all, it'll only take a few extra probes to skip past items placed with the same h2 values before we get to new slots. But I think that this might be the source of the extra probes seen during the modified version.
Hope this helps!
The proof of double hashing goes through under some weak assumptions about h1 and h2, namely, they're drawn from universal families. The result you get is that every operation is expected constant time (for every access that doesn't depend on the choice of h1 and h2).
With h2 only you need to either strengthen the condition on h2 or give the time bound as stated. Pick a prime p congruent to 1 mod 4, let P* = {1, …, p−1} be the set of units mod p, and consider the following mediocre but universal family of hash functions from {L, R} × P*. Draw a random scalar c ← P* and a random function f ← (P* → P*) and define
h2((L, x)) = cx mod p
h2((R, x)) = f(x).
This family is universal because the (L, x) keys never collide with each other, and every other pair collides with probability exactly 1/|P*|. It's a bad choice for the double-hashing-with-only-h2 algorithm because it's linear on half of its range, and linearity preserves arithmetic sequences.
Consider the following sequence of operations. Fill half of the hash table at random by inserting (R, 1), …, (R, (p−1)/2), then insert half again as many elements (L, (p−1)/4), …, (L, 1). The table load is at most 3/4, so everything should run in expected constant time. Consider what happens, however, when we insert (L, 1). With probability 1/2, the location h2((L, 1)) is occupied by one of the R keys. The ith probe i h2((L, 1)) hits the same location as h2((L, i)), which for i ≤ (p−1)/4 is guaranteed to be full by earlier operations. Therefore the expected cost of this operation is linear in p even though the sequence of keys didn't depend on the hash function, which is unacceptable.
Putting h1 back in the mix smashes this structure.
(Ugh this didn't quite work because the proof of expected constant time assumes strong universality, not universality as stated in the abstract.)
Taking another bite at the apple, this time with strong universality. Leaving my other answer up because this one uses a result by Patrascu and Thorup as a black box (and your choice of some deep number theory or some handwaving), which is perhaps less satisfying. The result is about linear probing, namely, for every table size m that's a power of 4, there exists a 2-universal (i.e., strongly univeral) hash family and a sequence of operations such that, in expectation over the random hash function, one operation (referred to as The Query) probes Theta(√m) cells.
In order to use this result, we'd really like a table of size p−1 where p is a prime, so fixing m and the bad-for-linear-probing hash family Hm (whose functions have codomain {0, …, m-1}), choose p to be the least prime greater than m. (Alternatively, the requirement that m be a power of 4 is basically for convenience writing up the proof; it seems tedious but possible to generalize Patrascu and Thorup's result to other table sizes.) Define the distribution Hp by drawing a function h'' ← Hm and then define each value of h' independently according to the distribution
h'(x) = | h''(x) + 1 with probability m/(p-1)
| m with probability 1/m
...
| p-1 with probability 1/m.
Letting K be the field mod p, the functions h have codomain K* = {1, …, p-1}, the units of K. Unless I botched the definition, it's straightforward to verify that Hp is strongly universal. We need to pull in some heavy-duty number theory to show that p - m is O(m2/3) (this follows from the existence of primes between sufficiently large successive cubes), which means that our linear probe sequence of length O(√m) for The Query remains intact with constant probability for Omega(m1/3) steps, well more than constant.
Now, in order to change this family from a linear probing wrecker to a "double" hash wrecker, we need to give the key involved in The Query a name, let's say q. (We know for sure which one it is because the operation sequence doesn't depend on the hash function.) We define a new distribution of hash functions h by drawing h' as before, drawing c ← K*, and then defining
h(x) = | c h'(x) if x ≠ q
| c if x = q.
Let's verify that this is still 2-universal. Given keys x and y both not equal to q, it's clear that (h(x), h(y)) = (c h'(x), c h'(y)) has uniform distribution over K* × K*. Given q and some other key x, we examine the distribution of (h(q), h(x)) = (c, c h'(x)), which is as needed because c is uniform, and h'(x) is uniform and independent of c, hence so is c h'(x).
OK, the point of this exercise at last. The probe sequence for The Query will be c, 2c, 3c, etc. Which keys hash to (e.g.) 2c? They are the x's that satisfy the equation
h(x) = c h'(x) = 2c
from which we derive
h'(x) = 2,
i.e., the keys whose preferred slot is right after The Query's in linear probe order. Generalizing from 2 to i, we conclude that the bad linear probe sequence for The Query for h' becomes a bad "double" hashing probe sequence for The Query for h, QED.

Approximate substring matching using a Suffix Tree

This article discusses approximate substring matching techniques that utilize a suffix tree to improve matching time. Each answer addresses a different algorithm.
Approximate substring matching attempts to find a substring (pattern) P in a string T allowing up to k mismatches.
To learn how to create a suffix tree, click here. However, some algorithms require additional preprocessing.
I invite people to add new algorithms (even if it's incomplete) and improve answers.
This was the original question that started this thread.
Professor Esko Ukkonen published a paper: Approximate string-matching over suffix trees. He discusses 3 different algorithms that have different matching times:
Algorithm A: O(mq + n)
Algorithm B: O(mq log(q) + size of the output)
Algorithm C: O(m^2q + size of the output)
Where m is the length of the substring, n is the length of the search string, and q is the edit distance.
I've been trying to understand algorithm B but I'm having trouble following the steps. Does anyone have experience with this algorithm? An example or pseudo algorithm would be greatly appreciated.
In particular:
What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
Here's what I believe (I stand to be corrected):
On page seven, we're introduced to suffix tree concepts; a state is effectively a node in the suffix tree: let root denote the initial state.
g(a, c) = b where a and b are nodes in the tree and c is a character or substring in the tree. So this represents a transition; from a, following the edges represented by c, we move to node b. This is referred to as the go-to transition. So for the suffix tree below, g(root, 'ccb') = red node
Key(a) = edge sequence where a represents a node in the tree. For example, Key(red node) = 'ccb'. So g(root, Key(red node)) = red node.
Keys(Subset of node S) = { Key(node) | node ∈ S}
There is a suffix function for nodes a and b, f(a) = b: for all (or perhaps there may exist) a ≠ root, there exists a character c, a substring x, and a node b such that g(root, cx) = a and g(root, x) = b. I think that this means, for the suffix tree example above, that f(pink node) = green node where c = 'a' and x = 'bccb'.
There is a mapping H that contains a node from the suffix tree and a value pair. The value is given by loc(w); I'm still uncertain how to evaluate the function. This dictionary contains nodes that have not been eliminated.
extract-min(H) refers to attaining the entry with the smallest value in the pair (node, loc(w)) from H.
The crux of the algorithm seems to be related to how loc(w) is evaluated. I've constructed my suffix tree using the combined answer here; however, the algorithms work on a suffix trie (uncompressed suffix tree). Therefore concepts like the depth need to be maintained and processed differently. In the suffix trie the depth would represent the suffix length; in a suffix tree, the depth would simply represent the node depth in the tree.
You are doing well. I don't have familiarity with the algorithm, but have read the paper today. Everything you wrote is correct as far as it goes. You are right that some parts of the explanation assume a lot.
Your Questions
1.What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
The output consists of the maximal k-distance matches of P in T. In particular you'll get the final index and length for each. So clearly this is also O(n) (remember big-O is an upper bound), but may be smaller. This is a nod to the fact that it's impossible to generate p matches in less than O(p) time. The rest of the time bound concerns only the pattern length and the number of viable prefixes, both of which can be arbitrarily small, so the output size can dominate. Consider k=0 and the input is 'a' repeated n times with the pattern 'a'.
2.Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
You're right. It's an error. The loop index should be i. What about j? This is the index of the column corresponding to the input character being processed in the dynamic program. It should really be an input parameter.
Let's take a step back. The Example table on page 6 is computed left-to-right, column-by-column using equations (1-4) given earlier. These show that only the previous columns of D and L are needed to get the next. Function dp is just an implementation of this idea of computing column j from j-1. Column j of D and L are called d and l respectively. Column j-1 D and L are d' and l', the function input parameters.
I recommend you work through the dynamic program until you understand it well. The algorithm is all about avoiding duplicate column computations. Here "duplicate" means "having the same values in the essential part", because that's all that matters. The inessential parts can't affect the answer.
The uncompressed trie is just the compressed one expanded in the obvious way to have one edge per character. Except for the idea of "depth", this is unimportant. In the compressed tree, depth(s) is just the length of the string - which he calls Key(s) - needed to get from root node s.
Algorithm A
Algorithm A is just a clever caching scheme.
All his theorems and lemmas show that 1) we only need to worry about the essential parts of columns and 2) the essential part of a column j is completely determined by the viable prefix Q_j. This is the longest suffix of the input ending at j that matches a prefix of the pattern (within edit distance k). In other words, Q_j is the maximal start of a k-edit match at the end of the input considered so far.
With this, here's pseudo-code for Algorithm A.
Let r = root of (uncompressed) suffix trie
Set r's cached d,l with formulas at end page 7 (0'th dp table columns)
// Invariant: r contains cached d,l
for each character t_j from input text T in sequence
Let s = g(r, t_j) // make the go-to transition from r on t_j
if visited(s)
r = s
while no cached d,l on node r
r = f(r) // traverse suffix edge
end while
else
Use cached d',l' on r to find new columns (d,l) = dp(d',l')
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s
while depth(r) != |Q_j|
mark r visited
r = f(r) // traverse suffix edge
end while
mark r visited
set cached d,l on node r
end if
end for
I've left out the output step for simplicity.
What is traversing suffix edges about? When we do this from a node r where Key(r) = aX (leading a followed by some string X), we are going to the node with Key X. The consequence: we are storing each column corresponding to a viable prefix Q_h at the trie node for the suffix of the input with prefix Q_h. The function f(s) = r is the suffix transition function.
For what it's worth, the Wikipedia picture of a suffix tree shows this pretty well. For example, if from the node for "NA" we follow the suffix edge, we get to the node for "A" and from there to "". We are always cutting off the leading character. So if we label state s with Key(s), we have f("NA") = "A" and f("A") = "". (I don't know why he doesn't label states like this in the paper. It would simplify many explanations.)
Now this is very cool because we are computing only one column per viable prefix. But it's still expensive because we are inspecting each character and potentially traversing suffix edges for each one.
Algorithm B
Algorithm B's intent is to go faster by skipping through the input, touching only those characters likely to produce a new column, i.e. those that are the ends of input that match a previously unseen viable prefix of the pattern.
As you'd suspect, the key to the algorithm is the loc function. Roughly speaking, this will tell where the next "likely" input character is. The algorithm is quite a bit like A* search. We maintain a min heap (which must have a delete operation) corresponding to the set S_i in the paper. (He calls it a dictionary, but this is not a very conventional use of the term.) The min heap contains potential "next states" keyed on the position of the next "likely character" as described above. Processing one character produces new entries. We keep going until the heap is empty.
You're absolutely right that here he gets sketchy. The theorems and lemmas are not tied together to make an argument on correctness. He assumes you will redo his work. I'm not entirely convinced by this hand-waving. But there does seem to be enough there to "decode" the algorithm he has in mind.
Another core concept is the set S_i and in particular the subset that remains not eliminated. We'll keep these un-eliminated states in the min-heap H.
You're right to say that the notation obscures the intent of S_i. As we process the input left-to-right and reach position i, we have amassed a set of viable prefixes seen so far. Each time a new one is found, a fresh dp column is computed. In the author's notation these prefixes would be Q_h for all h<=i or more formally { Q_h | h <= i }. Each of these has a path from the root to a unique node. The set S_i consists of all the states we get by taking one more step from all these nodes along go-to edges in the trie. This produces the same result as going through the whole text looking for each occurrence of Q_h and the next character a, then adding the state corresponding to Q_h a into S_i, but it's faster. The Keys for the S_i states are exactly the right candidates for the next viable prefix Q_{i+1}.
How do we choose the right candidate? Pick the one that occurs next after position i in the input. This is where loc(s) comes in. The loc value for a state s is just what I just said above: the position in the input starting at i where the viable prefix associated with that state occurs next.
The important point is that we don't want to just assign the newly found (by pulling the min loc value from H) "next" viable prefix as Q_{i+1} (the viable prefix for dp column i+1) and go on to the next character (i+2). This is where we must set the stage to skip ahead as far as possible to the last character k (with dp column k) such Q_k = Q_{i+1}. We skip ahead by following suffix edges as in Algorithm A. Only this time we record our steps for future use by altering H: removing elements, which is the same as eliminating elements from S_i, and modifying loc values.
The definition of function loc(s) is bare, and he never says how to compute it. Also unmentioned is that loc(s) is also a function of i, the current input position being processed (that he jumps from j in earlier parts of the paper to i here for the current input position is unhelpful.) The impact is that loc(s) changes as input processing proceeds.
It turns out that the part of the definition that applies to eliminated states "just happens" because states are marked eliminated upon removal form H. So for this case we need only check for a mark.
The other case - un-eliminated states - requires that we search forward in the input looking for the next occurrence in the text that is not covered by some other string. This notion of covering is to ensure we are always dealing with only "longest possible" viable prefixes. Shorter ones must be ignored to avoid outputting other than maximal matches. Now, searching forward sounds expensive, but happily we have a suffix trie already constructed, which allows us to do it in O(|Key(s)|) time. The trie will have to be carefully annotated to return the relevant input position and to avoid covered occurrences of Key(s), but it wouldn't be too hard. He never mentions what to do when the search fails. Here loc(s) = \infty, i.e. it's eliminated and should be deleted from H.
Perhaps the hairiest part of the algorithm is updating H to deal with cases where we add a new state s for a viable prefix that covers Key(w) for some w that was already in H. This means we have to surgically update the (loc(w) => w) element in H. It turns out the suffix trie yet again supports this efficiently with its suffix edges.
With all this in our heads, let's try for pseudocode.
H = { (0 => root) } // we use (loc => state) for min heap elements
until H is empty
(j => s_j) = H.delete_min // remove the min loc mapping from
(d, l) = dp(d', l', j) where (d',l') are cached at paraent(s_j)
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s_j
while depth(r) > |Q_j|
mark r eliminated
H.delete (_ => r) // loc value doesn't matter
end while
set cached d,l on node r
// Add all the "next states" reachable from r by go-tos
for all s = g(r, a) for some character a
unless s.eliminated?
H.insert (loc(s) => s) // here is where we use the trie to find loc
// Update H elements that might be newly covered
w = f(s) // suffix transition
while w != null
unless w.eliminated?
H.increase_key(loc(w) => w) // using explanation in Lemma 9.
w = f(w) // suffix transition
end unless
end while
end unless
end for
end until
Again I've omitted the output for simplicity. I will not say this is correct, but it's in the ballpark. One thing is that he mentions we should only process Q_j for nodes not before "visited," but I don't understand what "visited" means in this context. I think visited states by Algorithm A's definition won't occur because they've been removed from H. It's a puzzle...
The increase_key operation in Lemma 9 is hastily described with no proof. His claim that the min operation is possible in O(log |alphabet|) time is leaving a lot to the imagination.
The number of quirks leads me to wonder if this is not the final draft of the paper. It is also a Springer publication, and this copy on-line would probably violate copyright restrictions if it were precisely the same. It might be worth looking in a library or paying for the final version to see if some of the rough edges were knocked off during final review.
This is as far as I can get. If you have specific questions, I'll try to clarify.

Shortest path to transform one word into another

For a Data Structures project, I must find the shortest path between two words (like "cat" and "dog"), changing only one letter at a time. We are given a Scrabble word list to use in finding our path. For example:
cat -> bat -> bet -> bot -> bog -> dog
I've solved the problem using a breadth first search, but am seeking something better (I represented the dictionary with a trie).
Please give me some ideas for a more efficient method (in terms of speed and memory). Something ridiculous and/or challenging is preferred.
I asked one of my friends (he's a junior) and he said that there is no efficient solution to this problem. He said I would learn why when I took the algorithms course. Any comments on that?
We must move from word to word. We cannot go cat -> dat -> dag -> dog. We also have to print out the traversal.
NEW ANSWER
Given the recent update, you could try A* with the Hamming distance as a heuristic. It's an admissible heuristic since it's not going to overestimate the distance
OLD ANSWER
You can modify the dynamic-program used to compute the Levenshtein distance to obtain the sequence of operations.
EDIT: If there are a constant number of strings, the problem is solvable in polynomial time. Else, it's NP-hard (it's all there in wikipedia) .. assuming your friend is talking about the problem being NP-hard.
EDIT: If your strings are of equal length, you can use Hamming distance.
With a dictionary, BFS is optimal, but the running time needed is proportional to its size (V+E). With n letters, the dictionary might have ~a^n entires, where a is alphabet size. If the dictionary contains all words but the one that should be on the end of chain, then you'll traverse all possible words but won't find anything. This is graph traversal, but the size might be exponentially large.
You may wonder if it is possible to do it faster - to browse the structure "intelligently" and do it in polynomial time. The answer is, I think, no.
The problem:
You're given a fast (linear) way to check if a word is in dictionary, two words u, v and are to check if there's a sequence u -> a1 -> a2 -> ... -> an -> v.
is NP-hard.
Proof: Take some 3SAT instance, like
(p or q or not r) and (p or not q or r)
You'll start with 0 000 00 and are to check if it is possible to go to 2 222 22.
The first character will be "are we finished", three next bits will control p,q,r and two next will control clauses.
Allowed words are:
Anything that starts with 0 and contains only 0's and 1's
Anything that starts with 2 and is legal. This means that it consists of 0's and 1's (except that the first character is 2, all clauses bits are rightfully set according to variables bits, and they're set to 1 (so this shows that the formula is satisfable).
Anything that starts with at least two 2's and then is composed of 0's and 1's (regular expression: 222* (0+1)*, like 22221101 but not 2212001
To produce 2 222 22 from 0 000 00, you have to do it in this way:
(1) Flip appropriate bits - e.g. 0 100 111 in four steps. This requires finding a 3SAT solution.
(2) Change the first bit to 2: 2 100 111. Here you'll be verified this is indeed a 3SAT solution.
(3) Change 2 100 111 -> 2 200 111 -> 2 220 111 -> 2 222 111 -> 2 222 211 -> 2 222 221 -> 2 222 222.
These rules enforce that you can't cheat (check). Going to 2 222 22 is possible only if the formula is satisfable, and checking that is NP-hard. I feel it might be even harder (#P or FNP probably) but NP-hardness is enough for that purpose I think.
Edit: You might be interested in disjoint set data structure. This will take your dictionary and group words that can be reached from each other. You can also store a path from every vertex to root or some other vertex. This will give you a path, not neccessarily the shortest one.
There are methods of varying efficiency for finding links - you can construct a complete graph for each word length, or you can construct a BK-Tree, for example, but your friend is right - BFS is the most efficient algorithm.
There is, however, a way to significantly improve your runtime: Instead of doing a single BFS from the source node, do two breadth first searches, starting at either end of the graph, and terminating when you find a common node in their frontier sets. The amount of work you have to do is roughly half what is required if you search from only one end.
You can make it a little quicker by removing the words that are not the right length, first. More of the limited dictionary will fit into the CPU's cache. Probably all of it.
Also, all of the strncmp comparisons (assuming you made everything lowercase) can be memcmp comparisons, or even unrolled comparisons, which can be a speedup.
You could use some preprocessor magic and hard-compile the task for that word-length, or roll a few optimized variations of the task for common word lengths. All of those extra comparisons can 'go away' for pure unrolled fun.
This is a typical dynamic programming problem. Check for the Edit Distance problem.
What you are looking for is called the Edit Distance. There are many different types.
From (http://en.wikipedia.org/wiki/Edit_distance): "In information theory and computer science, the edit distance between two strings of characters is the number of operations required to transform one of them into the other."
This article about Jazzy (the java spell check API) has a nice overview of these sorts of comparisons (it's a similar problem - providing suggested corrections) http://www.ibm.com/developerworks/java/library/j-jazzy/
You could find the longest common subsequence, and therefore finding the letters that must be changed.
My gut feeling is that your friend is correct, in that there isn't a more efficient solution, but that is assumming you are reloading the dictionary every time. If you were to keep a running database of common transitions, then surely there would be a more efficient method for finding a solution, but you would need to generate the transitions beforehand, and discovering which transitions would be useful (since you can't generate them all!) is probably an art of its own.
bool isadjacent(string& a, string& b)
{
int count = 0; // to store count of differences
int n = a.length();
// Iterate through all characters and return false
// if there are more than one mismatching characters
for (int i = 0; i < n; i++)
{
if (a[i] != b[i]) count++;
if (count > 1) return false;
}
return count == 1 ? true : false;
}
// A queue item to store word and minimum chain length
// to reach the word.
struct QItem
{
string word;
int len;
};
// Returns length of shortest chain to reach 'target' from 'start'
// using minimum number of adjacent moves. D is dictionary
int shortestChainLen(string& start, string& target, set<string> &D)
{
// Create a queue for BFS and insert 'start' as source vertex
queue<QItem> Q;
QItem item = {start, 1}; // Chain length for start word is 1
Q.push(item);
// While queue is not empty
while (!Q.empty())
{
// Take the front word
QItem curr = Q.front();
Q.pop();
// Go through all words of dictionary
for (set<string>::iterator it = D.begin(); it != D.end(); it++)
{
// Process a dictionary word if it is adjacent to current
// word (or vertex) of BFS
string temp = *it;
if (isadjacent(curr.word, temp))
{
// Add the dictionary word to Q
item.word = temp;
item.len = curr.len + 1;
Q.push(item);
// Remove from dictionary so that this word is not
// processed again. This is like marking visited
D.erase(temp);
// If we reached target
if (temp == target)
return item.len;
}
}
}
return 0;
}
// Driver program
int main()
{
// make dictionary
set<string> D;
D.insert("poon");
D.insert("plee");
D.insert("same");
D.insert("poie");
D.insert("plie");
D.insert("poin");
D.insert("plea");
string start = "toon";
string target = "plea";
cout << "Length of shortest chain is: "
<< shortestChainLen(start, target, D);
return 0;
}
Copied from: https://www.geeksforgeeks.org/word-ladder-length-of-shortest-chain-to-reach-a-target-word/

Fastest way to find minimal Hamming distance to any substring?

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,
minHamming(ABCDEFGHIJ, CDEFGG) == 1.
minHamming(ABCDEFGHIJ, BCDGHI) == 3.
Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.
Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).
Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.
Here is a freely available PDF of a paper by Donald Benson describing how it works:
Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)
Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.
Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.
UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.
Modified Boyer-Moore
I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:
current_dist = 0
while pattern_pos >= 0:
if pattern[pattern_pos] != text[text_pos]:
if first_mismatch == -1:
first_mismatch = pattern_pos
tp = text_pos
current_dist += 1
if current_dist == smallest_dist:
break
pattern_pos -= 1
text_pos -= 1
smallest_dist = min(current_dist, smallest_dist)
# if the distance is 0, we've had a match and can quit
if current_dist == 0:
return 0
else: # shift
pattern_pos = first_mismatch
text_pos = tp
...
If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.
The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.
Preprocessing on the Text
Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.
import sys
from collections import defaultdict
import bisect
def char_positions(t):
pos = defaultdict(list)
for idx, c in enumerate(t):
pos[c].append(idx)
return dict(pos)
This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.
The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:
def min_hamming(text, pattern):
best = len(pattern)
pos = char_positions(text)
i = find_next_pos(pattern, pos, 0)
while i < len(text) - len(pattern):
dist = 0
for c in range(len(pattern)):
if text[i+c] != pattern[c]:
dist += 1
if dist == best:
break
c += 1
else:
if dist == 0:
return 0
best = min(dist, best)
i = find_next_pos(pattern, pos, i + 1)
return best
The actual improvement is in find_next_pos:
def find_next_pos(pattern, pos, i):
smallest = sys.maxint
for idx, c in enumerate(pattern):
if c in pos:
x = bisect.bisect_left(pos[c], i + idx)
if x < len(pos[c]):
smallest = min(smallest, pos[c][x] - idx)
return smallest
For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.
find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.
Discussion
Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.
DNA
If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.
You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.
Luckily, this is easily parallelized.
One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

Fastest way to find most similar string to an input?

Given a query string Q of length N, and a list L of M sequences of length exactly N, what is the most efficient algorithm to find the string in L with the fewest mismatch positions to Q? For example:
Q = "ABCDEFG";
L = ["ABCCEFG", "AAAAAAA", "TTAGGGT", "ZYXWVUT"];
answer = L.query(Q); # Returns "ABCCEFG"
answer2 = L.query("AAAATAA"); #Returns "AAAAAAA".
The obvious way is to scan every sequence in L, making the search take O(M * N). Is there any way to do this in sublinear time? I don't care if there's a large upfront cost to organizing L into some data structure because it will be queried a lot of times. Also, handling tied scores arbitrarily is fine.
Edit: To clarify, I am looking for the Hamming distance.
All the answers except the one that mentions the best first algorithm are very much off.
Locally sensitive hashing is basically dreaming. This is the first time I see answers so much off on stackoverflow.
First, this is a hard, but standard problem that has been solved many years ago
in different ways.
One approach uses a trie such as the one preseted
by Sedgewick here:
http://www.cs.princeton.edu/~rs/strings/
Sedgewick also has sample C code.
I quote from the paper titled "Fast Algorithms for Sorting and Searching Strings" by Bentley and Sedgewick:
"‘‘Near neighbor’’ queries locate all words within a given Hamming distance
of a query word (for instance, code is distance 2 from soda). We give a new algorithm for near neighbor searching in strings, present a simple C implementation, and describe experiments on its efficiency."
A second approach is to use indexing. Split the strings into characters n-grams and index
with inverted index (google for Lucene spell checker to see how it's done).
Use the index to pull potential candidates and then run hamming distance or edit distnace on the candidates. This is the approach guaranteed to work best (and relatively simple).
A third appears in the area of speech recognition. There the query is a wav signal, and the database is a set of strings. There is a "table" that matches pieces of the signal to pieces of words. The goal is to find the best match of words to signal. This problem is known as word alignment.
In the problem posted, there is an implicit cost of matching query parts to database parts.
For example one may have different costs for deletion/insertion/substitution and even
different costs for mismatching say "ph" with "f".
The standard solution in speech recognition uses a dynamic programming approach which is made efficient via heuristics that direct pruning. In this way, only the best, say 50 candidates are kept. Thus, the name best-first search. In theory, you may not get the best match, but usually one gets a good match.
Here is a reference to the latter approach:
http://amta2010.amtaweb.org/AMTA/papers/2-02-KoehnSenellart.pdf
Fast Approximate String Matching with Suffix Arrays and A* Parsing.
This approach applies not only to words but to sentences.
Locality sensitive hashing underlies what seems to be the asymptotically best method known, as I understand it from this review article in CACM. Said article is pretty hairy and I didn't read it all. See also nearest neighbor search.
To relate these references to your problem: they all deal with a set of points in a metric space, such as an n-dimensional vector space. In your problem, n is the length of each string, and the values on each coordinate are the characters that can appear at each position in a string.
The "best" method will vary significantly depending on your input set and query set. Having a fixed message length will let you treat this problem in a classification context.
An information theoretic decision tree algorithm (like C4.5, for example) will provide the best overall guarantee on performance. In order to get optimal performance out of this method, you must first cluster the string indices into features based on mutual information. Note that you will need to modify the classifier to return all leaf nodes at the last branch, then compute a partial edit distance for each of them. The edit distance only needs to be calculated for the feature set represented by the last split of the tree.
Using this technique, querying should be ~ O(k log n), k << m, where k is the expectation of the feature size, m is the length of the string, and n is the number of comparison sequences.
The initial setup on this is guaranteed to be less than O(m^2 + n*t^2), t < m, t * k ~ m, where t is the feature count for an item. This is very reasonable and should not require any serious hardware.
These very nice performance numbers are possible because of the fixed m constraint. Enjoy!
I think you are looking for the Levenshtein edit distance.
There are a few questions here on SO about this already, I suppose you can find some good answers.
You could treat each sequence as an N-dimensional coordinate, chunk the resulting space into blocks that know what sequences occur in them, then on a lookup first search the search sequence's block and all contiguous blocks, then expand outward as necessary. (Maintaining several scopes of chunking is probably more desirable than getting into searching really large groups of blocks.)
Are you looking for the Hamming distance between the strings (i.e. the number of different characters at equivalent locations)?
Or does the distance "between" characters (e.g. difference between ASCII values of English letters) matter to you as well?
Some variety of best-first search on the target sequences will do much better than O(M * N). The basic idea of this is that you'd compare the first character in your candidate sequence with the first character of the target sequences, then in your second iteration only do the next-character comparison with the sequences that have the least number of mismatches, and so on. In your first example, you'd wind up comparing against ABCCEFG and AAAAAAA the second time, ABCCEFG only the third and fourth times, all the sequences the fifth time, and only ABCCEFG thereafter. When you get to the end of your candidate sequence, the set of target sequences with the lowest mismatch count is your match set.
(Note: at each step you're comparing against the next character for that branch of the search. None of the progressive comparisons skip characters.)
I can't think of a general, exact algorithm which will be less than O(N * M), but if you have a small enough M and N you can make an algorithm which performs as (N + M) using bit-parallel operations.
For example, if N and M are both less than 16, you could use a N * M lookup table of 64 bit ints ( 16*log2(16) = 64), and perform all operations in one pass through the string, where each group of 4 bits in the counter counts 0-15 for one of the string being matched. Obviously you need M log2(N+1) bits to store the counters, so might need to update multiple values for each character, but often a single pass lookup can be faster than other approaches. So it's actually O( N * M log(N) ), just with a lower constant factor - using 64 bit ints introduces a 1/64 into it, so should be better if log2(N) < 64. If M log2(N+1) < 64, it works out as (N+M) operations. But that's still linear, rather than sub-linear.
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
size_t match ( const char* string, uint64_t table[][128] ) ;
int main ()
{
const char* data[] = { "ABCCEFG", "AAAAAAA", "TTAGGGT", "ZYXWVUT" };
const size_t N = 7;
const size_t M = 4;
// prepare a table
uint64_t table[7][128] = { 0 };
for ( size_t i = 0; i < M; ++i )
for ( size_t j = 0; j < N; ++j )
table[j][ (size_t)data[i][j] ] |= 1 << (i * 4);
const char* examples[] = { "ABCDEFG", "AAAATAA", "TTAGQQT", "ZAAGVUT" };
for ( size_t i = 0; i < 4; ++i ) {
const char* q = examples[i];
size_t result = match ( q, table );
printf("Q(%s) -> %zd %s\n", q, result, data[result]);
}
}
size_t match ( const char* string, uint64_t table[][128] )
{
uint64_t count = 0;
// scan through string once, updating all counters at once
for ( size_t i = 0; string[i]; ++i )
count += table[i][ (size_t) string[i] ];
// find greatest sub-count within count
size_t best = 0;
size_t best_sub_count = count & 0xf;
for ( size_t i = 1; i < 4; ++i ) {
size_t sub_count = ( count >>= 4 ) & 0xf;
if ( sub_count > best_sub_count ) {
best_sub_count = sub_count;
best = i;
}
}
return best;
}
Sorry for bumping this old thread
To search elementwise would mean a complexity of O(M*N*N) - O(M) for searching and O(N*N) for calculating levenshtein distance.
The OP is looking for an efficient way to find the smallest hamming distance (c), not the string itself. If you have an upper bound on c (say X), you can find the smallest c in O(log(X)*M*N).
As Stefan pointed out, you can quickly find strings within a given hamming distance. This page http://blog.faroo.com/2015/03/24/fast-approximate-string-matching-with-large-edit-distances/ talks about one such way using Tries. Modify this to just test if there is such a string and binary search on c from 0 to X.
If up front cost don't matter you could calculate the best match for every possible input, and put the result in a hash map.
Of course this won't work if N isn't exremely small.

Resources