Palindromes in a tree - algorithm

I am looking at this challenge:
Given a tree with N nodes and N-1 edges. Each edge on the tree is labelled by a string of lowercase letters from the Latin alphabet. Given Q queries, consisting of two nodes u and v, check if it is possible to make a palindrome string which uses all the characters that belong to the string labelled on the edges in the path from node u to node v.
Characters can be used in any order.
N is of the order of 105 and Q is of the order of 106
Input:
N=3
u=1 v=3 weight=bc
u=1 v=2 weight=aba
Q=4
u=1 v=2
u=2 v=3
u=3 v=1
u=3 v=3
Output:
YES
YES
NO
NO
What I thought was to compute the LCA between 2 nodes by precomputation in O(1) using sparse table and Range minimum query on Euler tower and then see the path from LCA to node u and LCA to node v and store all the characters frequency. If the sum of frequency of all the characters is odd, we check if the frequency of each character except one is odd. If the sum of frequency of all the characters is even, we check if the frequency of each character is even. But this process will surely time out because Q can be upto 106.
Is there anyone with a better algorithm?

Preparation Step
Prepare your data structure as follows:
For each node get the path to the root, get all letters on the path, and only retain a letter when it occurs an odd number of times on that path. Finally encode that string with unique letters as a bit pattern, where bit 0 is set when there is an "a", bit 1 is set when there is a "b", ... bit 25 is set when there is a "z". Store this pattern with the node.
This preprocessing can be done with a depth-first recursive procedure, where the current node's pattern is passed down to the children, which can apply the edge's information to that pattern to create their own pattern. So this preprocessing can run in linear time in terms of the total number of characters in the tree, or more precisely O(N+S), where S represents that total number of characters.
Query Step
When a query is done perform the bitwise XOR on the two involved patterns. If the result is 0 or it has only one bit set, return "YES", else return "NO". So the query will not visit any other nodes than just the two ones that are given, look up the two patterns and perform their XOR and make the bit test. All this happens in constant time for one query.
The last query given in the question shows that the result should be "NO" when the two nodes are the same node. This is a boundary case, as it is debatable whether an empty string is a palindrome or not. The above XOR algorithm would return "YES", so you would need a specific test for this boundary case, and return "NO" instead.
Explanation
This works because if we look at the paths both nodes have to the root, they may share a part of their path. The characters on that common path should not be considered, and the XOR will make sure they aren't. Where the paths differ, we actually have the edges on the path from the one node to the other. There we see the characters that should contribute to a palindrome.
If a character appears an even number of times in those edges, it poses no problem for creating a palindrome. The XOR makes sure those characters "disappear".
If a character appears an odd number of times, all but one can mirror each other like in the even case. The remaining one can only be used in an odd-length palindrome, and only in the centre position of it. So there can only be one such character. This translates to the test that the XOR result is allowed to have 1 bit set (but not more).
Implementation
Here is an implementation in JavaScript. The example run uses the input as provided in the question. I did not bother to turn the query results from boolean to NO/YES:
function prepare(edges) {
// edges: array of [u, v, weight] triplets
// Build adjacency list from the list of edges
let adjacency = {};
for (let [u, v, weight] of edges) {
// convert weight to pattern, as we don't really need to
// store the strings
let pattern = 0;
for (let i = 0; i < weight.length; i++) {
let ascii = weight.charCodeAt(i) - 97;
pattern ^= 1 << ascii; // toggle bit that corresponds to letter
}
if (v in adjacency && u in adjacency) throw "Cycle detected!";
if (!(v in adjacency)) adjacency[v] = {};
if (!(u in adjacency)) adjacency[u] = {};
adjacency[u][v] = pattern;
adjacency[v][u] = pattern;
}
// Prepare the consolidated path-pattern for each node
let patterns = {}; // This is the information to return
function dfs(u, parent, pathPattern) {
patterns[u] = pathPattern;
for (let v in adjacency[u]) {
// recurse into the "children" (the parent is not revisited)
if (v !== parent) dfs(v, u, adjacency[u][v] ^ pathPattern);
}
}
// Start a DFS from an arbitrary node as root
dfs(edges[0][0], null, 0);
return patterns;
}
function query(nodePatterns, u, v) {
if (u === v) return false; // Boundary case.
let pattern = nodePatterns[u] ^ nodePatterns[v];
// "smart" test to verify that at most 1 bit is set
return pattern === (pattern & -pattern);
}
// Example:
let edges = [[1, 3, "bc"], [1, 2, "aba"]];
let queries = [[1, 2], [2, 3], [3, 1], [3, 3]];
let nodePatterns = prepare(edges);
for (let [u, v] of queries) {
console.log(u, v, query(nodePatterns, u, v));
}

First of all, let's choose a root. Now imagine that each edge points to a node which is deeper in the tree. Instead of having strings on edges, put them on vertices that those edges point to. Now there is no string only at your root. Now for each vertex calculate and store amount of each letter in it's string.
Since now we'll be doing stuff for each letter seperately.
Using DFS, calculate for each node v number of letters on vertices on a path from v to root. You'll also need LCA, so you may precompute RMQ or find LCA in O(logn) if you like. Let Letters[v][c] be number of letters c on path from v to root. Then, to find number of letter c from u to v just use Letters[v][c] + Letters[u][c] - 2 * Letters[LCA(v, u)][c]. You can check amount of single letter in O(1) (or O(logn) if you're not using RMQ). So in 26* O(1) you can check every single possible letter.

Related

Implementing cartesian product, such that it can skip iterations

I want to implement a function which will return cartesian product of set, repeated given number. For example
input: {a, b}, 2
output:
aa
ab
bb
ba
input: {a, b}, 3
aaa
aab
aba
baa
bab
bba
bbb
However the only way I can implement it is firstly doing cartesion product for 2 sets("ab", "ab), then from the output of the set, add the same set. Here is pseudo-code:
function product(A, B):
result = []
for i in A:
for j in B:
result.append([i,j])
return result
function product1(chars, count):
result = product(chars, chars)
for i in range(2, count):
result = product(result, chars)
return result
What I want is to start computing directly the last set, without computing all of the sets before it. Is this possible, also a solution which will give me similar result, but it isn't cartesian product is acceptable.
I don't have problem reading most of the general purpose programming languages, so if you need to post code you can do it in any language you fell comfortable with.
Here's a recursive algorithm that builds S^n without building S^(n-1) "first". Imagine an infinite k-ary tree where |S| = k. Label with the elements of S each of the edges connecting any parent to its k children. An element of S^m can be thought of as any path of length m from the root. The set S^m, in that way of thinking, is the set of all such paths. Now the problem of finding S^n is a problem of enumerating all paths of length n - and we can name a path by considering the sequence of edge labels from beginning to end. We want to directly generate S^n without first enumerating all of S^(n-1), so a depth-first search modified to find all nodes at depth n seems appropriate. This is essentially how the below algorithm works:
// collection to hold generated output
members = []
// recursive function to explore product space
Products(set[1...n], length, current[1...m])
// if the product we're working on is of the
// desired length then record it and return
if m = length then
members.append(current)
return
// otherwise we add each possible value to the end
// and generate all products of the desired length
// with the new vector as a prefix
for i = 1 to n do
current.addLast(set[i])
Products(set, length, current)
currents.removeLast()
// reset the result collection and request the set be generated
members = []
Products([a, b], 3, [])
Now, a breadth-first approach is no less efficient than a depth-first one, and if you think about it would be no different from exactly what you're already doing. Indeed, and approach that generates S^n must necessarily generate S^(n-1) at least once, since that can be found in a solution to S^n.

Approximate substring matching using a Suffix Tree

This article discusses approximate substring matching techniques that utilize a suffix tree to improve matching time. Each answer addresses a different algorithm.
Approximate substring matching attempts to find a substring (pattern) P in a string T allowing up to k mismatches.
To learn how to create a suffix tree, click here. However, some algorithms require additional preprocessing.
I invite people to add new algorithms (even if it's incomplete) and improve answers.
This was the original question that started this thread.
Professor Esko Ukkonen published a paper: Approximate string-matching over suffix trees. He discusses 3 different algorithms that have different matching times:
Algorithm A: O(mq + n)
Algorithm B: O(mq log(q) + size of the output)
Algorithm C: O(m^2q + size of the output)
Where m is the length of the substring, n is the length of the search string, and q is the edit distance.
I've been trying to understand algorithm B but I'm having trouble following the steps. Does anyone have experience with this algorithm? An example or pseudo algorithm would be greatly appreciated.
In particular:
What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
Here's what I believe (I stand to be corrected):
On page seven, we're introduced to suffix tree concepts; a state is effectively a node in the suffix tree: let root denote the initial state.
g(a, c) = b where a and b are nodes in the tree and c is a character or substring in the tree. So this represents a transition; from a, following the edges represented by c, we move to node b. This is referred to as the go-to transition. So for the suffix tree below, g(root, 'ccb') = red node
Key(a) = edge sequence where a represents a node in the tree. For example, Key(red node) = 'ccb'. So g(root, Key(red node)) = red node.
Keys(Subset of node S) = { Key(node) | node ∈ S}
There is a suffix function for nodes a and b, f(a) = b: for all (or perhaps there may exist) a ≠ root, there exists a character c, a substring x, and a node b such that g(root, cx) = a and g(root, x) = b. I think that this means, for the suffix tree example above, that f(pink node) = green node where c = 'a' and x = 'bccb'.
There is a mapping H that contains a node from the suffix tree and a value pair. The value is given by loc(w); I'm still uncertain how to evaluate the function. This dictionary contains nodes that have not been eliminated.
extract-min(H) refers to attaining the entry with the smallest value in the pair (node, loc(w)) from H.
The crux of the algorithm seems to be related to how loc(w) is evaluated. I've constructed my suffix tree using the combined answer here; however, the algorithms work on a suffix trie (uncompressed suffix tree). Therefore concepts like the depth need to be maintained and processed differently. In the suffix trie the depth would represent the suffix length; in a suffix tree, the depth would simply represent the node depth in the tree.
You are doing well. I don't have familiarity with the algorithm, but have read the paper today. Everything you wrote is correct as far as it goes. You are right that some parts of the explanation assume a lot.
Your Questions
1.What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
The output consists of the maximal k-distance matches of P in T. In particular you'll get the final index and length for each. So clearly this is also O(n) (remember big-O is an upper bound), but may be smaller. This is a nod to the fact that it's impossible to generate p matches in less than O(p) time. The rest of the time bound concerns only the pattern length and the number of viable prefixes, both of which can be arbitrarily small, so the output size can dominate. Consider k=0 and the input is 'a' repeated n times with the pattern 'a'.
2.Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
You're right. It's an error. The loop index should be i. What about j? This is the index of the column corresponding to the input character being processed in the dynamic program. It should really be an input parameter.
Let's take a step back. The Example table on page 6 is computed left-to-right, column-by-column using equations (1-4) given earlier. These show that only the previous columns of D and L are needed to get the next. Function dp is just an implementation of this idea of computing column j from j-1. Column j of D and L are called d and l respectively. Column j-1 D and L are d' and l', the function input parameters.
I recommend you work through the dynamic program until you understand it well. The algorithm is all about avoiding duplicate column computations. Here "duplicate" means "having the same values in the essential part", because that's all that matters. The inessential parts can't affect the answer.
The uncompressed trie is just the compressed one expanded in the obvious way to have one edge per character. Except for the idea of "depth", this is unimportant. In the compressed tree, depth(s) is just the length of the string - which he calls Key(s) - needed to get from root node s.
Algorithm A
Algorithm A is just a clever caching scheme.
All his theorems and lemmas show that 1) we only need to worry about the essential parts of columns and 2) the essential part of a column j is completely determined by the viable prefix Q_j. This is the longest suffix of the input ending at j that matches a prefix of the pattern (within edit distance k). In other words, Q_j is the maximal start of a k-edit match at the end of the input considered so far.
With this, here's pseudo-code for Algorithm A.
Let r = root of (uncompressed) suffix trie
Set r's cached d,l with formulas at end page 7 (0'th dp table columns)
// Invariant: r contains cached d,l
for each character t_j from input text T in sequence
Let s = g(r, t_j) // make the go-to transition from r on t_j
if visited(s)
r = s
while no cached d,l on node r
r = f(r) // traverse suffix edge
end while
else
Use cached d',l' on r to find new columns (d,l) = dp(d',l')
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s
while depth(r) != |Q_j|
mark r visited
r = f(r) // traverse suffix edge
end while
mark r visited
set cached d,l on node r
end if
end for
I've left out the output step for simplicity.
What is traversing suffix edges about? When we do this from a node r where Key(r) = aX (leading a followed by some string X), we are going to the node with Key X. The consequence: we are storing each column corresponding to a viable prefix Q_h at the trie node for the suffix of the input with prefix Q_h. The function f(s) = r is the suffix transition function.
For what it's worth, the Wikipedia picture of a suffix tree shows this pretty well. For example, if from the node for "NA" we follow the suffix edge, we get to the node for "A" and from there to "". We are always cutting off the leading character. So if we label state s with Key(s), we have f("NA") = "A" and f("A") = "". (I don't know why he doesn't label states like this in the paper. It would simplify many explanations.)
Now this is very cool because we are computing only one column per viable prefix. But it's still expensive because we are inspecting each character and potentially traversing suffix edges for each one.
Algorithm B
Algorithm B's intent is to go faster by skipping through the input, touching only those characters likely to produce a new column, i.e. those that are the ends of input that match a previously unseen viable prefix of the pattern.
As you'd suspect, the key to the algorithm is the loc function. Roughly speaking, this will tell where the next "likely" input character is. The algorithm is quite a bit like A* search. We maintain a min heap (which must have a delete operation) corresponding to the set S_i in the paper. (He calls it a dictionary, but this is not a very conventional use of the term.) The min heap contains potential "next states" keyed on the position of the next "likely character" as described above. Processing one character produces new entries. We keep going until the heap is empty.
You're absolutely right that here he gets sketchy. The theorems and lemmas are not tied together to make an argument on correctness. He assumes you will redo his work. I'm not entirely convinced by this hand-waving. But there does seem to be enough there to "decode" the algorithm he has in mind.
Another core concept is the set S_i and in particular the subset that remains not eliminated. We'll keep these un-eliminated states in the min-heap H.
You're right to say that the notation obscures the intent of S_i. As we process the input left-to-right and reach position i, we have amassed a set of viable prefixes seen so far. Each time a new one is found, a fresh dp column is computed. In the author's notation these prefixes would be Q_h for all h<=i or more formally { Q_h | h <= i }. Each of these has a path from the root to a unique node. The set S_i consists of all the states we get by taking one more step from all these nodes along go-to edges in the trie. This produces the same result as going through the whole text looking for each occurrence of Q_h and the next character a, then adding the state corresponding to Q_h a into S_i, but it's faster. The Keys for the S_i states are exactly the right candidates for the next viable prefix Q_{i+1}.
How do we choose the right candidate? Pick the one that occurs next after position i in the input. This is where loc(s) comes in. The loc value for a state s is just what I just said above: the position in the input starting at i where the viable prefix associated with that state occurs next.
The important point is that we don't want to just assign the newly found (by pulling the min loc value from H) "next" viable prefix as Q_{i+1} (the viable prefix for dp column i+1) and go on to the next character (i+2). This is where we must set the stage to skip ahead as far as possible to the last character k (with dp column k) such Q_k = Q_{i+1}. We skip ahead by following suffix edges as in Algorithm A. Only this time we record our steps for future use by altering H: removing elements, which is the same as eliminating elements from S_i, and modifying loc values.
The definition of function loc(s) is bare, and he never says how to compute it. Also unmentioned is that loc(s) is also a function of i, the current input position being processed (that he jumps from j in earlier parts of the paper to i here for the current input position is unhelpful.) The impact is that loc(s) changes as input processing proceeds.
It turns out that the part of the definition that applies to eliminated states "just happens" because states are marked eliminated upon removal form H. So for this case we need only check for a mark.
The other case - un-eliminated states - requires that we search forward in the input looking for the next occurrence in the text that is not covered by some other string. This notion of covering is to ensure we are always dealing with only "longest possible" viable prefixes. Shorter ones must be ignored to avoid outputting other than maximal matches. Now, searching forward sounds expensive, but happily we have a suffix trie already constructed, which allows us to do it in O(|Key(s)|) time. The trie will have to be carefully annotated to return the relevant input position and to avoid covered occurrences of Key(s), but it wouldn't be too hard. He never mentions what to do when the search fails. Here loc(s) = \infty, i.e. it's eliminated and should be deleted from H.
Perhaps the hairiest part of the algorithm is updating H to deal with cases where we add a new state s for a viable prefix that covers Key(w) for some w that was already in H. This means we have to surgically update the (loc(w) => w) element in H. It turns out the suffix trie yet again supports this efficiently with its suffix edges.
With all this in our heads, let's try for pseudocode.
H = { (0 => root) } // we use (loc => state) for min heap elements
until H is empty
(j => s_j) = H.delete_min // remove the min loc mapping from
(d, l) = dp(d', l', j) where (d',l') are cached at paraent(s_j)
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s_j
while depth(r) > |Q_j|
mark r eliminated
H.delete (_ => r) // loc value doesn't matter
end while
set cached d,l on node r
// Add all the "next states" reachable from r by go-tos
for all s = g(r, a) for some character a
unless s.eliminated?
H.insert (loc(s) => s) // here is where we use the trie to find loc
// Update H elements that might be newly covered
w = f(s) // suffix transition
while w != null
unless w.eliminated?
H.increase_key(loc(w) => w) // using explanation in Lemma 9.
w = f(w) // suffix transition
end unless
end while
end unless
end for
end until
Again I've omitted the output for simplicity. I will not say this is correct, but it's in the ballpark. One thing is that he mentions we should only process Q_j for nodes not before "visited," but I don't understand what "visited" means in this context. I think visited states by Algorithm A's definition won't occur because they've been removed from H. It's a puzzle...
The increase_key operation in Lemma 9 is hastily described with no proof. His claim that the min operation is possible in O(log |alphabet|) time is leaving a lot to the imagination.
The number of quirks leads me to wonder if this is not the final draft of the paper. It is also a Springer publication, and this copy on-line would probably violate copyright restrictions if it were precisely the same. It might be worth looking in a library or paying for the final version to see if some of the rough edges were knocked off during final review.
This is as far as I can get. If you have specific questions, I'll try to clarify.

substring finding from a string

Input: string S = AAGATATGATAGGAT.
Output: Maximal repeats such as GATA (as in positions 3 and 8), GAT (as in position 3, 8 and 13) and so on...
A maximal repeat is a substring t occurs k>1 times in S, and if t is extended to left or right, it will occur less than k times.
An internal node’s leaf descendants are suffixes, each of which has a left character.
If the left characters of all leaf descendants are not all identical, it’s called a “left-diverse” node.
Maximal repeats is left-diverse internal nodes.
Overall idea:
Build a suffix tree and then do a DFS (depth first search) on the tree
For each leaf, label it with its left character
For each internal node:
If at least one child is labelled with *, then label it with *
Else if its children’s labels are diverse, label with *.
Else then all children have same label, copy it to current node
Is the above idea is correct? How does the pseudo-code to be? Then I can try to write programming myself.
Your idea is good, but with a suffix tree you can do something even easier.
Let T be the suffix tree of your sequence .
Let x be a node in T, T_x is the subtree of T with root x.
Let N_x be the number of leaf in T_x
Let word(x) be the word created by traversing T from root to node x
Now using the definition of a suffix tree we get :
Number of repeats of word(x) = N_x and the position of this words are the label of each leaf
The algorithm for this would be a basic tree traversal, for each node in the tree calculate N_x, if N_x > 2 add this to your result (if you want the position too you can add the label of each leaf)
Pseudo code :
input :
mySequence
output:
Result (list of word that repeat with count and position)
Algorithm :
T = suffixTree(mySequence)
For each internal node X in T:
T_X = subTree(T)
N_X = Number of lead (T_X)
if N_X >=2 :
Result .add ( [word(X), N_X , list(label of leafs)] )
return Result
Example :
let's take the wikipedia example for suffix trees : "BANANA" :
we get :
N_A = 3 so "A" repeats 3 times in position {1,3,5}
N_N=2 so "N" repeats 2 times in position {2,4}
N_NA=2 so "NA" repeats 2 times in position {2,4}
I found this paper that seems to treat your problem the same way you're doing, so yes I think your method is write :
Spelling approximate repeated or common motifs using a suffix tree
Extract
We present in this paper two algorithms. The first one extracts
repeated motifs from a sequence defined over an alphabet Sigma. For
instance, Sigma may be equal to {A, C, G, T} and the sequence
represent an encoding of a DNA macromolecule. The motifs searched
correspond to words over the same alphabet which occur a minimum
number q of times in the sequence with at most e mismatches each time
(q is called the quorum constraint).[...]
You can download it and have a look at it , the author gives pseudo code for your algorithm.
Hope this helps

For every vertex in a graph, find all vertices within a distance d

In my particular case, the graph is represented as an adjacency list and is undirected and sparse, n can be in the millions, and d is 3. Calculating A^d (where A is the adjacency matrix) and picking out the non-zero entries works, but I'd like something that doesn't involve matrix multiplication. A breadth-first search on every vertex is also an option, but it is slow.
def find_d(graph, start, st, d=0):
if d == 0:
st.add(start)
else:
st.add(start)
for edge in graph[start]:
find_d(graph, edge, st, d-1)
return st
graph = { 1 : [2, 3],
2 : [1, 4, 5, 6],
3 : [1, 4],
4 : [2, 3, 5],
5 : [2, 4, 6],
6 : [2, 5]
}
print find_d(graph, 1, set(), 2)
Let's say that we have a function verticesWithin(d,x) that finds all vertices within distance d of vertex x.
One good strategy for a problem such as this, to expose caching/memoisation opportunities, is to ask the question: How are the subproblems of this problem related to each other?
In this case, we can see that verticesWithin(d,x) if d >= 1 is the union of vertices(d-1,y[i]) for all i within range, where y=verticesWithin(1,x). If d == 0 then it's simply {x}. (I'm assuming that a vertex is deemed to be of distance 0 from itself.)
In practice you'll want to look at the adjacency list for the case d == 1, rather than using that relation, to avoid an infinite loop. You'll also want to avoid the redundancy of considering x itself as a member of y.
Also, if the return type of verticesWithin(d,x) is changed from a simple list or set, to a list of d sets representing increasing distance from x, then
verticesWithin(d,x) = init(verticesWithin(d+1,x))
where init is the function that yields all elements of a list except the last one. Obviously this would be a non-terminating recursive relation if transcribed literally into code, so you have to be a little bit clever about how you implement it.
Equipped with these relations between the subproblems, we can now cache the results of verticesWithin, and use these cached results to avoid performing redundant traversals (albeit at the cost of performing some set operations - I'm not entirely sure that this is a win). I'll leave it as an exercise to fill in the implementation details.
You already mention the option of calculating A^d, but this is much, much more than you need (as you already remark).
There is, however, a much cheaper way of using this idea. Suppose you have a (column) vector v of zeros and ones, representing a set of vertices. The vector w := A v now has a one at every node that can be reached from the starting node in exactly one step. Iterating, u := A w has a one for every node you can reach from the starting node in exactly two steps, etc.
For d=3, you could do the following (MATLAB pseudo-code):
v = j'th unit vector
w = v
for i = (1:d)
v = A*v
w = w + v
end
the vector w now has a positive entry for each node that can be accessed from the jth node in at most d steps.
Breadth first search starting with the given vertex is an optimal solution in this case. You will find all the vertices that within the distance d, and you will never even visit any vertices with distance >= d + 2.
Here is recursive code, although recursion can be easily done away with if so desired by using a queue.
// Returns a Set
Set<Node> getNodesWithinDist(Node x, int d)
{
Set<Node> s = new HashSet<Node>(); // our return value
if (d == 0) {
s.add(x);
} else {
for (Node y: adjList(x)) {
s.addAll(getNodesWithinDist(y,d-1);
}
}
return s;
}

Shortest path to transform one word into another

For a Data Structures project, I must find the shortest path between two words (like "cat" and "dog"), changing only one letter at a time. We are given a Scrabble word list to use in finding our path. For example:
cat -> bat -> bet -> bot -> bog -> dog
I've solved the problem using a breadth first search, but am seeking something better (I represented the dictionary with a trie).
Please give me some ideas for a more efficient method (in terms of speed and memory). Something ridiculous and/or challenging is preferred.
I asked one of my friends (he's a junior) and he said that there is no efficient solution to this problem. He said I would learn why when I took the algorithms course. Any comments on that?
We must move from word to word. We cannot go cat -> dat -> dag -> dog. We also have to print out the traversal.
NEW ANSWER
Given the recent update, you could try A* with the Hamming distance as a heuristic. It's an admissible heuristic since it's not going to overestimate the distance
OLD ANSWER
You can modify the dynamic-program used to compute the Levenshtein distance to obtain the sequence of operations.
EDIT: If there are a constant number of strings, the problem is solvable in polynomial time. Else, it's NP-hard (it's all there in wikipedia) .. assuming your friend is talking about the problem being NP-hard.
EDIT: If your strings are of equal length, you can use Hamming distance.
With a dictionary, BFS is optimal, but the running time needed is proportional to its size (V+E). With n letters, the dictionary might have ~a^n entires, where a is alphabet size. If the dictionary contains all words but the one that should be on the end of chain, then you'll traverse all possible words but won't find anything. This is graph traversal, but the size might be exponentially large.
You may wonder if it is possible to do it faster - to browse the structure "intelligently" and do it in polynomial time. The answer is, I think, no.
The problem:
You're given a fast (linear) way to check if a word is in dictionary, two words u, v and are to check if there's a sequence u -> a1 -> a2 -> ... -> an -> v.
is NP-hard.
Proof: Take some 3SAT instance, like
(p or q or not r) and (p or not q or r)
You'll start with 0 000 00 and are to check if it is possible to go to 2 222 22.
The first character will be "are we finished", three next bits will control p,q,r and two next will control clauses.
Allowed words are:
Anything that starts with 0 and contains only 0's and 1's
Anything that starts with 2 and is legal. This means that it consists of 0's and 1's (except that the first character is 2, all clauses bits are rightfully set according to variables bits, and they're set to 1 (so this shows that the formula is satisfable).
Anything that starts with at least two 2's and then is composed of 0's and 1's (regular expression: 222* (0+1)*, like 22221101 but not 2212001
To produce 2 222 22 from 0 000 00, you have to do it in this way:
(1) Flip appropriate bits - e.g. 0 100 111 in four steps. This requires finding a 3SAT solution.
(2) Change the first bit to 2: 2 100 111. Here you'll be verified this is indeed a 3SAT solution.
(3) Change 2 100 111 -> 2 200 111 -> 2 220 111 -> 2 222 111 -> 2 222 211 -> 2 222 221 -> 2 222 222.
These rules enforce that you can't cheat (check). Going to 2 222 22 is possible only if the formula is satisfable, and checking that is NP-hard. I feel it might be even harder (#P or FNP probably) but NP-hardness is enough for that purpose I think.
Edit: You might be interested in disjoint set data structure. This will take your dictionary and group words that can be reached from each other. You can also store a path from every vertex to root or some other vertex. This will give you a path, not neccessarily the shortest one.
There are methods of varying efficiency for finding links - you can construct a complete graph for each word length, or you can construct a BK-Tree, for example, but your friend is right - BFS is the most efficient algorithm.
There is, however, a way to significantly improve your runtime: Instead of doing a single BFS from the source node, do two breadth first searches, starting at either end of the graph, and terminating when you find a common node in their frontier sets. The amount of work you have to do is roughly half what is required if you search from only one end.
You can make it a little quicker by removing the words that are not the right length, first. More of the limited dictionary will fit into the CPU's cache. Probably all of it.
Also, all of the strncmp comparisons (assuming you made everything lowercase) can be memcmp comparisons, or even unrolled comparisons, which can be a speedup.
You could use some preprocessor magic and hard-compile the task for that word-length, or roll a few optimized variations of the task for common word lengths. All of those extra comparisons can 'go away' for pure unrolled fun.
This is a typical dynamic programming problem. Check for the Edit Distance problem.
What you are looking for is called the Edit Distance. There are many different types.
From (http://en.wikipedia.org/wiki/Edit_distance): "In information theory and computer science, the edit distance between two strings of characters is the number of operations required to transform one of them into the other."
This article about Jazzy (the java spell check API) has a nice overview of these sorts of comparisons (it's a similar problem - providing suggested corrections) http://www.ibm.com/developerworks/java/library/j-jazzy/
You could find the longest common subsequence, and therefore finding the letters that must be changed.
My gut feeling is that your friend is correct, in that there isn't a more efficient solution, but that is assumming you are reloading the dictionary every time. If you were to keep a running database of common transitions, then surely there would be a more efficient method for finding a solution, but you would need to generate the transitions beforehand, and discovering which transitions would be useful (since you can't generate them all!) is probably an art of its own.
bool isadjacent(string& a, string& b)
{
int count = 0; // to store count of differences
int n = a.length();
// Iterate through all characters and return false
// if there are more than one mismatching characters
for (int i = 0; i < n; i++)
{
if (a[i] != b[i]) count++;
if (count > 1) return false;
}
return count == 1 ? true : false;
}
// A queue item to store word and minimum chain length
// to reach the word.
struct QItem
{
string word;
int len;
};
// Returns length of shortest chain to reach 'target' from 'start'
// using minimum number of adjacent moves. D is dictionary
int shortestChainLen(string& start, string& target, set<string> &D)
{
// Create a queue for BFS and insert 'start' as source vertex
queue<QItem> Q;
QItem item = {start, 1}; // Chain length for start word is 1
Q.push(item);
// While queue is not empty
while (!Q.empty())
{
// Take the front word
QItem curr = Q.front();
Q.pop();
// Go through all words of dictionary
for (set<string>::iterator it = D.begin(); it != D.end(); it++)
{
// Process a dictionary word if it is adjacent to current
// word (or vertex) of BFS
string temp = *it;
if (isadjacent(curr.word, temp))
{
// Add the dictionary word to Q
item.word = temp;
item.len = curr.len + 1;
Q.push(item);
// Remove from dictionary so that this word is not
// processed again. This is like marking visited
D.erase(temp);
// If we reached target
if (temp == target)
return item.len;
}
}
}
return 0;
}
// Driver program
int main()
{
// make dictionary
set<string> D;
D.insert("poon");
D.insert("plee");
D.insert("same");
D.insert("poie");
D.insert("plie");
D.insert("poin");
D.insert("plea");
string start = "toon";
string target = "plea";
cout << "Length of shortest chain is: "
<< shortestChainLen(start, target, D);
return 0;
}
Copied from: https://www.geeksforgeeks.org/word-ladder-length-of-shortest-chain-to-reach-a-target-word/

Resources