Ukkonen's suffix tree algorithm: procedure 'test and split' unclear - algorithm

ukkonen's on line construction algorithm
i got a problem trying to understand the 'test and split' procedure,which is as follows:
procedure test–and–split(s, (k, p), t):
>1. if k ≤ p then
>2. let g'(s,(k',p'))=s' be the tk-transition from s
>3. if t=t(k'+p-k+1) then return (true,s)
my problem is that what exactly does the 2nd line mean,how can g'(s,(k',p'))be still a tk-transition if it starts from s and followed by t(k') instead of t(k)??

Probably you already figured it out and you don't need an answer anymore, but since I had the same problem in trying to understand it, and maybe it'll be useful for someone else in the future, the answer I think is the following one.
In Ukkonen's on line construction algorithm, on page 7 you can read that:
...
The string w spelled out by the transition path in STrie(T) between two explicit states s and r is represented in STree(T) as generalized transition g′(s,w) = r. To save space the string w is actually represented as a pair (k,p) of pointers (the left pointer k and the right pointer p) to T such that tk . . . tp = w. In this way the generalized transition gets form g′(s, (k, p)) = r.
Such pointers exist because there must be a suffix Ti such that the transition path for Ti in STrie(T) goes through s and r. We could select the smallest such i, and let k and p point to the substring of this Ti that is spelled out by the transition path from s to r. A transition g′(s, (k, p)) = r is called an a–transition if tk = a. Each s can have at most one a–transition for each a ∈ Σ.
...
This means that we are looking for the smallest indexes k and p such that tk . . . tp = w in T
=> if there is more than one occurrence of w in T, with k and p we always reference the first one.
Now, procedure test–and–split(s,(k,p),t) tests whether or not a state with canonical reference pair (s,(k,p)) is the endpoint, that is, a state that in STrie(T i−1) would have a ti –transition. Symbol ti is given as input parameter t.
The first lines of the algorithm are the following:
procedure test–and–split(s,(k,p),t):
1. if k ≤ p then
2. let g′(s,(k′,p′)) = s′ be the t(k)–transition from s;
3. if t = t(k′+p−k+1) then return(true,s)
4. else ...
On line 1 we check if the state is implicit (that is when k <= p).
If so, then on line 2 we want to find the transition from s that starts with the character we find in pos k of T (that is tk). Note that tk must be equal to tk' but indexes k and k' can be different because we always point to the first occurrence of a string w in T (remember also that from one state there can be at most one transition that starts with character tk => so that's the correct and the only one).
Then on line 3 we check if the state referenced by the canonical reference pair (s,(k,p)) is the endpoint, that is if it has a ti -transition. The state (s,(k,p)) is the one (implicit or not) that we can reach from state s, following the tk' -transition (that is the tk-transition because k' = k) for (p - k) characters. This explains the tk′+p−k+1, where the +1 is for the next character, the one that we are checking if it is equal to t (where t = ti). In that case we reached the endpoint and we return true.
Else, starting from line 4, we split the transition g′(s,(k′,p′)) = s′ to make explicit the state (s,(k,p)) and return the new explicit state.

Related

What is non-crossing matching between two strings in diff3 algorithm?

I'm reading through this article https://www.cis.upenn.edu/~bcpierce/papers/diff3-short.pdf and can't understand what this paragraph means:
Below is the copy-pasted text from the paper and a screenshot of the same paragraph for better readability
The first step of diff3 is to call a two-way comparison subroutine on (O, A) and (O, B) to compute a non-crossing matching M_a between the indices of O
and A — that is, a boolean function on pairs of indices from O and A such that if M_a[i, j] = true then (a) O[i] = A[j], (b) M_a[i′, j] = false and M_a[i, j′] = false whenever i′ != i and j′ != j, and (c) MA[i′, j′] = false whenever either i′ < i and j′ > j or i′ > i and j′ < j — and a non-crossing matching M_b between the indices of O and B. We treat this algorithm as a black box, simply assuming (a) that it is deterministic, and (b) that it always yields maximum matchings. For the counterexamples in the next section, we have verified that the matchings we use correspond to the ones actually chosen by GNU diff3
Where O is a list of characters of the base string, A - is the list of a string, and B - is the list of characters in string b (if I understood beginning of the article correctly)
Q: What is non-crossing matching between two strings? Can you show a visual example of this?

Finding start of cycle in linked list

I have a doubt related to a leetcode question (Linked List Cycle II), whose solution is listed here. Specifically I want to ask about the following code block in the solution:
node1, node2 = head, hare
while node1 != node2:
node1 = node1.next
node2 = node2.next
return node1
After the tortoise and the hare meet, we need to find where the cycle starts in the linked list, so two pointers are started, one from the head and the other from the hare. My question is that why does this code block always work ? Why isn't there a situation where the node2 may end up being always one step behind node1 ?
Two steps here. First we show what the solution implies algebraically, and then we show that the solution exists. This is "going" backwards and then forward again - I assume that the solution above is true, check what are the implications, and prove that they can occur.
I'm not sure there is an easy intuition arising from the proof below. I for one can't see something that would be trivial to deduce.
Step 1
Denote our nodes 0...L, the cycle start point as C, and the first meeting point of the hare and the tortoise (can we just say turtle?), if it exists, as M.
Lemma 1 M = (L-C)J where J is some Integer.
This comes from looking at what the hare passed:
The total distance is just 2M, since the tortoise waked M nodes (this is where setting the starting point is 0 starts to pay off, otherwise we would need -1s everywhere).
On the other hand, the hare arrived at M, and then kept going through L-C length cycles. If it bothers you it might "miss" M in a few runs, remember it doesn't matter - in the end it gets to M, and you can go backwards by single steps, unwinding an integer amount of cycles, then going back from M to 0.
So:
2M = M+(L-C)J => M = (L-C)J
and we're done.
Lemma 2 If M exists, C = (L-M) + (L-C)I where I is some integer.
This is easier. Again we look at what the two nodes have to pass. The head has to pass precisely C (LHS), while the node at the meeting point has to get to L from M, and then one more to get to C. Since we are 0 counting, this ends up as L-M. Now it has to go through L-C an integer amount of cycles, proving the above.
Step 2
Now we show the solution exist.
Lemma 3 J from Lemma 1. exists such that L >= M >= C.
If there exists a J such that (L-C)J = C we are done. Otherwise, take the smallest K such that
(L-C)K > C
assume by negation that
(L-C)K > L => (L-C)K - (L-C) > L - (L-C) => (L-C)(K-1) > C
contradicting the assumption K was minimal. Thus, J=K solves our problem.
Lemma 4 I from Lemma 2 exists.
To see this we merely need to see if there is a solution to C = (L-M)I where I and J are Integer and positive. We substitute M and have:
C = (L-M) + (L-C)I = L-(L-C)J+(L-C)I = (1-J+I)L + (J-I)C => (1-J+I)L=(1-J+I)C
So if there is to be an integer solution, either L=C, which is uninteresting, or
I=J-1
Q.E.D

Palindromes in a tree

I am looking at this challenge:
Given a tree with N nodes and N-1 edges. Each edge on the tree is labelled by a string of lowercase letters from the Latin alphabet. Given Q queries, consisting of two nodes u and v, check if it is possible to make a palindrome string which uses all the characters that belong to the string labelled on the edges in the path from node u to node v.
Characters can be used in any order.
N is of the order of 105 and Q is of the order of 106
Input:
N=3
u=1 v=3 weight=bc
u=1 v=2 weight=aba
Q=4
u=1 v=2
u=2 v=3
u=3 v=1
u=3 v=3
Output:
YES
YES
NO
NO
What I thought was to compute the LCA between 2 nodes by precomputation in O(1) using sparse table and Range minimum query on Euler tower and then see the path from LCA to node u and LCA to node v and store all the characters frequency. If the sum of frequency of all the characters is odd, we check if the frequency of each character except one is odd. If the sum of frequency of all the characters is even, we check if the frequency of each character is even. But this process will surely time out because Q can be upto 106.
Is there anyone with a better algorithm?
Preparation Step
Prepare your data structure as follows:
For each node get the path to the root, get all letters on the path, and only retain a letter when it occurs an odd number of times on that path. Finally encode that string with unique letters as a bit pattern, where bit 0 is set when there is an "a", bit 1 is set when there is a "b", ... bit 25 is set when there is a "z". Store this pattern with the node.
This preprocessing can be done with a depth-first recursive procedure, where the current node's pattern is passed down to the children, which can apply the edge's information to that pattern to create their own pattern. So this preprocessing can run in linear time in terms of the total number of characters in the tree, or more precisely O(N+S), where S represents that total number of characters.
Query Step
When a query is done perform the bitwise XOR on the two involved patterns. If the result is 0 or it has only one bit set, return "YES", else return "NO". So the query will not visit any other nodes than just the two ones that are given, look up the two patterns and perform their XOR and make the bit test. All this happens in constant time for one query.
The last query given in the question shows that the result should be "NO" when the two nodes are the same node. This is a boundary case, as it is debatable whether an empty string is a palindrome or not. The above XOR algorithm would return "YES", so you would need a specific test for this boundary case, and return "NO" instead.
Explanation
This works because if we look at the paths both nodes have to the root, they may share a part of their path. The characters on that common path should not be considered, and the XOR will make sure they aren't. Where the paths differ, we actually have the edges on the path from the one node to the other. There we see the characters that should contribute to a palindrome.
If a character appears an even number of times in those edges, it poses no problem for creating a palindrome. The XOR makes sure those characters "disappear".
If a character appears an odd number of times, all but one can mirror each other like in the even case. The remaining one can only be used in an odd-length palindrome, and only in the centre position of it. So there can only be one such character. This translates to the test that the XOR result is allowed to have 1 bit set (but not more).
Implementation
Here is an implementation in JavaScript. The example run uses the input as provided in the question. I did not bother to turn the query results from boolean to NO/YES:
function prepare(edges) {
// edges: array of [u, v, weight] triplets
// Build adjacency list from the list of edges
let adjacency = {};
for (let [u, v, weight] of edges) {
// convert weight to pattern, as we don't really need to
// store the strings
let pattern = 0;
for (let i = 0; i < weight.length; i++) {
let ascii = weight.charCodeAt(i) - 97;
pattern ^= 1 << ascii; // toggle bit that corresponds to letter
}
if (v in adjacency && u in adjacency) throw "Cycle detected!";
if (!(v in adjacency)) adjacency[v] = {};
if (!(u in adjacency)) adjacency[u] = {};
adjacency[u][v] = pattern;
adjacency[v][u] = pattern;
}
// Prepare the consolidated path-pattern for each node
let patterns = {}; // This is the information to return
function dfs(u, parent, pathPattern) {
patterns[u] = pathPattern;
for (let v in adjacency[u]) {
// recurse into the "children" (the parent is not revisited)
if (v !== parent) dfs(v, u, adjacency[u][v] ^ pathPattern);
}
}
// Start a DFS from an arbitrary node as root
dfs(edges[0][0], null, 0);
return patterns;
}
function query(nodePatterns, u, v) {
if (u === v) return false; // Boundary case.
let pattern = nodePatterns[u] ^ nodePatterns[v];
// "smart" test to verify that at most 1 bit is set
return pattern === (pattern & -pattern);
}
// Example:
let edges = [[1, 3, "bc"], [1, 2, "aba"]];
let queries = [[1, 2], [2, 3], [3, 1], [3, 3]];
let nodePatterns = prepare(edges);
for (let [u, v] of queries) {
console.log(u, v, query(nodePatterns, u, v));
}
First of all, let's choose a root. Now imagine that each edge points to a node which is deeper in the tree. Instead of having strings on edges, put them on vertices that those edges point to. Now there is no string only at your root. Now for each vertex calculate and store amount of each letter in it's string.
Since now we'll be doing stuff for each letter seperately.
Using DFS, calculate for each node v number of letters on vertices on a path from v to root. You'll also need LCA, so you may precompute RMQ or find LCA in O(logn) if you like. Let Letters[v][c] be number of letters c on path from v to root. Then, to find number of letter c from u to v just use Letters[v][c] + Letters[u][c] - 2 * Letters[LCA(v, u)][c]. You can check amount of single letter in O(1) (or O(logn) if you're not using RMQ). So in 26* O(1) you can check every single possible letter.

Implementing Parallel Algorithm for Longest Common Subsequence

I am trying to implement the Parallel Algorithm for Longest Common Subsequence Problem described in http://www.iaeng.org/publication/WCE2010/WCE2010_pp499-504.pdf
But i am having a problem with the variable C in Equation 6 on page 4
The paper refered to C on at the end of page 3 as
C as Let C[1 : l] bethe finite alphabet
I am not sure what is ment by this, as i guess it would it with the 2 strings ABCDEF and ABQXYEF be ABCDEFQXY. But what if my 2 stings is a list of objects (Where my match test for an example is obj1.Name = obj2.Name), what would my C be here? just a union on the 2 arrays?
Having read and studied the paper, I can say that C is supposed to be an array holding the alphabet of your strings, where the alphabet size (and, thus, the size of C) is l.
By the looks of your question, however, I feel the need to go deeper on this, because it looks like you didn't get the whole picture yet. What is P[i,j], and why do you need it? The answer is that you don't really need it, but it's an elegant optimization. In page 3, a little bit before Theorem 1, it is said that:
[...] This process ends when j-k = 0 at the k-th step, or a(i) =
b(j-k) at the k-th step. Assume that the process stops at the k-th
step, and k must be the minimum number that makes a(i) = b(j-k) or j-k
= 0. [...]
The recurrence relation in (3) is equivalent to (2), but the fundamental difference is that (2) expands recursively, whereas with (3) you never have recursive calls, provided that you know k. In other words, the magic behind (3) not expanding recursively is that you somehow know the spot where the recursion on (2) would stop, so you look at that cell immediately, rather than recursively approaching it.
Ok then, but how do you find out the value for k? Since k is the spot where (2) reaches a base case, it can be seen that k is the amount of columns that you have to "go back" on B until you are either off the limits (i.e., the first column that is filled with 0's) OR you find a match between a character in B and a character in A (which corresponds to the base case conditions in (2)). Remember that you will be matching the character a(i-1), where i is the current row.
So, what you really want is to find the last position in B before j where the character a(i-1) appears. If no such character ever appears in B before j, then that would be equivalent to reaching the case i = 0 or j-1 = 0 in (2); otherwise, it's the same as reaching a(i) = b(j-1) in (2).
Let's look at an example:
Consider that the algorithm is working on computing the values for i = 2 and j = 3 (the row and column are highlighted in gray). Imagine that the algorithm is working on the cell highlighted in black and is applying (2) to determine the value of S[2,2] (the position to the left of the black one). By applying (2), it would then start by looking at a(2) and b(2). a(2) is C, b(2) is G, to there's no match (this is the same procedure as the original, well-known algorithm). The algorithm now wants to find the value of S[2,2], because it is needed to compute S[2,3] (where we are). S[2,2] is not known yet, but the paper shows that it is possible to determine that value without refering to the row with i = 2. In (2), the 3rd case is chosen: S[2,2] = max(S[1, 2], S[2, 1]). Notice, if you will, that all this formula is doing is looking at the positions that would have been used to calculate S[2,2]. So, to rephrase that: we're computing S[2,3], we need S[2,2] for that, we don't know it yet, so we're going back on the table to see what's the value of S[2,2] in pretty much the same way we did in the original, non-parallel algorithm.
When will this stop? In this example, it will stop when we find the letter C (this is our a(i)) in TGTTCGACA before the second T (the letter on the current column) OR when we reach column 0. Because there is no C before T, we reach column 0. Another example:
Here, (2) would stop with j-1 = 5, because that is the last position in TGTTCGACA where C shows up. Thus, the recursion reaches the base case a(i) = b(j-1) when j-1 = 5.
With this in mind, we can see a shortcut here: if you could somehow know the amount k such that j-1-k is a base case in (2), then you wouldn't have to go through the score table to find the base case.
That's the whole idea behind P[i,j]. P is a table where you lay down the whole alphabet vertically (on the left side); the string B is, once again, placed horizontally in the upper side. This table is computed as part of a preprocessing step, and it will tell you exactly what you will need to know ahead of time: for each position j in B, it says, for each character C[i] in C (the alphabet), what is the last position in B before j where C[i] is found (note that i is used to index C, the alphabet, and not the string A. Maybe the authors should have used another index variable to avoid confusion).
So, you can think of the semantics for an entry P[i,j] as something along the lines of: The last position in B where I saw C[i] before position j. For example, if you alphabet is sigma = {A, E, I, O, U}, and B = "AOOIUEI", thenP` is:
Take the time to understand this table. Note the row for O. Remember: this row lists, for every position in B, where is the last known "O". Only when j = 3 will we have a value that is not zero (it's 2), because that's the position after the first O in AOOIUEI. This entry says that the last position in B where O was seen before is position 2 (and, indeed, B[2] is an O, the one that follows A). Notice, in that same row, that for j = 4, we have the value 3, because now the last position for O is the one that correspnds to the second O in B (and since no more O's exist, the rest of the row will be 3).
Recall that building P is a preprocessing step necessary if you want to easily find the value of k that makes the recursion from equation (2) stop. It should make sense by now that P[i,j] is the k you're looking for in (3). With P, you can determine that value in O(1) time.
Thus, the C[i] in (6) is a letter of the alphabet - the letter that we are currently considering. In the example above, C = [A,E,I,O,U], and C[1] = A, C[2] = E, etc. In equaton (7), c is the position in C where a(i) (the current letter of string A being considered) lives. It makes sense: after all, when building the score table position S[i,j], we want to use P to find the value of k - we want to know where was the last time we saw an a(i) in B before j. We do that by reading P[index_of(a(i)), j].
Ok, now that you understand the use of P, let's see what's happening with your implementation.
About your specific case
In the paper, P is shown as a table that lists the whole alphabet. It is a good idea to iterate through the alphabet because the typical uses of this algorithm are in bioinformatics, where the alphabet is much, much smaller than the string A, making the iteration through the alphabet cheaper.
Because your strings are sequences of objects, your C would be the set of all possible objects, so you'd have to build a table P with the set of all possible object instance (nonsense, of course). This is definitely a case where the alphabet size is huge when compared to your string size. However, note that you will only be indexing P in those rows that correspond to letters from A: any row in P for a letter C[i] that is not in A is useless and will never be used. This makes your life easier, because it means you can build P with the string A instead of using the alphabet of every possible object.
Again, an example: if your alphabet is AEIOU, A is EEI and B is AOOIUEI, you will only be indexing P in the rows for E and I, so that's all you need in P:
This works and suffices, because in (7), P[c,j] is the entry in P for the character c, and c is the index of a(i). In other words: C[c] always belongs to A, so it makes perfect sense to build P for the characters of A instead of using the whole alphabet for the cases where the size of A is considerably smaller than the size of C.
All you have to do now is to apply the same principle to whatever your objects are.
I really don't know how to explain it any better. This may be a little dense at first. Make sure to re-read it until you really get it - and I mean every little detail. You have to master this before thinking about implementing it.
NOTE: You said you were looking for a credible and / or official source. I'm just another CS student, so I'm not an official source, but I think I can be considered "credible". I've studied this before and I know the subject. Happy coding!

Approximate substring matching using a Suffix Tree

This article discusses approximate substring matching techniques that utilize a suffix tree to improve matching time. Each answer addresses a different algorithm.
Approximate substring matching attempts to find a substring (pattern) P in a string T allowing up to k mismatches.
To learn how to create a suffix tree, click here. However, some algorithms require additional preprocessing.
I invite people to add new algorithms (even if it's incomplete) and improve answers.
This was the original question that started this thread.
Professor Esko Ukkonen published a paper: Approximate string-matching over suffix trees. He discusses 3 different algorithms that have different matching times:
Algorithm A: O(mq + n)
Algorithm B: O(mq log(q) + size of the output)
Algorithm C: O(m^2q + size of the output)
Where m is the length of the substring, n is the length of the search string, and q is the edit distance.
I've been trying to understand algorithm B but I'm having trouble following the steps. Does anyone have experience with this algorithm? An example or pseudo algorithm would be greatly appreciated.
In particular:
What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
Here's what I believe (I stand to be corrected):
On page seven, we're introduced to suffix tree concepts; a state is effectively a node in the suffix tree: let root denote the initial state.
g(a, c) = b where a and b are nodes in the tree and c is a character or substring in the tree. So this represents a transition; from a, following the edges represented by c, we move to node b. This is referred to as the go-to transition. So for the suffix tree below, g(root, 'ccb') = red node
Key(a) = edge sequence where a represents a node in the tree. For example, Key(red node) = 'ccb'. So g(root, Key(red node)) = red node.
Keys(Subset of node S) = { Key(node) | node ∈ S}
There is a suffix function for nodes a and b, f(a) = b: for all (or perhaps there may exist) a ≠ root, there exists a character c, a substring x, and a node b such that g(root, cx) = a and g(root, x) = b. I think that this means, for the suffix tree example above, that f(pink node) = green node where c = 'a' and x = 'bccb'.
There is a mapping H that contains a node from the suffix tree and a value pair. The value is given by loc(w); I'm still uncertain how to evaluate the function. This dictionary contains nodes that have not been eliminated.
extract-min(H) refers to attaining the entry with the smallest value in the pair (node, loc(w)) from H.
The crux of the algorithm seems to be related to how loc(w) is evaluated. I've constructed my suffix tree using the combined answer here; however, the algorithms work on a suffix trie (uncompressed suffix tree). Therefore concepts like the depth need to be maintained and processed differently. In the suffix trie the depth would represent the suffix length; in a suffix tree, the depth would simply represent the node depth in the tree.
You are doing well. I don't have familiarity with the algorithm, but have read the paper today. Everything you wrote is correct as far as it goes. You are right that some parts of the explanation assume a lot.
Your Questions
1.What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
The output consists of the maximal k-distance matches of P in T. In particular you'll get the final index and length for each. So clearly this is also O(n) (remember big-O is an upper bound), but may be smaller. This is a nod to the fact that it's impossible to generate p matches in less than O(p) time. The rest of the time bound concerns only the pattern length and the number of viable prefixes, both of which can be arbitrarily small, so the output size can dominate. Consider k=0 and the input is 'a' repeated n times with the pattern 'a'.
2.Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
You're right. It's an error. The loop index should be i. What about j? This is the index of the column corresponding to the input character being processed in the dynamic program. It should really be an input parameter.
Let's take a step back. The Example table on page 6 is computed left-to-right, column-by-column using equations (1-4) given earlier. These show that only the previous columns of D and L are needed to get the next. Function dp is just an implementation of this idea of computing column j from j-1. Column j of D and L are called d and l respectively. Column j-1 D and L are d' and l', the function input parameters.
I recommend you work through the dynamic program until you understand it well. The algorithm is all about avoiding duplicate column computations. Here "duplicate" means "having the same values in the essential part", because that's all that matters. The inessential parts can't affect the answer.
The uncompressed trie is just the compressed one expanded in the obvious way to have one edge per character. Except for the idea of "depth", this is unimportant. In the compressed tree, depth(s) is just the length of the string - which he calls Key(s) - needed to get from root node s.
Algorithm A
Algorithm A is just a clever caching scheme.
All his theorems and lemmas show that 1) we only need to worry about the essential parts of columns and 2) the essential part of a column j is completely determined by the viable prefix Q_j. This is the longest suffix of the input ending at j that matches a prefix of the pattern (within edit distance k). In other words, Q_j is the maximal start of a k-edit match at the end of the input considered so far.
With this, here's pseudo-code for Algorithm A.
Let r = root of (uncompressed) suffix trie
Set r's cached d,l with formulas at end page 7 (0'th dp table columns)
// Invariant: r contains cached d,l
for each character t_j from input text T in sequence
Let s = g(r, t_j) // make the go-to transition from r on t_j
if visited(s)
r = s
while no cached d,l on node r
r = f(r) // traverse suffix edge
end while
else
Use cached d',l' on r to find new columns (d,l) = dp(d',l')
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s
while depth(r) != |Q_j|
mark r visited
r = f(r) // traverse suffix edge
end while
mark r visited
set cached d,l on node r
end if
end for
I've left out the output step for simplicity.
What is traversing suffix edges about? When we do this from a node r where Key(r) = aX (leading a followed by some string X), we are going to the node with Key X. The consequence: we are storing each column corresponding to a viable prefix Q_h at the trie node for the suffix of the input with prefix Q_h. The function f(s) = r is the suffix transition function.
For what it's worth, the Wikipedia picture of a suffix tree shows this pretty well. For example, if from the node for "NA" we follow the suffix edge, we get to the node for "A" and from there to "". We are always cutting off the leading character. So if we label state s with Key(s), we have f("NA") = "A" and f("A") = "". (I don't know why he doesn't label states like this in the paper. It would simplify many explanations.)
Now this is very cool because we are computing only one column per viable prefix. But it's still expensive because we are inspecting each character and potentially traversing suffix edges for each one.
Algorithm B
Algorithm B's intent is to go faster by skipping through the input, touching only those characters likely to produce a new column, i.e. those that are the ends of input that match a previously unseen viable prefix of the pattern.
As you'd suspect, the key to the algorithm is the loc function. Roughly speaking, this will tell where the next "likely" input character is. The algorithm is quite a bit like A* search. We maintain a min heap (which must have a delete operation) corresponding to the set S_i in the paper. (He calls it a dictionary, but this is not a very conventional use of the term.) The min heap contains potential "next states" keyed on the position of the next "likely character" as described above. Processing one character produces new entries. We keep going until the heap is empty.
You're absolutely right that here he gets sketchy. The theorems and lemmas are not tied together to make an argument on correctness. He assumes you will redo his work. I'm not entirely convinced by this hand-waving. But there does seem to be enough there to "decode" the algorithm he has in mind.
Another core concept is the set S_i and in particular the subset that remains not eliminated. We'll keep these un-eliminated states in the min-heap H.
You're right to say that the notation obscures the intent of S_i. As we process the input left-to-right and reach position i, we have amassed a set of viable prefixes seen so far. Each time a new one is found, a fresh dp column is computed. In the author's notation these prefixes would be Q_h for all h<=i or more formally { Q_h | h <= i }. Each of these has a path from the root to a unique node. The set S_i consists of all the states we get by taking one more step from all these nodes along go-to edges in the trie. This produces the same result as going through the whole text looking for each occurrence of Q_h and the next character a, then adding the state corresponding to Q_h a into S_i, but it's faster. The Keys for the S_i states are exactly the right candidates for the next viable prefix Q_{i+1}.
How do we choose the right candidate? Pick the one that occurs next after position i in the input. This is where loc(s) comes in. The loc value for a state s is just what I just said above: the position in the input starting at i where the viable prefix associated with that state occurs next.
The important point is that we don't want to just assign the newly found (by pulling the min loc value from H) "next" viable prefix as Q_{i+1} (the viable prefix for dp column i+1) and go on to the next character (i+2). This is where we must set the stage to skip ahead as far as possible to the last character k (with dp column k) such Q_k = Q_{i+1}. We skip ahead by following suffix edges as in Algorithm A. Only this time we record our steps for future use by altering H: removing elements, which is the same as eliminating elements from S_i, and modifying loc values.
The definition of function loc(s) is bare, and he never says how to compute it. Also unmentioned is that loc(s) is also a function of i, the current input position being processed (that he jumps from j in earlier parts of the paper to i here for the current input position is unhelpful.) The impact is that loc(s) changes as input processing proceeds.
It turns out that the part of the definition that applies to eliminated states "just happens" because states are marked eliminated upon removal form H. So for this case we need only check for a mark.
The other case - un-eliminated states - requires that we search forward in the input looking for the next occurrence in the text that is not covered by some other string. This notion of covering is to ensure we are always dealing with only "longest possible" viable prefixes. Shorter ones must be ignored to avoid outputting other than maximal matches. Now, searching forward sounds expensive, but happily we have a suffix trie already constructed, which allows us to do it in O(|Key(s)|) time. The trie will have to be carefully annotated to return the relevant input position and to avoid covered occurrences of Key(s), but it wouldn't be too hard. He never mentions what to do when the search fails. Here loc(s) = \infty, i.e. it's eliminated and should be deleted from H.
Perhaps the hairiest part of the algorithm is updating H to deal with cases where we add a new state s for a viable prefix that covers Key(w) for some w that was already in H. This means we have to surgically update the (loc(w) => w) element in H. It turns out the suffix trie yet again supports this efficiently with its suffix edges.
With all this in our heads, let's try for pseudocode.
H = { (0 => root) } // we use (loc => state) for min heap elements
until H is empty
(j => s_j) = H.delete_min // remove the min loc mapping from
(d, l) = dp(d', l', j) where (d',l') are cached at paraent(s_j)
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s_j
while depth(r) > |Q_j|
mark r eliminated
H.delete (_ => r) // loc value doesn't matter
end while
set cached d,l on node r
// Add all the "next states" reachable from r by go-tos
for all s = g(r, a) for some character a
unless s.eliminated?
H.insert (loc(s) => s) // here is where we use the trie to find loc
// Update H elements that might be newly covered
w = f(s) // suffix transition
while w != null
unless w.eliminated?
H.increase_key(loc(w) => w) // using explanation in Lemma 9.
w = f(w) // suffix transition
end unless
end while
end unless
end for
end until
Again I've omitted the output for simplicity. I will not say this is correct, but it's in the ballpark. One thing is that he mentions we should only process Q_j for nodes not before "visited," but I don't understand what "visited" means in this context. I think visited states by Algorithm A's definition won't occur because they've been removed from H. It's a puzzle...
The increase_key operation in Lemma 9 is hastily described with no proof. His claim that the min operation is possible in O(log |alphabet|) time is leaving a lot to the imagination.
The number of quirks leads me to wonder if this is not the final draft of the paper. It is also a Springer publication, and this copy on-line would probably violate copyright restrictions if it were precisely the same. It might be worth looking in a library or paying for the final version to see if some of the rough edges were knocked off during final review.
This is as far as I can get. If you have specific questions, I'll try to clarify.

Resources