In looking through the dynamic programming algorithm for computing the minimum edit distance between two strings I am having a hard time grasping one thing. To me it seems like given the two strings s and t inserting a character into s would be the same as deleting a character from t. Why then do we need to consider these operations separately when computing the edit distance? I always have a hard time computing the indices in the recurrence relation because I can't intuitively understand this part.
I've read through Skiena and some other sources but they all don't explain this part well. This SO link explains the insert and delete operations better than elsewhere in terms of understanding what string is being inserted into or deleted from but I still can't figure out why they aren't one and the same.
Edit: Ok, I didn't do a very good job of detailing the source of my confusion.
The way Skiena explains computing the minimum edit distance m(i,j) of the first i characters of a string s and the first j characters of a string t based on already having computed solutions to the subproblems is as follows. m(i,j) will be the minimum of the following 3 possibilities:
opt[MATCH] = m[i-1][j-1].cost + match(s[i],t[j]);
opt[INSERT] = m[i][j-1].cost + indel(t[j]);
opt[DELETE] = m[i-1][j].cost + indel(s[i]);
The way I understand it the 3 operations are all operations on the string s. An INSERT means you have to insert a character at the end of string s to get the minimum edit distance. A DELETE means you have to delete the character at the end of string s to get the minimum edit distance.
Given s = "SU" and t = "SATU" INSERT and DELETE would be as follows:
Insert:
SU_
SATU
Delete:
SU
SATU_
My confusion was that an INSERT into s is the same as a DELETION from t. I'm probably confused on something basic but it's not intuitive to me yet.
Edit 2: I think this link kind of clarifies my confusion but I'd love an explanation given my specific questions above.
They aren't the same thing any more than < and > are the same thing. There is of course a sort of duality and you are correct to point it out. a < b if and only if b > a so if you have a good algorithm to test for b > a then it makes sense to use it when you need to test if a < b.
It is much easier to directly test if s can be obtained from t by deletion rather than to directly test if t can be obtained from s by insertion. It would be silly to randomly insert letters to s and see if you get t. I can't imagine that any implementation of edit-distance actually does that. Still, it doesn't mean that you can't distinguish between insertion and deletion.
More abstractly. There is a relation, R on any set of strings defined by
s R t <=> t can be obtained from s by insertion
deletion is the inverse relation. Closely related, but not the same.
The problem of edit distance can be restated as a problem of converting the source string into target string with minimum number of operations (including insertion, deletion and replacement of a single character).
Thus, in the process of converting a source string into a target string, if inserting a character from target string or deleting a character from the source string or replacing a character in the source string with a character from the target string yields the same (minimum) edit distance, then, well, all the operations can be said to be equivalent. In other words, it does not matter how you arrive at the target string as long as you have done minimum number of edits.
This is realized by looking at how the cost matrix is calculated. Consider a simpler problem where source = AT (represented vertically) and target = TA (represented horizontally). The matrix is then constructed as (coming from west, northwest, north in that order):
| ε | T | A |
| | | |
ε | 0 | 1 | 2 |
| | | |
A | 1 | min(2, 1, 2) = 1 | min(2, 1, 3) = 1 |
| | | |
T | 2 | min(3, 1, 3) = 1 | min(2, 2, 2) = 2 |
The idea of filling this matrix is:
If we moved east, we insert the current target string character.
If we moved south, we delete the current source string character.
If we moved southeast, we replace the current source character with current target character.
If all or any two of these impart the same cost in terms of editing, then they can be said to be equivalent and you can break the ties arbitrarily.
One of the first experiences with this comes when we find c(2, 2) in the cost matrix (c(0, 0) through c(0, 2) -- minimum costs of converting an empty string to "T", "TA" respectively, and c(0, 0) to c(2,0) -- costs of converting "A", "AT" respectively to empty string are clear).
Value of c(2, 2), can be realized either by:
inserting the current character in target, 'A' (we move east from c(2,1)) -- cost is 1 + 1 = 2, or
replacing the current character 'T' in source by current character in target 'A' -- cost is `1 + 1 = 2
deleting the current character in source, 'T' (we move south from c(1, 2)) -- cost is 1 + 1 = 2
Since all values are the same, which one are you going to choose?
If you choose to move from west, your alignment could be:
A T -
- T A
(one deletion, one 0-cost replacement, one insertion)
If you choose to move from north, your alignment could be:
- A T
T A -
(one insertion, one 0-cost replacement, one deletion)
If you choose to move from northwest, your alignment could be:
A T
T A
(Two 1-cost replacements).
All these edit graphs are equivalent in terms of given edit distance (under given cost function).
Edit distance is only interested in the minimum number of operations required to transform one sequence into another; it is not interested in the uniqueness of the transformation. In practice, there are often multiple ways to transform one string into another, that all have the minimum number of operations.
Related
According to wikipedia, the definition of the recursive formula which calculates the Levenshtein distance between two strings a and b is the following:
I don't understand why we don't take into consideration the cases in which we delete a[j], or we insert b[i]. Also, correct me if I am wrong, isn't the case of insertion the same as the case of the deletion? I mean, instead of deleting a character from one string, we could insert the same character in the second string, and the opposite. So why not merge the insert/delete operations into one operation with cost equal to min{cost_insert, cost_delete}?
This is not done, because you are not allowed to edit both strings. The definition of the edit distance (from wikipedia) is:
the minimum-weight series of edit operations that transforms a into b.
So you are specifically looking for (the weight of) a sequence of operations to execute on the string a to transform it into string b.
Also, the edit distance is not necessarily symmetric. If your costs for inserts and deletions are identical the distance is symmetric: d(a,b) = d(b,a)
Consider the wikipedia example but with different costs:
costs for insertions: w_ins = 1
costs for deletions: w_del = 2
costs for substitutions: w_sub = 1
The distance of kitten and sitting still is 3,
kitten -> sitten (substitution k->s, cost 1)
sitten -> sittin (substitution e->i, cost 1)
sittin -> sitting (insertion of g, cost 1)
=> d(kitten, sitting) = 3
but the distance of sitting and kitten is not:
sitting -> kitting (substitution s->k, cost 1)
kitting -> kitteng (substitution i->e, cost 1)
kitteng -> kitten (deletion of g, cost 2)
=> d(kitten, sitting) = 4
You see that d(kitten, sitting) != d(kitten, sitting).
On the other hand if you do use symmetric costs, as the Levenshtein distance (which is an edit distance with unit costs) does, you can assume that d(a,b) = d(b,a) holds. Then you do not win anything by also considering the inverse cases. What you lose is the information which character has been replaced in which string, what makes it harder to extract the sequence of operations afterwards.
The Wagner-Fisher algorithm which you are showing in your question can extract this from the DP matrix by backtracking the path with minimal costs. Consider this two edit matrices between to and foo with unit costs:
t o f o o
f 1 2 t 1 2 3
o 2 1 o 2 1 2
o 3 2
Note that if you transpose the matrix for d(to, foo) you get the matrix for d(foo, to). Note that by this, an insertion in the first matrix becomes a deletion in the second matrix and vice versa. So this is where this symmetry you are looking for is coming up again.
I hope this helps :)
If the costs of insertions and deletions differ, inserting in one string isn't the same as deleting from the other. Even the cost of a substitution could differ from the cost of insertion+deletion and it is wise to keep them separate.
The problem is usually asymmetric: you have a list of valid strings, and you want to match to another that has errors in it.
I am trying to implement the Parallel Algorithm for Longest Common Subsequence Problem described in http://www.iaeng.org/publication/WCE2010/WCE2010_pp499-504.pdf
But i am having a problem with the variable C in Equation 6 on page 4
The paper refered to C on at the end of page 3 as
C as Let C[1 : l] bethe finite alphabet
I am not sure what is ment by this, as i guess it would it with the 2 strings ABCDEF and ABQXYEF be ABCDEFQXY. But what if my 2 stings is a list of objects (Where my match test for an example is obj1.Name = obj2.Name), what would my C be here? just a union on the 2 arrays?
Having read and studied the paper, I can say that C is supposed to be an array holding the alphabet of your strings, where the alphabet size (and, thus, the size of C) is l.
By the looks of your question, however, I feel the need to go deeper on this, because it looks like you didn't get the whole picture yet. What is P[i,j], and why do you need it? The answer is that you don't really need it, but it's an elegant optimization. In page 3, a little bit before Theorem 1, it is said that:
[...] This process ends when j-k = 0 at the k-th step, or a(i) =
b(j-k) at the k-th step. Assume that the process stops at the k-th
step, and k must be the minimum number that makes a(i) = b(j-k) or j-k
= 0. [...]
The recurrence relation in (3) is equivalent to (2), but the fundamental difference is that (2) expands recursively, whereas with (3) you never have recursive calls, provided that you know k. In other words, the magic behind (3) not expanding recursively is that you somehow know the spot where the recursion on (2) would stop, so you look at that cell immediately, rather than recursively approaching it.
Ok then, but how do you find out the value for k? Since k is the spot where (2) reaches a base case, it can be seen that k is the amount of columns that you have to "go back" on B until you are either off the limits (i.e., the first column that is filled with 0's) OR you find a match between a character in B and a character in A (which corresponds to the base case conditions in (2)). Remember that you will be matching the character a(i-1), where i is the current row.
So, what you really want is to find the last position in B before j where the character a(i-1) appears. If no such character ever appears in B before j, then that would be equivalent to reaching the case i = 0 or j-1 = 0 in (2); otherwise, it's the same as reaching a(i) = b(j-1) in (2).
Let's look at an example:
Consider that the algorithm is working on computing the values for i = 2 and j = 3 (the row and column are highlighted in gray). Imagine that the algorithm is working on the cell highlighted in black and is applying (2) to determine the value of S[2,2] (the position to the left of the black one). By applying (2), it would then start by looking at a(2) and b(2). a(2) is C, b(2) is G, to there's no match (this is the same procedure as the original, well-known algorithm). The algorithm now wants to find the value of S[2,2], because it is needed to compute S[2,3] (where we are). S[2,2] is not known yet, but the paper shows that it is possible to determine that value without refering to the row with i = 2. In (2), the 3rd case is chosen: S[2,2] = max(S[1, 2], S[2, 1]). Notice, if you will, that all this formula is doing is looking at the positions that would have been used to calculate S[2,2]. So, to rephrase that: we're computing S[2,3], we need S[2,2] for that, we don't know it yet, so we're going back on the table to see what's the value of S[2,2] in pretty much the same way we did in the original, non-parallel algorithm.
When will this stop? In this example, it will stop when we find the letter C (this is our a(i)) in TGTTCGACA before the second T (the letter on the current column) OR when we reach column 0. Because there is no C before T, we reach column 0. Another example:
Here, (2) would stop with j-1 = 5, because that is the last position in TGTTCGACA where C shows up. Thus, the recursion reaches the base case a(i) = b(j-1) when j-1 = 5.
With this in mind, we can see a shortcut here: if you could somehow know the amount k such that j-1-k is a base case in (2), then you wouldn't have to go through the score table to find the base case.
That's the whole idea behind P[i,j]. P is a table where you lay down the whole alphabet vertically (on the left side); the string B is, once again, placed horizontally in the upper side. This table is computed as part of a preprocessing step, and it will tell you exactly what you will need to know ahead of time: for each position j in B, it says, for each character C[i] in C (the alphabet), what is the last position in B before j where C[i] is found (note that i is used to index C, the alphabet, and not the string A. Maybe the authors should have used another index variable to avoid confusion).
So, you can think of the semantics for an entry P[i,j] as something along the lines of: The last position in B where I saw C[i] before position j. For example, if you alphabet is sigma = {A, E, I, O, U}, and B = "AOOIUEI", thenP` is:
Take the time to understand this table. Note the row for O. Remember: this row lists, for every position in B, where is the last known "O". Only when j = 3 will we have a value that is not zero (it's 2), because that's the position after the first O in AOOIUEI. This entry says that the last position in B where O was seen before is position 2 (and, indeed, B[2] is an O, the one that follows A). Notice, in that same row, that for j = 4, we have the value 3, because now the last position for O is the one that correspnds to the second O in B (and since no more O's exist, the rest of the row will be 3).
Recall that building P is a preprocessing step necessary if you want to easily find the value of k that makes the recursion from equation (2) stop. It should make sense by now that P[i,j] is the k you're looking for in (3). With P, you can determine that value in O(1) time.
Thus, the C[i] in (6) is a letter of the alphabet - the letter that we are currently considering. In the example above, C = [A,E,I,O,U], and C[1] = A, C[2] = E, etc. In equaton (7), c is the position in C where a(i) (the current letter of string A being considered) lives. It makes sense: after all, when building the score table position S[i,j], we want to use P to find the value of k - we want to know where was the last time we saw an a(i) in B before j. We do that by reading P[index_of(a(i)), j].
Ok, now that you understand the use of P, let's see what's happening with your implementation.
About your specific case
In the paper, P is shown as a table that lists the whole alphabet. It is a good idea to iterate through the alphabet because the typical uses of this algorithm are in bioinformatics, where the alphabet is much, much smaller than the string A, making the iteration through the alphabet cheaper.
Because your strings are sequences of objects, your C would be the set of all possible objects, so you'd have to build a table P with the set of all possible object instance (nonsense, of course). This is definitely a case where the alphabet size is huge when compared to your string size. However, note that you will only be indexing P in those rows that correspond to letters from A: any row in P for a letter C[i] that is not in A is useless and will never be used. This makes your life easier, because it means you can build P with the string A instead of using the alphabet of every possible object.
Again, an example: if your alphabet is AEIOU, A is EEI and B is AOOIUEI, you will only be indexing P in the rows for E and I, so that's all you need in P:
This works and suffices, because in (7), P[c,j] is the entry in P for the character c, and c is the index of a(i). In other words: C[c] always belongs to A, so it makes perfect sense to build P for the characters of A instead of using the whole alphabet for the cases where the size of A is considerably smaller than the size of C.
All you have to do now is to apply the same principle to whatever your objects are.
I really don't know how to explain it any better. This may be a little dense at first. Make sure to re-read it until you really get it - and I mean every little detail. You have to master this before thinking about implementing it.
NOTE: You said you were looking for a credible and / or official source. I'm just another CS student, so I'm not an official source, but I think I can be considered "credible". I've studied this before and I know the subject. Happy coding!
This question already has answers here:
Dynamic programming - Largest square block
(7 answers)
Closed 1 year ago.
I would like to know the different algorithms to find the biggest square in a two dimensions map dotted with obstacles.
An example, where o would be obstacles:
...........................
....o......................
............o..............
...........................
....o......................
...............o...........
...........................
......o..............o.....
..o.......o................
The biggest square would be (if we choose the first one):
.....xxxxxxx...............
....oxxxxxxx...............
.....xxxxxxxo..............
.....xxxxxxx...............
....oxxxxxxx...............
.....xxxxxxx...o...........
.....xxxxxxx...............
......o..............o.....
..o.......o................
What would be the fastest algorithm to find it? The one with the smallest complexity?
EDIT: I know that people are interested on the algorithm explained in the accepted answer, so I made a document that explains it a bit more, you can find it here:
https://docs.google.com/document/d/19pHCD433tYsvAor0WObxa2qusAjKdx96kaf3z5I8XT8/edit?usp=sharing
Here is how to do this in the optimal amount of time, O(nm). This is built on top of #dukeling's insight that you never need to check a solution of size less than your current known best solution.
The key is to be able to build a data structure that can answer this query in O(1) time.
Is there an obstacle in the square whose top left corner is at r, c and has size k?
To solve that problem, we'll support answering a slightly harder question, also in O(1).
What is the count of items in the rectangle from r1, c1 to r2, c2?
It's easy to answer the square existence question with an answer from the rectangle count question.
To answer the rectangle count question, note that if you had pre-computed the answer for every rectangle that starts in the top left, then you could answer the general question for from r1, c1 to r2, c2 by a kind of clever/inclusion exclusion tactic using only rectangles that start in the top left
c1 c2
-----------------------
| | | |
| A | B | |
|_____________|____| | r1
| | | |
| C | D | |
|_____________|____| | r2
|_____________________|
We want the count of stuff inside D. In terms of our pre-computed counts from the top left.
Count(D) = Count(A ∪ B ∪ C ∪ D) - Count(A ∪ C) - Count(A ∪ B) + Count(A)
You can pre-compute all the top left rectangles in O(nm) by doing some clever row/column partial sums, but I'll leave that to you.
Then to answer the to the problem you want just involves checking possible solutions, starting with solutions that are at least as good as your known best. Your known best will only get better up to min(n, m) times total, so the best_possible increment will happen very rarely and almost all squares will be rejected in O(1) time.
best_possible = 0
for r in range(n):
for c in range(m):
while True:
# this looks O(min(n, m)), but it's amortized O(1) since best_possible
# rarely increased.
if possible(r, c, best_possible+1):
best_possible += 1
else:
break
One idea, making use of binary search.
The basic idea:
Start off in the top-left corner. See if a 1x1 square would work.
If it will work, increase the sides lengths of the square by 1 and repeat.
If it won't work, move right and repeat. If you've reached the right-most position, move to the next line.
The native approach:
We can simply check every possible cell of every square at each step, but this is fairly inefficient.
The optimized approach:
When increasing the square size, we can just do a binary search over the next row and column to see if that row / column contains an obstacle at any of those positions.
When moving to the right, we can do a binary search for each next column to determine if that column contains an obstacle at any of those positions.
When moving down, we can do a similar binary on each of the columns in the target position.
Implementation note:
To start off, we'd need to go through all the rows and columns and set up arrays containing the positions of the obstacles for each of them, which we can use for the binary searches.
Running time:
We do 2 binary searches to increase the square size, and the square size is maximum the size of the grid, so that is fairly small (O(min(m,n) log max(m,n))) and gets dominated by the below.
Beyond that, for each position, we do a single binary search on a column.
So, for a grid with m columns and n rows, the overall complexity is O(mn log m).
But note how little we're actually searching below when the grid is sparse.
Example:
For your example:
012345678901234567890123456
0...........................
1....o......................
2............o..............
3...........................
4....o......................
5...............o...........
6...........................
7......o..............o.....
8..o.......o................
We'd first try a 1x1 square in the top-left corner, which works.
Then a 2x2 square. For this, we do a binary search for the range [0,1] on the row 1, which can be represented simply by {4} - an array of a single position corresponding to where the obstacle is. And we also do a binary search for the range [0,1] on the column 1, which contains no obstacles, thus an empty array - {}.
Then a 3x3 square. For this, we do a binary search for [0,2] on the row 2, which contains 1 obstacles at position 12, thus {12}. And we also do a binary search for [0,2] on the column 2, which contains an obstacle at position 8, thus {8}.
Then a 4x4 square. For this, we do a binary search for [0,3] on the row 3 - {}. And for [0,3] on column 3 - {}.
Then a 5x5 square. For this, we do a binary search for [0,4] on the row 4 - {4}. And for [0,4] column 4 - {1,4}.
Here is the first one we actually find. In the range [0,4], we find 4 in both the row and the column (we only really need to find one of them). So this indicates a fail.
From here we do a binary search on column 4 (again - not really necessary) for [0,4]. Then binary search columns 5-8 for [0,4], none of them found, so a square starting at position 5,0 is the next possible candidate.
So from here we try to increase the square size to 5x5, which works, then 6x6 and 7x7, which works.
Then we try 8x8, which doesn't work.
And so on.
I know binary search, but how does yours work?
So we're basically doing a range search within a set of values. This is fairly easy to do. First search for the starting value of the range, then the end value. If we get to the same point, there are no values in the range.
We don't really care what values exist in the range, just whether or not there are any.
So here's one rough approach.
Store the x-y positions of all the obstacles.
For each obstacle O
find obstacle C that is nearest to it column-wise.
find obstacle R-top that is nearest to it row-wise from the top.
find obstacle R-bottom that is nearest to it row-wise from the bottom.
if (|R-top.y - R-bottom.y| != |O.x - C.x|) continue
Size of the square = Abs((R-top.y - R-bottom.y) * (O.x - C.x))
Keep track of the sizes and positions to find the largest square
Complexity is roughly O(k^2) where k is the number of obstacles. You could reduce it to O(k * log k) if you use binary search.
The following SO articles are identical/similar to the problem you're trying to solve. You may want to look over those answers as well as the responses to your question.
Dynamic programming - Largest square block
dynamic programming: finding largest non-overlapping squares
Dynamic programming: Find largest diamond (rhombus)
Here's the baseline case I'd use, written in simplified Python/pseudocode.
# obstacleMap is a list of list of MapElements, stored in row-major order
max([find_largest_rect(obstacleMap, element) for row in obstacleMap for element in row])
def find_largest_rect(obstacleMap, upper_left_elem):
size = 0
while not has_obstacles(obstacleMap, upper_left_elem, size+1):
size += 1
return size
def has_obstacles(obstacleMap, upper_left_elem, size):
#determines if there are obstacles on the on outside square layer
#for example, if U is the upper left element and size=3, then has_obstacles checks the elements marked p.
# .....
# ..U.p
# ....p
# ..ppp
periphery_row = obstacleMap[upper_left_elem.row][upper_left_elem.col:upper_left_elem.col+size]
periphery_col = [row[upper_left_elem.col+size] for row in obstacleMap[upper_left_elem.row:upper_left_elem.row+size]
return any(is_obstacle(elem) for elem in periphery_row + periphery_col)
def is_obstacle(elem):
return elem.value == 'o'
class MapElement(object):
def __init__(self, row, col, value):
self.row = row
self.col = col
self.value = value
here is an approach using recurrence relation :-
isSquare(R,C1,C2) = noObstacle(R,C1,R,C2) && noObstacle(R,C2,R-(C2-C1),C2) && isSquare(R-1,C1,C2-1)
isSquare(R,C1,C2) = square that has bottom side (R,C1) to (R,C2)
noObstacle(R1,C1,R2,C2) = checks whether there is no obstacle in line segment (R1,C1) to (R2,C2)
Find Max (C2-C1+1) which where isSquare(R,C1,C2) = true
You can use dynamic programming to solve this problem in polynomial time. Use suitable data structure for searching obstacle.
This article discusses approximate substring matching techniques that utilize a suffix tree to improve matching time. Each answer addresses a different algorithm.
Approximate substring matching attempts to find a substring (pattern) P in a string T allowing up to k mismatches.
To learn how to create a suffix tree, click here. However, some algorithms require additional preprocessing.
I invite people to add new algorithms (even if it's incomplete) and improve answers.
This was the original question that started this thread.
Professor Esko Ukkonen published a paper: Approximate string-matching over suffix trees. He discusses 3 different algorithms that have different matching times:
Algorithm A: O(mq + n)
Algorithm B: O(mq log(q) + size of the output)
Algorithm C: O(m^2q + size of the output)
Where m is the length of the substring, n is the length of the search string, and q is the edit distance.
I've been trying to understand algorithm B but I'm having trouble following the steps. Does anyone have experience with this algorithm? An example or pseudo algorithm would be greatly appreciated.
In particular:
What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
Here's what I believe (I stand to be corrected):
On page seven, we're introduced to suffix tree concepts; a state is effectively a node in the suffix tree: let root denote the initial state.
g(a, c) = b where a and b are nodes in the tree and c is a character or substring in the tree. So this represents a transition; from a, following the edges represented by c, we move to node b. This is referred to as the go-to transition. So for the suffix tree below, g(root, 'ccb') = red node
Key(a) = edge sequence where a represents a node in the tree. For example, Key(red node) = 'ccb'. So g(root, Key(red node)) = red node.
Keys(Subset of node S) = { Key(node) | node ∈ S}
There is a suffix function for nodes a and b, f(a) = b: for all (or perhaps there may exist) a ≠ root, there exists a character c, a substring x, and a node b such that g(root, cx) = a and g(root, x) = b. I think that this means, for the suffix tree example above, that f(pink node) = green node where c = 'a' and x = 'bccb'.
There is a mapping H that contains a node from the suffix tree and a value pair. The value is given by loc(w); I'm still uncertain how to evaluate the function. This dictionary contains nodes that have not been eliminated.
extract-min(H) refers to attaining the entry with the smallest value in the pair (node, loc(w)) from H.
The crux of the algorithm seems to be related to how loc(w) is evaluated. I've constructed my suffix tree using the combined answer here; however, the algorithms work on a suffix trie (uncompressed suffix tree). Therefore concepts like the depth need to be maintained and processed differently. In the suffix trie the depth would represent the suffix length; in a suffix tree, the depth would simply represent the node depth in the tree.
You are doing well. I don't have familiarity with the algorithm, but have read the paper today. Everything you wrote is correct as far as it goes. You are right that some parts of the explanation assume a lot.
Your Questions
1.What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
The output consists of the maximal k-distance matches of P in T. In particular you'll get the final index and length for each. So clearly this is also O(n) (remember big-O is an upper bound), but may be smaller. This is a nod to the fact that it's impossible to generate p matches in less than O(p) time. The rest of the time bound concerns only the pattern length and the number of viable prefixes, both of which can be arbitrarily small, so the output size can dominate. Consider k=0 and the input is 'a' repeated n times with the pattern 'a'.
2.Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
You're right. It's an error. The loop index should be i. What about j? This is the index of the column corresponding to the input character being processed in the dynamic program. It should really be an input parameter.
Let's take a step back. The Example table on page 6 is computed left-to-right, column-by-column using equations (1-4) given earlier. These show that only the previous columns of D and L are needed to get the next. Function dp is just an implementation of this idea of computing column j from j-1. Column j of D and L are called d and l respectively. Column j-1 D and L are d' and l', the function input parameters.
I recommend you work through the dynamic program until you understand it well. The algorithm is all about avoiding duplicate column computations. Here "duplicate" means "having the same values in the essential part", because that's all that matters. The inessential parts can't affect the answer.
The uncompressed trie is just the compressed one expanded in the obvious way to have one edge per character. Except for the idea of "depth", this is unimportant. In the compressed tree, depth(s) is just the length of the string - which he calls Key(s) - needed to get from root node s.
Algorithm A
Algorithm A is just a clever caching scheme.
All his theorems and lemmas show that 1) we only need to worry about the essential parts of columns and 2) the essential part of a column j is completely determined by the viable prefix Q_j. This is the longest suffix of the input ending at j that matches a prefix of the pattern (within edit distance k). In other words, Q_j is the maximal start of a k-edit match at the end of the input considered so far.
With this, here's pseudo-code for Algorithm A.
Let r = root of (uncompressed) suffix trie
Set r's cached d,l with formulas at end page 7 (0'th dp table columns)
// Invariant: r contains cached d,l
for each character t_j from input text T in sequence
Let s = g(r, t_j) // make the go-to transition from r on t_j
if visited(s)
r = s
while no cached d,l on node r
r = f(r) // traverse suffix edge
end while
else
Use cached d',l' on r to find new columns (d,l) = dp(d',l')
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s
while depth(r) != |Q_j|
mark r visited
r = f(r) // traverse suffix edge
end while
mark r visited
set cached d,l on node r
end if
end for
I've left out the output step for simplicity.
What is traversing suffix edges about? When we do this from a node r where Key(r) = aX (leading a followed by some string X), we are going to the node with Key X. The consequence: we are storing each column corresponding to a viable prefix Q_h at the trie node for the suffix of the input with prefix Q_h. The function f(s) = r is the suffix transition function.
For what it's worth, the Wikipedia picture of a suffix tree shows this pretty well. For example, if from the node for "NA" we follow the suffix edge, we get to the node for "A" and from there to "". We are always cutting off the leading character. So if we label state s with Key(s), we have f("NA") = "A" and f("A") = "". (I don't know why he doesn't label states like this in the paper. It would simplify many explanations.)
Now this is very cool because we are computing only one column per viable prefix. But it's still expensive because we are inspecting each character and potentially traversing suffix edges for each one.
Algorithm B
Algorithm B's intent is to go faster by skipping through the input, touching only those characters likely to produce a new column, i.e. those that are the ends of input that match a previously unseen viable prefix of the pattern.
As you'd suspect, the key to the algorithm is the loc function. Roughly speaking, this will tell where the next "likely" input character is. The algorithm is quite a bit like A* search. We maintain a min heap (which must have a delete operation) corresponding to the set S_i in the paper. (He calls it a dictionary, but this is not a very conventional use of the term.) The min heap contains potential "next states" keyed on the position of the next "likely character" as described above. Processing one character produces new entries. We keep going until the heap is empty.
You're absolutely right that here he gets sketchy. The theorems and lemmas are not tied together to make an argument on correctness. He assumes you will redo his work. I'm not entirely convinced by this hand-waving. But there does seem to be enough there to "decode" the algorithm he has in mind.
Another core concept is the set S_i and in particular the subset that remains not eliminated. We'll keep these un-eliminated states in the min-heap H.
You're right to say that the notation obscures the intent of S_i. As we process the input left-to-right and reach position i, we have amassed a set of viable prefixes seen so far. Each time a new one is found, a fresh dp column is computed. In the author's notation these prefixes would be Q_h for all h<=i or more formally { Q_h | h <= i }. Each of these has a path from the root to a unique node. The set S_i consists of all the states we get by taking one more step from all these nodes along go-to edges in the trie. This produces the same result as going through the whole text looking for each occurrence of Q_h and the next character a, then adding the state corresponding to Q_h a into S_i, but it's faster. The Keys for the S_i states are exactly the right candidates for the next viable prefix Q_{i+1}.
How do we choose the right candidate? Pick the one that occurs next after position i in the input. This is where loc(s) comes in. The loc value for a state s is just what I just said above: the position in the input starting at i where the viable prefix associated with that state occurs next.
The important point is that we don't want to just assign the newly found (by pulling the min loc value from H) "next" viable prefix as Q_{i+1} (the viable prefix for dp column i+1) and go on to the next character (i+2). This is where we must set the stage to skip ahead as far as possible to the last character k (with dp column k) such Q_k = Q_{i+1}. We skip ahead by following suffix edges as in Algorithm A. Only this time we record our steps for future use by altering H: removing elements, which is the same as eliminating elements from S_i, and modifying loc values.
The definition of function loc(s) is bare, and he never says how to compute it. Also unmentioned is that loc(s) is also a function of i, the current input position being processed (that he jumps from j in earlier parts of the paper to i here for the current input position is unhelpful.) The impact is that loc(s) changes as input processing proceeds.
It turns out that the part of the definition that applies to eliminated states "just happens" because states are marked eliminated upon removal form H. So for this case we need only check for a mark.
The other case - un-eliminated states - requires that we search forward in the input looking for the next occurrence in the text that is not covered by some other string. This notion of covering is to ensure we are always dealing with only "longest possible" viable prefixes. Shorter ones must be ignored to avoid outputting other than maximal matches. Now, searching forward sounds expensive, but happily we have a suffix trie already constructed, which allows us to do it in O(|Key(s)|) time. The trie will have to be carefully annotated to return the relevant input position and to avoid covered occurrences of Key(s), but it wouldn't be too hard. He never mentions what to do when the search fails. Here loc(s) = \infty, i.e. it's eliminated and should be deleted from H.
Perhaps the hairiest part of the algorithm is updating H to deal with cases where we add a new state s for a viable prefix that covers Key(w) for some w that was already in H. This means we have to surgically update the (loc(w) => w) element in H. It turns out the suffix trie yet again supports this efficiently with its suffix edges.
With all this in our heads, let's try for pseudocode.
H = { (0 => root) } // we use (loc => state) for min heap elements
until H is empty
(j => s_j) = H.delete_min // remove the min loc mapping from
(d, l) = dp(d', l', j) where (d',l') are cached at paraent(s_j)
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s_j
while depth(r) > |Q_j|
mark r eliminated
H.delete (_ => r) // loc value doesn't matter
end while
set cached d,l on node r
// Add all the "next states" reachable from r by go-tos
for all s = g(r, a) for some character a
unless s.eliminated?
H.insert (loc(s) => s) // here is where we use the trie to find loc
// Update H elements that might be newly covered
w = f(s) // suffix transition
while w != null
unless w.eliminated?
H.increase_key(loc(w) => w) // using explanation in Lemma 9.
w = f(w) // suffix transition
end unless
end while
end unless
end for
end until
Again I've omitted the output for simplicity. I will not say this is correct, but it's in the ballpark. One thing is that he mentions we should only process Q_j for nodes not before "visited," but I don't understand what "visited" means in this context. I think visited states by Algorithm A's definition won't occur because they've been removed from H. It's a puzzle...
The increase_key operation in Lemma 9 is hastily described with no proof. His claim that the min operation is possible in O(log |alphabet|) time is leaving a lot to the imagination.
The number of quirks leads me to wonder if this is not the final draft of the paper. It is also a Springer publication, and this copy on-line would probably violate copyright restrictions if it were precisely the same. It might be worth looking in a library or paying for the final version to see if some of the rough edges were knocked off during final review.
This is as far as I can get. If you have specific questions, I'll try to clarify.
I apologize for not have the math background to put this question in a more formal way.
I'm looking to create a string of 796 letters (or integers) with certain properties.
Basically, the string is a variation on a De Bruijn sequence B(12,4), except order and repetition within each n-length subsequence are disregarded.
i.e. ABBB BABA BBBA are each equivalent to {AB}.
In other words, the main property of the string involves looking at consecutive groups of 4 letters within the larger string
(i.e. the 1st through 4th letters, the 2nd through 5th letters, the 3rd through 6th letters, etc)
And then producing the set of letters that comprise each group (repetitions and order disregarded)
For example, in the string of 9 letters:
A B B A C E B C D
the first 4-letter groups is: ABBA, which is comprised of the set {AB}
the second group is: BBAC, which is comprised of the set {ABC}
the third group is: BACE, which is comprised of the set {ABCE}
etc.
The goal is for every combination of 1-4 letters from a set of N letters to be represented by the 1-4-letter resultant sets of the 4-element groups once and only once in the original string.
For example, if there is a set of 5 letters {A, B, C, D, E} being used
Then the possible 1-4 letter combinations are:
A, B, C, D, E,
AB, AC, AD, AE, BC, BD, BE, CD, CE, DE,
ABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE,
ABCD, ABCE, ABDE, ACDE, BCDE
Here is a working example that uses a set of 5 letters {A, B, C, D, E}.
D D D D E C B B B B A E C C C C D A E E E E B D A A A A C B D D B
The 1st through 4th elements form the set: D
The 2nd through 5th elements form the set: DE
The 3rd through 6th elements form the set: CDE
The 4th through 7th elements form the set: BCDE
The 5th through 8th elements form the set: BCE
The 6th through 9th elements form the set: BC
The 7th through 10th elements form the set: B
etc.
* I am hoping to find a working example of a string that uses 12 different letters (a total of 793 4-letter groups within a 796-letter string) starting (and if possible ending) with 4 of the same letter. *
Here is a working solution for 7 letters:
AAAABCDBEAAACDECFAAADBFBACEAGAADEFBAGACDFBGCCCCDGEAFAGCBEEECGFFBFEGGGGFDEEEEFCBBBBGDCFFFFDAGBEGDDDDBE
Beware that in order to attempt exhaustive search (answer in VB is trying a naive version of that) you'll first have to solve the problem of generating all possible expansions while maintaining lexicographical order. Just ABC, expands to all perms of AABC, plus all perms of ABBC, plus all perms of ABCC which is 3*4! instead of just AABC. If you just concatenate AABC and AABD it would cover just 4 out of 4! perms of AABC and even that by accident. Just this expansion will bring you exponential complexity - end of game. Plus you'll need to maintain association between all explansions and the set (the set becomes a label).
Your best bet is to use one of known efficient De Bruijn constuctors and try to see if you can put your set-equivalence in there. Check out
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.674&rep=rep1&type=pdf
and
http://www.dim.uchile.cl/~emoreno/publicaciones/FINALES/copyrighted/IPL05-De_Bruijn_sequences_and_De_Bruijn_graphs_for_a_general_language.pdf
for a start.
If you know graphs, another viable option is to start with De Bruijn graph and formulate your set-equivalence as a graph rewriting. 2nd paper does De Bruijn graph partitioning.
BTW, try VB answer just for A,B,AB (at least expansion is small) - it will make AABBAB and construct ABBA or ABBAB (or throw in a decent language) both of which are wrong. You can even prove that it will always miss with 1st lexical expansions (that's what AAB, AAAB etc. are) just by examining first 2 passes (it will always miss 2nd A for NxA because (N-1)xA+B is in the string (1st expansion of {AB}).
Oh and if we could establish how many of each letters an optimal soluton should have (don't look at B(5,2) it's too easy and regular :-) a random serch would be feasible - you generate candidates with provable traits (like AAAA, BBBB ... are present and not touching and is has n1 A-s, n2 B-s ...) and random arrangement and then test whether they are solutions (checking is much faster than exhaustive search in this case).
Cool problem. Just a draft/psuedo algo:
dim STR-A as string = getall(ABCDEFGHIJKL)
//custom function to generate concat list of all 793 4-char combos.
//should be listed side-by-side to form 3172 character-long string.
//different ordering may ultimately produce different results.
//brute-forcing all orders of combos is too much work (793! is a big #).
//need to determine how to find optimal ordering, for this particular
//approach below.
dim STR-B as string = "" // to hold the string you're searching for
dim STR-C as string = "" // to hold the sub-string you are searching in
dim STR-A-NEW as string = "" //variable to hold your new string
dim MATCH as boolean = false //variable to hold matching status
while len(STR-A) > 0
//check each character in STR-A, which will be shorted by 1 char on each
//pass.
MATCH = false
STR-B = left(STR-A, 4)
STR-B = reduce(STR-B)
//reduce(str) is a custom re-usable function to sort & remove duplicates
for i as integer = 1 to len((STR-A) - 1)
STR-C = substr(STR-A, i, 4)
//gives you the 4-character sequence beginning at position i
STR-C = reduce(STR-C)
IF STR-B = STR-C Then
MATCH = true
exit for
//as long as there is even one match, you can throw-away the first
//letter
END IF
i = i+1
next
IF match = false then
//if you didn't find a match, then the first letter should be saved
STR-A-NEW += LEFT(STR-B, 1)
END IF
MATCH = false //re-init MATCH
STR-A = RIGHT(STR-A, LEN(STR-A) - 1) //re-init STR_A
wend
Anyway -- there could be problems at this, and you'd need to write another function to parse your result string (STR-A-NEW) to prove that it's a viable answer...
I've been thinking about this one and I'm sketching out a solution.
Let's call a string of four symbols a word and we'll write S(w) to denote the set of symbols in word w.
Each word abcd has "follow-on" words bcde where a,...,e are all symbols.
Let succ(w) be the set of follow-on words v for w such that S(w) != S(v). succ(w) is the set of successor words that can follow on from the first symbol in w if w is in a solution.
For each non-empty set of symbols s of cardinality at most four, let words(s) be the set of words w such that S(w) = s. Any solution must contain exactly one word in words(s) for each such set s.
Now we can do a reasonable search. The basic idea is this: say we are exploring a search path ending with word w. The follow-on word must be a non-excluded word in succ(w). A word v is excluded if the search path contains some word w such that v in words(S(w)).
You can be slightly more cunning: if we track the possible "predecessor" words to a set s (i.e., words w with a successor v such that v in words(s)) and reach a point where every predecessor of s is excluded, then we know we have reached a dead end, since we'll never be able to obtain s from any extension of the current search path.
Code to follow after the weekend, with a bit of luck...
Here is my proposal. I'll admit upfront this is a performance and memory hog.
This may be overkill, but have a class We'll call it UniqueCombination This will contain a unique 1-4 char reduced combination of the input set (i.e. A,AB,ABC,...) This will also contain a list of possible combination (AB {AABB,ABAB,BBAA,...}) this will need a method that determines if any possible combination overlaps any possible combination of another UniqueCombination by three characters. Also need a override that takes a string as well.
Then we start with the string "AAAA" then we find all of the UniqueCombinations that overlap this string. Then we find how many uniqueCombinations those possible matches overlap with. (we could be smart at this point an store this number.) Then we pick the one with the least number of overlaps greater than 0. Use up the ones with the least possible matches first.
Then we find a specific combination for the chosen UniqueCombination and add it to the final string. Remove this UniqueCombination from the list, then as we find overlaps for current string. rinse and repeat. (we could be smart and on subsequent runs while searching for overlaps we could remove any of the unreduced combination that are contained in the final string.)
Well that's my plan I will work on the code this weekend. Granted this does not guarantee that the final 4 characters will be 4 of the same letter (it might actually be trying to avoid that but I will look into that as well.)
If there is a non-exponential solution at all it may need to be formulated in terms of a recursive "growth" from a problem with a smaller size i.e to contruct B(N,k) from B(N-1,k-1) or from B(N-1,k) or from B(N,k-1).
Systematic construction for B(5,2) - one step at the time :-) It's bound to get more complex latter [card stands for cardinality, {AB} has card=2, I'll also call them 2-s, 3-s etc.] Note, 2-s and 3-s will be k-1 and k latter (I hope).
Initial. Start with k-1 result and inject symbols for singletons
(unique expansion empty intersection):
ABCDE -> AABBCCDDEE
mark used card=2 sets: AB,BC,CD,DE
Rewriting. Form card=3 sets to inject symbols into marked card=2.
1st feasible lexicographic expansion fires (may have to backtrack for k>2)
it's OK to use already marked 2-s since they'll all get replaced
but may have to do a verification pass for higher k
AB->ACB, BC->BCD, CD->CED, DE->DAE ==> AACBBDCCEDDAEEB
mark/verify used 2s
normally keep marking/unmarking during the construction but also keep keep old
mark list
marking/unmarking can get expensive if there's backtracking in #3
Unused: AB, BE
For higher k may need several recursive rewriting passes
possibly partitioning new sets into classes
Finalize: unused 2-s should overlap around the edge (that's why it's cyclic)
ABE - B can go to the begining or and: AACBBDCCEDDAEEB
Note: a step from B(N-1,k) to B(N,k) may need injection of pseudo-signletons, like doubling or trippling A
B(5,2) -> B(5,3) - B(5,4)
Initial. same: - ABCDE -> AAACBBBDCCCEDDDAEEEB
no use of marking 3-sets since they are all going to be chenged
Rewriting.
choose systematic insertion positions
AAA_CBBB_DCCC_EDDD_AEEE_B
mark all 2-s released by this: AC,AD,BD,BE,CE
use marked 2-s to decide inserted symbols - totice total regularity:
AxCB D -> ADCB
BxDC E -> BEDC
CxED A -> CAED
DxAE B => DBAE
ExBA C -> ECBA
Verify that 3-s are all used (marked inserted symbols just for fun)
AAA[D]CBBB[E]DCCC[A]EDDD[B]AEEE[C]B
Note: Systematic choice if insertion point deterministically dictated insertions (only AD can fit 1st, AC would create duplicate 2-set (AAC, ACC))
Note: It's not going to be so nice for B(6,2) and B(6,3) since number of 2-s will exceede 2x the no of 1-s. This is important since 2-s sit naturally on the sides of 1-s like CBBBE and the issue is how to place them when you run out of 1-s.
B(5,3) is so symetrical that just repeating #1 produces B(5.4):
AAAADCBBBBEDCCCCAEDDDDBAEEEECB