On counting pairs of words that differ by one letter - algorithm

Let us consider n words, each of length k. Those words consist of letters over an alphabet (whose cardinality is n) with defined order. The task is to derive an O(nk) algorithm to count the number of pairs of words that differ by one position (no matter which one exactly, as long as it's only a single position).
For instance, in the following set of words (n = 5, k = 4):
abcd, abdd, adcb, adcd, aecd
there are 5 such pairs: (abcd, abdd), (abcd, adcd), (abcd, aecd), (adcb, adcd), (adcd, aecd).
So far I've managed to find an algorithm that solves a slightly easier problem: counting the number of pairs of words that differ by one GIVEN position (i-th). In order to do this I swap the letter at the ith position with the last letter within each word, perform a Radix sort (ignoring the last position in each word - formerly the ith position), linearly detect words whose letters at the first 1 to k-1 positions are the same, eventually count the number of occurrences of each letter at the last (originally ith) position within each set of duplicates and calculate the desired pairs (the last part is simple).
However, the algorithm above doesn't seem to be applicable to the main problem (under the O(nk) constraint) - at least not without some modifications. Any idea how to solve this?

Assuming n and k isn't too large so that this will fit into memory:
Have a set with the first letter removed, one with the second letter removed, one with the third letter removed, etc. Technically this has to be a map from strings to counts.
Run through the list, simply add the current element to each of the maps (obviously by removing the applicable letter first) (if it already exists, add the count to totalPairs and increment it by one).
Then totalPairs is the desired value.
EDIT:
Complexity:
This should be O(n.k.logn).
You can use a map that uses hashing (e.g. HashMap in Java), instead of a sorted map for a theoretical complexity of O(nk) (though I've generally found a hash map to be slower than a sorted tree-based map).
Improvement:
A small alteration on this is to have a map of the first 2 letters removed to 2 maps, one with first letter removed and one with second letter removed, and have the same for the 3rd and 4th letters, and so on.
Then put these into maps with 4 letters removed and those into maps with 8 letters removed and so on, up to half the letters removed.
The complexity of this is:
You do 2 lookups into 2 sorted sets containing maximum k elements (for each half).
For each of these you do 2 lookups into 2 sorted sets again (for each quarter).
So the number of lookups is 2 + 4 + 8 + ... + k/2 + k, which I believe is O(k).
I may be wrong here, but, worst case, the number of elements in any given map is n, but this will cause all other maps to only have 1 element, so still O(logn), but for each n (not each n.k).
So I think that's O(n.(logn + k)).
.
EDIT 2:
Example of my maps (without the improvement):
(x-1) means x maps to 1.
Let's say we have abcd, abdd, adcb, adcd, aecd.
The first map would be (bcd-1), (bdd-1), (dcb-1), (dcd-1), (ecd-1).
The second map would be (acd-3), (add-1), (acb-1) (for 4th and 5th, value already existed, so increment).
The third map : (abd-2), (adb-1), (add-1), (aed-1) (2nd already existed).
The fourth map : (abc-1), (abd-1), (adc-2), (aec-1) (4th already existed).
totalPairs = 0
For second map - acd, for the 4th, we add 1, for the 5th we add 2.
totalPairs = 3
For third map - abd, for the 2th, we add 1.
totalPairs = 4
For fourth map - adc, for the 4th, we add 1.
totalPairs = 5.
Partial example of improved maps:
Same input as above.
Map of first 2 letters removed to maps of 1st and 2nd letter removed:
(cd-{ {(bcd-1)}, {(acd-1)} }),
(dd-{ {(bdd-1)}, {(add-1)} }),
(cb-{ {(dcb-1)}, {(acb-1)} }),
(cd-{ {(dcd-1)}, {(acd-1)} }),
(cd-{ {(ecd-1)}, {(acd-1)} })
The above is a map consisting of an element cd mapped to 2 maps, one containing one element (bcd-1) and the other containing (acd-1).
But for the 4th and 5th cd already existed, so, rather than generating the above, it will be added to that map instead, as follows:
(cd-{ {(bcd-1, dcd-1, ecd-1)}, {(acd-3)} }),
(dd-{ {(bdd-1)}, {(add-1)} }),
(cb-{ {(dcb-1)}, {(acb-1)} })

You can put each word into an array.Pop out elements from that array one by one.Then compare the resulting arrays.Finally you add back the popped element to get back the original arrays.
The popped elements from both the arrays must not be same.
Count number of cases where this occurs and finally divide it by 2 to get the exact solution

Think about how you would enumerate the language - you would likely use a recursive algorithm. Recursive algorithms map onto tree structures. If you construct such a tree, each divergence represents a difference of one letter, and each leaf will represent a word in the language.

It's been two months since I submitted the problem here. I have discussed it with my peers in the meantime and would like to share the outcome.
The main idea is similar to the one presented by Dukeling. For each word A and for each ith position within that word we are going to consider a tuple: (prefix, suffix, letter at the ith position), i.e. (A[1..i-1], A[i+1..n], A[i]). If i is either 1 or n, then the applicable substring is considered empty (these are simple boundary cases).
Having these tuples in hand, we should be able to apply the reasoning I provided in my first post to count the number of pairs of different words. All we have to do is sort the tuples by the prefix and suffix values (separately for each i) - then, words with letters equal at all but ith position will be adjacent to each other.
Here though is the technical part I am lacking. So as to make the sorting procedure (RadixSort appears to be the way to go) meet the O(nk) constraint, we might want to assign labels to our prefixes and suffixes (we only need n labels for each i). I am not quite sure how to go about the labelling stuff. (Sure, we might do some hashing instead, but I am pretty confident the former solution is viable).
While this is not an entirely complete solution, I believe it casts some light on the possible way to tackle this problem and that is why I posted it here. If anyone comes up with an idea of how to do the labelling part, I will implement it in this post.

How's the following Python solution?
import string
def one_apart(words, word):
res = set()
for i, _ in enumerate(word):
for c in string.ascii_lowercase:
w = word[:i] + c + word[i+1:]
if w != word and w in words:
res.add(w)
return res
pairs = set()
for w in words:
for other in one_apart(words, w):
pairs.add(frozenset((w, other)))
for pair in pairs:
print(pair)
Output:
frozenset({'abcd', 'adcd'})
frozenset({'aecd', 'adcd'})
frozenset({'adcb', 'adcd'})
frozenset({'abcd', 'aecd'})
frozenset({'abcd', 'abdd'})

Related

Generating a perfect hash function given known list of strings?

Suppose I have a list of N strings, known at compile-time.
I want to generate (at compile-time) a function that will map each string to a distinct integer between 1 and N inclusive. The function should take very little time or space to execute.
For example, suppose my strings are:
{"apple", "orange", "banana"}
Such a function may return:
f("apple") -> 2
f("orange") -> 1
f("banana") -> 3
What's a strategy to generate this function?
I was thinking to analyze the strings at compile time and look for a couple of constants I could mod or add by or something?
The compile-time generation time/space can be quite expensive (but obviously not ridiculously so).
Say you have m distinct strings, and let ai, j be the jth character of the ith string. In the following, I'll assume that they all have the same length. This can be easily translated into any reasonable programming language by treating ai, j as the null character if j ≥ |ai|.
The idea I suggest is composed of two parts:
Find (at most) m - 1 positions differentiating the strings, and store these positions.
Create a perfect hash function by considering the strings as length-m vectors, and storing the parameters of the perfect hash function.
Obviously, in general, the hash function must check at least m - 1 positions. It's easy to see this by induction. For 2 strings, at least 1 character must be checked. Assume it's true for i strings: i - 1 positions must be checked. Create a new set of strings by appending 0 to the end of each of the i strings, and add a new string that is identical to one of the strings, except it has a 1 at the end.
Conversely, it's obvious that it's possible to find at most m - 1 positions sufficient for differentiating the strings (for some sets the number of course might be lower, as low as log to the base of the alphabet size of m). Again, it's easy to see so by induction. Two distinct strings must differ at some position. Placing the strings in a matrix with m rows, there must be some column where not all characters are the same. Partitioning the matrix into two or more parts, and applying the argument recursively to each part with more than 2 rows, shows this.
Say the m - 1 positions are p1, ..., pm - 1. In the following, recall the meaning above for ai, pj for pj ≥ |ai|: it is the null character.
let us define h(ai) = ∑j = 1m - 1[qj ai, pj % n], for random qj and some n. Then h is known to be a universal hash function: the probability of pair-collision P(x ≠ y ∧ h(x) = h(y)) ≤ 1/n.
Given a universal hash function, there are known constructions for creating a perfect hash function from it. Perhaps the simplest is creating a vector of size m2 and successively trying the above h with n = m2 with randomized coefficients, until there are no collisions. The number of attempts needed until this is achieved, is expected 2 and the probability that more attempts are needed, decreases exponentially.
It is simple. Make a dictionary and assign 1 to the first word, 2 to the second, ... No need to make things complicated, just number your words.
To make the lookup effective, use trie or binary search or whatever tool your language provides.

Implementing Parallel Algorithm for Longest Common Subsequence

I am trying to implement the Parallel Algorithm for Longest Common Subsequence Problem described in http://www.iaeng.org/publication/WCE2010/WCE2010_pp499-504.pdf
But i am having a problem with the variable C in Equation 6 on page 4
The paper refered to C on at the end of page 3 as
C as Let C[1 : l] bethe finite alphabet
I am not sure what is ment by this, as i guess it would it with the 2 strings ABCDEF and ABQXYEF be ABCDEFQXY. But what if my 2 stings is a list of objects (Where my match test for an example is obj1.Name = obj2.Name), what would my C be here? just a union on the 2 arrays?
Having read and studied the paper, I can say that C is supposed to be an array holding the alphabet of your strings, where the alphabet size (and, thus, the size of C) is l.
By the looks of your question, however, I feel the need to go deeper on this, because it looks like you didn't get the whole picture yet. What is P[i,j], and why do you need it? The answer is that you don't really need it, but it's an elegant optimization. In page 3, a little bit before Theorem 1, it is said that:
[...] This process ends when j-k = 0 at the k-th step, or a(i) =
b(j-k) at the k-th step. Assume that the process stops at the k-th
step, and k must be the minimum number that makes a(i) = b(j-k) or j-k
= 0. [...]
The recurrence relation in (3) is equivalent to (2), but the fundamental difference is that (2) expands recursively, whereas with (3) you never have recursive calls, provided that you know k. In other words, the magic behind (3) not expanding recursively is that you somehow know the spot where the recursion on (2) would stop, so you look at that cell immediately, rather than recursively approaching it.
Ok then, but how do you find out the value for k? Since k is the spot where (2) reaches a base case, it can be seen that k is the amount of columns that you have to "go back" on B until you are either off the limits (i.e., the first column that is filled with 0's) OR you find a match between a character in B and a character in A (which corresponds to the base case conditions in (2)). Remember that you will be matching the character a(i-1), where i is the current row.
So, what you really want is to find the last position in B before j where the character a(i-1) appears. If no such character ever appears in B before j, then that would be equivalent to reaching the case i = 0 or j-1 = 0 in (2); otherwise, it's the same as reaching a(i) = b(j-1) in (2).
Let's look at an example:
Consider that the algorithm is working on computing the values for i = 2 and j = 3 (the row and column are highlighted in gray). Imagine that the algorithm is working on the cell highlighted in black and is applying (2) to determine the value of S[2,2] (the position to the left of the black one). By applying (2), it would then start by looking at a(2) and b(2). a(2) is C, b(2) is G, to there's no match (this is the same procedure as the original, well-known algorithm). The algorithm now wants to find the value of S[2,2], because it is needed to compute S[2,3] (where we are). S[2,2] is not known yet, but the paper shows that it is possible to determine that value without refering to the row with i = 2. In (2), the 3rd case is chosen: S[2,2] = max(S[1, 2], S[2, 1]). Notice, if you will, that all this formula is doing is looking at the positions that would have been used to calculate S[2,2]. So, to rephrase that: we're computing S[2,3], we need S[2,2] for that, we don't know it yet, so we're going back on the table to see what's the value of S[2,2] in pretty much the same way we did in the original, non-parallel algorithm.
When will this stop? In this example, it will stop when we find the letter C (this is our a(i)) in TGTTCGACA before the second T (the letter on the current column) OR when we reach column 0. Because there is no C before T, we reach column 0. Another example:
Here, (2) would stop with j-1 = 5, because that is the last position in TGTTCGACA where C shows up. Thus, the recursion reaches the base case a(i) = b(j-1) when j-1 = 5.
With this in mind, we can see a shortcut here: if you could somehow know the amount k such that j-1-k is a base case in (2), then you wouldn't have to go through the score table to find the base case.
That's the whole idea behind P[i,j]. P is a table where you lay down the whole alphabet vertically (on the left side); the string B is, once again, placed horizontally in the upper side. This table is computed as part of a preprocessing step, and it will tell you exactly what you will need to know ahead of time: for each position j in B, it says, for each character C[i] in C (the alphabet), what is the last position in B before j where C[i] is found (note that i is used to index C, the alphabet, and not the string A. Maybe the authors should have used another index variable to avoid confusion).
So, you can think of the semantics for an entry P[i,j] as something along the lines of: The last position in B where I saw C[i] before position j. For example, if you alphabet is sigma = {A, E, I, O, U}, and B = "AOOIUEI", thenP` is:
Take the time to understand this table. Note the row for O. Remember: this row lists, for every position in B, where is the last known "O". Only when j = 3 will we have a value that is not zero (it's 2), because that's the position after the first O in AOOIUEI. This entry says that the last position in B where O was seen before is position 2 (and, indeed, B[2] is an O, the one that follows A). Notice, in that same row, that for j = 4, we have the value 3, because now the last position for O is the one that correspnds to the second O in B (and since no more O's exist, the rest of the row will be 3).
Recall that building P is a preprocessing step necessary if you want to easily find the value of k that makes the recursion from equation (2) stop. It should make sense by now that P[i,j] is the k you're looking for in (3). With P, you can determine that value in O(1) time.
Thus, the C[i] in (6) is a letter of the alphabet - the letter that we are currently considering. In the example above, C = [A,E,I,O,U], and C[1] = A, C[2] = E, etc. In equaton (7), c is the position in C where a(i) (the current letter of string A being considered) lives. It makes sense: after all, when building the score table position S[i,j], we want to use P to find the value of k - we want to know where was the last time we saw an a(i) in B before j. We do that by reading P[index_of(a(i)), j].
Ok, now that you understand the use of P, let's see what's happening with your implementation.
About your specific case
In the paper, P is shown as a table that lists the whole alphabet. It is a good idea to iterate through the alphabet because the typical uses of this algorithm are in bioinformatics, where the alphabet is much, much smaller than the string A, making the iteration through the alphabet cheaper.
Because your strings are sequences of objects, your C would be the set of all possible objects, so you'd have to build a table P with the set of all possible object instance (nonsense, of course). This is definitely a case where the alphabet size is huge when compared to your string size. However, note that you will only be indexing P in those rows that correspond to letters from A: any row in P for a letter C[i] that is not in A is useless and will never be used. This makes your life easier, because it means you can build P with the string A instead of using the alphabet of every possible object.
Again, an example: if your alphabet is AEIOU, A is EEI and B is AOOIUEI, you will only be indexing P in the rows for E and I, so that's all you need in P:
This works and suffices, because in (7), P[c,j] is the entry in P for the character c, and c is the index of a(i). In other words: C[c] always belongs to A, so it makes perfect sense to build P for the characters of A instead of using the whole alphabet for the cases where the size of A is considerably smaller than the size of C.
All you have to do now is to apply the same principle to whatever your objects are.
I really don't know how to explain it any better. This may be a little dense at first. Make sure to re-read it until you really get it - and I mean every little detail. You have to master this before thinking about implementing it.
NOTE: You said you were looking for a credible and / or official source. I'm just another CS student, so I'm not an official source, but I think I can be considered "credible". I've studied this before and I know the subject. Happy coding!

Approximate substring matching using a Suffix Tree

This article discusses approximate substring matching techniques that utilize a suffix tree to improve matching time. Each answer addresses a different algorithm.
Approximate substring matching attempts to find a substring (pattern) P in a string T allowing up to k mismatches.
To learn how to create a suffix tree, click here. However, some algorithms require additional preprocessing.
I invite people to add new algorithms (even if it's incomplete) and improve answers.
This was the original question that started this thread.
Professor Esko Ukkonen published a paper: Approximate string-matching over suffix trees. He discusses 3 different algorithms that have different matching times:
Algorithm A: O(mq + n)
Algorithm B: O(mq log(q) + size of the output)
Algorithm C: O(m^2q + size of the output)
Where m is the length of the substring, n is the length of the search string, and q is the edit distance.
I've been trying to understand algorithm B but I'm having trouble following the steps. Does anyone have experience with this algorithm? An example or pseudo algorithm would be greatly appreciated.
In particular:
What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
Here's what I believe (I stand to be corrected):
On page seven, we're introduced to suffix tree concepts; a state is effectively a node in the suffix tree: let root denote the initial state.
g(a, c) = b where a and b are nodes in the tree and c is a character or substring in the tree. So this represents a transition; from a, following the edges represented by c, we move to node b. This is referred to as the go-to transition. So for the suffix tree below, g(root, 'ccb') = red node
Key(a) = edge sequence where a represents a node in the tree. For example, Key(red node) = 'ccb'. So g(root, Key(red node)) = red node.
Keys(Subset of node S) = { Key(node) | node ∈ S}
There is a suffix function for nodes a and b, f(a) = b: for all (or perhaps there may exist) a ≠ root, there exists a character c, a substring x, and a node b such that g(root, cx) = a and g(root, x) = b. I think that this means, for the suffix tree example above, that f(pink node) = green node where c = 'a' and x = 'bccb'.
There is a mapping H that contains a node from the suffix tree and a value pair. The value is given by loc(w); I'm still uncertain how to evaluate the function. This dictionary contains nodes that have not been eliminated.
extract-min(H) refers to attaining the entry with the smallest value in the pair (node, loc(w)) from H.
The crux of the algorithm seems to be related to how loc(w) is evaluated. I've constructed my suffix tree using the combined answer here; however, the algorithms work on a suffix trie (uncompressed suffix tree). Therefore concepts like the depth need to be maintained and processed differently. In the suffix trie the depth would represent the suffix length; in a suffix tree, the depth would simply represent the node depth in the tree.
You are doing well. I don't have familiarity with the algorithm, but have read the paper today. Everything you wrote is correct as far as it goes. You are right that some parts of the explanation assume a lot.
Your Questions
1.What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
The output consists of the maximal k-distance matches of P in T. In particular you'll get the final index and length for each. So clearly this is also O(n) (remember big-O is an upper bound), but may be smaller. This is a nod to the fact that it's impossible to generate p matches in less than O(p) time. The rest of the time bound concerns only the pattern length and the number of viable prefixes, both of which can be arbitrarily small, so the output size can dominate. Consider k=0 and the input is 'a' repeated n times with the pattern 'a'.
2.Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
You're right. It's an error. The loop index should be i. What about j? This is the index of the column corresponding to the input character being processed in the dynamic program. It should really be an input parameter.
Let's take a step back. The Example table on page 6 is computed left-to-right, column-by-column using equations (1-4) given earlier. These show that only the previous columns of D and L are needed to get the next. Function dp is just an implementation of this idea of computing column j from j-1. Column j of D and L are called d and l respectively. Column j-1 D and L are d' and l', the function input parameters.
I recommend you work through the dynamic program until you understand it well. The algorithm is all about avoiding duplicate column computations. Here "duplicate" means "having the same values in the essential part", because that's all that matters. The inessential parts can't affect the answer.
The uncompressed trie is just the compressed one expanded in the obvious way to have one edge per character. Except for the idea of "depth", this is unimportant. In the compressed tree, depth(s) is just the length of the string - which he calls Key(s) - needed to get from root node s.
Algorithm A
Algorithm A is just a clever caching scheme.
All his theorems and lemmas show that 1) we only need to worry about the essential parts of columns and 2) the essential part of a column j is completely determined by the viable prefix Q_j. This is the longest suffix of the input ending at j that matches a prefix of the pattern (within edit distance k). In other words, Q_j is the maximal start of a k-edit match at the end of the input considered so far.
With this, here's pseudo-code for Algorithm A.
Let r = root of (uncompressed) suffix trie
Set r's cached d,l with formulas at end page 7 (0'th dp table columns)
// Invariant: r contains cached d,l
for each character t_j from input text T in sequence
Let s = g(r, t_j) // make the go-to transition from r on t_j
if visited(s)
r = s
while no cached d,l on node r
r = f(r) // traverse suffix edge
end while
else
Use cached d',l' on r to find new columns (d,l) = dp(d',l')
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s
while depth(r) != |Q_j|
mark r visited
r = f(r) // traverse suffix edge
end while
mark r visited
set cached d,l on node r
end if
end for
I've left out the output step for simplicity.
What is traversing suffix edges about? When we do this from a node r where Key(r) = aX (leading a followed by some string X), we are going to the node with Key X. The consequence: we are storing each column corresponding to a viable prefix Q_h at the trie node for the suffix of the input with prefix Q_h. The function f(s) = r is the suffix transition function.
For what it's worth, the Wikipedia picture of a suffix tree shows this pretty well. For example, if from the node for "NA" we follow the suffix edge, we get to the node for "A" and from there to "". We are always cutting off the leading character. So if we label state s with Key(s), we have f("NA") = "A" and f("A") = "". (I don't know why he doesn't label states like this in the paper. It would simplify many explanations.)
Now this is very cool because we are computing only one column per viable prefix. But it's still expensive because we are inspecting each character and potentially traversing suffix edges for each one.
Algorithm B
Algorithm B's intent is to go faster by skipping through the input, touching only those characters likely to produce a new column, i.e. those that are the ends of input that match a previously unseen viable prefix of the pattern.
As you'd suspect, the key to the algorithm is the loc function. Roughly speaking, this will tell where the next "likely" input character is. The algorithm is quite a bit like A* search. We maintain a min heap (which must have a delete operation) corresponding to the set S_i in the paper. (He calls it a dictionary, but this is not a very conventional use of the term.) The min heap contains potential "next states" keyed on the position of the next "likely character" as described above. Processing one character produces new entries. We keep going until the heap is empty.
You're absolutely right that here he gets sketchy. The theorems and lemmas are not tied together to make an argument on correctness. He assumes you will redo his work. I'm not entirely convinced by this hand-waving. But there does seem to be enough there to "decode" the algorithm he has in mind.
Another core concept is the set S_i and in particular the subset that remains not eliminated. We'll keep these un-eliminated states in the min-heap H.
You're right to say that the notation obscures the intent of S_i. As we process the input left-to-right and reach position i, we have amassed a set of viable prefixes seen so far. Each time a new one is found, a fresh dp column is computed. In the author's notation these prefixes would be Q_h for all h<=i or more formally { Q_h | h <= i }. Each of these has a path from the root to a unique node. The set S_i consists of all the states we get by taking one more step from all these nodes along go-to edges in the trie. This produces the same result as going through the whole text looking for each occurrence of Q_h and the next character a, then adding the state corresponding to Q_h a into S_i, but it's faster. The Keys for the S_i states are exactly the right candidates for the next viable prefix Q_{i+1}.
How do we choose the right candidate? Pick the one that occurs next after position i in the input. This is where loc(s) comes in. The loc value for a state s is just what I just said above: the position in the input starting at i where the viable prefix associated with that state occurs next.
The important point is that we don't want to just assign the newly found (by pulling the min loc value from H) "next" viable prefix as Q_{i+1} (the viable prefix for dp column i+1) and go on to the next character (i+2). This is where we must set the stage to skip ahead as far as possible to the last character k (with dp column k) such Q_k = Q_{i+1}. We skip ahead by following suffix edges as in Algorithm A. Only this time we record our steps for future use by altering H: removing elements, which is the same as eliminating elements from S_i, and modifying loc values.
The definition of function loc(s) is bare, and he never says how to compute it. Also unmentioned is that loc(s) is also a function of i, the current input position being processed (that he jumps from j in earlier parts of the paper to i here for the current input position is unhelpful.) The impact is that loc(s) changes as input processing proceeds.
It turns out that the part of the definition that applies to eliminated states "just happens" because states are marked eliminated upon removal form H. So for this case we need only check for a mark.
The other case - un-eliminated states - requires that we search forward in the input looking for the next occurrence in the text that is not covered by some other string. This notion of covering is to ensure we are always dealing with only "longest possible" viable prefixes. Shorter ones must be ignored to avoid outputting other than maximal matches. Now, searching forward sounds expensive, but happily we have a suffix trie already constructed, which allows us to do it in O(|Key(s)|) time. The trie will have to be carefully annotated to return the relevant input position and to avoid covered occurrences of Key(s), but it wouldn't be too hard. He never mentions what to do when the search fails. Here loc(s) = \infty, i.e. it's eliminated and should be deleted from H.
Perhaps the hairiest part of the algorithm is updating H to deal with cases where we add a new state s for a viable prefix that covers Key(w) for some w that was already in H. This means we have to surgically update the (loc(w) => w) element in H. It turns out the suffix trie yet again supports this efficiently with its suffix edges.
With all this in our heads, let's try for pseudocode.
H = { (0 => root) } // we use (loc => state) for min heap elements
until H is empty
(j => s_j) = H.delete_min // remove the min loc mapping from
(d, l) = dp(d', l', j) where (d',l') are cached at paraent(s_j)
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s_j
while depth(r) > |Q_j|
mark r eliminated
H.delete (_ => r) // loc value doesn't matter
end while
set cached d,l on node r
// Add all the "next states" reachable from r by go-tos
for all s = g(r, a) for some character a
unless s.eliminated?
H.insert (loc(s) => s) // here is where we use the trie to find loc
// Update H elements that might be newly covered
w = f(s) // suffix transition
while w != null
unless w.eliminated?
H.increase_key(loc(w) => w) // using explanation in Lemma 9.
w = f(w) // suffix transition
end unless
end while
end unless
end for
end until
Again I've omitted the output for simplicity. I will not say this is correct, but it's in the ballpark. One thing is that he mentions we should only process Q_j for nodes not before "visited," but I don't understand what "visited" means in this context. I think visited states by Algorithm A's definition won't occur because they've been removed from H. It's a puzzle...
The increase_key operation in Lemma 9 is hastily described with no proof. His claim that the min operation is possible in O(log |alphabet|) time is leaving a lot to the imagination.
The number of quirks leads me to wonder if this is not the final draft of the paper. It is also a Springer publication, and this copy on-line would probably violate copyright restrictions if it were precisely the same. It might be worth looking in a library or paying for the final version to see if some of the rough edges were knocked off during final review.
This is as far as I can get. If you have specific questions, I'll try to clarify.

How to "sort" elements of 2 possible values in place in linear time? [duplicate]

This question already has answers here:
Stable separation for two classes of elements in an array
(3 answers)
Closed 9 years ago.
Suppose I have a function f and array of elements.
The function returns A or B for any element; you could visualize the elements this way ABBAABABAA.
I need to sort the elements according to the function, so the result is: AAAAAABBBB
The number of A values doesn't have to equal the number of B values. The total number of elements can be arbitrary (not fixed). Note that you don't sort chars, you sort objects that have a single char representation.
Few more things:
the sort should take linear time - O(n),
it should be performed in place,
it should be a stable sort.
Any ideas?
Note: if the above is not possible, do you have ideas for algorithms sacrificing one of the above requirements?
If it has to be linear and in-place, you could do a semi-stable version. By semi-stable I mean that A or B could be stable, but not both. Similar to Dukeling's answer, but you move both iterators from the same side:
a = first A
b = first B
loop while next A exists
if b < a
swap a,b elements
b = next B
a = next A
else
a = next A
With the sample string ABBAABABAA, you get:
ABBAABABAA
AABBABABAA
AAABBBABAA
AAAABBBBAA
AAAAABBBBA
AAAAAABBBB
on each turn, if you make a swap you move both, if not you just move a. This will keep A stable, but B will lose its ordering. To keep B stable instead, start from the end and work your way left.
It may be possible to do it with full stability, but I don't see how.
A stable sort might not be possible with the other given constraints, so here's an unstable sort that's similar to the partition step of quick-sort.
Have 2 iterators, one starting on the left, one starting on the right.
While there's a B at the right iterator, decrement the iterator.
While there's an A at the left iterator, increment the iterator.
If the iterators haven't crossed each other, swap their elements and repeat from 2.
Lets say,
Object_Array[1...N]
Type_A objs are A1,A2,...Ai
Type_B objs are B1,B2,...Bj
i+j = N
FOR i=1 :N
if Object_Array[i] is of Type_A
obj_A_count=obj_A_count+1
else
obj_B_count=obj_B_count+1
LOOP
Fill the resultant array with obj_A and obj_B with their respective counts depending on obj_A > obj_B
The following should work in linear time for a doubly-linked list. Because up to N insertion/deletions are involved that may cause quadratic time for arrays though.
Find the location where the first B should be after "sorting". This can be done in linear time by counting As.
Start with 3 iterators: iterA starts from the beginning of the container, and iterB starts from the above location where As and Bs should meet, and iterMiddle starts one element prior to iterB.
With iterA skip over As, find the 1st B, and move the object from iterA to iterB->previous position. Now iterA points to the next element after where the moved element used to be, and the moved element is now just before iterB.
Continue with step 3 until you reach iterMiddle. After that all elements between first() and iterB-1 are As.
Now set iterA to iterB-1.
Skip over Bs with iterB. When A is found move it to just after iterA and increment iterA.
Continue step 6 until iterB reaches end().
This would work as a stable sort for any container. The algorithm includes O(N) insertion/deletion, which is linear time for containers with O(1) insertions/deletions, but, alas, O(N^2) for arrays. Applicability in you case depends on whether the container is an array rather than a list.
If your data structure is a linked list instead of an array, you should be able to meet all three of your constraints. You just skim through the list and accumulating and moving the "B"s will be trivial pointer changes. Pseudo code below:
sort(list) {
node = list.head, blast = null, bhead = null
while(node != null) {
nextnode = node.next
if(node.val == "a") {
if(blast != null){
//move the 'a' to the front of the 'B' list
bhead.prev.next = node, node.prev = bhead.prev
blast.next = node.next, node.next.prev = blast
node.next = bhead, bhead.prev = node
}
}
else if(node.val == "b") {
if(blast == null)
bhead = blast = node
else //accumulate the "b"s..
blast = node
}
3
node = nextnode
}
}
So, you can do this in an array, but the memcopies, that emulate the list swap, will make it quiet slow for large arrays.
Firstly, assuming the array of A's and B's is either generated or read-in, I wonder why not avoid this question entirely by simply applying f as the list is being accumulated into memory into two lists that would subsequently be merged.
Otherwise, we can posit an alternative solution in O(n) time and O(1) space that may be sufficient depending on Sir Bohumil's ultimate needs:
Traverse the list and sort each segment of 1,000,000 elements in-place using the permutation cycles of the segment (once this step is done, the list could technically be sorted in-place by recursively swapping the inner-blocks, e.g., ABB AAB -> AAABBB, but that may be too time-consuming without extra space). Traverse the list again and use the same constant space to store, in two interval trees, the pointers to each block of A's and B's. For example, segments of 4,
ABBAABABAA => AABB AABB AA + pointers to blocks of A's and B's
Sequential access to A's or B's would be immediately available, and random access would come from using the interval tree to locate a specific A or B. One option could be to have the intervals number the A's and B's; e.g., to find the 4th A, look for the interval containing 4.
For sorting, an array of 1,000,000 four-byte elements (3.8MB) would suffice to store the indexes, using one bit in each element for recording visited indexes during the swaps; and two temporary variables the size of the largest A or B. For a list of one billion elements, the maximum combined interval trees would number 4000 intervals. Using 128 bits per interval, we can easily store numbered intervals for the A's and B's, and we can use the unused bits as pointers to the block index (10 bits) and offset in the case of B (20 bits). 4000*16 bytes = 62.5KB. We can store an additional array with only the B blocks' offsets in 4KB. Total space under 5MB for a list of one billion elements. (Space is in fact dependent on n but because it is extremely small in relation to n, for all practical purposes, we may consider it O(1).)
Time for sorting the million-element segments would be - one pass to count and index (here we can also accumulate the intervals and B offsets) and one pass to sort. Constructing the interval tree is O(nlogn) but n here is only 4000 (0.00005 of the one-billion list count). Total time O(2n) = O(n)
This should be possible with a bit of dynamic programming.
It works a bit like counting sort, but with a key difference. Make arrays of size n for both a and b count_a[n] and count_b[n]. Fill these arrays with how many As or Bs there has been before index i.
After just one loop, we can use these arrays to look up the correct index for any element in O(1). Like this:
int final_index(char id, int pos){
if(id == 'A')
return count_a[pos];
else
return count_a[n-1] + count_b[pos];
}
Finally, to meet the total O(n) requirement, the swapping needs to be done in a smart order. One simple option is to have recursive swapping procedure that doesn't actually perform any swapping until both elements would be placed in correct final positions. EDIT: This is actually not true. Even naive swapping will have O(n) swaps. But doing this recursive strategy will give you absolute minimum required swaps.
Note that in general case this would be very bad sorting algorithm since it has memory requirement of O(n * element value range).

Algorithm/Data Structure for finding combinations of minimum values easily

I have a symmetric matrix like shown in the image attached below.
I've made up the notation A.B which represents the value at grid point (A, B). Furthermore, writing A.B.C gives me the minimum grid point value like so: MIN((A,B), (A,C), (B,C)).
As another example A.B.D gives me MIN((A,B), (A,D), (B,D)).
My goal is to find the minimum values for ALL combinations of letters (not repeating) for one row at a time e.g for this example I need to find min values with respect to row A which are given by the calculations:
A.B = 6
A.C = 8
A.D = 4
A.B.C = MIN(6,8,6) = 6
A.B.D = MIN(6, 4, 4) = 4
A.C.D = MIN(8, 4, 2) = 2
A.B.C.D = MIN(6, 8, 4, 6, 4, 2) = 2
I realize that certain calculations can be reused which becomes increasingly important as the matrix size increases, but the problem is finding the most efficient way to implement this reuse.
Can point me in the right direction to finding an efficient algorithm/data structure I can use for this problem?
You'll want to think about the lattice of subsets of the letters, ordered by inclusion. Essentially, you have a value f(S) given for every subset S of size 2 (that is, every off-diagonal element of the matrix - the diagonal elements don't seem to occur in your problem), and the problem is to find, for each subset T of size greater than two, the minimum f(S) over all S of size 2 contained in T. (And then you're interested only in sets T that contain a certain element "A" - but we'll disregard that for the moment.)
First of all, note that if you have n letters, that this amounts to asking Omega(2^n) questions, roughly one for each subset. (Excluding the zero- and one-element subsets and those that don't include "A" saves you n + 1 sets and a factor of two, respectively, which is allowed for big Omega.) So if you want to store all these answers for even moderately large n, you'll need a lot of memory. If n is large in your applications, it might be best to store some collection of pre-computed data and do some computation whenever you need a particular data point; I haven't thought about what would work best, but for example computing data only for a binary tree contained in the lattice would not necessarily help you anything beyond precomputing nothing at all.
With these things out of the way, let's assume you actually want all the answers computed and stored in memory. You'll want to compute these "layer by layer", that is, starting with the three-element subsets (since the two-element subsets are already given by your matrix), then four-element, then five-element, etc. This way, for a given subset S, when we're computing f(S) we will already have computed all f(T) for T strictly contained in S. There are several ways that you can make use of this, but I think the easiest might be to use two such subset S: let t1 and t2 be two different elements of T that you may select however you like; let S be the subset of T that you get when you remove t1 and t2. Write S1 for S plus t1 and write S2 for S plus t2. Now every pair of letters contained in T is either fully contained in S1, or it is fully contained in S2, or it is {t1, t2}. Look up f(S1) and f(S2) in your previously computed values, then look up f({t1, t2}) directly in the matrix, and store f(T) = the minimum of these 3 numbers.
If you never select "A" for t1 or t2, then indeed you can compute everything you're interested in while not computing f for any sets T that don't contain "A". (This is possible because the steps outlined above are only interesting whenever T contains at least three elements.) Good! This leaves just one question - how to store the computed values f(T). What I would do is use a 2^(n-1)-sized array; represent each subset-of-your-alphabet-that-includes-"A" by the (n-1) bit number where the ith bit is 1 whenever the (i+1)th letter is in that set (so 0010110, which has bits 2, 4, and 5 set, represents the subset {"A", "C", "D", "F"} out of the alphabet "A" .. "H" - note I'm counting bits starting at 0 from the right, and letters starting at "A" = 0). This way, you can actually iterate through the sets in numerical order and don't need to think about how to iterate through all k-element subsets of an n-element set. (You do need to include a special case for when the set under consideration has 0 or 1 element, in which case you'll want to do nothing, or 2 elements, in which case you just copy the value from the matrix.)
Well, it looks simple to me, but perhaps I misunderstand the problem. I would do it like this:
let P be a pattern string in your notation X1.X2. ... .Xn, where Xi is a column in your matrix
first compute the array CS = [ (X1, X2), (X1, X3), ... (X1, Xn) ], which contains all combinations of X1 with every other element in the pattern; CS has n-1 elements, and you can easily build it in O(n)
now you must compute min (CS), i.e. finding the minimum value of the matrix elements corresponding to the combinations in CS; again you can easily find the minimum value in O(n)
done.
Note: since your matrix is symmetric, given P you just need to compute CS by combining the first element of P with all other elements: (X1, Xi) is equal to (Xi, X1)
If your matrix is very large, and you want to do some optimization, you may consider prefixes of P: let me explain with an example
when you have solved the problem for P = X1.X2.X3, store the result in an associative map, where X1.X2.X3 is the key
later on, when you solve a problem P' = X1.X2.X3.X7.X9.X10.X11 you search for the longest prefix of P' in your map: you can do this by starting with P' and removing one component (Xi) at a time from the end until you find a match in your map or you end up with an empty string
if you find a prefix of P' in you map then you already know the solution for that problem, so you just have to find the solution for the problem resulting from combining the first element of the prefix with the suffix, and then compare the two results: in our example the prefix is X1.X2.X3, and so you just have to solve the problem for
X1.X7.X9.X10.X11, and then compare the two values and choose the min (don't forget to update your map with the new pattern P')
if you don't find any prefix, then you must solve the entire problem for P' (and again don't forget to update the map with the result, so that you can reuse it in the future)
This technique is essentially a form of memoization.

Resources