I am trying to implement the Parallel Algorithm for Longest Common Subsequence Problem described in http://www.iaeng.org/publication/WCE2010/WCE2010_pp499-504.pdf
But i am having a problem with the variable C in Equation 6 on page 4
The paper refered to C on at the end of page 3 as
C as Let C[1 : l] bethe finite alphabet
I am not sure what is ment by this, as i guess it would it with the 2 strings ABCDEF and ABQXYEF be ABCDEFQXY. But what if my 2 stings is a list of objects (Where my match test for an example is obj1.Name = obj2.Name), what would my C be here? just a union on the 2 arrays?
Having read and studied the paper, I can say that C is supposed to be an array holding the alphabet of your strings, where the alphabet size (and, thus, the size of C) is l.
By the looks of your question, however, I feel the need to go deeper on this, because it looks like you didn't get the whole picture yet. What is P[i,j], and why do you need it? The answer is that you don't really need it, but it's an elegant optimization. In page 3, a little bit before Theorem 1, it is said that:
[...] This process ends when j-k = 0 at the k-th step, or a(i) =
b(j-k) at the k-th step. Assume that the process stops at the k-th
step, and k must be the minimum number that makes a(i) = b(j-k) or j-k
= 0. [...]
The recurrence relation in (3) is equivalent to (2), but the fundamental difference is that (2) expands recursively, whereas with (3) you never have recursive calls, provided that you know k. In other words, the magic behind (3) not expanding recursively is that you somehow know the spot where the recursion on (2) would stop, so you look at that cell immediately, rather than recursively approaching it.
Ok then, but how do you find out the value for k? Since k is the spot where (2) reaches a base case, it can be seen that k is the amount of columns that you have to "go back" on B until you are either off the limits (i.e., the first column that is filled with 0's) OR you find a match between a character in B and a character in A (which corresponds to the base case conditions in (2)). Remember that you will be matching the character a(i-1), where i is the current row.
So, what you really want is to find the last position in B before j where the character a(i-1) appears. If no such character ever appears in B before j, then that would be equivalent to reaching the case i = 0 or j-1 = 0 in (2); otherwise, it's the same as reaching a(i) = b(j-1) in (2).
Let's look at an example:
Consider that the algorithm is working on computing the values for i = 2 and j = 3 (the row and column are highlighted in gray). Imagine that the algorithm is working on the cell highlighted in black and is applying (2) to determine the value of S[2,2] (the position to the left of the black one). By applying (2), it would then start by looking at a(2) and b(2). a(2) is C, b(2) is G, to there's no match (this is the same procedure as the original, well-known algorithm). The algorithm now wants to find the value of S[2,2], because it is needed to compute S[2,3] (where we are). S[2,2] is not known yet, but the paper shows that it is possible to determine that value without refering to the row with i = 2. In (2), the 3rd case is chosen: S[2,2] = max(S[1, 2], S[2, 1]). Notice, if you will, that all this formula is doing is looking at the positions that would have been used to calculate S[2,2]. So, to rephrase that: we're computing S[2,3], we need S[2,2] for that, we don't know it yet, so we're going back on the table to see what's the value of S[2,2] in pretty much the same way we did in the original, non-parallel algorithm.
When will this stop? In this example, it will stop when we find the letter C (this is our a(i)) in TGTTCGACA before the second T (the letter on the current column) OR when we reach column 0. Because there is no C before T, we reach column 0. Another example:
Here, (2) would stop with j-1 = 5, because that is the last position in TGTTCGACA where C shows up. Thus, the recursion reaches the base case a(i) = b(j-1) when j-1 = 5.
With this in mind, we can see a shortcut here: if you could somehow know the amount k such that j-1-k is a base case in (2), then you wouldn't have to go through the score table to find the base case.
That's the whole idea behind P[i,j]. P is a table where you lay down the whole alphabet vertically (on the left side); the string B is, once again, placed horizontally in the upper side. This table is computed as part of a preprocessing step, and it will tell you exactly what you will need to know ahead of time: for each position j in B, it says, for each character C[i] in C (the alphabet), what is the last position in B before j where C[i] is found (note that i is used to index C, the alphabet, and not the string A. Maybe the authors should have used another index variable to avoid confusion).
So, you can think of the semantics for an entry P[i,j] as something along the lines of: The last position in B where I saw C[i] before position j. For example, if you alphabet is sigma = {A, E, I, O, U}, and B = "AOOIUEI", thenP` is:
Take the time to understand this table. Note the row for O. Remember: this row lists, for every position in B, where is the last known "O". Only when j = 3 will we have a value that is not zero (it's 2), because that's the position after the first O in AOOIUEI. This entry says that the last position in B where O was seen before is position 2 (and, indeed, B[2] is an O, the one that follows A). Notice, in that same row, that for j = 4, we have the value 3, because now the last position for O is the one that correspnds to the second O in B (and since no more O's exist, the rest of the row will be 3).
Recall that building P is a preprocessing step necessary if you want to easily find the value of k that makes the recursion from equation (2) stop. It should make sense by now that P[i,j] is the k you're looking for in (3). With P, you can determine that value in O(1) time.
Thus, the C[i] in (6) is a letter of the alphabet - the letter that we are currently considering. In the example above, C = [A,E,I,O,U], and C[1] = A, C[2] = E, etc. In equaton (7), c is the position in C where a(i) (the current letter of string A being considered) lives. It makes sense: after all, when building the score table position S[i,j], we want to use P to find the value of k - we want to know where was the last time we saw an a(i) in B before j. We do that by reading P[index_of(a(i)), j].
Ok, now that you understand the use of P, let's see what's happening with your implementation.
About your specific case
In the paper, P is shown as a table that lists the whole alphabet. It is a good idea to iterate through the alphabet because the typical uses of this algorithm are in bioinformatics, where the alphabet is much, much smaller than the string A, making the iteration through the alphabet cheaper.
Because your strings are sequences of objects, your C would be the set of all possible objects, so you'd have to build a table P with the set of all possible object instance (nonsense, of course). This is definitely a case where the alphabet size is huge when compared to your string size. However, note that you will only be indexing P in those rows that correspond to letters from A: any row in P for a letter C[i] that is not in A is useless and will never be used. This makes your life easier, because it means you can build P with the string A instead of using the alphabet of every possible object.
Again, an example: if your alphabet is AEIOU, A is EEI and B is AOOIUEI, you will only be indexing P in the rows for E and I, so that's all you need in P:
This works and suffices, because in (7), P[c,j] is the entry in P for the character c, and c is the index of a(i). In other words: C[c] always belongs to A, so it makes perfect sense to build P for the characters of A instead of using the whole alphabet for the cases where the size of A is considerably smaller than the size of C.
All you have to do now is to apply the same principle to whatever your objects are.
I really don't know how to explain it any better. This may be a little dense at first. Make sure to re-read it until you really get it - and I mean every little detail. You have to master this before thinking about implementing it.
NOTE: You said you were looking for a credible and / or official source. I'm just another CS student, so I'm not an official source, but I think I can be considered "credible". I've studied this before and I know the subject. Happy coding!
Related
Longest Odd Pallindromes
Problem Description
Given a string S(consisting of only lower case characters) and Q queries.
In each query you will given an integer i and your task is to find the length of longest odd palindromic substring whose middle index is i. Note:
1.) Assume 1 based indexing.
2.) Longest odd palindrome: A palindrome substring whose length is odd.
Problem Constraints
1<=|s|,Q<=1e5
1<=i<=|s|
Input Format
First argument A is string S.
Second argument B is an array of integers where B[i] denotes the query index of ith query.
Output Format
Return an array of integers where ith integer denotes the answer of ith query.
Is there any better way to solve this question other than brute force, that is, when we generate all the palindromic substrings and check
There is Manacher's algorithm that calculates number of palindromes centered at i-th index in linear time.
After precalculation stage you can answer query in O(1). I changed result array to contain lengths of the longest palindromes centered at every position.
Python code (link contains C++ one)
def manacher_odd(s):
n = len(s)
odds = []
l, r = 0, -1
for i in range(n):
k = min(odds[l+r-i], r-i+1) if i<=r else 1
while (i+k < n) and (i-k >= 0) and (s[i+k]==s[i-k]):
k += 1
odds.append(k)
if (i+k-1 > r):
l, r = i-k+1, i+k-1
for i in range(n):
odds[i] = 2 * odds[i] - 1
return odds
print(manacher_odd("abaaaba"))
[1, 3, 1, 7, 1, 3, 1]
There are 2 possible optimizations they might be looking for.
First, you can do an initial run over S first, cleverly building a lookup table, and then your query will just use that, which I think would be faster if B is long.
Alternatively, if not doing a look up, then while you're searching at index i, you'll potentially search neighboring indexes at the same time. As you check i, you can also be checking i+1, i-1, i+2, i-2, etc... as you go, and save that answer for later. This seems the less promising route to me, so I want to dive into the first idea more.
Finally, if B is quite short, then the best answer might be brute force, actually. It's good to know when to keep it simple.
Initial run method
One optimization that comes to mind for a pre-process run is as follows:
Search the next unknown index, brute force it by looking forwards and back, while recording the frequency of each letter (including the middle one.) If a palindrome of 1 or 3 was found, move to the next index and repeat.
If a palindrome or 5 or longer was found, calculate mid points of any letters that showed up more than twice which are to the right of the current index.
Any point between the current index and the index of the last palindrome letter that isn't in the mid-points list is a 1 for length.
This means you'll search all the midpoints found in (2). After that, you'll continue searching from the index of the last letter of the palindrome found in (2).
An example
Let's say S starts with: ```a, b, c, d, a, f, g, f, g, f, a, d, c, b, a, ...``` and you have checked from ```i = 2``` up to ```i = 7``` but found nothing except a run of 3 at ```i = 7```. Now, you check index ```i = 8```. You will find a palindrome extending out 7 letters in each direction, for a total of 15, but as you check, note any letters that show up more than twice. In this case, there are 3 ```f```s and 4 ```a```s. Find any mid points these pairs have that are right of the current index (8). In this case, 2 ```f```s have a mid point of i=9, the 2 right-most ```a```s have a midpoint of i=13. Once you're done looking at i=8, then you can skip any index not on your list, all the way up to the last letter you found in i=8. For example, we only have to check i=9 and i=13, and then start from i=15, checking every step. We've been able to skip checking i=10, 11, 12, and 14.
This article discusses approximate substring matching techniques that utilize a suffix tree to improve matching time. Each answer addresses a different algorithm.
Approximate substring matching attempts to find a substring (pattern) P in a string T allowing up to k mismatches.
To learn how to create a suffix tree, click here. However, some algorithms require additional preprocessing.
I invite people to add new algorithms (even if it's incomplete) and improve answers.
This was the original question that started this thread.
Professor Esko Ukkonen published a paper: Approximate string-matching over suffix trees. He discusses 3 different algorithms that have different matching times:
Algorithm A: O(mq + n)
Algorithm B: O(mq log(q) + size of the output)
Algorithm C: O(m^2q + size of the output)
Where m is the length of the substring, n is the length of the search string, and q is the edit distance.
I've been trying to understand algorithm B but I'm having trouble following the steps. Does anyone have experience with this algorithm? An example or pseudo algorithm would be greatly appreciated.
In particular:
What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
Here's what I believe (I stand to be corrected):
On page seven, we're introduced to suffix tree concepts; a state is effectively a node in the suffix tree: let root denote the initial state.
g(a, c) = b where a and b are nodes in the tree and c is a character or substring in the tree. So this represents a transition; from a, following the edges represented by c, we move to node b. This is referred to as the go-to transition. So for the suffix tree below, g(root, 'ccb') = red node
Key(a) = edge sequence where a represents a node in the tree. For example, Key(red node) = 'ccb'. So g(root, Key(red node)) = red node.
Keys(Subset of node S) = { Key(node) | node ∈ S}
There is a suffix function for nodes a and b, f(a) = b: for all (or perhaps there may exist) a ≠ root, there exists a character c, a substring x, and a node b such that g(root, cx) = a and g(root, x) = b. I think that this means, for the suffix tree example above, that f(pink node) = green node where c = 'a' and x = 'bccb'.
There is a mapping H that contains a node from the suffix tree and a value pair. The value is given by loc(w); I'm still uncertain how to evaluate the function. This dictionary contains nodes that have not been eliminated.
extract-min(H) refers to attaining the entry with the smallest value in the pair (node, loc(w)) from H.
The crux of the algorithm seems to be related to how loc(w) is evaluated. I've constructed my suffix tree using the combined answer here; however, the algorithms work on a suffix trie (uncompressed suffix tree). Therefore concepts like the depth need to be maintained and processed differently. In the suffix trie the depth would represent the suffix length; in a suffix tree, the depth would simply represent the node depth in the tree.
You are doing well. I don't have familiarity with the algorithm, but have read the paper today. Everything you wrote is correct as far as it goes. You are right that some parts of the explanation assume a lot.
Your Questions
1.What does size of the output refer to in terms of the suffix tree or input strings? The final output phase lists all occurrences of Key(r) in T, for all states r marked for output.
The output consists of the maximal k-distance matches of P in T. In particular you'll get the final index and length for each. So clearly this is also O(n) (remember big-O is an upper bound), but may be smaller. This is a nod to the fact that it's impossible to generate p matches in less than O(p) time. The rest of the time bound concerns only the pattern length and the number of viable prefixes, both of which can be arbitrarily small, so the output size can dominate. Consider k=0 and the input is 'a' repeated n times with the pattern 'a'.
2.Looking at Algorithm C, the function dp is defined (page four); I don't understand what index i represents. It isn't initialized and doesn't appear to increment.
You're right. It's an error. The loop index should be i. What about j? This is the index of the column corresponding to the input character being processed in the dynamic program. It should really be an input parameter.
Let's take a step back. The Example table on page 6 is computed left-to-right, column-by-column using equations (1-4) given earlier. These show that only the previous columns of D and L are needed to get the next. Function dp is just an implementation of this idea of computing column j from j-1. Column j of D and L are called d and l respectively. Column j-1 D and L are d' and l', the function input parameters.
I recommend you work through the dynamic program until you understand it well. The algorithm is all about avoiding duplicate column computations. Here "duplicate" means "having the same values in the essential part", because that's all that matters. The inessential parts can't affect the answer.
The uncompressed trie is just the compressed one expanded in the obvious way to have one edge per character. Except for the idea of "depth", this is unimportant. In the compressed tree, depth(s) is just the length of the string - which he calls Key(s) - needed to get from root node s.
Algorithm A
Algorithm A is just a clever caching scheme.
All his theorems and lemmas show that 1) we only need to worry about the essential parts of columns and 2) the essential part of a column j is completely determined by the viable prefix Q_j. This is the longest suffix of the input ending at j that matches a prefix of the pattern (within edit distance k). In other words, Q_j is the maximal start of a k-edit match at the end of the input considered so far.
With this, here's pseudo-code for Algorithm A.
Let r = root of (uncompressed) suffix trie
Set r's cached d,l with formulas at end page 7 (0'th dp table columns)
// Invariant: r contains cached d,l
for each character t_j from input text T in sequence
Let s = g(r, t_j) // make the go-to transition from r on t_j
if visited(s)
r = s
while no cached d,l on node r
r = f(r) // traverse suffix edge
end while
else
Use cached d',l' on r to find new columns (d,l) = dp(d',l')
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s
while depth(r) != |Q_j|
mark r visited
r = f(r) // traverse suffix edge
end while
mark r visited
set cached d,l on node r
end if
end for
I've left out the output step for simplicity.
What is traversing suffix edges about? When we do this from a node r where Key(r) = aX (leading a followed by some string X), we are going to the node with Key X. The consequence: we are storing each column corresponding to a viable prefix Q_h at the trie node for the suffix of the input with prefix Q_h. The function f(s) = r is the suffix transition function.
For what it's worth, the Wikipedia picture of a suffix tree shows this pretty well. For example, if from the node for "NA" we follow the suffix edge, we get to the node for "A" and from there to "". We are always cutting off the leading character. So if we label state s with Key(s), we have f("NA") = "A" and f("A") = "". (I don't know why he doesn't label states like this in the paper. It would simplify many explanations.)
Now this is very cool because we are computing only one column per viable prefix. But it's still expensive because we are inspecting each character and potentially traversing suffix edges for each one.
Algorithm B
Algorithm B's intent is to go faster by skipping through the input, touching only those characters likely to produce a new column, i.e. those that are the ends of input that match a previously unseen viable prefix of the pattern.
As you'd suspect, the key to the algorithm is the loc function. Roughly speaking, this will tell where the next "likely" input character is. The algorithm is quite a bit like A* search. We maintain a min heap (which must have a delete operation) corresponding to the set S_i in the paper. (He calls it a dictionary, but this is not a very conventional use of the term.) The min heap contains potential "next states" keyed on the position of the next "likely character" as described above. Processing one character produces new entries. We keep going until the heap is empty.
You're absolutely right that here he gets sketchy. The theorems and lemmas are not tied together to make an argument on correctness. He assumes you will redo his work. I'm not entirely convinced by this hand-waving. But there does seem to be enough there to "decode" the algorithm he has in mind.
Another core concept is the set S_i and in particular the subset that remains not eliminated. We'll keep these un-eliminated states in the min-heap H.
You're right to say that the notation obscures the intent of S_i. As we process the input left-to-right and reach position i, we have amassed a set of viable prefixes seen so far. Each time a new one is found, a fresh dp column is computed. In the author's notation these prefixes would be Q_h for all h<=i or more formally { Q_h | h <= i }. Each of these has a path from the root to a unique node. The set S_i consists of all the states we get by taking one more step from all these nodes along go-to edges in the trie. This produces the same result as going through the whole text looking for each occurrence of Q_h and the next character a, then adding the state corresponding to Q_h a into S_i, but it's faster. The Keys for the S_i states are exactly the right candidates for the next viable prefix Q_{i+1}.
How do we choose the right candidate? Pick the one that occurs next after position i in the input. This is where loc(s) comes in. The loc value for a state s is just what I just said above: the position in the input starting at i where the viable prefix associated with that state occurs next.
The important point is that we don't want to just assign the newly found (by pulling the min loc value from H) "next" viable prefix as Q_{i+1} (the viable prefix for dp column i+1) and go on to the next character (i+2). This is where we must set the stage to skip ahead as far as possible to the last character k (with dp column k) such Q_k = Q_{i+1}. We skip ahead by following suffix edges as in Algorithm A. Only this time we record our steps for future use by altering H: removing elements, which is the same as eliminating elements from S_i, and modifying loc values.
The definition of function loc(s) is bare, and he never says how to compute it. Also unmentioned is that loc(s) is also a function of i, the current input position being processed (that he jumps from j in earlier parts of the paper to i here for the current input position is unhelpful.) The impact is that loc(s) changes as input processing proceeds.
It turns out that the part of the definition that applies to eliminated states "just happens" because states are marked eliminated upon removal form H. So for this case we need only check for a mark.
The other case - un-eliminated states - requires that we search forward in the input looking for the next occurrence in the text that is not covered by some other string. This notion of covering is to ensure we are always dealing with only "longest possible" viable prefixes. Shorter ones must be ignored to avoid outputting other than maximal matches. Now, searching forward sounds expensive, but happily we have a suffix trie already constructed, which allows us to do it in O(|Key(s)|) time. The trie will have to be carefully annotated to return the relevant input position and to avoid covered occurrences of Key(s), but it wouldn't be too hard. He never mentions what to do when the search fails. Here loc(s) = \infty, i.e. it's eliminated and should be deleted from H.
Perhaps the hairiest part of the algorithm is updating H to deal with cases where we add a new state s for a viable prefix that covers Key(w) for some w that was already in H. This means we have to surgically update the (loc(w) => w) element in H. It turns out the suffix trie yet again supports this efficiently with its suffix edges.
With all this in our heads, let's try for pseudocode.
H = { (0 => root) } // we use (loc => state) for min heap elements
until H is empty
(j => s_j) = H.delete_min // remove the min loc mapping from
(d, l) = dp(d', l', j) where (d',l') are cached at paraent(s_j)
Compute |Q_j| = l[h], h = argmax(i).d[i]<=k as in the paper
r = s_j
while depth(r) > |Q_j|
mark r eliminated
H.delete (_ => r) // loc value doesn't matter
end while
set cached d,l on node r
// Add all the "next states" reachable from r by go-tos
for all s = g(r, a) for some character a
unless s.eliminated?
H.insert (loc(s) => s) // here is where we use the trie to find loc
// Update H elements that might be newly covered
w = f(s) // suffix transition
while w != null
unless w.eliminated?
H.increase_key(loc(w) => w) // using explanation in Lemma 9.
w = f(w) // suffix transition
end unless
end while
end unless
end for
end until
Again I've omitted the output for simplicity. I will not say this is correct, but it's in the ballpark. One thing is that he mentions we should only process Q_j for nodes not before "visited," but I don't understand what "visited" means in this context. I think visited states by Algorithm A's definition won't occur because they've been removed from H. It's a puzzle...
The increase_key operation in Lemma 9 is hastily described with no proof. His claim that the min operation is possible in O(log |alphabet|) time is leaving a lot to the imagination.
The number of quirks leads me to wonder if this is not the final draft of the paper. It is also a Springer publication, and this copy on-line would probably violate copyright restrictions if it were precisely the same. It might be worth looking in a library or paying for the final version to see if some of the rough edges were knocked off during final review.
This is as far as I can get. If you have specific questions, I'll try to clarify.
During my current preparation for interview, I encountered a question for which I am having some difficulty to get optimal solution,
We are given an array A and an integer Sum, we need to find all distinct sub sequences of A whose sum equals Sum.
For eg. A={1,2,3,5,6} Sum=6 then answer should be
{1,2,3}
{1,5}
{6}
Presently I can think of two ways of doing this,
Use Recursion ( which I suppose should be last thing to consider for an interview question)
Use Integer Partitioning to partition Sum and check whether the elements of partition are present in A
Please guide my thoughts.
I agree with Jason. This solution comes to mind:
(complexity is O(sum*|A|) if you represent the map as an array)
Call the input set A and the target sum sum
Have a map of elements B, with each element being x:y, where x (the map key) is the sum and y (the map value) is the number of ways to get to it.
Starting of, add 0:1 to the map - there is 1 way to get to 0 (obviously by using no elements)
For each element a in A, consider each element x:y in B.
If x+a > sum, don't do anything.
If an element with the key x+a already exists in B, say that element is x+a:z, modify it to x+a:y+z.
If an element with the key doesn't exist, simply add x+a:y to the set.
Look up the element with key sum, thus sum:x - x is our desired value.
If B is sorted (or an array), you can simply skip the rest of the elements in B during the "don't do anything" step.
Tracing it back:
The above just gives the count, this will modify it to give the actual subsequences.
At each element in B, instead of the sum, store all the source elements and the elements used to get there (so have a list of pairs at each element in B).
For 0:1 there is no source elements.
For x+a:y, the source element is x and the element to get there is a.
During the above process, if an element with the key already exists, enqueue the pair x/a to the element x+a (enqueue is an O(1) operation).
If an element with the key doesn't exist, simply create a list with one pair x/a at the element x+a.
To reconstruct, simply start at sum and recursively trace your way back.
We have to be careful of duplicate sequences (do we?) and sequences with duplicate elements here.
Example - not tracing it back:
A={1,2,3,5,6}
sum = 6
B = 0:1
Consider 1
Add 0+1
B = 0:1, 1:1
Consider 2
Add 0+2:1, 1+2:1
B = 0:1, 1:1, 2:1, 3:1
Consider 3
Add 0+3:1 (already exists -> add 1 to it), 1+3:1, 2+1:1, 3+1:1
B = 0:1, 1:1, 2:1, 3:2, 4:1, 5:1, 6:1
Consider 5
B = 0:1, 1:1, 2:1, 3:2, 4:1, 5:2, 6:2
Generated sums thrown away = 7:1, 8:2, 9:1, 10:1, 11:1
Consider 6
B = 0:1, 1:1, 2:1, 3:2, 4:1, 5:2, 6:3
Generated sums thrown away = 7:1, 8:1, 9:2, 10:1, 11:2, 12:2
Then, from 6:3, we know we have 3 ways to get to 6.
Example - tracing it back:
A={1,2,3,5,6}
sum = 6
B = 0:{}
Consider 1
B = 0:{}, 1:{0/1}
Consider 2
B = 0:{}, 1:{0/1}, 2:{0/2}, 3:{1/2}
Consider 3
B = 0:{}, 1:{0/1}, 2:{0/2}, 3:{1/2,0/3}, 4:{1/3}, 5:{2/3}, 6:{3/3}
Consider 5
B = 0:{}, 1:{0/1}, 2:{0/2}, 3:{1/2,0/3}, 4:{1/3}, 5:{2/3,0/5}, 6:{3/3,1/5}
Generated sums thrown away = 7, 8, 9, 10, 11
Consider 6
B = 0:{}, 1:{0/1}, 2:{0/2}, 3:{1/2,0/3}, 4:{1/3}, 5:{2/3,0/5}, 6:{3/3,1/5,0/6}
Generated sums thrown away = 7, 8, 9, 10, 11, 12
Then, tracing back from 6: (not in {} means an actual element, in {} means a map entry)
{6}
{3}+3
{1}+2+3
{0}+1+2+3
1+2+3
Output {1,2,3}
{0}+3+3
3+3
Invalid - 3 is duplicate
{1}+5
{0}+1+5
1+5
Output {1,5}
{0}+6
6
Output {6}
This is a variant of the subset-sum problem. The subset-sum problem asks if there is a subset that sums to given a value. You are asking for all of the subsets that sum to a given value.
The subset-sum problem is hard (more precisely, it's NP-Complete) which means that your variant is hard too (it's not NP-Complete, because it's not a decision problem, but it is NP-Hard).
The classic approach to the subset-sum problem is either recursion or dynamic programming. It's obvious how to modify the recursive solution to the subset-sum problem to answer your variant. I suggest that you also take a look at the dynamic programming solution to subset-sum and see if you can modify it for your variant (tbc: I do not know if this is actually possible). That would certainly be a very valuable learning exercise whether or not is possible as it would certainly enhance your understanding of dynamic programming either way.
It would surprise me though, if the expected answer to your question is anything but the recursive solution. It's easy to come up with, and an acceptable approach to the problem. Asking for the dynamic programming solution on-the-fly is a bit much to ask.
You did, however, neglect to mention a very naïve approach to this problem: generate all subsets, and for each subset check if it sums to the given value or not. Obviously that's exponential, but it does solve the problem.
I assumed that given array contains distinct numbers.
Let's define function f(i, s) - which means we used some numbers in range [1, i] and the sum of used numbers is s.
Let's store all values in 2 dimensional matrix i.e. in cell (i, j) we will have value for f(i, j). Now if have already calculated values for cells which are located upper or lefter the cell (i, s) we can calculate value for f(i, s) i.e. f(i, s) = f(i - 1, s);(not to take i indexed number) and if(s >= a[i]) f(i, s) += f(i - 1, s - a[i]). And we can use bottom-up approach to fill all the matrix, setting [f(0, 0) = 1; f(0, i) = 0; 1 <= i <= s], [f(i, 0) = 1;1<=i<=n;]. If we calculated all the matrix then we have answer in cell f(n,S); Thus we have total time complexity O(n*s) and memory complexity O(n*s);
We can improve memory complexity if we note that in every iteration we need only information from previous row, it means that we can store matrix of size 2xS not nxS. We reduced memory complexity up to linear to S. This problem is NP complete thus we don't have polynomial algorithm for this and this approach is the best thing.
I have a symmetric matrix like shown in the image attached below.
I've made up the notation A.B which represents the value at grid point (A, B). Furthermore, writing A.B.C gives me the minimum grid point value like so: MIN((A,B), (A,C), (B,C)).
As another example A.B.D gives me MIN((A,B), (A,D), (B,D)).
My goal is to find the minimum values for ALL combinations of letters (not repeating) for one row at a time e.g for this example I need to find min values with respect to row A which are given by the calculations:
A.B = 6
A.C = 8
A.D = 4
A.B.C = MIN(6,8,6) = 6
A.B.D = MIN(6, 4, 4) = 4
A.C.D = MIN(8, 4, 2) = 2
A.B.C.D = MIN(6, 8, 4, 6, 4, 2) = 2
I realize that certain calculations can be reused which becomes increasingly important as the matrix size increases, but the problem is finding the most efficient way to implement this reuse.
Can point me in the right direction to finding an efficient algorithm/data structure I can use for this problem?
You'll want to think about the lattice of subsets of the letters, ordered by inclusion. Essentially, you have a value f(S) given for every subset S of size 2 (that is, every off-diagonal element of the matrix - the diagonal elements don't seem to occur in your problem), and the problem is to find, for each subset T of size greater than two, the minimum f(S) over all S of size 2 contained in T. (And then you're interested only in sets T that contain a certain element "A" - but we'll disregard that for the moment.)
First of all, note that if you have n letters, that this amounts to asking Omega(2^n) questions, roughly one for each subset. (Excluding the zero- and one-element subsets and those that don't include "A" saves you n + 1 sets and a factor of two, respectively, which is allowed for big Omega.) So if you want to store all these answers for even moderately large n, you'll need a lot of memory. If n is large in your applications, it might be best to store some collection of pre-computed data and do some computation whenever you need a particular data point; I haven't thought about what would work best, but for example computing data only for a binary tree contained in the lattice would not necessarily help you anything beyond precomputing nothing at all.
With these things out of the way, let's assume you actually want all the answers computed and stored in memory. You'll want to compute these "layer by layer", that is, starting with the three-element subsets (since the two-element subsets are already given by your matrix), then four-element, then five-element, etc. This way, for a given subset S, when we're computing f(S) we will already have computed all f(T) for T strictly contained in S. There are several ways that you can make use of this, but I think the easiest might be to use two such subset S: let t1 and t2 be two different elements of T that you may select however you like; let S be the subset of T that you get when you remove t1 and t2. Write S1 for S plus t1 and write S2 for S plus t2. Now every pair of letters contained in T is either fully contained in S1, or it is fully contained in S2, or it is {t1, t2}. Look up f(S1) and f(S2) in your previously computed values, then look up f({t1, t2}) directly in the matrix, and store f(T) = the minimum of these 3 numbers.
If you never select "A" for t1 or t2, then indeed you can compute everything you're interested in while not computing f for any sets T that don't contain "A". (This is possible because the steps outlined above are only interesting whenever T contains at least three elements.) Good! This leaves just one question - how to store the computed values f(T). What I would do is use a 2^(n-1)-sized array; represent each subset-of-your-alphabet-that-includes-"A" by the (n-1) bit number where the ith bit is 1 whenever the (i+1)th letter is in that set (so 0010110, which has bits 2, 4, and 5 set, represents the subset {"A", "C", "D", "F"} out of the alphabet "A" .. "H" - note I'm counting bits starting at 0 from the right, and letters starting at "A" = 0). This way, you can actually iterate through the sets in numerical order and don't need to think about how to iterate through all k-element subsets of an n-element set. (You do need to include a special case for when the set under consideration has 0 or 1 element, in which case you'll want to do nothing, or 2 elements, in which case you just copy the value from the matrix.)
Well, it looks simple to me, but perhaps I misunderstand the problem. I would do it like this:
let P be a pattern string in your notation X1.X2. ... .Xn, where Xi is a column in your matrix
first compute the array CS = [ (X1, X2), (X1, X3), ... (X1, Xn) ], which contains all combinations of X1 with every other element in the pattern; CS has n-1 elements, and you can easily build it in O(n)
now you must compute min (CS), i.e. finding the minimum value of the matrix elements corresponding to the combinations in CS; again you can easily find the minimum value in O(n)
done.
Note: since your matrix is symmetric, given P you just need to compute CS by combining the first element of P with all other elements: (X1, Xi) is equal to (Xi, X1)
If your matrix is very large, and you want to do some optimization, you may consider prefixes of P: let me explain with an example
when you have solved the problem for P = X1.X2.X3, store the result in an associative map, where X1.X2.X3 is the key
later on, when you solve a problem P' = X1.X2.X3.X7.X9.X10.X11 you search for the longest prefix of P' in your map: you can do this by starting with P' and removing one component (Xi) at a time from the end until you find a match in your map or you end up with an empty string
if you find a prefix of P' in you map then you already know the solution for that problem, so you just have to find the solution for the problem resulting from combining the first element of the prefix with the suffix, and then compare the two results: in our example the prefix is X1.X2.X3, and so you just have to solve the problem for
X1.X7.X9.X10.X11, and then compare the two values and choose the min (don't forget to update your map with the new pattern P')
if you don't find any prefix, then you must solve the entire problem for P' (and again don't forget to update the map with the result, so that you can reuse it in the future)
This technique is essentially a form of memoization.
I apologize for not have the math background to put this question in a more formal way.
I'm looking to create a string of 796 letters (or integers) with certain properties.
Basically, the string is a variation on a De Bruijn sequence B(12,4), except order and repetition within each n-length subsequence are disregarded.
i.e. ABBB BABA BBBA are each equivalent to {AB}.
In other words, the main property of the string involves looking at consecutive groups of 4 letters within the larger string
(i.e. the 1st through 4th letters, the 2nd through 5th letters, the 3rd through 6th letters, etc)
And then producing the set of letters that comprise each group (repetitions and order disregarded)
For example, in the string of 9 letters:
A B B A C E B C D
the first 4-letter groups is: ABBA, which is comprised of the set {AB}
the second group is: BBAC, which is comprised of the set {ABC}
the third group is: BACE, which is comprised of the set {ABCE}
etc.
The goal is for every combination of 1-4 letters from a set of N letters to be represented by the 1-4-letter resultant sets of the 4-element groups once and only once in the original string.
For example, if there is a set of 5 letters {A, B, C, D, E} being used
Then the possible 1-4 letter combinations are:
A, B, C, D, E,
AB, AC, AD, AE, BC, BD, BE, CD, CE, DE,
ABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE,
ABCD, ABCE, ABDE, ACDE, BCDE
Here is a working example that uses a set of 5 letters {A, B, C, D, E}.
D D D D E C B B B B A E C C C C D A E E E E B D A A A A C B D D B
The 1st through 4th elements form the set: D
The 2nd through 5th elements form the set: DE
The 3rd through 6th elements form the set: CDE
The 4th through 7th elements form the set: BCDE
The 5th through 8th elements form the set: BCE
The 6th through 9th elements form the set: BC
The 7th through 10th elements form the set: B
etc.
* I am hoping to find a working example of a string that uses 12 different letters (a total of 793 4-letter groups within a 796-letter string) starting (and if possible ending) with 4 of the same letter. *
Here is a working solution for 7 letters:
AAAABCDBEAAACDECFAAADBFBACEAGAADEFBAGACDFBGCCCCDGEAFAGCBEEECGFFBFEGGGGFDEEEEFCBBBBGDCFFFFDAGBEGDDDDBE
Beware that in order to attempt exhaustive search (answer in VB is trying a naive version of that) you'll first have to solve the problem of generating all possible expansions while maintaining lexicographical order. Just ABC, expands to all perms of AABC, plus all perms of ABBC, plus all perms of ABCC which is 3*4! instead of just AABC. If you just concatenate AABC and AABD it would cover just 4 out of 4! perms of AABC and even that by accident. Just this expansion will bring you exponential complexity - end of game. Plus you'll need to maintain association between all explansions and the set (the set becomes a label).
Your best bet is to use one of known efficient De Bruijn constuctors and try to see if you can put your set-equivalence in there. Check out
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.674&rep=rep1&type=pdf
and
http://www.dim.uchile.cl/~emoreno/publicaciones/FINALES/copyrighted/IPL05-De_Bruijn_sequences_and_De_Bruijn_graphs_for_a_general_language.pdf
for a start.
If you know graphs, another viable option is to start with De Bruijn graph and formulate your set-equivalence as a graph rewriting. 2nd paper does De Bruijn graph partitioning.
BTW, try VB answer just for A,B,AB (at least expansion is small) - it will make AABBAB and construct ABBA or ABBAB (or throw in a decent language) both of which are wrong. You can even prove that it will always miss with 1st lexical expansions (that's what AAB, AAAB etc. are) just by examining first 2 passes (it will always miss 2nd A for NxA because (N-1)xA+B is in the string (1st expansion of {AB}).
Oh and if we could establish how many of each letters an optimal soluton should have (don't look at B(5,2) it's too easy and regular :-) a random serch would be feasible - you generate candidates with provable traits (like AAAA, BBBB ... are present and not touching and is has n1 A-s, n2 B-s ...) and random arrangement and then test whether they are solutions (checking is much faster than exhaustive search in this case).
Cool problem. Just a draft/psuedo algo:
dim STR-A as string = getall(ABCDEFGHIJKL)
//custom function to generate concat list of all 793 4-char combos.
//should be listed side-by-side to form 3172 character-long string.
//different ordering may ultimately produce different results.
//brute-forcing all orders of combos is too much work (793! is a big #).
//need to determine how to find optimal ordering, for this particular
//approach below.
dim STR-B as string = "" // to hold the string you're searching for
dim STR-C as string = "" // to hold the sub-string you are searching in
dim STR-A-NEW as string = "" //variable to hold your new string
dim MATCH as boolean = false //variable to hold matching status
while len(STR-A) > 0
//check each character in STR-A, which will be shorted by 1 char on each
//pass.
MATCH = false
STR-B = left(STR-A, 4)
STR-B = reduce(STR-B)
//reduce(str) is a custom re-usable function to sort & remove duplicates
for i as integer = 1 to len((STR-A) - 1)
STR-C = substr(STR-A, i, 4)
//gives you the 4-character sequence beginning at position i
STR-C = reduce(STR-C)
IF STR-B = STR-C Then
MATCH = true
exit for
//as long as there is even one match, you can throw-away the first
//letter
END IF
i = i+1
next
IF match = false then
//if you didn't find a match, then the first letter should be saved
STR-A-NEW += LEFT(STR-B, 1)
END IF
MATCH = false //re-init MATCH
STR-A = RIGHT(STR-A, LEN(STR-A) - 1) //re-init STR_A
wend
Anyway -- there could be problems at this, and you'd need to write another function to parse your result string (STR-A-NEW) to prove that it's a viable answer...
I've been thinking about this one and I'm sketching out a solution.
Let's call a string of four symbols a word and we'll write S(w) to denote the set of symbols in word w.
Each word abcd has "follow-on" words bcde where a,...,e are all symbols.
Let succ(w) be the set of follow-on words v for w such that S(w) != S(v). succ(w) is the set of successor words that can follow on from the first symbol in w if w is in a solution.
For each non-empty set of symbols s of cardinality at most four, let words(s) be the set of words w such that S(w) = s. Any solution must contain exactly one word in words(s) for each such set s.
Now we can do a reasonable search. The basic idea is this: say we are exploring a search path ending with word w. The follow-on word must be a non-excluded word in succ(w). A word v is excluded if the search path contains some word w such that v in words(S(w)).
You can be slightly more cunning: if we track the possible "predecessor" words to a set s (i.e., words w with a successor v such that v in words(s)) and reach a point where every predecessor of s is excluded, then we know we have reached a dead end, since we'll never be able to obtain s from any extension of the current search path.
Code to follow after the weekend, with a bit of luck...
Here is my proposal. I'll admit upfront this is a performance and memory hog.
This may be overkill, but have a class We'll call it UniqueCombination This will contain a unique 1-4 char reduced combination of the input set (i.e. A,AB,ABC,...) This will also contain a list of possible combination (AB {AABB,ABAB,BBAA,...}) this will need a method that determines if any possible combination overlaps any possible combination of another UniqueCombination by three characters. Also need a override that takes a string as well.
Then we start with the string "AAAA" then we find all of the UniqueCombinations that overlap this string. Then we find how many uniqueCombinations those possible matches overlap with. (we could be smart at this point an store this number.) Then we pick the one with the least number of overlaps greater than 0. Use up the ones with the least possible matches first.
Then we find a specific combination for the chosen UniqueCombination and add it to the final string. Remove this UniqueCombination from the list, then as we find overlaps for current string. rinse and repeat. (we could be smart and on subsequent runs while searching for overlaps we could remove any of the unreduced combination that are contained in the final string.)
Well that's my plan I will work on the code this weekend. Granted this does not guarantee that the final 4 characters will be 4 of the same letter (it might actually be trying to avoid that but I will look into that as well.)
If there is a non-exponential solution at all it may need to be formulated in terms of a recursive "growth" from a problem with a smaller size i.e to contruct B(N,k) from B(N-1,k-1) or from B(N-1,k) or from B(N,k-1).
Systematic construction for B(5,2) - one step at the time :-) It's bound to get more complex latter [card stands for cardinality, {AB} has card=2, I'll also call them 2-s, 3-s etc.] Note, 2-s and 3-s will be k-1 and k latter (I hope).
Initial. Start with k-1 result and inject symbols for singletons
(unique expansion empty intersection):
ABCDE -> AABBCCDDEE
mark used card=2 sets: AB,BC,CD,DE
Rewriting. Form card=3 sets to inject symbols into marked card=2.
1st feasible lexicographic expansion fires (may have to backtrack for k>2)
it's OK to use already marked 2-s since they'll all get replaced
but may have to do a verification pass for higher k
AB->ACB, BC->BCD, CD->CED, DE->DAE ==> AACBBDCCEDDAEEB
mark/verify used 2s
normally keep marking/unmarking during the construction but also keep keep old
mark list
marking/unmarking can get expensive if there's backtracking in #3
Unused: AB, BE
For higher k may need several recursive rewriting passes
possibly partitioning new sets into classes
Finalize: unused 2-s should overlap around the edge (that's why it's cyclic)
ABE - B can go to the begining or and: AACBBDCCEDDAEEB
Note: a step from B(N-1,k) to B(N,k) may need injection of pseudo-signletons, like doubling or trippling A
B(5,2) -> B(5,3) - B(5,4)
Initial. same: - ABCDE -> AAACBBBDCCCEDDDAEEEB
no use of marking 3-sets since they are all going to be chenged
Rewriting.
choose systematic insertion positions
AAA_CBBB_DCCC_EDDD_AEEE_B
mark all 2-s released by this: AC,AD,BD,BE,CE
use marked 2-s to decide inserted symbols - totice total regularity:
AxCB D -> ADCB
BxDC E -> BEDC
CxED A -> CAED
DxAE B => DBAE
ExBA C -> ECBA
Verify that 3-s are all used (marked inserted symbols just for fun)
AAA[D]CBBB[E]DCCC[A]EDDD[B]AEEE[C]B
Note: Systematic choice if insertion point deterministically dictated insertions (only AD can fit 1st, AC would create duplicate 2-set (AAC, ACC))
Note: It's not going to be so nice for B(6,2) and B(6,3) since number of 2-s will exceede 2x the no of 1-s. This is important since 2-s sit naturally on the sides of 1-s like CBBBE and the issue is how to place them when you run out of 1-s.
B(5,3) is so symetrical that just repeating #1 produces B(5.4):
AAAADCBBBBEDCCCCAEDDDDBAEEEECB