How to find the Longest Common Subsequence in Exponential time? - algorithm

I can do this the proper way using dynamic programming but I can't figure out how to do it in exponential time.
I'm looking to find the largest common sub-sequence between two strings.
Note: I mean subsequences and not sub-strings the symbols that make up a sequence need not be consecutive.

Just replace the lookups in the table in your dynamic programming code with recursive calls. In other words, just implement the recursive formulation of the LCS problem:
EDIT
In pseudocode (almost-python, actually):
def lcs(s1, s2):
if len(s1)==0 or len(s2)==0: return 0
if s1[0] == s2[0]: return 1 + lcs(s1[1:], s2[1:])
return max(lcs(s1, s2[1:]), lcs(s1[1:], s2))

Let's say you have two strings a and b of length n. The longest common subsequence is going to be the longest subsequence in string a that is also present in string b.
Thus we can iterate through all possible subsequences in a and see it is in b.
A high-level pseudocode for this would be:
for i=n to 0
for all length i subsequences s of a
if s is a subsequence of b
return s

String A and String B. A recursive algorithm, maybe it's naive but it is simple:
Look at the first letter of A. This will either be in a common sequence or not. When considering the 'not' option, we trim off the first letter and call recursively. When considering the 'is in a common sequence' option we also trim it off and we also trim off from the start of B up to, and including, the same letter in B. Some pseudocode:
def common_subsequences(A,B, len_subsequence_so_far = 0):
if len(A) == 0 or len(B) == 0:
return
first_of_A = A[0] // the first letter in A.
A1 = A[1:] // A, but with the first letter removed
common_subsequences(A1,B,len_subsequence_so_far) // the first recursive call
if(the_first_letter_of_A_is_also_in_B):
Bn = ... delete from the start of B up to, and including,
... the first letter which equals first_of_A
common_subsequences(A1,Bn, 1+len_subsequence_so_far )
You could start with that and then optimize by remembering the longest subsequence found so far, and then returning early when the current function cannot beat that (i.e. when min(len(A), len(B))+len_subsequence_so_far is smaller than the longest length found so far.

Essentially if you don't use dynamic programming paradigm - you reach exponential time. This is because, by not storing your partial values - you are recomputing the partial values multiple times.

To achieve exponential time it's enough to generate all subsequences of both arrays and compare each one with each other. If you match two that are identical check if their length is larger then current maximum. The pseudocode would be:
Generate all subsequences of `array1` and `array2`.
for each subsequence of `array1` as s1
for each subsequece of `array2` as s2
if s1 == s2 //check char by char
if len(s1) > currentMax
currentMax = len(s1)
for i = 0; i < 2^2; i++;
It's absolutely not optimal. It doesn't even try. However the question is about the very inefficient algorithm so I've provided one.

int lcs(char[] x, int i, char[] y, int j) {
if (i == 0 || j == 0) return 0;
if (x[i - 1] == y[j - 1]) return lcs(x, i - 1, y, j - 1) + 1;
return Math.max(lcs(x, i, y, j - 1), lcs(x, i - 1, y, j));
}
print(lcs(x, x.length, y, y.length);
Following is a partial recursion tree:
lcs("ABCD", "AFDX")
/ \
lcs("ABC", "AFDX") lcs("ABCD", "AFD")
/ \ / \
lcs("AB", "AFDX") lcs("AXY", "AFD") lcs("ABC", "AFD") lcs("ABCD", "AF")
Worst case is when the length of LCS is 0 which means there's no common subsequence. At that case all of the possible subsequences are examined and there are O(2^n) subsequences.

Related

Find longest positive substrings in binary string

Let's assume I have a string like 100110001010001. I'd like to find such substring that:
are as longest as possible
have total positive sum >0
So the longest substrings, that have more 1s than 0s.
For example for the string above 100110001010001 it would be: [10011]000[101]000[1]
Actually it's be satisfying to find the total length of those, in this case: 9.
Unfortunately I have no clue, how can it be done not in brute-force way. Any ideas, please?
As posted now, your question seems a bit unclear. The total length of valid substrings that are "as long as possible" could mean different things: for example, among other options, it could be (1) a list of the longest valid extension to the left of each index (which would allow overlaps in the list), (2) the longest combination of non-overlapping such longest left-extensions, (3) the longest combination of non-overlapping, valid substrings (where each substring is not necessarily the longest possible).
I will outline a method for (3) since it easily transforms to (1) or (2). Finding the longest left-extension from each index with more ones than zeros can be done in O(n log n) time and O(n) additional space (for just the longest valid substring in O(n) time, see here: Finding the longest non-negative sub array). With that preprocessing, finding the longest combination of valid, non-overlapping substrings can be done with dynamic programming in somewhat optimized O(n^2) time and O(n) additional space.
We start by traversing the string, storing sums representing the partial sum up to and including s[i], counting zeros as -1. We insert each partial sum in a binary tree where each node also stores an array of indexes where the value occurs, and the leftmost index of a value less than the node's value. (A substring from s[a] to s[b] has more ones than zeros if the prefix sum up to b is greater than the prefix sum up to a.) If a value is already in the tree, we add the index to the node's index array.
Since we are traversing from left to right, only when a new lowest value is inserted into the tree is the leftmost-index-of-lower-value updated — and it's updated only for the node with the previous lowest value. This is because any nodes with a lower value would not need updating; and if any nodes with lower values were already in the tree, any nodes with higher values would already have stored the index of the earliest one inserted.
The longest valid substring to the left of each index extends to the leftmost index with a lower prefix sum, which can be easily looked up in the tree.
To get the longest combination, let f(i) represent the longest combination up to index i. Then f(i) equals the maximum of the length of each valid left extension possible to index j added to f(j-1).
Dynamic programming.
We have a string. If it is positive, that's our answer. Otherwise we need to trim each end until it goes positive, and find each pattern of trims. So for each length (N-1, N-2, N-3) etc, we've got N- length possible paths (trim from a, trim from b) each of which give us a state. When state goes positive, we've found out substring.
So two lists of integers, representing what happens if we trim entirely from a or entirely from b. Then backtrack. If we trim 1 from a, we must trim all the rest from b, if we trim two from a, we must trim one fewer from b. Is there an answer that allows us to go positive?
We can quickly eliminate because the answer must be at a maximum, either max trimming from a or max trimming from b. If the other trim allows us go positive, that's the result.
pseudocode:
N = length(string);
Nones = countones(string);
Nzeros = N - Nones;
if(Nones > Nzeroes)
return string
vector<int> cuta;
vector<int> cutb;
int besta = Nones - Nzeros;
int bestb = Nones - Nzeros;
cuta.push_back(besta);
cutb.push_back(bestb);
bestia = 0;
bestib = 0;
for(i=0;i<N;i++)
{
cuta.push_back( string[i] == 1 ? cuta.back() - 1 : cuta.back() +1);
cutb.push_back( string[N-i-1] == 1 ? cutb.back() -1 : cutb.back()+1);
if(cuta.back() > besta)
{
besta = cuta.back();
bestia = i;
}
if(cutb.back() > bestb)
{
bestb = cutb.back();
bestib = i;
}
// checks, is a cut from wholly from a or b going to send us positive
if(besta == 1)
answer = substring(string, bestia, N);
if(bestb == 1)
answer = substring(string, 0, N - bestib);
// if not, is a combined cut from current position to the
// the peak in the other distribution going to send us positive?
if(Nones - Nzeros + besta + cutb.back() == 1)
{
answer = substring(string, bestai, N - i);
}
if(Nones - Nzeros + cuta.back() + bestb == 1)
{
answer = substring(string, i, N - bestbi);
}
}
/*if we get here the string was all zeros and no positive substring */
This is untested and the final checks are a bit fiddly and I might have
made an error somewhere, but the algorithm should work more or less
as described.

Minimal number of swaps?

There are N characters in a string of types A and B in the array (same amount of each type). What is the minimal number of swaps to make sure that no two adjacent chars are same if we can only swap two adjacent characters ?
For example, input is:
AAAABBBB
The minimal number of swaps is 6 to make the array ABABABAB. But how would you solve it for any kind of input ? I can only think of O(N^2) solution. Maybe some kind of sort ?
If we need just to count swaps, then we can do it with O(N).
Let's assume for simplicity that array X of N elements should become ABAB... .
GetCount()
swaps = 0, i = -1, j = -1
for(k = 0; k < N; k++)
if(k % 2 == 0)
i = FindIndexOf(A, max(k, i))
X[k] <-> X[i]
swaps += i - k
else
j = FindIndexOf(B, max(k, j))
X[k] <-> X[j]
swaps += j - k
return swaps
FindIndexOf(element, index)
while(index < N)
if(X[index] == element) return index
index++
return -1; // should never happen if count of As == count of Bs
Basically, we run from left to right, and if a misplaced element is found, it gets exchanged with the correct element (e.g. abBbbbA** --> abAbbbB**) in O(1). At the same time swaps are counted as if the sequence of adjacent elements would be swapped instead. Variables i and j are used to cache indices of next A and B respectively, to make sure that all calls together of FindIndexOf are done in O(N).
If we need to sort by swaps then we cannot do better than O(N^2).
The rough idea is the following. Let's consider your sample: AAAABBBB. One of Bs needs O(N) swaps to get to the A B ... position, another B needs O(N) to get to A B A B ... position, etc. So we get O(N^2) at the end.
Observe that if any solution would swap two instances of the same letter, then we can find a better solution by dropping that swap, which necessarily has no effect. An optimal solution therefore only swaps differing letters.
Let's view the string of letters as an array of indices of one kind of letter (arbitrarily chosen, say A) into the string. So AAAABBBB would be represented as [0, 1, 2, 3] while ABABABAB would be [0, 2, 4, 6].
We know two instances of the same letter will never swap in an optimal solution. This lets us always safely identify the first (left-most) instance of A with the first element of our index array, the second instance with the second element, etc. It also tells us our array is always in sorted order at each step of an optimal solution.
Since each step of an optimal solution swaps differing letters, we know our index array evolves at each step only by incrementing or decrementing a single element at a time.
An initial string of length n = 2k will have an array representation A of length k. An optimal solution will transform this array to either
ODDS = [1, 3, 5, ... 2k]
or
EVENS = [0, 2, 4, ... 2k - 1]
Since we know in an optimal solution instances of a letter do not pass each other, we can conclude an optimal solution must spend min(abs(ODDS[0] - A[0]), abs(EVENS[0] - A[0])) swaps to put the first instance in correct position.
By realizing the EVENS or ODDS choice is made only once (not once per letter instance), and summing across the array, we can count the minimum number of needed swaps as
define count_swaps(length, initial, goal)
total = 0
for i from 0 to length - 1
total += abs(goal[i] - initial[i])
end
return total
end
define count_minimum_needed_swaps(k, A)
return min(count_swaps(k, A, EVENS), count_swaps(k, A, ODDS))
end
Notice the number of loop iterations implied by count_minimum_needed_swaps is 2 * k = n; it runs in O(n) time.
By noting which term is smaller in count_minimum_needed_swaps, we can also tell which of the two goal states is optimal.
Since you know N, you can simply write a loop that generates the values with no swaps needed.
#define N 4
char array[N + N];
for (size_t z = 0; z < N + N; z++)
{
array[z] = 'B' - ((z & 1) == 0);
}
return 0; // The number of swaps
#Nemo and #AlexD are right. The algorithm is order n^2. #Nemo misunderstood that we are looking for a reordering where two adjacent characters are not the same, so we can not use that if A is after B they are out of order.
Lets see the minimum number of swaps.
We dont care if our first character is A or B, because we can apply the same algorithm but using A instead of B and viceversa everywhere. So lets assume that the length of the word WORD_N is 2N, with N As and N Bs, starting with an A. (I am using length 2N to simplify the calculations).
What we will do is try to move the next B right to this A, without taking care of the positions of the other characters, because then we will have reduce the problem to reorder a new word WORD_{N-1}. Lets also assume that the next B is not just after A if the word has more that 2 characters, because then the first step is done and we reduce the problem to the next set of characters, WORD_{N-1}.
The next B should be as far as possible to be in the worst case, so it is after half of the word, so we need $N-1$ swaps to put this B after the A (maybe less than that). Then our word can be reduced to WORD_N = [A B WORD_{N-1}].
We se that we have to perform this algorithm as most N-1 times, because the last word (WORD_1) will be already ordered. Performing the algorithm N-1 times we have to make
N_swaps = (N-1)*N/2.
where N is half of the lenght of the initial word.
Lets see why we can apply the same algorithm for WORD_{N-1} also assuming that the first word is A. In this case it matters than the first word should be the same as in the already ordered pair. We can be sure that the first character in WORD_{N-1} is A because it was the character just next to the first character in our initial word, ant if it was B the first work can perform only a swap between these two words and or none and we will already have WORD_{N-1} starting with the same character than WORD_{N}, while the first two characters of WORD_{N} are different at the cost of almost 1 swap.
I think this answer is similar to the answer by phs, just in Haskell. The idea is that the resultant-indices for A's (or B's) are known so all we need to do is calculate how far each starting index has to move and sum the total.
Haskell code:
Prelude Data.List> let is = elemIndices 'B' "AAAABBBB"
in minimum
$ map (sum . zipWith ((abs .) . (-)) is) [[1,3..],[0,2..]]
6 --output

How to avoid generating all subsequences [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Square Subsequence
I have been trying to solve the "Square Subsequences" problem on interviewstreet.com:
A string is called a square string if it can be obtained by concatenating two copies of the same string. For example, "abab", "aa" are square strings, while "aaa", "abba" are not.
Given a string, how many subsequences of the string are square strings?
I tried working out a DP solution, but this constraint seems impossible to circumvent: S will have at most 200 lowercase characters (a-z).
From what I know, finding all subsequences of a list of length n is O(2^n), which stops being feasible as soon as n is larger than, say, 30.
Is it really possible to systematically check all solutions if n is 200? How do I approach it?
First, for every letter a..z you get a list of their indices in S:
`p[x] = {i : S[i] = x}`, where `x = 'a',..,'z'`.
Then we start DP:
S: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
^ ^ ^
r1 l2 r2
Let f(r1,l2,r2) be the number of square subsequences (subsequences that are square strings) of any length L such that
SS[L-1] = r1
SS[L] = l2
SS[2L-1] = r2
i.e. the first half ends exactly at r1, the second half starts exactly at l2 and ends at r2.
The algorithm is then:
Let f[r1,l2,l2] = 1 if S[r1] = S[l2], else 0.
for (l2 in 1..2L-1 )
for( r1 in 0..l2-1 )
for (r2 in l2..2L-1)
if( f(r1, l2, r2) != 0 )
for (x in 'a'..'z')
for (i,j: r1 < i < l2, r2 < j, S[i] = S[j] = x) // these i,j are found using p[x] quickly
f[i, l2, j] += f[r1, l2, r2]
In the end, the answer is the sum of all the values in the f[.,.,.] array.
So basically, we divide S unisg l2 into two parts and then count the common subsequences.
It's hard for me to provide exact time complexity estimation right now, it's surely below n^4 and n^4 is acceptable for n = 200.
There are many algorithms (e.g. Z-algorithm) which can in linear time generate an array of prefix lengths. That is for every position i it tells you what is the longest prefix that can be read starting from position i (of course to i = 0 the longetst prefix is n).
Now notice that if you have a square string starting at the beginning, then there is a position k in this prefix length array such that the longest length is >=k. So you can count the number of those in linear time again.
Then remove the first letter of you string and do the same thing.
The total complexity of this would be O(n^2).

How to generate a permutation?

My question is: given a list L of length n, and an integer i such that 0 <= i < n!, how can you write a function perm(L, n) to produce the ith permutation of L in O(n) time? What I mean by ith permutation is just the ith permutation in some implementation defined ordering that must have the properties:
For any i and any 2 lists A and B, perm(A, i) and perm(B, i) must both map the jth element of A and B to an element in the same position for both A and B.
For any inputs (A, i), (A, j) perm(A, i)==perm(A, j) if and only if i==j.
NOTE: this is not homework. In fact, I solved this 2 years ago, but I've completely forgotten how, and it's killing me. Also, here is a broken attempt I made at a solution:
def perm(s, i):
n = len(s)
perm = [0]*n
itCount = 0
for elem in s:
perm[i%n + itCount] = elem
i = i / n
n -= 1
itCount+=1
return perm
ALSO NOTE: the O(n) requirement is very important. Otherwise you could just generate the n! sized list of all permutations and just return its ith element.
def perm(sequence, index):
sequence = list(sequence)
result = []
for x in xrange(len(sequence)):
idx = index % len(sequence)
index /= len(sequence)
result.append( sequence[idx] )
# constant time non-order preserving removal
sequence[idx] = sequence[-1]
del sequence[-1]
return result
Based on the algorithm for shuffling, but we take the least significant part of the number each time to decide which element to take instead of a random number. Alternatively consider it like the problem of converting to some arbitrary base except that the base name shrinks for each additional digit.
Could you use factoradics? You can find an illustration via this MSDN article.
Update: I wrote an extension of the MSDN algorithm that finds i'th permutation of n things taken r at a time, even if n != r.
A computational minimalistic approach (written in C-style pseudocode):
function perm(list,i){
for(a=list.length;a;a--){
list.switch(a-1,i mod a);
i=i/a;
}
return list;
}
Note that implementations relying on removing elements from the original list tend to run in O(n^2) time, at best O(n*log(n)) given a special tree style list implementation designed for quickly inserting and removing list elements.
The above code rather than shrinking the original list and keeping it in order just moves an element from the end to the vacant location, still makes a perfect 1:1 mapping between index and permutation, just a slightly more scrambled one, but in pure O(n) time.
So, I think I finally solved it. Before I read any answers, I'll post my own here.
def perm(L, i):
n = len(L)
if (n == 1):
return L
else:
split = i%n
return [L[split]] + perm(L[:split] + L[split+1:], i/n)
There are n! permutations. The first character can be chosen from L in n ways. Each of those choices leave (n-1)! permutations among them. So this idea is enough for establishing an order. In general, you will figure out what part you are in, pick the appropriate element and then recurse / loop on the smaller L.
The argument that this works correctly is by induction on the length of the sequence. (sketch) For a length of 1, it is trivial. For a length of n, you use the above observation to split the problem into n parts, each with a question on an L' with length (n-1). By induction, all the L's are constructed correctly (and in linear time). Then it is clear we can use the IH to construct a solution for length n.

An interview question - Split text into sub-strings according to rules

Split text into sub-strings according to below rules:
a) The length of each sub-string should less than or equal to M
b) The length of sub-string should less than or equal to N (N < M) if the sub-string contains any numeric char
c) The total number of sub-strings should be as small as possible
I have no clue how to solve this question, I guess it is related to "dynamic programming".
Can anybody help me implement it using C# or Java? Thanks a lot.
Idea
A greedy approach is the way to go:
If the current text is empty, you're done.
Take the first N characters. If any of them is a digit then this is a new substring. Chop it off and go to beginning.
Otherwise, extend the digitless segment to at most M characters. This is a new substring. Chop it off and go to beginning.
Proof
Here's a reductio-ad-absurdum proof that the above yields an optimal solution.
Assume there is a better split than the greedy split. Let's skip to the point where the two splits start to differ and remove everything before this point.
Case 1) A digit among the first N characters.
Assume that there is an input for which chopping off the first N characters cannot yield an optimal solution.
Greedy split: |--N--|...
A better split: |---|--...
^
+---- this segment can be shortened from the left side
However, the second segment of the putative better solution can be always shortened from the left side, and the first one extended to N characters, without altering the number of segments. Therefore, a contradiction: this split is not better than the greedy split.
Case 2) No digit among the first K (N < K <= M) characters.
Assume that there is an input for which chopping off the first K characters cannot yield an optimal solution.
Greedy split: |--K--|...
A better split: |---|--...
^
+---- this segment can be shortened from the left side
Again, the the "better" split can be transformed, without altering the number of segments, to the greedy split, which contradicts the initial assumption that there is a better split than the greedy split.
Therefore, the greedy split is optimal. Q.E.D.
Implementation (Python)
import sys
m, n, text = int(sys.argv[1]), int(sys.argv[2]), sys.argv[3]
textLen, isDigit = len(text), [c in '0123456789' for c in text]
chunks, i, j = [], 0, 0
while j < textLen:
i, j = j, min(textLen, j + n)
if not any(isDigit[i:j]):
while j < textLen and j - i < m and not isDigit[j]:
j += 1
chunks += [text[i:j]]
print chunks
Implementation (Java)
public class SO {
public List<String> go(int m, int n, String text) {
if (text == null)
return Collections.emptyList();
List<String> chunks = new ArrayList<String>();
int i = 0;
int j = 0;
while (j < text.length()) {
i = j;
j = Math.min(text.length(), j + n);
boolean ok = true;
for (int k = i; k < j; k++)
if (Character.isDigit(text.charAt(k))) {
ok = false;
break;
}
if (ok)
while (j < text.length() && j - i < m && !Character.isDigit(text.charAt(j)))
j++;
chunks.add(text.substring(i, j));
}
return chunks;
}
#Test
public void testIt() {
Assert.assertEquals(
Arrays.asList("asdas", "d332", "4asd", "fsdxf", "23"),
go(5, 4, "asdasd3324asdfsdxf23"));
}
}
Bolo has provided a greedy algorithm in his answer and asked for a counter-example. Well, there's no counter-example because that's perfectly correct approach. Here's the proof. Although it's a bit wordy, it often happens that proof is longer than algorithm itself :)
Let's imagine we have input of length L and constructed an answer A with our algorithm. Now, suppose there's a better answer B. I.e., B has less segments than A does.
Let's say, first segment in A has length la and in B - lb. la >= lb because we've choosen first segment in A to have maximum possible length. And if lb < la, we can increase length of first segment in B without increasing overall number of segments in B. It would give us some other optimal solution B', having same first segment as A.
Now, remove that first segment from A and B' and repeat operation for length L' < L. Do it until there's no segments left. It means, answer A is equal to some optimal solution.
The result of your computation will be a partitioning of the given text into short sub-strings containing numerics and long substrings not containing numerics. (This much you knew already).
You will essentially be partitioning off short subs around the numerics and then breaking everything else down into long subs as often as needed to fulfill the length criteria.
Your freedom, i.e. what you can manipulate to improve your result, is to select which characters to include with a numeric. If N = 3, then for every numeric you get the choice of XXN, XNX or NXX. If M is 5 and you have 6 characters before your first numeric, you'll want to include at least one of those characters in your short sub so you won't end up with two "long" strings to the left of your "short" one when you could have just one instead.
As a first approximation, I'd go with extending your "short" strings leftwise far enough to avoid redundant "long" strings. This is a typical "greedy" approach, and greedy approaches often yield optimal or almost-optimal results. To do even better than that would not be easy, and I'm not going to try to figure out how to go about that.

Resources