shortest uncommon subsequence

shortest uncommon subsequence - algorithm

Given two strings s and t, determine length of shortest string z such that z is a subsequence of s and not a subsequence of t.
example :
s :babab, t :babba
sol : 3 (aab)
not looking for copy pastable code, please if anybody can help with intution for solving this.
thanks a lot !

You can use dynamic programming to solve this in quadratic time just like the longest common subsequence. I'll show the formula and how you would come up with it.
First, some definitions. Let m be the length of S, and let n be the length of T. Define DP(i, j) as the length of the shortest subsequence of S[:i] that is not a subsequence of T[:j], or 'INF' if none exists. Here, the expression S[:i] is slice notation meaning 'the first i characters of S, so S[:0] is empty and S[:m] is all of S. We want DP(m, n).
There's two easy base cases: Since T[:0] is empty, any character in S will work, so DP(i, 0) = 1 if i > 0. Similarly, DP(0, j) = 'INF' for all j.
Now, we just have to write a general formula for DP(i, j) which only depends on the value of DP() on indices smaller than (i, j). The last character of S[:i] is just some character S[i-1]. Either our optimal subsequence for S[:i], T[:j] ends with S[i-1] or it doesn't.
If our optimal subsequence doesn't end with S[i-1], then we can delete S[i-1] from consideration, and our answer is DP(i, j) = DP(i-1, j).
If our optimal subsequence does end with S[i-1], then we need to know the rightmost occurrence of S[i-1] in T[:j].
If S[i-1] does not occur in T[:j] at all, then S[i-1] by itself is a shortest subsequence, so DP(i, j) = 1.
Otherwise, let Rightmost(c, j) be the rightmost index of T[:j] equal to some character c. Since we are using S[i-1] to end our optimal subsequence, we can ignore all the characters in T[:j] after the rightmost occurrence of S[i-1]: they can no longer affect whether a string is a subsequence of T[:j]. So then DP(i, j) = DP(i-1, Rightmost(S[i-1], j)) + 1, where the +1 comes from the fact that we did choose to use S[i-1].
Putting those together, the general formula for i, j > 0 becomes:
DP(i, j) = 1 if (S[i-1] not in T[:j]), or
= min(DP(i-1, j),
DP(i-1, Rightmost(S[i-1], j)) + 1) otherwise.
Since Rightmost(c, j) is always less than j by definition, we've achieved a formula using only indices smaller (lexicographically) than (i, j), and we can use that formula directly for a recursive algorithm.

Related

Get highest score in this game: choosing and removing elements in an array

Given an array arr of n integers, what is the highest score that a player can reach, playing the following game?
Choose an index 0 < i < n-1 in the array
Add arr[i-1] * arr[i+1] points to the score (initially the score is 0)
Shrink the array by removing element i (forall j >= i: arr[j] = arr[j+1]; then n = n - 1
Repeat steps 1-3 until n == 2.
Do the above until there are only 2 elements (which are the first and the last element because you can't remove them).
What is the highest score you can get ?
Example
arr = [1 2 3 4]
Choose i=2, get: 2*4 = 8 points, remove 3
Remaining: arr = [1 2 4]
Choose i=1, get 1*4 = 4 points, remove 2
Remaining: arr = [1 4].
The sum of points is 8 + 4 = 12, which is the highest possible score on this example.
I think it is related to Dynamic programming but I'm not sure how to solve it.

This problem has a dynamic programming approach similar to Matrix-chain multiplication problem. You can find further explanation in the book "Introduction to Algorithms", 3rd Edition (Cormen, page 370).
Let's find the optimal substructure property and then use it to construct an optimal solution to the problem from optimal solutions to subproblems.
Notation: Ci..j, where i ≤ j, stands for elements Ci,Ci+1,...,Cj.
Definition: A removal sequence for Ci..j is a permutation of i+1,i+2,...,j-1.
A removal sequence for Ci..j is optimal if the score achieved by removing the elements of Ci..j in that order is maximum among all possible removal sequences for Ci..j.
1. Characterize the structure of an optimal solution
If the problem is nontrivial, i.e. i + 1 < j, then any solution has a last removed element which corresponding index is k in the range
i < k < j. Such k split the problem into Ci..k and Ck..j. That is, for some value k, we first remove non extremal elements of Ci..k and Ck..j and then we remove element k. As removing non extremal elements of Ci..k doesn't affect score obtained by removing non extremal elements of Ck..j and an analogous reasoning for removing non extremal elements of Ck..j is also true we state that both subproblems are independent. Then, for a given removal sequence where kth-element is last, the score of Ci..j is equal to the sum of scores of Ci..k and Ck..j, plus the score of removing kth-element (C[i] * C[j]).
The optimal substructure of this problem is as follows. Suppose there is an optimal removal sequence O for Ci..j that ends at kth-element, then the ordering of removed elements from Ci..k must be optimal too. We can prove it by contradiction: If there was a removal sequence for Ci..k that scored higher than removal subsequence extracted from O for Ci..k then we can produce another removal sequence for Ci..j with higher score than optimal removal sequence (contradiction). A similar observation holds for the ordering of removed elements from Ck..j in the optimal removal sequence for Ci..j: it must be optimal too.
We can build an optimal solution for nontrivial instances of the problem by splitting the problem into two subproblems, finding optimal solutions to subproblem instances, and them combining these optimal subproblem solutions.
2. Recursively define the value of an optimal solution.
For this problem our subproblems are the maximum score obtained in Ci..j for 1 ≤ i ≤ j ≤ N. Let S[i, j] be the maximum score obtained in Ci..j; for the full problem, the highest score when evaluating the given rules is S[1, N].
We can define S[i, j] recursively as follows:
If j ≤ i + 1 then S[i, j] = 0
If i + 1 < j then S[i, j] = maxi < k < j{S[i, k] + S[k, j] + C[i] * C[j]}
We ensure that we search for the correct place to split because we consider all possible places, so that we are sure of having examined the optimal one.
3. Compute the value of an optimal solution
You can use your favorite method to compute S:
top-down approach (recursive)
bottom-up approach (iterative)\
I would use bottom-up for computing the solution since it would be < 5 lines long in almost any programming language.
Example in C++11:
for(int l = 2; l <= N; ++l) \\ increasing length intervals
for(int i = 1, j = i + l; j <= N; ++i, ++j)
for(int k = i + 1; k < j; ++k)
S[i, j] = max(S[i, j], S[i, k] + S[k, j] + C[i] * C[j])
4. Time Complexity and Space Complexity
There are nC2 + n = Θ(n2) subproblems and every subproblem do an operation which running time is Θ(l) where l is length of the subproblem so the math yield a running time of Θ(n3) for the algorithm (it's easy to spot the O(n3) part :-)). Also, the algorithm requires Θ(n2) space to store the S table.

algorithm proof - building least number after deleting k digits from an n-digit number

Problem: given an n-digit number, which k (k < n) digits should be deleted from it to make the number left is the smallest among all cases (the relative sequence of remaining digits should not be changed). e.g. delete 2 digits from '24635', the smallest left number is '235'.
A solution: Delete the first digit (from left to right) which is larger than or equal to its right neighbor, or the last digit, if we cannot find one as such. Repeat this procedure for k times. (see codecareer for reference. There are other solutions such as geeksforgeeks, stackoverflow, but I thought the one described here is more intuitive, so I prefer this one.)
The problem now is, how to prove the solution above is correct, i.e. how can it guarantee the final number is smallest by making it the smallest after deleting a single digit at each step.

Suppose k = 1.
Let m = Σi=0,...,n aibi and n+1 digit number anan-1...a1a0 with base b, i.e. 0 ≤ ai < b ∀ 0 ≤ i ≤ n (e.g. b = 10).
Proof
∃ j > 0 with aj > aj-1 and let j be maximal.
This means aj is the last digit of a (not necessary strictly) increasing sequence of consecutive digits.
Then the digit aj is now removed from the number and the resulting number m' has the value
m' = Σi=0,...,j-1 aibi + Σi=j+1,...,n aibi-1
The aim of this reduction is to maximize the difference m-m'. So lets take a look:
m - m' = Σi=0,...,n aibi - (Σi=0,...,j-1 aibi + Σi=j+1,...,n aibi-1)
= ajbj + Σi=j+1,...,n (aibi - aibi-1)
= anbn + Σi=j,...,n-1 (ai - ai+1)bi
Can there be a better choice of j to get a bigger difference?
Since an...aj is an increasing sub sequence, ai-ai+1 ≥ 0 holds. So choosing j' > j instead of j, you get more zeros where you now have a positive number, i.e. the difference gets not bigger, but lower if there exists an i with ai+1 < ai (strict smaller).
j is supposed to be maximal, i.e. aj-1-aj < 0. We know
bj-1 > Σi=0,...,j-2(b-1)bi = bi-1-1
This means, that if we choose `j' < j', we get a negative addition to the difference, so it also gets not bigger.
If ∄ j > 0 with aj > aj-1 the above proof works for j = 0.
What is left to do?
This is only the proof that your algorithm works for k = 1.
It is possible to extend the above proof to multiple sub sequences of (not necessary strictly) increasing digits. It's exact the same proof but much less readable, due to the number of indexes you need.
Maybe you can also use induction, since there are no interactions between the digits (blocking following next choices or something).

Here is a simple argument that your algorithm works for any k. Suppose there is a digit in the mth place that is less than or equal to it's right (m+1)th digit neighbor, and you delete the mth digit but not the (m+1)th. Then you can delete the (m+1)th digit instead of the mth, and you will get an answer less than or equal to your original answer.

notice: this proof is for building the maximum number after removing k digits, but the thinking is similar
key lemma: maximum (m + 1)-digit number contains maximum m-digit
number for all m = 0, 1, ..., n - 1
proof:
greedy solution to delete one digit from some number to get the maximum
result: just delete the first digit which next digit is greater than it, or the last digit if digits are in non-ascending order. This is very easy to prove.
we use contradiction to proof the lemma.
suppose the first time the lemma is broken when m = k, so S(k) ⊄ S(k + 1). Notice that the S(k) ⊂ S(n) as the initial number contains all sub optimal ones, so there must exist a x that S(k) ⊂ S(x) and S(k) ⊄ S(x - 1), k + 2 <= x <= n
we use the greedy solution above to delete only one digit S[X][y] from S(x) to get S(x - 1), so S[X][y] ∈ S(x) and S[X][y] ∉ S(x - 1) and S(k) must contain it. We now use contradiction to prove that S(k) does not need to contain this digit .
According to our greedy solution, all digits from beginning to S[X][y] are
in non-ascending order.
if S[X][y] is at the tail, then S(k) can be the first k digits of S(x) ---> contradiction!
otherwise, we firstly know that all digits in S[X][1, 2,..., y] are in S[k]. If there is a S[X][z] is not inS(k), 1 <= z <= y - 1, then we can shift digits of S(k) that in range S[X][z + 1, y] to left one unit to get a greater or equal S(k). Therefore, there are at least 2 digit after S[X][y] that are not in S(k) as x >= k + 2. Then, we can follow the prefix of S(k) to S[X][y], but we do not use S[X][y], we use from S[X][y + 1]. As S[X][y + 1] > S[X][y], we can build a greater S(k) -------> contradiction!
so, we prove lemma. If we have got S(m + 1), and we know S(m + 1) contains S(m), then S(m) must be the maximum number after removing one digit from S(m + 1)

Total number of palindromic subsequences in a string

The question is like this--
For every string given as input, you need to tell the number of subsequences of it that are palindromes (need not necessarily be distinct). Note that the empty string is not a palindrome.
For example, the palindromic subsequences of "aab" are:
"a", "a", "b", "aa", and the method returns 4.
I had the Dynamic Programming solution to finding Longest Palindromic Subsequence in mind and therefore tried to take ideas from it. Couldn't really get the solution. May be dynamic programming is not even required. Suggestions please.
And there is one more catch. When the condition "need not necessarily be distinct" is removed, can we still count without actually generating all the palindromic subsequences?

[EDIT 19/10/2015: An anonymous reviewer pointed out a problem with the formula, which prompted me to notice another, even bigger mistake... Now fixed.]
I now see how to drop the solution time down to O(n^2). I'll leave my other answer up in case it's interesting as a stepping-stone to this one. Note: This is (also) only a solution to the first part of the problem; I see no way to efficiently count only distinct palindromic subsequences (PS).
Instead of counting the number of PS that begin and end at exactly the positions i and j, let's count how many begin at or after i and end at or before j. Call this g(i, j).
We can try to write g(i, j) = g(i, j-1) + g(i+1, j) + (x[i] == x[j])*g(i+1, j-1) for the case when j > i. But this doesn't quite work, because the first two terms will double-count any PS that begin after i and end before j.
The key insight is to notice that we can easily calculate the number of PS that begin or end at some exact position by subtracting off other values of g(), and perhaps adding yet more values of g() back on to compensate for double-counting. For example, the number of PS that begin at exactly i and end at exactly j is g(i, j) - g(i+1, j) - g(i, j-1) + g(i+1, j-1): the last term corrects for the fact that both the second and third terms count all g(i+1, j-1) PS that begin after i and end before j.
Every PS that begins at or after i and ends at or before j is in exactly 1 of 4 categories:
It begins after i, and ends before j.
It begins at i, and ends before j.
It begins after i, and ends at j.
It begins at i, and ends at j.
g(i+1, j) counts all PS in category 1 or 3, and g(i, j-1) counts all PS in category 1 or 2, so their sum g(i+1, j) + g(i, j-1) counts all PS in category 2 or 3 once each, and all PS in category 1 twice. Since g(i+1, j-1) counts all PS in category 1 only, subtracting this off to get g(i+1, j) + g(i, j-1) - g(i+1, j-1) gives the total number of PS in category 1, 2 and 3. The remaining PS are those in category 4. If x[i] != x[j] then there are no PS in this category; otherwise, there are exactly as many as there are PS that begin at or after i+1 and end at or before j-1, namely g(i+1, j-1), plus one extra for the 2-character sequence x[i]x[j]. [EDIT: Thanks to commenter Tuxdude for 2 fixes here!]
With this in hand, we can express g() in a way that changes the quadratic case from f() to constant time:
g(i, i) = 1 (i.e. when j = i)
g(i, i+1) = 2 + (x[i] == x[i+1]) (i.e. 3 iff adjacent chars are identical, otherwise 2)
g(i, j) = 0 when j < i (this new boundary case is needed)
g(i, j) = g(i+1, j) + g(i, j-1) - g(i+1, j-1) + (x[i] == x[j])*(g(i+1, j-1)+1) when j >= i+2
The final answer is now simply g(1, n).

Here's a horrible O(n^4) solution:
Every palindromic subsequence begins at some position i and ends at some position j >= i such that x[i] = x[j], and its "interior" (all characters except the first and last) is either empty or a palindromic subsequence of x[i+1 .. j-1].
So we can define f(i, j) to be the number of palindromic subsequences beginning at i and ending at j >= i. Then
f(i, j) = 0 if x[i] != x[j]
f(i, i) = 1 (i.e. when j = i)
f(i, j) = 1 + the sum of f(i', j') over all i < i' <= j' < j otherwise
[EDIT: Fixed to count palindromic subsequences of length <= 2 too!]
Then the final answer is the sum of f(i, j) over all 1 <= i <= j <= n.
The DP for this is O(n^4) because there are n^2 table entries, and computing each one takes O(n^2) time. (It's probably possible to speed this up to at least O(n^3) by making use of the fact that x[i] != x[j] implies f(i, j) = 0.)

Intuitive O(n^3) solution using DP:
Let each state dp(i,j) represents number of palindromic subsequences in string[i...j]
Then simple recursive formula is
for k in range i, j-1:
if(A[j]==A[k]){
dp(i,j) = dp(i,j) + dp(k+1,j-1);
The Idea is very simple.. For adding a new character check if it is end of a subsequence or not. If there exist same character in the previously computed smaller subproblem, then it add the number of subsequences contained in range (k+1,j-1).
Just take care of corner cases.
Add one as newly added character is a single character subsequence too.
Even if there are no subsequences in the range (k+1,j-1) , you would still get 1 new subsequences of length 2 (like "aa").

How can I find a faster algorithm for this special case of Longest Common Sub-sequence (LCS)?

I know the LCS problem need time ~ O(mn) where m and n are length of two sequence X and Y respectively. But my problem is a little bit easier so I expect a faster algorithm than ~O(mn).
Here is my problem:
Input:
a positive integer Q, two sequence X=x1,x2,x3.....xn and Y=y1,y2,y3...yn, both of length n.
Output:
True, if the length of the LCS of X and Y is at least n - Q;
False, otherwise.
The well-known algorithm costs O(n^2) here, but actually we can do better than that. Because whenever we eliminate as many as Q elements in either sequence without finding a common element, the result returns False. Someone said there should be an algorithm as good as O(Q*n), but I cannot figure out.
UPDATE:
Already found an answer!
I was told I can just calculate the diagonal block of the table c[i,j], because if |i-j|>Q, means there are already more than Q unmatched elements in both sequences. So we only need to calculate the c[i,j] when |i-j|<=Q.

Here is one possible way to do it:
1. Let's assume that f(prefix_len, deleted_cnt) is the leftmost position in Y such that prefix_len elements of X were already processed and exactly deleted_cnt of them were deleted. Obviously, there are only O(N * Q) states because deleted_cnt cannot exceed Q.
2. The base case is f(0, 0) = 0(nothing was processed, thus nothing was deleted).
3. Transitions:
a) Remove the current element: f(i + 1, j + 1) = min(f(i + 1, j + 1), f(i, j)).
b) Match the current element with the leftmost possible element from Y that is equal to it and located after f(i, j)(let's assume that it has index pos): f(i + 1, j) = min(f(i + 1, j), pos).
4. So the only question remaining is how to get the leftmost matching element located to the right from a given position. Let's precompute the following pairs: (position in Y, element of X) -> the leftmost occurrence of the element of Y equal to this element of X to the right from this position in Y and put them into a hash table. It looks like O(n^2). But is not. For a fixed position in Y, we never need to go further to the right from it than by Q + 1 positions. Why? If we go further, we skip more than Q elements! So we can use this fact to examine only O(N * Q) pairs and get desired time complexity. When we have this hash table, finding pos during the step 3 is just one hash table lookup. Here is a pseudo code for this step:
map = EmptyHashMap()
for i = 0 ... n - 1:
for j = i + 1 ... min(n - 1, i + q + 1)
map[(i, Y[j])] = min(map[(i, Y[j])], j)
Unfortunately, this solution uses hash tables so it has O(N * Q) time complexity on average, not in the worst case, but it should be feasible.

You can also say cost of the process to make the string equal must not be greater than Q.if it greater than Q than answer must be false.(EDIT DISTANCE PROBLEM)
Suppose of the of string x is m, and the size of string y is n, then we create a two dimensional array d[0..m][0..n], where d[i][j] denotes the edit distance between the i-length prefix of x and j-length prefix of y.
The computation of array d is done using dynamic programming, which uses the following recurrence:
d[i][0] = i , for i <= m
d[0][j] = j , for j <= n
d[i][j] = d[i - 1][j - 1], if s[i] == w[j],
d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + 1), otherwise.
answer of LCS if m>n, m-dp[m][m-n]

How to find the Longest Common Subsequence in Exponential time?

I can do this the proper way using dynamic programming but I can't figure out how to do it in exponential time.
I'm looking to find the largest common sub-sequence between two strings.
Note: I mean subsequences and not sub-strings the symbols that make up a sequence need not be consecutive.

Just replace the lookups in the table in your dynamic programming code with recursive calls. In other words, just implement the recursive formulation of the LCS problem:
EDIT
In pseudocode (almost-python, actually):
def lcs(s1, s2):
if len(s1)==0 or len(s2)==0: return 0
if s1[0] == s2[0]: return 1 + lcs(s1[1:], s2[1:])
return max(lcs(s1, s2[1:]), lcs(s1[1:], s2))

Let's say you have two strings a and b of length n. The longest common subsequence is going to be the longest subsequence in string a that is also present in string b.
Thus we can iterate through all possible subsequences in a and see it is in b.
A high-level pseudocode for this would be:
for i=n to 0
for all length i subsequences s of a
if s is a subsequence of b
return s

String A and String B. A recursive algorithm, maybe it's naive but it is simple:
Look at the first letter of A. This will either be in a common sequence or not. When considering the 'not' option, we trim off the first letter and call recursively. When considering the 'is in a common sequence' option we also trim it off and we also trim off from the start of B up to, and including, the same letter in B. Some pseudocode:
def common_subsequences(A,B, len_subsequence_so_far = 0):
if len(A) == 0 or len(B) == 0:
return
first_of_A = A[0] // the first letter in A.
A1 = A[1:] // A, but with the first letter removed
common_subsequences(A1,B,len_subsequence_so_far) // the first recursive call
if(the_first_letter_of_A_is_also_in_B):
Bn = ... delete from the start of B up to, and including,
... the first letter which equals first_of_A
common_subsequences(A1,Bn, 1+len_subsequence_so_far )
You could start with that and then optimize by remembering the longest subsequence found so far, and then returning early when the current function cannot beat that (i.e. when min(len(A), len(B))+len_subsequence_so_far is smaller than the longest length found so far.

Essentially if you don't use dynamic programming paradigm - you reach exponential time. This is because, by not storing your partial values - you are recomputing the partial values multiple times.

To achieve exponential time it's enough to generate all subsequences of both arrays and compare each one with each other. If you match two that are identical check if their length is larger then current maximum. The pseudocode would be:
Generate all subsequences of `array1` and `array2`.
for each subsequence of `array1` as s1
for each subsequece of `array2` as s2
if s1 == s2 //check char by char
if len(s1) > currentMax
currentMax = len(s1)
for i = 0; i < 2^2; i++;
It's absolutely not optimal. It doesn't even try. However the question is about the very inefficient algorithm so I've provided one.

int lcs(char[] x, int i, char[] y, int j) {
if (i == 0 || j == 0) return 0;
if (x[i - 1] == y[j - 1]) return lcs(x, i - 1, y, j - 1) + 1;
return Math.max(lcs(x, i, y, j - 1), lcs(x, i - 1, y, j));
}
print(lcs(x, x.length, y, y.length);
Following is a partial recursion tree:
lcs("ABCD", "AFDX")
/ \
lcs("ABC", "AFDX") lcs("ABCD", "AFD")
/ \ / \
lcs("AB", "AFDX") lcs("AXY", "AFD") lcs("ABC", "AFD") lcs("ABCD", "AF")
Worst case is when the length of LCS is 0 which means there's no common subsequence. At that case all of the possible subsequences are examined and there are O(2^n) subsequences.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

shortest uncommon subsequence - algorithm

Given two strings s and t, determine length of shortest string z such that z is a subsequence of s and not a subsequence of t. example : s :babab, t :babba sol : 3 (aab) not looking for copy pastable code, please if anybody can help with intution for solving this. thanks a lot !

Related

Get highest score in this game: choosing and removing elements in an array

algorithm proof - building least number after deleting k digits from an n-digit number

Total number of palindromic subsequences in a string

How can I find a faster algorithm for this special case of Longest Common Sub-sequence (LCS)?

How to find the Longest Common Subsequence in Exponential time?

Categories

Resources