Longest common sub-sequence with a certain property? - algorithm

We say that a sequence of numbers x(1),x(2),...,x(k) is zigzag if no three of its consecutive elements create a nonincreasing or nondecreasing sequence. More precisely, for all i=1,2,...,k-2 either
x(i) >( x(i+1),x(i-1) )
or
x(i) < ( x(i+1) , x(i-1))
I have two sequences of numbers a(1),a(2),...,a(n) and b(1),b(2),...,b(m). The problem is to compute the length of their longest common zigzag subsequence. In other words, you're going to delete elements from the two sequences so that they are equal, and so that they're a zigzag sequence. If the minimum number of elements required to do this is k then your answer is m+n-2k.
Note. sequences with length two and one are trivially zigzag
Now i tried writing a memoized recursive solution for the same using the below state variables
i= current position of sequence 1.
j= current position of sequence 2.
last= last taken number in the zigzag sequence currently being considered.
direction = current requirement of the number i.e. should it be greater than previous,less or same;
i call the below function with
magic(0,0,Integer.MIN_VALUE,0);
Here Integer.MIN_VALUE is used a sentinel value denoting no numbers are taken yet in the sequence.
The function is given below:
static int magic(int i, int j, int last, int direction) {
if (hm.containsKey(i + " " + j + " " + last + " " + direction))
return hm.get(i + " " + j + " " + last + " " + direction);
if (i == seq1.length || j == seq2.length) {
return 0;
}
int take_both = 0, leave_both = 0, leave1 = 0, leave2 = 0;
if (seq1[i] == seq2[j] && last == Integer.MIN_VALUE)
take_both = 1 + magic(i + 1, j + 1, seq1[i], direction); // this is the first digit hence direction is 0.
else if (seq1[i] == seq2[j] && (direction == 0 || direction == 1 && seq1[i] > last || direction == -1 && seq1[i] < last))
take_both = 1 + magic(i + 1, j + 1, seq1[i], last != seq1[i] ? (last > seq1[i] ? 1 : -1) : 2);
leave_both = magic(i + 1, j + 1, last, direction);
leave1 = magic(i + 1, j, last, direction);
leave2 = magic(i, j + 1, last, direction);
int ans;
ans = Math.max(Math.max(Math.max(take_both, leave_both), leave1), leave2);
hm.put(i + " " + j + " " + last + " " + direction, ans);
return ans;
}
Now the above code is working for as much test cases i could make, but the complexity is high.
How do i reduce the time complexity,can i eliminate some state variables here? is there a efficient way to do this?

First let's reduce the number of states: Let f(i, j, d) be the length of the longest common zig-zag sequence starting at position i in the first string and position j in the second string and starting with direction d (up or down).
We have the recurrence
f(i, j, up) >= MAX(i' > i, j' > j : f(i', j', up))
if s1[i] = s2[j]:
f(i, j, up) >= MAX(i' > i, j' > j, s1[i'] > x : f(i', j', down))
an similar for the down direction. Solving this in a straightforward way
will lead to a runtime of something like O(n4 · W) where W is the range of integers in the array. W is not polynomially bounded, so we definitely want to get rid of this factor, and ideally a couple of n factors along the way.
To solve the first part, you have to find the maximum f(i', j', up) with
i' > i and j' > j. This is a standard standard 2-d orthogonal range maximum query.
For the second case, you need to find the maximum (i', j', down) with i' > i, j' > j and s1[i'] > s1[i]. That is a range maximum query in the rectangle (i, ∞) x (j, ∞) x (s1[i], ∞).
Now having 3 dimensions here looks scary. However, if we process the states in say, decreasing order of i, then we can get rid of one dimension.
We thus reduced the problem to a range query in the rectangle (j, ∞) x (s1[i], ∞). Coordinate compression gets the dimension of values down to O(n).
You can use a 2-d data structure such as a range tree or binary-indexed tree to solve both kinds of range queries in O(log2 n). The total runtime will be O(n2 · log2 n).
You can get rid of one log factor using fractional cascading, but that is associated with a high constant factor. The runtime is then only one log-factor short of that for finding the longest common subsequence, which seems like a lower-bound for our problem.

Related

Counting inversions in an array of 2D pair

Problem Description:
Let there be an array of 2D pairs ((x1, y1), . . . ,(xn, yn))
. With a fixed constant
y' a pair (i, j) is called half-inverted if i < j, xi > xj , and yi ≥ y' > yj . Devise an algorithm
that counts the number of half-inverted pairs. You will get full marks if your algorithm is
correct of complexity no more than O(n log n).
\My idea is to treat this using similar method as counting inversion in a normal array, but my problem is that how do we maintain the order during the Merge And Count step?
It is a simple modification of the familiar merge-sort inversion counting algorithm which can be used to solve this problem so make you fully understand it as a prerequisite.
If we examine the merge step of this algorithm we have 2 sorted halves and 2 pointers pointing to an element of each. Let our left pointer be i and our right, j. Using the traditional definition of an inversion, if our i pointer points to a value that is larger than the value pointed to by j then due the arrays being sorted and all the elements on the left being before those on the right in the real array, we know all the elements from i to the end of the left half meet our definition of an inversion for our value at j so we increase our count by mid - i where mid is the end of the left half.
Switching back to your problem, we are dealing with pairs (x,y). If we can keep our x values sorted then, using the approach described above, we can simply count the number of inversions only considering x values. Looking at your definition of half inversions we will surely be over counting the number we need if we only count xi > xj. We are missing the additional constraint of yi >= y' > yj which must be filtered out of our counting.
So, if we look back to our traditional algorithm when our i pointer is pointing to a value greater than the value at j we also need to make sure that our y value at j is less than y'. If this not true then none of the x's from i to mid will match our definition of a half inversion and so we cannot count them. Now let's assume our j's y is smaller than y', if we simply counted all the pairs from i to mid then we would still be over counting the pairs which have yi < y'.
One way to fix this is to keep track of the of y values in the left half from i to mid which are >= y' and add that value to our count. We can keep track of how many y >= y' we see in the merge step up to any i, and subtract that from the total number of y's which are >= y' in the left half. To keep track of that total number we can return that value from our recursive function (total = left + right) and only use the number which came from the left half when merging. We also need to modify our base case which is straightforward.
def count_half_inversions(l, y):
return count_rec(l, 0, len(l), l.copy(), y)[0]
def count_rec(l, begin, end, copy, y):
if end-begin <= 1:
# we have only 1 pair
return (0, 1 if l[begin][1] >= y else 0)
mid = begin + ((end-begin) // 2)
left = count_rec(copy, begin, mid, l, y)
right = count_rec(copy, mid, end, l, y)
between = merge_count(l, begin, mid, end, copy, left[1], y)
# return (inversion count, number of pairs, (i,j), with j >= y)
return (left[0] + right[0] + between, left[1] + right[1])
def merge_count(l, begin, mid, end, copy, left_y_count, y):
result = 0
i,j = begin, mid
k = begin
while i < mid and j < end:
if copy[i][0] > copy[j][0]:
if y > copy[j][1]:
result += left_y_count
smaller = copy[j]
j += 1
else:
if copy[i][1] >= y:
left_y_count -= 1
smaller = copy[i]
i += 1
l[k] = smaller
k += 1
while i < mid:
l[k] = copy[i]
i += 1
k += 1
while j < end:
l[k] = copy[j]
j += 1
k += 1
return result
test_case = [(1,1), (6,4), (6,3), (1,2), (1,2), (3,3), (6,2), (0,1)]
fixed_y = 2
print(count_half_inversions(test_case, fixed_y))

Dynamic Programming: Longest Common Subsequence

I'm going over notes that discuss dynamic programming in the context of finding the longest common subsequence of two equal-length strings. The algorithm in question outputs the length (not the substring).
So I have two strings, say:
S = ABAZDC, T = BACBAD
Longest common subsequence is ABAD (substrings don't have to be adjacent letters)
Algorithm is as follows, where LCS[i, j] denotes longest common subsequence of S[1..i] and T[1..j]:
if S[i] != T[j] then
LCS[i, j] = max(LCS[i - 1, j], LCS[i, j - 1])
else
LCS[i, j] = 1 + LCS[i - 1, j - 1]
My notes claim you can fill out a table where each string is written along an axis. Something like:
B A C B A D
A 0 1 1 1 1 1
B 1 1 1 2 2 2
A ...
Z
D
C
Two questions:
1) How do we actually start filling this table out. Algorithm is recursive but doesn't seem to provide a base case (otherwise I'd just call LCS[5, 5])? Notes claim you can do two simple loops with i and j and fill out each spot in constant time...
2) How could we modify the algorithm so the longest common subsequence would be of adjacent letters? My thought is that I'd have to reset the length of the current subsequence to 0 once I find that the next letter in S doesn't match the next letter in T. But it's tricky because I want to keep track of the longest seen thus far (It's possible the first subsequence I see is the longest one). So maybe I'd have an extra argument, longestThusFar, that is 0 when we call our algorithm initially and changes in subsequent calls.
Can someone make this a bit more rigorous?
Thanks!
Firstly,the algorithm is recursive,but the implementation is always iterative.In other words,we do not explicitly,call the same function from the function itself(Recursion).
We use the table entries already filled to compensate for the recursion.
Say,you have two strings of length M.
Then a table is defined of dimensions (M+1)X(M+1).
for(i = 0 to M)
{
LCS[0][i]=0;
}
for(i = 1 to M)
{
LCS[i][0]=0;
}
And you get a table like
B,A,C,B,A,D
0,0,0,0,0,0,0
A 0
B 0
A 0
Z 0
D 0
C 0
Each zero in 0th col means that if no character of string BACBAD is considered,then length of LCS = 0.
Each zero in 0th row means that if no character of string ABAZDC is considered,then length of LCS = 0.
The rest of entries are filled using the rules as you mentioned.
for(i = 1 to M)
{
for(j = 1 to M)
{
if S[i-1] != T[j-1] then
LCS[i, j] = max(LCS[i - 1, j], LCS[i, j - 1])
else
LCS[i, j] = 1 + LCS[i - 1, j - 1]
}
}
Notice that its S[i-1] != T[j-1] and not S[i] != T[j] because when you fill LCS[i,j], you are always comparing i-1 th char of S and j-1 th char of T.
The length of LCS is given by LCS[M,M].
The best way to understand this is try it by hand.
In answer to your second question,You do not need to modify the algo much.
the solution lies in the table that is used to retrieve the LCS.
In order to retrieve the LCS, we make an extra table T of characters of dimensions MXM. and we modify the algo as follows.
for(i = 1 to M)
{
for(j = 1 to M)
{
if S[i-1] != T[j-1] then
{
LCS[i, j] = max(LCS[i - 1, j], LCS[i, j - 1])
if(LCS[i - 1, j]>=LCS[i, j - 1])
T[i-1][j-1]='u'//meaning up
else T[i-1][j-1]='l'//meaning left
}
else
{
LCS[i, j] = 1 + LCS[i - 1, j - 1]
T[i-1][j-1]='d'//meaning diagonally up
}
}
}
Now,in order to know longest substring(OF ADJACENT LETTERS) common to both,traverse T diagonally.
The length = largest number of consecutive d's found in a diagonal.
The diagonal traversal of any square matrix NXN is done by.
Lower Triangle including the main diagonal
j=N-1
while(j>=0)
{
i=j;k=0;
while(i <= N-1)
{
entry T[i][k];
++i;++k
}
--j;
}
Upper triangle
j=1;
while(j<=N-1)
{
i=j;k=0;
while(i<=N-1)
{
entry T[k][i];
++k;++i;
}
--j;
}

Find minimum sum that cannot be formed

Given positive integers from 1 to N where N can go upto 10^9. Some K integers from these given integers are missing. K can be at max 10^5 elements. I need to find the minimum sum that can't be formed from remaining N-K elements in an efficient way.
Example; say we have N=5 it means we have {1,2,3,4,5} and let K=2 and missing elements are: {3,5} then remaining array is now {1,2,4} the minimum sum that can't be formed from these remaining elements is 8 because :
1=1
2=2
3=1+2
4=4
5=1+4
6=2+4
7=1+2+4
So how to find this un-summable minimum?
I know how to find this if i can store all the remaining elements by this approach:
We can use something similar to Sieve of Eratosthenes, used to find primes. Same idea, but with different rules for a different purpose.
Store the numbers from 0 to the sum of all the numbers, and cross off 0.
Then take numbers, one at a time, without replacement.
When we take the number Y, then cross off every number that is Y plus some previously-crossed off number.
When we have done this for every number that is remaining, the smallest un-crossed-off number is our answer.
However, its space requirement is high. Can there be a better and faster way to do this?
Here's an O(sort(K))-time algorithm.
Let 1 &leq; x1 &leq; x2 &leq; … &leq; xm be the integers not missing from the set. For all i from 0 to m, let yi = x1 + x2 + … + xi be the partial sum of the first i terms. If it exists, let j be the least index such that yj + 1 < xj+1; otherwise, let j = m. It is possible to show via induction that the minimum sum that cannot be made is yj + 1 (the hypothesis is that, for all i from 0 to j, the numbers x1, x2, …, xi can make all of the sums from 0 to yi and no others).
To handle the fact that the missing numbers are specified, there is an optimization that handles several consecutive numbers in constant time. I'll leave it as an exercise.
Let X be a bitvector initialized to zero. For each number Ni you set X = (X | X << Ni) | Ni. (i.e. you can make Ni and you can increase any value you could make previously by Ni).
This will set a '1' for every value you can make.
Running time is linear in N, and bitvector operations are fast.
process 1: X = 00000001
process 2: X = (00000001 | 00000001 << 2) | (00000010) = 00000111
process 4: X = (00000111 | 00000111 << 4) | (00001000) = 01111111
First number you can't make is 8.
Here is my O(K lg K) approach. I didn't test it very much because of lazy-overflow, sorry about that. If it works for you, I can explain the idea:
const int MAXK = 100003;
int n, k;
int a[MAXK];
long long sum(long long a, long long b) { // sum of elements from a to b
return max(0ll, b * (b + 1) / 2 - a * (a - 1) / 2);
}
void answer(long long ans) {
cout << ans << endl;
exit(0);
}
int main()
{
cin >> n >> k;
for (int i = 1; i <= k; ++i) {
cin >> a[i];
}
a[0] = 0;
a[k+1] = n+1;
sort(a, a+k+2);
long long ans = 0;
for (int i = 1; i <= k+1; ++i) {
// interval of existing numbers [lo, hi]
int lo = a[i-1] + 1;
int hi = a[i] - 1;
if (lo <= hi && lo > ans + 1)
break;
ans += sum(lo, hi);
}
answer(ans + 1);
}
EDIT: well, thanks God #DavidEisenstat in his answer wrote the description of the approach I used, so I don't have to write it. Basically, what he mentions as exercise is not adding the "existing numbers" one by one, but all at the same time. Before this,you just need to check if some of them breaks the invariant, which can be done using binary search. Hope it helped.
EDIT2: as #DavidEisenstat pointed in the comments, the binary search is not needed, since only the first number in every interval of existing numbers can break the invariant. Modified the code accordingly.

Counting number of points in lower left quadrant?

I am having trouble understanding a solution to an algorithmic problem
In particular, I don't understand how or why this part of the code
s += a[i];
total += query(s);
update(s);
allows you to compute the total number of points in the lower left quadrant of each point.
Could someone please elaborate?
As an analogue for the plane problem, consider this:
For a point (a, b) to lie in the lower left quadrant of (x, y), a <
x & b < y; thus, points of the form (i, P[i]) lie in the lower left quadrant
of (j, P[j]) iff i < j and P[i] < P[j]
When iterating in ascending order, all points that were considered earlier lie on the left compared to the current (i, P[i])
So one only has to locate all P[j]s less that P[i] that have been considered until now
*current point refers to the point in consideration in the current iteration of the for loop that you quoted ie, (i, P[i])
Let's define another array, C[s]:
C[s] = Number of Prefix Sums of array A[1..(i - 1)] that amount to s
So the solution to #3 becomes the sum ... C[-2] + C[-1] + C[0] + C[1] + C[2] ... C[P[i] - 1], ie prefix sum of C[P[i]]
Use the BIT to store the prefix sum of C, thus defining query(s) as:
query(s) = Number of Prefix Sums of array A[1..(i - 1)] that amount to a value < s
Using these definitions, s in the given code gives you the prefix sum up to the current index i (P[i]). total builds the answer, and update simply adds P[i] to the BIT.
We have to repeat this method for all i, hence the for loop.
PS: It uses a data structure called a Binary Indexed Tree (http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=binaryIndexedTrees) for operations. If you aren't acquainted with it, I'd recommend that you check the link.
EDIT:
You are given a array S and a value X. You can split S into two disjoint subarrays such that L has all elements of S less than X, and H that has those that are greater than or equal to X.
A: All elements of L are less than all elements of H.
Any subsequence T of S will have some elements of L and some elements of H. Let's say it has p elements of L and q of H. When T is sorted to give T', all p elements of L appear before the q elements of H because of A.
Median being the central value is the value at location m = (p + q)/2
It is intuitive to think that having q >= p implies that the median lies in X, as a proof:
Values in locations [1..p] in T' belong to L. Therefore for the median to be in H, it's position m should be greater than p:
m > p
(p + q)/2 > p
p + q > 2p
q > p
B: q - p > 0
To computer q - p, I replace all elements in T' with -1 if they belong to L ( < X ) and +1 if they belong to H ( >= X)
T looks something like {-1, -1, -1... 1, 1, 1}
It has p times -1 and q times 1. Sum of T' will now give me:
Sum = p * (-1) + q * (1)
C: Sum = q - p
I can use this information to find the value in B.
All subsequences are of the form {A[i], A[i + 2], A[i + 3] ... A[j + 1]} since they are contiguous, To compute sum of A[i] to A[j + 1], I can compute the prefix sum of A[i] with P[i] = A[1] + A[2] + .. A[i - 1]
Sum of subsequence from A[i] to A[j] then can be computed as P[j] - P[i] (j is greater of j and i)
With C and B in mind, we conclude:
Sum = P[j] - P[i] = q - p (q - p > 0)
P[j] - P[i] > 0
P[j] > P[i]
j > i and P[j] > P[i] for each solution that gives you a median >= X
In summary:
Replace all A[i] with -1 if they are less than X and -1 otherwise
Computer prefix sums of A[i]
For each pair (i, P[i]), count pairs which lie to its lower left quadrant.

Max suffix of a list

This problem is trying to find the lexicographical max suffix of a given list.
Suppose we have an array/list [e1;e2;e3;e4;e5].
Then all suffixes of [e1;e2;e3;e4;e5] are:
[e1;e2;e3;e4;e5]
[e2;e3;e4;e5]
[e3;e4;e5]
[e4;e5]
[e5]
Then our goal is to find the lexicographical max one among the above 5 lists.
for example, all suffixes of [1;2;3;1;0] are
[1;2;3;1;0]
[2;3;1;0]
[3;1;0]
[1;0]
[0].
The lexicographical max suffix is [3;1;0] from above example.
The straightforward algorithm is just to compare all suffixes one by one and always record the max. The time complexity is O(n^2) as comparing two lists need O(n).
However, the desired time complexity is O(n) and no suffix tree (no suffix array either) should be used.
please note that elements in the list may not be distinct
int max_suffix(const vector<int> &a)
{
int n = a.size(),
i = 0,
j = 1,
k;
while (j < n)
{
for (k = 0; j + k < n && a[i + k] == a[j + k]; ++k);
if (j + k == n) break;
(a[i + k] < a[j + k] ? i : j) += k + 1;
if (i == j)
++j;
else if (i > j)
swap(i, j);
}
return i;
}
My solution is a little modification of the solution to the problem Minimum Rotations.
In the above code, each time it step into the loop, it's keeped that i < j, and all a[p...n] (0<=p<j && p!=i) are not the max suffix. Then in order to decide which of a[i...n] and a[j...n] is less lexicographical, use the for-loop to find the least k that make a[i+k]!=a[j+k], then update i and j according to k.
We can skip k elements for i or j, and still keep it true that all a[p...n] (0<=p<j && p!=i) are not the max suffix. For example, if a[i+k]<a[j+k], then a[i+p...n](0<=p<=k) is not max suffix, since a[j+p...n] is lexicographically greater than it.
Imagine in a two player game, two opponents A and B work against each other, on finding the max suffix of a given string s. Whoever first finds the max suffix will win the game. In the first round, A picks suffix s[i..], and B picks suffix s[j..].
i: _____X
j: _____Y
Matched length = k
A judge compares two suffixes and finds there is mismatch after k comparisons, as shown in the fig above.
Without the loss of generality, we assume X > Y, then B is lost in this round. So he has to pick a different suffix in order to (possibly) beat A in next round. If B is smart, he will not pick any suffix starting at position j, j + 1, ..., j + k, because s[j..] is already beaten by s[i..] and he knows s[j+1..] will be beaten by s[i+1..], and s[j+2..] will be beaten by s[i+2..] and so on. So B should pick suffix S[j + k + 1..] for next round. One extra observation is that B should not pick the same suffix as A either because the first person who finds the max suffix wins the game. If j + k + 1 happens to be equal to i, B should skip to the next position.
Finally, after many rounds, either A or B will run out choices and lose the game, because the number of choices are limited for both A and B, and some choices will be eliminated after each round.
When this happens, the current suffix that the winner holds is the max suffix (Remember the loser runs out all choices. A choice is given up because either it cannot possibly be max suffix, or it is currently held by the other person. So the only reason that the loser gives up the actual max suffix in some round is that his opponent is holding it. Once a player holds max suffix, he will never lose and give it up).
The program below in C++ is almost literal translation of this game.
int maxSuffix(const std::string& s) {
std::size_t i = 0, j = 1, k;
while (i < s.size() && j < s.size()) {
for (k = 0; i + k < s.size() && j + k < s.size() && s[i + k] == s[j +k]; ++k) { } //judge
if (j + k >= s.size()) return i; //B is finally lost
if (i + k >= s.size()) return j; //A is finally lost
if (s[i + k] > s[j + k]) { //B is lost in this round so he needs a new choice
j = j + k + 1;
if (j == i) ++j;
} else { //A is lost in this round so he needs a new choice
i = i + k + 1;
if (i == j) ++i;
}
}
return j >= s.size() ? i : j;
}
Running time analysis: Initially each player has n choices. After each round, the judge makes k comparisons, and at least k possible choices are eliminated from either A or B. So the total number of comparisons are bounded by 2n when the game is over.
The discussion above is in the context of string, but it should work with minor modification on any container that supports sequential access only.

Resources