Dynamic programming--Longest Common Substring: Understanding the space optimization - algorithm

I am working through a ver typical question which is the Longest Common Substring of two strings.
The problem statement is quite clear:
for two string s1 and s2, find the length of their longest common substring.
I can understand the definition of the state represented by the dp array. It is a two-dimension array where two dimension just represents the index of the characters in each string(but just 1 based not 0 based).
The original solution code is like below which appears clear to me :
public int findLCSLength(String s1, String s2) {
int[][] dp = new int[s1.length()+1][s2.length()+1];
int maxLength = 0;
for(int i=1; i <= s1.length(); i++) {
for(int j=1; j <= s2.length(); j++) {
if(s1.charAt(i-1) == s2.charAt(j-1)) {
dp[i][j] = 1 + dp[i-1][j-1];
maxLength = Math.max(maxLength, dp[i][j]);
}
}
}
return maxLength;
}
This solution obviously can be optimized since the state of dp[i][j] just depends on the previous row which means two row will be sufficent for the dp array.
So I made the dp array a two-dimension one and use the mod operation to map the indexes in the range of 2.
static int findLCSLength(String s1, String s2) {
int[][] dp = new int[2][s2.length()+1];
int maxLength = 0;
for(int i=1; i <= s1.length(); i++) {
for(int j=1; j <= s2.length(); j++) {
if(s1.charAt(i-1) == s2.charAt(j-1)) {
dp[i%2][j] = 1 + dp[(i-1)%2][j-1];
maxLength = Math.max(maxLength, dp[i%2][j]);
}
}
}
return maxLength;
}
However my code didn't produce the correct answer for all test cases. I found one code snippet which gives correct answer on all test cases which has only one extra operation as I missed.
static int findLCSLength(String s1, String s2) {
int[][] dp = new int[2][s2.length()+1];
int maxLength = 0;
for(int i=1; i <= s1.length(); i++) {
for(int j=1; j <= s2.length(); j++) {
//This is the only extra line I missed
dp[i%2][j] = 0;
if(s1.charAt(i-1) == s2.charAt(j-1)) {
dp[i%2][j] = 1 + dp[(i-1)%2][j-1];
maxLength = Math.max(maxLength, dp[i%2][j]);
}
}
}
return maxLength;
}
One of the cases that my code fails is "passport" and "ppsspt", where my code produced 4 but the correct answer is obviously 3.
I am confused but this line , what does this line do and why it is necessary?
Hope anyone can help on that.

It resets the current count.
Your code sets this variable when:
if(s1.charAt(i-1) == s2.charAt(j-1)) {
But there's no else to set it to 0, which is effectively what that code does.
So consider when:
s1.charAt(i-1) != s2.charAt(j-1)
The previous value that you had in this array location will carry over to the next sub-string comparison when it shouldn't.

Related

The fastest algorithm for returning max length of consecutive same value fields of matrix?

Here is the given example:
We have the function which takes one matrix and it's number of columns and it's number of rows and returns int (this is gonna be length). For example:
int function (int** matrix, int n, int m)
The question is what's the fastest algorithm for implementing this function so it returns the maximum length of consecutive fields with the same value (doesn't matter if those same values are in one column or in one row, in this example on picture it's the 5 fields of one column with value 8)?
Values can be from 0-255 (grayscale for example).
So in the given example function should return 5.
If this is a bottleneck and the matrix is large, the first optimization to try is to make one pass over the matrix in sequential memory order (row-by-row in C or C++) rather than two. This is because it's very expensive to traverse a 2d array in the other direction. Cache and paging behavior are the worst possible.
For this you will need a row-sized array to track the number of consecutive values in the current run within each column.
int function (int a[][], int m, int n) {
if (n <= 0 || m <= 0) return 0;
int longest_run_len = 1; // Accumulator for the return value.
int current_col_run_len[n]; // Accumulators for each column
int current_row_run_len = 1; // Accumulator for the current row.
// Initialize the column accumulators and check the first row.
current_col_run_len[0] = 1;
for (int j = 1; j < n; j++) {
current_col_run_len[j] = 1;
if (a[0][j] == a[0][j-1]) {
if (++current_row_run_len > longest_run_len)
longest_run_len = current_row_run_len;
} else current_row_run_len = 1;
}
// Now the rest of the rows...
for (int i = 1; i < m; i++) {
// First column:
if (a[i][0] == a[i-1][0]) {
if (++current_col_run_len[0] > longest_run_len)
longest_run_len = current_col_run_len[0];
} else current_col_run_len[0] = 1;
// Other columns.
current_row_run_len = 1;
for (int j = 1; j < n; j++) {
if (a[i][j] == a[i][j-1]) {
if (++current_row_run_len > longest_run_len)
longest_run_len = current_row_run_len;
} else current_row_run_len = 1;
if (a[i][j] == a[i-1][j]) {
if (++current_col_run_len[j] > longest_run_len)
longest_run_len = current_col_run_len[j];
} else current_col_run_len[j] = 1;
}
}
return longest_run_len;
}
You need to pass over each entry of the matrix at least once, so you can't possible do better than O(m*n).
The most straightforward way is to pass over each row and each column once. This will be two passes over the matrix, but the algorithm is still O(m*n).
Any attempt to do it in one pass will probably be a lot more complex.
int function (int** matrix, int n, int m) {
int best=1;
for (int i=0; i<m; ++i) {
int k=1;
int last=-1;
for (int j=0; j<n; ++j) {
if (matrix[i][j] == last) {
k++;
if (k > best) {
best=k;
}
}
else {
k=1;
}
last = matrix[i][j];
}
}
for (int j=0; j<n; ++j) {
int k=1;
int last=-1;
for (int i=0; i<m; ++i) {
if (matrix[i][j] == last) {
k++;
if (k > best) {
best=k;
}
}
else {
k=1;
}
last = matrix[i][j];
}
}
return best;
}

Distinct Subsequences DP explanation

From LeetCode
Given a string S and a string T, count the number of distinct
subsequences of T in S.
A subsequence of a string is a new string which is formed from the
original string by deleting some (can be none) of the characters
without disturbing the relative positions of the remaining characters.
(ie, "ACE" is a subsequence of "ABCDE" while "AEC" is not).
Here is an example: S = "rabbbit", T = "rabbit"
Return 3.
I see a very good DP solution, however, I have hard time to understand it, anybody can explain how this dp works?
int numDistinct(string S, string T) {
vector<int> f(T.size()+1);
//set the last size to 1.
f[T.size()]=1;
for(int i=S.size()-1; i>=0; --i){
for(int j=0; j<T.size(); ++j){
f[j]+=(S[i]==T[j])*f[j+1];
printf("%d\t", f[j] );
}
cout<<"\n";
}
return f[0];
}
First, try to solve the problem yourself to come up with a naive implementation:
Let's say that S.length = m and T.length = n. Let's write S{i} for the substring of S starting at i. For example, if S = "abcde", S{0} = "abcde", S{4} = "e", and S{5} = "". We use a similar definition for T.
Let N[i][j] be the distinct subsequences for S{i} and T{j}. We are interested in N[0][0] (because those are both full strings).
There are two easy cases: N[i][n] for any i and N[m][j] for j<n. How many subsequences are there for "" in some string S? Exactly 1. How many for some T in ""? Only 0.
Now, given some arbitrary i and j, we need to find a recursive formula. There are two cases.
If S[i] != T[j], we know that N[i][j] = N[i+1][j] (I hope you can verify this for yourself, I aim to explain the cryptic algorithm above in detail, not this naive version).
If S[i] = T[j], we have a choice. We can either 'match' these characters and go on with the next characters of both S and T, or we can ignore the match (as in the case that S[i] != T[j]). Since we have both choices, we need to add the counts there: N[i][j] = N[i+1][j] + N[i+1][j+1].
In order to find N[0][0] using dynamic programming, we need to fill the N table. We first need to set the boundary of the table:
N[m][j] = 0, for 0 <= j < n
N[i][n] = 1, for 0 <= i <= m
Because of the dependencies in the recursive relation, we can fill the rest of the table looping i backwards and j forwards:
for (int i = m-1; i >= 0; i--) {
for (int j = 0; j < n; j++) {
if (S[i] == T[j]) {
N[i][j] = N[i+1][j] + N[i+1][j+1];
} else {
N[i][j] = N[i+1][j];
}
}
}
We can now use the most important trick of the algorithm: we can use a 1-dimensional array f, with the invariant in the outer loop: f = N[i+1]; This is possible because of the way the table is filled. If we apply this to my algorithm, this gives:
f[j] = 0, for 0 <= j < n
f[n] = 1
for (int i = m-1; i >= 0; i--) {
for (int j = 0; j < n; j++) {
if (S[i] == T[j]) {
f[j] = f[j] + f[j+1];
} else {
f[j] = f[j];
}
}
}
We're almost at the algorithm you gave. First of all, we don't need to initialize f[j] = 0. Second, we don't need assignments of the type f[j] = f[j].
Since this is C++ code, we can rewrite the snippet
if (S[i] == T[j]) {
f[j] += f[j+1];
}
to
f[j] += (S[i] == T[j]) * f[j+1];
and that's all. This yields the algorithm:
f[n] = 1
for (int i = m-1; i >= 0; i--) {
for (int j = 0; j < n; j++) {
f[j] += (S[i] == T[j]) * f[j+1];
}
}
I think the answer is wonderful, but something may be not correct.
I think we should iterate backwards over i and j. Then we change to array N to array f, we looping j forwards for not overlapping the result last got.
for (int i = m-1; i >= 0; i--) {
for (int j = 0; j < n; j++) {
if (S[i] == T[j]) {
N[i][j] = N[i+1][j] + N[i+1][j+1];
} else {
N[i][j] = N[i+1][j];
}
}
}

Interview - Find magnitude pole in an array

Magnitude Pole: An element in an array whose left hand side elements are lesser than or equal to it and whose right hand side element are greater than or equal to it.
example input
3,1,4,5,9,7,6,11
desired output
4,5,11
I was asked this question in an interview and I have to return the index of the element and only return the first element that met the condition.
My logic
Take two MultiSet (So that we can consider duplicate as well), one for right hand side of the element and one for left hand side of the
element(the pole).
Start with 0th element and put rest all elements in the "right set".
Base condition if this 0th element is lesser or equal to all element on "right set" then return its index.
Else put this into "left set" and start with element at index 1.
Traverse the Array and each time pick the maximum value from "left set" and minimum value from "right set" and compare.
At any instant of time for any element all the value to its left are in the "left set" and value to its right are in the "right set"
Code
int magnitudePole (const vector<int> &A) {
multiset<int> left, right;
int left_max, right_min;
int size = A.size();
for (int i = 1; i < size; ++i)
right.insert(A[i]);
right_min = *(right.begin());
if(A[0] <= right_min)
return 0;
left.insert(A[0]);
for (int i = 1; i < size; ++i) {
right.erase(right.find(A[i]));
left_max = *(--left.end());
if (right.size() > 0)
right_min = *(right.begin());
if (A[i] > left_max && A[i] <= right_min)
return i;
else
left.insert(A[i]);
}
return -1;
}
My questions
I was told that my logic is incorrect, I am not able to understand why this logic is incorrect (though I have checked for some cases and
it is returning right index)
For my own curiosity how to do this without using any set/multiset in O(n) time.
For an O(n) algorithm:
Count the largest element from n[0] to n[k] for all k in [0, length(n)), save the answer in an array maxOnTheLeft. This costs O(n);
Count the smallest element from n[k] to n[length(n)-1] for all k in [0, length(n)), save the answer in an array minOnTheRight. This costs O(n);
Loop through the whole thing and find any n[k] with maxOnTheLeft <= n[k] <= minOnTheRight. This costs O(n).
And you code is (at least) wrong here:
if (A[i] > left_max && A[i] <= right_min) // <-- should be >= and <=
Create two bool[N] called NorthPole and SouthPole (just to be humorous.
step forward through A[]tracking maximum element found so far, and set SouthPole[i] true if A[i] > Max(A[0..i-1])
step backward through A[] and set NorthPole[i] true if A[i] < Min(A[i+1..N-1)
step forward through NorthPole and SouthPole to find first element with both set true.
O(N) in each step above, as visiting each node once, so O(N) overall.
Java implementation:
Collection<Integer> magnitudes(int[] A) {
int length = A.length;
// what's the maximum number from the beginning of the array till the current position
int[] maxes = new int[A.length];
// what's the minimum number from the current position till the end of the array
int[] mins = new int[A.length];
// build mins
int min = mins[length - 1] = A[length - 1];
for (int i = length - 2; i >= 0; i--) {
if (A[i] < min) {
min = A[i];
}
mins[i] = min;
}
// build maxes
int max = maxes[0] = A[0];
for (int i = 1; i < length; i++) {
if (A[i] > max) {
max = A[i];
}
maxes[i] = max;
}
Collection<Integer> result = new ArrayList<>();
// use them to find the magnitudes if any exists
for (int i = 0; i < length; i++) {
if (A[i] >= maxes[i] && A[i] <= mins[i]) {
// return here if first one only is needed
result.add(A[i]);
}
}
return result;
}
Your logic seems perfectly correct (didn't check the implementation, though) and can be implemented to give an O(n) time algorithm! Nice job thinking in terms of sets.
Your right set can be implemented as a stack which supports a min, and the left set can be implemented as a stack which supports a max and this gives an O(n) time algorithm.
Having a stack which supports max/min is a well known interview question and can be done so each operation (push/pop/min/max is O(1)).
To use this for your logic, the pseudo code will look something like this
foreach elem in a[n-1 to 0]
right_set.push(elem)
while (right_set.has_elements()) {
candidate = right_set.pop();
if (left_set.has_elements() && left_set.max() <= candidate <= right_set.min()) {
break;
} else if (!left.has_elements() && candidate <= right_set.min() {
break;
}
left_set.push(candidate);
}
return candidate
I saw this problem on Codility, solved it with Perl:
sub solution {
my (#A) = #_;
my ($max, $min) = ($A[0], $A[-1]);
my %candidates;
for my $i (0..$#A) {
if ($A[$i] >= $max) {
$max = $A[$i];
$candidates{$i}++;
}
}
for my $i (reverse 0..$#A) {
if ($A[$i] <= $min) {
$min = $A[$i];
return $i if $candidates{$i};
}
}
return -1;
}
How about the following code? I think its efficiency is not good in the worst case, but it's expected efficiency would be good.
int getFirstPole(int* a, int n)
{
int leftPole = a[0];
for(int i = 1; i < n; i++)
{
if(a[j] >= leftPole)
{
int j = i;
for(; j < n; j++)
{
if(a[j] < a[i])
{
i = j+1; //jump the elements between i and j
break;
}
else if (a[j] > a[i])
leftPole = a[j];
}
if(j == n) // if no one is less than a[i] then return i
return i;
}
}
return 0;
}
Create array of ints called mags, and int variable called maxMag.
For each element in source array check if element is greater or equal to maxMag.
If is: add element to mags array and set maxMag = element.
If isn't: loop through mags array and remove all elements lesser.
Result: array of magnitude poles
Interesting question, I am having my own solution in C# which I have given below, read the comments to understand my approach.
public int MagnitudePoleFinder(int[] A)
{
//Create a variable to store Maximum Valued Item i.e. maxOfUp
int maxOfUp = A[0];
//if list has only one value return this value
if (A.Length <= 1) return A[0];
//create a collection for all candidates for magnitude pole that will be found in the iteration
var magnitudeCandidates = new List<KeyValuePair<int, int>>();
//add the first element as first candidate
var a = A[0];
magnitudeCandidates.Add(new KeyValuePair<int, int>(0, a));
//lets iterate
for (int i = 1; i < A.Length; i++)
{
a = A[i];
//if this item is maximum or equal to all above items ( maxofUp will hold max value of all the above items)
if (a >= maxOfUp)
{
//add it to candidate list
magnitudeCandidates.Add(new KeyValuePair<int, int>(i, a));
maxOfUp = a;
}
else
{
//remote all the candidates having greater values to this item
magnitudeCandidates = magnitudeCandidates.Except(magnitudeCandidates.Where(c => c.Value > a)).ToList();
}
}
//if no candidate return -1
if (magnitudeCandidates.Count == 0) return -1;
else
//return value of first candidate
return magnitudeCandidates.First().Key;
}

Old Top Coder riddle: Making a number by inserting +

I am thinking about this topcoder problem.
Given a string of digits, find the minimum number of additions required for the string to equal some target number. Each addition is the equivalent of inserting a plus sign somewhere into the string of digits. After all plus signs are inserted, evaluate the sum as usual.
For example, consider "303" and a target sum of 6. The best strategy is "3+03".
I would solve it with brute force as follows:
for each i in 0 to 9 // i -- number of plus signs to insert
for each combination c of i from 10
for each pos in c // we can just split the string w/o inserting plus signs
insert plus sign in position pos
evaluate the expression
if the expression value == given sum
return i
Does it make sense? Is it optimal from the performance point of view?
...
Well, now I see that a dynamic programming solution will be more efficient. However it is interesting if the presented solution makes sense anyway.
It's certainly not optimal. If, for example, you are given the string "1234567890" and the target is a three-digit number, you know that you have to split the string into at least four parts, so you need not check 0, 1, or 2 inserts. Also, the target limits the range of admissible insertion positions. Both points have small impact for short strings, but can make a huge difference for longer ones. However, I suspect there's a vastly better method, smells a bit of DP.
I haven't given it much thought yet, but if you scroll down you can see a link to the contest it was from, and from there you can see the solvers' solutions. Here's one in C#.
using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections;
public class QuickSums {
public int minSums(string numbers, int sum) {
int[] arr = new int[numbers.Length];
for (int i = 0 ; i < arr.Length; i++)
arr[i] = 0;
int min = 15;
while (arr[arr.Length - 1] != 2)
{
arr[0]++;
for (int i = 0; i < arr.Length - 1; i++)
if (arr[i] == 2)
{
arr[i] = 0;
arr[i + 1]++;
}
String newString = "";
for (int i = 0; i < numbers.Length; i++)
{
newString+=numbers[i];
if (arr[i] == 1)
newString+="+";
}
String[] nums = newString.Split('+');
int sum1 = 0;
for (int i = 0; i < nums.Length; i++)
try
{
sum1 += Int32.Parse(nums[i]);
}
catch
{
}
if (sum == sum1 && nums.Length - 1 < min)
min = nums.Length - 1;
}
if (min == 15)
return -1;
return min;
}
}
Because input length is small (10) all possible ways (which can be found by a simple binary counter of length 10) is small (2^10 = 1024), so your algorithm is fast enough and returns valid result, and IMO there is no need to improve it.
In all until your solution works fine in time and memory and other given constrains, there is no need to do micro optimization. e.g this case as akappa offered can be solved with DP like DP in two-Partition problem, but when your algorithm is fast there is no need to do this and may be adding some big constant or making code unreadable.
I just offer parse digits of string one time (in array of length 10) to prevent from too many string parsing, and just use a*10^k + ... (Also you can calculate 10^k for k=0..9 in startup and save its value).
I think the problem is similar to Matrix Chain Multiplication problem where we have to put braces for least multiplication. Here braces represent '+'. So I think it could be solved by similar dp approach.. Will try to implement it.
dynamic programming :
public class QuickSums {
public static int req(int n, int[] digits, int sum) {
if (n == 0) {
if (sum == 0)
return 0;
else
return -1;
} else if (n == 1) {
if (sum == digits[0]) {
return 0;
} else {
return -1;
}
}
int deg = 1;
int red = 0;
int opt = 100000;
int split = -1;
for (int i=0; i<n;i++) {
red += digits[n-i-1] * deg;
int t = req(n-i-1,digits,sum - red);
if (t != -1 && t <= opt) {
opt = t;
split = i;
}
deg = deg*10;
}
if (opt == 100000)
return -1;
if (split == n-1)
return opt;
else
return opt + 1;
}
public static int solve (String digits,int sum) {
int [] dig = new int[digits.length()];
for (int i=0;i<digits.length();i++) {
dig[i] = digits.charAt(i) - 48;
}
return req(digits.length(), dig, sum);
}
public static void doit() {
String digits = "9230560001";
int sum = 71;
int result = solve(digits, sum);
System.out.println(result);
}
Seems to be too late .. but just read some comments and answers here which say no to dp approach . But it is a very straightforward dp similar to rod-cutting problem:
To get the essence:
int val[N][N];
int dp[N][T];
val[i][j]: numerical value of s[i..j] including both i and j
val[i][j] can be easily computed using dynamic programming approach in O(N^2) time
dp[i][j] : Minimum no of '+' symbols to be inserted in s[0..i] to get the required sum j
dp[i][j] = min( 1+dp[k][j-val[k+1][j]] ) over all k such that 0<=k<=i and val[k][j]>0
In simple terms , to compute dp[i][j] you assume the position k of last '+' symbol and then recur for s[0..k]

Longest Common Subsequence among 3 Strings

I've implemented the dynamic programming solution to find the longest common subsequence among 2 strings. There is apparently a way to generalize this algorithm to find the LCS among 3 strings, but in my research I have not found any information on how to go about this. Any help would be appreciated.
To find the Longest Common Subsequence (LCS) of 2 strings A and B, you can traverse a 2-dimensional array diagonally like shown in the Link you posted. Every element in the array corresponds to the problem of finding the LCS of the substrings A' and B' (A cut by its row number, B cut by its column number). This problem can be solved by calculating the value of all elements in the array. You must be certain that when you calculate the value of an array element, all sub-problems required to calculate that given value has already been solved. That is why you traverse the 2-dimensional array diagonally.
This solution can be scaled to finding the longest common subsequence between N strings, but this requires a general way to iterate an array of N dimensions such that any element is reached only when all sub-problems the element requires a solution to has been solved.
Instead of iterating the N-dimensional array in a special order, you can also solve the problem recursively. With recursion it is important to save the intermediate solutions, since many branches will require the same intermediate solutions. I have written a small example in C# that does this:
string lcs(string[] strings)
{
if (strings.Length == 0)
return "";
if (strings.Length == 1)
return strings[0];
int max = -1;
int cacheSize = 1;
for (int i = 0; i < strings.Length; i++)
{
cacheSize *= strings[i].Length;
if (strings[i].Length > max)
max = strings[i].Length;
}
string[] cache = new string[cacheSize];
int[] indexes = new int[strings.Length];
for (int i = 0; i < indexes.Length; i++)
indexes[i] = strings[i].Length - 1;
return lcsBack(strings, indexes, cache);
}
string lcsBack(string[] strings, int[] indexes, string[] cache)
{
for (int i = 0; i < indexes.Length; i++ )
if (indexes[i] == -1)
return "";
bool match = true;
for (int i = 1; i < indexes.Length; i++)
{
if (strings[0][indexes[0]] != strings[i][indexes[i]])
{
match = false;
break;
}
}
if (match)
{
int[] newIndexes = new int[indexes.Length];
for (int i = 0; i < indexes.Length; i++)
newIndexes[i] = indexes[i] - 1;
string result = lcsBack(strings, newIndexes, cache) + strings[0][indexes[0]];
cache[calcCachePos(indexes, strings)] = result;
return result;
}
else
{
string[] subStrings = new string[strings.Length];
for (int i = 0; i < strings.Length; i++)
{
if (indexes[i] <= 0)
subStrings[i] = "";
else
{
int[] newIndexes = new int[indexes.Length];
for (int j = 0; j < indexes.Length; j++)
newIndexes[j] = indexes[j];
newIndexes[i]--;
int cachePos = calcCachePos(newIndexes, strings);
if (cache[cachePos] == null)
subStrings[i] = lcsBack(strings, newIndexes, cache);
else
subStrings[i] = cache[cachePos];
}
}
string longestString = "";
int longestLength = 0;
for (int i = 0; i < subStrings.Length; i++)
{
if (subStrings[i].Length > longestLength)
{
longestString = subStrings[i];
longestLength = longestString.Length;
}
}
cache[calcCachePos(indexes, strings)] = longestString;
return longestString;
}
}
int calcCachePos(int[] indexes, string[] strings)
{
int factor = 1;
int pos = 0;
for (int i = 0; i < indexes.Length; i++)
{
pos += indexes[i] * factor;
factor *= strings[i].Length;
}
return pos;
}
My code example can be optimized further. Many of the strings being cached are duplicates, and some are duplicates with just one additional character added. This uses more space than necessary when the input strings become large.
On input: "666222054263314443712", "5432127413542377777", "6664664565464057425"
The LCS returned is "54442"

Resources