finding longest similar subsequence in a string

finding longest similar subsequence in a string - algorithm

Suppose I want to find the longest subsequence such that first half of subsequence is same as second half of it.
For example: In a string abkcjadfbck , result is abcabc as abc is repeated in first and second half of it. In a stirng aaa, result is aa.

This task may be treated as a combination of two well known problems.
If you know in advance some point between two halves of the subsequence, you just need to find the best match for two strings. This is Pairwise alignment problem. Various dynamic programming methods solve it in O(N2) time.
To find a point where the string should be split optimally, you can use Golden section search or Fibonacci search. These algorithms have O(log N) time complexity.

In a first pass over inputString, we can count how often each character occurs, and remove those with occurrence one.
input: inputString
data strucutres:
Set<Triple<char[], Integer, Integer>> potentialSecondWords;
Map<Char, List<Integer>> lettersList;
for the characters c with increasing index h in inputString do
if (!lettersList.get(c).isEmpty()) {
for ((secondWord, currentIndex, maxIndex) in potentialSecondWords) {
if (there exists a j in lettersList.get(c) between currentIndex and maxIndex) {
update (secondWord, currentIndex, maxIndex) by adding c to secondWord and replacing currentIndex with j;
}
}
if potentialSecondWords contains a triple whose char[] is equal to c, remove it;
put new Triple with value (c,lettersList.get(c).get(0), h-1) into potentialSecondWords;
}
lettersList.get(c).add(h);
}
find the largest secondWord in potentialSecondWords and output secondWord twice;
So this algorithm passes once over the array, creating for each index, where it makes sense, a Triple representing the potential second word starting at the current index, and updates all potential second words.
With a suitable list implementation and n being the size of inputString, this algorithm has worst case runtime O(n²), e.g. for a^n.

Related

Reconstruct the whitespace-omitted string in O(n^2)

I want to check if my algorithm is correct.
Given a string of n characters with all white-space omitted,
Ex: "itwasthebestoftimes"
Give a dynamic programming algorithm which determines if the string can be broken into a valid sequence of words, and reconstruct a valid string with whitespaces, in O(n2).
My idea:
First find all substrings of a string (O(n2)), and for each substring map its position in space and length as an interval.
Ex: "it was the best"
[] [-] [-] [--]
[---] []
[]
(Spaces added to make it easier to view).
In the above example, "it" is valid and gets an interval value of 2, "was" gets 3, etc. The string "twas" is also valid, and gets a value of 4.
This is then reduced to a mini-max problem to find the max non-overlaping length in the set of intervals. Since the valid string must contain all letters, the max length non-overlapping interval will be the answer, and finding this takes Theta(n*log(n)).
Therefore the solution will take O(n2 + n*log(n)) = O(n2)
Is my thinking correct?

Your thinking is fine (assuming you know an O(n log n) solution to the problem of finding a maximum set of non-overlapping intervals), and that you know a way to find the word intervals in O(n^2) time. However, I think the problem is easier than you're making it.
Create an array W[0...n]. W[i] will be 0 if there's no way to cut up the string from i onwards into words, and otherwise it'll store the length a word that starts a valid cutting up of strings.
Then:
W[i] = min(j such that W[i:j] is a word, and i+j = n or W[i+j]>0)
or 0 if there's no such j.
If you keep your dictionary in a trie, you can compute W[i] in O(n-i) time assuming you've already computed W[i+1] to W[n-1]. That means you can compute all of W in O(n^2) time. Or if the maximum length of the word in your dictionary is k, you can do it in O(nk) time.
Once you've computed all of W, the whole string can be cut up into words if and only if W[0] is not 0.

Find longest positive substrings in binary string

Let's assume I have a string like 100110001010001. I'd like to find such substring that:
are as longest as possible
have total positive sum >0
So the longest substrings, that have more 1s than 0s.
For example for the string above 100110001010001 it would be: [10011]000[101]000[1]
Actually it's be satisfying to find the total length of those, in this case: 9.
Unfortunately I have no clue, how can it be done not in brute-force way. Any ideas, please?

As posted now, your question seems a bit unclear. The total length of valid substrings that are "as long as possible" could mean different things: for example, among other options, it could be (1) a list of the longest valid extension to the left of each index (which would allow overlaps in the list), (2) the longest combination of non-overlapping such longest left-extensions, (3) the longest combination of non-overlapping, valid substrings (where each substring is not necessarily the longest possible).
I will outline a method for (3) since it easily transforms to (1) or (2). Finding the longest left-extension from each index with more ones than zeros can be done in O(n log n) time and O(n) additional space (for just the longest valid substring in O(n) time, see here: Finding the longest non-negative sub array). With that preprocessing, finding the longest combination of valid, non-overlapping substrings can be done with dynamic programming in somewhat optimized O(n^2) time and O(n) additional space.
We start by traversing the string, storing sums representing the partial sum up to and including s[i], counting zeros as -1. We insert each partial sum in a binary tree where each node also stores an array of indexes where the value occurs, and the leftmost index of a value less than the node's value. (A substring from s[a] to s[b] has more ones than zeros if the prefix sum up to b is greater than the prefix sum up to a.) If a value is already in the tree, we add the index to the node's index array.
Since we are traversing from left to right, only when a new lowest value is inserted into the tree is the leftmost-index-of-lower-value updated — and it's updated only for the node with the previous lowest value. This is because any nodes with a lower value would not need updating; and if any nodes with lower values were already in the tree, any nodes with higher values would already have stored the index of the earliest one inserted.
The longest valid substring to the left of each index extends to the leftmost index with a lower prefix sum, which can be easily looked up in the tree.
To get the longest combination, let f(i) represent the longest combination up to index i. Then f(i) equals the maximum of the length of each valid left extension possible to index j added to f(j-1).

Dynamic programming.
We have a string. If it is positive, that's our answer. Otherwise we need to trim each end until it goes positive, and find each pattern of trims. So for each length (N-1, N-2, N-3) etc, we've got N- length possible paths (trim from a, trim from b) each of which give us a state. When state goes positive, we've found out substring.
So two lists of integers, representing what happens if we trim entirely from a or entirely from b. Then backtrack. If we trim 1 from a, we must trim all the rest from b, if we trim two from a, we must trim one fewer from b. Is there an answer that allows us to go positive?
We can quickly eliminate because the answer must be at a maximum, either max trimming from a or max trimming from b. If the other trim allows us go positive, that's the result.
pseudocode:
N = length(string);
Nones = countones(string);
Nzeros = N - Nones;
if(Nones > Nzeroes)
return string
vector<int> cuta;
vector<int> cutb;
int besta = Nones - Nzeros;
int bestb = Nones - Nzeros;
cuta.push_back(besta);
cutb.push_back(bestb);
bestia = 0;
bestib = 0;
for(i=0;i<N;i++)
{
cuta.push_back( string[i] == 1 ? cuta.back() - 1 : cuta.back() +1);
cutb.push_back( string[N-i-1] == 1 ? cutb.back() -1 : cutb.back()+1);
if(cuta.back() > besta)
{
besta = cuta.back();
bestia = i;
}
if(cutb.back() > bestb)
{
bestb = cutb.back();
bestib = i;
}
// checks, is a cut from wholly from a or b going to send us positive
if(besta == 1)
answer = substring(string, bestia, N);
if(bestb == 1)
answer = substring(string, 0, N - bestib);
// if not, is a combined cut from current position to the
// the peak in the other distribution going to send us positive?
if(Nones - Nzeros + besta + cutb.back() == 1)
{
answer = substring(string, bestai, N - i);
}
if(Nones - Nzeros + cuta.back() + bestb == 1)
{
answer = substring(string, i, N - bestbi);
}
}
/*if we get here the string was all zeros and no positive substring */
This is untested and the final checks are a bit fiddly and I might have
made an error somewhere, but the algorithm should work more or less
as described.

Optimizing construction of a trie over all substrings

I am solving a trie related problem. There is a set of strings S. I have to create a trie over all substrings for each string in S. I am using the following routine:
String strings[] = { ... }; // array containing all strings
for(int i = 0; i < strings.length; i++) {
String w = strings[i];
for (int j = 0; j < w.length(); j++) {
for (int k = j + 1; k <= w.length(); k++) {
trie.insert(w.substring(j, k));
}
}
}
I am using the trie implementation provided here. However, I am wondering if there are certain optimizations which can be done in order to reduce the complexity of creating trie over all substrings?
Why do I need this? Because I am trying to solve this problem.

If we have N words, each with maximum length L, your algorithm will take O(N*L^3) (supposing that adding to trie is linear with length of adding word). However, the size of the resulting trie (number of nodes) is at most O(N*L^2), so it seems you are wasting time and you could do better.
And indeed you can, but you have to pull a few tricks from you sleeve. Also, you will no longer need the trie.
.substring() in constant time
In Java 7, each String had a backing char[] array as well as starting position and length. This allowed the .substring() method to run in constant time, since String is immutable class. New String object with same backing char[] array was created, only with different start position and length.
You will need to extend this a bit, to support adding at the end of the string, by increasing the length. Always create a new string object, but leave the backing array same.
Recompute hash in constant time after appending single character
Again, let me use Java's hashCode() function for String:
int hash = 0;
for (int i = 0; i < data.length; i++) {
hash = 31 * hash + data[i];
} // data is the backing array
Now, how will the hash change after adding a single character at the end of the word? Easy, just add it's value (ASCII code) multiplied by 31^length. You can keep powers of 31 in some separate table, other primes can be used as well.
Store all substring in single HashMap
With using tricks 1 and 2, you can generate all substrings in time O(N*L^2), which is the total number of substrings. Just always start with string of length one and add one character at a time. Put all your strings into a single HashMap, to reduce duplicities.
(You can skip 2 and 3 and discard duplicities when/after sorting, perhaps it will be even faster.)
Sort your substrings and you are good to go.
Well, when I got to point 4, I realized my plan wouldn't work, because in sorting you need to compare strings, and that can take O(L) time. I came up with several attempts to solve it, among them bucket sorting, but none would be faster than original O(N*L^3)
I will just this answer here in case it inspires someone.
In case you don't know Aho-Corasic algorithm, take look into that, it could have some use for your problem.

What you need may be suffix automaton. It costs only O(n) time and can recognize all substrings.
Suffix array can also solve this problems.
These two algorithms can solve most string problems, and they are really hard to learn. After you learn those you will solve it.

You may consider the following optimization:
Maintain list of processed substrings. While inserting a substring, check if the processed set contains that particular substring and if yes, skip inserting that substring in the trie.
However, the worst case complexity for insertion of all substrings in trie will be of the order of n^2 where n is the size of strings array. From the problem page, this works out to be of the order of 10^8 insertion operations in trie. Therefore, even if each insertion takes 10 operations on an average, you will have 10^9 operations in total which sets you up to exceed the time limit.
The problem page refers to LCP array as a related topic for the problem. You should consider change in approach.

First, notice that it is enough to add only suffixes to the trie, and nodes for every substring will be added along the way.
Second, you have to compress the trie, otherwise it will not fit into memory limit imposed by HackerRank. Also this will make your solution faster.
I just submitted my solution implementing these suggestions, and it was accepted. (the max execution time was 0.08 seconds.)
But you can make your solution even faster by implementing a linear time algorithm to construct the suffix tree. You can read about linear time suffix tree construction algorithms here and here. There is also an explanation of the Ukkonen's algorithm on StackOverflow here.

Finding a set of repeated, non-overlapping substrings of two input strings using suffix arrays

Input: two strings A and B.
Output: a set of repeated, non overlapping substrings
I have to find all the repeated strings, each of which has to occur in both(!) strings at least once. So for instance, let
A = "xyabcxeeeyabczeee" and B = "yxabcxabee".
Then a valid output would be {"abcx","ab","ee"} but not "eee", since it occurs only in string A.
I think this problem is very related to the "supermaximal repeat" problem. Here is a definition:
Maximal repeated pair :
A pair of identical substrings alpha and beta in S such that extending alpha and beta
in either direction would destroy the equality of the two strings
It is represented as a triplet (position1,position2, length)
Maximal repeat :
“A substring of S that occurs in a maximal pair in S”.
Example: abc in S = xabcyiiizabcqabcyrxar.
Note: There can be numerous maximal repeated pairs, but there can be only a limited
number of maximal repeats.
Supermaximal repeat
“A maximal repeat that never occurs as a substring of any other maximal repeat”
Example: abcy in S = xabcyiiizabcqabcyrxar.
An algorithm for finding all supermaximal repeats is described in "Algorithms on strings, trees and sequences", but only for suffix trees.
It works by:
1.) finding all left-diverse nodes using DFS
For each position i in S, S(i-1) is called the left character i.
Left character of a leaf in T(S) is the left character of the suffix position
represented by that leaf.
An internal node v in T(S) is called left-diverse if at least two leaves in v’s
subtree have different left characters.
2.) applying theorem 7.12.4 on those nodes:
A left diverse internal node v represents a supermaximal repeat a if and only if
all of v's children are leaves, and each has a distinct left character
Both strings A and B probably have to be concatenated and when we check v's leaves
in step two we also have to impose an additional constraint, that there has to be
at least one distinct left character from strings A and B. This can be done by comparing
their position against the length of A. If position(left character) > length(A), then left character is in A, else in B.
Can you help me solve this problem with suffix + lcp arrays?

It sounds like you are looking for the set intersection of all substrings of your two input strings. In that case, single letter substrings should also be returned. Let s1 and s2 be your strings, s1 the shorter of the two. After doing some thinking for a while about this, I don't think you can get much better than the intuitive O(n^3m) or O(n^3) algorithm, where n is the length of s1 and m is the length of s2. I don't think suffix trees can help you here.
for(int i=0 to n-1){
for(int j=1 to n-i){
if(contains(s2,substring(s1,i,j))) emit the substring
}
}
The runtime comes from the (n^2)/2 loop iterations, each doing a worst-case O(nm) contains operation (possibly O(n) depending on implementation). But its not really quite this bad since there will be a constant much smaller than one out front, since the length of the substring will actually range between 1 and n.
If you don't want single character matches, you could just initialize j as 2 or something higher.
BTW: Don't actually create new strings with substring, find/create a contains function that will take indicies and the original string and just look at the characters between, inclusive of i, exclusive of j.

Optimal solution for non-overlapping maximum scoring sequences

While developing part of a simulator I came across the following problem. Consider a string of length N, and M substrings of this string with a non-negative score assigned to each of them. Of particular interest are the sets of substrings that meet the following requirements:
They do not overlap.
Their total score (by sum, for simplicity) is maximum.
They span the entire string.
I understand the naive brute-force solution is of O(M*N^2) complexity. While the implementation of this algorithm would probably not impose a lot on the performance of the whole project (nowhere near the critical path, can be precomputed, etc.), it really doesn't sit well with me.
I'd like to know if there are any better solutions to this problem and if so, which are they? Pointers to relevant code are always appreciated, but just algorithm description will do too.

This can be thought of as finding the longest path through a DAG. Each position in the string is a node and each substring match is an edge. You can trivially prove through induction that for any node on the optimal path the concatenation of the optimal path from the beginning to that node and from that node to the end is the same as the optimal path. Thanks to that you can just keep track of the optimal paths for each node and make sure you have visited all edges that end in a node before you start to consider paths containing it.
Then you just have the issue to find all edges that start from a node, or all substring that match at a given position. If you already know where the substring matches are, then it's as trivial as building a hash table. If you don't you can still build a hashtable if you use Rabin-Karp.
Note that with this you'll still visit all the edges in the DAG for O(e) complexity. Or in other words, you'll have to consider once each substring match that's possible in a sequence of connected substrings from start to the end. You could get better than this by doing preprocessing the substrings to find ways to rule out some matches. I have my doubts if any general case complexity improvements can come for this and any practical improvements depend heavily on your data distribution.

It is not clear whether M substrings are given as sequences of characters or indeces in the input string, but the problem doesn't change much because of that.
Let us have input string S of length N, and M input strings Tj. Let Lj be the length of Tj, and Pj - score given for string Sj. We say that string
This is called Dynamic Programming, or DP. You keep an array res of ints of length N, where the i-th element represents the score one can get if he has only the substring starting from the i-th element (for example, if input is "abcd", then res[2] will represent the best score you can get of "cd").
Then, you iterate through this array from end to the beginning, and check whether you can start string Sj from the i-th character. If you can, then result of (res[i + Lj] + Pj) is clearly achievable. Iterating over all Sj, res[i] = max(res[i + Lj] + Pj) for all Sj which can be applied to the i-th character.
res[0] will be your final asnwer.

inputs:
N, the number of chars in a string
e[0..N-1]: (b,c) an element of set e[a] means [a,b) is a substring with score c.
(If all substrings are possible, then you could just have c(a,b).)
By e.g. [1,2) we mean the substring covering the 2nd letter of the string (half open interval).
(empty substrings are not allowed; if they were, then you could handle them properly only if you allow them to be "taken" at most k times)
Outputs:
s[i] is the score of the best substring covering of [0,i)
a[i]: [a[i],i) is the last substring used to cover [0,i); else NULL
Algorithm - O(N^2) if the intervals e are not sparse; O(N+E) where e is the total number of allowed intervals. This is effectively finding the best path through an acyclic graph:
for i = 0 to N:
a[i] <- NULL
s[i] <- 0
a[0] <- 0
for i = 0 to N-1
if a[i] != NULL
for (b,c) in e[i]:
sib <- s[i]+c
if sib>s[b]:
a[b] <- i
s[b] <- sib
To yield the best covering triples (a,b,c) where cost of [a,b) is c:
i <- N
if (a[i]==NULL):
error "no covering"
while (a[i]!=0):
from <- a[i]
yield (from,i,s[i]-s[from]
i <- from
Of course, you could store the pair (sib,c) in s[b] and save the subtraction.

O(N+M) solution:
Set f[1..N]=-1
Set f[0]=0
for a = 0 to N-1
if f[a] >= 0
For each substring beginning at a
Let b be the last index of the substring, and c its score
If f[a]+c > f[b+1]
Set f[b+1] = f[a]+c
Set g[b+1] = [substring number]
Now f[N] contains the answer, or -1 if no set of substrings spans the string.
To get the substrings:
b = N
while b > 0
Get substring number from g[N]
Output substring number
b = b - (length of substring)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio