How to check if all substrings can be found in a dictionary - algorithm

I have a problem that I want to solve as efficiently as possible. As an example,
I am given a string of words: A B C D and I have a 'dictionary' with 5 entries:
A
B C
B D
D
E
The dictionary tells me which substrings can be in my input string. And I want to check as efficiently as possible if the whole input string can be split into substrings so that all of them are found in the dictionary.
In the example the input string can be found by splitting it into A, B C and D
I'd like to know if there's a better way than just bruteforcing through all possible substrings. The less I check if a substring is in a dictionary the better.
It's not necessary to know which substrings couldn't be found in case there are no possible solutions.
Thank you.

I would use a tree instead of a dictionary. This will improve the search speed and will eliminate sub-trees for searching.

If you can use the same substring several time, there is a natural dynamic programming solution to this.
Let n be the size of your string. Let v be a vector of size n such that v[i] = true if and only if the substring made of the (n-i) last character of your original string can be broken down with your dictionary. Then you can fill the vector v backwards, starting at the last index decreasing i at each step.
In pseudo-code :
Let D be your dictionnary
Let s be your string
Let n be the size of s
(Let s[i:j] denote the substring of s made by characters between i and j (inclusive))
Let v be a vector of size n+1 filled with zeros
Let v[n] = 1
For int i = n-1; i>=0; i--
For int j = i; j <=n-1; j++
If (s[i:j] is in D) and (v[j+1] is equal to 1)
v[i] = 1
Exit the for loop
Return v[0]

You can make it run in O(N^2) by following method.
First store all your string in a trie.
Second, use dynamic programming approach to solve your problem. For each position i we will be calculating whether the substring of the first i symbols can be split into words from a dictionary (a trie). We will use, for simplicity, a forward-looking approach of dynamic programming.
So first we set that the substring of first 0 symbols can be split. Then we iterate from 0 to N-1. When we come to position i, we assume that we know the answer for this position already. If the split is possible, then we can go from this position and see which strings starting from this position are in the trie. For every such string, mark its end position as possible. By using the trie, we can do this in O(N) per one external loop iteration.
t = trie of given words
ans = {false}
ans[0] = true
for i=0..N-1
if ans[i] // if substring s[0]..s[i-1] can be split to given words
cur = t.root
for j=i to N-1 // go along all strings starting from position i
cur=cur.child(s[j]) // move to the child in trie
// therefore, the position cur corresponds to string
// s[i]...s[j]
if cur.isWordEnd // if there is a word in trie that ends in cur
ans[j+1] = true // the string s[0]..s[j] can be split
your answer is in ans[N]
Total time is O(N^2).

Related

Data structure to check if a static array does not contain an element of a given range

I'm stuck for hours on the following homework question for data-structures class:
You are given a static set S (i.e., S never changes) of n integers from {1, . . . , u}.
Describe a data structure of size O(n log u) that can answer the following queries in O(1) time:
Empty(i, j) - returns TRUE if and only if there is no element in S that is between i and j (where i and j are integers in {1, . . . , u}).
At first I thought of using a y-fast-trie.
Using y-fast-trie we can achieve O(n) space and O(loglogu) query (by finding the successor of i and check if it's bigger than j).
But O(loglogu) is not O(1)...
Then I thought maybe we can sort the array and create a second array of size n+1 of the ranges that are not in the array and then in the query we would check if [i, j] is a sub-range of one of the ranges but I didn't thought of any way to do it that uses O(nlogu) space and can answer the query in O(1).
I have no idea how to solve this and I feel like I'm not even close to the solution, any help would be nice.
We can create a x-fast-trie of S (takes O(nlogu) space) and save in each node the maximum and minimum value of a leaf in it's sub tree. Now we can use that to answer the Empty query in O(1). Like this:
Empty(i, j)
We first calculate xor(i,j) now the number of leading zeros in that number will be the number of leading bits i and j share in common let's mark this number as k. Now we'll take the first k bits of i (or j because they're equal) and check in the x-fast-trie hash table if there's a node that equels to those bits. If there isn't we'll return TRUE because any number between i and j would also have the same k leading bits and since there isn't any number with those leading bits there isn't any number between i and j. If there is let's mark that node as X.
if X->right->minimum > j and X->left->maximum < i we return TRUE and otherwise we return FALSE, because if this is false then there is a number between i and j and if it's true then all the numbers that are smaller than j are also smaller than i and all the numbers that are bigger than i are also bigger than j.
Sorry for bad English
You haven't clarify either the numbers given will be sorted or not. If not, sort them, while will take O(nlogn).
Find upper bound of i, say x. Find lower bound of j, say y.
Now just check 4 numbers. Numbers at index x, x+1, y-1 and y. If any of the numbers of the given array is between i and j return true. Otherwise return false.
If the given Set/Array is not sorted, then in this approach additional O(nlogn) is required to sort it. Memory requires O(n). For each query, it's O(1).
Consider a data structure consisting of
an array A[1,...,u] of size u such that A[i]=1 if i is present in S, and A[i]=0 otherwise. This array can be constructed from set S in O(n).
an array B[1,...,u] of size u which stores cumulative sum of A i.e. B[i] = A[1]+...+A[i]. This array can be constructed in O(u) from A using the relation B[i] = B[i-1] + A[i] for all i>1.
a function empty(i,j) which returns the desired Boolean query. If i==1, then define count = B[j], otherwise take count = B[j]-B[i-1]. Note that count gives the number of distinct elements in S lying in range [i,j]. Once we have count, simply return count==0. Clearly, each query takes O(1).
Edit: As pointed out in comments, the size of this data structure is O(u), which doesn't matches the constraints. But I hope it gives others an approximate target to shoot at.
It isn't a solution, but impossible to write it in a comment. There is an idea of how to solve the more specific task that possibly will help to solve the generic task from the question.
The specific task is the same except the following point, u = 1024. Also, it isn't a final solution, it is a rough sketch (for the specific task).
Data structure creation:
Create a bitmask for U = { 1, ..., u } - M = 0000.....100001, where Mᵥ = 1 when Uᵥ ∊ S, otherwice = 0.
Save bitmask M as 'unsigned intgers 32' array = G (32 items). Each item of G contains 32 items from M.
Combine integer H = bitmask where Hᵣ = 0 when Gᵣ = 0, otherwice = 1
Convert G to G that is HashMap r to Gᵣ. G is G but contains records for Gᵣ != 0 only.
Images in the following pseudocode use 8 bits except 32, just for simplicity.
Empty(i, j) {
I = i / 32
J = j / 32
if I != J {
if P == 0: return true
if P(I) == 0: return true
if P(J) == 0: return true
} else {
if P(J=I) == 0: return true
}
return false
}

Find longest positive substrings in binary string

Let's assume I have a string like 100110001010001. I'd like to find such substring that:
are as longest as possible
have total positive sum >0
So the longest substrings, that have more 1s than 0s.
For example for the string above 100110001010001 it would be: [10011]000[101]000[1]
Actually it's be satisfying to find the total length of those, in this case: 9.
Unfortunately I have no clue, how can it be done not in brute-force way. Any ideas, please?
As posted now, your question seems a bit unclear. The total length of valid substrings that are "as long as possible" could mean different things: for example, among other options, it could be (1) a list of the longest valid extension to the left of each index (which would allow overlaps in the list), (2) the longest combination of non-overlapping such longest left-extensions, (3) the longest combination of non-overlapping, valid substrings (where each substring is not necessarily the longest possible).
I will outline a method for (3) since it easily transforms to (1) or (2). Finding the longest left-extension from each index with more ones than zeros can be done in O(n log n) time and O(n) additional space (for just the longest valid substring in O(n) time, see here: Finding the longest non-negative sub array). With that preprocessing, finding the longest combination of valid, non-overlapping substrings can be done with dynamic programming in somewhat optimized O(n^2) time and O(n) additional space.
We start by traversing the string, storing sums representing the partial sum up to and including s[i], counting zeros as -1. We insert each partial sum in a binary tree where each node also stores an array of indexes where the value occurs, and the leftmost index of a value less than the node's value. (A substring from s[a] to s[b] has more ones than zeros if the prefix sum up to b is greater than the prefix sum up to a.) If a value is already in the tree, we add the index to the node's index array.
Since we are traversing from left to right, only when a new lowest value is inserted into the tree is the leftmost-index-of-lower-value updated — and it's updated only for the node with the previous lowest value. This is because any nodes with a lower value would not need updating; and if any nodes with lower values were already in the tree, any nodes with higher values would already have stored the index of the earliest one inserted.
The longest valid substring to the left of each index extends to the leftmost index with a lower prefix sum, which can be easily looked up in the tree.
To get the longest combination, let f(i) represent the longest combination up to index i. Then f(i) equals the maximum of the length of each valid left extension possible to index j added to f(j-1).
Dynamic programming.
We have a string. If it is positive, that's our answer. Otherwise we need to trim each end until it goes positive, and find each pattern of trims. So for each length (N-1, N-2, N-3) etc, we've got N- length possible paths (trim from a, trim from b) each of which give us a state. When state goes positive, we've found out substring.
So two lists of integers, representing what happens if we trim entirely from a or entirely from b. Then backtrack. If we trim 1 from a, we must trim all the rest from b, if we trim two from a, we must trim one fewer from b. Is there an answer that allows us to go positive?
We can quickly eliminate because the answer must be at a maximum, either max trimming from a or max trimming from b. If the other trim allows us go positive, that's the result.
pseudocode:
N = length(string);
Nones = countones(string);
Nzeros = N - Nones;
if(Nones > Nzeroes)
return string
vector<int> cuta;
vector<int> cutb;
int besta = Nones - Nzeros;
int bestb = Nones - Nzeros;
cuta.push_back(besta);
cutb.push_back(bestb);
bestia = 0;
bestib = 0;
for(i=0;i<N;i++)
{
cuta.push_back( string[i] == 1 ? cuta.back() - 1 : cuta.back() +1);
cutb.push_back( string[N-i-1] == 1 ? cutb.back() -1 : cutb.back()+1);
if(cuta.back() > besta)
{
besta = cuta.back();
bestia = i;
}
if(cutb.back() > bestb)
{
bestb = cutb.back();
bestib = i;
}
// checks, is a cut from wholly from a or b going to send us positive
if(besta == 1)
answer = substring(string, bestia, N);
if(bestb == 1)
answer = substring(string, 0, N - bestib);
// if not, is a combined cut from current position to the
// the peak in the other distribution going to send us positive?
if(Nones - Nzeros + besta + cutb.back() == 1)
{
answer = substring(string, bestai, N - i);
}
if(Nones - Nzeros + cuta.back() + bestb == 1)
{
answer = substring(string, i, N - bestbi);
}
}
/*if we get here the string was all zeros and no positive substring */
This is untested and the final checks are a bit fiddly and I might have
made an error somewhere, but the algorithm should work more or less
as described.

Counting Binary Strings

This is in reference to this problem. We are required to calculate f(n , k), which is the number of binary strings of length n that have the length of the longest substring of ones as k. I am having trouble coming up with a recursion.
The case when the ith digit is a 0 , i think i can handle.
Specifically, I am unable to extend the solution to a sub-problem f(i-1 , j) , when I consider the ith digit to be a 1. how do i stitch the two together?
Sorry if I am a bit unclear. Any pointers would be a great help. Thanks.
I think you could build up a table using a variation of dynamic programming, if you expand the state space. Suppose that you calculate f(n,k,e) defined as the number of different binary strings of length n with the longest substring of 1s length at most k and ending with e 1s in a row. If you have calculated f(n,k,e) for all possible values of k and e associated with a given n, then, because you have the values split up by e, you can calculate f(n+1,k,e) for all possible values of k and e - what happens to an n-long string when you extend it with 0 or 1 depends on how many 1s it ends with at the moment, and you know that because of e.
Let s be the start index of the length k pattern. Then s is in: 1 to n-k.
For each s, we divide the Sting S into three strings:
PRE(s,k,n) = S[1:s-1]
POST(s,k,n)=S[s+k-1:n]
ONE(s,k,n) which has all 1s from S[s] to S[s+k-1]
The longest sub-string of 1s for PRE and POST should be less than k.
Let
x = s-1
y = n-(s+k)-1
Let NS(p,k) is total number of ways you can have a longest sub-string of size greater than equal to k.
NS(p,k) = sum{f(p,k), f(p,k+1),... f(p,p)}
Terminating condition:
NS(p,k) = 1 if p==k, 0 if k>p
f(n,k) = 1 if n==k, 0, if k > n.
For a string of length n, the number of permutations such that the longest substring of 1s is of size less than k = 2^n - NS(n,k).
f(n,k) = Sum over all s=1 to n-k
{2^x - NS(x,k)}*{2^y - NS(y,k)}
i.e. product of the number of permutations of each of the pre and post substrings where the longest sub-string is less than size k.
So we have a repeating sub-problem, and a whole bunch of reuse which can be DPed
Added Later:
Based on the comment below, I guess we really do not need to go into NS.
We can define S(p,k) as
S(p,k) = sum{f(p,1), f(p,2),... f(p,k-1)}
and
f(n,k) = Sum over all s=1 to n-k
S(x,k)*S(y,k)
I know this is quite an old question if any one wants I can clarify my small answer..
Here is my code
#include<bits/stdc++.h>
using namespace std;
long long DP[64][64];
int main()
{
ios::sync_with_stdio(0);
cin.tie(0);
int i,j,k;
DP[1][0]=1;
DP[1][1]=1;
DP[0][0]=1;
cout<<"1 1\n";
for(i=2;i<=63;i++,cout<<"\n")
{
DP[i][0]=1;
DP[i][i]=1;
cout<<"1 ";
for(j=1;j<i;j++)
{
for(k=0;k<=j;k++)
DP[i][j]+=DP[i-k-1][j]+DP[i-j-1][k];
DP[i][j]-=DP[i-j-1][j];
cout<<DP[i][j]<<" ";
}
cout<<"1 ";
}
return 0;
}
DP[i][j] represents F(i,j) .
Transitions/Recurrence (Hard to think):
Considering F(i,j):
1)I can put k 1s on the right and seperate them using a 0 i.e
String + 0 + k times '1' .
F(i-k-1,j)
Note : k=0 signifies I am only keeping 0 at the right!
2) I am missing out the ways in which the right j+1 positions are filled with 0 and j '1' s and All the left do not form any consecutive string of length j !!
F(i-j-1,k) (Note I have used k to signify both just because I have done so in my Code , you can define other variables too!)

An interview question - Split text into sub-strings according to rules

Split text into sub-strings according to below rules:
a) The length of each sub-string should less than or equal to M
b) The length of sub-string should less than or equal to N (N < M) if the sub-string contains any numeric char
c) The total number of sub-strings should be as small as possible
I have no clue how to solve this question, I guess it is related to "dynamic programming".
Can anybody help me implement it using C# or Java? Thanks a lot.
Idea
A greedy approach is the way to go:
If the current text is empty, you're done.
Take the first N characters. If any of them is a digit then this is a new substring. Chop it off and go to beginning.
Otherwise, extend the digitless segment to at most M characters. This is a new substring. Chop it off and go to beginning.
Proof
Here's a reductio-ad-absurdum proof that the above yields an optimal solution.
Assume there is a better split than the greedy split. Let's skip to the point where the two splits start to differ and remove everything before this point.
Case 1) A digit among the first N characters.
Assume that there is an input for which chopping off the first N characters cannot yield an optimal solution.
Greedy split: |--N--|...
A better split: |---|--...
^
+---- this segment can be shortened from the left side
However, the second segment of the putative better solution can be always shortened from the left side, and the first one extended to N characters, without altering the number of segments. Therefore, a contradiction: this split is not better than the greedy split.
Case 2) No digit among the first K (N < K <= M) characters.
Assume that there is an input for which chopping off the first K characters cannot yield an optimal solution.
Greedy split: |--K--|...
A better split: |---|--...
^
+---- this segment can be shortened from the left side
Again, the the "better" split can be transformed, without altering the number of segments, to the greedy split, which contradicts the initial assumption that there is a better split than the greedy split.
Therefore, the greedy split is optimal. Q.E.D.
Implementation (Python)
import sys
m, n, text = int(sys.argv[1]), int(sys.argv[2]), sys.argv[3]
textLen, isDigit = len(text), [c in '0123456789' for c in text]
chunks, i, j = [], 0, 0
while j < textLen:
i, j = j, min(textLen, j + n)
if not any(isDigit[i:j]):
while j < textLen and j - i < m and not isDigit[j]:
j += 1
chunks += [text[i:j]]
print chunks
Implementation (Java)
public class SO {
public List<String> go(int m, int n, String text) {
if (text == null)
return Collections.emptyList();
List<String> chunks = new ArrayList<String>();
int i = 0;
int j = 0;
while (j < text.length()) {
i = j;
j = Math.min(text.length(), j + n);
boolean ok = true;
for (int k = i; k < j; k++)
if (Character.isDigit(text.charAt(k))) {
ok = false;
break;
}
if (ok)
while (j < text.length() && j - i < m && !Character.isDigit(text.charAt(j)))
j++;
chunks.add(text.substring(i, j));
}
return chunks;
}
#Test
public void testIt() {
Assert.assertEquals(
Arrays.asList("asdas", "d332", "4asd", "fsdxf", "23"),
go(5, 4, "asdasd3324asdfsdxf23"));
}
}
Bolo has provided a greedy algorithm in his answer and asked for a counter-example. Well, there's no counter-example because that's perfectly correct approach. Here's the proof. Although it's a bit wordy, it often happens that proof is longer than algorithm itself :)
Let's imagine we have input of length L and constructed an answer A with our algorithm. Now, suppose there's a better answer B. I.e., B has less segments than A does.
Let's say, first segment in A has length la and in B - lb. la >= lb because we've choosen first segment in A to have maximum possible length. And if lb < la, we can increase length of first segment in B without increasing overall number of segments in B. It would give us some other optimal solution B', having same first segment as A.
Now, remove that first segment from A and B' and repeat operation for length L' < L. Do it until there's no segments left. It means, answer A is equal to some optimal solution.
The result of your computation will be a partitioning of the given text into short sub-strings containing numerics and long substrings not containing numerics. (This much you knew already).
You will essentially be partitioning off short subs around the numerics and then breaking everything else down into long subs as often as needed to fulfill the length criteria.
Your freedom, i.e. what you can manipulate to improve your result, is to select which characters to include with a numeric. If N = 3, then for every numeric you get the choice of XXN, XNX or NXX. If M is 5 and you have 6 characters before your first numeric, you'll want to include at least one of those characters in your short sub so you won't end up with two "long" strings to the left of your "short" one when you could have just one instead.
As a first approximation, I'd go with extending your "short" strings leftwise far enough to avoid redundant "long" strings. This is a typical "greedy" approach, and greedy approaches often yield optimal or almost-optimal results. To do even better than that would not be easy, and I'm not going to try to figure out how to go about that.

Optimal solution for non-overlapping maximum scoring sequences

While developing part of a simulator I came across the following problem. Consider a string of length N, and M substrings of this string with a non-negative score assigned to each of them. Of particular interest are the sets of substrings that meet the following requirements:
They do not overlap.
Their total score (by sum, for simplicity) is maximum.
They span the entire string.
I understand the naive brute-force solution is of O(M*N^2) complexity. While the implementation of this algorithm would probably not impose a lot on the performance of the whole project (nowhere near the critical path, can be precomputed, etc.), it really doesn't sit well with me.
I'd like to know if there are any better solutions to this problem and if so, which are they? Pointers to relevant code are always appreciated, but just algorithm description will do too.
This can be thought of as finding the longest path through a DAG. Each position in the string is a node and each substring match is an edge. You can trivially prove through induction that for any node on the optimal path the concatenation of the optimal path from the beginning to that node and from that node to the end is the same as the optimal path. Thanks to that you can just keep track of the optimal paths for each node and make sure you have visited all edges that end in a node before you start to consider paths containing it.
Then you just have the issue to find all edges that start from a node, or all substring that match at a given position. If you already know where the substring matches are, then it's as trivial as building a hash table. If you don't you can still build a hashtable if you use Rabin-Karp.
Note that with this you'll still visit all the edges in the DAG for O(e) complexity. Or in other words, you'll have to consider once each substring match that's possible in a sequence of connected substrings from start to the end. You could get better than this by doing preprocessing the substrings to find ways to rule out some matches. I have my doubts if any general case complexity improvements can come for this and any practical improvements depend heavily on your data distribution.
It is not clear whether M substrings are given as sequences of characters or indeces in the input string, but the problem doesn't change much because of that.
Let us have input string S of length N, and M input strings Tj. Let Lj be the length of Tj, and Pj - score given for string Sj. We say that string
This is called Dynamic Programming, or DP. You keep an array res of ints of length N, where the i-th element represents the score one can get if he has only the substring starting from the i-th element (for example, if input is "abcd", then res[2] will represent the best score you can get of "cd").
Then, you iterate through this array from end to the beginning, and check whether you can start string Sj from the i-th character. If you can, then result of (res[i + Lj] + Pj) is clearly achievable. Iterating over all Sj, res[i] = max(res[i + Lj] + Pj) for all Sj which can be applied to the i-th character.
res[0] will be your final asnwer.
inputs:
N, the number of chars in a string
e[0..N-1]: (b,c) an element of set e[a] means [a,b) is a substring with score c.
(If all substrings are possible, then you could just have c(a,b).)
By e.g. [1,2) we mean the substring covering the 2nd letter of the string (half open interval).
(empty substrings are not allowed; if they were, then you could handle them properly only if you allow them to be "taken" at most k times)
Outputs:
s[i] is the score of the best substring covering of [0,i)
a[i]: [a[i],i) is the last substring used to cover [0,i); else NULL
Algorithm - O(N^2) if the intervals e are not sparse; O(N+E) where e is the total number of allowed intervals. This is effectively finding the best path through an acyclic graph:
for i = 0 to N:
a[i] <- NULL
s[i] <- 0
a[0] <- 0
for i = 0 to N-1
if a[i] != NULL
for (b,c) in e[i]:
sib <- s[i]+c
if sib>s[b]:
a[b] <- i
s[b] <- sib
To yield the best covering triples (a,b,c) where cost of [a,b) is c:
i <- N
if (a[i]==NULL):
error "no covering"
while (a[i]!=0):
from <- a[i]
yield (from,i,s[i]-s[from]
i <- from
Of course, you could store the pair (sib,c) in s[b] and save the subtraction.
O(N+M) solution:
Set f[1..N]=-1
Set f[0]=0
for a = 0 to N-1
if f[a] >= 0
For each substring beginning at a
Let b be the last index of the substring, and c its score
If f[a]+c > f[b+1]
Set f[b+1] = f[a]+c
Set g[b+1] = [substring number]
Now f[N] contains the answer, or -1 if no set of substrings spans the string.
To get the substrings:
b = N
while b > 0
Get substring number from g[N]
Output substring number
b = b - (length of substring)

Resources