Reconstruct the whitespace-omitted string in O(n^2) - algorithm

I want to check if my algorithm is correct.
Given a string of n characters with all white-space omitted,
Ex: "itwasthebestoftimes"
Give a dynamic programming algorithm which determines if the string can be broken into a valid sequence of words, and reconstruct a valid string with whitespaces, in O(n2).
My idea:
First find all substrings of a string (O(n2)), and for each substring map its position in space and length as an interval.
Ex: "it was the best"
[] [-] [-] [--]
[---] []
[]
(Spaces added to make it easier to view).
In the above example, "it" is valid and gets an interval value of 2, "was" gets 3, etc. The string "twas" is also valid, and gets a value of 4.
This is then reduced to a mini-max problem to find the max non-overlaping length in the set of intervals. Since the valid string must contain all letters, the max length non-overlapping interval will be the answer, and finding this takes Theta(n*log(n)).
Therefore the solution will take O(n2 + n*log(n)) = O(n2)
Is my thinking correct?

Your thinking is fine (assuming you know an O(n log n) solution to the problem of finding a maximum set of non-overlapping intervals), and that you know a way to find the word intervals in O(n^2) time. However, I think the problem is easier than you're making it.
Create an array W[0...n]. W[i] will be 0 if there's no way to cut up the string from i onwards into words, and otherwise it'll store the length a word that starts a valid cutting up of strings.
Then:
W[i] = min(j such that W[i:j] is a word, and i+j = n or W[i+j]>0)
or 0 if there's no such j.
If you keep your dictionary in a trie, you can compute W[i] in O(n-i) time assuming you've already computed W[i+1] to W[n-1]. That means you can compute all of W in O(n^2) time. Or if the maximum length of the word in your dictionary is k, you can do it in O(nk) time.
Once you've computed all of W, the whole string can be cut up into words if and only if W[0] is not 0.

Related

Complexity analysis of a solution to minimizing concat cost

This is about analyzing the complexity of a solution to a popular interview problem.
Problem
There is a function concat(str1, str2) that concatenates two strings. The cost of the function is measured by the lengths of the two input strings len(str1) + len(str2). Implement concat_all(strs) that concatenates a list of strings using only the concat(str1, str2) function. The goal is to minimize the total concat cost.
Warnings
Usually in practice, you would be very cautious about concatenating pairs of strings in a loop. Some good explanations can be found here and here. In reality, I have witnessed a severity-1 accident caused by such code. Warnings aside, let's say this is an interview problem. What's really interesting to me is the complexity analysis around the various solutions.
You can pause here if you would like to think about the problem. I am gonna reveal some solutions below.
Solutions
Naive solution. Loop through the list and concatenate
def concat_all(strs):
result = ''
for str in strs:
result = concat(result, str)
return result
Min-heap solution. The idea is to concatenate shorter strings first. Maintain a min-heap of the strings based on the length of the strings. Each concatenation concatenates 2 strings off the min-heap and the result is pushed back the min-heap. Until only one string is left on the heap.
def concat_all(strs):
heap = MinHeap(strs, key_func=len)
while len(heap) > 1:
str1 = heap.pop()
str2 = heap.pop()
heap.push(concat(str1, str2))
return heap.pop()
Binary concat. May not be intuitively clear. But another good solution is to recursively split the list by half and concatenate.
def concat_all(strs):
if len(strs) == 1:
return strs[0]
if len(strs) == 2:
return concat(strs[0], strs[1])
mid = len(strs) // 2
str1 = concat_all(strs[:mid])
str2 = concat_all(strs[mid:])
return concat(str1, str2)
Complexity
What I am really struggling and asking here is the complexity of the 2nd approach above that uses a min-heap.
Let's say the number of strings in the list is n and the total number of characters in all the strings is m. The upper bound for the naive solution is O(mn). The binary-concat has an exact bound of theta(mlog(n)). It is the min-heap approach that is elusive to me.
I am kind of guessing it has an upper bound of O(mlog(n) + nlog(n)). The second term, nlog(n) is associated with maintaining the heap; there are n concats and each concat updates the heap in log(n). If we only focus on the cost of concatenations and ignore the cost of maintaining the min-heap, the overall complexity of the min-heap approach can be reduced to O(mlog(n)). Then min-heap is a more optimal approach than binary-concat cause for the former mlog(n) is the upper bound while for the latter it is the exact bound.
But I can't seem to prove it, or even find a good intuition to support that guessed upper bound. Can the upper bound be even lower than O(mlog(n))?
Let us call the length of strings 1 to n and m be the sum of all these values.
For the naive solution, clearly the worst appears if
m1
is almost equal to m, and you obtain a O(nm) complexity, as you pointed.
For the min-heap, the worst-case is a bit different, it consists in having the same length for any string. In that case, it's going to work exactly as your case 3. of binary concat, but you'll also have to maintain the min-heap structure. So yes, it will be a bit more costly than case 3 in real-life. Nevertheless, from a complexity point of view, both will be in O(m log n) since we have m > n and O(m log n + n log n)can be reduced to O(m log n).
To prove the min-heap complexity more rigorously, we can prove that when we take the two smallest strings in a set of k strings, and denote by S the sum of the two smallest strings, then we have: (m-S)/(k-1) >= S/2 (it simply means that the mean of the two smallest strings is less than the mean of the k-2 other strings). Reformulating leads to S <= 2m/(k+1). Let us apply it to the min-heap algorithm:
at first step, we can show that the 2 strings we take are of total length at most 2m/(n+1)
at first step, we can show that the 2 strings we take are of total length at most 2m/(n)
...
at last step, we can show that the 2 strings we take are of total length at most 2m/(1)
Hence the computation time of min-heap is 2m*[1/(n+1) + 1/n + ... + 1/2 + 1] which is in O(m log n)

Given a permutation's lexicographic number, is it possible to get any item in it in O(1)

I want to know whether the task explained below is even theoretically possible, and if so how I could do it.
You are given a space of N elements (i.e. all numbers between 0 and N-1.) Let's look at the space of all permutations on that space, and call it S. The ith member of S, which can be marked S[i], is the permutation with the lexicographic number i.
For example, if N is 3, then S is this list of permutations:
S[0]: 0, 1, 2
S[1]: 0, 2, 1
S[2]: 1, 0, 2
S[3]: 1, 2, 0
S[4]: 2, 0, 1
S[5]: 2, 1, 0
(Of course, when looking at a big N, this space becomes very large, N! to be exact.)
Now, I already know how to get the permutation by its index number i, and I already know how to do the reverse (get the lexicographic number of a given permutation.) But I want something better.
Some permutations can be huge by themselves. For example, if you're looking at N=10^20. (The size of S would be (10^20)! which I believe is the biggest number I ever mentioned in a Stack Overflow question :)
If you're looking at just a random permutation on that space, it would be so big that you wouldn't be able to store the whole thing on your harddrive, let alone calculate each one of the items by lexicographic number. What I want is to be able to do item access on that permutation, and also get the index of each item. That is, given N and i to specify a permutation, have one function that takes an index number and find the number that resides in that index, and another function that takes a number and finds in which index it resides. I want to do that in O(1), so I don't need to store or iterate over each member in the permutation.
Crazy, you say? Impossible? That may be. But consider this: A block cipher, like AES, is essentially a permutation, and it almost accomplishes the tasks I outlined above. AES has a block size of 16 bytes, meaning that N is 256^16 which is around 10^38. (The size of S, not that it matters, is a staggering (256^16)!, or around 10^85070591730234615865843651857942052838, which beats my recent record for "biggest number mentioned on Stack Overflow" :)
Each AES encryption key specifies a single permutation on N=256^16. That permutation couldn't be stored whole on your computer, because it has more members than there are atoms in the solar system. But, it allows you item access. By encrypting data using AES, you're looking at the data block by block, and for each block (member of range(N)) you output the encrypted block, which the member of range(N) that is in the index number of the original block in the permutation. And when you're decrypting, you're doing the reverse (Finding the index number of a block.) I believe this is done in O(1), I'm not sure but in any case it's very fast.
The problem with using AES or any other block cipher is that it limits you to very specific N, and it probably only captures a tiny fraction of the possible permutations, while I want to be able to use any N I like, and do item access on any permutation S[i] that I like.
Is it possible to get O(1) item access on a permutation, given size N and permutation number i? If so, how?
(If I'm lucky enough to get code answers here, I'd appreciate if they'll be in Python.)
UPDATE:
Some people pointed out the sad fact that the permutation number itself would be so huge, that just reading the number would make the task non-feasible. Then, I'd like to revise my question: Given access to the factoradic representation of a permutation's lexicographic number, is it possible to get any item in the permutation in O(as small as possible)?
The secret to doing this is to "count in base factorial".
In the same way that 134 = 1*10^2+3*10 + 4, 134 = 5! + 2 * 3! + 2! => 10210 in factorial notation (include 1!, exclude 0!). If you want to represent N!, you will then need N^2 base ten digits. (For each factorial digit N, the maximum number it can hold is N). Up to a bit of confusion about what you call 0, this factorial representation is exactly the lexicographic number of a permutation.
You can use this insight to solve Euler Problem 24 by hand. So I will do that here, and you will see how to solve your problem. We want the millionth permutation of 0-9. In factorial representation we take 1000000 => 26625122. Now to convert that to the permutation, I take my digits 0,1,2,3,4,5,6,7,8,9, and The first number is 2, which is the third (it could be 0), so I select 2 as the first digit, then I have a new list 0,1,3,4,5,6,7,8,9 and I take the seventh number which is 8 etc, and I get 2783915604.
However, this assumes that you start your lexicographic ordering at 0, if you actually start it at one, you have to subtract 1 from it, which gives 2783915460. Which is indeed the millionth permutation of the numbers 0-9.
You can obviously reverse this procedure, and hence convert backwards and forwards easily between the lexiographic number and the permutation that it represents.
I am not entirely clear what it is that you want to do here, but understanding the above procedure should help. For example, its clear that the lexiographic number represents an ordering which could be used as the key in a hashtable. And you can order numbers by comparing digits left to right so once you have inserted a number you never have to work outs it factorial.
Your question is a bit moot, because your input size for an arbitrary permutation index has size log(N!) (assuming you want to represent all possible permutations) which is Theta(N log N), so if N is really large then just reading the input of the permutation index would take too long, certainly much longer than O(1). It may be possible to store the permutation index in such a way that if you already had it stored, then you could access elements in O(1) time. But probably any such method would be equivalent to just storing the permutation in contiguous memory (which also has Theta(N log N) size), and if you store the permutation directly in memory then the question becomes trivial assuming you can do O(1) memory access. (However you still need to account for the size of the bit encoding of the element, which is O(log N)).
In the spirit of your encryption analogy, perhaps you should specify a small SUBSET of permutations according to some property, and ask if O(1) or O(log N) element access is possible for that small subset.
Edit:
I misunderstood the question, but it was not in waste. My algorithms let me understand: the factoradic representation of a permutation's lexicographic number is almost the same as the permutation itself. In fact the first digit of the factoradic representation is the same as the first element of the corresponding permutation (assuming your space consists of numbers from 0 to N-1). Knowing this there is not really a point in storing the index rather than the permutation itself . To see how to convert the lexicographic number into a permutation, read below.
See also this wikipedia link about Lehmer code.
Original post:
In the S space there are N elements that can fill the first slot, meaning that there are (N-1)! elements that start with 0. So i/(N-1)! is the first element (lets call it 'a'). The subset of S that starts with 0 consists of (N-1)! elements. These are the possible permutations of the set N{a}. Now you can get the second element: its the i(%((N-1)!)/(N-2)!). Repeat the process and you got the permutation.
Reverse is just as simple. Start with i=0. Get the 2nd last element of the permutation. Make a set of the last two elements, and find the element's position in it (its either the 0th element or the 1st), lets call this position j. Then i+=j*2!. Repeat the process (you can start with the last element too, but it will always be the 0th element of the possibilities).
Java-ish pesudo code:
find_by_index(List N, int i){
String str = "";
for(int l = N.length-1; i >= 0; i--){
int pos = i/fact(l);
str += N.get(pos);
N.remove(pos);
i %= fact(l);
}
return str;
}
find_index(String str){
OrderedList N;
int i = 0;
for(int l = str.length-1; l >= 0; l--){
String item = str.charAt(l);
int pos = N.add(item);
i += pos*fact(str.length-l)
}
return i;
}
find_by_index should run in O(n) assuming that N is pre ordered, while find_index is O(n*log(n)) (where n is the size of the N space)
After some research in Wikipedia, I desgined this algorithm:
def getPick(fact_num_list):
"""fact_num_list should be a list with the factorial number representation,
getPick will return a tuple"""
result = [] #Desired pick
#This will hold all the numbers pickable; not actually a set, but a list
#instead
inputset = range(len(fact_num_list))
for fnl in fact_num_list:
result.append(inputset[fnl])
del inputset[fnl] #Make sure we can't pick the number again
return tuple(result)
Obviously, this won't reach O(1) due the factor we need to "pick" every number. Due we do a for loop and thus, assuming all operations are O(1), getPick will run in O(n).
If we need to convert from base 10 to factorial base, this is an aux function:
import math
def base10_baseFactorial(number):
"""Converts a base10 number into a factorial base number. Output is a list
for better handle of units over 36! (after using all 0-9 and A-Z)"""
loop = 1
#Make sure n! <= number
while math.factorial(loop) <= number:
loop += 1
result = []
if not math.factorial(loop) == number:
loop -= 1 #Prevent dividing over a smaller number than denominator
while loop > 0:
denominator = math.factorial(loop)
number, rem = divmod(number, denominator)
result.append(rem)
loop -= 1
result.append(0) #Don't forget to divide to 0! as well!
return result
Again, this will run in O(n) due to the whiles.
Summing all, the best time we can find is O(n).
PS: I'm not a native English speaker, so spelling and phrasing errors may appear. Apologies in advance, and let me know if you can't get around something.
All correct algorithms for accessing the kth item of a permutation stored in factoradic form must read the first k digits. This is because, regardless of the values of the other digits among the first k, it makes a difference whether an unread digit is a 0 or takes on its maximum value. That this is the case can be seen by tracing the canonical correct decoding program in two parallel executions.
For example, if we want to decode the third digit of the permutation 1?0, then for 100, that digit is 0, and for 110, that digit is 2.

Finding a set of repeated, non-overlapping substrings of two input strings using suffix arrays

Input: two strings A and B.
Output: a set of repeated, non overlapping substrings
I have to find all the repeated strings, each of which has to occur in both(!) strings at least once. So for instance, let
A = "xyabcxeeeyabczeee" and B = "yxabcxabee".
Then a valid output would be {"abcx","ab","ee"} but not "eee", since it occurs only in string A.
I think this problem is very related to the "supermaximal repeat" problem. Here is a definition:
Maximal repeated pair :
A pair of identical substrings alpha and beta in S such that extending alpha and beta
in either direction would destroy the equality of the two strings
It is represented as a triplet (position1,position2, length)
Maximal repeat :
“A substring of S that occurs in a maximal pair in S”.
Example: abc in S = xabcyiiizabcqabcyrxar.
Note: There can be numerous maximal repeated pairs, but there can be only a limited
number of maximal repeats.
Supermaximal repeat
“A maximal repeat that never occurs as a substring of any other maximal repeat”
Example: abcy in S = xabcyiiizabcqabcyrxar.
An algorithm for finding all supermaximal repeats is described in "Algorithms on strings, trees and sequences", but only for suffix trees.
It works by:
1.) finding all left-diverse nodes using DFS
For each position i in S, S(i-1) is called the left character i.
Left character of a leaf in T(S) is the left character of the suffix position
represented by that leaf.
An internal node v in T(S) is called left-diverse if at least two leaves in v’s
subtree have different left characters.
2.) applying theorem 7.12.4 on those nodes:
A left diverse internal node v represents a supermaximal repeat a if and only if
all of v's children are leaves, and each has a distinct left character
Both strings A and B probably have to be concatenated and when we check v's leaves
in step two we also have to impose an additional constraint, that there has to be
at least one distinct left character from strings A and B. This can be done by comparing
their position against the length of A. If position(left character) > length(A), then left character is in A, else in B.
Can you help me solve this problem with suffix + lcp arrays?
It sounds like you are looking for the set intersection of all substrings of your two input strings. In that case, single letter substrings should also be returned. Let s1 and s2 be your strings, s1 the shorter of the two. After doing some thinking for a while about this, I don't think you can get much better than the intuitive O(n^3m) or O(n^3) algorithm, where n is the length of s1 and m is the length of s2. I don't think suffix trees can help you here.
for(int i=0 to n-1){
for(int j=1 to n-i){
if(contains(s2,substring(s1,i,j))) emit the substring
}
}
The runtime comes from the (n^2)/2 loop iterations, each doing a worst-case O(nm) contains operation (possibly O(n) depending on implementation). But its not really quite this bad since there will be a constant much smaller than one out front, since the length of the substring will actually range between 1 and n.
If you don't want single character matches, you could just initialize j as 2 or something higher.
BTW: Don't actually create new strings with substring, find/create a contains function that will take indicies and the original string and just look at the characters between, inclusive of i, exclusive of j.

finding longest similar subsequence in a string

Suppose I want to find the longest subsequence such that first half of subsequence is same as second half of it.
For example: In a string abkcjadfbck , result is abcabc as abc is repeated in first and second half of it. In a stirng aaa, result is aa.
This task may be treated as a combination of two well known problems.
If you know in advance some point between two halves of the subsequence, you just need to find the best match for two strings. This is Pairwise alignment problem. Various dynamic programming methods solve it in O(N2) time.
To find a point where the string should be split optimally, you can use Golden section search or Fibonacci search. These algorithms have O(log N) time complexity.
In a first pass over inputString, we can count how often each character occurs, and remove those with occurrence one.
input: inputString
data strucutres:
Set<Triple<char[], Integer, Integer>> potentialSecondWords;
Map<Char, List<Integer>> lettersList;
for the characters c with increasing index h in inputString do
if (!lettersList.get(c).isEmpty()) {
for ((secondWord, currentIndex, maxIndex) in potentialSecondWords) {
if (there exists a j in lettersList.get(c) between currentIndex and maxIndex) {
update (secondWord, currentIndex, maxIndex) by adding c to secondWord and replacing currentIndex with j;
}
}
if potentialSecondWords contains a triple whose char[] is equal to c, remove it;
put new Triple with value (c,lettersList.get(c).get(0), h-1) into potentialSecondWords;
}
lettersList.get(c).add(h);
}
find the largest secondWord in potentialSecondWords and output secondWord twice;
So this algorithm passes once over the array, creating for each index, where it makes sense, a Triple representing the potential second word starting at the current index, and updates all potential second words.
With a suitable list implementation and n being the size of inputString, this algorithm has worst case runtime O(n²), e.g. for a^n.

Optimal solution for non-overlapping maximum scoring sequences

While developing part of a simulator I came across the following problem. Consider a string of length N, and M substrings of this string with a non-negative score assigned to each of them. Of particular interest are the sets of substrings that meet the following requirements:
They do not overlap.
Their total score (by sum, for simplicity) is maximum.
They span the entire string.
I understand the naive brute-force solution is of O(M*N^2) complexity. While the implementation of this algorithm would probably not impose a lot on the performance of the whole project (nowhere near the critical path, can be precomputed, etc.), it really doesn't sit well with me.
I'd like to know if there are any better solutions to this problem and if so, which are they? Pointers to relevant code are always appreciated, but just algorithm description will do too.
This can be thought of as finding the longest path through a DAG. Each position in the string is a node and each substring match is an edge. You can trivially prove through induction that for any node on the optimal path the concatenation of the optimal path from the beginning to that node and from that node to the end is the same as the optimal path. Thanks to that you can just keep track of the optimal paths for each node and make sure you have visited all edges that end in a node before you start to consider paths containing it.
Then you just have the issue to find all edges that start from a node, or all substring that match at a given position. If you already know where the substring matches are, then it's as trivial as building a hash table. If you don't you can still build a hashtable if you use Rabin-Karp.
Note that with this you'll still visit all the edges in the DAG for O(e) complexity. Or in other words, you'll have to consider once each substring match that's possible in a sequence of connected substrings from start to the end. You could get better than this by doing preprocessing the substrings to find ways to rule out some matches. I have my doubts if any general case complexity improvements can come for this and any practical improvements depend heavily on your data distribution.
It is not clear whether M substrings are given as sequences of characters or indeces in the input string, but the problem doesn't change much because of that.
Let us have input string S of length N, and M input strings Tj. Let Lj be the length of Tj, and Pj - score given for string Sj. We say that string
This is called Dynamic Programming, or DP. You keep an array res of ints of length N, where the i-th element represents the score one can get if he has only the substring starting from the i-th element (for example, if input is "abcd", then res[2] will represent the best score you can get of "cd").
Then, you iterate through this array from end to the beginning, and check whether you can start string Sj from the i-th character. If you can, then result of (res[i + Lj] + Pj) is clearly achievable. Iterating over all Sj, res[i] = max(res[i + Lj] + Pj) for all Sj which can be applied to the i-th character.
res[0] will be your final asnwer.
inputs:
N, the number of chars in a string
e[0..N-1]: (b,c) an element of set e[a] means [a,b) is a substring with score c.
(If all substrings are possible, then you could just have c(a,b).)
By e.g. [1,2) we mean the substring covering the 2nd letter of the string (half open interval).
(empty substrings are not allowed; if they were, then you could handle them properly only if you allow them to be "taken" at most k times)
Outputs:
s[i] is the score of the best substring covering of [0,i)
a[i]: [a[i],i) is the last substring used to cover [0,i); else NULL
Algorithm - O(N^2) if the intervals e are not sparse; O(N+E) where e is the total number of allowed intervals. This is effectively finding the best path through an acyclic graph:
for i = 0 to N:
a[i] <- NULL
s[i] <- 0
a[0] <- 0
for i = 0 to N-1
if a[i] != NULL
for (b,c) in e[i]:
sib <- s[i]+c
if sib>s[b]:
a[b] <- i
s[b] <- sib
To yield the best covering triples (a,b,c) where cost of [a,b) is c:
i <- N
if (a[i]==NULL):
error "no covering"
while (a[i]!=0):
from <- a[i]
yield (from,i,s[i]-s[from]
i <- from
Of course, you could store the pair (sib,c) in s[b] and save the subtraction.
O(N+M) solution:
Set f[1..N]=-1
Set f[0]=0
for a = 0 to N-1
if f[a] >= 0
For each substring beginning at a
Let b be the last index of the substring, and c its score
If f[a]+c > f[b+1]
Set f[b+1] = f[a]+c
Set g[b+1] = [substring number]
Now f[N] contains the answer, or -1 if no set of substrings spans the string.
To get the substrings:
b = N
while b > 0
Get substring number from g[N]
Output substring number
b = b - (length of substring)

Resources