Assume we have an string array str = ["foo", "bar", "zebras", "topper"] which I need to to sort to get ["bar", "foo", "zebra", "top"], the sorting complexity will be O(nlogn) where n is the length of the array. But we also do string compare, which should be O(m) where m is the length of longest strings (such as zebras and topper). So final complexity should be O(m * nlogn). Please correct me if I am wrong.
This question differs from here because, here I am comparing all strings with all strings, rather than a fixed string with just one.
This is the result you would get if you performed sort using only normal string comparison.
You can get better performance (but I don't now of any implementations of this) by knowing your inputs are strings. For example, if you sort the strings by each character in turn, which gives O(nm|alphabet|).
Related
Question:
We have an array of m strings composed of only lower case
characters such that the total number of characters in all the strings
combined is n.
Show how to sort the strings (in lexicographic order) in O(n) time using only character comparisons. Justify your answer.
What I have:
This really seems like it should be radix sort. Radix sort has a time complexity of O(k*(m+d)) where k is the maximum number of letters in a string contained in the array, and d is the number of "buckets" (Assuming you are using radix sort with bucket sort) in this case we know we will have 26 "buckets" (for each letter in the alphabet). Therefore we can simplify the time complexity to O(k*m).
Assuming I am correct and the best way of doing this is radix sort, what I am struggling to prove is that O(k*m) = O(n).
Am I right that this is radix sort?
How can I prove that O(k*m) = O(n)?
O(k*(m+d)) ~ O(n+kd) in your case.
For example, let's say you have to sort ["ABCD", "ABDC","AB"]. When you sort the first and second character, you go through all 3 elements. But when you check for the third and fourth character, you don't have to check the string "AB" since it doesn't have a third and fourth letter. So actual times you go through each letter will be 2*3 + 2*2 = 10 which is the sum of length of all strings (plus the kd term for storing and retrieving letters).
You'll just have to tweak the radix sort by adding a few validation checks on terminated strings and it comes to O(n + kd)
Let A = ["stack","overflow","algorithm"] ,
B = ["gor","tac","flo"].
A and B are array of strings where B has the substrings.
It is guaranteed that every string in B will be a substring to only one string in A and every string in A have only one match in B. Also consider that the number of strings in A and B are equal.
Output B. Such that B[i] should be substring of A[i].
The output for the above example is:
B = ["tac","flo","gor"].
I can only think of the naive approach. Do we have better solution to the above problem?
Make concatenation of all strings into superstring s of length L=sum(len(i)), store indexes of string beginnings.
Build suffix array for superstring (LlogL)
Search every substring in that suffix array (N*logL)
Get string corresponding to this index
If substring cannot fit between found position and index of the next string beginning, use another suffix (situation like fax/emotion/axel and searching axe)
I assume by naive approach you mean for each substring in B scan through all words in A to find a match. This approach will have a complexity of O(n^2)
However, you can also try to index the words in A using a Substring Index. Building this index usually requires O(n * log n) construction time.
Afterwards you can utilize this index to efficiently (usually O(log n)) find words that contain a substring. Doing this for every substring in B is thus O(n * log n)
This is about analyzing the complexity of a solution to a popular interview problem.
Problem
There is a function concat(str1, str2) that concatenates two strings. The cost of the function is measured by the lengths of the two input strings len(str1) + len(str2). Implement concat_all(strs) that concatenates a list of strings using only the concat(str1, str2) function. The goal is to minimize the total concat cost.
Warnings
Usually in practice, you would be very cautious about concatenating pairs of strings in a loop. Some good explanations can be found here and here. In reality, I have witnessed a severity-1 accident caused by such code. Warnings aside, let's say this is an interview problem. What's really interesting to me is the complexity analysis around the various solutions.
You can pause here if you would like to think about the problem. I am gonna reveal some solutions below.
Solutions
Naive solution. Loop through the list and concatenate
def concat_all(strs):
result = ''
for str in strs:
result = concat(result, str)
return result
Min-heap solution. The idea is to concatenate shorter strings first. Maintain a min-heap of the strings based on the length of the strings. Each concatenation concatenates 2 strings off the min-heap and the result is pushed back the min-heap. Until only one string is left on the heap.
def concat_all(strs):
heap = MinHeap(strs, key_func=len)
while len(heap) > 1:
str1 = heap.pop()
str2 = heap.pop()
heap.push(concat(str1, str2))
return heap.pop()
Binary concat. May not be intuitively clear. But another good solution is to recursively split the list by half and concatenate.
def concat_all(strs):
if len(strs) == 1:
return strs[0]
if len(strs) == 2:
return concat(strs[0], strs[1])
mid = len(strs) // 2
str1 = concat_all(strs[:mid])
str2 = concat_all(strs[mid:])
return concat(str1, str2)
Complexity
What I am really struggling and asking here is the complexity of the 2nd approach above that uses a min-heap.
Let's say the number of strings in the list is n and the total number of characters in all the strings is m. The upper bound for the naive solution is O(mn). The binary-concat has an exact bound of theta(mlog(n)). It is the min-heap approach that is elusive to me.
I am kind of guessing it has an upper bound of O(mlog(n) + nlog(n)). The second term, nlog(n) is associated with maintaining the heap; there are n concats and each concat updates the heap in log(n). If we only focus on the cost of concatenations and ignore the cost of maintaining the min-heap, the overall complexity of the min-heap approach can be reduced to O(mlog(n)). Then min-heap is a more optimal approach than binary-concat cause for the former mlog(n) is the upper bound while for the latter it is the exact bound.
But I can't seem to prove it, or even find a good intuition to support that guessed upper bound. Can the upper bound be even lower than O(mlog(n))?
Let us call the length of strings 1 to n and m be the sum of all these values.
For the naive solution, clearly the worst appears if
m1
is almost equal to m, and you obtain a O(nm) complexity, as you pointed.
For the min-heap, the worst-case is a bit different, it consists in having the same length for any string. In that case, it's going to work exactly as your case 3. of binary concat, but you'll also have to maintain the min-heap structure. So yes, it will be a bit more costly than case 3 in real-life. Nevertheless, from a complexity point of view, both will be in O(m log n) since we have m > n and O(m log n + n log n)can be reduced to O(m log n).
To prove the min-heap complexity more rigorously, we can prove that when we take the two smallest strings in a set of k strings, and denote by S the sum of the two smallest strings, then we have: (m-S)/(k-1) >= S/2 (it simply means that the mean of the two smallest strings is less than the mean of the k-2 other strings). Reformulating leads to S <= 2m/(k+1). Let us apply it to the min-heap algorithm:
at first step, we can show that the 2 strings we take are of total length at most 2m/(n+1)
at first step, we can show that the 2 strings we take are of total length at most 2m/(n)
...
at last step, we can show that the 2 strings we take are of total length at most 2m/(1)
Hence the computation time of min-heap is 2m*[1/(n+1) + 1/n + ... + 1/2 + 1] which is in O(m log n)
I want to check if my algorithm is correct.
Given a string of n characters with all white-space omitted,
Ex: "itwasthebestoftimes"
Give a dynamic programming algorithm which determines if the string can be broken into a valid sequence of words, and reconstruct a valid string with whitespaces, in O(n2).
My idea:
First find all substrings of a string (O(n2)), and for each substring map its position in space and length as an interval.
Ex: "it was the best"
[] [-] [-] [--]
[---] []
[]
(Spaces added to make it easier to view).
In the above example, "it" is valid and gets an interval value of 2, "was" gets 3, etc. The string "twas" is also valid, and gets a value of 4.
This is then reduced to a mini-max problem to find the max non-overlaping length in the set of intervals. Since the valid string must contain all letters, the max length non-overlapping interval will be the answer, and finding this takes Theta(n*log(n)).
Therefore the solution will take O(n2 + n*log(n)) = O(n2)
Is my thinking correct?
Your thinking is fine (assuming you know an O(n log n) solution to the problem of finding a maximum set of non-overlapping intervals), and that you know a way to find the word intervals in O(n^2) time. However, I think the problem is easier than you're making it.
Create an array W[0...n]. W[i] will be 0 if there's no way to cut up the string from i onwards into words, and otherwise it'll store the length a word that starts a valid cutting up of strings.
Then:
W[i] = min(j such that W[i:j] is a word, and i+j = n or W[i+j]>0)
or 0 if there's no such j.
If you keep your dictionary in a trie, you can compute W[i] in O(n-i) time assuming you've already computed W[i+1] to W[n-1]. That means you can compute all of W in O(n^2) time. Or if the maximum length of the word in your dictionary is k, you can do it in O(nk) time.
Once you've computed all of W, the whole string can be cut up into words if and only if W[0] is not 0.
I am reading suffix array construction tutorials from codechef and stackoverflow as well. One point I could understand is that they say..
It works by first sorting the 2-grams(*), then the 4-grams, then the 8-grams, and so forth, of the original string S, so in the i-th iteration, we sort the 2i-grams
And so forth. Each iteration i has two steps:
Sorting by 2i-grams, using the lexicographic names from the previous iteration to enable comparisons in 2 steps (i.e. O(1) time) each
Creating new lexicographic names
MY DOUBT IS:
How can I use the index computed at 2-grams for 4 - grams. ?
Suppose my 2 suffixes are 'ab', 'ac' how can you compare then in O(1) time and give them indexes.
I really tried but stuck there. Please provide some example , that helps . Than
ks in advance
Let's assume that all substrings with length 2^k are sorted and now we want to sort all substrings with length 2^(k + 1). The key observation here is that any substring with length 2^(k + 1) is a concatenation of two substrings with length 2^k.
For example, in a string abacaba a substring caba is a concatenation of ca and ba.
But all substrings with length 2^k are sorted, so we may assume that each of them is assigned an integer from range[0 ... n - 1](I will call it class) based on it's position in the sorted array of all substrings with this length(equal strings should be assigned equal numbers and this array is not maintained explicitly, of course). In this case, each substring with length 2^(k + 1) can be represented as a pair of two numbers (p1, p2) - classes of the first and the second substring, respectively. So all what we need to do is to sort an array of pairs of integers from range [0 ... n - 1]. One can use radix sort to do in linear time. After sorting these pairs, we can find classes for all substrings with length 2^(k + 1) using single pass in the sorted array.