This is about analyzing the complexity of a solution to a popular interview problem.
Problem
There is a function concat(str1, str2) that concatenates two strings. The cost of the function is measured by the lengths of the two input strings len(str1) + len(str2). Implement concat_all(strs) that concatenates a list of strings using only the concat(str1, str2) function. The goal is to minimize the total concat cost.
Warnings
Usually in practice, you would be very cautious about concatenating pairs of strings in a loop. Some good explanations can be found here and here. In reality, I have witnessed a severity-1 accident caused by such code. Warnings aside, let's say this is an interview problem. What's really interesting to me is the complexity analysis around the various solutions.
You can pause here if you would like to think about the problem. I am gonna reveal some solutions below.
Solutions
Naive solution. Loop through the list and concatenate
def concat_all(strs):
result = ''
for str in strs:
result = concat(result, str)
return result
Min-heap solution. The idea is to concatenate shorter strings first. Maintain a min-heap of the strings based on the length of the strings. Each concatenation concatenates 2 strings off the min-heap and the result is pushed back the min-heap. Until only one string is left on the heap.
def concat_all(strs):
heap = MinHeap(strs, key_func=len)
while len(heap) > 1:
str1 = heap.pop()
str2 = heap.pop()
heap.push(concat(str1, str2))
return heap.pop()
Binary concat. May not be intuitively clear. But another good solution is to recursively split the list by half and concatenate.
def concat_all(strs):
if len(strs) == 1:
return strs[0]
if len(strs) == 2:
return concat(strs[0], strs[1])
mid = len(strs) // 2
str1 = concat_all(strs[:mid])
str2 = concat_all(strs[mid:])
return concat(str1, str2)
Complexity
What I am really struggling and asking here is the complexity of the 2nd approach above that uses a min-heap.
Let's say the number of strings in the list is n and the total number of characters in all the strings is m. The upper bound for the naive solution is O(mn). The binary-concat has an exact bound of theta(mlog(n)). It is the min-heap approach that is elusive to me.
I am kind of guessing it has an upper bound of O(mlog(n) + nlog(n)). The second term, nlog(n) is associated with maintaining the heap; there are n concats and each concat updates the heap in log(n). If we only focus on the cost of concatenations and ignore the cost of maintaining the min-heap, the overall complexity of the min-heap approach can be reduced to O(mlog(n)). Then min-heap is a more optimal approach than binary-concat cause for the former mlog(n) is the upper bound while for the latter it is the exact bound.
But I can't seem to prove it, or even find a good intuition to support that guessed upper bound. Can the upper bound be even lower than O(mlog(n))?
Let us call the length of strings 1 to n and m be the sum of all these values.
For the naive solution, clearly the worst appears if
m1
is almost equal to m, and you obtain a O(nm) complexity, as you pointed.
For the min-heap, the worst-case is a bit different, it consists in having the same length for any string. In that case, it's going to work exactly as your case 3. of binary concat, but you'll also have to maintain the min-heap structure. So yes, it will be a bit more costly than case 3 in real-life. Nevertheless, from a complexity point of view, both will be in O(m log n) since we have m > n and O(m log n + n log n)can be reduced to O(m log n).
To prove the min-heap complexity more rigorously, we can prove that when we take the two smallest strings in a set of k strings, and denote by S the sum of the two smallest strings, then we have: (m-S)/(k-1) >= S/2 (it simply means that the mean of the two smallest strings is less than the mean of the k-2 other strings). Reformulating leads to S <= 2m/(k+1). Let us apply it to the min-heap algorithm:
at first step, we can show that the 2 strings we take are of total length at most 2m/(n+1)
at first step, we can show that the 2 strings we take are of total length at most 2m/(n)
...
at last step, we can show that the 2 strings we take are of total length at most 2m/(1)
Hence the computation time of min-heap is 2m*[1/(n+1) + 1/n + ... + 1/2 + 1] which is in O(m log n)
I have a method called binary sum
Algorithm BinarySum(A, i, n):
Input: An array A and integers i and n
Output: The sum of the n integers in A starting at index i
if n = 1 then
return A[i]
return BinarySum(A, i, n/ 2) + BinarySum(A, i + n/ 2, n/ 2)
Ignoring the fact of making a simple problem complicated I have been asked to find the Big O. Here is my thought process. For an array of size N I will be making 1 + 2 + 4 .. + N recursive calls. This is close to half the sum from 1 to N so I will say it is about N(N + 1)/4. After making this many calls now I need to add them together. So once again I need to perform N(N+1)/4 additions. Adding them together we are left with N^2 as the dominate term.
So would the big O of this algorithm be O(N^2)? Or am I doing something wrong. It feels strange to have binary recursion and not have a 2^n or log n in the final answer
There are in-fact 2^n and log n terms in the final result... sort of.
For each call to a sub-array of length n, two recursive calls are made to both halves of this array, plus a constant amount of work (if-statement, addition, pushing onto the call stack etc). Thus the recurrence relation is given by:
At this point we could just use the Master theorem to directly arrive at the final result - O(n). But let's instead derive it by repeated expansion:
The stopping condition n = 1 gives the maximum value of m (ignoring rounding):
In step (*) we used the standard formula for geometric series. So as you can see the answer does involve log n and 2^n terms in a sense, but they "cancel" out to give a simple linear term, which is the same as for a simple loop.
I was given a tricky question.
Given:
A = [a1,a2,...an] (list of positive integers with length "n")
r (positive integer)
Find a list of { *, + } operators
O = [o1,o2,...on-1]
so that if we placed those operators between the elements of "A", the resulting expression would evaluate to "r". Only one solution is required.
So for example if
A = [1,2,3,4]
r = 14
then
O = [*, +, *]
I've implemented a simple recursive solution with some optimisation, but of course it's exponential O(2^n) time, so for an input with length 40, it works for ages.
I wanted to ask if any of you know a sub-exponential solution for this?
Update
Elements of A are between 0-10000,
r can be arbitrarily big
Let A and B be positive integers. Then A + B ≤ A × B + 1.
This little fact can be used to construct a very efficient algorithm.
Let's define a graph. The graph nodes correspond to operations lists, for example, [+, ×, +, +, ×]. There is an edge from graph node X to graph node Y if the Y can be obtained by changing a single + to a × in X. The graph has a source at the node corresponding to [+, +, ..., +].
Now perform a breadth-first search from the source node, constructing the graph as you go. When expanding a node [+, ×, +, +, ×], for example, you (optionally construct then) connect to the nodes [×, ×, +, +, ×], [+, ×, ×, +, ×], and [+, ×, +, ×, ×]. Do not expand to a node if the result of evaluating it is greater than r + k(O), where k(O) is the number of +'s in the operation list O. This is because of the "+ 1" in the fact at the beginning of the answer - consider the case of a = [1, 1, 1, 1, 1], r = 1.
This approach uses O(n 2n) time and O(2n) space (where both are potentially very-loose worst case bounds). This is still an exponential algorithm, however I think you will find it performs very reasonably for non-sinister inputs. (I suspect this problem is NP-complete, which is why I am happy with this "non-sinister inputs" escape clause.)
Here's an O(rn^2)-time, O(rn)-space DP approach. If r << 2^n then this will have better worst-case behaviour than exponential-time branch-and-bound approaches, though even then the latter may still be faster on many instances. This is pseudo-polynomial time, because it takes time proportional to the value of part of its input (r), not its size (which would be log2(r)). Specifically it needs rn bits of memory, so it should give answers in a few seconds for up to around rn < 1,000,000,000 and n < 1000 (e.g. n = 100, r = 10,000,000).
The key observation is that any formula involving all n numbers has a final term that consists of some number i of factors, where 1 <= i <= n. That is, any formula must be in one of the following n cases:
(a formula on the first n-1 terms) + a[n]
(a formula on the first n-2 terms) + a[n-1] * a[n]
(a formula on the first n-3 terms) + a[n-2] * a[n-1] * a[n]
...
a[1] * a[2] * ... * a[n]
Let's call the "prefix" of a[] consisting of the first i numbers P[i]. If we record, for each 0 <= i <= n-1, the complete set of values <= r that can be reached by some formula on P[i], then based on the above, we can quite easily compute the complete set of values <= r that can be reached by P[n]. Specifically, let X[i][j] be a true or false value that indicates whether the prefix P[i] can achieve the value j. (X[][] could be stored as an array of n size-(r+1) bitmaps.) Then what we want to do is compute X[n][r], which will be true if r can be reached by some formula on a[], and false otherwise. (X[n][r] isn't quite the full answer yet, but it can be used to get the answer.)
X[1][a[1]] = true. X[1][j] = false for all other j. For any 2 <= i <= n and 0 <= j <= r, we can compute X[i][j] using
X[i][j] = X[i - 1][j - a[i]] ||
X[i - 2][j - a[i-1]*a[i]] ||
X[i - 3][j - a[i-2]*a[i-1]*a[i]] ||
... ||
X[1][j - a[2]*a[3]*...*a[i]] ||
(a[1]*a[2]*...*a[i] == j)
Note that the last line is an equality test that compares the product of all i numbers in P[i] to j, and returns true or false. There are i <= n "terms" (rows) in the expression for X[i][j], each of which can be computed in constant time (note in particular that the multiplications can be built up in constant time per row), so computing a single value X[i][j] can be done in O(n) time. To find X[n][r], we need to calculate X[i][j] for every 1 <= i <= n and every 0 <= j <= r, so there is O(rn^2) overall work to do. (Strictly speaking we may not need to compute all of these table entries if we use memoization instead of a bottom-up approach, but many inputs will require us to compute a large fraction of them anyway, so it's likely that the latter is faster by a small constant factor. Also a memoization approach requires keeping an "already processed" flag for each DP cell -- which doubles the memory usage when each cell is just 1 bit!)
Reconstructing a solution
If X[n][r] is true, then the problem has a solution (satisfying formula), and we can reconstruct one in O(n^2) time by tracing back through the DP table, starting from X[n][r], at each location looking for any term that enabled the current location to assume the value "true" -- that is, any true term. (We could do this reconstruction step faster by storing more than a single bit per (i, j) combination -- but since r is allowed to be "arbitrarily big", and this faster reconstruction won't improve the overall time complexity, it probably makes more sense to go with the approach that uses the fewest bits per DP table entry.) All satisfying solutions can be reconstructed this way, by backtracking through all true terms instead of just picking any one -- but there may be an exponential number of them.
Speedups
There are two ways that calculation of an individual X[i][j] value can be sped up. First, because all the terms are combined with ||, we can stop as soon as the result becomes true, since no later term can make it false again. Second, if there is no zero anywhere to the left of i, we can stop as soon as the product of the final numbers becomes larger than r, since there's no way for that product to be decreased again.
When there are no zeroes in a[], that second optimisation is likely to be very important in practice: it has the potential to make the inner loop much smaller than the full i-1 iterations. In fact if a[] contains no zeroes, and its average value is v, then after k terms have been computed for a particular X[i][j] value the product will be around v^k -- so on average, the number of inner loop iterations (terms) needed drops from n to log_v(r) = log(r)/log(v). That might be much smaller than n, in which case the average time complexity for this model drops to O(rn*log(r)/log(v)).
[EDIT: We actually can save multiplications with the following optimisation :)]
8/32/64 X[i][j]s at a time: X[i][j] is independent of X[i][k] for k != j, so if we are using bitsets to store these values, we can calculate 8, 32 or 64 of them (or maybe more, with SSE2 etc.) in parallel using simple bitwise OR operations. That is, we can calculate the first term of X[i][j], X[i][j+1], ..., X[i][j+31] in parallel, OR them into the results, then calculate their second terms in parallel and OR them in, etc. We still need to perform the same number of subtractions this way, but the products are all the same, so we can reduce the number of multiplications by a factor of 8/32/64 -- as well as, of course, the number of memory accesses. OTOH, this makes the first optimisation from the previous paragraph harder to accomplish -- you have to wait until an entire block of 8/32/64 bits have become true before you can stop iterating.
Zeroes: Zeroes in a[] may allow us to stop early. Specifically, if we have just computed X[i][r] for some i < n and found it to be true, and there is a zero anywhere to the right of position i in a[], then we can stop: we already have a formula on the first i numbers that evaluates to r, and we can use that zero to "kill off" all numbers to the right of position i by creating one big product term that includes all of them.
Ones: An interesting property of any a[] entry containing the value 1 is that it can be moved to any other position in a[] without affecting whether or not there is a solution. This is because every satisfying formula either has a * on at least one side of this 1, in which case it multiplies some other term and has no effect there, and would likewise have no effect anywhere else; or it has a + on both sides (imagine extra + signs before the first position and after the last), in which case it might as well be added in anywhere.
So, we can safely shunt all 1 values to the end of a[] before doing anything else. The point of doing this is that now we don't have to evaluate these rows of X[][] at all, because they only influence the outcome in a very simple way. Suppose there are m < n ones in a[], which we have moved to the end. Then after computing the m+1 values X[n-m][r-m], X[n-m][r-m+1], X[n-m][r-m+2], ..., X[n-m][r], we already know what X[n][r] must be: if any of them are true, then X[n][r] must be true, otherwise (if they are all false) it must be false. This is because the final m ones can add anywhere from 0 up to m to a formula on the first n-m values. (But if a[] consists entirely of 1s, then at least 1 must be "added" -- they can't all multiply some other term.)
Here is another approach that might be helpful. It is sometimes known as a "meet-in-the-middle" algorithm and runs in O(n * 2^(n/2)). The basic idea is this. Suppose n = 40 and you know that the middle slot is a +. Then, you can brute force all N := 2^20 possibilities for each side. Let A be a length N array storing the possible values of the left side, and similarly let B be a length N array storing the values for the right side.
Then, after sorting A and B, it is not hard to efficiently check for whether any two of them sum to r (e.g. for each value in A, do a binary search on B, or you can even do it in linear time if both arrays are sorted). This part takes O(N * log N) = O(n * 2^(n/2)) time.
Now, this was all assuming the middle slot is a +. If not, then it has to be a *, and you can combine the middle two elements into one (their product), reducing the problem to n = 39. Then you try the same thing, and so on. If you analyze it carefully, you should get O(n * 2^(n/2)) as the asymptotic complexity, since actually the largest term dominates.
You need to do some bookkeeping to actually recover the +'s and *'s, which I have left out to simplify the explanation.
I've been doing some reading on time complexity, and I've got the basics covered. To reinforce the concept, I took a look at an answer I recently gave here on SO. The question is now closed, for reason, but that's not the point. I can't figure out what the complexity of my answer is, and the question was closed before I could get any useful feedback.
The task was to find the first unique character in a string. My answer was something simple:
public String firstLonelyChar(String input)
{
while(input.length() > 0)
{
int curLength = input.length();
String first = String.valueOf(input.charAt(0));
input = input.replaceAll(first, "");
if(input.length() == curLength - 1)
return first;
}
return null;
}
Link to an ideone example
My first thought was that since it looks at each character, then looks at each again during replaceAll(), it would be O(n^2).
However, it got me to thinking about the way it actually works. For each character examined, it then removes all instances of that character in the string. So n is constantly shrinking. How does that factor into it? Does that make it O(log n), or is there something I'm not seeing?
What I'm asking:
What is the time complexity of the algorithm as written, and why?
What I'm not asking:
I'm not looking for suggestions to improve it. I get that there are probably better ways to do this. I'm trying to understand the concept of time complexity better, not find the best solution.
The worst time complexity you will have is for the string aabb... and so on each character repeated exactly twice. Now this depends on the size of your alphabet let's say that is S. Let's also annotate the length of your initial string with L. So for each letter you will have to iterate over the whole String. However first time you do that the String will be of size L, second time L-2 and so on. Overall you will have to perform in the order of L + (L-2) + ... + (L- S*2) operations and that is L*S - 2*S*(S+1), assuming L is more than 2*S.
By the way if the size of your alphabet is constant, and I suppose it is, the complexity of your code is O(L)(though with a big constant).
The worst case is O(n^2) where n is the length of the input string. Imagine the case where every character is doubled except the last one, like "aabbccddeeffg". Then there are n/2 loop iterations, and each call to replaceAll has to scan the entire remaining string, which is also proportional to n.
Edit: As Ivaylo points out, if the size of your alphabet is constant, it's technically O(n) since you never consider any character more than once.
Let's mark:
m = number of unique letters in the word
n = input length
This is the complexity calculation:
The main loop goes at most m times, because there are m different letters,
the .Replaceall checks at most O(n) comparisons in each cycle.
the total is: O(m*n)
an example for O(m*n) cycle is: input = aabbccdd,
m=4, n=8
the algorithm stages:
1. input = aabbccdd, complex - 8
2. input = bbccdd, complex = 6
3. input = ccdd, complex = 4
4. input = dd, complex = 2
total = 8+6+4+2 = 20 = O(m*n)
Let m be the size of your alphabet, and let n be the length of your string. The worse case would be to uniformly distribute your string's characters between the alphabet letters, meaning you'll have n / m characters for each letter in your alphabet, let's mark this quantity with q. For example, the string aabbccddeeffgghh is the uniformly distribution of 16 characters between the letters a-h, so here n=16 and m=8 and you have q=2 characters for each letter.
Now, your algorithm is actually going over the letters of the alphabet (it just uses the order which they appear in the string), and for each iteration it has to go over the length of the string (n) and shrink it by q (n -= q). So over all the number of operation you do in the worst case are:
s = n + n-(1*q) + ... + n-((m-1)*q)
You can see that s is the sum of the first m elements of the arithmetic series:
s = (n + n-((m-1)*q) * m / 2 =
(n + n-((m-1)*(n/m)) * m / 2 ~ n * m / 2
Given is a set S = {s1, ..., sm} where each si is a string of length k over the alphabet {0,1,?}.
I am looking for an efficient algorithm solving the following decision problem:
Is it true that for each 1 ≤ a < b ≤ k there is a string si in S s.t. si(a) = 0 and si(b) = 1 or si(a) = 1 and si(b) = 0, where si(a) denotes the a-th character in string si.
I am looking for a sublinear time algorithm in m, so something like O(\sqrt(m)f(k)) would be the goal.
Can't be done in less than linear time, unless prior offline processing is allowed.
Basically the first string that satisfies your criterion will be the last one you consider. And you might have to consider all m-1 others first.
Easy. Bitfield your character space (0 == 0x1, 1 == 0x2,...), then take the first N characters of each string in S and add their transformed representations, XOR the total against 0x3,..., and see if it evaluates zero.
For faster traversal than O(n), use this to form a binary search tree or heap (O(lg n) fetch), or a hash (O(1) fetch). If S is guaranteed to be sorted, this becomes even easier.
At least, if I understand your question properly. For better, more theoretic results with bounded mathematical complexity and proofs, Math.SE or CSTheory.SE are the go-to locations.