I've been doing some reading on time complexity, and I've got the basics covered. To reinforce the concept, I took a look at an answer I recently gave here on SO. The question is now closed, for reason, but that's not the point. I can't figure out what the complexity of my answer is, and the question was closed before I could get any useful feedback.
The task was to find the first unique character in a string. My answer was something simple:
public String firstLonelyChar(String input)
{
while(input.length() > 0)
{
int curLength = input.length();
String first = String.valueOf(input.charAt(0));
input = input.replaceAll(first, "");
if(input.length() == curLength - 1)
return first;
}
return null;
}
Link to an ideone example
My first thought was that since it looks at each character, then looks at each again during replaceAll(), it would be O(n^2).
However, it got me to thinking about the way it actually works. For each character examined, it then removes all instances of that character in the string. So n is constantly shrinking. How does that factor into it? Does that make it O(log n), or is there something I'm not seeing?
What I'm asking:
What is the time complexity of the algorithm as written, and why?
What I'm not asking:
I'm not looking for suggestions to improve it. I get that there are probably better ways to do this. I'm trying to understand the concept of time complexity better, not find the best solution.
The worst time complexity you will have is for the string aabb... and so on each character repeated exactly twice. Now this depends on the size of your alphabet let's say that is S. Let's also annotate the length of your initial string with L. So for each letter you will have to iterate over the whole String. However first time you do that the String will be of size L, second time L-2 and so on. Overall you will have to perform in the order of L + (L-2) + ... + (L- S*2) operations and that is L*S - 2*S*(S+1), assuming L is more than 2*S.
By the way if the size of your alphabet is constant, and I suppose it is, the complexity of your code is O(L)(though with a big constant).
The worst case is O(n^2) where n is the length of the input string. Imagine the case where every character is doubled except the last one, like "aabbccddeeffg". Then there are n/2 loop iterations, and each call to replaceAll has to scan the entire remaining string, which is also proportional to n.
Edit: As Ivaylo points out, if the size of your alphabet is constant, it's technically O(n) since you never consider any character more than once.
Let's mark:
m = number of unique letters in the word
n = input length
This is the complexity calculation:
The main loop goes at most m times, because there are m different letters,
the .Replaceall checks at most O(n) comparisons in each cycle.
the total is: O(m*n)
an example for O(m*n) cycle is: input = aabbccdd,
m=4, n=8
the algorithm stages:
1. input = aabbccdd, complex - 8
2. input = bbccdd, complex = 6
3. input = ccdd, complex = 4
4. input = dd, complex = 2
total = 8+6+4+2 = 20 = O(m*n)
Let m be the size of your alphabet, and let n be the length of your string. The worse case would be to uniformly distribute your string's characters between the alphabet letters, meaning you'll have n / m characters for each letter in your alphabet, let's mark this quantity with q. For example, the string aabbccddeeffgghh is the uniformly distribution of 16 characters between the letters a-h, so here n=16 and m=8 and you have q=2 characters for each letter.
Now, your algorithm is actually going over the letters of the alphabet (it just uses the order which they appear in the string), and for each iteration it has to go over the length of the string (n) and shrink it by q (n -= q). So over all the number of operation you do in the worst case are:
s = n + n-(1*q) + ... + n-((m-1)*q)
You can see that s is the sum of the first m elements of the arithmetic series:
s = (n + n-((m-1)*q) * m / 2 =
(n + n-((m-1)*(n/m)) * m / 2 ~ n * m / 2
Related
This algorithm essentially finds the star (*) inside a given binary string, and replaces it with 0 and also 1 to output all the different combinations of the binary string.
I originally thought this algorithm is O(2^n), however, it seems to me that that only takes into account the number of stars (*) inside the string. What about the length of the string? Since if there are no stars in the given string, it should technically still be linear, because the amount of recursive calls depends on string length, but my original O(2^n) does not seem to take that into account as it would become O(1) if n = 0.
How should I go about finding out its time and space complexity? Thanks.
Code:
static void RevealStr(StringBuilder str, int i) {
//base case: prints each possibility when reached
if(str.length() == i) {
System.out.println(str);
return;
}
//recursive step if wild card (*) found
if(str.charAt(i) == '*') {
//exploring permutations for 0 and 1 in the stack frame
for(char ch = '0'; ch <= '1'; ch++) {
//making the change to our string
str.setCharAt(i, ch);
//recur to reach permutations
RevealStr(str, i+1);
//undo changes to backtrack
str.setCharAt(i, '*');
}
return;
}
else
//if no wild card found, recur next char
RevealStr(str, i+1);
}
Edit: I am currently thinking of something like, O(2^s + l) where s is the number of stars and l the length of the string.
The idea of Big-O notation is to give an estimate of an upperbound, i.e. if the order of an algorithm is O(N^4) runtime it simply means that algorithm can't do any worse than that.
Lets say, there maybe an algorithm of order O(N) runtime but we can still say it is O(N^2). Since O(N) never does any worse than O(N^2). But then in computational sense we want the estimate to be as close and tight as it will give us a better idea of how well an algorithm actually performs.
In your current example, both O(2^N) and O(2^L), N is length of string and L is number of *, are valid upperbounds. But since O(2^L) gives a better idea about algorithm and its dependence on the presence of * characters, O(2^L) is better and tighter estimate (as L<=N) of the algorithm.
Update: The space complexity is implementation dependant. In your current implementation, assuming StringBuilder is passed by reference and there are no copies of strings made in each recursive call, the space complexity is indeed O(N), i.e. the size of recursive call stack. If it is passed by value and it is copied to stack every time before making call, the overall complexity would then be O(N * N), i.e. (O(max_number_of_recursive_calls * size_of_string)), since copy operation cost is O(size_of_string).
To resolve this we can do a manual run:
Base: n=1
RevealStr("*", 1)
It meets the criteria for the first if, we only ran this once for output *
Next: n=2
RevealStr("**", 1)
RevealStr("0*", 2)
RevealStr("00", 2)
RevealStr("01", 2)
RevealStr("1*", 2)
RevealStr("10", 2)
RevealStr("11", 2)
Next: n=3
RevealStr("***", 1)
RevealStr("0**", 2)
RevealStr("00*", 2)
RevealStr("000", 3)
RevealStr("001", 3)
RevealStr("01*", 2)
RevealStr("010", 3)
RevealStr("011", 3)
RevealStr("1**", 2)
RevealStr("10*", 2)
RevealStr("100", 3)
RevealStr("101", 3)
RevealStr("11*", 2)
RevealStr("110", 3)
RevealStr("111", 3)
You can see that with n=2, RevealStr was called 7 times, while with n=3 it was called 15. This follows the function F(n)=2^(n+1)-1
For the worst case scenario, the complexity seems to be O(2^n) being n the number of stars
I was going through this text from Cracking the Coding Interview and something doesn't look clear to me:
Arrays and Strings
String joinWords(String[] words) {
String sentence = "";
for (String w : words) {
sentence = sentence + w;
}
return sentence;
}
On each concatenation, a new copy of the string is created, and the two strings are copied over, character by character. The first iteration requires us to copy 𝑥 characters. The second iteration requires copying 2𝑥 characters. The third iteration requires 3𝑥, and so on. The total time therefore is 𝒪(𝑥 + 2𝑥 + ... + 𝑛𝑥). This reduces to 𝒪(𝑥𝑛²).
Why is it 𝒪(𝑥𝑛²)? Because 1 + 2 + ... + n equals 𝑛(𝑛+1)/2, or 𝒪(𝑛²).
How does 𝒪(𝑥 + 2𝑥 + 𝑛𝑥) reduce to 𝒪(𝑥𝑛²)?
My analogy, assuming 𝑥 is a constant 1, then 𝑛 is 2(𝑥 + 2𝑥) == 3
From the book (𝑥2²) == 4 assuming 𝑥 is a constant 1
Is the algorithm analysis in the above code correct?
In the above calculation O(x + 2x + ... + nx)
x + 2x + ... + nx is expanded as x(n(n+1)/2)
which is x((n^2+n/2)) since we neglect constants and in time complexity calculation and take the value with the largest power value it is taken as 𝑂(𝑛2).
It is similar to taking 𝑂(3𝑛) as 𝑂(𝑛).
To rationalize how asymptotic notations ignore constant factors, I usually think of it like this: asymptotic complexity isn't for comparing the performance of different algorithms, it's for understanding how the performance of individual algorithms scales with respect to the input size.
For instance, we say that a function that takes 3𝑛 steps is 𝑂(𝑛), because, roughly speaking, for large enough inputs, doubling the input size will no more than double the number of steps taken. Similarly, 𝑂(𝑛2) means that doubling the input size will at most quadruple the number of steps, and 𝑂(log𝑛) means that doubling the input size will increase the number of steps by at most some constant.
It's a tool for saying which algorithms scale better, not which ones are absolutely faster.
For more info : https://www.quora.com/Why-do-we-leave-the-constants-while-calculating-time-complexity-for-algorithms
Yes, the book's analysis is roughly correct, although it silently assumes all words have the same length x.
You are right that 2(𝑥+2𝑥) == 3, while the book's formula gives 𝑥2² == 4, but in big-𝒪 notation we don't look at the exact value, but the order of magnitude.
There are ½𝑥𝑛(𝑛+1) characters copied. This is because the expression 1+2+...+𝑛 is a triangle number, equal to ½𝑛(𝑛+1). Only remains to multiply with 𝑥.
For 𝑥 == 1 and 𝑛 == 2 it gives your result, i.e. 3
We can write this as ½𝑥𝑛² + ½𝑥𝑛. Now, when we go to big-𝒪 notation, only the most significant term needs to be retained, and any constant coefficient can be dropped, and so:
𝒪[½𝑥𝑛² + ½𝑥𝑛] == 𝒪[½𝑥𝑛²] == 𝒪(𝑛²)
Obviously this means that the expression will give a different value for a given 𝑥 and 𝑛, but in big-𝒪 notation that is not what is the point. It gives an order of magnitude.
This is about analyzing the complexity of a solution to a popular interview problem.
Problem
There is a function concat(str1, str2) that concatenates two strings. The cost of the function is measured by the lengths of the two input strings len(str1) + len(str2). Implement concat_all(strs) that concatenates a list of strings using only the concat(str1, str2) function. The goal is to minimize the total concat cost.
Warnings
Usually in practice, you would be very cautious about concatenating pairs of strings in a loop. Some good explanations can be found here and here. In reality, I have witnessed a severity-1 accident caused by such code. Warnings aside, let's say this is an interview problem. What's really interesting to me is the complexity analysis around the various solutions.
You can pause here if you would like to think about the problem. I am gonna reveal some solutions below.
Solutions
Naive solution. Loop through the list and concatenate
def concat_all(strs):
result = ''
for str in strs:
result = concat(result, str)
return result
Min-heap solution. The idea is to concatenate shorter strings first. Maintain a min-heap of the strings based on the length of the strings. Each concatenation concatenates 2 strings off the min-heap and the result is pushed back the min-heap. Until only one string is left on the heap.
def concat_all(strs):
heap = MinHeap(strs, key_func=len)
while len(heap) > 1:
str1 = heap.pop()
str2 = heap.pop()
heap.push(concat(str1, str2))
return heap.pop()
Binary concat. May not be intuitively clear. But another good solution is to recursively split the list by half and concatenate.
def concat_all(strs):
if len(strs) == 1:
return strs[0]
if len(strs) == 2:
return concat(strs[0], strs[1])
mid = len(strs) // 2
str1 = concat_all(strs[:mid])
str2 = concat_all(strs[mid:])
return concat(str1, str2)
Complexity
What I am really struggling and asking here is the complexity of the 2nd approach above that uses a min-heap.
Let's say the number of strings in the list is n and the total number of characters in all the strings is m. The upper bound for the naive solution is O(mn). The binary-concat has an exact bound of theta(mlog(n)). It is the min-heap approach that is elusive to me.
I am kind of guessing it has an upper bound of O(mlog(n) + nlog(n)). The second term, nlog(n) is associated with maintaining the heap; there are n concats and each concat updates the heap in log(n). If we only focus on the cost of concatenations and ignore the cost of maintaining the min-heap, the overall complexity of the min-heap approach can be reduced to O(mlog(n)). Then min-heap is a more optimal approach than binary-concat cause for the former mlog(n) is the upper bound while for the latter it is the exact bound.
But I can't seem to prove it, or even find a good intuition to support that guessed upper bound. Can the upper bound be even lower than O(mlog(n))?
Let us call the length of strings 1 to n and m be the sum of all these values.
For the naive solution, clearly the worst appears if
m1
is almost equal to m, and you obtain a O(nm) complexity, as you pointed.
For the min-heap, the worst-case is a bit different, it consists in having the same length for any string. In that case, it's going to work exactly as your case 3. of binary concat, but you'll also have to maintain the min-heap structure. So yes, it will be a bit more costly than case 3 in real-life. Nevertheless, from a complexity point of view, both will be in O(m log n) since we have m > n and O(m log n + n log n)can be reduced to O(m log n).
To prove the min-heap complexity more rigorously, we can prove that when we take the two smallest strings in a set of k strings, and denote by S the sum of the two smallest strings, then we have: (m-S)/(k-1) >= S/2 (it simply means that the mean of the two smallest strings is less than the mean of the k-2 other strings). Reformulating leads to S <= 2m/(k+1). Let us apply it to the min-heap algorithm:
at first step, we can show that the 2 strings we take are of total length at most 2m/(n+1)
at first step, we can show that the 2 strings we take are of total length at most 2m/(n)
...
at last step, we can show that the 2 strings we take are of total length at most 2m/(1)
Hence the computation time of min-heap is 2m*[1/(n+1) + 1/n + ... + 1/2 + 1] which is in O(m log n)
When we have a recursive function to generate parentheses with N valid parentheses, the time complexity is that of the Catalan number. This doesn't make sense to me.
My analysis of the time complexity is that, we have two operations at every node of the recursion tree. We can either add a close bracket or an opening bracket. So we make two recursive calls.
T(n) = 2 * T(N - 1) = O(2^N)
I get O(2^N) as my time complexity -- not the Catalan number. The Catalan number is so arbitrary to me -- it doesn't make sense. Could anyone explain it a bit further?
In your assumption, you explore all cases that can be formed by the characters '(' and ')'. However, it is possible to eliminate some of those cases, isn't it? For instance, we know that for an input N = 4, "))((" is not a valid/balanced string. In fact, we know this to be true from the moment we put the first character of that string. Here's a recursive implementation in Python, just so that we can observe it through an example.
def generate(index, N, s, depth):
if index == N:
print s
if depth > 0:
generate(index + 1, N, s + ')', depth - 1)
if depth < N:
generate(index + 1, N, s + '(', depth + 1)
Essentially, in a recursive implementation, you keep a score of the current depth. Whenever that score is less than 0, you know that your string becomes unbalanced, thus there is no point in exploring further. So, contrary to what you assumed, you do not explore both the subproblems.
If you think about it, the problem is simply finding the number of valid permutations of N = 2 * K different characters. At the first(leftmost) position, you can place K characters. (i.e. all the '(') In the second position, you can either place one of the ')' characters, or you can place one of the remaining K-1 '(' characters. With this approach, using permutation with repetition, you can find that the complexity of the problem you mentioned is, indeed, equivalent to the Kth Catalan number.
Basically, for a string of length 2N, you have two different characters of which you have N, each. Using permutation with repetition, all the possible permutations for this string would be (2N)! / (N! N!). Well, the formula for the Nth Catalan number is just that value, divided by an additional (N+1), as you can see in the relevant Wikipedia article. If you consider the cases where you do not handle the unbalanced strings I mentioned above, you can see that (N+1) factor is due to the cases where you don't compute both the subproblems.
I read that the total number of substrings that can be formed from a given string is n^2 but I don't understand how to count this.
By substrings, I mean, given a string CAT, the substrings would be:
C
CA
CAT
A
AT
T
The total number of (nonempty) substrings is n + C(n,2). The leading n counts the number of substrings of length 1 and C(n,2) counts the number of substrings of length > 1 and is equal to the number of ways to choose 2 indices from the set of n. The standard formula for binomial coefficients yields C(n,2) = n*(n-1)/2. Combining these two terms and simplifying gives that the total number is (n^2 + n)/2. #rici in the comments notes that this is the same as C(n+1,2) which makes sense if you e.g. think in terms of Python string slicing where substrings of s can always be written in the form s[i:j] where 0 <= i < j <= n (with j being 1 more than the final index). For n = 3 this works out to (9 + 3)/2 = 6.
In the sense of complexity theory the number of substrings is O(n^2), which might be what you read somewhere.
You have a starting point and and end point - if each could point to anywhere along the word, each would have n possible values, and therefor an overall of n^2, so that's an upper limit.
However, we need a constraint saying that the substring cannot end before it started, so end - start >=0. This cuts the possible count in about half, but on asymptotic terms it's still O(n^2)
Substring calculation is logically
selecting 2 blank spaces atleast one letter apart.
a| b c | d = substring bc
| a b c |d = substring abc.
Now how many ways can you chose these 2 blankspace. For n letter word there are n+1.
Then first select one = n+1 ways
Select another (not the same)= n
So total n(n+1). But you have calculated everything twice. So n*(n+1)/2.
Programmatically, without applying any special algorithms(like Z algo etc) you can use a map to calculate no of distinct substrings.(O(n^3)).
You can use suffix tree to get O(n^2) substring calculaton.
To get a substring of a given string s, you just need to select two different points in the string. Let s contain n characters,
|s[0]|s[1]|...|s[n-1]|
You want to choose two vertical bars to get a substring. How many vertical bars do you have? Exactly n+1. So the number of sustrings is C(n+1,2) = n(n+1)/2, which is to choose 2 items from n+1. Of course, it could be denoted as O(n^2).