Calculating bigO runtime with 2D values, where one dimension has unknown length - big-o

I was working on the water-collection between towers problem, and trying to calculate the bigO of my solution for practice.
At one point, I build a 2D array of 'towers' from the user's input array of heights. This step uses a nested for loop, where the inner loop runs height many times.
Is my BigO for this step then n * maxHeight?
I've never seen any sort of BigO that used a variable like this, but then again I'm pretty new so that could be an issue of experience.
I don't feel like the height issue can be written off as a constant, because there's no reason that the height of the towers wouldn't exceed the nuber of towers on a regular basis.
//convert towerArray into 2D array representing towers
var multiTowerArray = [];
for (i = 0; i < towerArray.length; i++) {
//towerArray is the user-input array of tower heights
multiTowerArray.push([]);
for (j = 0; j < towerArray[i]; j++) {
multiTowerArray[i].push(1);
}
}

For starters, it's totally reasonable - and not that uncommon - to give the big-O runtime of a piece of code both in terms of the number of elements in the input as well as the size of the elements in the input. For example, counting sort runs in time O(n + U), where n is the number of elements in the input array and U is the maximum value in the array. So yes, you absolutely could say that the runtime of your code is O(nU), where n is the number of elements and U is the maximum value anywhere in the array.
Another option would be to say that the runtime of your code is O(n + S), where S is the sum of all the elements in the array, since the aggregate number of times that the inner loop runs is equal to the sum of the array elements.
Generally speaking, you can express the runtime of an algorithm in terms of whatever quantities you'd like. Many graph algorithms have a runtime that depends on both number of nodes (often denoted n) and the number of edges (often denoted m), such as Dijkstra's algorithm, which can be made to run in time O(m + n log n) using a Fibonacci heap. Some algorithms have a runtime that depends on the size of the output (for example, the Aho-Corasick string matching algorithm runs in time O(m + n + z), where m and n are the lengths of the input strings and z is the number of matches). Some algorithms depend on a number of other parameters - as an example, the count-min sketch performs updates in time O(ε-1 log δ-1), where ε and δ are parameters specified when the algorithm starts.

Related

Binary vs Linear searches for unsorted N elements

I try to understand a formula when we should use quicksort. For instance, we have an array with N = 1_000_000 elements. If we will search only once, we should use a simple linear search, but if we'll do it 10 times we should use sort array O(n log n). How can I detect threshold when and for which size of input array should I use sorting and after that use binary search?
You want to solve inequality that rougly might be described as
t * n > C * n * log(n) + t * log(n)
where t is number of checks and C is some constant for sort implementation (should be determined experimentally). When you evaluate this constant, you can solve inequality numerically (with uncertainty, of course)
Like you already pointed out, it depends on the number of searches you want to do. A good threshold can come out of the following statement:
n*log[b](n) + x*log[2](n) <= x*n/2 x is the number of searches; n the input size; b the base of the logarithm for the sort, depending on the partitioning you use.
When this statement evaluates to true, you should switch methods from linear search to sort and search.
Generally speaking, a linear search through an unordered array will take n/2 steps on average, though this average will only play a big role once x approaches n. If you want to stick with big Omicron or big Theta notation then you can omit the /2 in the above.
Assuming n elements and m searches, with crude approximations
the cost of the sort will be C0.n.log n,
the cost of the m binary searches C1.m.log n,
the cost of the m linear searches C2.m.n,
with C2 ~ C1 < C0.
Now you compare
C0.n.log n + C1.m.log n vs. C2.m.n
or
C0.n.log n / (C2.n - C1.log n) vs. m
For reasonably large n, the breakeven point is about C0.log n / C2.
For instance, taking C0 / C2 = 5, n = 1000000 gives m = 100.
You should plot the complexities of both operations.
Linear search: O(n)
Sort and binary search: O(nlogn + logn)
In the plot, you will see for which values of n it makes sense to choose the one approach over the other.
This actually turned into an interesting question for me as I looked into the expected runtime of a quicksort-like algorithm when the expected split at each level is not 50/50.
the first question I wanted to answer was for random data, what is the average split at each level. It surely must be greater than 50% (for the larger subdivision). Well, given an array of size N of random values, the smallest value has a subdivision of (1, N-1), the second smallest value has a subdivision of (2, N-2) and etc. I put this in a quick script:
split = 0
for x in range(10000):
split += float(max(x, 10000 - x)) / 10000
split /= 10000
print split
And got exactly 0.75 as an answer. I'm sure I could show that this is always the exact answer, but I wanted to move on to the harder part.
Now, let's assume that even 25/75 split follows an nlogn progression for some unknown logarithm base. That means that num_comparisons(n) = n * log_b(n) and the question is to find b via statistical means (since I don't expect that model to be exact at every step). We can do this with a clever application of least-squares fitting after we use a logarithm identity to get:
C(n) = n * log(n) / log(b)
where now the logarithm can have any base, as long as log(n) and log(b) use the same base. This is a linear equation just waiting for some data! So I wrote another script to generate an array of xs and filled it with C(n) and ys and filled it with n*log(n) and used numpy to tell me the slope of that least squares fit, which I expect to equal 1 / log(b). I ran the script and got b inside of [2.16, 2.3] depending on how high I set n to (I varied n from 100 to 100'000'000). The fact that b seems to vary depending on n shows that my model isn't exact, but I think that's okay for this example.
To actually answer your question now, with these assumptions, we can solve for the cutoff point of when: N * n/2 = n*log_2.3(n) + N * log_2.3(n). I'm just assuming that the binary search will have the same logarithm base as the sorting method for a 25/75 split. Isolating N you get:
N = n*log_2.3(n) / (n/2 - log_2.3(n))
If your number of searches N exceeds the quantity on the RHS (where n is the size of the array in question) then it will be more efficient to sort once and use binary searches on that.

Complexity analysis of a solution to minimizing concat cost

This is about analyzing the complexity of a solution to a popular interview problem.
Problem
There is a function concat(str1, str2) that concatenates two strings. The cost of the function is measured by the lengths of the two input strings len(str1) + len(str2). Implement concat_all(strs) that concatenates a list of strings using only the concat(str1, str2) function. The goal is to minimize the total concat cost.
Warnings
Usually in practice, you would be very cautious about concatenating pairs of strings in a loop. Some good explanations can be found here and here. In reality, I have witnessed a severity-1 accident caused by such code. Warnings aside, let's say this is an interview problem. What's really interesting to me is the complexity analysis around the various solutions.
You can pause here if you would like to think about the problem. I am gonna reveal some solutions below.
Solutions
Naive solution. Loop through the list and concatenate
def concat_all(strs):
result = ''
for str in strs:
result = concat(result, str)
return result
Min-heap solution. The idea is to concatenate shorter strings first. Maintain a min-heap of the strings based on the length of the strings. Each concatenation concatenates 2 strings off the min-heap and the result is pushed back the min-heap. Until only one string is left on the heap.
def concat_all(strs):
heap = MinHeap(strs, key_func=len)
while len(heap) > 1:
str1 = heap.pop()
str2 = heap.pop()
heap.push(concat(str1, str2))
return heap.pop()
Binary concat. May not be intuitively clear. But another good solution is to recursively split the list by half and concatenate.
def concat_all(strs):
if len(strs) == 1:
return strs[0]
if len(strs) == 2:
return concat(strs[0], strs[1])
mid = len(strs) // 2
str1 = concat_all(strs[:mid])
str2 = concat_all(strs[mid:])
return concat(str1, str2)
Complexity
What I am really struggling and asking here is the complexity of the 2nd approach above that uses a min-heap.
Let's say the number of strings in the list is n and the total number of characters in all the strings is m. The upper bound for the naive solution is O(mn). The binary-concat has an exact bound of theta(mlog(n)). It is the min-heap approach that is elusive to me.
I am kind of guessing it has an upper bound of O(mlog(n) + nlog(n)). The second term, nlog(n) is associated with maintaining the heap; there are n concats and each concat updates the heap in log(n). If we only focus on the cost of concatenations and ignore the cost of maintaining the min-heap, the overall complexity of the min-heap approach can be reduced to O(mlog(n)). Then min-heap is a more optimal approach than binary-concat cause for the former mlog(n) is the upper bound while for the latter it is the exact bound.
But I can't seem to prove it, or even find a good intuition to support that guessed upper bound. Can the upper bound be even lower than O(mlog(n))?
Let us call the length of strings 1 to n and m be the sum of all these values.
For the naive solution, clearly the worst appears if
m1
is almost equal to m, and you obtain a O(nm) complexity, as you pointed.
For the min-heap, the worst-case is a bit different, it consists in having the same length for any string. In that case, it's going to work exactly as your case 3. of binary concat, but you'll also have to maintain the min-heap structure. So yes, it will be a bit more costly than case 3 in real-life. Nevertheless, from a complexity point of view, both will be in O(m log n) since we have m > n and O(m log n + n log n)can be reduced to O(m log n).
To prove the min-heap complexity more rigorously, we can prove that when we take the two smallest strings in a set of k strings, and denote by S the sum of the two smallest strings, then we have: (m-S)/(k-1) >= S/2 (it simply means that the mean of the two smallest strings is less than the mean of the k-2 other strings). Reformulating leads to S <= 2m/(k+1). Let us apply it to the min-heap algorithm:
at first step, we can show that the 2 strings we take are of total length at most 2m/(n+1)
at first step, we can show that the 2 strings we take are of total length at most 2m/(n)
...
at last step, we can show that the 2 strings we take are of total length at most 2m/(1)
Hence the computation time of min-heap is 2m*[1/(n+1) + 1/n + ... + 1/2 + 1] which is in O(m log n)

How to calculate runtime complexity of the nested loops with variable length

Suppose I have a task to write an algorithm that runs through an array of strings and checks if each value in the array contains s character. The algorithm will have two nested loops, here is the pseudo code:
for (let i=0; i < a.length; i++)
for (let j=0; j < a[i].length; j++)
if (a[i][j] === 'c')
do something
Now, the task is to identify the runtime complexity of the algorithm. Here is my reasoning:
let the number of elements in the array be n, while the maximum length of string values m. So the general formula for complexity is
n x m
Now the possible cases.
If the maximum length of string values is equal to the number of elements, I get the complexity:
n^2
If the maximum length of elements is less than the number of elements by some number a, the complexity is
n x (n - a) = n^2 - na
If the maximum length of elements is more than the number of elements by some number a, the complexity is
n x (n - a) = n^2 + na
Since we discard lower growth functions, it seems that the complexity of the algorithm is n^2. Is my reasoning correct?
Your time complexity is just the total number of characters. Which of the analyses is applicable, depends entirely on which of your assumptions about the relationship between the length of words, and the number of words, holds true. Note in particular, your statement that the time complexity is N x M where M is the largest name in the array, is not correct (it's correct in the sense that it places an upper bound, but that upper bound is not tight, so it's not very interesting; it's correct in the same sense that N^2 x M^2 is correct).
I think certainly in many real cases of interest, your analysis is incorrect. The total number of characters is equal to the number of strings, times the average number of characters per string, i.e. word length (note: average, not maximum!). As the number of strings becomes large, the average sample word length will approach the mean of whatever distribution you are sampling from. So at least for any well behaved distribution where the sampling is iid, the time complexity is simply N.
A good practical example is a database that stores names. It depends of course which people happen to be in the database, but if you are storing names of say American citizens, then as N becomes large, the number of inner operations will approach N times the average number of characters in a name, in the US. The latter quantity just doesn't depend on N at all, so it's linear in N.

Big-O complexity of calculation: Drawing a non-colliding subset of k elements from n total with dumb algorithm

I'm trying to understand the computational complexity of this pseudocode:
values is a set of n unique elements
subset is an empty set
for 0 ... k
X: randomly select a value from values
if value is in subset
goto X
else
insert value into subset
This is of course a (poor) algorithm for selecting a unique random subset of k elements from n, and I'm aware of the better choices, but I wanted to understand the computational complexity of this.
I can see easily that this is O(n) when duplicates are allowed because the conditional test is eliminated from the pseudocode and k choices are made each time.
When you have to account for duplicates, there is a probability that a re-test will be required which increases with each iteration. Depending on the values of n and k, this is a non-negligible fact, but I'm not certain how it affects the big-O complexity in a generalized way. Could someone explain this to me?
The probability of inserting value into subset for each value of k is (n-k)/n
The number of iterations of each k loop would be inversely proportional to that probability
Therefore Big O notation for each value of k would be O((n/(n-K)) + 1) where 1 would be 'insert value into subset'.
You have to calculate the summation of ((n/(n-K)) + 1) for each value of k ----final answer would be
O(((n/(n-K)) + 1) for k from 1 through k)
Disclaimer - this is assuming if Big(o) is applicable for functions that use random number generating algorithms (since X is random)

How do you find multiple ki smallest elements in array?

I am struggling with my homework and need a little push- the question is to design an algorithm that will in O(nlogm) time find multiple smallest elements 1<k1<k2<...<kn and you have m *k. I know that a simple selection algorithm takes o(n) time to find the kth element, but how do you reduce the m in your recurrence? I though to do both k1 and kn in each run, but that will only take out 2, not m/2.
Would appreciate some directions.
Thanks
If I understand the question correctly, you have a vector K containing m indices, and you want to find the k'th ranked element of A for each k in K. If K contains the smallest m indices (i.e. k=1,2,...,m) then this can be done easily in linear time T=O(n) by using quickselect to find the element k_m (since all the smaller elements will be on the left at the end of quickselect). So I'm assuming that K can contain any set of m indices.
One way to accomplish this is by running quickselect on all of K at the same time. Here is the algorithm
QuickselectMulti(A,K)
If K is empty, then return an empty result set
Pick a pivot p from A at random
Partition A into sets A0<p and A1>p.
i = A0.size + 1
if K contains i, then remove i from K and add (i=>p) to the result set.
Partition K into sets K0<i and K1>i
add QuickselectMulti(A0,K0) to the result set
subtract i from each k in K1
call QuickselectMulti(A1,K1), add i to each index of the output, and add this to the result set
return the result set
If K contains just one element, this is the same as randomized quickselect. To see why the running time is O(n log m) on average, first consider what happens when each pivot exactly splits both A and K in half. In this case, you get two recursive calls, so you have
T = n + 2T(n/2,m/2)
= n + n + 4T(n/4,m/4)
= n + n + n + 8T(n/8,m/8)
Since m drops in half each time, then n will show up log m times in this summation. To actually derive the expected running time requires a little more work, because you can't assume that the pivot will split both arrays in half, but if you work through the calculations, you will see that the running time is in fact O(n log m) on average.
On edit: The analysis of this algorithm can make this simpler by choosing the pivot by running p=Quickselect(A,k_i) where k_i is the middle element of K, rather than choosing p at random. This will guarantee that K gets split in half each time, and so the number of recursive calls will be exactly log m, and since quickselect runs in linear time, the result will still be O(n log m).

Resources