Extrapolating the runtime of mergesort? - sorting

I'm trying to solve the following problem, which asks me to extrapolate the runtime of mergesort to larger inputs. Here's the problem:
A user runs the code:
int a=(int*) malloc (N*sizeof(int));
for (i=0; i<N; i++)
{
a[i] = rand();
}
mergesort (a,N);
This code generates and then orders random numbers for N=10.000.000 and the time it needs is 5.3 sec
Assuming that we have enough memory, which of the following is closest to the runtime for N=1.000.000.000?
53, 340, 530, 680, 1060, 5300
I thought that as it is a divide and conquer method we have a total of n log n splits, which is 30 for N=1.000.000.000. I know that mergesort's runtime satifies the recurrence T(n) = 2T(n / 2) + n, but, I don't see how to use that to extrapolate the runtime.
How should I go about solving this problem?

The runtime of mergesort is Θ(n log n). For sufficiently large n (like the numbers you have here), it's not unreasonable to model the runtime as some function of the form cn log n.
One way to approach this problem would be to think about the ratio of the runtime for n = 109 to the runtime for n = 107. That gives you
c 109 log 109 / c (107 log 107)
= 102 log 109 / log 107
= 102 (9 / 7)
= 128.6
Therefore, you'd expect that the runtime for n = 109 to be about 128.6 times the runtime for n = 107. Since the runtime for n = 107 is 5.3s, you'd expect the runtime for n = 109 to be roughly 681.6s. Therefore, the best answer out of the list would be 680s.
This sort of approach - looking at ratios of runtimes - is a pretty good way to approximate runtimes. We could have also solved this by directly solving for c given that the runtime is of the form cn log n and we know the output for one particular value of n. The reason I chose to use the ratio approach is that it's often helpful for "eyeballing" the runtime. For example, since the runtime is Θ(n log n) and you increased the input size by a factor of 100, it's not unreasonable to guess that the runtime will go up by at least a factor of 100 for the n term, then probably a smaller extra term thrown in on top for the log n term. That alone could lead you to guess that the runtime would be about 680s.
Hope this helps!

Related

Binary vs Linear searches for unsorted N elements

I try to understand a formula when we should use quicksort. For instance, we have an array with N = 1_000_000 elements. If we will search only once, we should use a simple linear search, but if we'll do it 10 times we should use sort array O(n log n). How can I detect threshold when and for which size of input array should I use sorting and after that use binary search?
You want to solve inequality that rougly might be described as
t * n > C * n * log(n) + t * log(n)
where t is number of checks and C is some constant for sort implementation (should be determined experimentally). When you evaluate this constant, you can solve inequality numerically (with uncertainty, of course)
Like you already pointed out, it depends on the number of searches you want to do. A good threshold can come out of the following statement:
n*log[b](n) + x*log[2](n) <= x*n/2 x is the number of searches; n the input size; b the base of the logarithm for the sort, depending on the partitioning you use.
When this statement evaluates to true, you should switch methods from linear search to sort and search.
Generally speaking, a linear search through an unordered array will take n/2 steps on average, though this average will only play a big role once x approaches n. If you want to stick with big Omicron or big Theta notation then you can omit the /2 in the above.
Assuming n elements and m searches, with crude approximations
the cost of the sort will be C0.n.log n,
the cost of the m binary searches C1.m.log n,
the cost of the m linear searches C2.m.n,
with C2 ~ C1 < C0.
Now you compare
C0.n.log n + C1.m.log n vs. C2.m.n
or
C0.n.log n / (C2.n - C1.log n) vs. m
For reasonably large n, the breakeven point is about C0.log n / C2.
For instance, taking C0 / C2 = 5, n = 1000000 gives m = 100.
You should plot the complexities of both operations.
Linear search: O(n)
Sort and binary search: O(nlogn + logn)
In the plot, you will see for which values of n it makes sense to choose the one approach over the other.
This actually turned into an interesting question for me as I looked into the expected runtime of a quicksort-like algorithm when the expected split at each level is not 50/50.
the first question I wanted to answer was for random data, what is the average split at each level. It surely must be greater than 50% (for the larger subdivision). Well, given an array of size N of random values, the smallest value has a subdivision of (1, N-1), the second smallest value has a subdivision of (2, N-2) and etc. I put this in a quick script:
split = 0
for x in range(10000):
split += float(max(x, 10000 - x)) / 10000
split /= 10000
print split
And got exactly 0.75 as an answer. I'm sure I could show that this is always the exact answer, but I wanted to move on to the harder part.
Now, let's assume that even 25/75 split follows an nlogn progression for some unknown logarithm base. That means that num_comparisons(n) = n * log_b(n) and the question is to find b via statistical means (since I don't expect that model to be exact at every step). We can do this with a clever application of least-squares fitting after we use a logarithm identity to get:
C(n) = n * log(n) / log(b)
where now the logarithm can have any base, as long as log(n) and log(b) use the same base. This is a linear equation just waiting for some data! So I wrote another script to generate an array of xs and filled it with C(n) and ys and filled it with n*log(n) and used numpy to tell me the slope of that least squares fit, which I expect to equal 1 / log(b). I ran the script and got b inside of [2.16, 2.3] depending on how high I set n to (I varied n from 100 to 100'000'000). The fact that b seems to vary depending on n shows that my model isn't exact, but I think that's okay for this example.
To actually answer your question now, with these assumptions, we can solve for the cutoff point of when: N * n/2 = n*log_2.3(n) + N * log_2.3(n). I'm just assuming that the binary search will have the same logarithm base as the sorting method for a 25/75 split. Isolating N you get:
N = n*log_2.3(n) / (n/2 - log_2.3(n))
If your number of searches N exceeds the quantity on the RHS (where n is the size of the array in question) then it will be more efficient to sort once and use binary searches on that.

How to simplify O(2^(logN)) to O(N)

In Cracking the Coding Interview there's an example where the runtime for a recursive algorithm that counts the nodes in a binary search tree is O(2^(logN)). The book explains how we simplify to get O(N) like so...
2^p = Q
logQ = P
Let P = 2^(logN).
but I am lost at the step when they say Let P = 2^(logN). I don't understand how we know to set those two equal to one another, and I also don't understand this next step... (Although they tell me they do it by the definition of log base 2)
logP = logN
P = N
2^(logN) = N
Therefore the runtime of the code is O(N)
Assuming logN is log2N
This line:
Let P = 2^(logN).
Just assumes that P equals to 2^(logN). You do not know N yet, you just define how P and N relates to each other.
Later, you can apply log function to both sides of equation. And since log(2^(logN)) is logN, the next step is:
logP = logN
And, obviously, when logP = logN, then:
P = N
And previously you assumed that P = 2^(logN), then:
2^(logN) = N
Moreover, all of this could be simplified to 2^logN = N by definition of the log function.
The short answer is that the original question probably implicitly assumed that the logarithm was supposed to be in base 2, so that 2^(log_2(N)) is just N, by definition of log_2(x) as the inverse function of 2^y.
However, it's interesting to examine this a bit more carefully if the logarithm is to a different base. Standard results allow us to write the logarithm to base b as follows:
where ln(x) is the natural logarithm (using base e). Similarly, one can rewrite 2^x as follows:
We can then rewrite the original order-expression as follows:
which can be reduced to:
So, if the base b of our logarithm is 2, then this is clearly just N. However, if the base is different, then we get N raised to a power. For example, if b=10 we get N raised to the power 0.301, which is definitely a more slowly increasing function than O(N).
We can check this directly with the following Python script:
import numpy
import matplotlib.pyplot as plt
N = numpy.arange(1, 100)
plt.figure()
plt.plot(N, 2**(numpy.log2(N)))
plt.xlabel('N')
plt.ylabel(r'$2^{\log_2 N}$')
plt.figure()
plt.plot(N, 2**(numpy.log10(N)))
plt.xlabel('N')
plt.ylabel(r'$2^{\log_{10} N}$')
plt.show()
The graph this produces when we assume that the logarithm is to base two:
is very different from the graph when the logarithm is taken to base ten:
The definition of logarithm is “to what power does the base need to be raised to get this value” so if the base of the logarithm is 2, then raising 2 to that power brings us to the original value.
Example: N is 256. If we take the base 2 log of it we get 8. If we raise 2 to the power of 8 we get 256. So it is linear and we can make it to be just N.
If the log would be in a different base, for example 10, the conversion would just require dividing the exponent with a constant, making the more accurate form into N = 2^(log N / log 2), which can be changed into N / 2^(1 / log 2) = 2^log N. Here the divider for N on the left is a constant so we can forget it when discussing complexity and again come to N = 2^log N.
You can also test it by hand. Log2 of 256 is 8. Log2 of 128 is 7. 8/7 is about 1.14. Log10 of 256 is 2.4. Log10 of 128 is 2.1. 2.4/2.1 is about 1.14. So the base doesn’t matter, the value you get out isn’t the same but it is linear. So mathematically N doesn’t equal to 2^Log10 N, but in complexity terms it does.

Outlier in linear algorithm for nth Fibonacci

So I was taught that using the recurrence relation of Fibonacci numbers, I could get an O(n) algorithm. But, due to the large size of Fibonacci numbers for large n, the addition takes proportionally longer : meaning that, the time complexity is no longer linear.
That's all well and fine but in this graph (source), why is there a number near 1800000 which takes significantly longer to compute than it's neighbours?
EDIT: As mentioned in the answers, the outlier is around 180000, not 1800000
The outlier occurred at 180,000, not 1,800,000. I don't know how big integers are stored in python, but assuming 32 bit words stored as binary, fib(180000) takes close to 100,000 bytes. I suspect an issue with the testing for why 180,000 would take significantly longer than 181,000 or 179,000.
#cdlane mentioned the time it would take for fib(170000) to fib(200000), but the increment is 1000, so that would be 30 test cases, run 10 times each, which would take less than 20 minutes.
The article linked to mentioned a matrix variation for calculating Fibonacci number which is a log2(n) process. This can be further optimized using a Lucas sequence which is similar logic (repeated squaring for raising a matrix to a power). Example C code for 64 bit unsigned integers:
uint64_t fibl(uint64_t n) {
uint64_t a, b, p, q, qq, aq;
a = q = 1;
b = p = 0;
while(1) {
if(n & 1) {
aq = a*q;
a = b*q + aq + a*p;
b = b*p + aq;
}
n >>= 1;
if(n == 0)
break;
qq = q*q;
q = 2*p*q + qq;
p = p*p + qq;
}
return b;
}
probably the measurement of time was done on a system where other
tasks were performed simultaneously, which was causing a single
computation to take longer than expected.
Part of the graph that the OP didn't copy says:
Note: Each data point was averaged over 10 calculcations[sic].
So that seems to negate that explanation unless it was a huge difference and we're only seeing it averaged with normal times.
(My 2012 vintage Mac mini calculates the 200,000th number in 1/2 a second using the author's code. The machine that produced this data took over 4 seconds to calculate the 200,000th number. I pulled out a 2009 vintage MacBook to confirm that this is believable.)
I did my own crude graph with a modified version of the code that didn't restart at the beginning each time but kept a running total of the time and came up with the following chart:
The times are a bit longer than expected as the algorithm I used adds some printing overhead into each calculation. But no big anomaly between 180,000 and 195,000.

Why does this loop take O(2^n) time complexity?

There is a loop which perform a brute-force algorithm to calculate 5 * 3 without using arithmetical operators.
I just need to add Five 3times so that it takes O(3) which is O(y) if inputs are x * y.
However, in a book, it says it takes O(2^n) where n is the number of bits in the input. I don't understand why it use O(2^n) to represent it O(y). Is it more good way to show time complexity?. Could you please explain me?
I'm not asking other algorithm to calculate this.
int result = 0
for(int i=0; i<3; i++){
result += 5
}
You’re claiming that the time complexity is O(y) on the input, and the book is claiming that the time complexity is O(2n) on the number of bits in the input. Good news: you’re both right! If a number y can be represented by n bits, y is at most 2n − 1.
I think that you're misreading the passage from the book.
When the book is talking about the algorithm for computing the product of two numbers, it uses the example of multiplying 3 × 5 as a concrete instance of the more general idea of computing x × y by adding y + y + ... + y, x total times. It's not claiming that the specific algorithm "add 5 + 5 + 5" runs in time O(2n). Instead, think about this algorithm:
int total = 0;
for (int i = 0; i < x; i++) {
total += y;
}
The runtime of this algorithm is O(x). If you measure the runtime as a function of the number of bits n in the number x - as is suggested by the book - then the runtime is O(2n), since to represent the number x you need O(log n) bits. This is the distinction between polynomial time and pseudopolynomial time, and the reason the book then goes on to describe a better algorithm for solving this problem is so that the runtime ends up being a polynomial in the number of bits used to represent the input rather than in the numeric value of the numbers. The exposition about grade-school multiplication and addition is there to help you get a better sense for the difference between these two quantities.
Do not think with 3 and 5. Think how to calculate 2 billion x 2 billion (roughly 2^31 multiplied with 2^31)
Your inputs are 31 bits each (N) and your loop will be executed 2 billion times i.e. 2^N.
So, book is correct. For 5x3 case, 3 is 2 bits. So it is complexity is O(2^2). Again correct.

Algorithm - comparing performance

Suppose I have 3 algoithrms A,B and C to process n records.
algorithm A takes 80n + 40 steps
algorithm B takes n^2 + 30n steps
algorithm C takes 2^n steps
Decide which algorithms is most efficient when performing
i) 10 < n < 50
The way I would solve this problem is by assuming n is equals to a value for example
for i) Assume that n = 20
so
algo A - 80(20) + 40 = 1640 steps
algo B - 20^2 = 400 steps
algo C - 2^20 = 1048576 steps
therefore algo B is most efficent.
I am not really sure whether I have evaluated the 3 algorithms performance correctly because I am just substituting a n with a value instead of using Big O notation?
Please advise. thanks
Big-O notation deals with n that is arbitrary large, i.e. in order to evaluate O(n) the expression should be calculated for n-->infinity. In your case n is given, thus the overall running time can be precisely calculated, exactly the way you did it.

Resources