Using a decision tree and your answer to part (a), show that any algorithm that correctly merges two sorted lists must perform at least 2n − o(n) comparisons.
answer from part (a): 2n over n ways to divide 2n numbers into two sorted lists, each with n numbers
(2n over n) <= 2^h
h >= lg(2n)! / (n!)^2
= lg(2n!) - 2lg(n!)
= Θ(2nlg(2n)) - 2Θ(nlg(n)) <----
= Θ(2n) <----
I don't understand the last step. How can it be Θ(2n)?
You can represent logarithm of product as sum of separate logarithms (the first property here):
2*n*lg(2*n) = 2*n*(lg(2) + lg(n)) = 2*n*(1 + lg(n))
So
2*n*(1 + lg(n)) - 2*n*lg(n) =
2*n+ 2*n*lg(n)) - 2*n*lg(n) = 2*n
Related
Suppose I have an unsorted array A of n integers and an integer b. I want to write an algorithm to compute the frequency of b in A (i.e., count the number of times b appears in A) using divide and conquer.
Here is a recursive divide-and-conquer algorithm to count the frequency of b in the array A:
Divide the array A into two sub-arrays: left half and right half.
Recursively count the frequency of b in left half of A and in right half of A.
Combine the results from step 2: the frequency of b in A is equal to the sum of the frequency of b in left half and the frequency of b in right half.
If the length of the array A is 1, return 1 if A[0] equals b, otherwise return 0.
The recurrence relation of the algorithm is T(n) = 2T(n/2) + O(1), where O(1) is the time to divide the array and combine the results. The solution of the recurrence is T(n) = O(n), so the time complexity of the algorithm is O(n).
This is because each recursive call divides the array into two sub-arrays of equal size, and each element is visited once at the bottom level of the recursion. Therefore, the algorithm visits each element of the array once, leading to a linear time complexity.
Correct me If I'm wrong.
Let's just use concrete elements and do the math. The constant step during the splitting is O(1) so let's call it c and the constant step at the very end (Returning 1 or 0 for length 1 array) is just one step.
So then:
T(n) = 2T(n/2) + c
T(1) = 1
We make the educated guess (or ansatz, if you want to use fancy language) that T(n) = a*n + b, i.e., a linear function. Let's plug that into the relations:
T(n) = a*n + b = 2 * (a*n/2 + b) + c
from which it follows after a bit of basic math that b = -c.
Next, we plug the ansatz into the base case:
T(1) = a*1 + b = a + b = 1
from which we can deduce that a = 1 - b = 1 + c.
So there! We solved for a and b without making a mess and indeed we have
T(n) = (1 + c) * n - c
which is indeed O(n).
Note that this is the "pedestrian" way. If we're not interested in the actual coefficients a and b but really just in the complexity, we can be more efficient like so:
T(n) = 2 T(n/2) + O(1) = 4 T(n/4) + 2 * O(1) + O(1) = ...
= 2^k T(1) + 2^(k-1) O(1) + 2^(k-2) O(1) ... + O(1)
= 2^k O(1) + 2^(k-1) O(1) + ...
where k = log_2(n).
Summing up all those precoefficients then we get roughly
T(n) = 2^(k+1) O(1) = 2 * n * O(1) = O(n)
I had a lecture on Big Oh for Merge Sort and I'm confused.
What was shown is:
0 Merges [<----- n -------->] = n
1 Merge [<--n/2--][-n/2--->] = (n/2 + n/2) = n
2 Merges [n/4][n/4][n/4][n/4] = 2(n/4 + n/4) = n
....
log(n) merges = n
Total = (n + n + n + ... + n) = lg n
= O(n log n)
I don't understand why (n + n + ... + n) can also be expressed as log base 2 of n and how they got for 2 merges = 2(n/4 + n/4)
In the case of 1 merge, you have two sub arrays to be sorted where each sub-array will take time proportional to n/2 to be sorted. In that sense, to sort those two sub-arrays you need a time proportional to n.
Similarly, when you are doing 2 merges, there are 4 sub arrays to be sorted where each will be taking a time proportional to n/4 which will again sum up to n.
Similarly, if you have n merges, it will take a time proportional to n to sort all the sub-arrays. In that sense, we can write the time taken by merge sort as follows.
T(n) = 2 * T(n/2) + n
You will understand that this recursive call can go deep (say to a height of h) until n/(2^h) = 1. By taking log here, we get h=log(n). That is how log(n) came to the scene. Here log is taken from base 2.
Since you have log(n) steps where each step takes a time proportional to n, total time taken can be expressed as,
n * log(n)
In big O notations, we give this as an upper bound O(nlog(n)). Hope you got the idea.
Following image of the recursion tree will enlighten you further.
The last line of the following part written in your question,
0 Merges [<----- n -------->] = n
1 Merge [<--n/2--][-n/2--->] = (n/2 + n/2) = n
2 Merges [n/4][n/4][n/4][n/4] = 2(n/4 + n/4) = n
....
n merges = n --This line is incorrect!
is wrong. You will not have total n merges of size n, but Log n merges of size n.
At every level, you divide the problem size into 2 problems of half the size. As you continue diving, the total divisions that you can do is Log n. (How? Let's say total divisions possible is x. Then n = 2x or x = Log2n.)
Since at each level you do a total work of O(n), therefore for Log n levels, the sum total of all work done will be O(n Log n).
You've got a deep of log(n) and a width of n for your tree. :)
The log portion is the result of "how many times can I split my data in two before I have only one element left?" This is the depth of your recursion tree. The multiple of n comes from the fact that for each of those levels in the tree you'll look at every element in your data set once after all merge steps at that level.
recurse downwards:
n unsorted elements
[n/2][n/2] split until singletons...
...
merge n elements at each step when recursing back up
[][][]...[][][]
[ ] ... [ ]
...
[n/2][n/2]
n sorted elements
It's very simple. Each merge takes O(n) as you demonstrated. The number of merges you need to do is log n (base 2), because each merge doubles the size of the sorted sections.
The question is :
UNBALANCED MERGE SORT
is a sorting algorithm, which is a modified version of
the standard MERGE SORT
algorithm. The only difference is that instead of dividing
the input into 2 equal parts in each stage, we divide it into two unequal parts – the first
2/5 of the input, and the other 3/5.
a. Write the recurrence relation for the worst case time complexity of the
UNBALANCED MERGE SORT
algorithm.
b. What is the worst case time complexity of the UNBALANCEDMERGESORT
algorithm? Solve the recurrence relation from the previous section.
So i'm thinkin the recurrence relation is : T(n) <= T(2n/5) + T(3n/5) + dn.
Not sure how to solve it.
Thanks in advance.
I like to look at it as "runs", where the ith "run" is ALL the recursive steps with depth exactly i.
In each such run, at most n elements are being processed (we will prove it soon), so the total complexity is bounded by O(n*MAX_DEPTH), now, MAX_DEPTH is logarithmic, as in each step the bigger array is size 3n/5, so at step i, the biggest array is of size 3^i/5^i * n.
Sovle the equation:
3^i/5^i * n = 1
and you will find out that i = log_a(n) - for some base a
So, let's be more formal:
Claim:
Each element is being processed by at most one recursive call at depth
i, for all values of i.
Proof:
By induction, at depth 0, all elements are processed exactly once by the first call.
Let there be some element x, and let's have a look on it at step i+1. We know (induction hypothesis) that x was processed at most once in depth i, by some recursive call. This call later invoked (or not, we claim at most once) the recursive call of depth i+1, and sent the element x to left OR to right, never to both. So at depth i+1, the element x is proccessed at most once.
Conclusion:
Since at each depth i of the recursion, each element is processed at most once, and the maximal depth of the recursion is logarithmic, we get an upper bound of O(nlogn).
We can similarly prove a lower bound of Omega(nlogn), but that is not needed, since sorting is already an Omega(nlogn) problem - so we can conclude the modified algorithm is still Theta(nlogn).
If you want to prove it with "basic arithmetics", it can also be done, by induction.
Claim: T(n) = T(3n/5) + T(2n/5) + n <= 5nlog(n) + n
It will be similar when replacing +n with +dn, I simplified it, but follow the same idea of proof with T(n) <= 5dnlogn + dn
Proof:
Base: T(1) = 1 <= 1log(1) + 1 = 1
T(n) = T(3n/5) + T(2n/5) + n
<= 5* (3n/5) log(3n/5) +3n/5 + 5*(2n/5)log(2n/5) +2n/5 + n
< 5* (3n/5) log(3n/5) + 5*(2n/5)log(3n/5) + 2n
= 5*nlog(3n/5) + 2n
= 5*nlog(n) + 5*nlog(3/5) + 2n
(**)< 5*nlog(n) - n + 2n
= 5nlog(n) + n
(**) is because log(3/5)~=-0.22, so 5nlog(3/5) < -n, and 5nlog(3/5) + 2n < n
I came across this question in one of the slides of Stanford, that what would be the effect on the complexity of the code of merge sort if we split the array into 4 or 8 instead of 2.
It would be the same: O(n log n). You will have a shorter tree and the base of the logarithm will change, but that doesn't matter for big-oh, because a logarithm in a base a differs from a logarithm in base b by a constant:
log_a(x) = log_b(x) / log_b(a)
1 / log_b(a) = constant
And big-oh ignores constants.
You will still have to do O(n) work per tree level in order to merge the 4 or 8 or however many parts, which, combined with more recursive calls, might just make the whole thing even slower in practice.
In general, you can split your array into equal size subarrays of any size and then sort the subarrays recursively, and then use a min-heap to keep extracting the next smallest element from the collection of sorted subarrays. If the number of subarrays you break into is constant, then the execution time for each min-heap per operation is constant, so you arrive at the same O(n log n) time.
Intuitively it would be the same as there is no much difference between splitting the array into two parts and then doing it again or splitting it to 4 parts from the beginning.
A more official proof by induction based on this (I'll assume that the array is split into k):
Definitions:
Let T(N) - number of array stores to mergesort of input of size N
Then mergesort recurrence T(N) = k*T(N/k) + N (for N > 1, T(1) = 0)
Claim:
If T(N) satisfies the recurrence above then T(N) = Nlg(N)
Note - all the logarithms below are on base k
Proof:
Base case: N=1
Inductive hypothesis: T(N) = NlgN
Goal: show that T(kN) = kN(lg(kN))
T(kN) = kT(N) + kN [mergesort recurrence]
= kNlgN + kN [inductive hypothesis]
= kN(lg[(kN/k)] [algebra]
= kN(lg(kN) - lgk) [algebra]
= kN(lg(kN) - 1) + kN [algebra - for base k, lg(k )= 1]
= kNlg(kN) [QED]
Wikipedia states that the average runtime of quickselect algorithm (Link) is O(n). However, I could not clearly understand how this is so. Could anyone explain to me (via recurrence relation + master method usage) as to how the average runtime is O(n)?
Because
we already know which partition our desired element lies in.
We do not need to sort (by doing partition on) all the elements, but only do operation on the partition we need.
As in quick sort, we have to do partition in halves *, and then in halves of a half, but this time, we only need to do the next round partition in one single partition (half) of the two where the element is expected to lie in.
It is like (not very accurate)
n + 1/2 n + 1/4 n + 1/8 n + ..... < 2 n
So it is O(n).
Half is used for convenience, the actual partition is not exact 50%.
To do an average case analysis of quickselect one has to consider how likely it is that two elements are compared during the algorithm for every pair of elements and assuming a random pivoting. From this we can derive the average number of comparisons. Unfortunately the analysis I will show requires some longer calculations but it is a clean average case analysis as opposed to the current answers.
Let's assume the field we want to select the k-th smallest element from is a random permutation of [1,...,n]. The pivot elements we choose during the course of the algorithm can also be seen as a given random permutation. During the algorithm we then always pick the next feasible pivot from this permutation therefore they are chosen uniform at random as every element has the same probability of occurring as the next feasible element in the random permutation.
There is one simple, yet very important, observation: We only compare two elements i and j (with i<j) if and only if one of them is chosen as first pivot element from the range [min(k,i), max(k,j)]. If another element from this range is chosen first then they will never be compared because we continue searching in a sub-field where at least one of the elements i,j is not contained in.
Because of the above observation and the fact that the pivots are chosen uniform at random the probability of a comparison between i and j is:
2/(max(k,j) - min(k,i) + 1)
(Two events out of max(k,j) - min(k,i) + 1 possibilities.)
We split the analysis in three parts:
max(k,j) = k, therefore i < j <= k
min(k,i) = k, therefore k <= i < j
min(k,i) = i and max(k,j) = j, therefore i < k < j
In the third case the less-equal signs are omitted because we already consider those cases in the first two cases.
Now let's get our hands a little dirty on calculations. We just sum up all the probabilities as this gives the expected number of comparisons.
Case 1
Case 2
Similar to case 1 so this remains as an exercise. ;)
Case 3
We use H_r for the r-th harmonic number which grows approximately like ln(r).
Conclusion
All three cases need a linear number of expected comparisons. This shows that quickselect indeed has an expected runtime in O(n). Note that - as already mentioned - the worst case is in O(n^2).
Note: The idea of this proof is not mine. I think that's roughly the standard average case analysis of quickselect.
If there are any errors please let me know.
In quickselect, as specified, we apply recursion on only one half of the partition.
Average Case Analysis:
First Step: T(n) = cn + T(n/2)
where, cn = time to perform partition, where c is any constant(doesn't matter). T(n/2) = applying recursion on one half of the partition.Since it's an average case we assume that the partition was the median.
As we keep on doing recursion, we get the following set of equation:
T(n/2) = cn/2 + T(n/4) T(n/4) = cn/2 + T(n/8) .. . T(2) = c.2 + T(1) T(1) = c.1 + ...
Summing the equations and cross-cancelling like values produces a linear result.
c(n + n/2 + n/4 + ... + 2 + 1) = c(2n) //sum of a GP
Hence, it's O(n)
I also felt very conflicted at first when I read that the average time complexity of quickselect is O(n) while we break the list in half each time (like binary search or quicksort). It turns out that breaking the search space in half each time doesn't guarantee an O(log n) or O(n log n) runtime. What makes quicksort O(n log n) and quickselect is O(n) is that we always need to explore all branches of the recursive tree for quicksort and only a single branch for quickselect. Let's compare the time complexity recurrence relations of quicksort and quickselect to prove my point.
Quicksort:
T(n) = n + 2T(n/2)
= n + 2(n/2 + 2T(n/4))
= n + 2(n/2) + 4T(n/4)
= n + 2(n/2) + 4(n/4) + ... + n(n/n)
= 2^0(n/2^0) + 2^1(n/2^1) + ... + 2^log2(n)(n/2^log2(n))
= n (log2(n) + 1) (since we are adding n to itself log2 + 1 times)
Quickselect:
T(n) = n + T(n/2)
= n + n/2 + T(n/4)
= n + n/2 + n/4 + ... n/n
= n(1 + 1/2 + 1/4 + ... + 1/2^log2(n))
= n (1/(1-(1/2))) = 2n (by geometric series)
I hope this convinces you why the average runtime of quickselect is O(n).