Randomized Quick Sort Pivot selection with 25%-75% split - algorithm

I came to know that in case of Randomized quick sort, if we choose the pivot in such a way that it will at least give the split in the ration 25%-75%, then the run time is O(n log n).
Now I also came to know that we can prove this with Master Theorem.
But my problem is that if we split the array in 25%-75% in each step, then how I will define my T(n) and how can I prove that the runtime analysis in O(n log n)?

You can use Master theorem to find the complexity of this kind of algorithms. In this particular case assume, that when you divide the array into two parts each of these parts is not greater then 3/4 of the initial array. Then, T(n) < 2 * T(3/4 * n) + O(n), or T(n) = 2 * T(3/4 * n) + O(n) if you look for upper bound. Master theorem gives you the solution for this equation.
Update: though Master theorem may solve such recurrence equations, in this case it gives us a result which is worse than expected O(n*log n). Nevertheless, it can be solved in other way. If we assume that a pivot always splits the array in the way that the smaller part is >= 1/4 size, then we can limit the recursion depth as log_{4/3}N (because on each level the size of array decreases by at least 4/3 times). Time complexity on each recursion level is O(n) in total, thus we have O(n) * log{4/3}n = O(n*log n) overall complexity.
Furthermore, if you want some more strict analysis, you may consider a Wikipedia article, there are some good proofs.

Related

Measuring the Time Complexity of Insertion And Merge Sort Hybrid

I have a very basic implementation of merge and insertion sort that involves a threshold below which insertion sort is used on sub-arrays of problem size n, where merge and insertion sort are the most basic and widely available:
def hybrid_sort(array: list, threshold: int = 10):
if len(array) > 1:
mid = len(array) // 2
left = array[:mid]
right = array [mid:]
if len(array) > threshold:
hybrid_sort(left)
hybrid_sort(right)
merge(array, left, right)
else:
insertion_sort(array)
Unless I am completely misunderstanding then this would mean that we have a recurrence relation for this particular piece of code generalized as:
T(n) = 2T(n/2) + O(n^2)
The first half showing up for merge sort, and the second being insertion sort opertations.
By the master theorem, n raised to log_b(a) would equal n in this case, because you'd have n raised to the log_2(2) which is 1, so n^1 = n.
Then, our F(n) = n^2 which is is 'larger' than n, so by case 3 of the master theorem my algorithm above would be f(n) or O(n^2), because f(n) is bounded from below by n.
This doesn't seem right to me considering we know merge sort is O(nlog(n)), and I'm having a hard time wrapping my head around this. I think it's because I've not yet analyzed such an algorithm that has a conditional 'if' check.
Can anyone illuminate this for me?
Unless the threshold itself depends on n, the insertion sort part does not matter at all. This has the same complexity as a normal merge sort.
Keep in mind that the time complexity of an algorithm that takes an input of size n is a function of n that is generally difficult to compute exactly, and so we focus on the asymptotic behavior of that function instead. This is where the big O notation comes into play.
In your case, as long as threshold is a constant, this means that as n grows, threshold becomes insignificant and all the insertion sorts can just be grouped up as a constant factor, making the overall complexity O((n-threshold) * log(n-threshold) * f(threshold)), where f(threshold) is a constant. So it simplifies to O(n log n), the complexity of merge sort.
Here's a different perspective that might help give some visibility into what's happening.
Let's suppose that once the array size reaches k, you switch from merge sort to insertion sort. We want to work out the time complexity of this new approach. To do so, we'll imagine the "difference" between the old algorithm and the new algorithm. Specifically, if we didn't make any changes to the algorithm, merge sort would take time Θ(n log n) to complete. However, once we get to arrays of size k, we stop running mergesort and instead use insertion sort. Therefore, we'll make some observations:
There are Θ(n / k) subarrays of the original array of size k.
We are skipping calling mergesort on all these arrays. Therefore, we're avoiding doing Θ(k log k) work for each of Θ(n / k) subarrays, so we're avoiding doing Θ(n log k) work.
Instead, we're insertion-sorting each of those subarrays. Insertion sort, in the worst case, takes time O(k2) when run on an array of size k. There are Θ(n / k) of those arrays, so we're adding in a factor of O(nk) total work.
Overall, this means that the work we're doing in this new variant is O(n log n) - O(n log k) + O(nk). Dialing k up or down will change the total amount of work done. If k is a fixed constant (that is, k = O(1)), this simplifies to
O(n log n) - O(n log k) + O(nk)
= O(n log n) - O(n) + O(n)
= O(n log n)
and the asymptotic runtime is the same as that of regular insertion sort.
It's worth noting that as k gets larger, eventually the O(nk) term will dominate the O(n log k) term, so there's some crossover point where increasing k starts decreasing the runtime. You'd have to do some experimentation to fine-tune when to make the switch. But empirically, setting k to some modest value will indeed give you a big performance boost.

algorithm with O(logn) and θ(logn) time-complexity

If we have 2 algorthims. One of them is O(f(x)) time-complexity and the other on is θ(f(x)) time-complexity. Which one we prefer to solve our problem? and why?
There is insufficient information given to decide which algorithm is preferable. It's possible that the first algorithm is preferable, it's possible that both are equally preferable, and it's even possible the second is preferable if they are asymptotically equal but the second has a lower constant factor.
Consider the fact that binary search is O(n) because big-O only gives an upper bound, whereas linear search is Θ(n). Binary search is preferable, because it is asymptotically more efficient.
Consider linear search, which is O(n), and... linear search, which is Θ(n). Both are equally preferable because they are literally the same.
Consider bubble sort, which is O(n2), and insertion sort, which is Θ(n2). Insertion sort does on average ~ n2/4 comparisons, whereas bubble sort does on average ~ n2/2 comparisons, which is twice as many; so insertion sort is preferable.
So as you can see, it's not possible to say without more information.
Let's try to compare the algorithms:
First algorithm has O(nlogn) time complexity which means that execution time t1 is
t1 <= k1 * n * log(n) + o(n * log(n))
Second algorithm is θ(nlogn), that's why
t2 = k2 * n * log(n) + o(n * log(n))
Assuming that n is large enough so we can neglect o(n * log(n)) term, we still have two possibilities here.
t1 < n * log(n)
t1 = k1 * n * log(n) (at least for some worst case)
In the first case we should prefer algorithm 2 for large n, since algorithm 1 has a shorter execution time when n is large enough.
In the second case we have to compare unknown k1 and k2, we have not enough information to choose from 1st and 2nd algorithms.

What would be the running time of an algorithm that combines mergeSort and heapsort?

I have been given this problem that asks to compute the worst case running time of an algorithm that's exactly like mergeSort, but one of the two recursive calls is substituted by Heapsort.
So, I know that dividing in mergesort takes constant time and that merging is O(n). Heapsort takes O(nlogn).
This is what I came up with: T(n) = 2T(n/2) + O((n/2)logn)+ O(n).
I have some doubts about the O((n/2)logn) part. Is it n or n/2? I wrote n/2 because I'm doing heapsort only on half of the array, but I'm not sure that's correct
The question asks about running time, but should it be asking about time complexity?
Since recursion is mentioned, this is a question about top down merge sort (as opposed to bottom up merge sort).
With the code written as described, since heap sort is not recursive, recursion only occurs on one of each of the split sub-arrays. Heap sort will be called to sort sub-arrays of size n/2, n/4, n/8, n/16, ... , and no merging takes place until two sub-arrays of size 1 are the result of the recursive splitting. In the simple case where array size is a power of 2, then "merge sort" is only used for a single element, the rest of the sub-arrays of size {1, 2, 4, 8, ..., n/8, n/4, n/2} are sorted by heap sort and then merged.
Since heap sort is slower than merge sort, then running time will be longer, but time complexity remains at O(n log(n)) since constant or lower term factors are ignored for time complexity.
Let’s work out what the recurrence relation should be in this case. Here, we’re
splitting the array in half,
recursively sorting one half (T(n / 2)),
heapsorting one half (O(n log n)), and then
merging the two halves together (O(n)).
That gives us this recurrence relation:
T(n) = T(n / 2) + O(n log n).
Why is this O(n log n) and not, say, O((n / 2) log (n / 2))? The reason is that big-O notation munches up constant factors, so O(n log n) expresses the same asymptotic growth rate as O((n / 2) log (n / 2)). And why isn’t there a coefficient of 2 on the T(n / 2)? It’s because we’re only making one recursive call; remember that the other call was replaced by heapsort.
All that’s left to do now is to solve this recurrence. It does indeed work out to O(n log n), and I’ll leave it to you to decide how you want to show this. The iteration method is a great option here.

Divide and conquer - why does it work?

I know that algorithms like mergesort and quicksort use the divide-and-conquer paradigm, but I'm wondering why does it work in lowering the time complexity...
why does usually a "divide and conquer" algorithm work better than a non-divide-and-conquer one?
Divide and conquer algorithms work faster because they end up doing less work.
Consider the classic divide-and-conquer algorithm of binary search: rather than looking at N items to find an answer, binary search ends up checking only Log2N of them. Naturally, when you do less work, you can finish faster; that's precisely what's going on with the divide-and-conquer algorithms.
Of course the results depend a lot on how well your strategy does at dividing the work: if the division is more or less fair at every step (i.e. you divide the work in half) you get the perfect Log2N speed. If, however, the dividing is not perfect (e.g. the worst case of quicksort, when it spends O(n^2) sorting the array because it eliminates only a single element at each iteration) then divide-and-conquer strategy is not helpful, as your algorithm does not reduce the amount of work.
Divide and conquer works, because the mathematics supports it!
Consider a few divide and conquer algorithms:
1) Binary search: This algorithm reduces your input space to half each time. It is intuitively clear that this is better than a linear search, as we would avoid looking at a lot of elements.
But how much better? We get the recurrence (note: this is recurrence for the worst case analysis):
T(n) = T(n/2) + O(1)
Mathematics implies that T(n) = Theta(log n). Thus this is exponentially better than a linear search.
2) Merge Sort: Here we divide into two (almost) equal halves, sort the halves and then merge them. Why should this be better than quadratic? This is recurrence:
T(n) = 2T(n/2) + O(n)
It can be mathematically shown (say using Master theorem) that T(n) = Theta(n log n). Thus T(n) is asymptotically better than quadratic.
Observe that the naive quicksort ends up giving us the recurrence for worst case as
T(n) = T(n-1) + O(n)
which mathematically, comes out to be quadratic, and in the worst case, isn't better than bubble sort (asymptotically speaking). But, we can show that in the average case, quicksort is O(n log n).
3 Selection Algorithm: This is a divide a conquer algorithm to find the k^th largest element. It is not at all obvious whether this algorithm is better than sorting (or even that it is not quadratic).
But mathematically, its recurrence(again worst case) comes out to be
T(n) = T(n/5) + T(7n/10 + 6) + O(n)
It can be shown mathematically that T(n) = O(n) and thus it is better than sorting.
Perhaps a common way to look at them:
You can look at algorithms as tree where each sub-problem becomes a sub-tree of the current and the node can be tagged with the amount of work done and then the total work can be added up for each node.
For binary search, the work is O(1) (just a compare), and one of the sub-trees, the work is 0, so the total amount of work is O(log n) (essentially a path, just like we do in binary search trees).
For merge-sort, for a node with k children, the work is O(k) (merge step). The work done at each level is O(n) (n, n/2 + n/2, n/4 + n/4 + n/4 + n/4 etc) and there are O(log n) levels, and so merge sort is O(n log n).
For quicksort, in the worst case the binary tree is actually a linked list, so work done is n+n-1 + ... + 1 = Omega(n^2).
For selection sort, I have no clue how to visualize it, but I believe looking at it as a tree with 3 children (n/5, 7n/10 and the remaining) might still help.
Divide and conquer algorithms don't "usually work better". They just work, as other non-divide and conquer algorithms do. They don't lower sorting complexity, they do as good as other algorithms.

What's wrong with this inductive proof that mergesort is O(n)?

Comparison based sorting is big omega of nlog(n), so we know that mergesort can't be O(n). Nevertheless, I can't find the problem with the following proof:
Proposition P(n): For a list of length n, mergesort takes O(n) time.
P(0): merge sort on the empty list just returns the empty list.
Strong induction: Assume P(1), ..., P(n-1) and try to prove P(n). We know that at each step in a recursive mergesort, two approximately "half-lists" are mergesorted and then "zipped up". The mergesorting of each half list takes, by induction, O(n/2) time. The zipping up takes O(n) time. So the algorithm has a recurrence relation of M(n) = 2M(n/2) + O(n) which is 2O(n/2) + O(n) which is O(n).
Compare the "proof" that linear search is O(1).
Linear search on an empty array is O(1).
Linear search on a nonempty array compares the first element (O(1)) and then searches the rest of the array (O(1)). O(1) + O(1) = O(1).
The problem here is that, for the induction to work, there must be one big-O constant that works both for the hypothesis and the conclusion. That's impossible here and impossible for your proof.
The "proof" only covers a single pass, it doesn't cover the log n number of passes.
The recurrence only shows the cost of a pass as compared to the cost of the previous pass. To be correct, the recurrence relation should have the cumulative cost rather than the incremental cost.
You can see where the proof falls down by viewing the sample merge sort at http://en.wikipedia.org/wiki/Merge_sort
Here is the crux: all induction steps which refer to particular values of n must refer to a particular function T(n), not to O() notation!
O(M(n)) notation is a statement about the behavior of the whole function from problem size to performance guarantee (asymptotically, as n increases without limit). The goal of your induction is to determine a performance bound T(n), which can then be simplified (by dropping constant and lower-order factors) to O(M(n)).
In particular, one problem with your proof is that you can't get from your statement purely about O() back to a statement about T(n) for a given n. O() notation allows you to ignore a constant factor for an entire function; it doesn't allow you to ignore a constant factor over and over again while constructing the same function recursively...
You can still use O() notation to simplify your proof, by demonstrating:
T(n) = F(n) + O(something less significant than F(n))
and propagating this predicate in the usual inductive way. But you need to preserve the constant factor of F(): this constant factor has direct bearing on the solution of your divide-and-conquer recurrence!

Resources