Quick sort, is there an optimal pivot? - algorithm

Given the list of numbers:
2 5 1 8 4 10 6 3 7 9 0
The actual implementation of quick sort I understand, but a question on my homework that I didn't was:
What is the optimal choice of pivot, why?
I had assumed when reading this that the obvious choice for a pivot would be the 5 or 6 since its in the middle of the list. I figured quick sort would work either way though since we choose a new pivot every time. Which makes the followup question make a little more sense, but does anyone have a formal definition?
Why is an optimal pivot not practical?

The optimal pivot is the median of the set you're currently working on, because it will split the set into two equal-sized subsets which guarantees O(n log n) performance. The reason it's not practical is because of the cost of finding the actual median. You essentially have to sort the data to find the median, so it's like the book Catch 22 - "How do I sort the data?" "Find the median" "How do I find a median?" "Sort the data".

Optimal pivot is in the middle, because when you move it to the left or to the right (or take biggest or smallest item), you increase depth of recursion. In the worst case you will get O(n^2) except of O(n*log2(n)) when taking the middle.

Optimal pivot must be median of numbers because then subproblem sizes are exactly half of original. The time complexity is defined as follows:-
T(N) = T(N/2) + O(N)
which evaluates to
T(N) = O(NlogN)
Whereas if pivot ends up being the first element of array after partitioning then:-
T(N) = T(N-1) + O(N)
T(N) = O(N^2)
which is as bad as bubble sort
The reason that using median always as pivot is not practical because the algorithm that do it in O(N) are very complex & u can always do it in O(NlogN) but that is sorting again which is the problem which we are solving. Here is an example of algorithm that evaluates median in O(N) : -
Median of Medians

Related

Randomized quicksort where pivot is chosen again after partition

I would like to come up with a recurrence for this given problem:
Consider a variation of the randomized quicksort algorithm where the pivot is picked randomly until the array is partitioned in such a way that both the lower subarray L and the greater subarray G
contain 3/4 of the elements of the array. For instance, if the randomly chosen pivot
partitions the array in such a way that L contains 1/10 of the elements, then another
pivot is randomly chosen. Analyze the expected running time of this algorithm.
At first I treated this question as if it's just a regular quicksort question and came up with this recurrence where:
T(n) = T(3/4n) + T(n/4) + Θ(n) (where Θ(n) comes from the partition)
It would make sense if we had an algorithm where the split is always 1/4 : 3/4. But we are using random pivotting here and pivot changes everytime the condition for partioning is not satisfied. I know that worst-case running time for randomized quicksort is still O(n^2) but I think under these circumstances the worst case is different now (something worse than O(n^2)). Am I on the right track so far?
The time complexity of quick sort will never go beyond O(n^2) unless you chose some logic which takes O(n) time to chose the pivot.
The best way to chose the pivot is a random element or end or first element.
There are n/2 bad pivots. Assuming you never select the same pivot twice (if you do, the worst case is always selecting a bad pivot, i.e. infinite time), in the worst case you'd repeat the partitioning n/2 times, which leads to Θ(n^2) complexity of partitioning phase. The recurrence becomes
T(n) = T(n/4) + T(3n/4) + Θ(n^2)

What would be the recurrence relationship for this algorithm?

I have been given this algorithm that computes the median of an array and partitions the other items around it.
It puts all the elements smaller than the median in a set A1, all those equal to it in A2 and all those bigger in A3. If A1 is bigger than 1 it goes recursively into it and the same happens for A3. It terminates after copying a concatenation of A1, A2 and A3 in A.
I know it’s very similar to Quickselect, but I don’t know how to proceed in order to figure out the time complexity in the worst case.
What I know is that in Quicksort, time complexity is T(n) = n -1 + T(a) + T(n -a-1), where n - 1 is for the partition, T(a) is the recursive call on the first part and t(n-a-1) is the recursive call on the last part. In that case the worst scenario happened when the pivot was always the biggest or the smallest item in the array.
But now, since we have the median as the pivot, what could the worst case be?
You can use the Big 5 Algorithm which will give you an approximate median. If you use this as your pivot in quicksort, the worst-case complexity would be O(n log n) instead of O(n^2), since we are making equal divisions each time instead of the worst case when we divide unequally with one bucket having one element and the other having n - 1 elements.
This worst case is very unlikely on the other hand. There is a decent amount overhead attached with finding the pivot point using the Big 5 median algorithm, so in practice is it outperformed by choosing random pivots. But if you wanted to find the median every time, the worst case would be O(n logn)

Number of comparisons in quick sort variation

Will the number of comparisons differ when we take the last element as the pivot element in Quick sort and when we take the first element as the pivot element in the quick sort??
No it will not. In quick sort, what happens is, we chose a pivot element(say x). Then divide the list to 2 parts larger than x and less than x.
Therefore, the number of comparisons change slightly proportional to the recursion depth. That is, the more deeper the recursive function goes, more the number of comparisons to be made to divide the list to 2 parts.
The recursion depth differs - More the value of x can divide the list to similar length parts, lesser will be the recursion depth.
Therefore, the conclusion is, it doesn't matter whether you chose the first or the last element as the pivot, but whether that value can divide the list to 2 similar length lists.
Edit
The more the pivot is close to the median, lesser will be the complexity (O(nlogn)). The more the pivot is close to the max or min of the list, complexity increases (up to O(n^2))
When a first element or last element is chosen as pivot the number of comparisons remain same but it is the worst case as the array is either sorted or reverse sorted.
In every step ,numbers are divided as per the following recurrence.
T(n) = T(n-1) + O(n) and if you solve this relation it will give you the complexity of theta(n^2)
And when you choose median element as pivot it gives a recurrence relationship of
T(n) = 2T(n/2) + \theta(n) which is the best case as it gives complexity of `nlogn`

How does partitioning step acts as a conquering step in quick sort?

Time complexity for recurrence relations is given by :
T(n) = aT(n/b) + f(n) here f(n) is the cost of conquering the sub-problems i.e. cost of merging
all the sub-problems in order to solve the problem but in case of partioning we are dividing the array around a particular pivot point so while calculating the time complexity of quick-sort why do we take O(n)
time for f(n).
How is this acting as a conquering step?
Don't understand what you mean by conquering step.
f(n) is in fact the cost of anything done in your recursive function, that happens well before, after, or between your recursions.
In the case of quick sort, the cost of merging the solutions of the partitions is 0, as you don't need to do anything after the left and right sides of the pivot are sorted. The whole cost is in producing the partitions, and to do that, you need to position your selected pivot. This is why quick sort is classified as a Hard Split Easy Join kind of Divide and Conquer.
The cost of positioning the pivot is O(n), as you have to move from left to right and from right to left, finding items in the wrong side of the pivot, and swapping them, until both searches (from left to right and from right to left) cross each other.
Hope this helped in your understanding, and sorry if I misunderstood completely your question.

Median of medians algorithm: why divide the array into blocks of size 5

In median-of-medians algorithm, we need to divide the array into chunks of size 5. I am wondering how did the inventors of the algorithms came up with the magic number '5' and not, may be, 7, or 9 or something else?
The number has to be larger than 3 (and an odd number, obviously) for the algorithm. 5 is the smallest odd number larger than 3. So 5 was chosen.
I think that if you'll check "Proof of O(n) running time" section of wiki page for medians-of-medians algorithm:
The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians is 20% of the size of the list, while the other recursive call recurses on at most 70% of the list, making the running time
The O(n) term c n is for the partitioning work (we visited each element a constant number of times, in order to form them into n/5 groups and take each median in O(1) time).
From this, using induction, one can easily show that
That should help you to understand, why.
You can also use blocks of size 3 or 4, as shown in the paper Select with groups of 3 or 4 by K. Chen and A. Dumitrescu (2015). The idea is to use the "median of medians" algorithm twice and partition only after that. This lowers the quality of the pivot but is faster.
So instead of:
T(n) <= T(n/3) + T(2n/3) + O(n)
T(n) = O(nlogn)
one gets:
T(n) <= T(n/9) + T(7n/9) + O(n)
T(n) = Theta(n)
See this explanation on Brilliant.org. Basically, five is the smallest possible array we can use to maintain linear time. It is also easy to implement a linear sort with an n=5 sized array. Apologies for the laTex:
Why 5?
The median-of-medians divides a list into sublists of length five to
get an optimal running time. Remember, finding the median of small
lists by brute force (sorting) takes a small amount of time, so the
length of the sublists must be fairly small. However, adjusting the
sublist size to three, for example, does change the running time for
the worse.
If the algorithm divided the list into sublists of length three, pp
would be greater than approximately \frac{n}{3} 3 n ​ elements and
it would be smaller than approximately \frac{n}{3} 3 n ​ elements.
This would cause a worst case \frac{2n}{3} 3 2n ​ recursions,
yielding the recurrence T(n) = T\big( \frac{n}{3}\big) +
T\big(\frac{2n}{3}\big) + O(n),T(n)=T( 3 n ​ )+T( 3 2n ​ )+O(n),
which by the master theorem is O(n \log n),O(nlogn), which is slower
than linear time.
In fact, for any recurrence of the form T(n) \leq T(an) + T(bn) +
cnT(n)≤T(an)+T(bn)+cn, if a + b < 1a+b<1, the recurrence will solve to
O(n)O(n), and if a+b > 1a+b>1, the recurrence is usually equal to
\Omega(n \log n)Ω(nlogn). [3]
The median-of-medians algorithm could use a sublist size greater than
5—for example, 7—and maintain a linear running time. However, we need
to keep the sublist size as small as we can so that sorting the
sublists can be done in what is effectively constant time.

Resources