Quick select with random pick index or with median of medians? - sorting

To avoid the O(n^2) worst case scenario for quick select, I am aware of 2 options:
Randomly choose a pivot index
Use median of medians (MoM) to select an approximate median and pivot around that
When using MoM with quick select, we can guarantee worst case O(n). When using (1), we can't guarantee worst case O(n), but the probability of the algorithm going to O(n^2) should be extremely small. The overhead cost of (2) is much more than (1), where the latter adds little to no additional complexity.
So when should we use one over the other?

As you've noted, the median-of-medians approach is slower than quickselect, but has a better worst-case runtime. Assuming quickselect is truly using a random choice of pivot at each step, you can prove that not only is the expected runtime O(n), but that the probability that its runtime exceeds Θ(n log n) is very, very small (at most 1 / nk for any choice of constant k). So in that sense, if you have the ability to select pivots at random, quickselect will likely be faster.
However, not all implementations of quickselect use true randomness for the pivots, and some use deterministic pivot selection algorithms. This, unfortunately, can lead to pathological inputs that trigger the Θ(n2) worst-case runtime, which is a problem if you have adversarially-chosen inputs.
Once nice compromise between the two is introselect. The basic idea behind introselect is to use quickselect with a deterministic pivot selection algorithm. As the algorithm is running, it keeps track of how many times it's picked a pivot without throwing away at least 30% the input array. If that number exceeds some threshold, it stops using a random pivot choice and switches to the median-of-medians approach to select a good pivot, forcing a 30% size reduction. This approach means that in the common case when quickselect rapidly reduces the input size, introselect is basically identical to quickselect with a tiny bookkeeping overhead. However, in cases where quickselect would degrade to quadratic, introselect stops and switches to the worst-case efficient median-of-medians approach, ensuring the worst-case runtime is O(n). This gives you, essentially, the best of both worlds - it's fast on average, and its worst-case is never worse than O(n).

Related

Runtime of R-Select vs Select in average case?

In worst case R-select is O(n^2) where as select is O(n). Can someone explain and contrast their behavior in average cases.
P.s. - I am not sure if its a repetitive question. I can delete if its the case! Thanks!!
By R-select, I'm assuming you're talking about the randomized selection algorithm that works by choosing a pivot, partitioning on that pivot, and recursively proceeding from there. If that's not the case, let me know!
You're correct that the R-select algorithm's worst-case is Θ(n2), but that's extremely unlikely to arise in practice. This requires you to very frequently pick a pivot that's within a constant number of elements away from the min or max value, and the likelihood that this occurs is exponentially low. The average-case runtime of O(n) is actually quite likely to occur; you can prove, for example, that for any constant k, the probability that the runtime is O(n log n) is at least 1 - 1/nk.
The constant term hidden in the big-O notation of R-select is actually very low, so low in fact that R-select is typically much, much faster than the median-of-medians selection algorithm. In fact, they're sometimes combined together. The introselect algorithm works by running R-select and looking at the runtime, switching to the median-of-medians selection algorithm in the event that the runtime ends up looking bad. The overall runtime is then worst-case O(n) and comparable to R-select.

Big O Efficiency not always full proof?

I have been learning big o efficiency at school as the "go to" method for describing algorithm runtimes as better or worse than others but what I want to know is will the algorithm with the better efficiency always outperform the worst of the lot like bubble sort in every single situation, are there any situations where a bubble sort or a O(n2) algorithm will be better for a task than another algorithm with a lower O() runtime?
Generally, O() notation gives the asymptotic growth of a particular algorithm. That is, the larger category that an algorithm is placed into in terms of asymptotic growth indicates how long the algorithm will take to run as n grows (for some n number of items).
For example, we say that if a given algorithm is O(n), then it "grows linearly", meaning that as n increases, the algorithm will take about as long as any other O(n) algorithm to run.
That doesn't mean that it's exactly as long as any other algorithm that grows as O(n), because we disregard some things. For example, if the runtime of an algorithm takes exactly 12n+65ms, and another algorithm takes 8n+44ms, we can clearly see that for n=1000, algorithm 1 will take 12065ms to run and algorithm 2 will take 8044ms to run. Clearly algorithm 2 requires less time to run, but they are both O(n).
There are also situations that, for small values of n, an algorithm that is O(n2) might outperform another algorithm that's O(n), due to constants in the runtime that aren't being considered in the analysis.
Basically, Big-O notation gives you an estimate of the complexity of the algorithm, and can be used to compare different algorithms. In terms of application, though, you may need to dig deeper to find out which algorithm is best suited for a given project/program.
Big O is gives you the worst cast scenario. That means that it assumes the input in in the worst possible It also ignores the coefficient. If you are using selection sort on an array that is reverse sorted then it will run in n^2 time. If you use selection sort on a sorted array then it will run in n time. Therefore selection sort would run faster than many other sort algorithms on an already sorted list and slower than most (reasonable) algorithms on a reverse sorted list.
Edit: sorry, I meant insertion sort, not selection sort. Selection sort is always n^2

Quicksort vs Median asymptotic behavior

Quicksort and Median use the same method (Divide and concuer), why is it then that they have different asymptotic behavior?
Is it that quicksort may not use the proper pivot?
When you use method partition in Quicksort (see method in the link) to find the median, the method return index of element which have correct position, based on this position, you only need to check for selected parts which contains the median.
For example array length is 5, so median is 3. The partition method return 2, so you only need to check the upper part of the array from 2 to 5, not the whole array as Quicksort.
If you use Hoare's original select algorithm, you can get the same sort of poor worst case performance that you can from Quicksort.
If you use the median of medians, then you limit the worst case, at the expense of being slower in most typical cases.
You could use the median of medians to find a pivot for Quicksort, which would have roughly the same effect--limit the worst case, at the expense of being slower in most cases.
Of course, for the sort (in general) each partition operation is O(N), and you expect to do about log(N) partition operations, so you get approximately O(N log N) overall complexity.
With median finding, you also expect to do approximately O(log N) steps, but you only consider the partition from the previous step that can include the median (or quartile, etc. that you care about). You expect the sizes of those partitions to divide by (approximately) two at every step, rather than always having to partition the entire input, so you end up with approximately O(N) complexity instead of O(N log N) overall.
[Note that throughout this, I'm sort of abusing big-O notation to represent expected complexity whereas big-O is really supposed to represent the upper-bound (i.e., worst-case) complexity.]

Selection Algorithm Runtime

I am trying to figure out the most optimal way to compute a top-k query on some aggregation of data, lets say an array. I used to think the best way was to run through the array and maintain a heap or balanced binary tree of size k, leveraging that to compute the top-k value. Now, I have run across the Selection Algorithm which supposedly runs even faster. I understand how the Selection Algorithm works and how to implement it, I am just a little confused as to how it runs in O(n). I feel like in order for it to run in O(n) you would have to be extremely lucky. If you keep picking a random pivot point and partitioning around it, it could very well be the case that you just end up basically sorting almost the entire array before stumbling upon your kth index. Are there any optimizations such as maybe not picking a random pivot? Or is my maintaining a heap/tree method good enough for most cases.
What you're talking about there is quickselect, also known as Hoare's selection algorithm.
It does have O(n) average case performance, but its worst-case performance is O(n2).
Like quicksort, the quickselect has good average performance, but is sensitive to the pivot that is chosen. If good pivots are chosen, meaning ones that consistently decrease the search set by a given fraction, then the search set decreases in size exponentially and by induction (or summing the geometric series) one sees that performance is linear, as each step is linear and the overall time is a constant times this (depending on how quickly the search set reduces). However, if bad pivots are consistently chosen, such as decreasing by only a single element each time, then worst-case performance is quadratic: O(n2).
In terms of choosing pivots:
The easiest solution is to choose a random pivot, which yields almost certain linear time. Deterministically, one can use median-of-3 pivot strategy (as in quicksort), which yields linear performance on partially sorted data, as is common in the real world. However, contrived sequences can still cause worst-case complexity; David Musser describes a "median-of-3 killer" sequence that allows an attack against that strategy, which was one motivation for his introselect algorithm.
One can assure linear performance even in the worst case by using a more sophisticated pivot strategy; this is done in the median of medians algorithm. However, the overhead of computing the pivot is high, and thus this is generally not used in practice. One can combine basic quickselect with median of medians as fallback to get both fast average case performance and linear worst-case performance; this is done in introselect.
(quotes from Wikipedia)
So you're fairly likely to get O(n) performance with random pivots, but, if k is small and n is large, or if you're just unlikely, the O(n log k) solution using a size k heap or BST could outperform this.
We can't tell you with certainty which one will be faster when - it depends on (1) the exact implementations, (2) the machine it's run on, (3) the exact sizes of n and k and finally (4) the actual data. The O(n log k) solution should be sufficient for most purposes.

Using median selection in quicksort?

I have a slight question about Quicksort. In the case where the minimun or maximum value of the array is selected, the pivot value the partition is very inefficient as the array size decreases by 1 one only.
However if I add code of selecting the median of that array, I think then Ii will be more efficient. Since partition algorithm is already O(N), it will give an O(N log N) algorithm.
Can this be done?
You absolutely can use a linear-time median selection algorithm to compute the pivot in quicksort. This gives you a worst-case O(n log n) sorting algorithm.
However, the constant factor on linear-time selection tends to be so high that the resulting algorithm will, in practice, be much, much slower than a quicksort that just randomly chooses the pivot on each iteration. Therefore, it's not common to see such an implementation.
A completely different approach to avoiding the O(n2) worst-case is to use an approach like the one in introsort. This algorithm monitors the recursive depth of the quicksort. If it appears that the algorithm is starting to degenerate, it switches to a different sorting algorithm (usually, heapsort) with a guaranteed worst-case O(n log n). This makes the overall algorithm O(n log n) without noticeably decreasing performance.
Hope this helps!

Resources