How to modify Lomuto partition scheme? - algorithm

Lomuto partition is a simple partition algorithm used in quicksort. The Lomuto algorithm partitions subarray A[left] ... A[right] and assumes A[left] to be a pivot. How to modify this algorithm to partition A[left] ... A[right] using a given pivot P (which differs from A[left]) ?

Lomuto's partioning algorithm depends on the pivot being the leftmost element of the subarray being partitioned. It can also be modified to use the rightmost element of the pivot instead; for instance, see Chapter 7 of CLRS.
Using an arbitrary value for the pivot (say something not in the subarray) would screw things up in a quicksort implementation because there would be no guarantee that your partition made the problem any smaller. Say you had zero as the value you pivoted on but all N array entries were positive. Then your partition would give at zero-length array of elements <= 0 and an array of length N containing the elements >= 0 (which is all of them). You'd get an infinite loop trying to do quicksort in that case. Same if you were trying to find the median of the array using that modified form of Lomuto's partition. The partition depends critically on choosing an element from the array to pivot on. You'd basically lose the postcondition that an element (the pivot) would be fixed in place for good after the partition, which Lomuto's partition guarantees.
Lomuto's algorithm also depends critically on pivoting on an element that is either in the first or last position of the array being partitioned. If you pivot on an element not located at either the very front or very end of the array, the loop invariant that is the core of why Lomuto's partition works would would be a nightmare.
You can pivot on a different element of the array by swapping it with the first (or last if you implement it that way) element as the first step. Check MIT's video lecture on Quicksort for course 6.046J where they go in depth discussing Lomuto's partitioning algorithm (though they just call it Partition) and a vanilla implementation of quicksort based on it, not to mention some great probability in discussing the expected runtime of a randomized form of quicksort:
http://www.youtube.com/watch?v=vK_q-C-kXhs
CLRS and Programming Pearls both have great sections on quicksort if perhaps you're stuck using an inferior book for an algorithms class or something.

depends on how you define P, is P an index or a particular element?
if it is an index, then it is easy. you modify your two passes
...
i = left
j = right
while (a[i]<a[p]) i++
while (a[p]>a[j]) j--
if (i <= j)
swap(a, i, j)
qsort(a, left,i)
qsort(a, j,right)
...
if P is not an index, but a particular value, then you would need to search for it first, and only then do the above with the resultant index. Because the array is not sorted yet, you can only search linearly. You could also come up with a more clever scheme (hashtable) for finding your pivot P, but I don't see why you would need to do such a thing.

Related

Why does quicksort exclude the middle element but mergesort includes it?

I was going over the implementation of quicksort (from CLRS 3rd Edition). I found that the recursive divide of the array goes from the low index to middle-1 and then again from middle+1 to high.
QUICKSORT(A,p,r)
1 if(p < r)
2 q = PARTITION(A,p,r)
3 QUICKSORT(A,p,q-1)
4 QUICKSORT(A,q+1,r)
And the implementation of the merge sort is given as follows:
MERGE-SORT(A,p,r)
1 if(p < r)
2 q = (p+r)/2 (floor)
3 MERGE-SORT(A,p,q)
4 MERGE-SORT(A,q+1,r)
5 MERGE(A,p,q,r)
As both of them use the divide strategy to be the same, why does quicksort ignore the middle elements as going from 0 to q-1 and q+1 to r does not have q included in it while the mergesort has?
Quicksort puts all the elements smaller than the pivot on one side and all elements bigger on the other side. After this step we know the final position of the pivot will be between those two, and that's where we put it, so we don't need to look at it again.
Thus we can exclude the pivot element in the recursive calls.
Mergesort just picks the middle position and doesn't do anything with that element until later. There's no guarantee that the element in that position will already be in the right place, thus we need to look at that element again later on.
Thus we must include the middle element in the recursive calls.
Both methods exploit divide strategy but in different ways
Mergesort (the most common implementation) divides array recursively into equal (if possible) size parts, middle indexes are fixed positions (for given array length). Recurive calls treat left part and right part of array completely.
Quicksort partition subroutine places pivot element in the needed (final) position (in most cases pivot index is not middle). There is no need to treat this eleьent further, and recursive calls treat pieces before and after that element.

Cormen quick sort modify partition function

I am learning quick sort from Introduction to Algorithms.I got stuck on question 7.1-2. of Chapter 7 Quicksort-
"What value of q does PARTITION return when all elements in the array A[p…r] have the same value? Modify PARTITION so that q=⌊(p+r)/2⌋ when > all elements in the array A[p…r] have the same value."
The first part is easy and the answer is definitely r.But I can't even figure out what the second part is asking.I mean what is the reason for setting the pivot to (p+r)/2.Further I can't understand the solutions I found on searching on Google.
Please help me in understanding what is the advantage of this modification in case all elements are equal and if possible please provide the algorithm to do so.
By setting the pivot to the middle of p and r, we divide the array of size n into two sub-problems of equal size n/2. If you draw the recursion tree for the following recurrence, you would see its height as O(lgn)
T(n) = 2T(n/2)+O(n)
Imagine if the position of the pivot returned from the partition is always the last element in the array. Then the recurrence for the run time would be
T(n) = T(n-1)+O(n)
Do you see now why is it inefficient if the recursion tree is like a linked list? Try drawing the tree and adding the costs at each node in both the cases.
Also modifying the partition method to return (p+r)/2 is easy.
Hint: One easy way is split <= in the if condition of the partition.

Big Shot IT company interview puzzle

This past week I attended a couple of interviews at a few big IT companies. one question that left me bit puzzled. below is an exact description of the problem.(from one of the interview questions website)
Given the data set,
A,B,A,C,A,B,A,D,A,B,A,C,A,B,A,E,A,B,A,C,A,B,A,D,A,B,A,C,A,B,A,F
which can be reduced to
(A; 16); (B; 8); (C; 4); (D; 2); (E; 1); (F; 1):
using the (value, frequency) format.
for a total of m of these tuples, stored in no specific order. Devise an O(m) algorithm that returns the kth order statistic of the data set. m is the number of tuples as opposed to n which is total number of elements in the data set.
You can use Quick-Select to solve this problem.
Naively:
Pick an element (called the pivot) from the array
Put things less than or equal to the pivot on the left of the array, those greater on the right.
If the pivot is in position k, then you're done. If it's greater than k, then repeat the algorithm on the left side of the array. If it's less than k, then repeat the algorithm on the right side of the array.
There's a couple of details:
You need to either pick the pivot randomly (if you're happy with expected O(m) as the cost), or use a deterministic median algorithm.
You need to be careful to not take O(m^2) time if there's lots of values equal to the pivot. One simple way to do this is to do a second pass to split the array into 3 parts rather than 2: those less than the pivot, those equal to the pivot, and those greater than the pivot.

Clarify the swap criteria in a Quicksort?

In a quicksort, the idea is you keep selecting a pivot. And you swap the a value you find on the left that is greater than the pivot with a value you find on the right which is less than the pivot. see: ref
Just want to be 100% sure what happens in the following cases:
No value on left greater than pivot, value on right less than pivot
Value on left greater than pivot, no value on right less than pivot
No value on left greater than pivot, no value on right less than pivot
While the choice of the pivot value is important for performance, it is unimportant for sorting.
Once you've chosen some value as the pivot, you then move all values smaller than or equal to the pivot to the left of the pivot and to the right of it you end up with all values greater than the pivot.
After all these moves, the pivot value is in its final position.
Then you recursively repeat the above procedure for the sub-array to the left of the pivot value and also for the sub-array to the right of it. Or course, if the sub-arrays have 0 or 1 elements in them, there's nothing to do with them, nothing to sort.
So in this way you end up choosing a bunch of pivot values which get into their final positions after all the moves. Between those pivot values are empty or single-element sub-arrays that don't need sorting as described previously.
The swap criteria depend on the implementation. What happens in the three cases you mention depends on the partitioning scheme. There are many implementations of Quicksort, but the main two best known ones (in my opinion) are:
Hoare's Partition: The first element is the pivot, and two index variables (i and j) walk the array (a[])towards the center while the elements they encounters are less than / greater than the pivot. Then a[j] and a[i] are swapped. Note that in this implementation the swap happens for elements that are equal to the pivot. This is believed to be important when your array contains many identical entries. After i and j cross, a[0] is swapped with a[j], so the pivot goes between the smaller-or-equal-to partition and larger-or-equal-to partition.
Lomuto's partition. This is the one implemented in pseudo-code in the current Wiki quicksort entry under "In-place version". Here pivot could be anything (say a median, or a median of three), and is swapped with the last element of a. Here only i "walks" toward the end of the array: whenever a[i]>=pivot it is swapped with a[j] and j is decremented. At the end, the pivot is swapped with a[i+1]
(See here for instance for an illustration).
Robert Sedgewick chamions a three way partitioning scheme, where the array is divided into three partitions: less than, equal to, and greater than the pivot: the claim is that it has better performance on arrays with lots of dupes, or identical values. It is implemented yet differently (see the link above).

Quicksort: Choosing the pivot

When implementing Quicksort, one of the things you have to do is to choose a pivot. But when I look at pseudocode like the one below, it is not clear how I should choose the pivot. First element of list? Something else?
function quicksort(array)
var list less, greater
if length(array) ≤ 1
return array
select and remove a pivot value pivot from array
for each x in array
if x ≤ pivot then append x to less
else append x to greater
return concatenate(quicksort(less), pivot, quicksort(greater))
Can someone help me grasp the concept of choosing a pivot and whether or not different scenarios call for different strategies.
Choosing a random pivot minimizes the chance that you will encounter worst-case O(n2) performance (always choosing first or last would cause worst-case performance for nearly-sorted or nearly-reverse-sorted data). Choosing the middle element would also be acceptable in the majority of cases.
Also, if you are implementing this yourself, there are versions of the algorithm that work in-place (i.e. without creating two new lists and then concatenating them).
It depends on your requirements. Choosing a pivot at random makes it harder to create a data set that generates O(N^2) performance. 'Median-of-three' (first, last, middle) is also a way of avoiding problems. Beware of relative performance of comparisons, though; if your comparisons are costly, then Mo3 does more comparisons than choosing (a single pivot value) at random. Database records can be costly to compare.
Update: Pulling comments into answer.
mdkess asserted:
'Median of 3' is NOT first last middle. Choose three random indexes, and take the middle value of this. The whole point is to make sure that your choice of pivots is not deterministic - if it is, worst case data can be quite easily generated.
To which I responded:
Analysis Of Hoare's Find Algorithm With Median-Of-Three Partition (1997)
by P Kirschenhofer, H Prodinger, C Martínez supports your contention (that 'median-of-three' is three random items).
There's an article described at portal.acm.org that is about 'The Worst Case Permutation for Median-of-Three Quicksort' by Hannu Erkiö, published in The Computer Journal, Vol 27, No 3, 1984. [Update 2012-02-26: Got the text for the article. Section 2 'The Algorithm' begins: 'By using the median of the first, middle and last elements of A[L:R], efficient partitions into parts of fairly equal sizes can be achieved in most practical situations.' Thus, it is discussing the first-middle-last Mo3 approach.]
Another short article that is interesting is by M. D. McIlroy, "A Killer Adversary for Quicksort", published in Software-Practice and Experience, Vol. 29(0), 1–4 (0 1999). It explains how to make almost any Quicksort behave quadratically.
AT&T Bell Labs Tech Journal, Oct 1984 "Theory and Practice in the Construction of a Working Sort Routine" states "Hoare suggested partitioning around the median of several randomly selected lines. Sedgewick [...] recommended choosing the median of the first [...] last [...] and middle". This indicates that both techniques for 'median-of-three' are known in the literature. (Update 2014-11-23: The article appears to be available at IEEE Xplore or from Wiley — if you have membership or are prepared to pay a fee.)
'Engineering a Sort Function' by J L Bentley and M D McIlroy, published in Software Practice and Experience, Vol 23(11), November 1993, goes into an extensive discussion of the issues, and they chose an adaptive partitioning algorithm based in part on the size of the data set. There is a lot of discussion of trade-offs for various approaches.
A Google search for 'median-of-three' works pretty well for further tracking.
Thanks for the information; I had only encountered the deterministic 'median-of-three' before.
Heh, I just taught this class.
There are several options.
Simple: Pick the first or last element of the range. (bad on partially sorted input)
Better: Pick the item in the middle of the range. (better on partially sorted input)
However, picking any arbitrary element runs the risk of poorly partitioning the array of size n into two arrays of size 1 and n-1. If you do that often enough, your quicksort runs the risk of becoming O(n^2).
One improvement I've seen is pick median(first, last, mid);
In the worst case, it can still go to O(n^2), but probabilistically, this is a rare case.
For most data, picking the first or last is sufficient. But, if you find that you're running into worst case scenarios often (partially sorted input), the first option would be to pick the central value( Which is a statistically good pivot for partially sorted data).
If you're still running into problems, then go the median route.
Never ever choose a fixed pivot - this can be attacked to exploit your algorithm's worst case O(n2) runtime, which is just asking for trouble. Quicksort's worst case runtime occurs when partitioning results in one array of 1 element, and one array of n-1 elements. Suppose you choose the first element as your partition. If someone feeds an array to your algorithm that is in decreasing order, your first pivot will be the biggest, so everything else in the array will move to the left of it. Then when you recurse, the first element will be the biggest again, so once more you put everything to the left of it, and so on.
A better technique is the median-of-3 method, where you pick three elements at random, and choose the middle. You know that the element that you choose won't be the the first or the last, but also, by the central limit theorem, the distribution of the middle element will be normal, which means that you will tend towards the middle (and hence, nlog(n) time).
If you absolutely want to guarantee O(nlog(n)) runtime for the algorithm, the columns-of-5 method for finding the median of an array runs in O(n) time, which means that the recurrence equation for quicksort in the worst case will be:
T(n) = O(n) (find the median) + O(n) (partition) + 2T(n/2) (recurse left and right)
By the Master Theorem, this is O(nlog(n)). However, the constant factor will be huge, and if worst case performance is your primary concern, use a merge sort instead, which is only a little bit slower than quicksort on average, and guarantees O(nlog(n)) time (and will be much faster than this lame median quicksort).
Explanation of the Median of Medians Algorithm
Don't try and get too clever and combine pivoting strategies. If you combined median of 3 with random pivot by picking the median of the first, last and a random index in the middle, then you'll still be vulnerable to many of the distributions which send median of 3 quadratic (so its actually worse than plain random pivot)
E.g a pipe organ distribution (1,2,3...N/2..3,2,1) first and last will both be 1 and the random index will be some number greater than 1, taking the median gives 1 (either first or last) and you get an extermely unbalanced partitioning.
It is easier to break the quicksort into three sections doing this
Exchange or swap data element function
The partition function
Processing the partitions
It is only slightly more inefficent than one long function but is alot easier to understand.
Code follows:
/* This selects what the data type in the array to be sorted is */
#define DATATYPE long
/* This is the swap function .. your job is to swap data in x & y .. how depends on
data type .. the example works for normal numerical data types .. like long I chose
above */
void swap (DATATYPE *x, DATATYPE *y){
DATATYPE Temp;
Temp = *x; // Hold current x value
*x = *y; // Transfer y to x
*y = Temp; // Set y to the held old x value
};
/* This is the partition code */
int partition (DATATYPE list[], int l, int h){
int i;
int p; // pivot element index
int firsthigh; // divider position for pivot element
// Random pivot example shown for median p = (l+h)/2 would be used
p = l + (short)(rand() % (int)(h - l + 1)); // Random partition point
swap(&list[p], &list[h]); // Swap the values
firsthigh = l; // Hold first high value
for (i = l; i < h; i++)
if(list[i] < list[h]) { // Value at i is less than h
swap(&list[i], &list[firsthigh]); // So swap the value
firsthigh++; // Incement first high
}
swap(&list[h], &list[firsthigh]); // Swap h and first high values
return(firsthigh); // Return first high
};
/* Finally the body sort */
void quicksort(DATATYPE list[], int l, int h){
int p; // index of partition
if ((h - l) > 0) {
p = partition(list, l, h); // Partition list
quicksort(list, l, p - 1); // Sort lower partion
quicksort(list, p + 1, h); // Sort upper partition
};
};
It is entirely dependent on how your data is sorted to begin with. If you think it will be pseudo-random then your best bet is to either pick a random selection or choose the middle.
If you are sorting a random-accessible collection (like an array), it's general best to pick the physical middle item. With this, if the array is all ready sorted (or nearly sorted), the two partitions will be close to even, and you'll get the best speed.
If you are sorting something with only linear access (like a linked-list), then it's best to choose the first item, because it's the fastest item to access. Here, however,if the list is already sorted, you're screwed -- one partition will always be null, and the other have everything, producing the worst time.
However, for a linked-list, picking anything besides the first, will just make matters worse. It pick the middle item in a listed-list, you'd have to step through it on each partition step -- adding a O(N/2) operation which is done logN times making total time O(1.5 N *log N) and that's if we know how long the list is before we start -- usually we don't so we'd have to step all the way through to count them, then step half-way through to find the middle, then step through a third time to do the actual partition: O(2.5N * log N)
Ideally the pivot should be the middle value in the entire array.
This will reduce the chances of getting worst case performance.
In a truly optimized implementation, the method for choosing pivot should depend on the array size - for a large array, it pays off to spend more time choosing a good pivot. Without doing a full analysis, I would guess "middle of O(log(n)) elements" is a good start, and this has the added bonus of not requiring any extra memory: Using tail-call on the larger partition and in-place partitioning, we use the same O(log(n)) extra memory at almost every stage of the algorithm.
Quick sort's complexity varies greatly with the selection of pivot value. for example if you always choose first element as an pivot, algorithm's complexity becomes as worst as O(n^2). here is an smart method to choose pivot element-
1. choose the first, mid, last element of the array.
2. compare these three numbers and find the number which is greater than one and smaller than other i.e. median.
3. make this element as pivot element.
choosing the pivot by this method splits the array in nearly two half and hence the complexity
reduces to O(nlog(n)).
On the average, Median of 3 is good for small n. Median of 5 is a bit better for larger n. The ninther, which is the "median of three medians of three" is even better for very large n.
The higher you go with sampling the better you get as n increases, but the improvement dramatically slows down as you increase the samples. And you incur the overhead of sampling and sorting samples.
I recommend using the middle index, as it can be calculated easily.
You can calculate it by rounding (array.length / 2).
If you choose the first or the last element in the array, then there are high chance that the pivot is the smallest or the largest element of the array and that is bad.
Why?
Because in that case the number of element smaller / larger than the pivot element in 0. and this will repeat as follow :
Consider the size of the array n.Then,
(n) + (n - 1) + (n - 2) + ......+ 1 = O(n^2)
Hence, the time complexity increases to O(n^2) from O(nlogn). So, I highly recommend to use median / random element of the array as the pivot.

Resources