Does the non-parallel sample sort have the same complexity as quick sort? - algorithm

According to wikipedia and other resources, quick sort happens to be a special case of sample sort, because we always choose 1 partitioning item, put it in it's place and continue the sort, so quick sort is sample sort, where m (the number of partitioning items at each step) is 1. So, my question is, for 1 < m < n does it have the same complexity as quick sort when it's not parallel?
The following is the algorithm for sample sort described on wikipedia.
1) Find splitters, values that break up the data into buckets, by sampling the data.
2) Use the sorted splitters to define buckets and place data in appropriate buckets.
3) Sort each of the buckets.
I am not exactly sure I understand this algorithm correctly, but I think we first find the partitioning item, put it in it's place and then look to the left and to the right to find more partitioning items there, and then recursively call the same function to partition each one of those m samples into m samples again, am I right? Because if so, it seems that sample sort performs the same as quick sort because it simply does the same thing, except half of it iteratively (when looking for splitters) and half of it recursively.

They will have different complexity. When m > 1, their running would be approximate to CNlogm+1N. The constant C will be large enough to make it slower than ordinary QuickSort because there is no known algorithm to partition list into m + 1 buckets as efficiency as partition list into two buckets.
For example, normal QuickSort would takes O(N) to partition the list into two sub array. Assuming in best case, QuickSort perfectly choose value that split list into two buckets of the same size.
Cn = 2Cn/2 + n = nlog2n
Let assume that m = 2 that's mean we need to partition the list into three sub array. Let assuming that in best case, we can perfectly choose values that split the list into three buckets of the same size. However, let's say the cost of partition is O(3N).
Cn = 3Cn/3 + 3n = 3nlog3n
As you can see
3nlog3n > nlog2n.

Related

Time complexity for array management (Algorithm)

I'm working on a program that takes in a bunch (y) of integers and then needs to return the x highest integers in order. This code needs to be as fast as possible, but at the moment I dont think I have the best algorithm.
My approach/algorithm so far is to create a sorted list of integers (high to low) that have already been input and then handle each item as it comes in. For the first x items, I maintain a sorted array of integers, and when each new item comes in, I figure out where it should be placed using a binary sort. (Im also considering just taking in the first x items and then quick sorting them, but I dont know if this is faster) After the first x items have been sorted I then consider the rest of the items by first seeing if they qualify to enter the already sorted list of highest integers (by seeing if the new integer is greater than the integer at the end of the list) and if it does, add it to the sorted list via a binary search and remove the integer at the end of the list.
I was wondering if anyone had any advice as to how I can make this faster, or perhaps an entire new approach that is faster than this. Thanks.
This is a partial sort:
The fastest implementation is Quicksort where you only recurse on ranges containing the bottom/top k elements.
In C++ you can just use std::partial_sort
If you use a heap-ordered tree data structure to store the integers, inserting a new integer takes no more than lg N comparisons and removing the maximum takes no more than 2 lg N comparisions. Thus, to insert y items would require no more than y lg N comparisons and to remove the top x items would require no more than 2x lg N comparisons. The Wikipedia entry has references to a range of implementations.
This is called a top-N sort. Here is a very simple and efficient scheme. No fancy data structures needed.
Keep a list of the highest x elements (it starts out empty)
Split your input into chunks of x * 10 items
For each chunk, add the remembered list of the x highest items so far to it and sort it (e.g. quick sort)
Keep the x highest items. They form the new remembered list
goto 3 until all chunks processed
The remembered list is now your final result
This is O(N) in the number of items and only requires a normal quick sort as a primitive.
You don't seem to need the top N items in sorted order. Because of this, you can solve this in linear time.
Find the Nth largest array element using linear-time selection. Return it and all array elements larger than it.

Algorithm for finding mutual name in lists

I've been reading up on Algorithms from the book Algorithms by Robert Sedgewick and I've been stuck on an exercise problem for a while. Here is the question :
Given 3 lists of N names each, find an algorithm to determine if there is any name common to all three lists. The algorithm must have O(NlogN) complexity. You're only allowed to use sorting algorithms and the only data structures you can use are stacks and queues.
I figured I could solve this problem using a HashMap, but the questions restricts us from doing so. Even then that still wouldn't have a complexity of NlogN.
If you sort each of the lists, then you could trivially check if all three lists have any 1 name in O(n) time by picking the first name of list A compare it to the first name in list B, if that element is < that of list A, pop the list b element and repeat until list B >= list A. If you find a match repeat the process on C. If you find a match in C also return true, otherwise return to the next element in a.
Now you have to sort all of the lists in n log n time. which you could do with your favorite sorting algorithm though you would have to be a little creative using just stacks and queues. I would probably recommend merge sort
The below psuedo code is a little messed up because I am changing lists that I am iterating over
pseudo code:
assume listA, b and c are sorted Queues where the smallest name is at the top of the queue.
eltB = listB.pop()
eltC = listC.pop()
for eltA in listA:
while(eltB<=eltA):
if eltB==eltA:
while(eltC<=eltB):
if eltB==eltC:
return true
if eltC<eltB:
eltC=listC.pop();
eltB=listB.pop()
Steps:
Sort the three lists using an O(N lgN) sorting algorithm.
Pop the one item from each list.
If any of the lists from which you tried to pop is empty, then you are done i.e. no common element exists.
Else, compare the three elements.
If the elements are equal, you are done - you have found the common element.
Else, keep the maximum of the three elements (constant time) and replenish from the same lists from which the two elements were discarded.
Go to step 3.
Step 1 takes O(N lgN) and the rest of the steps take O(N), so the overall complexity is O(N lgN).

Limited Sort/Filter Algorithm

I have a rather large list of elements (100s of thousands).
I have a filter that can either accept or not accept elements.
I want the top 100 elements that satisfy the filter.
So far, I have sorted the results first and then taken the top 100 that satisfy the filter. The rationale behind this is that the filter is not entirely fast.
But right now, the sorting step is taking way longer than the filtering step, so I would like to combine them in some way.
Is there an algorithm to combine the concerns of sorting/filtering to get the top 100 results satisfying the filter without incurring the cost of sorting all of the elements?
My instinct is to select the top 100 elements from the list (much cheaper than a sort, use your favorite variant of QuickSelect). Run those through the filter, yielding n successes and 100-n failures. If n < 100 then repeat by selecting 100-n elements from the top of the remainder of the list:
k = 100
while (k > 0):
select top k from list and remove them
filter them, yielding n successes
k = k - n
All being well this runs in time proportional to the length of the list, since each selection step runs in that time, and the number of selection steps required depends on the success rate of the filter, but not directly on the size of the list.
I expect this has some bad cases, though. If almost all elements fail the filter then it's considerably slower than just sorting everything, since you'll end up selecting thousands of times. So you might want some criteria to bail out if it's looking bad, and fall back to sorting the whole list.
It also has the problem that it will likely do a largeish number of small selects towards the end, since we expect k to decay exponentially if the filter criteria are unrelated to the sort criteria. So you could probably improve it by selecting somewhat more than k elements at each step. Say, k divided by the expected success rate of the filter, plus a small constant. The expectation based on past performance if there's no domain knowledge you can use to predict it, and the small constant chosen experimentally to avoid an annoyingly large number of steps to find the last few elements. If you end up at any step with more items that have passed the filter than the number you're still looking for (i.e, n > k), then select the top k from the current batch of successes and you're done.
Since QuickSelect gives you the top k without sorting those k, you'll need to do a final sort of 100 elements if you need the top 100 in order.
I've solved this exact problem by using a binary tree for sorting and by keeping count of the elements to the left of the current node during insertion. See http://pub.uni-bielefeld.de/publication/2305936 (Figure 4.4 et al) for details.
If I understand right, you have two choiced:
Selecting 100 Elements - N operations of the filter check. Then 100(lg 100) for the sort.
Sorting then selecting 100 Elements - At least N(lg N) for the sort, then the select.
the first sounds shorter then sorting then selecting.
I'd probably filter first, then insert the result of that into a priority queue. Keep track of the number of items in the PQ, and after you do the insert, if it's larger than the number you want to keep (100 in your case), pop off the smallest item and discard it.
Steve's suggestion to use Quicksort is a good one.
1 Read in the first 1000 or so elements.
2 Sort them and pick the 100th largest element.
3 Run one pass of Quicksort on the whole file with the element from step 2 as the pivot.
4 Select the upper half of the result of the Quicksort pass for further processing.
You are guaranteed at least 100 elements in the upper half of the single pass of Quicksort. Assuming the first 1000 are reasonably representative of the whole file then you should end up with about one tenth of the original elements at step 4.

Is there a way to skip empty buckets during bucket sort?

Counting sort is kind of a bucket sort. Let's assume we're using it like this:
Let A be the array to sort
Let k be the max element
Let bucket[] be an array of buckets
Let each bucket be a linked list (with a start and end pointer)
Then in pseudocode, counting sort looks like this:
Counting-Sort (A[], bucket[], k)
1. Init bucket[]
2. for i -> 1 to n
3. add A[i] to bucket[A[i].key].end
4. for i -> 1 to k
5. concatenate bucket[i].start to bucket[0].end
6. bucket[0].end=bucket[i].end
7. copy bucket[0] to A
Time Complexity by lines:
1) I know there is a way (not simple but a way) to init array in O(1)
2,3) O(n)
4,5) O(k)
6) O(n)
This gives us a net runtime of O(k+n), which for k >> n is Ω(n), which is bad for us. But what if we can change lines 4,5 to somehow skip the empty buckets? This way we will end up having O(n) no metter what k is.
Does anyone know how to do this? Or is it impossible?
One option would be to hold an auxilary BST containing which buckets are actually being used. Whenever you add something to a bucket, if it's the first entry to be placed there, you would also add that bucket's value to the BST.
When you want to then go concatenate everything, you could then just iterate over the BST in sorted order, concatenating just the buckets you find.
If there are z buckets that actually get used, this takes O(n + z log z). If the number of buckets is large compared to the number actually used, this could be much faster.
More generally - if you have a way of sorting the z different buckets being used in O(f(z)) time, you can do a bucket sort in O(n + f(z)) time. Maintain a second array of the buckets you actually use, adding a bucket to the array when it's used for the first time. Before iterating over the buckets, sort in O(f(z)) time the indices of the buckets in usem then iterate across that array to determine what buckets to visit. For example, if you used y-Fast trees, you could sort in O(n + z log log z).
Hope this helps!
You can turn the bucket array into an associative array, which yields O(n log n), and I don't believe you can do better than that for sorting (on average).
O(n) is impossible in the general case.

Quicksort: Choosing the pivot

When implementing Quicksort, one of the things you have to do is to choose a pivot. But when I look at pseudocode like the one below, it is not clear how I should choose the pivot. First element of list? Something else?
function quicksort(array)
var list less, greater
if length(array) ≤ 1
return array
select and remove a pivot value pivot from array
for each x in array
if x ≤ pivot then append x to less
else append x to greater
return concatenate(quicksort(less), pivot, quicksort(greater))
Can someone help me grasp the concept of choosing a pivot and whether or not different scenarios call for different strategies.
Choosing a random pivot minimizes the chance that you will encounter worst-case O(n2) performance (always choosing first or last would cause worst-case performance for nearly-sorted or nearly-reverse-sorted data). Choosing the middle element would also be acceptable in the majority of cases.
Also, if you are implementing this yourself, there are versions of the algorithm that work in-place (i.e. without creating two new lists and then concatenating them).
It depends on your requirements. Choosing a pivot at random makes it harder to create a data set that generates O(N^2) performance. 'Median-of-three' (first, last, middle) is also a way of avoiding problems. Beware of relative performance of comparisons, though; if your comparisons are costly, then Mo3 does more comparisons than choosing (a single pivot value) at random. Database records can be costly to compare.
Update: Pulling comments into answer.
mdkess asserted:
'Median of 3' is NOT first last middle. Choose three random indexes, and take the middle value of this. The whole point is to make sure that your choice of pivots is not deterministic - if it is, worst case data can be quite easily generated.
To which I responded:
Analysis Of Hoare's Find Algorithm With Median-Of-Three Partition (1997)
by P Kirschenhofer, H Prodinger, C Martínez supports your contention (that 'median-of-three' is three random items).
There's an article described at portal.acm.org that is about 'The Worst Case Permutation for Median-of-Three Quicksort' by Hannu Erkiö, published in The Computer Journal, Vol 27, No 3, 1984. [Update 2012-02-26: Got the text for the article. Section 2 'The Algorithm' begins: 'By using the median of the first, middle and last elements of A[L:R], efficient partitions into parts of fairly equal sizes can be achieved in most practical situations.' Thus, it is discussing the first-middle-last Mo3 approach.]
Another short article that is interesting is by M. D. McIlroy, "A Killer Adversary for Quicksort", published in Software-Practice and Experience, Vol. 29(0), 1–4 (0 1999). It explains how to make almost any Quicksort behave quadratically.
AT&T Bell Labs Tech Journal, Oct 1984 "Theory and Practice in the Construction of a Working Sort Routine" states "Hoare suggested partitioning around the median of several randomly selected lines. Sedgewick [...] recommended choosing the median of the first [...] last [...] and middle". This indicates that both techniques for 'median-of-three' are known in the literature. (Update 2014-11-23: The article appears to be available at IEEE Xplore or from Wiley — if you have membership or are prepared to pay a fee.)
'Engineering a Sort Function' by J L Bentley and M D McIlroy, published in Software Practice and Experience, Vol 23(11), November 1993, goes into an extensive discussion of the issues, and they chose an adaptive partitioning algorithm based in part on the size of the data set. There is a lot of discussion of trade-offs for various approaches.
A Google search for 'median-of-three' works pretty well for further tracking.
Thanks for the information; I had only encountered the deterministic 'median-of-three' before.
Heh, I just taught this class.
There are several options.
Simple: Pick the first or last element of the range. (bad on partially sorted input)
Better: Pick the item in the middle of the range. (better on partially sorted input)
However, picking any arbitrary element runs the risk of poorly partitioning the array of size n into two arrays of size 1 and n-1. If you do that often enough, your quicksort runs the risk of becoming O(n^2).
One improvement I've seen is pick median(first, last, mid);
In the worst case, it can still go to O(n^2), but probabilistically, this is a rare case.
For most data, picking the first or last is sufficient. But, if you find that you're running into worst case scenarios often (partially sorted input), the first option would be to pick the central value( Which is a statistically good pivot for partially sorted data).
If you're still running into problems, then go the median route.
Never ever choose a fixed pivot - this can be attacked to exploit your algorithm's worst case O(n2) runtime, which is just asking for trouble. Quicksort's worst case runtime occurs when partitioning results in one array of 1 element, and one array of n-1 elements. Suppose you choose the first element as your partition. If someone feeds an array to your algorithm that is in decreasing order, your first pivot will be the biggest, so everything else in the array will move to the left of it. Then when you recurse, the first element will be the biggest again, so once more you put everything to the left of it, and so on.
A better technique is the median-of-3 method, where you pick three elements at random, and choose the middle. You know that the element that you choose won't be the the first or the last, but also, by the central limit theorem, the distribution of the middle element will be normal, which means that you will tend towards the middle (and hence, nlog(n) time).
If you absolutely want to guarantee O(nlog(n)) runtime for the algorithm, the columns-of-5 method for finding the median of an array runs in O(n) time, which means that the recurrence equation for quicksort in the worst case will be:
T(n) = O(n) (find the median) + O(n) (partition) + 2T(n/2) (recurse left and right)
By the Master Theorem, this is O(nlog(n)). However, the constant factor will be huge, and if worst case performance is your primary concern, use a merge sort instead, which is only a little bit slower than quicksort on average, and guarantees O(nlog(n)) time (and will be much faster than this lame median quicksort).
Explanation of the Median of Medians Algorithm
Don't try and get too clever and combine pivoting strategies. If you combined median of 3 with random pivot by picking the median of the first, last and a random index in the middle, then you'll still be vulnerable to many of the distributions which send median of 3 quadratic (so its actually worse than plain random pivot)
E.g a pipe organ distribution (1,2,3...N/2..3,2,1) first and last will both be 1 and the random index will be some number greater than 1, taking the median gives 1 (either first or last) and you get an extermely unbalanced partitioning.
It is easier to break the quicksort into three sections doing this
Exchange or swap data element function
The partition function
Processing the partitions
It is only slightly more inefficent than one long function but is alot easier to understand.
Code follows:
/* This selects what the data type in the array to be sorted is */
#define DATATYPE long
/* This is the swap function .. your job is to swap data in x & y .. how depends on
data type .. the example works for normal numerical data types .. like long I chose
above */
void swap (DATATYPE *x, DATATYPE *y){
DATATYPE Temp;
Temp = *x; // Hold current x value
*x = *y; // Transfer y to x
*y = Temp; // Set y to the held old x value
};
/* This is the partition code */
int partition (DATATYPE list[], int l, int h){
int i;
int p; // pivot element index
int firsthigh; // divider position for pivot element
// Random pivot example shown for median p = (l+h)/2 would be used
p = l + (short)(rand() % (int)(h - l + 1)); // Random partition point
swap(&list[p], &list[h]); // Swap the values
firsthigh = l; // Hold first high value
for (i = l; i < h; i++)
if(list[i] < list[h]) { // Value at i is less than h
swap(&list[i], &list[firsthigh]); // So swap the value
firsthigh++; // Incement first high
}
swap(&list[h], &list[firsthigh]); // Swap h and first high values
return(firsthigh); // Return first high
};
/* Finally the body sort */
void quicksort(DATATYPE list[], int l, int h){
int p; // index of partition
if ((h - l) > 0) {
p = partition(list, l, h); // Partition list
quicksort(list, l, p - 1); // Sort lower partion
quicksort(list, p + 1, h); // Sort upper partition
};
};
It is entirely dependent on how your data is sorted to begin with. If you think it will be pseudo-random then your best bet is to either pick a random selection or choose the middle.
If you are sorting a random-accessible collection (like an array), it's general best to pick the physical middle item. With this, if the array is all ready sorted (or nearly sorted), the two partitions will be close to even, and you'll get the best speed.
If you are sorting something with only linear access (like a linked-list), then it's best to choose the first item, because it's the fastest item to access. Here, however,if the list is already sorted, you're screwed -- one partition will always be null, and the other have everything, producing the worst time.
However, for a linked-list, picking anything besides the first, will just make matters worse. It pick the middle item in a listed-list, you'd have to step through it on each partition step -- adding a O(N/2) operation which is done logN times making total time O(1.5 N *log N) and that's if we know how long the list is before we start -- usually we don't so we'd have to step all the way through to count them, then step half-way through to find the middle, then step through a third time to do the actual partition: O(2.5N * log N)
Ideally the pivot should be the middle value in the entire array.
This will reduce the chances of getting worst case performance.
In a truly optimized implementation, the method for choosing pivot should depend on the array size - for a large array, it pays off to spend more time choosing a good pivot. Without doing a full analysis, I would guess "middle of O(log(n)) elements" is a good start, and this has the added bonus of not requiring any extra memory: Using tail-call on the larger partition and in-place partitioning, we use the same O(log(n)) extra memory at almost every stage of the algorithm.
Quick sort's complexity varies greatly with the selection of pivot value. for example if you always choose first element as an pivot, algorithm's complexity becomes as worst as O(n^2). here is an smart method to choose pivot element-
1. choose the first, mid, last element of the array.
2. compare these three numbers and find the number which is greater than one and smaller than other i.e. median.
3. make this element as pivot element.
choosing the pivot by this method splits the array in nearly two half and hence the complexity
reduces to O(nlog(n)).
On the average, Median of 3 is good for small n. Median of 5 is a bit better for larger n. The ninther, which is the "median of three medians of three" is even better for very large n.
The higher you go with sampling the better you get as n increases, but the improvement dramatically slows down as you increase the samples. And you incur the overhead of sampling and sorting samples.
I recommend using the middle index, as it can be calculated easily.
You can calculate it by rounding (array.length / 2).
If you choose the first or the last element in the array, then there are high chance that the pivot is the smallest or the largest element of the array and that is bad.
Why?
Because in that case the number of element smaller / larger than the pivot element in 0. and this will repeat as follow :
Consider the size of the array n.Then,
(n) + (n - 1) + (n - 2) + ......+ 1 = O(n^2)
Hence, the time complexity increases to O(n^2) from O(nlogn). So, I highly recommend to use median / random element of the array as the pivot.

Resources