I came across an algorithmic puzzle as following:
Given an array of events in the form of (name, start time, end time)
e.g.
(a, 1, 6)
(b, 2, 4)
(c, 7, 8)
...
The events are sorted based on their start time. I was asked to transform the event into another form (name, time),
e.g.
(a, 1)
(b, 2)
(b, 4)
(a, 6)
(c, 7)
(c, 8)
Notice that each event is now broken into two events, and they are required to be sorted by time.
The most naive way is O(n log n), and I thought of several other ways, but non of them is faster than O(n log n).
Anybody know the most time and space efficient way of solving this?
Sweep time from beginning to end, maintaining a priority queue of the end times of active events, whose top element is compared repeatedly to the begin time of the next event. This is O(n log k), where k is the maximum number of simultaneous events, with extra space usage of O(k) on top of the input and output. I implemented something similar in C++ for this answer: https://stackoverflow.com/a/25694591/2144669 .
This can be proven to be just as time consuming as regular sorting.
For example, suppose I want to sort N positive numbers. I could convert that to this problem of sorting N tuples of the form (a1, 0, num1), (a2, 0, num2), ..., (aN, 0, numN). This would yield a sorted result (a1, 0), (a2, 0), ..., (aN, 0), (aSorted1, numSorted1), ..., (aSortedN, numSortedN). Hence we would get {numSorted1, ..., numSortedN}. It is proven that the last sort should take at least O(N*Log(N)), so you can't get any better than that in the general case.
However, if you say that the start times are unique, there may be some other optimizations to the problem.
EDIT: We are using additional space of O(N) here but I think there can be an argument made with that case. However it is not as rigorous an answer though.
Really efficient implementation depends on some limitation, applied to your data.
But, in general, following method can sort your list for O(N), and memory 2N:
1. Create struct like:
struct data {
int timestamp; // Event timestamp
int orig_index; // Index of original event in the input array
}
2. Copy data from the input array into array of structures [1].
Each original event is copied into two structs "data". This is O(N).
3. Sort resulting array with RadixSort, by field "timestamp".
This is again O(N): http://en.wikipedia.org/wiki/Radix_sort
4. Create output array, and restore names from original array by
index "orig_index", if needed. Again O(N).
No algorithm is more time efficient, since the sorting problem, which has a tight lower bound of omega(nlogn) is reducible to this problem (to sort an array, pick any group of start times etc.)
As for space complexity, Heapsort, which uses O(1) auxillary space and O(nlogn) time, is probably the best worst-case algorithm.
Related
Was going through a hypothetical scenario in my head and couldn't quite think of a datastructure to use in this situation. Assume all we need is the ability to add() & remove() keys and then count(key1, key2) the number of keys between the two specified (inclusive). We are assuming of course that these keys overload the comparison operators so they can definitively be less than, greater than, or equal to each other. So for instance, if we inserted 1, 5, 3, 4, 7 and then ran count(1, 4), we would get a resulting output of 3 since we could count the keys 1, 3, and 4.
Now we could do this with a binary search tree using recursion in O(n) time, but what if we needed count() to run in O(log(n)) time? Is there a datastructure out there that you could potentially modify to perform this?
At first I thought maybe we could use an heap or BST and keep track of the number of children on each side. But then I just got really lost trying to trace it out on paper.
An order statistic tree is a modification of a BST that allows you to query, for any value, how many elements of the tree are smaller than that value. You can then count how many items are in the range (a, b) by asking how many items are less than a, then subtracting out how many items are less than b.
Each operation on an order statistic tree (add, remove, lookup, and count) takes time O(log n), so this lets you solve your particular problem in time O(log n) as well.
This is not my school home work. This is my own home work and I am self-learning algorithms.
In Algorithm Design Manual, there is such an excise
4-25 Assume that the array A[1..n] only has numbers from {1, . . . , n^2} but that at most log log n of these numbers ever appear. Devise an algorithm that sorts A in substantially less than O(n log n).
I have two approaches:
The first approach:
Basically I want to do counting sort for this problem. I can first scan the whole array (O(N)) and put all distinct numbers into a loglogN size array (int[] K).
Then apply counting sort. However, when setting up the counting array (int[] C), I don't need to set its size as N^2, instead, I set the size as loglogN too.
But in this way, when counting the frequencies of each distinct number, I have to scan array K to get that element's index (O(NloglogN) and then update array C.
The second approach:
Again, I have to scan the whole array to get a distinct number array K with size loglogN.
Then I just do a kind of quicksort like, but the partition is based on median of K array (i.e., each time the pivot is an element of K array), recursively.
I think this approach will be best, with O(NlogloglogN).
Am I right? or there are better solutions?
Similar excises exist in Algorithm Design Manual, such as
4-22 Show that n positive integers in the range 1 to k can be sorted in O(n log k) time. The interesting case is when k << n.
4-23 We seek to sort a sequence S of n integers with many duplications, such that the number of distinct integers in S is O(log n). Give an O(n log log n) worst-case time algorithm to sort such sequences.
But basically for all these excises, my intuitive was always thinking of counting sort as we can know the range of the elements and the range is short enough comparing to the length of the whole array. But after more deeply thinking, I guess what the excises are looking for is the 2nd approach, right?
Thanks
We can just create a hash map storing each element as key and its frequency as value.
Sort this map in log(n)*log(log(n)) time i.e (klogk) using any sorting algorithm.
Now scan the hash map and add elements to the new array frequency number of times. Like so:
total time = 2n+log(n)*log(log(n)) = O(n)
Counting sort is one of possible ways:
I will demonstrate this solution on example 2, 8, 1, 5, 7, 1, 6 and all number are <= 3^2 = 9. I use more elements to make my idea more clearer.
First for each number A[i] compute A[i] / N. Lets call this number first_part_of_number.
Sort this array using counting sort by first_part_of_number.
Results are in form (example for N = 3)
(0, 2)
(0, 1)
(0, 1)
(2, 8)
(2, 6)
(2, 7)
(2, 6)
Divide them into groups by first_part_of_number.
In this example you will have groups
(0, 2)
(0, 1)
(0, 1)
and
(2, 8)
(2, 6)
(2, 7)
(2, 6)
For each number compute X modulo N. Lets call it second_part_of_number. Add this number to each element
(0, 2, 2)
(0, 1, 1)
(0, 1, 1)
and
(2, 8, 2)
(2, 6, 0)
(2, 7, 1)
(2, 6, 0)
Sort each group using counting sort by second_part_of_number
(0, 1, 1)
(0, 1, 1)
(0, 2, 2)
and
(2, 6, 0)
(2, 6, 0)
(2, 7, 1)
(2, 8, 2)
Now combine all groups and you have result 1, 1, 2, 6, 6, 7, 8.
Complexity:
You were using only counting sort on elements <= N.
Each element took part in exactly 2 "sorts". So overall complexity is O(N).
I'm going to betray my limited knowledge of algorithmic complexity here, but:
Wouldn't it make sense to scan the array once and build something like a self-balancing tree? As we know the number of nodes in the tree will only grow to (log log n) it is relatively cheap (?) to find a number each time. If a repeat number is found (likely) a counter in that node is incremented.
Then to construct the sorted array, read the tree in order.
Maybe someone can comment on the complexity of this and any flaws.
Update: After I wrote the answer below, #Nabb showed me why it was incorrect. For more information, see Wikipedia's brief entry on Õ, and the links therefrom. At least because it is still needed to lend context to #Nabb's and #Blueshift's comments, and because the whole discussion remains interesting, my original answer is retained, as follows.
ORIGINAL ANSWER (INCORRECT)
Let me offer an unconventional answer: though there is indeed a difference between O(n*n) and O(n), there is no difference between O(n) and O(n*log(n)).
Now, of course, we all know that what I just said is wrong, don't we? After all, various authors concur that O(n) and O(n*log(n)) differ.
Except that they don't differ.
So radical-seeming a position naturally demands justification, so consider the following, then make up your own mind.
Mathematically, essentially, the order m of a function f(z) is such that f(z)/(z^(m+epsilon)) converges while f(z)/(z^(m-epsilon)) diverges for z of large magnitude and real, positive epsilon of arbitrarily small magnitude. The z can be real or complex, though as we said epsilon must be real. With this understanding, apply L'Hospital's rule to a function of O(n*log(n)) to see that it does not differ in order from a function of O(n).
I would contend that the accepted computer-science literature at the present time is slightly mistaken on this point. This literature will eventually refine its position in the matter, but it hasn't done, yet.
Now, I do not expect you to agree with me today. This, after all, is merely an answer on Stackoverflow -- and what is that compared to an edited, formally peer-reviewed, published computer-science book -- not to mention a shelffull of such books? You should not agree with me today, only take what I have written under advisement, mull it over in your mind these coming weeks, consult one or two of the aforementioned computer-science books that take the other position, and make up your own mind.
Incidentally, a counterintuitive implication of this answer's position is that one can access a balanced binary tree in O(1) time. Again, we all know that that's false, right? It's supposed to be O(log(n)). But remember: the O() notation was never meant to give a precise measure of computational demands. Unless n is very large, other factors can be more important than a function's order. But, even for n = 1 million, log(n) is only 20, compared, say, to sqrt(n), which is 1000. And I could go on in this vein.
Anyway, give it some thought. Even if, eventually, you decide that you disagree with me, you may find the position interesting nonetheless. For my part, I am not sure how useful the O() notation really is when it comes to O(log something).
#Blueshift asks some interesting questions and raises some valid points in the comments below. I recommend that you read his words. I don't really have a lot to add to what he has to say, except to observe that, because few programmers have (or need) a solid grounding in the mathematical theory of the complex variable, the O(log(n)) notation has misled probably, literally hundreds of thousands of programmers to believe that they were achieving mostly illusory gains in computational efficiency. Seldom in practice does reducing O(n*log(n)) to O(n) really buy you what you might think that it buys you, unless you have a clear mental image of how incredibly slow a function the logarithm truly is -- whereas reducing O(n) even to O(sqrt(n)) can buy you a lot. A mathematician would have told the computer scientist this decades ago, but the computer scientist wasn't listening, was in a hurry, or didn't understand the point. And that's all right. I don't mind. There are lots and lots of points on other subjects I don't understand, even when the points are carefully explained to me. But this is a point I believe that I do happen to understand. Fundamentally, it is a mathematical point not a computer point, and it is a point on which I happen to side with Lebedev and the mathematicians rather than with Knuth and the computer scientists. This is all.
MergeSort is a divide-and-conquer algorithm that divides the input into several parts and solves the parts recursively.
...There are several approaches for the split function. One way is to split down the middle. That approach has some nice properties, however, we'll focus on a method that's a little bit faster: even-odd split. The idea is to put every even-position element in one list, and every odd-position in another.
This is straight from my lecture notes. Why exactly is it the case that the even-odd split is faster than down the middle of the array?
I'm speculating it has something to do with the list being passed into MergeSort and having the quality of already already sorted, but I'm not entirely sure.
Could anyone shed some light on this?
Edit: I tried running the following in Python...
global K
K = []
for i in range (1, 100000):
K.append(i)
def testMergeSort():
"""
testMergeSort shows the proper functionality for the
Merge Sort Algorithm implemented above.
"""
t = Timer("mergeSort([K])", "from __main__ import *")
print(t.timeit(1000000))
p = Timer("mergeSort2([K])", "from __main__ import *")
print(p.timeit(1000000))
(MergeSort is the even-odd MergeSort, MergeSort2 divides down the center)
And the result was:
0.771506746608
0.843161219237
I can see that it could be possible that it is better because splitting it with alternative elements means you don't have to know how long the input is to start with - you just take elements and put them in alternating lists until you run out.
Also you could potentially starting splitting the resulting lists before you have finished iterating through the first list if you are careful allowing for better parallel processing.
I should add that I'm no expert on these matters, they are just things that came to mind...
The closer the input list is to already being sorted, the lower the runtime (this is because the merge procedure doesn't have to move any of the values if everything is already in the correct order; it just performs O(n) comparisons. Since MergeSort recursively calls itself on each half of the split, one wants to choose a split function that increases the likelihood that the resulting halves of the list are in sorted order. If the list is mostly sorted, even-odd split will do a better job of this than splitting down the middle. For example,
MergeSort([2, 1, 4, 3, 5, 7])
would result in
Merge(MergeSort([2, 1, 4]), MergeSort([3, 5, 7]))
if we split down the middle (note that both sub-lists have sorting errors), whereas if we did even-odd split we would get
Merge(MergeSort([2, 4, 5]), MergeSort([1, 3, 7]))
which results in two already-sorted lists (and best-case performance for the subsequent calls to MergeSort). Without knowing anything about the input lists, though, the choice of splitting function shouldn't affect runtime asymptotically.
I suspect there is noise in your experiment. :) Some of it may come from compare-and-swap not actually moving any elements in the list which avoids cache invalidation, etc.
Regardless, there is a chat about this here: https://cstheory.stackexchange.com/questions/6732/why-is-an-even-odd-split-faster-for-mergesort/6764#6764 (and yes, I did post a similar answer there (full disclosure))
The related Wikipedia articles point out that mergesort is O( n log(n) ) while Odd-Even Merge Sort is O( n log(n)^2 ). Odd-Even is certainly "slower", but the sorting network is static so you always know what operations you are going to perform and (looking at the graphic in the Wikipedia entry) notice how the algorithm stays parallel until the end.
Where as merge sort finally merges 2 lists together the last comparisons of the 8-element sorting network for Odd-Even merge sort are still independent.
I have a set of A's and a set of B's, each with an associated numerical priority, where each A may match some or all B's and vice versa, and my main loop basically consists of:
Take the best A and B in priority order, and do stuff with A and B.
The most obvious way to do this is with a single priority queue of (A,B) pairs, but if there are 100,000 A's and 100,000 B's then the set of O(N^2) pairs won't fit in memory (and disk is too slow).
Another possibility is for each A, loop through every B. However this means that global priority ordering is by A only, and I really need to take priority of both components into account.
(The application is theorem proving, where the above options are called the pair algorithm and the given clause algorithm respectively; the shortcomings of each are known, but I haven't found any reference to a good solution.)
Some kind of two layer priority queue would seem indicated, but it's not clear how to do this without using either O(N^2) memory or O(N^2) time in the worst case.
Is there a known method of doing this?
Clarification: each A must be processed with all corresponding B's, not just one.
Maybe there's something I'm not understanding but,
Why not keep the A's and B's in separate heaps, get_Max on each of the heaps, do your work, remove each max from its associated heap and continue?
You could handle the best pairs first, and if nothing good comes up mop up the rest with the given clause algorithm for completeness' sake. This may lead to some double work, but I'd bet that this is insignificant.
Have you considered ordered paramodulation or superposition?
It appears that the items in A have an individual priority, the items in B have an individual priority, and the (A,B) pairs have a combined priority. Only the combined priority matters, but hopefully we can use the individual properties along the way. However, there is also a matching relation between items in A and items in B that is independent priority.
I assume that, for all a in A, b1 and b2 in B, such that Match(a,b1) and Match(a,b2), then Priority(b1) >= Priority(b2) implies CombinedPriority(a,b1) >= CombinedPriority(a,b2).
Now, begin by sorting B in decreasing order priority. Let B(j) indicate the jth element in this sorted order. Also, let A(i) indicate the ith element of A (which may or may not be in sorted order).
Let nextb(i,j) be a function that finds the smallest j' >= j such that Match(A(i),B(j')). If no such j' exists, the function returns null (or some other suitable error value). Searching for j' may just involve looping upward from j, or we may be able to do something faster if we know more about the structure of the Match relation.
Create a priority queue Q containing (i,nextb(i,0)) for all indices i in A such that nextb(i,0) != null. The pairs (i,j) in Q are ordered by CombinedPriority(A(i),B(j)).
Now just loop until Q is empty. Pull out the highest-priority pair (i,j) and process (A(i),B(j)) appropriately. Then re-insert (i,nextb(i,j+1)) into Q (unless nextb(i,j+1) is null).
Altogether, this takes O(N^2 log N) time in the worst case that all pairs match. In general, it takes O(N^2 + M log N) where M are the number of matches. The N^2 component can be reduced if there is a faster way of calculating nextb(i,j) that just looping upward, but that depends on knowledge of the Match relation.
(In the above analysis, I assumed both A and B were of size N. The formulas could easily be modified if they are different sizes.)
You seemed to want something better than O(N^2) time in the worst case, but if you need to process every match, then you have a lower bound of M, which can be N^2 itself. I don't think you're going to be able to do better than O(N^2 log N) time unless there is some special structure to the combined priority that lets you use a better-than-log-N priority queue.
So you have a Set of A's, and a set of B's, and you need to pick a (A, B) pair from this set such that some f(a, b) is the highest of any other (A, B) pair.
This means you can either store all possible (A, B) pairs and order them, and just pick the highest each time through the loop (O(1) per iteration but O(N*M) memory).
Or you could loop through all possible pairs and keep track of the current maximum and use that (O(N*M) per iteration, but only O(N+M) memory).
If I am understanding you correctly this is what you are asking.
I think it very much depends on f() to determine if there is a better way to do it.
If f(a, b) = a + b, then it is obviously very simple, the highest A, and the highest B are what you want.
I think your original idea will work, you just need to keep your As and Bs in separate collections and just stick references to them in your priority queue. If each reference takes 16 bytes (just to pick a number), then 10,000,000 A/B references will only take ~300M. Assuming your As and Bs themselves aren't too big, it should be workable.
When implementing Quicksort, one of the things you have to do is to choose a pivot. But when I look at pseudocode like the one below, it is not clear how I should choose the pivot. First element of list? Something else?
function quicksort(array)
var list less, greater
if length(array) ≤ 1
return array
select and remove a pivot value pivot from array
for each x in array
if x ≤ pivot then append x to less
else append x to greater
return concatenate(quicksort(less), pivot, quicksort(greater))
Can someone help me grasp the concept of choosing a pivot and whether or not different scenarios call for different strategies.
Choosing a random pivot minimizes the chance that you will encounter worst-case O(n2) performance (always choosing first or last would cause worst-case performance for nearly-sorted or nearly-reverse-sorted data). Choosing the middle element would also be acceptable in the majority of cases.
Also, if you are implementing this yourself, there are versions of the algorithm that work in-place (i.e. without creating two new lists and then concatenating them).
It depends on your requirements. Choosing a pivot at random makes it harder to create a data set that generates O(N^2) performance. 'Median-of-three' (first, last, middle) is also a way of avoiding problems. Beware of relative performance of comparisons, though; if your comparisons are costly, then Mo3 does more comparisons than choosing (a single pivot value) at random. Database records can be costly to compare.
Update: Pulling comments into answer.
mdkess asserted:
'Median of 3' is NOT first last middle. Choose three random indexes, and take the middle value of this. The whole point is to make sure that your choice of pivots is not deterministic - if it is, worst case data can be quite easily generated.
To which I responded:
Analysis Of Hoare's Find Algorithm With Median-Of-Three Partition (1997)
by P Kirschenhofer, H Prodinger, C Martínez supports your contention (that 'median-of-three' is three random items).
There's an article described at portal.acm.org that is about 'The Worst Case Permutation for Median-of-Three Quicksort' by Hannu Erkiö, published in The Computer Journal, Vol 27, No 3, 1984. [Update 2012-02-26: Got the text for the article. Section 2 'The Algorithm' begins: 'By using the median of the first, middle and last elements of A[L:R], efficient partitions into parts of fairly equal sizes can be achieved in most practical situations.' Thus, it is discussing the first-middle-last Mo3 approach.]
Another short article that is interesting is by M. D. McIlroy, "A Killer Adversary for Quicksort", published in Software-Practice and Experience, Vol. 29(0), 1–4 (0 1999). It explains how to make almost any Quicksort behave quadratically.
AT&T Bell Labs Tech Journal, Oct 1984 "Theory and Practice in the Construction of a Working Sort Routine" states "Hoare suggested partitioning around the median of several randomly selected lines. Sedgewick [...] recommended choosing the median of the first [...] last [...] and middle". This indicates that both techniques for 'median-of-three' are known in the literature. (Update 2014-11-23: The article appears to be available at IEEE Xplore or from Wiley — if you have membership or are prepared to pay a fee.)
'Engineering a Sort Function' by J L Bentley and M D McIlroy, published in Software Practice and Experience, Vol 23(11), November 1993, goes into an extensive discussion of the issues, and they chose an adaptive partitioning algorithm based in part on the size of the data set. There is a lot of discussion of trade-offs for various approaches.
A Google search for 'median-of-three' works pretty well for further tracking.
Thanks for the information; I had only encountered the deterministic 'median-of-three' before.
Heh, I just taught this class.
There are several options.
Simple: Pick the first or last element of the range. (bad on partially sorted input)
Better: Pick the item in the middle of the range. (better on partially sorted input)
However, picking any arbitrary element runs the risk of poorly partitioning the array of size n into two arrays of size 1 and n-1. If you do that often enough, your quicksort runs the risk of becoming O(n^2).
One improvement I've seen is pick median(first, last, mid);
In the worst case, it can still go to O(n^2), but probabilistically, this is a rare case.
For most data, picking the first or last is sufficient. But, if you find that you're running into worst case scenarios often (partially sorted input), the first option would be to pick the central value( Which is a statistically good pivot for partially sorted data).
If you're still running into problems, then go the median route.
Never ever choose a fixed pivot - this can be attacked to exploit your algorithm's worst case O(n2) runtime, which is just asking for trouble. Quicksort's worst case runtime occurs when partitioning results in one array of 1 element, and one array of n-1 elements. Suppose you choose the first element as your partition. If someone feeds an array to your algorithm that is in decreasing order, your first pivot will be the biggest, so everything else in the array will move to the left of it. Then when you recurse, the first element will be the biggest again, so once more you put everything to the left of it, and so on.
A better technique is the median-of-3 method, where you pick three elements at random, and choose the middle. You know that the element that you choose won't be the the first or the last, but also, by the central limit theorem, the distribution of the middle element will be normal, which means that you will tend towards the middle (and hence, nlog(n) time).
If you absolutely want to guarantee O(nlog(n)) runtime for the algorithm, the columns-of-5 method for finding the median of an array runs in O(n) time, which means that the recurrence equation for quicksort in the worst case will be:
T(n) = O(n) (find the median) + O(n) (partition) + 2T(n/2) (recurse left and right)
By the Master Theorem, this is O(nlog(n)). However, the constant factor will be huge, and if worst case performance is your primary concern, use a merge sort instead, which is only a little bit slower than quicksort on average, and guarantees O(nlog(n)) time (and will be much faster than this lame median quicksort).
Explanation of the Median of Medians Algorithm
Don't try and get too clever and combine pivoting strategies. If you combined median of 3 with random pivot by picking the median of the first, last and a random index in the middle, then you'll still be vulnerable to many of the distributions which send median of 3 quadratic (so its actually worse than plain random pivot)
E.g a pipe organ distribution (1,2,3...N/2..3,2,1) first and last will both be 1 and the random index will be some number greater than 1, taking the median gives 1 (either first or last) and you get an extermely unbalanced partitioning.
It is easier to break the quicksort into three sections doing this
Exchange or swap data element function
The partition function
Processing the partitions
It is only slightly more inefficent than one long function but is alot easier to understand.
Code follows:
/* This selects what the data type in the array to be sorted is */
#define DATATYPE long
/* This is the swap function .. your job is to swap data in x & y .. how depends on
data type .. the example works for normal numerical data types .. like long I chose
above */
void swap (DATATYPE *x, DATATYPE *y){
DATATYPE Temp;
Temp = *x; // Hold current x value
*x = *y; // Transfer y to x
*y = Temp; // Set y to the held old x value
};
/* This is the partition code */
int partition (DATATYPE list[], int l, int h){
int i;
int p; // pivot element index
int firsthigh; // divider position for pivot element
// Random pivot example shown for median p = (l+h)/2 would be used
p = l + (short)(rand() % (int)(h - l + 1)); // Random partition point
swap(&list[p], &list[h]); // Swap the values
firsthigh = l; // Hold first high value
for (i = l; i < h; i++)
if(list[i] < list[h]) { // Value at i is less than h
swap(&list[i], &list[firsthigh]); // So swap the value
firsthigh++; // Incement first high
}
swap(&list[h], &list[firsthigh]); // Swap h and first high values
return(firsthigh); // Return first high
};
/* Finally the body sort */
void quicksort(DATATYPE list[], int l, int h){
int p; // index of partition
if ((h - l) > 0) {
p = partition(list, l, h); // Partition list
quicksort(list, l, p - 1); // Sort lower partion
quicksort(list, p + 1, h); // Sort upper partition
};
};
It is entirely dependent on how your data is sorted to begin with. If you think it will be pseudo-random then your best bet is to either pick a random selection or choose the middle.
If you are sorting a random-accessible collection (like an array), it's general best to pick the physical middle item. With this, if the array is all ready sorted (or nearly sorted), the two partitions will be close to even, and you'll get the best speed.
If you are sorting something with only linear access (like a linked-list), then it's best to choose the first item, because it's the fastest item to access. Here, however,if the list is already sorted, you're screwed -- one partition will always be null, and the other have everything, producing the worst time.
However, for a linked-list, picking anything besides the first, will just make matters worse. It pick the middle item in a listed-list, you'd have to step through it on each partition step -- adding a O(N/2) operation which is done logN times making total time O(1.5 N *log N) and that's if we know how long the list is before we start -- usually we don't so we'd have to step all the way through to count them, then step half-way through to find the middle, then step through a third time to do the actual partition: O(2.5N * log N)
Ideally the pivot should be the middle value in the entire array.
This will reduce the chances of getting worst case performance.
In a truly optimized implementation, the method for choosing pivot should depend on the array size - for a large array, it pays off to spend more time choosing a good pivot. Without doing a full analysis, I would guess "middle of O(log(n)) elements" is a good start, and this has the added bonus of not requiring any extra memory: Using tail-call on the larger partition and in-place partitioning, we use the same O(log(n)) extra memory at almost every stage of the algorithm.
Quick sort's complexity varies greatly with the selection of pivot value. for example if you always choose first element as an pivot, algorithm's complexity becomes as worst as O(n^2). here is an smart method to choose pivot element-
1. choose the first, mid, last element of the array.
2. compare these three numbers and find the number which is greater than one and smaller than other i.e. median.
3. make this element as pivot element.
choosing the pivot by this method splits the array in nearly two half and hence the complexity
reduces to O(nlog(n)).
On the average, Median of 3 is good for small n. Median of 5 is a bit better for larger n. The ninther, which is the "median of three medians of three" is even better for very large n.
The higher you go with sampling the better you get as n increases, but the improvement dramatically slows down as you increase the samples. And you incur the overhead of sampling and sorting samples.
I recommend using the middle index, as it can be calculated easily.
You can calculate it by rounding (array.length / 2).
If you choose the first or the last element in the array, then there are high chance that the pivot is the smallest or the largest element of the array and that is bad.
Why?
Because in that case the number of element smaller / larger than the pivot element in 0. and this will repeat as follow :
Consider the size of the array n.Then,
(n) + (n - 1) + (n - 2) + ......+ 1 = O(n^2)
Hence, the time complexity increases to O(n^2) from O(nlogn). So, I highly recommend to use median / random element of the array as the pivot.