Is Quicksort "adaptive" and "online"? - algorithm

That is to say, does Quicksort perform BETTER when given an already sorted list? I don't see why this would be the case, but perhaps I don't understand exactly the algorithm.
Also, can quicksort "keep going" whilst we add new data to the list WHILE SORTING? Seems to me the algorithm needs the full set of all data at the beginning to "work".

does Quicksort perform BETTER when given an already sorted list?
No, in fact the way it's usually taught (use the first element as the pivot) an already sorted (or nearly sorted) list is the worst-case. Using the middle or a random element as pivot can mitigate this, however.
can quicksort "keep going" whilst we add new data to the list WHILE SORTING?
No, your intuition is correct, you need the entire data set from the start.

Does Quicksort perform BETTER when given an already sorted list
I think the performance of quicks sort majorly depends upon the choice of the pivot element at every step. It would be the worst if the pivot element which is selected is likely to be either the smallest, or the largest element in the list
quicksort "keep going" whilst we add new data to the list WHILE
SORTING?
Yes quicksort is not adaptive. Thats the property of quick sort.

Quicksort, when its choice of pivots is random, has a runtime of O(n lg n) where n is the size of the array. If its choice of pivots is in sorted order its runtime degrades to O(n^2). Whether you choose the pivot from the left side, right side, middle or randomly, it doesn't matter since it is possible, if not likely, to select pivots in sorted order.
The only way to avoid this is to guarantee the pivots aren't in order by using a technique such as the "Median of Three."
According to Robert Sedgewick, Algorithms, Addison-Wesley Publishing Company, 1988, page 124, if you use the Median of Three technique to choose the pivot and stop the recursion for small partitions (anywhere from 5 to 25 in size; this leaves the array unsorted but you can finish it up quickly with an insertion sort) then quicksort will always be O(n lg n) and, furthermore, run 20% faster than ordinary quicksort.

Related

O(nlogn) in-place sorting algorithm

This question was in the preparation exam for my midterm in introduction to computer science.
There exists an algorithm which can find the kth element in a list in
O(n) time, and suppose that it is in place. Using this algorithm,
write an in place sorting algorithm that runs in worst case time
O(n*log(n)), and prove that it does. Given that this algorithm exists,
why is mergesort still used?
I assume I must write some alternate form of the quicksort algorithm, which has a worst case of O(n^2), since merge-sort is not an in-place algorithm. What confuses me is the given algorithm to find the kth element in a list. Isn't a simple loop iteration through through the elements of an array already a O(n) algorithm?
How can the provided algorithm make any difference in the running time of the sorting algorithm if it does not change anything in the execution time? I don't see how used with either quicksort, insertion sort or selection sort, it could lower the worst case to O(nlogn). Any input is appreciated!
Check wiki, namely the "Selection by sorting" section:
Similarly, given a median-selection algorithm or general selection algorithm applied to find the median, one can use it as a pivot strategy in Quicksort, obtaining a sorting algorithm. If the selection algorithm is optimal, meaning O(n), then the resulting sorting algorithm is optimal, meaning O(n log n). The median is the best pivot for sorting, as it evenly divides the data, and thus guarantees optimal sorting, assuming the selection algorithm is optimal. A sorting analog to median of medians exists, using the pivot strategy (approximate median) in Quicksort, and similarly yields an optimal Quicksort.
The short answer why mergesort is prefered over quicksort in some cases is that it is stable (while quicksort is not).
Reasons for merge sort. Merge Sort is stable. Merge sort does more moves but fewer compares than quick sort. If the compare overhead is greater than move overhead, then merge sort is faster. One situation where compare overhead may be greater is sorting an array of indices or pointers to objects, like strings.
If sorting a linked list, then merge sort using an array of pointers to the first nodes of working lists is the fastest method I'm aware of. This is how HP / Microsoft std::list::sort() is implemented. In the array of pointers, array[i] is either NULL or points to a list of length pow(2,i) (except the last pointer points to a list of unlimited length).
I found the solution:
if(start>stop) 2 op.
pivot<-partition(A, start, stop) 2 op. + n
quickSort(A, start, pivot-1) 2 op. + T(n/2)
quickSort(A, pibvot+1, stop) 2 op. + T(n/2)
T(n)=8+2T(n/2)+n k=1
=8+2(8+2T(n/4)+n/2)+n
=24+4T(n/4)+2n K=2
...
=(2^K-1)*8+2^k*T(n/2^k)+kn
Recursion finishes when n=2^k <==> k=log2(n)
T(n)=(2^(log2(n))-1)*8+2^(log2(n))*2+log2(n)*n
=n-8+2n+nlog2(n)
=3n+nlog2(n)-8
=n(3+log2(n))-8
is O(nlogn)
Quick sort have worstcase O(n^2), but that only occurs if you have bad luck when choosing the pivot. If you can select the kth element in O(n) that means you can choose a good pivot by doing O(n) extra steps. That yields a woest-case O(nlogn) algorithm. There are a couple of reasons why mergesort is still used. First, this selection algorithm is more or less cumbersome to implement in-place, and also adds several extra operations to the regular quicksort, so it is not that fastest than merge sort, as one might expect.
Nevertheless, MergeSort is not still used because of its worst time complexity, in fact HeapSort achieves the same worst case bounds and is also in place, and didn't replace MergeSort, though it has also other disadvantages against quicksort. The main reason why MergeSort survives is because it is the fastest stable sort algorithm know so far. There are several applications in which is paramount to have an stable sorting algorithm. And that is the strength of MergeSort.
A stable sort is such that the equal items preserve the original relative order. For example, this is very useful when you have two keys, and you want to sort by first key first and then by second key, preserving the first key order.
The problem with HeapSort against quicksort is that it is cache inefficient, since you swap/compare elements too far from each other in the array, while quicksort compares consequent elements, these elements are more likely to be in the cache at the same time.

Choice of pivot in Quick Sort Algorithm

I am learning Quick Sort. I know that Quick Sort performs badly when the pivot value does an unbalanced partition , and so first element or last element is not a good choice because if the list is almost sorted the partition would be unbalanced.
As i searched i found 2 options:
One was to choose a pivot randomly between low(lowest index) and up(highest index).It seems a safe option but random number generators are time consuming.
Second would be to take the median of all the elements. This option is costly so the median of first,last and middle element can be used as the pivot element.
Which method proves out to be the most efficient for Quick Sort?.. Is there any other method available for making the choice of pivot element?
Yes, if you're worried about the array being sorted or nearly sorted, you can apply successively more effort to choosing a good pivot, as you suggest, but at the cost of slowing the algorithm down if your data is unsorted. Skienna, in The Algorithm Design Manual, has a good discussion of pivot selection and he suggests you could go as far as to randomize the array before applying quicksort, but my guess is another sorting algorithm would perform better if you're that worried.
Which method proves out to be the most efficient for Quick Sort?
The key point here is to perform performance measurements on your data.
There is no single “most efficient” choice for quicksort. Either you slow down your sort for some (many?) cases by spending extra time selecting each pivot, or you have pathological (O(N2)) behavior for some inputs. Spending more time selecting the pivot slows down sorting for some inputs while speeding up other cases. It's always a trade-off. You choose a trade-off that improves your speed for the kind of inputs you expect.
In the real world, we can prevent the pathological cases fairly cheaply using introsort. One characteristic of a pathological case is deep recursion, so introsort detects deep recursion and switches to a different (but guaranteed O(N log N)) algorithm.
If you are really worried about worse case scenario, randomize the subarray in each recursive call and this should protect you against the worst case.

sorting algorithm suitable for a sorted list

I have a sorted list at hand. Now i add a new element to the end of the list. Which sorting algorithm is suitable for such scenario?
Quick sort has worst case time complexity of O(n2) when the list is already sorted. Does this mean time complexity if quick sort is used in the above case will be close to O(n2)?
If you are adding just one element, find the position where it should be inserted and put it there. For an array, you can do binary search for O(logN) time and insert in O(N). For a linked list, you'll have to do a linear search which will take O(N) time but then insertion is O(1).
As for your question on quicksort: If you choose the first value as your pivot, then yes it will be O(N2) in your case. Choose a random pivot and your case will still be O(NlogN) on average. However, the method I suggest above is both easier to implement and faster in your specific case.
It depends on the implementation of the underlying list.
It seems to me that insertion sort will fit your needs except the case when the list is implemented as an array list. In this case too many moves will be required.
Rather than appending to the end of the list, you should do an insert operation.
That is, when adding 5 to [1,2,3,4,7,8,9] you'd result want to the "insert" by putting it where it belongs in the sorted list, instead of at the end and then re-sorting the whole list.
You can quickly find the position to insert the item by using a binary search.
This is basically how insertion sort works, except it operates on the entire list. This method will have better performance than even the best sorting algorithm, for a single item. It may also be faster than appending at the end of the list, depending on your implementation.
I'm assuming you're using an array, since you talk about quicksort, so just adding an element would involve finding the place to insert it (O(log n)) and then actually inserting it (O(n)) for a total cost of O(n). Just appending it to the end and then resorting the entire list is definitely the wrong way to go.
However, if this is to be a frequent operation (i.e. if you have to keep adding elements while maintaining the sorted property) you'll incur an O(n^2) cost of adding another n elements to the list. If you change your representation to a balanced binary tree, that drops to O(n log n) for another n inserts, but finding an element by index will become O(n). If you never need to do this, but just iterate over the elements in order, the tree is definitely the way to go.
Of possible interest is the indexable skiplist which, for a slight storage cost, has O(log n) inserts, deletes, searches and lookups-by-index. Give it a look, it might be just what you're looking for here.
What exactly do you mean by "list" ? Do you mean specifically a linked list, or just some linear (sequential) data structure like an array?
If it's linked list, you'll need a linear search for the correct position. The insertion itself can be done in constant time.
If it's something like an array, you can add to the end and sort, as you mentioned. A sorted collection is only bad for Quicksort if the Quicksort is really badly implemented. If you select your pivot with the typical median of 3 alogrithm, a sorted list will give optimal performance.

Worst case for QuickSort - when can it occur?

When analyzing QS, every one always refers to the "almost sorted" worst case. When can such a scenario occur with natural input?
The only example I came up with is re-indexing.
I think people are confusing Quicksort the partition-based sorting algorithm, and "qsort" the various library implementations.
I prefer to see Quicksort the algorithm as having a pluggable pivot selection algorithm, which is quite essential in analyzing its behavior.
If the first element is always chosen as the pivot, then an already sorted list is the worst-case. Often there's a high probability that the array is already/nearly sorted, so this implementation is rather poor.
Analogously, selecting the last element as the pivot is bad for the same reason.
Some implementations tries to avoid this problem by choosing the middle element as the pivot. This would not perform as badly on already/nearly sorted arrays, but one could still construct an input that would exploit this predictable pivot selection and make it run in quadratic time.
Thus, you get randomized pivot selection algorithms, but even this doesn't guarantee O(N log N).
So other algorithms were developed that would use some information from the sequence before picking a pivot. You can of course scan the whole sequence and find the median, and use that as the pivot. This guarantees O(N log N), but of course slower in practice.
So some corners are cut, and people devised the median-of-3 algorithm. Of course, later even this was exploitable by the so-called median-of-3 "killer".
So more attempts are made at coming up with more "intelligent" pivot selection algorithms that guarantees O(N log N) asymptotic behavior that is still fast enough to be practical, with varying degree of success.
So really, unless one specifies a particular implementation of Quicksort, the question of when the worst case scenario occurs is ill-defined. If you use the so-called median-of-medians pivot selection algorithm, there is no quadratic worst-case scenario.
Most library implementations, however, are likely to forfeit O(N log N) guarantee for much faster sorting in the average case. Some of the really old implementations use the first element as the pivot, which is now well-understood as poor and is no longer a practice widely followed.
I believe that the worst case for quicksort depends on the choice of the pivot element at every step. Quicksort has its worst performance, if the pivot is likely to be either the smallest, or the largest element in the list (e.g. the first or last element of an already sorted list).
If, e.g. you choose the middle element of the list, an already sorted list does not have the worst case runtime.
So, if you suspect your scenario is likely to a bad case scenario for quicksort, you can simply change your choice of pivot element to make quicksort perform better.
Note: I know, that this did not give more example of real world occasions for quicksort worst cases. Examples of this depend on the implementation you are working with.
The actual question was: "When can such a scenario (almost sorted) occur with natural input?".
Although all the answers are dealing with "what causes worst case performance", none have covered "what causes data that meets the worst case performance scenario".
So, to answer the actual question
Programmer error: Basically you land up sorting a list twice. Typically this happens because a list is sorted one place in code. And later in another piece of code you know you need the list to be sorted, so you sort it again.
Using almost-chronological data: You have data that is generally received in chronological order, but occasionally some elements are out of position. (Consider a multi-threaded environment adding time-stamped elements to a list. Race conditions can cause elements to be added in a different order to which they were time-stamped.) In this situation, if you need sorted data, you must re-sort. Because the order of the data is not guaranteed.
Adding items to a list: If you have a sorted list and simply append some items (i.e. without using binary insertion). You would need to re-sort an almost-sorted list.
Data from an external source: If you receive data from an external source, there may be no guarantee that it's sorted. So you sort it yourself. However, if the external source is sorted, you will be re-sorting the data.
Natural ordering: This is similar to the chronoloigcal data. Basically, the natural order of the data you receive may be sorted. Consider an insurance company adding car registrations. If the authority assiging car registrations does so in a predictable order, newer cars are likely but not guaranteed to have higher registration numbers. Since you're not guaranteed it's sorted - you have to re-sort.
Interleaved data: If you receive data from multiple sorted sources with overlapping keys, you could get keys resembling the following: 1 3 2 5 4 7 6 9 8 11 10 13 12 15 14 17 16 19 18. Even though half the elements are out-of-sequence with its neighbour, the list is "almost sorted". Certainly using QuickSort that pivots on the first element would exhibit O(n^2) performance.
Conclusion
So, given all the above scenarios, it's actually quite easy to land up sorting almost-sorted data. And this is exactly why QuickSort that pivots on the first element is actually best avoided. polygene has provided some interesting information on alternate pivoting considerations.
As a side-note: One of the usually worst performing sorting algorithms, actually does quite well with "almost-sorted" data. In the interleaved data above, bubble-sort requires only 9 swap operations. It's performance would actually be O(n).
From Quicksort
for quicksort, "worst case"
corresponds to already sorted
A list with all the items the same number is already sorted.
worst case in quick sort:
All elements of array are same
Array is already sorted in same order
Array is already sorted in reverse order.
Quick worst case depends on choosing pivot element . so the problem occure only when
1) Array is already sorted in same order.
2) Array is already sorted in reverse order.
3) All elements are same (special case of case 1 and 2)

Is it faster to sort a list after inserting items or adding them to a sorted list

If I have a sorted list (say quicksort to sort), if I have a lot of values to add, is it better to suspend sorting, and add them to the end, then sort, or use binary chop to place the items correctly while adding them. Does it make a difference if the items are random, or already more or less in order?
If you add enough items that you're effectively building the list from scratch, you should be able to get better performance by sorting the list afterwards.
If items are mostly in order, you can tweak both incremental update and regular sorting to take advantage of that, but frankly, it usually isn't worth the trouble. (You also need to be careful of things like making sure some unexpected ordering can't make your algorithm take much longer, q.v. naive quicksort)
Both incremental update and regular list sort are O(N log N) but you can get a better constant factor sorting everything afterward (I'm assuming here that you've got some auxiliary datastructure so your incremental update can access list items faster than O(N)...). Generally speaking, sorting all at once has a lot more design freedom than maintaining the ordering incrementally, since incremental update has to maintain a complete order at all times, but an all-at-once bulk sort does not.
If nothing else, remember that there are lots of highly-optimized bulk sorts available.
Usually it's far better to use a heap. in short, it splits the cost of maintaining order between the pusher and the picker. Both operations are O(log n), instead of O(n log n), like most other solutions.
If you're adding in bunches, you can use a merge sort. Sort the list of items to be added, then copy from both lists, comparing items to determine which one gets copied next. You could even copy in-place if resize your destination array and work from the end backwards.
The efficiency of this solution is O(n+m) + O(m log m) where n is the size of the original list, and m is the number of items being inserted.
Edit: Since this answer isn't getting any love, I thought I'd flesh it out with some C++ sample code. I assume that the sorted list is kept in a linked list rather than an array. This changes the algorithm to look more like an insertion than a merge, but the principle is the same.
// Note that itemstoadd is modified as a side effect of this function
template<typename T>
void AddToSortedList(std::list<T> & sortedlist, std::vector<T> & itemstoadd)
{
std::sort(itemstoadd.begin(), itemstoadd.end());
std::list<T>::iterator listposition = sortedlist.begin();
std::vector<T>::iterator nextnewitem = itemstoadd.begin();
while ((listposition != sortedlist.end()) || (nextnewitem != itemstoadd.end()))
{
if ((listposition == sortedlist.end()) || (*nextnewitem < *listposition))
sortedlist.insert(listposition, *nextnewitem++);
else
++listposition;
}
}
I'd say, let's test it! :)
I tried with quicksort, but sorting an almost sorting array with quicksort is... well, not really a good idea. I tried a modified one, cutting off at 7 elements and using insertion sort for that. Still, horrible performance. I switched to merge sort. It might need quite a lot of memory for sorting (it's not in-place), but the performance is much better on sorted arrays and almost identical on random ones (the initial sort took almost the same time for both, quicksort was only slightly faster).
This already shows one thing: The answer to your questions depends strongly on the sorting algorithm you use. If it will have poor performance on almost sorted lists, inserting at the right position will be much faster than adding at the end and then re-sorting it; and merge sort might be no option for you, as it might need way too much external memory if the list is huge. BTW I used a custom merge sort implementation, that only uses 1/2 of external storage to the naive implementation (which needs as much external storage as the array size itself).
If merge sort is no option and quicksort is no option for sure, the best alternative is probably heap sort.
My results are: Adding the new elements simply at the end and then re-sorting the array was several magnitudes faster than inserting them in the right position. However, my initial array had 10 mio elements (sorted) and I was adding another mio (unsorted). So if you add 10 elements to an array of 10 mio, inserting them correctly is much faster than re-sorting everything. So the answer to your question also depends on how big the initial (sorted) array is and how many new elements you want to add to it.
In principle, it's faster to create a tree than to sort a list. The tree inserts are O(log(n)) for each insert, leading to overall O(nlog(n)). Sorting in O(nlog(n)).
That's why Java has TreeMap, (in addition to TreeSet, TreeList, ArrayList and LinkedList implementations of a List.)
A TreeSet keeps things in object comparison order. The key is defined by the Comparable interface.
A LinkedList keeps things in the insertion order.
An ArrayList uses more memory, is faster for some operations.
A TreeMap, similarly, removes the need to sort by a key. The map is built in key order during the inserts and maintained in sorted order at all times.
However, for some reason, the Java implementation of TreeSet is quite a bit slower than using an ArrayList and a sort.
[It's hard to speculate as to why it would be dramatically slower, but it is. It should be slightly faster by one pass through the data. This kind of thing is often the cost of memory management trumping the algorithmic analysis.]
It's about the same. Inserting an item into a sorted list is O(log N), and doing this for every element in the list, N, (thus building the list) would be O(N log N) which is the speed of quicksort (or merge sort which is closer to this approach).
If you instead inserted them onto the front it would be O(1), but doing a quicksort after, it would still be O(N log N).
I would go with the first approach, because it has the potential to be slightly faster. If the initial size of your list, N, is much greater than the number of elements to insert, X, then the insert approach is O(X log N). Sorting after inserting to the head of the list is O(N log N). If N=0 (IE: your list is initially empty), the speed of inserting in sorted order, or sorting afterwards are the same.
Inserting an item into a sorted list takes O(n) time, not O(log n) time. You have to find the place to put it, taking O(log n) time. But then you have to shift over all the elements - taking O(n) time. So inserting while maintaining sorted-ness is O(n ^ 2), where as inserting them all and then sorting is O(n log n).
Depending on your sort implementation, you can get even better than O(n log n) if the number of inserts is much smaller than the list size. But if that is the case, it doesn't matter either way.
So do the insert all and sort solution if the number of inserts is large, otherwise it probably won't matter.
If the list is a) already sorted, and b) dynamic in nature, then inserting into a sorted list should always be faster (find the right place (O(n)) and insert (O(1))).
However, if the list is static, then a shuffle of the remainder of the list has to occur (O(n) to find the right place and O(n) to slide things down).
Either way, inserting into a sorted list (or something like a Binary Search Tree) should be faster.
O(n) + O(n) should always be faster than O(N log n).
At a high level, it's a pretty simple problem, because you can think of sorting as just iterated searching. When you want to insert an element into an ordered array, list, or tree, you have to search for the point at which to insert it. Then you put it in, at hopefully low cost. So you could think of a sort algorithm as just taking a bunch of things and, one by one, searching for the proper position and inserting them. Thus, an insertion sort (O(n* n)) is an iterated linear search (O(n)). Tree, heap, merge, radix, and quick sort (O(n*log(n))) can be thought of as iterated binary search (O(log(n))). It is possible to have an O(n) sort, if the underlying search is O(1) as in an ordered hash table. (An example of this is sorting 52 cards by flinging them into 52 bins.)
So the answer to your question is, inserting things one at a time, versus saving them up and then sorting them should not make much difference, in a big-O sense. You could of course have constant factors to deal with, and those might be significant.
Of course, if n is small, like 10, the whole discussion is silly.
You should add them before and then use a radix sort this should be optimal
http://en.wikipedia.org/wiki/Radix_sort#Efficiency
(If the list you're talking about is like C# List<T>.) Adding some values to right positions into a sorted list with many values is going to require less operations. But if the number of values being added becomes large, it will require more.
I would suggest using not a list but some more suitable data structure in your case. Like a binary tree, for example. A sorted data structure with minimal insertion time.
If this is .NET and the items are integers, it's quicker to add them to a Dictionary (or if you're on .Net 3.0 or above use the HashSet if you don't mind losing duplicates)This gives you automagic sorting.
I think that strings would work the same way as well. The beauty is you get O(1) insertion and sorting this way.
Inserting an item into a sorted list is O(log n), while sorting a list is O(n log N)
Which would suggest that it's always better to sort first and then insert
But remeber big 'O' only concerns the scaling of the speed with number of items, it might be that for your application an insert in the middle is expensive (eg if it was a vector) and so appending and sorting afterward might be better.

Resources