What sorting algorithm fit this 'stream-like' condition? - algorithm

I have a buffer receiving data, which means the data are like 'stream' and have latency in 'IO'. The way I am doing now is when the buffer is full, using qsort to sort the buffer and write the result to disk. but there is obvious latency when doing qsort, so I am looking for some other sorting algorithms that may start sorting while the data is being added to the buffer, in order to reduce time consumed overall.
don't know if I have made myself clear and leave any comments if needed, thanks

Heap sort keeps the data permanently in a partially sorted condition and so is comparable to Insertion sort. But it is substantially quicker and has a worst case of O(n log n) compared with O(n2) for Insertion Sort.
How is this going to work? Presumably at some point you have to stop reading from the stream, store what you have sorted, and start reading a new set of data?

I think merge-sort or tree sort can be of great help . Look why on wikipedia.
When you can cut the huge input in reasonable large blocks, merge-sort is more appropriate.
When you insert small pieces at a time, tree-sort is more appropriate.
You want to implement an online sorting algorithm, ie an algorithm which runs while receiving the data in a streamlined fashion. Search for online algorithms over the web and you may find other nice algorithms.
In your case I would use tree sort. It doesn't have a better complexity than quicksort (both are O(nlog n) most of the time and O(n²) in few bad cases). But it amortizes the cost over each input. Which means the delay you have to wait after the last data is added is not of order O(nlog n), but O(log n)

You can try to use my Link Array structure. It should be ok for sequential adding of random data while keeping it sorted (look at the numbers in the table). This is a variation of Skip list approach but with easier implementation and logic (although the performance of Skip list should be better)

Related

Sorting an array

I want to know which is better to sort an array of elements.
Is it better to get good performance to sort the array at the end when I finish filling it? Or it is better to sort it each time I add an element to it?
Ignoring the application-specific information for a moment, consider that sorted insertion requires, worst case, O(n) operations for each element. For n elements, this of course gives us O(n^2). If this sounds familiar, it's because what you're doing is (as another commenter pointed out) an insertion sort. In contrast, performing one quicksort on the entire list will take, worse case, O(n log n) time.
So, does this mean you should definitely wait until the end to sort? No. The important thing to remember is that O(n log n) is the best we can do for sorting in the general case. Your application-speciifc knowledge of the data can influence the best algorithm for the job. For example, if you know your data is already almost sorted, then insertion sort will give you linear time complexity.
Your mileage may vary, but I think the algorithmic viewpoint is useful when examining the general problem of "When should I sort?"
It depends on what is critical to you. Do you need to be able to insert very fast (a lot of entries but little queries) or do you need to be able to query very fast and insert slowly (a lot of queries but not many entries) ?
This is basically your problem to solve. When you know this you can select an appropriate sorting algorithm and apply it.
Edit: This is assuming that either choice actually matters. This depends a lot on your activity (inserts vs queries) and the amount of data that you need to sort.

Comparison between timsort and quicksort

Why is it that I mostly hear about Quicksort being the fastest overall sorting algorithm when Timsort (according to wikipedia) seems to perform much better? Google didn't seem to turn up any kind of comparison.
TimSort is a highly optimized mergesort, it is stable and faster than old mergesort.
when comparing with quicksort, it has two advantages:
It is unbelievably fast for nearly sorted data sequence (including reverse sorted data);
The worst case is still O(N*LOG(N)).
To be honest, I don't think #1 is a advantage, but it did impress me.
Here are QuickSort's advantages
QuickSort is very very simple, even a highly tuned implementation, we can write down its pseduo codes within 20 lines;
QuickSort is fastest in most cases;
The memory consumption is LOG(N).
Currently, Java 7 SDK implements timsort and a new quicksort variant: i.e. Dual Pivot QuickSort.
If you need stable sort, try timsort, otherwise start with quicksort.
More or less, it has to do with the fact that Timsort is a hybrid sorting algorithm. This means that while the two underlying sorts it uses (Mergesort and Insertion sort) are both worse than Quicksort for many kinds of data, Timsort only uses them when it is advantageous to do so.
On a slightly deeper level, as Patrick87 states, quicksort is a worst-case O(n2) algorithm. Choosing a good pivot isn't hard, but guaranteeing an O(n log n) quicksort comes at the cost of generally slower sorting on average.
For more detail on Timsort, see this answer, and the linked blog post. It basically assumes that most data is already partially sorted, and constructs "runs" of sorted data that allow for efficient merges using mergesort.
Generally speaking quicksort is best algorithm for primitive array. This is due to memory locality and cache.
JDK7 uses TimSort for Object array. Object array only holds object reference. The object itself is stored in Heap. To compare object, we need to read object from heap. This is like reading from one part of the heap for one object, then randomly reading object from another part of heap. There will be a lot of cache miss. I guess for this reason memory locality is not important any more. This is may be why JDK only uses TimSort for Object array instead if primitive array.
This is only my guess.
Here are benchmark numbers from my machine (i7-6700 CPU, 3.4GHz, Ubuntu 16.04, gcc 5.4.0, parameters: SIZE=100000 and RUNS=3):
$ ./demo
Running tests
stdlib qsort time: 12246.33 us per iteration
##quick sort time: 5822.00 us per iteration
merge sort time: 8244.33 us per iteration
...
##tim sort time: 7695.33 us per iteration
in-place merge sort time: 6788.00 us per iteration
sqrt sort time: 7289.33 us per iteration
...
grail sort dyn buffer sort time: 7856.67 us per iteration
The benchmark comes from Swenson's sort project in which he as implemented several sorting algorithms in C. Presumably, his implementations are good enough to be representative, but I haven't investigated them.
So you really can't tell. Benchmark numbers only stay relevant for at most two years and then you have to repeat them. Possibly, timsort beat qsort waaay back in 2011 when the question was asked, but the times have changed. Or qsort was always the fastest, but timsort beat it on non-random data. Or Swenson's code isn't so good and a better programmer would turn the tide in timsort's favor. Or perhaps I suck and didn't use the right CFLAGS when compiling the code. Or... You get the point.
Tim Sort is great if you need an order-preserving sort, or if you are sorting a complex array (comparing heap-based objects) rather than a primitive array. As mentioned by others, quicksort benefits significantly from the locality of data and processor caching for primitive arrays.
The fact that the worst case of quicksort is O(n^2) was raised. Fortunately, you can achieve O(n log n) time worst-case with quicksort. The quicksort worst-case occurs when the pivot point is either the smallest or largest value such as when the pivot is the first or last element of an already sorted array.
We can achieve O(n log n) worst-case quicksort by setting the pivot at the median value. Since finding the median value can be done in linear time O(n). Since O(n) + O(n log n) = O(n log n), that becomes the worst-case time complexity.
In practice, however, most implementations find that a random pivot is sufficient so do not search for the median value.
Timsort is a popular hybrid sorting algorithm designed in 2002 by Tim Peters. It is a combination of insertion sort and merge sort. It is developed to perform well on various kinds of real world data sets. It is a fast, stable and adaptive sorting technique with average and worst-case performance of O(n log n).
How Timsort works
First of all, the input array is split into sub-arrays/blocks known as Run.
A simple Insertion Sort is used to sort each Run.
Merge Sort is used to merge the sorted Runs into a single array.
Advantages of Timsort
It performs better on nearly ordered data.
It is well-suited to dealing with real-world data.
Quicksort is a highly useful and efficient sorting algorithm that divides a large array of data into smaller ones and it is based on the concept of Divide and Conquer. Tony Hoare designed this sorting algorithm in 1959 with average performance of O(n log n).
How Quicksort works
Pick any element as the pivot.
Divide the array into partitions based on pivots.
Recursively apply quick sort to the left partition.
Recursively apply quick sort to the right partition.
Advantages of Quicksort
It performs better on random data as compared to Timsort.
It is useful when there is limited space availability.
It is the better suited for large data sets.

Reading streamed data into a sorted list

We know that, in general, the "smarter" comparison sorts on arbitrary data run in worst case complexity O(N * log(N)).
My question is what happens if we are asked not to sort a collection, but a stream of data. That is, values are given to us one by one with no indicator of what comes next (other than that the data is valid/in range). Intuitively, one might think that it is superior then to sort data as it comes in (like picking up a poker hand one by one) rather than gathering all of it and sorting later (sorting a poker hand after it's dealt). Is this actually the case?
Gathering and sorting would be O(N + N * log(N)) = O(N * log(N)). However if we sort it as it comes in, it is O(N * K), where K = time to find the proper index + time to insert the element. This complicates things, since the value of K now depends on our choice of data structure. An array is superior in finding the index but wastes time inserting the element. A linked list can insert more easily but cannot binary search to find the index.
Is there a complete discussion on this issue? When should we use one method or another? Might there be a desirable in-between strategy of sorting every once in a while?
Balanced tree sort has O(N log N) complexity and maintains the list in sorted order while elements are added.
Absolutely not!
Firstly, if I can sort in-streaming data, I can just accept all my data in O(N) and then stream it to myself and sort it using the quicker method. I.e. you can perform a reduction from all-data to stream, which means it cannot be faster.
Secondly, you're describing an insertion sort, which actually runs in O(N^2) time (i.e. your description of O(NK) was right, but K is not constant, rather a function of N), since it might take O(N) time to find the appropriate index. You could improve it to be a binary insertion sort, but that would run in O(NlogN) (assuming you're using a linked list, an array would still take O(N^2) even with the binary optimisation), so you haven't really saved anything.
Probably also worth mentioning the general principle; that as long as you're in the comparison model (i.e. you don't have any non-trivial and helpful information about the data which you're sorting, which is the general case) any sorting algorithm will be at best O(NlogN). I.e. the worst-case running time for a sorting algorithm in this model is omega(NlogN). That's not an hypothesis, but a theorem. So it is impossible to find anything faster (under the same assumptions).
Ok, if the timing of the stream is relatively slow, you will have a completely sorted list (minus the last element) when your last element arrives. Then, all that remains to do is a single binary search cycle, O(log n) not a complete binary sort, O(n log n). Potentially, there is a perceived performance gain, since you are getting a head-start on the other sort algorithms.
Managing, queuing, and extracting data from a stream is a completely different issue and might be counter-productive to your intentions. I would not recommend this unless you can sort the complete data set in about the same time it takes to stream one or maybe two elements (and you feel good about coding the streaming portion).
Use Heap Sort in those cases where Tree Sort will behave badly i.e. large data set since Tree sort needs additional space to store the tree structure.

What sort algorithm provides the best worst-case performance?

What is the fastest known sort algorithm for absolute worst case? I don't care about best case and am assuming a gigantic data set if that even matters.
make sure you have seen this:
visualizing sort algorithms - it helped me decide what sort alg to use.
Depends on data. For example for integers (or anything that can be expressed as integer) the fastest is radix sort which for fixed length values has worst case complexity of O(n). Best general comparison sort algorithms have complexity of O(n log n).
If you are using binary comparisons, the best possible sort algorithm takes O(N log N) comparisons to complete. If you're looking for something with good worst case performance, I'd look at MergeSort and HeapSort since they are O(N log N) algorithms in all cases.
HeapSort is nice if all your data fits in memory, while MergeSort allows you to do on-disk sorts better (but takes more space overall).
There are other less-well-known algorithms mentioned on the Wikipedia sorting algorithm page that all have O(n log n) worst case performance. (based on comment from mmyers)
For the man with limitless budget
Facetious but correct:
Sorting networks trade space (in real hardware terms) for better than O(n log n) sorting!
Without resorting to such hardware (which is unlikely to be available) you have a lower bound for the best comparison sorts of O(n log n)
O(n log n) worst case performance (no particular order)
Binary Tree Sort
Merge Sort
Heap Sort
Smooth Sort
Intro Sort
Beating the n log n
If your data is amenable to it you can beat the n log n restriction but instead care about the number of bits in the input data as well
Radix and Bucket are probably the best known examples of this. Without more information about your particular requirements it is not fruitful to consider these in more depth.
Quicksort is usually the fastest, but if you want good worst-case time, try Heapsort or Mergesort. These both have O(n log n) worst time performance.
If you have a gigantic data set (ie much larger than available memory) you likely have your data on disk/tape/something-with-expensive-random-access, so you need an external sort.
Merge sort works well in that case; unlike most other sorts it doesn't involve random reads/writes.
It largely is related to the size of your dataset and whether or not the set is already ordered (or what order it is currently in).
Entire books are written on search/sort algorithms. You aren't going to find an "absolute fastest" assuming a worst case scenario because different sorts have different worst-case situations.
If you have a sufficiently huge data set, you're probably looking at sorting individual bins of data, then using merge-sort to merge those bins. But at this point, we're talking data sets huge enough to be VASTLY larger than main memory.
I guess the most correct answer would be "it depends".
It depends both on the type of data and the type of resources. For example there are parallel algorithms that beat Quicksort, but given how you asked the question it's unlikely you have access them. There are times when the "worst case" for one algorithm is "best case" for another (nearly sorted data is problematic with Quick and Merge, but fast with much simpler techniques).
It depends on the size, according to the Big O notation O(n).
Here is a list of sorting algorithms BEST AND WORST CASE for you to compare.
My preference is the 2 way MergeSort
Assuming randomly sorted data, quicksort.
O(nlog n) mean case, O(n^2) in the worst case, but that requires highly non-random data.
You might want to describe your data set characteristics.
See Quick Sort Vs Merge Sort for a comparison of Quicksort and Mergesort, which are two of the better algorithms in most cases.
It all depends on the data you're trying to sort. Different algorithms have different speeds for different data. an O(n) algorithm may be slower than an O(n^2) algorithm, depending on what kind of data you're working with.
I've always preferred merge sort, as it's stable (meaning that if two elements are equal from a sorting perspective, then their relative order is explicitly preserved), but quicksort is good as well.
The lowest upper bound on Turing machines is achieved by merge sort, that is O(n log n). Though quick sort might be better on some datasets.
You can't go lower than O(n log n) unless you're using special hardware (e.g. hardware supported bead sort, other non-comparison sorts).
On the importance of specifying your problem: radix sort might be the fastest, but it's only usable when your data has fixed-length keys that can be broken down into independent small pieces. That limits its usefulness in the general case, and explains why more people haven't heard of it.
http://en.wikipedia.org/wiki/Radix_sort
P.S. This is an O(k*n) algorithm, where k is the size of the key.

Is there ever a good reason to use Insertion Sort?

For general-purpose sorting, the answer appears to be no, as quick sort, merge sort and heap sort tend to perform better in the average- and worst-case scenarios. However, insertion sort appears to excel at incremental sorting, that is, adding elements to a list one at a time over an extended period of time while keeping the list sorted, especially if the insertion sort is implemented as a linked list (O(log n) average case vs. O(n)). However, a heap seems to be able to perform just (or nearly) as well for incremental sorting (adding or removing a single element from a heap has a worst-case scenario of O(log n)). So what exactly does insertion sort have to offer over other comparison-based sorting algorithms or heaps?
From http://www.sorting-algorithms.com/insertion-sort:
Although it is one of the elementary sorting algorithms with
O(n2) worst-case time, insertion sort
is the algorithm of choice either when
the data is nearly sorted (because it
is adaptive) or when the problem size
is small (because it has low
overhead).
For these reasons, and because it is also stable, insertion sort is
often used as the recursive base case
(when the problem size is small) for
higher overhead divide-and-conquer
sorting algorithms, such as merge sort
or quick sort.
An important concept in analysis of algorithms is asymptotic analysis. In the case of two algorithms with different asymptotic running times, such as one O(n^2) and one O(nlogn) as is the case with insertion sort and quicksort respectively, it is not definite that one is faster than the other.
The important distinction with this sort of analysis is that for sufficiently large N, one algorithm will be faster than another. When analyzing an algorithm down to a term like O(nlogn), you drop constants. When realistically analyzing the running of an algorithm, those constants will be important only for situations of small n.
So what does this mean? That means for certain small n, some algorithms are faster. This article from EmbeddedGurus.net includes an interesting perspective on choosing different sorting algorithms in the case of a limited space (16k) and limited memory system. Of course, the article references only sorting a list of 20 integers, so larger orders of n is irrelevant. Shorter code and less memory consumption (as well as avoiding recursion) were ultimately more important decisions.
Insertion sort has low overhead, it can be written fairly succinctly, and it has several two key benefits: it is stable, and it has a fairly fast running case when the input is nearly sorted.
Yes, there is a reason to use either an insertion sort or one of its variants.
The sorting alternatives (quick sort, etc.) of the other answers here make the assumption that the data is already in memory and ready to go.
But if you are attempting to read in a large amount of data from a slower external source (say a hard drive), there is a large amount of time wasted as the bottleneck is clearly the data channel or the drive itself. It just cannot keep up with the CPU. A natural series of waits occur during any read. These waits are wasted CPU cycles unless you use them to sort as you go.
For instance, if you were to make your solution to this be the following:
Read a ton of data in a dedicated loop into memory
Sort that data
You would very likely take longer than if you did the following in two threads.
Thread A:
Read a datum
Place datum into FIFO queue
(Repeat until data exhausted from drive)
Thread B:
Get a datum from the FIFO queue
Insert it into the proper place in your sorted list
(repeat until queue empty AND Thread A says "done").
...the above will allow you to use the otherwise wasted time. Note: Thread B does not impede Thread A's progress.
By the time the data is fully read, it will have been sorted and ready for use.
Most sorting procedures will use quicksort and then insertion sort for very small data sets.
If you're talking about maintaining a sorted list, there is no advantage over some kind of tree, it's just slower.
Well, maybe it consumes less memory or is a simpler implementation.
Inserting into a sorted list will involve a scan, which means that each insert is O(n), therefore sorting n items becomes O(n^2)
Inserting into a container such as a balanced tree, is typically log(n), therefore the sort is O(n log(n)) which is of course better.
But for small lists it hardly makes any difference. You might use an insert sort if you have to write it yourself without any libraries, the lists are small and/or you don't care about performance.
YES,
Insertion sort is better than Quick Sort on short lists.
In fact an optimal Quick Sort has a size threshold that it stops at, and then the entire array is sorted by insertion sort over the threshold limits.
Also...
For maintaining a scoreboard, Binary Insertion Sort may be as good as it gets.
See this page.
For small array size insertion sort outperforms quicksort.
Java 7 and Java 8 uses dual pivot quicksort to sort primitive data types.
Dual pivot quicksort out performs typical single pivot quicksort. According to algorithm of dual pivot quicksort :
For small arrays (length < 27), use the Insertion sort algorithm.
Choose two pivot...........
Definitely, insertion sort out performs quicksort for smaller array size and that is why you switch to insertion sort for arrays of length less than 27. The reason could be: there are no recursions in insertion sort.
Source: http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf

Resources