Sorting method for not so scrambled data - sorting

I want to sort a million numbers. I already have them in memory (let's assume they fit) and I know for a fact it is very very likely that any given number is originally in a position quite close to it's final position after sorting (i.e. the 1000th number in the original data will very very likely end up between positions 900 and 1100 after sorting).
Which sorting method would perform best in this case? And most importantly, why would it perform better than the others? Assuming memory is big enough for any common method.

Plain old insertion sort runs in time O(nk) if you run it on n elements that are all at most k spots from their final positions. Many sorting algorithms, notably introsort, use this fact by having the sorting algorithm stop sorting once the elements are close enough and then switching to insertion sort. Since insertion sort has very low constant factors hidden in the big-O, I'd suspect it would work quite well here.

Related

Insertion Sort is a good choice for "small" data sets. What is "small"?

There are many places I have seen where it talks about how Insertion Sort is good for small data sets. I can't find a number for what "small" is though. My guess is that there is no absolute answer and that it depends on the type of machine the code is being run on.
However, what factors go into deciding what is the threshold for when Insertion Sort is a good idea? And what are some ballpark figures for "small"? 5? 10? 50? 100?
Thanks!
Site saying Insertion Sort is good for small data sets:
https://www.toptal.com/developers/sorting-algorithms/insertion-sort
Yes, your guess is right - there is no absolute answer, one have to measure where is threshold between insertion sort and other methods.
For example, typical values for triggering to insertion sort (and get some gain, of course) for small pieces inside combined merge or quick sort are about 32-100 (but can vary depending on data and implementation details)
An attempt at an answer, providing we're talking about the general sorting problem. Insertion sort is on average O(n^2), efficient sorting algorithms are on average O(nlogn). So vaguely speaking if something takes K steps to sort efficiently it will take around (kind of) K^2 steps with insertion sort.
So if n > K is too slow for your liking with an efficient sort, n > K^0.5 will be too slow for you (roughly) with insertion sort.
Practically speaking let's say you're happy to sort arrays of size 10^8 with something efficient then you might be happy to sort arrays of size 10^4 with insertion sort.

What sorting techniques can I use when comparing elements is expensive?

Problem
I have an application where I want to sort an array a of elements a0, a1,...,an-1. I have a comparison function cmp(i,j) that compares elements ai and aj and a swap function swap(i,j), that swaps elements ai and aj of the array. In the application, execution of the cmp(i,j) function might be extremely expensive, to the point where one execution of cmp(i,j) takes longer than any other steps in the sort (except for other cmp(i,j) calls, of course) together. You may think of cmp(i,j) as a rather lengthy IO operation.
Please assume for the sake of this question that there is no way to make cmp(i,j) faster. Assume all optimizations that could possibly make cmp(i,j) faster have already been done.
Questions
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
It is possible in my application to write a predicate expensive(i,j) that is true iff a call to cmp(i,j) would take a long time. expensive(i,j) is cheap and expensive(i,j) ∧ expensive(j,k) → expensive(i,k) mostly holds in my current application. This is not guaranteed though.
Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I'd like pointers to further material on this topic.
Example
This is an example that is not entirely unlike the application I have.
Consider a set of possibly large files. In this application the goal is to find duplicate files among them. This essentially boils down to sorting the files by some arbitrary criterium and then traversing them in order, outputting sequences of equal files that were encountered.
Of course reader in large amounts of data is expensive, therefor one can, for instance, only read the first megabyte of each file and calculate a hash function on this data. If the files compare equal, so do the hashes, but the reverse may not hold. Two large file could only differ in one byte near the end.
The implementation of expensive(i,j) in this case is simply a check whether the hashes are equal. If they are, an expensive deep comparison is neccessary.
I'll try to answer each question as best as I can.
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Traditional sorting methods may have some variation, but in general, there is a mathematical limit to the minimum number of comparisons necessary to sort a list, and most algorithms take advantage of that, since comparisons are often not inexpensive. You could try sorting by something else, or try using a shortcut that may be faster that may approximate the real solution.
Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I don't think you can get around the necessity of doing at least the minimum number of comparisons, but you may be able to change what you compare. If you can compare hashes or subsets of the data instead of the whole thing, that could certainly be helpful. Anything you can do to simplify the comparison operation will make a big difference, but without knowing specific details of the data, it's hard to suggest specific solutions.
I'd like pointers to further material on this topic.
Check these out:
Apparently Donald Knuth's The Art of Computer Programming, Volume 3 has a section on this topic, but I don't have a copy handy.
Wikipedia of course has some insight into the matter.
Sorting an array with minimal number of comparisons
How do I figure out the minimum number of swaps to sort a list in-place?
Limitations of comparison based sorting techniques
The theoretical minimum number of comparisons needed to sort an array of n elements on average is lg (n!), which is about n lg n - n. There's no way to do better than this on average if you're using comparisons to order the elements.
Of the standard O(n log n) comparison-based sorting algorithms, mergesort makes the lowest number of comparisons (just about n lg n, compared with about 1.44 n lg n for quicksort and about n lg n + 2n for heapsort), so it might be a good algorithm to use as a starting point. Typically mergesort is slower than heapsort and quicksort, but that's usually under the assumption that comparisons are fast.
If you do use mergesort, I'd recommend using an adaptive variant of mergesort like natural mergesort so that if the data is mostly sorted, the number of comparisons is closer to linear.
There are a few other options available. If you know for a fact that the data is already mostly sorted, you could use insertion sort or a standard variation of heapsort to try to speed up the sorting. Alternatively, you could use mergesort but use an optimal sorting network as a base case when n is small. This might shave off enough comparisons to give you a noticeable performance boost.
Hope this helps!
A technique called the Schwartzian transform can be used to reduce any sorting problem to that of sorting integers. It requires you to apply a function f to each of your input items, where f(x) < f(y) if and only if x < y.
(Python-oriented answer, when I thought the question was tagged [python])
If you can define a function f such that f(x) < f(y) if and only if x < y, then you can sort using
sort(L, key=f)
Python guarantees that key is called at most once for each element of the iterable you are sorting. This provides support for the Schwartzian transform.
Python 3 does not support specifying a cmp function, only the key parameter. This page provides a way of easily converting any cmp function to a key function.
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Edit: Ah, sorry. There are algorithms that minimize the number of comparisons (below), but not that I know of for specific elements.
Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
Not that I know of, but perhaps you'll find it in these papers below.
I'd like pointers to further material on this topic.
On Optimal and Efficient in Place Merging
Stable Minimum Storage Merging by Symmetric Comparisons
Optimal Stable Merging (this one seems to be O(n log2 n) though
Practical In-Place Mergesort
If you implement any of them, posting them here might be useful for others too! :)
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Merge insertion algorithm, described in D. Knuth's "The art of computer programming", Vol 3, chapter 5.3.1, uses less comparisons than other comparison-based algorithms. But still it needs O(N log N) comparisons.
Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I think some of existing sorting algorithms may be modified to take into account expensive(i,j) predicate. Let's take the simplest of them - insertion sort. One of its variants, named in Wikipedia as binary insertion sort, uses only O(N log N) comparisons.
It employs a binary search to determine the correct location to insert new elements. We could apply expensive(i,j) predicate after each binary search step to determine if it is cheap to compare the inserted element with "middle" element found in binary search step. If it is expensive we could try the "middle" element's neighbors, then their neighbors, etc. If no cheap comparisons could be found we just return to the "middle" element and perform expensive comparison.
There are several possible optimizations. If predicate and/or cheap comparisons are not so cheap we could roll back to the "middle" element earlier than all other possibilities are tried. Also if move operations cannot be considered as very cheap, we could use some order statistics data structure (like Indexable skiplist) do reduce insertion cost to O(N log N).
This modified insertion sort needs O(N log N) time for data movement, O(N2) predicate computations and cheap comparisons and O(N log N) expensive comparisons in the worst case. But more likely there would be only O(N log N) predicates and cheap comparisons and O(1) expensive comparisons.
Consider a set of possibly large files. In this application the goal is to find duplicate files among them.
If the only goal is to find duplicates, I think sorting (at least comparison sorting) is not necessary. You could just distribute the files between buckets depending on hash value computed for first megabyte of data from each file. If there are more than one file in some bucket, take other 10, 100, 1000, ... megabytes. If still more than one file in some bucket, compare them byte-by-byte. Actually this procedure is similar to radix sort.
Most sorting algorithm out there try minimize the amount of comparisons during sorting.
My advice:
Pick quick-sort as a base algorithm and memorize results of comparisons just in case you happen to compare the same problems again. This should help you in the O(N^2) worst case of quick-sort. Bear in mind that this will make you use O(N^2) memory.
Now if you are really adventurous you could try the Dual-Pivot quick-sort.
Something to keep in mind is that if you are continuously sorting the list with new additions, and the comparison between two elements is guaranteed to never change, you can memoize the comparison operation which will lead to a performance increase. In most cases this won't be applicable, unfortunately.
We can look at your problem in the another direction, Seems your problem is IO related, then you can use advantage of parallel sorting algorithms, In fact you can run many many threads to run comparison on files, then sort them by one of a best known parallel algorithms like Sample sort algorithm.
Quicksort and mergesort are the fastest possible sorting algorithm, unless you have some additional information about the elements you want to sort. They will need O(n log(n)) comparisons, where n is the size of your array.
It is mathematically proved that any generic sorting algorithm cannot be more efficient than that.
If you want to make the procedure faster, you might consider adding some metadata to accelerate the computation (can't be more precise unless you are, too).
If you know something stronger, such as the existence of a maximum and a minimum, you can use faster sorting algorithms, such as radix sort or bucket sort.
You can look for all the mentioned algorithms on wikipedia.
As far as I know, you can't benefit from the expensive relationship. Even if you know that, you still need to perform such comparisons. As I said, you'd better try and cache some results.
EDIT I took some time to think about it, and I came up with a slightly customized solution, that I think will make the minimum possible amount of expensive comparisons, but totally disregards the overall number of comparisons. It will make at most (n-m)*log(k) expensive comparisons, where
n is the size of the input vector
m is the number of distinct component which are easy to compare between each other
k is the maximum number of elements which are hard to compare and have consecutive ranks.
Here is the description of the algorithm. It's worth nothing saying that it will perform much worse than a simple merge sort, unless m is big and k is little. The total running time is O[n^4 + E(n-m)log(k)], where E is the cost of an expensive comparison (I assumed E >> n, to prevent it from being wiped out from the asymptotic notation. That n^4 can probably be further reduced, at least in the mean case.
EDIT The file I posted contained some errors. While trying it, I also fixed them (I overlooked the pseudocode for insert_sorted function, but the idea was correct. I made a Java program that sorts a vector of integers, with delays added as you described. Even if I was skeptical, it actually does better than mergesort, if the delay is significant (I used 1s delay agains integer comparison, which usually takes nanoseconds to execute)

In what situations do I use these sorting algorithms?

I know the implementation for most of these algorithms, but I don't know for what sized data sets to use them for (and the data included):
Merge Sort
Bubble Sort (I know, not very often)
Quick Sort
Insertion Sort
Selection Sort
Radix Sort
First of all, you take all the sorting algorithms that have a O(n2) complexity and throw them away.
Then, you have to study several proprieties of your sorting algorithms and decide whether each one of them will be better suited for the problem you want to solve. The most important are:
Is the algorithm in-place? This means that the sorting algorithm doesn't use any (O(1) actually) extra memory. This propriety is very important when you are running memory-critical applications.
Bubble-sort, Insertion-sort and Selection-sort use constant memory.
There is an in-place variant for Merge-sort too.
Is the algorithm stable? This means that if two elements x and y are equal given your comparison method, and in the input x is found before y, then in the output x will be found before y.
Merge-sort, Bubble-sort and Insertion-sort are stable.
Can the algorithm be parallelized? If the application you are building can make use of parallel computation, you might want to choose parallelizable sorting algorithms.
More info here.
Use Bubble Sort only when the data to be sorted is stored on rotating drum memory. It's optimal for that purpose, but not for random-access memory. These days, that amounts to "don't use Bubble Sort".
Use Insertion Sort or Selection Sort up to some size that you determine by testing it against the other sorts you have available. This usually works out to be around 20-30 items, but YMMV. In particular, when implementing divide-and-conquer sorts like Merge Sort and Quick Sort, you should "break out" to an Insertion sort or a Selection sort when your current block of data is small enough.
Also use Insertion Sort on nearly-sorted data, for example if you somehow know that your data used to be sorted, and hasn't changed very much since.
Use Merge Sort when you need a stable sort (it's also good when sorting linked lists), beware that for arrays it uses significant additional memory.
Generally you don't use "plain" Quick Sort at all, because even with intelligent choice of pivots it still has Omega(n^2) worst case but unlike Insertion Sort it doesn't have any useful best cases. The "killer" cases can be constructed systematically, so if you're sorting "untrusted" data then some user could deliberately kill your performance, and anyway there might be some domain-specific reason why your data approximates to killer cases. If you choose random pivots then the probability of killer cases is negligible, so that's an option, but the usual approach is "IntroSort" - a QuickSort that detects bad cases and switches to HeapSort.
Radix Sort is a bit of an oddball. It's difficult to find common problems for which it is best, but it has good asymptotic limit for fixed-width data (O(n), where comparison sorts are Omega(n log n)). If your data is fixed-width, and the input is larger than the number of possible values (for example, more than 4 billion 32-bit integers) then there starts to be a chance that some variety of radix sort will perform well.
When using extra space equal to the size of the array is not an issue
Only on very small data sets
When you want an in-place sort and a stable sort is not required
Only on very small data sets, or if the array has a high probability to already be sorted
Only on very small data sets
When the range of values to number of items ratio is small (experimentation suggested)
Note that usually Merge or Quick sort implementations use Insertion sort for parts of the subroutine where the sub-array is very small.

About bubble sort vs merge sort

This is an interview question that I recently found on Internet:
If you are going to implement a function which takes an integer array as input and returns the maximum, would you use bubble sort or merge sort to implement this function? What if the array size is less than 1000? What if it is greater than 1000?
This is how I think about it:
First, it is really weird to use sorting to implement the above function. You can just go through the array once and find the max one.
Second, if have to make a choice between the two, then bubble sort is better - you don't have to implement the whole bubble sort procedure but only need to do the first pass. It is better than merge sort both in time and space.
Are there any mistakes in my answer? Did I miss anything?
It's a trick question. If you just want the maximum, (or indeed, the kth value for any k, which includes finding the median), there's a perfectly good O(n) algorithm. Sorting is a waste of time. That's what they want to hear.
As you say, the algorithm for maximum is really trivial. To ace a question like this, you should have the quick-select algorithm ready, and also be able to suggest a heap datastructure in case you need to be able to mutate the list of values and always be able to produce the maximum rapidly.
I just googled the algorithms. The bubble sort wins in both situations because of the largest benefit of only having to run through it once. Merge sort can not cut any short cuts for only having to calculate the largest number. Merge takes the length of the list, finds the middle, and then all the numbers below the middle compare to the left and all above compare to the right; in oppose to creating unique pairs to compare. Meaning for every number left in the array an equal number of comparisons need to be made. In addition to that each number is compared twice so the lowest numbers of the array will most likely get eliminated in both of their comparisons. Meaning only one less number in the array after doing two comparisons in many situations. Bubble would dominate
Firstly I agree with everything you have said, but perhaps it is asking about knowing time complexity's of the algorithms and how the input size is a big factor in which will be fastest.
Bubble sort is O(n2) and Merge Sort is O(nlogn). So, on a small set it wont be that different but on a lot of data Bubble sort will be much slower.
Barring the maximum part, bubble sort is slower asymptotically, but it has a big advantage for small n in that it doesn't require the merging/creation of new arrays. In some implementations, this might make it faster in real time.
only one pass is needed , for worst case , to find maximum u just have to traverse the whole array , so bubble would be better ..
Merge sort is easy for a computer to sort the elements and it takes less time to sort than bubble sort. Best case with merge sort is n*log2n and worst case is n*log2n. With bubble sort best case is O(n) and worst case is O(n2).

Insertion sort better than Bubble sort?

I am doing my revision for the exam.
Would like to know under what condition will Insertion sort performs better than bubble sort given same average case complexity of O(N^2).
I did found some related articles, but I can't understand them.
Would anyone mind explaining it in a simple way?
The advantage of bubblesort is in the speed of detecting an already sorted list:
BubbleSort Best Case Scenario: O(n)
However, even in this case insertion sort got better/same performance.
Bubblesort is, more or less, only good for understanding and/or teaching the mechanism of sortalgorithm, but wont find a proper usage in programming these days, because its complexity
O(n²)
means that its efficiency decreases dramatically on lists of more than a small number of elements.
Following things came to my mind:
Bubble sort always takes one more pass over array to determine if it's sorted. On the other hand, insertion sort not need this -- once last element inserted, algorithm guarantees that array is sorted.
Bubble sort does n comparisons on every pass. Insertion sort does less than n comparisons: once the algorithm finds the position where to insert current element it stops making comparisons and takes next element.
Finally, quote from wikipedia article:
Bubble sort also interacts poorly with modern CPU hardware. It
requires at least twice as many writes as insertion sort, twice as
many cache misses, and asymptotically more branch mispredictions.
Experiments by Astrachan sorting strings in Java show bubble sort to
be roughly 5 times slower than insertion sort and 40% slower than
selection sort
You can find link to original research paper there.
I guess the answer you're looking for is here:
Bubble sort may also be efficiently used on a list that is already
sorted except for a very small number of elements. For example, if
only one element is not in order, bubble sort will take only 2n time.
If two elements are not in order, bubble sort will take only at most
3n time...
and
Insertion sort is a simple sorting algorithm that is relatively
efficient for small lists and mostly sorted lists, and often is used
as part of more sophisticated algorithms
Could you provide links to the related articles you don't understand? I'm not sure what aspects they might be addressing. Other than that, there is a theoretical difference which might be that bubble sort is more suited for collections represented as arrays (than it is for those represented as linked lists), while insertion sort is suited for linked lists.
The reasoning would be that bubble sort always swaps two items at a time which is trivial on both, array and linked list (more efficient on arrays), while insertion sort inserts at a place in a given list which is trivial for linked lists but involves moving all subsequent elements in an array to the right.
That being said, take it with a grain of salt. First of all, sorting arrays is, in practice, almost always faster than sorting linked lists. Simply due to the fact that scanning the list once has an enormous difference already. Apart from that, moving n elements of an array to the right, is much faster than performing n (or even n/2) swaps. This is why other answers correctly claim insertion sort to be superior in general, and why I really wonder about the articles you read, because I fail to think of a simple way of saying this is better in cases A, and that is better in cases B.
In the worst case both tend to perform at O(n^2)
In the best case scenario, i.e., when the array is already sorted, Bubble sort can perform at O(n).

Resources