I was curious to know how to select a sorting algorithm based on the input, so that I can get the best efficiency.
Should it be on the size of the input or how the input is arranged(Asc/Desc) or the data structure used etc ... ?
The importance of algorithms generally, and in sorting algorithms as well is as following:
(*) Correctness - This is the most important thing. It worth nothing if your algorithm is super fast and efficient, but is wrong. In sorting, even if you have 2 candidates that are sorting correctly, but you need a stable sort - you will chose the stable sort algorithm, even if it is less efficient - because it is correct for your purpose, and the other is not.
Next are basically trade offs between running time, needed space and implementation time (If you will need to implement something from scratch rather then use a library, for a minor performance enhancement - it probably doesn't worth it)
Some things to take into consideration when thinking about the trade off mentioned above:
Size of the input (for example: for small inputs, insertion sort is empirically faster then more advanced algorithms, though it takes O(n^2)).
Location of the input (sorting algorithms on disk are different from algorithms on RAM, because disk reads are much less efficient when not sequential. The algorithm which is usually used to sort on disk is a variation of merge-sort).
How is the data distributed? If the data is likely to be "almost sorted" - maybe a usually terrible bubble-sort can sort it in just 2-3 iterations and be super fast comparing to other algorithms.
What libraries do you have already implemented? How much work will it take to implement something new? Will it worth it?
Type (and range) of the input - for enumerable data (integers for example) - an integer designed algorithm (like radix sort) might be more efficient then a general case algorithm.
Latency requirement - if you are designing a missile head, and the result must return within specific amount of time, quicksort which might decay to quadric running time on worst case - might not be a good choice, and you might want to use a different algorithm which have a strict O(nlogn) worst case instead.
Your hardware - if for example you are using a huge cluster and a huge data - a distributed sorting algorithm will probably be better then trying to do all the work on one machine.
It should be based on all those things.
You need to take into account size of your data as Insertion sort can be faster than quicksort for small data sets, etc
you need to know the arrangement of your data due to differing worst/average/best case asymptotic runtimes for each of the algorithm (and some whose worst/avg cases are the same whereas the other may have significantly worse worst case vs avg)
and you obviously need to know the data structure used as there are some very specialized sorting algorithms if your data is already in a special format or even if you can put it into a new data structure efficiently that will automatically do your sorting for you (a la BST or heaps)
The 2 main things that determine your choice of a sorting algorithm are time complexity and space complexity. Depending on your scenario, and the resources (time and memory) available to you, you might need to choose between sorting algorithms, based on what each sorting algorithm has to offer.
The actual performance of a sorting algorithm depends on the input data too, and it helps if we know certain characteristics of the input data beforehand, like the size of input, how sorted the array already is.
For example,
If you know beforehand that the input data has only 1000 non-negative integers, you can very well use counting sort to sort such an array in linear time.
The choice of a sorting algorithm depends on the constraints of space and time, and also the size/characteristics of the input data.
At a very high level you need to consider the ratio of insertions vs compares with each algorithm.
For integers in a file, this isn't going to be hugely relevant but if say you're sorting files based on contents, you'll naturally want to do as few comparisons as possible.
Related
Apart from the obvious "It's faster when there are many elements". When is it more appropriate to use a simple sorting algorithm (0(N^2)) compared to an advanced one (O(N log N))?
I've read quite a bit about for example insertion sort being preferred when you've got a small array that's nearly sorted because you get the best case N. Why is it not good to use quicksort for example, when you've got say 20 elements. Not just insertion or quick but rather when and why is a more simple algorithm useful compared to an advanced?
EDIT: If we're working with for example an array, does it matter which data input we have? Such as objects or primitive types (Integer).
The big-oh notation captures the runtime cost of the algorithm for large values of N. It is less effective at measuring the runtime of the algorithm for small values.
The actual transition from one algorithm to another is not a trivial thing. For large N, the effects of N really dominate. For small numbers, more complex effects become very important. For example, some algorithms have better cache coherency. Others are best when you know something about the data (like your example of insertion sort when the data is nearly sorted).
The balance also changes over time. In the past, CPU speeds and memory speeds were closer together. Cache coherency issues were less of an issue. In modern times, CPU speeds have generally left memory busses behind, so cache coherency is more important.
So there's no one clear cut and dry answer to when you should use one algorithm over another. The only reliable answer is to profile your code and see.
For amusement: I was looking at the dynamic disjoint forest problem a few years back. I came across a state-of-the-art paper that permitted some operations to be done in something silly like O(log log N / log^4N). They did some truly brilliant math to get there, but there was a catch. The operations were so expensive that, for my graphs of 50-100 nodes, it was far slower than the O(n log n) solution that I eventually used. The paper's solution was far more important for people operating on graphs of 500,000+ nodes.
When programming sorting algorithms, you have to take into account how much work would be put into implementing the actual algorithm vs the actual speed of it. For big O, the time to implement advanced algorithms would be outweighed by the decreased time taken to sort. For small O, such as 20-100 items, the difference is minimal, so taking a simpler route is much better.
First of all O-Notation gives you the sense of the worst case scenario. So in case the array is nearly sorted the execution time could be near to linear time so it would be better than quick sort for example.
In case the n is small enough, we do take into consideration other aspects. Algorithms such as Quick-sort can be slower because of all the recursions called. At that point it depends on how the OS handles the recursions which can end up being slower than the simple arithmetic operations required in the insertion-sort. And not to mention the additional memory space required for recursive algorithms.
Better than 99% of the time, you should not be implementing a sorting algorithm at all.
Instead use a standard sorting algorithm from your language's standard library. In one line of code you get to use a tested and optimized implementation which is O(n log(n)). It likely implements tricks you wouldn't have thought of.
For external sorts, I've used the Unix sort utility from time to time. Aside from the non-intuitive LC_ALL=C environment variable that I need to get it to behave, it is very useful.
Any other cases where you actually need to implement your own sorting algorithm, what you implement will be driven by your precise needs. I've had to deal with this exactly once for production code in two decades of programming. (That was because for a complex series of reasons, I needed to be sorting compressed data on a machine which literally did not have enough disk space to store said data uncompressed. I used a merge sort.)
There's bubble, insert, selection, quick sorting algorithm.
Which one is the 'fastest' algorithm?
code size is not important.
Bubble sort
insertion sort
quick sort
I tried to check speed. when data is already sorted, bubble, insertion's Big-O is n but the algorithm is too slow on large lists.
Is it good to use only one algorithm?
Or faster to use a different mix?
Quicksort is generally very good, only really falling down when the data is close to being ordered already, or when the data has a lot of similarity (lots of key repeats), in which case it is slower.
If you don't know anything about your data and you don't mind risking the slow case of quick sort (if you think about it you can probably make a determination for your case if it's ever likely you'll get this (from already ordered data)) then quicksort is never going to be a BAD choice.
If you decide your data is or will sometimes (or often enough to be a problem) be sorted (or significantly partially sorted) already, or one way and another you decide you can't risk the worst case of quicksort, then consider timsort.
As noted by the comments on your question though, if it's really important to have the ultimate performance, you should consider implementing several algorithms and trying them on good representative sample data.
HP / Microsoft std::sort is introsort (quick sort switching to heap sort if nesting reaches some limit), and std::stable_sort is a variation of bottom up mergesort.
For sorting an array or vector of mostly random integers, counting / radix sort would normally be fastest.
Most external sorts are some variation of a k-way bottom up merge sort (the initial internal sort phase could use any of the algorithms mentioned above).
For sorting a small (16 or less) fixed number of elements, a sorting network could be used. This seems to be one of the lesser known algorithms. It would mostly be useful if having to repeatedly sort small sets of elements, perhaps implemented in hardware.
Most of the time, we use built-in libraries for sorting, which are generic. But most of the times, too, we are sorting based on numeric indexes or other values that can be translated in indices. If I'm not mistaken, sorting numbers is O(n). So why aren't we ever using numeric sorting algorithms at all?
Is there really a need?
I'm not really sure (single) integers (or floating points for that matter, though most numeric sorts require / are efficient for integers) are what is being sorted 'most of the time', thus having some algorithm that only works on integers doesn't seem particularly useful. I say 'single' integers as opposed to (strings or) objects (or equivalent) that contain multiple integers, numbers, strings or whatever else.
Not to mention that (I believe) the bottleneck of any real-world program (that's primary purpose is more than just sorting data) (well, most of them) should not be sorting 'single' numbers using an O(n log n) sort. You're probably way better off changing the way your data is represented to remove the need for the sort rather than cutting down on the log n factor.
Numeric sorts
It's a common misperception, but no sorting algorithm (numeric or otherwise) is actually worst-case O(n). There's always some additional parameter that comes into play. For radix sort, the length of the numbers is the determining factor. For long numbers in short arrays, this length can easily be more than log n, resulting in worse performance than a O(n log n) sort (see below test).
Now numeric sorts are useful and way better than any comparison-based sorting algorithm given that your data conforms to specific constraints most (but not all) of the time (by looking at the complexity given by any decent reference, you should easily see what determines whether it will be good - e.g. O(kN) implies long numbers might cause it to take a bit longer, things like dealing with duplicates well is a bit more subtle).
So, why aren't they used?
Without extensive real-world experience / theoretical knowledge, you're unlikely to pick the most efficient algorithm it's entirely possible that you'll find yourself with a problem where the chosen algorithm, which should be awesome in theory severely under-performs a standard algorithm for your data, because of some subtle factor.
So standard libraries don't put you in the position to pick an incorrect sort and possibly have terrible performance because your data doesn't conform to some constraints. Library sorts tend to be decent all-round, but aren't specialized to specific data sets. Though I'm sure there are also libraries that focus on sorting algorithms, allowing you to pick from an extensive range of algorithms, but your average Joe the programmer probably doesn't want to / shouldn't be exposed to this choice.
Also note, while they aren't commonly included in libraries, it should be easy enough to find / write an implementation of whichever (popular) sort you wish to use ... which you should then benchmark against library sorts on a sufficient sample of your data before committing to it.
A somewhat random test
This is by no means intended to be a conclusive, 100% correct test with the best implementations of radix sort and quick sort to ever see the light of day. It's more to show that what the data looks like plays a large role in the performance of any given algorithm.
This is the only decent benchmark including radix-sort I could find in a few minutes of searching.
I ran the code and found this: (number range 0-2147483646)
(the time unit is related to nano-seconds, which doesn't really translate to seconds)
ArraySize Radix Quick
10 1889 126
100 2871 2702
1000 18227 38075
10000 360623 484128
100000 2306284 6029230
Quick-sort is faster for a large range of numbers and arrays of less than size 100 (exactly what I was saying above). Interesting but nothing really amazing about it. I mean who cares about the performance of sorting less than 100 numbers?
However, look what happened when I changed the number range to 0-99:
ArraySize Radix Quick
10 1937 121
100 8932 2022
1000 29513 14824
10000 236669 125926
100000 2393641 1225715
Quick-sort is consistently around 2x faster than Radix-sort for reasonably-sized arrays (1000-100000 elements).
You must be thinking - "What in the world? I thought radix sort was supposed to be good at these. I mean ... there's only 2 digits. And why is Quick-sort so much faster than in the above case?" Exactly. That's where "extensive real-world experience / theoretical knowledge" comes in. I suspect it relates to how well each algorithm / implementation deals with duplicates. But half of that could be because I might not have optimized the radix sort implementation for the smaller range (didn't know we do that? Well, that's another reason against trying to have a generic radix sort in a library)
Now 0-99 is probably not your typical data set either, and, overall, radix-sort is probably still better, but what you need to take away from all of this:
There's about a gazillion sorting algorithms. They vary greatly in what they're good at. Don't expect a standard library to give you a function for each. Comparison-based sorts can sort any comparable data types (and are fast enough for most practical applications) as opposed to numeric sorts which can only sort numbers. Thus having a single (or 2, as Java has) comparison-based sort in your (as in you, the person who wrote it) library is preferred.
Basically, we use comparison-based sorting algorithms because it's easier. Being able to supply a comparison function and get your data sorted is a huge win from an engineering perspective, even if you pay for it with a speed hit.
Keep in mind that the O(n log n) comparison-based sorting bound counts comparisons, not total runtime. If you're sorting strings, for instance, comparison can take time linear in the lengths of the strings being compared.
A common misconception (that I see echoed in the other answer) is that comparison-based sorting winds up having faster asymptotic complexity when you're sorting a moderate number of long numbers; say they're k bytes each. This simply isn't true; you do about n log(n) number comparisons, each of which takes O(k) time, for an overall complexity of O(k n log n). This is worse than O(k n).
Engineering a fast radix sort is a little harder than the theory says. While theory dictates that you should choose as large a radix as possible, there is a tradeoff between the radix you choose and the locality you achieve when partitioning the input stream. A bigger radix means fewer passes but also less local use of memory.
What is meant by to "sort in place"?
The idea of an in-place algorithm isn't unique to sorting, but sorting is probably the most important case, or at least the most well-known. The idea is about space efficiency - using the minimum amount of RAM, hard disk or other storage that you can get away with. This was especially relevant going back a few decades, when hardware was much more limited.
The idea is to produce an output in the same memory space that contains the input by successively transforming that data until the output is produced. This avoids the need to use twice the storage - one area for the input and an equal-sized area for the output.
Sorting is a fairly obvious case for this because sorting can be done by repeatedly exchanging items - sorting only re-arranges items. Exchanges aren't the only approach - the Insertion Sort, for example, uses a slightly different approach which is equivalent to doing a run of exchanges but faster.
Another example is matrix transposition - again, this can be implemented by exchanging items. Adding two very large numbers can also be done in-place (the result replacing one of the inputs) by starting at the least significant digit and propogating carries upwards.
Getting back to sorting, the advantages to re-arranging "in place" get even more obvious when you think of stacks of punched cards - it's preferable to avoid copying punched cards just to sort them.
Some algorithms for sorting allow this style of in-place operation whereas others don't.
However, all algorithms require some additional storage for working variables. If the goal is simply to produce the output by successively modifying the input, it's fairly easy to define algorithms that do that by reserving a huge chunk of memory, using that to produce some auxiliary data structure, then using that to guide those modifications. You're still producing the output by transforming the input "in place", but you're defeating the whole point of the exercise - you're not being space-efficient.
For that reason, the normal definition of an in-place definition requires that you achieve some standard of space efficiency. It's absolutely not acceptable to use extra space proportional to the input (that is, O(n) extra space) and still call your algorithm "in-place".
The Wikipedia page on in-place algorithms currently claims that an in-place algorithm can only use a constant amount - O(1) - of extra space.
In computer science, an in-place algorithm (or in Latin in situ) is an algorithm which transforms input using a data structure with a small, constant amount of extra storage space.
There are some technicalities specified in the In Computational Complexity section, but the conclusion is still that e.g. Quicksort requires O(log n) space (true) and therefore is not in-place (which I believe is false).
O(log n) is much smaller than O(n) - for example the base 2 log of 16,777,216 is 24.
Quicksort and heapsort are both normally considered in-place, and heapsort can be implemented with O(1) extra space (I was mistaken about this earlier). Mergesort is more difficult to implement in-place, but the out-of-place version is very cache-friendly - I suspect real-world implementations accept the O(n) space overhead - RAM is cheap but memory bandwidth is a major bottleneck, so trading memory for cache-efficiency and speed is often a good deal.
[EDIT When I wrote the above, I assume I was thinking of in-place merge-sorting of an array. In-place merge-sorting of a linked list is very simple. The key difference is in the merge algorithm - doing a merge of two linked lists with no copying or reallocation is easy, doing the same with two sub-arrays of a larger array (and without O(n) auxiliary storage) AFAIK isn't.]
Quicksort is also cache-efficient, even in-place, but can be disqualified as an in-place algorithm by appealing to its worst-case behaviour. There is a degenerate case (in a non-randomized version, typically when the input is already sorted) where the run-time is O(n^2) rather than the expected O(n log n). In this case the extra space requirement is also increased to O(n). However, for large datasets and with some basic precautions (mainly randomized pivot selection) this worst-case behaviour becomes absurdly unlikely.
My personal view is that O(log n) extra space is acceptable for in-place algorithms - it's not cheating as it doesn't defeat the original point of working in-place.
However, my opinion is of course just my opinion.
One extra note - sometimes, people will call a function in-place simply because it has a single parameter for both the input and the output. It doesn't necessarily follow that the function was space efficient, that the result was produced by transforming the input, or even that the parameter still references the same area of memory. This usage isn't correct (or so the prescriptivists will claim), though it's common enough that it's best to be aware but not get stressed about it.
In-place sorting means sorting without any extra space requirement. According to wiki , it says
an in-place algorithm is an algorithm which transforms input using a data structure with a small, constant amount of extra storage space.
Quicksort is one example of In-Place Sorting.
I don't think these terms are closely related:
Sort in place means to sort an existing list by modifying the element order directly within the list. The opposite is leaving the original list as is and create a new list with the elements in order.
Natural ordering is a term that describes how complete objects can somehow be ordered. You can for instance say that 0 is lower that 1 (natural ordering for integers) or that A is before B in alphabetical order (natural ordering for strings). You can hardly say though that Bob is greater or lower than Alice in general as it heavily depends on specific attributes (alphabetically by name, by age, by income, ...). Therefore there is no natural ordering for people.
I'm not sure these concepts are similar enough to compare as suggested. Yes, they both involve sorting, but one is about a sort ordering that is human understandable (natural) and the other defines an algorithm for efficient sorting in terms of memory by overwriting into the existing structure instead of using an additional data structure (like a bubble sort)
it can be done by using swap function , instead of making a whole new structure , we implement that algorithm without even knowing it's name :D
I know the implementation for most of these algorithms, but I don't know for what sized data sets to use them for (and the data included):
Merge Sort
Bubble Sort (I know, not very often)
Quick Sort
Insertion Sort
Selection Sort
Radix Sort
First of all, you take all the sorting algorithms that have a O(n2) complexity and throw them away.
Then, you have to study several proprieties of your sorting algorithms and decide whether each one of them will be better suited for the problem you want to solve. The most important are:
Is the algorithm in-place? This means that the sorting algorithm doesn't use any (O(1) actually) extra memory. This propriety is very important when you are running memory-critical applications.
Bubble-sort, Insertion-sort and Selection-sort use constant memory.
There is an in-place variant for Merge-sort too.
Is the algorithm stable? This means that if two elements x and y are equal given your comparison method, and in the input x is found before y, then in the output x will be found before y.
Merge-sort, Bubble-sort and Insertion-sort are stable.
Can the algorithm be parallelized? If the application you are building can make use of parallel computation, you might want to choose parallelizable sorting algorithms.
More info here.
Use Bubble Sort only when the data to be sorted is stored on rotating drum memory. It's optimal for that purpose, but not for random-access memory. These days, that amounts to "don't use Bubble Sort".
Use Insertion Sort or Selection Sort up to some size that you determine by testing it against the other sorts you have available. This usually works out to be around 20-30 items, but YMMV. In particular, when implementing divide-and-conquer sorts like Merge Sort and Quick Sort, you should "break out" to an Insertion sort or a Selection sort when your current block of data is small enough.
Also use Insertion Sort on nearly-sorted data, for example if you somehow know that your data used to be sorted, and hasn't changed very much since.
Use Merge Sort when you need a stable sort (it's also good when sorting linked lists), beware that for arrays it uses significant additional memory.
Generally you don't use "plain" Quick Sort at all, because even with intelligent choice of pivots it still has Omega(n^2) worst case but unlike Insertion Sort it doesn't have any useful best cases. The "killer" cases can be constructed systematically, so if you're sorting "untrusted" data then some user could deliberately kill your performance, and anyway there might be some domain-specific reason why your data approximates to killer cases. If you choose random pivots then the probability of killer cases is negligible, so that's an option, but the usual approach is "IntroSort" - a QuickSort that detects bad cases and switches to HeapSort.
Radix Sort is a bit of an oddball. It's difficult to find common problems for which it is best, but it has good asymptotic limit for fixed-width data (O(n), where comparison sorts are Omega(n log n)). If your data is fixed-width, and the input is larger than the number of possible values (for example, more than 4 billion 32-bit integers) then there starts to be a chance that some variety of radix sort will perform well.
When using extra space equal to the size of the array is not an issue
Only on very small data sets
When you want an in-place sort and a stable sort is not required
Only on very small data sets, or if the array has a high probability to already be sorted
Only on very small data sets
When the range of values to number of items ratio is small (experimentation suggested)
Note that usually Merge or Quick sort implementations use Insertion sort for parts of the subroutine where the sub-array is very small.