which sorting algorithm to use where? [closed] - algorithm

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
There are varieties of sorting algorithms available. Sorting algorithm with time complexity of O(n^2) may be suited over O(nlogn), because it is in-place or it is stable. For example:
For somewhat sorted things insertion sort is good.
Applying quick sort on nearly sorted array is foolishness.
Heap sort is good with O(nlogn) but not stable.
Merge sort can't be used in embedded systems as in worst case it requires O(n) of space complexity.
I want to know which sorting algorithm is suitable in what conditions.
Which sorting algo is best for sorting names in alphabetical order?
Which sorting algo is best for sorting less integers?
Which sorting algo is best for sorting less integers but may be large in range (98767 – 6734784)?
Which sorting algo is best for sorting billions of integers?
Which sorting algo is best for sorting in embedded systems or real time systems where space and time both are constraints?
Please suggest these/other situations, books or website for these type of comparisons.

Well, there is no silver bullet - but here are some rules of thumb:
Radix sort/ Counting sort is usually good when the range of elements (let it be U) is relatively small comparing to the number of elements (U<<n) (might fit your case 2,4)
Insertion sort is good for small (say n<30) lists, even faster then O(nlogn) algorithms (empirically). In fact, you can optimize an O(nlogn) top-down algorithm by switching to insertion sort when n<30
A variation of radix sort might also be a good choice for sorting strings alphabetically, since it is O(|S|*n), while normal comparing based algorithm is O(|S|*nlogn) [where |S| is the length of your string]. (fits your case 1)
Where the sorted input is very large, way too large to fit in merge, the way to do it is with external sort - which is a variation or merge sort, it minimizes the number of disk reads/writes and makes sure these are done sequentially - because it improves the performance drastically. (might fit case 4)
For general case sorting, quick sort and timsort (used for java)
gives good performance.

Merge sort can't be used in embedded systems as in worst case it
requires O(n) of space complexity.
You may be interested in the stable_sort function from C++. It tries to allocate the extra space for a regular merge sort, but if that fails it does an in-place stable merge sort with inferior time complexity (n * ((log n)^2) instead of n * (log n)). If you can read C++ you can look at the implementation in your favourite standard library, otherwise I expect you can find the details explained somewhere in language-agnostic terms.
There's a body of academic literature about in-place stable sorting (and in particular in-place merging).
So in C++ the rule of thumb is easy, "use std::stable_sort if you need a stable sort, otherwise use std::sort". Python makes it even easier again, the rule of thumb is "use sorted".
In general, you will find that a lot of languages have fairly clever built-in sort algorithms, and you can use them most of the time. It's rare that you'll need to implement your own to beat the standard library. If you do need to implement your own, there isn't really any substitute for pulling out the textbooks, implementing a few algorithms with as many tricks as you can find, and testing them against each other for the specific case you're worried about for which you need to beat the library function.
Most of the "obvious" advice that you might be hoping for in response to this question is already incorporated into the built-in sort functions of one or more common programming languages. But to answer your specific questions:
Which sorting algo is best for sorting names in alphabetical order?
A radix sort might edge out standard comparison sorts like C++ sort, but that might not be possible if you're using "proper" collation rules for names. For example, "McAlister" used to be alphabetized the same as "MacAlister", and "St. John" as "Saint John". But then programmers came along and wanted to just sort by ASCII value rather than code a lot of special rules, so most computer systems don't use those rules any more. I find Friday afternoon is a good time for this kind of feature ;-) You can still use a radix sort if you do it on the letters of the "canonicalized" name rather than the actual name.
"Proper" collation rules in languages other than English are also entertaining. For example in German "Grüber" sorts like "Grueber", and therefore comes after "Gruber" but before "Gruhn". In English the name "Llewellyn" comes after "Lewis", but I believe in Welsh (using the exact same alphabet but different traditional collation rules) it comes before.
For that reason, it's easier to talk about optimizing string sorts than it is to actually do it. Sorting strings "properly" requires being able to plug in locale-specific collation rules, and if you move away from a comparison sort then you might have to re-write all your collation code.
Which sorting algo is best for sorting less integers?
For a small number of small values maybe a counting sort, but Introsort with a switch to insertion sort when the data gets small enough (20-30 elements) is pretty good. Timsort is especially good when the data isn't random.
Which sorting algo is best for sorting less integers but may be large in range (98767 – 6734784)?
The large range rules out counting sort, so for a small number of widely-ranged integers, Introsort/Timsort.
Which sorting algo is best for sorting billions of integers?
If by "billions" you mean "too many to fit in memory" then that changes the game a bit. Probably you want to divide the data into chunks that do fit in memory, Intro/Tim sort each one, then do an external merge. Of if you're on a 64 bit machine sorting 32 bit integers, you could consider counting sort.
Which sorting algo is best for sorting in embedded systems or real time systems where space and time both are constraints?
Probably Introsort.
For somewhat sorted things insertion sort is good.
True, and Timsort takes advantage of the same situation.
Applying quick sort on nearly sorted array is foolishness.
False. Nobody uses the plain QuickSort originally published by Hoare, you can make better choices of pivot that make the killer cases much less obvious than "sorted data". To deal with the bad cases thoroughly there is Introsort.
Heap sort is good with O(nlogn) but not stable.
True, but Introsort is better (and also not stable).
Merge sort can't be used in embedded systems as in worst case it requires O(n) of space complexity.
Handle this by allowing for somewhat slower in-place merging like std::stable_sort does.

Related

Why do we need so many sorting techniques?

There is a plethora of sorting techniques in data structure as follows -
Selection Sort
Bubble Sort
Recursive Bubble Sort
Insertion Sort
Recursive Insertion Sort
Merge Sort
Iterative Merge Sort
Quick Sort
Iterative Quick Sort
Heap Sort
Counting Sort
Radix Sort
Bucket Sort
Shell Sort
Tim Sort
Comb Sort
Pigeonhole Sort
Cycle Sort
Cocktail Sort
Strand Sort
and many more.
Do we need all of them?
There’s no single reason why so many different sorting algorithms exist. Here’s a sampler of sorting algorithms and where they came from, to give a better sense of their origins:
Radix sort was invented in the late 1800s for physically sorting punched cards for the US census. It’s still used today in software because it’s very fast on numeric and string data.
Merge sort appears to have been invented by John von Neumann to validate his stored-program computer model (the von Neumann architecture). It works well as a sorting algorithm for low-memory computers processing data that’s streamed through the machine, hence its popularity in the 1960s and 1970s. And it’s a great testbed for divide-and-conquer techniques, making it popular in algorithms classes.
Insertion sort seems to have been around forever. Even though it’s slow in the worst case, it’s fantastic on small inputs and mostly-sorted data and is used as a building block in other fast sorting algorithms.
Quicksort was invented in 1961. It plays excellently with processor caches, hence its continued popularity.
Sorting networks were studied extensively many years back. They’re still useful as building blocks in theoretical proof-of-concept algorithms like signature sort.
Timsort was invented for Python and was designed to sort practical, real-world sequences faster than other sorts by taking advantage of common distributions and patterns.
Introsort was invented as a practical way to harness the speed of quicksort without its worst-case behavior.
Shellsort was invented over fifty years ago and was practical on the computers of its age. Probing its theoretical limits was a difficult mathematical problem for folks who studied it back then.
Thorup and Yao’s O(n sqrt(log log n))-time integer sorting algorithm was designed to probe the theoretical limits of efficient algorithms using word-level parallelism.
Cycle sort derives from the study of permutations in group theory and is designed to minimize the number of memory writes made when sorting the list.
Heapsort is noteworthy for being in-place and yet fast in practice. It’s based on the idea of implicitly representing a nontrivial data structure.
This isn’t even close to an exhaustive list of sorting algorithms, but hopefully gives you a sense of what’s out there and why. :-)
The main reason sorting algorithms are discussed and studied in early computer science classes if because they provide very good studying material. The problem of sorting is simple, and a good excuse to present several algorithm strategies; several data structures; how to implement them; and discuss time complexity and space complexity; and different properties algorithms can have even if they apparently solve the same problem.
In practice, standard libraries for programming languages usually include a default sort function, such as std::sort in C++ or list.sort in python; and in almost every situation, you should trust that function and the algorithm it uses.
But everything you've learned about sorting algorithms is valuable and can be applied to other problems. Here is a non-exhaustive list of things that can be learned by studying sorting algorithms:
divide and conquer;
heaps;
binary search trees, including different types of self-balancing binary search trees;
the importance of choosing an appropriate data-structure;
difference between in-place and not-in-place;
difference between stable and non-stable sort;
recursive approach and iterative approach;
how to calculate the time complexity, and how to compare the efficiency of two algorithms;
Besides educational reasons, We need multiple sorting algorithms because they work best in a couple of situations, and none of them rules them all.
For example, although the mean time-complexity of quicksort is impressive, its performance on nearly sorted array is horrible.

What sorting algorithm is used by programmers? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I am currently studying algorithms at college and I am curios as what does a seasoned developer uses in its code when said programmer needs to sort something.
C++ uses IntroSort which has an average of Θ(n log(n)) and worst of Θ(n^2).
C# uses QuickSort which has an average of Θ(n log(n)) and worst of Θ(n^2).
Java uses MergeSort which has an average of Θ(n log(n)) and worst of Θ(n log(n)).
Javascript seems like its doing Θ(n log(n)) and the algorithm depends on the browser.
And from a quick read, the majority of languages have a sorting method that has a time complexity of Θ(n log(n)).
Do programmers use the default sorting methods or they implement their own ?
When do they use the default one and when do they implement their own ?
Is Θ(n log(n)) the best time a sorting algorithm can get ?
There are a ton of sorting algorithm, as I am currently finding out in uni.
I am currently studying algorithms at college and I am curios as what does a seasoned developer uses in its code when said programmer needs to sort something.
Different sorting algorithms have different applications. You choose the best algorithm for the problem you're facing. For example, if you have a list of items in-memory then you can sort them in-place with QuickSort - if you want to sort items that are streamed-in (i.e. an online sort) then QuickSort wouldn't be appropriate.
C++ uses IntroSort which has an average of Θ(n log(n)) and worst of Θ(n^2).
I think you mean C++'s STL sort defaults to using Introsort in most implementations (including the original SGI STL and GNU's, but I don't believe the C++ specification specifically requires sort to use Introsort - it only requires it to be a stable sort. C++ is just a language, which does not have a sorting-algorithm built in to the language. Anyway, it's a library feature, not a language feature.
C# uses QuickSort which has an average of Θ(n log(n)) and worst of Θ(n^2).
Again, C# (the language) does not have any built-in sorting functionality. It's a .NET BCL (Base Class Library) feature that exposes methods that perform the sorting (such as Array.Sort, List<T>.Sort, Enumerable.OrderBy<T>, and so on). Unlike the C++ specification, the C# official documentation does state that the algorithm used by List<T>.Sort is Quicksort, but other methods like Enumerable.OrderBy<T> leave the actual sorting algorithm used to the backend provider (e.g. in Linq-to-SQL and Linq-to-Entities the sorting is performed by the remote database engine).
Do programmers use the default sorting methods or they implement their own ?
Generally speaking, we use the defaults because they're good enough for 95%+ of all workloads and scenarios - or because the specification allows the toolchain and library we're using to pick the best algorithm for the runtime platform (e.g. C++'s sort could hypothetically make use of hardware-sorting which allows for sorting of constrained values of n in O(1) to O(n) worst-case time instead of O(n^2) with QuickSort - which is a problem when processing unsanitized user-input.
But also, generally speaking, programmers should never reimplement their own sorting algorithms. Modern languages with support for templates and generics mean that an algorithm can be written in the general form for us, so we just need to provide the data to be sorted and either comparator function or a sorting key selector, which avoids common programmer human errors when reimplementing a stock algorithm.
As for the possibility of programmers inventing their own new novel sorting algorithms... with few exceptions that really doesn't happen. As with cryptography, if you find yourself "inventing" a new sorting algorithm I guarantee that not only are you not inventing a new algorithm, but that your algorithm will be flawed in some way or another. In short: don't - at least not until you've ran your idea past your nearest computer science academic.
When do they use the default one and when do they implement their own ?
See above. You're also not considering using a non-default algorithm. As the other answers have said, it's based on the application, i.e. the problem you're trying to solve.
Is Θ(n log(n)) the best time a sorting algorithm can get ?
You need to understand the difference between Best-case, Average-case, and Worst-case time complexities. Just read the Wikipedia article section with the big table that shows you the different runtime complexities: https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_sorts - for example, insertion sort has a best-case time complexity of O(n) which is much better than O(n log n), which directly contradicts your supposition.
There are a ton of sorting algorithm, as I am currently finding out in uni.
I think you would be better served by bringing your questions to your class TA/prof/reader as they know the course material you're using and know the context in which you're asking.
A different sort is practically chosen based upon what it is sorting and where it is sorting it
whether the data needs to be sorted at all (can all or a subset of the data be inserted in order?)
how sorted the data is already (does it come in sorted chunks?)
whether the data needs to be sorted now or how unsorted it can be before it should be sorted (when? cache compaction during off-peak hours?)
time complexity
space requirements for the sort
Distributed environments are also extremely relevant in modern software, and causes states where not all of the nodes may be available
This greatly changes how things are or if they are fully sorted (for example data may be sliced up to different nodes, partially sorted, and then referenced by some sort of Cuckoo Hash YT)
The standard list sort in Haskell uses merge sort.
Divide the list into "runs"; sections where the input is already in ascending order, or in descending order. The minimum run length is 2, and for random data the average is 3 (I think). Runs in descending order are just reversed, so now you have a list of sorted lists.
Merge each pair of lists. Repeat until you have only one list left.
This is O(n . log n) in the worst case and O(n) if the input is already sorted. Also it is stable.
I think I qualify as a seasoned developer. If I just what to sort whatever, I will almost always call the library function. A lot of effort has been put into optimizing it.
Some of the situations in which I will write my own sort include:
When I need to do it incrementally. Insertion-sort each item as it comes in to maintain a sorted sequence, maybe, or use a heap as a priority queue.
When a counting sort or bucket sort will do. In this case it's easy to implement and has lower complexity.
When the keys are integers and speed is very important, I sometimes implement a radix sort.
When the stuff I need to sort doesn't fit in memory (external sorting)
When I need to build a suffix array or otherwise take advantage of special relationships between the keys.
When comparisons are extremely expensive, I will sometimes implement a merge sort to put a good upper bound on how many I have to do.
In a real-time context that is memory constrained, I will sometimes write a heap sort to get in-place sorting with a good upper bound on worst-case execution time.
If I can produce the required ordering as a side-effect of something else that is going on (and it make design sense), then I might take advantage of that instead of doing a separate sort.
In the overwhelming number of cases only the default sorts of a language are used.
When that is not the case it is mostly because the data has some special properties that can be used to reduce the sort time, an even those are then mostly the ordering lambda that is changed.
Some cases where you know that only a few distinct values are have simple O(N) sorting algorithms that could be used.
In decreasing order of generality and increasing order of simplicity.
radix sort
bucket sorts
counting sort

what is the best algorithm of sorting in speed

There's bubble, insert, selection, quick sorting algorithm.
Which one is the 'fastest' algorithm?
code size is not important.
Bubble sort
insertion sort
quick sort
I tried to check speed. when data is already sorted, bubble, insertion's Big-O is n but the algorithm is too slow on large lists.
Is it good to use only one algorithm?
Or faster to use a different mix?
Quicksort is generally very good, only really falling down when the data is close to being ordered already, or when the data has a lot of similarity (lots of key repeats), in which case it is slower.
If you don't know anything about your data and you don't mind risking the slow case of quick sort (if you think about it you can probably make a determination for your case if it's ever likely you'll get this (from already ordered data)) then quicksort is never going to be a BAD choice.
If you decide your data is or will sometimes (or often enough to be a problem) be sorted (or significantly partially sorted) already, or one way and another you decide you can't risk the worst case of quicksort, then consider timsort.
As noted by the comments on your question though, if it's really important to have the ultimate performance, you should consider implementing several algorithms and trying them on good representative sample data.
HP / Microsoft std::sort is introsort (quick sort switching to heap sort if nesting reaches some limit), and std::stable_sort is a variation of bottom up mergesort.
For sorting an array or vector of mostly random integers, counting / radix sort would normally be fastest.
Most external sorts are some variation of a k-way bottom up merge sort (the initial internal sort phase could use any of the algorithms mentioned above).
For sorting a small (16 or less) fixed number of elements, a sorting network could be used. This seems to be one of the lesser known algorithms. It would mostly be useful if having to repeatedly sort small sets of elements, perhaps implemented in hardware.

In what situations do I use these sorting algorithms?

I know the implementation for most of these algorithms, but I don't know for what sized data sets to use them for (and the data included):
Merge Sort
Bubble Sort (I know, not very often)
Quick Sort
Insertion Sort
Selection Sort
Radix Sort
First of all, you take all the sorting algorithms that have a O(n2) complexity and throw them away.
Then, you have to study several proprieties of your sorting algorithms and decide whether each one of them will be better suited for the problem you want to solve. The most important are:
Is the algorithm in-place? This means that the sorting algorithm doesn't use any (O(1) actually) extra memory. This propriety is very important when you are running memory-critical applications.
Bubble-sort, Insertion-sort and Selection-sort use constant memory.
There is an in-place variant for Merge-sort too.
Is the algorithm stable? This means that if two elements x and y are equal given your comparison method, and in the input x is found before y, then in the output x will be found before y.
Merge-sort, Bubble-sort and Insertion-sort are stable.
Can the algorithm be parallelized? If the application you are building can make use of parallel computation, you might want to choose parallelizable sorting algorithms.
More info here.
Use Bubble Sort only when the data to be sorted is stored on rotating drum memory. It's optimal for that purpose, but not for random-access memory. These days, that amounts to "don't use Bubble Sort".
Use Insertion Sort or Selection Sort up to some size that you determine by testing it against the other sorts you have available. This usually works out to be around 20-30 items, but YMMV. In particular, when implementing divide-and-conquer sorts like Merge Sort and Quick Sort, you should "break out" to an Insertion sort or a Selection sort when your current block of data is small enough.
Also use Insertion Sort on nearly-sorted data, for example if you somehow know that your data used to be sorted, and hasn't changed very much since.
Use Merge Sort when you need a stable sort (it's also good when sorting linked lists), beware that for arrays it uses significant additional memory.
Generally you don't use "plain" Quick Sort at all, because even with intelligent choice of pivots it still has Omega(n^2) worst case but unlike Insertion Sort it doesn't have any useful best cases. The "killer" cases can be constructed systematically, so if you're sorting "untrusted" data then some user could deliberately kill your performance, and anyway there might be some domain-specific reason why your data approximates to killer cases. If you choose random pivots then the probability of killer cases is negligible, so that's an option, but the usual approach is "IntroSort" - a QuickSort that detects bad cases and switches to HeapSort.
Radix Sort is a bit of an oddball. It's difficult to find common problems for which it is best, but it has good asymptotic limit for fixed-width data (O(n), where comparison sorts are Omega(n log n)). If your data is fixed-width, and the input is larger than the number of possible values (for example, more than 4 billion 32-bit integers) then there starts to be a chance that some variety of radix sort will perform well.
When using extra space equal to the size of the array is not an issue
Only on very small data sets
When you want an in-place sort and a stable sort is not required
Only on very small data sets, or if the array has a high probability to already be sorted
Only on very small data sets
When the range of values to number of items ratio is small (experimentation suggested)
Note that usually Merge or Quick sort implementations use Insertion sort for parts of the subroutine where the sub-array is very small.

Real world examples to decide which sorting algorithm works best

I am risking this question being closed before i get an answer, but i really do want to know the answer. So here goes.
I am currently trying to learn algorithms, and I am beginning to understand it as such but cannot relate to it.
I understand Time Complexity and Space Complexity. I also do understand some sorting algorithms based on the pseudo code
Sorting algorithms like
Bubble Sort
Insertion Sort
Selection Sort
Quicksort
Mergesort
Heapsort (Some what)
I am also aware of Best Case and Worst Case scenarios(Average case not so much).
Some online relevant references
Nice place which shows all the above graphically.
This gave me a good understanding as well.
BUT my question is - can some one give me REAL WORLD EXAMPLES where these sorting algorithms are implemented.
As the number of elements increases, you will use more sophisticated sorting algorithms. The later sorting techniques have a higher initial overhead, so you need a lot of elements to sort to justify that cost. If you only have 10 elements, a bubble or insertion sort will be the much faster than a merge sort, or heapsort.
Space complexity is important to consider for smaller embedded devices like a TV remote, or a cell phone. You don't have enough space to do something like a heapsort on those devices.
Datebases use an external merge sort to sort sets of data that are too large to be loaded entirely into memory. The driving factor in this sort is the reduction in the number of disk I/Os.
Good bubble sort discussion, there are many other factors to consider that contribute to a time and space complexity.
Sorting-Algorithms.com
One example is C++ STL sort
as the wikipedia page says:
The GNU Standard C++ library, for example, uses a hybrid sorting
algorithm: introsort is performed first, to a maximum depth given by
2×log2 n, where n is the number of elements, followed by an insertion
sort on the result.1 Whatever the implementation, the complexity
should be O(n log n) comparisons on the average.[2]

Resources