Sorting in place - algorithm

What is meant by to "sort in place"?

The idea of an in-place algorithm isn't unique to sorting, but sorting is probably the most important case, or at least the most well-known. The idea is about space efficiency - using the minimum amount of RAM, hard disk or other storage that you can get away with. This was especially relevant going back a few decades, when hardware was much more limited.
The idea is to produce an output in the same memory space that contains the input by successively transforming that data until the output is produced. This avoids the need to use twice the storage - one area for the input and an equal-sized area for the output.
Sorting is a fairly obvious case for this because sorting can be done by repeatedly exchanging items - sorting only re-arranges items. Exchanges aren't the only approach - the Insertion Sort, for example, uses a slightly different approach which is equivalent to doing a run of exchanges but faster.
Another example is matrix transposition - again, this can be implemented by exchanging items. Adding two very large numbers can also be done in-place (the result replacing one of the inputs) by starting at the least significant digit and propogating carries upwards.
Getting back to sorting, the advantages to re-arranging "in place" get even more obvious when you think of stacks of punched cards - it's preferable to avoid copying punched cards just to sort them.
Some algorithms for sorting allow this style of in-place operation whereas others don't.
However, all algorithms require some additional storage for working variables. If the goal is simply to produce the output by successively modifying the input, it's fairly easy to define algorithms that do that by reserving a huge chunk of memory, using that to produce some auxiliary data structure, then using that to guide those modifications. You're still producing the output by transforming the input "in place", but you're defeating the whole point of the exercise - you're not being space-efficient.
For that reason, the normal definition of an in-place definition requires that you achieve some standard of space efficiency. It's absolutely not acceptable to use extra space proportional to the input (that is, O(n) extra space) and still call your algorithm "in-place".
The Wikipedia page on in-place algorithms currently claims that an in-place algorithm can only use a constant amount - O(1) - of extra space.
In computer science, an in-place algorithm (or in Latin in situ) is an algorithm which transforms input using a data structure with a small, constant amount of extra storage space.
There are some technicalities specified in the In Computational Complexity section, but the conclusion is still that e.g. Quicksort requires O(log n) space (true) and therefore is not in-place (which I believe is false).
O(log n) is much smaller than O(n) - for example the base 2 log of 16,777,216 is 24.
Quicksort and heapsort are both normally considered in-place, and heapsort can be implemented with O(1) extra space (I was mistaken about this earlier). Mergesort is more difficult to implement in-place, but the out-of-place version is very cache-friendly - I suspect real-world implementations accept the O(n) space overhead - RAM is cheap but memory bandwidth is a major bottleneck, so trading memory for cache-efficiency and speed is often a good deal.
[EDIT When I wrote the above, I assume I was thinking of in-place merge-sorting of an array. In-place merge-sorting of a linked list is very simple. The key difference is in the merge algorithm - doing a merge of two linked lists with no copying or reallocation is easy, doing the same with two sub-arrays of a larger array (and without O(n) auxiliary storage) AFAIK isn't.]
Quicksort is also cache-efficient, even in-place, but can be disqualified as an in-place algorithm by appealing to its worst-case behaviour. There is a degenerate case (in a non-randomized version, typically when the input is already sorted) where the run-time is O(n^2) rather than the expected O(n log n). In this case the extra space requirement is also increased to O(n). However, for large datasets and with some basic precautions (mainly randomized pivot selection) this worst-case behaviour becomes absurdly unlikely.
My personal view is that O(log n) extra space is acceptable for in-place algorithms - it's not cheating as it doesn't defeat the original point of working in-place.
However, my opinion is of course just my opinion.
One extra note - sometimes, people will call a function in-place simply because it has a single parameter for both the input and the output. It doesn't necessarily follow that the function was space efficient, that the result was produced by transforming the input, or even that the parameter still references the same area of memory. This usage isn't correct (or so the prescriptivists will claim), though it's common enough that it's best to be aware but not get stressed about it.

In-place sorting means sorting without any extra space requirement. According to wiki , it says
an in-place algorithm is an algorithm which transforms input using a data structure with a small, constant amount of extra storage space.
Quicksort is one example of In-Place Sorting.

I don't think these terms are closely related:
Sort in place means to sort an existing list by modifying the element order directly within the list. The opposite is leaving the original list as is and create a new list with the elements in order.
Natural ordering is a term that describes how complete objects can somehow be ordered. You can for instance say that 0 is lower that 1 (natural ordering for integers) or that A is before B in alphabetical order (natural ordering for strings). You can hardly say though that Bob is greater or lower than Alice in general as it heavily depends on specific attributes (alphabetically by name, by age, by income, ...). Therefore there is no natural ordering for people.

I'm not sure these concepts are similar enough to compare as suggested. Yes, they both involve sorting, but one is about a sort ordering that is human understandable (natural) and the other defines an algorithm for efficient sorting in terms of memory by overwriting into the existing structure instead of using an additional data structure (like a bubble sort)

it can be done by using swap function , instead of making a whole new structure , we implement that algorithm without even knowing it's name :D

Related

what is the best algorithm of sorting in speed

There's bubble, insert, selection, quick sorting algorithm.
Which one is the 'fastest' algorithm?
code size is not important.
Bubble sort
insertion sort
quick sort
I tried to check speed. when data is already sorted, bubble, insertion's Big-O is n but the algorithm is too slow on large lists.
Is it good to use only one algorithm?
Or faster to use a different mix?
Quicksort is generally very good, only really falling down when the data is close to being ordered already, or when the data has a lot of similarity (lots of key repeats), in which case it is slower.
If you don't know anything about your data and you don't mind risking the slow case of quick sort (if you think about it you can probably make a determination for your case if it's ever likely you'll get this (from already ordered data)) then quicksort is never going to be a BAD choice.
If you decide your data is or will sometimes (or often enough to be a problem) be sorted (or significantly partially sorted) already, or one way and another you decide you can't risk the worst case of quicksort, then consider timsort.
As noted by the comments on your question though, if it's really important to have the ultimate performance, you should consider implementing several algorithms and trying them on good representative sample data.
HP / Microsoft std::sort is introsort (quick sort switching to heap sort if nesting reaches some limit), and std::stable_sort is a variation of bottom up mergesort.
For sorting an array or vector of mostly random integers, counting / radix sort would normally be fastest.
Most external sorts are some variation of a k-way bottom up merge sort (the initial internal sort phase could use any of the algorithms mentioned above).
For sorting a small (16 or less) fixed number of elements, a sorting network could be used. This seems to be one of the lesser known algorithms. It would mostly be useful if having to repeatedly sort small sets of elements, perhaps implemented in hardware.

In what situations do I use these sorting algorithms?

I know the implementation for most of these algorithms, but I don't know for what sized data sets to use them for (and the data included):
Merge Sort
Bubble Sort (I know, not very often)
Quick Sort
Insertion Sort
Selection Sort
Radix Sort
First of all, you take all the sorting algorithms that have a O(n2) complexity and throw them away.
Then, you have to study several proprieties of your sorting algorithms and decide whether each one of them will be better suited for the problem you want to solve. The most important are:
Is the algorithm in-place? This means that the sorting algorithm doesn't use any (O(1) actually) extra memory. This propriety is very important when you are running memory-critical applications.
Bubble-sort, Insertion-sort and Selection-sort use constant memory.
There is an in-place variant for Merge-sort too.
Is the algorithm stable? This means that if two elements x and y are equal given your comparison method, and in the input x is found before y, then in the output x will be found before y.
Merge-sort, Bubble-sort and Insertion-sort are stable.
Can the algorithm be parallelized? If the application you are building can make use of parallel computation, you might want to choose parallelizable sorting algorithms.
More info here.
Use Bubble Sort only when the data to be sorted is stored on rotating drum memory. It's optimal for that purpose, but not for random-access memory. These days, that amounts to "don't use Bubble Sort".
Use Insertion Sort or Selection Sort up to some size that you determine by testing it against the other sorts you have available. This usually works out to be around 20-30 items, but YMMV. In particular, when implementing divide-and-conquer sorts like Merge Sort and Quick Sort, you should "break out" to an Insertion sort or a Selection sort when your current block of data is small enough.
Also use Insertion Sort on nearly-sorted data, for example if you somehow know that your data used to be sorted, and hasn't changed very much since.
Use Merge Sort when you need a stable sort (it's also good when sorting linked lists), beware that for arrays it uses significant additional memory.
Generally you don't use "plain" Quick Sort at all, because even with intelligent choice of pivots it still has Omega(n^2) worst case but unlike Insertion Sort it doesn't have any useful best cases. The "killer" cases can be constructed systematically, so if you're sorting "untrusted" data then some user could deliberately kill your performance, and anyway there might be some domain-specific reason why your data approximates to killer cases. If you choose random pivots then the probability of killer cases is negligible, so that's an option, but the usual approach is "IntroSort" - a QuickSort that detects bad cases and switches to HeapSort.
Radix Sort is a bit of an oddball. It's difficult to find common problems for which it is best, but it has good asymptotic limit for fixed-width data (O(n), where comparison sorts are Omega(n log n)). If your data is fixed-width, and the input is larger than the number of possible values (for example, more than 4 billion 32-bit integers) then there starts to be a chance that some variety of radix sort will perform well.
When using extra space equal to the size of the array is not an issue
Only on very small data sets
When you want an in-place sort and a stable sort is not required
Only on very small data sets, or if the array has a high probability to already be sorted
Only on very small data sets
When the range of values to number of items ratio is small (experimentation suggested)
Note that usually Merge or Quick sort implementations use Insertion sort for parts of the subroutine where the sub-array is very small.

Criteria for selecting a sorting algorithm

I was curious to know how to select a sorting algorithm based on the input, so that I can get the best efficiency.
Should it be on the size of the input or how the input is arranged(Asc/Desc) or the data structure used etc ... ?
The importance of algorithms generally, and in sorting algorithms as well is as following:
(*) Correctness - This is the most important thing. It worth nothing if your algorithm is super fast and efficient, but is wrong. In sorting, even if you have 2 candidates that are sorting correctly, but you need a stable sort - you will chose the stable sort algorithm, even if it is less efficient - because it is correct for your purpose, and the other is not.
Next are basically trade offs between running time, needed space and implementation time (If you will need to implement something from scratch rather then use a library, for a minor performance enhancement - it probably doesn't worth it)
Some things to take into consideration when thinking about the trade off mentioned above:
Size of the input (for example: for small inputs, insertion sort is empirically faster then more advanced algorithms, though it takes O(n^2)).
Location of the input (sorting algorithms on disk are different from algorithms on RAM, because disk reads are much less efficient when not sequential. The algorithm which is usually used to sort on disk is a variation of merge-sort).
How is the data distributed? If the data is likely to be "almost sorted" - maybe a usually terrible bubble-sort can sort it in just 2-3 iterations and be super fast comparing to other algorithms.
What libraries do you have already implemented? How much work will it take to implement something new? Will it worth it?
Type (and range) of the input - for enumerable data (integers for example) - an integer designed algorithm (like radix sort) might be more efficient then a general case algorithm.
Latency requirement - if you are designing a missile head, and the result must return within specific amount of time, quicksort which might decay to quadric running time on worst case - might not be a good choice, and you might want to use a different algorithm which have a strict O(nlogn) worst case instead.
Your hardware - if for example you are using a huge cluster and a huge data - a distributed sorting algorithm will probably be better then trying to do all the work on one machine.
It should be based on all those things.
You need to take into account size of your data as Insertion sort can be faster than quicksort for small data sets, etc
you need to know the arrangement of your data due to differing worst/average/best case asymptotic runtimes for each of the algorithm (and some whose worst/avg cases are the same whereas the other may have significantly worse worst case vs avg)
and you obviously need to know the data structure used as there are some very specialized sorting algorithms if your data is already in a special format or even if you can put it into a new data structure efficiently that will automatically do your sorting for you (a la BST or heaps)
The 2 main things that determine your choice of a sorting algorithm are time complexity and space complexity. Depending on your scenario, and the resources (time and memory) available to you, you might need to choose between sorting algorithms, based on what each sorting algorithm has to offer.
The actual performance of a sorting algorithm depends on the input data too, and it helps if we know certain characteristics of the input data beforehand, like the size of input, how sorted the array already is.
For example,
If you know beforehand that the input data has only 1000 non-negative integers, you can very well use counting sort to sort such an array in linear time.
The choice of a sorting algorithm depends on the constraints of space and time, and also the size/characteristics of the input data.
At a very high level you need to consider the ratio of insertions vs compares with each algorithm.
For integers in a file, this isn't going to be hugely relevant but if say you're sorting files based on contents, you'll naturally want to do as few comparisons as possible.

Is there ever a good reason to use Insertion Sort?

For general-purpose sorting, the answer appears to be no, as quick sort, merge sort and heap sort tend to perform better in the average- and worst-case scenarios. However, insertion sort appears to excel at incremental sorting, that is, adding elements to a list one at a time over an extended period of time while keeping the list sorted, especially if the insertion sort is implemented as a linked list (O(log n) average case vs. O(n)). However, a heap seems to be able to perform just (or nearly) as well for incremental sorting (adding or removing a single element from a heap has a worst-case scenario of O(log n)). So what exactly does insertion sort have to offer over other comparison-based sorting algorithms or heaps?
From http://www.sorting-algorithms.com/insertion-sort:
Although it is one of the elementary sorting algorithms with
O(n2) worst-case time, insertion sort
is the algorithm of choice either when
the data is nearly sorted (because it
is adaptive) or when the problem size
is small (because it has low
overhead).
For these reasons, and because it is also stable, insertion sort is
often used as the recursive base case
(when the problem size is small) for
higher overhead divide-and-conquer
sorting algorithms, such as merge sort
or quick sort.
An important concept in analysis of algorithms is asymptotic analysis. In the case of two algorithms with different asymptotic running times, such as one O(n^2) and one O(nlogn) as is the case with insertion sort and quicksort respectively, it is not definite that one is faster than the other.
The important distinction with this sort of analysis is that for sufficiently large N, one algorithm will be faster than another. When analyzing an algorithm down to a term like O(nlogn), you drop constants. When realistically analyzing the running of an algorithm, those constants will be important only for situations of small n.
So what does this mean? That means for certain small n, some algorithms are faster. This article from EmbeddedGurus.net includes an interesting perspective on choosing different sorting algorithms in the case of a limited space (16k) and limited memory system. Of course, the article references only sorting a list of 20 integers, so larger orders of n is irrelevant. Shorter code and less memory consumption (as well as avoiding recursion) were ultimately more important decisions.
Insertion sort has low overhead, it can be written fairly succinctly, and it has several two key benefits: it is stable, and it has a fairly fast running case when the input is nearly sorted.
Yes, there is a reason to use either an insertion sort or one of its variants.
The sorting alternatives (quick sort, etc.) of the other answers here make the assumption that the data is already in memory and ready to go.
But if you are attempting to read in a large amount of data from a slower external source (say a hard drive), there is a large amount of time wasted as the bottleneck is clearly the data channel or the drive itself. It just cannot keep up with the CPU. A natural series of waits occur during any read. These waits are wasted CPU cycles unless you use them to sort as you go.
For instance, if you were to make your solution to this be the following:
Read a ton of data in a dedicated loop into memory
Sort that data
You would very likely take longer than if you did the following in two threads.
Thread A:
Read a datum
Place datum into FIFO queue
(Repeat until data exhausted from drive)
Thread B:
Get a datum from the FIFO queue
Insert it into the proper place in your sorted list
(repeat until queue empty AND Thread A says "done").
...the above will allow you to use the otherwise wasted time. Note: Thread B does not impede Thread A's progress.
By the time the data is fully read, it will have been sorted and ready for use.
Most sorting procedures will use quicksort and then insertion sort for very small data sets.
If you're talking about maintaining a sorted list, there is no advantage over some kind of tree, it's just slower.
Well, maybe it consumes less memory or is a simpler implementation.
Inserting into a sorted list will involve a scan, which means that each insert is O(n), therefore sorting n items becomes O(n^2)
Inserting into a container such as a balanced tree, is typically log(n), therefore the sort is O(n log(n)) which is of course better.
But for small lists it hardly makes any difference. You might use an insert sort if you have to write it yourself without any libraries, the lists are small and/or you don't care about performance.
YES,
Insertion sort is better than Quick Sort on short lists.
In fact an optimal Quick Sort has a size threshold that it stops at, and then the entire array is sorted by insertion sort over the threshold limits.
Also...
For maintaining a scoreboard, Binary Insertion Sort may be as good as it gets.
See this page.
For small array size insertion sort outperforms quicksort.
Java 7 and Java 8 uses dual pivot quicksort to sort primitive data types.
Dual pivot quicksort out performs typical single pivot quicksort. According to algorithm of dual pivot quicksort :
For small arrays (length < 27), use the Insertion sort algorithm.
Choose two pivot...........
Definitely, insertion sort out performs quicksort for smaller array size and that is why you switch to insertion sort for arrays of length less than 27. The reason could be: there are no recursions in insertion sort.
Source: http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf

Stable, efficient sort?

I'm trying to create an unusual associative array implementation that is very space-efficient, and I need a sorting algorithm that meets all of the following:
Stable (Does not change the relative ordering of elements with equal keys.)
In-place or almost in-place (O(log n) stack is fine, but no O(n) space usage or heap allocations.
O(n log n) time complexity.
Also note that the data structure to be sorted is an array.
It's easy to see that there's a basic algorithm that matches any 2 of these three (insertion sort matches 1 and 2, merge sort matches 1 and 3, heap sort matches 2 and 3), but I cannot for the life of me find anything that matches all three of these criteria.
Merge sort can be written to be in-place I believe. That may be the best route.
Note: standard quicksort is not O(n log n) ! In the worst case, it can take up to O(n^2) time. The problem is that you might pivot on an element which is far from the median, so that your recursive calls are highly unbalanced.
There is a way to combat this, which is to carefully pick a median which is guaranteed, or at least very likely, to be close to the median. It is surprising that you can actually find the exact median in linear time, although in your case it sounds like you care about speed so I would not suggest this.
I think the most practical approach is to implement a stable quicksort (it's easy to keep stable) but use the median of 5 random values as the pivot at each step. This makes it highly unlikely that you'll have a slow sort, and is stable.
By the way, merge sort can be done in-place, although it's tricky to do both in-place and stable.
What about quicksort?
Exchange can do that too, might be more "stable" by your terms, but quicksort is faster.
There's a list of sort algorithms on Wikipedia. It includes categorization by execution time, stability, and allocation.
Your best bet is probably going to be modifying an efficient unstable sort to be stable, thereby making it less efficient.
There is a class of stable in-place merge algorithms, although they are complicated and linear with a rather high constant hidden in the O(n). To learn more, have a look at this article, and its bibliography.
Edit: the merge phase is linear, thus the mergesort is nlog_n.
Because your elements are in an array (rather than, say, a linked list) you have some information about their original order available to you in the array indices themselves. You can take advantage of this by writing your sort and comparison functions to be aware of the indices:
function cmp( ar, idx1, idx2 )
{
// first compare elements as usual
rc = (ar[idx1]<ar[idx2]) ? -1 : ( (ar[idx1]>ar[idx2]) ? 1 : 0 );
// if the elements are identical, then compare their positions
if( rc != 0 )
rc = (idx1<idx2) ? -1 : ((idx1>idx2) ? 1 : 0);
return rc;
}
This technique can be used to make any sort stable, as long as the sort ONLY performs element swaps. The indices of elements will change, but the relative order of identical elements will stay the same, so the sort remains robust. It won't work out of the box for a sort like heapsort because the original heapification "throws away" the relative ordering, though you might be able to adapt the idea to other sorts.
Quicksort can be made stable reasonably easy simply by having an sequence field added to each record, initializing it to the index before sorting and using it as the least significant part of the sort key.
This has a slightly adverse effect on the time taken but it doesn't affect the time complexity of the algorithm. It also has a minimal storage cost overhead for each record, but that rarely matters until you get very large numbers of records (and is mimimized with larger record sizes).
I've used this method with C's qsort() function to avoid writing my own. Each record has a 32-bit integer added and populated with the starting sequence number before calling qsort().
Then the comparison function checked the keys and the sequence (this guarantees there are no duplicate keys), turning the quicksort into a stable one. I recall that it still outperformed the inherently stable mergesort for the data sets I was using.
Your mileage may vary, so always remember: Measure, don't guess!
Quicksort can be made stable by doing it on a linked list. This costs n to pick random or median of 3 pivots but with a very small constant (list traversal).
By splitting the list and ensuring that the left list is sorted so same values go left and the right list is sorted so the same values go right, the sort will be implicity stable for no real extra cost. Also, since this deals with assignment rather than swapping, I think the speed might actually be slightly better than a quick sort on an array since there's only a single write.
So in conclusion, list up all your items and run quicksort on a list
Don't worry too much about O(n log n) until you can demonstrate that it matters. If you can find an O(n^2) algorithm with a drastically lower constant, go for it!
The general worst-case scenario is not relevant if your data is highly constrained.
In short: Run some test.
There's a nice list of sorting functions on wikipedia that can help you find whatever type of sorting function you're after.
For example, to address your specific question, it looks like an in-place merge sort is what you want.
However, you might also want to take a look at strand sort, it's got some very interesting properties.
I have implemented a stable in-place quicksort and a stable in-place merge sort. The merge sort is a bit faster, and guaranteed to work in O(n*log(n)^2), but not the quicksort. Both use O(log(n)) space.
Perhaps shell sort? If I recall my data structures course correctly, it tended to be stable, but it's worse case time is O(n log^2 n), although it performs O(n) on almost sorted data. It's based on insertion sort, so it sorts in place.
Maybe I'm in a bit of a rut, but I like hand-coded merge sort. It's simple, stable, and well-behaved. The additional temporary storage it needs is only N*sizeof(int), which isn't too bad.

Resources