Why make selection sort stable? [duplicate] - algorithm

I'm very curious, why stability is or is not important in sorting algorithms?

A sorting algorithm is said to be stable if two objects with equal keys appear in the same order in sorted output as they appear in the input array to be sorted. Some sorting algorithms are stable by nature like Insertion sort, Merge Sort, Bubble Sort, etc. And some sorting algorithms are not, like Heap Sort, Quick Sort, etc.
Background: a "stable" sorting algorithm keeps the items with the same sorting key in order. Suppose we have a list of 5-letter words:
peach
straw
apple
spork
If we sort the list by just the first letter of each word then a stable-sort would produce:
apple
peach
straw
spork
In an unstable sort algorithm, straw or spork may be interchanged, but in a stable one, they stay in the same relative positions (that is, since straw appears before spork in the input, it also appears before spork in the output).
We could sort the list of words using this algorithm: stable sorting by column 5, then 4, then 3, then 2, then 1.
In the end, it will be correctly sorted. Convince yourself of that. (by the way, that algorithm is called radix sort)
Now to answer your question, suppose we have a list of first and last names. We are asked to sort "by last name, then by first". We could first sort (stable or unstable) by the first name, then stable sort by the last name. After these sorts, the list is primarily sorted by the last name. However, where last names are the same, the first names are sorted.
You can't stack unstable sorts in the same fashion.

A stable sorting algorithm is the one that sorts the identical elements in their same order as they appear in the input, whilst unstable sorting may not satisfy the case. - I thank my algorithm lecturer Didem Gozupek to have provided insight into algorithms.
I again needed to edit the question due to some feedback that some people don't get the logic of the presentation. It illustrates sorting w.r.t. first elements. On the other hand, you can either consider the illustration consisting of key-value pairs.
Stable Sorting Algorithms:
Insertion Sort
Merge Sort
Bubble Sort
Tim Sort
Counting Sort
Block Sort
Quadsort
Library Sort
Cocktail shaker Sort
Gnome Sort
Odd–even Sort
Unstable Sorting Algorithms:
Heap sort
Selection sort
Shell sort
Quick sort
Introsort (subject to Quicksort)
Tree sort
Cycle sort
Smoothsort
Tournament sort(subject to Hesapsort)

Sorting stability means that records with the same key retain their relative order before and after the sort.
So stability matters if, and only if, the problem you're solving requires retention of that relative order.
If you don't need stability, you can use a fast, memory-sipping algorithm from a library, like heapsort or quicksort, and forget about it.
If you need stability, it's more complicated. Stable algorithms have higher big-O CPU and/or memory usage than unstable algorithms. So when you have a large data set, you have to pick between beating up the CPU or the memory. If you're constrained on both CPU and memory, you have a problem. A good compromise stable algorithm is a binary tree sort; the Wikipedia article has a pathetically easy C++ implementation based on the STL.
You can make an unstable algorithm into a stable one by adding the original record number as the last-place key for each record.

It depends on what you do.
Imagine you've got some people records with a first and a last name field. First you sort the list by first name. If you then sort the list with a stable algorithm by last name, you'll have a list sorted by first name AND last name.

There's a few reasons why stability can be important. One is that, if two records don't need to be swapped by swapping them you can cause a memory update, a page is marked dirty, and needs to be re-written to disk (or another slow medium).

A sorting algorithm is said to be stable if two objects with equal keys appear in the same order in sorted output as they appear in the input unsorted array. Some sorting algorithms are stable by nature like Insertion sort, Merge Sort, Bubble Sort, etc. And some sorting algorithms are not, like Heap Sort, Quick Sort, etc.
However, any given sorting algo which is not stable can be modified to be stable. There can be sorting algo specific ways to make it stable, but in general, any comparison based sorting algorithm which is not stable by nature can be modified to be stable by changing the key comparison operation so that the comparison of two keys considers position as a factor for objects with equal keys.
References:
http://www.math.uic.edu/~leon/cs-mcs401-s08/handouts/stability.pdf
http://en.wikipedia.org/wiki/Sorting_algorithm#Stability

I know there are many answers for this, but to me, this answer, by Robert Harvey, summarized it much more clearly:
A stable sort is one which preserves the original order of the input set, where the [unstable] algorithm does not distinguish between two or more items.
Source

Some more examples of the reason for wanting stable sorts. Databases are a common example. Take the case of a transaction data base than includes last|first name, date|time of purchase, item number, price. Say the data base is normally sorted by date|time. Then a query is made to make a sorted copy of the data base by last|first name, since a stable sort preserves the original order, even though the inquiry compare only involves last|first name, the transactions for each last|first name will be in data|time order.
A similar example is classic Excel, which limited sorts to 3 columns at a time. To sort 6 columns, a sort is done with the least significant 3 columns, followed by a sort with the most significant 3 columns.
A classic example of a stable radix sort is a card sorter, used to sort by a field of base 10 numeric columns. The cards are sorted from least significant digit to most significant digit. On each pass, a deck of cards is read and separated into 10 different bins according to the digit in that column. Then the 10 bins of cards are put back into the input hopper in order ("0" cards first, "9" cards last). Then another pass is done by the next column, until all columns are sorted. Actual card sorters have more than 10 bins since there are 12 zones on a card, a column can be blank, and there is a mis-read bin. To sort letters, 2 passes per column are needed, 1st pass for digit, 2nd pass for the 12 11 zone.
Later (1937) there were card collating (merging) machines that could merge two decks of cards by comparing fields. The input was two already sorted decks of cards, a master deck and an update deck. The collator merged the two decks into a a new mater bin and an archive bin, which was optionally used for master duplicates so that the new master bin would only have update cards in case of duplicates. This was probably the basis for the idea behind the original (bottom up) merge sort.

If you assume what you are sorting are just numbers and only their values identify/distinguish them (e.g. elements with same value are identicle), then the stability-issue of sorting is meaningless.
However, objects with same priority in sorting may be distinct, and sometime their relative order is meaningful information. In this case, unstable sort generates problems.
For example, you have a list of data which contains the time cost [T] of all players to clean a maze with Level [L] in a game.
Suppose we need to rank the players by how fast they clean the maze. However, an additional rule applies: players who clean the maze with higher-level always have a higher rank, no matter how long the time cost is.
Of course you might try to map the paired value [T,L] to a real number [R] with some algorithm which follows the rules and then rank all players with [R] value.
However, if stable sorting is feasible, then you may simply sort the entire list by [T] (Faster players first) and then by [L]. In this case, the relative order of players (by time cost) will not be changed after you grouped them by level of maze they cleaned.
PS: of course the approach to sort twice is not the best solution to the particular problem but to explain the question of poster it should be enough.

Stable sort will always return same solution (permutation) on same input.
For instance [2,1,2] will be sorted using stable sort as permutation [2,1,3] (first is index 2, then index 1 then index 3 in sorted output) That mean that output is always shuffled same way. Other non stable, but still correct permutation is [2,3,1].
Quick sort is not stable sort and permutation differences among same elements depends on algorithm for picking pivot. Some implementations pick up at random and that can make quick sort yielding different permutations on same input using same algorithm.
Stable sort algorithm is necessary deterministic.

Related

Sorting based on fuzzy criteria OR Create an acceptable order with only n comparisons

I'm looking for an algorithm to sort a large number of items using the fewest comparisons. My specific case makes it unclear which of the obvious approaches is appropriate: the comparison function is slow and non-deterministic so it can make errors, because it's a human brain.
In other words, I want to sort arbitrary items on my computer into a list from "best" to "worst" by comparing them two at a time. They could be images, strings, songs, anything. My program would display two things for me to compare. The program doesn't know anything about what is being compared, its job is just to decide which pairs to compare. So that gives the following criteria
It's a comparison sort - The only time the user sees items is when comparing two of them.
It's an out-of-place sort - I don't want to move the actual files, so items can have placeholder values or metadata files
Comparisons are slow - at least compared to a computer. Data locality won't have an effect, but comparing obvious disparities will be quick, similar items will be slow.
Comparison is subjective - comparison results could vary slightly at different times.
Items don't have a total order - the desired outcome is an order that is "good enough" at runtime, which will vary depending on context.
Items will rarely be almost sorted - in fact, the goal is to get random data to an almost-sorted state.
Sets usually will contain runs - If every song on an album is a banger, it might be faster because of (2) to compare them to the next album rather than each other. Imagine a set {10.0, 10.2, 10.9, 5.0, 4.2, 6.9} where integer comparisons are fast but float comparisons are very slow.
There are many different ways to approach this problem. In addition to sorting algorithms, it's similar to creating tournament brackets, and voting systems. As that table illustrates, there are countless ways to define and solve the problem based on various criteria. For this question I'm only interested in treating it as a sorting problem where the user is comparing two items at a time and choosing a preference. So what approach makes sense for either of the two following versions of the question?
How to choose pairs to get the best result in O(n) or fewer operations? (for example compare random pairs of items with n/2 operations, then use n/2 operations to spot check or fine-tune)
How to create the best order with additional operations but no additional comparisons (e.g. similar items are sorted into buckets or losers are removed, anything that doesn't increase the number of comparisons)
The representation of comparison results can be anything that makes the solution convenient - it can be dictionary keys corresponding to the final order, a "score" based on number of comparisons, a database, etc.
Edit: The comments have helped clarify the question in that the goal is similar to something like bucket sort, samplesort or the partitioning phase of quicksort. So the question could be rephrased as how to choose good partitions based on comparisons, but I'm also interested in any other ways of using the comparison results that wouldn't be applicable in a standard in-place comparison sort like keeping a score for each item.

Sort in ascending or descending order (chosen arbitrarily; Prefer whichever is cheaper)

I have an array of elements. This array could be:
Randomly shuffled (about 20% of the time)
Nearly sorted* in ascending order (about 40% of the time)
Nearly sorted in descending order (about 40% of the time)
But I do not know (in advance) which of these cases applies. I would prefer to sort the array into the order which it is already close to.
It does not matter whether the output is ascending or descending, but it must be one or the other (so I can perform a binary search on it.)
The sort need not be stable.
Some background info: The process goes roughly like this:
Populate the array
Sort on some attribute A
Do some processing (compute quantiles, and some other minor stuff)
Sort on some other attribute B
Do more processing
Sort on attribute C
Do more processing
A and B are often correlated with each other (but may be positively or negatively.) Same applies to B and C. Occasionally A == C.
* "nearly sorted" here means most elements are close to their final positions. But rarely exactly at their final positions (there is a lot of additive noise, and not many long sorted subsequences.) Still, there are usually a few "outliers" at the start and end of the array which are poor predictors of the order for the next sort. 
Is there an algorithm that can advantage of the fact that I have no preference for ascending vs. descending, to sort more cheaply (compared to the TimSort I am currently using?)
I'd continue using Timsort (however, a good alternative is Smoothsort*), but first probe the array to decide whether to sort in ascending or descending order. Look at the first and last elements and sort accordingly. If the array is unsorted, the choice is immaterial; if it is (partially) sorted, probing at a wide interval is more likely to correctly detect which way.
*Smoothsort has the same best, average, and worst case time as Timsort, and better space complexity. Like Timsort, it was specifically designed to take advantage of partially sorted data.
Another possibility to consider:
Start doing a (hand-rolled) insertion sort
As you go, count the number of inversions you perform
After you have done some small fixed number of insertions, compare the number of inversions that you have counted, to the maximum number of inversions that would have occurred by that point if the data were reverse-sorted to begin with:
If the proportion is close to 0, then (probably) the data is nearly-sorted. Complete the insertion sort, which performs very well on nearly-sorted data. If you don't like the sound of "probably" then continue counting inversions as you go and be ready to fall back to Timsort if it falls under a threshold.
If the proportion is close to 1, then (probably) the data is nearly-reverse-sorted, and you have a small number of sorted elements at the start. Move them to the end, reverse them, and complete an insertion sort with reversed comparator.
Otherwise the data is random, use your favourite sorting algorithm. I'd say Timsort, but since that does well on nearly-sorted data there must be some other algorithm that does at least a tiny bit better than Timsort does on uniformly-shuffled data. Probably plain merge sort without the Tim.
The "small fixed number" can be a number for which insertion sort is fairly fast even in bad cases. I would guess 10-20 or so. It's possible to work out the probability of a false positive in uniformly shuffled data for any given number of insertions and any given threshold of "close to 0/1", but I'm too lazy.
You say the first and last few array elements typically buck the trend, in which case you could exclude them from the initial test insertion sort.
Obviously this approach is somewhat inspired by Timsort. But Timsort is fiendishly optimized for data that contains runs -- I have tried to fiendishly optimize only for data that's close to one big run (in either direction). Another feature of Timsort is that it's well tested, I don't claim to share that.

Why is an "unstable sort" considered bad

Just wondering if someone could explain why an "unstable sort" is considered bad? Basically I don't see any situations where it would really matter. Could anyone care to provide one?
If you have a GUI that allows people to sort on individual columns by clicking on that column, and you use a stable sort, then people who know can get a multi-column sort on columns A,B,C by clicking on columns C,B,A in that order. Because the sort is stable, when you click on B any records with equal keys under B will still be sorted by C so after clicking on B the records are sorted by B, C. Similarly, after you click on A, the records are sorted by A, B, C.
(Unfortunately, last time I tried this on some Microsoft product or other, it looked like it didn't use a stable sort, so it's not surprising this trick is not better known).
Imagine that you wanted to organize a deck of cards. You could sort first by suit, then by numeric value. If you used a stable sort, you'd be done. If you used an unstable sort, then they'd be in numeric order, but the suits would be all messed up again. There are lots of equivalent situations that come up in real development problems.
There are just a few cases where you need a sort algorithm that's stable. An example of this is if you're implementing something like a Radix sort, which depends on the idea that the comparison sorting algorithm used as the building block is stable. (Radix sort can operate in linear time, but it's inputs are more restricted than comparison sorting algorithms. (Comparison sorts require O(n lg n) time))
It's not necessarily that a sort that is unstable is "bad"; it's more that a sort that is stable is "desirable in some cases". That's why programming languages, e.g. C++'s Standard Template Library, provide both -- e.g. std::sort and std::stable_sort -- which allow you to specify when you need stability, and when you don't.
Because they can do better than I could do...from Developer Fusion:
There are two kinds of sort
algorithms: "stable sorts" and
"unstable sorts". These terms refer to
the action that is taken when two
values compare as equal. If you have
an array T0..size with two elements Tn
and Tk for n < k, and these two
elements compare equal, in a stable
sort they will appear in the sorted
output with the value that was in Tn
preceding Tk. The output order
preserves the original input order. An
unstable sort, by contrast, there is
no guarantee of the order of these two
elements in the output.
Note that sorting algorithms like quick sort are not stable or unstable. The implementation will determine which it is.
In any case, stable is not necessarily better or worse than unstable - it's just that sometimes you need the guarantee of the order to two equal elements. When you do need that guarantee, unstable would not be suitable.

What is stability in sorting algorithms and why is it important?

I'm very curious, why stability is or is not important in sorting algorithms?
A sorting algorithm is said to be stable if two objects with equal keys appear in the same order in sorted output as they appear in the input array to be sorted. Some sorting algorithms are stable by nature like Insertion sort, Merge Sort, Bubble Sort, etc. And some sorting algorithms are not, like Heap Sort, Quick Sort, etc.
Background: a "stable" sorting algorithm keeps the items with the same sorting key in order. Suppose we have a list of 5-letter words:
peach
straw
apple
spork
If we sort the list by just the first letter of each word then a stable-sort would produce:
apple
peach
straw
spork
In an unstable sort algorithm, straw or spork may be interchanged, but in a stable one, they stay in the same relative positions (that is, since straw appears before spork in the input, it also appears before spork in the output).
We could sort the list of words using this algorithm: stable sorting by column 5, then 4, then 3, then 2, then 1.
In the end, it will be correctly sorted. Convince yourself of that. (by the way, that algorithm is called radix sort)
Now to answer your question, suppose we have a list of first and last names. We are asked to sort "by last name, then by first". We could first sort (stable or unstable) by the first name, then stable sort by the last name. After these sorts, the list is primarily sorted by the last name. However, where last names are the same, the first names are sorted.
You can't stack unstable sorts in the same fashion.
A stable sorting algorithm is the one that sorts the identical elements in their same order as they appear in the input, whilst unstable sorting may not satisfy the case. - I thank my algorithm lecturer Didem Gozupek to have provided insight into algorithms.
I again needed to edit the question due to some feedback that some people don't get the logic of the presentation. It illustrates sorting w.r.t. first elements. On the other hand, you can either consider the illustration consisting of key-value pairs.
Stable Sorting Algorithms:
Insertion Sort
Merge Sort
Bubble Sort
Tim Sort
Counting Sort
Block Sort
Quadsort
Library Sort
Cocktail shaker Sort
Gnome Sort
Odd–even Sort
Unstable Sorting Algorithms:
Heap sort
Selection sort
Shell sort
Quick sort
Introsort (subject to Quicksort)
Tree sort
Cycle sort
Smoothsort
Tournament sort(subject to Hesapsort)
Sorting stability means that records with the same key retain their relative order before and after the sort.
So stability matters if, and only if, the problem you're solving requires retention of that relative order.
If you don't need stability, you can use a fast, memory-sipping algorithm from a library, like heapsort or quicksort, and forget about it.
If you need stability, it's more complicated. Stable algorithms have higher big-O CPU and/or memory usage than unstable algorithms. So when you have a large data set, you have to pick between beating up the CPU or the memory. If you're constrained on both CPU and memory, you have a problem. A good compromise stable algorithm is a binary tree sort; the Wikipedia article has a pathetically easy C++ implementation based on the STL.
You can make an unstable algorithm into a stable one by adding the original record number as the last-place key for each record.
It depends on what you do.
Imagine you've got some people records with a first and a last name field. First you sort the list by first name. If you then sort the list with a stable algorithm by last name, you'll have a list sorted by first name AND last name.
There's a few reasons why stability can be important. One is that, if two records don't need to be swapped by swapping them you can cause a memory update, a page is marked dirty, and needs to be re-written to disk (or another slow medium).
A sorting algorithm is said to be stable if two objects with equal keys appear in the same order in sorted output as they appear in the input unsorted array. Some sorting algorithms are stable by nature like Insertion sort, Merge Sort, Bubble Sort, etc. And some sorting algorithms are not, like Heap Sort, Quick Sort, etc.
However, any given sorting algo which is not stable can be modified to be stable. There can be sorting algo specific ways to make it stable, but in general, any comparison based sorting algorithm which is not stable by nature can be modified to be stable by changing the key comparison operation so that the comparison of two keys considers position as a factor for objects with equal keys.
References:
http://www.math.uic.edu/~leon/cs-mcs401-s08/handouts/stability.pdf
http://en.wikipedia.org/wiki/Sorting_algorithm#Stability
I know there are many answers for this, but to me, this answer, by Robert Harvey, summarized it much more clearly:
A stable sort is one which preserves the original order of the input set, where the [unstable] algorithm does not distinguish between two or more items.
Source
Some more examples of the reason for wanting stable sorts. Databases are a common example. Take the case of a transaction data base than includes last|first name, date|time of purchase, item number, price. Say the data base is normally sorted by date|time. Then a query is made to make a sorted copy of the data base by last|first name, since a stable sort preserves the original order, even though the inquiry compare only involves last|first name, the transactions for each last|first name will be in data|time order.
A similar example is classic Excel, which limited sorts to 3 columns at a time. To sort 6 columns, a sort is done with the least significant 3 columns, followed by a sort with the most significant 3 columns.
A classic example of a stable radix sort is a card sorter, used to sort by a field of base 10 numeric columns. The cards are sorted from least significant digit to most significant digit. On each pass, a deck of cards is read and separated into 10 different bins according to the digit in that column. Then the 10 bins of cards are put back into the input hopper in order ("0" cards first, "9" cards last). Then another pass is done by the next column, until all columns are sorted. Actual card sorters have more than 10 bins since there are 12 zones on a card, a column can be blank, and there is a mis-read bin. To sort letters, 2 passes per column are needed, 1st pass for digit, 2nd pass for the 12 11 zone.
Later (1937) there were card collating (merging) machines that could merge two decks of cards by comparing fields. The input was two already sorted decks of cards, a master deck and an update deck. The collator merged the two decks into a a new mater bin and an archive bin, which was optionally used for master duplicates so that the new master bin would only have update cards in case of duplicates. This was probably the basis for the idea behind the original (bottom up) merge sort.
If you assume what you are sorting are just numbers and only their values identify/distinguish them (e.g. elements with same value are identicle), then the stability-issue of sorting is meaningless.
However, objects with same priority in sorting may be distinct, and sometime their relative order is meaningful information. In this case, unstable sort generates problems.
For example, you have a list of data which contains the time cost [T] of all players to clean a maze with Level [L] in a game.
Suppose we need to rank the players by how fast they clean the maze. However, an additional rule applies: players who clean the maze with higher-level always have a higher rank, no matter how long the time cost is.
Of course you might try to map the paired value [T,L] to a real number [R] with some algorithm which follows the rules and then rank all players with [R] value.
However, if stable sorting is feasible, then you may simply sort the entire list by [T] (Faster players first) and then by [L]. In this case, the relative order of players (by time cost) will not be changed after you grouped them by level of maze they cleaned.
PS: of course the approach to sort twice is not the best solution to the particular problem but to explain the question of poster it should be enough.
Stable sort will always return same solution (permutation) on same input.
For instance [2,1,2] will be sorted using stable sort as permutation [2,1,3] (first is index 2, then index 1 then index 3 in sorted output) That mean that output is always shuffled same way. Other non stable, but still correct permutation is [2,3,1].
Quick sort is not stable sort and permutation differences among same elements depends on algorithm for picking pivot. Some implementations pick up at random and that can make quick sort yielding different permutations on same input using same algorithm.
Stable sort algorithm is necessary deterministic.

Most efficient sorting algorithm for a large set of numbers

I'm working on a large project, I won't bother to summarize it here, but this section of the project is to take a very large document of text (minimum of around 50,000 words (not unique)), and output each unique word in order of most used to least used (probably top three will be "a" "an" and "the").
My question is of course, what would be the best sorting algorithm to use? I was reading of counting sort, and I like it, but my concern is that the range of values will be too large compared to the number of unique words.
Any suggestions?
First, you will need a map of word -> count.
50,000 words is not much - it will easily fit in memory, so there's nothing to worry. In C++ you can use the standard STL std::map.
Then, once you have the map, you can copy all the map keys to a vector.
Then, sort this vector using a custom comparison operator: instead of comparing the words, compare the counts from the map. (Don't worry about the specific sorting algorithm - your array is not that large, so any standard library sort will work for you.)
I'd start with a quicksort and go from there.
Check out the wiki page on sorting algorithms, though, to learn the differences.
You should try an MSD radix sort. It will sort your entries in lexicographical order. Here is a google code project you might be interested in.
Have a look at the link. A Pictorial representation on how different algorithm works. This will give you an hint!
Sorting Algorithms
You can get better performance than quicksort with this particular problem assuming that if two words occur the same number of times, then it doesn't matter in which order you output them.
First step: Create a hash map with the words as key values and frequency as the associated values. You will fill this hash map in as you parse the file. While you are doing this, make sure to keep track of the highest frequency encountered. This step is O(n) complexity.
Second step: Create a list with the number of entries equal to the highest frequency from the first step. The index of each slot in this list will hold a list of the words with the frequency count equal to the index. So words that occur 3 times in the document will go in list[3] for example. Iterate through the hash map and insert the words into the appropriate spots in the list. This step is O(n) complexity.
Third step: Iterate through the list in reverse and output all the words. This step is O(n) complexity.
Overall this algorithm will accomplish your task in O(n) time rather than O(nlogn) required by quicksort.
In almost every case I've ever tested, Quicksort worked the best for me. However, I did have two cases where Combsort was the best. Could have been that combsort was better in those cases because the code was so small, or due to some quirk in how ordered the data was.
Any time sorting shows up in my profile, I try the major sorts. I've never had anything that topped both Quicksort and Combsort.
I think you want to do something as explained in the below post:
http://karephul.blogspot.com/2008/12/groovy-closures.html
Languages which support closure make the solution much easy, like LINQ as Eric mentioned.
For large sets you can use what is known as the "sort based indexing" in information retrieval, but for 50,000 words you can use the following:
read the entire file into a buffer.
parse the buffer and build a token vector with
struct token { char *term, int termlen; }
term is a pointer to the word in the buffer.
sort the table by term (lexicographical order).
set entrynum = 0, iterate through the term vector,
when term is new, store it in a vector :
struct { char *term; int frequency; } at index entrynum, set frequency to 1 and increment the entry number, otherwise increment frequency.
sort the vector by frequency in descending order.
You can also try implementing digital trees also known as Trie. Here is the link

Resources