Is F# ResizeArray sortBy stable? - sorting

Does the function ResizeArray.sortBy do a stable sort, in the sense that it does not change the order of elements which have the same value for the key function?
And if not, how to write a stable sort in F#?

The answer to your question is unstable.
First ResizeArray.sortBy is implemented as:
module ResizeArray =
let sortBy f (arr: ResizeArray<'T>) = arr.Sort (System.Comparison(fun x y -> compare (f x) (f y)))
And ResizeArray is an alias for .Net List collection:
type ResizeArray<'T> = System.Collections.Generic.List<'T> // alias
Now let's look at the List documentation:
This method uses Array.Sort, which
uses the QuickSort algorithm. This
implementation performs an unstable
sort; that is, if two elements are
equal, their order might not be
preserved. In contrast, a stable sort
preserves the order of elements that
are equal.
So unstable. If you want a stable sort, you can implement merge sort or a careful quick sort. However the stable version quick sort is less efficient.

Seq.sort is stable.

Related

Why does Haskell use mergesort instead of quicksort?

In Wikibooks' Haskell, there is the following claim:
Data.List offers a sort function for sorting lists. It does not use quicksort; rather, it uses an efficient implementation of an algorithm called mergesort.
What is the underlying reason in Haskell to use mergesort over quicksort? Quicksort usually has better practical performance, but maybe not in this case. I gather that the in-place benefits of quicksort are hard (impossible?) to do with Haskell lists.
There was a related question on softwareengineering.SE, but it wasn't really about why mergesort is used.
I implemented the two sorts myself for profiling. Mergesort was superior (around twice as fast for a list of 2^20 elements), but I'm not sure that my implementation of quicksort was optimal.
Edit: Here are my implementations of mergesort and quicksort:
mergesort :: Ord a => [a] -> [a]
mergesort [] = []
mergesort [x] = [x]
mergesort l = merge (mergesort left) (mergesort right)
where size = div (length l) 2
(left, right) = splitAt size l
merge :: Ord a => [a] -> [a] -> [a]
merge ls [] = ls
merge [] vs = vs
merge first#(l:ls) second#(v:vs)
| l < v = l : merge ls second
| otherwise = v : merge first vs
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort [x] = [x]
quicksort l = quicksort less ++ pivot:(quicksort greater)
where pivotIndex = div (length l) 2
pivot = l !! pivotIndex
[less, greater] = foldl addElem [[], []] $ enumerate l
addElem [less, greater] (index, elem)
| index == pivotIndex = [less, greater]
| elem < pivot = [elem:less, greater]
| otherwise = [less, elem:greater]
enumerate :: [a] -> [(Int, a)]
enumerate = zip [0..]
Edit 2 3: I was asked to provide timings for my implementations versus the sort in Data.List. Following #Will Ness' suggestions, I compiled this gist with the -O2 flag, changing the supplied sort in main each time, and executed it with +RTS -s. The sorted list was a cheaply-created, pseudorandom [Int] list with 2^20 elements. The results were as follows:
Data.List.sort: 0.171s
mergesort: 1.092s (~6x slower than Data.List.sort)
quicksort: 1.152s (~7x slower than Data.List.sort)
In imperative languages, Quicksort is performed in-place by mutating an array. As you demonstrate in your code sample, you can adapt Quicksort to a pure functional language like Haskell by building singly-linked lists instead, but this is not as fast.
On the other hand, Mergesort is not an in-place algorithm: a straightforward imperative implementation copies the merged data to a different allocation. This is a better fit for Haskell, which by its nature must copy the data anyway.
Let's step back a bit: Quicksort's performance edge is "lore" -- a reputation built up decades ago on machines much different from the ones we use today. Even if you use the same language, this kind of lore needs rechecking from time to time, as the facts on the ground can change. The last benchmarking paper I read on this topic had Quicksort still on top, but its lead over Mergesort was slim, even in C/C++.
Mergesort has other advantages: it doesn't need to be tweaked to avoid Quicksort's O(n^2) worst case, and it is naturally stable. So, if you lose the narrow performance difference due to other factors, Mergesort is an obvious choice.
I think #comingstorm's answer is pretty much on the nose, but here's some more info on the history of GHC's sort function.
In the source code for Data.OldList, you can find the implementation of sort and verify for yourself that it's a merge sort. Just below the definition in that file is the following comment:
Quicksort replaced by mergesort, 14/5/2002.
From: Ian Lynagh <igloo#earth.li>
I am curious as to why the List.sort implementation in GHC is a
quicksort algorithm rather than an algorithm that guarantees n log n
time in the worst case? I have attached a mergesort implementation along
with a few scripts to time it's performance...
So, originally a functional quicksort was used (and the function qsort is still there, but commented out). Ian's benchmarks showed that his mergesort was competitive with quicksort in the "random list" case and massively outperformed it in the case of already sorted data. Later, Ian's version was replaced by another implementation that was about twice as fast, according to additional comments in that file.
The main issue with the original qsort was that it didn't use a random pivot. Instead it pivoted on the first value in the list. This is obviously pretty bad because it implies performance will be worst case (or close) for sorted (or nearly sorted) input. Unfortunately, there are a couple of challenges in switching from "pivot on first" to an alternative (either random, or -- as in your implementation -- somewhere in "the middle"). In a functional language without side effects, managing a pseudorandom input is a bit of a problem, but let's say you solve that (maybe by building a random number generator into your sort function). You still have the problem that, when sorting an immutable linked list, locating an arbitrary pivot and then partitioning based on it will involve multiple list traversals and sublist copies.
I think the only way to realize the supposed benefits of quicksort would be to write the list out to a vector, sort it in place (and sacrifice sort stability), and write it back out to a list. I don't see that that could ever be an overall win. On the other hand, if you already have data in a vector, then an in-place quicksort would definitely be a reasonable option.
On a singly-linked list, mergesort can be done in place. What's more, naive implementations scan over half the list in order to get the start of the second sublist, but the start of the second sublist falls out as a side effect of sorting the first sublist and does not need extra scanning. The one thing quicksort has going over mergesort is cache coherency. Quicksort works with elements close to each other in memory. As soon as an element of indirection enters into it, like when you are sorting pointer arrays instead of the data itself, that advantage becomes less.
Mergesort has hard guarantees for worst-case behavior, and it's easy to do stable sorting with it.
Short answer:
Quicksort is advantageous for arrays (in-place, fast, but not worst-case optimal). Mergesort for linked lists (fast, worst-case optimal, stable, simple).
Quicksort is slow for lists, Mergesort is not in-place for arrays.
Many arguments on why Quicksort is not used in Haskell seem plausible. However, at least Quicksort is not slower than Mergesort for the random case. Based on the implementation given in Richard Bird's book, Thinking Functionally in Haskell, I made a 3-way Quicksort:
tqsort [] = []
tqsort (x:xs) = sortp xs [] [x] []
where
sortp [] us ws vs = tqsort us ++ ws ++ tqsort vs
sortp (y:ys) us ws vs =
case compare y x of
LT -> sortp ys (y:us) ws vs
GT -> sortp ys us ws (y:vs)
_ -> sortp ys us (y:ws) vs
I benchmarked a few cases, e.g., lists of size 10^4 containing Int between 0 and 10^3 or 10^4, and so on. The result is the 3-way Quicksort or even Bird's version are better than GHC's Mergesort, something like 1.x~3.x faster than ghc's Mergesort, depending on the type of data (many repetitions? very sparse?). The following stats is generated by criterion:
benchmarking Data.List.sort/Diverse/10^5
time 223.0 ms (217.0 ms .. 228.8 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 226.4 ms (224.5 ms .. 228.3 ms)
std dev 2.591 ms (1.824 ms .. 3.354 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking 3-way Quicksort/Diverse/10^5
time 91.45 ms (86.13 ms .. 98.14 ms)
0.996 R² (0.993 R² .. 0.999 R²)
mean 96.65 ms (94.48 ms .. 98.91 ms)
std dev 3.665 ms (2.775 ms .. 4.554 ms)
However, there is another requirement of sort stated in Haskell 98/2010: it needs to be stable. The typical Quicksort implementation using Data.List.partition is stable, but the above one isn't.
Later addition: A stable 3-way Quicksort mentioned in the comment seems as fast as tqsort here.
I am not sure, but looking at the code i don't think Data.List.sort is Mergesort as we know it. It just makes a single pass starting with the sequences function in a beautiful triangular mutual recursive fashion with ascending and descending functions to result in a list of already ascending or descending ordered chunks in the required order. Only then it starts merging.
It's a manifestation of poetry in coding. Unlike Quicksort, its worst case (total random input) has O(nlogn) time complexity, and best case (already sorted ascending or descending) is O(n).
I don't think any other sorting algorithm can beat it.

thenComparing vs sort

Are there any differences (e.g. performance, ordering) of the two versions:
version 1:
mylist.sort(myComparator.sort_item);
mylist.sort(myComparator.sort_post);
version 2:
// java 8
mylist.sort(myComparator.sort_item
.thenComparing(myComparator.sort_post));
Version 1: You are sorting by item, then throwing that sort away to sort by post instead. Effectively, the first sort is meaningless.
Version 2: You are sorting first by item, and in the event of a tie, breaking that tie using post.
From the Java 8 API documentation:
[thenComparing] Returns a lexicographic-order comparator with another
comparator. If this Comparator considers two elements equal, i.e.
compare(a, b) == 0, other is used to determine the order.
That means the second comparator is only used if the first one returns 0 (the elements are equal). So in practice it should be faster in most cases then calling sort twice.
In theory, if the sorting algorithm is of time complexity C, then calling it twice will still be C (constant multiplication doesn't matter) to the complexity of both sorting methods is the same.

Possible to do quicksort without splitting into separate lists?

In many quicksort algorithms, the programming involves placing the elements from each array into three groups:(less, pivot, more), and sometimes placing the groups back together. What if I do not want to use this? Is there a simpler approach to sorting a list with quicksort manually?
Basically, I plan to keep the array as one, and swap all the elements based on a partition (for example, given a list x and pivot r, we could have the reference lists of [0:r] and [r:len(x)]. However, as the sorting continues, how do I continue referencing each smaller "subarray"?
So this is my code, but I'm not sure how to continue from here:
x = [4,7,4,2,4,6,5]
#r is pivot POSITION
r = len(x)-1
i = -1
for a in range(0,r+1):
if x[a] <= x[r]:
i+=1
x[i], x[a] = x[a], x[i]
You can implement quicksort purely by swapping the locations of items in a list, rather than actually creating new lists.
But unless this is some sort of homework assignment, the best option is generally to use python's built-in sort() function, which automatically uses quicksort where appropriate.
There's something not right in here. You need to have two definitions, one for the partition and one for the quicksort process itself. The quicksort will then need to have some sort of loop, so that it will continue applying the partition to subarrays of the array. Go and check the Wikipedia article to understand how this works.

Why doesn't Haskell provide folds for one-dimensional Arrays?

Data.Array doesn't provide folds for the Array type.
In Real World Haskell (ch. 12), the reason is said to be that Arrays could be folded in different ways based on the programmer's needs:
First of all, there are several kinds of folds that make sense. We might still want to fold over single elements, but we now have the possibility of folding over rows or columns, too. On top of this, for element-at-a-time folding, there are no longer just two sequences for traversal.
Isn't this exactly true of Lists? It's very common to represent e.g. a matrix with a multidimensional List, but there are still folds defined for one-dimensional Lists.
What's the subtlety I'm missing? Is it that a multidimensional Array is sufficiently different from an Array of Arrays?
Edit: Hm, even multidimensional arrays do have folds defined, in the form of instances of Data.Foldable.[0] So how does that fit in with the Real World Haskell quote?
[0] http://hackage.haskell.org/packages/archive/base/4.6.0.0/doc/html/Data-Foldable.html
Since you mention the difference between a "multidimensional" Array and an Array of Arrays, that will illustrate the point nicely, alongside a comparison with lists.
A fold (in the Foldable class sense) is an inherently linear operation, just as lists are an inherently linear structure; a right fold fully characterizes a list by matching its constructors one-for-one with the arguments to foldr. While you can define functions like foldl as well, there's a clear choice of a standard, canonical fold.
Array has no such transparent structure that can be matched one-for-one in a fold. It's an abstract type, with access to individual elements provided by index values, which can be of any type that has an Ix instance. So not only is there no single obvious choice for implementing a fold, there also is no intrinsic linear structure. It does so happen that Ix lets you enumerate a range of indices, but this is more an implementation detail than anything else.
What about multidimensional Arrays? They don't really exist, as such. Ix defines instances for tuples of types that are also instances, and if you want to think of such tuples as an index type for a "multidimensional" Array, go ahead! But they're still just tuples. Obviously, Ix puts some linear order on those tuples, but what is it? Can you find anything in the documentation that tells you?
So, I think we can safely say that folding a multidimensional Array using the order defined by Ix is unwise unless you don't really care what order you get the elements in.
For an Array of Arrays, on the other hand, there's only one sensible way to combine them, much like nested lists: fold each inner Array separately according to their own order of elements, then fold the result of each according to the outer Array's order of elements.
Now, you might reasonably object that since there's no type distinction between one-dimensional and multidimensional Arrays, and the former can be assumed to have a sensible fold ordering based on the Ix instance, why not just use that ordering by default? There's already a function that returns the elements of an Array in a list, after all.
As it turns out, the library itself would agree with you, because that's exactly what the Foldable instance does.
There is one natural way to fold lists, which is foldr. Note the types of the list constructors:
(:) :: a -> [a] -> [a]
[] :: [a]
Replacing the occurrences of [a] with b, we get these types:
f :: a -> b -> b
z :: b
And now, of course, the type of foldr is based on this principle:
foldr :: (a -> b -> b) -> b -> [a] -> b
So given the construction/observation semantics of lists, foldr is the one that's most natural. You can read the type as "tell me what to do with a (:) and what to do with a [], and I'll get rid of a list for you."
Array doesn't have this property; you build an array from an association list (Ix i => [(i,a)]), and the type doesn't really expose any recursive structure: one array is not built from other arrays through a recursive constructor as a list or tree would be.

Union of inverted lists

Give k sorted inverted lists, I want an efficient algorithm to get the union of these k lists?
Each inverted list is a read-only array in memory, each list contains integer in sorted order.
the result will be saved in a predefined array which is large enough. Is there any algorithm better than k-way merge?
K-Way merge is optimal. It has O(log(k)*n) ops [where n is the number of elements in all lists combined].
It is easy to see it cannot be done better - as #jpalecek mentioned, otherwise you could sort any array better then O(nlogn) by splitting it into chunks [inverted indexes] of size 1.
Note: This answer assumes it is important that inverted indexes
[resulting array] will be sorted. This assumption is true for most
applications that use inverted indexes, especially in the
Information-Retrieval area. This feature [sorted indexes] allows
elegant and quick intersection of indexes.
Note: that standard k-way merge allows duplications, you will have to
make sure that if an element is appearing in two lists, it will be
added only once [easy to do it by simply checking the last element in
the target array before adding].
If you don't need the resulting array to be sorted, the best approach would be using a hash table to mark which of the elements you have seen. This way, you can get O(n) (n being the total number of elements) time complexity.
Something along the lines of (Perl):
my %seen;
#merged = grep { exists $seen{$_} ? 0 : ($seen{$_} = 1) } (map {(#$_)} #inputs);

Resources