Sort an array of tuples by all elements - algorithm

I would like to sort an array of tuples by all elements (like if they were in a trie). If the input is (1,2,5), (1,2,3), (1,1,4), (2,8,9), the corresponding output would be (1,1,4), (1,2,3), (1,2,5),(2,8,9). The corresponding trie would be:
root
/ \
1 2
/ \ |
1 2 8
| /\ |
4 3 5 9
I was thinking about using a search tree for each position in the tuples. There is also the obvious naive way (sort by first position, then sort by second position, etc.). Does anybody see a better way?

The trie-based approach that you have outlined above is extremely similar to doing a most-significant digit radix sort on the tuples. You essentially are distributing them into buckets based on their first digit, then recursively subdividing the buckets into smaller groups based on the remaining digits. You might want to consider explicitly performing the MSD radix sort rather than building the trie, since tries can be memory-inefficient when the data are sparse, while MSD radix sort has reasonably good memory usage (especially if you implement everything implicitly).
In the example you gave above, all of the numbers in the tuples were single digits. If this is the case, you can have at most 10 × 10 × 10 = 1000 possible distinct tuples, which isn't very large. In that case, you might want to consider just using a standard sorting algorithm with a custom comparator, since the benefits of a more optimized sort probably won't be all that apparent at that scale. On the other hand, if your tuples have many more entries in them, then it might be worth investing in a more clever sort, like MSD radix sort.
Hope this helps!

Radix Sort to Trie, is like Merge Sort to Binary-Tree

how about keeping things simple and consider the tuple value as sum of all elements of the tuple to some base
say 10
then we have (1,2,5) as 125
and so now and then just sorting them with any simple comparative sorting like heap sort

Related

What are the stabilty "factors" of sorting?

Lately, I have been learning about various methods of sorting and a lot of them are unstable i.e selection sort, quick sort, heap sort.
My question is: What are the general factors that make sorting unstable?
Most of the efficient sorting algorithms are efficient since they move data over a longer distance i.e. far closer to the final position every move. This efficiency causes the loss of stability in sorting.
For example, when you do a simple sort like bubble sort, you compare and swap neighboring elements. In this case, it is easy to not move the elements if they are already in the correct order. But say in the case of quick-sort, the partitioning process might chose to say move so the swaps are minimal. For example, if you partition the below list on the number 2, the most efficient way would be to swap the 1st element with the 4th element and 2nd element with the 5th element
2 3 1 1 1 4
1 1 1 2 3 4
If you notice, now we have changed the sequence of 1's in the list causing it to be unstable.
So to sum it up, some algorithms are very suitable for stable sorting (like bubble-sort), whereas some others like quick sort can be made stable by carefully selecting a partitioning algorithm, albeit at the cost of efficiency or complexity or both.
We usually classify the algorithm to be stable or not based on the most "natural" implementation of it.
A sorting algorithm is stable when it uses the original order of elements to break ties in the new ordering. For example, lets say you have records of (name, age) and you want to sort them by age.
If you use a stable sort on (Matt, 50), (Bob, 20), (Alice, 50), then you will get (Bob, 20), (Matt, 50), (Alice, 50). The Matt and Alice records have equal ages, so they are equal according to the sorting criteria. The stable sort preserves their original relative order -- Matt came before Alice in the original list, so it comes before Alice in the output.
If you use an unstable sort on the same list, you might get (Bob, 20), (Matt, 50), (Alice, 50) or you might get (Bob, 20), (Alice, 50), (Matt, 50). Elements that compare equal will be grouped together but can come out in any order.
It's often handy to have a stable sort, but a stable sort implementation has to remember information about the original order of the elements while its reordering them.
In-place array sorting algorithms are designed not to use any extra space to store this kind of information, and they destroy the original ordering while they work. The fast ones like quicksort aren't usually stable, because reordering the array in ways that preserve the original order to break ties is slow. Slow array sorting algorithms like insertion sort or selection sort can usually be written to be stable without difficulty.
Sorting algorithms that copy data from one place to another, or work with other data structures like linked lists, can be both fast and stable. Merge sort is the most common.
If you have an example input of
1 5 3 7 1
You want for to be stable the last 1 to never go before the first 1.
Generally, elements with the same value in the input array to not have changed their positions once sorted.
Then sorted would look like:
1(f) 3 5 7 1(l)
f: first, l: last(or second if more than 2).
For example, QuickSort uses swaps and because the comparisons are done with greater than (>=) or less than, equally valued elements can be swapped while sorting. And as result in the output.

sorting component-wise multi value (SIMD) array

I'm trying to find an O(n∙log(n)) sorting method to sort several arrays simultaneously so that an element in a multi-value array will represent elements from 4 different single value arrays and the sorting method would sort the multi-value elements.
For example:
For a given 4 single value arrays An, Bn, Cn and Dn,
I'd set a new array Qn
so that Qᵢ = [ Aᵢ Bᵢ Cᵢ Dᵢ ].
Qᵢ may be changed during the process so that Qᵢ = [ Aaᵢ Bbᵢ Ccᵢ Ddᵢ ]
where aᵢ, bᵢ, cᵢ and dᵢ are index lists
and of course that Qᵢ ≤ Qᵢ₊₁ = [ Aaᵢ₊₁ Bbᵢ₊₁ Ccᵢ₊₁ Ddᵢ₊₁ ] so that Aaᵢ ≤ Aaᵢ₊₁, Bbᵢ ≤ Bbᵢ₊₁ and so on.
The motivation is to use SIMD intructions of course to benefit from this structure to separately sort the 4 arrays.
I tried to use a SIMD comparer (_mm_cmplt_ps for example) and a masked swap (_mm_blendv_ps for example)
to make a modified version of traditional sorting algorithms (quick sort, heap sort, merge sort etc)
but I always encounter the problem that in theory there appear to be O(n∙log(n)) steps in the decision tree.
And so, a decision, whether if to set a pivot (quick sort) or whether if to exchange a parent with one of its children (heap sort)
is not correct for all of the whole 4 components all together at the same time (and thus, the next step - go right or left - is incorrect).
For now i only have O(n²) methods working.
Any ideas?
It sounds as though a sorting network is the answer to the question that you asked, since the position of the comparators is not data dependent. Batcher's bitonic mergesort is O(n log2 n).

Measuring how "out-of-order" an array is

Given an array of values, I want to find the total "score", where the score of each element is the number of elements with a smaller value that occur before it in the array.
e.g.
values: 4 1 3 2 5
scores: 0 0 1 1 4
total score: 6
An O(n^2) algorithm is trivial, but I suspect it may be possible to do it in O(nlgn), by sorting the array. Does anyone have any ideas how to do that, or if it's not possible?
Looks like what you are doing is essentially counting the number of pairs of elements that are in the incorrect relative order (i.e. number of inversions). This can be done in O(n*log(n)) by using the same idea as merge sort. As you merge, you just count the number of elements that are in the left list but should have been on the right list (and vice versa).
If the range of your numbers is small enough, the fastest algorithm I can think of is one that uses Fenwick Trees. Essentially just iterate through the list and query the Fenwick Tree for how many elements are before it, then insert the number into the tree. This will answer your question in O(nlogm), where n is the size of your list and m is your largest integer.
If you don't have a reasonable range on your integers (or you want to conserve space) MAK's solution is pretty damn elegant, so use that :)

Most efficient way to count occurrences?

I'm looking to calculate entropy and mutual information a huge number of times in performance-critical code. As an intermediate step, I need to count the number of occurrences of each value. For example:
uint[] myArray = [1,1,2,1,4,5,2];
uint[] occurrences = countOccurrences(myArray);
// Occurrences == [3, 2, 1, 1] or some permutation of that.
// 3 occurrences of 1, 2 occurrences of 2, one each of 4 and 5.
Of course the obvious ways to do this are either using an associative array or by sorting the input array using a "standard" sorting algorithm like quick sort. For small integers, like bytes, the code is currently specialized to use a plain old array.
Is there any clever algorithm to do this more efficiently than a hash table or a "standard" sorting algorithm will offer, such as an associative array implementation that heavily favors updates over insertions or a sorting algorithm that shines when your data has a lot of ties?
Note: Non-sparse integers are just one example of a possible data type. I'm looking to implement a reasonably generic solution here, though since integers and structs containing only integers are common cases, I'd be interested in solutions specific to these if they are extremely efficient.
Hashing is generally more scalable, as another answer indicates. However, for many possible distributions (and many real-life cases, where subarrays just happen to be often sorted, depending on how the overall array was put together), timsort is often "preternaturally good" (closer to O(N) than to O(N log N)) -- I hear it's probably going to become the standard/default sorting algorithm in Java at some reasonably close future data (it's been the standard sorting algorithm in Python for years now).
There's no really good way to address such problems except to benchmark on a selection of cases that are representative of the real-life workload you expect to be experiencing (with the obvious risk that you may choose a sample that actually happened to be biased/non-representative -- that's not a small risk if you're trying to build a library that will be used by many external users outside of your control).
Please tell more about your data.
How many items are there?
What is the expected ratio of unique items to total items?
What is the distribution of actual values of your integers? Are they usually small enough to use a simple counting array? Or are they clustered into reasonably narrow groups? Etc.
In any case, I suggest the following idea: a mergesort modified to count duplicates.
That is, you work in terms of not numbers but pairs (number, frequency) (you might use some clever memory-efficient representation for that, for example two arrays instead of an array of pairs etc.).
You start with [(x1,1), (x2,1), ...] and do a mergesort as usual, but when you merge two lists that start with the same value, you put the value into the output list with their sum of occurences. On your example:
[1:1,1:1,2:1,1:1,4:1,5:1,2:1]
Split into [1:1, 1:1, 2:1] and [1:1, 4:1, 5:1, 2:1]
Recursively process them; you get [1:2, 2:1] and [1:1, 2:1, 4:1, 5:1]
Merge them: (first / second / output)
[1:2, 2:1] / [1:1, 2:1, 4:1, 5:1] / [] - we add up 1:2 and 1:1 and get 1:3
[2:1] / [2:1, 4:1, 5:1] / [1:3] - we add up 2:1 and 2:1 and get 2:2
[] / [4:1, 5:1] / [1:3, 2:2]
[1:3, 2:2, 4:1, 5:1]
This might be improved greatly by using some clever tricks to do an initial reduction of the array (obtain an array of value:occurence pairs that is much smaller than the original, but the sum of 'occurence' for each 'value' is equal to the number of occurences of 'value' in the original array). For example, split the array into continuous blocks where values differ by no more than 256 or 65536 and use a small array to count occurences inside each block. Actually this trick can be applied at later merging phases, too.
With an array of integers like in the example, the most effient way would be to have an array of ints and index it based using your values (as you appear to be doing already).
If you can't do that, I can't think of a better alternative than a hashmap. You just need to have a fast hashing algorithm. You can't get better than O(n) performance if you want to use all your data. Is it an option to use only a portion of the data you have?
(Note that sorting and counting is asymptotically slower (O(n*log(n))) than using a hashmap based solution (O(n)).)

which sorting algorithms give near / approximate sort sooner?

Which sorting algorithms produce intermediate orderings which are good approximations?
By "good approximation" I mean according to metrics such as Kendall's tau and Spearman's footrule for determining how "far" an ordered list is from another (in this case, the exact sort)
The particular application I have in mind is where humans are doing subjective pairwise comparison and may not be able to do all n log n comparisons required by, say, heapsort or best-case quicksort.
Which algorithms are better than others at getting the list to a near / approximate sort sooner?
You may want to check out the shell sort algorithm.
AFAIK it is the only algorithm you can use with a subjective comparison (meaning that you won't have any hint about median values or so) that will get nearer to the correct sort at every pass.
Here is some more information http://en.wikipedia.org/wiki/Shell_sort
I would suggest some version of quicksort. If you know in what range the data you want to sort is then you can select pivot elements cleverly and possibly divide the problem into more than two parts at a time.
how about a left to right radix sort and stopping a bit premature (no pun intended)?
this will be Nb runtime where b is the number of bits you decide to examine. the more bits you examine, the more sorted it will be
unsorted:
5 0101
8 1000
4 0100
13 1101
1 0001
after 1 bits (N):
5 0101
1 0001
4 0100
13 1101
8 1000
after 2 bits (2N)
1 0001
5 0101
4 0100
8 1000
13 1101
and so on...
My completely un-scientific and visual survey of the sorts on this page indicates that "Comb Sort" looks good. It seems to be getting to a better approximation with every pass.
I devised an NlgN sorting algorithm I called "tournament sort" some time ago, which finds output items in order (i.e. it starts by finding the first item, then it finds the second item, etc.). I'm not sure it's entirely practical, since the book-keeping overhead exceeds that of quicksort, merge sort, etc. but in benchmarking with 1,000,000 random items the comparison count actually came in below a standard library quicksort implementation (though I'm not sure how it would do against newer ones).
For purposes of my algorithm, the "score" of each item is the number of items that it's known to be better than. Initially each item has a score of 0. When two items are compared, the better item adds to its score the score of the other item, plus one. To run the algorithm, declare all items to be "eligible", and as long as more than one eligible item remains, compare the two eligible items with the lowest score, and make the loser "ineligible". Once all but one item has been declared ineligible, output the one remaining item and then declare eligible all the items which that item had "beaten".
The priority queue required to compare the two lowest-scoring items poses a somewhat annoying level of bookkeeping overhead, but if one were sorting things that were expensive to compare, the algorithm may be interesting.
Bubble sort I reckon. Advantage is that you can gradually improve the ordering with additional sweeps of the data.

Resources