Sorting algorithm for list of integers

Sorting algorithm for list of integers - sorting

I have a list of about 200 integers whose values are between 1 and 5.
I want to get into learning about sorting algorithms and knowing where to apply each because at the moment I use bubble-sort for everything which I've been told is a terrible way to do things.
What would be the fastest sorting algorithm for this integer sorting?
EDIT: It turns out that because I know the numbers are 1 to 5 then I can use a bucket sort (?) algorithm which if I'm not mistaken - and I definitely could be - means that for each integer of value 1, I put it in the 1 group, value 2 I put it in the 2 group etc, then concatenate the groups at the end. This seems like a simple and efficient way to do it.
However since this is (currently) a learning excercise for me I am going to remove the 1 - 5 limitation and try to implement bubble-sort and merge-sort then compare the two to see which is faster.
Thanks for your help!

... which I've been told is a terrible way to do things.
First off, don't accept as gospel anything you hear from random bods on the internet (even me).
Bubble sort is fine under certain conditions, such as when the data is already mostly sorted, or the item count is relatively small (such as 200) (a), or you have no sort functionality built into the language and you're on a tight deadline where lack of performance will annoy the customer but lack of functionality will get you fired :-)
This bias against bubble sort is similar to the "only one exit point from a function" and "no goto" rules. You should understand the reasoning behind them so that you know when the rules can be ignored safely.
Anyway, on to the question proper. An efficient way for your specific case is to just count the items then output them, something like:
dim count[1..5] = {0, 0, 0, 0, 0};
for each item in list:
count[item] = count[item] + 1
for val in 1..5:
for quant in 1..count[val]:
output val
That's an O(n) time and O(1) space solution and you won't find a more efficient big-O for a generalised sort routine - it's only possible in this case because of the extra information you have about the data (limited to the values 1 through 5).
If you wanted to examine all the different sort algorithms, the Wikipedia Sorting Algorithm page is a useful starting point, including the major algorithms and their properties.
(a) As an aside, the following code (using worst case data for bubble sort), when run under CygWin on a not-very-powerful IBM T60 (2GHz dual core) laptop, completes in, on average, 0.157 seconds (5 samples: 0.150, 0.125, 0.192, 0.199, 0.115).
I wouldn't use it for sorting a million items (everyone knows bubble sort scales poorly) but 200 should be fine in most cases:
#include <stdio.h>
#define COUNT 200
int main (void) {
int i, swapped, tmp, item[COUNT];
// Set up worst case (reverse order) data.
for (i = 0; i < COUNT; i++)
item[i] = 200 - i;
// Slightly optimised bubble sort.
swapped = 1;
while (swapped) {
swapped = 0;
for (i = 1; i < COUNT; i++) {
if (item[i-1] > item[i]) {
tmp = item[i-1];
item[i-1] = item[i];
item[i] = tmp;
swapped = 1;
}
}
}
// for (i = 0; i < COUNT; i++)
// printf ("%d ", item[i]);
// putchar ('\n');
return 0;
}

You may not need sorting here, since you only have 5 possible values.
You could use 5 containers (or buckets) and as you scan your list of integers you place the values in the right bucket.
At the end, join the buckets together, in order.

Merge sort is an O(n log n) I think its way better than QuickSort
You can find some C# code here.

Related

abstract inplace mergesort for effective merge sort

I am reading about merge sort in Algorithms in C++ by Robert Sedgewick and have following questions.
static void mergeAB(ITEM[] c, int cl, ITEM[] a, int al, int ar, ITEM[] b, int bl, int br )
{
int i = al, j = bl;
for (int k = cl; k < cl+ar-al+br-bl+1; k++)
{
if (i > ar) { c[k] = b[j++]; continue; }
if (j > br) { c[k] = a[i++]; continue; }
c[k] = less(a[i], b[j]) ? a[i++] : b[j++];
}
}
The characteristic of the basic merge that is worthy of note is that
the inner loop includes two tests to determine whether the ends of the
two input arrays have been reached. Of course, these two tests usually
fail, and the situation thus cries out for the use of sentinel keys to
allow the tests to be removed. That is, if elements with a key value
larger than those of all the other keys are added to the ends of the a
and aux arrays, the tests can be removed, because when the a (b) array
is exhausted, the sentinel causes the next elements for the c array to
be taken from the b (a) array until the merge is complete.
However, it is not always easy to use sentinels, either because it
might not be easy to know the largest key value or because space might
not be available conveniently.
For merging, there is a simple remedy. The method is based on the
following idea: Given that we are resigned to copying the arrays to
implement the in-place abstraction, we simply put the second array in
reverse order when it is copied (at no extra cost), so that its
associated index moves from right to left. This arrangement leads to
the largest element—in whichever array it is—serving as sentinel for
the other array.
My questions on above text
What does statement "when the a (b) array is exhausted"? what is 'a (b)' here?
Why is the author mentioning that it is not easy to determine the largest key and how is the space related in determining largest key?
What does author mean by "Given that we are resigned to copying the arrays"? What is resigned in this context?
Request with simple example in understanding idea which is mentioned as simple remedy?

"When the a (b) array is exhausted" is a shorthand for "When either the a array or the b array is exhausted".
The interface is dealing with sub-arrays of a bigger array, so you can't simply go writing beyond the ends of the arrays.
The code copies the data from two arrays into one other array. Since this copy is inevitable, we are 'resigned to copying the arrays' means we reluctantly accept that it is inevitable that the arrays must be copied.
Tricky...that's going to take some time to work out what is meant.
Tangentially: That's probably not the way I'd write the loop. I'd be inclined to use:
int i = al, j = bl;
for (int k = cl; i <= ar && j <= br; k++)
{
if (a[i] < b[j])
c[k] = a[i++];
else
c[k] = b[j++];
}
while (i <= ar)
c[k++] = a[i++];
while (j <= br)
c[k++] = b[j++];
One of the two trailing loops does nothing. The revised main merge loop has 3 tests per iteration versus 4 tests per iteration for the one original algorithm. I've not formally measured it, but the simpler merge loop is likely to be quicker than the original single-loop algorithm.
The first three questions are almost best suited for English Language Learners.

a(b) and b(a)
Sometimes parenthesis are used to tell one or more similar phrases at once:
when a (b) is exhausted we copy elements from b (a)
means:
when a is exhausted we copy elements from b,
when b is exhausted we copy elements from a
What is difficult about sentinels
Two annoying things about sentinels are
sometimes your array data may potentially contain every possible value, so there is no value you can use as sentinel that is guaranteed to be bigger that all the values in the array
to use a sentinel instead of checking the index to see if you are done with an array requires that you have room for one extra space in the array to store the sentinel
Resigning
We programmers are never happy to copy (or move) things around and leaving them where they already are is, if possible, better (because we are lazy).
In this version of the merge sort we already gave up about trying to not copy things around... we resigned to it.
Given that we must copy, we can copy things in the opposite order if we like (and of course use the copy in opposite order) because that is free(*).
(*) is free at this level of abstraction, the cost on some real CPU may be high. As almost always in the performance area YMMV.

Fast algorithm for finding prime numbers? [duplicate]

This question already has answers here:
Which is the fastest algorithm to find prime numbers?
(20 answers)
Closed 9 years ago.
First of all - I checked a lot in this forum and I haven't found something fast enough.
I try to make a function that returns me the prime numbers in a specified range.
For example I did this function (in C#) using the sieve of Eratosthenes. I tried also Atkin's sieve but the Eratosthenes one runs faster (in my implementation):
public static void SetPrimesSieve(int Range)
{
Primes = new List<uint>();
Primes.Add(2);
int Half = (Range - 1) >> 1;
BitArray Nums = new BitArray(Half, false);
int Sqrt = (int)Math.Sqrt(Range);
for (int i = 3, j; i <= Sqrt; )
{
for (j = ((i * i) >> 1) - 1; j < Half; j += i)
Nums[j] = true;
do
i += 2;
while (i <= Sqrt && Nums[(i >> 1) - 1]);
}
for (int i = 0; i < Half; ++i)
if (!Nums[i])
Primes.Add((uint)(i << 1) + 3);
}
It runs about twice faster than codes & algorithms I found...
There's should be a faster way to find prime numbers, could you help me?

When searching around for algorithms on this topic (for project Euler) I don't remember finding anything faster. If speed is really the concern, have you thought about just storing the primes so you simply need to look it up?
EDIT: quick google search found this, confirming that the fastest method would be just to page the results and look them up as needed.
One more edit - you may find more information here, essentially a duplicate of this topic. Top post there states that atkin's sieve was faster than eras' as far as generating on the fly.

The fastest algorithm in my experience so far is the Sieve of Erathostenes with wheel factorization for 2, 3 and 5, where the primes among the remaining numbers are represented as bits in a byte array. In Java on one core of my 3 year old Laptop it takes 23 seconds to compute the primes up to 1 billion.
With wheel factorization the Sieve of Atkin was about a factor of two slower, while with an ordinary BitSet it was about 30% faster.
See also this answer.

I did an algorithm that can find prime numbers from range 2-90 000 000 for 0.65 sec on I 350M-notebook, written in C .... you have to use bitwise operations and have "code" for recalculating index of your array to index of concrete bit you want. for example If you want folds of number 2, concrete bits will be for example ....10101000 ... so if you read from left ... you get index 4,6,8 ... thats it

Several comments.
For speed, precompute then load from disk. It's super fast. I did it in Java long ago.
Don't store as an array, store as a bitsequence for odd numbers. Way more efficient on memory
If your speed question is that you want this particular computation to run fast (you need to justify why you can't precompute and load it from disk) you need to code a better Atkin's sieve. It is faster. But only slightly.
You haven't indicated the end use for these primes. We may be missing something completely because you've not told us the application. Tell us a sketch of the application and the answers will be targetted better for your context.
Why on earth do you think something faster exists? You haven't justified your hunch. This is a very hard problem. (that is to find something faster)

You can do better than that using the Sieve of Atkin, but it is quite tricky to implement it fast and correctly. A simple translation of the Wikipedia pseudo-code is probably not good enough.

random permutation

I would like to genrate a random permutation as fast as possible.
The problem: The knuth shuffle which is O(n) involves generating n random numbers.
Since generating random numbers is quite expensive.
I would like to find an O(n) function involving a fixed O(1) amount of random numbers.
I realize that this question has been asked before, but I did not see any relevant answers.
Just to stress a point: I am not looking for anything less than O(n), just an algorithm involving less generation of random numbers.
Thanks

Create a 1-1 mapping of each permutation to a number from 1 to n! (n factorial). Generate a random number in 1 to n!, use the mapping, get the permutation.
For the mapping, perhaps this will be useful: http://en.wikipedia.org/wiki/Permutation#Numbering_permutations
Of course, this would get out of hand quickly, as n! can become really large soon.

Generating a random number takes long time you say? The implementation of Javas Random.nextInt is roughly
oldseed = seed;
nextseed = (oldseed * multiplier + addend) & mask;
return (int)(nextseed >>> (48 - bits));
Is that too much work to do for each element?

See https://doi.org/10.1145/3009909 for a careful analysis of the number of random bits required to generate a random permutation. (It's open-access, but it's not easy reading! Bottom line: if carefully implemented, all of the usual methods for generating random permutations are efficient in their use of random bits.)
And... if your goal is to generate a random permutation rapidly for large N, I'd suggest you try the MergeShuffle algorithm. An article published in 2015 claimed a factor-of-two speedup over Fisher-Yates in both parallel and sequential implementations, and a significant speedup in sequential computations over the other standard algorithm they tested (Rao-Sandelius).
An implementation of MergeShuffle (and of the usual Fisher-Yates and Rao-Sandelius algorithms) is available at https://github.com/axel-bacher/mergeshuffle. But caveat emptor! The authors are theoreticians, not software engineers. They have published their experimental code to github but aren't maintaining it. Someday, I imagine someone (perhaps you!) will add MergeShuffle to GSL. At present gsl_ran_shuffle() is an implementation of Fisher-Yates, see https://www.gnu.org/software/gsl/doc/html/randist.html?highlight=gsl_ran_shuffle.

Not what you asked exactly, but if provided random number generator doesn't satisfy you, may be you should try something different. Generally, pseudorandom number generation can be very simple.
Probably, best-known algorithm
http://en.wikipedia.org/wiki/Linear_congruential_generator
More
http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators

As other answers suggest, you can make a random integer in the range 0 to N! and use it to produce a shuffle. Although theoretically correct, this won't be faster in general since N! grows fast and you'll spend all your time doing bigint arithmetic.
If you want speed and you don't mind trading off some randomness, you will be much better off using a less good random number generator. A linear congruential generator (see http://en.wikipedia.org/wiki/Linear_congruential_generator) will give you a random number in a few cycles.

Usually there is no need in full-range of next random value, so to use exactly the same amount of randomness you can use next approach (which is almost like random(0,N!), I guess):
// ...
m = 1; // range of random buffer (single variant)
r = 0; // random buffer (number zero)
// ...
for(/* ... */) {
while (m < n) { // range of our buffer is too narrow for "n"
r = r*RAND_MAX + random(); // add another random to our random-buffer
m *= RAND_MAX; // update range of random-buffer
}
x = r % n; // pull-out next random with range "n"
r /= n; // remove it from random-buffer
m /= n; // fix range of random-buffer
// ...
}
P.S. of course there will be some errors related with division by value different from 2^n, but they will be distributed among resulted samples.

Generate N numbers (N < of the number of random number you need) before to do the computation, or store them in an array as data, with your slow but good random generator; then pick up a number simply incrementing an index into the array inside your computing loop; if you need different seeds, create multiple tables.

Are you sure that your mathematical and algorithmical approach to the problem is correct?
I hit exactly same problem where Fisher–Yates shuffle will be bottleneck in corner cases. But for me the real problem is brute force algorithm that doesn't scale well to all problems. Following story explains the problem and optimizations that I have come up with so far.
Dealing cards for 4 players
Number of possible deals is 96 bit number. That puts quite a stress for random number generator to avoid statical anomalies when selecting play plan from generated sample set of deals. I choose to use 2xmt19937_64 seeded from /dev/random because of the long period and heavy advertisement in web that it is good for scientific simulations.
Simple approach is to use Fisher–Yates shuffle to generate deals and filter out deals that don't match already collected information. Knuth shuffle takes ~1400 CPU cycles per deal mostly because I have to generate 51 random numbers and swap 51 times entries in the table.
That doesn't matter for normal cases where I would only need to generate 10000-100000 deals in 7 minutes. But there is extreme cases when filters may select only very small subset of hands requiring huge number of deals to be generated.
Using single number for multiple cards
When profiling with callgrind (valgrind) I noticed that main slow down was C++ random number generator (after switching away from std::uniform_int_distribution that was first bottleneck).
Then I came up with idea that I can use single random number for multiple cards. The idea is to use least significant information from the number first and then erase that information.
int number = uniform_rng(0, 52*51*50*49);
int card1 = number % 52;
number /= 52;
int cards2 = number % 51;
number /= 51;
......
Of course that is only minor optimization because generation is still O(N).
Generation using bit permutations
Next idea was exactly solution asked in here but I ended up still with O(N) but with larger cost than original shuffle. But lets look into solution and why it fails so miserably.
I decided to use idea Dealing All the Deals by John Christman
void Deal::generate()
{
// 52:26 split, 52!/(26!)**2 = 495,918,532,948,1041
max = 495918532948104LU;
partner = uniform_rng(eng1, max);
// 2x 26:13 splits, (26!)**2/(13!)**2 = 10,400,600**2
max = 10400600LU*10400600LU;
hands = uniform_rng(eng2, max);
// Create 104 bit presentation of deal (2 bits per card)
select_deal(id, partner, hands);
}
So far good and pretty good looking but select_deal implementation is PITA.
void select_deal(Id &new_id, uint64_t partner, uint64_t hands)
{
unsigned idx;
unsigned e, n, ns = 26;
e = n = 13;
// Figure out partnership who owns which card
for (idx = CARDS_IN_SUIT*NUM_SUITS; idx > 0; ) {
uint64_t cut = ncr(idx - 1, ns);
if (partner >= cut) {
partner -= cut;
// Figure out if N or S holds the card
ns--;
cut = ncr(ns, n) * 10400600LU;
if (hands > cut) {
hands -= cut;
n--;
} else
new_id[idx%NUM_SUITS] |= 1 << (idx/NUM_SUITS);
} else
new_id[idx%NUM_SUITS + NUM_SUITS] |= 1 << (idx/NUM_SUITS);
idx--;
}
unsigned ew = 26;
// Figure out if E or W holds a card
for (idx = CARDS_IN_SUIT*NUM_SUITS; idx-- > 0; ) {
if (new_id[idx%NUM_SUITS + NUM_SUITS] & (1 << (idx/NUM_SUITS))) {
uint64_t cut = ncr(--ew, e);
if (hands >= cut) {
hands -= cut;
e--;
} else
new_id[idx%NUM_SUITS] |= 1 << (idx/NUM_SUITS);
}
}
}
Now that I had the O(N) permutation solution done to prove algorithm could work I started searching for O(1) mapping from random number to bit permutation. Too bad it looks like only solution would be using huge lookup tables that would kill CPU caches. That doesn't sound good idea for AI that will be using very large amount of caches for double dummy analyzer.
Mathematical solution
After all hard work to figure out how to generate random bit permutations I decided go back to maths. It is entirely possible to apply filters before dealing cards. That requires splitting deals to manageable number of layered sets and selecting between sets based on their relative probabilities after filtering out impossible sets.
I don't yet have code ready for that to tests how much cycles I'm wasting in common case where filter is selecting major part of deal. But I believe this approach gives the most stable generation performance keeping the cost less than 0.1%.

Generate a 32 bit integer. For each index i (maybe only up to half the number of elements in the array), if bit i % 32 is 1, swap i with n - i - 1.
Of course, this might not be random enough for your purposes. You could probably improve this by not swapping with n - i - 1, but rather by another function applied to n and i that gives better distribution. You could even use two functions: one for when the bit is 0 and another for when it's 1.

Fastest sorting algorithm for a specific situation

What is the fastest sorting algorithm for a large number (tens of thousands) of groups of 9 positive double precision values, where each group must be sorted individually? So it's got to sort fast a small number of possibly repeated double precision values, many times in a row.
The values are in the [0..1] interval. I don't care about space complexity or stability, just about speed.

Sorting each group individually, merge sort would probably be easiest to implement with good results.
A sorting network would probably be the fastest solution:
http://en.wikipedia.org/wiki/Sorting_network

Good question because this comes down to "Fastest way to sort an array of 9 elements", and most comparisons between and analysis of sorting methods are about large N. I assume the 'groups' are clearly defined and don't play a real role here.
You will probably have to benchmark a few candidates because a lot of factors (locality) come into play here.
In any case, making it parallel sounds like a good idea. Use Parallel.For() if you can use ,NET4.

I think you will need to try out a few examples to see what works best, as you have an unusual set of conditions. My guess that the best will be one of
sorting network
insertion sort
quick sort (one level -- insertion sort below)
merge sort
Given that double precision number are relatively long I suspect you will not do better with a radix sort, but feel free to add it in.
For what its worth, Java uses quick sort on doubles until the number of items to be sorted drops below 7, at which is uses insertion sort. The third option mimics that solution.
Also your overall problem is embarrassingly parallel so you want to make use of parallelism when possible. The problem looks too small for a distributed solution (more time would be lost in networking than saved), but if set up right, your problem can make use of multiple cores very effectively.

It looks like you want the most cycle-stingy way to sort 9 values. Since the number of values is limited, I would (as Kathy suggested) first do an unrolled insertion sort on the first 4 elements and the second 5 elements. Then I would merge those two groups.
Here's an unrolled insertion sort of 4 elements:
if (u[1] < u[0]) swap(u[0], u[1]);
if (u[2] < u[0]) swap(u[0], u[2]);
if (u[3] < u[0]) swap(u[0], u[3]);
if (u[2] < u[1]) swap(u[1], u[2]);
if (u[3] < u[1]) swap(u[1], u[3]);
if (u[3] < u[2]) swap(u[2], u[3]);
Here's a merge loop. The first set of 4 elements is in u, and the second set of 5 elements in in v. The result is in r.
i = j = k = 0;
while(i < 4 && j < 5){
if (u[i] < v[j]) r[k++] = u[i++];
else if (v[j] < u[i]) r[k++] = v[j++];
else {
r[k++] = u[i++];
r[k++] = v[j++];
}
}
while (i < 4) r[k++] = u[i++];
while (j < 5) r[k++] = v[j++];

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?
Preliminary
I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.
At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.
Details
I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).
The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.
Use Cases
Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.
Research
Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.
In particular, there are the following things.
First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).
Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).
Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.
To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?

Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.
void radixSort(string[] seqs, size_t base = 0) {
if(seqs.length == 0)
return;
size_t TPos = seqs.length, APos = 0;
size_t i = 0;
while(i < TPos) {
if(seqs[i][base] == 'A') {
swap(seqs[i], seqs[APos++]);
i++;
}
else if(seqs[i][base] == 'T') {
swap(seqs[i], seqs[--TPos]);
} else i++;
}
i = APos;
size_t CPos = APos;
while(i < TPos) {
if(seqs[i][base] == 'C') {
swap(seqs[i], seqs[CPos++]);
}
i++;
}
if(base < seqs[0].length - 1) {
radixSort(seqs[0..APos], base + 1);
radixSort(seqs[APos..CPos], base + 1);
radixSort(seqs[CPos..TPos], base + 1);
radixSort(seqs[TPos..seqs.length], base + 1);
}
}
Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.
Edit:
I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.

I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.
Reason:
The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.
I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).
That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.
I have no idea if it works out in practice though.
Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.

You can certainly drop the memory requirements by encoding the sequence in bits.
You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits.
For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.
If there is a way to reduce the number of valid 'words', further compression may be possible.
So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64.
Initialize them to zero.
Iterate through your very very large list of 3 char sequences, and encode them as above.
Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.
Next, regenerate your list.
Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.
A sequence of 4, adds 2 bits, so there would be 256 buckets.
A sequence of 5, adds 2 bits, so there would be 1024 buckets.
At some point the number of buckets will approach your limits.
If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.
I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.
Here is a hack that shows the technique
#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
int *bucket = NULL;
const char charMap[4] = {'A', 'C', 'G', 'T'};
void setup
(
void
)
{
bucket = new int[bucketCount];
memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
}
void teardown
(
void
)
{
delete[] bucket;
}
void show
(
int encoded
)
{
int z;
int y;
int j;
for (z = width - 1; z >= 0; z--)
{
int n = 1;
for (y = 0; y < z; y++)
n *= 4;
j = encoded % n;
encoded -= j;
encoded /= n;
cout << charMap[encoded];
encoded = j;
}
cout << endl;
}
int main(void)
{
// Sort this sequence
const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";
size_t testSequenceLength = strlen(testSequence);
setup();
// load the sequences into the buckets
size_t z;
for (z = 0; z < testSequenceLength; z += width)
{
int encoding = 0;
size_t y;
for (y = 0; y < width; y++)
{
encoding *= 4;
switch (*(testSequence + z + y))
{
case 'A' : encoding += 0; break;
case 'C' : encoding += 1; break;
case 'G' : encoding += 2; break;
case 'T' : encoding += 3; break;
default : abort();
};
}
bucket[encoding]++;
}
/* show the sorted sequences */
for (z = 0; z < bucketCount; z++)
{
while (bucket[z] > 0)
{
show(z);
bucket[z]--;
}
}
teardown();
return 0;
}

If your data set is so big, then I would think that a disk-based buffer approach would be best:
sort(List<string> elements, int prefix)
if (elements.Count < THRESHOLD)
return InMemoryRadixSort(elements, prefix)
else
return DiskBackedRadixSort(elements, prefix)
DiskBackedRadixSort(elements, prefix)
DiskBackedBuffer<string>[] buckets
foreach (element in elements)
buckets[element.MSB(prefix)].Add(element);
List<string> ret
foreach (bucket in buckets)
ret.Add(sort(bucket, prefix + 1))
return ret
I would also experiment grouping into a larger number of buckets, for instance, if your string was:
GATTACA
the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.

I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:
You control the reading of the data
You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.
The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.
Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.
Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.
That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(
See the Wikipedia articles:
Heapsort
Binary heap

"Radix sorting with no extra space" is a paper addressing your problem.

Performance-wise you might want to look at a more general string-comparison sorting algorithms.
Currently you wind up touching every element of every string, but you can do better!
In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.
A decent general purpose implementation of burstsort is available on source forge at http://sourceforge.net/projects/burstsort/ - but it is not in-place.
For comparison purposes, The C-burstsort implementation covered at http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.

You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.
Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.

You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).
Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.

I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.

It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.

Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets.
You can look at:
ti7qsort. ti7qsort is the fastest sort for integers (can be used for small-fixed size strings).
Inline QSORT
String sorting
You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.

dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.
I suggest a very simple approach:
Empirically estimate the largest size m for which a radix sort is efficient.
Read blocks of m elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.
Mergesort the resulting sorted blocks.
Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require 2n space to sort n items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.
Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.

First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.
OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.
Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:
Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
Bitmap encoding of the last three characters of a string.
For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.
This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.
For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off
You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.
Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.

While the accepted answer perfectly answers the description of the problem, I've reached this place looking in vain for an algorithm to partition inline an array into N parts. I've written one myself, so here it is.
Warning: this is not a stable partitioning algorithm, so for multilevel partitioning, one must repartition each resulting partition instead of the whole array. The advantage is that it is inline.
The way it helps with the question posed is that you can repeatedly partition inline based on a letter of the string, then sort the partitions when they are small enough with the algorithm of your choice.
function partitionInPlace(input, partitionFunction, numPartitions, startIndex=0, endIndex=-1) {
if (endIndex===-1) endIndex=input.length;
const starts = Array.from({ length: numPartitions + 1 }, () => 0);
for (let i = startIndex; i < endIndex; i++) {
const val = input[i];
const partByte = partitionFunction(val);
starts[partByte]++;
}
let prev = startIndex;
for (let i = 0; i < numPartitions; i++) {
const p = prev;
prev += starts[i];
starts[i] = p;
}
const indexes = [...starts];
starts[numPartitions] = prev;
let bucket = 0;
while (bucket < numPartitions) {
const start = starts[bucket];
const end = starts[bucket + 1];
if (end - start < 1) {
bucket++;
continue;
}
let index = indexes[bucket];
if (index === end) {
bucket++;
continue;
}
let val = input[index];
let destBucket = partitionFunction(val);
if (destBucket === bucket) {
indexes[bucket] = index + 1;
continue;
}
let dest;
do {
dest = indexes[destBucket] - 1;
let destVal;
let destValBucket = destBucket;
while (destValBucket === destBucket) {
dest++;
destVal = input[dest];
destValBucket = partitionFunction(destVal);
}
input[dest] = val;
indexes[destBucket] = dest + 1;
val = destVal;
destBucket = destValBucket;
} while (dest !== index)
}
return starts;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio