Fast bit permutation

Fast bit permutation - algorithm

I need to store and apply permutations to 16-bit integers. The best solution I came up with is to store permutation as 64-bit integer where each 4 bits correspond to the new position of i-th bit, the application would look like:
int16 permute(int16 bits, int64 perm)
{
int16 result = 0;
for(int i = 0; i < 16; ++i)
result |= ((bits >> i) & 1) * (1 << int( (perm >> (i*4))&0xf ));
return result;
}
is there a faster way to do this? Thank you.

There are alternatives.
Any permutation can be handled by a Beneš network, and encoded as the masks that are the inputs to the multiplexers to apply the shuffle. This can be done reasonably efficiently in software too (not great but OK), it's just a bunch of butterfly permutations. The masks are a bit tricky to compute, but probably faster to apply than moving every bit on its own, though that depends on how many bits you're dealing with and 16 is not a lot.
Some smaller categories of shuffles can be handled by simpler (faster) networks, which you can also find on that page.
Finally in practice, on modern x86 hardware, there is the highly versatile pshufb function which can apply a permutation (but may include dupes and zeroes) to 16 bytes in (typically) a single cycle. It is slightly awkward to distribute the bits over the bytes, but once you're there it only takes a pshufb to permute and a pmovmskb to compress it back down to 16 bits.

Related

Generate random numbers without repetition (or vanishing probability of repetition) without storing full list of past generated numbers?

I need to generate random numbers in a very large range, 128 bits integers, and I will generate a many many of them. I'll generate so many of them, that I cannot fit into memory a list of the numbers generated.
I also have the requirement that the generated numbers do not repeat, or at least that the probability of repetition is vanishingly small.
Is there an algorithm that does this?

Build a 128 bit linear congruential generator or linear feedback shift register generator. With properly chosen coefficients either of those will achieve full cycle, meaning no repeats until you've exhausted all outcomes.

Any full-period PRNG with a 128-bit state will do what you need in principle. Unfortunately many of these generators tend to produce only 32 or 64 bits per iteration while the rest of the state goes through a predictable permutation (LFSRs being the worst case, producing only 1 bit per iteration). Each 128-bit state is unique, but many of its bits would show a trivial relation to the previous state.
This can be overcome with tempering -- taking your questionable-quality PRNG state with a known-good period, and permuting it through a 1:1 transform to hide the not-so-random factors.
For example, borrowing from the example xorshift+ shown on Wikipedia:
static uint64_t s[2] = { 1, 0 };
void random128(uint64_t result[]) {
uint64_t x = s[0];
uint64_t y = s[1];
x ^= x << 23;
x ^= y ^ (x >> 17) ^ (y >> 26);
s[0] = y;
s[1] = x;
At this point we know that s[0] is just the old value of s[1], which would be a terrible PRNG if all 128 bits were exposed (normally only s[1] is exposed). To overcome this we permute the result to disguise that relationship (following the same principle as a feistel network to ensure that the transform is 1:1).
y += x * 1630144151483159999;
x ^= y >> 3;
result[0] = x;
result[1] = y;
}
This seems to be sufficient to pass diehard. So long as the original generator has full(ish) period, the whole generator should be full period too.
The logical conclusion to tempering a low-quality generator is to use AES-128 in counter mode. Simply run a counter from 0 to 2**128-1 (an extremely low-quality generator), and encrypt each value using AES-128 and a consistent key (an ideal temper) for your final output.
If you do this, don't get distracted by full cryptographic RNG requirements. Those involve re-seeding and consequently can produce the same number more than once (which is more random, but it's what you want to avoid).

Efficiently Get Random Numbers in Range on GPU

Given a uniformly distributed random number generator in the range [0, 2^64), is there any efficient way (on a GPU) to build a random number generator for the range [0, k) for some k < 2^64?
Some solutions that don't work:
// not uniformly distributed in [0, k)
myRand(rng, k) = rng() % k;
// way too much branching to run efficiently on a gpu
myRand(rng, k) =
uint64_t ret;
while((ret = rng() & (nextPow2(k)-1)) >= k);
return ret;
// only 53 bits of random data, not 64. Also I
// have no idea how to reason about how "uniform"
// this distribution is.
myRand(doubleRng, k) =
double r = doubleRng(); // generates a random number in [0, 1)
return (uint64_t)floor(r*k);
I'd be willing to compromise non-uniformity if the difference is sufficiently small (say, within 1/2^64).

There are only two options: do the modulus (or the floating point) and settle for non-uniformity, or do rejection sampling with a loop. There really isn't a third option. Which one is better depends on your application.
If your k is typically very small (say, you're shuffling cards so k is on the order of 100), then the non-uniformity is so small that it's probably OK, even at 32 bits. At 64 bits, a k on the order of millions is still going to give you a non-uniformity vanishingly small. No, it won't be on the order of 1/2^64, but I can't imagine a real-world application where a non-uniformity on the order of 1/2^20 is noticeable. When I wrote the test suite for my RNG library, I deliberately ran it against a known bad mod implementation and it had a really hard time detecting the error even at 32 bits.
If you really have to be perfectly uniform, then you're just going to have to sample and reject. This can be done pretty fast, and you can even get rid of the division (calculate that nextPow2() outside the rejection loop--that's how I do it in ojrandlib). FYI, the fastest way to do the next-power-of-two mask is this:
mask = k - 1;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask |= mask >> 32;

If you have a function that returns 53 bits of random data, but you need 64, call it twice, use the bottom 32 bits of the first call for the top 32 bits of your result, and the bottom 32 bits of the second call for the bottom 32 bits of your result. If your original function was uniform, this one is too.

compression algorithm for sorted integers

I have a large sequence of random integers sorted from the lowest to the highest. The numbers start from 1 bit and end near 45 bits. In the beginning of the list I have numbers very close to each other: 4, 20, 23, 40, 66. But when the numbers start to get higher the distance between them is a bit higher too (actually the distance between them is aleatory). There are no duplicated numbers.
I'm using bit packing to save some space. Nonetheless, this file can get really big.
I would like to know what kind of compression algorithm can be used in this situation, or any other technique to save as much space as possible.
Thank you.

You can compress optimally if you know the true distribution of the data. If you can provide a probability distribution for each integer you can use arithmetic coding or other entropy coding techniques to compress to theoretical minimal size.
The trick is in predicting accurately.
First, you should probably compress the distances between the numbers because that allows you to make statistical statements. If you were to compress the numbers directly you'd have a hard time modelling them because they occur only once.
Next, you could try to build a very simple model to predict the next distance. Keep a histogram of all previously seen distances and calculate the probabilities from the frequencies.
You probably need to account for missing values (you clearly can't assign them 0 probability because that is not expressible) but you can use heuristics for that, like encoding the next distance bit-by-bit and predicting each bit individually. You will pay almost nothing for the high-order bits because they are almost always 0 and entropy encoding optimizes them away.
All of this is much simpler if you know the distribution. Example: You you are compressing a list of all prime numbers you know the theoretical distribution of distances because there are formulae for that. So you already have a perfect model.

There's a very simple and fairly effective compression technique which can be used for sorted integers in a known range. Like most compression schemes, it is optimized for serial access, although you can build an index to speed up random access if needed.
It's a type of delta encoding (i.e. each number is represented by the distance from the previous one), consisting of a vector of codes which are either
a single 1-bit, representing a delta of 2k which is added to the delta in the following code, or
a 0-bit followed by a k-bit delta, indicating that the next number is the specified delta from the previous one.
For example, if k is 4, the sequence:
00011 1 1 00000 1 00001
codes three numbers. The first four-bit encoding (3) is the first delta, taken from an initial value of 0, so the first number is 3. The next two solitary 1's accumulate to a delta of 2&centerdot;24, or 32, which is added to the following delta of 0000, for a total of 32. So the second number is 3+32=35. Finally, the last delta is a single 24 plus 1, total 17, and the third number is 35+17=52.
The 1-bit indicates that the next delta should be incremented by 2k (or, more generally, each delta is incremented by 2k times the number of immediately preceding 1-bits.)
Another, possibly better, way of thinking of this is that each delta is coded as a variable length bit sequence: 1i0(1|0)k, representing a delta of i&centerdot;2k+[the k-bit suffix]. But the first presentation aligns better with the optimality proof.
Since each "1" code represents an increment of 2k, there cannot be more than m/2k of them, where m is the largest number in the set to be compressed. The remaining codes all correspond to numbers, and have a total length of n&centerdot;(k + 1) where n is the size of the set. The optimal value of k is roughly log2 m/n, which in your case would be 7 or 8.
I did a quick proof of concept of the algorithm, without worrying about optimizations. It's still plenty fast; sorting the random sample takes a lot longer than compressing/decompressing it. I tried it with a few different seeds and vector sizes from 16,400,000 to 31,000,000 with a value range of [0, 4,000,000,000). The bits used per data value ranged from 8.59 (n=31000000) to 9.45 (n=16400000). All of the tests were done with 7-bit suffixes; log2 m/n varies from 7.01 (n=31000000) to 7.93 (n=16400000). I tried with 6-bit and 8-bit suffixes; except in the case of n=31000000 where the 6-bit suffixes were slightly smaller, the 7-bit suffix was always the best. So I guess that the optimal k is not exactly floor(log2 m/n) but it's not far off.
Compression code:
void Compress(std::ostream& os,
const std::vector<unsigned long>& v,
unsigned long k = 0) {
BitOut out(os);
out.put(v.size(), 64);
if (v.size()) {
unsigned long twok;
if (k == 0) {
unsigned long ratio = v.back() / v.size();
for (twok = 1; twok <= ratio / 2; ++k, twok *= 2) { }
} else {
twok = 1 << k;
}
out.put(k, 32);
unsigned long prev = 0;
for (unsigned long val : v) {
while (val - prev >= twok) { out.put(1); prev += twok; }
out.put(0);
out.put(val - prev, k);
prev = val;
}
}
out.flush(1);
}
Decompression:
std::vector<unsigned long> Decompress(std::istream& is) {
BitIn in(is);
unsigned long size = in.get(64);
if (size) {
unsigned long k = in.get(32);
unsigned long twok = 1 << k;
std::vector<unsigned long> v;
v.reserve(size);
unsigned long prev = 0;
for (; size; --size) {
while (in.get()) prev += twok;
prev += in.get(k);
v.push_back(prev);
}
}
return v;
}
It can be a bit awkward to use variable-length encodings; an alternative is to store the first bit of each code (1 or 0) in a bit vector, and the k-bit suffixes in a separate vector. This would be particularly convenient if k is 8.
A variant, which results in slight longer files but is a bit easier to build indexes for, is to only use the 1-bits as deltas. Then the deltas are always a&centerdot;2k for some a, possibly 0, where a is the number of consecutive 1 bits preceding the suffix code. The index then consists of the locations of every Nth 1-bit in the bit vector, and the corresponding index into the suffix vector (i.e. the index of the suffix corresponding with the next 0 in the bit vector).

One option that worked well for me in the past was to store a list of 64-bit integers as 8 different lists of 8-bit values. You store the high 8 bits of the numbers, then the next 8 bits, etc. For example, say you have the following 32-bit numbers:
0x12345678
0x12349785
0x13111111
0x13444444
The data stored would be (in hex):
12,12,13,13
34,34,11,44
56,97,11,44
78,85,11,44
I then ran that through the deflate compressor.
I don't recall what compression ratios I was able to achieve with this, but it was significantly better than compressing the numbers themselves.

I want to add another answer with the simplest possible solution:
Convert the numbers to deltas as discussed previously
Run it through the 7-zip LZMA2 algorithm. It is even multi-core ready
I think this will give almost perfect results in your case because the distances have a simple distribution. 7-zip will be able to pick it up.

You can use Delta Encoding and Protocol Buffers simply.
Like your example: 4, 20, 23, 40, 66.
Delta Encoding compressed: 4, 16, 3, 17, 26.
Then you store all numbers as varint in Protocol Buffers directly. Only need 1 byte for number between 0-127. And 2 bytes for number between 128-16384... This is enough for most scenes.
Further more you can use entropy coding(huffman) to achieve more effective compression rate than varint. Even less than 8bits per number.
Divide a number to 2 part. Like 17=...0001 0001(binary)=(5)0001. The first part (5) is valid bit count. The suffix part (0001) is without the leading 1.
Like the example: 4, 16, 3, 17, 26 = (3)00 (5)0000 (2)1 (5)0001 (5)1010
The first part will be between 0-45 even there are a lot of numbers. So they can be compressed by entropy coding like huffman effectively.

If your sequence is made up of pseudo-random numbers, such as might be generated by a typical digital computer, then I don't think that any compression scheme will beat, for brevity of representation, simply storing the code for the generator and whatever parameters you need to define its initial state.
If your sequence is made up of truly random numbers generated in some non-deterministic way then the other answers already posted offer a variety of good advice.

random permutation

I would like to genrate a random permutation as fast as possible.
The problem: The knuth shuffle which is O(n) involves generating n random numbers.
Since generating random numbers is quite expensive.
I would like to find an O(n) function involving a fixed O(1) amount of random numbers.
I realize that this question has been asked before, but I did not see any relevant answers.
Just to stress a point: I am not looking for anything less than O(n), just an algorithm involving less generation of random numbers.
Thanks

Create a 1-1 mapping of each permutation to a number from 1 to n! (n factorial). Generate a random number in 1 to n!, use the mapping, get the permutation.
For the mapping, perhaps this will be useful: http://en.wikipedia.org/wiki/Permutation#Numbering_permutations
Of course, this would get out of hand quickly, as n! can become really large soon.

Generating a random number takes long time you say? The implementation of Javas Random.nextInt is roughly
oldseed = seed;
nextseed = (oldseed * multiplier + addend) & mask;
return (int)(nextseed >>> (48 - bits));
Is that too much work to do for each element?

See https://doi.org/10.1145/3009909 for a careful analysis of the number of random bits required to generate a random permutation. (It's open-access, but it's not easy reading! Bottom line: if carefully implemented, all of the usual methods for generating random permutations are efficient in their use of random bits.)
And... if your goal is to generate a random permutation rapidly for large N, I'd suggest you try the MergeShuffle algorithm. An article published in 2015 claimed a factor-of-two speedup over Fisher-Yates in both parallel and sequential implementations, and a significant speedup in sequential computations over the other standard algorithm they tested (Rao-Sandelius).
An implementation of MergeShuffle (and of the usual Fisher-Yates and Rao-Sandelius algorithms) is available at https://github.com/axel-bacher/mergeshuffle. But caveat emptor! The authors are theoreticians, not software engineers. They have published their experimental code to github but aren't maintaining it. Someday, I imagine someone (perhaps you!) will add MergeShuffle to GSL. At present gsl_ran_shuffle() is an implementation of Fisher-Yates, see https://www.gnu.org/software/gsl/doc/html/randist.html?highlight=gsl_ran_shuffle.

Not what you asked exactly, but if provided random number generator doesn't satisfy you, may be you should try something different. Generally, pseudorandom number generation can be very simple.
Probably, best-known algorithm
http://en.wikipedia.org/wiki/Linear_congruential_generator
More
http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators

As other answers suggest, you can make a random integer in the range 0 to N! and use it to produce a shuffle. Although theoretically correct, this won't be faster in general since N! grows fast and you'll spend all your time doing bigint arithmetic.
If you want speed and you don't mind trading off some randomness, you will be much better off using a less good random number generator. A linear congruential generator (see http://en.wikipedia.org/wiki/Linear_congruential_generator) will give you a random number in a few cycles.

Usually there is no need in full-range of next random value, so to use exactly the same amount of randomness you can use next approach (which is almost like random(0,N!), I guess):
// ...
m = 1; // range of random buffer (single variant)
r = 0; // random buffer (number zero)
// ...
for(/* ... */) {
while (m < n) { // range of our buffer is too narrow for "n"
r = r*RAND_MAX + random(); // add another random to our random-buffer
m *= RAND_MAX; // update range of random-buffer
}
x = r % n; // pull-out next random with range "n"
r /= n; // remove it from random-buffer
m /= n; // fix range of random-buffer
// ...
}
P.S. of course there will be some errors related with division by value different from 2^n, but they will be distributed among resulted samples.

Generate N numbers (N < of the number of random number you need) before to do the computation, or store them in an array as data, with your slow but good random generator; then pick up a number simply incrementing an index into the array inside your computing loop; if you need different seeds, create multiple tables.

Are you sure that your mathematical and algorithmical approach to the problem is correct?
I hit exactly same problem where Fisher–Yates shuffle will be bottleneck in corner cases. But for me the real problem is brute force algorithm that doesn't scale well to all problems. Following story explains the problem and optimizations that I have come up with so far.
Dealing cards for 4 players
Number of possible deals is 96 bit number. That puts quite a stress for random number generator to avoid statical anomalies when selecting play plan from generated sample set of deals. I choose to use 2xmt19937_64 seeded from /dev/random because of the long period and heavy advertisement in web that it is good for scientific simulations.
Simple approach is to use Fisher–Yates shuffle to generate deals and filter out deals that don't match already collected information. Knuth shuffle takes ~1400 CPU cycles per deal mostly because I have to generate 51 random numbers and swap 51 times entries in the table.
That doesn't matter for normal cases where I would only need to generate 10000-100000 deals in 7 minutes. But there is extreme cases when filters may select only very small subset of hands requiring huge number of deals to be generated.
Using single number for multiple cards
When profiling with callgrind (valgrind) I noticed that main slow down was C++ random number generator (after switching away from std::uniform_int_distribution that was first bottleneck).
Then I came up with idea that I can use single random number for multiple cards. The idea is to use least significant information from the number first and then erase that information.
int number = uniform_rng(0, 52*51*50*49);
int card1 = number % 52;
number /= 52;
int cards2 = number % 51;
number /= 51;
......
Of course that is only minor optimization because generation is still O(N).
Generation using bit permutations
Next idea was exactly solution asked in here but I ended up still with O(N) but with larger cost than original shuffle. But lets look into solution and why it fails so miserably.
I decided to use idea Dealing All the Deals by John Christman
void Deal::generate()
{
// 52:26 split, 52!/(26!)**2 = 495,918,532,948,1041
max = 495918532948104LU;
partner = uniform_rng(eng1, max);
// 2x 26:13 splits, (26!)**2/(13!)**2 = 10,400,600**2
max = 10400600LU*10400600LU;
hands = uniform_rng(eng2, max);
// Create 104 bit presentation of deal (2 bits per card)
select_deal(id, partner, hands);
}
So far good and pretty good looking but select_deal implementation is PITA.
void select_deal(Id &new_id, uint64_t partner, uint64_t hands)
{
unsigned idx;
unsigned e, n, ns = 26;
e = n = 13;
// Figure out partnership who owns which card
for (idx = CARDS_IN_SUIT*NUM_SUITS; idx > 0; ) {
uint64_t cut = ncr(idx - 1, ns);
if (partner >= cut) {
partner -= cut;
// Figure out if N or S holds the card
ns--;
cut = ncr(ns, n) * 10400600LU;
if (hands > cut) {
hands -= cut;
n--;
} else
new_id[idx%NUM_SUITS] |= 1 << (idx/NUM_SUITS);
} else
new_id[idx%NUM_SUITS + NUM_SUITS] |= 1 << (idx/NUM_SUITS);
idx--;
}
unsigned ew = 26;
// Figure out if E or W holds a card
for (idx = CARDS_IN_SUIT*NUM_SUITS; idx-- > 0; ) {
if (new_id[idx%NUM_SUITS + NUM_SUITS] & (1 << (idx/NUM_SUITS))) {
uint64_t cut = ncr(--ew, e);
if (hands >= cut) {
hands -= cut;
e--;
} else
new_id[idx%NUM_SUITS] |= 1 << (idx/NUM_SUITS);
}
}
}
Now that I had the O(N) permutation solution done to prove algorithm could work I started searching for O(1) mapping from random number to bit permutation. Too bad it looks like only solution would be using huge lookup tables that would kill CPU caches. That doesn't sound good idea for AI that will be using very large amount of caches for double dummy analyzer.
Mathematical solution
After all hard work to figure out how to generate random bit permutations I decided go back to maths. It is entirely possible to apply filters before dealing cards. That requires splitting deals to manageable number of layered sets and selecting between sets based on their relative probabilities after filtering out impossible sets.
I don't yet have code ready for that to tests how much cycles I'm wasting in common case where filter is selecting major part of deal. But I believe this approach gives the most stable generation performance keeping the cost less than 0.1%.

Generate a 32 bit integer. For each index i (maybe only up to half the number of elements in the array), if bit i % 32 is 1, swap i with n - i - 1.
Of course, this might not be random enough for your purposes. You could probably improve this by not swapping with n - i - 1, but rather by another function applied to n and i that gives better distribution. You could even use two functions: one for when the bit is 0 and another for when it's 1.

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?
Preliminary
I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.
At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.
Details
I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).
The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.
Use Cases
Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.
Research
Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.
In particular, there are the following things.
First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).
Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).
Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.
To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?

Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.
void radixSort(string[] seqs, size_t base = 0) {
if(seqs.length == 0)
return;
size_t TPos = seqs.length, APos = 0;
size_t i = 0;
while(i < TPos) {
if(seqs[i][base] == 'A') {
swap(seqs[i], seqs[APos++]);
i++;
}
else if(seqs[i][base] == 'T') {
swap(seqs[i], seqs[--TPos]);
} else i++;
}
i = APos;
size_t CPos = APos;
while(i < TPos) {
if(seqs[i][base] == 'C') {
swap(seqs[i], seqs[CPos++]);
}
i++;
}
if(base < seqs[0].length - 1) {
radixSort(seqs[0..APos], base + 1);
radixSort(seqs[APos..CPos], base + 1);
radixSort(seqs[CPos..TPos], base + 1);
radixSort(seqs[TPos..seqs.length], base + 1);
}
}
Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.
Edit:
I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.

I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.
Reason:
The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.
I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).
That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.
I have no idea if it works out in practice though.
Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.

You can certainly drop the memory requirements by encoding the sequence in bits.
You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits.
For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.
If there is a way to reduce the number of valid 'words', further compression may be possible.
So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64.
Initialize them to zero.
Iterate through your very very large list of 3 char sequences, and encode them as above.
Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.
Next, regenerate your list.
Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.
A sequence of 4, adds 2 bits, so there would be 256 buckets.
A sequence of 5, adds 2 bits, so there would be 1024 buckets.
At some point the number of buckets will approach your limits.
If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.
I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.
Here is a hack that shows the technique
#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
int *bucket = NULL;
const char charMap[4] = {'A', 'C', 'G', 'T'};
void setup
(
void
)
{
bucket = new int[bucketCount];
memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
}
void teardown
(
void
)
{
delete[] bucket;
}
void show
(
int encoded
)
{
int z;
int y;
int j;
for (z = width - 1; z >= 0; z--)
{
int n = 1;
for (y = 0; y < z; y++)
n *= 4;
j = encoded % n;
encoded -= j;
encoded /= n;
cout << charMap[encoded];
encoded = j;
}
cout << endl;
}
int main(void)
{
// Sort this sequence
const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";
size_t testSequenceLength = strlen(testSequence);
setup();
// load the sequences into the buckets
size_t z;
for (z = 0; z < testSequenceLength; z += width)
{
int encoding = 0;
size_t y;
for (y = 0; y < width; y++)
{
encoding *= 4;
switch (*(testSequence + z + y))
{
case 'A' : encoding += 0; break;
case 'C' : encoding += 1; break;
case 'G' : encoding += 2; break;
case 'T' : encoding += 3; break;
default : abort();
};
}
bucket[encoding]++;
}
/* show the sorted sequences */
for (z = 0; z < bucketCount; z++)
{
while (bucket[z] > 0)
{
show(z);
bucket[z]--;
}
}
teardown();
return 0;
}

If your data set is so big, then I would think that a disk-based buffer approach would be best:
sort(List<string> elements, int prefix)
if (elements.Count < THRESHOLD)
return InMemoryRadixSort(elements, prefix)
else
return DiskBackedRadixSort(elements, prefix)
DiskBackedRadixSort(elements, prefix)
DiskBackedBuffer<string>[] buckets
foreach (element in elements)
buckets[element.MSB(prefix)].Add(element);
List<string> ret
foreach (bucket in buckets)
ret.Add(sort(bucket, prefix + 1))
return ret
I would also experiment grouping into a larger number of buckets, for instance, if your string was:
GATTACA
the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.

I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:
You control the reading of the data
You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.
The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.
Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.
Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.
That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(
See the Wikipedia articles:
Heapsort
Binary heap

"Radix sorting with no extra space" is a paper addressing your problem.

Performance-wise you might want to look at a more general string-comparison sorting algorithms.
Currently you wind up touching every element of every string, but you can do better!
In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.
A decent general purpose implementation of burstsort is available on source forge at http://sourceforge.net/projects/burstsort/ - but it is not in-place.
For comparison purposes, The C-burstsort implementation covered at http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.

You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.
Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.

You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).
Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.

I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.

It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.

Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets.
You can look at:
ti7qsort. ti7qsort is the fastest sort for integers (can be used for small-fixed size strings).
Inline QSORT
String sorting
You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.

dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.
I suggest a very simple approach:
Empirically estimate the largest size m for which a radix sort is efficient.
Read blocks of m elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.
Mergesort the resulting sorted blocks.
Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require 2n space to sort n items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.
Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.

First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.
OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.
Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:
Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
Bitmap encoding of the last three characters of a string.
For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.
This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.
For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off
You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.
Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.

While the accepted answer perfectly answers the description of the problem, I've reached this place looking in vain for an algorithm to partition inline an array into N parts. I've written one myself, so here it is.
Warning: this is not a stable partitioning algorithm, so for multilevel partitioning, one must repartition each resulting partition instead of the whole array. The advantage is that it is inline.
The way it helps with the question posed is that you can repeatedly partition inline based on a letter of the string, then sort the partitions when they are small enough with the algorithm of your choice.
function partitionInPlace(input, partitionFunction, numPartitions, startIndex=0, endIndex=-1) {
if (endIndex===-1) endIndex=input.length;
const starts = Array.from({ length: numPartitions + 1 }, () => 0);
for (let i = startIndex; i < endIndex; i++) {
const val = input[i];
const partByte = partitionFunction(val);
starts[partByte]++;
}
let prev = startIndex;
for (let i = 0; i < numPartitions; i++) {
const p = prev;
prev += starts[i];
starts[i] = p;
}
const indexes = [...starts];
starts[numPartitions] = prev;
let bucket = 0;
while (bucket < numPartitions) {
const start = starts[bucket];
const end = starts[bucket + 1];
if (end - start < 1) {
bucket++;
continue;
}
let index = indexes[bucket];
if (index === end) {
bucket++;
continue;
}
let val = input[index];
let destBucket = partitionFunction(val);
if (destBucket === bucket) {
indexes[bucket] = index + 1;
continue;
}
let dest;
do {
dest = indexes[destBucket] - 1;
let destVal;
let destValBucket = destBucket;
while (destValBucket === destBucket) {
dest++;
destVal = input[dest];
destValBucket = partitionFunction(destVal);
}
input[dest] = val;
indexes[destBucket] = dest + 1;
val = destVal;
destBucket = destValBucket;
} while (dest !== index)
}
return starts;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio