SSE2: How To Load Data From Non-Contiguous Memory Locations?

SSE2: How To Load Data From Non-Contiguous Memory Locations? - performance

I'm trying to vectorize some extremely performance critical code. At a high level, each loop iteration reads six floats from non-contiguous positions in a small array, then converts these values to double precision and adds them to six different double precision accumulators. These accumulators are the same across iterations, so they can live in registers. Due to the nature of the algorithm, it's not feasible to make the memory access pattern contiguous. The array is small enough to fit in L1 cache, though, so memory latency/bandwidth isn't a bottleneck.
I'm willing to use assembly language or SSE2 intrinsics to parallelize this. I know I need to load two floats at a time into the two lower dwords of an XMM register, convert them to two doubles using cvtps2pd, then add them to two accumulators at a time using addpd.
My question is, how do I get the two floats into the two lower dwords of a single XMM register if they aren't adjacent to each other in memory? Obviously any technique that's so slow that it defeats the purpose of parallelization isn't useful. An answer in either ASM or Intel/GCC intrinsics would be appreciated.
EDIT:
The size of the float array is, strictly speaking, not known at compile time but it's almost always 256, so this can be special cased.
The element of the float array that should be read is determined by loading a value from a byte array. There are six byte arrays, one for each accumulator. The reads from the byte array are sequential, one from each array for each loop iteration, so there shouldn't be many cache misses there.
The access pattern of the float array is for all practical purposes random.

For this specific case, take a look at the unpack-and-interleave instructions in your instruction reference manual. It would be something like
movss xmm0, <addr1>
movss xmm1, <addr2>
unpcklps xmm0, xmm1
Also take a look at shufps, which is handy whenever you have the data you want in the wrong order.

I think it would be interesting to see how this performs using the lookup functions from Agner Fog's Vector Class Library. It's not a library you need compile and link in. It's just a collection of header files. If you drop the header files into your source code directory then the following code should compile. The code below loads 16 bytes at a time from each of the six byte arrays, extends them to 32-bit integers (because the lookup function requires that), and then gathers floats for each of the six accumulators. You could probably extend this to AVX as well. I don't know if it this will be any better in performance (it could be worse). My guess is that if there was a regular pattern it could help (in that case the gather function would be better) but in any case it's worth a try.
#include "vectorclass.h"
int main() {
const int n = 16*10;
float x[256];
char b[6][n];
Vec4f sum[6];
for(int i=0; i<6; i++) sum[i] = 0;
for(int i=0; i<n; i+=16) {
Vec4i in[6][4];
for(int j=0; j<6; j++) {
Vec16c b16 = Vec16uc().load(&b[j][i]);
Vec8s low,high;
low = extend_low(b16);
high = extend_high(b16);
in[j][0] = extend_low(low);
in[j][1] = extend_high(low);
in[j][2] = extend_low(high);
in[j][3] = extend_high(high);
}
for(int j=0; j<4; j++) {
sum[0] += lookup<256>(in[0][j], x);
sum[1] += lookup<256>(in[1][j], x);
sum[2] += lookup<256>(in[2][j], x);
sum[3] += lookup<256>(in[3][j], x);
sum[4] += lookup<256>(in[4][j], x);
sum[5] += lookup<256>(in[5][j], x);
}
}
}

Related

GLSL: Why is a random write in local array significantly slower than a looped write?

Lets look at a simplified example function in GLSL:
void foo() {
vec2 localData[16];
// ...
int i = ... // somehow dependent on dynamic data (not known at compile time)
localData[i] = x; // THE IMPORTANT LINE
}
It writes some value x to a dynamic determined index in a local array.
Now, replacing the line localData[i] = x; with
for( int j = 0; j < 16; ++j )
if( i == j )
localData[j] = x;
makes the code significantly faster. In several tested examples (different shaders) the execution time almost halved and there were much more things going on than this write.
For example: in an order-independent transparency shader which, among other things, fetches 16 texels the timings are 39ms with the direct write and 23ms with the looped write. Nothing else changed!
The test hardware is an GTX1080. The assembly returned by glGetProgramBinary is still too high-level. It contains one line in the first case and a loop+if surrounding an identical line in the second.
Why does this performance issue happen?
Is this true for all vendors?
Guess: localData is stored in 8 vec4 registers (the assembly does not say anything about that). Further I assume, that registers cannot be addressed with an index. If both are true, than the final binary must use some branch construct. The loop variant might be unrolled and result in a switch-like pattern which is faster. But is that common for all vendors? Why can't the compiler use whatever results from the for loop as the default for such writes?

Further experiments have shown that the reason is the use of a different memory type for the array. The (unrolled) looped variant uses registers, while the random access variant switches to local memory.
Local memory is usual placed in the global one, but private to each thread. It is likely that accesses to this local array are going to be cached (L2?).
The experiments to verify this reasoning were the following:
Manual versions of unrolled loops (measured in an insertion sort with 16 elements over 1M pixels):
Base line: localData[i] = x 33ms
For loop: for j + if i=j 16.8ms
Switch: switch(i) { case 0: localData[0] ...: 16.92ms
If else tree (splitting in halves): 16.92ms
If list (plain manual unrolled): 16.8ms
=> All kinds of branch constructs result in more or less the same timings. So it is not a bad branching behavior as initially guessed.
Multiple vs. one vs no random access (32 element insertion sort)
2x localData[i] = x 47ms
1x localData[i] = x 45ms
0x localData[i] = x 16ms
=> As long as there is at least one random access the performance will be bad. This means there is a global decision changing the behavior of localData -- most likely the use of a different memory. Using more than one random access does not make things worse much, because of caching.

Fast bit permutation

I need to store and apply permutations to 16-bit integers. The best solution I came up with is to store permutation as 64-bit integer where each 4 bits correspond to the new position of i-th bit, the application would look like:
int16 permute(int16 bits, int64 perm)
{
int16 result = 0;
for(int i = 0; i < 16; ++i)
result |= ((bits >> i) & 1) * (1 << int( (perm >> (i*4))&0xf ));
return result;
}
is there a faster way to do this? Thank you.

There are alternatives.
Any permutation can be handled by a Beneš network, and encoded as the masks that are the inputs to the multiplexers to apply the shuffle. This can be done reasonably efficiently in software too (not great but OK), it's just a bunch of butterfly permutations. The masks are a bit tricky to compute, but probably faster to apply than moving every bit on its own, though that depends on how many bits you're dealing with and 16 is not a lot.
Some smaller categories of shuffles can be handled by simpler (faster) networks, which you can also find on that page.
Finally in practice, on modern x86 hardware, there is the highly versatile pshufb function which can apply a permutation (but may include dupes and zeroes) to 16 bytes in (typically) a single cycle. It is slightly awkward to distribute the bits over the bytes, but once you're there it only takes a pshufb to permute and a pmovmskb to compress it back down to 16 bits.

How to parallelise a nested loop with cross element dependencies in cuda?

I'm a beginner at cuda and am having some difficulties with it
If I have an input vector A and a result vector B both with size N, and B[i] depends on all elements of A except A[i], how can I code this without having to call a kernel multiple times inside a serial for loop? I can't think of a way to paralelise both the outer and inner loop simultaneously.
edit: Have a device with cc 2.0
example:
// a = some stuff
int i;
int j;
double result = 0;
for(i=0; i<1000; i++) {
double ai = a[i];
for(j=0; j<1000; j++) {
double aj = a[j];
if (i == j)
continue;
result += ai - aj;
}
}
I have this at the moment:
//in host
int i;
for(i=0; i<1000; i++) {
kernelFunc <<<2, 500>>> (i, d_a)
}
Is there a way to eliminate the serial loop?

Something like this should work, I think:
__global__ void my_diffs(const double *a, double *b, const length){
unsigned idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < length){
double my_a = a[idx];
double result = 0.0;
for (int j=0; j<length; j++)
result += my_a - a[j];
b[idx] = result;
}
}
(written in browser, not tested)
This can possibly be further optimized in a couple ways, however for cc 2.0 and newer devices that have L1 cache, the benefits of these optimizations might be small:
use shared memory - we can reduce the number of global loads to one per element per block. However, the initial loads will be cached in L1, and your data set is quite small (1000 double elements ?) so the benefits might be limited
create an offset indexing scheme, so each thread is using a different element from the cacheline to create coalesced access (i.e. modify j index for each thread). Again, for cc 2.0 and newer devices, this may not help much, due to L1 cache as well as the ability to broadcast warp global reads.
If you must use a cc 1.x device, then you'll get significant mileage out of one or more optimizations -- the code I've shown here will run noticeably slower in that case.
Note that I've chosen not to bother with the special case where we are subtracting a[i] from itself, as that should be approximately zero anyway, and should not disturb your results. If you're concerned about that, you can special-case it out, easily enough.
You'll also get more performance if you increase the blocks and reduce the threads per block, perhaps something like this:
my_diffs<<<8,128>>>(d_a, d_b, len);
The reason for this is that many GPUs have more than 1 or 2 SMs. To maximize perf on these GPUs with such a small data set, we want to try and get at least one block launched on each SM. Having more blocks in the grid makes this more likely.
If you want to fully parallelize the computation, the approach would be to create a 2D matrix (let's call it c[...]) in GPU memory, of square dimensions equal to the length of your vector. I would then create a 2D grid of threads, and have each thread perform the subtraction (a[row] - a[col]) and store it's result in c[row*len+col]. I would then launch a second (1D) kernel to sum the columns of c (each thread has a loop to sum a column) to create the result vector b. However I'm not sure this would be any faster than the approach I've outlined. Such a "more fully parallelized" approach also wouldn't lend itself as easily to the optimizations I discussed.

set RNG state with openMP and Rcpp

I have a clarification question.
It is my understanding, that sourceCpp automatically passes on the RNG state, so that set.seed(123) gives me reproducible random numbers when calling Rcpp code. When compiling a package, I have to add a set RNG statement.
Now how does this all work with openMP either in sourceCpp or within a package?
Consider the following Rcpp code
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp1(int n, double mu, double sigma ){
Rcpp::NumericVector out(n);
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp2(int n, double mu, double sigma, int cores=1 ){
omp_set_num_threads(cores);
Rcpp::NumericVector out(n);
#pragma omp parallel for schedule(dynamic)
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
And then run
set.seed(123)
a1=rnormrcpp1(100,2,3,2)
set.seed(123)
a2=rnormrcpp1(100,2,3,2)
set.seed(123)
a3=rnormrcpp2(100,2,3,2)
set.seed(123)
a4=rnormrcpp2(100,2,3,2)
all.equal(a1,a2)
all.equal(a3,a4)
While a1 and a2 are identical, a3 and a4 are not. How can I adjust the RNG state with the openMP loop? Can I?

To expand on what Dirk Eddelbuettel has already said, it is next to impossible to both generate the same PRN sequence in parallel and have the desired speed-up. The root of this is that generation of PRN sequences is essentially a sequential process where each state depends on the previous one and this creates a backward dependence chain that reaches back as far as the initial seeding state.
There are two basic solutions to this problem. One of them requires a lot of memory and the other one requires a lot of CPU time and both are actually more like workarounds than true solutions:
pregenerated PRN sequence: One thread generates sequentially a huge array of PRNs and then all threads access this array in a manner that would be consistent with the sequential case. This method requires lots of memory in order to store the sequence. Another option would be to have the sequence stored into a disk file that is later memory-mapped. The latter method has the advantage that it saves some compute time, but generally I/O operations are slow, so it only makes sense on machines with limited processing power or with small amounts of RAM.
prewound PRNGs: This one works well in cases when work is being statically distributed among the threads, e.g. with schedule(static). Each thread has its own PRNG and all PRNGs are seeded with the same initial seed. Then each thread draws as many dummy PRNs as its starting iteration, essentially prewinding its PRNG to the correct position. For example:
thread 0: draws 0 dummy PRNs, then draws 100 PRNs and fills out(0:99)
thread 1: draws 100 dummy PRNs, then draws 100 PRNs and fills out(100:199)
thread 2: draws 200 dummy PRNs, then draws 100 PRNs and fills out(200:299)
and so on. This method works well when each thread does a lot of computations besides drawing the PRNs since the time to prewind the PRNG could be substantial in some cases (e.g. with many iterations).
A third option exists for the case when there is a lot of data processing besides drawing a PRN. This one uses OpenMP ordered loops (note that the iteration chunk size is set to 1):
#pragma omp parallel for ordered schedule(static,1)
for (int i=0; i < n; i++) {
#pragma omp ordered
{
rnum = R::rnorm(mu,sigma);
}
out(i) = lots of processing on rnum
}
Although loop ordering essentially serialises the computation, it still allows for lots of processing on rnum to execute in parallel and hence parallel speed-up would be observed. See this answer for a better explanation as to why so.

Yes, sourceCpp() etc and an instantiation of RNGScope so the RNGs are left in a proper state.
And yes one can do OpenMP. But inside of OpenMP segment you cannot control in which order the threads are executed -- so you longer the same sequence. I have the same problem with a package under development where I would like to have reproducible draws yet use OpenMP. But it seems you can't.

In-Place Radix Sort

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?
Preliminary
I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.
At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.
Details
I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).
The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.
Use Cases
Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.
Research
Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.
In particular, there are the following things.
First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).
Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).
Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.
To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?

Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.
void radixSort(string[] seqs, size_t base = 0) {
if(seqs.length == 0)
return;
size_t TPos = seqs.length, APos = 0;
size_t i = 0;
while(i < TPos) {
if(seqs[i][base] == 'A') {
swap(seqs[i], seqs[APos++]);
i++;
}
else if(seqs[i][base] == 'T') {
swap(seqs[i], seqs[--TPos]);
} else i++;
}
i = APos;
size_t CPos = APos;
while(i < TPos) {
if(seqs[i][base] == 'C') {
swap(seqs[i], seqs[CPos++]);
}
i++;
}
if(base < seqs[0].length - 1) {
radixSort(seqs[0..APos], base + 1);
radixSort(seqs[APos..CPos], base + 1);
radixSort(seqs[CPos..TPos], base + 1);
radixSort(seqs[TPos..seqs.length], base + 1);
}
}
Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.
Edit:
I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.

I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.
Reason:
The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.
I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).
That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.
I have no idea if it works out in practice though.
Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.

You can certainly drop the memory requirements by encoding the sequence in bits.
You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits.
For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.
If there is a way to reduce the number of valid 'words', further compression may be possible.
So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64.
Initialize them to zero.
Iterate through your very very large list of 3 char sequences, and encode them as above.
Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.
Next, regenerate your list.
Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.
A sequence of 4, adds 2 bits, so there would be 256 buckets.
A sequence of 5, adds 2 bits, so there would be 1024 buckets.
At some point the number of buckets will approach your limits.
If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.
I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.
Here is a hack that shows the technique
#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
int *bucket = NULL;
const char charMap[4] = {'A', 'C', 'G', 'T'};
void setup
(
void
)
{
bucket = new int[bucketCount];
memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
}
void teardown
(
void
)
{
delete[] bucket;
}
void show
(
int encoded
)
{
int z;
int y;
int j;
for (z = width - 1; z >= 0; z--)
{
int n = 1;
for (y = 0; y < z; y++)
n *= 4;
j = encoded % n;
encoded -= j;
encoded /= n;
cout << charMap[encoded];
encoded = j;
}
cout << endl;
}
int main(void)
{
// Sort this sequence
const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";
size_t testSequenceLength = strlen(testSequence);
setup();
// load the sequences into the buckets
size_t z;
for (z = 0; z < testSequenceLength; z += width)
{
int encoding = 0;
size_t y;
for (y = 0; y < width; y++)
{
encoding *= 4;
switch (*(testSequence + z + y))
{
case 'A' : encoding += 0; break;
case 'C' : encoding += 1; break;
case 'G' : encoding += 2; break;
case 'T' : encoding += 3; break;
default : abort();
};
}
bucket[encoding]++;
}
/* show the sorted sequences */
for (z = 0; z < bucketCount; z++)
{
while (bucket[z] > 0)
{
show(z);
bucket[z]--;
}
}
teardown();
return 0;
}

If your data set is so big, then I would think that a disk-based buffer approach would be best:
sort(List<string> elements, int prefix)
if (elements.Count < THRESHOLD)
return InMemoryRadixSort(elements, prefix)
else
return DiskBackedRadixSort(elements, prefix)
DiskBackedRadixSort(elements, prefix)
DiskBackedBuffer<string>[] buckets
foreach (element in elements)
buckets[element.MSB(prefix)].Add(element);
List<string> ret
foreach (bucket in buckets)
ret.Add(sort(bucket, prefix + 1))
return ret
I would also experiment grouping into a larger number of buckets, for instance, if your string was:
GATTACA
the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.

I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:
You control the reading of the data
You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.
The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.
Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.
Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.
That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(
See the Wikipedia articles:
Heapsort
Binary heap

"Radix sorting with no extra space" is a paper addressing your problem.

Performance-wise you might want to look at a more general string-comparison sorting algorithms.
Currently you wind up touching every element of every string, but you can do better!
In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.
A decent general purpose implementation of burstsort is available on source forge at http://sourceforge.net/projects/burstsort/ - but it is not in-place.
For comparison purposes, The C-burstsort implementation covered at http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.

You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.
Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.

You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).
Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.

I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.

It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.

Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets.
You can look at:
ti7qsort. ti7qsort is the fastest sort for integers (can be used for small-fixed size strings).
Inline QSORT
String sorting
You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.

dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.
I suggest a very simple approach:
Empirically estimate the largest size m for which a radix sort is efficient.
Read blocks of m elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.
Mergesort the resulting sorted blocks.
Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require 2n space to sort n items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.
Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.

First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.
OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.
Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:
Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
Bitmap encoding of the last three characters of a string.
For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.
This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.
For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off
You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.
Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.

While the accepted answer perfectly answers the description of the problem, I've reached this place looking in vain for an algorithm to partition inline an array into N parts. I've written one myself, so here it is.
Warning: this is not a stable partitioning algorithm, so for multilevel partitioning, one must repartition each resulting partition instead of the whole array. The advantage is that it is inline.
The way it helps with the question posed is that you can repeatedly partition inline based on a letter of the string, then sort the partitions when they are small enough with the algorithm of your choice.
function partitionInPlace(input, partitionFunction, numPartitions, startIndex=0, endIndex=-1) {
if (endIndex===-1) endIndex=input.length;
const starts = Array.from({ length: numPartitions + 1 }, () => 0);
for (let i = startIndex; i < endIndex; i++) {
const val = input[i];
const partByte = partitionFunction(val);
starts[partByte]++;
}
let prev = startIndex;
for (let i = 0; i < numPartitions; i++) {
const p = prev;
prev += starts[i];
starts[i] = p;
}
const indexes = [...starts];
starts[numPartitions] = prev;
let bucket = 0;
while (bucket < numPartitions) {
const start = starts[bucket];
const end = starts[bucket + 1];
if (end - start < 1) {
bucket++;
continue;
}
let index = indexes[bucket];
if (index === end) {
bucket++;
continue;
}
let val = input[index];
let destBucket = partitionFunction(val);
if (destBucket === bucket) {
indexes[bucket] = index + 1;
continue;
}
let dest;
do {
dest = indexes[destBucket] - 1;
let destVal;
let destValBucket = destBucket;
while (destValBucket === destBucket) {
dest++;
destVal = input[dest];
destValBucket = partitionFunction(destVal);
}
input[dest] = val;
indexes[destBucket] = dest + 1;
val = destVal;
destBucket = destValBucket;
} while (dest !== index)
}
return starts;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio