Cache friendly offline random read

Cache friendly offline random read - algorithm

Consider this function in C++:
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
Its purpose should be clear enough. Unfortunately, b1 contains random data and trash the cache, making foo the bottleneck of my program. Is there anyway I can optimize it?
This is an SSCCE that should resemble my actual code:
#include <iostream>
#include <chrono>
#include <algorithm>
#include <numeric>
namespace {
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
uint32_t data[max_n], index[max_q], result[max_q];
}
int main() {
uint32_t seed = 0;
auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
std::generate_n(data, max_n, rng);
std::generate_n(index, max_q, [rng]() { return rng() % max_n; });
auto t1 = std::chrono::high_resolution_clock::now();
foo(data, data + max_n, index, index + max_q, result);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;
uint32_t hash = 0;
for (unsigned i = 0; i < max_q; i++)
hash += result[i] ^ (i << 8) ^ i;
std::cout << hash << std::endl;
}
This is not Cache-friendly copying of an array with readjustment by known index, gather, scatter, which asks about random writes and assumes b is a permutation.

First, let's take a look at the actual performance of the code above:
$ sudo perf stat ./offline-read
0.123023
1451229184
Performance counter stats for './offline-read':
184.661547 task-clock (msec) # 0.997 CPUs utilized
3 context-switches # 0.016 K/sec
0 cpu-migrations # 0.000 K/sec
717 page-faults # 0.004 M/sec
623,638,834 cycles # 3.377 GHz
419,309,952 instructions # 0.67 insn per cycle
70,803,672 branches # 383.424 M/sec
16,895 branch-misses # 0.02% of all branches
0.185129552 seconds time elapsed
We are getting a low IPC of 0.67, probably caused almost entirely by load-misses to DRAM5. Let's confirm:
sudo ../pmu-tools/ocperf.py stat -e cycles,LLC-load-misses,cycle_activity.stalls_l3_miss ./offline-read
perf stat -e cycles,LLC-load-misses,cpu/event=0xa3,umask=0x6,cmask=6,name=cycle_activity_stalls_l3_miss/ ./offline-read
0.123979
1451229184
Performance counter stats for './offline-read':
622,661,371 cycles
16,114,063 LLC-load-misses
368,395,404 cycle_activity_stalls_l3_miss
0.184045411 seconds time elapsed
So ~370k cycles out of 620k are straight-up stalled on outstanding misses. In fact, the portion of cycles stalled this way in foo() is much higher, close to 90% since perf is also measuring the init and accumulate code which takes about a third of the runtime (but doesn't have significant L3 misses).
This is nothing unexpected, since we knew the random-read pattern a1[*b1++] was going to have essentially zero locality. In fact, the number of LLC-load-misses is 16 million1, corresponding almost exactly to the 16 million random reads of a1.2
If we just assume 100% of foo() is spending waiting on memory access, we can get an idea of the total cost of each miss: 0.123 sec / 16,114,063 misses == 7.63 ns/miss. On my box, the memory latency is around 60 ns in the best case, so less than 8 ns per miss means we are already extracting a lot of memory-level parallelism (MLP): about 8 misses would have to be overlapped and in-flight on average to achieve this (even totally ignoring the additional traffic from the streaming load of b1 and streaming write of o).
So I don't think there are many tweaks you can apply to the simple loop to do much better. Still, two possibilities are:
Non-temporal stores for the writes to o, if your platform supports them. This would cut out the reads implied by RFO for normal stores. It should be a straight win since o is never read again (inside the timed portion!).
Software prefetching. Carefully tuned prefetching of a1 or b1 could potentially help a bit. The impact is going to be fairly limited, however, since we are already approaching the limits of MLP as described above. Also, we expect the linear reads of b1 to be almost perfectly prefetched by the hardware prefetchers. The random reads of a1 seem like they could be amenable to prefetching, but in practice the ILP in the loop leads to enough MLP though out-of-order processing (at least on big OoO processors like recent x86).
In the comments user harold already mentioned that he tried prefetching with only a small effect.
So since the simple tweaks aren't likely to bear much fruit, you are left with transforming the loop. One "obvious" transformation is to sort the indexes b1 (along with the index element's original position) and then do the reads from a1 in sorted order. This transforms the reads of a1 from completely random, to almost3 linear, but now the writes are all random, which is no better.
Sort and then unsort
The key problem is that the reads of a1 under control of b1 are random, and a1 is large you get a miss-to-DRAM for essentially every read. We can fix that by sorting b1, and then reading a1 in order to get a permuted result. Now you need to "un-permute" the result a1 to get the result in the final order, which is simply another sort, this time on the "output index".
Here's a worked example with the given input array a, index array b and output array o, and i which is the (implicit) position of each element:
i = 0 1 2 3
a = [00, 10, 20, 30]
b = [ 3, 1, 0, 1]
o = [30, 10, 00, 10] (desired result)
First, sort array b, with the original array position i as secondary data (alternately you may see this as sorting tuples (b[0], 0), (b[1], 1), ...), this gives you the sorted b array b' and the sorted index list i' as shown:
i' = [ 2, 1, 3, 0]
b' = [ 0, 1, 1, 3]
Now you can read the permuted result array o' from a under the control of b'. This read is strictly increasing in order, and should be able to operate at close to memcpy speeds. In fact you may be able to take advantage of wide contiguous SIMD reads and some shuffles to do several reads and once and move the 4-byte elements into the right place (duplicating some elements and skipping others):
a = [00, 10, 20, 30]
b' = [ 0, 1, 1, 3]
o' = [00, 10, 10, 30]
Finally, you de-permute o' to get o, conceptually simply by sorting o' on the permuted indexes i':
i' = [ 2, 1, 3, 0]
o' = [00, 10, 10, 30]
i = [ 0, 1, 2, 3]
o = [30, 10, 00, 10]
Finished!
Now this is the simplest idea of the technique and isn't particularly cache-friendly (each pass conceptually iterates over one or more 2^26-byte arrays), but it at least fully uses every cache line it reads (unlike the original loop which only reads a single element from a cache line, which is why you have 16 million misses even though the data only occupies 1 million cache lines!). All of the reads are more or less linear, so hardware prefetching will help a lot.
How much speedup you get probably large depends on how will you implement the sorts: they need to be fast and cache sensitive. Almost certainly some type of cache-aware radix sort will work best.
Here are some notes on ways to improve this further:
Optimize the amount of sorting
You don't actually need to fully sort b. You just want to sort it "enough" such that the subsequent reads of a under the control of b' are more or less linear. For example, 16 elements fit in a cache line, so you don't need to sort based on the last 4 bits at all: the same linear sequence of cache lines will be read anyways. You could also sort on even fewer bits: e.g., if you ignored the 5 least-significant bits, you'd read cache lines in an "almost linear" way, sometimes swapping two cache lines from the perfectly linear pattern like: 0, 1, 3, 2, 5, 4, 6, 7. Here, you'll still get the full benefit of the L1 cache (subsequent reads to a cache line will always hit), and I suspect such a pattern would still be prefetched well and if not you can always help it with software prefetching.
You can test on your system what the optimal number of ignored bits is. Ignoring bits has two benefits:
Less work to do in the radix search, either from fewer passes needed or needing fewer buckets in one or more passes (which helps caching).
Potentially less work to do to "undo" the permutation in the last step: if the undo by examining the original index array b, ignoring bits means that you get the same savings when undoing the search.
Cache block the work
The above description lays out everything in several sequential, disjoint passes that each work on the entire data set. In practice, you'd probably want to interleave them to get better caching behavior. For example, assuming you use an MSD radix-256 sort, you might do the first pass, sorting the data into 256 buckets of approximately 256K elements each.
Then rather than doing the full second pass, you might finish sorting only the first (or first few) buckets, and proceed to do the read of a based on the resulting block of b'. You are guaranteed that this block is contiguous (i.e., a suffix of the final sorted sequence) so you don't give up any locality in the read, and your reads will generally be cached. You may also do the first pass of de-permuting o' since the block of o' is also hot in the cache (and perhaps you can combine the latter two phases into one loop).
Smart De-permutation
One area for optimization is how exactly the de-permutation of o' is implemented. In the description above, we assume some index array i initially with values [0, 1, 2, ..., max_q] which is sorted along with b. That's conceptually how it works, but you may not need to actually materialize i right away and sort it as auxillary data. In the first pass of the radix sort, for example, the value of i is implicitly known (since you are iterating through the data), so it could be calculated for free4 and written out during the first pass without every having appeared in sorted order.
There may also be more efficient ways to do the "unsort" operation than maintaining the full index. For example, the original unsorted b array conceptually has all the information needed to do the unsort, but it is clear to me how to use to efficiently unsort.
Is it be faster?
So will this actually be faster than the naive approach? It depends largely on implementation details especially including the efficiency of the implemented sort. On my hardware, the naive approach is processing about ~140 million elements per second. Online descriptions of cache-aware radix sorts seem to vary from perhaps 200 to 600 million elements/s, and since you need two of those, the opportunity for a big speedup would seem limited if you believe those numbers. On other hand, those numbers are from older hardware, and for slightly more general searches (e.g,. for all 32 bits of the key, while we may be able to use as few as 16 bits).
Only a careful implementation will determine if it is feasible, and feasibility also depends on the hardware. For example, on hardware that can't sustain as much MLP, the sorting-unsorting approach becomes relatively more favorable.
The best approach also depends on the relative values of max_n and max_q. For example, if max_n >> max_q, then the reads will be "sparse" even with optimal sorting, so the naive approach would be better. On the other hand if max_n << max_q, then the same index will usually be read many times, so the sorting approach will have good read locality, the sorting steps will themselves have better locality, and further optimizations which handle duplicate reads explicitly may be possible.
Multiple Cores
It isn't clear from the question whether you are interested in parallelizing this. The naive solution for foo() already does admit a "straightforward" parallelization where you simply partition the a and b arrays into equal sized chunks, on for each thread, which would seem to provide a perfect speedup. Unfortunately, you'll probably find that you get much worse than linear scaling, because you'll be running into resource contention in the memory controller and associated uncore/offcore resources which are shared between all cores on a socket. So it isn't clear how much more throughput you'll get for a purely parallel random read load to memory as you add more cores6.
For the radix-sort version, most of the bottlenecks (store throughput, total instruction throughput) are in the core, so I expect it to scale reasonably with additional cores. As Peter mentioned in the comment, if you are using hyperthreading, the sort may have the additional benefit of good locality in the core local L1 and L2 caches, effectively letting each sibling thread use the entire cache, rather than cutting the effective capacity in half. Of course, that involves carefully managing your thread affinity so that sibling threads actually use nearby data, and not just letting the scheduler do whatever it does.
1 You might ask why the LLC-load-misses isn't say 32 or 48 million, given that we also have to read all 16 million elements of b1 and then the accumulate() call reads all of result. The answer is that LLC-load-misses only counts demand misses that actually miss in the L3. The other mentioned read patterns are totally linear, so the prefetchers will always be bringing the line into the L3 before it is needed. These don't count as "LLC misses" by the definition perf uses.
2 You might want to know how I know that the load misses all come from the reads of a1 in foo: I simply used perf record and perf mem to confirm that the misses were coming from the expected assembly instruction.
3 Almost linear because b1 is not a permutation of all indexes, so in principle there can be skipped and duplicate indexes. At the cache-line level, however, it is highly likely that every cache line will be read in-order since each element has a ~63% chance of being included, and a cache line has 16 4-byte elements, so there's only about a 1 in 10 million chance that any given cache has zero elements. So prefetching, which works at the cache line level, will work fine.
4 Here I mean that the calculation of the value comes for free or nearly so, but of course the write still costs. This is still much better than the "up-front materialization" approach, however, which first creates the i array [0, 1, 2, ...] needing max_q writes and then again needs another max_q writes to sort it in the first radix sort pass. The implicit materialization only incurs the second write.
5 In fact, the IPC of the actual timed section foo() is much lower: about 0.15 based on my calculations. The reported IPC of the entire process is an average of the IPC of the timed section and the initialization and accumulation code before and after which has a much higher IPC.
6 Notably, this is different from a how a dependent-load latency bound workflow scales: a load that is doing random read but can only have one load in progress because each load depends on the result of last scales very well to multiple cores because the serial nature of the loads doesn't use many downstream resources (but such loads can conceptually also be sped up even on a single core by changing the core loop to handle more than one dependent load stream in parallel).

You can partition indices into buckets where higher bits of indices are the same. Beware that if indices are not random the buckets will overflow.
#include <iostream>
#include <chrono>
#include <cassert>
#include <algorithm>
#include <numeric>
#include <vector>
namespace {
constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
uint32_t* foo_fx(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, const uint32_t b_offset, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[b_offset+(*b1++)];
}
return o;
}
uint32_t data[max_n], index[max_q], result[max_q];
std::pair<uint32_t, uint32_t[max_q / 8]>index_fx[16];
}
int main() {
uint32_t seed = 0;
auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
std::generate_n(data, max_n, rng);
//std::generate_n(index, max_q, [rng]() { return rng() % max_n; });
for (size_t i = 0; i < max_q;++i) {
const uint32_t idx = rng() % max_n;
const uint32_t bucket = idx >> 20;
assert(bucket < 16);
index_fx[bucket].second[index_fx[bucket].first] = idx % (1 << 20);
index_fx[bucket].first++;
assert((1 << 20)*bucket + index_fx[bucket].second[index_fx[bucket].first - 1] == idx);
}
auto t1 = std::chrono::high_resolution_clock::now();
//foo(data, data + max_n, index, index + max_q, result);
uint32_t* result_begin = result;
for (int i = 0; i < 16; ++i) {
result_begin = foo_fx(data, data + max_n, index_fx[i].second, index_fx[i].second + index_fx[i].first, (1<<20)*i, result_begin);
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;
std::cout << std::accumulate(result, result + max_q, 0ull) << std::endl;
}

Related

CUDA Parallel Cross Product

Disclaimer: I am fairly new to CUDA and parallel programming - so if you're not going to bother to answer my question, just ignore this, or at least point me to the right resources so I can find the answer myself.
Here's the particular problem I'm looking to solve using parallel programming. I have some 1D arrays that store 3D vectors in this format -> [v0x, v0y, v0z, ... vnx, vny, vnz], where n is the vector, and x, y, z are the respective components.
Suppose I want to find the cross product between vectors [v0, v1, ... vn] in one array and their corresponding vectors [v0, v1, ... vn] in another array.
The calculation is pretty straightforward without parallelization:
result[x] = vec1[y]*vec2[z] - vec1[z]*vec2[y];
result[y] = vec1[z]*vec2[x] - vec1[x]*vec2[z];
result[z] = vec1[x]*vec2[y] - vec1[y]*vec2[x];
The problem I'm having is understanding how to implement CUDA parallelization for the arrays I currently have. Since each value in the result vector is a separate calculation, I can effectively run the above calculation for each vector in parallel. Since each component of the resulting cross product is a separate calculation, those too could run in parallel. How would I go about setting up the blocks and threads/ go about thinking about setting up the threads for such a problem?

The top 2 optimization priorities for any CUDA programmer are to use memory efficiently, and expose enough parallelism to hide latency. We'll use those to guide our algorithmic choices.
A very simple thread strategy (the thread strategy answers the question, "what will each thread do or be responsible for?") in any transformation (as opposed to reduction) type problem is to have each thread be responsible for 1 output value. Your problem fits the description of transformation - the output data set size is on the order of the input data set size(s).
I'll assume that you intended to have two equal length vectors containing your 3D vectors, and that you want to take the cross product of the first 3D vectors in each and the 2nd 3D vectors in each, and so on.
If we choose a thread strategy of 1 output point per thread (i.e. result[x] or result[y] or result[z], all together would be 3 output points), then we will need 3 threads to compute the output of each vector cross product. If we have enough vectors to multiply, then we will have enough threads to keep our machine "busy" and do a good job of hiding latency. As a rule of thumb, your problem will start to become interesting on GPUs if the number of threads is 10000 or more, so this means we would want your 1D vectors to consist of about 3000 3D vectors or more. Let's assume that is the case.
In order to tackle the memory efficiency objective, our first task is to load your vector data from global memory. We will want this ideally to be coalesced, which roughly means adjacent threads access adjacent elements in memory. We'll want the output stores to be coalesced also, and our thread strategy of choosing one output point/one vector component per thread will work nicely to support that.
For efficient memory usage, we'd like to ideally load each item from global memory only once. Your algorithm naturally involves a small amount of data reuse. The data reuse is evident since the computation of result[y] depends on vec2[z] and the computation of result[x] also depends on vec2[z] to pick just one example. Therefore a typical strategy when there is data reuse is to load the data first into CUDA shared memory, and then allow the threads to perform their computations based on the data in shared memory. As we will see, this makes it fairly easy/convenient for us to arrange for coalesced loads from global memory, since the global data load arrangement is no longer tightly coupled to the threads or the usage of the data for computation.
The last challenge is to figure out an indexing pattern so that each thread will select the proper elements from shared memory to multiply together. If we look at your calculation pattern that you have depicted in your question, we see that the first load from vec1 follows an offset pattern of +1(modulo 3) from the index that the result is being computed for. So x->y, y->z, and z -> x. Likewise we see a +2(modulo 3) for the next load from vec2, another +2(modulo 3) pattern for the next load from vec1 and another +1(modulo 3) pattern for the final load from vec2.
If we combine all these ideas, we can then write a kernel that should have generally efficient characteristics:
$ cat t1003.cu
#include <stdio.h>
#define TV1 1
#define TV2 2
const size_t N = 4096; // number of 3D vectors
const int blksize = 192; // choose as multiple of 3 and 32, and less than 1024
typedef float mytype;
//pairwise vector cross product
template <typename T>
__global__ void vcp(const T * __restrict__ vec1, const T * __restrict__ vec2, T * __restrict__ res, const size_t n){
__shared__ T sv1[blksize];
__shared__ T sv2[blksize];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < 3*n){ // grid-stride loop
// load shared memory using coalesced pattern to global memory
sv1[threadIdx.x] = vec1[idx];
sv2[threadIdx.x] = vec2[idx];
// compute modulo/offset indexing for thread loads of shared data from vec1, vec2
int my_mod = threadIdx.x%3; // costly, but possibly hidden by global load latency
int off1 = my_mod+1;
if (off1 > 2) off1 -= 3;
int off2 = my_mod+2;
if (off2 > 2) off2 -= 3;
__syncthreads();
// each thread loads its computation elements from shared memory
T t1 = sv1[threadIdx.x-my_mod+off1];
T t2 = sv2[threadIdx.x-my_mod+off2];
T t3 = sv1[threadIdx.x-my_mod+off2];
T t4 = sv2[threadIdx.x-my_mod+off1];
// compute result, and store using coalesced pattern, to global memory
res[idx] = t1*t2-t3*t4;
idx += gridDim.x*blockDim.x;} // for grid-stride loop
}
int main(){
mytype *h_v1, *h_v2, *d_v1, *d_v2, *h_res, *d_res;
h_v1 = (mytype *)malloc(N*3*sizeof(mytype));
h_v2 = (mytype *)malloc(N*3*sizeof(mytype));
h_res = (mytype *)malloc(N*3*sizeof(mytype));
cudaMalloc(&d_v1, N*3*sizeof(mytype));
cudaMalloc(&d_v2, N*3*sizeof(mytype));
cudaMalloc(&d_res, N*3*sizeof(mytype));
for (int i = 0; i<N; i++){
h_v1[3*i] = TV1;
h_v1[3*i+1] = 0;
h_v1[3*i+2] = 0;
h_v2[3*i] = 0;
h_v2[3*i+1] = TV2;
h_v2[3*i+2] = 0;
h_res[3*i] = 0;
h_res[3*i+1] = 0;
h_res[3*i+2] = 0;}
cudaMemcpy(d_v1, h_v1, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_v2, h_v2, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
vcp<<<(N*3+blksize-1)/blksize, blksize>>>(d_v1, d_v2, d_res, N);
cudaMemcpy(h_res, d_res, N*3*sizeof(mytype), cudaMemcpyDeviceToHost);
// verification
for (int i = 0; i < N; i++) if ((h_res[3*i] != 0) || (h_res[3*i+1] != 0) || (h_res[3*i+2] != TV1*TV2)) { printf("mismatch at %d, was: %f, %f, %f, should be: %f, %f, %f\n", i, h_res[3*i], h_res[3*i+1], h_res[3*i+2], (float)0, (float)0, (float)(TV1*TV2)); return -1;}
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
$ nvcc t1003.cu -o t1003
$ cuda-memcheck ./t1003
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$
Note that I've chosen to write the kernel using a grid-stride loop. This isn't terribly important to this discussion, and not that relevant for this problem, because I've chosen a grid size equal to the problem size (4096*3). However for much larger problem sizes, you might choose a smaller grid size than the overall problem size, for some possible small efficiency gain.
For such a simple problem as this, it's fairly easy to define "optimality". The optimal scenario would be however long it takes to load the input data (just once) and write the output data. If we consider a larger version of the test code above, changing N to 40960 (and making no other changes), then the total data read and written would be 40960*3*4*3 bytes. If we profile that code and then compare to bandwidthTest as a proxy for peak achievable memory bandwidth, we observe:
$ CUDA_VISIBLE_DEVICES="1" nvprof ./t1003
==27861== NVPROF is profiling process 27861, command: ./t1003
no error
==27861== Profiling application: ./t1003
==27861== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 65.97% 162.22us 2 81.109us 77.733us 84.485us [CUDA memcpy HtoD]
30.04% 73.860us 1 73.860us 73.860us 73.860us [CUDA memcpy DtoH]
4.00% 9.8240us 1 9.8240us 9.8240us 9.8240us void vcp<float>(float const *, float const *, float*, unsigned long)
API calls: 99.10% 249.79ms 3 83.263ms 6.8890us 249.52ms cudaMalloc
0.46% 1.1518ms 96 11.998us 374ns 454.09us cuDeviceGetAttribute
0.25% 640.18us 3 213.39us 186.99us 229.86us cudaMemcpy
0.10% 255.00us 1 255.00us 255.00us 255.00us cuDeviceTotalMem
0.05% 133.16us 1 133.16us 133.16us 133.16us cuDeviceGetName
0.03% 71.903us 1 71.903us 71.903us 71.903us cudaLaunchKernel
0.01% 15.156us 1 15.156us 15.156us 15.156us cuDeviceGetPCIBusId
0.00% 7.0920us 3 2.3640us 711ns 4.6520us cuDeviceGetCount
0.00% 2.7780us 2 1.3890us 612ns 2.1660us cuDeviceGet
0.00% 1.9670us 1 1.9670us 1.9670us 1.9670us cudaGetLastError
0.00% 361ns 1 361ns 361ns 361ns cudaGetErrorString
$ CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla K20Xm
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6375.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6554.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 171220.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
The kernel takes 9.8240us to execute, and in that time loads or stores a total of 40960*3*4*3 bytes of data. Therefore the achieved memory bandwidth by the kernel is 40960*3*4*3/0.000009824 or 150 GB/s. The proxy measurement for peak achievable on this GPU is 171 GB/s, so this kernel achieves 88% of the optimal throughput. With more careful benchmarking to run the kernel twice in a row, the 2nd execution requires only 8.99us to execute. This brings the achieved bandwidth in this case up to 96% of peak achievable throughput.

sort huge array with small number of repeating keys

I want to sort a huge array, say 10^8 entries of type X with at most N different keys, where N is ~10^2. Because I don't know the range or spacing of the elements, count sort is not an option. So my best guess so far is to use a hash map for the counts like so
std::unordered_map< X, unsigned > counts;
for (auto x : input)
counts[x]++;
This works ok-ish and is ~4 times faster than 3-way quicksort, but I'm a nervous person and it's still not fast enough.
I wonder: am I missing something? Can I make better use of the fact that N is known in advance? Or is it possible to tune the hash map to my needs?
EDIT An additional pre-condition is that the input sequence is badly sorted and the frequency of the keys is about the same.

STL implementations are often not perfect in terms of performance (no holy wars, please).
If you know a guaranteed and sensible upper on the number of unique elements (N), then you can trivially implement your own hash table of size 2^s >> N. Here is how I usually do it myself:
int size = 1;
while (size < 3 * N) size <<= 1;
//Note: at least 3X size factor, size = power of two
//count = -1 means empty entry
std::vector<std::pair<X, int>> table(size, make_pair(X(), -1));
auto GetHash = [size](X val) -> int { return std::hash<X>()(val) & (size-1); };
for (auto x : input) {
int cell = GetHash(x);
bool ok = false;
for (; table[cell].second >= 0; cell = (cell + 1) & (size-1)) {
if (table[cell].first == x) { //match found -> stop
ok = true;
break;
}
}
if (!ok) { //match not found -> add entry on free place
table[cell].first = x;
table[cell].second = 0;
}
table[cell].second++; //increment counter
}
On MSVC2013, it improves time from 0.62 secs to 0.52 secs compared to your code, given that int is used as type X.
Also, we can choose a faster hash function. Note however, that the choice of hash function depends heavily on the properties of the input. Let's take Knuth's multiplicative hash:
auto GetHash = [size](X val) -> int { return (val*2654435761) & (size-1); };
It further improves time to 0.34 secs.
As a conclusion: do you really want to reimplement standard data structures to achieve a 2X speed boost?
Notes: Speedup may be entirely different on another compiler/machine. You may have to do some hacks if your type X is not POD.

Counting sort really would by best, but isnt applicable due to unknown range or spacing.
Seems to be easily parallelized with fork-join, e.g. boost::thread.
You could also try a more efficient, handrolled hashmap. Unorded_map typically uses linked lists to counter potentially bad hash functions. The memory overhead of linked lists may hurt performance if the hashtable doesnt fit into L1 cache. Closed Hashing may use less memory. Some hints for optimizing:
Closed Hashing with linear probing and without support for removal
power of two sized hashtable for bit shifting instead of modulo (division requires multiple cycles and there is only one hardware divider per core)
Low LoadFactor (entries through size) to minimize collisions. Thats a tradeof between memory usage and number of collisions. A LoadFactor over 0.5 should be avoided. A hashtable-size of 256 seems suitable for 100 entries.
cheapo hash function. You havent shown the type of X, so perhaps a cheaper hash function could outweigh more collisions.

I would look to store items in a sorted vector, as about 100 keys, would mean inserting into the vector would only occur 1 in 10^6 entries. Lookup would be processor efficient bsearch in vector

GPU sorting vs CPU sorting

I made a very naive implementation of the mergesort algorithm, which i turned to work on CUDA with very minimal implementation changes, the algorith code follows:
//Merge for mergesort
__device__ void merge(int* aux,int* data,int l,int m,int r)
{
int i,j,k;
for(i=m+1;i>l;i--){
aux[i-1]=data[i-1];
}
//Copy in reverse order the second subarray
for(j=m;j<r;j++){
aux[r+m-j]=data[j+1];
}
//Merge
for(k=l;k<=r;k++){
if(aux[j]<aux[i] || i==(m+1))
data[k]=aux[j--];
else
data[k]=aux[i++];
}
}
//What this code do is performing a local merge
//of the array
__global__
void basic_merge(int* aux, int* data,int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int tn = n / (blockDim.x*gridDim.x);
int l = i * tn;
int r = l + tn;
//printf("Thread %d: %d,%d: \n",i,l,r);
for(int i{1};i<=(tn/2)+1;i*=2)
for(int j{l+i};j<(r+1);j+=2*i)
{
merge(aux,data,j-i,j-1,j+i-1);
}
__syncthreads();
if(i==0){
//Complete the merge
do{
for(int i{tn};i<(n+1);i+=2*tn)
merge(aux,data,i-tn,i-1,i+tn-1);
tn*=2;
}while(tn<(n/2)+1);
}
}
The problem is that no matter how many threads i launch on my GTX 760, the sorting performance is always much much more worst than the same code on CPU running on 8 threads (My CPU have hardware support for up to 8 concurrent threads).
For example, sorting 150 million elements on CPU takes some hundred milliseconds, on GPU up to 10 minutes (even with 1024 threads per block)! Clearly i'm missing some important point here, can you please provide me with some comment? I strongly suspect the the problem is in the final merge operation performed by the first thread, at that point we have a certain amount of subarray (the exact amount depend on the number of threads) which are sorted and need to me merged, this is completed by just one thread (one tiny GPU thread).
I think i should use come kind of reduction here, so each thread perform in parallel further more merge, and the "Complete the merge" step just merge the last two sorted subarray..
I'm very new to CUDA.
EDIT (ADDENDUM):
Thanks for the link, I must admit I still need some time to learn better CUDA before taking full advantage of that material.. Anyway, I was able to rewrite the sorting function in order to take advantage as long as possible of multiple threads, my first implementation had a bottleneck in the last phase of the merge procedure, which was performed by only one multiprocessor.
Now after the first merge, I use each time up to (1/2)*(n/b) threads, where n is the amount of data to sort and b is the size of the chunk of data sorted by each threads.
The improvement in performance is surprising, using only 1024 threads it takes about ~10 seconds to sort 30 milion element.. Well, this is still a poor result unfortunately! The problem is in the threads syncronization, but first things first, let's see the code:
__global__
void basic_merge(int* aux, int* data,int n)
{
int k = blockIdx.x*blockDim.x + threadIdx.x;
int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1;
b = pow( (float)2, b);
int l=k*b;
int r=min(l+b-1,n-1);
__syncthreads();
for(int m{1};m<=(r-l);m=2*m)
{
for(int i{l};i<=r;i+=2*m)
{
merge(aux,data,i,min(r,i+m-1),min(r,i+2*m-1));
}
}
__syncthreads();
do{
if(k<=(n/b)*.5)
{
l=2*k*b;
r=min(l+2*b-1,n-1);
merge(aux,data,l,min(r,l+b-1),r);
}else break;
__syncthreads();
b*=2;
}while((r+1)<n);
}
The function 'merge' is the same as before. Now the problem is that I'm using only 1024 threads instead of the 65000 and more I can run on my CUDA device, the problem is that __syncthreads does not work as sync primitive at grid level, but only at block level!
So i can syncronize up to 1024 threads,that is the amount of threads supported per block. Without a proper syncronization each thread mess up the data of the other, and the merging procedure does not work.
In order to boost the performance I need some kind of syncronization between all the threads in the grid, seems that no API exist for this purpose, and i read about a solution which involve multiple kernel launch from the host code, using the host as barrier for all the threads.
I have a certain plan on how to implement this tehcnique in my mergesort function, I will provide you with the code in the near future. Did you have any suggestion on your own?
Thanks

It looks like all the work is being done in __global __ memory. Each write takes a long time and each read takes a long time making the function slow. I think it would help to maybe first copy your data to __shared __ memory first and then do the work in there and then when the sorting is completed(for that block) copy the results back to global memory.
Global memory takes about 400 clock cycles (or about 100 if the data happens to be in L2 cache). Shared memory on the other hand only takes 1-3 clock cycles to write and read.
The above would help with performance a lot. Some other super minor things you can try are..
(1) remove the first __syncthreads(); It is not really doing anything because no data is being past in between warps at that point.
(2) Move the "int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1; b = pow( (float)2, b);" outside the kernel and just pass in b instead. This is being calculated over and over when it really only needs to be calculated once.
I tried to follow along on your algorithm but was not able to. The variable names were hard to follow...or... your code is above my head and I cannot follow. =) Hope the above helps.

How would you implement this function in CUDA? (offsets in sorted integer vector)

I have a sorted integer array on the device, e.g.:
[0,0,0,1,1,2,2]
And I want the offsets to each element in another array:
[0,3,5]
(since the first 0 is at position 0, the first 1 at position 3 and so on)
I know how many different elements there will be beforehand. How would you implement this efficiently in CUDA? I'm not asking for code, but a high level description of the algorithm you would implement to compute this transformation. I already hat a look at the various functions in the thrust name space, but could not think of any combination of thrust functions to achieve this. Also, does this transformation have a widely accepted name?

You can solve this in Thrust using thrust::unique_by_key_copy with thrust::counting_iterator. The idea is to treat your integer array as the keys argument to unique_by_key_copy and to use a sequence of ascending integers (i.e., counting_iterator) as the values. unique_by_key_copy will compact the values array into the indices of each unique key:
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/unique.h>
#include <thrust/copy.h>
#include <iterator>
#include <iostream>
int main()
{
thrust::device_vector<int> keys(7);
keys[0] = 0; keys[1] = 0; keys[2] = 0;
keys[3] = 1; keys[4] = 1; keys[5] = 2; keys[6] = 2;
std::cout << "keys before unique_by_key_copy: [ ";
thrust::copy(keys.begin(), keys.end(), std::ostream_iterator<int>(std::cout," "));
std::cout << "]" << std::endl;
thrust::device_vector<int> offsets(3);
thrust::unique_by_key_copy(keys.begin(), keys.end(), // keys
thrust::make_counting_iterator(0), // [0, 1, 2, 3, ...] are the values
thrust::make_discard_iterator(), // discard the compacted keys
offsets.begin()); // the offsets are the values
std::cout << "offsets after unique_by_key_copy: [ ";
thrust::copy(offsets.begin(), offsets.end(), std::ostream_iterator<int>(std::cout," "));
std::cout << "]" << std::endl;
return 0;
}
Here's the output:
$ nvcc test.cu -run
keys before unique_by_key_copy: [ 0 0 0 1 1 2 2 ]
offsets after unique_by_key_copy: [ 0 3 5 ]

Although I've never used thrust library, what about this possible approach (simple but maybe effective):
int input[N]; // your sorted array
int offset[N]; // the offset of the first values of each elements. Initialized with -1
// each thread will check an index position
if (input[id] > input[id-1]) // bingo! here begins a new value
{
int oid = input[id]; // use the integer value as index
offset[oid] = id; // mark the offset with the beginning of the new value
}
In your example the output will be:
[0,3,5]
But if the input array is:
[0,0,0,2,2,4,4]
Then the output will be:
[0,-1, 3, -1, 5]
Now, if thrust can do it for you, remove_if( offset[i] == -1 ) and compact the array.
This approach will waste lot of memory for the offset array, but as you dont know how many offset you are going to find, the worst case will use as much memory as the input array.
On the other hand, the few instruction per thread compared to the global memory load will limit this implementation by memory bandwidth. There are some optimization for this case as process some values per thread.
My 2 cents!

Scan is the algorithm you're looking for. If you don't have an implementation lying around, the Thrust library would be a good resource. (Look for thrust::scan)
Scan (or "parallel prefix sum") takes an input array and generates an output where each element is the sum of the inputs to that point: [1 5 3 7] => [1 6 9 16]
If you scan predicates (0 or 1 depending on an evaluated condition) where the predicate checks whether a given element the same as the preceding element, then you compute the output index of the element in question. Your example array
[0 0 0 1 1 2 2]
[0 0 0 1 0 1 0] <= predicates
[0 0 0 1 1 2 2] <= scanned predicates
Now you can use the scanned predicates as indices to write your output.

Good question and the answer depends on what you need to do with it after. Let me explain.
As soon as this problem can be solved in O(n) (where n is the input length) on CPU, you will suffer from memory allocation and copying (Host -> Device (input) and Device -> Host (result)) drawbacks. This will leads to performance degradation against simple CPU solution.
Even if your array already in device memory, each computation block need to read it to local or registers (at least access device memory), and it can't be done significantly faster than on CPU.
In general CUDA accelerate perfomance well if:
Asymptotic complexity of computations is high comparing to input data length. For example input data length is n and complexity is O(n^2) or O(n^3).
There is way to split task to independed or weak depended subtasks.
So if I was you, I would not try to do computations of such kind on CUDA if it's possible. And if it must be some standalone function or output format convertion for some other function I would do in CPU.
If it's part of some more complex algorithm the answer is more complicated. If I was on your place I would try to somehow change [0,3,5] format, because it adds limitations for utilization CUDA computation power. You can't effectively split your task on independed blocks. Just for example if I process 10 integers in one computation thread and next 10 integers in other. The second one don't know where to place his outputs until first one not finished. May be I will split an array on subarrays and store answer for each subarray separately. It's highly depends on what computations you are doing.

In-Place Radix Sort

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?
Preliminary
I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.
At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.
Details
I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).
The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.
Use Cases
Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.
Research
Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.
In particular, there are the following things.
First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).
Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).
Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.
To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?

Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.
void radixSort(string[] seqs, size_t base = 0) {
if(seqs.length == 0)
return;
size_t TPos = seqs.length, APos = 0;
size_t i = 0;
while(i < TPos) {
if(seqs[i][base] == 'A') {
swap(seqs[i], seqs[APos++]);
i++;
}
else if(seqs[i][base] == 'T') {
swap(seqs[i], seqs[--TPos]);
} else i++;
}
i = APos;
size_t CPos = APos;
while(i < TPos) {
if(seqs[i][base] == 'C') {
swap(seqs[i], seqs[CPos++]);
}
i++;
}
if(base < seqs[0].length - 1) {
radixSort(seqs[0..APos], base + 1);
radixSort(seqs[APos..CPos], base + 1);
radixSort(seqs[CPos..TPos], base + 1);
radixSort(seqs[TPos..seqs.length], base + 1);
}
}
Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.
Edit:
I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.

I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.
Reason:
The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.
I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).
That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.
I have no idea if it works out in practice though.
Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.

You can certainly drop the memory requirements by encoding the sequence in bits.
You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits.
For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.
If there is a way to reduce the number of valid 'words', further compression may be possible.
So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64.
Initialize them to zero.
Iterate through your very very large list of 3 char sequences, and encode them as above.
Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.
Next, regenerate your list.
Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.
A sequence of 4, adds 2 bits, so there would be 256 buckets.
A sequence of 5, adds 2 bits, so there would be 1024 buckets.
At some point the number of buckets will approach your limits.
If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.
I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.
Here is a hack that shows the technique
#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
int *bucket = NULL;
const char charMap[4] = {'A', 'C', 'G', 'T'};
void setup
(
void
)
{
bucket = new int[bucketCount];
memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
}
void teardown
(
void
)
{
delete[] bucket;
}
void show
(
int encoded
)
{
int z;
int y;
int j;
for (z = width - 1; z >= 0; z--)
{
int n = 1;
for (y = 0; y < z; y++)
n *= 4;
j = encoded % n;
encoded -= j;
encoded /= n;
cout << charMap[encoded];
encoded = j;
}
cout << endl;
}
int main(void)
{
// Sort this sequence
const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";
size_t testSequenceLength = strlen(testSequence);
setup();
// load the sequences into the buckets
size_t z;
for (z = 0; z < testSequenceLength; z += width)
{
int encoding = 0;
size_t y;
for (y = 0; y < width; y++)
{
encoding *= 4;
switch (*(testSequence + z + y))
{
case 'A' : encoding += 0; break;
case 'C' : encoding += 1; break;
case 'G' : encoding += 2; break;
case 'T' : encoding += 3; break;
default : abort();
};
}
bucket[encoding]++;
}
/* show the sorted sequences */
for (z = 0; z < bucketCount; z++)
{
while (bucket[z] > 0)
{
show(z);
bucket[z]--;
}
}
teardown();
return 0;
}

If your data set is so big, then I would think that a disk-based buffer approach would be best:
sort(List<string> elements, int prefix)
if (elements.Count < THRESHOLD)
return InMemoryRadixSort(elements, prefix)
else
return DiskBackedRadixSort(elements, prefix)
DiskBackedRadixSort(elements, prefix)
DiskBackedBuffer<string>[] buckets
foreach (element in elements)
buckets[element.MSB(prefix)].Add(element);
List<string> ret
foreach (bucket in buckets)
ret.Add(sort(bucket, prefix + 1))
return ret
I would also experiment grouping into a larger number of buckets, for instance, if your string was:
GATTACA
the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.

I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:
You control the reading of the data
You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.
The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.
Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.
Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.
That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(
See the Wikipedia articles:
Heapsort
Binary heap

"Radix sorting with no extra space" is a paper addressing your problem.

Performance-wise you might want to look at a more general string-comparison sorting algorithms.
Currently you wind up touching every element of every string, but you can do better!
In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.
A decent general purpose implementation of burstsort is available on source forge at http://sourceforge.net/projects/burstsort/ - but it is not in-place.
For comparison purposes, The C-burstsort implementation covered at http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.

You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.
Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.

You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).
Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.

I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.

It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.

Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets.
You can look at:
ti7qsort. ti7qsort is the fastest sort for integers (can be used for small-fixed size strings).
Inline QSORT
String sorting
You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.

dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.
I suggest a very simple approach:
Empirically estimate the largest size m for which a radix sort is efficient.
Read blocks of m elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.
Mergesort the resulting sorted blocks.
Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require 2n space to sort n items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.
Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.

First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.
OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.
Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:
Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
Bitmap encoding of the last three characters of a string.
For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.
This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.
For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off
You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.
Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.

While the accepted answer perfectly answers the description of the problem, I've reached this place looking in vain for an algorithm to partition inline an array into N parts. I've written one myself, so here it is.
Warning: this is not a stable partitioning algorithm, so for multilevel partitioning, one must repartition each resulting partition instead of the whole array. The advantage is that it is inline.
The way it helps with the question posed is that you can repeatedly partition inline based on a letter of the string, then sort the partitions when they are small enough with the algorithm of your choice.
function partitionInPlace(input, partitionFunction, numPartitions, startIndex=0, endIndex=-1) {
if (endIndex===-1) endIndex=input.length;
const starts = Array.from({ length: numPartitions + 1 }, () => 0);
for (let i = startIndex; i < endIndex; i++) {
const val = input[i];
const partByte = partitionFunction(val);
starts[partByte]++;
}
let prev = startIndex;
for (let i = 0; i < numPartitions; i++) {
const p = prev;
prev += starts[i];
starts[i] = p;
}
const indexes = [...starts];
starts[numPartitions] = prev;
let bucket = 0;
while (bucket < numPartitions) {
const start = starts[bucket];
const end = starts[bucket + 1];
if (end - start < 1) {
bucket++;
continue;
}
let index = indexes[bucket];
if (index === end) {
bucket++;
continue;
}
let val = input[index];
let destBucket = partitionFunction(val);
if (destBucket === bucket) {
indexes[bucket] = index + 1;
continue;
}
let dest;
do {
dest = indexes[destBucket] - 1;
let destVal;
let destValBucket = destBucket;
while (destValBucket === destBucket) {
dest++;
destVal = input[dest];
destValBucket = partitionFunction(destVal);
}
input[dest] = val;
indexes[destBucket] = dest + 1;
val = destVal;
destBucket = destValBucket;
} while (dest !== index)
}
return starts;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio