Using a LinkedList or ArrayList for iteration

Using a LinkedList or ArrayList for iteration - data-structures

If I am be adding an unknown number of elements to a List, and that list is only going to be iterated through, would a LinkedList be better than an ArrayList in the particular instance (Using Java, if that has any relevance)

The performance trade-offs between ArrayList and LinkedList have been discussed before, but in short: ArrayList tends to be faster for most real-life usage scenarios. ArrayList will cause less memory fragmentation and will play nicer with the Garbage Collector, it will use up less memory and allow for faster iteration, and it will be faster for insertions that occur at the end of the list.
So, as long as the insertions in the list always occur at the last position, there's no reason to pick LinkedList - ArrayList is the clear winner.

Okay Its been already answered but I will still try to put my point.
ArrayList is faster in iteration than LinkedList. The reason is same because arrayList is backed by an array. Lets try to understand whay array iteration is faster then linkedList.
There are 2 factors that work for it
Array is stored as contiguous memory locations (You can say then
what?)
System cache is much faster then accessing memory
But you can ask how Cache fits here. Well check here, CPU tries to take leverage of caches by storing data in cache. It uses Locality of refrence.Now there are 2 techniques which are
Reference Locality of refrence
Temporal locality
If at one point a particular memory location is referenced, then it is likely that the same location will be referenced again in the
near future. There is a temporal proximity between the adjacent
references to the same memory location. In this case it is common to
make efforts to store a copy of the referenced data in special memory
storage, which can be accessed faster. Temporal locality is a special
case of spatial locality, namely when the prospective location is
identical to the present location.
Spatial locality
If a particular storage location is referenced at a particular time, then it is likely that nearby memory locations will be
referenced in the near future. In this case it is common to attempt to
guess the size and shape of the area around the current reference for
which it is worthwhile to prepare faster access.
So if one array location is accessed at a time it will load the adjacent memory locations in cache too. But wait it will not load all. It depends on CACHE_LINES. Well CACHE_LINES define how much bits can be loaded in cache at a time.
So before diving further lest remind what we know
Array is contiguous memory locations
When one memory location of array is accessed adjacent also loaded in memory
How much array memory locations are loaded in memory is defined by CACHE-LINES capacity
SO whenever CPU tries to access a memory location it check if that memory is already in cache. If its present it match else its cache miss.
So from what we know in case of array there will be less cache_miss as compared to random memory locations as in linked list. So it makes sense
and finally from here Array_data_structure from Wikipedia it says
In an array with element size k and on a machine with a cache line
size of B bytes, iterating through an array of n elements requires the
minimum of ceiling(nk/B) cache misses, because its elements occupy
contiguous memory locations. This is roughly a factor of B/k better
than the number of cache misses needed to access n elements at random
memory locations. As a consequence, sequential iteration over an array
is noticeably faster in practice than iteration over many other data
structures, a property called locality of refrence
I guess that answers your question.

For iterating both will have the same O(n) complexity on iterating, ArrayList will take less memory BTW.

public List<Integer> generateArrayList(int n) {
long start = System.nanoTime();
List<Integer> result = new ArrayList<>();
for (int i = 0; i < n; i++) {
result.add(i);
}
System.out.println("generateArrayList time: " + (System.nanoTime() - start));
return result;
}
public List<Integer> generateLinkedList(int n) {
long start = System.nanoTime();
List<Integer> result = new LinkedList<>();
for (int i = 0; i < n; i++) {
result.add(i);
}
System.out.println("generateLinkedList time: " + (System.nanoTime() - start));
return result;
}
public void iteratorAndRemove(List<Integer> list) {
String type = list instanceof ArrayList ? "ArrayList" : "LinkedList";
long start = System.nanoTime();
Iterator<Integer> ite = list.iterator();
while (ite.hasNext()) {
int getDataToDo = ite.next();
ite.remove();
}
System.out.println("iteratorAndRemove with " + type + " time: " + (System.nanoTime() - start));
}
#org.junit.Test
public void benchMark() {
final int n = 500_000;
List<Integer> arr = generateArrayList(n);
List<Integer> linked = generateLinkedList(n);
iteratorAndRemove(linked);
iteratorAndRemove(arr);
}
Arraylist is useful for get random position value, linkedlist useful for insert, remove operate. Above code will show linkedlist very faster than ArrayList, in remove function linkedlist faster than arraylist 1000 times, OMG!!!
generateArrayList time: 15997000
generateLinkedList time: 15912000
iteratorAndRemove with LinkedList time: 14188500
iteratorAndRemove with ArrayList time: 13558249400

Related

Efficiently select a available number within a range

I need to use and recycle ids(ints) within a range say from 1 to 20million.
what is the most efficient way to do this?
Somethings i tried.
Generate a running sequence of numbers from 1 to k and store in a
map. after k numbers if lets say if id 2 becomes free we delete it
from the map. And continue our next id from k+1 (it will be good if
i can choose the id that was freed from the beginning(2) instead of
k+1. how can i do this ? )
Generate random numbers in between range 1 to 20 million and check
if its already used with a map lookup, if yes, choose another random
number or do number+1 until map lookup fails.
Storing all numbers from 1 to 20million in a set and taking one by one for use and add back when it's freed( this will have bigger
memory footprint and don't want to do this)
What is the most efficent way to solve this problem, if lets say around
50% of ids are used at any point of time

A space-efficient solution is to use a bit-mask to keep track of free entries. 20M bits is only 2.5MB.
If about half of them will be free, then when you need to allocate a new ID, you can just start at a random spot and walk forward until you find an entry with a free bit.
If you need a guaranteed time bound, then you can use an array of 64-bit words for your bit mask, and a bit mask of 1/64 the size to keep track of which words have free entries. Recurse until you get to one or two words.
If space isn't a problem, then the simplest fast way is to keep free IDs in a free list. That requires an array of up to 20M integers. You remember the last entry freed, and for every free node x, array[x] is the index of the preceeding freed node, or -1.
If your IDs actually point to something, then often you can use the very same array for the free list and the pointers, so the free list takes no extra memory at all.

20M of integers is about 80 Mb of RAM. If we are talking about Java, then according to this article, HashSet<Integer> can take up to 22 times more space, so it's about 1.7 Gb, wow.
You can implement your own bitset that supports fast selection of the next free ID. Bitset should take only about 2.4 Mb of RAM and we can find the next free ID in O(1). Haven't checked the code, it's mostly an idea:
int range = 20_000_000;
long[] bitset = new long[range / 64 + 1]; // About 2.4 Mb of RAM, array length is 312501
Stack<Integer> hasFreeIds = new Stack<Integer>(); // Slots in bitset with free IDs
for (int i = 0; i < bitset.length; ++i) { // All slots have free IDs in the beginning
hasFreeIds.push(i);
}
// Now `hasFreeIds` is about (8 + 4) * 312_000 bytes = ~4Mb of RAM
// Our structure should be ~6.4 Mb of RAM in total
// Complexity is O(1), so should be fast
int getNextFreeId() {
// Select the first slot with free IDs
int freeSlotPos = hasFreeIds.pop();
long slot = bitset[freeSlotPos];
// Find the first free ID
long lowestZeroBit = Long.lowestOneBit(~slot);
int lowestZeroBitPosition = Long.numberOfTrailingZeros(lowestZeroBit);
int freeId = 64 * freeSlotPos + lowestZeroBitPosition;
// Update the slot, flip the bit to mark it as used
slot |= lowestZeroBit;
bitset[freeSlotPos] = slot;
// If the slot still has free IDs, then push it back to our stack
if (~slot != 0) {
hasFreeIds.push(freeSlotPos);
}
return freeId;
}
// Complexity is also O(1)
void returnId(int id) {
// Find slot that contains this id
long slot = bitset[id / 64];
boolean slotIsFull = (~slot == 0L); // True if the slot does not have free IDs
// Flip the bit in the slot to mark it as free
int bitPosition = id % 64;
slot &= ~(1 << bitPosition);
bitset[id / 64] = slot;
// If this slot was full before, we need to push it to the stack
if (slotIsFull) {
hasFreeIds.push(id / 64);
}
}

Theoretically speaking, the fastest would be storing all free IDs in a linked list.
That is, push 20M sequential numbers into a linked list. To allocate an ID pop it from the front. And when an ID is free - push it at either top or bottom depending on your preferred staregy (i.e. would you reuse freed IDs first, or only after each preallocated one was used).
This way both allocating an ID and freeing it is O(1).
Now, as an optimization you don't really need to preallocate all your IDs. You should only store the highest ID allocated. When you need to allocate an ID and the list of free IDs is empty - just increase the highest ID variable and return it.
This way your list will never reach big numbers, unless they were really allocated and returned.

sort huge array with small number of repeating keys

I want to sort a huge array, say 10^8 entries of type X with at most N different keys, where N is ~10^2. Because I don't know the range or spacing of the elements, count sort is not an option. So my best guess so far is to use a hash map for the counts like so
std::unordered_map< X, unsigned > counts;
for (auto x : input)
counts[x]++;
This works ok-ish and is ~4 times faster than 3-way quicksort, but I'm a nervous person and it's still not fast enough.
I wonder: am I missing something? Can I make better use of the fact that N is known in advance? Or is it possible to tune the hash map to my needs?
EDIT An additional pre-condition is that the input sequence is badly sorted and the frequency of the keys is about the same.

STL implementations are often not perfect in terms of performance (no holy wars, please).
If you know a guaranteed and sensible upper on the number of unique elements (N), then you can trivially implement your own hash table of size 2^s >> N. Here is how I usually do it myself:
int size = 1;
while (size < 3 * N) size <<= 1;
//Note: at least 3X size factor, size = power of two
//count = -1 means empty entry
std::vector<std::pair<X, int>> table(size, make_pair(X(), -1));
auto GetHash = [size](X val) -> int { return std::hash<X>()(val) & (size-1); };
for (auto x : input) {
int cell = GetHash(x);
bool ok = false;
for (; table[cell].second >= 0; cell = (cell + 1) & (size-1)) {
if (table[cell].first == x) { //match found -> stop
ok = true;
break;
}
}
if (!ok) { //match not found -> add entry on free place
table[cell].first = x;
table[cell].second = 0;
}
table[cell].second++; //increment counter
}
On MSVC2013, it improves time from 0.62 secs to 0.52 secs compared to your code, given that int is used as type X.
Also, we can choose a faster hash function. Note however, that the choice of hash function depends heavily on the properties of the input. Let's take Knuth's multiplicative hash:
auto GetHash = [size](X val) -> int { return (val*2654435761) & (size-1); };
It further improves time to 0.34 secs.
As a conclusion: do you really want to reimplement standard data structures to achieve a 2X speed boost?
Notes: Speedup may be entirely different on another compiler/machine. You may have to do some hacks if your type X is not POD.

Counting sort really would by best, but isnt applicable due to unknown range or spacing.
Seems to be easily parallelized with fork-join, e.g. boost::thread.
You could also try a more efficient, handrolled hashmap. Unorded_map typically uses linked lists to counter potentially bad hash functions. The memory overhead of linked lists may hurt performance if the hashtable doesnt fit into L1 cache. Closed Hashing may use less memory. Some hints for optimizing:
Closed Hashing with linear probing and without support for removal
power of two sized hashtable for bit shifting instead of modulo (division requires multiple cycles and there is only one hardware divider per core)
Low LoadFactor (entries through size) to minimize collisions. Thats a tradeof between memory usage and number of collisions. A LoadFactor over 0.5 should be avoided. A hashtable-size of 256 seems suitable for 100 entries.
cheapo hash function. You havent shown the type of X, so perhaps a cheaper hash function could outweigh more collisions.

I would look to store items in a sorted vector, as about 100 keys, would mean inserting into the vector would only occur 1 in 10^6 entries. Lookup would be processor efficient bsearch in vector

Why is the following two duplicate finder algorithms have different time complexity?

I was reading this question. The selected answer contains the following two algorithms. I couldn't understand why the first one's time complexity is O(ln(n)). At the worst case, if the array don't contain any duplicates it will loop n times so does the second one. Am I wrong or am I missing something? Thank you
1) A faster (in the limit) way
Here's a hash based approach. You gotta pay for the autoboxing, but it's O(ln(n)) instead of O(n2). An enterprising soul would go find a primitive int-based hash set (Apache or Google Collections has such a thing, methinks.)
boolean duplicates(final int[] zipcodelist)
{
Set<Integer> lump = new HashSet<Integer>();
for (int i : zipcodelist)
{
if (lump.contains(i)) return true;
lump.add(i);
}
return false;
}
2)Bow to HuyLe
See HuyLe's answer for a more or less O(n) solution, which I think needs a couple of add'l steps:
static boolean duplicates(final int[] zipcodelist) {
final int MAXZIP = 99999;
boolean[] bitmap = new boolean[MAXZIP+1];
java.util.Arrays.fill(bitmap, false);
for (int item : zipcodeList)
if (!bitmap[item]) bitmap[item] = true;
else return true;
}
return false;
}

The first solution should have expected complexity of O(n), since the whole zip code list must be traversed, and processing each zip code is O(1) expected time complexity.
Even taking into consideration that insertion into HashMap may trigger a re-hash, the complexity is still O(1). This is a bit of non sequitur, since there may be no relation between Java HashMap and the assumption in the link, but it is there to show that it is possible.
From HashSet documentation:
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets.
It's the same for the second solution, which is correctly analyzed: O(n).
(Just an off-topic note, BitSet is faster than array, as seen in the original post, since 8 booleans are packed into 1 byte, which uses less memory).

Efficient datastructure for pooling integers

I'm looking for a data structure to help me manage a pool of integers. It's a pool in that I remove integers from the pool for a short while then put them back with the expectation that they will be used again. It has some other odd constraints however, so a regular pool doesn't work well.
Hard requirements:
constant time access to what the largest in use integer is.
the sparseness of the integers needs to be bounded (even if only in principal).
I want the integers to be close to each other so I can quickly iterate over them with minimal unused integers in the range.
Use these if they help with selecting a data structure, otherwise ignore them:
Integers in the pool are 0 based and contiguous.
The pool can be constant sized.
Integers from the pool are only used for short periods with a high churn rate.
I have a working solution but it feels inelegant.
My (sub-optimal) Solution
Constant sized pool.
Put all available integers into a sorted set (free_set).
When a new integer is requested retrieve the smallest from the free_set.
Put all in-use integers into another sorted set (used_set).
When the largest is requested, retrieve the largest from the used_set.
There are a few optimization that may help with my particular solution (priority queue, memoization, etc). But my whole approach seems wasteful.
I'm hoping there some esoteric data structure that fits my problem perfectly. Or at least a better pooling algorithm.

pseudo class:
class IntegerPool {
int size = 0;
Set<int> free_set = new Set<int>();
public int Acquire() {
if(!free_set.IsEmpty()) {
return free_set.RemoveSmallest();
} else {
return size++;
}
}
public void Release(int i) {
if(i == size - 1) {
size--;
} else {
free_set.Add(i);
}
}
public int GetLargestUsedInteger() {
return size;
}
}
Edit
RemoveSmallest isn't useful as all. RemoveWhatever is good enough. So Set<int> can be replaced by LinkedList<int> as a faster alternative (or even Stack<int>).

Why not use a balanced binary search tree? You can store a pointer/iterator to the min element and access it for free, and updating it after an insert/delete is an O(1) operation. If you use a self balancing tree, insert/delete is O(log(n)). To elaborate:
insert : Just compare new element to previous min; if it is better make the iterator point to the new min.
delete : If min was deleted, then before removing find the successor (which you can do by just walking the iterator forward 1 step), and then take that guy to be the new min.
While it is theoretically possible to do slightly better using some kind of sophisticated uber-heap data structure (ie Fibonacci heaps), in practice I don't think you would want to deal with implementing something like that just to save a small log factor. Also, as a bonus you get fast in-order traversal for free -- not to mention that most programming languages these days^ come with fast implementations of self-balancing binary search trees out of the box (like red-black trees/avl etc.).
^ with the exception of javascript :P
EDIT: Thought of an even better answer.

In-Place Radix Sort

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?
Preliminary
I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.
At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.
Details
I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).
The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.
Use Cases
Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.
Research
Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.
In particular, there are the following things.
First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).
Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).
Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.
To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?

Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.
void radixSort(string[] seqs, size_t base = 0) {
if(seqs.length == 0)
return;
size_t TPos = seqs.length, APos = 0;
size_t i = 0;
while(i < TPos) {
if(seqs[i][base] == 'A') {
swap(seqs[i], seqs[APos++]);
i++;
}
else if(seqs[i][base] == 'T') {
swap(seqs[i], seqs[--TPos]);
} else i++;
}
i = APos;
size_t CPos = APos;
while(i < TPos) {
if(seqs[i][base] == 'C') {
swap(seqs[i], seqs[CPos++]);
}
i++;
}
if(base < seqs[0].length - 1) {
radixSort(seqs[0..APos], base + 1);
radixSort(seqs[APos..CPos], base + 1);
radixSort(seqs[CPos..TPos], base + 1);
radixSort(seqs[TPos..seqs.length], base + 1);
}
}
Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.
Edit:
I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.

I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.
Reason:
The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.
I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).
That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.
I have no idea if it works out in practice though.
Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.

You can certainly drop the memory requirements by encoding the sequence in bits.
You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits.
For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.
If there is a way to reduce the number of valid 'words', further compression may be possible.
So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64.
Initialize them to zero.
Iterate through your very very large list of 3 char sequences, and encode them as above.
Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.
Next, regenerate your list.
Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.
A sequence of 4, adds 2 bits, so there would be 256 buckets.
A sequence of 5, adds 2 bits, so there would be 1024 buckets.
At some point the number of buckets will approach your limits.
If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.
I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.
Here is a hack that shows the technique
#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
int *bucket = NULL;
const char charMap[4] = {'A', 'C', 'G', 'T'};
void setup
(
void
)
{
bucket = new int[bucketCount];
memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
}
void teardown
(
void
)
{
delete[] bucket;
}
void show
(
int encoded
)
{
int z;
int y;
int j;
for (z = width - 1; z >= 0; z--)
{
int n = 1;
for (y = 0; y < z; y++)
n *= 4;
j = encoded % n;
encoded -= j;
encoded /= n;
cout << charMap[encoded];
encoded = j;
}
cout << endl;
}
int main(void)
{
// Sort this sequence
const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";
size_t testSequenceLength = strlen(testSequence);
setup();
// load the sequences into the buckets
size_t z;
for (z = 0; z < testSequenceLength; z += width)
{
int encoding = 0;
size_t y;
for (y = 0; y < width; y++)
{
encoding *= 4;
switch (*(testSequence + z + y))
{
case 'A' : encoding += 0; break;
case 'C' : encoding += 1; break;
case 'G' : encoding += 2; break;
case 'T' : encoding += 3; break;
default : abort();
};
}
bucket[encoding]++;
}
/* show the sorted sequences */
for (z = 0; z < bucketCount; z++)
{
while (bucket[z] > 0)
{
show(z);
bucket[z]--;
}
}
teardown();
return 0;
}

If your data set is so big, then I would think that a disk-based buffer approach would be best:
sort(List<string> elements, int prefix)
if (elements.Count < THRESHOLD)
return InMemoryRadixSort(elements, prefix)
else
return DiskBackedRadixSort(elements, prefix)
DiskBackedRadixSort(elements, prefix)
DiskBackedBuffer<string>[] buckets
foreach (element in elements)
buckets[element.MSB(prefix)].Add(element);
List<string> ret
foreach (bucket in buckets)
ret.Add(sort(bucket, prefix + 1))
return ret
I would also experiment grouping into a larger number of buckets, for instance, if your string was:
GATTACA
the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.

I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:
You control the reading of the data
You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.
The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.
Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.
Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.
That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(
See the Wikipedia articles:
Heapsort
Binary heap

"Radix sorting with no extra space" is a paper addressing your problem.

Performance-wise you might want to look at a more general string-comparison sorting algorithms.
Currently you wind up touching every element of every string, but you can do better!
In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.
A decent general purpose implementation of burstsort is available on source forge at http://sourceforge.net/projects/burstsort/ - but it is not in-place.
For comparison purposes, The C-burstsort implementation covered at http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.

You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.
Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.

You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).
Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.

I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.

It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.

Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets.
You can look at:
ti7qsort. ti7qsort is the fastest sort for integers (can be used for small-fixed size strings).
Inline QSORT
String sorting
You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.

dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.
I suggest a very simple approach:
Empirically estimate the largest size m for which a radix sort is efficient.
Read blocks of m elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.
Mergesort the resulting sorted blocks.
Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require 2n space to sort n items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.
Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.

First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.
OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.
Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:
Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
Bitmap encoding of the last three characters of a string.
For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.
This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.
For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off
You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.
Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.

While the accepted answer perfectly answers the description of the problem, I've reached this place looking in vain for an algorithm to partition inline an array into N parts. I've written one myself, so here it is.
Warning: this is not a stable partitioning algorithm, so for multilevel partitioning, one must repartition each resulting partition instead of the whole array. The advantage is that it is inline.
The way it helps with the question posed is that you can repeatedly partition inline based on a letter of the string, then sort the partitions when they are small enough with the algorithm of your choice.
function partitionInPlace(input, partitionFunction, numPartitions, startIndex=0, endIndex=-1) {
if (endIndex===-1) endIndex=input.length;
const starts = Array.from({ length: numPartitions + 1 }, () => 0);
for (let i = startIndex; i < endIndex; i++) {
const val = input[i];
const partByte = partitionFunction(val);
starts[partByte]++;
}
let prev = startIndex;
for (let i = 0; i < numPartitions; i++) {
const p = prev;
prev += starts[i];
starts[i] = p;
}
const indexes = [...starts];
starts[numPartitions] = prev;
let bucket = 0;
while (bucket < numPartitions) {
const start = starts[bucket];
const end = starts[bucket + 1];
if (end - start < 1) {
bucket++;
continue;
}
let index = indexes[bucket];
if (index === end) {
bucket++;
continue;
}
let val = input[index];
let destBucket = partitionFunction(val);
if (destBucket === bucket) {
indexes[bucket] = index + 1;
continue;
}
let dest;
do {
dest = indexes[destBucket] - 1;
let destVal;
let destValBucket = destBucket;
while (destValBucket === destBucket) {
dest++;
destVal = input[dest];
destValBucket = partitionFunction(destVal);
}
input[dest] = val;
indexes[destBucket] = dest + 1;
val = destVal;
destBucket = destValBucket;
} while (dest !== index)
}
return starts;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio