Adaptive IO Optimization Problem - algorithm

Here is an interesting optimization problem that I think about for some days now:
In a system I read data from a slow IO device. I don't know beforehand how much data I need. The exact length is only known once I have read an entire package (think of it as it has some kind of end-symbol). Reading more data than required is not a problem except that it wastes time in IO.
Two constrains also come into play: Reads are very slow. Each byte I read costs. Also each read-request has a constant setup cost regardless of the number of bytes I read. This makes reading byte by byte costly. As a rule of thumb: the setup costs are roughly as expensive as a read of 5 bytes.
The packages I read are usually between 9 and 64 bytes, but there are rare occurrences larger or smaller packages. The entire range will be between 1 to 120 bytes.
Of course I know a little bit of my data: Packages come in sequences of identical sizes. I can classify three patterns here:
Sequences of reads with identical sizes:
A A A A A ...
Alternating sequences:
A B A B A B A B ...
And sequences of triples:
A B C A B C A B C ...
The special case of degenerated triples exist as well:
A A B A A B A A B ...
(A, B and C denote some package size between 1 and 120 here).
Based on the size of the previous packages, how do I predict the size of the next read request? I need something that adapts fast, uses little storage (lets say below 500 bytes) and is fast from a computational point of view as well.
Oh - and pre-generating some tables won't work because the statistic of read sizes can vary a lot with different devices I read from.
Any ideas?

You need to read at least 3 packages and at most 4 packages to identify the pattern.
Read 3 packages. If they are all same size, then the pattern is AAAAAA...
If they are all not the same size, read the 4th package. If 1=3 & 2=4, pattern is ABAB. Otherwise, pattern is ABCABC...
With that outline, it is probably a good idea to do a speculative read of 3 package sizes (something like 3*64 bytes at a single go).

I don't see a problem here.. But first, several questions:
1) Can you read the input asyncronously (e.g. separate thread, interrupt routine, etc)?
2) Do you have some free memory for a buffer?
3) If you've commanded a longer read, are you able to obtain first byte(s) before the whole packet is read?
If so (and I think in most cases it can be implemented), then you can just have a separate thread that reads them at highest possible speed and stores them in a buffer, with stalling when the buffer gets full, so that you normal process can use a synchronous getc() on that buffer.
EDIT: I see.. it's because of CRC or encryption? Well, then you could use some ideas from data compression:
Consider a simple adaptive algorithm of order N for M possible symbols:
int freqs[M][M][M]; // [a][b][c] : occurences of outcome "c" when prev vals were "a" and "b"
int prev[2]; // some history
int predict(){
int prediction = 0;
for (i = 1; i < M; i++)
if (freqs[prev[0]][prev[1]][i] > freqs[prev[0]][prev[1]][prediction])
prediction = i;
return prediction;
void add_outcome(int val){
if (freqs[prev[0]][prev[1]][val]++ > DECAY_LIMIT){
for (i = 0; i < M; i++)
freqs[prev[0]][prev[1]][i] >>= 1;
pred[0] = pred[1];
pred[1] = val;
freqs has to be an array of order N+1, and you have to remember N previsous values. N and DECAY_LIMIT have to be adjusted according to the statistics of the input. However, even they can be made adaptive (for example, if it producess too many misses, then the decay limit can be shortened).
The last problem would be the alphabet. Depending on the context, if there are several distinct sizes, you can create a one-to-one mapping to your symbols. If more, then you can use quantitization to limit the number of symbols. The whole algorithm can be written with pointer arithmetics, so that N and M won't be hardcoded.

Since reading is so slow, I suppose you can throw some CPU power at it so you can try to make an educated guess of how much to read.
That would be basically a predictor, that would have a model based on probabilities. It would generate a sample of predictions of the upcoming message size, and the cost of each. Then pick the message size that has the best expected cost.
Then when you find out the actual message size, use Bayes rule to update the model probabilities, and do it again.
Maybe this sounds complicated, but if the probabilities are stored as fixed-point fractions you won't have to deal with floating-point, so it may be not much code. I would use something like a Metropolis-Hastings algorithm as my basic simulator and bayesian update framework. (This is just an initial stab at thinking about it.)


A memory efficient way for a randomized single pass over a set of indices

I have a big file (about 1GB) which I am using as a basis to do some data integrity testing. I'm using Python 2.7 for this because I don't care so much about how fast the writes happen, my window for data corruption should be big enough (and it's easier to submit a Python script to the machine I'm using for testing)
To do this I'm writing a sequence of 32 bit integers to memory as a background process while other code is running, like the following:
from struct import pack
with open('./FILE', 'rb+', buffering=0) as f:
counter = 1
while counter < SIZE+1:
f.write(pack('>i', counter))
Then after I do some other stuff it's very easy to see if we missed a write since there will be a gap instead of the sequential increasing sequence. This works well enough. My problem is some data corruption cases might only be caught with random I/O (not sequential like this) based on how we track changes to files
So what I need is a method for performing a single pass of random I/O over my 1GB file, but I can't really store this in memory since 1GB ~= 250 million 4-byte integers. Considered chunking up the file into smaller pieces and indexing those, maybe 500 KB or something, but if there is a way to write a generator that can do the same job that would be awesome. Like this:
from struct import pack
def rand_index_generator:
counter = 0
while counter < MAX:
yield generator.next_index()
with open('./FILE', 'rb+', buffering=0) as f:
counter = 1
for index in rand_index_generator:*index)
f.write(pack('>i', counter))
I need it:
Not to run out of memory (so no pouring the random sequence into a list)
To be reproducible so I can verify these values in the same order later
Is there a way to do this in Python 2.7?
Just to provide an answer for anyone who has the same problem, the approach that I settled on was this, which worked well enough if you don't need something all that random:
def rand_index_generator(a,b):
while True:
yield (ctr%b)
Then, initialize it with your index size, b and a value a which is coprime to b. This is easy to choose if b is a power of two, since a just needs to be an odd number to make sure it isn't divisible by 2. It's a hard requirement for the two values to be coprime, so you might have to do more work if your index size b is not such an easily factored number as a power of 2.
index_gen = rand_index_generator(1934919251, 2**28)
Then each time you want the new index you use and this is guaranteed to iterate over numbers between [0,2^28-1] in a semi-randomish manner depending on your choice of 'a'
There's really no point in picking an a value larger than your index size, since the mod gets rid of the remainder anyways. This isn't a very good approach in terms of randomness, but it's very efficient in terms of memory and speed which is what I care about for simulating this write workload.

Is there any probabilistic data structure that reduces the space complexity of a large number of counters?

Basically I need to keep track of a large number of counters. I can increment or decrement each counter by name. The simplest way to do so is to use a hash table, using counter_name as key and its corresponding count as the value for that key.
The counters don't need to be 100% accurate, approximate values for count are fine. So I'm wondering if there is any probabilistic data structure that can reduce the space complexity of N counters to lower than O(N), kinda similar to how HyperLogLog reduces the memory requirement of counting N items by giving only an approximate result. Any ideas?
In my opinion, the thing you are looking for is Count-min sketch.
Reading a stream of elements a1, a2, a3, ..., an where there can be a
lot of repeated elements, in any time it will give you the answer to
the following question: how many ai elements have you seen so far.
basically your unique elements can be bijected into your counters. Countmin sketch allows you to adjust parameters to trade your memory for the accuracy.
P.S. I described some other popular probabilistic data structures here.
Stefan Haustein's correct that the names are likely to take more space than the counters, and you may be able to prioritise certain names as he suggests, but failing that you can consider how best to store the names. If they're fairly short (e.g. 8 characters or less), you might consider using a closed hashing table that stores them directly in the buckets. If they're long, you could store them contiguously (NUL terminated) in a block of memory, and in the hash table store the offset into that block of their first character.
For the counter itself, you can save space by using a probabilistic approach as follows:
template <typename T, typename Q = unsigned>
class Approx_Counter
Approx_Counter() : n_(0) { }
Approx_Counter& operator++()
if (n_ < 2 || rand() % (operator Q()) == 0)
return *this;
operator Q() const { return n_ < 2 ? n_ : 1 << n_; }
T n_;
Then you can use e.g. Approx_Counter<unsigned char, unsigned long>. Swap out rand() for a C++11 generator if you care.
The idea's simple:
when n_ is 0, ++ has definitely not be invoked
when n_ is 1, ++ has definitely been invoked exactly once
when n_ >= 2, it indicates ++ has probably been invoked about 2n_ times
To keep that last implication in line with the number of ++ invocations actually made, each invocation has a 1 in 2n_ chance of actually incrementing n_ again.
Just make sure your rand() or substitute returns values much larger than the largest counter value you want to track, otherwise you'll get rand() % (operator Q()) == 0 too often and increment inappropriately.
That said, having a smaller counter doesn't help much if you have pointers or offsets to it, so you'll want to squeeze the counter into the bucket too, another reason to prefer your own closed hashing implementation if you genuinely need to tighten up memory usage but want to stick with a hash table (a trie is another possibility).
The above is still O(N) in counter space, just with a smaller constant. For genuinely < O(N) options, you need to consider whether/how keys are related, such that incrementing a counter might reasonable impact multiple keys. You've given us no insights in your question to date.
The names probably take up more space than the counters.
How about having a fixed number of counters and only keep the ones with the highest counts, plus some kind of LRU mechanism to allow new counters to rise to the top? I guess it really depends on your use case...

Sum reduction of binary sequence

Consider a binary sequence:
I have to find sum of this series (actually in parallel)
Sum =1+1+0+0+0+1+1+1= 5
This is a waste of resource as why invest time in adding 0s?
Is there any clever way to sum this sequence so I can avoid unnecessary additions?
Operate at the byte level rather than the bit level. Use a small LUT to convert a byte to a population count. That way you're only doing one lookup and one add per 8 bits. Unless your data is likely to be very sparse this should be quite efficient.
Well it depends on how you store your bitset.
If it's an array, then you can't do more than a plain for. If you want to do this in parallel, just split the array in chunks and process them concurrently.
If we are talking about a bitset (storing the bits in a native (32/64-bit) integer type), then the simplest way to count bits would be this one:
int bitset;
int s = 0;
for (; bitset; s++)
bitset &= bitset-1;
This removes the last bit of 1 at every step, so you have O(s).
Of course, you can combine these two methods if you need more than 32/64 bits
I dunno why people are answering, not even looking into link from the 1st comment to the question. You can easily make it under O(size_of_bitset). At lewast when it comes to constant factor.
You could use this method (found in link by J.F. Sebastian):
inline int count_bits(int num){
int sum = 0;
for (; bitset; sum++) bitset &= bitset-1;
return sum;
int main (void){
int array[N];
int total_sum = 0;
#pragma omp parallel for reduction(+:total_sum)
for (size_t i = 0; i < N, i++){
total_sum += count_bits(array[i]);
This will count number of bits in memory range of array in parallel. The inline is important to avoid unnecessary copying, also the compiler should optimize it much better.
You can swap the count_bits with anything better that counts bits in an integer to get faster if you find anything. This version has complexity of O(bits_set) (not size of the bit set!).
Invoking the parallel construct will introduce quite a lot of overhead compared to a single summation that it does need to be quite large to compensate.
The parallelism is done via OpenMP. The partial sum of each thread is summed at the end of the parallel loop and stored in total_sum. Note the total_sum will be private inside the loop for each thread reduction due to reduction clause.
You could alter the code to make it count bits set in arbitrary memory region but it is quite important for it to be memory aligned when you perform operations on such low level.
As far as I can see, it would be wasteful to try to handle the zeros specially. As #bdares said, addition is really cheap. At a minimum, you'll need to execute N instructions to sum up the an N-bit sequence, that would be if you unconditionally sum ever bit. If you add a test to see whether the bit is a 0 or 1, that's another instruction that needs to be executed for each bit. Even if there's no branch penalty, you're executing minimum 1 instruction for every bit (the conditional test), and then you're also executing the original instruction (the add) for any bits that are equal to 1. So even without branch penalty, this takes more time to execute.
#bdares mentions that the compiler will optimize out the branches, but that's only if the value of each bit is known at compile time, and if you know the values of the bits at compile time, you should just add them up yourself in advance.
There might be some cute things you can do with bit twiddling. For instance, if you take the bits two at a time you're adding up values of 0, 1, 2, or 3, and only have half as many additions to do. There may by something you can then do with the result to convert it into the value you want, but I haven't actually thought about how to do that.

Range extremes don't seem to get drawn by random()

For several valid reasons I have to use BSD's random() to generate awfully large amounts of random numbers, and since its cycle is quite short (~2^69, if I'm not mistaken) the quality of such numbers degrades pretty quickly for my use case. I could use the rng board I have access to but it's painfully slow so I thought I could do this trick: take one number from the board, use it to seed random(), use random() to draw numbers and reseed it when the board says a new number is available. The board generates about 100 numbers per second so my guess is that random() hardly gets to cycle over and the generation rate easily keeps up with my requirements of several millions numbers per second.
Anyway, the problem is that random() claims to uniformly draw numbers between 0 and (2^31)-1, but I've been drawing an uncountable amount of numbers and I've never ever seen a 0 nor a (2^31)-1 so far. Maybe some 1 and (2^31)-2, but I've never seen the extremes. Now, I know the problem with random numbers is that you can never be sure (see Dilbert, Debian), but this seem extremely odd nonetheless. Moreover I tried analysing the generated datasets with Octave using the histc() function, and the lowest and the highest bins contain between half and three quarter the amount of numbers of the middle bins (which in turn are uniformly filled, so I guess in some sense the distribution is "uniform").
Can anybody explain this?
EDIT Some code
The board outputs this structure with the three components, and then I do some mumbo-jumbo combining them to produce the seed. I have no specs about this board, it's an ancient piece of hardware thrown together by a previous student some years ago, there's little documentation and this formula I'm using is one of those suggested in the docs. The STEP parameter tells me how may numbers I can draw using one seed so I can optimise performance and throttle down CPU usage at the same time.
float n = fabsf(fmod(sqrt(a.s1*a.s1 + a.s2*a.s2 + a.s3*a.s3), 1.0));
unsigned int seed = n * UINT32_MAX;
for(int i = 0; i < STEP; i++) {
long r = random();
n = (float)r / (UINT32_MAX >> 1);
[_numbers addObject:[NSNumber numberWithFloat:n]];
Are you certain that
void main() {
while (random() != 0L);
hangs indefinitely? On my linux machine (the Gnu C library uses the same linear feedback shift register as BSD, albeit with a different seeding procedure) it doesn't.
According to this reference the algorithm produces 'runs' of consecutive zeroes or ones up to length n-1 where n is the size of the shift register. When this has a size of 31 integers (the default case) we can even be certain that, eventually, random() will return 0 a whopping 30 (but never 31) times in a row! Of course, we may have to wait a few centuries to see it happening...
To extend the cycle length, one method is to run two RNGs, with different periods, and XOR their output. See L'Ecuyer 1988 for some examples.

How would you sort 1 million 32-bit integers in 2MB of RAM?

Please, provide code examples in a language of your choice.
No constraints set on external storage.
Example: Integers are received/sent via network. There is a sufficient space on local disk for intermediate results.
Split the problem into pieces small enough to fit into available memory, then use merge sort to combine them.
Sorting a million 32-bit integers in 2MB of RAM using Python by Guido van Rossum
1 million 32-bit integers = 4 MB of memory.
You should sort them using some algorithm that uses external storage. Mergesort, for example.
You need to provide more information. What extra storage is available? Where are you supposed to store the result?
Otherwise, the most general answer:
1. load the fist half of data into memory (2MB), sort it by any method, output it to file.
2. load the second half of data into memory (2MB), sort it by any method, keep it in memory.
3. use merge algorithm to merge the two sorted halves and output the complete sorted data set to a file.
This wikipedia article on External Sorting have some useful information.
Dual tournament sort with polyphased merge
#!/usr/bin/env python
import random
from sort import Pickle, Polyphase
nrecords = 1000000
available_memory = 2000000 # number of bytes
#NOTE: it doesn't count memory required by Python interpreter
record_size = 24 # (20 + 4) number of bytes per element in a Python list
heap_size = available_memory / record_size
p = Polyphase(compare=lambda x,y: cmp(y, x), # descending order
max_files=4 * (nrecords / heap_size + 1))
# put records
maxel = 1000000000
for _ in xrange(nrecords):
# get sorted records
last = maxel
for n, el in enumerate(p.get_all()):
if el > last: # elements must be in descending order
print "not sorted %d: %d %d" % (n, el ,last)
last = el
assert nrecords == (n + 1) # check all records read
Um, store them all in a file.
Memory map the file (you said there was only 2M of RAM; let's assume the address space is large enough to memory map a file).
Sort them using the file backing store as if it were real memory now!
Here's a valid and fun solution.
Load half the numbers into memory. Heap sort them in place and write the output to a file. Repeat for the other half. Use external sort (basically a merge sort that takes file i/o into account) to merge the two files.
Make heap sort faster in the face of slow external storage:
Start constructing the heap before all the integers are in memory.
Start putting the integers back into the output file while heap sort is still extracting elements
As people above mention type int of 32bit 4 MB.
To fit as much "Number" as possible into as little of space as possible using the types int, short and char in C++. You could be slick(but have odd dirty code) by doing several types of casting to stuff things everywhere.
Here it is off the edge of my seat.
anything that is less than 2^8(0 - 255) gets stored as a char (1 byte data type)
anything that is less than 2^16(256 - 65535) and > 2^8 gets stored as a short ( 2 byte data type)
The rest of the values would be put into int. ( 4 byte data type)
You would want to specify where the char section starts and ends, where the short section starts and ends, and where the int section starts and ends.
No example, but Bucket Sort has relatively low complexity and is easy enough to implement
