A memory efficient way for a randomized single pass over a set of indices - algorithm

I have a big file (about 1GB) which I am using as a basis to do some data integrity testing. I'm using Python 2.7 for this because I don't care so much about how fast the writes happen, my window for data corruption should be big enough (and it's easier to submit a Python script to the machine I'm using for testing)
To do this I'm writing a sequence of 32 bit integers to memory as a background process while other code is running, like the following:
from struct import pack
with open('./FILE', 'rb+', buffering=0) as f:
f.seek(0)
counter = 1
while counter < SIZE+1:
f.write(pack('>i', counter))
counter+=1
Then after I do some other stuff it's very easy to see if we missed a write since there will be a gap instead of the sequential increasing sequence. This works well enough. My problem is some data corruption cases might only be caught with random I/O (not sequential like this) based on how we track changes to files
So what I need is a method for performing a single pass of random I/O over my 1GB file, but I can't really store this in memory since 1GB ~= 250 million 4-byte integers. Considered chunking up the file into smaller pieces and indexing those, maybe 500 KB or something, but if there is a way to write a generator that can do the same job that would be awesome. Like this:
from struct import pack
def rand_index_generator:
generator = RAND_INDEX(1, MAX+1, NO REPLACEMENT)
counter = 0
while counter < MAX:
counter+=1
yield generator.next_index()
with open('./FILE', 'rb+', buffering=0) as f:
counter = 1
for index in rand_index_generator:
f.seek(4*index)
f.write(pack('>i', counter))
counter+=1
I need it:
Not to run out of memory (so no pouring the random sequence into a list)
To be reproducible so I can verify these values in the same order later
Is there a way to do this in Python 2.7?

Just to provide an answer for anyone who has the same problem, the approach that I settled on was this, which worked well enough if you don't need something all that random:
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
Then, initialize it with your index size, b and a value a which is coprime to b. This is easy to choose if b is a power of two, since a just needs to be an odd number to make sure it isn't divisible by 2. It's a hard requirement for the two values to be coprime, so you might have to do more work if your index size b is not such an easily factored number as a power of 2.
index_gen = rand_index_generator(1934919251, 2**28)
Then each time you want the new index you use index_gen.next() and this is guaranteed to iterate over numbers between [0,2^28-1] in a semi-randomish manner depending on your choice of 'a'
There's really no point in picking an a value larger than your index size, since the mod gets rid of the remainder anyways. This isn't a very good approach in terms of randomness, but it's very efficient in terms of memory and speed which is what I care about for simulating this write workload.

Related

Is there difference between Cache index address calculation vs Division hash function?

Upon studying hash data structure and cache memory from computer architecture, I noticed that they're very similar.
Division hash function calculates index by hash(k) = k Mod (table size M) but my DS book says M should be a prime number or at least an odd number, because if M is an even number, the result is always even when k is even, odd when k is odd, so even M should be avoided since you often use memory addresses which are always even.
And yet, my CA book says for direct-mapped cache you use (Block address) Mod (Number of blocks in the cache) and the result indices look uniform. Why is this? It's all very confusing because MIPS uses 32 bit address every 4 bytes which is even number. But I think it's because they threw out the last 2 bits since they're byte offsets?
And, since it uses (Block address) Mod (Number of blocks in the cache), it makes the cache size power of 2 so that you can just use the lower x bits of the block address.
But this method looks exactly the same as division hash function, except you make the hash table power of 2, which is even (data structure book said use prime or odd) and use the lower bits of the block address.
Are these 2 different methods? If so, what's the cache one called? I would really appreciate a reply please. Thank you.
The reason for not using an even number for hash table is described here.
And how caches use addresses to calculate line numbers are described here. And its ok for caches to map more than one entry to the same line. Just because an address is mapped to a cacheline which has data, we don't blindly use the data in that cacheline. We also do a tag comparison to make sure that the content is the cacheline is what exactly we are looking for.
The reason for using a prime to take the modulo by is to get "mixing" of the bits, which is helpful if the integers that you're hashing have a poor structure. That isn't the only way to deal with it though, and for example the Java standard library doesn't use that, it uses a separate "mixing" function (that XORs the input with right-shifted versions of itself) and then uses a power-of-two sizes table. Either way it's protection against badly distributed input, which isn't necessary in and of itself - if the input was always nicely distributed you wouldn't need it.
Memory addresses are usually fairly nicely distributed, because it's typically used in sequential pieces. The obvious exception is that you will see highly aligned big objects, which would conflict with each other in the cache if nothing was done about it. Of course you will probably use a set-associative cache rather than direct mapped, since it is far more robust against degradation, and that would take care of a lot of that. But nothing is ever immune to bad patterns (that also goes for hash-mod-prime, which you can easily defeat if you know the prime), but a fairly simple improvement (which is also used in practice, or at least was, more advanced techniques exist now - combined with adaptive replacement strategies that mitigate bad access patterns) is to XOR some of the higher address bits into the index. This is hash-strengthening, the same technique used in the Java standard library, but a much simpler version of it.
Computing a remainder by a prime number (or really anything that isn't a power of two) is not something you'd want to do in this case, it's a slow computation by itself, and it leaves you with an awkwardly sized cache that doesn't fully use the power of its decoders, which adds to the slowness (or reduces cache size for a given latency, depending on how you look at it). The difference between that and XORing some of the high bits into the low bits is much bigger in hardware than it is in software, since XOR is really a trivial operation in hardware, much faster as a circuit operation than as an instruction.

Random number generation from 1 to 7

I was going through Google Interview Questions. to implement the random number generation from 1 to 7.
I did write a simple code, I would like to understand if in the interview this question asked to me and if I write the below code is it Acceptable or not?
import time
def generate_rand():
ret = str(time.time()) # time in second like, 12345.1234
ret = int(ret[-1])
if ret == 0 or ret == 1:
return 1
elif ret > 7:
ret = ret - 7
return ret
return ret
while 1:
print(generate_rand())
time.sleep(1) # Just to see the output in the STDOUT
(Since the question seems to ask for analysis of issues in the code and not a solution, I am not providing one. )
The answer is unacceptable because:
You need to wait for a second for each random number. Many applications need a few hundred at a time. (If the sleep is just for convenience, note that even a microsecond granularity will not yield true random numbers as the last microsecond will be monotonically increasing until 10us are reached. You may get more than a few calls done in a span of 10us and there will be a set of monotonically increasing pseudo-random numbers).
Random numbers have uniform distribution. Each element should have the same probability in theory. In this case, you skew 1 more (twice the probability for 0, 1) and 7 more (thrice the probability for 7, 8, 9) compared to the others in the range 2-6.
Typically answers to this sort of a question will try to get a large range of numbers and distribute the ranges evenly from 1-7. For example, the above method would have worked fine if u had wanted randomness from 1-5 as 10 is evenly divisible by 5. Note that this will only solve (2) above.
For (1), there are other sources of randomness, such as /dev/random on a Linux OS.
You haven't really specified the constraints of the problem you're trying to solve, but if it's from a collection of interview questions it seems likely that it might be something like this.
In any case, the answer shown would not be acceptable for the following reasons:
The distribution of the results is not uniform, even if the samples you read from time.time() are uniform.
The results from time.time() will probably not be uniform. The result depends on the time at which you make the call, and if your calls are not uniformly distributed in time then the results will probably not be uniformly distributed either. In the worst case, if you're trying to randomise an array on a very fast processor then you might complete the entire operation before the time changes, so the whole array would be filled with the same value. Or at least large chunks of it would be.
The changes to the random value are highly predictable and can be inferred from the speed at which your program runs. In the very-fast-computer case you'll get a bunch of x followed by a bunch of x+1, but even if the computer is much slower or the clock is more precise, you're likely to get aliasing patterns which behave in a similarly predictable way.
Since you take the time value in decimal, it's likely that the least significant digit doesn't visit all possible values uniformly. It's most likely a conversion from binary to some arbitrary number of decimal digits, and the distribution of the least significant digit can be quite uneven when that happens.
The code should be much simpler. It's a complicated solution with many special cases, which reflects a piecemeal approach to the problem rather than an understanding of the relevant principles. An ideal solution would make the behaviour self-evident without having to consider each case individually.
The last one would probably end the interview, I'm afraid. Perhaps not if you could tell a good story about how you got there.
You need to understand the pigeonhole principle to begin to develop a solution. It looks like you're reducing the time to its least significant decimal digit for possible values 0 to 9. Legal results are 1 to 7. If you have seven pigeonholes and ten pigeons then you can start by putting your first seven pigeons into one hole each, but then you have three pigeons left. There's nowhere that you can put the remaining three pigeons (provided you only use whole pigeons) such that every hole has the same number of pigeons.
The problem is that if you pick a pigeon at random and ask what hole it's in, the answer is more likely to be a hole with two pigeons than a hole with one. This is what's called "non-uniform", and it causes all sorts of problems, depending on what you need your random numbers for.
You would either need to figure out how to ensure that all holes are filled equally, or you would have to come up with an explanation for why it doesn't matter.
Typically the "doesn't matter" answer is that each hole has either a million or a million and one pigeons in it, and for the scale of problem you're working with the bias would be undetectable.
Using the same general architecture you've created, I would do something like this:
import time
def generate_rand():
ret = str(time.time()) # time in second like, 12345.1234
ret = ret % 8 # will return pseudorandom numbers 0-7
if ret == 0:
return 1 # or you could also return the result of another call to generate_rand()
return ret
while 1:
print(generate_rand())
time.sleep(1)

Why are my nested for loops taking so long to compute?

I have a code that generates all of the possible combinations of 4 integers between 0 and 36.
This will be 37^4 numbers = 1874161.
My code is written in MATLAB:
i=0;
for a = 0:36
for b= 0:36
for c = 0:36
for d = 0:36
i=i+1;
combination(i,:) = [a,b,c,d];
end
end
end
end
I've tested this with using the number 3 instead of the number 36 and it worked fine.
If there are 1874161 combinations, and with An overly cautions guess of 100 clock cycles to do the additions and write the values, then if I have a 2.3GHz PC, this is:
1874161 * (1/2300000000) * 100 = 0.08148526086
A fraction of a second. But It has been running for about half an hour so far.
I did receive a warning that combination changes size every loop iteration, consider predefining its size for speed, but this can't effect it that much can it?
As #horchler suggested you need to preallocate the target array
This is because your program is not O(N^4) without preallocation. Each time you add new line to array it need to be resized, so new bigger array is created (as matlab do not know how big array it will be it probably increase only by 1 item) and then old array is copied into it and lastly old array is deleted. So when you have 10 items in array and adding 11th, then a copying of 10 items is added to iteration ... if I am not mistaken that leads to something like O(N^12) which is massively more huge
estimated as (N^4)*(1+2+3+...+N^4)=((N^4)^3)/2
Also the reallocation process is increasing in size breaching CACHE barriers slowing down even more with increasing i above each CACHE size barrier.
The only solution to this without preallocation is to store the result in linked list
Not sure Matlab has this option but that will need one/two pointer per item (32/64 bit value) which renders your array 2+ times bigger.
If you need even more speed then there are ways (probably not for Matlab):
use multi-threading for array filling is fully parallelisable
use memory block copy (rep movsd) or DMA the data is periodically repeating
You can also consider to compute the value from i on the run instead of remember the whole array, depending on the usage it can be faster in some cases...

how to deal with a big text file(about 300M)

There's a text file(about 300M) and I need to count the ten most offen occurred words(some stop words are exclued). Test machine has 8 cores and Linux system, any programming language is welcome and can use open-source framework only(hadoop is not an option), I don't have any mutithread programming experince, where can I start from and how to give a solution cost as little time as possible?
300M is not a big deal, a matter of seconds for your task, even for single core processing in a high-level interpreted language like python if you do it right. Python has an advantage that it will make your word-counting programming very easy to code and debug, compared to many lower-level languages. If you still want to parallelize (even though it will only take a matter of seconds to run single-core in python), I'm sure somebody can post a quick-and-easy way to do it.
How to solve this problem with a good scalability:
The problem can be solved by 2 map-reduce steps:
Step 1:
map(word):
emit(word,1)
Combine + Reduce(word,list<k>):
emit(word,sum(list))
After this step you have a list of (word,#occurances)
Step 2:
map(word,k):
emit(word,k):
Combine + Reduce(word,k): //not a list, because each word has only 1 entry.
find top 10 and yield (word,k) for the top 10. //see appendix1 for details
In step 2 you must use a single reducer, The problem is still scalable, because it (the single reducer) has only 10*#mappers entries as input.
Solution for 300 MB file:
Practically, 300MB is not such a large file, so you can just create a histogram (on memory, with a tree/hash based map), and then output the top k values from it.
Using a map that supports concurrency, you can split the file into parts, and let each thread modify the when it needs. Note that if it cab actually be splitted efficiently is FS dependent, and sometimes a linear scan by one thread is mandatory.
Appendix1:
How to get top k:
Use a min heap and iterate the elements, the min heap will contain the highest K elements at all times.
Fill the heap with first k elements.
For each element e:
If e > min.heap():
remove the smallest element from the heap, and add e instead.
Also, more details in this thread
Assuming that you have 1 word per line, you can do the following in python
from collections import Counter
FILE = 'test.txt'
count = Counter()
with open(FILE) as f:
for w in f.readlines():
count[w.rstrip()] += 1
print count.most_common()[0:10]
Read the file and create a map [Word, count] of all occurring word as keys and the value are the number of occurrences of the words while you read it.
Any language should do the job.
After reading the File once, you have the map.
Then iterate through the map and remember the ten word with the highest count value

Adaptive IO Optimization Problem

Here is an interesting optimization problem that I think about for some days now:
In a system I read data from a slow IO device. I don't know beforehand how much data I need. The exact length is only known once I have read an entire package (think of it as it has some kind of end-symbol). Reading more data than required is not a problem except that it wastes time in IO.
Two constrains also come into play: Reads are very slow. Each byte I read costs. Also each read-request has a constant setup cost regardless of the number of bytes I read. This makes reading byte by byte costly. As a rule of thumb: the setup costs are roughly as expensive as a read of 5 bytes.
The packages I read are usually between 9 and 64 bytes, but there are rare occurrences larger or smaller packages. The entire range will be between 1 to 120 bytes.
Of course I know a little bit of my data: Packages come in sequences of identical sizes. I can classify three patterns here:
Sequences of reads with identical sizes:
A A A A A ...
Alternating sequences:
A B A B A B A B ...
And sequences of triples:
A B C A B C A B C ...
The special case of degenerated triples exist as well:
A A B A A B A A B ...
(A, B and C denote some package size between 1 and 120 here).
Question:
Based on the size of the previous packages, how do I predict the size of the next read request? I need something that adapts fast, uses little storage (lets say below 500 bytes) and is fast from a computational point of view as well.
Oh - and pre-generating some tables won't work because the statistic of read sizes can vary a lot with different devices I read from.
Any ideas?
You need to read at least 3 packages and at most 4 packages to identify the pattern.
Read 3 packages. If they are all same size, then the pattern is AAAAAA...
If they are all not the same size, read the 4th package. If 1=3 & 2=4, pattern is ABAB. Otherwise, pattern is ABCABC...
With that outline, it is probably a good idea to do a speculative read of 3 package sizes (something like 3*64 bytes at a single go).
I don't see a problem here.. But first, several questions:
1) Can you read the input asyncronously (e.g. separate thread, interrupt routine, etc)?
2) Do you have some free memory for a buffer?
3) If you've commanded a longer read, are you able to obtain first byte(s) before the whole packet is read?
If so (and I think in most cases it can be implemented), then you can just have a separate thread that reads them at highest possible speed and stores them in a buffer, with stalling when the buffer gets full, so that you normal process can use a synchronous getc() on that buffer.
EDIT: I see.. it's because of CRC or encryption? Well, then you could use some ideas from data compression:
Consider a simple adaptive algorithm of order N for M possible symbols:
int freqs[M][M][M]; // [a][b][c] : occurences of outcome "c" when prev vals were "a" and "b"
int prev[2]; // some history
int predict(){
int prediction = 0;
for (i = 1; i < M; i++)
if (freqs[prev[0]][prev[1]][i] > freqs[prev[0]][prev[1]][prediction])
prediction = i;
return prediction;
};
void add_outcome(int val){
if (freqs[prev[0]][prev[1]][val]++ > DECAY_LIMIT){
for (i = 0; i < M; i++)
freqs[prev[0]][prev[1]][i] >>= 1;
};
pred[0] = pred[1];
pred[1] = val;
};
freqs has to be an array of order N+1, and you have to remember N previsous values. N and DECAY_LIMIT have to be adjusted according to the statistics of the input. However, even they can be made adaptive (for example, if it producess too many misses, then the decay limit can be shortened).
The last problem would be the alphabet. Depending on the context, if there are several distinct sizes, you can create a one-to-one mapping to your symbols. If more, then you can use quantitization to limit the number of symbols. The whole algorithm can be written with pointer arithmetics, so that N and M won't be hardcoded.
Since reading is so slow, I suppose you can throw some CPU power at it so you can try to make an educated guess of how much to read.
That would be basically a predictor, that would have a model based on probabilities. It would generate a sample of predictions of the upcoming message size, and the cost of each. Then pick the message size that has the best expected cost.
Then when you find out the actual message size, use Bayes rule to update the model probabilities, and do it again.
Maybe this sounds complicated, but if the probabilities are stored as fixed-point fractions you won't have to deal with floating-point, so it may be not much code. I would use something like a Metropolis-Hastings algorithm as my basic simulator and bayesian update framework. (This is just an initial stab at thinking about it.)

Resources