Why are my nested for loops taking so long to compute? - algorithm

I have a code that generates all of the possible combinations of 4 integers between 0 and 36.
This will be 37^4 numbers = 1874161.
My code is written in MATLAB:
i=0;
for a = 0:36
for b= 0:36
for c = 0:36
for d = 0:36
i=i+1;
combination(i,:) = [a,b,c,d];
end
end
end
end
I've tested this with using the number 3 instead of the number 36 and it worked fine.
If there are 1874161 combinations, and with An overly cautions guess of 100 clock cycles to do the additions and write the values, then if I have a 2.3GHz PC, this is:
1874161 * (1/2300000000) * 100 = 0.08148526086
A fraction of a second. But It has been running for about half an hour so far.
I did receive a warning that combination changes size every loop iteration, consider predefining its size for speed, but this can't effect it that much can it?

As #horchler suggested you need to preallocate the target array
This is because your program is not O(N^4) without preallocation. Each time you add new line to array it need to be resized, so new bigger array is created (as matlab do not know how big array it will be it probably increase only by 1 item) and then old array is copied into it and lastly old array is deleted. So when you have 10 items in array and adding 11th, then a copying of 10 items is added to iteration ... if I am not mistaken that leads to something like O(N^12) which is massively more huge
estimated as (N^4)*(1+2+3+...+N^4)=((N^4)^3)/2
Also the reallocation process is increasing in size breaching CACHE barriers slowing down even more with increasing i above each CACHE size barrier.
The only solution to this without preallocation is to store the result in linked list
Not sure Matlab has this option but that will need one/two pointer per item (32/64 bit value) which renders your array 2+ times bigger.
If you need even more speed then there are ways (probably not for Matlab):
use multi-threading for array filling is fully parallelisable
use memory block copy (rep movsd) or DMA the data is periodically repeating
You can also consider to compute the value from i on the run instead of remember the whole array, depending on the usage it can be faster in some cases...

Related

A memory efficient way for a randomized single pass over a set of indices

I have a big file (about 1GB) which I am using as a basis to do some data integrity testing. I'm using Python 2.7 for this because I don't care so much about how fast the writes happen, my window for data corruption should be big enough (and it's easier to submit a Python script to the machine I'm using for testing)
To do this I'm writing a sequence of 32 bit integers to memory as a background process while other code is running, like the following:
from struct import pack
with open('./FILE', 'rb+', buffering=0) as f:
f.seek(0)
counter = 1
while counter < SIZE+1:
f.write(pack('>i', counter))
counter+=1
Then after I do some other stuff it's very easy to see if we missed a write since there will be a gap instead of the sequential increasing sequence. This works well enough. My problem is some data corruption cases might only be caught with random I/O (not sequential like this) based on how we track changes to files
So what I need is a method for performing a single pass of random I/O over my 1GB file, but I can't really store this in memory since 1GB ~= 250 million 4-byte integers. Considered chunking up the file into smaller pieces and indexing those, maybe 500 KB or something, but if there is a way to write a generator that can do the same job that would be awesome. Like this:
from struct import pack
def rand_index_generator:
generator = RAND_INDEX(1, MAX+1, NO REPLACEMENT)
counter = 0
while counter < MAX:
counter+=1
yield generator.next_index()
with open('./FILE', 'rb+', buffering=0) as f:
counter = 1
for index in rand_index_generator:
f.seek(4*index)
f.write(pack('>i', counter))
counter+=1
I need it:
Not to run out of memory (so no pouring the random sequence into a list)
To be reproducible so I can verify these values in the same order later
Is there a way to do this in Python 2.7?
Just to provide an answer for anyone who has the same problem, the approach that I settled on was this, which worked well enough if you don't need something all that random:
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
Then, initialize it with your index size, b and a value a which is coprime to b. This is easy to choose if b is a power of two, since a just needs to be an odd number to make sure it isn't divisible by 2. It's a hard requirement for the two values to be coprime, so you might have to do more work if your index size b is not such an easily factored number as a power of 2.
index_gen = rand_index_generator(1934919251, 2**28)
Then each time you want the new index you use index_gen.next() and this is guaranteed to iterate over numbers between [0,2^28-1] in a semi-randomish manner depending on your choice of 'a'
There's really no point in picking an a value larger than your index size, since the mod gets rid of the remainder anyways. This isn't a very good approach in terms of randomness, but it's very efficient in terms of memory and speed which is what I care about for simulating this write workload.

Never ending 'for' loop prevents my RStudio notebook from being rendered into a .md file

I'm trying to calculate the Kolmogorov-Smirnov statistic in R. I have the following sample, which clearly comes from a random variable that follows a long-tailed distribution.
Download link
https://drive.google.com/file/d/1hIgqikX7p343zdyc-Goq34THUpsZA63n/view?usp=sharing
As you may know, the Kolmogorov-Smirnov statistic requires the calculation of the empirical cumulative distribution function and the presumed cumulative distribution function. For both calculations I take the following approach: first, I create a vector with the same length as the length of the sample, and then I modify each of the components of the vector so as for it to contain the empirical cdf (or presumed cdf) of the corresponding observation of the sample.
For the sake of illustration, I'll show you the code I wrote in order to calculate the empirical cdf.
I'm assuming that the data has been read and stored in a dataframe called data.
ecdf = vector("numeric", length(data$logueos))for (i in 1:length(data$logueos)) {ecdf[i] = sum (data$logueos <= data$logueos[i])/length(data$logueos)}
The code I wrote for the calculation of the presumed cdf is analogous to the preceding one; the only difference is that I set each component of the pcdf vector equal to the formula $P(X<=t)$ —where t is the corresponding observation of the sample— according to the distribution that I'm assuming.
The problem is that this 'for' loop never ends. If I force it to end by clicking RStudio's stop button it works: it makes the vector store what I want it to store. But, if I press Ctrl+Shift+k in order to render my notebook and preview it, the load gets stuck when trying to execute the first chunk encountered that contains one of those loops.
First of all, your loop is not endless. It will finish, eventually.
You start initializing a vector with as much elements as the number of observations (1.245.888, which is a lot of iterations). This vector is FULL OF ZEROS.
What your loop does is iterate while changing each zero with the calculus sum (data$logueos <= data$logueos[i])/length(data$logueos). Check that when you stop the execution, the first values of your vector will be values between 0 and 1 while the last values is going to be 0s (because the loop hasn't arrived there yet).
So, you will have to wait more time.
In order to make the execution faster, you could consider loop parallelization (because standard loops go sequentially, one by one, and if it's too much wait, parallelization makes it faster. For example, executing 4 by 4, depending of your computer capacities). Here you'll find some information about it: https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html
Then, my proposal to you:
if(!require(foreach)){install.packages("foreach")}; require(foreach)
registerDoParallel(detectCores() - 1)
ecdf = vector("numeric", length(data$logueos))
foreach (i=1:length(data$logueos)) %do% {
print(i)
ecdf[i] = sum (data$logueos <= data$logueos[i])/length(data$logueos)
}
The first line will download and load foreach library, that you
need for parallelization.
detectCores() - 1 is going to use all the
processors that your computer has except one (to avoid freezing your
machine) for computing this loop. You'll see that is going to be
faster!
registerDoParallel function is what tells to foreach how many cores use.

Adaptive IO Optimization Problem

Here is an interesting optimization problem that I think about for some days now:
In a system I read data from a slow IO device. I don't know beforehand how much data I need. The exact length is only known once I have read an entire package (think of it as it has some kind of end-symbol). Reading more data than required is not a problem except that it wastes time in IO.
Two constrains also come into play: Reads are very slow. Each byte I read costs. Also each read-request has a constant setup cost regardless of the number of bytes I read. This makes reading byte by byte costly. As a rule of thumb: the setup costs are roughly as expensive as a read of 5 bytes.
The packages I read are usually between 9 and 64 bytes, but there are rare occurrences larger or smaller packages. The entire range will be between 1 to 120 bytes.
Of course I know a little bit of my data: Packages come in sequences of identical sizes. I can classify three patterns here:
Sequences of reads with identical sizes:
A A A A A ...
Alternating sequences:
A B A B A B A B ...
And sequences of triples:
A B C A B C A B C ...
The special case of degenerated triples exist as well:
A A B A A B A A B ...
(A, B and C denote some package size between 1 and 120 here).
Question:
Based on the size of the previous packages, how do I predict the size of the next read request? I need something that adapts fast, uses little storage (lets say below 500 bytes) and is fast from a computational point of view as well.
Oh - and pre-generating some tables won't work because the statistic of read sizes can vary a lot with different devices I read from.
Any ideas?
You need to read at least 3 packages and at most 4 packages to identify the pattern.
Read 3 packages. If they are all same size, then the pattern is AAAAAA...
If they are all not the same size, read the 4th package. If 1=3 & 2=4, pattern is ABAB. Otherwise, pattern is ABCABC...
With that outline, it is probably a good idea to do a speculative read of 3 package sizes (something like 3*64 bytes at a single go).
I don't see a problem here.. But first, several questions:
1) Can you read the input asyncronously (e.g. separate thread, interrupt routine, etc)?
2) Do you have some free memory for a buffer?
3) If you've commanded a longer read, are you able to obtain first byte(s) before the whole packet is read?
If so (and I think in most cases it can be implemented), then you can just have a separate thread that reads them at highest possible speed and stores them in a buffer, with stalling when the buffer gets full, so that you normal process can use a synchronous getc() on that buffer.
EDIT: I see.. it's because of CRC or encryption? Well, then you could use some ideas from data compression:
Consider a simple adaptive algorithm of order N for M possible symbols:
int freqs[M][M][M]; // [a][b][c] : occurences of outcome "c" when prev vals were "a" and "b"
int prev[2]; // some history
int predict(){
int prediction = 0;
for (i = 1; i < M; i++)
if (freqs[prev[0]][prev[1]][i] > freqs[prev[0]][prev[1]][prediction])
prediction = i;
return prediction;
};
void add_outcome(int val){
if (freqs[prev[0]][prev[1]][val]++ > DECAY_LIMIT){
for (i = 0; i < M; i++)
freqs[prev[0]][prev[1]][i] >>= 1;
};
pred[0] = pred[1];
pred[1] = val;
};
freqs has to be an array of order N+1, and you have to remember N previsous values. N and DECAY_LIMIT have to be adjusted according to the statistics of the input. However, even they can be made adaptive (for example, if it producess too many misses, then the decay limit can be shortened).
The last problem would be the alphabet. Depending on the context, if there are several distinct sizes, you can create a one-to-one mapping to your symbols. If more, then you can use quantitization to limit the number of symbols. The whole algorithm can be written with pointer arithmetics, so that N and M won't be hardcoded.
Since reading is so slow, I suppose you can throw some CPU power at it so you can try to make an educated guess of how much to read.
That would be basically a predictor, that would have a model based on probabilities. It would generate a sample of predictions of the upcoming message size, and the cost of each. Then pick the message size that has the best expected cost.
Then when you find out the actual message size, use Bayes rule to update the model probabilities, and do it again.
Maybe this sounds complicated, but if the probabilities are stored as fixed-point fractions you won't have to deal with floating-point, so it may be not much code. I would use something like a Metropolis-Hastings algorithm as my basic simulator and bayesian update framework. (This is just an initial stab at thinking about it.)

Which index varies fastest in a VB array?

When using a Visual Basic two-dimensional array, which index varies fastest? In other words, when filling in an array, should I write...
For i = 1 To 30
For j = 1 To 30
myarray (i,j) = something
Next
Next
or
For i = 1 To 30
For j = 1 To 30
myarray (j, i) = something
Next
Next
(or alternatively does it make very much difference)?
Column major. VB6 uses COM SAFEARRAYs and lays them out in column-major order. The fastest access is like this (although it won't matter if you only have 30x30 elements).
For i = 1 To 30
For j = 1 To 30
myarray (j, i) = something
Next
Next
If you really want to speed up your array processing, consider the tips in Advanced Visual Basic by Matt Curland, which shows you how to poke around inside the underlying SAFEARRAY structures.
For instance accessing a 2D SAFEARRAY is considerably slower than accessing a 1D SAFEARRAY, so in order to set all array entries to the same value it is quicker to bypass VB6's SAFEARRAY descriptor and temporarily make one of your own. Page 33.
You should also consider turning on "Remove array bounds checks" in the project properties compile options.
I don't know if (or where) this is specified. It might be left as 'implementation defined'.
But I would expect the first index to be the 'lower' dimension, ie the big chunks, and the following index positions to be ever more fine-grained.
Edit: Seems I was wrong. VB6 uses a Column-first approach.
Does it make much difference?
You would have to measure but using the lower higher dimension for the outer loop would allow the compiler to generate faster code and could make better use of the processor cache (locality). But with a size=30 I wouldn't expect much difference.

How would you sort 1 million 32-bit integers in 2MB of RAM?

Please, provide code examples in a language of your choice.
Update:
No constraints set on external storage.
Example: Integers are received/sent via network. There is a sufficient space on local disk for intermediate results.
Split the problem into pieces small enough to fit into available memory, then use merge sort to combine them.
Sorting a million 32-bit integers in 2MB of RAM using Python by Guido van Rossum
1 million 32-bit integers = 4 MB of memory.
You should sort them using some algorithm that uses external storage. Mergesort, for example.
You need to provide more information. What extra storage is available? Where are you supposed to store the result?
Otherwise, the most general answer:
1. load the fist half of data into memory (2MB), sort it by any method, output it to file.
2. load the second half of data into memory (2MB), sort it by any method, keep it in memory.
3. use merge algorithm to merge the two sorted halves and output the complete sorted data set to a file.
This wikipedia article on External Sorting have some useful information.
Dual tournament sort with polyphased merge
#!/usr/bin/env python
import random
from sort import Pickle, Polyphase
nrecords = 1000000
available_memory = 2000000 # number of bytes
#NOTE: it doesn't count memory required by Python interpreter
record_size = 24 # (20 + 4) number of bytes per element in a Python list
heap_size = available_memory / record_size
p = Polyphase(compare=lambda x,y: cmp(y, x), # descending order
file_maker=Pickle,
verbose=True,
heap_size=heap_size,
max_files=4 * (nrecords / heap_size + 1))
# put records
maxel = 1000000000
for _ in xrange(nrecords):
p.put(random.randrange(maxel))
# get sorted records
last = maxel
for n, el in enumerate(p.get_all()):
if el > last: # elements must be in descending order
print "not sorted %d: %d %d" % (n, el ,last)
break
last = el
assert nrecords == (n + 1) # check all records read
Um, store them all in a file.
Memory map the file (you said there was only 2M of RAM; let's assume the address space is large enough to memory map a file).
Sort them using the file backing store as if it were real memory now!
Here's a valid and fun solution.
Load half the numbers into memory. Heap sort them in place and write the output to a file. Repeat for the other half. Use external sort (basically a merge sort that takes file i/o into account) to merge the two files.
Aside:
Make heap sort faster in the face of slow external storage:
Start constructing the heap before all the integers are in memory.
Start putting the integers back into the output file while heap sort is still extracting elements
As people above mention type int of 32bit 4 MB.
To fit as much "Number" as possible into as little of space as possible using the types int, short and char in C++. You could be slick(but have odd dirty code) by doing several types of casting to stuff things everywhere.
Here it is off the edge of my seat.
anything that is less than 2^8(0 - 255) gets stored as a char (1 byte data type)
anything that is less than 2^16(256 - 65535) and > 2^8 gets stored as a short ( 2 byte data type)
The rest of the values would be put into int. ( 4 byte data type)
You would want to specify where the char section starts and ends, where the short section starts and ends, and where the int section starts and ends.
No example, but Bucket Sort has relatively low complexity and is easy enough to implement

Resources