How come matlab run slower and slower when running a program that takes long time to execute? - performance

There is a program that my matlab runs, since there are two gigantic nested for-loop, we expect this program to run more than 10 hours. We ask matlab to print out the loop number every time when it is looping.
Initially (the first 1 hour), wee see the loop number increment very fast in our screen; as time goes by, it goes slower and slower..... now (more than 20 consecutive hours of executing the same ".m" file and it still haven't finished yet), it is almost 20 times slower than it had been initially.
The ram usage initially was about 30%, right now after 20 hours of executing time, it is as shown below:
My computer spec is below.
What can I do to let matlab maintain its initially speed?

I can only guess, but my bet is that you have some array variables that have not been preallocated, and thus their size increases at each iteration of the for loop. As a result of this, Matlab has to reallocate memory in each iteration. Reallocating slows things down, and more so the larger those variables are, because Matlab needs to find an ever larger chunk of contiguous memory. This would explain why the program seems to run slower as time passes.
If this is indeed the cause, the solution would be to preallocate those variables. If their size is not known beforehand, you can make a guess and preallocate to an approximate size, thus avoiding at least some of the reallocating. Or, if your program is not memory-limited, maybe you can use an upper-bound on variable size when preallocating; then, after the loop, trim the arrays by removing unused entries.

Some general hints, if they don't help I suggest to add the code to the question.
Don't print to console, this output slows down execution and the output is kept in memory. Write a logfile instead if you need it. For a simple status, use waitbar
Verify you are preallocating all variables
Check which function calls depend on the loop index, for example increasing size of variables.
If any mex-functions are used, double check them for memory leaks. The standard procedure to do this: Call the function with random example data, don't store the output. If the memory usage increases there is a memory leak in the function.
Use the profiler. Profile your code for the first n iterations (where n matches about 10 minutes), generate a HTML report. Then let it run for about 2 hours and generate a report for n iterations again. Now compare both reports and see where the time is lost.

I'd like to point everybody to the following page in the MATLAB documentation: Strategies for Efficient Use of Memory. This page contains a collection of techniques and good practices that MATLAB users should be aware of.
The OP reminded me of it when saying that memory usage tends to increase over time. Indeed there's an issue with long running instances of MATLAB on win32 systems, where a memory leak exists and is exacerbated over time (this is described in the link too).
I'd like to add to Luis' answer the following piece of advice that a friend of mine once received during a correspondence with Yair Altman:
Allocating largest vars first helps by assigning the largest free contiguous blocks to the highest vars, but it is easily shown that in some cases this can actually be harmful, so it is not a general solution.
The only sure way to solve memory fragmentation is by restarting Matlab and even better to restart Windows.
More information on memory allocation performance can be found in the following undocumentedMatlab posts:
Preallocation performance
Allocation performance take 2

Related

Octave parallel function worsens the running time on a single machine

I tried to create a test code by octave to evaluate the time efficiency on my Windows machine with an 8-core processor (parallelization on a single machine); starting with the simple code example provided in the documentation, as follows:
pkg load parallel
fun = #(x) x^2;
vector_x = 1:20000;
# Serial Format of the Program
tic()
for i=1:20000
vector_y(1,i)=vector_x(1,i)^2;
endfor
toc()
# Parallel Format of the Program
tic()
vector_y1 = pararrayfun(nproc, fun, vector_x);
toc()
To my surprise, the time required for a serial code is much faster than using the parallel function. The serial case ran in 0.0758219 s, the parallel one in 3.79864 s.
Would someone explain me if it is a parallel overhead or I should set up something in my Octave setting, or in which cases is the parallization really helpful?
TL;DR: open up your pool outside the timer and choose a more difficult operation.
There's two main issues. One is what Ander mentioned in his comment, starting up the parallel pool takes a second or two. You can open up it beforehand (in MATLAB you can do this via parpool) to speed that up. Alternatively, run a single parallel operation, thus opening the pool, and then redo the timing.
The second issue is the simplicity of your operation. Just squaring a number cannot go much faster than it already goes in serial. There's no point in passing data back-and-forth between workers for such a simple operation. Redo your test with a more expensive function, e.g. eig() as MATLAB does in their examples.
Parallellisation is thus useful if the runtime of your operations greatly outweighs the overhead of passing data to and from workers. Basically this means you either have a very large data set, which you need to perform the same operation on each item (e.g. taking the mean of every 1000 rows or so), or you have a few heavy, but independent, tasks to perform.
For a more in-depth explanation I can recommend this answer of mine and references therein.
Just as a sidenote, I'm surprised your serial for is that fast, given that you do not initialise your output vector. Preallocation is very important, as "growing" arrays in loops requires the creation of a new array and copying all previous content to it every iteration.
You also might want to consider not using i or j as variable names, as they denote the imaginary unit. It won't affect runtime much, but can result in very hard to debug errors. Simply use idx, ii, or a more descriptive variable name.

Parallel code slower than serial code (value function iteration example)

I'm trying to make the code faster in Julia using parallelization. My code has nested serial for-loops and performs value function iteration. (as decribed in http://www.parallelecon.com/vfi/)
The following link shows the serial and parallelized version of the code I wrote:
https://github.com/minsuc/MyProject/blob/master/VFI_parallel.ipynb (You can find the functions defined in DefinitionPara.jl in the github page too.) Serial code is defined as main() and parallel code is defined as main_paral().
The third for-loop in main() is the step where I find the maximizer given (nCapital, nProductivity). As suggested in the official parallel documentation, I distribute the work over nCapital grid, which consists of many points.
When I do #time for the serial and the parallel code, I get
Serial: 0.001041 seconds
Parallel: 0.004515 seconds
My questions are as follows:
1) I added two workers and each of them works for 0.000714 seconds and 0.000640 seconds as you can see in the ipython notebook. The reason why parallel code is slower is due to the cost of overhead?
2) I increased the number of grid points by changing
vGridCapital = collect(0.5*capitalSteadyState:0.000001:1.5*capitalSteadyState)
Even though each worker does significant amount of work, serial code is way faster than the parallel code. When I added more workers, serial code is still faster. I think something is wrong but I haven't been able to figure out... Could it be related to the fact that I pass too many arguments in the parallelized function
final_shared(mValueFunctionNew, mPolicyFunction, pparams, vGridCapital, mOutput, expectedValueFunction)?
I will really appreciate your comments and suggestions!
If the amount of work is really small between synchronizations, the task sync overhead may be too long. Remember that a common OS timeslicing quantum is 10ms, and you are measuring in the 1ms range, so with a bit of load, 4ms latency for getting all work threads synced is perfectly reasonable.
In the case of all tasks accessing the same shared data structure, access locking overhead may well be the culprit, if the shared data structure is thread safe, even with longer parallel tasks.
In some cases, it may be possible to use non-thread-safe shared arrays for both input and output, but then it must be ensured that the workers don't clobber each other's results.
Depending on what exactly the work threads are doing, for example if they are outputting to the same array elements, it might be necessary to give each worker its own output array, and merge them together in the end, but that doesn't seem to be the case with your task.

Computing time in relation to number of operations

is it possible to calculate the computing time of a process based on the number of operations that it performs and the speed of the CPU in GHz?
For example, I have a for loop that performs a total number of 5*10^14 cycles. If it runs on a 2.4 GHz processor, will the computing time in seconds be: 5*10^14/2.4*10^9 = 208333 s?
If the process runs on 4 cores in parallel, will the time be reduced by four?
Thanks for your help.
No, it is not possible to calculate the computing time based just on the number of operations. First of all, based on your question, it sounds like you are talking about the number of lines of code in some higher-level programming language since you mention a for loop. So depending on the optimization level of your compiler, you could see varying results in computation time depending on what kinds of optimizations are done.
But even if you are talking about assembly language operations, it is still not possible to calculate the computation time based on the number of instructions and CPU speed alone. Some instructions might take multiple CPU cycles. If you have a lot of memory access, you will likely have cache misses and have to load data from disk, which is unpredictable.
Also, if the time that you are concerned about is the actual amount of time that passes between the moment the program begins executing and the time it finishes, you have the additional confounding variable of other processes running on the computer and taking up CPU time. The operating system should be pretty good about context switching during disk reads and other slow operations so that the program isn't stopped in the middle of computation, but you can't count on never losing some computation time because of this.
As far as running on four cores in parallel, a program can't just do that by itself. You need to actually write the program as a parallel program. A for loop is a sequential operation on its own. In order to run four processes on four separate cores, you will need to use the fork system call and have some way of dividing up the work between the four processes. If you divide the work into four processes, the maximum speedup you can have is 4x, but in most cases it is impossible to achieve the theoretical maximum. How close you get depends on how well you are able to balance the work between the four processes and how much overhead is necessary to make sure the parallel processes successfully work together to generate a correct result.

Is it possible to find hotspots in a parallel application using a sampling profiler?

As far as I understand a sampling profiler works as follows: it interupts the program execution in regular intervals and reads out the call stack. It notes which part of the program is currently executing and increments a counter that represents this part of the program. In a post processing step: For each function of the program the ratio of the whole execution time is computed, for which the function is responsible for. This is done by looking at the counter C for this specific function and the total number of samples N:
ratio of the function = C / N
Finding the hotspots then is easy, as this are the parts of the program with a high ratio.
But how can this be done for a parallel program running on parallel hardware. As far as I know, when the program execution is interupted the executing parts of the program on ALL processors are determined. Due to that a function which is executed in parallel gets counted multiple times. Thus the number of samples C of this function can not be used for computing its share of the whole execution time anymore.
Is my thinking correct? Are there other ways how the hotspots of a parallel program can be identified - or is this just not possible using sampling?
You're on the right track.
Whether you need to sample all the threads depends on whether they are doing the same thing or different things.
It is not essential to sample them all at the same time.
You need to look at the threads that are actually working, not just idling.
Some points:
Sampling should be on wall-clock time, not CPU time, unless you want to be blind to needless I/O and other blocking calls.
You're not just interested in which functions are on the stack, but which lines of code, because they convey the purpose of the time being spent. It is more useful to look for a "hot purpose" than a "hot spot".
The cost of a function or line of code is just the fraction of samples it appears on. To appreciate that, suppose samples are taken every 10ms for a total of N samples. If the function or line of code could be made to disappear, then all the samples in which it is on the stack would also disappear, reducing N by that fraction. That's what speedup is.
In spite of the last point, in sampling, quality beats quantity. When the goal is to understand what opportunities you have for speedup, you get farther faster by manually scrutinizing 10-20 samples to understand the full reason why each moment in time is being spent. That's why I take samples manually. Knowing the amount of time with statistical precision is really far less important.
I can't emphasize enough the importance of finding and fixing more than one problem. Speed problems come in severals, and each one you fix has a multiplier effect on those done already. The ones you don't find end up being the limiting factor.
Programs that involve a lot of asynchronous inter-thread message-passing are more difficult, because it becomes harder to discern the full reason why a moment in time is being spent.
More on that.

CUDA: reduction or atomic operations?

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is:
Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices)
I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads.
Any other idea come into your mind?
You may also want to use the reduction routines that comes w/ CUDA Thrust which is a part of CUDA 4.0 or available here.
The library is written by a pair of nVidia engineers and compares favorably with heavily hand optimized code. I believe there is also some auto-tuning of grid/block size going on.
You can interface with your own kernel easily by wrapping your raw device pointers.
This is strictly from a rapid integration point of view. For the theory, see tkerwin's answer.
This is the usual way to perform reductions in CUDA
Within each block,
1) Keep a running reduced value in shared memory for each thread. Hence each thread will read n (I personally favor between 16 and 32), values from global memory and updates the reduced value from these
2) Perform the reduction algorithm within the block to get one final reduced value per block.
This way you will not need more shared memory than (number of threads) * sizeof (datatye) bytes.
Since each block a reduced value, you will need to perform a second reduction pass to get the final value.
For example, if you are launching 256 threads per block, and are reading 16 values per thread, you will be able to reduce (256 * 16 = 4096) elements per block.
So given 1 million elements, you will need to launch around 250 blocks in the first pass, and just one block in the second.
You will probably need a third pass for cases when the number of elements > (4096)^2 for this configuration.
You will have to take care that the global memory reads are coalesced. You can not coalesce global memory writes, but that is one performance hit you need to take.
NVIDIA has a CUDA demo that does reduction: here. There's a whitepaper that goes along with it that explains some motivations behind the design.
I found this document very useful for learning the basics of parallel reduction with CUDA. It's kind of old, so there must be additional tricks to boost performance further.
Actually, the problem you described is not really about matrices. The two-dimensional view of the input data is not significant (assuming the matrix data is layed out contiguously in memory). It's just a reduction over a sequence of values, being all matrix elements in whatever order they appear in memory.
Assuming the matrix representation is contiguous in memory, you just want to perform a simple reduction. And the best available implementation these days - as far as I can tell - is the excellent libcub by nVIDIA's Duane Merill. Here is the documentation on its device-wide Maximum-calculating function.
Note, though, that unless the matrix is small, for most of the computation it will simply be threads reading data and updating their own thread-specific maximum. Only when a thread has finished reading through a large swatch of the matrix (or rather, a large strided swath) will it write its local maximum anywhere - typically into shared memory for a block-level reduction. And as for atomics, you will probably be making an atomicMax() call once every obscenely large number of matrix element reads - tens of thousands if not more.
The atomicAdd function could also be used, but it is much less efficient than the approaches mentioned above. http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/
If you have K20 or Titan, I suggest dynamic parallelism: lunching a single thread kernel, which lunches #items worker kernel threads to produce data, then lunches #items/first-round-reduction-factor threads for first round reduction, and keep lunching till result coming out.

Resources