Parallel code slower than serial code (value function iteration example) - performance

I'm trying to make the code faster in Julia using parallelization. My code has nested serial for-loops and performs value function iteration. (as decribed in http://www.parallelecon.com/vfi/)
The following link shows the serial and parallelized version of the code I wrote:
https://github.com/minsuc/MyProject/blob/master/VFI_parallel.ipynb (You can find the functions defined in DefinitionPara.jl in the github page too.) Serial code is defined as main() and parallel code is defined as main_paral().
The third for-loop in main() is the step where I find the maximizer given (nCapital, nProductivity). As suggested in the official parallel documentation, I distribute the work over nCapital grid, which consists of many points.
When I do #time for the serial and the parallel code, I get
Serial: 0.001041 seconds
Parallel: 0.004515 seconds
My questions are as follows:
1) I added two workers and each of them works for 0.000714 seconds and 0.000640 seconds as you can see in the ipython notebook. The reason why parallel code is slower is due to the cost of overhead?
2) I increased the number of grid points by changing
vGridCapital = collect(0.5*capitalSteadyState:0.000001:1.5*capitalSteadyState)
Even though each worker does significant amount of work, serial code is way faster than the parallel code. When I added more workers, serial code is still faster. I think something is wrong but I haven't been able to figure out... Could it be related to the fact that I pass too many arguments in the parallelized function
final_shared(mValueFunctionNew, mPolicyFunction, pparams, vGridCapital, mOutput, expectedValueFunction)?
I will really appreciate your comments and suggestions!

If the amount of work is really small between synchronizations, the task sync overhead may be too long. Remember that a common OS timeslicing quantum is 10ms, and you are measuring in the 1ms range, so with a bit of load, 4ms latency for getting all work threads synced is perfectly reasonable.
In the case of all tasks accessing the same shared data structure, access locking overhead may well be the culprit, if the shared data structure is thread safe, even with longer parallel tasks.
In some cases, it may be possible to use non-thread-safe shared arrays for both input and output, but then it must be ensured that the workers don't clobber each other's results.
Depending on what exactly the work threads are doing, for example if they are outputting to the same array elements, it might be necessary to give each worker its own output array, and merge them together in the end, but that doesn't seem to be the case with your task.

Related

Octave parallel function worsens the running time on a single machine

I tried to create a test code by octave to evaluate the time efficiency on my Windows machine with an 8-core processor (parallelization on a single machine); starting with the simple code example provided in the documentation, as follows:
pkg load parallel
fun = #(x) x^2;
vector_x = 1:20000;
# Serial Format of the Program
tic()
for i=1:20000
vector_y(1,i)=vector_x(1,i)^2;
endfor
toc()
# Parallel Format of the Program
tic()
vector_y1 = pararrayfun(nproc, fun, vector_x);
toc()
To my surprise, the time required for a serial code is much faster than using the parallel function. The serial case ran in 0.0758219 s, the parallel one in 3.79864 s.
Would someone explain me if it is a parallel overhead or I should set up something in my Octave setting, or in which cases is the parallization really helpful?
TL;DR: open up your pool outside the timer and choose a more difficult operation.
There's two main issues. One is what Ander mentioned in his comment, starting up the parallel pool takes a second or two. You can open up it beforehand (in MATLAB you can do this via parpool) to speed that up. Alternatively, run a single parallel operation, thus opening the pool, and then redo the timing.
The second issue is the simplicity of your operation. Just squaring a number cannot go much faster than it already goes in serial. There's no point in passing data back-and-forth between workers for such a simple operation. Redo your test with a more expensive function, e.g. eig() as MATLAB does in their examples.
Parallellisation is thus useful if the runtime of your operations greatly outweighs the overhead of passing data to and from workers. Basically this means you either have a very large data set, which you need to perform the same operation on each item (e.g. taking the mean of every 1000 rows or so), or you have a few heavy, but independent, tasks to perform.
For a more in-depth explanation I can recommend this answer of mine and references therein.
Just as a sidenote, I'm surprised your serial for is that fast, given that you do not initialise your output vector. Preallocation is very important, as "growing" arrays in loops requires the creation of a new array and copying all previous content to it every iteration.
You also might want to consider not using i or j as variable names, as they denote the imaginary unit. It won't affect runtime much, but can result in very hard to debug errors. Simply use idx, ii, or a more descriptive variable name.

Avoiding cPickle in Ipython's parallel

I have some code that I have paralleled successfully in the sense that it gets an answer, but it is still kind of slow. Using cProfile.run(), I found that 121 seconds (57% of total time) were spent in cPickle.dumps despite a per call time of .003. I don't use this function anywhere else, so it must be occurring due to ipython's parallel.
The way my code works is it does some serial stuff, then runs many simulations in parallel. Then some serial stuff, then a simulation in parallel. It has to repeat this many, many times. Each simulation requires a very large dictionary that I pull in from a module I wrote. I believe this is what is getting pickled many times and slowing the program down.
Is there a way to push a large dictionary to the engines in such a way that it stays there permanently? I think it's getting physically pushed every time I call the parallel function.

How come matlab run slower and slower when running a program that takes long time to execute?

There is a program that my matlab runs, since there are two gigantic nested for-loop, we expect this program to run more than 10 hours. We ask matlab to print out the loop number every time when it is looping.
Initially (the first 1 hour), wee see the loop number increment very fast in our screen; as time goes by, it goes slower and slower..... now (more than 20 consecutive hours of executing the same ".m" file and it still haven't finished yet), it is almost 20 times slower than it had been initially.
The ram usage initially was about 30%, right now after 20 hours of executing time, it is as shown below:
My computer spec is below.
What can I do to let matlab maintain its initially speed?
I can only guess, but my bet is that you have some array variables that have not been preallocated, and thus their size increases at each iteration of the for loop. As a result of this, Matlab has to reallocate memory in each iteration. Reallocating slows things down, and more so the larger those variables are, because Matlab needs to find an ever larger chunk of contiguous memory. This would explain why the program seems to run slower as time passes.
If this is indeed the cause, the solution would be to preallocate those variables. If their size is not known beforehand, you can make a guess and preallocate to an approximate size, thus avoiding at least some of the reallocating. Or, if your program is not memory-limited, maybe you can use an upper-bound on variable size when preallocating; then, after the loop, trim the arrays by removing unused entries.
Some general hints, if they don't help I suggest to add the code to the question.
Don't print to console, this output slows down execution and the output is kept in memory. Write a logfile instead if you need it. For a simple status, use waitbar
Verify you are preallocating all variables
Check which function calls depend on the loop index, for example increasing size of variables.
If any mex-functions are used, double check them for memory leaks. The standard procedure to do this: Call the function with random example data, don't store the output. If the memory usage increases there is a memory leak in the function.
Use the profiler. Profile your code for the first n iterations (where n matches about 10 minutes), generate a HTML report. Then let it run for about 2 hours and generate a report for n iterations again. Now compare both reports and see where the time is lost.
I'd like to point everybody to the following page in the MATLAB documentation: Strategies for Efficient Use of Memory. This page contains a collection of techniques and good practices that MATLAB users should be aware of.
The OP reminded me of it when saying that memory usage tends to increase over time. Indeed there's an issue with long running instances of MATLAB on win32 systems, where a memory leak exists and is exacerbated over time (this is described in the link too).
I'd like to add to Luis' answer the following piece of advice that a friend of mine once received during a correspondence with Yair Altman:
Allocating largest vars first helps by assigning the largest free contiguous blocks to the highest vars, but it is easily shown that in some cases this can actually be harmful, so it is not a general solution.
The only sure way to solve memory fragmentation is by restarting Matlab and even better to restart Windows.
More information on memory allocation performance can be found in the following undocumentedMatlab posts:
Preallocation performance
Allocation performance take 2

How to remove Fortran race condition?

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO
You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

How do i write tasks? (parallel code)

I am impressed with intel thread building blocks. I like how i should write task and not thread code and i like how it works under the hood with my limited understanding (task are in a pool, there wont be 100 threads on 4cores, a task is not guaranteed to run because it isnt on its own thread and may be far into the pool. But it may be run with another related task so you cant do bad things like typical thread unsafe code).
I wanted to know more about writing task. I like the 'Task-based Multithreading - How to Program for 100 cores' video here http://www.gdcvault.com/sponsor.php?sponsor_id=1 (currently second last link. WARNING it isnt 'great'). My fav part was 'solving the maze is better done in parallel' which is around the 48min mark (you can click the link on the left side. That part is really all you need to watch if any).
However i like to see more code examples and some API of how to write task. Does anyone have a good resource? I have no idea how a class or pieces of code may look after pushing it onto a pool or how weird code may look when you need to make a copy of everything and how much of everything is pushed onto a pool.
Java has a parallel task framework similar to Thread Building Blocks - it's called the Fork-Join framework. It's available for use with the current Java SE 6 and to be included in the upcoming Java SE 7.
There are resources available for getting started with the framework, in addition to the javadoc class documentation. From the jsr166 page, mentions that
"There is also a wiki containing additional documentation, notes, advice, examples, and so on for these classes."
The fork-join examples, such as matrix multiplication are a good place to start.
I used the fork-join framework in solving some of Intel's 2009 threading challenges. The framework is lightweight and low-overhead - mine was the only Java entry for the Kight's Tour problem and it outperformed other entries in the competition. The java sources and writeup are available from the challenge site for download.
EDIT:
I have no idea how a class or pieces
of code may look after pushing it onto
a pool [...]
You can make your own task by subclassing one of the ForKJoinTask subclasses, such as RecursiveTask. Here's how to compute the fibonacci sequence in parallel. (Taken from the RecursiveTask javadocs - comments are mine.)
// declare a new task, that itself spawns subtasks.
// The task returns an Integer result.
class Fibonacci extends RecursiveTask<Integer> {
final int n; // the n'th number in the fibonacci sequence to compute
Fibonnaci(int n) { this.n = n; } // constructor
Integer compute() { // this method is the main work of the task
if (n <= 1) // 1 or 0, base case to end recursion
return n;
Fibonacci f1 = new Fibonacci(n - 1); // create a new task to compute n-1
f1.fork(); // schedule to run asynchronously
Fibonacci f2 = new Fibonacci(n - 2); // create a new task to compute n-2
return f2.invoke() + f1.join(); // wait for both tasks to compute.
// f2 is run as part of this task, f1 runs asynchronously.
// (you could create two separate tasks and wait for them both, but running
// f2 as part of this task is a little more efficient.
}
}
You then run this task and get the result
// default parallelism is number of cores
ForkJoinPool pool = new ForkJoinPool();
Fibonacci f = new Fibonacci(100);
int result = pool.invoke(f);
This is a trivial example to keep things simple. In practice, performance would not be so good, since the work executed by the task is trivial compared to the overhead of the task framework. As a rule of thumb, a task should perform some significant computation - enough to make the framework overhead insignificant, yet not so much that you end up with one core at the end of the problem running one large task. Splitting large tasks into smaller ones ensures that one core isn't left doing lots of work while other cores are idle - using smaller tasks keeps more cores busy, but not so small that the task does no real work.
[...] or how weird code may look when
you need to make a copy of everything
and how much of everything is pushed
onto a pool.
Only the tasks themselves are pushed into a pool. Ideally you don't want to be copying anything: to avoid interference and the need for locking, which would slow down your program, your tasks should ideally be working with independent data. Read-only data can be shared amongst all tasks, and doesn't need to be copied. If threads need to co-operate building some large data structure, it's best they build the pieces separately and then combine them at the end. The combining can be done as a separate task, or each task can add it's piece of the puzzle to the overall solution. This often does require some form of locking, but it's not a considerable performance issue if the work of the task is much greater than the the work updating the solution. My Knight's Tour solution takes this approach to update a common repository of tours on the board.
Working with tasks and concurrency is quite a paradigm shift from regular single-threaded programming. There are often several designs possible to solve a given problem, but only some of these will be suitable for a threaded solution. It can take a few attempts to get the feel for how to recast familiar problems in a multi-threaded way. The best way to learn is to look at the examples, and then try it for yourself. Always profile, and meausre the effects of varying the number of threads. You can explicitly set the number of threads (cores) to use in the pool in the pool constructor. When tasks are broken up linearly, you can expect near linear speedup as the number of threads increases.
Playing with "frameworks" which claim to be solving unsolvable (optimal task scheduling is NP hard) is not going to help you at all - reading books and than articles on concurrent algorithms will. So called "tasks" are nothing more that a fancy name for defining separability of the problem (parts that can be computed independently of each other). Class of separable problems is very small - and they are already covered in old books.
For problems which are not separable you have to plan phases and data barriers between phases to exchange data. Optimal orchestration of data barriers for simultaneous data exchange is not just NP hard but impossible to solve in a general way in principle - you'd need to examine history of all possible interleavings - that's like the power set of already exponential set (like going from N to R in math). The reason I mention is to make it clear that no software can ever do this for you and that how to do it is intrinsically dependent on actual algorithm and a make or break of whether parallelization is feasible at all (even if it's theoretically possible).
When you enter high parallelism you can't even maintain a queue, you don't even have a memory bus anymore - imagine 100 CPU-s trying to sync up on just a single shared int or trying to do memory bus arbitrage. You have to pre-plan and pre-configure everything that's going to run and essentially prove correctness on a white board. Intel's Threading Building Blocks are a small kid in that world. They are for small number of cores which can still share a memory bus. Running separable problems is a no-brainer which you can do without any "framework".
So you are back to having to read as many different parallel algorithms as you can. It normally takes 1-3 years to research approximately optimal data barrier layout for one problem. It becomes layout when you go for say 16+ cores on a single chip since only first neighbors can exchange data efficiently (during one data barrier cycle). So you'll actually learn much more by looking at CUDA and papers and results with IBM-s experimental 30-core CPU than Intel's sales pitch or some Java toy.
Beware of demo problems for which the size of resources wasted (number of cores and memory) is much bigger that the speedup they achieve. If it takes 4 cores and 4x RAM to solve something 2x faster, the solution is not scalable for parallelization.

Resources