Matlab parallel computing running slower than expected - performance

I am trying to use Matlab parallel computing toolbox. My PC has 6 'workers' or cores. Thus, I would expect my code to run roughly 6x as fast (ie ~600% increase in speed). However, when I actually time the opertations, I find I am only getting roughly a 40% increase in speed.
Is this normal, or am I doing something wrong?
Here is my code:
N=5000;
%%Parrelel
pp=parpool(6)
ts=tic;
parfor i=1:12
q=eye(N); q^-1;
end
disp(['Time for Parrelel Computation: ' num2str(toc(ts),3) 's']);
delete(pp);
%Serial
ts=tic;
for i=1:12
q=eye(N); q^-1;
end
disp(['Time for Serial Computation: ' num2str(toc(ts),3) 's']);
The readout is:
Time for Parrelel Computation: 24.6s
Time for Serial Computation: 35.9s
I wouldve expected the parrelel computation to be roughly 35/6~=6s, not 24s
Any advice?
Thanks
Roman

Related

Parallel computing and Julia

I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.

A fast dilation image function in MATLAB

I am going to work on the dilated versions of image superpixels, however bwmorph and imdilate are very slow for my application. For example the following code snippet takes more than 1 second for N=200 (parfor on 4 threads):
parfor i=1:N
idx = superpixels==i;
bwF = bwmorph(idx,'dilate',10);
end
Does anyone know about any other MATLAB code that speeds up this process?
Thanks!
Matlab's Image Processing Toolbox includes Mathematical Morphology.
The dilation function is called imdilate. The toolbox uses the GPU for high speed.
If you are searching for high performance image processing you should change to c++ and employ the GPU (CUDA for example). It's faster than using parallel cores of the cpu.

What is the performance of 10 processors capable of 200 MFLOPs running code which is 10% sequential and 90% parallelelizable?

simple problem from Wilkinson and Allen's Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Working through the exercises at the end of the first chapter and want to make sure that I'm on the right track. The full question is:
1-11 A multiprocessor consists of 10 processors, each capable of a peak execution rate of 200 MFLOPs (millions of floating point operations per second). What is the performance of the system as measured in MFLOPs when 10% of the code is sequential and 90% is parallelizable?
I assume the question wants me to find the number of operations per second of a serial processor which would take the same amount of time to run the program as the multiprocessor.
I think I'm right in thinking that 10% of the program is run at 200 MFLOPs, and 90% is run at 2,000 MFLOPs, and that I can average these speeds to find the performance of the multiprocessor in MFLOPs:
1/10 * 200 + 9/10 * 2000 = 1820 MFLOPs
So when running a program which is 10% serial and 90% parallelizable the performance of the multiprocessor is 1820 MFLOPs.
Is my approach correct?
ps: I understand that this isn't exactly how this would work in reality because it's far more complex, but I would like to know if I'm grasping the concepts.
Your calculation would be fine if 90% of the time, all 10 processors were fully utilized, and 10% of the time, just 1 processor was in use. However, I don't think that is a reasonable interpretation of the problem. I think it is more reasonable to assume that if a single processor were used, 10% of its computations would be on the sequential part, and 90% of its computations would be on the parallelizable part.
One possibility is that the sequential part and parallelizable parts can be run in parallel. Then one processor could run the sequential part, and the other 9 processors could do the parallelizable part. All processors would be fully used, and the result would be 2000 MFLOPS.
Another possibility is that the sequential part needs to be run first, and then the parallelizable part. If a single processor needed 1 hour to do the first part, and 9 hours to do the second, then it would take 10 processors 1 + 0.9 = 1.9 hours total, for an average of about (1*200 + 0.9*2000)/1.9 ~ 1053 MFLOPS.

Parallel code is taking longer time than sequential code using Parallel computing toolbox in MATLAB. Why?

I am working with MATLAB. I am just new with parallel computing toolbox in MATLAB. I have core i3 processor, MATLAB R2011a, 2 GB of RAM, 320 Hard disk.
To calculate speed up, I just wrote following code and found that parallel code is taking longer time than a sequential code.
1st code is taking 0.039763 seconds
2nd code is taking 0.379056 seconds.
1st code:
tic
MM = magic(5);
MN = magic(5);
ML = magic(5);
MP = magic(5);
MK = magic(5);
MM
MN
ML
MP
MK
toc
2nd Code:
matlabpool open local 4
tic
spmd % Uses all 3 workers
MM = magic(5); % MM is a variable on each lab
end
MM{1}
MM{2}
MM{3}
MM{4}
toc
matlabpool close
I want to learn parallel computing toolbox.
As mentioned by Dan in the comments, the problem is clearly too small for parallelization to be beneficial. Increasing for example the size of the magic matrices you create from 5 to 5000, already shows a clear improvement. That is, with the larger size the overhead of parallelization becomes (almost) negligible compared to the computation time for one matrix.

Matlab + CUDA slow in solving matrix-vector equation A*x=B

I am calculating an equation A*x=B, where A is a matrix and B is a vector, x is answer (unknown) vector.
Hardware specs:
Intel i7 3630QM (4 cores),
nVidia GeForce GT 640M (384 CUDA cores)
Here's an example:
>> A=rand(5000);
>> B=rand(5000,1);
>> Agpu=gpuArray(A);
>> Bgpu=gpuArray(B);
>> tic;A\B;toc;
Elapsed time is 1.382281 seconds.
>> tic;Agpu\Bgpu;toc;
Elapsed time is 4.775395 seconds.
Somehow GPU is much slower... Why? It is also slower in FFT, INV, LU calculations, which should be related with matrix division.
However, GPU is much faster in matrix multiplication (the same data):
>> tic;A*B;toc;
Elapsed time is 0.014700 seconds.
>> tic;Agpu*Bgpu;toc;
Elapsed time is 0.000505 seconds.
The main question is why GPU A\B (mldivide) is so slow comparing to CPU?
UPDATED
Here are some more results when A, B (on CPU), AA, BB (on GPU) are rand(5000):
>> tic;fft(A);toc;
Elapsed time is *0.117189 *seconds.
>> tic;fft(AA);toc;
Elapsed time is 1.062969 seconds.
>> tic;fft(AA);toc;
Elapsed time is 0.542242 seconds.
>> tic;fft(AA);toc;
Elapsed time is *0.229773* seconds.
>> tic;fft(AA);toc;
Bold times are stable times. However GPU is almost twice slower. By the way, why GPU is even more slower on first two attempts? Is it compiled twice firstly?
In addition:
>> tic;sin(A);toc;
Elapsed time is *0.121008* seconds.
>> tic;sin(AA);toc;
Elapsed time is 0.020448 seconds.
>> tic;sin(AA);toc;
Elapsed time is 0.157209 seconds.
>> tic;sin(AA);toc;
Elapsed time is *0.000419 *seconds
After two calculations GPU is incredibly faster in sin calculations.
So, still, why GPU is so slow in matrix division, fft and similar calculations, though it is so fast in matrix multiplication and trigonometry? The question actually should not be like that... GPU should be faster in all these calculations because Matlab has released overlapped functions (mldivide, fft) for GPU.
Could somebody help me solve these issues, please? :)
Please read how Matlab calculates the solutions. It will help you understand why GPU is slower.
I'll try say it in few words.
A*x=b becomes L*(U*x=y)=b, L*U=A
So Matlab makes A to L*U (This process cannot be done fully parallel
as far as I know instead some steps can be done parallel, due to
their nature)
Then Matlab solves L*y=B and finds y. (This process cannot be done
parallel as each step requires data from previous)
Then Matlab solves U*x=y and finds x. (This process cannot be done
parallel as each step requires data from previous)
So it GPU clock is slower than the CPU, and since processes cannot be done parallel, CPU is faster. And no, unless you come up with a better method (good luck!) then GPU will be always slower except in some very specific cases.
Part 1 of the explanation is in the answer from user2230360, but your question is twofold, so I'll add a bit about the multiplication.
As noted already, the LU factorization is not very easily parallelized even if some steps can be. Matrix multiplication, however, is very much parallelizable. If you're working with these things you should be able to do matrix multiplication by hand, and then you will know that calculating the elements of the matrix C in A*B=C can be done in any order you want - hence the possibility for parallel computation. That is probably why you're seeing so lightning fast multiplication, but slow solving of linear systems. One can't be parallelized "as much as the other".

Resources