pmap slow for toy example - parallel-processing

I'm testing out parallelism in Julia to see if there's a speedup on my machine (I'm picking a language to implement new algorithms with). I didn't want to dedicate a ton of time writing a huge example so I did the following test on release version Julia 0.4.5 (Mac OS X and dual core):
$ julia -p2
julia> #everywhere f(x) = x^2 + 10
julia> #time map(f, 1:10000000)
julia> #time pmap(f, 1:10000000)
pmap is significantly slower than map (>20x) and allocates well over 10x the memory. What am I doing wrong?
Thanks.

That's because pmap is intended to do heavy computations per core, not many simple ones. If you use for something simple as your function, the overhead of moving along the information along proccesors is bigger that the benefits. Instead, test this code (I run it with 4 cores in a i7):
function fast(x::Float64)
return x^2+1.0
end
function slow(x::Float64)
a = 1.0
for i in 1:1000
for j in 1:5000
a+=asinh(i+j)
end
end
return a
end
info("Precompilation")
map(fast,linspace(1,1000,1000))
pmap(fast,linspace(1,1000,1000))
map(slow,linspace(1,1000,10))
pmap(slow,linspace(1,1000,10))
info("Testing slow function")
#time map(slow,linspace(1,1000,10)) #3.69 s
#time pmap(slow,linspace(1,1000,10)) #0.003 s
info("Testing fast function")
#time map(fast,linspace(1,1000,1000)) #52 μs
#time pmap(fast,linspace(1,1000,1000)) #775 s
For parallelization of a lot of very small iterations, you can use #parallel, search for it in the documentation.

Related

Julia many allocations using Distributed and SharedArrays with #sync/#async

I am trying to understand how to use the package Distributed together with SharedArrays to perform parallel operations with julia. Just as an example I am takingt a simple Montecarlo average method
using Distributed
using SharedArrays
using Statistics
const NWorkers = 2
const Ns = Int(1e6)
function parallelRun()
addprocs(NWorkers)
procsID = workers()
A = SharedArray{Float64,1}(Ns)
println("starting loop")
for i=1:2:Ns
#parallel block
#sync for p=1:NWorkers
#async A[i+p-1] = remotecall_fetch(rand,procsID[p]);
end
end
println(mean(A))
end
function singleRun()
A = zeros(Ns)
for i=1:Ns
A[i] = rand()
end
println(mean(A))
end
However if I #time both functions I get
julia> #time singleRun()
0.49965531193003165
0.009762 seconds (17 allocations: 7.630 MiB)
julia> #time parallelRun()
0.4994892300029917
46.319737 seconds (66.99 M allocations: 2.665 GiB, 1.01% gc time)
In particular there are many more allocations in the parallel version, which makes the code much slower.
Am I missing something?
By the way the reason why I am using #sync and #async (even if not needed in this framework since every sample can be computed in random order) is just because I would like to apply the same strategy to solve a parabolic PDE numerically with something on the line of
for t=1:time_steps
#parallel block
#sync for p=1:NWorkers
#async remotecall(make_step_PDE,procsID[p],p);
end
end
where each worker indexed by p should work on a disjoint set of indices of my equation.
Thanks in advance
There are the following problems in your code:
You are spawning a remote task for each value of i a and this is just expensive and in the end it takes long. Basically the rule of thumb is to use #distributed macro for your load balancing across workers this will just evenly share the work.
Never put addprocs inside your work function because every time you run it, every time you add new processes - spawning a new Julia process also takes lots of time and this was included in your measurements. In practice this means you want to run addprocs at some part of the script that performs the initialization or perhaps the processes are added via starting the julia process with -p or --machine-file parameter
Finally, always run #time always twice - in the first measurement #time is also measuring compilation times and the compilation in a distributed environment takes much longer than in a single process.
Your function should look more or less like this
using Distributed, SharedArrays
addprocs(4)
#everywhere using Distributed, SharedArrays
function parallelRun(Ns)
A = SharedArray{Float64,1}(Ns)
#sync #distributed for i=1:Ns
A[i] = rand();
end
println(mean(A))
end
You might also consider completely splitting the data between workers. This in some scenarios is less bug prone and allows you to distribute over many nodes:
using Distributed, DistributedArrays
addprocs(4)
#everywhere using Distributed, DistributedArrays
function parallelRun2(Ns)
d = dzeros(Ns) #creates an array distributed evenly on all workers
#sync #distributed for i in 1:Ns
p = localpart(d)
p[((i-1) % Int(Ns/nworkers())+1] = rand()
end
println(mean(d))
end

Parallel computing and Julia

I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.

IPython parallel and map performance

I have used parallel computing before through MPI (and Fortran :)). I would like to use now the parallel capabilities of IPython.
My question is related to the poor performance of the following code, inspired by http://ipython.org/ipython-doc/dev/parallel/asyncresult.html:
from IPython.parallel import Client
import numpy as np
_procs = Client()
print 'engines #', len(_procs)
dv = _procs.direct_view()
X = np.linspace(0,100)
add = lambda a,b: a+b
sq = lambda x: x*x
%timeit reduce(add, map(sq, X))
%timeit reduce(add, dv.map(sq, X))
The results for one processor are:
10000 loops, best of 3: 43 µs per loop
100 loops, best of 3: 4.77 ms per loop
Could you tell me if the results seem normal to you and, if so, why there is such a huge difference in computational time?
Best regards,
Flavien.
Parallel processing doesn't come for free. There is a cost associated with sending job items to the clients and receiving the results afterwards called overhead. Your original work takes 43 µs and that is just too short. You need to have way larger work items before parallel processing would become beneficial. A simple rule of a thumb would be that it should take each worker at least about 10 times the overhead in order to process its work items. Try with a vector of 1 million elements or even larger instead.

Measuring elapsed CPU time in Julia

Many scientific computing languages make a distinction between absolute time (wall clock) and CPU time (processor cycles). For example, in Matlab we have:
>> tic; pause(1); toc
Elapsed time is 1.009068 seconds.
>> start = cputime; pause(1); elapsed = cputime - start
elapsed =
0
and in Mathematica we have:
>>In[1]:= AbsoluteTiming[Pause[1]]
>>Out[1]= {1.0010572, Null}
>>In[2]:= Timing[Pause[1]]
>>Out[2]= {0., Null}
This distinction is useful when benchmarking code run on computation servers, where there may be high variance in the absolute timing results depending on what other processes are running concurrently.
The Julia standard library provides support for timing of expressions through tic(), toc(), #time and a few other functions/macros all based on time_ns(), a function that measures absolute time.
>>julia> #time sleep(1)
elapsed time: 1.017056895 seconds (135788 bytes allocated)
My question: Is there a simple way to get the elapsed CPU time for an expression evaluation in Julia?
(Side note: doing some digging, it appears that Julia timing is based on the uv_hrtime() function from libuv. It seems to me that using uv_getrusage from the same library might give a way to access elapsed CPU time in Julia, but I'm no expert. Has anybody tried using anything like this?)
I couldn't find any existing solutions, so I've put together a package with some simple CPU timing functionality here: https://github.com/schmrlng/CPUTime.jl. The package is completely untested on parallel code and may have other bugs, but if anybody else would like to try it out calling
>> Pkg.clone("https://github.com/schmrlng/CPUTime.jl.git")
from the julia> prompt should install the package.
Julia does have the commands tic() and toc() which work just like tic and toc in Matlab:
julia> tic(); 7^1000000000; toc()
elapsed time: 0.046563597 seconds
0.046563597

Performance penalty of persistent variables in MATLAB

Recently I profiled some MATLAB code and I was shocked to see the following in a heavily used function:
5.76 198694 58 persistent CONSTANTS;
3.44 198694 59 if isempty(CONSTANTS) % initialize CONSTANTS
In other words, MATLAB spent about 9 seconds, over 198694 function calls, declaring the persistent CONSTANTS and checking if it has been initialized. That represents 13% of the total time spent in that function.
Do persistent variables really carry that much of a performance penalty in MATLAB? Or are we doing something terribly wrong here?
UPDATE
#Andrew I tried your sample script and I am very, very perplexed by the output:
time calls line
6 function has_persistent
6.48 200000 7 persistent CONSTANTS
1.91 200000 8 if isempty(CONSTANTS)
9 CONSTANTS = 42;
10 end
I tried the bench() command and it showed my machine in the middle range of the sample machines. Running Ubuntu 64 bits on a Intel(R) Core(TM) i7 CPU, 4GB RAM.
That's the standard way of using persistent variables in Matlab. You're doing what you're supposed to. There will be noticable overhead for it, but your timings do seem kind of surprisingly high.
Here's a similar test I ran in 32-bit Matlab R2009b on a 3.0 GHz Intel Core 2 QX9650 machine under Windows XP x64. Similar results on other machines and versions. About 5x faster than your timings.
Test:
function call_has_persistent
for i = 1:200000
has_persistent();
end
function has_persistent
persistent CONSTANTS
if isempty(CONSTANTS)
CONSTANTS = 42;
end
Results:
0.89 200000 7 persistent CONSTANTS
0.25 200000 8 if isempty(CONSTANTS)
What Matlab version, OS, and CPU are you running on? What does CONSTANTS get initialized with? Does Matlab's bench() output seem reasonable for your machine?
Your timings do seem high. There may be a bug or config issue there to fix. But if you really want to get Matlab code fast, the standard advice is to "vectorize" it: restructure the code so that it makes fewer function calls on larger input arrays, and makes use of Matlab's built in vectorized functions instead of loops or control structures, to avoid having 200,000 calls to the function in the first place. If possible. Matlab has relatively high overhead per function or method call (see Is MATLAB OOP slow or am I doing something wrong? for some numbers), so you can often get more mileage by refactoring to eliminate function calls instead of making the individual function calls faster.
It may be worth benchmarking some other basic Matlab operations on your machine, to see if it's just "persistent" that seems slow. Also try profiling just this little call_has_persistent test script in isolation to see if the context of your function makes a difference.

Resources