Julia many allocations using Distributed and SharedArrays with #sync/#async - parallel-processing

I am trying to understand how to use the package Distributed together with SharedArrays to perform parallel operations with julia. Just as an example I am takingt a simple Montecarlo average method
using Distributed
using SharedArrays
using Statistics
const NWorkers = 2
const Ns = Int(1e6)
function parallelRun()
addprocs(NWorkers)
procsID = workers()
A = SharedArray{Float64,1}(Ns)
println("starting loop")
for i=1:2:Ns
#parallel block
#sync for p=1:NWorkers
#async A[i+p-1] = remotecall_fetch(rand,procsID[p]);
end
end
println(mean(A))
end
function singleRun()
A = zeros(Ns)
for i=1:Ns
A[i] = rand()
end
println(mean(A))
end
However if I #time both functions I get
julia> #time singleRun()
0.49965531193003165
0.009762 seconds (17 allocations: 7.630 MiB)
julia> #time parallelRun()
0.4994892300029917
46.319737 seconds (66.99 M allocations: 2.665 GiB, 1.01% gc time)
In particular there are many more allocations in the parallel version, which makes the code much slower.
Am I missing something?
By the way the reason why I am using #sync and #async (even if not needed in this framework since every sample can be computed in random order) is just because I would like to apply the same strategy to solve a parabolic PDE numerically with something on the line of
for t=1:time_steps
#parallel block
#sync for p=1:NWorkers
#async remotecall(make_step_PDE,procsID[p],p);
end
end
where each worker indexed by p should work on a disjoint set of indices of my equation.
Thanks in advance

There are the following problems in your code:
You are spawning a remote task for each value of i a and this is just expensive and in the end it takes long. Basically the rule of thumb is to use #distributed macro for your load balancing across workers this will just evenly share the work.
Never put addprocs inside your work function because every time you run it, every time you add new processes - spawning a new Julia process also takes lots of time and this was included in your measurements. In practice this means you want to run addprocs at some part of the script that performs the initialization or perhaps the processes are added via starting the julia process with -p or --machine-file parameter
Finally, always run #time always twice - in the first measurement #time is also measuring compilation times and the compilation in a distributed environment takes much longer than in a single process.
Your function should look more or less like this
using Distributed, SharedArrays
addprocs(4)
#everywhere using Distributed, SharedArrays
function parallelRun(Ns)
A = SharedArray{Float64,1}(Ns)
#sync #distributed for i=1:Ns
A[i] = rand();
end
println(mean(A))
end
You might also consider completely splitting the data between workers. This in some scenarios is less bug prone and allows you to distribute over many nodes:
using Distributed, DistributedArrays
addprocs(4)
#everywhere using Distributed, DistributedArrays
function parallelRun2(Ns)
d = dzeros(Ns) #creates an array distributed evenly on all workers
#sync #distributed for i in 1:Ns
p = localpart(d)
p[((i-1) % Int(Ns/nworkers())+1] = rand()
end
println(mean(d))
end

Related

Parallel computing in Julia - running a simple for-loop on multiple cores

For starters, I have to say I'm completely new to parallel computing (and know close to nothing about computer science), so my understanding of what things like "workers" or "processes" actually are is very limited. I do however have a question about running a simple for-loop that presumably has no dependencies between the iterations in parallel.
Let's say I wanted to do the following:
for N in 1:5:20
println("The N of this iteration in $N")
end
If I simply wanted these messages to appear on screen and the order of appearance didn't matter, how could one achieve this in Julia 0.6, and for future reference in Julia 0.7 (and therefore 1.0)?
Just to add the example to the answer of Chris. Since the release of julia 1.3 you do this easily with Threads.#threads
Threads.#threads for N in 1:5:20
println("The number of this iteration is $N")
end
Here you are running only one julia session with multiple threads instead of using Distributed where you run multiple julia sessions.
See, e.g. multithreading blog post for more information.
Distributed Processing
Start julia with e.g. julia -p 4 if you want to use 4 cpus (or use the function addprocs(4)). In Julia 1.x, you make a parallel loop as following:
using Distributed
#distributed for N in 1:5:20
println("The N of this iteration in $N")
end
Note that every process have its own variables per default.
For any serious work, have a look at the manual https://docs.julialang.org/en/v1.4/manual/parallel-computing/, in particular the section about SharedArrays.
Another option for distributed computing are the function pmap or the package MPI.jl.
Threads
Since Julia 1.3, you can also use Threads as noted by wueli.
Start julia with e.g. julia -t 4 to use 4 threads. Alternatively you can or set the environment variable JULIA_NUM_THREADS before starting julia.
For example Linux/Mac OS:
export JULIA_NUM_THREADS=4
In windows, you can use set JULIA_NUM_THREADS 4 in the cmd prompt.
Then in julia:
Threads.#threads for N = 1::20
println("N = $N (thread $(Threads.threadid()) of out $(Threads.nthreads()))")
end
All CPUs are assumed to have access to shared memory in the examples above (e.g. "OpenMP style" parallelism) which is the common case for multi-core CPUs.

Parallel computing and Julia

I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.

OpenMP Fortran Particle Method Speed Decrease

In trying to optimise some code I find that using OpenMP linearly increases the time it takes to run. The representative section of code that I am trying to speed up is as follow:
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = REAL(cr)
CALL SYSTEM_CLOCK(c1)
DO k=1,ntotal
CALL OMP_INIT_LOCK(locks(k))
END DO
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i,j,k)
DO k=1,niac
i = pair_i(k)
j = pair_j(k)
dvx(:,k) = vx(:,i)-vx(:,j)
CALL omp_set_lock(locks(i))
CALL DGER(dim,dim,-1.d0, (disp_nmh(:,j)-disp_nmh(:,i)),1, &
(dwdx_nor(dim+1:2*dim,k)*V_0(j)),1, particle_data(i)%def_grad,dim)
CALL DGER(dim,dim,-1.d0, (-dvx(:,k)),1, &
(dwdx_nor(dim+1:2*dim,k)*V_0(j)) ,1, particle_data(i)%vel_grad(1:dim,1:dim),dim)
CALL omp_unset_lock(locks(i))
CALL omp_set_lock(locks(j))
CALL DGER(dim,dim,-1.d0, (dvx(:,k)),1, &
(dwdx_nor(3*dim+1:4*dim,k)*V_0(i)) ,1, particle_data(j)%vel_grad(1:dim,1:dim),dim)
CALL DGER(dim,dim,-1.d0, (disp_nmh(:,i)-disp_nmh(:,j)),1, &
(dwdx_nor(3*dim+1:4*dim,k)*V_0(i)),1, particle_data(j)%def_grad,dim)
CALL omp_unset_lock(locks(j))
END DO
!$OMP END PARALLEL DO
CALL SYSTEM_CLOCK(c2)
t_el = t_el + (c2-c1)/rate
WRITE(*,*) "Wall time elapsed: ", t_el
Note that for the simulation I am testing k=14000 which I thought was a reasonable candidate for running in parallel. So far as I know I have to use the locks to ensure that threads which are given the same value of "i" (but a different value of "j") cannot access the same index of the arrays which are being written to at the same time. I cannot figure out if the version of BLAS (sudo apt-get install libblas-dev liblapack-dev) which I use is thread safe. I ran a simulation with 8 cores and got the same result as without OpenMP so I am guessing that it could be. BLAS is used, in this case, to calculate and sum the outer product of many 3x3 matrices.
Is the implementation of OpenMP above the best way to speed up this code? I know very little about OpenMP but my guesses are that:
the memory being all over the place ("i" is sequential but "j" is not)
the overhead in starting and closing down all the threads
the constant locking and unlocking
and maybe the small loop size (although I thought 14000 would be sufficient)
are significantly outweighing the performance benefits. Is this correct? Or can the code above be modified to get some performance gain?
EDIT
I should probably add that the code above is part of a time integration loop. Hopefully this explains why the elapsed time is summed.

pmap slow for toy example

I'm testing out parallelism in Julia to see if there's a speedup on my machine (I'm picking a language to implement new algorithms with). I didn't want to dedicate a ton of time writing a huge example so I did the following test on release version Julia 0.4.5 (Mac OS X and dual core):
$ julia -p2
julia> #everywhere f(x) = x^2 + 10
julia> #time map(f, 1:10000000)
julia> #time pmap(f, 1:10000000)
pmap is significantly slower than map (>20x) and allocates well over 10x the memory. What am I doing wrong?
Thanks.
That's because pmap is intended to do heavy computations per core, not many simple ones. If you use for something simple as your function, the overhead of moving along the information along proccesors is bigger that the benefits. Instead, test this code (I run it with 4 cores in a i7):
function fast(x::Float64)
return x^2+1.0
end
function slow(x::Float64)
a = 1.0
for i in 1:1000
for j in 1:5000
a+=asinh(i+j)
end
end
return a
end
info("Precompilation")
map(fast,linspace(1,1000,1000))
pmap(fast,linspace(1,1000,1000))
map(slow,linspace(1,1000,10))
pmap(slow,linspace(1,1000,10))
info("Testing slow function")
#time map(slow,linspace(1,1000,10)) #3.69 s
#time pmap(slow,linspace(1,1000,10)) #0.003 s
info("Testing fast function")
#time map(fast,linspace(1,1000,1000)) #52 μs
#time pmap(fast,linspace(1,1000,1000)) #775 s
For parallelization of a lot of very small iterations, you can use #parallel, search for it in the documentation.

IPython parallel and map performance

I have used parallel computing before through MPI (and Fortran :)). I would like to use now the parallel capabilities of IPython.
My question is related to the poor performance of the following code, inspired by http://ipython.org/ipython-doc/dev/parallel/asyncresult.html:
from IPython.parallel import Client
import numpy as np
_procs = Client()
print 'engines #', len(_procs)
dv = _procs.direct_view()
X = np.linspace(0,100)
add = lambda a,b: a+b
sq = lambda x: x*x
%timeit reduce(add, map(sq, X))
%timeit reduce(add, dv.map(sq, X))
The results for one processor are:
10000 loops, best of 3: 43 µs per loop
100 loops, best of 3: 4.77 ms per loop
Could you tell me if the results seem normal to you and, if so, why there is such a huge difference in computational time?
Best regards,
Flavien.
Parallel processing doesn't come for free. There is a cost associated with sending job items to the clients and receiving the results afterwards called overhead. Your original work takes 43 µs and that is just too short. You need to have way larger work items before parallel processing would become beneficial. A simple rule of a thumb would be that it should take each worker at least about 10 times the overhead in order to process its work items. Try with a vector of 1 million elements or even larger instead.

Resources