Parallel computing in Julia - running a simple for-loop on multiple cores - parallel-processing

For starters, I have to say I'm completely new to parallel computing (and know close to nothing about computer science), so my understanding of what things like "workers" or "processes" actually are is very limited. I do however have a question about running a simple for-loop that presumably has no dependencies between the iterations in parallel.
Let's say I wanted to do the following:
for N in 1:5:20
println("The N of this iteration in $N")
end
If I simply wanted these messages to appear on screen and the order of appearance didn't matter, how could one achieve this in Julia 0.6, and for future reference in Julia 0.7 (and therefore 1.0)?

Just to add the example to the answer of Chris. Since the release of julia 1.3 you do this easily with Threads.#threads
Threads.#threads for N in 1:5:20
println("The number of this iteration is $N")
end
Here you are running only one julia session with multiple threads instead of using Distributed where you run multiple julia sessions.
See, e.g. multithreading blog post for more information.

Distributed Processing
Start julia with e.g. julia -p 4 if you want to use 4 cpus (or use the function addprocs(4)). In Julia 1.x, you make a parallel loop as following:
using Distributed
#distributed for N in 1:5:20
println("The N of this iteration in $N")
end
Note that every process have its own variables per default.
For any serious work, have a look at the manual https://docs.julialang.org/en/v1.4/manual/parallel-computing/, in particular the section about SharedArrays.
Another option for distributed computing are the function pmap or the package MPI.jl.
Threads
Since Julia 1.3, you can also use Threads as noted by wueli.
Start julia with e.g. julia -t 4 to use 4 threads. Alternatively you can or set the environment variable JULIA_NUM_THREADS before starting julia.
For example Linux/Mac OS:
export JULIA_NUM_THREADS=4
In windows, you can use set JULIA_NUM_THREADS 4 in the cmd prompt.
Then in julia:
Threads.#threads for N = 1::20
println("N = $N (thread $(Threads.threadid()) of out $(Threads.nthreads()))")
end
All CPUs are assumed to have access to shared memory in the examples above (e.g. "OpenMP style" parallelism) which is the common case for multi-core CPUs.

Related

Parallel computing and Julia

I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.

Python Multiprocessing slight slower than Multithreading on Windows

I have been experimenting my code to send "parallel" commands to multiple serial COM ports.
My multi-threading code consists of:
global q
q = Queue()
devices = [0, 1, 2, 3]
for i in devices:
q.put(i)
cpus=cpu_count() #detect number of cores
logging.debug("Creating %d threads" % cpus)
for i in range(cpus):
t = Thread(name= 'DeviceThread_'+str(i), target=testFunc1)
t.daemon = True
t.start()
and multi-processing code consists of:
devices = [0, 1, 2, 3]
cpus=cpu_count() #detect number of cores
pool = Pool(cpus)
results = pool.map(multi_run_wrapper, devices)
I observe that the task of sending serial commands to 4 COM ports in "parallel" takes about 6 seconds and multi-processing always always takes a 0.5 to 1 second of additional total run time.
Any inputs on why the discrepancy on a Windows machine?
Well, for one, you're not comparing apples to apples. If you want equivalent code, use multiprocessing.dummy.Pool in your threaded case (which is the same as multiprocessing.Pool implemented in terms of threads, not processes), so you're at least using the same basic parallelization model with different internal implementations, not changing everything all at once.
Beyond that, launching the workers and communicating data to them has some overhead, more on Windows than on other systems since Windows can't fork to spawn new processes cheaply; it has to spawn a new Python instance and then copy over state via IPC to approximate forking.
Aside from that, you haven't provided enough information; your process and thread based worker functions aren't provided, and could cause significant differences in behavior. Nor have you provided information on how you're performing timing. Similarly, if each worker process needs to reinitialize the COM port communication library, that could involve non-trivial overhead.

OpenMP Fortran Particle Method Speed Decrease

In trying to optimise some code I find that using OpenMP linearly increases the time it takes to run. The representative section of code that I am trying to speed up is as follow:
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = REAL(cr)
CALL SYSTEM_CLOCK(c1)
DO k=1,ntotal
CALL OMP_INIT_LOCK(locks(k))
END DO
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i,j,k)
DO k=1,niac
i = pair_i(k)
j = pair_j(k)
dvx(:,k) = vx(:,i)-vx(:,j)
CALL omp_set_lock(locks(i))
CALL DGER(dim,dim,-1.d0, (disp_nmh(:,j)-disp_nmh(:,i)),1, &
(dwdx_nor(dim+1:2*dim,k)*V_0(j)),1, particle_data(i)%def_grad,dim)
CALL DGER(dim,dim,-1.d0, (-dvx(:,k)),1, &
(dwdx_nor(dim+1:2*dim,k)*V_0(j)) ,1, particle_data(i)%vel_grad(1:dim,1:dim),dim)
CALL omp_unset_lock(locks(i))
CALL omp_set_lock(locks(j))
CALL DGER(dim,dim,-1.d0, (dvx(:,k)),1, &
(dwdx_nor(3*dim+1:4*dim,k)*V_0(i)) ,1, particle_data(j)%vel_grad(1:dim,1:dim),dim)
CALL DGER(dim,dim,-1.d0, (disp_nmh(:,i)-disp_nmh(:,j)),1, &
(dwdx_nor(3*dim+1:4*dim,k)*V_0(i)),1, particle_data(j)%def_grad,dim)
CALL omp_unset_lock(locks(j))
END DO
!$OMP END PARALLEL DO
CALL SYSTEM_CLOCK(c2)
t_el = t_el + (c2-c1)/rate
WRITE(*,*) "Wall time elapsed: ", t_el
Note that for the simulation I am testing k=14000 which I thought was a reasonable candidate for running in parallel. So far as I know I have to use the locks to ensure that threads which are given the same value of "i" (but a different value of "j") cannot access the same index of the arrays which are being written to at the same time. I cannot figure out if the version of BLAS (sudo apt-get install libblas-dev liblapack-dev) which I use is thread safe. I ran a simulation with 8 cores and got the same result as without OpenMP so I am guessing that it could be. BLAS is used, in this case, to calculate and sum the outer product of many 3x3 matrices.
Is the implementation of OpenMP above the best way to speed up this code? I know very little about OpenMP but my guesses are that:
the memory being all over the place ("i" is sequential but "j" is not)
the overhead in starting and closing down all the threads
the constant locking and unlocking
and maybe the small loop size (although I thought 14000 would be sufficient)
are significantly outweighing the performance benefits. Is this correct? Or can the code above be modified to get some performance gain?
EDIT
I should probably add that the code above is part of a time integration loop. Hopefully this explains why the elapsed time is summed.

Obtain the number of CPU cores in Julia

I want to obtain the number of cores available in Julia. Currently I am doing the following:
using PyCall
#pyimport psutil
nCores = psutil.cpu_count()
This calls a Python function. I would like, however, to use some Julia procedure. How can it be done?
Sys.CPU_CORES is not defined in Julia v1.1.0. However, the following does the job.
length(Sys.cpu_info())
I'm not 100% certain about this, but CPU_CORES returns the number of (hyper-threading) cores on my machine (OSX 10.9.5 and Julia 0.3.5), even when I start Julia in serial mode. I've been checking the number of available cores using nworkers() and nprocs(). Starting up Julia without the -p flag this returns 1 for both.
When I start julia as julia -p 4
julia> nprocs()
5
julia> nworkers()
4
In both cases CPU_CORES returns 8.
In recent versions of Julia, you can use Sys.CPU_CORES (and not Base.CPU_CORES as some answers mentioned). Tested on 0.6.
According to the documentation, the "number of cores available" can be limited by the JULIA_NUM_THREADS environment variable.
To see the number of threads available to Julia, use
Threads.nthreads()
Sys.CPU_CORES is undefined in Julia 1.0.0 (at least, running on a macbook—I don't imagine that would make a difference). Instead, use Sys.CPU_THREADS.
I don't know Julia but "psutil.cpu_count(logical=False)" in Python gives you the number of physical CPUs (hyper threaded ones are not counted).

MPI: shared variable value for all processors

Here's one question about MPI. I need two processors that keeps modifying one variable and I want both processors to have access to the variable with the most up-to-date value.
from mpi4py import MPI
from time import sleep
comm = MPI.COMM_WORLD
rank = comm.rank
assert comm.size == 2
msg = 0
sec = 10
if comm.rank == 0:
for i in range(sec):
print msg
sleep(1)
msg = comm.bcast(msg,root = 1)
else:
for i in range(sec*2):
msg += 1
sleep(0.5)
comm.bcast(msg,root = 1)
So I'm expecting the program to print something like: 0 2 4 ...
But the program turns out to print: 0 1 2 3 4 5 6 7 8 9
I'm curious if there's a mechanism in mpi4py such that the variable msg is shared by both processors? That is, whenever msg is modified by processor 1, the new value becomes immediately available to processor 0. In other words, I want processor 0 to access the most-up-to-date value of msg instead of waiting for every changes that were made on msg by processor 1.
I think you're getting confused about how distributed memory programming works. In MPI, each process (or rank) has its own memory, and therefore when it changes values via load/store operations (like what you're doing with msg += 1), it will not affect the value of the variable on another process. The only way to update remote values is by sending messages, which you are doing with the comm.bcast() call. This sends the local value of msg from rank 1 to all other ranks. Until this point, there's no way for rank 0 to know what's been happening on rank 1.
If you want to have shared values between processes, then you probably need to take a look at something else, perhaps threads. You'll lose the distributed abilities of MPI if you switch to OpenMP, but that might not be what you needed MPI for in the first place. There are ways of doing this with distributed memory models (such as PGAS languages like Unified Parallel C, Global Arrays, etc.), but you will always run into the issue of latency which means that there will be some time that the values on ranks 0 and 1 are not synchronized unless you have some sort of protection to enforce it.
As mentioned by Wesley Bland, this isn't really possible in a pure distributed memory environment, as memory isn't shared.
However, MPI has for some time (since 1997) allowed something like this in the MPI-2, as one-sided communications; these have been updated significantly in MPI-3 (2012). This approach can have real advantages, but one has to be a little careful; since memory isn't really shared, every update requires expensive communications and it's easy to accidentally put significant scalability/performance bottlenecks in your code by over-reliance on shared state.
The Using MPI-2 book has an example of implementing a counter using the MPI-2 one-sided communications; a simple version of that counter is described and implemented in this answer in C. In the mpi4py distribution, under 'demos', there are implementations of these same counters in the 'nxtval' demo; the same simple counter as nxtval-onesided.py and a more complicated but more scalable implementation, also as described in the Using MPI-2 book, as nxtval-scalable.py. You should be able to use either of those implementations more or less as-is in the above code.

Resources