Julia Distributed slow down to half the single core performance when adding process - parallel-processing

I've got a function func that may cost ~50s when running on a single core. Now I want to run it on a server which has got 192-core CPUs for many times. But when I add worker processes to say, 180, the performance of each core slows down. The worst CPU takes ~100s to calculate func.
Can someone help me, please?
Here is the pseudo code
using Distributed
addprocs(180)
#everywhere include("func.jl") # defines func in every process
First try using only 10 workers
#sync #distributed for i in 1:10
func()
end
#sync #distributed for i in 1:10
#time func()
end
From worker #: 43.537886 seconds (243.58 M allocations: 30.004 GiB, 8.16% gc time)
From worker #: 44.242588 seconds (247.59 M allocations: 30.531 GiB, 7.90% gc time)
From worker #: 44.571170 seconds (246.26 M allocations: 30.338 GiB, 8.81% gc time)
...
From worker #: 45.259822 seconds (252.19 M allocations: 31.108 GiB, 8.25% gc time)
From worker #: 46.746692 seconds (246.36 M allocations: 30.346 GiB, 11.21% gc time)
From worker #: 47.451914 seconds (248.94 M allocations: 30.692 GiB, 8.96% gc time)
Seems not bad when using 10 workers
Now we use 180 workers
#sync #distributed for i in 1:180
func()
end
#sync #distributed for i in 1:180
#time func()
end
From worker #: 55.752026 seconds (245.20 M allocations: 30.207 GiB, 9.33% gc time)
From worker #: 57.031739 seconds (245.00 M allocations: 30.176 GiB, 7.70% gc time)
From worker #: 57.552505 seconds (247.76 M allocations: 30.543 GiB, 7.34% gc time)
...
From worker #: 96.850839 seconds (247.33 M allocations: 30.470 GiB, 7.95% gc time)
From worker #: 97.468060 seconds (250.04 M allocations: 30.827 GiB, 6.96% gc time)
From worker #: 98.078816 seconds (250.55 M allocations: 30.883 GiB, 10.87% gc time)
The time increases almost linearly from 55s to 100s.
I've checked by top command that CPU usage may not the bottleneck ("id" keeps >2%). The RAM usage, too (used ~20%).
Other version information:
Julia Version 1.5.3
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Platinum 9242 CPU # 2.30GHz
Update:
substitute func to minimal example (simple for loop) does not change the slowdown.
reducing the process number to 192/2 alleviates slowdown
The new pseudo code is
addprocs(96)
#everywhere function ss()
sum=0
for i in 1:1000000000
sum+=sin(i)
end
end
#sync #distributed for i in 1:10
ss()
end
#sync #distributed for i in 1:10
#time ss()
end
From worker #: 32.8 seconds ..(8 others).. 34.0 seconds
...
#sync #distributed for i in 1:96
#time ss()
end
From worker #: 38.1 seconds ..(94 others).. 45.4 seconds

You are measuring the time it takes each worker to perform func() and observe performance decrease for a single process when going from 10 processes to 180 parallel processes.
This looks quite normal to me:
Intel cores use hyper-threading so you actually have 96 cores (in more detail - a hyper-threaded core adds only 20-30% performance). It means that 168 of your processes need to share 84 hyper-threaded cores and 12 processes get full 12 cores.
The CPU speed is determined by throttle temperature (https://en.wikipedia.org/wiki/Thermal_design_power) and of course there is so much more space when running 10 processes vs 180 processes
Your tasks are obviously competing for memory. They make a total of over 5TB of memory allocations and you machine has much less than that. The last mile in garbage collecting is always the most expensive one - so if your garbage collectors are squeezed and competing for memory the performance is uneven with surprisingly longer garbage collection times.
Looking at this data I would recommend you to try:
addprocs(192 ÷ 2)
and see how the performance is then going to change.

Related

Parallel computing and Julia

I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.

Reading Go gctrace output

I have gctrace output that looks like this:
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock, 0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
I am not sure how to read the CPU times in particular. I understand that it is broken down into three phases (STW sweep termination, concurrent mark/scan, and STW mark termination), but I'm not sure what the + signs mean (i.e. 0.18+7720 and 3615+0.65). What do these + signs signify?
In your case, they look like assist and termination times;
// CPU time
0.18 : **STW** Sweep termination.
7720ms : Mark/Scan - Assist Time (GC performed in line with allocation).
21356ms : Mark/Scan - Background GC time.
3615ms : Mark/Scan - Idle GC time.
0.65ms : **STW** Mark termination.
I think it changes (or it may) over various Go versions and you can find more detailed info at runtime package docs.
Currently, it is:
gc # ##s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # P
where the fields are as follows:
gc # the GC number, incremented at each GC
##s time in seconds since program start
#% percentage of time spent in GC since program start
#+...+# wall-clock/CPU times for the phases of the GC
#->#-># MB heap size at GC start, at GC end, and live heap
# MB goal goal heap size
# P number of processors used
Example here
See also Interpreting GC trace output
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock,
0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
gc 6
#48.155s since program start
15%: of time spent in GC since program start
0.093+12360+0.32 ms clock stop-the-world (STW) sweep termination + concurrent
mark and scan + and STW mark termination
0.18+7720/21356/3615+0.65 ms cpu (GC performed in
line with allocation), background GC time, and idle GC time
11039->13278->6876 MB heap size at GC start, at GC end, and live heap
8 P number of processors used

pmap slow for toy example

I'm testing out parallelism in Julia to see if there's a speedup on my machine (I'm picking a language to implement new algorithms with). I didn't want to dedicate a ton of time writing a huge example so I did the following test on release version Julia 0.4.5 (Mac OS X and dual core):
$ julia -p2
julia> #everywhere f(x) = x^2 + 10
julia> #time map(f, 1:10000000)
julia> #time pmap(f, 1:10000000)
pmap is significantly slower than map (>20x) and allocates well over 10x the memory. What am I doing wrong?
Thanks.
That's because pmap is intended to do heavy computations per core, not many simple ones. If you use for something simple as your function, the overhead of moving along the information along proccesors is bigger that the benefits. Instead, test this code (I run it with 4 cores in a i7):
function fast(x::Float64)
return x^2+1.0
end
function slow(x::Float64)
a = 1.0
for i in 1:1000
for j in 1:5000
a+=asinh(i+j)
end
end
return a
end
info("Precompilation")
map(fast,linspace(1,1000,1000))
pmap(fast,linspace(1,1000,1000))
map(slow,linspace(1,1000,10))
pmap(slow,linspace(1,1000,10))
info("Testing slow function")
#time map(slow,linspace(1,1000,10)) #3.69 s
#time pmap(slow,linspace(1,1000,10)) #0.003 s
info("Testing fast function")
#time map(fast,linspace(1,1000,1000)) #52 μs
#time pmap(fast,linspace(1,1000,1000)) #775 s
For parallelization of a lot of very small iterations, you can use #parallel, search for it in the documentation.

How can i calculate for the estimated completion time of both process

A certain computer system runs in a multi-programming environment using a non-preemptive
algorithm. In this system, two processes A and B are stored in the process queue,
and A has a higher priority than B. The table below shows estimated execution time for each
process; for example, process A uses CPU, I/O, and then CPU sequentially for 30, 60, and 30
milliseconds respectively. Which of the following is the estimated time in milliseconds
to complete both A and B? Here, the multi-processing overhead of OS is negligibly
small. In addition, both CPU and I/O operations can be executed concurrently, but I/O
operations for A and B cannot be performed in parallel.
UNIT : millisecond
CPU I/O CPU
A_______________30___________________60_________________30
B_______________45___________________45__________________--
Please help me.. i need to explain this in front of the class tomorrow but i cant seem get the idea of it...
A has the highest priority, but since the system is non-preemptive, this is only a tiebreaker when both processes need a resource at the same time.
At t=0, A gets the CPU for 30 ms, B waits as it needs the CPU.
At t=30, A releases the CPU, B gets the CPU for 45 ms, while A gets the I/O for 60 ms.
At t=75, the CPU sits idle as B is waiting for A to finish I/O, and A is not ready to use the CPU.
At t=90, A releases I/O and gets the CPU for another 30 ms, while B gets the I/O for 45 ms.
At t=120, A releases the CPU and is finished.
At t=135, B releases I/O and is finished.
It takes the longest path:
Non-preemptive multitasking or cooperative multitasking means that the process is kind of sharing a.e. the CPU time. In the worst case they use the worst time to achieve theire task.
CPU:
B = 45 is longer than A=30
45 +
I/O
A = 60 and B = 45
45 + 60
CPU again:
A = 30
45 + 60 + 30 = 135
i will explain in brief and please elaborate for your classroom discussion:
For your answer :135
when Process A waits for the I/O task,the CPU time will be given to Process B. so the complete time for process A and B would be
Process A (CPU )+ Process A I/O and Process B CPU + Process B I/O
30+60+45 = 135 ms

Matlab matrix multiplication speed

I was wondering how can matlab multiply two matrices so fast. When multiplying two NxN matrices, N^3 multiplications are performed. Even with the Strassen Algorithm it takes N^2.8 multiplications, which is still a large number. I was running the following test program:
a = rand(2160);
b = rand(2160);
tic;a*b;toc
2160 was used because 2160^3=~10^10 ( a*b should be about 10^10 multiplications)
I got:
Elapsed time is 1.164289 seconds.
(I'm running on 2.4Ghz notebook and no threading occurs)
which mean my computer made ~10^10 operation in a little more than 1 second.
How this could be??
It's a combination of several things:
Matlab does indeed multi-thread.
The core is heavily optimized with vector instructions.
Here's the numbers on my machine: Core i7 920 # 3.5 GHz (4 cores)
>> a = rand(10000);
>> b = rand(10000);
>> tic;a*b;toc
Elapsed time is 52.624931 seconds.
Task Manager shows 4 cores of CPU usage.
Now for some math:
Number of multiplies = 10000^3 = 1,000,000,000,000 = 10^12
Max multiplies in 53 secs =
(3.5 GHz) * (4 cores) * (2 mul/cycle via SSE) * (52.6 secs) = 1.47 * 10^12
So Matlab is achieving about 1 / 1.47 = 68% efficiency of the maximum possible CPU throughput.
I see nothing out of the ordinary.
To check whether you do or not use multi-threading in MATLAB use this command
maxNumCompThreads(n)
This sets the number of cores to use to n. Now I have a Core i7-2620M, which has a maximum frequency of 2.7GHz, but it also has a turbo mode with 3.4GHz. The CPU has two cores. Let's see:
A = rand(5000);
B = rand(5000);
maxNumCompThreads(1);
tic; C=A*B; toc
Elapsed time is 10.167093 seconds.
maxNumCompThreads(2);
tic; C=A*B; toc
Elapsed time is 5.864663 seconds.
So there is multi-threading.
Let's look at the single CPU results. A*B executes approximately 5000^3 multiplications and additions. So the performance of single-threaded code is
5000^3*2/10.8 = 23 GFLOP/s
Now the CPU. 3.4 GHz, and Sandy Bridge can do maximum 8 FLOPs per cycle with AVX:
3.4 [Ginstructions/second] * 8 [FLOPs/instruction] = 27.2 GFLOP/s peak performance
So single core performance is around 85% peak, which is to be expected for this problem.
You really need to look deeply into the capabilities of your CPU to get accurate performannce estimates.

Resources