I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.
I am trying to understand the following slide
The definition is kind of unclear to me. Sources like wikipedia say that Amdahl's measures the speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. To me speedup is basically how faster a task runs over other task. Speedup in this case is used in a different way. Can you clarify what Amdahl's law measures in an easier way and what speed up really is?
The definition of speedup here is:
Speedup = Baseline Running Time / New Running Time
This means that if the running time is BRT and the parallelizable portion is P, then:
BRT = (1 - P) * BRT + P * BRT
Now if a speedup of S was obtained on the P portion of the running time, then the new improved running time (IRT) is:
IRT = (1 - P) * BRT + P * (BRT / S)
= (1 - P) * BRT + (P / S) * BRT
= ((1 - P) + (P / S)) * BRT
Therefore:
BRT / IRT = 1 / ((1 - P) + (P / S))
This is the overall speedup. This is Amdahl's law.
To me speedup is basically how faster a task runs over other task.
Yes, speedup can be defined in different ways. This can be a little confusing.
Amdhal's Law measures the theoretical maximum speed up, this is almost never achieved, The formula is easy to under stand once you know what different parts mean,
Okay so the formula is Speedup = 1/ 1-f + f/p,
1 means the whole code,
1-f means the amount of serial code (can't be parallelized),
f means code that can be parallelized,
p means number of processors,
So, if we say there are 10 processors and we have 40% of code that can be parallelized.
the formula is speedup = 1/ 1-40% (0.4) + 40%(0.4)/10
Not a professional and you might want to check this, but if i remember correctly this is how it should work :)
No system is truly parallel. they might start parallel, then executed serially and then parallel again in each different workflow. In general, we have to take locks, coordinate threads, synchronize the code. So there will be serial portions within a parallel process. During this serial portion, multiple threads/processes that are executing, get queueing. Amdahl's law tells how much the serial portion affects the performance (throughput) graph. As you see in the image:
if it was a perfect parallel system, the rate would have been perfectly linear. If there is a serial portion within a process, it does not matter 5 percent or 10 percent, the rate of the graph will be flat after a given point. Amdahl's law calculates how soon the graph is going to flatten out. If it is flattened out that means throughput has decreased.
The formula on the slide is saying that the amount of speedup a program will see by using more parallel cores is based on how much of the program is serial.
Using the following simple benchmark in Racket 6.6:
#lang racket
(require data/gvector)
(define (run)
;; this should have to periodically resize in order to incorporate new data
;; and thus should be slower
(time (define v (make-gvector)) (for ((i (range 1000000))) (gvector-add! v i)) )
(collect-garbage 'major)
;; this should never have to resize and thus should be faster
;; ... but consistently benchmarks slower?!
(time (define v (make-gvector #:capacity 1000000)) (for ((i (range 1000000))) (gvector-add! v i)) )
)
(run)
The version that properly reserves capacity does worse consistently. Why? This is certainly not the result that I would expect, and is inconsistent with what you would see in C++ (std::vector) or Java (ArrayList). Am I somehow benchmarking incorrectly?
Example output:
cpu time: 232 real time: 230 gc time: 104
cpu time: 228 real time: 230 gc time: 120
One benchmarking comment: use in-range instead of range in your microbenchmarks; otherwise you're including the cost of constructing a million-element list in your measurements.
I added some extra loops to your microbenchmark to make it do more work (and I fixed the range issue). Here are some of the results:
Using #:capacity for large capacities is slower.
== 5 iterations of 1e7 sized gvector, measured 3 times each way
with #:capacity
cpu time: 9174 real time: 9169 gc time: 4769
cpu time: 9109 real time: 9108 gc time: 4683
cpu time: 9094 real time: 9091 gc time: 4670
without
cpu time: 7917 real time: 7912 gc time: 3243
cpu time: 7703 real time: 7697 gc time: 3107
cpu time: 7732 real time: 7727 gc time: 3115
Using #:capacity for small capacities is faster.
== 20 iterations of 1e6 sized gvector, measured three times each way
with #:capacity
cpu time: 2167 real time: 2168 gc time: 408
cpu time: 2152 real time: 2152 gc time: 385
cpu time: 2112 real time: 2111 gc time: 373
without
cpu time: 2310 real time: 2308 gc time: 473
cpu time: 2316 real time: 2315 gc time: 480
cpu time: 2335 real time: 2334 gc time: 488
My hypothesis: it's GC overhead. When the backing vector is mutated, Racket's generational GC remembers the vector so it can scan it in the next minor collection. When the backing vector is very big, scanning the whole vector on every minor GC outweighs the cost of reallocation and copying. The overhead wouldn't occur with a GC with a finer remembered-set granularity (but... tradeoffs).
BTW, looking over the gvector code I found a couple opportunities for improvement. They don't change the big picture, though.
Increasing the vector size with a factor 10 I get the following in DrRacket
(with all debugging turned off):
cpu time: 5245 real time: 5605 gc time: 3607
cpu time: 4851 real time: 5136 gc time: 3231
Note: If there is garbage left over from the first benchmark it can affect the next one. Therefore use collect-garbage (three times) before using time again.
Also... don't make benchmarks in DrRacket as I did - use the command line.
Converting non-negative Integer to its list of digits is commonly done like this:
import Data.Char
digits :: Integer -> [Int]
digits = (map digitToInt) . show
I was trying to find a more direct way to perform the task, without involving a string conversion, but I'm unable to come up with something faster.
Things I've been trying so far:
The baseline:
digits :: Int -> [Int]
digits = (map digitToInt) . show
Got this one from another question on StackOverflow:
digits2 :: Int -> [Int]
digits2 = map (`mod` 10) . reverse . takeWhile (> 0) . iterate (`div` 10)
Trying to roll my own:
digits3 :: Int -> [Int]
digits3 = reverse . revDigits3
revDigits3 :: Int -> [Int]
revDigits3 n = case divMod n 10 of
(0, digit) -> [digit]
(rest, digit) -> digit:(revDigits3 rest)
This one was inspired by showInt in Numeric:
digits4 n0 = go n0 [] where
go n cs
| n < 10 = n:cs
| otherwise = go q (r:cs)
where
(q,r) = n `quotRem` 10
Now the benchmark. Note: I'm forcing the evaluation using filter.
λ>:set +s
λ>length $ filter (>5) $ concat $ map (digits) [1..1000000]
2400000
(1.58 secs, 771212628 bytes)
This is the reference. Now for digits2:
λ>length $ filter (>5) $ concat $ map (digits2) [1..1000000]
2400000
(5.47 secs, 1256170448 bytes)
That's 3.46 times longer.
λ>length $ filter (>5) $ concat $ map (digits3) [1..1000000]
2400000
(7.74 secs, 1365486528 bytes)
digits3 is 4.89 time slower. Just for fun, I tried using only revDigits3 and avoid the reverse.
λ>length $ filter (>5) $ concat $ map (revDigits3) [1..1000000]
2400000
(8.28 secs, 1277538760 bytes)
Strangely, this is even slower, 5.24 times slower.
And the last one:
λ>length $ filter (>5) $ concat $ map (digits4) [1..1000000]
2400000
(16.48 secs, 1779445968 bytes)
This is 10.43 time slower.
I was under the impression that only using arithmetic and cons would outperform anything involving a string conversion. Apparently, there something I can't grasp.
So what the trick? Why is digits so fast?
I'm using GHC 6.12.3.
Seeing as I can't add comments yet, I'll do a little bit more work and just analyze all of them. I'm putting the analysis at the top; however, the relevant data is below. (Note: all of this is done in 6.12.3 as well - no GHC 7 magic yet.)
Analysis:
Version 1: show is pretty good for ints, especially those as short as we have. Making strings actually tends to be decent in GHC; however reading to strings and writing large strings to files (or stdout, although you wouldn't want to do that) are where your code can absolutely crawl. I would suspect that a lot of the details behind why this is so fast are due to clever optimizations within show for Ints.
Version 2: This one was the slowest of the bunch when compiled. Some problems: reverse is strict in its argument. What this means is that you don't benefit from being able to perform computations on the first part of the list while you're computing the next elements; you have to compute them all, flip them, and then do your computations (namely (`mod` 10) ) on the elements of the list. While this may seem small, it can lead to greater memory usage (note the 5GB of heap memory allocated here as well) and slower computations. (Long story short: don't use reverse.)
Version 3: Remember how I just said don't use reverse? Turns out, if you take it out, this one drops to 1.79s total execution time - barely slower than the baseline. The only problem here is that as you go deeper into the number, you're building up the spine of the list in the wrong direction (essentially, you're consing "into" the list with recursion, as opposed to consing "onto" the list).
Version 4: This is a very clever implementation. You benefit from several nice things: for one, quotRem should use the Euclidean algorithm, which is logarithmic in its larger argument. (Maybe it's faster, but I don't believe there's anything that's more than a constant factor faster than Euclid.) Furthermore, you cons onto the list as discussed last time, so that you don't have to resolve any list thunks as you go - the list is already entirely constructed when you come back around to parse it. As you can see, the performance benefits from this.
This code was probably the slowest in GHCi because a lot of the optimizations performed with the -O3 flag in GHC deal with making lists faster, whereas GHCi wouldn't do any of that.
Lessons: cons the right way onto a list, watch for intermediate strictness that can slow down computations, and do some legwork in looking at the fine-grained statistics of your code's performance. Also compile with the -O3 flags: whenever you don't, all those people who put a lot of hours into making GHC super-fast get big ol' puppy eyes at you.
Data:
I just took all four functions, stuck them into one .hs file, and then changed as necessary to reflect the function in use. Also, I bumped your limit up to 5e6, because in some cases compiled code would run in less than half a second on 1e6, and this can start to cause granularity problems with the measurements we're making.
Compiler options: use ghc --make -O3 [filename].hs to have GHC do some optimization. We'll dump statistics to standard error using digits +RTS -sstderr.
Dumping to -sstderr gives us output that looks like this, in the case of digits1:
digits1 +RTS -sstderr
12000000
2,885,827,628 bytes allocated in the heap
446,080 bytes copied during GC
3,224 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 5504 collections, 0 parallel, 0.06s, 0.03s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.61s ( 1.66s elapsed)
GC time 0.06s ( 0.03s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.67s ( 1.69s elapsed)
%GC time 3.7% (1.5% elapsed)
Alloc rate 1,795,998,050 bytes per MUT second
Productivity 96.3% of total user, 95.2% of total elapsed
There are three key statistics here:
Total memory in use: only 1MB means this version is very space-efficient.
Total time: 1.61s means nothing now, but we'll see how it looks against the other implementations.
Productivity: This is just 100% minus garbage collecting; since we're at 96.3% this means that we're not creating a lot of objects that we leave lying around in memory..
Alright, let's move on to version 2.
digits2 +RTS -sstderr
12000000
5,512,869,824 bytes allocated in the heap
1,312,416 bytes copied during GC
3,336 bytes maximum residency (1 sample(s))
13,048 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 10515 collections, 0 parallel, 0.06s, 0.04s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.20s ( 3.25s elapsed)
GC time 0.06s ( 0.04s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.26s ( 3.29s elapsed)
%GC time 1.9% (1.2% elapsed)
Alloc rate 1,723,838,984 bytes per MUT second
Productivity 98.1% of total user, 97.1% of total elapsed
Alright, so we're seeing an interesting pattern.
Same amount of memory used. This means that this is a pretty good implementation, although it could mean that we need to test on higher sample inputs to see if we can find a difference.
It takes twice as long. We'll come back to some speculation as to why this is later.
It's actually slightly more productive, but given that GC is not a huge portion of either program this doesn't help us anything significant.
Version 3:
digits3 +RTS -sstderr
12000000
3,231,154,752 bytes allocated in the heap
832,724 bytes copied during GC
3,292 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 6163 collections, 0 parallel, 0.02s, 0.02s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.09s ( 2.08s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.11s ( 2.10s elapsed)
%GC time 0.7% (1.0% elapsed)
Alloc rate 1,545,701,615 bytes per MUT second
Productivity 99.3% of total user, 99.3% of total elapsed
Alright, so we're seeing some strange patterns.
We're still at 1MB total memory in use. So we haven't hit anything memory-inefficient, which is good.
We're not quite at digits1, but we've got digits2 beat pretty easily.
Very little GC. (Keep in mind that anything over 95% productivity is very good, so we're not really dealing with anything too significant here.)
And finally, version 4:
digits4 +RTS -sstderr
12000000
1,347,856,636 bytes allocated in the heap
270,692 bytes copied during GC
3,180 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2570 collections, 0 parallel, 0.00s, 0.01s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.09s ( 1.08s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.09s ( 1.09s elapsed)
%GC time 0.0% (0.8% elapsed)
Alloc rate 1,234,293,036 bytes per MUT second
Productivity 100.0% of total user, 100.5% of total elapsed
Wowza! Let's break it down:
We're still at 1MB total. This is almost certainly a feature of these implementations, as they remain at 1MB on inputs of 5e5 and 5e7. A testament to laziness, if you will.
We cut off about 32% of our original time, which is pretty impressive.
I suspect that the percentages here reflect the granularity in -sstderr's monitoring rather than any computation on superluminal particles.
Answering the question "why rem instead of mod?" in the comments. When dealing with positive values rem x y === mod x y so the only consideration of note is performance:
> import Test.QuickCheck
> quickCheck (\x y -> x > 0 && y > 0 ==> x `rem` y == x `mod` y)
So what is the performance? Unless you have a good reason not to (and being lazy isn't a good reason, neither is not knowing Criterion) then use a good benchmark tool, I used Criterion:
$ cat useRem.hs
import Criterion
import Criterion.Main
list :: [Integer]
list = [1..10000]
main = defaultMain
[ bench "mod" (nf (map (`mod` 7)) list)
, bench "rem" (nf (map (`rem` 7)) list)
]
Running this shows rem is measurably better than mod (compiled with -O2):
$ ./useRem
...
benchmarking mod
...
mean: 590.4692 us, lb 589.2473 us, ub 592.1766 us, ci 0.950
benchmarking rem
...
mean: 394.1580 us, lb 393.2415 us, ub 395.4184 us, ci 0.950
I want to see how long a function takes to run. What's the easiest way to do this in PLT-Scheme? Ideally I'd want to be able to do something like this:
> (define (loopy times)
(if (zero? times)
0
(loopy (sub1 times))))
> (loopy 5000000)
0 ;(after about a second)
> (timed (loopy 5000000))
Took: 0.93 seconds
0
>
It doesn't matter if I'd have to use some other syntax like (timed loopy 5000000) or (timed '(loopy 5000000)), or if it returns the time taken in a cons or something.
The standard name for timing the execution of expressions in most Scheme implementations is "time". Here is an example from within DrRacket.
(define (loopy times)
(if (zero? times)
0
(loopy (sub1 times))))
(time (loopy 5000000))
cpu time: 1526 real time: 1657 gc time: 0
0
If you use time to benchmark different implementations against each other,
remember to use racket from the command line rather than benchmarking directly
in DrRacket (DrRacket inserts debug code in order to give better error messages).
Found it...
From the online documentation:
(time-apply proc arg-list) invokes the procedure proc with the arguments in arg-list. Four values are returned: a list containing the result(s) of applying proc, the number of milliseconds of CPU time required to obtain this result, the number of ``real'' milliseconds required for the result, and the number of milliseconds of CPU time (included in the first result) spent on garbage collection.
Example usage:
> (time-apply loopy '(5000000))
(0)
621
887
0