Julia parallelism: #distributed (+) slower than serial? - parallel-processing

After seeing a couple tutorials on the internet on Julia parallelism I decided to implement a small parallel snippet for computing the harmonic series.
The serial code is:
harmonic = function (n::Int64)
x = 0
for i in n:-1:1 # summing backwards to avoid rounding errors
x +=1/i
end
x
end
And I made 2 parallel versions, one using #distributed macro and another using the #everywhere macro (julia -p 2 btw):
#everywhere harmonic_ever = function (n::Int64)
x = 0
for i in n:-1:1
x +=1/i
end
x
end
harmonic_distr = function (n::Int64)
x = #distributed (+) for i in n:-1:1
x = 1/i
end
x
end
However, when I run the above code and #time it, I don't get any speedup - in fact, the #distributed version runs significantly slower!
#time harmonic(10^10)
>>> 53.960678 seconds (29.10 k allocations: 1.553 MiB) 23.60306659488827
job = #spawn harmonic_ever(10^10)
#time fetch(job)
>>> 46.729251 seconds (309.01 k allocations: 15.737 MiB) 23.60306659488827
#time harmonic_distr(10^10)
>>> 143.105701 seconds (1.25 M allocations: 63.564 MiB, 0.04% gc time) 23.603066594889185
What completely and absolutely baffles me is the "0.04% gc time". I'm clearly missing something and also the examples I saw weren't for 1.0.1 version (one for example used #parallel).

You're distributed version should be
function harmonic_distr2(n::Int64)
x = #distributed (+) for i in n:-1:1
1/i # no x assignment here
end
x
end
The #distributed loop will accumulate values of 1/i on every worker an then finally on the master process.
Note that it is also generally better to use BenchmarkTools's #btime macro instead of #time for benchmarking:
julia> using Distributed; addprocs(4);
julia> #btime harmonic(1_000_000_000); # serial
1.601 s (1 allocation: 16 bytes)
julia> #btime harmonic_distr2(1_000_000_000); # parallel
754.058 ms (399 allocations: 36.63 KiB)
julia> #btime harmonic_distr(1_000_000_000); # your old parallel version
4.289 s (411 allocations: 37.13 KiB)
The parallel version is, of course, slower if run only on one process:
julia> rmprocs(workers())
Task (done) #0x0000000006fb73d0
julia> nprocs()
1
julia> #btime harmonic_distr2(1_000_000_000); # (not really) parallel
1.879 s (34 allocations: 2.00 KiB)

Related

Difference between the built-in #time macros and the ones from the benchmark module

In the Julia package BenchmarkTools, there are macros like #btime, #belapse that seem redundant to me since Julia has built-in #time, #elapse macros. And it seems to me that these macros serve the same purpose. So what's the difference between #time and #btime, and #elapse and #belapsed?
TLDR ;)
#time and #elapsed just run the code once and measure the time. This measurement may or may not include the compile time (depending whether #time is run for the first or second time) and includes time to resolve global variables.
On the the other hand #btime and #belapsed perform warm up so you know that compile time and global variable resolve time (if $ is used) do not affect the time measurement.
Details
For further understand how this works lets used the #macroexpand (I am also stripping comment lines for readability):
julia> using MacroTools, BenchmarkTools
julia> MacroTools.striplines(#macroexpand1 #elapsed sin(x))
quote
Experimental.#force_compile
local var"#28#t0" = Base.time_ns()
sin(x)
(Base.time_ns() - var"#28#t0") / 1.0e9
end
Compilation if sin is not forced and you get different results when running for the first time and subsequent times. For an example:
julia> #time cos(x);
0.110512 seconds (261.97 k allocations: 12.991 MiB, 99.95% compilation time)
julia> #time cos(x);
0.000008 seconds (1 allocation: 16 bytes)
julia> #time cos(x);
0.000006 seconds (1 allocation: : 16 bytes)
The situation is different with #belapsed:
julia> MacroTools.striplines(#macroexpand #belapsed sin($x))
quote
(BenchmarkTools).time((BenchmarkTools).minimum(begin
local var"##314" = begin
BenchmarkTools.generate_benchmark_definition(Main, Symbol[], Any[], [Symbol("##x#315")], (x,), $(Expr(:copyast, :($(QuoteNode(:(sin(var"##x#315"))))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), BenchmarkTools.Parameters())
end
(BenchmarkTools).warmup(var"##314")
(BenchmarkTools).tune!(var"##314")
(BenchmarkTools).run(var"##314")
end)) / 1.0e9
end
You can see that a minimum value is taken (the code is run several times).
Basically most time you should use BenchmarkTools for measuring times when designing your application.
Last but not least try #benchamrk:
julia> #benchmark sin($x)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 13.714 ns … 51.151 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.814 ns ┊ GC (median): 0.00%
Time (mean ± σ): 14.089 ns ± 1.121 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇ ▂▄ ▁▂ ▃ ▁ ▂
██▆▅██▇▅▄██▃▁▃█▄▃▁▅█▆▁▄▃▅█▅▃▁▄▇▆▁▁▁▁▁▆▄▄▁▁▃▄▇▃▁▃▁▁▁▆▅▁▁▁▆▅▅ █
13.7 ns Histogram: log(frequency) by time 20 ns <
Memory estimate: 0 bytes, allocs estimate: 0.

High GC time for simple mapreduce problem

I have simulation program written in Julia that does something equivalent to this as a part of its main loop:
# Some fake data
M = [randn(100,100) for m=1:100, n=1:100]
W = randn(100,100)
work = zip(W,M)
result = mapreduce(x -> x[1]*x[2], +,work)
In other words, a simple sum of weighted matrices. Timing the above code yields
0.691084 seconds (79.03 k allocations: 1.493 GiB, 70.59% gc time, 2.79% compilation time)
I am surprised about the large number of memory allocations, as this problem should be possible to do in-place. To see if it was my use of mapreduce that was wrong I also tested the following equivalent implementation:
#time begin
res = zeros(100,100)
for m=1:100
for n=1:100
res += W[m,n] * M[m,n]
end
end
end
which gave
0.442521 seconds (50.00 k allocations: 1.491 GiB, 70.81% gc time)
So, if I wrote this in C++ or Fortran it would be simple to do all of this in-place. Is this impossible in Julia? Or am I missing something here...?
It is possible to do it in place like this:
function ws(W, M)
res = zeros(100,100)
for m=1:100
for n=1:100
#. res += W[m,n] * M[m, n]
end
end
return res
end
and the timing is:
julia> #time ws(W, M);
0.100328 seconds (2 allocations: 78.172 KiB)
Note that in order to perform this operation in-place I used broadcasting (I could also use loops, but it would be the same).
The problem with your code is that in line:
res += W[m,n] * M[m,n]
You get two allocations:
When you do multiplication W[m,n] * M[m,n] a new matrix is allocated.
When you do addition res += ... again a matrix is allocated
By using broadcasting with #. you perform an in-place operation, see https://docs.julialang.org/en/v1/manual/mathematical-operations/#man-dot-operators for more explanations.
Additionally note that I have wrapped the code inside a function. If you do not do it then access both W and M is type unstable which also causes allocations, see https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.
I'd like to add something to Bogumił's answer. The missing broadcast is the main problem, but in addition, the loop and the mapreduce variant differ in a fundamental semantic way.
The purpose of mapreduce is to reduce by an associative operation with identity element init in an unspecified order. This in particular also includes the (theoretical) option of running parts in parallel and doesn't really play well with mutation. From the docs:
The associativity of the reduction is implementation-dependent. Additionally, some implementations may reuse the return value of f for elements that appear multiple times in itr. Use mapfoldl or
mapfoldr instead for guaranteed left or right associativity and invocation of f for every value.
and
It is unspecified whether init is used for non-empty collections.
What the loop variant really corresponds to is a fold, which has a well-defined order and initial (not necessarily identity) element and can thus use an in-place reduction operator:
Like reduce, but with guaranteed left associativity. If provided, the keyword argument init will be used exactly once.
julia> #benchmark foldl((acc, (m, w)) -> (#. acc += m * w), $work; init=$(zero(W)))
BenchmarkTools.Trial: 45 samples with 1 evaluation.
Range (min … max): 109.967 ms … 118.251 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 112.639 ms ┊ GC (median): 0.00%
Time (mean ± σ): 112.862 ms ± 1.154 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃█ ▁▄▃
▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███▆███▄▁▄▁▁▄▁▁▄▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
110 ms Histogram: frequency by time 118 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark mapreduce(Base.splat(*), +, $work)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
Range (min … max): 403.100 ms … 458.882 ms ┊ GC (min … max): 4.53% … 3.89%
Time (median): 445.058 ms ┊ GC (median): 4.04%
Time (mean ± σ): 440.042 ms ± 16.792 ms ┊ GC (mean ± σ): 4.21% ± 0.92%
▁ ▁ ▁ ▁ ▁ ▁ ▁▁▁ █ ▁
█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁█▁█▁▁▁▁███▁▁▁▁▁█▁▁▁█ ▁
403 ms Histogram: frequency by time 459 ms <
Memory estimate: 1.49 GiB, allocs estimate: 39998.
Think of it that way: if you would write the function as a parallel for loop with (+) reduction, iteration also would have an unspecified order, and you'd have memory overhead for the necessary copying of the individual results to the accumulating thread.
Thus, there is a trade-off. In your example, allocation/copying dominates. In other cases, the the mapped operation might dominate, and parallel reduction (with unspecified order, but copying overhead) be worth it.

Julia - How to efficiently turn to zero the diagonal of a matrix?

In Julia, what would be an efficient way of turning the diagonal of a matrix to zero?
Assuming m is your matrix of size N x N this could be done as:
setindex!.(Ref(m), 0.0, 1:N, 1:N)
Another option:
using LinearAlgebra
m[diagind(m)] .= 0.0
And some performance tests:
julia> using LinearAlgebra, BenchmarkTools
julia> m=rand(20,20);
julia> #btime setindex!.(Ref($m), 0.0, 1:20, 1:20);
55.533 ns (1 allocation: 240 bytes)
julia> #btime $m[diagind($m)] .= 0.0;
75.386 ns (2 allocations: 80 bytes)
Performance wise, simple loop is faster (and more explicit, but it is taste dependent)
julia> #btime foreach(i -> $m[i, i] = 0, 1:20)
11.881 ns (0 allocations: 0 bytes)
julia> #btime setindex!.(Ref($m), 0.0, 1:20, 1:20);
50.923 ns (1 allocation: 240 bytes)
And it is faster then diagind version, but not by much
julia> m = rand(1000, 1000);
julia> #btime foreach(i -> $m[i, i] = 0.0, 1:1000)
1.456 μs (0 allocations: 0 bytes)
julia> #btime foreach(i -> #inbounds($m[i, i] = 0.0), 1:1000)
1.338 μs (0 allocations: 0 bytes)
julia> #btime $m[diagind($m)] .= 0.0;
1.495 μs (2 allocations: 80 bytes)
Przemyslaw Szufel's solutions bench-marked for 1_000 x 1_000 matrix size show that diagind performs best:
julia> #btime setindex!.(Ref($m), 0.0, 1:1_000, 1:1_000);
2.222 μs (1 allocation: 7.94 KiB)
julia> #btime $m[diagind($m)] .= 0.0;
1.280 μs (2 allocations: 80 bytes)
Here is a more general way how to performantly use setindex! on an Array by accessing custom indices:
Using an array for in array indexing
Linear indexing is best performing, that's why diagind runs better than the Cartesian Indices.

Extract lower triangle portion of a matrix

I was wondering if there is a command or a package in Julia that permits us to extract directly the lower triangle portion of a matrix, excluding the diagonal. I can call R commands for that (like lowerTriangle of the gdata package), obviously, but I'd like to know if Julia has something similar. For example, imagine I have the matrix
1.0 0.751 0.734
0.751 1.0 0.948
0.734 0.948 1.0
I don't want to create a lower triangular matrix like
NA NA NA
0.751 NA NA
0.734 0.948 NA
but extract the lower portion of the matrix as an array: 0.751 0.734 0.948
If you're OK with creating a lower triangular matrix as an intermediate step, you can use logical indexing and tril! with an extra argument to get what you need.
julia> M = [1.0 0.751 0.734
0.751 1.0 0.948
0.734 0.948 1.0];
julia> v = M[tril!(trues(size(M)), -1)]
3-element Array{Float64, 1}:
0.751
0.734
0.948
The trues call returns an array of M's shape filled with boolean true values. tril! then prunes this down to just the part of the matrix that we want. The second argument to tril! tells it which superdiagonal to start from, which we use here to avoid the values in the leading diagonal.
We use the result of that for indexing into M, and that returns an array with the required values.
Using comprehensions:
julia> [M[m, n] for m in 2:size(M, 1) for n in 1:m-1]
3-element Array{Float64,1}:
0.751
0.734
0.948
But it is much slower than the sundar/Matt B. solution:
lower_triangular_1(M) = [M[m, n] for m in 2:size(M, 1) for n in 1:m-1]
lower_triangular_2(M) = [M[m, n] for n in 1:size(M, 2) for m in n+1:size(M, 1)]
lower_triangular_3(M) = M[tril!(trues(size(M)), -1)]
using BenchmarkTools
using LinearAlgebra # avoid warning in 0.7
M=rand(100, 100)
Testing with Julia Version 0.7.0-alpha.0:
julia> #btime lower_triangular_1(M);
73.179 μs (10115 allocations: 444.34 KiB)
julia> #btime lower_triangular_2(M);
71.157 μs (10117 allocations: 444.41 KiB)
julia> #btime lower_triangular_3(M);
16.325 μs (6 allocations: 40.19 KiB)
Not elegant, but faster (with #views):
function lower_triangular_4(M)
# works only for square matrices
res = similar(M, ((size(M, 1)-1) * size(M, 2)) ÷ 2)
start_idx = 1
for n = 1:size(M, 2)-1
#views column = M[n+1:end, n]
last_idx = start_idx -1 + length(column)
#views res[start_idx:last_idx] = column[:]
start_idx = last_idx + 1
end
end
julia> #btime lower_triangular_4(M);
4.272 μs (101 allocations: 44.95 KiB)

Julia: best way to sample from successively shrinking range?

I would like to sample k numbers where the first number is sampled from 1:n and the second from 1:n-1 and the third from 1:n-2 and so on.
I have the below implementation
function shrinksample(n,k)
[rand(1:m) for m in n:-1:n-k+1]
end
Are there faster solutions in Julia?
The following takes ideas from the implementation of randperm and since n and k are of the same order, this is appropriate as the same type of randomness is needed (both have output space of size n factorial):
function fastshrinksample(r::AbstractRNG,n,k)
a = Vector{typeof(n)}(k)
#assert n <= Int64(2)^52
k == 0 && return a
mask = (1<<(64-leading_zeros(n)))-1
nextmask = mask>>1
nn = n
for i=1:k
a[i] = 1+Base.Random.rand_lt(r, nn, mask)
nn -= 1
if nn == nextmask
mask, nextmask = nextmask, nextmask>>1
end
end
return a
end
fastshrinksample(n,k) = fastshrinksample(Base.Random.GLOBAL_RNG, n, k)
Benchmarking gives a 3x improvement for one tested instance:
julia> using BenchmarkTools
julia> #btime shrinksample(10000,10000);
310.277 μs (2 allocations: 78.20 KiB)
julia> #btime fastshrinksample(10000,10000);
91.815 μs (2 allocations: 78.20 KiB)
The trick is mainly to use the internal Base.Random.rand_lt instead of regular rand(1:n)
If this is not very sensitive to randomness (you're not doing cryptography), the following should be amazingly fast and very simple:
blazingshrinksample(n,k) = (Int)[trunc(Int,(n-m)rand()+1) for m in 0:k-1]
Testing this along with your implementation and with Dan's, I got this:
using BenchmarkTools
#btime shrinksample(10000,10000);
259.414 μs (2 allocations: 78.20 KiB)
#btime fastshrinksample(10000,10000);
66.713 μs (2 allocations: 78.20 KiB)
#btime blazingshrinksample(10000,10000);
33.614 μs (2 allocations: 78.20 KiB)

Resources