Consider the basic iteration to generate N random numbers and save them in an array (assume either that we are not interested in array comprehensions and also that we don't know the calling rand(N))
function random_numbers(N::Int)
array = zeros(N)
for i in 1:N
array[i] = rand()
end
array
end
I am interested in a similar function that takes advantage of the cores of my laptop to generate the same array. I have checked this nice blog where the macros #everywhere, #spawn and #parallel are introduced but there the calculation is carried out "on-the-fly" and an array is not needed to save the data.
I have the impression that this is very basic and can be done easily using perhaps the function pmap but I am unfamiliar with parallel computing.
My aim is to apply this method to a function that I have built to generate random numbers drawn from an unusual distribution.
I would recommend to do a more careful initialization of random number generators in parallel processes, e.g:
# choose the seed you want
#everywhere srand(1)
# replace 10 below by maximum process id in your case
#everywhere const LOCAL_R = randjump(Base.GLOBAL_RNG, 10)[myid()]
# here is an example usage
#everywhere f() = rand(LOCAL_R)
In this way you:
make sure that your results are reproducible;
have control that there is no overlap between random sequences generated by different processes.
As suggested in the comment more clarification in the question is always welcome. However, it seems pmap will do what is required. The relevant documentation is here.
The following is a an example. Note, the time spent in the pmap method is half of the regular map. With 16 cores, the situation might be substantially better:
julia> addprocs(2)
2-element Array{Int64,1}:
2
3
julia> #everywhere long_rand() = foldl(+,0,(randn() for i=1:10_000_000))
julia> long_rand()
-1165.9596619177153
julia> #time map(x->long_rand(), zeros(10,10))
8.455930 seconds (204.89 k allocations: 11.069 MiB)
10×10 Array{Float64,2}:
⋮
⋮
julia> #time pmap(x->long_rand(), zeros(10,10));
6.125479 seconds (773.08 k allocations: 42.242 MiB, 0.25% gc time)
julia> #time pmap(x->long_rand(), zeros(10,10))
4.609745 seconds (20.99 k allocations: 954.991 KiB)
10×10 Array{Float64,2}:
⋮
⋮
Related
In Julia, I would like to concatenate several arrays (and also multiply them). Within my program, I have written it as follows:
[Uᵣ Qₐ]*Uₖ
[Vᵣ Qᵦ]*Vₖ
However, this array concatenation is very expensive compared to the rest of the program I have written. Is there any way in Julia to cheaply/efficiently concatenate arrays other than what I have done (or just using hcat, vcat functions)?
The problem is that whenever you combine matrices all data are getting copied. This happens because matrices cannot grow in the way vectors do.
However if you matrices are big enough you can avoid copying data by using BlockArrays. A non-materializing function combining matrices is called mortar.
Have a look at this code:
using BlockArrays, BenchmarkTools
a = rand(1000,100)
b = rand(1000,120)
z = rand(220,7)
Now let's run benchmarks:
julia> #btime [$a $b]*$z;
1.234 ms (4 allocations: 1.73 MiB)
julia> #btime mortar(($a, $b)) * $z;
573.100 μs (11 allocations: 55.33 KiB)
julia> all([a b]*z .≈ mortar((a, b)) * z)
true
You can see that the speedup is 2x and the difference in memory allocation is 30x. However the results will vary depending on size and shape of matrices so you should run your own benchmark.
I'm still learning Julia's multiple dispatch and value-as-type approach.
Instantiating Val{c}() seems about 50times slower than dictionary lookup.
After that, dispatch seems 6 times faster than dictionary lookup.
Are these durations expected? Is it possible to speed up the instanciation of Val{c}()?
using BenchmarkTools
rand_n = rand([4,11], 1_000_000)
simple_dict = Dict(4 => 11, 11 => 4)
call_dict(num) = simple_dict[num]
#benchmark call_dict.($rand_n) # 42.113ms
val_type(::Val{4}) = 11
val_type(::Val{11}) = 4
#benchmark Val.($rand_n) # 2.4s
partial_result = Val.(rand_n)
#benchmark val_type.($partial_result) # 7ms
Tricks like these can be great but they can also take you into dangerous territory. You get a boost when you have only 2 val_type methods; to reprise your results,
julia> rand_n = [4, 11, 4]
3-element Vector{Int64}:
4
11
4
julia> vrand_n = Val.(rand_n)
3-element Vector{Val}:
Val{4}()
Val{11}()
Val{4}()
julia> val_type(::Val{4}) = 11
val_type (generic function with 1 method)
julia> val_type(::Val{11}) = 4
val_type (generic function with 2 methods)
julia> using BenchmarkTools
julia> #btime val_type.($vrand_n);
28.421 ns (1 allocation: 112 bytes)
But look what happens when you have 5:
julia> val_type(::Val{2}) = 0
val_type (generic function with 3 methods)
julia> val_type(::Val{3}) = 0
val_type (generic function with 4 methods)
julia> val_type(::Val{7}) = 0
val_type (generic function with 5 methods)
julia> #btime val_type.($vrand_n);
95.008 ns (1 allocation: 112 bytes)
Importantly, I didn't even have to create any such objects to observe the slowdown. Moreover, this is much worse than a fixed version of your Dict-based method:
julia> const simple_dict = Dict(4 => 11, 11 => 4)
Dict{Int64, Int64} with 2 entries:
4 => 11
11 => 4
julia> call_dict(num) = simple_dict[num]
call_dict (generic function with 1 method)
julia> #btime call_dict.($rand_n);
39.674 ns (1 allocation: 112 bytes)
(That const is crucial, see https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.)
Why? The key is to look at the type of object you're working with:
julia> eltype(vrand_n)
Val
julia> isconcretetype(eltype(vrand_n))
false
This explains why it can be slow: when your iteration extracts the next element, Julia can't predict the concrete type of the object. So it has to use runtime dispatch, which is essentially a glorified dictionary lookup. Unfortunately, it's one where the comparison of keys is much more complicated than just looking up an Int. So you lose quite a lot of performance.
Why is it so much faster when there are only two methods? Because Julia tries to be really smart, and it checks to see how many there are; if there are 3 or fewer methods, it will generate some optimized code that checks the type in a simple if branch rather than invoking the full machinery of type intersection. You can read more details here.
Newcomers to Julia--once they learn the wonders of specialization and the big runtime performance improvements it delivers--often get excited to try to use the type system for everything, and Val-based dispatch is often a tool they reach for. But inferrability is a key component to the speed advantage of multiple dispatch, so when you use a design that breaks inferrability, you lose the advantage to such an extent that it can be worse than less "fancy" methods.
The bottom line: for the demo you were trying, you'll be much better off if you stick to Dict. There are cases where Val-based dispatch is useful: generally, when a single runtime dispatch sets you up for a whole sequence of subsequently inferrable calls, that can be a win. But you should use it judiciously, and (as you have done) always profile your results.
I have a background in MATLAB so I have the tendency to vectorize everything. However, in Julia, I tested these two functions:
function testVec(n)
t = [0 0 0 0];
for i = 1:n
for j = 1:4
t[j] = i;
end
end
end
function testVec2(n)
t = [0 0 0 0];
for i = 1:n
t.= [i i i i];
end
end
#time testVec(10^4)
0.000029 seconds (6 allocations: 288 bytes)
#time testVec2(10^4)
0.000844 seconds (47.96 k allocations: 1.648 MiB)
I have two questions:
Why are loops faster?
If loops are indeed faster, are there "smart" vectorization techniques that mimic loops? The syntax for loops is ugly and long.
It's all loops under the hood. The vectorized expressions get translated to loops, both in Julia and in Matlab. In the end it's all loops. In your particular example, it is as #sam says, because you're allocating a bunch of extra arrays that you can avoid if you loop explicitly. The reason you still do so in Matlab is that then everything gets shuffled into functions that are written in a high-performance language (C or Fortran, probably), so it's worth it even when you do extra allocations.
Indeed there are, as #sam showed. Here's a blog post that tells you all you need to know about broadcasting and loop fusion.
In the testVec2 method, the code will allocate a temporary vector for holding [i i i i] for every instance of i in your loop. This allocation is not for free. You can see evidence of this in the number of allocations printed in your timing results. You could try the following:
function testVec3(n)
t = [0 0 0 0]
for i=1:n
t .= i
end
end
I'm doing MC simulations and I need to generate random integers within a range between 1 and a variable upper limit n_mol
The specific Julia function for doing this is rand(1:n_mol) where n_mol is an integer that changes with every MC iteration. The problem is that doing it this is slow... (possibly an issue to open for Julia developers). So, instead of using that particular function call, I thought about generating a random float in [0,1) multiply it by n_mol and then get the integer part of the result: int(rand()*n_mol) the problem now is that int() rounds up so I could end up with numbers between 0 and n_mol, and I can't get 0... so the solution I'm using for the moment is using ifloor and add a 1, ifloor(rand()*n_mol)+1, which considerably faster that the first, but slower than the second.
function t1(N,n_mol)
for i = 1:N
rand(1:n_mol)
end
end
function t2(N,n_mol)
for i = 1:N
int(rand()*n_mol)
end
end
function t3(N,n_mol)
for i = 1:N
ifloor(rand()*n_mol)+1
end
end
#time t1(1e8,123456789)
#time t2(1e8,123456789)
#time t3(1e8,123456789)
elapsed time: 3.256220849 seconds (176 bytes allocated)
elapsed time: 0.482307467 seconds (176 bytes allocated)
elapsed time: 0.975422095 seconds (176 bytes allocated)
So, is there any way of doing this faster with speeds near the second test?
It's important because the MC simulation goes for more than 1e10 iterations.
The result has to be an integer because it will be used as an index of an array.
The rand(r::Range) code is quite fast, given the following two considerations. First, julia calls a 52 bit rng twice to obtain random integers and a 52 bit rng once to obtain random floats, that gives with some book keeping a factor 2.5. A second thing is that
(rand(Uint) % k)
is only evenly distributed between 0 to k-1, if k is a power of 2. This is taken care of with rejection sampling, this explains more or less the remaining additional cost.
If speed is extremely important you can use a simpler random number generator as Julia and ignore those issues. For example with a linear congruential generator without rejection sampling
function lcg(old)
a = unsigned(2862933555777941757)
b = unsigned(3037000493)
a*old + b
end
function randfast(k, x::Uint)
x = lcg(x)
1 + rem(x, k) % Int, x
end
function t4(N, R)
state = rand(Uint)
for i = 1:N
x, state = randfast(R, state)
end
end
But be careful, if the range is (really) big.
m = div(typemax(Uint),3)*2
julia> mean([rand(1:m)*1.0 for i in 1:10^7])
6.148922790091841e18
julia> m/2
6.148914691236517e18
but (!)
julia> mean([(rand(Uint) % m)*1.0 for i in 1:10^7])
5.123459611164573e18
julia> 5//12*tm
5.124095576030431e18
Note that in 0.4, int() is deprecated, and you're aske to use round() instead.
function t2(N,n_mol)
for i = 1:N
round(rand()*n_mol)
end
end
gives 0.27 seconds on my machine (using Julia 0.4).
I have a question about the use of pmap. I think it's a simple/obvious answer but still can't figure it out! I am currently running a loop where each of 50 iterations is separate and so running it in parallel should be possible and should improve speed. It uses a function that has multiple inputs and outputs, which are both a mixture of vectors and scalars. I need to save the outputs of the function for each of the 50 iterations for later use. Here are the basics of the code when not in parallel.
A=Array(Float64, 500,50)
b=Array(Float64,50)
for i in 1:50
A[:,i],b[i] = func(i,x,y,z)
end
Any advice for how to implement this is parallel? I'm using v0.3 Julia.
Thanks in advance.
David
This worked for me.
#everywhere x,y,z = 1,2,3
#everywhere function f(i,x,y,z)
sleep(1)
return(ones(500)*i, i+x+y+z)
end
naive = #time map(i -> f(i,x,y,z), 1:50)
parallel = #time pmap(i -> f(i,x,y,z), 1:50)
A = [x[1] for x in parallel]
b = [x[2] for x in parallel]
Let me know if anyone can suggest a more elegant way to get A and b out of the array of tuples that is produced by pmap.
The timings (when run on 8 processes) are as we would expect
elapsed time: 5.063214725 seconds (94436 bytes allocated)
elapsed time: 0.815228485 seconds (288864 bytes allocated)