I'm still learning Julia's multiple dispatch and value-as-type approach.
Instantiating Val{c}() seems about 50times slower than dictionary lookup.
After that, dispatch seems 6 times faster than dictionary lookup.
Are these durations expected? Is it possible to speed up the instanciation of Val{c}()?
using BenchmarkTools
rand_n = rand([4,11], 1_000_000)
simple_dict = Dict(4 => 11, 11 => 4)
call_dict(num) = simple_dict[num]
#benchmark call_dict.($rand_n) # 42.113ms
val_type(::Val{4}) = 11
val_type(::Val{11}) = 4
#benchmark Val.($rand_n) # 2.4s
partial_result = Val.(rand_n)
#benchmark val_type.($partial_result) # 7ms
Tricks like these can be great but they can also take you into dangerous territory. You get a boost when you have only 2 val_type methods; to reprise your results,
julia> rand_n = [4, 11, 4]
3-element Vector{Int64}:
4
11
4
julia> vrand_n = Val.(rand_n)
3-element Vector{Val}:
Val{4}()
Val{11}()
Val{4}()
julia> val_type(::Val{4}) = 11
val_type (generic function with 1 method)
julia> val_type(::Val{11}) = 4
val_type (generic function with 2 methods)
julia> using BenchmarkTools
julia> #btime val_type.($vrand_n);
28.421 ns (1 allocation: 112 bytes)
But look what happens when you have 5:
julia> val_type(::Val{2}) = 0
val_type (generic function with 3 methods)
julia> val_type(::Val{3}) = 0
val_type (generic function with 4 methods)
julia> val_type(::Val{7}) = 0
val_type (generic function with 5 methods)
julia> #btime val_type.($vrand_n);
95.008 ns (1 allocation: 112 bytes)
Importantly, I didn't even have to create any such objects to observe the slowdown. Moreover, this is much worse than a fixed version of your Dict-based method:
julia> const simple_dict = Dict(4 => 11, 11 => 4)
Dict{Int64, Int64} with 2 entries:
4 => 11
11 => 4
julia> call_dict(num) = simple_dict[num]
call_dict (generic function with 1 method)
julia> #btime call_dict.($rand_n);
39.674 ns (1 allocation: 112 bytes)
(That const is crucial, see https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.)
Why? The key is to look at the type of object you're working with:
julia> eltype(vrand_n)
Val
julia> isconcretetype(eltype(vrand_n))
false
This explains why it can be slow: when your iteration extracts the next element, Julia can't predict the concrete type of the object. So it has to use runtime dispatch, which is essentially a glorified dictionary lookup. Unfortunately, it's one where the comparison of keys is much more complicated than just looking up an Int. So you lose quite a lot of performance.
Why is it so much faster when there are only two methods? Because Julia tries to be really smart, and it checks to see how many there are; if there are 3 or fewer methods, it will generate some optimized code that checks the type in a simple if branch rather than invoking the full machinery of type intersection. You can read more details here.
Newcomers to Julia--once they learn the wonders of specialization and the big runtime performance improvements it delivers--often get excited to try to use the type system for everything, and Val-based dispatch is often a tool they reach for. But inferrability is a key component to the speed advantage of multiple dispatch, so when you use a design that breaks inferrability, you lose the advantage to such an extent that it can be worse than less "fancy" methods.
The bottom line: for the demo you were trying, you'll be much better off if you stick to Dict. There are cases where Val-based dispatch is useful: generally, when a single runtime dispatch sets you up for a whole sequence of subsequently inferrable calls, that can be a win. But you should use it judiciously, and (as you have done) always profile your results.
Related
In Julia, I would like to concatenate several arrays (and also multiply them). Within my program, I have written it as follows:
[Uᵣ Qₐ]*Uₖ
[Vᵣ Qᵦ]*Vₖ
However, this array concatenation is very expensive compared to the rest of the program I have written. Is there any way in Julia to cheaply/efficiently concatenate arrays other than what I have done (or just using hcat, vcat functions)?
The problem is that whenever you combine matrices all data are getting copied. This happens because matrices cannot grow in the way vectors do.
However if you matrices are big enough you can avoid copying data by using BlockArrays. A non-materializing function combining matrices is called mortar.
Have a look at this code:
using BlockArrays, BenchmarkTools
a = rand(1000,100)
b = rand(1000,120)
z = rand(220,7)
Now let's run benchmarks:
julia> #btime [$a $b]*$z;
1.234 ms (4 allocations: 1.73 MiB)
julia> #btime mortar(($a, $b)) * $z;
573.100 μs (11 allocations: 55.33 KiB)
julia> all([a b]*z .≈ mortar((a, b)) * z)
true
You can see that the speedup is 2x and the difference in memory allocation is 30x. However the results will vary depending on size and shape of matrices so you should run your own benchmark.
I'm trying my function call to have as little memory allocation as possible to make it run faster. The problem is that it seems when I access a struct that is given as argument there are many allocations.
function mCondition(y,t,integrator)
xp, yp, zp, vx, vy, vz = y
mu = integrator.p[1]
cond = (xp - 1.0 + mu)*vx + yp*vy + zp*vz
return cond
end
struct myStr
p
u
end
y = rand(6)
t = 0.0
A = myStr([0.01215],rand(6))
#test call
mCondition(y,t,A)
using BenchmarkTools
#btime mCondition(y,t,A)
The output is:
julia> #btime mCondition(y,t,A)
102.757 ns (9 allocations: 144 bytes)
-0.07935578340713843
I think that the problem is with the struct because when I delete that part of the code,
function mCondition(y,t,integrator)
xp, yp, zp, vx, vy, vz = y
cond = (xp - 1.0)*vx + yp*vy + zp*vz
return cond
end
this is the result of the benchmark:
julia> #btime mCondition(y,t,A)
18.294 ns (1 allocation: 16 bytes)
-0.08427348469961408
which is closer to what I would expect of whats going on inside the function (but I still wonder if that allocation is even necessary). If you could help me understand whats going on or even fix it that would be nice.
Thanks in advance :)
You need to type annotate field definitions in your struct to enable the compiler to generate high-performance code. Without type annotations in your struct fields, the compiler cannot infer the types of the fields at compile-time which would leave the decisions to run-time and hence this would hurt performance and cause otherwise unnecessary allocations.
The solution is then,
struct myStr
p::Vector{Float64}
u::Vector{Float64}
end
You can also make your struct parametric, for example, if you want it to work with Vectors of other types. See the Types section of documentation for more information.
I would also suggest that you read the Performance Tips section of documentation to learn more about how to write high-performance code in Julia.
I've noticed a strange behavior of julia during a matrix copy.
Consider the following three functions:
function priv_memcopyBtoA!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = B[1:n,1:n]
return nothing
end
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1; jj = 1;
while ii <= n
jj = 1 #(*)
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
function priv_memcopyBtoA3!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = view(B, 1:n, 1:n)
return nothing
end
Edit: 1) I tested if the code would throw an BoundsError so the line marked with jj = 1 #(*) was missing in the initial code. The testing results were already from the fixed version, so they remain unchanged. 2) I've added the view variant, thanks to #Colin T Bowers for addressing both issues.
It seems like both functions should lead to more or less the same code. Yet I get for
A = fill!(Matrix{Int32}(2^12,2^12),2); B = Int32.(eye(2^12));
the results
#timev priv_memcopyBtoA!(A,B, 2000)
0.178327 seconds (10 allocations: 15.259 MiB, 85.52% gc time)
elapsed time (ns): 178326537
gc time (ns): 152511699
bytes allocated: 16000304
pool allocs: 9
malloc() calls: 1
GC pauses: 1
and
#timev priv_memcopyBtoA2!(A,B, 2000)
0.015760 seconds (4 allocations: 160 bytes)
elapsed time (ns): 15759742
bytes allocated: 160
pool allocs: 4
and
#timev priv_memcopyBtoA3!(A,B, 2000)
0.043771 seconds (7 allocations: 224 bytes)
elapsed time (ns): 43770978
bytes allocated: 224
pool allocs: 7
That's a drastic difference. It's also surprising. I've expected the first version to be like memcopy, which is hard to beat for a large memory block.
The second version has overhead from the pointer arithmetic (getindex), the branch condition (<=) and the bounds check in each assignment. Yet each assignment takes just ~3 ns.
Also, the time which the garbage collector consumes, varies a lot for the first function. If no garbage collection is performed, the large difference becomes small, but it remains. It's still a factor of ~2.5 between version 3 and 2.
So why is the "memcopy" version not as efficient as the "assignment" version?
Firstly, your code contains a bug. Run this:
A = [1 2 ; 3 4]
B = [5 6 ; 7 8]
priv_memcopyBtoA2!(A, B, 2)
then:
julia> A
2×2 Array{Int64,2}:
5 2
7 4
You need to re-assign jj back to 1 at the end of each inner while loop, ie:
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1
while ii <= n
jj = 1
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
Even with the bug fix, you'll still note that the while loop solution is faster. This is because array slices in julia create temporary arrays. So in this line:
A[1:n,1:n] = B[1:n,1:n]
the right-hand side operation creates a temporary nxn array, and then assigns the temporary array to the left-hand side.
If you wanted to avoid the temporary array allocation, you would instead write:
A[1:n,1:n] = view(B, 1:n, 1:n)
and you'll notice that the timings of the two methods is now pretty close, although the while loop is still slightly faster. As a general rule, loops in Julia are fast (as in C fast), and explicitly writing out the loop will usually get you the most optimized compiled code. I would still expect the explicit loop to be faster than the view method.
As for the garbage collection stuff, that is just a result of your method of timing. Much better to use #btime from the package BenchmarkTools, which uses various tricks to avoid traps like timing garbage collection etc.
Why is A[1:n,1:n] = view(B, 1:n, 1:n) or variants of it, slower than a set of while loops? Let's look at what A[1:n,1:n] = view(B, 1:n, 1:n) does.
view returns an iterator which contains a pointer to the parent B and information how to compute the indices which should be copied. A[1:n,1:n] = ... is parsed to a call _setindex!(...). After that, and a few calls down the call chain, the main work is done by:
.\abstractarray.jl:883;
# In general, we simply re-index the parent indices by the provided ones
function getindex(V::SlowSubArray{T,N}, I::Vararg{Int,N}) where {T,N}
#_inline_meta
#boundscheck checkbounds(V, I...)
#inbounds r = V.parent[reindex(V, V.indexes, I)...]
r
end
#.\multidimensional.jl:212;
#inline function next(iter::CartesianRange{I}, state) where I<:CartesianIndex
state, I(inc(state.I, iter.start.I, iter.stop.I))
end
#inline inc(::Tuple{}, ::Tuple{}, ::Tuple{}) = ()
#inline inc(state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int}) = (state[1]+1,)
#inline function inc(state, start, stop)
if state[1] < stop[1]
return (state[1]+1,tail(state)...)
end
newtail = inc(tail(state), tail(start), tail(stop))
(start[1], newtail...)
end
getindex takes a view V and an index I. We get the view from B and the index I from A. In each step reindex computes from the view V and the index I indices to get an element in B. It's called r and we return it. Finally r is written to A.
After each copy inc increments the index I to the next element in A and tests if one is done. Note that the code is from v0.63 but in master it's more or less the same.
In principle the code could be reduced to a set of while loops, yet it is more general. It works for arbitrary views of B and arbitrary slices of the form a:b:c and for an arbitrary number of matrix dimensions. The big N is in our case 2.
Since the functions are more complex, the compiler doesn't optimize them as well. I.e. there is a recommendation that the compiler should inline them, but it doesn't do that. This shows that the shown functions are non trivial.
For a set of loops the compiler reduces the innermost loop to three additions (each for a pointer to A and B and one for the loop index) and a single copy instruction.
tl;dr The internal call chain of A[1:n,1:n] = view(B, 1:n, 1:n) coupled with multiple dispatch is non trivial and handles the general case. This induces overhead. A set of while loops is already optimized to a special case.
Note that the performance depends on the compiler. If one looks at the one dimensional case A[1:n] = view(B, 1:n), it's faster than a while loop because it vectorizes the code. Yet for higher dimensions N >2 the difference grows.
Consider the basic iteration to generate N random numbers and save them in an array (assume either that we are not interested in array comprehensions and also that we don't know the calling rand(N))
function random_numbers(N::Int)
array = zeros(N)
for i in 1:N
array[i] = rand()
end
array
end
I am interested in a similar function that takes advantage of the cores of my laptop to generate the same array. I have checked this nice blog where the macros #everywhere, #spawn and #parallel are introduced but there the calculation is carried out "on-the-fly" and an array is not needed to save the data.
I have the impression that this is very basic and can be done easily using perhaps the function pmap but I am unfamiliar with parallel computing.
My aim is to apply this method to a function that I have built to generate random numbers drawn from an unusual distribution.
I would recommend to do a more careful initialization of random number generators in parallel processes, e.g:
# choose the seed you want
#everywhere srand(1)
# replace 10 below by maximum process id in your case
#everywhere const LOCAL_R = randjump(Base.GLOBAL_RNG, 10)[myid()]
# here is an example usage
#everywhere f() = rand(LOCAL_R)
In this way you:
make sure that your results are reproducible;
have control that there is no overlap between random sequences generated by different processes.
As suggested in the comment more clarification in the question is always welcome. However, it seems pmap will do what is required. The relevant documentation is here.
The following is a an example. Note, the time spent in the pmap method is half of the regular map. With 16 cores, the situation might be substantially better:
julia> addprocs(2)
2-element Array{Int64,1}:
2
3
julia> #everywhere long_rand() = foldl(+,0,(randn() for i=1:10_000_000))
julia> long_rand()
-1165.9596619177153
julia> #time map(x->long_rand(), zeros(10,10))
8.455930 seconds (204.89 k allocations: 11.069 MiB)
10×10 Array{Float64,2}:
⋮
⋮
julia> #time pmap(x->long_rand(), zeros(10,10));
6.125479 seconds (773.08 k allocations: 42.242 MiB, 0.25% gc time)
julia> #time pmap(x->long_rand(), zeros(10,10))
4.609745 seconds (20.99 k allocations: 954.991 KiB)
10×10 Array{Float64,2}:
⋮
⋮
If i want to find the union of two unordered sets, represented as 1D vectors, eg:
a = [2 4 6 8 1]
b = [1 2 5 7 9]
i can use the union function:
c = union(a,b)
which gives the answer:
c = [1 2 4 5 6 7 8 9]
however, this appears to be quite slow (relatively speaking). If i run a tic-toc test on it i get:
>> for test = 1
tic
c = union(a,b);
toc
end
Elapsed time is 0.000906 seconds.
whereas, if i use this much more convoluted method, i get a much quicker result:
>> for test = 1
tic
a_1 = zeros(1,9);
b_1 = zeros(1,9);
a_1(a) = 1;
b_1(b) = 1;
c_1 = or(a_1,b_1);
c = find(c_1);
toc
end
Elapsed time is 0.000100 seconds.
this still gives the same answer for c, but is about 9 times as fast (with this small example, im not sure about how well it scales)
what is the advantage of ever using union? and can anyone suggest a more compact way of expressing the second method that i used?
thanks
Several comments:
Your code isn't as general as union. You're assuming set members are strictly positive integers.
Your code will blow up if a set member has a large value, eg. 10^20, as it tries to allocate an absurd quantity of memory.
A bunch of Matlab functions that don't use BLAS/LAPACK or aren't built-in are actually quite slow. They're there for convenience but don't be shocked when you can roll something faster yourself, especially if you can specialize for your particular problem.
Representing sets using logical arrays (which is what you do) can be highly efficient for problems where the set of possible set members is quite small.