Efficient/Cheap way to concatenate arrays in Julia? - matrix

In Julia, I would like to concatenate several arrays (and also multiply them). Within my program, I have written it as follows:
[Uᵣ Qₐ]*Uₖ
[Vᵣ Qᵦ]*Vₖ
However, this array concatenation is very expensive compared to the rest of the program I have written. Is there any way in Julia to cheaply/efficiently concatenate arrays other than what I have done (or just using hcat, vcat functions)?

The problem is that whenever you combine matrices all data are getting copied. This happens because matrices cannot grow in the way vectors do.
However if you matrices are big enough you can avoid copying data by using BlockArrays. A non-materializing function combining matrices is called mortar.
Have a look at this code:
using BlockArrays, BenchmarkTools
a = rand(1000,100)
b = rand(1000,120)
z = rand(220,7)
Now let's run benchmarks:
julia> #btime [$a $b]*$z;
1.234 ms (4 allocations: 1.73 MiB)
julia> #btime mortar(($a, $b)) * $z;
573.100 μs (11 allocations: 55.33 KiB)
julia> all([a b]*z .≈ mortar((a, b)) * z)
true
You can see that the speedup is 2x and the difference in memory allocation is 30x. However the results will vary depending on size and shape of matrices so you should run your own benchmark.

Related

Factorize a Matrix without Any Memory Allocations in Julia

Julia supports in place factorization of matrices (for some factorizations).
I wonder if one could also eliminate any allocation of memory inside the function.
For instance, is there a way to apply a Cholesky factorization on a matrix with no hidden memory allocation?
Non allocating LAPACK functions have bindings in Julia. They are documented in Julia Documentation - Linear Algebra - LAPACK Functions.
The Cholesky factorization cholesky!(A) overwrites A and does allocate a fixed small amount of memory, whereas cholesky(A) does allocate a larger amount. Here, allocations (bytes) do grow quadratically with the size of A.
let n = 1000; M = rand(n,n); B = transpose(M)*M
cholesky(B)
#time cholesky(B)
# 0.023478 seconds (5 allocations: 7.630 MiB)
end
vs
let n = 1000; M = rand(n,n); B = transpose(M)*M
cholesky!(copy(B))
#time cholesky!(B)
# 0.021360 seconds (3 allocations: 80 bytes)
end
Performance differences are small as pointed out by Oscar Smith.

Julia Val{c}() seems slow compared to dictionnary lookup

I'm still learning Julia's multiple dispatch and value-as-type approach.
Instantiating Val{c}() seems about 50times slower than dictionary lookup.
After that, dispatch seems 6 times faster than dictionary lookup.
Are these durations expected? Is it possible to speed up the instanciation of Val{c}()?
using BenchmarkTools
rand_n = rand([4,11], 1_000_000)
simple_dict = Dict(4 => 11, 11 => 4)
call_dict(num) = simple_dict[num]
#benchmark call_dict.($rand_n) # 42.113ms
val_type(::Val{4}) = 11
val_type(::Val{11}) = 4
#benchmark Val.($rand_n) # 2.4s
partial_result = Val.(rand_n)
#benchmark val_type.($partial_result) # 7ms
Tricks like these can be great but they can also take you into dangerous territory. You get a boost when you have only 2 val_type methods; to reprise your results,
julia> rand_n = [4, 11, 4]
3-element Vector{Int64}:
4
11
4
julia> vrand_n = Val.(rand_n)
3-element Vector{Val}:
Val{4}()
Val{11}()
Val{4}()
julia> val_type(::Val{4}) = 11
val_type (generic function with 1 method)
julia> val_type(::Val{11}) = 4
val_type (generic function with 2 methods)
julia> using BenchmarkTools
julia> #btime val_type.($vrand_n);
28.421 ns (1 allocation: 112 bytes)
But look what happens when you have 5:
julia> val_type(::Val{2}) = 0
val_type (generic function with 3 methods)
julia> val_type(::Val{3}) = 0
val_type (generic function with 4 methods)
julia> val_type(::Val{7}) = 0
val_type (generic function with 5 methods)
julia> #btime val_type.($vrand_n);
95.008 ns (1 allocation: 112 bytes)
Importantly, I didn't even have to create any such objects to observe the slowdown. Moreover, this is much worse than a fixed version of your Dict-based method:
julia> const simple_dict = Dict(4 => 11, 11 => 4)
Dict{Int64, Int64} with 2 entries:
4 => 11
11 => 4
julia> call_dict(num) = simple_dict[num]
call_dict (generic function with 1 method)
julia> #btime call_dict.($rand_n);
39.674 ns (1 allocation: 112 bytes)
(That const is crucial, see https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.)
Why? The key is to look at the type of object you're working with:
julia> eltype(vrand_n)
Val
julia> isconcretetype(eltype(vrand_n))
false
This explains why it can be slow: when your iteration extracts the next element, Julia can't predict the concrete type of the object. So it has to use runtime dispatch, which is essentially a glorified dictionary lookup. Unfortunately, it's one where the comparison of keys is much more complicated than just looking up an Int. So you lose quite a lot of performance.
Why is it so much faster when there are only two methods? Because Julia tries to be really smart, and it checks to see how many there are; if there are 3 or fewer methods, it will generate some optimized code that checks the type in a simple if branch rather than invoking the full machinery of type intersection. You can read more details here.
Newcomers to Julia--once they learn the wonders of specialization and the big runtime performance improvements it delivers--often get excited to try to use the type system for everything, and Val-based dispatch is often a tool they reach for. But inferrability is a key component to the speed advantage of multiple dispatch, so when you use a design that breaks inferrability, you lose the advantage to such an extent that it can be worse than less "fancy" methods.
The bottom line: for the demo you were trying, you'll be much better off if you stick to Dict. There are cases where Val-based dispatch is useful: generally, when a single runtime dispatch sets you up for a whole sequence of subsequently inferrable calls, that can be a win. But you should use it judiciously, and (as you have done) always profile your results.

Why does simple matrix multiplication occupy so much garbage collector time in Julia?

I have two large-ish matrices D (4096 x 40) and W (40 x 2800).
when I use #time R = D*W this gives the following stats:
38.449856 seconds (1.40 G allocations: 20.932 GiB, 55.88% gc time)
The 55.88% gc time was shocking to me. There must be a better way of doing this simple matrix calculation. Any ideas for this Julia novice?
You need to provide more information on how you generate D and W (or at least what their types are).
This is what I get:
julia> D = rand(4096, 40); W = rand(40, 2800);
julia> #time R = D * W;
0.081237 seconds (7 allocations: 87.500 MiB, 6.74% gc time)
The problem is that you're measuring compilation as well. If you run any other matrix multiplication before, you'll see a result 10x faster with very little allocation
It did not occur to me this was a type issue. In my code the D matrix is Array{Float64,2} and W is Array{Real,2}.
Conversion from real to float yields the following:
0.024569 seconds (6 allocations: 87.500 MiB, 36.84% gc time)
Much improved!

Parallel random numbers julia

Consider the basic iteration to generate N random numbers and save them in an array (assume either that we are not interested in array comprehensions and also that we don't know the calling rand(N))
function random_numbers(N::Int)
array = zeros(N)
for i in 1:N
array[i] = rand()
end
array
end
I am interested in a similar function that takes advantage of the cores of my laptop to generate the same array. I have checked this nice blog where the macros #everywhere, #spawn and #parallel are introduced but there the calculation is carried out "on-the-fly" and an array is not needed to save the data.
I have the impression that this is very basic and can be done easily using perhaps the function pmap but I am unfamiliar with parallel computing.
My aim is to apply this method to a function that I have built to generate random numbers drawn from an unusual distribution.
I would recommend to do a more careful initialization of random number generators in parallel processes, e.g:
# choose the seed you want
#everywhere srand(1)
# replace 10 below by maximum process id in your case
#everywhere const LOCAL_R = randjump(Base.GLOBAL_RNG, 10)[myid()]
# here is an example usage
#everywhere f() = rand(LOCAL_R)
In this way you:
make sure that your results are reproducible;
have control that there is no overlap between random sequences generated by different processes.
As suggested in the comment more clarification in the question is always welcome. However, it seems pmap will do what is required. The relevant documentation is here.
The following is a an example. Note, the time spent in the pmap method is half of the regular map. With 16 cores, the situation might be substantially better:
julia> addprocs(2)
2-element Array{Int64,1}:
2
3
julia> #everywhere long_rand() = foldl(+,0,(randn() for i=1:10_000_000))
julia> long_rand()
-1165.9596619177153
julia> #time map(x->long_rand(), zeros(10,10))
8.455930 seconds (204.89 k allocations: 11.069 MiB)
10×10 Array{Float64,2}:
⋮
⋮
julia> #time pmap(x->long_rand(), zeros(10,10));
6.125479 seconds (773.08 k allocations: 42.242 MiB, 0.25% gc time)
julia> #time pmap(x->long_rand(), zeros(10,10))
4.609745 seconds (20.99 k allocations: 954.991 KiB)
10×10 Array{Float64,2}:
⋮
⋮

Julia - why are loops faster

I have a background in MATLAB so I have the tendency to vectorize everything. However, in Julia, I tested these two functions:
function testVec(n)
t = [0 0 0 0];
for i = 1:n
for j = 1:4
t[j] = i;
end
end
end
function testVec2(n)
t = [0 0 0 0];
for i = 1:n
t.= [i i i i];
end
end
#time testVec(10^4)
0.000029 seconds (6 allocations: 288 bytes)
#time testVec2(10^4)
0.000844 seconds (47.96 k allocations: 1.648 MiB)
I have two questions:
Why are loops faster?
If loops are indeed faster, are there "smart" vectorization techniques that mimic loops? The syntax for loops is ugly and long.
It's all loops under the hood. The vectorized expressions get translated to loops, both in Julia and in Matlab. In the end it's all loops. In your particular example, it is as #sam says, because you're allocating a bunch of extra arrays that you can avoid if you loop explicitly. The reason you still do so in Matlab is that then everything gets shuffled into functions that are written in a high-performance language (C or Fortran, probably), so it's worth it even when you do extra allocations.
Indeed there are, as #sam showed. Here's a blog post that tells you all you need to know about broadcasting and loop fusion.
In the testVec2 method, the code will allocate a temporary vector for holding [i i i i] for every instance of i in your loop. This allocation is not for free. You can see evidence of this in the number of allocations printed in your timing results. You could try the following:
function testVec3(n)
t = [0 0 0 0]
for i=1:n
t .= i
end
end

Resources