Julia supports in place factorization of matrices (for some factorizations).
I wonder if one could also eliminate any allocation of memory inside the function.
For instance, is there a way to apply a Cholesky factorization on a matrix with no hidden memory allocation?
Non allocating LAPACK functions have bindings in Julia. They are documented in Julia Documentation - Linear Algebra - LAPACK Functions.
The Cholesky factorization cholesky!(A) overwrites A and does allocate a fixed small amount of memory, whereas cholesky(A) does allocate a larger amount. Here, allocations (bytes) do grow quadratically with the size of A.
let n = 1000; M = rand(n,n); B = transpose(M)*M
cholesky(B)
#time cholesky(B)
# 0.023478 seconds (5 allocations: 7.630 MiB)
end
vs
let n = 1000; M = rand(n,n); B = transpose(M)*M
cholesky!(copy(B))
#time cholesky!(B)
# 0.021360 seconds (3 allocations: 80 bytes)
end
Performance differences are small as pointed out by Oscar Smith.
Related
I have 5000 3D points in a Matrix A and another 5000 3D point in a matrix B.
For each point in A i want to find the smallest distance to a point in B. These distances should be stored in an array with 5000 entries.
So far I have this solution, running in about 0.145342 seconds (23 allocations: 191.079 MiB). How can I improve this further?
using Distances
A = rand(5000, 3)
B = rand(5000, 3)
mis = #time minimum(Distances.pairwise(SqEuclidean(), A, B, dims=1), dims=2)
This is a standard way to do it as it will have a better time complexity (especially for larger data):
using NearestNeighbors
nn(KDTree(B'; leafsize = 10), A')[2] .^ 2
Two comments:
by default Euclidean distance is computed (so I square it)
by default NearestNeigbors.jl assumes observations are stored in columns (so I need B' and A' in the solution; if your original data were transposed it would not be needed; the reason why it is designed this way is that Julia uses column major matrix storage)
Generating a big distance matrix using Distances.pairwise(SqEuclidean(), A, B, dims=1) is not efficient because the main memory is pretty slow nowadays compared to CPU caches and the computing power of modern CPUs and this is not gonna be better any time soon (see "memory wall"). It is faster to compute the minimum on-the-fly using two basic nested for loops. Additionally, one can use multiple cores to compute this faster using multiple threads.
function computeMinDist(A, B)
n, m = size(A, 1), size(B, 1)
result = zeros(n)
Threads.#threads for i = 1:n
minSqDist = Inf
#inbounds for j = 1:m
dx = A[i,1] - B[j,1]
dy = A[i,2] - B[j,2]
dz = A[i,3] - B[j,3]
sqDist = dx*dx + dy*dy + dz*dz
if sqDist < minSqDist
minSqDist = sqDist
end
end
result[i] = minSqDist
end
return result
end
mis = #time computeMinDist(A, B)
Note the Julia interpreter uses 1 thread by default but this can be tuned using the environment variable JULIA_NUM_THREADS=auto or just by running it using the flag --threads=auto. See the multi-threading documentation for more information.
Performance results
Here are performance results on my i5-9600KF machine with 6 cores (with two 5000x3 matrices):
Initial implementation: 93.4 ms
This implementation: 4.4 ms
This implementation is thus 21 times faster.
Results are the same to few ULP.
Note the code can certainly be optimized further using loop tiling, and possibly by transposing A and B so the JIT can generate a more efficient implementation using SIMD instructions.
In Julia, I would like to concatenate several arrays (and also multiply them). Within my program, I have written it as follows:
[Uᵣ Qₐ]*Uₖ
[Vᵣ Qᵦ]*Vₖ
However, this array concatenation is very expensive compared to the rest of the program I have written. Is there any way in Julia to cheaply/efficiently concatenate arrays other than what I have done (or just using hcat, vcat functions)?
The problem is that whenever you combine matrices all data are getting copied. This happens because matrices cannot grow in the way vectors do.
However if you matrices are big enough you can avoid copying data by using BlockArrays. A non-materializing function combining matrices is called mortar.
Have a look at this code:
using BlockArrays, BenchmarkTools
a = rand(1000,100)
b = rand(1000,120)
z = rand(220,7)
Now let's run benchmarks:
julia> #btime [$a $b]*$z;
1.234 ms (4 allocations: 1.73 MiB)
julia> #btime mortar(($a, $b)) * $z;
573.100 μs (11 allocations: 55.33 KiB)
julia> all([a b]*z .≈ mortar((a, b)) * z)
true
You can see that the speedup is 2x and the difference in memory allocation is 30x. However the results will vary depending on size and shape of matrices so you should run your own benchmark.
I have two large-ish matrices D (4096 x 40) and W (40 x 2800).
when I use #time R = D*W this gives the following stats:
38.449856 seconds (1.40 G allocations: 20.932 GiB, 55.88% gc time)
The 55.88% gc time was shocking to me. There must be a better way of doing this simple matrix calculation. Any ideas for this Julia novice?
You need to provide more information on how you generate D and W (or at least what their types are).
This is what I get:
julia> D = rand(4096, 40); W = rand(40, 2800);
julia> #time R = D * W;
0.081237 seconds (7 allocations: 87.500 MiB, 6.74% gc time)
The problem is that you're measuring compilation as well. If you run any other matrix multiplication before, you'll see a result 10x faster with very little allocation
It did not occur to me this was a type issue. In my code the D matrix is Array{Float64,2} and W is Array{Real,2}.
Conversion from real to float yields the following:
0.024569 seconds (6 allocations: 87.500 MiB, 36.84% gc time)
Much improved!
Consider the basic iteration to generate N random numbers and save them in an array (assume either that we are not interested in array comprehensions and also that we don't know the calling rand(N))
function random_numbers(N::Int)
array = zeros(N)
for i in 1:N
array[i] = rand()
end
array
end
I am interested in a similar function that takes advantage of the cores of my laptop to generate the same array. I have checked this nice blog where the macros #everywhere, #spawn and #parallel are introduced but there the calculation is carried out "on-the-fly" and an array is not needed to save the data.
I have the impression that this is very basic and can be done easily using perhaps the function pmap but I am unfamiliar with parallel computing.
My aim is to apply this method to a function that I have built to generate random numbers drawn from an unusual distribution.
I would recommend to do a more careful initialization of random number generators in parallel processes, e.g:
# choose the seed you want
#everywhere srand(1)
# replace 10 below by maximum process id in your case
#everywhere const LOCAL_R = randjump(Base.GLOBAL_RNG, 10)[myid()]
# here is an example usage
#everywhere f() = rand(LOCAL_R)
In this way you:
make sure that your results are reproducible;
have control that there is no overlap between random sequences generated by different processes.
As suggested in the comment more clarification in the question is always welcome. However, it seems pmap will do what is required. The relevant documentation is here.
The following is a an example. Note, the time spent in the pmap method is half of the regular map. With 16 cores, the situation might be substantially better:
julia> addprocs(2)
2-element Array{Int64,1}:
2
3
julia> #everywhere long_rand() = foldl(+,0,(randn() for i=1:10_000_000))
julia> long_rand()
-1165.9596619177153
julia> #time map(x->long_rand(), zeros(10,10))
8.455930 seconds (204.89 k allocations: 11.069 MiB)
10×10 Array{Float64,2}:
⋮
⋮
julia> #time pmap(x->long_rand(), zeros(10,10));
6.125479 seconds (773.08 k allocations: 42.242 MiB, 0.25% gc time)
julia> #time pmap(x->long_rand(), zeros(10,10))
4.609745 seconds (20.99 k allocations: 954.991 KiB)
10×10 Array{Float64,2}:
⋮
⋮
I'm doing MC simulations and I need to generate random integers within a range between 1 and a variable upper limit n_mol
The specific Julia function for doing this is rand(1:n_mol) where n_mol is an integer that changes with every MC iteration. The problem is that doing it this is slow... (possibly an issue to open for Julia developers). So, instead of using that particular function call, I thought about generating a random float in [0,1) multiply it by n_mol and then get the integer part of the result: int(rand()*n_mol) the problem now is that int() rounds up so I could end up with numbers between 0 and n_mol, and I can't get 0... so the solution I'm using for the moment is using ifloor and add a 1, ifloor(rand()*n_mol)+1, which considerably faster that the first, but slower than the second.
function t1(N,n_mol)
for i = 1:N
rand(1:n_mol)
end
end
function t2(N,n_mol)
for i = 1:N
int(rand()*n_mol)
end
end
function t3(N,n_mol)
for i = 1:N
ifloor(rand()*n_mol)+1
end
end
#time t1(1e8,123456789)
#time t2(1e8,123456789)
#time t3(1e8,123456789)
elapsed time: 3.256220849 seconds (176 bytes allocated)
elapsed time: 0.482307467 seconds (176 bytes allocated)
elapsed time: 0.975422095 seconds (176 bytes allocated)
So, is there any way of doing this faster with speeds near the second test?
It's important because the MC simulation goes for more than 1e10 iterations.
The result has to be an integer because it will be used as an index of an array.
The rand(r::Range) code is quite fast, given the following two considerations. First, julia calls a 52 bit rng twice to obtain random integers and a 52 bit rng once to obtain random floats, that gives with some book keeping a factor 2.5. A second thing is that
(rand(Uint) % k)
is only evenly distributed between 0 to k-1, if k is a power of 2. This is taken care of with rejection sampling, this explains more or less the remaining additional cost.
If speed is extremely important you can use a simpler random number generator as Julia and ignore those issues. For example with a linear congruential generator without rejection sampling
function lcg(old)
a = unsigned(2862933555777941757)
b = unsigned(3037000493)
a*old + b
end
function randfast(k, x::Uint)
x = lcg(x)
1 + rem(x, k) % Int, x
end
function t4(N, R)
state = rand(Uint)
for i = 1:N
x, state = randfast(R, state)
end
end
But be careful, if the range is (really) big.
m = div(typemax(Uint),3)*2
julia> mean([rand(1:m)*1.0 for i in 1:10^7])
6.148922790091841e18
julia> m/2
6.148914691236517e18
but (!)
julia> mean([(rand(Uint) % m)*1.0 for i in 1:10^7])
5.123459611164573e18
julia> 5//12*tm
5.124095576030431e18
Note that in 0.4, int() is deprecated, and you're aske to use round() instead.
function t2(N,n_mol)
for i = 1:N
round(rand()*n_mol)
end
end
gives 0.27 seconds on my machine (using Julia 0.4).