Suggestions for performance improvement in Julia

Suggestions for performance improvement in Julia - performance

I'm making my first effort to move from Matlab to Julia and have found my code to improve by ~3x but still think there is more to come, I'm not using any global variables in the function and have preallocated all the arrays used (I think?). If there was any thoughts on how it could be sped up even further it would be greatly appreciated, I'll fully convert even at the current improvement I think!
function word_sim(tau::Int, omega::Int, mu::Float64)
# inserts a word in position (tau+1), at each point creates a new word with prob mu
# otherwise randomly chooses a previously used. Runs the program until time omega
words = zeros(Int32, 1, omega) # to store the words
tests = rand(1,omega) # will compare mu to these
words[1] = 1; # initialize the words
next_word = 2 # will be the next word used
words[tau+1] = omega + 1; # max possible word so insert that at time tau
innovates = mu .> tests; # when we'll make a new word
for i = 2:tau # simulate the process
if innovates[i] == 1 # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
# force the word we're interested in
for i = (tau+2):omega
if innovates[i] == 1 # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
result = sum(words .== (omega + 1)); # count how many times our word occurred
return result
end
and when I run it with these values it takes ~.26 seconds on my PC
using Statistics
#time begin
nsim = 10^3;
omega = 100;
seed = [0:1:(omega-1);];
mu = 0.01;
results = zeros(Float64, 1, length(seed));
pops = zeros(Int64, 1, nsim);
for tau in seed
for jj = 1:nsim
pops[jj] = word_sim(tau, omega, mu);
end
results[tau+1] = mean(pops);
end
end
Or perhaps I'd be better writing the code in C++? Julia was my first reaction as I've heard rave reviews about its syntax, which to be honest is fantastic!
Any comments greatly appreciated.

A 3x speedup is a nice start, but it turns out there are a few more things you can do to improve performance significantly!
As a starting point, using your example posted above in Julia 1.6.1, I get
0.301665 seconds (798.10 k allocations: 164.778 MiB, 12.70% gc time)
That's a lot of allocations, and a fair amount of garbage collector ("gc") time, so it seems we're producing a fair amount of garbage here. Some of the culprits are lines like
tests = rand(1,omega) # will compare mu to these
or
innovates = mu .> tests; # when we'll make a new word
In languages like Matlab or Python, pre-calculating these things whole-vector-at-a-time can be good for performance, but in Julia it's generally not really necessary, and can even hurt because each of these lines is causing a brand new array to be allocated. If we remove these and just generate our tests on the fly, we can avoid these allocations. One other line that allocates in here is
result = sum(words .== (omega + 1))
where you first build a whole new array before taking the sum of it. You could avoid this by writing it as a for loop (even though this may feel wrong coming from Matlab, it's quite fast in Julia). Or, to keep it as a one-liner, use either count or sum with a function that does the comparison as the first argument
result = count(x->(x == omega+1), words)
(in this example, just using an anonymous function x->(x == omega+1)).
Adding up these changes so far then
function word_sim(tau::Int, omega::Int, mu::Float64)
# inserts a word in position (tau+1), at each point creates a new word with prob mu
# otherwise randomly chooses a previously used. Runs the program until time omega
words = zeros(Int32, 1, omega) # to store the words
words[1] = 1; # initialize the words
next_word = 2 # will be the next word used
words[tau+1] = omega + 1; # max possible word so insert that at time tau
for i = 2:tau # simulate the process
if mu > rand() # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
# force the word we're interested in
for i = (tau+2):omega
if mu > rand() # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
result = count(x->(x == omega+1), words) # count how many times our word occurred
return result
end
Using the same timing code, this now brings us down to
0.177766 seconds (298.10 k allocations: 51.863 MiB, 13.01% gc time)
So about half the time and half the allocations. There's still more though!
First, let's move the allocation of the words array outside of the word_sim function and instead make an in-place version of that function. We can also speed things up a adding an #inbounds to the tight for loops.
function word_sim!(words::AbstractArray, tau::Int, omega::Int, mu::Float64)
# inserts a word in position (tau+1), at each point creates a new word with prob mu
# otherwise randomly chooses a previously used. Runs the program until time omega
fill!(words, 0) # Probably not necessary actually, but I haven't spent enough time looking at the code to be sure
words[1] = 1; # initialize the words
next_word = 2 # will be the next word used
words[tau+1] = omega + 1; # max possible word so insert that at time tau
#inbounds for i = 2:tau # simulate the process
if mu > rand() # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
# force the word we're interested in
#inbounds for i = (tau+2):omega
if mu > rand() # innovate
words[i] = next_word
next_word = next_word + 1
else # copy
words[i] = words[rand(1:(i-1))]
end
end
result = count(x->(x == omega+1), words) # count how many times our word occurred
return result
end
In-place functions that modify one of their input arguments are usually denoted by a ! at the end of their name by convention in Julia, hence the new function name.
Since we have to modify the timing code a bit to pre-allocate words now, let's also take the opportunity to put that timing code into a function to avoid any globals in the timing.
function run_word_sim()
nsim = 10^3
omega = 100
seed = [0:1:(omega-1);]
mu = 0.01
results = zeros(Float64, 1, length(seed))
pops = zeros(Int64, 1, nsim)
words = zeros(Int32, 1, omega) # to store the words
for tau in seed
for jj = 1:nsim
pops[jj] = word_sim!(words, tau, omega, mu)
end
results[tau+1] = mean(pops)
end
return results
end
Then get the most accurate timing results (and optionally some useful plots and statistics) we can use the BenchmarkTools package and its #btime or #benchmark macros
julia> using BenchmarkTools
julia> #btime run_word_sim()
124.178 ms (4 allocations: 10.17 KiB)
or
So, almost another 3x speedup, and reduced allocations and memory usage (by four or five orders of magnitude) down to only the four arrays used in the timing code (seed, results, pops and words).
For the absolute maximum performance, you could possibly go even farther with LoopVectorization.jl and its #turbo macro, though it would likely require a change in algorithm since these loops depend on previous state, so don't appear to be compatible with loop re-ordering. You could turn the count into a for loop and #turbo that for a slight additional speedup though.
There are also other options for potentially faster random number generation, such as VectorizedRNG.jl as discussed in the discourse thread linked in the comments. While allocating a new vector of random numbers on each call of word_sim is likely not optimal, RNG is generally faster when you can generate a lot of random numbers at once, so passing a pre-allocated buffer of random numbers to word_sim! and filling that in-place with rand! as provided by either the Random stdlib or VectorizedRNG could yield a significant additional speedup.
Some of the tricks and rules of thumb used in this answer are discussed more generally in https://github.com/brenhinkeller/JuliaAdviceForMatlabProgrammers, along with a few other general Matlab -> Julia tips.

Related

Fastest way to generate a kmer count vector from a nucleotide sequence (Julia)

Given a nucleotide sequence, I'm writing some Julia code to generate a sparse vector of (masked) kmer counts, and I would like it to run as fast as possible.
Here is my current implementation,
using Distributions
using SparseArrays
function kmer_profile(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = Dict('A'=>0, 'C'=>1, 'G'=>2, 'T'=>3)
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
if mask[i]
kmer_hash += d[seq[n+i-1]] * basis[j]
j += 1
end
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
return sparsevec(kmer_dict)
end
seq = join(sample(['A','C','G','T'], 1000000))
mask_str = "111111011111001111111111111110"
mask = BitArray([parse(Bool, string(m)) for m in split(mask_str, "")])
k = sum(mask)
#time kmer_profile(seq, k, mask)
This code runs in about 0.3 seconds on my M1 MacBook Pro, is there any way to make it run significantly faster?
The function kmer_profile uses a sliding window of size length(mask) to count the number of times each masked kmer appears in the nucleotide sequence. A mask is a binary sequence, and a masked kmer is a kmer with nucleotides dropped at positions at which the mask is zero. E.g. the kmer ACGT and mask 1001 will produce the masked kmer AT.
To produce the kmer hash, the function treats each kmer as a base 4 number and then converts it to a (base 10) 64-bit integer, for indexing into the kmer vector.
The size of k is equal to the number of ones in the mask string, and is implicitly limited to 31 so that kmer hashes can fit into a 64-bit integer type.

There are several possible optimizations to make this code faster.
First of all, one can convert the Dict to an array since array-based indexing is faster than dictionary-based indexing one and this is possible here since the key is an ASCII character.
Moreover, the extraction of the sequence codes can be done once instead of length(mask) times by pre-computing code and putting the result in a temporary array.
Additionally, the mask-based conditional and the loop carried dependency make things slow. Indeed, the condition cannot be (easily) predicted by the processor causing it to stall for several cycles. The loop carried dependency make things even worse since the processor can hardly execute other instructions during this stall. This problem can be solved by pre-computing the factors based on both mask and basis. The result is a faster branch-less loop.
Once the above optimizations are done, the biggest bottleneck is sparsevec. In fact, it was also taking nearly half the time of the initial implementation! Optimizing this step is difficult but not impossible. It is slow because of random accesses in the Julia implementation. One can speed this up by sorting the keys-values pairs in the first place. It is faster due to a more cache-friendly execution and it can also help the prediction unit of the processor. This is a complex topic. For more details about how this works, please read Why is processing a sorted array faster than processing an unsorted array?.
Here is the final optimized code:
function kmer_profile_opt(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
sorted_kmer_pairs = sort(collect(kmer_dict))
sorted_kmer_keys = [e[1] for e in sorted_kmer_pairs]
sorted_kmer_values = [e[2] for e in sorted_kmer_pairs]
return sparsevec(sorted_kmer_keys, sorted_kmer_values)
end
This code is a bit more than twice faster than the initial implementation on my machine. A significant fraction of the time is still spent in the sorting algorithm.
The code can still be optimized further. One way is to use a parallel sort algorithm. Another way is to replace the premult[i] multiplication by a shift which is faster assuming premult[i] is modified so to contain exponents. I expect the code to be about 4 times faster than the original code. The main bottleneck should be the big dictionary creation. Improving further the performance of this is very hard (though it is still possible).

Inspired by Jérôme's answer, and squeezing some more by avoiding Dicts altogether:
function kmer_profile_opt3a(seq, k, mask)
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
basis = [4^i for i in (k-1):-1:0]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_vec = Vector{Int}(undef, length(seq)-length(mask)+1)
#inbounds for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
kmer_vec[n] = kmer_hash
end
sort!(kmer_vec)
return sparsevec(kmer_vec, ones(length(kmer_vec)), 4^k, +)
end
This achieved another 2x over Jérôme's answer on my machine.
The auto-combining feature of sparsevec makes the code a bit more compact.
Trying to slim the code further, and avoid unnecessary allocations in sparse vector creation, the following can be used:
using SparseArrays, LinearAlgebra
function specialsparsevec(nzs, n)
vals = Vector{Int}(undef, length(nzs))
j, k, count, last = (1, 1, 0, nzs[1])
while k <= length(nzs)
if nzs[k] == last
count += 1
else
vals[j], nzs[j] = (count, last)
count, last = (1, nzs[k])
j += 1
end
k += 1
end
vals[j], nzs[j] = (count, last)
resize!(nzs, j)
resize!(vals, j)
return SparseVector(n, nzs, vals)
end
function kmer_profile_opt3(seq, k, mask)
d = zeros(Int8, 128)
foreach(((i,c),) -> d[Int(c)]=i-1, enumerate(collect("ACGT")))
seq_codes = getindex.(Ref(d), Int8.(collect(seq)))
premult = foldr(
(i,(p,j))->(mask[i] && (p[i]=j ; j<<=2) ; (p,j)),
1:length(mask); init=(zeros(Int64,length(mask)),1)) |> first
kmer_vec = sort(
[ dot(#view(seq_codes[n:n+length(mask)-1]),premult) + 1 for
n in 1:(length(seq)-length(mask)+1)
])
return specialsparsevec(kmer_vec, 4^k)
end
This last version gets another 10% speedup (but is a little cryptic):
julia> #btime kmer_profile_opt($seq, $k, $mask);
367.584 ms (81 allocations: 134.71 MiB) # other answer
julia> #btime kmer_profile_opt3a($seq, $k, $mask);
140.882 ms (22 allocations: 54.36 MiB) # 1st this answer
julia> #btime kmer_profile_opt3($seq, $k, $mask);
127.016 ms (14 allocations: 27.66 MiB) # 2nd this answer

Performance issues with parallel Julia code

I am working on a scientific code that is experiencing issues with parallelization.
The parallel version is slower than the serial one and I am not sure if the right approaches are used for this application.
How can I improve the performance of the parallel calculation?
Is the right approach being used or should other packages / functions be considered for parallelization?
I have already tried a larger workload, however this makes no difference.
I suspect the problem is somehow due to data movement between workers, but I don't know how to check or improve this one.
Parallel programming with Julia is still relatively new for me, so I am very grateful for any help!
The simulation code is something of a benchmark for the Julia programming language, as our team is considering using Julia for all future projects if strong performance advantages to the current workflow can be demonstrated.
Because of this, I would like to maximize performance, also since calculations with very large models as well as possible use on a cluster are planned.
Minimum Working Example
The critical parts of the code can be broken down to the following example.
I start the process as follows:
using Distributed
addprocs();
#everywhere using SharedArrays, LinearAlgebra, Test
First I define the simulation model, containing all data used for the calculations.
Is it actually okay to store SharedArrays with other data in a struct or should a different approach be used?
#everywhere struct Model
idx::Vector{Tuple{Int,Int}} # indices
A::SharedMatrix{Float64} # results, will be constantly updated
B::Vector{Float64} # part of pre-processing, will only be read
end
See the non-parallel version of the function used for the update of the model below.
function update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
For parallelization, I simply tried the following. Is perhaps an approach with pmap better in this case?
#everywhere function parallel_update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
#sync #distributed for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
To test the results I use the following function:
#everywhere function test_my_code()
# provide some data
n = 10000000
idx = [(rand(1:n),rand(1:n)) for k=1:n]
A = SharedArray(hcat(([rand(0.:1000.);rand(0.:1000.);rand(0.:1000.)] for k=1:n)...))
B = [rand(0.:1000.) for k=1:n]
# define models
model1 = Model(idx,A,B)
model2 = Model(idx,A,B)
# test and compare results
#time update(model1,2.)
#time parallel_update(model2,2.)
#test model1 == model2
end
julia> test_my_code() # first run
6.350694 seconds (50.00 M allocations: 5.215 GiB, 13.66% gc time)
11.422999 seconds (6.69 k allocations: 446.156 KiB)
Test Passed
julia> test_my_code() # second run
6.286828 seconds (50.00 M allocations: 5.215 GiB, 18.35% gc time)
6.297144 seconds (2.92 k allocations: 143.516 KiB)
Test Passed
Note: significant performance improvements for the serial code
I was already able to significantly improve the performance of the serial function and reduce the number of allocations to zero.
Since this seems to make no difference to the parallelization problem, I used the shorter, easier-to-read version for the previous example.
See the serial code below.
using LinearAlgebra, Test
struct Model
idx::Vector{Tuple{Int,Int}}
A::Matrix{Float64}
B::Vector{Float64}
end
function update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
function update_fast(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = sqrt((m.A[1,i]-m.A[1,j])^2 +
(m.A[2,i]-m.A[2,j])^2 +
(m.A[3,i]-m.A[3,j])^2)
k = factor * m.B[cnt]
m.A[1,i] += k*L
m.A[2,i] += k*L
m.A[3,i] += k*L
m.A[1,j] -= k*L
m.A[2,j] -= k*L
m.A[3,j] -= k*L
end
end
function test_serial_speedup()
n = 10000000
idx = [(rand(1:n),rand(1:n)) for k=1:n]
A = hcat(([rand(0.:1000.);rand(0.:1000.);rand(0.:1000.)] for k=1:n)...)
B = [rand(0.:1000.) for k=1:n]
model1 = Model(idx,A,B)
model2 = Model(idx,A,B)
#time update(model1,2.)
#time update_fast(model2,2.)
#test model1 == model2
end
julia> test_serial_speedup()
5.008049 seconds (50.00 M allocations: 5.215 GiB, 18.14% gc time)
0.464986 seconds
Test Passed

How to improve in Julia the Storkey Learning Performance?

Hi community I'm new here , and also new in Julia 1.0.3. Nowadays I'm studying Storkey Learning rules in different number systems. In my first attempt coding this ideias I try this naive code:
function storkey_learning_first(U)
# The memories are given by the columns
row,col = size(U)
# First W matrix
W_new = zeros(row,row)
for mu=1:col
W_old = copy(W_new)
for i=1:row
for j=i:row
s = 0.0
# Putting this value in the new matrix
s += U[i,mu]*U[j,mu]
s -= local_field_opt(W_old,U[:,mu],i,j,row)*U[j,mu]
s -= local_field_opt(W_old,U[:,mu],j,i,row)*U[i,mu]
s *= 1/row
W_new[i,j] += s
W_new[j,i] = W_new[i,j]
end
end
end
return W_new
end
which is the main function and the "local field" given by
function local_field_opt(W_old,U,i,j,row)
hij = 0.0
for k=1:row
if k != i && k != j
hij += W_old[i,k]*U[k]
end
end
return hij
end
then given a n-dimensional real-valued vector, both codes creates a matrix of dimension (n x n). For lower dimensional vectors is working. But it is really slow for higher dimensional arrays. In fact, I want to store vectors of dimension n = 8192. Also, I would like to work with lower dimensional complex-valued vectors, or quaternions but I can't do better in the real case. In a second attempt I separate the complete structure in two functions, in particular, I separate the two inner loops avoinding to call the same elements repeatedly:
function_inner(U,W_old,W_mu,mu,row)
# Calling one time the column
U_mu = U[:,mu]
for j=1:row
U_j_mu = U[j,mu]
for i=j:row
U_i_mu = U[i,mu]
s = 0.0
s += U_i_mu *U_j_mu
s -= U_i_mu *local_field_opt(W_old,U_mu,j,i,row)
s -= local_field_opt(W_old,U_mu,i,j,row)*U_j_mu
s *= 1/row
W_mu[i,j] += s
W_mu[j,i] = W_mu[i,j]
end
end
return W_mu
end
with this I gained a few seconds. How can I improve my syntax in this particular case? and the use of complex or quaternion numbers: should that be a considerable additional burden?. Finally, until now I'm obtaining this time mark for vectors of dimension n=1352:
#time W = RealStorkey.storkey_learning(U,RealStorkey.first)
349.680284 seconds (268 allocations: 1.376 GiB, 0.09% gc time)

How to parallelize computation of pairwise distance matrix?

My problem is roughly as follows. Given a numerical matrix X, where each row is an item. I want to find each row's nearest neighbor in terms of L2 distance in all rows except itself. I tried reading the official documentation but was still a little confused about how to achieve this. Could someone give me some hint?
My code is as follows
function l2_dist(v1, v2)
return sqrt(sum((v1 - v2) .^ 2))
end
function main(Mat, dist_fun)
n = size(Mat, 1)
Dist = SharedArray{Float64}(n) #[Inf for i in 1:n]
Id = SharedArray{Int64}(n) #[-1 for i in 1:n]
#parallel for i = 1:n
Dist[i] = Inf
Id[i] = 0
end
Threads.#threads for i in 1:n
for j in 1:n
if i != j
println(i, j)
dist_temp = dist_fun(Mat[i, :], Mat[j, :])
if dist_temp < Dist[i]
println("Dist updated!")
Dist[i] = dist_temp
Id[i] = j
end
end
end
end
return Dict("Dist" => Dist, "Id" => Id)
end
n = 4000
p = 30
X = [rand() for i in 1:n, j in 1:p];
main(X[1:30, :], l2_dist)
#time N = main(X, l2_dist)
I'm trying to distributed all the i's (i.e. calculating each row minimum) over different cores. But the version above apparently isn't working correctly. It is even slower than the sequential version. Can someone point me to the right direction? Thanks.

Maybe you're doing something in addition to what you have written down, but, at this point from what I can see, you aren't actually doing any computations in parallel. Julia requires you to tell it how many processors (or threads) you would like it to have access to. You can do this through either
Starting Julia with multiple processors julia -p # (where # is the number of processors you want Julia to have access to)
Once you have started a Julia "session" you can call the addprocs function to add additional processors.
To have more than 1 thread, you need to run command export JULIA_NUM_THREADS = #. I don't know very much about threading, so I will be sticking with the #parallel macro. I suggest reading documentation for more details on threading -- Maybe #Chris Rackauckas could expand a little more on the difference.
A few comments below about my code and on your code:
I'm on version 0.6.1-pre.0. I don't think I'm doing anything 0.6 specific, but this is a heads up just in case.
I'm going to use the Distances.jl package when computing the distances between vectors. I think it is a good habit to farm out as many of my computations to well-written and well-maintained packages as possible.
Rather than compute the distance between rows, I'm going to compute the distance between columns. This is because Julia is a column-major language, so this will increase the number of cache hits and give a little extra speed. You can obviously get the row-wise results you want by just transposing the input.
Unless you expect to have that many memory allocations then that many allocations are a sign that something in your code is inefficient. It is often a type stability problem. I don't know if that was the case in your code before, but that doesn't seem to be an issue in the current version (it wasn't immediately clear to me why you were having so many allocations).
Code is below
# Make sure all processors have access to Distances package
#everywhere using Distances
# Create a random matrix
nrow = 30
ncol = 4000
# Seed creation of random matrix so it is always same matrix
srand(42)
X = rand(nrow, ncol)
function main(X::AbstractMatrix{Float64}, M::Distances.Metric)
# Get size of the matrix
nrow, ncol = size(X)
# Create `SharedArray` to store output
ind_vec = SharedArray{Int}(ncol)
dist_vec = SharedArray{Float64}(ncol)
# Compute the distance between columns
#sync #parallel for i in 1:ncol
# Initialize various temporary variables
min_dist_i = Inf
min_ind_i = -1
X_i = view(X, :, i)
# Check distance against all other columns
for j in 1:ncol
# Skip comparison with itself
if i==j
continue
end
# Tell us who is doing the work
# (can uncomment if you want to verify stuff)
# println("Column $i compared with Column $j by worker $(myid())")
# Evaluate the new distance...
# If it is less then replace it, otherwise proceed
dist_temp = evaluate(M, X_i, view(X, :, j))
if dist_temp < min_dist_i
min_dist_i = dist_temp
min_ind_i = j
end
end
# Which column is minimum distance from column i
dist_vec[i] = min_dist_i
ind_vec[i] = min_ind_i
end
return dist_vec, ind_vec
end
# Using Euclidean metric
metric = Euclidean()
inds, dist = main(X, metric)
#time main(X, metric);
#show dist[[1, 5, 25]], inds[[1, 5, 25]]
You can run the code with
1 processor julia testfile.jl
% julia testfile.jl
0.640365 seconds (16.00 M allocations: 732.495 MiB, 3.70% gc time)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])
n processors (in this case 4) julia -p n testfile.jl
% julia -p 4 testfile.jl
0.201523 seconds (2.10 k allocations: 99.107 KiB)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])

while-loop faster than for when returning iterator

I'm trying to oversimplify this as much as possible.
functions f1and f2 implement a very simplified version of a roulette wheel selection over a Vector R. The only difference between them is that f1 uses a for and f2 a while. Both functions return the index of the array where the condition was met.
R=rand(100)
function f1(X::Vector)
l = length(X)
r = rand()*X[l]
for i = 1:l
if r <= X[i]
return i
end
end
end
function f2(X::Vector)
l = length(X)
r = rand()*X[l]
i = 1
while true
if r <= X[i]
return i
end
i += 1
end
end
now I created a couple of test functions...
M is the number of times we repeat the function execution.
Now this is critical... I want to store the values I get from the functions because I need them later... To oversimplify the code I just created a new variable r where I sum up the returns from the functions.
function test01(M,R)
cumR = cumsum(R)
r = 0
for i = 1:M
a = f1(cumR)
r += a
end
return r
end
function test02(M,R)
cumR = cumsum(R)
r = 0
for i = 1:M
a = f2(cumR)
r += a
end
return r
end
So, next I get:
#time test01(1e7,R)
elapsed time: 1.263974802 seconds (320000832 bytes allocated, 15.06% gc time)
#time test02(1e7,R)
elapsed time: 0.57086421 seconds (1088 bytes allocated)
So, for some reason I can't figure out f1 allocates a lot of memory and its even greater the larger M gets.
I said the line r += a was critical, because if I remove it from both test functions, I get the same result with both tests, so no problems! So I thought there was a problem with the type of a being returned by the functions (because f1 returns the iterator of the for loop, and f2 uses its own variable i "manually declared" inside the function).
But...
aa = f1(cumsum(R))
bb = f2(cumsum(R))
typeof(aa) == typeof(bb)
true
So... what that hell is going on???
I apologize if this is some sort of basic question but, I've been going over this for over 3 hours now and couldn't find an answer... Even though the functions are fixed by using a while loop I hate not knowing what's going on.
Thanks.

When you see lots of surprising allocations like that, a good first thing to check is type-stability. The #code_warntype macro is very helpful here:
julia> #code_warntype f1(R)
# … lots of annotated code, but the important part is this last line:
end::Union{Int64,Void}
Compare that to f2:
julia> #code_warntype f2(R)
# ...
end::Int64
So, why are the two different? Julia thinks that f1 might sometimes return nothing (which is of type Void)! Look again at your f1 function: what would happen if the last element of X is NaN? It'll just fall off the end of the function with no explicit return statement. In f2, however, you'll end up indexing beyond the bounds of X and get an error instead. Fix this type-instabillity by deciding what to do if the loop completes without finding the answer and you'll see more similar timings.

As I stated in the comment, your functions f1 and f2 both contain random numbers inside it, and you are using the random numbers as stopping criterion. Thus, there is no deterministic way to measure which of the functions is faster (doesn't depend in the implementation).
You can replace f1 and f2 functions to accept r as a parameter:
function f1(X::Vector, r)
for i = 1:length(X)
if r <= X[i]
return i
end
end
end
function f2(X::Vector, r)
i = 1
while i <= length(X)
if r <= X[i]
return i
end
i += 1
end
end
And then measure the time properly with the same R and r for both functions:
>>> R = cumsum(rand(100))
>>> r = rand(1_000_000) * R[end] # generate 1_000_000 random thresholds
>>> #time for i=1:length(r); f1(R, r[i]); end;
0.177048 seconds (4.00 M allocations: 76.278 MB, 2.70% gc time)
>>> #time for i=1:length(r); f2(R, r[i]); end;
0.173244 seconds (4.00 M allocations: 76.278 MB, 2.76% gc time)
As you can see, the timings are now nearly identical. Any difference will be caused for external factors (warming or processor busy with other tasks).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Suggestions for performance improvement in Julia - performance

Related

Fastest way to generate a kmer count vector from a nucleotide sequence (Julia)

Performance issues with parallel Julia code

How to improve in Julia the Storkey Learning Performance?

How to parallelize computation of pairwise distance matrix?

while-loop faster than for when returning iterator

Categories

Resources