Julia: random multinomial numbers allocation - performance

I have to generate a vector of random number drawn from a Multinomial distribution. Since I need to do it inside a loop I would like to reduce the number of allocations.
This is my code:
pop = zeros(100)
p = rand(100)
p /= sum(p)
#imagine the next line inside a for loop
#time pop = rand(Multinomial(100, p)) #test1
#time pop .= rand(Multinomial(100, p)) #test2
Why test1 is 2 allocations and test2 is 4? Is there a way of doing it with 0 allocations?
Thanks in advance.

Each sample of a multinomial distribution Multinomial(n, p) is a n-dimensional integer vector that sums to n. So this vector is allocated inside of rand.
I believe you want using rand! which works in-place:
julia> m = Multinomial(100, p);
julia> #time rand!(m, pop);
0.000010 seconds
Note that I create m object first, as creation of it allocates, so it should be done once.

Related

IDL: How do I loop through all possible values of shift without using a loop?

I have a 3 dimensional array which I am seeking to shift, multiply (element wise) with the un-shifted array, and sum all the products. For a given set of the three integers defining the shift (i,j,k) the sum corresponding to this triple will be stored in another 3 dimensional array at said location (i,j,k).
I am attempting to do this for all possible values of i,j,k. Is there a way to do this without explicitly using for/while loops? I am highly interested in minimizing the computational time but IDL doesnt understand the current instructions I am providing:
FUNCTION ccs, vecx, vecy
TIC
l = 512
km = l/2
corvec = fltarr(km,km,km)
N = float(l)^3
corvec[*,*,*] = (total(vecx*shift(vecx,*,*,*))/N)+ (total(vecy*shift(vecy,*,*,*))/N)
return, corvec
TOC
end
IDL appears to only accept scalars or 1 element arrays as possible shifting arguments.
Is the below what you are trying to do? It's not fast, but it will give something to test faster solutions against.
function ccs, vecx, vecy
compile_opt strictarr
tic
l = 512
km = l/2
corvec = fltarr(km, km, km)
n = float(l)^3
for i = 0L, km - 1L do begin
for j = 0L, km - 1L do begin
for k = 0L, km - 1L do begin
corvec[i, j, k] = total(vecx * shift(vecx, i, j, k)) / n + total(vecy * shift(vecy, i, j, k)) / n
endfor
endfor
endfor
toc
return, corvec
end
If this is what you are trying to do, I think it will be tough in IDL. Generally, vectorized solutions without loops for these types of problems need to create intermediate arrays — in this case, of all possible shifts of vecx and vecy, which would be a very large intermediate array. My suggestion would be to implement in C and call from IDL, but maybe I am missing something.

reduction parallel loop in julia

We can use
c = #parallel (vcat) for i=1:10
(i,i+1)
end
But when I'm trying to use push!() instead of vcat() I'm getting some error. How can I use push!() in this parallel loop?
c = #parallel (push!) for i=1:10
(c, (i,i+1))
end
The #parallel is somewhat similar to foldl(op, itr) in that it uses the first value of itr as an initial first parameter for op. push! lacks the required symmetry between the operands. Perhaps what you are looking for is:
julia> c = #parallel (append!) for i=1:10
[(i,i+1)]
end
Elaborating a bit on Dan's point; to see how the parallel macro works, see the difference between the following two invocations:
julia> #parallel print for i in 1:10
(i,i+1)
end
(1, 2)(2, 3)nothing(3, 4)nothing(4, 5)nothing(5, 6)nothing(6, 7)nothing(7, 8)nothing(8, 9)nothing(9, 10)nothing(10, 11)
julia> #parallel string for i in 1:10
(i,i+1)
end
"(1, 2)(2, 3)(3, 4)(4, 5)(5, 6)(6, 7)(7, 8)(8, 9)(9, 10)(10, 11)"
From the top one it should be clear what's going on. Each iteration produces an output. When it comes to using the specified function on those outputs, this is done in output pairs. Two first pair of outputs is fed to print, and the result of the print operation then becomes the first item in the next pair to be processed. Since the output is nothing, print prints nothing then (3,4). The result of this print statement is nothing, therefore the next pair to be printed is nothing and (4,5), and so on until all elements are consumed. I.e. in terms of pseudocode, this is what's happening:
Step 1: state = print((1,2), (2,3)); # state becomes nothing
Step 2: state = print(state, (3,4)); # state becomes nothing again
Step 3: state = print(state, (4,5)); # and so forth
The reason string works as expected is because what's happening is the following steps:
Step 1: state = string((1,2),(2,3));
Step 2: state = string(state, (3,4));
Step 3: state = string(state, (4,5);
etc
In general, the function you pass to the parallel macro should be something that takes two inputs of the same type, and outputs an object of the same type.
Therefore you cannot use push!, because this always uses two inputs of different types (one array, and one plain element), and outputs an array. Therefore you need to use append! instead, which fits the specification.
Also note that the order of outputs is not guaranteed. (here it happens to be in order because I only used 1 worker). If you want something where the order of operations matters, then you shouldn't use this construct. E.g., obviously in something like addition it doesn't matter, because addition is a completely associative operation; but if I used string, if outputs are processed in different order, then obviously you could end up with a different string than what you'd expect.
EDIT - addressing benchmark between vcat / append! / indexed assignment
I think the most efficient way to do this is in fact via normal indexing onto a preallocated array. But between append! and vcat, append will most certainly be faster as vcat always makes a copy (as I understand it).
Benchmarks:
function parallelWithVcat!( A::Array{Tuple{Int64, Int64}, 1} )
A = #parallel vcat for i = 1:10000
(i, i+1)
end
end;
function parallelWithFunction!( A::Array{Tuple{Int64, Int64}, 1} )
A = #parallel append! for i in 1:10000
[(i, i+1)];
end
end;
function parallelWithPreallocation!( A::Array{Tuple{Int64, Int64}, 1} )
#parallel for i in 1:10000
A[i] = (i, i+1);
end
end;
A = Array{Tuple{Int64, Int64}, 1}(10000);
### first runs omitted, all benchmarks here are from 2nd runs ###
# first on a single worker:
#time for n in 1:100; parallelWithVcat!(A); end
#> 8.050429 seconds (24.65 M allocations: 75.341 GiB, 15.42% gc time)
#time for n in 1:100; parallelWithFunction!(A); end
#> 0.072325 seconds (1.01 M allocations: 141.846 MiB, 52.69% gc time)
#time for n in 1:100; parallelWithPreallocation!(A); end
#> 0.000387 seconds (4.21 k allocations: 234.750 KiB)
# now with true parallelism:
addprocs(10);
#time for n in 1:100; parallelWithVcat!(A); end
#> 1.177645 seconds (160.02 k allocations: 109.618 MiB, 0.75% gc time)
#time for n in 1:100; parallelWithFunction!(A); end
#> 0.060813 seconds (111.87 k allocations: 70.585 MiB, 3.91% gc time)
#time for n in 1:100; parallelWithPreallocation!(A); end
#> 0.058134 seconds (116.16 k allocations: 4.174 MiB)
If someone can suggest an even more efficient way, please do so!
Note in particular that the indexed assignment is much faster than the rest, such that it appears (for this example at least) that most of its computation in the parallel case appears to be lost on the parallelisation itself.
Disclaimer: I make no claim that the above are correct summonings of the #parallel spell. I have not delved into the inner workings of the macro in detail to be able to claim otherwise. In particular, I am not aware which parts the macro causes to be processed remotely vs local (e.g. the assignment part). Caution is advised, ymmv, etc.

How to parallelize computation of pairwise distance matrix?

My problem is roughly as follows. Given a numerical matrix X, where each row is an item. I want to find each row's nearest neighbor in terms of L2 distance in all rows except itself. I tried reading the official documentation but was still a little confused about how to achieve this. Could someone give me some hint?
My code is as follows
function l2_dist(v1, v2)
return sqrt(sum((v1 - v2) .^ 2))
end
function main(Mat, dist_fun)
n = size(Mat, 1)
Dist = SharedArray{Float64}(n) #[Inf for i in 1:n]
Id = SharedArray{Int64}(n) #[-1 for i in 1:n]
#parallel for i = 1:n
Dist[i] = Inf
Id[i] = 0
end
Threads.#threads for i in 1:n
for j in 1:n
if i != j
println(i, j)
dist_temp = dist_fun(Mat[i, :], Mat[j, :])
if dist_temp < Dist[i]
println("Dist updated!")
Dist[i] = dist_temp
Id[i] = j
end
end
end
end
return Dict("Dist" => Dist, "Id" => Id)
end
n = 4000
p = 30
X = [rand() for i in 1:n, j in 1:p];
main(X[1:30, :], l2_dist)
#time N = main(X, l2_dist)
I'm trying to distributed all the i's (i.e. calculating each row minimum) over different cores. But the version above apparently isn't working correctly. It is even slower than the sequential version. Can someone point me to the right direction? Thanks.
Maybe you're doing something in addition to what you have written down, but, at this point from what I can see, you aren't actually doing any computations in parallel. Julia requires you to tell it how many processors (or threads) you would like it to have access to. You can do this through either
Starting Julia with multiple processors julia -p # (where # is the number of processors you want Julia to have access to)
Once you have started a Julia "session" you can call the addprocs function to add additional processors.
To have more than 1 thread, you need to run command export JULIA_NUM_THREADS = #. I don't know very much about threading, so I will be sticking with the #parallel macro. I suggest reading documentation for more details on threading -- Maybe #Chris Rackauckas could expand a little more on the difference.
A few comments below about my code and on your code:
I'm on version 0.6.1-pre.0. I don't think I'm doing anything 0.6 specific, but this is a heads up just in case.
I'm going to use the Distances.jl package when computing the distances between vectors. I think it is a good habit to farm out as many of my computations to well-written and well-maintained packages as possible.
Rather than compute the distance between rows, I'm going to compute the distance between columns. This is because Julia is a column-major language, so this will increase the number of cache hits and give a little extra speed. You can obviously get the row-wise results you want by just transposing the input.
Unless you expect to have that many memory allocations then that many allocations are a sign that something in your code is inefficient. It is often a type stability problem. I don't know if that was the case in your code before, but that doesn't seem to be an issue in the current version (it wasn't immediately clear to me why you were having so many allocations).
Code is below
# Make sure all processors have access to Distances package
#everywhere using Distances
# Create a random matrix
nrow = 30
ncol = 4000
# Seed creation of random matrix so it is always same matrix
srand(42)
X = rand(nrow, ncol)
function main(X::AbstractMatrix{Float64}, M::Distances.Metric)
# Get size of the matrix
nrow, ncol = size(X)
# Create `SharedArray` to store output
ind_vec = SharedArray{Int}(ncol)
dist_vec = SharedArray{Float64}(ncol)
# Compute the distance between columns
#sync #parallel for i in 1:ncol
# Initialize various temporary variables
min_dist_i = Inf
min_ind_i = -1
X_i = view(X, :, i)
# Check distance against all other columns
for j in 1:ncol
# Skip comparison with itself
if i==j
continue
end
# Tell us who is doing the work
# (can uncomment if you want to verify stuff)
# println("Column $i compared with Column $j by worker $(myid())")
# Evaluate the new distance...
# If it is less then replace it, otherwise proceed
dist_temp = evaluate(M, X_i, view(X, :, j))
if dist_temp < min_dist_i
min_dist_i = dist_temp
min_ind_i = j
end
end
# Which column is minimum distance from column i
dist_vec[i] = min_dist_i
ind_vec[i] = min_ind_i
end
return dist_vec, ind_vec
end
# Using Euclidean metric
metric = Euclidean()
inds, dist = main(X, metric)
#time main(X, metric);
#show dist[[1, 5, 25]], inds[[1, 5, 25]]
You can run the code with
1 processor julia testfile.jl
% julia testfile.jl
0.640365 seconds (16.00 M allocations: 732.495 MiB, 3.70% gc time)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])
n processors (in this case 4) julia -p n testfile.jl
% julia -p 4 testfile.jl
0.201523 seconds (2.10 k allocations: 99.107 KiB)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])

Joint Entropy Performance in Julia

I wrote a function to calculate the joint entropy of each column pair in a matrix. But I would like to increase the performance regarding time and memory.
The function looks like this:
function jointentropy(aln)
mat = Array(Float64,size(aln,2),size(aln,2))
for i in combinations(1:size(aln,2),2)
a = i[1]
b = i[2]
mina, maxa = extrema(aln[:,a])
minb, maxb = extrema(aln[:,b])
h = Array(Float64,(maxa-mina+1,maxb-minb+1))
h = hist2d([aln[:,a] aln[:,b]],mina-1:1:maxa,minb-1:1:maxb)[3]
h = h/size(aln[:,1],1)
I,J,V = findnz(h)
l = sparse(I,J,log2(V),maxa-mina+1,maxb-minb+1)
mat[b,a] = - sum(l.*h)
end
return mat
end
Matrices that go into this function look like this:
rand(45:122,rand(1:2000),rand(1:2000))
An example with a 500x500 matrix resulted in the following #time output:
elapsed time: 33.692081413 seconds (33938843192 bytes allocated, 36.42% gc time)
...which seems to be a whole lot of memory...
Any suggestions on how to speed up this function and reduce memory allocation?
Thanks in advance for any help!
Here are a few ideas to speed up your function.
If the range of all the columns is roughly the same, you can move the extrema computations outside the loop and reuse the same h array.
hist2d creates a new array: you can use hist2d! to reuse the previous one.
The assignment h = h/size(aln[:,1],1) creates a new array.
The division in h = h/size(aln[:,1],1) is done for all the elements of the array, including the zeroes.
You can use a loop instead of findnz and a sparse matrix (findnz already contains a loop).
.
function jointentropy2(aln)
n1 = size(aln,1)
n2 = size(aln,2)
mat = Array(Float64,n2,n2)
lower, upper = extrema(aln)
m = upper-lower+1
h = Array(Float64,(m,m))
for a in 1:n2
for b in (a+1):n2
Base.hist2d!(h,[aln[:,a] aln[:,b]],lower-1:1:upper,lower-1:1:upper)[3]
s = 0
for i in 1:m
for j in 1:m
if h[i,j] != 0
p = h[i,j] / n1
s += p * log2(p)
end
end
end
mat[b,a] = - s
end
end
return mat
end
This is twice as fast as the initial function,
and the memory allocations were divided by 4.
aln = rand(45:122,500,400)
#time x = jointentropy(aln)
# elapsed time: 26.946314168 seconds (21697858752 bytes allocated, 29.97% gc time)
#time y = jointentropy2(aln)
# elapsed time: 13.626282821 seconds (5087119968 bytes allocated, 16.21% gc time)
x - y # approximately zero (at least below the diagonal --
# the matrix was not initialized above it)
The next candidate for optimization is hist2d (here, you could use a loop and a sparse matrix).
#profile jointentropy2(aln)
Profile.print()

Efficient partial permutation sort in Julia

I am dealing with a problem that requires a partial permutation sort by magnitude in Julia. If x is a vector of dimension p, then what I need are the first k indices corresponding to the k components of x that would appear first in a partial sort by absolute value of x.
Refer to Julia's sorting functions here. Basically, I want a cross between sortperm and select!. When Julia 0.4 is released, I will be able to obtain the same answer by applying sortperm! (this function) to the vector of indices and choosing the first k of them. However, using sortperm! is not ideal here because it will sort the remaining p-k indices of x, which I do not need.
What would be the most memory-efficient way to do the partial permutation sort? I hacked a solution by looking at the sortperm source code. However, since I am not versed in the ordering modules that Julia uses there, I am not sure if my approach is intelligent.
One important detail: I can ignore repeats or ambiguities here. In other words, I do not care about the ordering by abs() of indices for two components 2 and -2. My actual code uses floating point values, so exact equality never occurs for practical purposes.
# initialize a vector for testing
x = [-3,-2,4,1,0,-1]
x2 = copy(x)
k = 3 # num components desired in partial sort
p = 6 # num components in x, x2
# what are the indices that sort x by magnitude?
indices = sortperm(x, by = abs, rev = true)
# now perform partial sort on x2
select!(x2, k, by = abs, rev = true)
# check if first k components are sorted here
# should evaluate to "true"
isequal(x2[1:k], x[indices[1:k]])
# now try my partial permutation sort
# I only need indices2[1:k] at end of day!
indices2 = [1:p]
select!(indices2, 1:k, 1, p, Base.Perm(Base.ord(isless, abs, true, Base.Forward), x))
# same result? should evaluate to "true"
isequal(indices2[1:k], indices[1:k])
EDIT: With the suggested code, we can briefly compare performance on much larger vectors:
p = 10000; k = 100; # asking for largest 1% of components
x = randn(p); x2 = copy(x);
# run following code twice for proper timing results
#time {indices = sortperm(x, by = abs, rev = true); indices[1:k]};
#time {indices2 = [1:p]; select!(indices2, 1:k, 1, p, Base.Perm(Base.ord(isless, abs, true, Base.Forward), x))};
#time selectperm(x,k);
My output:
elapsed time: 0.048876901 seconds (19792096 bytes allocated)
elapsed time: 0.007016534 seconds (2203688 bytes allocated)
elapsed time: 0.004471847 seconds (1657808 bytes allocated)
The following version appears to be relatively space-efficient because it uses only an integer array of the same length as the input array:
function selectperm (x,k)
if k > 1 then
kk = 1:k
else
kk = 1
end
z = collect(1:length(x))
return select!(z,1:k,by = (i)->abs(x[i]), rev = true)
end
x = [-3,-2,4,1,0,-1]
k = 3 # num components desired in partial sort
print (selectperm(x,k))
The output is:
[3,1,2]
... as expected.
I'm not sure if it uses less memory than the originally-proposed solution (though I suspect the memory usage is similar) but the code may be clearer and it does produce only the first k indices whereas the original solution produced all p indices.
(Edit)
selectperm() has been edited to deal with the BoundsError that occurs if k=1 in the call to select!().

Resources