We can use
c = #parallel (vcat) for i=1:10
(i,i+1)
end
But when I'm trying to use push!() instead of vcat() I'm getting some error. How can I use push!() in this parallel loop?
c = #parallel (push!) for i=1:10
(c, (i,i+1))
end
The #parallel is somewhat similar to foldl(op, itr) in that it uses the first value of itr as an initial first parameter for op. push! lacks the required symmetry between the operands. Perhaps what you are looking for is:
julia> c = #parallel (append!) for i=1:10
[(i,i+1)]
end
Elaborating a bit on Dan's point; to see how the parallel macro works, see the difference between the following two invocations:
julia> #parallel print for i in 1:10
(i,i+1)
end
(1, 2)(2, 3)nothing(3, 4)nothing(4, 5)nothing(5, 6)nothing(6, 7)nothing(7, 8)nothing(8, 9)nothing(9, 10)nothing(10, 11)
julia> #parallel string for i in 1:10
(i,i+1)
end
"(1, 2)(2, 3)(3, 4)(4, 5)(5, 6)(6, 7)(7, 8)(8, 9)(9, 10)(10, 11)"
From the top one it should be clear what's going on. Each iteration produces an output. When it comes to using the specified function on those outputs, this is done in output pairs. Two first pair of outputs is fed to print, and the result of the print operation then becomes the first item in the next pair to be processed. Since the output is nothing, print prints nothing then (3,4). The result of this print statement is nothing, therefore the next pair to be printed is nothing and (4,5), and so on until all elements are consumed. I.e. in terms of pseudocode, this is what's happening:
Step 1: state = print((1,2), (2,3)); # state becomes nothing
Step 2: state = print(state, (3,4)); # state becomes nothing again
Step 3: state = print(state, (4,5)); # and so forth
The reason string works as expected is because what's happening is the following steps:
Step 1: state = string((1,2),(2,3));
Step 2: state = string(state, (3,4));
Step 3: state = string(state, (4,5);
etc
In general, the function you pass to the parallel macro should be something that takes two inputs of the same type, and outputs an object of the same type.
Therefore you cannot use push!, because this always uses two inputs of different types (one array, and one plain element), and outputs an array. Therefore you need to use append! instead, which fits the specification.
Also note that the order of outputs is not guaranteed. (here it happens to be in order because I only used 1 worker). If you want something where the order of operations matters, then you shouldn't use this construct. E.g., obviously in something like addition it doesn't matter, because addition is a completely associative operation; but if I used string, if outputs are processed in different order, then obviously you could end up with a different string than what you'd expect.
EDIT - addressing benchmark between vcat / append! / indexed assignment
I think the most efficient way to do this is in fact via normal indexing onto a preallocated array. But between append! and vcat, append will most certainly be faster as vcat always makes a copy (as I understand it).
Benchmarks:
function parallelWithVcat!( A::Array{Tuple{Int64, Int64}, 1} )
A = #parallel vcat for i = 1:10000
(i, i+1)
end
end;
function parallelWithFunction!( A::Array{Tuple{Int64, Int64}, 1} )
A = #parallel append! for i in 1:10000
[(i, i+1)];
end
end;
function parallelWithPreallocation!( A::Array{Tuple{Int64, Int64}, 1} )
#parallel for i in 1:10000
A[i] = (i, i+1);
end
end;
A = Array{Tuple{Int64, Int64}, 1}(10000);
### first runs omitted, all benchmarks here are from 2nd runs ###
# first on a single worker:
#time for n in 1:100; parallelWithVcat!(A); end
#> 8.050429 seconds (24.65 M allocations: 75.341 GiB, 15.42% gc time)
#time for n in 1:100; parallelWithFunction!(A); end
#> 0.072325 seconds (1.01 M allocations: 141.846 MiB, 52.69% gc time)
#time for n in 1:100; parallelWithPreallocation!(A); end
#> 0.000387 seconds (4.21 k allocations: 234.750 KiB)
# now with true parallelism:
addprocs(10);
#time for n in 1:100; parallelWithVcat!(A); end
#> 1.177645 seconds (160.02 k allocations: 109.618 MiB, 0.75% gc time)
#time for n in 1:100; parallelWithFunction!(A); end
#> 0.060813 seconds (111.87 k allocations: 70.585 MiB, 3.91% gc time)
#time for n in 1:100; parallelWithPreallocation!(A); end
#> 0.058134 seconds (116.16 k allocations: 4.174 MiB)
If someone can suggest an even more efficient way, please do so!
Note in particular that the indexed assignment is much faster than the rest, such that it appears (for this example at least) that most of its computation in the parallel case appears to be lost on the parallelisation itself.
Disclaimer: I make no claim that the above are correct summonings of the #parallel spell. I have not delved into the inner workings of the macro in detail to be able to claim otherwise. In particular, I am not aware which parts the macro causes to be processed remotely vs local (e.g. the assignment part). Caution is advised, ymmv, etc.
Related
I am working on a scientific code that is experiencing issues with parallelization.
The parallel version is slower than the serial one and I am not sure if the right approaches are used for this application.
How can I improve the performance of the parallel calculation?
Is the right approach being used or should other packages / functions be considered for parallelization?
I have already tried a larger workload, however this makes no difference.
I suspect the problem is somehow due to data movement between workers, but I don't know how to check or improve this one.
Parallel programming with Julia is still relatively new for me, so I am very grateful for any help!
The simulation code is something of a benchmark for the Julia programming language, as our team is considering using Julia for all future projects if strong performance advantages to the current workflow can be demonstrated.
Because of this, I would like to maximize performance, also since calculations with very large models as well as possible use on a cluster are planned.
Minimum Working Example
The critical parts of the code can be broken down to the following example.
I start the process as follows:
using Distributed
addprocs();
#everywhere using SharedArrays, LinearAlgebra, Test
First I define the simulation model, containing all data used for the calculations.
Is it actually okay to store SharedArrays with other data in a struct or should a different approach be used?
#everywhere struct Model
idx::Vector{Tuple{Int,Int}} # indices
A::SharedMatrix{Float64} # results, will be constantly updated
B::Vector{Float64} # part of pre-processing, will only be read
end
See the non-parallel version of the function used for the update of the model below.
function update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
For parallelization, I simply tried the following. Is perhaps an approach with pmap better in this case?
#everywhere function parallel_update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
#sync #distributed for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
To test the results I use the following function:
#everywhere function test_my_code()
# provide some data
n = 10000000
idx = [(rand(1:n),rand(1:n)) for k=1:n]
A = SharedArray(hcat(([rand(0.:1000.);rand(0.:1000.);rand(0.:1000.)] for k=1:n)...))
B = [rand(0.:1000.) for k=1:n]
# define models
model1 = Model(idx,A,B)
model2 = Model(idx,A,B)
# test and compare results
#time update(model1,2.)
#time parallel_update(model2,2.)
#test model1 == model2
end
julia> test_my_code() # first run
6.350694 seconds (50.00 M allocations: 5.215 GiB, 13.66% gc time)
11.422999 seconds (6.69 k allocations: 446.156 KiB)
Test Passed
julia> test_my_code() # second run
6.286828 seconds (50.00 M allocations: 5.215 GiB, 18.35% gc time)
6.297144 seconds (2.92 k allocations: 143.516 KiB)
Test Passed
Note: significant performance improvements for the serial code
I was already able to significantly improve the performance of the serial function and reduce the number of allocations to zero.
Since this seems to make no difference to the parallelization problem, I used the shorter, easier-to-read version for the previous example.
See the serial code below.
using LinearAlgebra, Test
struct Model
idx::Vector{Tuple{Int,Int}}
A::Matrix{Float64}
B::Vector{Float64}
end
function update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
function update_fast(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = sqrt((m.A[1,i]-m.A[1,j])^2 +
(m.A[2,i]-m.A[2,j])^2 +
(m.A[3,i]-m.A[3,j])^2)
k = factor * m.B[cnt]
m.A[1,i] += k*L
m.A[2,i] += k*L
m.A[3,i] += k*L
m.A[1,j] -= k*L
m.A[2,j] -= k*L
m.A[3,j] -= k*L
end
end
function test_serial_speedup()
n = 10000000
idx = [(rand(1:n),rand(1:n)) for k=1:n]
A = hcat(([rand(0.:1000.);rand(0.:1000.);rand(0.:1000.)] for k=1:n)...)
B = [rand(0.:1000.) for k=1:n]
model1 = Model(idx,A,B)
model2 = Model(idx,A,B)
#time update(model1,2.)
#time update_fast(model2,2.)
#test model1 == model2
end
julia> test_serial_speedup()
5.008049 seconds (50.00 M allocations: 5.215 GiB, 18.14% gc time)
0.464986 seconds
Test Passed
My problem is roughly as follows. Given a numerical matrix X, where each row is an item. I want to find each row's nearest neighbor in terms of L2 distance in all rows except itself. I tried reading the official documentation but was still a little confused about how to achieve this. Could someone give me some hint?
My code is as follows
function l2_dist(v1, v2)
return sqrt(sum((v1 - v2) .^ 2))
end
function main(Mat, dist_fun)
n = size(Mat, 1)
Dist = SharedArray{Float64}(n) #[Inf for i in 1:n]
Id = SharedArray{Int64}(n) #[-1 for i in 1:n]
#parallel for i = 1:n
Dist[i] = Inf
Id[i] = 0
end
Threads.#threads for i in 1:n
for j in 1:n
if i != j
println(i, j)
dist_temp = dist_fun(Mat[i, :], Mat[j, :])
if dist_temp < Dist[i]
println("Dist updated!")
Dist[i] = dist_temp
Id[i] = j
end
end
end
end
return Dict("Dist" => Dist, "Id" => Id)
end
n = 4000
p = 30
X = [rand() for i in 1:n, j in 1:p];
main(X[1:30, :], l2_dist)
#time N = main(X, l2_dist)
I'm trying to distributed all the i's (i.e. calculating each row minimum) over different cores. But the version above apparently isn't working correctly. It is even slower than the sequential version. Can someone point me to the right direction? Thanks.
Maybe you're doing something in addition to what you have written down, but, at this point from what I can see, you aren't actually doing any computations in parallel. Julia requires you to tell it how many processors (or threads) you would like it to have access to. You can do this through either
Starting Julia with multiple processors julia -p # (where # is the number of processors you want Julia to have access to)
Once you have started a Julia "session" you can call the addprocs function to add additional processors.
To have more than 1 thread, you need to run command export JULIA_NUM_THREADS = #. I don't know very much about threading, so I will be sticking with the #parallel macro. I suggest reading documentation for more details on threading -- Maybe #Chris Rackauckas could expand a little more on the difference.
A few comments below about my code and on your code:
I'm on version 0.6.1-pre.0. I don't think I'm doing anything 0.6 specific, but this is a heads up just in case.
I'm going to use the Distances.jl package when computing the distances between vectors. I think it is a good habit to farm out as many of my computations to well-written and well-maintained packages as possible.
Rather than compute the distance between rows, I'm going to compute the distance between columns. This is because Julia is a column-major language, so this will increase the number of cache hits and give a little extra speed. You can obviously get the row-wise results you want by just transposing the input.
Unless you expect to have that many memory allocations then that many allocations are a sign that something in your code is inefficient. It is often a type stability problem. I don't know if that was the case in your code before, but that doesn't seem to be an issue in the current version (it wasn't immediately clear to me why you were having so many allocations).
Code is below
# Make sure all processors have access to Distances package
#everywhere using Distances
# Create a random matrix
nrow = 30
ncol = 4000
# Seed creation of random matrix so it is always same matrix
srand(42)
X = rand(nrow, ncol)
function main(X::AbstractMatrix{Float64}, M::Distances.Metric)
# Get size of the matrix
nrow, ncol = size(X)
# Create `SharedArray` to store output
ind_vec = SharedArray{Int}(ncol)
dist_vec = SharedArray{Float64}(ncol)
# Compute the distance between columns
#sync #parallel for i in 1:ncol
# Initialize various temporary variables
min_dist_i = Inf
min_ind_i = -1
X_i = view(X, :, i)
# Check distance against all other columns
for j in 1:ncol
# Skip comparison with itself
if i==j
continue
end
# Tell us who is doing the work
# (can uncomment if you want to verify stuff)
# println("Column $i compared with Column $j by worker $(myid())")
# Evaluate the new distance...
# If it is less then replace it, otherwise proceed
dist_temp = evaluate(M, X_i, view(X, :, j))
if dist_temp < min_dist_i
min_dist_i = dist_temp
min_ind_i = j
end
end
# Which column is minimum distance from column i
dist_vec[i] = min_dist_i
ind_vec[i] = min_ind_i
end
return dist_vec, ind_vec
end
# Using Euclidean metric
metric = Euclidean()
inds, dist = main(X, metric)
#time main(X, metric);
#show dist[[1, 5, 25]], inds[[1, 5, 25]]
You can run the code with
1 processor julia testfile.jl
% julia testfile.jl
0.640365 seconds (16.00 M allocations: 732.495 MiB, 3.70% gc time)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])
n processors (in this case 4) julia -p n testfile.jl
% julia -p 4 testfile.jl
0.201523 seconds (2.10 k allocations: 99.107 KiB)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])
I want to alter a shared array owned by only some of my processes:
julia> addprocs(4)
4-element Array{Int64,1}:
2
3
4
5
julia> s = SharedArray(Int, (100,), pids=[2,3]);
julia>for i in procs() println(remotecall_fetch(localindexes, i, s)) end
1:0
1:50
51:100
1:0
1:0
This works, but I want to be able to parallelize the loop:
julia> for i=1:100 s[i] = i end
This results in processes 4 and 5 terminating with a segfault:
julia> #parallel for i=1:100 s[i] = i end
Question: Why does this terminate the processes rather than throw an exception or split the loop only among the processes that share the array?
I expected this to work instead, but it does not fill the entire array:
julia> #parallel for i in localindexes(s) s[i] = i end
Each process updates the part of the array that is local to it, and since the array is shared, changes made by one process should be visible to all processes. Why is some of the array still unchanged?
Replacing #parallel with #everywhere gives the error that s is not defined on process 2. How can a process which owns part of the array be unaware of it?
I am so confused. What is the best way to parallelize this loop?
Will this do the trick?
#sync begin
for i in procs(s) # <-- note loop over process IDs of SharedArray s!
#async #spawnat i setindex!(s, i, localindexes(s))
end
end
You may run into issues if the master process is called here; in that case, you can try building your own function modeled on the pmap example.
I'm trying to oversimplify this as much as possible.
functions f1and f2 implement a very simplified version of a roulette wheel selection over a Vector R. The only difference between them is that f1 uses a for and f2 a while. Both functions return the index of the array where the condition was met.
R=rand(100)
function f1(X::Vector)
l = length(X)
r = rand()*X[l]
for i = 1:l
if r <= X[i]
return i
end
end
end
function f2(X::Vector)
l = length(X)
r = rand()*X[l]
i = 1
while true
if r <= X[i]
return i
end
i += 1
end
end
now I created a couple of test functions...
M is the number of times we repeat the function execution.
Now this is critical... I want to store the values I get from the functions because I need them later... To oversimplify the code I just created a new variable r where I sum up the returns from the functions.
function test01(M,R)
cumR = cumsum(R)
r = 0
for i = 1:M
a = f1(cumR)
r += a
end
return r
end
function test02(M,R)
cumR = cumsum(R)
r = 0
for i = 1:M
a = f2(cumR)
r += a
end
return r
end
So, next I get:
#time test01(1e7,R)
elapsed time: 1.263974802 seconds (320000832 bytes allocated, 15.06% gc time)
#time test02(1e7,R)
elapsed time: 0.57086421 seconds (1088 bytes allocated)
So, for some reason I can't figure out f1 allocates a lot of memory and its even greater the larger M gets.
I said the line r += a was critical, because if I remove it from both test functions, I get the same result with both tests, so no problems! So I thought there was a problem with the type of a being returned by the functions (because f1 returns the iterator of the for loop, and f2 uses its own variable i "manually declared" inside the function).
But...
aa = f1(cumsum(R))
bb = f2(cumsum(R))
typeof(aa) == typeof(bb)
true
So... what that hell is going on???
I apologize if this is some sort of basic question but, I've been going over this for over 3 hours now and couldn't find an answer... Even though the functions are fixed by using a while loop I hate not knowing what's going on.
Thanks.
When you see lots of surprising allocations like that, a good first thing to check is type-stability. The #code_warntype macro is very helpful here:
julia> #code_warntype f1(R)
# … lots of annotated code, but the important part is this last line:
end::Union{Int64,Void}
Compare that to f2:
julia> #code_warntype f2(R)
# ...
end::Int64
So, why are the two different? Julia thinks that f1 might sometimes return nothing (which is of type Void)! Look again at your f1 function: what would happen if the last element of X is NaN? It'll just fall off the end of the function with no explicit return statement. In f2, however, you'll end up indexing beyond the bounds of X and get an error instead. Fix this type-instabillity by deciding what to do if the loop completes without finding the answer and you'll see more similar timings.
As I stated in the comment, your functions f1 and f2 both contain random numbers inside it, and you are using the random numbers as stopping criterion. Thus, there is no deterministic way to measure which of the functions is faster (doesn't depend in the implementation).
You can replace f1 and f2 functions to accept r as a parameter:
function f1(X::Vector, r)
for i = 1:length(X)
if r <= X[i]
return i
end
end
end
function f2(X::Vector, r)
i = 1
while i <= length(X)
if r <= X[i]
return i
end
i += 1
end
end
And then measure the time properly with the same R and r for both functions:
>>> R = cumsum(rand(100))
>>> r = rand(1_000_000) * R[end] # generate 1_000_000 random thresholds
>>> #time for i=1:length(r); f1(R, r[i]); end;
0.177048 seconds (4.00 M allocations: 76.278 MB, 2.70% gc time)
>>> #time for i=1:length(r); f2(R, r[i]); end;
0.173244 seconds (4.00 M allocations: 76.278 MB, 2.76% gc time)
As you can see, the timings are now nearly identical. Any difference will be caused for external factors (warming or processor busy with other tasks).
I am dealing with a problem that requires a partial permutation sort by magnitude in Julia. If x is a vector of dimension p, then what I need are the first k indices corresponding to the k components of x that would appear first in a partial sort by absolute value of x.
Refer to Julia's sorting functions here. Basically, I want a cross between sortperm and select!. When Julia 0.4 is released, I will be able to obtain the same answer by applying sortperm! (this function) to the vector of indices and choosing the first k of them. However, using sortperm! is not ideal here because it will sort the remaining p-k indices of x, which I do not need.
What would be the most memory-efficient way to do the partial permutation sort? I hacked a solution by looking at the sortperm source code. However, since I am not versed in the ordering modules that Julia uses there, I am not sure if my approach is intelligent.
One important detail: I can ignore repeats or ambiguities here. In other words, I do not care about the ordering by abs() of indices for two components 2 and -2. My actual code uses floating point values, so exact equality never occurs for practical purposes.
# initialize a vector for testing
x = [-3,-2,4,1,0,-1]
x2 = copy(x)
k = 3 # num components desired in partial sort
p = 6 # num components in x, x2
# what are the indices that sort x by magnitude?
indices = sortperm(x, by = abs, rev = true)
# now perform partial sort on x2
select!(x2, k, by = abs, rev = true)
# check if first k components are sorted here
# should evaluate to "true"
isequal(x2[1:k], x[indices[1:k]])
# now try my partial permutation sort
# I only need indices2[1:k] at end of day!
indices2 = [1:p]
select!(indices2, 1:k, 1, p, Base.Perm(Base.ord(isless, abs, true, Base.Forward), x))
# same result? should evaluate to "true"
isequal(indices2[1:k], indices[1:k])
EDIT: With the suggested code, we can briefly compare performance on much larger vectors:
p = 10000; k = 100; # asking for largest 1% of components
x = randn(p); x2 = copy(x);
# run following code twice for proper timing results
#time {indices = sortperm(x, by = abs, rev = true); indices[1:k]};
#time {indices2 = [1:p]; select!(indices2, 1:k, 1, p, Base.Perm(Base.ord(isless, abs, true, Base.Forward), x))};
#time selectperm(x,k);
My output:
elapsed time: 0.048876901 seconds (19792096 bytes allocated)
elapsed time: 0.007016534 seconds (2203688 bytes allocated)
elapsed time: 0.004471847 seconds (1657808 bytes allocated)
The following version appears to be relatively space-efficient because it uses only an integer array of the same length as the input array:
function selectperm (x,k)
if k > 1 then
kk = 1:k
else
kk = 1
end
z = collect(1:length(x))
return select!(z,1:k,by = (i)->abs(x[i]), rev = true)
end
x = [-3,-2,4,1,0,-1]
k = 3 # num components desired in partial sort
print (selectperm(x,k))
The output is:
[3,1,2]
... as expected.
I'm not sure if it uses less memory than the originally-proposed solution (though I suspect the memory usage is similar) but the code may be clearer and it does produce only the first k indices whereas the original solution produced all p indices.
(Edit)
selectperm() has been edited to deal with the BoundsError that occurs if k=1 in the call to select!().