Statistics of Julia Array along dimension in parallel - parallel-processing

What are the best practices in Julia to get statistics of an array along a given dimension in parallel? I have many large arrays and am looking for something like mean(array, 1), but parallel (and returning a quantile). I can not handle the arrays in parallel because I don't have enough RAM.
I coded up a crude benchmark that also illustrates the approaches I've tried so far: mapslices and #parallel loops over SharedArrays and DArrays (see below). The parallelization does not seem to speed things up much. Adding 7 workers and using SharedArrays yields a 1.8x speedup, using DArrays yields a 2.3x speedup. I'm pretty new to Julia. Is this to be expected? Am I doing something wrong?
Thanks for your help. Below is the output of my script followed by the script itself.
Script output:
WARNING: replacing module DistributedArrays
WARNING: replacing module DistributedArrays
WARNING: replacing module DistributedArrays
WARNING: replacing module DistributedArrays
WARNING: replacing module DistributedArrays
WARNING: replacing module DistributedArrays
WARNING: replacing module DistributedArrays
mapslices on Array
38.152894 seconds (218.71 M allocations: 14.435 GB, 3.33% gc time)
37.985577 seconds (218.10 M allocations: 14.406 GB, 3.23% gc time)
loop over Array using CartesianRange
9.161392 seconds (25.27 M allocations: 9.005 GB, 4.41% gc time)
9.118627 seconds (25.17 M allocations: 9.000 GB, 4.40% gc time)
#parallel loop over SharedArray
9.092477 seconds (322.23 k allocations: 14.190 MB, 0.05% gc time)
4.945648 seconds (18.90 k allocations: 1.405 MB)
#parallel loop over DArray
5.615429 seconds (496.26 k allocations: 21.535 MB, 0.08% gc time)
3.932704 seconds (15.63 k allocations: 1.178 MB)
Script:
procs_added = addprocs(CPU_CORES - 1)
#everywhere using DistributedArrays
function benchmark_array(dtype, dims)
data = rand(dtype, dims...)
println("mapslices on Array")
#time out = mapslices(f->quantile(f, 0.2), data, 1)
#time out = mapslices(f->quantile(f, 0.2), data, 1)
println("loop over Array using CartesianRange")
out = Array(Float32, size(data)[2:end])
#time loop_over_array!(out, data)
#time loop_over_array!(out, data)
end
function loop_over_array!(out::Array, data::Array)
for I in CartesianRange(size(out))
# explicit indexing, since [:, I...] didn't work
out[I] = quantile(data[:, I[1], I[2], I[3]], 0.2)
end
end
function benchmark_shared_array(dtype, dims)
data = SharedArray(dtype, (dims...), pids=workers())
println("#parallel loop over SharedArray")
out = SharedArray(Float32, size(data)[2:end], pids=workers())
#time parallel_loop_over_shared_array!(out, data)
#time parallel_loop_over_shared_array!(out, data)
end
function parallel_loop_over_shared_array!(out::SharedArray, data::SharedArray)
# #parallel for I in CartesianRange(size(out)) does not seem to work
#sync #parallel for i in 1:size(out)[end]
for I in CartesianRange(size(out)[1:end-1])
out[I[1], I[2], i] = quantile(data[:, I[1], I[2], i], 0.2)
end
end
end
function benchmark_distributed_array(dtype, dims)
data = drand(dtype, (dims...), workers(),
[i == length(dims) ? nworkers() : 1 for i in 1:length(dims)])
println("#parallel loop over DArray")
out = dzeros(Float32, size(data)[2:end], workers(),
[i == ndims(data) ? nworkers() : 1 for i in 2:ndims(data)])
#time parallel_loop_over_distributed_array!(out, data)
#time parallel_loop_over_distributed_array!(out, data)
end
function parallel_loop_over_distributed_array!(out::DArray, data::DArray)
#sync for pid in workers()
#spawnat pid begin
inchunk = localpart(data)
outchunk = localpart(out)
for I in CartesianRange(size(outchunk))
outchunk[I] = quantile(inchunk[:, I[1], I[2], I[3]], 0.2)
end
end
end
end
function benchmark_all(dtype, dims)
benchmark_array(dtype, dims)
benchmark_shared_array(dtype, dims)
benchmark_distributed_array(dtype, dims)
end
const dtype = Int
const dims = [128,256,256,64]
benchmark_all(dtype, dims)

Related

Performance issues with evaluation of custom tree data structure in Julia

I am implementing a Binary Tree in Julia. The binary tree has nodes and leafs. The nodes point to left and right children, which are also nodes/leafs objects. The following code exemplifies the data structure:
using TimerOutputs
mutable struct NodeLeaf
isleaf::Bool
value::Union{Nothing,Float64}
split::Union{Nothing,Float64}
column::Union{Nothing,Int64}
left::Union{Nothing,NodeLeaf}
right::Union{Nothing,NodeLeaf}
end
function evaluate(node::NodeLeaf, x)::Float64
while !node.isleaf
if x[node.column] < node.split
node = node.left
else
node = node.right
end
end
return node.value
end
function build_random_tree(max_depth)
if max_depth == 0
return NodeLeaf(true, randn(), randn(), rand(1:10), nothing, nothing)
else
return NodeLeaf(false, randn(), randn(), rand(1:10), build_random_tree(max_depth - 1), build_random_tree(max_depth - 1))
end
end
function main()
my_random_tree = build_random_tree(4)
#timeit to "evaluation" for i in 1:1000000
evaluate(my_random_tree, randn(10))
end
end
const to = TimerOutput()
main()
show(to)
I notice that a lot of allocations occur in the evaluate function, but I don't see the reason why this is the case:
julia mytree.jl
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 476ms / 21.6% 219MiB / 62.7%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
evaluation 1 103ms 100.0% 103ms 137MiB 100.0% 137MiB
───────────────────────────────────────────────────────────────────────
As I increase the evaluation loop, the allocation continues to increase without bound. Can anybody explain why allocation grows so much and please suggest how to avoid this issue? Thanks.
EDIT
I simplified too much the code for the example. The actual code is accessing DataFrames, so the main looks like this:
using DataFrames
function main()
my_random_tree = build_random_tree(7)
df = DataFrame(A=1:1000000)
for i in 1:9
df[!, string(i)] = collect(1:1000000)
end
#timeit to "evaluation" for i in 1:size(df, 1)
evaluate(my_random_tree, #view df[i, :])
end
end
I expect this to yield 0 allocations, but that isn't true:
julia mytree.jl
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 551ms / 20.5% 305MiB / 45.0%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
evaluation 1 113ms 100.0% 113ms 137MiB 100.0% 137MiB
───────────────────────────────────────────────────────────────────────%
On the other hand, if I use a plain array I don't get allocations:
function main()
my_random_tree = build_random_tree(7)
df = randn(1000000, 10)
#timeit to "evaluation" for i in 1:size(df, 1)
evaluate(my_random_tree, #view df[i, :])
end
end
julia mytree.jl
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 465ms / 5.7% 171MiB / 0.0%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
evaluation 1 26.4ms 100.0% 26.4ms 0.00B - % 0.00B
───────────────────────────────────────────────────────────────────────%
The thing that allocates is randn not evaluation. Switch to randn!:
julia> using Random
julia> function main()
my_random_tree = build_random_tree(4)
x = randn(10)
#allocated for i in 1:1000000
evaluate(my_random_tree, randn!(x))
end
end
main (generic function with 1 method)
julia> main()
0
EDIT
Solution with DataFrames.jl:
function bar(mrt, nti)
#timeit to "evaluation" for nt in nti
evaluate(mrt, nt)
end
end
function main()
my_random_tree = build_random_tree(7)
df = DataFrame(A=1:1000000)
for i in 1:9
df[!, string(i)] = collect(1:1000000)
end
bar(my_random_tree, Tables.namedtupleiterator(df))
end

Julia: why doesn't shared memory multi-threading give me a speedup?

I want to use shared memory multi-threading in Julia. As done by the Threads.#threads macro, I can use ccall(:jl_threading_run ...) to do this. And whilst my code now runs in parallel, I don't get the speedup I expected.
The following code is intended as a minimal example of the approach I'm taking and the performance problem I'm having: [EDIT: See later for even more minimal example]
nthreads = Threads.nthreads()
test_size = 1000000
println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".")
# Something to be processed:
objects = rand(test_size)
# Somewhere for our results
results = zeros(nthreads)
counts = zeros(nthreads)
# A function to do some work.
function worker_fn()
work_idx = 1
my_result = results[Threads.threadid()]
while work_idx > 0
my_result += objects[work_idx]
work_idx += nthreads
if work_idx > test_size
break
end
counts[Threads.threadid()] += 1
end
end
# Call our worker function using jl_threading_run
#time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn)
# Verify that we made as many calls as we think we did.
println("\nCOUNTS:")
println("\tPer thread:\t", counts)
println("\tSum:\t\t", sum(counts))
On an i7-7700, a typical single threaded result is:
STARTED with 1 thread(s) and test size of 1000000.
0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time)
COUNTS:
Per thread: [999999.0]
Sum: 999999.0
And with 4 threads:
STARTED with 4 thread(s) and test size of 1000000.
0.140378 seconds (1.81 M allocations: 25.661 MiB)
COUNTS:
Per thread: [249999.0, 249999.0, 249999.0, 249999.0]
Sum: 999996.0
Multi-threading slows things down! Why?
EDIT: A better minimal example can be created #threads macro itself.
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
#time Threads.#threads for i = 1 : test_size
a[Threads.threadid()] += b[i]
calls[Threads.threadid()] += 1
end
I falsely assumed that the #threads macro's inclusion in Julia would mean that there was a benefit to be had.
The problem you have is most probably false sharing.
You can solve it by separating the areas you write to far enough like this (here is a "quick and dirty" implementation to show the essence of the change):
julia> function f(spacing)
test_size = 1000000
a = zeros(Threads.nthreads()*spacing)
b = rand(test_size)
calls = zeros(Threads.nthreads()*spacing)
Threads.#threads for i = 1 : test_size
#inbounds begin
a[Threads.threadid()*spacing] += b[i]
calls[Threads.threadid()*spacing] += 1
end
end
a, calls
end
f (generic function with 1 method)
julia> #btime f(1);
41.525 ms (35 allocations: 7.63 MiB)
julia> #btime f(8);
2.189 ms (35 allocations: 7.63 MiB)
or doing per-thread accumulation on a local variable like this (this is a preferred approach as it should be uniformly faster):
function getrange(n)
tid = Threads.threadid()
nt = Threads.nthreads()
d , r = divrem(n, nt)
from = (tid - 1) * d + min(r, tid - 1) + 1
to = from + d - 1 + (tid ≤ r ? 1 : 0)
from:to
end
function f()
test_size = 10^8
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
Threads.#threads for k = 1 : Threads.nthreads()
local_a = 0.0
local_c = 0.0
for i in getrange(test_size)
for j in 1:10
local_a += b[i]
local_c += 1
end
end
a[Threads.threadid()] = local_a
calls[Threads.threadid()] = local_c
end
a, calls
end
Also note that you are probably using 4 treads on a machine with 2 physical cores (and only 4 virtual cores) so the gains from threading will not be linear.

julia-lang Cache data in a parallel thread using #async

Suppose we have a slow function to produce data and another slow function to process data as follow:
# some slow function
function prime(i)
sleep(2)
println("processed $i")
i
end
function slow_process(x)
sleep(2)
println("slow processed $x")
end
function each(rng)
function _iter()
for i ∈ rng
#time d = prime(i)
produce(d)
end
end
return Task(_iter)
end
#time for x ∈ each(1000:1002)
slow_process(x)
end
Output:
% julia test-task.jl
processed 1000
2.063938 seconds (37.84 k allocations: 1.605 MB)
slow processed 1000
processed 1001
2.003115 seconds (17 allocations: 800 bytes)
slow processed 1001
processed 1002
2.001798 seconds (17 allocations: 800 bytes)
slow processed 1002
12.166475 seconds (88.08 k allocations: 3.640 MB)
Is there some way to get and cache data in a parallel thread using #async and feed to the slow_process function?
Edit: I updated the example to clarify the problem. Ideally, the example should take 2+6 seconds instead of 12 seconds.
Edit 2: This is my effort of using #sync and #async but I got the error ERROR (unhandled task failure): no process with id 2 exists
macro swap(x,y)
quote
local tmp = $(esc(x))
$(esc(x)) = $(esc(y))
$(esc(y)) = tmp
end
end
# some slow function
function prime(i)
sleep(2)
println("processed $i")
i
end
function slow_process(x)
sleep(2)
println("slow processed $x")
end
function each(rng)
#assert length(rng) > 1
rng = collect(rng)
a = b = nothing
function _iter()
for i ∈ 1:length(rng)
if a == nothing
a = #async remotecall_fetch(prime, 2, rng[i])
b = #async remotecall_fetch(prime, 2, rng[i+1])
else
if i < length(rng)
a = #async remotecall_fetch(prime, 2, rng[i+1])
end
#swap(a,b)
end
#sync d = a
produce(d)
end
end
return Task(_iter)
end
#time for x ∈ each(1000:1002)
slow_process(x)
end
OK, I have the working solution below:
macro swap(x,y)
quote
local tmp = $(esc(x))
$(esc(x)) = $(esc(y))
$(esc(y)) = tmp
end
end
# some slow function
#everywhere function prime(i)
sleep(2)
println("prime $i")
i
end
function slow_process(x)
sleep(2)
println("slow_process $x")
end
function each(rng)
#assert length(rng) > 1
rng = collect(rng)
a = b = nothing
function _iter()
for i ∈ 1:length(rng)
if a == nothing
a = remotecall(prime, 2, rng[i])
b = remotecall(prime, 2, rng[i+1])
else
if i < length(rng)
a = remotecall(prime, 2, rng[i+1])
end
#swap(a,b)
end
d = fetch(a)
produce(d)
end
end
return Task(_iter)
end
#time for x ∈ each(1000:1002)
slow_process(x)
end
And
% julia -p 2 test-task.jl
8.354102 seconds (148.00 k allocations: 6.204 MB)

JuMP don't release memory

Why in this simple case the garbage collector can't release the all the memory allocated by JuMP? #time also retruns only 71M.
using JuMP, GLPKMathProgInterface
function memuse()
pid = parse(Int,readall(pipeline(`ps axc`,`awk "{if (\$5==\"julia\") print \$1}"`) ))
return string(round(Int,parse(Int,readall(`ps -p $pid -o rss=`))/1024),"M")
end
function optimize()
m = Model(solver=GLPKSolverLP())
#variable(m, x[1:10] >= 0)
#constraint(m, con[i = 1:10000000], x⋅rand(10) >=0)
solve(m)
return getobjectivevalue(m)
end
println("Before $(memuse())")
#time optimize()
println("Created $(memuse())")
gc()
println("After gc() $(memuse())")
Ouput
Before 139M
11.683382 seconds (71.59 M allocations: 3.635 GB, 44.04% gc time)
Created 1471M
After gc() 924M
groups.google.com/forum/#!topic/julia-opt/zZr5dnQIJno

Memory allocation in a fixed point algorithm

I need to find the fixed point of a function f. The algorithm is very simple:
Given X, compute f(X)
If ||X-f(X)|| is lower than a certain tolerance, exit and return X,
otherwise set X equal to f(X) and go back to 1.
I'd like to be sure I'm not allocating memory for a new object at every iteration
For now, the algorithm looks like this:
iter1 = function(x::Vector{Float64})
for iter in 1:max_it
oldx = copy(x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
Here g1(x) is a function that sets x to f(x)
But it seems this loop allocates a new vector at every loop (see below).
Another way to write the algorithm is the following:
iter2 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
(oldx, x) = (x, oldx)
g2(x, oldx)
delta = vnormdiff(oldx, x, 2)
if delta < tolerance
break
end
end
end
where g2(x1, x2) is a function that sets x1 to f(x2).
Is thi the most efficient and natural way to write this kind of iteration problem?
Edit1: timing shows that the second code is faster:
using NumericExtensions
max_it = 1000
tolerance = 1e-8
max_it = 100
g1 = function(x::Vector{Float64})
for i in 1:length(x)
x[i] = x[i]/2
end
end
g2 = function(newx::Vector{Float64}, x::Vector{Float64})
for i in 1:length(x)
newx[i] = x[i]/2
end
end
x = fill(1e7, int(1e7))
#time iter1(x)
# elapsed time: 4.688103075 seconds (4960117840 bytes allocated, 29.72% gc time)
x = fill(1e7, int(1e7))
#time iter2(x)
# elapsed time: 2.187916177 seconds (80199676 bytes allocated, 0.74% gc time)
Edit2: using copy!
iter3 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
copy!(oldx, x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
x = fill(1e7, int(1e7))
#time iter3(x)
# elapsed time: 2.745350176 seconds (80008088 bytes allocated, 1.11% gc time)
I think replacing the following lines in the first code
for iter = 1:max_it
oldx = copy( x )
...
by
oldx = zeros( N )
for iter = 1:max_it
oldx[:] = x # or copy!( oldx, x )
...
will be more efficient because no array is allocated. Also, the code can be made more efficient by writing for-loops explicitly. This can be seen, for example, from the following comparison
function test()
N = 1000000
a = zeros( N )
b = zeros( N )
#time c = copy( a )
#time b[:] = a
#time copy!( b, a )
#time \
for i = 1:length(a)
b[i] = a[i]
end
#time \
for i in eachindex(a)
b[i] = a[i]
end
end
test()
The result obtained with Julia0.4.0 on Linux(x86_64) is
elapsed time: 0.003955609 seconds (7 MB allocated)
elapsed time: 0.001279142 seconds (0 bytes allocated)
elapsed time: 0.000836167 seconds (0 bytes allocated)
elapsed time: 1.19e-7 seconds (0 bytes allocated)
elapsed time: 1.28e-7 seconds (0 bytes allocated)
It seems that copy!() is faster than using [:] in the left-hand side,
though the difference becomes marginal in repeated calculations (there seems to be
some overhead for the first [:] calculation). Btw, the last example using eachindex() is very convenient for looping over multi-dimensional arrays.
Similar comparison can be made for vnormdiff(), where use of norm( x - oldx ) etc is slower than an explicit loop for vector norm, because the former allocates one temporary array for x - oldx.

Resources