Iterating a custom function efficiently in Julia - performance

I have a operator T_ implemented quite efficiently in Julia and I want to iterate using the while loop. My operator is given by:
% parameters
β = 0.987
δ = 0.012;
% grids
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500);
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]
% auxiliary functions to build my operator
F_(z,k) = exp(z) * (k^(1/3));
u_(c) = (c^(1-2) - 1)/(1-2)
% T_operator
function T_(V, P, kgrid, zgrid, β, δ)
E = V * P'
T1 = similar(V)
for i in axes(T1, 2)
for j in axes(T1, 1)
temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
aux = -Inf
for l in eachindex(kgrid)
c = max(0.0, temp - kgrid[l])
aux = max(aux, u_(c) + β * E[l, i])
end
T1[j,i] = aux
end
end
return T1
end
Explaining briefly. This operator has as input
V is a 500x7 matrix and P a 7x7 transition matrix (i.e. each row sums one)
kgrid is a grid of length 500 and zgrid is a grid of length 7
β and δ particular parameters
T_ returns a T1 (500x7) matrix. More details about this operator and the correct way to run this operator can be found in this other question that I asked: Tricks to improve the performance of a cunstom function in Julia
Running this operator only once, it takes very little time, almost instantly. However, I need to iterate this operator until I get an acceptable tolerance error, but my implementation results in an inefficient process taking a long time:
max_it = 1000
it = 1
tol = 1e-3
dist = tol +1
V0 = repeat(sqrt.(a_grid), outer = [1,7]);
while it < max_it && dist > tol
TV= T_(V0,P,kgrid, zgrid, β, δ)
dist = maximum(abs.(TV - V0)) % Computing distance or error
V0 = TV % update
it = it + 1 % Updating iterations
% Some information about the state of the iteration
if rem(it, 100) == 0
println("Current iteration:")
println(it)
println("Current norm:")
println(dist)
end
end
I think a more efficient solution is to incorporate the while loop directly into the implementation of the T_ operator, but I spent the whole day trying this out and couldn't do it. Help.
UPDATE
This the MATLAB version. It is more efficient
V0 = repmat(sqrt(kgrid), 1, 7); % Concave and increasing guess
max_it = 1000;
tol = 1e-3;
%% Iteration
tic
norm = tol + 1;
it = 1;
tic;
[K, Z, new_K] = meshgrid(kgrid, zgrid, kgrid);
K = permute(K, [2, 1, 3]);
Z = permute(Z, [2, 1, 3]);
new_K = permute(new_K, [2, 1, 3]);
% Computing consumption on each possible state and choice
C = max(f(Z,K) + (1-delta)*K - new_K,0);
% All possible utilities
U = u(C);
disp('Starting value function iteration through the good and old brute force...')
while it < max_it & norm > tol
EV = V0 * P';
EV = permute(repmat(EV, 1, 1, nk), [3, 2, 1]);
H = U + beta*EV;
[TV, index] = max(H, [], 3);
it = it + 1; % Updating iterations
norm = max(max(abs(TV - V0))); % Computing error
V0 = TV;
if rem(it, 100) == 0
disp('Current iteration:')
disp(it)
disp('Current norm:')
disp(norm)
end
end
V = TV;
toc;

Just to get an idea of where just we're starting from, let's wrap your inital implementation in a function
function iterate_T_firstattempt(; max_it=1000, it=1, tol=1e-3, dist=tol+1)
V0 = repeat(sqrt.(kgrid), outer = [1,7]) # Assuming the `a_grid` was a typo from your comments
while it < max_it && dist > tol
TV = T_(V0, P, kgrid, zgrid, β, δ)
dist = maximum(abs.(TV - V0)) # Computing distance or error
V0 = TV # update
it += 1 # Updating iterations
# Some information about the state of the iteration
if rem(it, 100) == 0
println("Current iteration:")
println(it)
println("Current norm:")
println(dist)
end
end
end
and benchmark it with BenchmarkTools.jl
julia> #benchmark iterate_T_firstattempt()
sample with 1 evaluation.
Single result which took 7.056 s (0.00% GC) to evaluate,
with a memory estimate of 52.33 MiB, over 5875 allocations.
Oof, that's a lot of allocations. Some of these are coming from the use of global variables, others from type instability, yet others from the design of your functions. A few specific points:
The compiler's probably already making the right call, but we might as well add an #inline to your definition of u_(c) and F_(z,k) to make sure they get inlined. And why not on T_ itself too while we're at it.
You're doing a lot of indexing in the nested for loops, might as well throw an #inbounds on there given that there should be no way of getting out-of-bounds indexing.
One better: the loops in T_ look to be safely reorder-able, so we can go ahead and upgrade that #inbounds to a #turbo or #tturbo from LoopVectorization.jl for an even bigger speedup by using your CPU's SIMD instructions / Advanced Vector Extensions.
The calculation of dist = maximum(abs.(TV - V0)) involves at least two large allocations, we can avoid those with a simple mapreduce. Or to use those SIMD instructions again, vmapreduce, from LoopVectorization.jl
The line TV = T_(V0, P, kgrid, zgrid, β, δ) is also allocating, let's switch that out for an in-place version T_!.
As mentioned above, global variables are bad news. We can just move them into the function signature of iterate_T easily enough though, which should fix that problem.
While we're at it, let's also break out three-arg mul! from the LinearAlgebra stdlib for a non-allocating calculation of E = V * P'. And to get rid of one last sneaky source of type-instability (which was causing a final ~2k allocations), we should change that outer=[1,7] to outer=(1,7) -- a nice stable tuple instead of an array.
Putting it all together:
using LinearAlgebra, LoopVectorization
# parameters
β = 0.987
δ = 0.012
# grids
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500)
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]
P = rand(7,7)
P ./= sum(P,dims=2) # Rows sum to one
# auxiliary functions to build operator
#inline F_(z,k) = exp(z) * (k^(1/3))
#inline u_(c) = (c^(1-2) - 1)/(1-2)
# T_operator, in-place version
#inline function T_!(TV, E, V, P, kgrid, zgrid, β, δ)
mul!(E, V, P')
#tturbo for i in axes(TV, 2)
for j in axes(TV, 1)
temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
aux = -Inf
for l in eachindex(kgrid)
c = max(0.0, temp - kgrid[l])
aux = max(aux, u_(c) + β * E[l, i])
end
TV[j,i] = aux
end
end
return TV
end
function iterate_T(P, kgrid, zgrid, β, δ; max_it=1000, it=1, tol=1e-3, dist=tol+1)
V0 = repeat(sqrt.(kgrid), outer=(1,7))
# Preallocate temporary arrays
TV = similar(V0)
E = similar(V0)
# Iterate
for it = 1:max_it
# Non-allocating in-place T_!
TV = T_!(TV, E, V0, P, kgrid, zgrid, β, δ)
# Compute distance or error
dist = vmapreduce((a,b)->abs(a-b), max, TV, V0)
copyto!(V0, TV) # update
# # Some information about the state of the iteration
# if rem(it, 100) == 0
# println("Current iteration:")
# println(it)
# println("Current norm:")
# println(dist)
# end
(dist < tol) && break
end
return V0
end
we get
julia> #benchmark iterate_T($P, $kgrid, $zgrid, $β, $δ)
BenchmarkTools.Trial: 11 samples with 1 evaluation.
Range (min … max): 460.246 ms … 599.820 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 474.826 ms ┊ GC (median): 0.00%
Time (mean ± σ): 486.661 ms ± 40.359 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▇▁▇█▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
460 ms Histogram: frequency by time 600 ms <
Memory estimate: 86.42 KiB, allocs estimate: 9.
That's a bit more like it!

Related

Julia #spawn and pmap() on an embarrassingly parallel problem that requires JuMP and Ipopt

I'd really appreciate some help on parallelizing the following pseudo code in Julia (and I do apologize in advance for the long post):
P, Q # both K by N matrix, K = num features and N = num samples
X, Y # K*4 by N and K*2 by N matrices
tempX, tempY # column vectors of size K*4 and K*2
ndata # a dict from parsing a .m file to be used by a solver with JuMP and Ipopt
# serial version
for i = 1:N
ndata[P] = P[:, i] # technically requires a for loop from 1 to K since the dict has to be indexed element-wise
ndata[Q] = Q[:, i]
ndata_A = run_solver_A(ndata) # with a third-party package and JuMP, Ipopt
ndata_B = run_solver_B(ndata)
kX = 1, kY = 1
for j = 1:K
tempX[kX:kX+3] = [ndata_A[j][a], ndata_A[j][b], P[j, i], Q[j, i]]
tempY[kY:kY+1] = [ndata_B[j][a], ndata_B[j][b]]
kX += 4
kY += 2
end
X[:, i] = deepcopy(tempX)
Y[:, i] = deepcopy(tempY)
end
So obviously, this for loop can be executed independently as long as no columns of P and Q is accessed twice and the same column i of P and Q are accessed at a time. The only thing I need to be careful about is that column i of X and Y are correct pairs of tempX and tempY, and I don't care as much about whether the i = 1, ..., N order is maintained (hopefully that makes sense!).
I read both the official documentation and some online tutorials, and wrote the following with #spawn and fetch that works for the insertion part by replacing the ndata[j][a] etc. with placeholder numbers 1.0 and 180:
using Distributed
addprocs(2)
num_proc = nprocs()
#everywhere function insertPQ(P, Q)
println(myid())
data = zeros(4*length(P))
k = 1
for i = 1:length(P)
data[k:k+3] = [1.0, 180., P[i], Q[i]]
k += 4
end
return data
end
P = [0.99, 0.99, 0.99, 0.99]
Q = [-0.01, -0.01, -0.01, -0.01]
for i = 1:5 # should be 4 x 32
global P = hcat(P, (P .- 0.01))
global Q = hcat(Q, (Q .- 0.01))
end
datas = zeros(16, 0) # serial result
datap = zeros(16, 32) # parallel result
#time for i = 1:32
s = fetch(#spawn insertPQ(P[:, i], Q[:, i]))
global datap = hcat(datap, s)
end
#time for i = 1:32
k = 1
for j = 1:4
datas[k:k+3, i] = [1.0, 180., P[j, i], Q[j, i]]
k += 4
end
end
println(datap == datas)
The above code is fine but I did notice the output was consistently worker 2->3->4->5->2... and was much slower than the serial case (I'm testing this on my laptop with only 4 cores, but eventually I'll run it on a cluster). It took forever to run when added in the run_solver_A/B in the insertPQ() that I had to stop it.
As for pmap(), I couldn't figure out how to pass an entire vector to the function. I probably misunderstood the documentation but "Transform collection c by applying f to each element using available workers and tasks" sounds like I can only do this element-wise? That can't be it. I went to a Julia intro session last week and asked the lecturer about this. He said I should use pmap and I've been trying to make it work since.
So, how can I parallelize the my original pseudo code? Any help or suggestion is greatly appreciated!

Julia: Avoid memory allocation due to nested function calls in for loop

I've seen multiple questions addressing memory allocation in Julia in general, however none of these examples helped me.
I provide a minimal example that shall illustrate my problem. I implemented a finite volume solver that computes the solution of an advection equation. Long story short here the (self contained) code:
function dummyexample()
nx = 100
Δx = 1.0/nx
x = range(Δx/2.0, length=nx, step=Δx)
ρ = sin.(2π*x)
for i=1:floor(1.0/Δx / 0.5)
shu_osher_step!(ρ) # This part is executed several times
end
println(sum(Δx*abs.(ρ .- sin.(2π*x))))
end
function shu_osher_step!(ρ::AbstractArray)
ρ₁ = euler_step(ρ) # array allocation
ρ₂ = 3.0/4.0*ρ .+ 1.0/4.0*euler_step(ρ₁) # array allocation
ρ .= 1.0/3.0*ρ .+ 2.0/3.0*euler_step(ρ₂) # array allocation
end
function euler_step(ρ::AbstractArray)
return ρ .+ 0.5*rhs(ρ)
end
function rhs(ρ::AbstractArray)
ρₗ = circshift(ρ,+1) # array allocation
ρᵣ = circshift(ρ,-1) # array allocation
Δρₗ = ρ.-ρₗ # array allocation
Δρᵣ = ρᵣ .-ρ # array allocation
vᵣ = ρ .+ 1.0/2.0 .* H(Δρₗ,Δρᵣ) # array allocation
return -(vᵣ .- circshift(vᵣ,+1)) # array allocation
end
function H(Δρₗ::AbstractArray,Δρᵣ::AbstractArray)
σ = Δρₗ ./ Δρᵣ
σ̃ = max.(abs.(σ),1e-12) .* (2.0 .* (σ .>= 0.0) .- 1.0)
for i=1:100
if isnan(σ̃[i])
σ̃[i] = 1e-12
end
end
return Δρₗ .* (2.0/3.0*(1.0 ./ σ̃) .+ 1.0/3.0)
end
My problem is, that deep down in the call tree the function rhs allocates several arrays in every iteration of the most upper time loop. These arrays are temporary and I do not like the fact that they have to be reallocated every iteration. Here the output from #time:
julia> include("dummyexample.jl");
julia> #time dummyexample()
8.780349744014917e-5 # <- just to check that the error is almost zero
0.362833 seconds (627.38 k allocations: 39.275 MiB, 1.95% gc time)
Now in the real code, there is actually a struct p passed down the whole calltree that contains attributes which I hardcoded here (basically every of the explicitly stated numbers would be referenced by p.n, etc.)
I could probably also pass down preallocated arrays like but that seems to get messy and I would have to change that every time I want to do extra computations.
Global arrays are discouraged in the Julia documentation but wouldn't that do the trick here? Are there any other obvious things I am missing? I am considering Julia 1.0.
Passing down preallocated arrays, as you say in the last paragraph, is exactly the right thing in this kind of situation. Additional to that, I would devectorize the code into a manual loop containing a stencil and more indexing math instead of circshift.
Applying both ideas results in the following:
function dummyexample()
nx = 100
Δx = 1.0 / nx
steps = 2 ÷ Δx
x = range(Δx ÷ 2, length = nx, step = Δx)
ρ = sin.(2π .* x)
run!(ρ, steps)
println(sum(#. Δx * abs(ρ - sin(2π * x))))
end
function run!(ρ, steps)
ρ₁, ρ₂, v = similar(ρ), similar(ρ), similar(ρ)
for i = 1:steps
shu_osher_step!(ρ₁, ρ₂, v, ρ)
end
return ρ
end
function shu_osher_step!(ρ₁, ρ₂, v, ρ)
euler_step!(ρ₁, v, ρ)
ρ₂ .= 3.0/4.0 .* ρ .+ 1.0/4.0 .* euler_step!(ρ₂, v, ρ₁)
ρ .= 1.0/3.0 .* ρ .+ 2.0/3.0 .* euler_step!(ρ, v, ρ₂)
end
function euler_step!(ρₒ, v, ρ)
cycle(i) = mod(i - 1, length(ρ)) + 1
# two steps of calculating v fused into one -- could be replaced by
# an extra loop for v.
for I in 1:2:size(ρ, 1)
v[I] = rhs(ρ[cycle(I-1)], ρ[I], ρ[cycle(I+1)])
v[cycle(I+1)] = rhs(ρ[cycle(I)], ρ[I+1], ρ[cycle(I+2)])
ρₒ[I] += 0.5 * (v[cycle(I+1)] - v[I])
end
return ρₒ
end
function rhs(ρₗ, ρᵢ, ρᵣ)
Δρₗ = ρᵢ - ρₗ
Δρᵣ = ρᵣ - ρᵢ
return ρᵢ + 1/2 * H(Δρₗ, Δρᵣ)
end
function H(Δρₗ, Δρᵣ)
σ = Δρₗ / Δρᵣ
σ̃ = max(abs(σ), 1e-12) * (2.0 * (σ >= 0.0) - 1.0)
isnan(σ̃) && (σ̃ = 1e-12)
return Δρₗ * (2.0 / 3.0 * (1.0 / σ̃) + 1.0 / 3.0)
end
The above might still contain some logic errors due to my lack of domain knowledge (dummyexample() prints 0.02984422033942575), but you see the pattern. And it benchmarks well:
julia> #benchmark run!($ρ, $steps)
BenchmarkTools.Trial:
memory estimate: 699.13 KiB
allocs estimate: 799
--------------
minimum time: 3.024 ms (0.00% GC)
median time: 3.164 ms (0.00% GC)
mean time: 3.760 ms (1.69% GC)
maximum time: 57.105 ms (94.41% GC)
--------------
samples: 1327
evals/sample: 1

Why is Julia allocating so much memory?

I am trying to write a fast coordinate descent algorithm for solving ordinary least squares regression. The following Julia code works, but I don't understand why it's allocating so much memory
function OLS_cd{T<:Float64}(A::Array{T,2}, b::Array{T,1}, tolerance::T=1e-12)
N,P = size(A)
x = zeros(P)
r = copy(b)
d = ones(P)
while sum(d.*d) > tolerance
#inbounds for j = 1:P
d[j] = sum(A[:,j].*r)
x[j] += d[j]
r -= d[j]*A[:,j]
end
end
return(x)
end
On the data I generate with
n = 100
p = 75
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Using #benchmark OLS_cd(X, y) I get
BenchmarkTools.Trial:
memory estimate: 65.94 mb
allocs estimate: 151359
--------------
minimum time: 19.316 ms (16.49% GC)
median time: 20.545 ms (16.60% GC)
mean time: 22.164 ms (16.24% GC)
maximum time: 42.114 ms (10.82% GC)
--------------
samples: 226
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
The OLS problem gets harder as p gets bigger, and I've noticed that the as I make p bigger and need to run longer, the more memory Julia allocates.
Why would each pass through the while loop allocate more memory? To my eye, it seems like all of my operations are in place, and the types are clearly specified.
Nothing popped out to me while profiling, but I could post that output as well if it's useful.
Update:
As pointed out below, temporary arrays caused by using vectorized operations were the culprit. The following eliminated extraneous allocations and runs pretty quickly:
function OLS_cd_unrolled{T<:Float64}(A::Array{T,2}, b::Array{T,1}, tolerance::T=1e-12)
N,P = size(A)
x = zeros(P)
r = copy(b)
d = ones(P)
while norm(d,Inf) > tolerance
#inbounds for j = 1:P
d[j] = 0.0; #inbounds for i = 1:N d[j] += A[i,j]*r[i] end
#inbounds for i = 1:N r[i] -= d[j]*A[i,j] end
x[j] += d[j]
end
end
return(x)
end
A[:,j] creates a copy, not a view. You want to use #view A[:,j] or view(A,:,j).
You can devectorize r -= d[j]*A[:,j] with r .= -.(r,d[j]*A[:.j]) to get rid of some more temporaries. As #LutfullahTomak said sum(A[:,j].*r) should devectorize as dot(view(A,:,j),r) to get rid of all of the temporaries in there. To use an infix operator, you can use \cdot, as in view(A,:,j)⋅r.
You should read up on copies vs views and how vectorization causes temporary arrays. The jist of it is that when vectorized operations occur, they have to create a new vector as output. Instead, you want to write to an existing vector. r = ... for an array changes reference, so r = ex for some expression which makes an array will make a new array, and then point r to that array. r .= ex will replace the values of the array r with the values from the expression. The former allocates a temporary, the latter does not. Repeated applications of this idea is where all of the temporaries come from.
Actually, sum(d.*d) , sum(A[:,j].*r) and so on are not inplace and make temporary arrays.. First, sum(d.*d) == dot(d,d) I think and sum(A[:,j].*r) makes 2 temporary arrays. I'd do dot(view(A,:,j),r) for the latter. Current stable version of julia(0.5) doesn't have short version for r -= d[j]*A[:,j] so you need to devectorize it make a loop.

BLAS v. parallel updates for Julia SharedArray objects

I am interested in using Julia SharedArrays for a scientific computing project. My current implementation appeals to BLAS for all matrix-vector operations, but I thought that perhaps a SharedArray would offer some speedup on multicore machines. My idea is to simply update an output vector index-by-index, farming the index updates to worker processes.
Previous discussions here about SharedArrays and here about shared memory objects did not offer clear guidance on this issue. It seems intuitively simple enough, but after testing, I'm somewhat confused as to why this approach works so poorly (see code below). For starters, it seems like #parallel for allocates a lot of memory. And if I prefix the loop with #sync, which seems like a smart thing to do if the whole output vector is required later, then the parallel loop is substantially slower (though without #sync, the loop is mighty quick).
Have I incorrectly interpreted the proper use of the SharedArray object? Or perhaps did I inefficiently assign the calculations?
### test for speed gain w/ SharedArray vs. Array ###
# problem dimensions
n = 10000; p = 25000
# set BLAS threads; 64 seems reasonable in testing
blas_set_num_threads(64)
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray{Float64,2}, x)
Y = convert(SharedArray{Float64,1}, y)
Z = convert(SharedArray{Float64,1}, z)
# run BLAS.gemv! on Arrays twice, time second case
BLAS.gemv!('N', 1.0, x, y, 0.0, z)
#time BLAS.gemv!('N', 1.0, x, y, 0.0, z)
# does BLAS work equally well for SharedArrays?
# check timing result and ensure same answer
BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
#time BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
println("$(isequal(z,Z))") # should be true
# SharedArrays can be updated in parallel
# code a loop to farm updates to worker nodes
# use transposed X to place rows of X in columnar format
# should (hopefully) help with performance issues from stride
Xt = X'
#parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# now time the synchronized copy of this
#time #sync #parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# still get same result?
println("$(isequal(z,Z))") # should be true
Output from test with 4 workers + 1 master node:
elapsed time: 0.109010169 seconds (80 bytes allocated)
elapsed time: 0.110858551 seconds (80 bytes allocated)
true
elapsed time: 1.726231048 seconds (119936 bytes allocated)
true
You're running into several issues, of which the most important is that Xt[:,i] creates a new array (allocating memory). Here's a demonstration that gets you closer to what you want:
n = 10000; p = 25000
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray, x)
Y = convert(SharedArray, y)
Z = convert(SharedArray, z)
Xt = X'
#everywhere function dotcol(a, B, j)
length(a) == size(B,1) || throw(DimensionMismatch("a and B must have the same number of rows"))
s = 0.0
#inbounds #simd for i = 1:length(a)
s += a[i]*B[i,j]
end
s
end
function run1!(Z, Y, Xt)
for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
function runp!(Z, Y, Xt)
#sync #parallel for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
run1!(Z, Y, Xt)
runp!(Z, Y, Xt)
#time run1!(Z, Y, Xt)
zc = copy(sdata(Z))
fill!(Z, -1)
#time runp!(Z, Y, Xt)
#show sdata(Z) == zc
Results (when starting julia -p 8):
julia> include("/tmp/paralleldot.jl")
elapsed time: 0.465755791 seconds (80 bytes allocated)
elapsed time: 0.076751406 seconds (282 kB allocated)
sdata(Z) == zc = true
For comparison, when running on this same machine:
julia> blas_set_num_threads(8)
julia> #time A_mul_B!(Z, X, Y);
elapsed time: 0.067611858 seconds (80 bytes allocated)
So the raw Julia implementation is at least competitive with BLAS.

Vectorization: friend or foe? bsxfun/arrayfun to avoid loops, repmat, permute, squeeze, etc

This question is related to this question and probably to this other as well.
Suppose you have two matrices A and B. A is M-by-N and B is N-by-K. I want to obtain an M-by-K matrix C such that C(i, j) = 1 - prod(1 - A(i, :)' .* B(:, j)). I have tried some solutions in Matlab - I am here comparing their computation performance.
% Size of matrices:
M = 4e3;
N = 5e2;
K = 5e1;
GG = 50; % GG instances
rntm1 = zeros(GG, 1); % running time of first algorithm
rntm2 = zeros(GG, 1); % running time of second algorithm
rntm3 = zeros(GG, 1); % running time of third algorithm
rntm4 = zeros(GG, 1); % running time of fourth algorithm
rntm5 = zeros(GG, 1); % running time of fifth algorithm
for gg = 1:GG
A = rand(M, N); % M-by-N matrix of random numbers
A = A ./ repmat(sum(A, 2), 1, N); % M-by-N matrix of probabilities (?)
B = rand(N, K); % N-by-K matrix of random numbers
B = B ./ repmat(sum(B), N, 1); % N-by-K matrix of probabilities (?)
%% First solution
% One-liner solution:
tic
C = squeeze(1 - prod(1 - repmat(A, [1 1 K]) .* permute(repmat(B, [1 1 M]), [3 1 2]), 2));
rntm1(gg) = toc;
%% Second solution
% Full vectorization, using meshgrid, arrayfun and reshape (from Luis Mendo, second link above)
tic
[ii jj] = meshgrid(1:size(A, 1), 1:size(B, 2));
D = arrayfun(#(n) 1 - prod(1 - A(ii(n), :)' .* B(:, jj(n))), 1:numel(ii));
D = reshape(D, size(B, 2), size(A, 1)).';
rntm2(gg) = toc;
clear ii jj
%% Third solution
% Partial vectorization 1
tic
E = zeros(M, K);
for hh = 1:M
tmp = repmat(A(hh, :)', 1, K);
E(hh, :) = 1 - prod((1 - tmp .* B), 1);
end
rntm3(gg) = toc;
clear tmp hh
%% Fourth solution
% Partial vectorization 2
tic
F = zeros(M, K);
for hh = 1:M
for ii = 1:K
F(hh, ii) = 1 - prod(1 - A(hh, :)' .* B(:, ii));
end
end
rntm4(gg) = toc;
clear hh ii
%% Fifth solution
% No vectorization at all
tic
G = ones(M, K);
for hh = 1:M
for ii = 1:K
for jj = 1:N
G(hh, ii) = G(hh, ii) * prod(1 - A(hh, jj) .* B(jj, ii));
end
G(hh, ii) = 1 - G(hh, ii);
end
end
rntm5(gg) = toc;
clear hh ii jj C D E F G
end
prctile([rntm1 rntm2 rntm3 rntm4 rntm5], [2.5 25 50 75 97.5])
% 3.6519 3.5261 0.5912 1.9508 2.7576
% 5.3449 6.8688 1.1973 3.3744 3.9940
% 8.1094 8.7016 1.4116 4.9678 7.0312
% 8.8124 10.5170 1.9874 6.1656 8.8227
% 9.5881 12.0150 2.1529 6.6445 9.5115
mean([rntm1 rntm2 rntm3 rntm4 rntm5])
% 7.2420 8.3068 1.4522 4.5865 6.4423
std([rntm1 rntm2 rntm3 rntm4 rntm5])
% 2.1070 2.5868 0.5261 1.6122 2.4900
The solutions are equivalent but the algorithms with partial vectorization are way more efficient in terms of memory and execution time. Even the triple loop seems to perform better than arrayfun! Is there any approach that is actually better than the third, only partially vectorized solution?
EDIT: Dan's solutions are the best so far. Let rntm6, rntm7 and rntm8 be the runtime of his first, second and third solution. Then:
prctile(rntm6, [2.5 25 50 75 97.5])
% 0.6337 0.6377 0.6480 0.7110 1.2932
mean(rntm6)
% 0.7440
std(rntm6)
% 0.1970
prctile(rntm7, [2.5 25 50 75 97.5])
% 0.6898 0.7130 0.9050 1.1505 1.4041
mean(rntm7)
% 0.9313
std(rntm7)
% 0.2276
prctile(rntm8, [2.5 25 50 75 97.5])
% 0.5949 0.6005 0.6036 0.6370 1.3529
mean(rntm8)
% 0.6753
std(rntm8)
% 0.1890
You can get a minor performance gain with bsxfun:
E = zeros(M, K);
for hh = 1:M
E(hh, :) = 1 - prod((1 - bsxfun(#times, A(hh,:)', B)), 1);
end
And you could squeeze (pun intended) a tiny bit more performance with this:
E = squeeze(1 - prod((1-bsxfun(#times, permute(B, [3 1 2]), A)),2));
Or you could try pre-compute the transpose for my first suggestion:
E = zeros(M, K);
At = A';
for hh = 1:M
E(hh, :) = 1 - prod((1 - bsxfun(#times, At(:,hh), B)), 1);
end
One situation where you would absolutely benefit from using arrayfun or bsxfun is where you have Parallel Computing Toolbox available and a compatible NVIDIA GPU. In that case, the performance of those two functions is blazingly fast since the body can be sent to the GPU for execution there. See for example: http://www.mathworks.co.uk/help/distcomp/examples/improve-performance-of-element-wise-matlab-functions-on-the-gpu-using-arrayfun.html

Resources