BLAS v. parallel updates for Julia SharedArray objects - parallel-processing

I am interested in using Julia SharedArrays for a scientific computing project. My current implementation appeals to BLAS for all matrix-vector operations, but I thought that perhaps a SharedArray would offer some speedup on multicore machines. My idea is to simply update an output vector index-by-index, farming the index updates to worker processes.
Previous discussions here about SharedArrays and here about shared memory objects did not offer clear guidance on this issue. It seems intuitively simple enough, but after testing, I'm somewhat confused as to why this approach works so poorly (see code below). For starters, it seems like #parallel for allocates a lot of memory. And if I prefix the loop with #sync, which seems like a smart thing to do if the whole output vector is required later, then the parallel loop is substantially slower (though without #sync, the loop is mighty quick).
Have I incorrectly interpreted the proper use of the SharedArray object? Or perhaps did I inefficiently assign the calculations?
### test for speed gain w/ SharedArray vs. Array ###
# problem dimensions
n = 10000; p = 25000
# set BLAS threads; 64 seems reasonable in testing
blas_set_num_threads(64)
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray{Float64,2}, x)
Y = convert(SharedArray{Float64,1}, y)
Z = convert(SharedArray{Float64,1}, z)
# run BLAS.gemv! on Arrays twice, time second case
BLAS.gemv!('N', 1.0, x, y, 0.0, z)
#time BLAS.gemv!('N', 1.0, x, y, 0.0, z)
# does BLAS work equally well for SharedArrays?
# check timing result and ensure same answer
BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
#time BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
println("$(isequal(z,Z))") # should be true
# SharedArrays can be updated in parallel
# code a loop to farm updates to worker nodes
# use transposed X to place rows of X in columnar format
# should (hopefully) help with performance issues from stride
Xt = X'
#parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# now time the synchronized copy of this
#time #sync #parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# still get same result?
println("$(isequal(z,Z))") # should be true
Output from test with 4 workers + 1 master node:
elapsed time: 0.109010169 seconds (80 bytes allocated)
elapsed time: 0.110858551 seconds (80 bytes allocated)
true
elapsed time: 1.726231048 seconds (119936 bytes allocated)
true

You're running into several issues, of which the most important is that Xt[:,i] creates a new array (allocating memory). Here's a demonstration that gets you closer to what you want:
n = 10000; p = 25000
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray, x)
Y = convert(SharedArray, y)
Z = convert(SharedArray, z)
Xt = X'
#everywhere function dotcol(a, B, j)
length(a) == size(B,1) || throw(DimensionMismatch("a and B must have the same number of rows"))
s = 0.0
#inbounds #simd for i = 1:length(a)
s += a[i]*B[i,j]
end
s
end
function run1!(Z, Y, Xt)
for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
function runp!(Z, Y, Xt)
#sync #parallel for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
run1!(Z, Y, Xt)
runp!(Z, Y, Xt)
#time run1!(Z, Y, Xt)
zc = copy(sdata(Z))
fill!(Z, -1)
#time runp!(Z, Y, Xt)
#show sdata(Z) == zc
Results (when starting julia -p 8):
julia> include("/tmp/paralleldot.jl")
elapsed time: 0.465755791 seconds (80 bytes allocated)
elapsed time: 0.076751406 seconds (282 kB allocated)
sdata(Z) == zc = true
For comparison, when running on this same machine:
julia> blas_set_num_threads(8)
julia> #time A_mul_B!(Z, X, Y);
elapsed time: 0.067611858 seconds (80 bytes allocated)
So the raw Julia implementation is at least competitive with BLAS.

Related

Iterating a custom function efficiently in Julia

I have a operator T_ implemented quite efficiently in Julia and I want to iterate using the while loop. My operator is given by:
% parameters
β = 0.987
δ = 0.012;
% grids
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500);
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]
% auxiliary functions to build my operator
F_(z,k) = exp(z) * (k^(1/3));
u_(c) = (c^(1-2) - 1)/(1-2)
% T_operator
function T_(V, P, kgrid, zgrid, β, δ)
E = V * P'
T1 = similar(V)
for i in axes(T1, 2)
for j in axes(T1, 1)
temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
aux = -Inf
for l in eachindex(kgrid)
c = max(0.0, temp - kgrid[l])
aux = max(aux, u_(c) + β * E[l, i])
end
T1[j,i] = aux
end
end
return T1
end
Explaining briefly. This operator has as input
V is a 500x7 matrix and P a 7x7 transition matrix (i.e. each row sums one)
kgrid is a grid of length 500 and zgrid is a grid of length 7
β and δ particular parameters
T_ returns a T1 (500x7) matrix. More details about this operator and the correct way to run this operator can be found in this other question that I asked: Tricks to improve the performance of a cunstom function in Julia
Running this operator only once, it takes very little time, almost instantly. However, I need to iterate this operator until I get an acceptable tolerance error, but my implementation results in an inefficient process taking a long time:
max_it = 1000
it = 1
tol = 1e-3
dist = tol +1
V0 = repeat(sqrt.(a_grid), outer = [1,7]);
while it < max_it && dist > tol
TV= T_(V0,P,kgrid, zgrid, β, δ)
dist = maximum(abs.(TV - V0)) % Computing distance or error
V0 = TV % update
it = it + 1 % Updating iterations
% Some information about the state of the iteration
if rem(it, 100) == 0
println("Current iteration:")
println(it)
println("Current norm:")
println(dist)
end
end
I think a more efficient solution is to incorporate the while loop directly into the implementation of the T_ operator, but I spent the whole day trying this out and couldn't do it. Help.
UPDATE
This the MATLAB version. It is more efficient
V0 = repmat(sqrt(kgrid), 1, 7); % Concave and increasing guess
max_it = 1000;
tol = 1e-3;
%% Iteration
tic
norm = tol + 1;
it = 1;
tic;
[K, Z, new_K] = meshgrid(kgrid, zgrid, kgrid);
K = permute(K, [2, 1, 3]);
Z = permute(Z, [2, 1, 3]);
new_K = permute(new_K, [2, 1, 3]);
% Computing consumption on each possible state and choice
C = max(f(Z,K) + (1-delta)*K - new_K,0);
% All possible utilities
U = u(C);
disp('Starting value function iteration through the good and old brute force...')
while it < max_it & norm > tol
EV = V0 * P';
EV = permute(repmat(EV, 1, 1, nk), [3, 2, 1]);
H = U + beta*EV;
[TV, index] = max(H, [], 3);
it = it + 1; % Updating iterations
norm = max(max(abs(TV - V0))); % Computing error
V0 = TV;
if rem(it, 100) == 0
disp('Current iteration:')
disp(it)
disp('Current norm:')
disp(norm)
end
end
V = TV;
toc;
Just to get an idea of where just we're starting from, let's wrap your inital implementation in a function
function iterate_T_firstattempt(; max_it=1000, it=1, tol=1e-3, dist=tol+1)
V0 = repeat(sqrt.(kgrid), outer = [1,7]) # Assuming the `a_grid` was a typo from your comments
while it < max_it && dist > tol
TV = T_(V0, P, kgrid, zgrid, β, δ)
dist = maximum(abs.(TV - V0)) # Computing distance or error
V0 = TV # update
it += 1 # Updating iterations
# Some information about the state of the iteration
if rem(it, 100) == 0
println("Current iteration:")
println(it)
println("Current norm:")
println(dist)
end
end
end
and benchmark it with BenchmarkTools.jl
julia> #benchmark iterate_T_firstattempt()
sample with 1 evaluation.
Single result which took 7.056 s (0.00% GC) to evaluate,
with a memory estimate of 52.33 MiB, over 5875 allocations.
Oof, that's a lot of allocations. Some of these are coming from the use of global variables, others from type instability, yet others from the design of your functions. A few specific points:
The compiler's probably already making the right call, but we might as well add an #inline to your definition of u_(c) and F_(z,k) to make sure they get inlined. And why not on T_ itself too while we're at it.
You're doing a lot of indexing in the nested for loops, might as well throw an #inbounds on there given that there should be no way of getting out-of-bounds indexing.
One better: the loops in T_ look to be safely reorder-able, so we can go ahead and upgrade that #inbounds to a #turbo or #tturbo from LoopVectorization.jl for an even bigger speedup by using your CPU's SIMD instructions / Advanced Vector Extensions.
The calculation of dist = maximum(abs.(TV - V0)) involves at least two large allocations, we can avoid those with a simple mapreduce. Or to use those SIMD instructions again, vmapreduce, from LoopVectorization.jl
The line TV = T_(V0, P, kgrid, zgrid, β, δ) is also allocating, let's switch that out for an in-place version T_!.
As mentioned above, global variables are bad news. We can just move them into the function signature of iterate_T easily enough though, which should fix that problem.
While we're at it, let's also break out three-arg mul! from the LinearAlgebra stdlib for a non-allocating calculation of E = V * P'. And to get rid of one last sneaky source of type-instability (which was causing a final ~2k allocations), we should change that outer=[1,7] to outer=(1,7) -- a nice stable tuple instead of an array.
Putting it all together:
using LinearAlgebra, LoopVectorization
# parameters
β = 0.987
δ = 0.012
# grids
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500)
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]
P = rand(7,7)
P ./= sum(P,dims=2) # Rows sum to one
# auxiliary functions to build operator
#inline F_(z,k) = exp(z) * (k^(1/3))
#inline u_(c) = (c^(1-2) - 1)/(1-2)
# T_operator, in-place version
#inline function T_!(TV, E, V, P, kgrid, zgrid, β, δ)
mul!(E, V, P')
#tturbo for i in axes(TV, 2)
for j in axes(TV, 1)
temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
aux = -Inf
for l in eachindex(kgrid)
c = max(0.0, temp - kgrid[l])
aux = max(aux, u_(c) + β * E[l, i])
end
TV[j,i] = aux
end
end
return TV
end
function iterate_T(P, kgrid, zgrid, β, δ; max_it=1000, it=1, tol=1e-3, dist=tol+1)
V0 = repeat(sqrt.(kgrid), outer=(1,7))
# Preallocate temporary arrays
TV = similar(V0)
E = similar(V0)
# Iterate
for it = 1:max_it
# Non-allocating in-place T_!
TV = T_!(TV, E, V0, P, kgrid, zgrid, β, δ)
# Compute distance or error
dist = vmapreduce((a,b)->abs(a-b), max, TV, V0)
copyto!(V0, TV) # update
# # Some information about the state of the iteration
# if rem(it, 100) == 0
# println("Current iteration:")
# println(it)
# println("Current norm:")
# println(dist)
# end
(dist < tol) && break
end
return V0
end
we get
julia> #benchmark iterate_T($P, $kgrid, $zgrid, $β, $δ)
BenchmarkTools.Trial: 11 samples with 1 evaluation.
Range (min … max): 460.246 ms … 599.820 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 474.826 ms ┊ GC (median): 0.00%
Time (mean ± σ): 486.661 ms ± 40.359 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▇▁▇█▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
460 ms Histogram: frequency by time 600 ms <
Memory estimate: 86.42 KiB, allocs estimate: 9.
That's a bit more like it!

Why is Julia allocating so much memory?

I am trying to write a fast coordinate descent algorithm for solving ordinary least squares regression. The following Julia code works, but I don't understand why it's allocating so much memory
function OLS_cd{T<:Float64}(A::Array{T,2}, b::Array{T,1}, tolerance::T=1e-12)
N,P = size(A)
x = zeros(P)
r = copy(b)
d = ones(P)
while sum(d.*d) > tolerance
#inbounds for j = 1:P
d[j] = sum(A[:,j].*r)
x[j] += d[j]
r -= d[j]*A[:,j]
end
end
return(x)
end
On the data I generate with
n = 100
p = 75
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Using #benchmark OLS_cd(X, y) I get
BenchmarkTools.Trial:
memory estimate: 65.94 mb
allocs estimate: 151359
--------------
minimum time: 19.316 ms (16.49% GC)
median time: 20.545 ms (16.60% GC)
mean time: 22.164 ms (16.24% GC)
maximum time: 42.114 ms (10.82% GC)
--------------
samples: 226
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
The OLS problem gets harder as p gets bigger, and I've noticed that the as I make p bigger and need to run longer, the more memory Julia allocates.
Why would each pass through the while loop allocate more memory? To my eye, it seems like all of my operations are in place, and the types are clearly specified.
Nothing popped out to me while profiling, but I could post that output as well if it's useful.
Update:
As pointed out below, temporary arrays caused by using vectorized operations were the culprit. The following eliminated extraneous allocations and runs pretty quickly:
function OLS_cd_unrolled{T<:Float64}(A::Array{T,2}, b::Array{T,1}, tolerance::T=1e-12)
N,P = size(A)
x = zeros(P)
r = copy(b)
d = ones(P)
while norm(d,Inf) > tolerance
#inbounds for j = 1:P
d[j] = 0.0; #inbounds for i = 1:N d[j] += A[i,j]*r[i] end
#inbounds for i = 1:N r[i] -= d[j]*A[i,j] end
x[j] += d[j]
end
end
return(x)
end
A[:,j] creates a copy, not a view. You want to use #view A[:,j] or view(A,:,j).
You can devectorize r -= d[j]*A[:,j] with r .= -.(r,d[j]*A[:.j]) to get rid of some more temporaries. As #LutfullahTomak said sum(A[:,j].*r) should devectorize as dot(view(A,:,j),r) to get rid of all of the temporaries in there. To use an infix operator, you can use \cdot, as in view(A,:,j)⋅r.
You should read up on copies vs views and how vectorization causes temporary arrays. The jist of it is that when vectorized operations occur, they have to create a new vector as output. Instead, you want to write to an existing vector. r = ... for an array changes reference, so r = ex for some expression which makes an array will make a new array, and then point r to that array. r .= ex will replace the values of the array r with the values from the expression. The former allocates a temporary, the latter does not. Repeated applications of this idea is where all of the temporaries come from.
Actually, sum(d.*d) , sum(A[:,j].*r) and so on are not inplace and make temporary arrays.. First, sum(d.*d) == dot(d,d) I think and sum(A[:,j].*r) makes 2 temporary arrays. I'd do dot(view(A,:,j),r) for the latter. Current stable version of julia(0.5) doesn't have short version for r -= d[j]*A[:,j] so you need to devectorize it make a loop.

Matlab allocating A=zeros(N) vs. A(N,N)=0 yields strange performance results [duplicate]

I've stumbled upon the weird way (in my view) that Matlab is dealing with empty matrices. For example, if two empty matrices are multiplied the result is:
zeros(3,0)*zeros(0,3)
ans =
0 0 0
0 0 0
0 0 0
Now, this already took me by surprise, however, a quick search got me to the link above, and I got an explanation of the somewhat twisted logic of why this is happening.
However, nothing prepared me for the following observation. I asked myself, how efficient is this type of multiplication vs just using zeros(n) function, say for the purpose of initialization? I've used timeit to answer this:
f=#() zeros(1000)
timeit(f)
ans =
0.0033
vs:
g=#() zeros(1000,0)*zeros(0,1000)
timeit(g)
ans =
9.2048e-06
Both have the same outcome of 1000x1000 matrix of zeros of class double, but the empty matrix multiplication one is ~350 times faster! (a similar result happens using tic and toc and a loop)
How can this be? are timeit or tic,toc bluffing or have I found a faster way to initialize matrices?
(this was done with matlab 2012a, on a win7-64 machine, intel-i5 650 3.2Ghz...)
EDIT:
After reading your feedback, I have looked more carefully into this peculiarity, and tested on 2 different computers (same matlab ver though 2012a) a code that examine the run time vs the size of matrix n. This is what I get:
The code to generate this used timeit as before, but a loop with tic and toc will look the same. So, for small sizes, zeros(n) is comparable. However, around n=400 there is a jump in performance for the empty matrix multiplication. The code I've used to generate that plot was:
n=unique(round(logspace(0,4,200)));
for k=1:length(n)
f=#() zeros(n(k));
t1(k)=timeit(f);
g=#() zeros(n(k),0)*zeros(0,n(k));
t2(k)=timeit(g);
end
loglog(n,t1,'b',n,t2,'r');
legend('zeros(n)','zeros(n,0)*zeros(0,n)',2);
xlabel('matrix size (n)'); ylabel('time [sec]');
Are any of you experience this too?
EDIT #2:
Incidentally, empty matrix multiplication is not needed to get this effect. One can simply do:
z(n,n)=0;
where n> some threshold matrix size seen in the previous graph, and get the exact efficiency profile as with empty matrix multiplication (again using timeit).
Here's an example where it improves efficiency of a code:
n = 1e4;
clear z1
tic
z1 = zeros( n );
for cc = 1 : n
z1(:,cc)=cc;
end
toc % Elapsed time is 0.445780 seconds.
%%
clear z0
tic
z0 = zeros(n,0)*zeros(0,n);
for cc = 1 : n
z0(:,cc)=cc;
end
toc % Elapsed time is 0.297953 seconds.
However, using z(n,n)=0; instead yields similar results to the zeros(n) case.
This is strange, I am seeing f being faster while g being slower than what you are seeing. But both of them are identical for me. Perhaps a different version of MATLAB ?
>> g = #() zeros(1000, 0) * zeros(0, 1000);
>> f = #() zeros(1000)
f =
#()zeros(1000)
>> timeit(f)
ans =
8.5019e-04
>> timeit(f)
ans =
8.4627e-04
>> timeit(g)
ans =
8.4627e-04
EDIT can you add + 1 for the end of f and g, and see what times you are getting.
EDIT Jan 6, 2013 7:42 EST
I am using a machine remotely, so sorry about the low quality graphs (had to generate them blind).
Machine config:
i7 920. 2.653 GHz. Linux. 12 GB RAM. 8MB cache.
It looks like even the machine I have access to shows this behavior, except at a larger size (somewhere between 1979 and 2073). There is no reason I can think of right now for the empty matrix multiplication to be faster at larger sizes.
I will be investigating a little bit more before coming back.
EDIT Jan 11, 2013
After #EitanT's post, I wanted to do a little bit more of digging. I wrote some C code to see how matlab may be creating a zeros matrix. Here is the c++ code that I used.
int main(int argc, char **argv)
{
for (int i = 1975; i <= 2100; i+=25) {
timer::start();
double *foo = (double *)malloc(i * i * sizeof(double));
for (int k = 0; k < i * i; k++) foo[k] = 0;
double mftime = timer::stop();
free(foo);
timer::start();
double *bar = (double *)malloc(i * i * sizeof(double));
memset(bar, 0, i * i * sizeof(double));
double mmtime = timer::stop();
free(bar);
timer::start();
double *baz = (double *)calloc(i * i, sizeof(double));
double catime = timer::stop();
free(baz);
printf("%d, %lf, %lf, %lf\n", i, mftime, mmtime, catime);
}
}
Here are the results.
$ ./test
1975, 0.013812, 0.013578, 0.003321
2000, 0.014144, 0.013879, 0.003408
2025, 0.014396, 0.014219, 0.003490
2050, 0.014732, 0.013784, 0.000043
2075, 0.015022, 0.014122, 0.000045
2100, 0.014606, 0.014480, 0.000045
As you can see calloc (4th column) seems to be the fastest method. It is also getting significantly faster between 2025 and 2050 (I'd assume it would at around 2048 ?).
Now I went back to matlab to check for the same. Here are the results.
>> test
1975, 0.003296, 0.003297
2000, 0.003377, 0.003385
2025, 0.003465, 0.003464
2050, 0.015987, 0.000019
2075, 0.016373, 0.000019
2100, 0.016762, 0.000020
It looks like both f() and g() are using calloc at smaller sizes (<2048 ?). But at larger sizes f() (zeros(m, n)) starts to use malloc + memset, while g() (zeros(m, 0) * zeros(0, n)) keeps using calloc.
So the divergence is explained by the following
zeros(..) begins to use a different (slower ?) scheme at larger sizes.
calloc also behaves somewhat unexpectedly, leading to an improvement in performance.
This is the behavior on Linux. Can someone do the same experiment on a different machine (and perhaps a different OS) and see if the experiment holds ?
The results might be a bit misleading. When you multiply two empty matrices, the resulting matrix is not immediately "allocated" and "initialized", rather this is postponed until you first use it (sort of like a lazy evaluation).
The same applies when indexing out of bounds to grow a variable, which in the case of numeric arrays fills out any missing entries with zeros (I discuss afterwards the non-numeric case). Of course growing the matrix this way does not overwrite existing elements.
So while it may seem faster, you are just delaying the allocation time until you actually first use the matrix. In the end you'll have similar timings as if you did the allocation from the start.
Example to show this behavior, compared to a few other alternatives:
N = 1000;
clear z
tic, z = zeros(N,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = zeros(N,0)*zeros(0,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(N,N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = full(spalloc(N,N,0)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(1:N,1:N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
val = 0;
tic, z = val(ones(N)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = repmat(0, [N N]); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
The result shows that if you sum the elapsed time for both instructions in each case, you end up with similar total timings:
// zeros(N,N)
Elapsed time is 0.004525 seconds.
Elapsed time is 0.000792 seconds.
// zeros(N,0)*zeros(0,N)
Elapsed time is 0.000052 seconds.
Elapsed time is 0.004365 seconds.
// z(N,N) = 0
Elapsed time is 0.000053 seconds.
Elapsed time is 0.004119 seconds.
The other timings were:
// full(spalloc(N,N,0))
Elapsed time is 0.001463 seconds.
Elapsed time is 0.003751 seconds.
// z(1:N,1:N) = 0
Elapsed time is 0.006820 seconds.
Elapsed time is 0.000647 seconds.
// val(ones(N))
Elapsed time is 0.034880 seconds.
Elapsed time is 0.000911 seconds.
// repmat(0, [N N])
Elapsed time is 0.001320 seconds.
Elapsed time is 0.003749 seconds.
These measurements are too small in the milliseconds and might not be very accurate, so you might wanna run these commands in a loop a few thousand times and take the average. Also sometimes running saved M-functions is faster than running scripts or on the command prompt, as certain optimizations only happen that way...
Either way allocation is usually done once, so who cares if it takes an extra 30ms :)
A similar behavior can be seen with cell arrays or arrays of structures. Consider the following example:
N = 1000;
tic, a = cell(N,N); toc
tic, b = repmat({[]}, [N,N]); toc
tic, c{N,N} = []; toc
which gives:
Elapsed time is 0.001245 seconds.
Elapsed time is 0.040698 seconds.
Elapsed time is 0.004846 seconds.
Note that even if they are all equal, they occupy different amount of memory:
>> assert(isequal(a,b,c))
>> whos a b c
Name Size Bytes Class Attributes
a 1000x1000 8000000 cell
b 1000x1000 112000000 cell
c 1000x1000 8000104 cell
In fact the situation is a bit more complicated here, since MATLAB is probably sharing the same empty matrix for all the cells, rather than creating multiple copies.
The cell array a is in fact an array of uninitialized cells (an array of NULL pointers), while b is a cell array where each cell is an empty array [] (internally and because of data sharing, only the first cell b{1} points to [] while all the rest have a reference to the first cell). The final array c is similar to a (uninitialized cells), but with the last one containing an empty numeric matrix [].
I looked around the list of exported C functions from the libmx.dll (using Dependency Walker tool), and I found a few interesting things.
there are undocumented functions for creating uninitialized arrays: mxCreateUninitDoubleMatrix, mxCreateUninitNumericArray, and mxCreateUninitNumericMatrix. In fact there is a submission on the File Exchange makes use of these functions to provide a faster alternative to zeros function.
there exist an undocumented function called mxFastZeros. Googling online, I can see you cross-posted this question on MATLAB Answers as well, with some excellent answers over there. James Tursa (same author of UNINIT from before) gave an example on how to use this undocumented function.
libmx.dll is linked against tbbmalloc.dll shared library. This is Intel TBB scalable memory allocator. This library provides equivalent memory allocation functions (malloc, calloc, free) optimized for parallel applications. Remember that many MATLAB functions are automatically multithreaded, so I wouldn't be surprised if zeros(..) is multithreaded and is using Intel's memory allocator once the matrix size is large enough (here is recent comment by Loren Shure that confirms this fact).
Regarding the last point about the memory allocator, you could write a similar benchmark in C/C++ similar to what #PavanYalamanchili did, and compare the various allocators available. Something like this. Remember that MEX-files have a slightly higher memory management overhead, since MATLAB automatically frees any memory that was allocated in MEX-files using the mxCalloc, mxMalloc, or mxRealloc functions. For what it's worth, it used to be possible to change the internal memory manager in older versions.
EDIT:
Here is a more thorough benchmark to compare the discussed alternatives. It specifically shows that once you stress the use of the entire allocated matrix, all three methods are on equal footing, and the difference is negligible.
function compare_zeros_init()
iter = 100;
for N = 512.*(1:8)
% ZEROS(N,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, ZEROS = %.9f\n', N, mean(sum(t,2)))
% z(N,N)=0
t = zeros(iter,3);
for i=1:iter
clear z
tic, z(N,N) = 0; t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, GROW = %.9f\n', N, mean(sum(t,2)))
% ZEROS(N,0)*ZEROS(0,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,0)*zeros(0,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, MULT = %.9f\n\n', N, mean(sum(t,2)))
end
end
Below are the timings averaged over 100 iterations in terms of increasing matrix size. I performed the tests in R2013a.
>> compare_zeros_init
N = 512, ZEROS = 0.001560168
N = 512, GROW = 0.001479991
N = 512, MULT = 0.001457031
N = 1024, ZEROS = 0.005744873
N = 1024, GROW = 0.005352638
N = 1024, MULT = 0.005359236
N = 1536, ZEROS = 0.011950846
N = 1536, GROW = 0.009051589
N = 1536, MULT = 0.008418878
N = 2048, ZEROS = 0.012154002
N = 2048, GROW = 0.010996315
N = 2048, MULT = 0.011002169
N = 2560, ZEROS = 0.017940950
N = 2560, GROW = 0.017641046
N = 2560, MULT = 0.017640323
N = 3072, ZEROS = 0.025657999
N = 3072, GROW = 0.025836506
N = 3072, MULT = 0.051533432
N = 3584, ZEROS = 0.074739924
N = 3584, GROW = 0.070486857
N = 3584, MULT = 0.072822335
N = 4096, ZEROS = 0.098791732
N = 4096, GROW = 0.095849788
N = 4096, MULT = 0.102148452
After doing some research, I've found this article in "Undocumented Matlab", in which Mr. Yair Altman had already come to the conclusion that MathWork's way of preallocating matrices using zeros(M, N) is indeed not the most efficient way.
He timed x = zeros(M,N) vs. clear x, x(M,N) = 0 and found that the latter is ~500 times faster. According to his explanation, the second method simply creates an M-by-N matrix, the elements of which being automatically initialized to 0. The first method however, creates x (with x having automatic zero elements) and then assigns a zero to every element in x again, and that is a redundant operation that takes more time.
In the case of empty matrix multiplication, such as what you've shown in your question, MATLAB expects the product to be an M×N matrix, and therefore it allocates an M×N matrix. Consequently, the output matrix is automatically initialized to zeroes. Since the original matrices are empty, no further calculations are performed, and hence the elements in the output matrix remain unchanged and equal to zero.
Interesting question, apparently there are several ways to 'beat' the built-in zeros function. My only guess as to why this is happening would be that it could be more memory efficient (after all, zeros(LargeNumer) will sooner cause Matlab to hit the memory limit than form a devestating speed bottleneck in most code), or more robust somehow.
Here is another fast allocation method using a sparse matrix, i have added the regular zeros function as a benchmark:
tic; x=zeros(1000,1000); toc
Elapsed time is 0.002863 seconds.
tic; clear x; x(1000,1000)=0; toc
Elapsed time is 0.000282 seconds.
tic; x=full(spalloc(1000,1000,0)); toc
Elapsed time is 0.000273 seconds.
tic; x=spalloc(1000,1000,1000000); toc %Is this the same for practical purposes?
Elapsed time is 0.000281 seconds.

Joint Entropy Performance in Julia

I wrote a function to calculate the joint entropy of each column pair in a matrix. But I would like to increase the performance regarding time and memory.
The function looks like this:
function jointentropy(aln)
mat = Array(Float64,size(aln,2),size(aln,2))
for i in combinations(1:size(aln,2),2)
a = i[1]
b = i[2]
mina, maxa = extrema(aln[:,a])
minb, maxb = extrema(aln[:,b])
h = Array(Float64,(maxa-mina+1,maxb-minb+1))
h = hist2d([aln[:,a] aln[:,b]],mina-1:1:maxa,minb-1:1:maxb)[3]
h = h/size(aln[:,1],1)
I,J,V = findnz(h)
l = sparse(I,J,log2(V),maxa-mina+1,maxb-minb+1)
mat[b,a] = - sum(l.*h)
end
return mat
end
Matrices that go into this function look like this:
rand(45:122,rand(1:2000),rand(1:2000))
An example with a 500x500 matrix resulted in the following #time output:
elapsed time: 33.692081413 seconds (33938843192 bytes allocated, 36.42% gc time)
...which seems to be a whole lot of memory...
Any suggestions on how to speed up this function and reduce memory allocation?
Thanks in advance for any help!
Here are a few ideas to speed up your function.
If the range of all the columns is roughly the same, you can move the extrema computations outside the loop and reuse the same h array.
hist2d creates a new array: you can use hist2d! to reuse the previous one.
The assignment h = h/size(aln[:,1],1) creates a new array.
The division in h = h/size(aln[:,1],1) is done for all the elements of the array, including the zeroes.
You can use a loop instead of findnz and a sparse matrix (findnz already contains a loop).
.
function jointentropy2(aln)
n1 = size(aln,1)
n2 = size(aln,2)
mat = Array(Float64,n2,n2)
lower, upper = extrema(aln)
m = upper-lower+1
h = Array(Float64,(m,m))
for a in 1:n2
for b in (a+1):n2
Base.hist2d!(h,[aln[:,a] aln[:,b]],lower-1:1:upper,lower-1:1:upper)[3]
s = 0
for i in 1:m
for j in 1:m
if h[i,j] != 0
p = h[i,j] / n1
s += p * log2(p)
end
end
end
mat[b,a] = - s
end
end
return mat
end
This is twice as fast as the initial function,
and the memory allocations were divided by 4.
aln = rand(45:122,500,400)
#time x = jointentropy(aln)
# elapsed time: 26.946314168 seconds (21697858752 bytes allocated, 29.97% gc time)
#time y = jointentropy2(aln)
# elapsed time: 13.626282821 seconds (5087119968 bytes allocated, 16.21% gc time)
x - y # approximately zero (at least below the diagonal --
# the matrix was not initialized above it)
The next candidate for optimization is hist2d (here, you could use a loop and a sparse matrix).
#profile jointentropy2(aln)
Profile.print()

Faster way to initialize arrays via empty matrix multiplication? (Matlab)

I've stumbled upon the weird way (in my view) that Matlab is dealing with empty matrices. For example, if two empty matrices are multiplied the result is:
zeros(3,0)*zeros(0,3)
ans =
0 0 0
0 0 0
0 0 0
Now, this already took me by surprise, however, a quick search got me to the link above, and I got an explanation of the somewhat twisted logic of why this is happening.
However, nothing prepared me for the following observation. I asked myself, how efficient is this type of multiplication vs just using zeros(n) function, say for the purpose of initialization? I've used timeit to answer this:
f=#() zeros(1000)
timeit(f)
ans =
0.0033
vs:
g=#() zeros(1000,0)*zeros(0,1000)
timeit(g)
ans =
9.2048e-06
Both have the same outcome of 1000x1000 matrix of zeros of class double, but the empty matrix multiplication one is ~350 times faster! (a similar result happens using tic and toc and a loop)
How can this be? are timeit or tic,toc bluffing or have I found a faster way to initialize matrices?
(this was done with matlab 2012a, on a win7-64 machine, intel-i5 650 3.2Ghz...)
EDIT:
After reading your feedback, I have looked more carefully into this peculiarity, and tested on 2 different computers (same matlab ver though 2012a) a code that examine the run time vs the size of matrix n. This is what I get:
The code to generate this used timeit as before, but a loop with tic and toc will look the same. So, for small sizes, zeros(n) is comparable. However, around n=400 there is a jump in performance for the empty matrix multiplication. The code I've used to generate that plot was:
n=unique(round(logspace(0,4,200)));
for k=1:length(n)
f=#() zeros(n(k));
t1(k)=timeit(f);
g=#() zeros(n(k),0)*zeros(0,n(k));
t2(k)=timeit(g);
end
loglog(n,t1,'b',n,t2,'r');
legend('zeros(n)','zeros(n,0)*zeros(0,n)',2);
xlabel('matrix size (n)'); ylabel('time [sec]');
Are any of you experience this too?
EDIT #2:
Incidentally, empty matrix multiplication is not needed to get this effect. One can simply do:
z(n,n)=0;
where n> some threshold matrix size seen in the previous graph, and get the exact efficiency profile as with empty matrix multiplication (again using timeit).
Here's an example where it improves efficiency of a code:
n = 1e4;
clear z1
tic
z1 = zeros( n );
for cc = 1 : n
z1(:,cc)=cc;
end
toc % Elapsed time is 0.445780 seconds.
%%
clear z0
tic
z0 = zeros(n,0)*zeros(0,n);
for cc = 1 : n
z0(:,cc)=cc;
end
toc % Elapsed time is 0.297953 seconds.
However, using z(n,n)=0; instead yields similar results to the zeros(n) case.
This is strange, I am seeing f being faster while g being slower than what you are seeing. But both of them are identical for me. Perhaps a different version of MATLAB ?
>> g = #() zeros(1000, 0) * zeros(0, 1000);
>> f = #() zeros(1000)
f =
#()zeros(1000)
>> timeit(f)
ans =
8.5019e-04
>> timeit(f)
ans =
8.4627e-04
>> timeit(g)
ans =
8.4627e-04
EDIT can you add + 1 for the end of f and g, and see what times you are getting.
EDIT Jan 6, 2013 7:42 EST
I am using a machine remotely, so sorry about the low quality graphs (had to generate them blind).
Machine config:
i7 920. 2.653 GHz. Linux. 12 GB RAM. 8MB cache.
It looks like even the machine I have access to shows this behavior, except at a larger size (somewhere between 1979 and 2073). There is no reason I can think of right now for the empty matrix multiplication to be faster at larger sizes.
I will be investigating a little bit more before coming back.
EDIT Jan 11, 2013
After #EitanT's post, I wanted to do a little bit more of digging. I wrote some C code to see how matlab may be creating a zeros matrix. Here is the c++ code that I used.
int main(int argc, char **argv)
{
for (int i = 1975; i <= 2100; i+=25) {
timer::start();
double *foo = (double *)malloc(i * i * sizeof(double));
for (int k = 0; k < i * i; k++) foo[k] = 0;
double mftime = timer::stop();
free(foo);
timer::start();
double *bar = (double *)malloc(i * i * sizeof(double));
memset(bar, 0, i * i * sizeof(double));
double mmtime = timer::stop();
free(bar);
timer::start();
double *baz = (double *)calloc(i * i, sizeof(double));
double catime = timer::stop();
free(baz);
printf("%d, %lf, %lf, %lf\n", i, mftime, mmtime, catime);
}
}
Here are the results.
$ ./test
1975, 0.013812, 0.013578, 0.003321
2000, 0.014144, 0.013879, 0.003408
2025, 0.014396, 0.014219, 0.003490
2050, 0.014732, 0.013784, 0.000043
2075, 0.015022, 0.014122, 0.000045
2100, 0.014606, 0.014480, 0.000045
As you can see calloc (4th column) seems to be the fastest method. It is also getting significantly faster between 2025 and 2050 (I'd assume it would at around 2048 ?).
Now I went back to matlab to check for the same. Here are the results.
>> test
1975, 0.003296, 0.003297
2000, 0.003377, 0.003385
2025, 0.003465, 0.003464
2050, 0.015987, 0.000019
2075, 0.016373, 0.000019
2100, 0.016762, 0.000020
It looks like both f() and g() are using calloc at smaller sizes (<2048 ?). But at larger sizes f() (zeros(m, n)) starts to use malloc + memset, while g() (zeros(m, 0) * zeros(0, n)) keeps using calloc.
So the divergence is explained by the following
zeros(..) begins to use a different (slower ?) scheme at larger sizes.
calloc also behaves somewhat unexpectedly, leading to an improvement in performance.
This is the behavior on Linux. Can someone do the same experiment on a different machine (and perhaps a different OS) and see if the experiment holds ?
The results might be a bit misleading. When you multiply two empty matrices, the resulting matrix is not immediately "allocated" and "initialized", rather this is postponed until you first use it (sort of like a lazy evaluation).
The same applies when indexing out of bounds to grow a variable, which in the case of numeric arrays fills out any missing entries with zeros (I discuss afterwards the non-numeric case). Of course growing the matrix this way does not overwrite existing elements.
So while it may seem faster, you are just delaying the allocation time until you actually first use the matrix. In the end you'll have similar timings as if you did the allocation from the start.
Example to show this behavior, compared to a few other alternatives:
N = 1000;
clear z
tic, z = zeros(N,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = zeros(N,0)*zeros(0,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(N,N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = full(spalloc(N,N,0)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(1:N,1:N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
val = 0;
tic, z = val(ones(N)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = repmat(0, [N N]); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
The result shows that if you sum the elapsed time for both instructions in each case, you end up with similar total timings:
// zeros(N,N)
Elapsed time is 0.004525 seconds.
Elapsed time is 0.000792 seconds.
// zeros(N,0)*zeros(0,N)
Elapsed time is 0.000052 seconds.
Elapsed time is 0.004365 seconds.
// z(N,N) = 0
Elapsed time is 0.000053 seconds.
Elapsed time is 0.004119 seconds.
The other timings were:
// full(spalloc(N,N,0))
Elapsed time is 0.001463 seconds.
Elapsed time is 0.003751 seconds.
// z(1:N,1:N) = 0
Elapsed time is 0.006820 seconds.
Elapsed time is 0.000647 seconds.
// val(ones(N))
Elapsed time is 0.034880 seconds.
Elapsed time is 0.000911 seconds.
// repmat(0, [N N])
Elapsed time is 0.001320 seconds.
Elapsed time is 0.003749 seconds.
These measurements are too small in the milliseconds and might not be very accurate, so you might wanna run these commands in a loop a few thousand times and take the average. Also sometimes running saved M-functions is faster than running scripts or on the command prompt, as certain optimizations only happen that way...
Either way allocation is usually done once, so who cares if it takes an extra 30ms :)
A similar behavior can be seen with cell arrays or arrays of structures. Consider the following example:
N = 1000;
tic, a = cell(N,N); toc
tic, b = repmat({[]}, [N,N]); toc
tic, c{N,N} = []; toc
which gives:
Elapsed time is 0.001245 seconds.
Elapsed time is 0.040698 seconds.
Elapsed time is 0.004846 seconds.
Note that even if they are all equal, they occupy different amount of memory:
>> assert(isequal(a,b,c))
>> whos a b c
Name Size Bytes Class Attributes
a 1000x1000 8000000 cell
b 1000x1000 112000000 cell
c 1000x1000 8000104 cell
In fact the situation is a bit more complicated here, since MATLAB is probably sharing the same empty matrix for all the cells, rather than creating multiple copies.
The cell array a is in fact an array of uninitialized cells (an array of NULL pointers), while b is a cell array where each cell is an empty array [] (internally and because of data sharing, only the first cell b{1} points to [] while all the rest have a reference to the first cell). The final array c is similar to a (uninitialized cells), but with the last one containing an empty numeric matrix [].
I looked around the list of exported C functions from the libmx.dll (using Dependency Walker tool), and I found a few interesting things.
there are undocumented functions for creating uninitialized arrays: mxCreateUninitDoubleMatrix, mxCreateUninitNumericArray, and mxCreateUninitNumericMatrix. In fact there is a submission on the File Exchange makes use of these functions to provide a faster alternative to zeros function.
there exist an undocumented function called mxFastZeros. Googling online, I can see you cross-posted this question on MATLAB Answers as well, with some excellent answers over there. James Tursa (same author of UNINIT from before) gave an example on how to use this undocumented function.
libmx.dll is linked against tbbmalloc.dll shared library. This is Intel TBB scalable memory allocator. This library provides equivalent memory allocation functions (malloc, calloc, free) optimized for parallel applications. Remember that many MATLAB functions are automatically multithreaded, so I wouldn't be surprised if zeros(..) is multithreaded and is using Intel's memory allocator once the matrix size is large enough (here is recent comment by Loren Shure that confirms this fact).
Regarding the last point about the memory allocator, you could write a similar benchmark in C/C++ similar to what #PavanYalamanchili did, and compare the various allocators available. Something like this. Remember that MEX-files have a slightly higher memory management overhead, since MATLAB automatically frees any memory that was allocated in MEX-files using the mxCalloc, mxMalloc, or mxRealloc functions. For what it's worth, it used to be possible to change the internal memory manager in older versions.
EDIT:
Here is a more thorough benchmark to compare the discussed alternatives. It specifically shows that once you stress the use of the entire allocated matrix, all three methods are on equal footing, and the difference is negligible.
function compare_zeros_init()
iter = 100;
for N = 512.*(1:8)
% ZEROS(N,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, ZEROS = %.9f\n', N, mean(sum(t,2)))
% z(N,N)=0
t = zeros(iter,3);
for i=1:iter
clear z
tic, z(N,N) = 0; t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, GROW = %.9f\n', N, mean(sum(t,2)))
% ZEROS(N,0)*ZEROS(0,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,0)*zeros(0,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, MULT = %.9f\n\n', N, mean(sum(t,2)))
end
end
Below are the timings averaged over 100 iterations in terms of increasing matrix size. I performed the tests in R2013a.
>> compare_zeros_init
N = 512, ZEROS = 0.001560168
N = 512, GROW = 0.001479991
N = 512, MULT = 0.001457031
N = 1024, ZEROS = 0.005744873
N = 1024, GROW = 0.005352638
N = 1024, MULT = 0.005359236
N = 1536, ZEROS = 0.011950846
N = 1536, GROW = 0.009051589
N = 1536, MULT = 0.008418878
N = 2048, ZEROS = 0.012154002
N = 2048, GROW = 0.010996315
N = 2048, MULT = 0.011002169
N = 2560, ZEROS = 0.017940950
N = 2560, GROW = 0.017641046
N = 2560, MULT = 0.017640323
N = 3072, ZEROS = 0.025657999
N = 3072, GROW = 0.025836506
N = 3072, MULT = 0.051533432
N = 3584, ZEROS = 0.074739924
N = 3584, GROW = 0.070486857
N = 3584, MULT = 0.072822335
N = 4096, ZEROS = 0.098791732
N = 4096, GROW = 0.095849788
N = 4096, MULT = 0.102148452
After doing some research, I've found this article in "Undocumented Matlab", in which Mr. Yair Altman had already come to the conclusion that MathWork's way of preallocating matrices using zeros(M, N) is indeed not the most efficient way.
He timed x = zeros(M,N) vs. clear x, x(M,N) = 0 and found that the latter is ~500 times faster. According to his explanation, the second method simply creates an M-by-N matrix, the elements of which being automatically initialized to 0. The first method however, creates x (with x having automatic zero elements) and then assigns a zero to every element in x again, and that is a redundant operation that takes more time.
In the case of empty matrix multiplication, such as what you've shown in your question, MATLAB expects the product to be an M×N matrix, and therefore it allocates an M×N matrix. Consequently, the output matrix is automatically initialized to zeroes. Since the original matrices are empty, no further calculations are performed, and hence the elements in the output matrix remain unchanged and equal to zero.
Interesting question, apparently there are several ways to 'beat' the built-in zeros function. My only guess as to why this is happening would be that it could be more memory efficient (after all, zeros(LargeNumer) will sooner cause Matlab to hit the memory limit than form a devestating speed bottleneck in most code), or more robust somehow.
Here is another fast allocation method using a sparse matrix, i have added the regular zeros function as a benchmark:
tic; x=zeros(1000,1000); toc
Elapsed time is 0.002863 seconds.
tic; clear x; x(1000,1000)=0; toc
Elapsed time is 0.000282 seconds.
tic; x=full(spalloc(1000,1000,0)); toc
Elapsed time is 0.000273 seconds.
tic; x=spalloc(1000,1000,1000000); toc %Is this the same for practical purposes?
Elapsed time is 0.000281 seconds.

Resources