I am new to using distributed and codistributed arrays in matlab. The parallel code I have produced works, but is much slower than the serial version and I have no idea why. The code examples below compute the eigenvalues of hessian matrices from volumetic data.
Serial version:
S = size(D);
[fx, fy, fz] = gradient(Dsmt);
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
d = zeros([3 S(1) S(2) S(3)]);
for i = 1 : S(1)
fprintf('Slice %d out of %d\n', i, S(1));
for ii = 1 : S(2)
for iii = 1 : S(3)
d(:,i,ii,iii) = eig(squeeze(DHess(:,:,i,ii,iii)));
Parallel version:
S = size(D);
[fx, fy, fz] = gradient(Dsmt);
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
CDHess = distributed(DHess);
d = zeros([3 S(1) S(2) S(3)], codistributor('1d',4));
for i = 1 : S(1)
fprintf('Slice %d out of %d\n', i, S(1));
for ii = 1 : S(2)
for iii = drange(1 : S(3))
d(:,i,ii,iii) = eig(squeeze(CDHess(:,:,i,ii,iii)));
If someone could shed some light on the issue I would be very grateful

Here is a re-written version of your code. I have split the work over the outer-most loop, not as in your case - the inner-most loop. I have also explicitly allocated local parts of the d result vector, and the local part of the Hessian matrix.
In your code you rely on drange to split the work, and you access the distributed arrays directly to avoid extracting the local part. Admittedly, it should not result in such a great slowdown if MATLAB did everything correctly. The bottom line is, I don't know why your code is so slow - most likely because MATLAB does some remote data accessing despite the fact that you distributed your matrices.
Anyway, the below code runs and gives pretty good speedup on my computer using 4 labs. I have generated synthetic random input data to have something to work on. Have a look at the comments. If something is unclear, I can elaborate later.
clear all;
D = rand(512, 512, 3);
S = size(D);
[fx, fy, fz] = gradient(D);
% this part could also be parallelized - at least a bit.
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
% your sequential implementation
d = zeros([3, S(1) S(2) S(3)]);
for i = 1 : S(1)
for ii = 1 : S(2)
for iii = 1 : S(3)
d(:,i,ii,iii) = eig(squeeze(DHess(:,:,i,ii,iii)));
% my parallel implementation
% just for information
disp(['lab ' num2str(labindex)]);
% distribute the input data along the third dimension
% This is the dimension of the outer-most loop, hence this is where we
% want to parallelize!
DHess_dist = codistributed(DHess, codistributor1d(3));
DHess_local = getLocalPart(DHess_dist);
% create an output data distribution -
% note that this time we split along the second dimension
codist = codistributor1d(2, codistributor1d.unsetPartition, [3, S(1) S(2) S(3)]);
localSize = [3 codist.Partition(labindex) S(2) S(3)];
% allocate local part of the output array d
d_local = zeros(localSize);
% your ordinary loop, BUT! the outermost loop is split amongst the
% threads explicitly, using local indexing. In the loop only local parts
% of matrix d and DHess are accessed
for i = 1:size(d_local,2)
for ii = 1 : S(2)
for iii = 1 : S(3)
d_local(:,i,ii,iii) = eig(squeeze(DHess_local(:,:,i,ii,iii)));
% assemble local results to a codistributed matrix
d_dist =, codist);
isequal(d, d_dist)
And the output
Elapsed time is 0.364255 seconds.
Elapsed time is 33.498985 seconds.
Lab 1:
lab 1
Lab 2:
lab 2
Lab 3:
lab 3
Lab 4:
lab 4
Elapsed time is 9.445856 seconds.
ans =
Edit I have checked the performance on a reshaped matrix DHess=[3x3xN]. The performance is not much better (10%), so it is not substantial. But maybe you can implement the eig a bit differently? After all, those are 3x3 matrices you are dealing with.

You didn't specify where you've opened your matlabpool, and that will be the main factor determining what speedup you get.
If you are using the 'local' scheduler, then there is often no benefit to using distributed arrays. In particular, if the time-consuming operations are multithreaded in MATLAB already, then they will almost certainly slow down when using the local scheduler since the matlabpool workers run in single-threaded mode.
If you are using some other scheduler with the workers on a separate machine then you might be able to get speedup, but that depends on what you're doing. There's an example here which shows some benchmarks of MATLAB's \ operator.
Finally, it's worth noting that indexing distributed arrays is unfortunately rather slow, especially compared to MATLAB's built-in indexing. If you can extract the 'local part' of your codistributed arrays inside the spmd block and work exclusively with those, that might also help.


Optimizing matrix multiplication with varying sizes

Suppose I have the following data generating process
using Random
using StatsBase
m_1 = [1.0 2.0]
m_2 = [1.0 2.0; 3.0 4.0]
DD = []
y = zeros(2,200)
for i in 1:100
push!(DD, m_1)
push!(DD, m_2)
idxs = sample(1:200,10)
for i in idxs
DD[i] = DD[1]
and suppose given the data, I have the following function
function test(y, DD, n)
v_1 = [1 2]
v_2 = [3 4]
for j in 1:n
for i in 1:size(DD,1)
if size(DD[i],1) == 1
y[1:size(DD[i],1),i] .= (v_1 * DD[i]')[1]
y[1:size(DD[i],1),i] = (v_2 * DD[i]')'
I'm struggling to optimize the speed of test. In particular, memory allocation increases as I increase n. However, I'm not really allocating anything new.
The data generating process captures the fact that I don't know for sure the size of DD[i] beforehand. That is, the first time I call test, DD[1] could be a 2x2 matrix. The second time I call test, DD[1] could be a 1x2 matrix. I think this could be part of the issue with memory allocation: Julia doesn't know the sizes beforehand.
I'm completely stuck. I've tried #inbounds but that didn't help. Is there a way to improve this?
One important thing to check for performance is that Julia can understand the types. You can check this by running #code_warntype test(y, DD, 1), the output will make it clear that DD is of type Any[] (since you declared it that way). Working with Any can incur quite a performance penalty so declaring DD = Matrix{Float64}[] cuts the time to a third in my testing.
I'm not sure how close this example is to the actual code you want to write but in this particular case the size(DD[i],1) == 1 branch can be replaced by a call to
y[1:size(DD[i],1),i] .= dot(v_1, DD[i])
this cuts the time by another 50% for me. Finally you can squeeze out just a tiny bit more by using mul! to perform the other multiplication in place:
mul!(view(y, 1:size(DD[i],1),i:i), DD[i], v_2')
Full example:
using Random
using LinearAlgebra
DD = [rand(i,2) for _ in 1:100 for i in 1:2]
y = zeros(2,200)
function test(y, DD, n)
v_1 = [1 2]
v_2 = [3 4]'
for j in 1:n
for i in 1:size(DD,1)
if size(DD[i],1) == 1
y[1:size(DD[i],1),i] .= dot(v_1, DD[i])
mul!(view(y, 1:size(DD[i],1),i:i), DD[i], v_2)

inverse of symmetric matrix is not symmetric in Julia

I am using Julia version 0.6.2 and I am facing this problem.
mat = zeros(6, 6)
for i = 1 : 6
for j = 1 : 6
mat[i, j] = exp(-(i - j)^2)
And the output is
Main> issymmetric(mat)
Main> issymmetric(inv(mat))
I also tried the following Matlab code
mat = zeros(6, 6);
for i = 1 : 6
for j = 1 : 6
mat(i, j) = exp(-(i - j)^2);
And the output is
logical 1
logical 1
Apart from manually making the matrix symmetric as you propose, e.g. taking the average of matrix and its transpose like
A = inv(mat)
probably a cleaner way is
smat = Symmetric(mat)
B = inv(smat)
now B (as well as smat) passes issymmetric. Moreover, the fact that it is symmetric is ensured on type level (Symmetric) - some functions might take advantage of this additional information. This is exactly what inv does for smat.
EDIT: the question was also posted on Discourse, where you can find additional discussion about the performance of Symmetric.

How to parallelize computation of pairwise distance matrix?

My problem is roughly as follows. Given a numerical matrix X, where each row is an item. I want to find each row's nearest neighbor in terms of L2 distance in all rows except itself. I tried reading the official documentation but was still a little confused about how to achieve this. Could someone give me some hint?
My code is as follows
function l2_dist(v1, v2)
return sqrt(sum((v1 - v2) .^ 2))
function main(Mat, dist_fun)
n = size(Mat, 1)
Dist = SharedArray{Float64}(n) #[Inf for i in 1:n]
Id = SharedArray{Int64}(n) #[-1 for i in 1:n]
#parallel for i = 1:n
Dist[i] = Inf
Id[i] = 0
Threads.#threads for i in 1:n
for j in 1:n
if i != j
println(i, j)
dist_temp = dist_fun(Mat[i, :], Mat[j, :])
if dist_temp < Dist[i]
println("Dist updated!")
Dist[i] = dist_temp
Id[i] = j
return Dict("Dist" => Dist, "Id" => Id)
n = 4000
p = 30
X = [rand() for i in 1:n, j in 1:p];
main(X[1:30, :], l2_dist)
#time N = main(X, l2_dist)
I'm trying to distributed all the i's (i.e. calculating each row minimum) over different cores. But the version above apparently isn't working correctly. It is even slower than the sequential version. Can someone point me to the right direction? Thanks.
Maybe you're doing something in addition to what you have written down, but, at this point from what I can see, you aren't actually doing any computations in parallel. Julia requires you to tell it how many processors (or threads) you would like it to have access to. You can do this through either
Starting Julia with multiple processors julia -p # (where # is the number of processors you want Julia to have access to)
Once you have started a Julia "session" you can call the addprocs function to add additional processors.
To have more than 1 thread, you need to run command export JULIA_NUM_THREADS = #. I don't know very much about threading, so I will be sticking with the #parallel macro. I suggest reading documentation for more details on threading -- Maybe #Chris Rackauckas could expand a little more on the difference.
A few comments below about my code and on your code:
I'm on version 0.6.1-pre.0. I don't think I'm doing anything 0.6 specific, but this is a heads up just in case.
I'm going to use the Distances.jl package when computing the distances between vectors. I think it is a good habit to farm out as many of my computations to well-written and well-maintained packages as possible.
Rather than compute the distance between rows, I'm going to compute the distance between columns. This is because Julia is a column-major language, so this will increase the number of cache hits and give a little extra speed. You can obviously get the row-wise results you want by just transposing the input.
Unless you expect to have that many memory allocations then that many allocations are a sign that something in your code is inefficient. It is often a type stability problem. I don't know if that was the case in your code before, but that doesn't seem to be an issue in the current version (it wasn't immediately clear to me why you were having so many allocations).
Code is below
# Make sure all processors have access to Distances package
#everywhere using Distances
# Create a random matrix
nrow = 30
ncol = 4000
# Seed creation of random matrix so it is always same matrix
X = rand(nrow, ncol)
function main(X::AbstractMatrix{Float64}, M::Distances.Metric)
# Get size of the matrix
nrow, ncol = size(X)
# Create `SharedArray` to store output
ind_vec = SharedArray{Int}(ncol)
dist_vec = SharedArray{Float64}(ncol)
# Compute the distance between columns
#sync #parallel for i in 1:ncol
# Initialize various temporary variables
min_dist_i = Inf
min_ind_i = -1
X_i = view(X, :, i)
# Check distance against all other columns
for j in 1:ncol
# Skip comparison with itself
if i==j
# Tell us who is doing the work
# (can uncomment if you want to verify stuff)
# println("Column $i compared with Column $j by worker $(myid())")
# Evaluate the new distance...
# If it is less then replace it, otherwise proceed
dist_temp = evaluate(M, X_i, view(X, :, j))
if dist_temp < min_dist_i
min_dist_i = dist_temp
min_ind_i = j
# Which column is minimum distance from column i
dist_vec[i] = min_dist_i
ind_vec[i] = min_ind_i
return dist_vec, ind_vec
# Using Euclidean metric
metric = Euclidean()
inds, dist = main(X, metric)
#time main(X, metric);
#show dist[[1, 5, 25]], inds[[1, 5, 25]]
You can run the code with
1 processor julia testfile.jl
% julia testfile.jl
0.640365 seconds (16.00 M allocations: 732.495 MiB, 3.70% gc time)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])
n processors (in this case 4) julia -p n testfile.jl
% julia -p 4 testfile.jl
0.201523 seconds (2.10 k allocations: 99.107 KiB)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])

Efficiency of diag() - MATLAB

In writing out a matrix operation that was to be performed over tens of thousands of vectors I kept coming across the warning:
Requested 200000x200000 (298.0GB) array exceeds maximum array size
preference. Creation of arrays greater than this limit may take a long
time and cause MATLAB to become unresponsive. See array size limit or
preference panel for more information.
The reason for this was my use of diag() to get the values down the diagonal of an matrix inner product. Because MATLAB is generally optimized for vector/matrix operations, when I first write code, I usually go for the vectorized form. In this case, however, MATLAB has to build the entire matrix in order to get the diagonal which causes the memory and speed issues.
I decided to test the use of diag() vs a for loop to see if at any point it was more efficient to use diag():
num = 200000; % Matrix dimension
x = ones(num, 1);
y = 2 * ones(num, 1);
% z = diag(x*y'); % Expression to solve
% Loop approach
z = zeros(num,1);
for i = 1 : num
z(i) = x(i)*y(i);
% Dividing the too-large matrix into process-able chunks
fraction = [10, 20, 50, 100, 500, 1000, 5000, 10000, 20000];
time = zeros(size(fraction));
for k = 1 : length(fraction)
f = fraction(k);
% Operation to time
z = zeros(num,1);
for i = 1 : k
first = (i-1) * (num / f);
last = first + (num / f);
z(first + 1 : last) = diag(x(first + 1: last) * y(first + 1 : last)');
time(k) = toc;
% Plot results
hold on
plot(log10(fraction), log10(chunkTime));
plot(log10(fraction), repmat(log10(loopTime), 1, length(fraction)));
plot(log10(fraction), log10(chunkTime), 'g*'); % Plot points along time
legend('Partioned Running Time', 'Loop Running Time');
xlabel('Log_{10}(Fractional Size)'), ylabel('Log_{10}(Running Time)'), title('Running Time Comparison');
This is the result of the test:
(NOTE: The red line represents the loop time as a threshold--it's not to say that the total loop time is constant regardless of the number of loops)
From the graph it is clear that it takes breaking the operations down into roughly 200x200 square matrices to be faster to use diag than to perform the same operation using loops.
Can someone explain why I'm seeing these results? Also, I would think that with MATLAB's ever-more optimized design, there would be built-in handling of these massive matrices within a diag() function call. For example, it could just perform the i = j indexed operations. Is there a particular reason why this might be prohibitive?
I also haven't really thought of memory implications for diag using the partition method, although it's clear that as the partition size decreases, memory requirements drop.
Test of speed of diag vs. a loop.
n = 10000;
M = randn(n, n); %create a random matrix.
Test speed of diag:
d = diag(M);
Test speed of loop:
d = zeros(n, 1);
for i=1:n
d(i) = M(i,i);
This would test diag. Your code is not a clean test of diag...
Comment on where there might be confusion
Diag only extracts the diagonal of a matrix. If x and y are vectors, and you do d = diag(x * y'), MATLAB first constructs the n by n matrix x*y' and calls diag on that. This is why, you get the error, "cannot construct 290GB matrix..." Matlab interpreter does not optimize in a crazy way, realize you only want the diagonal and construct just a vector (rather than full matrix with x*y', that does not happen.
Not sure if you're asking this, but the fastest way to calculate d = diag(x*y') where x and y are n by 1 vectors would simply be: d = x.*y

Histogram intersection kernel optimization in MATLAB

I want to try a svm classifier using histogram intersection kernel, for a dataset of 153 images but it takes a long time. This is my code:
a = load('...'); %vectors
b = load('...'); %labels
g = dataset(a,b);
error = crossval(g,libsvc([],proxm([],'ih'),100),10,10);
error1 = crossval(g,libsvc([],proxm([],'ih'),10),10,10);
error2 = crossval(g,libsvc([],proxm([],'ih'),1),10,10);
My implementation of the kernel within the proxm function is:
case {'dist_histint','ih'}
if (d ~= d1)
error('column length of A (%d) != column length of B (%d)\n',d,d1);
% With the MATLAB JIT compiler the trivial implementation turns out
% to be the fastest, especially for large matrices.
D = zeros(m,n);
for i=1:m % m is number of samples of A
if (0==mod(i,1000)) fprintf('.'); end
for j=1:n % n is number of samples of B
D(i,j) = sum(min([A(i,:);B(j,:)]));%./max(A(:,i),B(:,j)));
I need some matlab optimization for this code!
You can get rid of that kernel loop to calculate D with this bsxfun based vectorized approach -
D = squeeze(sum(bsxfun(#min,A,permute(B,[3 2 1])),2))
Or avoid squeeze with this modification -
D = sum(bsxfun(#min,permute(A,[1 3 2]),permute(B,[3 1 2])),3)
If the calculations of D involve max instead of min, just replace #min with #max there.
Explanation: The way bsxfun works is that it does expansion on singleton dimensions and performs the operation as listed with # inside its call. Now, this expansion is basically how one achieves vectorized solutions that replace for-loops. By singleton dimensions in arrays, we mean dimensions of 1 in them.
In many cases, singleton dimensions aren't already present and for vectorization with bsxfun, we need to create singleton dimensions. One of the tools to do so is with permute. That's basically all about the way vectorized approach stated earlier would work.
Thus, your kernel code -
case {'dist_histint','ih'}
if (d ~= d1)
error('column length of A (%d) != column length of B (%d)\n',d,d1);
% With the MATLAB JIT compiler the trivial implementation turns out
% to be the fastest, especially for large matrices.
D = zeros(m,n);
for i=1:m % m is number of samples of A
if (0==mod(i,1000)) fprintf('.'); end
for j=1:n % n is number of samples of B
D(i,j) = sum(min([A(i,:);B(j,:)]));%./max(A(:,i),B(:,j)));
reduces to -
case {'dist_histint','ih'}
if (d ~= d1)
error('column length of A (%d) != column length of B (%d)\n',d,d1);
D = squeeze(sum(bsxfun(#min,A,permute(B,[3 2 1])),2))
%// OR D = sum(bsxfun(#min,permute(A,[1 3 2]),permute(B,[3 1 2])),3)
I am assuming the line: if (0==mod(i,1000)) fprintf('.'); end isn't important to the calculations as it does printing of some message.
