Optimizing matrix multiplication with varying sizes - performance

Suppose I have the following data generating process
using Random
using StatsBase
m_1 = [1.0 2.0]
m_2 = [1.0 2.0; 3.0 4.0]
DD = []
y = zeros(2,200)
for i in 1:100
rand!(m_1)
rand!(m_2)
push!(DD, m_1)
push!(DD, m_2)
end
idxs = sample(1:200,10)
for i in idxs
DD[i] = DD[1]
end
and suppose given the data, I have the following function
function test(y, DD, n)
v_1 = [1 2]
v_2 = [3 4]
for j in 1:n
for i in 1:size(DD,1)
if size(DD[i],1) == 1
y[1:size(DD[i],1),i] .= (v_1 * DD[i]')[1]
else
y[1:size(DD[i],1),i] = (v_2 * DD[i]')'
end
end
end
end
I'm struggling to optimize the speed of test. In particular, memory allocation increases as I increase n. However, I'm not really allocating anything new.
The data generating process captures the fact that I don't know for sure the size of DD[i] beforehand. That is, the first time I call test, DD[1] could be a 2x2 matrix. The second time I call test, DD[1] could be a 1x2 matrix. I think this could be part of the issue with memory allocation: Julia doesn't know the sizes beforehand.
I'm completely stuck. I've tried #inbounds but that didn't help. Is there a way to improve this?

One important thing to check for performance is that Julia can understand the types. You can check this by running #code_warntype test(y, DD, 1), the output will make it clear that DD is of type Any[] (since you declared it that way). Working with Any can incur quite a performance penalty so declaring DD = Matrix{Float64}[] cuts the time to a third in my testing.
I'm not sure how close this example is to the actual code you want to write but in this particular case the size(DD[i],1) == 1 branch can be replaced by a call to LinearAlgebra.dot:
y[1:size(DD[i],1),i] .= dot(v_1, DD[i])
this cuts the time by another 50% for me. Finally you can squeeze out just a tiny bit more by using mul! to perform the other multiplication in place:
mul!(view(y, 1:size(DD[i],1),i:i), DD[i], v_2')
Full example:
using Random
using LinearAlgebra
DD = [rand(i,2) for _ in 1:100 for i in 1:2]
y = zeros(2,200)
shuffle!(DD)
function test(y, DD, n)
v_1 = [1 2]
v_2 = [3 4]'
for j in 1:n
for i in 1:size(DD,1)
if size(DD[i],1) == 1
y[1:size(DD[i],1),i] .= dot(v_1, DD[i])
else
mul!(view(y, 1:size(DD[i],1),i:i), DD[i], v_2)
end
end
end
end

Related

My loops are slow. Is that because of if statements?

I read this post and realized that loops are faster in Julia. Thus, I decided to change my vectorized code into loops. However, I had to use a few if statements in my loop but my loops slowed down after I added more such if statements.
Consider this excerpt, which I directly copied from the post:
function devectorized()
a = [1.0, 1.0]
b = [2.0, 2.0]
x = [NaN, NaN]
for i in 1:1000000
for index in 1:2
x[index] = a[index] + b[index]
end
end
return
end
function time(N)
timings = Array(Float64, N)
# Force compilation
devectorized()
for itr in 1:N
timings[itr] = #elapsed devectorized()
end
return timings
end
I then added a few if statements to test the speed:
function devectorized2()
a = [1.0, 1.0]
b = [2.0, 2.0]
x = [NaN, NaN]
for i in 1:1000000
for index in 1:2
####repeat this 6 times
if index * i < 20
x[index] = a[index] - b[index]
else
x[index] = a[index] + b[index]
end
####
end
end
return
end
I repeated this block six times:
if index * i < 20
x[index] = a[index] - b[index]
else
x[index] = a[index] + b[index]
end
For the sake of conciseness, I'm not repeating this block in my sample code. After repeating the if statements 6 times, devectorized2() took 3 times as long.
I have two questions:
Are there better ways to implement if statements?
Why are if statements so slow? I know that Julia is trying to do loops in a way that matches C. Is Julia providing better "translation" between Julia and C and these if statements just made the translation process more difficult?
Firstly, I don't think the performance here is very odd, since you're adding a lot of work to your function.
Secondly, you should actually return x here, otherwise the compiler might decide that you're not using x, and just skip the whole computation, which would thoroughly confuse the timings.
Thirdly, to answer your question 1: You can implement it like this:
x[index] = a[index] + ifelse(index * i < 20, -1, 1) * b[index]
This can be faster in some cases, but not necessarily in your case, where the branch is very easy to predict. Sometimes you can also get speedups by using Bools, for example like this:
x[index] = a[index] + (2*(index * i >= 20)-1) * b[index]
Again, in your example this doesn't help much, but there are cases when this approach can give you a decent speedup.
BTW: It isn't necessarily always true that loops are preferable to vectorized code any longer. The post you linked to is quite old. Take a look at this blog post, which shows how vectorized code can achieve similar performance to loopy code. In many cases, though, a loop is the clearest, easiest and fastest way to accomplish your goal.

How to parallelize computation of pairwise distance matrix?

My problem is roughly as follows. Given a numerical matrix X, where each row is an item. I want to find each row's nearest neighbor in terms of L2 distance in all rows except itself. I tried reading the official documentation but was still a little confused about how to achieve this. Could someone give me some hint?
My code is as follows
function l2_dist(v1, v2)
return sqrt(sum((v1 - v2) .^ 2))
end
function main(Mat, dist_fun)
n = size(Mat, 1)
Dist = SharedArray{Float64}(n) #[Inf for i in 1:n]
Id = SharedArray{Int64}(n) #[-1 for i in 1:n]
#parallel for i = 1:n
Dist[i] = Inf
Id[i] = 0
end
Threads.#threads for i in 1:n
for j in 1:n
if i != j
println(i, j)
dist_temp = dist_fun(Mat[i, :], Mat[j, :])
if dist_temp < Dist[i]
println("Dist updated!")
Dist[i] = dist_temp
Id[i] = j
end
end
end
end
return Dict("Dist" => Dist, "Id" => Id)
end
n = 4000
p = 30
X = [rand() for i in 1:n, j in 1:p];
main(X[1:30, :], l2_dist)
#time N = main(X, l2_dist)
I'm trying to distributed all the i's (i.e. calculating each row minimum) over different cores. But the version above apparently isn't working correctly. It is even slower than the sequential version. Can someone point me to the right direction? Thanks.
Maybe you're doing something in addition to what you have written down, but, at this point from what I can see, you aren't actually doing any computations in parallel. Julia requires you to tell it how many processors (or threads) you would like it to have access to. You can do this through either
Starting Julia with multiple processors julia -p # (where # is the number of processors you want Julia to have access to)
Once you have started a Julia "session" you can call the addprocs function to add additional processors.
To have more than 1 thread, you need to run command export JULIA_NUM_THREADS = #. I don't know very much about threading, so I will be sticking with the #parallel macro. I suggest reading documentation for more details on threading -- Maybe #Chris Rackauckas could expand a little more on the difference.
A few comments below about my code and on your code:
I'm on version 0.6.1-pre.0. I don't think I'm doing anything 0.6 specific, but this is a heads up just in case.
I'm going to use the Distances.jl package when computing the distances between vectors. I think it is a good habit to farm out as many of my computations to well-written and well-maintained packages as possible.
Rather than compute the distance between rows, I'm going to compute the distance between columns. This is because Julia is a column-major language, so this will increase the number of cache hits and give a little extra speed. You can obviously get the row-wise results you want by just transposing the input.
Unless you expect to have that many memory allocations then that many allocations are a sign that something in your code is inefficient. It is often a type stability problem. I don't know if that was the case in your code before, but that doesn't seem to be an issue in the current version (it wasn't immediately clear to me why you were having so many allocations).
Code is below
# Make sure all processors have access to Distances package
#everywhere using Distances
# Create a random matrix
nrow = 30
ncol = 4000
# Seed creation of random matrix so it is always same matrix
srand(42)
X = rand(nrow, ncol)
function main(X::AbstractMatrix{Float64}, M::Distances.Metric)
# Get size of the matrix
nrow, ncol = size(X)
# Create `SharedArray` to store output
ind_vec = SharedArray{Int}(ncol)
dist_vec = SharedArray{Float64}(ncol)
# Compute the distance between columns
#sync #parallel for i in 1:ncol
# Initialize various temporary variables
min_dist_i = Inf
min_ind_i = -1
X_i = view(X, :, i)
# Check distance against all other columns
for j in 1:ncol
# Skip comparison with itself
if i==j
continue
end
# Tell us who is doing the work
# (can uncomment if you want to verify stuff)
# println("Column $i compared with Column $j by worker $(myid())")
# Evaluate the new distance...
# If it is less then replace it, otherwise proceed
dist_temp = evaluate(M, X_i, view(X, :, j))
if dist_temp < min_dist_i
min_dist_i = dist_temp
min_ind_i = j
end
end
# Which column is minimum distance from column i
dist_vec[i] = min_dist_i
ind_vec[i] = min_ind_i
end
return dist_vec, ind_vec
end
# Using Euclidean metric
metric = Euclidean()
inds, dist = main(X, metric)
#time main(X, metric);
#show dist[[1, 5, 25]], inds[[1, 5, 25]]
You can run the code with
1 processor julia testfile.jl
% julia testfile.jl
0.640365 seconds (16.00 M allocations: 732.495 MiB, 3.70% gc time)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])
n processors (in this case 4) julia -p n testfile.jl
% julia -p 4 testfile.jl
0.201523 seconds (2.10 k allocations: 99.107 KiB)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])

Efficiency of diag() - MATLAB

Motivation:
In writing out a matrix operation that was to be performed over tens of thousands of vectors I kept coming across the warning:
Requested 200000x200000 (298.0GB) array exceeds maximum array size
preference. Creation of arrays greater than this limit may take a long
time and cause MATLAB to become unresponsive. See array size limit or
preference panel for more information.
The reason for this was my use of diag() to get the values down the diagonal of an matrix inner product. Because MATLAB is generally optimized for vector/matrix operations, when I first write code, I usually go for the vectorized form. In this case, however, MATLAB has to build the entire matrix in order to get the diagonal which causes the memory and speed issues.
Experiment:
I decided to test the use of diag() vs a for loop to see if at any point it was more efficient to use diag():
num = 200000; % Matrix dimension
x = ones(num, 1);
y = 2 * ones(num, 1);
% z = diag(x*y'); % Expression to solve
% Loop approach
tic
z = zeros(num,1);
for i = 1 : num
z(i) = x(i)*y(i);
end
toc
% Dividing the too-large matrix into process-able chunks
fraction = [10, 20, 50, 100, 500, 1000, 5000, 10000, 20000];
time = zeros(size(fraction));
for k = 1 : length(fraction)
f = fraction(k);
% Operation to time
tic
z = zeros(num,1);
for i = 1 : k
first = (i-1) * (num / f);
last = first + (num / f);
z(first + 1 : last) = diag(x(first + 1: last) * y(first + 1 : last)');
end
time(k) = toc;
end
% Plot results
figure;
hold on
plot(log10(fraction), log10(chunkTime));
plot(log10(fraction), repmat(log10(loopTime), 1, length(fraction)));
plot(log10(fraction), log10(chunkTime), 'g*'); % Plot points along time
legend('Partioned Running Time', 'Loop Running Time');
xlabel('Log_{10}(Fractional Size)'), ylabel('Log_{10}(Running Time)'), title('Running Time Comparison');
This is the result of the test:
(NOTE: The red line represents the loop time as a threshold--it's not to say that the total loop time is constant regardless of the number of loops)
From the graph it is clear that it takes breaking the operations down into roughly 200x200 square matrices to be faster to use diag than to perform the same operation using loops.
Question:
Can someone explain why I'm seeing these results? Also, I would think that with MATLAB's ever-more optimized design, there would be built-in handling of these massive matrices within a diag() function call. For example, it could just perform the i = j indexed operations. Is there a particular reason why this might be prohibitive?
I also haven't really thought of memory implications for diag using the partition method, although it's clear that as the partition size decreases, memory requirements drop.
Test of speed of diag vs. a loop.
Initialization:
n = 10000;
M = randn(n, n); %create a random matrix.
Test speed of diag:
tic;
d = diag(M);
toc;
Test speed of loop:
tic;
d = zeros(n, 1);
for i=1:n
d(i) = M(i,i);
end;
toc;
This would test diag. Your code is not a clean test of diag...
Comment on where there might be confusion
Diag only extracts the diagonal of a matrix. If x and y are vectors, and you do d = diag(x * y'), MATLAB first constructs the n by n matrix x*y' and calls diag on that. This is why, you get the error, "cannot construct 290GB matrix..." Matlab interpreter does not optimize in a crazy way, realize you only want the diagonal and construct just a vector (rather than full matrix with x*y', that does not happen.
Not sure if you're asking this, but the fastest way to calculate d = diag(x*y') where x and y are n by 1 vectors would simply be: d = x.*y

Vectorizing three for loops

I'm quite new to Matlab and I need help in speeding up some part of my code. I am writing a Matlab application that performs 3D matrix convolution but unlike in standard convolution, the kernel is not constant, it needs to be calculated for each pixel of an image.
So far, I have ended up with a working code, but incredibly slow:
function result = calculateFilteredImages(images, T)
% images - matrix [480,360,10] of 10 grayscale images of height=480 and width=360
% reprezented as a value in a range [0..1]
% i.e. images(10,20,5) = 0.1231;
% T - some matrix [480,360,10, 3,3] of double values, calculated earlier
kerN = 5; %kernel size
mid=floor(kerN/2); %half the kernel size
offset=mid+1; %kernel offset
[h,w,n] = size(images);
%add padding so as not to get IndexOutOfBoundsEx during summation:
%[i.e. changes [1 2 3...10] to [0 0 1 2 ... 10 0 0]]
images = padarray(images,[mid, mid, mid]);
result(h,w,n)=0; %preallocate, faster than zeros(h,w,n)
kernel(kerN,kerN,kerN)=0; %preallocate
% the three parameters below are not important in this problem
% (are used to calculate sigma in x,y,z direction inside the loop)
sigMin=0.5;
sigMax=3;
d = 3;
for a=1:n;
tic;
for b=1:w;
for c=1:h;
M(:,:)=T(c,b,a,:,:); % M is now a 3x3 matrix
[R D] = eig(M); %get eigenvectors and eigenvalues - R and D are now 3x3 matrices
% eigenvalues
l1 = D(1,1);
l2 = D(2,2);
l3 = D(3,3);
sig1=sig( l1 , sigMin, sigMax, d);
sig2=sig( l2 , sigMin, sigMax, d);
sig3=sig( l3 , sigMin, sigMax, d);
% calculate kernel
for i=-mid:mid
for j=-mid:mid
for k=-mid:mid
x_new = [i,j,k] * R; %calculate new [i,j,k]
kernel(offset+i, offset+j, offset+k) = exp(- (((x_new(1))^2 )/(sig1^2) + ((x_new(2))^2)/(sig2^2) + ((x_new(3))^2)/(sig3^2)) /2);
end
end
end
% normalize
kernel=kernel/sum(kernel(:));
%perform summation
xm_sum=0;
for i=-mid:mid
for j=-mid:mid
for k=-mid:mid
xm_sum = xm_sum + kernel(offset+i, offset+j, offset+k) * images(c+mid+i, b+mid+j, a+mid+k);
end
end
end
result(c,b,a)=xm_sum;
end
end
toc;
end
end
I tried replacing the "calculating kernel" part with
sigma=[sig1 sig2 sig3]
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
k2 = arrayfun(#(x, y, z) exp(-(norm([x,y,z]*R./sigma)^2)/2), x,y,z);
but it turned out to be even slower than the loop. I went through several articles and tutorials on vectorization but I'm quite stuck with this one.
Can it be vectorized or somehow speeded up using something else?
I'm new to Matlab, maybe there are some build-in functions that could help in this case?
Update
The profiling result:
Sample data which was used during profiling:
T.mat
grayImages.mat
As Dennis noted, this is a lot of code, cutting it down to the minimum that's slow given by the profiler will help. I'm not sure if my code is equivalent to yours, can you try it and profile it? The 'trick' to Matlab vectorization is using .* and .^, which operate element-by-element instead of having to use loops. http://www.mathworks.com/help/matlab/ref/power.html
Take your rewritten part:
sigma=[sig1 sig2 sig3]
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
k2 = arrayfun(#(x, y, z) exp(-(norm([x,y,z]*R./sigma)^2)/2), x,y,z);
And just pick one sigma for now. Looping over 3 different sigmas isn't a performance problem if you can vectorize the underlying k2 formula.
EDIT: Changed the matrix_to_norm code to be x(:), and no commas. See Generate all possible combinations of the elements of some vectors (Cartesian product)
Then try:
% R & mid my test variables
R = [1 2 3; 4 5 6; 7 8 9];
mid = 5;
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
% meshgrid is also a possibility, check that you are getting the order you want
% Going to break the equation apart for now for clarity
% Matrix operation, should already be fast.
matrix_to_norm = [x(:) y(:) z(:)]*R/sig1
% Ditto
matrix_normed = norm(matrix_to_norm)
% Note the .^ - I believe you want element-by-element exponentiation, this will
% vectorize it.
k2 = exp(-0.5*(matrix_normed.^2))

Matlab slow parallel processing with distributed arrays

I am new to using distributed and codistributed arrays in matlab. The parallel code I have produced works, but is much slower than the serial version and I have no idea why. The code examples below compute the eigenvalues of hessian matrices from volumetic data.
Serial version:
S = size(D);
Dsmt=imgaussian(D,2,20);
[fx, fy, fz] = gradient(Dsmt);
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
d = zeros([3 S(1) S(2) S(3)]);
for i = 1 : S(1)
fprintf('Slice %d out of %d\n', i, S(1));
for ii = 1 : S(2)
for iii = 1 : S(3)
d(:,i,ii,iii) = eig(squeeze(DHess(:,:,i,ii,iii)));
end
end
end
Parallel version:
S = size(D);
Dsmt=imgaussian(D,2,20);
[fx, fy, fz] = gradient(Dsmt);
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
CDHess = distributed(DHess);
spmd
d = zeros([3 S(1) S(2) S(3)], codistributor('1d',4));
for i = 1 : S(1)
fprintf('Slice %d out of %d\n', i, S(1));
for ii = 1 : S(2)
for iii = drange(1 : S(3))
d(:,i,ii,iii) = eig(squeeze(CDHess(:,:,i,ii,iii)));
end
end
end
end
If someone could shed some light on the issue I would be very grateful
Here is a re-written version of your code. I have split the work over the outer-most loop, not as in your case - the inner-most loop. I have also explicitly allocated local parts of the d result vector, and the local part of the Hessian matrix.
In your code you rely on drange to split the work, and you access the distributed arrays directly to avoid extracting the local part. Admittedly, it should not result in such a great slowdown if MATLAB did everything correctly. The bottom line is, I don't know why your code is so slow - most likely because MATLAB does some remote data accessing despite the fact that you distributed your matrices.
Anyway, the below code runs and gives pretty good speedup on my computer using 4 labs. I have generated synthetic random input data to have something to work on. Have a look at the comments. If something is unclear, I can elaborate later.
clear all;
D = rand(512, 512, 3);
S = size(D);
[fx, fy, fz] = gradient(D);
% this part could also be parallelized - at least a bit.
tic;
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
toc
% your sequential implementation
d = zeros([3, S(1) S(2) S(3)]);
disp('sequential')
tic
for i = 1 : S(1)
for ii = 1 : S(2)
for iii = 1 : S(3)
d(:,i,ii,iii) = eig(squeeze(DHess(:,:,i,ii,iii)));
end
end
end
toc
% my parallel implementation
disp('parallel')
tic
spmd
% just for information
disp(['lab ' num2str(labindex)]);
% distribute the input data along the third dimension
% This is the dimension of the outer-most loop, hence this is where we
% want to parallelize!
DHess_dist = codistributed(DHess, codistributor1d(3));
DHess_local = getLocalPart(DHess_dist);
% create an output data distribution -
% note that this time we split along the second dimension
codist = codistributor1d(2, codistributor1d.unsetPartition, [3, S(1) S(2) S(3)]);
localSize = [3 codist.Partition(labindex) S(2) S(3)];
% allocate local part of the output array d
d_local = zeros(localSize);
% your ordinary loop, BUT! the outermost loop is split amongst the
% threads explicitly, using local indexing. In the loop only local parts
% of matrix d and DHess are accessed
for i = 1:size(d_local,2)
for ii = 1 : S(2)
for iii = 1 : S(3)
d_local(:,i,ii,iii) = eig(squeeze(DHess_local(:,:,i,ii,iii)));
end
end
end
% assemble local results to a codistributed matrix
d_dist = codistributed.build(d_local, codist);
end
toc
isequal(d, d_dist)
And the output
Elapsed time is 0.364255 seconds.
sequential
Elapsed time is 33.498985 seconds.
parallel
Lab 1:
lab 1
Lab 2:
lab 2
Lab 3:
lab 3
Lab 4:
lab 4
Elapsed time is 9.445856 seconds.
ans =
1
Edit I have checked the performance on a reshaped matrix DHess=[3x3xN]. The performance is not much better (10%), so it is not substantial. But maybe you can implement the eig a bit differently? After all, those are 3x3 matrices you are dealing with.
You didn't specify where you've opened your matlabpool, and that will be the main factor determining what speedup you get.
If you are using the 'local' scheduler, then there is often no benefit to using distributed arrays. In particular, if the time-consuming operations are multithreaded in MATLAB already, then they will almost certainly slow down when using the local scheduler since the matlabpool workers run in single-threaded mode.
If you are using some other scheduler with the workers on a separate machine then you might be able to get speedup, but that depends on what you're doing. There's an example here http://www.mathworks.com/products/parallel-computing/examples.html?file=/products/demos/shipping/distcomp/paralleldemo_backslash_bench.html which shows some benchmarks of MATLAB's \ operator.
Finally, it's worth noting that indexing distributed arrays is unfortunately rather slow, especially compared to MATLAB's built-in indexing. If you can extract the 'local part' of your codistributed arrays inside the spmd block and work exclusively with those, that might also help.

Resources