accelerate and memory optimize for loop - performance

I have the following data types:
K1,K2,K3,K4 = ... % predefined constants which are not subject to change.
time = 1e6; % time steps length.
z = 1e3; % location vector length.
A_Sink = rand(1,time);
B_Sink = rand(1,time);
laserA = zeros(1,z);
laserB = zeros(1,z);
laserA_Old = zeros(1,z);
laserB_Old = zeros(1,z);
Now here is the code:
for ii = 1:time
laserA_Old = laserA;
laserB_Old = laserB;
laserA(2:end) = laserA_Old(2:end) + K1*diff(laserA_Old) + K2*laserB_old(2:end);
laserA(1) = A_Sink(ii);
laserB(1:end-1) = laserB_Old(1:end-1) + K3*diff(laserB_Old) + K4*laserA_Old(1:end-1);
laserB(end) = B_Sink(ii);
end
Now my main gripes are, I see in the profiler that the lines in the loop with (2:end) or (1:end-1) are allocating memory without freeing it. This causes MATLAB to be much slower.
Is there a way to accelerate things and improve it memory-wise?
many thanks.

Related

Performance of updating/inserting into a sparse matrix in Matlab?

i have written a fairly large class for the calculation of measurement uncertainties, but it is painfully slow. Profiling the code shows that the slowest operation, by far, is to insert the computation results into a large sparse matrix. About 97% of all time is spent on that operation. The matrix keeps the uncertainties of all measurement data, and I cannot change the data structures without breaking a lot of other code. So my only option is to optimize the data insertion step. This is done about 5700 times in my benchmark, and every time the amout of data increases.
First solution, extremely slow:
% this automatically sums up duplicate yInd entries
[zInd_grid, yInd_grid] = ndgrid(1:numel(z), yInd(:));
Uzw = sparse(zInd_grid(:), yInd_grid(:), Uzy(:), numel(z), numel(obj.w));
% this automatically sums up duplicate yInd entries
dz_dw = sparse(zInd_grid(:), yInd_grid(:), dz_dy(:), numel(z), numel(obj.w));
obj.w = [obj.w; z(:)]; % insert new measurement results into the column vector obj.w
obj.Uww = [obj.Uww, transpose(Uzw); Uzw, Uzz]; % insert new uncertainties of the measurement results
obj.dw_dw = [obj.dw_dw, transpose(dz_dw); dz_dw, dz_dz]; % insert "dependencies" of new measurements on old results
The line obj.Uww = [obj.Uww, transpose(Uzw); Uzw, Uzz]; is the slowest, by far. Perhaps it is slow because Matlab needs to allocate a new, larger buffer for obj.Uww and copy everything over. Thus I changed the code to the following:
% Preallocation in the class constructor
obj.w = spalloc(nnz_w, 1, nnz_w);
obj.Uww = spalloc(nnz_w, nnz_w, nnz_Uww);
obj.dw_dw = spalloc(nnz_w, nnz_w, nnz_dw_dw);
obj.num_w = 0; % to manually keep track of the "real" size of obj.w, obj.Uww and obj.dw_dw
The class constructor is called with the sizes the three properties w, Uww and dw_dw will have at the end of the computation (nzz_w is approximately 0.1 million, nzz_Uww is at 9 million and nzz_dw_dw is about 1.6 million). Thus, no new allocation of memory should be needed. This is the inserting step now:
% this automatically sums up duplicate yInd entries
[zInd_grid, yInd_grid] = ndgrid(1:numel(z), yInd(:));
Uzw = sparse(zInd_grid(:), yInd_grid(:), Uzy(:), numel(z), obj.num_w);
% this automatically sums up duplicate yInd entries
dz_dw = sparse(zInd_grid(:), yInd_grid(:), dz_dy(:), numel(z), obj.num_w);
wInd = 1:obj.num_w;
obj.w(zInd, 1) = z(:); % insert new measurement results
obj.num_w = zInd(end); % new "real" size of w, Uww and dw_dw
obj.Uww(zInd, wInd) = Uzw; % about 51% of all computation time
obj.Uww(wInd, zInd) = transpose(Uzw); % about 15% of all computation time
obj.Uww(zInd, zInd) = Uzz; % about 14.4% of all computation time
obj.dw_dw(zInd, wInd) = dz_dw; % about 13% of all computation time
obj.dw_dw(wInd, zInd) = transpose(dz_dw); % about 3.5% of all computation time
obj.dw_dw(zInd, zInd) = dz_dz; % less than 3.5% of all computation time
Still, these lines account for 97% of all computation time, and no speed improvement. Thus I tried version three:
obj.w = [obj.w; z(:)];
[zInd_zy, yInd_zy] = ndgrid(zInd, yInd(:));
[zzInd_i, zzInd_j] = ndgrid(zInd, zInd);
[Uww_i, Uww_j, Uww_v] = find(obj.Uww); % 14% of all computation time
Uww_new = sparse( ... % this statement takes 66% of all computation time
[Uww_i; zInd_zy(:); yInd_zy(:); zzInd_i(:)], ...
[Uww_j; yInd_zy(:); zInd_zy(:); zzInd_j(:)], ...
[Uww_v; Uzy(:); Uzy(:); Uzz(:)], ...
numel(obj.w), numel(obj.w));
[dw_dw_i, dw_dw_j, dw_dw_v] = find(obj.dw_dw);
dw_dw_new = sparse( ... % 14% of all computation time
[dw_dw_i; zInd_zy(:); yInd_zy(:); zzInd_i(:)], ...
[dw_dw_j; yInd_zy(:); zInd_zy(:); zzInd_j(:)], ...
[dw_dw_v; dz_dy(:); dz_dy(:); dz_dz(:)], ...
numel(obj.w), numel(obj.w));
obj.Uww = Uww_new;
obj.dw_dw = dw_dw_new;
which is even slower thant the two other versions. Why is inserting into an already preallocated array so slow? And how can I speed it up?
(All the matrices are symmetric, but I did not try to exploit that yet.)
I don't understand the details of your update pattern, but keep in mind that Matlab stores sparse matrices internally in compressed-sparse column format. So adding entries in sequence column-by-column is significantly faster than other orders. E.g., on my old version of Matlab (R2006a), this:
n=10000;
nz=400000;
v=floor(n*rand(nz,3))+1;
fprintf('Random\n');
A=sparse(n, n);
tic
for k=1:nz
A(v(k,1), v(k,2))=v(k,3);
end
toc
fprintf('Row-wise\n');
v=sortrows(v);
A=sparse(n, n);
tic
for k=1:nz
A(v(k,1), v(k,2))=v(k,3);
end
toc
fprintf('Column-wise\n');
v=sortrows(v, [2 1]);
A=sparse(n, n);
tic
for k=1:nz
A(v(k,1), v(k,2))=v(k,3);
end
toc
gives this:
>> sparsetest
Random
Elapsed time is 19.276089 seconds.
Row-wise
Elapsed time is 20.714324 seconds.
Column-wise
Elapsed time is 1.498150 seconds.
Likely best of all would be if you can somehow just collect the nonzeros in a form suitable for spconvert or sparse and then make the whole sparse matrix at the end, but I gather that you might not be able to do that.
#bg2b Pointed out how adding data column-wise is much faster.
It turns out adding rows to a sparse matrix or sparse vector is painfully slow (more precisely, to the lower triangular part). Thus, I now store only the upper triangular part of the sparse matrix, because that is fast. When I need data from the matrix, I recreate it from the upper triangular sparse matrix. See the end of the benchmarking script for this.
This is my benchmarking script. It nicely shows the exponential increase in computation time for adding data.
% Benchmark the extension of sparse matrices of the form
% Uww_new = [ Uww, Uwz; ...
% Uzw, Uzz];
% where Uzw = transpose(Uwz). Uww and Uzz are always square and symmetric.
close all;
clearvars;
rng(70101557, 'twister'); % the seed is the number of the stack overflow question
density = 0.25;
nZ = 10;
iterations = 5e2;
n = nZ * iterations;
nonzeros = n*n*density;
all_Uzz = cell(iterations, 1);
all_Uwz = cell(iterations, 1);
for k = 1:iterations
% Uzz must be symmetric!
Uzz_nonsymmetric = sprand(nZ, nZ, density);
all_Uzz{k} = (Uzz_nonsymmetric + transpose(Uzz_nonsymmetric))./2;
all_Uwz{k} = sprand((k-1)*nZ, nZ, density);
end
f = figure();
ax = axes(f);
hold(ax, 'on');
grid(ax, 'on');
xlabel(ax, 'Iteration');
ylabel(ax, 'Elapsed time in seconds.');
h = gobjects(1, 0);
name = 'Optimised.';
fprintf('%s\n', name);
Uww_optimised = spalloc(0, 0, nonzeros);
t1 = tic();
elapsedTimes = NaN(iterations, 1);
for k = 1:iterations
Uzz = all_Uzz{k};
Uwz = all_Uwz{k};
zInd = size(Uww_optimised, 1) + (1:size(Uzz, 1));
wInd = 1:size(Uww_optimised, 1);
Uzz_triu = triu(Uzz);
Uww_optimised(wInd, zInd) = Uwz; % add columns
Uww_optimised(zInd, zInd) = Uzz_triu; % add rows and columns
elapsedTimes(k, 1) = toc(t1);
end
toc(t1)
h = [h, plot(ax, 1:iterations, seconds(elapsedTimes), 'DisplayName', name)];
name = 'Only Uwz and Uzz.';
fprintf('%s\n', name);
Uww_wz_zz = spalloc(0, 0, nonzeros);
t1 = tic();
elapsedTimes = NaN(iterations, 1);
for k = 1:iterations
Uzz = all_Uzz{k};
Uwz = all_Uwz{k};
zInd = size(Uww_wz_zz, 1) + (1:size(Uzz, 1));
wInd = 1:size(Uww_wz_zz, 1);
Uzw = transpose(Uwz);
Uww_wz_zz(wInd, zInd) = Uwz; % add columns
% Uww_wz_zz(zInd, wInd) = Uzw; % add rows
Uww_wz_zz(zInd, zInd) = Uzz; % add rows and columns
elapsedTimes(k, 1) = toc(t1);
end
toc(t1)
h = [h, plot(ax, 1:iterations, seconds(elapsedTimes), 'DisplayName', name)];
name = 'Only Uzw and Uzz.';
fprintf('%s\n', name);
Uww_zw_zz = spalloc(0, 0, nonzeros);
t1 = tic();
elapsedTimes = NaN(iterations, 1);
for k = 1:iterations
Uzz = all_Uzz{k};
Uwz = all_Uwz{k};
zInd = size(Uww_zw_zz, 1) + (1:size(Uzz, 1));
wInd = 1:size(Uww_zw_zz, 1);
Uzw = transpose(Uwz);
% Uww_zw_zz(wInd, zInd) = Uwz;
Uww_zw_zz(zInd, wInd) = Uzw;
Uww_zw_zz(zInd, zInd) = Uzz;
elapsedTimes(k, 1) = toc(t1);
end
toc(t1)
h = [h, plot(ax, 1:iterations, seconds(elapsedTimes), 'DisplayName', name)];
name = 'Uzw, Uwz and Uzz.';
fprintf('%s\n', name);
Uww = spalloc(0, 0, nonzeros);
t1 = tic();
elapsedTimes = NaN(iterations, 1);
for k = 1:iterations
Uzz = all_Uzz{k};
Uwz = all_Uwz{k};
zInd = size(Uww, 1) + (1:size(Uzz, 1));
wInd = 1:size(Uww, 1);
Uzw = transpose(Uwz);
Uww(wInd, zInd) = Uwz;
Uww(zInd, wInd) = Uzw;
Uww(zInd, zInd) = Uzz;
elapsedTimes(k, 1) = toc(t1);
end
toc(t1)
h = [h, plot(ax, 1:iterations, seconds(elapsedTimes), 'DisplayName', name)];
leg = legend(ax, h, 'Location', 'northwest');
assert(issymmetric(Uww));
assert(istriu(Uww_optimised));
assert(isequal(Uww, Uww_optimised + transpose(triu(Uww_optimised, 1))));
%% Get Uyy from Uww_optimised. Uyy is a symmetric subset of Uww
yInd = randi(size(Uww_optimised, 1), 1, nZ); % indices to extract
[yIndRowInds, yIndColInds] = ndgrid(yInd, yInd);
indsToFlip = yIndRowInds > yIndColInds;
temp = yIndColInds(indsToFlip);
yIndColInds(indsToFlip) = yIndRowInds(indsToFlip);
yIndRowInds(indsToFlip) = temp;
linInd = sub2ind(size(Uww_optimised), yIndRowInds, yIndColInds);
assert(issymmetric(linInd));
Uyy = Uww_optimised(linInd);
assert(issymmetric(Uyy));

efficient matlab implementation for Lukas-Kanade step

I got an assignment in a video processing course - to implement the Lucas-Kanade algorithm. Since we have to do it in the pyramidal model, I first build a pyramid for each of the 2 input images, and then for each level I perform a number of LK iterations. in each step (iteration), the following code runs (note: the images are zero-padded so I can handle the image edges easily):
function [du,dv]= LucasKanadeStep(I1,I2,WindowSize)
It = I2-I1;
[Ix, Iy] = imgradientxy(I2);
Ixx = imfilter(Ix.*Ix, ones(5));
Iyy = imfilter(Iy.*Iy, ones(5));
Ixy = imfilter(Ix.*Iy, ones(5));
Ixt = imfilter(Ix.*It, ones(5));
Iyt = imfilter(Iy.*It, ones(5));
half_win = floor(WindowSize/2);
du = zeros(size(It));
dv = zeros(size(It));
A = zeros(2);
b = zeros(2,1);
%iterate only on the relevant parts of the images
for i = 1+half_win : size(It,1)-half_win
for j = 1+half_win : size(It,2)-half_win
A(1,1) = Ixx(i,j);
A(2,2) = Iyy(i,j);
A(1,2) = Ixy(i,j);
A(2,1) = Ixy(i,j);
b(1,1) = -Ixt(i,j);
b(2,1) = -Iyt(i,j);
U = pinv(A)*b;
du(i,j) = U(1);
dv(i,j) = U(2);
end
end
end
mathematically what I'm doing is calculating for every pixel (i,j) the following optical flow:
as you can see, in the code I am calculating this for each pixel, which takes quite a long time (the whole processing for 2 images - including building 3 levels pyramids and 3 LK steps like the one above on each level - takes about 25 seconds (!) on a remote connection to my university servers).
My question: Is there a way to calculate this single LK step without the nested for loops? it must be more efficient because the next step of the assignment is to stabilize a short video using this algorithm.. thanks.
I ran your code on my system and did profiling. Here is what I got.
As you can see inverting the matrix(pinv) is taking most of the time. You can try and vectorise your code I guess, but I am not sure how to do it. But I do know a trick to improve the compute time. You have to exploit the minimum variance of the matrix A. That is, compute the inverse only if the minimum variance of A is greater than some threshold. This will improve the speed as you won't be inverting the matrix for all the pixel.
You do this by modifying your code to the one shown below.
function [du,dv]= LucasKanadeStep(I1,I2,WindowSize)
It = double(I2-I1);
[Ix, Iy] = imgradientxy(I2);
Ixx = imfilter(Ix.*Ix, ones(5));
Iyy = imfilter(Iy.*Iy, ones(5));
Ixy = imfilter(Ix.*Iy, ones(5));
Ixt = imfilter(Ix.*It, ones(5));
Iyt = imfilter(Iy.*It, ones(5));
half_win = floor(WindowSize/2);
du = zeros(size(It));
dv = zeros(size(It));
A = zeros(2);
B = zeros(2,1);
%iterate only on the relevant parts of the images
for i = 1+half_win : size(It,1)-half_win
for j = 1+half_win : size(It,2)-half_win
A(1,1) = Ixx(i,j);
A(2,2) = Iyy(i,j);
A(1,2) = Ixy(i,j);
A(2,1) = Ixy(i,j);
B(1,1) = -Ixt(i,j);
B(2,1) = -Iyt(i,j);
% +++++++++++++++++++++++++++++++++++++++++++++++++++
% Code I added , threshold better be outside the loop.
lambda = eig(A);
threshold = 0.2
if (min(lambda)> threshold)
U = A\B;
du(i,j) = U(1);
dv(i,j) = U(2);
end
% end of addendum
% +++++++++++++++++++++++++++++++++++++++++++++++++++
% U = pinv(A)*B;
% du(i,j) = U(1);
% dv(i,j) = U(2);
end
end
end
I have set the threshold to 0.2. You can experiment with it. By using eigen value trick I was able to get the compute time from 37 seconds to 10 seconds(shown below). Using eigen, pinv hardly takes up the time like before.
Hope this helped. Good luck :)
Eventually I was able to find a much more efficient solution to this problem.
It is based on the formula shown in the question. The last 3 lines are what makes the difference - we get a loop-free code that works way faster. There were negligible differences from the looped version (~10^-18 or less in terms of absolute difference between the result matrices, ignoring the padding zone).
Here is the code:
function [du,dv]= LucasKanadeStep(I1,I2,WindowSize)
half_win = floor(WindowSize/2);
% pad frames with mirror reflections of itself
I1 = padarray(I1, [half_win half_win], 'symmetric');
I2 = padarray(I2, [half_win half_win], 'symmetric');
% create derivatives (time and space)
It = I2-I1;
[Ix, Iy] = imgradientxy(I2, 'prewitt');
% calculate dP = (du, dv) according to the formula
Ixx = imfilter(Ix.*Ix, ones(WindowSize));
Iyy = imfilter(Iy.*Iy, ones(WindowSize));
Ixy = imfilter(Ix.*Iy, ones(WindowSize));
Ixt = imfilter(Ix.*It, ones(WindowSize));
Iyt = imfilter(Iy.*It, ones(WindowSize));
% calculate the whole du,dv matrices AT ONCE!
invdet = (Ixx.*Iyy - Ixy.*Ixy).^-1;
du = invdet.*(-Iyy.*Ixt + Ixy.*Iyt);
dv = invdet.*(Ixy.*Ixt - Ixx.*Iyt);
end

Reduce the calculation time for the matlab code

To calculate an enhancement function for an input image I have written the following piece of code:
Ig = rgb2gray(imread('test.png'));
N = numel(Ig);
meanTotal = mean2(Ig);
[row,cal] = size(Ig);
IgTransformed = Ig;
n = 3;
a = 1;
b = 1;
c = 1;
k = 1;
for ii=2:row-1
for jj=2:cal-1
window = Ig(ii-1:ii+1,jj-1:jj+1);
IgTransformed(ii,jj) = ((k*meanTotal)/(std2(window) + b))*abs(Ig(ii,jj)-c*mean2(window)) + mean2(window).^a;
end
end
How can I reduce the calculation time?
Obviously, one of the factors is the small window (3x3) that should be made in the loop each time.
Here you go -
Igd = double(Ig);
std2v = colfilt(Igd, [3 3], 'sliding', #std);
mean2v = conv2(Igd,ones(3),'same')/9;
Ig_out = uint8((k*meanTotal)./(std2v + b).*abs(Igd-cal*mean2v) + mean2v.^a);
This will change the boundary elements too, which if not desired could be set back to the original ones with few additional steps, like so -
Ig_out(:,[1 end]) = Ig(:,[1 end])
Ig_out([1 end],:) = Ig([1 end],:)

Faster concatenation of cell arrays of different sizes

I have a cell array of size m x 1 and each cell is again s x t cell array (size varies). I would like to concatenate vertically. The code is as follows:
function(cell_out) = vert_cat(cell_in)
[row,col] = cellfun(#size,cell_in,'Uni',0);
fcn_vert = #(x)([x,repmat({''},size(x,1),max(cell2mat(col))-size(x,2))]);
cell_out = cellfun(fcn_vert,cell_in,'Uni',0); % Taking up lot of time
cell_out = vertcat(cell_out{:});
end
Step 3 takes a lot of time. Is it the right way to do or is there any another faster way to achieve this?
cellfun has been found to be slower than loops (kind of old, but agrees with what I have seen).
In addition, repmat has also been a performance hit in the past (though that may be different now).
Try this two-loop code that aims to accomplish your task:
function cellOut = vert_cat(c)
nElem = length(c);
colPad = zeros(nElem,1);
nRow = zeros(nElem,1);
for k = 1:nElem
[nRow(k),colPad(k)] = size(c{k});
end
colMax = max(colPad);
colPad = colMax - colPad;
cellOut = cell(sum(nRow),colMax);
bottom = cumsum(nRow) - nRow + 1;
top = bottom + nRow - 1;
for k = 1:nElem
cellOut(bottom(k):top(k),:) = [c{k},cell(nRow(k),colPad(k))];
end
end
My test for this code was
A = rand(20,20);
A = mat2cell(A,ones(20,1),ones(20,1));
C = arrayfun(#(c) A(1:c,1:c),randi([1,15],1,5),'UniformOutput',false);
ccat = vert_cat(c);
I used this pice of code to generate data:
%generating some dummy data
m=1000;
s=100;
t=100;
cell_in=cell(m,1);
for idx=1:m
cell_in{idx}=cell(randi(s),randi(t));
end
Applying some minor modifications, I was able to speed up the code by a factor of 5
%Minor modifications of the original code
%use arrays instead of cells for row and col
[row,col] = cellfun(#size,cell_in);
%claculate max(col) once
tcol=max(col);
%use cell instead of repmat to generate an empty cell
fcn_vert = #(x)([x,cell(size(x,1),tcol-size(x,2))]);
cell_out = cellfun(fcn_vert,cell_in,'Uni',0); % Taking up lot of time
cell_out = vertcat(cell_out{:});
Using simply a for loop is even faster, because the data is only moved once
%new approac. Basic idea: move every data only once
[row,col] = cellfun(#size,cell_in);
trow=sum(row);
tcol=max(col);
r=1;
cell_out2 = cell(trow,tcol);
for idx=1:numel(cell_in)
cell_out2(r:r+row(idx)-1,1:col(idx))=cell_in{idx};
r=r+row(idx);
end

replace repmat with bsxfun in MATLAB

In the following function i want to make some changes to make it fast. By itself it is fast but i have to use it many times in a for loop so it takes long. I think if i replace the repmat with bsxfun will make it faster but i am not sure. How can i do these replacements
function out = lagcal(y1,y1k,source)
kn1 = y1(:);
kt1 = y1k(:);
kt1x = repmat(kt1,1,length(kt1));
eq11 = 1./(prod(kt1x-kt1x'+eye(length(kt1))));
eq1 = eq11'*eq11;
dist = repmat(kn1,1,length(kt1))-repmat(kt1',length(kn1),1);
[fixi,fixj] = find(dist==0); dist(fixi,fixj)=eps;
mult = 1./(dist);
eq2 = prod(dist,2);
eq22 = repmat(eq2,1,length(kt1));
eq222 = eq22 .* mult;
out = eq1 .* (eq222'*source*eq222);
end
Does it really speed up my function?
Introduction and code changes
All the repmat usages used in the function code are to expand inputs to sizes so that later on the mathemtical operations involving these inputs could be performed. This is tailor-made situation for bsxfun. Sadly though the real bottleneck of the function code seems to be something else. Stay on as we discuss all the performance related aspects of the code.
Code with repmat replaced by bsxfun is presented next and the replaced codes
are kept as comments for comparison -
function out = lagcal(y1,y1k,source)
kn1 = y1(:);
kt1 = y1k(:);
%//kt1x = repmat(kt1,1,length(kt1));
%//eq11 = 1./(prod(kt1x-kt1x'+eye(length(kt1)))) %//'
eq11 = 1./prod(bsxfun(#minus,kt1,kt1.') + eye(numel(kt1))) %//'
eq1 = eq11'*eq11; %//'
%//dist = repmat(kn1,1,length(kt1))-repmat(kt1',length(kn1),1) %//'
dist = bsxfun(#minus,kn1,kt1.') %//'
[fixi,fixj] = find(dist==0);
dist(fixi,fixj)=eps;
mult = 1./(dist);
eq2 = prod(dist,2);
%//eq22 = repmat(eq2,1,length(kt1));
%//eq222 = eq22 .* mult
eq222 = bsxfun(#times,eq2,mult)
out = eq1 .* (eq222'*source*eq222); %//'
return; %// Better this way to end a function
One more modification could be added here. In the last line, we could do
something like as shown below, but the timing results don't show a huge benefit
with it -
out = bsxfun(#times,eq11.',bsxfun(#times,eq11,eq222'*source*eq222))
This would avoid the calculation of eq1 done earlier in the original code, so you would save little more time that way.
Benchmarking
Benchmarking on the bsxfun modified portions of the code versus the original
repmat based codes is discussed next.
Benchmarking Code
N_arr = [50 100 200 500 1000 2000 3000]; %// array elements for N (datasize)
blocks = 3;
timeall = zeros(2,numel(N_arr),blocks);
for k1 = 1:numel(N_arr)
N = N_arr(k1);
y1 = rand(N,1);
y1k = rand(N,1);
source = rand(N);
kn1 = y1(:);
kt1 = y1k(:);
%% Block 1 ----------------
block = 1;
f = #() block1_org(kt1);
timeall(1,k1,block) = timeit(f);
clear f
f = #() block1_mod(kt1);
timeall(2,k1,block) = timeit(f);
eq11 = feval(f);
clear f
%% Block 1 ----------------
eq1 = eq11'*eq11; %//'
%% Block 2 ----------------
block = 2;
f = #() block2_org(kn1,kt1);
timeall(1,k1,block) = timeit(f);
clear f
f = #() block2_mod(kn1,kt1);
timeall(2,k1,block) = timeit(f);
dist = feval(f);
clear f
%% Block 2 ----------------
[fixi,fixj] = find(dist==0);
dist(fixi,fixj)=eps;
mult = 1./(dist);
eq2 = prod(dist,2);
%% Block 3 ----------------
block = 3;
f = #() block3_org(eq2,mult,length(kt1));
timeall(1,k1,block) = timeit(f);
clear f
f = #() block3_mod(eq2,mult);
timeall(2,k1,block) = timeit(f);
clear f
%% Block 3 ----------------
end
%// Display benchmark results
figure,
for k2 = 1:blocks
subplot(blocks,1,k2),
title(strcat('Block',num2str(k2),' results :'),'fontweight','bold'),hold on
plot(N_arr,timeall(1,:,k2),'-ro')
plot(N_arr,timeall(2,:,k2),'-kx')
legend('REPMAT Method','BSXFUN Method')
xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')
end
Associated functions
function out = block1_org(kt1)
kt1x = repmat(kt1,1,length(kt1));
out = 1./(prod(kt1x-kt1x'+eye(length(kt1))));
return;
function out = block1_mod(kt1)
out = 1./prod(bsxfun(#minus,kt1,kt1.') + eye(numel(kt1)));
return;
function out = block2_org(kn1,kt1)
out = repmat(kn1,1,length(kt1))-repmat(kt1',length(kn1),1);
return;
function out = block2_mod(kn1,kt1)
out = bsxfun(#minus,kn1,kt1.');
return;
function out = block3_org(eq2,mult,length_kt1)
eq22 = repmat(eq2,1,length_kt1);
out = eq22 .* mult;
return;
function out = block3_mod(eq2,mult)
out = bsxfun(#times,eq2,mult);
return;
Results
Conclusions
bsxfun based codes show around 2x speedups over repmat based ones which is encouraging. But a profiling of the original code across a varying datasize show the multiple matrix multiplications in the final line seem to be occupying most of the runtime for the function code, which are supposedly very efficient within MATLAB. Unless you have some way to avoid those multiplications by using some other mathematical technique, they look like the bottleneck.

Resources