In the following function i want to make some changes to make it fast. By itself it is fast but i have to use it many times in a for loop so it takes long. I think if i replace the repmat with bsxfun will make it faster but i am not sure. How can i do these replacements
function out = lagcal(y1,y1k,source)
kn1 = y1(:);
kt1 = y1k(:);
kt1x = repmat(kt1,1,length(kt1));
eq11 = 1./(prod(kt1x-kt1x'+eye(length(kt1))));
eq1 = eq11'*eq11;
dist = repmat(kn1,1,length(kt1))-repmat(kt1',length(kn1),1);
[fixi,fixj] = find(dist==0); dist(fixi,fixj)=eps;
mult = 1./(dist);
eq2 = prod(dist,2);
eq22 = repmat(eq2,1,length(kt1));
eq222 = eq22 .* mult;
out = eq1 .* (eq222'*source*eq222);
end
Does it really speed up my function?
Introduction and code changes
All the repmat usages used in the function code are to expand inputs to sizes so that later on the mathemtical operations involving these inputs could be performed. This is tailor-made situation for bsxfun. Sadly though the real bottleneck of the function code seems to be something else. Stay on as we discuss all the performance related aspects of the code.
Code with repmat replaced by bsxfun is presented next and the replaced codes
are kept as comments for comparison -
function out = lagcal(y1,y1k,source)
kn1 = y1(:);
kt1 = y1k(:);
%//kt1x = repmat(kt1,1,length(kt1));
%//eq11 = 1./(prod(kt1x-kt1x'+eye(length(kt1)))) %//'
eq11 = 1./prod(bsxfun(#minus,kt1,kt1.') + eye(numel(kt1))) %//'
eq1 = eq11'*eq11; %//'
%//dist = repmat(kn1,1,length(kt1))-repmat(kt1',length(kn1),1) %//'
dist = bsxfun(#minus,kn1,kt1.') %//'
[fixi,fixj] = find(dist==0);
dist(fixi,fixj)=eps;
mult = 1./(dist);
eq2 = prod(dist,2);
%//eq22 = repmat(eq2,1,length(kt1));
%//eq222 = eq22 .* mult
eq222 = bsxfun(#times,eq2,mult)
out = eq1 .* (eq222'*source*eq222); %//'
return; %// Better this way to end a function
One more modification could be added here. In the last line, we could do
something like as shown below, but the timing results don't show a huge benefit
with it -
out = bsxfun(#times,eq11.',bsxfun(#times,eq11,eq222'*source*eq222))
This would avoid the calculation of eq1 done earlier in the original code, so you would save little more time that way.
Benchmarking
Benchmarking on the bsxfun modified portions of the code versus the original
repmat based codes is discussed next.
Benchmarking Code
N_arr = [50 100 200 500 1000 2000 3000]; %// array elements for N (datasize)
blocks = 3;
timeall = zeros(2,numel(N_arr),blocks);
for k1 = 1:numel(N_arr)
N = N_arr(k1);
y1 = rand(N,1);
y1k = rand(N,1);
source = rand(N);
kn1 = y1(:);
kt1 = y1k(:);
%% Block 1 ----------------
block = 1;
f = #() block1_org(kt1);
timeall(1,k1,block) = timeit(f);
clear f
f = #() block1_mod(kt1);
timeall(2,k1,block) = timeit(f);
eq11 = feval(f);
clear f
%% Block 1 ----------------
eq1 = eq11'*eq11; %//'
%% Block 2 ----------------
block = 2;
f = #() block2_org(kn1,kt1);
timeall(1,k1,block) = timeit(f);
clear f
f = #() block2_mod(kn1,kt1);
timeall(2,k1,block) = timeit(f);
dist = feval(f);
clear f
%% Block 2 ----------------
[fixi,fixj] = find(dist==0);
dist(fixi,fixj)=eps;
mult = 1./(dist);
eq2 = prod(dist,2);
%% Block 3 ----------------
block = 3;
f = #() block3_org(eq2,mult,length(kt1));
timeall(1,k1,block) = timeit(f);
clear f
f = #() block3_mod(eq2,mult);
timeall(2,k1,block) = timeit(f);
clear f
%% Block 3 ----------------
end
%// Display benchmark results
figure,
for k2 = 1:blocks
subplot(blocks,1,k2),
title(strcat('Block',num2str(k2),' results :'),'fontweight','bold'),hold on
plot(N_arr,timeall(1,:,k2),'-ro')
plot(N_arr,timeall(2,:,k2),'-kx')
legend('REPMAT Method','BSXFUN Method')
xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')
end
Associated functions
function out = block1_org(kt1)
kt1x = repmat(kt1,1,length(kt1));
out = 1./(prod(kt1x-kt1x'+eye(length(kt1))));
return;
function out = block1_mod(kt1)
out = 1./prod(bsxfun(#minus,kt1,kt1.') + eye(numel(kt1)));
return;
function out = block2_org(kn1,kt1)
out = repmat(kn1,1,length(kt1))-repmat(kt1',length(kn1),1);
return;
function out = block2_mod(kn1,kt1)
out = bsxfun(#minus,kn1,kt1.');
return;
function out = block3_org(eq2,mult,length_kt1)
eq22 = repmat(eq2,1,length_kt1);
out = eq22 .* mult;
return;
function out = block3_mod(eq2,mult)
out = bsxfun(#times,eq2,mult);
return;
Results
Conclusions
bsxfun based codes show around 2x speedups over repmat based ones which is encouraging. But a profiling of the original code across a varying datasize show the multiple matrix multiplications in the final line seem to be occupying most of the runtime for the function code, which are supposedly very efficient within MATLAB. Unless you have some way to avoid those multiplications by using some other mathematical technique, they look like the bottleneck.
Related
I got an assignment in a video processing course - to implement the Lucas-Kanade algorithm. Since we have to do it in the pyramidal model, I first build a pyramid for each of the 2 input images, and then for each level I perform a number of LK iterations. in each step (iteration), the following code runs (note: the images are zero-padded so I can handle the image edges easily):
function [du,dv]= LucasKanadeStep(I1,I2,WindowSize)
It = I2-I1;
[Ix, Iy] = imgradientxy(I2);
Ixx = imfilter(Ix.*Ix, ones(5));
Iyy = imfilter(Iy.*Iy, ones(5));
Ixy = imfilter(Ix.*Iy, ones(5));
Ixt = imfilter(Ix.*It, ones(5));
Iyt = imfilter(Iy.*It, ones(5));
half_win = floor(WindowSize/2);
du = zeros(size(It));
dv = zeros(size(It));
A = zeros(2);
b = zeros(2,1);
%iterate only on the relevant parts of the images
for i = 1+half_win : size(It,1)-half_win
for j = 1+half_win : size(It,2)-half_win
A(1,1) = Ixx(i,j);
A(2,2) = Iyy(i,j);
A(1,2) = Ixy(i,j);
A(2,1) = Ixy(i,j);
b(1,1) = -Ixt(i,j);
b(2,1) = -Iyt(i,j);
U = pinv(A)*b;
du(i,j) = U(1);
dv(i,j) = U(2);
end
end
end
mathematically what I'm doing is calculating for every pixel (i,j) the following optical flow:
as you can see, in the code I am calculating this for each pixel, which takes quite a long time (the whole processing for 2 images - including building 3 levels pyramids and 3 LK steps like the one above on each level - takes about 25 seconds (!) on a remote connection to my university servers).
My question: Is there a way to calculate this single LK step without the nested for loops? it must be more efficient because the next step of the assignment is to stabilize a short video using this algorithm.. thanks.
I ran your code on my system and did profiling. Here is what I got.
As you can see inverting the matrix(pinv) is taking most of the time. You can try and vectorise your code I guess, but I am not sure how to do it. But I do know a trick to improve the compute time. You have to exploit the minimum variance of the matrix A. That is, compute the inverse only if the minimum variance of A is greater than some threshold. This will improve the speed as you won't be inverting the matrix for all the pixel.
You do this by modifying your code to the one shown below.
function [du,dv]= LucasKanadeStep(I1,I2,WindowSize)
It = double(I2-I1);
[Ix, Iy] = imgradientxy(I2);
Ixx = imfilter(Ix.*Ix, ones(5));
Iyy = imfilter(Iy.*Iy, ones(5));
Ixy = imfilter(Ix.*Iy, ones(5));
Ixt = imfilter(Ix.*It, ones(5));
Iyt = imfilter(Iy.*It, ones(5));
half_win = floor(WindowSize/2);
du = zeros(size(It));
dv = zeros(size(It));
A = zeros(2);
B = zeros(2,1);
%iterate only on the relevant parts of the images
for i = 1+half_win : size(It,1)-half_win
for j = 1+half_win : size(It,2)-half_win
A(1,1) = Ixx(i,j);
A(2,2) = Iyy(i,j);
A(1,2) = Ixy(i,j);
A(2,1) = Ixy(i,j);
B(1,1) = -Ixt(i,j);
B(2,1) = -Iyt(i,j);
% +++++++++++++++++++++++++++++++++++++++++++++++++++
% Code I added , threshold better be outside the loop.
lambda = eig(A);
threshold = 0.2
if (min(lambda)> threshold)
U = A\B;
du(i,j) = U(1);
dv(i,j) = U(2);
end
% end of addendum
% +++++++++++++++++++++++++++++++++++++++++++++++++++
% U = pinv(A)*B;
% du(i,j) = U(1);
% dv(i,j) = U(2);
end
end
end
I have set the threshold to 0.2. You can experiment with it. By using eigen value trick I was able to get the compute time from 37 seconds to 10 seconds(shown below). Using eigen, pinv hardly takes up the time like before.
Hope this helped. Good luck :)
Eventually I was able to find a much more efficient solution to this problem.
It is based on the formula shown in the question. The last 3 lines are what makes the difference - we get a loop-free code that works way faster. There were negligible differences from the looped version (~10^-18 or less in terms of absolute difference between the result matrices, ignoring the padding zone).
Here is the code:
function [du,dv]= LucasKanadeStep(I1,I2,WindowSize)
half_win = floor(WindowSize/2);
% pad frames with mirror reflections of itself
I1 = padarray(I1, [half_win half_win], 'symmetric');
I2 = padarray(I2, [half_win half_win], 'symmetric');
% create derivatives (time and space)
It = I2-I1;
[Ix, Iy] = imgradientxy(I2, 'prewitt');
% calculate dP = (du, dv) according to the formula
Ixx = imfilter(Ix.*Ix, ones(WindowSize));
Iyy = imfilter(Iy.*Iy, ones(WindowSize));
Ixy = imfilter(Ix.*Iy, ones(WindowSize));
Ixt = imfilter(Ix.*It, ones(WindowSize));
Iyt = imfilter(Iy.*It, ones(WindowSize));
% calculate the whole du,dv matrices AT ONCE!
invdet = (Ixx.*Iyy - Ixy.*Ixy).^-1;
du = invdet.*(-Iyy.*Ixt + Ixy.*Iyt);
dv = invdet.*(Ixy.*Ixt - Ixx.*Iyt);
end
I have a equation that used to compute sigma, in which i is index from 1 to N,* denotes convolution operation, Omega is image domain.
I want to implement it by matlab code. Currently, I have three options to implement the above equation. Could you look at my equation and said to me which one is correct? I spend so much time to see what is differnent amongs methods but I could not find. Thanks in advance
The different between Method 1 and Method 2 that is method 1 compute the sigma after loop but Method 2 computes it in loop.
sigma(1:row,1:col,1:dim) = nu/d;
Does it give same result?
===========Matlab code==============
Method 1
nu = 0;
d = 0;
I2 = I.^2;
[row,col] = size(I);
for i = 1:N
KuI2 = conv2(u(:,:,i).*I2,k,'same');
bc = b.*(c(:,:,i));
bcKuI = -2*bc.*conv2(u(:,:,i).*I,k,'same');
bc2Ku = bc.^2.*conv2(u(:,:,i),k,'same');
nu = nu + sum(sum(KuI2+bcKuI+bc2Ku));
ku = conv2(u(:,:,i),k,'same');
d = d + sum(sum(ku));
end
d = d + (d==0)*eps;
sigma(1:row,1:col,1:dim) = nu/d;
Method 2:
I2 = I.^2;
[row,col] = size(I);
for i = 1:dim
KuI2 = conv2(u(:,:,i).*I2,k,'same');
bc = b.*(c(:,:,i));
bcKuI = -2*bc.*conv2(u(:,:,i).*I,k,'same');
bc2Ku = bc.^2.*conv2(u(:,:,i),k,'same');
nu = sum(sum(KuI2+bcKuI+bc2Ku));
ku = conv2(u(:,:,i),k,'same');
d = sum(sum(ku));
d = d + (d==0)*eps;
sigma(1:row,1:col,i) = nu/d;
end
Method 3:
I2 = I.^2;
[row,col] = size(I);
for i = 1:dim
KuI2 = conv2(u(:,:,i).*I2,k,'same');
bc = b.*(c(:,:,i));
bcKuI = -2*bc.*conv2(u(:,:,i).*I,k,'same');
bc2Ku = bc.^2.*conv2(u(:,:,i),k,'same');
ku = conv2(u(:,:,i),k,'same');
d = ku + (ku==0)*eps;
sigma(:,:,i) = (KuI2+bcKuI+bc2Ku)./d;
end
sigma = sigma + (sigma==0).*eps;
I think that Method 1 is assume that sigma1=sigma2=...sigman because you were computed out of loop function
sigma(1:row,1:col,1:dim) = nu/d;
where nu and d are cumulative sum for each iteration.
While, the Method 2 shown that sigma1 !=sigma 2 !=..sigman because each sigma is calculated in loop function
Hope it help
:) I'm trying to code a Least Squares algorithm and I've come up with this:
function [y] = ex1_Least_Squares(xValues,yValues,x) % a + b*x + c*x^2 = y
points = size(xValues,1);
A = ones(points,3);
b = zeros(points,1);
for i=1:points
A(i,1) = 1;
A(i,2) = xValues(i);
A(i,3) = xValues(i)^2;
b(i) = yValues(i);
end
constants = (A'*A)\(A'*b);
y = constants(1) + constants(2)*x + constants(3)*x^2;
When I use this matlab script for linear functions, it works fine I think. However, when I'm passing 12 points of the sin(x) function I get really bad results.
These are the points I pass to the function:
xValues = [ -180; -144; -108; -72; -36; 0; 36; 72; 108; 144; 160; 180];
yValues = [sind(-180); sind(-144); sind(-108); sind(-72); sind(-36); sind(0); sind(36); sind(72); sind(108); sind(144); sind(160); sind(180) ];
And the result is sin(165°) = 0.559935259380508, when it should be sin(165°) = 0.258819
There is no reason why fitting a parabola to a full period of a sinusoid should give good results. These two curves are unrelated.
MATLAB already contains a least square polynomial fitting function, polyfit and a complementary function, polyval. Although you are probably supposed to write your own, trying out something like the following will be educational:
xValues = [ -180; -144; -108; -72; -36; 0; 36; 72; 108; 144; 160; 180];
% you may want to experiment with different ranges of xValues
yValues = sind(xValues);
% try this with different values of n, say 2, 3, and 4
p = polyfit(xValues,yValues,n);
x = -180:36:180;
y = polyval(p,x);
plot(xValues,yValues);
hold on
plot(x,y,'r');
Also, more generically, you should avoid using loops as you have in your code. This should be equivalent:
points = size(xValues,1);
A = ones(points,3);
A(:,2) = xValues;
A(:,3) = xValues.^2; % .^ and ^ are different
The part of the loop involving b is equivalent to doing b = yValues; either name the incoming variable b or just use the variable yValues, there's no need to make a copy of it.
After a few days of optimization this is my code for an enumeration process that consist in finding the best combination for every row of W. The algorithm separates the matrix W in one where the elements of W are grather of LimiteInferiore (called W_legali) and one that have only element below the limit (called W_nlegali).
Using some parameters like Media (aka Mean), rho_b_legali The algorithm minimizes the total cost function. In the last part, I find where is the combination with the lowest value of objective function and save it in W_ottimo
As you can see the algorithm is not so "clean" and with very large matrix (142506x3000) is damn slow...So, can somebody help me to speed it up a little bit?
for i=1:3000
W = PesoIncertezza * MatriceCombinazioni';
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
W_legali = W;
W_legali(W<LimiteInferiore) = nan;
if i==1
Media = W_legali;
rho_b_legale = ones(size (W_legali,1),size(MatriceCombinazioni,1));
else
Media = (repmat(sum(W_tot_migl,2),1,size(MatriceCombinazioni,1))+W_legali)/(size(W_tot_migl,2)+1);
rho_b_legale = repmat(((n_b+1)/i),1,size(MatriceCombinazioni,1));
end
[W_legali_migl,comb] = min(C_u .* Media .* (1./rho_b_legale) + (1./rho_b_legale) .* c_0 + (c_1./(i * rho_b_legale)),[],2);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
MatriceCombinazioni_2 = MatriceCombinazioni;
MatriceCombinazioni_2(sum(MatriceCombinazioni_2,2)<2,:)=[];
W_nlegali = PesoIncertezza * MatriceCombinazioni_2';
W_nlegali(W_nlegali>=LimiteInferiore) = nan;
if i==1
Media = W_nlegali;
rho_b_nlegale = zeros(size (W_nlegali,1),size(MatriceCombinazioni_2,1));
else
Media = (repmat(sum(W_tot_migl,2),1,size(MatriceCombinazioni_2,1))+W_nlegali)/(size(W_tot_migl,2)+1);
rho_b_nlegale = repmat(((n_b)/i),1,size(MatriceCombinazioni_2,1));
end
[W_nlegali_migliori,comb2] = min(C_u .* Media .* (1./rho_b_nlegale) + (1./rho_b_nlegale) .* c_0 + (c_1./(i * rho_b_nlegale)),[],2);
z = [W_legali_migl, W_nlegali_migliori];
[z_ott,comb3] = min(z,[],2);
%Increasing n_b
if i==1
n_b = zeros(size(W,1),1);
end
index = find(comb3==1);
increment = ones(size(index,1),1);
B = accumarray(index,increment);
nzIndex = (B ~= 0);
n_b(nzIndex) = n_b(nzIndex) + B(nzIndex);
%Using comb3 to find where is the best configuration, is in
%W_legali or in W_nLegali?
combinazione = comb.*logical(comb3==1) + comb2.*logical(comb3==2);
W_ottimo = W(sub2ind(size(W),[1:size(W,1)],combinazione'))';
W_tot_migl(:,i) = W_ottimo;
FunzObb(:,i) = z_ott;
[PesoCestelli] = Simulazione_GenerazioneNumeriCasuali (PianoSperimentale,NumeroCestelli,NumeroEsperimenti,Alfa);
[PesoIncertezza_2] = Simulazione_GenerazioneIncertezza (NumeroCestelli,NumeroEsperimenti,IncertezzaCella,PesoCestelli);
PesoIncertezza(MatriceCombinazioni(combinazione,:)~=0) = PesoIncertezza_2(MatriceCombinazioni(combinazione,:)~=0); %updating just the hoppers that has been discharged
end
When you see repmat you should think bsxfun. For example, replace:
Media = (repmat(sum(W_tot_migl,2),1,size(MatriceCombinazioni,1))+W_legali) / ...
(size(W_tot_migl,2)+1);
with
Media = bsxfun(#plus,sum(W_tot_migl,2),W_legali) / ...
(size(W_tot_migl,2)+1);
The purpose of bsxfun is to do a virtual "singleton expansion" like repmat, without actually replicating the array into a matrix of the same size as W_legali.
Also note that in the above code, sum(W_tot_migl,2) is computed twice. There are other small optimizations, but changing to bsxfun should give you a good improvement.
The values of 1./rho_b_legale are effectively computed three times. Store this quotient matrix.
I'm working on a function with three nested for loops that is way too slow for its intended use. The bottleneck is clearly the looping part - almost 100 % of the execution time is spent in the innermost loop.
The function takes a 2d matrix called rM as input and returns a 3d matrix called ec:
rows = size(rM, 1);
cols = size(rM, 2);
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols,
for r = 2:rows+1,
ec(r, c, risk) = ec(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
Any help on speeding up the for loops would be appreciated...
The problem is, that the inner loop is slowest, while it is also near-impossible to vectorize. As every iteration directly depends on the previous one.
The outer two are possible:
clc;
rM = rand(50);
rows = size(rM, 1);
cols = size(rM, 2);
minRisk = 1;
stepRisk = 1;
maxRisk = 100;
numRiskLevels = maxRisk/stepRisk;
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
riskArray = (minRisk:stepRisk:maxRisk)';
tic
for r = 2:rows+1
tmp = riskArray * rM(r-1, :);
tmp = permute(tmp, [3 2 1]);
ec(r, :, :) = ec(r-1, :, :) .* (1 + tmp);
end
toc
%preallocate.
ec2 = zeros(rows+1, cols, numRiskLevels);
ec2(1, :, :) = 100;
tic
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols
for r = 2:rows+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
toc
all(all(all(ec == ec2)))
But to my surprise, the vectorized code is indeed slower. (But maybe someone can improve the code, so I figured I leave it her for you.)
I have just tried to vectorize the outer loop, and actually noticed a significant speed increase. Of course it is hard to judge the speed of a script without knowing (the size of) the inputs but I would say this is a good starting point:
% Here you can change the input parameters
riskVec = 1:3:120;
rM = rand(50);
%preallocate and calculate non vectorized solution
ec2 = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec2(1, :, :) = 100;
tic
for risk = riskVec
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
t1=toc;
%preallocate and calculate vectorized solution
ec = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec(1, :, :) = 100;
tic
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec(r, c, riskVec) = ec(r-1, c, riskVec) .* reshape(1 + riskVec * rM(r-1, c),[1 1 length(riskVec)]);
end
end
t2=toc;
% Check whether the vectorization is done correctly and show the timing results
if ec(:) == ec2(:)
t1
t2
end
The given output is:
t1 =
0.1288
t2 =
0.0408
So for this riskVec and rM it is about 3 times as fast as the non-vectorized solution.