Optimizing a program by vectorized notation - performance

Hi all I am working on Image processing and have written a short piece of code in MATLAB. The code is quite slow.
I am giving my code snippet here
for i=1:10
//find c1,c2,c3
//c1 c2 and c3 change at each iteration
u = (1./((abs(P-c1))^m) + 1./((abs(P-c2))^m) + 1./((abs(P-c3))^m));
u1 = 1./((abs(P-c1))^m)./u;
u2 = 1./((abs(P-c2))^m)./u;
u3 = 1./((abs(P-c3))^m)./u;
end
Let me explain the variables here:
P,u,u1,u2 and u3 are all matrices of size 512x512
c1,c2 and c3 are constants of dimension 1x1
m is a constant with value = 2
I want to repeat this operations in a loop (say 10 times). However my code is quite slow.
The results of the profiler are given below :
The total running time of the program was 4.6 secs. However the four steps listed above itself takes abour 80% of the time.
So I wanted to make my code run faster.
MY FIRST EDIT
My changed code snippet
for i=1:10
//find c1 and c2
//c1 and c2 changes at each iteration
a=((abs(P-c1))^m);
b=((abs(P-c2))^m);
c=((abs(P-c3))^m);
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
end
Now the program computes in 2.47 seconds computation time for the above steps are given below:
So this is way much more faster than my first method.
2nd edit
for i=1:10
//find c1,c2,c3
//c1 c2 and c3 change at each iteration
a=(P-c1).*(P-c1);
b=(P-c2).*(P-c2);
c=(P-c3).*(P-c3);
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
end
Now the program computes in 0.808 seconds.
The four steps described above computes above very quickly.
I am sure it can be made even faster. Can you guys please help me to further optimize my code.
It would be extremely helpful for matrices larger size than 512 such as 1024 , 2048 or likewise.
Thanks in advance.

Your current code is:
a=((abs(P-c1))^m);
b=((abs(P-c2))^m);
c=((abs(P-c3))^m);
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
Firstly, realize that the absolute value function is multiplicative. So |AB| = |A|x|B|. Now, abs(P-C1)^m is equivalent to abs( (P-C1)^m ).
Just a preliminary glance at it suggests that some of the computation in the bottleneck can be reused. Specifically, since c1,c2 and c3 are constants, the computation can be sped up a little bit if you try to reuse them (at the expense of additional memory).
temp_P2 = P*P;
temp_PCA = P*ones(size(P));
temp_PCB = ones(size(P))*P;
a = abs(temp_P2 - c1*temp_PCA - c1*temp_PCB + c1^2 * length(P))
The computation of temp_PCA and temp_PCB can also be avoided since multiplication by a constant matrix always amounts to the construction of a rank 1 matrix with either constant rows or columns.
I don't claim that any of these modifications will speed up your code but they are definitely worth trying.

The first suggestion is:
if m = 2 and it is not changing, why you don't try these alternatives:
A*A
and if m = 2 then do you really need abs ?
this part that you are doing
1./a
is faster than
a.^(-1)
so I don't see any better option in this part.
Another thing you can try is this. instead of:
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
You can have this:
u = (x + y + z);
u1 = 1./(a.*u);
u2 = 1./(b.*u);
u3 = 1./(c.*u);
this way I guess it is a little bit faster by removing 3 variables. but the code becomes less readable.

Related

Writing a vector sum in MATLAB

Suppose I have a function phi(x1,x2)=k1*x1+k2*x2 which I have evaluated over a grid where the grid is a square having boundaries at -100 and 100 in both x1 and x2 axis with some step size say h=0.1. Now I want to calculate this sum over the grid with which I'm struggling:
What I was trying :
clear all
close all
clc
D=1; h=0.1;
D1 = -100;
D2 = 100;
X = D1 : h : D2;
Y = D1 : h : D2;
[x1, x2] = meshgrid(X, Y);
k1=2;k2=2;
phi = k1.*x1 + k2.*x2;
figure(1)
surf(X,Y,phi)
m1=-500:500;
m2=-500:500;
[M1,M2,X1,X2]=ndgrid(m1,m2,X,Y)
sys=#(m1,m2,X,Y) (k1*h*m1+k2*h*m2).*exp((-([X Y]-h*[m1 m2]).^2)./(h^2*D))
sum1=sum(sys(M1,M2,X1,X2))
Matlab says error in ndgrid, any idea how I should code this?
MATLAB shows:
Error using repmat
Requested 10001x1001x2001x2001 (298649.5GB) array exceeds maximum array size preference. Creation of arrays greater
than this limit may take a long time and cause MATLAB to become unresponsive. See array size limit or preference
panel for more information.
Error in ndgrid (line 72)
varargout{i} = repmat(x,s);
Error in new_try1 (line 16)
[M1,M2,X1,X2]=ndgrid(m1,m2,X,Y)
Judging by your comments and your code, it appears as though you don't fully understand what the equation is asking you to compute.
To obtain the value M(x1,x2) at some given (x1,x2), you have to compute that sum over Z2. Of course, using a numerical toolbox such as MATLAB, you could only ever hope to compute over some finite range of Z2. In this case, since (x1,x2) covers the range [-100,100] x [-100,100], and h=0.1, it follows that mh covers the range [-1000, 1000] x [-1000, 1000]. Example: m = (-1000, -1000) gives you mh = (-100, -100), which is the bottom-left corner of your domain. So really, phi(mh) is just phi(x1,x2) evaluated on all of your discretised points.
As an aside, since you need to compute |x-hm|^2, you can treat x = x1 + i x2 as a complex number to make use of MATLAB's abs function. If you were strictly working with vectors, you would have to use norm, which is OK too, but a bit more verbose. Thus, for some given x=(x10, x20), you would compute x-hm over the entire discretised plane as (x10 - x1) + i (x20 - x2).
Finally, you can compute 1 term of M at a time:
D=1; h=0.1;
D1 = -100;
D2 = 100;
X = (D1 : h : D2); % X is in rows (dim 2)
Y = (D1 : h : D2)'; % Y is in columns (dim 1)
k1=2;k2=2;
phi = k1*X + k2*Y;
M = zeros(length(Y), length(X));
for j = 1:length(X)
for i = 1:length(Y)
% treat (x - hm) as a complex number
x_hm = (X(j)-X) + 1i*(Y(i)-Y); % this computes x-hm for all m
M(i,j) = 1/(pi*D) * sum(sum(phi .* exp(-abs(x_hm).^2/(h^2*D)), 1), 2);
end
end
By the way, this computation takes quite a long time. You can consider either increasing h, reducing D1 and D2, or changing all three of them.

Efficiently Calculate Frequency Averaged Periodogram Using GPU

In Matlab I am looking for a way to most efficiently calculate a frequency averaged periodogram on a GPU.
I understand that the most important thing is to minimise for loops and use the already built in GPU functions. However my code still feels relatively unoptimised and I was wondering what changes I can make to it to gain a better speed up.
r = 5; % Dimension
n = 100; % Time points
m = 20; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
Pdgm = zeros(r, r, n/2 + 1);
for j = 1:n/2 + 1
Pdgm(:,:,j) = FT(:,j)*FT(:,j)';
end
% Finally smooth with our weights
SmPdgm = zeros(r, r, n/2 + 1);
% Take advantage of the GPU filter function
% Create new Periodogram WrapPdgm with m/2 values wrapped around in front and
% behind it (it seems like there is redundancy here)
WrapPdgm = zeros(r,r,n/2 + 1 + m);
WrapPdgm(:,:,m/2+1:n/2+m/2+1) = Pdgm;
WrapPdgm(:,:,1:m/2) = flip(Pdgm(:,:,2:m/2+1),3);
WrapPdgm(:,:,n/2+m/2+2:end) = flip(Pdgm(:,:,n/2-m/2+1:end-1),3);
% Perform filtering
for i = 1:r
for j = 1:r
temp = filter(w, [1], WrapPdgm(i,j,:));
SmPdgm(i,j,:) = temp(:,:,m+1:end);
end
end
In particular, I couldn't see a way to optimise out the for loop when calculating the initial Pdgm from the Fourier transformed data and I feel the trick I play with the WrapPdgm in order to take advantage of filter() on the GPU feels unnecessary if there were a smooth function instead.
Solution Code
This seems to be pretty efficient as benchmark runtimes in the next section might convince us -
%// Select the portion of FT to be processed and
%// send copy to GPU for calculating everything
gFT = gpuArray(FT(:,1:n/2 + 1));
%// Perform non-smoothed Periodogram, thus removing the first loop
Pdgm1 = bsxfun(#times,permute(gFT,[1 3 2]),permute(conj(gFT),[3 1 2]));
%// Generate WrapPdgm right on GPU
WrapPdgm1 = zeros(r,r,n/2 + 1 + m,'gpuArray');
WrapPdgm1(:,:,m/2+1:n/2+m/2+1) = Pdgm1;
WrapPdgm1(:,:,1:m/2) = Pdgm1(:,:,m/2+1:-1:2);
WrapPdgm1(:,:,n/2+m/2+2:end) = Pdgm1(:,:,end-1:-1:n/2-m/2+1);
%// Perform filtering on GPU and get the final output, SmPdgm1
filt_data = filter(w,1,reshape(WrapPdgm1,r*r,[]),[],2);
SmPdgm1 = gather(reshape(filt_data(:,m+1:end),r,r,[]));
Benchmarking
Benchmarking Code
%// Input parameters
r = 50; % Dimension
n = 1000; % Time points
m = 200; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
tic, %// ... Code from original approach, toc
tic %// ... Code from proposed approach, toc
Runtime results thus obtained on GPU, GTX 750 Ti against CPU, I-7 4790K -
------------------------------ With Original Approach on CPU
Elapsed time is 0.279816 seconds.
------------------------------ With Proposed Approach on GPU
Elapsed time is 0.169969 seconds.
To get rid of the first loop you can do the following:
Pdgm_cell = cellfun(#(x) x * x', mat2cell(FT(:, 1 : 51), [5], ones(51, 1)), 'UniformOutput', false);
Pdgm = reshape(cell2mat(Pdgm_cell),5,5,[]);
Then in your filter you can do the following:
temp = filter(w, 1, WrapPdgm, [], 3);
SmPdgm = temp(:, :, m + 1 : end);
The 3 lets the filter know to operate along the 3rd dimension of your data.
You can use pagefun on the GPU for the first loop. (Note that the implementation of cellfun is basically a hidden loop, whereas pagefun runs natively on the GPU using a batched GEMM operation). Here's how:
n = 16;
r = 8;
X = gpuArray.rand(r, n);
R = gpuArray.zeros(r, r, n/2 + 1);
for jj = 1:(n/2+1)
R(:,:,jj) = X(:,jj) * X(:,jj)';
end
X2 = X(:,1:(n/2+1));
R2 = pagefun(#mtimes, reshape(X2, r, 1, []), reshape(X2, 1, r, []));
R - R2

Matlab parfor, cannot run "due to the way P is used"

I have a quite time consuming task that I perform in a for loop. Each iteration is completely independent from the others so I figured out to use the parfor loop and benefit from the i7 core of my machine.
The serial loop is:
for i=1 : size(datacoord,1)
%P matrix: person_number x z or
P(i,1) = datacoord(i,1); %pn
P(i,4) = datacoord(i,5); %or
P(i,3) = predict(Barea2, datacoord(i,4)); %distance (z)
dist = round(P(i,3)); %round the distance to get how many cells
x = ceil(datacoord(i,2) / (im_w / ncell(1,dist)));
P(i,2) = pos(dist, x); %x
end
Reading around about the parfor, the only doubt it had is that i use dist and x as indexes which are calculated inside the loop, i heard that this could be a problem.
The error I get from matlab is about the way P matrix is used though. How is it? If i remember correcly from my parallel computing courses and I interpret correcly the parfor documentation, this should work by just switching the for with the parfor.
Any input would be greatly appreciated, thanks!
Unfortunately, in a PARFOR loop, 'sliced' variables such as you'd like P to be cannot be indexed in multiple different ways. The simplest solution is to build up a single row, and then make a single assignment into P, like this:
parfor i=1 : size(datacoord,1)
%P matrix: person_number x z or
P_tmp = NaN(1, 4);
P_tmp(1) = datacoord(i,1); %pn
P_tmp(4) = datacoord(i,5); %or
P_tmp(3) = predict(Barea2, datacoord(i,4)); %distance (z)
dist = round(P_tmp(3)); %round the distance to get how many cells
x = ceil(datacoord(i,2) / (im_w / ncell(1,dist)));
P_tmp(2) = pos(dist, x); %x
P(i, :) = P_tmp;
end

MATLAB Speed Optimisation

Can anyone help? I am a fairly experienced Matlab user but am having trouble speeding up the code below.
The fastest time I have been able to achieve for one run through all three loops, using 12 cores, is ~200s. The actual function will be called ~720 times and at this rate will take over 40hrs to execute. According to the Matlab profiler, the majority of cpu time is spent in the exponential function call. I've managed to speed this up quite substantially using a gpuArray and then running the exp call on a Quadro 4000 graphics card however this then prevents the parfor loop from being used, since the workstation has only one graphics card, which obliterates any gains. Can anyone help, or is this code close to the optimum that can be achieved using Matlab? I have written a very crude c++ implementation with openMP but achieved little gain.
Many thanks in advance
function SPEEDtest_CPU
% Variable setup:
% - For testing I'll use random variables. These will actually be fed into
% the function for the real version of this code.
sy = 320;
sx = 100;
sz = 32;
A = complex(rand(sy,sx,sz),rand(sy,sx,sz));
B = complex(rand(sy,sx,sz),rand(sy,sx,sz));
C = rand(sy,sx);
D = rand(sy*sx,1);
F = zeros(sy,sx,sz);
x = rand(sy*sx,1);
y = rand(sy*sx,1);
x_ind = (1:sx) - (sx / 2) - 1;
y_ind = (1:sy) - (sy / 2) - 1;
% MAIN LOOPS
% - In the real code this set of three loops will be called ~720 times!
% - Using 12 cores, the fastest I have managed is ~200 seconds for one
% call of this function.
tic
for z = 1 : sz
A_slice = A(:,:,z);
A_slice = A_slice(:);
parfor cx = 1 : sx
for cy = 1 : sy
E = ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
F(cy,cx,z) = (B(cy,cx,z) .* exp(-1i .* E))' * A_slice;
end
end
end
toc
end
Some things to think about:
Have you considered using singles?
Can you vectorize the cx, cy portion so that they represent array operations?
Consider changing the floating point rounding or signalling modes.
If your data are real (not complex), as in your example, you can save time replacing
(B(cy,cx,z) .* exp(-1i .* E))'
by
(B(cy,cx,z) .* (cos(E)+1i*sin(E))).'
Specifically, on my machine (cos(x)+1i*sin(x)).' takes 19% less time than exp(-1i .* x)'.
If A and B are complex: E is still real, so you can precompute Bconj = conj(B) outside the loops (this takes about 10 ms with your data size, and it's done only once) and then replace
(B(cy,cx,z) .* exp(-1i .* E))'
by
(Bconj(cy,cx,z) .* (cos(E)+1i*sin(E))).'
to obtain a similar gain.
There are two main ways of speeding up MATLAB code; preallocation and vectorisation.
You have preallocated well but there is no vectorisation. In order to best learn how to do this you need to have a good grasp of linear algebra and the use of repmat to expand vectors into multiple dimensions.
Vectorisation can result in multiple orders of magnitude speedup and will use the cores optimally (provided the flag is up).
What is the mathematical expression you are calculating and I may be able to lend a hand?
You can move x .* x_ind(cx) out of the innermost loop. I don't have a GPU handy to test the timings, but you could split the code into three sections to allow you to use the GPU and parfor
for z = 1 : sz
E = zeros(sy*sx,sx,sy);
A_slice = A(:,:,z);
A_slice = A_slice(:);
parfor cx = 1 : sx
temp = ( x .* x_ind(cx) );
for cy = 1 : sy
E(:, cx, cy) = temp + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
end
end
temp = zeros(zeros(sy*sx,sx,sy));
for cx = 1 : sx
for cy = 1 : sy
% Ideally use your GPU magic here
temp(:, cx, cy) = exp(-1i .* E(:, cx, cy)));
end
end
parfor cx = 1 : sx
for cy = 1 : sy
F(cy,cx,z) = (B(cy,cx,z) .* temp(:, cx, cy)' * A_slice;
end
end
end
To allow for proper paralellization you need to make sure the loops are completely independant, hence check whether not assigning to E in each run helps.
Furthermore try to vectorize as much as possible, one simple example could be: y.*y_ind(cy)
If you just create the proper index for all values at once, you can take this out of the lowest loop.
Not sure if it helps much with speed - but as E is basically a sum maybe you can use that exp (i cx(A+1)x) = exp(i cx(A) x) * exp(i x) and exp(i x) can be calculated beforehand.
That way you wouldn't have to evaluate exp each iteration - but just have to multiplicate, which should be faster.
In addition to the other good advise given here by others, the multiplication by A_slice is independent of the cx,cy loops and can be taken outside them, multiplying F once both loops have finished.
Similarly, the conjugation of B*exp(...) can also be done en-bulk outside the cx,cy loop, before multiplication by A_slice.
This line: ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
is some type of convolution, is it not? Circular convolution is much faster in the frequency domain, and the conversion to/from frequency domain is optimized using the FTT.

Search for Optimal Point search algorithms

I have an objective function B(s,r,l), I calculated results for s=1,...,10, r=1,...,10 and l=0.1:0.1:10. for each s, I generated 10 by sizeof(l) matrices. I want to write a search code code such that it will return me minimum B values. In a more clear form
Minimize B(s1,r1,l1)+B(s2,r2,l2)+B(s3,r3,l3)
s.t s1+s2+s3 = 10
r1 +r2 +r3 = 15
l1+l2+l3 = 30
s1,s2,s3,r1,r2,r3,l1,l2,l3 >=0
s and r are integer.
What is the best search algorithm for the above problem?
I would suggest to make l3,r3,s3 dependent on the choice of another variables. For example, if l1 = 1 and l2 = 2 it implies that l3 = 30 - 1 - 2. So you have only 6 parameters left to search for.
Then you should use some kind of non-linear optimization method, like fminsearh. Define you functional as a function of these 6 parameters.
If your function is smooth, the integer solution should be near the real solution.
In order to treat the non-zero condition, you can simply give a huge error to any input that gives negative output.
So, your functional should be something like:
function d = f(l1,l2,s1,s2,r1,r2)
l3 = 30 - l1 - l2;
r3 = 15 - r1 - r2;
s3 = 10 - s1 - s2;
z = B(s1,r1,l1)+B(s2,r2,l2)+B(s3,r3,l3);
if z<0
d = 10^20;
else
d = z;
end
end
In the end, try to check all of the integer solutions - try to round each value to floor or ceil. There will be 2^6 possibilities.

Resources