MATLAB Speed Optimisation - performance

Can anyone help? I am a fairly experienced Matlab user but am having trouble speeding up the code below.
The fastest time I have been able to achieve for one run through all three loops, using 12 cores, is ~200s. The actual function will be called ~720 times and at this rate will take over 40hrs to execute. According to the Matlab profiler, the majority of cpu time is spent in the exponential function call. I've managed to speed this up quite substantially using a gpuArray and then running the exp call on a Quadro 4000 graphics card however this then prevents the parfor loop from being used, since the workstation has only one graphics card, which obliterates any gains. Can anyone help, or is this code close to the optimum that can be achieved using Matlab? I have written a very crude c++ implementation with openMP but achieved little gain.
Many thanks in advance
function SPEEDtest_CPU
% Variable setup:
% - For testing I'll use random variables. These will actually be fed into
% the function for the real version of this code.
sy = 320;
sx = 100;
sz = 32;
A = complex(rand(sy,sx,sz),rand(sy,sx,sz));
B = complex(rand(sy,sx,sz),rand(sy,sx,sz));
C = rand(sy,sx);
D = rand(sy*sx,1);
F = zeros(sy,sx,sz);
x = rand(sy*sx,1);
y = rand(sy*sx,1);
x_ind = (1:sx) - (sx / 2) - 1;
y_ind = (1:sy) - (sy / 2) - 1;
% MAIN LOOPS
% - In the real code this set of three loops will be called ~720 times!
% - Using 12 cores, the fastest I have managed is ~200 seconds for one
% call of this function.
tic
for z = 1 : sz
A_slice = A(:,:,z);
A_slice = A_slice(:);
parfor cx = 1 : sx
for cy = 1 : sy
E = ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
F(cy,cx,z) = (B(cy,cx,z) .* exp(-1i .* E))' * A_slice;
end
end
end
toc
end

Some things to think about:
Have you considered using singles?
Can you vectorize the cx, cy portion so that they represent array operations?
Consider changing the floating point rounding or signalling modes.

If your data are real (not complex), as in your example, you can save time replacing
(B(cy,cx,z) .* exp(-1i .* E))'
by
(B(cy,cx,z) .* (cos(E)+1i*sin(E))).'
Specifically, on my machine (cos(x)+1i*sin(x)).' takes 19% less time than exp(-1i .* x)'.
If A and B are complex: E is still real, so you can precompute Bconj = conj(B) outside the loops (this takes about 10 ms with your data size, and it's done only once) and then replace
(B(cy,cx,z) .* exp(-1i .* E))'
by
(Bconj(cy,cx,z) .* (cos(E)+1i*sin(E))).'
to obtain a similar gain.

There are two main ways of speeding up MATLAB code; preallocation and vectorisation.
You have preallocated well but there is no vectorisation. In order to best learn how to do this you need to have a good grasp of linear algebra and the use of repmat to expand vectors into multiple dimensions.
Vectorisation can result in multiple orders of magnitude speedup and will use the cores optimally (provided the flag is up).
What is the mathematical expression you are calculating and I may be able to lend a hand?

You can move x .* x_ind(cx) out of the innermost loop. I don't have a GPU handy to test the timings, but you could split the code into three sections to allow you to use the GPU and parfor
for z = 1 : sz
E = zeros(sy*sx,sx,sy);
A_slice = A(:,:,z);
A_slice = A_slice(:);
parfor cx = 1 : sx
temp = ( x .* x_ind(cx) );
for cy = 1 : sy
E(:, cx, cy) = temp + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
end
end
temp = zeros(zeros(sy*sx,sx,sy));
for cx = 1 : sx
for cy = 1 : sy
% Ideally use your GPU magic here
temp(:, cx, cy) = exp(-1i .* E(:, cx, cy)));
end
end
parfor cx = 1 : sx
for cy = 1 : sy
F(cy,cx,z) = (B(cy,cx,z) .* temp(:, cx, cy)' * A_slice;
end
end
end

To allow for proper paralellization you need to make sure the loops are completely independant, hence check whether not assigning to E in each run helps.
Furthermore try to vectorize as much as possible, one simple example could be: y.*y_ind(cy)
If you just create the proper index for all values at once, you can take this out of the lowest loop.

Not sure if it helps much with speed - but as E is basically a sum maybe you can use that exp (i cx(A+1)x) = exp(i cx(A) x) * exp(i x) and exp(i x) can be calculated beforehand.
That way you wouldn't have to evaluate exp each iteration - but just have to multiplicate, which should be faster.

In addition to the other good advise given here by others, the multiplication by A_slice is independent of the cx,cy loops and can be taken outside them, multiplying F once both loops have finished.
Similarly, the conjugation of B*exp(...) can also be done en-bulk outside the cx,cy loop, before multiplication by A_slice.

This line: ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
is some type of convolution, is it not? Circular convolution is much faster in the frequency domain, and the conversion to/from frequency domain is optimized using the FTT.

Related

Efficiency of diag() - MATLAB

Motivation:
In writing out a matrix operation that was to be performed over tens of thousands of vectors I kept coming across the warning:
Requested 200000x200000 (298.0GB) array exceeds maximum array size
preference. Creation of arrays greater than this limit may take a long
time and cause MATLAB to become unresponsive. See array size limit or
preference panel for more information.
The reason for this was my use of diag() to get the values down the diagonal of an matrix inner product. Because MATLAB is generally optimized for vector/matrix operations, when I first write code, I usually go for the vectorized form. In this case, however, MATLAB has to build the entire matrix in order to get the diagonal which causes the memory and speed issues.
Experiment:
I decided to test the use of diag() vs a for loop to see if at any point it was more efficient to use diag():
num = 200000; % Matrix dimension
x = ones(num, 1);
y = 2 * ones(num, 1);
% z = diag(x*y'); % Expression to solve
% Loop approach
tic
z = zeros(num,1);
for i = 1 : num
z(i) = x(i)*y(i);
end
toc
% Dividing the too-large matrix into process-able chunks
fraction = [10, 20, 50, 100, 500, 1000, 5000, 10000, 20000];
time = zeros(size(fraction));
for k = 1 : length(fraction)
f = fraction(k);
% Operation to time
tic
z = zeros(num,1);
for i = 1 : k
first = (i-1) * (num / f);
last = first + (num / f);
z(first + 1 : last) = diag(x(first + 1: last) * y(first + 1 : last)');
end
time(k) = toc;
end
% Plot results
figure;
hold on
plot(log10(fraction), log10(chunkTime));
plot(log10(fraction), repmat(log10(loopTime), 1, length(fraction)));
plot(log10(fraction), log10(chunkTime), 'g*'); % Plot points along time
legend('Partioned Running Time', 'Loop Running Time');
xlabel('Log_{10}(Fractional Size)'), ylabel('Log_{10}(Running Time)'), title('Running Time Comparison');
This is the result of the test:
(NOTE: The red line represents the loop time as a threshold--it's not to say that the total loop time is constant regardless of the number of loops)
From the graph it is clear that it takes breaking the operations down into roughly 200x200 square matrices to be faster to use diag than to perform the same operation using loops.
Question:
Can someone explain why I'm seeing these results? Also, I would think that with MATLAB's ever-more optimized design, there would be built-in handling of these massive matrices within a diag() function call. For example, it could just perform the i = j indexed operations. Is there a particular reason why this might be prohibitive?
I also haven't really thought of memory implications for diag using the partition method, although it's clear that as the partition size decreases, memory requirements drop.
Test of speed of diag vs. a loop.
Initialization:
n = 10000;
M = randn(n, n); %create a random matrix.
Test speed of diag:
tic;
d = diag(M);
toc;
Test speed of loop:
tic;
d = zeros(n, 1);
for i=1:n
d(i) = M(i,i);
end;
toc;
This would test diag. Your code is not a clean test of diag...
Comment on where there might be confusion
Diag only extracts the diagonal of a matrix. If x and y are vectors, and you do d = diag(x * y'), MATLAB first constructs the n by n matrix x*y' and calls diag on that. This is why, you get the error, "cannot construct 290GB matrix..." Matlab interpreter does not optimize in a crazy way, realize you only want the diagonal and construct just a vector (rather than full matrix with x*y', that does not happen.
Not sure if you're asking this, but the fastest way to calculate d = diag(x*y') where x and y are n by 1 vectors would simply be: d = x.*y

MATLAB: readable code vs optimized code

So, I want to know if making the code more easy to read slows performance in Matlab.
function V = example(t, I)
a = 10;
b = 20;
c = 0.5;
V = zeros(1, length(t));
V(1) = 0;
delta_t = t(2) - t(1);
for i=1:length(t)-1
V(i+1) = V(i) + delta_t*feval(#V_prime,a,b,c,t(i));
end;
So, this function is just an example of a Euler method. The idea is that I name constant variables, a, b, c and define a function of the derivative. This basically makes the code easier to read. What I want to know is if declaring a,b,c slows down my code. Also, for performance improvement, would be better to put the equation of the derivative (V_prime) directly on the equation instead of calling it?
Following this mindset the code would look something like this.
function V = example(t, I)
V = zeros(1, length(t));
V(1) = 0;
delta_t = t(2) - t(1);
for i=1:length(t)-1
V(i+1) = V(i) + delta_t*(((10 + t(i)*3)/20)+0.5);
Also from what I've read, Matlab performs better when the code is vectorized, would that be the case in my code?
EDIT:
So, here is my actual code that I am working on:
function [V, u] = Izhikevich_CA1_Imp(t, I_amp, t_inj)
vr = -61.8; % resting potential (mV)
vt = -57.0; % threshold potential (mV)
c = -65.8; % reset membrane potential (mV)
vpeak = 22.6; % membrane voltage cutoff
khigh = 3.3; % nS/mV
klow = 0.1; % nS/mV
C = 115; % Membrane capacitance (pA)
a = 0.0012; % 1/ms
b = 3; % nS
d = 10; % pA
V = zeros(1, length(t));
V(1) = vr; u = 0; % initial values
span = length(t)-1;
delta_t = t(2) - t(1);
for i=1:span
if (V(i) <= vt)
k = klow;
else
k = khigh;
end;
if ((t(i) >= t_inj(1)) && (t(i) <= t_inj(2)))
I_inj = I_amp;
else I_inj = 0;
end;
V(i+1) = V(i) + delta_t*((k*(V(i)-vr)*(V(i)-vt)-u(i)+I_inj)/C);
u(i+1) = u(i) + delta_t*(a*(b*(V(i)-vr)-u(i)));
if (V(i+1) >= vpeak)
V(i+1) = c;
V(i) = vpeak;
u(i+1) = u(i+1) + d;
end;
end;
plot(t,V);
Since I didn't have any training in Matlab (learned by trying and failing), I have my C mindset of programming, and for what I understand, Matlab code should be vectorized.
Eventually I will start working with bigger functions, so performance will be a concern. Now my goal is to vectorize this code.
Usually it is faster.
Especially if you replace looped function calls (like plot()), you will see a significant increase in performance.
In one of my past projects, I had to optimize a program. This one was made using regular program rules (for, while, etc.). Using vectorization, I reached a 10 times increase in performance, which is quite notable..
I would suggest using vectorisation instead of loops most of the time.
On matlab you should basically forget the mindset coming from low-level C programming.
In my experience the first rule for achieving performance in matlab is to avoid loops and use built-in vectorized functions as much as possible. In general, you should try to avoid direct access to array elements like array(i).
Implementing your own ODE solver inevitably leads to very slow execution because in this case there is really no way to avoid the aforementioned things, even if your implementation is per se fine (like in your case). I strongly advise to rely on matlab's ode solvers which are highly optimized blocks of compiled code and much faster than any interpreted matlab code you can write.
In my opinion this goes along with readability of the code as well, at least for the trivial reason that you get a shorter code... but I guess it is also a matter of personal taste.

Matlab parfor, cannot run "due to the way P is used"

I have a quite time consuming task that I perform in a for loop. Each iteration is completely independent from the others so I figured out to use the parfor loop and benefit from the i7 core of my machine.
The serial loop is:
for i=1 : size(datacoord,1)
%P matrix: person_number x z or
P(i,1) = datacoord(i,1); %pn
P(i,4) = datacoord(i,5); %or
P(i,3) = predict(Barea2, datacoord(i,4)); %distance (z)
dist = round(P(i,3)); %round the distance to get how many cells
x = ceil(datacoord(i,2) / (im_w / ncell(1,dist)));
P(i,2) = pos(dist, x); %x
end
Reading around about the parfor, the only doubt it had is that i use dist and x as indexes which are calculated inside the loop, i heard that this could be a problem.
The error I get from matlab is about the way P matrix is used though. How is it? If i remember correcly from my parallel computing courses and I interpret correcly the parfor documentation, this should work by just switching the for with the parfor.
Any input would be greatly appreciated, thanks!
Unfortunately, in a PARFOR loop, 'sliced' variables such as you'd like P to be cannot be indexed in multiple different ways. The simplest solution is to build up a single row, and then make a single assignment into P, like this:
parfor i=1 : size(datacoord,1)
%P matrix: person_number x z or
P_tmp = NaN(1, 4);
P_tmp(1) = datacoord(i,1); %pn
P_tmp(4) = datacoord(i,5); %or
P_tmp(3) = predict(Barea2, datacoord(i,4)); %distance (z)
dist = round(P_tmp(3)); %round the distance to get how many cells
x = ceil(datacoord(i,2) / (im_w / ncell(1,dist)));
P_tmp(2) = pos(dist, x); %x
P(i, :) = P_tmp;
end

Optimizing a program by vectorized notation

Hi all I am working on Image processing and have written a short piece of code in MATLAB. The code is quite slow.
I am giving my code snippet here
for i=1:10
//find c1,c2,c3
//c1 c2 and c3 change at each iteration
u = (1./((abs(P-c1))^m) + 1./((abs(P-c2))^m) + 1./((abs(P-c3))^m));
u1 = 1./((abs(P-c1))^m)./u;
u2 = 1./((abs(P-c2))^m)./u;
u3 = 1./((abs(P-c3))^m)./u;
end
Let me explain the variables here:
P,u,u1,u2 and u3 are all matrices of size 512x512
c1,c2 and c3 are constants of dimension 1x1
m is a constant with value = 2
I want to repeat this operations in a loop (say 10 times). However my code is quite slow.
The results of the profiler are given below :
The total running time of the program was 4.6 secs. However the four steps listed above itself takes abour 80% of the time.
So I wanted to make my code run faster.
MY FIRST EDIT
My changed code snippet
for i=1:10
//find c1 and c2
//c1 and c2 changes at each iteration
a=((abs(P-c1))^m);
b=((abs(P-c2))^m);
c=((abs(P-c3))^m);
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
end
Now the program computes in 2.47 seconds computation time for the above steps are given below:
So this is way much more faster than my first method.
2nd edit
for i=1:10
//find c1,c2,c3
//c1 c2 and c3 change at each iteration
a=(P-c1).*(P-c1);
b=(P-c2).*(P-c2);
c=(P-c3).*(P-c3);
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
end
Now the program computes in 0.808 seconds.
The four steps described above computes above very quickly.
I am sure it can be made even faster. Can you guys please help me to further optimize my code.
It would be extremely helpful for matrices larger size than 512 such as 1024 , 2048 or likewise.
Thanks in advance.
Your current code is:
a=((abs(P-c1))^m);
b=((abs(P-c2))^m);
c=((abs(P-c3))^m);
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
Firstly, realize that the absolute value function is multiplicative. So |AB| = |A|x|B|. Now, abs(P-C1)^m is equivalent to abs( (P-C1)^m ).
Just a preliminary glance at it suggests that some of the computation in the bottleneck can be reused. Specifically, since c1,c2 and c3 are constants, the computation can be sped up a little bit if you try to reuse them (at the expense of additional memory).
temp_P2 = P*P;
temp_PCA = P*ones(size(P));
temp_PCB = ones(size(P))*P;
a = abs(temp_P2 - c1*temp_PCA - c1*temp_PCB + c1^2 * length(P))
The computation of temp_PCA and temp_PCB can also be avoided since multiplication by a constant matrix always amounts to the construction of a rank 1 matrix with either constant rows or columns.
I don't claim that any of these modifications will speed up your code but they are definitely worth trying.
The first suggestion is:
if m = 2 and it is not changing, why you don't try these alternatives:
A*A
and if m = 2 then do you really need abs ?
this part that you are doing
1./a
is faster than
a.^(-1)
so I don't see any better option in this part.
Another thing you can try is this. instead of:
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
You can have this:
u = (x + y + z);
u1 = 1./(a.*u);
u2 = 1./(b.*u);
u3 = 1./(c.*u);
this way I guess it is a little bit faster by removing 3 variables. but the code becomes less readable.

Vectorizing three for loops

I'm quite new to Matlab and I need help in speeding up some part of my code. I am writing a Matlab application that performs 3D matrix convolution but unlike in standard convolution, the kernel is not constant, it needs to be calculated for each pixel of an image.
So far, I have ended up with a working code, but incredibly slow:
function result = calculateFilteredImages(images, T)
% images - matrix [480,360,10] of 10 grayscale images of height=480 and width=360
% reprezented as a value in a range [0..1]
% i.e. images(10,20,5) = 0.1231;
% T - some matrix [480,360,10, 3,3] of double values, calculated earlier
kerN = 5; %kernel size
mid=floor(kerN/2); %half the kernel size
offset=mid+1; %kernel offset
[h,w,n] = size(images);
%add padding so as not to get IndexOutOfBoundsEx during summation:
%[i.e. changes [1 2 3...10] to [0 0 1 2 ... 10 0 0]]
images = padarray(images,[mid, mid, mid]);
result(h,w,n)=0; %preallocate, faster than zeros(h,w,n)
kernel(kerN,kerN,kerN)=0; %preallocate
% the three parameters below are not important in this problem
% (are used to calculate sigma in x,y,z direction inside the loop)
sigMin=0.5;
sigMax=3;
d = 3;
for a=1:n;
tic;
for b=1:w;
for c=1:h;
M(:,:)=T(c,b,a,:,:); % M is now a 3x3 matrix
[R D] = eig(M); %get eigenvectors and eigenvalues - R and D are now 3x3 matrices
% eigenvalues
l1 = D(1,1);
l2 = D(2,2);
l3 = D(3,3);
sig1=sig( l1 , sigMin, sigMax, d);
sig2=sig( l2 , sigMin, sigMax, d);
sig3=sig( l3 , sigMin, sigMax, d);
% calculate kernel
for i=-mid:mid
for j=-mid:mid
for k=-mid:mid
x_new = [i,j,k] * R; %calculate new [i,j,k]
kernel(offset+i, offset+j, offset+k) = exp(- (((x_new(1))^2 )/(sig1^2) + ((x_new(2))^2)/(sig2^2) + ((x_new(3))^2)/(sig3^2)) /2);
end
end
end
% normalize
kernel=kernel/sum(kernel(:));
%perform summation
xm_sum=0;
for i=-mid:mid
for j=-mid:mid
for k=-mid:mid
xm_sum = xm_sum + kernel(offset+i, offset+j, offset+k) * images(c+mid+i, b+mid+j, a+mid+k);
end
end
end
result(c,b,a)=xm_sum;
end
end
toc;
end
end
I tried replacing the "calculating kernel" part with
sigma=[sig1 sig2 sig3]
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
k2 = arrayfun(#(x, y, z) exp(-(norm([x,y,z]*R./sigma)^2)/2), x,y,z);
but it turned out to be even slower than the loop. I went through several articles and tutorials on vectorization but I'm quite stuck with this one.
Can it be vectorized or somehow speeded up using something else?
I'm new to Matlab, maybe there are some build-in functions that could help in this case?
Update
The profiling result:
Sample data which was used during profiling:
T.mat
grayImages.mat
As Dennis noted, this is a lot of code, cutting it down to the minimum that's slow given by the profiler will help. I'm not sure if my code is equivalent to yours, can you try it and profile it? The 'trick' to Matlab vectorization is using .* and .^, which operate element-by-element instead of having to use loops. http://www.mathworks.com/help/matlab/ref/power.html
Take your rewritten part:
sigma=[sig1 sig2 sig3]
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
k2 = arrayfun(#(x, y, z) exp(-(norm([x,y,z]*R./sigma)^2)/2), x,y,z);
And just pick one sigma for now. Looping over 3 different sigmas isn't a performance problem if you can vectorize the underlying k2 formula.
EDIT: Changed the matrix_to_norm code to be x(:), and no commas. See Generate all possible combinations of the elements of some vectors (Cartesian product)
Then try:
% R & mid my test variables
R = [1 2 3; 4 5 6; 7 8 9];
mid = 5;
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
% meshgrid is also a possibility, check that you are getting the order you want
% Going to break the equation apart for now for clarity
% Matrix operation, should already be fast.
matrix_to_norm = [x(:) y(:) z(:)]*R/sig1
% Ditto
matrix_normed = norm(matrix_to_norm)
% Note the .^ - I believe you want element-by-element exponentiation, this will
% vectorize it.
k2 = exp(-0.5*(matrix_normed.^2))

Resources