Matlab parfor, cannot run "due to the way P is used" - performance

I have a quite time consuming task that I perform in a for loop. Each iteration is completely independent from the others so I figured out to use the parfor loop and benefit from the i7 core of my machine.
The serial loop is:
for i=1 : size(datacoord,1)
%P matrix: person_number x z or
P(i,1) = datacoord(i,1); %pn
P(i,4) = datacoord(i,5); %or
P(i,3) = predict(Barea2, datacoord(i,4)); %distance (z)
dist = round(P(i,3)); %round the distance to get how many cells
x = ceil(datacoord(i,2) / (im_w / ncell(1,dist)));
P(i,2) = pos(dist, x); %x
Reading around about the parfor, the only doubt it had is that i use dist and x as indexes which are calculated inside the loop, i heard that this could be a problem.
The error I get from matlab is about the way P matrix is used though. How is it? If i remember correcly from my parallel computing courses and I interpret correcly the parfor documentation, this should work by just switching the for with the parfor.
Any input would be greatly appreciated, thanks!

Unfortunately, in a PARFOR loop, 'sliced' variables such as you'd like P to be cannot be indexed in multiple different ways. The simplest solution is to build up a single row, and then make a single assignment into P, like this:
parfor i=1 : size(datacoord,1)
%P matrix: person_number x z or
P_tmp = NaN(1, 4);
P_tmp(1) = datacoord(i,1); %pn
P_tmp(4) = datacoord(i,5); %or
P_tmp(3) = predict(Barea2, datacoord(i,4)); %distance (z)
dist = round(P_tmp(3)); %round the distance to get how many cells
x = ceil(datacoord(i,2) / (im_w / ncell(1,dist)));
P_tmp(2) = pos(dist, x); %x
P(i, :) = P_tmp;


Matlab vectorization of for loops

Is there any way to vectorize such a for loop in MATLAB? It's taking a lot of time to execute.
for i = 1:numberOfFrames-1
frameDifferencesEroded(:,:,i+1) = imabsdiff(frameDifferencesErodedTemp(:,:,i+1),frameDifferencesErodedTemp(:,:,1));
for k=1:numel(frameDifferences(1,:,i))
for m=1:numel(frameDifferences(:,1,i))
frameDifferences(m,k,i+1) = 255;
frameDifferences(m,k,i+1) = 0;
Assuming you want frameDifferencesEroded(:,:,1) and frameDifferences(:,:,1) to be all zeros, as you are not inputting values into those with your code, this might work for you -
%// Replace imabsdiff with abs(bsxfun(#minus..)), which might be faster
frameDifferencesEroded = abs(bsxfun(#minus,frameDifferencesErodedTemp, frameDifferencesErodedTemp(:,:,1)))
%// Get the thresholding done next
frameDifferences = (frameDifferencesEroded>thresold).*255
You could try somehting like this:
[M, N, P] = size(frameDifferences);
for i = 2:P
frameDifferencesEroded(:,:,i) = imabsdiff(frameDifferencesErodedTemp(:,:,i),frameDifferencesErodedTemp(:,:,1));
frameDifferences(:, :, i) = (frameDifferencesEroded(:, :, i) > thresold) .* 255;
Do you need to keep frameDifferencesEroded? If not you can make it a temporary 2-D matrix inside this loop.
But try to rearrange your data by swapping the 1st and 3rd dimension: m(i,:,:) are stored in memory consecutively, whereas m(:,:,1) are not which might make it slower.

Vectorizing three for loops

I'm quite new to Matlab and I need help in speeding up some part of my code. I am writing a Matlab application that performs 3D matrix convolution but unlike in standard convolution, the kernel is not constant, it needs to be calculated for each pixel of an image.
So far, I have ended up with a working code, but incredibly slow:
function result = calculateFilteredImages(images, T)
% images - matrix [480,360,10] of 10 grayscale images of height=480 and width=360
% reprezented as a value in a range [0..1]
% i.e. images(10,20,5) = 0.1231;
% T - some matrix [480,360,10, 3,3] of double values, calculated earlier
kerN = 5; %kernel size
mid=floor(kerN/2); %half the kernel size
offset=mid+1; %kernel offset
[h,w,n] = size(images);
%add padding so as not to get IndexOutOfBoundsEx during summation:
%[i.e. changes [1 2 3...10] to [0 0 1 2 ... 10 0 0]]
images = padarray(images,[mid, mid, mid]);
result(h,w,n)=0; %preallocate, faster than zeros(h,w,n)
kernel(kerN,kerN,kerN)=0; %preallocate
% the three parameters below are not important in this problem
% (are used to calculate sigma in x,y,z direction inside the loop)
d = 3;
for a=1:n;
for b=1:w;
for c=1:h;
M(:,:)=T(c,b,a,:,:); % M is now a 3x3 matrix
[R D] = eig(M); %get eigenvectors and eigenvalues - R and D are now 3x3 matrices
% eigenvalues
l1 = D(1,1);
l2 = D(2,2);
l3 = D(3,3);
sig1=sig( l1 , sigMin, sigMax, d);
sig2=sig( l2 , sigMin, sigMax, d);
sig3=sig( l3 , sigMin, sigMax, d);
% calculate kernel
for i=-mid:mid
for j=-mid:mid
for k=-mid:mid
x_new = [i,j,k] * R; %calculate new [i,j,k]
kernel(offset+i, offset+j, offset+k) = exp(- (((x_new(1))^2 )/(sig1^2) + ((x_new(2))^2)/(sig2^2) + ((x_new(3))^2)/(sig3^2)) /2);
% normalize
%perform summation
for i=-mid:mid
for j=-mid:mid
for k=-mid:mid
xm_sum = xm_sum + kernel(offset+i, offset+j, offset+k) * images(c+mid+i, b+mid+j, a+mid+k);
I tried replacing the "calculating kernel" part with
sigma=[sig1 sig2 sig3]
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
k2 = arrayfun(#(x, y, z) exp(-(norm([x,y,z]*R./sigma)^2)/2), x,y,z);
but it turned out to be even slower than the loop. I went through several articles and tutorials on vectorization but I'm quite stuck with this one.
Can it be vectorized or somehow speeded up using something else?
I'm new to Matlab, maybe there are some build-in functions that could help in this case?
The profiling result:
Sample data which was used during profiling:
As Dennis noted, this is a lot of code, cutting it down to the minimum that's slow given by the profiler will help. I'm not sure if my code is equivalent to yours, can you try it and profile it? The 'trick' to Matlab vectorization is using .* and .^, which operate element-by-element instead of having to use loops.
Take your rewritten part:
sigma=[sig1 sig2 sig3]
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
k2 = arrayfun(#(x, y, z) exp(-(norm([x,y,z]*R./sigma)^2)/2), x,y,z);
And just pick one sigma for now. Looping over 3 different sigmas isn't a performance problem if you can vectorize the underlying k2 formula.
EDIT: Changed the matrix_to_norm code to be x(:), and no commas. See Generate all possible combinations of the elements of some vectors (Cartesian product)
Then try:
% R & mid my test variables
R = [1 2 3; 4 5 6; 7 8 9];
mid = 5;
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
% meshgrid is also a possibility, check that you are getting the order you want
% Going to break the equation apart for now for clarity
% Matrix operation, should already be fast.
matrix_to_norm = [x(:) y(:) z(:)]*R/sig1
% Ditto
matrix_normed = norm(matrix_to_norm)
% Note the .^ - I believe you want element-by-element exponentiation, this will
% vectorize it.
k2 = exp(-0.5*(matrix_normed.^2))

MATLAB Speed Optimisation

Can anyone help? I am a fairly experienced Matlab user but am having trouble speeding up the code below.
The fastest time I have been able to achieve for one run through all three loops, using 12 cores, is ~200s. The actual function will be called ~720 times and at this rate will take over 40hrs to execute. According to the Matlab profiler, the majority of cpu time is spent in the exponential function call. I've managed to speed this up quite substantially using a gpuArray and then running the exp call on a Quadro 4000 graphics card however this then prevents the parfor loop from being used, since the workstation has only one graphics card, which obliterates any gains. Can anyone help, or is this code close to the optimum that can be achieved using Matlab? I have written a very crude c++ implementation with openMP but achieved little gain.
Many thanks in advance
function SPEEDtest_CPU
% Variable setup:
% - For testing I'll use random variables. These will actually be fed into
% the function for the real version of this code.
sy = 320;
sx = 100;
sz = 32;
A = complex(rand(sy,sx,sz),rand(sy,sx,sz));
B = complex(rand(sy,sx,sz),rand(sy,sx,sz));
C = rand(sy,sx);
D = rand(sy*sx,1);
F = zeros(sy,sx,sz);
x = rand(sy*sx,1);
y = rand(sy*sx,1);
x_ind = (1:sx) - (sx / 2) - 1;
y_ind = (1:sy) - (sy / 2) - 1;
% - In the real code this set of three loops will be called ~720 times!
% - Using 12 cores, the fastest I have managed is ~200 seconds for one
% call of this function.
for z = 1 : sz
A_slice = A(:,:,z);
A_slice = A_slice(:);
parfor cx = 1 : sx
for cy = 1 : sy
E = ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
F(cy,cx,z) = (B(cy,cx,z) .* exp(-1i .* E))' * A_slice;
Some things to think about:
Have you considered using singles?
Can you vectorize the cx, cy portion so that they represent array operations?
Consider changing the floating point rounding or signalling modes.
If your data are real (not complex), as in your example, you can save time replacing
(B(cy,cx,z) .* exp(-1i .* E))'
(B(cy,cx,z) .* (cos(E)+1i*sin(E))).'
Specifically, on my machine (cos(x)+1i*sin(x)).' takes 19% less time than exp(-1i .* x)'.
If A and B are complex: E is still real, so you can precompute Bconj = conj(B) outside the loops (this takes about 10 ms with your data size, and it's done only once) and then replace
(B(cy,cx,z) .* exp(-1i .* E))'
(Bconj(cy,cx,z) .* (cos(E)+1i*sin(E))).'
to obtain a similar gain.
There are two main ways of speeding up MATLAB code; preallocation and vectorisation.
You have preallocated well but there is no vectorisation. In order to best learn how to do this you need to have a good grasp of linear algebra and the use of repmat to expand vectors into multiple dimensions.
Vectorisation can result in multiple orders of magnitude speedup and will use the cores optimally (provided the flag is up).
What is the mathematical expression you are calculating and I may be able to lend a hand?
You can move x .* x_ind(cx) out of the innermost loop. I don't have a GPU handy to test the timings, but you could split the code into three sections to allow you to use the GPU and parfor
for z = 1 : sz
E = zeros(sy*sx,sx,sy);
A_slice = A(:,:,z);
A_slice = A_slice(:);
parfor cx = 1 : sx
temp = ( x .* x_ind(cx) );
for cy = 1 : sy
E(:, cx, cy) = temp + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
temp = zeros(zeros(sy*sx,sx,sy));
for cx = 1 : sx
for cy = 1 : sy
% Ideally use your GPU magic here
temp(:, cx, cy) = exp(-1i .* E(:, cx, cy)));
parfor cx = 1 : sx
for cy = 1 : sy
F(cy,cx,z) = (B(cy,cx,z) .* temp(:, cx, cy)' * A_slice;
To allow for proper paralellization you need to make sure the loops are completely independant, hence check whether not assigning to E in each run helps.
Furthermore try to vectorize as much as possible, one simple example could be: y.*y_ind(cy)
If you just create the proper index for all values at once, you can take this out of the lowest loop.
Not sure if it helps much with speed - but as E is basically a sum maybe you can use that exp (i cx(A+1)x) = exp(i cx(A) x) * exp(i x) and exp(i x) can be calculated beforehand.
That way you wouldn't have to evaluate exp each iteration - but just have to multiplicate, which should be faster.
In addition to the other good advise given here by others, the multiplication by A_slice is independent of the cx,cy loops and can be taken outside them, multiplying F once both loops have finished.
Similarly, the conjugation of B*exp(...) can also be done en-bulk outside the cx,cy loop, before multiplication by A_slice.
This line: ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
is some type of convolution, is it not? Circular convolution is much faster in the frequency domain, and the conversion to/from frequency domain is optimized using the FTT.

MatLab - Newton's method algorithm

I have written the following algorithm in order to evaluate a function in MatLab using Newton's method (we set r = -7 in my solution):
function newton(r);
syms x;
y = exp(x) - 1.5 - atan(x);
yprime = diff(y,x);
f = matlabFunction(y);
fprime = matlabFunction(yprime);
x = r;
xvals = x
for i=1:8
u = x;
x = u - f(r)/fprime(r);
xvals = x
The algorithm works in that it runs without any errors, but the numbers keep decreasing at every iteration, even though, according to my textbook, the expression should converge to roughly -14 for x. My algorithm is correct the first two iterations, but then it goes beyond -14 and finally ends up at roughøy -36.4 after all iterations have completed.
If anyone can give me some help as to why the algorithm does not work properly, I would greatly appreciate it!
I think
x = u - f(r)/fprime(r);
should be
x = u - f(u)/fprime(u);
If you always use r, you're always decrementing x by the same value.
syms x
y = exp(x) - 1.5 - atan(x); % your function is converted in for loop
for i=2:n
v=[v ;x(i)]; % you will get solution vector for each i value

How to speed this kind of for-loop?

I would like to compute the maximum of translated images along the direction of a given axis. I know about ordfilt2, however I would like to avoid using the Image Processing Toolbox.
So here is the code I have so far:
imInput = imread('tire.tif');
n = 10;
imMax = imInput(:, n:end);
for i = 1:(n-1)
imMax = max(imMax, imInput(:, i:end-(n-i)));
Is it possible to avoid using a for-loop in order to speed the computation up, and, if so, how?
First edit: Using Octave's code for im2col is actually 50% slower.
Second edit: Pre-allocating did not appear to improve the result enough.
sz = [size(imInput,1), size(imInput,2)-n+1];
range_j = 1:size(imInput, 2)-sz(2)+1;
range_i = 1:size(imInput, 1)-sz(1)+1;
B = zeros(prod(sz), length(range_j)*length(range_i));
counter = 0;
for j = range_j % left to right
for i = range_i % up to bottom
counter = counter + 1;
v = imInput(i:i+sz(1)-1, j:j+sz(2)-1);
B(:, counter) = v(:);
imMax = reshape(max(B, [], 2), sz);
Third edit: I shall show the timings.
For what it's worth, here's a vectorized solution using IM2COL function from the Image Processing Toolbox:
imInput = imread('tire.tif');
n = 10;
sz = [size(imInput,1) size(imInput,2)-n+1];
imMax = reshape(max(im2col(imInput, sz, 'sliding'),[],2), sz);
You could perhaps write your own version of IM2COL as it simply consists of well crafted indexing, or even look at how Octave implements it.
Check out the answer to this question about doing a rolling median in c. I've successfully made it into a mex function and it is way faster than even ordfilt2. It will take some work to do a max, but I'm sure it's possible.
