Communication overhead (parfor) and preallocating for speed (for) in Multidimensional Arrays
I am getting two warnings in the following script at the places indicated by **'s
Variable is indexed but not sliced... (the array A shown by ** in the second parfor loop) - What is causing this and how can it be avoided?
The variable appears to change size on every loop... (the array Sol shown by ** in the for loop) - Maybe I am not doing it right, but preallocating memory hasn't worked.
Edit: My initial idea was to preallocate the arrays (as done in the first parfor loop) so that it will execute the rest of the script faster (the full version of the script repeats various array operations similar to the second parfor and for loops).
Any suggestions? :)
N = 1000;
parfor i=1:N
A(:,:,i) = rand(2);
X(:,:,i) = rand(2,1);
Sol1(1,1,i) = zeros();
Sol2(1,1,i) = zeros();
Sol(2,1,i) = zeros();
end
t0 = tic;
parfor i=1:N
Sol1(1,:,i) = A(1,:,i)*X(:,1,i);
Sol2(1,:,i) = **A**(2,:,i)*X(:,1,i);
end
for i=1:N
**Sol**(:,1,i) = [Sol1(1,:,i);Sol2(1,:,i)];
end
toc(t0);
Your pre-allocation is not right - you need to do each in a single call.
A = rand(2, 2, N);
X = rand(2, 1, N);
Sol1 = zeros(1, 1, N);
Sol2 = zeros(1, 1, N);
Sol = zeros(2, 1, N); % not really needed actually.
In your PARFOR loop, you can avoid 'broadcasting' A by using a syntax that MATLAB understands as slicing
parfor i = 1:N
tmp = A(:, :, i);
Sol1(1, :, i) = tmp(1,:) * X(:, 1, i);
Sol2(1, :, i) = tmp(2,:) * X(:, 1, i);
end
Finally, I think you can do this as a vectorised concatenation like so:
Sol = [Sol1; Sol2];
EDIT
On the GPU, you can use pagefun to get the whole job done in a single call, like so:
Ag = gpuArray.rand(2,2,N);
Xg = gpuArray.rand(2,1,N);
Sol = pagefun(#mtimes, Ag, Xg);
Related
I am working in MATLAB to process two 512x512 images, the domain image and the range image. What I am trying to accomplish is the following:
Divide both domain and range images into 8x8 pixel blocks
For each 8x8 block in the domain image, I have to apply a linear transformations to it and compare each of the 4096 transformed blocks with each of the 4096 range blocks.
Compute error in each case between the transformed block and the range image block and find the minimum error.
Finally I'll have for each 8x8 range block, the id of the 8x8 domain block for which the error was minimum (error between the range block and the transformed domain block)
To achieve this, I have written the following code:
RangeImagecolor = imread('input.png'); %input is 512x512
DomainImagecolor = imread('input.png'); %Range and Domain images are identical
RangeImagetemp = rgb2gray(RangeImagecolor);
DomainImagetemp = rgb2gray(DomainImagecolor);
RangeImage = im2double(RangeImagetemp);
DomainImage = im2double(DomainImagetemp);
%For the (k,l)th 8x8 range image block
for k = 1:64
for l = 1:64
minerror = 9999;
min_i = 0;
min_j = 0;
for i = 1:64
for j = 1:64
%here I compute for the (i,j)th domain block, the transformed domain block stored in D_trans
error = 0;
D_trans = zeros(8,8);
R = zeros(8,8); %Contains the pixel values of the (k,l)th range block
for m = 1:8
for n = 1:8
R(m,n) = RangeImage(8*k-8+m,8*l-8+n);
%ApplyTransformation can depend on (k,l) so I can't compute the transformation outside the k,l loop.
[m_dash,n_dash] = ApplyTransformation(8*i-8+m,8*j-8+n);
D_trans(m,n) = DomainImage(m_dash,n_dash);
error = error + (R(m,n)-D_trans(m,n))^2;
end
end
if(error < minerror)
minerror = error;
min_i = i;
min_j = j;
end
end
end
end
end
As an example ApplyTransformation, one can use the identity transformation:
function [x_dash,y_dash] = Iden(x,y)
x_dash = x;
y_dash = y;
end
Now the problem I am facing is the high computation time. The order of computation in the above code is 64^5, which is of the order 10^9. This computation should take at the worst minutes or an hour. It takes about 40 minutes to compute just 50 iterations. I don't know why the code is running so slow.
Thanks for reading my question.
You can use im2col* to convert the image to column format so each block forms a column of a [64 * 4096] matrix. Then apply transformation to each column and use bsxfun to vectorize computation of error.
DomainImage=rand(512);
RangeImage=rand(512);
DomainImage_col = im2col(DomainImage,[8 8],'distinct');
R = im2col(RangeImage,[8 8],'distinct');
[x y]=ndgrid(1:8);
function [x_dash, y_dash] = ApplyTransformation(x,y)
x_dash = x;
y_dash = y;
end
[x_dash, y_dash] = ApplyTransformation(x,y);
idx = sub2ind([8 8],x_dash, y_dash);
D_trans = DomainImage_col(idx,:); %transformation is reduced to matrix indexing
Error = 0;
for mn = 1:64
Error = Error + bsxfun(#minus,R(mn,:),D_trans(mn,:).').^2;
end
[minerror ,min_ij]= min(Error,[],2); % linear index of minimum of each block;
[min_i min_j]=ind2sub([64 64],min_ij); % convert linear index to subscript
Explanation:
Our goal is to reduce number of loops as much as possible. For it we should avoid matrix indexing and instead we should use vectorization. Nested loops should be converted to one loop. As the first step we can create a more optimized loop as here:
min_ij = zeros(4096,1);
for kl = 1:4096 %%% => 1:size(D_trans,2)
minerror = 9999;
min_ij(kl) = 0;
for ij = 1:4096 %%% => 1:size(R,2)
Error = 0;
for mn = 1:64
Error = Error + (R(mn,kl) - D_trans(mn,ij)).^2;
end
if(Error < minerror)
minerror = Error;
min_ij(kl) = ij;
end
end
end
We can re-arrange the loops and we can make the most inner loop as the outer loop and separate computation of the minimum from the computation of the error.
% Computation of the error
Error = zeros(4096,4096);
for mn = 1:64
for kl = 1:4096
for ij = 1:4096
Error(kl,ij) = Error(kl,ij) + (R(mn,kl) - D_trans(mn,ij)).^2;
end
end
end
% Computation of the min
min_ij = zeros(4096,1);
for kl = 1:4096
minerror = 9999;
min_ij(kl) = 0;
for ij = 1:4096
if(Error(kl,ij) < minerror)
minerror = Error(kl,ij);
min_ij(kl) = ij;
end
end
end
Now the code is arranged in a way that can best be vectorized:
Error = 0;
for mn = 1:64
Error = Error + bsxfun(#minus,R(mn,:),D_trans(mn,:).').^2;
end
[minerror ,min_ij] = min(Error, [], 2);
[min_i ,min_j] = ind2sub([64 64], min_ij);
*If you don't have the Image Processing Toolbox a more efficient implementation of im2col can be found here.
*The whole computation takes less than a minute.
First things first - your code doesn't do anything. But you likely do something with this minimum error stuff and only forgot to paste this here, or still need to code that bit. Never mind for now.
One big issue with your code is that you calculate transformation for 64x64 blocks of resulting image AND source image. 64^5 iterations of a complex operation are bound to be slow. Rather, you should calculate all transformations at once and save them.
allTransMats = cell(64);
for i = 1 : 64
for j = 1 : 64
allTransMats{i,j} = getTransformation(DomainImage, i, j)
end
end
function D_trans = getTransformation(DomainImage, i,j)
D_trans = zeros(8);
for m = 1 : 8
for n = 1 : 8
[m_dash,n_dash] = ApplyTransformation(8*i-8+m,8*j-8+n);
D_trans(m,n) = DomainImage(m_dash,n_dash);
end
end
end
This serves to get allTransMat and is OUTSIDE the k, l loop. Preferably as a simple function.
Now, you make your big k, l, i, j loop, where you compare all the elements as needed. Comparison could be also done block-wise instead of filling a small 8x8 matrix, yet doing it per element for some reason.
m = 1 : 8;
n = m;
for ...
R = RangeImage(...); % This will give 8x8 output as n and m are vectors.
D = allTransMats{i,j};
difference = sum(sum((R-D).^2));
if (difference < minDifference) ...
end
Even though this is a simple no transformations case, this speeds up code a lot.
Finally, are you sure you need to compare each block of transformed output with each block in the source? Typically you compare block1(a,b) with block2(a,b) - blocks (or pixels) on the same position.
EDIT: allTransMats requires k and l too. Ouch. There is NO WAY to make this fast for a single iteration, as you require 64^5 calls to ApplyTransformation (or a vectorization of that function, but even then it might not be fast - we would have to see the function to help here).
Therefore, I will re-iterate my advice to generate all transformations and then perform lookup: this upper part of the answer with allTransMats generation should be changed to have all 4 loops and generate allTransMats{i,j,k,l};. It WILL be slow, there is no way around that as I mentioned in the upper part of edit. But, it is a cost you pay once, as after saving the allTransMats, all further image analyses will be able to simply load it instead of generating it again.
But ... what do you even do? Transformation that depends on source and destination block indices plus pixel indices (= 6 values total) sounds like a mistake somewhere, or a prime candidate to optimize instead of all the rest.
Here is my little script for simulating Levy motion:
clear all;
clc; close all;
t = 0; T = 1000; I = T-t;
dT = T/I; t = 0:dT:T; tau = T/I;
alpha = 1.5;
sigma = dT^(1/alpha);
mu = 0; beta = 0;
N = 1000;
X = zeros(N, length(I));
for k=1:N
L = zeros(1,I);
for i = 1:I-1
L( (i + 1) * tau ) = L(i*tau) + stable2( alpha, beta, sigma, mu, 1);
end
X(k,1:length(L)) = L;
end
q = 0.1:0.1:0.9;
quant = qlines2(X, q, t(1:length(X)), tau);
hold all
for i = 1:length(quant)
plot( t, quant(i) * t.^(1/alpha), ':k' );
end
Where stable2 returns a stable random variable with given parameters (you may replace it with normrnd(mu, sigma) for this case, it's not crucial); qlines2 returns quantiles needed for plotting.
But I don't want to talk about math here. My problem is that this implementation is pretty slow, and I would like to speed it up. Unfortunately, computer science is not my main field - I heard something about methods like memoization, vectorization and that there is a lot of other techniques, but I don't know how to use them.
For example, I'm pretty sure I should replace this filthy double for-loop somehow, but I'm not sure what to do instead.
EDIT: Maybe I should use (and learn...) another language (Python, C, any functional one)? I always though that Matlab/OCTAVE is designed for numerical computation, but if change, then for which one?
The crucial bit is, as you said, the for loops, Matlab does not like those, so vectorization is indeed the keyword. (Together with preallocating the space.
I just altered you for loop section somewhat so that you do not have to reset L over and over again, instead we save all Ls in a bigger matrix (also I elimiated the length(L) command).
L = zeros(N,I);
for k=1:N
for i = 1:I-1
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
X(k,1:I) = L(k,1:I);
end
Now you can already see that X(k,1:I) = L(k,1:I); in the loop is obsolete and that also means that we can switch the order of the loops. This is crucial, because the i-steps are recursive (depend on the previous step) that means we cannot vectorize this loop, we can only vectorize the k-loop.
Now your original code needed 9.3 seconds on my machine, the new code still needs about the same time)
L = zeros(N,I);
for i = 1:I-1
for k=1:N
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
end
X = L;
But now we can apply the vectorization, instead of looping throu all rows (the loop over k) we can instead eliminate this loop, and doing all rows at "once".
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma); %<- this is not yet what you want, see comment below
end
X = L;
This code need only 0.045 seconds on my machine. I hope you still get the same output, because I have no idea what you are calculating, but I also hope you could see how you go about vectorizing code.
PS: I just noticed that we now use the same random number in the last example for the whole column, this is obviously not what you want. Instad you should generate a whole vector of random numbers, e.g:
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma,N,1);
end
X = L;
PPS: Great question!
I have a cell array of size m x 1 and each cell is again s x t cell array (size varies). I would like to concatenate vertically. The code is as follows:
function(cell_out) = vert_cat(cell_in)
[row,col] = cellfun(#size,cell_in,'Uni',0);
fcn_vert = #(x)([x,repmat({''},size(x,1),max(cell2mat(col))-size(x,2))]);
cell_out = cellfun(fcn_vert,cell_in,'Uni',0); % Taking up lot of time
cell_out = vertcat(cell_out{:});
end
Step 3 takes a lot of time. Is it the right way to do or is there any another faster way to achieve this?
cellfun has been found to be slower than loops (kind of old, but agrees with what I have seen).
In addition, repmat has also been a performance hit in the past (though that may be different now).
Try this two-loop code that aims to accomplish your task:
function cellOut = vert_cat(c)
nElem = length(c);
colPad = zeros(nElem,1);
nRow = zeros(nElem,1);
for k = 1:nElem
[nRow(k),colPad(k)] = size(c{k});
end
colMax = max(colPad);
colPad = colMax - colPad;
cellOut = cell(sum(nRow),colMax);
bottom = cumsum(nRow) - nRow + 1;
top = bottom + nRow - 1;
for k = 1:nElem
cellOut(bottom(k):top(k),:) = [c{k},cell(nRow(k),colPad(k))];
end
end
My test for this code was
A = rand(20,20);
A = mat2cell(A,ones(20,1),ones(20,1));
C = arrayfun(#(c) A(1:c,1:c),randi([1,15],1,5),'UniformOutput',false);
ccat = vert_cat(c);
I used this pice of code to generate data:
%generating some dummy data
m=1000;
s=100;
t=100;
cell_in=cell(m,1);
for idx=1:m
cell_in{idx}=cell(randi(s),randi(t));
end
Applying some minor modifications, I was able to speed up the code by a factor of 5
%Minor modifications of the original code
%use arrays instead of cells for row and col
[row,col] = cellfun(#size,cell_in);
%claculate max(col) once
tcol=max(col);
%use cell instead of repmat to generate an empty cell
fcn_vert = #(x)([x,cell(size(x,1),tcol-size(x,2))]);
cell_out = cellfun(fcn_vert,cell_in,'Uni',0); % Taking up lot of time
cell_out = vertcat(cell_out{:});
Using simply a for loop is even faster, because the data is only moved once
%new approac. Basic idea: move every data only once
[row,col] = cellfun(#size,cell_in);
trow=sum(row);
tcol=max(col);
r=1;
cell_out2 = cell(trow,tcol);
for idx=1:numel(cell_in)
cell_out2(r:r+row(idx)-1,1:col(idx))=cell_in{idx};
r=r+row(idx);
end
I have a piece of code here I need to streamline as it is greatly increasing the runtime of my script:
size=300;
resultLength = (size+1)^3;
freqResult=zeros(1, resultLength);
inc=1;
for i=0:size,
for j=0:size,
for k=0:size,
freqResult(inc)=(c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
inc=inc+1;
end
end
end
c, L, W, and H are all constants. As the size input gets over about 400, the runtime is too long to wait for, and I can watch my disk space draining by the gigabyte. Any advice?
Thanks!
What about this:
[kT, jT, iT] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
for indx = 1:numel(iT)
i = iT(indx) - 1;
j = jT(indx) - 1;
k = kT(indx) - 1;
freqResult1(indx) = (c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
end
On my PC, for size = 400, version with 3 loops takes 136s and this one takes 19s.
For more "matlaby" way u could also even do as follows:
[kT, jT, iT] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
func = #(i, j, k) (c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
freqResult2 = arrayfun(func, iT-1, jT-1, kT-1);
But for some reason, this is slower then the above version.
A faster solution can be (based on Marcin's answer):
[k, j, i] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
freqResult = (c/2)*sqrt(((i-1)/L).^2+((j-1)/W).^2+((k-1)/H).^2);
It takes about 5 seconds to run on my PC for size = 300
The following is even faster (but it doesn't look very good):
k = repmat(0:size,[1 (size+1)^2]);
j = repmat(kron(0:size, ones(1,size+1)),[1 (size+1)]);
i = kron(0:size, ones(1,(size+1)^2));
freqResult = (c/2)*sqrt((i/L).^2+(j/W).^2+(k/H).^2);
which takes ~3.5s for size = 300
I'm working on a function with three nested for loops that is way too slow for its intended use. The bottleneck is clearly the looping part - almost 100 % of the execution time is spent in the innermost loop.
The function takes a 2d matrix called rM as input and returns a 3d matrix called ec:
rows = size(rM, 1);
cols = size(rM, 2);
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols,
for r = 2:rows+1,
ec(r, c, risk) = ec(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
Any help on speeding up the for loops would be appreciated...
The problem is, that the inner loop is slowest, while it is also near-impossible to vectorize. As every iteration directly depends on the previous one.
The outer two are possible:
clc;
rM = rand(50);
rows = size(rM, 1);
cols = size(rM, 2);
minRisk = 1;
stepRisk = 1;
maxRisk = 100;
numRiskLevels = maxRisk/stepRisk;
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
riskArray = (minRisk:stepRisk:maxRisk)';
tic
for r = 2:rows+1
tmp = riskArray * rM(r-1, :);
tmp = permute(tmp, [3 2 1]);
ec(r, :, :) = ec(r-1, :, :) .* (1 + tmp);
end
toc
%preallocate.
ec2 = zeros(rows+1, cols, numRiskLevels);
ec2(1, :, :) = 100;
tic
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols
for r = 2:rows+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
toc
all(all(all(ec == ec2)))
But to my surprise, the vectorized code is indeed slower. (But maybe someone can improve the code, so I figured I leave it her for you.)
I have just tried to vectorize the outer loop, and actually noticed a significant speed increase. Of course it is hard to judge the speed of a script without knowing (the size of) the inputs but I would say this is a good starting point:
% Here you can change the input parameters
riskVec = 1:3:120;
rM = rand(50);
%preallocate and calculate non vectorized solution
ec2 = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec2(1, :, :) = 100;
tic
for risk = riskVec
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
t1=toc;
%preallocate and calculate vectorized solution
ec = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec(1, :, :) = 100;
tic
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec(r, c, riskVec) = ec(r-1, c, riskVec) .* reshape(1 + riskVec * rM(r-1, c),[1 1 length(riskVec)]);
end
end
t2=toc;
% Check whether the vectorization is done correctly and show the timing results
if ec(:) == ec2(:)
t1
t2
end
The given output is:
t1 =
0.1288
t2 =
0.0408
So for this riskVec and rM it is about 3 times as fast as the non-vectorized solution.