Matlab Parallelism toolbox : stacking loops in parfor

Matlab Parallelism toolbox : stacking loops in parfor - for-loop

I'm trying to use the parfor loop in the matlab parallelism package.
I'm having a similar problem to this guy : MATLAB parfor slicing issue? . The output matrix doesn't seem to be recognized as a sliced variable. In my specific case, I'm trying to stack use other for loops inside the parfor, and I have trouble applying the solution proposed in the other thread to my problem. Here is a dummy example of what I'm trying to do :
n=175;
matlabpool;
Matred=zeros(n,n);
Matx2Cell = cell(n);
parfor i=1:n
for j=1:n
for k=1:n
Matred(j,k)=exp((j+i+k)/500)
end;
end;
Matx2Cell{i}=Matred;
end;
matlabpool close;
P.S. I know it would work to put the parfor on the k-loop instead of the i-loop... But I'd still like to put it on the i-loop (I believe it would be more time-efficient in my real program).
Thanks a lot
Frédéric Godin

You can put Matred = zeros(n); into the parfor body, but this is very slow. Instead define a function with Matred = zeros(n); in it: effectively the same thing, but much faster:
function Matred = calcMatred(i,n)
Matred=zeros(n);
for j=1:n
for k=1:n
Matred(j,k)=exp((j+i+k)/500);
end
end
Here is a time comparison:
matlabpool
n = 175;
Matx2Cell = cell(n,1);
tic
parfor i=1:n
Matred=zeros(n);
for j=1:n
for k=1:n
Matred(j,k)=exp((j+i+k)/500);
end
end
Matx2Cell{i}=Matred;
end
toc
tic
parfor i=1:n
Matx2Cell{i}=calcMatred(i,n);
end
toc
matlabpool close
On my machine, it takes 7 seconds for the first one and 0.3 seconds for the second.
Also note that I've changed the declaration of Matx2Cell to cell(n,1) since cell(n) makes an n x n cell array.

You need to move Matred into the parfor loop body. This needs to be done because each iteration of the parfor needs a new copy of Matred.
n=175;
matlabpool;
Matx2Cell = cell(n);
parfor i=1:n
Matred=zeros(n,n);
for j=1:n
for k=1:n
Matred(j,k)=exp((j+i+k)/500)
end;
end;
Matx2Cell{i}=Matred;
end;
matlabpool close;

Related

Optimizing matrix multiplication with varying sizes

Suppose I have the following data generating process
using Random
using StatsBase
m_1 = [1.0 2.0]
m_2 = [1.0 2.0; 3.0 4.0]
DD = []
y = zeros(2,200)
for i in 1:100
rand!(m_1)
rand!(m_2)
push!(DD, m_1)
push!(DD, m_2)
end
idxs = sample(1:200,10)
for i in idxs
DD[i] = DD[1]
end
and suppose given the data, I have the following function
function test(y, DD, n)
v_1 = [1 2]
v_2 = [3 4]
for j in 1:n
for i in 1:size(DD,1)
if size(DD[i],1) == 1
y[1:size(DD[i],1),i] .= (v_1 * DD[i]')[1]
else
y[1:size(DD[i],1),i] = (v_2 * DD[i]')'
end
end
end
end
I'm struggling to optimize the speed of test. In particular, memory allocation increases as I increase n. However, I'm not really allocating anything new.
The data generating process captures the fact that I don't know for sure the size of DD[i] beforehand. That is, the first time I call test, DD[1] could be a 2x2 matrix. The second time I call test, DD[1] could be a 1x2 matrix. I think this could be part of the issue with memory allocation: Julia doesn't know the sizes beforehand.
I'm completely stuck. I've tried #inbounds but that didn't help. Is there a way to improve this?

One important thing to check for performance is that Julia can understand the types. You can check this by running #code_warntype test(y, DD, 1), the output will make it clear that DD is of type Any[] (since you declared it that way). Working with Any can incur quite a performance penalty so declaring DD = Matrix{Float64}[] cuts the time to a third in my testing.
I'm not sure how close this example is to the actual code you want to write but in this particular case the size(DD[i],1) == 1 branch can be replaced by a call to LinearAlgebra.dot:
y[1:size(DD[i],1),i] .= dot(v_1, DD[i])
this cuts the time by another 50% for me. Finally you can squeeze out just a tiny bit more by using mul! to perform the other multiplication in place:
mul!(view(y, 1:size(DD[i],1),i:i), DD[i], v_2')
Full example:
using Random
using LinearAlgebra
DD = [rand(i,2) for _ in 1:100 for i in 1:2]
y = zeros(2,200)
shuffle!(DD)
function test(y, DD, n)
v_1 = [1 2]
v_2 = [3 4]'
for j in 1:n
for i in 1:size(DD,1)
if size(DD[i],1) == 1
y[1:size(DD[i],1),i] .= dot(v_1, DD[i])
else
mul!(view(y, 1:size(DD[i],1),i:i), DD[i], v_2)
end
end
end
end

My loops are slow. Is that because of if statements?

I read this post and realized that loops are faster in Julia. Thus, I decided to change my vectorized code into loops. However, I had to use a few if statements in my loop but my loops slowed down after I added more such if statements.
Consider this excerpt, which I directly copied from the post:
function devectorized()
a = [1.0, 1.0]
b = [2.0, 2.0]
x = [NaN, NaN]
for i in 1:1000000
for index in 1:2
x[index] = a[index] + b[index]
end
end
return
end
function time(N)
timings = Array(Float64, N)
# Force compilation
devectorized()
for itr in 1:N
timings[itr] = #elapsed devectorized()
end
return timings
end
I then added a few if statements to test the speed:
function devectorized2()
a = [1.0, 1.0]
b = [2.0, 2.0]
x = [NaN, NaN]
for i in 1:1000000
for index in 1:2
####repeat this 6 times
if index * i < 20
x[index] = a[index] - b[index]
else
x[index] = a[index] + b[index]
end
####
end
end
return
end
I repeated this block six times:
if index * i < 20
x[index] = a[index] - b[index]
else
x[index] = a[index] + b[index]
end
For the sake of conciseness, I'm not repeating this block in my sample code. After repeating the if statements 6 times, devectorized2() took 3 times as long.
I have two questions:
Are there better ways to implement if statements?
Why are if statements so slow? I know that Julia is trying to do loops in a way that matches C. Is Julia providing better "translation" between Julia and C and these if statements just made the translation process more difficult?

Firstly, I don't think the performance here is very odd, since you're adding a lot of work to your function.
Secondly, you should actually return x here, otherwise the compiler might decide that you're not using x, and just skip the whole computation, which would thoroughly confuse the timings.
Thirdly, to answer your question 1: You can implement it like this:
x[index] = a[index] + ifelse(index * i < 20, -1, 1) * b[index]
This can be faster in some cases, but not necessarily in your case, where the branch is very easy to predict. Sometimes you can also get speedups by using Bools, for example like this:
x[index] = a[index] + (2*(index * i >= 20)-1) * b[index]
Again, in your example this doesn't help much, but there are cases when this approach can give you a decent speedup.
BTW: It isn't necessarily always true that loops are preferable to vectorized code any longer. The post you linked to is quite old. Take a look at this blog post, which shows how vectorized code can achieve similar performance to loopy code. In many cases, though, a loop is the clearest, easiest and fastest way to accomplish your goal.

Why the weird nested arrayfun is most time effective?

I was writing a simple Autocorrelation function for the matrix array. Each row is a separate time series and we autocorrelate it with itself with certain set of lags. I have found out that one of the most contrintuitive things works best.
% 17 second benchmark
A = cell2mat(arrayfun(#(i) AutoCorrelation(array(i,:),lags)',1:size(array,1),'UniformOutput',false))';
where the function itself is the following
function ans = AutoCorrelation(array,lags)
arrayfun(#(x) dot(array(lags(x)+1:end),array(1:end-lags(x)))/(length(array)-lags(x)),1:length(lags));
end
Other things I have tried:
A = zeros(size(array,1),length(lags));
T = size(array,2);
% 97 seconds benchmark
A = cell2mat(arrayfun(#(i) arrayfun(#(x) dot(array(i,lags(x)+1:end),array(i,1:end-lags(x)))/(size(array,2)-lags(x)),1:length(lags))',1:size(array,1),'UniformOutput',false))';
% 100 seconds benchmark
A = arrayfun(#(i,x) dot(array(i,lags(x)+1:end),array(i,1:end-lags(x)))/(size(array,2)-lags(x)),repmat((1:size(array,1))',1,length(lags)),repmat(1:length(lags),size(array,1),1));
% 27 second benchmark
for i = 1:length(lags)
A(:,i) = dot(array(:,lags(i)+1:end),array(:,1:end-lags(i)),2)/(T-lags(i));
end
% 95 second benchmark
for i = 1:length(lags)
for j = 1:size(array,1);
A(j,i) = dot(array(j,lags(i)+1:end),array(j,1:end-lags(i)),2)/(T-lags(i));
end
end
It is more a curiosity question. If you ask me I would bet a direct dot product method would work the best. Also if arrayfun works that well, why then the double argument arrayfun doesn't perform well?
My array was 512*100000 array of doubles.

How to check performance of parallel processing in matlab?

Just to check how parallel processing works in matlab, I tried the below piece of codes and measured the time of execution. But I found the parallel processing code takes more time than normal code which is unexpected. Am I doing wrong somewhere?
Code with parallel processing
function t = parl()
matlabpool('open',2);
tic;
A = 5:10000000;
parfor i = 1:length(A)
A(i) = 3*A(i) + (A(i)/5);
A(i) = 0.456*A(i) + (A(i)/45);
end
tic;
matlabpool('close');
t = toc;
end
There result for parallel processing
>> parl Starting matlabpool using the 'local' profile ... connected to 2 workers. Sending a stop signal to all the workers ... stopped.
ans =
3.3332
function t = parl()
tic;
A = 5:10000000;
for i = 1:length(A)
A(i) = 3*A(i) + (A(i)/5);
A(i) = 0.456*A(i) + (A(i)/45);
end
tic;
t = toc;
end
Result for without parallel processing code
>> parl
ans =
2.8737e-05

Look at the time to (apparently) execute the serial version of the code, it is effectively 0. That's suspicious, so look at the code ...
tic;
t = toc;
Hmmm, this starts a stopwatch and immediately stops it. Yep, that should take about 0s. Have a look at the parallel code ...
tic;
matlabpool('close');
t = toc;
Ahh, in this case the code times the execution of the closing of the pool of workers. That's requires a fair bit of work and the time it takes, the 3.33s, is part of the overhead of using parallel computation in Matlab.
Yes, I do believe that you are doing something wrong, you are not measuring what you (probably) think you are measuring. tic starts a stopwatch and toc reads it. Your code starts a stopwatch twice and reads it once, it should probably start timing only once.

Matlab vectorization of for loops

Is there any way to vectorize such a for loop in MATLAB? It's taking a lot of time to execute.
for i = 1:numberOfFrames-1
frameDifferencesEroded(:,:,i+1) = imabsdiff(frameDifferencesErodedTemp(:,:,i+1),frameDifferencesErodedTemp(:,:,1));
for k=1:numel(frameDifferences(1,:,i))
for m=1:numel(frameDifferences(:,1,i))
if(frameDifferencesEroded(m,k,i+1)>thresold)
frameDifferences(m,k,i+1) = 255;
else
frameDifferences(m,k,i+1) = 0;
end
end
end
end

Assuming you want frameDifferencesEroded(:,:,1) and frameDifferences(:,:,1) to be all zeros, as you are not inputting values into those with your code, this might work for you -
%// Replace imabsdiff with abs(bsxfun(#minus..)), which might be faster
frameDifferencesEroded = abs(bsxfun(#minus,frameDifferencesErodedTemp, frameDifferencesErodedTemp(:,:,1)))
%// Get the thresholding done next
frameDifferences = (frameDifferencesEroded>thresold).*255

You could try somehting like this:
[M, N, P] = size(frameDifferences);
for i = 2:P
frameDifferencesEroded(:,:,i) = imabsdiff(frameDifferencesErodedTemp(:,:,i),frameDifferencesErodedTemp(:,:,1));
frameDifferences(:, :, i) = (frameDifferencesEroded(:, :, i) > thresold) .* 255;
end
Do you need to keep frameDifferencesEroded? If not you can make it a temporary 2-D matrix inside this loop.
But try to rearrange your data by swapping the 1st and 3rd dimension: m(i,:,:) are stored in memory consecutively, whereas m(:,:,1) are not which might make it slower.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Matlab Parallelism toolbox : stacking loops in parfor - for-loop

Related

Optimizing matrix multiplication with varying sizes

My loops are slow. Is that because of if statements?

Why the weird nested arrayfun is most time effective?

How to check performance of parallel processing in matlab?

Matlab vectorization of for loops

Categories

Resources