Matlab - Speeding up Nested For-Loops - performance

I'm working on a function with three nested for loops that is way too slow for its intended use. The bottleneck is clearly the looping part - almost 100 % of the execution time is spent in the innermost loop.
The function takes a 2d matrix called rM as input and returns a 3d matrix called ec:
rows = size(rM, 1);
cols = size(rM, 2);
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols,
for r = 2:rows+1,
ec(r, c, risk) = ec(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
Any help on speeding up the for loops would be appreciated...

The problem is, that the inner loop is slowest, while it is also near-impossible to vectorize. As every iteration directly depends on the previous one.
The outer two are possible:
clc;
rM = rand(50);
rows = size(rM, 1);
cols = size(rM, 2);
minRisk = 1;
stepRisk = 1;
maxRisk = 100;
numRiskLevels = maxRisk/stepRisk;
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
riskArray = (minRisk:stepRisk:maxRisk)';
tic
for r = 2:rows+1
tmp = riskArray * rM(r-1, :);
tmp = permute(tmp, [3 2 1]);
ec(r, :, :) = ec(r-1, :, :) .* (1 + tmp);
end
toc
%preallocate.
ec2 = zeros(rows+1, cols, numRiskLevels);
ec2(1, :, :) = 100;
tic
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols
for r = 2:rows+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
toc
all(all(all(ec == ec2)))
But to my surprise, the vectorized code is indeed slower. (But maybe someone can improve the code, so I figured I leave it her for you.)

I have just tried to vectorize the outer loop, and actually noticed a significant speed increase. Of course it is hard to judge the speed of a script without knowing (the size of) the inputs but I would say this is a good starting point:
% Here you can change the input parameters
riskVec = 1:3:120;
rM = rand(50);
%preallocate and calculate non vectorized solution
ec2 = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec2(1, :, :) = 100;
tic
for risk = riskVec
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
t1=toc;
%preallocate and calculate vectorized solution
ec = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec(1, :, :) = 100;
tic
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec(r, c, riskVec) = ec(r-1, c, riskVec) .* reshape(1 + riskVec * rM(r-1, c),[1 1 length(riskVec)]);
end
end
t2=toc;
% Check whether the vectorization is done correctly and show the timing results
if ec(:) == ec2(:)
t1
t2
end
The given output is:
t1 =
0.1288
t2 =
0.0408
So for this riskVec and rM it is about 3 times as fast as the non-vectorized solution.

Related

Tricks to improve the performance of a cunstom function in Julia

I am replicating using Julia a sequence of steps originally made in Matlab. In Octave, this procedure takes 1.4582 seconds and in Julia (using Jupyter) it takes approximately 10 seconds. I'll try to be brief in the scripts. My goal is to achieve or improve Octave's performance. First of all, I will describe my variables and some function:
zgrid (double 1x7 size)
kgrid (double 500x1 size)
V0 (double 500x7 size)
P (double 7x7 size) a transition matrix
delta and beta are fixed parameters.
F(z,k) and u(c) are particular functions and are specified in the Julia script.
% Octave script
% V0 is given
[K, Z, K2] = meshgrid(kgrid, zgrid, kgrid);
K = permute(K, [2, 1, 3]);
Z = permute(Z, [2, 1, 3]);
K2 = permute(K2, [2, 1, 3]);
C = max(f(Z,K) + (1-delta)*K - K2,0);
U = u(C);
EV = V0*P';% EV is a 500x7 matrix size
EV = permute(repmat(EV, 1, 1, 500), [3, 2, 1]);
H = U + beta*EV;
[TV, index] = max(H, [], 3);
In Julia, I created a function that replicates this procedure. I used loops, but it has a performance 9 times longer.
% Julia script
% V0 is the input of my T operator function
V0 = repeat(sqrt.(kgrid), outer = [1,7]);
F = (z,k) -> exp(z)*(k^α);
u = (c) -> (c^(1-μ) - 1)/(1-μ)
% parameters
α = 1/3
β = 0.987
δ = 0.012;
μ = 2
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500);
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]
function T(V)
E=V*P'
T1 = zeros(Float64, 500, 7 )
aux = zeros(Float64, 500)
for i = 1:7
for j = 1:500
for l = 1:500
c= maximum( (F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j] - kgrid[l],0))
aux[l] = u(c) + β*E[l,i]
end
T1[j,i] = maximum(aux)
end
end
return T1
end
I would very much like to improve my performance in Julia. I believe there is a way to improve, but I am new in Julia programming.
This code runs for me in 5ms. Note that I have made F and u into proper (not anonymous) functions, F_ and u_, but you could get a similar effect by making the anonymous functions const.
Your main problem is that you have a lot of non-const global variables, and also that your main function is doing unnecessary work multiple times, and creating an unnecessary array, aux.
The performance tips section in the manual is essential reading: https://docs.julialang.org/en/v1/manual/performance-tips/
F_(z,k) = exp(z) * (k^(1/3)); # you can still use α, but it must be const
u_(c) = (c^(1-2) - 1)/(1-2)
function T_(V, P, kgrid, zgrid, β, δ)
E = V * P'
T1 = similar(V)
for i in axes(T1, 2)
for j in axes(T1, 1)
temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
aux = -Inf
for l in eachindex(kgrid)
c = max(0.0, temp - kgrid[l])
aux = max(aux, u_(c) + β * E[l, i])
end
T1[j,i] = aux
end
end
return T1
end
Benchmark:
V0 = repeat(sqrt.(kgrid), outer = [1,7]);
zgrid = sort!(rand(1, 7); dims=2)
kgrid = sort!(rand(500, 1); dims=1)
P = rand(length(zgrid), length(zgrid))
#btime T_($V0, $P, $kgrid, $zgrid, $β, $δ);
# output: 5.126 ms (4 allocations: 54.91 KiB)
The following should perform much better. The most noticeable differences are that it calculates F 500x less, and doesn't rely on global variables.
function T(V,kgrid,zgrid,β,δ)
E=V*P'
T1 = zeros(Float64, 500, 7)
for j = 1:500
for i = 1:7
x = F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j]
T1[j,i] = maximum(u(max(x - kgrid[l], 0)) + β*E[l,i] for l in 1:500)
end
end
return T1
end

Can anyone explain how different is this hybrid PSOGA from normal GA?

Does this code have mutation, selection, and crossover, just like the original genetic algorithm.
Since this, a hybrid algorithm (i.e PSO with GA) does it use all steps of original GA or skips some
of them.Please do tell me.
I am just new to this and still trying to understand. Thank you.
%%% Hybrid GA and PSO code
function [gbest, gBestScore, all_scores] = QAP_PSO_GA(CreatePopFcn, FitnessFcn, UpdatePosition, ...
nCity, nPlant, nPopSize, nIters)
% Set algorithm parameters
constant = 0.95;
c1 = 1.5; %1.4944; %2;
c2 = 1.5; %1.4944; %2;
w = 0.792 * constant;
% Allocate memory and initialize
gBestScore = inf;
all_scores = inf * ones(nPopSize, nIters);
x = CreatePopFcn(nPopSize, nCity);
v = zeros(nPopSize, nCity);
pbest = x;
% update lbest
cost_p = inf * ones(1, nPopSize); %feval(FUN, pbest');
for i=1:nPopSize
cost_p(i) = FitnessFcn(pbest(i, 1:nPlant));
end
lbest = update_lbest(cost_p, pbest, nPopSize);
for iter = 1 : nIters
if mod(iter,1000) == 0
parents = randperm(nPopSize);
for i = 1:nPopSize
x(i,:) = (pbest(i,:) + pbest(parents(i),:))/2;
% v(i,:) = pbest(parents(i),:) - x(i,:);
% v(i,:) = (v(i,:) + v(parents(i),:))/2;
end
else
% Update velocity
v = w*v + c1*rand(nPopSize,nCity).*(pbest-x) + c2*rand(nPopSize,nCity).*(lbest-x);
% Update position
x = x + v;
x = UpdatePosition(x);
end
% Update pbest
cost_x = inf * ones(1, nPopSize);
for i=1:nPopSize
cost_x(i) = FitnessFcn(x(i, 1:nPlant));
end
s = cost_x<cost_p;
cost_p = (1-s).*cost_p + s.*cost_x;
s = repmat(s',1,nCity);
pbest = (1-s).*pbest + s.*x;
% update lbest
lbest = update_lbest(cost_p, pbest, nPopSize);
% update global best
all_scores(:, iter) = cost_x;
[cost,index] = min(cost_p);
if (cost < gBestScore)
gbest = pbest(index, :);
gBestScore = cost;
end
% draw current fitness
figure(1);
plot(iter,min(cost_x),'cp','MarkerEdgeColor','k','MarkerFaceColor','g','MarkerSize',8)
hold on
str=strcat('Best fitness: ', num2str(min(cost_x)));
disp(str);
end
end
% Function to update lbest
function lbest = update_lbest(cost_p, x, nPopSize)
sm(1, 1)= cost_p(1, nPopSize);
sm(1, 2:3)= cost_p(1, 1:2);
[cost, index] = min(sm);
if index==1
lbest(1, :) = x(nPopSize, :);
else
lbest(1, :) = x(index-1, :);
end
for i = 2:nPopSize-1
sm(1, 1:3)= cost_p(1, i-1:i+1);
[cost, index] = min(sm);
lbest(i, :) = x(i+index-2, :);
end
sm(1, 1:2)= cost_p(1, nPopSize-1:nPopSize);
sm(1, 3)= cost_p(1, 1);
[cost, index] = min(sm);
if index==3
lbest(nPopSize, :) = x(1, :);
else
lbest(nPopSize, :) = x(nPopSize-2+index, :);
end
end
If you are new to Optimization, I recommend you first to study each algorithm separately, then you may study how GA and PSO maybe combined, Although you must have basic mathematical skills in order to understand the operators of the two algorithms and in order to test the efficiency of these algorithm (this is what really matter).
This code chunk is responsible for parent selection and crossover:
parents = randperm(nPopSize);
for i = 1:nPopSize
x(i,:) = (pbest(i,:) + pbest(parents(i),:))/2;
% v(i,:) = pbest(parents(i),:) - x(i,:);
% v(i,:) = (v(i,:) + v(parents(i),:))/2;
end
Is not really obvious how selection randperm is done (I have no experience about Matlab).
And this is the code that is responsible for updating the velocity and position of each particle:
% Update velocity
v = w*v + c1*rand(nPopSize,nCity).*(pbest-x) + c2*rand(nPopSize,nCity).*(lbest-x);
% Update position
x = x + v;
x = UpdatePosition(x);
This version of velocity updating strategy is utilizing what is called Interia-Weight W, which basically mean we are preserving the velocity history of each particle (not completely recomputing it).
It worth mentioning that velocity updating is done more often than crossover (each 1000 iteration).

Speed up matrix calculation

I am working on linear model predictive control and I need to calculate some matrices for the controller, only.. it takes a lot of time to calculate one of them and I would like to ask if there is a better way to code this calculation. I am using MATLAB, but I do understand FORTRAN also.
Well, I want to calculate a matrix (Φ) but the way I am doing it takes to much time to calculate it. Φ matrix is of the form (the right one):
MPC_matrices.
Here is the book where I found this image in case you need to refer to (especially page 8).
Now the code that I wrote in MATLAB goes like this:(moved it after EDIT)
Considering that I will have quite big NS, Np and Nc variables it will take a whole lot of time to do this calculation. Is there an optimal way (or at least better than mine) to speed up this calculation?
EDIT
After considering #Daniel's & user2682877's comments I tested this
clear;clc
Np = 80;
Nc = Np / 2;
m = 3;
q = 1;
Niter = 30;
MAT = zeros(Niter,5);
for I=1:Niter
NS = 10 * I;
A = rand(NS,NS);
B = rand(NS,m);
C = rand(1,NS);
tic
Phi1 = zeros(Np*q,Nc*m);
CB = C * B;
for i=1:Np
for j=1:Nc
if j<i
Phi1( (q*i-(q-1)):(q*i) , (m*j-(m-1)):(m*j) ) = C * A^(i-1-(j-1)) * B;
elseif j==i
Phi1( (q*i-(q-1)):(q*i) , (m*j-(m-1)):(m*j) ) = CB;
end
end
end
t1 = toc;
% Daniel's suggestion
tic
Phi2=zeros(Np*q,Nc*m);
CB = C * B;
for diffij=0:Np-1
if diffij>0
F=C * A^diffij * B;
else
F=CB;
end
for i=max(1,diffij+1):min(Np,Nc+diffij)
j=i-diffij;
Phi2( (q*i-(q-1)):(q*i) , (m*j-(m-1)):(m*j) ) = F;
end
end
t2 = toc;
% user2682877 suggestion
tic
Phi3=zeros(Np*q,Nc*m);
temp = B;
% 1st column
Phi3( (q*1-(q-1)):(q*1) , (m*1-(m-1)):(m*1) ) = C * B;
for i=2:Np
% reusing temp
temp = A * temp;
Phi3( (q*i-(q-1)):(q*i) , (m*1-(m-1)):(m*1) ) = C * temp;
end
% remaining columns
for j=2:Nc
for i=j:Nc
Phi3( (q*i-(q-1)):(q*i) , (m*j-(m-1)):(m*j) ) =...
Phi3( (q*(i-j+1)-(q-1)):(q*(i-j+1)) , (m*1-(m-1)):(m*1) );
end
end
t3 = toc;
MAT(I,:) = [I, NS, t1, t2 ,t3];
fprintf('I iteration = %g \n', I);
end
figure(1)
clf(1)
hold on
plot(MAT(:,2),MAT(:,3),'b')
plot(MAT(:,2),MAT(:,4),'r')
plot(MAT(:,2),MAT(:,5),'g')
hold off
legend('My <Unfortunate> Idea','Daniel`s suggestion','user2682877 suggestion')
xlabel('NS variable')
ylabel('Time, s')
And here is the resulting figure:
Keep in mind that now NS = 300 ,BUT as I upgrate my model (I intent include more and more equations and variables in the state space model) this variables (mostly NS and Np) will be bigger and bigger.
#Daniel's 2nd comment, I know that I perform more calculations than those I should, but my lack of experience limits my ideas on upgrating this one.
#durasm's comment, I am not quite familiar with parfor, but I will test it.
Referring to the answers: I will test your suggestions as soon as I understand them (...) and get back to you.
Results
It is obvious that my initial thought is only a bit worse than what suggested here.. Thank you guys! You were extremely helpful!
There is only a limited set of results from your calculation C * A^(i-1-(j-1)) * B which only depends on the difference between i and j. Not to repeatedly calculate it, my solution iterates this difference and i, then calculates j depending on these two variables.
Phi=zeros(Np*q,Nc*m);
CB = C * B;
for diffij=0:Np-1
if diffij>0
F=C * A^diffij * B;
else
F=CB;
end
for i=max(1,diffij+1):min(Np,Nc+diffij)
j=i-diffij;
Phi( (q*i-(q-1)):(q*i) , (m*j-(m-1)):(m*j) ) = F;
end
end
Performance comparison:
Maybe you can try this:
Phi=zeros(Np*q,Nc*m);
temp = B;
% 1st column
Phi( (q*1-(q-1)):(q*1) , (m*1-(m-1)):(m*1) ) = C * B;
for i=2:Np
% reusing temp
temp = A * temp;
Phi( (q*i-(q-1)):(q*i) , (m*1-(m-1)):(m*1) ) = C * temp;
end
% remaining columns
for j=2:Nc
for i=j:Np
Phi( (q*i-(q-1)):(q*i) , (m*j-(m-1)):(m*j) ) = Phi( (q*(i-j+1)-(q-1)):(q*(i-j+1)) , (m*1-(m-1)):(m*1) );
end
end

Speeding up simulation of the Levy motion algorithm

Here is my little script for simulating Levy motion:
clear all;
clc; close all;
t = 0; T = 1000; I = T-t;
dT = T/I; t = 0:dT:T; tau = T/I;
alpha = 1.5;
sigma = dT^(1/alpha);
mu = 0; beta = 0;
N = 1000;
X = zeros(N, length(I));
for k=1:N
L = zeros(1,I);
for i = 1:I-1
L( (i + 1) * tau ) = L(i*tau) + stable2( alpha, beta, sigma, mu, 1);
end
X(k,1:length(L)) = L;
end
q = 0.1:0.1:0.9;
quant = qlines2(X, q, t(1:length(X)), tau);
hold all
for i = 1:length(quant)
plot( t, quant(i) * t.^(1/alpha), ':k' );
end
Where stable2 returns a stable random variable with given parameters (you may replace it with normrnd(mu, sigma) for this case, it's not crucial); qlines2 returns quantiles needed for plotting.
But I don't want to talk about math here. My problem is that this implementation is pretty slow, and I would like to speed it up. Unfortunately, computer science is not my main field - I heard something about methods like memoization, vectorization and that there is a lot of other techniques, but I don't know how to use them.
For example, I'm pretty sure I should replace this filthy double for-loop somehow, but I'm not sure what to do instead.
EDIT: Maybe I should use (and learn...) another language (Python, C, any functional one)? I always though that Matlab/OCTAVE is designed for numerical computation, but if change, then for which one?
The crucial bit is, as you said, the for loops, Matlab does not like those, so vectorization is indeed the keyword. (Together with preallocating the space.
I just altered you for loop section somewhat so that you do not have to reset L over and over again, instead we save all Ls in a bigger matrix (also I elimiated the length(L) command).
L = zeros(N,I);
for k=1:N
for i = 1:I-1
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
X(k,1:I) = L(k,1:I);
end
Now you can already see that X(k,1:I) = L(k,1:I); in the loop is obsolete and that also means that we can switch the order of the loops. This is crucial, because the i-steps are recursive (depend on the previous step) that means we cannot vectorize this loop, we can only vectorize the k-loop.
Now your original code needed 9.3 seconds on my machine, the new code still needs about the same time)
L = zeros(N,I);
for i = 1:I-1
for k=1:N
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
end
X = L;
But now we can apply the vectorization, instead of looping throu all rows (the loop over k) we can instead eliminate this loop, and doing all rows at "once".
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma); %<- this is not yet what you want, see comment below
end
X = L;
This code need only 0.045 seconds on my machine. I hope you still get the same output, because I have no idea what you are calculating, but I also hope you could see how you go about vectorizing code.
PS: I just noticed that we now use the same random number in the last example for the whole column, this is obviously not what you want. Instad you should generate a whole vector of random numbers, e.g:
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma,N,1);
end
X = L;
PPS: Great question!

Optimising multidimensional array performance- MATLAB

Communication overhead (parfor) and preallocating for speed (for) in Multidimensional Arrays
I am getting two warnings in the following script at the places indicated by **'s
Variable is indexed but not sliced... (the array A shown by ** in the second parfor loop) - What is causing this and how can it be avoided?
The variable appears to change size on every loop... (the array Sol shown by ** in the for loop) - Maybe I am not doing it right, but preallocating memory hasn't worked.
Edit: My initial idea was to preallocate the arrays (as done in the first parfor loop) so that it will execute the rest of the script faster (the full version of the script repeats various array operations similar to the second parfor and for loops).
Any suggestions? :)
N = 1000;
parfor i=1:N
A(:,:,i) = rand(2);
X(:,:,i) = rand(2,1);
Sol1(1,1,i) = zeros();
Sol2(1,1,i) = zeros();
Sol(2,1,i) = zeros();
end
t0 = tic;
parfor i=1:N
Sol1(1,:,i) = A(1,:,i)*X(:,1,i);
Sol2(1,:,i) = **A**(2,:,i)*X(:,1,i);
end
for i=1:N
**Sol**(:,1,i) = [Sol1(1,:,i);Sol2(1,:,i)];
end
toc(t0);
Your pre-allocation is not right - you need to do each in a single call.
A = rand(2, 2, N);
X = rand(2, 1, N);
Sol1 = zeros(1, 1, N);
Sol2 = zeros(1, 1, N);
Sol = zeros(2, 1, N); % not really needed actually.
In your PARFOR loop, you can avoid 'broadcasting' A by using a syntax that MATLAB understands as slicing
parfor i = 1:N
tmp = A(:, :, i);
Sol1(1, :, i) = tmp(1,:) * X(:, 1, i);
Sol2(1, :, i) = tmp(2,:) * X(:, 1, i);
end
Finally, I think you can do this as a vectorised concatenation like so:
Sol = [Sol1; Sol2];
EDIT
On the GPU, you can use pagefun to get the whole job done in a single call, like so:
Ag = gpuArray.rand(2,2,N);
Xg = gpuArray.rand(2,1,N);
Sol = pagefun(#mtimes, Ag, Xg);

Resources