Speed up matlab code with backward multiplication using vectorization - performance

I need to decrease the runtime of the following piece of code written in Matlab :
dt = 0.001; dt05 = dt^0.5; length_t = 1.0e6;
%a: array containing length_t elements
y0 = [1.5 2.0 1.0];y = zeros(length_t,3);y(1,:) = y0;
for i = 1:length_t-1
dy = f(y(i,:); %call to some function
y(i+1,1) = y(i,1) + dt*dy(1) ;
y(i+1,2) = y(1,2) + a(1:i)*(y(i:-1:1,2)-y(1,2)) + dt05*dy(2) ;
y(i+1,3) = y(1,3) + a(1:i)*(y(i:-1:1,3)-y(1,3)) + dt05*dy(3) ;
The slowest steps are the calculations of y(i+1,2) and y(i+1,3) (because they require all the previous y(:,2:3) values). How can I speed up this code by vectorization and/or using a GPU?
EDIT: a is given by
a(1) = 0.5; a (2:length_t) = cumprod( (1-((1+a(1))./(2:length_t))) )*a(1);
and f is some function like:
function dy = f(y)
k12 = 1.0; k02 = 2.0;
dy(1) = - k12*y(1)*y(2);
dy(2) = k12*y(1) - k02*y(2);
dy(3) = (k12+k02)*(y(1)+y(2)+y(3));
dy = [dy(1) dy(2) dy(3)];

Note that I do NOT have DSP knowledge. I hope someone can write a better answer or correct mine.
If you can tolerate some approximations:
You can see that ratio a(i+1)/a(i) tends towards 1. This means that you can calculate a*y exactly for the first N elements (N depending on your desired accuracy), then add N+1-th element to variable AY and decrease variable AY by a magic factor depending on i. That way you can save yourself a lot of multiplications at the cost of this AY being somewhat inaccurate estimate of the actual product.
Your y(i,2) would then be somewhat like (csa = cumsum(a);):
y(i,2) = a(1:N) * y(i:-1:i-N) + AY + dt05_thingy + (1-csa(i))*y(1,2);
y(i,3) = ...
AY = AY*MF(i,N) + a(N)*y(i-N);
Magic factor would depend on N and perhaps also i. Precalculate R=a(2:end)./a(1:end-1); and use MF(N, i>N) = R(N+(i-N)/2) - so take the middle ratio for the elements you are approximating.


Can anyone explain how different is this hybrid PSOGA from normal GA?

Does this code have mutation, selection, and crossover, just like the original genetic algorithm.
Since this, a hybrid algorithm (i.e PSO with GA) does it use all steps of original GA or skips some
of them.Please do tell me.
I am just new to this and still trying to understand. Thank you.
%%% Hybrid GA and PSO code
function [gbest, gBestScore, all_scores] = QAP_PSO_GA(CreatePopFcn, FitnessFcn, UpdatePosition, ...
nCity, nPlant, nPopSize, nIters)
% Set algorithm parameters
constant = 0.95;
c1 = 1.5; %1.4944; %2;
c2 = 1.5; %1.4944; %2;
w = 0.792 * constant;
% Allocate memory and initialize
gBestScore = inf;
all_scores = inf * ones(nPopSize, nIters);
x = CreatePopFcn(nPopSize, nCity);
v = zeros(nPopSize, nCity);
pbest = x;
% update lbest
cost_p = inf * ones(1, nPopSize); %feval(FUN, pbest');
for i=1:nPopSize
cost_p(i) = FitnessFcn(pbest(i, 1:nPlant));
lbest = update_lbest(cost_p, pbest, nPopSize);
for iter = 1 : nIters
if mod(iter,1000) == 0
parents = randperm(nPopSize);
for i = 1:nPopSize
x(i,:) = (pbest(i,:) + pbest(parents(i),:))/2;
% v(i,:) = pbest(parents(i),:) - x(i,:);
% v(i,:) = (v(i,:) + v(parents(i),:))/2;
% Update velocity
v = w*v + c1*rand(nPopSize,nCity).*(pbest-x) + c2*rand(nPopSize,nCity).*(lbest-x);
% Update position
x = x + v;
x = UpdatePosition(x);
% Update pbest
cost_x = inf * ones(1, nPopSize);
for i=1:nPopSize
cost_x(i) = FitnessFcn(x(i, 1:nPlant));
s = cost_x<cost_p;
cost_p = (1-s).*cost_p + s.*cost_x;
s = repmat(s',1,nCity);
pbest = (1-s).*pbest + s.*x;
% update lbest
lbest = update_lbest(cost_p, pbest, nPopSize);
% update global best
all_scores(:, iter) = cost_x;
[cost,index] = min(cost_p);
if (cost < gBestScore)
gbest = pbest(index, :);
gBestScore = cost;
% draw current fitness
hold on
str=strcat('Best fitness: ', num2str(min(cost_x)));
% Function to update lbest
function lbest = update_lbest(cost_p, x, nPopSize)
sm(1, 1)= cost_p(1, nPopSize);
sm(1, 2:3)= cost_p(1, 1:2);
[cost, index] = min(sm);
if index==1
lbest(1, :) = x(nPopSize, :);
lbest(1, :) = x(index-1, :);
for i = 2:nPopSize-1
sm(1, 1:3)= cost_p(1, i-1:i+1);
[cost, index] = min(sm);
lbest(i, :) = x(i+index-2, :);
sm(1, 1:2)= cost_p(1, nPopSize-1:nPopSize);
sm(1, 3)= cost_p(1, 1);
[cost, index] = min(sm);
if index==3
lbest(nPopSize, :) = x(1, :);
lbest(nPopSize, :) = x(nPopSize-2+index, :);
If you are new to Optimization, I recommend you first to study each algorithm separately, then you may study how GA and PSO maybe combined, Although you must have basic mathematical skills in order to understand the operators of the two algorithms and in order to test the efficiency of these algorithm (this is what really matter).
This code chunk is responsible for parent selection and crossover:
parents = randperm(nPopSize);
for i = 1:nPopSize
x(i,:) = (pbest(i,:) + pbest(parents(i),:))/2;
% v(i,:) = pbest(parents(i),:) - x(i,:);
% v(i,:) = (v(i,:) + v(parents(i),:))/2;
Is not really obvious how selection randperm is done (I have no experience about Matlab).
And this is the code that is responsible for updating the velocity and position of each particle:
% Update velocity
v = w*v + c1*rand(nPopSize,nCity).*(pbest-x) + c2*rand(nPopSize,nCity).*(lbest-x);
% Update position
x = x + v;
x = UpdatePosition(x);
This version of velocity updating strategy is utilizing what is called Interia-Weight W, which basically mean we are preserving the velocity history of each particle (not completely recomputing it).
It worth mentioning that velocity updating is done more often than crossover (each 1000 iteration).

Speeding up simulation of the Levy motion algorithm

Here is my little script for simulating Levy motion:
clear all;
clc; close all;
t = 0; T = 1000; I = T-t;
dT = T/I; t = 0:dT:T; tau = T/I;
alpha = 1.5;
sigma = dT^(1/alpha);
mu = 0; beta = 0;
N = 1000;
X = zeros(N, length(I));
for k=1:N
L = zeros(1,I);
for i = 1:I-1
L( (i + 1) * tau ) = L(i*tau) + stable2( alpha, beta, sigma, mu, 1);
X(k,1:length(L)) = L;
q = 0.1:0.1:0.9;
quant = qlines2(X, q, t(1:length(X)), tau);
hold all
for i = 1:length(quant)
plot( t, quant(i) * t.^(1/alpha), ':k' );
Where stable2 returns a stable random variable with given parameters (you may replace it with normrnd(mu, sigma) for this case, it's not crucial); qlines2 returns quantiles needed for plotting.
But I don't want to talk about math here. My problem is that this implementation is pretty slow, and I would like to speed it up. Unfortunately, computer science is not my main field - I heard something about methods like memoization, vectorization and that there is a lot of other techniques, but I don't know how to use them.
For example, I'm pretty sure I should replace this filthy double for-loop somehow, but I'm not sure what to do instead.
EDIT: Maybe I should use (and learn...) another language (Python, C, any functional one)? I always though that Matlab/OCTAVE is designed for numerical computation, but if change, then for which one?
The crucial bit is, as you said, the for loops, Matlab does not like those, so vectorization is indeed the keyword. (Together with preallocating the space.
I just altered you for loop section somewhat so that you do not have to reset L over and over again, instead we save all Ls in a bigger matrix (also I elimiated the length(L) command).
L = zeros(N,I);
for k=1:N
for i = 1:I-1
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
X(k,1:I) = L(k,1:I);
Now you can already see that X(k,1:I) = L(k,1:I); in the loop is obsolete and that also means that we can switch the order of the loops. This is crucial, because the i-steps are recursive (depend on the previous step) that means we cannot vectorize this loop, we can only vectorize the k-loop.
Now your original code needed 9.3 seconds on my machine, the new code still needs about the same time)
L = zeros(N,I);
for i = 1:I-1
for k=1:N
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
X = L;
But now we can apply the vectorization, instead of looping throu all rows (the loop over k) we can instead eliminate this loop, and doing all rows at "once".
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma); %<- this is not yet what you want, see comment below
X = L;
This code need only 0.045 seconds on my machine. I hope you still get the same output, because I have no idea what you are calculating, but I also hope you could see how you go about vectorizing code.
PS: I just noticed that we now use the same random number in the last example for the whole column, this is obviously not what you want. Instad you should generate a whole vector of random numbers, e.g:
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma,N,1);
X = L;
PPS: Great question!

Implement a fast optimization algorithm using fixed point method in matlab

I am implementing a fast optimization algorithm using fixed point method in matlab. The goal of that method is that find optimal value of u. Denote u={u_i,i=1..2}. The optimal value of u can be obtained as following steps:
Sorry about my image because I cannot type mathematics equation in here.
To do that task, I tried to find u follows above steps. However, I don't know how to implement the term \sum_{j!=i} (u_j-1) in equation 25. This is my code. Please see it and could you give me some comment or suggestion about my implementation to correct them. Currently, I tried to run that code but it give an incorrect answer.
function u = compute_u_TV(Im0, N_class)
% Initialization
N_class=2; % only have u1 and u2
% Iterative segmentation process
for i=1:N_class
v(:,:,i) = Im0/max(Im0(:)); % u between 0 and 1.
qxv(:,:,i) = zeros(size(Im0));
qyv(:,:,i) = zeros(size(Im0));
u(:,:,i) = v(:,:,i);
for iteration=1:10000
% Update v
Divqi = ( BackwardX(qxv(:,:,i)) + BackwardY(qyv(:,:,i)) );
Term = Divqi - u(:,:,i)/ (theta*gamma);
TermX = ForwardX(Term);
TermY = ForwardY(Term);
Norm = sqrt(TermX.^2 + TermY.^2);
Denom = 1 + tau*Norm;
%Equation 24
qxv(:,:,i) = (qxv(:,:,i) + tau*TermX)./Denom;
qyv(:,:,i) = (qyv(:,:,i) + tau*TermY)./Denom;
v(:,:,i) = u(:,:,i) - theta*gamma* Divqi; %Equation 23
% Update u
u(:,:,i) = (v(:,:,i) - theta* gamma* Divqi -theta*gamma*sigma*(sum(u(:))-u(:,:,i)-1))./(1+theta* gamma*sigma);
u(:,:,i) = max(u(:,:,i),0);
u(:,:,i) = min(u(:,:,i),1);
% Sub-functions- X.Berson
function [dx]=BackwardX(u);
[Ny,Nx] = size(u);
dx = u;
dx(2:Ny-1,2:Nx-1)=( u(2:Ny-1,2:Nx-1) - u(2:Ny-1,1:Nx-2) );
dx(:,Nx) = -u(:,Nx-1);
function [dy]=BackwardY(u);
[Ny,Nx] = size(u);
dy = u;
dy(2:Ny-1,2:Nx-1)=( u(2:Ny-1,2:Nx-1) - u(1:Ny-2,2:Nx-1) );
dy(Ny,:) = -u(Ny-1,:);
function [dx]=ForwardX(u);
[Ny,Nx] = size(u);
dx = zeros(Ny,Nx);
dx(1:Ny-1,1:Nx-1)=( u(1:Ny-1,2:Nx) - u(1:Ny-1,1:Nx-1) );
function [dy]=ForwardY(u);
[Ny,Nx] = size(u);
dy = zeros(Ny,Nx);
dy(1:Ny-1,1:Nx-1)=( u(2:Ny,1:Nx-1) - u(1:Ny-1,1:Nx-1) );
% End of sub-function
You should do
u(:,:,i) = (v(:,:,i) - theta* gamma* Divqi -theta*gamma*sigma* ...
(sum(u(:,:,1:size(u,3) ~= i),3) -1))./(1+theta* gamma*sigma);
The part you were searching for is
sum(u(:,:,1:size(u,3) ~= i),3)
Let's decompose this :
1:size(u,3) ~= i
is a vector containing all values from 1 to the max size of u on the third dimension except i.
u(:,:,1:size(u,3) ~= i)
is all the matrix of the third dimension of u except for j = i
is the sum of all the matrix by the thrid dimension.
Let me know if it does help!

Speed up an Enumeration process

After a few days of optimization this is my code for an enumeration process that consist in finding the best combination for every row of W. The algorithm separates the matrix W in one where the elements of W are grather of LimiteInferiore (called W_legali) and one that have only element below the limit (called W_nlegali).
Using some parameters like Media (aka Mean), rho_b_legali The algorithm minimizes the total cost function. In the last part, I find where is the combination with the lowest value of objective function and save it in W_ottimo
As you can see the algorithm is not so "clean" and with very large matrix (142506x3000) is damn slow...So, can somebody help me to speed it up a little bit?
for i=1:3000
W = PesoIncertezza * MatriceCombinazioni';
W_legali = W;
W_legali(W<LimiteInferiore) = nan;
if i==1
Media = W_legali;
rho_b_legale = ones(size (W_legali,1),size(MatriceCombinazioni,1));
Media = (repmat(sum(W_tot_migl,2),1,size(MatriceCombinazioni,1))+W_legali)/(size(W_tot_migl,2)+1);
rho_b_legale = repmat(((n_b+1)/i),1,size(MatriceCombinazioni,1));
[W_legali_migl,comb] = min(C_u .* Media .* (1./rho_b_legale) + (1./rho_b_legale) .* c_0 + (c_1./(i * rho_b_legale)),[],2);
MatriceCombinazioni_2 = MatriceCombinazioni;
W_nlegali = PesoIncertezza * MatriceCombinazioni_2';
W_nlegali(W_nlegali>=LimiteInferiore) = nan;
if i==1
Media = W_nlegali;
rho_b_nlegale = zeros(size (W_nlegali,1),size(MatriceCombinazioni_2,1));
Media = (repmat(sum(W_tot_migl,2),1,size(MatriceCombinazioni_2,1))+W_nlegali)/(size(W_tot_migl,2)+1);
rho_b_nlegale = repmat(((n_b)/i),1,size(MatriceCombinazioni_2,1));
[W_nlegali_migliori,comb2] = min(C_u .* Media .* (1./rho_b_nlegale) + (1./rho_b_nlegale) .* c_0 + (c_1./(i * rho_b_nlegale)),[],2);
z = [W_legali_migl, W_nlegali_migliori];
[z_ott,comb3] = min(z,[],2);
%Increasing n_b
if i==1
n_b = zeros(size(W,1),1);
index = find(comb3==1);
increment = ones(size(index,1),1);
B = accumarray(index,increment);
nzIndex = (B ~= 0);
n_b(nzIndex) = n_b(nzIndex) + B(nzIndex);
%Using comb3 to find where is the best configuration, is in
%W_legali or in W_nLegali?
combinazione = comb.*logical(comb3==1) + comb2.*logical(comb3==2);
W_ottimo = W(sub2ind(size(W),[1:size(W,1)],combinazione'))';
W_tot_migl(:,i) = W_ottimo;
FunzObb(:,i) = z_ott;
[PesoCestelli] = Simulazione_GenerazioneNumeriCasuali (PianoSperimentale,NumeroCestelli,NumeroEsperimenti,Alfa);
[PesoIncertezza_2] = Simulazione_GenerazioneIncertezza (NumeroCestelli,NumeroEsperimenti,IncertezzaCella,PesoCestelli);
PesoIncertezza(MatriceCombinazioni(combinazione,:)~=0) = PesoIncertezza_2(MatriceCombinazioni(combinazione,:)~=0); %updating just the hoppers that has been discharged
When you see repmat you should think bsxfun. For example, replace:
Media = (repmat(sum(W_tot_migl,2),1,size(MatriceCombinazioni,1))+W_legali) / ...
Media = bsxfun(#plus,sum(W_tot_migl,2),W_legali) / ...
The purpose of bsxfun is to do a virtual "singleton expansion" like repmat, without actually replicating the array into a matrix of the same size as W_legali.
Also note that in the above code, sum(W_tot_migl,2) is computed twice. There are other small optimizations, but changing to bsxfun should give you a good improvement.
The values of 1./rho_b_legale are effectively computed three times. Store this quotient matrix.

Gradient descent does not return incorrect prediction for linear function

I've implemented following Batch Gradient descednt algorithm, based on various sources I was able to find around web and in lecture notes.
This implementation isn't ideal in terms of stopping criteria, but for my sample it should work.
x = [1,1;1,2;1,3;1,4;1,5];
y = [1;2;3;4;5];
theta = [0;0];
tempTheta = [0;0];
for c = 1:10000,
for j = 1:2,
sum = 0;
for i = 1:5,
sum = sum + ((dot(theta', x(i, :)) - y(j)) * x(i,j));
sum = (sum / 5) * 0.01;
tempTheta(j) = theta(j) - sum;
theta = tempTheta;
The expected result is theta = [0;1], but my implementation always returns theta = [-3.5, 1.5].
I've tried various combinations of alpha and starting point, but without luck. Where am I making mistake?
In this line
sum = sum + ((dot(theta', x(i, :)) - y(j)) * x(i,j));
you are using a wrong index of y, it should be y(i), as j is a dimension iterator, not the sample iterator.
After the change
theta =
