Speeding up loop-rich Matlab function to calculate temperature distribution - performance

I would like to speed up this function as much as possible in Matlab.
This is part of a bigger simulation project, and as it is one of the most called functions within the simulation, this is crucial.
For now, I tried generating a MEX file, but the speed was not better.
Vectorizing seems difficult (but would be beneficial due to the nested loops), given the non-linear operations.
function y = mixing(T,dis,rr,n)
%% ===================================================================
% input: temperature of cells array T, distance array dis, number of cells
% n, mixing ratio r
%
% output: new temperature array
%
% purpose: calculates the temperature array of next timestep
% ===================================================================
for j = 1:n
i = 1;
r = rr;
while i < dis(j)+1 && j+i <= n
if (dis(j) < i)
r = r*(dis(j)-floor(dis(j)));
end
d = T(j+i-1);
T(j+i-1) = r*T(j+i) + (1-r)*T(j+i-1);
T(j+i) = r*d + (1-r)*T(j+i);
i = i + 1;
end
end
y = T;
end
Any ideas on how to speed-up this Matlab function?
Inputs: T is a 10x1 double, dis is a 10x1 double, rr is a 1x1 double, and n is a 1x1 integer value.
Example values: T = random('unif',55,65,10,1); dis = repmat(0.1,10,1); rr = rand; n = 10;
What I'm trying to compute with this is the degree of temperature mixing between water layers, given by the equations for T(j+i-1) and T(j+i).
This must be calculated for all , this is for all the layers, at all timesteps (note that are the total number of water layers).

Related

Exponential fractional values for power function

I am using below Matlab code to calculate power function [ without using built-in function] to calculate power = b^e.
At the moment , I am unable to get power function going that support fractional exponential values b^(1/2) = sqrt(b) or 3.4 ^ (1/4) to calculate power due inefficient approach , because it loops e times. I need help in efficient logic for fractional exponent.
Thank you
b = [-32:32]; % example input values
e = [-3:3]; % example input values but doesn't support fraction's
power_function(b,e)
p = 1;
if e < 0
e = abs(e);
multiplier = 1/b;
else
multiplier = b;
end
for k = 1:e
p(:) = p * multiplier; % n-th root for any given number
end

Efficiently Calculate Frequency Averaged Periodogram Using GPU

In Matlab I am looking for a way to most efficiently calculate a frequency averaged periodogram on a GPU.
I understand that the most important thing is to minimise for loops and use the already built in GPU functions. However my code still feels relatively unoptimised and I was wondering what changes I can make to it to gain a better speed up.
r = 5; % Dimension
n = 100; % Time points
m = 20; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
Pdgm = zeros(r, r, n/2 + 1);
for j = 1:n/2 + 1
Pdgm(:,:,j) = FT(:,j)*FT(:,j)';
end
% Finally smooth with our weights
SmPdgm = zeros(r, r, n/2 + 1);
% Take advantage of the GPU filter function
% Create new Periodogram WrapPdgm with m/2 values wrapped around in front and
% behind it (it seems like there is redundancy here)
WrapPdgm = zeros(r,r,n/2 + 1 + m);
WrapPdgm(:,:,m/2+1:n/2+m/2+1) = Pdgm;
WrapPdgm(:,:,1:m/2) = flip(Pdgm(:,:,2:m/2+1),3);
WrapPdgm(:,:,n/2+m/2+2:end) = flip(Pdgm(:,:,n/2-m/2+1:end-1),3);
% Perform filtering
for i = 1:r
for j = 1:r
temp = filter(w, [1], WrapPdgm(i,j,:));
SmPdgm(i,j,:) = temp(:,:,m+1:end);
end
end
In particular, I couldn't see a way to optimise out the for loop when calculating the initial Pdgm from the Fourier transformed data and I feel the trick I play with the WrapPdgm in order to take advantage of filter() on the GPU feels unnecessary if there were a smooth function instead.
Solution Code
This seems to be pretty efficient as benchmark runtimes in the next section might convince us -
%// Select the portion of FT to be processed and
%// send copy to GPU for calculating everything
gFT = gpuArray(FT(:,1:n/2 + 1));
%// Perform non-smoothed Periodogram, thus removing the first loop
Pdgm1 = bsxfun(#times,permute(gFT,[1 3 2]),permute(conj(gFT),[3 1 2]));
%// Generate WrapPdgm right on GPU
WrapPdgm1 = zeros(r,r,n/2 + 1 + m,'gpuArray');
WrapPdgm1(:,:,m/2+1:n/2+m/2+1) = Pdgm1;
WrapPdgm1(:,:,1:m/2) = Pdgm1(:,:,m/2+1:-1:2);
WrapPdgm1(:,:,n/2+m/2+2:end) = Pdgm1(:,:,end-1:-1:n/2-m/2+1);
%// Perform filtering on GPU and get the final output, SmPdgm1
filt_data = filter(w,1,reshape(WrapPdgm1,r*r,[]),[],2);
SmPdgm1 = gather(reshape(filt_data(:,m+1:end),r,r,[]));
Benchmarking
Benchmarking Code
%// Input parameters
r = 50; % Dimension
n = 1000; % Time points
m = 200; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
tic, %// ... Code from original approach, toc
tic %// ... Code from proposed approach, toc
Runtime results thus obtained on GPU, GTX 750 Ti against CPU, I-7 4790K -
------------------------------ With Original Approach on CPU
Elapsed time is 0.279816 seconds.
------------------------------ With Proposed Approach on GPU
Elapsed time is 0.169969 seconds.
To get rid of the first loop you can do the following:
Pdgm_cell = cellfun(#(x) x * x', mat2cell(FT(:, 1 : 51), [5], ones(51, 1)), 'UniformOutput', false);
Pdgm = reshape(cell2mat(Pdgm_cell),5,5,[]);
Then in your filter you can do the following:
temp = filter(w, 1, WrapPdgm, [], 3);
SmPdgm = temp(:, :, m + 1 : end);
The 3 lets the filter know to operate along the 3rd dimension of your data.
You can use pagefun on the GPU for the first loop. (Note that the implementation of cellfun is basically a hidden loop, whereas pagefun runs natively on the GPU using a batched GEMM operation). Here's how:
n = 16;
r = 8;
X = gpuArray.rand(r, n);
R = gpuArray.zeros(r, r, n/2 + 1);
for jj = 1:(n/2+1)
R(:,:,jj) = X(:,jj) * X(:,jj)';
end
X2 = X(:,1:(n/2+1));
R2 = pagefun(#mtimes, reshape(X2, r, 1, []), reshape(X2, 1, r, []));
R - R2

Compute double sum in matlab efficiently?

I am looking for an optimal way to program this summation ratio. As input I have two vectors v_mn and x_mn with (M*N)x1 elements each.
The ratio is of the form:
The vector x_mn is 0-1 vector so when x_mn=1, the ration is r given above and when x_mn=0 the ratio is 0.
The vector v_mn is a vector which contain real numbers.
I did the denominator like this but it takes a lot of times.
function r_ij = denominator(v_mn, M, N, i, j)
%here x_ij=1, to get r_ij.
S = [];
for m = 1:M
for n = 1:N
if (m ~= i)
if (n ~= j)
S = [S v_mn(i, n)];
else
S = [S 0];
end
else
S = [S 0];
end
end
end
r_ij = 1+S;
end
Can you give a good way to do it in matlab. You can ignore the ratio and give me the denominator which is more complicated.
EDIT: I am sorry I did not write it very good. The i and j are some numbers between 1..M and 1..N respectively. As you can see, the ratio r is many values (M*N values). So I calculated only the value i and j. More precisely, I supposed x_ij=1. Also, I convert the vectors v_mn into a matrix that's why I use double index.
If you reshape your data, your summation is just a repeated matrix/vector multiplication.
Here's an implementation for a single m and n, along with a simple speed/equality test:
clc
%# some arbitrary test parameters
M = 250;
N = 1000;
v = rand(M,N); %# (you call it v_mn)
x = rand(M,N); %# (you call it x_mn)
m0 = randi(M,1); %# m of interest
n0 = randi(N,1); %# n of interest
%# "Naive" version
tic
S1 = 0;
for mm = 1:M %# (you call this m')
if mm == m0, continue; end
for nn = 1:N %# (you call this n')
if nn == n0, continue; end
S1 = S1 + v(m0,nn) * x(mm,nn);
end
end
r1 = v(m0,n0)*x(m0,n0) / (1+S1);
toc
%# MATLAB version: use matrix multiplication!
tic
ninds = [1:m0-1 m0+1:M];
minds = [1:n0-1 n0+1:N];
S2 = sum( x(minds, ninds) * v(m0, ninds).' );
r2 = v(m0,n0)*x(m0,n0) / (1+S2);
toc
%# Test if values are equal
abs(r1-r2) < 1e-12
Outputs on my machine:
Elapsed time is 0.327004 seconds. %# loop-version
Elapsed time is 0.002455 seconds. %# version with matrix multiplication
ans =
1 %# and yes, both are equal
So the speedup is ~133×
Now that's for a single value of m and n. To do this for all values of m and n, you can use an (optimized) double loop around it:
r = zeros(M,N);
for m0 = 1:M
xx = x([1:m0-1 m0+1:M], :);
vv = v(m0,:).';
for n0 = 1:N
ninds = [1:n0-1 n0+1:N];
denom = 1 + sum( xx(:,ninds) * vv(ninds) );
r(m0,n0) = v(m0,n0)*x(m0,n0)/denom;
end
end
which completes in ~15 seconds on my PC for M = 250, N= 1000 (R2010a).
EDIT: actually, with a little more thought, I was able to reduce it all down to this:
denom = zeros(M,N);
for mm = 1:M
xx = x([1:mm-1 mm+1:M],:);
denom(mm,:) = sum( xx*v(mm,:).' ) - sum( bsxfun(#times, xx, v(mm,:)) );
end
denom = denom + 1;
r_mn = x.*v./denom;
which completes in less than 1 second for N = 250 and M = 1000 :)
For a start you need to pre-alocate your S matrix. It changes size every loop so put
S = zeros(m*n, 1)
at the start of your function. This will also allow you to do away with your else conditional statements, ie they will reduce to this:
if (m ~= i)
if (n ~= j)
S(m*M + n) = v_mn(i, n);
Otherwise since you have to visit every element im afraid it may not be able to get much faster.
If you desperately need more speed you can look into doing some mex coding which is code in c/c++ but run in matlab.
http://www.mathworks.com.au/help/matlab/matlab_external/introducing-mex-files.html
Rather than first jumping into vectorization of the double loop, you may want modify the above to make sure that it does what you want. In this code, there is no summing of the data, instead a vector S is being resized at each iteration. As well, the signature could include the matrices V and X so that the multiplication occurs as in the formula (rather than just relying on the value of X to be zero or one, let us pass that matrix in).
The function could look more like the following (I've replaced the i,j inputs with m,n to be more like the equation):
function result = denominator(V,X,m,n)
% use the size of V to determine M and N
[M,N] = size(V);
% initialize the summed value to one (to account for one at the end)
result = 1;
% outer loop
for i=1:M
% ignore the case where m==i
if i~=m
for j=1:N
% ignore the case where n==j
if j~=n
result = result + V(m,j)*X(i,j);
end
end
end
end
Note how the first if is outside of the inner for loop since it does not depend on j. Try the above and see what happens!
You can vectorize from within Matlab to speed up your calculations. Every time you use an operation like ".^" or ".*" or any matrix operation for that matter, Matlab will do them in parallel, which is much, much faster than iterating over each item.
In this case, look at what you are doing in terms of matrices. First, in your loop you are only dealing with the mth row of $V_{nm}$, which we can use as a vector for itself.
If you look at your formula carefully, you can figure out that you almost get there if you just write this row vector as a column vector and multiply the matrix $X_{nm}$ to it from the left, using standard matrix multiplication. The resulting vector contains the sums over all n. To get the final result, just sum up this vector.
function result = denominator_vectorized(V,X,m,n)
% get the part of V with the first index m
Vm = V(m,:)';
% remove the parts of X you don't want to iterate over. Note that, since I
% am inside the function, I am only editing the value of X within the scope
% of this function.
X(m,:) = 0;
X(:,n) = 0;
%do the matrix multiplication and the summation at once
result = 1-sum(X*Vm);
To show you how this optimizes your operation, I will compare it to the code proposed by another commenter:
function result = denominator(V,X,m,n)
% use the size of V to determine M and N
[M,N] = size(V);
% initialize the summed value to one (to account for one at the end)
result = 1;
% outer loop
for i=1:M
% ignore the case where m==i
if i~=m
for j=1:N
% ignore the case where n==j
if j~=n
result = result + V(m,j)*X(i,j);
end
end
end
end
The test:
V=rand(10000,10000);
X=rand(10000,10000);
disp('looped version')
tic
denominator(V,X,1,1)
toc
disp('matrix operation')
tic
denominator_vectorized(V,X,1,1)
toc
The result:
looped version
ans =
2.5197e+07
Elapsed time is 4.648021 seconds.
matrix operation
ans =
2.5197e+07
Elapsed time is 0.563072 seconds.
That is almost ten times the speed of the loop iteration. So, always look out for possible matrix operations in your code. If you have the Parallel Computing Toolbox installed and a CUDA-enabled graphics card installed, Matlab will even perform these operations on your graphics card without any further effort on your part!
EDIT: That last bit is not entirely true. You still need to take a few steps to do operations on CUDA hardware, but they aren't a lot. See Matlab documentation.

Improving performance of interpolation (Barycentric formula)

I have been given an assignment in which I am supposed to write an algorithm which performs polynomial interpolation by the barycentric formula. The formulas states that:
p(x) = (SIGMA_(j=0 to n) w(j)*f(j)/(x - x(j)))/(SIGMA_(j=0 to n) w(j)/(x - x(j)))
I have written an algorithm which works just fine, and I get the polynomial output I desire. However, this requires the use of some quite long loops, and for a large grid number, lots of nastly loop operations will have to be done. Thus, I would appreciate it greatly if anyone has any hints as to how I may improve this, so that I will avoid all these loops.
In the algorithm, x and f stand for the given points we are supposed to interpolate. w stands for the barycentric weights, which have been calculated before running the algorithm. And grid is the linspace over which the interpolation should take place:
function p = barycentric_formula(x,f,w,grid)
%Assert x-vectors and f-vectors have same length.
if length(x) ~= length(f)
sprintf('Not equal amounts of x- and y-values. Function is terminated.')
return;
end
n = length(x);
m = length(grid);
p = zeros(1,m);
% Loops for finding polynomial values at grid points. All values are
% calculated by the barycentric formula.
for i = 1:m
var = 0;
sum1 = 0;
sum2 = 0;
for j = 1:n
if grid(i) == x(j)
p(i) = f(j);
var = 1;
else
sum1 = sum1 + (w(j)*f(j))/(grid(i) - x(j));
sum2 = sum2 + (w(j)/(grid(i) - x(j)));
end
end
if var == 0
p(i) = sum1/sum2;
end
end
This is a classical case for matlab 'vectorization'. I would say - just remove the loops. It is almost that simple. First, have a look at this code:
function p = bf2(x, f, w, grid)
m = length(grid);
p = zeros(1,m);
for i = 1:m
var = grid(i)==x;
if any(var)
p(i) = f(var);
else
sum1 = sum((w.*f)./(grid(i) - x));
sum2 = sum(w./(grid(i) - x));
p(i) = sum1/sum2;
end
end
end
I have removed the inner loop over j. All I did here was in fact removing the (j) indexing and changing the arithmetic operators from / to ./ and from * to .* - the same, but with a dot in front to signify that the operation is performed on element by element basis. This is called array operators in contrast to ordinary matrix operators. Also note that treating the special case where the grid points fall onto x is very similar to what you had in the original implementation, only using a vector var such that x(var)==grid(i).
Now, you can also remove the outermost loop. This is a bit more tricky and there are two major approaches how you can do that in MATLAB. I will do it the simpler way, which can be less efficient, but more clear to read - using repmat:
function p = bf3(x, f, w, grid)
% Find grid points that coincide with x.
% The below compares all grid values with all x values
% and returns a matrix of 0/1. 1 is in the (row,col)
% for which grid(row)==x(col)
var = bsxfun(#eq, grid', x);
% find the logical indexes of those x entries
varx = sum(var, 1)~=0;
% and of those grid entries
varp = sum(var, 2)~=0;
% Outer-most loop removal - use repmat to
% replicate the vectors into matrices.
% Thus, instead of having a loop over j
% you have matrices of values that would be
% referenced in the loop
ww = repmat(w, numel(grid), 1);
ff = repmat(f, numel(grid), 1);
xx = repmat(x, numel(grid), 1);
gg = repmat(grid', 1, numel(x));
% perform the calculations element-wise on the matrices
sum1 = sum((ww.*ff)./(gg - xx),2);
sum2 = sum(ww./(gg - xx),2);
p = sum1./sum2;
% fix the case where grid==x and return
p(varp) = f(varx);
end
The fully vectorized version can be implemented with bsxfun rather than repmat. This can potentially be a bit faster, since the matrices are not explicitly formed. However, the speed difference may not be large for small system sizes.
Also, the first solution with one loop is also not too bad performance-wise. I suggest you test those and see, what is better. Maybe it is not worth it to fully vectorize? The first code looks a bit more readable..

Algorithm to express elements of a matrix as a vector

Statement of Problem:
I have an array M with m rows and n columns. The array M is filled with non-zero elements.
I also have a vector t with n elements, and a vector omega
with m elements.
The elements of t correspond to the columns of matrix M.
The elements of omega correspond to the rows of matrix M.
Goal of Algorithm:
Define chi as the multiplication of vector t and omega. I need to obtain a 1D vector a, where each element of a is a function of chi.
Each element of chi is unique (i.e. every element is different).
Using mathematics notation, this can be expressed as a(chi)
Each element of vector a corresponds to an element or elements of M.
Matlab code:
Here is a code snippet showing how the vectors t and omega are generated. The matrix M is pre-existing.
[m,n] = size(M);
t = linspace(0,5,n);
omega = linspace(0,628,m);
Conceptual Diagram:
This appears to be a type of integration (if this is the right word for it) along constant chi.
Reference:
Link to reference
The algorithm is not explicitly stated in the reference. I only wish that this algorithm was described in a manner reminiscent of computer science textbooks!
Looking at Figure 11.5, the matrix M is Figure 11.5(a). The goal is to find an algorithm to convert Figure 11.5(a) into 11.5(b).
It appears that the algorithm is a type of integration (averaging, perhaps?) along constant chi.
It appears to me that reshape is the matlab function you need to use. As noted in the link:
B = reshape(A,siz) returns an n-dimensional array with the same elements as A, but reshaped to siz, a vector representing the dimensions of the reshaped array.
That is, create a vector siz with the number m*n in it, and say A = reshape(P,siz), where P is the product of vectors t and ω; or perhaps say something like A = reshape(t*ω,[m*n]). (I don't have matlab here, or would run a test to see if I have the product the right way around.) Note, the link does not show an example with one number (instead of several) after the matrix parameter to reshape, but I would expect from the description that A = reshape(t*ω,m*n) might also work.
You should add a pseudocode or a link to the algorithm you want to implement. From what I could understood I have developed the following code anyway:
M = [1 2 3 4; 5 6 7 8; 9 10 11 12]' % easy test M matrix
a = reshape(M, prod(size(M)), 1) % convert M to vector 'a' with reshape command
[m,n] = size(M); % Your sample code
t = linspace(0,5,n); % Your sample code
omega = linspace(0,628,m); % Your sample code
for i=1:length(t)
for j=1:length(omega) % Acces a(chi) in the desired order
chi = length(omega)*(i-1)+j;
t(i) % related t value
omega(j) % related omega value
a(chi) % related a(chi) value
end
end
As you can see, I also think that the reshape() function is the solution to your problems. I hope that this code helps,
The basic idea is to use two separate loops. The outer loop is over the chi variable values, whereas the inner loop is over the i variable values. Referring to the above diagram in the original question, the i variable corresponds to the x-axis (time), and the j variable corresponds to the y-axis (frequency). Assuming that the chi, i, and j variables can take on any real number, bilinear interpolation is then used to find an amplitude corresponding to an element in matrix M. The integration is just an averaging over elements of M.
The following code snippet provides an overview of the basic algorithm to express elements of a matrix as a vector using the spectral collapsing from 2D to 1D. I can't find any reference for this, but it is a solution that works for me.
% Amp = amplitude vector corresponding to Figure 11.5(b) in book reference
% M = matrix corresponding to the absolute value of the complex Gabor transform
% matrix in Figure 11.5(a) in book reference
% Nchi = number of chi in chi vector
% prod = product of timestep and frequency step
% dt = time step
% domega = frequency step
% omega_max = maximum angular frequency
% i = time array element along x-axis
% j = frequency array element along y-axis
% current_i = current time array element in loop
% current_j = current frequency array element in loop
% Nchi = number of chi
% Nivar = number of i variables
% ivar = i variable vector
% calculate for chi = 0, which only occurs when
% t = 0 and omega = 0, at i = 1
av0 = mean( M(1,:) );
av1 = mean( M(2:end,1) );
av2 = mean( [av0 av1] );
Amp(1) = av2;
% av_val holds the sum of all values that have been averaged
av_val_sum = 0;
% loop for rest of chi
for ccnt = 2:Nchi % 2:Nchi
av_val_sum = 0; % reset av_val_sum
current_chi = chi( ccnt ); % current value of chi
% loop over i vector
for icnt = 1:Nivar % 1:Nivar
current_i = ivar( icnt );
current_j = (current_chi / (prod * (current_i - 1))) + 1;
current_t = dt * (current_i - 1);
current_omega = domega * (current_j - 1);
% values out of range
if(current_omega > omega_max)
continue;
end
% use bilinear interpolation to find an amplitude
% at current_t and current_omega from matrix M
% f_x_y is the bilinear interpolated amplitude
% Insert bilinear interpolation code here
% add to running sum
av_val_sum = av_val_sum + f_x_y;
end % icnt loop
% compute the average over all i
av = av_val_sum / Nivar;
% assign the average to Amp
Amp(ccnt) = av;
end % ccnt loop

Resources