Increasing computation speed in Fortran - performance

I have some basic psuedo code as follows,
PROGRAM PSUEDOEXAMPLE
IMPLICIT NONE
!Define all types
!Load some data into arrays: Array_I, Array_J
!Do Loops
do i = 1,10
xi = Array_I(i,1)
yi = Array_I(i,2)
zi = Array_I(i,3)
do j =1,10
xj = Array_J(j,1)
yj = Array_J(j,2)
zj = Array_J(j,3)
separation = ((xi - xj)**2 + (yi-yj)**2 +(zi-zj)**2)**0.5
enddo
enddo
END PROGRAM PSUEDOEXAMPLE
I can time the time it takes for a single i-step to be ~0.3 seconds. What are the best ways to reduce this time? I can see potentially removing the square root would be effective. I am using gfortran as my compiler.

Related

ODE with time dependent input, How to speed Up without using interpolation?

I am trying to solve a system of ODEs and my input excitation is a function of time.
I have been using interp1 inside the integration function, but this doesn't seems like a very efficient way to do this. I know it is not, because once I change the input excitation to a sin function, which does not require an interp1 call inside the function, I get much much faster results. But doing interpolation every step takes about 10–20 times longer to converge. So, is there a better way of solving ODEs for arbitrary time dependent excitation, without needing to do interpolation or some other tricks to speed up?
I am just copying a modified version of a simple example from The MathWorks here:
Input Excitation is a gradually increasing sin function, but after some time later it becomes a constant amplitude sin function.
Dt = 0.01; % sampling time step
Amp0 = 2; % Final Amplitude of signal
Dur_G = 10; % Duration of gradually increasing part of signal
Dur_tot = 25; % Duration of total signal
t_G = 0 : Dt : Dur_G; % time of gradual part
A = linspace(0, Amp0, length(t_G));
carrier_1 = sin(5*t_G); % Unit Normal Signal
carrier_A0 = Amp0*sin(5*t_G);
out_G = A.*carrier_1; % Gradually Increasing Signal
% Total Signal with Gradual Constant Amplitude Parts
t_C = Dur_G+Dt:Dt:Dur_tot; % time of constant part
out_C = Amp0*sin(5*t_C); % Signal of constant part
ft = [t_G t_C]; % total time
f = [out_G out_C]; % total signal
figure; plot(ft, f, '-b'); % input excitation
function dydt = myode(t,y,ft,f)
f = interp1(ft,f,t); % Interpolate the data set (ft,f) at time t
g = 2; % a constant
dydt = -f.*y + g; % Evaluate ODE at time t
tspan = [1 5]; ic = 1;
opts = odeset('RelTol',1e-2,'AbsTol',1e-4);
[t,y] = ode45(#(t,y) myode(t,y,ft,f), tspan, ic, opts);
figure;
plot(t,y);
Note that I explained only first part of my problem above, which is solving system for a gradually increasing sin function.
In the second part, I need to solve it for an arbitrary input excitation (e.g., a ground acceleration input).
For this example, you could use griddedInterpolant class to get a bit of a speed-up:
ft = linspace(0,5,25);
f = ft.^2 - ft - 3;
Fp = griddedInterpolant(ft,f);
gt = linspace(1,6,25);
g = 3*sin(gt-0.25);
Gp = griddedInterpolant(gt,g);
tspan = [1 5];
ic = 1;
opts = odeset('RelTol',1e-2,'AbsTol',1e-4);
[t,y] = ode45(#(t,y)myode(t,y,Fp,Gp),tspan,ic,opts);
figure;
plot(t,y);
The ODE function is then:
function dydt = myode(t,y,Fp,Gp)
f = Fp(t); % Interpolate the data set (ft,f) at time t
g = Gp(t); % Interpolate the data set (gt,g) at time t
dydt = -f.*y + g; % Evaluate ODE at time t
On my system with R2015b, the call to ode45 is about three times faster (0.011 sec vs. 0.035 sec) for your example. You could get a bit more speed by switching to ode23. You can read more about the griddedInterpolant class here.
If your actual system, discretely switches between inputs particular points in time, then you should probably solve the problem piecewise by integrating each case separately. See this question and this question. If the system switches based on the value of the state variable(s), then you should use event location (see this question). However, if "solving ODEs for random time dependent excitation" means that you're adding random noise to the system, then you have an SDE rather than an ODE, which is a completely different beast.

MATLAB: readable code vs optimized code

So, I want to know if making the code more easy to read slows performance in Matlab.
function V = example(t, I)
a = 10;
b = 20;
c = 0.5;
V = zeros(1, length(t));
V(1) = 0;
delta_t = t(2) - t(1);
for i=1:length(t)-1
V(i+1) = V(i) + delta_t*feval(#V_prime,a,b,c,t(i));
end;
So, this function is just an example of a Euler method. The idea is that I name constant variables, a, b, c and define a function of the derivative. This basically makes the code easier to read. What I want to know is if declaring a,b,c slows down my code. Also, for performance improvement, would be better to put the equation of the derivative (V_prime) directly on the equation instead of calling it?
Following this mindset the code would look something like this.
function V = example(t, I)
V = zeros(1, length(t));
V(1) = 0;
delta_t = t(2) - t(1);
for i=1:length(t)-1
V(i+1) = V(i) + delta_t*(((10 + t(i)*3)/20)+0.5);
Also from what I've read, Matlab performs better when the code is vectorized, would that be the case in my code?
EDIT:
So, here is my actual code that I am working on:
function [V, u] = Izhikevich_CA1_Imp(t, I_amp, t_inj)
vr = -61.8; % resting potential (mV)
vt = -57.0; % threshold potential (mV)
c = -65.8; % reset membrane potential (mV)
vpeak = 22.6; % membrane voltage cutoff
khigh = 3.3; % nS/mV
klow = 0.1; % nS/mV
C = 115; % Membrane capacitance (pA)
a = 0.0012; % 1/ms
b = 3; % nS
d = 10; % pA
V = zeros(1, length(t));
V(1) = vr; u = 0; % initial values
span = length(t)-1;
delta_t = t(2) - t(1);
for i=1:span
if (V(i) <= vt)
k = klow;
else
k = khigh;
end;
if ((t(i) >= t_inj(1)) && (t(i) <= t_inj(2)))
I_inj = I_amp;
else I_inj = 0;
end;
V(i+1) = V(i) + delta_t*((k*(V(i)-vr)*(V(i)-vt)-u(i)+I_inj)/C);
u(i+1) = u(i) + delta_t*(a*(b*(V(i)-vr)-u(i)));
if (V(i+1) >= vpeak)
V(i+1) = c;
V(i) = vpeak;
u(i+1) = u(i+1) + d;
end;
end;
plot(t,V);
Since I didn't have any training in Matlab (learned by trying and failing), I have my C mindset of programming, and for what I understand, Matlab code should be vectorized.
Eventually I will start working with bigger functions, so performance will be a concern. Now my goal is to vectorize this code.
Usually it is faster.
Especially if you replace looped function calls (like plot()), you will see a significant increase in performance.
In one of my past projects, I had to optimize a program. This one was made using regular program rules (for, while, etc.). Using vectorization, I reached a 10 times increase in performance, which is quite notable..
I would suggest using vectorisation instead of loops most of the time.
On matlab you should basically forget the mindset coming from low-level C programming.
In my experience the first rule for achieving performance in matlab is to avoid loops and use built-in vectorized functions as much as possible. In general, you should try to avoid direct access to array elements like array(i).
Implementing your own ODE solver inevitably leads to very slow execution because in this case there is really no way to avoid the aforementioned things, even if your implementation is per se fine (like in your case). I strongly advise to rely on matlab's ode solvers which are highly optimized blocks of compiled code and much faster than any interpreted matlab code you can write.
In my opinion this goes along with readability of the code as well, at least for the trivial reason that you get a shorter code... but I guess it is also a matter of personal taste.

Compute double sum in matlab efficiently?

I am looking for an optimal way to program this summation ratio. As input I have two vectors v_mn and x_mn with (M*N)x1 elements each.
The ratio is of the form:
The vector x_mn is 0-1 vector so when x_mn=1, the ration is r given above and when x_mn=0 the ratio is 0.
The vector v_mn is a vector which contain real numbers.
I did the denominator like this but it takes a lot of times.
function r_ij = denominator(v_mn, M, N, i, j)
%here x_ij=1, to get r_ij.
S = [];
for m = 1:M
for n = 1:N
if (m ~= i)
if (n ~= j)
S = [S v_mn(i, n)];
else
S = [S 0];
end
else
S = [S 0];
end
end
end
r_ij = 1+S;
end
Can you give a good way to do it in matlab. You can ignore the ratio and give me the denominator which is more complicated.
EDIT: I am sorry I did not write it very good. The i and j are some numbers between 1..M and 1..N respectively. As you can see, the ratio r is many values (M*N values). So I calculated only the value i and j. More precisely, I supposed x_ij=1. Also, I convert the vectors v_mn into a matrix that's why I use double index.
If you reshape your data, your summation is just a repeated matrix/vector multiplication.
Here's an implementation for a single m and n, along with a simple speed/equality test:
clc
%# some arbitrary test parameters
M = 250;
N = 1000;
v = rand(M,N); %# (you call it v_mn)
x = rand(M,N); %# (you call it x_mn)
m0 = randi(M,1); %# m of interest
n0 = randi(N,1); %# n of interest
%# "Naive" version
tic
S1 = 0;
for mm = 1:M %# (you call this m')
if mm == m0, continue; end
for nn = 1:N %# (you call this n')
if nn == n0, continue; end
S1 = S1 + v(m0,nn) * x(mm,nn);
end
end
r1 = v(m0,n0)*x(m0,n0) / (1+S1);
toc
%# MATLAB version: use matrix multiplication!
tic
ninds = [1:m0-1 m0+1:M];
minds = [1:n0-1 n0+1:N];
S2 = sum( x(minds, ninds) * v(m0, ninds).' );
r2 = v(m0,n0)*x(m0,n0) / (1+S2);
toc
%# Test if values are equal
abs(r1-r2) < 1e-12
Outputs on my machine:
Elapsed time is 0.327004 seconds. %# loop-version
Elapsed time is 0.002455 seconds. %# version with matrix multiplication
ans =
1 %# and yes, both are equal
So the speedup is ~133×
Now that's for a single value of m and n. To do this for all values of m and n, you can use an (optimized) double loop around it:
r = zeros(M,N);
for m0 = 1:M
xx = x([1:m0-1 m0+1:M], :);
vv = v(m0,:).';
for n0 = 1:N
ninds = [1:n0-1 n0+1:N];
denom = 1 + sum( xx(:,ninds) * vv(ninds) );
r(m0,n0) = v(m0,n0)*x(m0,n0)/denom;
end
end
which completes in ~15 seconds on my PC for M = 250, N= 1000 (R2010a).
EDIT: actually, with a little more thought, I was able to reduce it all down to this:
denom = zeros(M,N);
for mm = 1:M
xx = x([1:mm-1 mm+1:M],:);
denom(mm,:) = sum( xx*v(mm,:).' ) - sum( bsxfun(#times, xx, v(mm,:)) );
end
denom = denom + 1;
r_mn = x.*v./denom;
which completes in less than 1 second for N = 250 and M = 1000 :)
For a start you need to pre-alocate your S matrix. It changes size every loop so put
S = zeros(m*n, 1)
at the start of your function. This will also allow you to do away with your else conditional statements, ie they will reduce to this:
if (m ~= i)
if (n ~= j)
S(m*M + n) = v_mn(i, n);
Otherwise since you have to visit every element im afraid it may not be able to get much faster.
If you desperately need more speed you can look into doing some mex coding which is code in c/c++ but run in matlab.
http://www.mathworks.com.au/help/matlab/matlab_external/introducing-mex-files.html
Rather than first jumping into vectorization of the double loop, you may want modify the above to make sure that it does what you want. In this code, there is no summing of the data, instead a vector S is being resized at each iteration. As well, the signature could include the matrices V and X so that the multiplication occurs as in the formula (rather than just relying on the value of X to be zero or one, let us pass that matrix in).
The function could look more like the following (I've replaced the i,j inputs with m,n to be more like the equation):
function result = denominator(V,X,m,n)
% use the size of V to determine M and N
[M,N] = size(V);
% initialize the summed value to one (to account for one at the end)
result = 1;
% outer loop
for i=1:M
% ignore the case where m==i
if i~=m
for j=1:N
% ignore the case where n==j
if j~=n
result = result + V(m,j)*X(i,j);
end
end
end
end
Note how the first if is outside of the inner for loop since it does not depend on j. Try the above and see what happens!
You can vectorize from within Matlab to speed up your calculations. Every time you use an operation like ".^" or ".*" or any matrix operation for that matter, Matlab will do them in parallel, which is much, much faster than iterating over each item.
In this case, look at what you are doing in terms of matrices. First, in your loop you are only dealing with the mth row of $V_{nm}$, which we can use as a vector for itself.
If you look at your formula carefully, you can figure out that you almost get there if you just write this row vector as a column vector and multiply the matrix $X_{nm}$ to it from the left, using standard matrix multiplication. The resulting vector contains the sums over all n. To get the final result, just sum up this vector.
function result = denominator_vectorized(V,X,m,n)
% get the part of V with the first index m
Vm = V(m,:)';
% remove the parts of X you don't want to iterate over. Note that, since I
% am inside the function, I am only editing the value of X within the scope
% of this function.
X(m,:) = 0;
X(:,n) = 0;
%do the matrix multiplication and the summation at once
result = 1-sum(X*Vm);
To show you how this optimizes your operation, I will compare it to the code proposed by another commenter:
function result = denominator(V,X,m,n)
% use the size of V to determine M and N
[M,N] = size(V);
% initialize the summed value to one (to account for one at the end)
result = 1;
% outer loop
for i=1:M
% ignore the case where m==i
if i~=m
for j=1:N
% ignore the case where n==j
if j~=n
result = result + V(m,j)*X(i,j);
end
end
end
end
The test:
V=rand(10000,10000);
X=rand(10000,10000);
disp('looped version')
tic
denominator(V,X,1,1)
toc
disp('matrix operation')
tic
denominator_vectorized(V,X,1,1)
toc
The result:
looped version
ans =
2.5197e+07
Elapsed time is 4.648021 seconds.
matrix operation
ans =
2.5197e+07
Elapsed time is 0.563072 seconds.
That is almost ten times the speed of the loop iteration. So, always look out for possible matrix operations in your code. If you have the Parallel Computing Toolbox installed and a CUDA-enabled graphics card installed, Matlab will even perform these operations on your graphics card without any further effort on your part!
EDIT: That last bit is not entirely true. You still need to take a few steps to do operations on CUDA hardware, but they aren't a lot. See Matlab documentation.

Improving performance of interpolation (Barycentric formula)

I have been given an assignment in which I am supposed to write an algorithm which performs polynomial interpolation by the barycentric formula. The formulas states that:
p(x) = (SIGMA_(j=0 to n) w(j)*f(j)/(x - x(j)))/(SIGMA_(j=0 to n) w(j)/(x - x(j)))
I have written an algorithm which works just fine, and I get the polynomial output I desire. However, this requires the use of some quite long loops, and for a large grid number, lots of nastly loop operations will have to be done. Thus, I would appreciate it greatly if anyone has any hints as to how I may improve this, so that I will avoid all these loops.
In the algorithm, x and f stand for the given points we are supposed to interpolate. w stands for the barycentric weights, which have been calculated before running the algorithm. And grid is the linspace over which the interpolation should take place:
function p = barycentric_formula(x,f,w,grid)
%Assert x-vectors and f-vectors have same length.
if length(x) ~= length(f)
sprintf('Not equal amounts of x- and y-values. Function is terminated.')
return;
end
n = length(x);
m = length(grid);
p = zeros(1,m);
% Loops for finding polynomial values at grid points. All values are
% calculated by the barycentric formula.
for i = 1:m
var = 0;
sum1 = 0;
sum2 = 0;
for j = 1:n
if grid(i) == x(j)
p(i) = f(j);
var = 1;
else
sum1 = sum1 + (w(j)*f(j))/(grid(i) - x(j));
sum2 = sum2 + (w(j)/(grid(i) - x(j)));
end
end
if var == 0
p(i) = sum1/sum2;
end
end
This is a classical case for matlab 'vectorization'. I would say - just remove the loops. It is almost that simple. First, have a look at this code:
function p = bf2(x, f, w, grid)
m = length(grid);
p = zeros(1,m);
for i = 1:m
var = grid(i)==x;
if any(var)
p(i) = f(var);
else
sum1 = sum((w.*f)./(grid(i) - x));
sum2 = sum(w./(grid(i) - x));
p(i) = sum1/sum2;
end
end
end
I have removed the inner loop over j. All I did here was in fact removing the (j) indexing and changing the arithmetic operators from / to ./ and from * to .* - the same, but with a dot in front to signify that the operation is performed on element by element basis. This is called array operators in contrast to ordinary matrix operators. Also note that treating the special case where the grid points fall onto x is very similar to what you had in the original implementation, only using a vector var such that x(var)==grid(i).
Now, you can also remove the outermost loop. This is a bit more tricky and there are two major approaches how you can do that in MATLAB. I will do it the simpler way, which can be less efficient, but more clear to read - using repmat:
function p = bf3(x, f, w, grid)
% Find grid points that coincide with x.
% The below compares all grid values with all x values
% and returns a matrix of 0/1. 1 is in the (row,col)
% for which grid(row)==x(col)
var = bsxfun(#eq, grid', x);
% find the logical indexes of those x entries
varx = sum(var, 1)~=0;
% and of those grid entries
varp = sum(var, 2)~=0;
% Outer-most loop removal - use repmat to
% replicate the vectors into matrices.
% Thus, instead of having a loop over j
% you have matrices of values that would be
% referenced in the loop
ww = repmat(w, numel(grid), 1);
ff = repmat(f, numel(grid), 1);
xx = repmat(x, numel(grid), 1);
gg = repmat(grid', 1, numel(x));
% perform the calculations element-wise on the matrices
sum1 = sum((ww.*ff)./(gg - xx),2);
sum2 = sum(ww./(gg - xx),2);
p = sum1./sum2;
% fix the case where grid==x and return
p(varp) = f(varx);
end
The fully vectorized version can be implemented with bsxfun rather than repmat. This can potentially be a bit faster, since the matrices are not explicitly formed. However, the speed difference may not be large for small system sizes.
Also, the first solution with one loop is also not too bad performance-wise. I suggest you test those and see, what is better. Maybe it is not worth it to fully vectorize? The first code looks a bit more readable..

Speeding up MATLAB code for FDR estimation

I have 2 input variables:
a vector of p-values (p) with N elements (unsorted)
and N x M matrix with p-values obtained by random permutations (pr) with M iterations. N is quite large, 10K to 100K or more. M let's say 100.
I'm estimating the False Discovery Rate (FDR) for each element of p representing how many p-values from random permutations will pass if the current p-value (from p) will be the threshold.
I wrote the function with ARRAYFUN, but it takes lot of time for large N (2 min for N=20K), comparable to for-loop.
function pfdr = fdr_from_random_permutations(p, pr)
%# ... skipping arguments checks
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
Any ideas how to make it faster?
Comments about statistical issues here are also welcome.
The test data can be generated as p = rand(N,1); pr = rand(N,M);.
Well, the trick was indeed sorting the vectors. I give credit to #EgonGeerardyn for that. Also, there is no need to use mean. You can just divide everything afterwards by M. When p is sorted, finding the amount of values that are less than current x, is just a running index. pr is a more interesting case - I used a running index called place to discover how many elements are less than x.
Edit(2): Here is the fastest version I come up with:
function Speedup2()
N = 10000/4 ;
M = 100/4 ;
p = rand(N,1); pr = rand(N,M);
tic
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
toc
tic
out = zeros(numel(p),1);
[p,sortIndex] = sort(p);
pr = sort(pr(:));
pr(end+1) = Inf;
place = 1;
N = numel(pr);
for i=1:numel(p)
x = p(i);
while pr(place)<=x
place = place+1;
end
exp1a = place-1;
exp2 = i;
out(i) = exp1a/exp2;
end
out(sortIndex) = out/ M;
toc
disp(max(abs(pfdr-out)));
end
And the benchmark results for N = 10000/4 ; M = 100/4 :
Elapsed time is 0.898689 seconds.
Elapsed time is 0.007697 seconds.
2.220446049250313e-016
and for N = 10000 ; M = 100 ;
Elapsed time is 39.730695 seconds.
Elapsed time is 0.088870 seconds.
2.220446049250313e-016
First of all, tr to analyze this using the profiler. Profiling should ALWAYS be the first step when trying to improve performance. We can all guess at what is causing your performance drop, but the only way to be sure and focus on the right part is to inspect the profiler report.
I didn't run the profiler on your code, as I don't want to generate test data to do so; but I have some ideas about what work is being carried out in vain. In your function mean(sum(pr<=x))./sum(p<=x), you are repeatedly summing over p<=x. All in all, one call includes N comparisons and N-1 summations. So for both, you have behavior that is quadratic in N when all N values of p are calculated.
If you step through a sorted version of p, you need less calculations and comparisons, as you can keep track of a running sum (i.e. behavior that is linear in N). I guess a similar method could be applied to the other part of the calculation.
edit:
The implementation of my idea as expressed above:
function pfdr = fdr(p,pr)
[N, M] = size(pr);
[p, idxP] = sort(p);
[pr] = sort(pr(:));
pfdr = NaN(N,1);
parfor iP = 1:N
x = p(iP);
m = sum(pr<=x)/M;
pfdr(iP) = m/iP;
end
pfdr(idxP) = pfdr;
If you have access to the parallel computing toolbox, the parfor loop will allow you to gain some performance. I used two basic ideas: mean(sum(pr<=x)) is actually equal to sum(pr(:)<=x)/M. On the other hand, since p is sorted, this allows you to just take the index as the number of elements (in the assumption that every element is unique, otherwise you'll have to work with unique to do the full rigorous analysis).
As you should already know very well by running the profiler yourself, the line m = sum(pr<=x)/M; is the main resource hog. This can be tackled similarly to p by making use of the sorted nature of pr.
I tested my code (both for identical results and for time consumption) against yours. For N=20e3; M=100, I get about 63 seconds to run your code and 43 seconds to run mine on my main computer (MATLAB 2011a on 64 bit Arch Linux, 8 GiB RAM, Core i7 860). For smaller values of M the gain is larger. But this gain is in part due to parallelization.
edit2: Apparently, I came to very similar results as Andrey, my result would have been very similar had I pursued the same approach.
However, I realised that there are some built-in functions that do more or less what you need, i.e. quite similar to determining the empirical cumulative density function. And this can be done by constructing the histogram:
function pfdr = fdr(p,pr)
[N, M] = size(pr);
[p, idxP] = sort(p);
count = histc(pr(:), [0; p]);
count = cumsum(count(1:N));
pfdr = count./(1:N).';
pfdr(idxP) = pfdr/M;
For the same M and N as above, this code takes 228 milliseconds on my computer. It takes 104 milliseconds for Andrey's parameters, so on my computer it turns out a bit slower, but I think this code is far more readable than intricate for loops (as was the case in both our examples).
Following the discussion between me and Andrey in this question, this very late answer is just to prove to Andrey that vectorized solutions are still faster than JIT'ed loops, they sometimes just aren't as easy to find.
I am more than willing to remove this answer if it is deemed inappropriate by the OP.
Now, on to business, here's the original arrayfun, looped version by Andrey, and vectorized version by Egon:
function test
clc
N = 10000/4 ;
M = 100/4 ;
p = rand(N,1);
pr = rand(N,M);
%% first option
tic
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
toc
%% second option
tic
out = zeros(numel(p),1);
[p2,sortIndex] = sort(p);
pr2 = sort(pr(:));
pr2(end+1) = Inf;
place = 1;
for i=1:numel(p2)
x = p2(i);
while pr2(place)<=x
place = place+1;
end
exp1a = place-1;
exp2 = i;
out(i) = exp1a/exp2;
end
out(sortIndex) = out/ M;
toc
%% third option
tic
[p2,sortIndex] = sort(p);
count = histc(pr2(:), [0; p2]);
count = cumsum(count(1:N));
out = count./(1:N).';
out(sortIndex) = out/M;
toc
end
Results on my laptop:
Elapsed time is 0.916196 seconds.
Elapsed time is 0.011429 seconds.
Elapsed time is 0.007328 seconds.
and for N=1000; M = 100; :
Elapsed time is 38.082718 seconds.
Elapsed time is 0.127052 seconds.
Elapsed time is 0.042686 seconds.
So: vectorized is 2-3 times faster.

Resources