Speeding up a nested for loop - performance

I've been working on speeding up the following function, but with no results:
function beta = beta_c(k,c,gamma)
beta = zeros(size(k));
E = #(x) (1.453*x.^4)./((1 + x.^2).^(17/6));
for ii = 1:size(k,1)
for jj = 1:size(k,2)
E_int = integral(E,k(ii,jj),10000);
beta(ii,jj) = c*gamma/(k(ii,jj)*sqrt(E_int));
end
end
end
Up to now, I solved it this way:
function beta = beta_calc(k,c,gamma)
k_1d = reshape(k,[1,numel(k)]);
E_1d =#(k) 1.453.*k.^4./((1 + k.^2).^(17/6));
E_int = zeros(1,numel(k_1d));
parfor ii = 1:numel(k_1d)
E_int(ii) = quad(E_1d,k_1d(ii),10000);
end
beta_1d = c*gamma./(k_1d.*sqrt(E_int));
beta = reshape(beta_1d,[size(k,1),size(k,2)]);
end
Seems to me, it didn't really enhance performances. What do you think about this?
Would you mind to shed a light?
I thank you in advance.
EDIT
I am gonna introduce some theoretical background involving my question.
Generally, beta is to be calculated as follows
Therefore, in the reduced case of unidimensional k array, E_int may be calculated as
E = 1.453.*k.^4./((1 + k.^2).^(17/6));
E_int = 1.5 - cumtrapz(k,E);
or, alternatively as
E_int(1) = 1.5;
for jj = 2:numel(k)
E =#(k) 1.453.*k.^4./((1 + k.^2).^(17/6));
E_int(jj) = E_int(jj - 1) - integral(E,k(jj-1),k(jj));
end
Nonetheless, k is currently a matrix k(size1,size2).

Here's another approach, parallelize, because it's easy using spmd or parfor. Instead of integral consider quad, see this link for examples...

I like this question.
The problem: the function integral takes as integration limits only scalars. Hence, it is difficult to vectorize the computation of of E_int.
A clue: there seems to be lot of redundancy in integrating the same function over and over from k(ii,jj) to infinity...
Proposed solution: How about sorting the values of k from smallest to largest and integrating E_sort_int(si) = integral( E, sortedK(si), sortedK(si+1) ); with sortedK( numel(k) + 1 ) = 10000;. Then the full value of E_int = cumsum( E_sort_int ); (you only need to "undo" the sorting and reshape it back to the size of k).

Related

How to multiply tensors in MATLAB without looping?

Suppose I have:
A = rand(1,10,3);
B = rand(10,16);
And I want to get:
C(:,1) = A(:,:,1)*B;
C(:,2) = A(:,:,2)*B;
C(:,3) = A(:,:,3)*B;
Can I somehow multiply this in a single line so that it is faster?
What if I create new tensor b like this
for i = 1:3
b(:,:,i) = B;
end
Can I multiply A and b to get the same C but faster? Time taken in creation of b by the loop above doesn't matter since I will be needing C for many different A-s while B stays the same.
Permute the dimensions of A and B and then apply matrix multiplication:
C = B.'*permute(A, [2 3 1]);
If A is a true 3D array, something like A = rand(4,10,3) and assuming that B stays as a 2D array, then each A(:,:,1)*B would yield a 2D array.
So, assuming that you want to store those 2D arrays as slices in the third dimension of output array, C like so -
C(:,:,1) = A(:,:,1)*B;
C(:,:,2) = A(:,:,2)*B;
C(:,:,3) = A(:,:,3)*B; and so on.
To solve this in a vectorized manner, one of the approaches would be to use reshape A into a 2D array merging the first and third dimensions and then performing matrix-muliplication. Finally, to bring the output size same as the earlier listed C, we need a final step of reshaping.
The implementation would look something like this -
%// Get size and then the final output C
[m,n,r] = size(A);
out = permute(reshape(reshape(permute(A,[1 3 2]),[],n)*B,m,r,[]),[1 3 2]);
Sample run -
>> A = rand(4,10,3);
B = rand(10,16);
C(:,:,1) = A(:,:,1)*B;
C(:,:,2) = A(:,:,2)*B;
C(:,:,3) = A(:,:,3)*B;
>> [m,n,r] = size(A);
out = permute(reshape(reshape(permute(A,[1 3 2]),[],n)*B,m,r,[]),[1 3 2]);
>> all(C(:)==out(:)) %// Verify results
ans =
1
As per the comments, if A is a 3D array with always a singleton dimension at the start, you can just use squeeze and then matrix-multiplication like so -
C = B.'*squeeze(A)
EDIT: #LuisMendo points out that this is indeed possible for this specific use case. However, it is not (in general) possible if the first dimension of A is not 1.
I've grappled with this for a while now, and I've never been able to come up with a solution. Performing element-wise calculations is made nice by bsxfun, but tensor multiplication is something which is woefully unsupported. Sorry, and good luck!
You can check out this mathworks file exchange file, which will make it easier for you and supports the behavior you're looking for, but I believe that it relies on loops as well. Edit: it relies on MEX/C++, so it isn't a pure MATLAB solution if that's what you're looking for.
I have to agree with #GJSein, the for loop is really fast
time
0.7050 0.3145
Here's the timer function
function time
n = 1E7;
A = rand(1,n,3);
B = rand(n,16);
t = [];
C = {};
tic
C{length(C)+1} = squeeze(cell2mat(cellfun(#(x) x*B,num2cell(A,[1 2]),'UniformOutput',false)));
t(length(t)+1) = toc;
tic
for i = 1:size(A,3)
C{length(C)+1}(:,i) = A(:,:,i)*B;
end
t(length(t)+1) = toc;
disp(t)
end

matlab code nested loop performance improvement

I would be very interested to receive suggestions on how to improve performance of the following nested for loop:
I = (U > q); % matrix of indicator variables, I(i,j) is 1 if U(i,j) > q
for i = 2:K
for j = 1:(i-1)
mTau(i,j) = sum(I(:,i) .* I(:,j));
mTau(j,i) = mTau(i,j);
end
end
The code evaluates if for pairs of variables both variables are below a certain threshold, thereby filling a matrix. I appreciate your help!
You can use matrix multiplication:
I = double(U>q);
mTau = I.'*I;
This will have none-zero values on diagonal so you can set them to zero by
mTau = mTau - diag(diag(mTau));
One approach with bsxfun -
out = squeeze(sum(bsxfun(#and,I,permute(I,[1 3 2])),1));
out(1:size(out,1)+1:end)=0;

MATLAB: readable code vs optimized code

So, I want to know if making the code more easy to read slows performance in Matlab.
function V = example(t, I)
a = 10;
b = 20;
c = 0.5;
V = zeros(1, length(t));
V(1) = 0;
delta_t = t(2) - t(1);
for i=1:length(t)-1
V(i+1) = V(i) + delta_t*feval(#V_prime,a,b,c,t(i));
end;
So, this function is just an example of a Euler method. The idea is that I name constant variables, a, b, c and define a function of the derivative. This basically makes the code easier to read. What I want to know is if declaring a,b,c slows down my code. Also, for performance improvement, would be better to put the equation of the derivative (V_prime) directly on the equation instead of calling it?
Following this mindset the code would look something like this.
function V = example(t, I)
V = zeros(1, length(t));
V(1) = 0;
delta_t = t(2) - t(1);
for i=1:length(t)-1
V(i+1) = V(i) + delta_t*(((10 + t(i)*3)/20)+0.5);
Also from what I've read, Matlab performs better when the code is vectorized, would that be the case in my code?
EDIT:
So, here is my actual code that I am working on:
function [V, u] = Izhikevich_CA1_Imp(t, I_amp, t_inj)
vr = -61.8; % resting potential (mV)
vt = -57.0; % threshold potential (mV)
c = -65.8; % reset membrane potential (mV)
vpeak = 22.6; % membrane voltage cutoff
khigh = 3.3; % nS/mV
klow = 0.1; % nS/mV
C = 115; % Membrane capacitance (pA)
a = 0.0012; % 1/ms
b = 3; % nS
d = 10; % pA
V = zeros(1, length(t));
V(1) = vr; u = 0; % initial values
span = length(t)-1;
delta_t = t(2) - t(1);
for i=1:span
if (V(i) <= vt)
k = klow;
else
k = khigh;
end;
if ((t(i) >= t_inj(1)) && (t(i) <= t_inj(2)))
I_inj = I_amp;
else I_inj = 0;
end;
V(i+1) = V(i) + delta_t*((k*(V(i)-vr)*(V(i)-vt)-u(i)+I_inj)/C);
u(i+1) = u(i) + delta_t*(a*(b*(V(i)-vr)-u(i)));
if (V(i+1) >= vpeak)
V(i+1) = c;
V(i) = vpeak;
u(i+1) = u(i+1) + d;
end;
end;
plot(t,V);
Since I didn't have any training in Matlab (learned by trying and failing), I have my C mindset of programming, and for what I understand, Matlab code should be vectorized.
Eventually I will start working with bigger functions, so performance will be a concern. Now my goal is to vectorize this code.
Usually it is faster.
Especially if you replace looped function calls (like plot()), you will see a significant increase in performance.
In one of my past projects, I had to optimize a program. This one was made using regular program rules (for, while, etc.). Using vectorization, I reached a 10 times increase in performance, which is quite notable..
I would suggest using vectorisation instead of loops most of the time.
On matlab you should basically forget the mindset coming from low-level C programming.
In my experience the first rule for achieving performance in matlab is to avoid loops and use built-in vectorized functions as much as possible. In general, you should try to avoid direct access to array elements like array(i).
Implementing your own ODE solver inevitably leads to very slow execution because in this case there is really no way to avoid the aforementioned things, even if your implementation is per se fine (like in your case). I strongly advise to rely on matlab's ode solvers which are highly optimized blocks of compiled code and much faster than any interpreted matlab code you can write.
In my opinion this goes along with readability of the code as well, at least for the trivial reason that you get a shorter code... but I guess it is also a matter of personal taste.

Vectorization of nested loops and if statements in MATLAB

I am fairly new to the concept of vectorization in MATLAB so please excuse my naivety in this regard. I was trying to vectorize the following MATLAB code which includes if statements within nested for loops:
h = zeros(dimV);
for a = 1 : dimV
for b = 1 : dimV
if a ~= b && C(a,b) == 1
h(a,b) = C(a,b)*exp(-1i*A(a,b)*L(a,b))*(sin(k*L(a,b)))^-1;
else if a == b
for m = 1 : dimV
if m ~= a && C(a,m) == 1
h(a,b) = h(a,b) - C(a,m)*cot(k*L(a,m));
end
end
end
end
end
end
Here the variable dimV specifies the size of the matrix h and is fairly large, of the order of 100, and C is a symmetric square matrix (previously defined) of size dimV all of whose off-diagonal elements are either 0 or 1 and the diagonal elements are necesarrily 0. The elements of matrix L are also zero in the same positions as the zeros of the matrix C. Following the vectorization techniques that I found on this website and here, I was able to vectorize the code, albeit partially, and my MWE is as follows:
h = zeros(dimV);
idx = (C == 1);
h(idx) = C(idx).*exp(-1i*A(idx).*L(idx)).*(sin(k*L(idx))).^-1;
for a = 1:dimV;
m = 1 : dimV;
m = m(C(a,:) == 1);
h(a,a) = - sum (C(a,m).*cot(k*L(a,m)));
end
My main problem is in converting the for loop in the variable a to a vector as I need the individual values of a to address the diagonal elements of h. I compared the evaluation time of both the code blocks using the MATLAB profiler and the latter version is only marginally faster and the improvement in efficiency is really insignificant. In fact the profiler showed that the line allocating the values to h(a,a) takes up nearly 50% of the execution time in the second case. So I was wondering if there is a more elegant way to rewrite the above code using the appropriate vectorization schemes, which would help improve its efficiency. I am really in a bit of bother about this and any I would be greatly appreciative of any help in this regard. Thank you so much.
Vectorized Code -
diag_ind = 1:dimV+1:numel(C);
C_neq1 = C~=1;
parte1_2 = (sin(k.*L)).^-1;
parte1_2(C_neq1 & (L==0))=0;
parte1 = exp(-1i.*A.*L).*parte1_2;
parte1(C_neq1)=0;
h = parte1;
parte2 = cot(k*L);
parte2(C_neq1)=0;
parte2(diag_ind)=0;
h(diag_ind) = - sum(parte2,2);

How to speed this kind of for-loop?

I would like to compute the maximum of translated images along the direction of a given axis. I know about ordfilt2, however I would like to avoid using the Image Processing Toolbox.
So here is the code I have so far:
imInput = imread('tire.tif');
n = 10;
imMax = imInput(:, n:end);
for i = 1:(n-1)
imMax = max(imMax, imInput(:, i:end-(n-i)));
end
Is it possible to avoid using a for-loop in order to speed the computation up, and, if so, how?
First edit: Using Octave's code for im2col is actually 50% slower.
Second edit: Pre-allocating did not appear to improve the result enough.
sz = [size(imInput,1), size(imInput,2)-n+1];
range_j = 1:size(imInput, 2)-sz(2)+1;
range_i = 1:size(imInput, 1)-sz(1)+1;
B = zeros(prod(sz), length(range_j)*length(range_i));
counter = 0;
for j = range_j % left to right
for i = range_i % up to bottom
counter = counter + 1;
v = imInput(i:i+sz(1)-1, j:j+sz(2)-1);
B(:, counter) = v(:);
end
end
imMax = reshape(max(B, [], 2), sz);
Third edit: I shall show the timings.
For what it's worth, here's a vectorized solution using IM2COL function from the Image Processing Toolbox:
imInput = imread('tire.tif');
n = 10;
sz = [size(imInput,1) size(imInput,2)-n+1];
imMax = reshape(max(im2col(imInput, sz, 'sliding'),[],2), sz);
imshow(imMax)
You could perhaps write your own version of IM2COL as it simply consists of well crafted indexing, or even look at how Octave implements it.
Check out the answer to this question about doing a rolling median in c. I've successfully made it into a mex function and it is way faster than even ordfilt2. It will take some work to do a max, but I'm sure it's possible.
Rolling median in C - Turlach implementation

Resources