When I run the code shown below, the tic/toc pair inside the function shows it takes very short time (<< 1sec) to go through all the lines. However, it actually takes around 2.3secs to get the outputs!!! I use the tic/toc pair to measure the time.
tic
rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');
inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps =100;
inData.BatchSize = 10000;
[H,OX] = forward_pass(rnn, inData)
toc
All the matrices in rnn, and inData are gpuArray, so all the calculation are carried out in GPU. The outputs are also gpuArray.
function [H,OX] = forward_pass(rnn, inData)
tic;
%initial hidden state values
H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));
%initialize state H
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
%initialize OX (which is H * Who)
OX = zeros(inData.BatchSize, rnn.o, inData.TimeSteps,'gpuArray');
for t = 1 : inData.TimeSteps
if t == 1
HX_t = H_init * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
else
HX_t = H(:,:,(t-1)) * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
end
H(:,:,t) = tanh(HX_t);
OX(:,:,t) = H(:,:,t) * rnn.W_ho;
end
toc;
end
Normally, if you use gather() function, it will be slow. I didn't use the gather() function to transfer the outputs to workspace, I don't know why it is still so slow. It looks like the last line "end" takes more than 2secs.
Anyone knows how to accelerate the function call?
First off, for proper benchmarking you do need to use gather either inside the function call or afterwards. In the former case, you would have a non-gpu output from the function call and in the latter case, a gpu-based datatype would be the output. Now, back to your problem, you are using very few TimeSteps and as such any optimization that you might try out won't reflect in a huge manner. Here's an optimized version that will show increased performance as you increase Timesteps -
function [H,OX] = forward_pass(rnn, inData)
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))*rnn.W_vh;
H(:,:,1) = tanh(bsxfun(#plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));
for t = 2 : inData.TimeSteps
H(:,:,t) = tanh( H(:,:,(t-1))*rnn.W_hh + ...
T((t-1)*size(inData.V,1)+1: t*size(inData.V,1),:));
end
A = reshape(permute(H,[1 3 2]),[],size(H,2))*rnn.W_ho;
OX = permute(reshape(A,size(H,1),size(A,1)/size(H,1),[]),[1 3 2]);
return;
Benchmarking
Test Case #1
Parameters
rnn.v = 11;
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;
Results
---- Original Code :
Elapsed time is 5.678876 seconds.
---- Modified Code :
Elapsed time is 3.821059 seconds.
Test Case #2
Parameters
inData.TimeSteps = 50000; (rest are same as in Test Case #1)
Results
---- Original Code :
Elapsed time is 28.392290 seconds.
---- Modified Code :
Elapsed time is 19.031776 seconds.
Please note that these are tested on GTX 750 Ti.
Related
Here is my little script for simulating Levy motion:
clear all;
clc; close all;
t = 0; T = 1000; I = T-t;
dT = T/I; t = 0:dT:T; tau = T/I;
alpha = 1.5;
sigma = dT^(1/alpha);
mu = 0; beta = 0;
N = 1000;
X = zeros(N, length(I));
for k=1:N
L = zeros(1,I);
for i = 1:I-1
L( (i + 1) * tau ) = L(i*tau) + stable2( alpha, beta, sigma, mu, 1);
end
X(k,1:length(L)) = L;
end
q = 0.1:0.1:0.9;
quant = qlines2(X, q, t(1:length(X)), tau);
hold all
for i = 1:length(quant)
plot( t, quant(i) * t.^(1/alpha), ':k' );
end
Where stable2 returns a stable random variable with given parameters (you may replace it with normrnd(mu, sigma) for this case, it's not crucial); qlines2 returns quantiles needed for plotting.
But I don't want to talk about math here. My problem is that this implementation is pretty slow, and I would like to speed it up. Unfortunately, computer science is not my main field - I heard something about methods like memoization, vectorization and that there is a lot of other techniques, but I don't know how to use them.
For example, I'm pretty sure I should replace this filthy double for-loop somehow, but I'm not sure what to do instead.
EDIT: Maybe I should use (and learn...) another language (Python, C, any functional one)? I always though that Matlab/OCTAVE is designed for numerical computation, but if change, then for which one?
The crucial bit is, as you said, the for loops, Matlab does not like those, so vectorization is indeed the keyword. (Together with preallocating the space.
I just altered you for loop section somewhat so that you do not have to reset L over and over again, instead we save all Ls in a bigger matrix (also I elimiated the length(L) command).
L = zeros(N,I);
for k=1:N
for i = 1:I-1
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
X(k,1:I) = L(k,1:I);
end
Now you can already see that X(k,1:I) = L(k,1:I); in the loop is obsolete and that also means that we can switch the order of the loops. This is crucial, because the i-steps are recursive (depend on the previous step) that means we cannot vectorize this loop, we can only vectorize the k-loop.
Now your original code needed 9.3 seconds on my machine, the new code still needs about the same time)
L = zeros(N,I);
for i = 1:I-1
for k=1:N
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
end
X = L;
But now we can apply the vectorization, instead of looping throu all rows (the loop over k) we can instead eliminate this loop, and doing all rows at "once".
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma); %<- this is not yet what you want, see comment below
end
X = L;
This code need only 0.045 seconds on my machine. I hope you still get the same output, because I have no idea what you are calculating, but I also hope you could see how you go about vectorizing code.
PS: I just noticed that we now use the same random number in the last example for the whole column, this is obviously not what you want. Instad you should generate a whole vector of random numbers, e.g:
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma,N,1);
end
X = L;
PPS: Great question!
This is a follow-up question of this question.
The following code takes an enormous amount of time to loop through. Do you have any recommendations for speeding up the process? The variable z has a size of 479x1672 and others will be around 479x12000.
z = HongKongPrices;
zmat = false(size(z));
r = size(z,1);
c = size(z,2);
for k = 1:c
for i = 5:r
if z(i,k) == z(i-4,k) && z(i,k) == z(i-3,k) && z(i,k) == z(end,k)
zmat(i-3:i,k) = 1
end
end
end
z(zmat) = NaN
I am currently running this with MatLab R2014b on an iMac with 3.2 Intel i5 and 16 GB DDR3.
You can use logical indexing here to your advantage to replace the IF-conditional statement and have a small-loop -
%// Get size parameters
[r,c] = size(z);
%// Get logical mask with ones for each column at places that satisfy the condition
%// mentioned as the IF conditional statement in the problem code
mask = z(1:r-4,:) == z(5:r,:) & z(2:r-3,:) == z(5:r,:) & ...
bsxfun(#eq,z(end,:),z(5:r,:));
%// Use logical indexing to map entire z array and set mask elements as NaNs
for k = 1:4
z([false(k,c) ; mask ; false(4-k,c)]) = NaN;
end
Benchmarking
%// Size parameters
nrows = 479;
ncols = 12000;
max_num = 10;
num_iter = 10; %// number of iterations to run each approach,
%// so that runtimes are over 1 sec mark
z_org = randi(max_num,nrows,ncols); %// random input data of specified size
disp('--------------------------------- With proposed approach')
tic
for iter = 1:num_iter
z = z_org;
[..... code from the proposed approach ...]
end
toc, clear z k mask r c
disp('--------------------------------- With original approach')
tic
for iter = 1:num_iter
z = z_org;
[..... code from the problem ...]
end
toc
Results
Case # 1: z as 479 x 1672 (num_iter = 50)
--------------------------------- With proposed approach
Elapsed time is 1.285337 seconds.
--------------------------------- With original approach
Elapsed time is 2.008256 seconds.
Case # 2: z as 479 x 12000 (num_iter = 10)
--------------------------------- With proposed approach
Elapsed time is 1.941858 seconds.
--------------------------------- With original approach
Elapsed time is 2.897006 seconds.
In the following function i want to make some changes to make it fast. By itself it is fast but i have to use it many times in a for loop so it takes long. I think if i replace the repmat with bsxfun will make it faster but i am not sure. How can i do these replacements
function out = lagcal(y1,y1k,source)
kn1 = y1(:);
kt1 = y1k(:);
kt1x = repmat(kt1,1,length(kt1));
eq11 = 1./(prod(kt1x-kt1x'+eye(length(kt1))));
eq1 = eq11'*eq11;
dist = repmat(kn1,1,length(kt1))-repmat(kt1',length(kn1),1);
[fixi,fixj] = find(dist==0); dist(fixi,fixj)=eps;
mult = 1./(dist);
eq2 = prod(dist,2);
eq22 = repmat(eq2,1,length(kt1));
eq222 = eq22 .* mult;
out = eq1 .* (eq222'*source*eq222);
end
Does it really speed up my function?
Introduction and code changes
All the repmat usages used in the function code are to expand inputs to sizes so that later on the mathemtical operations involving these inputs could be performed. This is tailor-made situation for bsxfun. Sadly though the real bottleneck of the function code seems to be something else. Stay on as we discuss all the performance related aspects of the code.
Code with repmat replaced by bsxfun is presented next and the replaced codes
are kept as comments for comparison -
function out = lagcal(y1,y1k,source)
kn1 = y1(:);
kt1 = y1k(:);
%//kt1x = repmat(kt1,1,length(kt1));
%//eq11 = 1./(prod(kt1x-kt1x'+eye(length(kt1)))) %//'
eq11 = 1./prod(bsxfun(#minus,kt1,kt1.') + eye(numel(kt1))) %//'
eq1 = eq11'*eq11; %//'
%//dist = repmat(kn1,1,length(kt1))-repmat(kt1',length(kn1),1) %//'
dist = bsxfun(#minus,kn1,kt1.') %//'
[fixi,fixj] = find(dist==0);
dist(fixi,fixj)=eps;
mult = 1./(dist);
eq2 = prod(dist,2);
%//eq22 = repmat(eq2,1,length(kt1));
%//eq222 = eq22 .* mult
eq222 = bsxfun(#times,eq2,mult)
out = eq1 .* (eq222'*source*eq222); %//'
return; %// Better this way to end a function
One more modification could be added here. In the last line, we could do
something like as shown below, but the timing results don't show a huge benefit
with it -
out = bsxfun(#times,eq11.',bsxfun(#times,eq11,eq222'*source*eq222))
This would avoid the calculation of eq1 done earlier in the original code, so you would save little more time that way.
Benchmarking
Benchmarking on the bsxfun modified portions of the code versus the original
repmat based codes is discussed next.
Benchmarking Code
N_arr = [50 100 200 500 1000 2000 3000]; %// array elements for N (datasize)
blocks = 3;
timeall = zeros(2,numel(N_arr),blocks);
for k1 = 1:numel(N_arr)
N = N_arr(k1);
y1 = rand(N,1);
y1k = rand(N,1);
source = rand(N);
kn1 = y1(:);
kt1 = y1k(:);
%% Block 1 ----------------
block = 1;
f = #() block1_org(kt1);
timeall(1,k1,block) = timeit(f);
clear f
f = #() block1_mod(kt1);
timeall(2,k1,block) = timeit(f);
eq11 = feval(f);
clear f
%% Block 1 ----------------
eq1 = eq11'*eq11; %//'
%% Block 2 ----------------
block = 2;
f = #() block2_org(kn1,kt1);
timeall(1,k1,block) = timeit(f);
clear f
f = #() block2_mod(kn1,kt1);
timeall(2,k1,block) = timeit(f);
dist = feval(f);
clear f
%% Block 2 ----------------
[fixi,fixj] = find(dist==0);
dist(fixi,fixj)=eps;
mult = 1./(dist);
eq2 = prod(dist,2);
%% Block 3 ----------------
block = 3;
f = #() block3_org(eq2,mult,length(kt1));
timeall(1,k1,block) = timeit(f);
clear f
f = #() block3_mod(eq2,mult);
timeall(2,k1,block) = timeit(f);
clear f
%% Block 3 ----------------
end
%// Display benchmark results
figure,
for k2 = 1:blocks
subplot(blocks,1,k2),
title(strcat('Block',num2str(k2),' results :'),'fontweight','bold'),hold on
plot(N_arr,timeall(1,:,k2),'-ro')
plot(N_arr,timeall(2,:,k2),'-kx')
legend('REPMAT Method','BSXFUN Method')
xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')
end
Associated functions
function out = block1_org(kt1)
kt1x = repmat(kt1,1,length(kt1));
out = 1./(prod(kt1x-kt1x'+eye(length(kt1))));
return;
function out = block1_mod(kt1)
out = 1./prod(bsxfun(#minus,kt1,kt1.') + eye(numel(kt1)));
return;
function out = block2_org(kn1,kt1)
out = repmat(kn1,1,length(kt1))-repmat(kt1',length(kn1),1);
return;
function out = block2_mod(kn1,kt1)
out = bsxfun(#minus,kn1,kt1.');
return;
function out = block3_org(eq2,mult,length_kt1)
eq22 = repmat(eq2,1,length_kt1);
out = eq22 .* mult;
return;
function out = block3_mod(eq2,mult)
out = bsxfun(#times,eq2,mult);
return;
Results
Conclusions
bsxfun based codes show around 2x speedups over repmat based ones which is encouraging. But a profiling of the original code across a varying datasize show the multiple matrix multiplications in the final line seem to be occupying most of the runtime for the function code, which are supposedly very efficient within MATLAB. Unless you have some way to avoid those multiplications by using some other mathematical technique, they look like the bottleneck.
Basically I am trying to solve a 2nd order differential equation with the forward euler method. I have some for loops inside my code, which take considerable time to solve and I would like to speed things up a bit. Does anyone have any suggestions how could I do this?
And also when looking at the time it takes, I notice that my end at line 14 takes 45 % of my total time. What is end actually doing and why is it taking so much time?
Here is my simplified code:
t = 0:0.01:100;
dt = t(2)-t(1);
B = 3.5 * t;
F0 = 2 * t;
BB=zeros(1,length(t)); % Preallocation
x = 2; % Initial value
u = 0; % Initial value
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
Running the code it takes me 8.552 sec.
You can remove the inner loop, I think:
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
So BB(ii) = BB(ii) (zero at initalisation) + sum for 1 to ii of BB(kk)* u(ii-kk+1).dt
but kk = 1:ii, so for a given ii, ii-kk+1 → ii-(1:ii) + 1 → ii:-1:1
So I think this is equivalent to:
for ii = 1:length(t)
BB(ii) = sum(B(1:ii).*u(ii:-1:1)*dt);
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
It doesn't take as long as 8 seconds for me using either method, but the version with only one loop is about 2x as fast (the output of BB appears to be the same).
Is the sum loop of B(kk) * u(ii-kk+1) just conv(B(1:ii),u(1:ii),'same')
The best way to speed up loops in matlab is to try to avoid them. Try if you are able to perform a matrix operation instead of the inner loop. For example try to break the calculation you do there in small parts, then decide, if there are parts you can perform in advance without knowing the results of the next iteration of the loop.
to your secound part of the question, my guess:: The end contains the check if the loop runs for another round and this check by it self is not that long but called 50.015.001 times!
I have the following data types:
K1,K2,K3,K4 = ... % predefined constants which are not subject to change.
time = 1e6; % time steps length.
z = 1e3; % location vector length.
A_Sink = rand(1,time);
B_Sink = rand(1,time);
laserA = zeros(1,z);
laserB = zeros(1,z);
laserA_Old = zeros(1,z);
laserB_Old = zeros(1,z);
Now here is the code:
for ii = 1:time
laserA_Old = laserA;
laserB_Old = laserB;
laserA(2:end) = laserA_Old(2:end) + K1*diff(laserA_Old) + K2*laserB_old(2:end);
laserA(1) = A_Sink(ii);
laserB(1:end-1) = laserB_Old(1:end-1) + K3*diff(laserB_Old) + K4*laserA_Old(1:end-1);
laserB(end) = B_Sink(ii);
end
Now my main gripes are, I see in the profiler that the lines in the loop with (2:end) or (1:end-1) are allocating memory without freeing it. This causes MATLAB to be much slower.
Is there a way to accelerate things and improve it memory-wise?
many thanks.