More on using i and j as variables in Matlab: speed - performance

The Matlab documentation says that
For speed and improved robustness, you can replace complex i and j by 1i. For example, instead of using
a = i;
use
a = 1i;
The robustness part is clear, as there might be variables called i or j. However, as for speed, I have made a simple test in Matlab 2010b and I obtain results which seem to contradict the claim:
>>clear all
>> a=0; tic, for n=1:1e8, a=i; end, toc
Elapsed time is 3.056741 seconds.
>> a=0; tic, for n=1:1e8, a=1i; end, toc
Elapsed time is 3.205472 seconds.
Any ideas? Could it be a version-related issue?
After comments by #TryHard and #horchler, I have tried assigning other values to the variable a, with these results:
Increasing order of elapsed time:
"i" < "1i" < "1*i" (trend "A")
"2i" < "2*1i" < "2*i" (trend "B")
"1+1i" < "1+i" < "1+1*i" (trend "A")
"2+2i" < "2+2*1i" < "2+2*i" (trend "B")

I think you are looking at a pathological example. Try something more complex (results shown for R2012b on OSX):
(repeated addition)
>> clear all
>> a=0; tic, for n=1:1e8, a = a + i; end, toc
Elapsed time is 2.217482 seconds. % <-- slower
>> clear all
>> a=0; tic, for n=1:1e8, a = a + 1i; end, toc
Elapsed time is 1.962985 seconds. % <-- faster
(repeated multiplication)
>> clear all
>> a=0; tic, for n=1:1e8, a = a * i; end, toc
Elapsed time is 2.239134 seconds. % <-- slower
>> clear all
>> a=0; tic, for n=1:1e8, a = a * 1i; end, toc
Elapsed time is 1.998718 seconds. % <-- faster

One thing to remember is that optimizations are applied differently whether you are running from the command line or a saved M-function.
Here is a test of my own:
function testComplex()
tic, test1(); toc
tic, test2(); toc
tic, test3(); toc
tic, test4(); toc
tic, test5(); toc
tic, test6(); toc
end
function a = test1
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2i;
end
end
function a = test2
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2j;
end
end
function a = test3
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*i;
end
end
function a = test4
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*j;
end
end
function a = test5
a = zeros(1e7,1);
for n=1:1e7
a(n) = complex(2,2);
end
end
function a = test6
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*sqrt(-1);
end
end
The results on my Windows machine running R2013a:
>> testComplex
Elapsed time is 0.946414 seconds. %// 2 + 2i
Elapsed time is 0.947957 seconds. %// 2 + 2j
Elapsed time is 0.811044 seconds. %// 2 + 2*i
Elapsed time is 0.685793 seconds. %// 2 + 2*j
Elapsed time is 0.767683 seconds. %// complex(2,2)
Elapsed time is 8.193529 seconds. %// 2 + 2*sqrt(-1)
Note that the results fluctuate a little bit with different runs where the order of calls is shuffled. So take the timings with a grain of salt.
My conclusion: doesn't matter in terms of speed if you use 1i or 1*i.
One interesting difference is that if you also have a variable in the function scope where you also use it as the imaginary unit, MATLAB throws an error:
Error: File: testComplex.m Line: 38 Column: 5
"i" previously appeared to be used as a function or command, conflicting with its
use here as the name of a variable.
A possible cause of this error is that you forgot to initialize the variable, or you
have initialized it implicitly using load or eval.
To see the error, change the above test3 function into:
function a = test3
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*i;
end
i = rand(); %// added this line!
end
i.e, the variable i was used as both a function and a variable in the same local scope.

Related

Memory allocation in a fixed point algorithm

I need to find the fixed point of a function f. The algorithm is very simple:
Given X, compute f(X)
If ||X-f(X)|| is lower than a certain tolerance, exit and return X,
otherwise set X equal to f(X) and go back to 1.
I'd like to be sure I'm not allocating memory for a new object at every iteration
For now, the algorithm looks like this:
iter1 = function(x::Vector{Float64})
for iter in 1:max_it
oldx = copy(x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
Here g1(x) is a function that sets x to f(x)
But it seems this loop allocates a new vector at every loop (see below).
Another way to write the algorithm is the following:
iter2 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
(oldx, x) = (x, oldx)
g2(x, oldx)
delta = vnormdiff(oldx, x, 2)
if delta < tolerance
break
end
end
end
where g2(x1, x2) is a function that sets x1 to f(x2).
Is thi the most efficient and natural way to write this kind of iteration problem?
Edit1: timing shows that the second code is faster:
using NumericExtensions
max_it = 1000
tolerance = 1e-8
max_it = 100
g1 = function(x::Vector{Float64})
for i in 1:length(x)
x[i] = x[i]/2
end
end
g2 = function(newx::Vector{Float64}, x::Vector{Float64})
for i in 1:length(x)
newx[i] = x[i]/2
end
end
x = fill(1e7, int(1e7))
#time iter1(x)
# elapsed time: 4.688103075 seconds (4960117840 bytes allocated, 29.72% gc time)
x = fill(1e7, int(1e7))
#time iter2(x)
# elapsed time: 2.187916177 seconds (80199676 bytes allocated, 0.74% gc time)
Edit2: using copy!
iter3 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
copy!(oldx, x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
x = fill(1e7, int(1e7))
#time iter3(x)
# elapsed time: 2.745350176 seconds (80008088 bytes allocated, 1.11% gc time)
I think replacing the following lines in the first code
for iter = 1:max_it
oldx = copy( x )
...
by
oldx = zeros( N )
for iter = 1:max_it
oldx[:] = x # or copy!( oldx, x )
...
will be more efficient because no array is allocated. Also, the code can be made more efficient by writing for-loops explicitly. This can be seen, for example, from the following comparison
function test()
N = 1000000
a = zeros( N )
b = zeros( N )
#time c = copy( a )
#time b[:] = a
#time copy!( b, a )
#time \
for i = 1:length(a)
b[i] = a[i]
end
#time \
for i in eachindex(a)
b[i] = a[i]
end
end
test()
The result obtained with Julia0.4.0 on Linux(x86_64) is
elapsed time: 0.003955609 seconds (7 MB allocated)
elapsed time: 0.001279142 seconds (0 bytes allocated)
elapsed time: 0.000836167 seconds (0 bytes allocated)
elapsed time: 1.19e-7 seconds (0 bytes allocated)
elapsed time: 1.28e-7 seconds (0 bytes allocated)
It seems that copy!() is faster than using [:] in the left-hand side,
though the difference becomes marginal in repeated calculations (there seems to be
some overhead for the first [:] calculation). Btw, the last example using eachindex() is very convenient for looping over multi-dimensional arrays.
Similar comparison can be made for vnormdiff(), where use of norm( x - oldx ) etc is slower than an explicit loop for vector norm, because the former allocates one temporary array for x - oldx.

How to speed up a double loop in matlab

This is a follow-up question of this question.
The following code takes an enormous amount of time to loop through. Do you have any recommendations for speeding up the process? The variable z has a size of 479x1672 and others will be around 479x12000.
z = HongKongPrices;
zmat = false(size(z));
r = size(z,1);
c = size(z,2);
for k = 1:c
for i = 5:r
if z(i,k) == z(i-4,k) && z(i,k) == z(i-3,k) && z(i,k) == z(end,k)
zmat(i-3:i,k) = 1
end
end
end
z(zmat) = NaN
I am currently running this with MatLab R2014b on an iMac with 3.2 Intel i5 and 16 GB DDR3.
You can use logical indexing here to your advantage to replace the IF-conditional statement and have a small-loop -
%// Get size parameters
[r,c] = size(z);
%// Get logical mask with ones for each column at places that satisfy the condition
%// mentioned as the IF conditional statement in the problem code
mask = z(1:r-4,:) == z(5:r,:) & z(2:r-3,:) == z(5:r,:) & ...
bsxfun(#eq,z(end,:),z(5:r,:));
%// Use logical indexing to map entire z array and set mask elements as NaNs
for k = 1:4
z([false(k,c) ; mask ; false(4-k,c)]) = NaN;
end
Benchmarking
%// Size parameters
nrows = 479;
ncols = 12000;
max_num = 10;
num_iter = 10; %// number of iterations to run each approach,
%// so that runtimes are over 1 sec mark
z_org = randi(max_num,nrows,ncols); %// random input data of specified size
disp('--------------------------------- With proposed approach')
tic
for iter = 1:num_iter
z = z_org;
[..... code from the proposed approach ...]
end
toc, clear z k mask r c
disp('--------------------------------- With original approach')
tic
for iter = 1:num_iter
z = z_org;
[..... code from the problem ...]
end
toc
Results
Case # 1: z as 479 x 1672 (num_iter = 50)
--------------------------------- With proposed approach
Elapsed time is 1.285337 seconds.
--------------------------------- With original approach
Elapsed time is 2.008256 seconds.
Case # 2: z as 479 x 12000 (num_iter = 10)
--------------------------------- With proposed approach
Elapsed time is 1.941858 seconds.
--------------------------------- With original approach
Elapsed time is 2.897006 seconds.

Matlab is slow when using user defined function with calculation in GPU

When I run the code shown below, the tic/toc pair inside the function shows it takes very short time (<< 1sec) to go through all the lines. However, it actually takes around 2.3secs to get the outputs!!! I use the tic/toc pair to measure the time.
tic
rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');
inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps =100;
inData.BatchSize = 10000;
[H,OX] = forward_pass(rnn, inData)
toc
All the matrices in rnn, and inData are gpuArray, so all the calculation are carried out in GPU. The outputs are also gpuArray.
function [H,OX] = forward_pass(rnn, inData)
tic;
%initial hidden state values
H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));
%initialize state H
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
%initialize OX (which is H * Who)
OX = zeros(inData.BatchSize, rnn.o, inData.TimeSteps,'gpuArray');
for t = 1 : inData.TimeSteps
if t == 1
HX_t = H_init * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
else
HX_t = H(:,:,(t-1)) * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
end
H(:,:,t) = tanh(HX_t);
OX(:,:,t) = H(:,:,t) * rnn.W_ho;
end
toc;
end
Normally, if you use gather() function, it will be slow. I didn't use the gather() function to transfer the outputs to workspace, I don't know why it is still so slow. It looks like the last line "end" takes more than 2secs.
Anyone knows how to accelerate the function call?
First off, for proper benchmarking you do need to use gather either inside the function call or afterwards. In the former case, you would have a non-gpu output from the function call and in the latter case, a gpu-based datatype would be the output. Now, back to your problem, you are using very few TimeSteps and as such any optimization that you might try out won't reflect in a huge manner. Here's an optimized version that will show increased performance as you increase Timesteps -
function [H,OX] = forward_pass(rnn, inData)
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))*rnn.W_vh;
H(:,:,1) = tanh(bsxfun(#plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));
for t = 2 : inData.TimeSteps
H(:,:,t) = tanh( H(:,:,(t-1))*rnn.W_hh + ...
T((t-1)*size(inData.V,1)+1: t*size(inData.V,1),:));
end
A = reshape(permute(H,[1 3 2]),[],size(H,2))*rnn.W_ho;
OX = permute(reshape(A,size(H,1),size(A,1)/size(H,1),[]),[1 3 2]);
return;
Benchmarking
Test Case #1
Parameters
rnn.v = 11;
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;
Results
---- Original Code :
Elapsed time is 5.678876 seconds.
---- Modified Code :
Elapsed time is 3.821059 seconds.
Test Case #2
Parameters
inData.TimeSteps = 50000; (rest are same as in Test Case #1)
Results
---- Original Code :
Elapsed time is 28.392290 seconds.
---- Modified Code :
Elapsed time is 19.031776 seconds.
Please note that these are tested on GTX 750 Ti.

speeding up some for loops in matlab

Basically I am trying to solve a 2nd order differential equation with the forward euler method. I have some for loops inside my code, which take considerable time to solve and I would like to speed things up a bit. Does anyone have any suggestions how could I do this?
And also when looking at the time it takes, I notice that my end at line 14 takes 45 % of my total time. What is end actually doing and why is it taking so much time?
Here is my simplified code:
t = 0:0.01:100;
dt = t(2)-t(1);
B = 3.5 * t;
F0 = 2 * t;
BB=zeros(1,length(t)); % Preallocation
x = 2; % Initial value
u = 0; % Initial value
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
Running the code it takes me 8.552 sec.
You can remove the inner loop, I think:
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
So BB(ii) = BB(ii) (zero at initalisation) + sum for 1 to ii of BB(kk)* u(ii-kk+1).dt
but kk = 1:ii, so for a given ii, ii-kk+1 → ii-(1:ii) + 1 → ii:-1:1
So I think this is equivalent to:
for ii = 1:length(t)
BB(ii) = sum(B(1:ii).*u(ii:-1:1)*dt);
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
It doesn't take as long as 8 seconds for me using either method, but the version with only one loop is about 2x as fast (the output of BB appears to be the same).
Is the sum loop of B(kk) * u(ii-kk+1) just conv(B(1:ii),u(1:ii),'same')
The best way to speed up loops in matlab is to try to avoid them. Try if you are able to perform a matrix operation instead of the inner loop. For example try to break the calculation you do there in small parts, then decide, if there are parts you can perform in advance without knowing the results of the next iteration of the loop.
to your secound part of the question, my guess:: The end contains the check if the loop runs for another round and this check by it self is not that long but called 50.015.001 times!

Purposefully Slow MATLAB Function?

I want to write a really, really, slow program for MATLAB. I'm talking like, O(2^n) or worse. It has to finish, and it has to be deterministically slow, so no "if rand() = 123,123, exit!" This sounds crazy, but it's actually for a distributed systems test. I need to create a .m file, compile it (with MCC), and then run it on my distributed system to perform some debugging operations.
The program must constantly be doing work, so sleep() is not a valid option.
I tried making a random large matrix and finding its inverse, but this was completing too quickly. Any ideas?
This naive implementation of the Discrete Fourier Transform takes ~ 9 seconds for a 2048 long input vector x on my 1.86 GHz single core machine. Going to 4096 inputs extends the time to ~ 35 seconds, close to the 4x I would expect for O(N^2). I don't have the patience to try longer inputs :)
function y = SlowDFT(x)
t = cputime;
y = zeros(size(x));
for c1=1:length(x)
for c2=1:length(x)
y(c1) = y(c1) + x(c2)*(cos((c1-1)*(c2-1)*2*pi/length(x)) - ...
1j*sin((c1-1)*(c2-1)*2*pi/length(x)));
end
end
disp(cputime-t);
EDIT: Or if you're looking to stress memory more than CPU:
function y = SlowDFT_MemLookup(x)
t = cputime;
y = zeros(size(x));
cosbuf = cos((0:1:(length(x)-1))*2*pi/length(x));
for c1=1:length(x)
cosctr = 1;
sinctr = round(3*length(x)/4)+1;
for c2=1:length(x)
y(c1) = y(c1) + x(c2)*(cosbuf(cosctr) ...
-1j*cosbuf(sinctr));
cosctr = cosctr + (c1-1);
if cosctr > length(x), cosctr = cosctr - length(x); end
sinctr = sinctr + (c1-1);
if sinctr > length(x), sinctr = sinctr - length(x); end
end
end
disp(cputime-t);
This is faster than calculating sin and cos on each iteration. A 2048 long input took ~ 3 seconds, and a 16384 long input took ~ 180 seconds.
Count to 2n. Optionally, make a slow function call in each iteration.
If you want real work that's easy to set up and stresses CPU way over memory:
Large dense matrix inversion (not slow enough? make it bigger.)
Factor an RSA number
How about using inv? It has been reported to be quite slow.
Do some work in a loop. You can tune the time it takes to complete using the number of loop iterations.
I don't speak MATLAB but something equivalent to the following might work.
loops = 0
counter = 0
while (loops < MAX_INT) {
counter = counter + 1;
if (counter == MAX_INT) {
loops = loops + 1;
counter = 0;
}
}
This will iterate MAX_INT*MAX_INT times. You can put some computationally heavy thing in the loop for it to take longer if this is not enough.
Easy! Go back to your Turing machine roots and think of processes that are O(2^n) or worse.
Here's a fairly simple one (warning, untested but you get the point)
N = 12; radix = 10;
odometer = zeros(N, 1);
done = false;
while (~done)
done = true;
for i = 1:N
odometer(i) = odometer(i) + 1;
if (odometer(i) >= radix)
odometer(i) = 0;
else
done = false;
break;
end
end
end
Even better, how about calculating Fibonacci numbers recursively? Runtime is O(2^N), since fib(N) has to make two function calls fib(N-1) and fib(N-2), but stack depth is O(N), since only one of those function calls happens at a time.
function y = fib(n)
if (n <= 1)
y = 1;
else
y = fib(n-1) + fib(n-2);
end
end
You could ask it to factor(X) for a suitably large X
You could also test if a given input is prime by just dividing it by all smaller numbers. This would give you O(n^2).
Try this one:
tic
isprime( primes(99999999) );
toc
EDIT:
For a more extensive set of tests, use these benchmarks (perhaps for multiple repetitions even):
disp(repmat('-',1,85))
disp(['MATLAB Version ' version])
disp(['Operating System: ' system_dependent('getos')])
disp(['Java VM Version: ' version('-java')]);
disp(['Date: ' date])
disp(repmat('-',1,85))
N = 3000; % matrix size
A = rand(N,N);
A = A*A;
tic; A*A; t=toc;
fprintf('A*A \t\t\t%f sec\n', t)
tic; [L,U,P] = lu(A); t=toc; clear L U P
fprintf('LU(A)\t\t\t%f sec\n', t)
tic; inv(A); t=toc;
fprintf('INV(A)\t\t\t%f sec\n', t)
tic; [U,S,V] = svd(A); t=toc; clear U S V
fprintf('SVD(A)\t\t\t%f sec\n', t)
tic; [Q,R,P] = qr(A); t=toc; clear Q R P
fprintf('QR(A)\t\t\t%f sec\n', t)
tic; [V,D] = eig(A); t=toc; clear V D
fprintf('EIG(A)\t\t\t%f sec\n', t)
tic; det(A); t=toc;
fprintf('DET(A)\t\t\t%f sec\n', t)
tic; rank(A); t=toc;
fprintf('RANK(A)\t\t\t%f sec\n', t)
tic; cond(A); t=toc;
fprintf('COND(A)\t\t\t%f sec\n', t)
tic; sqrtm(A); t=toc;
fprintf('SQRTM(A)\t\t%f sec\n', t)
tic; fft(A(:)); t=toc;
fprintf('FFT\t\t\t%f sec\n', t)
tic; isprime(primes(10^7)); t=toc;
fprintf('Primes\t\t\t%f sec\n', t)
The following are the results on my machine using N=1000 for one iteration only (note primes is using as upper bound 10^7 NOT 10^8 [takes way more time!])
A*A 0.178329 sec
LU(A) 0.118864 sec
INV(A) 0.319275 sec
SVD(A) 15.236875 sec
QR(A) 0.841982 sec
EIG(A) 3.967812 sec
DET(A) 0.121882 sec
RANK(A) 1.813042 sec
COND(A) 1.809365 sec
SQRTM(A) 22.750331 sec
FFT 0.113233 sec
Primes 27.080918 sec
this will run 100% cpu for WANTED_TIME seconds
WANTED_TIME = 2^n; % seconds
t0=cputime;
t=cputime;
while (t-t0 < WANTED_TIME)
t=cputime;
end;

Resources