Time complexity of element accessing in MATLAB - performance

The following two code snippets perform the same task (generating M samples uniformly from an N-dim sphere). I was wondering why the latter one consumes much more time than the previous one.
%% MATLAB R2014a
M = 30;
N = 10000;
#1
tic
S = zeros(M, N);
for k = 1:M
P = ones(1, N);
for i = 1:N - 1
t = rand*2*pi;
P(1:i) = P(1:i)*sin(t);
P(i+1) = P(i+1)*cos(t);
end
S(k,:) = P;
end
toc
#2
tic
S = ones(M, N);
for k = 1:M
for i = 1:N - 1
t = rand*2*pi;
S(k, 1:i) = S(k, 1:i)*sin(t);
S(k, i+1) = S(k, i+1)*cos(t);
end
end
toc
The output is:
Elapsed time is 15.007667 seconds.
Elapsed time is 59.745311 seconds.
And I also tried M = 1,
Elapsed time is 0.463370 seconds.
Elapsed time is 1.566913 seconds.
#2 is nearly 4 times slower than #1. Is frequent 2d element accessing in #2 making it time-consuming?

The time difference is due to memory access patterns, and how well they map onto the cache. And also possibly to MATLAB's exploitation of your hardware vector unit (SSE/AVX). MATLAB stores matrices "column-major", meaning S(2,1) is next to S(1,1).
In #1, you process each sample using the vector P, which lives in contiguous memory. These 80,000 bytes fit easily in L2 cache for the fast repeated access you need to perform. They're also neighbors, and trivially vectorized (I'm not certain if MATLAB performs this optimization, but I'd hope so...)
In #2, you access a row of S at a time, which is not contiguous, but rather is interleaved by M values. So each row is spread across 30*80,000 bytes, which does not fit in L2 cache. It'll have to be read back in for each repeated access, even though you're ignoring 29/30 values in that data.
Here's the test. All I'm doing it transposing S so that you can process a column at a time instead, then putting it back at the end just to get the same result:
#3
tic
S = ones(N, M);
for k = 1:M
for i = 1:N - 1
t = rand*2*pi;
S(1:i, k) = S(1:i, k)*sin(t);
S(i+1, k) = S(i+1, k)*cos(t);
end
end
S = S.';
toc
Results:
Elapsed time is 11.254212 seconds.
Elapsed time is 45.847750 seconds.
Elapsed time is 11.501580 seconds.
Yep, transposing S gets us the same contiguous access and performance as the separate vector approach. By the way, L3 vs. L2 is about 4x more clock cycles... 1
Let's see if we can find any breakpoints related to cache size. Here's N = 1000, where everything should fit in L2:
Elapsed time is 0.240184 seconds.
Elapsed time is 0.373448 seconds.
Elapsed time is 0.258566 seconds.
Much lower difference, though now we're probably into L1 effects.
Finally, here's a completely different way to solve your problem. It relies on the fact that multivariate normal RV's have the correct symmetry.
#4
tic
S = randn(M, N);
S = bsxfun(#rdivide, S, sqrt(sum(S.*S, 2)));
toc
Elapsed time is 10.714104 seconds.
Elapsed time is 45.351277 seconds.
Elapsed time is 11.031061 seconds.
Elapsed time is 0.015068 seconds.

I suspect the advantage comes from using a hard coded 1 in the access of the array. If you try M=1 you will still see a significant speed up for the sin(t) line. My guess is that the assembly under the hood can do some use immediate instructions as opposed to reloading the variable K into a register.

Related

How to bechmark a method in Octave?

Matlab has timeit method which is helpful to compare the performance of an implementation with another. I couldn't find something similar in octave. I wrote this benchmark method with runs a function f N times and then returns the total time taken. Is this a reasonable way to compare different implementations or am I missing something critical like "warmup"?
function elapsed_time_in_seconds = benchmark(f, N)
% benchmark runs the function 'f' N times and returns the elapsed time in seconds.
timeid = tic;
for i=1:N
output = f();
end
elapsed_time_in_seconds = toc(timeid);
end
MATLAB's timeit does the following (you can read the whole function, it's an M-file):
Obtain a rough estimate t_rough of the time for calling the function f.
Use the estimate to determine N such that N*t_rough is about 0.001 s.
Determine M such that M*N*t_rough is no more than 15 s, but M must be between 3 and 11.
Loop M times:
   Call f() N times and record the total time.
Determine the median of the M times, divided by N.
The purpose of the two loops, M and N, is as follows: Calling f() N times ensures that the time measured by tic/toc is sufficiently large to be reliable, this loop avoids attempting to time something that is so short that it cannot be timed. Repeating the measurement M times and keeping the median attempts to make the measurement robust against delays caused by other stuff happening on your system, which can artificially inflate the recorded time.
The function subtracts the overhead of calling a function through its handle (determined experimentally by timing the call of an empty function), as well as the tic/toc call time (also determined experimentally). It does not subtract the cost of the inner loop, presumably because in MATLAB it is optimized by the JIT and its cost is negligible.
There are some further refinements. The function that determines t_rough first warms up tic and toc by calling each one twice, then it uses a while loop to ensure it calls f() for at least 0.001 s. But in this loop, if the first iteration takes at least 3 s, it just takes that time as the rough estimate. If the first iteration takes less time, the first time count is discarded (warmup), and then uses the median of all the subsequent calls as the rough estimate of the time.
There's also a lot of effort put into calling the function f() with the right number of output arguments.
The code has a lot of comments explaining the reason behind all these steps, it is worth reading.
As a minimum, I would augment your benchmark function as follows:
function elapsed_time_in_seconds = benchmark(f, N, M)
% benchmark runs the function 'f' N*M times and returns the elapsed time in seconds.
tic; [~] = toc; tic; [~] = toc; % warmup
output = f(); % warmup
t = zeros(M, 1);
for k=1:M
timeid = tic;
for i=1:N
output = f();
end
t(k) = toc(timeid) / N;
end
elapsed_time_in_seconds = median(t);
end
If you use the function to directly compare various alternatives, keeping N and M constant, then the overheads of tic, toc, function calls and loops is irrelevant.
This function does assume that f has one output argument, which is not necessarily the case. You could just call f() instead of output = f(), which will work for functions with or without output arguments. But if the function needs to have a certain number of outputs to work correctly, or to trigger computations that you want to time, then you'd have to adjust the function to call it with the right number of output arguments.
You could come up with some heuristic to determine M from N, which would make it a little easier to use this function.

Efficient way of generating random integers within a range in Julia

I'm doing MC simulations and I need to generate random integers within a range between 1 and a variable upper limit n_mol
The specific Julia function for doing this is rand(1:n_mol) where n_mol is an integer that changes with every MC iteration. The problem is that doing it this is slow... (possibly an issue to open for Julia developers). So, instead of using that particular function call, I thought about generating a random float in [0,1) multiply it by n_mol and then get the integer part of the result: int(rand()*n_mol) the problem now is that int() rounds up so I could end up with numbers between 0 and n_mol, and I can't get 0... so the solution I'm using for the moment is using ifloor and add a 1, ifloor(rand()*n_mol)+1, which considerably faster that the first, but slower than the second.
function t1(N,n_mol)
for i = 1:N
rand(1:n_mol)
end
end
function t2(N,n_mol)
for i = 1:N
int(rand()*n_mol)
end
end
function t3(N,n_mol)
for i = 1:N
ifloor(rand()*n_mol)+1
end
end
#time t1(1e8,123456789)
#time t2(1e8,123456789)
#time t3(1e8,123456789)
elapsed time: 3.256220849 seconds (176 bytes allocated)
elapsed time: 0.482307467 seconds (176 bytes allocated)
elapsed time: 0.975422095 seconds (176 bytes allocated)
So, is there any way of doing this faster with speeds near the second test?
It's important because the MC simulation goes for more than 1e10 iterations.
The result has to be an integer because it will be used as an index of an array.
The rand(r::Range) code is quite fast, given the following two considerations. First, julia calls a 52 bit rng twice to obtain random integers and a 52 bit rng once to obtain random floats, that gives with some book keeping a factor 2.5. A second thing is that
(rand(Uint) % k)
is only evenly distributed between 0 to k-1, if k is a power of 2. This is taken care of with rejection sampling, this explains more or less the remaining additional cost.
If speed is extremely important you can use a simpler random number generator as Julia and ignore those issues. For example with a linear congruential generator without rejection sampling
function lcg(old)
a = unsigned(2862933555777941757)
b = unsigned(3037000493)
a*old + b
end
function randfast(k, x::Uint)
x = lcg(x)
1 + rem(x, k) % Int, x
end
function t4(N, R)
state = rand(Uint)
for i = 1:N
x, state = randfast(R, state)
end
end
But be careful, if the range is (really) big.
m = div(typemax(Uint),3)*2
julia> mean([rand(1:m)*1.0 for i in 1:10^7])
6.148922790091841e18
julia> m/2
6.148914691236517e18
but (!)
julia> mean([(rand(Uint) % m)*1.0 for i in 1:10^7])
5.123459611164573e18
julia> 5//12*tm
5.124095576030431e18
Note that in 0.4, int() is deprecated, and you're aske to use round() instead.
function t2(N,n_mol)
for i = 1:N
round(rand()*n_mol)
end
end
gives 0.27 seconds on my machine (using Julia 0.4).

How can I fill a matrix with all the N-Ary numbers fast?

It is really stupid all I am trying to do is having a 7 column matrix consisiting all mod 7 numbers and it takes a huge time to generate such a matrix utilizing the following code
to = 7^k;
msgValue = zeros(to,k);
for l=0:to
for kCounter=0:(k-1)
msgValue(l+1,kCounter+1)=mod((l/7^kCounter),7);
end
end
msgValue = floor(msgValue);
How can I do this faster?
Or another vectorized approach (direct matrix multiplication):
msgValue = floor( mod( (0:7^k).' * (1./(7.^(0:k-1))),7 ) ) ;
a wee bit faster than the famous bsxfun ;-)
%// For 10000 iterations, k=3
Elapsed time is 2.280774 seconds. %// double loop
Elapsed time is 1.329179 seconds. %// bsxfun
Elapsed time is 0.958945 seconds. %// matrix multiplication
You can use a vectorized approach with bsxfun -
msgValue = floor(mod(bsxfun(#rdivide,[0:to]',7.^(0:(k-1))),7));
Quick runtime tests for k = 7:
-------------------- With Original Approach
Elapsed time is 1.519023 seconds.
-------------------- With Proposed Approach
Elapsed time is 0.279547 seconds.
I used a submssion from matlab central called rude, which I tend to use from time to time and was able to eliminate one for loop and vectorize the code to some extent.
tic
k=7;
modval = 7;
to=modval^k;
mods = mod(0:(modval-1),modval);
msgValue=zeros(to,k);
for kCounter=1:k
aux = rude(modval^(kCounter-1)*ones(1,modval),mods)';
msgValue(:,kCounter) = repmat(aux,to/(7^kCounter),1);
end
toc
The idea behind the code is to make at the beginning of each iteration the building block of the column vector using the rude function. Rude, in turn, uses mods = [0 1 2 3 4 5 6] as a starting point for the manipulation. The real work is done through vectorization.
You did not mention how long your code takes to run. So I timed it just once to give you a rough idea. It ran in 0.43 seconds in my machine, a Windows 7 Ultimate, 2.4 GHz, 4GB Ram, Dual CPU.
Also, the way you defined your loop adds a repetition in your msgValue matrix. The first row consists of zero values throughout all columns, and so the last row, which I also fixed. For a toy example with k=3, your code returns a 344x1 matrix, while you explicitly initialize it as a 7³x1 (343x1) matrix.

Matlab bsxfun - Why does bsxfun fail to work in this case?

I have a binary function roughly looks like
func=#(i,j)exp(-32*(i-j)^2);
with mesh as follows
[X Y]=meshgrid(-10:.1:10);
Strangely, arrayfun produces the right result while bsxfun would produce entries that are Inf.
an1=arrayfun(func,X,Y);
an2=bsxfun(func,X,Y);
>> max(max(abs(an1-an2)))
ans =
Inf
Why?
EDIT: now that the question is resolved. I am including some benchmark data to facilitate the discussion on efficiency with bsxfun
Assuming the grid is already produced with
[X Y]=meshgrid(Grid.partition);
func=#(i,j)exp(-32*(i-j).^2);
(I intend to re-use the grid many times in various places.)
Timing the nested named functions approach.
>> tic;for i=1:1000;temp3=exp(-32*bsxfun(#minus,Grid.partition.',Grid.partition).^2);end;toc,clear temp
Elapsed time is 1.473543 seconds.
>> tic;for i=1:1000;temp3=exp(-32*bsxfun(#minus,Grid.partition.',Grid.partition).^2);end;toc,clear temp
Elapsed time is 1.497116 seconds.
>> tic;for i=1:1000;temp3=exp(-32*bsxfun(#minus,Grid.partition.',Grid.partition).^2);end;toc,clear temp
Elapsed time is 1.816970 seconds.
Timing the anonymous function approach
>> tic;for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear temp
Elapsed time is 1.134980 seconds.
>> tic;for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear temp
Elapsed time is 1.171421 seconds.
>> tic;for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear temp
Elapsed time is 1.180998 seconds.
One can see that the anonymous function approach is faster than the nested function approach (excluding the time on meshgrid).
If the time on meshgrid is included,
>> tic;[X Y]=meshgrid(Grid.partition);for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear X Y temp
Elapsed time is 1.965701 seconds.
>> tic;[X Y]=meshgrid(Grid.partition);for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear X Y temp
Elapsed time is 1.249637 seconds.
>> tic;[X Y]=meshgrid(Grid.partition);for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear X Y temp
Elapsed time is 1.208296 seconds.
Hard to say...
Acording to the documentation, when you call bsxfun with an arbitrary function func,
funcmust be able to accept as input either two column vectors of the same size, or one column vector and one scalar, and return as output a column vector of the same size as the input(s).
Your function does not fulfill that. To correct it, replace ^ by .^:
func=#(i,j)exp(-32*(i-j).^2);
Anyway, instead of your function you could use one of bsxfun's built-in functions (see #Divakar's answer). That way you avoid meshgrid, and the code will probably be faster.
Instead of that way of using Anonymous Functions with bsxfun, you could do something like this for a more efficient usage of bsxfun -
arr1 = -10:.1:10
an2 = exp(-32*bsxfun(#minus,arr1.',arr1).^2)
Benchmarking
Trying to clarify on OP's runtime comments here to compare bsxfun's Anonymous Functions capabilities against the built-in #minus with some benchmarking.
Benchmarking code
func=#(i,j)exp(-32.*(i-j).^2);
num_iter = 1000;
%// Warm up tic/toc.
for k = 1:100000
tic(); elapsed = toc();
end
disp('---------------------------- Using Anonymous Functions with bsxfun')
tic
for iter = 1:num_iter
[X Y]=meshgrid(-10:.1:10);
an2=bsxfun(func,X,Y);
end
toc, clear X Y an2
disp('---------------------------- Using bsxfuns built-in "#minus"')
tic
for iter = 1:num_iter
arr1 = -10:.1:10;
an2 = exp(-32*bsxfun(#minus,arr1',arr1).^2);
end
toc
Runtimes
---------------------------- Using Anonymous Functions with bsxfun
Elapsed time is 0.241312 seconds.
---------------------------- Using bsxfuns built-in "#minus"
Elapsed time is 0.221555 seconds.

Efficient replacement for ppval

I have a loop in which I use ppval to evaluate a set of values from a piecewise polynomial spline. The interpolation is easily the most time consuming part of the loop and I am looking for a way improve the function's efficiency.
More specifically, I'm using a finite difference scheme to calculate transient temperature distributions in friction welds. To do this I need to recalculate the material properties (as a function of temperature and position) at each time step. The rate limiting factor is the interpolation of these values. I could use an alternate finite difference scheme (less restrictive in the time domain) but would rather stick with what I have if at all possible.
I've included a MWE below:
x=0:.1:10;
y=sin(x);
pp=spline(x,y);
tic
for n=1:10000
x_int=10*rand(1000,1);
y_int=ppval(pp,x_int);
end
toc
plot(x,y,x_int,y_int,'*') % plot for sanity of data
Elapsed time is 1.265442 seconds.
Edit - I should probably mention that I would be more than happy with a simple linear interpolation between values but the interp1 function is slower than ppval
x=0:.1:10;
y=sin(x);
tic
for n=1:10000
x_int=10*rand(1000,1);
y_int=interp1(x,y,x_int,'linear');
end
toc
plot(x,y,x_int,y_int,'*') % plot for sanity of data
Elapsed time is 1.957256 seconds.
This is slow, because you're running into the single most annoying limitation of JIT. It's the cause of many many many oh so many questions in the MATLAB tag here on SO:
MATLAB's JIT accelerator cannot accelerate loops that call non-builtin functions.
Both ppval and interp1 are not built in (check with type ppval or edit interp1). Their implementation is not particularly slow, they just aren't fast when placed in a loop.
Now I have the impression it's getting better in more recent versions of MATLAB, but there are still quite massive differences between "inlined" and "non-inlined" loops. Why their JIT doesn't automate this task by simply recursing into non-builtins, I really have no idea.
Anyway, to fix this, you should copy-paste the essence of what happens in ppval into the loop body:
% Example data
x = 0:.1:10;
y = sin(x);
pp = spline(x,y);
% Your original version
tic
for n = 1:10000
x_int = 10*rand(1000,1);
y_int = ppval(pp, x_int);
end
toc
% "inlined" version
tic
br = pp.breaks.';
cf = pp.coefs;
for n = 1:10000
x_int = 10*rand(1000,1);
[~, inds] = histc(x_int, [-inf; br(2:end-1); +inf]);
x_shf = x_int - br(inds);
zero = ones(size(x_shf));
one = x_shf;
two = one .* x_shf;
three = two .* x_shf;
y_int = sum( [three two one zero] .* cf(inds,:), 2);
end
toc
Profiler:
Results on my crappy machine:
Elapsed time is 2.764317 seconds. % ppval
Elapsed time is 1.695324 seconds. % "inlined" version
The difference is actually less than what I expected, but I think that's mostly due to the sum() -- for this ppval case, I usually only need to evaluate a single site per iteration, which you can do without histc (but with simple vectorized code) and matrix/vector multiplication x*y (BLAS) instead of sum(x.*y) (fast, but not BLAS-fast).
Oh well, a ~60% reduction is not bad :)
It is a bit surprising that interp1 is slower than ppval, but having a quick look at its source code, it seems that it has to check for many special cases and has to loop over all the points since it it cannot be sure if the step-size is constant.
I didn't check the timing, but I guess you can speed up the linear interpolation by a lot if you can guarantee that steps in x of your table are constant, and that the values to be interpolated are stricktly within the given range, so that you do not have to do any checking. In that case, linear interpolation can be converted to a simple lookup problem like so:
%data to be interpolated, on grid with constant step
x = 0:0.5:10;
y = sin(x);
x_int = 0:0.1:9.9;
%make sure it is interpolation, not extrapolation
assert(all(x(1) <= x_int & x_int < x(end)));
% compute mapping, this can be precomputed for constant grid
slope = (length(x) - 1) / (x(end) - x(1));
offset = 1 - slope*x(1);
%map x_int to interval 1..lenght(i)
xmapped = offset + slope * x_int;
ind = floor(xmapped);
frac = xmapped - ind;
%interpolate by taking weighted sum of neighbouring points
y_int = y(ind) .* (1 - frac) + y(ind+1) .* frac;
% make plot to check correctness
plot(x, y, 'o-', x_int, y_int, '.')

Resources