Performance of the first vs second call of a function - compilation

Consider the following piece of code (Julia)
bar(x) = for i = 1:9999 x+x*x-x+x end # Define the "bar" function
print("First try: "); #time bar(0.5)
print("Second try: "); #time bar(0.5)
bar(x) = for i = 1:9999 x+x*x-x+x end # Redefine the same "bar" function
print("Third try: "); #time bar(0.5)
print("Fourth try: "); #time bar(0.6)
The output is
First try: elapsed time: 0.002738996 seconds (88152 bytes allocated)
Second try: elapsed time: 3.827e-6 seconds (80 bytes allocated)
Third try: elapsed time: 0.002907554 seconds (88152 bytes allocated)
Fourth try: elapsed time: 2.395e-6 seconds (80 bytes allocated)
Why is the second (and fourth) try so much faster (and take up less memory) than the first (and third) try?

Julia has, I understand, a just-in-time compiler. So the first (and third) run is compiling the code (with the allocations needed for that) and the second (and fourth) runs are just running the previously compiled code

Just to expand on Paul's answer: a big part of the speedup comes from Julia's type inference and multiple dispatch. Say the first time you evaluate the function with a float: the JIT (just in time compiler) figures out the type of the argument and writes an appropriate LLVM code. If you then evaluate the same function with an integer, this compiles a different LLVM code. The following times you call the function, it will dispatch different LLVM code depending on the type of the argument. This is the reason why it wouldn't make sense in general to compile when you define the function.
You can read more about this here for example (there's tons of references to multiple dispatch in the documentation!)
Consider, for example:
bar(x) = for i = 1:9999 x+x*x-x+x end
print("First try with floats: "); #time bar(0.5)
print("Second try with floats: "); #time bar(0.5)
print("First try with integers: "); #time bar(1)
print("Second try with integers: "); #time bar(1)
which gives:
First try with floats: elapsed time: 0.005570773 seconds (102440 bytes allocated)
Second try with floats: elapsed time: 5.762e-6 seconds (80 bytes allocated)
First try with integers: elapsed time: 0.003584026 seconds (86896 bytes allocated)
Second try with integers: elapsed time: 6.402e-6 seconds (80 bytes allocated)

Related

Is the timing of MATLAB reliable? If yes, can we reproduce the performance with julia, fortran, etc.?

Originally this is a problem coming up in mathematica.SE, but since multiple programming languages have involved in the discussion, I think it's better to rephrase it a bit and post it here.
In short, michalkvasnicka found that in the following MATLAB sample
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for c = 1:s
for r = 1:s
H(r,c) = 1/(r+c-1);
end
end
toc
%Elapsed time is 1.359625 seconds.... For-loop
tic;
% vectorized version
c = 1:s;
r = c';
HH=1./(r+c-1);
toc
%Elapsed time is 0.047916 seconds.... Vectorized
isequal(H,HH)
the vectorized code piece is more than 25 times faster than the pure for-loop code piece. Though I don't have access to MATLAB so cannot test the sample myself, the timing 1.359625 seems to suggest it's tested on an average PC, just as mine.
But I cannot reproduce the timing with other languages like fortran or julia! (We know, both of them are famous for their performance of numeric calculation. Well, I admit I'm by no means an expert of fortran or julia. )
The followings are the samples I used for test. I'm using a laptop with i7-8565U CPU, Win 10.
fortran
fortran code is compiled with gfortran (TDM-GCC-10.3.0-2, with compile option -Ofast).
program tst
use, intrinsic :: iso_fortran_env
implicit none
integer,parameter::s=15000
integer::r,c
real(real64)::hmn(s,s)
do r=1,s
do c=1, s
hmn(r,c)=1._real64/(r + c - 1)
end do
end do
print *, hmn(s,s)
end program
compilation timing: 0.2057823 seconds
execution timing: 0.7179657 seconds
julia
Version of julia is 1.6.3.
#time (s=15000; Hmm=[1. /(r+c-1) for r=1:s,c=1:s];)
Timing: 0.7945998 seconds
Here comes the question:
Is the timing of MATLAB reliable?
If the answer to 1st question is yes, then how can we reproduce the performance (for 2 GHz CPU, the timing should be around 0.05 seconds) with julia, fortran, or any other programming languages?
Just to add on the Julia side - make sure you use BenchmarkToolsto benchmark, wrap the code you want to benchmark in functions so as not to benchmark in global scope, and interpolate any variables you pass to #btime.
Here's how I would do it:
julia> s = 15_000;
julia> function f_loop!(H)
for c ∈ 1:size(H, 1)
for r ∈ 1:size(H, 1)
H[r, c] = 1 / (r + c - 1)
end
end
end
f_loop! (generic function with 1 method)
julia> function f_vec!(H)
c = 1:size(H, 1)
r = c'
H .= 1 ./ (r .+ c .- 1)
end
f_vec! (generic function with 1 method)
julia> H = zeros(s, s);
julia> using BenchmarkTools
julia> #btime f_loop!($H);
625.891 ms (0 allocations: 0 bytes)
julia> H = zeros(s, s);
julia> #btime f_vec!($H);
625.248 ms (0 allocations: 0 bytes)
So both versions come in at the same time, which is what I'd expect for such a straightforward operation where a properly type-inferred code should compile down to roughly the same machine code.
tic/toc should be fine, but it looks like the timing is being skewed by memory pre-allocation.
I can reproduce similar timings to your MATLAB example, however
On first run (clear workspace)
Loop approach takes 2.08 sec
Vectorised approach takes 1.04 sec
Vectorisation saves 50% execution time
On second run (workspace not cleared)
Loop approach takes 2.55 sec
Vectorised approach takes 0.065 sec
Vectorisation "saves" 97.5% execution time
My guess would be that since the loop approach explicitly creates a new matrix via zeros, the memory is reallocated from scratch on every run and you don't see the speed improvement on subsequent runs.
However, when HH remains in memory and the HH=___ line outputs a matrix of the same size, I suspect MATLAB is doing some clever memory allocation to speed up the operation.
We can prove this theory with the following test:
Test Num | Workspace cleared | s | Loop (sec) | Vectorised (sec)
1 | Yes | 15000 | 2.10 | 1.41
2 | No | 15000 | 2.73 | 0.07
3 | No | 15000 | 2.50 | 0.07
4 | No | 15001 | 2.74 | 1.73
See the variation between tests 2 and 3, this is why timeit would have been helpful for an average runtime (see footnote). The difference in output sizes between tests 3 and 4 are pretty small, but the execution time returns to a similar magnitude of that in test 1 for the vectorised approach, suggesting that the re-allocation to create HH costs most of the time.
Footnote: tic/toc timings in MATLAB can be improved by using the in-built timeit function, which essentially takes an average over several runs. One interesting thing to observe from the workings of timeit though is that it explicitly "warms up" (quoting a comment) the tic/toc function by calling it a couple of times. You can see when running tic/toc a few times from a clear workspace (with no intermediate code) that the first call takes longer than subsequent calls, as there must be some overhead for getting the timer initialised.
I hope that the following modified benchmark could bring some new light to the problem:
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for i =1:10
for c = 1:s
for r = 1:s
H(r,c) = H(r,c) + 1/(r+c-1+i);
end
end
end
toc
tic;
% vectorized version
HH = zeros(s,s);
c = 1:s;
r = c';
for i=1:10
HH= HH + 1./(r+c-1+i);
end
toc
isequal(H,HH)
In this case any kind of "cashing" is avoided by changing of matrix H (HH) at each for-loop (over "i") iteration.
In this case we get:
Elapsed time is 3.737275 seconds. (for-loop)
Elapsed time is 1.143387 seconds. (vectorized)
So, there is still performance improvement (~ 3x) due to the vectorization, which is probably done by implicit multi-threading implementation of vectorized Matlab commands.
Yes, tic/toc vs timeit is not strictly consistent, but the overall timing functionality is very similar.
To add to this, here is a simple python script which does the vectorized operation with numpy:
from timeit import default_timer
import numpy as np
s = 15000
start = default_timer()
# for-loop
H = np.zeros([s, s])
for c in range(1, s):
for r in range(1, s):
H[r, c] = 1 / (r + c - 1)
end = default_timer()
print(end - start)
start = default_timer()
# vectorized
c = np.arange(1, s).reshape([1, -1])
r = c.T
HH = 1 / (c + r - 1)
end = default_timer()
print(end - start)
for-loop: 32.94566780002788 seconds
vectorized: 0.494859800033737 seconds
While the for-loop version is terribly slow, the vectorized version is faster than the posted fortran/julia times. Numpy internally tries to use special SIMD hardware instructions to speed up arithmetic on vectors, which can make a significant difference. It's possible that the fortran/julia compilers weren't able to generate those instructions from the provided code, but numpy/matlab were able to. However, Matlab is still about 10x faster than the numpy code, which I don't think would be explained by better use of SIMD instructions. Instead, they may also be using multiple threads to parallelize the computation, since the matrix is fairly large.
Ultimately, I think the matlab numbers are plausible, but I'm not sure exactly how they're getting their speedup.

How to bechmark a method in Octave?

Matlab has timeit method which is helpful to compare the performance of an implementation with another. I couldn't find something similar in octave. I wrote this benchmark method with runs a function f N times and then returns the total time taken. Is this a reasonable way to compare different implementations or am I missing something critical like "warmup"?
function elapsed_time_in_seconds = benchmark(f, N)
% benchmark runs the function 'f' N times and returns the elapsed time in seconds.
timeid = tic;
for i=1:N
output = f();
end
elapsed_time_in_seconds = toc(timeid);
end
MATLAB's timeit does the following (you can read the whole function, it's an M-file):
Obtain a rough estimate t_rough of the time for calling the function f.
Use the estimate to determine N such that N*t_rough is about 0.001 s.
Determine M such that M*N*t_rough is no more than 15 s, but M must be between 3 and 11.
Loop M times:
   Call f() N times and record the total time.
Determine the median of the M times, divided by N.
The purpose of the two loops, M and N, is as follows: Calling f() N times ensures that the time measured by tic/toc is sufficiently large to be reliable, this loop avoids attempting to time something that is so short that it cannot be timed. Repeating the measurement M times and keeping the median attempts to make the measurement robust against delays caused by other stuff happening on your system, which can artificially inflate the recorded time.
The function subtracts the overhead of calling a function through its handle (determined experimentally by timing the call of an empty function), as well as the tic/toc call time (also determined experimentally). It does not subtract the cost of the inner loop, presumably because in MATLAB it is optimized by the JIT and its cost is negligible.
There are some further refinements. The function that determines t_rough first warms up tic and toc by calling each one twice, then it uses a while loop to ensure it calls f() for at least 0.001 s. But in this loop, if the first iteration takes at least 3 s, it just takes that time as the rough estimate. If the first iteration takes less time, the first time count is discarded (warmup), and then uses the median of all the subsequent calls as the rough estimate of the time.
There's also a lot of effort put into calling the function f() with the right number of output arguments.
The code has a lot of comments explaining the reason behind all these steps, it is worth reading.
As a minimum, I would augment your benchmark function as follows:
function elapsed_time_in_seconds = benchmark(f, N, M)
% benchmark runs the function 'f' N*M times and returns the elapsed time in seconds.
tic; [~] = toc; tic; [~] = toc; % warmup
output = f(); % warmup
t = zeros(M, 1);
for k=1:M
timeid = tic;
for i=1:N
output = f();
end
t(k) = toc(timeid) / N;
end
elapsed_time_in_seconds = median(t);
end
If you use the function to directly compare various alternatives, keeping N and M constant, then the overheads of tic, toc, function calls and loops is irrelevant.
This function does assume that f has one output argument, which is not necessarily the case. You could just call f() instead of output = f(), which will work for functions with or without output arguments. But if the function needs to have a certain number of outputs to work correctly, or to trigger computations that you want to time, then you'd have to adjust the function to call it with the right number of output arguments.
You could come up with some heuristic to determine M from N, which would make it a little easier to use this function.

Efficient way of generating random integers within a range in Julia

I'm doing MC simulations and I need to generate random integers within a range between 1 and a variable upper limit n_mol
The specific Julia function for doing this is rand(1:n_mol) where n_mol is an integer that changes with every MC iteration. The problem is that doing it this is slow... (possibly an issue to open for Julia developers). So, instead of using that particular function call, I thought about generating a random float in [0,1) multiply it by n_mol and then get the integer part of the result: int(rand()*n_mol) the problem now is that int() rounds up so I could end up with numbers between 0 and n_mol, and I can't get 0... so the solution I'm using for the moment is using ifloor and add a 1, ifloor(rand()*n_mol)+1, which considerably faster that the first, but slower than the second.
function t1(N,n_mol)
for i = 1:N
rand(1:n_mol)
end
end
function t2(N,n_mol)
for i = 1:N
int(rand()*n_mol)
end
end
function t3(N,n_mol)
for i = 1:N
ifloor(rand()*n_mol)+1
end
end
#time t1(1e8,123456789)
#time t2(1e8,123456789)
#time t3(1e8,123456789)
elapsed time: 3.256220849 seconds (176 bytes allocated)
elapsed time: 0.482307467 seconds (176 bytes allocated)
elapsed time: 0.975422095 seconds (176 bytes allocated)
So, is there any way of doing this faster with speeds near the second test?
It's important because the MC simulation goes for more than 1e10 iterations.
The result has to be an integer because it will be used as an index of an array.
The rand(r::Range) code is quite fast, given the following two considerations. First, julia calls a 52 bit rng twice to obtain random integers and a 52 bit rng once to obtain random floats, that gives with some book keeping a factor 2.5. A second thing is that
(rand(Uint) % k)
is only evenly distributed between 0 to k-1, if k is a power of 2. This is taken care of with rejection sampling, this explains more or less the remaining additional cost.
If speed is extremely important you can use a simpler random number generator as Julia and ignore those issues. For example with a linear congruential generator without rejection sampling
function lcg(old)
a = unsigned(2862933555777941757)
b = unsigned(3037000493)
a*old + b
end
function randfast(k, x::Uint)
x = lcg(x)
1 + rem(x, k) % Int, x
end
function t4(N, R)
state = rand(Uint)
for i = 1:N
x, state = randfast(R, state)
end
end
But be careful, if the range is (really) big.
m = div(typemax(Uint),3)*2
julia> mean([rand(1:m)*1.0 for i in 1:10^7])
6.148922790091841e18
julia> m/2
6.148914691236517e18
but (!)
julia> mean([(rand(Uint) % m)*1.0 for i in 1:10^7])
5.123459611164573e18
julia> 5//12*tm
5.124095576030431e18
Note that in 0.4, int() is deprecated, and you're aske to use round() instead.
function t2(N,n_mol)
for i = 1:N
round(rand()*n_mol)
end
end
gives 0.27 seconds on my machine (using Julia 0.4).

Julia: use of pmap with matrices

I have a question about the use of pmap. I think it's a simple/obvious answer but still can't figure it out! I am currently running a loop where each of 50 iterations is separate and so running it in parallel should be possible and should improve speed. It uses a function that has multiple inputs and outputs, which are both a mixture of vectors and scalars. I need to save the outputs of the function for each of the 50 iterations for later use. Here are the basics of the code when not in parallel.
A=Array(Float64, 500,50)
b=Array(Float64,50)
for i in 1:50
A[:,i],b[i] = func(i,x,y,z)
end
Any advice for how to implement this is parallel? I'm using v0.3 Julia.
Thanks in advance.
David
This worked for me.
#everywhere x,y,z = 1,2,3
#everywhere function f(i,x,y,z)
sleep(1)
return(ones(500)*i, i+x+y+z)
end
naive = #time map(i -> f(i,x,y,z), 1:50)
parallel = #time pmap(i -> f(i,x,y,z), 1:50)
A = [x[1] for x in parallel]
b = [x[2] for x in parallel]
Let me know if anyone can suggest a more elegant way to get A and b out of the array of tuples that is produced by pmap.
The timings (when run on 8 processes) are as we would expect
elapsed time: 5.063214725 seconds (94436 bytes allocated)
elapsed time: 0.815228485 seconds (288864 bytes allocated)

Matlab bsxfun - Why does bsxfun fail to work in this case?

I have a binary function roughly looks like
func=#(i,j)exp(-32*(i-j)^2);
with mesh as follows
[X Y]=meshgrid(-10:.1:10);
Strangely, arrayfun produces the right result while bsxfun would produce entries that are Inf.
an1=arrayfun(func,X,Y);
an2=bsxfun(func,X,Y);
>> max(max(abs(an1-an2)))
ans =
Inf
Why?
EDIT: now that the question is resolved. I am including some benchmark data to facilitate the discussion on efficiency with bsxfun
Assuming the grid is already produced with
[X Y]=meshgrid(Grid.partition);
func=#(i,j)exp(-32*(i-j).^2);
(I intend to re-use the grid many times in various places.)
Timing the nested named functions approach.
>> tic;for i=1:1000;temp3=exp(-32*bsxfun(#minus,Grid.partition.',Grid.partition).^2);end;toc,clear temp
Elapsed time is 1.473543 seconds.
>> tic;for i=1:1000;temp3=exp(-32*bsxfun(#minus,Grid.partition.',Grid.partition).^2);end;toc,clear temp
Elapsed time is 1.497116 seconds.
>> tic;for i=1:1000;temp3=exp(-32*bsxfun(#minus,Grid.partition.',Grid.partition).^2);end;toc,clear temp
Elapsed time is 1.816970 seconds.
Timing the anonymous function approach
>> tic;for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear temp
Elapsed time is 1.134980 seconds.
>> tic;for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear temp
Elapsed time is 1.171421 seconds.
>> tic;for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear temp
Elapsed time is 1.180998 seconds.
One can see that the anonymous function approach is faster than the nested function approach (excluding the time on meshgrid).
If the time on meshgrid is included,
>> tic;[X Y]=meshgrid(Grid.partition);for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear X Y temp
Elapsed time is 1.965701 seconds.
>> tic;[X Y]=meshgrid(Grid.partition);for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear X Y temp
Elapsed time is 1.249637 seconds.
>> tic;[X Y]=meshgrid(Grid.partition);for i=1:1000;temp=bsxfun(func,X,Y);end;toc,clear X Y temp
Elapsed time is 1.208296 seconds.
Hard to say...
Acording to the documentation, when you call bsxfun with an arbitrary function func,
funcmust be able to accept as input either two column vectors of the same size, or one column vector and one scalar, and return as output a column vector of the same size as the input(s).
Your function does not fulfill that. To correct it, replace ^ by .^:
func=#(i,j)exp(-32*(i-j).^2);
Anyway, instead of your function you could use one of bsxfun's built-in functions (see #Divakar's answer). That way you avoid meshgrid, and the code will probably be faster.
Instead of that way of using Anonymous Functions with bsxfun, you could do something like this for a more efficient usage of bsxfun -
arr1 = -10:.1:10
an2 = exp(-32*bsxfun(#minus,arr1.',arr1).^2)
Benchmarking
Trying to clarify on OP's runtime comments here to compare bsxfun's Anonymous Functions capabilities against the built-in #minus with some benchmarking.
Benchmarking code
func=#(i,j)exp(-32.*(i-j).^2);
num_iter = 1000;
%// Warm up tic/toc.
for k = 1:100000
tic(); elapsed = toc();
end
disp('---------------------------- Using Anonymous Functions with bsxfun')
tic
for iter = 1:num_iter
[X Y]=meshgrid(-10:.1:10);
an2=bsxfun(func,X,Y);
end
toc, clear X Y an2
disp('---------------------------- Using bsxfuns built-in "#minus"')
tic
for iter = 1:num_iter
arr1 = -10:.1:10;
an2 = exp(-32*bsxfun(#minus,arr1',arr1).^2);
end
toc
Runtimes
---------------------------- Using Anonymous Functions with bsxfun
Elapsed time is 0.241312 seconds.
---------------------------- Using bsxfuns built-in "#minus"
Elapsed time is 0.221555 seconds.

Resources