I'm doing MC simulations and I need to generate random integers within a range between 1 and a variable upper limit n_mol
The specific Julia function for doing this is rand(1:n_mol) where n_mol is an integer that changes with every MC iteration. The problem is that doing it this is slow... (possibly an issue to open for Julia developers). So, instead of using that particular function call, I thought about generating a random float in [0,1) multiply it by n_mol and then get the integer part of the result: int(rand()*n_mol) the problem now is that int() rounds up so I could end up with numbers between 0 and n_mol, and I can't get 0... so the solution I'm using for the moment is using ifloor and add a 1, ifloor(rand()*n_mol)+1, which considerably faster that the first, but slower than the second.
function t1(N,n_mol)
for i = 1:N
rand(1:n_mol)
end
end
function t2(N,n_mol)
for i = 1:N
int(rand()*n_mol)
end
end
function t3(N,n_mol)
for i = 1:N
ifloor(rand()*n_mol)+1
end
end
#time t1(1e8,123456789)
#time t2(1e8,123456789)
#time t3(1e8,123456789)
elapsed time: 3.256220849 seconds (176 bytes allocated)
elapsed time: 0.482307467 seconds (176 bytes allocated)
elapsed time: 0.975422095 seconds (176 bytes allocated)
So, is there any way of doing this faster with speeds near the second test?
It's important because the MC simulation goes for more than 1e10 iterations.
The result has to be an integer because it will be used as an index of an array.
The rand(r::Range) code is quite fast, given the following two considerations. First, julia calls a 52 bit rng twice to obtain random integers and a 52 bit rng once to obtain random floats, that gives with some book keeping a factor 2.5. A second thing is that
(rand(Uint) % k)
is only evenly distributed between 0 to k-1, if k is a power of 2. This is taken care of with rejection sampling, this explains more or less the remaining additional cost.
If speed is extremely important you can use a simpler random number generator as Julia and ignore those issues. For example with a linear congruential generator without rejection sampling
function lcg(old)
a = unsigned(2862933555777941757)
b = unsigned(3037000493)
a*old + b
end
function randfast(k, x::Uint)
x = lcg(x)
1 + rem(x, k) % Int, x
end
function t4(N, R)
state = rand(Uint)
for i = 1:N
x, state = randfast(R, state)
end
end
But be careful, if the range is (really) big.
m = div(typemax(Uint),3)*2
julia> mean([rand(1:m)*1.0 for i in 1:10^7])
6.148922790091841e18
julia> m/2
6.148914691236517e18
but (!)
julia> mean([(rand(Uint) % m)*1.0 for i in 1:10^7])
5.123459611164573e18
julia> 5//12*tm
5.124095576030431e18
Note that in 0.4, int() is deprecated, and you're aske to use round() instead.
function t2(N,n_mol)
for i = 1:N
round(rand()*n_mol)
end
end
gives 0.27 seconds on my machine (using Julia 0.4).
Related
Matlab has timeit method which is helpful to compare the performance of an implementation with another. I couldn't find something similar in octave. I wrote this benchmark method with runs a function f N times and then returns the total time taken. Is this a reasonable way to compare different implementations or am I missing something critical like "warmup"?
function elapsed_time_in_seconds = benchmark(f, N)
% benchmark runs the function 'f' N times and returns the elapsed time in seconds.
timeid = tic;
for i=1:N
output = f();
end
elapsed_time_in_seconds = toc(timeid);
end
MATLAB's timeit does the following (you can read the whole function, it's an M-file):
Obtain a rough estimate t_rough of the time for calling the function f.
Use the estimate to determine N such that N*t_rough is about 0.001 s.
Determine M such that M*N*t_rough is no more than 15 s, but M must be between 3 and 11.
Loop M times:
Call f() N times and record the total time.
Determine the median of the M times, divided by N.
The purpose of the two loops, M and N, is as follows: Calling f() N times ensures that the time measured by tic/toc is sufficiently large to be reliable, this loop avoids attempting to time something that is so short that it cannot be timed. Repeating the measurement M times and keeping the median attempts to make the measurement robust against delays caused by other stuff happening on your system, which can artificially inflate the recorded time.
The function subtracts the overhead of calling a function through its handle (determined experimentally by timing the call of an empty function), as well as the tic/toc call time (also determined experimentally). It does not subtract the cost of the inner loop, presumably because in MATLAB it is optimized by the JIT and its cost is negligible.
There are some further refinements. The function that determines t_rough first warms up tic and toc by calling each one twice, then it uses a while loop to ensure it calls f() for at least 0.001 s. But in this loop, if the first iteration takes at least 3 s, it just takes that time as the rough estimate. If the first iteration takes less time, the first time count is discarded (warmup), and then uses the median of all the subsequent calls as the rough estimate of the time.
There's also a lot of effort put into calling the function f() with the right number of output arguments.
The code has a lot of comments explaining the reason behind all these steps, it is worth reading.
As a minimum, I would augment your benchmark function as follows:
function elapsed_time_in_seconds = benchmark(f, N, M)
% benchmark runs the function 'f' N*M times and returns the elapsed time in seconds.
tic; [~] = toc; tic; [~] = toc; % warmup
output = f(); % warmup
t = zeros(M, 1);
for k=1:M
timeid = tic;
for i=1:N
output = f();
end
t(k) = toc(timeid) / N;
end
elapsed_time_in_seconds = median(t);
end
If you use the function to directly compare various alternatives, keeping N and M constant, then the overheads of tic, toc, function calls and loops is irrelevant.
This function does assume that f has one output argument, which is not necessarily the case. You could just call f() instead of output = f(), which will work for functions with or without output arguments. But if the function needs to have a certain number of outputs to work correctly, or to trigger computations that you want to time, then you'd have to adjust the function to call it with the right number of output arguments.
You could come up with some heuristic to determine M from N, which would make it a little easier to use this function.
Im trying to construct and compare, the fastest possible way, two 01 random vectors of the same length using Julia, each vector with the same number of zeros and ones.
This is all for a MonteCarlo simulation of the following probabilistic question
We have two independent urns, each one with n white balls and n black balls. Then we take a pair of balls, one of each urn, each time up to empty the urns. What is the probability that each pair have the same color?
What I did is the following:
using Random
# Auxiliar function that compare the parity, element by element, of two
# random vectors of length 2n
function comp(n::Int64)
sum((shuffle!(Vector(1:2*n)) .+ shuffle!(Vector(1:2*n))).%2)
end
The above generate two random permutations of the vector from 1 to 2n, add element by element, apply modulo 2 to each elemnt and after sum all the values of the remaining vector. Then Im using above the parity of each number to model it color: odd black and white even.
If the final sum is zero then the two random vectors had the same colors, element by element. A different result says that the two vectors doesnt had paired colors.
Then I setup the following function, that it is just the MonteCarlo simulation of the desired probability:
# Here m is an optional argument that control the amount of random
# experiments in the simulation
function sim(n::Int64,m::Int64=24)
# A counter for the valid cases
x = 0
for i in 1:2^m
# A random pair of vectors is a valid case if they have the
# the same parity element by element so
if comp(n) == 0
x += 1
end
end
# The estimated value
x/2^m
end
Now I want to know if there is a faster way to compare such vectors. I tried the following alternative construction and comparison for the random vectors
shuffle!( repeat([0,1],n)) == shuffle!( repeat([0,1],n))
Then I changed accordingly the code to
comp(n)
With these changes the code runs slightly slower, what I tested with the function #time. Other changes that I did was changing the forstatement for a whilestatement, but the computation time remain the same.
Because Im not programmer (indeed just yesterday I learn something of the Julia language, and installed the Juno front-end) then probably will be a faster way to make the same computations. Some tip will be appreciated because the effectiveness of a MonteCarlo simulation depends on the number of random experiments, so the faster the computation the larger values we can test.
The key cost in this problem is shuffle! therefore in order to maximize the simulation speed you can use (I add it as an answer as it is too long for a comment):
function test(n,m)
ref = [isodd(i) for i in 1:2n]
sum(all(view(shuffle!(ref), 1:n)) for i in 1:m) / m
end
What are the differences from the code proposed in the other answer:
You do not have to shuffle! both vectors; it is enough to shuffle! one of them, as the result of the comparison is invariant to any identical permutation of both vectors after independently shuffling them; therefore we can assume that one vector is after random permutation reshuffled to be ordered so that it has trues in the first n entries and falses in the last n entries
I do shuffle! in-place (i.e. ref vector is allocated only once)
I use all function on the fist half of the vector; this way the check is stopped as I hit first false; if I hit all true in the first n entries I do not have to check the last n entries as I know they are all false so I do not have to check them
To get something cleaner, you could generate directly vectors of 0/1 values, and then just let Julia check for vector equality, e.g.
function rndvec(n::Int64)
shuffle!(vcat(zeros(Bool,n),ones(Bool,n)))
end
function sim0(n::Int64, m::Int64=24)
sum(rndvec(n) == rndvec(n) for i in 1:2^m) / 2^m
end
Avoiding allocation makes the code faster, as explained by Bogumił Kamiński (and letting Julia make the comparison is faster than his code).
function sim1(n::Int64, m::Int64=24)
vref = vcat(zeros(Bool,n),ones(Bool,n))
vshuffled = vref[:]
sum(shuffle!(vshuffled) == vref for i in 1:2^m) / 2^m
end
To go even faster use lazy evaluation and fast exit: if the first element is different, you don't even need to generate the rest of the vectors.
This would make the code much trickier though.
I find it's a bit not in the spirit of the question, but you could also do some more math.
There is binomial(2*n, n) possible vectors generated and you could therefore just compute
function sim2(n::Int64, m::Int64=24)
nvec = binomial(2*n, n)
sum(rand(1:nvec) == 1 for i in 1:2^m) / 2^m
end
Here are some timings I obtain:
#time show(("sim0", sim0(6, 21)))
#time show(("sim1", sim1(6, 21)))
#time show(("sim2", sim2(6, 21)))
#time test(("test", test(6, 2^21)))
("sim0", 0.0010724067687988281) 4.112159 seconds (12.68 M allocations: 1.131 GiB, 11.47% gc time)
("sim1", 0.0010781288146972656) 0.916075 seconds (19.87 k allocations: 1.092 MiB)
("sim2", 0.0010628700256347656) 0.249432 seconds (23.12 k allocations: 1.258 MiB)
("test", 0.0010166168212890625) 1.180781 seconds (2.14 M allocations: 98.634 MiB, 2.22% gc time)
I've noticed a strange behavior of julia during a matrix copy.
Consider the following three functions:
function priv_memcopyBtoA!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = B[1:n,1:n]
return nothing
end
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1; jj = 1;
while ii <= n
jj = 1 #(*)
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
function priv_memcopyBtoA3!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = view(B, 1:n, 1:n)
return nothing
end
Edit: 1) I tested if the code would throw an BoundsError so the line marked with jj = 1 #(*) was missing in the initial code. The testing results were already from the fixed version, so they remain unchanged. 2) I've added the view variant, thanks to #Colin T Bowers for addressing both issues.
It seems like both functions should lead to more or less the same code. Yet I get for
A = fill!(Matrix{Int32}(2^12,2^12),2); B = Int32.(eye(2^12));
the results
#timev priv_memcopyBtoA!(A,B, 2000)
0.178327 seconds (10 allocations: 15.259 MiB, 85.52% gc time)
elapsed time (ns): 178326537
gc time (ns): 152511699
bytes allocated: 16000304
pool allocs: 9
malloc() calls: 1
GC pauses: 1
and
#timev priv_memcopyBtoA2!(A,B, 2000)
0.015760 seconds (4 allocations: 160 bytes)
elapsed time (ns): 15759742
bytes allocated: 160
pool allocs: 4
and
#timev priv_memcopyBtoA3!(A,B, 2000)
0.043771 seconds (7 allocations: 224 bytes)
elapsed time (ns): 43770978
bytes allocated: 224
pool allocs: 7
That's a drastic difference. It's also surprising. I've expected the first version to be like memcopy, which is hard to beat for a large memory block.
The second version has overhead from the pointer arithmetic (getindex), the branch condition (<=) and the bounds check in each assignment. Yet each assignment takes just ~3 ns.
Also, the time which the garbage collector consumes, varies a lot for the first function. If no garbage collection is performed, the large difference becomes small, but it remains. It's still a factor of ~2.5 between version 3 and 2.
So why is the "memcopy" version not as efficient as the "assignment" version?
Firstly, your code contains a bug. Run this:
A = [1 2 ; 3 4]
B = [5 6 ; 7 8]
priv_memcopyBtoA2!(A, B, 2)
then:
julia> A
2×2 Array{Int64,2}:
5 2
7 4
You need to re-assign jj back to 1 at the end of each inner while loop, ie:
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1
while ii <= n
jj = 1
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
Even with the bug fix, you'll still note that the while loop solution is faster. This is because array slices in julia create temporary arrays. So in this line:
A[1:n,1:n] = B[1:n,1:n]
the right-hand side operation creates a temporary nxn array, and then assigns the temporary array to the left-hand side.
If you wanted to avoid the temporary array allocation, you would instead write:
A[1:n,1:n] = view(B, 1:n, 1:n)
and you'll notice that the timings of the two methods is now pretty close, although the while loop is still slightly faster. As a general rule, loops in Julia are fast (as in C fast), and explicitly writing out the loop will usually get you the most optimized compiled code. I would still expect the explicit loop to be faster than the view method.
As for the garbage collection stuff, that is just a result of your method of timing. Much better to use #btime from the package BenchmarkTools, which uses various tricks to avoid traps like timing garbage collection etc.
Why is A[1:n,1:n] = view(B, 1:n, 1:n) or variants of it, slower than a set of while loops? Let's look at what A[1:n,1:n] = view(B, 1:n, 1:n) does.
view returns an iterator which contains a pointer to the parent B and information how to compute the indices which should be copied. A[1:n,1:n] = ... is parsed to a call _setindex!(...). After that, and a few calls down the call chain, the main work is done by:
.\abstractarray.jl:883;
# In general, we simply re-index the parent indices by the provided ones
function getindex(V::SlowSubArray{T,N}, I::Vararg{Int,N}) where {T,N}
#_inline_meta
#boundscheck checkbounds(V, I...)
#inbounds r = V.parent[reindex(V, V.indexes, I)...]
r
end
#.\multidimensional.jl:212;
#inline function next(iter::CartesianRange{I}, state) where I<:CartesianIndex
state, I(inc(state.I, iter.start.I, iter.stop.I))
end
#inline inc(::Tuple{}, ::Tuple{}, ::Tuple{}) = ()
#inline inc(state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int}) = (state[1]+1,)
#inline function inc(state, start, stop)
if state[1] < stop[1]
return (state[1]+1,tail(state)...)
end
newtail = inc(tail(state), tail(start), tail(stop))
(start[1], newtail...)
end
getindex takes a view V and an index I. We get the view from B and the index I from A. In each step reindex computes from the view V and the index I indices to get an element in B. It's called r and we return it. Finally r is written to A.
After each copy inc increments the index I to the next element in A and tests if one is done. Note that the code is from v0.63 but in master it's more or less the same.
In principle the code could be reduced to a set of while loops, yet it is more general. It works for arbitrary views of B and arbitrary slices of the form a:b:c and for an arbitrary number of matrix dimensions. The big N is in our case 2.
Since the functions are more complex, the compiler doesn't optimize them as well. I.e. there is a recommendation that the compiler should inline them, but it doesn't do that. This shows that the shown functions are non trivial.
For a set of loops the compiler reduces the innermost loop to three additions (each for a pointer to A and B and one for the loop index) and a single copy instruction.
tl;dr The internal call chain of A[1:n,1:n] = view(B, 1:n, 1:n) coupled with multiple dispatch is non trivial and handles the general case. This induces overhead. A set of while loops is already optimized to a special case.
Note that the performance depends on the compiler. If one looks at the one dimensional case A[1:n] = view(B, 1:n), it's faster than a while loop because it vectorizes the code. Yet for higher dimensions N >2 the difference grows.
Consider the basic iteration to generate N random numbers and save them in an array (assume either that we are not interested in array comprehensions and also that we don't know the calling rand(N))
function random_numbers(N::Int)
array = zeros(N)
for i in 1:N
array[i] = rand()
end
array
end
I am interested in a similar function that takes advantage of the cores of my laptop to generate the same array. I have checked this nice blog where the macros #everywhere, #spawn and #parallel are introduced but there the calculation is carried out "on-the-fly" and an array is not needed to save the data.
I have the impression that this is very basic and can be done easily using perhaps the function pmap but I am unfamiliar with parallel computing.
My aim is to apply this method to a function that I have built to generate random numbers drawn from an unusual distribution.
I would recommend to do a more careful initialization of random number generators in parallel processes, e.g:
# choose the seed you want
#everywhere srand(1)
# replace 10 below by maximum process id in your case
#everywhere const LOCAL_R = randjump(Base.GLOBAL_RNG, 10)[myid()]
# here is an example usage
#everywhere f() = rand(LOCAL_R)
In this way you:
make sure that your results are reproducible;
have control that there is no overlap between random sequences generated by different processes.
As suggested in the comment more clarification in the question is always welcome. However, it seems pmap will do what is required. The relevant documentation is here.
The following is a an example. Note, the time spent in the pmap method is half of the regular map. With 16 cores, the situation might be substantially better:
julia> addprocs(2)
2-element Array{Int64,1}:
2
3
julia> #everywhere long_rand() = foldl(+,0,(randn() for i=1:10_000_000))
julia> long_rand()
-1165.9596619177153
julia> #time map(x->long_rand(), zeros(10,10))
8.455930 seconds (204.89 k allocations: 11.069 MiB)
10×10 Array{Float64,2}:
⋮
⋮
julia> #time pmap(x->long_rand(), zeros(10,10));
6.125479 seconds (773.08 k allocations: 42.242 MiB, 0.25% gc time)
julia> #time pmap(x->long_rand(), zeros(10,10))
4.609745 seconds (20.99 k allocations: 954.991 KiB)
10×10 Array{Float64,2}:
⋮
⋮
The following two code snippets perform the same task (generating M samples uniformly from an N-dim sphere). I was wondering why the latter one consumes much more time than the previous one.
%% MATLAB R2014a
M = 30;
N = 10000;
#1
tic
S = zeros(M, N);
for k = 1:M
P = ones(1, N);
for i = 1:N - 1
t = rand*2*pi;
P(1:i) = P(1:i)*sin(t);
P(i+1) = P(i+1)*cos(t);
end
S(k,:) = P;
end
toc
#2
tic
S = ones(M, N);
for k = 1:M
for i = 1:N - 1
t = rand*2*pi;
S(k, 1:i) = S(k, 1:i)*sin(t);
S(k, i+1) = S(k, i+1)*cos(t);
end
end
toc
The output is:
Elapsed time is 15.007667 seconds.
Elapsed time is 59.745311 seconds.
And I also tried M = 1,
Elapsed time is 0.463370 seconds.
Elapsed time is 1.566913 seconds.
#2 is nearly 4 times slower than #1. Is frequent 2d element accessing in #2 making it time-consuming?
The time difference is due to memory access patterns, and how well they map onto the cache. And also possibly to MATLAB's exploitation of your hardware vector unit (SSE/AVX). MATLAB stores matrices "column-major", meaning S(2,1) is next to S(1,1).
In #1, you process each sample using the vector P, which lives in contiguous memory. These 80,000 bytes fit easily in L2 cache for the fast repeated access you need to perform. They're also neighbors, and trivially vectorized (I'm not certain if MATLAB performs this optimization, but I'd hope so...)
In #2, you access a row of S at a time, which is not contiguous, but rather is interleaved by M values. So each row is spread across 30*80,000 bytes, which does not fit in L2 cache. It'll have to be read back in for each repeated access, even though you're ignoring 29/30 values in that data.
Here's the test. All I'm doing it transposing S so that you can process a column at a time instead, then putting it back at the end just to get the same result:
#3
tic
S = ones(N, M);
for k = 1:M
for i = 1:N - 1
t = rand*2*pi;
S(1:i, k) = S(1:i, k)*sin(t);
S(i+1, k) = S(i+1, k)*cos(t);
end
end
S = S.';
toc
Results:
Elapsed time is 11.254212 seconds.
Elapsed time is 45.847750 seconds.
Elapsed time is 11.501580 seconds.
Yep, transposing S gets us the same contiguous access and performance as the separate vector approach. By the way, L3 vs. L2 is about 4x more clock cycles... 1
Let's see if we can find any breakpoints related to cache size. Here's N = 1000, where everything should fit in L2:
Elapsed time is 0.240184 seconds.
Elapsed time is 0.373448 seconds.
Elapsed time is 0.258566 seconds.
Much lower difference, though now we're probably into L1 effects.
Finally, here's a completely different way to solve your problem. It relies on the fact that multivariate normal RV's have the correct symmetry.
#4
tic
S = randn(M, N);
S = bsxfun(#rdivide, S, sqrt(sum(S.*S, 2)));
toc
Elapsed time is 10.714104 seconds.
Elapsed time is 45.351277 seconds.
Elapsed time is 11.031061 seconds.
Elapsed time is 0.015068 seconds.
I suspect the advantage comes from using a hard coded 1 in the access of the array. If you try M=1 you will still see a significant speed up for the sin(t) line. My guess is that the assembly under the hood can do some use immediate instructions as opposed to reloading the variable K into a register.