I would like to sample k numbers where the first number is sampled from 1:n and the second from 1:n-1 and the third from 1:n-2 and so on.
I have the below implementation
function shrinksample(n,k)
[rand(1:m) for m in n:-1:n-k+1]
end
Are there faster solutions in Julia?
The following takes ideas from the implementation of randperm and since n and k are of the same order, this is appropriate as the same type of randomness is needed (both have output space of size n factorial):
function fastshrinksample(r::AbstractRNG,n,k)
a = Vector{typeof(n)}(k)
#assert n <= Int64(2)^52
k == 0 && return a
mask = (1<<(64-leading_zeros(n)))-1
nextmask = mask>>1
nn = n
for i=1:k
a[i] = 1+Base.Random.rand_lt(r, nn, mask)
nn -= 1
if nn == nextmask
mask, nextmask = nextmask, nextmask>>1
end
end
return a
end
fastshrinksample(n,k) = fastshrinksample(Base.Random.GLOBAL_RNG, n, k)
Benchmarking gives a 3x improvement for one tested instance:
julia> using BenchmarkTools
julia> #btime shrinksample(10000,10000);
310.277 μs (2 allocations: 78.20 KiB)
julia> #btime fastshrinksample(10000,10000);
91.815 μs (2 allocations: 78.20 KiB)
The trick is mainly to use the internal Base.Random.rand_lt instead of regular rand(1:n)
If this is not very sensitive to randomness (you're not doing cryptography), the following should be amazingly fast and very simple:
blazingshrinksample(n,k) = (Int)[trunc(Int,(n-m)rand()+1) for m in 0:k-1]
Testing this along with your implementation and with Dan's, I got this:
using BenchmarkTools
#btime shrinksample(10000,10000);
259.414 μs (2 allocations: 78.20 KiB)
#btime fastshrinksample(10000,10000);
66.713 μs (2 allocations: 78.20 KiB)
#btime blazingshrinksample(10000,10000);
33.614 μs (2 allocations: 78.20 KiB)
Related
I would like to know, how can I measure the memory usage by a small part of my code? Lets say I have 50 lines of code, where I take only three lines (randomly) and find the memory being used by them.
In python, one can use such syntax to measure the usage:
**code**
psutil.virtual_memory().total -psutil.virtual_memory().available)/1048/1048/1048
**code**
psutil.virtual_memory().total -psutil.virtual_memory().available)/1048/1048/1048
**code**
I have tried using begin - end loop but firstly, I am not sure whether it is good approach and secondly, may i know how can i extract just the memory usage using benchmarktools package.
Julia:
using BenchmarkTools
**code**
#btime begin
** code **
end
**code**
How may I extract the information in such a manner?
Look forward to the suggestions!
Thanks!!
I guess one workaround would be to put the code you want to benchmark into a function and benchmark that function:
using BenchmarkTools
# code before
f() = # code to benchmark
#btime f() ;
# code after
To save your benchmarks you probably need to use #benchmark instead of #btime, as in, e.g.:
julia> t = #benchmark x = [sin(3.0)]
BenchmarkTools.Trial:
memory estimate: 96 bytes
allocs estimate: 1
--------------
minimum time: 26.594 ns (0.00% GC)
median time: 29.141 ns (0.00% GC)
mean time: 33.709 ns (5.34% GC)
maximum time: 1.709 μs (97.96% GC)
--------------
samples: 10000
evals/sample: 992
julia> t.allocs
1
julia> t.memory
96
julia> t.times
10000-element Vector{Float64}:
26.59375
26.616935483870968
26.617943548387096
26.66532258064516
26.691532258064516
⋮
1032.6875
1043.6219758064517
1242.3336693548388
1708.797379032258
In Julia, what would be an efficient way of turning the diagonal of a matrix to zero?
Assuming m is your matrix of size N x N this could be done as:
setindex!.(Ref(m), 0.0, 1:N, 1:N)
Another option:
using LinearAlgebra
m[diagind(m)] .= 0.0
And some performance tests:
julia> using LinearAlgebra, BenchmarkTools
julia> m=rand(20,20);
julia> #btime setindex!.(Ref($m), 0.0, 1:20, 1:20);
55.533 ns (1 allocation: 240 bytes)
julia> #btime $m[diagind($m)] .= 0.0;
75.386 ns (2 allocations: 80 bytes)
Performance wise, simple loop is faster (and more explicit, but it is taste dependent)
julia> #btime foreach(i -> $m[i, i] = 0, 1:20)
11.881 ns (0 allocations: 0 bytes)
julia> #btime setindex!.(Ref($m), 0.0, 1:20, 1:20);
50.923 ns (1 allocation: 240 bytes)
And it is faster then diagind version, but not by much
julia> m = rand(1000, 1000);
julia> #btime foreach(i -> $m[i, i] = 0.0, 1:1000)
1.456 μs (0 allocations: 0 bytes)
julia> #btime foreach(i -> #inbounds($m[i, i] = 0.0), 1:1000)
1.338 μs (0 allocations: 0 bytes)
julia> #btime $m[diagind($m)] .= 0.0;
1.495 μs (2 allocations: 80 bytes)
Przemyslaw Szufel's solutions bench-marked for 1_000 x 1_000 matrix size show that diagind performs best:
julia> #btime setindex!.(Ref($m), 0.0, 1:1_000, 1:1_000);
2.222 μs (1 allocation: 7.94 KiB)
julia> #btime $m[diagind($m)] .= 0.0;
1.280 μs (2 allocations: 80 bytes)
Here is a more general way how to performantly use setindex! on an Array by accessing custom indices:
Using an array for in array indexing
Linear indexing is best performing, that's why diagind runs better than the Cartesian Indices.
I am not sure why this is happening.
using Formatting: printfmt
function within(x,y)
if x * x + y * y <= 1
true
else
false
end
end
function π_estimation_error(estimated_value)
pi_known = 3.1415926535897932384626433
return abs((pi_known - estimated_value) / 100)
end
function estimate_π_1(n)
count = 0
for i = 1:n
if within(rand(), rand())
count = count + 1
end
end
pi_est = count/n*4
printfmt("n: {} π estimated {:.8f}, error {:.10f}", n, pi_est, π_estimation_error(pi_est))
end
function estimate_π_2(n)
rand_coords = rand(n, 2) .^ 2
count = sum(rand_coords[:,1] + rand_coords[:,2] .<= 1)
pi_est = count/n*4
printfmt("n: {} π estimated {:.8f}, error {:.10f}", n, pi_est, π_estimation_error(pi_est))
end
number_of_experiments = 20000000
for i = 1:10
print("1 :: ")
#time estimate_π_1(number_of_experiments)
print("2 :: ")
#time estimate_π_2(number_of_experiments)
end
What is the proper way to get consistent results? Not sure why this is happening. The allocation numbers seem way off.
1 :: n: 20000000 π estimated 3.14188540, error 0.0000029275 0.507643 seconds (1.15 M allocations: 56.432 MiB, 8.75% gc time)
2 :: n: 20000000 π estimated 3.14141280, error 0.0000017985 0.786538 seconds (1.13 M allocations: 1.100 GiB, 13.17% gc time)
1 :: n: 20000000 π estimated 3.14118120, error 0.0000041145 0.054791 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14207560, error 0.0000048295 0.536932 seconds (196 allocations: 1.045 GiB, 14.11% gc time)
1 :: n: 20000000 π estimated 3.14119660, error 0.0000039605 0.054647 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14154040, error 0.0000005225 0.529361 seconds (196 allocations: 1.045 GiB, 14.04% gc time)
1 :: n: 20000000 π estimated 3.14188640, error 0.0000029375 0.054321 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14177120, error 0.0000017855 0.532848 seconds (196 allocations: 1.045 GiB, 14.01% gc time)
1 :: n: 20000000 π estimated 3.14191880, error 0.0000032615 0.055158 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14213220, error 0.0000053955 0.524499 seconds (196 allocations: 1.045 GiB, 14.02% gc time)
1 :: n: 20000000 π estimated 3.14161380, error 0.0000002115 0.054355 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14174220, error 0.0000014955 0.529431 seconds (196 allocations: 1.045 GiB, 14.17% gc time)
1 :: n: 20000000 π estimated 3.14178600, error 0.0000019335 0.054558 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14152500, error 0.0000006765 0.537786 seconds (196 allocations: 1.045 GiB, 13.89% gc time)
1 :: n: 20000000 π estimated 3.14163340, error 0.0000004075 0.055921 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14220380, error 0.0000061115 0.521758 seconds (196 allocations: 1.045 GiB, 14.19% gc time)
1 :: n: 20000000 π estimated 3.14092000, error 0.0000067265 0.054592 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14177460, error 0.0000018195 0.527376 seconds (196 allocations: 1.045 GiB, 14.10% gc time)
1 :: n: 20000000 π estimated 3.14171780, error 0.0000012515 0.054904 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14136040, error 0.0000023225 0.528569 seconds (196 allocations: 1.045 GiB, 14.04% gc time)
Is this happeing because some optimization kicks in?
I understand you are asking why the first run of a function is always much slower and allocates more memory than subsequent runs?
The reason is that Julia is compiled language - so the first time you run any function, Julia will compile it to binary code, creating binary methods corresponding to the most commonly expected input types. For any later calls of that function, Julia will see that it's already generated the binary code and just use that.
After seeing a couple tutorials on the internet on Julia parallelism I decided to implement a small parallel snippet for computing the harmonic series.
The serial code is:
harmonic = function (n::Int64)
x = 0
for i in n:-1:1 # summing backwards to avoid rounding errors
x +=1/i
end
x
end
And I made 2 parallel versions, one using #distributed macro and another using the #everywhere macro (julia -p 2 btw):
#everywhere harmonic_ever = function (n::Int64)
x = 0
for i in n:-1:1
x +=1/i
end
x
end
harmonic_distr = function (n::Int64)
x = #distributed (+) for i in n:-1:1
x = 1/i
end
x
end
However, when I run the above code and #time it, I don't get any speedup - in fact, the #distributed version runs significantly slower!
#time harmonic(10^10)
>>> 53.960678 seconds (29.10 k allocations: 1.553 MiB) 23.60306659488827
job = #spawn harmonic_ever(10^10)
#time fetch(job)
>>> 46.729251 seconds (309.01 k allocations: 15.737 MiB) 23.60306659488827
#time harmonic_distr(10^10)
>>> 143.105701 seconds (1.25 M allocations: 63.564 MiB, 0.04% gc time) 23.603066594889185
What completely and absolutely baffles me is the "0.04% gc time". I'm clearly missing something and also the examples I saw weren't for 1.0.1 version (one for example used #parallel).
You're distributed version should be
function harmonic_distr2(n::Int64)
x = #distributed (+) for i in n:-1:1
1/i # no x assignment here
end
x
end
The #distributed loop will accumulate values of 1/i on every worker an then finally on the master process.
Note that it is also generally better to use BenchmarkTools's #btime macro instead of #time for benchmarking:
julia> using Distributed; addprocs(4);
julia> #btime harmonic(1_000_000_000); # serial
1.601 s (1 allocation: 16 bytes)
julia> #btime harmonic_distr2(1_000_000_000); # parallel
754.058 ms (399 allocations: 36.63 KiB)
julia> #btime harmonic_distr(1_000_000_000); # your old parallel version
4.289 s (411 allocations: 37.13 KiB)
The parallel version is, of course, slower if run only on one process:
julia> rmprocs(workers())
Task (done) #0x0000000006fb73d0
julia> nprocs()
1
julia> #btime harmonic_distr2(1_000_000_000); # (not really) parallel
1.879 s (34 allocations: 2.00 KiB)
I was wondering if there is a command or a package in Julia that permits us to extract directly the lower triangle portion of a matrix, excluding the diagonal. I can call R commands for that (like lowerTriangle of the gdata package), obviously, but I'd like to know if Julia has something similar. For example, imagine I have the matrix
1.0 0.751 0.734
0.751 1.0 0.948
0.734 0.948 1.0
I don't want to create a lower triangular matrix like
NA NA NA
0.751 NA NA
0.734 0.948 NA
but extract the lower portion of the matrix as an array: 0.751 0.734 0.948
If you're OK with creating a lower triangular matrix as an intermediate step, you can use logical indexing and tril! with an extra argument to get what you need.
julia> M = [1.0 0.751 0.734
0.751 1.0 0.948
0.734 0.948 1.0];
julia> v = M[tril!(trues(size(M)), -1)]
3-element Array{Float64, 1}:
0.751
0.734
0.948
The trues call returns an array of M's shape filled with boolean true values. tril! then prunes this down to just the part of the matrix that we want. The second argument to tril! tells it which superdiagonal to start from, which we use here to avoid the values in the leading diagonal.
We use the result of that for indexing into M, and that returns an array with the required values.
Using comprehensions:
julia> [M[m, n] for m in 2:size(M, 1) for n in 1:m-1]
3-element Array{Float64,1}:
0.751
0.734
0.948
But it is much slower than the sundar/Matt B. solution:
lower_triangular_1(M) = [M[m, n] for m in 2:size(M, 1) for n in 1:m-1]
lower_triangular_2(M) = [M[m, n] for n in 1:size(M, 2) for m in n+1:size(M, 1)]
lower_triangular_3(M) = M[tril!(trues(size(M)), -1)]
using BenchmarkTools
using LinearAlgebra # avoid warning in 0.7
M=rand(100, 100)
Testing with Julia Version 0.7.0-alpha.0:
julia> #btime lower_triangular_1(M);
73.179 μs (10115 allocations: 444.34 KiB)
julia> #btime lower_triangular_2(M);
71.157 μs (10117 allocations: 444.41 KiB)
julia> #btime lower_triangular_3(M);
16.325 μs (6 allocations: 40.19 KiB)
Not elegant, but faster (with #views):
function lower_triangular_4(M)
# works only for square matrices
res = similar(M, ((size(M, 1)-1) * size(M, 2)) ÷ 2)
start_idx = 1
for n = 1:size(M, 2)-1
#views column = M[n+1:end, n]
last_idx = start_idx -1 + length(column)
#views res[start_idx:last_idx] = column[:]
start_idx = last_idx + 1
end
end
julia> #btime lower_triangular_4(M);
4.272 μs (101 allocations: 44.95 KiB)