Using #time in Julia gives surprising results - performance

I am not sure why this is happening.
using Formatting: printfmt
function within(x,y)
if x * x + y * y <= 1
true
else
false
end
end
function π_estimation_error(estimated_value)
pi_known = 3.1415926535897932384626433
return abs((pi_known - estimated_value) / 100)
end
function estimate_π_1(n)
count = 0
for i = 1:n
if within(rand(), rand())
count = count + 1
end
end
pi_est = count/n*4
printfmt("n: {} π estimated {:.8f}, error {:.10f}", n, pi_est, π_estimation_error(pi_est))
end
function estimate_π_2(n)
rand_coords = rand(n, 2) .^ 2
count = sum(rand_coords[:,1] + rand_coords[:,2] .<= 1)
pi_est = count/n*4
printfmt("n: {} π estimated {:.8f}, error {:.10f}", n, pi_est, π_estimation_error(pi_est))
end
number_of_experiments = 20000000
for i = 1:10
print("1 :: ")
#time estimate_π_1(number_of_experiments)
print("2 :: ")
#time estimate_π_2(number_of_experiments)
end
What is the proper way to get consistent results? Not sure why this is happening. The allocation numbers seem way off.
1 :: n: 20000000 π estimated 3.14188540, error 0.0000029275 0.507643 seconds (1.15 M allocations: 56.432 MiB, 8.75% gc time)
2 :: n: 20000000 π estimated 3.14141280, error 0.0000017985 0.786538 seconds (1.13 M allocations: 1.100 GiB, 13.17% gc time)
1 :: n: 20000000 π estimated 3.14118120, error 0.0000041145 0.054791 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14207560, error 0.0000048295 0.536932 seconds (196 allocations: 1.045 GiB, 14.11% gc time)
1 :: n: 20000000 π estimated 3.14119660, error 0.0000039605 0.054647 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14154040, error 0.0000005225 0.529361 seconds (196 allocations: 1.045 GiB, 14.04% gc time)
1 :: n: 20000000 π estimated 3.14188640, error 0.0000029375 0.054321 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14177120, error 0.0000017855 0.532848 seconds (196 allocations: 1.045 GiB, 14.01% gc time)
1 :: n: 20000000 π estimated 3.14191880, error 0.0000032615 0.055158 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14213220, error 0.0000053955 0.524499 seconds (196 allocations: 1.045 GiB, 14.02% gc time)
1 :: n: 20000000 π estimated 3.14161380, error 0.0000002115 0.054355 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14174220, error 0.0000014955 0.529431 seconds (196 allocations: 1.045 GiB, 14.17% gc time)
1 :: n: 20000000 π estimated 3.14178600, error 0.0000019335 0.054558 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14152500, error 0.0000006765 0.537786 seconds (196 allocations: 1.045 GiB, 13.89% gc time)
1 :: n: 20000000 π estimated 3.14163340, error 0.0000004075 0.055921 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14220380, error 0.0000061115 0.521758 seconds (196 allocations: 1.045 GiB, 14.19% gc time)
1 :: n: 20000000 π estimated 3.14092000, error 0.0000067265 0.054592 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14177460, error 0.0000018195 0.527376 seconds (196 allocations: 1.045 GiB, 14.10% gc time)
1 :: n: 20000000 π estimated 3.14171780, error 0.0000012515 0.054904 seconds (181 allocations: 6.711 KiB)
2 :: n: 20000000 π estimated 3.14136040, error 0.0000023225 0.528569 seconds (196 allocations: 1.045 GiB, 14.04% gc time)
Is this happeing because some optimization kicks in?

I understand you are asking why the first run of a function is always much slower and allocates more memory than subsequent runs?
The reason is that Julia is compiled language - so the first time you run any function, Julia will compile it to binary code, creating binary methods corresponding to the most commonly expected input types. For any later calls of that function, Julia will see that it's already generated the binary code and just use that.

Related

Difference between the built-in #time macros and the ones from the benchmark module

In the Julia package BenchmarkTools, there are macros like #btime, #belapse that seem redundant to me since Julia has built-in #time, #elapse macros. And it seems to me that these macros serve the same purpose. So what's the difference between #time and #btime, and #elapse and #belapsed?
TLDR ;)
#time and #elapsed just run the code once and measure the time. This measurement may or may not include the compile time (depending whether #time is run for the first or second time) and includes time to resolve global variables.
On the the other hand #btime and #belapsed perform warm up so you know that compile time and global variable resolve time (if $ is used) do not affect the time measurement.
Details
For further understand how this works lets used the #macroexpand (I am also stripping comment lines for readability):
julia> using MacroTools, BenchmarkTools
julia> MacroTools.striplines(#macroexpand1 #elapsed sin(x))
quote
Experimental.#force_compile
local var"#28#t0" = Base.time_ns()
sin(x)
(Base.time_ns() - var"#28#t0") / 1.0e9
end
Compilation if sin is not forced and you get different results when running for the first time and subsequent times. For an example:
julia> #time cos(x);
0.110512 seconds (261.97 k allocations: 12.991 MiB, 99.95% compilation time)
julia> #time cos(x);
0.000008 seconds (1 allocation: 16 bytes)
julia> #time cos(x);
0.000006 seconds (1 allocation: : 16 bytes)
The situation is different with #belapsed:
julia> MacroTools.striplines(#macroexpand #belapsed sin($x))
quote
(BenchmarkTools).time((BenchmarkTools).minimum(begin
local var"##314" = begin
BenchmarkTools.generate_benchmark_definition(Main, Symbol[], Any[], [Symbol("##x#315")], (x,), $(Expr(:copyast, :($(QuoteNode(:(sin(var"##x#315"))))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), BenchmarkTools.Parameters())
end
(BenchmarkTools).warmup(var"##314")
(BenchmarkTools).tune!(var"##314")
(BenchmarkTools).run(var"##314")
end)) / 1.0e9
end
You can see that a minimum value is taken (the code is run several times).
Basically most time you should use BenchmarkTools for measuring times when designing your application.
Last but not least try #benchamrk:
julia> #benchmark sin($x)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 13.714 ns … 51.151 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.814 ns ┊ GC (median): 0.00%
Time (mean ± σ): 14.089 ns ± 1.121 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇ ▂▄ ▁▂ ▃ ▁ ▂
██▆▅██▇▅▄██▃▁▃█▄▃▁▅█▆▁▄▃▅█▅▃▁▄▇▆▁▁▁▁▁▆▄▄▁▁▃▄▇▃▁▃▁▁▁▆▅▁▁▁▆▅▅ █
13.7 ns Histogram: log(frequency) by time 20 ns <
Memory estimate: 0 bytes, allocs estimate: 0.

Find index of maximum element satisfying condition (Julia)

In Julia I can use argmax(X) to find max element. If I want to find all element satisfying condition C I can use findall(C,X). But how can I combine the two? What's the most efficient/idiomatic/concise way to find maximum element index satisfying some condition in Julia?
If you'd like to avoid allocations, filtering the array lazily would work:
idx_filtered = (i for (i, el) in pairs(X) if C(el))
argmax(i -> X[i], idx_filtered)
Unfortunately, this is about twice as slow as a hand-written version. (edit: in my benchmarks, it's 2x slower on Intel Xeon Platinum but nearly equal on Apple M1)
function byhand(C, X)
start = findfirst(C, X)
isnothing(start) && return nothing
imax, max = start, X[start]
for i = start:lastindex(X)
if C(X[i]) && X[i] > max
imax, max = i, X[i]
end
end
imax, max
end
You can store the index returned by findall and subset it with the result of argmax of the vector fulfilling the condition.
X = [5, 4, -3, -5]
C = <(0)
i = findall(C, X);
i[argmax(X[i])]
#3
Or combine both:
argmax(i -> X[i], findall(C, X))
#3
Assuming that findall is not empty. Otherwise it need to be tested e.g. with isempty.
Benchmark
#Functions
function August(C, X)
idx_filtered = (i for (i, el) in pairs(X) if C(el))
argmax(i -> X[i], idx_filtered)
end
function byhand(C, X)
start = findfirst(C, X)
isnothing(start) && return nothing
imax, max = start, X[start]
for i = start:lastindex(X)
if C(X[i]) && X[i] > max
imax, max = i, X[i]
end
end
imax, max
end
function GKi1(C, X)
i = findall(C, X);
i[argmax(X[i])]
end
GKi2(C, X) = argmax(i -> X[i], findall(C, X))
#Data
using Random
Random.seed!(42)
n = 100000
X = randn(n)
C = <(0)
#Benchmark
using BenchmarkTools
suite = BenchmarkGroup()
suite["August"] = #benchmarkable August(C, $X)
suite["byhand"] = #benchmarkable byhand(C, $X)
suite["GKi1"] = #benchmarkable GKi1(C, $X)
suite["GKi2"] = #benchmarkable GKi2(C, $X)
tune!(suite);
results = run(suite)
#Results
results
#4-element BenchmarkTools.BenchmarkGroup:
# tags: []
# "August" => Trial(641.061 μs)
# "byhand" => Trial(261.135 μs)
# "GKi2" => Trial(259.260 μs)
# "GKi1" => Trial(339.570 μs)
results.data["August"]
#BenchmarkTools.Trial: 7622 samples with 1 evaluation.
# Range (min … max): 641.061 μs … 861.379 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 643.640 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 653.027 μs ± 18.123 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
#
# ▄█▅▄▃ ▂▂▃▁ ▁▃▃▂▂ ▁▃ ▁▁ ▁
# ██████▇████████████▇▆▆▇████▇▆██▇▇▇▆▆▆▅▇▆▅▅▅▅▆██▅▆▆▆▇▆▇▇▆▇▆▆▆▅ █
# 641 μs Histogram: log(frequency) by time 718 μs <
#
# Memory estimate: 16 bytes, allocs estimate: 1.
results.data["byhand"]
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 261.135 μs … 621.141 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 261.356 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 264.382 μs ± 11.638 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
#
# █ ▁▁▁▁ ▂ ▁▁ ▂ ▁ ▁ ▁
# █▅▂▂▅████▅▄▃▄▆█▇▇▆▄▅███▇▄▄▅▆▆█▄▇█▅▄▅▅▆▇▇▅▄▅▄▄▄▃▄▃▃▃▄▅▆▅▄▇█▆▅▄ █
# 261 μs Histogram: log(frequency) by time 292 μs <
#
# Memory estimate: 32 bytes, allocs estimate: 1.
results.data["GKi1"]
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 339.570 μs … 1.447 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 342.579 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 355.167 μs ± 52.935 μs ┊ GC (mean ± σ): 1.90% ± 6.85%
#
# █▆▄▅▃▂▁▁ ▁ ▁
# ████████▇▆▆▅▅▅▆▄▄▄▄▁▃▁▁▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
# 340 μs Histogram: log(frequency) by time 722 μs <
#
# Memory estimate: 800.39 KiB, allocs estimate: 11.
results.data["GKi2"]
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 259.260 μs … 752.773 μs ┊ GC (min … max): 0.00% … 54.40%
# Time (median): 260.692 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 270.300 μs ± 40.094 μs ┊ GC (mean ± σ): 1.31% ± 5.60%
#
# █▁▁▅▄▂▂▄▃▂▁▁▁ ▁ ▁
# █████████████████▇██▆▆▇▆▅▄▆▆▆▄▅▄▆▅▇▇▆▆▅▅▄▅▃▃▅▃▄▁▁▁▃▁▃▃▃▄▃▃▁▃▃ █
# 259 μs Histogram: log(frequency) by time 390 μs <
#
# Memory estimate: 408.53 KiB, allocs estimate: 9.
versioninfo()
#Julia Version 1.8.0
#Commit 5544a0fab7 (2022-08-17 13:38 UTC)
#Platform Info:
# OS: Linux (x86_64-linux-gnu)
# CPU: 8 × Intel(R) Core(TM) i7-2600K CPU # 3.40GHz
# WORD_SIZE: 64
# LIBM: libopenlibm
# LLVM: libLLVM-13.0.1 (ORCJIT, sandybridge)
# Threads: 1 on 8 virtual cores
In this example argmax(i -> X[i], findall(C, X)) is close to the performance of the hand written function of #August but uses more memory, but can show better performance in case the data is sorted:
sort!(X)
results = run(suite)
#4-element BenchmarkTools.BenchmarkGroup:
# tags: []
# "August" => Trial(297.519 μs)
# "byhand" => Trial(270.486 μs)
# "GKi2" => Trial(242.320 μs)
# "GKi1" => Trial(319.732 μs)
From what I understand from your question you can use findmax() (requires Julia >= v1.7) to find the maximum index on the result of findall():
julia> v = [10, 20, 30, 40, 50]
5-element Vector{Int64}:
10
20
30
40
50
julia> findmax(findall(x -> x > 30, v))[1]
5
Performance of the above function:
julia> v = collect(10:1:10_000_000);
julia> #btime findmax(findall(x -> x > 30, v))[1]
33.471 ms (10 allocations: 77.49 MiB)
9999991
Update: solution suggested by #dan-getz of using last() and findlast() perform better than findmax() but findlast() is the winner:
julia> #btime last(findall(x -> x > 30, v))
19.961 ms (9 allocations: 77.49 MiB)
9999991
julia> #btime findlast(x -> x > 30, v)
81.422 ns (2 allocations: 32 bytes)
Update 2: Looks like the OP wanted to find the max element and not only the index. In that case, the solution would be:
julia> v[findmax(findall(x -> x > 30, v))[1]]
50

Julia parallelism: #distributed (+) slower than serial?

After seeing a couple tutorials on the internet on Julia parallelism I decided to implement a small parallel snippet for computing the harmonic series.
The serial code is:
harmonic = function (n::Int64)
x = 0
for i in n:-1:1 # summing backwards to avoid rounding errors
x +=1/i
end
x
end
And I made 2 parallel versions, one using #distributed macro and another using the #everywhere macro (julia -p 2 btw):
#everywhere harmonic_ever = function (n::Int64)
x = 0
for i in n:-1:1
x +=1/i
end
x
end
harmonic_distr = function (n::Int64)
x = #distributed (+) for i in n:-1:1
x = 1/i
end
x
end
However, when I run the above code and #time it, I don't get any speedup - in fact, the #distributed version runs significantly slower!
#time harmonic(10^10)
>>> 53.960678 seconds (29.10 k allocations: 1.553 MiB) 23.60306659488827
job = #spawn harmonic_ever(10^10)
#time fetch(job)
>>> 46.729251 seconds (309.01 k allocations: 15.737 MiB) 23.60306659488827
#time harmonic_distr(10^10)
>>> 143.105701 seconds (1.25 M allocations: 63.564 MiB, 0.04% gc time) 23.603066594889185
What completely and absolutely baffles me is the "0.04% gc time". I'm clearly missing something and also the examples I saw weren't for 1.0.1 version (one for example used #parallel).
You're distributed version should be
function harmonic_distr2(n::Int64)
x = #distributed (+) for i in n:-1:1
1/i # no x assignment here
end
x
end
The #distributed loop will accumulate values of 1/i on every worker an then finally on the master process.
Note that it is also generally better to use BenchmarkTools's #btime macro instead of #time for benchmarking:
julia> using Distributed; addprocs(4);
julia> #btime harmonic(1_000_000_000); # serial
1.601 s (1 allocation: 16 bytes)
julia> #btime harmonic_distr2(1_000_000_000); # parallel
754.058 ms (399 allocations: 36.63 KiB)
julia> #btime harmonic_distr(1_000_000_000); # your old parallel version
4.289 s (411 allocations: 37.13 KiB)
The parallel version is, of course, slower if run only on one process:
julia> rmprocs(workers())
Task (done) #0x0000000006fb73d0
julia> nprocs()
1
julia> #btime harmonic_distr2(1_000_000_000); # (not really) parallel
1.879 s (34 allocations: 2.00 KiB)

Julia: best way to sample from successively shrinking range?

I would like to sample k numbers where the first number is sampled from 1:n and the second from 1:n-1 and the third from 1:n-2 and so on.
I have the below implementation
function shrinksample(n,k)
[rand(1:m) for m in n:-1:n-k+1]
end
Are there faster solutions in Julia?
The following takes ideas from the implementation of randperm and since n and k are of the same order, this is appropriate as the same type of randomness is needed (both have output space of size n factorial):
function fastshrinksample(r::AbstractRNG,n,k)
a = Vector{typeof(n)}(k)
#assert n <= Int64(2)^52
k == 0 && return a
mask = (1<<(64-leading_zeros(n)))-1
nextmask = mask>>1
nn = n
for i=1:k
a[i] = 1+Base.Random.rand_lt(r, nn, mask)
nn -= 1
if nn == nextmask
mask, nextmask = nextmask, nextmask>>1
end
end
return a
end
fastshrinksample(n,k) = fastshrinksample(Base.Random.GLOBAL_RNG, n, k)
Benchmarking gives a 3x improvement for one tested instance:
julia> using BenchmarkTools
julia> #btime shrinksample(10000,10000);
310.277 μs (2 allocations: 78.20 KiB)
julia> #btime fastshrinksample(10000,10000);
91.815 μs (2 allocations: 78.20 KiB)
The trick is mainly to use the internal Base.Random.rand_lt instead of regular rand(1:n)
If this is not very sensitive to randomness (you're not doing cryptography), the following should be amazingly fast and very simple:
blazingshrinksample(n,k) = (Int)[trunc(Int,(n-m)rand()+1) for m in 0:k-1]
Testing this along with your implementation and with Dan's, I got this:
using BenchmarkTools
#btime shrinksample(10000,10000);
259.414 μs (2 allocations: 78.20 KiB)
#btime fastshrinksample(10000,10000);
66.713 μs (2 allocations: 78.20 KiB)
#btime blazingshrinksample(10000,10000);
33.614 μs (2 allocations: 78.20 KiB)

ArrayFire.jl performance slowly degrades over time

So I'm trying to use ArrayFire in Julia, and I find that the performance bizarrely degrades over time:
using ArrayFire
srand(1)
function f()
r = AFArray(zeros(Float32, 100, 100000))
a = AFArray(rand(Float32, 100, 100000))
for d in 1:100:90000
r[:,d:d+99] = a[:,d:d+99] .* a[:,d:d+99]
end
nothing
end
function g()
r = zeros(Float32, 100, 100000)
a = ones(Float32, 100, 100000)
for d in 1:100:90000
r[:,d:d+99] = a[:,d:d+99] .* a[:,d:d+99]
end
nothing
end
for _ in 1:15
#time f()
end
If you run this code you'll see every iteration gets slower and slower. I tried calling finalize on r and a inside f() to try to throw these arrays out of GPU memory, in case that was the problem, but it didn't do anything.
Here is the output:
0.810842 seconds (114.91 k allocations: 80.216 MB, 0.71% gc time)
0.283941 seconds (79.22 k allocations: 78.561 MB, 3.22% gc time)
0.267405 seconds (79.22 k allocations: 78.561 MB, 2.31% gc time)
0.332186 seconds (79.22 k allocations: 78.561 MB, 1.76% gc time)
0.405174 seconds (79.22 k allocations: 78.561 MB, 1.50% gc time)
0.433224 seconds (79.22 k allocations: 78.561 MB, 2.11% gc time)
0.501358 seconds (79.22 k allocations: 78.561 MB, 1.18% gc time)
0.572704 seconds (79.22 k allocations: 78.561 MB, 1.07% gc time)
0.650663 seconds (79.22 k allocations: 78.561 MB, 1.10% gc time)
0.794873 seconds (79.22 k allocations: 78.561 MB, 1.16% gc time)
0.838882 seconds (79.22 k allocations: 78.561 MB, 1.04% gc time)
1.281940 seconds (79.22 k allocations: 78.561 MB, 0.61% gc time)
1.200713 seconds (79.22 k allocations: 78.561 MB, 0.37% gc time)
1.268786 seconds (79.22 k allocations: 78.561 MB, 0.78% gc time)
1.396851 seconds (79.22 k allocations: 78.561 MB, 0.66% gc time)

Resources