High GC time for simple mapreduce problem - performance

I have simulation program written in Julia that does something equivalent to this as a part of its main loop:
# Some fake data
M = [randn(100,100) for m=1:100, n=1:100]
W = randn(100,100)
work = zip(W,M)
result = mapreduce(x -> x[1]*x[2], +,work)
In other words, a simple sum of weighted matrices. Timing the above code yields
0.691084 seconds (79.03 k allocations: 1.493 GiB, 70.59% gc time, 2.79% compilation time)
I am surprised about the large number of memory allocations, as this problem should be possible to do in-place. To see if it was my use of mapreduce that was wrong I also tested the following equivalent implementation:
#time begin
res = zeros(100,100)
for m=1:100
for n=1:100
res += W[m,n] * M[m,n]
end
end
end
which gave
0.442521 seconds (50.00 k allocations: 1.491 GiB, 70.81% gc time)
So, if I wrote this in C++ or Fortran it would be simple to do all of this in-place. Is this impossible in Julia? Or am I missing something here...?

It is possible to do it in place like this:
function ws(W, M)
res = zeros(100,100)
for m=1:100
for n=1:100
#. res += W[m,n] * M[m, n]
end
end
return res
end
and the timing is:
julia> #time ws(W, M);
0.100328 seconds (2 allocations: 78.172 KiB)
Note that in order to perform this operation in-place I used broadcasting (I could also use loops, but it would be the same).
The problem with your code is that in line:
res += W[m,n] * M[m,n]
You get two allocations:
When you do multiplication W[m,n] * M[m,n] a new matrix is allocated.
When you do addition res += ... again a matrix is allocated
By using broadcasting with #. you perform an in-place operation, see https://docs.julialang.org/en/v1/manual/mathematical-operations/#man-dot-operators for more explanations.
Additionally note that I have wrapped the code inside a function. If you do not do it then access both W and M is type unstable which also causes allocations, see https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.

I'd like to add something to Bogumił's answer. The missing broadcast is the main problem, but in addition, the loop and the mapreduce variant differ in a fundamental semantic way.
The purpose of mapreduce is to reduce by an associative operation with identity element init in an unspecified order. This in particular also includes the (theoretical) option of running parts in parallel and doesn't really play well with mutation. From the docs:
The associativity of the reduction is implementation-dependent. Additionally, some implementations may reuse the return value of f for elements that appear multiple times in itr. Use mapfoldl or
mapfoldr instead for guaranteed left or right associativity and invocation of f for every value.
and
It is unspecified whether init is used for non-empty collections.
What the loop variant really corresponds to is a fold, which has a well-defined order and initial (not necessarily identity) element and can thus use an in-place reduction operator:
Like reduce, but with guaranteed left associativity. If provided, the keyword argument init will be used exactly once.
julia> #benchmark foldl((acc, (m, w)) -> (#. acc += m * w), $work; init=$(zero(W)))
BenchmarkTools.Trial: 45 samples with 1 evaluation.
Range (min … max): 109.967 ms … 118.251 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 112.639 ms ┊ GC (median): 0.00%
Time (mean ± σ): 112.862 ms ± 1.154 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃█ ▁▄▃
▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███▆███▄▁▄▁▁▄▁▁▄▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
110 ms Histogram: frequency by time 118 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark mapreduce(Base.splat(*), +, $work)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
Range (min … max): 403.100 ms … 458.882 ms ┊ GC (min … max): 4.53% … 3.89%
Time (median): 445.058 ms ┊ GC (median): 4.04%
Time (mean ± σ): 440.042 ms ± 16.792 ms ┊ GC (mean ± σ): 4.21% ± 0.92%
▁ ▁ ▁ ▁ ▁ ▁ ▁▁▁ █ ▁
█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁█▁█▁▁▁▁███▁▁▁▁▁█▁▁▁█ ▁
403 ms Histogram: frequency by time 459 ms <
Memory estimate: 1.49 GiB, allocs estimate: 39998.
Think of it that way: if you would write the function as a parallel for loop with (+) reduction, iteration also would have an unspecified order, and you'd have memory overhead for the necessary copying of the individual results to the accumulating thread.
Thus, there is a trade-off. In your example, allocation/copying dominates. In other cases, the the mapped operation might dominate, and parallel reduction (with unspecified order, but copying overhead) be worth it.

Related

Why is operating on Float64 faster than Float16?

I wonder why operating on Float64 values is faster than operating on Float16:
julia> rnd64 = rand(Float64, 1000);
julia> rnd16 = rand(Float16, 1000);
julia> #benchmark rnd64.^2
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.800 μs … 662.140 μs ┊ GC (min … max): 0.00% … 99.37%
Time (median): 2.180 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.457 μs ± 13.176 μs ┊ GC (mean ± σ): 12.34% ± 3.89%
▁██▄▂▂▆▆▄▂▁ ▂▆▄▁ ▂▂▂▁ ▂
████████████████▇▇▆▆▇▆▅▇██▆▆▅▅▆▄▄▁▁▃▃▁▁▄▁▃▄▁▃▁▄▃▁▁▆▇██████▇ █
1.8 μs Histogram: log(frequency) by time 10.6 μs <
Memory estimate: 8.02 KiB, allocs estimate: 5.
julia> #benchmark rnd16.^2
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max): 5.117 μs … 587.133 μs ┊ GC (min … max): 0.00% … 98.61%
Time (median): 5.383 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.716 μs ± 9.987 μs ┊ GC (mean ± σ): 3.01% ± 1.71%
▃▅█▇▅▄▄▆▇▅▄▁ ▁ ▂
▄██████████████▇▆▇▆▆▇▆▇▅█▇████▇█▇▇▆▅▆▄▇▇▆█▇██▇█▇▇▇▆▇▇▆▆▆▆▄▄ █
5.12 μs Histogram: log(frequency) by time 7.48 μs <
Memory estimate: 2.14 KiB, allocs estimate: 5.
Maybe you ask why I expect the opposite: Because Float16 values have less floating point precision:
julia> rnd16[1]
Float16(0.627)
julia> rnd64[1]
0.4375452455597999
Shouldn't calculations with fewer precisions take place faster? Then I wonder why someone should use Float16? They can do it even with Float128!
As you can see, the effect you are expecting is present for Float32:
julia> rnd64 = rand(Float64, 1000);
julia> rnd32 = rand(Float32, 1000);
julia> rnd16 = rand(Float16, 1000);
julia> #btime $rnd64.^2;
616.495 ns (1 allocation: 7.94 KiB)
julia> #btime $rnd32.^2;
330.769 ns (1 allocation: 4.06 KiB) # faster!!
julia> #btime $rnd16.^2;
2.067 μs (1 allocation: 2.06 KiB) # slower!!
Float64 and Float32 have hardware support on most platforms, but Float16 does not, and must therefore be implemented in software.
Note also that you should use variable interpolation ($) when micro-benchmarking. The difference is significant here, not least in terms of allocations:
julia> #btime $rnd32.^2;
336.187 ns (1 allocation: 4.06 KiB)
julia> #btime rnd32.^2;
930.000 ns (5 allocations: 4.14 KiB)
The short answer is that you probably shouldn't use Float16 unless you are using a GPU or an Apple CPU because (as of 2022) other processors don't have hardware support for Float16.

Difference between the built-in #time macros and the ones from the benchmark module

In the Julia package BenchmarkTools, there are macros like #btime, #belapse that seem redundant to me since Julia has built-in #time, #elapse macros. And it seems to me that these macros serve the same purpose. So what's the difference between #time and #btime, and #elapse and #belapsed?
TLDR ;)
#time and #elapsed just run the code once and measure the time. This measurement may or may not include the compile time (depending whether #time is run for the first or second time) and includes time to resolve global variables.
On the the other hand #btime and #belapsed perform warm up so you know that compile time and global variable resolve time (if $ is used) do not affect the time measurement.
Details
For further understand how this works lets used the #macroexpand (I am also stripping comment lines for readability):
julia> using MacroTools, BenchmarkTools
julia> MacroTools.striplines(#macroexpand1 #elapsed sin(x))
quote
Experimental.#force_compile
local var"#28#t0" = Base.time_ns()
sin(x)
(Base.time_ns() - var"#28#t0") / 1.0e9
end
Compilation if sin is not forced and you get different results when running for the first time and subsequent times. For an example:
julia> #time cos(x);
0.110512 seconds (261.97 k allocations: 12.991 MiB, 99.95% compilation time)
julia> #time cos(x);
0.000008 seconds (1 allocation: 16 bytes)
julia> #time cos(x);
0.000006 seconds (1 allocation: : 16 bytes)
The situation is different with #belapsed:
julia> MacroTools.striplines(#macroexpand #belapsed sin($x))
quote
(BenchmarkTools).time((BenchmarkTools).minimum(begin
local var"##314" = begin
BenchmarkTools.generate_benchmark_definition(Main, Symbol[], Any[], [Symbol("##x#315")], (x,), $(Expr(:copyast, :($(QuoteNode(:(sin(var"##x#315"))))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), BenchmarkTools.Parameters())
end
(BenchmarkTools).warmup(var"##314")
(BenchmarkTools).tune!(var"##314")
(BenchmarkTools).run(var"##314")
end)) / 1.0e9
end
You can see that a minimum value is taken (the code is run several times).
Basically most time you should use BenchmarkTools for measuring times when designing your application.
Last but not least try #benchamrk:
julia> #benchmark sin($x)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 13.714 ns … 51.151 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.814 ns ┊ GC (median): 0.00%
Time (mean ± σ): 14.089 ns ± 1.121 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇ ▂▄ ▁▂ ▃ ▁ ▂
██▆▅██▇▅▄██▃▁▃█▄▃▁▅█▆▁▄▃▅█▅▃▁▄▇▆▁▁▁▁▁▆▄▄▁▁▃▄▇▃▁▃▁▁▁▆▅▁▁▁▆▅▅ █
13.7 ns Histogram: log(frequency) by time 20 ns <
Memory estimate: 0 bytes, allocs estimate: 0.

Find index of maximum element satisfying condition (Julia)

In Julia I can use argmax(X) to find max element. If I want to find all element satisfying condition C I can use findall(C,X). But how can I combine the two? What's the most efficient/idiomatic/concise way to find maximum element index satisfying some condition in Julia?
If you'd like to avoid allocations, filtering the array lazily would work:
idx_filtered = (i for (i, el) in pairs(X) if C(el))
argmax(i -> X[i], idx_filtered)
Unfortunately, this is about twice as slow as a hand-written version. (edit: in my benchmarks, it's 2x slower on Intel Xeon Platinum but nearly equal on Apple M1)
function byhand(C, X)
start = findfirst(C, X)
isnothing(start) && return nothing
imax, max = start, X[start]
for i = start:lastindex(X)
if C(X[i]) && X[i] > max
imax, max = i, X[i]
end
end
imax, max
end
You can store the index returned by findall and subset it with the result of argmax of the vector fulfilling the condition.
X = [5, 4, -3, -5]
C = <(0)
i = findall(C, X);
i[argmax(X[i])]
#3
Or combine both:
argmax(i -> X[i], findall(C, X))
#3
Assuming that findall is not empty. Otherwise it need to be tested e.g. with isempty.
Benchmark
#Functions
function August(C, X)
idx_filtered = (i for (i, el) in pairs(X) if C(el))
argmax(i -> X[i], idx_filtered)
end
function byhand(C, X)
start = findfirst(C, X)
isnothing(start) && return nothing
imax, max = start, X[start]
for i = start:lastindex(X)
if C(X[i]) && X[i] > max
imax, max = i, X[i]
end
end
imax, max
end
function GKi1(C, X)
i = findall(C, X);
i[argmax(X[i])]
end
GKi2(C, X) = argmax(i -> X[i], findall(C, X))
#Data
using Random
Random.seed!(42)
n = 100000
X = randn(n)
C = <(0)
#Benchmark
using BenchmarkTools
suite = BenchmarkGroup()
suite["August"] = #benchmarkable August(C, $X)
suite["byhand"] = #benchmarkable byhand(C, $X)
suite["GKi1"] = #benchmarkable GKi1(C, $X)
suite["GKi2"] = #benchmarkable GKi2(C, $X)
tune!(suite);
results = run(suite)
#Results
results
#4-element BenchmarkTools.BenchmarkGroup:
# tags: []
# "August" => Trial(641.061 μs)
# "byhand" => Trial(261.135 μs)
# "GKi2" => Trial(259.260 μs)
# "GKi1" => Trial(339.570 μs)
results.data["August"]
#BenchmarkTools.Trial: 7622 samples with 1 evaluation.
# Range (min … max): 641.061 μs … 861.379 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 643.640 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 653.027 μs ± 18.123 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
#
# ▄█▅▄▃ ▂▂▃▁ ▁▃▃▂▂ ▁▃ ▁▁ ▁
# ██████▇████████████▇▆▆▇████▇▆██▇▇▇▆▆▆▅▇▆▅▅▅▅▆██▅▆▆▆▇▆▇▇▆▇▆▆▆▅ █
# 641 μs Histogram: log(frequency) by time 718 μs <
#
# Memory estimate: 16 bytes, allocs estimate: 1.
results.data["byhand"]
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 261.135 μs … 621.141 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 261.356 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 264.382 μs ± 11.638 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
#
# █ ▁▁▁▁ ▂ ▁▁ ▂ ▁ ▁ ▁
# █▅▂▂▅████▅▄▃▄▆█▇▇▆▄▅███▇▄▄▅▆▆█▄▇█▅▄▅▅▆▇▇▅▄▅▄▄▄▃▄▃▃▃▄▅▆▅▄▇█▆▅▄ █
# 261 μs Histogram: log(frequency) by time 292 μs <
#
# Memory estimate: 32 bytes, allocs estimate: 1.
results.data["GKi1"]
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 339.570 μs … 1.447 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 342.579 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 355.167 μs ± 52.935 μs ┊ GC (mean ± σ): 1.90% ± 6.85%
#
# █▆▄▅▃▂▁▁ ▁ ▁
# ████████▇▆▆▅▅▅▆▄▄▄▄▁▃▁▁▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
# 340 μs Histogram: log(frequency) by time 722 μs <
#
# Memory estimate: 800.39 KiB, allocs estimate: 11.
results.data["GKi2"]
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 259.260 μs … 752.773 μs ┊ GC (min … max): 0.00% … 54.40%
# Time (median): 260.692 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 270.300 μs ± 40.094 μs ┊ GC (mean ± σ): 1.31% ± 5.60%
#
# █▁▁▅▄▂▂▄▃▂▁▁▁ ▁ ▁
# █████████████████▇██▆▆▇▆▅▄▆▆▆▄▅▄▆▅▇▇▆▆▅▅▄▅▃▃▅▃▄▁▁▁▃▁▃▃▃▄▃▃▁▃▃ █
# 259 μs Histogram: log(frequency) by time 390 μs <
#
# Memory estimate: 408.53 KiB, allocs estimate: 9.
versioninfo()
#Julia Version 1.8.0
#Commit 5544a0fab7 (2022-08-17 13:38 UTC)
#Platform Info:
# OS: Linux (x86_64-linux-gnu)
# CPU: 8 × Intel(R) Core(TM) i7-2600K CPU # 3.40GHz
# WORD_SIZE: 64
# LIBM: libopenlibm
# LLVM: libLLVM-13.0.1 (ORCJIT, sandybridge)
# Threads: 1 on 8 virtual cores
In this example argmax(i -> X[i], findall(C, X)) is close to the performance of the hand written function of #August but uses more memory, but can show better performance in case the data is sorted:
sort!(X)
results = run(suite)
#4-element BenchmarkTools.BenchmarkGroup:
# tags: []
# "August" => Trial(297.519 μs)
# "byhand" => Trial(270.486 μs)
# "GKi2" => Trial(242.320 μs)
# "GKi1" => Trial(319.732 μs)
From what I understand from your question you can use findmax() (requires Julia >= v1.7) to find the maximum index on the result of findall():
julia> v = [10, 20, 30, 40, 50]
5-element Vector{Int64}:
10
20
30
40
50
julia> findmax(findall(x -> x > 30, v))[1]
5
Performance of the above function:
julia> v = collect(10:1:10_000_000);
julia> #btime findmax(findall(x -> x > 30, v))[1]
33.471 ms (10 allocations: 77.49 MiB)
9999991
Update: solution suggested by #dan-getz of using last() and findlast() perform better than findmax() but findlast() is the winner:
julia> #btime last(findall(x -> x > 30, v))
19.961 ms (9 allocations: 77.49 MiB)
9999991
julia> #btime findlast(x -> x > 30, v)
81.422 ns (2 allocations: 32 bytes)
Update 2: Looks like the OP wanted to find the max element and not only the index. In that case, the solution would be:
julia> v[findmax(findall(x -> x > 30, v))[1]]
50

Mergesort implementation in Julia

I'm trying to implement the merge sort algorithm in Julia, but I cannot seem to understand the recursion step needed for the algorithm. My code is the following:
mₐ = [1, 10, 7, 4, 3, 6, 8, 2, 9]
b₁(t, z, half₁, half₂)= ((t<=length(half₁)) && (z<=length(half₂))) && (half₁[t]<half₂[z])
b₂(t, z, half₁, half₂)= ((z<=length(half₂)) && (t<=length(half₁)) ) && (half₁[t]>half₂[z])
function Merge(m₁, m₂)
N = length(m₁) + length(m₂)
B = zeros(N)
i = 1
j = 1
for k in 1:N
if b₁(i, j, m₁, m₂)
B[k] = m₁[i]
i += 1
elseif b₂(i, j, m₁, m₂)
B[k] = m₂[j]
j += 1
elseif j >= length(m₂)
B[k] = m₁[i]
i += 1
elseif i >= length(m₁)
B[k] = m₂[j]
j += 1
end
end
return B
end
function MergeSort(M)
if length(M) == 1
return M
elseif length(M) == 0
return nothing
end
n = length(M)
i₁ = n ÷ 2
i₂ = n - i₁
h₁ = M[1:i₁]
h₂ = M[i₂:end]
C = MergeSort(h₁)
D = MergeSort(h₂)
return Merge(C, D)
end
MergeSort(mₐ)
It always gets stuck when C becomes a single element because it returns it and then splits it again, the only solution is to make it a loop once it is a single element. However, this would not be a recursive approach.
Solution
Taking #Sundar R answer and suggestions. This is a working implementation
#implementation of MergeSort in julia
# merge function, it joins two ordered arrays and returning one single ordered array
function merge(m₁, m₂)
N = length(m₁) + length(m₂)
# create a zeros array of the same input type (int64)
B = zeros(eltype(m₁), N)
i = 1
j = 1
for k in 1:N
if !checkbounds(Bool, m₁, i)
B[k] = m₂[j]
j += 1
elseif !checkbounds(Bool, m₂, j)
B[k] = m₁[i]
i += 1
elseif m₁[i]<m₂[j]
B[k] = m₁[i]
i += 1
else
B[k] = m₂[j]
j += 1
end
end
return B
end
# merge mergesort, this function recursively orders m/2 sub array given an array M
function mergeSort(M)
# base cases
if length(M) == 1
return M
elseif length(M) == 0
return nothing
end
# dividing array in two
n = length(M)
i₁ = n ÷ 2
# be careful with the indexes, thank you #Sundar R
i₂ = i₁ + 1
h₁ = M[1:i₁]
h₂ = M[i₂:end]
# recursively sorting the array
C = mergeSort(h₁)
D = mergeSort(h₂)
return merge(C, D)
end
#test the function
mₐ = [1, 10, 7, 4, 3, 6, 8, 2, 9]
b = mergeSort(mₐ)
println(b)
The issue is with the indices used for splitting, specifically i₂. n - i₁ is the number of elements in the second half of the array, but not necessarily the index where the second half starts - for that you just want i₂ = i₁ + 1.
With i₂ = n - i₁, when n is 2 i.e. when you come down to [1, 10] as the array to sort, i₁ = n ÷ 2 is 1, and i₂ is (2 - 1) = 1 also. So instead of splitting it into [1], [10], you end up "splitting" it into [1], and [1, 10], hence the infinite looping.
Once you fix that, there's a BoundsError from Merge because of a minor mistake: the elseif conditions should check for >, not >= (since Julia uses 1-based indexing, j is still a valid index when j == length(m₂)).
Some other suggestions:
zeros(N) returns a Float64 array, so the result here will always be a float array. I'd suggest zeros(eltype(m₁), N) instead.
It feels like b₁ and b₂ only complicate the code and make it less clear, I'd suggest a simple nested if there, an outer one to check the indices - look up checkbounds, for eg. checkbounds(Bool, m₁, i) - and an inner one to see which is greater.
Julia convention is to use lowercase for functions, so merge and mergesort instead of Merge and MergeSort
To add to the previous answers, which deal with some of the problems in your existing code, here is for reference a relatively efficient and straightforward Julia implementation of mergesort:
# Top-level function will allocate temporary arrays for convenience
function mergesort(A)
S = similar(A)
return mergesort!(copy(A), S)
end
# Efficient in-place version
# S is a temporary working (scratch) array
function mergesort!(A, S, n=length(A))
width = 1
swapcount = 0
while width < n
# A is currently full of sorted runs of length `width` (starting with width=1)
for i = 1:2*width:n
# Merge two sorted lists, left and right:
# left = A[i:i+width-1], right = A[i+width:i+2*width-1]
merge!(A, i, min(i+width, n+1), min(i+2*width, n+1), S)
end
# Swap the pointers of `A` and `S` such that `A` now contains merged
# runs of length 2*width.
S,A = A,S
swapcount += 1
# Double the width and continue
width *= 2
end
# Optional, if it is important that `A` be sorted in-place:
if isodd(swapcount)
# If we've swapped A and S an odd number of times, copy `A` back to `S`
# since `S` will by now refer to the memory initially provided as input
# array `A`, which the user will expect to have been sorted in-place
copyto!(S,A)
end
return A
end
# Merge two sorted subarrays, left and right:
# left = A[iₗ:iᵣ-1], right = A[iᵣ:iₑ-1]
#inline function merge!(A, iₗ, iᵣ, iₑ, S)
left, right = iₗ, iᵣ
#inbounds for n = iₗ:(iₑ-1)
if (left < iᵣ) && (right >= iₑ || A[left] <= A[right])
S[n] = A[left]
left += 1
else
S[n] = A[right]
right += 1
end
end
end
This is enough to get us in the same ballpark as Base's implementation of the same algorithm
julia> using BenchmarkTools
julia> #benchmark mergesort!(A,B) setup = (A = rand(50); B = similar(A))
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
Range (min … max): 497.062 ns … 1.294 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 501.438 ns ┊ GC (median): 0.00%
Time (mean ± σ): 526.171 ns ± 49.011 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅ ▁ ▁ ▃▇▄ ▁ ▂
█████▇▇▆▇█▇████▇▅▆▅▅▅▆█▆██▄▅▅▄▆██▆▆▄▄▆██▅▃▄██▄▅▅▃▃▃▃▄▅▁▄▄▃▁█ █
497 ns Histogram: log(frequency) by time 718 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> issorted(mergesort(rand(50)))
true
julia> issorted(mergesort(rand(10_000)))
true
julia> #benchmark Base.sort!(A, alg=MergeSort) setup=(A = rand(50))
BenchmarkTools.Trial: 10000 samples with 216 evaluations.
Range (min … max): 344.690 ns … 11.294 μs ┊ GC (min … max): 0.00% … 95.73%
Time (median): 352.917 ns ┊ GC (median): 0.00%
Time (mean ± σ): 401.700 ns ± 378.399 ns ┊ GC (mean ± σ): 3.57% ± 3.76%
█▇▄▄▄▂▁▂▁▂▃▁▁ ▃▂ ▁ ▁▁ ▁
████████████████▇██████▆▆▆▅▆▆▆▆▅▃▅▅▄▅▃▅▅▄▆▅▄▅▄▅▃▄▄██▇▅▆▆▇▆▄▅▅ █
345 ns Histogram: log(frequency) by time 741 ns <
Memory estimate: 336 bytes, allocs estimate: 3.
though both cost a good bit more in terms of both time and memory (the latter due to the need for the working array) in most numeric cases than a similarly efficient pure-Julia implementation of quicksort!:
julia> #benchmark VectorizedStatistics.quicksort!(A) setup = (A = rand(50))
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
Range (min … max): 28.854 ns … 175.821 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 35.268 ns ┊ GC (median): 0.00%
Time (mean ± σ): 38.703 ns ± 7.478 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▃█▁ ▃▃ ▃▆▂ ▂ ▃ ▂ ▁ ▂ ▂
█▆▃▅▁▁▄▅███▆███▆▆███▁▇█▇▅▇█▆▇█▁▆▅▃▅▄▄██▅▆▅▇▅▄▃▁▄▃▁▄▁▃▃▃▁▄▄▇█ █
28.9 ns Histogram: log(frequency) by time 68.7 ns <
Memory estimate: 0 bytes, allocs estimate: 0.

Profiling a Haskell program

I have a piece of code that repeatedly samples from a probability distribution using sequence. Morally, it does something like this:
sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
xs <- sequence (replicate n dist)
return (sum xs)
Except that it's a bit more complicated. The actual code I'm interested in is the function likelihoodWeighting at this Github repo.
I noticed that the running time scales nonlinearly with n. In particular, once n exceeds a certain value it hits the memory limit, and the running time explodes. I'm not certain, but I think this is because sequence is building up a long list of thunks which aren't getting evaluated until the call to sum.
Once I get past about 100,000 samples, the program slows to a crawl. I'd like to optimize this (my feeling is that 10 million samples shouldn't be a problem) so I decided to profile it - but I'm having a little trouble understanding the output of the profiler.
Profiling
I created a short executable in a file main.hs that runs my function with 100,000 samples. Here's the output from doing
$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s
First things I notice - it allocates nearly 1.5 GB of heap, and spends 60% of its time on garbage collection. Is this generally indicative of too much laziness?
1,377,538,232 bytes allocated in the heap
1,195,050,032 bytes copied during GC
169,411,368 bytes maximum residency (12 sample(s))
7,360,232 bytes maximum slop
423 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2574 collections, 0 parallel, 2.40s, 2.43s elapsed
Generation 1: 12 collections, 0 parallel, 1.07s, 1.28s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.92s ( 1.94s elapsed)
GC time 3.47s ( 3.70s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.23s ( 0.23s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.63s ( 5.87s elapsed)
%GC time 61.8% (63.1% elapsed)
Alloc rate 716,368,278 bytes per MUT second
Productivity 34.2% of total user, 32.7% of total elapsed
Here are the results from
$ ./main +RTS -p
The first time I ran this, it turned out that there was one function being called repeatedly, and it turned out I could memoize it, which sped things up by a factor of 2. It didn't solve the space leak, however.
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 434 4 0.0 0.0 100.0 100.0
likelihoodWeighting AI.Probability.Bayes 445 1 0.0 0.3 100.0 100.0
distributionLW AI.Probability.Bayes 448 1 0.0 2.6 0.0 2.6
getSampleLW AI.Probability.Bayes 446 100000 20.0 50.4 100.0 97.1
bnProb AI.Probability.Bayes 458 400000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 457 400000 6.7 0.8 6.7 0.8
bnVals AI.Probability.Bayes 455 400000 20.0 6.3 26.7 7.1
bnParents AI.Probability.Bayes 456 400000 6.7 0.8 6.7 0.8
bnSubRef AI.Probability.Bayes 454 800000 13.3 13.5 13.3 13.5
weightedSample AI.Probability.Bayes 447 100000 26.7 23.9 33.3 25.3
bnProb AI.Probability.Bayes 453 100000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 452 100000 0.0 0.2 0.0 0.2
bnVals AI.Probability.Bayes 450 100000 0.0 0.3 6.7 0.5
bnParents AI.Probability.Bayes 451 100000 6.7 0.2 6.7 0.2
bnSubRef AI.Probability.Bayes 449 200000 0.0 0.7 0.0 0.7
Here's a heap profile. I don't know why it claims the runtime is 1.8 seconds - this run took about 6 seconds.
Can anyone help me to interpret the output of the profiler - i.e. to identify where the bottleneck is, and provide suggestions for how to speed things up?
A huge improvement has already been achieved by incorporating JohnL's suggestion of using foldM in likelihoodWeighting. That reduced memory usage about tenfold here, and brought down the GC times significantly to almost or actually negligible.
A profiling run with the current source yields
probabilityIO AI.Util.Util 26.1 42.4 413 290400000
weightedSample.go AI.Probability.Bayes 16.1 19.1 255 131200080
bnParents AI.Probability.Bayes 10.8 1.2 171 8000384
bnVals AI.Probability.Bayes 10.4 7.8 164 53603072
bnCond AI.Probability.Bayes 7.9 1.2 125 8000384
ndSubRef AI.Util.Array 4.8 9.2 76 63204112
bnSubRef AI.Probability.Bayes 4.7 8.1 75 55203072
likelihoodWeighting.func AI.Probability.Bayes 3.3 2.8 53 19195128
%! AI.Util.Util 3.3 0.5 53 3200000
bnProb AI.Probability.Bayes 2.5 0.0 40 16
bnProb.p AI.Probability.Bayes 2.5 3.5 40 24001152
likelihoodWeighting AI.Probability.Bayes 2.5 2.9 39 20000264
likelihoodWeighting.func.x AI.Probability.Bayes 2.3 0.2 37 1600000
and 13MB memory usage reported by -s, ~5MB maximum residency. That's not too bad already.
Still, there remain some points we can improve. First, a relatively minor thing, in the grand scheme, AI.UTIl.Array.ndSubRef:
ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])
Reversing the list, and mapping (2^) over another list is inefficient, better is
ndSubRef = L.foldl' (\a d -> 2*a + d) 0
which doesn't need to keep the entire list in memory (probably not a big deal, since the lists will be short) as reversing it does, and doesn't need to allocate a second list. The reduction in allocation is noticeable, about 10%, and that part runs measurably faster,
ndSubRef AI.Util.Array 1.7 1.3 24 8000384
in the profile of the modified run, but since it takes only a small part of the overall time, the overall impact is small. There are potentially bigger fish to fry in weightedSample and likelihoodWeighting.
Let's add a bit of strictness in weightedSample to see how that changes things:
weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
go 1.0 (M.fromList fixed) (bnVars bn)
where
go w assignment [] = return (assignment, w)
go w assignment (v:vs) = if v `elem` vars
then
let w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let p = bnProb bn assignment (v,True)
x <- probabilityIO p
go w (M.insert v x assignment) vs
vars = map fst fixed
The weight parameter of go is never forced, nor is the assignment parameter, thus they can build up thunks. Let's enable {-# LANGUAGE BangPatterns #-} and force updates to take effect immediately, also evaluate p before passing it to probabilityIO:
go w assignment (v:vs) = if v `elem` vars
then
let !w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let !p = bnProb bn assignment (v,True)
x <- probabilityIO p
let !assignment' = M.insert v x assignment
go w assignment' vs
That brings a further reduction in allocation (~9%) and a small speedup (~%13%), but the total memory usage and maximum residence haven't changed much.
I see nothing else obvious to change there, so let's look at likelihoodWeighting:
func m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return $! x `seq` w `seq` M.adjust (+w) x m
In the last line, first, w is already evaluated in weightedSample now, so we don't need to seq it here, the key x is required to evaluate the updated map, so seqing that isn't necessary either. The bad thing on that line is M.adjust. adjust has no way of forcing the result of the updated function, so that builds thunks in the map's values. You can force evaluation of the thunks by looking up the modified value and forcing that, but Data.Map provides a much more convenient way here, since the key at which the map is updated is guaranteed to be present, insertWith':
func !m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return (M.insertWith' (+) x w m)
(Note: GHC optimises better with a bang-pattern on m than with return $! ... here). That slightly reduces the total allocation and doesn't measurably change the running time, but has a great impact on total memory used and maximum residency:
934,566,488 bytes allocated in the heap
1,441,744 bytes copied during GC
68,112 bytes maximum residency (1 sample(s))
23,272 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
The biggest improvement in running time to be had would be by avoiding randomIO, the used StdGen is very slow.
I am surprised how much time the bn* functions take, but don't see any obvious inefficiency in those.
I have trouble digesting these profiles, but I have gotten my ass kicked before because the MonadRandom on Hackage is strict. Creating a lazy version of MonadRandom made my memory problems go away.
My colleague has not yet gotten permission to release the code, but I've put Control.Monad.LazyRandom online at pastebin. Or if you want to see some excerpts that explain a fully lazy random search, including infinite lists of random computations, check out Experience Report: Haskell in Computational Biology.
I put together a very elementary example, posted here: http://hpaste.org/71919. I'm not sure if it's anything like your example.. just a very minimal thing that seemed to work.
Compiling with -prof and -fprof-auto and running with 100000 iterations yielded the following head of the profiling output (pardon my line numbers):
8 COST CENTRE MODULE %time %alloc
9
10 sample AI.Util.ProbDist 31.5 36.6
11 bnParents AI.Probability.Bayes 23.2 0.0
12 bnRank AI.Probability.Bayes 10.7 23.7
13 weightedSample.go AI.Probability.Bayes 9.6 13.4
14 bnVars AI.Probability.Bayes 8.6 16.2
15 likelihoodWeighting AI.Probability.Bayes 3.8 4.2
16 likelihoodWeighting.getSample AI.Probability.Bayes 2.1 0.7
17 sample.cumulative AI.Util.ProbDist 1.7 2.1
18 bnCond AI.Probability.Bayes 1.6 0.0
19 bnRank.ps AI.Probability.Bayes 1.1 0.0
And here are the summary statistics:
1,433,944,752 bytes allocated in the heap
1,016,435,800 bytes copied during GC
176,719,648 bytes maximum residency (11 sample(s))
1,900,232 bytes maximum slop
400 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.40s ( 1.41s elapsed)
GC time 1.08s ( 1.24s elapsed)
Total time 2.47s ( 2.65s elapsed)
%GC time 43.6% (46.8% elapsed)
Alloc rate 1,026,674,336 bytes per MUT second
Productivity 56.4% of total user, 52.6% of total elapsed
Notice that the profiler pointed its finger at sample. I forced the return in that function by using $!, and here are some summary statistics afterwards:
1,776,908,816 bytes allocated in the heap
165,232,656 bytes copied during GC
34,963,136 bytes maximum residency (7 sample(s))
483,192 bytes maximum slop
68 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.42s ( 2.44s elapsed)
GC time 0.21s ( 0.23s elapsed)
Total time 2.63s ( 2.68s elapsed)
%GC time 7.9% (8.8% elapsed)
Alloc rate 733,248,745 bytes per MUT second
Productivity 92.1% of total user, 90.4% of total elapsed
Much more productive in terms of GC, but not much changed on the time. You might be able to keep iterating in this profile/tweak fashion to target your bottlenecks and eke out some better performance.
I think your initial diagnosis is correct, and I've never seen a profiling report that's useful once memory effects kick in.
The problem is that you're traversing the list twice, once for sequence and again for sum. In Haskell, multiple list traversals of large lists are really, really bad for performance. The solution is generally to use some type of fold, such as foldM. Your sampleMean function can be written as
{-# LANGUAGE BangPatterns #-}
sampleMean2 :: MonadRandom m => Int -> m Float -> m Float
sampleMean2 n dist = foldM (\(!a) mb -> liftM (+a) mb) 0 $ replicate n dist
for example, traversing the list only once.
You can do the same sort of thing with likelihoodWeighting as well. In order to prevent thunks, it's important to make sure that the accumulator in your fold function has appropriate strictness.

Resources