Scala 2.11.8
I'm measuring iteration through flattened and non-flattened iterator. I wrote the following benchmark:
#State(Scope.Benchmark)
class SerializeBenchmark
var list = List(
List("test", 12, 34, 56),
List("test-test-test", 123, 444, 0),
List("test-test-test-tes", 145, 443, 4333),
List("testdsfg-test-test-tes", 3145, 435, 333),
List("test-tessdfgsdt-tessdfgt-tes", 1455, 43, 333),
List("tesewrt-test-tessdgdsft-tes", 13345, 4533, 3222333),
List("ewrtes6yhgfrtyt-test-test-tes", 122245, 433444, 322233),
List("tserfest-test-testtryfgd-tes", 143345, 43, 3122233),
List("test-reteytest-test-tes", 1121145, 4343, 3331212),
List("test-test-ertyeu6test-tes", 14115, 4343, 33433),
List("test-lknlkkn;lkntest-ertyeu6test-tes", 98141115, 4343, 33433),
List("tkknknest-test-ertyeu6test-tes", 914111215, 488343, 33433),
List("test-test-ertyeu6test-tes", 1411125, 437743, 93433),
List("test-test-ertyeu6testo;kn;lkn;lk-tes", 14111215, 5409343, 39823),
List("telnlkkn;lnih98st-test-ertyeu6test-tes", 1557215, 498343, 3377433)
)
#Benchmark
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Array(Mode.AverageTime))
def flattenerd(bh: Blackhole): Any = {
list.iterator.flatten.foreach(bh.consume)
}
#Benchmark
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Array(Mode.AverageTime))
def raw(bh: Blackhole): Any = {
list.iterator.foreach(_.foreach(bh.consume))
}
}
After running these benchmarks several times I got the following results:
Benchmark Mode Cnt Score Error Units
SerializeBenchmark.flattenerd avgt 5 10311,373 ± 1189,448 ns/op
SerializeBenchmark.raw avgt 5 3463,902 ± 141,145 ns/op
Almost 3 times difference in performance. And the larger I make the source list the bigger performance difference. Why?
I expected some performance difference but not 3 times.
I re-ran your test with a bit more iterations running under the hs_gc profile.
These are the results:
[info] Benchmark Mode Cnt Score Error Units
[info] IteratorFlatten.flattenerd avgt 50 0.708 â–’ 0.120 us/op
[info] IteratorFlatten.flattenerd:â•–sun.gc.collector.0.invocations avgt 50 8.840 â–’ 2.259 ?
[info] IteratorFlatten.raw avgt 50 0.367 â–’ 0.014 us/op
[info] IteratorFlatten.raw:â•–sun.gc.collector.0.invocations avgt 50 0 ?
IteratorFlatten.flattenerd had an average of 8 GC cycles during the test runs, where raw had 0. This means that because of the noise generated by the allocation by FlattenOps (the wrapper class and it's method, particularly hasNext which allocates an iterator per list), which is what is needed in order to provide the flatten method on Iterator, we suffer in running time.
If I re-run the test and give it a minimum heap size of 2G, the results get closer:
[info] Benchmark Mode Cnt Score Error Units
[info] IteratorFlatten.flattenerd avgt 50 0.615 â–’ 0.041 us/op
[info] IteratorFlatten.raw avgt 50 0.434 â–’ 0.064 us/op
The gist of it is, the more you allocate, the more work the GC has to do, more pauses, slower execution.
Note that these kind of micro benchmarks are very fragile and may yield different results. Make sure you measure enough allocations for the stats to become significant.
Related
I have simulation program written in Julia that does something equivalent to this as a part of its main loop:
# Some fake data
M = [randn(100,100) for m=1:100, n=1:100]
W = randn(100,100)
work = zip(W,M)
result = mapreduce(x -> x[1]*x[2], +,work)
In other words, a simple sum of weighted matrices. Timing the above code yields
0.691084 seconds (79.03 k allocations: 1.493 GiB, 70.59% gc time, 2.79% compilation time)
I am surprised about the large number of memory allocations, as this problem should be possible to do in-place. To see if it was my use of mapreduce that was wrong I also tested the following equivalent implementation:
#time begin
res = zeros(100,100)
for m=1:100
for n=1:100
res += W[m,n] * M[m,n]
end
end
end
which gave
0.442521 seconds (50.00 k allocations: 1.491 GiB, 70.81% gc time)
So, if I wrote this in C++ or Fortran it would be simple to do all of this in-place. Is this impossible in Julia? Or am I missing something here...?
It is possible to do it in place like this:
function ws(W, M)
res = zeros(100,100)
for m=1:100
for n=1:100
#. res += W[m,n] * M[m, n]
end
end
return res
end
and the timing is:
julia> #time ws(W, M);
0.100328 seconds (2 allocations: 78.172 KiB)
Note that in order to perform this operation in-place I used broadcasting (I could also use loops, but it would be the same).
The problem with your code is that in line:
res += W[m,n] * M[m,n]
You get two allocations:
When you do multiplication W[m,n] * M[m,n] a new matrix is allocated.
When you do addition res += ... again a matrix is allocated
By using broadcasting with #. you perform an in-place operation, see https://docs.julialang.org/en/v1/manual/mathematical-operations/#man-dot-operators for more explanations.
Additionally note that I have wrapped the code inside a function. If you do not do it then access both W and M is type unstable which also causes allocations, see https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.
I'd like to add something to Bogumił's answer. The missing broadcast is the main problem, but in addition, the loop and the mapreduce variant differ in a fundamental semantic way.
The purpose of mapreduce is to reduce by an associative operation with identity element init in an unspecified order. This in particular also includes the (theoretical) option of running parts in parallel and doesn't really play well with mutation. From the docs:
The associativity of the reduction is implementation-dependent. Additionally, some implementations may reuse the return value of f for elements that appear multiple times in itr. Use mapfoldl or
mapfoldr instead for guaranteed left or right associativity and invocation of f for every value.
and
It is unspecified whether init is used for non-empty collections.
What the loop variant really corresponds to is a fold, which has a well-defined order and initial (not necessarily identity) element and can thus use an in-place reduction operator:
Like reduce, but with guaranteed left associativity. If provided, the keyword argument init will be used exactly once.
julia> #benchmark foldl((acc, (m, w)) -> (#. acc += m * w), $work; init=$(zero(W)))
BenchmarkTools.Trial: 45 samples with 1 evaluation.
Range (min … max): 109.967 ms … 118.251 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 112.639 ms ┊ GC (median): 0.00%
Time (mean ± σ): 112.862 ms ± 1.154 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃█ ▁▄▃
▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███▆███▄▁▄▁▁▄▁▁▄▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
110 ms Histogram: frequency by time 118 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark mapreduce(Base.splat(*), +, $work)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
Range (min … max): 403.100 ms … 458.882 ms ┊ GC (min … max): 4.53% … 3.89%
Time (median): 445.058 ms ┊ GC (median): 4.04%
Time (mean ± σ): 440.042 ms ± 16.792 ms ┊ GC (mean ± σ): 4.21% ± 0.92%
▁ ▁ ▁ ▁ ▁ ▁ ▁▁▁ █ ▁
█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁█▁█▁▁▁▁███▁▁▁▁▁█▁▁▁█ ▁
403 ms Histogram: frequency by time 459 ms <
Memory estimate: 1.49 GiB, allocs estimate: 39998.
Think of it that way: if you would write the function as a parallel for loop with (+) reduction, iteration also would have an unspecified order, and you'd have memory overhead for the necessary copying of the individual results to the accumulating thread.
Thus, there is a trade-off. In your example, allocation/copying dominates. In other cases, the the mapped operation might dominate, and parallel reduction (with unspecified order, but copying overhead) be worth it.
I would like to know, how can I measure the memory usage by a small part of my code? Lets say I have 50 lines of code, where I take only three lines (randomly) and find the memory being used by them.
In python, one can use such syntax to measure the usage:
**code**
psutil.virtual_memory().total -psutil.virtual_memory().available)/1048/1048/1048
**code**
psutil.virtual_memory().total -psutil.virtual_memory().available)/1048/1048/1048
**code**
I have tried using begin - end loop but firstly, I am not sure whether it is good approach and secondly, may i know how can i extract just the memory usage using benchmarktools package.
Julia:
using BenchmarkTools
**code**
#btime begin
** code **
end
**code**
How may I extract the information in such a manner?
Look forward to the suggestions!
Thanks!!
I guess one workaround would be to put the code you want to benchmark into a function and benchmark that function:
using BenchmarkTools
# code before
f() = # code to benchmark
#btime f() ;
# code after
To save your benchmarks you probably need to use #benchmark instead of #btime, as in, e.g.:
julia> t = #benchmark x = [sin(3.0)]
BenchmarkTools.Trial:
memory estimate: 96 bytes
allocs estimate: 1
--------------
minimum time: 26.594 ns (0.00% GC)
median time: 29.141 ns (0.00% GC)
mean time: 33.709 ns (5.34% GC)
maximum time: 1.709 μs (97.96% GC)
--------------
samples: 10000
evals/sample: 992
julia> t.allocs
1
julia> t.memory
96
julia> t.times
10000-element Vector{Float64}:
26.59375
26.616935483870968
26.617943548387096
26.66532258064516
26.691532258064516
⋮
1032.6875
1043.6219758064517
1242.3336693548388
1708.797379032258
I have the below dummy dataframe:
import pandas
tab_size = 300000
tab_de_merde = [[i*1, 0, i*3, [i*7%3, i*11%3] ] for i in range(tab_size)]
colnames = ['Id', "Date", "Account","Value"]
indexnames = ['Id']
df = pandas.DataFrame(tab_de_merde, columns = colnames ).set_index(indexnames)
And I want to check if the column "Value" contains a 0.
I've tried 3 different solution and I was wondering if the third one (Python Vectorization) was correctly implemented since it doesn't seem to fasten the whole code.
%timeit df[[(0 in x) for x in df['Value'].values]]
#108 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[df['Value'].apply(lambda x: 0 in x)]
#86.2 ms ± 649 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def list_contains_element(np_array): return [(0 in x) for x in np_array]
%timeit df[list_contains_element(df['Value'].values)]
#106 ms ± 807 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I would be very glad if someone could help me understand better how to be faster with vector manipulation.
Indexing large matrixes seems to be taking FAR longer in 0.5 and 0.6 than 0.4.7.
For instance:
x = rand(10,10,100,4,4,1000) #Dummy array
tic()
r = squeeze(mean(x[:,:,1:80,:,:,56:800],(1,2,3,4,5)),(1,2,3,4,5))
toc()
Julia 0.5.0 -> elapsed time: 176.357068283 seconds
Julia 0.4.7 -> elapsed time: 1.19991952 seconds
Edit: as per requested, I've updated the benchmark to use BenchmarkTools.jl and wrap the code in a function:
using BenchmarkTools
function testf(x)
r = squeeze(mean(x[:,:,1:80,:,:,56:800],(1,2,3,4,5)),(1,2,3,4,5));
end
x = rand(10,10,100,4,4,1000) #Dummy array
#benchmark testf(x)
In 0.5.0 I get the following (with huge memory usage):
BenchmarkTools.Trial:
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 23.36 gb
allocs estimate: 1043200022
minimum time: 177.94 s (1.34% GC)
median time: 177.94 s (1.34% GC)
mean time: 177.94 s (1.34% GC)
maximum time: 177.94 s (1.34% GC)
In 0.4.7 I get:
BenchmarkTools.Trial:
samples: 11
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 727.55 mb
allocs estimate: 79
minimum time: 425.82 ms (0.06% GC)
median time: 485.95 ms (11.31% GC)
mean time: 482.67 ms (10.37% GC)
maximum time: 503.27 ms (11.22% GC)
Edit: Updated to use sub in 0.4.7 and view in 0.5.0
using BenchmarkTools
function testf(x)
r = mean(sub(x, :, :, 1:80, :, :, 56:800));
end
x = rand(10,10,100,4,4,1000) #Dummy array
#benchmark testf(x)
In 0.5.0 it ran for >20 mins and gave:
BenchmarkTools.Trial:
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 53.75 gb
allocs estimate: 2271872022
minimum time: 407.64 s (1.32% GC)
median time: 407.64 s (1.32% GC)
mean time: 407.64 s (1.32% GC)
maximum time: 407.64 s (1.32% GC)
In 0.4.7 I get:
BenchmarkTools.Trial:
samples: 5
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.28 kb
allocs estimate: 34
minimum time: 1.15 s (0.00% GC)
median time: 1.16 s (0.00% GC)
mean time: 1.16 s (0.00% GC)
maximum time: 1.18 s (0.00% GC)
This seems repeatable on other machines, so an issue has been opened: https://github.com/JuliaLang/julia/issues/19174
EDIT 17 March 2017 This regression is fixed in Julia v0.6.0. The discussion still applies if using older versions of Julia.
Try running this crude script in both Julia v0.4.7 and v0.5.0 (change sub to view):
using BenchmarkTools
function testf()
# set seed
srand(2016)
# test array
x = rand(10,10,100,4,4,1000)
# extract array view
y = sub(x, :, :, 1:80, :, :, 56:800) # julia v0.4
#y = view(x, :, :, 1:80, :, :, 56:800) # julia v0.5
# wrap mean(y) into a function
z() = mean(y)
# benchmark array mean
#time z()
#time z()
end
testf()
My machine:
julia> versioninfo()
Julia Version 0.4.7
Commit ae26b25 (2016-09-18 16:17 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-4870HQ CPU # 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
My output, Julia v0.4.7:
1.314966 seconds (246.43 k allocations: 11.589 MB)
1.017073 seconds (1 allocation: 16 bytes)
My output, Julia v0.5.0:
417.608056 seconds (2.27 G allocations: 53.749 GB, 0.75% gc time)
410.918933 seconds (2.27 G allocations: 53.747 GB, 0.72% gc time)
It would seem that you may have discovered a performance regression. Consider filing an issue.
I am using http://openjdk.java.net/projects/code-tools/jmh/ for benchmarking and i get a result like:
Benchmark Mode Samples Score Score error Units
o.a.f.c.j.b.TestClass.test1 avgt 5 2372870,600 210897,743 us/op
o.a.f.c.j.b.TestClass.test2 avgt 5 2079931,850 394727,671 us/op
o.a.f.c.j.b.TestClass.test3 avgt 5 26585,818 21105,739 us/op
o.a.f.c.j.b.TestClass.test4 avgt 5 19113,230 8012,852 us/op
o.a.f.c.j.b.TestClass.test5 avgt 5 2586,413 1949,487 us/op
o.a.f.c.j.b.TestClass.test6 avgt 5 1942,963 1619,967 us/op
o.a.f.c.j.b.TestClass.test7 avgt 5 233,902 73,861 us/op
o.a.f.c.j.b.TestClass.test8 avgt 5 191,970 126,682 us/op
What does the column "Score error" exactly mean and how to interpret it?
This is the margin of error for the score. In most cases, that is a half of confidence interval. Think about it as if there is a "±" sign between "Score" and "Score error". In fact, the human-readable log will show that:
Result: 1.986 ±(99.9%) 0.009 ops/ns [Average]
Statistics: (min, avg, max) = (1.984, 1.986, 1.990), stdev = 0.002
Confidence interval (99.9%): [1.977, 1.995]
# Run complete. Total time: 00:00:12
Benchmark Mode Samples Score Score error Units
o.o.j.s.HelloWorld.hello thrpt 5 1.986 0.009 ops/ns