Inv() versus '\' on JULIA - matrix

So I am working on coding a program that requires me to assign and invert a large matrix M of size 100 x 100 each time, in a loop that repeats roughly 1000 times.
I was originally using the inv() function but since it is taking a lot of time, I want to optimize my program to make it run faster. Hence I wrote some dummy code as a test for what could be slowing things down:
function test1()
for i in (1:100)
𝐌=Diagonal(rand(100))
inverse_𝐌=inv(B)
end
end
using BenchmarkTools
#benchmark test1()
---------------------------------
BenchmarkTools.Trial:
memory estimate: 178.13 KiB
allocs estimate: 400
--------------
minimum time: 67.991 μs (0.00% GC)
median time: 71.032 μs (0.00% GC)
mean time: 89.125 μs (19.43% GC)
maximum time: 2.490 ms (96.64% GC)
--------------
samples: 10000
evals/sample: 1
When I use the '\' operator to evaluate the inverse:
function test2()
for i in (1:100)
𝐌=Diagonal(rand(100))
inverse_𝐌=𝐌\Diagonal(ones(100))
end
end
using BenchmarkTools
#benchmark test2()
-----------------------------------
BenchmarkTools.Trial:
memory estimate: 267.19 KiB
allocs estimate: 600
--------------
minimum time: 53.728 μs (0.00% GC)
median time: 56.955 μs (0.00% GC)
mean time: 84.430 μs (30.96% GC)
maximum time: 2.474 ms (96.95% GC)
--------------
samples: 10000
evals/sample: 1
I can see that inv() is taking less memory than the '\' operator, although the '\' operator is faster in the end.
Is this because I am using an extra Identity matrix--> Diagonal(ones(100)) in test2() ? Does it mean that everytime I run my loop, a new portion of memory is being assigned to store the Identity matrix?
My original matrix 𝐌 is a tridigonal matrix. Does inverting a matrix with such a large number of zeros cost more memory allocation? For such a matrix,what is better to use: the inv() or the '\' function or is there some other better strategy?
P.S: How does inverting matrices in julia compare to other languages like C and Python? When I ran the same algorithm in my older program written in C, it took a considerably less amount of time...so I was wondering if the inv() function was the culprit here.
EDIT:
So as pointed out,I had made a typo while typing in the test1() function. It's actually
function test1()
for i in (1:100)
𝐌=Diagonal(rand(100))
inverse_𝐌=inv(𝐌)
end
end
However my problem stayed the same, test1() function allocates less memory but takes more time:
using BenchmarkTools
#benchmark test1()
>BenchmarkTools.Trial:
memory estimate: 178.13 KiB
allocs estimate: 400
--------------
minimum time: 68.640 μs (0.00% GC)
median time: 71.240 μs (0.00% GC)
mean time: 90.468 μs (20.23% GC)
maximum time: 3.455 ms (97.41% GC)
samples: 10000
evals/sample: 1
using BenchmarkTools
#benchmark test2()
BenchmarkTools.Trial:
memory estimate: 267.19 KiB
allocs estimate: 600
--------------
minimum time: 54.368 μs (0.00% GC)
median time: 57.162 μs (0.00% GC)
mean time: 86.380 μs (31.68% GC)
maximum time: 3.021 ms (97.52% GC)
--------------
samples: 10000
evals/sample: 1
I also tested some other variants of the test2() function:
function test3()
for i in (1:100)
𝐌=Diagonal(rand(100))
𝐈=Diagonal(ones(100))
inverse_𝐌=𝐌\𝐈
end
end
function test4(𝐈)
for i in (1:100)
𝐌=Diagonal(rand(100))
inverse_𝐌=𝐌\𝐈
end
end
using BenchmarkTools
#benchmark test3()
>BenchmarkTools.Trial:
memory estimate: 267.19 KiB
allocs estimate: 600
--------------
minimum time: 54.248 μs (0.00% GC)
median time: 57.120 μs (0.00% GC)
mean time: 86.628 μs (32.01% GC)
maximum time: 3.151 ms (97.23% GC)
--------------
samples: 10000
evals/sample: 1
using BenchmarkTools
#benchmark test4(Diagonal(ones(100)))
>BenchmarkTools.Trial:
memory estimate: 179.02 KiB
allocs estimate: 402
--------------
minimum time: 48.556 μs (0.00% GC)
median time: 52.731 μs (0.00% GC)
mean time: 72.193 μs (25.48% GC)
maximum time: 3.015 ms (97.32% GC)
--------------
samples: 10000
evals/sample: 1
test2() and test3() are equivalent. I realised I could go about the extra memory allocation in test2() by passing the Identity matrix as a variable instead, as done in test4(). It also fastens up the the function.

The question you ask is tricky and depends on the context. I can answer you within your context, but if you post your real problem the answer might change.
So for your question, the codes are not equivalent, because in the first you use some matrix B in inv(B), which is undefined (and probably is a global, type unstable, variable), If you change B to 𝐌 actually the first code is a bit faster:
julia> function test1()
for i in (1:100)
𝐌=Diagonal(rand(100))
inverse_𝐌=inv(𝐌)
end
end
test1 (generic function with 1 method)
julia> function test2()
for i in (1:100)
𝐌=Diagonal(rand(100))
inverse_𝐌=𝐌\Diagonal(ones(100))
end
end
test2 (generic function with 1 method)
julia> using BenchmarkTools
julia> #benchmark test1()
BenchmarkTools.Trial:
memory estimate: 178.13 KiB
allocs estimate: 400
--------------
minimum time: 28.273 μs (0.00% GC)
median time: 32.900 μs (0.00% GC)
mean time: 43.447 μs (14.28% GC)
maximum time: 34.779 ms (99.70% GC)
--------------
samples: 10000
evals/sample: 1
julia> #benchmark test2()
BenchmarkTools.Trial:
memory estimate: 267.19 KiB
allocs estimate: 600
--------------
minimum time: 28.273 μs (0.00% GC)
median time: 33.928 μs (0.00% GC)
mean time: 45.907 μs (15.25% GC)
maximum time: 34.718 ms (99.74% GC)
--------------
samples: 10000
evals/sample: 1
Now, the second thing is that your code uses diagonal matrices and Julia is smart enough to have specialized methods for inv and \ for this kind of matrices. Their definitions are as follows:
(\)(Da::Diagonal, Db::Diagonal) = Diagonal(Da.diag .\ Db.diag)
function inv(D::Diagonal{T}) where T
Di = similar(D.diag, typeof(inv(zero(T))))
for i = 1:length(D.diag)
if D.diag[i] == zero(T)
throw(SingularException(i))
end
Di[i] = inv(D.diag[i])
end
Diagonal(Di)
end
And you can see that such an example is not fully representative for the general case (if the matrices were not diagonal other methods would be used). You can check which methods are used like this:
julia> #which �\Diagonal(ones(100))
\(Da::Diagonal, Db::Diagonal) in LinearAlgebra at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.1\LinearAlgebra\src\diagonal.jl:493
julia> #which inv(�)
inv(D::Diagonal{T,V} where V<:AbstractArray{T,1}) where T in LinearAlgebra at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.1\LinearAlgebra\src\diagonal.jl:496
and look up the code yourself.
I assume that in your real exercise you do not have diagonal matrices. In particular if you have block matrices you might have a look at https://github.com/JuliaArrays/BlockArrays.jl package, as it might have optimized methods for your use case. You might also have a look at https://github.com/JuliaMatrices/BandedMatrices.jl.
In summary - you can expect Julia to try to highly optimize the code for the specific use case, so in order to get a definitive answer for your use case the detailed specification of your problem will matter. If you would share it a more specific answer can be given.

Related

Why is operating on Float64 faster than Float16?

I wonder why operating on Float64 values is faster than operating on Float16:
julia> rnd64 = rand(Float64, 1000);
julia> rnd16 = rand(Float16, 1000);
julia> #benchmark rnd64.^2
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.800 μs … 662.140 μs ┊ GC (min … max): 0.00% … 99.37%
Time (median): 2.180 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.457 μs ± 13.176 μs ┊ GC (mean ± σ): 12.34% ± 3.89%
▁██▄▂▂▆▆▄▂▁ ▂▆▄▁ ▂▂▂▁ ▂
████████████████▇▇▆▆▇▆▅▇██▆▆▅▅▆▄▄▁▁▃▃▁▁▄▁▃▄▁▃▁▄▃▁▁▆▇██████▇ █
1.8 μs Histogram: log(frequency) by time 10.6 μs <
Memory estimate: 8.02 KiB, allocs estimate: 5.
julia> #benchmark rnd16.^2
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max): 5.117 μs … 587.133 μs ┊ GC (min … max): 0.00% … 98.61%
Time (median): 5.383 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.716 μs ± 9.987 μs ┊ GC (mean ± σ): 3.01% ± 1.71%
▃▅█▇▅▄▄▆▇▅▄▁ ▁ ▂
▄██████████████▇▆▇▆▆▇▆▇▅█▇████▇█▇▇▆▅▆▄▇▇▆█▇██▇█▇▇▇▆▇▇▆▆▆▆▄▄ █
5.12 μs Histogram: log(frequency) by time 7.48 μs <
Memory estimate: 2.14 KiB, allocs estimate: 5.
Maybe you ask why I expect the opposite: Because Float16 values have less floating point precision:
julia> rnd16[1]
Float16(0.627)
julia> rnd64[1]
0.4375452455597999
Shouldn't calculations with fewer precisions take place faster? Then I wonder why someone should use Float16? They can do it even with Float128!
As you can see, the effect you are expecting is present for Float32:
julia> rnd64 = rand(Float64, 1000);
julia> rnd32 = rand(Float32, 1000);
julia> rnd16 = rand(Float16, 1000);
julia> #btime $rnd64.^2;
616.495 ns (1 allocation: 7.94 KiB)
julia> #btime $rnd32.^2;
330.769 ns (1 allocation: 4.06 KiB) # faster!!
julia> #btime $rnd16.^2;
2.067 μs (1 allocation: 2.06 KiB) # slower!!
Float64 and Float32 have hardware support on most platforms, but Float16 does not, and must therefore be implemented in software.
Note also that you should use variable interpolation ($) when micro-benchmarking. The difference is significant here, not least in terms of allocations:
julia> #btime $rnd32.^2;
336.187 ns (1 allocation: 4.06 KiB)
julia> #btime rnd32.^2;
930.000 ns (5 allocations: 4.14 KiB)
The short answer is that you probably shouldn't use Float16 unless you are using a GPU or an Apple CPU because (as of 2022) other processors don't have hardware support for Float16.

Difference between the built-in #time macros and the ones from the benchmark module

In the Julia package BenchmarkTools, there are macros like #btime, #belapse that seem redundant to me since Julia has built-in #time, #elapse macros. And it seems to me that these macros serve the same purpose. So what's the difference between #time and #btime, and #elapse and #belapsed?
TLDR ;)
#time and #elapsed just run the code once and measure the time. This measurement may or may not include the compile time (depending whether #time is run for the first or second time) and includes time to resolve global variables.
On the the other hand #btime and #belapsed perform warm up so you know that compile time and global variable resolve time (if $ is used) do not affect the time measurement.
Details
For further understand how this works lets used the #macroexpand (I am also stripping comment lines for readability):
julia> using MacroTools, BenchmarkTools
julia> MacroTools.striplines(#macroexpand1 #elapsed sin(x))
quote
Experimental.#force_compile
local var"#28#t0" = Base.time_ns()
sin(x)
(Base.time_ns() - var"#28#t0") / 1.0e9
end
Compilation if sin is not forced and you get different results when running for the first time and subsequent times. For an example:
julia> #time cos(x);
0.110512 seconds (261.97 k allocations: 12.991 MiB, 99.95% compilation time)
julia> #time cos(x);
0.000008 seconds (1 allocation: 16 bytes)
julia> #time cos(x);
0.000006 seconds (1 allocation: : 16 bytes)
The situation is different with #belapsed:
julia> MacroTools.striplines(#macroexpand #belapsed sin($x))
quote
(BenchmarkTools).time((BenchmarkTools).minimum(begin
local var"##314" = begin
BenchmarkTools.generate_benchmark_definition(Main, Symbol[], Any[], [Symbol("##x#315")], (x,), $(Expr(:copyast, :($(QuoteNode(:(sin(var"##x#315"))))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), BenchmarkTools.Parameters())
end
(BenchmarkTools).warmup(var"##314")
(BenchmarkTools).tune!(var"##314")
(BenchmarkTools).run(var"##314")
end)) / 1.0e9
end
You can see that a minimum value is taken (the code is run several times).
Basically most time you should use BenchmarkTools for measuring times when designing your application.
Last but not least try #benchamrk:
julia> #benchmark sin($x)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 13.714 ns … 51.151 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.814 ns ┊ GC (median): 0.00%
Time (mean ± σ): 14.089 ns ± 1.121 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇ ▂▄ ▁▂ ▃ ▁ ▂
██▆▅██▇▅▄██▃▁▃█▄▃▁▅█▆▁▄▃▅█▅▃▁▄▇▆▁▁▁▁▁▆▄▄▁▁▃▄▇▃▁▃▁▁▁▆▅▁▁▁▆▅▅ █
13.7 ns Histogram: log(frequency) by time 20 ns <
Memory estimate: 0 bytes, allocs estimate: 0.

High GC time for simple mapreduce problem

I have simulation program written in Julia that does something equivalent to this as a part of its main loop:
# Some fake data
M = [randn(100,100) for m=1:100, n=1:100]
W = randn(100,100)
work = zip(W,M)
result = mapreduce(x -> x[1]*x[2], +,work)
In other words, a simple sum of weighted matrices. Timing the above code yields
0.691084 seconds (79.03 k allocations: 1.493 GiB, 70.59% gc time, 2.79% compilation time)
I am surprised about the large number of memory allocations, as this problem should be possible to do in-place. To see if it was my use of mapreduce that was wrong I also tested the following equivalent implementation:
#time begin
res = zeros(100,100)
for m=1:100
for n=1:100
res += W[m,n] * M[m,n]
end
end
end
which gave
0.442521 seconds (50.00 k allocations: 1.491 GiB, 70.81% gc time)
So, if I wrote this in C++ or Fortran it would be simple to do all of this in-place. Is this impossible in Julia? Or am I missing something here...?
It is possible to do it in place like this:
function ws(W, M)
res = zeros(100,100)
for m=1:100
for n=1:100
#. res += W[m,n] * M[m, n]
end
end
return res
end
and the timing is:
julia> #time ws(W, M);
0.100328 seconds (2 allocations: 78.172 KiB)
Note that in order to perform this operation in-place I used broadcasting (I could also use loops, but it would be the same).
The problem with your code is that in line:
res += W[m,n] * M[m,n]
You get two allocations:
When you do multiplication W[m,n] * M[m,n] a new matrix is allocated.
When you do addition res += ... again a matrix is allocated
By using broadcasting with #. you perform an in-place operation, see https://docs.julialang.org/en/v1/manual/mathematical-operations/#man-dot-operators for more explanations.
Additionally note that I have wrapped the code inside a function. If you do not do it then access both W and M is type unstable which also causes allocations, see https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.
I'd like to add something to Bogumił's answer. The missing broadcast is the main problem, but in addition, the loop and the mapreduce variant differ in a fundamental semantic way.
The purpose of mapreduce is to reduce by an associative operation with identity element init in an unspecified order. This in particular also includes the (theoretical) option of running parts in parallel and doesn't really play well with mutation. From the docs:
The associativity of the reduction is implementation-dependent. Additionally, some implementations may reuse the return value of f for elements that appear multiple times in itr. Use mapfoldl or
mapfoldr instead for guaranteed left or right associativity and invocation of f for every value.
and
It is unspecified whether init is used for non-empty collections.
What the loop variant really corresponds to is a fold, which has a well-defined order and initial (not necessarily identity) element and can thus use an in-place reduction operator:
Like reduce, but with guaranteed left associativity. If provided, the keyword argument init will be used exactly once.
julia> #benchmark foldl((acc, (m, w)) -> (#. acc += m * w), $work; init=$(zero(W)))
BenchmarkTools.Trial: 45 samples with 1 evaluation.
Range (min … max): 109.967 ms … 118.251 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 112.639 ms ┊ GC (median): 0.00%
Time (mean ± σ): 112.862 ms ± 1.154 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃█ ▁▄▃
▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███▆███▄▁▄▁▁▄▁▁▄▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
110 ms Histogram: frequency by time 118 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark mapreduce(Base.splat(*), +, $work)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
Range (min … max): 403.100 ms … 458.882 ms ┊ GC (min … max): 4.53% … 3.89%
Time (median): 445.058 ms ┊ GC (median): 4.04%
Time (mean ± σ): 440.042 ms ± 16.792 ms ┊ GC (mean ± σ): 4.21% ± 0.92%
▁ ▁ ▁ ▁ ▁ ▁ ▁▁▁ █ ▁
█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁█▁█▁▁▁▁███▁▁▁▁▁█▁▁▁█ ▁
403 ms Histogram: frequency by time 459 ms <
Memory estimate: 1.49 GiB, allocs estimate: 39998.
Think of it that way: if you would write the function as a parallel for loop with (+) reduction, iteration also would have an unspecified order, and you'd have memory overhead for the necessary copying of the individual results to the accumulating thread.
Thus, there is a trade-off. In your example, allocation/copying dominates. In other cases, the the mapped operation might dominate, and parallel reduction (with unspecified order, but copying overhead) be worth it.

How to measure the memory usage for a small part of your code in Julia?

I would like to know, how can I measure the memory usage by a small part of my code? Lets say I have 50 lines of code, where I take only three lines (randomly) and find the memory being used by them.
In python, one can use such syntax to measure the usage:
**code**
psutil.virtual_memory().total -psutil.virtual_memory().available)/1048/1048/1048
**code**
psutil.virtual_memory().total -psutil.virtual_memory().available)/1048/1048/1048
**code**
I have tried using begin - end loop but firstly, I am not sure whether it is good approach and secondly, may i know how can i extract just the memory usage using benchmarktools package.
Julia:
using BenchmarkTools
**code**
#btime begin
** code **
end
**code**
How may I extract the information in such a manner?
Look forward to the suggestions!
Thanks!!
I guess one workaround would be to put the code you want to benchmark into a function and benchmark that function:
using BenchmarkTools
# code before
f() = # code to benchmark
#btime f() ;
# code after
To save your benchmarks you probably need to use #benchmark instead of #btime, as in, e.g.:
julia> t = #benchmark x = [sin(3.0)]
BenchmarkTools.Trial:
memory estimate: 96 bytes
allocs estimate: 1
--------------
minimum time: 26.594 ns (0.00% GC)
median time: 29.141 ns (0.00% GC)
mean time: 33.709 ns (5.34% GC)
maximum time: 1.709 μs (97.96% GC)
--------------
samples: 10000
evals/sample: 992
julia> t.allocs
1
julia> t.memory
96
julia> t.times
10000-element Vector{Float64}:
26.59375
26.616935483870968
26.617943548387096
26.66532258064516
26.691532258064516
⋮
1032.6875
1043.6219758064517
1242.3336693548388
1708.797379032258

Why is indexing a large matrix 170x slower slower in Julia 0.5.0 than 0.4.7?

Indexing large matrixes seems to be taking FAR longer in 0.5 and 0.6 than 0.4.7.
For instance:
x = rand(10,10,100,4,4,1000) #Dummy array
tic()
r = squeeze(mean(x[:,:,1:80,:,:,56:800],(1,2,3,4,5)),(1,2,3,4,5))
toc()
Julia 0.5.0 -> elapsed time: 176.357068283 seconds
Julia 0.4.7 -> elapsed time: 1.19991952 seconds
Edit: as per requested, I've updated the benchmark to use BenchmarkTools.jl and wrap the code in a function:
using BenchmarkTools
function testf(x)
r = squeeze(mean(x[:,:,1:80,:,:,56:800],(1,2,3,4,5)),(1,2,3,4,5));
end
x = rand(10,10,100,4,4,1000) #Dummy array
#benchmark testf(x)
In 0.5.0 I get the following (with huge memory usage):
BenchmarkTools.Trial:
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 23.36 gb
allocs estimate: 1043200022
minimum time: 177.94 s (1.34% GC)
median time: 177.94 s (1.34% GC)
mean time: 177.94 s (1.34% GC)
maximum time: 177.94 s (1.34% GC)
In 0.4.7 I get:
BenchmarkTools.Trial:
samples: 11
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 727.55 mb
allocs estimate: 79
minimum time: 425.82 ms (0.06% GC)
median time: 485.95 ms (11.31% GC)
mean time: 482.67 ms (10.37% GC)
maximum time: 503.27 ms (11.22% GC)
Edit: Updated to use sub in 0.4.7 and view in 0.5.0
using BenchmarkTools
function testf(x)
r = mean(sub(x, :, :, 1:80, :, :, 56:800));
end
x = rand(10,10,100,4,4,1000) #Dummy array
#benchmark testf(x)
In 0.5.0 it ran for >20 mins and gave:
BenchmarkTools.Trial:
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 53.75 gb
allocs estimate: 2271872022
minimum time: 407.64 s (1.32% GC)
median time: 407.64 s (1.32% GC)
mean time: 407.64 s (1.32% GC)
maximum time: 407.64 s (1.32% GC)
In 0.4.7 I get:
BenchmarkTools.Trial:
samples: 5
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.28 kb
allocs estimate: 34
minimum time: 1.15 s (0.00% GC)
median time: 1.16 s (0.00% GC)
mean time: 1.16 s (0.00% GC)
maximum time: 1.18 s (0.00% GC)
This seems repeatable on other machines, so an issue has been opened: https://github.com/JuliaLang/julia/issues/19174
EDIT 17 March 2017 This regression is fixed in Julia v0.6.0. The discussion still applies if using older versions of Julia.
Try running this crude script in both Julia v0.4.7 and v0.5.0 (change sub to view):
using BenchmarkTools
function testf()
# set seed
srand(2016)
# test array
x = rand(10,10,100,4,4,1000)
# extract array view
y = sub(x, :, :, 1:80, :, :, 56:800) # julia v0.4
#y = view(x, :, :, 1:80, :, :, 56:800) # julia v0.5
# wrap mean(y) into a function
z() = mean(y)
# benchmark array mean
#time z()
#time z()
end
testf()
My machine:
julia> versioninfo()
Julia Version 0.4.7
Commit ae26b25 (2016-09-18 16:17 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-4870HQ CPU # 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
My output, Julia v0.4.7:
1.314966 seconds (246.43 k allocations: 11.589 MB)
1.017073 seconds (1 allocation: 16 bytes)
My output, Julia v0.5.0:
417.608056 seconds (2.27 G allocations: 53.749 GB, 0.75% gc time)
410.918933 seconds (2.27 G allocations: 53.747 GB, 0.72% gc time)
It would seem that you may have discovered a performance regression. Consider filing an issue.

Resources