I created simple plotting example with Julia
using Gadfly
draw(SVG("example.svg", 10cm, 10cm),
plot(x=rand(10), y=rand(10))
)
And ran it as time julia example.jl it took it 27 sec to finish. Is it normal behaviour? Is it possible to speed it up?
Latest Julia 0.5.2 and Pkg.
I'm not an expert so take this with a pinch of salt, but you're draw and SVG functions are compiled the first time they're run, that's why the long running time.
If you call the function again, it takes a lot less time. You're paying a penalty to compile the function calls first, but all later executions are quite quick.
I amended you're script to measure the time spent in different calls:
#time using Gadfly
#time draw(SVG("example.svg", 10cm, 10cm),
plot(x=rand(10), y=rand(10))
)
#time draw(SVG("example2.svg", 10cm, 10cm),
plot(x=rand(10), y=rand(10))
)
Running this from the console with julia example.jl gives me the following:
$ julia example.jl
2.728577 seconds (3.32 M allocations: 141.186 MB, 10.29% gc time)
20.434172 seconds (27.48 M allocations: 1.109 GB, 1.95% gc time)
0.023084 seconds (32.59 k allocations: 1.444 MB)
I have tried to do the same example with GR.jl as suggested by #daycaster and got 3.3 seconds on one laptop with Windows 10 64 bits:
PS C:\Users\dell\plot_example> cat plot.jl
using GR
plot(rand(10), rand(10), size = (500, 500))
savefig("plot.svg")
PS C:\Users\dell\plot_example> Measure-Command {julia plot.jl}
Days : 0
Hours : 0
Minutes : 0
Seconds : 3
Milliseconds : 382
Ticks : 33822083
TotalDays : 3.91459293981481E-05
TotalHours : 0.000939502305555556
TotalMinutes : 0.0563701383333333
TotalSeconds : 3.3822083
TotalMilliseconds : 3382.2083
Version and CPU:
PS C:\Users\dell\plot_example> julia -q
julia> VERSION
v"0.5.1"
julia> Sys.cpu_info()[]
Intel(R) Core(TM) i5-6300HQ CPU # 2.30GHz:
speed user nice sys idle irq ticks
2304 MHz 18360406 0 10161406 218911218 2123421 ticks
Example plot:
Related
In the Julia package BenchmarkTools, there are macros like #btime, #belapse that seem redundant to me since Julia has built-in #time, #elapse macros. And it seems to me that these macros serve the same purpose. So what's the difference between #time and #btime, and #elapse and #belapsed?
TLDR ;)
#time and #elapsed just run the code once and measure the time. This measurement may or may not include the compile time (depending whether #time is run for the first or second time) and includes time to resolve global variables.
On the the other hand #btime and #belapsed perform warm up so you know that compile time and global variable resolve time (if $ is used) do not affect the time measurement.
Details
For further understand how this works lets used the #macroexpand (I am also stripping comment lines for readability):
julia> using MacroTools, BenchmarkTools
julia> MacroTools.striplines(#macroexpand1 #elapsed sin(x))
quote
Experimental.#force_compile
local var"#28#t0" = Base.time_ns()
sin(x)
(Base.time_ns() - var"#28#t0") / 1.0e9
end
Compilation if sin is not forced and you get different results when running for the first time and subsequent times. For an example:
julia> #time cos(x);
0.110512 seconds (261.97 k allocations: 12.991 MiB, 99.95% compilation time)
julia> #time cos(x);
0.000008 seconds (1 allocation: 16 bytes)
julia> #time cos(x);
0.000006 seconds (1 allocation: : 16 bytes)
The situation is different with #belapsed:
julia> MacroTools.striplines(#macroexpand #belapsed sin($x))
quote
(BenchmarkTools).time((BenchmarkTools).minimum(begin
local var"##314" = begin
BenchmarkTools.generate_benchmark_definition(Main, Symbol[], Any[], [Symbol("##x#315")], (x,), $(Expr(:copyast, :($(QuoteNode(:(sin(var"##x#315"))))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), BenchmarkTools.Parameters())
end
(BenchmarkTools).warmup(var"##314")
(BenchmarkTools).tune!(var"##314")
(BenchmarkTools).run(var"##314")
end)) / 1.0e9
end
You can see that a minimum value is taken (the code is run several times).
Basically most time you should use BenchmarkTools for measuring times when designing your application.
Last but not least try #benchamrk:
julia> #benchmark sin($x)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 13.714 ns … 51.151 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.814 ns ┊ GC (median): 0.00%
Time (mean ± σ): 14.089 ns ± 1.121 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇ ▂▄ ▁▂ ▃ ▁ ▂
██▆▅██▇▅▄██▃▁▃█▄▃▁▅█▆▁▄▃▅█▅▃▁▄▇▆▁▁▁▁▁▆▄▄▁▁▃▄▇▃▁▃▁▁▁▆▅▁▁▁▆▅▅ █
13.7 ns Histogram: log(frequency) by time 20 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
I would like to know, how can I measure the memory usage by a small part of my code? Lets say I have 50 lines of code, where I take only three lines (randomly) and find the memory being used by them.
In python, one can use such syntax to measure the usage:
**code**
psutil.virtual_memory().total -psutil.virtual_memory().available)/1048/1048/1048
**code**
psutil.virtual_memory().total -psutil.virtual_memory().available)/1048/1048/1048
**code**
I have tried using begin - end loop but firstly, I am not sure whether it is good approach and secondly, may i know how can i extract just the memory usage using benchmarktools package.
Julia:
using BenchmarkTools
**code**
#btime begin
** code **
end
**code**
How may I extract the information in such a manner?
Look forward to the suggestions!
Thanks!!
I guess one workaround would be to put the code you want to benchmark into a function and benchmark that function:
using BenchmarkTools
# code before
f() = # code to benchmark
#btime f() ;
# code after
To save your benchmarks you probably need to use #benchmark instead of #btime, as in, e.g.:
julia> t = #benchmark x = [sin(3.0)]
BenchmarkTools.Trial:
memory estimate: 96 bytes
allocs estimate: 1
--------------
minimum time: 26.594 ns (0.00% GC)
median time: 29.141 ns (0.00% GC)
mean time: 33.709 ns (5.34% GC)
maximum time: 1.709 μs (97.96% GC)
--------------
samples: 10000
evals/sample: 992
julia> t.allocs
1
julia> t.memory
96
julia> t.times
10000-element Vector{Float64}:
26.59375
26.616935483870968
26.617943548387096
26.66532258064516
26.691532258064516
⋮
1032.6875
1043.6219758064517
1242.3336693548388
1708.797379032258
I'm trying to use TF to do some filtering. I have 60 images of size 1740 x 2340 and a guassian filter of size 16 x 16. I ran a conv2d as
strides = [1,1,1,1]
data_ph = tf.constant(data,tf.float32)
filt_ph = tf.constant(filt,tf.float32)
data_format = 'NCHW'
conv = tf.nn.conv2d(data_ph,filt_ph,strides,'SAME',data_format=data_format)
where
data_ph = <tf.Tensor 'Const:0' shape=(60, 1, 1740, 2340) dtype=float32>
filt_ph = <tf.Tensor 'Const_1:0' shape=(16, 16, 1, 1) dtype=float32>
I tried to use place holders instead of constants and I also tried to use readers such as tf.FixedLengthRecordReader. I repeated the experiments few times per run. The first runs in 12 secs and the subsequents run in 4 secs using constants and 5 secs using place holders. The same experiment ran in mxnet takes always 1.6 secs and in matlab 1.5 secs. In all cases I'm placing the computation on the GPU, a 8 GB Quadro K5200. Is this expected (some posts mention TF being slower than other frameworks) or am I doing anything wrong?
Indexing large matrixes seems to be taking FAR longer in 0.5 and 0.6 than 0.4.7.
For instance:
x = rand(10,10,100,4,4,1000) #Dummy array
tic()
r = squeeze(mean(x[:,:,1:80,:,:,56:800],(1,2,3,4,5)),(1,2,3,4,5))
toc()
Julia 0.5.0 -> elapsed time: 176.357068283 seconds
Julia 0.4.7 -> elapsed time: 1.19991952 seconds
Edit: as per requested, I've updated the benchmark to use BenchmarkTools.jl and wrap the code in a function:
using BenchmarkTools
function testf(x)
r = squeeze(mean(x[:,:,1:80,:,:,56:800],(1,2,3,4,5)),(1,2,3,4,5));
end
x = rand(10,10,100,4,4,1000) #Dummy array
#benchmark testf(x)
In 0.5.0 I get the following (with huge memory usage):
BenchmarkTools.Trial:
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 23.36 gb
allocs estimate: 1043200022
minimum time: 177.94 s (1.34% GC)
median time: 177.94 s (1.34% GC)
mean time: 177.94 s (1.34% GC)
maximum time: 177.94 s (1.34% GC)
In 0.4.7 I get:
BenchmarkTools.Trial:
samples: 11
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 727.55 mb
allocs estimate: 79
minimum time: 425.82 ms (0.06% GC)
median time: 485.95 ms (11.31% GC)
mean time: 482.67 ms (10.37% GC)
maximum time: 503.27 ms (11.22% GC)
Edit: Updated to use sub in 0.4.7 and view in 0.5.0
using BenchmarkTools
function testf(x)
r = mean(sub(x, :, :, 1:80, :, :, 56:800));
end
x = rand(10,10,100,4,4,1000) #Dummy array
#benchmark testf(x)
In 0.5.0 it ran for >20 mins and gave:
BenchmarkTools.Trial:
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 53.75 gb
allocs estimate: 2271872022
minimum time: 407.64 s (1.32% GC)
median time: 407.64 s (1.32% GC)
mean time: 407.64 s (1.32% GC)
maximum time: 407.64 s (1.32% GC)
In 0.4.7 I get:
BenchmarkTools.Trial:
samples: 5
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.28 kb
allocs estimate: 34
minimum time: 1.15 s (0.00% GC)
median time: 1.16 s (0.00% GC)
mean time: 1.16 s (0.00% GC)
maximum time: 1.18 s (0.00% GC)
This seems repeatable on other machines, so an issue has been opened: https://github.com/JuliaLang/julia/issues/19174
EDIT 17 March 2017 This regression is fixed in Julia v0.6.0. The discussion still applies if using older versions of Julia.
Try running this crude script in both Julia v0.4.7 and v0.5.0 (change sub to view):
using BenchmarkTools
function testf()
# set seed
srand(2016)
# test array
x = rand(10,10,100,4,4,1000)
# extract array view
y = sub(x, :, :, 1:80, :, :, 56:800) # julia v0.4
#y = view(x, :, :, 1:80, :, :, 56:800) # julia v0.5
# wrap mean(y) into a function
z() = mean(y)
# benchmark array mean
#time z()
#time z()
end
testf()
My machine:
julia> versioninfo()
Julia Version 0.4.7
Commit ae26b25 (2016-09-18 16:17 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-4870HQ CPU # 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
My output, Julia v0.4.7:
1.314966 seconds (246.43 k allocations: 11.589 MB)
1.017073 seconds (1 allocation: 16 bytes)
My output, Julia v0.5.0:
417.608056 seconds (2.27 G allocations: 53.749 GB, 0.75% gc time)
410.918933 seconds (2.27 G allocations: 53.747 GB, 0.72% gc time)
It would seem that you may have discovered a performance regression. Consider filing an issue.
I have a piece of code that repeatedly samples from a probability distribution using sequence. Morally, it does something like this:
sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
xs <- sequence (replicate n dist)
return (sum xs)
Except that it's a bit more complicated. The actual code I'm interested in is the function likelihoodWeighting at this Github repo.
I noticed that the running time scales nonlinearly with n. In particular, once n exceeds a certain value it hits the memory limit, and the running time explodes. I'm not certain, but I think this is because sequence is building up a long list of thunks which aren't getting evaluated until the call to sum.
Once I get past about 100,000 samples, the program slows to a crawl. I'd like to optimize this (my feeling is that 10 million samples shouldn't be a problem) so I decided to profile it - but I'm having a little trouble understanding the output of the profiler.
Profiling
I created a short executable in a file main.hs that runs my function with 100,000 samples. Here's the output from doing
$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s
First things I notice - it allocates nearly 1.5 GB of heap, and spends 60% of its time on garbage collection. Is this generally indicative of too much laziness?
1,377,538,232 bytes allocated in the heap
1,195,050,032 bytes copied during GC
169,411,368 bytes maximum residency (12 sample(s))
7,360,232 bytes maximum slop
423 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2574 collections, 0 parallel, 2.40s, 2.43s elapsed
Generation 1: 12 collections, 0 parallel, 1.07s, 1.28s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.92s ( 1.94s elapsed)
GC time 3.47s ( 3.70s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.23s ( 0.23s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.63s ( 5.87s elapsed)
%GC time 61.8% (63.1% elapsed)
Alloc rate 716,368,278 bytes per MUT second
Productivity 34.2% of total user, 32.7% of total elapsed
Here are the results from
$ ./main +RTS -p
The first time I ran this, it turned out that there was one function being called repeatedly, and it turned out I could memoize it, which sped things up by a factor of 2. It didn't solve the space leak, however.
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 434 4 0.0 0.0 100.0 100.0
likelihoodWeighting AI.Probability.Bayes 445 1 0.0 0.3 100.0 100.0
distributionLW AI.Probability.Bayes 448 1 0.0 2.6 0.0 2.6
getSampleLW AI.Probability.Bayes 446 100000 20.0 50.4 100.0 97.1
bnProb AI.Probability.Bayes 458 400000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 457 400000 6.7 0.8 6.7 0.8
bnVals AI.Probability.Bayes 455 400000 20.0 6.3 26.7 7.1
bnParents AI.Probability.Bayes 456 400000 6.7 0.8 6.7 0.8
bnSubRef AI.Probability.Bayes 454 800000 13.3 13.5 13.3 13.5
weightedSample AI.Probability.Bayes 447 100000 26.7 23.9 33.3 25.3
bnProb AI.Probability.Bayes 453 100000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 452 100000 0.0 0.2 0.0 0.2
bnVals AI.Probability.Bayes 450 100000 0.0 0.3 6.7 0.5
bnParents AI.Probability.Bayes 451 100000 6.7 0.2 6.7 0.2
bnSubRef AI.Probability.Bayes 449 200000 0.0 0.7 0.0 0.7
Here's a heap profile. I don't know why it claims the runtime is 1.8 seconds - this run took about 6 seconds.
Can anyone help me to interpret the output of the profiler - i.e. to identify where the bottleneck is, and provide suggestions for how to speed things up?
A huge improvement has already been achieved by incorporating JohnL's suggestion of using foldM in likelihoodWeighting. That reduced memory usage about tenfold here, and brought down the GC times significantly to almost or actually negligible.
A profiling run with the current source yields
probabilityIO AI.Util.Util 26.1 42.4 413 290400000
weightedSample.go AI.Probability.Bayes 16.1 19.1 255 131200080
bnParents AI.Probability.Bayes 10.8 1.2 171 8000384
bnVals AI.Probability.Bayes 10.4 7.8 164 53603072
bnCond AI.Probability.Bayes 7.9 1.2 125 8000384
ndSubRef AI.Util.Array 4.8 9.2 76 63204112
bnSubRef AI.Probability.Bayes 4.7 8.1 75 55203072
likelihoodWeighting.func AI.Probability.Bayes 3.3 2.8 53 19195128
%! AI.Util.Util 3.3 0.5 53 3200000
bnProb AI.Probability.Bayes 2.5 0.0 40 16
bnProb.p AI.Probability.Bayes 2.5 3.5 40 24001152
likelihoodWeighting AI.Probability.Bayes 2.5 2.9 39 20000264
likelihoodWeighting.func.x AI.Probability.Bayes 2.3 0.2 37 1600000
and 13MB memory usage reported by -s, ~5MB maximum residency. That's not too bad already.
Still, there remain some points we can improve. First, a relatively minor thing, in the grand scheme, AI.UTIl.Array.ndSubRef:
ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])
Reversing the list, and mapping (2^) over another list is inefficient, better is
ndSubRef = L.foldl' (\a d -> 2*a + d) 0
which doesn't need to keep the entire list in memory (probably not a big deal, since the lists will be short) as reversing it does, and doesn't need to allocate a second list. The reduction in allocation is noticeable, about 10%, and that part runs measurably faster,
ndSubRef AI.Util.Array 1.7 1.3 24 8000384
in the profile of the modified run, but since it takes only a small part of the overall time, the overall impact is small. There are potentially bigger fish to fry in weightedSample and likelihoodWeighting.
Let's add a bit of strictness in weightedSample to see how that changes things:
weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
go 1.0 (M.fromList fixed) (bnVars bn)
where
go w assignment [] = return (assignment, w)
go w assignment (v:vs) = if v `elem` vars
then
let w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let p = bnProb bn assignment (v,True)
x <- probabilityIO p
go w (M.insert v x assignment) vs
vars = map fst fixed
The weight parameter of go is never forced, nor is the assignment parameter, thus they can build up thunks. Let's enable {-# LANGUAGE BangPatterns #-} and force updates to take effect immediately, also evaluate p before passing it to probabilityIO:
go w assignment (v:vs) = if v `elem` vars
then
let !w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let !p = bnProb bn assignment (v,True)
x <- probabilityIO p
let !assignment' = M.insert v x assignment
go w assignment' vs
That brings a further reduction in allocation (~9%) and a small speedup (~%13%), but the total memory usage and maximum residence haven't changed much.
I see nothing else obvious to change there, so let's look at likelihoodWeighting:
func m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return $! x `seq` w `seq` M.adjust (+w) x m
In the last line, first, w is already evaluated in weightedSample now, so we don't need to seq it here, the key x is required to evaluate the updated map, so seqing that isn't necessary either. The bad thing on that line is M.adjust. adjust has no way of forcing the result of the updated function, so that builds thunks in the map's values. You can force evaluation of the thunks by looking up the modified value and forcing that, but Data.Map provides a much more convenient way here, since the key at which the map is updated is guaranteed to be present, insertWith':
func !m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return (M.insertWith' (+) x w m)
(Note: GHC optimises better with a bang-pattern on m than with return $! ... here). That slightly reduces the total allocation and doesn't measurably change the running time, but has a great impact on total memory used and maximum residency:
934,566,488 bytes allocated in the heap
1,441,744 bytes copied during GC
68,112 bytes maximum residency (1 sample(s))
23,272 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
The biggest improvement in running time to be had would be by avoiding randomIO, the used StdGen is very slow.
I am surprised how much time the bn* functions take, but don't see any obvious inefficiency in those.
I have trouble digesting these profiles, but I have gotten my ass kicked before because the MonadRandom on Hackage is strict. Creating a lazy version of MonadRandom made my memory problems go away.
My colleague has not yet gotten permission to release the code, but I've put Control.Monad.LazyRandom online at pastebin. Or if you want to see some excerpts that explain a fully lazy random search, including infinite lists of random computations, check out Experience Report: Haskell in Computational Biology.
I put together a very elementary example, posted here: http://hpaste.org/71919. I'm not sure if it's anything like your example.. just a very minimal thing that seemed to work.
Compiling with -prof and -fprof-auto and running with 100000 iterations yielded the following head of the profiling output (pardon my line numbers):
8 COST CENTRE MODULE %time %alloc
9
10 sample AI.Util.ProbDist 31.5 36.6
11 bnParents AI.Probability.Bayes 23.2 0.0
12 bnRank AI.Probability.Bayes 10.7 23.7
13 weightedSample.go AI.Probability.Bayes 9.6 13.4
14 bnVars AI.Probability.Bayes 8.6 16.2
15 likelihoodWeighting AI.Probability.Bayes 3.8 4.2
16 likelihoodWeighting.getSample AI.Probability.Bayes 2.1 0.7
17 sample.cumulative AI.Util.ProbDist 1.7 2.1
18 bnCond AI.Probability.Bayes 1.6 0.0
19 bnRank.ps AI.Probability.Bayes 1.1 0.0
And here are the summary statistics:
1,433,944,752 bytes allocated in the heap
1,016,435,800 bytes copied during GC
176,719,648 bytes maximum residency (11 sample(s))
1,900,232 bytes maximum slop
400 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.40s ( 1.41s elapsed)
GC time 1.08s ( 1.24s elapsed)
Total time 2.47s ( 2.65s elapsed)
%GC time 43.6% (46.8% elapsed)
Alloc rate 1,026,674,336 bytes per MUT second
Productivity 56.4% of total user, 52.6% of total elapsed
Notice that the profiler pointed its finger at sample. I forced the return in that function by using $!, and here are some summary statistics afterwards:
1,776,908,816 bytes allocated in the heap
165,232,656 bytes copied during GC
34,963,136 bytes maximum residency (7 sample(s))
483,192 bytes maximum slop
68 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.42s ( 2.44s elapsed)
GC time 0.21s ( 0.23s elapsed)
Total time 2.63s ( 2.68s elapsed)
%GC time 7.9% (8.8% elapsed)
Alloc rate 733,248,745 bytes per MUT second
Productivity 92.1% of total user, 90.4% of total elapsed
Much more productive in terms of GC, but not much changed on the time. You might be able to keep iterating in this profile/tweak fashion to target your bottlenecks and eke out some better performance.
I think your initial diagnosis is correct, and I've never seen a profiling report that's useful once memory effects kick in.
The problem is that you're traversing the list twice, once for sequence and again for sum. In Haskell, multiple list traversals of large lists are really, really bad for performance. The solution is generally to use some type of fold, such as foldM. Your sampleMean function can be written as
{-# LANGUAGE BangPatterns #-}
sampleMean2 :: MonadRandom m => Int -> m Float -> m Float
sampleMean2 n dist = foldM (\(!a) mb -> liftM (+a) mb) 0 $ replicate n dist
for example, traversing the list only once.
You can do the same sort of thing with likelihoodWeighting as well. In order to prevent thunks, it's important to make sure that the accumulator in your fold function has appropriate strictness.