I have a gem that does some tough numbercrunching: it crops "interesting" parts out of an image. For that, I set up several algorithms. Overall, it just performs bad; which, obviously, I want to improve :).
I want to test and measure three things:
memory usage
CPU-usage
overall time spent in a method/routine.
I want to investigate this and compare the values for various algorithms, parameters and set-ups.
Is there some Ruby functionality, a gem or anything like that, which will allow me to run my code, change a few parameters or a little bit of code, run it again and then compare the results?
I have test:unit and shoulda in place already, byt the way, so if there is something that uses these testing frameworks, that is fine.
You can use the 'profiler' library that is standard. It reports the time spent in each of your methods.
require 'profiler'
def functionToBeProfiled
a = 0
1000.times do |i|
a = a + i * rand
end
end
Profiler__::start_profile
functionToBeProfiled
Profiler__::stop_profile
Profiler__::print_profile($stdout)
This will produce the following output:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
16.58 0.03 0.03 1 31.00 93.00 Integer#times
16.58 0.06 0.03 1000 0.03 0.03 Kernel.rand
8.56 0.08 0.02 3 5.33 10.33 RubyLex#identify_identifier
8.56 0.09 0.02 10 1.60 6.20 IRB::SLex::Node#match_io
8.56 0.11 0.02 84 0.19 0.19 Kernel.===
8.56 0.13 0.02 999 0.02 0.02 Float#+
8.56 0.14 0.02 6 2.67 5.33 String#gsub!
Be careful however as this library will hinder your application's performance. The measurements you get from it are only useful when compared with other measurements you obtain with the same method. They can't be used to assess absolute measurements.
I've made good experiences with ruby-prof:
http://ruby-prof.rubyforge.org/
There's also a good presentation on various ways of profiling somewhere on the web, but I can't remember title and author and have a hard time finding it right now... :-(
I was pleasantly surprised by JRuby. If your code runs on that implementation without change and you're familiar with Java benchmarking software you should take a look (and let me know how you get on).
http://www.engineyard.com/blog/2010/monitoring-memory-with-jruby-part-1-jhat-and-visualvm/
http://danlucraft.com/blog/2011/03/built-in-profiler-in-jruby-1.6/
Having now read your question more carefully I realise this doesn't afford you the capability of automating the process.
Related
I am using LogisticRegression for a classification problem with a large number of sparse features (tfidf vectors for documents to be specific) as well as a large number of classes. I noticed recently that performance seems to have dramatically worsened when upgrading to newer versions of scikit-learn. While it's hard to trace the exact origin of the performance problem, I did notice when profiling that ravel is called twice, which is taking up a large amount of the time at inference. What's interesting though, is that if I change the coef_ matrix to column-major order with np.asfortranarray, I recover the performance I am expecting. I also noticed that the problem only occurs when the input is sparse, as it is in my case.
Is there a way to change inference so that it is fastest with row-major ordering? I suspect you couldn't do this without having to transpose the input matrix to predict_proba, which would be worse since now the time doing taken doing the raveling is unbounded. Or is there some flag to tell scikit to use column-major ordering in order to have to avoid these calls during inference?
Example code below:
import scipy
import numpy as np
from sklearn.linear_model import LogisticRegression
X = np.random.rand(10_000, 10_000)
y = np.random.randint(0, 500, size=10_000)
clf = LogisticRegression(max_iter=10).fit(X, y)
%timeit clf.predict_proba(scipy.sparse.rand(1, 10_000))
# 21.9 ms ± 973 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%prun
# ncalls tottime percall cumtime percall filename:lineno(function)
# 2 0.019 0.010 0.019 0.010 {method 'ravel' of 'numpy.ndarray' objects}
# 1 0.003 0.003 0.022 0.022 _compressed.py:493(_mul_multivector)
clf.coef_ = np.asfortranarray(clf.coef_)
%timeit clf.predict_proba(scipy.sparse.rand(1, 10_000))
# 467 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%prun clf.predict_proba(scipy.sparse.rand(1, 10_000))
# ncalls tottime percall cumtime percall filename:lineno(function)
# 1 0.000 0.000 0.000 0.000 {built-in method scipy.sparse._sparsetools.csr_matvecs}
# 1 0.000 0.000 0.000 0.000 {method 'choice' of 'numpy.random.mtrand.RandomState' objects}
As you can see, converting the matrix to column-major order reduced the runtime of the ravel calls by a large margin.
Sparse matmul is handled by scipy as a.dot(b), and it needs b to be in row-major order. In this case, when you call clf.predict_proba() you're calculating p # clf.coef_.T, and clf.coef_.T is done by switching between row-major and column-major order (cause doing it that way doesn't require a copy).
If clf.coef_ is row-major order (which it will be after the model is fit), clf.coef_.T is column-major order and calling clf.predict_proba() requires it to be fully copied in memory (in this case, by .ravel()) to return it to row-major order.
When you turn clf.coef_ to column-major order with clf.coef_ = np.asfortranarray(clf.coef_), you make it so clf.coef_.T is row-major order, and .ravel() is basically a noop as it makes a view into the existing C array which doesn't have to be copied.
You have already found the most efficient workaround for this, so I don't know that there's anything else to be done. You could also just make p dense with p.A; the scipy.sparse matmul isn't terribly efficient and doesn't handle edge conditions well. This isn't a new thing and I don't know why you'd have not seen it with older sklearn.
Are there any up-to-date Prolog implementation benchmarks (with results)?
I found this on the mercury web site. Surprisingly, it shows a 20-fold gap between swi-prolog and Aquarius. I suspect that these results are pretty old. Does this gap still hold? Personally, I'd also like to see some comparisons with the occurs check turned on, since it has a major impact on performance, and some compilers might be better than others at optimizing it away.
Of more recent comparisons, I found this claim that gnu-prolog is 2x faster than SWI, and YAP is 4x faster than SWI on one specific code base.
Edit:
a specific case where the occurs check is needed for a real world problem
Sure: type inference in Haskell, OCaml, Swift or theorem provers such as this one. I also think the burden is on the programmer to prove that his code doesn't need the occurs check. Tests can only prove that you do need it, not that you don't need it.
I have some benchmark results published at:
https://logtalk.org/performance.html
Be sure to read and understand the notes at the end of that page, however.
Regarding running benchmarks with GNU Prolog, note that you cannot use the top-level interpreter as code loaded from it is interpreted, not compiled (see GNU Prolog documentation on gplc). In general, is not uncommon to see people running benchmarks from the top-level interpreter, forgetting what the word interpreter means, and publishing bogus stats where compilation/term-expansion/... steps mistakenly end up mixed with what's supposed to be benchmarked.
There's also a classical set of Prolog benchmarks that can be used for comparing Prolog implementations. Some Prolog systems include them (e.g. SWI-Prolog). They are also included in the Logtalk distribution, which allows running them with the supported backends:
https://github.com/LogtalkDotOrg/logtalk3/tree/master/examples/bench
In the current Logtalk git version, you can start it with the backend you want to benchmark and use the queries:
?- {bench(loader)}.
...
?- run.
These will run each benchmark 1000 times are reported the total time. Use run/1 for a different number of repetitions. For example, in my macOS system using SWI-Prolog 8.3.15 I get:
?- run.
boyer: 20.897818 seconds
chat_parser: 7.962188999999999 seconds
crypt: 0.14653999999999812 seconds
derive: 0.004462999999997663 seconds
divide10: 0.002300000000001745 seconds
log10: 0.0011489999999980682 seconds
meta_qsort: 0.2729539999999986 seconds
mu: 0.04534600000000211 seconds
nreverse: 0.016964000000001533 seconds
ops8: 0.0016230000000021505 seconds
poly_10: 1.9540520000000008 seconds
prover: 0.05286200000000463 seconds
qsort: 0.030829000000004214 seconds
queens_8: 2.2245050000000077 seconds
query: 0.11675499999999772 seconds
reducer: 0.00044199999999960937 seconds
sendmore: 3.048624999999994 seconds
serialise: 0.0003770000000073992 seconds
simple_analyzer: 0.8428750000000065 seconds
tak: 5.495768999999996 seconds
times10: 0.0019139999999993051 seconds
unify: 0.11229400000000567 seconds
zebra: 1.595203000000005 seconds
browse: 31.000829000000003 seconds
fast_mu: 0.04102400000000728 seconds
flatten: 0.028527999999994336 seconds
nand: 0.9632950000000022 seconds
perfect: 0.36678499999999303 seconds
true.
For SICStus Prolog 4.6.0 I get:
| ?- run.
boyer: 3.638 seconds
chat_parser: 0.7650000000000006 seconds
crypt: 0.029000000000000803 seconds
derive: 0.0009999999999994458 seconds
divide10: 0.001000000000000334 seconds
log10: 0.0009999999999994458 seconds
meta_qsort: 0.025000000000000355 seconds
mu: 0.004999999999999893 seconds
nreverse: 0.0019999999999997797 seconds
ops8: 0.001000000000000334 seconds
poly_10: 0.20500000000000007 seconds
prover: 0.005999999999999339 seconds
qsort: 0.0030000000000001137 seconds
queens_8: 0.2549999999999999 seconds
query: 0.024999999999999467 seconds
reducer: 0.001000000000000334 seconds
sendmore: 0.6079999999999997 seconds
serialise: 0.0019999999999997797 seconds
simple_analyzer: 0.09299999999999997 seconds
tak: 0.5869999999999997 seconds
times10: 0.001000000000000334 seconds
unify: 0.013000000000000789 seconds
zebra: 0.33999999999999986 seconds
browse: 4.137 seconds
fast_mu: 0.0070000000000014495 seconds
nand: 0.1280000000000001 seconds
perfect: 0.07199999999999918 seconds
yes
For GNU Prolog 1.4.5, I use the sample embedding script in logtalk3/scripts/embedding/gprolog to create an executable that includes the bench example fully compiled:
| ?- run.
boyer: 9.3459999999999983 seconds
chat_parser: 1.9610000000000003 seconds
crypt: 0.048000000000000043 seconds
derive: 0.0020000000000006679 seconds
divide10: 0.00099999999999944578 seconds
log10: 0.00099999999999944578 seconds
meta_qsort: 0.099000000000000199 seconds
mu: 0.012999999999999901 seconds
nreverse: 0.0060000000000002274 seconds
ops8: 0.00099999999999944578 seconds
poly_10: 0.72000000000000064 seconds
prover: 0.016000000000000014 seconds
qsort: 0.0080000000000008953 seconds
queens_8: 0.68599999999999994 seconds
query: 0.041999999999999815 seconds
reducer: 0.0 seconds
sendmore: 1.1070000000000011 seconds
serialise: 0.0060000000000002274 seconds
simple_analyzer: 0.25 seconds
tak: 1.3899999999999988 seconds
times10: 0.0010000000000012221 seconds
unify: 0.089999999999999858 seconds
zebra: 0.63499999999999979 seconds
browse: 10.923999999999999 seconds
fast_mu: 0.015000000000000568 seconds
(27352 ms) yes
I suggest you try these benchmarks, running them on your computer, with the Prolog systems that you want to compare. In doing that, remember that this is a limited set of benchmarks, not necessarily reflecting the actual relative performance in non-trivial applications.
Ratios:
SICStus/SWI GNU/SWI
boyer 17.4% 44.7%
browse 13.3% 35.2%
chat_parser 9.6% 24.6%
crypt 19.8% 32.8%
derive 22.4% 44.8%
divide10 43.5% 43.5%
fast_mu 17.1% 36.6%
flatten - -
log10 87.0% 87.0%
meta_qsort 9.2% 36.3%
mu 11.0% 28.7%
nand 13.3% -
nreverse 11.8% 35.4%
ops8 61.6% 61.6%
perfect 19.6% -
poly_10 10.5% 36.8%
prover 11.4% 30.3%
qsort 9.7% 25.9%
queens_8 11.5% 30.8%
query 21.4% 36.0%
reducer 226.2% 0.0%
sendmore 19.9% 36.3%
serialise 530.5% 1591.5%
simple_analyzer 11.0% 29.7%
tak 10.7% 25.3%
times10 52.2% 52.2%
unify 11.6% 80.1%
zebra 21.3% 39.8%
P.S. Be sure to use Logtalk 3.43.0 or later as it includes portability fixes for the bench example, including for GNU Prolog, and a set of basic unit tests.
I stumbled upon this comparison from 2008 in the Internet archive:
https://web.archive.org/web/20100227050426/http://www.probp.com/performance.htm
I have just read an article by Rico Mariani that concerns with performance of memory access given different locality, architecture, alignment and density.
The author built an array of varying size containing a doubly linked list with an int payload, which was shuffled to a certain percentage. He experimented with this list and found some consistent results on his machine.
Quoting one of the result table:
Pointer implementation with no changes
sizeof(int*)=4 sizeof(T)=12
shuffle 0% 1% 10% 25% 50% 100%
1000 1.99 1.99 1.99 1.99 1.99 1.99
2000 1.99 1.85 1.99 1.99 1.99 1.99
4000 1.99 2.28 2.77 2.92 3.06 3.34
8000 1.96 2.03 2.49 3.27 4.05 4.59
16000 1.97 2.04 2.67 3.57 4.57 5.16
32000 1.97 2.18 3.74 5.93 8.76 10.64
64000 1.99 2.24 3.99 5.99 6.78 7.35
128000 2.01 2.13 3.64 4.44 4.72 4.80
256000 1.98 2.27 3.14 3.35 3.30 3.31
512000 2.06 2.21 2.93 2.74 2.90 2.99
1024000 2.27 3.02 2.92 2.97 2.95 3.02
2048000 2.45 2.91 3.00 3.10 3.09 3.10
4096000 2.56 2.84 2.83 2.83 2.84 2.85
8192000 2.54 2.68 2.69 2.69 2.69 2.68
16384000 2.55 2.62 2.63 2.61 2.62 2.62
32768000 2.54 2.58 2.58 2.58 2.59 2.60
65536000 2.55 2.56 2.58 2.57 2.56 2.56
The author explains:
This is the baseline measurement. You can see the structure is a nice round 12 bytes and it will align well on x86. Looking at the first column, with no shuffling, as expected things get worse and worse as the array gets bigger until finally the cache isn't helping much and you have about the worst you're going to get, which is about 2.55ns on average per item.
But something quite strange can be seen around 32k items:
The results for shuffling are not exactly what I expected. At small sizes, it makes no difference. I expected this because basically the entire table is staying hot in the cache and so locality isn't mattering. Then as the table grows you see that shuffling has a big impact at about 32000 elements. That's 384k of data. Likely because we've blown past a 256k limit.
Now the bizarre thing is this: after this the cost of shuffling actually goes down, to the point that later on it hardly matters at all. Now I can understand that at some point shuffled or not shuffled really should make no difference because the array is so huge that runtime is largely gated by memory bandwidth regardless of order. However... there are points in the middle where the cost of non-locality is actually much worse than it will be at the endgame.
What I expected to see was that shuffling caused us to reach maximum badness sooner and stay there. What actually happens is that at middle sizes non-locality seems to cause things to go very very bad... And I do not know why :)
So the question is: What might have caused this unexpected behavior?
I have thought about this for some time, but found no good explanation. The test code looks fine to me. I don't think CPU branch prediction is the culprit in this instance, as it should be observable far earlier than 32k items, and show a far slighter spike.
I have confirmed this behavior on my box, it looks pretty much exactly the same.
I figured it might be caused by forwarding of CPU state, so I changed the order of rows and/or column generation - almost no difference in output. To make sure, I generated data for a larger continuous sample. For easy of viewing, I put it into excel:
And another independent run for good measure, negligible difference
I put my best theory here: http://blogs.msdn.com/b/ricom/archive/2014/09/28/performance-quiz-14-memory-locality-alignment-and-density-suggestions.aspx#10561107 but it's just a guess, I haven't confirmed it.
Mystery solved! From my blog:
Ryan Mon, Sep 29 2014 9:35 AM #
Wait - are you concluding that completely randomized access is the same speed as sequential for very large cases? That would be very surprising!!
What's the range of rand()? If it's 32k that would mean you're just shuffling the first 32k items and doing basically sequential reads for most items in the large case, and the per-item avg would become very close to the sequential case. This matches your data very well.
Mon, Sep 29 2014 10:57 AM #
That's exactly it!
The rand function returns a pseudorandom integer in the range 0 to RAND_MAX (32767). Use the srand function to seed the pseudorandom-number generator before calling rand.
I need a different random number generator!
I'll redo it!
I've done quite a bit of Googling but can't find an answer to this question. When you run prove on your tests (http://perldoc.perl.org/prove.html), you get some statistics that look like:
Files=3, Tests=45, 2 wallclock secs ( 0.03 usr 0.00 sys + 0.50 cusr 0.12 csys = 0.65 CPU)
What do the numbers given for usr, sys, cusr, csys and CPU mean?
Wallclock seconds is the actual elapsed time, as if you looked at your watch to time it.
The usr seconds in the time actually spent on the CPU, in user space.
The sys seconds is the time actually spent on the CPU, in kernel space.
The CPU time is the total time spent on the CPU.
I don't know what cusr and csys represents, I guess they mean children_user and children_system?
I'm using the cache_method gem, and while profiling some critical process in my app I found the following result
6.11 0.01 0.00 6.10 413/413 ActiveSupport::Cache::Strategy::LocalCache#write_entry 364
4.70 0.01 0.00 4.69 388/388 ActiveSupport::Cache::Strategy::LocalCache#delete_entry
Is it possible that for 413 cache write and 388 cache delete it takes 10 seconds?
sound way too much. Any way to improve this with some configuration options?
It's perfectly possible that these operations take so long to achieve, the first symptom is indexing, while updating your cache you are certainly updating your indexes and this is the heaviest task in caching mechanism.
You can take a look in your index configuration, and depending in its implementation you can use lazy index refresh to avoid latency in Delete and Update operations.
Cheers