How do you write (and run) a correct micro-benchmark in Java?
I'm looking for some code samples and comments illustrating various things to think about.
Example: Should the benchmark measure time/iteration or iterations/time, and why?
Related: Is stopwatch benchmarking acceptable?

Tips about writing micro benchmarks from the creators of Java HotSpot:
Rule 0: Read a reputable paper on JVMs and micro-benchmarking. A good one is Brian Goetz, 2005. Do not expect too much from micro-benchmarks; they measure only a limited range of JVM performance characteristics.
Rule 1: Always include a warmup phase which runs your test kernel all the way through, enough to trigger all initializations and compilations before timing phase(s). (Fewer iterations is OK on the warmup phase. The rule of thumb is several tens of thousands of inner loop iterations.)
Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so you can verify that the compiler and other parts of the JVM are not doing unexpected work during your timing phase.
Rule 2.1: Print messages at the beginning and end of timing and warmup phases, so you can verify that there is no output from Rule 2 during the timing phase.
Rule 3: Be aware of the difference between -client and -server, and OSR and regular compilations. The -XX:+PrintCompilation flag reports OSR compilations with an at-sign to denote the non-initial entry point, for example: Trouble$1::run # 2 (41 bytes). Prefer server to client, and regular to OSR, if you are after best performance.
Rule 4: Be aware of initialization effects. Do not print for the first time during your timing phase, since printing loads and initializes classes. Do not load new classes outside of the warmup phase (or final reporting phase), unless you are testing class loading specifically (and in that case load only the test classes). Rule 2 is your first line of defense against such effects.
Rule 5: Be aware of deoptimization and recompilation effects. Do not take any code path for the first time in the timing phase, because the compiler may junk and recompile the code, based on an earlier optimistic assumption that the path was not going to be used at all. Rule 2 is your first line of defense against such effects.
Rule 6: Use appropriate tools to read the compiler's mind, and expect to be surprised by the code it produces. Inspect the code yourself before forming theories about what makes something faster or slower.
Rule 7: Reduce noise in your measurements. Run your benchmark on a quiet machine, and run it several times, discarding outliers. Use -Xbatch to serialize the compiler with the application, and consider setting -XX:CICompilerCount=1 to prevent the compiler from running in parallel with itself. Try your best to reduce GC overhead, set Xmx(large enough) equals Xms and use UseEpsilonGC if it is available.
Rule 8: Use a library for your benchmark as it is probably more efficient and was already debugged for this sole purpose. Such as JMH, Caliper or Bill and Paul's Excellent UCSD Benchmarks for Java.

I know this question has been marked as answered but I wanted to mention two libraries that help us to write micro benchmarks
Caliper from Google
Getting started tutorials
JMH from OpenJDK
Getting started tutorials
Avoiding Benchmarking Pitfalls on the JVM
Using JMH for Java Microbenchmarking
Introduction to JMH

Important things for Java benchmarks are:
Warm up the JIT first by running the code several times before timing it
Make sure you run it for long enough to be able to measure the results in seconds or (better) tens of seconds
While you can't call System.gc() between iterations, it's a good idea to run it between tests, so that each test will hopefully get a "clean" memory space to work with. (Yes, gc() is more of a hint than a guarantee, but it's very likely that it really will garbage collect in my experience.)
I like to display iterations and time, and a score of time/iteration which can be scaled such that the "best" algorithm gets a score of 1.0 and others are scored in a relative fashion. This means you can run all algorithms for a longish time, varying both number of iterations and time, but still getting comparable results.
I'm just in the process of blogging about the design of a benchmarking framework in .NET. I've got a couple of earlier posts which may be able to give you some ideas - not everything will be appropriate, of course, but some of it may be.

jmh is a recent addition to OpenJDK and has been written by some performance engineers from Oracle. Certainly worth a look.
The jmh is a Java harness for building, running, and analysing nano/micro/macro benchmarks written in Java and other languages targetting the JVM.
Very interesting pieces of information buried in the sample tests comments.
See also:
Avoiding Benchmarking Pitfalls on the JVM
Discussion on the main strengths of jmh.

Should the benchmark measure time/iteration or iterations/time, and why?
It depends on what you are trying to test.
If you are interested in latency, use time/iteration and if you are interested in throughput, use iterations/time.

Make sure you somehow use results which are computed in benchmarked code. Otherwise your code can be optimized away.

If you are trying to compare two algorithms, do at least two benchmarks for each, alternating the order. i.e.:
I have found some noticeable differences (5-10% sometimes) in the runtime of the same algorithm in different passes..
Also, make sure that n is very large, so that the runtime of each loop is at the very least 10 seconds or so. The more iterations, the more significant figures in your benchmark time and the more reliable that data is.

There are many possible pitfalls for writing micro-benchmarks in Java.
First: You have to calculate with all sorts of events that take time more or less random: Garbage collection, caching effects (of OS for files and of CPU for memory), IO etc.
Second: You cannot trust the accuracy of the measured times for very short intervals.
Third: The JVM optimizes your code while executing. So different runs in the same JVM-instance will become faster and faster.
My recommendations: Make your benchmark run some seconds, that is more reliable than a runtime over milliseconds. Warm up the JVM (means running the benchmark at least once without measuring, that the JVM can run optimizations). And run your benchmark multiple times (maybe 5 times) and take the median-value. Run every micro-benchmark in a new JVM-instance (call for every benchmark new Java) otherwise optimization effects of the JVM can influence later running tests. Don't execute things, that aren't executed in the warmup-phase (as this could trigger class-load and recompilation).

It should also be noted that it might also be important to analyze the results of the micro benchmark when comparing different implementations. Therefore a significance test should be made.
This is because implementation A might be faster during most of the runs of the benchmark than implementation B. But A might also have a higher spread, so the measured performance benefit of A won't be of any significance when compared with B.
So it is also important to write and run a micro benchmark correctly, but also to analyze it correctly.

To add to the other excellent advice, I'd also be mindful of the following:
For some CPUs (e.g. Intel Core i5 range with TurboBoost), the temperature (and number of cores currently being used, as well as thier utilisation percent) affects the clock speed. Since CPUs are dynamically clocked, this can affect your results. For example, if you have a single-threaded application, the maximum clock speed (with TurboBoost) is higher than for an application using all cores. This can therefore interfere with comparisons of single and multi-threaded performance on some systems. Bear in mind that the temperature and volatages also affect how long Turbo frequency is maintained.
Perhaps a more fundamentally important aspect that you have direct control over: make sure you're measuring the right thing! For example, if you're using System.nanoTime() to benchmark a particular bit of code, put the calls to the assignment in places that make sense to avoid measuring things which you aren't interested in. For example, don't do:
long startTime = System.nanoTime();
//code here...
System.out.println("Code took "+(System.nanoTime()-startTime)+"nano seconds");
Problem is you're not immediately getting the end time when the code has finished. Instead, try the following:
final long endTime, startTime = System.nanoTime();
//code here...
endTime = System.nanoTime();
System.out.println("Code took "+(endTime-startTime)+"nano seconds");

http://opt.sourceforge.net/ Java Micro Benchmark - control tasks required to determine the comparative performance characteristics of the computer system on different platforms. Can be used to guide optimization decisions and to compare different Java implementations.


Why there is a huge gap in performance(in terms of time taken) when I use Stream API twice in Java 8? [duplicate]

How many cores do I need to use when I want to benchmark the performance of my compiler?

I reorder the compiler optimization.
And I want to compare the performance of the output with gcc O3.
I have a test-suite.
How many cores do I need to use for benchmark?
I'm sure that the executable files of them are different.
And I use one single core to measure the run time of them, the time is similarly same.
But I don't limit the number of cores to measure the run time, the executable from my compiler is faster than gcc O3.
How can I determine which compiler is better?
How many cores do I need to use when I want to benchmark the performance of my compiler?
Well, the more the merrier. Single-core as you mentioned is definitely not recommended. Since you have mentioned gcc, you have to look into GCC benchmarks.
However, in the context of aforementioned "the more the merrier" beware of "law of diminishing return" as rightly put by this answer below:
In the benchmark wars the individual manufacturers will will throw as many cores/processors/CPUs at the problem as they can be effective with. But there's always (except in some very weird circumstances) a "law of diminishing return" -- the second core will only add 60-80%, the third core less than that, etc. (And this assumes a problem that is sufficiently multi-threaded to actually make use of the added cores.) So you can't look at a given benchmark and assume that twice as many cores will provide twice the performance. In fact, in some cases you could double the number of cores and actually reduce performance. Achieving good performance in a highly multi-threaded application is somewhere between an art and black magic.

Measuring execution time ECLiPSe CLP (or Prolog)

How do I measure the execution time of a method in ECLiPSe CLP? Currently, I have this:
statistics(runtime, _),
statistics(runtime,[_|T]), % T
I need to write the time it took to perform a method solve_traditional(...) and write it out to a text file. However, it is not precise enough. Sometimes time will print 0.015 or 0.016 seconds for the given method, but usually it prints 0.0 seconds.
Figuring the method completes too fast, I decided to make use of statistics(runtime, ...) to measure the time it takes between two runtime calls. I could then measure for example the time it takes to complete 20 method calls and divide the measured time T by 20.
Only problem is, with 20 calls T equals either 0, 16, 32 or 48 milliseconds. Apparently, it measures the time for each method call separately and finds the sum of the execution times (which is often just 0.0s). This beats the whole purpose of measuring the runtime for N method calls and dividing the time T by N.
In short: the current methods I'm using for execution time measurements are inadequate. Is there a way to make it more precise (9 decimals for example)?
Benchmarking is a tricky business in any programming language, and particularly so in CLP. Especially if you plan to publish your results, you should be extremely thorough and make absolutely sure you are measuring what you claim to measure.
Timers: Are you measuring real time, process cpu time, thread cpu time? Including time spent in system calls? Including or excluding garbage collection? ...
See the different timers offered by the statistics/2 primitive.
There is a real-time high-resolution timer that can be accessed via statistics(hr_time,T).
Timer resolution: In your example the timer resolution seems to be 1/60 sec. That means, to get 3 significant digits in your time measurement, you have to measure at least a runtime of 1000*1/60 = 16.7 seconds.
If your benchmark runtime is too short, you have to run it multiple times.
Runtime variance: On modern machines it is increasingly difficult to get reproducible timings. This is due to effects that have nothing to do with the program you are measuring, such as cache behaviour, paging, context switches, power management hardware, memory alignment, etc.
Run enough repetitions, run on a quiet machine, make sure your results are reproducible.
Repeating benchmarks: In a system like ECLiPSe, running benchmarks repeatedly must be done carefully to ensure that the successive runs really do the same computation, and ideally have same or similar cache and garbage collection behaviour.
In your code, you run the benchmark successively in a conjunction. This is not recommended because variable instantiations, delayed goals or garbage can survive from previous runs and slow down or speed up subsequent runs. As suggested above, you could use the pattern
run_n_times(N,Goal) :- \+ ( between(1,N,1,_), \+ Goal ).
which is essentially a way of repeating N times the sequence
once(Goal), fail
The point of this is that the combination of once/1 and fail undoes all of Goal's computation, so that the next iteration starts as much as possible from a similar machine state. Unfortunately, this undo-process itself adds extra runtime, which distorts the measurement...
Test overheads: If you run your benchmark several times, you need a test framework that does that for you, and this contributes to the runtime you measure.
You either have to make sure that the overhead is negligible, or you have to measure the overhead (e.g. by running the test framework with a dummy benchmark) and subtract it, for example:
benchmark(N, DummyGoal, Goal, Time) :-
run_n_times(N, DummyGoal),
run_n_times(N, Goal),
Time is (T3-T2)-(T2-T1).
CLP specifics: There are many other considerations specific to the kind of data-driven operations that occur in CLP solvers, and which make CLP runtimes very difficult to compare. These solvers have many internal degrees of freedom regarding scheduling of propagators, degrees of pruning, tie breaking rules in search control, etc.
A paper that discusses these things specifically is:
On Benchmarking Constraint Logic Programming Platforms, by Mark Wallace, Joachim Schimpf, Kish Shen and Warwick Harvey. In CONSTRAINTS Journal, ed. E.C. Freuder,9(1), pp 5-34, Kluwer, 2004.

Effective Code Instrumentation?

All too often I read statements about some new framework and their "benchmarks." My question is a general one but to the specific points of:
What approach should a developer take to effectively instrument code to measure performance?
When reading about benchmarks and performance testing, what are some red-flags to watch out for that might not represent real results?
There are two methods of measuring performance: using code instrumentation and using sampling.
The commercial profilers (Hi-Prof, Rational Quantify, AQTime) I used in the past used code instrumentation (some of them could also use sampling) and in my experience, this gives the best, most detailed result. Especially Rational Quantity allow you to zoom in on results, focus on sub trees, remove complete call trees to simulate an improvement, ...
The downside of these instrumenting profilers is that they:
tend to be slow (your code runs about 10 times slower)
take quite some time to instrument your application
don't always correctly handle exceptions in the application (in C++)
can be hard to set up if you have to disable the instrumentation of DLL's (we had to disable instrumentation for Oracle DLL's)
The instrumentation also sometimes skews the times reported for low-level functions like memory allocations, critical sections, ...
The free profilers (Very Sleepy, Luke Stackwalker) that I use use sampling, which means that it is much easier to do a quick performance test and see where the problem lies. These free profilers don't have the full functionality of the commercial profilers (although I submitted the "focus on subtree" functionality for Very Sleepy myself), but since they are fast, they can be very useful.
At this time, my personal favorite is Very Sleepy, with Luke StackWalker coming second.
In both cases (instrumenting and sampling), my experience is that:
It is very difficult to compare the results of profilers over different releases of your application. If you have a performance problem in your release 2.0, profile your release 2.0 and try to improve it, rather than looking for the exact reason why 2.0 is slower than 1.0.
You must never compare the profiling results with the timing (real time, cpu time) results of an application that is run outside the profiler. If your application consumes 5 seconds CPU time outside the profiler, and when run in the profiler the profiler reports that it consumes 10 seconds, there's nothing wrong. Don't think that your application actually takes 10 seconds.
That's why you must consistently check results in the same environment. Consistently compare results of your application when run outside the profiler, or when run inside the profiler. Don't mix the results.
Also use a consistent environment and system. If you get a faster PC, your application could still run slower, e.g. because the screen is larger and more needs to be updated on screen. If moving to a new PC, retest the last (one or two) releases of your application on the new PC so you get an idea on how times scale to the new PC.
This also means: use fixed data sets and check your improvements on these datasets. It could be that an improvement in your application improves the performance of dataset X, but makes it slower with dataset Y. In some cases this may be acceptible.
Discuss with the testing team what results you want to obtain beforehand (see Oded's answer on my own question What's the best way to 'indicate/numerate' performance of an application?).
Realize that a faster application can still use more CPU time than a slower application, if the faster one uses multi-threading and the slower one doesn't. Discuss (as said before) with the testing time what needs to be measured and what doesn't (in the multi-threading case: real time instead of CPU time).
Realize that many small improvements may lead to one big improvement. If you find 10 parts in your application that each take 3% of the time and you can reduce it to 1%, your application will be 20% faster.
It depends what you're trying to do.
1) If you want to maintain general timing information, so you can be alert to regressions, various instrumenting profilers are the way to go. Make sure they measure all kinds of time, not just CPU time.
2) If you want to find ways to make the software faster, that is a distinctly different problem.
You should put the emphasis on the find, not on the measure.
For this, you need something that samples the call stack, not just the program counter (over multiple threads, if necessary). That rules out profilers like gprof.
Importantly, it should sample on wall-clock time, not CPU time, because you are every bit as likely to lose time due to I/O as due to crunching. This rules out some profilers.
It should be able to take samples only when you care, such as not when waiting for user input. This also rules out some profilers.
Finally, and very important, is the summary you get.
It is essential to get per-line percent of time.
The percent of time used by a line is the percent of stack samples containing the line.
Don't settle for function-only timings, even with a call graph.
This rules out still more profilers.
(Forget about "self time", and forget about invocation counts. Those are seldom useful and often misleading.)
Accuracy of finding the problems is what you're after, not accuracy of measuring them. That is a very important point. (You don't need a large number of samples, though it does no harm. The harm is in your head, making you think about measuring, rather than what is it doing.)
One good tool for this is RotateRight's Zoom profiler. Personally I rely on manual sampling.

How do I get repeatable CPU-bound benchmark runtimes on Windows?

We sometimes have to run some CPU-bound tests where we want to measure runtime. The tests last in the order of a minute. The problem is that from run to run the runtime varies by quite a lot (+/- 5%). We suspect that the variation is caused by activity from other applications/services on the system, eg:
Applications doing housekeeping in their idle time (e.g. Visual Studio updating IntelliSense)
Filesystem indexers
What tips are there to make our benchmark timings more stable?
Currently we minimize all other applications, run the tests at "Above Normal" priority, and not touch the machine while it runs the test.
The usual approach is to perform lots of repetitions and then discard outliers. So, if the distractions such as the disk indexer only crops up once every hour or so, and you do 5 minutes runs repeated for 24 hours, you'll have plenty of results where nothing got in the way. It is a good idea to plot the probability density function to make sure you are understand what is going on. Also, if you are not interested in startup effects such as getting everything into the processor caches then make sure the experiment runs long enough to make them insignificant.
First of all, if it's just about benchmarking the application itself, you should use CPU time, not wallclock time as a measure. That's then (almost) free from influences of what the other processes or the system do. Secondly, as Dickon Reed pointed out, more repetitions increase confidence.
Quote from VC++ team blog, how they do performance tests:
To reduce noise on the benchmarking machines, we take several steps:
Stop as many services and processes as possible.
Disable network driver: this will turn off the interrupts from NIC caused by >broadcast packets.
Set the test’s processor affinity to run on one processor/core only.
Set the run to high priority which will decrease the number of context switches.
Run the test for several iterations.
I do the following:
Call the method x times and measure the time
Do this n times and calculate the mean and standard deviation of those measurements
Try to get the x to a point where you're at a >1 second per measurement. This will reduce the noise a bit.
The mean will tell you the average performance of your test and the standard deviation the stability of your test/measurements.
I also set my application at a very high priority, and when I test a single-thread algorithm I associate it with one cpu core to make sure there is not scheduling overhead.
This code demonstrates how to do this in .NET:
Thread.CurrentThread.Priority = ThreadPriority.Highest;
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;
if (Environment.ProcessorCount > 1)
Process.GetCurrentProcess().ProcessorAffinity =
new IntPtr(1 << (Environment.ProcessorCount - 1));
