Java - Repeated function call reduces execution time - performance

I have this following code
public class BenchMark {
public static void main(String args[]) {
doLinear();
doLinear();
doLinear();
doLinear();
}
private static void doParallel() {
IntStream range = IntStream.range(1, 6).parallel();
long startTime = System.nanoTime();
int reduce = range
.reduce((a, item) -> a * item).getAsInt();
long endTime = System.nanoTime();
System.out.println("parallel: " +reduce + " -- Time: " + (endTime - startTime));
}
private static void doLinear() {
IntStream range = IntStream.range(1, 6);
long startTime = System.nanoTime();
int reduce = range
.reduce((a, item) -> a * item).getAsInt();
long endTime = System.nanoTime();
System.out.println("linear: " +reduce + " -- Time: " + (endTime - startTime));
}
}
I was trying to benchmark streams but came through this execution time steadily decreasing upon calling the same function again and again
Output:
linear: 120 -- Time: 57008226
linear: 120 -- Time: 23202
linear: 120 -- Time: 17192
linear: 120 -- Time: 17802
Process finished with exit code 0
There is a huge difference between first and second execution time.
I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there ?
Is there anyway to avoid this optimization so I can benchmark true execution time ?

I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there?
The massive latency of the first invocation is due to the initialization of the complete lambda runtime subsystem. You pay this only once for the whole application.
The first time your code reaches any given lambda expression, you pay for the linkage of that lambda (initialization of the invokedynamic call site).
After some iterations you'll see additional speedup due to the JIT compiler optimizing your reduction code.
Is there anyway to avoid this optimization so I can benchmark true execution time?
You are asking for a contradiction here: the "true" execution time is the one you get after warmup, when all optimizations have been applied. This is the runtime an actual application would experience. The latency of the first few runs is not relevant to the wider picture, unless you are interested in single-shot performance.
For the sake of exploration you can see how your code behaves with JIT compilation disabled: pass -Xint to the java command. There are many more flags which disable various aspects of optimization.

UPDATE: Refer #Marko's answer for an explanation of the initial latency due to lambda linkage.
The higher execution time for the first call is probably a result of the JIT effect. In short, the JIT compilation of the byte codes into native machine code occurs during the first time your method is called. The JVM then attempts further optimization by identifying frequently-called (hot) methods, and re-generate their codes for higher performance.
Is there anyway to avoid this optimization so I can benchmark true execution time ?
You can certainly account for the JVM initial warm-up by excluding the first few result. Then increase the number of repeated calls to your method in a loop of tens of thousands of iterations, and average the results.
There are a few more options that you might want to consider adding to your execution to help reduce noises as discussed in this post. There are also some good tips from this post too.

true execution time
There's no thing like "true execution time". If you need to solve this task only once, the true execution time would be the time of the first test (along with time to startup the JVM itself). In general the time spent for execution of given piece of code depends on many things:
Whether this piece of code is interpreted, JIT-compiled by C1 or C2 compiler. Note that there are not just three options. If you call one method from another, one of them might be interpreted and another might be C2-compiled.
For C2 compiler: how this code was executed previously, so what's in branch and type profile. The polluted type profile can drastically reduce the performance.
Garbage collector state: whether it interrupts the execution or not
Compilation queue: whether JIT-compiler compiles other code simultaneously (which may slow down the execution of current code)
The memory layout: how objects located in the memory, how many cache lines should be loaded to access all the necessary data.
CPU branch predictor state which depends on the previous code execution and may increase or decrease number of branch mispredictions.
And so on and so forth. So even if you measure something in the isolated benchmark, this does not mean that the speed of the same code in the production will be the same. It may differ in the order of magnitude. So before measuring something you should ask yourself why you want to measure this thing. Usually you don't care how long some part of your program is executed. What you usually care is the latency and the throughput of the whole program. So profile the whole program and optimize the slowest parts. Probably the thing you are measuring is not the slowest.

Java VM loads a class into memory first time the class is used.
So the difference between 1st and 2nd run may be caused by class loading.

Related

How to properly implement waiting of async computations?

i have some little trouble and i am asking for hint. I am on Windows platform, doing calculations in a following manner:
int input = 0;
int output; // junk bytes here
while(true) {
async_enqueue_upload(input); // completes instantly, but transfer will take 10us
async_enqueue_calculate(); // completes instantly, but computation will take 80us
async_enqueue_download(output); // completes instantly, but transfer will take 10us
sync_wait_finish(); // must wait while output is fully calculated, and there is no junk
input = process(output); // i cannot launch next step without doing it on the host.
}
I am asking about wait_finish() thing. I must wait all devices to finish, to combine all results and somehow process the data and upload a new portion, that is based on a previous computation step. I need to sync data in between each step, so i can't parallelize steps. I know, this is not quite performant case. So lets proceed to question.
I have 2 ways of checking completion, within wait_finish(). First is to put thread to sleep until it wakes up by completion event:
while( !is_completed() )
Sleep(1);
It has very low performance, because actual calculation, to say, takes 100us, and minimal Windows sheduler timestep is 1ms, so it gives unsuitable 10x lower performance.
Second way is to check completion in empty infinite loop:
while( !is_completed() )
{} // do_nothing();
It has 10x good computation performance. But it is also unsuitable solution, because it makes full cpu core utilisation usage, with absolutely useless work. How to make cpu "sleep" exactly time i needed? (Each step has equal amount of work)
How this case is usually solved, when amount of calculation time is too big for active spin-wait, but is too small compared to sheduler timestep? Also related subquestion - how to do that on linux?
Fortunately, i have succeeded in finding answer on my own. In short words - i should use linux for that.
And my investigation shows following. On windows there is hidden function in ntdll, NtDelayExecution(). It is not exposed through SDK, but can be loaded in a following manner:
static int(__stdcall *NtDelayExecution)(BOOL Alertable, PLARGE_INTEGER DelayInterval) = (int(__stdcall*)(BOOL, PLARGE_INTEGER)) GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "NtDelayExecution");
It allows to set sleep intervals in 100ns periods. However, even that not worked well, as shown in a following benchmark:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); // requires Admin privellegies
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
uint64_t hpf = qpf(); // QueryPerformanceFrequency()
uint64_t s0 = qpc(); // QueryPerformanceCounter()
uint64_t n = 0;
while (1) {
sleep_precise(1); // NtDelayExecution(-1); waits one 100-nanosecond interval
auto s1 = qpc();
n++;
auto passed = s1 - s0;
if (passed >= hpf) {
std::cout << "freq=" << (n * hpf / passed) << " hz\n";
s0 = s1;
n = 0;
}
}
That yields something less than just 2000 hz loop rate, and result varies from string to string. That led me towards windows thread switching sheduler, which is totally not suited for real time tasks. And its minimum interval of 0.5ms (+overhead). Btw, does anyone knows on how to tune that value?
And next was linux question, and what does it can? So i've built custom tiny kernel 4.14 with means of buildroot, and tested that benchmark code there. I replaced qpc() to return clock_gettime() data, with CLOCK_MONOTONIC clock, and qpf() just returns number of nanoseconds in a second and sleep_precise() just called clock_nanosleep(). I was failed to find out what is the difference between CLOCK_MONOTONIC and CLOCK_REALTIME.
And i was quite surprised, getting whooping 18.4khz frequency just out of the box, and that was quite stable. While i tested several intervals, i found that i can set the loop to almost any frequency up to 18.4khz, but also that actual measured wait time results differs to 1.6 times of what i asked. For example if i ask to sleep 100 us it actually sleeps for ~160 us, giving ~6.25 khz frequency. Nothing else is going on the system, just kernel, busybox and this test. I am not an experience linux user, and i am still wondering how can i tune this to be more real-time and deterministic. Can i push that frequency maximum even more?

Testing Erlang function performance with timer

I'm testing the performance of a function in a tight loop (say 5000 iterations) using timer:tc/3:
{Duration_us, _Result} = timer:tc(M, F, [A])
This returns both the duration (in microseconds) and the result of the function. For argument's sake the duration is N microseconds.
I then perform a simple average calculation on the results of the iterations.
If I place a timer:sleep(1) function call before the timer:tc/3 call, the average duration for all the iterations is always > the average without the sleep:
timer:sleep(1),
timer:tc(M, F, [A]).
This doesn't make much sense to me as the timer:tc/3 function should be atomic and not care about anything that happened before it.
Can anyone explain this strange functionality? Is it somehow related to scheduling and reductions?
Do you mean like this:
4> foo:foo(10000).
Where:
-module(foo).
-export([foo/1, baz/1]).
foo(N) -> TL = bar(N), {TL,sum(TL)/N} .
bar(0) -> [];
bar(N) ->
timer:sleep(1),
{D,_} = timer:tc(?MODULE, baz, [1000]),
[D|bar(N-1)]
.
baz(0) -> ok;
baz(N) -> baz(N-1).
sum([]) -> 0;
sum([H|T]) -> H + sum(T).
I tried this, and it's interesting. With the sleep statement the mean time returned by timer:tc/3 is 19 to 22 microseconds, and with the sleep commented out, the average drops to 4 to 6 microseconds. Quite dramatic!
I notice there are artefacts in the timings, so events like this (these numbers being the individual microsecond timings returned by timer:tc/3) are not uncommon:
---- snip ----
5,5,5,6,5,5,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,5,4,5,5,5,5,6,5,5,
5,6,5,5,5,5,5,6,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,5,5,5,4,5,
5,5,5,6,5,5,5,6,5,5,7,8,7,8,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,
14,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,
5,5,4,5,4,5,5,4,4,5,5,4,5,5,4,4,4,4,4,5,4,5,5,4,5,5,5,4,5,5,
4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,4,5,4,5,5,4,4,4,4,5,4,
5,5,54,22,26,21,22,22,24,24,32,31,36,31,33,27,25,21,22,21,
24,21,22,22,24,21,22,21,24,21,22,22,24,21,22,21,24,21,22,21,
23,27,22,21,24,21,22,21,24,22,22,21,23,22,22,21,24,22,22,21,
24,21,22,22,24,22,22,21,24,22,22,22,24,22,22,22,24,22,22,22,
24,22,22,22,24,22,22,21,24,22,22,21,24,21,22,22,24,22,22,21,
24,21,23,21,24,22,23,21,24,21,22,22,24,21,22,22,24,21,22,22,
24,22,23,21,24,21,23,21,23,21,21,21,23,21,25,22,24,21,22,21,
24,21,22,21,24,22,21,24,22,22,21,24,22,23,21,23,21,22,21,23,
21,22,21,23,21,23,21,24,22,22,22,24,22,22,41,36,30,33,30,35,
21,23,21,25,21,23,21,24,22,22,21,23,21,22,21,24,22,22,22,24,
22,22,21,24,22,22,22,24,22,22,21,24,22,22,21,24,22,22,21,24,
22,22,21,24,21,22,22,27,22,23,21,23,21,21,21,23,21,21,21,24,
21,22,21,24,21,22,22,24,22,22,22,24,21,22,22,24,21,22,21,24,
21,23,21,23,21,22,21,23,21,23,22,24,22,22,21,24,21,22,22,24,
21,23,21,24,21,22,22,24,21,22,22,24,21,22,21,24,21,22,22,24,
22,22,22,24,22,22,21,24,22,21,21,24,21,22,22,24,21,22,22,24,
24,23,21,24,21,22,24,21,22,21,23,21,22,21,24,21,22,21,32,31,
32,21,25,21,22,22,24,46,5,5,5,5,5,4,5,5,5,5,6,5,5,5,5,5,5,4,
6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,
5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,6,4,6,5,5,5,5,5,5,4,6,5,5,5,
5,4,5,5,5,5,5,5,6,5,5,5,5,4,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,
5,5,5,4,5,5,6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,
6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,4,5,4,5,5,5,5,5,6,5,5,
5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,
---- snip ----
I assume this is the effect you are referring to, though when you say always > N, is it always, or just mostly? Not always for me anyway.
The above results extract was without the sleep. Typically when using sleep timer:tc/3 returns low times like 4 or 5 most of the time without the sleep, but sometimes big times like 22, and with the sleep in place it's usually big times like 22, with occasional batches of low times.
It's certainly not obvious why this would happen, since sleep really just means yield. I wonder if all this is not down to the CPU cache. After all, especially on a machine that's not busy, one might expect the case without the sleep to execute most of the code all in one go without it getting moved to another core, without doing so much else with the core, thus making the most out of the caches... but when you sleep, and thus yield, and come back later, the chances of cache hits might be considerably less.
Measuring performance is a complex task especially on new HW and in modern OS. There are many things which can fiddle with your result. First thing, you are not alone. It is when you measure on your desktop or notebook, there can be other processes which can interfere with your measurement including system ones. Second thing, there is HW itself. Moder CPUs have many cool features which control performance and power consumption. They can boost performance for a short time before overheat, they can boost performance when there is not work on other CPUs on the same chip or other hyper thread on the same CPU. On another hand, they can enter power saving mode when there is not enough work and CPU doesn't react fast enough to the sudden change. It is hard to tell if it is your case, but it is naive to thing previous work or lack of it can't affect your measurement. You should always take care to measure in steady state for long enough time (seconds at least) and remove as much as possible other things which could affect your measurement. (And do not forget GC in Erlang as well.)

Different running times with Python

I'm writing a very simple program to calculate the factorial of a number.
Here it is:
import time
def factorial1(n):
fattoriale = 1
while (n > 0):
fattoriale = fattoriale * n
n = n - 1
return fattoriale
start_time = time.clock()
factorial1(v)
print float(time.clock() - start_time), "seconds"
The strange point (for me) are the results in term of execution time (on a value):
1° run: 0.000301 seconds
2° run: 0.000430 seconds
3° run: 0.000278 seconds
Why do you think it's so variable?
Does it has something to do with the float type approximation?
Thanks, Gianluca
On Unix based systems time.clock returns the CPU time, not the wall-clock time.
Your program is deterministic (even the print is) and on an ideal system should always run in the same amount of time. I believe that in your tests your program was interrupted and some interrupt handler was executed or the scheduler paused your process and gave the CPU to some other process. When your process is allowed to run again the CPU cache might have been filled by the other process, so the processor needs to load your code from memory into the cache again. This takes a small amount of time - which you see in your test.
For a good quantization of how fast your program is you should consider not calling factorial1 only once but thousands of times (or with greater input values). When your program runs for multiple seconds, then scheduling effects have less (relative) impact than in your test where you only tested for less than a millisecond.
It probably has a lot to do with sharing of resources. If your program runs as a separate process, it might have to contend for other processes running on your computer at the same time which are using resources like CPU and RAM. These resources are used by other processes as well so 'acquire' them in terms of concurrent terms will take variable times especially if there are some high-priority processes running parallel to it and other things like interupts may have higher priority.
As for your idea, from what I know, the approximation process should not take variable times as it runs a deterministic algorithm. However the approximation process again may have to contend for the resources.

Process time changes for each debugging

For timing an algorithm (millisecond), I have the below code:
clock_t start = clock();
algorithm();
clock_t end = clock();
cout << float(end-start)/CLOCKS_PER_SEC*1000.0 << endl;
For each time I debug, the result changes. Could someone tell me why and how I can fix that result?
It is based on current system load. Typically, your OS will be busy with other things and this way, sometimes it will take more or less time.
Actually, execution also is dependent on a lot of other things, like how memory- cpu- and i/o-intensive it is, also again dependent on other things.
I suggest to call algorithm() in a loop, which really is a standard way of getting more repeatable results on a machine, either by a fixed count or actually using the passed time till a certain limit is reached, and then calculating the runtime as the average over the runs.
This will reduce noise and increase precision.

Measuring execution time of selected loops

I want to measure the running times of selected loops in a C program so as to see what percentage of the total time for executing the program (on linux) is spent in these loops. I should be able to specify the loops for which the performance should be measured. I have tried out several tools (vtune, hpctoolkit, oprofile) in the last few days and none of them seem to do this. They all find the performance bottlenecks and just show the time for those. Thats because these tools only store the time taken that is above a threshold (~1ms). So if one loop takes lesser time than that then its execution time won't be reported.
The basic block counting feature of gprof depends on a feature in older compilers thats not supported now.
I could manually write a simple timer using gettimeofday or something like that but for some cases it won't give accurate results. For ex:
for (i = 0; i < 1000; ++i)
{
for (j = 0; j < N; ++j)
{
//do some work here
}
}
Now here I want to measure the total time spent in the inner loop and I will have to put a call to gettimeofday inside the first loop. So gettimeofday itself will get called a 1000 times which introduces its own overhead and the result will be inaccurate.
Unless you have an in circuit emulator or break-out box around your CPU, there's no such thing as timing a single-loop or single-instruction. You need to bulk up your test runs to something that takes at least several seconds each in order to reduce error due to other things going on in the CPU, OS, etc.
If you're wanting to find out exactly how much time a particular loop takes to execute, and it takes less than, say, 1 second to execute, you're going to need to artificially increase the number of iterations in order to get a number that is above the "noise floor". You can then take that number and divide it by the number of artificially inflated iterations to get a figure that represents how long one pass through your target loop will take.
If you're wanting to compare the performance of different loop styles or techniques, the same thing holds: you're going to need to increase the number of iterations or passes through your test code in order to get a measurement in which what you're interested in dominates the time slice you're measuring.
This is true whether you're measuring performance using sub-millisecond high performance counters provided by the CPU, the system date time clock, or a wall clock to measure the elapsed time of your test.
Otherwise, you're just measuring white noise.
Typically if you want to measure the time spent in the inner loop, you'll put the time get routines outside of the outer loop and then divide by the (outer) loop count. If you expect the time of the inner loop to be relatively constant for any j, that is.
Any profiling instructions incur their own overhead, but presumably the overhead will be the same regardless of where it's inserted so "it all comes out in the wash." Presumably you're looking for spots where there are considerable differences between the runtimes of two compared processes, where a pair of function calls like this won't be an issue (since you need one at the "end" too, to get the time delta) since one routine will be 2x or more costly over the other.
Most platforms offer some sort of higher resolution timer, too, although the one we use here is hidden behind an API so that the "client" code is cross-platform. I'm sure with a little looking you can turn it up. Although even here, there's little likelihood that you'll get better than 1ms accuracy, so it's preferable to run the code several times in a row and time the whole run (then divide by the loop count, natch).
I'm glad you're looking for percentage, because that's easy to get. Just get it running. If it runs quickly, put an outer loop around it so it takes a good long time. That won't affect the percentages. While it's running, get stackshots. You can do this with Ctrl-Break in gdb, or you can use pstack or lsstack. Just look to see what percentage of stackshots display the code you care about.
Suppose the loops take some fraction of time, like 0.2 (20%) and you take N=20 samples. Then the number of samples that should show them will average 20 * 0.2 = 4, and the standard deviation of the number of samples will be sqrt(20 * 0.2 * 0.8) = sqrt(3.2) = 1.8, so if you want more precision, take more samples. (I personally think precision is overrated.)

Resources