Testing Erlang function performance with timer - performance

I'm testing the performance of a function in a tight loop (say 5000 iterations) using timer:tc/3:
{Duration_us, _Result} = timer:tc(M, F, [A])
This returns both the duration (in microseconds) and the result of the function. For argument's sake the duration is N microseconds.
I then perform a simple average calculation on the results of the iterations.
If I place a timer:sleep(1) function call before the timer:tc/3 call, the average duration for all the iterations is always > the average without the sleep:
timer:sleep(1),
timer:tc(M, F, [A]).
This doesn't make much sense to me as the timer:tc/3 function should be atomic and not care about anything that happened before it.
Can anyone explain this strange functionality? Is it somehow related to scheduling and reductions?

Do you mean like this:
4> foo:foo(10000).
Where:
-module(foo).
-export([foo/1, baz/1]).
foo(N) -> TL = bar(N), {TL,sum(TL)/N} .
bar(0) -> [];
bar(N) ->
timer:sleep(1),
{D,_} = timer:tc(?MODULE, baz, [1000]),
[D|bar(N-1)]
.
baz(0) -> ok;
baz(N) -> baz(N-1).
sum([]) -> 0;
sum([H|T]) -> H + sum(T).
I tried this, and it's interesting. With the sleep statement the mean time returned by timer:tc/3 is 19 to 22 microseconds, and with the sleep commented out, the average drops to 4 to 6 microseconds. Quite dramatic!
I notice there are artefacts in the timings, so events like this (these numbers being the individual microsecond timings returned by timer:tc/3) are not uncommon:
---- snip ----
5,5,5,6,5,5,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,5,4,5,5,5,5,6,5,5,
5,6,5,5,5,5,5,6,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,5,5,5,4,5,
5,5,5,6,5,5,5,6,5,5,7,8,7,8,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,
14,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,
5,5,4,5,4,5,5,4,4,5,5,4,5,5,4,4,4,4,4,5,4,5,5,4,5,5,5,4,5,5,
4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,4,5,4,5,5,4,4,4,4,5,4,
5,5,54,22,26,21,22,22,24,24,32,31,36,31,33,27,25,21,22,21,
24,21,22,22,24,21,22,21,24,21,22,22,24,21,22,21,24,21,22,21,
23,27,22,21,24,21,22,21,24,22,22,21,23,22,22,21,24,22,22,21,
24,21,22,22,24,22,22,21,24,22,22,22,24,22,22,22,24,22,22,22,
24,22,22,22,24,22,22,21,24,22,22,21,24,21,22,22,24,22,22,21,
24,21,23,21,24,22,23,21,24,21,22,22,24,21,22,22,24,21,22,22,
24,22,23,21,24,21,23,21,23,21,21,21,23,21,25,22,24,21,22,21,
24,21,22,21,24,22,21,24,22,22,21,24,22,23,21,23,21,22,21,23,
21,22,21,23,21,23,21,24,22,22,22,24,22,22,41,36,30,33,30,35,
21,23,21,25,21,23,21,24,22,22,21,23,21,22,21,24,22,22,22,24,
22,22,21,24,22,22,22,24,22,22,21,24,22,22,21,24,22,22,21,24,
22,22,21,24,21,22,22,27,22,23,21,23,21,21,21,23,21,21,21,24,
21,22,21,24,21,22,22,24,22,22,22,24,21,22,22,24,21,22,21,24,
21,23,21,23,21,22,21,23,21,23,22,24,22,22,21,24,21,22,22,24,
21,23,21,24,21,22,22,24,21,22,22,24,21,22,21,24,21,22,22,24,
22,22,22,24,22,22,21,24,22,21,21,24,21,22,22,24,21,22,22,24,
24,23,21,24,21,22,24,21,22,21,23,21,22,21,24,21,22,21,32,31,
32,21,25,21,22,22,24,46,5,5,5,5,5,4,5,5,5,5,6,5,5,5,5,5,5,4,
6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,
5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,6,4,6,5,5,5,5,5,5,4,6,5,5,5,
5,4,5,5,5,5,5,5,6,5,5,5,5,4,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,
5,5,5,4,5,5,6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,
6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,4,5,4,5,5,5,5,5,6,5,5,
5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,
---- snip ----
I assume this is the effect you are referring to, though when you say always > N, is it always, or just mostly? Not always for me anyway.
The above results extract was without the sleep. Typically when using sleep timer:tc/3 returns low times like 4 or 5 most of the time without the sleep, but sometimes big times like 22, and with the sleep in place it's usually big times like 22, with occasional batches of low times.
It's certainly not obvious why this would happen, since sleep really just means yield. I wonder if all this is not down to the CPU cache. After all, especially on a machine that's not busy, one might expect the case without the sleep to execute most of the code all in one go without it getting moved to another core, without doing so much else with the core, thus making the most out of the caches... but when you sleep, and thus yield, and come back later, the chances of cache hits might be considerably less.

Measuring performance is a complex task especially on new HW and in modern OS. There are many things which can fiddle with your result. First thing, you are not alone. It is when you measure on your desktop or notebook, there can be other processes which can interfere with your measurement including system ones. Second thing, there is HW itself. Moder CPUs have many cool features which control performance and power consumption. They can boost performance for a short time before overheat, they can boost performance when there is not work on other CPUs on the same chip or other hyper thread on the same CPU. On another hand, they can enter power saving mode when there is not enough work and CPU doesn't react fast enough to the sudden change. It is hard to tell if it is your case, but it is naive to thing previous work or lack of it can't affect your measurement. You should always take care to measure in steady state for long enough time (seconds at least) and remove as much as possible other things which could affect your measurement. (And do not forget GC in Erlang as well.)

Related

What could be the causes of this performance regression, and how to investigate it?

Context
I'm writing some high-performance code for ARM64 using NEON SIMD instructions, which I am trying to further optimize. I only use integer operations, no floating-point. This code is fully CPU- or memory-bound: it does not perform system calls or I/O of any kind (filesystem, networking, or anything else). The code is single-threaded by design -- any parallelism should be handled by calling the code from different CPUs with different arguments. The data working set should be small enough to fit in my CPU's L1 D-cache, and if it overflows a little, it will definitely fit in L2 with lots of space to spare.
My development environment is an Apple laptop with the M1 processor, running macOS; as such, the prime choice for a performance investigation tool is Apple's Instruments. I know VTune has some more advanced features such as top-down microarchitecture analysis, but evidently this isn't available for ARM.
The problem
I had an idea that, at a high level, works like this: a certain function f(x, y) can be broken down into two functions g() and h(). I can calculate x2 = g(x), y2 = g(y) and then h(x2, y2), obtaining the same result as f(x, y). However, it turns out that I compute f() many times with different combinations of the same input arguments. By applying all these inputs to g() and caching their outputs, I can directly call the output of h()with these cached values and save some time recomputing theg()-part of f()`.
Benchmarks
I confirmed the basic idea is sound by microbenchmarking with Google Benchmark. If f() takes 100 X (where X is some arbitrary unit of time), then each call to g() takes 14 X, and a call to h() takes 78 X. While it's longer to call g() twice then h() rather than f(), suppose I need to compute f(x, y) and f(x, z), which would ordinarily take 200 X. I can instead compute x2 = g(x), y2 = g(y) and z2 = g(z), taking 3*14 = 42 X, and then h(x2, y2) and h(x2, z2), taking 2*78 = 156 X. In total, I spend 156 + 42 = 198 X, which is already less than 200 X, and the savings would add up for larger examples, up to maximum of 22%, since this is how much less h() costs compared to f() (assuming I compute h() much more often than g()). This would represent a significant speedup for my application.
I proceeded to test this idea on a more realistic example: I have some code which does a bunch of things, plus 3 calls to f() which, among themselves, use combinations of the same 2 arguments. So, I replace 3 calls to f() by 2 calls to g() and 3 calls to h(). The benchmarks above indicate this should reduce execution time by 3*100 - 2*14 - 3*78 = 38 X. However, benchmarking the modified code shows that execution time increases by ~700 X!
I tried replacing each call to f() individually with 2 calls to g() for its arguments and a call to h(). This should increase execution time by 2*14 + 78 - 100 = 6 X, but instead, execution time increases by 230 X (not coincidentally, approximately 1/3 of 700 X).
Performance counter results using Apple Instruments
To bring some data to the discussion, I ran both codes under Apple Instruments using the CPU counters template, monitoring some performance counters I thought might be relevant.
For reference, the original code executes in 7.6 seconds (considering only number of iterations times execution time per iteration, i.e. disregarding Google Benchmark overhead), whereas the new code executes in 9.4 seconds; i.e. a difference of 1.8 seconds. Both versions use the exact same number of iterations and work on the same input, producing the same output. The code runs on the M1's performance core, which I assume is running at its maximum 3.2 GHz clock speed.
Parameter
Original code
New code
Total cycles
22,199,155,777
27,510,276,704
MAP_DISPATCH_BUBBLE
78,611,658
6,438,255,204
L1D_CACHE_MISS_LD
892,442
1,808,341
L1D_CACHE_MISS_ST
2,163,402
4,830,661
L1I_CACHE_MISS_DEMAND
2,620,793
7,698,674
INST_SIMD_ALU
79,448,291,331
78,253,076,740
INST_SIMD_LD
17,254,640,147
16,867,679,279
INST_SIMD_ST
14,169,912,790
14,029,275,120
INST_INT_ALU
4,512,600,211
4,379,585,445
INST_INT_LD
550,965,752
546,134,341
INST_INT_ST
455,541,070
455,298,056
INST_ALL
119,683,934,968
118,972,558,207
MAP_STALL_DISPATCH
6,307,551,337
5,470,291,508
SCHEDULE_UOP
116,252,941,232
113,882,670,763
MAP_REWIND
16,293,616
11,787,119
FLUSH_RESTART_OTHER_NONSPEC
58,616
90,955
FETCH_RESTART
27,417,457
28,119,690
BRANCH_MISPRED_NONSPEC
432,761
465,697
L1I_TLB_MISS_DEMAND
754,161
1,492,705
L2_TLB_MISS_INSTRUCTION
485,702
1,217,474
MMU_TABLE_WALK_INSTRUCTION
486,812
1,219,082
BRANCH_MISPRED_NONSPEC
377,750
440,382
INST_BRANCH
1,194,614,553
1,151,040,641
Instruments won't let me add all these counters to the same run, so some results are from different runs. However, since the code is fully deterministic and runs the same number of iterations, any differences between runs should be just random noise.
EDIT: playing around with Instruments, I found one performance counter that has wildly differing values between the original code and the new code, which is MAP_DISPATCH_BUBBLE. Still doing research on what it means, whether it might explain the issues I'm seeing, and how to work around this.
EDIT 2: I decided to test this code on other ARM processors I have access to (Cortex-X2 and Cortex-A72). On the Cortex-X2, both versions perform identically, and on the Cortex-A72, there was a small (~1.5%) increase in performance with the new code. So I'm more inclined than ever to believe that I hit an M1 front-end bottleneck.
Hypotheses and data analysis
Having faced previous performance problems with this code base before, some ideas sprung to mind:
Memory alignment: SIMD code is sometimes sensitive to memory alignment, particularly for memory-bound code, which I suspect my code may be. However, adding or removing __attribute__((aligned(64))) made no difference, so I don't think that's it.
D-cache misses: the new code allocates some new arrays to cache the output of g(), so it might lead to more cache misses. And indeed there are 3.6 million more L1 D-cache misses (load + store) than the original code. However, as I've mentioned at the beginning, the working set easily fits into L2. Assuming a 10-cycle L2 cache miss cost, that's only 36 million cycles. At 3.2 GHz, that's just 1.1 ms, i.e. < 0.1% of the observed performance difference.
I-cache misses: a similar situation: there's an extra 5.1 million L1 I-cache misses, but at a 10-cycle cost, we're looking at 1.6 ms, again < 0.1% of the observed performance difference.
Inlining/unrolling: I employ aggressive inlining and loop unrolling on my code, as well as LTO and unity builds, since performance is the #1 priority and code size is irrelevant (unless it affects performance via e.g. I-cache misses). I considered the possibility that the new code might be inlining/unrolling less aggressively due to the compiler hitting some kind of heuristic for maximum code size. This might result in more instructions being executed, such as compares/branches for loops, and CALL/RET and function prologues/epilogues for function call. However, the table shows that the new code executes a bit fewer instructions of each kind (as I would expect), and of course, in total (INST_ALL).
Somehow, the original code simply achieves a higher IPC, and I have no idea why. Also, to be clear: both codes perform the same operation using the same algorithm. What I did was to basically the code for f() (a bunch of function calls to other subroutines) between g() and h().
The question
This brings me to my question: what could possibly be making the new code run slower than the old code? What other performance counters could I look at in Instruments to give me insight into this issue?
Beyond answers to this specific question, I'm looking for general advice on how to approach similar problems like this in the future. I've found some books about debugging performance problems, but they generally fall into two camps. The first just describes the profiling process I'm familiar with: find out which functions take the longest to execute and optimize them. The second is represented by books like Systems Performance: Enterprise and the Cloud and The Every Computer Performance Book, and is closer to what I'm looking for. However, they look at system-level issues like I/O, kernel calls, etc.; the kind of code I write is CPU- and maybe memory-bound, with many opportunities to convert to SIMD, and no interaction with the outside world. Basically, I'd like to know how to design meaningful experiments using a profiler and CPU performance counters (cycle counters, cache misses, instructions executed by type such as ALU, memory, etc.) to solve these kinds of performance issues with my code when they arise.

Fluctuations in execution time, is this normal?

I'm trying to implement some sort of template matching which requires me calling a function more than 10 thousand times per frame. I've managed to reduce the execution time of my function to a few microseconds. However, about 1 in 5 executions will take quite longer to run. While the function usually runs in less than 20 microseconds these cases can even take 100 microseconds.
Trying to find the part of the function that has fluctuating execution time, I realized that big fluctuations appear in many parts, almost randomly. And this "ghost" time is added even in parts that take constant time. For example, iterating through a specific number of vectors and taking their dot product with a specific vector fluctuates from 3 microseconds to 20+.
All the tests I did seem to indicate that the fluctuation has nothing to do with the varying data but instead it's just random at some parts of the code. Of course I could be wrong and maybe all these parts that have fluctuations contain something that causes them. But my main question is specific and that's why I don't provide a snippet or runtime data:
Are fluctuations of execution time from 3 microseconds to 20+ microseconds for constant time functions with the same amount of data normal? Could the cpu occasionally be doing something else that is causing these ghost times?

Python 3 multiprocessing: optimal chunk size

How do I find the optimal chunk size for multiprocessing.Pool instances?
I used this before to create a generator of n sudoku objects:
processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.
I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).
If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.
Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.
This answer provides a high level overview.
Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.
You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.
The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.
Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).
Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.
Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.

Different running times with Python

I'm writing a very simple program to calculate the factorial of a number.
Here it is:
import time
def factorial1(n):
fattoriale = 1
while (n > 0):
fattoriale = fattoriale * n
n = n - 1
return fattoriale
start_time = time.clock()
factorial1(v)
print float(time.clock() - start_time), "seconds"
The strange point (for me) are the results in term of execution time (on a value):
1° run: 0.000301 seconds
2° run: 0.000430 seconds
3° run: 0.000278 seconds
Why do you think it's so variable?
Does it has something to do with the float type approximation?
Thanks, Gianluca
On Unix based systems time.clock returns the CPU time, not the wall-clock time.
Your program is deterministic (even the print is) and on an ideal system should always run in the same amount of time. I believe that in your tests your program was interrupted and some interrupt handler was executed or the scheduler paused your process and gave the CPU to some other process. When your process is allowed to run again the CPU cache might have been filled by the other process, so the processor needs to load your code from memory into the cache again. This takes a small amount of time - which you see in your test.
For a good quantization of how fast your program is you should consider not calling factorial1 only once but thousands of times (or with greater input values). When your program runs for multiple seconds, then scheduling effects have less (relative) impact than in your test where you only tested for less than a millisecond.
It probably has a lot to do with sharing of resources. If your program runs as a separate process, it might have to contend for other processes running on your computer at the same time which are using resources like CPU and RAM. These resources are used by other processes as well so 'acquire' them in terms of concurrent terms will take variable times especially if there are some high-priority processes running parallel to it and other things like interupts may have higher priority.
As for your idea, from what I know, the approximation process should not take variable times as it runs a deterministic algorithm. However the approximation process again may have to contend for the resources.

Measuring execution time of selected loops

I want to measure the running times of selected loops in a C program so as to see what percentage of the total time for executing the program (on linux) is spent in these loops. I should be able to specify the loops for which the performance should be measured. I have tried out several tools (vtune, hpctoolkit, oprofile) in the last few days and none of them seem to do this. They all find the performance bottlenecks and just show the time for those. Thats because these tools only store the time taken that is above a threshold (~1ms). So if one loop takes lesser time than that then its execution time won't be reported.
The basic block counting feature of gprof depends on a feature in older compilers thats not supported now.
I could manually write a simple timer using gettimeofday or something like that but for some cases it won't give accurate results. For ex:
for (i = 0; i < 1000; ++i)
{
for (j = 0; j < N; ++j)
{
//do some work here
}
}
Now here I want to measure the total time spent in the inner loop and I will have to put a call to gettimeofday inside the first loop. So gettimeofday itself will get called a 1000 times which introduces its own overhead and the result will be inaccurate.
Unless you have an in circuit emulator or break-out box around your CPU, there's no such thing as timing a single-loop or single-instruction. You need to bulk up your test runs to something that takes at least several seconds each in order to reduce error due to other things going on in the CPU, OS, etc.
If you're wanting to find out exactly how much time a particular loop takes to execute, and it takes less than, say, 1 second to execute, you're going to need to artificially increase the number of iterations in order to get a number that is above the "noise floor". You can then take that number and divide it by the number of artificially inflated iterations to get a figure that represents how long one pass through your target loop will take.
If you're wanting to compare the performance of different loop styles or techniques, the same thing holds: you're going to need to increase the number of iterations or passes through your test code in order to get a measurement in which what you're interested in dominates the time slice you're measuring.
This is true whether you're measuring performance using sub-millisecond high performance counters provided by the CPU, the system date time clock, or a wall clock to measure the elapsed time of your test.
Otherwise, you're just measuring white noise.
Typically if you want to measure the time spent in the inner loop, you'll put the time get routines outside of the outer loop and then divide by the (outer) loop count. If you expect the time of the inner loop to be relatively constant for any j, that is.
Any profiling instructions incur their own overhead, but presumably the overhead will be the same regardless of where it's inserted so "it all comes out in the wash." Presumably you're looking for spots where there are considerable differences between the runtimes of two compared processes, where a pair of function calls like this won't be an issue (since you need one at the "end" too, to get the time delta) since one routine will be 2x or more costly over the other.
Most platforms offer some sort of higher resolution timer, too, although the one we use here is hidden behind an API so that the "client" code is cross-platform. I'm sure with a little looking you can turn it up. Although even here, there's little likelihood that you'll get better than 1ms accuracy, so it's preferable to run the code several times in a row and time the whole run (then divide by the loop count, natch).
I'm glad you're looking for percentage, because that's easy to get. Just get it running. If it runs quickly, put an outer loop around it so it takes a good long time. That won't affect the percentages. While it's running, get stackshots. You can do this with Ctrl-Break in gdb, or you can use pstack or lsstack. Just look to see what percentage of stackshots display the code you care about.
Suppose the loops take some fraction of time, like 0.2 (20%) and you take N=20 samples. Then the number of samples that should show them will average 20 * 0.2 = 4, and the standard deviation of the number of samples will be sqrt(20 * 0.2 * 0.8) = sqrt(3.2) = 1.8, so if you want more precision, take more samples. (I personally think precision is overrated.)

Resources