Benchmarking instruction cache impact - caching

When a sub-routine needs to be benchmarked, it's relatively easy to create a workload for it, and then just loop over this workload to extract statistics. It will provide enough information to measure the sub-routine's performance.
At least in theory. However, in practice, I find the equivalence with real-world scenarios not that straightforward.
In a benchmark, the sub-routine is "core and center", it will loop over and over, in order to be measured. During this period, it essentially has monopoly over both data and instruction caches. This can lead to design decisions which measurably improve benchmark figures, though at the expense of inflated instruction and data cache budgets.
Once integrated into a larger system, the target sub-routine is now invoked once in a while, sandwiched between multiple other sub-routines. It's no longer alone, meaning that the inflated budget will now play against it, flushing the cache used by other subroutines.
I'm fine with measuring and taming the impact of data cache. But for the instruction cache, it's an entirely different problem.
This impact seems impossible to measure in a synthetic benchmark, where the sub-routine is alone. Yet, because of this difference, it leads to invalid conclusions. This is a big problem, making benchmark's essentially worthless or downright misleading.
I haven't found a solution to this problem yet, apart from measuring performance of a whole system. But when the sub-routine is designed to be re-used, it's unclear if measuring its contribution to a single use case is representative to any other use case.
Also, when the sub-routine is only a small contributor to the system, it's more difficult to extract its impact with enough accuracy (the system itself may have more noise than the sub-routine's contribution).
Is there any "good practice" thats make it possible to measure the performance of a sub-routine while taking into account its impact on the (shared) instruction cache ?

Related

Most relevant performance indicators for C/C++

I am looking for relevant performance indicators to benchmark and optimize my C/C++ code. For example, virtual memory usage is a simple but efficient indicator, but I know some are more specialized and help in optimizing specific domains : cache hits/misses, context switches, and so on.
I believe here is a good place to have a list of performance indicators, what they measure, and how to measure them, in order to help people who want to start optimizing their programs know where to start.
Time is the most relevant indicator.
This is why most profilers default to measuring / sampling time or core clock cycles. Understanding where your code spends its time is an essential first step to looking for speedups. First find out what's slow, then find out why it's slow.
There are 2 fundamentally different kinds of speedups you can look for, and time will help you find both of them.
Algorithmic improvements: finding ways to do less work in the first place. This is often the most important kind, and the one Mike Dunlavey's answer focuses on. You should definitely not ignore this. Caching a result that's slow to recompute can be very worth it, especially if it's slow enough that loading from DRAM is still faster.
Using data structures / algorithms that can more efficiently solve your problem on real CPUs is somewhere between these two kinds of speedups. (e.g. linked lists are in practice often slower than arrays because pointer-chasing latency is a bottleneck, unless you end up copying large arrays too often...)
Applying brute force more efficiently to do the same work in fewer cycles. (And/or more friendly to the rest of the program with smaller cache footprint and/or less branching that takes up space in the branch predictors, or whatever.)
Often involves changing your data layout to be more cache friendly, and/or manually vectorizing with SIMD. Or doing so in a smarter way. Or writing a function that handles a common special case faster than your general-case function. Or even hand-holding the compiler into making better asm for your C source.
Consider summing an array of float on modern x86-64: Going from latency-bound scalar addition to AVX SIMD with multiple accumulators can give you a speedup of 8 (elements per vector) * 8 (latency / throughput on Skylake) = 64x for a medium-sized array (still on a single core/thread), in the theoretical best case where you don't run into another bottleneck (like memory bandwidth if your data isn't hot in L1d cache). Skylake vaddps / vaddss has 4 cycle latency, and 2-per-clock = 0.5c reciprocal throughput. (https://agner.org/optimize/). Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about multiple accumulators to hide FP latency. But this still loses hard vs. storing the total somewhere, and maybe even updating the total with a delta when you change an element. (FP rounding error can accumulate that way, though, unlike integers.)
If you don't see an obvious algorithmic improvement, or want to know more before making changes, check whether the CPU is stalling on anything, or if it's efficiency chewing through all the work the compiler is making it do.
Instructions per clock (IPC) tells you whether the CPU is close to its max instruction throughput or not. (Or more accurately, fused-domain uops issued per clock on x86, because for example one rep movsb instruction is a whole big memcpy and decodes to many many uops. And cmp/jcc fuses from 2 instructions to 1 uop, increasing IPC but the pipeline width is still fixed.)
Work done per instruction is a factor, too, but isn't something you can measure with a profiler: if you have the expertise, look at compiler-generated asm to see if the same work with fewer instructions is possible. If the compiler didn't auto-vectorize, or did so inefficiently, you can maybe get a lot more work done per instruction by manually vectorizing with SIMD intrinsics, depending on the problem. Or by hand-holding the compiler into emitting better asm by tweaking your C source to compute things in a way that is natural for asm. e.g. What is the efficient way to count set bits at a position or lower?. And see also C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
If you find low IPC, figure out why by considering possibilities like cache misses or branch misses, or long dependency chains (often a cause of low IPC when not bottlenecked on the front-end or memory).
Or you might find that it's already close to optimally applying the available brute force of the CPU (unlikely but possible for some problems). In that case your only hope is algorithmic improvements to do less work.
(CPU frequency isn't fixed, but core clock cycles is a good proxy. If your program doesn't spend time waiting for I/O, then core clock cycles is maybe more useful to measure.)
A mostly-serial portion of a multi-threaded program can be hard to detect; most tools don't have an easy way to find threads using cycles when other threads are blocked.
Time spent in a function isn't the only indicator, though. A function can make the rest of the program slow by touching a lot of memory, resulting in eviction of other useful data from cache. So that kind of effect is possible. Or having a lot of branches somewhere can maybe occupy some of the branch-prediction capacity of the CPU, resulting in more branch misses elsewhere.
But note that simply finding where the CPU is spending a lot of time executing is not the most useful, in a large codebase where functions containing hotspots can have multiple callers. e.g. lots of time spent in memcpy doesn't mean you need to speed up memcpy, it means you need to find which caller is calling memcpy a lot. And so on back up the call tree.
Use profilers that can record stack snapshots, or just hit control-C in a debugger and look at the call stack a few times. If a certain function usually appears in the call stack, it's making expensive calls.
Related: linux perf: how to interpret and find hotspots, especially Mike Dunlavey's answer there makes this point.
Algorithmic improvements to avoid doing work at all are often much more valuable than doing the same work more efficiently.
But if you find very low IPC for some work you haven't figured out how to avoid yet, then sure take a look at rearranging your data structures for better caching, or avoiding branch mispredicts.
Or if high IPC is still taking a long time, manually vectorizing a loop can help, doing 4x or more work per instruction.
#PeterCordes answers are always good. I can only add my own perspective, coming from about 40 years optimizing code:
If there is time to be saved (which there is), that time is spent doing something unnecessary, that you can get rid of if you know what it is.
So what is it? Since you don't know what it is, you also don't know how much time it takes, but it does take time. The more time it takes, the more worthwhile it is to find, and the easier it is to find it. Suppose it takes 30% of the time. That means a random-time snapshot has a 30% chance of showing you what it is.
I take 5-10 random snapshots of the call stack, using a debugger and the "pause" function.
If I see it doing something on more than one snapshot, and that thing can be done faster or not at all, I've got a substantial speedup, guaranteed.
Then the process can be repeated to find more speedups, until I hit diminishing returns.
The important thing about this method is - no "bottleneck" can hide from it. That sets it apart from profilers which, because they summarize, speedups can hide from them.

Spinlock implementation reasoning

I want to improve the performance of a program by replacing some of the mutexes
with spinlocks. I have found a spinlock implementation in
http://www.boost.org/doc/libs/1_36_0/boost/detail/spinlock_sync.hpp
which I intend to reuse. I believe this implementation is safer than simpler implementations in which threads keep trying forever like the one found here
http://www.boost.org/doc/libs/1_54_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.example_spinlock.implementation
But i need to clarify some things on the yield function found here
http://www.boost.org/doc/libs/1_36_0/boost/detail/yield_k.hpp
First of all I can assume that the numbers 4,16,32 are arbitrary. I actually tested some other values and I have found that I got best performance in my case by using other values.
But can someone explain the reasoning behind the yield code. Specifically why do we need all three
BOOST_SMT_PAUSE
sched_yield and
nanosleep
Yes, this concept is known as "adaptive spinlock" - see e.g. https://lwn.net/Articles/271817/.
Usually the numbers are chosen for exponential back-off: https://geidav.wordpress.com/tag/exponential-back-off/
So, the numbers aren't arbitrary. However, which "numbers" work for your case depend on your application patterns, requirements and system resources.
The three methods to introduce "micro-delays" are designed explicitly to balance the cost and the potential gain:
zero-cost is to spin on high-CPU, but it results in high power consumption and wasted cycles
a small "cheap" delay might be able to prevent the cost of a context-switch while reducing the CPU load relative to a busy-spin
a simple yield might allow the OS to avoid a context switch depending on other system load (e.g. if the number of threads < number logical cores)
The trade-offs with these are important for low-latency applications where the effect of a context switch or cache misses are significant.
TL;DR
All trade-offs try to find a balance between wasting CPU cycles and losing cache/thread efficiency.

Are there any metrics for both performance and energy efficiency?

For many parallel programs, the parallelization brings substantial cost, making the speedup sublinear. In this case, the parallel versions are less energy efficient than sequential one.
However, people may care both the time performance and energy efficiency, are there any specific metrics commonly used for this purpose?
More specifically, a metric that can determine the number of threads for best energy and performance goal.
The most common metric is performance per watt. Take a look at the "Green500 List". Wikipedia also has an article on performance per watt. The metric is not as clear cut as it first appears because "performance" is not clear cut. FLOPS is very popular at the moment but it has a lot of deficiencies. I disagree that performance/watt can't be used to evaluate the performance of software. Depending upon your application, you may want to use performance/watt/sec.
I don’t know why you want to determine energy efficiency if parallelism is costing you. In fact, I don’t really understand how parallelism can be decreasing energy efficiency unless you are using a single core machine, doing pure computation, and are doing a lot of thrashing between threads. I’m guessing that this is not your own code.
Software power efficiency: The most important two factors are:
getting your computation done faster
making sure that periods between computation are truly idle
These factors break down into a whole host of other more concrete guidelines:
avoid timing interrupts and (shutter) polling
minimize synchronization constructs
exploit parallelism (thread and vectorization)
use a good optimizing compiler
use a thread pool if you are continuously creating and terminating a lot of threads
use efficient high performance libraries
avoid virtual machines (e.g. java and flash)
use a modern (tickless) OS
etc. etc. etc
Dividing your computation between parallel threads should decrease computation times, or else why add its complications? (Yes, I understand that some programming constructs, such as recursion, can result in simpler and cleaner code but worse performance, but these are exceptions.) Decreasing computation should increase energy efficiency. If it doesn't, look at the algorithm and code practice.
If you can give me more detail about your app, I may be able to make more concrete suggestions.

measuring real running time of an algorithm

Approximately, how many physical instructions of MIPS does an abstract algorithm operation amortize to? As for an abstract algorithm operation, I means a basic operation, such as add, divide, etc.
I see this is not a strict measuring technique :-)
Kejia
There is a list of the basic MIPS instructions here. Most of the "basic operations" that you mentioned are a single MIPS instruction or perhaps two, which probably holds true on most current CPU families.
However this does not take into account at all the architecture and performance characteristics of any of the modern CPUs. Different instructions often have diffrent completion times. Current CPUs usually implement branch prediction, instruction pipelines, memory caching, parallelisation and a whole list of other techniques to make the code execution faster.
Therefore just having the assembly code implementation of an algorithm says nothing about its execution speed. You would have to measure and profile the code on the actual hardware to obtain comparable results. In fact, some algorithms may be far more effective on certain CPUs, even within the same CPU family.
A common and rather understandable example is the effect of the instruction cache. Unrolling a loop will eliminate a number of branch operations, which intuitively makes code faster. If you run that code on a CPU of the same family with very little instruction cache memory, though, the added accesses to the main memory can make it far slower than the simple branch-based loop.
Computers are complicated. If you want to get down to this level you need to start considering what kind of CPU you are using, how well your compiler can use this CPU's instruction set, what variables are being kept in what registers, what are their bit-level representations, etc. Even then, the number of instructions not always easily maps to the actual running time. Different instructions can take different ammounts of clock cycles to execute and this is not even thinking about OS threading and your program's cache miss rate.
In the end, there is a good reason we use big-O notatoin in the first place :)
BTW, most simple operations (add, subtract) on integers should map to a single machine instruction, in case you are worried.
It depends on the CPU architecture. Some processors requires several cycles for a single instruction such as divivide, while others manage to execute all machine code instructions in a single cycle each.
It is sometimes relevant to measure an algorithm in how many floating point operations it requires. However this does not take I/O (such as reading memory) into consideration.
The speed of a CPU is sometimes provided in FLOPS (Floating Point OPerations per Second) which could help to give you a time estimate. Again, not taking I/O into consideration - and not multi-threading issues (also a very important measuring factor).
Donald Knuth addressed this very problem in writing Volume 1 of "The Art of Computer Programming".
In the preface he gives a lengthy justification for presenting algorithms in the assembly code for an imaginary machine -
... To avoid this dilemma, I have
attempted to design an "ideal"
computer called "MIX," with very
simple rules of operation ...
That way, one can talk sensibly about how many "cycles" an algorithm would take, without having to care about differences between machines, caching, latency, pipelines, or any of the other ways computers have been optimized to save time, at the expense of knowing how long they will take.

How can you insure your code runs with no variability in execution time due to cache?

In an embedded application (written in C, on a 32-bit processor) with hard real-time constraints, the execution time of critical code (specially interrupts) needs to be constant.
How do you insure that time variability is not introduced in the execution of the code, specifically due to the processor's caches (be it L1, L2 or L3)?
Note that we are concerned with cache behavior due to the huge effect it has on execution speed (sometimes more than 100:1 vs. accessing RAM). Variability introduced due to specific processor architecture are nowhere near the magnitude of cache.
If you can get your hands on the hardware, or work with someone who can, you can turn off the cache. Some CPUs have a pin that, if wired to ground instead of power (or maybe the other way), will disable all internal caches. That will give predictability but not speed!
Failing that, maybe in certain places in the software code could be written to deliberately fill the cache with junk, so whatever happens next can be guaranteed to be a cache miss. Done right, that can give predictability, and perhaps could be done only in certain places so speed may be better than totally disabling caches.
Finally, if speed does matter - carefully design the software and data as if in the old day of programming for an ancient 8-bit CPU - keep it small enough for it all to fit in L1 cache. I'm always amazed at how on-board caches these days are bigger than all of RAM on a minicomputer back in (mumble-decade). But this will be hard work and takes cleverness. Good luck!
Two possibilities:
Disable the cache entirely. The application will run slower, but without any variability.
Pre-load the code in the cache and "lock it in". Most processors provide a mechanism to do this.
It seems that you are referring to x86 processor family that is not built with real-time systems in mind, so there is no real guarantee for constant time execution (CPU may reorder micro-instructions, than there is branch prediction and instruction prefetch queue which is flushed each time when CPU wrongly predicts conditional jumps...)
This answer will sound snide, but it is intended to make you think:
Only run the code once.
The reason I say that is because so much will make it variable and you might not even have control over it. And what is your definition of time? Suppose the operating system decides to put your process in the wait queue.
Next you have unpredictability due to cache performance, memory latency, disk I/O, and so on. These all boil down to one thing; sometimes it takes time to get the information into the processor where your code can use it. Including the time it takes to fetch/decode your code itself.
Also, how much variance is acceptable to you? It could be that you're okay with 40 milliseconds, or you're okay with 10 nanoseconds.
Depending on the application domain you can even further just mask over or hide the variance. Computer graphics people have been rendering to off screen buffers for years to hide variance in the time to rendering each frame.
The traditional solutions just remove as many known variable rate things as possible. Load files into RAM, warm up the cache and avoid IO.
If you make all the function calls in the critical code 'inline', and minimize the number of variables you have, so that you can let them have the 'register' type.
This should improve the running time of your program. (You probably have to compile it in a special way since compilers these days tend to disregard your 'register' tags)
I'm assuming that you have enough memory not to cause page faults when you try to load something from memory. The page faults can take a lot of time.
You could also take a look at the generated assembly code, to see if there are lots of branches and memory instuctions that could change your running code.
If an interrupt happens in your code execution it WILL take longer time. Do you have interrupts/exceptions enabled?
Understand your worst case runtime for complex operations and use timers.

Resources