Open source profiler for analyzing low-level architectural inefficiencies? - performance

Modern processors use all sorts of tricks to bridge the gap between the large speed of their processing elements and the tardiness of the external memory. In performance-critical applications the way you structure your code can often have a considerable influence on its efficiency. For instance, researchers using the SLO analyzer were able to fix cache locality problems and double the execution speed of several SPEC2000 benchmark programs. I'm looking for recommendations for an open source tool that utilizes a processor's performance monitoring support to locate and analyze architectural inefficiencies, such as cache misses, branch mispredicts, front end stalls, cache pollution through address aliasing, long latency instructions, and TLB misses. I'm aware of Intel's VTune (commercial), AMD's CodeAnalysist (free, but not open source), and Cachegrind (relies on simulation).

For linux, oprofile works well. Actually AMD's CodeAnalysist uses oprofile as its backend.
Oprofile uses processor's intenal performance tunning mechanism to analyze architectural inefficiency.

Related

How to use (read/write) CPU caches L1, L2, L3

I have a task that requires ultra performance
Of course I can optimize its algorithm but I also want optimize on the hardware level.
I can of course use the CPU affinity in order to allocate a whole core to the thread that processes my task
Another kind of optimization could be to put in the CPU caches (L1, L2, L3) the data my tasks requires to complete, in order to avoid as far as possible the "RAM access" latency
What API can I use for such a development?
(In other words, my questions could be: "how to force to the CPU to place in the cache a given data-structure?")
Thank you for your help
Excellent comment by Peter C about prefetching. As a former optimizer, the first thing we'd do to improve code was to remove all the SW prefetching. Also, don't try to muck around with power states and such. They are so good now days that the effort isn't worth the gain in HPC. A possible exception is hyper threading. The only time you'd want to go there would be for certain benchmarking where you need consistency as well as performance.
Take a look at the Intel optimization resources such as the optimization guide. Also get yourself a good profiler; Intel's VTune is truly one of the best. For info from Intel, use bing (or google) to find stuff. Intel's site is and always has been a glossy mess. VTune has Student and Educator licensing.
Here are the steps I used to take in optimizing apps for performance. First off, exhaust the higher-level software changes. Then get down into tweaking for hardware performance. Why? Two reasons: (1) code changes are generally architecture independent and have a better chance of surviving a move to a different HW platform and generation. (2) They are a heck of a lot simpler to do (though perhaps not as fun).
CODE CHANGES:
Remove all SW prefetching.
Replace any polling with periodic interrupts
Make sure any checking interrupts have appropriate intervals
Use Fortran. Really. There's a reason Fortran is still alive. Take a look at the Intel Fortran forums. The forum's all classical HPC. And Intel's Fortran compiler is one of the best.
Use a good optimizing compiler, and play with the compiler settings and pragmas/annotations (e.g. #pragma loop count). Again, Intel's is one of the best. (I hate saying that, but it's true.)
Use a good SW profiler to find optimization opportunities (where most of your time is being spent). Make sure the profiler is able to dig into the source code to identify time spent in different functions. Optimize those functions first.
Find opportunities for thread parallization (multi-threading) properly scoped to the number of cores
Find opportunities for vectorization
Convert from AoS (Array of Structs) to SofA. Note that if you have to do the conversion on the fly, it may not be worth the performance cost.
Structure your loops such that they are more conducive to the compiler finding vectorization opportunities. See any good optimization book for how to do this.
HARDWARE HACKING/OPTIMIZATION (using a good HW-level performance analyzer)
Identify cache and TLB misses, and restructure code.
Identify branch mispredicts and restructure code.
Identify pipeline stalls and restructure code.
One last suggestion, though I'm sure you already know this. Remember, go after the hottest spots. Smaller opportunities are time consuming and performance improvements are not impactful to the overall application.
Best of luck. Optimization can be fun and rewarding (if you are slightly crazy).
You can't typically override the LRU replacement policies in CPU caches. x86 CPUs at least don't support any way to "pin" certain address ranges into any level of cache.
What you can do is "prefetch" ahead of use. "software prefetch" is only rarely helpful. Usually HW prefetching does a good job, and your data then stays in cache, as long as your cache footprint is small enough. Ulrich Drepper's What every programmer should know about memory covers this, and is still relevant. However, its emphasis on software prefetch (esp. a separate prefetch thread) was appropriate for P4, but not a good idea for other CPUs. Keep that in mind while reading.
Designing your data structures and access patterns to be cache-friendly is very important, too. Try googling "cache aware" algorithms, maybe (or just read Ulrich's paper). Or just tune as you go, using performance counters to see if you've accidentally done something that causes a lot of cache misses.
If you're running this on an Intel Haswell Xeon or newer (Exxx v3 or higher), you can partition the L3 cache so the core running your critical thread owns a chunk of L3, and it won't be evicted by other cores. This is called CAT (Cache Allocation Technology). See also this article by Dan Luu
Well, you'll need to use a low level language (C would probably be the go-to in this case).
Then you have some reading to do : What every programmer should know about memory. Pay special attention to chapter 6, which contains very useful programming advice on how to optimize for specific usage patterns.

Any tool allows to measure all cache levels in C program?

I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
In order to measure the cache hits/misses I tried to use valgrind but this tool only assumes a 2-level cache when using cachegrind and my laptop has three.
Any tool allows to measure all cache levels in C program?
Modern CPUs have a PMU (performance monitoring unit) which can be used to accumulate L1/2/3/4 cache hits/misses/requests amongst a lot of things. There are a couple good libraries out there which implement PMU stuff.
I'm familiar with the PAPI, perf and Intel's PMU. I prefer Intel's implementation because it implements performance counters on QPI and other "uncore" stuff. I think most people use PAPI because it is frequently updated for new hardware and has high level and low level interfaces.
Implementing this stuff isn't too trivial but there is plenty of information out there about it. Typically you just have to specify your profiling regions in the code then specify which counters you want to use. Note that you will only have a certain amount of counters in hardware at your disposal depending on the PMU in your chip and what is being utilized by your operating system.
Also, I don't believe the valgrind cache analysis uses PMU instructions to get data and does it in software instead.

When are page frame specific cache management policies useful?

I'm reading the O'Reilly Linux Kernel book and one of the things that was pointed out during the chapter on paging is that the Pentium cache lets the operating system associate a different cache management policy with each page frame. So I get that there could be scenarios where a program has very little spacial/temporal locality and memory accesses are random/infrequent enough that the probability of cache hits is below some sort of threshold.
I was wondering whether this mechanism is actually used in practice today? Or is it more of a feature that was necessary back when caches where fairly small and not as efficient as they are now? I could see it being useful for an embedded system with little overhead as far as system calls are necessary, are there other applications I am missing?
Having multiple cache management policies is widely used, whether by assigning whole regions using MTRRs (fixed/dynamic, as explained in Intel's PRM), MMIO regions, or through special instructions (e.g. streaming loads/stores, non-temporal prefetches, etc..). The use-cases also vary a lot, whether you're trying to map an external I/O device into virtual memory (and don't want CPU caching to impact its coherence), or whether you want to define a writethrough region for better integrity management of some database, or just want plain writeback to maximize the cache-hierarchy capacity and replacement efficiency (which means performance).
These usages often overlap (especially when multiple applications are running), so the flexibility is very much needed, as you said - you don't want data with little to no spatial/temporal locality to thrash out other lines you use all the time.
By the way, caches are never going to be big enough in the foreseeable future (with any known technology), since increasing them requires you to locate them further away from the core and pay in latency. So cache management is still, and is going to be for a long while, one of the most important things for performance critical systems and applications

Profiling CPU Cache/Memory from the OS/Application?

I wish to write software which could essentially profile the CPU cache (L2,L3, possibly L1) and the memory, to analyze performance.
Am I right in thinking this is un-doable because there is no access for the software to the cache content?
Another way of wording my Q: is there any way to know, from the OS/Application level, what data has been loaded into cache/memory?
EDIT: Operating System Windows or Linux and CPU Intel Desktop/Xeon
You might want to look at Intel's PMU i.e. Performance Monitoring Unit. Some processors have one. It is a bunch of special purpose registers (Intel calls them Model Specific Registers, or MSRs) which you can program to count events, like cache misses, using the RDMSR and WRMSR instructions.
Here is a document about Performance Analysis on i7 and Xeon 5500.
You might want to check out Intel's Performance Counter Monitor, which is basically some routines that abstract the PMU, which you can use in a C++ application to measure several performance metrics live, including cache misses. It also has some GUI/Commandline tools for standalone use.
Apparently, the Linux kernel has a facility for manipulating MSRs.
There are other utilities/APIs that also use the PMU: perf, PAPI.
Cache performance is generally measured in terms of hit rate and miss rate.
There are many tools to do this for you. Check how Valgrind does cache profiling.
Also cache performance is generally measured on a per program basis. Well written programs will result in a fewer cache misses and better cache performance and vice versa for poorly written code.
Measuring the actual cache speed is the headache of the hardware manufacturers and you can refer their manuals to know this value.
Callgrind/Cachegrind combination can help you track cache hits/misses
This has some examples.
TAU, an open-source profiler which works using PAPI can also be used.
If however, you want to write a code to measure the cache statistics you can write a program using PAPI. PAPI allows the user to access the hardware counters without any need to know system architecture.
PMU uses Model Specific Registers, hence you must have the knwoledge of the registers to be used.
Perf allows for measurement of L1 and LLC (which is L2), Cachegrind, on the other hand allows the user to measure L1 and LLC (which can be L2 or L3, whichever the highest level cache is). Use Cachegrind only if you have no need of faster results because Cachegrind runs the program about 10X slower.

How to analyze the main memory and the cache access patterns?

I am looking for a way how to analyze the main memory access times. Such method should give me a distribution of RAM and Cache accesses to analyze CPU stalls in time. I wonder if it's possible entirely in software (a kernel module?) or maybe a Virtual Machine may provide a feedback?
The performance counters in modern x86_64 CPUs are perfect for determining what code is executing when there's events like cache misses, branch mispredictions, instruction/data TLB misses, prefetches, etc.
On linux, there's tools like perf and oprofile. AMD and Intel both offer commercial tools (for linux and other platforms) to record and analyze these same performance counters.

Resources