How to analyze the main memory and the cache access patterns? - memory-management

I am looking for a way how to analyze the main memory access times. Such method should give me a distribution of RAM and Cache accesses to analyze CPU stalls in time. I wonder if it's possible entirely in software (a kernel module?) or maybe a Virtual Machine may provide a feedback?

The performance counters in modern x86_64 CPUs are perfect for determining what code is executing when there's events like cache misses, branch mispredictions, instruction/data TLB misses, prefetches, etc.
On linux, there's tools like perf and oprofile. AMD and Intel both offer commercial tools (for linux and other platforms) to record and analyze these same performance counters.

Related

Is there a way to measure cache coherence misses

Given a program running on multiple cores, if two or more cores are operating on the same cache line, is there a way to measure the number of cache coherence invalidations/misses there are (i.e. when Core1 writes to the cache line, which then forces Core2 to refresh its copy of the cache line so that both cores are consistent)?
Let me know if I'm using the wrong terminology for this concept.
Yes, hardware performance counters can be used to do so.
However, the way to fetch them is use to be dependent of the operating system and your processor. On Linux, the perf too can be used to track performance counters (more especially perf stat -e COUNTER_NAME_1,COUNTER_NAME_2,etc.). Alternatively, on both Linux & Windows, Intel VTune can do this too.
The list of the hardware counters can be retrieved using perf list (or with PMU-Tools).
The kind of metric you want to measure looks like Request For Ownership (RFO) in the MESI cache-coherence protocol. Hopefully, most modern (x86_64) processors include hardware events to measure RFOs. On Intel Skylake processors, there are hardware events called l2_rqsts.all_rfo, and more precisely l2_rqsts.code_rd_hit and l2_rqsts.code_rd_miss to do this at the L2-cache level. Alternatively, there are many more-advanced RFO-related hardware events that can be used at the offcore level.

Any tool allows to measure all cache levels in C program?

I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
In order to measure the cache hits/misses I tried to use valgrind but this tool only assumes a 2-level cache when using cachegrind and my laptop has three.
Any tool allows to measure all cache levels in C program?
Modern CPUs have a PMU (performance monitoring unit) which can be used to accumulate L1/2/3/4 cache hits/misses/requests amongst a lot of things. There are a couple good libraries out there which implement PMU stuff.
I'm familiar with the PAPI, perf and Intel's PMU. I prefer Intel's implementation because it implements performance counters on QPI and other "uncore" stuff. I think most people use PAPI because it is frequently updated for new hardware and has high level and low level interfaces.
Implementing this stuff isn't too trivial but there is plenty of information out there about it. Typically you just have to specify your profiling regions in the code then specify which counters you want to use. Note that you will only have a certain amount of counters in hardware at your disposal depending on the PMU in your chip and what is being utilized by your operating system.
Also, I don't believe the valgrind cache analysis uses PMU instructions to get data and does it in software instead.

Profiling CPU Cache/Memory from the OS/Application?

I wish to write software which could essentially profile the CPU cache (L2,L3, possibly L1) and the memory, to analyze performance.
Am I right in thinking this is un-doable because there is no access for the software to the cache content?
Another way of wording my Q: is there any way to know, from the OS/Application level, what data has been loaded into cache/memory?
EDIT: Operating System Windows or Linux and CPU Intel Desktop/Xeon
You might want to look at Intel's PMU i.e. Performance Monitoring Unit. Some processors have one. It is a bunch of special purpose registers (Intel calls them Model Specific Registers, or MSRs) which you can program to count events, like cache misses, using the RDMSR and WRMSR instructions.
Here is a document about Performance Analysis on i7 and Xeon 5500.
You might want to check out Intel's Performance Counter Monitor, which is basically some routines that abstract the PMU, which you can use in a C++ application to measure several performance metrics live, including cache misses. It also has some GUI/Commandline tools for standalone use.
Apparently, the Linux kernel has a facility for manipulating MSRs.
There are other utilities/APIs that also use the PMU: perf, PAPI.
Cache performance is generally measured in terms of hit rate and miss rate.
There are many tools to do this for you. Check how Valgrind does cache profiling.
Also cache performance is generally measured on a per program basis. Well written programs will result in a fewer cache misses and better cache performance and vice versa for poorly written code.
Measuring the actual cache speed is the headache of the hardware manufacturers and you can refer their manuals to know this value.
Callgrind/Cachegrind combination can help you track cache hits/misses
This has some examples.
TAU, an open-source profiler which works using PAPI can also be used.
If however, you want to write a code to measure the cache statistics you can write a program using PAPI. PAPI allows the user to access the hardware counters without any need to know system architecture.
PMU uses Model Specific Registers, hence you must have the knwoledge of the registers to be used.
Perf allows for measurement of L1 and LLC (which is L2), Cachegrind, on the other hand allows the user to measure L1 and LLC (which can be L2 or L3, whichever the highest level cache is). Use Cachegrind only if you have no need of faster results because Cachegrind runs the program about 10X slower.

What is the best way to detect CPU cache misses when running an algorithm?

We have an algorithm which is performing poorly and we believe it's because of CPU cache misses. Nevertheless, we can't prove it because we don't have any way of detecting them. Is there any way to tell how many CPU cache misses an algorithm produces? We can port it to any language which could allow us to detect them.
Thanks in advance.
Easiest way to find out such kind of issues is to use profilers and collect cache related performance counters.
I would recommend to check following tools:
Intel® VTune™ Amplifier XE (supports: linux and windows; C/C++, Java, .NET) - http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/
OProfile - http://oprofile.sourceforge.net/
Is it possible to see the overall structure of your algorithm (if it is not too long)?
Intel CPUs keep performance counters that you can extract with some assembler instructions.
Could you (1) baseline cache misses on a quiescent system, (2) run the program and compare?
See Volume 3B of the Intel Instruction Set Reference Section 18 Page 15 (18-15) for the assembler you would have to write up.

Open source profiler for analyzing low-level architectural inefficiencies?

Modern processors use all sorts of tricks to bridge the gap between the large speed of their processing elements and the tardiness of the external memory. In performance-critical applications the way you structure your code can often have a considerable influence on its efficiency. For instance, researchers using the SLO analyzer were able to fix cache locality problems and double the execution speed of several SPEC2000 benchmark programs. I'm looking for recommendations for an open source tool that utilizes a processor's performance monitoring support to locate and analyze architectural inefficiencies, such as cache misses, branch mispredicts, front end stalls, cache pollution through address aliasing, long latency instructions, and TLB misses. I'm aware of Intel's VTune (commercial), AMD's CodeAnalysist (free, but not open source), and Cachegrind (relies on simulation).
For linux, oprofile works well. Actually AMD's CodeAnalysist uses oprofile as its backend.
Oprofile uses processor's intenal performance tunning mechanism to analyze architectural inefficiency.

Resources