Is there a way to measure cache coherence misses - performance

Given a program running on multiple cores, if two or more cores are operating on the same cache line, is there a way to measure the number of cache coherence invalidations/misses there are (i.e. when Core1 writes to the cache line, which then forces Core2 to refresh its copy of the cache line so that both cores are consistent)?
Let me know if I'm using the wrong terminology for this concept.

Yes, hardware performance counters can be used to do so.
However, the way to fetch them is use to be dependent of the operating system and your processor. On Linux, the perf too can be used to track performance counters (more especially perf stat -e COUNTER_NAME_1,COUNTER_NAME_2,etc.). Alternatively, on both Linux & Windows, Intel VTune can do this too.
The list of the hardware counters can be retrieved using perf list (or with PMU-Tools).
The kind of metric you want to measure looks like Request For Ownership (RFO) in the MESI cache-coherence protocol. Hopefully, most modern (x86_64) processors include hardware events to measure RFOs. On Intel Skylake processors, there are hardware events called l2_rqsts.all_rfo, and more precisely l2_rqsts.code_rd_hit and l2_rqsts.code_rd_miss to do this at the L2-cache level. Alternatively, there are many more-advanced RFO-related hardware events that can be used at the offcore level.

Related

Using perf to monitor memory access of every CPU

I'm trying to use the linux perf tool to sample the memory accesses in my program. Specifically, I'm using perf to monitor read/write access of every CPU in NUMA.
Now, I can monitor every single CPU's read and write memory access, but I also have to know whether the access is a local memory access or a remote memory access.
I have used perf list to go through the events list, but I just find out some events about socket's memory access.
Questions
Is there any way to get every single CPU's remote memory access, when using perf ?
Is there a better option than perf ?
Yes, the PMU unit in your CPU can probably do what you want through the various uncore counters - in particular they can count the various offcore responses for non-local memory access. This blog post is a reasonable starting point.
The main problem is that often the perf tool, which is tied to the specific kernel version, will lag behind in its support of modern processors1, especially when it comes to uncore and NUMA related events2.
To work around that, you can use Andi Kleen's pmu-tools, which provides an ocperf wrapper script that uses whatever underlying perf you have on your system but with up-to-date event ids downloaded directly from Intel. That will usually give you access to the uncore events you need.
Of course, even when you get that working, these events are often very tough to interpret, especially because the mental model you have of demand-memory requests is complicated by a ton of factors such as prefetch behavior, request-for-ownership, accesses that "hit" in a line-buffer in the process of being filled, etc, etc.
1 Both because adding new processors/events as some lag, but especially because the tool is tied to the kernel, and you likely aren't on a bleeding edge kernel, so even though mainline perf might have support, you are stuck with the perf version associated with your kernel.
2 Probably because most kernel developers, like developers in general, aren't working on NUMA systems.

Any tool allows to measure all cache levels in C program?

I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
In order to measure the cache hits/misses I tried to use valgrind but this tool only assumes a 2-level cache when using cachegrind and my laptop has three.
Any tool allows to measure all cache levels in C program?
Modern CPUs have a PMU (performance monitoring unit) which can be used to accumulate L1/2/3/4 cache hits/misses/requests amongst a lot of things. There are a couple good libraries out there which implement PMU stuff.
I'm familiar with the PAPI, perf and Intel's PMU. I prefer Intel's implementation because it implements performance counters on QPI and other "uncore" stuff. I think most people use PAPI because it is frequently updated for new hardware and has high level and low level interfaces.
Implementing this stuff isn't too trivial but there is plenty of information out there about it. Typically you just have to specify your profiling regions in the code then specify which counters you want to use. Note that you will only have a certain amount of counters in hardware at your disposal depending on the PMU in your chip and what is being utilized by your operating system.
Also, I don't believe the valgrind cache analysis uses PMU instructions to get data and does it in software instead.

Difference between core and processor

What is the difference between a core and a processor?
I've already looked for it on Google, but I only get definitions for multi-core and multi-processor, which is not what I am looking for.
A core is usually the basic computation unit of the CPU - it can run a single program context (or multiple ones if it supports hardware threads such as hyperthreading on Intel CPUs), maintaining the correct program state, registers, and correct execution order, and performing the operations through ALUs. For optimization purposes, a core can also hold on-core caches with copies of frequently used memory chunks.
A CPU may have one or more cores to perform tasks at a given time. These tasks are usually software processes and threads that the OS schedules. Note that the OS may have many threads to run, but the CPU can only run X such tasks at a given time, where X = number cores * number of hardware threads per core. The rest would have to wait for the OS to schedule them whether by preempting currently running tasks or any other means.
In addition to the one or many cores, the CPU will include some interconnect that connects the cores to the outside world, and usually also a large "last-level" shared cache. There are multiple other key elements required to make a CPU work, but their exact locations may differ according to design. You'll need a memory controller to talk to the memory, I/O controllers (display, PCIe, USB, etc..). In the past these elements were outside the CPU, in the complementary "chipset", but most modern design have integrated them into the CPU.
In addition the CPU may have an integrated GPU, and pretty much everything else the designer wanted to keep close for performance, power and manufacturing considerations. CPU design is mostly trending in to what's called system on chip (SoC).
This is a "classic" design, used by most modern general-purpose devices (client PC, servers, and also tablet and smartphones). You can find more elaborate designs, usually in the academy, where the computations is not done in basic "core-like" units.
An image may say more than a thousand words:
* Figure describing the complexity of a modern multi-processor, multi-core system.
Source:
https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
Let's clarify first what is a CPU and what is a core, a central processing unit CPU, can have multiple core units, those cores are a processor by itself, capable of execute a program but it is self contained on the same chip.
In the past one CPU was distributed among quite a few chips, but as Moore's Law progressed they made to have a complete CPU inside one chip (die), since the 90's the manufacturer's started to fit more cores in the same die, so that's the concept of Multi-core.
In these days is possible to have hundreds of cores on the same CPU (chip or die) GPUs, Intel Xeon. Other technique developed in the 90's was simultaneous multi-threading, basically they found that was possible to have another thread in the same single core CPU, since most of the resources were duplicated already like ALU, multiple registers.
So basically a CPU can have multiple cores each of them capable to run one thread or more at the same time, we may expect to have more cores in the future, but with more difficulty to be able to program efficiently.
CPU is a central processing unit. Since 2002 we have only single core processor i.e. we will only perform a single task or a program at a time.
For having multiple programs run at a time we have to use the multiple processor for executing multi processes at a time so we required another motherboard for that and that is very expensive.
So, Intel introduced the concept of hyper threading i.e. it will convert the single CPU into two virtual CPUs i.e we have two cores for our task. Now the CPU is single, but it is only pretending (masqueraded) that it has a dual CPU and performs multiple tasks. But having real multiple cores will be better than that so people develop making multi-core processor i.e. multiple processors on a single box i.e. grabbing a multiple CPU on single big CPU. I.e. multiple cores.
In the early days...like before the 90s...the processors weren't able to do multi tasks that efficiently...coz a single processor could handle just a single task...so when we used to say that my antivirus,microsoft word,vlc,etc. softwares are all running at the same time...that isn't actually true. When I said a processor could handle a single process at a time...I meant it. It actually would process a single task...then it used to pause that task...take another task...complete it if its a short one or again pause it and add it to the queue...then the next. But this 'pause' that I mentioned was so small (appx. 1ns) that you didn't understand that the task has been paused. Eg. On vlc while listening to music there are other apps running simultaneously but as I told you...one program at a time...so the vlc is actually pausing in between for ns so you dont underatand it but the music is actually stopping in between.
But this was about the old processors...
Now-a- days processors ie 3rd gen pcs have multi cored processors. Now the 'cores' can be compared to a 1st or 2nd gen processors itself...embedded onto a single chip, a single processor. So now we understood what are cores ie they are mini processors which combine to become a processor. And each core can handle a single process at a time or multi threads as designed for the OS. And they folloq the same steps as I mentioned above about the single processor.
Eg. A i7 6gen processor has 8 cores...ie 8 mini processors in 1 i7...ie its speed is 8x times the old processors. And this is how multi tasking can be done.
There could be hundreds of cores in a single processor
Eg. Intel i128.
I hope I explaned this well.
I have read all answers, but this link was more clear explanation for me about difference between CPU(Processor) and Core. So I'm leaving here some notes from there.
The main difference between CPU and Core is that the CPU is an electronic circuit inside the computer that carries out instruction to perform arithmetic, logical, control and input/output operations while the core is an execution unit inside the CPU that receives and executes instructions.
Intel's picture is helpful, as shown by Tortuga's best answer. Here's a caption for it.
Processor: One semiconductor chip, the CPU (central processing unit) seated in one socket, circa 1950s-2010s. Over time, more functions have been packed onto the CPU chip. Prior to the 1950s releases of single-chip processors, one processor might have spread across multiple chips. In the mid 2010s the system-on-a-chip chips made it slightly more sketchy to equate one processor to one chip, though that's generally what people mean by processor, as in "this computer has an i7 processor" or "this computer system has four processors."
Core: One block of a CPU, executing one instruction at a time. (You'll see people say one instruction per clock cycle, but some CPUs use multiple clock cycles for some instructions.)

Profiling CPU Cache/Memory from the OS/Application?

I wish to write software which could essentially profile the CPU cache (L2,L3, possibly L1) and the memory, to analyze performance.
Am I right in thinking this is un-doable because there is no access for the software to the cache content?
Another way of wording my Q: is there any way to know, from the OS/Application level, what data has been loaded into cache/memory?
EDIT: Operating System Windows or Linux and CPU Intel Desktop/Xeon
You might want to look at Intel's PMU i.e. Performance Monitoring Unit. Some processors have one. It is a bunch of special purpose registers (Intel calls them Model Specific Registers, or MSRs) which you can program to count events, like cache misses, using the RDMSR and WRMSR instructions.
Here is a document about Performance Analysis on i7 and Xeon 5500.
You might want to check out Intel's Performance Counter Monitor, which is basically some routines that abstract the PMU, which you can use in a C++ application to measure several performance metrics live, including cache misses. It also has some GUI/Commandline tools for standalone use.
Apparently, the Linux kernel has a facility for manipulating MSRs.
There are other utilities/APIs that also use the PMU: perf, PAPI.
Cache performance is generally measured in terms of hit rate and miss rate.
There are many tools to do this for you. Check how Valgrind does cache profiling.
Also cache performance is generally measured on a per program basis. Well written programs will result in a fewer cache misses and better cache performance and vice versa for poorly written code.
Measuring the actual cache speed is the headache of the hardware manufacturers and you can refer their manuals to know this value.
Callgrind/Cachegrind combination can help you track cache hits/misses
This has some examples.
TAU, an open-source profiler which works using PAPI can also be used.
If however, you want to write a code to measure the cache statistics you can write a program using PAPI. PAPI allows the user to access the hardware counters without any need to know system architecture.
PMU uses Model Specific Registers, hence you must have the knwoledge of the registers to be used.
Perf allows for measurement of L1 and LLC (which is L2), Cachegrind, on the other hand allows the user to measure L1 and LLC (which can be L2 or L3, whichever the highest level cache is). Use Cachegrind only if you have no need of faster results because Cachegrind runs the program about 10X slower.

How to analyze the main memory and the cache access patterns?

I am looking for a way how to analyze the main memory access times. Such method should give me a distribution of RAM and Cache accesses to analyze CPU stalls in time. I wonder if it's possible entirely in software (a kernel module?) or maybe a Virtual Machine may provide a feedback?
The performance counters in modern x86_64 CPUs are perfect for determining what code is executing when there's events like cache misses, branch mispredictions, instruction/data TLB misses, prefetches, etc.
On linux, there's tools like perf and oprofile. AMD and Intel both offer commercial tools (for linux and other platforms) to record and analyze these same performance counters.

Resources