Any tool allows to measure all cache levels in C program? - performance

I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
In order to measure the cache hits/misses I tried to use valgrind but this tool only assumes a 2-level cache when using cachegrind and my laptop has three.
Any tool allows to measure all cache levels in C program?

Modern CPUs have a PMU (performance monitoring unit) which can be used to accumulate L1/2/3/4 cache hits/misses/requests amongst a lot of things. There are a couple good libraries out there which implement PMU stuff.
I'm familiar with the PAPI, perf and Intel's PMU. I prefer Intel's implementation because it implements performance counters on QPI and other "uncore" stuff. I think most people use PAPI because it is frequently updated for new hardware and has high level and low level interfaces.
Implementing this stuff isn't too trivial but there is plenty of information out there about it. Typically you just have to specify your profiling regions in the code then specify which counters you want to use. Note that you will only have a certain amount of counters in hardware at your disposal depending on the PMU in your chip and what is being utilized by your operating system.
Also, I don't believe the valgrind cache analysis uses PMU instructions to get data and does it in software instead.

Related

Explanation for why effective DRAM bandwidth reduces upon adding CPUs

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system
I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
Strong scaling of bandwidth:
Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?
Overview
This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.
More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.
Under the hood
The layout of Intel Xeon Skylake SP processors is like the following in your case:
Source: Intel
There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM # 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.
Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.
The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).
Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.
With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).
Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.
Source: Intel
Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.
The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:
What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
Intel® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)

Is there a way to measure cache coherence misses

Given a program running on multiple cores, if two or more cores are operating on the same cache line, is there a way to measure the number of cache coherence invalidations/misses there are (i.e. when Core1 writes to the cache line, which then forces Core2 to refresh its copy of the cache line so that both cores are consistent)?
Let me know if I'm using the wrong terminology for this concept.
Yes, hardware performance counters can be used to do so.
However, the way to fetch them is use to be dependent of the operating system and your processor. On Linux, the perf too can be used to track performance counters (more especially perf stat -e COUNTER_NAME_1,COUNTER_NAME_2,etc.). Alternatively, on both Linux & Windows, Intel VTune can do this too.
The list of the hardware counters can be retrieved using perf list (or with PMU-Tools).
The kind of metric you want to measure looks like Request For Ownership (RFO) in the MESI cache-coherence protocol. Hopefully, most modern (x86_64) processors include hardware events to measure RFOs. On Intel Skylake processors, there are hardware events called l2_rqsts.all_rfo, and more precisely l2_rqsts.code_rd_hit and l2_rqsts.code_rd_miss to do this at the L2-cache level. Alternatively, there are many more-advanced RFO-related hardware events that can be used at the offcore level.

Hyper-threading and gaming (and other computing applications)?

I was wondering what the real-world performance effects are of hyperthreading (multiple logical cores for each physical core) in different situations. Intel advertises this as being effective for when threads of execution are waiting for I/O, however in memory intensive applications, it can be ineffective because when a switch occurs between logical cores, locality is lost in the processor cache. The second application's data is loaded into cache, forcing the first application's memory out of cache. Upon returning to the first application, its references are all cache misses and performance is lost. I know several super computer managers and they claim that they turn off hyperthreading because doing so is more efficient in their cases. Are there "normal" user cases where disabling hyperthreading is more efficient? Gaming can be pretty memory intensive--would it be better without hyperthreading?
First, it should be recognized that hyperthreading is an Intel marketing term labelling Switch-on-Event MultiThreading (on Itanium) and Simultaneous MultiThreading (on x86). SoEMT is primarily beneficial in hiding high latency events such as last level cache misses, is easier to implement, and is friendlier to VLIW-like scheduling. SoEMT is also a better fit for a small L1 (given a somewhat fast L2) than SMT since cache contention is moved more to L2 or L3 (thousands of accesses between thread switches) which can better handle contention given their greater capacity and higher associativity. SMT can be useful in hiding smaller latencies like branch resolution delay or L2 cache hits and provides instruction level parallelism, but introduces more intense contention for resources.
(There is also a difference between disabling hyperthreading and not using hyperthreading. Disabling hyperthreading might provide a small performance benefit in that some shareable resources will be used even by an inactive but enabled thread and some partitioned resources may still use a small amount of power, but the primary benefit would be in preventing the OS from making disruptive scheduling decisions.)
For "normal" code, the available thread-level parallelism may well be lower than the number of cores available. In that case, a modern OS typically will not use the hardware multithreading since it recognizes that a full core has more performance than a core shared by more than one thread. (Sharing a core can theoretically improve performance in special cases where using L1 to communicate between threads is unusually helpful. In addition, waking an inactive thread on an active core is much faster and requires less energy than waking up a core, so using multithreading might be helpful for energy efficiency in some special cases.)
HPC codes tend to be the worst case for SMT. HPC code is more likely to be friendly to static scheduling. This means that the latency hiding benefits of SMT tend to be minimized. (Similarly, HPC code tends to benefit less from out-of-order execution.) HPC code also tends to be constrained by memory bandwidth rather than memory latency. SMT can increase the bandwidth demand per unit of execution (by increasing cache misses) and reduce the actual achieved memory bandwidth by contention at the memory controller. (DRAM is not friendly to random access; such causes excessive refresh and row active cycles.) SMT may also cause the number of data streams that are active to exceed the hardware's support for prefetching. HPC code is also more likely to be blocked according to cache sizes assuming one thread per core; in such cases SMT will produce significant cache thrashing.
Disabling hyperthreading may also be friendlier to gang-scheduled operation, which is common in HPC. If only some of the cores are using multithreading, those cores might have higher performance per core yet would have lower performance per thread; that forces other cores to idly wait for the slowed threads to complete. (HPC systems may have dedicated OS cores and spare cores to avoid similar problems, where OS activity would slow down one core/thread and force hundreds of others to wait or where a failed core could cause, e.g., a 16-thread gang scheduled program to run 15 threads and then one thread, doubling execution time.)
(In theory, SMT could be used in HPC to reduce register pressure in some optimized loops since the effective latency of operations like FMADD in a dual threaded core may be viewed as roughly being halved. Since compilers generally use a fixed latency for scheduling [SMT is treated as a transparent feature], exploiting this feature is not generally practical even when it could be beneficial.)
Rather like out-of-order execution, SMT is most beneficial for irregular code. (OoO looks ahead in a single code stream for instruction level and memory level parallelism; SMT looks "sideways" across threads for such parallelism.) If branch mispredictions and cache misses are common, SMT can use existing thread-level parallelism to hide such latencies (the cost of a branch misprediction is largely in the latency of resolution).
The benefit from SMT varies by workload and by the specific hardware. A deeply pipelined in-order microarchitecture like the initial Intel Atom benefits more from SMT than a shallower pipelined OoO microarchitecture would (latencies, especially branch resolution latency, being generally higher with longer pipelines and OoO providing some parallelism that would otherwise be used by SMT's thread-level parallelism).
Enabled hyperthreading may also have the disadvantage of increasing the number of threads used by an application where performance scaling with increased thread count is sufficiently sublinear that the lower performance per thread with hyperthreading would result in a net loss of performance. E.g., if two-thread-per-core hyperthreading provided a 30% increase in per core performance and doubling thread count increased performance by 50%, then total performance would decrease by 2.5%.
The standard advice of "when in doubt, measure" obviously applies.
Obviously some people don't understand some things. I have done so, here is what I copied froma site:
Depending on when you last bought a computer, you may remember Hyper-Threading as a feature that Intel introduced and then discontinued. This could understandably leave a sour taste in your mouth – why would Intel discontinue it if it wasn’t trouble?
The truth isn’t so grim. Hyper-Threading was for a time made available on certain Intel Pentium 4 and Intel Xeon processors. It was discontinued not because the feature itself was bad, but rather because the processor that used it turned out to be a bit of a misstep for other reasons. The Pentium 4 architecture was a minor disaster for Intel because it was incapable of going the direction Intel hoped (Intel wanted to have Pentium 4 processors with clock speeds of up to 10 GHz). As a result, Intel jumped back to designing processors based on the Pentium Pro family tree.
Hyper-Threading was gone, but not forgotten. Intel eventually found the time and resources to integrate it into another new processor architecture - Nehalem. This is the architecture that is the basis for all current Intel Core i3, i5 and i7 processors.
Source: http://www.makeuseof.com/tag/hyperthreading-technology-explained/

Profiling CPU Cache/Memory from the OS/Application?

I wish to write software which could essentially profile the CPU cache (L2,L3, possibly L1) and the memory, to analyze performance.
Am I right in thinking this is un-doable because there is no access for the software to the cache content?
Another way of wording my Q: is there any way to know, from the OS/Application level, what data has been loaded into cache/memory?
EDIT: Operating System Windows or Linux and CPU Intel Desktop/Xeon
You might want to look at Intel's PMU i.e. Performance Monitoring Unit. Some processors have one. It is a bunch of special purpose registers (Intel calls them Model Specific Registers, or MSRs) which you can program to count events, like cache misses, using the RDMSR and WRMSR instructions.
Here is a document about Performance Analysis on i7 and Xeon 5500.
You might want to check out Intel's Performance Counter Monitor, which is basically some routines that abstract the PMU, which you can use in a C++ application to measure several performance metrics live, including cache misses. It also has some GUI/Commandline tools for standalone use.
Apparently, the Linux kernel has a facility for manipulating MSRs.
There are other utilities/APIs that also use the PMU: perf, PAPI.
Cache performance is generally measured in terms of hit rate and miss rate.
There are many tools to do this for you. Check how Valgrind does cache profiling.
Also cache performance is generally measured on a per program basis. Well written programs will result in a fewer cache misses and better cache performance and vice versa for poorly written code.
Measuring the actual cache speed is the headache of the hardware manufacturers and you can refer their manuals to know this value.
Callgrind/Cachegrind combination can help you track cache hits/misses
This has some examples.
TAU, an open-source profiler which works using PAPI can also be used.
If however, you want to write a code to measure the cache statistics you can write a program using PAPI. PAPI allows the user to access the hardware counters without any need to know system architecture.
PMU uses Model Specific Registers, hence you must have the knwoledge of the registers to be used.
Perf allows for measurement of L1 and LLC (which is L2), Cachegrind, on the other hand allows the user to measure L1 and LLC (which can be L2 or L3, whichever the highest level cache is). Use Cachegrind only if you have no need of faster results because Cachegrind runs the program about 10X slower.

why are separate icache and dcache needed [duplicate]

This question already has an answer here:
What does a 'Split' cache means. And how is it useful(if it is)?
(1 answer)
Closed 2 years ago.
Can someone please explain what do we gain by having a separate instruction cache and data cache.
Any pointers to a good link explaining this will also be appreciated.
The main reason is: performance. Another reason is power consumption.
Separate dCache and iCache makes it possible to fetch instructions and data in parallel.
Instructions and data have different access patterns.
Writes to iCache are rare. CPU designers are optimizing the iCache and the CPU architecture based on the assumption that code changes are rare. For example, the AMD Software Optimization Guide for 10h and 12h Processors states that:
Predecoding begins as the L1 instruction cache is filled. Predecode information is generated and stored alongside the instruction cache.
Intel Nehalem CPU features a loopback buffer, and in addition to this the Sandy Bridge CPU features a µop cache The microarchitecture of Intel, AMD and VIA CPUs. Note that these are features related to code, and have no direct counterpart in relation to data. They benefit performance, and since Intel "prohibits" CPU designers to introduce features which result in excessive increase of power consumption they presumably also benefit total power consumption.
Most CPUs feature a data forwarding network (store to load forwarding). There is no "store to load forwarding" in relation to code, simply because code is being modified much less frequently than data.
Code exhibits different patterns than data.
That said, most CPUs nowadays have unified L2 cache which holds both code and data. The reason for this is that having separate L2I and L2D caches would pointlessly consume the transistor budget while failing to deliver any measurable performance gains.
(Surely, the reason for having separate iCache and dCache isn't reduced complexity because if the reason was reduced complexity than there wouldn't be any pipelining in any of the current CPU designs. A CPU with pipelining is more complex than a CPU without pipelining. We want the increased complexity. The fact is: the next CPU design is (usually) more complex than the previous design.)
It has to do with which functional units of the CPU primarily access that cache. Since the ALU and FPU access the data cache which the decoder and scheduler access the instruction cache, and often pipelining allows the instruction processor and the execution unit to work simultaneously, using a single cache would cause contention between these two components. By separating them we lose some flexibility and gain the ability for these two major components of the processor to fetch data from cache simultaneously.
One reason is reduced complexity - you can implement a shared cache that can retrieve multiple lines at once, or just asynchronously (see Hit-Under-Miss), but it makes the cache controller far more complicated.
Another reason is execution stability - if you have a known amount of icache and dcache, caching of data cannot starve the cache system of instructions, which may occur in a simplistic shared cache.
And as Dan stated, having them separated makes pipelining easier, without adding to the controller complexity.
As processor's MEM and FETCH stages can access L1 cache(assume combined) simultaneously, there can be conflict as which one to give priority(can become performance bottleneck). One way to resolve this is to make L1 cache with two read ports. But increasing the number of ports increases the cache area quadratically and hence increased power consumption.
Also, if L1 cache is the combined one then there are chances that some data blocks might replace blocks containing instructions which were important and about to get accessed. These evictions and followed cache miss can hurt the overall performance.
Also, most of the time processor fetches instructions sequentially(few exceptions like taken targets, jumps etc) which gives instruction cache more spatial locality and hence good hit rate. Also, as mentioned in other answers, there are hardly any writes to the ICache(self-modifying code such as JIT compilers). So separate icache and dcache designs can be optimized considering their access patterns and other components like Load/store queues, write buffers etc.
There are generally 2 kinds of architectures 1. von neuman architecture and 2. the harward architecture. The harward architecture uses 2 separate memories. you can get more on this on this arm page http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3839.html

Resources