Linux perf to measure memory bandwidth on AMD EPYC 2nd Gen - performance

How could I measure the memory bandwidth of an application using perf and mpirun? I would like to know if this application is memory bandwidth bound.

Related

Explanation for why effective DRAM bandwidth reduces upon adding CPUs

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system
I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
Strong scaling of bandwidth:
Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?
Overview
This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.
More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.
Under the hood
The layout of Intel Xeon Skylake SP processors is like the following in your case:
Source: Intel
There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM # 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.
Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.
The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).
Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.
With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).
Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.
Source: Intel
Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.
The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:
What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
IntelĀ® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)

relation of CPU speed and NIC throughput

I'm testing a performance of DPDK-based OpenvSwitch implementation (github.com/01org/dpdk-ovs) on the following server:
Intel CPU Xeon E3 3.30Ghz
Intel 1G NIC I210
RAM 8G
Basically my setup includes two ports, traffic enters port0 and forwarded by dpdk to port1. The performance is quite low, although I isolated processes on distinct cores on the machine with DPDK software. I didn't do IRQ affinitization because DPDK has poll-mode user drivers.
Now I'm beginning to wonder, if a CPU speed of the server may have an impact on the overall performance? I mean with regard to NIC speed and packet processing performance, is it normal to slow down CPU, i.e. drop the frequency, in order to achieve better performance, or it sounds stupid?
Thanks.
Its very unlikely CPU drops its packet processing cycles speed with respective to NIC processing capability. Ideally when NIC working at its full capacity, meaning when its packet buffers are DMA'ed to the system memory form NIC memory doesn't require any CPU cycles. And when DPDK PMD rte_eth_rx/tx_burst() performed to receive or transmit on two different physical CPUs(In your case) or on hyper-threaded lcores, then CPU cycles are required. Hence for better performance processing at NIC capacity, you might need additional cores but it doesn't slow down the CPU.

Approximate latency to access caches and main memory via QPI (dual socket/processor)

This thread has a good list of times that it takes to access various parts of the computer architecture in a uniprocessor environment. How about in a dual processor environment, over Intel's QPI bus?
Let's assume a 64 byte packet memory is allocated on the first CPU. The second CPU has to access this via a 8.0 GT/s QPI bus, so I know the serialization latency alone is 4~ ns. What additional latency should I expect on the QPI bus?

Disparity between bus throughput and CPU throughput and their effect on sequential and parallel computing

What is the disparity between bus throughput and CPU throughput? How does this adversely impact sequential computing? How does this adversely impact parallel computing?
If your CPU can access its cache in 1 nS steps, but your memory takes 60 nS to deliver a random memory word, at some point your processor is going to read memory at 60x slow rate than the cache. If you are processing a lot of data, you may see a tremendous slow down, even for sequential programs.
If you have multiple CPUs, they will collectively have a higher bandwidth demand on the bus. Imagine a serial-access bus with 64 CPUs all trying to read from it: only one succeeds at any one moment. The consequence is it is hard to get parallelism of 64 in such a system, unless each processor stays entirely within its cache.

How can I read from the pinned (lock-page) RAM, and not from the CPU cache (use DMA zero-copy with GPU)?

If I use DMA for RAM <-> GPU on CUDA C++, How can I be sure that the memory will be read from the pinned (lock-page) RAM, and not from the CPU cache?
After all, with DMA, the CPU does not know anything about the fact that someone changed the memory and about the need to synchronize the CPU (Cache<->RAM). And as far as I know, std :: memory_barier () from C + +11 does not help with DMA and will not read from RAM, but only will result in compliance between the caches L1/L2/L3. Furthermore, in general, then there is no protocol to resolution conflict between cache and RAM on CPU, but only sync protocols different levels of CPU-cache L1/L2/L3 and multi-CPUs in NUMA: MOESI / MESIF
On x86, the CPU does snoop bus traffic, so this is not a concern. On Sandy Bridge class CPUs, the PCI Express bus controller is integrated into the CPU, so the CPU actually can service GPU reads from its L3 cache, or update its cache based on writes by the GPU.

Resources