How to calculate the theoretically maximal CPU-RAM bandwidth? - performance

CUDA (GPGPU) programmers are taught early on to measure their programs' bandwidth to see if they come close to the theoretical maximum (which can be looked up in the GPU spec or measured using standard utilities).
This is obviously useful for programs that are memory bandwidth-bound.
Is there a way to calculate the theoretically maximal CPU-RAM bandwidth instead (considering that typical setups are multi-DIMM and multi-core, and sometimes multi-socket)?

You may find some relevant information in ark.intel.com, for example, on the core i7-6700K, the site reads a bandwidth of 34.1 GB/s, which is 8(bits/Transactions)*2(memory channels)*2.133GT/s(transactions of the RAM module) = 34.128 GB/s.
For more recent processors, not all information is available, but you may find it: Core i9-9980XE has 8*4*2.666 = 85.312 GB/s peak bandwidth.
This is assuming the optimal DDR4 memory module is installed.
Piriform's speccy tool can provide you with this hardware information on the RAM menu:
In this case, two slots with 2133 GT/s with two channels = 34.1GB/s.

Related

Explanation for why effective DRAM bandwidth reduces upon adding CPUs

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system
I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
Strong scaling of bandwidth:
Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?
Overview
This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.
More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.
Under the hood
The layout of Intel Xeon Skylake SP processors is like the following in your case:
Source: Intel
There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM # 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.
Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.
The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).
Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.
With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).
Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.
Source: Intel
Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.
The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:
What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
IntelĀ® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)

Why the time needed for executing code on "Hackerrank" site varies significantly?

When you try some coding challenges on www.hackerrank.com and you track the time also (e.g. in C++ using the "chrono" library), you will see every time you run the Code you will get a different time needed for execution at exactly the same Code. The variance I estimate at 10 to 30 percent. What is the reason that Code execution times are varying significantly at equal code? Which factors account to it?
It could be the Server System; but are there even physical reasons (stochastic processes in electronic components)?
Almost certainly just a busy system where your process competes for CPU time against other processes. (And competes for memory bandwidth even when it has a CPU to itself).
Also it's likely that the server runs on a CPU with SMT (Simultaneous multithreading) that uses each physical cores as two logical cores, so the "shared execution resources" that processes compete for includes stuff inside a single CPU core, not just L3 cache and memory bandwidth.
Intel calls their SMT tech "hyperthreading"; most servers these days run on Intel Xeon CPUs. AMD Zen also uses SMT, so either way unless the server admins disabled it, when the OS schedules tasks onto both logical cores of a single physical core, they will slow each other down some (with the slowdown amount mostly depending on their average IPC (instructions per cycle) - two threads that have a lot of branch mispredicts will typically not see much slowdown (so throughput nearly doubles). But two threads that could both nearly saturate the FP multiplier ALUs on their own will each run at nearly half speed (near zero gain in throughput).
OSes "know about" SMT / Hyperthreading and can detect which logical cores share a physical core. They try to avoid scheduling threads to the same physical core while there are some physical cores with both threads idle.
See also Modern Microprocessors
A 90-Minute Guide! which covers SMT, with the background to understand what execution resources are being shared.
but are there even physical reasons (stochastic processes in electronic components)?
No, CPUs are deterministic (except for the rdrand instruction which uses true analog electrical noise as a randomness source).
CPUs do use dynamic voltage and frequency scaling to save power (idle vs. max turbo, or somewhere in between if thermal limits don't allow max turbo.) https://en.wikipedia.org/wiki/Intel_Turbo_Boost / https://en.wikipedia.org/wiki/Dynamic_frequency_scaling
See also Idiomatic way of performance evaluation? for more microbenchmarking pitfalls / warm-up effects.

Discrete GPU to reduce memory contention & improve CPU performance

I have long suspected the shared RAM of integrated GPUs causes memory contention and significantly slows the performance of the CPU. Especially in the context of compiler and IDE performance.
Have you done any experiments or noticed a difference when adding or removing a discrete graphics card?
Are you aware of any studies on this subject? (I could not find any)
For video there's 2 uses of memory - reading the frame buffer's contents and sending it to the monitor every frame; and whatever the GPU happens to be doing.
For the GPU there's no way to guess.
For reading the frame buffer; for a video mode like 1920x1600 with 32 bits per pixel you're looking at 12.288 MB per frame, so at 60 frames per second that's 0.737 GB/s. A single RAM module is typically capable of "tens of GB per second" (e.g. DDR4-3200 is 25.6 GB/s according to wikipedia). From this you can assume reading from the framebuffer consumes less than 10% of one RAM module's bandwidth. Of course for most systems there's multiple RAM modules and multiple memory channels; so it's likely to be significantly less than 10% of available RAM bandwidth.
Also note that CPUs typically use caches for most memory accesses and only need RAM bandwidth for "cache miss" (e.g. you could have 8 CPUs pounding caches and still have almost all of the usable RAM bandwidth wasted/being used for nothing); so devices of all types (e.g. disk controllers, network cards, USB controllers, sound cards, discrete and integrated video) using RAM bandwidth won't necessarily effect CPU performance.
There are also other (potentially more significant) factors for performance too. For example, for modern integrated video, GPU is in the same package as the CPUs, so when the GPU is going berserk heating up the package the CPUs may need to slow down to avoid melting everything. Discrete video cards don't have this problem (they have the "spend several hundred extra $$ to be deafened by excessive fan noise while you're sitting in a puddle of your own perspiration" problem instead ;) ).
Mostly; everything involved (which hardware, which software, which other devices) is too variable for a concrete measurement of one specific case to be meaningful; so I wouldn't expect to find any studies.

CPU usage is higher than expected to be possible (over 200% with a total of two cores)

Correct me if I'm wrong, but based on answers to this question and this question, I understand that the percentage of CPU used can easily go up to 100 * number of processors * number of cores per processor without affecting performance too badly; e.g if I have one processor with two cores, my CPU usage should be able to go easily to 200%.
I just checked with top while training a small neural network in Python / TensorFlow, and Python is consistently using over 300% of my single, dual core processor.
I have not been noticing any poor performance in any other applications. How is this possible?
It could be that your particular CPU supports multiple logical threads per physical core by some other mechanism. This question suggests some ways you could discover if this is the case.
With the caveat that I'm kind of guessing, another reason you might see this effect is with Intel's Turbo Boost feature interacting with the way CPU counters and CPU usage are reported.
The 100% * nCores thing breaks down in the presence of things like hyper-threading, CPU frequency scaling, and the murky relationship between what CPU vendors market as "threads" vs. "cores".

Optimum performance of GPU

I have been asked to measure how "efficiently " does my code use the GPU /what % of peak performance are algorithms achieving.I am not sure how to do this comparison.Till now I have basically had timers put in my code and measure the execution.How can I compare this to optimal performance and find what might be the bottle necks? (I did hear about visual profiler but couldnt get it to work ..it keeps giving me "cannot load output" error).
Each card has a maximum memory bandwidth and processing speed. For example, the GTX 480 bandwidth is 177.4 GB/s. You will need to know the specs for your card.
The first thing to decide is whether your code is memory bound or computation bound. If it is clearly one or the other, that will help you focus on the correct "efficiency" to measure. If your program is memory bound, then you need to compare your bandwidth with the cards maximum bandwidth.
You can calculate memory bandwidth by computing the amount of memory you read/write and dividing by run time (I use cuda events for timing). Here is a good example of calculating bandwidth efficiency (look at the whitepaper for the parallel reduction) and using it to help validate a kernel.
I don't know very much about determining the efficiency if instead you are ALU bound. You can probably count (or profile) the number of instructions, but what is the card's maximum?
I'm also not sure what to do in the likely case that your kernel is something in between memory bound and ALU bound.
Anyone...?
Generally "efficiently" would probably be a measure of how much memory and GPU cycles (average, min, max) of your program is using. Then the efficiency measure would be avg(mem)/total memory for the time period and so on with AVG(GPU cycles)/Max GPU cycles.
Then I'd compare these metrics to metrics from some GPU benchmark suites (which you can assume to be pretty efficient at using most of the GPU). Or you could measure against some random GPU intensive programs of your choice. That'd be how I'd do it but I've never thought to try so good luck!
As for bottlenecks and "optimal" performance. These are probably NP-Complete problems that no one can help you with. Get out the old profiler and debuggers and start working your way through your code.
Can't help with profiler and microoptimisation, but there is a CUDA calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls , which trys to estimate how does your CUDA code use the hardware resources, based on this values:
Threads Per Block
Registers Per Thread
Shared Memory Per Block (bytes)

Resources