Does the NT Kernel not utilise Multiple Memory Channel Architecture? - performance

I've been reading benchmarks that test the benefits of systems with Multiple Memory Channel Architectures. The general conclusion of most of these benchmarks is that the performance benefits of systems with greater numbers of memory channels over those systems with fewer channels are negligible.
However nowhere have I found an explanation of why this is the case, just benchmark results indicating that this is the real world performance attained.
The theory is that every doubling of the system's memory channels doubles the bandwidth of memory access, so in theory there should be a performance gain, however in real world applications the gains are negligible. Why?
My postulation is that when the NT Kernel allocates physical memory it is not disturbing the allocations evenly across the the memory channels. If all of a process's virtual memory is mapped to a single memory channel within a MMC system then the process will effectively only be able to attain the performance of having a single memory channel at its disposal. Is this the reason for negligible real world performance gains?
Naturally a process is allocated virtual memory and the kernel allocates physical memory pages, so is this negligible performance gain the fault of the NT Kernel not distributing allocations across the available channels?

related: Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? two memory controllers is sufficient for single-threaded memory bandwidth. Only if you have multiple threads / processes that all miss in cache a lot do you start to benefit from the extra memory controllers in a big Xeon.
(e.g. your example from comments of running many independent image-processing tasks on different images in parallel might do it, depending on the task.)
Going from two down to one DDR4 channel could hurt even a single-threaded program on a quad-core if it was bottlenecked on DRAM bandwidth a lot of the time, but one important part of tuning for performance is to optimize for data reuse so you get at least L3 cache hits.
Matrix multiplication is a classic example: instead of looping over rows / columns of the whole matrix N^2 times (which is too big to fit in cache) (one row x column dot product for each output element), you break the work up into "tiles" and compute partial results, so you're looping repeatedly over a tile of the matrix that stays hot in L1d or L2 cache. (And you hopefully bottleneck on FP ALU throughput, running FMA instructions, not memory at all, because matmul takes O(N^3) multiply+add operations over N^2 elements for a square matrix.) These optimizations are called "loop tiling" or "cache blocking".
So well-optimized code that touches a lot of memory can often get enough work done as its looping that it doesn't actually bottleneck on DRAM bandwidth (L3 cache miss) most of the time.
If a single channel of DRAM is enough to keep up with hardware prefetch requests for how quickly/slowly the code is actually touching new memory, there won't be any measureable slowdown from memory bandwidth. (Of course that's not always possible, and sometimes you do loop over a big array doing not very much work or even just copying it, but if that only makes up a small fraction of the total run time then it's still not significant.)

The theory is that every doubling of the system's memory channels doubles the bandwidth of memory access, so in theory there should be a performance gain, however in real world applications the gains are negligible. Why?
Think of it as a hierarchy, like "CPU <-> L1 cache <-> L2 cache <-> L3 cache <-> RAM <-> swap space". RAM bandwidth only matters when L3 cache wasn't big enough (and swap space bandwidth only matters if RAM wasn't big enough, and ...).
For most (not all) real world applications, the cache is big enough, so RAM bandwidth isn't important and the gains (of multi-channel) are negligible.
My postulation is that when the NT Kernel allocates physical memory it is not disturbing the allocations evenly across the the memory channels.
It doesn't work like that. The CPU mostly only works with whole cache lines (e.g. 64 byte pieces); and with one channel the entire cache line comes from one channel; and with 2 channels half of a cache line comes from one channel and the other half comes from a different channel. There is almost nothing that any software can do that will make any difference. The NT kernel only works with whole pages (e.g. 4 KiB pieces), so whatever the kernel does is even less likely to matter (until you start thinking about NUMA optimizations, which is a completely different thing).

Related

Does anyone have an example where _mm256_stream_load_si256 (non-tempral load to bypasse cache) actually improves performance?

Consider massiveley SIMD-vectorized loops on very large amounts of floating point data (hundreds of GB) that, in theory, should benefit from non-temporal ("streaming" i.e. bypassing cache) loads/store.
Using non-temp store (_mm256_stream_ps) actually does significantly improve throughput by about ~25% over plain store (_mm256_store_ps)
However, I could not measure any difference when using _mm256_stream_load instead of _mm256_load_ps.
Does anyone have an example where _mm256_stream_load_si256 can be used to actually improves performance ?
(Instruction set & Hardware is AVX2 on AMD Zen2, 64 cores)
for(size_t i=0; i < 1000000000/*larger than L3 cache-size*/; i+=8 )
{
#ifdef USE_STREAM_LOAD
__m256 a = _mm256_castsi256_ps (_mm256_stream_load_si256((__m256i *)source+i));
#else
__m256 a = _mm256_load_ps( source+i );
#endif
a *= a;
#ifdef USE_STREAM_STORE
_mm256_stream_ps (destination+i, a);
#else
_mm256_store_ps (destination+i, a);
#endif
}
stream_load (vmovntdqa) is just a slower version of normal load (extra ALU uop) unless you use it on a WC memory region (uncacheable, write-combining).
The non-temporal hint is ignored by current CPUs, because unlike NT stores, the instruction doesn't override the memory ordering semantics. We know that's true on Intel CPUs, and your test results suggest the same is true on AMD.
Its purpose is for copying from video RAM back to main memory, as in an Intel whitepaper. It's useless unless you're copying from some kind of uncacheable device memory. (On current CPUs).
See also What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region? for more details. As my answer there points out, what can sometimes help if tuned carefully for your hardware and workload, is NT prefetch to reduce cache pollution. But tuning the prefetch distance is pretty brittle; too far and data will be fully evicted by the time you read it, instead of just missing L1 and hitting in L2.
There wouldn't be much if anything to gain in bandwidth anyway. Normal stores cost a read + an eventual write on eviction for each cache line. The Read For Ownership (RFO) is required for cache coherency, and because of how write-back caches work that only track dirty status on a whole-line basis. NT stores can increase bandwidth by avoiding those loads.
But plain loads aren't wasting anything, the only downside is evicting other data as you loop over huge arrays generating boatloads of cache misses, if you can't change your algorithm to have any locality.
If cache-blocking is possible for your algorithm, there's much more to gain from that, so you don't just bottleneck on DRAM bandwidth. e.g. do multiple steps over a subset of your data, then move on to the next.
See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - most of it; go read Ulrich Drepper's paper.
Anything you can do to increase computational intensity helps (ALU work per time the data is loaded into L1d cache, or into registers).
Even better, make a custom loop that combines multiple steps that you were going to do on each element. Avoid stuff like for(i) A[i] = sqrt(B[i]) if there is an earlier or later step that also does something simple to each element of the same array.
If you're using NumPy or something, and just gluing together optimized building blocks that operate on large arrays, it's kind of expected that you'll bottleneck on memory bandwidth for algorithms with low computational intensity (like STREAM add or triad type of things).
If you're using C with intrinsics, you should be aiming higher. You might still bottleneck on memory bandwidth, but your goal should be to saturate the ALUs, or at least bottleneck on L2 cache bandwidth.
Sometimes it's hard, or you haven't gotten around to all the optimizations on your TODO list that you can think of, so NT stores can be good for memory bandwidth if nothing is going to re-read this data any time soon. But consider that a sign of failure, not success. CPUs have large fast caches, use them.
Further reading:
Enhanced REP MOVSB for memcpy - RFO vs. no-RFO stores (including NT stores), and how per-core memory bandwidth can be limited to the latency-bandwidth product given latency of handing off cache lines to lower levels and the number of LFBs to track them. Especially on Intel server chips.
Non-temporal loads and the hardware prefetcher, do they work together? - no, NT loads are only useful on WC memory, where HW prefetch doesn't work. They kind of exist to fill that gap.

Explanation for why effective DRAM bandwidth reduces upon adding CPUs

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system
I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
Strong scaling of bandwidth:
Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?
Overview
This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.
More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.
Under the hood
The layout of Intel Xeon Skylake SP processors is like the following in your case:
Source: Intel
There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM # 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.
Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.
The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).
Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.
With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).
Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.
Source: Intel
Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.
The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:
What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
Intel® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)

Optimal buffer size to avoid cache misses for recent i7 / i9 CPUs

Let's assume an algorithm is repeatedly processing buffers of data, it may be accessing say 2 to 16 of these buffers, all having the same size. What would you expect to be the optimum size of these buffers, assuming the algorithm can process the full data in smaller blocks.
I expect the potential bottleneck of cache misses if the blocks are too big, but of course the bigger the blocks the better for vectorization.
Let's expect current i7/i9 CPUs (2018)
Any ideas?
Do you have multiple threads? Can you arrange things so the same thread uses the same buffer repeatedly? (i.e. keep buffers associated with threads when possible).
Modern Intel CPUs have 32k L1d, 256k L2 private per-core. (Or Skylake-AVX512 has 1MiB private L2 caches, with less shared L3). (Which cache mapping technique is used in intel core i7 processor?)
Aiming for L2 hits most of the time is good. L2 miss / L3 hit some of the time isn't always terrible, but off-core is significantly slower. Remember that L2 is a unified cache, so it covers code as well, and of course there's stack memory and random other demands for L2. So aiming for a total buffer size of around half L2 size usually gives a good hit-rate for cache-blocking.
Depending on how much bandwidth your algorithm can use, you might even aim for mostly L1d hits, but small buffers can mean more startup / cleanup overhead and spending more time outside of the main loop.
Also remember that with Hyperthreading, each logical core competes for cache on the physical core it's running on. So if two threads end up on the same physical core, but are touching totally different memory, your effective cache sizes are about half.
Probably you should make the buffer size a tunable parameter, and profile with a few different sizes.
Use perf counters to check if you're actually avoiding L1d or L2 misses or not, with different sizes, to help you understand whether your code is sensitive to different amounts of memory latency or not.

Operations that use only RAM

Can you please tell me some example code where we use ignorable amount of CPU and storage but heavy use of RAM? Like, if I run a loop and create objects, this will consume RAM but not CPU or storage. I mean tell me some memory expensive operations.
appzYourLife gave a good example, but I'd like to give a more conceptual answer.
Memory is slow. Like it's really slow, at least on the time scale that CPUs operate on. There is a concept called the memory hierarchy, which illustrates the trade off between cost/capacity and speed.
To prevent a fast CPU from wasting its time waiting on slow memory, we came up with CPU cache, which is a very small amount (it's expensive!) of very fast memory. The CPU never directly interacts with RAM, only the lowest level of CPU cache. Any time the CPU needs data that doesn't fall in the cache, it dispatches the memory controller to go fetch the desired data from RAM and put it in cache. The memory controller does this directly, without CPU involvement (so that the CPU can handle another process while wasting on this slow memory I/O).
The memory controller can be smart about how it does its memory fetching however. The principle of locality comes into play, which is the trend that CPUs tend to deal mostly with closely related (close in memory) data, such as arrays of data or long series of consecutive instructions. Knowing this, the memory controller can prefetch data from RAM that it predicts (according to various prediction algorithms, a key topic in CPU design) might be needed soon, and makes it available to the CPU before the CPU even knows it will need it. Think of it like a surgeon's assistant, who preempts what tools will be needed, and offers to hand them to the surgeon the moment they're needed, without the surgeon needing to request them, and without making the surgeon wait for the assistant to go get them and come back.
To maximize RAM usage, you'd need to minimize cache usage. This can be done by doing a lot of unexpected jumps between distant locations in memory. Typically, linked structures (such as linked lists) can cause this to happen. If a linked structure is composed of nodes that are scattered all throughout RAM, then there is no way for the memory controller to be able to predict all their locations and prefetch them. Traversing such a structure will cause many "cache misses" (a memory request for which the data isn't cached, and must be fetched from RAM), which are RAM intensive.
Ultimately, the CPU would usually be used heavily too, because it won't sit around waiting for the memory access, but will instead execute the instructions of the other processes running on the system, if there are any.
In Swift the Int64 type requires 64 bit of memory. So if you allocate space for 1000000 Int64 you will reserve memory for 8 MB.
UnsafeMutablePointer<Int64>.alloc(1000000)
The process should not consume much CPU since you are not initializing that memory, you are just allocating it.

cuda memory coalescing

I would like first to confirm the following:
The elementary global memory transaction to shared memory is either 32 bytes, 64 or 128 bytes, but only if the memory accesses can be coalesced. The latencies of the precedent transactions are all equal. Is that right?
Second question: If the memory reads can't be coalesced, each thread reads only 4 bytes (is that right?) will all threads memory accesses be made sequential?
It depends on the architecture you are working on. However, on Fermi and Kepler you have:
Memory transactions are always 32byte or 128byte called segments
32byte segments is used when only L2 cache is used, 128byte segments when L2+L1.
If two threads of the same warp fall into the same segment, data is delivered in a single transation
If on the other hand there is data in a segment you fetch that no thread requested - it is being read anyway and you (probably) waste bandwidth
Whole segments fall into L1 & L2 cache and may reduce your bandwidth pressure when your neighbouring warps need the same segment
L1 & L2 are fairly small compared to the number of threads they usually deliver for. That is why you should not expect a piece of data to stay in the cache for long (in contrary to CPU programming)
You can disable L1 caching which may help if you overfetch in random memory access patterns.
As you can see there are several variables which decide how much time your memory access is going to take. The general rule of thumb is: the more dense your access pattern - the better! Stride or misalignment are not as costly now as they were in the past, so don't worry too much about that, unless you are doing some late-stage optimizations.

Resources