How do I interpret this difference in matrix multiplication GFLOP/s? - performance

I'm trying some matrix multiplication optimizations from this wiki here. While measuring the GFLOP/s for the naive, triple-for-loop matmul, I expected to see a drop in the GFLOP/s after a particular size, which according to the wiki, represents the point where data stops fitting in the cache:
I ran the benchmark on 2 different PCs:
3rd gen Intel i5 (3210M): (L1=32KB per core, L2=256KB per core, L3=3MB shared).
I got the expected graph, with a sharp drop from ~2GFLOP/s to 0.5.
6th gen Intel i7 (6500U): (L1=32KB per core, L2=256KB per core, L3=4MB shared)
On this, I instead see a gradual decrease in GFLOP/s, even if I try for larger sizes. Looking at the Ubuntu system monitor, one of the CPU cores was always at 100% usage.
I'm trying to understand the following:
How do I interpret the change in GFLOP/s with matrix size? If the expected drop corresponds to the data no longer fitting in the cache, why do I not see such a drop even for much bigger sizes on the i7?
How does the 3rd gen i5 perform faster for smaller sizes?
How do I interpret the CPU occupancy? Would I see a reduction in CPU usage if more time was being spent in fetching data from cache/RAM?
Edit:
I switched to double from float and tried -O3 and -O0, here are the plots. I couldn't check frequencies on the odler i5 but the Skylake i7 goes to turbo freq almost instantaneously for most of the process' duration.
Code from here, used GCC 7.4.0 on i7, and clang(Apple LLVM 7) on i5.

Regarding question 2:
While both CPUs have the same base and turbo frequency, the Ivy Bridge has a TDP of 35W while the Skylake has 15W. Even with a much newer process it is possible that the Ivy Bridge is able to use its turbo for a bigger part of the calculation. (Peter Cordes already mentioned checking the actual turbo.).
Regarding question 3:
CPU utilization doesn't depend on what the CPU is doing, waiting for RAM still counts as utilized. There are performance counters you can query which would tell you if the Ivy Bridge is slower because it stalls for memory more often.

With efficient cache-blocking, dense matmul should bottleneck on ALU, not memory bandwidth. O(N^3) work over O(N^2) memory.
But you're measuring a naive matmul. That means it's always horrible if you're striding down columns of one input. This is the classic problem for cache-blocking / loop-tiling.
Your Skylake has significantly better bandwidth to L3 cache and DRAM, and less-associative L2 cache (4-way instead of 8-way). Still, I would have expected better performance when your working set fits in L2 than when it doesn't.
SKL probably also has better HW prefetching, and definitely a larger out-of-order window size, than IvyBridge.
IvyBridge (including your 3210M) was the generation that introduced next-page hardware prefetching, but I think the feature with that name is just TLB prefetching, not data. It probably isn't a factor, especially if transparent hugepages are avoiding any TLB misses.
But if not, TLB misses might be the real cause of the dropoff on IvB. Use performance counters to check. (e.g. perf stat)
Was your CPU frequency shooting up to max turbo right away and staying there for both CPUs? #idspispopd's answer also makes some good points about total power / cooling budget, but yeah check that your two systems are maintaining the same CPU frequencies for this. Or if not, record what they are.
You did compile with optimization enabled, right? If not, that could be enough overhead to hide a memory bottleneck. Did you use the same compiler/version/options on both systems? Did you use -march=native?

Related

Explanation for why effective DRAM bandwidth reduces upon adding CPUs

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system
I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
Strong scaling of bandwidth:
Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?
Overview
This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.
More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.
Under the hood
The layout of Intel Xeon Skylake SP processors is like the following in your case:
Source: Intel
There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM # 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.
Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.
The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).
Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.
With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).
Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.
Source: Intel
Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.
The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:
What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
IntelĀ® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)

Slowdown when accessing data at page boundaries?

(My question is related to computer architecture and performance understanding. Did not find a relevant forum, so post it here as a general question.)
I have a C program which accesses memory words that are located X bytes apart in virtual address space. For instance, for (int i=0;<some stop condition>;i+=X){array[i]=4;}.
I measure the execution time with a varying value of X. Interestingly, when X is the power of 2 and is about page size, e.g., X=1024,2048,4096,8192..., I get to huge performance slowdown. But on all other values of X, like 1023 and 1025, there is no slowdown. The performance results are attached in the figure below.
I test my program on several personal machines, all are running Linux with x86_64 on Intel CPU.
What could be the cause of this slowdown? We have tried row buffer in DRAM, L3 cache, etc. which do not seem to make sense...
Update (July 11)
We did a little test here by adding NOP instructions to the original code. And the slowdown is still there. This sorta veto the 4k alias. The cause by conflict cache misses is more likely the case here.
There's 2 things here:
Set-associative cache aliasing creating conflict misses if you only touch the multiple-of-4096 addresses. Inner fast caches (L1 and L2) are normally indexed by a small range of bits from the physical address. So striding by 4096 bytes means those address bits are the same for all accesses so you're only one of the sets in L1d cache, and some small number in L2.
Striding by 1024 means you'd only be using 4 sets in L1d, with smaller powers of 2 using progressively more sets, but non-power-of-2 distributing over all the sets. (Intel CPUs have used 32KiB 8-way associative L1d caches for a long time; 32K/8 = 4K per way. Ice Lake bumped it up to 48K 12-way, so the same indexing where the set depends only on bits below the page number. This is not a coincidence for VIPT caches that want to index in parallel with TLB.)
But with a non-power-of-2 stride, your accesses will be distributed over more sets in the cache. Performance advantages of powers-of-2 sized data? (answer describes this disadvantage)
Which cache mapping technique is used in intel core i7 processor? - shared L3 cache is resistant to aliasing from big power-of-2 offsets because it uses a more complex indexing function.
4k aliasing (e.g. in some Intel CPUs). Although with only stores this probably doesn't matter. It's mainly a factor for memory disambiguation, when the CPU has to quickly figure out if a load might be reloading recently-stored data, and it does so in the first pass by just looking just at page-offset bits.
This is probably not what's going on for you, but for more details see:
L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes and
Why are elementwise additions much faster in separate loops than in a combined loop?
Either or both of these effects could be a factor in Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?
Another possible factor is that HW prefetching stops at physical page boundaries. Why does the speed of memcpy() drop dramatically every 4KB? But changing a stride from 1024 to 1023 wouldn't help that by a big factor. "Next-page" prefetching in IvyBridge and later is only TLB prefetching, not data from the next page.
I kind of assumed x86 for most of this answer, but the cache aliasing / conflict-miss stuff applies generally. Set-associative caches with simple indexing are universally used for L1d caches. (Or on older CPUs, direct-mapped where each "set" only has 1 member). The 4k aliasing stuff might be mostly Intel-specific.
Prefetching across virtual page boundaries is likely also a general problem.

Query Intel CPU details of execution unit, port, etc

Is it possible to query the number of execution unit/port per core and similar information on Intel CPU?
I have an assembly program, and noticed that the performance is quite different on different CPU's. For example, on an Core i5 4570, some functions takes consistently 25% cycles to complete than on an Core i7 4970HQ. They are both Haswell based, from the same generation. No memory movement is involved in the part of program benchmarked. So I am thinking maybe the difference comes from the details such as number of execution unit, number of ports etc. The benchmark measures single core CPU cycles, so frequencies/HT etc does not come into play.
Am I right to assume such an explanation of performance difference? If yes, where can I find such informations for specific CPUs. And is it possible to query it dynamically? If possible, then I can dispatch dynamically based on such informations and distribution uops more evenly and similar techniques to optimize the program for multiple CPUs.
Did you time reference cycles (RDTSC) instead of core clock cycles (with perf counters)? That would explain your observations.
Turbo makes a big difference, and the ratio between max turbo and max sustained / rated clock speed (i.e. reference cycle tick rate) is different on different CPUs. e.g. see my answer on this related question
The lower the CPU's TDP, the bigger the ratio between sustained and peak. The Haswell wikipedia article has tables:
84W desktop i5 4570: sustained 3.2GHz = RDTSC frequency, max turbo 3.6GHz (the speed the core was probably actually running for most of your benchmark, if it had time to go up from low-power idle speed).
47W laptop i7-4960HQ: 2.6GHz sustained = RDTSC frequency vs. 3.8GHz max turbo.
Time your code with performance counters, and look at the "core clock cycles" count. (And lots of other neat stuff).
Every Haswell core is identical from Core-M 5Watt CPUs to high-power quad core to 18-core Xeon (which actually has a per-core power-budget more like a laptop CPU); it's only the L3 caches, number of cores (and interconnect), and support or not for HT and/or Turbo that differ. Basically everything outside the cores themselves can be different, including the GPU. They don't disable execution ports, and even the L1/L2 caches are identical. I think disabling execution ports would require significant redesigns in the out-of-order scheduler and stuff like that.
More importantly, every port has at least one execution unit that isn't found on any other port: p0 has the divider, p1 has the integer multiply unit, p5 has the shuffle unit, and p6 is the only port that can execute predicted-taken branches. Actually, p2 and p3 are identical load ports (and can handle store-address uops)...
See Agner Fog's microarch pdf for more about Haswell internals, and also David Kanter's writeup with diagrams of the different blocks.
(However, it's not strictly true that the entire core is identical: Haswell Pentium/Celeron CPUs don't support AVX/AVX2, or BMI/BMI2. I think they do that by disabling decode of VEX prefixes in the decoders. This is still the case for Skylake Pentiums/Celerons, so thanks Intel for delaying the time when we can assume support for new instruction sets. Presumably they do this so CPUs with defects in one only the upper or lower half of their vector execution units can still be sold as Celeron or Pentium, just like CPUs with a defect in some of their L3 can be sold as i5 instead of i7)

hyperthreading and turbo boost in matrix multiply - worse performance using hyper threading

I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.

Cycles/byte calculations

In Crypto communities it is common to measure algorithm performance in cycles/byte. My question is, which parameters in the CPU architecture are affecting this number? Except the clockspeed ofcourse :)
Two important factors are:
the ISA of the CPU, or more specifically how closely CPU instructions map to the operations that you need to perform - if you can perform a given operation in one instruction one CPU but it requires 3 instructions on another CPU then the first CPU will probably be faster. If you have specific crypto instructions on the CPU, or extensions such as SIMD which can be leveraged, then so much the better.
the instruction issue rate of the CPU, i.e. how many instructions can be issued per clock cycle
Here are some CPU features that can impact cycles/byte:
depth of pipeline
number of IU and/or FPU able to work in parallel
size of cache memories
algorithms for branch prediction
algorithms for handling cache miss
Moreover, you may be interested in the general problem of assessing WCET (worst case execution time)
Mainly:
Memory bus bandwidth
CPU instructions per cycle
How much memory the CPU can access per second can be a limiting factor. That depends on the algorithm and how big part of the work is memory access. Also which parts of the memory that is accesses will affect how well the memory cache works.
Nowadays instruction times is not measured in how many cycles an instruction takes, but how many instructions can be executed in the same cycle. The pre-processor in the CPU lines up several instructions to be executed in parallel, so it depends on how many parallel lines the CPU has and how well the code can be parallelised. Generally a lot of conditional branching in the algorithm makes it harder to parallelise.

Resources