When should we use prefetch?

When should we use prefetch? - performance

Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document. Although there is a comment in GCC's document, but it's too short to me.
I want to know, in practice, when should we use prefetch? Are there some examples?

This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code.
Now the real question seems to be "when are prefetches useful", and the answer is - in any scenario where youre bounded on memory latency, and the access pattern isn't regular and distinguishable for the HW prefetch to capture (organized in a stream or strides), or when you suspect there are too many different streams for the HW to track simultaneously.
Most compilers would only very seldom insert their own prefetches for you, so it's basically up to you to play with your code and benchmark how prefetches could be useful.
The link by #Mysticial shows a nice example, but here's a more straight forward one that I think can't be caught by HW:
#include "stdio.h"
#include "sys/timeb.h"
#include "emmintrin.h"
#define N 4096
#define REP 200
#define ELEM int
int main() {
int i,j, k, b;
const int blksize = 64 / sizeof(ELEM);
ELEM __attribute ((aligned(4096))) a[N][N];
for (i = 0; i < N; ++i) {
for (j = 0; j < N; ++j) {
a[i][j] = 1;
}
}
unsigned long long int sum = 0;
struct timeb start, end;
unsigned long long delta;
ftime(&start);
for (k = 0; k < REP; ++k) {
for (i = 0; i < N; ++i) {
for (j = 0; j < N; j ++) {
sum += a[i][j];
}
}
}
ftime(&end);
delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
printf ("Prefetching off: N=%d, sum=%lld, time=%lld\n", N, sum, delta);
ftime(&start);
sum = 0;
for (k = 0; k < REP; ++k) {
for (i = 0; i < N; ++i) {
for (j = 0; j < N; j += blksize) {
for (b = 0; b < blksize; ++b) {
sum += a[i][j+b];
}
_mm_prefetch(&a[i+1][j], _MM_HINT_T2);
}
}
}
ftime(&end);
delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
printf ("Prefetching on: N=%d, sum=%lld, time=%lld\n", N, sum, delta);
}
What I do here is traverse each matrix line (enjoying the HW prefetcher help with the consecutive lines), but prefetch ahead the element with the same column index from the next line that resides in a different page (which the HW prefetch should be hard pressed to catch). I sum the data just so that it's not optimized away, the important thing is that I basically just loop over a matrix, should have been pretty straightforward and simple to detect, and yet still get a speedup.
Built with gcc 4.8.1 -O3, it gives me an almost 20% boost on an Intel Xeon X5670:
Prefetching off: N=4096, sum=3355443200, time=1839
Prefetching on: N=4096, sum=3355443200, time=1502
Note that the speedup is received even though I made the control flow more complicated (extra loop nesting level), the branch predictor should easily catch the pattern of that short block-size loop, and it saves execution of unneeded prefetches.
Note that Ivybridge and onward on should have a "next-page prefetcher", so the HW may be able to mitigate that on these CPUs (if anyone has one available and cares to try i'll be happy to know). In that case i'd modify the benchmark to sum every second line (and the prefetch would look ahead two lines everytime), that should confuse the hell out of the HW prefetchers.
Skylake results
Here are some results from a Skylake i7-6700-HQ, running at 2.6 GHz (no turbo) with gcc:
Compile flags: -O3 -march=native
Prefetching off: N=4096, sum=28147495993344000, time=896
Prefetching on: N=4096, sum=28147495993344000, time=1222
Prefetching off: N=4096, sum=28147495993344000, time=886
Prefetching on: N=4096, sum=28147495993344000, time=1291
Prefetching off: N=4096, sum=28147495993344000, time=890
Prefetching on: N=4096, sum=28147495993344000, time=1234
Prefetching off: N=4096, sum=28147495993344000, time=848
Prefetching on: N=4096, sum=28147495993344000, time=1220
Prefetching off: N=4096, sum=28147495993344000, time=852
Prefetching on: N=4096, sum=28147495993344000, time=1253
Compile flags: -O2 -march=native
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on: N=4096, sum=28147495993344000, time=1813
Prefetching off: N=4096, sum=28147495993344000, time=1956
Prefetching on: N=4096, sum=28147495993344000, time=1814
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on: N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1961
Prefetching on: N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1965
Prefetching on: N=4096, sum=28147495993344000, time=1814
So using prefetch is either about 40% slower, or 8% faster depending on if you use -O3 or -O2 respectively for this particular example. The big slowdown for -O3 is actually due to a code generation quirk: at -O3 the loop without prefetch is vectorized, but the extra complexity of the prefetch variant loop prevents vectorization on my version of gcc anyway.
So the -O2 results are probably more apples-to-apples, and the benefit is about half (8% speedup vs 16%) of what we saw on Leeor's Westmere. Still it's worth noting that you have to be careful not to change code generation such that you get a big slowdown.
This test probably isn't ideal in that by going int by int implies a lot of CPU overhead rather than stressing the memory subsystem (that's why vectorization helped so much).

On recent Intel chips one reason you apparently might want to use prefetching is to avoid CPU power-saving features artificially limiting your achieved memory bandwidth. In this scenario, simple prefetching can as much as double your performance versus the same code without prefetching, but it depends entirely on the selected power management plan.
I ran a simplified version (code here)of the test in Leeor's answer, which stresses the memory subsystem a bit more (since that's where prefetch will help, hurt or do nothing). The original test stressed the CPU in parallel with the memory subsystem since it added together every int on each cache line. Since typical memory read bandwidth is in the region of 15 GB/s, that's 3.75 billion integers per second, putting a pretty hard cap on the maximum speed (code that isn't vectorized will usually process 1 int or less per cycle, so a 3.75 GHz CPU will be about equally CPU and memory bount).
First, I got results that seemed to show prefetching kicking butt on my i7-6700HQ (Skylake):
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=204, MiB/s=12549
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=200, MiB/s=12800
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=201, MiB/s=12736
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=197, MiB/s=12994
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Eyeballing the numbers, prefetch achieves something a bit above 16 GiB/s and without only about 12.5, so prefetch is increasing speed by about 30%. Right?
Not so fast. Remembering that the powersaving mode has all sorts of wonderful interactions on modern chips, I changed my Linux CPU governor to performance from the default of powersave1. Now I get:
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=155, MiB/s=16516
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=152, MiB/s=16842
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=159, MiB/s=16100
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=163, MiB/s=15705
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=161, MiB/s=15900
It's a total toss-up. Both with and without prefetching seem to perform identically. So either hardware prefetching is less aggressive in the high powersaving modes, or there is some other interaction with power saving that behaves differently with the explicit software prefetches.
Investigation
In fact, the difference between prefetching and not is even more extreme if you change the benchark. The existing benchmark alternates between runs with prefetching on and off, and it turns out that this helped the "off" variant because the speed increase which occurs in the "on" test partly carries over to the subsequent off test2. If you run only the "off" test you get results around 9 GiB/s:
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=280, MiB/s=9142
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=277, MiB/s=9241
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=285, MiB/s=8982
... versus about 17 GiB/s for the prefetching version:
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=149, MiB/s=17181
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297
So the prefetching version is almost twice as fast.
Let's take a look at what's going on with perf stat, for both the **off* version:
Performance counter stats for './prefetch-test off':
2907.485684 task-clock (msec) # 1.000 CPUs utilized
3,197,503,204 cycles # 1.100 GHz
2,158,244,139 instructions # 0.67 insns per cycle
429,993,704 branches # 147.892 M/sec
10,956 branch-misses # 0.00% of all branches
... and the on version:
1502.321989 task-clock (msec) # 1.000 CPUs utilized
3,896,143,464 cycles # 2.593 GHz
2,576,880,294 instructions # 0.66 insns per cycle
429,853,720 branches # 286.126 M/sec
11,444 branch-misses # 0.00% of all branches
The difference is that the version with prefetching on consistently runs at the max non-turbo frequency of ~2.6 GHz (I have disabled turbo via an MSR). The version without prefetching, however, has decided to run at a much lower speed of 1.1 GHz. Such large CPU differences often also reflect a large difference in uncore frequency, which can explain the worse bandwdith.
Now we've seen this before, and it is probably an outcome of the Energy Efficient Turbo feature on recent Intel chips, which try to ramp down the CPU frequency when they determine a process is mostly memory bound, presumably since increased CPU core speed doesn't provide much benefit in those cases. As we can see here, this assumption isn't always true, but it isn't clear to me if the tradeoff is a bad one in general, or perhaps the heuristic only occasionally gets it wrong.
1 I'm running the intel_pstate driver, which is the default for Intel chips on recent kernels which implements "hardware p-states", also known as "HWP". Command used: sudo cpupower -c 0,1,2,3 frequency-set -g performance.
2 Conversely, the slowdown from the "off" test partly carries over into the "on" test, although the effect is less extreme, possibly because the powersaving "ramp up" behavior is faster than "ramp down".

Here's a brief summary of cases that I'm aware of in which software prefetching may prove especially useful. Some may not apply to all hardware.
This list should be read from the point of view that the most obvious place software prefetches could be used is where the stream of accesses can be predicted in software, and yet this case isn't necessarily such an obvious win for SW prefetch because out-of-order processing often ends up having a similar effect since it can execute behind existing misses in order to get more misses in flight.
So this list is more a "in light of the fact that SW prefetch isn't as obviously useful as it might first seem, here are some places it might still be useful anyways", often compared to the alternative of either just letting out-of-order processing do its thing or just using "plain loads" to load some values before they are needed.
Fitting more loads in the out-of-order window
Although out-of-order processing can potentially expose the same type of MLP (Memory-Level Parallelism) as software prefetches, there are limits inherent to the total possible lookahead distance after a cache miss. These include reorder-buffer capacity, load buffer capacity, scheduler capacity and so on. See this blog post for an example of where extra work seriously hinders MLP because the CPU can't run ahead far enough to get enough loads executing at once.
In this case, software prefetch allows you to effectively stuff more loads earlier in the instruction stream. As an example, imagine you have a loop which performs one load and then 20 instructions worth of work on the loaded data, and your CPU has an out-of-order buffer of 100 instructions and that loads are independent from each other (e.g,. accessing an array with a known stride).
After the first miss, you can run ahead 99 more instructions which will be composed of 95 non-load and 5 load instructions (including the first load). So your MLP is inherently limited to 5 by the size of the out-of-order buffer. If instead you paired every load with two software prefetches to a location say 6 or more iterations ahead, you'd end up instead with 90 non-load instructions, 5 loads and 5 software prefetches and since all those loads are you just doubled your MLP to 102.
There is of course no limit of one additional prefetch per load: you could add more to hit higher numbers, but there is a point of diminishing and then negative returns as you hit the MLP limits of your machine and the prefetches take up resources you'd rather spend on other things.
This is similar to software pipelining, where you load data for a future iteration, and then don't touch that register until after a significant amount of other work. This was mostly used on in-order machines to hide latency of computation as well as memory. Even on a RISC with 32 architectural registers, software-pipelining typically can't place the loads as far ahead of use as an optimal prefetch-distance on a modern machine; the amount of work a CPU can do during one memory latency has grown a lot since the early days of in-order RISCs.
In-order machines
Not all machines are bit out-of-order cores: in-order CPUs are still common in some places (especially outside x86), and you'll also find "weak" out of order cores that don't have the capability to run ahead very far and so partly act like in-order machines.
On these machines software prefetches may help gain MLP that you wouldn't otherwise be able access (of course, an in-order machine probably doesn't support a lot of inherent MLP otherwise).
Working around hardware prefetch restrictions
Hardware prefetch may have restrictions which you could work around using software prefetch.
For example, Leeor's answer has an example of hardware prefetch stopping at page boundaries, while software prefetch doesn't have any such restriction.
Another example might be any time that hardware prefetch is too aggressive or too conservative (after all it has to guess at your intentions): you might use software prefetch instead since you know exactly how your application will behave.
Examples of the latter include prefetching discontiguous areas: such as rows in a sub-matrix of a larger matrix: hardware prefetch won't understand the boundaries of the "rectangular" region and will constantly prefetch beyond the end of each row, and then take a bit of time to pick up the new row pattern. Software prefetching can get this exactly right: never issuing any useless prefetches at all (but it often requires ugly splitting of loops).
If you do enough software prefetches, the hardware prefeteches should in theory mostly shut down, because the activity of the memory subsystem is one heuristic they use to decide whether to activate.
Counterpoint
I should note here that software prefetching is not equivalent to hardware prefetching when it comes to possible speedups for cases the hardware prefetching can pick up: hardware prefetching can be considerably faster. That is because hardware prefetching can start working closer to memory (e.g., from the L2) where it has a lower latency to memory and also access to more buffers (in the so-called "superqueue" on Intel chips) and so more concurrency. So if you turn off hardware prefetching and try to implement a memcpy or some other streaming load with pure software prefetching, you'll find that it is likely slower.
Special load hints
Prefetching may give you access to special hints that you can't achieve with regular loads. For example x86 has the prefetchnta, prefetcht0, prefetcht1, and prefetchw instructions which hint to the processor how to treat the loaded data in the caching subsystem. You can't achieve the same effect with plain loads (at least on x86).
2 It's not actually as simple as just adding a single prefetch to the loop, since after the first five iterations, the loads will start hitting already prefetched values, reducing your MLP back to 5 - but the idea still holds. A real implementation would also involve reorganizing the loop so that the MLP can be sustained (e.g., "jamming" the loads and prefetches together every few iterations).

There are definitely situations where software prefetch provides significant performance improvements.
For example, if you are accessing a relatively slow memory device such as Optane DC Persistent Memory, which has access times of several hundred nanoseconds, prefetching can reduce effective latency by 50 percent or more if you can do it far enough in advance of the read or write.
This isn't a very common case at present but it will become a lot more common if and when such storage devices become mainstream.

The article 'What Every Programmer Should Know About Memory
Ulrich Drepper' discusses situations where pre-fetching is advantageous;
http://www.akkadia.org/drepper/cpumemory.pdf , warning: this is quite a long article that discusses things like memory architecture / how the cpu works, etc.
prefetching gives something if the data is aligned to cache lines; and if you are loading data that is about to be accessed by the algorithm;
In any event one should do this when trying to optimize highly used code; benchmarking is a must and things usually work out differently than one might use to think.

It seems, that the best policy to follow is to never use __builtin_prefetch (and its friend, __builtin_expect) at all. On some platforms those may help (and even help a lot) - however, one must always do some benchmarking to confirm this. The real question, whether the short term performance gains will worth the trouble in the longer run.
First, one may ask the following question: what these statements actually do when fed to a higher end modern CPU? The answer is: nobody really knows (except, may be, few guys on the CPU's core architecture team, but they are not going to tell anybody). Modern CPUs are very complex machines, capable of instruction reordering, speculative execution of instructions across possibly not taken branches, etc., etc. Moreover, the details of this complex behavior may (and will) differ considerably between CPU generations and vendors (Intel Core vs Intel I* vs AMD Opteron; with more fragmented platforms like ARM the situation is even worse).
One neat example (not prefetch related, but still) of CPU functionality which used to speed things up on older Intel CPUs, but sucks badly on the more modern one is outlined here: http://lists-archives.com/git/744742-git-gc-speed-it-up-by-18-via-faster-hash-comparisons.html. In that particular case, it was possible to achieve 18% performance increase by replacing the optimized version of gcc supplied memcmp with an explicit ("naive" so to say) loop.

Related

Does the Meltdown mitigation, in combination with `calloc()`s CoW "lazy allocation", imply a performance hit for calloc()-allocated memory?

So calloc() works by asking the OS for some virtual memory. The OS is working in cahoots with the MMU, and cleverly responds with a virtual memory address which actually maps to a copy-on-write, read-only page full of zeroes. When a program tries to write to anywhere in that page, a page fault occurs (because you cannot write to read-only pages), a copy of the page is created, and your program's virtual memory is mapped to this brand new copy of those zeroes.
Now that Meltdown is a thing, OSes have been patched so that it's no longer possible to speculatively execute across the kernel-user boundary. This means that whenever user code calls kernel code, it effectively causes a pipeline stall. Typically, when the pipeline stalls in a loop, it's devastating for performance, since the CPU ends up wasting time waiting for data, whether from cache or main memory.
Given such, what I want to know is:
When a program writes to a never-before-accessed page which was allocated with calloc(), and the remapping to the new CoW page occurs, is this executing kernel code?
Is the page fault copy-on-write functionality implemented at the OS level or the MMU level?
If I call calloc() to allocate 4GiB of memory, then initialize it with some arbitrary value (say, 0xFF instead of 0x00) in a tight loop, is my (Intel) CPU going to be hitting a speculation boundary every time it writes to a new page?
And finally, if it is real, is there any case where this effect is significant to real-world performance?

Your premise is wrong. Page faults were never pipelined / super-cheap. Meltdown (and Spectre) mitigation does make them more expensive, though, along with system calls and all other user->kernel transitions.
Speculative execution across the kernel/user boundary was never possible; Intel CPUs don't rename the privilege level, i.e. kernel/user transitions always required a full pipeline flush. I think you're misunderstanding Meltdown: it's cause purely by speculative execution in user-space and delayed handling of the privilege checks on TLB hits.
This is universal in CPU design, AFAIK. I'm not aware of any microarchitectures that rename the privilege level or otherwise speculate into kernel code, x86 or otherwise.
The cost added by Meltdown mitigation is that entering the kernel flushes the TLB. (Or on CPUs with TLB process-context ID support, the kernel can use PCIDs to make using separate page-tables for kernel vs. user-space much cheaper).
The kernel entry point (on Linux) becomes a trampoline that swaps page tables and jumps to the real kernel entry point, to avoid exposing the kernel ASLR offset to user-space. But other than that and an extra mov cr3, reg on entry and exit from the kernel (setting a new page table), nothing else is changed.
(Spectre mitigation is tricky, too, and required more changes like retpolines... and might also significantly increase the cost of user->kernel->user. IDK about page fault costs.)
#BeeOnRope reports (see comments and his answer for full details) that without Spectre patches, just Meltdown patches applied but nopti boot option to "disable" it, increased the cost of a round trip to the kernel on a Skylake CPU (with syscall with bogus RAX, returning -ENOSYS right away) went up from ~100 to ~300 cycles. So that's maybe the cost of the trampoline? And with actual page-table isolation enabled, it went up to ~700 cycles. That's without Spectre mitigation patches at all. (Also, that's the x86-64 syscall entry point, not page-fault. They're likely similar, though.)
Page fault exceptions:
CPUs don't predict page faults, so they couldn't speculatively execute the handler anyway. Prefetch or decode of the page fault entry point could maybe happen while the pipeline was flushing, but that process wouldn't start until the page-faulting instruction tried to retire. A faulting load/store is marked to take effect on retirement, and doesn't re-steer the front-end; the whole key to Meltdown is the lack of action on a faulting load until it reaches retirement.
Related: When an interrupt occurs, what happens to instructions in the pipeline?
Also: Out-of-order execution vs. speculative execution has some detail about what kind of speculation really causes Meltdown, and how CPUs handle faults.
When a program writes to a never-before-accessed page which was allocated with calloc(), and the remapping to the new CoW page occurs, is this executing kernel code?
Yes, page faults are handled by the kernel's page-fault handler. There's no pure-hardware handling for copy-on-write.
If I call calloc() to allocate 4GiB of memory, then initialize it with some arbitrary value (say, 0xFF instead of 0x00) in a tight loop, is my (Intel) CPU going to be hitting a speculation boundary every time it writes to a new page?
Yes. The kernel doesn't fault-around for zeroed pages (unlike for file-backed mappings when data is hot in the pagecache). So every new page touched causes a pagefault, even for small 4k normal pages. (Thanks to #BeeOnRope for accurate info on this.) With anonymous hugepages, you'll only pagefault once per 2MiB (x86-64), which is tremendously better.
If you want to avoid per-page costs, allocate with mmap(MAP_POPULATE) to prefault all the pages into the HW page table, on a Linux system. I'm not sure if madvise can prefault pages for you, e.g. madvise(MADV_WILLNEED) on an already-mapped region. But madvise(MADV_HUGEPAGE) will encourage the kernel to use anonymous hugepages (and maybe to defrag physical memory to free up contiguous 2M blocks to enable that, if you don't have it configured to do that without madvise).
Related: Two TLB-miss per mmap/access/munmap has some perf results on a Linux kernel with KPTI patches.

Yes use of calloc()-allocated memory will suffer a performance degradation due to the Meltdown and Spectre patches.
In fact, calloc() isn't special here: malloc(), new and more generally all allocated memory will probably suffer approximately the same performance impact. Both calloc() and malloc() are ultimately backed by pages returned by the OS (although the allocator will re-use them after they are freed). The only real difference being that a smart allocator, when it goes down the path of using new pages from the OS (rather than re-using a previously freed allocation) in the case of calloc it can omit the zeroing because the OS-provided pages are guaranteed to be zero. Other than that the allocator behavior is largely the same and the OS-level zeroing behavior is the same (there is usually no option to ask the OS for non-zero pages).
So the performance impact applies more broadly than you thought, but the performance impact is likely smaller than you suggest, since a page fault is already doing a lot of work anyways, so you aren't talking an order of magnitude degradation or anything. See Peter's answer on the reasons the performance impact is likely to be limited. I wrote this answer mostly because the answer to your headline question is still yes as there is some impact.
To estimate the impact on a malloc heavy workflow, I tried running some allocation and page-fault heavy test on a current kernel (4.13.0-39-generic) with the Spectre and Meltdown mitigations, as well as on an older kernel prior to these mitigations.
The test code is very simple:
#include <stdlib.h>
#include <stdio.h>
#define SIZE (40 * 1024 * 1024)
#define PG_SIZE 4096
int main() {
char *mem = malloc(SIZE);
for (volatile char *p = mem; p < mem + SIZE; p += PG_SIZE) {
*p = 'z';
}
printf("pages touched: %d\npoitner value : %p\n", SIZE / PG_SIZE, mem);
}
The results on the newer kernel were about ~3700 cycles per page fault, and on the older kernel without mitigations around ~3300 cycles. The overall regression (presumably) due to the mitigations was about 14%. Note that this in on Skylake hardware (i7-6700HQ) where some of the Spectre mitigations are somewhat cheaper, and the kernel supports PCID which makes the KPTI Meltdown mitigations cheaper. The results might be worse on different hardware.
Oddly, the results on the new kernel with Spectre and Meltdown mitigations disabled at boot (using spectre_v2=off nopti) were much worse than either the new kernel default or the old kernel, coming in at about 5050 cycles per page fault, something like a 35% regression over the same kernel with the mitigations enabled. So something is going really wrong, performance-wise when the mitigations are disabled.
Full Results
Here is the full perf stat output for the two runs.
Old Kernel (4.10.0-42)
pages touched: 10240
poitner value : 0x7f7d2561e010
Performance counter stats for './pagefaults':
12.980048 task-clock (msec) # 0.976 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
10,286 page-faults # 0.792 M/sec
33,662,397 cycles # 2.593 GHz
27,230,864 instructions # 0.81 insn per cycle
4,535,443 branches # 349.417 M/sec
11,760 branch-misses # 0.26% of all branches
0.013293417 seconds time elapsed
New Kernel (4.13.0-39)
pages touched: 10240
poitner value : 0x7f306ad69010
Performance counter stats for './pagefaults':
14.789615 task-clock (msec) # 0.966 CPUs utilized
8 context-switches # 0.541 K/sec
0 cpu-migrations # 0.000 K/sec
10,288 page-faults # 0.696 M/sec
38,318,595 cycles # 2.591 GHz
28,796,523 instructions # 0.75 insn per cycle
4,693,944 branches # 317.381 M/sec
26,853 branch-misses # 0.57% of all branches
0.015312764 seconds time elapsed
New Kernel (4.13.0.-39) spectre_v2=off nopti
pages touched: 10240
poitner value : 0x7ff079ede010
Performance counter stats for './pagefaults':
16.690621 task-clock (msec) # 0.982 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
10,286 page-faults # 0.616 M/sec
51,964,080 cycles # 3.113 GHz
28,602,441 instructions # 0.55 insn per cycle
4,699,608 branches # 281.572 M/sec
25,064 branch-misses # 0.53% of all branches
0.017001581 seconds time elapsed

L2 instruction fetch misses much higher than L1 instruction fetch misses

I am generating a synthetic C benchmark aimed at causing a large number of instruction fetch misses via the following Python script:
#!/usr/bin/env python
import tempfile
import random
import sys
if __name__ == '__main__':
functions = list()
for i in range(10000):
func_name = "f_{}".format(next(tempfile._get_candidate_names()))
sys.stdout.write("void {}() {{\n".format(func_name))
sys.stdout.write(" double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n")
sys.stdout.write(" res = pi*r*r*h;\n")
sys.stdout.write(" res = res/(e*e);\n")
sys.stdout.write("}\n")
functions.append(func_name)
sys.stdout.write("int main() {\n")
sys.stdout.write("unsigned int i;\n")
sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n")
for i in range(10000):
r = random.randint(0, len(functions)-1)
sys.stdout.write("{}();\n".format(functions[r]))
sys.stdout.write("}\n")
sys.stdout.write("}\n")
What the code does is simply generating a C program that consists of a lot of randomly named dummy functions that are in turn called in random order in main(). I am compiling the resulting code with gcc 4.8.5 under CentOS 7 with -O0. The code is running on a dual socket machine fitted with 2x Intel Xeon E5-2630v3 (Haswell architecture).
What I am interested in is understanding instruction-related counters reported by perf when profiling the binary compiled from the C code (not the Python script, that is only used to automatically generate the code). In particular, I am observing the following counters with perf stat:
instructions
L1-icache-load-misses (instruction fetches that miss L1, aka r0280 on Haswell)
r2424, L2_RQSTS.CODE_RD_MISS (instruction fetches that miss L2)
rf824, L2_RQSTS.ALL_PF (all L2 hardware prefetcher requests, both code and data)
I first profiled the code with all hardware prefetchers disabled in the BIOS, i.e.
MLC Streamer Disabled
MLC Spatial Prefetcher Disabled
DCU Data Prefetcher Disabled
DCU Instruction Prefetcher Disabled
and the results are the following (process is pinned to first core of second CPU and corresponding NUMA domain, but I guess this doesn't make much difference):
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code
Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':
25,108,610,204 instructions
2,613,075,664 L1-icache-load-misses
5,065,167,059 r2424
17 rf824
33.696954142 seconds time elapsed
Considering the figures above, I cannot explain such a high number of instruction fetch misses in L2. I have disabled all prefetchers, and L2_RQSTS.ALL_PF confirms so. But why do I see twice as much the number of instruction fetch misses in L2 than in L1i? In my (simple) mental processor model, if an instruction is looked up in L2, it must have necessarily been looked up in L1i before. Clearly I am wrong, what am I missing?
I then tried to run the same code with all the hardware prefetchers enabled, i.e.
MLC Streamer Enabled
MLC Spatial Prefetcher Enabled
DCU Data Prefetcher Enabled
DCU Instruction Prefetcher Enabled
and the results are the following:
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code
Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':
25,109,877,626 instructions
2,599,883,072 L1-icache-load-misses
5,054,883,231 r2424
908,494 rf824
Now, L2_RQSTS.ALL_PF seems to indicate that something more is happening and although I expected the prefetcher to be a bit more aggressive, I imagine that the instruction prefetcher is severely put to the test due to the jump-intensive type of workload and data prefetcher has not much to do with this kind of workload. But again, L2_RQSTS.CODE_RD_MISS is still too high with the prefetchers enabled.
So, to sum up, my question is:
With hardware prefetchers disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses. Even with hardware prefetchers enabled, I still cannot explain it. What is the reason behind such a high count of L2_RQSTS.CODE_RD_MISS compared to L1-icache-load-misses?

The instruction prefetcher can generate requests are that don't count as accesses to the L1I cache, but are counted as code fetch requests at higher-numbered memory levels, such as the L2. This is generally true on all Intel microarchitectures with an instruction prefetcher. L2_RQSTS.CODE_RD_MISS counts both demand and prefetch requests from the L1I. Demand requests are generated by a multiplexing unit in the IFU that chooses a target fetch linear address from among the different units in the pipeline that may change the flow, such as the branch prediction units. Prefetch requests are generated by the L1I instruction prefetcher on an L1I miss if possible.
In general, the number of prefetch fetch requests is nearly proportional to the number of L1I misses. For instruction fetches from memory regions of cacheable memory types, the following formula holds:
ICACHE.MISSES <= L2_RQSTS.CODE_RD_MISS + L2_RQSTS.CODE_RD_HIT
I'm not sure whether this formula also holds for uncacheable fetch requests. I didn't test it in that condition. I know these requests are counted as ICACHE.MISSES, but not sure about the other events.
In your case, most instruction fetches will miss in the L1I and L2. You have 10,000 functions each nearly fully spans 2 64-btye cache lines (here is a version with only two functions), so the code size is much larger than the 256 KiB L2 available on Haswell. The functions are being called in a non-sequential and upredictable order, so the L1I and L2 prefetchers won't significantly help. The only noteworthy exception are returns, all of which will be predicted correctly using the RSB mechanism.
Each of the 10,000 functions are being called 100,000 times in a loop. Most fetch requests are for lines occupied by these functions. The total number of useful instruction fetch requests is about 2 lines per function * 10,000 function * 100,000 iterations = 2,000,000,000 lines, most of which will miss in the L1I and L2 (but probably hit in the L3 after the first cold iteration). Several millions of other requests will be for lines occupied by the loop body. Your measurements show that there are about 30% more instruction fetches that miss in the L1I. This is because of branch mispredictions, which cause fetch requests for incorrect lines that may not be even be in the L1I and/or L2. Each L1I miss may trigger a prefetch, so it's normal for L2 instruction fetches to be within two times the number of L1I misses. This is consistent with your numbers.
In my two-function version, I'm counting 24 instructions per invoked function, so I expect the total number of retired instructions to be approximately 24 billion, but you got 25 billion. Either I don't know how to count, or you have 25 instructions per function for some reason.

Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory.
Looking at the results (compiled for 64-bit) on a few different machines, Skylake machines do significantly better than Broadwell-E, keeping OS (Win10-64), processor speed, and RAM speed (DDR4-2133) the same. We're not talking a few percentage points, but rather a factor of about 2. Skylake is configured dual-channel, and the results for Broadwell-E don't vary for dual/triple/quad-channel.
Any ideas why this might be happening? The code that follows is compiled in Release in VS2015, and reports average time to complete each memcpy at:
64-bit: 2.2ms for Skylake vs 4.5ms for Broadwell-E
32-bit: 2.2ms for Skylake vs 3.5ms for Broadwell-E.
We can get greater memory throughput on a quad-channel Broadwell-E build by utilizing multiple threads, and that's nice, but to see such a drastic difference for single-threaded memory access is frustrating. Any thoughts on why the difference is so pronounced?
We've also used various benchmarking software, and they validate what this simple example shows - single-threaded memory throughput is way better on Skylake.
#include <memory>
#include <Windows.h>
#include <iostream>
//Prevent the memcpy from being optimized out of the for loop
_declspec(noinline) void MemoryCopy(void *destinationMemoryBlock, void *sourceMemoryBlock, size_t size)
{
memcpy(destinationMemoryBlock, sourceMemoryBlock, size);
}
int main()
{
const int SIZE_OF_BLOCKS = 25000000;
const int NUMBER_ITERATIONS = 100;
void* sourceMemoryBlock = malloc(SIZE_OF_BLOCKS);
void* destinationMemoryBlock = malloc(SIZE_OF_BLOCKS);
LARGE_INTEGER Frequency;
QueryPerformanceFrequency(&Frequency);
while (true)
{
LONGLONG total = 0;
LONGLONG max = 0;
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
for (int i = 0; i < NUMBER_ITERATIONS; ++i)
{
QueryPerformanceCounter(&StartingTime);
MemoryCopy(destinationMemoryBlock, sourceMemoryBlock, SIZE_OF_BLOCKS);
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
total += ElapsedMicroseconds.QuadPart;
max = max(ElapsedMicroseconds.QuadPart, max);
}
std::cout << "Average is " << total*1.0 / NUMBER_ITERATIONS / 1000.0 << "ms" << std::endl;
std::cout << "Max is " << max / 1000.0 << "ms" << std::endl;
}
getchar();
}

Single-threaded memory bandwidth on modern CPUs is limited by max_concurrency / latency of the transfers from L1D to the rest of the system, not by DRAM-controller bottlenecks. Each core has 10 Line-Fill Buffers (LFBs) which track outstanding requests to/from L1D. (And 16 "superqueue" entries which track lines to/from L2).
(Update: experiments show that Skylake probably has 12 LFBs, up from 10 in Broadwell. e.g. Fig7 in the ZombieLoad paper, and other performance experiments including #BeeOnRope's testing of multiple store streams)
Intel's many-core chips have higher latency to L3 / memory than quad-core or dual-core desktop / laptop chips, so single-threaded memory bandwidth is actually much worse on a big Xeon, even though the max aggregate bandwidth with many threads is much better. They have many more hops on the ring bus that connects cores, memory controllers, and the System Agent (PCIe and so on).
SKX (Skylake-server / AVX512, including the i9 "high-end desktop" chips) is really bad for this: L3 / memory latency is significantly higher than for Broadwell-E / Broadwell-EP, so single-threaded bandwidth is even worse than on a Broadwell with a similar core count. (SKX uses a mesh instead of a ring bus because that scales better, see this for details on both. But apparently the constant factors are bad in the new design; maybe future generations will have better L3 bandwidth/latency for small / medium core counts. The private per-core L2 is bumped up to 1MiB though, so maybe L3 is intentionally slow to save power.)
(Skylake-client (SKL) like in the question, and later quad/hex-core desktop/laptop chips like Kaby Lake and Coffee Lake, still use the simpler ring-bus layout. Only the server chips changed. We don't yet know for sure what Ice Lake client will do.)
A quad or dual core chip only needs a couple threads (especially if the cores + uncore (L3) are clocked high) to saturate its memory bandwidth, and a Skylake with fast DDR4 dual channel has quite a lot of bandwidth.
For more about this, see the Latency-bound Platforms section of this answer about x86 memory bandwidth. (And read the other parts for memcpy/memset with SIMD loops vs. rep movs/rep stos, and NT stores vs. regular RFO stores, and more.)
Also related: What Every Programmer Should Know About Memory? (2017 update on what's still true and what's changed in that excellent article from 2007).

I finally got VTune (evalutation) up and running. It gives a DRAM bound score of .602 (between 0 and 1) on Broadwell-E and .324 on Skylake, with a huge part of the Broadwell-E delay coming from Memory Latency. Given that the memory sticks are the same speed (except dual-channel configured in Skylake and quad-channel in Broadwell-E), my best guess is that something about the memory controller in Skylake is just tremendously better.
It makes buying into the Broadwell-E architecture a much tougher call, and requires that you really need the extra cores to even consider it.
I also got L3/TLB miss counts. On Broadwell-E, TLB miss count was about 20% higher, and L3 miss count about 36% higher.
I don't think this is really an answer for "why" so I won't mark it as such, but is as close as I think I'll get to one for the time being. Thanks for all the helpful comments along the way.

Why can't my ultraportable laptop CPU maintain peak performance in HPC

I have developed a high performance Cholesky factorization routine, which should have peak performance at around 10.5 GFLOPs on a single CPU (without hyperthreading). But there is some phenomenon which I don't understand when I test its performance. In my experiment, I measured the performance with increasing matrix dimension N, from 250 up to 10000.
In my algorithm I have applied caching (with tuned blocking factor), and data are always accessed with unit stride during computation, so cache performance is optimal; TLB and paging problem are eliminated;
I have 8GB available RAM, and the maximum memory footprint during experiment is under 800MB, so no swapping comes across;
During experiment, no resource demanding process like web browser is running at the same time. Only some really cheap background process is running to record CPU frequency as well as CPU temperature data every 2s.
I would expect the performance (in GFLOPs) should maintain at around 10.5 for whatever N I am testing. But a significant performance drop is observed in the middle of the experiment as shown in the first figure.
CPU frequency and CPU temperature are seen in the 2nd and 3rd figure. The experiment finishes in 400s. Temperature was at 51 degree when experiment started, and quickly rose up to 72 degree when CPU got busy. After that it grew slowly to the highest at 78 degree. CPU frequency is basically stable, and it did not drop when temperature got high.
So, my question is:
since CPU frequency did not drop, why performance suffers?
how exactly does temperature affect CPU performance? Does the increment from 72 degree to 78 degree really make things worse?
CPU info
System: Ubuntu 14.04 LTS
Laptop model: Lenovo-YOGA-3-Pro-1370
Processor: Intel Core M-5Y71 CPU # 1.20 GHz * 2
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0,1
Off-line CPU(s) list: 2,3
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 61
Stepping: 4
CPU MHz: 1474.484
BogoMIPS: 2799.91
Virtualisation: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0,1
CPU 0, 1
driver: intel_pstate
CPUs which run at the same hardware frequency: 0, 1
CPUs which need to have their frequency coordinated by software: 0, 1
maximum transition latency: 0.97 ms.
hardware limits: 500 MHz - 2.90 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 500 MHz and 2.90 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.40 GHz.
boost state support:
Supported: yes
Active: yes
update 1 (control experiment)
In my original experiment, CPU is kept busy working from N = 250 to N = 10000. Many people (primarily those whose saw this post before re-editing) suspected that the overheating of CPU is the major reason for performance hit. Then I went back and installed lm-sensors linux package to track such information, and indeed, CPU temperature rose up.
But to complete the picture, I did another control experiment. This time, I give CPU a cooling time between each N. This is achieved by asking the program to pause for a number of seconds at the start of iteration of the loop through N.
for N between 250 and 2500, the cooling time is 5s;
for N between 2750 and 5000, the cooling time is 20s;
for N between 5250 and 7500, the cooling time is 40s;
finally for N between 7750 and 10000, the cooling time is 60s.
Note that the cooling time is much larger than the time spent for computation. For N = 10000, only 30s are needed for Cholesky factorization at peak performance, but I ask for a 60s cooling time.
This is certainly a very uninteresting setting in high performance computing: we want our machine to work all the time at peak performance, until a very large task is completed. So this kind of halt makes no sense. But it helps to better know the effect of temperature on performance.
This time, we see that peak performance is achieved for all N, just as theory supports! The periodic feature of CPU frequency and temperature is the result of cooling and boost. Temperature still has an increasing trend, simply because as N increases, the work load is getting bigger. This also justifies more cooling time for a sufficient cooling down, as I have done.
The achievement of peak performance seems to rule out all effects other than temperature. But this is really annoying. Basically it says that computer will get tired in HPC, so we can't get expected performance gain. Then what is the point of developing HPC algorithm?
OK, here are the new set of plots:
I don't know why I could not upload the 6th figure. SO simply does not allow me to submit the edit when adding the 6th figure. So I am sorry I can't attach the figure for CPU frequency.
update 2 (how I measure CPU frequency and temperature)
Thanks to Zboson for adding the x86 tag. The following bash commands are what I used for measurement:
while true
do
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> cpu0_freq.txt ## parameter "freq0"
cat sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq >> cpu1_freq.txt ## parameter "freq1"
sensors | grep "Core 0" >> cpu0_temp.txt ## parameter "temp0"
sensors | grep "Core 1" >> cpu1_temp.txt ## parameter "temp1"
sleep 2
done
Since I did not pin the computation to 1 core, the operating system will alternately use two different cores. It makes more sense to take
freq[i] <- max (freq0[i], freq1[i])
temp[i] <- max (temp0[i], temp1[i])
as the overall measurement.

TL:DR: Your conclusion is correct. Your CPU's sustained performance is nowhere near its peak. This is normal: the peak perf is only available as a short term "bonus" for bursty interactive workloads, above its rated sustained performance, given the light-weight heat-sink, fans, and power-delivery.
You can develop / test on this machine, but benchmarking will be hard. You'll want to run on a cluster, server, or desktop, or at least a gaming / workstation laptop.
From the CPU info you posted, you have a dual-core-with-hyperthreading Intel Core M with a rated sustainable frequency of 1.20 GHz, Broadwell generation. Its max turbo is 2.9GHz, and it's TDP-up sustainable frequency is 1.4GHz (at 6W).
For short bursts, it can run much faster and make much more heat than it requires its cooling system to handle. This is what Intel's "turbo" feature is all about. It lets low-power ultraportable laptops like yours have snappy UI performance in stuff like web browsers, because the CPU load from interactive is almost always bursty.
Desktop/server CPUs (Xeon and i5/i7, but not i3) do still have turbo, but the sustained frequency is much closer to the max turbo. e.g. a Haswell i7-4790k has a sustained "rated" frequency of 4.0GHz. At that frequency and below, it won't use (and convert to heat) more than its rated TDP of 88W. Thus, it needs a cooling system that can handle 88W. When power/current/temperature allow, it can clock up to 4.4GHz and use more than 88W of power. (The sliding window for calculating the power history to keep the sustained power with 88W is sometimes configurable in the BIOS, e.g. 20sec or 5sec. Depending on what code is running, 4.4GHz might not increase the electrical current demand to anywhere near peak. e.g. code with lots of branch mispredicts that's still limited by CPU frequency, but that doesn't come anywhere near saturating the 256b AVX FP units like Prime95 would.)
Your laptop's max turbo is a factor of 2.4x higher than rated frequency. That high-end Haswell desktop CPU can only upclock by 1.1x. The max sustained frequency is already pretty close to the max peak limits, because it's rated to need a good cooling system that can keep up with that kind of heat production. And a solid power supply that can supply that much current.
The purpose of Core M is to have a CPU that can limit itself to ultra low power levels (rated TDP of 4.5 W at 1.2GHz, 6W at 1.4GHz). So the laptop manufacturer can safely design a cooling and power delivery system that's small and light, and only handles that much power. The "Scenario Design Power" is only 3.5W, and that's supposed to represent the thermal requirements for real-world code, not max-power stuff like Prime95.
Even a "normal" ULV laptop CPU is rated for 15W sustained, and high power gaming/workstation laptop CPUs at 45W. And of course laptop vendors put those CPUs into machines with beefier heat-sinks and fans. See a table on wikipedia, and compare desktop / server CPUs (also on the same page).
The achievement of peak performance seems to rule out all effects
other than temperature. But this is really annoying. Basically it says
that computer will get tired in HPC, so we can't get expected
performance gain. Then what is the point of developing HPC algorithm?
The point is to run them on hardware that's not so badly thermally limited! An ultra-low-power CPU like a Core M makes a decent dev platform, but not a good HPC compute platform.
Even a laptop with an xxxxM CPU, rather than a xxxxU CPU, will do ok. (e.g. a "gaming" or "workstation" laptop that's designed to run CPU-intensive stuff for sustained periods). Or in Skylake-family, "xxxxH" or "HK" are the 45W mobile CPUs, at least quad-core.
Further reading:
Modern Microprocessors
A 90-Minute Guide!
[Power Delivery in a Modern Processor] - general background, including the "power wall" that Pentium 4 ran into.
(https://www.realworldtech.com/power-delivery/) - really deep technical dive into CPU / motherboard design and the challenges of delivering stable low-voltage to very bursty demands, and reacting quickly to the CPU requesting more / less voltage as it changes frequency.

Does software prefetching allocate a Line Fill Buffer (LFB)?

I've realized that Little's Law limits how fast data can be transferred at a given latency and with a given level of concurrency. If you want to transfer something faster, you either need larger transfers, more transfers "in flight", or lower latency. For the case of reading from RAM, the concurrency is limited by the number of Line Fill Buffers.
A Line Fill Buffer is allocated when a load misses the L1 cache. Modern Intel chips (Nehalem, Sandy Bridge, Ivy Bridge, Haswell) have 10 LFB's per core, and thus are limited to 10 outstanding cache misses per core. If RAM latency is 70 ns (plausible), and each transfer is 128 Bytes (64B cache line plus its hardware prefetched twin), this limits bandwidth per core to: 10 * 128B / 75 ns = ~16 GB/s. Benchmarks such as single-threaded Stream confirm that this is reasonably accurate.
The obvious way to reduce the latency would be prefetching the desired data with x64 instructions such as PREFETCHT0, PREFETCHT1, PREFETCHT2, or PREFETCHNTA so that it doesn't have to be read from RAM. But I haven't been able to speed anything up by using them. The problem seems to be that the __mm_prefetch() instructions themselves consume LFB's, so they too are subject to the same limits. Hardware prefetches don't touch the LFB's, but also will not cross page boundaries.
But I can't find any of this documented anywhere. The closest I've found is 15 year old article that says mentions that prefetch on the Pentium III uses the Line Fill Buffers. I worry things may have changed since then. And since I think the LFB's are associated with the L1 cache, I'm not sure why a prefetch to L2 or L3 would consume them. And yet, the speeds I measure are consistent with this being the case.
So: Is there any way to initiate a fetch from a new location in memory without using up one of those 10 Line Fill Buffers, thus achieving higher bandwidth by skirting around Little's Law?

Based on my testing, all types of prefetch instructions consume line fill buffers on recent Intel mainstream CPUs.
In particular, I added some load & prefetch tests to uarch-bench, which use large-stride loads over buffers of various sizes. Here are typical results on my Skylake i7-6700HQ:
Benchmark Cycles Nanos
16-KiB parallel loads 0.50 0.19
16-KiB parallel prefetcht0 0.50 0.19
16-KiB parallel prefetcht1 1.15 0.44
16-KiB parallel prefetcht2 1.24 0.48
16-KiB parallel prefetchtnta 0.50 0.19
32-KiB parallel loads 0.50 0.19
32-KiB parallel prefetcht0 0.50 0.19
32-KiB parallel prefetcht1 1.28 0.49
32-KiB parallel prefetcht2 1.28 0.49
32-KiB parallel prefetchtnta 0.50 0.19
128-KiB parallel loads 1.00 0.39
128-KiB parallel prefetcht0 2.00 0.77
128-KiB parallel prefetcht1 1.31 0.50
128-KiB parallel prefetcht2 1.31 0.50
128-KiB parallel prefetchtnta 4.10 1.58
256-KiB parallel loads 1.00 0.39
256-KiB parallel prefetcht0 2.00 0.77
256-KiB parallel prefetcht1 1.31 0.50
256-KiB parallel prefetcht2 1.31 0.50
256-KiB parallel prefetchtnta 4.10 1.58
512-KiB parallel loads 4.09 1.58
512-KiB parallel prefetcht0 4.12 1.59
512-KiB parallel prefetcht1 3.80 1.46
512-KiB parallel prefetcht2 3.80 1.46
512-KiB parallel prefetchtnta 4.10 1.58
2048-KiB parallel loads 4.09 1.58
2048-KiB parallel prefetcht0 4.12 1.59
2048-KiB parallel prefetcht1 3.80 1.46
2048-KiB parallel prefetcht2 3.80 1.46
2048-KiB parallel prefetchtnta 16.54 6.38
The key thing to note is that none of the prefetching techniques are much faster than loads at any buffer size. If any prefetch instruction didn't use the LFB, we would expect it to be very fast for a benchmark that fit into the level of cache it prefetches to. For example prefetcht1 brings lines into the L2, so for the 128-KiB test we might expect it to be faster than the load variant if it doesn't use LFBs.
More conclusively, we can examine the l1d_pend_miss.fb_full counter, whose description is:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
The description already indicates that SW prefetches need LFB entries and testing confirmed it: for all types of prefetch, this figure was very high for any test where concurrency was a limiting factor. For example, for the 512-KiB prefetcht1 test:
Performance counter stats for './uarch-bench --test-name 512-KiB parallel prefetcht1':
38,345,242 branches
1,074,657,384 cycles
284,646,019 mem_inst_retired.all_loads
1,677,347,358 l1d_pend_miss.fb_full
The fb_full value is more than the number of cycles, meaning that the LFB was full almost all the time (it can be more than the number of cycles since up to two loads might want an LFB per cycle). This workload is pure prefetches, so there is nothing to fill up the LFBs except prefetch.
The results of this test also contract the claimed behavior in the section of the manual quoted by Leeor:
There are cases where a PREFETCH will not perform the data prefetch.
These include:
...
If the memory subsystem runs out of request buffers
between the first-level cache and the second-level cache.
Clearly this is not the case here: the prefetch requests are not dropped when the LFBs fill up, but are stalled like a normal load until resources are available (this is not an unreasonable behavior: if you asked for a software prefetch, you probably want to get it, perhaps even if it means stalling).
We also note the following interesting behaviors:
It seems like there is some small difference between prefetcht1 and prefetcht2 as they report different performance for the 16-KiB test (the difference varies, but is consistently different), but if you repeat the test you'll see that this is more likely just run-to-run variation as those particular values are somewhat unstable (most other values are very stable).
For the L2 contained tests, we can sustain 1 load per cycle, but only one prefetcht0 prefetch. This is kind of weird because prefetcht0 should be very similar to a load (and it can issue 2 per cycle in the L1 cases).
Even though the L2 has ~12 cycle latency, we are able to fully hide the latency LFB with only 10 LFBs: we get 1.0 cycles per load (limited by L2 throughput), not 12 / 10 == 1.2 cycles per load that we'd expect (best case) if LFB were the limiting fact (and very low values for fb_full confirms it). That's probably because the 12 cycle latency is the full load-to-use latency all the way to the execution core, which includes also several cycles of additional latency (e.g., L1 latency is 4-5 cycles), so the actual time spent in the LFB is less than 10 cycles.
For the L3 tests, we see values of 3.8-4.1 cycles, very close to the expected 42/10 = 4.2 cycles based on the L3 load-to-use latency. So we are definitely limited by the 10 LFBs when we hit the L3. Here prefetcht1 and prefetcht2 are consistently 0.3 cycles faster than loads or prefetcht0. Given the 10 LFBs, that equals 3 cycles less occupancy, more or less explained by the prefetch stopping at L2 rather than going all the way to L1.
prefetchtnta generally has much lower throughput than the others outside of L1. This probably means that prefetchtnta is actually doing what it is supposed to, and appears to bring lines into L1, not into L2, and only "weakly" into L3. So for the L2-contained tests it has concurrency-limited throughput as if it was hitting the L3 cache, and for the 2048-KiB case (1/3 of the L3 cache size) it has the performance of hitting main memory. prefetchnta limits L3 cache pollution (to something like only one way per set), so we seem to be getting evictions.
Could it be different?
Here's an older answer I wrote before testing, speculating on how it could work:
In general, I would expect any prefetch that results in data ending up in L1 to consume a line fill buffer, since I believe that the only path between L1 and the rest of the memory hierarchy is the LFB1. So SW and HW prefetches that target the L1 probably both use LFBs.
However, this leaves open the probability that prefetches that target L2 or higher levels don't consume LFBs. For the case of hardware prefetch, I'm quite sure this is the case: you can find many reference that explain that HW prefetch is a mechanism to effectively get more memory parallelism beyond the maximum of 10 offered by the LFB. Furthermore, it doesn't seem like the L2 prefetchers could use the LFBs if they wanted: they live in/near the L2 and issue requests to higher levels, presumably using the superqueue and wouldn't need the LFBs.
That leaves software prefetch that target the L2 (or higher), such as prefetcht1 and prefetcht22. Unlike requests generated by the L2, these start in the core, so they need some way to get from the core out, and this could be via the LFB. From the Intel Optimization guide have the following interesting quote (emphasis mine):
Generally, software prefetching into the L2 will show more benefit
than L1 prefetches. A software prefetch into L1 will consume critical
hardware resources (fill buffer) until the cacheline fill completes. A
software prefetch into L2 does not hold those resources, and it is
less likely to have a negative perfor- mance impact. If you do use L1
software prefetches, it is best if the software prefetch is serviced
by hits in the L2 cache, so the length of time that the hardware
resources are held is minimized.
This would seem to indicate that software prefetches don't consume LFBs - but this quote only applies to the Knights Landing architecture, and I can't find similar language for any of the more mainstream architectures. It appears that the cache design of Knights Landing is significantly different (or the quote is wrong).
1 In fact, I think that even non-temporal stores use the LFBs to get get out of the execution core - but their occupancy time is short because as soon as they get to the L2 they can enter the superqueue (without actually going into L2) and then free up their associated LFB.
2 I think both of these target the L2 on recent Intel, but this is also unclear - perhaps the t2 hint actually targets LLC on some uarchs?

First of all a minor correction - read the optimization guide, and you'll note that some HW prefetchers belong in the L2 cache, and as such are not limited by the number of fill buffers but rather by the L2 counterpart.
The "spatial prefetcher" (the colocated-64B line you meantion, completing to 128B chunks) is one of them, so in theory if you fetch every other line you'll be able to get a higher bandwidth (some DCU prefetchers might try to "fill the gaps for you", but in theory they should have lower priority so it might work).
However, the "king" prefetcher is the other guy, the "L2 streamer". Section 2.1.5.4 reads:
Streamer : This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page
The important part is -
The streamer may issue two prefetch requests on every L2 lookup. The streamer
can run up to 20 lines ahead of the load reques
This 2:1 ratio means that for a stream of accesses that is recognized by this prefetcher, it would always run ahead of your accesses. It's true that you won't see these lines in your L1 automatically, but it does mean that if all works well, you should always get L2 hit latency for them (once the prefetch stream had enough time to run ahead and mitigate L3/memory latencies). You may only have 10 LFBs, but as you noted in your calculation - the shorter the access latency becomes, the faster you can replace them the the higher bandwidth you can reach. This is essentially detaching the L1 <-- mem latency into parallel streams of L1 <-- L2 and L2 <-- mem.
As for the question in your headline - it stands to reason that prefetches attempting to fill the L1 would require a line fill buffer to hold the retrieved data for that level. This should probably include all L1 prefetches. As for SW prefetches, section 7.4.3 says:
There are cases where a PREFETCH will not perform the data prefetch. These include:
PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.
An access to the specified address that causes a fault/exception.
If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.
...
So I assume you're right and SW prefetches are not a way to artificially increase your number of outstanding requests. However, the same explanation applies here as well - if you know how to use SW prefetching to access your lines well enough in advance, you may be able to mitigate some of the access latency and increase your effective BW. This however won't work for long streams for two reasons: 1) your cache capacity is limited (even if the prefetch is temporal, like t0 flavor), and 2) you still need to pay the full L1-->mem latency for each prefetch, so you're just moving your stress ahead a bit - if your data manipulation is faster than memory access, you'll eventually catch up with your SW prefetching. So this only works if you can prefetch all you need well enough in advance, and keep it there.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio