Cycles/cost for L1 Cache hit vs. Register on x86?

Cycles/cost for L1 Cache hit vs. Register on x86? - performance

I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?
How many cycles does an L1 cache hit take? How does it compare to register access?

Here's a great article on the subject:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1
To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)
PS:
The specifics will vary, but this link has some good ballpark figures:
Approximate cost to access various caches and main memory?
Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns
PPS:
These figures represent much older, slower CPUs, but the ratios basically hold:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2
Level Access Time Typical Size Technology Managed By
----- ----------- ------------ --------- -----------
Registers 1-3 ns ?1 KB Custom CMOS Compiler
Level 1 Cache (on-chip) 2-8 ns 8 KB-128 KB SRAM Hardware
Level 2 Cache (off-chip) 5-12 ns 0.5 MB - 8 MB SRAM Hardware
Main Memory 10-60 ns 64 MB - 1 GB DRAM Operating System
Hard Disk 3M - 10M ns 20 - 100 GB Magnetic Operating System/User

Throughput and latency are different things. You can't just add up cycle costs. For throughput, see Load/stores per cycle for recent CPU architecture generations - 2 loads per clock throughput for most modern microarchitectures. And see How can cache be that fast? for microarchitectural details of load/store execution units, including showing load / store buffers which limit how much memory-level parallelism they can track. The rest of this answer will focus only on latency, which is relevant for workloads that involve pointer-chasing (like linked lists and trees), and how much latency out-of-order exec needs to hide. (L3 Cache misses are usually too long to fully hide.)
Single-cycle cache latency used to be a thing on simple in-order pipelines at lower clock speeds (so each cycle was more nanoseconds), especially with simpler caches (smaller, not as associative, and with a smaller TLB for caches that weren't purely virtually addressed.) e.g. the classic 5-stage RISC pipeline like MIPS I assumes 1 cycle for memory access on a cache hit, with address calculation in EX and memory access in a single MEM pipeline stage, before WB.
Modern high-performance CPUs divide the pipeline up into more stages, allowing each cycle to be shorter. This lets simple instructions like add / or / and run really fast, still 1 cycle latency but at high clock speed.
For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.
Intel Haswell's L1 load-use latency is 4 cycles for pointer-chasing, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or for a linked list that hits in cache, easy to microbench with a closed loop). See also Is there a penalty when base+offset is in a different page than the base? That 4-cycle latency special case only applies if the pointer comes directly from another load, otherwise it's 5 cycles.
Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.
Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, reading from the store buffer for store data that hasn't yet committed to L1d cache).
As harold commented, register access is 0 cycles. So, for example:
inc eax has 1 cycle latency (just the ALU operation)
add dword [mem], 1 has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.)
http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).
Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).
L1 Data cache = 32 KB, 64 B/line, 8-WAY.
L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.
L2 cache = 256 KB, 64 B/line, 8-WAY
L3 cache = 8 MB, 64 B/line
L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])
L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).
L2 Cache Latency = 12 cycles
L3 Cache Latency = 36 cycles
RAM Latency = 36 cycles + 57 ns
The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.
#paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.

If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.
EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\

Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.
As result:
1) I think you can consider L1 cache as an unmanaged processor registers pool.
2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.

Related

Why are the x86 bit-string manipulation instructions slow with a memory destination? (BTS, BTR, BTC)

Agner finds that the x86 bit manipulation instructions (btr bts btc, no lock) applied to a memory operand are slower than other read-modify-write instructions (like add, xor, etc.) on most processors where they are supported. Why is this? The instructions seem quite straightforward to implement.
Is it because the address actually loaded from is not the same as that specified by the memory operand, and this confuses some frontend mechanism for tracking memory accesses? This seems plausible, but I wouldn't expect it to affect throughput (at least, not by so much); only latency.

Is it because the address actually loaded from is not the same as that specified by the memory operand
Yes, pretty clearly that's the thing that separates it from a memory-destination shift.
The reg-reg version is 1 uop with 1 cycle latency on Intel, running on execution ports 0 or 6 on Intel Haswell and later for example, same as shifts. (Decoding an index to a 1-hot mask is cheaper than a general shifter, but since there are shift units presumably Intel just uses those.)
AMD for some reason runs bts reg,reg as 2 uops, slower than simple shifts. IDK why, maybe something about the FLAGS setting.
bts mem, imm8 is also pretty normal, 3 front-end uops on Intel. xor mem, imm8 is only 2 front-end uops, but that's because it can micro-fuse the load+xor. not mem is 3 front-end uops, only micro-fusing the store-address and store-uop instructions.
and this confuses some frontend mechanism for tracking memory accesses?
No. The front-end doesn't track memory accesses, that's the back end.
It's partly slow because it's implemented as multiple uops; that hurts even when you do one surrounded by different instructions. On Intel Haswell and Alder Lake (and probably all in between), it's 10 front-end uops for bts mem, r32, vs. 3 for bts mem, imm8
Since it can't use the usual address-generation hardware directly, it's implemented in microcode as multiple uops, presumably something like LEA into a temporary from the normal addressing mode, and adding (bit_index>>6) * 4 to that to index by dwords or something like that. Oh, maybe the reason it's 10 uops is that it always wants to access the aligned dword containing the bit, not just a multiple-of-4 offset from the address in the [] addressing mode for something like [rax + rdx*4 + 123].
Doing it manually is more efficient for the normal case where you know the start of the bitstring is aligned, so you can shr the bit-index to get a dword index for load / bts reg,reg (1 uop) / store. That takes fewer uops
than bts [mem], reg. Note that bts reg,reg truncates / wraps the bit-index, so if you arrange things correctly that modulo comes for free. For example a Sieve of Eratosthenes. Also How can memory destination BTS be significantly slower than load / BTS reg,reg / store?
But Agner Fog and https://uops.info/ both measure a throughput of 5 cycles on Haswell / Alder Lake P-cores, significantly lower than the front-end bottleneck (or any per-port back-end bottleneck) would account for.
I don't know what accounts for that. The actual load and store uops should just be normal, with inputs coming from internal temporary registers but still a normal load uop and store uop as far as the addresses in the store buffer and load buffer are concerned. (Together, Intel calls that a Memory order buffer = MOB.)
I don't expect it to be a special case of memory-dependency prediction since that happens when a load uop executes (and there are previous store-address uops not yet executed, so the addresses are some previous stores are still unknown.)
TODO: run some experiments to see what if any other instructions mixed in with bts mem,reg will slow it down, competing for whatever resource it bottlenecks on.
It doesn't look like a benchmarking error on the part of https://uops.info/ (e.g. using the same address every time and stalling on store-forwarding latency). Their testing included some unrolled sequences using different offsets. e.g. Haswell throughput testing for bts m64, r64 measured 6.02 or 6.0 cycle throughput with the same address every time (bts qword ptr [r14], r8), or an average of 5.0 cycles per BTS when unrolling a repeated sequence like bts [r14],r8 ; bts [r14+0x8],r8 ; ... ; bts [r14+0x38],r8. Even for a sequence of 16 independent instructions covering two adjacent cache lines, it was still the same 5 cycles per iteration.

Can all of L2/L3 cache be used by data? If so, why does the Graviton 3 bandwidth plot drop off after half the L2/L3 size, but only gradually?

Consider Graviton3, for example. It's a 64-core CPU with per-core caches 64KiB L1d and 1MiB L2. And a shared L3 of 64MiB across all cores. The RAM bandwidth per socket is 307GB/s (source).
In this plot (source),
we see that all-cores bandwidth drops off to roughly half, when the data exceeds 4MB. This makes sense: 64x 64KiB = 4 MiB is the size of the L1 data cache.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there? The private L2 caches of 64 cores is a total of 64 MiB, same as the shared L3 size.

It looks from the plot like they may not have tested any sizes between 32M and 64M. Looks like a straight line between those points on all 3 CPUs.
Since 64M is the total size of both L2 and L3, I'd expect a test like this to have slowed most of the way down at 64M. As Brendan says, page tables and a bit of code will take space, competing with the actual intended test data. If the benchmark loop is tight, stack won't come into play, except for interrupt handling.
Once you're evicting anything from a working set slightly larger than cache, you often evict almost everything before getting back to it, depending on pseudo-LRU luck. I'd expect a test size or 48 or even 56 MiB to be a lot closer to the 32 MiB data point than the 64 MiB data point.

Can all of L2/L3 cache be used by data?
In theory, yes; but only if there's no "non-data" (code) in the cache, only if you count "all data" (and don't just count a process' data and ignore things like stack and page tables), and only if there isn't any aliasing problems.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there?
For a fully associative cache I'd expect a sudden drop off at/near 32 MiB. However, large caches are almost never fully associative as it costs way to much to find anything in the cache.
As associativity decreases the chance of conflicts increases. For example, for an 8-way associative 64 MiB cache the pathological case is that everything conflicts and you're only able to effectively use 8 MiB of it.
More specifically, for a 64 MiB cache (with unknown associativity), and an "assumed Linux" environment that lacks support for cache coloring, it's reasonable to expect a smooth drop off that ends at 64 MiB.

Just to be clear, on a running Graviton 3 in AWS, an lscpu gives me 32MiB for L3 and not 64 MiB.
Caches (sum of all):
L1d: 4 MiB (64 instances)
L1i: 4 MiB (64 instances)
L2: 64 MiB (64 instances)
L3: 32 MiB (1 instance)
The original question is assuming an L3 of 64 MiB across all cores.
Blockquote
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there? The private L2 caches of 64 cores is a total of 64 MiB, same as the shared L3 size.
Blockquote

Is the L1-Dcache the ultimate data cache and is DSB also a cache that can be simulated by gem5?

I wonder if the L1-Dcache is the ultimate cache that data comes from. Because I know for i-cache, there is a DSB which is even closer to CPU which could be seen as L0-icache.
Also, I am interested in what hardware changes could influence DSB's performance? I mean for cache, there are things such as cache size, Cache Associativity. But is DSB also just a cache that can be influenced by those factors?
If yes, can I simulate the results using gem5. I know with gem5, I can configure the L1 instruction cache and observe L1 instruction cache performance. How could same things be done for DSB on gem?

I wonder if the L1-Dcache is the ultimate cache that data comes from
Yes, or the store buffer. Globally Invisible load instructions explains how partial store-forwarding can let a core load a dword value that was never globally visible, so no other core could have loaded.
The DSB (uop cache) is a cache, but it doesn't cache machine code. It caches the result of decoding x86 machine code into uops.
It has various limitations like not using more than 3 "lines" for uops from the same 32-byte block of x86 machine code, so modeling is it not as simple as just size / assocativity. e.g. each way (aka line) can hold up to 6 uops, but ends with an unconditional (or predicted-taken) branch uop. And all the uops from a multi-uop instruction have to go in the same line.
The number of fused-domain uops from each x86 instruction depend on exactly what instruction it is; see https://uops.info/, but note that un-lamination will mean some instructions take more uops in the issue/rename stage and ROB than they do decoders and uop-cache. (Micro fusion and addressing modes)
Agner Fog's microarch guide has some detailed testing results (https://agner.org/optimize/), and see also https://www.realworldtech.com/sandy-bridge/4/
The basic parameters of Intel's uop cache are, as described in the Sandybridge section Agner's microarch guide:
The µop cache is organized as 32 sets x 8 ways x 6 µops, totaling a maximum capacity of
1536 µops. It can allocate a maximum of 3 lines of 6 µops each for each aligned and
contiguous 32-bytes block of code.
AFAIK, this geometry has remained unchanged from SnB through Skylake and Ice Lake.
The L1i cache is inclusive of the uop cache. The uop cache is virtually-addressed, so TLB lookups aren't needed. But it has to be evicted on TLB invalidation as well, I guess. (That's not a huge problem because the legacy decoders are quite good; Sandybridge-family avoided problems of P4's slow decoding, and trying to use its trace cache instead of a normal L1i.)
Note that AMD's Zen microarchitecture family also uses a uop cache. They don't call it a DSB, and it presumably has some differences from Intel's.
Also, I am interested in what hardware changes could influence DSB's performance?
Skylake increased the bandwidth of uop-cache -> IDQ from 4 to 6 uops per cycle. So even in high-throughput code, the uop-cache can "catch up" after bubbles partially drain the IDQ.
It can still only read 1 uop cache line per cycle, though, so for example on a Skylake where microcode updates disabled the loop buffer (LSD), a tiny loop that would normally run at 1 cycle per iteration can slow down to 2 cycles if the loop is split across a 32-byte boundary, because that means its uops will be in 2 separate uop-cache lines. (Like 1 or 2 from each line.)
But Haswell can sustain 4 uops per clock from the uop cache under ideal conditions, even with instructions that fully pack uop cache lines with 6 uops per line. So there's apparently some buffering between uop cache-line fetch and adding to the IDQ, otherwise it would be a 4 : 2 pattern if all the uops added to the IDQ had to come from the same line.

Cache miss latency in clock cycles

To measure the impact of cache-misses in a program, I want to latency caused by cache-misses to the cycles used for actual computation.
I use perf stat to measure the cycles, L1-loads, L1-misses, LLC-loads and LLC-misses in my program. Here is a example output:
467 769,70 msec task-clock # 1,000 CPUs utilized
1 234 063 672 432 cycles # 2,638 GHz (62,50%)
572 761 379 098 instructions # 0,46 insn per cycle (75,00%)
129 143 035 219 branches # 276,083 M/sec (75,00%)
6 457 141 079 branch-misses # 5,00% of all branches (75,00%)
195 360 583 052 L1-dcache-loads # 417,643 M/sec (75,00%)
33 224 066 301 L1-dcache-load-misses # 17,01% of all L1-dcache hits (75,00%)
20 620 655 322 LLC-loads # 44,083 M/sec (50,00%)
6 030 530 728 LLC-load-misses # 29,25% of all LL-cache hits (50,00%)
Then my question is:
How to convert the number of cache-misses into a number of "lost" clock cycles?
Or alternatively, what is the proportion of time spent for fetching data?
I think the factor should be known by the constructor. My processor is Intel Core i7-10810U, and I couldn't find this information in the specifications nor in this list of benchmarked CPUs.
This related problem describes how to measure the number of cycles lost in a cache-miss, but is there a way to obtain this as hardware information? Ideally, the output would be something like:
L1-hit: 3 cycles
L2-hit: 10 cycles
LLC-hit: 30 cycles
RAM: 300 cycles

Out-of-order exec and memory-level parallelism exist to hide some of that latency by overlapping useful work with time data is in flight. If you simply multiplied L3 miss count by say 300 cycles each, that could exceed the total number of cycles your whole program took. The perf event cycle_activity.stalls_l3_miss (which exists on my Skylake CPU) should count cycles when no uops execute and there's an outstanding L3 cache miss. i.e. cycles when execution is fully stalled. But there will also be cycles with some work, but less than without a cache miss, and that's harder to evaluate.
TL:DR: memory access is heavily pipelined; the whole core doesn't stop on one cache miss, that's the whole point. A pointer-chasing benchmark (to measure latency) is merely a worst case, where the only work is dereferencing a load result. See Modern Microprocessors
A 90-Minute Guide! which has a section about memory and the "memory wall". See also https://agner.org/optimize/ and https://www.realworldtech.com/haswell-cpu/ to learn more about the details of out-of-order exec CPUs and how they can continue doing independent work while one instruction is waiting for data from a cache miss, up to the limit of their out-of-order window size. (https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/)
Re: numbers from vendors:
L3 and RAM latencies aren't a fixed number of core clock cycles: first, core frequency is variable (and independent of uncore and memory clocks), and second because of contention from other cores, and number of hops over the interconnect. (Related: Is cycle count itself reliable on program timing? discusses some effects of core frequency independent of L3 and memory)
That said, Intel's optimization manual does include a table with exact latencies for L1 and L2, and typical for L3, and DRAM on Skylake-server. (2.2.1.3 Skylake Server Microarchitecture Cache Recommendations)
https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#optimization - they say SKX L3 latency is typically 50-70 cycles. DRAM speed depends some on the timing of your DIMMs.
Other people have tested specific CPUs, like https://www.7-cpu.com/cpu/Skylake.html.

Does software prefetching allocate a Line Fill Buffer (LFB)?

I've realized that Little's Law limits how fast data can be transferred at a given latency and with a given level of concurrency. If you want to transfer something faster, you either need larger transfers, more transfers "in flight", or lower latency. For the case of reading from RAM, the concurrency is limited by the number of Line Fill Buffers.
A Line Fill Buffer is allocated when a load misses the L1 cache. Modern Intel chips (Nehalem, Sandy Bridge, Ivy Bridge, Haswell) have 10 LFB's per core, and thus are limited to 10 outstanding cache misses per core. If RAM latency is 70 ns (plausible), and each transfer is 128 Bytes (64B cache line plus its hardware prefetched twin), this limits bandwidth per core to: 10 * 128B / 75 ns = ~16 GB/s. Benchmarks such as single-threaded Stream confirm that this is reasonably accurate.
The obvious way to reduce the latency would be prefetching the desired data with x64 instructions such as PREFETCHT0, PREFETCHT1, PREFETCHT2, or PREFETCHNTA so that it doesn't have to be read from RAM. But I haven't been able to speed anything up by using them. The problem seems to be that the __mm_prefetch() instructions themselves consume LFB's, so they too are subject to the same limits. Hardware prefetches don't touch the LFB's, but also will not cross page boundaries.
But I can't find any of this documented anywhere. The closest I've found is 15 year old article that says mentions that prefetch on the Pentium III uses the Line Fill Buffers. I worry things may have changed since then. And since I think the LFB's are associated with the L1 cache, I'm not sure why a prefetch to L2 or L3 would consume them. And yet, the speeds I measure are consistent with this being the case.
So: Is there any way to initiate a fetch from a new location in memory without using up one of those 10 Line Fill Buffers, thus achieving higher bandwidth by skirting around Little's Law?

Based on my testing, all types of prefetch instructions consume line fill buffers on recent Intel mainstream CPUs.
In particular, I added some load & prefetch tests to uarch-bench, which use large-stride loads over buffers of various sizes. Here are typical results on my Skylake i7-6700HQ:
Benchmark Cycles Nanos
16-KiB parallel loads 0.50 0.19
16-KiB parallel prefetcht0 0.50 0.19
16-KiB parallel prefetcht1 1.15 0.44
16-KiB parallel prefetcht2 1.24 0.48
16-KiB parallel prefetchtnta 0.50 0.19
32-KiB parallel loads 0.50 0.19
32-KiB parallel prefetcht0 0.50 0.19
32-KiB parallel prefetcht1 1.28 0.49
32-KiB parallel prefetcht2 1.28 0.49
32-KiB parallel prefetchtnta 0.50 0.19
128-KiB parallel loads 1.00 0.39
128-KiB parallel prefetcht0 2.00 0.77
128-KiB parallel prefetcht1 1.31 0.50
128-KiB parallel prefetcht2 1.31 0.50
128-KiB parallel prefetchtnta 4.10 1.58
256-KiB parallel loads 1.00 0.39
256-KiB parallel prefetcht0 2.00 0.77
256-KiB parallel prefetcht1 1.31 0.50
256-KiB parallel prefetcht2 1.31 0.50
256-KiB parallel prefetchtnta 4.10 1.58
512-KiB parallel loads 4.09 1.58
512-KiB parallel prefetcht0 4.12 1.59
512-KiB parallel prefetcht1 3.80 1.46
512-KiB parallel prefetcht2 3.80 1.46
512-KiB parallel prefetchtnta 4.10 1.58
2048-KiB parallel loads 4.09 1.58
2048-KiB parallel prefetcht0 4.12 1.59
2048-KiB parallel prefetcht1 3.80 1.46
2048-KiB parallel prefetcht2 3.80 1.46
2048-KiB parallel prefetchtnta 16.54 6.38
The key thing to note is that none of the prefetching techniques are much faster than loads at any buffer size. If any prefetch instruction didn't use the LFB, we would expect it to be very fast for a benchmark that fit into the level of cache it prefetches to. For example prefetcht1 brings lines into the L2, so for the 128-KiB test we might expect it to be faster than the load variant if it doesn't use LFBs.
More conclusively, we can examine the l1d_pend_miss.fb_full counter, whose description is:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
The description already indicates that SW prefetches need LFB entries and testing confirmed it: for all types of prefetch, this figure was very high for any test where concurrency was a limiting factor. For example, for the 512-KiB prefetcht1 test:
Performance counter stats for './uarch-bench --test-name 512-KiB parallel prefetcht1':
38,345,242 branches
1,074,657,384 cycles
284,646,019 mem_inst_retired.all_loads
1,677,347,358 l1d_pend_miss.fb_full
The fb_full value is more than the number of cycles, meaning that the LFB was full almost all the time (it can be more than the number of cycles since up to two loads might want an LFB per cycle). This workload is pure prefetches, so there is nothing to fill up the LFBs except prefetch.
The results of this test also contract the claimed behavior in the section of the manual quoted by Leeor:
There are cases where a PREFETCH will not perform the data prefetch.
These include:
...
If the memory subsystem runs out of request buffers
between the first-level cache and the second-level cache.
Clearly this is not the case here: the prefetch requests are not dropped when the LFBs fill up, but are stalled like a normal load until resources are available (this is not an unreasonable behavior: if you asked for a software prefetch, you probably want to get it, perhaps even if it means stalling).
We also note the following interesting behaviors:
It seems like there is some small difference between prefetcht1 and prefetcht2 as they report different performance for the 16-KiB test (the difference varies, but is consistently different), but if you repeat the test you'll see that this is more likely just run-to-run variation as those particular values are somewhat unstable (most other values are very stable).
For the L2 contained tests, we can sustain 1 load per cycle, but only one prefetcht0 prefetch. This is kind of weird because prefetcht0 should be very similar to a load (and it can issue 2 per cycle in the L1 cases).
Even though the L2 has ~12 cycle latency, we are able to fully hide the latency LFB with only 10 LFBs: we get 1.0 cycles per load (limited by L2 throughput), not 12 / 10 == 1.2 cycles per load that we'd expect (best case) if LFB were the limiting fact (and very low values for fb_full confirms it). That's probably because the 12 cycle latency is the full load-to-use latency all the way to the execution core, which includes also several cycles of additional latency (e.g., L1 latency is 4-5 cycles), so the actual time spent in the LFB is less than 10 cycles.
For the L3 tests, we see values of 3.8-4.1 cycles, very close to the expected 42/10 = 4.2 cycles based on the L3 load-to-use latency. So we are definitely limited by the 10 LFBs when we hit the L3. Here prefetcht1 and prefetcht2 are consistently 0.3 cycles faster than loads or prefetcht0. Given the 10 LFBs, that equals 3 cycles less occupancy, more or less explained by the prefetch stopping at L2 rather than going all the way to L1.
prefetchtnta generally has much lower throughput than the others outside of L1. This probably means that prefetchtnta is actually doing what it is supposed to, and appears to bring lines into L1, not into L2, and only "weakly" into L3. So for the L2-contained tests it has concurrency-limited throughput as if it was hitting the L3 cache, and for the 2048-KiB case (1/3 of the L3 cache size) it has the performance of hitting main memory. prefetchnta limits L3 cache pollution (to something like only one way per set), so we seem to be getting evictions.
Could it be different?
Here's an older answer I wrote before testing, speculating on how it could work:
In general, I would expect any prefetch that results in data ending up in L1 to consume a line fill buffer, since I believe that the only path between L1 and the rest of the memory hierarchy is the LFB1. So SW and HW prefetches that target the L1 probably both use LFBs.
However, this leaves open the probability that prefetches that target L2 or higher levels don't consume LFBs. For the case of hardware prefetch, I'm quite sure this is the case: you can find many reference that explain that HW prefetch is a mechanism to effectively get more memory parallelism beyond the maximum of 10 offered by the LFB. Furthermore, it doesn't seem like the L2 prefetchers could use the LFBs if they wanted: they live in/near the L2 and issue requests to higher levels, presumably using the superqueue and wouldn't need the LFBs.
That leaves software prefetch that target the L2 (or higher), such as prefetcht1 and prefetcht22. Unlike requests generated by the L2, these start in the core, so they need some way to get from the core out, and this could be via the LFB. From the Intel Optimization guide have the following interesting quote (emphasis mine):
Generally, software prefetching into the L2 will show more benefit
than L1 prefetches. A software prefetch into L1 will consume critical
hardware resources (fill buffer) until the cacheline fill completes. A
software prefetch into L2 does not hold those resources, and it is
less likely to have a negative perfor- mance impact. If you do use L1
software prefetches, it is best if the software prefetch is serviced
by hits in the L2 cache, so the length of time that the hardware
resources are held is minimized.
This would seem to indicate that software prefetches don't consume LFBs - but this quote only applies to the Knights Landing architecture, and I can't find similar language for any of the more mainstream architectures. It appears that the cache design of Knights Landing is significantly different (or the quote is wrong).
1 In fact, I think that even non-temporal stores use the LFBs to get get out of the execution core - but their occupancy time is short because as soon as they get to the L2 they can enter the superqueue (without actually going into L2) and then free up their associated LFB.
2 I think both of these target the L2 on recent Intel, but this is also unclear - perhaps the t2 hint actually targets LLC on some uarchs?

First of all a minor correction - read the optimization guide, and you'll note that some HW prefetchers belong in the L2 cache, and as such are not limited by the number of fill buffers but rather by the L2 counterpart.
The "spatial prefetcher" (the colocated-64B line you meantion, completing to 128B chunks) is one of them, so in theory if you fetch every other line you'll be able to get a higher bandwidth (some DCU prefetchers might try to "fill the gaps for you", but in theory they should have lower priority so it might work).
However, the "king" prefetcher is the other guy, the "L2 streamer". Section 2.1.5.4 reads:
Streamer : This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page
The important part is -
The streamer may issue two prefetch requests on every L2 lookup. The streamer
can run up to 20 lines ahead of the load reques
This 2:1 ratio means that for a stream of accesses that is recognized by this prefetcher, it would always run ahead of your accesses. It's true that you won't see these lines in your L1 automatically, but it does mean that if all works well, you should always get L2 hit latency for them (once the prefetch stream had enough time to run ahead and mitigate L3/memory latencies). You may only have 10 LFBs, but as you noted in your calculation - the shorter the access latency becomes, the faster you can replace them the the higher bandwidth you can reach. This is essentially detaching the L1 <-- mem latency into parallel streams of L1 <-- L2 and L2 <-- mem.
As for the question in your headline - it stands to reason that prefetches attempting to fill the L1 would require a line fill buffer to hold the retrieved data for that level. This should probably include all L1 prefetches. As for SW prefetches, section 7.4.3 says:
There are cases where a PREFETCH will not perform the data prefetch. These include:
PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.
An access to the specified address that causes a fault/exception.
If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.
...
So I assume you're right and SW prefetches are not a way to artificially increase your number of outstanding requests. However, the same explanation applies here as well - if you know how to use SW prefetching to access your lines well enough in advance, you may be able to mitigate some of the access latency and increase your effective BW. This however won't work for long streams for two reasons: 1) your cache capacity is limited (even if the prefetch is temporal, like t0 flavor), and 2) you still need to pay the full L1-->mem latency for each prefetch, so you're just moving your stress ahead a bit - if your data manipulation is faster than memory access, you'll eventually catch up with your SW prefetching. So this only works if you can prefetch all you need well enough in advance, and keep it there.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio