Does software prefetching allocate a Line Fill Buffer (LFB)?

Does software prefetching allocate a Line Fill Buffer (LFB)? - caching

I've realized that Little's Law limits how fast data can be transferred at a given latency and with a given level of concurrency. If you want to transfer something faster, you either need larger transfers, more transfers "in flight", or lower latency. For the case of reading from RAM, the concurrency is limited by the number of Line Fill Buffers.
A Line Fill Buffer is allocated when a load misses the L1 cache. Modern Intel chips (Nehalem, Sandy Bridge, Ivy Bridge, Haswell) have 10 LFB's per core, and thus are limited to 10 outstanding cache misses per core. If RAM latency is 70 ns (plausible), and each transfer is 128 Bytes (64B cache line plus its hardware prefetched twin), this limits bandwidth per core to: 10 * 128B / 75 ns = ~16 GB/s. Benchmarks such as single-threaded Stream confirm that this is reasonably accurate.
The obvious way to reduce the latency would be prefetching the desired data with x64 instructions such as PREFETCHT0, PREFETCHT1, PREFETCHT2, or PREFETCHNTA so that it doesn't have to be read from RAM. But I haven't been able to speed anything up by using them. The problem seems to be that the __mm_prefetch() instructions themselves consume LFB's, so they too are subject to the same limits. Hardware prefetches don't touch the LFB's, but also will not cross page boundaries.
But I can't find any of this documented anywhere. The closest I've found is 15 year old article that says mentions that prefetch on the Pentium III uses the Line Fill Buffers. I worry things may have changed since then. And since I think the LFB's are associated with the L1 cache, I'm not sure why a prefetch to L2 or L3 would consume them. And yet, the speeds I measure are consistent with this being the case.
So: Is there any way to initiate a fetch from a new location in memory without using up one of those 10 Line Fill Buffers, thus achieving higher bandwidth by skirting around Little's Law?

Based on my testing, all types of prefetch instructions consume line fill buffers on recent Intel mainstream CPUs.
In particular, I added some load & prefetch tests to uarch-bench, which use large-stride loads over buffers of various sizes. Here are typical results on my Skylake i7-6700HQ:
Benchmark Cycles Nanos
16-KiB parallel loads 0.50 0.19
16-KiB parallel prefetcht0 0.50 0.19
16-KiB parallel prefetcht1 1.15 0.44
16-KiB parallel prefetcht2 1.24 0.48
16-KiB parallel prefetchtnta 0.50 0.19
32-KiB parallel loads 0.50 0.19
32-KiB parallel prefetcht0 0.50 0.19
32-KiB parallel prefetcht1 1.28 0.49
32-KiB parallel prefetcht2 1.28 0.49
32-KiB parallel prefetchtnta 0.50 0.19
128-KiB parallel loads 1.00 0.39
128-KiB parallel prefetcht0 2.00 0.77
128-KiB parallel prefetcht1 1.31 0.50
128-KiB parallel prefetcht2 1.31 0.50
128-KiB parallel prefetchtnta 4.10 1.58
256-KiB parallel loads 1.00 0.39
256-KiB parallel prefetcht0 2.00 0.77
256-KiB parallel prefetcht1 1.31 0.50
256-KiB parallel prefetcht2 1.31 0.50
256-KiB parallel prefetchtnta 4.10 1.58
512-KiB parallel loads 4.09 1.58
512-KiB parallel prefetcht0 4.12 1.59
512-KiB parallel prefetcht1 3.80 1.46
512-KiB parallel prefetcht2 3.80 1.46
512-KiB parallel prefetchtnta 4.10 1.58
2048-KiB parallel loads 4.09 1.58
2048-KiB parallel prefetcht0 4.12 1.59
2048-KiB parallel prefetcht1 3.80 1.46
2048-KiB parallel prefetcht2 3.80 1.46
2048-KiB parallel prefetchtnta 16.54 6.38
The key thing to note is that none of the prefetching techniques are much faster than loads at any buffer size. If any prefetch instruction didn't use the LFB, we would expect it to be very fast for a benchmark that fit into the level of cache it prefetches to. For example prefetcht1 brings lines into the L2, so for the 128-KiB test we might expect it to be faster than the load variant if it doesn't use LFBs.
More conclusively, we can examine the l1d_pend_miss.fb_full counter, whose description is:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
The description already indicates that SW prefetches need LFB entries and testing confirmed it: for all types of prefetch, this figure was very high for any test where concurrency was a limiting factor. For example, for the 512-KiB prefetcht1 test:
Performance counter stats for './uarch-bench --test-name 512-KiB parallel prefetcht1':
38,345,242 branches
1,074,657,384 cycles
284,646,019 mem_inst_retired.all_loads
1,677,347,358 l1d_pend_miss.fb_full
The fb_full value is more than the number of cycles, meaning that the LFB was full almost all the time (it can be more than the number of cycles since up to two loads might want an LFB per cycle). This workload is pure prefetches, so there is nothing to fill up the LFBs except prefetch.
The results of this test also contract the claimed behavior in the section of the manual quoted by Leeor:
There are cases where a PREFETCH will not perform the data prefetch.
These include:
...
If the memory subsystem runs out of request buffers
between the first-level cache and the second-level cache.
Clearly this is not the case here: the prefetch requests are not dropped when the LFBs fill up, but are stalled like a normal load until resources are available (this is not an unreasonable behavior: if you asked for a software prefetch, you probably want to get it, perhaps even if it means stalling).
We also note the following interesting behaviors:
It seems like there is some small difference between prefetcht1 and prefetcht2 as they report different performance for the 16-KiB test (the difference varies, but is consistently different), but if you repeat the test you'll see that this is more likely just run-to-run variation as those particular values are somewhat unstable (most other values are very stable).
For the L2 contained tests, we can sustain 1 load per cycle, but only one prefetcht0 prefetch. This is kind of weird because prefetcht0 should be very similar to a load (and it can issue 2 per cycle in the L1 cases).
Even though the L2 has ~12 cycle latency, we are able to fully hide the latency LFB with only 10 LFBs: we get 1.0 cycles per load (limited by L2 throughput), not 12 / 10 == 1.2 cycles per load that we'd expect (best case) if LFB were the limiting fact (and very low values for fb_full confirms it). That's probably because the 12 cycle latency is the full load-to-use latency all the way to the execution core, which includes also several cycles of additional latency (e.g., L1 latency is 4-5 cycles), so the actual time spent in the LFB is less than 10 cycles.
For the L3 tests, we see values of 3.8-4.1 cycles, very close to the expected 42/10 = 4.2 cycles based on the L3 load-to-use latency. So we are definitely limited by the 10 LFBs when we hit the L3. Here prefetcht1 and prefetcht2 are consistently 0.3 cycles faster than loads or prefetcht0. Given the 10 LFBs, that equals 3 cycles less occupancy, more or less explained by the prefetch stopping at L2 rather than going all the way to L1.
prefetchtnta generally has much lower throughput than the others outside of L1. This probably means that prefetchtnta is actually doing what it is supposed to, and appears to bring lines into L1, not into L2, and only "weakly" into L3. So for the L2-contained tests it has concurrency-limited throughput as if it was hitting the L3 cache, and for the 2048-KiB case (1/3 of the L3 cache size) it has the performance of hitting main memory. prefetchnta limits L3 cache pollution (to something like only one way per set), so we seem to be getting evictions.
Could it be different?
Here's an older answer I wrote before testing, speculating on how it could work:
In general, I would expect any prefetch that results in data ending up in L1 to consume a line fill buffer, since I believe that the only path between L1 and the rest of the memory hierarchy is the LFB1. So SW and HW prefetches that target the L1 probably both use LFBs.
However, this leaves open the probability that prefetches that target L2 or higher levels don't consume LFBs. For the case of hardware prefetch, I'm quite sure this is the case: you can find many reference that explain that HW prefetch is a mechanism to effectively get more memory parallelism beyond the maximum of 10 offered by the LFB. Furthermore, it doesn't seem like the L2 prefetchers could use the LFBs if they wanted: they live in/near the L2 and issue requests to higher levels, presumably using the superqueue and wouldn't need the LFBs.
That leaves software prefetch that target the L2 (or higher), such as prefetcht1 and prefetcht22. Unlike requests generated by the L2, these start in the core, so they need some way to get from the core out, and this could be via the LFB. From the Intel Optimization guide have the following interesting quote (emphasis mine):
Generally, software prefetching into the L2 will show more benefit
than L1 prefetches. A software prefetch into L1 will consume critical
hardware resources (fill buffer) until the cacheline fill completes. A
software prefetch into L2 does not hold those resources, and it is
less likely to have a negative perfor- mance impact. If you do use L1
software prefetches, it is best if the software prefetch is serviced
by hits in the L2 cache, so the length of time that the hardware
resources are held is minimized.
This would seem to indicate that software prefetches don't consume LFBs - but this quote only applies to the Knights Landing architecture, and I can't find similar language for any of the more mainstream architectures. It appears that the cache design of Knights Landing is significantly different (or the quote is wrong).
1 In fact, I think that even non-temporal stores use the LFBs to get get out of the execution core - but their occupancy time is short because as soon as they get to the L2 they can enter the superqueue (without actually going into L2) and then free up their associated LFB.
2 I think both of these target the L2 on recent Intel, but this is also unclear - perhaps the t2 hint actually targets LLC on some uarchs?

First of all a minor correction - read the optimization guide, and you'll note that some HW prefetchers belong in the L2 cache, and as such are not limited by the number of fill buffers but rather by the L2 counterpart.
The "spatial prefetcher" (the colocated-64B line you meantion, completing to 128B chunks) is one of them, so in theory if you fetch every other line you'll be able to get a higher bandwidth (some DCU prefetchers might try to "fill the gaps for you", but in theory they should have lower priority so it might work).
However, the "king" prefetcher is the other guy, the "L2 streamer". Section 2.1.5.4 reads:
Streamer : This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page
The important part is -
The streamer may issue two prefetch requests on every L2 lookup. The streamer
can run up to 20 lines ahead of the load reques
This 2:1 ratio means that for a stream of accesses that is recognized by this prefetcher, it would always run ahead of your accesses. It's true that you won't see these lines in your L1 automatically, but it does mean that if all works well, you should always get L2 hit latency for them (once the prefetch stream had enough time to run ahead and mitigate L3/memory latencies). You may only have 10 LFBs, but as you noted in your calculation - the shorter the access latency becomes, the faster you can replace them the the higher bandwidth you can reach. This is essentially detaching the L1 <-- mem latency into parallel streams of L1 <-- L2 and L2 <-- mem.
As for the question in your headline - it stands to reason that prefetches attempting to fill the L1 would require a line fill buffer to hold the retrieved data for that level. This should probably include all L1 prefetches. As for SW prefetches, section 7.4.3 says:
There are cases where a PREFETCH will not perform the data prefetch. These include:
PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.
An access to the specified address that causes a fault/exception.
If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.
...
So I assume you're right and SW prefetches are not a way to artificially increase your number of outstanding requests. However, the same explanation applies here as well - if you know how to use SW prefetching to access your lines well enough in advance, you may be able to mitigate some of the access latency and increase your effective BW. This however won't work for long streams for two reasons: 1) your cache capacity is limited (even if the prefetch is temporal, like t0 flavor), and 2) you still need to pay the full L1-->mem latency for each prefetch, so you're just moving your stress ahead a bit - if your data manipulation is faster than memory access, you'll eventually catch up with your SW prefetching. So this only works if you can prefetch all you need well enough in advance, and keep it there.

Related

Cache miss latency in clock cycles

To measure the impact of cache-misses in a program, I want to latency caused by cache-misses to the cycles used for actual computation.
I use perf stat to measure the cycles, L1-loads, L1-misses, LLC-loads and LLC-misses in my program. Here is a example output:
467 769,70 msec task-clock # 1,000 CPUs utilized
1 234 063 672 432 cycles # 2,638 GHz (62,50%)
572 761 379 098 instructions # 0,46 insn per cycle (75,00%)
129 143 035 219 branches # 276,083 M/sec (75,00%)
6 457 141 079 branch-misses # 5,00% of all branches (75,00%)
195 360 583 052 L1-dcache-loads # 417,643 M/sec (75,00%)
33 224 066 301 L1-dcache-load-misses # 17,01% of all L1-dcache hits (75,00%)
20 620 655 322 LLC-loads # 44,083 M/sec (50,00%)
6 030 530 728 LLC-load-misses # 29,25% of all LL-cache hits (50,00%)
Then my question is:
How to convert the number of cache-misses into a number of "lost" clock cycles?
Or alternatively, what is the proportion of time spent for fetching data?
I think the factor should be known by the constructor. My processor is Intel Core i7-10810U, and I couldn't find this information in the specifications nor in this list of benchmarked CPUs.
This related problem describes how to measure the number of cycles lost in a cache-miss, but is there a way to obtain this as hardware information? Ideally, the output would be something like:
L1-hit: 3 cycles
L2-hit: 10 cycles
LLC-hit: 30 cycles
RAM: 300 cycles

Out-of-order exec and memory-level parallelism exist to hide some of that latency by overlapping useful work with time data is in flight. If you simply multiplied L3 miss count by say 300 cycles each, that could exceed the total number of cycles your whole program took. The perf event cycle_activity.stalls_l3_miss (which exists on my Skylake CPU) should count cycles when no uops execute and there's an outstanding L3 cache miss. i.e. cycles when execution is fully stalled. But there will also be cycles with some work, but less than without a cache miss, and that's harder to evaluate.
TL:DR: memory access is heavily pipelined; the whole core doesn't stop on one cache miss, that's the whole point. A pointer-chasing benchmark (to measure latency) is merely a worst case, where the only work is dereferencing a load result. See Modern Microprocessors
A 90-Minute Guide! which has a section about memory and the "memory wall". See also https://agner.org/optimize/ and https://www.realworldtech.com/haswell-cpu/ to learn more about the details of out-of-order exec CPUs and how they can continue doing independent work while one instruction is waiting for data from a cache miss, up to the limit of their out-of-order window size. (https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/)
Re: numbers from vendors:
L3 and RAM latencies aren't a fixed number of core clock cycles: first, core frequency is variable (and independent of uncore and memory clocks), and second because of contention from other cores, and number of hops over the interconnect. (Related: Is cycle count itself reliable on program timing? discusses some effects of core frequency independent of L3 and memory)
That said, Intel's optimization manual does include a table with exact latencies for L1 and L2, and typical for L3, and DRAM on Skylake-server. (2.2.1.3 Skylake Server Microarchitecture Cache Recommendations)
https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#optimization - they say SKX L3 latency is typically 50-70 cycles. DRAM speed depends some on the timing of your DIMMs.
Other people have tested specific CPUs, like https://www.7-cpu.com/cpu/Skylake.html.

Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake

I'm seeing unexpectedly poor performance for a simple store loop which has two stores: one with a forward stride of 16 byte and one that's always to the same location1, like this:
volatile uint32_t value;
void weirdo_cpp(size_t iters, uint32_t* output) {
uint32_t x = value;
uint32_t *rdx = output;
volatile uint32_t *rsi = output;
do {
*rdx = x;
*rsi = x;
rdx += 4; // 16 byte stride
} while (--iters > 0);
}
In assembly this loop probably3 looks like:
weirdo_cpp:
...
align 16
.top:
mov [rdx], eax ; stride 16
mov [rsi], eax ; never changes
add rdx, 16
dec rdi
jne .top
ret
When the memory region accessed is in L2 I would expect this to run at less than 3 cycles per iteration. The second store just keeps hitting the same location and should add about a cycle. The first store implies bringing in a line from L2 and hence also evicting a line once every 4 iterations. I'm not sure how you evaluate the L2 cost, but even if you conservatively estimate that the L1 can only do one of the following every cycle: (a) commit a store or (b) receive a line from L2 or (c) evict a line to L2, you'd get something like 1 + 0.25 + 0.25 = 1.5 cycles for the stride-16 store stream.
Indeed, you comment out one store you get ~1.25 cycles per iteration for the first store only, and ~1.01 cycles per iteration for the second store, so 2.5 cycles per iteration seems like a conservative estimate.
The actual performance is very odd, however. Here's a typical run of the test harness:
Estimated CPU speed: 2.60 GHz
output size : 64 KiB
output alignment: 32
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.89 cycles/iter, 1.49 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
4.73 cycles/iter, 1.81 ns/iter, cpu before: 0, cpu after: 0
7.33 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.33 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.34 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.26 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.31 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.29 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.29 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.27 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.30 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.30 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
Two things are weird here.
First are the bimodal timings: there is a fast mode and a slow mode. We start out in slow mode taking about 7.3 cycles per iteration, and at some point transition to about 3.9 cycles per iteration. This behavior is consistent and reproducible and the two timings are always quite consistent clustered around the two values. The transition shows up in both directions from slow mode to fast mode and the other way around (and sometimes multiple transitions in one run).
The other weird thing is the really bad performance. Even in fast mode, at about 3.9 cycles the performance is much worse than the 1.0 + 1.3 = 2.3 cycles worst cast you'd expect from adding together the each of the cases with a single store (and assuming that absolutely zero worked can be overlapped when both stores are in the loop). In slow mode, performance is terrible compared to what you'd expect based on first principles: it is taking 7.3 cycles to do 2 stores, and if you put it in L2 store bandwidth terms, that's roughly 29 cycles per L2 store (since we only store one full cache line every 4 iterations).
Skylake is recorded as having a 64B/cycle throughput between L1 and L2, which is way higher than the observed throughput here (about 2 bytes/cycle in slow mode).
What explains the poor throughput and bimodal performance and can I avoid it?
I'm also curious if this reproduces on other architectures and even on other Skylake boxes. Feel free to include local results in the comments.
You can find the test code and harness on github. There is a Makefile for Linux or Unix-like platforms, but it should be relatively easy to build on Windows too. If you want to run the asm variant you'll need nasm or yasm for the assembly4 - if you don't have that you can just try the C++ version.
Eliminated Possibilities
Here are some possibilities that I considered and largely eliminated. Many of the possibilities are eliminated by the simple fact that you see the performance transition randomly in the middle of the benchmarking loop, when many things simply haven't changed (e.g., if it was related to the output array alignment, it couldn't change in the middle of a run since the same buffer is used the entire time). I'll refer to this as the default elimination below (even for things that are default elimination there is often another argument to be made).
Alignment factors: the output array is 16 byte aligned, and I've tried up to 2MB alignment without change. Also eliminated by the default elimination.
Contention with other processes on the machine: the effect is observed more or less identically on an idle machine and even on a heavily loaded one (e.g., using stress -vm 4). The benchmark itself should be completely core-local anyways since it fits in L2, and perf confirms there are very few L2 misses per iteration (about 1 miss every 300-400 iterations, probably related to the printf code).
TurboBoost: TurboBoost is completely disabled, confirmed by three different MHz readings.
Power-saving stuff: The performance governor is intel_pstate in performance mode. No frequency variations are observed during the test (CPU stays essentially locked at 2.59 GHz).
TLB effects: The effect is present even when the output buffer is located in a 2 MB huge page. In any case, the 64 4k TLB entries more than cover the 128K output buffer. perf doesn't report any particularly weird TLB behavior.
4k aliasing: older, more complex versions of this benchmark did show some 4k aliasing but this has been eliminated since there are no loads in the benchmark (it's loads that might incorrectly alias earlier stores). Also eliminated by the default elimination.
L2 associativity conflicts: eliminated by the default elimination and by the fact that this doesn't go away even with 2MB pages, where we can be sure the output buffer is laid out linearly in physical memory.
Hyperthreading effects: HT is disabled.
Prefetching: Only two of the prefetchers could be involved here (the "DCU", aka L1<->L2 prefetchers), since all the data lives in L1 or L2, but the performance is the same with all prefetchers enabled or all disabled.
Interrupts: no correlation between interrupt count and slow mode. There is a limited number of total interrupts, mostly clock ticks.
toplev.py
I used toplev.py which implements Intel's Top Down analysis method, and to no surprise it identifies the benchmark as store bound:
BE Backend_Bound: 82.11 % Slots [ 4.83%]
BE/Mem Backend_Bound.Memory_Bound: 59.64 % Slots [ 4.83%]
BE/Core Backend_Bound.Core_Bound: 22.47 % Slots [ 4.83%]
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 0.03 % Stalls [ 4.92%]
This metric estimates how often the CPU was stalled without
loads missing the L1 data cache...
Sampling events: mem_load_retired.l1_hit:pp mem_load_retired.fb_hit:pp
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 74.91 % Stalls [ 4.96%] <==
This metric estimates how often CPU was stalled due to
store memory accesses...
Sampling events: mem_inst_retired.all_stores:pp
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 28.20 % Clocks [ 4.93%]
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.1_Port_Utilized: 26.28 % CoreClocks [ 4.83%]
This metric represents Core cycles fraction where the CPU
executed total of 1 uop per cycle on all execution ports...
MUX: 4.65 %
PerfMon Event Multiplexing accuracy indicator
This doesn't really shed much light: we already knew it must be the stores messing things up, but why? Intel's description of the condition doesn't say much.
Here's a reasonable summary of some of the issues involved in L1-L2 interaction.
Update Feb 2019: I cannot no longer reproduce the "bimodal" part of the performance: for me, on the same i7-6700HQ box, the performance is now always very slow in the same cases the slow and very slow bimodal performance applies, i.e., with results around 16-20 cycles per line, like this:
This change seems to have been introduced in the August 2018 Skylake microcode update, revision 0xC6. The prior microcode, 0xC2 shows the original behavior described in the question.
1 This is a greatly simplified MCVE of my original loop, which was at least 3 times the size and which did lots of additional work, but exhibited exactly the same performance as this simple version, bottlenecked on the same mysterious issue.
3 In particular, it looks exactly like this if you write the assembly by hand, or if you compile it with gcc -O1 (version 5.4.1), and probably most reasonable compilers (volatile is used to avoid sinking the mostly-dead second store outside the loop).
4 No doubt you could convert this to MASM syntax with a few minor edits since the assembly is so trivial. Pull requests accepted.

What I've found so far. Unfortunately it doesn't really offer an explanation for the poor performance, and not at all for the bimodal distribution, but is more a set of rules for when you might see the performance and notes on mitigating it:
The store throughput into L2 appears to be at most one 64-byte cache-line per three cycles0, putting a ~21 bytes per cycle upper limit on store throughput. Said another way, series of stores that miss in L1 and hit in L2 will take at least three cycles per cache line touched.
Above that baseline there is a significant penalty when stores that hit in L2 are interleaved with stores to a different cache line (regardless of whether those stores hit in L1 or L2).
The penalty is apparently somewhat larger for stores that are nearby (but still not in the same cache line).
The bimodal performance is at least superficially related to above effect since in the non-interleaving case it does not appear to occur, although I don't have a further explanation for it.
If you ensure the cache line is already in L1 before the store, by prefetch or a dummy load, the slow performance disappears and the performance is no longer bimodal.
Details and Pictures
64-byte Stride
The original question arbitrarily used a stride of 16, but let's start with probably the simplest case: a stride of 64, i.e., one full cache line. As it turns out the various effects are visible with any stride, but 64 ensures an L2 cache miss on every stride and so removes some variables.
Let's also remove the second store for now - so we're just testing a single 64-byte strided store over 64K of memory:
top:
mov BYTE PTR [rdx],al
add rdx,0x40
sub rdi,0x1
jne top
Running this in the same harness as above, I get about 3.05 cycles/store2, although there is quite a bit of variance compared to what I'm used to seeing ( - you can even find a 3.0 in there).
So we know already we probably aren't going to do better than this for sustained stores purely to L21. While Skylake apparently has a 64 byte throughput between L1 and L2, in the case of a stream of stores, that bandwidth has to be shared for both evictions from L1, and to load the new line into L1. 3 cycles seems reasonable if it takes say 1 cycle each to (a) evict the dirty victim line from L1 to L2 (b) update L1 with the new line from L2 and (c) commit the store into L1.
What happens when you add do a second write to the same cache line (to the next byte, although it turns out not to matter) in the loop? Like this:
top:
mov BYTE PTR [rdx],al
mov BYTE PTR [rdx+0x1],al
add rdx,0x40
sub rdi,0x1
jne top
Here's a histogram of the timing for 1000 runs of the test harness for the above loop:
count cycles/itr
1 3.0
51 3.1
5 3.2
5 3.3
12 3.4
733 3.5
139 3.6
22 3.7
2 3.8
11 4.0
16 4.1
1 4.3
2 4.4
So the majority of times are clustered around 3.5 cycles. That means that this additional store only added 0.5 cycles to the timing. It could be something like the store buffer is able to drain two stores to the L1 if they are in the same line, but this only happens about half the time.
Consider that the store buffer contains a series of stores like 1, 1, 2, 2, 3, 3 where 1 indicates the cache line: half of the positions have two consecutive values from the same cache line and half don't. As the store buffer is waiting to drain stores, and the L1 is busily evicting to and accepting lines from L2, the L1 will come available for a store at an "arbitrary" point, and if it is at the position 1, 1 maybe the stores drain in one cycle, but if it's at 1, 2 it takes two cycles.
Note there is another peak of about 6% of results around 3.1 rather than 3.5. That could be a steady state where we always get the lucky outcome. There is another peak of around 3% at ~4.0-4.1 - the "always unlucky" arrangement.
Let's test this theory by looking at various offsets between the first and second stores:
top:
mov BYTE PTR [rdx + FIRST],al
mov BYTE PTR [rdx + SECOND],al
add rdx,0x40
sub rdi,0x1
jne top
We try all values of FIRST and SECOND from 0 to 256 in steps of 8. The results, with varying FIRST values on the vertical axis and SECOND on the horizontal:
We see a specific pattern - the white values are "fast" (around the 3.0-4.1 values discussed above for the offset of 1). Yellow values are higher, up to 8 cycles, and red up to 10. The purple outliers are the highest and are usually cases where the "slow mode" described in the OP kicks in (usually clocking in a 18.0 cycles/iter). We notice the following:
From the pattern of white cells, we see that we get the fast ~3.5 cycle result as long as the second store is in the same cache line or the next relative to the first store. This is consistent with the idea above that stores to the same cache line are handled more efficiently. The reason that having the second store in the next cache line works is that the pattern ends up being the same, except for the first first access: 0, 0, 1, 1, 2, 2, ... vs 0, 1, 1, 2, 2, ... - where in the second case it is the second store that first touches each cache line. The store buffer doesn't care though. As soon as you get into different cache lines, you get a pattern like 0, 2, 1, 3, 2, ... and apparently this sucks?
The purple "outliers" are never appear in the white areas, so are apparently restricted to the scenario that is already slow (and the slow more here makes it about 2.5x slower: from ~8 to 18 cycles).
We can zoom out a bit and look at even larger offsets:
The same basic pattern, although we see that the performance improves (green area) as the second store gets further away (ahead or behind) the first one, up until it gets worse again at an offset of about ~1700 bytes. Even in the improved area we only get to at best 5.8 cycles/iteration still much worse than the same-line performance of 3.5.
If you add any kind of load or prefetch instruction that runs ahead3 of the stores, both the overall slow performance and the "slow mode" outliers disappear:
You can port this back to the original stride by 16 problem - any type of prefetch or load in the core loop, pretty much insensitive of the distance (even if it's behind in fact), fixes the issue and you get 2.3 cycles/iteration, close to the best possible ideal of 2.0, and equal to the sum of the two stores with separate loops.
So the basic rule is that stores to L2 without corresponding loads are much slower than if you software prefetch them - unless the entire store stream accesses cache lines in a single sequential pattern. That's contrary to the idea that a linear pattern like this never benefits from SW prefetch.
I don't really have a fleshed out explanation, but it could include these factors:
Having other stores in the store buffers may reduce the concurrency of the requests going to L2. It isn't clear exactly when stores that are going to miss in L1 allocate a store buffer, but perhaps it occurs near when the store is going to retire and there is a certain amount of "lookhead" into the store buffer to bring locations into L1, so having additional stores that aren't going to miss in L1 hurts the concurrency since the lookahead can't see as many requests that will miss.
Perhaps there are conflicts for L1 and L2 resources like read and write ports, inter-cache bandwidth, that are worse with this pattern of stores. For example when stores to different lines interleave, maybe they cannot drain as quickly from the store queue (see above where it appears that in some scenarios more than one store may drain per cycle).
These comments by Dr. McCalpin on the Intel forums are also quite interesting.
0 Mostly only achievable with the L2 streamer disabled since otherwise the additional contention on the L2 slows this down to about 1 line per 3.5 cycles.
1 Contrast this with stores, where I get almost exactly 1.5 cycles per load, for an implied bandwidth of ~43 bytes per cycle. This makes perfect sense: the L1<->L2 bandwith is 64 bytes, but assuming that the L1 is either accepting a line from the L2 or servicing load requests from the core every cycle (but not both in parallel) then you have 3 cycles for two loads to different L2 lines: 2 cycles to accept the lines from L2, and 1 cycle to satisfy two load instructions.
2 With prefetching off. As it turns out, the L2 prefetcher competes for access to the L2 cache when it detects streaming access: even though it always finds the candidate lines and doesn't go to L3, this slows down the code and increases variability. The conclusions generally hold with prefetching on, but everything is just a bit slower (here's a big blob of results with prefetching on - you see about 3.3 cycles per load, but with lots of variability).
3 It doesn't even really need to be ahead - prefetching several lines behind also works: I guess the prefetch/loads just quickly run ahead of the stores which are bottlenecked so they get ahead anyways. In this way, the prefetching is kind of self-healing and seems to work with almost any value you put in.

Sandy Bridge has "L1 data hardware pre-fetchers". What this means is that initially when you do your store the CPU has to fetch data from L2 into L1; but after this has happened several times the hardware pre-fetcher notices the nice sequential pattern and starts pre-fetching data from L2 into L1 for you, so that the data is either in L1 or "half way to L1" before your code does its store.

L2 instruction fetch misses much higher than L1 instruction fetch misses

I am generating a synthetic C benchmark aimed at causing a large number of instruction fetch misses via the following Python script:
#!/usr/bin/env python
import tempfile
import random
import sys
if __name__ == '__main__':
functions = list()
for i in range(10000):
func_name = "f_{}".format(next(tempfile._get_candidate_names()))
sys.stdout.write("void {}() {{\n".format(func_name))
sys.stdout.write(" double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n")
sys.stdout.write(" res = pi*r*r*h;\n")
sys.stdout.write(" res = res/(e*e);\n")
sys.stdout.write("}\n")
functions.append(func_name)
sys.stdout.write("int main() {\n")
sys.stdout.write("unsigned int i;\n")
sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n")
for i in range(10000):
r = random.randint(0, len(functions)-1)
sys.stdout.write("{}();\n".format(functions[r]))
sys.stdout.write("}\n")
sys.stdout.write("}\n")
What the code does is simply generating a C program that consists of a lot of randomly named dummy functions that are in turn called in random order in main(). I am compiling the resulting code with gcc 4.8.5 under CentOS 7 with -O0. The code is running on a dual socket machine fitted with 2x Intel Xeon E5-2630v3 (Haswell architecture).
What I am interested in is understanding instruction-related counters reported by perf when profiling the binary compiled from the C code (not the Python script, that is only used to automatically generate the code). In particular, I am observing the following counters with perf stat:
instructions
L1-icache-load-misses (instruction fetches that miss L1, aka r0280 on Haswell)
r2424, L2_RQSTS.CODE_RD_MISS (instruction fetches that miss L2)
rf824, L2_RQSTS.ALL_PF (all L2 hardware prefetcher requests, both code and data)
I first profiled the code with all hardware prefetchers disabled in the BIOS, i.e.
MLC Streamer Disabled
MLC Spatial Prefetcher Disabled
DCU Data Prefetcher Disabled
DCU Instruction Prefetcher Disabled
and the results are the following (process is pinned to first core of second CPU and corresponding NUMA domain, but I guess this doesn't make much difference):
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code
Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':
25,108,610,204 instructions
2,613,075,664 L1-icache-load-misses
5,065,167,059 r2424
17 rf824
33.696954142 seconds time elapsed
Considering the figures above, I cannot explain such a high number of instruction fetch misses in L2. I have disabled all prefetchers, and L2_RQSTS.ALL_PF confirms so. But why do I see twice as much the number of instruction fetch misses in L2 than in L1i? In my (simple) mental processor model, if an instruction is looked up in L2, it must have necessarily been looked up in L1i before. Clearly I am wrong, what am I missing?
I then tried to run the same code with all the hardware prefetchers enabled, i.e.
MLC Streamer Enabled
MLC Spatial Prefetcher Enabled
DCU Data Prefetcher Enabled
DCU Instruction Prefetcher Enabled
and the results are the following:
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code
Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':
25,109,877,626 instructions
2,599,883,072 L1-icache-load-misses
5,054,883,231 r2424
908,494 rf824
Now, L2_RQSTS.ALL_PF seems to indicate that something more is happening and although I expected the prefetcher to be a bit more aggressive, I imagine that the instruction prefetcher is severely put to the test due to the jump-intensive type of workload and data prefetcher has not much to do with this kind of workload. But again, L2_RQSTS.CODE_RD_MISS is still too high with the prefetchers enabled.
So, to sum up, my question is:
With hardware prefetchers disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses. Even with hardware prefetchers enabled, I still cannot explain it. What is the reason behind such a high count of L2_RQSTS.CODE_RD_MISS compared to L1-icache-load-misses?

The instruction prefetcher can generate requests are that don't count as accesses to the L1I cache, but are counted as code fetch requests at higher-numbered memory levels, such as the L2. This is generally true on all Intel microarchitectures with an instruction prefetcher. L2_RQSTS.CODE_RD_MISS counts both demand and prefetch requests from the L1I. Demand requests are generated by a multiplexing unit in the IFU that chooses a target fetch linear address from among the different units in the pipeline that may change the flow, such as the branch prediction units. Prefetch requests are generated by the L1I instruction prefetcher on an L1I miss if possible.
In general, the number of prefetch fetch requests is nearly proportional to the number of L1I misses. For instruction fetches from memory regions of cacheable memory types, the following formula holds:
ICACHE.MISSES <= L2_RQSTS.CODE_RD_MISS + L2_RQSTS.CODE_RD_HIT
I'm not sure whether this formula also holds for uncacheable fetch requests. I didn't test it in that condition. I know these requests are counted as ICACHE.MISSES, but not sure about the other events.
In your case, most instruction fetches will miss in the L1I and L2. You have 10,000 functions each nearly fully spans 2 64-btye cache lines (here is a version with only two functions), so the code size is much larger than the 256 KiB L2 available on Haswell. The functions are being called in a non-sequential and upredictable order, so the L1I and L2 prefetchers won't significantly help. The only noteworthy exception are returns, all of which will be predicted correctly using the RSB mechanism.
Each of the 10,000 functions are being called 100,000 times in a loop. Most fetch requests are for lines occupied by these functions. The total number of useful instruction fetch requests is about 2 lines per function * 10,000 function * 100,000 iterations = 2,000,000,000 lines, most of which will miss in the L1I and L2 (but probably hit in the L3 after the first cold iteration). Several millions of other requests will be for lines occupied by the loop body. Your measurements show that there are about 30% more instruction fetches that miss in the L1I. This is because of branch mispredictions, which cause fetch requests for incorrect lines that may not be even be in the L1I and/or L2. Each L1I miss may trigger a prefetch, so it's normal for L2 instruction fetches to be within two times the number of L1I misses. This is consistent with your numbers.
In my two-function version, I'm counting 24 instructions per invoked function, so I expect the total number of retired instructions to be approximately 24 billion, but you got 25 billion. Either I don't know how to count, or you have 25 instructions per function for some reason.

Interpretation of perf stat output

I have developed a code that gets as input a large 2-D image (up to 64MPixels) and:
Applies a filters on each row
Transposes the image (used blocking to avoid lots of cache misses)
Applies a filters on the columns (now-rows) of the image
Transposes the filtered image back to carry on with other calculations
Although it doesn't change something, for the sake of completeness of my question, the filtering is applying a discrete wavelet transform and the code is written in C.
My end goal is to make this run as fast as possible. The speedups I have so far are more than 10X times through the use of the blocking matrix transpose, vectorization, multithreading, compiler-friendly code etc.
Coming to my question: The latest profiling stats of the code I have (using perf stat -e) have troubled me.
76,321,873 cache-references
8,647,026,694 cycles # 0.000 GHz
7,050,257,995 instructions # 0.82 insns per cycle
49,739,417 cache-misses # 65.171 % of all cache refs
0.910437338 seconds time elapsed
The (# of cache-misses)/(# instructions) is low at around ~0.7%. Here it is mentioned that this number is a good thing to have in mind to check for memory efficiency.
On the other hand, the % of cache-misses to cache-references is significantly high (65%!) which as I see could indicates that something is going wrong with the execution in terms of cache efficiency.
The detailed stat from perf stat -d is:
2711.191150 task-clock # 2.978 CPUs utilized
1,421 context-switches # 0.524 K/sec
50 cpu-migrations # 0.018 K/sec
362,533 page-faults # 0.134 M/sec
8,518,897,738 cycles # 3.142 GHz [40.13%]
6,089,067,266 stalled-cycles-frontend # 71.48% frontend cycles idle [39.76%]
4,419,565,197 stalled-cycles-backend # 51.88% backend cycles idle [39.37%]
7,095,514,317 instructions # 0.83 insns per cycle
# 0.86 stalled cycles per insn [49.66%]
858,812,708 branches # 316.766 M/sec [49.77%]
3,282,725 branch-misses # 0.38% of all branches [50.19%]
1,899,797,603 L1-dcache-loads # 700.724 M/sec [50.66%]
153,927,756 L1-dcache-load-misses # 8.10% of all L1-dcache hits [50.94%]
45,287,408 LLC-loads # 16.704 M/sec [40.70%]
26,011,069 LLC-load-misses # 57.44% of all LL-cache hits [40.45%]
0.910380914 seconds time elapsed
Here frontend and backend stalled cycles are also high and the lower level caches seem to suffer from a high miss rate of 57.5%.
Which metric is the most appropriate for this scenario? One idea I was thinking is that it could be the case that the workload no longer requires further "touching" of the LL caches after the initial image load (loads the values once and after that it's done - the workload is more CPU-bound than memory-bound being an image filtering algorithm).
The machine I'm running this on is a Xeon E5-2680 (20M of Smart cache, out of which 256KB L2 cache per core, 8 cores).

The first thing you want to make sure is that no other compute intensive process is running on your machine. That's a server CPU so I thought that could be a problem.
If you use multi-threading in your program, and you distribute equal amount of work between threads, you might be interested collecting metrics only on one CPU.
I suggest disabling hyper-threading in the optimization phase as it can lead to confusion when interpreting the profiling metrics. (e.g. increased #cycles spent in the back-end). Also if you distribute work to 3 threads, you have a high chance that 2 threads will share the resources of one core and the 3rd will have the entire core for itself - and it will be faster.
Perf has never been very good at explaining the metrics. Judging by the order of magnitude, the cache references are the L2 misses that hit the LLC. A high LLC miss number compared with LLC references is not always a bad thing if the number of LLC references / #Instructions is low. In your case, you have 0.018 so that means that most of your data is being used from L2. The high LLC miss ratio means that you still need to get data from RAM and write it back.
Regarding #Cycles BE and FE bound, I'm a bit concerned about the values because they don't sum to 100% and to the total number of cycles. You have 8G but staying 6G cycles in the FE and 4G cycles in the BE. That does not seem very correct.
If the FE cycles is high, that means you have misses in the instruction cache or bad branch speculation. If the BE cycles is high, that means you wait for data.
Anyway, regarding your question. The most relevant metric to asses the performance of your code is Instructions / Cycle (IPC). Your CPU can execute up to 4 instructions / cycle. You only execute 0.8. That means resources are underutilized, except for the case where you have many vector instructions. After IPC you need to check branch misses and L1 misses (data and code) because those generate most penalties.
A final suggestion: you may be interested in trying Intel's vTune Amplifier. It gives a much better explaining on the metrics and points you to the eventual problems in your code.

Cycles/cost for L1 Cache hit vs. Register on x86?

I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?
How many cycles does an L1 cache hit take? How does it compare to register access?

Here's a great article on the subject:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1
To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)
PS:
The specifics will vary, but this link has some good ballpark figures:
Approximate cost to access various caches and main memory?
Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns
PPS:
These figures represent much older, slower CPUs, but the ratios basically hold:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2
Level Access Time Typical Size Technology Managed By
----- ----------- ------------ --------- -----------
Registers 1-3 ns ?1 KB Custom CMOS Compiler
Level 1 Cache (on-chip) 2-8 ns 8 KB-128 KB SRAM Hardware
Level 2 Cache (off-chip) 5-12 ns 0.5 MB - 8 MB SRAM Hardware
Main Memory 10-60 ns 64 MB - 1 GB DRAM Operating System
Hard Disk 3M - 10M ns 20 - 100 GB Magnetic Operating System/User

Throughput and latency are different things. You can't just add up cycle costs. For throughput, see Load/stores per cycle for recent CPU architecture generations - 2 loads per clock throughput for most modern microarchitectures. And see How can cache be that fast? for microarchitectural details of load/store execution units, including showing load / store buffers which limit how much memory-level parallelism they can track. The rest of this answer will focus only on latency, which is relevant for workloads that involve pointer-chasing (like linked lists and trees), and how much latency out-of-order exec needs to hide. (L3 Cache misses are usually too long to fully hide.)
Single-cycle cache latency used to be a thing on simple in-order pipelines at lower clock speeds (so each cycle was more nanoseconds), especially with simpler caches (smaller, not as associative, and with a smaller TLB for caches that weren't purely virtually addressed.) e.g. the classic 5-stage RISC pipeline like MIPS I assumes 1 cycle for memory access on a cache hit, with address calculation in EX and memory access in a single MEM pipeline stage, before WB.
Modern high-performance CPUs divide the pipeline up into more stages, allowing each cycle to be shorter. This lets simple instructions like add / or / and run really fast, still 1 cycle latency but at high clock speed.
For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.
Intel Haswell's L1 load-use latency is 4 cycles for pointer-chasing, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or for a linked list that hits in cache, easy to microbench with a closed loop). See also Is there a penalty when base+offset is in a different page than the base? That 4-cycle latency special case only applies if the pointer comes directly from another load, otherwise it's 5 cycles.
Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.
Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, reading from the store buffer for store data that hasn't yet committed to L1d cache).
As harold commented, register access is 0 cycles. So, for example:
inc eax has 1 cycle latency (just the ALU operation)
add dword [mem], 1 has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.)
http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).
Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).
L1 Data cache = 32 KB, 64 B/line, 8-WAY.
L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.
L2 cache = 256 KB, 64 B/line, 8-WAY
L3 cache = 8 MB, 64 B/line
L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])
L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).
L2 Cache Latency = 12 cycles
L3 Cache Latency = 36 cycles
RAM Latency = 36 cycles + 57 ns
The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.
#paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.

If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.
EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\

Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.
As result:
1) I think you can consider L1 cache as an unmanaged processor registers pool.
2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio