Store misses hurt performance? - caching

We know that the dirty victim data is not immediately written back to RAM, it is stashed away in the store buffer and then written back to RAM later as time permits. Also, the store forwarding technique that if you do a subsequent LOAD to the same location on the same core before the value is flushed to the cache/memory, the value from the store buffer will be "forwarded" and you will get the value that was just stored. This can be done in parallel with the cache access, so it doesn’t slow things down.
My question is - With the help of the store buffer and store forwarding, the store misses don’t necessarily require the processor (correspond core) to stall. Therefore, store misses do not contribute to the total cache miss latency, right?
Thanks.

DRAM latency is really high, so it's easy for the store buffer to fill up and stall allocation of new store instructions into the back-end when a cache miss store stalls its progress. The ability of the store buffer to decouple / insulate execution from cache misses is limited by its finite size. It always helps some, though. You're right, stores are easier to hide cache-miss latency for.
Stalling and filling up the store buffer is more of a problem with a strongly ordered memory model like x86's TSO: stores can only commit from the store buffer into L1d cache in program order, so any cache-miss store blocks store-buffer progress until the RFO (Read For Ownership) completes. Initiating the RFO early (before the store reaches the commit end of the store buffer, e.g. upon retire) can hide some of this latency by getting the RFO in flight before the data needs to arrive.
How do the store buffer and Line Fill Buffer interact with each other?
Consecutive stores into the same cache line can be coalesced into a buffer that lets them all commit at once when the data arrives from RAM (or from another core which had ownership). There's some evidence that Intel CPUs actually do this, in the limited cases where that wouldn't violate the memory-ordering rules.
See Why doesn't RFO after retirement break memory ordering? for links to #BeeOnRope's experimental testing of this commit into LFBs before the RFO data arrives, on Intel Skylake.

Related

Size of store buffers on Intel hardware? What exactly is a store buffer?

The Intel optimization manual talks about the number of store buffers that exist in many parts of the processor, but do not seem to talk about the size of the store buffers. Is this public information or is the size of a store buffer kept as a microarchitectural detail?
The processors I am looking into are primarily Broadwell and Skylake, but information about others would be nice as well.
Also, what do store buffers do, exactly?
Related: what is a store buffer? and a beginner-friendly (but detailed) intro to the concept of buffers in Can a speculatively executed CPU branch contain opcodes that access RAM? which I highly recommend reading for CPU-architecture background on why we need them and what they do (decouple execution from commit to L1d / cache misses, and allow speculative exec of stores without making speculation visible in coherent cache.)
Also How do the store buffer and Line Fill Buffer interact with each other? has a good description of the steps in executing a store instruction and how it eventually commits to L1d cache.
The store buffer as a whole is composed of multiple entries.
Each core has its own store buffer1 to decouple execution and retirement from commit into L1d cache. Even an in-order CPU benefits from a store buffer to avoid stalling on cache-miss stores, because unlike loads they just have to become visible eventually. (No practical CPUs use a sequential-consistency memory model, so at least StoreLoad reordering is allowed, even in x86 and SPARC-TSO).
For speculative / out-of-order CPUs, it also makes it possible roll back a store after detecting an exception or other mis-speculation in an older instruction, without speculative stores ever being globally visible. This is obviously essential for correctness! (You can't roll back other cores, so you can't let them see your store data until it's known to be non-speculative.)
When both logical cores are active (hyperthreading), Intel partitions the store buffer in two; each logical core gets half. Loads from one logical core only snoop its own half of the store buffer2. What will be used for data exchange between threads are executing on one Core with HT?
The store buffer commits data from retired store instructions into L1d as fast as it can, in program order (to respect x86's strongly-ordered memory model3). Requiring stores to commit as they retire would unnecessarily stall retirement for cache-miss stores. Retired stores still in the store buffer are definitely going to happen and can't be rolled back, so they can actually hurt interrupt latency. (Interrupts aren't technically required to be serializing, but any stores done by an IRQ handler can't become visible until after existing pending stores are drained. And iret is serializing, so even in the best case the store buffer drains before returning.)
It's a common(?) misconception that it has to be explicitly flushed for data to become visible to other threads. Memory barriers don't cause the store buffer to be flushed, full barriers make the current core wait until the store buffer drains itself, before allowing any later loads to happen (i.e. read L1d). Atomic RMW operations have to wait for the store buffer to drain before they can lock a cache line and do both their load and store to that line without allowing it to leave MESI Modified state, thus stopping any other agent in the system from observing it during the atomic operation.
To implement x86's strongly ordered memory model while still microarchitecturally allowing early / out-of-order loads (and later checking if the data is still valid when the load is architecturally allowed to happen), load buffer + store buffer entries collectively form the Memory Order Buffer (MOB). (If a cache line isn't still present when the load was allowed to happen, that's a memory-order mis-speculation.) This structure is presumably where mfence and locked instructions can put a barrier that blocks StoreLoad reordering without blocking out-of-order execution. (Although mfence on Skylake does block OoO exec of independent ALU instructions, as an implementation detail.)
movnt cache-bypassing stores (like movntps) also go through the store buffer, so they can be treated as speculative just like everything else in an OoO exec CPU. But they commit directly to an LFB (Line Fill Buffer), aka write-combining buffer, instead of to L1d cache.
Store instructions on Intel CPUs decode to store-address and store-data uops (micro-fused into one fused-domain uop). The store-address uop just writes the address (and probably the store width) into the store buffer, so later loads can set up store->load forwarding or detect that they don't overlap. The store-data uop writes the data.
Store-address and store-data can execute in either order, whichever is ready first: the allocate/rename stage that writes uops from the front-end into the ROB and RS in the back end also allocates a load or store buffer for load or store uops at issue time. Or stalls until one is available. Since allocation and commit happen in-order, that probably means older/younger is easy to keep track of because it can just be a circular buffer that doesn't have to worry about old long-lived entries still being in use after wrapping around. (Unless cache-bypassing / weakly-ordered NT stores can do that? They can commit to an LFB (Line Fill Buffer) out of order. Unlike normal stores, they commit directly to an LFB for transfer off-core, rather than to L1d.)
but what is the size of an entry?
Store buffer sizes are measured in entries, not bits.
Narrow stores don't "use less space" in the store buffer, they still use exactly 1 entry.
Skylake's store buffer has 56 entries (wikichip), up from 42 in Haswell/Broadwell, and 36 in SnB/IvB (David Kanter's HSW writeup on RealWorldTech has diagrams). You can find numbers for most earlier x86 uarches in Kanter's writeups on RWT, or Wikichip's diagrams, or various other sources.
SKL/BDW/HSW also have 72 load buffer entries, SnB/IvB have 64. This is the number of in-flight load instructions that either haven't executed or are waiting for data to arrive from outer caches.
The size in bits of each entry is an implementation detail that has zero impact on how you optimize software. Similarly, we don't know the size in bits of of a uop (in the front-end, in the ROB, in the RS), or TLB implementation details, or many other things, but we do know how many ROB and RS entries there are, and how many TLB entries of different types there are in various uarches.
Intel doesn't publish circuit diagrams for their CPU designs and (AFAIK) these sizes aren't generally known, so we can't even satisfy our curiosity about design details / tradeoffs.
Write coalescing in the store buffer:
Back-to-back narrow stores to the same cache line can (probably?) be combined aka coalesced in the store buffer before they commit, so it might only take one cycle on a write port of L1d cache to commit multiple stores.
We know for sure that some non-x86 CPUs do this, and we have some evidence / reason to suspect that Intel CPUs might do this. But if it happens, it's limited. #BeeOnRope and I currently think Intel CPUs probably don't do any significant merging. And if they do, the most plausible case is that entries at the end of the store buffer (ready to commit to L1d) that all go to the same cache line might merge into one buffer, optimizing commit if we're waiting for an RFO for that cache line. See discussion in comments on Are two store buffer entries needed for split line/page stores on recent Intel?. I proposed some possible experiments but haven't done them.
Earlier stuff about possible store-buffer merging:
See discussion starting with this comment: Are write-combining buffers used for normal writes to WB memory regions on Intel?
And also Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake may be relevant.
We know for sure that some weakly-ordered ISAs like Alpha 21264 did store coalescing in their store buffer, because the manual documents it, along with its limitations on what it can commit and/or read to/from L1d per cycle. Also PowerPC RS64-II and RS64-III, with less detail, in docs linked from a comment here: Are there any modern CPUs where a cached byte store is actually slower than a word store?
People have published papers on how to do (more aggressive?) store coalescing in TSO memory models (like x86), e.g. Non-Speculative Store Coalescing in Total Store Order
Coalescing could allow a store-buffer entry to be freed before its data commits to L1d (presumably only after retirement), if its data is copied to a store to the same line. This could only happen if no stores to other lines separate them, or else it would cause stores to commit (become globally visible) out of program order, violating the memory model. But we think this can happen for any 2 stores to the same line, even the first and last byte.
A problem with this idea is that SB entry allocation is probably a ring buffer, like the ROB. Releasing entries out of order would mean hardware would need to scan every entry to find a free one, and then if they're reallocated out of order then they're not in program order for later stores. That could make allocation and store-forwarding much harder so it's probably not plausible.
As discussed in
Are two store buffer entries needed for split line/page stores on recent Intel?, it would make sense for an SB entry to hold all of one store even if it spans a cache-line boundary. Cache line boundaries become relevant when committing to L1d cache on leaving the SB. We know that store-forwarding can work for stores that split across a cache line. That seems unlikely if they were split into multiple SB entries in the store ports.
Terminology: I've been using "coalescing" to talk about merging in the store buffer, vs. "write combining" to talk about NT stores that combine in an LFB before (hopefully) doing a full-line write with no RFO. Or stores to WC memory regions which do the same thing.
This distinction / convention is just something I made up. According to discussion in comments, this might not be standard computer architecture terminology.
Intel's manuals (especially the optimization manual) are written over many years by different authors, and also aren't consistent in their terminology. Take most parts of the optimization manual with a grain of salt especially if it talks about Pentium4. The new sections about Sandybridge and Haswell are reliable, but older parts might have stale advice that's only / mostly relevant for P4 (e.g. inc vs. add 1), or the microarchitectural explanations for some optimization rules might be confusing / wrong. Especially section 3.6.10 Write Combining. The first bullet point about using LFBs to combine stores while waiting for lines to arrive for cache-miss stores to WB memory just doesn't seem plausible, because of memory-ordering rules. See discussion between me and BeeOnRope linked above, and in comments here.
Footnote 1:
A write-combining cache to buffer write-back (or write-through) from inner caches would have a different name. e.g. Bulldozer-family uses 16k write-through L1d caches, with a small 4k write-back buffer. (See Why do L1 and L2 Cache waste space saving the same data? for details and links to even more details. See Cache size estimation on your system? for a rewrite-an-array microbenchmark that slows down beyond 4k on a Bulldozer-family CPU.)
Footnote 2: Some POWER CPUs let other SMT threads snoop retired stores in the store buffer: this can cause different threads to disagree about the global order of stores from other threads. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?
Footnote 3: non-x86 CPUs with weak memory models can commit retired stores in any order, allowing more aggressive coalescing of multiple stores to the same line, and making a cache-miss store not stall commit of other stores.

Does store buffer send read invalidate message or invalidate req message?

I think, to make the CPU continue executing subsequent instructions,the store buffer must do part of the MESI processing to get cache consistency, because the latest value is stored in store buffer and not cache. So the store buffer sends read invalidate or invalidate REQ messages and flushes the latest value to cache after the arrival of ACK.
And Cache cannot do it.
Is my analysis and result right?
Or shall all MESI processing be done by cache?
On most designs the store buffer wouldn't directly send invalidate requests and is usually not even snooped1 by external requests. That is, it is part of the private/core-side of the coherence domain and so doesn't need to participate in coherence. Instead, the store buffer ultimately interacts with the first level of the caching subsystem which itself would be responsible for the various parts of the MESI protocol.
How that interaction works exactly depends on the design, of course. A simple design may only process one store at a time: the oldest one that is at the head of the store buffer and perform the RFO for that address, and when complete move on the to the next element. A more sophisticated design might send RFO for several "upcoming" requests in the store buffer in an attempt to exploit more MLP. The exact mechanism isn't clear to me on x86: stores to L2 seem to perform quite poorly in some scenarios, but I'm pretty sure a bunch of store misses to RAM will perform much better than if they were handled serially.
1 There are some exceptions, e.g. simultaneous multithreading (hyperthreading on x86) which involves two logical cores sharing all levels of cache and hence being able to avail themselves of the normal cache coherency mechanisms, may require store buffer snoops.

Would buffering cache changes prevent Meltdown?

If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible?
The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed.
TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register), but probably be too expensive (in power cost, and maybe also performance) to be a likely implementation.
But with hyperthreading (or more generally any SMT), there's also an ALU / port-pressure side-channel if you can get mis-speculation to run data-dependent ALU instructions with the secret data, instead of using it as an array index. The Meltdown paper discusses this possibility before focusing on the flush+reload cache-timing side-channel. (It's more viable for Meltdown than Spectre, because you have much better control of the timing of when the the secret data is used).
So modifying cache behaviour doesn't block the attacks. It would take away the reliable side-channel for getting the secret data into the attacking process, though. (i.e. ALU timing has higher noise and thus lower bandwidth to get the same reliability; Shannon's noisy channel theorem), and you have to make sure your code runs on the same physical core as the code under attack.
On CPUs without SMT (e.g. Intel's desktop i5 chips), the ALU timing side-channel is very hard to use with Spectre, because you can't directly use perf counters on code you don't have privilege for. (But Meltdown could still be exploited by timing your own ALU instructions with Linux perf, for example).
Meltdown specifically is much easier to defend against, microarchitecturally, with simpler and cheaper changes to the hard-wired parts of the CPU that microcode updates can't rewire.
You don't need to block speculative loads from affecting cache; the change could be as simple as letting speculative execution continue after a TLB-hit load that will fault if it reaches retirement, but with the value used by speculative execution of later instructions forced to 0 because of the failed permission check against the TLB entry.
So the mis-speculated (after the faulting load of secret) touch array[secret*4096] load would always make the same cache line hot, with no secret-data-dependent behaviour. The secret data itself would enter cache, but not a physical register. (And this stops ALU / port-pressure side-channels, too.)
Stopping the faulting load from even bringing the "secret" line into cache in the first place could make it harder to tell the difference between a kernel mapping and an unmapped page, which could possibly help protect against user-space trying to defeat KASLR by finding which virtual addresses the kernel has mapped. But that's not Meltdown.
Spectre
Spectre is the hard one because the mis-speculated instructions that make data-dependent modifications to microarchitectural state do have permission to read the secret data. Yes, a "load queue" that works similarly to the store queue could do the trick, but implementing it efficiently could be expensive. (Especially given the cache coherency problem that I didn't think of when I wrote this first section.)
(There are other ways of implementing the your basic idea; maybe there's even a way that's viable. But extra bits on L1D lines to track their status has downsides and isn't obviously easier.)
The store queue tracks stores from execution until they commit to L1D cache. (Stores can't commit to L1D until after they retire, because that's the point at which they're known to be non-speculative, and thus can be made globally visible to other cores).
A load queue would have to store whole incoming cache lines, not just the bytes that were loaded. (But note that Skylake-X can do 64-byte ZMM stores, so its store-buffer entries do have to be the size of a cache line. But if they can borrow space from each other or something, then there might not be 64 * entries bytes of storage available, i.e. maybe only the full number of entries is usable with scalar or narrow-vector stores. I've never read anything about a limitation like this, so I don't think there is one, but it's plausible)
A more serious problem is that Intel's current L1D design has 2 read ports + 1 write port. (And maybe another port for writing lines that arrive from L2 in parallel with committing a store? There was some discussion about that on Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake.)
If your loaded data can't enter L1D until after the loads retire, then they're probably going to be competing for the same write port that stores use.
Loads that hit in L1D can still come directly from L1D, though, and loads that hit in the memory-order-buffer could still be executed at 2 per clock. (The MOB would now include this new load queue as well as the usual store queue + markers for loads to maintain x86 memory ordering semantics). You still need both L1D read ports to maintain performance for code that doesn't touch a lot of new memory, and mostly is reloading stuff that's been hot in L1D for a while.
This would make the MOB about twice as large (in terms of data storage), although it doesn't need any more entries. As I understand it, the MOB in current Intel CPUs is composed of the individual load-buffer and store-buffer entries. (Haswell has 72 and 42 respectively).
Hmm, a further complication is that the load data in the MOB has to maintain cache coherency with other cores. This is very different from store data, which is private and hasn't become globally visible / isn't part of the global memory order and cache coherency until it commits to L1D.
So this proposed "load queue" implementation mechanism for your idea is probably not feasible without tweaks: it would have to be checked by invalidation-requests from other cores, so that's another read-port needed in the MOB.
Any possible implementation would have the problem of needing to later commit to L1D like a store. I think it would be a significant burden not to be able to evict + allocate a new line when it arrived from off-core.
(Even allowing speculative eviction but not speculative replacement from conflicts leaves open a possible cache-timing attack. You'd prime all the lines and then do a load that would evict one from one set of lines or another, and find which line was evicted instead of which one was fetched using a similar cache-timing side channel. So using extra bits in L1D to find / evict lines loaded during recovery from mis-speculation wouldn't eliminate this side-channel.)
Footnote: all instructions are speculative. This question is worded well, but I think many people reading about OoO exec and thinking about Meltdown / Spectre fall into this trap of confusing speculative execution with mis-speculation.
Remember that all instructions are speculative when they're executed. It's not known to be correct speculation until retirement. Meltdown / Spectre depend on accessing secret data and using it during mis-speculation. But the basis of current OoO CPU designs is that you don't know whether you've speculated correctly or not; everything is speculative until retirement.
Any load or store could potentially fault, and so can some ALU instructions (e.g. floating point if exceptions are unmasked), so any performance cost that applies "only when executing speculatively" actually applies all the time. This is why stores can't commit from the store queue into L1D until after the store uops have retired from the out-of-order CPU core (with the store data in the store queue).
However, I think conditional and indirect branches are treated specially, because they're expected to mis-speculate some of the time, and optimizing recovery for them is important. Modern CPUs do better with branches than just rolling back to the current retirement state when a mispredict is detected, I think using a checkpoint buffer of some sort. So out-of-order execution for instructions before the branch can continue during recovery.
But loop and other branches are very common, so most code executes "speculatively" in this sense, too, with at least one branch-rollback checkpoint not yet verified as correct speculation. Most of the time it's correct speculation, so no rollback happens.
Recovery for mis-speculation of memory ordering or faulting loads is a full pipeline-nuke, rolling back to the retirement architectural state. So I think only branches consume the branch checkpoint microarchitectural resources.
Anyway, all of this is what makes Spectre so insidious: the CPU can't tell the difference between mis-speculation and correct speculation until after the fact. If it knew it was mis-speculating, it would initiate rollback instead of executing useless instructions / uops. Indirect branches are not rare, either (in user-space); every DLL or shared library function call uses one in normal executables on Windows and Linux.
I suspect the overhead from buffering and committing the buffer would render the specEx/caching useless?
This is purely speculative (no pun intended) - I would love to see someone with a lower level background weigh in this!

if cache miss happens, the data will be moved to register directly or first moved to cache then to register?

if cache miss happens, the data will be moved to register directly from main memory, or the data firstly will be moved to cache then to register? Is there a direct way connect the register with main memory?
I think you're asking if a cache-miss load has to wait for L1 load-use latency after the cache line arrives from outer cache. i.e. wait for the line to be written to L1, then retry the load normally.
I'm almost certain that high-performance CPUs don't work that way. L2-hit latency is important for many workloads, and you need a load buffer tracking that incoming cache line anyway to know when to restart the load. So you just grab the data as it comes in, in parallel with writing it to the cache. The TLB check was already done as part of generating a physical address to send to the outer cache.
Most real CPUs use an early-restart design that lets the pipeline restart as soon as the word / byte they were waiting for arrives, so the rest of the cache line transfers "in the background".
A further optimization is critical-word-first, which asks for the cache line to be sent starting with the needed word, so a demand miss for a word in the middle of a cache line can receive that word first. I think modern DDR DRAM still supports this when reading from main memory, starting the 64-byte burst at a specified 64-bit chunk. I'm not 100% sure modern out-of-order CPUs use this, though; when out-of-order execution allows multiple outstanding misses for the same line, it probably makes it more complicated.
See which is optimal a bigger block cache size or a smaller one? for some discussion of early-restart and critical-word-first.
Is there a direct way connect the register with main memory?
It depends what you mean by "direct". In a modern high-performance CPU, there will be 2 or 3 layers of cache and a memory controller with its own buffering to arbitrate access to memory for multiple cores. So no, you can't.
If you design a simple single-core CPU with special cache-bypassing load and store instructions, then sure. Or if you consider early-restart as "direct", then yes it already happens.
For stores, x86 and some other architectures have cache-bypassing stores, but x86's MOVNT instructions don't directly connect registers with memory. Stores go into a line-fill buffer which is flushed when full, so you get write-combining.
There's also uncacheable memory regions: a load or store to uncacheable memory is architecturally "direct", but in the actually microarchitecture it still goes through the memory hierarchy from the load/store execution unit through the same mechanism that L1D uses to talk to the memory controller.

Write Allocate / Fetch on Write Cache Policy

I couldn't find a source that explains how the policy works in great detail. The combinations of write policies are explained in Jouppi's Paper for the interested. This is how I understood it.
A write request is sent from cpu to cache.
Request results in a cache-miss.
A cache block is allocated for this request in cache.(Write-Allocate)
Write request block is fetched from lower memory to the allocated cache block.(Fetch-on-Write)
Now we are able to write onto allocated and updated by fetch cache block.
Question is what happens between step 4 and step 5. (Lets say Cache is a non-blocking cache using Miss Status Handling Registers.)
Does CPU have to retry write request on cache until write-hit happens? (after fetching the block to the allocated cache block)
If not, where does write request data is being held in the meantime?
Edit: I think I've found my answer in Implementation of Write Allocate in the K86™ Processors . It is directly being written into the allocated cache block and it gets merged with the read request later on.
It is directly being written into the allocated cache block and it gets merged with the read request later on.
No, that's not what AMD's pdf says. They say the store-data is merged with the just-fetched data from memory and then stored into the L1 cache's data array.
Cache tracks validity with cache-line granularity. There's no way for it to store the fact that "bytes 3 to 6 are valid; keep them when data arrives from memory". That kind of logic is too big to replicate in each line of the cache array.
Also note that the pdf you found describes some specific behaviour of their AMD's K6 microarchitectures, which was single-core only, and some models only had a single level of cache, so no cache-coherency protocol was even necessary. They do describe the K6-III (model 9) using MESI between L1 and L2 caches.
A CPU writing to cache has to hold onto the data until the cache is ready to accept it. It's not a retry-until-success process, though. It's more like the cache notified the store hardware when it's ready to accept that store (i.e. it has that line active, and in the Modified state if the cache is coherent with other caches using the MESI protocol).
In a real CPU, multiple outstanding misses can be in flight at once (even without full out-of-order speculative execution). This is called miss under miss. The CPU<->cache connection needs a buffer for each outstanding miss that can be supported in parallel, to hold the store data. e.g. a core might have 8 buffers and support 8 outstanding load or store misses. A 9th memory operation couldn't start to happen until one of the 8 buffers became available. Until then, data would have to stay in the CPU's store queue.
These buffers might be shared between loads and stores, or there might be dedicated store buffers. The OP reports that searching on store buffer found lots of related stuff of interest; one example being this part of Wikipedia's MESI article.
The L1 cache is really a part of a CPU core in modern high-performance designs. It's very tightly integrated with the memory-order logic, and needs to be able to efficiently support atomic operations like lock inc [mem] and lots of other complications (like memory reordering). See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for example.
Some other terms:
store buffer
store queue
memory order buffer
cache write port / cache read port / cache port
globally visible
distantly related: An interesting post investigating the adaptive replacement policy of Intel IvyBridge's L3 cache, making it more resistant against evicting valuable data when scanning a huge array.

Resources