Cache Line update consistency for atomically updating a whole cache line? - caching

I have the following scenario and looking for suggestions please:
Need to share data between two threads, A and B each running in different cores in the same processor, where thread A writes to an instance of data structure S and B thread reads it. I need the sharing of S to be as consistent and as fast as possible.
struct alignas(64) S
{
char cacheline [64];
};
Planning to leverage the consistency of a cache line, being visible to other cores as an atomic update. Therefore have thread A write to S as fast as possible (*1) so the update is consistent (atomic from a visibility perspective) and then demote (CLDEMOTE instruction) the cache line to the shared cache so that thread B can read it as fast as possible.
Note 1: The reason why it needs to happen fast is so that when core running thread A starts writing to the cache line, it can update all of its contents completely and then core making it visible in L1 (updates occur in the core store buffer), otherwise if it takes too long to update a "mid-state" of the cache-line may be pushed to L1 incurring into unnecessaries invalidation signals (MESI) penalties (as it needs to do the rest again), and worst inconsistent state in thread B.
Are there better ways to achieve this?
Thanks!

Yes, store then cldemote is a good plan. It runs as a nop on CPUs that don't support it, so you can use it optimistically. (Test that it actually helps your program on CPUs where it's not a nop, though, in case you accidentally demote before reading the line some more.)
Do you actually need atomicity, or is that just nice to have some of the time? If you need atomicity, you can't use separate store instructions. Coalescing in the store buffer isn't guaranteed; for L1d hits it may only sometimes happen on Ice Lake. And an interrupt can happen at any point (unless interrupts are disabled, but SMI and NMI can't be disabled). Including between two stores you were hoping would commit together.
32-byte AVX aligned loads and stores aren't guaranteed atomic, but in practice they probably are on Haswell and later (where the load/store units are 32 bytes wide).
Similarly, 64-byte AVX-512 loads and stores aren't guaranteed atomic, and very likely won't be in practice on Zen4 where they're done in two 32-byte halves. But they probably are on Intel CPUs with AVX-512, if you want to do some testing and find some "works in practice" functionality that doesn't show any tearing on the actual machine you care about.
16-byte loads/stores are guaranteed atomic on Intel CPUs that have the AVX feature flag. (Finally documented after being true for years, fortunately retroactive with an existing feature bit.) AMD doesn't document this yet, but it's probably true of AMD CPUs with AVX, too.
Related: https://rigtorp.se/isatomic/ / SSE instructions: which CPUs can do atomic 16B memory operations?
movdir64b will provide guaranteed 64-byte write atomicity, but only with NT semantics: evicting the cache line all the way to DRAM. It also doesn't provide 64-byte atomic read, so the read side would need to check sequence numbers or something, like a SeqLock.
Intel TSX (transactional memory) can let you commit changes to a whole cache line (or more) as a single atomic transaction. But Intel keeps disabling it with microcode updates. The HLE part (optimistic lock add handling) is fully gone, but the RTM part (xbegin / xend) can still be enabled on some CPUs, I think.
For a use case like this where one thread is only writing, you might consider a SeqLock, using 4 bytes of the cache line as a sequence number. Optimal way to pass a few variables between 2 threads pinning different CPUs / how to implement a seqlock lock using c++11 atomic library
The writer can load the sequence number, store seq+1 (with a plain mov store, no lock inc needed), store the payload with regular stores, or SIMD if convenient, then store seq+2.
Unfortunately without guarantees of vector load/store atomicity, or of ordering between parts of it, you can't have the reader just load the whole cache line at once, you do need 3 separate loads. (Seq number, whole line, then seq number again.)
But if you want to use 32-byte atomicity which appears to be true in practice on Haswell and Zen2 and later, maybe put a sequence number in each 32-byte half of a cache line, so the reader can check with vpcmpeqd / vpmovmskps / test al,1 to check that the first dword element (sequence number) matched between halves. Or maybe put them somewhere else within the vector to make reassembling the payload cheaper.
This spends space for two sequence numbers to save loads in the reader, but might cost more overhead in shuffling data into / out of vectors. I guess maybe store with vmovdqua [rdi+28], ymm1 / vmovdqu [rdi], ymm0 could leave you with 60 useful bytes starting at rdi+4, overwriting the 4 byte sequence number at the start of ymm1. Store-forwarding to a 32-byte load from [rdi+4] would stall, but narrower loads that don't span the boundary between the two earlier stores would be fine even.
Related Q&As about solving the same problem of pushing data for other cores to be able to read cheaply:
CPU cache inhibition
x86 MESI invalidate cache line latency issue
Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions? - Sapphire Rapids has UIPI for user-space interrupt handling of special inter-processor interrupts. So that's fun if you want low latency notification. If you just want to read whatever the current state of a shared data structure is, SeqLock or RCU are good.

Related

Size of store buffers on Intel hardware? What exactly is a store buffer?

The Intel optimization manual talks about the number of store buffers that exist in many parts of the processor, but do not seem to talk about the size of the store buffers. Is this public information or is the size of a store buffer kept as a microarchitectural detail?
The processors I am looking into are primarily Broadwell and Skylake, but information about others would be nice as well.
Also, what do store buffers do, exactly?
Related: what is a store buffer? and a beginner-friendly (but detailed) intro to the concept of buffers in Can a speculatively executed CPU branch contain opcodes that access RAM? which I highly recommend reading for CPU-architecture background on why we need them and what they do (decouple execution from commit to L1d / cache misses, and allow speculative exec of stores without making speculation visible in coherent cache.)
Also How do the store buffer and Line Fill Buffer interact with each other? has a good description of the steps in executing a store instruction and how it eventually commits to L1d cache.
The store buffer as a whole is composed of multiple entries.
Each core has its own store buffer1 to decouple execution and retirement from commit into L1d cache. Even an in-order CPU benefits from a store buffer to avoid stalling on cache-miss stores, because unlike loads they just have to become visible eventually. (No practical CPUs use a sequential-consistency memory model, so at least StoreLoad reordering is allowed, even in x86 and SPARC-TSO).
For speculative / out-of-order CPUs, it also makes it possible roll back a store after detecting an exception or other mis-speculation in an older instruction, without speculative stores ever being globally visible. This is obviously essential for correctness! (You can't roll back other cores, so you can't let them see your store data until it's known to be non-speculative.)
When both logical cores are active (hyperthreading), Intel partitions the store buffer in two; each logical core gets half. Loads from one logical core only snoop its own half of the store buffer2. What will be used for data exchange between threads are executing on one Core with HT?
The store buffer commits data from retired store instructions into L1d as fast as it can, in program order (to respect x86's strongly-ordered memory model3). Requiring stores to commit as they retire would unnecessarily stall retirement for cache-miss stores. Retired stores still in the store buffer are definitely going to happen and can't be rolled back, so they can actually hurt interrupt latency. (Interrupts aren't technically required to be serializing, but any stores done by an IRQ handler can't become visible until after existing pending stores are drained. And iret is serializing, so even in the best case the store buffer drains before returning.)
It's a common(?) misconception that it has to be explicitly flushed for data to become visible to other threads. Memory barriers don't cause the store buffer to be flushed, full barriers make the current core wait until the store buffer drains itself, before allowing any later loads to happen (i.e. read L1d). Atomic RMW operations have to wait for the store buffer to drain before they can lock a cache line and do both their load and store to that line without allowing it to leave MESI Modified state, thus stopping any other agent in the system from observing it during the atomic operation.
To implement x86's strongly ordered memory model while still microarchitecturally allowing early / out-of-order loads (and later checking if the data is still valid when the load is architecturally allowed to happen), load buffer + store buffer entries collectively form the Memory Order Buffer (MOB). (If a cache line isn't still present when the load was allowed to happen, that's a memory-order mis-speculation.) This structure is presumably where mfence and locked instructions can put a barrier that blocks StoreLoad reordering without blocking out-of-order execution. (Although mfence on Skylake does block OoO exec of independent ALU instructions, as an implementation detail.)
movnt cache-bypassing stores (like movntps) also go through the store buffer, so they can be treated as speculative just like everything else in an OoO exec CPU. But they commit directly to an LFB (Line Fill Buffer), aka write-combining buffer, instead of to L1d cache.
Store instructions on Intel CPUs decode to store-address and store-data uops (micro-fused into one fused-domain uop). The store-address uop just writes the address (and probably the store width) into the store buffer, so later loads can set up store->load forwarding or detect that they don't overlap. The store-data uop writes the data.
Store-address and store-data can execute in either order, whichever is ready first: the allocate/rename stage that writes uops from the front-end into the ROB and RS in the back end also allocates a load or store buffer for load or store uops at issue time. Or stalls until one is available. Since allocation and commit happen in-order, that probably means older/younger is easy to keep track of because it can just be a circular buffer that doesn't have to worry about old long-lived entries still being in use after wrapping around. (Unless cache-bypassing / weakly-ordered NT stores can do that? They can commit to an LFB (Line Fill Buffer) out of order. Unlike normal stores, they commit directly to an LFB for transfer off-core, rather than to L1d.)
but what is the size of an entry?
Store buffer sizes are measured in entries, not bits.
Narrow stores don't "use less space" in the store buffer, they still use exactly 1 entry.
Skylake's store buffer has 56 entries (wikichip), up from 42 in Haswell/Broadwell, and 36 in SnB/IvB (David Kanter's HSW writeup on RealWorldTech has diagrams). You can find numbers for most earlier x86 uarches in Kanter's writeups on RWT, or Wikichip's diagrams, or various other sources.
SKL/BDW/HSW also have 72 load buffer entries, SnB/IvB have 64. This is the number of in-flight load instructions that either haven't executed or are waiting for data to arrive from outer caches.
The size in bits of each entry is an implementation detail that has zero impact on how you optimize software. Similarly, we don't know the size in bits of of a uop (in the front-end, in the ROB, in the RS), or TLB implementation details, or many other things, but we do know how many ROB and RS entries there are, and how many TLB entries of different types there are in various uarches.
Intel doesn't publish circuit diagrams for their CPU designs and (AFAIK) these sizes aren't generally known, so we can't even satisfy our curiosity about design details / tradeoffs.
Write coalescing in the store buffer:
Back-to-back narrow stores to the same cache line can (probably?) be combined aka coalesced in the store buffer before they commit, so it might only take one cycle on a write port of L1d cache to commit multiple stores.
We know for sure that some non-x86 CPUs do this, and we have some evidence / reason to suspect that Intel CPUs might do this. But if it happens, it's limited. #BeeOnRope and I currently think Intel CPUs probably don't do any significant merging. And if they do, the most plausible case is that entries at the end of the store buffer (ready to commit to L1d) that all go to the same cache line might merge into one buffer, optimizing commit if we're waiting for an RFO for that cache line. See discussion in comments on Are two store buffer entries needed for split line/page stores on recent Intel?. I proposed some possible experiments but haven't done them.
Earlier stuff about possible store-buffer merging:
See discussion starting with this comment: Are write-combining buffers used for normal writes to WB memory regions on Intel?
And also Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake may be relevant.
We know for sure that some weakly-ordered ISAs like Alpha 21264 did store coalescing in their store buffer, because the manual documents it, along with its limitations on what it can commit and/or read to/from L1d per cycle. Also PowerPC RS64-II and RS64-III, with less detail, in docs linked from a comment here: Are there any modern CPUs where a cached byte store is actually slower than a word store?
People have published papers on how to do (more aggressive?) store coalescing in TSO memory models (like x86), e.g. Non-Speculative Store Coalescing in Total Store Order
Coalescing could allow a store-buffer entry to be freed before its data commits to L1d (presumably only after retirement), if its data is copied to a store to the same line. This could only happen if no stores to other lines separate them, or else it would cause stores to commit (become globally visible) out of program order, violating the memory model. But we think this can happen for any 2 stores to the same line, even the first and last byte.
A problem with this idea is that SB entry allocation is probably a ring buffer, like the ROB. Releasing entries out of order would mean hardware would need to scan every entry to find a free one, and then if they're reallocated out of order then they're not in program order for later stores. That could make allocation and store-forwarding much harder so it's probably not plausible.
As discussed in
Are two store buffer entries needed for split line/page stores on recent Intel?, it would make sense for an SB entry to hold all of one store even if it spans a cache-line boundary. Cache line boundaries become relevant when committing to L1d cache on leaving the SB. We know that store-forwarding can work for stores that split across a cache line. That seems unlikely if they were split into multiple SB entries in the store ports.
Terminology: I've been using "coalescing" to talk about merging in the store buffer, vs. "write combining" to talk about NT stores that combine in an LFB before (hopefully) doing a full-line write with no RFO. Or stores to WC memory regions which do the same thing.
This distinction / convention is just something I made up. According to discussion in comments, this might not be standard computer architecture terminology.
Intel's manuals (especially the optimization manual) are written over many years by different authors, and also aren't consistent in their terminology. Take most parts of the optimization manual with a grain of salt especially if it talks about Pentium4. The new sections about Sandybridge and Haswell are reliable, but older parts might have stale advice that's only / mostly relevant for P4 (e.g. inc vs. add 1), or the microarchitectural explanations for some optimization rules might be confusing / wrong. Especially section 3.6.10 Write Combining. The first bullet point about using LFBs to combine stores while waiting for lines to arrive for cache-miss stores to WB memory just doesn't seem plausible, because of memory-ordering rules. See discussion between me and BeeOnRope linked above, and in comments here.
Footnote 1:
A write-combining cache to buffer write-back (or write-through) from inner caches would have a different name. e.g. Bulldozer-family uses 16k write-through L1d caches, with a small 4k write-back buffer. (See Why do L1 and L2 Cache waste space saving the same data? for details and links to even more details. See Cache size estimation on your system? for a rewrite-an-array microbenchmark that slows down beyond 4k on a Bulldozer-family CPU.)
Footnote 2: Some POWER CPUs let other SMT threads snoop retired stores in the store buffer: this can cause different threads to disagree about the global order of stores from other threads. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?
Footnote 3: non-x86 CPUs with weak memory models can commit retired stores in any order, allowing more aggressive coalescing of multiple stores to the same line, and making a cache-miss store not stall commit of other stores.

Exclusive access to L1 cacheline on x86?

If one has a 64 byte buffer that is heavily read/written to then it's likely that it'll be kept in L1; but is there any way to force that behaviour?
As in, give one core exclusive access to those 64 bytes and tell it not to sync the data with other cores nor the memory controller so that those 64 bytes always live in one core's L1 regardless of whether or not the CPU thinks it's used often enough.
No, x86 doesn't let you do this. You can force evict with clfushopt, or (on upcoming CPUs) for just write-back without evict with clwb, but you can't pin a line in cache or disable coherency.
You can put the whole CPU (or a single core?) into cache-as-RAM (aka no-fill) mode to disable sync with the memory controller, and disable ever writing back the data. Cache-as-Ram (no fill mode) Executable Code. It's typically used by BIOS / firmware in early boot before configuring the memory controllers. It's not available on a per-line basis, and is almost certainly not practically useful here. Fun fact: leaving this mode is one of the use-cases for invd, which drops cached data without writeback, as opposed to wbinvd.
I'm not sure if no-fill mode prevents eviction from L1d to L3 or whatever; or if data is just dropped on eviction. So you'd just have to avoid accessing more than 7 other cache lines that alias the one you care about in your L1d, or the equivalent for L2/L3.
Being able to force one core to hang on to a line of L1d indefinitely and not respond to MESI requests to write it back / share it would make the other cores vulnerable to lockups if they ever touched that line. So obviously if such a feature existed, it would require kernel mode. (And with HW virtualization, require hypervisor privilege.) It could also block hardware DMA (because modern x86 has cache-coherent DMA).
So supporting such a feature would require lots of parts of the CPU to handle indefinite delays, where currently there's probably some upper bound, which may be shorter than a PCIe timeout, if there is such a thing. (I don't write drivers or build real hardware, just guessing about this).
As #fuz points out, a coherency-violating instruction (xdcbt) was tried on PowerPC (in the Xbox 360 CPU), with disastrous results from mis-speculated execution of the instruction. So it's hard to implement.
You normally don't need this.
If the line is frequently used, LRU replacement will keep it hot. And if it's lost from L1d at frequent enough intervals, then it will probably stay hot in L2 which is also on-core and private, and very fast, in recent designs (Intel since Nehalem). Intel's inclusive L3 on CPUs other than Skylake-AVX512 means that staying in L1d also means staying in L3.
All this means that full cache misses all the way to DRAM are very unlikely with any kind of frequency for a line that's heavily used by one core. So throughput shouldn't be a problem. I guess you could maybe want this for realtime latency, where the worst-case run time for one call of a function mattered. Dummy reads from the cache line in some other part of the code could be helpful in keeping it hot.
However, if pressure from other cores in L3 cache causes eviction of this line from L3, Intel CPUs with an inclusive L3 also have to force eviction from inner caches that still have it hot. IDK if there's any mechanism to let L3 know that a line is heavily used in a core's L1d, because that doesn't generate any L3 traffic.
I'm not aware of this being much of a problem in real code. L3 is highly associative (like 16 or 24 way), so it takes a lot of conflicts before you'd get an eviction. L3 also uses a more complex indexing function (like a real hash function, not just modulo by taking a contiguous range of bits). In IvyBridge and later, it also uses an adaptive replacement policy to mitigate eviction from touching a lot of data that won't be reused often. http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.
See also Which cache mapping technique is used in intel core i7 processor?
#AlexisWilke points out that you could maybe use vector register(s) instead of a line of cache, for some use-cases. Using ymm registers as a "memory-like" storage location. You could globally dedicate some vector regs to this purpose. To get this in gcc-generated code, maybe use -ffixed-ymm8, or declare it as a volatile global register variable. (How to inform GCC to not use a particular register)
Using ALU instructions or store-forwarding to get data to/from the vector reg will give you guaranteed latency with no possibility of data-cache misses. But code-cache misses are still a problem for extremely low latency.
There is no direct way to achieve that on Intel and AMD x86 processors, but you can get pretty close with some effort. First, you said you're worried that the cache line might get evicted from the L1 because some other core might access it. This can only happen in the following situations:
The line is shared, and therefore, it can be accessed by multiple agents in the system concurrently. If another agent attempts to read the line, its state will change from Modified or Exclusive to Shared. That is, it will state in the L1. If, on the other hand, another agent attempts to write to the line, it has to be invalidated from the L1.
The line can be private or shared, but the thread got rescheduled by the OS to run on another core. Similar to the previous case, if it attempts to read the line, its state will change from Modified or Exclusive to Shared in both L1 caches. If it attempts to write to the line, it has to be invalidated from the L1 of the previous core on which it was running.
There are other reasons why the line may get evicted from the L1 as I will discuss shortly.
If the line is shared, then you cannot disable coherency. What you can do, however, is make a private copy of it, which effectively does disable coherency. If doing that may lead to faulty behavior, then the only thing you can do is to set the affinity of all threads that share the line to run on the same physical core on a hyperthreaded (SMT) Intel processor. Since the L1 is shared between the logical cores, the line will not get evicted due to sharing, but it can still get evicted due to other reasons.
Setting the affinity of a thread does not guarantee though that other threads cannot get scheduled to run on the same core. To reduce the probability of scheduling other threads (that don't access the line) on the same core or rescheduling the thread to run on other physical cores, you can increase the priority of the thread (or all the threads that share the line).
Intel processors are mostly 2-way hyperthreaded, so you can only run two threads that share the line at a time. so if you play with the affinity and priority of the threads, performance can change in interesting ways. You'll have to measure it. Recent AMD processors also support SMT.
If the line is private (only one thread can access it), a thread running on a sibling logical core in an Intel processor may cause the line to be evicted because the L1 is competitively shared, depending on its memory access behavior. I will discuss how this can be dealt with shortly.
Another issue is interrupts and exceptions. On Linux and maybe other OSes, you can configure which cores should handle which interrupts. I think it's OK to map all interrupts to all other cores, except the periodic timer interrupt whose interrupt handler's behavior is OS-dependent and it may not be safe to play with it. Depending on how much effort you want to spend on this, you can perform carefully designed experiments to determine the impact of the timer interrupt handler on the L1D cache contents. Also you should avoid exceptions.
I can think of two reasons why a line might get invalidated:
A (potentially speculative) RFO with intent for modification from another core.
The line was chosen to be evicted to make space for another line. This depends on the design of the cache hierarchy:
The L1 cache placement policy.
The L1 cache replacement policy.
Whether lower level caches are inclusive or not.
The replacement policy is commonly not configurable, so you should strive to avoid conflict L1 misses, which depends on the placement policy, which depends on the microarchitecture. On Intel processors, the L1D is typically both virtually indexed and physically indexed because the bits used for the index don't require translation. Since you know the virtual addresses of all memory accesses, you can determine which lines would be allocated from which cache set. You need to make sure that the number of lines mapped to the same set (including the line you don't want it to be evicted) does not exceed the associativity of the cache. Otherwise, you'd be at the mercy of the replacement policy. Note also that an L1D prefetcher can also change the contents of the cache. You can disable it on Intel processors and measure its impact in both cases. I cannot think of an easy way to deal with inclusive lower level caches.
I think the idea of "pinning" a line in the cache is interesting and can be useful. It's a hybrid between caches and scratch pad memories. The line would be like a temporary register mapped to the virtual address space.
The main issue here is that you want to both read from and write to the line, while still keeping it in the cache. This sort of behavior is currently not supported.

Would buffering cache changes prevent Meltdown?

If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible?
The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed.
TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register), but probably be too expensive (in power cost, and maybe also performance) to be a likely implementation.
But with hyperthreading (or more generally any SMT), there's also an ALU / port-pressure side-channel if you can get mis-speculation to run data-dependent ALU instructions with the secret data, instead of using it as an array index. The Meltdown paper discusses this possibility before focusing on the flush+reload cache-timing side-channel. (It's more viable for Meltdown than Spectre, because you have much better control of the timing of when the the secret data is used).
So modifying cache behaviour doesn't block the attacks. It would take away the reliable side-channel for getting the secret data into the attacking process, though. (i.e. ALU timing has higher noise and thus lower bandwidth to get the same reliability; Shannon's noisy channel theorem), and you have to make sure your code runs on the same physical core as the code under attack.
On CPUs without SMT (e.g. Intel's desktop i5 chips), the ALU timing side-channel is very hard to use with Spectre, because you can't directly use perf counters on code you don't have privilege for. (But Meltdown could still be exploited by timing your own ALU instructions with Linux perf, for example).
Meltdown specifically is much easier to defend against, microarchitecturally, with simpler and cheaper changes to the hard-wired parts of the CPU that microcode updates can't rewire.
You don't need to block speculative loads from affecting cache; the change could be as simple as letting speculative execution continue after a TLB-hit load that will fault if it reaches retirement, but with the value used by speculative execution of later instructions forced to 0 because of the failed permission check against the TLB entry.
So the mis-speculated (after the faulting load of secret) touch array[secret*4096] load would always make the same cache line hot, with no secret-data-dependent behaviour. The secret data itself would enter cache, but not a physical register. (And this stops ALU / port-pressure side-channels, too.)
Stopping the faulting load from even bringing the "secret" line into cache in the first place could make it harder to tell the difference between a kernel mapping and an unmapped page, which could possibly help protect against user-space trying to defeat KASLR by finding which virtual addresses the kernel has mapped. But that's not Meltdown.
Spectre
Spectre is the hard one because the mis-speculated instructions that make data-dependent modifications to microarchitectural state do have permission to read the secret data. Yes, a "load queue" that works similarly to the store queue could do the trick, but implementing it efficiently could be expensive. (Especially given the cache coherency problem that I didn't think of when I wrote this first section.)
(There are other ways of implementing the your basic idea; maybe there's even a way that's viable. But extra bits on L1D lines to track their status has downsides and isn't obviously easier.)
The store queue tracks stores from execution until they commit to L1D cache. (Stores can't commit to L1D until after they retire, because that's the point at which they're known to be non-speculative, and thus can be made globally visible to other cores).
A load queue would have to store whole incoming cache lines, not just the bytes that were loaded. (But note that Skylake-X can do 64-byte ZMM stores, so its store-buffer entries do have to be the size of a cache line. But if they can borrow space from each other or something, then there might not be 64 * entries bytes of storage available, i.e. maybe only the full number of entries is usable with scalar or narrow-vector stores. I've never read anything about a limitation like this, so I don't think there is one, but it's plausible)
A more serious problem is that Intel's current L1D design has 2 read ports + 1 write port. (And maybe another port for writing lines that arrive from L2 in parallel with committing a store? There was some discussion about that on Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake.)
If your loaded data can't enter L1D until after the loads retire, then they're probably going to be competing for the same write port that stores use.
Loads that hit in L1D can still come directly from L1D, though, and loads that hit in the memory-order-buffer could still be executed at 2 per clock. (The MOB would now include this new load queue as well as the usual store queue + markers for loads to maintain x86 memory ordering semantics). You still need both L1D read ports to maintain performance for code that doesn't touch a lot of new memory, and mostly is reloading stuff that's been hot in L1D for a while.
This would make the MOB about twice as large (in terms of data storage), although it doesn't need any more entries. As I understand it, the MOB in current Intel CPUs is composed of the individual load-buffer and store-buffer entries. (Haswell has 72 and 42 respectively).
Hmm, a further complication is that the load data in the MOB has to maintain cache coherency with other cores. This is very different from store data, which is private and hasn't become globally visible / isn't part of the global memory order and cache coherency until it commits to L1D.
So this proposed "load queue" implementation mechanism for your idea is probably not feasible without tweaks: it would have to be checked by invalidation-requests from other cores, so that's another read-port needed in the MOB.
Any possible implementation would have the problem of needing to later commit to L1D like a store. I think it would be a significant burden not to be able to evict + allocate a new line when it arrived from off-core.
(Even allowing speculative eviction but not speculative replacement from conflicts leaves open a possible cache-timing attack. You'd prime all the lines and then do a load that would evict one from one set of lines or another, and find which line was evicted instead of which one was fetched using a similar cache-timing side channel. So using extra bits in L1D to find / evict lines loaded during recovery from mis-speculation wouldn't eliminate this side-channel.)
Footnote: all instructions are speculative. This question is worded well, but I think many people reading about OoO exec and thinking about Meltdown / Spectre fall into this trap of confusing speculative execution with mis-speculation.
Remember that all instructions are speculative when they're executed. It's not known to be correct speculation until retirement. Meltdown / Spectre depend on accessing secret data and using it during mis-speculation. But the basis of current OoO CPU designs is that you don't know whether you've speculated correctly or not; everything is speculative until retirement.
Any load or store could potentially fault, and so can some ALU instructions (e.g. floating point if exceptions are unmasked), so any performance cost that applies "only when executing speculatively" actually applies all the time. This is why stores can't commit from the store queue into L1D until after the store uops have retired from the out-of-order CPU core (with the store data in the store queue).
However, I think conditional and indirect branches are treated specially, because they're expected to mis-speculate some of the time, and optimizing recovery for them is important. Modern CPUs do better with branches than just rolling back to the current retirement state when a mispredict is detected, I think using a checkpoint buffer of some sort. So out-of-order execution for instructions before the branch can continue during recovery.
But loop and other branches are very common, so most code executes "speculatively" in this sense, too, with at least one branch-rollback checkpoint not yet verified as correct speculation. Most of the time it's correct speculation, so no rollback happens.
Recovery for mis-speculation of memory ordering or faulting loads is a full pipeline-nuke, rolling back to the retirement architectural state. So I think only branches consume the branch checkpoint microarchitectural resources.
Anyway, all of this is what makes Spectre so insidious: the CPU can't tell the difference between mis-speculation and correct speculation until after the fact. If it knew it was mis-speculating, it would initiate rollback instead of executing useless instructions / uops. Indirect branches are not rare, either (in user-space); every DLL or shared library function call uses one in normal executables on Windows and Linux.
I suspect the overhead from buffering and committing the buffer would render the specEx/caching useless?
This is purely speculative (no pun intended) - I would love to see someone with a lower level background weigh in this!

Prefetch instruction behavior

In order to satisfy some security property, I want to make sure that an important data is already in the cache when a statement accesses it (so there will be no cache miss). For example, for this code
...
a += 2;
...
I want to make sure that a is in the cache right before a += 2 is executed.
I was considering to use the PREFETCHh instruction of x86 to achieve this:
...
__prefetch(&a); /* pseudocode */
a += 2;
...
However, I have read that inserting the prefetch instruction right before a += 2 might be too late to ensure a is in the cache when a += 2 gets executed. Is this claim true? If it is true, can I fix it by inserting a CPUID instruction after prefetch to ensure the prefectch instruction has been executed (because the Intel manual says PREFETCHh is ordered with respect to CPUID)?
Yes, you need to prefetch with a lead-time of about the memory latency for it to be optimal. Ulrich Drepper's What Every Programmer Should Know About Memory talks a lot about prefetching.
Making this happen will be highly non-trivial for a single access. Too soon and your data might be evicted before the insn you care about. Too late and it might reduce the access time some. Tuning this will depend on compiler version/options, and on the hardware you're running on. (Higher instructions-per-cycle means you need to prefetch earlier. Higher memory latency also means you need to prefetch earlier).
Since you want to do a read-modify-write to a, you should use PREFETCHW if available. The other prefetch instructions only prefetch for reading, so the read part of a the RMW could hit, but I think the store part could be delayed by MOSI cache coherency getting write-ownership of the cache line.
If a isn't atomic, you can also just load a well ahead of time and use the copy in a register. The store back to the global could easily miss in this case, which could eventually stall execution, though.
You'll probably have a hard time doing some that reliably with a compiler, instead of writing asm yourself. Any of the other ideas will also require checking the compiler output to make sure the compiler did what you're hoping.
Prefetch instructions don't necessarily prefetch anything. They're "hints", which presumably get ignored when the number of outstanding loads is near max (i.e. almost out of load buffers).
Another option is to load it (not just prefetch) and then serialize with a CPUID. (A load that throws away the result is like a prefetch). The load would have to complete before the serializing instruction, and instructions after the serializing insn can't start decoding until then. I think a prefetch can retire before the data arrives, which is normally an advantage, but not in this case where we care about one operation hitting at the expense of overall performance.
From Intel's insn ref manual (see the x86 tag wiki) entry for CPUID:
Serializing instruction execution
guarantees that any modifications to flags, registers, and memory for previous instructions are completed before
the next instruction is fetched and executed.
I think a sequence like this is fairly good (but still doesn't guarantee anything in a pre-emptive multi-tasking system):
add [mem], 0 # can't retire until the store completes, requiring that our core owns the cache line for writing
CPUID # later insns can't start until the prev add retires
add [mem], 2 # a += 2 Can't miss in cache unless an interrupt or the other hyper-thread evicts the cache line before this insn can execute
Here we're using add [mem], 0 as a write-prefetch which is otherwise a near no-op. (It is a non-atomic read-modify-rewrite). I'm not sure if PREFETCHW really will ensure the cache line is ready if you do PREFETCHW / CPUID / add [mem], 2. The insn is ordered wrt. CPUID, but the manual doesn't say that the prefetch effect is ordered.
If a is volatile, then (void)a; will get gcc or clang to emit a load insn. I assume most other compilers (MSVC?) are the same. You can probably do (void) *(volatile something*)&a to dereference a pointer to volatile and force a load from a's address.
To guarantee that a memory access will hit in cache, you'd need to be running at realtime priority pinned to a core that doesn't receive interrupts. Depending on the OS, the timer-interrupt handler is probably lightweight enough that the chance of evicting your data from cache is low enough.
If your process is descheduled between executing a prefetch insn and doing the real access, the data will probably have been evicted from at least L1 cache.
So it's unlikely you can defeat an attacker determined to do a timing attack on your code, unless it's realistic to run at realtime priority. An attacker could run many many threads of memory-intensive code...

Why do we bother with CPU registers in assembly, instead of just working directly with memory?

I have a basic question about assembly.
Why do we bother doing arithmetic operations only on registers if they can work on memory as well?
For example both of the following cause (essentially) the same value to be calculated as an answer:
Snippet 1
.data
var dd 00000400h
.code
Start:
add var,0000000Bh
mov eax,var
;breakpoint: var = 00000B04
End Start
Snippet 2
.code
Start:
mov eax,00000400h
add eax,0000000bh
;breakpoint: eax = 0000040B
End Start
From what I can see most texts and tutorials do arithmetic operations mostly on registers. Is it just faster to work with registers?
If you look at computer architectures, you find a series of levels of memory. Those that are close to the CPU are the fast, expensive (per a bit), and therefore small, while at the other end you have big, slow and cheap memory devices. In a modern computer, these are typically something like:
CPU registers (slightly complicated, but in the order of 1KB per a core - there
are different types of registers. You might have 16 64 bit
general purpose registers plus a bunch of registers for special
purposes)
L1 cache (64KB per core)
L2 cache (256KB per core)
L3 cache (8MB)
Main memory (8GB)
HDD (1TB)
The internet (big)
Over time, more and more levels of cache have been added - I can remember a time when CPUs didn't have any onboard caches, and I'm not even old! These days, HDDs come with onboard caches, and the internet is cached in any number of places: in memory, on the HDD, and maybe on caching proxy servers.
There is a dramatic (often orders of magnitude) decrease in bandwidth and increase in latency in each step away from the CPU. For example, a HDD might be able to be read at 100MB/s with a latency of 5ms (these numbers may not be exactly correct), while your main memory can read at 6.4GB/s with a latency of 9ns (six orders of magnitude!). Latency is a very important factor, as you don't want to keep the CPU waiting any longer than it has to (this is especially true for architectures with deep pipelines, but that's a discussion for another day).
The idea is that you will often be reusing the same data over and over again, so it makes sense to put it in a small fast cache for subsequent operations. This is referred to as temporal locality. Another important principle of locality is spatial locality, which says that memory locations near each other will likely be read at about the same time. It is for this reason that reading from RAM will cause a much larger block of RAM to be read and put into on-CPU cache. If it wasn't for these principles of locality, then any location in memory would have an equally likely chance of being read at any one time, so there would be no way to predict what will be accessed next, and all the levels of cache in the world will not improve speed. You might as well just use a hard drive, but I'm sure you know what it's like to have the computer come to a grinding halt when paging (which is basically using the HDD as an extension to RAM). It is conceptually possible to have no memory except for a hard drive (and many small devices have a single memory), but this would be painfully slow compared to what we're familiar with.
One other advantage of having registers (and only a small number of registers) is that it lets you have shorter instructions. If you have instructions that contain two (or more) 64 bit addresses, you are going to have some long instructions!
Because RAM is slow. Very slow.
Registers are placed inside the CPU, right next to the ALU so signals can travel almost instantly. They're also the fastest memory type but they take significant space so we can have only a limited number of them. Increasing the number of registers increases
die size
distance needed for signals to travel
work to save the context when switching between threads
number of bits in the instruction encoding
Read If registers are so blazingly fast, why don't we have more of them?
More commonly used data will be placed in caches for faster accessing. In the past caches are very expensive so they're an optional part and can be purchased separately and plug into a socket outside the CPU. Nowadays they're often in the same die with the CPUs. Caches are constructed from SRAM cells which are smaller than register cells but maybe tens or hundreds of times slower.
Main memory will be made from DRAM which needs only one transistor per cell but are thousands of times slower than registers, hence we can't work with only DRAM in a high-performance system. However some embedded system do make use of register file so registers are also main memory
More information: Can we have a computer with just registers as memory?
Registers are much faster and also the operations that you can perform directly on memory are far more limited.
In real, there are tiny implementations that does not separate registers from memory. They can expose it, for example, in the way they have 512 bytes of RAM, and first 64 of them are exposed as 32 16-bit registers and in the same time accessible as addressable RAM. Or, another example, MosTek 6502 "zero page" (RAM range 0-255, accessed used 1-byte address) was a poor substitution for registers, due to small amount of real registers in CPU. But, this is poorly scalable to larger setups.
The advantage of registers are following:
They are the most fast. They are faster in a typical modern system than any cache, more so than DRAM. (In the example above, RAM is likely SRAM. But SRAM of a few gigabytes is unusably expensive.) And, they are close to processor. Difference of time between register access and DRAM access can reach values like 200 or even 1000. Even compared to L1 cache, register access is typically 2-4 times faster.
Their amount is limited. A typical instruction set will become too bloated if any memory location is addressed explicitly.
Registers are specific to each CPU (core, hardware thread, hart) separately. (In systems where fixed RAM addresses serve role of special registers, as e.g. zSeries does, this needs special remapping of such service area in absolute addresses, separate for each core.)
In the same manner as (3), registers are specific to each process thread without a need to adjust locations in code for a thread.
Registers (relatively easily) allow specific optimizations, as register renaming. This is too complex if memory addresses are used.
Additionally, there are registers that could not be implemented in separate block RAM because access to RAM needs their change. I mean the "execution phase" register in the simplest CPU designs, which takes values like "instruction extracting phase", "instruction decoding phase", "ALU phase", "data writing phase" and so on, and this register equivalents in more complicated (pipeline, out-of-order) designs; also different buffer registers on bus access, and so on. But, such registers are not visible to programmer, so you did likely not mean them.
x86, like pretty much every other "normal" CPU you might learn assembly for, is a register machine1. There are other ways to design something that you can program (e.g. a Turing machine that moves along a logical "tape" in memory, or the Game of Life), but register machines have proven to be basically the only way to go for high-performance.
https://www.realworldtech.com/architecture-basics/2/ covers possible alternatives like accumulator or stack machines which are also obsolete now. Although it omits CISCs like x86 which can be either load-store or register-memory. x86 instructions can actually be reg,mem; reg,reg; or even mem,reg. (Or with an immediate source.)
Footnote 1: The abstract model of computation called a register machine doesn't distinguish between registers and memory; what it calls registers are more like memory in real computers. I say "register machine" here to mean a machine with multiple general-purpose registers, as opposed to just one accumulator, or a stack machine or whatever. Most x86 instructions have 2 explicit operands (but it varies), up to one of which can be memory. Even microcontrollers like 6502 that can only really do math into one accumulator register almost invariably have some other registers (e.g. for pointers or indices), unlike true toy ISAs like Marie or LMC that are extremely inefficient to program for because you need to keep storing and reloading different things into the accumulator, and can't even keep an array index or loop counter anywhere that you can use it directly.
Since x86 was designed to use registers, you can't really avoid them entirely, even if you wanted to and didn't care about performance.
Current x86 CPUs can read/write many more registers per clock cycle than memory locations.
For example, Intel Skylake can do two loads and one store from/to its 32KiB 8-way associative L1D cache per cycle (best case), but can read upwards of 10 registers per clock, and write 3 or 4 (plus EFLAGS).
Building an L1D cache with as many read/write ports as the register file would be prohibitively expensive (in transistor count/area and power usage), especially if you wanted to keep it as large as it is. It's probably just not physically possible to build something that can use memory the way x86 uses registers with the same performance.
Also, writing a register and then reading it again has essentially zero latency because the CPU detects this and forwards the result directly from the output of one execution unit to the input of another, bypassing the write-back stage. (See https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing).
These result-forwarding connections between execution units are called the "bypass network" or "forwarding network", and it's much easier for the CPU to do this for a register design than if everything had to go into memory and back out. The CPU only has to check a 3 to 5 bit register number, instead of an 32-bit or 64-bit address, to detect cases where the output of one instruction is needed right away as the input for another operation. (And those register numbers are hard-coded into the machine-code, so they're available right away.)
As others have mentioned, 3 or 4 bits to address a register make the machine-code format much more compact than if every instruction had absolute addresses.
See also https://en.wikipedia.org/wiki/Memory_hierarchy: you can think of registers as a small fast fixed-size memory space separate from main memory, where only direct absolute addressing is supported. (You can't "index" a register: given an integer N in one register, you can't get the contents of the Nth register with one insn.)
Registers are also private to a single CPU core, so out-of-order execution can do whatever it wants with them. With memory, it has to worry about what order things become visible to other CPU cores.
Having a fixed number of registers is part of what lets CPUs do register-renaming for out-of-order execution. Having the register-number available right away when an instruction is decoded also makes this easier: there's never a read or write to a not-yet-known register.
See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an explanation of register renaming, and a specific example (the later edits to the question / later parts of my answer showing the speedup from unrolling with multiple accumulators to hide FMA latency even though it reuses the same architectural register repeatedly).
The store buffer with store forwarding does basically give you "memory renaming". A store/reload to a memory location is independent of earlier stores and load to that location from within this core. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
Repeated function calls with a stack-args calling convention, and/or returning a value by reference, are cases where the same bytes of stack memory can be reused multiple times.
The seconds store/reload can execute even if the first store is still waiting for its inputs. (I've tested this on Skylake, but IDK if I ever posted the results in an answer anywhere.)
Registers are accessed way faster than RAM memory, since you don't have to access the "slow" memory bus!
We use registers because they are fast. Usually, they operate at CPU's speed.
Registers and CPU cache are made with different technology / fabrics and
they are expensive. RAM on the other hand is cheap and 100 times slower.
Generally speaking register arithmetic is much faster and much preferred. However there are some cases where the direct memory arithmetic is useful.
If all you want to do is increment a number in memory (and nothing else at least for a few million instructions) then a single direct memory arithmetic instruction is usually slightly faster than load/add/store.
Also if you are doing complex array operations you generally need a lot of registers to keep track of where you are and where your arrays end. On older architectures you could run out of register really quickly so the option of adding two bits of memory together without zapping any of your current registers was really useful.
Yes, it's much much much faster to use registers. Even if you only consider the physical distance from processor to register compared to proc to memory, you save a lot of time by not sending electrons so far, and that means you can run at a higher clock rate.
Yes - also you can typically push/pop registers easily for calling procedures, handling interrupts, etc
It's just that the instruction set will not allow you to do such complex operations:
add [0x40001234],[0x40002234]
You have to go through the registers.

Resources