Would buffering cache changes prevent Meltdown? - caching

If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible?
The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed.

TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register), but probably be too expensive (in power cost, and maybe also performance) to be a likely implementation.
But with hyperthreading (or more generally any SMT), there's also an ALU / port-pressure side-channel if you can get mis-speculation to run data-dependent ALU instructions with the secret data, instead of using it as an array index. The Meltdown paper discusses this possibility before focusing on the flush+reload cache-timing side-channel. (It's more viable for Meltdown than Spectre, because you have much better control of the timing of when the the secret data is used).
So modifying cache behaviour doesn't block the attacks. It would take away the reliable side-channel for getting the secret data into the attacking process, though. (i.e. ALU timing has higher noise and thus lower bandwidth to get the same reliability; Shannon's noisy channel theorem), and you have to make sure your code runs on the same physical core as the code under attack.
On CPUs without SMT (e.g. Intel's desktop i5 chips), the ALU timing side-channel is very hard to use with Spectre, because you can't directly use perf counters on code you don't have privilege for. (But Meltdown could still be exploited by timing your own ALU instructions with Linux perf, for example).
Meltdown specifically is much easier to defend against, microarchitecturally, with simpler and cheaper changes to the hard-wired parts of the CPU that microcode updates can't rewire.
You don't need to block speculative loads from affecting cache; the change could be as simple as letting speculative execution continue after a TLB-hit load that will fault if it reaches retirement, but with the value used by speculative execution of later instructions forced to 0 because of the failed permission check against the TLB entry.
So the mis-speculated (after the faulting load of secret) touch array[secret*4096] load would always make the same cache line hot, with no secret-data-dependent behaviour. The secret data itself would enter cache, but not a physical register. (And this stops ALU / port-pressure side-channels, too.)
Stopping the faulting load from even bringing the "secret" line into cache in the first place could make it harder to tell the difference between a kernel mapping and an unmapped page, which could possibly help protect against user-space trying to defeat KASLR by finding which virtual addresses the kernel has mapped. But that's not Meltdown.
Spectre
Spectre is the hard one because the mis-speculated instructions that make data-dependent modifications to microarchitectural state do have permission to read the secret data. Yes, a "load queue" that works similarly to the store queue could do the trick, but implementing it efficiently could be expensive. (Especially given the cache coherency problem that I didn't think of when I wrote this first section.)
(There are other ways of implementing the your basic idea; maybe there's even a way that's viable. But extra bits on L1D lines to track their status has downsides and isn't obviously easier.)
The store queue tracks stores from execution until they commit to L1D cache. (Stores can't commit to L1D until after they retire, because that's the point at which they're known to be non-speculative, and thus can be made globally visible to other cores).
A load queue would have to store whole incoming cache lines, not just the bytes that were loaded. (But note that Skylake-X can do 64-byte ZMM stores, so its store-buffer entries do have to be the size of a cache line. But if they can borrow space from each other or something, then there might not be 64 * entries bytes of storage available, i.e. maybe only the full number of entries is usable with scalar or narrow-vector stores. I've never read anything about a limitation like this, so I don't think there is one, but it's plausible)
A more serious problem is that Intel's current L1D design has 2 read ports + 1 write port. (And maybe another port for writing lines that arrive from L2 in parallel with committing a store? There was some discussion about that on Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake.)
If your loaded data can't enter L1D until after the loads retire, then they're probably going to be competing for the same write port that stores use.
Loads that hit in L1D can still come directly from L1D, though, and loads that hit in the memory-order-buffer could still be executed at 2 per clock. (The MOB would now include this new load queue as well as the usual store queue + markers for loads to maintain x86 memory ordering semantics). You still need both L1D read ports to maintain performance for code that doesn't touch a lot of new memory, and mostly is reloading stuff that's been hot in L1D for a while.
This would make the MOB about twice as large (in terms of data storage), although it doesn't need any more entries. As I understand it, the MOB in current Intel CPUs is composed of the individual load-buffer and store-buffer entries. (Haswell has 72 and 42 respectively).
Hmm, a further complication is that the load data in the MOB has to maintain cache coherency with other cores. This is very different from store data, which is private and hasn't become globally visible / isn't part of the global memory order and cache coherency until it commits to L1D.
So this proposed "load queue" implementation mechanism for your idea is probably not feasible without tweaks: it would have to be checked by invalidation-requests from other cores, so that's another read-port needed in the MOB.
Any possible implementation would have the problem of needing to later commit to L1D like a store. I think it would be a significant burden not to be able to evict + allocate a new line when it arrived from off-core.
(Even allowing speculative eviction but not speculative replacement from conflicts leaves open a possible cache-timing attack. You'd prime all the lines and then do a load that would evict one from one set of lines or another, and find which line was evicted instead of which one was fetched using a similar cache-timing side channel. So using extra bits in L1D to find / evict lines loaded during recovery from mis-speculation wouldn't eliminate this side-channel.)
Footnote: all instructions are speculative. This question is worded well, but I think many people reading about OoO exec and thinking about Meltdown / Spectre fall into this trap of confusing speculative execution with mis-speculation.
Remember that all instructions are speculative when they're executed. It's not known to be correct speculation until retirement. Meltdown / Spectre depend on accessing secret data and using it during mis-speculation. But the basis of current OoO CPU designs is that you don't know whether you've speculated correctly or not; everything is speculative until retirement.
Any load or store could potentially fault, and so can some ALU instructions (e.g. floating point if exceptions are unmasked), so any performance cost that applies "only when executing speculatively" actually applies all the time. This is why stores can't commit from the store queue into L1D until after the store uops have retired from the out-of-order CPU core (with the store data in the store queue).
However, I think conditional and indirect branches are treated specially, because they're expected to mis-speculate some of the time, and optimizing recovery for them is important. Modern CPUs do better with branches than just rolling back to the current retirement state when a mispredict is detected, I think using a checkpoint buffer of some sort. So out-of-order execution for instructions before the branch can continue during recovery.
But loop and other branches are very common, so most code executes "speculatively" in this sense, too, with at least one branch-rollback checkpoint not yet verified as correct speculation. Most of the time it's correct speculation, so no rollback happens.
Recovery for mis-speculation of memory ordering or faulting loads is a full pipeline-nuke, rolling back to the retirement architectural state. So I think only branches consume the branch checkpoint microarchitectural resources.
Anyway, all of this is what makes Spectre so insidious: the CPU can't tell the difference between mis-speculation and correct speculation until after the fact. If it knew it was mis-speculating, it would initiate rollback instead of executing useless instructions / uops. Indirect branches are not rare, either (in user-space); every DLL or shared library function call uses one in normal executables on Windows and Linux.

I suspect the overhead from buffering and committing the buffer would render the specEx/caching useless?
This is purely speculative (no pun intended) - I would love to see someone with a lower level background weigh in this!

Related

How does write-invalidate policy work with set-associative caches?

I was going through Cache Write Policies paper by Norman P. Jouppi and I understand why write-invalidate (defined on page 193) works well with direct mapped caches which is because of the ability to write the data which checking the tag and if found to be miss, the cache line is invalidates as it is corrupted by the write. This can be done in one cycle.
But is there any benefit if write-invalidate is used for set-associative caches?
What is the usual configuration that is used for L1 caches in real processors? Do they use direct or set-associative and write-validate/write around/write invalidate/fetch-on-write policy?
TL:DR: for a non-blocking cache using write-invalidate, changing it from direct-mapped to set-associative could hurt the hit rate unless writes are very rare, or mean that you introduce the possibility of needing to block.
Write-invalidate only makes sense for a simple in-order pipeline with a simple cache that tries to avoid stalling the pipeline even without a store buffer, and go really fast at the expense of hit-rate. If you were going to change things to improve hit-rate, changing away from write-invalidate (usually to write-back + write-allocate + fetch-on-write) would be one of the first things. Write-invalidate with set-associative cache is possible with some ugly tradeoffs, but you wouldn't like the results.
The 1993 paper you linked is using that term to mean something other than the modern cache-coherence mechanism meaning. In the paper:
The combination of
write-before-hit, no-fetch-on-write, and no-write-allocate
we call write-invalidate
Yes, real-world caches these days are basically always set-associative; the more complex tag-comparator logic is worth the increased hit-rate for the same data size. Which cache mapping technique is used in intel core i7 processor? has some general stuff, not just x86. Modern examples of direct-mapped caches include the DRAM cache when a part of the persistent memory on an Intel platform operates in memory mode. Also many server-grade processors from multiple vendors support L3 way-wise partitioning, so you can, for example, allocate one way for a thread which would basically behave like a direct-mapped cache.
Write policy is usually write-allocate + fetch-on-write + no-write-before-hit for modern CPU caches; some ISAs offer methods such as special instructions to bypass cache for "non-temporal" stores that won't be re-read soon, to avoid cache pollution for those cases. Most workloads do re-load their stores with enough temporal locality that write-allocate is the only sane choice, especially when caches are larger and/or more associative so they're more likely to be able to hang onto a line until the next read or write.
It's also very common to do multiple small writes into the same line, making write-allocate very valuable, especially if a store buffer didn't manage to merge those writes.
But is there any benefit if write-invalidate is used for set-associative caches?
It doesn't seem so.
The only advantage it has is not stalling a simple in-order pipeline that lacks a store buffer ("write buffer" in the paper). It allows write in parallel with the tag-check, so you find out after modifying the line whether you hit or not. (Modern CPUs do use a store buffer to decouple store commit to L1d from store execution and hide store-miss latency. Even in-order CPUs typically have a store buffer to allow memory-level parallelism of RFOs (read-for-ownership). (e.g. ARM Cortex-A53 found in phones).
Anyway, in a set-associative cache, you need to check tags to know which "way" of the set to write into on a write hit. (Or detect a miss and pick one to evict according to some policy, like random or pseudo-LRU using some extra state bits, or write-around if no-write-allocate). If you wait until after the tag check to find the write way, you've lost the only benefit of write-invalidate.
Blindly writing to a random way could lead to a situation where there's a hit in a different way than the one you guessed. Way-prediction is a thing (and can do better than random), but the downside of an incorrect prediction for a write like this would be unnecessarily invalidating a line, instead of just a bit of extra latency. Way prediction in modern cache. I don't know what kind of success-rate way-prediction usually achieves. I'd guess not great, like maybe 80 to 90% at best. Probably spending transistors to do way-prediction would be better spent elsewhere, to do something that sucks less than write-invalidate! A store buffer with store forwarding probably costs more, but is a lot better.
The advantage of write-invalidate is to help make the cache non-blocking. But if you need to correct the situation when you do find a write-hit in a way other than the one you picked, you need to go back and correct the situation, updating the correct line. So you'd lose the non-blocking property. Never stalling is better than not usually stalling, because it means you don't even need to make the hardware handle that possible case at all. (Although you do need to be able to stall for memory.)
The write-in-one-way-hit-in-another situation can be avoided by writing in all of the ways. But there will be at most one hit and the rest will have to be invalidated. The negative impact on hit rate will significantly grow with associativity. (Unless writes are quite infrequent vs. reads, reducing the associativity would probably help hit rate with the write-all-ways strategy, so for a given total cache capacity, direct-mapped might be the best choice if you insist on fully-non-blocking write-invalidate.) Even for a direct-mapped cache, the experimental evaluation given in the paper itself shows that write-invalidate has higher miss rate compare to the other evaluated write policies. So it's win only if the benefits of reducing latency and bandwidth demand outweighs the damage of high miss rate.
Also, as I said, write-allocate is very good for CPUs, especially when it's set-associative so you're spending more resources trying to get a higher hit-rate. You could maybe still implement write-allocate by triggering a fetch on miss, remembering where in the line you stored the data, and merging that with the old copy of the line when it arrives.
You don't want to defeat that by blowing away lines that didn't need to die.
Also, write-invalidate implies write-through even for write hits, because it could lose data if a line is ever dirty. But write-back is also very good in modern L1d caches to insulate larger/slower caches from the write bandwidth. (Especially if there's no per-core private L2 to separately reduce total traffic to shared caches.) However, AMD Bulldozer-family did have a write-through L1d with a small 4k write-buffer between it and a write-back L2. This was generally considered a failed experiment or weak point of the design, and they dropped it in favour of a standard write-back write-allocate L1d for Zen. When use write-through cache policy for pages.
So in summary, write-invalidate is incompatible with several things that modern mainstream CPU designs have settled on as the best options, that you'll find in most mainstream CPU designs
write-allocate write policy
write-back (not write-through). https://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies
set-associative (huge downsides that can only be partially mitigated by way-prediction)
store buffer to decouple store miss from execution, and allow memory parallelism. (Not strictly incompatible, but a store buffer makes it pointless. Necessary for OoO exec and widely used for in-order)
write-invalidate in cache-coherent SMP systems
You'd never consider using it in a single-chip multi-core CPU; spend more transistors on each core to get more of the low-hanging fruit before you start building more cores. e.g. a proper store buffer. Use some flavour of SMT if you want high throughput for multiple low-IPC threads that stall a lot.
But for multi-socket SMP, this could have made sense historically if you want to use multiple of the biggest single-core chip you can build, and that was still not big enough to just have a store buffer instead of this.
I guess it could even make sense to use a really "thin" direct-mapped write-through L1d in front of a private medium-sized write-back set-associative L2 that's still pretty fast. (Maybe call this an L0d cache because it can act like an unordered store buffer. The next-level cache will still see a lot of reads and writes from the low hit-rate of this small direct-mapped cache.)
Normally all caches (including L1d) are part of the same global coherency domain so writing into L1d cache can't happen until you have exclusive ownership. (Which you check for as part of the tag check.) But if this L1d / L0d is not like that, then it's not coherent and is more like a store buffer.
Of course, you need to queue the write-throughs for L2, and eventually stall when it can't keep up, so you're just adding complexity. The write-through to L2 mechanism would also need to deal with waiting for L2 to gain exclusive ownership of the line before writing (MESI Exclusive or Modified state). So this is very much just an unordered store buffer.
The case of writing to a line that hadn't made it to L2 yet is interesting: if it's an L0d write hit you effectively get store merging for free. You'd need per-word or per-byte needs-writeback bits (aka dirty bits) for this. Normally write-through would be sending along the write while the offset within line is still available, but if L2 isn't ready to accept it yet (e.g. because of a write miss) then you can't do that. This is morphing it into a write-combining buffer. Marking the whole line as needing write-back doesn't work because the unwritten parts are still invalid.
But if it's a write miss (same cache line, different tag bits) on a line that still hasn't finished write-back to L2, you have a big problem because you'd be invalidating a line that's still "dirty" (has the only copy of some older store data). You can't detect that before writing; the whole point is to write in parallel with checking tags.
It might be possible to still make this work: if the cache access is a read+write exchange that keeps the previous value in a one-word buffer (or whatever the max write size is), you still have all the data. Stall everything (including writeback of this line so you don't make wrong data globally visible in coherent L2 cache). Then exchange back, wait for the old state of that L0d line to actually write back to that address, then store the tmp buffer into L0d and update the tag and needs-writeback bits to reflect this store. So aliasing between nearby stores becomes extra costly and stalls the pipeline. Or maybe you can let non-memory instructions continue and only stall execution at the next load or store. (If you have the transistor budget to do much of that stall-avoidance, you can probably just use a completely different strategy, like having a store buffer and a normal L1d.)
To be usable (assuming you work around the dirty-store-miss problem), you'd need some way to track relative order of stores (and loads). If that's as simplistic as making sure every entry in the entire L0d has finished its write-through process before allowing another write, then even store-store barriers will be very expensive. The less order-tracking a CPU does, the more expensive barriers have to be (flush more stuff to make sure).

Is it possible to “abort” when loading a register from memory rather the triggering a page fault?

I am thinking about 'Minimizing page faults (and TLB faults) while “walking” a large graph'
'How to know whether a pointer is in physical memory or it will trigger a Page Fault?' is a related question looking at the problem from the other side, but does not have a solution.
I wish to be able to load some data from memory into a register, but have the load abort rather than getting a page fault, if the memory is currently paged out. I need the code to work in user space on both Windows and Linux without needing any none standard permission.
(Ideally, I would also like to abort on a TLB fault.)
The RTM (Restricted Transactional Memory) part of the TXT-NI feature allows to suppress exceptions:
Any fault or trap in a transactional region that must be exposed to software will be suppressed. Transactional
execution will abort and execution will transition to a non-transactional execution, as if the fault or trap had never
occurred.
[...]
Synchronous exception events (#DE, #OF, #NP, #SS, #GP, #BR, #UD, #AC, #XM, #PF, #NM, #TS, #MF, #DB, #BP/INT3) that occur during transactional execution may cause an execution not to commit transactionally, and
require a non-transactional execution. These events are suppressed as if they had never occurred.
I've never used RTM but it should work something like this:
xbegin fallback
; Don't fault here
xend
; Somewhere else
fallback:
; Retry non-transactionally
Note that a transaction can be aborted for many reasons, see chapter 16.8.3.2 of the Intel manual volume 1.
Also note that RTM is not ubiquitous.
Besides RTM I cannot think of another way to suppress a load since it must return a value or eventually signal an abort condition (which would be the same as a #PF).
There's unfortunately no instruction that just queries the TLB or the current page table with the result in a register, on x86 (or any other ISA I know of). Maybe there should be, because it could be implemented very cheaply.
(For querying virtual memory for pages being paged out or not, there is the Linux system call mincore(2) that produces a bitmap of present/absent for a range of pages starting (given as void* start / size_t length. That's maybe similar to the HW page tables so probably could let you avoid page faults until after you've touched memory, but unrelated to TLB or cache. And maybe doesn't rule out soft page faults, only hard. And of course that's only the current situation: pages could be evicted between query and access.)
Would a CPU feature like this be useful? probably yes for a few cases
Such a thing would be hard to use in a way that paid off, because every "false" attempt is CPU time / instructions that didn't accomplish any useful work. But a case like this could possibly be a win, when you don't care what order you traverse a tree / graph in, and some nodes might be hot in cache, TLB, or even just RAM while others are cold or even paged out to disk.
When memory is tight, touching a cold page could even evict a currently-hot page before you get to it.
Normal CPUs (like modern x86) can do speculative / out-of-order page walks (to fill TLB entries), and definitely speculative loads into cache, but not page faults. Page faults are handled in software by the kernel. Taking a page-fault can't happen speculatively, and is serializing. (CPUs don't rename the privilege level.)
So software prefetch can cheaply get the hardware to fill TLB and cache while you touch other memory, if you the one you're going to touch 2nd was cold. If it was hot and you touch the cold side first, that's unfortunate. If there was a cheap way to check hot/cold, it might be worth using it to always go the right way (at least on the first step) in traversal order when one pointer is hot and the other is cold. Unless a read-only transaction is quite cheap, it's probably not worth actually using Margaret's clever answer.
If you have 2 pointers you will eventually dereference, and one of them points to a page that's been paged out while the other is hot, the best case would be to somehow detect this and get the OS to start paging in one page from disk in the background while you traverse the side that's already in RAM. (e.g. with Windows
PrefetchVirtualMemory or Linux madvise(MADV_WILLNEED). See answers on the OP's other question: Minimizing page faults (and TLB faults) while "walking" a large graph)
This will require a system call, but system calls are expensive and pollute caches + TLBs, especially on current x86 where Spectre + Meltdown mitigation adds thousands of clock cycles. So it's not worth it to make a VM prefetch system call for one of every pair of pointers in a tree. You'd get a massive slowdown for cases when all the pointers were in RAM.
CPU design possibilities
Like I said, I don't think any current ISAs have this, but it would I think be easy to support in hardware with instructions that run kind of like load instructions, but produce a result based on the TLB lookup instead of fetching data from L1d cache.
There are a couple possibilities that come to mind:
a queryTLB m8 instruction that writes flags (e.g. CF=1 for present) according to whether the memory operand is currently hot in TLB (including 2nd-level TLB), never doing a page walk. And a querypage m8 that will do a page walk on a TLB miss, and sets flags according to whether there's a page table entry. Putting the result in a r32 integer reg you could test/jcc on would also be an option.
a try_load r32, r/m32 instruction that does a normal load if possible, but sets flags instead of taking a page fault if a page walk finds no valid entry for the virtual address. (e.g. CF=1 for valid, CF=0 for abort with integer result = 0, like rdrand. It could make itself useful and set other flags (SF/ZF/PF) according to the value, if there is one.)
The query idea would only be useful for performance, not correctness, because there'd always be a gap between querying and using during which the page could be unmapped. (Like the IsBadXxxPtr Windows system call, except that that probably checks the logical memory map, not the hardware page tables.)
A try_load insn that also sets/clear flags instead of raising #PF could avoid the race condition. You could have different versions of it, or it could take an immediate to choose the abort condition (e.g. TLB miss without attempt page-walk).
These instructions could easily decode to a load uop, probably just one. The load ports on modern x86 already support normal loads, software prefetch, broadcast loads, zero or sign-extending loads (movsx r32, m8 is a single uop for a load port on Intel), and even vmovddup ymm, m256 (two in-lane broadcasts) for some reason, so adding another kind of load uop doesn't seem like a problem.
Loads that hit a TLB entry they don't have permission for (kernel-only mapping) do currently behave specially on some x86 uarches (the ones that aren't vulnerable to Meltdown). See The Microarchitecture Behind Meltdown on Henry Wong's blod (stuffedcow.net). According to his testing, some CPUs produce a zero for speculative execution of later instructions after a TLB/page miss (entry not present). So we already know that doing something with a TLB hit/miss result should be able to affect the integer result of a load. (Of course, a TLB miss is different from a hit on a privileged entry.)
Setting flags from a load is not something that ever normally happens on x86 (only from micro-fused load+alu), so maybe it would be implemented with an ALU uop as well, if Intel ever did implement this idea.
Aborting on a condition other than TLB/page miss or L1d miss would require outer levels of cache to also support this special request, though. A try_load that runs if it hits L3 cache but aborts on L3 miss would need support from the L3 cache. I think we could do without that, though.
The low-hanging fruit for this CPU-architecture idea is reducing page faults and maybe page walks, which are significantly more expensive than L3 cache misses.
I suspect that trying to branch on L3 cache misses would cost you too much in branch misses for it to really be worth it vs. just letting out-of-order exec do its thing. Especially if you have hyperthreading so this latency-bound process can happen on one logical core of a CPU that's also doing something else.

Exclusive access to L1 cacheline on x86?

If one has a 64 byte buffer that is heavily read/written to then it's likely that it'll be kept in L1; but is there any way to force that behaviour?
As in, give one core exclusive access to those 64 bytes and tell it not to sync the data with other cores nor the memory controller so that those 64 bytes always live in one core's L1 regardless of whether or not the CPU thinks it's used often enough.
No, x86 doesn't let you do this. You can force evict with clfushopt, or (on upcoming CPUs) for just write-back without evict with clwb, but you can't pin a line in cache or disable coherency.
You can put the whole CPU (or a single core?) into cache-as-RAM (aka no-fill) mode to disable sync with the memory controller, and disable ever writing back the data. Cache-as-Ram (no fill mode) Executable Code. It's typically used by BIOS / firmware in early boot before configuring the memory controllers. It's not available on a per-line basis, and is almost certainly not practically useful here. Fun fact: leaving this mode is one of the use-cases for invd, which drops cached data without writeback, as opposed to wbinvd.
I'm not sure if no-fill mode prevents eviction from L1d to L3 or whatever; or if data is just dropped on eviction. So you'd just have to avoid accessing more than 7 other cache lines that alias the one you care about in your L1d, or the equivalent for L2/L3.
Being able to force one core to hang on to a line of L1d indefinitely and not respond to MESI requests to write it back / share it would make the other cores vulnerable to lockups if they ever touched that line. So obviously if such a feature existed, it would require kernel mode. (And with HW virtualization, require hypervisor privilege.) It could also block hardware DMA (because modern x86 has cache-coherent DMA).
So supporting such a feature would require lots of parts of the CPU to handle indefinite delays, where currently there's probably some upper bound, which may be shorter than a PCIe timeout, if there is such a thing. (I don't write drivers or build real hardware, just guessing about this).
As #fuz points out, a coherency-violating instruction (xdcbt) was tried on PowerPC (in the Xbox 360 CPU), with disastrous results from mis-speculated execution of the instruction. So it's hard to implement.
You normally don't need this.
If the line is frequently used, LRU replacement will keep it hot. And if it's lost from L1d at frequent enough intervals, then it will probably stay hot in L2 which is also on-core and private, and very fast, in recent designs (Intel since Nehalem). Intel's inclusive L3 on CPUs other than Skylake-AVX512 means that staying in L1d also means staying in L3.
All this means that full cache misses all the way to DRAM are very unlikely with any kind of frequency for a line that's heavily used by one core. So throughput shouldn't be a problem. I guess you could maybe want this for realtime latency, where the worst-case run time for one call of a function mattered. Dummy reads from the cache line in some other part of the code could be helpful in keeping it hot.
However, if pressure from other cores in L3 cache causes eviction of this line from L3, Intel CPUs with an inclusive L3 also have to force eviction from inner caches that still have it hot. IDK if there's any mechanism to let L3 know that a line is heavily used in a core's L1d, because that doesn't generate any L3 traffic.
I'm not aware of this being much of a problem in real code. L3 is highly associative (like 16 or 24 way), so it takes a lot of conflicts before you'd get an eviction. L3 also uses a more complex indexing function (like a real hash function, not just modulo by taking a contiguous range of bits). In IvyBridge and later, it also uses an adaptive replacement policy to mitigate eviction from touching a lot of data that won't be reused often. http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.
See also Which cache mapping technique is used in intel core i7 processor?
#AlexisWilke points out that you could maybe use vector register(s) instead of a line of cache, for some use-cases. Using ymm registers as a "memory-like" storage location. You could globally dedicate some vector regs to this purpose. To get this in gcc-generated code, maybe use -ffixed-ymm8, or declare it as a volatile global register variable. (How to inform GCC to not use a particular register)
Using ALU instructions or store-forwarding to get data to/from the vector reg will give you guaranteed latency with no possibility of data-cache misses. But code-cache misses are still a problem for extremely low latency.
There is no direct way to achieve that on Intel and AMD x86 processors, but you can get pretty close with some effort. First, you said you're worried that the cache line might get evicted from the L1 because some other core might access it. This can only happen in the following situations:
The line is shared, and therefore, it can be accessed by multiple agents in the system concurrently. If another agent attempts to read the line, its state will change from Modified or Exclusive to Shared. That is, it will state in the L1. If, on the other hand, another agent attempts to write to the line, it has to be invalidated from the L1.
The line can be private or shared, but the thread got rescheduled by the OS to run on another core. Similar to the previous case, if it attempts to read the line, its state will change from Modified or Exclusive to Shared in both L1 caches. If it attempts to write to the line, it has to be invalidated from the L1 of the previous core on which it was running.
There are other reasons why the line may get evicted from the L1 as I will discuss shortly.
If the line is shared, then you cannot disable coherency. What you can do, however, is make a private copy of it, which effectively does disable coherency. If doing that may lead to faulty behavior, then the only thing you can do is to set the affinity of all threads that share the line to run on the same physical core on a hyperthreaded (SMT) Intel processor. Since the L1 is shared between the logical cores, the line will not get evicted due to sharing, but it can still get evicted due to other reasons.
Setting the affinity of a thread does not guarantee though that other threads cannot get scheduled to run on the same core. To reduce the probability of scheduling other threads (that don't access the line) on the same core or rescheduling the thread to run on other physical cores, you can increase the priority of the thread (or all the threads that share the line).
Intel processors are mostly 2-way hyperthreaded, so you can only run two threads that share the line at a time. so if you play with the affinity and priority of the threads, performance can change in interesting ways. You'll have to measure it. Recent AMD processors also support SMT.
If the line is private (only one thread can access it), a thread running on a sibling logical core in an Intel processor may cause the line to be evicted because the L1 is competitively shared, depending on its memory access behavior. I will discuss how this can be dealt with shortly.
Another issue is interrupts and exceptions. On Linux and maybe other OSes, you can configure which cores should handle which interrupts. I think it's OK to map all interrupts to all other cores, except the periodic timer interrupt whose interrupt handler's behavior is OS-dependent and it may not be safe to play with it. Depending on how much effort you want to spend on this, you can perform carefully designed experiments to determine the impact of the timer interrupt handler on the L1D cache contents. Also you should avoid exceptions.
I can think of two reasons why a line might get invalidated:
A (potentially speculative) RFO with intent for modification from another core.
The line was chosen to be evicted to make space for another line. This depends on the design of the cache hierarchy:
The L1 cache placement policy.
The L1 cache replacement policy.
Whether lower level caches are inclusive or not.
The replacement policy is commonly not configurable, so you should strive to avoid conflict L1 misses, which depends on the placement policy, which depends on the microarchitecture. On Intel processors, the L1D is typically both virtually indexed and physically indexed because the bits used for the index don't require translation. Since you know the virtual addresses of all memory accesses, you can determine which lines would be allocated from which cache set. You need to make sure that the number of lines mapped to the same set (including the line you don't want it to be evicted) does not exceed the associativity of the cache. Otherwise, you'd be at the mercy of the replacement policy. Note also that an L1D prefetcher can also change the contents of the cache. You can disable it on Intel processors and measure its impact in both cases. I cannot think of an easy way to deal with inclusive lower level caches.
I think the idea of "pinning" a line in the cache is interesting and can be useful. It's a hybrid between caches and scratch pad memories. The line would be like a temporary register mapped to the virtual address space.
The main issue here is that you want to both read from and write to the line, while still keeping it in the cache. This sort of behavior is currently not supported.

How can an unlock/lock operation on a mutex be faster than a fetch from memory?

Norvig claims, that an mutex lock or unlock operation takes only a quarter of the time that is needed to do a fetch from memory.
This answer explains, that a mutex is
essentially a flag and a wait queue and that it would only take a few instructions to flip the flag on an uncontended mutex.
I assume, if a different CPU or core tries to lock that mutex, it needs to wait for
the cache line to be written back into the memory (if that didn't already happen) and its own memory read to get the state of the flag. Is that correct? What is the difference, if it is a different core compared to a different CPU?
So the numbers Norvig states are only for an uncontended mutex where the CPU or core trying the operation already has that flag in its cache and the cache line isn't dirty?
A typical PC runs a x86 CPU, Intel's CPUs can perform the locking entirely on the caches:
if the area of memory being locked during a LOCK operation is
cached in the processor that is performing the LOCK operation as write-back memory and is completely contained
in a cache line, the processor may not assert the LOCK# signal on the bus.
Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically.
This
operation is called “cache locking.”
The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.
From Intel Software Developer Manual 3, Section 8.1.4
The cache coherence mechanism is a variation of the MESI protocol.
In such protocol before a CPU can write to a cached location, it must have the corresponding line in the Exclusive (E) state.
This means that only one CPU at a time has a given memory location in a dirty state.
When other CPUs want to read the same location, the owner CPU will delay such reads until the atomic operation is finished.
It then follows the coherence protocol to either forward, invalidate or write-back the line.
In the above scenario a lock can be performed faster than an uncached load.
Those times however are a bit off and surely outdated.
They are intended to give an order, along with an order of magnitude, among the typical operations.
The timing for an L1 hit is a bit odd, it isn't faster than the typical instruction execution (which by itself cannot be described with a single number).
The Intel optimization manual reports, for an old CPU like Sandy Bridge, an L1 access time of 4 cycles while there are a lot of instructions with a latency of 4 cycles of less.
I would take those numbers with a grain of salt, avoiding reasoning too much on them.
The lesson Norvig tried to teach us is: hardware is layered, the closer (from a topological point of view1) to the CPU, the faster.
So when parsing a file, a programmer should avoid moving data back and forth to a file, instead it should minimize the IO pressure.
The some applies when processing an array, locality will improve performance.
Note however that these are technically, micro-optimisations and the topic is not as simple as it appears.
1 In general divide the hardware in what is: inside the core (registers), inside the CPU (caches, possibly not the LLC), inside the socket (GPU, LLC), behind dedicated bus devices (memory, other CPUs), behind one generic bus (PCIe - internal devices like network cards), behind two or more buses (USB devices, disks) and in another computer entirely (servers).

Out-of-order instruction execution: is commit order preserved?

On the one hand, Wikipedia writes about the steps of the out-of-order execution:
Instruction fetch.
Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).
The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before
earlier, older instructions.
The instruction is issued to the appropriate functional unit and executed by that unit.
The results are queued.
Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The similar information can be found in the "Computer Organization and Design" book:
To make programs behave as if they were running on a simple in-order
pipeline, the instruction fetch and decode unit is required to issue
instructions in order, which allows dependences to be tracked, and the
commit unit is required to write results to registers and memory in
program fetch order. This conservative mode is called in-order
commit... Today, all dynamically scheduled pipelines use in-order commit.
So, as far as I understand, even if the instructions execution is done in the out-of-order manner, the results of their executions are preserved in the reorder buffer and then committed to the memory/registers in a deterministic order.
On the other hand, there is a known fact that modern CPUs can reorder memory operations for the performance acceleration purposes (for example, two adjacent independent load instructions can be reordered). Wikipedia writes about it here.
Could you please shed some light on this discrepancy?
TL:DR: memory ordering is not the same thing as out of order execution. It happens even on in-order pipelined CPUs.
In-order commit is necessary1 for precise exceptions that can roll-back to exactly the instruction that faulted, without any instructions after that having already retired. The cardinal rule of out-of-order execution is don't break single-threaded code. If you allowed out-of-order commit (retirement) without any kind of other mechanism, you could have a page-fault happen while some later instructions had already executed once, and/or some earlier instructions hadn't executed yet. This would make restarting execution after handing a page-fault impossible the normal way.
(In-order issue/rename and dependency-tracking takes care of correct execution in the normal case of no exceptions.)
Memory ordering is all about what other cores see. Also notice that what you quoted is only talking about committing results to the register file, not to memory.
(Footnote 1: Kilo-instruction Processors: Overcoming the Memory Wall is a theoretical paper about checkpointing state to allow rollback to a consistent machine state at some point before an exception, allowing much larger out-of-order windows without a gigantic ROB of that size. AFAIK, no mainstream commercial designs have used that, but it shows that there are in theory approaches other than strictly in-order retirement to building a usable CPU.
Apple's M1 reportedly has a significantly larger out-of-order window than its x86 contemporaries, but I haven't seen any definite info that it uses anything other than a very large ROB.)
Since each core's private L1 cache is coherent with all the other data caches in the system, memory ordering is a question of when instructions read or write cache. This is separate from when they retire from the out-of-order core.
Loads become globally visible when they read their data from cache. This is more or less when they "execute", and definitely way before they retire (aka commit).
Stores become globally visible when their data is committed to cache. This has to wait until they're known to be non-speculative, i.e. that no exceptions or interrupts will cause a roll-back that has to "undo" the store. So a store can commit to L1 cache as early as when it retires from the out-of-order core.
But even in-order CPUs use a store queue or store buffer to hide the latency of stores that miss in L1 cache. The out-of-order machinery doesn't need to keep tracking a store once it's known that it will definitely happen, so a store insn/uop can retire even before it commits to L1 cache. The store buffer holds onto it until L1 cache is ready to accept it. i.e. when it owns the cache line (Exclusive or Modified state of the MESI cache coherency protocol), and the memory-ordering rules allow the store to become globally visible now.
See also my answer on Write Allocate / Fetch on Write Cache Policy
As I understand it, a store's data is added to the store queue when it "executes" in the out-of-order core, and that's what a store execution unit does. (Store-address writing the address, and store-data writing the data into the store-buffer entry reserved for it at allocation/rename time, so either of those parts can execute first on CPUs where those parts are scheduled separately, e.g. Intel.)
Loads have to probe the store queue so that they see recently-stored data.
For an ISA like x86, with strong ordering, the store queue has to preserve the memory-ordering semantics of the ISA. i.e. stores can't reorder with other stores, and stores can't become globally visible before earlier loads. (LoadStore reordering isn't allowed (nor is StoreStore or LoadLoad), only StoreLoad reordering).
David Kanter's article on how TSX (transactional memory) could be implemented in different ways than what Haswell does provides some insight into the Memory Order Buffer, and how it's a separate structure from the ReOrder Buffer (ROB) that tracks instruction/uop reordering. He starts by describing how things currently work, before getting into how it could be modified to track a transaction that can commit or abort as a group.

Resources