Prefetch instruction behavior

Prefetch instruction behavior - caching

In order to satisfy some security property, I want to make sure that an important data is already in the cache when a statement accesses it (so there will be no cache miss). For example, for this code
...
a += 2;
...
I want to make sure that a is in the cache right before a += 2 is executed.
I was considering to use the PREFETCHh instruction of x86 to achieve this:
...
__prefetch(&a); /* pseudocode */
a += 2;
...
However, I have read that inserting the prefetch instruction right before a += 2 might be too late to ensure a is in the cache when a += 2 gets executed. Is this claim true? If it is true, can I fix it by inserting a CPUID instruction after prefetch to ensure the prefectch instruction has been executed (because the Intel manual says PREFETCHh is ordered with respect to CPUID)?

Yes, you need to prefetch with a lead-time of about the memory latency for it to be optimal. Ulrich Drepper's What Every Programmer Should Know About Memory talks a lot about prefetching.
Making this happen will be highly non-trivial for a single access. Too soon and your data might be evicted before the insn you care about. Too late and it might reduce the access time some. Tuning this will depend on compiler version/options, and on the hardware you're running on. (Higher instructions-per-cycle means you need to prefetch earlier. Higher memory latency also means you need to prefetch earlier).
Since you want to do a read-modify-write to a, you should use PREFETCHW if available. The other prefetch instructions only prefetch for reading, so the read part of a the RMW could hit, but I think the store part could be delayed by MOSI cache coherency getting write-ownership of the cache line.
If a isn't atomic, you can also just load a well ahead of time and use the copy in a register. The store back to the global could easily miss in this case, which could eventually stall execution, though.
You'll probably have a hard time doing some that reliably with a compiler, instead of writing asm yourself. Any of the other ideas will also require checking the compiler output to make sure the compiler did what you're hoping.
Prefetch instructions don't necessarily prefetch anything. They're "hints", which presumably get ignored when the number of outstanding loads is near max (i.e. almost out of load buffers).
Another option is to load it (not just prefetch) and then serialize with a CPUID. (A load that throws away the result is like a prefetch). The load would have to complete before the serializing instruction, and instructions after the serializing insn can't start decoding until then. I think a prefetch can retire before the data arrives, which is normally an advantage, but not in this case where we care about one operation hitting at the expense of overall performance.
From Intel's insn ref manual (see the x86 tag wiki) entry for CPUID:
Serializing instruction execution
guarantees that any modifications to flags, registers, and memory for previous instructions are completed before
the next instruction is fetched and executed.
I think a sequence like this is fairly good (but still doesn't guarantee anything in a pre-emptive multi-tasking system):
add [mem], 0 # can't retire until the store completes, requiring that our core owns the cache line for writing
CPUID # later insns can't start until the prev add retires
add [mem], 2 # a += 2 Can't miss in cache unless an interrupt or the other hyper-thread evicts the cache line before this insn can execute
Here we're using add [mem], 0 as a write-prefetch which is otherwise a near no-op. (It is a non-atomic read-modify-rewrite). I'm not sure if PREFETCHW really will ensure the cache line is ready if you do PREFETCHW / CPUID / add [mem], 2. The insn is ordered wrt. CPUID, but the manual doesn't say that the prefetch effect is ordered.
If a is volatile, then (void)a; will get gcc or clang to emit a load insn. I assume most other compilers (MSVC?) are the same. You can probably do (void) *(volatile something*)&a to dereference a pointer to volatile and force a load from a's address.
To guarantee that a memory access will hit in cache, you'd need to be running at realtime priority pinned to a core that doesn't receive interrupts. Depending on the OS, the timer-interrupt handler is probably lightweight enough that the chance of evicting your data from cache is low enough.
If your process is descheduled between executing a prefetch insn and doing the real access, the data will probably have been evicted from at least L1 cache.
So it's unlikely you can defeat an attacker determined to do a timing attack on your code, unless it's realistic to run at realtime priority. An attacker could run many many threads of memory-intensive code...

Related

Is cache coherency only an issue when storing and not when loading?

I came across this code emission for x64 were "Atomic Load" is using a simple movq whereas "Atomic Store" is using xchgq.
This link explains that Atomic Load/Stores on aligned addresses are atomic by default. I'm assuming that's why Atomic Load in the above link is using a simple movq.
I have the following questions;
Is Atomic Store using a xchgq (which enables LOCK by default) to fix any issues with cache lines? essentially it's making sure all cache lines are updated properly? If cache line wasn't an issue they could have just used movq?
Does it also mean cache coherency is only an issue when Storing? As Load above is not using a locked instruction?

No, seq_cst stores use xchg (or mov + mfence but that's slower on recent CPUs) for ordering wrt. other operations. release or relaxed atomic stores can just use mov and will still be promptly visible to other cores. (Not before later loads in this thread might have executed, though.)
Cache coherence isn't the cause of memory-reordering, that's local to each core. (For x86, the memory model is program order + a store buffer with store-forwarding. It's the store buffer that causes stores to not become visible until after the store instruction has retired from out-of-order exec.)
The answer you linked which says "if I set this to true (or false), no other thread will read a different value after I've set it" (that's not quite such a certainty - you need a "lock" prefix to guarantee that). is somewhat misleading. They mean that (implicit-lock) xchg includes a full memory barrier, so no code in the storing thread can access memory until after the store is actually committed to cache, globally visible.
A clearer way to state that is that it makes this thread wait without doing anything until the store is visible. i.e. stall this thread until the store buffer has finished committing all previous stores. That would eventually happen on its own. So it's really about ordering of this thread relative to store visibility, not other threads. Other threads (cores) can locally do their own early loading / late storing, although on x86 all loads happen in program order. That's why I commented on that answer you linked to disagree with the way it was presenting things.
Can a speculatively executed CPU branch contain opcodes that access RAM? (What a store buffer does)
C++ How is release-and-acquire achieved on x86 only using MOV? discusses cache-coherency and limits on local reordering being enough to give release/acquire synchronization.
Why does a std::atomic store with sequential consistency use XCHG?
Acquire-release on x86
Which is a better write barrier on x86: lock+addl or xchgl? - shows in more detail why we need xchg or a separate memory barrier for a seq_cst store.
https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ (It's talking about Java volatile, which is like C++ std::atomic with memory_order_seq_cst.
How does memory reordering help processors and compilers?
Does atomic read guarantees reading of the latest value? - people often get hung up on "latest value" guarantees. Don't. Acquire/release just works, and stronger orders or memory barriers don't make stores visible to other cores sooner in any significant way.

Cache Line update consistency for atomically updating a whole cache line?

I have the following scenario and looking for suggestions please:
Need to share data between two threads, A and B each running in different cores in the same processor, where thread A writes to an instance of data structure S and B thread reads it. I need the sharing of S to be as consistent and as fast as possible.
struct alignas(64) S
{
char cacheline [64];
};
Planning to leverage the consistency of a cache line, being visible to other cores as an atomic update. Therefore have thread A write to S as fast as possible (*1) so the update is consistent (atomic from a visibility perspective) and then demote (CLDEMOTE instruction) the cache line to the shared cache so that thread B can read it as fast as possible.
Note 1: The reason why it needs to happen fast is so that when core running thread A starts writing to the cache line, it can update all of its contents completely and then core making it visible in L1 (updates occur in the core store buffer), otherwise if it takes too long to update a "mid-state" of the cache-line may be pushed to L1 incurring into unnecessaries invalidation signals (MESI) penalties (as it needs to do the rest again), and worst inconsistent state in thread B.
Are there better ways to achieve this?
Thanks!

Yes, store then cldemote is a good plan. It runs as a nop on CPUs that don't support it, so you can use it optimistically. (Test that it actually helps your program on CPUs where it's not a nop, though, in case you accidentally demote before reading the line some more.)
Do you actually need atomicity, or is that just nice to have some of the time? If you need atomicity, you can't use separate store instructions. Coalescing in the store buffer isn't guaranteed; for L1d hits it may only sometimes happen on Ice Lake. And an interrupt can happen at any point (unless interrupts are disabled, but SMI and NMI can't be disabled). Including between two stores you were hoping would commit together.
32-byte AVX aligned loads and stores aren't guaranteed atomic, but in practice they probably are on Haswell and later (where the load/store units are 32 bytes wide).
Similarly, 64-byte AVX-512 loads and stores aren't guaranteed atomic, and very likely won't be in practice on Zen4 where they're done in two 32-byte halves. But they probably are on Intel CPUs with AVX-512, if you want to do some testing and find some "works in practice" functionality that doesn't show any tearing on the actual machine you care about.
16-byte loads/stores are guaranteed atomic on Intel CPUs that have the AVX feature flag. (Finally documented after being true for years, fortunately retroactive with an existing feature bit.) AMD doesn't document this yet, but it's probably true of AMD CPUs with AVX, too.
Related: https://rigtorp.se/isatomic/ / SSE instructions: which CPUs can do atomic 16B memory operations?
movdir64b will provide guaranteed 64-byte write atomicity, but only with NT semantics: evicting the cache line all the way to DRAM. It also doesn't provide 64-byte atomic read, so the read side would need to check sequence numbers or something, like a SeqLock.
Intel TSX (transactional memory) can let you commit changes to a whole cache line (or more) as a single atomic transaction. But Intel keeps disabling it with microcode updates. The HLE part (optimistic lock add handling) is fully gone, but the RTM part (xbegin / xend) can still be enabled on some CPUs, I think.
For a use case like this where one thread is only writing, you might consider a SeqLock, using 4 bytes of the cache line as a sequence number. Optimal way to pass a few variables between 2 threads pinning different CPUs / how to implement a seqlock lock using c++11 atomic library
The writer can load the sequence number, store seq+1 (with a plain mov store, no lock inc needed), store the payload with regular stores, or SIMD if convenient, then store seq+2.
Unfortunately without guarantees of vector load/store atomicity, or of ordering between parts of it, you can't have the reader just load the whole cache line at once, you do need 3 separate loads. (Seq number, whole line, then seq number again.)
But if you want to use 32-byte atomicity which appears to be true in practice on Haswell and Zen2 and later, maybe put a sequence number in each 32-byte half of a cache line, so the reader can check with vpcmpeqd / vpmovmskps / test al,1 to check that the first dword element (sequence number) matched between halves. Or maybe put them somewhere else within the vector to make reassembling the payload cheaper.
This spends space for two sequence numbers to save loads in the reader, but might cost more overhead in shuffling data into / out of vectors. I guess maybe store with vmovdqua [rdi+28], ymm1 / vmovdqu [rdi], ymm0 could leave you with 60 useful bytes starting at rdi+4, overwriting the 4 byte sequence number at the start of ymm1. Store-forwarding to a 32-byte load from [rdi+4] would stall, but narrower loads that don't span the boundary between the two earlier stores would be fine even.
Related Q&As about solving the same problem of pushing data for other cores to be able to read cheaply:
CPU cache inhibition
x86 MESI invalidate cache line latency issue
Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions? - Sapphire Rapids has UIPI for user-space interrupt handling of special inter-processor interrupts. So that's fun if you want low latency notification. If you just want to read whatever the current state of a shared data structure is, SeqLock or RCU are good.

Is it possible to “abort” when loading a register from memory rather the triggering a page fault?

I am thinking about 'Minimizing page faults (and TLB faults) while “walking” a large graph'
'How to know whether a pointer is in physical memory or it will trigger a Page Fault?' is a related question looking at the problem from the other side, but does not have a solution.
I wish to be able to load some data from memory into a register, but have the load abort rather than getting a page fault, if the memory is currently paged out. I need the code to work in user space on both Windows and Linux without needing any none standard permission.
(Ideally, I would also like to abort on a TLB fault.)

The RTM (Restricted Transactional Memory) part of the TXT-NI feature allows to suppress exceptions:
Any fault or trap in a transactional region that must be exposed to software will be suppressed. Transactional
execution will abort and execution will transition to a non-transactional execution, as if the fault or trap had never
occurred.
[...]
Synchronous exception events (#DE, #OF, #NP, #SS, #GP, #BR, #UD, #AC, #XM, #PF, #NM, #TS, #MF, #DB, #BP/INT3) that occur during transactional execution may cause an execution not to commit transactionally, and
require a non-transactional execution. These events are suppressed as if they had never occurred.
I've never used RTM but it should work something like this:
xbegin fallback
; Don't fault here
xend
; Somewhere else
fallback:
; Retry non-transactionally
Note that a transaction can be aborted for many reasons, see chapter 16.8.3.2 of the Intel manual volume 1.
Also note that RTM is not ubiquitous.
Besides RTM I cannot think of another way to suppress a load since it must return a value or eventually signal an abort condition (which would be the same as a #PF).

There's unfortunately no instruction that just queries the TLB or the current page table with the result in a register, on x86 (or any other ISA I know of). Maybe there should be, because it could be implemented very cheaply.
(For querying virtual memory for pages being paged out or not, there is the Linux system call mincore(2) that produces a bitmap of present/absent for a range of pages starting (given as void* start / size_t length. That's maybe similar to the HW page tables so probably could let you avoid page faults until after you've touched memory, but unrelated to TLB or cache. And maybe doesn't rule out soft page faults, only hard. And of course that's only the current situation: pages could be evicted between query and access.)
Would a CPU feature like this be useful? probably yes for a few cases
Such a thing would be hard to use in a way that paid off, because every "false" attempt is CPU time / instructions that didn't accomplish any useful work. But a case like this could possibly be a win, when you don't care what order you traverse a tree / graph in, and some nodes might be hot in cache, TLB, or even just RAM while others are cold or even paged out to disk.
When memory is tight, touching a cold page could even evict a currently-hot page before you get to it.
Normal CPUs (like modern x86) can do speculative / out-of-order page walks (to fill TLB entries), and definitely speculative loads into cache, but not page faults. Page faults are handled in software by the kernel. Taking a page-fault can't happen speculatively, and is serializing. (CPUs don't rename the privilege level.)
So software prefetch can cheaply get the hardware to fill TLB and cache while you touch other memory, if you the one you're going to touch 2nd was cold. If it was hot and you touch the cold side first, that's unfortunate. If there was a cheap way to check hot/cold, it might be worth using it to always go the right way (at least on the first step) in traversal order when one pointer is hot and the other is cold. Unless a read-only transaction is quite cheap, it's probably not worth actually using Margaret's clever answer.
If you have 2 pointers you will eventually dereference, and one of them points to a page that's been paged out while the other is hot, the best case would be to somehow detect this and get the OS to start paging in one page from disk in the background while you traverse the side that's already in RAM. (e.g. with Windows
PrefetchVirtualMemory or Linux madvise(MADV_WILLNEED). See answers on the OP's other question: Minimizing page faults (and TLB faults) while "walking" a large graph)
This will require a system call, but system calls are expensive and pollute caches + TLBs, especially on current x86 where Spectre + Meltdown mitigation adds thousands of clock cycles. So it's not worth it to make a VM prefetch system call for one of every pair of pointers in a tree. You'd get a massive slowdown for cases when all the pointers were in RAM.
CPU design possibilities
Like I said, I don't think any current ISAs have this, but it would I think be easy to support in hardware with instructions that run kind of like load instructions, but produce a result based on the TLB lookup instead of fetching data from L1d cache.
There are a couple possibilities that come to mind:
a queryTLB m8 instruction that writes flags (e.g. CF=1 for present) according to whether the memory operand is currently hot in TLB (including 2nd-level TLB), never doing a page walk. And a querypage m8 that will do a page walk on a TLB miss, and sets flags according to whether there's a page table entry. Putting the result in a r32 integer reg you could test/jcc on would also be an option.
a try_load r32, r/m32 instruction that does a normal load if possible, but sets flags instead of taking a page fault if a page walk finds no valid entry for the virtual address. (e.g. CF=1 for valid, CF=0 for abort with integer result = 0, like rdrand. It could make itself useful and set other flags (SF/ZF/PF) according to the value, if there is one.)
The query idea would only be useful for performance, not correctness, because there'd always be a gap between querying and using during which the page could be unmapped. (Like the IsBadXxxPtr Windows system call, except that that probably checks the logical memory map, not the hardware page tables.)
A try_load insn that also sets/clear flags instead of raising #PF could avoid the race condition. You could have different versions of it, or it could take an immediate to choose the abort condition (e.g. TLB miss without attempt page-walk).
These instructions could easily decode to a load uop, probably just one. The load ports on modern x86 already support normal loads, software prefetch, broadcast loads, zero or sign-extending loads (movsx r32, m8 is a single uop for a load port on Intel), and even vmovddup ymm, m256 (two in-lane broadcasts) for some reason, so adding another kind of load uop doesn't seem like a problem.
Loads that hit a TLB entry they don't have permission for (kernel-only mapping) do currently behave specially on some x86 uarches (the ones that aren't vulnerable to Meltdown). See The Microarchitecture Behind Meltdown on Henry Wong's blod (stuffedcow.net). According to his testing, some CPUs produce a zero for speculative execution of later instructions after a TLB/page miss (entry not present). So we already know that doing something with a TLB hit/miss result should be able to affect the integer result of a load. (Of course, a TLB miss is different from a hit on a privileged entry.)
Setting flags from a load is not something that ever normally happens on x86 (only from micro-fused load+alu), so maybe it would be implemented with an ALU uop as well, if Intel ever did implement this idea.
Aborting on a condition other than TLB/page miss or L1d miss would require outer levels of cache to also support this special request, though. A try_load that runs if it hits L3 cache but aborts on L3 miss would need support from the L3 cache. I think we could do without that, though.
The low-hanging fruit for this CPU-architecture idea is reducing page faults and maybe page walks, which are significantly more expensive than L3 cache misses.
I suspect that trying to branch on L3 cache misses would cost you too much in branch misses for it to really be worth it vs. just letting out-of-order exec do its thing. Especially if you have hyperthreading so this latency-bound process can happen on one logical core of a CPU that's also doing something else.

Would buffering cache changes prevent Meltdown?

If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible?
The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed.

TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register), but probably be too expensive (in power cost, and maybe also performance) to be a likely implementation.
But with hyperthreading (or more generally any SMT), there's also an ALU / port-pressure side-channel if you can get mis-speculation to run data-dependent ALU instructions with the secret data, instead of using it as an array index. The Meltdown paper discusses this possibility before focusing on the flush+reload cache-timing side-channel. (It's more viable for Meltdown than Spectre, because you have much better control of the timing of when the the secret data is used).
So modifying cache behaviour doesn't block the attacks. It would take away the reliable side-channel for getting the secret data into the attacking process, though. (i.e. ALU timing has higher noise and thus lower bandwidth to get the same reliability; Shannon's noisy channel theorem), and you have to make sure your code runs on the same physical core as the code under attack.
On CPUs without SMT (e.g. Intel's desktop i5 chips), the ALU timing side-channel is very hard to use with Spectre, because you can't directly use perf counters on code you don't have privilege for. (But Meltdown could still be exploited by timing your own ALU instructions with Linux perf, for example).
Meltdown specifically is much easier to defend against, microarchitecturally, with simpler and cheaper changes to the hard-wired parts of the CPU that microcode updates can't rewire.
You don't need to block speculative loads from affecting cache; the change could be as simple as letting speculative execution continue after a TLB-hit load that will fault if it reaches retirement, but with the value used by speculative execution of later instructions forced to 0 because of the failed permission check against the TLB entry.
So the mis-speculated (after the faulting load of secret) touch array[secret*4096] load would always make the same cache line hot, with no secret-data-dependent behaviour. The secret data itself would enter cache, but not a physical register. (And this stops ALU / port-pressure side-channels, too.)
Stopping the faulting load from even bringing the "secret" line into cache in the first place could make it harder to tell the difference between a kernel mapping and an unmapped page, which could possibly help protect against user-space trying to defeat KASLR by finding which virtual addresses the kernel has mapped. But that's not Meltdown.
Spectre
Spectre is the hard one because the mis-speculated instructions that make data-dependent modifications to microarchitectural state do have permission to read the secret data. Yes, a "load queue" that works similarly to the store queue could do the trick, but implementing it efficiently could be expensive. (Especially given the cache coherency problem that I didn't think of when I wrote this first section.)
(There are other ways of implementing the your basic idea; maybe there's even a way that's viable. But extra bits on L1D lines to track their status has downsides and isn't obviously easier.)
The store queue tracks stores from execution until they commit to L1D cache. (Stores can't commit to L1D until after they retire, because that's the point at which they're known to be non-speculative, and thus can be made globally visible to other cores).
A load queue would have to store whole incoming cache lines, not just the bytes that were loaded. (But note that Skylake-X can do 64-byte ZMM stores, so its store-buffer entries do have to be the size of a cache line. But if they can borrow space from each other or something, then there might not be 64 * entries bytes of storage available, i.e. maybe only the full number of entries is usable with scalar or narrow-vector stores. I've never read anything about a limitation like this, so I don't think there is one, but it's plausible)
A more serious problem is that Intel's current L1D design has 2 read ports + 1 write port. (And maybe another port for writing lines that arrive from L2 in parallel with committing a store? There was some discussion about that on Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake.)
If your loaded data can't enter L1D until after the loads retire, then they're probably going to be competing for the same write port that stores use.
Loads that hit in L1D can still come directly from L1D, though, and loads that hit in the memory-order-buffer could still be executed at 2 per clock. (The MOB would now include this new load queue as well as the usual store queue + markers for loads to maintain x86 memory ordering semantics). You still need both L1D read ports to maintain performance for code that doesn't touch a lot of new memory, and mostly is reloading stuff that's been hot in L1D for a while.
This would make the MOB about twice as large (in terms of data storage), although it doesn't need any more entries. As I understand it, the MOB in current Intel CPUs is composed of the individual load-buffer and store-buffer entries. (Haswell has 72 and 42 respectively).
Hmm, a further complication is that the load data in the MOB has to maintain cache coherency with other cores. This is very different from store data, which is private and hasn't become globally visible / isn't part of the global memory order and cache coherency until it commits to L1D.
So this proposed "load queue" implementation mechanism for your idea is probably not feasible without tweaks: it would have to be checked by invalidation-requests from other cores, so that's another read-port needed in the MOB.
Any possible implementation would have the problem of needing to later commit to L1D like a store. I think it would be a significant burden not to be able to evict + allocate a new line when it arrived from off-core.
(Even allowing speculative eviction but not speculative replacement from conflicts leaves open a possible cache-timing attack. You'd prime all the lines and then do a load that would evict one from one set of lines or another, and find which line was evicted instead of which one was fetched using a similar cache-timing side channel. So using extra bits in L1D to find / evict lines loaded during recovery from mis-speculation wouldn't eliminate this side-channel.)
Footnote: all instructions are speculative. This question is worded well, but I think many people reading about OoO exec and thinking about Meltdown / Spectre fall into this trap of confusing speculative execution with mis-speculation.
Remember that all instructions are speculative when they're executed. It's not known to be correct speculation until retirement. Meltdown / Spectre depend on accessing secret data and using it during mis-speculation. But the basis of current OoO CPU designs is that you don't know whether you've speculated correctly or not; everything is speculative until retirement.
Any load or store could potentially fault, and so can some ALU instructions (e.g. floating point if exceptions are unmasked), so any performance cost that applies "only when executing speculatively" actually applies all the time. This is why stores can't commit from the store queue into L1D until after the store uops have retired from the out-of-order CPU core (with the store data in the store queue).
However, I think conditional and indirect branches are treated specially, because they're expected to mis-speculate some of the time, and optimizing recovery for them is important. Modern CPUs do better with branches than just rolling back to the current retirement state when a mispredict is detected, I think using a checkpoint buffer of some sort. So out-of-order execution for instructions before the branch can continue during recovery.
But loop and other branches are very common, so most code executes "speculatively" in this sense, too, with at least one branch-rollback checkpoint not yet verified as correct speculation. Most of the time it's correct speculation, so no rollback happens.
Recovery for mis-speculation of memory ordering or faulting loads is a full pipeline-nuke, rolling back to the retirement architectural state. So I think only branches consume the branch checkpoint microarchitectural resources.
Anyway, all of this is what makes Spectre so insidious: the CPU can't tell the difference between mis-speculation and correct speculation until after the fact. If it knew it was mis-speculating, it would initiate rollback instead of executing useless instructions / uops. Indirect branches are not rare, either (in user-space); every DLL or shared library function call uses one in normal executables on Windows and Linux.

I suspect the overhead from buffering and committing the buffer would render the specEx/caching useless?
This is purely speculative (no pun intended) - I would love to see someone with a lower level background weigh in this!

Out-of-order instruction execution: is commit order preserved?

On the one hand, Wikipedia writes about the steps of the out-of-order execution:
Instruction fetch.
Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).
The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before
earlier, older instructions.
The instruction is issued to the appropriate functional unit and executed by that unit.
The results are queued.
Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The similar information can be found in the "Computer Organization and Design" book:
To make programs behave as if they were running on a simple in-order
pipeline, the instruction fetch and decode unit is required to issue
instructions in order, which allows dependences to be tracked, and the
commit unit is required to write results to registers and memory in
program fetch order. This conservative mode is called in-order
commit... Today, all dynamically scheduled pipelines use in-order commit.
So, as far as I understand, even if the instructions execution is done in the out-of-order manner, the results of their executions are preserved in the reorder buffer and then committed to the memory/registers in a deterministic order.
On the other hand, there is a known fact that modern CPUs can reorder memory operations for the performance acceleration purposes (for example, two adjacent independent load instructions can be reordered). Wikipedia writes about it here.
Could you please shed some light on this discrepancy?

TL:DR: memory ordering is not the same thing as out of order execution. It happens even on in-order pipelined CPUs.
In-order commit is necessary1 for precise exceptions that can roll-back to exactly the instruction that faulted, without any instructions after that having already retired. The cardinal rule of out-of-order execution is don't break single-threaded code. If you allowed out-of-order commit (retirement) without any kind of other mechanism, you could have a page-fault happen while some later instructions had already executed once, and/or some earlier instructions hadn't executed yet. This would make restarting execution after handing a page-fault impossible the normal way.
(In-order issue/rename and dependency-tracking takes care of correct execution in the normal case of no exceptions.)
Memory ordering is all about what other cores see. Also notice that what you quoted is only talking about committing results to the register file, not to memory.
(Footnote 1: Kilo-instruction Processors: Overcoming the Memory Wall is a theoretical paper about checkpointing state to allow rollback to a consistent machine state at some point before an exception, allowing much larger out-of-order windows without a gigantic ROB of that size. AFAIK, no mainstream commercial designs have used that, but it shows that there are in theory approaches other than strictly in-order retirement to building a usable CPU.
Apple's M1 reportedly has a significantly larger out-of-order window than its x86 contemporaries, but I haven't seen any definite info that it uses anything other than a very large ROB.)
Since each core's private L1 cache is coherent with all the other data caches in the system, memory ordering is a question of when instructions read or write cache. This is separate from when they retire from the out-of-order core.
Loads become globally visible when they read their data from cache. This is more or less when they "execute", and definitely way before they retire (aka commit).
Stores become globally visible when their data is committed to cache. This has to wait until they're known to be non-speculative, i.e. that no exceptions or interrupts will cause a roll-back that has to "undo" the store. So a store can commit to L1 cache as early as when it retires from the out-of-order core.
But even in-order CPUs use a store queue or store buffer to hide the latency of stores that miss in L1 cache. The out-of-order machinery doesn't need to keep tracking a store once it's known that it will definitely happen, so a store insn/uop can retire even before it commits to L1 cache. The store buffer holds onto it until L1 cache is ready to accept it. i.e. when it owns the cache line (Exclusive or Modified state of the MESI cache coherency protocol), and the memory-ordering rules allow the store to become globally visible now.
See also my answer on Write Allocate / Fetch on Write Cache Policy
As I understand it, a store's data is added to the store queue when it "executes" in the out-of-order core, and that's what a store execution unit does. (Store-address writing the address, and store-data writing the data into the store-buffer entry reserved for it at allocation/rename time, so either of those parts can execute first on CPUs where those parts are scheduled separately, e.g. Intel.)
Loads have to probe the store queue so that they see recently-stored data.
For an ISA like x86, with strong ordering, the store queue has to preserve the memory-ordering semantics of the ISA. i.e. stores can't reorder with other stores, and stores can't become globally visible before earlier loads. (LoadStore reordering isn't allowed (nor is StoreStore or LoadLoad), only StoreLoad reordering).
David Kanter's article on how TSX (transactional memory) could be implemented in different ways than what Haswell does provides some insight into the Memory Order Buffer, and how it's a separate structure from the ReOrder Buffer (ROB) that tracks instruction/uop reordering. He starts by describing how things currently work, before getting into how it could be modified to track a transaction that can commit or abort as a group.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio