Sequential Consistency in Web Assembly - parallel-processing

The document describing WebAssembly threads says:
Atomic load/store memory accesses behave like their non-atomic counterparts, with the exception that the ordering of accesses is sequentially consistent.
This is in reference to i32.atomic.load versus i32.load, i32.atomic.store versus i32.store, etc.
What does it mean that the non-atomic operations aren't sequentially consistent? In what situations would the non-atomic operations not be suitable?

The memory consistency of atomic operation define (the minimum requirement of) how atomic accesses are sequenced by the processor in relation to other variable. This is related to memory barrier. A relaxed atomic operation cause the atomic operation to be sequenced independently of other variables. This means that if you do:
non_atomic_variable = 42;
atomic_increment(atomic_variable);
Then, other threads may not see the updated value of non_atomic_variable when atomic_variable has been increased by the current thread. This is not possible in a sequentially consistent memory ordering because the compiler should use instructions so that there is a memory barrier forcing other threads to see the updated value when the increase is done and the atomic operation (eg. read) is sequenced from other threads.
A sequentially consistent memory ordering is safe but also slow. A relaxed memory ordering is fast because of weaker synchronizations (and more room for processors optimizations during load/stores). For example, with a relaxed memory ordering, a processor can execute the non_atomic_variable later because of a cache miss (thanks to an out-of-order execution). With a sequentially consistent memory ordering, the increment need to wait for the store to be done which can take some time when there is a cache miss.
Note that the memory ordering of the processor can be stronger than the one required by the software stack (eg. x86-64 processor have a strong memory ordering).

Related

Can CPU Out-of-Order-Execution cause memory reordering?

I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering.
In my opinion, Out-of-Order-Execution can't cause reordering because the results are always retired in-order as mentioned in this question.
To make my question more clear, let's say we have such an relax memory consistency architecture:
It doesn't have store buffer and invalidate queues
It can do Out-of-Order-Execution
Can memory reordering still happen in this architecture?
Does memory barrier has two functions, one is forbidding the Out-of-Order execution, the other is flushing invalidation queue and draining store buffer?
Yes, out of order execution can definitely cause memory reordering, such as load/load re-ordering
It is not so much a question of the loads being retired in order, as of when the load value is bound to the load instruction. Eg Load1 may precede Load2 in program order, Load2 gets its value from memory before Load1 does, and eg if there is an intervening store to the location read by Load2, then Load/load reordering has occurred.
However, certain systems, such as Intel P6 family systems, have additional mechanisms to detect such conditions to obtain stronger memory order models.
In these systems all loads are buffered until retirement, and if a possible store is detected to such a buffered but not yet retired load, then the load and program order instructions are “nuked”, and execution is resumed art, e.g., Load2.
I call this Freye’s Rule snooping, after I learned that Brad Freye at IBM had invented it many years before I thought I had. I believe the standard academic reference is Gharachorloo.
I.e. it is not so much buffering loads until retirement, as it is providing such a detection and correction mechanism associated with buffering loads until retirement. Many CPUs provide buffering until retirement but do not provide this detection mechanism.
Note also that this requires something like snoop based cache coherence. Many systems, including Intel systems that have such mechanisms also support noncoherent memory, e.g. memory that may be cached but which is managed by software. If speculative loads are allowed to such cacheable but non-coherent memory regions, the Freye’s Rule mechanism will not work and memory will be weakly ordered.
Note: I said “buffer until retirement”, but if you think about it you can easily come up with ways of buffering not quite until retirement. E.g. you can stop this snooping when all earlier loads have them selves been bound, and there is no longer any possibility of an intervening store being observed even transitively.
This can be important, because there is quite a lot of performance to be gained by “early retirement“, removing instructions such as loads from buffering and repair mechanisms before all earlier instructions have retired. Early retirement can greatly reduce the cost of out of order hardware mechanisms.

What are the advantages of strict consistency over sequential consistency in the field of cache consistency?

When the lock is implemented, both strict consistency and sequential consistency only need cas (tas) instructions, and neither need barriers. Since there is no observer of the physical world on the cpu, the consistency of the observer's perspective has no practical meaning.
I'm not really familiar with the definition of strict consistency.
Below is a long story with a high risk of information overload, but I hope it will show how coherence fits in memory consistency models or at least give you enough structure to look for additional information.
Cache coherence is normally defined in terms of sequential consistency (SC) per location. SC doesn't need to respect the real time order of requests; so reads and writes can be skewed as long as the program order (PO) is preserved. This prevents SC from being composable, so if you would have a cache of coherent locations (so SC per location), then the cache as a whole doesn't need to be SC (with PO per location). The consequence is that there is no guarantee a total order over the memory order exist that can explain the execution. So in simple terms, you can't build a SC cache if the cache is only SC per location.
If coherence would be implemented using linearizability per location instead of SC per location, then a linearizable cache can be made (with PO per location). This is also called 'atomic memory'.
With linearizability the real time order of a request is respected. The advantage of linearizability is that it is composable. So if you would have a system of linearizable locations, then the cache as as a whole will be linearizable. As a consequence there always exist at least 1 total order over the memory order that explains the execution. So if the CPU would prevent any reordering of loads/stores before they hit the cache, in combination with a linearizable cache, you can create an SC CPU.
A typical protocol for cache coherence is MESI: write needs to wait till the cache line has been invalidated on all CPU's before it can write the change to the cache line. The consequence of this approach is that MESI based caches are linearizable.
Most CPU's have store buffers, so an older store can be reordered with a newer load to a different address and as a consequence the memory order doesn't order older stores with newer loads to a different address. So the [StoreLoad] is dropped as requirement for the memory order. Dropping the [StoreLoad] doesn't prevent you from having a total order over all memory accesses it just means that the memory model doesn't care in which order they can be found in the total order over the memory order.
The key problem here is when a store is followed by a load to the same address. There are 2 solutions possible:
1 (A strict solution): The loads needs to wait for the store to be committed to the cache before the load can be executed. The advantage of this approach is that the load and stores are properly ordered in the memory order and a total order over the memory order exists. This is the memory model of the IBM-370. So IBM-370 is SC + dropping [StoreLoad].
2 (a relaxed solution) The load looks inside the store buffer. If there is a match, it will return the stored value. This is called store to load forwarding (STLF). The problem here is that it isn't possible to create a total order over the memory order because the store isn't atomic; a load is by definition globally ordered after the store it reads from, but because the load is performed (load from the store buffer) before the store is globally performed (committed to the cache), the store and load to the same address are not properly ordered in the memory order. This is demonstrated with the following test:
A=B=0
CPU1:
A=1
r1=A
r2=B
CPU2:
B=1
r3=B
r4=A
With STLF it can be that r1=1, r2=0, r3=1, r4=0, but with IBM-370/SC/Linearizability it would not be possible. In the above example the load of r1=A is ordered both after A=1 and before A=1 (due to STLF). So a total order over all memory actions doesn't exist because the load would be ordered both before and after the store. Instead the requirements of the memory model are relaxed to a total order over all stores need to exist. And this is how we get the Total Store Order, the memory model of the X86. So TSO is a relaxation of SC whereby the [StoreLoad] is dropped + STLF.
We can relax the memory order further. So with TSO we have the guarantee that at least 1 total order over all the stores exist, but this is because the cache is linearizable. If we would relax this requirement, we get processor consistency (PC). So PC allows for an older store to be reordered with a newer load, and requires a coherent cache, but writes to different addresses made by different CPUs can be seen out of order (so no total order over the stores).
This is demonstrated using the Independent Reads of Independent writes (IRIW) litmus test
A=B=0
CPU1
A=1
CPU2
B=1
CPU3:
r1=A
r2=B
CPU4:
r3=B
r4=A
Can it be that we see r=1,r2=0,r3=1,r4=0. So can it be that CPU3 and CPU4 see the writes to A,B in different orders? If a total order over the stores exist (e.g. TSO/IBM-370/SC/Linearizability), then this isn't possible. But on PC, this is allowed.
I hope this example makes it clear that 'just' a coherent cache is still a pretty weak property.
Linearizability, SC and IBM-370 are also called atomic/store-atomic/single-copy store atomic because there is only a single copy of the data. There is a logical point where the store becomes visible to all CPUs.
TSO is called multi copy store atomic because a store can become visible to the issuing CPU early (STLF).
A memory model like PC is called non atomic (or non store atomic) because there is no logical moment where a store becomes visible to other CPUs.
A CAS instruction is not just sequential consistent; it is linearizable. And depending on the architecture, a CAS involves fences. E.g. an atomic instruction like a CMPXCHG on the X86 has an implicit lock which will act like a full barrier. So it is guarantees to preserve all 4 fences although it only needs to preserve [StoreLoad] since the other fences are automatically provided.
For more information about this topic see "A primer on memory consistency and cache coherence 2e" which is available for free.
Note 1:
A frequent requirement of the memory model is that some kind of total order over all loads and stores in that memory models exist that explain the execution. This can be done by using a topological sort.
Note 2:
Any requirement in the memory order can be violated as long as nobody is able to observe it.
Note 3:
If there is a total order of loads/stores (either per location or for all locations) a load needs to see the most recent store before it in the memory order.
Strict consistency is distinguishable from sequential consistency when implicit writes are present. Implicit writes are not unheard-of when dealing with I/O devices.
One obvious example would be a clock; a clock has an implicit write at every clock tick independent of reads.
A perhaps more meaningful example would be a buffer presented as a single word address. Writes to the buffer would only become visible after previous writes have been read, so even if such writes were visible to the consistency mechanism as updating that address the order of the visibility of writes would depend on the order of the reads of the buffer. The writes might be effectively invisible to the consistency mechanism because they come from non-coherent I/O activity or because the interface specifies a different address for adding a value to the buffer from the address used for taking a value from the buffer (where a read from the write address might provide the number of buffer entries filled or the number vacant).
A shared pseudorandom number generator or access counter would have a similar read side effect of advancing the position in a "buffer".
The C programming language's volatile keyword informs the compiler that a variable can change without explicit writes, recognizing a distinction at the level of programming language between strict consistency and sequential consistency.

Out-of-order instruction execution: is commit order preserved?

On the one hand, Wikipedia writes about the steps of the out-of-order execution:
Instruction fetch.
Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).
The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before
earlier, older instructions.
The instruction is issued to the appropriate functional unit and executed by that unit.
The results are queued.
Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The similar information can be found in the "Computer Organization and Design" book:
To make programs behave as if they were running on a simple in-order
pipeline, the instruction fetch and decode unit is required to issue
instructions in order, which allows dependences to be tracked, and the
commit unit is required to write results to registers and memory in
program fetch order. This conservative mode is called in-order
commit... Today, all dynamically scheduled pipelines use in-order commit.
So, as far as I understand, even if the instructions execution is done in the out-of-order manner, the results of their executions are preserved in the reorder buffer and then committed to the memory/registers in a deterministic order.
On the other hand, there is a known fact that modern CPUs can reorder memory operations for the performance acceleration purposes (for example, two adjacent independent load instructions can be reordered). Wikipedia writes about it here.
Could you please shed some light on this discrepancy?
TL:DR: memory ordering is not the same thing as out of order execution. It happens even on in-order pipelined CPUs.
In-order commit is necessary1 for precise exceptions that can roll-back to exactly the instruction that faulted, without any instructions after that having already retired. The cardinal rule of out-of-order execution is don't break single-threaded code. If you allowed out-of-order commit (retirement) without any kind of other mechanism, you could have a page-fault happen while some later instructions had already executed once, and/or some earlier instructions hadn't executed yet. This would make restarting execution after handing a page-fault impossible the normal way.
(In-order issue/rename and dependency-tracking takes care of correct execution in the normal case of no exceptions.)
Memory ordering is all about what other cores see. Also notice that what you quoted is only talking about committing results to the register file, not to memory.
(Footnote 1: Kilo-instruction Processors: Overcoming the Memory Wall is a theoretical paper about checkpointing state to allow rollback to a consistent machine state at some point before an exception, allowing much larger out-of-order windows without a gigantic ROB of that size. AFAIK, no mainstream commercial designs have used that, but it shows that there are in theory approaches other than strictly in-order retirement to building a usable CPU.
Apple's M1 reportedly has a significantly larger out-of-order window than its x86 contemporaries, but I haven't seen any definite info that it uses anything other than a very large ROB.)
Since each core's private L1 cache is coherent with all the other data caches in the system, memory ordering is a question of when instructions read or write cache. This is separate from when they retire from the out-of-order core.
Loads become globally visible when they read their data from cache. This is more or less when they "execute", and definitely way before they retire (aka commit).
Stores become globally visible when their data is committed to cache. This has to wait until they're known to be non-speculative, i.e. that no exceptions or interrupts will cause a roll-back that has to "undo" the store. So a store can commit to L1 cache as early as when it retires from the out-of-order core.
But even in-order CPUs use a store queue or store buffer to hide the latency of stores that miss in L1 cache. The out-of-order machinery doesn't need to keep tracking a store once it's known that it will definitely happen, so a store insn/uop can retire even before it commits to L1 cache. The store buffer holds onto it until L1 cache is ready to accept it. i.e. when it owns the cache line (Exclusive or Modified state of the MESI cache coherency protocol), and the memory-ordering rules allow the store to become globally visible now.
See also my answer on Write Allocate / Fetch on Write Cache Policy
As I understand it, a store's data is added to the store queue when it "executes" in the out-of-order core, and that's what a store execution unit does. (Store-address writing the address, and store-data writing the data into the store-buffer entry reserved for it at allocation/rename time, so either of those parts can execute first on CPUs where those parts are scheduled separately, e.g. Intel.)
Loads have to probe the store queue so that they see recently-stored data.
For an ISA like x86, with strong ordering, the store queue has to preserve the memory-ordering semantics of the ISA. i.e. stores can't reorder with other stores, and stores can't become globally visible before earlier loads. (LoadStore reordering isn't allowed (nor is StoreStore or LoadLoad), only StoreLoad reordering).
David Kanter's article on how TSX (transactional memory) could be implemented in different ways than what Haswell does provides some insight into the Memory Order Buffer, and how it's a separate structure from the ReOrder Buffer (ROB) that tracks instruction/uop reordering. He starts by describing how things currently work, before getting into how it could be modified to track a transaction that can commit or abort as a group.

Go atomic and memory order

I am porting a lock free queue from c++11 to go and i came across things such as
auto currentRead = writeIndex.load(std::memory_order_relaxed);
and in some cases std::memory_order_release and std::memory_order_aqcuire
also the equivelent for the above in c11 is something like
unsigned long currentRead = atomic_load_explicit(&q->writeIndex,memory_order_relaxed);
the meaning of those is described here
is there an equivalent to such thing in go or do i just use something like
var currentRead uint64 = atomic.LoadUint64(&q.writeIndex)
after porting i benchmarked and just using LoadUint64 it seems to work as expected but orders of magnitude slower and i wonder how much effect dose those specialized ops have on performance.
further info from the link i attached
memory_order_relaxed:Relaxed operation: there are no synchronization
or ordering constraints, only atomicity is required of this operation.
memory_order_consume:A load operation with this memory order performs
a consume operation on the affected memory location: no reads in the
current thread dependent on the value currently loaded can be
reordered before this load. This ensures that writes to data-dependent
variables in other threads that release the same atomic variable are
visible in the current thread. On most platforms, this affects
compiler optimizations only.
memory_order_acquire:A load operation with this memory order performs the acquire operation on the affected memory location: no
memory accesses in the current thread can be reordered before this
load. This ensures that all writes in other threads that release the
same atomic variable are visible in the current thread.
memory_order_release:A store operation with this memory order performs the release operation: no memory accesses in the current
thread can be reordered after this store. This ensures that all writes
in the current thread are visible in other threads that acquire or the
same atomic variable and writes that carry a dependency into the
atomic variable become visible in other threads that consume the same
atomic.
You need to read The Go Memory Model
You'll discover that Go has nothing like the control that you have in C++ - there isn't a direct translation of the C++ features in your post. This is a deliberate design decision by the Go authors - the Go motto is Do not communicate by sharing memory; instead, share memory by communicating.
Assuming that the standard go channel isn't good enough for what you want to do, you'll have 2 choices for each memory access, using the facilities in sync/atomic or not, and whether you need to use them or not will depend on a careful reading of the Go Memory Model and analysis of your code which only you can do.

how do processor knows about the latest copy of cache line in multiprocessor system

In multiprocessor system where each processor have its own copy of cache, how processor comes to know from where to get the copy of data.
As it will be present in its own cache,also in caches of other respective processors or main memory i.e. how it will come to know which copy is the latest one
Most processors (in particular x86 in our laptops, desktops, servers) have some hardware provided cache coherence
Often, some synchronization memory barrier instructions exist.
It is rumored that some synchronization machine instructions could be quite slow.
Actually, recent C++2011 and C2011 standards have specific wordings and atomic data types to deal with these, like C++11 std::atomic
In practice, you should use some well established standard library like pthreads (or the C++11 std::thread etc....)
In a typical modern cache coherent system, if the contents of a memory address are present in multiple caches, their content will be the same. Using the typical invalidation-based coherence mechanism, in order for a processor to change the content, it must gain exclusive ownership of that block of memory. This is done by invalidating any copies. Any subsequent request from a processor that previously had the block cached would result in a miss (the block was invalidated) and a coherence action will find the updated content in the writing processor's cache.
(In earlier implementations of cache coherence with write-through caches, a common bus to memory could be snooped to grab any content changes. Similarly, a processor changing content could broadcast or multicast the changes to any sharers. These methods would keep cached contents the same.)
A more subtle aspect of this process is memory consistency--how different processors see the orderings of memory accesses to different addresses. With sequential consistency all processors see a single ordering of every read and write in the system. This is the easiest consistency model to understand, but in order to support greater parallel operation hardware complexity increases (e.g., rather than waiting to confirm that no ordering conflicts exist, a processor can speculatively continue execution and rollback to a previous known-correct state if an ordering conflict occurred).
A relaxed consistency model allows reads and writes to have inconsistent orderings among different processors. To provide stronger ordering guarantees, memory barrier operations are provided. These operations guarantee that certain types of memory accesses later in program order for the processor performing the barrier operation will be observed by all other processors as occurring after the barrier and certain types of memory accesses (earlier for that processor) will be observed by all processors before the barrier.
A system using a relaxed consistency model could provide the same behavior as a sequential consistency model system by using memory barriers after every memory access. However, systems using a relaxed model will generally not handle such excessive use of barriers well since they are designed to exploit the relaxed demands on memory ordering.

Resources