Sequential consistency with store buffers in a multiprocessor? - memory-management

How does a multiprocessor with write-buffers maintain the sequential consistency?
To my knowledge, in a uniprocessor, If the buffer is FIFO and the reads to an element that is pending to be write on main memory is supplied by the buffer, it maintains the consistency.
But how it works in a MP? I think that If a processor puts an store in his buffer, another processor can't read this, and I think that this break the sequencial consistency.
How does it work in a multithread environment with a write-buffer per thread? It also breaks the sequential consistency?

You referred to:
Typically, a CPU only sees the random access; the fact that memory busses are sequentially accessed is hidden to the CPU itself, so from the point of view of the CPU, there's no FIFO involved here.
In SMP modern machines, there's so-called snoop control units that watch the memory transfers and invalidate the cache copy of the RAM if necessary. So there's dedicated hardware to make sure data is synchronous. This doesn't mean it's really synchronous -- there's always more than one way to get invalid data (for example, by already having loaded a memory value into a register before the other CPU core changed it), but that is what you were getting at.
Also, multiple threads are basically a software concept. So if you need to synchronize software FIFOs, you will need to use proper locking mechanisms.

I'm assuming X86 here.
The store in the store buffer in itself isn't the problem. If for example a CPU would only do stores and the stores in the store buffer all retire in order, it would be exactly the same behavior as a processor that doesn't have a store buffer. For SC the real time order doesn't need to be preserved.
And you already indicated that a processor will see its own stores in the store buffer in order. The part where SC gets violated is when a store is followed by a load to a different address.
So imagine
A=1
r1=B
Then without a store buffer, first the store of A would be written to cache/memory. And then the B would be read from cache/memory.
But with a store buffer, it can be that the load of B will overtake the store of A. So the load will read from cache/memory before the store of A is written to cache/memory.
The typical example of where SC breaks with store buffers is Dekkers algorithm.
lock_a=1
while(lock_b==1){
if(turn == b){
lock_a=0
while(lock_b==1);
lock_a=1
}
}
So at the top you can see a store of lock_a=1 followed by a load of lock_b. Due to store buffer it can be that these 2 get reordered and as a consequence 2 threads could enter the critical section.
One way to solve it is to add a [StoreLoad] fence between the load and store, which prevents loads from being executed till the store buffer has been drained. This way SC is restored.
Note 1: store buffers are per CPU; not per thread.
Note 2: store (and load) buffers are before the cache.

Related

What are the advantages of strict consistency over sequential consistency in the field of cache consistency?

When the lock is implemented, both strict consistency and sequential consistency only need cas (tas) instructions, and neither need barriers. Since there is no observer of the physical world on the cpu, the consistency of the observer's perspective has no practical meaning.
I'm not really familiar with the definition of strict consistency.
Below is a long story with a high risk of information overload, but I hope it will show how coherence fits in memory consistency models or at least give you enough structure to look for additional information.
Cache coherence is normally defined in terms of sequential consistency (SC) per location. SC doesn't need to respect the real time order of requests; so reads and writes can be skewed as long as the program order (PO) is preserved. This prevents SC from being composable, so if you would have a cache of coherent locations (so SC per location), then the cache as a whole doesn't need to be SC (with PO per location). The consequence is that there is no guarantee a total order over the memory order exist that can explain the execution. So in simple terms, you can't build a SC cache if the cache is only SC per location.
If coherence would be implemented using linearizability per location instead of SC per location, then a linearizable cache can be made (with PO per location). This is also called 'atomic memory'.
With linearizability the real time order of a request is respected. The advantage of linearizability is that it is composable. So if you would have a system of linearizable locations, then the cache as as a whole will be linearizable. As a consequence there always exist at least 1 total order over the memory order that explains the execution. So if the CPU would prevent any reordering of loads/stores before they hit the cache, in combination with a linearizable cache, you can create an SC CPU.
A typical protocol for cache coherence is MESI: write needs to wait till the cache line has been invalidated on all CPU's before it can write the change to the cache line. The consequence of this approach is that MESI based caches are linearizable.
Most CPU's have store buffers, so an older store can be reordered with a newer load to a different address and as a consequence the memory order doesn't order older stores with newer loads to a different address. So the [StoreLoad] is dropped as requirement for the memory order. Dropping the [StoreLoad] doesn't prevent you from having a total order over all memory accesses it just means that the memory model doesn't care in which order they can be found in the total order over the memory order.
The key problem here is when a store is followed by a load to the same address. There are 2 solutions possible:
1 (A strict solution): The loads needs to wait for the store to be committed to the cache before the load can be executed. The advantage of this approach is that the load and stores are properly ordered in the memory order and a total order over the memory order exists. This is the memory model of the IBM-370. So IBM-370 is SC + dropping [StoreLoad].
2 (a relaxed solution) The load looks inside the store buffer. If there is a match, it will return the stored value. This is called store to load forwarding (STLF). The problem here is that it isn't possible to create a total order over the memory order because the store isn't atomic; a load is by definition globally ordered after the store it reads from, but because the load is performed (load from the store buffer) before the store is globally performed (committed to the cache), the store and load to the same address are not properly ordered in the memory order. This is demonstrated with the following test:
A=B=0
CPU1:
A=1
r1=A
r2=B
CPU2:
B=1
r3=B
r4=A
With STLF it can be that r1=1, r2=0, r3=1, r4=0, but with IBM-370/SC/Linearizability it would not be possible. In the above example the load of r1=A is ordered both after A=1 and before A=1 (due to STLF). So a total order over all memory actions doesn't exist because the load would be ordered both before and after the store. Instead the requirements of the memory model are relaxed to a total order over all stores need to exist. And this is how we get the Total Store Order, the memory model of the X86. So TSO is a relaxation of SC whereby the [StoreLoad] is dropped + STLF.
We can relax the memory order further. So with TSO we have the guarantee that at least 1 total order over all the stores exist, but this is because the cache is linearizable. If we would relax this requirement, we get processor consistency (PC). So PC allows for an older store to be reordered with a newer load, and requires a coherent cache, but writes to different addresses made by different CPUs can be seen out of order (so no total order over the stores).
This is demonstrated using the Independent Reads of Independent writes (IRIW) litmus test
A=B=0
CPU1
A=1
CPU2
B=1
CPU3:
r1=A
r2=B
CPU4:
r3=B
r4=A
Can it be that we see r=1,r2=0,r3=1,r4=0. So can it be that CPU3 and CPU4 see the writes to A,B in different orders? If a total order over the stores exist (e.g. TSO/IBM-370/SC/Linearizability), then this isn't possible. But on PC, this is allowed.
I hope this example makes it clear that 'just' a coherent cache is still a pretty weak property.
Linearizability, SC and IBM-370 are also called atomic/store-atomic/single-copy store atomic because there is only a single copy of the data. There is a logical point where the store becomes visible to all CPUs.
TSO is called multi copy store atomic because a store can become visible to the issuing CPU early (STLF).
A memory model like PC is called non atomic (or non store atomic) because there is no logical moment where a store becomes visible to other CPUs.
A CAS instruction is not just sequential consistent; it is linearizable. And depending on the architecture, a CAS involves fences. E.g. an atomic instruction like a CMPXCHG on the X86 has an implicit lock which will act like a full barrier. So it is guarantees to preserve all 4 fences although it only needs to preserve [StoreLoad] since the other fences are automatically provided.
For more information about this topic see "A primer on memory consistency and cache coherence 2e" which is available for free.
Note 1:
A frequent requirement of the memory model is that some kind of total order over all loads and stores in that memory models exist that explain the execution. This can be done by using a topological sort.
Note 2:
Any requirement in the memory order can be violated as long as nobody is able to observe it.
Note 3:
If there is a total order of loads/stores (either per location or for all locations) a load needs to see the most recent store before it in the memory order.
Strict consistency is distinguishable from sequential consistency when implicit writes are present. Implicit writes are not unheard-of when dealing with I/O devices.
One obvious example would be a clock; a clock has an implicit write at every clock tick independent of reads.
A perhaps more meaningful example would be a buffer presented as a single word address. Writes to the buffer would only become visible after previous writes have been read, so even if such writes were visible to the consistency mechanism as updating that address the order of the visibility of writes would depend on the order of the reads of the buffer. The writes might be effectively invisible to the consistency mechanism because they come from non-coherent I/O activity or because the interface specifies a different address for adding a value to the buffer from the address used for taking a value from the buffer (where a read from the write address might provide the number of buffer entries filled or the number vacant).
A shared pseudorandom number generator or access counter would have a similar read side effect of advancing the position in a "buffer".
The C programming language's volatile keyword informs the compiler that a variable can change without explicit writes, recognizing a distinction at the level of programming language between strict consistency and sequential consistency.

Does store buffer send read invalidate message or invalidate req message?

I think, to make the CPU continue executing subsequent instructions,the store buffer must do part of the MESI processing to get cache consistency, because the latest value is stored in store buffer and not cache. So the store buffer sends read invalidate or invalidate REQ messages and flushes the latest value to cache after the arrival of ACK.
And Cache cannot do it.
Is my analysis and result right?
Or shall all MESI processing be done by cache?
On most designs the store buffer wouldn't directly send invalidate requests and is usually not even snooped1 by external requests. That is, it is part of the private/core-side of the coherence domain and so doesn't need to participate in coherence. Instead, the store buffer ultimately interacts with the first level of the caching subsystem which itself would be responsible for the various parts of the MESI protocol.
How that interaction works exactly depends on the design, of course. A simple design may only process one store at a time: the oldest one that is at the head of the store buffer and perform the RFO for that address, and when complete move on the to the next element. A more sophisticated design might send RFO for several "upcoming" requests in the store buffer in an attempt to exploit more MLP. The exact mechanism isn't clear to me on x86: stores to L2 seem to perform quite poorly in some scenarios, but I'm pretty sure a bunch of store misses to RAM will perform much better than if they were handled serially.
1 There are some exceptions, e.g. simultaneous multithreading (hyperthreading on x86) which involves two logical cores sharing all levels of cache and hence being able to avail themselves of the normal cache coherency mechanisms, may require store buffer snoops.

Are malloc memory pages attributes set to cacheable or non-cacheable?

When we use malloc and access memory, the physical pages being given for this address space has what kind of page attributes, are they cacheable or non-cacheable pages ?
Ordinary memory -- whether for user-space or kernel -- is pretty much always marked cacheable. Otherwise, using that memory would entail a huge performance hit.
Generally speaking, the only time you want memory to be marked non-cacheable is when the memory is actually part of an external device (i.e. a device other than a memory chip): for example, a PCI device BAR region used to implement device control registers.
Caching is good for performance since reading and writing the cache is usually much faster than reading and writing the underlying RAM. And the caching can "bundle up" reads and writes so that those operations on the RAM chip are done significantly less often. The downside is that by using it you generally give up exact control over the reading and writing of the RAM.
The main RAM usually gets read and written at "random" times as determined by the cache controller, and it typically gets read and written in large blocks called "cache lines" -- blocks of 32-, 64- or 128-bytes at a time. When you write a value to cached memory, that value may not get written to the actual RAM chip until some indeterminate later time (if ever: it might get overwritten before it ever gets transferred out of the cache). This is of course all hidden you as a user of the memory -- you don't generally even need to be aware of it.
But if the memory being written to is a control register -- setting some mode or characteristic of a device for example -- then you want the value of that register to be set exactly when you write to it not at some indeterminate later time, and you don't want the write to that register to affect any other registers that may be located near to it in the address space.
Likewise, if you read the value of a status register, it might be "volatile": i.e. its value might change with two consecutive reads of the same register so you don't want the value cached. And reading a register might have side-effects, so you only want explicit reads to access it.

What is the cache's role when writing to memory?

I have a function that does very little reading, but a lot of writing to RAM. When I run it multiple times on the same core (the main thread), it runs about 5x as fast than if I launch the function on a new thread every run (which doesn't guarantee the same core is used between runs), as I launch and join between runs.
This suggests the cache is being used heavily for the write process, but I don't understand how. I thought the cache was only useful for reads.
Modern processors usually have write-buffers. The reason is that writes are, to a first approximation, pure sinks. The processor doesn't usually have to wait for the store to reach the coherent memory hierarchy before it executes the next instruction.
(Aside: Obviously stores are not pure sinks. A later read from the written-to memory location should return the written value, so the processor must snoop the write-buffer, and either stall the read or forward the written value to it)
Obviously such buffer(s) are of finite size, so when the buffers are full the next store in the program can't be executed and stalls until a slot in the buffer is made available by an older store becoming architecturally visible.
Ordinarily, the way a write leaves the buffer is when the value is written to cache (since a lot of writes are actually read back again quickly, think of the program stack as an example). If the write only sets part of the cacheline, the rest of the cacheline must remain unmodified, so consequently it must be loaded from the memory hierarchy.
There are ways to avoid loading the old cacheline, like non-temporal stores, write-combining memory or cacheline-zeroing instructions.
Non-temporal stores and write-combining memory combine adjacent writes to fill a whole cacheline, sending the new cacheline to the memory hierarchy to replace the old one.
POWER has an instruction that zeroes a full cacheline (dcbz), which also removes the need to load the old value from memory.
x86 with AVX512 has cacheline-sized registers, which suggests that an aligned zmm-register store could avoid loading the old cacheline (though I do not know whether it does).
Note that many of these techniques are not consistent with the usual memory-ordering of the respective processor architectures. Using them may require additional fences/barriers in multi-threaded operation.

why we use write buffer in mips?[cache]

On Computer Architecture lecture, I learned that the function of write buffer; hold data waiting to be written memory. My professor just told that it improves time performance.
However, I'm really curious 'how it improves time-performance'?
Could you explain more precisely how write buffer works?
The paper Design Issues and Tradeoffs for Write Buffers describes the purpose of write buffers as follows:
In a system with a write-through first-level cache, a write buffer has
two essential functions: it absorbs processor writes (store
instructions) at a rate faster than the next-level cache could,
thereby preventing processor stalls; and it aggregates writes to the
same cache block, thereby reducing traffic to the next-level cache.
To put this another way, the two primary benefits are:
If the processor has a burst of writes that occur faster than the cache can respond, then the write buffer can store multiple outstanding writes that are waiting to go to the cache. This improves performance because some of the other instructions won't be writes and thus they can continue executing instead of being stalled.
If there are multiple writes to different words in the write buffer than go to the same cache line, then these writes can be grouped together into a single write to the cache line. This improves performance because it reduces the total number of writes that need to go to the cache (since the cache line contains multiple words).

Resources