Cache coherence for a data race free programs

Cache coherence for a data race free programs - multiprocessing

All examples I've ever seen on when cache coherence is relevant are code examples that are data races (two cores simultaneously write to the same memory location).
When it comes to memory consistency, hardware vendors have decided not to provide serial consistency guarantees and e.g. C++11 has adopted the SC for DRF memory model, which basically says, if you want serial consistency, make sure your program doesn't have data races.
Why isn't the same approach applied to cache coherency? That is, a data race free program doesn't need transparent cache coherency and cache lines are synchronized exactly where the programmer/compiler inserted a synchronization barrier.
Or put in another way: Why worry about cache coherence if it is only relevant for race conditions?

All examples I've ever seen on when cache coherence is relevant are code examples that are data races (two cores simultaneously write to the same memory location).
A concurrent write/write or a concurrent read/write to the same memory location is a data-race; not just a concurrent write/write.
When it comes to memory consistency, hardware vendors have decided not to provide serial consistency guarantees
You mean sequential consistency.
Or put in another way: Why worry about cache coherence if it is only relevant for race conditions?
I guess you mean data-races. Race conditions are a higher-level problem.
I'm not sure what your question is. You need to understand that coherence equivalent to sequential consistency for a single location. So any system that is sequentially consistent, is automatically coherent.

Related

Sequential program cache consistency

I wanted to ask you a question regarding the consistency of the cache memory.
If I have a sequential program, I shouldn't have cache consistency problems because in any case the instructions are executed sequentially and consequently there is no danger that several processors will write the same memory location at the same time, in case there are is the shared memory.
Different case is the situation where I have a parallel program, so it runs on multiple processors and there is a high probability that there are cache consistency problems.
Quite right?

In a single-threaded program, unless otherwise programmed, it doesn't change the thread by itself, except if OS does (and when it does, all the same thread-states are re-loaded from memory into that cache so there is no problem about coherence in there).
In a multi-threaded program, an update on same variable found on other caches needs to inform those caches somehow. This causes a re-flow of data through all other caches. Maybe it's not a blocking effect on same thread but once user wants only updated values, the synchronization / locking will see a performance hit. Especially when there are also other variables being updated on very close addresses such that they're in same cache-line. That's why using 20-byte elements for locking resolution is worse than using 128-byte elements in an array of locks.
If CPUs did not have coherence, multi-threading wouldn't work efficiently. So, for some versions, they chose to broadcast an update to all caches (as in Snoop cache). But this is not efficient on high number of cores. If 1000 cores existed in same CPU, it would require a 1000-way broadcasting logic consuming a lot of area of circuitry. So they break the problem into smaller parts and add other ways like directory-based coherence & multiple chunks of multiple cores. But this adds more latency for the coherence.
On the other hand, many GPUs do not implement automatic cache coherence because
the algorithm given by developer is generally embarrassingly parallel with only few points of synchronization and multiple blocks of threads do not require to communicate with other blocks (when they do, they go through a common cache by developer's choice of instructions anyway)
there are thousands of streaming pipelines (not real cores) that just need to make memory requests efficiently or else there wouldn't be enough space for that many pipelines
high throughput is required instead of low-latency (no need for implicit coherence anywhere)
so multi-processors in a GPU are designed to do completely independent work from each other and adding automatic coherence would add little performance (if not subtract). When developer needs to synchronize data between multiple threads in GPU in same block, there are instructions for this and not using these do not make any valid data update. So it's just an optional cache coherence in GPU.

Can CPU Out-of-Order-Execution cause memory reordering?

I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering.
In my opinion, Out-of-Order-Execution can't cause reordering because the results are always retired in-order as mentioned in this question.
To make my question more clear, let's say we have such an relax memory consistency architecture:
It doesn't have store buffer and invalidate queues
It can do Out-of-Order-Execution
Can memory reordering still happen in this architecture?
Does memory barrier has two functions, one is forbidding the Out-of-Order execution, the other is flushing invalidation queue and draining store buffer?

Yes, out of order execution can definitely cause memory reordering, such as load/load re-ordering
It is not so much a question of the loads being retired in order, as of when the load value is bound to the load instruction. Eg Load1 may precede Load2 in program order, Load2 gets its value from memory before Load1 does, and eg if there is an intervening store to the location read by Load2, then Load/load reordering has occurred.
However, certain systems, such as Intel P6 family systems, have additional mechanisms to detect such conditions to obtain stronger memory order models.
In these systems all loads are buffered until retirement, and if a possible store is detected to such a buffered but not yet retired load, then the load and program order instructions are “nuked”, and execution is resumed art, e.g., Load2.
I call this Freye’s Rule snooping, after I learned that Brad Freye at IBM had invented it many years before I thought I had. I believe the standard academic reference is Gharachorloo.
I.e. it is not so much buffering loads until retirement, as it is providing such a detection and correction mechanism associated with buffering loads until retirement. Many CPUs provide buffering until retirement but do not provide this detection mechanism.
Note also that this requires something like snoop based cache coherence. Many systems, including Intel systems that have such mechanisms also support noncoherent memory, e.g. memory that may be cached but which is managed by software. If speculative loads are allowed to such cacheable but non-coherent memory regions, the Freye’s Rule mechanism will not work and memory will be weakly ordered.
Note: I said “buffer until retirement”, but if you think about it you can easily come up with ways of buffering not quite until retirement. E.g. you can stop this snooping when all earlier loads have them selves been bound, and there is no longer any possibility of an intervening store being observed even transitively.
This can be important, because there is quite a lot of performance to be gained by “early retirement“, removing instructions such as loads from buffering and repair mechanisms before all earlier instructions have retired. Early retirement can greatly reduce the cost of out of order hardware mechanisms.

Memory models-/Cache coherence protocols: How TSO goes together with MESIF

Having just worked through my system's programming lecture material, I stumbled upon the crucial concepts of memory models as well as cache coherence protocols. Although they make sense as independent concepts, it is not really clear how they go together. Specifically, when looking at x86, I am working with an ISA enforcing the TSO memory model, and a CPU (in the case of Intel) using the MESIF cache coherence protocol.
In the beginning, the professor introduced cache coherence protocols as means of ensuring that to any core in the chip, it appears as if they all access one large, monolithic block of memory. Then, after wrapping up with cache coherence, he continued with memory models, specifically TSO (we were introduced to linearizability-/sequential consistency in our parallel programming class already). The following is a direct quote from the lecture material about the x86 memory model:
Standard for 64-bit x86 processors
Sometimes called Total Store Ordering (TSO)
Earlier 32-bit x86 implemented PRAM – weaker!
Write-to-read relaxation: later reads can bypass earlier writes
All processors see writes from one processor in the order they were issued.
Processors can see different interleavings of writes from different processors.
It seems as if we "solved" the problem of slow sequential consistency by introducing what is (yet another) layer in the cache hierarchy, namely the (ordered) store buffer.
To me, TSO seems orthogonal to the principles of cache coherence. We worked so hard to get our caches to match, only to add another layer in between not covered by cache coherence.
Questions:
Why are the store buffers not covered by cache coherence protocols? Is it assumed that writeback to L1 from store buffers is so fast, that inconsistencies due to intermediate writebacks would not be an issue in the majority of cases? (i.e. I reckon store buffer -> L1 transfer only takes a few cycles, so as soon as L1 receives the data, it sends a transaction across the bus, telling other cores to invalidate their copy)
How should I think of the two concepts of cache coherence as well as memory model? The way I understand it, memory models are the theoretical concepts of what we want, and cache coherence is part of the practical implementation to achieve said model.
Thank you so much in advance for the clarifications!
Best,
Felix

The sequential consistency model is the most commonly prescribed memory model for shared memory parallel programming. A parallel-program comprising multiple tasks or threads, sequential consistency requires two conditions as described below.
Program order execution: All memory operations in each task appear to execute in that task's program order.
Memory access atomicity: Memory operations (in all tasks of the parallel program) appear to execute one-at-a-time.
Every programmer assumes these conditions to reason about their parallel programs.
Unfortunately, sequential consistency is a less useful model than imagined. The main reason is the implementation cost of these two properties. Enforcing these
properties prohibit many basic compiler and hardware optimizations[1].
Other weak/relaxed memory models are proposed that relax these properties and allow compiler and hardware optimizations. These weak memory models trade
programmability for performance.
Why are the store buffers not covered by cache coherence protocols?
It is a design choice of TSO for performance reasons. Serving a load from the store buffer or serving a load when its preceding store (of different address) is still in the store buffer, reduces the store latency. To keep store buffers coherent, the load has to wait until all other processors have acknowledged receipt of the invalidates generated by the store. Moreover, most of the time, there may not be any copies of the store address in other caches (the variable is local to a task), then waiting for the acknowledges is a waste of time. In case other tasks share this variable, then waiting for the acknowledges can be explicitly enforced by using atomic or fence instructions.
How should I think of the two concepts of cache coherence as well as memory model?
Cache coherence protocols are concerned with serializing stores to the same memory location and ensuring that a load returns the value of the most recent store to the same memory location. Cache coherence protocols are required only when there are caches or multiple copies of the same memory location, and its job is to keep all the copies coherent.
A memory consistency model is concerned with the relative order of loads and stores (of the same task) to different memory locations. Any system that involves executing shared-memory parallel programs (multiple tasks or threads communicating through a shared memory) must define its memory consistency model.
Broadly, cache coherence protocols implement a part of the memory consistency model. More precisely, it is the combination of core pipeline, and cache coherence protocols (and every other component that a memory instruction traverses) must adhere to the memory model specifications.
[1]: Shared memory consistency models: a tutorial

What is the point of MESI on Intel 64 and IA-32

The point of MESI is to retain a notion of a shared memory system.
However, with store buffers, things are complicated:
Memory is coherent downstream of once the data hits the MESI-implementing caches.
However, upstream of that, each core may disagree on what is in memory location X, dependent on what is in each core's local store buffer.
As such, it seems like, from the viewpoint of each core, that the state of memory is different - it is not coherent.
So, why do we bother "partially" enforcing coherency with MESI?
Edit: A substantial edit was made, after some further narrowing of what was really confusing me. I have tried to keep the general notion of the question the same, to preserve the relevance of the great answers received.

The point of MESI on x86 is the same as on pretty much any multiple core/CPU system: to enforce cache consistency. There is no "partial coherency" used for the cache coherency part of the equation on x86: the caches are fully coherent. The possible re-orderings, then, are a result of both the coherent caching system and the interaction with core-local components such as the load/store subsystem (especially store buffers) and other out-of-order machinery.
The result of that interaction is the architected strong memory model that x86 provides, with only limited re-ordering. Without coherent caches, you couldn't reasonably implement this model at all, or almost any model that was anything other than completely weak1.
Your question seems to embed the assumption that there are only possible states "coherent" and "everything every else". Also, there is some mixing of the ideas of cache coherency (which mostly deals with the caches specifically, and is mostly a hidden detail), and the memory consistency model which is architecturally defined and will be implemented by each architecture2. Wikipedia explains that one difference between cache coherency and memory consistency is that the rules for the former applies only to one location at a time, whereas consistency rules apply across locations. In practice, the more important distinction is that the memory consistency model is the only architecturally documented one.
Briefly, Intel (and AMD likewise) define a specific memory consistency model, x86-TSO3 - which is relatively strong as far as memory models go, but is still weaker than sequential consistency. The primary behaviors weakened compared to sequential consistency are:
That later loads can pass earlier stores.
That stores can be seen in a different order from the total store order, but only by cores that performed one of the stores.
To order to implement this memory model, various parts must play by the rules to achieve it. On all recent x86, this means ordered load and store buffers, which avoid the disallowed re-orderings. The use of a store buffer results in the two re-orderings mentioned above: without allowing those, the implementation would be very restricted and probably much slower. In practice it also means fully coherent data caches, since many of the guarantees (e.g., no load-load reordering) would be very difficult to implement without that.
To wrap it all up:
Memory consistency is different than cache coherency: the former is what is documented and forms part of the programming model.
In practice, x86 implementations have fully coherent caches, which helps them implement their x86-TSO memory model, which is fairly strong but weaker than sequential consistency.
Finally, perhaps the answer you were looking for, in different words: a memory model weaker than sequential consistency is still very useful since you can program against it, and in the case you need sequential consistency for some particular operations(s) you insert the right memory barriers4.
If you program against a language supplied memory model, such as Java's or C++11's you don't need to worry about the hardware specifics, but rather than language memory model, and the compiler inserts the barriers required to match the language memory model semantics to the hardware one. The stronger the hardware model, the fewer the barriers required.
1 If your memory model was completely weak, i.e., not really placing any restrictions on cross-core reordering, I suppose you could implement it directly on a non-cache coherent system in a cheap way for normal operations, but then memory barriers potentially become very expensive since they would need to flush a potentially large part of the local private cache.
2 Various chips may implement in differently internally, and in particular some chips may implement stronger semantics than the model (i.e., some allowed re-orderings can never be observed), but absent bugs none will implement a weaker one.
3 This is the name given to it in that paper, which I used because Intel themselves doesn't give it a name, and the paper is a more formal definition than the one Intel gives a less formal model as a series of litmus tests.
4 It practice on x86 you usually use locked instructions (using the lock prefix) rather than separate barriers, although standalone barriers exist also. Here's I'll just use the term barries to refer to both standalone barriers and the barrier semantics embedded into locked instructions.

Re: your edit, which seems to be a new question: right, store-forwarding "violates" coherency. A core can see its own stores earlier than any other core can see them. The store buffer is not coherent.
The x86 memory ordering rules require that loads become globally visible in program order, but allows a core to load data from its own stores before they become globally visible. It doesn't have to pretend it waited and check for memory-order mis-speculation, like it does in other cases of doing loads earlier than the memory model says it should.
Also related; Can x86 reorder a narrow store with a wider load that fully contains it? is a specific example of the store buffer + store-forwarding violating the usual memory-ordering rules. See this collection of mailing list posts by Linus Torvalds explaining the effect of store-forwarding on memory ordering (and how it means that the proposed locking scheme doesn't work).
Without coherency at all, how would you atomically increment a shared counter, or implement other atomic read-modify-write operations that are essential for implementing locks, or for use directly in lockless code. (See Can num++ be atomic for 'int num'?).
lock add [shared_counter], 1 in multiple threads at the same time never loses any counts on actual x86, because the lock prefix makes a core keep exclusive ownership of the cache line from the load until the store commits to L1d (and thus becomes globally visible).
A system without coherent caches would let each thread increment its own copy of a shared counter, and then the final value in memory would come from whichever thread last flushed that line.
Allowing different caches to hold conflicting data for the same line long-term even while other loads/stores happened, and across memory barriers, would allow all sorts of weirdness.
It would also violate the assumption that a pure store becomes visible to other cores promptly. If you didn't have coherency at all, then cores could keep using their cached copy of a shared variable. So if you wanted readers to notice updates, you'd have to clflush before every read of the shared variable, making the common case expensive (when nobody has modified the data since you last checked).
MESI is like a push notification system, instead of forcing every reader to re-validate their cache on every read.
MESI (or coherency in general) allows things like RCU (Read-Copy-Update) to have zero overhead for readers (compared to single threaded) in the case where the shared data structure hasn't been modified. See https://lwn.net/Articles/262464/, and https://en.wikipedia.org/wiki/Read-copy-update. The basic idea is that instead of locking a data structure, a writer copies the whole thing, modifies the copy, and then updates a shared pointer to point to the new version. So readers are always entirely wait-free; they just dereference an (atomic) pointer, and data stays hot in their L1d caches.
Hardware-supported coherency is extremely valuable, and almost every shared-memory SMP architecture uses it. Even ISAs with much weaker memory-ordering rules than x86, like PowerPC, use MESI.

how do processor knows about the latest copy of cache line in multiprocessor system

In multiprocessor system where each processor have its own copy of cache, how processor comes to know from where to get the copy of data.
As it will be present in its own cache,also in caches of other respective processors or main memory i.e. how it will come to know which copy is the latest one

Most processors (in particular x86 in our laptops, desktops, servers) have some hardware provided cache coherence
Often, some synchronization memory barrier instructions exist.
It is rumored that some synchronization machine instructions could be quite slow.
Actually, recent C++2011 and C2011 standards have specific wordings and atomic data types to deal with these, like C++11 std::atomic
In practice, you should use some well established standard library like pthreads (or the C++11 std::thread etc....)

In a typical modern cache coherent system, if the contents of a memory address are present in multiple caches, their content will be the same. Using the typical invalidation-based coherence mechanism, in order for a processor to change the content, it must gain exclusive ownership of that block of memory. This is done by invalidating any copies. Any subsequent request from a processor that previously had the block cached would result in a miss (the block was invalidated) and a coherence action will find the updated content in the writing processor's cache.
(In earlier implementations of cache coherence with write-through caches, a common bus to memory could be snooped to grab any content changes. Similarly, a processor changing content could broadcast or multicast the changes to any sharers. These methods would keep cached contents the same.)
A more subtle aspect of this process is memory consistency--how different processors see the orderings of memory accesses to different addresses. With sequential consistency all processors see a single ordering of every read and write in the system. This is the easiest consistency model to understand, but in order to support greater parallel operation hardware complexity increases (e.g., rather than waiting to confirm that no ordering conflicts exist, a processor can speculatively continue execution and rollback to a previous known-correct state if an ordering conflict occurred).
A relaxed consistency model allows reads and writes to have inconsistent orderings among different processors. To provide stronger ordering guarantees, memory barrier operations are provided. These operations guarantee that certain types of memory accesses later in program order for the processor performing the barrier operation will be observed by all other processors as occurring after the barrier and certain types of memory accesses (earlier for that processor) will be observed by all processors before the barrier.
A system using a relaxed consistency model could provide the same behavior as a sequential consistency model system by using memory barriers after every memory access. However, systems using a relaxed model will generally not handle such excessive use of barriers well since they are designed to exploit the relaxed demands on memory ordering.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio