Does the Cache Coherency issue apply to UMA architectures as well? - caching

I have learned that Shared Memory computer architectures can be divided in Uniform Memory Access (UMA) and Non-uniform Memory Access (NUMA), depending on whether the access times to a given memory location are the same for all processors or not.
I've also learned that NUMA architectures can be further divided into cache-coherent and non-cache-coherent, based on whether they have a mechanism for propagating (or invalidating) modified data from one processor's (or core's) cache to another's, leading to the term "ccNUMA". (Please correct me if I got anything wrong...)
Based on this question, it is also my understanding that the term NUMA specifically refers to access times to main memory, not cache, so that even though most multiprocessor systems have necessarily distributed caches, these systems are still called UMA if they have uniform access to main memory.
What I don't understand is this: why is the concept of a "ccUMA" architecture rarely mentioned? For example, wikipedia only has a page for ccNUMA (which redirects to NUMA), not for ccUMA, and the page for Cache Coherence doesn't explicitly refer to either (except that it links to Distributed Shared Memory, which seems to be roughly equivalent to NUMA...) Also, a google search for ccUMA returns far less results than ccNUMA...
Does the cache coherency problem not apply on UMA architectures? It seems to me that it does, but why is it never mentioned then?

Related

Memory models-/Cache coherence protocols: How TSO goes together with MESIF

Having just worked through my system's programming lecture material, I stumbled upon the crucial concepts of memory models as well as cache coherence protocols. Although they make sense as independent concepts, it is not really clear how they go together. Specifically, when looking at x86, I am working with an ISA enforcing the TSO memory model, and a CPU (in the case of Intel) using the MESIF cache coherence protocol.
In the beginning, the professor introduced cache coherence protocols as means of ensuring that to any core in the chip, it appears as if they all access one large, monolithic block of memory. Then, after wrapping up with cache coherence, he continued with memory models, specifically TSO (we were introduced to linearizability-/sequential consistency in our parallel programming class already). The following is a direct quote from the lecture material about the x86 memory model:
Standard for 64-bit x86 processors
Sometimes called Total Store Ordering (TSO)
Earlier 32-bit x86 implemented PRAM – weaker!
Write-to-read relaxation: later reads can bypass earlier writes
All processors see writes from one processor in the order they were issued.
Processors can see different interleavings of writes from different processors.
It seems as if we "solved" the problem of slow sequential consistency by introducing what is (yet another) layer in the cache hierarchy, namely the (ordered) store buffer.
To me, TSO seems orthogonal to the principles of cache coherence. We worked so hard to get our caches to match, only to add another layer in between not covered by cache coherence.
Questions:
Why are the store buffers not covered by cache coherence protocols? Is it assumed that writeback to L1 from store buffers is so fast, that inconsistencies due to intermediate writebacks would not be an issue in the majority of cases? (i.e. I reckon store buffer -> L1 transfer only takes a few cycles, so as soon as L1 receives the data, it sends a transaction across the bus, telling other cores to invalidate their copy)
How should I think of the two concepts of cache coherence as well as memory model? The way I understand it, memory models are the theoretical concepts of what we want, and cache coherence is part of the practical implementation to achieve said model.
Thank you so much in advance for the clarifications!
Best,
Felix
The sequential consistency model is the most commonly prescribed memory model for shared memory parallel programming. A parallel-program comprising multiple tasks or threads, sequential consistency requires two conditions as described below.
Program order execution: All memory operations in each task appear to execute in that task's program order.
Memory access atomicity: Memory operations (in all tasks of the parallel program) appear to execute one-at-a-time.
Every programmer assumes these conditions to reason about their parallel programs.
Unfortunately, sequential consistency is a less useful model than imagined. The main reason is the implementation cost of these two properties. Enforcing these
properties prohibit many basic compiler and hardware optimizations[1].
Other weak/relaxed memory models are proposed that relax these properties and allow compiler and hardware optimizations. These weak memory models trade
programmability for performance.
Why are the store buffers not covered by cache coherence protocols?
It is a design choice of TSO for performance reasons. Serving a load from the store buffer or serving a load when its preceding store (of different address) is still in the store buffer, reduces the store latency. To keep store buffers coherent, the load has to wait until all other processors have acknowledged receipt of the invalidates generated by the store. Moreover, most of the time, there may not be any copies of the store address in other caches (the variable is local to a task), then waiting for the acknowledges is a waste of time. In case other tasks share this variable, then waiting for the acknowledges can be explicitly enforced by using atomic or fence instructions.
How should I think of the two concepts of cache coherence as well as memory model?
Cache coherence protocols are concerned with serializing stores to the same memory location and ensuring that a load returns the value of the most recent store to the same memory location. Cache coherence protocols are required only when there are caches or multiple copies of the same memory location, and its job is to keep all the copies coherent.
A memory consistency model is concerned with the relative order of loads and stores (of the same task) to different memory locations. Any system that involves executing shared-memory parallel programs (multiple tasks or threads communicating through a shared memory) must define its memory consistency model.
Broadly, cache coherence protocols implement a part of the memory consistency model. More precisely, it is the combination of core pipeline, and cache coherence protocols (and every other component that a memory instruction traverses) must adhere to the memory model specifications.
[1]: Shared memory consistency models: a tutorial

What is the point of MESI on Intel 64 and IA-32

The point of MESI is to retain a notion of a shared memory system.
However, with store buffers, things are complicated:
Memory is coherent downstream of once the data hits the MESI-implementing caches.
However, upstream of that, each core may disagree on what is in memory location X, dependent on what is in each core's local store buffer.
As such, it seems like, from the viewpoint of each core, that the state of memory is different - it is not coherent.
So, why do we bother "partially" enforcing coherency with MESI?
Edit: A substantial edit was made, after some further narrowing of what was really confusing me. I have tried to keep the general notion of the question the same, to preserve the relevance of the great answers received.
The point of MESI on x86 is the same as on pretty much any multiple core/CPU system: to enforce cache consistency. There is no "partial coherency" used for the cache coherency part of the equation on x86: the caches are fully coherent. The possible re-orderings, then, are a result of both the coherent caching system and the interaction with core-local components such as the load/store subsystem (especially store buffers) and other out-of-order machinery.
The result of that interaction is the architected strong memory model that x86 provides, with only limited re-ordering. Without coherent caches, you couldn't reasonably implement this model at all, or almost any model that was anything other than completely weak1.
Your question seems to embed the assumption that there are only possible states "coherent" and "everything every else". Also, there is some mixing of the ideas of cache coherency (which mostly deals with the caches specifically, and is mostly a hidden detail), and the memory consistency model which is architecturally defined and will be implemented by each architecture2. Wikipedia explains that one difference between cache coherency and memory consistency is that the rules for the former applies only to one location at a time, whereas consistency rules apply across locations. In practice, the more important distinction is that the memory consistency model is the only architecturally documented one.
Briefly, Intel (and AMD likewise) define a specific memory consistency model, x86-TSO3 - which is relatively strong as far as memory models go, but is still weaker than sequential consistency. The primary behaviors weakened compared to sequential consistency are:
That later loads can pass earlier stores.
That stores can be seen in a different order from the total store order, but only by cores that performed one of the stores.
To order to implement this memory model, various parts must play by the rules to achieve it. On all recent x86, this means ordered load and store buffers, which avoid the disallowed re-orderings. The use of a store buffer results in the two re-orderings mentioned above: without allowing those, the implementation would be very restricted and probably much slower. In practice it also means fully coherent data caches, since many of the guarantees (e.g., no load-load reordering) would be very difficult to implement without that.
To wrap it all up:
Memory consistency is different than cache coherency: the former is what is documented and forms part of the programming model.
In practice, x86 implementations have fully coherent caches, which helps them implement their x86-TSO memory model, which is fairly strong but weaker than sequential consistency.
Finally, perhaps the answer you were looking for, in different words: a memory model weaker than sequential consistency is still very useful since you can program against it, and in the case you need sequential consistency for some particular operations(s) you insert the right memory barriers4.
If you program against a language supplied memory model, such as Java's or C++11's you don't need to worry about the hardware specifics, but rather than language memory model, and the compiler inserts the barriers required to match the language memory model semantics to the hardware one. The stronger the hardware model, the fewer the barriers required.
1 If your memory model was completely weak, i.e., not really placing any restrictions on cross-core reordering, I suppose you could implement it directly on a non-cache coherent system in a cheap way for normal operations, but then memory barriers potentially become very expensive since they would need to flush a potentially large part of the local private cache.
2 Various chips may implement in differently internally, and in particular some chips may implement stronger semantics than the model (i.e., some allowed re-orderings can never be observed), but absent bugs none will implement a weaker one.
3 This is the name given to it in that paper, which I used because Intel themselves doesn't give it a name, and the paper is a more formal definition than the one Intel gives a less formal model as a series of litmus tests.
4 It practice on x86 you usually use locked instructions (using the lock prefix) rather than separate barriers, although standalone barriers exist also. Here's I'll just use the term barries to refer to both standalone barriers and the barrier semantics embedded into locked instructions.
Re: your edit, which seems to be a new question: right, store-forwarding "violates" coherency. A core can see its own stores earlier than any other core can see them. The store buffer is not coherent.
The x86 memory ordering rules require that loads become globally visible in program order, but allows a core to load data from its own stores before they become globally visible. It doesn't have to pretend it waited and check for memory-order mis-speculation, like it does in other cases of doing loads earlier than the memory model says it should.
Also related; Can x86 reorder a narrow store with a wider load that fully contains it? is a specific example of the store buffer + store-forwarding violating the usual memory-ordering rules. See this collection of mailing list posts by Linus Torvalds explaining the effect of store-forwarding on memory ordering (and how it means that the proposed locking scheme doesn't work).
Without coherency at all, how would you atomically increment a shared counter, or implement other atomic read-modify-write operations that are essential for implementing locks, or for use directly in lockless code. (See Can num++ be atomic for 'int num'?).
lock add [shared_counter], 1 in multiple threads at the same time never loses any counts on actual x86, because the lock prefix makes a core keep exclusive ownership of the cache line from the load until the store commits to L1d (and thus becomes globally visible).
A system without coherent caches would let each thread increment its own copy of a shared counter, and then the final value in memory would come from whichever thread last flushed that line.
Allowing different caches to hold conflicting data for the same line long-term even while other loads/stores happened, and across memory barriers, would allow all sorts of weirdness.
It would also violate the assumption that a pure store becomes visible to other cores promptly. If you didn't have coherency at all, then cores could keep using their cached copy of a shared variable. So if you wanted readers to notice updates, you'd have to clflush before every read of the shared variable, making the common case expensive (when nobody has modified the data since you last checked).
MESI is like a push notification system, instead of forcing every reader to re-validate their cache on every read.
MESI (or coherency in general) allows things like RCU (Read-Copy-Update) to have zero overhead for readers (compared to single threaded) in the case where the shared data structure hasn't been modified. See https://lwn.net/Articles/262464/, and https://en.wikipedia.org/wiki/Read-copy-update. The basic idea is that instead of locking a data structure, a writer copies the whole thing, modifies the copy, and then updates a shared pointer to point to the new version. So readers are always entirely wait-free; they just dereference an (atomic) pointer, and data stays hot in their L1d caches.
Hardware-supported coherency is extremely valuable, and almost every shared-memory SMP architecture uses it. Even ISAs with much weaker memory-ordering rules than x86, like PowerPC, use MESI.

Under which conditions caching of a page(not web caching) should not be allowed?

I would like to know in what ways,using a cache buffer(eg.TLB) to cache frequent pages would be not advantageous or potentially catastrophic.
I searched a bit but could not comprehend this:
"When that page is shared with another process running on a different core of the machine. For example, in Intel 3rd generation core architecture (Ivybridge) L1 and L2 cores are private for a core and L3 is shared. So, a shared page must not go above L3 cache or otherwise explicit coherence mechanism must be done by the programmer "
I don't know if you are asking about x86 CPU caches or generally for any architecture.
As far as I know (and I didn't find any source saying otherwise), in x86, the CPU hardware always ensures that you have a cache-coherent shared memory. So you as an application or OS programmer don't need to care about this. Also there is no instruction which would allow you to deny the CPU from caching your data in L1/L2 or L3 cache. This wouldn't make any sense, as for every write the CPUs must ensure shared cache is coherent and it wouldn't know about your single non-cacheable page. Although it seems there are ways to pre-fetch some data into cache if needed.
For other (theoretical and future) architectures, it is possible to present the cache (and hence whole memory) as not coherent - then the OS must be written so that it will ensure memory coherence itself. For example this paper writes about OS design for non-cache-coherent systems (like Intel's experimental Single-Chip Cloud computer).
For writing applications "friendly" to caching, there are many good questions and answers also here in SO, few of them mentioning this 114-pdf from RedHat: What Every Programmer Should Know About Memory. Here is also Wikipedia page for Cache Coherence.

Which architecture to call Non-uniform memory access (NUMA)?

According to wiki: Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to a processor.
But it is not clear whether it is about any memory including caches or about main memory only.
For example Xeon Phi processor have next architecture:
Memory access to main memory (GDDR) is same for all cores. Meanwhile memory access to L2 cache is different for different cores, since first native L2 cache is checked, then L2 cache of other cores is checked via ring. Is it NUMA or UMA architecture?
Technically, NUMA should probably only be used to describe non-uniform access latency or bandwidth to main memory. (If the NUMA factor [latency far/latency near or bandwidth far/bandwidth near] is small [e.g., comparable to dynamic variability due to DRAM row misses, buffering, etc.], then the system might still be considered UMA.)
(Technically, the Xeon Phi has a small but non-zero NUMA factor since each hop on the ring interconnect takes time [a core might be only one hop from one memory controller and several hops from the most distant one].)
The term NUCA (Non-Uniform Cache Access) has been taken to describe a single cache with different latency of access for different cache blocks. A shared cache level with portions more closely tied to a core or cluster of cores would also fall under NUCA, but separate cache hierarchies would (I believe) not justify the term (even though snooping might find a desired cache block in a 'remote' cache).
I do not know of any term being used to describe a system with variable cache latency associated with snooping (i.e., with separate cache hierarchies) and a small/zero NUMA factor.
(Since caches can transparently replicate and migrate cache blocks, the NUMA concept is a little less fitting. [Yes, an OS can migrate and replicate pages transparently to application software in a NUMA system, so this difference is not absolute.])
Perhaps somewhat interestingly, Azul Systems claims UMA across sockets for its Vega systems:
Azul builds are gear as ‘UMA’ because our programs do not have well understood access patterns. Instead, the patterns are mostly random (after cache filtering) and so it makes sense to have uniform mediocre speeds instead of 1/16th of memory fast and 15/16ths as slow.
I would definitely go for UMA on that one. Having looked at the NUMA Wiki as yourself, the Wiki page for UMA actually gives a little more detail:
In the UMA architecture, each processor may use a private cache. WIKI
This is why I would go for UMA, also because, as you state, once L2 is missed, it goes to a general Memory Directory link and finds the correct physical page using the Memory Controller (MC in your picture)

When are page frame specific cache management policies useful?

I'm reading the O'Reilly Linux Kernel book and one of the things that was pointed out during the chapter on paging is that the Pentium cache lets the operating system associate a different cache management policy with each page frame. So I get that there could be scenarios where a program has very little spacial/temporal locality and memory accesses are random/infrequent enough that the probability of cache hits is below some sort of threshold.
I was wondering whether this mechanism is actually used in practice today? Or is it more of a feature that was necessary back when caches where fairly small and not as efficient as they are now? I could see it being useful for an embedded system with little overhead as far as system calls are necessary, are there other applications I am missing?
Having multiple cache management policies is widely used, whether by assigning whole regions using MTRRs (fixed/dynamic, as explained in Intel's PRM), MMIO regions, or through special instructions (e.g. streaming loads/stores, non-temporal prefetches, etc..). The use-cases also vary a lot, whether you're trying to map an external I/O device into virtual memory (and don't want CPU caching to impact its coherence), or whether you want to define a writethrough region for better integrity management of some database, or just want plain writeback to maximize the cache-hierarchy capacity and replacement efficiency (which means performance).
These usages often overlap (especially when multiple applications are running), so the flexibility is very much needed, as you said - you don't want data with little to no spatial/temporal locality to thrash out other lines you use all the time.
By the way, caches are never going to be big enough in the foreseeable future (with any known technology), since increasing them requires you to locate them further away from the core and pay in latency. So cache management is still, and is going to be for a long while, one of the most important things for performance critical systems and applications

Resources