How is coherence implemented in multi-level caches? - caching

I have understood how the cache coherence FSM works for single private L1 cache and a common LLC/memory. But couldn't find good resources where they discuss about cache coherence when there are 2 private caches - L1 and L2 and a common memory. I don't understand where the interconnection network is placed in the whole model, what does L1 caches snoop and their FSM.

Related

Cache coherency(MESI protocol) between different levels of cache namely L1, L2 and L3

This is about cache coherency protocol across different layers of cache. My understanding(X86_64) about L1 is that, it is owned exclusively by a core and L2 is between 2 cores and L3 for all the cores in a CPU socket. I have read the MESI protocol functioning, about store buffers, invalidate queues, invalidate messages etc. My doubt here is that is the MESI applicable for L1 only or it is applicable for L2 and L3 as well. Or is there a different cache synchronizing between for L2 and L3 .
The number of cache levels, how each level is organized with respect to other processors or cores in the system, and the coherence protocol implemented in each cache is defined by the core microarchitecture, the uncore microarchitecture, and, in some cases, relevant boot-time configuration options. These design aspects vary by vendor and processor generation and models within the same generation. There a lot of different designs even if you just consider the processors released in the past few years.
The organization of the cache hierarchy is always clearly documented by Intel and AMD. However, the coherence protocols are not always clearly documented. You won't find a section in any official document that directly tells you all the protocols that caches use. Some hardware performance event names allude to what coherence protocol is used in the cache to which the events apply.
The instruction cache (L1I) always uses the SI protocol because a line is never modified between the point of fill and the point of invalidation. So an entry can either be in the S or I state. The M and E states are only relevant and the cache supports modifying an existing line.
Some microarchitectures have caches that only support the write-through write hit policy. For example, the L1D in the AMD Bulldozer is a write-through cache. The M state doesn't make sense in a write-through cache. This means that the L1D either uses SI or ESI. SI is more likely because it requires only a single bit of state per entry.
Intel processors almost always support the write-back policy in all data and unified caches. Old Intel processors (90s and early 2000s) with two levels of caches use MESI for the L1D and L2. Intel processors with three levels of caches also uses MESI for the L1D and L2. The fact that four states are available doesn't necessarily mean that all are being used. A cache line whose physical address falls within a region with the write-through (WT) memory type doesn't use the M state. (It's possible that the type changed from WB to WT, so the first WT access could hit in M.) So the effective protocol for a WT line is ESI or SI.
The L3 caches in Intel processors starting with Nehalem-EX uses the MESIF protocol with an inclusive directory (used on a hit) for the entire NUMA node. Nehalem-EX also employs an in-memory 2-state directory to track which lines are owned by the off-package IOH. The in-memory directory protocol changed in Westmere-EX, and then changed again in the Xeon E5, and again in the Xeon E5/E7 v2, and again in the Xeon E5/E7 v3. These processors also support multiple coherence protocols in the L3-miss scenario with different tradeoffs.
I'm not sure what else to say to answer your question. I guess you could say that MESI is more or less applicable to the L2 and L3.

Why do L1 and L2 Cache waste space saving the same data?

I don't know why L1 Cache and L2 Cache save the same data.
For example, let's say we want to access Memory[x] for the first time. Memory[x] is mapped to the L2 Cache first, then the same data piece is mapped to L1 Cache where CPU register can retrieve data from.
But we have duplicated data stored on both L1 and L2 cache, isn't it a problem or at least a waste of storage space?
I edited your question to ask about why CPUs waste cache space storing the same data in multiple levels of cache, because I think that's what you're asking.
Not all caches are like that. The Cache Inclusion Policy for an outer cache can be Inclusive, Exclusive, or Not-Inclusive / Not-Exclusive.
NINE is the "normal" case, not maintaining either special property, but L2 does tend to have copies of most lines in L1 for the reason you describe in the question. If L2 is less associative than L1 (like in Skylake-client) and the access pattern creates a lot of conflict misses in L2 (unlikely), you could get a decent amount of data that's only in L1. And maybe in other ways, e.g. via hardware prefetch, or from L2 evictions of data due to code-fetch, because real CPUs use split L1i / L1d caches.
For the outer caches to be useful, you need some way for data to enter them so you can get an L2 hit sometime after the line was evicted from the smaller L1. Having inner caches like L1d fetch through outer caches gives you that for free, and has some advantages. You can put hardware prefetch logic in an outer or middle level of cache, which doesn't have to be as high-performance as L1. (e.g. Intel CPUs have most of their prefetch logic in the private per-core L2, but also some prefetch logic in L1d).
The other main option is for the outer cache to be a victim cache, i.e. lines enter it only when they're evicted from L1. So you can loop over an array of L1 + L2 size and probably still get L2 hits. The extra logic to implement this is useful if you want a relatively large L1 compared to L2, so the total size is more than a little larger than L2 alone.
With an exclusive L2, an L1 miss / L2 hit can just exchange lines between L1d and L2 if L1d needs to evict something from that set.
Some CPUs do in fact use an L2 that's exclusive of L1d (e.g. AMD K10 / Barcelona). Both of those caches are private per-core caches, not shared, so it's like the simple L1 / L2 situation for a single core CPU you're talking about.
Things get more complicated with multi-core CPUs and shared caches!
Barcelona's shared L3 cache is also mostly exclusive of the inner caches, but not strictly. David Kanter explains:
First, it is mostly exclusive, but not entirely so. When a line is sent from the L3 cache to an L1D cache, if the cache line is shared, or is likely to be shared, then it will remain in the L3 – leading to duplication which would never happen in a totally exclusive hierarchy. A fetched cache line is likely to be shared if it contains code, or if the data has been previously shared (sharing history is tracked). Second, the eviction policy for the L3 has been changed. In the K8, when a cache line is brought in from memory, a pseudo-least recently used algorithm would evict the oldest line in the cache. However, in Barcelona’s L3, the replacement algorithm has been changed to also take into account sharing, and it prefers evicting unshared lines.
AMD's successor to K10/Barcelona is Bulldozer. https://www.realworldtech.com/bulldozer/3/ points out that Bulldozer's shared L3 is also victim cache, and thus mostly exclusive of L2. It's probably like Barcelona's L3.
But Bulldozer's L1d is a small write-through cache with an even smaller (4k) write-combining buffer, so it's mostly inclusive of L2. Bulldozer's write-through L1d is generally considered a mistake in the CPU design world, and Ryzen went back to a normal 32kiB write-back L1d like Intel has been using all along (with great results). A pair of weak integer cores form a "cluster" that shares an FPU/SIMD unit, and shares a big L2 that's "mostly inclusive". (i.e. probably a standard NINE). This cluster thing is Bulldozer's alternative to SMT / Hyperthreading, which AMD also ditched for Ryzen in favour of normal SMT with a massively wide out-of-order core.
Ryzen also has some exclusivity between core clusters (CCX), apparently, but I haven't looked into the details.
I've been talking about AMD first because they have used exclusive caches in recent designs, and seem to have a preference for victim caches. Intel hasn't tried as many different things, because they hit on a good design with Nehalem and stuck with it until Skylake-AVX512.
Intel Nehalem and later use a large shared tag-inclusive L3 cache. For lines that are modified / exclusive (MESI) in a private per-core L1d or L2 (NINE) cache, the L3 tags still indicate which cores (might) have a copy of a line, so requests from one core for exclusive access to a line don't have to be broadcast to all cores, only to cores that might still have it cached. (i.e. it's a snoop filter for coherency traffic, which lets CPUs scale up to dozens of cores per chip without flooding each other with requests when they're not even sharing memory.)
i.e. L3 tags hold info about where a line is (or might be) cached in an L2 or L1 somewhere, so it knows where to send invalidation messages instead of broadcasting messages from every core to all other cores.
With Skylake-X (Skylake-server / SKX / SKL-SP), Intel dropped that and made L3 NINE and only a bit bigger than the total per-core L2 size. But there's still a snoop filter, it just doesn't have data. I don't know what Intel's planning to do for future (dual?)/quad/hex-core laptop / desktop chips (e.g. Cannonlake / Icelake). That's small enough that their classic ring bus would still be great, so they could keep doing that in mobile/desktop parts and only use a mesh in high-end / server parts, like they are in Skylake.
Realworldtech forum discussions of inclusive vs. exclusive vs. non-inclusive:
CPU architecture experts spend time discussing what makes for a good design on that forum. While searching for stuff about exclusive caches, I found this thread, where some disadvantages of strictly inclusive last-level caches are presented. e.g. they force private per-core L2 caches to be small (otherwise you waste too much space with duplication between L3 and L2).
Also, L2 caches filter requests to L3, so when its LRU algorithm needs to drop a line, the one it's seen least-recently can easily be one that stays permanently hot in L2 / L1 of a core. But when an inclusive L3 decides to drop a line, it has to evict it from all inner caches that have it, too!
David Kanter replied with an interesting list of advantages for inclusive outer caches. I think he's comparing to exclusive caches, rather than to NINE. e.g. his point about data sharing being easier only applies vs. exclusive caches, where I think he's suggesting that a strictly exclusive cache hierarchy might cause evictions when multiple cores want the same line even in a shared/read-only manner.

Cache Inclusion Property- Multilevel Caching

I am not able to understand the concepts of cache inclusion property in multi-level caching. As per my understanding, if we have 2 levels of cache, L1 and L2 then the contents of L1 must be a subset of L2. This implies that L2 must be at least as large as L1. Further, when a block in L1 is modified, we have to update in two places L2 and Memory. Are these concepts correct ?
In general, we can say adding more levels of cache is adding more levels of access in memory hierarchy. Its always trade-off between access time and latency. larger the cache, more we can store, but takes more time to search through. As you have said, L2 cache must be larger than L1 cache. otherwise its failing the basic purpose of the same.
Now coming to whether L1 a subset of L2. Its not always necessary. There is Inclusive cache hierarchy and exclusive cache hierarchy. In inclusive, as you said the last level is superset of all other caches.
you can check this presentation for more details
PPT.
Now updating different levels, is a cache coherence problem & larger the number of levels, larger the headache. You can check various protocols here: cache coherence
You are correct about an inclusive L2 cache being larger than the L1 cache. However, your statement about an inclusive cache requiring a modification in the L1 also requiring a modification in the L2 and memory is not correct. The system described by you is called a "write-through" cache where all the writes in the private cache also write the next level(s) of cache. Inclusive cache heirarchies do not imply write-through caches.
Most architectures that have inclusive heirarchies use a "write-back" cache. A "write-back" cache differs from the write-through cache in that it does not require modifications in the current level of cache to be eagerly propogated to the next level of cache (for eg. a write in the L1 cache does not have to immediately write the L2). Instead, write-back caches update only the current level of cache and make the data "dirty" (describes a cacheline whose most recent value is in the current level and all upper levels have stale values). A write-back flushes the dirty cacheline to the next level of cache on an eviction (when space needs to be created in the current cache to service a conflict miss)
These concepts are summarized in the seminal work by Baer and Wang "On the inclusion property of Multi level cache heirarchies", ISCA 1988 paper_link. The paper explains your confusion in the initially confusing statement:
A
MultiLevel
cache
hierarchy
has
the
inclusion
property(ML1)
if
the
contents
of
a
cache
at
level
C_(i+1),
is
a
superset
of
the
contents
of
all
its
children
caches,
C_i,
at
level
i.”
This
definition
implies
that
the
write-through
policy
must
be
used
for
lower
level
caches.
As
we
will
assume
write-back
caches
in
this
paper,
the
ML1
is
actually
a
“space”
MLI,
i.e.,
space
is
provided
for
inclusion
but
a
write-back
policy
is
implemented.

what is the difference between l1 cache and l2 cache?

I know that l1 and l2 caches are levels in multi-level cache.
I would like to know where each level cache is placed, and what is the maximum number of cache levels allowed?
Both of these depend on the CPU. There are CPUs which have no cache at all, there are CPUs which have the L1 cache on die and the L2 cache on a separate die on the same chip or even on a separate chip, or there are CPUs which have both L1 and L2 cache on the same die as the CPU core.
There are multi-core, multi-chip CPUs where each core has its own L1 cache on die, the 4 cores of one multi-core chip share an L2 cache that is on chip, but on a separate die, and the 2 chips share an L3 cache that is on a separate chip, but in the same package. Sometimes, there are also so-called CPU books which contain multiple chip packages, which might or might not have their own shared cache, which would then be an L4 cache.
Of course, multi-core chips don't have to share their L2 cache, they can also have private L2 caches.
And it's not always obvious, what level a certain cache is, or even whether or not a piece of RAM is a cache at all.
For example, on later Intel 80486 processors, there was an L1 cache on the chip and an L2 cache on the motherboard. But then AMD came out with a socket-compatible CPU that had both an L1 and L2 cache on the chip. So, the exact same cache chip on the motherboard was either an L2 or L3 cache, depending on what kind of CPU you used.
On the Cell BE CPU, the SPEs have 256 KiByte of RAM each. Except that this RAM has about the same size and the same speed as a typical L2 cache, and since the SPEs don't have any other caches, you could also view this as a cache. However, caches are normally managed automatically by the CPU, whereas RAM is typically managed by the user program, the language runtime or the OS, not the CPU. So, is this RAM or a cache? It turns out that, in order to achieve best performance, you should really not view this as RAM, but more as a software-controlled cache.
The different between L1 and L2 cache
Although both L1 and L2 are cache memories they have their key differences. L1 and L2 are the first and second cache in the hierarchy of cache levels.
L1 has a smaller memory capacity than L2.
Also, L1 can be accessed faster than L2.
L2 is accessed only if the requested data in not found in L1.**
L1 is usually in-built to the chip, while L2 is soldered on the
motherboard very close to the chip.
Therefore, L1 has a very little delay compared to L2. Because L1 is
implemented using SRAM and L2 is implemented using DRAM, L1 does not
need refreshing, while L2 needs to be refreshed.
If the caches are strictly inclusive, all data in L1 can be found in
L2 as well. However, if the caches are exclusive, same data will not
be available in both L1 and L2.
IF YOU WANT TO READ DEEPLY CLICK THIS LINK
Taken from this link -
L1 and L2 are levels of cache memory in a computer. If the computer processor can find the data it needs for its next operation in cache memory, it will save time compared to having to get it from random access memory. L1 is "level-1" cache memory, usually built onto the microprocessor chip itself. For example, the Intel MMX microprocessor comes with 32 thousand bytes of L1.
L2 (that is, level-2) cache memory is on a separate chip (possibly on an expansion card) that can be accessed more quickly than the larger "main" memory. A popular L2 cache memory size is 1,024 kilobytes (one megabyte).
Complete Cache architecture is here in WIKI

L1/2 cache problem

could L1/L2 cache line each cache multiple copies of the main memory data word?
It's possible that the main memory is in a cache more than once. Obviously that's true and a common occurrence for multiprocessor machines. But even on uni processor machines, it can happen.
Consider a Pentium CPU that has a split L1 instruction/data cache. Instructions only go to the I-cache, data only to the D-cache. Now if the OS allows self modifying code, the same memory could be loaded into both the I- and D-cache, once as data, once as instructions. Now you have that data twice in the L1 cache. Therefore a CPU with such a split cache architecture must employ a cache coherence protocol to avoid race conditions/corruption.
No - if it's already in the cache the MMU will use that rather than creating another copy.
Every cache basically stores some small subset of the whole memory. When CPU needs a word from memory it first goes to L1, then to L2 cache and so on, before the main memory is checked.
So a particular memory word can be in L2 and in L1 simultaneously, but it can't be stored two times in L1, because that is not necessary.
Yes it can. L1 copy is updated but has not been flushed to L2. This happens only if L1 and L2 are non-exclusive caches. This is obvious for uni-processors but it is even more so for multi-processors which typically have their own L1 caches for each core.
It all depends on the cache architecture - whether it guarantees any sort of thing.

Resources