L1/2 cache problem - caching

could L1/L2 cache line each cache multiple copies of the main memory data word?

It's possible that the main memory is in a cache more than once. Obviously that's true and a common occurrence for multiprocessor machines. But even on uni processor machines, it can happen.
Consider a Pentium CPU that has a split L1 instruction/data cache. Instructions only go to the I-cache, data only to the D-cache. Now if the OS allows self modifying code, the same memory could be loaded into both the I- and D-cache, once as data, once as instructions. Now you have that data twice in the L1 cache. Therefore a CPU with such a split cache architecture must employ a cache coherence protocol to avoid race conditions/corruption.

No - if it's already in the cache the MMU will use that rather than creating another copy.

Every cache basically stores some small subset of the whole memory. When CPU needs a word from memory it first goes to L1, then to L2 cache and so on, before the main memory is checked.
So a particular memory word can be in L2 and in L1 simultaneously, but it can't be stored two times in L1, because that is not necessary.

Yes it can. L1 copy is updated but has not been flushed to L2. This happens only if L1 and L2 are non-exclusive caches. This is obvious for uni-processors but it is even more so for multi-processors which typically have their own L1 caches for each core.
It all depends on the cache architecture - whether it guarantees any sort of thing.

Related

Why do L1 and L2 Cache waste space saving the same data?

I don't know why L1 Cache and L2 Cache save the same data.
For example, let's say we want to access Memory[x] for the first time. Memory[x] is mapped to the L2 Cache first, then the same data piece is mapped to L1 Cache where CPU register can retrieve data from.
But we have duplicated data stored on both L1 and L2 cache, isn't it a problem or at least a waste of storage space?
I edited your question to ask about why CPUs waste cache space storing the same data in multiple levels of cache, because I think that's what you're asking.
Not all caches are like that. The Cache Inclusion Policy for an outer cache can be Inclusive, Exclusive, or Not-Inclusive / Not-Exclusive.
NINE is the "normal" case, not maintaining either special property, but L2 does tend to have copies of most lines in L1 for the reason you describe in the question. If L2 is less associative than L1 (like in Skylake-client) and the access pattern creates a lot of conflict misses in L2 (unlikely), you could get a decent amount of data that's only in L1. And maybe in other ways, e.g. via hardware prefetch, or from L2 evictions of data due to code-fetch, because real CPUs use split L1i / L1d caches.
For the outer caches to be useful, you need some way for data to enter them so you can get an L2 hit sometime after the line was evicted from the smaller L1. Having inner caches like L1d fetch through outer caches gives you that for free, and has some advantages. You can put hardware prefetch logic in an outer or middle level of cache, which doesn't have to be as high-performance as L1. (e.g. Intel CPUs have most of their prefetch logic in the private per-core L2, but also some prefetch logic in L1d).
The other main option is for the outer cache to be a victim cache, i.e. lines enter it only when they're evicted from L1. So you can loop over an array of L1 + L2 size and probably still get L2 hits. The extra logic to implement this is useful if you want a relatively large L1 compared to L2, so the total size is more than a little larger than L2 alone.
With an exclusive L2, an L1 miss / L2 hit can just exchange lines between L1d and L2 if L1d needs to evict something from that set.
Some CPUs do in fact use an L2 that's exclusive of L1d (e.g. AMD K10 / Barcelona). Both of those caches are private per-core caches, not shared, so it's like the simple L1 / L2 situation for a single core CPU you're talking about.
Things get more complicated with multi-core CPUs and shared caches!
Barcelona's shared L3 cache is also mostly exclusive of the inner caches, but not strictly. David Kanter explains:
First, it is mostly exclusive, but not entirely so. When a line is sent from the L3 cache to an L1D cache, if the cache line is shared, or is likely to be shared, then it will remain in the L3 – leading to duplication which would never happen in a totally exclusive hierarchy. A fetched cache line is likely to be shared if it contains code, or if the data has been previously shared (sharing history is tracked). Second, the eviction policy for the L3 has been changed. In the K8, when a cache line is brought in from memory, a pseudo-least recently used algorithm would evict the oldest line in the cache. However, in Barcelona’s L3, the replacement algorithm has been changed to also take into account sharing, and it prefers evicting unshared lines.
AMD's successor to K10/Barcelona is Bulldozer. https://www.realworldtech.com/bulldozer/3/ points out that Bulldozer's shared L3 is also victim cache, and thus mostly exclusive of L2. It's probably like Barcelona's L3.
But Bulldozer's L1d is a small write-through cache with an even smaller (4k) write-combining buffer, so it's mostly inclusive of L2. Bulldozer's write-through L1d is generally considered a mistake in the CPU design world, and Ryzen went back to a normal 32kiB write-back L1d like Intel has been using all along (with great results). A pair of weak integer cores form a "cluster" that shares an FPU/SIMD unit, and shares a big L2 that's "mostly inclusive". (i.e. probably a standard NINE). This cluster thing is Bulldozer's alternative to SMT / Hyperthreading, which AMD also ditched for Ryzen in favour of normal SMT with a massively wide out-of-order core.
Ryzen also has some exclusivity between core clusters (CCX), apparently, but I haven't looked into the details.
I've been talking about AMD first because they have used exclusive caches in recent designs, and seem to have a preference for victim caches. Intel hasn't tried as many different things, because they hit on a good design with Nehalem and stuck with it until Skylake-AVX512.
Intel Nehalem and later use a large shared tag-inclusive L3 cache. For lines that are modified / exclusive (MESI) in a private per-core L1d or L2 (NINE) cache, the L3 tags still indicate which cores (might) have a copy of a line, so requests from one core for exclusive access to a line don't have to be broadcast to all cores, only to cores that might still have it cached. (i.e. it's a snoop filter for coherency traffic, which lets CPUs scale up to dozens of cores per chip without flooding each other with requests when they're not even sharing memory.)
i.e. L3 tags hold info about where a line is (or might be) cached in an L2 or L1 somewhere, so it knows where to send invalidation messages instead of broadcasting messages from every core to all other cores.
With Skylake-X (Skylake-server / SKX / SKL-SP), Intel dropped that and made L3 NINE and only a bit bigger than the total per-core L2 size. But there's still a snoop filter, it just doesn't have data. I don't know what Intel's planning to do for future (dual?)/quad/hex-core laptop / desktop chips (e.g. Cannonlake / Icelake). That's small enough that their classic ring bus would still be great, so they could keep doing that in mobile/desktop parts and only use a mesh in high-end / server parts, like they are in Skylake.
Realworldtech forum discussions of inclusive vs. exclusive vs. non-inclusive:
CPU architecture experts spend time discussing what makes for a good design on that forum. While searching for stuff about exclusive caches, I found this thread, where some disadvantages of strictly inclusive last-level caches are presented. e.g. they force private per-core L2 caches to be small (otherwise you waste too much space with duplication between L3 and L2).
Also, L2 caches filter requests to L3, so when its LRU algorithm needs to drop a line, the one it's seen least-recently can easily be one that stays permanently hot in L2 / L1 of a core. But when an inclusive L3 decides to drop a line, it has to evict it from all inner caches that have it, too!
David Kanter replied with an interesting list of advantages for inclusive outer caches. I think he's comparing to exclusive caches, rather than to NINE. e.g. his point about data sharing being easier only applies vs. exclusive caches, where I think he's suggesting that a strictly exclusive cache hierarchy might cause evictions when multiple cores want the same line even in a shared/read-only manner.

PoU with non-shareable attribute

Another question regarding caching in ArmV7-A.
In this case, the SoC in question is Allwinner A20, Dual-Core Cortex-A7.
From what I have read, The definition of PoU for a core is the point at which the instruction and data caches of the core are guaranteed to see the same copy of a memory location.
In regards to SoC in question, since both cores share PoU at L2 (Unified) Cache, it means that whatever is put in L1, will be visible to L2. Is that right?
Even if I change an attribute of a memory region to be Non-Shareable, L2 will be able to see what inside L1 in either core. Is that true?
To elaborate what I meant by that, I have done a little experiment:
When I wrote into an memory address inside a Non-Shareable, Write-Back region from core #0. Then without doing any Cache Maintenance operation, when I tried to read from the same memory address from core #1, it happened that it read the correct value which was written from core #0.
I speculated that the behaviour was a result from L2 being the PoU, so, when I wrote from core#0, L2 also store a copy of it (even if it's not flushed). Then when I read from core#1, after a read miss, core#1's L1 retreive the memory value from L2.
...since both cores share PoU at L2 (Unified) Cache, it means that whatever is put in L1, will be visible to L2. Is that right?
No. One CPU's data accesses may snoop the data caches of another in the same shareability domain, but that has nothing to do with the PoU for instruction accesses; it's just the coherency protocol.
Even if I change an attribute of a memory region to be Non-Shareable, L2 will be able to see what inside L1 in either core. Is that true?
No. Non-shareable memory is not guaranteed to be coherent. Sure, you might see it work - maybe Cortex-A7 happens to still snoop non-shareable cache lines, or maybe your data just got naturally evicted from L1D in the meantime such that the other CPU hit it at L2 - but it definitely should not be relied upon. Either way, having multiple CPUs access the same non-shareable location is a totally backwards thing to do in practice; you've deliberately said you don't want to share it!

Cache miss on a multilevel cache

There are 2 levels of cache L1 and L2. If there is a cache miss on both levels, data is being read from the memory. During reading the data from main memory, will the data be first entered into L2 and L1 cache first and then the processor reads the data from L1 cache or the updation into L1 and L2 and the read to processor happen simultaneously?
I believe this depends on the hardware implementation. I think it also depends on whether or not it is a write-through or write-back cache. A write through would have the same data at all levels because it updates it all at the same time. It could also be put into a write buffer to be written into the cache, in which case it would happen at the same time as the read. If there was no write buffer, the processor might stall to allow the cache to be updated.

When L1 misses are a lot different than L2 accesses... TLB related?

I have been running some benchmarks on some algorithms and profiling their memory usage and efficiency (L1/L2/TLB accesses and misses), and some of the results are quite intriguing for me.
Considering an inclusive cache hierarchy (L1 and L2 caches), shouldn't the number of L1 cache misses coincide with the number of L2 cache accesses? One of the explanations I find would be TLB related: when a virtual address is not mapped in TLB, the system automatically skips searches in some cache levels.
Does this seem legitimate?
First, inclusive cache hierarchies may not be so common as you assume. For example, I do not think any current Intel processors - not Nehalem, not Sandybridge, possibly Atoms - have an L1 that is included within the L2. (Nehalem and probably Sandybridge do, however, have both L1 and L2 included within L3; using Intel's current terminology, FLC and MLC in LLC.)
But, this doesn't necessarily matter. In most cache hierarchies if you have an L1 cache miss, then that miss will probably be looked up in the L2. Doesn't matter if it is inclusive or not. To do otherwise, you would have to have something that told you that the data you care about is (probably) not in the L2, you don't need to look. Although I have designed protocols and memory types that do this - e.g. a memory type that cached only in the L1 but not the L2, useful for stuff like graphics where you get the benefits of combining in the L1, but where you are repeatedly scanning over a large array, so caching in the L2 not a good idea. Bit I am not aware of anyone shipping them at the moment.
Anyway, here are some reasons why the number of L1 cache misses may not be equal to the number of L2 cache accesses.
You don't say what systems you are working on - I know my answer is applicable to Intel x86s such as Nehalem and Sandybridge, whose EMON performance event monitoring allows you to count things such as L1 and L2 cache misses, etc. It will probably also apply to any modern microprocessor with hardware performance counters for cache misses, such as those on ARM and Power.
Most modern microprocessors do not stop at the first cache miss, but keep going trying to do extra work. This is overall often called speculative execution. Furthermore, the processor may be in-order or out-of-order, but although the latter may given you even greater differences between number of L1 misses and number of L2 accesses, it's not necessary - you can get this behavior even on in-order processors.
Short answer: many of these speculative memory accesses will be to the same memory location. They will be squashed and combined.
The performance event "L1 cache misses" is probably[*] counting the number of (speculative) instructions that missed the L1 cache. Which then allocate a hardware data structure, called at Intel a fill buffer, at some other places a miss status handling register. Subsequent cache misses that are to the same cache line will miss the L1 cache but hit the fill buffer, and will get squashed. Only one of them, typically the first will get sent to the L2, and counted as an L2 access.)
By the way, there may be a performance event for this: Squashed_Cache_Misses.
There may also be a performance event L1_Cache_Misses_Retired. But this may undercount, since speculation may pull the data into the cache, and a cache miss at retirement may never occur.
([*] By the way, when I say "probably" here I mean "On the machines that I helped design". Almost definitely. I might have to check the definition, look at the RTL, but I would be immensely surprised if not. It is almost guaranteed.)
E.g. imagine that you are accessing bytes A[0], A[1], A[2], ... A[63], A[64], ...
If the address of A[0] is equal to zero modulo 64, then A[0]..A[63] will be in the same cache line, on a machine with 64 byte cache lines. If the code that uses these is simple, it is quite possible that all of them can be issued speculatively. QED: 64 speculative memory access, 64 L1 cache misses, but only one L2 memory access.
(By the way, don't expect the numbers to be quite so clean. You might not get exactly 64 L1 accesses per L2 access.)
Some more possibilities:
If the number of L2 accesses is greater than the number of L1 cache misses (I have almost never seen it, but it is possible) you may have a memory access pattern that is confusing a hardware prefetcher. The hardware prefetcher tries to predict which cache lines you are going to need. If the prefetcher predicts badly, it may fetch cache lines that you don't actually need. Oftentimes there is a performance evernt to count Prefetches_from_L2 or Prefetches_from_Memory.
Some machines may cancel speculative accesses that have caused an L1 cache miss, before they are sent to the L2. However, I don't know of Intel doing this.
The write policy of a data cache determines whether a store hit writes its data only on that cache (write-back or copy-back) or also at the following level of the cache hierarchy (write-through).
Hence, a store that hits at a write-through L1-D cache, also writes its data at the L2 cache.
This could be another source of L2 accesses that do not come from L1 cache misses.

details of MESI protocol for multicore processors

The details of the MESI protocol for multicore processors would be really important for me, but I can't find them anywhere. Even http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf doesn't contain enough detail. For instance: assume a private L1 and shared L2 cache. If the state of a line is exclusive in L1, then is it exclusive in L2 too (or invalid, because only in one cache could be the state of a line exclusive)? And clearly, if another core writes this line, the state of the previously exclusive line in L1 becomes invalid, but how is changing the state of the L2 cache line? If a modified line in L1 is read by another core, will be the new state of that line shared and is it written back to the main memory through the L2 cache, or stay modified in L2 too? etc.
The reason you are having trouble finding these answers is because the traditional protocols were not defined for hierarchical cache architectures so the MESI protocol by itself doesn't define what will happen when you have an L1 and an L2 cache. It depends on three other system attributes.
If the L2 is designed to be exclusive of the L1 (i.e., it is guaranteed that L2 and L1 can never have common cache lines), then any line in the L1 will be invalid state (basically not present) in the L2.
If the L2 is inclusive of the L1, i.e., every line in the L1 must have an entry in the L2 as well, the entry in the L2 will contain a descriptor stating which L1 cache has the line in E state.
Whether or not the value is written out to L2 or memory on a read from E or W stage depends on whether your system supports cache-to-cache transfers or not. In old day, when each chip was a single core, and core-to-core communication was as expensive as read/write to memory, systems would write the data to memory and make the other processor read it (this allowed them to not support cache-to-cache transfers). In multi-core, talking via memory is insanely expensive compared to talking to other cores on-chip, so almost all multi-core chips today support cache-to-cache transfer. Thus, a read from E or W stage is not serviced by writing to memory.
I hope this helps.
I found this. It might help.
Third comment here might also be useful.

Resources