Cache Inclusion Property- Multilevel Caching

Cache Inclusion Property- Multilevel Caching - caching

I am not able to understand the concepts of cache inclusion property in multi-level caching. As per my understanding, if we have 2 levels of cache, L1 and L2 then the contents of L1 must be a subset of L2. This implies that L2 must be at least as large as L1. Further, when a block in L1 is modified, we have to update in two places L2 and Memory. Are these concepts correct ?

In general, we can say adding more levels of cache is adding more levels of access in memory hierarchy. Its always trade-off between access time and latency. larger the cache, more we can store, but takes more time to search through. As you have said, L2 cache must be larger than L1 cache. otherwise its failing the basic purpose of the same.
Now coming to whether L1 a subset of L2. Its not always necessary. There is Inclusive cache hierarchy and exclusive cache hierarchy. In inclusive, as you said the last level is superset of all other caches.
you can check this presentation for more details
PPT.
Now updating different levels, is a cache coherence problem & larger the number of levels, larger the headache. You can check various protocols here: cache coherence

You are correct about an inclusive L2 cache being larger than the L1 cache. However, your statement about an inclusive cache requiring a modification in the L1 also requiring a modification in the L2 and memory is not correct. The system described by you is called a "write-through" cache where all the writes in the private cache also write the next level(s) of cache. Inclusive cache heirarchies do not imply write-through caches.
Most architectures that have inclusive heirarchies use a "write-back" cache. A "write-back" cache differs from the write-through cache in that it does not require modifications in the current level of cache to be eagerly propogated to the next level of cache (for eg. a write in the L1 cache does not have to immediately write the L2). Instead, write-back caches update only the current level of cache and make the data "dirty" (describes a cacheline whose most recent value is in the current level and all upper levels have stale values). A write-back flushes the dirty cacheline to the next level of cache on an eviction (when space needs to be created in the current cache to service a conflict miss)
These concepts are summarized in the seminal work by Baer and Wang "On the inclusion property of Multi level cache heirarchies", ISCA 1988 paper_link. The paper explains your confusion in the initially confusing statement:
A
MultiLevel
cache
hierarchy
has
the
inclusion
property(ML1)
if
the
contents
of
a
cache
at
level
C_(i+1),
is
a
superset
of
the
contents
of
all
its
children
caches,
C_i,
at
level
i.”
This
definition
implies
that
the
write-through
policy
must
be
used
for
lower
level
caches.
As
we
will
assume
write-back
caches
in
this
paper,
the
ML1
is
actually
a
“space”
MLI,
i.e.,
space
is
provided
for
inclusion
but
a
write-back
policy
is
implemented.

Related

Can an inner level of cache be write back inside an inclusive outer-level cache?

I have asked a similar question: Can a lower level cache have higher associativity and still hold inclusion?
Suppose we have 2-level of cache. (L1 being nearest to CPU (inner / lower-level) and L2 being outside that, nearest to main memory) can L1 cache be write back?
My attempt)
I think we must have only write through cache and we cannot have write back cache in L1. If a block is replaced in the L1 cache then it has to be written back to L2 and also to main memory in order to hold inclusion. Hence it has to be write through and not write back.
All these doubts arise from the below exam question. :P
Question) For inclusion to hold between two cache levels L1 and L2 in
a multi-level cache hierarchy which of the following are necessary?
I) L1 must be write-through cache
II) L2 must be a write-through cache
III) The associativity of L2 must be greater than that of L1
IV) The L2 cache must be at least as large as the L1 cache
A) IV only
B) I and IV only
C) I, II and IV only
D) I, II, III and IV
As per my understanding, the answer needs to be Option (B)

Real life counterexample: Intel i7 series (since Nehalem) have a large shared (between cores) L3 that's inclusive. And all levels are write-back (including the per-core private L2 and L1d) to reduce bandwidth requirements for outer caches.
Inclusive just means that the outer cache tags have a state other than Invalid for every line in a valid state in any inner cache. Not necessarily that the data is also kept in sync. https://en.wikipedia.org/wiki/Cache_inclusion_policy calls that "value inclusion", and yes it does require a write-through (or read-only) inner cache. That's Option B, and is even stronger than just "inclusive".
My understanding of regular inclusion, specifically in Intel i7, is that data can be stale but tags are always inclusive. Moreover, since this is a multi-core CPU, L3 tags tell you which core's private L2/L1d cache owns a line in Exclusive or Modified state, if any. So you know which one to talk to if another core wants to read or write the line. i.e. it works as a snoop filter for those multi-core CPUs.
And conversely, if there are no tag matches in the inclusive L3 cache, the line is definitely not present anywhere on chip. (So an invalidate message doesn't need to be passed on to every core.) See also Which cache mapping technique is used in intel core i7 processor? for more details.
To write a line, the inner cache has to fetch / RFO it through the outer cache so it has a chance to maintain inclusion that way as it handles the RFO (read for ownership) from the L1d/L2 write miss (not in Exclusive or Modified state).
Apparently this is not called "tag-inclusive"; that term may have some other technical meaning. I think I saw it used and made a wrong(?) assumption about what it meant. What is tag-only forced cache inclusion called? suggests "tag-inclusive" doesn't mean tags but no data either.
Having a line in Modified state in the inner cache (L1) means an inclusive outer cache will have a tag match for that line, even if the actual data in the outer cache is stale. (I'm not sure what state caches typically use for this case; according to #Hadi in comments it's not Invalid. I assume it's not Shared either because it needs to avoid using this stale data to satisfy read requests from other cores.)
When the data does eventually write back from L1, it can be in Modified state only in the outer cache, evicted from L1.

The answer to your question will be 1V) L2 only needs to be bigger. i.e option A
Inclusive only means that line in L1 need to be present in L2. The line could be modified further in L1 and the state in L1 will reflect same.
When some other core looks up L2, it can Snoop state of line in L1 and force a WB if needes.

Can a lower level cache have higher associativity and still hold inclusion?

Can a lower level cache have higher associativity and still hold inclusion?
Suppose we have 2-level of cache.(L1 being nearest to CPU and L2 being nearest to main memory)
L1 cache is 2-way set associative with 4 sets and let's say L2 cache is direct mapped with 16 cache lines and assume that both caches have same block size. Then I think it will follow inclusion property even though L1(lower level) has higher associativity than L2 (upper level).
As per my understanding, lower level cache can have higher associativity (and still hold inclusion). This will only change the number of tag bits (as seen in physical address at each level), number of comparators and MUX to be used.Please let me know if this is correct.

Inclusion is a property that is enforced on the contents of the included cache, and is independent of the associativity of the caches. Inclusion provides the benefit of eliminating most snoops of the included cache, which allows an implementation to get away with reduced tag bandwidth, and may also result in reduced snoop latency.
Intuitively, it makes sense that when the enclosing cache has more associativity than the included cache, the contents of the included cache should always "fit" into the enclosing cache. This "static" view is an inappropriate oversimplification. Differences in replacement policy and access patterns will almost always generate cases in which lines are chosen as victim in the enclosing cache before being chosen as victim in the included cache. The inclusion policy requires that such lines be evicted from the included cache -- independent of associativity.
The case that is intuitively more problematic occurs when the enclosing cache has less associativity than the included cache. In this case it is clear that associativity conflicts in the enclosing cache will force evictions from the included cache.
In either case, judging whether the additional evictions from the included cache outweigh the benefits of inclusion is multi-dimensional. The performance impact will depend on the specific sizes, associativities, and indexing of the caches, as well as on the application access patterns. The importance of the performance impact depends on the application characteristics -- e.g., tightly-coupled parallel applications typically show throughput proportional to the worst-case performance of the participating processors, while independent applications typically show throughput proportional to the average performance of the participating processors.

Yes, but conflict evictions in the outer cache may force eviction from the inner cache to maintain inclusivity.
(I think if both caches use simple indexing, you wouldn't have that with an outer cache that's larger and at least as associative, because aliasing in the outer cache would only happen when you also alias in the inner.)
With the outer cache being larger you don't necessarily get aliasing in it for lines that would alias in L1, so it's not useless.
But it is unusual: usually outer caches are larger and more associative than inner caches because they don't have to be as fast, and high hit rate is more valuable.
If you were going to make the outer cache less associative than an inner cache, making it NINE (non-inclusive non-exclusive) might be a better idea. But you only asked if it was possible.

What's the difference between conflict miss and capacity miss

Capacity miss occurs because blocks are being discarded from cache because cache cannot contain all blocks needed for program execution (program working set is much larger than cache capacity).
Conflict miss occurs in the case of set associative or direct mapped block placement strategies, conflict misses occur when several blocks are mapped to the same set or block frame; also called collision misses or interference misses.
Are they actually very closely related?
For example, if all the cache lines are filled and we have a read request for memory B, for which we have to evict memory A.
So should it be considered as a capacity miss since we don't have enough space? And later if we want to access memory A, and since it's evicted before, it's considered as a conflict miss.
Am I understanding this correctly? Thanks

The important distinction here is between cache misses caused by the size of your data set, and cache misses caused by the way your cache and data alignment are organized.
Lets assume you have a 32k direct mapped cache, and consider the following 2 cases:
You repeatedly iterate over a 128k array. There's no way the data can fit in that cache, therefore all the misses are capacity ones (except the first access of each line which is a compulsory miss, and would remain even if you could increase your cache infinitely).
You have 2 small 8k arrays, but unfortunately they are both aligned and map to the same sets. This means that while they could theoretically fit in the cache (if you fix your alignment), they will not utilize the full cache size and instead compete for the same group of sets and thrash each other. These are conflict misses, since the data could fit, but still collides due to organization. The same problem can occur with set associative caches, although less often (let's say the cache is 2-way, but you have 4 aligned data sets...).
The 2 types are indeed related, you could say that given high levels of associativity, set skewing, proper data alignments and other techniques, you could reduce the conflicts, until you're mostly left with true capacity misses that are unavoidable.

My favorite definition for conflict misses from Reducing Compulsory and Capacity misses by Norman P. Jouppi:
Conflict misses are misses that would not occur if the cache were
fully associative with LRU replacement.
Let's look at an example. We have a direct-mapped cache of size of 4. The access sequences are
0(compulsory miss), 1(compulsory miss), 2(compulsory miss), 3(compulsory miss), 4(compulsory
miss), 1(hit), 2(hit), 3(hit), 0(capacity miss), 4(capacity miss), 0(conflict miss)
The second to last 0 is a capacity miss because even if the cache were fully associative with LRU cache, it would still cause a miss because 4,1,2,3 are accessed before last 0. However the last 0 is a conflict miss because in a fully associative cache the last 4 would have replace 1 in the cache instead of 0.

Compulsory miss: when a block of main memory is trying to occupy fresh empty line of cache and the very first access to a memory Block that must be brought into cache is called compulsory miss.
Conflict miss: when still there are empty lines in the cache, block of main memory is conflicting with the already filled line of cache, ie., even when empty place is available, block is trying to occupy already filled line. its called conflict miss.
Capacity miss: miss occured when all lines of cache are filled.
conflict miss occurs only in direct mapped cache and set-associative cache. Because in associative mapping, no block of main memory tries to occupy already filled line.

Line size of L1 and L2 caches

From a previous question on this forum, I learned that in most of the memory systems, L1 cache is a subset of the L2 cache means any entry removed from L2 is also removed from L1.
So now my question is how do I determine a corresponding entry in L1 cache for an entry in the L2 cache. The only information stored in the L2 entry is the tag information. Based on this tag information, if I re-create the addr it may span multiple lines in the L1 cache if the line-sizes of L1 and L2 cache are not same.
Does the architecture really bother about flushing both the lines or it just maintains L1 and L2 cache with the same line-size.
I understand that this is a policy decision but I want to know the commonly used technique.

Cache-Lines size is (typically) 64 bytes.
Moreover, take a look at this very interesting article about processors caches:
Gallery of Processor Cache Effects
You will find the following chapters:
Memory accesses and performance
Impact of cache lines
L1 and L2 cache sizes
Instruction-level parallelism
Cache associativity
False cache line sharing
Hardware complexities

In core i7 the line sizes in L1 , L2 and L3 are the same: that is 64 Bytes.
I guess this simplifies maintaining the inclusive property, and coherence.
See page 10 of: https://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf

The most common technique of handling cache block size in a strictly inclusive cache hierarchy is to use the same size cache blocks for all levels of cache for which the inclusion property is enforced. This results in greater tag overhead than if the higher level cache used larger blocks, which not only uses chip area but can also increase latency since higher level caches generally use phased access (where tags are checked before the data portion is accessed). However, it also simplifies the design somewhat and reduces the wasted capacity from unused portions of the data. It does not take a large fraction of unused 64-byte chunks in 128-byte cache blocks to compensate for the area penalty of an extra 32-bit tag. In addition, the larger cache block effect of exploiting broader spatial locality can be provided by relatively simple prefetching, which has the advantages that no capacity is left unused if the nearby chunk is not loaded (to conserve memory bandwidth or reduce latency on a conflicting memory read) and that the adjacency prefetching need not be limited to a larger aligned chunk.
A less common technique divides the cache block into sectors. Having the sector size the same as the block size for lower level caches avoids the problem of excess back-invalidation since each sector in the higher level cache has its own valid bit. (Providing all the coherence state metadata for each sector rather than just validity can avoid excessive writeback bandwidth use when at least one sector in a block is not dirty/modified and some coherence overhead [e.g., if one sector is in shared state and another is in the exclusive state, a write to the sector in the exclusive state could involve no coherence traffic—if snoopy rather than directory coherence is used].)
The area savings from sectored cache blocks were especially significant when tags were on the processor chip but the data was off-chip. Obviously, if the data storage takes area comparable to the size of the processor chip (which is not unreasonable), then 32-bit tags with 64-byte blocks would take roughly a 16th (~6%) of the processor area while 128-byte blocks would take half as much. (IBM's POWER6+, introduced in 2009, is perhaps the most recent processor to use on-processor-chip tags and off-processor data. Storing data in higher-density embedded DRAM and tags in lower-density SRAM, as IBM did, exaggerates this effect.)
It should be noted that Intel uses "cache line" to refer to the smaller unit and "cache sector" for the larger unit. (This is one reason why I used "cache block" in my explanation.) Using Intel's terminology it would be very unusual for cache lines to vary in size among levels of cache regardless of whether the levels were strictly inclusive, strictly exclusive, or used some other inclusion policy.
(Strict exclusion typically uses the higher level cache as a victim cache where evictions from the lower level cache are inserted into the higher level cache. Obviously, if the block sizes were different and sectoring was not used, then an eviction would require the rest of the larger block to be read from somewhere and invalidated if present in the lower level cache. [Theoretically, strict exclusion could be used with inflexible cache bypassing where an L1 eviction would bypass L2 and go to L3 and L1/L2 cache misses would only be allocated to either L1 or L2, bypassing L1 for certain accesses. The closest to this being implemented that I am aware of is Itanium's bypassing of L1 for floating-point accesses; however, if I recall correctly, the L2 was inclusive of L1.])

Typically, in one access to the main memory 64 bytes of data and 8 bytes of parity/ECC (I don't remember exactly which) is accessed. And it is rather complicated to maintain different cache line sizes at the various memory levels. You have to note that cache line size would be more correlated to the word alignment size on that architecture than anything else. Based on that, a cache line size is highly unlikely to be different from memory access size. Now, the parity bits are for the use of the memory controller - so cache line size typically is 64 bytes. The processor really controls very little beyond the registers. Everything else going on in the computer is more about getting hardware in to optimize CPU performance. In that sense also, it really would not make any sense to import extra complexity by making cache line sizes different at different levels of memory.

When L1 misses are a lot different than L2 accesses... TLB related?

I have been running some benchmarks on some algorithms and profiling their memory usage and efficiency (L1/L2/TLB accesses and misses), and some of the results are quite intriguing for me.
Considering an inclusive cache hierarchy (L1 and L2 caches), shouldn't the number of L1 cache misses coincide with the number of L2 cache accesses? One of the explanations I find would be TLB related: when a virtual address is not mapped in TLB, the system automatically skips searches in some cache levels.
Does this seem legitimate?

First, inclusive cache hierarchies may not be so common as you assume. For example, I do not think any current Intel processors - not Nehalem, not Sandybridge, possibly Atoms - have an L1 that is included within the L2. (Nehalem and probably Sandybridge do, however, have both L1 and L2 included within L3; using Intel's current terminology, FLC and MLC in LLC.)
But, this doesn't necessarily matter. In most cache hierarchies if you have an L1 cache miss, then that miss will probably be looked up in the L2. Doesn't matter if it is inclusive or not. To do otherwise, you would have to have something that told you that the data you care about is (probably) not in the L2, you don't need to look. Although I have designed protocols and memory types that do this - e.g. a memory type that cached only in the L1 but not the L2, useful for stuff like graphics where you get the benefits of combining in the L1, but where you are repeatedly scanning over a large array, so caching in the L2 not a good idea. Bit I am not aware of anyone shipping them at the moment.
Anyway, here are some reasons why the number of L1 cache misses may not be equal to the number of L2 cache accesses.
You don't say what systems you are working on - I know my answer is applicable to Intel x86s such as Nehalem and Sandybridge, whose EMON performance event monitoring allows you to count things such as L1 and L2 cache misses, etc. It will probably also apply to any modern microprocessor with hardware performance counters for cache misses, such as those on ARM and Power.
Most modern microprocessors do not stop at the first cache miss, but keep going trying to do extra work. This is overall often called speculative execution. Furthermore, the processor may be in-order or out-of-order, but although the latter may given you even greater differences between number of L1 misses and number of L2 accesses, it's not necessary - you can get this behavior even on in-order processors.
Short answer: many of these speculative memory accesses will be to the same memory location. They will be squashed and combined.
The performance event "L1 cache misses" is probably[*] counting the number of (speculative) instructions that missed the L1 cache. Which then allocate a hardware data structure, called at Intel a fill buffer, at some other places a miss status handling register. Subsequent cache misses that are to the same cache line will miss the L1 cache but hit the fill buffer, and will get squashed. Only one of them, typically the first will get sent to the L2, and counted as an L2 access.)
By the way, there may be a performance event for this: Squashed_Cache_Misses.
There may also be a performance event L1_Cache_Misses_Retired. But this may undercount, since speculation may pull the data into the cache, and a cache miss at retirement may never occur.
([*] By the way, when I say "probably" here I mean "On the machines that I helped design". Almost definitely. I might have to check the definition, look at the RTL, but I would be immensely surprised if not. It is almost guaranteed.)
E.g. imagine that you are accessing bytes A[0], A[1], A[2], ... A[63], A[64], ...
If the address of A[0] is equal to zero modulo 64, then A[0]..A[63] will be in the same cache line, on a machine with 64 byte cache lines. If the code that uses these is simple, it is quite possible that all of them can be issued speculatively. QED: 64 speculative memory access, 64 L1 cache misses, but only one L2 memory access.
(By the way, don't expect the numbers to be quite so clean. You might not get exactly 64 L1 accesses per L2 access.)
Some more possibilities:
If the number of L2 accesses is greater than the number of L1 cache misses (I have almost never seen it, but it is possible) you may have a memory access pattern that is confusing a hardware prefetcher. The hardware prefetcher tries to predict which cache lines you are going to need. If the prefetcher predicts badly, it may fetch cache lines that you don't actually need. Oftentimes there is a performance evernt to count Prefetches_from_L2 or Prefetches_from_Memory.
Some machines may cancel speculative accesses that have caused an L1 cache miss, before they are sent to the L2. However, I don't know of Intel doing this.

The write policy of a data cache determines whether a store hit writes its data only on that cache (write-back or copy-back) or also at the following level of the cache hierarchy (write-through).
Hence, a store that hits at a write-through L1-D cache, also writes its data at the L2 cache.
This could be another source of L2 accesses that do not come from L1 cache misses.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio