Related
I was reading a book on Cache Optimizations and in the Third Optimization ie Higher Associativity to Reduce Miss Rate
The author says
2:1 cache rule (for cache upto size 128KB) A direct-mapped cache of size N
has the same miss rate of 2-way associative cache of size N /2.
But the author provides no explanation or proof to this and I am unable to understand this.
Could anyone explain how he came up with this rule?
It's a "rule of thumb" that's (apparently) usually true for normal workloads, I'd guess from simulating traces of real workloads like SPECint and/or commercial software. Not an actual law that's always true.
Seems reasonable if most of the misses are just conflict misses, not capacity misses.
Can a lower level cache have higher associativity and still hold inclusion?
Suppose we have 2-level of cache.(L1 being nearest to CPU and L2 being nearest to main memory)
L1 cache is 2-way set associative with 4 sets and let's say L2 cache is direct mapped with 16 cache lines and assume that both caches have same block size. Then I think it will follow inclusion property even though L1(lower level) has higher associativity than L2 (upper level).
As per my understanding, lower level cache can have higher associativity (and still hold inclusion). This will only change the number of tag bits (as seen in physical address at each level), number of comparators and MUX to be used.Please let me know if this is correct.
Inclusion is a property that is enforced on the contents of the included cache, and is independent of the associativity of the caches. Inclusion provides the benefit of eliminating most snoops of the included cache, which allows an implementation to get away with reduced tag bandwidth, and may also result in reduced snoop latency.
Intuitively, it makes sense that when the enclosing cache has more associativity than the included cache, the contents of the included cache should always "fit" into the enclosing cache. This "static" view is an inappropriate oversimplification. Differences in replacement policy and access patterns will almost always generate cases in which lines are chosen as victim in the enclosing cache before being chosen as victim in the included cache. The inclusion policy requires that such lines be evicted from the included cache -- independent of associativity.
The case that is intuitively more problematic occurs when the enclosing cache has less associativity than the included cache. In this case it is clear that associativity conflicts in the enclosing cache will force evictions from the included cache.
In either case, judging whether the additional evictions from the included cache outweigh the benefits of inclusion is multi-dimensional. The performance impact will depend on the specific sizes, associativities, and indexing of the caches, as well as on the application access patterns. The importance of the performance impact depends on the application characteristics -- e.g., tightly-coupled parallel applications typically show throughput proportional to the worst-case performance of the participating processors, while independent applications typically show throughput proportional to the average performance of the participating processors.
Yes, but conflict evictions in the outer cache may force eviction from the inner cache to maintain inclusivity.
(I think if both caches use simple indexing, you wouldn't have that with an outer cache that's larger and at least as associative, because aliasing in the outer cache would only happen when you also alias in the inner.)
With the outer cache being larger you don't necessarily get aliasing in it for lines that would alias in L1, so it's not useless.
But it is unusual: usually outer caches are larger and more associative than inner caches because they don't have to be as fast, and high hit rate is more valuable.
If you were going to make the outer cache less associative than an inner cache, making it NINE (non-inclusive non-exclusive) might be a better idea. But you only asked if it was possible.
I came to the topic caching and mapping and cache misses and how the cache blocks get replaced in what order when all blocks are already full.
There is the least recently used algorithm or the fifo algorithm or the least frequently algorithm and random replacement, ...
But what algorithms are used on actual cpu caches? Or can you use all and the... operating system decides what the best algorithm is?
Edit: Even when i chose an answer, any further information is welcome ;)
As hivert said - it's hard to get a clear picture on the specific algorithm, but one can deduce some of the information according to hints or clever reverse engineering.
You didn't specify which CPU you mean, each one can have a different policy (actually even within the same CPU different cache levels may have different policies, not to mention TLBs and other associative arrays which also may have such policies). I did find a few hints about Intel (specifically Ivy bridge), so we'll use this as a benchmark for industry level "standards" (which may or may not apply elsewhere).
First, Intel presented some LRU related features here -
http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-1-Microprocessor/HC24.28.117-HotChips_IvyBridge_Power_04.pdf
Slide 46 mentioned "Quad-Age LRU" - this is apparently an age based LRU that assigned some "age" to each line according to its predicted importance. They mention that prefetches get middle age, so demands are probably allocated at a higher age (or lower, whatever survives longest), and all lines likely age gradually, so the oldest gets replaced first. Not as good as perfect "fifo-like" LRU, but keep in mind that most caches don't implement that, but rather a complicated pseudo-LRU solution, so this might be an improvement.
Another interesting mechanism mentioned there, which goes the extra mile beyond classic LRU, is adaptive fill policy. There's a pretty good analysis here - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ , but in a nutshell (if the blog is correct, and he does seem to make a good match with his results), the cache dynamically chooses between two LRU policies, trying to decide whether the lines are going to be reused or not (and should be kept or not).
I guess this could answer to some extent your question on multiple LRU schemes. Implementing several schemes is probably hard and expensive in terms of HW, but when you have some policy that's complicated enough to have parameters, it's possible to use tricks like dynamic selection, set dueling , etc..
The following are some examples of replacement policies used in actual processors.
The PowerPC 7450's 8-way L1 cache used binary tree pLRU. Binary tree pLRU uses one bit per pair of ways to set an LRU for that pair, then an LRU bit for each pair of pairs of ways, etc. The 8-way L2 used pseudo-random replacement settable by privileged software (the OS) as using either a 3-bit counter incremented every clock cycle or a shift-register-based pseudo-random number generator.
The StrongARM SA-1110 32-way L1 data cache used FIFO. It also had a 2-way minicache for transient data, which also seems to have used FIFO. (Intel StrongARM SA-1110 Microprocessor Developer’s Manual states "Replacements in the minicache use the same round-robin pointer mechanism as in the main data cache. However, since this cache is only two-way set-associative, the replacement algorithm reduces to a simple least-recently-used (LRU) mechanism."; but 2-way FIFO is not the same as LRU even with only two ways, though for streaming data it works out the same.])
The HP PA 7200 had a 64-block fully associative "assist cache" that was accessed in parallel with an off-chip direct-mapped data cache. The assist cache used FIFO replacement with the option of evicting to the off-chip L1 cache. Load and store instructions had a "locality only" hint; if an assist cache entry was loaded by such a memory access, it would be evicted to memory bypassing the off-chip L1.
For 2-way associativity, true LRU might be the most common choice since it has good behavior (and, incidentally, is the same as binary tree pLRU when there are only two ways). E.g., the Fairchild Clipper Cache And Memory Management Unit used LRU for its 2-way cache. FIFO is slightly cheaper than LRU since the replacement information is only updated when the tags are written anyway, i.e., when inserting a new cache block, but has better behavior than counter-based pseudo-random replacement (which has even lower overhead). The HP PA 7300LC used FIFO for its 2-way L1 caches.
The Itanium 9500 series (Poulson) uses NRU for L1 and L2 data caches, L2 instruction cache, and the L3 cache (L1 instruction cache is documented as using LRU.). For the 24-way L3 cache in the Itanium 2 6M (Madison), a bit per block was provided for NRU with an access to a block setting the bit corresponding to its set and way ("Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache", Stefan Rusu et al., 2004). This is similar to the clock page replacement algorithm.
I seem to recall reading elsewhere that the bits were cleared when all were set (rather than keeping the one that set the last unset bit) and that the victim was chosen by a find first unset scan of the bits. This would have the hardware advantage of only having to read the information (which was stored in distinct arrays from but nearby the L3 tags) on a cache miss; a cache hit could simply set the appropriate bit. Incidentally, this type of NRU avoids some of the bad traits of true LRU (e.g., LRU degrades to FIFO in some cases and in some of these cases even random replacement can increase the hit rate).
For Intel CPUs, the replacement policies are usually undocumented. I have done some experiments to uncover the policies in recent Intel CPUs, the results of which can be found on https://uops.info/cache.html. The code that I used is available on GitHub.
The following is a summary of my findings.
Tree-PLRU: This policy is used by the L1 data caches of all CPUs that I tested, as well as by the L2 caches of the Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell and Broadwell CPUs.
Randomized Tree-PLRU: Some Core 2 Duo CPUs use variants of Tree-PLRU in their L2 caches where either the lowest or the highest bits in the tree are replaced by (pseudo-)randomness.
MRU: This policy is sometimes also called NRU. It uses one bit per cache block. An access to a block sets the bit to 0. If the last 1-bit was set to 0, all other bits are set to 1. Upon a miss, the first block with its bit set to 1 is replaced. This policy is used for the L3 caches of the Nehalem, Westmere, and Sandy Bridge CPUs.
Quad-Age LRU (QLRU): This is a generalization of the MRU policy that uses two bits per cache block. Different variants of this policy are used for the L3 caches, starting with Ivy Bridge, and for the L2 caches, starting with Skylake.
Adaptive policies: The Ivy Bridge, Haswell, and Broadwell CPUs can dynamically choose between two different QLRU variants. This is implemented via set dueling: A small number of dedicated sets always use the same QLRU variant; the remaining sets are "follower sets" that use the variant that performs better on the dedicated sets. See also http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.
I am not able to understand the concepts of cache inclusion property in multi-level caching. As per my understanding, if we have 2 levels of cache, L1 and L2 then the contents of L1 must be a subset of L2. This implies that L2 must be at least as large as L1. Further, when a block in L1 is modified, we have to update in two places L2 and Memory. Are these concepts correct ?
In general, we can say adding more levels of cache is adding more levels of access in memory hierarchy. Its always trade-off between access time and latency. larger the cache, more we can store, but takes more time to search through. As you have said, L2 cache must be larger than L1 cache. otherwise its failing the basic purpose of the same.
Now coming to whether L1 a subset of L2. Its not always necessary. There is Inclusive cache hierarchy and exclusive cache hierarchy. In inclusive, as you said the last level is superset of all other caches.
you can check this presentation for more details
PPT.
Now updating different levels, is a cache coherence problem & larger the number of levels, larger the headache. You can check various protocols here: cache coherence
You are correct about an inclusive L2 cache being larger than the L1 cache. However, your statement about an inclusive cache requiring a modification in the L1 also requiring a modification in the L2 and memory is not correct. The system described by you is called a "write-through" cache where all the writes in the private cache also write the next level(s) of cache. Inclusive cache heirarchies do not imply write-through caches.
Most architectures that have inclusive heirarchies use a "write-back" cache. A "write-back" cache differs from the write-through cache in that it does not require modifications in the current level of cache to be eagerly propogated to the next level of cache (for eg. a write in the L1 cache does not have to immediately write the L2). Instead, write-back caches update only the current level of cache and make the data "dirty" (describes a cacheline whose most recent value is in the current level and all upper levels have stale values). A write-back flushes the dirty cacheline to the next level of cache on an eviction (when space needs to be created in the current cache to service a conflict miss)
These concepts are summarized in the seminal work by Baer and Wang "On the inclusion property of Multi level cache heirarchies", ISCA 1988 paper_link. The paper explains your confusion in the initially confusing statement:
A
MultiLevel
cache
hierarchy
has
the
inclusion
property(ML1)
if
the
contents
of
a
cache
at
level
C_(i+1),
is
a
superset
of
the
contents
of
all
its
children
caches,
C_i,
at
level
i.”
This
definition
implies
that
the
write-through
policy
must
be
used
for
lower
level
caches.
As
we
will
assume
write-back
caches
in
this
paper,
the
ML1
is
actually
a
“space”
MLI,
i.e.,
space
is
provided
for
inclusion
but
a
write-back
policy
is
implemented.
I am learning caching and have a question on the concurrency of cache.
As I know, LRU caching is implemented with double linked list + hashtable. Then how does LRU cache handle high frequent concurrency? Note both getting data from cache and putting data to cache will update the linked list and hash table so cache is modified all the time.
If we use mutex lock for thread-safe, won't the speed be slowed down if the cache is visited by large amount of people? If we do not use lock, what techniques are used? Thanks in advance.
Traditional LRU caches are not designed for high concurrency because of limited hardware and that the hit penalty is far smaller than the miss penalty (e.g. database lookup). For most applications, locking the cache is acceptable if its only used to update the underlying structure (not compute the value on a miss). Simple techniques like segmenting the LRU policy were usually good enough when the locks became contended.
The way to make an LRU cache scale is to avoid updating the policy on every access. The critical observation to make is that the user of the cache does not care what the current LRU ordering is. The only concern of the caller is that the cache maintains a threshold size and a high hit rate. This opens the door for optimizations by avoiding mutating the LRU policy on every read.
The approach taken by memcached is to discard subsequent reads within a time window, e.g. 1 second. The cache is expected to be very large so there is a very low chance of evicting a poor candidate by this simpler LRU.
The approach taken by ConcurrentLinkedHashMap (CLHM), and subsequently Guava's Cache, is to record the access in a buffer. This buffer is drained under the LRU's lock and by using a try-lock no other operation has to be blocked. CLHM uses multiple ring buffers that are lossy if the cache cannot keep up, as losing events is preferred to degraded performance.
The approach taken by Ehcache and redis is a probabilistic LRU policy. A read updates the entry's timestamp and a write iterates the cache to obtain a random sample. The oldest entry is evicted from that sample. If the sample is fast to construct and the cache is large, the evicted entry was likely a good candidate.
There are probably other techniques and, of course, pseudo LRU policies (like CLOCK) that offer better concurrency at lower hit rates.