I was reading a book on Cache Optimizations and in the Third Optimization ie Higher Associativity to Reduce Miss Rate
The author says
2:1 cache rule (for cache upto size 128KB) A direct-mapped cache of size N
has the same miss rate of 2-way associative cache of size N /2.
But the author provides no explanation or proof to this and I am unable to understand this.
Could anyone explain how he came up with this rule?
It's a "rule of thumb" that's (apparently) usually true for normal workloads, I'd guess from simulating traces of real workloads like SPECint and/or commercial software. Not an actual law that's always true.
Seems reasonable if most of the misses are just conflict misses, not capacity misses.
Related
If I have two different cache subsystem designs C1 and C2 that both have roughly the same hardware complexity, can I make a decision if which one is better choice considering effectiveness of cache subsystem is the prime factor i.e., the number misses should be minimized.
Give the total miss rate below:
miss_rate = (number of cache misses)/(number of cache reference)
miss rate of C1 = 0.77
miss rate of C2 = 0.73
Is the given miss rate information sufficient to make decision of what subsystem is better?
Yes, assuming hit latency is the same for both caches, actual miss rate on the workload you care about is the ultimate factor for that workload. It doesn't always generalize.
All differences in size, associativity, eviction policy, all matter because of their impact on miss rate on any given workload. Even cache line (block) size factors in to this: a cache with twice as many 32-byte lines vs. a cache with half as many 64-byte lines would be able to cache more scattered words, but pull in less nearby data on a miss. (Unless you have hardware prefetching, but again prefetch algorithms ultimately just affect miss rate.)
If hit and miss latencies are fixed, then all misses are equal and you just want fewer of them.
Well, not just latency, but overall effect on the pipeline, if the CPU isn't a simple in-order design from the 1980s that simply stalls on a miss. Which is what homework usually assumes, because otherwise the miss cost depends on details of the context, making it impossible to calculate performance based on just instruction mix, hit/miss rate, and miss costs.
An out-of-order exec CPU can hide the latency of some misses better than others. (On the critical path of some larger dependency chain vs. not.) Even an in-order CPU that can scoreboard loads can get work done in the shadow of a cache miss load, up until it reaches an instruction that reads the load result. (And with a store buffer, can usually hide store miss latency.) So miss penalty can differ depending on which loads miss, whether it's one that software instruction scheduling was able to hide more vs. less of the latency for. (If the independent work after a load includes other loads, then you'd need a non-blocking cache that handles hit-under-miss. Miss-under-miss to memory-level parallelism of multiple misses in flight also helps, as well as being able to get to a hit after 2 or more cache-miss loads.)
I think usually for most workloads with different cache geometries and sizes, there won't be significant bias towards more of the misses being easier to hide or not, so you could still say that miss-rate is the only thing that ultimately matters.
Miss-rate for a cache depends on workload, so you can't say that a lower miss rate on just one workload or trace makes it better on average or for all cases. e.g. an 8-way associative 16 KiB cache might have a higher hit rate than a 32 KiB 2-way cache on one workload (with a lot of conflict misses for the 2-way cache), but on a different workload where the working set is mostly one contiguous 24KiB array, the 32K 2-way cache might have a higher hit rate.
The term "better" is subjective as follows:
Hardware cost, in terms of silicon real-estate, meaning that a larger chip is more expensive to produce and thus costs more per chip. (A larger cache may not even fit on the chip in question.)
Hardware cost, in terms of silicon process technology, meaning that a faster cache requires a more advanced chip process, so will increase costs per chip.
A miss rate on a given cache is workload specific (e.g. application specific or algorithm specific). Thus, two different workloads may have different miss rates on each of the caches in question. So, "better" here may mean across an average workload (or an average across several different workloads), but there's a lot of room for variability.
We would have to know the performance of the caches upon hit, and also upon miss — as a more complex cache with a higher hit rate might have longer timings.
In summary, in order to say that lower miss rate is better, we would have to know that all the other factors are equal. Otherwise, the notion of better needs to be defined, perhaps to include cost/benefit definition.
I am asking this question regarding Haswell Microarchitetcure(Intel Xeon E5-2640-v3 CPU). From the specifications of the CPU and other resources I found out that there are 10 LFBs and Size of the super queue is 16. I have two questions related to LFBs and SuperQueues:
1) What will be the maximum degree of memory level parallelism the system can provide, 10 or 16(LFBs or SQ)?
2) According to some sources every L1D miss is recorded in SQ and then SQ assigns the Line fill buffer and at some other sources they have written that SQ and LFBs can work independently. Could you please explain the working of SQ in brief?
Here is the example figure(Not for Haswell) for SQ and LFB.
References:
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
http://www.realworldtech.com/haswell-cpu/
For (1) logically the maximum parallelism would be limited by the least-parallel part of the pipeline which is the 10 LFBs, and this is probably strictly true for demand-load parallelism when prefetching is disabled or can't help. In practice, everything is more complicated once your load is at least partly helped by prefetching, since then the wider queues between L2 and RAM can be used which could make the observed parallelism greater than 10. The most practical approach is probably direct measurement: given measured latency to RAM, and observed throughput, you can calculate an effective parallelism for any particular load.
For (2) my understanding is that it is the other way around: all demand misses in L1 first allocate into the LFB (unless of course they hit an existing LFB) and may involve the "superqueue" later (or whatever it is called these days) if they also miss higher in the cache hierarchy. The diagram you included seems to confirm that: the only path from the L1 is through the LFB queue.
Let assume that one row of cache has size 2^nB. Which hit ratio expected in the sequential reading byte by byte long contiguous memory?
To my eye it is (2^n - 1) / 2^n.
However, I am not sure if I am right. What do you think ?
yes, looks right for simple hardware (non-pipelined with no prefetching). e.g. 1 miss and 63 hits for 64B cache lines.
On real hardware, even in-order single-issue (non-superscalar) CPUs, miss under miss (multiple outstanding misses) is usually supported, so you will see misses until you run out of load buffers. This makes memory accesses pipelined as well, which is useful when misses to different cache lines can be in flight at once, instead of waiting the full latency for each one.
Real hardware will also have hardware prefetching. For example, have a look at Intel's article about disabling HW prefetching for some use-cases.
HW prefetching can probably keep up with a one-byte-at-a-time loop on most CPUs, so with good prefetching you might see hardly any L1 cache misses.
See Ulrich Drepper's What Every Programmer Should Know About Memory, and other links in the x86 tag wiki for more about real HW performance.
I came to the topic caching and mapping and cache misses and how the cache blocks get replaced in what order when all blocks are already full.
There is the least recently used algorithm or the fifo algorithm or the least frequently algorithm and random replacement, ...
But what algorithms are used on actual cpu caches? Or can you use all and the... operating system decides what the best algorithm is?
Edit: Even when i chose an answer, any further information is welcome ;)
As hivert said - it's hard to get a clear picture on the specific algorithm, but one can deduce some of the information according to hints or clever reverse engineering.
You didn't specify which CPU you mean, each one can have a different policy (actually even within the same CPU different cache levels may have different policies, not to mention TLBs and other associative arrays which also may have such policies). I did find a few hints about Intel (specifically Ivy bridge), so we'll use this as a benchmark for industry level "standards" (which may or may not apply elsewhere).
First, Intel presented some LRU related features here -
http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-1-Microprocessor/HC24.28.117-HotChips_IvyBridge_Power_04.pdf
Slide 46 mentioned "Quad-Age LRU" - this is apparently an age based LRU that assigned some "age" to each line according to its predicted importance. They mention that prefetches get middle age, so demands are probably allocated at a higher age (or lower, whatever survives longest), and all lines likely age gradually, so the oldest gets replaced first. Not as good as perfect "fifo-like" LRU, but keep in mind that most caches don't implement that, but rather a complicated pseudo-LRU solution, so this might be an improvement.
Another interesting mechanism mentioned there, which goes the extra mile beyond classic LRU, is adaptive fill policy. There's a pretty good analysis here - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ , but in a nutshell (if the blog is correct, and he does seem to make a good match with his results), the cache dynamically chooses between two LRU policies, trying to decide whether the lines are going to be reused or not (and should be kept or not).
I guess this could answer to some extent your question on multiple LRU schemes. Implementing several schemes is probably hard and expensive in terms of HW, but when you have some policy that's complicated enough to have parameters, it's possible to use tricks like dynamic selection, set dueling , etc..
The following are some examples of replacement policies used in actual processors.
The PowerPC 7450's 8-way L1 cache used binary tree pLRU. Binary tree pLRU uses one bit per pair of ways to set an LRU for that pair, then an LRU bit for each pair of pairs of ways, etc. The 8-way L2 used pseudo-random replacement settable by privileged software (the OS) as using either a 3-bit counter incremented every clock cycle or a shift-register-based pseudo-random number generator.
The StrongARM SA-1110 32-way L1 data cache used FIFO. It also had a 2-way minicache for transient data, which also seems to have used FIFO. (Intel StrongARM SA-1110 Microprocessor Developer’s Manual states "Replacements in the minicache use the same round-robin pointer mechanism as in the main data cache. However, since this cache is only two-way set-associative, the replacement algorithm reduces to a simple least-recently-used (LRU) mechanism."; but 2-way FIFO is not the same as LRU even with only two ways, though for streaming data it works out the same.])
The HP PA 7200 had a 64-block fully associative "assist cache" that was accessed in parallel with an off-chip direct-mapped data cache. The assist cache used FIFO replacement with the option of evicting to the off-chip L1 cache. Load and store instructions had a "locality only" hint; if an assist cache entry was loaded by such a memory access, it would be evicted to memory bypassing the off-chip L1.
For 2-way associativity, true LRU might be the most common choice since it has good behavior (and, incidentally, is the same as binary tree pLRU when there are only two ways). E.g., the Fairchild Clipper Cache And Memory Management Unit used LRU for its 2-way cache. FIFO is slightly cheaper than LRU since the replacement information is only updated when the tags are written anyway, i.e., when inserting a new cache block, but has better behavior than counter-based pseudo-random replacement (which has even lower overhead). The HP PA 7300LC used FIFO for its 2-way L1 caches.
The Itanium 9500 series (Poulson) uses NRU for L1 and L2 data caches, L2 instruction cache, and the L3 cache (L1 instruction cache is documented as using LRU.). For the 24-way L3 cache in the Itanium 2 6M (Madison), a bit per block was provided for NRU with an access to a block setting the bit corresponding to its set and way ("Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache", Stefan Rusu et al., 2004). This is similar to the clock page replacement algorithm.
I seem to recall reading elsewhere that the bits were cleared when all were set (rather than keeping the one that set the last unset bit) and that the victim was chosen by a find first unset scan of the bits. This would have the hardware advantage of only having to read the information (which was stored in distinct arrays from but nearby the L3 tags) on a cache miss; a cache hit could simply set the appropriate bit. Incidentally, this type of NRU avoids some of the bad traits of true LRU (e.g., LRU degrades to FIFO in some cases and in some of these cases even random replacement can increase the hit rate).
For Intel CPUs, the replacement policies are usually undocumented. I have done some experiments to uncover the policies in recent Intel CPUs, the results of which can be found on https://uops.info/cache.html. The code that I used is available on GitHub.
The following is a summary of my findings.
Tree-PLRU: This policy is used by the L1 data caches of all CPUs that I tested, as well as by the L2 caches of the Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell and Broadwell CPUs.
Randomized Tree-PLRU: Some Core 2 Duo CPUs use variants of Tree-PLRU in their L2 caches where either the lowest or the highest bits in the tree are replaced by (pseudo-)randomness.
MRU: This policy is sometimes also called NRU. It uses one bit per cache block. An access to a block sets the bit to 0. If the last 1-bit was set to 0, all other bits are set to 1. Upon a miss, the first block with its bit set to 1 is replaced. This policy is used for the L3 caches of the Nehalem, Westmere, and Sandy Bridge CPUs.
Quad-Age LRU (QLRU): This is a generalization of the MRU policy that uses two bits per cache block. Different variants of this policy are used for the L3 caches, starting with Ivy Bridge, and for the L2 caches, starting with Skylake.
Adaptive policies: The Ivy Bridge, Haswell, and Broadwell CPUs can dynamically choose between two different QLRU variants. This is implemented via set dueling: A small number of dedicated sets always use the same QLRU variant; the remaining sets are "follower sets" that use the variant that performs better on the dedicated sets. See also http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.
I learned that due to computational overhead, true LRU is not implemented in virtual memory systems. So, why is the LRU algorithm feasible in a file cache?
I think reason is may be the time field in an inode. Is that correct?
It's about speed.
Virtual Memory status bits must be updated in nanoseconds, so hardware support is needed, and status information for LRU is expensive to implement in hardware. E.g. the clock algorithm is designed to approximate LRU with less expensive hardware support.
File system operations are on the order of milliseconds. A CPU can do LRU in software in a very small fraction of this time. Milliseconds are so "slow" from the CPU's point of view (190-thousands of instructions) that preventing only a small number of cache misses produces a big payoff.