What cache invalidation algorithms are used in actual CPU caches? - algorithm

I came to the topic caching and mapping and cache misses and how the cache blocks get replaced in what order when all blocks are already full.
There is the least recently used algorithm or the fifo algorithm or the least frequently algorithm and random replacement, ...
But what algorithms are used on actual cpu caches? Or can you use all and the... operating system decides what the best algorithm is?
Edit: Even when i chose an answer, any further information is welcome ;)

As hivert said - it's hard to get a clear picture on the specific algorithm, but one can deduce some of the information according to hints or clever reverse engineering.
You didn't specify which CPU you mean, each one can have a different policy (actually even within the same CPU different cache levels may have different policies, not to mention TLBs and other associative arrays which also may have such policies). I did find a few hints about Intel (specifically Ivy bridge), so we'll use this as a benchmark for industry level "standards" (which may or may not apply elsewhere).
First, Intel presented some LRU related features here -
http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-1-Microprocessor/HC24.28.117-HotChips_IvyBridge_Power_04.pdf
Slide 46 mentioned "Quad-Age LRU" - this is apparently an age based LRU that assigned some "age" to each line according to its predicted importance. They mention that prefetches get middle age, so demands are probably allocated at a higher age (or lower, whatever survives longest), and all lines likely age gradually, so the oldest gets replaced first. Not as good as perfect "fifo-like" LRU, but keep in mind that most caches don't implement that, but rather a complicated pseudo-LRU solution, so this might be an improvement.
Another interesting mechanism mentioned there, which goes the extra mile beyond classic LRU, is adaptive fill policy. There's a pretty good analysis here - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ , but in a nutshell (if the blog is correct, and he does seem to make a good match with his results), the cache dynamically chooses between two LRU policies, trying to decide whether the lines are going to be reused or not (and should be kept or not).
I guess this could answer to some extent your question on multiple LRU schemes. Implementing several schemes is probably hard and expensive in terms of HW, but when you have some policy that's complicated enough to have parameters, it's possible to use tricks like dynamic selection, set dueling , etc..

The following are some examples of replacement policies used in actual processors.
The PowerPC 7450's 8-way L1 cache used binary tree pLRU. Binary tree pLRU uses one bit per pair of ways to set an LRU for that pair, then an LRU bit for each pair of pairs of ways, etc. The 8-way L2 used pseudo-random replacement settable by privileged software (the OS) as using either a 3-bit counter incremented every clock cycle or a shift-register-based pseudo-random number generator.
The StrongARM SA-1110 32-way L1 data cache used FIFO. It also had a 2-way minicache for transient data, which also seems to have used FIFO. (Intel StrongARM SA-1110 Microprocessor Developer’s Manual states "Replacements in the minicache use the same round-robin pointer mechanism as in the main data cache. However, since this cache is only two-way set-associative, the replacement algorithm reduces to a simple least-recently-used (LRU) mechanism."; but 2-way FIFO is not the same as LRU even with only two ways, though for streaming data it works out the same.])
The HP PA 7200 had a 64-block fully associative "assist cache" that was accessed in parallel with an off-chip direct-mapped data cache. The assist cache used FIFO replacement with the option of evicting to the off-chip L1 cache. Load and store instructions had a "locality only" hint; if an assist cache entry was loaded by such a memory access, it would be evicted to memory bypassing the off-chip L1.
For 2-way associativity, true LRU might be the most common choice since it has good behavior (and, incidentally, is the same as binary tree pLRU when there are only two ways). E.g., the Fairchild Clipper Cache And Memory Management Unit used LRU for its 2-way cache. FIFO is slightly cheaper than LRU since the replacement information is only updated when the tags are written anyway, i.e., when inserting a new cache block, but has better behavior than counter-based pseudo-random replacement (which has even lower overhead). The HP PA 7300LC used FIFO for its 2-way L1 caches.
The Itanium 9500 series (Poulson) uses NRU for L1 and L2 data caches, L2 instruction cache, and the L3 cache (L1 instruction cache is documented as using LRU.). For the 24-way L3 cache in the Itanium 2 6M (Madison), a bit per block was provided for NRU with an access to a block setting the bit corresponding to its set and way ("Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache", Stefan Rusu et al., 2004). This is similar to the clock page replacement algorithm.
I seem to recall reading elsewhere that the bits were cleared when all were set (rather than keeping the one that set the last unset bit) and that the victim was chosen by a find first unset scan of the bits. This would have the hardware advantage of only having to read the information (which was stored in distinct arrays from but nearby the L3 tags) on a cache miss; a cache hit could simply set the appropriate bit. Incidentally, this type of NRU avoids some of the bad traits of true LRU (e.g., LRU degrades to FIFO in some cases and in some of these cases even random replacement can increase the hit rate).

For Intel CPUs, the replacement policies are usually undocumented. I have done some experiments to uncover the policies in recent Intel CPUs, the results of which can be found on https://uops.info/cache.html. The code that I used is available on GitHub.
The following is a summary of my findings.
Tree-PLRU: This policy is used by the L1 data caches of all CPUs that I tested, as well as by the L2 caches of the Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell and Broadwell CPUs.
Randomized Tree-PLRU: Some Core 2 Duo CPUs use variants of Tree-PLRU in their L2 caches where either the lowest or the highest bits in the tree are replaced by (pseudo-)randomness.
MRU: This policy is sometimes also called NRU. It uses one bit per cache block. An access to a block sets the bit to 0. If the last 1-bit was set to 0, all other bits are set to 1. Upon a miss, the first block with its bit set to 1 is replaced. This policy is used for the L3 caches of the Nehalem, Westmere, and Sandy Bridge CPUs.
Quad-Age LRU (QLRU): This is a generalization of the MRU policy that uses two bits per cache block. Different variants of this policy are used for the L3 caches, starting with Ivy Bridge, and for the L2 caches, starting with Skylake.
Adaptive policies: The Ivy Bridge, Haswell, and Broadwell CPUs can dynamically choose between two different QLRU variants. This is implemented via set dueling: A small number of dedicated sets always use the same QLRU variant; the remaining sets are "follower sets" that use the variant that performs better on the dedicated sets. See also http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.

Related

How does write-invalidate policy work with set-associative caches?

I was going through Cache Write Policies paper by Norman P. Jouppi and I understand why write-invalidate (defined on page 193) works well with direct mapped caches which is because of the ability to write the data which checking the tag and if found to be miss, the cache line is invalidates as it is corrupted by the write. This can be done in one cycle.
But is there any benefit if write-invalidate is used for set-associative caches?
What is the usual configuration that is used for L1 caches in real processors? Do they use direct or set-associative and write-validate/write around/write invalidate/fetch-on-write policy?
TL:DR: for a non-blocking cache using write-invalidate, changing it from direct-mapped to set-associative could hurt the hit rate unless writes are very rare, or mean that you introduce the possibility of needing to block.
Write-invalidate only makes sense for a simple in-order pipeline with a simple cache that tries to avoid stalling the pipeline even without a store buffer, and go really fast at the expense of hit-rate. If you were going to change things to improve hit-rate, changing away from write-invalidate (usually to write-back + write-allocate + fetch-on-write) would be one of the first things. Write-invalidate with set-associative cache is possible with some ugly tradeoffs, but you wouldn't like the results.
The 1993 paper you linked is using that term to mean something other than the modern cache-coherence mechanism meaning. In the paper:
The combination of
write-before-hit, no-fetch-on-write, and no-write-allocate
we call write-invalidate
Yes, real-world caches these days are basically always set-associative; the more complex tag-comparator logic is worth the increased hit-rate for the same data size. Which cache mapping technique is used in intel core i7 processor? has some general stuff, not just x86. Modern examples of direct-mapped caches include the DRAM cache when a part of the persistent memory on an Intel platform operates in memory mode. Also many server-grade processors from multiple vendors support L3 way-wise partitioning, so you can, for example, allocate one way for a thread which would basically behave like a direct-mapped cache.
Write policy is usually write-allocate + fetch-on-write + no-write-before-hit for modern CPU caches; some ISAs offer methods such as special instructions to bypass cache for "non-temporal" stores that won't be re-read soon, to avoid cache pollution for those cases. Most workloads do re-load their stores with enough temporal locality that write-allocate is the only sane choice, especially when caches are larger and/or more associative so they're more likely to be able to hang onto a line until the next read or write.
It's also very common to do multiple small writes into the same line, making write-allocate very valuable, especially if a store buffer didn't manage to merge those writes.
But is there any benefit if write-invalidate is used for set-associative caches?
It doesn't seem so.
The only advantage it has is not stalling a simple in-order pipeline that lacks a store buffer ("write buffer" in the paper). It allows write in parallel with the tag-check, so you find out after modifying the line whether you hit or not. (Modern CPUs do use a store buffer to decouple store commit to L1d from store execution and hide store-miss latency. Even in-order CPUs typically have a store buffer to allow memory-level parallelism of RFOs (read-for-ownership). (e.g. ARM Cortex-A53 found in phones).
Anyway, in a set-associative cache, you need to check tags to know which "way" of the set to write into on a write hit. (Or detect a miss and pick one to evict according to some policy, like random or pseudo-LRU using some extra state bits, or write-around if no-write-allocate). If you wait until after the tag check to find the write way, you've lost the only benefit of write-invalidate.
Blindly writing to a random way could lead to a situation where there's a hit in a different way than the one you guessed. Way-prediction is a thing (and can do better than random), but the downside of an incorrect prediction for a write like this would be unnecessarily invalidating a line, instead of just a bit of extra latency. Way prediction in modern cache. I don't know what kind of success-rate way-prediction usually achieves. I'd guess not great, like maybe 80 to 90% at best. Probably spending transistors to do way-prediction would be better spent elsewhere, to do something that sucks less than write-invalidate! A store buffer with store forwarding probably costs more, but is a lot better.
The advantage of write-invalidate is to help make the cache non-blocking. But if you need to correct the situation when you do find a write-hit in a way other than the one you picked, you need to go back and correct the situation, updating the correct line. So you'd lose the non-blocking property. Never stalling is better than not usually stalling, because it means you don't even need to make the hardware handle that possible case at all. (Although you do need to be able to stall for memory.)
The write-in-one-way-hit-in-another situation can be avoided by writing in all of the ways. But there will be at most one hit and the rest will have to be invalidated. The negative impact on hit rate will significantly grow with associativity. (Unless writes are quite infrequent vs. reads, reducing the associativity would probably help hit rate with the write-all-ways strategy, so for a given total cache capacity, direct-mapped might be the best choice if you insist on fully-non-blocking write-invalidate.) Even for a direct-mapped cache, the experimental evaluation given in the paper itself shows that write-invalidate has higher miss rate compare to the other evaluated write policies. So it's win only if the benefits of reducing latency and bandwidth demand outweighs the damage of high miss rate.
Also, as I said, write-allocate is very good for CPUs, especially when it's set-associative so you're spending more resources trying to get a higher hit-rate. You could maybe still implement write-allocate by triggering a fetch on miss, remembering where in the line you stored the data, and merging that with the old copy of the line when it arrives.
You don't want to defeat that by blowing away lines that didn't need to die.
Also, write-invalidate implies write-through even for write hits, because it could lose data if a line is ever dirty. But write-back is also very good in modern L1d caches to insulate larger/slower caches from the write bandwidth. (Especially if there's no per-core private L2 to separately reduce total traffic to shared caches.) However, AMD Bulldozer-family did have a write-through L1d with a small 4k write-buffer between it and a write-back L2. This was generally considered a failed experiment or weak point of the design, and they dropped it in favour of a standard write-back write-allocate L1d for Zen. When use write-through cache policy for pages.
So in summary, write-invalidate is incompatible with several things that modern mainstream CPU designs have settled on as the best options, that you'll find in most mainstream CPU designs
write-allocate write policy
write-back (not write-through). https://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies
set-associative (huge downsides that can only be partially mitigated by way-prediction)
store buffer to decouple store miss from execution, and allow memory parallelism. (Not strictly incompatible, but a store buffer makes it pointless. Necessary for OoO exec and widely used for in-order)
write-invalidate in cache-coherent SMP systems
You'd never consider using it in a single-chip multi-core CPU; spend more transistors on each core to get more of the low-hanging fruit before you start building more cores. e.g. a proper store buffer. Use some flavour of SMT if you want high throughput for multiple low-IPC threads that stall a lot.
But for multi-socket SMP, this could have made sense historically if you want to use multiple of the biggest single-core chip you can build, and that was still not big enough to just have a store buffer instead of this.
I guess it could even make sense to use a really "thin" direct-mapped write-through L1d in front of a private medium-sized write-back set-associative L2 that's still pretty fast. (Maybe call this an L0d cache because it can act like an unordered store buffer. The next-level cache will still see a lot of reads and writes from the low hit-rate of this small direct-mapped cache.)
Normally all caches (including L1d) are part of the same global coherency domain so writing into L1d cache can't happen until you have exclusive ownership. (Which you check for as part of the tag check.) But if this L1d / L0d is not like that, then it's not coherent and is more like a store buffer.
Of course, you need to queue the write-throughs for L2, and eventually stall when it can't keep up, so you're just adding complexity. The write-through to L2 mechanism would also need to deal with waiting for L2 to gain exclusive ownership of the line before writing (MESI Exclusive or Modified state). So this is very much just an unordered store buffer.
The case of writing to a line that hadn't made it to L2 yet is interesting: if it's an L0d write hit you effectively get store merging for free. You'd need per-word or per-byte needs-writeback bits (aka dirty bits) for this. Normally write-through would be sending along the write while the offset within line is still available, but if L2 isn't ready to accept it yet (e.g. because of a write miss) then you can't do that. This is morphing it into a write-combining buffer. Marking the whole line as needing write-back doesn't work because the unwritten parts are still invalid.
But if it's a write miss (same cache line, different tag bits) on a line that still hasn't finished write-back to L2, you have a big problem because you'd be invalidating a line that's still "dirty" (has the only copy of some older store data). You can't detect that before writing; the whole point is to write in parallel with checking tags.
It might be possible to still make this work: if the cache access is a read+write exchange that keeps the previous value in a one-word buffer (or whatever the max write size is), you still have all the data. Stall everything (including writeback of this line so you don't make wrong data globally visible in coherent L2 cache). Then exchange back, wait for the old state of that L0d line to actually write back to that address, then store the tmp buffer into L0d and update the tag and needs-writeback bits to reflect this store. So aliasing between nearby stores becomes extra costly and stalls the pipeline. Or maybe you can let non-memory instructions continue and only stall execution at the next load or store. (If you have the transistor budget to do much of that stall-avoidance, you can probably just use a completely different strategy, like having a store buffer and a normal L1d.)
To be usable (assuming you work around the dirty-store-miss problem), you'd need some way to track relative order of stores (and loads). If that's as simplistic as making sure every entry in the entire L0d has finished its write-through process before allowing another write, then even store-store barriers will be very expensive. The less order-tracking a CPU does, the more expensive barriers have to be (flush more stuff to make sure).

How is an LRU cache implemented in a CPU?

I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick?
Also what happens in a dual core system where both CPUs write to the one address simultaneously?
For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.
For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.
For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.
On a MRU hit, the state is unchanged.
On a next-MRU hit the two bit fields are swapped.
On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.
On a miss, the state is updated as if an LRU hit had occurred.
Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.
For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.
With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.
Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.
There is a good slide-deck Page replacement algorithms that talks about various page replacement schemes. It also explains the LRU implementation using mxm matrix really well.

Cache or Registers - which is faster?

I'm sorry if this is the wrong place to ask this but I've searched and always found different answer. My question is:
Which is faster? Cache or CPU Registers?
According to me, the registers are what directly load data to execute it while the cache is just a storage place close or internally in the CPU.
Here are the sources I found that confuses me:
2 for cache | 1 for registers
http://in.answers.yahoo.com/question/index?qid=20110503030537AAzmDGp
Cache is faster.
http://wiki.answers.com/Q/Is_cache_memory_faster_than_CPU_registers
So which really is it?
CPU register is always faster than the L1 cache. It is the closest. The difference is roughly a factor of 3.
Trying to make this as intuitive as possible without getting lost in the physics underlying the question: there is a simple correlation between speed and distance in electronics. The further you make a signal travel, the harder it gets to get that signal to the other end of the wire without the signal getting corrupted. It is the "there is no free lunch" principle of electronic design.
The corollary is that bigger is slower. Because if you make something bigger then inevitably the distances are going to get larger. Something that was automatic for a while, shrinking the feature size on the chip automatically produced a faster processor.
The register file in a processor is small and sits physically close to the execution engine. The furthest removed from the processor is the RAM. You can pop the case and actually see the wires between the two. In between sit the caches, designed to bridge the dramatic gap between the speed of those two opposites. Every processor has an L1 cache, relatively small (32 KB typically) and located closest to the core. Further down is the L2 cache, relatively big (4 MB typically) and located further from the core. More expensive processors also have an L3 cache, bigger and further away.
Specifically on x86 architecture:
Reading from register has 0 or 1 cycle latency.
Writing to registers has 0 cycle latency.
Reading/Writing L1 cache has a 3 to 5 cycle latency (varies by architecture age)
Actual load/store requests may execute within 0 or 1 cycles due to write-back buffer and store-forwarding features (details below)
Reading from register can have a 1 cycle latency on Intel Core 2 CPUs (and earlier models) due to its design: If enough simultaneously-executing instructions are reading from different registers, the CPU's register bank will be unable to service all the requests in a single cycle. This design limitation isn't present in any x86 chip that's been put on the consumer market since 2010 (but it is present in some 2010/11-released Xeon chips).
L1 cache latencies are fixed per-model but tend to get slower as you go back in time to older models. However, keep in mind three things:
x86 chips these days have a write-back cache that has a 0 cycle latency. When you store a value to memory it falls into that cache, and the instruction is able to retire in a single cycle. Memory latency then only becomes visible if you issue enough consecutive writes to fill the write-back cache. Writeback caches have been prominent in desktop chip design since about 2001, but was widely missing from the ARM-based mobile chip markets until much more recently.
x86 chips these days have store forwarding from the write-back cache. If you store an address to the WB cache and then read back the same address several instructions later, the CPU will fetch the value from the WB cache instead of accessing L1 memory for it. This reduces the visible latency on what appears to be an L1 request to 1 cycle. But in fact, the L1 isn't be referenced at all in that case. Store forwarding also has some other rules for it to work properly, which also vary a lot across the various CPUs available on the market today (typically requiring 128-bit address alignment and matched operand size).
The store forwarding feature can generate false positives where-in the CPU thinks the address is in the writeback buffer based on a fast partial-bits check (usually 10-14 bits, depending on chip). It uses an extra cycle to verify with a full check. If that fails then the CPU must re-route as a regular memory request. This miss can add an extra 1-2 cycles latency to qualifying L1 cache accesses. In my measurements, store forwarding misses happen quite often on AMD's Bulldozer, for example; enough so that its L1 cache latency over-time is about 10-15% higher than its documented 3-cycles. It is almost a non-factor on Intel's Core series.
Primary reference: http://www.agner.org/optimize/ and specifically http://www.agner.org/optimize/microarchitecture.pdf
And then manually graph info from that with the tables on architectures, models, and release dates from the various List of CPUs pages on wikipedia.

Line size of L1 and L2 caches

From a previous question on this forum, I learned that in most of the memory systems, L1 cache is a subset of the L2 cache means any entry removed from L2 is also removed from L1.
So now my question is how do I determine a corresponding entry in L1 cache for an entry in the L2 cache. The only information stored in the L2 entry is the tag information. Based on this tag information, if I re-create the addr it may span multiple lines in the L1 cache if the line-sizes of L1 and L2 cache are not same.
Does the architecture really bother about flushing both the lines or it just maintains L1 and L2 cache with the same line-size.
I understand that this is a policy decision but I want to know the commonly used technique.
Cache-Lines size is (typically) 64 bytes.
Moreover, take a look at this very interesting article about processors caches:
Gallery of Processor Cache Effects
You will find the following chapters:
Memory accesses and performance
Impact of cache lines
L1 and L2 cache sizes
Instruction-level parallelism
Cache associativity
False cache line sharing
Hardware complexities
In core i7 the line sizes in L1 , L2 and L3 are the same: that is 64 Bytes.
I guess this simplifies maintaining the inclusive property, and coherence.
See page 10 of: https://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
The most common technique of handling cache block size in a strictly inclusive cache hierarchy is to use the same size cache blocks for all levels of cache for which the inclusion property is enforced. This results in greater tag overhead than if the higher level cache used larger blocks, which not only uses chip area but can also increase latency since higher level caches generally use phased access (where tags are checked before the data portion is accessed). However, it also simplifies the design somewhat and reduces the wasted capacity from unused portions of the data. It does not take a large fraction of unused 64-byte chunks in 128-byte cache blocks to compensate for the area penalty of an extra 32-bit tag. In addition, the larger cache block effect of exploiting broader spatial locality can be provided by relatively simple prefetching, which has the advantages that no capacity is left unused if the nearby chunk is not loaded (to conserve memory bandwidth or reduce latency on a conflicting memory read) and that the adjacency prefetching need not be limited to a larger aligned chunk.
A less common technique divides the cache block into sectors. Having the sector size the same as the block size for lower level caches avoids the problem of excess back-invalidation since each sector in the higher level cache has its own valid bit. (Providing all the coherence state metadata for each sector rather than just validity can avoid excessive writeback bandwidth use when at least one sector in a block is not dirty/modified and some coherence overhead [e.g., if one sector is in shared state and another is in the exclusive state, a write to the sector in the exclusive state could involve no coherence traffic—if snoopy rather than directory coherence is used].)
The area savings from sectored cache blocks were especially significant when tags were on the processor chip but the data was off-chip. Obviously, if the data storage takes area comparable to the size of the processor chip (which is not unreasonable), then 32-bit tags with 64-byte blocks would take roughly a 16th (~6%) of the processor area while 128-byte blocks would take half as much. (IBM's POWER6+, introduced in 2009, is perhaps the most recent processor to use on-processor-chip tags and off-processor data. Storing data in higher-density embedded DRAM and tags in lower-density SRAM, as IBM did, exaggerates this effect.)
It should be noted that Intel uses "cache line" to refer to the smaller unit and "cache sector" for the larger unit. (This is one reason why I used "cache block" in my explanation.) Using Intel's terminology it would be very unusual for cache lines to vary in size among levels of cache regardless of whether the levels were strictly inclusive, strictly exclusive, or used some other inclusion policy.
(Strict exclusion typically uses the higher level cache as a victim cache where evictions from the lower level cache are inserted into the higher level cache. Obviously, if the block sizes were different and sectoring was not used, then an eviction would require the rest of the larger block to be read from somewhere and invalidated if present in the lower level cache. [Theoretically, strict exclusion could be used with inflexible cache bypassing where an L1 eviction would bypass L2 and go to L3 and L1/L2 cache misses would only be allocated to either L1 or L2, bypassing L1 for certain accesses. The closest to this being implemented that I am aware of is Itanium's bypassing of L1 for floating-point accesses; however, if I recall correctly, the L2 was inclusive of L1.])
Typically, in one access to the main memory 64 bytes of data and 8 bytes of parity/ECC (I don't remember exactly which) is accessed. And it is rather complicated to maintain different cache line sizes at the various memory levels. You have to note that cache line size would be more correlated to the word alignment size on that architecture than anything else. Based on that, a cache line size is highly unlikely to be different from memory access size. Now, the parity bits are for the use of the memory controller - so cache line size typically is 64 bytes. The processor really controls very little beyond the registers. Everything else going on in the computer is more about getting hardware in to optimize CPU performance. In that sense also, it really would not make any sense to import extra complexity by making cache line sizes different at different levels of memory.

LRU vs FIFO vs Random

When there is a page fault or a cache miss we can use either the Least Recently Used (LRU), First in Fist Out (FIFO) or Random replacement algorithms. I was wondering, which one provides the best performance aka the least possible future cache miss'/page faults?
Architecture: Coldfire processor
The expression "There are no stupid questions" fits this so well. This was such a good question I had to create an account and post on it and share my views as somebody who has modelled caches on a couple of CPUs.
You specify the architecture a 68000 which is a CPU rather than a GPU or USB controller, or other piece of hardware which may access a cache however...
Therefore the code you run on the 68000 will make a huge difference to the part of the question "least possible future cache miss'/page faults".
In this you differentiate between cache miss's and page faults, I'm not sure exactly which coldfire architecure you are refering to but I'm assuming this doesn't have a hardware TLB replacement, it uses a software mechansim (so the cache will be shared with the data of applications).
In the replacement policy the most important factor is the number of associations (or ways).
A direct map cache (1 way), directly correlates (all most always) with the low order bits of the address (the number of bits specify the size of the cache) so a 32k cache would be the lower 15bits. In this case the replacement algorthims LRU, FIFO or Random would be useless as there is only one possible choice.
However Writeback or Writethrough selection of the cache would have more effect. For writes to memory only Writethrough means the cache line isn't allocated as apposed to a writeback cache where the line currently in the cache which shares the same lower 15 bits is ejected out of the cache and read back in then modified, for use IF the code running on the CPU uses this data).
For operations that write and don't perform multiple operations on the data then writethrough is usually much better, also on modern processors (and I don' know if this architecture supports it), but Writethrough or Writeback can be selected on a TLB/Page basis. This can have a much greator effect on the cache than the policy, you can programme the system to match the type of data in each page, especially in a direct map cache ;-)
So a direct map cache is pretty easy to understand, it's also easy to understand the basis of cache's worst case, best case and average case.
Imagin a memcpy routine which copies data which is aligned to the cache size. For example a 32k direct mapped cache, with two 32k buffers aligned on a 32k boundry....
0x0000 -> read
0x8000 -> write
0x8004 -> read
0x8004 -> write
...
0x8ffc -> read
0x8ffc -> write
Here you see the read and write's as they copy each word of data, notice the lower 15 bits are the same for each read and write operation.
A direct mapped cache using writeback (remember writeback allocates lines does the following)
0x0000 -> read
cache performs: (miss)
0x0000:0x001f -> READ from main memory (ie. read 32 bytes of the source)
0x8000 -> write
cache performs: (miss)
invalidate 0x0000:0x001f (line 0)
0x8000:0x801f -> READ from main memory (ie. read 32 bytes of the destination)
0x8000 (modify this location in the cache with the read source data)
<loop>
0x0004 -> read
cache performs: (miss)
writeback 0x8000:0x801f -> WRITE to main memory (ie. write 32 bytes to the desitnation)
0x0000:0x001f -> READ from main memory (ie. read 32 bytes of source (the same as we did just before)
0x8004 -> write
cache performs: (miss)
invalidate 0x0000:0x001f (line 0)
0x8000:0x801f -> READ from main memory (ie. read 32 bytes of the destination)
0x8004 (modify this location in the cache with the read source data)
</loop> <--- (side note XML is not a language but we use it as such)
As you see a lot of memory operations go on, this in fact is called "thrashing" and is the best example of a worst case scenairo.
Now imagine that we use a writethrough cache, these are the operations:
<loop>
0x0000 -> read
cache performs: (miss)
0x0000:0x001f -> READ from main memory (ie. read 32 bytes of the source)
0x8000 -> write
cache performs: (not a miss)
(not a lot, the write is "posted" to main memory) (posted is like a letter you just place it in the mailbox and you don't care if it takes a week to get there).
<loop>
0x0004 -> read
cache performs: (hit)
(not a lot, it just pulls the data it fetched last time which it has in it's memory so it goes very quickly to the CPU)
0x8004 -> write
cache performs: (not a miss)
(not a lot, the write is "posted" to main memory)
</loop until next 32 bytes>
</loop until end of buffer>
As you can see a huge difference we now don't thrash, in fact we are best case in this example.
Ok so that's the simple case of write through vs. write back.
Direct map caches however are now not very common most people use, 2,4 or 8 way cache's, that is there are 2, 4 or 8 different possible allocations in a line. So we could store 0x0000, 0x8000, 0x1000, 0x1800 all in the cache at the same time in a 4 way or 8 way cache (obviously an 8 way could also store 0x2000, 0x2800, 0x3000, 0x3800 as well).
This would avoid this thrashing issue.
Just to clarify the line number in a 32k direct mapped cache is the bottom 15 bits of the address.
In a 32k 2 way it's the bottom 14 bits.
In a 32k 4 way it's the bottom 13 bits.
In a 32k 8 way it's the bottom 12 bits.
And in a fully associatve cache it's the lines size (or the bottom 5 bits with a 32 byte line). You can't have less than a line. 32 bytes is typically the most optimum operation in a DDR memory system (there are other reasons as well, someimes 16 or sometimes 64 bytes can be better, and 1 byte would be optimal in the algorthmic case, lets uses 32 as that's very common)
To help understand LRU, FIFO and Random consider the cache is full associative, in a 32k 32 byte line cache this is 1024 lines.
A random replacement policy would on random cause a worst case hit every 1024 replacements (ie. 99.9% hit), in either LRU or FIFO I could always write a programme which would "thrash" ie. always cause a worst case behavouir (ie. 0% hit).
Clearly if you had a fully associative cache you would only choose LRU or FIFO if the programme was known and it was known the exact behavouir of the programme.
For ANYTHING which wasn't 99.9% predictable you would choose RANDOM, it's simply the best at not being worst, and one of the best for being average, but how about best case (where I get the best performance)...
Well it depends basically on the number of ways...
2 ways and I can optimise things like memcpy and other algorthims to do a good job. Random would get it wrong half the time.
4 ways and when I switch between other tasks I might not damage the cache so much that their data is still local. Random would get it wrong quater of the time.
8 ways now statistics can take effect a 7/8% hit rate on a memcpy isn't as good as a 1023/1024% (fully associative or optimised code) but for non optimised code it makes a difference.
So why don't people make fully associative cache's with random replacement policies!
Well it's not because they can't generate good random numbers, in fact a pseudo random number generator is just as good and yes I can write a programme to get 100% miss rate, but that isn't the point, I couldn't write a useful programme that would have 100% miss, which I could with an LRU or FIFO algo.
A 32k 32 byte line Fully associatve cache's require you to compare 1024 values, in hardware this is done through a CAM, but this is an expensive piece of hardware, and also it's just not possible to compare this many values in a "FAST" processing time, I wonder if a quantum computer could....
Anyway to answer your question which one is better:
Consider if writethrough might be better than writeback.
Large way's RANDOM is better
Unknown code RANDOM is better for 4 way or above.
If it's single function or you want the most speed from something your willing to optimise, and or you don't care about worst case, then LRU is probably what you want.
If you have very few way's LRU is likely what you want unless you have a very specific scenario then FIFO might be OK.
References:
http://www.freescale.com/files/training/doc/APF_ENT_T0801_5225_COLDFIRE_MCU.pdf
http://en.wikipedia.org/wiki/Cache_algorithms
No perfect caching policy exists because it would require knowledge of the future (how a program will access memory).
But, some are measurably better than others in the common memory access pattern cases. This is the case with LRU. LRU has historically given very good performance in overall use.
But, for what you are trying to do, another policy may be better. There is always some memory access pattern that will cause a caching policy to perform poorly.
You may find this thread helpful (and more elaborate!)
Why is LRU better than FIFO?
Many of the architectures I have studied use LRU, as it generally provides not only efficiency in implementation, but also is pretty good on average at preventing misses. However, in the latest x86 architectures, I think there are some more complicated things going on. LRU is sort of a basic model.
It really depends on what kind of operations you are performing on your device. Depending on the types of operations, different evacuation policies will work better. For example, FIFO works well with traversing memory sequentially.
Hope this helps, I'm not really an architecture guy.
Between the three, I'd recommend LRU. First, it's a good approximation to optimal scheduling when locality is assumed (this turns out to be a good assumption). Random scheduling cannot benefit from locality. Second, it doesn't suffer from Belady's anomaly (like FIFO); that is, bigger caches mean better performance, which isn't necessarily true with FIFO.
Only if your specific problem domain strongly suggests using something else, LRU is going to be hard to beat in the general case.
Of the three, LRU is generally the best while FIFO is the worst and random comes in somewhere between. You can construct access patterns where any of the three is superior to either of the others, but it is somewhat tricky. Interestingly enough, this order is also roughly how expensive they are to implement -- LRU is the most expensive and FIFO is the cheapest. Just goes to show, there's no free lunch
If you want the best of both worlds, consider an adaptive approach that alters it strategy based on actual usage patterns. For example, look at the algorith IBM's Adaptive Replacement Cache: http://code.activestate.com/recipes/576532-adaptive-replacement-cache-in-python/

Resources