Cache Line Format/Layout

Cache Line Format/Layout - caching

I would like to understand the layout of a cache line (not cache layout) and was searching for more in depth explanation or a figure on which fields are contained in a cache line, preferably for Intel CPUs.
On Computer systems: a programmer's perspective from Randal E. Bryant; David R. O'Hallaron it is stated that:
A cache line is a container in a cache that stores a block, as well as other information such as the valid
bit and the tag bits.
However, this is very generic and does not state if there are other bits as well.
I was searching the web, aforementioned book and the Intel manuals, but didn't find anything else. The only thing that seems to be readily available is information about the layout of the cache and page table entries.
Is this any undisclosed information or is it just too trivial and the only fields available in a cache line are data block, valid bit and tag bits (as stated in the book)?

In addition to the data itself, each cache line will typically have coherence metadata (not just validity, using four-state MESI is a popular coherence tracking method) and either parity or ECC metadata.
For each set (or sometimes for each line), there will also be replacement information. Not Recently Used can be implemented with a bit per line that is set on use and cleared when all use bits are set in a cache set. Tree-based pseudo-Least Recently Used has a binary tree for each set where each bit indicates if its half of the group was more recently used.
An L2 or L3 cache that is used by more than one L1 may have metadata about which L1 caches hold the data to avoid having to send invalidation or sharing update requests to all the L1s.
Other metadata may be present to improve replacement beyond the basic method (e.g., EvictMe bits have been proposed), to indicate compressed status, to provide prefetch hints, etc. AMD's Athlon stole ECC bits to store branch information in L2 (only providing parity protection for instruction memory).
Instruction caches can also predecode cache lines to make decode simpler and faster. This can alter the encoding (e.g., replacing a branch target offset with the branch target inset (the sum of offset and the offset)), rearrange fields for more regular encoding, or pad the instruction with instruction type information.

Related

VIPT Cache: Connection between TLB & Cache?

I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details.
In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache.
From the TLB we get the traslated physical address.
From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set).
Then the translated TLB address is matched with the list of tags to find a candidate.
My question is where is this check performed ?
In Cache ?
If not in Cache, where else ?
If the check is performed in Cache, then
is there a side-band connection from TLB to the Cache module to get the
translated physical address needed for comparison with the tag addresses?
Can somebody please throw some light on "actually" how this is generally implemented and the connection between Cache module & the TLB(MMU) module ?
I know this dependents on the specific architecture and implementation.
But, what is the implementation which you know when there is VIPT cache ?
Thanks.

At this level of detail, you have to break "the cache" and "the TLB" down into their component parts. They're very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being translated "for free". Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?)
The L1dTLB itself is a small/fast Content addressable memory with (for example) 64 entries and 4-way set associative (Intel Skylake). Hugepages are often handled with a second (and 3rd) array checked in parallel, e.g. 32-entry 4-way for 2M pages, and for 1G pages: 4-entry fully (4-way) associative.
But for now, simplify your mental model and forget about hugepages.
The L1dTLB is a single CAM, and checking it is a single lookup operation.
"The cache" consists of at least these parts:
the SRAM array that stores the tags + data in sets
control logic to fetch a set of data+tags based on the index bits. (High-performance L1d caches typically fetch data for all ways of the set in parallel with tags, to reduce hit latency vs. waiting until the right tag is selected like you would with larger more highly associative caches.)
comparators to check the tags against a translated address, and select the right data if one of them matches, or trigger miss-handling. (And on hit, update the LRU bits to mark this way as Most Recently Used). For a diagram of the basics for a 2-way associative cache without a TLB, see https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=17. The = inside a circle is the comparator: producing a boolean true output if the tag-width inputs are equal.
The L1dTLB is not really separate from the L1D cache. I don't actually design hardware, but I think a load execution unit in a modern high-performance design works something like this:
AGU generates an address from register(s) + offset.
(Fun fact: Sandybridge-family optimistically shortcuts this process for simple addressing mode: [reg + 0-2047] has 1c lower load-use latency than other addressing modes, if the reg value is in the same 4k page as reg+disp. Is there a penalty when base+offset is in a different page than the base?)
The index bits come from the offset-within-page part of the address, so they don't need translating from virtual to physical. Or translation is a no-op. This VIPT speed with the non-aliasing of a PIPT cache works as long as L1_size / associativity <= page_size. e.g. 32kiB / 8-way = 4k pages.
The index bits select a set. Tags+data are fetched in parallel for all ways of that set. (This costs power to save latency, and is probably only worth it for L1. Higher-associativity (more ways per set) L3 caches definitely not)
The high bits of the address are looked up in the L1dTLB CAM array.
The tag comparator receives the translated physical-address tag and the fetched tags from that set.
If there's a tag match, the cache extracts the right bytes from the data for the way that matched (using the offset-within-line low bits of the address, and the operand-size).
Or instead of fetching the full 64-byte line, it could have used the offset bits earlier to fetch just one (aligned) word from each way. CPUs without efficient unaligned loads are certainly designed this way. I don't know if this is worth doing to save power for simple aligned loads on a CPU which supports unaligned loads.
But modern Intel CPUs (P6 and later) have no penalty for unaligned load uops, even for 32-byte vectors, as long as they don't cross a cache-line boundary. Byte-granularity indexing for 8 ways in parallel probably costs more than just fetching the whole 8 x 64 bytes and setting up the muxing of the output while the fetch+TLB is happening, based on offset-within-line, operand-size, and special attributes like zero- or sign-extension, or broadcast-load. So once the tag-compare is done, the 64 bytes of data from the selected way might just go into an already-configured mux network that grabs the right bytes and broadcasts or sign-extends.
AVX512 CPUs can even do 64-byte full-line loads.
If there's no match in the L1dTLB CAM, the whole cache fetch operation can't continue. I'm not sure if / how CPUs manage to pipeline this so other loads can keep executing while the TLB-miss is resolved. That process involves checking the L2TLB (Skylake: unified 1536 entry 12-way for 4k and 2M, 16-entry for 1G), and if that fails then with a page-walk.
I assume that a TLB miss results in the tag+data fetch being thrown away. They'll be re-fetched once the needed translation is found. There's nowhere to keep them while other loads are running.
At the simplest, it could just re-run the whole operation (including fetching the translation from L1dTLB) when the translation is ready, but it could lower the latency for L2TLB hits by short-cutting the process and using the translation directly instead of putting it into L1dTLB and getting it back out again.
Obviously that requires that the dTLB and L1D are really designed together and tightly integrated. Since they only need to talk to each other, this makes sense. Hardware page walks fetch data through the L1D cache. (Page tables always have known physical addresses to avoid a catch 22 / chicken-egg problem).
is there a side-band connection from TLB to the Cache?
I wouldn't call it a side-band connection. The L1D cache is the only thing that uses the L1dTLB. Similarly, L1iTLB is used only by the L1I cache.
If there's a 2nd-level TLB, it's usually unified, so both the L1iTLB and L1dTLB check it if they miss. Just like split L1I and L1D caches usually check a unified L2 cache if they miss.
Outer caches (L2, L3) are pretty universally PIPT. Translation happens during the L1 check, so physical addresses can be sent to other caches.

Opposite of cache prefetch hint

Is there a hint I can put in my code indicating that a line should be removed from cache? As opposed to a prefetch hint, which would indicate I will soon need a line. In my case, I know when I won't need a line for a while, so I want to be able to get rid of it to free up space for lines I do need.

clflush, clflushopt
Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the
linear address specified with the memory operand. If that cache line contains modified data at any level of the
cache hierarchy, that data is written back to memory.
They are not available on every CPU (in particular, clflushopt is only available on the 6th generation and later). To be certain, you should use CPUID to verify their availability:
The availability of CLFLUSH is indicated by the presence of the CPUID feature flag CLFSH
(CPUID.01H:EDX[bit 19]).
The availability of CLFLUSHOPT is indicated by the presence of the CPUID feature flag CLFLUSHOPT
(CPUID.(EAX=7,ECX=0):EBX[bit 23]).
If available, you should use clflushopt. It outperforms clflush when flushing buffers larger than 4KiB (64 lines).
This is the benchmark from Intel's Optimization Manual:
For informational purpose (assuming you are running in a privileged context), you can also use invd (as a nuke-from-orbit option). This:
Invalidates (flushes) the processor’s internal caches and issues a special-function bus cycle that directs external
caches to also flush themselves. Data held in internal caches is not written back to main memory.
or wbinvd, which:
Writes back all modified cache lines in the processor’s internal cache to main memory and invalidates (flushes) the
internal caches. The instruction then issues a special-function bus cycle that directs external caches to also write
back modified data and another bus cycle to indicate that the external caches should be invalidated.
A future instruction that could make it into the ISA is club. Although this won't fit your need (because it doesn't necessarily invalidate the line), it's worth mentioning for completeness. This would:
Writes back to memory the cache line (if dirty) that contains the linear address specified with the memory
operand from any level of the cache hierarchy in the cache coherence domain. The line may be retained in the
cache hierarchy in non-modified state. Retaining the line in the cache hierarchy is a performance optimization
(treated as a hint by hardware) to reduce the possibility of cache miss on a subsequent access. Hardware may
choose to retain the line at any of the levels in the cache hierarchy, and in some cases, may invalidate the line
from the cache hierarchy.

MESI protocol. Write with cache miss, but cache line copy exists on another CPU. Why needs fetch from main memory?

According to this diagram in case of write cache miss with copy in another CPU cache (for example Shared/Exclusive state). The steps are:
1. Snooping cores (with cache line copy) sets state to Invalid.
2. Current cache stores fresh main memory value.
Why one of the snooping cores can't put its cache line value on the bus at first? And then go to Invalid state. The same algorithm is used in read miss with existing copy. Thank you.

You're absolutely right in that it's pretty silly to go fetch a line from memory when you already have it right next to you, but this diagram describes the minimal requirement for functional correctness of the coherence protocol (i.e. what must be done to avoid coherence bugs), and that only dictates snooping the data out for modified lines since that's the only correct copy. What you describe is a possible optimization, and some systems indeed behave that way.
However, keep in mind that most systems today employ a shared cache as well (L2 or L3, sometimes even beyond that), and this is often inclusive (with regards to all lines that exist in all cores). In such systems, there's no real need to go all the way to memory, since having the line in another core means it's also in the shared cache, and after invalidation the requesting core can obtain it from there. Your proposal is therefore relevant only for systems with no shared cache, or with a cache that is not strictly inclusive.

How is an LRU cache implemented in a CPU?

I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick?
Also what happens in a dual core system where both CPUs write to the one address simultaneously?

For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.
For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.
For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.
On a MRU hit, the state is unchanged.
On a next-MRU hit the two bit fields are swapped.
On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.
On a miss, the state is updated as if an LRU hit had occurred.
Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.
For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.
With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.
Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.

There is a good slide-deck Page replacement algorithms that talks about various page replacement schemes. It also explains the LRU implementation using mxm matrix really well.

What cache invalidation algorithms are used in actual CPU caches?

I came to the topic caching and mapping and cache misses and how the cache blocks get replaced in what order when all blocks are already full.
There is the least recently used algorithm or the fifo algorithm or the least frequently algorithm and random replacement, ...
But what algorithms are used on actual cpu caches? Or can you use all and the... operating system decides what the best algorithm is?
Edit: Even when i chose an answer, any further information is welcome ;)

As hivert said - it's hard to get a clear picture on the specific algorithm, but one can deduce some of the information according to hints or clever reverse engineering.
You didn't specify which CPU you mean, each one can have a different policy (actually even within the same CPU different cache levels may have different policies, not to mention TLBs and other associative arrays which also may have such policies). I did find a few hints about Intel (specifically Ivy bridge), so we'll use this as a benchmark for industry level "standards" (which may or may not apply elsewhere).
First, Intel presented some LRU related features here -
http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-1-Microprocessor/HC24.28.117-HotChips_IvyBridge_Power_04.pdf
Slide 46 mentioned "Quad-Age LRU" - this is apparently an age based LRU that assigned some "age" to each line according to its predicted importance. They mention that prefetches get middle age, so demands are probably allocated at a higher age (or lower, whatever survives longest), and all lines likely age gradually, so the oldest gets replaced first. Not as good as perfect "fifo-like" LRU, but keep in mind that most caches don't implement that, but rather a complicated pseudo-LRU solution, so this might be an improvement.
Another interesting mechanism mentioned there, which goes the extra mile beyond classic LRU, is adaptive fill policy. There's a pretty good analysis here - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ , but in a nutshell (if the blog is correct, and he does seem to make a good match with his results), the cache dynamically chooses between two LRU policies, trying to decide whether the lines are going to be reused or not (and should be kept or not).
I guess this could answer to some extent your question on multiple LRU schemes. Implementing several schemes is probably hard and expensive in terms of HW, but when you have some policy that's complicated enough to have parameters, it's possible to use tricks like dynamic selection, set dueling , etc..

The following are some examples of replacement policies used in actual processors.
The PowerPC 7450's 8-way L1 cache used binary tree pLRU. Binary tree pLRU uses one bit per pair of ways to set an LRU for that pair, then an LRU bit for each pair of pairs of ways, etc. The 8-way L2 used pseudo-random replacement settable by privileged software (the OS) as using either a 3-bit counter incremented every clock cycle or a shift-register-based pseudo-random number generator.
The StrongARM SA-1110 32-way L1 data cache used FIFO. It also had a 2-way minicache for transient data, which also seems to have used FIFO. (Intel StrongARM SA-1110 Microprocessor Developer’s Manual states "Replacements in the minicache use the same round-robin pointer mechanism as in the main data cache. However, since this cache is only two-way set-associative, the replacement algorithm reduces to a simple least-recently-used (LRU) mechanism."; but 2-way FIFO is not the same as LRU even with only two ways, though for streaming data it works out the same.])
The HP PA 7200 had a 64-block fully associative "assist cache" that was accessed in parallel with an off-chip direct-mapped data cache. The assist cache used FIFO replacement with the option of evicting to the off-chip L1 cache. Load and store instructions had a "locality only" hint; if an assist cache entry was loaded by such a memory access, it would be evicted to memory bypassing the off-chip L1.
For 2-way associativity, true LRU might be the most common choice since it has good behavior (and, incidentally, is the same as binary tree pLRU when there are only two ways). E.g., the Fairchild Clipper Cache And Memory Management Unit used LRU for its 2-way cache. FIFO is slightly cheaper than LRU since the replacement information is only updated when the tags are written anyway, i.e., when inserting a new cache block, but has better behavior than counter-based pseudo-random replacement (which has even lower overhead). The HP PA 7300LC used FIFO for its 2-way L1 caches.
The Itanium 9500 series (Poulson) uses NRU for L1 and L2 data caches, L2 instruction cache, and the L3 cache (L1 instruction cache is documented as using LRU.). For the 24-way L3 cache in the Itanium 2 6M (Madison), a bit per block was provided for NRU with an access to a block setting the bit corresponding to its set and way ("Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache", Stefan Rusu et al., 2004). This is similar to the clock page replacement algorithm.
I seem to recall reading elsewhere that the bits were cleared when all were set (rather than keeping the one that set the last unset bit) and that the victim was chosen by a find first unset scan of the bits. This would have the hardware advantage of only having to read the information (which was stored in distinct arrays from but nearby the L3 tags) on a cache miss; a cache hit could simply set the appropriate bit. Incidentally, this type of NRU avoids some of the bad traits of true LRU (e.g., LRU degrades to FIFO in some cases and in some of these cases even random replacement can increase the hit rate).

For Intel CPUs, the replacement policies are usually undocumented. I have done some experiments to uncover the policies in recent Intel CPUs, the results of which can be found on https://uops.info/cache.html. The code that I used is available on GitHub.
The following is a summary of my findings.
Tree-PLRU: This policy is used by the L1 data caches of all CPUs that I tested, as well as by the L2 caches of the Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell and Broadwell CPUs.
Randomized Tree-PLRU: Some Core 2 Duo CPUs use variants of Tree-PLRU in their L2 caches where either the lowest or the highest bits in the tree are replaced by (pseudo-)randomness.
MRU: This policy is sometimes also called NRU. It uses one bit per cache block. An access to a block sets the bit to 0. If the last 1-bit was set to 0, all other bits are set to 1. Upon a miss, the first block with its bit set to 1 is replaced. This policy is used for the L3 caches of the Nehalem, Westmere, and Sandy Bridge CPUs.
Quad-Age LRU (QLRU): This is a generalization of the MRU policy that uses two bits per cache block. Different variants of this policy are used for the L3 caches, starting with Ivy Bridge, and for the L2 caches, starting with Skylake.
Adaptive policies: The Ivy Bridge, Haswell, and Broadwell CPUs can dynamically choose between two different QLRU variants. This is implemented via set dueling: A small number of dedicated sets always use the same QLRU variant; the remaining sets are "follower sets" that use the variant that performs better on the dedicated sets. See also http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio