VIPT Cache: Connection between TLB & Cache? - caching

I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details.
In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache.
From the TLB we get the traslated physical address.
From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set).
Then the translated TLB address is matched with the list of tags to find a candidate.
My question is where is this check performed ?
In Cache ?
If not in Cache, where else ?
If the check is performed in Cache, then
is there a side-band connection from TLB to the Cache module to get the
translated physical address needed for comparison with the tag addresses?
Can somebody please throw some light on "actually" how this is generally implemented and the connection between Cache module & the TLB(MMU) module ?
I know this dependents on the specific architecture and implementation.
But, what is the implementation which you know when there is VIPT cache ?
Thanks.

At this level of detail, you have to break "the cache" and "the TLB" down into their component parts. They're very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being translated "for free". Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?)
The L1dTLB itself is a small/fast Content addressable memory with (for example) 64 entries and 4-way set associative (Intel Skylake). Hugepages are often handled with a second (and 3rd) array checked in parallel, e.g. 32-entry 4-way for 2M pages, and for 1G pages: 4-entry fully (4-way) associative.
But for now, simplify your mental model and forget about hugepages.
The L1dTLB is a single CAM, and checking it is a single lookup operation.
"The cache" consists of at least these parts:
the SRAM array that stores the tags + data in sets
control logic to fetch a set of data+tags based on the index bits. (High-performance L1d caches typically fetch data for all ways of the set in parallel with tags, to reduce hit latency vs. waiting until the right tag is selected like you would with larger more highly associative caches.)
comparators to check the tags against a translated address, and select the right data if one of them matches, or trigger miss-handling. (And on hit, update the LRU bits to mark this way as Most Recently Used). For a diagram of the basics for a 2-way associative cache without a TLB, see https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=17. The = inside a circle is the comparator: producing a boolean true output if the tag-width inputs are equal.
The L1dTLB is not really separate from the L1D cache. I don't actually design hardware, but I think a load execution unit in a modern high-performance design works something like this:
AGU generates an address from register(s) + offset.
(Fun fact: Sandybridge-family optimistically shortcuts this process for simple addressing mode: [reg + 0-2047] has 1c lower load-use latency than other addressing modes, if the reg value is in the same 4k page as reg+disp. Is there a penalty when base+offset is in a different page than the base?)
The index bits come from the offset-within-page part of the address, so they don't need translating from virtual to physical. Or translation is a no-op. This VIPT speed with the non-aliasing of a PIPT cache works as long as L1_size / associativity <= page_size. e.g. 32kiB / 8-way = 4k pages.
The index bits select a set. Tags+data are fetched in parallel for all ways of that set. (This costs power to save latency, and is probably only worth it for L1. Higher-associativity (more ways per set) L3 caches definitely not)
The high bits of the address are looked up in the L1dTLB CAM array.
The tag comparator receives the translated physical-address tag and the fetched tags from that set.
If there's a tag match, the cache extracts the right bytes from the data for the way that matched (using the offset-within-line low bits of the address, and the operand-size).
Or instead of fetching the full 64-byte line, it could have used the offset bits earlier to fetch just one (aligned) word from each way. CPUs without efficient unaligned loads are certainly designed this way. I don't know if this is worth doing to save power for simple aligned loads on a CPU which supports unaligned loads.
But modern Intel CPUs (P6 and later) have no penalty for unaligned load uops, even for 32-byte vectors, as long as they don't cross a cache-line boundary. Byte-granularity indexing for 8 ways in parallel probably costs more than just fetching the whole 8 x 64 bytes and setting up the muxing of the output while the fetch+TLB is happening, based on offset-within-line, operand-size, and special attributes like zero- or sign-extension, or broadcast-load. So once the tag-compare is done, the 64 bytes of data from the selected way might just go into an already-configured mux network that grabs the right bytes and broadcasts or sign-extends.
AVX512 CPUs can even do 64-byte full-line loads.
If there's no match in the L1dTLB CAM, the whole cache fetch operation can't continue. I'm not sure if / how CPUs manage to pipeline this so other loads can keep executing while the TLB-miss is resolved. That process involves checking the L2TLB (Skylake: unified 1536 entry 12-way for 4k and 2M, 16-entry for 1G), and if that fails then with a page-walk.
I assume that a TLB miss results in the tag+data fetch being thrown away. They'll be re-fetched once the needed translation is found. There's nowhere to keep them while other loads are running.
At the simplest, it could just re-run the whole operation (including fetching the translation from L1dTLB) when the translation is ready, but it could lower the latency for L2TLB hits by short-cutting the process and using the translation directly instead of putting it into L1dTLB and getting it back out again.
Obviously that requires that the dTLB and L1D are really designed together and tightly integrated. Since they only need to talk to each other, this makes sense. Hardware page walks fetch data through the L1D cache. (Page tables always have known physical addresses to avoid a catch 22 / chicken-egg problem).
is there a side-band connection from TLB to the Cache?
I wouldn't call it a side-band connection. The L1D cache is the only thing that uses the L1dTLB. Similarly, L1iTLB is used only by the L1I cache.
If there's a 2nd-level TLB, it's usually unified, so both the L1iTLB and L1dTLB check it if they miss. Just like split L1I and L1D caches usually check a unified L2 cache if they miss.
Outer caches (L2, L3) are pretty universally PIPT. Translation happens during the L1 check, so physical addresses can be sent to other caches.

Related

How is AMD's micro-tagged L1 data cache accessed?

I am learning about the access process of L1 cache of AMD processor. But I read AMD's manual repeatedly, and I still can't understand it.
My understanding of L1 data cache with Intel is:
L1 cache is virtual Indexed and physical tagged. Therefore, use the index bits of the virtual address to find the corresponding cache set, and finally determine which cache line in the cache set is based on the tag.
(Intel makes their L1d caches associative enough and small enough that the index bits come only from the offset-within-page which is the same in the physical address. So they get the speed of VIPT with none of the aliasing problems, behaving like PIPT.)
But AMD used a new method. In Zen 1, they have a 32-Kbyte, 8-way set associative L1d cache, which (unlike the 64KB 4-way L1i) is small enough to avoid aliasing problems without micro-tags.
From AMD's 2017 Software Optimization Manual, section 2.6.2.2 "Microarchitecture of AMD Family 17h Processor" (Zen 1):
The L1 data cache tags contain a linear-address-based microtag (utag)
that tags each cacheline with the linear address that was used to
access the cacheline initially. Loads use this utag to determine which
way of the cache to read using their linear address, which is
available before the load's physical address has been determined via
the TLB. The utag is a hash of the load's linear address. This linear
address based lookup enables a very accurate prediction of in which
way the cacheline is located prior to a read of the cache data. This
allows a load to read just a single cache way, instead of all 8. This
saves power and reduces bank conflicts.
It is possible for the utag to
be wrong in both directions: it can predict hit when the access will
miss, and it can predict miss when the access could have hit. In
either case, a fill request to the L2 cache is initiated and the utag
is updated when L2 responds to the fill request.
Linear aliasing occurs when two different linear addresses are mapped
to the same physical address. This can cause performance penalties for
loads and stores to the aliased cachelines. A load to an address that
is valid in the L1 DC but under a different linear alias will see an
L1 DC miss, which requires an L2 cache request to be made. The latency
will generally be no larger than that of an L2 cache hit. However, if
multiple aliased loads or stores are in-flight simultaneously, they
each may experience L1 DC misses as they update the utag with a
particular linear address and remove another linear address from being
able to access the cacheline.
It is also possible for two different
linear addresses that are NOT aliased to the same physical address to
conflict in the utag, if they have the same linear hash. At a given L1
DC index (11:6), only one cacheline with a given linear hash is
accessible at any time; any cachelines with matching linear hashes are
marked invalid in the utag and are not accessible.
It is possible for the utag to be wrong in both directions
What is the specific scenario of this sentence in the second paragraph? Under what circumstances will hit be predicted as miss and miss as hit?
When the CPU accesses data from the memory to the cache, it will calculate a cache way based on utag. And just put it here? Even if the other cache way are empty?
Linear aliasing occurs when two different linear addresses are mapped to the same physical address.
How can different linear addresses map to the same physical address?
However, if multiple aliased loads or stores are in-flight simultaneously, they each may experience L1 DC misses as they update the utag with a particular linear address and remove another linear address from being able to access the cacheline.
What does this sentence mean? My understanding is to first calculate the utag based on the linear address (virtual address) to determine which cache way to use. Then use the tag field of the physical address to determine whether it is a cache hit? How is utag updated? Will it be recorded in the cache?
any cachelines with matching linear hashes are marked invalid in the utag and are not accessible.
What does this sentence mean?
How does AMD judge cache hit or miss? Why are some hits regarded as misses? Can someone explain? Many thanks!
The L1 data cache tags contain a linear-address-based microtag (utag)
that tags each cacheline with the linear address that was used to
access the cacheline initially.
Each cache line in the L1D has a utag associated with it. This implies the utag memory structure is organized exactly like the L1D (i.e., 8 ways and 64 sets) and there is a one-to-one correspondence between the entries. The utag is calculated based on the linear address of the request that caused the line to be filled in the L1D.
Loads use this utag to determine which way of the cache to read using
their linear address, which is available before the load's physical
address has been determined via the TLB.
The linear address of a load is sent simultaneously to the way predictor and the TLB (it's better to use the term MMU, since there are multiple TLBs). A particular set in the utag memory is selected using certain bits of the linear address (11:6) and all of the 8 utags in that set are read at the same time. Meanwhile, a utag is calculated based on the linear address of the load request. When both of these operations complete, the given utag is compared against all the utags stored in the set. The utag memory is maintained such that there can be at most one utag in each set with the same value. In case of a hit in the utag memory, the way predictor predicts that the target cache line is in the corresponding cache entry in the L1D. Up until this point, the physical address is not yet needed.
The utag is a hash of the load's linear address.
The hash function was reverse-engineered in the paper titled Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors in Section 3 for a number of microarchitectures. Basically, certain bits of the linear address at positions 27:12 are XOR'ed with each other to produce an 8-bit value, which is the utag. A good hash function should: (1) minimize the number of linear address pairs that map to the same utag, (2) minimize the size of the utag, and (3) have a latency not larger than the utag memory access latency.
This linear address based lookup enables a very accurate prediction of
in which way the cacheline is located prior to a read of the cache
data. This allows a load to read just a single cache way, instead of
all 8. This saves power and reduces bank conflicts.
Besides the utag memory and associated logic, the L1D also includes a tag memory and a data memory, all have the same organization. The tag memory stores physical tags (bit 6 up to the highest bit of the physical address). The data memory stores cache lines. In case of a hit in the utag, the way predictor reads only one entry in the corresponding way in the tag memory and data memory. The size of a physical address is more than 35 bits on modern x86 processors, and so the size of a physical tag is more than 29 bits. This is more than 3x larger than the size of a utag. Without way prediction, in a cache with more than one cache way, multiple tags would have to be read and compared in parallel. In an 8-way cache, reading and comparing 1 tag consumes much less energy than reading and comparing 8 tags.
In a cache where each way can be activated separately, each cache entry has its own wordline, which is shorter compared to a worldline shared across multiple cache ways. Due to signal propagation delays, reading a single way takes less time than reading 8 ways. However, in a parallelly-accessed cache, there is no way prediction delay, but linear address translation becomes on the critical path of the load latency. With way prediction, the data from the predicted entry can be speculatively forwarded to dependent uops. This can provide a significant load latency advantage, especially since linear address translation latency can vary due to the multi-level design of the MMU, even in the typical case of an MMU hit. The downside is that it introduces a new reason why replays may occur: in case of a misprediction, tens or even hundreds of uops may need to be replayed. I don't know if AMD actually forwards the requested data before validating the prediction, but it's possible even though not mentioned in the manual.
Reduction of bank conflicts is another advantage of way prediction as mentioned in the manual. This implies that different ways are placed in different banks. Section 2.6.2.1 says that bits 5:2 of the address, the size of the access, and the cache way number determine the banks to be accessed. This suggests there are 16*8 = 128 banks, one bank for each 4-byte chunk in each way. Bits 5:2 are obtained from the linear address of the load, the size of the load is obtained from the load uop, and the way number is obtained from the way predictor. Section 2.6.2 says that the L1D supports two 16-byte loads and one 16-byte store in the same cycle. This suggests that each bank has a single 16-byte read-write port. Each of the 128 bank ports are connected through an interconnect to each of the 3 ports of the data memory of the L1D. One of the 3 ports are connected to the store buffer and the other two are connected to the load buffer, possibly with intermediary logic for efficiently handling cross-line loads (single load uop but two load requests whose results are merged), overlapping loads (to avoid bank conflicts), and loads that cross bank boundaries.
The fact that way prediction requires accessing only a single way in the tag memory and the data memory of the L1D allows reducing or completely eliminating the need (depending on how snoops are handled) to make the tag and data memories truly multiported (which is the approach Intel has followed in Haswell), while still achieving about the same throughput. Bank conflicts can still occur, though, when there are simultaneous accesses to the same way and identical 5:2 address bits, but different utags. Way prediction does reduce bank conflicts because it doesn't require reading multiple entries (at least in the tag memory, but possibly also in the data memory) for each access, but it doesn't completely eliminate bank conflicts.
That said, the tag memory may require true multiporting to handle fill checks (see later), validation checks (see later), snooping, and "normal path" checks for non-load accesses. I think only load requests use the way predictor. Other types of requests are handled normally.
A highly accurate L1D hit/miss prediction can have other benefits too. If a load is predicted to miss in the L1D, the scheduler wakeup signal for dependent uops can be suppressed to avoid likely replays. In addition, the physical address, as soon as it's available, can be sent early to the L2 cache before fully resolving the prediction. I don't know if these optimizations are employed by AMD.
It is possible for the utag to be wrong in both directions: it can
predict hit when the access will miss, and it can predict miss when
the access could have hit. In either case, a fill request to the L2
cache is initiated and the utag is updated when L2 responds to the
fill request.
On an OS that supports multiple linear address spaces or allows synonyms in the same address space, cache lines can only be identified uniquely using physical addresses. As mentioned earlier, when looking up a utag in the utag memory, there can either be one hit or zero hits. Consider first the hit case. This linear address-based lookup results in a speculative hit and still needs to be verified. Even if paging is disabled, a utag is still not a unique substitute to a full address. As soon as the physical address is provided by the MMU, the prediction can be validated by comparing the physical tag from the predicted way with the tag from the physical address of the access. One of the following cases can occur:
The physical tags match and the speculative hit is deemed a true hit. Nothing needs to be done, except possibly triggering a prefetch or updating the replacement state of the line.
The physical tags don't match and the target line doesn't exist in any of the other entries of the same set. Note that the target line cannot possibly exist in other sets because all of the L1D memories use the same set indexing function. I'll discuss how this is handled later.
The physical tags don't match and the target line does exist in another entry of the same set (associated with a different utag). I'll discuss how this is handled later.
If no matching utag was found in the utag memory, there will be no physical tag to compare against because no way is predicted. One of the following cases can occur:
The target line actually doesn't exist in the L1D, so the speculative miss is a true miss. The line has to be fetched from somewhere else.
The target line actually exists in the same set but with a different utag. I'll discuss how this is handled later.
(I'm making two simplifications here. First, the load request is assumed to be to cacheable memory. Second, on a speculative or true hit in the L1D, there are no detected errors in the data. I'm trying to stay focused on Section 2.6.2.2.)
Accessing the L2 is needed only in cases 3 and 5 and not in cases 2 and 4. The only way to determine which is the case is by comparing the physical tag of the load with the physical tags of all present lines in the same set. This can be done either before or after accessing the L2. Either way, it has to be done to avoid the possibility of having multiple copies of the same line in the L1D. Doing the checks before accessing the L2 improves the latency in cases 3 and 5, but hurts it in cases 2 and 4. Doing the checks after accessing the L2 improves the latency in cases 2 and 4, but hurts it in cases 3 and 5. It's possible to both perform the checks and send a request to the L2 at the same time. But this may waste energy and L2 bandwidth in cases 3 and 5. It seems that AMD decided to do the checks after the line is fetched from the L2 (which is inclusive of the L1 caches).
When the line arrives from the L2, the L1D doesn't have to wait until it gets filled in it to respond with the requested data, so a higher fill latency is tolerable. The physical tags are now compared to determine which of the 4 cases has occurred. In case 4, the line is filled in the data memory, tag memory, and utag memory in the way chosen by the replacement policy. In case 2, the requested line replaces the existing line that happened to have the same utag and the replacement policy is not engaged to chose a way. This happens even if there was a vacant entry in the same set, essentially reducing the effective capacity of the cache. In case 5, the utag can simply be overwritten. Case 3 is a little complicated because it involves an entry with a matching physical tag and a different entry with a matching utag. One of them will have to be invalidated and the other will have to be replaced. A vacant entry can also exist in this case and not utilized.
Linear aliasing occurs when two different linear addresses are mapped
to the same physical address. This can cause performance penalties for
loads and stores to the aliased cachelines. A load to an address that
is valid in the L1 DC but under a different linear alias will see an
L1 DC miss, which requires an L2 cache request to be made. The latency
will generally be no larger than that of an L2 cache hit. However, if
multiple aliased loads or stores are in-flight simultaneously, they
each may experience L1 DC misses as they update the utag with a
particular linear address and remove another linear address from being
able to access the cacheline.
This is how case 5 (and case 2 to a lesser extent) can occur. Linear aliasing can occur within the same linear address space and across different address spaces (context switching and hyperthreading effects come into play).
It is also possible for two different linear addresses that are NOT
aliased to the same physical address to conflict in the utag, if they
have the same linear hash. At a given L1 DC index (11:6), only one
cacheline with a given linear hash is accessible at any time; any
cachelines with matching linear hashes are marked invalid in the utag
and are not accessible.
This is how cases 2 and 3 can occur and they're handled as discussed earlier. This part tells that the L1D uses the simple set indexing function; the set number is bits 11:6.
I think huge pages make cases 2 and 3 more likely to occur because more than half of the bits used by the utag hash function become part of the page offset rather than page number. Physical memory shared between multiple OS processes makes case 5 more likely.

How does cache associativity impact performance [duplicate]

This question already has answers here:
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
(3 answers)
Closed 3 years ago.
I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints and observed that accessing first column was slower for 1024x1024 case.
Author explained (background info, CPU is Intel with L1 cache with 32KB memory, it is 8-way associative):
When N=1024, this difference is exactly 4096 bytes; it equals the
critical stride value. This means that all elements from the first
column match the same eight cache lines of L1. We don’t really have
performance benefits from the cache because we can’t use it
efficiently: we have only 512 bytes (8 cache lines * 64-byte cache
line size) instead of the original 32 kilobytes. When we iterate the
first column in a loop, the corresponding elements pop each other from
the cache. When N=1023 and N=1025, we don’t have problems with the
critical stride anymore: all elements can be kept in the cache, which
is much more efficient.
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
It strikes me as odd, after reading wiki page I would say the performance penalty comes from resolving address conflicts. Since each row can be potentially mapped into the same cache line, it is conflict after conflict, and CPU has to resolve those -- it takes time.
Thus my question, what is the real nature of performance problem here. Accessible memory size of cache is lower, or entire cache is available but CPU spends more time in resolving conflicts with mapping. Or there is some other reason?
Caching is a layer between two other layers. In your case, between the CPU and RAM. At its best, the CPU rarely has to wait for something to be fetched from RAM. At its worst, the CPU usually has to wait.
The 1024 example hits a bad case. For that entire column all words requested from RAM land in the same cell in cache (or the same 2 cells, if using a 2-way associative cache, etc).
Meanwhile, the CPU does not care -- it asks the cache for a word from memory; the cache either has it (fast access) or needs to reach into RAM (slow access) to get it. And RAM does not care -- it responds to requests, whenever they come.
Back to 1024. Look at the layout of that array in memory. The cells of the row are in consecutive words of RAM; when one row is finished, the next row starts. With a little bit of thought, you can see that consecutive cells in a column have addresses differing by 1024*N, when N=4 or 8 (or whatever the size of a cell). That is a power of 2.
Now let's look at the relatively trivial architecture of a cache. (It is 'trivial' because it needs to be fast and easy to implement.) It simply takes several bits out of the address to form the address in the cache's "memory".
Because of the power of 2, those bits will always be the same -- hence the same slot is accessed. (I left out a few details, like now many bits are needed, hence the size of the cache, 2-way, etc, etc.)
A cache is useful when the process above it (CPU) fetches an item (word) more than once before that item gets bumped out of cache by some other item needing the space.
Note: This is talking about the CPU->RAM cache, not disk controller caching, database caches, web site page caches, etc, etc; they use more sophisticated algorithms (often hashing) instead of "picking a few bits out of an address".
Back to your Question...
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
There are conceptual problems with that quote.
Main memory is not "mapped to a cache"; see virtual versus real addresses.
The penalty comes when the cache does not have the desired word.
"shrinking the cache" -- The cache is a fixed size, based on the hardware involved.
Definition: In this context, a "word" is a consecutive string of bytes from RAM. It is always(?) a power-of-2 bytes and positioned at some multiple of that in the reall address space. A "word" for caching depends on vintage of the CPU, which level of cache, etc. 4-, 8-, 16-byte words probably can be found today. Again, the power-of-2 and positioned-at-multiple... are simple optimizations.
Back to your 1K*1K array of, say, 4-byte numbers. That adds up to 4MB, plus or minus (for 1023, 1025). If you have 8MB of cache, the entire array will eventually get loaded, and further actions on the array will be faster due to being in the cache. But if you have, say, 1MB of cache, some of the array will get in the cache, then be bumped out -- repeatedly. It might not be much better than if you had no cache.

Is Translation Lookaside Buffer (TLB) the same level as L1 cache to CPU? So, Can I overlap virtual address translation with the L1 cache access?

I am trying to understand the whole structure and concepts about caching. As we use TLB for fast mapping virtual addresses to physical addresses, in case if we use virtually-indexed, physically-tagged L1 cache, can one overlap the virtual address translation with the L1 cache access?
Yes, that's the whole point of a VIPT cache.
Since the virtual addresses and physical one match over the lower bits (the page offset is the same), you don't need to translate them. Most VIPT caches are built around this (note that this limits the number of sets you can use, but you can grow their associativity instead), so you can use the lower bits to do a lookup in that cache even before you found the translation in the TLB.
This is critical because the TLB lookup itself takes time, and the L1 caches are usually designed to provide as much BW and low latency as possible to avoid stalling the often much-faster execution.
If you miss the TLB and suffer an even greater latency (either some level2 TLB or, god forbid, a page walk), it's less critical since you can't really do anything with the cache lookup until you compare the tag, but the few cycles you did save in the TLB hit + cache hit case should be the common case on many applications, so that's usually considered worthy to optimize and align the pipelines for.

CPU cycle speed

Finding the latencies of L1/L2/L3 caches is easy:
Approximate cost to access various caches and main memory?
but I am interested in what the cost is (in CPU cycles) for translating a virtual address to physical page address when:
There is a hit in the L1 TLB
There is a miss in the L1 TLB but a hit in the L2 TLB
There is a miss in the L2 TLB and a hit in the page table
(I dont think there can be a miss in the page table can there? If there can, cost of this)
I did find this:
Data TLB L1 size = 64 items. 4-WAY. Miss penalty = 7 cycles. Parallel miss: 1 cycle per access
TLB L2 size = 512 items. 4-WAY. Miss penalty = 10 cycles. Parallel miss: 21 cycle per access
Instruction TLB L1 size = 64 items per thread (128 per core). 4-WAY
PDE cache = 32 items?
http://www.7-cpu.com/cpu/SandyBridge.html
but it doesn't mention the cost of a hit/accessing the relevant TLB cache?
Typically the L1 TLB access time will be less than the cache access time to allow tag comparison in a set associative, physically tagged cache. A direct mapped cache can delay the tag check by assuming a hit. (For an in-order processor, a miss with immediate use of the data would need to wait for the miss to be handled, so there is no performance penalty. For an out-of-order processor, correcting for such wrong speculation can have noticeable performance impact. While an out-of-order process is unlikely to use a direct mapped cache, it may use way prediction which can behave similarly.) A virtually tagged cache can (in theory) delay TLB access even more since the TLB is only needed to verify permissions not to determine a cache hit and the handling of permission violations is generally somewhat expensive and rare.
This means that L1 TLB access time will generally not be made public since it will not influence software performance tuning.
L2 hit time would be equivalent to the L1 miss penalty. This will vary depending on the specific implementation and may not be a single value. E.g., If the TLB uses banking to support multiple accesses in a single cycle, bank conflicts can delay accesses, or if rehashing is used to support multiple page sizes, a page of the alternate size will take longer to find (both of these cases can accumulate delay under high utilization).
The time required for an L2 TLB fill can vary greatly. ARM and x86 use hardware TLB fill using a multi-level page table. Depending on where page table data can be cached and whether there is a cache hit, the latency of a TLB fill can be between the latency of a main memory access for each level of the page table and the latency of the cache where the page table data is found for each level (plus some overhead).
Complicating this further, more recent Intel x86 have paging-structure caches which allow levels of the page table to be skipped. E.g., if a page directory entry (an entry in a second level page table which points to a page of page table entries) is found in this cache, rather than starting from the base of the page table and doing four dependent look-ups only a single look-up is required.
(It might be worth noting that using a page the size of the virtual address region covered by a level of the page table (e.g., 2 MiB and 1 GiB for x86-64), reduces the depth of the page table hierarchy. Not only can using such large pages reduce TLB pressure, but it can also reduce the latency of a TLB miss.)
A page table miss is handled by the operating system. This might result in the page still being in memory (e.g., if the write to swap has not been completed) in which case the latency will be relatively small. (The actual latency will depend on how the operating system implements this and on cache hit behavior, though cache misses both for the code and the data are likely since paging is an uncommon event.) If the page is no longer in memory, the latency of reading from secondary storage (e.g., a disk drive) is added to the software latency of handling an invalid page table entry (i.e., a page table miss).

Line size of L1 and L2 caches

From a previous question on this forum, I learned that in most of the memory systems, L1 cache is a subset of the L2 cache means any entry removed from L2 is also removed from L1.
So now my question is how do I determine a corresponding entry in L1 cache for an entry in the L2 cache. The only information stored in the L2 entry is the tag information. Based on this tag information, if I re-create the addr it may span multiple lines in the L1 cache if the line-sizes of L1 and L2 cache are not same.
Does the architecture really bother about flushing both the lines or it just maintains L1 and L2 cache with the same line-size.
I understand that this is a policy decision but I want to know the commonly used technique.
Cache-Lines size is (typically) 64 bytes.
Moreover, take a look at this very interesting article about processors caches:
Gallery of Processor Cache Effects
You will find the following chapters:
Memory accesses and performance
Impact of cache lines
L1 and L2 cache sizes
Instruction-level parallelism
Cache associativity
False cache line sharing
Hardware complexities
In core i7 the line sizes in L1 , L2 and L3 are the same: that is 64 Bytes.
I guess this simplifies maintaining the inclusive property, and coherence.
See page 10 of: https://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
The most common technique of handling cache block size in a strictly inclusive cache hierarchy is to use the same size cache blocks for all levels of cache for which the inclusion property is enforced. This results in greater tag overhead than if the higher level cache used larger blocks, which not only uses chip area but can also increase latency since higher level caches generally use phased access (where tags are checked before the data portion is accessed). However, it also simplifies the design somewhat and reduces the wasted capacity from unused portions of the data. It does not take a large fraction of unused 64-byte chunks in 128-byte cache blocks to compensate for the area penalty of an extra 32-bit tag. In addition, the larger cache block effect of exploiting broader spatial locality can be provided by relatively simple prefetching, which has the advantages that no capacity is left unused if the nearby chunk is not loaded (to conserve memory bandwidth or reduce latency on a conflicting memory read) and that the adjacency prefetching need not be limited to a larger aligned chunk.
A less common technique divides the cache block into sectors. Having the sector size the same as the block size for lower level caches avoids the problem of excess back-invalidation since each sector in the higher level cache has its own valid bit. (Providing all the coherence state metadata for each sector rather than just validity can avoid excessive writeback bandwidth use when at least one sector in a block is not dirty/modified and some coherence overhead [e.g., if one sector is in shared state and another is in the exclusive state, a write to the sector in the exclusive state could involve no coherence traffic—if snoopy rather than directory coherence is used].)
The area savings from sectored cache blocks were especially significant when tags were on the processor chip but the data was off-chip. Obviously, if the data storage takes area comparable to the size of the processor chip (which is not unreasonable), then 32-bit tags with 64-byte blocks would take roughly a 16th (~6%) of the processor area while 128-byte blocks would take half as much. (IBM's POWER6+, introduced in 2009, is perhaps the most recent processor to use on-processor-chip tags and off-processor data. Storing data in higher-density embedded DRAM and tags in lower-density SRAM, as IBM did, exaggerates this effect.)
It should be noted that Intel uses "cache line" to refer to the smaller unit and "cache sector" for the larger unit. (This is one reason why I used "cache block" in my explanation.) Using Intel's terminology it would be very unusual for cache lines to vary in size among levels of cache regardless of whether the levels were strictly inclusive, strictly exclusive, or used some other inclusion policy.
(Strict exclusion typically uses the higher level cache as a victim cache where evictions from the lower level cache are inserted into the higher level cache. Obviously, if the block sizes were different and sectoring was not used, then an eviction would require the rest of the larger block to be read from somewhere and invalidated if present in the lower level cache. [Theoretically, strict exclusion could be used with inflexible cache bypassing where an L1 eviction would bypass L2 and go to L3 and L1/L2 cache misses would only be allocated to either L1 or L2, bypassing L1 for certain accesses. The closest to this being implemented that I am aware of is Itanium's bypassing of L1 for floating-point accesses; however, if I recall correctly, the L2 was inclusive of L1.])
Typically, in one access to the main memory 64 bytes of data and 8 bytes of parity/ECC (I don't remember exactly which) is accessed. And it is rather complicated to maintain different cache line sizes at the various memory levels. You have to note that cache line size would be more correlated to the word alignment size on that architecture than anything else. Based on that, a cache line size is highly unlikely to be different from memory access size. Now, the parity bits are for the use of the memory controller - so cache line size typically is 64 bytes. The processor really controls very little beyond the registers. Everything else going on in the computer is more about getting hardware in to optimize CPU performance. In that sense also, it really would not make any sense to import extra complexity by making cache line sizes different at different levels of memory.

Resources