Assuming that we have intentionally thrashed the DTLB, and would like to proceed to flush a specific cache line from L1-3 using clflush on a memory region which is (most likely) disjoint from the addresses pointed to by the TLB entries; would this in fact bring the page base address of the cache line we are flushing back into the TLB?
In short, does a clflush touch the TLB at all? I'm assuming that due to this instruction honouring coherency, it will subsequently write that line back to memory (obviously needing a TLB look-up.)
From Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A: Instruction Set Reference, A-L: "Invalidates the cache line that contains the linear address specified with the source operand from all levels of the processor cache hierarchy (data and instruction) ."
Since it uses the linear (virtual) address, the address needs to be translated, which means that a page table walk would be needed on a TLB miss. (This would generally be the case even for a different kind of instruction that pushed cache entries out to higher levels of cache since L1 caches are typically physically tagged for x86. In general, tagging with the virtual address has fallen out of favor. Using the physical address for tags means that the physical address is needed to check the cache for a hit, so even if it was not sent to memory, translation would be needed.)
While it would be possible to avoid loading the TLB for such accesses, the extra complexity of such special-case handling would almost certainly not be viewed as worth the bother given that CLFLUSH is not commonly used.
Related
Suppose two address spaces share a largish lump of non-contiguous memory.
The system might want to share physical page table(s) between them.
These tables wouldn't use Global bits (even if supported), and would tie them to asids if supported.
There are immediate benefits since the data cache will be less polluted than by a copy, less pinned ram, etc.
Does the page walk take explicit advantage of this in any known architecture?
If so, does that imply the mmu is explicitly caching & sharing interior page tree nodes based on physical tag?
Sorry for the multiple questions; it really is one broken down. I am trying to determine if it is worth devising a measurement test for this.
On modern x86 CPUs (like Sandybridge-family), page walks fetch through the cache hierarchy (L1d / L2 / L3), so yes there's an obvious benefit there for having to different page directories point to the same subtree for a shared region of virtual address space. Or for some AMD, fetch through L2, skipping L1d.
What happens after a L2 TLB miss? has more details about the fact that page-walk definitely fetches through cache, e.g. Broadwell perf counters exist to measure hits.
("The MMU" is part of a CPU core; the L1dTLB is tightly coupled to load/store execution units. The page walker is a fairly separate thing, though, and runs in parallel with instruction execution, but is still part of the core and can be triggered speculatively, etc. So it's tightly coupled enough to access memory through L1d cache.)
Higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. Section 3 of this paper confirms that Intel and AMD actually do this in practice, so you need to flush the TLB in cases where you might think you didn't need to.
However, I don't think you'll find that PDE caching happening across a change in the top-level page-tables.
On x86, you install a new page table with a mov to CR3; that implicitly flushes all cached translations and internal page-walker PDE caching, like invlpg does for one virtual address. (Or with ASIDs, makes TLB entries from different ASIDs unavailable for hits).
The main issue is that TLB the and page-walker internal caches are not coherent with main memory / data caches. I think all ISAs that do HW page walks at all require manual flushing of TLBs, with semantics like x86 for installing a new page table. (Some ISAs like MIPS only do software TLB management, invoking a special kernel TLB-miss handler; your question won't apply there.)
So yes, they could detect same physical address, but for sanity you also have to avoid using stale cached data from after a store to that physical address.
Without hardware-managed coherence between page-table stores and TLB/pagewalk, there's no way this cache could happen safely.
That said; some x86 CPUs do go beyond what's on paper and do limited coherency with stores, but only protecting you from speculative page walks for backwards compat with OSes that assumed a valid but not-yet-used PTE could be modified without invlpg. http://blog.stuffedcow.net/2015/08/pagewalk-coherence/
So it's not unheard of for microarchitectures to snoop stores to detect stores to certain ranges; you could plausibly have stores snoop the address ranges near locations the page-walker had internally cached, effectively providing coherence for internal page-walker caches.
Modern x86 does in practice detect self-modifying code by snoop for stores near any in-flight instructions. Observing stale instruction fetching on x86 with self-modifying code In that case snoop hits are handled by nuking the whole back-end state back to retirement state.
So it's plausible that you could in theory design a CPU with an efficient mechanism to be able to take advantage of this transparently, but it has significant cost (snooping every store against a CAM to check for matches on page-walker-cached addresses) for very low benefit. Unless I'm missing something, I don't think there's an easier way to do this, so I'd bet money that no real designs actually do this.
Hard to imagine outside of x86; almost everything else takes a "weaker" / "fewer guarantees" approach and would only snoop the store buffer (for store-forwarding). CAMs (content-addressable-memory = hardware hash table) are power-hungry, and handling the special case of a hit would complicate the pipeline. Especially an OoO exec pipeline where the store to a PTE might not have its store-address ready until after a load wanted to use that TLB entry. Introducing more pipeline nukes is a bad thing.
The benefit of this would be tiny
After the first page-walk fetches data from L1d cache (or farther away if it wasn't hot in L1d either), then the usual cache-within-page-walker mechanisms can act normally.
So further page walks for nearby pages before the next context switch can benefit from page-walker internal caches. This has benefits, and is what some real HW does (at least some x86; IDK about others).
All the argument above about why this would require snooping for coherent page tables is about having the page-walker internal caches stay hot across a context switch.
L1d can easily do that; VIPT caches that behave like PIPT (no aliasing) simply cache based on physical address and don't need flushing on context switch.
If you're context-switching very frequently, the ASIDs let TLB entries proper stay cached. If you're still getting a lot of TLB misses, the worst case is that they have to fetch through cache all the way from the top. This is really not bad and very much not worth spending a lot of transistors and power budget on.
I'm only considering OS on bare metal, not HW virtualization with nested page tables. (Hypervisor virtualizing the guest OS's page tables). I think all the same arguments basically apply, though. Page walk still definitely fetches through cache.
I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details.
In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache.
From the TLB we get the traslated physical address.
From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set).
Then the translated TLB address is matched with the list of tags to find a candidate.
My question is where is this check performed ?
In Cache ?
If not in Cache, where else ?
If the check is performed in Cache, then
is there a side-band connection from TLB to the Cache module to get the
translated physical address needed for comparison with the tag addresses?
Can somebody please throw some light on "actually" how this is generally implemented and the connection between Cache module & the TLB(MMU) module ?
I know this dependents on the specific architecture and implementation.
But, what is the implementation which you know when there is VIPT cache ?
Thanks.
At this level of detail, you have to break "the cache" and "the TLB" down into their component parts. They're very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being translated "for free". Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?)
The L1dTLB itself is a small/fast Content addressable memory with (for example) 64 entries and 4-way set associative (Intel Skylake). Hugepages are often handled with a second (and 3rd) array checked in parallel, e.g. 32-entry 4-way for 2M pages, and for 1G pages: 4-entry fully (4-way) associative.
But for now, simplify your mental model and forget about hugepages.
The L1dTLB is a single CAM, and checking it is a single lookup operation.
"The cache" consists of at least these parts:
the SRAM array that stores the tags + data in sets
control logic to fetch a set of data+tags based on the index bits. (High-performance L1d caches typically fetch data for all ways of the set in parallel with tags, to reduce hit latency vs. waiting until the right tag is selected like you would with larger more highly associative caches.)
comparators to check the tags against a translated address, and select the right data if one of them matches, or trigger miss-handling. (And on hit, update the LRU bits to mark this way as Most Recently Used). For a diagram of the basics for a 2-way associative cache without a TLB, see https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=17. The = inside a circle is the comparator: producing a boolean true output if the tag-width inputs are equal.
The L1dTLB is not really separate from the L1D cache. I don't actually design hardware, but I think a load execution unit in a modern high-performance design works something like this:
AGU generates an address from register(s) + offset.
(Fun fact: Sandybridge-family optimistically shortcuts this process for simple addressing mode: [reg + 0-2047] has 1c lower load-use latency than other addressing modes, if the reg value is in the same 4k page as reg+disp. Is there a penalty when base+offset is in a different page than the base?)
The index bits come from the offset-within-page part of the address, so they don't need translating from virtual to physical. Or translation is a no-op. This VIPT speed with the non-aliasing of a PIPT cache works as long as L1_size / associativity <= page_size. e.g. 32kiB / 8-way = 4k pages.
The index bits select a set. Tags+data are fetched in parallel for all ways of that set. (This costs power to save latency, and is probably only worth it for L1. Higher-associativity (more ways per set) L3 caches definitely not)
The high bits of the address are looked up in the L1dTLB CAM array.
The tag comparator receives the translated physical-address tag and the fetched tags from that set.
If there's a tag match, the cache extracts the right bytes from the data for the way that matched (using the offset-within-line low bits of the address, and the operand-size).
Or instead of fetching the full 64-byte line, it could have used the offset bits earlier to fetch just one (aligned) word from each way. CPUs without efficient unaligned loads are certainly designed this way. I don't know if this is worth doing to save power for simple aligned loads on a CPU which supports unaligned loads.
But modern Intel CPUs (P6 and later) have no penalty for unaligned load uops, even for 32-byte vectors, as long as they don't cross a cache-line boundary. Byte-granularity indexing for 8 ways in parallel probably costs more than just fetching the whole 8 x 64 bytes and setting up the muxing of the output while the fetch+TLB is happening, based on offset-within-line, operand-size, and special attributes like zero- or sign-extension, or broadcast-load. So once the tag-compare is done, the 64 bytes of data from the selected way might just go into an already-configured mux network that grabs the right bytes and broadcasts or sign-extends.
AVX512 CPUs can even do 64-byte full-line loads.
If there's no match in the L1dTLB CAM, the whole cache fetch operation can't continue. I'm not sure if / how CPUs manage to pipeline this so other loads can keep executing while the TLB-miss is resolved. That process involves checking the L2TLB (Skylake: unified 1536 entry 12-way for 4k and 2M, 16-entry for 1G), and if that fails then with a page-walk.
I assume that a TLB miss results in the tag+data fetch being thrown away. They'll be re-fetched once the needed translation is found. There's nowhere to keep them while other loads are running.
At the simplest, it could just re-run the whole operation (including fetching the translation from L1dTLB) when the translation is ready, but it could lower the latency for L2TLB hits by short-cutting the process and using the translation directly instead of putting it into L1dTLB and getting it back out again.
Obviously that requires that the dTLB and L1D are really designed together and tightly integrated. Since they only need to talk to each other, this makes sense. Hardware page walks fetch data through the L1D cache. (Page tables always have known physical addresses to avoid a catch 22 / chicken-egg problem).
is there a side-band connection from TLB to the Cache?
I wouldn't call it a side-band connection. The L1D cache is the only thing that uses the L1dTLB. Similarly, L1iTLB is used only by the L1I cache.
If there's a 2nd-level TLB, it's usually unified, so both the L1iTLB and L1dTLB check it if they miss. Just like split L1I and L1D caches usually check a unified L2 cache if they miss.
Outer caches (L2, L3) are pretty universally PIPT. Translation happens during the L1 check, so physical addresses can be sent to other caches.
Is there a hint I can put in my code indicating that a line should be removed from cache? As opposed to a prefetch hint, which would indicate I will soon need a line. In my case, I know when I won't need a line for a while, so I want to be able to get rid of it to free up space for lines I do need.
clflush, clflushopt
Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the
linear address specified with the memory operand. If that cache line contains modified data at any level of the
cache hierarchy, that data is written back to memory.
They are not available on every CPU (in particular, clflushopt is only available on the 6th generation and later). To be certain, you should use CPUID to verify their availability:
The availability of CLFLUSH is indicated by the presence of the CPUID feature flag CLFSH
(CPUID.01H:EDX[bit 19]).
The availability of CLFLUSHOPT is indicated by the presence of the CPUID feature flag CLFLUSHOPT
(CPUID.(EAX=7,ECX=0):EBX[bit 23]).
If available, you should use clflushopt. It outperforms clflush when flushing buffers larger than 4KiB (64 lines).
This is the benchmark from Intel's Optimization Manual:
For informational purpose (assuming you are running in a privileged context), you can also use invd (as a nuke-from-orbit option). This:
Invalidates (flushes) the processor’s internal caches and issues a special-function bus cycle that directs external
caches to also flush themselves. Data held in internal caches is not written back to main memory.
or wbinvd, which:
Writes back all modified cache lines in the processor’s internal cache to main memory and invalidates (flushes) the
internal caches. The instruction then issues a special-function bus cycle that directs external caches to also write
back modified data and another bus cycle to indicate that the external caches should be invalidated.
A future instruction that could make it into the ISA is club. Although this won't fit your need (because it doesn't necessarily invalidate the line), it's worth mentioning for completeness. This would:
Writes back to memory the cache line (if dirty) that contains the linear address specified with the memory
operand from any level of the cache hierarchy in the cache coherence domain. The line may be retained in the
cache hierarchy in non-modified state. Retaining the line in the cache hierarchy is a performance optimization
(treated as a hint by hardware) to reduce the possibility of cache miss on a subsequent access. Hardware may
choose to retain the line at any of the levels in the cache hierarchy, and in some cases, may invalidate the line
from the cache hierarchy.
I do understand how the a virtual address is translated to a physical address to access the main memory. I also understand how the cache memory works as well.
But my problem is in putting the 2 concepts together and understanding the big picture of how a process accesses memory and what will happen if we have a cache miss. so i have this drawing that will help me asks the following questions:
click to see the image ( assume one-level cache)
1- Does the process access the cache with the exact same physical address that represent the location of byte in the main memory ?
2- Is the TLB actually in the first level of Cache or is it a separate memory inside the CPU chip that is dedicated for the translation purpose ?
3- When there is a cache miss, i need to get a whole block and allocated in the cache, but the main memory organized in frames(pages) not blocks. So does a process page is divided itself to cache blocks that can be brought to cache in case of a miss ?
4- Lets assume there is a TLB miss, does that mean that I need to go all the way to the main memory and do the page walk there , or does the page walk happen in the cache ?
5- Does a TLB miss guarantee that there will be a cache miss ?
6- If you have any reading material that explain the big picture that i am trying to understand i would really appreciate sharing it with me.
Thanks and feel free to answer any single question i have asked
Yes. The cache is not memory that can be addressed separately. Cache mapping will translate a physical address into an address for the cache but this mapping is not something a process usually controls. For some CPU architecture it is completely controlled by the hardware (e.g. Intel x86). For others the operating system would be expected to program the mapping.
The TLB in the diagram you gave is for virtual to physical address mapping. It is probably not for the cache. Again on some architecture the TLBs are programmed whereas on others it is controlled by the hardware.
Page size and cache line size do not have to be the same as one relates to virtual memory and the other to physical memory. When a process access a virtual address that address will be translated to a physical address using the TLB considering page size. Once that's done the size of a page is of no concern. The access is for a byte/word at a physical address. If this causes a cache miss occurs then the cache block that will be read will be of the size of a cache block that covers the physical memory address that's being accessed.
TLB miss will require a page translation by reading other memory. This process can occur in hardware on some CPU (such as Intel x86/x64) or need to be handled in software. Once the page translation has been completed the TLB will be reloaded with the page translation.
TLB miss does not imply cache miss. TLB miss just means the virtual to physical address mapping was not known and required a page address translation to occur. A cache miss means the physical memory content could not be provided quickly.
To recap:
the TLB is to convert virtual addresses to physical address quickly. It exist to cache the virtual to physical memory mapping quickly. It does not have anything to do with physical memory content.
the cache is to allow faster access to memory. It is only there to provide the content of physical memory faster.
Keep in mind that the term cache can be used for lots of purposes (e.g. note the usage of cache when describing the TLB). TLB is a bit more specific and usually implies a virtual memory translation though that's not universal. For example some DMA controllers have a TLB too but that TLB is not necessarily used to translate virtual to physical addresses but rather to convert block addresses to physical addresses.
I am trying to understand the whole structure and concepts about caching. As we use TLB for fast mapping virtual addresses to physical addresses, in case if we use virtually-indexed, physically-tagged L1 cache, can one overlap the virtual address translation with the L1 cache access?
Yes, that's the whole point of a VIPT cache.
Since the virtual addresses and physical one match over the lower bits (the page offset is the same), you don't need to translate them. Most VIPT caches are built around this (note that this limits the number of sets you can use, but you can grow their associativity instead), so you can use the lower bits to do a lookup in that cache even before you found the translation in the TLB.
This is critical because the TLB lookup itself takes time, and the L1 caches are usually designed to provide as much BW and low latency as possible to avoid stalling the often much-faster execution.
If you miss the TLB and suffer an even greater latency (either some level2 TLB or, god forbid, a page walk), it's less critical since you can't really do anything with the cache lookup until you compare the tag, but the few cycles you did save in the TLB hit + cache hit case should be the common case on many applications, so that's usually considered worthy to optimize and align the pipelines for.