About Adaptive Mode for L1 Cache in Hyper-threading - performance

I'm a student doing some research on Hyper-threading recently. I'm a little confused about the feature - L1 Data Cache Context Mode.
In the architecture optimization manual, it was described that L1 cache can operate in two modes:
The first level cache can operate in two modes depending on a context-ID bit:
Shared mode: The L1 data cache is fully shared by two logical processors.
Adaptive mode: In adaptive mode, memory accesses using the page directory is mapped identically across logical processors sharing the L1 data cache.
However, I am curious about how cache get partitioned in the adaptive mode according to the description.

On Intel arch, a value of 1 of L1 Context ID indicates the L1 data cache mode can be set to either adaptive mode or shared mode, while a value of 0 indicates this feature is not supported. Check the definition of IA32_MISC_ENABLE MSR Bit 24 (L1 Data Cache Context Mode) for details.
According to Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A (Chapter 11/Cache Control), which I quote below:
Share mode
In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if the logical processors use identical CR3 registers and paging modes. In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear address in the cache can point to different physical locations. The mechanism for resolving aliasing can lead to thrashing. For this reason, IA32_MISC_ENABLE[bit 24] = 0 is the preferred configuration for processors based on the Intel NetBurst microarchitecture that support Intel Hyper-Threading Technology.
Adaptive Mode
Adaptive mode facilitates L1 data cache sharing between logical processors. When running in adaptive mode, the L1 data cache is shared across logical processors in the same core if:
• CR3 control registers for logical processors sharing the cache are identical.
• The same paging mode is used by logical processors sharing the cache.
In this situation, the entire L1 data cache is available to each logical processor (instead of being competitively shared).
If CR3 values are different for the logical processors sharing an L1 data cache or the logical processors use different paging modes, processors compete for cache resources. This reduces the effective size of the cache for each logical processor.
Aliasing of the cache is not allowed (which prevents data thrashing).
I just guess there is no definite approach for partitioning the L1 data cache.

The document just states that if you use the adaptive mode and if CR3 or the paging mode differ between cores, the cache is not shared and the cores "compete" for the cache. It doesn't tell how the partitioning works.
The most straightforward manner to implement this would be to statically reserve half of the ways of the data cache to each of the processors. This would essentially assign half the data cache statically to each processor.
Alternatively they could add an additional bit to the virtual tag of each cache line to distinguish which processor the line belongs to. This would allow a dynamic partition of the cache. This fits the description of "competing" for the cache better than a static partition.
If you really need to know, you could design some micro-benchmarks to verify that one these schemes is actually used.

The L1 data cache is not partitioned in either mode and is always competitively shared.
Note that there is an obvious error in the manual, the mode isn't determined by the context-ID bit, but by IA32_MISC_ENABLE[24]. This enhancement is supported on later steppings of Northwood with HT and all Prescott processors with HT. The default value is zero, which represents adaptive mode. However, in certain processors, an updated BIOS may switch to shared mode by setting IA32_MISC_ENABLE[24] due to a bug in these processors that occurs only in adaptive mode.
In earlier steppings of Northwood with HT, only shared mode is supported. In shared mode, when a load request is issued to the L1 data cache, the request is first processed on the "fast path," which involves making a way prediction based on bits 11-15 of the linear address and producing a speculative hit/miss signal as a result. In processors with HT, the logical core ID is also compared. Both the partial tag and logical core ID have to match in order to get a speculative hit. In general, this helps improving the correct speculative hit rate.
If the two sibling logical cores operate in the same paging mode and have identical CR3 values, which indicate that accesses from both cores use the same page tables (if paging is enabled), it would be better to produce a speculative hit even if the logical core ID doesn't match on the fast path of the cache.
In adaptive mode, a context ID value is calculated whenever the paging mode or the CR3 register of one of the cores is changed. If the paging modes and the CR3 values match, the context ID bit is set to one of the two possible values. Otherwise, it's set to the other value. When a load request is issued to the cache, the context ID is checked. If indicates that the cores have the same address translation structures, the logical core ID comparison result is ignored and a speculative hit is produced if the partial virtual tag matched. Otherwise, the logical core ID comparison takes effect as in shared mode.

Related

How does TLB differentiate between entries of different Page tables?

Since different processes have their own Page table, How does the TLB cache differentiate between two page tables?
Or is the TLB flushed every time a different process gets CPU?
Yes, setting a new top-level page table phys address (such as x86 mov cr3, rax) invalidates all existing TLB entries1, or on other ISAs possibly software would need to use additional instructions to ensure safety. (I'm guessing about that, I only know how x86 does it).
Some ISAs do purely software management of TLBs, in which case it would definitely be up to software to flush all or at least the non-global TLB entries on context switch.
A more recent CPU feature allows us to avoid full invalidations in some cases. A context ID gives some extra tag bits with each TLB entry, so the CPU can keep track of which page-table they came from and only hit on entries that match the current context. This way, frequent switches between a small set of page tables can keep some entries valid.
On x86, the relevant feature is PCID (Process Context ID): When the OS sets a new top-level page-table address, it's associated with a context ID number. (maybe 4 bits IIRC on current CPUs). It's passed in the low bits of the page-table address. Page-tables have to be page aligned so those bits are actually unused; this feature repurposes them to be a separate bitfield, with CR3 bits above the page-offset used normally as the physical page-number.
And the OS can tell the CPU whether or not to flush the TLB when it loads a new page table, for either switching back to a previous context, or recycling a context-ID for a different task. (By setting the high bit of the new CR3 value, mov cr, reg manual entry.)
x86 PCID was new in 2nd-gen Nehalem: https://www.realworldtech.com/westmere/ has a brief description of it from a CPU-architecture PoV.
Similar support I think extends to HW virtualization / nested page tables, to reduce the cost of hypervisor switches between guests.
I expect other ISAs that have any kind of page-table context mechanism work broadly similarly, with it being a small integer that the OS sets along with / as part of a new top-level page-table address.
Footnote 1: Except for "global" ones where the PTE indicates that this page will be mapped the same in all page tables. This lets OSes optimize by marking kernel pages that way, so those TLB entries can stay hot when the kernel context-switches user-space tasks. Both page tables should actually have valid entries for that page that do map to the same phys address, of course. On x86 at least, there is a bit in the PTE format that lets the CPU know it can assume the TLB entry is still valid across different page directories.

Hyper-Threading data cache context aliasing

in intel's manual, the following section confuse me:
11.5.6.2 Shared Mode In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if
the logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased,
meaning that one linear address in the cache can point to different
physical locations. The mechanism for resolving aliasing can lead to
thrashing. For this reason, IA32_MISC_ENABLE[bit 24] = 0 is the
preferred configuration for processors based on the Intel NetBurst
microarchitecture that support Intel Hyper-Threading Technology.
as intel use VIPT(equals to PIPT) to access cache.
how cache aliasing would happened ?
Based on Intel® 64 and IA-32 Architectures Optimization Reference Manual, November 2009 (248966-020), Section 2.6.1.3:
Most resources in a physical processor are fully shared to improve the
dynamic utilization of the resource, including caches and all the
execution units. Some shared resources which are linearly addressed,
like the DTLB, include a logical processor ID bit to distinguish
whether the entry belongs to one logical processor or the other.
The first level cache can operate in two modes depending on a context-ID
bit:
Shared mode: The L1 data cache is fully shared by two logical
processors.
Adaptive mode: In adaptive mode, memory accesses using the page
directory is mapped identically across logical processors sharing the
L1 data cache.
Aliasing is possible because the processor ID/context-ID bit (which is just a bit indicating which virtual processor the memory access came from) would be different for different threads and shared mode uses that bit. Adaptive mode simply addresses the cache as one would normally expect, only using the memory address.
Specifically how the processor ID is used when indexing the cache in shared mode appears not to be documented. (XORing with several address bits would provide dispersal of indexes such that adjacent indexes for one hardware thread would map to more separated indexes for the other thread. Selecting a different bit order for different threads is less likely since such would tend to increase delay. Dispersal reduces conflict frequency given spatial locality above cache line granularity but less than way-size granularity.)

VIPT Cache: Connection between TLB & Cache?

I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details.
In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache.
From the TLB we get the traslated physical address.
From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set).
Then the translated TLB address is matched with the list of tags to find a candidate.
My question is where is this check performed ?
In Cache ?
If not in Cache, where else ?
If the check is performed in Cache, then
is there a side-band connection from TLB to the Cache module to get the
translated physical address needed for comparison with the tag addresses?
Can somebody please throw some light on "actually" how this is generally implemented and the connection between Cache module & the TLB(MMU) module ?
I know this dependents on the specific architecture and implementation.
But, what is the implementation which you know when there is VIPT cache ?
Thanks.
At this level of detail, you have to break "the cache" and "the TLB" down into their component parts. They're very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being translated "for free". Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?)
The L1dTLB itself is a small/fast Content addressable memory with (for example) 64 entries and 4-way set associative (Intel Skylake). Hugepages are often handled with a second (and 3rd) array checked in parallel, e.g. 32-entry 4-way for 2M pages, and for 1G pages: 4-entry fully (4-way) associative.
But for now, simplify your mental model and forget about hugepages.
The L1dTLB is a single CAM, and checking it is a single lookup operation.
"The cache" consists of at least these parts:
the SRAM array that stores the tags + data in sets
control logic to fetch a set of data+tags based on the index bits. (High-performance L1d caches typically fetch data for all ways of the set in parallel with tags, to reduce hit latency vs. waiting until the right tag is selected like you would with larger more highly associative caches.)
comparators to check the tags against a translated address, and select the right data if one of them matches, or trigger miss-handling. (And on hit, update the LRU bits to mark this way as Most Recently Used). For a diagram of the basics for a 2-way associative cache without a TLB, see https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=17. The = inside a circle is the comparator: producing a boolean true output if the tag-width inputs are equal.
The L1dTLB is not really separate from the L1D cache. I don't actually design hardware, but I think a load execution unit in a modern high-performance design works something like this:
AGU generates an address from register(s) + offset.
(Fun fact: Sandybridge-family optimistically shortcuts this process for simple addressing mode: [reg + 0-2047] has 1c lower load-use latency than other addressing modes, if the reg value is in the same 4k page as reg+disp. Is there a penalty when base+offset is in a different page than the base?)
The index bits come from the offset-within-page part of the address, so they don't need translating from virtual to physical. Or translation is a no-op. This VIPT speed with the non-aliasing of a PIPT cache works as long as L1_size / associativity <= page_size. e.g. 32kiB / 8-way = 4k pages.
The index bits select a set. Tags+data are fetched in parallel for all ways of that set. (This costs power to save latency, and is probably only worth it for L1. Higher-associativity (more ways per set) L3 caches definitely not)
The high bits of the address are looked up in the L1dTLB CAM array.
The tag comparator receives the translated physical-address tag and the fetched tags from that set.
If there's a tag match, the cache extracts the right bytes from the data for the way that matched (using the offset-within-line low bits of the address, and the operand-size).
Or instead of fetching the full 64-byte line, it could have used the offset bits earlier to fetch just one (aligned) word from each way. CPUs without efficient unaligned loads are certainly designed this way. I don't know if this is worth doing to save power for simple aligned loads on a CPU which supports unaligned loads.
But modern Intel CPUs (P6 and later) have no penalty for unaligned load uops, even for 32-byte vectors, as long as they don't cross a cache-line boundary. Byte-granularity indexing for 8 ways in parallel probably costs more than just fetching the whole 8 x 64 bytes and setting up the muxing of the output while the fetch+TLB is happening, based on offset-within-line, operand-size, and special attributes like zero- or sign-extension, or broadcast-load. So once the tag-compare is done, the 64 bytes of data from the selected way might just go into an already-configured mux network that grabs the right bytes and broadcasts or sign-extends.
AVX512 CPUs can even do 64-byte full-line loads.
If there's no match in the L1dTLB CAM, the whole cache fetch operation can't continue. I'm not sure if / how CPUs manage to pipeline this so other loads can keep executing while the TLB-miss is resolved. That process involves checking the L2TLB (Skylake: unified 1536 entry 12-way for 4k and 2M, 16-entry for 1G), and if that fails then with a page-walk.
I assume that a TLB miss results in the tag+data fetch being thrown away. They'll be re-fetched once the needed translation is found. There's nowhere to keep them while other loads are running.
At the simplest, it could just re-run the whole operation (including fetching the translation from L1dTLB) when the translation is ready, but it could lower the latency for L2TLB hits by short-cutting the process and using the translation directly instead of putting it into L1dTLB and getting it back out again.
Obviously that requires that the dTLB and L1D are really designed together and tightly integrated. Since they only need to talk to each other, this makes sense. Hardware page walks fetch data through the L1D cache. (Page tables always have known physical addresses to avoid a catch 22 / chicken-egg problem).
is there a side-band connection from TLB to the Cache?
I wouldn't call it a side-band connection. The L1D cache is the only thing that uses the L1dTLB. Similarly, L1iTLB is used only by the L1I cache.
If there's a 2nd-level TLB, it's usually unified, so both the L1iTLB and L1dTLB check it if they miss. Just like split L1I and L1D caches usually check a unified L2 cache if they miss.
Outer caches (L2, L3) are pretty universally PIPT. Translation happens during the L1 check, so physical addresses can be sent to other caches.

Is cache prefetching done in hardware address space or virtual address space?

Does the hardware prefetcher operate on contiguous virtual addresses, or is it operating on contiguous hardware addresses? Imagine the case where you have a large array of bytes which span multiple pages. In the virtual address space the bytes are contiguous, but in fact the pages could be allocated in disjoint pages in hardware. I would hope that the prefetcher is able to do the appropriate conversion using the TLB before it starts to bring in cache lines that belong to the next page.
Is this so?
I couldn't find information that confirmed this and was hoping someone could give more insight.
I'm asking for x86 mainly, but any insight would be appreciated
I can't answer this for AMD processors, but I can answer it for Intel ones.
As far as I know, the hardware prefetcher(s) should not prefetch cache lines across page boundaries on current Intel processors.
From Intel's Intel® 64 and IA-32 Architectures Optimization Reference Manual, section 7.5.2, Hardware Prefetch:
Automatic hardware prefetch can bring cache lines into the unified last-level cache based on prior data misses. It will attempt to prefetch two cache lines ahead of the prefetch stream. Characteristics of the hardware prefetcher are:
[...]
It will not prefetch across a 4-KByte page boundary. A program has to initiate demand loads for the new page before the hardware prefetcher starts prefetching from the new page.
Above paragraph is talking about "unified last-level cache", but things aren't better in L1d land:
2.3.5.4, Data Prefetching
Data Prefetch to L1 Data Cache
Data prefetching is triggered by load operations when the following conditions are met:
[...]
The prefetched data is within the same 4K byte page as the load instruction that triggered it.
Or in L2:
The following two hardware prefetchers fetched data from memory to the L2 cache and last level cache:
Spatial Prefetcher: [...]
Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page.
However, the processor might prefetch paging data. From Intel's Intel® 64 and IA-32 Architectures Software Developer Manuals, Volume 3A, 4.10.2.3, Details of TLB Use:
The processor may cache translations required for prefetches and for accesses that are a result of speculative execution that would never actually occur in the executed code path.
Volume 3A, 4.10.3.1, Caches for Paging Structures:
The processor may create entries in paging-structure caches for translations required for prefetches and for accesses that are a result of speculative execution that would never actually occur in the executed code path.
I know you asked about hardware prefetching, but you should be able to use software prefetching for data (not instructions):
In older microarchitectures, PREFETCH causing a Data Translation Lookaside Buffer (DTLB) miss would be dropped. In processors based on Nehalem, Westmere, Sandy Bridge, and newer microarchitectures, Intel Core 2 processors, and Intel Atom processors, PREFETCH causing a DTLB miss can
be fetched across a page boundary.

CUDA caches data into the unified cache from the global memory to store them into the shared memory?

As far as I know GPU follows steps(global memory-l2-l1-register-shared memory) to store data into the shared memory for previous NVIDIA GPU architectures.
However, the maxwell gpu(GTX980) has physically separated unified cache and shared memory, and I want to know that this architecture also follows the same step to store data into the shared memory? or do they support direct communication between global and shared memory?
the unified cache is enabled with option "-dlcm=ca"
This might answer most of your questions about memory types and steps within the Maxwell architecture :
As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.
In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.
Local loads also are cached in L2 only, which could increase the cost of register spilling if L1 local load hit rates were high with Kepler. The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling.
The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler.
From section "1.4.2. Memory Throughput", sub-section "1.4.2.1. Unified L1/Texture Cache" in the Maxwell tuning guide from Nvidia.
The other sections and sub-sections following these two also teach and/or explicit useful other details about shared memory sizes/bandwidth, caching, etc.
Give it a try !

Resources