Hyper-Threading data cache context aliasing - caching

in intel's manual, the following section confuse me:
11.5.6.2 Shared Mode In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if
the logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased,
meaning that one linear address in the cache can point to different
physical locations. The mechanism for resolving aliasing can lead to
thrashing. For this reason, IA32_MISC_ENABLE[bit 24] = 0 is the
preferred configuration for processors based on the Intel NetBurst
microarchitecture that support Intel Hyper-Threading Technology.
as intel use VIPT(equals to PIPT) to access cache.
how cache aliasing would happened ?

Based on Intel® 64 and IA-32 Architectures Optimization Reference Manual, November 2009 (248966-020), Section 2.6.1.3:
Most resources in a physical processor are fully shared to improve the
dynamic utilization of the resource, including caches and all the
execution units. Some shared resources which are linearly addressed,
like the DTLB, include a logical processor ID bit to distinguish
whether the entry belongs to one logical processor or the other.
The first level cache can operate in two modes depending on a context-ID
bit:
Shared mode: The L1 data cache is fully shared by two logical
processors.
Adaptive mode: In adaptive mode, memory accesses using the page
directory is mapped identically across logical processors sharing the
L1 data cache.
Aliasing is possible because the processor ID/context-ID bit (which is just a bit indicating which virtual processor the memory access came from) would be different for different threads and shared mode uses that bit. Adaptive mode simply addresses the cache as one would normally expect, only using the memory address.
Specifically how the processor ID is used when indexing the cache in shared mode appears not to be documented. (XORing with several address bits would provide dispersal of indexes such that adjacent indexes for one hardware thread would map to more separated indexes for the other thread. Selecting a different bit order for different threads is less likely since such would tend to increase delay. Dispersal reduces conflict frequency given spatial locality above cache line granularity but less than way-size granularity.)

Related

Is there a performance gain on x86 Windows/Linux apps in minimizing page count?

I'm writing and running software on Windows and Linux of the last five years, say, on machines of the last five years.
I believe the running processes will never get close to the amount of memory we have (we're typically using like 3GB out of 32GB) but I'm certain we'll be filling up the cache to overflowing.
I have a data structure modification which makes the data structure run a bit slower (that is, do quite a bit more pointer math) but reduces both the cache line count and page count (working set) of the application.
My understanding is that if processes' memory is being paged out to disk, reducing the page count and thus disk I/O is a huge win in terms of wall clock time and latency (e.g. to respond to network events, user events, timers, etc.) However with memory so cheap nowadays, I don't think processes using my data structure will ever page out of memory. So is there still an advantage in improving this? For example if I'm using 10,000 cache lines, does it matter whether they're spread over 157 4k pages (=643k) or 10,000 4k pages (=40MB)? Subquestion: even if you're nowhere near filling up RAM, are there cases a modern OS will page a running process out anyways? (I think Linux of the 90s might have done so to increase file caching, for instance.)
In contrast, I also can reduce cache lines and I suspect this will probably wall clock time quite a bit. Just to be clear on what I'm counting, I'm counting the number of 64-byte-aligned areas of memory that I'm touching at least one byte in as the total cache lines used by this structure. Example: if hypothetically I have 40,000 16 byte structures, a computer whose processes are constantly referring to more memory that it has L1 cache will probably run everything faster if those structures are packed into 10,000 64 byte cache lines instead of each straddling two cache lines apiece for 80,000 cache lines?
The general recommendation is to minimize the number of cache lines and virtual pages that exhibit high temporal locality on the same core during the same time interval of the lifetime of a given application.
When the total number of needed physical pages is about to exceed the capacity of main memory, the OS will attempt to free up some physical memory by paging out some of the resident pages to secondary storage. This situation may result in major page faults, which can significantly impact performance. Even if you know with certainty that the system will not reach that point, there are other performance issues that may occur when unnecessarily using more virtual pages (i.e., there is wasted space in the used pages).
On Intel and AMD processors, page table entries and other paging structures are cached in hardware caches in the memory management unit to efficiently translate virtual addresses to physical addresses. These include the L1 and L2 TLBs. On an L2 TLB miss, a hardware component called the page walker is engaged to peform the required address translation. More pages means more misses. On pre-Broadwell1 and pre-Zen microarchitectures, there can only be one outstanding page walk at any time. On later microarchitectures, there can only be two. In addition, on Intel Ivy Bridge and later, it may be more difficult for the TLB prefetcher to keep up with the misses.
On Intel and AMD processors, the L1D and L2 caches are designed so that all cache lines within the same 4K page are guaranteed to be mapped to different cache sets. So if all of the cache lines of a page are used, in contest to, for example, spreading out the cache lines in 10 different pages, conflict misses in these cache levels may be reduced. That said, on all AMD processors and on Intel pre-Haswell microarchitectures, bank conflicts are more likely to occur when accesses are more spread across cache sets.
On Intel and AMD processors, hardware data prefetchers don't work across 4K boundaries. An access pattern that could be detected by one or more prefetchers but has the accesses spread out across many pages would benefit less from hardware prefetching.
On Windows/Intel, the accessed bits of the page table entries of all present pages are reset every second by the working set manager. So when accesses are unnecessarily spread out in the virtual address space, the number of page walks that require microcode assists (to set the accessed bit) per memory access may become larger.
The same applies to minor page faults as well (on Intel and AMD). Both Windows and Linux use demand paging, i.e., an allocated virtual page is only mapped to a physical page on demand. When a virtual page that is not yet mapped to a physical page is accessed, a minor page fault occurs. Just like with the accessed bit, the number of minor page faults per access may be larger.
On Intel processors, with more pages accessed, it becomes more likely for nearby accesses on the same logical core to 4K-alias. For more information on 4K aliasing, see: L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes.
There are probably other potential issues.
Subquestion: even if you're nowhere near filling up RAM, are there
cases a modern OS will page a running process out anyways? (I think
Linux of the 90s might have done so to increase file caching, for
instance.)
On both Windows and Linux, there is a maximum working set size limit on each process. On Linux, this is called RLIMIT_RSS and is not enforced. On Windows, this limit can be either soft (the default) or hard. The limit is only enforced if it is hard (which can be specified by calling the SetProcessWorkingSetSizeEx function and passing the QUOTA_LIMITS_HARDWS_MIN_ENABLE flag). When a process reaches its hard working set limit, further memory allocation requests will be satisfied by paging out some of its pages to the page file, even if free physical memory is available.
I don't know about Linux of the 90s.
Footnotes:
(1) The Intel optimization manual mentions in Section 2.2.3 that Skylake can do two page walks in parallel compared to one in Haswell and earlier microarchitectures. To my knowledge, the manual does not clearly mention whether Broadwell can do one or two page walks in parallel. However, according to slide 10 of these Intel slides (entitled "Intel Xeon Processor D: The First Xeon processor optimized for dense solutions"), Xeon Broadwell supports two page walks. I think this also applies to all Broadwell models.

Managing cache with memory mapped I/O

I have a question regarding memory mapped io.
Suppose, there is a memory mapped IO peripheral whose value is being read by CPU. Once read, the value is stored in cache. But the value in memory has been updated by external IO peripheral.
In such cases how will CPU determine cache has been invalidated and what could be the workaround for such case?
That's strongly platform dependent. And actually, there are two different cases.
Case #1. Memory-mapped peripheral. This means that access to some range of physical memory addresses is routed to peripheral device. There is no actual RAM involved. To control caching, x86, for example, has MTRR ("memory type range registers") and PAT ("page attribute tables"). They allow to set caching mode on particular range of physical memory. Under normal circumstances, range of memory mapped to RAM is write-back cacheable, while range of memory mapped to periphery devices is uncacheable. Different caching policies are described in Intel's system programming guide, 11.3 "Methods of caching available". So, when you issue read or write request to memory mapped peripheral, CPU cache is bypassed, and request goes directly to the device.
Case #2. DMA. It allows peripheral devices to access RAM asynchronously. In this case, DMA controller is no different from any CPU and equally participates in cache coherency protocol. Write request from periphery is seen by caches of other CPUs, and cache lines are either invalidated or are updated with new data. Read request also is seen by caches of other CPUs and data is returned from cache rather than from main RAM. (This is only an example: actual implementation is platform dependent. For example, SoC typically do not guarantee strong cache coherency peripheral <-> CPU.)
In both cases, the problem of caching also exists at compiler level: complier may cache data values in registers. That's why programming languages has some means of prohibiting such optimization: for example, volatile keyword in C.

CUDA caches data into the unified cache from the global memory to store them into the shared memory?

As far as I know GPU follows steps(global memory-l2-l1-register-shared memory) to store data into the shared memory for previous NVIDIA GPU architectures.
However, the maxwell gpu(GTX980) has physically separated unified cache and shared memory, and I want to know that this architecture also follows the same step to store data into the shared memory? or do they support direct communication between global and shared memory?
the unified cache is enabled with option "-dlcm=ca"
This might answer most of your questions about memory types and steps within the Maxwell architecture :
As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.
In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.
Local loads also are cached in L2 only, which could increase the cost of register spilling if L1 local load hit rates were high with Kepler. The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling.
The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler.
From section "1.4.2. Memory Throughput", sub-section "1.4.2.1. Unified L1/Texture Cache" in the Maxwell tuning guide from Nvidia.
The other sections and sub-sections following these two also teach and/or explicit useful other details about shared memory sizes/bandwidth, caching, etc.
Give it a try !

How are the modern Intel CPU L3 caches organized?

Given that CPUs are now multi-core and have their own L1/L2 caches, I was curious as to how the L3 cache is organized given that its shared by multiple cores. I would imagine that if we had, say, 4 cores, then the L3 cache would contain 4 pages worth of data, each page corresponding to the region of memory that a particular core is referencing. Assuming I'm somewhat correct, is that as far as it goes? It could, for example, divide each of these pages into sub-pages. This way when multiple threads run on the same core each thread may find their data in one of the sub-pages. I'm just coming up with this off the top of my head so I'm very interested in educating myself on what is really going on underneath the scenes. Can anyone share their insights or provide me with a link that will cure me of my ignorance?
Many thanks in advance.
There is single (sliced) L3 cache in single-socket chip, and several L2 caches (one per real physical core).
L3 cache caches data in segments of size of 64 bytes (cache lines), and there is special Cache coherence protocol between L3 and different L2/L1 (and between several chips in the NUMA/ccNUMA multi-socket systems too); it tracks which cache line is actual, which is shared between several caches, which is just modified (and should be invalidated from other caches). Some of protocols (cache line possible states and state translation): https://en.wikipedia.org/wiki/MESI_protocol, https://en.wikipedia.org/wiki/MESIF_protocol, https://en.wikipedia.org/wiki/MOESI_protocol
In older chips (epoch of Core 2) cache coherence was snooped on shared bus, now it is checked with help of directory.
In real life L3 is not just "single" but sliced into several slices, each of them having high-speed access ports. There is some method of selecting the slice based on physical address, which allow multicore system to do many accesses at every moment (each access will be directed by undocumented method to some slice; when two cores uses same physical address, their accesses will be served by same slice or by slices which will do cache coherence protocol checks).
Information about L3 cache slices was reversed in several papers:
https://cmaurice.fr/pdf/raid15_maurice.pdf Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters
https://eprint.iacr.org/2015/690.pdf Systematic Reverse Engineering of Cache Slice Selection in Intel Processors
https://arxiv.org/pdf/1508.03767.pdf Cracking Intel Sandy Bridge’s Cache Hash Function
With recent chips programmer has ability to partition the L3 cache between applications "Cache Allocation Technology" (v4 Family): https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology https://software.intel.com/en-us/articles/introduction-to-code-and-data-prioritization-with-usage-models https://danluu.com/intel-cat/ https://lwn.net/Articles/659161/
Modern Intel L3 caches (since Nehalem) use a 64B line size, the same as L1/L2. They're shared, and inclusive.
See also http://www.realworldtech.com/nehalem/2/
Since SnB at least, each core has part of the L3, and they're on a ring bus. So in big Xeons, L3 size scales linearly with number of cores.
See also Which cache mapping technique is used in intel core i7 processor? where I wrote a much larger and more complete answer.

About Adaptive Mode for L1 Cache in Hyper-threading

I'm a student doing some research on Hyper-threading recently. I'm a little confused about the feature - L1 Data Cache Context Mode.
In the architecture optimization manual, it was described that L1 cache can operate in two modes:
The first level cache can operate in two modes depending on a context-ID bit:
Shared mode: The L1 data cache is fully shared by two logical processors.
Adaptive mode: In adaptive mode, memory accesses using the page directory is mapped identically across logical processors sharing the L1 data cache.
However, I am curious about how cache get partitioned in the adaptive mode according to the description.
On Intel arch, a value of 1 of L1 Context ID indicates the L1 data cache mode can be set to either adaptive mode or shared mode, while a value of 0 indicates this feature is not supported. Check the definition of IA32_MISC_ENABLE MSR Bit 24 (L1 Data Cache Context Mode) for details.
According to Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A (Chapter 11/Cache Control), which I quote below:
Share mode
In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if the logical processors use identical CR3 registers and paging modes. In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear address in the cache can point to different physical locations. The mechanism for resolving aliasing can lead to thrashing. For this reason, IA32_MISC_ENABLE[bit 24] = 0 is the preferred configuration for processors based on the Intel NetBurst microarchitecture that support Intel Hyper-Threading Technology.
Adaptive Mode
Adaptive mode facilitates L1 data cache sharing between logical processors. When running in adaptive mode, the L1 data cache is shared across logical processors in the same core if:
• CR3 control registers for logical processors sharing the cache are identical.
• The same paging mode is used by logical processors sharing the cache.
In this situation, the entire L1 data cache is available to each logical processor (instead of being competitively shared).
If CR3 values are different for the logical processors sharing an L1 data cache or the logical processors use different paging modes, processors compete for cache resources. This reduces the effective size of the cache for each logical processor.
Aliasing of the cache is not allowed (which prevents data thrashing).
I just guess there is no definite approach for partitioning the L1 data cache.
The document just states that if you use the adaptive mode and if CR3 or the paging mode differ between cores, the cache is not shared and the cores "compete" for the cache. It doesn't tell how the partitioning works.
The most straightforward manner to implement this would be to statically reserve half of the ways of the data cache to each of the processors. This would essentially assign half the data cache statically to each processor.
Alternatively they could add an additional bit to the virtual tag of each cache line to distinguish which processor the line belongs to. This would allow a dynamic partition of the cache. This fits the description of "competing" for the cache better than a static partition.
If you really need to know, you could design some micro-benchmarks to verify that one these schemes is actually used.
The L1 data cache is not partitioned in either mode and is always competitively shared.
Note that there is an obvious error in the manual, the mode isn't determined by the context-ID bit, but by IA32_MISC_ENABLE[24]. This enhancement is supported on later steppings of Northwood with HT and all Prescott processors with HT. The default value is zero, which represents adaptive mode. However, in certain processors, an updated BIOS may switch to shared mode by setting IA32_MISC_ENABLE[24] due to a bug in these processors that occurs only in adaptive mode.
In earlier steppings of Northwood with HT, only shared mode is supported. In shared mode, when a load request is issued to the L1 data cache, the request is first processed on the "fast path," which involves making a way prediction based on bits 11-15 of the linear address and producing a speculative hit/miss signal as a result. In processors with HT, the logical core ID is also compared. Both the partial tag and logical core ID have to match in order to get a speculative hit. In general, this helps improving the correct speculative hit rate.
If the two sibling logical cores operate in the same paging mode and have identical CR3 values, which indicate that accesses from both cores use the same page tables (if paging is enabled), it would be better to produce a speculative hit even if the logical core ID doesn't match on the fast path of the cache.
In adaptive mode, a context ID value is calculated whenever the paging mode or the CR3 register of one of the cores is changed. If the paging modes and the CR3 values match, the context ID bit is set to one of the two possible values. Otherwise, it's set to the other value. When a load request is issued to the cache, the context ID is checked. If indicates that the cores have the same address translation structures, the logical core ID comparison result is ignored and a speculative hit is produced if the partial virtual tag matched. Otherwise, the logical core ID comparison takes effect as in shared mode.

Resources