L1 cache in GPU - caching

I see some similar terms while reading memoryt hierarchy of GPUs and since there were some architectural modifications in past versions, I don't know if they can be used together or has different meanings. The device is M2000 which is compute-compatibility 5.2.
Top level (closest to pipeline) is a unified L1/texture cache which is 24KB per SM. Is it unified for instructions and data too?
Below that, is L2 cache which is also know as shared memory which is shared by all SMs According to the ./deviceQuery, L2 size is 768KB. If that is an aggregate value, then each SM has 768KB/6=128KB. However, according to the programming guide, shared memory is 96KB.
What is constant memory then and where does it reside? There is no information about its size neither in deviceQuery nor nvprof metrics. Programming guide says:
There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages (see Device Memory Accesses). Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats (see Texture and Surface Memory).
The global, constant, and texture memory spaces are persistent across kernel launches by the same application.
Below L2 is the global memory which is known as device memory which can be 2GB, 4GB and ...

The NVIDIA GPU Architecture has the following access paths. GPU may have additional caches within the hierarchy presented below.
Path for Global, Local Memory
(CC3.*) L1 -> L2
(CC5.-6.) L1TEX -> L2
(CC7.*) L1TEX (LSU) -> L2
Path for Surface, Texture (CC5./6.)
(CC < 5) TEX
(CC5.-6.) L1TEX -> L2
(CC7.*) L1TEX (TEX) -> L2
Path for Shared
(CC3.*) L1
(CC5.-6.) SharedMemory
(CC7.*) L1TEX (LSU)
Path for Immediate Constant
... c[bank][offset] -> IMC - Immediate Constant Cache -> L2 Cache
Path for Indexed Constant
LDC Rd, c[bank][offset] -> IDC - Indexed Constant Cache -> L2 Cache
Path for Instruction
ICC - Instruction Cache -> L2
The NVIDIA CUDA Profilers (Nsight Compute, Nvidia Visual Profiler, and Nsight VSE CUDA Profiler) have high level diagrams of the the memory hierarchy to help you understand how logical requests map to the hardware.
CC3.* Memory Hierarchy
For CC5./6. there are two unified L1TEX caches per SM. Each L1/TEX unit services 1 SM partition. Each SM partition has two sub-partitions (2 warp schedulers). The SM contains a separate RAM and data path for shared memory. The L1TEX unit services neither instruction fetches or constant data loads (via c[bank][offset]). Instruction fetches and constant loads are handled through separate cache hierarchies (see above). The CUDA programming model also supports read-only (const) data access through the L1TEX via global memory address space.
L2 cache is shared by all engines in the GPU including but not limited to SMs, copy engines, video decoders, video encoders, and display controllers. The L2 cache is not partitioned by client. L2 is not referred to as shared memory. In NVIDIA GPUs shared memory is a RAM local to the SM that supports efficient non-linear access.
Global memory is a virtual memory address that may include:
dedicated memory on the chip referred to as device memory, video memory, or frame buffer depending on the context.
pinned system memory
non-pinned system memory through unified virtual memory
peer memory

Related

How caches are connected to cores?

I have very fundamental question on how physically (in RTL) caches (e.g. L1,L2) are connected to cores (e.g. Arm Cortex A53)? How many read/writes ports/bus are there and what is width of it? Is it 32-bit bus? How to calculate theoretical max bandwidth/throughput on L1 cache connected to Arm Cortex A53 running at 1400MHz?
On web lots of information is available on how caches work but couldn't find how it is connected.
You can get the information in the ARM documentation (which is pretty complete compared to others):
L1 data cache:
(configurable) sizes of 8KB, 16KB, 32KB, or 64KB.
Data side cache line length of 64 bytes.
256-bit write interface to the L2 memory system.
128-bit read interface to the L2 memory system.
64-bit read path from the data L1 memory system to the datapath.
128-bit write path from the datapath to the L1 memory system.
Note there is one datapath since it is mentioned when there are multiple of them, hence there is certainly 1 port unless 2 ports share the same datapath which would be surprising.
L2 cache:
All bus interfaces are 128-bits wide.
Configurable L2 cache size of 128KB, 256KB, 512KB, 1MB and 2MB.
Fixed line length of 64 bytes.
General information:
One to four cores, each with an L1 memory system and a single shared L2 cache.
In-order pipeline with symmetric dual-issue of most instructions.
Harvard Level 1 (L1) memory system with a Memory Management Unit (MMU).
Level 2 (L2) memory system providing cluster memory coherency, optionally including an L2 cache.
The Level 1 (L1) data cache controller, that generates the control signals for the associated embedded tag, data, and dirty RAMs, and arbitrates between the different sources requesting access to the memory resources. The data cache is 4-way set associative and uses a Physically Indexed, Physically Tagged (PIPT) scheme for lookup that enables unambiguous address management in the system.
The Store Buffer (STB) holds store operations when they have left the load/store pipeline and have been committed by the DPU. The STB can request access to the cache RAMs in the DCU, request the BIU to initiate linefills, or request the BIU to write out the data on the external write channel. External data writes are through the SCU.
The STB can merge several store transactions into a single transaction if they are to the same 128-bit aligned address.
An upper-bound for the L1 bandwidth is frequency * interface_width * number_of_paths so 1400MHz * 64bit * 1 = 10.43 GiB/s from the L1 (reads) and 20.86 GiB/s to the L1 (writes). In practice, the concurrency can be a problem but it is hard to know which part of the chip will be a limiting factor.
Note that there are many other documents available but this one is the most interesting. I am not sure you can get the physical information about cache in RTL since I expect this information to be confidential, hence not publicly available (because I guess competitors could take benefit of this).

Hyper-Threading data cache context aliasing

in intel's manual, the following section confuse me:
11.5.6.2 Shared Mode In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if
the logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased,
meaning that one linear address in the cache can point to different
physical locations. The mechanism for resolving aliasing can lead to
thrashing. For this reason, IA32_MISC_ENABLE[bit 24] = 0 is the
preferred configuration for processors based on the Intel NetBurst
microarchitecture that support Intel Hyper-Threading Technology.
as intel use VIPT(equals to PIPT) to access cache.
how cache aliasing would happened ?
Based on IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual, November 2009 (248966-020), Section 2.6.1.3:
Most resources in a physical processor are fully shared to improve the
dynamic utilization of the resource, including caches and all the
execution units. Some shared resources which are linearly addressed,
like the DTLB, include a logical processor ID bit to distinguish
whether the entry belongs to one logical processor or the other.
The first level cache can operate in two modes depending on a context-ID
bit:
Shared mode: The L1 data cache is fully shared by two logical
processors.
Adaptive mode: In adaptive mode, memory accesses using the page
directory is mapped identically across logical processors sharing the
L1 data cache.
Aliasing is possible because the processor ID/context-ID bit (which is just a bit indicating which virtual processor the memory access came from) would be different for different threads and shared mode uses that bit. Adaptive mode simply addresses the cache as one would normally expect, only using the memory address.
Specifically how the processor ID is used when indexing the cache in shared mode appears not to be documented. (XORing with several address bits would provide dispersal of indexes such that adjacent indexes for one hardware thread would map to more separated indexes for the other thread. Selecting a different bit order for different threads is less likely since such would tend to increase delay. Dispersal reduces conflict frequency given spatial locality above cache line granularity but less than way-size granularity.)

CUDA caches data into the unified cache from the global memory to store them into the shared memory?

As far as I know GPU follows steps(global memory-l2-l1-register-shared memory) to store data into the shared memory for previous NVIDIA GPU architectures.
However, the maxwell gpu(GTX980) has physically separated unified cache and shared memory, and I want to know that this architecture also follows the same step to store data into the shared memory? or do they support direct communication between global and shared memory?
the unified cache is enabled with option "-dlcm=ca"
This might answer most of your questions about memory types and steps within the Maxwell architecture :
As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.
In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.
Local loads also are cached in L2 only, which could increase the cost of register spilling if L1 local load hit rates were high with Kepler. The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling.
The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler.
From section "1.4.2. Memory Throughput", sub-section "1.4.2.1. Unified L1/Texture Cache" in the Maxwell tuning guide from Nvidia.
The other sections and sub-sections following these two also teach and/or explicit useful other details about shared memory sizes/bandwidth, caching, etc.
Give it a try !

L2 cache in Kepler

How does L2 cache work in GPUs with Kepler architecture in terms of locality of references? For example if a thread accesses an address in global memory, supposing the value of that address is not in L2 cache, how is the value being cached? Is it temporal? Or are other nearby values of that address brought to L2 cache too (spatial)?
Below picture is from NVIDIA whitepaper.
Unified L2 cache was introduced with compute capability 2.0 and higher and continues to be supported on the Kepler architecture. The caching policy used is LRU (least recently used) the main intention of which was to avoid the global memory bandwidth bottleneck. The GPU application can exhibit both types of locality (temporal and spatial).
Whenever there is an attempt read a specific memory it looks in the cache L1 and L2 if not found, then it will load 128 byte from the cache line. This is the default mode. The same can be understood from the below diagram as to why the 128 bit access pattern gives the good result.

How are cache memories shared in multicore Intel CPUs?

I have a few questions regarding Cache memories used in Multicore CPUs or Multiprocessor systems. (Although not directly related to programming, it has many repercussions while one writes software for multicore processors/multiprocessors systems, hence asking here!)
In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)?
Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache misses, in the scenario that if that particular processors cache does not have some data but some other second processors' cache might have it thus avoiding a read from memory into cache of first processor? Is this assumption valid and true?
Will there be any problems in allowing any processor to access other processor's cache memory?
In a multiprocessor system or a multicore processor (Intel Quad Core,
Core two Duo etc..) does each cpu core/processor have its own cache
memory (data and program cache)?
Yes. It varies by the exact chip model, but the most common design is for each CPU core to have its own private L1 data and instruction caches.
On old and/or low-power CPUs, the next level of cache is typically a L2 unified cache is typically shared between all cores. Or on 65nm Core2Quad (which was two core2duo dies in one package), each pair of cores had their own last-level cache and couldn't communicate as efficiently.
Modern mainstream Intel CPUs (since the first-gen i7 CPUs, Nehalem) use 3 levels of cache.
32kiB split L1i/L1d: private per-core (same as earlier Intel)
256kiB unified L2: private per-core. (1MiB on Skylake-avx512).
large unified L3: shared among all cores
Last-level cache is a a large shared L3. It's physically distributed between cores, with a slice of L3 going with each core on the ring bus that connects the cores. Typically 1.5 to 2.25MB of L3 cache with every core, so a many-core Xeon might have a 36MB L3 cache shared between all its cores. This is why a dual-core chip has 2 to 4 MB of L3, while a quad-core has 6 to 8 MB.
On CPUs other than Skylake-avx512, L3 is inclusive of the per-core private caches so its tags can be used as a snoop filter to avoid broadcasting requests to all cores. i.e. anything cached in a private L1d, L1i, or L2, must also be allocated in L3. See Which cache mapping technique is used in intel core i7 processor?
David Kanter's Sandybridge write-up has a nice diagram of the memory heirarchy / system architecture, showing the per-core caches and their connection to shared L3, and DDR3 / DMI(chipset) / PCIe connecting to that. (This still applies to Haswell / Skylake-client / Coffee Lake, except with DDR4 in later CPUs).
Can one processor/core access each other's cache memory, because if
they are allowed to access each other's cache, then I believe there
might be lesser cache misses, in the scenario that if that particular
processors cache does not have some data but some other second
processors' cache might have it thus avoiding a read from memory into
cache of first processor? Is this assumption valid and true?
No. Each CPU core's L1 caches tightly integrate into that core. Multiple cores accessing the same data will each have their own copy of it in their own L1d caches, very close to the load/store execution units.
The whole point of multiple levels of cache is that a single cache can't be fast enough for very hot data, but can't be big enough for less-frequently used data that's still accessed regularly. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
Going off-core to another core's caches wouldn't be faster than just going to L3 in Intel's current CPUs. Or the required mesh network between cores to make this happen would be prohibitive compared to just building a larger / faster L3 cache.
The small/fast caches built-in to other cores are there to speed up those cores. Sharing them directly would probably cost more power (and maybe even more transistors / die area) than other ways of increasing cache hit rate. (Power is a bigger limiting factor than transistor count or die area. That's why modern CPUs can afford to have large private L2 caches).
Plus you wouldn't want other cores polluting the small private cache that's probably caching stuff relevant to this core.
Will there be any problems in allowing any processor to access other
processor's cache memory?
Yes -- there simply aren't wires connecting the various CPU caches to the other cores. If a core wants to access data in another core's cache, the only data path through which it can do so is the system bus.
A very important related issue is the cache coherency problem. Consider the following: suppose one CPU core has a particular memory location in its cache, and it writes to that memory location. Then, another core reads that memory location. How do you ensure that the second core sees the updated value? That is the cache coherency problem.
The normal solution is the MESI protocol, or a variation on it. Intel uses MESIF.
Quick answers
1) Yes 2)No, but it all may depend on what memory instance/resource you are referring, data may exist in several locations at the same time. 3)Yes.
For a full length explanation of the issue you should read the 9 part article "What every programmer should know about memory" by Ulrich Drepper ( http://lwn.net/Articles/250967/ ), you will get the full picture of the issues you seem to be inquiring about in a good and accessible detail.
To answer your first, I know the Core 2 Duo has a 2-tier caching system, in which each processor has its own first-level cache, and they share a second-level cache. This helps with both data synchronization and utilization of memory.
To answer your second question, I believe your assumption to be correct. If the processors were to be able to access each others' cache, there would obviously be less cache misses as there would be more data for the processors to choose from. Consider, however, shared cache. In the case of the Core 2 Duo, having shared cache allows programmers to place commonly used variables safely in this environment so that the processors will not have to access their individual first-level caches.
To answer your third question, there could potentially be a problem with accessing other processors' cache memory, which goes to the "Single Write Multiple Read" principle. We can't allow more than one process to write to the same location in memory at the same time.
For more info on the core 2 duo, read this neat article.
http://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems/

Resources