L2 cache in Kepler

L2 cache in Kepler - caching

How does L2 cache work in GPUs with Kepler architecture in terms of locality of references? For example if a thread accesses an address in global memory, supposing the value of that address is not in L2 cache, how is the value being cached? Is it temporal? Or are other nearby values of that address brought to L2 cache too (spatial)?
Below picture is from NVIDIA whitepaper.

Unified L2 cache was introduced with compute capability 2.0 and higher and continues to be supported on the Kepler architecture. The caching policy used is LRU (least recently used) the main intention of which was to avoid the global memory bandwidth bottleneck. The GPU application can exhibit both types of locality (temporal and spatial).
Whenever there is an attempt read a specific memory it looks in the cache L1 and L2 if not found, then it will load 128 byte from the cache line. This is the default mode. The same can be understood from the below diagram as to why the 128 bit access pattern gives the good result.

Related

How caches are connected to cores?

I have very fundamental question on how physically (in RTL) caches (e.g. L1,L2) are connected to cores (e.g. Arm Cortex A53)? How many read/writes ports/bus are there and what is width of it? Is it 32-bit bus? How to calculate theoretical max bandwidth/throughput on L1 cache connected to Arm Cortex A53 running at 1400MHz?
On web lots of information is available on how caches work but couldn't find how it is connected.

You can get the information in the ARM documentation (which is pretty complete compared to others):
L1 data cache:
(configurable) sizes of 8KB, 16KB, 32KB, or 64KB.
Data side cache line length of 64 bytes.
256-bit write interface to the L2 memory system.
128-bit read interface to the L2 memory system.
64-bit read path from the data L1 memory system to the datapath.
128-bit write path from the datapath to the L1 memory system.
Note there is one datapath since it is mentioned when there are multiple of them, hence there is certainly 1 port unless 2 ports share the same datapath which would be surprising.
L2 cache:
All bus interfaces are 128-bits wide.
Configurable L2 cache size of 128KB, 256KB, 512KB, 1MB and 2MB.
Fixed line length of 64 bytes.
General information:
One to four cores, each with an L1 memory system and a single shared L2 cache.
In-order pipeline with symmetric dual-issue of most instructions.
Harvard Level 1 (L1) memory system with a Memory Management Unit (MMU).
Level 2 (L2) memory system providing cluster memory coherency, optionally including an L2 cache.
The Level 1 (L1) data cache controller, that generates the control signals for the associated embedded tag, data, and dirty RAMs, and arbitrates between the different sources requesting access to the memory resources. The data cache is 4-way set associative and uses a Physically Indexed, Physically Tagged (PIPT) scheme for lookup that enables unambiguous address management in the system.
The Store Buffer (STB) holds store operations when they have left the load/store pipeline and have been committed by the DPU. The STB can request access to the cache RAMs in the DCU, request the BIU to initiate linefills, or request the BIU to write out the data on the external write channel. External data writes are through the SCU.
The STB can merge several store transactions into a single transaction if they are to the same 128-bit aligned address.
An upper-bound for the L1 bandwidth is frequency * interface_width * number_of_paths so 1400MHz * 64bit * 1 = 10.43 GiB/s from the L1 (reads) and 20.86 GiB/s to the L1 (writes). In practice, the concurrency can be a problem but it is hard to know which part of the chip will be a limiting factor.
Note that there are many other documents available but this one is the most interesting. I am not sure you can get the physical information about cache in RTL since I expect this information to be confidential, hence not publicly available (because I guess competitors could take benefit of this).

CUDA caches data into the unified cache from the global memory to store them into the shared memory?

As far as I know GPU follows steps(global memory-l2-l1-register-shared memory) to store data into the shared memory for previous NVIDIA GPU architectures.
However, the maxwell gpu(GTX980) has physically separated unified cache and shared memory, and I want to know that this architecture also follows the same step to store data into the shared memory? or do they support direct communication between global and shared memory?
the unified cache is enabled with option "-dlcm=ca"

This might answer most of your questions about memory types and steps within the Maxwell architecture :
As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.
In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.
Local loads also are cached in L2 only, which could increase the cost of register spilling if L1 local load hit rates were high with Kepler. The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling.
The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler.
From section "1.4.2. Memory Throughput", sub-section "1.4.2.1. Unified L1/Texture Cache" in the Maxwell tuning guide from Nvidia.
The other sections and sub-sections following these two also teach and/or explicit useful other details about shared memory sizes/bandwidth, caching, etc.
Give it a try !

Does anyone knows about details of L2 cache structure of NVIDIA Kepler GPUs(Mapping function, Replacement policy, Associativity)?

I need the detail information about L2 caches of NVIDIA Kepler GPUs. I know the size (e.g. 512KB on GT740M GPU) and block size (32B) of the cache. I tried to capture the associativity, replacement policy, and more importantly, the mapping function (from global address to cache line), by a sample kernel and profiling the read hit ratio by nvprof profiler. I realized that mapping is not modulo operation. Is there any trick to find out what cache line a given global address is mapped to? can anyone help me?

Just as #Dmitri Budnikov memtioned, cache information is not public available.
Some researchers are working on this problem, and This Paper give us some insights on the memory hierarchy of GPU architecture.
Their findings on L2 cache can be summarized as below:
The replacement policy of the L2 cache is not LRU;
The L2 cache line size is 32 bytes;
The data mapping is sophisticated and not conventional bits-defined;
A hardware-level pre-fetching mechanism from the DRAM to the L2 data cache is found on fermi, Kepler and Maxwell architecture.
The benchmarks they developed can be found here.

How are the modern Intel CPU L3 caches organized?

Given that CPUs are now multi-core and have their own L1/L2 caches, I was curious as to how the L3 cache is organized given that its shared by multiple cores. I would imagine that if we had, say, 4 cores, then the L3 cache would contain 4 pages worth of data, each page corresponding to the region of memory that a particular core is referencing. Assuming I'm somewhat correct, is that as far as it goes? It could, for example, divide each of these pages into sub-pages. This way when multiple threads run on the same core each thread may find their data in one of the sub-pages. I'm just coming up with this off the top of my head so I'm very interested in educating myself on what is really going on underneath the scenes. Can anyone share their insights or provide me with a link that will cure me of my ignorance?
Many thanks in advance.

There is single (sliced) L3 cache in single-socket chip, and several L2 caches (one per real physical core).
L3 cache caches data in segments of size of 64 bytes (cache lines), and there is special Cache coherence protocol between L3 and different L2/L1 (and between several chips in the NUMA/ccNUMA multi-socket systems too); it tracks which cache line is actual, which is shared between several caches, which is just modified (and should be invalidated from other caches). Some of protocols (cache line possible states and state translation): https://en.wikipedia.org/wiki/MESI_protocol, https://en.wikipedia.org/wiki/MESIF_protocol, https://en.wikipedia.org/wiki/MOESI_protocol
In older chips (epoch of Core 2) cache coherence was snooped on shared bus, now it is checked with help of directory.
In real life L3 is not just "single" but sliced into several slices, each of them having high-speed access ports. There is some method of selecting the slice based on physical address, which allow multicore system to do many accesses at every moment (each access will be directed by undocumented method to some slice; when two cores uses same physical address, their accesses will be served by same slice or by slices which will do cache coherence protocol checks).
Information about L3 cache slices was reversed in several papers:
https://cmaurice.fr/pdf/raid15_maurice.pdf Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters
https://eprint.iacr.org/2015/690.pdf Systematic Reverse Engineering of Cache Slice Selection in Intel Processors
https://arxiv.org/pdf/1508.03767.pdf Cracking Intel Sandy Bridge’s Cache Hash Function
With recent chips programmer has ability to partition the L3 cache between applications "Cache Allocation Technology" (v4 Family): https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology https://software.intel.com/en-us/articles/introduction-to-code-and-data-prioritization-with-usage-models https://danluu.com/intel-cat/ https://lwn.net/Articles/659161/

Modern Intel L3 caches (since Nehalem) use a 64B line size, the same as L1/L2. They're shared, and inclusive.
See also http://www.realworldtech.com/nehalem/2/
Since SnB at least, each core has part of the L3, and they're on a ring bus. So in big Xeons, L3 size scales linearly with number of cores.
See also Which cache mapping technique is used in intel core i7 processor? where I wrote a much larger and more complete answer.

Line size of L1 and L2 caches

From a previous question on this forum, I learned that in most of the memory systems, L1 cache is a subset of the L2 cache means any entry removed from L2 is also removed from L1.
So now my question is how do I determine a corresponding entry in L1 cache for an entry in the L2 cache. The only information stored in the L2 entry is the tag information. Based on this tag information, if I re-create the addr it may span multiple lines in the L1 cache if the line-sizes of L1 and L2 cache are not same.
Does the architecture really bother about flushing both the lines or it just maintains L1 and L2 cache with the same line-size.
I understand that this is a policy decision but I want to know the commonly used technique.

Cache-Lines size is (typically) 64 bytes.
Moreover, take a look at this very interesting article about processors caches:
Gallery of Processor Cache Effects
You will find the following chapters:
Memory accesses and performance
Impact of cache lines
L1 and L2 cache sizes
Instruction-level parallelism
Cache associativity
False cache line sharing
Hardware complexities

In core i7 the line sizes in L1 , L2 and L3 are the same: that is 64 Bytes.
I guess this simplifies maintaining the inclusive property, and coherence.
See page 10 of: https://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf

The most common technique of handling cache block size in a strictly inclusive cache hierarchy is to use the same size cache blocks for all levels of cache for which the inclusion property is enforced. This results in greater tag overhead than if the higher level cache used larger blocks, which not only uses chip area but can also increase latency since higher level caches generally use phased access (where tags are checked before the data portion is accessed). However, it also simplifies the design somewhat and reduces the wasted capacity from unused portions of the data. It does not take a large fraction of unused 64-byte chunks in 128-byte cache blocks to compensate for the area penalty of an extra 32-bit tag. In addition, the larger cache block effect of exploiting broader spatial locality can be provided by relatively simple prefetching, which has the advantages that no capacity is left unused if the nearby chunk is not loaded (to conserve memory bandwidth or reduce latency on a conflicting memory read) and that the adjacency prefetching need not be limited to a larger aligned chunk.
A less common technique divides the cache block into sectors. Having the sector size the same as the block size for lower level caches avoids the problem of excess back-invalidation since each sector in the higher level cache has its own valid bit. (Providing all the coherence state metadata for each sector rather than just validity can avoid excessive writeback bandwidth use when at least one sector in a block is not dirty/modified and some coherence overhead [e.g., if one sector is in shared state and another is in the exclusive state, a write to the sector in the exclusive state could involve no coherence traffic—if snoopy rather than directory coherence is used].)
The area savings from sectored cache blocks were especially significant when tags were on the processor chip but the data was off-chip. Obviously, if the data storage takes area comparable to the size of the processor chip (which is not unreasonable), then 32-bit tags with 64-byte blocks would take roughly a 16th (~6%) of the processor area while 128-byte blocks would take half as much. (IBM's POWER6+, introduced in 2009, is perhaps the most recent processor to use on-processor-chip tags and off-processor data. Storing data in higher-density embedded DRAM and tags in lower-density SRAM, as IBM did, exaggerates this effect.)
It should be noted that Intel uses "cache line" to refer to the smaller unit and "cache sector" for the larger unit. (This is one reason why I used "cache block" in my explanation.) Using Intel's terminology it would be very unusual for cache lines to vary in size among levels of cache regardless of whether the levels were strictly inclusive, strictly exclusive, or used some other inclusion policy.
(Strict exclusion typically uses the higher level cache as a victim cache where evictions from the lower level cache are inserted into the higher level cache. Obviously, if the block sizes were different and sectoring was not used, then an eviction would require the rest of the larger block to be read from somewhere and invalidated if present in the lower level cache. [Theoretically, strict exclusion could be used with inflexible cache bypassing where an L1 eviction would bypass L2 and go to L3 and L1/L2 cache misses would only be allocated to either L1 or L2, bypassing L1 for certain accesses. The closest to this being implemented that I am aware of is Itanium's bypassing of L1 for floating-point accesses; however, if I recall correctly, the L2 was inclusive of L1.])

Typically, in one access to the main memory 64 bytes of data and 8 bytes of parity/ECC (I don't remember exactly which) is accessed. And it is rather complicated to maintain different cache line sizes at the various memory levels. You have to note that cache line size would be more correlated to the word alignment size on that architecture than anything else. Based on that, a cache line size is highly unlikely to be different from memory access size. Now, the parity bits are for the use of the memory controller - so cache line size typically is 64 bytes. The processor really controls very little beyond the registers. Everything else going on in the computer is more about getting hardware in to optimize CPU performance. In that sense also, it really would not make any sense to import extra complexity by making cache line sizes different at different levels of memory.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio