What is the L2 cache accessPolicyWindow introduced in CUDA 11 - parallel-processing

New runtime API is introduced in CUDA 11 to fine-tune L2 access policy.
But I do not fully understand the meaning of policy like hitRatio and how they can be used in practice.
In particular, I saw in CUDA API Documentation that cudaAccessPolicyWindow:
Specifies an access policy for a window, a contiguous extent of memory
beginning at base_ptr and ending at base_ptr + num_bytes. Partition
into many segments and assign segments such that. sum of "hit
segments" / window == approx. ratio. sum of "miss segments" / window
== approx 1-ratio. Segments and ratio specifications are fitted to the capabilities of the architecture. Accesses in a hit segment apply the
hitProp access policy. Accesses in a miss segment apply the missProp
access policy.
My question:
How is the contiguous extent of memory "partitioned" into segments? Are these partitions decided statically based on the hit prop such that the hit or miss attribute of segments will remain unchanged once they are assigned or there are some running counters which dynamically adjust the assignment e.g. on each access basis?
How should these attributes be applied to optimize performance in practice? E.g. to put it in a over-simplified & naive way, suppose I have 1 MB L2 cache set aside, should I create a window of 1 MB for my most frequently used data and set hitRatio as 1 or should I create a window of 2 MB and set hitRatio as 0.5? Why?

How is the contiguous extent of memory "partitioned" into segments? Are these partitions decided statically based on the hit prop such that the hit or miss attribute of segments will remain unchanged once they are assigned or there are some running counters which dynamically adjust the assignment e.g. on each access basis?
I don't think any of this is explicitly specified. However we can make some pretty solid (I believe) conjecture based on two ideas:
NVIDIA GPU L2 caches for a "long time" have had a 32-byte basic line size. This is designed to align with the design of the DRAM subsystem which has 32-byte segment boundaries.
The documentation states that which lines are chosen to be persistent in the cache as a "random" selection. This pretty much implies it is not tied to access patterns.
Assembling these ideas, then I would say that the partitioning is done at a granularity of no less then a L2 line (32 bytes) and it may be a higher granularity if needed by the L2 tag/TLB system. Most of these L2 details are generally unpublished by NVIDIA. Once selected at random, I would not expect the selection to change based on access pattern.
How should these attributes be applied to optimize performance in practice? E.g. to put it in a over-simplified & naive way, suppose I have 1 MB L2 cache set aside, should I create a window of 1 MB for my most frequently used data and set hitRatio as 1 or should I create a window of 2 MB and set hitRatio as 0.5? Why?
There is not enough information to give a specific answer to this question. The knowledge that you have decided to have 1MB L2 cache set aside is the first key piece of information, but the second key piece of information is how much data do you actually have to cache? In addition, expected access pattern matters. Lets cover several cases:
1MB cache set aside, 1MB window, hitRatio 1. This implies that you only have 1MB of data that you would like to cache with this mechanism. With a 1MB cache and a 1MB window, there isn't any reason to choose anything other than 1 for hitRatio. If you only have 1MB of data to cache, and you can afford to carve off 1MB of L2, this is a completely sensible choice. You're essentially guaranteeing that no other activity can "evict" this "protected" data, once it appears in the cache carve-out.
1MB cache set aside, 2MB window, hitRatio 0.5. This implies of course that you have at least 2MB of data that you would like to cache with this mechanism, so this is not directly comparable to case 1 above. The hitRatio of 0.5 can be thought of as a "guard" against thrashing. Let's consider several sub-cases:
A. Suppose your 2MB of data is broken into 1MB regions A and B, and your code accesses all the data in region A (once), then all the data in region B (once), then all the data in region A (once), etc. If you chose a hit ratio of 1 with a 1MB cache carve out but a 2MB window, this setup would thrash. You'd fill the cache with region A, then evict and fill with region B, then evict and fill with region A, etc. Without a "hitRatio" mechanism/control, this kind of behavior for this kind of scenario would be unavoidable. So if you anticipate this kind of cyclic access patterns, a hit Ratio of 0.5 would be better (50% of your data access would benefit from L2 carve out, 50% would not) rather than having no benefit from the cache at all.
B. Suppose your 2MB of data is accessed with high temporal locality. Your code access 1MB (say, region A) of the 2MB of data repeatedly, then accesses the other (region B) repeatedly. In this case, a carve-out might not be needed at all. If you wanted to use a carve-out, a hit ratio of 1 might make sense, because it means that this carve out is behaving more or less as ordinary cache, with the exception that the cacheable window is user-defined. In this example, your cache would fill up with your first 1MB of region A data, and then your code would benefit from cache as it reused that data. When your code switched modality and started to use the second 1MB of data (region B), this second 1MB would evict the first 1MB, and then as your code used the second 1MB repeatedly, it would derive again 100% benefit from the cache.

Related

How does cache associativity impact performance [duplicate]

This question already has answers here:
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
(3 answers)
Closed 3 years ago.
I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints and observed that accessing first column was slower for 1024x1024 case.
Author explained (background info, CPU is Intel with L1 cache with 32KB memory, it is 8-way associative):
When N=1024, this difference is exactly 4096 bytes; it equals the
critical stride value. This means that all elements from the first
column match the same eight cache lines of L1. We don’t really have
performance benefits from the cache because we can’t use it
efficiently: we have only 512 bytes (8 cache lines * 64-byte cache
line size) instead of the original 32 kilobytes. When we iterate the
first column in a loop, the corresponding elements pop each other from
the cache. When N=1023 and N=1025, we don’t have problems with the
critical stride anymore: all elements can be kept in the cache, which
is much more efficient.
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
It strikes me as odd, after reading wiki page I would say the performance penalty comes from resolving address conflicts. Since each row can be potentially mapped into the same cache line, it is conflict after conflict, and CPU has to resolve those -- it takes time.
Thus my question, what is the real nature of performance problem here. Accessible memory size of cache is lower, or entire cache is available but CPU spends more time in resolving conflicts with mapping. Or there is some other reason?
Caching is a layer between two other layers. In your case, between the CPU and RAM. At its best, the CPU rarely has to wait for something to be fetched from RAM. At its worst, the CPU usually has to wait.
The 1024 example hits a bad case. For that entire column all words requested from RAM land in the same cell in cache (or the same 2 cells, if using a 2-way associative cache, etc).
Meanwhile, the CPU does not care -- it asks the cache for a word from memory; the cache either has it (fast access) or needs to reach into RAM (slow access) to get it. And RAM does not care -- it responds to requests, whenever they come.
Back to 1024. Look at the layout of that array in memory. The cells of the row are in consecutive words of RAM; when one row is finished, the next row starts. With a little bit of thought, you can see that consecutive cells in a column have addresses differing by 1024*N, when N=4 or 8 (or whatever the size of a cell). That is a power of 2.
Now let's look at the relatively trivial architecture of a cache. (It is 'trivial' because it needs to be fast and easy to implement.) It simply takes several bits out of the address to form the address in the cache's "memory".
Because of the power of 2, those bits will always be the same -- hence the same slot is accessed. (I left out a few details, like now many bits are needed, hence the size of the cache, 2-way, etc, etc.)
A cache is useful when the process above it (CPU) fetches an item (word) more than once before that item gets bumped out of cache by some other item needing the space.
Note: This is talking about the CPU->RAM cache, not disk controller caching, database caches, web site page caches, etc, etc; they use more sophisticated algorithms (often hashing) instead of "picking a few bits out of an address".
Back to your Question...
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
There are conceptual problems with that quote.
Main memory is not "mapped to a cache"; see virtual versus real addresses.
The penalty comes when the cache does not have the desired word.
"shrinking the cache" -- The cache is a fixed size, based on the hardware involved.
Definition: In this context, a "word" is a consecutive string of bytes from RAM. It is always(?) a power-of-2 bytes and positioned at some multiple of that in the reall address space. A "word" for caching depends on vintage of the CPU, which level of cache, etc. 4-, 8-, 16-byte words probably can be found today. Again, the power-of-2 and positioned-at-multiple... are simple optimizations.
Back to your 1K*1K array of, say, 4-byte numbers. That adds up to 4MB, plus or minus (for 1023, 1025). If you have 8MB of cache, the entire array will eventually get loaded, and further actions on the array will be faster due to being in the cache. But if you have, say, 1MB of cache, some of the array will get in the cache, then be bumped out -- repeatedly. It might not be much better than if you had no cache.

Using texture cache versus coalesced global memory with low cache hit rate?

In the processus of optimizing and profiling a kernel, I noticed that it's L2 and global cache hit frequency was very low (~1.2 % avg.). My kernel typically reads 4 full cache lines per pass per warp, with 3 blocks per SM (So 4 * 32 * 2 = 256 lines of cache per SM per pass of my kernel, that has a variable pass number). The reads are from different regions of global memory, which is obviously then hard to cache. (The pattern of the regions is A, 32 * B, A .....)
It is then made clear that for data that is so "dispersed" and read only 1 time before moving on, L1/L2 cache is almost useless. To compensate for this vastness in the reads of my kernel, I consider using texture memory, which is "pre-cached" in L1.
Can it be considered "good" practice to do this ?
Side question 1 : If the accesses to that texture
are coalesced (supposing row major) does it still has performance gains over non coalesced texture read ?
Side question 2 : As my data is read in a fashion so that each warp reads 1 row, is 2D texture really that useful ? Or is 1D layered texture better for the job ?
Sorry if the side questions are already answered elsewhere , they got through my mind will I was writing and a quick research (probably using erroneous vocabulary) did not yield an answer. Sorry if my question is dumb, my literature about CUDA is limited to the nVidia documentations.
Finally, the texture based implementation did not bring much. From what I understand, while the cache rate when up (~50 %) There definitly is an overhead in the cache hierarchy or the texture units.
To retain (not application specific)
Texture memory comes with a slight overhead, that makes it worth it only in the situations where the filtering given is a benefit AND that the whole texture can fit in the caches, allowing 2D perfectly cached memory that is resistant to non-coalesced accesses.

Size of neighbouring data a modern computer caches for locality favour

I have a continuous memory of 1024 buffers, each buffer sizes 2K bytes. I use a linked list to keep record of available buffers (Buffer here can be thought of being used by Producer and Consumer). After some operations, the order of buffers in the link list becomes random.
The modern computer architecture favours compact data, locality a lot. It caches neighbouring data when a location needs to be accessed. The cache-line of my computer is 64(corrected from 64K) bytes.
Question 1. For my case, is there a lot of cache misses due to my access pattern is random?
Question 2. What is the size of neighbouring data a modern computer caches? I think if you access a location in an array of integers, it will cache neighbouring integers. But my unit data (2K) is much larger than int (4). So, I am not sure how many neighbours will be cached.
First of all I doubt that "The cache-line of my computer is 64K bytes". It's most likely to be 64 Bytes only. Let me try to answer your questions:
Question 1. For my case, is there a lot of cache misses due to my access pattern is random?
Not necessarily. It depends on how many operations you do on a buffer once it is cached.
So if you cache a 2K buffer and do lots of sequential work on it your
cache hit rate would be good. As Paul suggested, this works even better with hardware prefetching enabled.
However if you constantly jump between buffers and do relatively
low amount of work on each buffer, the cache hit rate will drop.
However 1024 x 2KB = 2MB so that could be a size for an L2 cache (if you have also L3, then L2 is generally smaller). So even
if you miss L1, there's a high chance that in both cases you will
hit L2.
Question 2. What is the size of neighbouring data a modern computer caches?
Usually the number of neighbors fetched is given by the cache line size. If the line size is 64B, you could fetch 16 integer values. So on each read, you fill a cache line. However you need to take into consideration prefetching. If your CPU detects that the memory reads are sequential, it will prefetch more neighbors and bring more cache lines in advance.
Hope this helps!

Does larger cache size always lead to improved performance?

Since cache inside the processor increases the instruction execution speed. I'm wondering what if we increase the size of cache to many MBs like 1 GB. Is it possible? If it is will increasing the cache size always result in increased performance?
There is a tradeoff between cache size and hit rate on one side and read latency with power consumption on another. So the answer to your first question is: technically (probably) possible, but unlikely to make sense, since L3 cache in modern CPUs with size of just a few MBs has read latency of about dozens of cycles.
Performance depends more on memory access pattern than on cache size. More precisely, if the program is mainly sequential, cache size is not a big deal. If there are quite a lot of random access (ex. when associative containers are actively used), cache size really matters.
The above is true for single computational tasks. In multiprocess environment with several active processes bigger cache size is always better, because of decrease of interprocess contention.
This is a simplification, but, one of the primary reasons the cache increases 'speed' is that it provides a fast memory very close to the processor - this is much faster to access than main memory. So, in theory, increasing the size of the cache should allow more information to be stored in this 'fast' memory, and thereby improve performance.. In the real world things are obviously much more complex than this, and there will of course be added complexity, and cost, associated with such a large cache, and with dealing with issues like cache coherency, caching algorithms etc.
As cache stores data temporary. Cache is used to locate the file easily that has been frequently using. So if the size of cache increased upto 1gb or more it will not stay as cache, it becomes RAM. Data is stored in ram temporary. So if cache isn't used, when data is called by processor, ram will take time to fetch data to provide to the processor because of its wide size of 4gb or more. So we use cache as our temporary memory for the things we recently or frequently used. In this way, ram ram doesnt required to find and fetch data to give it to processor, because processor direct access data from cache, because of small size of cache, it doesnt take time to find data, and processor doesn't require to call ram to fetch data, all of this done fastly without ram. Lets take an example, we have a wide classroom (RAM) , our principal (processor) call class CR (Data) for some purposes, then ones will go to the class room and will find the CR in the class of 1000 students and take him to the principal. It takes time. When we specify a space(cache) for CR in the class, because principal mostly call CR of the class, so it will become easy to find CR becuase most of the time CR is called by Principal.

Line size of L1 and L2 caches

From a previous question on this forum, I learned that in most of the memory systems, L1 cache is a subset of the L2 cache means any entry removed from L2 is also removed from L1.
So now my question is how do I determine a corresponding entry in L1 cache for an entry in the L2 cache. The only information stored in the L2 entry is the tag information. Based on this tag information, if I re-create the addr it may span multiple lines in the L1 cache if the line-sizes of L1 and L2 cache are not same.
Does the architecture really bother about flushing both the lines or it just maintains L1 and L2 cache with the same line-size.
I understand that this is a policy decision but I want to know the commonly used technique.
Cache-Lines size is (typically) 64 bytes.
Moreover, take a look at this very interesting article about processors caches:
Gallery of Processor Cache Effects
You will find the following chapters:
Memory accesses and performance
Impact of cache lines
L1 and L2 cache sizes
Instruction-level parallelism
Cache associativity
False cache line sharing
Hardware complexities
In core i7 the line sizes in L1 , L2 and L3 are the same: that is 64 Bytes.
I guess this simplifies maintaining the inclusive property, and coherence.
See page 10 of: https://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
The most common technique of handling cache block size in a strictly inclusive cache hierarchy is to use the same size cache blocks for all levels of cache for which the inclusion property is enforced. This results in greater tag overhead than if the higher level cache used larger blocks, which not only uses chip area but can also increase latency since higher level caches generally use phased access (where tags are checked before the data portion is accessed). However, it also simplifies the design somewhat and reduces the wasted capacity from unused portions of the data. It does not take a large fraction of unused 64-byte chunks in 128-byte cache blocks to compensate for the area penalty of an extra 32-bit tag. In addition, the larger cache block effect of exploiting broader spatial locality can be provided by relatively simple prefetching, which has the advantages that no capacity is left unused if the nearby chunk is not loaded (to conserve memory bandwidth or reduce latency on a conflicting memory read) and that the adjacency prefetching need not be limited to a larger aligned chunk.
A less common technique divides the cache block into sectors. Having the sector size the same as the block size for lower level caches avoids the problem of excess back-invalidation since each sector in the higher level cache has its own valid bit. (Providing all the coherence state metadata for each sector rather than just validity can avoid excessive writeback bandwidth use when at least one sector in a block is not dirty/modified and some coherence overhead [e.g., if one sector is in shared state and another is in the exclusive state, a write to the sector in the exclusive state could involve no coherence traffic—if snoopy rather than directory coherence is used].)
The area savings from sectored cache blocks were especially significant when tags were on the processor chip but the data was off-chip. Obviously, if the data storage takes area comparable to the size of the processor chip (which is not unreasonable), then 32-bit tags with 64-byte blocks would take roughly a 16th (~6%) of the processor area while 128-byte blocks would take half as much. (IBM's POWER6+, introduced in 2009, is perhaps the most recent processor to use on-processor-chip tags and off-processor data. Storing data in higher-density embedded DRAM and tags in lower-density SRAM, as IBM did, exaggerates this effect.)
It should be noted that Intel uses "cache line" to refer to the smaller unit and "cache sector" for the larger unit. (This is one reason why I used "cache block" in my explanation.) Using Intel's terminology it would be very unusual for cache lines to vary in size among levels of cache regardless of whether the levels were strictly inclusive, strictly exclusive, or used some other inclusion policy.
(Strict exclusion typically uses the higher level cache as a victim cache where evictions from the lower level cache are inserted into the higher level cache. Obviously, if the block sizes were different and sectoring was not used, then an eviction would require the rest of the larger block to be read from somewhere and invalidated if present in the lower level cache. [Theoretically, strict exclusion could be used with inflexible cache bypassing where an L1 eviction would bypass L2 and go to L3 and L1/L2 cache misses would only be allocated to either L1 or L2, bypassing L1 for certain accesses. The closest to this being implemented that I am aware of is Itanium's bypassing of L1 for floating-point accesses; however, if I recall correctly, the L2 was inclusive of L1.])
Typically, in one access to the main memory 64 bytes of data and 8 bytes of parity/ECC (I don't remember exactly which) is accessed. And it is rather complicated to maintain different cache line sizes at the various memory levels. You have to note that cache line size would be more correlated to the word alignment size on that architecture than anything else. Based on that, a cache line size is highly unlikely to be different from memory access size. Now, the parity bits are for the use of the memory controller - so cache line size typically is 64 bytes. The processor really controls very little beyond the registers. Everything else going on in the computer is more about getting hardware in to optimize CPU performance. In that sense also, it really would not make any sense to import extra complexity by making cache line sizes different at different levels of memory.

Resources