Avoid L1 cache pollution on GCN device - caching

I have a kernel that writes results to a global buffer; these results are never read back into the kernel (they are processed by another kernel at a later time).
So, I don't want this data sitting in the L1 cache if I can help it. Is there a way of ensuring that it is not cached? I need L1 for another array that is frequently read from and written to. This array is around 4kb, so it should stay in the L1 cache.

Related

Can cuda atomic operations use L1 cache?

cc: 7.5 Windows: 10.0 cuda: 11.7
I'm performing a bunch of atomic operations on device memory. Every thread in a warp is operating on a consecutive uint32_t. And every warp in the block updates those same values, before they all move on to the next line.
Since I'm not using any shared memory, I was hoping that it would be used to cache the device memory, effectively doing an atomicAnd against shared memory without all the overhead and headaches of syncthreads and copying the data around.
But the performance suggests that's not what's happening.
Indeed, looking at NSight, it's saying there's a 0% hit rate in L1 cache. Ouch. The memory workload analysis also shows 0% Hit under Global Atomic ALU.
Google turned up one hit (somewhat dated) suggesting that atomic is always done via L2 for device memory. Not exactly an authoritative source, but it matches what I'm seeing. On the other hand, there's this which seems to suggest it does (did?) go thru L1. A more authoritative source, but not exactly on point.
Could I have something misconfigured? Maybe my code isn't doing what I think it is? Or do atomic operations against device memory always go thru L2?
I tried using RED instead of atomics, but that didn't make any difference.
I also tried using atomicAnd_block instead of just atomicAnd, and somehow that made things even slower? Not what I expected.
I'd like to experiment with redux, but cc 8.0 isn't an option for me yet. __shfl_sync turned out to be disappointing (performance-wise).
At this point I'm inclined to believe that in 7.5, atomics on device memory always go thru L2. But if someone has evidence to the contrary, I can keep digging.
As usual with Nvidia, concrete information is hard to come by. But we can have a look at the PTX documentation and infer a few things.
Atomic load and store
Atomic loads and stores use variations of their regular ld and st instructions which have the following pattern:
ld{.weak}{.ss}{.cop}{.level::cache_hint}{.level::prefetch_size}{.vec}.type d, [a]{, cache-policy};
ld.sem.scope{.ss}{.level::eviction_priority}{.level::cache_hint}{.level::prefetch_size}{.vec}.type;
st{.weak}{.ss}{.cop}{.level::cache_hint}{.vec}.type [a], b{, cache-policy};
st.sem.scope{.ss}{.level::eviction_priority}{.level::cache_hint}{.vec}.type [a], b{, cache-policy};
weak loads and stores are regular memory operations. The cop part specifies the cache behavior. For our purposes, there is ld.cg (cache-global) that only uses the L2 cache and ld.ca (cache-all), which uses L1 and L2 cache. As the documentation notes:
Global data is coherent at the L2 level, but multiple L1 caches are not coherent for global data. If one thread stores to global memory via one L1 cache, and a second thread loads that address via a second L1 cache with ld.ca, the second thread may get stale L1 cache data, rather than the data stored by the first thread. The driver must invalidate global L1 cache lines between dependent grids of parallel threads. Stores by the first grid program are then correctly fetched by the second grid program issuing default ld.ca loads cached in L1.
Similarly, there is st.cg which caches only in L2. It "bypasses the L1 cache." The wording isn't precise but it sounds as if this should invalidate the L1 cache. Otherwise even within a single thread, a sequence of ld.ca; st.cg; ld.ca would read stale data and that sounds like an insane idea.
The second relevant cog for write is st.wb (write-back). The wording in the documentation is very weird. I guess this writes back to L1 cache and may later evict to L2 and up.
The ld.sem and st.sem (where sem is one of relaxed, acquire, or release) are the true atomic loads and stores. Scope gives the, well, scope of the synchronization, meaning for example whether an acquire is synchronized within a thread block or on the whole GPU.
Notice how these operations have no cop element. So you cannot even specify a cache layer. You can give cache hints but I don't see how those are sufficient to specify the desired semantics. cache_hint and cache-policy only work on L2.
Only the eviction_priority mentions L1. But just because that performance hint is accepted does not mean it has any effect. I assume it works for weak memory operations but for atomics, only the L2 policies have any effect. But this is just conjecture.
Atomic Read-modify-write
The atom instruction is used for atomic exchange, compare-and-swap, addition, etc. red is used for reductions. They have the following structure:
atom{.sem}{.scope}{.space}.op{.level::cache_hint}.type d, [a], b{, cache-policy};
red{.sem}{.scope}{.space}.op{.level::cache_hint}.type [a], b{, cache-policy};
With these elements:
sem: memory synchronization behavior, such as as acquire, release, or relaxed
scope: memory synchronization scope, e.g. acquire-release within a CTA (thread block) or GPU
space: global or shared memory
cache policy, level and hint: cache eviction policy. But there are no options for L1, only L2
Given that there is no way to specify L1 caching or write-back behavior, there is no way of using atomic RMW operations on L1 cache. This makes a lot of sense to me. Why should the GPU waste transistors on implementing this? Shared memory exists for the exact purpose of allowing fast memory operations within a thread block.

What happens in CPU, cache and memory when CPU is instructed to store data to memory?

Let's suppose the memory hierarchy is 1 cpu with L1i, L1d ,L2i L2d,L3, DRAM.
I'm wondering what happens at the lower levels of the computer when I use MOV/store instruction (or any other instruction that will cause CPU transfer data to memory)? I know what happens if there is just CPU and memory, but with the caches I'm a bit confused. I've searched for this, but it only yielded information about data transfer between:
registers and memory
CPU and cache,
cache and memory
I'm trying to understand more about this, like when cache will write through, when will write back? I just know write through is that update cache line and corresponding memory line immediately and write back is that update until replacement.Can they coexist? Is it the data will transfer directly to memory in write through? and in write back the data will through cache hierarchy?
What caused my confusion is that the Volatile in C/C++.As I known those type of variable will store in memory directly which means don’t through cache.Am I right? So what if I define a Volatile variable and a normal variable like int . how can the CPU distinguish that write directly to memory or through cache hierarchy.
Is there any instruction that can control cache? If not, how is cache
controlled? Some other hardware? OS? Cache controller(if such a thing exists)?

cache to memory mapping

When a cache is first designed, is it randomly mapped with some memory addresses or does it is empty at the beginning and fills with memory/lower level cache data only after a load or store instruction from processor?
I have this question , since I have designed the RTL for L1 Cache. So should I leave it blank and wait for any processor to request a read/write or just fill it with some memory mapped data and then comprehend hit/miss accordingly?
First designed? Do you mean first powered on? The normal way would be to start out with all the tags invalid (so it doesn't matter what's in the data arrays or anywhere else).
It's easy to imagine bugs if all the data in your cache was randomly initialized, so some lines would be valid, not-dirty, and have different contents than what's actually in RAM / ROM, so obviously you shouldn't do that. e.g. a hit in this out-of-sync L1 for the boot ROM code would be bad!
If any part of memory is initialized at power-on to known contents (like all-zeros), you could in theory init your cache tags and data so it's caching that memory.
If you init your cache as valid for anywhere that doesn't match what's in memory, you'd need to initialize it as dirty, which would trigger a writeback when the lines are evicted in favour of whatever the CPU actually needs, so that makes no sense.

Write Allocate / Fetch on Write Cache Policy

I couldn't find a source that explains how the policy works in great detail. The combinations of write policies are explained in Jouppi's Paper for the interested. This is how I understood it.
A write request is sent from cpu to cache.
Request results in a cache-miss.
A cache block is allocated for this request in cache.(Write-Allocate)
Write request block is fetched from lower memory to the allocated cache block.(Fetch-on-Write)
Now we are able to write onto allocated and updated by fetch cache block.
Question is what happens between step 4 and step 5. (Lets say Cache is a non-blocking cache using Miss Status Handling Registers.)
Does CPU have to retry write request on cache until write-hit happens? (after fetching the block to the allocated cache block)
If not, where does write request data is being held in the meantime?
Edit: I think I've found my answer in Implementation of Write Allocate in the K86™ Processors . It is directly being written into the allocated cache block and it gets merged with the read request later on.
It is directly being written into the allocated cache block and it gets merged with the read request later on.
No, that's not what AMD's pdf says. They say the store-data is merged with the just-fetched data from memory and then stored into the L1 cache's data array.
Cache tracks validity with cache-line granularity. There's no way for it to store the fact that "bytes 3 to 6 are valid; keep them when data arrives from memory". That kind of logic is too big to replicate in each line of the cache array.
Also note that the pdf you found describes some specific behaviour of their AMD's K6 microarchitectures, which was single-core only, and some models only had a single level of cache, so no cache-coherency protocol was even necessary. They do describe the K6-III (model 9) using MESI between L1 and L2 caches.
A CPU writing to cache has to hold onto the data until the cache is ready to accept it. It's not a retry-until-success process, though. It's more like the cache notified the store hardware when it's ready to accept that store (i.e. it has that line active, and in the Modified state if the cache is coherent with other caches using the MESI protocol).
In a real CPU, multiple outstanding misses can be in flight at once (even without full out-of-order speculative execution). This is called miss under miss. The CPU<->cache connection needs a buffer for each outstanding miss that can be supported in parallel, to hold the store data. e.g. a core might have 8 buffers and support 8 outstanding load or store misses. A 9th memory operation couldn't start to happen until one of the 8 buffers became available. Until then, data would have to stay in the CPU's store queue.
These buffers might be shared between loads and stores, or there might be dedicated store buffers. The OP reports that searching on store buffer found lots of related stuff of interest; one example being this part of Wikipedia's MESI article.
The L1 cache is really a part of a CPU core in modern high-performance designs. It's very tightly integrated with the memory-order logic, and needs to be able to efficiently support atomic operations like lock inc [mem] and lots of other complications (like memory reordering). See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for example.
Some other terms:
store buffer
store queue
memory order buffer
cache write port / cache read port / cache port
globally visible
distantly related: An interesting post investigating the adaptive replacement policy of Intel IvyBridge's L3 cache, making it more resistant against evicting valuable data when scanning a huge array.

relationship between CPUECTLR.SMPEN, caches and MMU

I'm reading ARM document (ARM ® Cortex ® -A57 MPCore Processor) and see the following descriptions about
You must set CPUECTLR.SMPEN to 1 before the caches and MMU are enabled, or any instruction cache or TLB maintenance operations are performed.
CPUECTLR.SMPEN is for:
Enables the processor to receive instruction cache and TLB maintenance operations broadcast from other processors in the cluster.
You must set this bit before enabling the caches and MMU, or performing any cache and TLB maintenance operations.
You must clear this bit during a processor power down sequence.
However, it is still unclear for me the real reason (i.e., why we should set CPUECTLR.SMPEN to 1 before the caches and MMU are enabled). Please help me on this. Thanks.
Simply put, SMPEN essentially controls whether the core participates in coherency protocols or not.
Without it set, any TLB or cache maintenance operation a core performs will only affect that core, and it won't be aware of other cores doing the same, nor of data in other cores' private caches - on an SMP system with all the cores operating on the same regions of memory, this is generally a recipe for data corruption and disaster.
Say everyone has their MMUs and caches enabled, and core A goes to remap some page of memory - it writes zeros to the PTE, invalidates its TLB for that VA, then writes the updated PTE. Core B could also have a TLB entry for that VA: unless the TLBI is broadcast, core B won't be aware that its entry for that VA is no longer valid, and could read bogus data or worse corrupt the old physical page now that it may have been reused for something else.
OK, perhaps core B didn't have that address cached in its TLB, but goes to access it after the update, and kicks off a page table walk. Without cache coherency, this goes several ways:
Core B happens to have the page table cached in its L1; unless it can snoop core A's L1 to know that someone else now has a dirty copy of that line and its own copy is now invalid, it's going to read the stale old PTE and go wrong.
Core B doesn't have the page tables cached at L1; unless it can coherently snoop the dirty line from core A's L1, the read goes out to L2 or main memory, hits the stale old PTE and goes wrong.
Core B doesn't have the page tables cached at L1, but core A's first write has already propagated out to L2 or further; unless core B's read can snoop the second write from core A's L1, it reads the intermediate invalid PTE from L2 and takes a fault.
Core B doesn't have the page tables cached at L1, but both of core A's writes have already propagated out to L2 or further; core B's read hits the new PTE in L2, and everything manages to work as expected by pure chance.
Now, there are some situations in which you might not want this - in asymmetric multiprocessing, where the two cores might be doing completely unrelated things, running different operating systems, and working in separate areas of memory, there might be a small benefit from not having unnecessary coherency chit-chat going on in the background - on the rare occasions the cores might want to communicate with each other there, they would probably do so via inter-processor interrupts and a specific shared area of uncached memory. For SMP, though, you really do want the cores to know about each other and be part of the same coherency domain before they have a chance to start actually allocating cache lines and TLB entries, which is precisely why the control of all the broadcast and coherency machinery is wrapped up in a single, somewhat-vaguely-named "SMP enable" bit.
To elaborate on actually entering and exiting coherency, when coming in you want to be sure that your whole data cache is invalid to avoid conflicting entries - If a CPU enters SMP with valid lines already in its cache for addresses owned by lines in other CPUs' coherent caches, the coherency protocol is broken and data loss/corruption ensues. Conversely, when going offline, the CPU has to guarantee its cache is clean to avoid data loss - it can prevent itself dirtying any more entries by disabling its cache/MMU, but it also has to exit coherency to prevent dirty lines being transferred in from other CPUs behind its back. Only then is it safe to perform the set/way operations necessary to clean the whole local cache before the contents are lost at powerdown.

Resources