what is clean state in L2 cache? - linux-kernel

In ARM architecture while reading the CPU shutdown sequence I found these steps:
save per CPU peripherals (IC, VFP, PMU)
save CPU registers
clean L1 D-cache
clean state from L2
disable L1 D-cache allocation
clean L1 D-cache
exit coherency
call WFI (wait for interrupt)
What does clean L1 mean? Does it means delete all the content of L1?
And what does clean state from L2 means?

What is clean?
Clean, in ARM Cortex-A documents, usually means a flush (write dirty cache lines to next level). It is only valid for Dcache or unified caches. Some times we need both clean and invalidate (clear the cache). This is important if some other entity (bus master/peripheral) may change the memory. Usually, a bus (AXI) has a mechanism to avoid this. Also, if you update code in main memory and there is previous I-cache data, you need to invalidate it.
Why multiple cleans?
You need to clean the L1 to make sure the data is in the L2 (flushed) so that you may then clean the L2. As we disable the L1 DCache, you may have some stale data from the act of L2 flushing in the L1. I am not completely sure why they say clean as opposed to invalidate for step 6. You haven't given an exact ARM CPU and these details vary depending on the type. It seems this is maybe an Cortex-A5/A8/A9 with external L2C-310.
The 2nd L1 clean is due to a race between the two levels of caches. It is describe in one of the Cortex-A technical reference manuals (TRM). I would follow their advice as it probably avoids some rare corner case and this type of code is difficult to debug. Shutdown/suspend/sleeping by necessity disables all your debug devices and is difficult to trouble shoot like boot code.

Related

Can cuda atomic operations use L1 cache?

cc: 7.5 Windows: 10.0 cuda: 11.7
I'm performing a bunch of atomic operations on device memory. Every thread in a warp is operating on a consecutive uint32_t. And every warp in the block updates those same values, before they all move on to the next line.
Since I'm not using any shared memory, I was hoping that it would be used to cache the device memory, effectively doing an atomicAnd against shared memory without all the overhead and headaches of syncthreads and copying the data around.
But the performance suggests that's not what's happening.
Indeed, looking at NSight, it's saying there's a 0% hit rate in L1 cache. Ouch. The memory workload analysis also shows 0% Hit under Global Atomic ALU.
Google turned up one hit (somewhat dated) suggesting that atomic is always done via L2 for device memory. Not exactly an authoritative source, but it matches what I'm seeing. On the other hand, there's this which seems to suggest it does (did?) go thru L1. A more authoritative source, but not exactly on point.
Could I have something misconfigured? Maybe my code isn't doing what I think it is? Or do atomic operations against device memory always go thru L2?
I tried using RED instead of atomics, but that didn't make any difference.
I also tried using atomicAnd_block instead of just atomicAnd, and somehow that made things even slower? Not what I expected.
I'd like to experiment with redux, but cc 8.0 isn't an option for me yet. __shfl_sync turned out to be disappointing (performance-wise).
At this point I'm inclined to believe that in 7.5, atomics on device memory always go thru L2. But if someone has evidence to the contrary, I can keep digging.
As usual with Nvidia, concrete information is hard to come by. But we can have a look at the PTX documentation and infer a few things.
Atomic load and store
Atomic loads and stores use variations of their regular ld and st instructions which have the following pattern:
ld{.weak}{.ss}{.cop}{.level::cache_hint}{.level::prefetch_size}{.vec}.type d, [a]{, cache-policy};
ld.sem.scope{.ss}{.level::eviction_priority}{.level::cache_hint}{.level::prefetch_size}{.vec}.type;
st{.weak}{.ss}{.cop}{.level::cache_hint}{.vec}.type [a], b{, cache-policy};
st.sem.scope{.ss}{.level::eviction_priority}{.level::cache_hint}{.vec}.type [a], b{, cache-policy};
weak loads and stores are regular memory operations. The cop part specifies the cache behavior. For our purposes, there is ld.cg (cache-global) that only uses the L2 cache and ld.ca (cache-all), which uses L1 and L2 cache. As the documentation notes:
Global data is coherent at the L2 level, but multiple L1 caches are not coherent for global data. If one thread stores to global memory via one L1 cache, and a second thread loads that address via a second L1 cache with ld.ca, the second thread may get stale L1 cache data, rather than the data stored by the first thread. The driver must invalidate global L1 cache lines between dependent grids of parallel threads. Stores by the first grid program are then correctly fetched by the second grid program issuing default ld.ca loads cached in L1.
Similarly, there is st.cg which caches only in L2. It "bypasses the L1 cache." The wording isn't precise but it sounds as if this should invalidate the L1 cache. Otherwise even within a single thread, a sequence of ld.ca; st.cg; ld.ca would read stale data and that sounds like an insane idea.
The second relevant cog for write is st.wb (write-back). The wording in the documentation is very weird. I guess this writes back to L1 cache and may later evict to L2 and up.
The ld.sem and st.sem (where sem is one of relaxed, acquire, or release) are the true atomic loads and stores. Scope gives the, well, scope of the synchronization, meaning for example whether an acquire is synchronized within a thread block or on the whole GPU.
Notice how these operations have no cop element. So you cannot even specify a cache layer. You can give cache hints but I don't see how those are sufficient to specify the desired semantics. cache_hint and cache-policy only work on L2.
Only the eviction_priority mentions L1. But just because that performance hint is accepted does not mean it has any effect. I assume it works for weak memory operations but for atomics, only the L2 policies have any effect. But this is just conjecture.
Atomic Read-modify-write
The atom instruction is used for atomic exchange, compare-and-swap, addition, etc. red is used for reductions. They have the following structure:
atom{.sem}{.scope}{.space}.op{.level::cache_hint}.type d, [a], b{, cache-policy};
red{.sem}{.scope}{.space}.op{.level::cache_hint}.type [a], b{, cache-policy};
With these elements:
sem: memory synchronization behavior, such as as acquire, release, or relaxed
scope: memory synchronization scope, e.g. acquire-release within a CTA (thread block) or GPU
space: global or shared memory
cache policy, level and hint: cache eviction policy. But there are no options for L1, only L2
Given that there is no way to specify L1 caching or write-back behavior, there is no way of using atomic RMW operations on L1 cache. This makes a lot of sense to me. Why should the GPU waste transistors on implementing this? Shared memory exists for the exact purpose of allowing fast memory operations within a thread block.

Why does DSB not flush the cache?

I'm debugging an HTTP server on STM32H725VG using LWIP and HAL drivers, all initially generated by STM32CubeMX. The problem is that in some cases data sent via HAL_ETH_Transmit have some octets replaced by 0x00, and this corrupted content successfully gets to the client.
I've checked that the data in the buffers passed as arguments into HAL_ETH_Transmit are intact both before and after the call to this function. So, apparently, the corruption occurs on transfer from the RAM to the MAC, because the checksum is calculated on the corrupted data. So I supposed that the problem may be due to interaction between cache and DMA. I've tried disabling D-cache, and then the corruption doesn't occur.
Then I thought that I should just use __DSB() instruction that should write the cached data into the RAM. After enabling D-cache back, I added __DSB() right before the call to HAL_ETH_Transmit (which is inside low_level_output function generated by STM32CubeMX), and... nothing happened: the data are still corrupted.
Then, after some experimentation I found that SCB_CleanDCache() call after (or instead of) __DSB() fixes the problem.
This makes me wonder. The description of DSB instruction is as follows:
Data Synchronization Barrier acts as a special kind of memory barrier. No instruction in program order after this instruction executes until this instruction completes. This instruction completes when:
All explicit memory accesses before this instruction complete.
All Cache, Branch predictor and TLB maintenance operations before this instruction complete.
And the description of SCB_DisableDCache has the following note about SCB_CleanDCache:
When disabling the data cache, you must clean (SCB_CleanDCache) the entire cache to ensure that any dirty data is flushed to external memory.
Why doesn't the DSB flush the cache if it's supposed to be complete when "all explicit memory accesses" complete, which seems to include flushing of caches?
dsb ish works as a memory barrier for inter-thread memory order; it just orders the current CPU's access to coherent cache. You wouldn't expect dsb ish to flush any cache because that's not required for visibility within the same inner-shareable cache-coherency domain. Like it says in the manual you quoted, it finishes memory operations.
Cacheable memory operations on write-back cache only update cache; waiting for them to finish doesn't imply flushing the cache.
Your ARM system I think has multiple coherency domains for microcontroller vs. DSP? Does your __DSB intrinsic compile to a dsb sy instruction? Assuming that doesn't flush cache, what they mean is presumably that it orders memory / cache operations including explicit flushes, which are still necessary.
I'd put my money on performance.
Flushing cache means to write data from cache to memory. Memory access is slow.
L1 cache size (assuming ARM Cortex-A9) is 32KB. You don't want to move a whole 32KB from cache into memory for no reason. There might be L2 cache which is easily 512KB-1MB (could be even more). You really don't want to move a whole L2 either.
As a matter of fact your whole DMA transfer might be smaller than size of caches. There is simply no justification to do that.

PoU with non-shareable attribute

Another question regarding caching in ArmV7-A.
In this case, the SoC in question is Allwinner A20, Dual-Core Cortex-A7.
From what I have read, The definition of PoU for a core is the point at which the instruction and data caches of the core are guaranteed to see the same copy of a memory location.
In regards to SoC in question, since both cores share PoU at L2 (Unified) Cache, it means that whatever is put in L1, will be visible to L2. Is that right?
Even if I change an attribute of a memory region to be Non-Shareable, L2 will be able to see what inside L1 in either core. Is that true?
To elaborate what I meant by that, I have done a little experiment:
When I wrote into an memory address inside a Non-Shareable, Write-Back region from core #0. Then without doing any Cache Maintenance operation, when I tried to read from the same memory address from core #1, it happened that it read the correct value which was written from core #0.
I speculated that the behaviour was a result from L2 being the PoU, so, when I wrote from core#0, L2 also store a copy of it (even if it's not flushed). Then when I read from core#1, after a read miss, core#1's L1 retreive the memory value from L2.
...since both cores share PoU at L2 (Unified) Cache, it means that whatever is put in L1, will be visible to L2. Is that right?
No. One CPU's data accesses may snoop the data caches of another in the same shareability domain, but that has nothing to do with the PoU for instruction accesses; it's just the coherency protocol.
Even if I change an attribute of a memory region to be Non-Shareable, L2 will be able to see what inside L1 in either core. Is that true?
No. Non-shareable memory is not guaranteed to be coherent. Sure, you might see it work - maybe Cortex-A7 happens to still snoop non-shareable cache lines, or maybe your data just got naturally evicted from L1D in the meantime such that the other CPU hit it at L2 - but it definitely should not be relied upon. Either way, having multiple CPUs access the same non-shareable location is a totally backwards thing to do in practice; you've deliberately said you don't want to share it!

relationship between CPUECTLR.SMPEN, caches and MMU

I'm reading ARM document (ARM ® Cortex ® -A57 MPCore Processor) and see the following descriptions about
You must set CPUECTLR.SMPEN to 1 before the caches and MMU are enabled, or any instruction cache or TLB maintenance operations are performed.
CPUECTLR.SMPEN is for:
Enables the processor to receive instruction cache and TLB maintenance operations broadcast from other processors in the cluster.
You must set this bit before enabling the caches and MMU, or performing any cache and TLB maintenance operations.
You must clear this bit during a processor power down sequence.
However, it is still unclear for me the real reason (i.e., why we should set CPUECTLR.SMPEN to 1 before the caches and MMU are enabled). Please help me on this. Thanks.
Simply put, SMPEN essentially controls whether the core participates in coherency protocols or not.
Without it set, any TLB or cache maintenance operation a core performs will only affect that core, and it won't be aware of other cores doing the same, nor of data in other cores' private caches - on an SMP system with all the cores operating on the same regions of memory, this is generally a recipe for data corruption and disaster.
Say everyone has their MMUs and caches enabled, and core A goes to remap some page of memory - it writes zeros to the PTE, invalidates its TLB for that VA, then writes the updated PTE. Core B could also have a TLB entry for that VA: unless the TLBI is broadcast, core B won't be aware that its entry for that VA is no longer valid, and could read bogus data or worse corrupt the old physical page now that it may have been reused for something else.
OK, perhaps core B didn't have that address cached in its TLB, but goes to access it after the update, and kicks off a page table walk. Without cache coherency, this goes several ways:
Core B happens to have the page table cached in its L1; unless it can snoop core A's L1 to know that someone else now has a dirty copy of that line and its own copy is now invalid, it's going to read the stale old PTE and go wrong.
Core B doesn't have the page tables cached at L1; unless it can coherently snoop the dirty line from core A's L1, the read goes out to L2 or main memory, hits the stale old PTE and goes wrong.
Core B doesn't have the page tables cached at L1, but core A's first write has already propagated out to L2 or further; unless core B's read can snoop the second write from core A's L1, it reads the intermediate invalid PTE from L2 and takes a fault.
Core B doesn't have the page tables cached at L1, but both of core A's writes have already propagated out to L2 or further; core B's read hits the new PTE in L2, and everything manages to work as expected by pure chance.
Now, there are some situations in which you might not want this - in asymmetric multiprocessing, where the two cores might be doing completely unrelated things, running different operating systems, and working in separate areas of memory, there might be a small benefit from not having unnecessary coherency chit-chat going on in the background - on the rare occasions the cores might want to communicate with each other there, they would probably do so via inter-processor interrupts and a specific shared area of uncached memory. For SMP, though, you really do want the cores to know about each other and be part of the same coherency domain before they have a chance to start actually allocating cache lines and TLB entries, which is precisely why the control of all the broadcast and coherency machinery is wrapped up in a single, somewhat-vaguely-named "SMP enable" bit.
To elaborate on actually entering and exiting coherency, when coming in you want to be sure that your whole data cache is invalid to avoid conflicting entries - If a CPU enters SMP with valid lines already in its cache for addresses owned by lines in other CPUs' coherent caches, the coherency protocol is broken and data loss/corruption ensues. Conversely, when going offline, the CPU has to guarantee its cache is clean to avoid data loss - it can prevent itself dirtying any more entries by disabling its cache/MMU, but it also has to exit coherency to prevent dirty lines being transferred in from other CPUs behind its back. Only then is it safe to perform the set/way operations necessary to clean the whole local cache before the contents are lost at powerdown.

L1/2 cache problem

could L1/L2 cache line each cache multiple copies of the main memory data word?
It's possible that the main memory is in a cache more than once. Obviously that's true and a common occurrence for multiprocessor machines. But even on uni processor machines, it can happen.
Consider a Pentium CPU that has a split L1 instruction/data cache. Instructions only go to the I-cache, data only to the D-cache. Now if the OS allows self modifying code, the same memory could be loaded into both the I- and D-cache, once as data, once as instructions. Now you have that data twice in the L1 cache. Therefore a CPU with such a split cache architecture must employ a cache coherence protocol to avoid race conditions/corruption.
No - if it's already in the cache the MMU will use that rather than creating another copy.
Every cache basically stores some small subset of the whole memory. When CPU needs a word from memory it first goes to L1, then to L2 cache and so on, before the main memory is checked.
So a particular memory word can be in L2 and in L1 simultaneously, but it can't be stored two times in L1, because that is not necessary.
Yes it can. L1 copy is updated but has not been flushed to L2. This happens only if L1 and L2 are non-exclusive caches. This is obvious for uni-processors but it is even more so for multi-processors which typically have their own L1 caches for each core.
It all depends on the cache architecture - whether it guarantees any sort of thing.

Resources