Another question regarding caching in ArmV7-A.
In this case, the SoC in question is Allwinner A20, Dual-Core Cortex-A7.
From what I have read, The definition of PoU for a core is the point at which the instruction and data caches of the core are guaranteed to see the same copy of a memory location.
In regards to SoC in question, since both cores share PoU at L2 (Unified) Cache, it means that whatever is put in L1, will be visible to L2. Is that right?
Even if I change an attribute of a memory region to be Non-Shareable, L2 will be able to see what inside L1 in either core. Is that true?
To elaborate what I meant by that, I have done a little experiment:
When I wrote into an memory address inside a Non-Shareable, Write-Back region from core #0. Then without doing any Cache Maintenance operation, when I tried to read from the same memory address from core #1, it happened that it read the correct value which was written from core #0.
I speculated that the behaviour was a result from L2 being the PoU, so, when I wrote from core#0, L2 also store a copy of it (even if it's not flushed). Then when I read from core#1, after a read miss, core#1's L1 retreive the memory value from L2.
...since both cores share PoU at L2 (Unified) Cache, it means that whatever is put in L1, will be visible to L2. Is that right?
No. One CPU's data accesses may snoop the data caches of another in the same shareability domain, but that has nothing to do with the PoU for instruction accesses; it's just the coherency protocol.
Even if I change an attribute of a memory region to be Non-Shareable, L2 will be able to see what inside L1 in either core. Is that true?
No. Non-shareable memory is not guaranteed to be coherent. Sure, you might see it work - maybe Cortex-A7 happens to still snoop non-shareable cache lines, or maybe your data just got naturally evicted from L1D in the meantime such that the other CPU hit it at L2 - but it definitely should not be relied upon. Either way, having multiple CPUs access the same non-shareable location is a totally backwards thing to do in practice; you've deliberately said you don't want to share it!
Related
cc: 7.5 Windows: 10.0 cuda: 11.7
I'm performing a bunch of atomic operations on device memory. Every thread in a warp is operating on a consecutive uint32_t. And every warp in the block updates those same values, before they all move on to the next line.
Since I'm not using any shared memory, I was hoping that it would be used to cache the device memory, effectively doing an atomicAnd against shared memory without all the overhead and headaches of syncthreads and copying the data around.
But the performance suggests that's not what's happening.
Indeed, looking at NSight, it's saying there's a 0% hit rate in L1 cache. Ouch. The memory workload analysis also shows 0% Hit under Global Atomic ALU.
Google turned up one hit (somewhat dated) suggesting that atomic is always done via L2 for device memory. Not exactly an authoritative source, but it matches what I'm seeing. On the other hand, there's this which seems to suggest it does (did?) go thru L1. A more authoritative source, but not exactly on point.
Could I have something misconfigured? Maybe my code isn't doing what I think it is? Or do atomic operations against device memory always go thru L2?
I tried using RED instead of atomics, but that didn't make any difference.
I also tried using atomicAnd_block instead of just atomicAnd, and somehow that made things even slower? Not what I expected.
I'd like to experiment with redux, but cc 8.0 isn't an option for me yet. __shfl_sync turned out to be disappointing (performance-wise).
At this point I'm inclined to believe that in 7.5, atomics on device memory always go thru L2. But if someone has evidence to the contrary, I can keep digging.
As usual with Nvidia, concrete information is hard to come by. But we can have a look at the PTX documentation and infer a few things.
Atomic load and store
Atomic loads and stores use variations of their regular ld and st instructions which have the following pattern:
ld{.weak}{.ss}{.cop}{.level::cache_hint}{.level::prefetch_size}{.vec}.type d, [a]{, cache-policy};
ld.sem.scope{.ss}{.level::eviction_priority}{.level::cache_hint}{.level::prefetch_size}{.vec}.type;
st{.weak}{.ss}{.cop}{.level::cache_hint}{.vec}.type [a], b{, cache-policy};
st.sem.scope{.ss}{.level::eviction_priority}{.level::cache_hint}{.vec}.type [a], b{, cache-policy};
weak loads and stores are regular memory operations. The cop part specifies the cache behavior. For our purposes, there is ld.cg (cache-global) that only uses the L2 cache and ld.ca (cache-all), which uses L1 and L2 cache. As the documentation notes:
Global data is coherent at the L2 level, but multiple L1 caches are not coherent for global data. If one thread stores to global memory via one L1 cache, and a second thread loads that address via a second L1 cache with ld.ca, the second thread may get stale L1 cache data, rather than the data stored by the first thread. The driver must invalidate global L1 cache lines between dependent grids of parallel threads. Stores by the first grid program are then correctly fetched by the second grid program issuing default ld.ca loads cached in L1.
Similarly, there is st.cg which caches only in L2. It "bypasses the L1 cache." The wording isn't precise but it sounds as if this should invalidate the L1 cache. Otherwise even within a single thread, a sequence of ld.ca; st.cg; ld.ca would read stale data and that sounds like an insane idea.
The second relevant cog for write is st.wb (write-back). The wording in the documentation is very weird. I guess this writes back to L1 cache and may later evict to L2 and up.
The ld.sem and st.sem (where sem is one of relaxed, acquire, or release) are the true atomic loads and stores. Scope gives the, well, scope of the synchronization, meaning for example whether an acquire is synchronized within a thread block or on the whole GPU.
Notice how these operations have no cop element. So you cannot even specify a cache layer. You can give cache hints but I don't see how those are sufficient to specify the desired semantics. cache_hint and cache-policy only work on L2.
Only the eviction_priority mentions L1. But just because that performance hint is accepted does not mean it has any effect. I assume it works for weak memory operations but for atomics, only the L2 policies have any effect. But this is just conjecture.
Atomic Read-modify-write
The atom instruction is used for atomic exchange, compare-and-swap, addition, etc. red is used for reductions. They have the following structure:
atom{.sem}{.scope}{.space}.op{.level::cache_hint}.type d, [a], b{, cache-policy};
red{.sem}{.scope}{.space}.op{.level::cache_hint}.type [a], b{, cache-policy};
With these elements:
sem: memory synchronization behavior, such as as acquire, release, or relaxed
scope: memory synchronization scope, e.g. acquire-release within a CTA (thread block) or GPU
space: global or shared memory
cache policy, level and hint: cache eviction policy. But there are no options for L1, only L2
Given that there is no way to specify L1 caching or write-back behavior, there is no way of using atomic RMW operations on L1 cache. This makes a lot of sense to me. Why should the GPU waste transistors on implementing this? Shared memory exists for the exact purpose of allowing fast memory operations within a thread block.
I'm reading ARM document (ARM ® Cortex ® -A57 MPCore Processor) and see the following descriptions about
You must set CPUECTLR.SMPEN to 1 before the caches and MMU are enabled, or any instruction cache or TLB maintenance operations are performed.
CPUECTLR.SMPEN is for:
Enables the processor to receive instruction cache and TLB maintenance operations broadcast from other processors in the cluster.
You must set this bit before enabling the caches and MMU, or performing any cache and TLB maintenance operations.
You must clear this bit during a processor power down sequence.
However, it is still unclear for me the real reason (i.e., why we should set CPUECTLR.SMPEN to 1 before the caches and MMU are enabled). Please help me on this. Thanks.
Simply put, SMPEN essentially controls whether the core participates in coherency protocols or not.
Without it set, any TLB or cache maintenance operation a core performs will only affect that core, and it won't be aware of other cores doing the same, nor of data in other cores' private caches - on an SMP system with all the cores operating on the same regions of memory, this is generally a recipe for data corruption and disaster.
Say everyone has their MMUs and caches enabled, and core A goes to remap some page of memory - it writes zeros to the PTE, invalidates its TLB for that VA, then writes the updated PTE. Core B could also have a TLB entry for that VA: unless the TLBI is broadcast, core B won't be aware that its entry for that VA is no longer valid, and could read bogus data or worse corrupt the old physical page now that it may have been reused for something else.
OK, perhaps core B didn't have that address cached in its TLB, but goes to access it after the update, and kicks off a page table walk. Without cache coherency, this goes several ways:
Core B happens to have the page table cached in its L1; unless it can snoop core A's L1 to know that someone else now has a dirty copy of that line and its own copy is now invalid, it's going to read the stale old PTE and go wrong.
Core B doesn't have the page tables cached at L1; unless it can coherently snoop the dirty line from core A's L1, the read goes out to L2 or main memory, hits the stale old PTE and goes wrong.
Core B doesn't have the page tables cached at L1, but core A's first write has already propagated out to L2 or further; unless core B's read can snoop the second write from core A's L1, it reads the intermediate invalid PTE from L2 and takes a fault.
Core B doesn't have the page tables cached at L1, but both of core A's writes have already propagated out to L2 or further; core B's read hits the new PTE in L2, and everything manages to work as expected by pure chance.
Now, there are some situations in which you might not want this - in asymmetric multiprocessing, where the two cores might be doing completely unrelated things, running different operating systems, and working in separate areas of memory, there might be a small benefit from not having unnecessary coherency chit-chat going on in the background - on the rare occasions the cores might want to communicate with each other there, they would probably do so via inter-processor interrupts and a specific shared area of uncached memory. For SMP, though, you really do want the cores to know about each other and be part of the same coherency domain before they have a chance to start actually allocating cache lines and TLB entries, which is precisely why the control of all the broadcast and coherency machinery is wrapped up in a single, somewhat-vaguely-named "SMP enable" bit.
To elaborate on actually entering and exiting coherency, when coming in you want to be sure that your whole data cache is invalid to avoid conflicting entries - If a CPU enters SMP with valid lines already in its cache for addresses owned by lines in other CPUs' coherent caches, the coherency protocol is broken and data loss/corruption ensues. Conversely, when going offline, the CPU has to guarantee its cache is clean to avoid data loss - it can prevent itself dirtying any more entries by disabling its cache/MMU, but it also has to exit coherency to prevent dirty lines being transferred in from other CPUs behind its back. Only then is it safe to perform the set/way operations necessary to clean the whole local cache before the contents are lost at powerdown.
I have been running some benchmarks on some algorithms and profiling their memory usage and efficiency (L1/L2/TLB accesses and misses), and some of the results are quite intriguing for me.
Considering an inclusive cache hierarchy (L1 and L2 caches), shouldn't the number of L1 cache misses coincide with the number of L2 cache accesses? One of the explanations I find would be TLB related: when a virtual address is not mapped in TLB, the system automatically skips searches in some cache levels.
Does this seem legitimate?
First, inclusive cache hierarchies may not be so common as you assume. For example, I do not think any current Intel processors - not Nehalem, not Sandybridge, possibly Atoms - have an L1 that is included within the L2. (Nehalem and probably Sandybridge do, however, have both L1 and L2 included within L3; using Intel's current terminology, FLC and MLC in LLC.)
But, this doesn't necessarily matter. In most cache hierarchies if you have an L1 cache miss, then that miss will probably be looked up in the L2. Doesn't matter if it is inclusive or not. To do otherwise, you would have to have something that told you that the data you care about is (probably) not in the L2, you don't need to look. Although I have designed protocols and memory types that do this - e.g. a memory type that cached only in the L1 but not the L2, useful for stuff like graphics where you get the benefits of combining in the L1, but where you are repeatedly scanning over a large array, so caching in the L2 not a good idea. Bit I am not aware of anyone shipping them at the moment.
Anyway, here are some reasons why the number of L1 cache misses may not be equal to the number of L2 cache accesses.
You don't say what systems you are working on - I know my answer is applicable to Intel x86s such as Nehalem and Sandybridge, whose EMON performance event monitoring allows you to count things such as L1 and L2 cache misses, etc. It will probably also apply to any modern microprocessor with hardware performance counters for cache misses, such as those on ARM and Power.
Most modern microprocessors do not stop at the first cache miss, but keep going trying to do extra work. This is overall often called speculative execution. Furthermore, the processor may be in-order or out-of-order, but although the latter may given you even greater differences between number of L1 misses and number of L2 accesses, it's not necessary - you can get this behavior even on in-order processors.
Short answer: many of these speculative memory accesses will be to the same memory location. They will be squashed and combined.
The performance event "L1 cache misses" is probably[*] counting the number of (speculative) instructions that missed the L1 cache. Which then allocate a hardware data structure, called at Intel a fill buffer, at some other places a miss status handling register. Subsequent cache misses that are to the same cache line will miss the L1 cache but hit the fill buffer, and will get squashed. Only one of them, typically the first will get sent to the L2, and counted as an L2 access.)
By the way, there may be a performance event for this: Squashed_Cache_Misses.
There may also be a performance event L1_Cache_Misses_Retired. But this may undercount, since speculation may pull the data into the cache, and a cache miss at retirement may never occur.
([*] By the way, when I say "probably" here I mean "On the machines that I helped design". Almost definitely. I might have to check the definition, look at the RTL, but I would be immensely surprised if not. It is almost guaranteed.)
E.g. imagine that you are accessing bytes A[0], A[1], A[2], ... A[63], A[64], ...
If the address of A[0] is equal to zero modulo 64, then A[0]..A[63] will be in the same cache line, on a machine with 64 byte cache lines. If the code that uses these is simple, it is quite possible that all of them can be issued speculatively. QED: 64 speculative memory access, 64 L1 cache misses, but only one L2 memory access.
(By the way, don't expect the numbers to be quite so clean. You might not get exactly 64 L1 accesses per L2 access.)
Some more possibilities:
If the number of L2 accesses is greater than the number of L1 cache misses (I have almost never seen it, but it is possible) you may have a memory access pattern that is confusing a hardware prefetcher. The hardware prefetcher tries to predict which cache lines you are going to need. If the prefetcher predicts badly, it may fetch cache lines that you don't actually need. Oftentimes there is a performance evernt to count Prefetches_from_L2 or Prefetches_from_Memory.
Some machines may cancel speculative accesses that have caused an L1 cache miss, before they are sent to the L2. However, I don't know of Intel doing this.
The write policy of a data cache determines whether a store hit writes its data only on that cache (write-back or copy-back) or also at the following level of the cache hierarchy (write-through).
Hence, a store that hits at a write-through L1-D cache, also writes its data at the L2 cache.
This could be another source of L2 accesses that do not come from L1 cache misses.
The details of the MESI protocol for multicore processors would be really important for me, but I can't find them anywhere. Even http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf doesn't contain enough detail. For instance: assume a private L1 and shared L2 cache. If the state of a line is exclusive in L1, then is it exclusive in L2 too (or invalid, because only in one cache could be the state of a line exclusive)? And clearly, if another core writes this line, the state of the previously exclusive line in L1 becomes invalid, but how is changing the state of the L2 cache line? If a modified line in L1 is read by another core, will be the new state of that line shared and is it written back to the main memory through the L2 cache, or stay modified in L2 too? etc.
The reason you are having trouble finding these answers is because the traditional protocols were not defined for hierarchical cache architectures so the MESI protocol by itself doesn't define what will happen when you have an L1 and an L2 cache. It depends on three other system attributes.
If the L2 is designed to be exclusive of the L1 (i.e., it is guaranteed that L2 and L1 can never have common cache lines), then any line in the L1 will be invalid state (basically not present) in the L2.
If the L2 is inclusive of the L1, i.e., every line in the L1 must have an entry in the L2 as well, the entry in the L2 will contain a descriptor stating which L1 cache has the line in E state.
Whether or not the value is written out to L2 or memory on a read from E or W stage depends on whether your system supports cache-to-cache transfers or not. In old day, when each chip was a single core, and core-to-core communication was as expensive as read/write to memory, systems would write the data to memory and make the other processor read it (this allowed them to not support cache-to-cache transfers). In multi-core, talking via memory is insanely expensive compared to talking to other cores on-chip, so almost all multi-core chips today support cache-to-cache transfer. Thus, a read from E or W stage is not serviced by writing to memory.
I hope this helps.
I found this. It might help.
Third comment here might also be useful.
could L1/L2 cache line each cache multiple copies of the main memory data word?
It's possible that the main memory is in a cache more than once. Obviously that's true and a common occurrence for multiprocessor machines. But even on uni processor machines, it can happen.
Consider a Pentium CPU that has a split L1 instruction/data cache. Instructions only go to the I-cache, data only to the D-cache. Now if the OS allows self modifying code, the same memory could be loaded into both the I- and D-cache, once as data, once as instructions. Now you have that data twice in the L1 cache. Therefore a CPU with such a split cache architecture must employ a cache coherence protocol to avoid race conditions/corruption.
No - if it's already in the cache the MMU will use that rather than creating another copy.
Every cache basically stores some small subset of the whole memory. When CPU needs a word from memory it first goes to L1, then to L2 cache and so on, before the main memory is checked.
So a particular memory word can be in L2 and in L1 simultaneously, but it can't be stored two times in L1, because that is not necessary.
Yes it can. L1 copy is updated but has not been flushed to L2. This happens only if L1 and L2 are non-exclusive caches. This is obvious for uni-processors but it is even more so for multi-processors which typically have their own L1 caches for each core.
It all depends on the cache architecture - whether it guarantees any sort of thing.