Opposite of cache prefetch hint - performance

Is there a hint I can put in my code indicating that a line should be removed from cache? As opposed to a prefetch hint, which would indicate I will soon need a line. In my case, I know when I won't need a line for a while, so I want to be able to get rid of it to free up space for lines I do need.

clflush, clflushopt
Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the
linear address specified with the memory operand. If that cache line contains modified data at any level of the
cache hierarchy, that data is written back to memory.
They are not available on every CPU (in particular, clflushopt is only available on the 6th generation and later). To be certain, you should use CPUID to verify their availability:
The availability of CLFLUSH is indicated by the presence of the CPUID feature flag CLFSH
(CPUID.01H:EDX[bit 19]).
The availability of CLFLUSHOPT is indicated by the presence of the CPUID feature flag CLFLUSHOPT
(CPUID.(EAX=7,ECX=0):EBX[bit 23]).
If available, you should use clflushopt. It outperforms clflush when flushing buffers larger than 4KiB (64 lines).
This is the benchmark from Intel's Optimization Manual:
For informational purpose (assuming you are running in a privileged context), you can also use invd (as a nuke-from-orbit option). This:
Invalidates (flushes) the processor’s internal caches and issues a special-function bus cycle that directs external
caches to also flush themselves. Data held in internal caches is not written back to main memory.
or wbinvd, which:
Writes back all modified cache lines in the processor’s internal cache to main memory and invalidates (flushes) the
internal caches. The instruction then issues a special-function bus cycle that directs external caches to also write
back modified data and another bus cycle to indicate that the external caches should be invalidated.
A future instruction that could make it into the ISA is club. Although this won't fit your need (because it doesn't necessarily invalidate the line), it's worth mentioning for completeness. This would:
Writes back to memory the cache line (if dirty) that contains the linear address specified with the memory
operand from any level of the cache hierarchy in the cache coherence domain. The line may be retained in the
cache hierarchy in non-modified state. Retaining the line in the cache hierarchy is a performance optimization
(treated as a hint by hardware) to reduce the possibility of cache miss on a subsequent access. Hardware may
choose to retain the line at any of the levels in the cache hierarchy, and in some cases, may invalidate the line
from the cache hierarchy.

Related

Why does DSB not flush the cache?

I'm debugging an HTTP server on STM32H725VG using LWIP and HAL drivers, all initially generated by STM32CubeMX. The problem is that in some cases data sent via HAL_ETH_Transmit have some octets replaced by 0x00, and this corrupted content successfully gets to the client.
I've checked that the data in the buffers passed as arguments into HAL_ETH_Transmit are intact both before and after the call to this function. So, apparently, the corruption occurs on transfer from the RAM to the MAC, because the checksum is calculated on the corrupted data. So I supposed that the problem may be due to interaction between cache and DMA. I've tried disabling D-cache, and then the corruption doesn't occur.
Then I thought that I should just use __DSB() instruction that should write the cached data into the RAM. After enabling D-cache back, I added __DSB() right before the call to HAL_ETH_Transmit (which is inside low_level_output function generated by STM32CubeMX), and... nothing happened: the data are still corrupted.
Then, after some experimentation I found that SCB_CleanDCache() call after (or instead of) __DSB() fixes the problem.
This makes me wonder. The description of DSB instruction is as follows:
Data Synchronization Barrier acts as a special kind of memory barrier. No instruction in program order after this instruction executes until this instruction completes. This instruction completes when:
All explicit memory accesses before this instruction complete.
All Cache, Branch predictor and TLB maintenance operations before this instruction complete.
And the description of SCB_DisableDCache has the following note about SCB_CleanDCache:
When disabling the data cache, you must clean (SCB_CleanDCache) the entire cache to ensure that any dirty data is flushed to external memory.
Why doesn't the DSB flush the cache if it's supposed to be complete when "all explicit memory accesses" complete, which seems to include flushing of caches?
dsb ish works as a memory barrier for inter-thread memory order; it just orders the current CPU's access to coherent cache. You wouldn't expect dsb ish to flush any cache because that's not required for visibility within the same inner-shareable cache-coherency domain. Like it says in the manual you quoted, it finishes memory operations.
Cacheable memory operations on write-back cache only update cache; waiting for them to finish doesn't imply flushing the cache.
Your ARM system I think has multiple coherency domains for microcontroller vs. DSP? Does your __DSB intrinsic compile to a dsb sy instruction? Assuming that doesn't flush cache, what they mean is presumably that it orders memory / cache operations including explicit flushes, which are still necessary.
I'd put my money on performance.
Flushing cache means to write data from cache to memory. Memory access is slow.
L1 cache size (assuming ARM Cortex-A9) is 32KB. You don't want to move a whole 32KB from cache into memory for no reason. There might be L2 cache which is easily 512KB-1MB (could be even more). You really don't want to move a whole L2 either.
As a matter of fact your whole DMA transfer might be smaller than size of caches. There is simply no justification to do that.

Exclusive access to L1 cacheline on x86?

If one has a 64 byte buffer that is heavily read/written to then it's likely that it'll be kept in L1; but is there any way to force that behaviour?
As in, give one core exclusive access to those 64 bytes and tell it not to sync the data with other cores nor the memory controller so that those 64 bytes always live in one core's L1 regardless of whether or not the CPU thinks it's used often enough.
No, x86 doesn't let you do this. You can force evict with clfushopt, or (on upcoming CPUs) for just write-back without evict with clwb, but you can't pin a line in cache or disable coherency.
You can put the whole CPU (or a single core?) into cache-as-RAM (aka no-fill) mode to disable sync with the memory controller, and disable ever writing back the data. Cache-as-Ram (no fill mode) Executable Code. It's typically used by BIOS / firmware in early boot before configuring the memory controllers. It's not available on a per-line basis, and is almost certainly not practically useful here. Fun fact: leaving this mode is one of the use-cases for invd, which drops cached data without writeback, as opposed to wbinvd.
I'm not sure if no-fill mode prevents eviction from L1d to L3 or whatever; or if data is just dropped on eviction. So you'd just have to avoid accessing more than 7 other cache lines that alias the one you care about in your L1d, or the equivalent for L2/L3.
Being able to force one core to hang on to a line of L1d indefinitely and not respond to MESI requests to write it back / share it would make the other cores vulnerable to lockups if they ever touched that line. So obviously if such a feature existed, it would require kernel mode. (And with HW virtualization, require hypervisor privilege.) It could also block hardware DMA (because modern x86 has cache-coherent DMA).
So supporting such a feature would require lots of parts of the CPU to handle indefinite delays, where currently there's probably some upper bound, which may be shorter than a PCIe timeout, if there is such a thing. (I don't write drivers or build real hardware, just guessing about this).
As #fuz points out, a coherency-violating instruction (xdcbt) was tried on PowerPC (in the Xbox 360 CPU), with disastrous results from mis-speculated execution of the instruction. So it's hard to implement.
You normally don't need this.
If the line is frequently used, LRU replacement will keep it hot. And if it's lost from L1d at frequent enough intervals, then it will probably stay hot in L2 which is also on-core and private, and very fast, in recent designs (Intel since Nehalem). Intel's inclusive L3 on CPUs other than Skylake-AVX512 means that staying in L1d also means staying in L3.
All this means that full cache misses all the way to DRAM are very unlikely with any kind of frequency for a line that's heavily used by one core. So throughput shouldn't be a problem. I guess you could maybe want this for realtime latency, where the worst-case run time for one call of a function mattered. Dummy reads from the cache line in some other part of the code could be helpful in keeping it hot.
However, if pressure from other cores in L3 cache causes eviction of this line from L3, Intel CPUs with an inclusive L3 also have to force eviction from inner caches that still have it hot. IDK if there's any mechanism to let L3 know that a line is heavily used in a core's L1d, because that doesn't generate any L3 traffic.
I'm not aware of this being much of a problem in real code. L3 is highly associative (like 16 or 24 way), so it takes a lot of conflicts before you'd get an eviction. L3 also uses a more complex indexing function (like a real hash function, not just modulo by taking a contiguous range of bits). In IvyBridge and later, it also uses an adaptive replacement policy to mitigate eviction from touching a lot of data that won't be reused often. http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.
See also Which cache mapping technique is used in intel core i7 processor?
#AlexisWilke points out that you could maybe use vector register(s) instead of a line of cache, for some use-cases. Using ymm registers as a "memory-like" storage location. You could globally dedicate some vector regs to this purpose. To get this in gcc-generated code, maybe use -ffixed-ymm8, or declare it as a volatile global register variable. (How to inform GCC to not use a particular register)
Using ALU instructions or store-forwarding to get data to/from the vector reg will give you guaranteed latency with no possibility of data-cache misses. But code-cache misses are still a problem for extremely low latency.
There is no direct way to achieve that on Intel and AMD x86 processors, but you can get pretty close with some effort. First, you said you're worried that the cache line might get evicted from the L1 because some other core might access it. This can only happen in the following situations:
The line is shared, and therefore, it can be accessed by multiple agents in the system concurrently. If another agent attempts to read the line, its state will change from Modified or Exclusive to Shared. That is, it will state in the L1. If, on the other hand, another agent attempts to write to the line, it has to be invalidated from the L1.
The line can be private or shared, but the thread got rescheduled by the OS to run on another core. Similar to the previous case, if it attempts to read the line, its state will change from Modified or Exclusive to Shared in both L1 caches. If it attempts to write to the line, it has to be invalidated from the L1 of the previous core on which it was running.
There are other reasons why the line may get evicted from the L1 as I will discuss shortly.
If the line is shared, then you cannot disable coherency. What you can do, however, is make a private copy of it, which effectively does disable coherency. If doing that may lead to faulty behavior, then the only thing you can do is to set the affinity of all threads that share the line to run on the same physical core on a hyperthreaded (SMT) Intel processor. Since the L1 is shared between the logical cores, the line will not get evicted due to sharing, but it can still get evicted due to other reasons.
Setting the affinity of a thread does not guarantee though that other threads cannot get scheduled to run on the same core. To reduce the probability of scheduling other threads (that don't access the line) on the same core or rescheduling the thread to run on other physical cores, you can increase the priority of the thread (or all the threads that share the line).
Intel processors are mostly 2-way hyperthreaded, so you can only run two threads that share the line at a time. so if you play with the affinity and priority of the threads, performance can change in interesting ways. You'll have to measure it. Recent AMD processors also support SMT.
If the line is private (only one thread can access it), a thread running on a sibling logical core in an Intel processor may cause the line to be evicted because the L1 is competitively shared, depending on its memory access behavior. I will discuss how this can be dealt with shortly.
Another issue is interrupts and exceptions. On Linux and maybe other OSes, you can configure which cores should handle which interrupts. I think it's OK to map all interrupts to all other cores, except the periodic timer interrupt whose interrupt handler's behavior is OS-dependent and it may not be safe to play with it. Depending on how much effort you want to spend on this, you can perform carefully designed experiments to determine the impact of the timer interrupt handler on the L1D cache contents. Also you should avoid exceptions.
I can think of two reasons why a line might get invalidated:
A (potentially speculative) RFO with intent for modification from another core.
The line was chosen to be evicted to make space for another line. This depends on the design of the cache hierarchy:
The L1 cache placement policy.
The L1 cache replacement policy.
Whether lower level caches are inclusive or not.
The replacement policy is commonly not configurable, so you should strive to avoid conflict L1 misses, which depends on the placement policy, which depends on the microarchitecture. On Intel processors, the L1D is typically both virtually indexed and physically indexed because the bits used for the index don't require translation. Since you know the virtual addresses of all memory accesses, you can determine which lines would be allocated from which cache set. You need to make sure that the number of lines mapped to the same set (including the line you don't want it to be evicted) does not exceed the associativity of the cache. Otherwise, you'd be at the mercy of the replacement policy. Note also that an L1D prefetcher can also change the contents of the cache. You can disable it on Intel processors and measure its impact in both cases. I cannot think of an easy way to deal with inclusive lower level caches.
I think the idea of "pinning" a line in the cache is interesting and can be useful. It's a hybrid between caches and scratch pad memories. The line would be like a temporary register mapped to the virtual address space.
The main issue here is that you want to both read from and write to the line, while still keeping it in the cache. This sort of behavior is currently not supported.

relationship between CPUECTLR.SMPEN, caches and MMU

I'm reading ARM document (ARM ® Cortex ® -A57 MPCore Processor) and see the following descriptions about
You must set CPUECTLR.SMPEN to 1 before the caches and MMU are enabled, or any instruction cache or TLB maintenance operations are performed.
CPUECTLR.SMPEN is for:
Enables the processor to receive instruction cache and TLB maintenance operations broadcast from other processors in the cluster.
You must set this bit before enabling the caches and MMU, or performing any cache and TLB maintenance operations.
You must clear this bit during a processor power down sequence.
However, it is still unclear for me the real reason (i.e., why we should set CPUECTLR.SMPEN to 1 before the caches and MMU are enabled). Please help me on this. Thanks.
Simply put, SMPEN essentially controls whether the core participates in coherency protocols or not.
Without it set, any TLB or cache maintenance operation a core performs will only affect that core, and it won't be aware of other cores doing the same, nor of data in other cores' private caches - on an SMP system with all the cores operating on the same regions of memory, this is generally a recipe for data corruption and disaster.
Say everyone has their MMUs and caches enabled, and core A goes to remap some page of memory - it writes zeros to the PTE, invalidates its TLB for that VA, then writes the updated PTE. Core B could also have a TLB entry for that VA: unless the TLBI is broadcast, core B won't be aware that its entry for that VA is no longer valid, and could read bogus data or worse corrupt the old physical page now that it may have been reused for something else.
OK, perhaps core B didn't have that address cached in its TLB, but goes to access it after the update, and kicks off a page table walk. Without cache coherency, this goes several ways:
Core B happens to have the page table cached in its L1; unless it can snoop core A's L1 to know that someone else now has a dirty copy of that line and its own copy is now invalid, it's going to read the stale old PTE and go wrong.
Core B doesn't have the page tables cached at L1; unless it can coherently snoop the dirty line from core A's L1, the read goes out to L2 or main memory, hits the stale old PTE and goes wrong.
Core B doesn't have the page tables cached at L1, but core A's first write has already propagated out to L2 or further; unless core B's read can snoop the second write from core A's L1, it reads the intermediate invalid PTE from L2 and takes a fault.
Core B doesn't have the page tables cached at L1, but both of core A's writes have already propagated out to L2 or further; core B's read hits the new PTE in L2, and everything manages to work as expected by pure chance.
Now, there are some situations in which you might not want this - in asymmetric multiprocessing, where the two cores might be doing completely unrelated things, running different operating systems, and working in separate areas of memory, there might be a small benefit from not having unnecessary coherency chit-chat going on in the background - on the rare occasions the cores might want to communicate with each other there, they would probably do so via inter-processor interrupts and a specific shared area of uncached memory. For SMP, though, you really do want the cores to know about each other and be part of the same coherency domain before they have a chance to start actually allocating cache lines and TLB entries, which is precisely why the control of all the broadcast and coherency machinery is wrapped up in a single, somewhat-vaguely-named "SMP enable" bit.
To elaborate on actually entering and exiting coherency, when coming in you want to be sure that your whole data cache is invalid to avoid conflicting entries - If a CPU enters SMP with valid lines already in its cache for addresses owned by lines in other CPUs' coherent caches, the coherency protocol is broken and data loss/corruption ensues. Conversely, when going offline, the CPU has to guarantee its cache is clean to avoid data loss - it can prevent itself dirtying any more entries by disabling its cache/MMU, but it also has to exit coherency to prevent dirty lines being transferred in from other CPUs behind its back. Only then is it safe to perform the set/way operations necessary to clean the whole local cache before the contents are lost at powerdown.

Does a cache line flush access the TLB?

Assuming that we have intentionally thrashed the DTLB, and would like to proceed to flush a specific cache line from L1-3 using clflush on a memory region which is (most likely) disjoint from the addresses pointed to by the TLB entries; would this in fact bring the page base address of the cache line we are flushing back into the TLB?
In short, does a clflush touch the TLB at all? I'm assuming that due to this instruction honouring coherency, it will subsequently write that line back to memory (obviously needing a TLB look-up.)
From Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A: Instruction Set Reference, A-L: "Invalidates the cache line that contains the linear address specified with the source operand from all levels of the processor cache hierarchy (data and instruction) ."
Since it uses the linear (virtual) address, the address needs to be translated, which means that a page table walk would be needed on a TLB miss. (This would generally be the case even for a different kind of instruction that pushed cache entries out to higher levels of cache since L1 caches are typically physically tagged for x86. In general, tagging with the virtual address has fallen out of favor. Using the physical address for tags means that the physical address is needed to check the cache for a hit, so even if it was not sent to memory, translation would be needed.)
While it would be possible to avoid loading the TLB for such accesses, the extra complexity of such special-case handling would almost certainly not be viewed as worth the bother given that CLFLUSH is not commonly used.

Invalidating the CPU's cache

When my program performs a load operation with acquire semantics/store operation with release semantics or perhaps a full-fence, it invalidates the CPU's cache.
My question is this: which part of the cache is actually invalidated? only the cache-line that held the variable that I've used acquire/release? or perhaps the entire cache is invalidated? (L1 + L2 + L3 .. and so on?). Is there a difference in this subject when I use acquire/release semantics, or when i use a full-fence?
When you perform a load without fences or mutexes, then the loaded value could potentially come from anywhere, i.e, caches, registers (by way of compiler optimizations), or RAM... but from your question, you already knew this.
In most mutex implementations, when you acquire a mutex, a fence is always applied, either explicitly (e.g., mfence, barrier, etc.) or implicitly (e.g., lock prefix to lock the bus on x86). This causes the cache-lines of all caches on the path to be invalidated.
Note that the entire cache isn't invalidated, just the respective cache-lines for the memory location. This also includes the lines for the mutex (which is usually implemented as a value in memory).
Of course, there are architecture-specific details, but this is how it works in general.
Also note that this isn't the only reason for invalidating caches, as there may be operations on one CPU that would need caches on another one to be invalidated. Doing a google search for "cache coherence protocols" will provide you with a lot of information on this subject.

Resources