Is there an architectural reason to use writealloc cache policy in ARM SMP Linux kernel? Can we change it to writeback cache policy?
Kernel boot log :
[ 0.000000] Forcing write-allocate cache policy for SMP
[ 0.000000] Memory policy: Data cache writealloc
Is there an architectural reason to use writealloc cache policy in ARM SMP Linux kernel?
First, it is much faster for most work loads. Second, the spin_locks and other Linux synchronization primitives use LDREX and STREX and probably need to have a write allocate policyXilinx W/A and exclusive or at least would complicate the code using exclusive access, which is a large benefit for SMP systems.
Write allocate implies a write-back cache; no-write allocate implies a write-through cache (or basically no caching of writes). It is probably much harder to get exlusive locking to work with a write-through cache (because you will have to duplicate the write-back cache to implement the exclusive lock).
Can we change it to writeback cache policy?
It looks like NO. At least not without modifying the source, which is what I think you mean. The kernel parameter cachepolicy can be one of,
uncached
buffered
writethrough
writeback
writealloc
build_mem_type_table forces this to "write-allocate" for an SMP system. At the very least you need to change this code. However, if you naively remove it, it will have consequences. See for instance, ca8f0b0a545f55b.
Source: Wikipedia
There are two basic cache writing approaches:
Write-through: write is done synchronously both to the cache and to the backing store.
Write-back (also called write-behind): initially, writing is done only to the cache. The write to the backing store is postponed until the modified content is about to be replaced by another cache block.
...
Since no data is returned to the requester on write operations, a decision needs to be made on write misses, whether or not data would be loaded into the cache. This is defined by these two approaches:
Write allocate (also called fetch on write): data at the missed-write location is loaded to cache, followed by a write-hit operation. In this approach, write misses are similar to read misses.
No-write allocate (also called write-no-allocate or write around): data at the missed-write location is not loaded to cache, and is written directly to the backing store. In this approach, data is loaded into the cache on read misses only.
...
A write-back cache uses write allocate, hoping for subsequent writes (or even reads) to the same location, which is now cached.
A write-through cache uses no-write allocate. Here, subsequent writes have no advantage, since they still need to be written directly to the backing store.
Entities other than the cache may change the data in the backing store, in which case the copy in the cache may become out-of-date or stale. Alternatively, when the client updates the data in the cache, copies of those data in other caches will become stale. Communication protocols between the cache managers which keep the data consistent are known as coherency protocols.
ARM cpus typically have a write buffer so multiple writes (say 32bit) will be ganged into 128bit (AXI bus size) or even large for SDRAM devices.
Related
I'm debugging an HTTP server on STM32H725VG using LWIP and HAL drivers, all initially generated by STM32CubeMX. The problem is that in some cases data sent via HAL_ETH_Transmit have some octets replaced by 0x00, and this corrupted content successfully gets to the client.
I've checked that the data in the buffers passed as arguments into HAL_ETH_Transmit are intact both before and after the call to this function. So, apparently, the corruption occurs on transfer from the RAM to the MAC, because the checksum is calculated on the corrupted data. So I supposed that the problem may be due to interaction between cache and DMA. I've tried disabling D-cache, and then the corruption doesn't occur.
Then I thought that I should just use __DSB() instruction that should write the cached data into the RAM. After enabling D-cache back, I added __DSB() right before the call to HAL_ETH_Transmit (which is inside low_level_output function generated by STM32CubeMX), and... nothing happened: the data are still corrupted.
Then, after some experimentation I found that SCB_CleanDCache() call after (or instead of) __DSB() fixes the problem.
This makes me wonder. The description of DSB instruction is as follows:
Data Synchronization Barrier acts as a special kind of memory barrier. No instruction in program order after this instruction executes until this instruction completes. This instruction completes when:
All explicit memory accesses before this instruction complete.
All Cache, Branch predictor and TLB maintenance operations before this instruction complete.
And the description of SCB_DisableDCache has the following note about SCB_CleanDCache:
When disabling the data cache, you must clean (SCB_CleanDCache) the entire cache to ensure that any dirty data is flushed to external memory.
Why doesn't the DSB flush the cache if it's supposed to be complete when "all explicit memory accesses" complete, which seems to include flushing of caches?
dsb ish works as a memory barrier for inter-thread memory order; it just orders the current CPU's access to coherent cache. You wouldn't expect dsb ish to flush any cache because that's not required for visibility within the same inner-shareable cache-coherency domain. Like it says in the manual you quoted, it finishes memory operations.
Cacheable memory operations on write-back cache only update cache; waiting for them to finish doesn't imply flushing the cache.
Your ARM system I think has multiple coherency domains for microcontroller vs. DSP? Does your __DSB intrinsic compile to a dsb sy instruction? Assuming that doesn't flush cache, what they mean is presumably that it orders memory / cache operations including explicit flushes, which are still necessary.
I'd put my money on performance.
Flushing cache means to write data from cache to memory. Memory access is slow.
L1 cache size (assuming ARM Cortex-A9) is 32KB. You don't want to move a whole 32KB from cache into memory for no reason. There might be L2 cache which is easily 512KB-1MB (could be even more). You really don't want to move a whole L2 either.
As a matter of fact your whole DMA transfer might be smaller than size of caches. There is simply no justification to do that.
if cache miss happens, the data will be moved to register directly from main memory, or the data firstly will be moved to cache then to register? Is there a direct way connect the register with main memory?
I think you're asking if a cache-miss load has to wait for L1 load-use latency after the cache line arrives from outer cache. i.e. wait for the line to be written to L1, then retry the load normally.
I'm almost certain that high-performance CPUs don't work that way. L2-hit latency is important for many workloads, and you need a load buffer tracking that incoming cache line anyway to know when to restart the load. So you just grab the data as it comes in, in parallel with writing it to the cache. The TLB check was already done as part of generating a physical address to send to the outer cache.
Most real CPUs use an early-restart design that lets the pipeline restart as soon as the word / byte they were waiting for arrives, so the rest of the cache line transfers "in the background".
A further optimization is critical-word-first, which asks for the cache line to be sent starting with the needed word, so a demand miss for a word in the middle of a cache line can receive that word first. I think modern DDR DRAM still supports this when reading from main memory, starting the 64-byte burst at a specified 64-bit chunk. I'm not 100% sure modern out-of-order CPUs use this, though; when out-of-order execution allows multiple outstanding misses for the same line, it probably makes it more complicated.
See which is optimal a bigger block cache size or a smaller one? for some discussion of early-restart and critical-word-first.
Is there a direct way connect the register with main memory?
It depends what you mean by "direct". In a modern high-performance CPU, there will be 2 or 3 layers of cache and a memory controller with its own buffering to arbitrate access to memory for multiple cores. So no, you can't.
If you design a simple single-core CPU with special cache-bypassing load and store instructions, then sure. Or if you consider early-restart as "direct", then yes it already happens.
For stores, x86 and some other architectures have cache-bypassing stores, but x86's MOVNT instructions don't directly connect registers with memory. Stores go into a line-fill buffer which is flushed when full, so you get write-combining.
There's also uncacheable memory regions: a load or store to uncacheable memory is architecturally "direct", but in the actually microarchitecture it still goes through the memory hierarchy from the load/store execution unit through the same mechanism that L1D uses to talk to the memory controller.
I couldn't find a source that explains how the policy works in great detail. The combinations of write policies are explained in Jouppi's Paper for the interested. This is how I understood it.
A write request is sent from cpu to cache.
Request results in a cache-miss.
A cache block is allocated for this request in cache.(Write-Allocate)
Write request block is fetched from lower memory to the allocated cache block.(Fetch-on-Write)
Now we are able to write onto allocated and updated by fetch cache block.
Question is what happens between step 4 and step 5. (Lets say Cache is a non-blocking cache using Miss Status Handling Registers.)
Does CPU have to retry write request on cache until write-hit happens? (after fetching the block to the allocated cache block)
If not, where does write request data is being held in the meantime?
Edit: I think I've found my answer in Implementation of Write Allocate in the K86™ Processors . It is directly being written into the allocated cache block and it gets merged with the read request later on.
It is directly being written into the allocated cache block and it gets merged with the read request later on.
No, that's not what AMD's pdf says. They say the store-data is merged with the just-fetched data from memory and then stored into the L1 cache's data array.
Cache tracks validity with cache-line granularity. There's no way for it to store the fact that "bytes 3 to 6 are valid; keep them when data arrives from memory". That kind of logic is too big to replicate in each line of the cache array.
Also note that the pdf you found describes some specific behaviour of their AMD's K6 microarchitectures, which was single-core only, and some models only had a single level of cache, so no cache-coherency protocol was even necessary. They do describe the K6-III (model 9) using MESI between L1 and L2 caches.
A CPU writing to cache has to hold onto the data until the cache is ready to accept it. It's not a retry-until-success process, though. It's more like the cache notified the store hardware when it's ready to accept that store (i.e. it has that line active, and in the Modified state if the cache is coherent with other caches using the MESI protocol).
In a real CPU, multiple outstanding misses can be in flight at once (even without full out-of-order speculative execution). This is called miss under miss. The CPU<->cache connection needs a buffer for each outstanding miss that can be supported in parallel, to hold the store data. e.g. a core might have 8 buffers and support 8 outstanding load or store misses. A 9th memory operation couldn't start to happen until one of the 8 buffers became available. Until then, data would have to stay in the CPU's store queue.
These buffers might be shared between loads and stores, or there might be dedicated store buffers. The OP reports that searching on store buffer found lots of related stuff of interest; one example being this part of Wikipedia's MESI article.
The L1 cache is really a part of a CPU core in modern high-performance designs. It's very tightly integrated with the memory-order logic, and needs to be able to efficiently support atomic operations like lock inc [mem] and lots of other complications (like memory reordering). See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for example.
Some other terms:
store buffer
store queue
memory order buffer
cache write port / cache read port / cache port
globally visible
distantly related: An interesting post investigating the adaptive replacement policy of Intel IvyBridge's L3 cache, making it more resistant against evicting valuable data when scanning a huge array.
I have a question regarding memory mapped io.
Suppose, there is a memory mapped IO peripheral whose value is being read by CPU. Once read, the value is stored in cache. But the value in memory has been updated by external IO peripheral.
In such cases how will CPU determine cache has been invalidated and what could be the workaround for such case?
That's strongly platform dependent. And actually, there are two different cases.
Case #1. Memory-mapped peripheral. This means that access to some range of physical memory addresses is routed to peripheral device. There is no actual RAM involved. To control caching, x86, for example, has MTRR ("memory type range registers") and PAT ("page attribute tables"). They allow to set caching mode on particular range of physical memory. Under normal circumstances, range of memory mapped to RAM is write-back cacheable, while range of memory mapped to periphery devices is uncacheable. Different caching policies are described in Intel's system programming guide, 11.3 "Methods of caching available". So, when you issue read or write request to memory mapped peripheral, CPU cache is bypassed, and request goes directly to the device.
Case #2. DMA. It allows peripheral devices to access RAM asynchronously. In this case, DMA controller is no different from any CPU and equally participates in cache coherency protocol. Write request from periphery is seen by caches of other CPUs, and cache lines are either invalidated or are updated with new data. Read request also is seen by caches of other CPUs and data is returned from cache rather than from main RAM. (This is only an example: actual implementation is platform dependent. For example, SoC typically do not guarantee strong cache coherency peripheral <-> CPU.)
In both cases, the problem of caching also exists at compiler level: complier may cache data values in registers. That's why programming languages has some means of prohibiting such optimization: for example, volatile keyword in C.
I am writing a dummy driver to share kernel buffer to user space on ARM v7.
I want to implement fsync() operation for this buffer. Which APIs should I use to flush L1 and L2 cache for a given user address range in my fsync?
There are many APIs available in asm/cacheflush.h, but I am not sure weather they will flush both L1 and L2 or only L1?
Currently I am using
dmac_flush_range()
outer_flush_range()
APIs. Are they fine for the use case?
Thanks!
ARMv7 mandates that data caches behave as if physically-indexed and physically-tagged*, which means that multiple virtual addresses mapping to the same physical address are naturally coherent with each other without requiring any cache maintenance or barriers. Therefore the kernel mapping and user mapping of your buffer are already fully in sync at all times, and there's not really anything you need to do. You certainly don't have any of the VIVT cache problems of older CPUs.
That said, using those architecture-private cache APIs directly from a driver would get you roundly shouted at by kernel maintainers these days - drivers should normally only need to care about cache maintenance at all when DMA is involved, but correct use of the DMA mapping API already takes care of everything in that regard.
* they don't strictly have to be PIPT, for instance Cortex-A8's L1 which is actually non-aliasing VIPT under the hood.