Read to Cache in Write-Allocate policy - caching

write-allocate diagram
What is the read-to-cache for in this case ?
To my understanding, the Store is paused while an address in the cache memory is allocated after the Write-Back occurs (to clear a line if the dirty bit is set). And then the Store can occur. But what is the read-to-cache doing in this case?

Related

Why does DSB not flush the cache?

I'm debugging an HTTP server on STM32H725VG using LWIP and HAL drivers, all initially generated by STM32CubeMX. The problem is that in some cases data sent via HAL_ETH_Transmit have some octets replaced by 0x00, and this corrupted content successfully gets to the client.
I've checked that the data in the buffers passed as arguments into HAL_ETH_Transmit are intact both before and after the call to this function. So, apparently, the corruption occurs on transfer from the RAM to the MAC, because the checksum is calculated on the corrupted data. So I supposed that the problem may be due to interaction between cache and DMA. I've tried disabling D-cache, and then the corruption doesn't occur.
Then I thought that I should just use __DSB() instruction that should write the cached data into the RAM. After enabling D-cache back, I added __DSB() right before the call to HAL_ETH_Transmit (which is inside low_level_output function generated by STM32CubeMX), and... nothing happened: the data are still corrupted.
Then, after some experimentation I found that SCB_CleanDCache() call after (or instead of) __DSB() fixes the problem.
This makes me wonder. The description of DSB instruction is as follows:
Data Synchronization Barrier acts as a special kind of memory barrier. No instruction in program order after this instruction executes until this instruction completes. This instruction completes when:
All explicit memory accesses before this instruction complete.
All Cache, Branch predictor and TLB maintenance operations before this instruction complete.
And the description of SCB_DisableDCache has the following note about SCB_CleanDCache:
When disabling the data cache, you must clean (SCB_CleanDCache) the entire cache to ensure that any dirty data is flushed to external memory.
Why doesn't the DSB flush the cache if it's supposed to be complete when "all explicit memory accesses" complete, which seems to include flushing of caches?
dsb ish works as a memory barrier for inter-thread memory order; it just orders the current CPU's access to coherent cache. You wouldn't expect dsb ish to flush any cache because that's not required for visibility within the same inner-shareable cache-coherency domain. Like it says in the manual you quoted, it finishes memory operations.
Cacheable memory operations on write-back cache only update cache; waiting for them to finish doesn't imply flushing the cache.
Your ARM system I think has multiple coherency domains for microcontroller vs. DSP? Does your __DSB intrinsic compile to a dsb sy instruction? Assuming that doesn't flush cache, what they mean is presumably that it orders memory / cache operations including explicit flushes, which are still necessary.
I'd put my money on performance.
Flushing cache means to write data from cache to memory. Memory access is slow.
L1 cache size (assuming ARM Cortex-A9) is 32KB. You don't want to move a whole 32KB from cache into memory for no reason. There might be L2 cache which is easily 512KB-1MB (could be even more). You really don't want to move a whole L2 either.
As a matter of fact your whole DMA transfer might be smaller than size of caches. There is simply no justification to do that.

Store misses hurt performance?

We know that the dirty victim data is not immediately written back to RAM, it is stashed away in the store buffer and then written back to RAM later as time permits. Also, the store forwarding technique that if you do a subsequent LOAD to the same location on the same core before the value is flushed to the cache/memory, the value from the store buffer will be "forwarded" and you will get the value that was just stored. This can be done in parallel with the cache access, so it doesn’t slow things down.
My question is - With the help of the store buffer and store forwarding, the store misses don’t necessarily require the processor (correspond core) to stall. Therefore, store misses do not contribute to the total cache miss latency, right?
Thanks.
DRAM latency is really high, so it's easy for the store buffer to fill up and stall allocation of new store instructions into the back-end when a cache miss store stalls its progress. The ability of the store buffer to decouple / insulate execution from cache misses is limited by its finite size. It always helps some, though. You're right, stores are easier to hide cache-miss latency for.
Stalling and filling up the store buffer is more of a problem with a strongly ordered memory model like x86's TSO: stores can only commit from the store buffer into L1d cache in program order, so any cache-miss store blocks store-buffer progress until the RFO (Read For Ownership) completes. Initiating the RFO early (before the store reaches the commit end of the store buffer, e.g. upon retire) can hide some of this latency by getting the RFO in flight before the data needs to arrive.
How do the store buffer and Line Fill Buffer interact with each other?
Consecutive stores into the same cache line can be coalesced into a buffer that lets them all commit at once when the data arrives from RAM (or from another core which had ownership). There's some evidence that Intel CPUs actually do this, in the limited cases where that wouldn't violate the memory-ordering rules.
See Why doesn't RFO after retirement break memory ordering? for links to #BeeOnRope's experimental testing of this commit into LFBs before the RFO data arrives, on Intel Skylake.

if cache miss happens, the data will be moved to register directly or first moved to cache then to register?

if cache miss happens, the data will be moved to register directly from main memory, or the data firstly will be moved to cache then to register? Is there a direct way connect the register with main memory?
I think you're asking if a cache-miss load has to wait for L1 load-use latency after the cache line arrives from outer cache. i.e. wait for the line to be written to L1, then retry the load normally.
I'm almost certain that high-performance CPUs don't work that way. L2-hit latency is important for many workloads, and you need a load buffer tracking that incoming cache line anyway to know when to restart the load. So you just grab the data as it comes in, in parallel with writing it to the cache. The TLB check was already done as part of generating a physical address to send to the outer cache.
Most real CPUs use an early-restart design that lets the pipeline restart as soon as the word / byte they were waiting for arrives, so the rest of the cache line transfers "in the background".
A further optimization is critical-word-first, which asks for the cache line to be sent starting with the needed word, so a demand miss for a word in the middle of a cache line can receive that word first. I think modern DDR DRAM still supports this when reading from main memory, starting the 64-byte burst at a specified 64-bit chunk. I'm not 100% sure modern out-of-order CPUs use this, though; when out-of-order execution allows multiple outstanding misses for the same line, it probably makes it more complicated.
See which is optimal a bigger block cache size or a smaller one? for some discussion of early-restart and critical-word-first.
Is there a direct way connect the register with main memory?
It depends what you mean by "direct". In a modern high-performance CPU, there will be 2 or 3 layers of cache and a memory controller with its own buffering to arbitrate access to memory for multiple cores. So no, you can't.
If you design a simple single-core CPU with special cache-bypassing load and store instructions, then sure. Or if you consider early-restart as "direct", then yes it already happens.
For stores, x86 and some other architectures have cache-bypassing stores, but x86's MOVNT instructions don't directly connect registers with memory. Stores go into a line-fill buffer which is flushed when full, so you get write-combining.
There's also uncacheable memory regions: a load or store to uncacheable memory is architecturally "direct", but in the actually microarchitecture it still goes through the memory hierarchy from the load/store execution unit through the same mechanism that L1D uses to talk to the memory controller.

Write Allocate / Fetch on Write Cache Policy

I couldn't find a source that explains how the policy works in great detail. The combinations of write policies are explained in Jouppi's Paper for the interested. This is how I understood it.
A write request is sent from cpu to cache.
Request results in a cache-miss.
A cache block is allocated for this request in cache.(Write-Allocate)
Write request block is fetched from lower memory to the allocated cache block.(Fetch-on-Write)
Now we are able to write onto allocated and updated by fetch cache block.
Question is what happens between step 4 and step 5. (Lets say Cache is a non-blocking cache using Miss Status Handling Registers.)
Does CPU have to retry write request on cache until write-hit happens? (after fetching the block to the allocated cache block)
If not, where does write request data is being held in the meantime?
Edit: I think I've found my answer in Implementation of Write Allocate in the K86™ Processors . It is directly being written into the allocated cache block and it gets merged with the read request later on.
It is directly being written into the allocated cache block and it gets merged with the read request later on.
No, that's not what AMD's pdf says. They say the store-data is merged with the just-fetched data from memory and then stored into the L1 cache's data array.
Cache tracks validity with cache-line granularity. There's no way for it to store the fact that "bytes 3 to 6 are valid; keep them when data arrives from memory". That kind of logic is too big to replicate in each line of the cache array.
Also note that the pdf you found describes some specific behaviour of their AMD's K6 microarchitectures, which was single-core only, and some models only had a single level of cache, so no cache-coherency protocol was even necessary. They do describe the K6-III (model 9) using MESI between L1 and L2 caches.
A CPU writing to cache has to hold onto the data until the cache is ready to accept it. It's not a retry-until-success process, though. It's more like the cache notified the store hardware when it's ready to accept that store (i.e. it has that line active, and in the Modified state if the cache is coherent with other caches using the MESI protocol).
In a real CPU, multiple outstanding misses can be in flight at once (even without full out-of-order speculative execution). This is called miss under miss. The CPU<->cache connection needs a buffer for each outstanding miss that can be supported in parallel, to hold the store data. e.g. a core might have 8 buffers and support 8 outstanding load or store misses. A 9th memory operation couldn't start to happen until one of the 8 buffers became available. Until then, data would have to stay in the CPU's store queue.
These buffers might be shared between loads and stores, or there might be dedicated store buffers. The OP reports that searching on store buffer found lots of related stuff of interest; one example being this part of Wikipedia's MESI article.
The L1 cache is really a part of a CPU core in modern high-performance designs. It's very tightly integrated with the memory-order logic, and needs to be able to efficiently support atomic operations like lock inc [mem] and lots of other complications (like memory reordering). See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for example.
Some other terms:
store buffer
store queue
memory order buffer
cache write port / cache read port / cache port
globally visible
distantly related: An interesting post investigating the adaptive replacement policy of Intel IvyBridge's L3 cache, making it more resistant against evicting valuable data when scanning a huge array.

For Write-Back Cache Policy, why data should first be read from memory, before writing to cache?

Caches with Write Back Cache, perform write operations to the cache memory and return immediately. This is only when the data is already present in the cache. If the data is not present in the cache, it is first fetched from the lower memories, and then written in the cache.
I do not understand why it is important to first fetch the data from the memory, before writing it. If the data is to be written, it will become invalid anyways.
I do know the basic concept, but want to know the reason behind having to read data before writing to the address.
I have the following guess,
This is done for Cache Coherency, in a multi-processor environment. Other processors snoop on the bus to maintain Cache Coherency. The processor writing on the address needs to gain an exclusive access, and other processors must find out about this.
But, does that mean, this is not required on Single-Processor computers?
Short answer
A write that miss in the cache may or may not fetch the block being written depending on the write-miss policy of the cache (fetch-on-write-miss vs. no-fetch-on-write-miss).
It does not depend on the write-hit policy (write-back vs. write-through).
Explanation
In order to simplify, let us assume that we have a one-level cache hierarchy:
----- ------ -------------
|CPU| <-> | L1 | <-> |main memory|
----- ------ -------------
The L1 write-policy is fetch-on-write-miss.
The cache stores blocks of data. A typical L1 block is 32 bytes width, that is, it contains several words (for instance, 8 x 4-bytes words).
The transfer unit between the cache and main memory is a block, but transfers between CPU and cache can be of different sizes (1, 2, 4 or 8 bytes).
Let us assume that the CPU performs a 4-byte word write.
If the block containing the word is not stored at the cache, we have a cache miss. The whole block (32 bytes) is transferred from main memory to the cache, and then the corresponding word (4 bytes) is stored in the cache.
A write-back cache would tag the block as dirty (not invalid, as you stated).
A write-through cache would send the updated word to main memory.
If the block containing the word is stored at the cache, we have a cache hit. The corresponding word is updated.
More information:
Cache Write Policies and Performance. Norman P. Jouppi.
http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-91-12.pdf
Your guess is almost correct. However this behavior has to be done also in multi-core single processor systems.
Your processor can have multiple cores, therefore when writing a cache line (in a WB cache), the core that issues the write needs to get exclusive access to that line. If the line intended for write is marked as dirty it will be "flushed" to the lower memories before being written with the new information.
In a multi-core CPU, each core has it's own L1 cache and there is the possibility that each core could store a copy of a shared L2 line. Therefore you need this behavior for Cache Coherency.
You should find out more by reading about MESI protocol and it's derivations.

Resources