My understanding is that the main difference between the two methods is that in "write-through" method data is written to the main memory through the cache immediately, while in "write-back" data is written in a "later time".
We still need to wait for the memory in "later time" so What is the benefit of "write-through"?
The benefit of write-through to main memory is that it simplifies the design of the computer system. With write-through, the main memory always has an up-to-date copy of the line. So when a read is done, main memory can always reply with the requested data.
If write-back is used, sometimes the up-to-date data is in a processor cache, and sometimes it is in main memory. If the data is in a processor cache, then that processor must stop main memory from replying to the read request, because the main memory might have a stale copy of the data. This is more complicated than write-through.
Also, write-through can simplify the cache coherency protocol because it doesn't need the Modify state. The Modify state records that the cache must write back the cache line before it invalidates or evicts the line. In write-through a cache line can always be invalidated without writing back since memory already has an up-to-date copy of the line.
One more thing - on a write-back architecture software that writes to memory-mapped I/O registers must take extra steps to make sure that writes are immediately sent out of the cache. Otherwise writes are not visible outside the core until the line is read by another processor or the line is evicted.
Hope this article can help you Differences between disk Cache Write-through and Write-back
Write-through: Write is done synchronously both to the cache and to the backing store.
Write-back (or Write-behind): Writing is done only to the cache. A modified cache block is written back to the store, just before it is replaced.
Write-through: When data is updated, it is written to both the cache and the back-end storage. This mode is easy for operation but is slow in data writing because data has to be written to both the cache and the storage.
Write-back: When data is updated, it is written only to the cache. The modified data is written to the back-end storage only when data is removed from the cache. This mode has fast data write speed but data will be lost if a power failure occurs before the updated data is written to the storage.
Let's look at this with the help of an example.
Suppose we have a direct mapped cache and the write back policy is used. So we have a valid bit, a dirty bit, a tag and a data field in a cache line.
Suppose we have an operation : write A ( where A is mapped to the first line of the cache).
What happens is that the data(A) from the processor gets written to the first line of the cache. The valid bit and tag bits are set. The dirty bit is set to 1.
Dirty bit simply indicates was the cache line ever written since it was last brought into the cache!
Now suppose another operation is performed : read E(where E is also mapped to the first cache line)
Since we have direct mapped cache, the first line can simply be replaced by the E block which will be brought from memory. But since the block last written into the line (block A) is not yet written into the memory(indicated by the dirty bit), so the cache controller will first issue a write back to the memory to transfer the block A to memory, then it will replace the line with block E by issuing a read operation to the memory. dirty bit is now set to 0.
So write back policy doesnot guarantee that the block will be the same in memory and its associated cache line. However whenever the line is about to be replaced, a write back is performed at first.
A write through policy is just the opposite. According to this, the memory will always have a up-to-date data. That is, if the cache block is written, the memory will also be written accordingly. (no use of dirty bits)
Write-back and write-through describe policies when a write hit occurs, that is when the cache has the requested information. In these examples, we assume a single processor is writing to main memory with a cache.
Write-through: The information is written to the cache and memory, and the write finishes when both have finished. This has the advantage of being simpler to implement, and the main memory is always consistent (in sync) with the cache (for the uniprocessor case - if some other device modifies main memory, then this policy is not enough), and a read miss never results in writes to main memory. The obvious disadvantage is that every write hit has to do two writes, one of which accesses slower main memory.
Write-back: The information is written to a block in the cache. The modified cache block is only written to memory when it is replaced (in effect, a lazy write). A special bit for each cache block, the dirty bit, marks whether or not the cache block has been modified while in the cache. If the dirty bit is not set, the cache block is "clean" and a write miss does not have to write the block to memory.
The advantage is that writes can occur at the speed of the cache, and if writing within the same block only one write to main memory is needed (when the previous block is being replaced). The disadvantages are that this protocol is harder to implement, main memory can be not consistent (not in sync) with the cache, and reads that result in replacement may cause writes of dirty blocks to main memory.
The policies for a write miss are detailed in my first link.
These protocols don't take care of the cases with multiple processors and multiple caches, as is common in modern processors. For this, more complicated cache coherence mechanisms are required. Write-through caches have simpler protocols since a write to the cache is immediately reflected in memory.
Good resources:
http://web.cs.iastate.edu/~prabhu/Tutorial/CACHE/interac.html (what my post is largely based on)
http://www.cs.cornell.edu/courses/cs3410/2013sp/lecture/18-caches3-w.pdf
Write-Back is a more complex one and requires a complicated Cache Coherence Protocol(MOESI) but it is worth it as it makes the system fast and efficient.
The only benefit of Write-Through is that it makes the implementation extremely simple and no complicated cache coherency protocol is required.
Related
I'm debugging an HTTP server on STM32H725VG using LWIP and HAL drivers, all initially generated by STM32CubeMX. The problem is that in some cases data sent via HAL_ETH_Transmit have some octets replaced by 0x00, and this corrupted content successfully gets to the client.
I've checked that the data in the buffers passed as arguments into HAL_ETH_Transmit are intact both before and after the call to this function. So, apparently, the corruption occurs on transfer from the RAM to the MAC, because the checksum is calculated on the corrupted data. So I supposed that the problem may be due to interaction between cache and DMA. I've tried disabling D-cache, and then the corruption doesn't occur.
Then I thought that I should just use __DSB() instruction that should write the cached data into the RAM. After enabling D-cache back, I added __DSB() right before the call to HAL_ETH_Transmit (which is inside low_level_output function generated by STM32CubeMX), and... nothing happened: the data are still corrupted.
Then, after some experimentation I found that SCB_CleanDCache() call after (or instead of) __DSB() fixes the problem.
This makes me wonder. The description of DSB instruction is as follows:
Data Synchronization Barrier acts as a special kind of memory barrier. No instruction in program order after this instruction executes until this instruction completes. This instruction completes when:
All explicit memory accesses before this instruction complete.
All Cache, Branch predictor and TLB maintenance operations before this instruction complete.
And the description of SCB_DisableDCache has the following note about SCB_CleanDCache:
When disabling the data cache, you must clean (SCB_CleanDCache) the entire cache to ensure that any dirty data is flushed to external memory.
Why doesn't the DSB flush the cache if it's supposed to be complete when "all explicit memory accesses" complete, which seems to include flushing of caches?
dsb ish works as a memory barrier for inter-thread memory order; it just orders the current CPU's access to coherent cache. You wouldn't expect dsb ish to flush any cache because that's not required for visibility within the same inner-shareable cache-coherency domain. Like it says in the manual you quoted, it finishes memory operations.
Cacheable memory operations on write-back cache only update cache; waiting for them to finish doesn't imply flushing the cache.
Your ARM system I think has multiple coherency domains for microcontroller vs. DSP? Does your __DSB intrinsic compile to a dsb sy instruction? Assuming that doesn't flush cache, what they mean is presumably that it orders memory / cache operations including explicit flushes, which are still necessary.
I'd put my money on performance.
Flushing cache means to write data from cache to memory. Memory access is slow.
L1 cache size (assuming ARM Cortex-A9) is 32KB. You don't want to move a whole 32KB from cache into memory for no reason. There might be L2 cache which is easily 512KB-1MB (could be even more). You really don't want to move a whole L2 either.
As a matter of fact your whole DMA transfer might be smaller than size of caches. There is simply no justification to do that.
I have been reading some articles and blogs, which describe atomics.
All of them mention a value written by a store is visible to the subsequent load.
Can some one point me to direction or document, which describes how the memory is handled on a atomic writes.
One thing I read the store buffers are flushed - are they flushed all the way to the main memory?
Apart from this what else happens?
Are the caches on other CPUs invalidated?
According to this diagram in case of write cache miss with copy in another CPU cache (for example Shared/Exclusive state). The steps are:
1. Snooping cores (with cache line copy) sets state to Invalid.
2. Current cache stores fresh main memory value.
Why one of the snooping cores can't put its cache line value on the bus at first? And then go to Invalid state. The same algorithm is used in read miss with existing copy. Thank you.
You're absolutely right in that it's pretty silly to go fetch a line from memory when you already have it right next to you, but this diagram describes the minimal requirement for functional correctness of the coherence protocol (i.e. what must be done to avoid coherence bugs), and that only dictates snooping the data out for modified lines since that's the only correct copy. What you describe is a possible optimization, and some systems indeed behave that way.
However, keep in mind that most systems today employ a shared cache as well (L2 or L3, sometimes even beyond that), and this is often inclusive (with regards to all lines that exist in all cores). In such systems, there's no real need to go all the way to memory, since having the line in another core means it's also in the shared cache, and after invalidation the requesting core can obtain it from there. Your proposal is therefore relevant only for systems with no shared cache, or with a cache that is not strictly inclusive.
I couldn't find a source that explains how the policy works in great detail. The combinations of write policies are explained in Jouppi's Paper for the interested. This is how I understood it.
A write request is sent from cpu to cache.
Request results in a cache-miss.
A cache block is allocated for this request in cache.(Write-Allocate)
Write request block is fetched from lower memory to the allocated cache block.(Fetch-on-Write)
Now we are able to write onto allocated and updated by fetch cache block.
Question is what happens between step 4 and step 5. (Lets say Cache is a non-blocking cache using Miss Status Handling Registers.)
Does CPU have to retry write request on cache until write-hit happens? (after fetching the block to the allocated cache block)
If not, where does write request data is being held in the meantime?
Edit: I think I've found my answer in Implementation of Write Allocate in the K86™ Processors . It is directly being written into the allocated cache block and it gets merged with the read request later on.
It is directly being written into the allocated cache block and it gets merged with the read request later on.
No, that's not what AMD's pdf says. They say the store-data is merged with the just-fetched data from memory and then stored into the L1 cache's data array.
Cache tracks validity with cache-line granularity. There's no way for it to store the fact that "bytes 3 to 6 are valid; keep them when data arrives from memory". That kind of logic is too big to replicate in each line of the cache array.
Also note that the pdf you found describes some specific behaviour of their AMD's K6 microarchitectures, which was single-core only, and some models only had a single level of cache, so no cache-coherency protocol was even necessary. They do describe the K6-III (model 9) using MESI between L1 and L2 caches.
A CPU writing to cache has to hold onto the data until the cache is ready to accept it. It's not a retry-until-success process, though. It's more like the cache notified the store hardware when it's ready to accept that store (i.e. it has that line active, and in the Modified state if the cache is coherent with other caches using the MESI protocol).
In a real CPU, multiple outstanding misses can be in flight at once (even without full out-of-order speculative execution). This is called miss under miss. The CPU<->cache connection needs a buffer for each outstanding miss that can be supported in parallel, to hold the store data. e.g. a core might have 8 buffers and support 8 outstanding load or store misses. A 9th memory operation couldn't start to happen until one of the 8 buffers became available. Until then, data would have to stay in the CPU's store queue.
These buffers might be shared between loads and stores, or there might be dedicated store buffers. The OP reports that searching on store buffer found lots of related stuff of interest; one example being this part of Wikipedia's MESI article.
The L1 cache is really a part of a CPU core in modern high-performance designs. It's very tightly integrated with the memory-order logic, and needs to be able to efficiently support atomic operations like lock inc [mem] and lots of other complications (like memory reordering). See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for example.
Some other terms:
store buffer
store queue
memory order buffer
cache write port / cache read port / cache port
globally visible
distantly related: An interesting post investigating the adaptive replacement policy of Intel IvyBridge's L3 cache, making it more resistant against evicting valuable data when scanning a huge array.
I'm using perf as basic event counter. I'm working on a program which suffers from data cache store misses. Which as as high as ratio of %80.
I know how caches in principle work. It loads from memory on various miss cases, removes data from cache when it pleases. What I don't understand is , what is difference between store - load misses. How does it differ loading and storing. How can you store-miss ?
A load-miss (as you know) is referring to when the processor needs to fetch data from main memory, but data does not exist in the cache. So whenever the processor wants some data from the main memory, it esquires the cache, and if the data is already loaded you get a load-hit and otherwise you get a load-miss.
A store-miss is related to when the processor wants to write back the newly calculated data to the main memory.When it wants to write-back the data to the main memory, it hasto make sure that the content of the cache and main memory are in sync with each other. It can happen with two different policies that you can find here: Writing Policies.
So no matter what policy you choose, you first need to check whether the data is already in the cache so you can store it to cache first (since it's faster), and if the data block you are looking for has been evicted from the cache, you get a store-miss related to that cache.
You can check the applet here, to get a better idea of what happens in different scenarios.
I'm not fully familiar with how perf define these events, but given the common definition I believe load/store miss is just a way to break down the overall miss rate counting, so that you may tell which accesses miss more often. Note that loads are usually performed speculatively (at least in modern x86 cpus), while stores are performed much later along the pipeline, after the commit point, so even a piece of code with both loads and stores to the same region can have different miss rates.
In MESI-based cache protocols a load would hit the cache, or miss and fetch the line from the memory or next cache levels, either exclusively if it's not owned by anyone else, or in a shared state if it is. It would write the data to the caches along the way in the process.
A store would fetch a line in the same manner, but use an RFO (read-for-ownership) request which grants it exclusive ownership and the right to modify the line. The line would still get cached, but once the new data is written to it locally (usually in your L1 cache), it would become modified. The hit/miss process would look the same though.
What Saman referred to in his answer is the breakdown between reads and writes. Loads and stores (and other forms of access like code-read) all form the "read" part, and writebacks (or intentional write-throughs using special command or mem types like uncacheable) form the "write part.