On Computer Architecture lecture, I learned that the function of write buffer; hold data waiting to be written memory. My professor just told that it improves time performance.
However, I'm really curious 'how it improves time-performance'?
Could you explain more precisely how write buffer works?
The paper Design Issues and Tradeoffs for Write Buffers describes the purpose of write buffers as follows:
In a system with a write-through first-level cache, a write buffer has
two essential functions: it absorbs processor writes (store
instructions) at a rate faster than the next-level cache could,
thereby preventing processor stalls; and it aggregates writes to the
same cache block, thereby reducing traffic to the next-level cache.
To put this another way, the two primary benefits are:
If the processor has a burst of writes that occur faster than the cache can respond, then the write buffer can store multiple outstanding writes that are waiting to go to the cache. This improves performance because some of the other instructions won't be writes and thus they can continue executing instead of being stalled.
If there are multiple writes to different words in the write buffer than go to the same cache line, then these writes can be grouped together into a single write to the cache line. This improves performance because it reduces the total number of writes that need to go to the cache (since the cache line contains multiple words).
Related
I was going through Cache Write Policies paper by Norman P. Jouppi and I understand why write-invalidate (defined on page 193) works well with direct mapped caches which is because of the ability to write the data which checking the tag and if found to be miss, the cache line is invalidates as it is corrupted by the write. This can be done in one cycle.
But is there any benefit if write-invalidate is used for set-associative caches?
What is the usual configuration that is used for L1 caches in real processors? Do they use direct or set-associative and write-validate/write around/write invalidate/fetch-on-write policy?
TL:DR: for a non-blocking cache using write-invalidate, changing it from direct-mapped to set-associative could hurt the hit rate unless writes are very rare, or mean that you introduce the possibility of needing to block.
Write-invalidate only makes sense for a simple in-order pipeline with a simple cache that tries to avoid stalling the pipeline even without a store buffer, and go really fast at the expense of hit-rate. If you were going to change things to improve hit-rate, changing away from write-invalidate (usually to write-back + write-allocate + fetch-on-write) would be one of the first things. Write-invalidate with set-associative cache is possible with some ugly tradeoffs, but you wouldn't like the results.
The 1993 paper you linked is using that term to mean something other than the modern cache-coherence mechanism meaning. In the paper:
The combination of
write-before-hit, no-fetch-on-write, and no-write-allocate
we call write-invalidate
Yes, real-world caches these days are basically always set-associative; the more complex tag-comparator logic is worth the increased hit-rate for the same data size. Which cache mapping technique is used in intel core i7 processor? has some general stuff, not just x86. Modern examples of direct-mapped caches include the DRAM cache when a part of the persistent memory on an Intel platform operates in memory mode. Also many server-grade processors from multiple vendors support L3 way-wise partitioning, so you can, for example, allocate one way for a thread which would basically behave like a direct-mapped cache.
Write policy is usually write-allocate + fetch-on-write + no-write-before-hit for modern CPU caches; some ISAs offer methods such as special instructions to bypass cache for "non-temporal" stores that won't be re-read soon, to avoid cache pollution for those cases. Most workloads do re-load their stores with enough temporal locality that write-allocate is the only sane choice, especially when caches are larger and/or more associative so they're more likely to be able to hang onto a line until the next read or write.
It's also very common to do multiple small writes into the same line, making write-allocate very valuable, especially if a store buffer didn't manage to merge those writes.
But is there any benefit if write-invalidate is used for set-associative caches?
It doesn't seem so.
The only advantage it has is not stalling a simple in-order pipeline that lacks a store buffer ("write buffer" in the paper). It allows write in parallel with the tag-check, so you find out after modifying the line whether you hit or not. (Modern CPUs do use a store buffer to decouple store commit to L1d from store execution and hide store-miss latency. Even in-order CPUs typically have a store buffer to allow memory-level parallelism of RFOs (read-for-ownership). (e.g. ARM Cortex-A53 found in phones).
Anyway, in a set-associative cache, you need to check tags to know which "way" of the set to write into on a write hit. (Or detect a miss and pick one to evict according to some policy, like random or pseudo-LRU using some extra state bits, or write-around if no-write-allocate). If you wait until after the tag check to find the write way, you've lost the only benefit of write-invalidate.
Blindly writing to a random way could lead to a situation where there's a hit in a different way than the one you guessed. Way-prediction is a thing (and can do better than random), but the downside of an incorrect prediction for a write like this would be unnecessarily invalidating a line, instead of just a bit of extra latency. Way prediction in modern cache. I don't know what kind of success-rate way-prediction usually achieves. I'd guess not great, like maybe 80 to 90% at best. Probably spending transistors to do way-prediction would be better spent elsewhere, to do something that sucks less than write-invalidate! A store buffer with store forwarding probably costs more, but is a lot better.
The advantage of write-invalidate is to help make the cache non-blocking. But if you need to correct the situation when you do find a write-hit in a way other than the one you picked, you need to go back and correct the situation, updating the correct line. So you'd lose the non-blocking property. Never stalling is better than not usually stalling, because it means you don't even need to make the hardware handle that possible case at all. (Although you do need to be able to stall for memory.)
The write-in-one-way-hit-in-another situation can be avoided by writing in all of the ways. But there will be at most one hit and the rest will have to be invalidated. The negative impact on hit rate will significantly grow with associativity. (Unless writes are quite infrequent vs. reads, reducing the associativity would probably help hit rate with the write-all-ways strategy, so for a given total cache capacity, direct-mapped might be the best choice if you insist on fully-non-blocking write-invalidate.) Even for a direct-mapped cache, the experimental evaluation given in the paper itself shows that write-invalidate has higher miss rate compare to the other evaluated write policies. So it's win only if the benefits of reducing latency and bandwidth demand outweighs the damage of high miss rate.
Also, as I said, write-allocate is very good for CPUs, especially when it's set-associative so you're spending more resources trying to get a higher hit-rate. You could maybe still implement write-allocate by triggering a fetch on miss, remembering where in the line you stored the data, and merging that with the old copy of the line when it arrives.
You don't want to defeat that by blowing away lines that didn't need to die.
Also, write-invalidate implies write-through even for write hits, because it could lose data if a line is ever dirty. But write-back is also very good in modern L1d caches to insulate larger/slower caches from the write bandwidth. (Especially if there's no per-core private L2 to separately reduce total traffic to shared caches.) However, AMD Bulldozer-family did have a write-through L1d with a small 4k write-buffer between it and a write-back L2. This was generally considered a failed experiment or weak point of the design, and they dropped it in favour of a standard write-back write-allocate L1d for Zen. When use write-through cache policy for pages.
So in summary, write-invalidate is incompatible with several things that modern mainstream CPU designs have settled on as the best options, that you'll find in most mainstream CPU designs
write-allocate write policy
write-back (not write-through). https://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies
set-associative (huge downsides that can only be partially mitigated by way-prediction)
store buffer to decouple store miss from execution, and allow memory parallelism. (Not strictly incompatible, but a store buffer makes it pointless. Necessary for OoO exec and widely used for in-order)
write-invalidate in cache-coherent SMP systems
You'd never consider using it in a single-chip multi-core CPU; spend more transistors on each core to get more of the low-hanging fruit before you start building more cores. e.g. a proper store buffer. Use some flavour of SMT if you want high throughput for multiple low-IPC threads that stall a lot.
But for multi-socket SMP, this could have made sense historically if you want to use multiple of the biggest single-core chip you can build, and that was still not big enough to just have a store buffer instead of this.
I guess it could even make sense to use a really "thin" direct-mapped write-through L1d in front of a private medium-sized write-back set-associative L2 that's still pretty fast. (Maybe call this an L0d cache because it can act like an unordered store buffer. The next-level cache will still see a lot of reads and writes from the low hit-rate of this small direct-mapped cache.)
Normally all caches (including L1d) are part of the same global coherency domain so writing into L1d cache can't happen until you have exclusive ownership. (Which you check for as part of the tag check.) But if this L1d / L0d is not like that, then it's not coherent and is more like a store buffer.
Of course, you need to queue the write-throughs for L2, and eventually stall when it can't keep up, so you're just adding complexity. The write-through to L2 mechanism would also need to deal with waiting for L2 to gain exclusive ownership of the line before writing (MESI Exclusive or Modified state). So this is very much just an unordered store buffer.
The case of writing to a line that hadn't made it to L2 yet is interesting: if it's an L0d write hit you effectively get store merging for free. You'd need per-word or per-byte needs-writeback bits (aka dirty bits) for this. Normally write-through would be sending along the write while the offset within line is still available, but if L2 isn't ready to accept it yet (e.g. because of a write miss) then you can't do that. This is morphing it into a write-combining buffer. Marking the whole line as needing write-back doesn't work because the unwritten parts are still invalid.
But if it's a write miss (same cache line, different tag bits) on a line that still hasn't finished write-back to L2, you have a big problem because you'd be invalidating a line that's still "dirty" (has the only copy of some older store data). You can't detect that before writing; the whole point is to write in parallel with checking tags.
It might be possible to still make this work: if the cache access is a read+write exchange that keeps the previous value in a one-word buffer (or whatever the max write size is), you still have all the data. Stall everything (including writeback of this line so you don't make wrong data globally visible in coherent L2 cache). Then exchange back, wait for the old state of that L0d line to actually write back to that address, then store the tmp buffer into L0d and update the tag and needs-writeback bits to reflect this store. So aliasing between nearby stores becomes extra costly and stalls the pipeline. Or maybe you can let non-memory instructions continue and only stall execution at the next load or store. (If you have the transistor budget to do much of that stall-avoidance, you can probably just use a completely different strategy, like having a store buffer and a normal L1d.)
To be usable (assuming you work around the dirty-store-miss problem), you'd need some way to track relative order of stores (and loads). If that's as simplistic as making sure every entry in the entire L0d has finished its write-through process before allowing another write, then even store-store barriers will be very expensive. The less order-tracking a CPU does, the more expensive barriers have to be (flush more stuff to make sure).
I have a function that does very little reading, but a lot of writing to RAM. When I run it multiple times on the same core (the main thread), it runs about 5x as fast than if I launch the function on a new thread every run (which doesn't guarantee the same core is used between runs), as I launch and join between runs.
This suggests the cache is being used heavily for the write process, but I don't understand how. I thought the cache was only useful for reads.
Modern processors usually have write-buffers. The reason is that writes are, to a first approximation, pure sinks. The processor doesn't usually have to wait for the store to reach the coherent memory hierarchy before it executes the next instruction.
(Aside: Obviously stores are not pure sinks. A later read from the written-to memory location should return the written value, so the processor must snoop the write-buffer, and either stall the read or forward the written value to it)
Obviously such buffer(s) are of finite size, so when the buffers are full the next store in the program can't be executed and stalls until a slot in the buffer is made available by an older store becoming architecturally visible.
Ordinarily, the way a write leaves the buffer is when the value is written to cache (since a lot of writes are actually read back again quickly, think of the program stack as an example). If the write only sets part of the cacheline, the rest of the cacheline must remain unmodified, so consequently it must be loaded from the memory hierarchy.
There are ways to avoid loading the old cacheline, like non-temporal stores, write-combining memory or cacheline-zeroing instructions.
Non-temporal stores and write-combining memory combine adjacent writes to fill a whole cacheline, sending the new cacheline to the memory hierarchy to replace the old one.
POWER has an instruction that zeroes a full cacheline (dcbz), which also removes the need to load the old value from memory.
x86 with AVX512 has cacheline-sized registers, which suggests that an aligned zmm-register store could avoid loading the old cacheline (though I do not know whether it does).
Note that many of these techniques are not consistent with the usual memory-ordering of the respective processor architectures. Using them may require additional fences/barriers in multi-threaded operation.
How does a multiprocessor with write-buffers maintain the sequential consistency?
To my knowledge, in a uniprocessor, If the buffer is FIFO and the reads to an element that is pending to be write on main memory is supplied by the buffer, it maintains the consistency.
But how it works in a MP? I think that If a processor puts an store in his buffer, another processor can't read this, and I think that this break the sequencial consistency.
How does it work in a multithread environment with a write-buffer per thread? It also breaks the sequential consistency?
You referred to:
Typically, a CPU only sees the random access; the fact that memory busses are sequentially accessed is hidden to the CPU itself, so from the point of view of the CPU, there's no FIFO involved here.
In SMP modern machines, there's so-called snoop control units that watch the memory transfers and invalidate the cache copy of the RAM if necessary. So there's dedicated hardware to make sure data is synchronous. This doesn't mean it's really synchronous -- there's always more than one way to get invalid data (for example, by already having loaded a memory value into a register before the other CPU core changed it), but that is what you were getting at.
Also, multiple threads are basically a software concept. So if you need to synchronize software FIFOs, you will need to use proper locking mechanisms.
I'm assuming X86 here.
The store in the store buffer in itself isn't the problem. If for example a CPU would only do stores and the stores in the store buffer all retire in order, it would be exactly the same behavior as a processor that doesn't have a store buffer. For SC the real time order doesn't need to be preserved.
And you already indicated that a processor will see its own stores in the store buffer in order. The part where SC gets violated is when a store is followed by a load to a different address.
So imagine
A=1
r1=B
Then without a store buffer, first the store of A would be written to cache/memory. And then the B would be read from cache/memory.
But with a store buffer, it can be that the load of B will overtake the store of A. So the load will read from cache/memory before the store of A is written to cache/memory.
The typical example of where SC breaks with store buffers is Dekkers algorithm.
lock_a=1
while(lock_b==1){
if(turn == b){
lock_a=0
while(lock_b==1);
lock_a=1
}
}
So at the top you can see a store of lock_a=1 followed by a load of lock_b. Due to store buffer it can be that these 2 get reordered and as a consequence 2 threads could enter the critical section.
One way to solve it is to add a [StoreLoad] fence between the load and store, which prevents loads from being executed till the store buffer has been drained. This way SC is restored.
Note 1: store buffers are per CPU; not per thread.
Note 2: store (and load) buffers are before the cache.
My understanding is that the main difference between the two methods is that in "write-through" method data is written to the main memory through the cache immediately, while in "write-back" data is written in a "later time".
We still need to wait for the memory in "later time" so What is the benefit of "write-through"?
The benefit of write-through to main memory is that it simplifies the design of the computer system. With write-through, the main memory always has an up-to-date copy of the line. So when a read is done, main memory can always reply with the requested data.
If write-back is used, sometimes the up-to-date data is in a processor cache, and sometimes it is in main memory. If the data is in a processor cache, then that processor must stop main memory from replying to the read request, because the main memory might have a stale copy of the data. This is more complicated than write-through.
Also, write-through can simplify the cache coherency protocol because it doesn't need the Modify state. The Modify state records that the cache must write back the cache line before it invalidates or evicts the line. In write-through a cache line can always be invalidated without writing back since memory already has an up-to-date copy of the line.
One more thing - on a write-back architecture software that writes to memory-mapped I/O registers must take extra steps to make sure that writes are immediately sent out of the cache. Otherwise writes are not visible outside the core until the line is read by another processor or the line is evicted.
Hope this article can help you Differences between disk Cache Write-through and Write-back
Write-through: Write is done synchronously both to the cache and to the backing store.
Write-back (or Write-behind): Writing is done only to the cache. A modified cache block is written back to the store, just before it is replaced.
Write-through: When data is updated, it is written to both the cache and the back-end storage. This mode is easy for operation but is slow in data writing because data has to be written to both the cache and the storage.
Write-back: When data is updated, it is written only to the cache. The modified data is written to the back-end storage only when data is removed from the cache. This mode has fast data write speed but data will be lost if a power failure occurs before the updated data is written to the storage.
Let's look at this with the help of an example.
Suppose we have a direct mapped cache and the write back policy is used. So we have a valid bit, a dirty bit, a tag and a data field in a cache line.
Suppose we have an operation : write A ( where A is mapped to the first line of the cache).
What happens is that the data(A) from the processor gets written to the first line of the cache. The valid bit and tag bits are set. The dirty bit is set to 1.
Dirty bit simply indicates was the cache line ever written since it was last brought into the cache!
Now suppose another operation is performed : read E(where E is also mapped to the first cache line)
Since we have direct mapped cache, the first line can simply be replaced by the E block which will be brought from memory. But since the block last written into the line (block A) is not yet written into the memory(indicated by the dirty bit), so the cache controller will first issue a write back to the memory to transfer the block A to memory, then it will replace the line with block E by issuing a read operation to the memory. dirty bit is now set to 0.
So write back policy doesnot guarantee that the block will be the same in memory and its associated cache line. However whenever the line is about to be replaced, a write back is performed at first.
A write through policy is just the opposite. According to this, the memory will always have a up-to-date data. That is, if the cache block is written, the memory will also be written accordingly. (no use of dirty bits)
Write-back and write-through describe policies when a write hit occurs, that is when the cache has the requested information. In these examples, we assume a single processor is writing to main memory with a cache.
Write-through: The information is written to the cache and memory, and the write finishes when both have finished. This has the advantage of being simpler to implement, and the main memory is always consistent (in sync) with the cache (for the uniprocessor case - if some other device modifies main memory, then this policy is not enough), and a read miss never results in writes to main memory. The obvious disadvantage is that every write hit has to do two writes, one of which accesses slower main memory.
Write-back: The information is written to a block in the cache. The modified cache block is only written to memory when it is replaced (in effect, a lazy write). A special bit for each cache block, the dirty bit, marks whether or not the cache block has been modified while in the cache. If the dirty bit is not set, the cache block is "clean" and a write miss does not have to write the block to memory.
The advantage is that writes can occur at the speed of the cache, and if writing within the same block only one write to main memory is needed (when the previous block is being replaced). The disadvantages are that this protocol is harder to implement, main memory can be not consistent (not in sync) with the cache, and reads that result in replacement may cause writes of dirty blocks to main memory.
The policies for a write miss are detailed in my first link.
These protocols don't take care of the cases with multiple processors and multiple caches, as is common in modern processors. For this, more complicated cache coherence mechanisms are required. Write-through caches have simpler protocols since a write to the cache is immediately reflected in memory.
Good resources:
http://web.cs.iastate.edu/~prabhu/Tutorial/CACHE/interac.html (what my post is largely based on)
http://www.cs.cornell.edu/courses/cs3410/2013sp/lecture/18-caches3-w.pdf
Write-Back is a more complex one and requires a complicated Cache Coherence Protocol(MOESI) but it is worth it as it makes the system fast and efficient.
The only benefit of Write-Through is that it makes the implementation extremely simple and no complicated cache coherency protocol is required.
I have encountered some Intel compiler intrinsic functions which I believe allow developers to bypass the cache?
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-mac/GUID-AF42A867-B796-4D29-8FED-C20193FD87E0.htm
I have also come across the GCC compiler prefetch keyword, although I cannot admit to fully appreciating what this does.
With the above in mind I wondered if any members could either elaborate on the above (which I badly described) or provide other techniques which allow the developer to have close control over which data (or instructions) is/isn't loaded in the CPU cache?
This page contains a lot of information about all intrinsics:
Intel Intrinsics Guide
The series of instructions that will write data to memory, avoiding cache evictions are generally named _mm_stream_.... As the name implies, these are ideal for applications that write a large stream of data that is basically contiguous in memory and unlikely to be accessed again in the near future. So, for example, if you are mixing audio buffers and producing a single waveform output this would work well.
One of the keys to using these instructions effectively is taking advantage of write combining. If your write locations are scattered throughout memory, these instructions will stall as badly, or possibly worse than any other kind of memory storage instruction you attempt. Since these writes do not wind up in cache, if you're not filling an entire write buffer then essentially your operation becomes a write-through operation, requiring a stall until the write is completed. If you are writing contiguous memory locations then write combining will apply, and make your data writes much more efficient.
The flip side of that coin is prefetching. Prefetching tells the system to start pulling a memory address into the desired level of cache so that by the time the memory read is complete, you are ready to use the data. This is much harder to use, and requires an appropriate data "stride" which takes into account the cache sizes, cache line size, and the number of instructions which can execute before the memory read completes. Using the hinting parameter, you can "suggest" that the data goes into the L1, L2, or L3 cache, or that it is "non-temporal", meaning that you're just going to use it once and it should be evicted first before any other cache evictions. The hardware has its own prefetching heuristics that work well for most problems without explicit prefetching instructions, but the classic counter-example is a matrix transpose:
Prefetching examples
Prefetching is generally very difficult to use effectively except in some very specific cases like this. Without a more specific problem statement from you, this is about all I can provide.