Is there any benefit to use cache if there is read miss for every access? - caching

Is there any benefit to use cache if there is read miss for every access?
My question aims at a better understanding of caches.
Read miss for every access can also happen during cold start, am I right?

If you are doing only reads and you miss all levels of cache on every read, then by definition caches didn't help. You paid extra in power and latency to check each level of cache, and extra power and (probably) latency to load a whole cache line of data which will never be used (by definition) except for the data you read since all your accesses miss.
You didn't say you are doing only reads though. So of course caches can help in the case that your reads all miss but some of your writes hit.
Perhaps you meant that the read-portion of all accesses, both reads and writes, misses - where the read-portion of a write is the read-for-ownership access present on most cache-based systems where a cache-line must be read into cache before (part of it) can be written. In that case, the cache probably also didn't help and probably hurt.
Read miss for every access can also happen during cold start, am I right?
No, almost never. Some reads will miss, but those will bring in adjacent data on the same cache line which will often result in later hits (spatial locality). Many reads will go back to the same location even in a cold start (temporal locality) which will also often hit. Even beyond the dynamics of a single cache line, modern CPUs often offer hardware prefetching which will recognize certain access patterns and will bring in data before you need it which can result in hits even the first time you access a cache line.
Finally, on most general purpose hardware there you usually cannot decide to simply "not use the caches" so as a practical matter you pay the built-in costs of caching even if your hit rate is low.
That said, sometimes, when you know your access pattern you can provide hints to the CPU. For example, x86 CPUs provide "non-temporal store" instructions which essentially bypass the caches when used - meaning that the stored cache line won't be cached. This is useful not necessarily to speed up the store itself (which still largely pays the price of the cache hierarchy which is baked into the hardware), but to avoid polluting the cache with data the developer knows will not soon be accessed.

Related

How does write-invalidate policy work with set-associative caches?

I was going through Cache Write Policies paper by Norman P. Jouppi and I understand why write-invalidate (defined on page 193) works well with direct mapped caches which is because of the ability to write the data which checking the tag and if found to be miss, the cache line is invalidates as it is corrupted by the write. This can be done in one cycle.
But is there any benefit if write-invalidate is used for set-associative caches?
What is the usual configuration that is used for L1 caches in real processors? Do they use direct or set-associative and write-validate/write around/write invalidate/fetch-on-write policy?
TL:DR: for a non-blocking cache using write-invalidate, changing it from direct-mapped to set-associative could hurt the hit rate unless writes are very rare, or mean that you introduce the possibility of needing to block.
Write-invalidate only makes sense for a simple in-order pipeline with a simple cache that tries to avoid stalling the pipeline even without a store buffer, and go really fast at the expense of hit-rate. If you were going to change things to improve hit-rate, changing away from write-invalidate (usually to write-back + write-allocate + fetch-on-write) would be one of the first things. Write-invalidate with set-associative cache is possible with some ugly tradeoffs, but you wouldn't like the results.
The 1993 paper you linked is using that term to mean something other than the modern cache-coherence mechanism meaning. In the paper:
The combination of
write-before-hit, no-fetch-on-write, and no-write-allocate
we call write-invalidate
Yes, real-world caches these days are basically always set-associative; the more complex tag-comparator logic is worth the increased hit-rate for the same data size. Which cache mapping technique is used in intel core i7 processor? has some general stuff, not just x86. Modern examples of direct-mapped caches include the DRAM cache when a part of the persistent memory on an Intel platform operates in memory mode. Also many server-grade processors from multiple vendors support L3 way-wise partitioning, so you can, for example, allocate one way for a thread which would basically behave like a direct-mapped cache.
Write policy is usually write-allocate + fetch-on-write + no-write-before-hit for modern CPU caches; some ISAs offer methods such as special instructions to bypass cache for "non-temporal" stores that won't be re-read soon, to avoid cache pollution for those cases. Most workloads do re-load their stores with enough temporal locality that write-allocate is the only sane choice, especially when caches are larger and/or more associative so they're more likely to be able to hang onto a line until the next read or write.
It's also very common to do multiple small writes into the same line, making write-allocate very valuable, especially if a store buffer didn't manage to merge those writes.
But is there any benefit if write-invalidate is used for set-associative caches?
It doesn't seem so.
The only advantage it has is not stalling a simple in-order pipeline that lacks a store buffer ("write buffer" in the paper). It allows write in parallel with the tag-check, so you find out after modifying the line whether you hit or not. (Modern CPUs do use a store buffer to decouple store commit to L1d from store execution and hide store-miss latency. Even in-order CPUs typically have a store buffer to allow memory-level parallelism of RFOs (read-for-ownership). (e.g. ARM Cortex-A53 found in phones).
Anyway, in a set-associative cache, you need to check tags to know which "way" of the set to write into on a write hit. (Or detect a miss and pick one to evict according to some policy, like random or pseudo-LRU using some extra state bits, or write-around if no-write-allocate). If you wait until after the tag check to find the write way, you've lost the only benefit of write-invalidate.
Blindly writing to a random way could lead to a situation where there's a hit in a different way than the one you guessed. Way-prediction is a thing (and can do better than random), but the downside of an incorrect prediction for a write like this would be unnecessarily invalidating a line, instead of just a bit of extra latency. Way prediction in modern cache. I don't know what kind of success-rate way-prediction usually achieves. I'd guess not great, like maybe 80 to 90% at best. Probably spending transistors to do way-prediction would be better spent elsewhere, to do something that sucks less than write-invalidate! A store buffer with store forwarding probably costs more, but is a lot better.
The advantage of write-invalidate is to help make the cache non-blocking. But if you need to correct the situation when you do find a write-hit in a way other than the one you picked, you need to go back and correct the situation, updating the correct line. So you'd lose the non-blocking property. Never stalling is better than not usually stalling, because it means you don't even need to make the hardware handle that possible case at all. (Although you do need to be able to stall for memory.)
The write-in-one-way-hit-in-another situation can be avoided by writing in all of the ways. But there will be at most one hit and the rest will have to be invalidated. The negative impact on hit rate will significantly grow with associativity. (Unless writes are quite infrequent vs. reads, reducing the associativity would probably help hit rate with the write-all-ways strategy, so for a given total cache capacity, direct-mapped might be the best choice if you insist on fully-non-blocking write-invalidate.) Even for a direct-mapped cache, the experimental evaluation given in the paper itself shows that write-invalidate has higher miss rate compare to the other evaluated write policies. So it's win only if the benefits of reducing latency and bandwidth demand outweighs the damage of high miss rate.
Also, as I said, write-allocate is very good for CPUs, especially when it's set-associative so you're spending more resources trying to get a higher hit-rate. You could maybe still implement write-allocate by triggering a fetch on miss, remembering where in the line you stored the data, and merging that with the old copy of the line when it arrives.
You don't want to defeat that by blowing away lines that didn't need to die.
Also, write-invalidate implies write-through even for write hits, because it could lose data if a line is ever dirty. But write-back is also very good in modern L1d caches to insulate larger/slower caches from the write bandwidth. (Especially if there's no per-core private L2 to separately reduce total traffic to shared caches.) However, AMD Bulldozer-family did have a write-through L1d with a small 4k write-buffer between it and a write-back L2. This was generally considered a failed experiment or weak point of the design, and they dropped it in favour of a standard write-back write-allocate L1d for Zen. When use write-through cache policy for pages.
So in summary, write-invalidate is incompatible with several things that modern mainstream CPU designs have settled on as the best options, that you'll find in most mainstream CPU designs
write-allocate write policy
write-back (not write-through). https://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies
set-associative (huge downsides that can only be partially mitigated by way-prediction)
store buffer to decouple store miss from execution, and allow memory parallelism. (Not strictly incompatible, but a store buffer makes it pointless. Necessary for OoO exec and widely used for in-order)
write-invalidate in cache-coherent SMP systems
You'd never consider using it in a single-chip multi-core CPU; spend more transistors on each core to get more of the low-hanging fruit before you start building more cores. e.g. a proper store buffer. Use some flavour of SMT if you want high throughput for multiple low-IPC threads that stall a lot.
But for multi-socket SMP, this could have made sense historically if you want to use multiple of the biggest single-core chip you can build, and that was still not big enough to just have a store buffer instead of this.
I guess it could even make sense to use a really "thin" direct-mapped write-through L1d in front of a private medium-sized write-back set-associative L2 that's still pretty fast. (Maybe call this an L0d cache because it can act like an unordered store buffer. The next-level cache will still see a lot of reads and writes from the low hit-rate of this small direct-mapped cache.)
Normally all caches (including L1d) are part of the same global coherency domain so writing into L1d cache can't happen until you have exclusive ownership. (Which you check for as part of the tag check.) But if this L1d / L0d is not like that, then it's not coherent and is more like a store buffer.
Of course, you need to queue the write-throughs for L2, and eventually stall when it can't keep up, so you're just adding complexity. The write-through to L2 mechanism would also need to deal with waiting for L2 to gain exclusive ownership of the line before writing (MESI Exclusive or Modified state). So this is very much just an unordered store buffer.
The case of writing to a line that hadn't made it to L2 yet is interesting: if it's an L0d write hit you effectively get store merging for free. You'd need per-word or per-byte needs-writeback bits (aka dirty bits) for this. Normally write-through would be sending along the write while the offset within line is still available, but if L2 isn't ready to accept it yet (e.g. because of a write miss) then you can't do that. This is morphing it into a write-combining buffer. Marking the whole line as needing write-back doesn't work because the unwritten parts are still invalid.
But if it's a write miss (same cache line, different tag bits) on a line that still hasn't finished write-back to L2, you have a big problem because you'd be invalidating a line that's still "dirty" (has the only copy of some older store data). You can't detect that before writing; the whole point is to write in parallel with checking tags.
It might be possible to still make this work: if the cache access is a read+write exchange that keeps the previous value in a one-word buffer (or whatever the max write size is), you still have all the data. Stall everything (including writeback of this line so you don't make wrong data globally visible in coherent L2 cache). Then exchange back, wait for the old state of that L0d line to actually write back to that address, then store the tmp buffer into L0d and update the tag and needs-writeback bits to reflect this store. So aliasing between nearby stores becomes extra costly and stalls the pipeline. Or maybe you can let non-memory instructions continue and only stall execution at the next load or store. (If you have the transistor budget to do much of that stall-avoidance, you can probably just use a completely different strategy, like having a store buffer and a normal L1d.)
To be usable (assuming you work around the dirty-store-miss problem), you'd need some way to track relative order of stores (and loads). If that's as simplistic as making sure every entry in the entire L0d has finished its write-through process before allowing another write, then even store-store barriers will be very expensive. The less order-tracking a CPU does, the more expensive barriers have to be (flush more stuff to make sure).

write through access pattern for cache

I know "write-through" means the write is committed only if DB write and cache write are both good. However below statement confused me
"rite-through cache is good for applications that write and then re-read data frequently as data is stored in cache and results in low read latency"
I think This pattern has to write 2 layer, which would lead to higher write latency. How could this be good for write-frequent application.
When you are using write-through as writing policy, you make sure that on either write misses or write hits, main memory keeps updated with correct values. As you said , if an application writes and then re-reads some data frequently, part of this data could stay in cache memory, which would result in less misses overall.
However, this is not an absolute truth at all, due to the fact that both CPU and memory performance depend on several factors and cannot be measured by testing just one program or application.
Hope this helps!

MESI protocol. Write with cache miss, but cache line copy exists on another CPU. Why needs fetch from main memory?

According to this diagram in case of write cache miss with copy in another CPU cache (for example Shared/Exclusive state). The steps are:
1. Snooping cores (with cache line copy) sets state to Invalid.
2. Current cache stores fresh main memory value.
Why one of the snooping cores can't put its cache line value on the bus at first? And then go to Invalid state. The same algorithm is used in read miss with existing copy. Thank you.
You're absolutely right in that it's pretty silly to go fetch a line from memory when you already have it right next to you, but this diagram describes the minimal requirement for functional correctness of the coherence protocol (i.e. what must be done to avoid coherence bugs), and that only dictates snooping the data out for modified lines since that's the only correct copy. What you describe is a possible optimization, and some systems indeed behave that way.
However, keep in mind that most systems today employ a shared cache as well (L2 or L3, sometimes even beyond that), and this is often inclusive (with regards to all lines that exist in all cores). In such systems, there's no real need to go all the way to memory, since having the line in another core means it's also in the shared cache, and after invalidation the requesting core can obtain it from there. Your proposal is therefore relevant only for systems with no shared cache, or with a cache that is not strictly inclusive.

cpu cache performance. store misses vs load misses

I'm using perf as basic event counter. I'm working on a program which suffers from data cache store misses. Which as as high as ratio of %80.
I know how caches in principle work. It loads from memory on various miss cases, removes data from cache when it pleases. What I don't understand is , what is difference between store - load misses. How does it differ loading and storing. How can you store-miss ?
A load-miss (as you know) is referring to when the processor needs to fetch data from main memory, but data does not exist in the cache. So whenever the processor wants some data from the main memory, it esquires the cache, and if the data is already loaded you get a load-hit and otherwise you get a load-miss.
A store-miss is related to when the processor wants to write back the newly calculated data to the main memory.When it wants to write-back the data to the main memory, it hasto make sure that the content of the cache and main memory are in sync with each other. It can happen with two different policies that you can find here: Writing Policies.
So no matter what policy you choose, you first need to check whether the data is already in the cache so you can store it to cache first (since it's faster), and if the data block you are looking for has been evicted from the cache, you get a store-miss related to that cache.
You can check the applet here, to get a better idea of what happens in different scenarios.
I'm not fully familiar with how perf define these events, but given the common definition I believe load/store miss is just a way to break down the overall miss rate counting, so that you may tell which accesses miss more often. Note that loads are usually performed speculatively (at least in modern x86 cpus), while stores are performed much later along the pipeline, after the commit point, so even a piece of code with both loads and stores to the same region can have different miss rates.
In MESI-based cache protocols a load would hit the cache, or miss and fetch the line from the memory or next cache levels, either exclusively if it's not owned by anyone else, or in a shared state if it is. It would write the data to the caches along the way in the process.
A store would fetch a line in the same manner, but use an RFO (read-for-ownership) request which grants it exclusive ownership and the right to modify the line. The line would still get cached, but once the new data is written to it locally (usually in your L1 cache), it would become modified. The hit/miss process would look the same though.
What Saman referred to in his answer is the breakdown between reads and writes. Loads and stores (and other forms of access like code-read) all form the "read" part, and writebacks (or intentional write-throughs using special command or mem types like uncacheable) form the "write part.

In what applications caching does not give any advantage?

Our professor asked us to think of an embedded system design where caches cannot be used to their full advantage. I have been trying to find such a design but could not find one yet. If you know such a design, can you give a few tips?
Caches exploit the fact data (and code) exhibit locality.
So an embedded system wich does not exhibit locality, will not benefit from a cache.
Example:
An embedded system has 1MB of memory and 1kB of cache.
If this embedded system is accessing memory with short jumps it will stay long in the same 1kB area of memory, which could be successfully cached.
If this embedded system is jumping in different distant places inside this 1MB and does that frequently, then there is no locality and cache will be used badly.
Also note that depending on architecture you can have different caches for data and code, or a single one.
More specific example:
If your embedded system spends most of its time accessing the same data and (e.g.) running in a tight loop that will fit in cache, then you're using cache to a full advantage.
If your system is something like a database that will be fetching random data from any memory range, then cache can not be used to it's full advantage. (Because the application is not exhibiting locality of data/code.)
Another, but weird example
Sometimes if you are building safety-critical or mission-critical system, you will want your system to be highly predictable. Caches makes your code execution being very unpredictable, because you can't predict if a certain memory is cached or not, thus you don't know how long it will take to access this memory. Thus if you disable cache it allows you to judge you program's performance more precisely and calculate worst-case execution time. That is why it is common to disable cache in such systems.
I do not know what you background is but I suggest to read about what the "volatile" keyword does in the c language.
Think about how a cache works. For example if you want to defeat a cache, depending on the cache, you might try having your often accessed data at 0x10000000, 0x20000000, 0x30000000, 0x40000000, etc. It takes very little data at each location to cause cache thrashing and a significant performance loss.
Another one is that caches generally pull in a "cache line" A single instruction fetch may cause 8 or 16 or more bytes or words to be read. Any situation where on average you use a small percentage of the cache line before it is evicted to bring in another cache line, will make your performance with the cache on go down.
In general you have to first understand your cache, then come up with ways to defeat the performance gain, then think about any real world situations that would cause that. Not all caches are created equal so there is no one good or bad habit or attack that will work for all caches. Same goes for the same cache with different memories behind it or a different processor or memory interface or memory cycles in front of it. You also need to think of the system as a whole.
EDIT:
Perhaps I answered the wrong question. not...full advantage. that is a much simpler question. In what situations does the embedded application have to touch memory beyond the cache (after the initial fill)? Going to main memory wipes out the word full in "full advantage". IMO.
Caching does not offer an advantage, and is actually a hindrance, in controlling memory-mapped peripherals. Things like coprocessors, motor controllers, and UARTs often appear as just another memory location in the processor's address space. Instead of simply storing a value, those locations can cause something to happen in the real world when written to or read from.
Cache causes problems for these devices because when software writes to them, the peripheral doesn't immediately see the write. If the cache line never gets flushed, the peripheral may never actually receive a command even after the CPU has sent hundreds of them. If writing 0xf0 to 0x5432 was supposed to cause the #3 spark plug to fire, or the right aileron to tilt down 2 degrees, then the cache will delay or stop that signal and cause the system to fail.
Similarly, the cache can prevent the CPU from getting fresh data from sensors. The CPU reads repeatedly from the address, and cache keeps sending back the value that was there the first time. On the other side of the cache, the sensor waits patiently for a query that will never come, while the software on the CPU frantically adjusts controls that do nothing to correct gauge readings that never change.
In addition to almost complete answer by Halst, I would like to mention one additional case where caches may be far from being an advantage. If you have multiple-core SoC where all cores, of course, have own cache(s) and depending on how program code utilizes these cores - caches can be very ineffective. This may happen if ,for example, due to incorrect design or program specific (e.g. multi-core communication) some data block in RAM is concurrently used by 2 or more cores.

Resources