High frequent concurrency for cache

High frequent concurrency for cache - caching

I am learning caching and have a question on the concurrency of cache.
As I know, LRU caching is implemented with double linked list + hashtable. Then how does LRU cache handle high frequent concurrency? Note both getting data from cache and putting data to cache will update the linked list and hash table so cache is modified all the time.
If we use mutex lock for thread-safe, won't the speed be slowed down if the cache is visited by large amount of people? If we do not use lock, what techniques are used? Thanks in advance.

Traditional LRU caches are not designed for high concurrency because of limited hardware and that the hit penalty is far smaller than the miss penalty (e.g. database lookup). For most applications, locking the cache is acceptable if its only used to update the underlying structure (not compute the value on a miss). Simple techniques like segmenting the LRU policy were usually good enough when the locks became contended.
The way to make an LRU cache scale is to avoid updating the policy on every access. The critical observation to make is that the user of the cache does not care what the current LRU ordering is. The only concern of the caller is that the cache maintains a threshold size and a high hit rate. This opens the door for optimizations by avoiding mutating the LRU policy on every read.
The approach taken by memcached is to discard subsequent reads within a time window, e.g. 1 second. The cache is expected to be very large so there is a very low chance of evicting a poor candidate by this simpler LRU.
The approach taken by ConcurrentLinkedHashMap (CLHM), and subsequently Guava's Cache, is to record the access in a buffer. This buffer is drained under the LRU's lock and by using a try-lock no other operation has to be blocked. CLHM uses multiple ring buffers that are lossy if the cache cannot keep up, as losing events is preferred to degraded performance.
The approach taken by Ehcache and redis is a probabilistic LRU policy. A read updates the entry's timestamp and a write iterates the cache to obtain a random sample. The oldest entry is evicted from that sample. If the sample is fast to construct and the cache is large, the evicted entry was likely a good candidate.
There are probably other techniques and, of course, pseudo LRU policies (like CLOCK) that offer better concurrency at lower hit rates.

Related

How does write-invalidate policy work with set-associative caches?

I was going through Cache Write Policies paper by Norman P. Jouppi and I understand why write-invalidate (defined on page 193) works well with direct mapped caches which is because of the ability to write the data which checking the tag and if found to be miss, the cache line is invalidates as it is corrupted by the write. This can be done in one cycle.
But is there any benefit if write-invalidate is used for set-associative caches?
What is the usual configuration that is used for L1 caches in real processors? Do they use direct or set-associative and write-validate/write around/write invalidate/fetch-on-write policy?

TL:DR: for a non-blocking cache using write-invalidate, changing it from direct-mapped to set-associative could hurt the hit rate unless writes are very rare, or mean that you introduce the possibility of needing to block.
Write-invalidate only makes sense for a simple in-order pipeline with a simple cache that tries to avoid stalling the pipeline even without a store buffer, and go really fast at the expense of hit-rate. If you were going to change things to improve hit-rate, changing away from write-invalidate (usually to write-back + write-allocate + fetch-on-write) would be one of the first things. Write-invalidate with set-associative cache is possible with some ugly tradeoffs, but you wouldn't like the results.
The 1993 paper you linked is using that term to mean something other than the modern cache-coherence mechanism meaning. In the paper:
The combination of
write-before-hit, no-fetch-on-write, and no-write-allocate
we call write-invalidate
Yes, real-world caches these days are basically always set-associative; the more complex tag-comparator logic is worth the increased hit-rate for the same data size. Which cache mapping technique is used in intel core i7 processor? has some general stuff, not just x86. Modern examples of direct-mapped caches include the DRAM cache when a part of the persistent memory on an Intel platform operates in memory mode. Also many server-grade processors from multiple vendors support L3 way-wise partitioning, so you can, for example, allocate one way for a thread which would basically behave like a direct-mapped cache.
Write policy is usually write-allocate + fetch-on-write + no-write-before-hit for modern CPU caches; some ISAs offer methods such as special instructions to bypass cache for "non-temporal" stores that won't be re-read soon, to avoid cache pollution for those cases. Most workloads do re-load their stores with enough temporal locality that write-allocate is the only sane choice, especially when caches are larger and/or more associative so they're more likely to be able to hang onto a line until the next read or write.
It's also very common to do multiple small writes into the same line, making write-allocate very valuable, especially if a store buffer didn't manage to merge those writes.
But is there any benefit if write-invalidate is used for set-associative caches?
It doesn't seem so.
The only advantage it has is not stalling a simple in-order pipeline that lacks a store buffer ("write buffer" in the paper). It allows write in parallel with the tag-check, so you find out after modifying the line whether you hit or not. (Modern CPUs do use a store buffer to decouple store commit to L1d from store execution and hide store-miss latency. Even in-order CPUs typically have a store buffer to allow memory-level parallelism of RFOs (read-for-ownership). (e.g. ARM Cortex-A53 found in phones).
Anyway, in a set-associative cache, you need to check tags to know which "way" of the set to write into on a write hit. (Or detect a miss and pick one to evict according to some policy, like random or pseudo-LRU using some extra state bits, or write-around if no-write-allocate). If you wait until after the tag check to find the write way, you've lost the only benefit of write-invalidate.
Blindly writing to a random way could lead to a situation where there's a hit in a different way than the one you guessed. Way-prediction is a thing (and can do better than random), but the downside of an incorrect prediction for a write like this would be unnecessarily invalidating a line, instead of just a bit of extra latency. Way prediction in modern cache. I don't know what kind of success-rate way-prediction usually achieves. I'd guess not great, like maybe 80 to 90% at best. Probably spending transistors to do way-prediction would be better spent elsewhere, to do something that sucks less than write-invalidate! A store buffer with store forwarding probably costs more, but is a lot better.
The advantage of write-invalidate is to help make the cache non-blocking. But if you need to correct the situation when you do find a write-hit in a way other than the one you picked, you need to go back and correct the situation, updating the correct line. So you'd lose the non-blocking property. Never stalling is better than not usually stalling, because it means you don't even need to make the hardware handle that possible case at all. (Although you do need to be able to stall for memory.)
The write-in-one-way-hit-in-another situation can be avoided by writing in all of the ways. But there will be at most one hit and the rest will have to be invalidated. The negative impact on hit rate will significantly grow with associativity. (Unless writes are quite infrequent vs. reads, reducing the associativity would probably help hit rate with the write-all-ways strategy, so for a given total cache capacity, direct-mapped might be the best choice if you insist on fully-non-blocking write-invalidate.) Even for a direct-mapped cache, the experimental evaluation given in the paper itself shows that write-invalidate has higher miss rate compare to the other evaluated write policies. So it's win only if the benefits of reducing latency and bandwidth demand outweighs the damage of high miss rate.
Also, as I said, write-allocate is very good for CPUs, especially when it's set-associative so you're spending more resources trying to get a higher hit-rate. You could maybe still implement write-allocate by triggering a fetch on miss, remembering where in the line you stored the data, and merging that with the old copy of the line when it arrives.
You don't want to defeat that by blowing away lines that didn't need to die.
Also, write-invalidate implies write-through even for write hits, because it could lose data if a line is ever dirty. But write-back is also very good in modern L1d caches to insulate larger/slower caches from the write bandwidth. (Especially if there's no per-core private L2 to separately reduce total traffic to shared caches.) However, AMD Bulldozer-family did have a write-through L1d with a small 4k write-buffer between it and a write-back L2. This was generally considered a failed experiment or weak point of the design, and they dropped it in favour of a standard write-back write-allocate L1d for Zen. When use write-through cache policy for pages.
So in summary, write-invalidate is incompatible with several things that modern mainstream CPU designs have settled on as the best options, that you'll find in most mainstream CPU designs
write-allocate write policy
write-back (not write-through). https://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies
set-associative (huge downsides that can only be partially mitigated by way-prediction)
store buffer to decouple store miss from execution, and allow memory parallelism. (Not strictly incompatible, but a store buffer makes it pointless. Necessary for OoO exec and widely used for in-order)
write-invalidate in cache-coherent SMP systems
You'd never consider using it in a single-chip multi-core CPU; spend more transistors on each core to get more of the low-hanging fruit before you start building more cores. e.g. a proper store buffer. Use some flavour of SMT if you want high throughput for multiple low-IPC threads that stall a lot.
But for multi-socket SMP, this could have made sense historically if you want to use multiple of the biggest single-core chip you can build, and that was still not big enough to just have a store buffer instead of this.
I guess it could even make sense to use a really "thin" direct-mapped write-through L1d in front of a private medium-sized write-back set-associative L2 that's still pretty fast. (Maybe call this an L0d cache because it can act like an unordered store buffer. The next-level cache will still see a lot of reads and writes from the low hit-rate of this small direct-mapped cache.)
Normally all caches (including L1d) are part of the same global coherency domain so writing into L1d cache can't happen until you have exclusive ownership. (Which you check for as part of the tag check.) But if this L1d / L0d is not like that, then it's not coherent and is more like a store buffer.
Of course, you need to queue the write-throughs for L2, and eventually stall when it can't keep up, so you're just adding complexity. The write-through to L2 mechanism would also need to deal with waiting for L2 to gain exclusive ownership of the line before writing (MESI Exclusive or Modified state). So this is very much just an unordered store buffer.
The case of writing to a line that hadn't made it to L2 yet is interesting: if it's an L0d write hit you effectively get store merging for free. You'd need per-word or per-byte needs-writeback bits (aka dirty bits) for this. Normally write-through would be sending along the write while the offset within line is still available, but if L2 isn't ready to accept it yet (e.g. because of a write miss) then you can't do that. This is morphing it into a write-combining buffer. Marking the whole line as needing write-back doesn't work because the unwritten parts are still invalid.
But if it's a write miss (same cache line, different tag bits) on a line that still hasn't finished write-back to L2, you have a big problem because you'd be invalidating a line that's still "dirty" (has the only copy of some older store data). You can't detect that before writing; the whole point is to write in parallel with checking tags.
It might be possible to still make this work: if the cache access is a read+write exchange that keeps the previous value in a one-word buffer (or whatever the max write size is), you still have all the data. Stall everything (including writeback of this line so you don't make wrong data globally visible in coherent L2 cache). Then exchange back, wait for the old state of that L0d line to actually write back to that address, then store the tmp buffer into L0d and update the tag and needs-writeback bits to reflect this store. So aliasing between nearby stores becomes extra costly and stalls the pipeline. Or maybe you can let non-memory instructions continue and only stall execution at the next load or store. (If you have the transistor budget to do much of that stall-avoidance, you can probably just use a completely different strategy, like having a store buffer and a normal L1d.)
To be usable (assuming you work around the dirty-store-miss problem), you'd need some way to track relative order of stores (and loads). If that's as simplistic as making sure every entry in the entire L0d has finished its write-through process before allowing another write, then even store-store barriers will be very expensive. The less order-tracking a CPU does, the more expensive barriers have to be (flush more stuff to make sure).

When do we perform Cache invalidation?

An excerpt from Wiki on Cache invalidation -
"Cache invalidation is a process in a computer system whereby entries in a cache are replaced or removed." But, why on earth do we need to invalidate Cache?
I can think of only possible scenario -
If for some reason cache and the database go out of sync, the data in cache will be stale. To sync it, we will need to invalidate cache. But, the cache and DB going of sync(except for a short period of time when the data is yet to be written into both) is not a desirable behaviour. So, cache invalidation acts as a remedy if we discover that the cache does not contain the correct data. Is this its sole purpose?

Cache invalidation exists because most caches operate based upon a trade-off of performance vs capacity.
Consider a solid state drive vs a hard drive. The performance of the SSD will be better but the amount of data you can store will be worse at the same cost level. Often people will combine them to get the performance of an SSD for frequently accessed files (such as the operating system), and a HDD for raw storage capacity.
CPUs are structured in a similar hierarchy, where the closest to the CPU is the fastest but also the smallest. The costs in this case are not necessarily just monetary cost but also physical space, power usage, heat production etc.
CPU registers - fastest, very small
CPU caches (also have their own hierarchy) - fast, small
RAM - medium, large
To keep the caches performing at their best, the most frequently accessed items must be maintained so that there is a better ratio of cache hits to misses. We want to be fetching from our slower sources as infrequently as possible. Similarly, because of the limited size constraint, we need to evict the items which are accessed least frequently.
Cache invalidation is the strategy which we will utilise in order to decide which items to evict and when, in order to make space for newer items which have a higher likelihood of being required again. It is not applicable if your cache contains a full representation of some other data source.

There are plenty of reasons. Probably one of the the most common ones: a cache is (often by nature) much smaller compared to the overall amount of data that needs to be stored.
In other words: if you just keep adding and adding elements to your cache, it becomes a full copy of your data. Respectively, you run out of memory quickly.
In other words: the nature of a cache is this: it is limited (somehow) in size. Thus, sooner or later you are facing a decision like: "I can't just add a new element to the cache, I have to make room first". And then you have to do exactly that: invalidate one of the entries in your cache so that there is room for that "newer" entry.
And given the comment by the OP: often invalidating a whole cache is seen similar to "restart" your program, or "re-install your app", or "restart your device". It is often seen as "generic" mean to ensure the program/application gets reset to a known good state.

Cache Performance in Hash Tables with Chaining vs Open Addressing

In this following website from geeksforgeeks.org it states that
Cache performance of chaining is not good as keys are stored using linked list. Open addressing provides better cache performance as everything is stored in same table.
So my questions are:
What causes chaining to have a bad cache performance?
Where is the cache being used?
Why would open addressing provide better cache performance as I cannot see how the cache comes into this?
Also what considerations what you take into account when deciding between chaining and linear probed open addressing and quadratic probed open addressing?

Sorry, due to quite wide questions, the answers also will be quite generic with some links for more detailed information.
It is better to start with the question:
Where is the cache being used?
On modern CPUs, cache is used everywhere: to read program instructions and read/write data in memory. On most CPUs cache is transparent, i.e. there is no need to explicitly manage the cache.
Cache is much faster than the main memory (DRAM). Just to give you a perspective, accessing data in Level 1 cache is ~4 CPU cycles, while accessing the DRAM on the same CPU is ~200 CPU cycles, i.e. 50 times faster.
Cache operate on small blocks called cache lines, which are usually 64 bytes long.
More info: https://en.wikipedia.org/wiki/CPU_cache
What causes chaining to have a bad cache performance?
Basically, chaining is not cache friendly. It is not only about this case in the hash tables, same issue with "classical" lists.
Hash keys (or list nodes) are far away from each other, so each key access generates a "cache miss", i.e. slow DRAM access. So checking 10 keys in a chain takes 10 DRAM accesses, i.e. 200 x 10 = 2000 cycles for our generic CPU.
The address of the next key is not known until a next pointer is read in the current key, so there is not much room for an optimization...
Why would open addressing provide better cache performance as I cannot see how the cache comes into this?
Linear probing is cache friendly. Keys are "clustered" together, so once we accessed the first key (slow DRAM access), most probably the next key will be already in cache, since the cache line is 64 bytes. So accessing the same 10 keys with open addressing takes 1 DRAM access and 9 cache accesses, i.e. 200 x 1 + 9 x 4 = 236 cycles for our generic CPU. It is much faster than 2000 cycles for chained keys.
Also, since we access the memory in predictable manner, there is a room for optimizations like cache prefetching: https://en.wikipedia.org/wiki/Cache_prefetching
Also what considerations what you take into account when deciding between chaining and linear probed open addressing and quadratic probed open addressing?
Chaining or linear probing is not a good sign anyway. So the first thing I would consider is to make sure the probability of collisions is at minimum by using a good hash function and reasonable hash size.
The second thing I would consider is a ready to use solution. Sure, there are still might be some rare cases when you need your own implementation...
Not sure about the language, but here is blazingly fast hash table implementation with BSD license: http://dpdk.org/browse/dpdk/tree/lib/librte_hash/rte_cuckoo_hash.h
So, if you still need your own hash table implementation and you do care about performance, the next quite easy thing to implement would be to use cache aligned buckets instead of plain hash elements. It will waste few bytes per each element (i.e. each hash table element will be 64 bytes long), but in case of a collision there will be some fast storage for at least few keys. The code to manage those buckets will be also a bit more complicated, so it is a thing to consider if it is worth for you to bother...

Implementing LRU with timestamp: How expensive is memory store and load?

I'm talking about LRU memory page replacement algorithm implement in C, NOT in Java or C++.
According to the OS course notes:
OK, so how do we actually implement a LRU? Idea 1): mark everything we touch with a timestamp.
Whenever we need to evict a page, we select the oldest page (=least-recently used). It turns out that this
simple idea is not so good. Why? Because for every memory load, we would have to read contents of the
clock and perform a memory store! So it is clear that keeping timestamps would make the computer at
least twice as slow. I
Memory load and store operation should be very fast. Is it really necessary to get rid of these little tiny operations?
In the case of memory replacement, the overhead of loading page from disk should be a lot more significant than memory operations. Why would actually care about memory store and load?
If what the notes said isn't correct, then what is the real problem with implementing LRU with timestamp?
EDIT:
As I dig deeper, the reason I can think of is like the following. These memory store and load operations happen when there is a page hit. In this case, we are not loading page from disks, so the comparison is not valid.
Since the hit rate is expected to be very high, so updating the data structure associated with LRU should be very frequent. That's why we care about the operations repeated in the udpate process, e.g., memory load and store.
But still, I'm not convincing how significant the overhead is to do memory load and store. There should be some measurements around. Can someone point me to them? Thanks!

Memory load and store operations can be quite fast, but in most real life cases the memory subsystem is slower - sometimes much slower - than the CPU's execution engine.
Rough numbers for memory access times:
L1 cache hit: 2-4 CPU cycles
L2 cache hit: 10-20 CPU cycles
L3 cache hit: 50 CPU cycles
Main memory access: 100-200 CPU cycles
So it costs real time to do loads and stores. With LRU, every regular memory access will also incur the cost of a memory store operation. This alone doubles the number of memory accesses the CPU does. In most situations this will slow the program execution. In addition, on a page eviction all the timestamps will need to be read. This will be quite slow.
In addition, reading and storing the timestamps constantly means they will be taking up space in the L1 or L2 caches. Space in these caches is limited, so your cache miss rate for other accesses will probably be higher, which will cost more time.
In short - LRU is quite expensive.

Does larger cache size always lead to improved performance?

Since cache inside the processor increases the instruction execution speed. I'm wondering what if we increase the size of cache to many MBs like 1 GB. Is it possible? If it is will increasing the cache size always result in increased performance?

There is a tradeoff between cache size and hit rate on one side and read latency with power consumption on another. So the answer to your first question is: technically (probably) possible, but unlikely to make sense, since L3 cache in modern CPUs with size of just a few MBs has read latency of about dozens of cycles.
Performance depends more on memory access pattern than on cache size. More precisely, if the program is mainly sequential, cache size is not a big deal. If there are quite a lot of random access (ex. when associative containers are actively used), cache size really matters.
The above is true for single computational tasks. In multiprocess environment with several active processes bigger cache size is always better, because of decrease of interprocess contention.

This is a simplification, but, one of the primary reasons the cache increases 'speed' is that it provides a fast memory very close to the processor - this is much faster to access than main memory. So, in theory, increasing the size of the cache should allow more information to be stored in this 'fast' memory, and thereby improve performance.. In the real world things are obviously much more complex than this, and there will of course be added complexity, and cost, associated with such a large cache, and with dealing with issues like cache coherency, caching algorithms etc.

As cache stores data temporary. Cache is used to locate the file easily that has been frequently using. So if the size of cache increased upto 1gb or more it will not stay as cache, it becomes RAM. Data is stored in ram temporary. So if cache isn't used, when data is called by processor, ram will take time to fetch data to provide to the processor because of its wide size of 4gb or more. So we use cache as our temporary memory for the things we recently or frequently used. In this way, ram ram doesnt required to find and fetch data to give it to processor, because processor direct access data from cache, because of small size of cache, it doesnt take time to find data, and processor doesn't require to call ram to fetch data, all of this done fastly without ram. Lets take an example, we have a wide classroom (RAM) , our principal (processor) call class CR (Data) for some purposes, then ones will go to the class room and will find the CR in the class of 1000 students and take him to the principal. It takes time. When we specify a space(cache) for CR in the class, because principal mostly call CR of the class, so it will become easy to find CR becuase most of the time CR is called by Principal.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio