Assume that we have a key value pair data of 500TB. We can use 2.5TB memory to cache these pairs for future requests. The requests are some how random.
The probability of cache hit would be 2.5/500 = 0.5%
I know that the hit rate may increase by time if we use LFU eviction as by time more frequently keys will remain in the cache increasing cache hit rate.
So, if the throughput of the system reading from storage 10K QPS, then using cache would improve the rate by 0.05%(neglecting the memory seek time).
Then the throughput would be 10,050 QPS.
How efficient using cache in this case?
Should we go without cache?
UPDATE
I think I have a mistake here. If we have 100% hit, then throughput will be 1MQPS. If we have 0% hit, then the throughput will be 10KQPS.
Having 0.5% hit ratio (assuming linear relation) yields at
(0.5*(1M-10K)/100)+10K = 14950 QPS
That is 50% increase in the throughput.
"Somehow random" is the key.
If the request are truly random, the cache is unlikely to help. Your logic is correct. But in real systems, it turns out that many data stores have non-uniform, highly correlated access patterns.
This still holds for huge amounts of data. It doesn't matter how much data there is in total. It just matters how little is needed frequently.
[edit]
The update does not make sense. You're averaging speeds there, but you need to average the time of operations.
Related
If I have two different cache subsystem designs C1 and C2 that both have roughly the same hardware complexity, can I make a decision if which one is better choice considering effectiveness of cache subsystem is the prime factor i.e., the number misses should be minimized.
Give the total miss rate below:
miss_rate = (number of cache misses)/(number of cache reference)
miss rate of C1 = 0.77
miss rate of C2 = 0.73
Is the given miss rate information sufficient to make decision of what subsystem is better?
Yes, assuming hit latency is the same for both caches, actual miss rate on the workload you care about is the ultimate factor for that workload. It doesn't always generalize.
All differences in size, associativity, eviction policy, all matter because of their impact on miss rate on any given workload. Even cache line (block) size factors in to this: a cache with twice as many 32-byte lines vs. a cache with half as many 64-byte lines would be able to cache more scattered words, but pull in less nearby data on a miss. (Unless you have hardware prefetching, but again prefetch algorithms ultimately just affect miss rate.)
If hit and miss latencies are fixed, then all misses are equal and you just want fewer of them.
Well, not just latency, but overall effect on the pipeline, if the CPU isn't a simple in-order design from the 1980s that simply stalls on a miss. Which is what homework usually assumes, because otherwise the miss cost depends on details of the context, making it impossible to calculate performance based on just instruction mix, hit/miss rate, and miss costs.
An out-of-order exec CPU can hide the latency of some misses better than others. (On the critical path of some larger dependency chain vs. not.) Even an in-order CPU that can scoreboard loads can get work done in the shadow of a cache miss load, up until it reaches an instruction that reads the load result. (And with a store buffer, can usually hide store miss latency.) So miss penalty can differ depending on which loads miss, whether it's one that software instruction scheduling was able to hide more vs. less of the latency for. (If the independent work after a load includes other loads, then you'd need a non-blocking cache that handles hit-under-miss. Miss-under-miss to memory-level parallelism of multiple misses in flight also helps, as well as being able to get to a hit after 2 or more cache-miss loads.)
I think usually for most workloads with different cache geometries and sizes, there won't be significant bias towards more of the misses being easier to hide or not, so you could still say that miss-rate is the only thing that ultimately matters.
Miss-rate for a cache depends on workload, so you can't say that a lower miss rate on just one workload or trace makes it better on average or for all cases. e.g. an 8-way associative 16 KiB cache might have a higher hit rate than a 32 KiB 2-way cache on one workload (with a lot of conflict misses for the 2-way cache), but on a different workload where the working set is mostly one contiguous 24KiB array, the 32K 2-way cache might have a higher hit rate.
The term "better" is subjective as follows:
Hardware cost, in terms of silicon real-estate, meaning that a larger chip is more expensive to produce and thus costs more per chip. (A larger cache may not even fit on the chip in question.)
Hardware cost, in terms of silicon process technology, meaning that a faster cache requires a more advanced chip process, so will increase costs per chip.
A miss rate on a given cache is workload specific (e.g. application specific or algorithm specific). Thus, two different workloads may have different miss rates on each of the caches in question. So, "better" here may mean across an average workload (or an average across several different workloads), but there's a lot of room for variability.
We would have to know the performance of the caches upon hit, and also upon miss — as a more complex cache with a higher hit rate might have longer timings.
In summary, in order to say that lower miss rate is better, we would have to know that all the other factors are equal. Otherwise, the notion of better needs to be defined, perhaps to include cost/benefit definition.
I'm talking about LRU memory page replacement algorithm implement in C, NOT in Java or C++.
According to the OS course notes:
OK, so how do we actually implement a LRU? Idea 1): mark everything we touch with a timestamp.
Whenever we need to evict a page, we select the oldest page (=least-recently used). It turns out that this
simple idea is not so good. Why? Because for every memory load, we would have to read contents of the
clock and perform a memory store! So it is clear that keeping timestamps would make the computer at
least twice as slow. I
Memory load and store operation should be very fast. Is it really necessary to get rid of these little tiny operations?
In the case of memory replacement, the overhead of loading page from disk should be a lot more significant than memory operations. Why would actually care about memory store and load?
If what the notes said isn't correct, then what is the real problem with implementing LRU with timestamp?
EDIT:
As I dig deeper, the reason I can think of is like the following. These memory store and load operations happen when there is a page hit. In this case, we are not loading page from disks, so the comparison is not valid.
Since the hit rate is expected to be very high, so updating the data structure associated with LRU should be very frequent. That's why we care about the operations repeated in the udpate process, e.g., memory load and store.
But still, I'm not convincing how significant the overhead is to do memory load and store. There should be some measurements around. Can someone point me to them? Thanks!
Memory load and store operations can be quite fast, but in most real life cases the memory subsystem is slower - sometimes much slower - than the CPU's execution engine.
Rough numbers for memory access times:
L1 cache hit: 2-4 CPU cycles
L2 cache hit: 10-20 CPU cycles
L3 cache hit: 50 CPU cycles
Main memory access: 100-200 CPU cycles
So it costs real time to do loads and stores. With LRU, every regular memory access will also incur the cost of a memory store operation. This alone doubles the number of memory accesses the CPU does. In most situations this will slow the program execution. In addition, on a page eviction all the timestamps will need to be read. This will be quite slow.
In addition, reading and storing the timestamps constantly means they will be taking up space in the L1 or L2 caches. Space in these caches is limited, so your cache miss rate for other accesses will probably be higher, which will cost more time.
In short - LRU is quite expensive.
I am learning caching and have a question on the concurrency of cache.
As I know, LRU caching is implemented with double linked list + hashtable. Then how does LRU cache handle high frequent concurrency? Note both getting data from cache and putting data to cache will update the linked list and hash table so cache is modified all the time.
If we use mutex lock for thread-safe, won't the speed be slowed down if the cache is visited by large amount of people? If we do not use lock, what techniques are used? Thanks in advance.
Traditional LRU caches are not designed for high concurrency because of limited hardware and that the hit penalty is far smaller than the miss penalty (e.g. database lookup). For most applications, locking the cache is acceptable if its only used to update the underlying structure (not compute the value on a miss). Simple techniques like segmenting the LRU policy were usually good enough when the locks became contended.
The way to make an LRU cache scale is to avoid updating the policy on every access. The critical observation to make is that the user of the cache does not care what the current LRU ordering is. The only concern of the caller is that the cache maintains a threshold size and a high hit rate. This opens the door for optimizations by avoiding mutating the LRU policy on every read.
The approach taken by memcached is to discard subsequent reads within a time window, e.g. 1 second. The cache is expected to be very large so there is a very low chance of evicting a poor candidate by this simpler LRU.
The approach taken by ConcurrentLinkedHashMap (CLHM), and subsequently Guava's Cache, is to record the access in a buffer. This buffer is drained under the LRU's lock and by using a try-lock no other operation has to be blocked. CLHM uses multiple ring buffers that are lossy if the cache cannot keep up, as losing events is preferred to degraded performance.
The approach taken by Ehcache and redis is a probabilistic LRU policy. A read updates the entry's timestamp and a write iterates the cache to obtain a random sample. The oldest entry is evicted from that sample. If the sample is fast to construct and the cache is large, the evicted entry was likely a good candidate.
There are probably other techniques and, of course, pseudo LRU policies (like CLOCK) that offer better concurrency at lower hit rates.
Since cache inside the processor increases the instruction execution speed. I'm wondering what if we increase the size of cache to many MBs like 1 GB. Is it possible? If it is will increasing the cache size always result in increased performance?
There is a tradeoff between cache size and hit rate on one side and read latency with power consumption on another. So the answer to your first question is: technically (probably) possible, but unlikely to make sense, since L3 cache in modern CPUs with size of just a few MBs has read latency of about dozens of cycles.
Performance depends more on memory access pattern than on cache size. More precisely, if the program is mainly sequential, cache size is not a big deal. If there are quite a lot of random access (ex. when associative containers are actively used), cache size really matters.
The above is true for single computational tasks. In multiprocess environment with several active processes bigger cache size is always better, because of decrease of interprocess contention.
This is a simplification, but, one of the primary reasons the cache increases 'speed' is that it provides a fast memory very close to the processor - this is much faster to access than main memory. So, in theory, increasing the size of the cache should allow more information to be stored in this 'fast' memory, and thereby improve performance.. In the real world things are obviously much more complex than this, and there will of course be added complexity, and cost, associated with such a large cache, and with dealing with issues like cache coherency, caching algorithms etc.
As cache stores data temporary. Cache is used to locate the file easily that has been frequently using. So if the size of cache increased upto 1gb or more it will not stay as cache, it becomes RAM. Data is stored in ram temporary. So if cache isn't used, when data is called by processor, ram will take time to fetch data to provide to the processor because of its wide size of 4gb or more. So we use cache as our temporary memory for the things we recently or frequently used. In this way, ram ram doesnt required to find and fetch data to give it to processor, because processor direct access data from cache, because of small size of cache, it doesnt take time to find data, and processor doesn't require to call ram to fetch data, all of this done fastly without ram. Lets take an example, we have a wide classroom (RAM) , our principal (processor) call class CR (Data) for some purposes, then ones will go to the class room and will find the CR in the class of 1000 students and take him to the principal. It takes time. When we specify a space(cache) for CR in the class, because principal mostly call CR of the class, so it will become easy to find CR becuase most of the time CR is called by Principal.
I knew that cache memory stores the frequently used data to speed up process execution instead fetching them from main memory -which is slower- every time , and it's size always small in comparison with main memory because it's expensive technology and because always the real data are being processed at a time is very smaller than the whole data process held by main memory .
But is there any limitations or constrains regarding cache memory size at a some CPU speed or a some main memory size ? theoretically , if we increased the cache memory much .. will that affect in an opposite way ? or just it will be a waste increase ?
Indeed the performance gain become less and less significant after 64KB of cache size.
Here is graph from wikipedia showing that regardless of the scheme of set-associativity the miss-rate decrease only slightly as the cache size increases pass 64KB
Caches are small because the silicon used to build them is quite expensive and, expecially on CISC-type CPUs, there might not be enough space on the chip to hold them. Also making chips bigger has it cost and there's the possibility that it won't fit in its socket, which adds many more issues. It's not that simple ;)
EDIT:
Well, I haven't got any papers about this, but I'll explain my opinion anyway with a simple question: if a programs needs x bytes of memory, what would be the difference if the cache's size is 10 * x bytes or 100 * x? Once all the data is loaded in the cache (which doesn't depend on its size at all), the difference is all in the cache's access speed. And given locality of reference, it's not necessary having everything on cache.
Also, having big chaches requires having better algorithm for searching requested data in it. For example accessing data in a fully associative caches will become slower than accessing the main memory as the cache size increases (which implies there are more and more places to look for the data). Considering multitasking system, though, introduces other issues which I don't actually know much of.
To conclude, the performance gain caused by increasing caches' size becomes slighter as it approaches the usual amount of data used by the whole software running on a given machine.