I learned that due to computational overhead, true LRU is not implemented in virtual memory systems. So, why is the LRU algorithm feasible in a file cache?
I think reason is may be the time field in an inode. Is that correct?
It's about speed.
Virtual Memory status bits must be updated in nanoseconds, so hardware support is needed, and status information for LRU is expensive to implement in hardware. E.g. the clock algorithm is designed to approximate LRU with less expensive hardware support.
File system operations are on the order of milliseconds. A CPU can do LRU in software in a very small fraction of this time. Milliseconds are so "slow" from the CPU's point of view (190-thousands of instructions) that preventing only a small number of cache misses produces a big payoff.
Related
If I have two different cache subsystem designs C1 and C2 that both have roughly the same hardware complexity, can I make a decision if which one is better choice considering effectiveness of cache subsystem is the prime factor i.e., the number misses should be minimized.
Give the total miss rate below:
miss_rate = (number of cache misses)/(number of cache reference)
miss rate of C1 = 0.77
miss rate of C2 = 0.73
Is the given miss rate information sufficient to make decision of what subsystem is better?
Yes, assuming hit latency is the same for both caches, actual miss rate on the workload you care about is the ultimate factor for that workload. It doesn't always generalize.
All differences in size, associativity, eviction policy, all matter because of their impact on miss rate on any given workload. Even cache line (block) size factors in to this: a cache with twice as many 32-byte lines vs. a cache with half as many 64-byte lines would be able to cache more scattered words, but pull in less nearby data on a miss. (Unless you have hardware prefetching, but again prefetch algorithms ultimately just affect miss rate.)
If hit and miss latencies are fixed, then all misses are equal and you just want fewer of them.
Well, not just latency, but overall effect on the pipeline, if the CPU isn't a simple in-order design from the 1980s that simply stalls on a miss. Which is what homework usually assumes, because otherwise the miss cost depends on details of the context, making it impossible to calculate performance based on just instruction mix, hit/miss rate, and miss costs.
An out-of-order exec CPU can hide the latency of some misses better than others. (On the critical path of some larger dependency chain vs. not.) Even an in-order CPU that can scoreboard loads can get work done in the shadow of a cache miss load, up until it reaches an instruction that reads the load result. (And with a store buffer, can usually hide store miss latency.) So miss penalty can differ depending on which loads miss, whether it's one that software instruction scheduling was able to hide more vs. less of the latency for. (If the independent work after a load includes other loads, then you'd need a non-blocking cache that handles hit-under-miss. Miss-under-miss to memory-level parallelism of multiple misses in flight also helps, as well as being able to get to a hit after 2 or more cache-miss loads.)
I think usually for most workloads with different cache geometries and sizes, there won't be significant bias towards more of the misses being easier to hide or not, so you could still say that miss-rate is the only thing that ultimately matters.
Miss-rate for a cache depends on workload, so you can't say that a lower miss rate on just one workload or trace makes it better on average or for all cases. e.g. an 8-way associative 16 KiB cache might have a higher hit rate than a 32 KiB 2-way cache on one workload (with a lot of conflict misses for the 2-way cache), but on a different workload where the working set is mostly one contiguous 24KiB array, the 32K 2-way cache might have a higher hit rate.
The term "better" is subjective as follows:
Hardware cost, in terms of silicon real-estate, meaning that a larger chip is more expensive to produce and thus costs more per chip. (A larger cache may not even fit on the chip in question.)
Hardware cost, in terms of silicon process technology, meaning that a faster cache requires a more advanced chip process, so will increase costs per chip.
A miss rate on a given cache is workload specific (e.g. application specific or algorithm specific). Thus, two different workloads may have different miss rates on each of the caches in question. So, "better" here may mean across an average workload (or an average across several different workloads), but there's a lot of room for variability.
We would have to know the performance of the caches upon hit, and also upon miss — as a more complex cache with a higher hit rate might have longer timings.
In summary, in order to say that lower miss rate is better, we would have to know that all the other factors are equal. Otherwise, the notion of better needs to be defined, perhaps to include cost/benefit definition.
I've been reading benchmarks that test the benefits of systems with Multiple Memory Channel Architectures. The general conclusion of most of these benchmarks is that the performance benefits of systems with greater numbers of memory channels over those systems with fewer channels are negligible.
However nowhere have I found an explanation of why this is the case, just benchmark results indicating that this is the real world performance attained.
The theory is that every doubling of the system's memory channels doubles the bandwidth of memory access, so in theory there should be a performance gain, however in real world applications the gains are negligible. Why?
My postulation is that when the NT Kernel allocates physical memory it is not disturbing the allocations evenly across the the memory channels. If all of a process's virtual memory is mapped to a single memory channel within a MMC system then the process will effectively only be able to attain the performance of having a single memory channel at its disposal. Is this the reason for negligible real world performance gains?
Naturally a process is allocated virtual memory and the kernel allocates physical memory pages, so is this negligible performance gain the fault of the NT Kernel not distributing allocations across the available channels?
related: Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? two memory controllers is sufficient for single-threaded memory bandwidth. Only if you have multiple threads / processes that all miss in cache a lot do you start to benefit from the extra memory controllers in a big Xeon.
(e.g. your example from comments of running many independent image-processing tasks on different images in parallel might do it, depending on the task.)
Going from two down to one DDR4 channel could hurt even a single-threaded program on a quad-core if it was bottlenecked on DRAM bandwidth a lot of the time, but one important part of tuning for performance is to optimize for data reuse so you get at least L3 cache hits.
Matrix multiplication is a classic example: instead of looping over rows / columns of the whole matrix N^2 times (which is too big to fit in cache) (one row x column dot product for each output element), you break the work up into "tiles" and compute partial results, so you're looping repeatedly over a tile of the matrix that stays hot in L1d or L2 cache. (And you hopefully bottleneck on FP ALU throughput, running FMA instructions, not memory at all, because matmul takes O(N^3) multiply+add operations over N^2 elements for a square matrix.) These optimizations are called "loop tiling" or "cache blocking".
So well-optimized code that touches a lot of memory can often get enough work done as its looping that it doesn't actually bottleneck on DRAM bandwidth (L3 cache miss) most of the time.
If a single channel of DRAM is enough to keep up with hardware prefetch requests for how quickly/slowly the code is actually touching new memory, there won't be any measureable slowdown from memory bandwidth. (Of course that's not always possible, and sometimes you do loop over a big array doing not very much work or even just copying it, but if that only makes up a small fraction of the total run time then it's still not significant.)
The theory is that every doubling of the system's memory channels doubles the bandwidth of memory access, so in theory there should be a performance gain, however in real world applications the gains are negligible. Why?
Think of it as a hierarchy, like "CPU <-> L1 cache <-> L2 cache <-> L3 cache <-> RAM <-> swap space". RAM bandwidth only matters when L3 cache wasn't big enough (and swap space bandwidth only matters if RAM wasn't big enough, and ...).
For most (not all) real world applications, the cache is big enough, so RAM bandwidth isn't important and the gains (of multi-channel) are negligible.
My postulation is that when the NT Kernel allocates physical memory it is not disturbing the allocations evenly across the the memory channels.
It doesn't work like that. The CPU mostly only works with whole cache lines (e.g. 64 byte pieces); and with one channel the entire cache line comes from one channel; and with 2 channels half of a cache line comes from one channel and the other half comes from a different channel. There is almost nothing that any software can do that will make any difference. The NT kernel only works with whole pages (e.g. 4 KiB pieces), so whatever the kernel does is even less likely to matter (until you start thinking about NUMA optimizations, which is a completely different thing).
What are the disadvantages of using larger cache memories? Could we use just use a large enough cache memory so a secondary memory wouldn't be needed at all? I understand that the most compelling arguments are related to the cost of it / the problem of it's size. But if we assume that creating such a cache memory is possible, would it encounter any additional problems?
Many problems even if it was not expensive
Size will degrade the performance
Cache is fast because it’s very small compared to the main memory and hence it requires small amount of time to search it. If you build a large cache then it will not be able to perform at the same speed as the smaller counterpart.
Larger die area
Most of the DRAM chips only require a capacitor and a transistor to store a bit. SRAM on the other hand requires at least 6 transistors to make a single cell of memory. Which requires more area.
High power requirements
Because of the more transistors SRAM requires more power to operate. Which in turn generates more heat so you will have to handle the cooling problem.
So as you can see it’s not worth the effort given that today’s computers already achieve 90% hit ratio most of the time.
I'm talking about LRU memory page replacement algorithm implement in C, NOT in Java or C++.
According to the OS course notes:
OK, so how do we actually implement a LRU? Idea 1): mark everything we touch with a timestamp.
Whenever we need to evict a page, we select the oldest page (=least-recently used). It turns out that this
simple idea is not so good. Why? Because for every memory load, we would have to read contents of the
clock and perform a memory store! So it is clear that keeping timestamps would make the computer at
least twice as slow. I
Memory load and store operation should be very fast. Is it really necessary to get rid of these little tiny operations?
In the case of memory replacement, the overhead of loading page from disk should be a lot more significant than memory operations. Why would actually care about memory store and load?
If what the notes said isn't correct, then what is the real problem with implementing LRU with timestamp?
EDIT:
As I dig deeper, the reason I can think of is like the following. These memory store and load operations happen when there is a page hit. In this case, we are not loading page from disks, so the comparison is not valid.
Since the hit rate is expected to be very high, so updating the data structure associated with LRU should be very frequent. That's why we care about the operations repeated in the udpate process, e.g., memory load and store.
But still, I'm not convincing how significant the overhead is to do memory load and store. There should be some measurements around. Can someone point me to them? Thanks!
Memory load and store operations can be quite fast, but in most real life cases the memory subsystem is slower - sometimes much slower - than the CPU's execution engine.
Rough numbers for memory access times:
L1 cache hit: 2-4 CPU cycles
L2 cache hit: 10-20 CPU cycles
L3 cache hit: 50 CPU cycles
Main memory access: 100-200 CPU cycles
So it costs real time to do loads and stores. With LRU, every regular memory access will also incur the cost of a memory store operation. This alone doubles the number of memory accesses the CPU does. In most situations this will slow the program execution. In addition, on a page eviction all the timestamps will need to be read. This will be quite slow.
In addition, reading and storing the timestamps constantly means they will be taking up space in the L1 or L2 caches. Space in these caches is limited, so your cache miss rate for other accesses will probably be higher, which will cost more time.
In short - LRU is quite expensive.
I knew that cache memory stores the frequently used data to speed up process execution instead fetching them from main memory -which is slower- every time , and it's size always small in comparison with main memory because it's expensive technology and because always the real data are being processed at a time is very smaller than the whole data process held by main memory .
But is there any limitations or constrains regarding cache memory size at a some CPU speed or a some main memory size ? theoretically , if we increased the cache memory much .. will that affect in an opposite way ? or just it will be a waste increase ?
Indeed the performance gain become less and less significant after 64KB of cache size.
Here is graph from wikipedia showing that regardless of the scheme of set-associativity the miss-rate decrease only slightly as the cache size increases pass 64KB
Caches are small because the silicon used to build them is quite expensive and, expecially on CISC-type CPUs, there might not be enough space on the chip to hold them. Also making chips bigger has it cost and there's the possibility that it won't fit in its socket, which adds many more issues. It's not that simple ;)
EDIT:
Well, I haven't got any papers about this, but I'll explain my opinion anyway with a simple question: if a programs needs x bytes of memory, what would be the difference if the cache's size is 10 * x bytes or 100 * x? Once all the data is loaded in the cache (which doesn't depend on its size at all), the difference is all in the cache's access speed. And given locality of reference, it's not necessary having everything on cache.
Also, having big chaches requires having better algorithm for searching requested data in it. For example accessing data in a fully associative caches will become slower than accessing the main memory as the cache size increases (which implies there are more and more places to look for the data). Considering multitasking system, though, introduces other issues which I don't actually know much of.
To conclude, the performance gain caused by increasing caches' size becomes slighter as it approaches the usual amount of data used by the whole software running on a given machine.