FIFO cache vs LRU cache - caching

I'm really sorry for such simple question. I just want to be sure that I understand FIFO cache model correctly and I hope that someone will help me with that :) LRU cache deletes entry that was accessed least recently if the cache is full. FIFO deletes the entry that was added earlier(?) than other entries if the cache needs free space (for example if 'a' - 'v' - 'f' - 'k' is entries in the cache and 'a' is the oldest entry then cache will delete 'a' if it will need free space).
Am I right?

You are correct.
Think of FIFO as cars going through a tunnel. The first car to go in the tunnel will be the first one to go out the other side.
Think of the LRU cache as cleaning out the garage. You will throw away items that you have not used for a long time, and keep the ones that you use frequently. An evolution of that algorithm (an improvement to simple LRU) would be to throw away items that have not been used for a long time, and are not expensive to replace if you need them after all.

Yes, LRU cache is based on least recent use of on object in cache but FIFO is based on the time an object is cached.

Yes, that is correct. FIFO means First In, First Out, i.e., consider (in this case delete) elements strictly in arrival order. LRU is Least Recently Used, the cache element that hasn't been used the longest time is evicted (on the hunch that it won't be needed soon).

Related

Reasons to use a LIFO cache eviction policy?

On the Wikipedia page about cache replacement policies, there is a small section about LIFO/FILO policy:
Last in first out (LIFO) or First in last out (FILO)
Using this algorithm the cache behaves in the same way as a stack and opposite way as a FIFO queue. The cache evicts the block added most recently first without any regard to how often or how many times it was accessed before.
I tried to look a bit for an application of this policy but didn't find any example. In my opinion if you discard the most recently added entry, then it defeats the purpose of caching. When there is a cache miss, you'll fetch the data, save it in the cache but it will likely be the first one discarded on the next cache miss, so why did we cache it at all? The only reason I see is that each entry will be likely fetched only one, but then why implement caching then?

Can a fully associative cache have a higher miss rate than a direct mapped cache?

The following was an interview question:
Why might a fully associative cache have higher miss rates than a direct-mapped cache?
I thought this was not possible at all. Can anyone please share some insight on this?
Are you supposed to assume they're the same size? If not, then a smaller fully associative cache can still miss more if most of the misses are "capacity" misses, not conflict misses.
If they are the same size, then it's a lot less likely, so you have to start thinking of pathological cases. An unlucky case for the replacement policy can lead to unrelated misses evicting a line that will be useful later. A miss in a fully associative cache can evict any of the current lines, and has to pick one. (And with a high associativity, LRU would take a lot of bits, so it's probably not going to be true LRU. Even if true LRU, you can certainly construct a sequence that still evicts more lines.)
Consider the following sequence of events:
A miss brings a line into the cache. It will be useful in the future, but of course the hardware can't know this yet because it can't see the future.
many other compulsory misses to other lines (so no cache could help with them), to addresses which (for a direct-mapped or set-associative cache) don't alias the same set, i.e. have a different index. Thus they can't possibly disturb the future-value line for direct-mapped, but they can for fully associative: everything is in one large set. Also works if this is just looping over a big array (except for a line that would alias the one that has future value).
There can be some hits in here, too, but it's easier to verify correctness if there are none. as long as none of the lines alias each other so we
another access to that first line. Direct mapped definitely still has it cached because no previous accesses have had this index. Associative will have evicted it, unless the replacement policy somehow predicted or guessed the future and never evicted it despite all the comings and goings of other lines.
There might be more-reasonable examples where the average hit rate isn't so low. e.g. simply looping over an array and thus benefiting from the cache when reading each byte or word sequentially.
An associative cache has much more room for other accesses to alias the same set because there are fewer sets. Usually the replacement policy makes useful choices for your workload, so this behaviour is only possible with one that defeats it.
Related: https://en.wikipedia.org/wiki/Adaptive_replacement_cache https://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ - adaptive replacement can help keep some lines around when a cache-blasting loop over a huge array runs.
An adaptive replacement policy can make an associative cache more resistant to this kind of worst-case where a direct-mapped cache could beat it. Of course, looping over a huge array will usually completely wipe out a direct-mapped cache; what's special here is that we're avoiding anything that would alias the line that will hit. So looping through a linked list is more plausible for a sequence of accesses that touches a lot of lines but happens to skip one.
Also related re: pathological worst-cases for LRU: Bélády's anomaly for virtual memory page-replacement, where one can construct cases where LRU gets worse with more page-frames but FIFO doesn't. The analogous case for a CPU cache would be getting worse with more ways per set, for the same sequences of accesses to a set.
Of course nobody said anything about the fully-associative cache being true LRU, and as I said earlier that's unlikely in practice.

MESI protocol. Write with cache miss, but cache line copy exists on another CPU. Why needs fetch from main memory?

According to this diagram in case of write cache miss with copy in another CPU cache (for example Shared/Exclusive state). The steps are:
1. Snooping cores (with cache line copy) sets state to Invalid.
2. Current cache stores fresh main memory value.
Why one of the snooping cores can't put its cache line value on the bus at first? And then go to Invalid state. The same algorithm is used in read miss with existing copy. Thank you.
You're absolutely right in that it's pretty silly to go fetch a line from memory when you already have it right next to you, but this diagram describes the minimal requirement for functional correctness of the coherence protocol (i.e. what must be done to avoid coherence bugs), and that only dictates snooping the data out for modified lines since that's the only correct copy. What you describe is a possible optimization, and some systems indeed behave that way.
However, keep in mind that most systems today employ a shared cache as well (L2 or L3, sometimes even beyond that), and this is often inclusive (with regards to all lines that exist in all cores). In such systems, there's no real need to go all the way to memory, since having the line in another core means it's also in the shared cache, and after invalidation the requesting core can obtain it from there. Your proposal is therefore relevant only for systems with no shared cache, or with a cache that is not strictly inclusive.

How is an LRU cache implemented in a CPU?

I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick?
Also what happens in a dual core system where both CPUs write to the one address simultaneously?
For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.
For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.
For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.
On a MRU hit, the state is unchanged.
On a next-MRU hit the two bit fields are swapped.
On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.
On a miss, the state is updated as if an LRU hit had occurred.
Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.
For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.
With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.
Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.
There is a good slide-deck Page replacement algorithms that talks about various page replacement schemes. It also explains the LRU implementation using mxm matrix really well.

Data structure and algorithm for representing/allocating free space in a file

I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.
I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?
For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.
My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.
I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:
1) File header. Points at master type list.
2) Data. Each block has a header with type, index, logical size, and padded size.
3) Arrays of (offset, size) tuples for each given type.
4) Array of (type, offset, count) that keeps track of the types.
We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.
When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)
One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.
Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.
Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file!
Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.
For memory management I am a fan of the BiBOP* approach, which is normally efficient at managing fragmentation.
The idea is to segregate data based on their size. This, way, within a "bag" you only have "pages" of small blocks with identical sizes:
no need to store the size explicitly, it's known depending on the bag you're in
no "real" fragmentation within a bag
The bag keeps a simple free-list of the available pages. Each page keeps a free-list of available storage units in an overlay over those units.
You need an index to map size to its corresponding bag.
You also need a special treatment for "out-of-norm" requests (ie requests that ask for allocation greater than the page size).
This storage is extremely space efficient, especially for small objects, because the overhead is not per-object, however there is one drawback: you can end-up with "almost empty" pages that still contain one or two occupied storage units.
This can be alleviated if you have the ability to "move" existing objects. Which effectively allows to merge pages.
(*) BiBOP: Big Bag Of Pages
I would recommend making customized file-system (might contain one file of course), based on FUSE. There are a lot of available solutions for FUSE you can base on - I recommend choosing not related but simplest projects, in order to learn easily.
What algorithm and data-structure to choose, it highly deepens on your needs. It can be : map, list or file split into chunks with on-the-fly compression/decompression.
Data structures proposed by you are good ideas. As you clearly see there is a trade-off: fragmentation vs compaction.
On one side - best compaction, highest fragmentation - splay and many other kinds of trees.
On another side - lowest fragmentation, worst compaction - linked list.
In between there are B-Trees and others.
As you I understand, you stated as priority: space-saving - while taking care about performance.
I would recommend you mixed data-structure in order to achieve all requirements.
a kind of list of contiguous blocks of data
a kind of tree for current "add/remove" operation
when data are required on demand, allocate from tree. When deleted, keep track what's "deleted" using tree as well.
mixing -> during each operation (or on idle moments) do "step by step" de-fragmentation, and apply changes kept in tree to contiguous blocks, while moving them slowly.
This solution gives you fast response on demand, while "optimising" stuff while it's is used, (For example "each read of 10MB of data -> defragmantation of 1MB) or in idle moments.
The most simple solution is a free list: keep a linked list of free blocks, reusing the free space to store the address of the next block in the list.

Resources