Why Bit-PLRU is different from LRU - caching

Following is description about bit-plru
Bit-PLRU stores one status bit for each cache line. We call these bits MRU-
bits. Every access to a line sets its MRU-bit to 1, indicating that the line was recently used. Whenever the last remaining 0 bit of a set's status bits is set to 1, all other bits are reset to 0. At cache misses, the line with lowest index whose MRU-bit is 0 is replaced.
I think both replace policy of lru and bit-plru are same, their miss rate are also same.
My reason: At cache misses, the line with lowest index whose MRU-bit is 0 is replaced.
Line with lowest index means least recently used, so its mru-bit is definitely zero(it couldn't be 1 because it isn't recently used). So, mru-bit is redundant?
If my reason is wrong, could anyone give me some reason or example show me where is different between bit-plru and lru? Why bit-plru gave better performance(miss rate)?
Thanks!

Least recently used means the line with the oldest access time. But keeping track of accesses to always know which one is the oldest may be complex in a cache context. Storing the access order for every block would require at least ceil(log2(n!)) bits, or, most probably, n×log2n bits which is close for n small and much simpler to manage. Whenever a memory reference is accessed, it must be removed from the order list, put at the top and the rest of the list updated. This may be complex to do in one cycle.
This is the reason why pseudo-LRU methods have been developed. They guaranty that an "ancient" line will be ejected, but not that the most ancient will be.
Consider an example for the bit-LRU of your question. We assume that the initial set state is the following.
line status real order
(index) (MRU)
0 0 3 LRU
1 1 0 MRU
2 1 1
3 0 2
The real order is not stored, but we will use it to understand the behavior of the algorithm (smallest is youngest).
Now, assume we access existing line 0. Status becomes
line status real order
(index) (MRU)
0 1 0 MRU
1 1 1
2 1 2
3 0 3 LRU
Assume this is followed by a miss, so we apply the method and replace line 3:
Whenever the last remaining 0 bit of a set's status bits is set to 1, all other bits are reset to 0.
line status real order
(index) (MRU)
0 0 1
1 0 2
2 0 3 LRU
3 1 0 MRU
So the algorithm has properly ejected LRU (line 3).
Assume that there is another miss. The algorithm states :
At cache misses, the line with lowest index whose MRU-bit is 0 is replaced.
So line 0 will be replaced. But it is not LRU which is line 2. It is even the "youngest" in the ancient lines. But is has the lowest index.
To eject another better line would require additional information on the access times.
Maybe some randomness in the ejection could be simply added. But finding the real LRU is more complex.
Note that there are better ways to have a pseudo LRU. Tree-LRU for instance is much better, but it still does not guaranty that the real LRU will be used.
For practical applications, pLRU gives a miss rate similar to real LRU, while being much simpler.
But even real LRU may not always be the best policy and if a line has been accessed frequently it is likely that it will continue to be accessed, and it should probably not be replaced even if it is LRU.
So the most efficient methods extend pLRU by keeping track of the number of accesses and by considering differently lines that are have been accessed only once and lines that have been accessed twice or more. This way, whenever a line has to be ejected, lines that have been accessed only once are preferred.

Related

Data Oriented Design with Mike Acton - Are 'loops per cache line' calculations right?

I've watched Mike Acton's talks about DOD a few times now to better understand it (it is not an easy subject to me). I'm referring to CppCon 2014: Mike Acton "Data-Oriented Design and C++"
and GDC 2015: How to Write Code the Compiler Can Actually Optimize.
But in both talks he presents some calculations that I'm confused with:
This shows that FooUpdateIn takes 12 bytes, but if you stack 32 of them you will get 6 fully packed cache lines. Same goes for FooUpdateOut, it takes 4 bytes and 32 of them gives you 2 fully packed cache lines.
In the UpdateFoos function, you can do ~5.33 loops per each cache line (assuming that count is indeed 32), then he proceeds by assuming that all the math done takes about 40 cycles which means that each cache line would take about 213.33 cycles.
Now here's where I'm confused, isn't he forgetting about reads and writes? Even though he has 2 fully packed data structures they are in different memory spaces.
In my head this is what's happening:
Read in[0].m_Velocity[0] (which would take about 200 cycles based on his previous slides)
Since in[0].m_Velocity[1] and in[0].m_Foo are in the same cache line as in[0].m_Velocity[0] their access is free
Do all the calculation
Write the result to out[0].m_Foo - Here is what I don't know what happens, I assume that it would discard the previous cache line (fetched in 1.) and load the new one to write the result
Read in[1].m_Velocity[0] which would discard again another cache line (fetched in 4.) (which would take again about 200 cycles)
...
So jumping from in and out the calculations goes from ~5.33 loops/cache line to 0.5 loops/cache line which would do 20 cycles per cache line.
Could someone explain why wasn't he concerned about reads/writes? Or what is wrong in my thinking?
Thank you.
If we assume L1 cache is 64KB and one cache line is 64 bytes then there are total 1000 cache lines. So, in step 4 write to the result out[0].m_Foo will not discard the data cache in step 2 as they both are in different memory locations. This is the reason why he is using separate structure for updating out m_Foo instead directly mutating it in inplace like in his first implementation. He is just talking till point of calculation value. Updating value/writing value will have same cost as in his first implementation. Also, processor can optimize loops quite well as it can do multiple calculations in parallel(not sequential as result of first loop and second loop are not dependent). I hope this helps

Dirty bit value after changing data to original state

If the value in some part of cache is 4 and we change it to 5, that sets the dirty bit for that data to 1. But what about, if we set the value back to 4, will dirty bit still stay 1 or change back to 0?
I am interested in this, because this would mean a higher level optimization of the computer system when dealing with read-write operations between main memory and cache.
In order for a cache to work like you said, it would need to reserve half of its data space to store the old values.
Since cache are expensive exactly because they have an high cost per bit, and considering that:
That mechanism would only detect a two levels writing history: A -> B -> A and not any deeper (like A -> B -> C -> A).
Writing would imply the copy of the current values in the old values.
The minimum amount of taggable data in a cache is the line and the whole line need to be changed back to its original value. Considering that a line has a size in the order of 64 Bytes, that's very unlikely to happen.
An hierarchical structure of the caches (L1, L2, L3, ...) its there exactly to mitigate the problem of eviction.
The solution you proposed has little benefits compared to the cons and thus is not implemented.

How do I map a memory address to a block when there is an offset in a direct-mapped cache?

To start off, the first cache has 16 one-word blocks. As an example I will use 0x03 memory reference. The index has 4 bits (0011). It is clear that the bits equal 3mod16 (0011 = 0x03 = 3). However I am getting confused using this mod equation to determine block location in a cache with offset bits.
The second cache has a total size of eight two-word blocks. This means that there is 1 offset bit. Since there are now 8 blocks, there are only 3 index bits. As an example, I will take the same memory reference of 0x03. However now I am having trouble mapping to the block using the mod equation I used before. I try 3mod8 which is 3, however in this case, since there is an offset bit, the index bits are 001. 001 is not equal to 3 so what did I do wrong? Does mod not work when there are offset bits? I was under the impression that the mod equation would always equal the index bits.
Its all in the address. You get the address, then mask off number of bits from the end, for following reasons.
Number of words in the cacheline. If you've got 2 word cacheline (take a bit out, 4 word - 2 bts etc)
Then how many cacheline entries you have. (If is a 1024 cacheline, you takeout 10 bits. This 10 bits is your index, remaining bits are for your Tag)
Now, you also need to consider 'WAY' as well. If its a direct mapped cache, above applies. If its a 2 way set associative cache, you dont have 1024 lines, what you have a 512 blocks with each having 2 lines in them. Which means you only need 9 bits to determine the index of the block. If its 4 way, you've got 256 blocks with 4 lines in them, meaning you only need 8 bits for your index.
In a set associative cache, index are there to choose a block, once a block is chosen, use can use a policy like LRU to fill an entry in case of a cache miss. Hits are determined by comparing the tag in the selected block.
Bottom line, block location is not determined by the address, only a block is selected by the address and thereafter its Tag comparison to find the data.

Cache Memory Blocks Organization

I am not able to understand how exactly the cache is organized in the following scenario.
The cache size is 256 bytes. The cache line size is 8 bytes. All variables are 4 bytes. Assume that an array A[1024] is stored in memory locations 0-4095. Suppose if we are using fully associative mapping technique, how is the array mapped to this particular cache ? Consider that the cache is initially empty and we use LRU algorithm for replacement. During each replacement, an entire line of cache is replaced.
Initial analysis :
There will be 32 cache blocks each with 8 bytes length. But the variables to be stored in these locations is only 4 bytes long. I am not able to take this analysis any further as to how these array elements are mapped to the 32 cache blocks.
Let's assume it's accessed sequentially:
for (int i=0; i<1024; ++i)
read(A[i]);
In that case, you'll fill the first 64 elements (A[0] through A[63]) into the 32 cache blocks in adjacent pairs like MSalters said.
The next access would have to kick out the least recently used line, which, since you access the array in sequential order is A[64]. It would have to pick a victim to kick out, and since you're using LRU that would be the first block (way 0). You therefore replace A[0] and A[1] with A[64] and A[65] and so on, so in general you'll have element i mapped into way floor(i/2)%32.
Now computing the hit rate requires an additional assumption - each memory line fetched is the size of a full block (8 bytes), since you can't fill half blocks (actually there are ways using mask bits, but let's assume the simple case). We therefore get each second element "for free" - fetching A[0] would also fetch A[1] and so on. In theory this means that the hit rate could be 50% (miss even elements, hit odds, in reality most CPUs would perform the accesses in parallel so you won't really have that hit rate, but let's say the accesses are serialized here).
Note that each new block fetched after the first 64 elements would have to evict a block from the cache, if processing the elements also modifies them you'll have to write them back too.
Elements A[0] and A[1] are stored in adjacent memory locations, 0-4 and 4-8. That means they share the first cache block. The other elements are similarly mapped pairwise to a cache line. Which pair goes where?

Algo- Given a sequence of memory access, give the minimum number of cache misses

Given inputs are: size of cache s, number of memory entries n, and a series of memory accesses.
Give the minumum number of cache misses possible.
Example:
s = 3, n = 4
1 2 3 1 4 1 2 3
min_miss = 4
I've been stuck the entire day. Thanks in advance!
You get to decide whatever behaviour the cache takes. You don't have to take in an entry even if it's accessed, for example. And it need not be regular. You don't need to follow a fixed "rule" to cache.
Try following http://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithm - when you need to swap something out swap out the item that will not be used again for the longest possible time. Since you get the entire sequence of memory accesses ahead of time this is feasible for you. This is obviously locally optimum at least up to the first cache miss after the cache becomes full, because every other strategy has had at least one cache miss by then. It is not obvious to me that this is globally optimal - searching I find a proof at http://www.stanford.edu/~bvr/psfiles/paging.pdf with a claim that other proofs of its optimality do exist but are even longer.

Resources