Algo- Given a sequence of memory access, give the minimum number of cache misses - algorithm

Given inputs are: size of cache s, number of memory entries n, and a series of memory accesses.
Give the minumum number of cache misses possible.
Example:
s = 3, n = 4
1 2 3 1 4 1 2 3
min_miss = 4
I've been stuck the entire day. Thanks in advance!
You get to decide whatever behaviour the cache takes. You don't have to take in an entry even if it's accessed, for example. And it need not be regular. You don't need to follow a fixed "rule" to cache.

Try following http://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithm - when you need to swap something out swap out the item that will not be used again for the longest possible time. Since you get the entire sequence of memory accesses ahead of time this is feasible for you. This is obviously locally optimum at least up to the first cache miss after the cache becomes full, because every other strategy has had at least one cache miss by then. It is not obvious to me that this is globally optimal - searching I find a proof at http://www.stanford.edu/~bvr/psfiles/paging.pdf with a claim that other proofs of its optimality do exist but are even longer.

Related

Using a slice instead of list when working with large data volumes in Go

I have a question on the utility of slices in Go. I have just seen Why are lists used infrequently in Go? and Why use arrays instead of slices? but had some question which I did not see answered there.
In my application:
I read a CSV file containing approx 10 million records, with 23 columns per record.
For each record, I create a struct and put it into a linked list.
Once all records have been read, the rest of the application logic works with this linked list (the processing logic itself is not relevant for this question).
The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need. Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Every Go tutorial or article I read seems to suggest that I should use slices and not lists (as a slice can do everything a list can, but do it better somehow). However, I don't see how or why a slice would be more helpful for what I need? Any ideas from anyone?
... approx 10 million records, with 23 columns per record ... The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need.
This contiguous memory is its own benefit as well as its own drawback. Let's consider both parts.
(Note that it is also possible to use a hybrid approach: a list of chunks. This seems unlikely to be very worthwhile here though.)
Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Clearly, if there are n records, and you allocate and fill in each one once (using a list), this is O(n).
If you use a slice, and allocate a single extra slice entry every time, you start with none, grow it to size 1, then copy the 1 to a new array of size 2 and fill in item #2, grow it to size 3 and fill in item #3, and so on. The first of the n entities is copied n times, the second is copied n-1 times, and so on, for n(n+1)/2 = O(n2) copies. But if you use a multiplicative expansion technique—which Go's append implementation does—this drops to O(log n) copies. Each one does copy more bytes though. It ends up being O(n), amortized (see Why do dynamic arrays have to geometrically increase their capacity to gain O(1) amortized push_back time complexity?).
The space used with the slice is obviously O(n). The space used for the linked list approach is O(n) as well (though the records now require at least one forward pointer so you need some extra space per record).
So in terms of the time needed to construct the data, and the space needed to hold the data, it's O(n) either way. You end up with the same total memory requirement. The main difference, at first glace anyway, is that the linked-list approach doesn't require contiguous memory.
So: What do we lose when using contiguous memory, and what do we gain?
What we lose
The thing we lose is obvious. If we already have fragmented memory regions, we might not be able to get a contiguous block of the right size. That is, given:
used: 1 MB (starting at base, ending at base+1M)
free: 1 MB (starting at +1M, ending at +2M)
used: 1 MB (etc)
free: 1 MB
used: 1 MB
free: 1 MB
we have a total of 6 MB, 3 used and 3 free. We can allocate 3 1 MB blocks, but we can't allocate one 3 MB block unless we can somehow compact the three "used" regions.
Since Go programs tend to run in virtual memory on large-memory-space machines (virtual sizes of 64 GB or more), this tends not to be a big problem. Of course everyone's situation differs, so if you really are VM-constrained, that's a real concern. (Other languages have compacting GC to deal with this, and a future Go implementation could at least in theory use a compacting GC.)
What we gain
The first gain is also obvious: we don't need pointers in each record. This saves some space—the exact amount depends on the size of the pointers, whether we're using singly linked lists, and so on. Let's just assume 2 8 byte pointers, or 16 bytes per record. Multiply by 10 million records and we're looking pretty good here: we've saved 160 MBytes. (Go's container/list implementation uses a doubly linked list, and on a 64 bit machine, this is the size of the per-element threading needed.)
We gain something less obvious at first, though, and it's huge. Because Go is a garbage-collected language, every pointer is something the GC must examine at various times. The slice approach has zero extra pointers per record; the linked-list approach has two. That means that the GC system can avoid examining the nonexistent 20 million pointers (in the 10 million records).
Conclusion
There are times to use container/list. If your algorithm really calls for a list and is significantly clearer that way, do it that way, unless and until it proves to be a problem in practice. Or, if you have items that can be on some collection of lists—items that are actually shared, but some of them are on the X list and some are on the Y list and some are on both—this calls for a list-style container. But if there's an easy way to express something as either a list or a slice, go for the slice version first. Because slices are built into Go, you also get the type safety / clarity mentioned in the first link (Why are lists used infrequently in Go?).

Why Bit-PLRU is different from LRU

Following is description about bit-plru
Bit-PLRU stores one status bit for each cache line. We call these bits MRU-
bits. Every access to a line sets its MRU-bit to 1, indicating that the line was recently used. Whenever the last remaining 0 bit of a set's status bits is set to 1, all other bits are reset to 0. At cache misses, the line with lowest index whose MRU-bit is 0 is replaced.
I think both replace policy of lru and bit-plru are same, their miss rate are also same.
My reason: At cache misses, the line with lowest index whose MRU-bit is 0 is replaced.
Line with lowest index means least recently used, so its mru-bit is definitely zero(it couldn't be 1 because it isn't recently used). So, mru-bit is redundant?
If my reason is wrong, could anyone give me some reason or example show me where is different between bit-plru and lru? Why bit-plru gave better performance(miss rate)?
Thanks!
Least recently used means the line with the oldest access time. But keeping track of accesses to always know which one is the oldest may be complex in a cache context. Storing the access order for every block would require at least ceil(log2(n!)) bits, or, most probably, n×log2n bits which is close for n small and much simpler to manage. Whenever a memory reference is accessed, it must be removed from the order list, put at the top and the rest of the list updated. This may be complex to do in one cycle.
This is the reason why pseudo-LRU methods have been developed. They guaranty that an "ancient" line will be ejected, but not that the most ancient will be.
Consider an example for the bit-LRU of your question. We assume that the initial set state is the following.
line status real order
(index) (MRU)
0 0 3 LRU
1 1 0 MRU
2 1 1
3 0 2
The real order is not stored, but we will use it to understand the behavior of the algorithm (smallest is youngest).
Now, assume we access existing line 0. Status becomes
line status real order
(index) (MRU)
0 1 0 MRU
1 1 1
2 1 2
3 0 3 LRU
Assume this is followed by a miss, so we apply the method and replace line 3:
Whenever the last remaining 0 bit of a set's status bits is set to 1, all other bits are reset to 0.
line status real order
(index) (MRU)
0 0 1
1 0 2
2 0 3 LRU
3 1 0 MRU
So the algorithm has properly ejected LRU (line 3).
Assume that there is another miss. The algorithm states :
At cache misses, the line with lowest index whose MRU-bit is 0 is replaced.
So line 0 will be replaced. But it is not LRU which is line 2. It is even the "youngest" in the ancient lines. But is has the lowest index.
To eject another better line would require additional information on the access times.
Maybe some randomness in the ejection could be simply added. But finding the real LRU is more complex.
Note that there are better ways to have a pseudo LRU. Tree-LRU for instance is much better, but it still does not guaranty that the real LRU will be used.
For practical applications, pLRU gives a miss rate similar to real LRU, while being much simpler.
But even real LRU may not always be the best policy and if a line has been accessed frequently it is likely that it will continue to be accessed, and it should probably not be replaced even if it is LRU.
So the most efficient methods extend pLRU by keeping track of the number of accesses and by considering differently lines that are have been accessed only once and lines that have been accessed twice or more. This way, whenever a line has to be ejected, lines that have been accessed only once are preferred.

Dirty bit value after changing data to original state

If the value in some part of cache is 4 and we change it to 5, that sets the dirty bit for that data to 1. But what about, if we set the value back to 4, will dirty bit still stay 1 or change back to 0?
I am interested in this, because this would mean a higher level optimization of the computer system when dealing with read-write operations between main memory and cache.
In order for a cache to work like you said, it would need to reserve half of its data space to store the old values.
Since cache are expensive exactly because they have an high cost per bit, and considering that:
That mechanism would only detect a two levels writing history: A -> B -> A and not any deeper (like A -> B -> C -> A).
Writing would imply the copy of the current values in the old values.
The minimum amount of taggable data in a cache is the line and the whole line need to be changed back to its original value. Considering that a line has a size in the order of 64 Bytes, that's very unlikely to happen.
An hierarchical structure of the caches (L1, L2, L3, ...) its there exactly to mitigate the problem of eviction.
The solution you proposed has little benefits compared to the cons and thus is not implemented.

How can I force an L2 cache miss?

I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
Can anyone show me an example of C program which forces "N" numbers of L2 cache misses?
You can generally force cache misses at some cache level by randomly accessing a working set larger than that cache level1.
You would expect the probability of any given load to be a miss to be something like: p(hit) = min(100, C / W), and p(miss) = 1 - p(hit) where p(hit) and p(miss) are the probabilities of a hit and miss, C is the relevant cache size, and W is the working set size. So for a miss rate of 50%, use a working set of twice the cache size.
A quick look at the formula above shows that p(miss) will never be 100%, since C/W only goes to 0 as W goes to infinity (and you probably can't afford an infinite amount of RAM). So your options are:
Getting "close enough" by using a very large working set (e.g., 4 GB gives you a 99%+ miss chance for a 256 KB), and pretending you have a miss rate of 100%.
Applying the formula to determine the actual expected number of misses. E.g., if you are using a working size of 2560 KB against an L2 cache of 256 KB, you have a miss rate of 90%. So if you want to examine the effect of 1,000 misses, you should make 1000 / 0.9 = ~1111 memory access to get about 1,000 misses.
Use any approximate approach but then actually count the number of misses you incur using the performance counter units on your CPU. For example, on Linux you could use PAPI or on Linux and Windows you could use Intel's PCM (if you are using Intel hardware).
Use an "almost random" approach to force the number of misses you want. The formula above is valid for random accesses, but if you choose you access pattern so that it is random with the caveat that it doesn't repeat "recent" accesses, you can get a 100% miss ratio. Here "recent" means accesses to cache lines that are likely to still be in the cache. Calculating what that means exactly is tricky, and depends in detail on the associativity and replacement algorithm of the cache, but if you don't repeat any access that has occurred in the last cache_size * 10 accesses, you should be pretty safe.
As for the C code, you should at least show us what you've tried. A basic outline is to create a vector of bytes or ints or whatever with the required size, then to randomly access that vector. If you make each access dependent on the previous access (e.g., use the integer read to calculate the index of the next read) you will also get a rough measurement of the latency of that level of cache. If the accesses are independent, you'll probably have several outstanding misses to the cache at once, and get more misses per unit time. Which one you are interested in depend on what you are studying.
For an open source project that does this kind of memory testing across different stride and working set sizes, take a look at TinyMemBench.
1 This gets a bit trickier for levels of caches that are shared among cores (usually L3 for recent Intel chips, for example) - but it should work well if your machine is pretty quiet while testing.

How to calculate the miss rate of Data and Instruction Caches

The Situation
I'm trying to answer an architecture question on instruction and data caches I have found in a past exam paper (not homework!).
The question seems to give a lot of information which I haven't used in my solution. This makes me think I'm missing something and would be really grateful if someone could help me out!
The (Full) Question
This is the full question as asked in the paper. For a summary of key points please see below.
Calculate the miss rate for a machine S with separate instruction
cache and data cache, each of n bytes. There are I misses per K
instructions for the instruction cache, and d misses per k
instructions for the data cache.
A fraction X of instructions involve data transfer, while a fraction
Y of instructions contain instruction references, the rest contain
data references. A hit takes H cycles and the miss penalty is M
cycles.
Key Question Points
Given:
Data and instruction caches are separate
Each cache has N bytes
I misses per K instructions for instruction cache
D misses per K instructions for the data cache
A fraction X of the instructions involve data transfer
A fraction Y of the instructions involve instruction references
The rest of the instructions contain instruction references
A hit takes H cycles
The miss penalty is M cycles
Calculate: Miss Rate of Machine
Attempts so far
I originally thought that the miss rate would be (I/K)*Y + (D/K)*(1 - X - Y) but since this doesn't use all the data provided, I don't think it can be right :(.
Any help would be awesome!
I think that you maybe interpreting the question wrong or the question is not well framed. Miss rate of cache is obviously no. of misses/total no. of accesses. The only thing I can think of which uses all the info of the question is calculating the miss penalty for a cache miss
Hit timeL1 + Miss rateL1 * (Hit timeL2 + Miss rateL2 * Miss penatlyL2)

Resources