Calculate cache hit rate - caching

I am doing an older exam of the course, having a hard time to get grip of calculating cache miss hit rate, here is the question:
Suppose you have a 32-bit processor with a direct mapped instruction cache. The capacity of the cache is 2048 bytes and there are 256 sets. Assume that a program loop contains 5 instructions, including a conditional jump instruction, which is the last of the 5 instructions. The other 4 instructions are not jump instructions. The first instruction in the loop is located at memory address 0x40001000. What is the instruction cache hit rate when executing the loop 10 times? Does the example show temporal locality, spatial locality, or both?
The answer should be 47/50 hit rate.
My try does not come far, what I can here is by doing 2048/256 we get 8 bytes block size, but I have no clue how to calculate the hit rate, where to start, anyone have an explaination how this can be calculate would be much appreciated. Mean while... I am seeking other topics try to understand.


Cache blocking brings no improvement for image filter on ARM

I'm experimenting with cache blocking. To do that, I implemented 2 convolution based smoothing algorithms. The gaussian kernel I'm using looks like this:
The first algorithm is just the simple double for loop, looping from left to right, top to bottom as shown below.
Image source: (
In the second algorithm I tried to play with cache blocking by spliting the loops into chunks, which became something like the following. I used a BLOCK size of 512x512.
Image source: (
I'm running the code on a raspberry pi 3B+, which has a Cortex-A53 with 32KB of L1 and 256KB of L2, I believe. I ran the two algorithms with different image sizes (2048x1536, 6000x4000, 12000x8000, 16000x12000. 8bit gray scale images). But across different image sizes, I saw the run time being very similar.
The question is shouldn't the first algorithm experience access latency which the second should not, especially when using large size image (like 12000x8000). Base on the description of cache blocking in this link, when processing data at the end of image rows using the 1st algorithm, the data at the beginning of the rows should have been evicted from the L1 cache. Using 12000x8000 size image as an example, since we are using 5x5 kernel, 5 rows of data is need, which is 12000x5=60KB, already larger than the 32KB L1 size. When we start processing data for a new row, 4 rows of previous data are still needed but they are likely gone in L1 so needs to be re-fetched. But for the second algorithm it shouldn't have this problem because the block size is small. Can anyone please tell me what am I missing?
I also profiled the algorithm using oprofile with the following data:
Algorithm 1
Algorithm 2
So it looks like the 1st algorithm does have more cache miss compared to the second, reflecting by the L1D_CACHE_REFILL counts. But it also has higher data prefetching rate, which maybe due to the simple behavior of the loop. So is the whole story of cache blocking not taking into account data prefetching?
Conceptually, you're right blocking will reduce cache misses by keeping the input window in cache.
I suspect the main reason you're not seeing a speedup is because the cache is prefetching from all 5 input rows. Your performance counters show more prefetch loads in the unblocked implementation. I suspect many textbook examples are out of date since cache prefetching has kept getting better. Intel's L2 cache can detect and prefetch from up to 16 linear streams about 10 years ago, I think.
Assume the filter takes 5 * 5 cycles. So that would be 20.8 ns = 25 / 1.2GHz on RPI3. The IO cost will be reading a 5 high column of new input pixels. The amortized IO cost will be 5 bytes / 20.8ns = 229 MiB/s, which is much less than the ~2 GiB/s DRAM bandwidth. So in theory, the relatively slow computation combined with prefetching (I'm not certain how effective) means that memory access isn't a bottleneck.
Try increasing the filter height. The cache can only detect and prefetch from a certain # streams. Or try vectorizing the computation so that memory access becomes the bottleneck.

Writing a full cache line at an uncached address before reading it again on x64

On x64 if you first write within a short period of time the contents of a full cache line at a previously uncached address, and then soon after read from that address again can the CPU avoid having to read the old contents of that address from memory?
As effectively it shouldn't matter what the contents of the memory was previously because the full cache line worth of data was fully overwritten? I can understand that if it was a partial cache line write of an uncached address, followed by a read then it would incur the overhead of having to synchronise with main memory etc.
Looking at documentation regards write allocate, write combining and snooping has left me a little confused about this matter. Currently I think that an x64 CPU cannot do this?
In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!
Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.
In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.
Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.
In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.
The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.
From the above linked document, here's what Agner says about store-forwarding on Skylake:
The Skylake processor can forward a memory write to a subsequent read
from the same address under certain conditions. Store forwarding is
one clock cycle faster than on previous processors. A memory write
followed by a read from the same address takes 4 clock cycles in the
best case for operands of 32 or 64 bits, and 5 clock cycles for other
operand sizes.
Store forwarding has a penalty of up to 3 clock cycles extra when an
operand of 128 or 256 bits is misaligned.
A store forwarding usually takes 4 - 5 clock cycles extra when an
operand of any size crosses a cache line boundary, i.e. an address
divisible by 64 bytes.
A write followed by a smaller read from the same address has little or
no penalty.
A write of 64 bits or less followed by a smaller read has a penalty of
1 - 3 clocks when the read is offset but fully contained in the
address range covered by the write.
An aligned write of 128 or 256 bits followed by a read of one or both
of the two halves or the four quarters, etc., has little or no
penalty. A partial read that does not fit into the halves or quarters
can take 11 clock cycles extra.
A read that is bigger than the write, or a read that covers both
written and unwritten bytes, takes approximately 11 clock cycles
The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.
Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.
There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.

How can I force an L2 cache miss?

I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
Can anyone show me an example of C program which forces "N" numbers of L2 cache misses?
You can generally force cache misses at some cache level by randomly accessing a working set larger than that cache level1.
You would expect the probability of any given load to be a miss to be something like: p(hit) = min(100, C / W), and p(miss) = 1 - p(hit) where p(hit) and p(miss) are the probabilities of a hit and miss, C is the relevant cache size, and W is the working set size. So for a miss rate of 50%, use a working set of twice the cache size.
A quick look at the formula above shows that p(miss) will never be 100%, since C/W only goes to 0 as W goes to infinity (and you probably can't afford an infinite amount of RAM). So your options are:
Getting "close enough" by using a very large working set (e.g., 4 GB gives you a 99%+ miss chance for a 256 KB), and pretending you have a miss rate of 100%.
Applying the formula to determine the actual expected number of misses. E.g., if you are using a working size of 2560 KB against an L2 cache of 256 KB, you have a miss rate of 90%. So if you want to examine the effect of 1,000 misses, you should make 1000 / 0.9 = ~1111 memory access to get about 1,000 misses.
Use any approximate approach but then actually count the number of misses you incur using the performance counter units on your CPU. For example, on Linux you could use PAPI or on Linux and Windows you could use Intel's PCM (if you are using Intel hardware).
Use an "almost random" approach to force the number of misses you want. The formula above is valid for random accesses, but if you choose you access pattern so that it is random with the caveat that it doesn't repeat "recent" accesses, you can get a 100% miss ratio. Here "recent" means accesses to cache lines that are likely to still be in the cache. Calculating what that means exactly is tricky, and depends in detail on the associativity and replacement algorithm of the cache, but if you don't repeat any access that has occurred in the last cache_size * 10 accesses, you should be pretty safe.
As for the C code, you should at least show us what you've tried. A basic outline is to create a vector of bytes or ints or whatever with the required size, then to randomly access that vector. If you make each access dependent on the previous access (e.g., use the integer read to calculate the index of the next read) you will also get a rough measurement of the latency of that level of cache. If the accesses are independent, you'll probably have several outstanding misses to the cache at once, and get more misses per unit time. Which one you are interested in depend on what you are studying.
For an open source project that does this kind of memory testing across different stride and working set sizes, take a look at TinyMemBench.
1 This gets a bit trickier for levels of caches that are shared among cores (usually L3 for recent Intel chips, for example) - but it should work well if your machine is pretty quiet while testing.

Memory Access time in 2 level Paging

Consider a system with a two-level paging scheme in which a regular memory
access takes 150 nanoseconds, and servicing a page fault takes 8 ms.
An average instruction takes 100 ns of CPU time, and two memory
accesses. The TLB hit ratio is 90%, and the page fault rate is one in every 10,000
instructions. What is the effective average instruction execution time?
This was asked in GATE 2004. To solve the question, I would follow the below concept :
T(memory access avg) = .90(150) + .1(150+150+150) = 180
(150- level1, 150-level2 and 150-memory)
T effective = 100+ 2* 180 + 1/10000* 8* 10^6 = 1260.
Is this approach correct ? Also I have the following doubts :
There won't be a page fault when there is a TLB hit because the most
frequently used pages has to be in the memory. Is it correct ?
What is the size of the page table for a process? Say for a 32 bit
virtual address, for every process do we allocate a page-table with
2^32 entries ? How is the memory limits managed in paging ?
Please explain theses concepts.
I would suggest the following
here 100ns for instruction execution (no difference of opinion there)
Now given TLB hit ratio is 90%, so whenever there is a TLB miss, we have to do 2 memory accesses, since it is given a 2 level paging scheme.
and irrespective of TLB hit or miss 2*(150+ 8*10^6 * 1/20000 ) should be done which is memory access time for contents and overhead for page fault.
I think your expression assumes, that for an instruction whenever a TLB hit occurs for first content, it follows for the second
so you assume hit-hit or miss-miss,while since given TLB hit is 90%(per access and not per instruction), I feel there should be all 4 possible combinations
hit-hit, miss-miss, hit-miss,miss-hit

understanding CPI and cache access

These are previous homework problems, but I am using them as exam review. I am changing numbers around from what is actually in the problem. I just want to make sure I have a grasp on the concepts. I already have the answers, just need clarification that I understand them. This is not homework but review work.
Anyway, this focuses on aspects of CPI
The fist problem:
An application running on a 1GHz processor has 30% load-store instructions, 30% arithmetic, and 40% branch instructions. The individual CPIs are 3 for load-store, 4 for arithmetic, 5 for branch instructions. Determine the overall CPI of this program on the given processor.
My answer: The overall CPI is the sum of the sub-CPIs, multiplied by the percentages in which they occur i.e. 3*0.3 + 4*0.3 + 5*0.4 = 0.9 + 1.2 + 2 = 4.1
Now, the processor is enhanced to run at 1.6GHz. The CPIs of the branch instructions remain the same but load-store and arithmetic instruction CPIs both increase to 6 cycles. A new compiler is in use which eliminates 30% of branch instructions and 10% of load-stores. Determine the new overall CPI and the factor by which the application will be faster or slower.
My answer: Once again, the new CPI is just the sum of its parts. However, the parts have changed and this must be accounted for. Branch instructions will drop by 30% (0.4*0.7=0.28) and load-stores will drop by 10% (0.3*0.9=0.27); arithmetic instructions will now account for the rest of the instructions (1-0.28-0.27=0.45), or 45%. These will be multiplied by the new sub-CPIs to get: 6*0.45+6*0.27+5*0.28=5.72.
Now, the processor enhancement is 60% faster, and the CPI is greater by (5.72-4.1)/4.1 = 39.5%. Thus, the application will run roughly 0.6*0.395 = 23.7% faster.
Now, the second problem:
A new processor with a load/store architecture has an ideal CPI of 1.25. Typical applications on this processor are a mix of 50% arithmetic and logic, 25% conditional branching and 25% load/store. Memory is accessed via a separate data and instruction cache, with a 5% instruction cache miss rate and 10% data miss rate. The penalty of any cache miss is 100 cycles and hits don't produce any penalties.
What is the effective CPI?
My answer: The effective CPI is the ideal CPI, plus the stalled cycles per instruction due to cache access. The ideal CPI is, as given, 1.25. The stalled cycles per instruction is (0.1*100*0.25) + (0.05*100*1) = 7.5. 0.1*100*0.25 is the data miss rate multiplied by the stalled cycle penalty which is also multiplied by the load/store percentage (which is where the data accesses take place); 0.05*100*1 is the instruction miss rate, which is the instruction cache miss rate times the stalled cycle penalty, instruction access take place in 100% of the program, so this is multiplied by 1. Following from this, the effective CPI is 1.25 + 7.5 = 8.75.
What is the misses per 1000 instruction for typical applications and what is the average memory access time (in clock cycles) for typical applications?
My answers: The misses per 1000 instructions is equal to the stalled cycles per instruction due to cache access (as given above: 7.5), divided by 1000, which equals 7.5/1000 = 0.0075
When discussing the average memory access time (AMAT), we first must talk about the total number of accesses here, which is the percentage of data accesses (25%) plus the percentage of instruction accesses (100%), or 125%=1.25. The data accesses are .25/1.25 and the instruction accesses are 1/1.25.
The AMAT equals the percentage of data accesses (.25/1.25) multiplied by the sum of the hit time (1) and the data miss rate multiplied by the miss penalty (0.1*100), or (.25/1.25)(1+0.1*100) and this is added to the percentage of instruction accesses (1/1.25) multiplied by the sum of the hit time (1) and the instruction miss rate multiplied by the miss penalty (0.05*100), or (1/1.25)(1+0.05*100). Put together, the AMAT is (.25/1.25)(1+0.1*100)+(1/1.25)(1+0.05*100)=7.
Once again, sorry for the wall of text. If I am wrong, please try to help me understand how I am wrong. I tried to show all my work to make it as easy as possible to understand. Thanks in advance.
There's an error in the lat part of your question. When they ask:
What is the misses per 1000 instruction for typical applications and what is the average memory
access time (in clock cycles) for typical applications?
what's needed here is the number of misses you will get for every 1000 instructions, which in this case would be 1000*1*0.05 for instruction cache misses and 1000*0.25*0.1 for data cache misses. This equals 75 misses per 1000 instructions.
To calculate the AMAT, you use the formula AMAT = hit time + (miss rate*miss penalty)
In this case, your miss rate is 75/1000 and your miss penalty is 100 cycles. The hit time is given as 1.25 cycles (your ideal CPI!).
Hope this helps and all the best for your exam!
