Finding average memory access time, AMAT and global miss rate - caching

I'm quite confused about this question. So, I have IL1, DL1 and UL2 and when I try to find AMAT do I use the formula AMAT = Hit Time(1) + Miss Rate * (Hit time(2) + Miss Rate * Miss Penalty ? or Do I also add Hit time(3) because there are 3 miss rates
For Example: 0.4 + 0.1 * (0.8 + 0.05 * (10 + 0.02 * 48))
I used AMAT = Hit Time(1) + Miss Rate * (Hit time(2) + Miss Rate * (Hit time(3) + Miss Rate * Miss Penalty))
Here is the Table, and also Frequency is 2.5 GHZ and It is also provided that 20% of all instructions are of load/store type.
By the way are there also a way to find global miss rate of UL2 in %? I'm also quite stuck on that one too.

There are two different cache hierarchies to consider.  I cannot tell from your question post if you're trying to compute AMAT for just data operations (load & store) or for instruction access + data operations (20% of the them).
The hierarchies:
Instruction Cache: IL1 backed by UL2 backed by Main Memory
Data Cache: DL1 backed by UL2 backed by Main Memory
There is a stated hit time & miss rate associated with each individual cache, and, this is necessary because the caches are of different construction and size (and also in different positions in the hierarchy).
All instructions participate in accessing of the Instruction Cache, so hit/miss there applies to every instruction regardless of the nature or type of the instruction.  So, you can compute the AMAT for instruction access alone generally using the IL1->UL2->Main Memory hierarchy — be sure to use the specific hit time and miss rate for each given level in the hierarchy: 1clk & 10% for IL1; 25clk & 2% for UL2; and 120clk & 0% for Main Memory.
20% of the instructions participate in accessing of the Data Cache.
Of those that do data accesses, you can compute that component of AMAT using the DL1->UL2->Main Memory hierarchy — here you have DL1 with 2clk & 5%; UL2 with 25clk & 2%; and Main Memory with 120clk & 0%.
These numbers can be combined to an overall value that accounts for 100% of the instructions incurring the instructions cache hierarchy AMAT, and 20% of them incurring the data cache hierarchy AMAT.
As needed you can convert AMAT in cycles/clocks to AMAT in (nano) seconds.

Related

how do you find the miss penalty of a single level cache?

I am attempting to find the average memory access time (AMAT) of a single level cache. In order to do so, miss penalty must be calculated since the AMAT formula requires it.
Doing this for a multilevel cache requires using the next level cache penalty. But for a single level, there is obviously no other cache level.
So how is this calculated?
formula:
AMAT = HIT-TIME + MISS-RATE * MISS-PENALTY
You have the correct formula to calculate the AMAT, however you may be misinterpreting the components of the formula. Let’s take a look at how to use this equation, first with a single-level cache and next with a multi-level cache.
Suppose you have a single-level cache. Hit time represents the amount of time required to search and retrieve data from the cache. Miss rate denotes the percentage of the data requested that does not reside in the cache i.e. the percentage of data you have to go to main memory to retrieve. Miss penalty is the amount of time required to retrieve the data once you miss in the cache. Because we are dealing with a single-level cache, the only other level in the memory hierarchy to consider is main memory for the miss penalty.
Here’s a good example for single-level cache:
L1 cache has an access time of 5ns and a miss rate of 50%
Main memory has an access time of 500ns
AMAT = 5ns + 0.5 * 500ns = 255ns
You always check the cache first so you always incur a 5 ns hit time overhead. Because our miss rate is 0.5, we find what we are looking for in the L1 cache half the time and must go to main memory the remaining half time. You can calculate the miss penalty in the following way using a weighted average:
(0.5 * 0ns) + (0.5 * 500ns) = (0.5 * 500ns) = 250ns.
Now, suppose you have a multi-level cache i.e. L1 and L2 cache. Hit time now represents the amount of time to retrieve data in the L1 cache. Miss rate is an indication of how often we miss in the L1 cache. Calculating the miss penalty in a multi-level cache is not as straightforward as before because we need to consider the time required to read data from the L2 cache as well as how often we miss in the L2 cache.
Here’s a good example:
L1 cache has an access time of 5 ns and miss rate of 50%
L2 cache has an access time of 50 ns and miss rate of 20%
Main memory has an access time of 500 ns
AMAT = 5ns + 0.5 * (50ns + 0.2 * 500ns) = 80 ns
Again, you always check the L1 cache first so you always incur a 5 ns hit time overhead. Because our miss rate is 0.5, we find what we are looking for in the L1 cache half the time and must down the memory hierarchy (L2 cache, main memory) the remaining half time. If we do not find the data in the L1 cache, we always look in the L2 cache next. We thus incur a 50 ns hit time overhead every time we miss in the L1 cache. In the case that the data is not in the L2 cache also (which is 20% of the time), we must go to main memory which has a memory access time of 500 ns.

How can I force an L2 cache miss?

I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
Can anyone show me an example of C program which forces "N" numbers of L2 cache misses?
You can generally force cache misses at some cache level by randomly accessing a working set larger than that cache level1.
You would expect the probability of any given load to be a miss to be something like: p(hit) = min(100, C / W), and p(miss) = 1 - p(hit) where p(hit) and p(miss) are the probabilities of a hit and miss, C is the relevant cache size, and W is the working set size. So for a miss rate of 50%, use a working set of twice the cache size.
A quick look at the formula above shows that p(miss) will never be 100%, since C/W only goes to 0 as W goes to infinity (and you probably can't afford an infinite amount of RAM). So your options are:
Getting "close enough" by using a very large working set (e.g., 4 GB gives you a 99%+ miss chance for a 256 KB), and pretending you have a miss rate of 100%.
Applying the formula to determine the actual expected number of misses. E.g., if you are using a working size of 2560 KB against an L2 cache of 256 KB, you have a miss rate of 90%. So if you want to examine the effect of 1,000 misses, you should make 1000 / 0.9 = ~1111 memory access to get about 1,000 misses.
Use any approximate approach but then actually count the number of misses you incur using the performance counter units on your CPU. For example, on Linux you could use PAPI or on Linux and Windows you could use Intel's PCM (if you are using Intel hardware).
Use an "almost random" approach to force the number of misses you want. The formula above is valid for random accesses, but if you choose you access pattern so that it is random with the caveat that it doesn't repeat "recent" accesses, you can get a 100% miss ratio. Here "recent" means accesses to cache lines that are likely to still be in the cache. Calculating what that means exactly is tricky, and depends in detail on the associativity and replacement algorithm of the cache, but if you don't repeat any access that has occurred in the last cache_size * 10 accesses, you should be pretty safe.
As for the C code, you should at least show us what you've tried. A basic outline is to create a vector of bytes or ints or whatever with the required size, then to randomly access that vector. If you make each access dependent on the previous access (e.g., use the integer read to calculate the index of the next read) you will also get a rough measurement of the latency of that level of cache. If the accesses are independent, you'll probably have several outstanding misses to the cache at once, and get more misses per unit time. Which one you are interested in depend on what you are studying.
For an open source project that does this kind of memory testing across different stride and working set sizes, take a look at TinyMemBench.
1 This gets a bit trickier for levels of caches that are shared among cores (usually L3 for recent Intel chips, for example) - but it should work well if your machine is pretty quiet while testing.

Finding Average Penalty from AMAT

I can calculate penalty when I have a single cache. But I'm unsure what to do when I am presented with two L1 caches (one for data and one for instruction) that are accessed in parallel. I'm also unsure what to do when I'm presented with clock cycles instead of actual time such as ns.
How do I calculate the average miss penalty using these new parameters?
Do I just use the formula two times and then average the miss penalty or is there more to this?
AMAT = hit time + miss rate * miss penalty
For example I have the following values:
AMAT = 4 clock cycles
L1 data access = 2 clock cycle (also hit time)
L1 instruction access = 2 clock cycle (also hit time)
60% of instructions are loads and stores
L1 instruction miss rate = 1%
L1 data miss rate = 3%
How would these values fit into AMAT?
Short answer
The average memory access time (AMAT) is typically calculated by taking the total number of instructions and dividing it by the total number of cycles spent servicing the memory request.
Details
On page B-17 of Computer Architecture a Quantiative Approach, 5th edition AMAT is defined as:
Average memory access time = % instructions x (Hit time + instruction miss rate x miss penalty) + % data x (Hit time + Data miss rate x miss penalty)`.
As you can see in this formula each instruction counts for a single memory access and the instructions that operate on data (load/store) constitute an additional memory access.
Note that there are many simplifying instructions that are made when using AMAT, and depending on the performance analysis that you want to perform. The same textbook I quotes earlier notes that:
In summary, although the state of the art in defining and measuring
memory stalls for out-of-order processors is complex, be aware of the
issues because they significantly affect performance. The complexity
arises because out-of-order processors tolerate some latency due to
cache misses without hurting performance. Consequently, designers
normally use simulators of the out-of-order processor and memory when
evaluating trade-offs in the memory hierarchy to be sure that an
improvement that helps the average memory latency actually helps
program performance.
My point of including this quote is that in practice AMAT is used for getting an approximate comparison between various different option. And as a result there are always simplifying assumptions used. But generally the memory accesses for instructions and data are added together to get a total number of accesses when calculating AMAT, rather than being calculated separately.
The way I see it, since the L1 Instruction Cache and the L1 Data Cache are accessed in parallel, you should compute AMAT for Instructions and AMAT for data, and then take the largest value as the final AMAT.
In your example since the Data Miss Rate is higher than Instruction Miss Rate you can consider that during the time the CPU waits for data, it solves all the misses on the instruction cache.
If the measure unit is cycles you do the same as if it were nanoseconds. If you know the frequency of your processor, you can convert back the AMAT in nanoseconds.

Memory access time analytical modelling

There is this question regarding solving the AMAT(Average Memory Access Time) given these data:
Legends: Cache Level 1 = L1 Cache Level 2 = L2 Main Memory = M
L1, L2 and M's Hit Time are 1, 10 and 100 respectively whilst
L1 Miss Rate is 5%, L2 5% and M 50%.
Find the AMAT in clock cycles.
After attempting to solve this question, here is my solution:
AMAT's formula is = Hit Time X Hit Rate + Miss Penalty * Miss Rate
Miss Penalty = AMAT for the next cache(say for example, AMAT of L2)
So I manipulated the formula, resulting into something like this:
AMAT = Hit Time L1 X Hit Rate L1 + AMAT L2 * Miss Rate L1
AMAT L2 = Hit Time L2 X Hit Rate L2 + AMAT M * Miss Rate L2
AMAT M = Hit Time M X Hit Rate M + [???] * Miss Rate M
providing the numerical value for the said formula would look like this:
AMAT = 1 X .95 + AMAT L2 * .05
AMAT L2 = 10 X .95 + AMAT M * .05
AMAT M = 100 X .5 + [???] * .5
So my first question would be, is my formula correct?
Next, how to get M's Miss Penalty?
Your "cascading" deductions are correct. If you miss L1 then the data has to be fetched from L2, involving a penalty, and if you miss L2 the data has to be fetched from RAM involving a higher penalty. Based on this your method of computing AMAT is correct.
Now, let's look a bit at the computer architecture. After L1 you have L2, after L2 there can be an L3 or it can be a RAM. After RAM you have the persistent storage (Hard disk).
However, the access time towards the HDD is very difficult to compute. Majority of the profilers (e.g perf, oprofile) just give you miss rates for the cache and assume that the rest of reads are taken directly from the RAM. Usually this is the case in modern computers. Reads from hard drives are marked as I/O requests. Why is it hard to compute the HDD access time? Because your data block can be at any location on your HDD, depending fragmentation. In case of spinning disks the access time can be lower f the data is located near the magnetic reader. So usually HDD access time is hard to predict when computing AMAT.
This means that if the problem statement gives you only L1, L2, M hit time, you cannot compute HDD hit time. However you can measure/estimate it in real life applications, using profilers.
So in short:
Is your formula correct? Yes - From mathematical point of view. From practical point of view, it is hard to use and generally does not make sense if M hit rate is not 100%.
How to compute M miss penalty? If it is not given you cannot compute it. In practice, by analyzing profiler output you can determine it for a specific measurement (no guaranteed that it will be the same for another measurement, because of the reasons described above).
Cheers.

understanding CPI and cache access

These are previous homework problems, but I am using them as exam review. I am changing numbers around from what is actually in the problem. I just want to make sure I have a grasp on the concepts. I already have the answers, just need clarification that I understand them. This is not homework but review work.
Anyway, this focuses on aspects of CPI
The fist problem:
An application running on a 1GHz processor has 30% load-store instructions, 30% arithmetic, and 40% branch instructions. The individual CPIs are 3 for load-store, 4 for arithmetic, 5 for branch instructions. Determine the overall CPI of this program on the given processor.
My answer: The overall CPI is the sum of the sub-CPIs, multiplied by the percentages in which they occur i.e. 3*0.3 + 4*0.3 + 5*0.4 = 0.9 + 1.2 + 2 = 4.1
Now, the processor is enhanced to run at 1.6GHz. The CPIs of the branch instructions remain the same but load-store and arithmetic instruction CPIs both increase to 6 cycles. A new compiler is in use which eliminates 30% of branch instructions and 10% of load-stores. Determine the new overall CPI and the factor by which the application will be faster or slower.
My answer: Once again, the new CPI is just the sum of its parts. However, the parts have changed and this must be accounted for. Branch instructions will drop by 30% (0.4*0.7=0.28) and load-stores will drop by 10% (0.3*0.9=0.27); arithmetic instructions will now account for the rest of the instructions (1-0.28-0.27=0.45), or 45%. These will be multiplied by the new sub-CPIs to get: 6*0.45+6*0.27+5*0.28=5.72.
Now, the processor enhancement is 60% faster, and the CPI is greater by (5.72-4.1)/4.1 = 39.5%. Thus, the application will run roughly 0.6*0.395 = 23.7% faster.
Now, the second problem:
A new processor with a load/store architecture has an ideal CPI of 1.25. Typical applications on this processor are a mix of 50% arithmetic and logic, 25% conditional branching and 25% load/store. Memory is accessed via a separate data and instruction cache, with a 5% instruction cache miss rate and 10% data miss rate. The penalty of any cache miss is 100 cycles and hits don't produce any penalties.
What is the effective CPI?
My answer: The effective CPI is the ideal CPI, plus the stalled cycles per instruction due to cache access. The ideal CPI is, as given, 1.25. The stalled cycles per instruction is (0.1*100*0.25) + (0.05*100*1) = 7.5. 0.1*100*0.25 is the data miss rate multiplied by the stalled cycle penalty which is also multiplied by the load/store percentage (which is where the data accesses take place); 0.05*100*1 is the instruction miss rate, which is the instruction cache miss rate times the stalled cycle penalty, instruction access take place in 100% of the program, so this is multiplied by 1. Following from this, the effective CPI is 1.25 + 7.5 = 8.75.
What is the misses per 1000 instruction for typical applications and what is the average memory access time (in clock cycles) for typical applications?
My answers: The misses per 1000 instructions is equal to the stalled cycles per instruction due to cache access (as given above: 7.5), divided by 1000, which equals 7.5/1000 = 0.0075
When discussing the average memory access time (AMAT), we first must talk about the total number of accesses here, which is the percentage of data accesses (25%) plus the percentage of instruction accesses (100%), or 125%=1.25. The data accesses are .25/1.25 and the instruction accesses are 1/1.25.
The AMAT equals the percentage of data accesses (.25/1.25) multiplied by the sum of the hit time (1) and the data miss rate multiplied by the miss penalty (0.1*100), or (.25/1.25)(1+0.1*100) and this is added to the percentage of instruction accesses (1/1.25) multiplied by the sum of the hit time (1) and the instruction miss rate multiplied by the miss penalty (0.05*100), or (1/1.25)(1+0.05*100). Put together, the AMAT is (.25/1.25)(1+0.1*100)+(1/1.25)(1+0.05*100)=7.
Once again, sorry for the wall of text. If I am wrong, please try to help me understand how I am wrong. I tried to show all my work to make it as easy as possible to understand. Thanks in advance.
There's an error in the lat part of your question. When they ask:
What is the misses per 1000 instruction for typical applications and what is the average memory
access time (in clock cycles) for typical applications?
what's needed here is the number of misses you will get for every 1000 instructions, which in this case would be 1000*1*0.05 for instruction cache misses and 1000*0.25*0.1 for data cache misses. This equals 75 misses per 1000 instructions.
To calculate the AMAT, you use the formula AMAT = hit time + (miss rate*miss penalty)
In this case, your miss rate is 75/1000 and your miss penalty is 100 cycles. The hit time is given as 1.25 cycles (your ideal CPI!).
Hope this helps and all the best for your exam!

Resources