Calculating Effective CPI when using write-through/write-back architecture - caching

So I'm trying to understand a homework problem given by an instructor and I'm honestly lost - I understand the concept of write-through/write-back, etc. but I can't figure out the actual calculations needed for the effective CPI, could anyone give me a hand? (The problem follows:
The following table provides the statistics of a cache for a
particular program. It is known that the base CPI (without cache
misses) is 1. It is also known that the memory bus bandwidth (the
bandwidth to transfer data between cache and memory) is 4 bytes per
cycle, and it takes one cycle to send the address before data
transfer. The memory spends 10 cycles to store data from bus or fetch
data to bus. The clock rate used by memory and the bus is a quarter of
the CPU clock rate.
Data reads per 1000 instructions: 100
Data writes per 1000 instructions: 150
Instruction cache miss rate: 0.4%
Data cache miss rate: 3%
Block size in bytes: 32

The effective CPI is the base CPU plus the CPI contribution from cache misses.
The cache miss CPI is the sum of the of instruction cache CPI and data cache CPI.
The cache miss cost is the cost of reading or writing to memory, so we will need that.
The cost in bus cycles is 1 (for the address) plus 10 (memory busy time) + 8 (32 byte blocks size divided by 4 bytes/cycle) = 19 cycles. Multiply this by 4 to get CPU cycles. Total is 76 CPU cycles.
So the cost for I cache misses is .004 * 76 = .304 cycles.
The cost for D caches misses is (.10 + .15) * .03 * 76 = .57 cycles
So the effective CPI is 1 + .304 + .57 = 1.874 cycles.


Memory access time in deep cache hierarchy

I am trying to solve an exercise question from Computer architecture textbook. The book includes the equation for calculating memory access time (MAT) for up to L2 cache (eq. below), however the exercise has upto L4 cache and off chip memory access for which I don't understand how to use the equation to calculate Avg MAT.
So, Average memory access time = Hit time_L1 + Miss rate_L1 x (Hit time_L2 + Miss rate_L2xMiss penalty_L2)
In exercise question it mentioned a cache hierarchy as -- >[32 KB L1; 128 KB L2; 2 MB L3; 8 MB L4; off-chip memory] for which the memory access time needs to be calculated.
Given, cache/latency/Miss per thousand instructions values: 32 KB/1/100, 128 KB/2/80, 512 KB/4/50, 2 MB/8/40, 8 MB/16/10. And off chip memory access requires 200 cycles on average. Also, 1000 instructions of a program, an average of 20 memory accesses may exhibit low enough locality and can't be serviced bty 2MB cache which has 20 Miss per thousand instruction.
Could anyone help me to solve the problem?

CPI calculation

I have calculated a graph with cache miss rate(mr) vs the size of cache(sc). How can the CPI (cycle per instructions) be calculated for various cache sizes.
Assumptions are :
Given cache miss latency (say 10 ) ,
base CPI of 1 and
33.33% of instructions as memory operations.
What I understand is that the CPI can be calculated using the following formula. Is the below method correct?
CPI = miss rate*(.3333)*10 + 1
for the MISS RATE: 2.700978
I got the following CPI
CPI: 1.090024
To calculate CPI when given a baseline CPI and statistics about cache hierarchy you can use the following formula:
Effective CPI = Baseline CPI + CPI of memory accesses
Your baseline CPI is 1 (given in the problem statement). So you just need to find the CPI of the memory accesses.
If the memory access is a hit in the cache then we assume that the CPI is the same as the baseline CPI. If it is a miss then it will be the miss latency.
So you have 33% of instructions that are memory accesses. Of those the ones that are misses will take 10 cycles. So putting all of this together you get:
CPI = miss rate*(.3333)*10 + 1
Which is what you have in your question.
As for the code you included and the "Answer" section I don't know what you are asking about there or what its purpose is.

Calculating actual/effective CPI for 3 level cache

(a) You are given a memory system that has two levels of cache (L1 and L2). Following are the specifications:
Hit time of L1 cache: 2 clock cycles
Hit rate of L1 cache: 92%
Miss penalty to L2 cache (hit time of L2): 8 clock cycles
Hit rate of L2 cache: 86%
Miss penalty to main memory: 37 clock cycles
Assume for the moment that hit rate of main memory is 100%.
Given a 2000 instruction program with 37% data transfer instructions (loads/stores), calculate the CPI (Clock Cycles per Instruction) for this scenario.
For this part, I calculated it like this (am I doing this right?):
(m1: miss rate of L1, m2: miss rate of L2)
AMAT = HitTime_L1 + m1*(HitTime_L2 + m2*MissPenalty_L2)
CPI(actual) = CPI(ideal) + (AMAT - CPI(ideal))*AverageMemoryAccess
(b) Now lets add another level of cache, i.e., L3 cache between the L2 cache and the main memory. Consider the following:
Miss penalty to L3 cache (hit time of L3 cache): 13 clock cycles
Hit rate of L3 cache: 81%
Miss penalty to main memory: 37 clock cycles
Other specifications remain as part (a)
For the same 2000 instruction program (which has 37% data transfer instructions), calculate the CPI.
(m1: miss rate of L1, m2: miss rate of L2, m3: miss rate of L3)
AMAT = HitTime_L1
+ m1*(HitTime_L2 + m2*MissPenalty_L2)
+ m2*(HitTime_L3 + m3*MissPenalty_L3)
Is this formula correct and where do I add the miss penalty to main memory in this formula?
It should probably be added with the miss penalty of L3 but I am not sure.
(a) The AMAT calculation is correct if you notice that the MissPenalty_L2 parameter is what you called Miss penalty to main memory.
The CPI is a bit more difficult.
First of all, let's assume that the CPU is not pipelined (sequential processor).
There are 1.37 memory accesses per instruction (one access to fetch the instruction and 0.37 due to data transfer instructions). The ideal case is that all memory acceses hit in the L1 cache.
So, knowing that:
CPI(ideal) = CPI(computation) + CPI(mem) =
CPI(computation) + Memory_Accesses_per_Instruction*HitTime_L1 =
CPI(computation) + 1.37*HitTime_L1
With real memory, the average memory access time is AMAT, so:
CPI(actual) = CPI(computation) + Memory_Accesses_per_Instruction*AMAT =
CPI(ideal) + Memory_Accesses_per_Instruction*(AMAT - HitTime_L1) =
CPI(ideal) + 1.37*(AMAT - HitTime_L1)
(b) Your AMAT calculation is wrong. After a miss at L2, it follows a L3 access that can be a hit or a miss. Try to finish the exercise yourself.

understanding CPI and cache access

These are previous homework problems, but I am using them as exam review. I am changing numbers around from what is actually in the problem. I just want to make sure I have a grasp on the concepts. I already have the answers, just need clarification that I understand them. This is not homework but review work.
Anyway, this focuses on aspects of CPI
The fist problem:
An application running on a 1GHz processor has 30% load-store instructions, 30% arithmetic, and 40% branch instructions. The individual CPIs are 3 for load-store, 4 for arithmetic, 5 for branch instructions. Determine the overall CPI of this program on the given processor.
My answer: The overall CPI is the sum of the sub-CPIs, multiplied by the percentages in which they occur i.e. 3*0.3 + 4*0.3 + 5*0.4 = 0.9 + 1.2 + 2 = 4.1
Now, the processor is enhanced to run at 1.6GHz. The CPIs of the branch instructions remain the same but load-store and arithmetic instruction CPIs both increase to 6 cycles. A new compiler is in use which eliminates 30% of branch instructions and 10% of load-stores. Determine the new overall CPI and the factor by which the application will be faster or slower.
My answer: Once again, the new CPI is just the sum of its parts. However, the parts have changed and this must be accounted for. Branch instructions will drop by 30% (0.4*0.7=0.28) and load-stores will drop by 10% (0.3*0.9=0.27); arithmetic instructions will now account for the rest of the instructions (1-0.28-0.27=0.45), or 45%. These will be multiplied by the new sub-CPIs to get: 6*0.45+6*0.27+5*0.28=5.72.
Now, the processor enhancement is 60% faster, and the CPI is greater by (5.72-4.1)/4.1 = 39.5%. Thus, the application will run roughly 0.6*0.395 = 23.7% faster.
Now, the second problem:
A new processor with a load/store architecture has an ideal CPI of 1.25. Typical applications on this processor are a mix of 50% arithmetic and logic, 25% conditional branching and 25% load/store. Memory is accessed via a separate data and instruction cache, with a 5% instruction cache miss rate and 10% data miss rate. The penalty of any cache miss is 100 cycles and hits don't produce any penalties.
What is the effective CPI?
My answer: The effective CPI is the ideal CPI, plus the stalled cycles per instruction due to cache access. The ideal CPI is, as given, 1.25. The stalled cycles per instruction is (0.1*100*0.25) + (0.05*100*1) = 7.5. 0.1*100*0.25 is the data miss rate multiplied by the stalled cycle penalty which is also multiplied by the load/store percentage (which is where the data accesses take place); 0.05*100*1 is the instruction miss rate, which is the instruction cache miss rate times the stalled cycle penalty, instruction access take place in 100% of the program, so this is multiplied by 1. Following from this, the effective CPI is 1.25 + 7.5 = 8.75.
What is the misses per 1000 instruction for typical applications and what is the average memory access time (in clock cycles) for typical applications?
My answers: The misses per 1000 instructions is equal to the stalled cycles per instruction due to cache access (as given above: 7.5), divided by 1000, which equals 7.5/1000 = 0.0075
When discussing the average memory access time (AMAT), we first must talk about the total number of accesses here, which is the percentage of data accesses (25%) plus the percentage of instruction accesses (100%), or 125%=1.25. The data accesses are .25/1.25 and the instruction accesses are 1/1.25.
The AMAT equals the percentage of data accesses (.25/1.25) multiplied by the sum of the hit time (1) and the data miss rate multiplied by the miss penalty (0.1*100), or (.25/1.25)(1+0.1*100) and this is added to the percentage of instruction accesses (1/1.25) multiplied by the sum of the hit time (1) and the instruction miss rate multiplied by the miss penalty (0.05*100), or (1/1.25)(1+0.05*100). Put together, the AMAT is (.25/1.25)(1+0.1*100)+(1/1.25)(1+0.05*100)=7.
Once again, sorry for the wall of text. If I am wrong, please try to help me understand how I am wrong. I tried to show all my work to make it as easy as possible to understand. Thanks in advance.
There's an error in the lat part of your question. When they ask:
What is the misses per 1000 instruction for typical applications and what is the average memory
access time (in clock cycles) for typical applications?
what's needed here is the number of misses you will get for every 1000 instructions, which in this case would be 1000*1*0.05 for instruction cache misses and 1000*0.25*0.1 for data cache misses. This equals 75 misses per 1000 instructions.
To calculate the AMAT, you use the formula AMAT = hit time + (miss rate*miss penalty)
In this case, your miss rate is 75/1000 and your miss penalty is 100 cycles. The hit time is given as 1.25 cycles (your ideal CPI!).
Hope this helps and all the best for your exam!

Cycles/cost for L1 Cache hit vs. Register on x86?

I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?
How many cycles does an L1 cache hit take? How does it compare to register access?
Here's a great article on the subject:
To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)
The specifics will vary, but this link has some good ballpark figures:
Approximate cost to access various caches and main memory?
Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns
These figures represent much older, slower CPUs, but the ratios basically hold:
Level Access Time Typical Size Technology Managed By
----- ----------- ------------ --------- -----------
Registers 1-3 ns ?1 KB Custom CMOS Compiler
Level 1 Cache (on-chip) 2-8 ns 8 KB-128 KB SRAM Hardware
Level 2 Cache (off-chip) 5-12 ns 0.5 MB - 8 MB SRAM Hardware
Main Memory 10-60 ns 64 MB - 1 GB DRAM Operating System
Hard Disk 3M - 10M ns 20 - 100 GB Magnetic Operating System/User
Throughput and latency are different things. You can't just add up cycle costs. For throughput, see Load/stores per cycle for recent CPU architecture generations - 2 loads per clock throughput for most modern microarchitectures. And see How can cache be that fast? for microarchitectural details of load/store execution units, including showing load / store buffers which limit how much memory-level parallelism they can track. The rest of this answer will focus only on latency, which is relevant for workloads that involve pointer-chasing (like linked lists and trees), and how much latency out-of-order exec needs to hide. (L3 Cache misses are usually too long to fully hide.)
Single-cycle cache latency used to be a thing on simple in-order pipelines at lower clock speeds (so each cycle was more nanoseconds), especially with simpler caches (smaller, not as associative, and with a smaller TLB for caches that weren't purely virtually addressed.) e.g. the classic 5-stage RISC pipeline like MIPS I assumes 1 cycle for memory access on a cache hit, with address calculation in EX and memory access in a single MEM pipeline stage, before WB.
Modern high-performance CPUs divide the pipeline up into more stages, allowing each cycle to be shorter. This lets simple instructions like add / or / and run really fast, still 1 cycle latency but at high clock speed.
For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.
Intel Haswell's L1 load-use latency is 4 cycles for pointer-chasing, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or for a linked list that hits in cache, easy to microbench with a closed loop). See also Is there a penalty when base+offset is in a different page than the base? That 4-cycle latency special case only applies if the pointer comes directly from another load, otherwise it's 5 cycles.
Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.
Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, reading from the store buffer for store data that hasn't yet committed to L1d cache).
As harold commented, register access is 0 cycles. So, for example:
inc eax has 1 cycle latency (just the ALU operation)
add dword [mem], 1 has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.) has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).
Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).
L1 Data cache = 32 KB, 64 B/line, 8-WAY.
L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.
L2 cache = 256 KB, 64 B/line, 8-WAY
L3 cache = 8 MB, 64 B/line
L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])
L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).
L2 Cache Latency = 12 cycles
L3 Cache Latency = 36 cycles
RAM Latency = 36 cycles + 57 ns
The top-level benchmark page is, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.
#paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.
If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.
EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\
Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.
As result:
1) I think you can consider L1 cache as an unmanaged processor registers pool.
2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.
