Finding Average Penalty from AMAT - performance

I can calculate penalty when I have a single cache. But I'm unsure what to do when I am presented with two L1 caches (one for data and one for instruction) that are accessed in parallel. I'm also unsure what to do when I'm presented with clock cycles instead of actual time such as ns.
How do I calculate the average miss penalty using these new parameters?
Do I just use the formula two times and then average the miss penalty or is there more to this?
AMAT = hit time + miss rate * miss penalty
For example I have the following values:
AMAT = 4 clock cycles
L1 data access = 2 clock cycle (also hit time)
L1 instruction access = 2 clock cycle (also hit time)
60% of instructions are loads and stores
L1 instruction miss rate = 1%
L1 data miss rate = 3%
How would these values fit into AMAT?

Short answer
The average memory access time (AMAT) is typically calculated by taking the total number of instructions and dividing it by the total number of cycles spent servicing the memory request.
Details
On page B-17 of Computer Architecture a Quantiative Approach, 5th edition AMAT is defined as:
Average memory access time = % instructions x (Hit time + instruction miss rate x miss penalty) + % data x (Hit time + Data miss rate x miss penalty)`.
As you can see in this formula each instruction counts for a single memory access and the instructions that operate on data (load/store) constitute an additional memory access.
Note that there are many simplifying instructions that are made when using AMAT, and depending on the performance analysis that you want to perform. The same textbook I quotes earlier notes that:
In summary, although the state of the art in defining and measuring
memory stalls for out-of-order processors is complex, be aware of the
issues because they significantly affect performance. The complexity
arises because out-of-order processors tolerate some latency due to
cache misses without hurting performance. Consequently, designers
normally use simulators of the out-of-order processor and memory when
evaluating trade-offs in the memory hierarchy to be sure that an
improvement that helps the average memory latency actually helps
program performance.
My point of including this quote is that in practice AMAT is used for getting an approximate comparison between various different option. And as a result there are always simplifying assumptions used. But generally the memory accesses for instructions and data are added together to get a total number of accesses when calculating AMAT, rather than being calculated separately.

The way I see it, since the L1 Instruction Cache and the L1 Data Cache are accessed in parallel, you should compute AMAT for Instructions and AMAT for data, and then take the largest value as the final AMAT.
In your example since the Data Miss Rate is higher than Instruction Miss Rate you can consider that during the time the CPU waits for data, it solves all the misses on the instruction cache.
If the measure unit is cycles you do the same as if it were nanoseconds. If you know the frequency of your processor, you can convert back the AMAT in nanoseconds.

Related

CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

It may seem a weird question..
Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7).
There are two objects A, B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64.
1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not evicted yet, CPU fetches B in a very short time. Everybody is happy.
2) If N >> 64 (i.e. much larger than 64), when A is fetched by CPU, B is not read into the cache line along with A. So we say "CPU doesn't like chase pointers around", and it is one of the reason to avoid heap allocated node-based data structure, like std::list.
My question is, if N > 64 but is still small, say N = 70, in other words, A and B do not fit in one cache line but are not too far away apart, when A is loaded by CPU, does fetching B takes the same amount of clock cycles as it would take when N is much larger than 64?
Rephrase - when A is loaded, let t represent the time elapse of fetching B, is t(N=70) much smaller than, or almost equal to, t(N=9999999)?
I ask this question because I suspect t(N=70) is much smaller than t(N=9999999), since CPU cache is hierarchical.
It is even better if there is a quantitative research.
There are at least three factors which can make a fetch of B after A misses faster. First, a processor may speculatively fetch the next block (independent of any stride-based prefetch engine, which would depend on two misses being encountered near each other in time and location in order to determine the stride; unit stride prefetching does not need to determine the stride value [it is one] and can be started after the first miss). Since such prefetching consumes memory bandwidth and on-chip storage, it will typically have a throttling mechanism (which can be as simple as having a modest sized prefetch buffer and only doing highly speculative prefetching when the memory interface is sufficiently idle).
Second, because DRAM is organized into rows and changing rows (within a single bank) adds latency, if B is in the same DRAM row as A, the access to B may avoid the latency of a row precharge (to close the previously open row) and activate (to open the new row). (This can also improve memory bandwidth utilization.)
Third, if B is in the same address translation page as A, a TLB may be avoided. (In many designs hierarchical page table walks are also faster in nearby regions because paging structures can be cached. E.g., in x86-64, if B is in the same 2MiB region as A, a TLB miss may only have to perform one memory access because the page directory may still be cached; furthermore, if the translation for B is in the same 64-byte cache line as the translation for A and the TLB miss for A was somewhat recent, the cache line may still be present.)
In some cases one can also exploit stride-base prefetch engines by arranging objects that are likely to miss together in a fixed, ordered stride. This would seem to be a rather difficult and limited context optimization.
One obvious way that stride can increase latency is by introducing conflict misses. Most caches use simple modulo a power of two indexing with limited associativity, so power of two strides (or other mappings to the same cache set) can place a disproportionate amount of data in a limited number of sets. Once the associativity is exceeded, conflict misses will occur. (Skewed associativity and non-power-of-two modulo indexing have been proposed to reduce this issue, but these techniques have not been broadly adopted.)
(By the way, the reason pointer chasing is particularly slow is not just low spatial locality but that the access to B cannot be started until after the access to A has completed because there is a data dependency, i.e., the latency of fetching B cannot be overlapped with the latency of fetching A.)
If B is at a lower address than A, it won't be in the same cache line even if they're adjacent. So your N < 64 case is misnamed: it's really the "same cache line" case.
Since you mention Intel i7: Sandybridge-family has a "spatial" prefetcher in L2, which (if there aren't a lot of outstanding misses already) prefetches the other cache line in a pair to complete a naturally-aligned 128B pair of lines.
From Intel's optimization manual, in section 2.3 SANDY BRIDGE:
2.3.5.4 Data Prefetching
... Some prefetchers fetch into L1.
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with
the pair line that completes it to a 128-byte aligned chunk.
... several other prefetchers try to prefetch into L2
IDK how soon it does this; if it doesn't issue the request until the first cache line arrives, it won't help much for a pointer-chasing case. A dependent load can execute only a couple cycles after the cache line arrives in L1D, if it's really just pointer-chasing without a bunch of computation latency. But if it issues the prefetch soon after the first miss (which contains the address for the 2nd load), the 2nd load could find its data already in L1D cache, having arrived a cycle or two after the first demand-load.
Anyway, this makes 128B boundaries relevant for prefetching in Intel CPUs.
See Paul's excellent answer for other factors.

Interpreting Intel VTune's Memory Bound Metric

I see the following when I run Intel VTune on my workload:
Memory Bound 50.8%
I read the Intel doc, which says (Intel doc):
Memory Bound measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. This accounts mainly for incomplete in-flight memory demand loads that coincide with execution starvation in addition to less common cases where stores could imply back-pressure on the pipeline.
Does that mean that roughly half of the instructions in my app are stalled waiting for memory, or is it more subtle than that?
The pipeline slots concept used by VTune is explain e.g. here: https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win.
In short pipeline slot represents the hardware resources needed to process one uOp. So for 4-wide CPUs (most Intel processors) we can execute 4 Ops each cycle and the total number of slots will be measured as 4 * CPU_CLK_UNHALTED.THREAD by VTune.
The Memory Bound metric is built on CYCLE_ACTIVITY.STALLS_MEM_ANY event which gives you directly stalls due to memory. Taking into account out-of-order. Basically only if CPU is stalled and at the same time it has in-flight loads the counter is incremented. If there are loads in-flight but CPU is kept busy it is not accounted as memory stall.
So Memory Bound metric provides quite accurate estimation on how much the workload is bound by memory performance issues. The value of 50% means that half of the time was wasted waiting for data from memory.
A slot is an execution port of the pipeline. In general in the VTune documentation, a stall could either mean "not retired" or "not dispatched for execution". In this case, it refers to the number of cycles in which zero uops were dispatched.
According to the VTune include configuration files, Memory Bound is calculated as follows:
Memory_Bound = Memory_Bound_Fraction * BackendBound
Memory_Bound_Fraction is basically the fraction of slots mentioned in the documentation. However, according to the top-down method discussed in the optimization manual, the memory bound metric is relative to the backend bound metric. So this is why it is multiplied by BackendBound.
I'll focus on the first term of the formula, Memory_Bound_Fraction. The formula for the second term, BackendBound, is actually complicated.
Memory_Bound_Fraction is calculated as follows:
Memory_Bound_Fraction = (CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB) * NUM_OF_PORTS / Backend_Bound_Cycles * NUM_OF_PORTS
NUM_OF_PORTS is the number of execution ports of the microarchitecture of the target CPU. This can be simplified to:
Memory_Bound_Fraction = CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB / Backend_Bound_Cycles
CYCLE_ACTIVITY.STALLS_MEM_ANY and RESOURCE_STALLS.SB are performance events. Backend_Bound_Cycles is calculated as follows:
Backend_Bound_Cycles = CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - Few_Uops_Executed_Threshold - Frontend_RS_Empty_Cycles + RESOURCE_STALLS.SB
Few_Uops_Executed_Threshold is either UOPS_EXECUTED.CYCLES_GE_2_UOP_EXEC or UOPS_EXECUTED.CYCLES_GE_3_UOP_EXEC depending on some other metric. Frontend_RS_Empty_Cycles is either RS_EVENTS.EMPTY_CYCLES or zero depending on some metric.
I realize this answer still needs a lot of additional explanation and BackendBound needs to be expanded. But this early edit makes the answer accurate.

How can I force an L2 cache miss?

I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
Can anyone show me an example of C program which forces "N" numbers of L2 cache misses?
You can generally force cache misses at some cache level by randomly accessing a working set larger than that cache level1.
You would expect the probability of any given load to be a miss to be something like: p(hit) = min(100, C / W), and p(miss) = 1 - p(hit) where p(hit) and p(miss) are the probabilities of a hit and miss, C is the relevant cache size, and W is the working set size. So for a miss rate of 50%, use a working set of twice the cache size.
A quick look at the formula above shows that p(miss) will never be 100%, since C/W only goes to 0 as W goes to infinity (and you probably can't afford an infinite amount of RAM). So your options are:
Getting "close enough" by using a very large working set (e.g., 4 GB gives you a 99%+ miss chance for a 256 KB), and pretending you have a miss rate of 100%.
Applying the formula to determine the actual expected number of misses. E.g., if you are using a working size of 2560 KB against an L2 cache of 256 KB, you have a miss rate of 90%. So if you want to examine the effect of 1,000 misses, you should make 1000 / 0.9 = ~1111 memory access to get about 1,000 misses.
Use any approximate approach but then actually count the number of misses you incur using the performance counter units on your CPU. For example, on Linux you could use PAPI or on Linux and Windows you could use Intel's PCM (if you are using Intel hardware).
Use an "almost random" approach to force the number of misses you want. The formula above is valid for random accesses, but if you choose you access pattern so that it is random with the caveat that it doesn't repeat "recent" accesses, you can get a 100% miss ratio. Here "recent" means accesses to cache lines that are likely to still be in the cache. Calculating what that means exactly is tricky, and depends in detail on the associativity and replacement algorithm of the cache, but if you don't repeat any access that has occurred in the last cache_size * 10 accesses, you should be pretty safe.
As for the C code, you should at least show us what you've tried. A basic outline is to create a vector of bytes or ints or whatever with the required size, then to randomly access that vector. If you make each access dependent on the previous access (e.g., use the integer read to calculate the index of the next read) you will also get a rough measurement of the latency of that level of cache. If the accesses are independent, you'll probably have several outstanding misses to the cache at once, and get more misses per unit time. Which one you are interested in depend on what you are studying.
For an open source project that does this kind of memory testing across different stride and working set sizes, take a look at TinyMemBench.
1 This gets a bit trickier for levels of caches that are shared among cores (usually L3 for recent Intel chips, for example) - but it should work well if your machine is pretty quiet while testing.

understanding CPI and cache access

These are previous homework problems, but I am using them as exam review. I am changing numbers around from what is actually in the problem. I just want to make sure I have a grasp on the concepts. I already have the answers, just need clarification that I understand them. This is not homework but review work.
Anyway, this focuses on aspects of CPI
The fist problem:
An application running on a 1GHz processor has 30% load-store instructions, 30% arithmetic, and 40% branch instructions. The individual CPIs are 3 for load-store, 4 for arithmetic, 5 for branch instructions. Determine the overall CPI of this program on the given processor.
My answer: The overall CPI is the sum of the sub-CPIs, multiplied by the percentages in which they occur i.e. 3*0.3 + 4*0.3 + 5*0.4 = 0.9 + 1.2 + 2 = 4.1
Now, the processor is enhanced to run at 1.6GHz. The CPIs of the branch instructions remain the same but load-store and arithmetic instruction CPIs both increase to 6 cycles. A new compiler is in use which eliminates 30% of branch instructions and 10% of load-stores. Determine the new overall CPI and the factor by which the application will be faster or slower.
My answer: Once again, the new CPI is just the sum of its parts. However, the parts have changed and this must be accounted for. Branch instructions will drop by 30% (0.4*0.7=0.28) and load-stores will drop by 10% (0.3*0.9=0.27); arithmetic instructions will now account for the rest of the instructions (1-0.28-0.27=0.45), or 45%. These will be multiplied by the new sub-CPIs to get: 6*0.45+6*0.27+5*0.28=5.72.
Now, the processor enhancement is 60% faster, and the CPI is greater by (5.72-4.1)/4.1 = 39.5%. Thus, the application will run roughly 0.6*0.395 = 23.7% faster.
Now, the second problem:
A new processor with a load/store architecture has an ideal CPI of 1.25. Typical applications on this processor are a mix of 50% arithmetic and logic, 25% conditional branching and 25% load/store. Memory is accessed via a separate data and instruction cache, with a 5% instruction cache miss rate and 10% data miss rate. The penalty of any cache miss is 100 cycles and hits don't produce any penalties.
What is the effective CPI?
My answer: The effective CPI is the ideal CPI, plus the stalled cycles per instruction due to cache access. The ideal CPI is, as given, 1.25. The stalled cycles per instruction is (0.1*100*0.25) + (0.05*100*1) = 7.5. 0.1*100*0.25 is the data miss rate multiplied by the stalled cycle penalty which is also multiplied by the load/store percentage (which is where the data accesses take place); 0.05*100*1 is the instruction miss rate, which is the instruction cache miss rate times the stalled cycle penalty, instruction access take place in 100% of the program, so this is multiplied by 1. Following from this, the effective CPI is 1.25 + 7.5 = 8.75.
What is the misses per 1000 instruction for typical applications and what is the average memory access time (in clock cycles) for typical applications?
My answers: The misses per 1000 instructions is equal to the stalled cycles per instruction due to cache access (as given above: 7.5), divided by 1000, which equals 7.5/1000 = 0.0075
When discussing the average memory access time (AMAT), we first must talk about the total number of accesses here, which is the percentage of data accesses (25%) plus the percentage of instruction accesses (100%), or 125%=1.25. The data accesses are .25/1.25 and the instruction accesses are 1/1.25.
The AMAT equals the percentage of data accesses (.25/1.25) multiplied by the sum of the hit time (1) and the data miss rate multiplied by the miss penalty (0.1*100), or (.25/1.25)(1+0.1*100) and this is added to the percentage of instruction accesses (1/1.25) multiplied by the sum of the hit time (1) and the instruction miss rate multiplied by the miss penalty (0.05*100), or (1/1.25)(1+0.05*100). Put together, the AMAT is (.25/1.25)(1+0.1*100)+(1/1.25)(1+0.05*100)=7.
Once again, sorry for the wall of text. If I am wrong, please try to help me understand how I am wrong. I tried to show all my work to make it as easy as possible to understand. Thanks in advance.
There's an error in the lat part of your question. When they ask:
What is the misses per 1000 instruction for typical applications and what is the average memory
access time (in clock cycles) for typical applications?
what's needed here is the number of misses you will get for every 1000 instructions, which in this case would be 1000*1*0.05 for instruction cache misses and 1000*0.25*0.1 for data cache misses. This equals 75 misses per 1000 instructions.
To calculate the AMAT, you use the formula AMAT = hit time + (miss rate*miss penalty)
In this case, your miss rate is 75/1000 and your miss penalty is 100 cycles. The hit time is given as 1.25 cycles (your ideal CPI!).
Hope this helps and all the best for your exam!

How to avoid TLB miss (and high Global Memory Replay Overhead) in CUDA GPUs?

The title might be more specific than my actual problem is, although I believe answering this question would solve a more general problem, which is: how to decrease the effect of high latency (~700 cycle) that comes from random (but coalesced) global memory access in GPUs.
In general if one accesses the global memory with coalesced load (eg. I read 128 consecutive bytes), but with very large distance (256KB-64MB) between coalesced accesses, one gets a high TLB (Translation Lookaside Buffer) miss rate. This high TLB miss rate is due to the limited number (~512) and size (~4KB) of the memory pages used in the TLB lookup table.
I suppose the high TLB miss rate because of the fact that virtual memory is used by NVIDIA, the fact that I get high (98%) Global Memory Replay Overhead and low throughput (45GB/s, with a K20c) in the profiler and the fact that partition camping is not an issue since Fermi.
Is it possible to avoid high TLB miss rate somehow? Would 3D texture cache help if I'm accessing a (X x Y x Z) cube coalesced along X dimension and with a X*Y "stride" along the Z dimension?
Any comment on this topic is appreciated.
Constraints: 1) global data can not be reordered/transposed; 2) kernel is communication bound.
You can only avoid TLB misses by changing your memory access pattern. A different layout of your data in memory can help with this.
A 3D texture will not improve your situation, as it trades improved spatial locality in two additional dimensions against reduced spatial locality in the third dimension. Thus you would unnecessarily read data of neighbors along the Y axis.
What you can do however is mitigate the impact of the resulting latency on throughput. In order to hide t = 700 cycles of latency at a global memory bandwidth of b = 250GB/s, you need to have memory transactions for b / t = 175 KB of data in flight at any time (or 12.5 KB for each of the 14 SMX). With a fully loaded memory interface and a high ratio of TLB misses, you will however find that latency gets closer to 2000 cycles, requiring roughly 32 KB of transactions in flight per sm.
As each word of a memory read transaction in flight requires one register where the value will be stored once it arrives, hiding memory latency has to be balances against register pressure. Keeping 32 KB of data in flight requires 8192 registers, or 12.5% of the total registers available on an SMX.
(Note that for above rough estimates I have neglected the difference between KiB and KB).

Resources