Memory access time in deep cache hierarchy

Memory access time in deep cache hierarchy - caching

I am trying to solve an exercise question from Computer architecture textbook. The book includes the equation for calculating memory access time (MAT) for up to L2 cache (eq. below), however the exercise has upto L4 cache and off chip memory access for which I don't understand how to use the equation to calculate Avg MAT.
So, Average memory access time = Hit time_L1 + Miss rate_L1 x (Hit time_L2 + Miss rate_L2xMiss penalty_L2)
In exercise question it mentioned a cache hierarchy as -- >[32 KB L1; 128 KB L2; 2 MB L3; 8 MB L4; off-chip memory] for which the memory access time needs to be calculated.
Given, cache/latency/Miss per thousand instructions values: 32 KB/1/100, 128 KB/2/80, 512 KB/4/50, 2 MB/8/40, 8 MB/16/10. And off chip memory access requires 200 cycles on average. Also, 1000 instructions of a program, an average of 20 memory accesses may exhibit low enough locality and can't be serviced bty 2MB cache which has 20 Miss per thousand instruction.
Could anyone help me to solve the problem?

Related

Can all of L2/L3 cache be used by data? If so, why does the Graviton 3 bandwidth plot drop off after half the L2/L3 size, but only gradually?

Consider Graviton3, for example. It's a 64-core CPU with per-core caches 64KiB L1d and 1MiB L2. And a shared L3 of 64MiB across all cores. The RAM bandwidth per socket is 307GB/s (source).
In this plot (source),
we see that all-cores bandwidth drops off to roughly half, when the data exceeds 4MB. This makes sense: 64x 64KiB = 4 MiB is the size of the L1 data cache.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there? The private L2 caches of 64 cores is a total of 64 MiB, same as the shared L3 size.

It looks from the plot like they may not have tested any sizes between 32M and 64M. Looks like a straight line between those points on all 3 CPUs.
Since 64M is the total size of both L2 and L3, I'd expect a test like this to have slowed most of the way down at 64M. As Brendan says, page tables and a bit of code will take space, competing with the actual intended test data. If the benchmark loop is tight, stack won't come into play, except for interrupt handling.
Once you're evicting anything from a working set slightly larger than cache, you often evict almost everything before getting back to it, depending on pseudo-LRU luck. I'd expect a test size or 48 or even 56 MiB to be a lot closer to the 32 MiB data point than the 64 MiB data point.

Can all of L2/L3 cache be used by data?
In theory, yes; but only if there's no "non-data" (code) in the cache, only if you count "all data" (and don't just count a process' data and ignore things like stack and page tables), and only if there isn't any aliasing problems.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there?
For a fully associative cache I'd expect a sudden drop off at/near 32 MiB. However, large caches are almost never fully associative as it costs way to much to find anything in the cache.
As associativity decreases the chance of conflicts increases. For example, for an 8-way associative 64 MiB cache the pathological case is that everything conflicts and you're only able to effectively use 8 MiB of it.
More specifically, for a 64 MiB cache (with unknown associativity), and an "assumed Linux" environment that lacks support for cache coloring, it's reasonable to expect a smooth drop off that ends at 64 MiB.

Just to be clear, on a running Graviton 3 in AWS, an lscpu gives me 32MiB for L3 and not 64 MiB.
Caches (sum of all):
L1d: 4 MiB (64 instances)
L1i: 4 MiB (64 instances)
L2: 64 MiB (64 instances)
L3: 32 MiB (1 instance)
The original question is assuming an L3 of 64 MiB across all cores.
Blockquote
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there? The private L2 caches of 64 cores is a total of 64 MiB, same as the shared L3 size.
Blockquote

What is reference when it says L1 Cache Reference or Main Memory Reference

So I am trying to learn performance metrics of various components of computer like L1 cache, L2 cache, main memory, ethernet, disk etc as below:
Latency Comparison Numbers
--------------------------
L1 cache **reference** 0.5 ns
Branch mispredict 5 ns
L2 cache **reference** 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory **reference** 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 10,000 ns 10 us
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 us
Read 4 KB randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from 1 Gbps 10,000,000 ns 10,000 us 10 ms 40x memory, 10X SSD
Read 1 MB sequentially from disk 30,000,000 ns 30,000 us 30 ms 120x memory, 30X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
I don't think the reference mentioned above is for how much data is read in bits or bytes. But is actually about maybe accessing one address in cache or memory.
Can someone please explain better what is this reference that's happening in 0.5 n/s ?

This table is listing typical numbers for some representative system,
as the actual values for a real example system would hardly be so "smooth numbers" but complicated sums over some non-even multiples of CPU and/or bus clock periods.
We could find such a table in a textbook for educative use.
This one apparently found its way into a general introduction into system designing1
from some conference presentations Google AI's lead person,
Jeff Dean
held in back in 20093,4.
The two presentation PDFs3,4
do not give an explicit definition what exactly was meant by "reference" in those tables. Instead, the tables are presented to point out that the ability for "back-of-the-envelope calculations" is crucial for successful system design.
The term "reference" likely means retrieving a piece of information from the corresponding level of memory if the requested value is maintained there, so that it doesn't have to be reloaded from a slower source:
L1 cache <- L2 cache <- Main memory (RAM) <- Disk (e.g., swap)
The upper-level sources (RAM, disk) can just be seen as a very rough sketch because here you will find lots of sub-levels and variants (type of mass device, internal cache on the disk's chipset, buses/bridges etc. etc.).
The present numbers appear to be a conclusion of experiences at Google's data center.
Therefore, let's assume they are based on some high-performance class hardware which was relevant in 2009 (or earlier).
Today (2020), the numbers should not be taken literally but to demonstrate the orders of magnitude in the context of the corresponding values for other levels of data transfer.
The label "branch mispredict" stands for all cases when a fetch operation from the next level is necessary, because a mispredicted branching decision is the most important reason for cases when such a fetch operation is critical w. r. t. latencies.
In other cases, branch prediction infrastructure is supposed to trigger data fetch operations in time so all latencies beyond the low "reference" value are hidden behind pipeline operations.
1
The URL you gave us in comment discussion
"Latency numbers every programmer should know" in: "The System Design Primer"
references the following sources:
2
Jeff Dean: "Latency Numbers Every Programmer Should Know", 31 May 2012.
"Originally by Peter Norvig ("Teach Yourself Programming in Ten Years") with some updates from Brendan", 1 Jun 2012.
3
Jeff Dean:
"Designs, Lessons and Advice from Building Large Distributed Systems",
13 Oct 2009,
page 24.
4
Jeff Dean:
"Software Engineering Advice from Building Large-Scale Distributed Systems",
17 Mar 2009,
page 13.

Going to the specific question about what is an L1 cache - it helps to understand multi-level caching -- https://en.wikipedia.org/wiki/CPU_cache#MULTILEVEL
While creating any cache there is a trade-off between hit-rate and latency. Larger caches generally have higher hit rate but also longer latency. To achieve the best of both worlds, many architectures implement 2 or more level of cache - an L1 which is small, super-fast backed by L2 which will be looked up in case of miss from L1, L2 being larger but also slower and so on. The metrics posted in your reference are a rough ballpark of an L1 hit, it would appear.

Calculating Effective CPI when using write-through/write-back architecture

So I'm trying to understand a homework problem given by an instructor and I'm honestly lost - I understand the concept of write-through/write-back, etc. but I can't figure out the actual calculations needed for the effective CPI, could anyone give me a hand? (The problem follows:
The following table provides the statistics of a cache for a
particular program. It is known that the base CPI (without cache
misses) is 1. It is also known that the memory bus bandwidth (the
bandwidth to transfer data between cache and memory) is 4 bytes per
cycle, and it takes one cycle to send the address before data
transfer. The memory spends 10 cycles to store data from bus or fetch
data to bus. The clock rate used by memory and the bus is a quarter of
the CPU clock rate.
Data reads per 1000 instructions: 100
Data writes per 1000 instructions: 150
Instruction cache miss rate: 0.4%
Data cache miss rate: 3%
Block size in bytes: 32

The effective CPI is the base CPU plus the CPI contribution from cache misses.
The cache miss CPI is the sum of the of instruction cache CPI and data cache CPI.
The cache miss cost is the cost of reading or writing to memory, so we will need that.
The cost in bus cycles is 1 (for the address) plus 10 (memory busy time) + 8 (32 byte blocks size divided by 4 bytes/cycle) = 19 cycles. Multiply this by 4 to get CPU cycles. Total is 76 CPU cycles.
So the cost for I cache misses is .004 * 76 = .304 cycles.
The cost for D caches misses is (.10 + .15) * .03 * 76 = .57 cycles
So the effective CPI is 1 + .304 + .57 = 1.874 cycles.

How to avoid TLB miss (and high Global Memory Replay Overhead) in CUDA GPUs?

The title might be more specific than my actual problem is, although I believe answering this question would solve a more general problem, which is: how to decrease the effect of high latency (~700 cycle) that comes from random (but coalesced) global memory access in GPUs.
In general if one accesses the global memory with coalesced load (eg. I read 128 consecutive bytes), but with very large distance (256KB-64MB) between coalesced accesses, one gets a high TLB (Translation Lookaside Buffer) miss rate. This high TLB miss rate is due to the limited number (~512) and size (~4KB) of the memory pages used in the TLB lookup table.
I suppose the high TLB miss rate because of the fact that virtual memory is used by NVIDIA, the fact that I get high (98%) Global Memory Replay Overhead and low throughput (45GB/s, with a K20c) in the profiler and the fact that partition camping is not an issue since Fermi.
Is it possible to avoid high TLB miss rate somehow? Would 3D texture cache help if I'm accessing a (X x Y x Z) cube coalesced along X dimension and with a X*Y "stride" along the Z dimension?
Any comment on this topic is appreciated.
Constraints: 1) global data can not be reordered/transposed; 2) kernel is communication bound.

You can only avoid TLB misses by changing your memory access pattern. A different layout of your data in memory can help with this.
A 3D texture will not improve your situation, as it trades improved spatial locality in two additional dimensions against reduced spatial locality in the third dimension. Thus you would unnecessarily read data of neighbors along the Y axis.
What you can do however is mitigate the impact of the resulting latency on throughput. In order to hide t = 700 cycles of latency at a global memory bandwidth of b = 250GB/s, you need to have memory transactions for b / t = 175 KB of data in flight at any time (or 12.5 KB for each of the 14 SMX). With a fully loaded memory interface and a high ratio of TLB misses, you will however find that latency gets closer to 2000 cycles, requiring roughly 32 KB of transactions in flight per sm.
As each word of a memory read transaction in flight requires one register where the value will be stored once it arrives, hiding memory latency has to be balances against register pressure. Keeping 32 KB of data in flight requires 8192 registers, or 12.5% of the total registers available on an SMX.
(Note that for above rough estimates I have neglected the difference between KiB and KB).

Cycles/cost for L1 Cache hit vs. Register on x86?

I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?
How many cycles does an L1 cache hit take? How does it compare to register access?

Here's a great article on the subject:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1
To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)
PS:
The specifics will vary, but this link has some good ballpark figures:
Approximate cost to access various caches and main memory?
Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns
PPS:
These figures represent much older, slower CPUs, but the ratios basically hold:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2
Level Access Time Typical Size Technology Managed By
----- ----------- ------------ --------- -----------
Registers 1-3 ns ?1 KB Custom CMOS Compiler
Level 1 Cache (on-chip) 2-8 ns 8 KB-128 KB SRAM Hardware
Level 2 Cache (off-chip) 5-12 ns 0.5 MB - 8 MB SRAM Hardware
Main Memory 10-60 ns 64 MB - 1 GB DRAM Operating System
Hard Disk 3M - 10M ns 20 - 100 GB Magnetic Operating System/User

Throughput and latency are different things. You can't just add up cycle costs. For throughput, see Load/stores per cycle for recent CPU architecture generations - 2 loads per clock throughput for most modern microarchitectures. And see How can cache be that fast? for microarchitectural details of load/store execution units, including showing load / store buffers which limit how much memory-level parallelism they can track. The rest of this answer will focus only on latency, which is relevant for workloads that involve pointer-chasing (like linked lists and trees), and how much latency out-of-order exec needs to hide. (L3 Cache misses are usually too long to fully hide.)
Single-cycle cache latency used to be a thing on simple in-order pipelines at lower clock speeds (so each cycle was more nanoseconds), especially with simpler caches (smaller, not as associative, and with a smaller TLB for caches that weren't purely virtually addressed.) e.g. the classic 5-stage RISC pipeline like MIPS I assumes 1 cycle for memory access on a cache hit, with address calculation in EX and memory access in a single MEM pipeline stage, before WB.
Modern high-performance CPUs divide the pipeline up into more stages, allowing each cycle to be shorter. This lets simple instructions like add / or / and run really fast, still 1 cycle latency but at high clock speed.
For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.
Intel Haswell's L1 load-use latency is 4 cycles for pointer-chasing, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or for a linked list that hits in cache, easy to microbench with a closed loop). See also Is there a penalty when base+offset is in a different page than the base? That 4-cycle latency special case only applies if the pointer comes directly from another load, otherwise it's 5 cycles.
Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.
Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, reading from the store buffer for store data that hasn't yet committed to L1d cache).
As harold commented, register access is 0 cycles. So, for example:
inc eax has 1 cycle latency (just the ALU operation)
add dword [mem], 1 has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.)
http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).
Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).
L1 Data cache = 32 KB, 64 B/line, 8-WAY.
L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.
L2 cache = 256 KB, 64 B/line, 8-WAY
L3 cache = 8 MB, 64 B/line
L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])
L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).
L2 Cache Latency = 12 cycles
L3 Cache Latency = 36 cycles
RAM Latency = 36 cycles + 57 ns
The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.
#paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.

If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.
EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\

Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.
As result:
1) I think you can consider L1 cache as an unmanaged processor registers pool.
2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio