Related
It may seem a weird question..
Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7).
There are two objects A, B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64.
1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not evicted yet, CPU fetches B in a very short time. Everybody is happy.
2) If N >> 64 (i.e. much larger than 64), when A is fetched by CPU, B is not read into the cache line along with A. So we say "CPU doesn't like chase pointers around", and it is one of the reason to avoid heap allocated node-based data structure, like std::list.
My question is, if N > 64 but is still small, say N = 70, in other words, A and B do not fit in one cache line but are not too far away apart, when A is loaded by CPU, does fetching B takes the same amount of clock cycles as it would take when N is much larger than 64?
Rephrase - when A is loaded, let t represent the time elapse of fetching B, is t(N=70) much smaller than, or almost equal to, t(N=9999999)?
I ask this question because I suspect t(N=70) is much smaller than t(N=9999999), since CPU cache is hierarchical.
It is even better if there is a quantitative research.
There are at least three factors which can make a fetch of B after A misses faster. First, a processor may speculatively fetch the next block (independent of any stride-based prefetch engine, which would depend on two misses being encountered near each other in time and location in order to determine the stride; unit stride prefetching does not need to determine the stride value [it is one] and can be started after the first miss). Since such prefetching consumes memory bandwidth and on-chip storage, it will typically have a throttling mechanism (which can be as simple as having a modest sized prefetch buffer and only doing highly speculative prefetching when the memory interface is sufficiently idle).
Second, because DRAM is organized into rows and changing rows (within a single bank) adds latency, if B is in the same DRAM row as A, the access to B may avoid the latency of a row precharge (to close the previously open row) and activate (to open the new row). (This can also improve memory bandwidth utilization.)
Third, if B is in the same address translation page as A, a TLB may be avoided. (In many designs hierarchical page table walks are also faster in nearby regions because paging structures can be cached. E.g., in x86-64, if B is in the same 2MiB region as A, a TLB miss may only have to perform one memory access because the page directory may still be cached; furthermore, if the translation for B is in the same 64-byte cache line as the translation for A and the TLB miss for A was somewhat recent, the cache line may still be present.)
In some cases one can also exploit stride-base prefetch engines by arranging objects that are likely to miss together in a fixed, ordered stride. This would seem to be a rather difficult and limited context optimization.
One obvious way that stride can increase latency is by introducing conflict misses. Most caches use simple modulo a power of two indexing with limited associativity, so power of two strides (or other mappings to the same cache set) can place a disproportionate amount of data in a limited number of sets. Once the associativity is exceeded, conflict misses will occur. (Skewed associativity and non-power-of-two modulo indexing have been proposed to reduce this issue, but these techniques have not been broadly adopted.)
(By the way, the reason pointer chasing is particularly slow is not just low spatial locality but that the access to B cannot be started until after the access to A has completed because there is a data dependency, i.e., the latency of fetching B cannot be overlapped with the latency of fetching A.)
If B is at a lower address than A, it won't be in the same cache line even if they're adjacent. So your N < 64 case is misnamed: it's really the "same cache line" case.
Since you mention Intel i7: Sandybridge-family has a "spatial" prefetcher in L2, which (if there aren't a lot of outstanding misses already) prefetches the other cache line in a pair to complete a naturally-aligned 128B pair of lines.
From Intel's optimization manual, in section 2.3 SANDY BRIDGE:
2.3.5.4 Data Prefetching
... Some prefetchers fetch into L1.
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with
the pair line that completes it to a 128-byte aligned chunk.
... several other prefetchers try to prefetch into L2
IDK how soon it does this; if it doesn't issue the request until the first cache line arrives, it won't help much for a pointer-chasing case. A dependent load can execute only a couple cycles after the cache line arrives in L1D, if it's really just pointer-chasing without a bunch of computation latency. But if it issues the prefetch soon after the first miss (which contains the address for the 2nd load), the 2nd load could find its data already in L1D cache, having arrived a cycle or two after the first demand-load.
Anyway, this makes 128B boundaries relevant for prefetching in Intel CPUs.
See Paul's excellent answer for other factors.
I am generating a synthetic C benchmark aimed at causing a large number of instruction fetch misses via the following Python script:
#!/usr/bin/env python
import tempfile
import random
import sys
if __name__ == '__main__':
functions = list()
for i in range(10000):
func_name = "f_{}".format(next(tempfile._get_candidate_names()))
sys.stdout.write("void {}() {{\n".format(func_name))
sys.stdout.write(" double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n")
sys.stdout.write(" res = pi*r*r*h;\n")
sys.stdout.write(" res = res/(e*e);\n")
sys.stdout.write("}\n")
functions.append(func_name)
sys.stdout.write("int main() {\n")
sys.stdout.write("unsigned int i;\n")
sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n")
for i in range(10000):
r = random.randint(0, len(functions)-1)
sys.stdout.write("{}();\n".format(functions[r]))
sys.stdout.write("}\n")
sys.stdout.write("}\n")
What the code does is simply generating a C program that consists of a lot of randomly named dummy functions that are in turn called in random order in main(). I am compiling the resulting code with gcc 4.8.5 under CentOS 7 with -O0. The code is running on a dual socket machine fitted with 2x Intel Xeon E5-2630v3 (Haswell architecture).
What I am interested in is understanding instruction-related counters reported by perf when profiling the binary compiled from the C code (not the Python script, that is only used to automatically generate the code). In particular, I am observing the following counters with perf stat:
instructions
L1-icache-load-misses (instruction fetches that miss L1, aka r0280 on Haswell)
r2424, L2_RQSTS.CODE_RD_MISS (instruction fetches that miss L2)
rf824, L2_RQSTS.ALL_PF (all L2 hardware prefetcher requests, both code and data)
I first profiled the code with all hardware prefetchers disabled in the BIOS, i.e.
MLC Streamer Disabled
MLC Spatial Prefetcher Disabled
DCU Data Prefetcher Disabled
DCU Instruction Prefetcher Disabled
and the results are the following (process is pinned to first core of second CPU and corresponding NUMA domain, but I guess this doesn't make much difference):
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code
Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':
25,108,610,204 instructions
2,613,075,664 L1-icache-load-misses
5,065,167,059 r2424
17 rf824
33.696954142 seconds time elapsed
Considering the figures above, I cannot explain such a high number of instruction fetch misses in L2. I have disabled all prefetchers, and L2_RQSTS.ALL_PF confirms so. But why do I see twice as much the number of instruction fetch misses in L2 than in L1i? In my (simple) mental processor model, if an instruction is looked up in L2, it must have necessarily been looked up in L1i before. Clearly I am wrong, what am I missing?
I then tried to run the same code with all the hardware prefetchers enabled, i.e.
MLC Streamer Enabled
MLC Spatial Prefetcher Enabled
DCU Data Prefetcher Enabled
DCU Instruction Prefetcher Enabled
and the results are the following:
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code
Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':
25,109,877,626 instructions
2,599,883,072 L1-icache-load-misses
5,054,883,231 r2424
908,494 rf824
Now, L2_RQSTS.ALL_PF seems to indicate that something more is happening and although I expected the prefetcher to be a bit more aggressive, I imagine that the instruction prefetcher is severely put to the test due to the jump-intensive type of workload and data prefetcher has not much to do with this kind of workload. But again, L2_RQSTS.CODE_RD_MISS is still too high with the prefetchers enabled.
So, to sum up, my question is:
With hardware prefetchers disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses. Even with hardware prefetchers enabled, I still cannot explain it. What is the reason behind such a high count of L2_RQSTS.CODE_RD_MISS compared to L1-icache-load-misses?
The instruction prefetcher can generate requests are that don't count as accesses to the L1I cache, but are counted as code fetch requests at higher-numbered memory levels, such as the L2. This is generally true on all Intel microarchitectures with an instruction prefetcher. L2_RQSTS.CODE_RD_MISS counts both demand and prefetch requests from the L1I. Demand requests are generated by a multiplexing unit in the IFU that chooses a target fetch linear address from among the different units in the pipeline that may change the flow, such as the branch prediction units. Prefetch requests are generated by the L1I instruction prefetcher on an L1I miss if possible.
In general, the number of prefetch fetch requests is nearly proportional to the number of L1I misses. For instruction fetches from memory regions of cacheable memory types, the following formula holds:
ICACHE.MISSES <= L2_RQSTS.CODE_RD_MISS + L2_RQSTS.CODE_RD_HIT
I'm not sure whether this formula also holds for uncacheable fetch requests. I didn't test it in that condition. I know these requests are counted as ICACHE.MISSES, but not sure about the other events.
In your case, most instruction fetches will miss in the L1I and L2. You have 10,000 functions each nearly fully spans 2 64-btye cache lines (here is a version with only two functions), so the code size is much larger than the 256 KiB L2 available on Haswell. The functions are being called in a non-sequential and upredictable order, so the L1I and L2 prefetchers won't significantly help. The only noteworthy exception are returns, all of which will be predicted correctly using the RSB mechanism.
Each of the 10,000 functions are being called 100,000 times in a loop. Most fetch requests are for lines occupied by these functions. The total number of useful instruction fetch requests is about 2 lines per function * 10,000 function * 100,000 iterations = 2,000,000,000 lines, most of which will miss in the L1I and L2 (but probably hit in the L3 after the first cold iteration). Several millions of other requests will be for lines occupied by the loop body. Your measurements show that there are about 30% more instruction fetches that miss in the L1I. This is because of branch mispredictions, which cause fetch requests for incorrect lines that may not be even be in the L1I and/or L2. Each L1I miss may trigger a prefetch, so it's normal for L2 instruction fetches to be within two times the number of L1I misses. This is consistent with your numbers.
In my two-function version, I'm counting 24 instructions per invoked function, so I expect the total number of retired instructions to be approximately 24 billion, but you got 25 billion. Either I don't know how to count, or you have 25 instructions per function for some reason.
I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
Can anyone show me an example of C program which forces "N" numbers of L2 cache misses?
You can generally force cache misses at some cache level by randomly accessing a working set larger than that cache level1.
You would expect the probability of any given load to be a miss to be something like: p(hit) = min(100, C / W), and p(miss) = 1 - p(hit) where p(hit) and p(miss) are the probabilities of a hit and miss, C is the relevant cache size, and W is the working set size. So for a miss rate of 50%, use a working set of twice the cache size.
A quick look at the formula above shows that p(miss) will never be 100%, since C/W only goes to 0 as W goes to infinity (and you probably can't afford an infinite amount of RAM). So your options are:
Getting "close enough" by using a very large working set (e.g., 4 GB gives you a 99%+ miss chance for a 256 KB), and pretending you have a miss rate of 100%.
Applying the formula to determine the actual expected number of misses. E.g., if you are using a working size of 2560 KB against an L2 cache of 256 KB, you have a miss rate of 90%. So if you want to examine the effect of 1,000 misses, you should make 1000 / 0.9 = ~1111 memory access to get about 1,000 misses.
Use any approximate approach but then actually count the number of misses you incur using the performance counter units on your CPU. For example, on Linux you could use PAPI or on Linux and Windows you could use Intel's PCM (if you are using Intel hardware).
Use an "almost random" approach to force the number of misses you want. The formula above is valid for random accesses, but if you choose you access pattern so that it is random with the caveat that it doesn't repeat "recent" accesses, you can get a 100% miss ratio. Here "recent" means accesses to cache lines that are likely to still be in the cache. Calculating what that means exactly is tricky, and depends in detail on the associativity and replacement algorithm of the cache, but if you don't repeat any access that has occurred in the last cache_size * 10 accesses, you should be pretty safe.
As for the C code, you should at least show us what you've tried. A basic outline is to create a vector of bytes or ints or whatever with the required size, then to randomly access that vector. If you make each access dependent on the previous access (e.g., use the integer read to calculate the index of the next read) you will also get a rough measurement of the latency of that level of cache. If the accesses are independent, you'll probably have several outstanding misses to the cache at once, and get more misses per unit time. Which one you are interested in depend on what you are studying.
For an open source project that does this kind of memory testing across different stride and working set sizes, take a look at TinyMemBench.
1 This gets a bit trickier for levels of caches that are shared among cores (usually L3 for recent Intel chips, for example) - but it should work well if your machine is pretty quiet while testing.
If I have this class:
class MyClass{
short a;
short b;
short c;
};
and I have this code performing calculations on the above:
std::vector<MyClass> vec;
//
for(auto x : vec){
sum = vec.a * (3 + vec.b) / vec.c;
}
I understand the CPU only loads the very data it needs from the L1 cache, but when the L1 cache retrieves data from the L2 cache it loads a whole "cache line" (which could include a few bytes of data it doesn't need).
How much data does the L2 cache load from the L3 cache, and the L3 cache load from main memory? Is it defined in terms of pages and if so, how would this answer differ according to different L2/L3 cache sizes?
L2 and L3 caches also have cache lines that are smaller than a virtual memory system page. The size of L2 and L3 cache lines is greater than or equal to the L1 cache line size, not uncommonly being twice that of the L1 cache line size.
For recent x86 processors, all caches use the same 64-byte cache line size. (Early Pentium 4 processors had 64-byte L1 cache lines and 128-byte L2 cache lines.)
IBM's POWER7 uses 128-byte cache blocks in L1, L2, and L3. (However, POWER4 used 128-byte blocks in L1 and L2, but sectored 512-byte blocks in the off-chip L3. Sectored blocks provide a valid bit for subblocks. For L2 and L3 caches, sectoring allows a single coherence size to be used throughout the system.)
Using a larger cache line size in last level cache reduces tag overhead and facilitates long burst accesses between the processor and main memory (longer bursts can provide more bandwidth and facilitate more extensive error correction and DRAM chip redundancy), while allowing other levels of cache and cache coherence to use smaller chunks which reduces bandwidth use and capacity waste. (Large last level cache blocks also provide a prefetching effect whose cache polluting issues are less severe because of the relatively high capacity of last level caches. However, hardware prefetching can accomplish the same effect with less waste of cache capacity.) With a smaller cache (e.g., typical L1 cache), evictions happen more frequently so the time span in which spatial locality can be exploited is smaller (i.e., it is more likely that only data in one smaller chunk will be used before the cache line is evicted). A larger cache line also reduces the number of blocks available, in some sense reducing the capacity of the cache; this capacity reduction is particularly problematic for a small cache.
It depends somewhat on the ISA and microarchitecture of your platform. Recent x86-64 based microarchitectures use 64 byte lines in all levels of the cache hierarchy.
Typically signed shorts will require two bytes each meaning that MyClass will need 6 bytes in addition the class overhead. If your C++ implementation stores the vector<> contiguously like an array you should get about 10 MyClass objects per 64-byte lines. Provided the vector<> is the right length, you won't load much garbage.
It's wise to note that since you're accessing the elements in a very predictable pattern the hardware prefetcher should kick in and fetch a reasonable amount of data it expects to use in the future. This could potentially bring more than you need into various levels of the cache hierarchy. It will vary from chip to chip.
I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?
How many cycles does an L1 cache hit take? How does it compare to register access?
Here's a great article on the subject:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1
To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)
PS:
The specifics will vary, but this link has some good ballpark figures:
Approximate cost to access various caches and main memory?
Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns
PPS:
These figures represent much older, slower CPUs, but the ratios basically hold:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2
Level Access Time Typical Size Technology Managed By
----- ----------- ------------ --------- -----------
Registers 1-3 ns ?1 KB Custom CMOS Compiler
Level 1 Cache (on-chip) 2-8 ns 8 KB-128 KB SRAM Hardware
Level 2 Cache (off-chip) 5-12 ns 0.5 MB - 8 MB SRAM Hardware
Main Memory 10-60 ns 64 MB - 1 GB DRAM Operating System
Hard Disk 3M - 10M ns 20 - 100 GB Magnetic Operating System/User
Throughput and latency are different things. You can't just add up cycle costs. For throughput, see Load/stores per cycle for recent CPU architecture generations - 2 loads per clock throughput for most modern microarchitectures. And see How can cache be that fast? for microarchitectural details of load/store execution units, including showing load / store buffers which limit how much memory-level parallelism they can track. The rest of this answer will focus only on latency, which is relevant for workloads that involve pointer-chasing (like linked lists and trees), and how much latency out-of-order exec needs to hide. (L3 Cache misses are usually too long to fully hide.)
Single-cycle cache latency used to be a thing on simple in-order pipelines at lower clock speeds (so each cycle was more nanoseconds), especially with simpler caches (smaller, not as associative, and with a smaller TLB for caches that weren't purely virtually addressed.) e.g. the classic 5-stage RISC pipeline like MIPS I assumes 1 cycle for memory access on a cache hit, with address calculation in EX and memory access in a single MEM pipeline stage, before WB.
Modern high-performance CPUs divide the pipeline up into more stages, allowing each cycle to be shorter. This lets simple instructions like add / or / and run really fast, still 1 cycle latency but at high clock speed.
For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.
Intel Haswell's L1 load-use latency is 4 cycles for pointer-chasing, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or for a linked list that hits in cache, easy to microbench with a closed loop). See also Is there a penalty when base+offset is in a different page than the base? That 4-cycle latency special case only applies if the pointer comes directly from another load, otherwise it's 5 cycles.
Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.
Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, reading from the store buffer for store data that hasn't yet committed to L1d cache).
As harold commented, register access is 0 cycles. So, for example:
inc eax has 1 cycle latency (just the ALU operation)
add dword [mem], 1 has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.)
http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).
Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).
L1 Data cache = 32 KB, 64 B/line, 8-WAY.
L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.
L2 cache = 256 KB, 64 B/line, 8-WAY
L3 cache = 8 MB, 64 B/line
L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])
L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).
L2 Cache Latency = 12 cycles
L3 Cache Latency = 36 cycles
RAM Latency = 36 cycles + 57 ns
The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.
#paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.
If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.
EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\
Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.
As result:
1) I think you can consider L1 cache as an unmanaged processor registers pool.
2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.