cache interaction steps and read cycles - caching

I'm struggling to fully grasp how caches work.
Let's say I have a L1 cache and L2 cache.
The CPU (main memory) gives the L1 controller the memory address.
L1 cache controller determines cache set, requested cache tag, and block offset
L1 cache circuits check if the requested tag is in set
Can't find L1 cache tag match.
Does #2 happen here or after L1 sends L2 the memory address?
On read times if L1 takes x cycles, L2 takes y cycles, and main memory takes z cycles. Basically if the above steps happen and then L2 finds a cache tag match and sends it back to L1 who sends it to main, how many cycles does it take? When L1 returns it to the CPU does that count as a read cycle or not?
Thanks in advance for the help!

L1 might be in the processor but the process is still the same. The processor performs a read lets say, the address and read/control signals go out. From the address the L1 cache looks up the tag and determines hit/miss. If it is a hit it returns the information, if it misses then the L1 needs to go out on its address bus, adjust the address to align it to the cache line size and address alignment. The L2 does the same thing the L1 does at a high level, address turns into a tag turns into a hit/miss, if a miss then it puts the aligned/sized cache line fetch on its external address bus, this repeats until you hit something that answers (DRAM, peripheral, etc). When the L2 responds it sends the line back to L1, L1 per the rules of the design/settings save the line and then return to the processor the data/length of what it asked for. For that moment, depending design and settings the L1 and L2 contain the same data, ideally the L1 contains all the data in L2, L2 contains all the L1 data plus some. Granted non-cacheable requests should pass through so you may have a L2 hit that results in L1 not storing the data. Also based on the design a non-cacheable request may pass through to the other side of L1 and/or L2 in the original processor size/shape rather than being aligned and sized to a cache line.

Related

Cache misses on macOS

There are some questions about this topic, but none has a real answer. The question is: how can I measure L1, L2, L3 (if any) cache misses on macOS?
The problem is not that macOS does not provide, in theory, those values even without any external tool. In Instruments we could use the Counters and go to Recording Options... as in here:
However, there is no L1 cache miss or L2, but a huge list of possible items that could be selected:
So, when measuring L1 and L2 cache misses (or even L3 if there is any), how can I count them?
Which of the list is the "cache misses" I should pay attention to in order to retrieve that magic "cache miss" number?
On Ivy Bridge, Haswell, Broadwell, and Goldmont processors, you can use the following events to count the number of data cache lines that were needed by demand load requests from cacheable1 load instructions that missed the L1, L2, and L3: MEM_LOAD_UOPS_RETIRED.L1_MISS, MEM_LOAD_UOPS_RETIRED.L2_MISS, and MEM_LOAD_UOPS_RETIRED.L3_MISS, respectively. On Skylake and later, the corresponding events are called: MEM_LOAD_RETIRED.L1_MISS, MEM_LOAD_RETIRED.L2_MISS, and MEM_LOAD_RETIRED.L3_MISS. These events only count cache lines that were needed by load instructions that were retired.
On Nehalem and later, you can use the following events to count the number of cache lines that were needed by demand store requests from cacheable store instructions that missed the L1, L2, and L3: L2_RQSTS.ALL_RFO, L2_RQSTS.RFO_MISS, and OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37), respectively. These events count cache lines that were needed by stores instructions that were retired or flushed from the pipeline.
Counting only retired instructions can be more useful that counting accesses from all instructions depending on the scenario. Unfortunately, there are no store events that correspond to MEM_LOAD_UOPS_*. However, there are load events that count both retired and flushed loads. These include L2_RQSTS.ALL_DEMAND_DATA_RD for L1 load misses, L2_RQSTS.DEMAND_DATA_RD_MISS for L2 load misses, and OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37) for L3 load misses. Note that the first two events include also loads from the L1 hardware prefetchers. The L2_RQSTS.DEMAND_DATA_RD_MISS event is only supported on Ivy Bridge and later. On Sandy Bridge, I think it can be calculated by subtracting L2_RQSTS.DEMAND_DATA_RD_HIT from L2_RQSTS.ALL_DEMAND_DATA_RD.
See also: How does Linux perf calculate the cache-references and cache-misses events.
Footnotes:
(1) The IN instruction is counted as a MEM_LOAD_UOPS_RETIRED.L1_MISS event on Haswell (See: What does port-mapped I/O look like on Sandy Bridge). I've also verified empirically that all of the MEM_LOAD_UOPS_RETIRED.L1|2|3|LFB_MISS|HIT events don't count loads from the UC or WC memory types and that they do count loads from the WP, WB, and WT memory types. Note that the manual only mentions that UC loads are excluded and for only some of the events. By the way, MEM_UOPS_RETIRED.ALL_LOADS counts loads from all memory types.

CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

It may seem a weird question..
Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7).
There are two objects A, B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64.
1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not evicted yet, CPU fetches B in a very short time. Everybody is happy.
2) If N >> 64 (i.e. much larger than 64), when A is fetched by CPU, B is not read into the cache line along with A. So we say "CPU doesn't like chase pointers around", and it is one of the reason to avoid heap allocated node-based data structure, like std::list.
My question is, if N > 64 but is still small, say N = 70, in other words, A and B do not fit in one cache line but are not too far away apart, when A is loaded by CPU, does fetching B takes the same amount of clock cycles as it would take when N is much larger than 64?
Rephrase - when A is loaded, let t represent the time elapse of fetching B, is t(N=70) much smaller than, or almost equal to, t(N=9999999)?
I ask this question because I suspect t(N=70) is much smaller than t(N=9999999), since CPU cache is hierarchical.
It is even better if there is a quantitative research.
There are at least three factors which can make a fetch of B after A misses faster. First, a processor may speculatively fetch the next block (independent of any stride-based prefetch engine, which would depend on two misses being encountered near each other in time and location in order to determine the stride; unit stride prefetching does not need to determine the stride value [it is one] and can be started after the first miss). Since such prefetching consumes memory bandwidth and on-chip storage, it will typically have a throttling mechanism (which can be as simple as having a modest sized prefetch buffer and only doing highly speculative prefetching when the memory interface is sufficiently idle).
Second, because DRAM is organized into rows and changing rows (within a single bank) adds latency, if B is in the same DRAM row as A, the access to B may avoid the latency of a row precharge (to close the previously open row) and activate (to open the new row). (This can also improve memory bandwidth utilization.)
Third, if B is in the same address translation page as A, a TLB may be avoided. (In many designs hierarchical page table walks are also faster in nearby regions because paging structures can be cached. E.g., in x86-64, if B is in the same 2MiB region as A, a TLB miss may only have to perform one memory access because the page directory may still be cached; furthermore, if the translation for B is in the same 64-byte cache line as the translation for A and the TLB miss for A was somewhat recent, the cache line may still be present.)
In some cases one can also exploit stride-base prefetch engines by arranging objects that are likely to miss together in a fixed, ordered stride. This would seem to be a rather difficult and limited context optimization.
One obvious way that stride can increase latency is by introducing conflict misses. Most caches use simple modulo a power of two indexing with limited associativity, so power of two strides (or other mappings to the same cache set) can place a disproportionate amount of data in a limited number of sets. Once the associativity is exceeded, conflict misses will occur. (Skewed associativity and non-power-of-two modulo indexing have been proposed to reduce this issue, but these techniques have not been broadly adopted.)
(By the way, the reason pointer chasing is particularly slow is not just low spatial locality but that the access to B cannot be started until after the access to A has completed because there is a data dependency, i.e., the latency of fetching B cannot be overlapped with the latency of fetching A.)
If B is at a lower address than A, it won't be in the same cache line even if they're adjacent. So your N < 64 case is misnamed: it's really the "same cache line" case.
Since you mention Intel i7: Sandybridge-family has a "spatial" prefetcher in L2, which (if there aren't a lot of outstanding misses already) prefetches the other cache line in a pair to complete a naturally-aligned 128B pair of lines.
From Intel's optimization manual, in section 2.3 SANDY BRIDGE:
2.3.5.4 Data Prefetching
... Some prefetchers fetch into L1.
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with
the pair line that completes it to a 128-byte aligned chunk.
... several other prefetchers try to prefetch into L2
IDK how soon it does this; if it doesn't issue the request until the first cache line arrives, it won't help much for a pointer-chasing case. A dependent load can execute only a couple cycles after the cache line arrives in L1D, if it's really just pointer-chasing without a bunch of computation latency. But if it issues the prefetch soon after the first miss (which contains the address for the 2nd load), the 2nd load could find its data already in L1D cache, having arrived a cycle or two after the first demand-load.
Anyway, this makes 128B boundaries relevant for prefetching in Intel CPUs.
See Paul's excellent answer for other factors.

Can you help me understand the cache behaviour on an ARM Cortex-A9?

I tried to understand what was going on during a LOAD and/or STORE instruction. Therefore I performed 4 tests, and each time, I measured the number of cpu cycles (CC) /cache hits (CH) /misses (CM)/data reads (DR)/writes (DW).
After reading the different counters, I just flush the L1 (I/D cache).
Test1:
LDRB R3, [R4,#1]!
STR R3, [SP,#0x48+var_34]
Results: 4 (CC) 3(CH) 1(CM) 1(DR) 2(DW)
Test2:
LDR R3, [SP,#0x48+var_34]
LDR R3, [R3]
Results: 4 3 1 2 1
Test3:
LDR R3, [SP,#0x48+var_38]
LDR R3, [R3]
STR R3, [SP,#0x48+var_30]
Results: 4 4 1 2 2
var_30 is returned at the end of the current function.
Test4:
LDR R2, [SP,#0x48+var_34]
LDR R3, [R2]
Results: 4 3 1 2 1
Here is my understanding:
1. Cache misses
In each test we have 1 cache miss because when one performs
LDR reg, something
"Something" is going to be cached, and there will be a cache miss.
And... that's pretty much the only "logical" interpretation I could make...
I do not understand the different values for the cache hits, data reads, and data writes.
Any idea?
the arm documentation at infocenter.arm.com spells out quite clearly what happens on the axi/amba bus in the amba/axi documentation. Now processor to L1 is tightly coupled, not amba/axi, all within the core. If you are only clearing the L1 then the L2 may still contain the values so one experiment compared to others may show different results if the L2 misses or not. Also you are not just measuring the load and store but the fetch of the instructions too, and their alignment will change the results even with two instructions if the cache line is between them the performance may differ than if they were together. there are experiments to do just with that based on alignment within a line as to when and if another cache line fetch goes out.
Also trying to get deterministic numbers on processors like these are a bit difficult, particularly with the cache on. If you are running these experiments on anything but bare metal, they there is no reason to expect any kind of meaningful results. With bare metal the results are still suspect, but can be made to be more deterministic.
If you are simply trying to understand cache basics not specific to the arm or any other platform then just google that, go to wikipedia, etc. TONS of info out there. Cache is just faster ram, closer to the processor in time as well as being fast (more expensive) sram. So quite simply the cache looks at your address, looks it up in a table or set of tables and determines hit or miss, if a hit then it returns the value or accepts the write data and completes the processor side of the transaction (allowing the processor to continue but then goes to write the cache, fire and forget basically). if a miss then it has to figure out if there is a spare opening in the cache for this data, if not it has to evict something by writing it out, then or if there was already an empty spot it can do a cache line read which is often larger than the read you asked for. That hits the l2 in the same way as the l1, hit or miss evict or not and so on, until it either hits a cache layer that gets a hit or until it hits the final ram or peripheral where it gets the data from. then that is written to all the cache layers on the way back to the l1 and then the processor gets its little bit of data it asked for. If the processor asks for another data item in that cache line now it is in l1 and returns really fast. l2 is usually bigger than l1 and so on such that everything in l1 is in l2 but not everything in l2 is in l1, so that you can evict from l1 to l2 and if then something comes along it may miss l1 but hit l2 and still be much faster than going to slow dram. It is a bit like keeping the tools or reference materials or whatever you use most often closer to you at your desk and things you keep less often further away since there isnt room for everything, as you change projects or evolve what is used most often and least often changes and there position on the desk change.

Do typical multicore processors have muliple ports from L1 to L2

For typical x86 multicore processors, let us say, we have a processor with 2 cores and both cores encounter an L1 instruction cache miss when reading an instruction. Lets also assume that both of the cores are accessing data in addresses which are in separate cache lines. Would those two cores get data from L2 to L1 instruction cache simultaneously or would it be serialized? In other words, do we have multiple ports for L2 cache access for different cores?
For typical x86 multicore processors, let us say, we have a processor with 2 cores
Ok, let use some early variant of Intel Core 2 Duo with two cores (Conroe). They have 2 CPU cores, 2 L1i caches and shared L2 cache.
and both cores encounter an L1 instruction cache miss when reading an instruction.
Ok, there will be miss in L1i to read next instruction (miss in L1d, when you access the data, works in similar way, but there are only reads from L1i and reads&writes from L1d). Each L1i with miss will generate request to next layer of memory hierarchy, to the L2 cache.
Lets also assume that both of the cores are accessing data in addresses which are in separate cache lines.
Now we must to know how the caches are organized (This is classic middle-detail cache scheme which is logically similar to real hardware). Cache is memory array with special access circuits, and it looks like 2D array. We have many sets (64 in this picture) and each set has several ways. When we ask cache to get data from some address, the address is split into 3 parts: tag, set index and offset inside cache line. Set index is used to select the set (row in our 2D cache memory array), then tags in all ways are compared (to find right column in 2D array) with tag part of the request address, this is done in parallel by 8 tag comparators. If there is tag in cache equal to request address tag part, cache have "hit" and cache line from the selected cell will be returned to the requester.
Ways and sets; 2D array of cache (image from http://www.cnblogs.com/blockcipher/archive/2013/03/27/2985115.html or http://duartes.org/gustavo/blog/post/intel-cpu-caches/)
The example where set index 2 was selected, and parallel tag comparators give a "hit" (tag equality) for the Way 1:
What is the "port" to some memory or to cache? This is hardware interface between external hardware blocks and the memory, which has lines for request address (set by external block, for L1 it is set by CPU, for L2 - by L1), access type (load or store; may be fixed for the port), data input (for stores) and data output with ready bit (set by memory; cache logic handles misses too, so it return data both on hit and on miss, but it will return data for miss later).
If we want to increase true port count, we should increase hardware: for raw SRAM memory array we should add two transistor for every bit to increase port count by 1; for cache we should duplicate ALL tag comparator logic. But this has too high cost, so there are no much multiported memory in CPU, and if it has several ports, the total count of true ports is small.
But we can emulate having of several ports. http://web.eecs.umich.edu/~twenisch/470_F07/lectures/15.pdf EECS 470 2007 slide 11:
Parallel cache access is harder than parallel FUs
fundamental difference: caches have state, FUs don’t
one port affects future for other ports
Several approaches used
true multi‐porting
multiple cache copies
virtual multi‐porting
multi‐banking (interleaving)
line buffers
Multi-banking (sometimes called slicing) is used by modern chips ("Intel Core i7 has four banks in L1 and eight banks in L2"; figure 1.6 from page 9 of ISBN 1598297546 (2011) - https://books.google.com/books?id=Uc9cAQAAQBAJ&pg=PA9&lpg=PA9 ). It means, that there are several hardware caches of smaller sizes, and some bits of request address (part of set index - think the sets - rows as splitted over 8 parts or having colored into interleaved rows) are used to select bank. Each bank has low number of ports (1) and function just like classic cache (and there is full set of tag comparators in each bank; but the height of bank - number of sets in it is smaller, and every tag in array is routed only to single tag comparator - cheap as in single ported cache).
Would those two cores get data from L2 to L1 instruction cache simultaneously or would it be serialized? In other words, do we have multiple ports for L2 cache access for different cores?
If two accesses are routed to different L2 banks (slices), then cache behave like multiported and can handle both requests at the same time. But if both are routed to the single bank with single port, they will be serialized for the cache. Cache serialization may cost several ticks and request will be stalled near port; CPU will see this as slightly more access latency.

How much data is loaded in to the L2 and L3 caches?

If I have this class:
class MyClass{
short a;
short b;
short c;
};
and I have this code performing calculations on the above:
std::vector<MyClass> vec;
//
for(auto x : vec){
sum = vec.a * (3 + vec.b) / vec.c;
}
I understand the CPU only loads the very data it needs from the L1 cache, but when the L1 cache retrieves data from the L2 cache it loads a whole "cache line" (which could include a few bytes of data it doesn't need).
How much data does the L2 cache load from the L3 cache, and the L3 cache load from main memory? Is it defined in terms of pages and if so, how would this answer differ according to different L2/L3 cache sizes?
L2 and L3 caches also have cache lines that are smaller than a virtual memory system page. The size of L2 and L3 cache lines is greater than or equal to the L1 cache line size, not uncommonly being twice that of the L1 cache line size.
For recent x86 processors, all caches use the same 64-byte cache line size. (Early Pentium 4 processors had 64-byte L1 cache lines and 128-byte L2 cache lines.)
IBM's POWER7 uses 128-byte cache blocks in L1, L2, and L3. (However, POWER4 used 128-byte blocks in L1 and L2, but sectored 512-byte blocks in the off-chip L3. Sectored blocks provide a valid bit for subblocks. For L2 and L3 caches, sectoring allows a single coherence size to be used throughout the system.)
Using a larger cache line size in last level cache reduces tag overhead and facilitates long burst accesses between the processor and main memory (longer bursts can provide more bandwidth and facilitate more extensive error correction and DRAM chip redundancy), while allowing other levels of cache and cache coherence to use smaller chunks which reduces bandwidth use and capacity waste. (Large last level cache blocks also provide a prefetching effect whose cache polluting issues are less severe because of the relatively high capacity of last level caches. However, hardware prefetching can accomplish the same effect with less waste of cache capacity.) With a smaller cache (e.g., typical L1 cache), evictions happen more frequently so the time span in which spatial locality can be exploited is smaller (i.e., it is more likely that only data in one smaller chunk will be used before the cache line is evicted). A larger cache line also reduces the number of blocks available, in some sense reducing the capacity of the cache; this capacity reduction is particularly problematic for a small cache.
It depends somewhat on the ISA and microarchitecture of your platform. Recent x86-64 based microarchitectures use 64 byte lines in all levels of the cache hierarchy.
Typically signed shorts will require two bytes each meaning that MyClass will need 6 bytes in addition the class overhead. If your C++ implementation stores the vector<> contiguously like an array you should get about 10 MyClass objects per 64-byte lines. Provided the vector<> is the right length, you won't load much garbage.
It's wise to note that since you're accessing the elements in a very predictable pattern the hardware prefetcher should kick in and fetch a reasonable amount of data it expects to use in the future. This could potentially bring more than you need into various levels of the cache hierarchy. It will vary from chip to chip.

Resources