Can you help me understand the cache behaviour on an ARM Cortex-A9? - caching

I tried to understand what was going on during a LOAD and/or STORE instruction. Therefore I performed 4 tests, and each time, I measured the number of cpu cycles (CC) /cache hits (CH) /misses (CM)/data reads (DR)/writes (DW).
After reading the different counters, I just flush the L1 (I/D cache).
Test1:
LDRB R3, [R4,#1]!
STR R3, [SP,#0x48+var_34]
Results: 4 (CC) 3(CH) 1(CM) 1(DR) 2(DW)
Test2:
LDR R3, [SP,#0x48+var_34]
LDR R3, [R3]
Results: 4 3 1 2 1
Test3:
LDR R3, [SP,#0x48+var_38]
LDR R3, [R3]
STR R3, [SP,#0x48+var_30]
Results: 4 4 1 2 2
var_30 is returned at the end of the current function.
Test4:
LDR R2, [SP,#0x48+var_34]
LDR R3, [R2]
Results: 4 3 1 2 1
Here is my understanding:
1. Cache misses
In each test we have 1 cache miss because when one performs
LDR reg, something
"Something" is going to be cached, and there will be a cache miss.
And... that's pretty much the only "logical" interpretation I could make...
I do not understand the different values for the cache hits, data reads, and data writes.
Any idea?

the arm documentation at infocenter.arm.com spells out quite clearly what happens on the axi/amba bus in the amba/axi documentation. Now processor to L1 is tightly coupled, not amba/axi, all within the core. If you are only clearing the L1 then the L2 may still contain the values so one experiment compared to others may show different results if the L2 misses or not. Also you are not just measuring the load and store but the fetch of the instructions too, and their alignment will change the results even with two instructions if the cache line is between them the performance may differ than if they were together. there are experiments to do just with that based on alignment within a line as to when and if another cache line fetch goes out.
Also trying to get deterministic numbers on processors like these are a bit difficult, particularly with the cache on. If you are running these experiments on anything but bare metal, they there is no reason to expect any kind of meaningful results. With bare metal the results are still suspect, but can be made to be more deterministic.
If you are simply trying to understand cache basics not specific to the arm or any other platform then just google that, go to wikipedia, etc. TONS of info out there. Cache is just faster ram, closer to the processor in time as well as being fast (more expensive) sram. So quite simply the cache looks at your address, looks it up in a table or set of tables and determines hit or miss, if a hit then it returns the value or accepts the write data and completes the processor side of the transaction (allowing the processor to continue but then goes to write the cache, fire and forget basically). if a miss then it has to figure out if there is a spare opening in the cache for this data, if not it has to evict something by writing it out, then or if there was already an empty spot it can do a cache line read which is often larger than the read you asked for. That hits the l2 in the same way as the l1, hit or miss evict or not and so on, until it either hits a cache layer that gets a hit or until it hits the final ram or peripheral where it gets the data from. then that is written to all the cache layers on the way back to the l1 and then the processor gets its little bit of data it asked for. If the processor asks for another data item in that cache line now it is in l1 and returns really fast. l2 is usually bigger than l1 and so on such that everything in l1 is in l2 but not everything in l2 is in l1, so that you can evict from l1 to l2 and if then something comes along it may miss l1 but hit l2 and still be much faster than going to slow dram. It is a bit like keeping the tools or reference materials or whatever you use most often closer to you at your desk and things you keep less often further away since there isnt room for everything, as you change projects or evolve what is used most often and least often changes and there position on the desk change.

Related

Cortex M4 LDR/STR timing

I am reading through Cortex M4 TRM to understand instruction execution cycles. However, there are some confusing description there
In Table of Processor Instuctions, STR takes 2 cycles.
Later in Load/store timings, it indicates that
STR Rx,[Ry,#imm] is always one cycle, This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing.
If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete.
If the store is not to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion
Still in Load/store timings, it indicates LDR can be pipelined by following LDR and STR, but STR can't be pipelined by following instructions.
Other instructions cannot be pipelined after STR with register offset. STR can only be pipelined when it follows an LDR, but nothing can be pipelined after the store. Even a stalled STR normally only takes two cycles, because of the write buffer
More specific on what confused me:
Q1. 1 and 2 seems conflict with each other, how many cycles do STR actually take, 1 or 2? (My experiment shows 1 though)
Q2. 2 indicates that if store go through write buffer and it is not available, it will stall the pipeline nevertheless, but if store bypass it, the pipeline may only stalled when load/store instructions are following. Smells like write buffer can only make things worse. That is contrary to common sense.
Q3. 3 means STR can't be pipelined with following instruction, however 2 means STR is always pipelined with following instruction under proper condition. How to understand the conflicting statements? (And here it indicates STR takes 2 instead of 1 cycle because of the write buffer)
Q4. I don't find more information on how write buffer is imeplemented. How large is the buffer? How STR determine whether to use it or bypass it?
Type of STR
Note that on "Load/Store timings page" the first statement refers to STR with a literal offset to the base address register (STR Rx,[Ry,#imm]). Further down it refers to an STR with a register offset to the base address register (STR R1,[R3,R2]). These are two different variants of the STR instruction.
Literal Offset STR(STR Rx,[Ry,#imm])
Hmm, I wonder if the documentation is mis-leading when it says "always 1 cycle", because it then follows to add a caveat that means it could take multiple cycles "... the next instruction is delayed until the store can complete"
I am going to do my best to interpret the documentation:
STR Rx,[Ry,#imm] is always one cycle. This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing. If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete. If the store is to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion.
I would assume that the first STR takes 1 cycle, if the write buffer is available. If it is not available, the next instruction will be stalled until the buffer is available. However, if the buffer is not in use, it will delay the next instruction until the bus transaction completes.
With a non consecutive STR (the first STR) the write buffer will be empty, and the instruction takes 1 cycle. If there are 2 consecutive STR instructions, the 2nd STR will begin immediately as the 1st STR has written to the buffer. However, if the bus transaction for the 1st STR stalls and remains in the write buffer, the 2nd STR will be unable to write to the buffer and will block further instructions. Then when the bus transaction for the 1st STR completes the buffer is emptied and the 2nd STR writes to the buffer, unblocking the next instruction.
A stalled bus transaction, where the transaction is buffered in the write buffer, doesn't affect non STR instructions as they do not need access to the write buffer to complete. So an STR instruction where the bus is stalled will not delay further instructions unless it is another STR. However, if the write buffer is not in use then a stalled bus transaction will delay all instructions.
It does seem a bit off that the instruction set summary page puts a solid "2" as the number of cycles for STR when clearly it is not as predictable as this.
Register offset STR(STR R1,[R3,R2])
I stand with you on your confusion over the following apparently conflicting statement:
Other instructions cannot be pipelined after STR with register offset. STR can only be pipelined when it follows an LDR, but nothing can be pipelined after the store. Even a stalled STR normally only takes two cycles, because of the write buffer.
As this is contradicted by the first clause on the page. But, I believe this is because it is refering to 2 different STR types, literal offset (the first one) and register offset. The register offset STR being the one that can't allow pipelined instructions afterwards. The language could be clearer though. What does it mean by a stalled STR, is it refering to a register offset STR which always stalls by default? Is this stall different to a stall caused by the write buffer being unavailable? It is easy to get lost here.
I think basically a register offset STR is a minimum of 2 cycles. It is going to block and take more cycles if the write buffer is unavailable, or if the transaction is not buffered and the bus stalls.
Size of write buffer
The size is a single entry, see https://developer.arm.com/documentation/100166/0001/Programmers-Model/Write-buffer?lang=en
To prevent bus wait cycles from stalling the processor during data stores, buffered stores to the DCode and System buses go through a one-entry write buffer. If the write buffer is full, subsequent accesses to the bus stall until the write buffer has drained.
The write buffer is only used if the bus waits the data phase of the buffered store, otherwise the transaction completes on the bus.
Usefulness of write buffer
As far as my understanding goes: If the CPU could write to a bus instantly then it would not need a buffer as the bus would be free immediately for the next instruction. On a high performance part like M4 some of the memory buses can't keep up with the CPU clock rate which means it could take multiple cycles to perform a transaction. Also there could be DMA units that make use of the same bus. To prevent stalling the CPU until a bus transaction completes, the buffer provides an immediate store to use which hardware then writes to the bus when it is free.
#EmbeddedSoftwareEngineer, thanks for the reply. I'd like to post what I summarized from my experiment
As a baseline, LDR takes 2 cycles, STR takes 1 cycle
There are 2 kinds of dependency for adjacent instructions
content dependency. A typical example is STR followed by a LDR, because the assembly don't make sure the LDR target memory is not modified by STR, it always get delay,that is 3 cycles for LDR
addressing dependency. When 2nd instruction's address is based on result of first instruction, the 2nd instruction always get delay, typical example
sub SP, SP, #20
ldr r1, [SP, #4]
;OR
ldr r3, [SP, #8]
ldr r4, [r3]
The second LDR will always get an extra wait cycle, yields 3 cycles
When there is no dependencies described in 2, LDR following LDR will take 1 cycle, STR following LDR will take 0 cycle
All these are based on TCM which introduce no extra cycle from cache load or external bus stall.

CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

It may seem a weird question..
Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7).
There are two objects A, B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64.
1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not evicted yet, CPU fetches B in a very short time. Everybody is happy.
2) If N >> 64 (i.e. much larger than 64), when A is fetched by CPU, B is not read into the cache line along with A. So we say "CPU doesn't like chase pointers around", and it is one of the reason to avoid heap allocated node-based data structure, like std::list.
My question is, if N > 64 but is still small, say N = 70, in other words, A and B do not fit in one cache line but are not too far away apart, when A is loaded by CPU, does fetching B takes the same amount of clock cycles as it would take when N is much larger than 64?
Rephrase - when A is loaded, let t represent the time elapse of fetching B, is t(N=70) much smaller than, or almost equal to, t(N=9999999)?
I ask this question because I suspect t(N=70) is much smaller than t(N=9999999), since CPU cache is hierarchical.
It is even better if there is a quantitative research.
There are at least three factors which can make a fetch of B after A misses faster. First, a processor may speculatively fetch the next block (independent of any stride-based prefetch engine, which would depend on two misses being encountered near each other in time and location in order to determine the stride; unit stride prefetching does not need to determine the stride value [it is one] and can be started after the first miss). Since such prefetching consumes memory bandwidth and on-chip storage, it will typically have a throttling mechanism (which can be as simple as having a modest sized prefetch buffer and only doing highly speculative prefetching when the memory interface is sufficiently idle).
Second, because DRAM is organized into rows and changing rows (within a single bank) adds latency, if B is in the same DRAM row as A, the access to B may avoid the latency of a row precharge (to close the previously open row) and activate (to open the new row). (This can also improve memory bandwidth utilization.)
Third, if B is in the same address translation page as A, a TLB may be avoided. (In many designs hierarchical page table walks are also faster in nearby regions because paging structures can be cached. E.g., in x86-64, if B is in the same 2MiB region as A, a TLB miss may only have to perform one memory access because the page directory may still be cached; furthermore, if the translation for B is in the same 64-byte cache line as the translation for A and the TLB miss for A was somewhat recent, the cache line may still be present.)
In some cases one can also exploit stride-base prefetch engines by arranging objects that are likely to miss together in a fixed, ordered stride. This would seem to be a rather difficult and limited context optimization.
One obvious way that stride can increase latency is by introducing conflict misses. Most caches use simple modulo a power of two indexing with limited associativity, so power of two strides (or other mappings to the same cache set) can place a disproportionate amount of data in a limited number of sets. Once the associativity is exceeded, conflict misses will occur. (Skewed associativity and non-power-of-two modulo indexing have been proposed to reduce this issue, but these techniques have not been broadly adopted.)
(By the way, the reason pointer chasing is particularly slow is not just low spatial locality but that the access to B cannot be started until after the access to A has completed because there is a data dependency, i.e., the latency of fetching B cannot be overlapped with the latency of fetching A.)
If B is at a lower address than A, it won't be in the same cache line even if they're adjacent. So your N < 64 case is misnamed: it's really the "same cache line" case.
Since you mention Intel i7: Sandybridge-family has a "spatial" prefetcher in L2, which (if there aren't a lot of outstanding misses already) prefetches the other cache line in a pair to complete a naturally-aligned 128B pair of lines.
From Intel's optimization manual, in section 2.3 SANDY BRIDGE:
2.3.5.4 Data Prefetching
... Some prefetchers fetch into L1.
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with
the pair line that completes it to a 128-byte aligned chunk.
... several other prefetchers try to prefetch into L2
IDK how soon it does this; if it doesn't issue the request until the first cache line arrives, it won't help much for a pointer-chasing case. A dependent load can execute only a couple cycles after the cache line arrives in L1D, if it's really just pointer-chasing without a bunch of computation latency. But if it issues the prefetch soon after the first miss (which contains the address for the 2nd load), the 2nd load could find its data already in L1D cache, having arrived a cycle or two after the first demand-load.
Anyway, this makes 128B boundaries relevant for prefetching in Intel CPUs.
See Paul's excellent answer for other factors.

cache interaction steps and read cycles

I'm struggling to fully grasp how caches work.
Let's say I have a L1 cache and L2 cache.
The CPU (main memory) gives the L1 controller the memory address.
L1 cache controller determines cache set, requested cache tag, and block offset
L1 cache circuits check if the requested tag is in set
Can't find L1 cache tag match.
Does #2 happen here or after L1 sends L2 the memory address?
On read times if L1 takes x cycles, L2 takes y cycles, and main memory takes z cycles. Basically if the above steps happen and then L2 finds a cache tag match and sends it back to L1 who sends it to main, how many cycles does it take? When L1 returns it to the CPU does that count as a read cycle or not?
Thanks in advance for the help!
L1 might be in the processor but the process is still the same. The processor performs a read lets say, the address and read/control signals go out. From the address the L1 cache looks up the tag and determines hit/miss. If it is a hit it returns the information, if it misses then the L1 needs to go out on its address bus, adjust the address to align it to the cache line size and address alignment. The L2 does the same thing the L1 does at a high level, address turns into a tag turns into a hit/miss, if a miss then it puts the aligned/sized cache line fetch on its external address bus, this repeats until you hit something that answers (DRAM, peripheral, etc). When the L2 responds it sends the line back to L1, L1 per the rules of the design/settings save the line and then return to the processor the data/length of what it asked for. For that moment, depending design and settings the L1 and L2 contain the same data, ideally the L1 contains all the data in L2, L2 contains all the L1 data plus some. Granted non-cacheable requests should pass through so you may have a L2 hit that results in L1 not storing the data. Also based on the design a non-cacheable request may pass through to the other side of L1 and/or L2 in the original processor size/shape rather than being aligned and sized to a cache line.

Why do memory instructions take 4 cycles in ARM assembly?

Memory instructions such as ldr, str or b take 4 cycles each in ARM assembly.
Is it because each memory location is 4 bytes long?
ARM has a pipelined architecture. Each clock cycle advances the pipeline by one step (e.g. fetch/decode/execute/read...). Since the pipeline is continuously fed, the overall time to execute each instruction can approach 1 cycle, but the actual time for an individual instruction from 'fetch' through completion can be 3+ cycles. ARM has a good explanation on their website:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0222b/ch01s01s01.html
Memory latency adds another layer of complication to this idea. ARM employs a multi-level cache system which aims to have the most frequently used data available in the fewest cycles. Even a read from the fastest (L0) cache involves several cycles of latency. The pipeline includes facilities to allow read requests to complete at a later time if the data is not used right away. It's easier to understand by way of example:
LDR R0,[R1]
MOV R2,R3 // Allow time for memory read to occur
ADD R4,R4,#200 // by interleaving other instructions
CMP R0,#0 // before trying to use the value
// By trying to access the data immediately, this will cause a pipeline
// 'stall' and waste time waiting for the data to become available.
LDR R0,[R1]
CMP R0,#0 // Wastes at least 1 cycle due to pipeline not having the data
The idea is to hide the inherent latencies in the pipeline and, if you can, hide additional latencies in the memory access by delaying dependencies on registers (aka instruction interleaving).

If registers are so blazingly fast, why don't we have more of them?

In 32bit, we had 8 "general purpose" registers. With 64bit, the amount doubles, but it seems independent of the 64bit change itself.
Now, if registers are so fast (no memory access), why aren't there more of them naturally? Shouldn't CPU builders work as many registers as possible into the CPU? What is the logical restriction to why we only have the amount we have?
There's many reasons you don't just have a huge number of registers:
They're highly linked to most pipeline stages. For starters, you need to track their lifetime, and forward results back to previous stages. The complexity gets intractable very quickly, and the number of wires (literally) involved grows at the same rate. It's expensive on area, which ultimately means it's expensive on power, price and performance after a certain point.
It takes up instruction encoding space. 16 registers takes up 4 bits for source and destination, and another 4 if you have 3-operand instructions (e.g ARM). That's an awful lot of instruction set encoding space taken up just to specify the register. This eventually impacts decoding, code size and again complexity.
There's better ways to achieve the same result...
These days we really do have lots of registers - they're just not explicitly programmed. We have "register renaming". While you only access a small set (8-32 registers), they're actually backed by a much larger set (e.g 64-256). The CPU then tracks the visibility of each register, and allocates them to the renamed set. For example, you can load, modify, then store to a register many times in a row, and have each of these operations actually performed independently depending on cache misses etc. In ARM:
ldr r0, [r4]
add r0, r0, #1
str r0, [r4]
ldr r0, [r5]
add r0, r0, #1
str r0, [r5]
Cortex A9 cores do register renaming, so the first load to "r0" actually goes to a renamed virtual register - let's call it "v0". The load, increment and store happen on "v0". Meanwhile, we also perform a load/modify/store to r0 again, but that'll get renamed to "v1" because this is an entirely independent sequence using r0. Let's say the load from the pointer in "r4" stalled due to a cache miss. That's ok - we don't need to wait for "r0" to be ready. Because it's renamed, we can run the next sequence with "v1" (also mapped to r0) - and perhaps that's a cache hit and we just had a huge performance win.
ldr v0, [v2]
add v0, v0, #1
str v0, [v2]
ldr v1, [v3]
add v1, v1, #1
str v1, [v3]
I think x86 is up to a gigantic number of renamed registers these days (ballpark 256). That would mean having 8 bits times 2 for every instruction just to say what the source and destination is. It would massively increase the number of wires needed across the core, and its size. So there's a sweet spot around 16-32 registers which most designers have settled for, and for out-of-order CPU designs, register renaming is the way to mitigate it.
Edit: The importance of out-of-order execution and register renaming on this. Once you have OOO, the number of registers doesn't matter so much, because they're just "temporary tags" and get renamed to the much larger virtual register set. You don't want the number to be too small, because it gets difficult to write small code sequences. This is a problem for x86-32, because the limited 8 registers means a lot of temporaries end up going through the stack, and the core needs extra logic to forward reads/writes to memory. If you don't have OOO, you're usually talking about a small core, in which case a large register set is a poor cost/performance benefit.
So there's a natural sweet spot for register bank size which maxes out at about 32 architected registers for most classes of CPU. x86-32 has 8 registers and it's definitely too small. ARM went with 16 registers and it's a good compromise. 32 registers is slightly too many if anything - you end up not needing the last 10 or so.
None of this touches on the extra registers you get for SSE and other vector floating point coprocessors. Those make sense as an extra set because they run independently of the integer core, and don't grow the CPU's complexity exponentially.
We Do Have More of Them
Because almost every instruction must select 1, 2, or 3 architecturally visible registers, expanding the number of them would increase code size by several bits on each instruction and so reduce code density. It also increases the amount of context that must be saved as thread state, and partially saved in a function's activation record. These operations occur frequently. Pipeline interlocks must check a scoreboard for every register and this has quadratic time and space complexity. And perhaps the biggest reason is simply compatibility with the already-defined instruction set.
But it turns out, thanks to register renaming, we really do have lots of registers available, and we don't even need to save them. The CPU actually has many register sets, and it automatically switches between them as your code exeutes. It does this purely to get you more registers.
Example:
load r1, a # x = a
store r1, x
load r1, b # y = b
store r1, y
In an architecture that has only r0-r7, the following code may be rewritten automatically by the CPU as something like:
load r1, a
store r1, x
load r10, b
store r10, y
In this case r10 is a hidden register that is substituted for r1 temporarily. The CPU can tell that the the value of r1 is never used again after the first store. This allows the first load to be delayed (even an on-chip cache hit usually takes several cycles) without requiring the delay of the second load or the second store.
They add registers all of the time, but they are often tied to special purpose instructions (e.g. SIMD, SSE2, etc) or require compiling to a specific CPU architecture, which lowers portability. Existing instructions often work on specific registers and couldn't take advantage of other registers if they were available. Legacy instruction set and all.
To add a little interesting info here you'll notice that having 8 same sized registers allows opcodes to maintain consistency with hexadecimal notation. For example the instruction push ax is opcode 0x50 on x86 and goes up to 0x57 for the last register di. Then the instruction pop ax starts at 0x58 and goes up to 0x5F pop di to complete the first base-16. Hexadecimal consistency is maintained with 8 registers per a size.

Resources