Duration of a LDR instruction on STM32H7 depending on memory - performance

I'm doing some evaluations on STM32H7, on the STM32H753I-EVAL2 board. I used STMicro example code to configure, write and read the QSPI Flash in memory mapped mode.
I was surprised by some figures regarding duration of LDR instruction:
I measure the number of cycles of instructions using the SysTick (connected on CPU clock). As far as I understood: one cycle of SysTick = one cycle of CPU.
I measured two instructions exactly identical ldrb.w Rn, [Rp, Rq] except that Rp is in one case an address in DTC-RAM and in the other case an address
in QSPI Flash.
The results are (code executed from internal flash): 15 cycles from DCTM-RAM, 12 cycles from QSPI.
I'm surprised by the results, I guess the QSPI content if cached so it might explain the figures ?
Also I find that 15 cycles for a single LDR instruction seems quite a lot, what do you think ? Is there something wrong in my procedure ?

If the internal flash is not cached, or the cache is invalid, or the pipeline was flushed or ... (many many other)s it may take more time than the QSPI Flash located instruction.
To measure execution time you have special registers.

Related

Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?

I'm wondering if any Intel experts out there can tell me the difference between STD and STA with respect to the Intel Skylake core.
In the Intel optimization guide, there's a picture describing the "super-scalar ports" of the Intel Cores.
Here's the PDF. The picture is on page 40.
.
Here's another picture from page 78, this picture describes "Store Address" and "Store Data":
Prepares the store forwarding and store retirement logic with the address of the data being stored.
Prepares the store forwarding and store retirement logic with the data being stored.
Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle, I was curious what the difference was between these two.
It seems "natural" to me that store-forwarding would be done to the address of the data. But I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done. Are there any assembly / optimization experts out there that can help me understand exactly the difference between STD and STA is?
Intel CPUs have been splitting stores into store-address and store-data since the first P6-family microarchitecture, Pentium Pro.
But store-address and store-data uops can micro-fuse into one fused-domain uop. On Sandy/IvyBridge, indexed addressing modes are un-laminated as described in Intel's optimization manual. But Haswell and later can keep them micro-fused even in the ROB, so they aren't un-laminated. See Micro fusion and addressing modes. (Intel doesn't mention this, and Agner Fog hasn't had time to test extensively for Haswell/Skylake so his usually-good microarch PDF doesn't even mention un-lamination at all. But you should still definitely read it to learn more about how uops work and how instructions are decoded and go through the pipeline. See also other x86 performance links in the x86 tag wiki)
Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle
Ports 2 and 3 can also run load uops on their AGUs, leaving the load-data part of the port unused that cycle. Port7 only has a dedicated store-AGU for simple addressing modes.
Store addressing modes with an index register can't use port 7, only p2/p3. But if you do use "simple" addressing modes for stores, the peak throughput is 2 loads + 1 store per clock.
On Nehalem and earlier (P6 family), p2 was the only load port, p3 was the store-address port, and p4 was store-data.
On IvyBridge/Sandybridge, there weren't separate ports for store-address uops, they always just ran on the AGU (Address Generation Unit) in the load ports (p23). With 256b loads / stores, the AGU was only needed every other cycle (256b load or store uops occupy the load or store-data ports for 2 cycles, but the load ports can accept a store-address uop during that 2nd cycle). So 2 load / 1 store per clock was in theory sustainable on Sandybridge, but only if most of it was with AVX 256-bit vector loads / stores running as two 128-bit halves.
Haswell added the dedicated store-AGU on port7 and widened the load/store execution units to 256b, because there aren't spare cycles when the load ports don't need their AGUs if there's a steady supply of loads.
A store-address uop writes the address (and width, I guess) into the store buffer (aka Memory Order Buffer in Intel's terminology). Having this happen separately, and possibly before the data to be stored is even ready lets later loads (in program order) detect whether they overlap the store or not.
Out-of-order execution of loads when there are pending stores with unknown address is problematic: a wrong guess means having to roll back the pipeline. (I think the machine_clears.memory_ordering perf counter event includes this. It is possible to get non-zero counts for this from single-threaded code, but I forget if I had definite evidence that Skylake sometimes speculatively guesses that loads don't overlap unknown-address stores).
As David Kanter points out in his Haswell microarch writeup, a load uop also needs to probe the store buffer to check for forwarding / conflicts, so an execution unit that only runs store-address uops is cheaper to build.
Anyway, I'm not sure what the performance implications would be if Intel redesigned things so port7 had a full AGU that could handle indexed addressing modes, too, and made store-address uops only run on p7, not p2/p3.
That would stop store-address uops from "stealing" p23, which does happen and which reduces max sustained L1D bandwidth from 96 bytes / cycle (2 load + 1 store of 32-byte YMM vectors) down to ~81 bytes / cycle for Skylake according to a table in Intel's optimization manual. But under the right circumstances, Skylake can sustain 2 loads + 1 store per clock of 4-byte operands, so maybe that 81-byte / cycle number is limited by some other microarchitectural limit. The peak is 96B/clock, but apparently that can't happen back-to-back indefinitely.
One downside to stopping store-address uops from running on p23 is that it would take longer for store addresses to be known, maybe delaying loads more.
I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done.
A store/reload can have the load take the data from the store buffer, instead of waiting for it to commit to L1D and reading it from there.
How does store to load forwarding happens in case of unaligned memory access?
Store-to-Load Forwarding and Memory Disambiguation in x86 Processors
Store/reload can happen when a function spills some registers before calling a function, of as part of passing args on the stack (especially with crappy stack-args calling conventions that pass all args on the stack). Or passing something by reference to a non-inline function. Or in a histogram, if the same bin is hit repeatedly, you're basically doing a memory-destination increment in a loop.
Its been a few days without a response, so here's my best guess at "answering my own question".
The raw x86 instruction set isn't executed directly by modern processors. Instead, the x86 instruction set is "compiled" down into Micro-ops (uOps) before being executed by the Intel core. This shouldn't be too surprising, because some x86 instructions can be complex. An example taken from the optimization guide is as follows:
Similarly, the following store instruction has three register sources and is broken into "generate store
address" and "generate store data" sub-components.
MOV [ESP+ECX*4+12345678], AL
This is currently found on page 50 of the optimization manual (2.3.2.4 Micro-op Queue and the Loop Stream Detector (LSD)).
In this case, the address of the store operation is complex, so it is its own uOp. So at very least, this singular x86 instruction gets converted into two uOps internally. The names of these two uOps are "Store Address" and "Store Data". The manual doesn't describe the internal uOps at all, so it may take even more than two uOps to accomplish.
Since there's only one "store data" port on Skylake systems, that means that Skylake can only modify at most one memory location per cycle. The three "Store Address" ports means that Skylake can calculate the effective address of many instructions simultaneously (possibly because some very complicated addresses may take more than one uOp to execute??).

Why do memory instructions take 4 cycles in ARM assembly?

Memory instructions such as ldr, str or b take 4 cycles each in ARM assembly.
Is it because each memory location is 4 bytes long?
ARM has a pipelined architecture. Each clock cycle advances the pipeline by one step (e.g. fetch/decode/execute/read...). Since the pipeline is continuously fed, the overall time to execute each instruction can approach 1 cycle, but the actual time for an individual instruction from 'fetch' through completion can be 3+ cycles. ARM has a good explanation on their website:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0222b/ch01s01s01.html
Memory latency adds another layer of complication to this idea. ARM employs a multi-level cache system which aims to have the most frequently used data available in the fewest cycles. Even a read from the fastest (L0) cache involves several cycles of latency. The pipeline includes facilities to allow read requests to complete at a later time if the data is not used right away. It's easier to understand by way of example:
LDR R0,[R1]
MOV R2,R3 // Allow time for memory read to occur
ADD R4,R4,#200 // by interleaving other instructions
CMP R0,#0 // before trying to use the value
// By trying to access the data immediately, this will cause a pipeline
// 'stall' and waste time waiting for the data to become available.
LDR R0,[R1]
CMP R0,#0 // Wastes at least 1 cycle due to pipeline not having the data
The idea is to hide the inherent latencies in the pipeline and, if you can, hide additional latencies in the memory access by delaying dependencies on registers (aka instruction interleaving).

Understanding CPU pipeline stages vs. Instruction throughput

I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?
Besides the obvious of "different instructions require a different amount of work to complete", hear me out...
Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.
An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).
Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.
I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.
I think what's missing from the existing answers is the existence of "bypass" or "forwarding" datapaths. For simplicity, let's stick with the MIPS 5-stage pipeline. Every instruction takes 5 cycles from birth to death -- fetch, decode, execute, memory, writeback. So that's how long it takes to process a single instruction.
What you want to know is how long it takes for one instruction to hand off its result to a dependent instruction. Say you have two consecutive ADD instructions, and there's a dependency through R1:
ADD R1, R2, R3
ADD R4, R1, R5
If there were no forwarding paths, we'd have to stall the second instruction for multiple cycles (2 or 3 depending on how writeback works), so that the first one could store its result into the register file before the second one reads that as input in the decode stage.
However, there are forwarding paths that allow valid results (but ones that are not yet written back) to be picked out of the pipeline. So let's say the first ADD gets all its inputs from the register file in decode. The second one will get R5 out of the register file, but it'll get R1 out of the pipeline register following the execute stage. In other words, we're routing the output of the ALU back into its input one cycle later.
Out-of-order processors make ubiquitous use of forwarding. They will have lots of different functional units that have lots of different latencies. For instance, ADD and AND will typically take one cycle (TO DO THE MATH, putting aside all of the pipeline stages before and after), MUL will take like 4, floating point operations will take lots of cycles, memory access has variable latency (due to cache misses), etc.
By using forwarding, we can limit the critical path of an instruction to just the latencies of the execution units, while everything else (fetch, decode, retirement), it out of the critical path. Instructions get decoded and dumped into instruction queues, awaiting their inputs to be produced by other executing instructions. When an instruction's dependency is satisfied, then it can begin executing.
Let's consider this example
MUL R1,R5,R6
ADD R2,R1,R3
AND R7,R2,R8
I'm going to make an attempt at drawing a timeline that shows the flow of these instructions through the pipeline.
MUL FDIXXXXWR
ADD FDIIIIXWR
AND FDIIIIXWR
Key:
F - Fetch
D - Decode
I - Instruction queue (IQ)
X - execute
W - writeback/forward/bypass
R - retire
So, as you see, the multiply instruction has a total lifetime of 9 cycles. But there is overlap in execution of the MUL and the ADD, because the processor is pipelined. When the ADD enters the IQ, it has to wait for its input (R1), and likewise so does the AND that is dependent on the ADD's result (R2). What we care about is not how long the MUL lives in total but how long any dependent instruction has to wait. That is its EFFECTIVE latency, which is 4 cycles. As you can see, once the ADD executes, the dependent AND can execute on the next cycle, again due to forwarding.
I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?
Because what we're interested in is in speed between instructions, not the start to end time of a single instruction.
Besides the obvious of "different instructions require a different amount of work to complete", hear me out...
Well that's the key answer to why different instructions have different latencies.
Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.
That is correct, though that's not a particularly meaningful number. For example, why do we care how long it takes before the CPU is entirely done with an instruction? That has basically no effect.
An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).
This is just a bunch of misunderstandings. An XOR introduces one cycle of latency into a dependency chain. That is, if I do 12 instructions that each modify the previous instruction's value and then add an XOR as the 13th instruction, it will take one cycle more. That's what the latency means.
Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.
Right. So?
I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.
They don't. Why should there be any connection? Say there's 14 extra stages at the beginning of the pipeline. Why would that effect latency or throughput at all? It would just mean everything happens 14 clock cycles later, but still at the same rate. (Though likely it would impact the cost of a mispredicted branch and other things.)

how many cpu cycle takes to execute MOV A, 5 instruction

My question is calculate how many CPU cycle takes to execute MOV A, 5 instruction. Describe each.
can anyone please explain me how this works. And the 5 is a value is it? Just explain me the main points.
As far as i know,
first,
-get the instruction from memory (one clock cycle)
-update instruction pointer(one clock cycle)
-decode the instruction to see what it does(one clock cycle)
i'm stuck after this.
I'm assuming you are talking about the x86 processors. This of course depends on the processor, but it usually takes 1 clock cycle from what i remember.
This is because instructions are executed in a pipeline. This means that while the processor is computing the result of an instruction, it is decoding the next and fetching the one before that, so that each part of the processor is busy doing something.
Usually the instructions that need data from memory or do complex calculations like multiplications or divisions take a longer time to execute.
You can also get the number of cycles with RDTSC https://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf

If registers are so blazingly fast, why don't we have more of them?

In 32bit, we had 8 "general purpose" registers. With 64bit, the amount doubles, but it seems independent of the 64bit change itself.
Now, if registers are so fast (no memory access), why aren't there more of them naturally? Shouldn't CPU builders work as many registers as possible into the CPU? What is the logical restriction to why we only have the amount we have?
There's many reasons you don't just have a huge number of registers:
They're highly linked to most pipeline stages. For starters, you need to track their lifetime, and forward results back to previous stages. The complexity gets intractable very quickly, and the number of wires (literally) involved grows at the same rate. It's expensive on area, which ultimately means it's expensive on power, price and performance after a certain point.
It takes up instruction encoding space. 16 registers takes up 4 bits for source and destination, and another 4 if you have 3-operand instructions (e.g ARM). That's an awful lot of instruction set encoding space taken up just to specify the register. This eventually impacts decoding, code size and again complexity.
There's better ways to achieve the same result...
These days we really do have lots of registers - they're just not explicitly programmed. We have "register renaming". While you only access a small set (8-32 registers), they're actually backed by a much larger set (e.g 64-256). The CPU then tracks the visibility of each register, and allocates them to the renamed set. For example, you can load, modify, then store to a register many times in a row, and have each of these operations actually performed independently depending on cache misses etc. In ARM:
ldr r0, [r4]
add r0, r0, #1
str r0, [r4]
ldr r0, [r5]
add r0, r0, #1
str r0, [r5]
Cortex A9 cores do register renaming, so the first load to "r0" actually goes to a renamed virtual register - let's call it "v0". The load, increment and store happen on "v0". Meanwhile, we also perform a load/modify/store to r0 again, but that'll get renamed to "v1" because this is an entirely independent sequence using r0. Let's say the load from the pointer in "r4" stalled due to a cache miss. That's ok - we don't need to wait for "r0" to be ready. Because it's renamed, we can run the next sequence with "v1" (also mapped to r0) - and perhaps that's a cache hit and we just had a huge performance win.
ldr v0, [v2]
add v0, v0, #1
str v0, [v2]
ldr v1, [v3]
add v1, v1, #1
str v1, [v3]
I think x86 is up to a gigantic number of renamed registers these days (ballpark 256). That would mean having 8 bits times 2 for every instruction just to say what the source and destination is. It would massively increase the number of wires needed across the core, and its size. So there's a sweet spot around 16-32 registers which most designers have settled for, and for out-of-order CPU designs, register renaming is the way to mitigate it.
Edit: The importance of out-of-order execution and register renaming on this. Once you have OOO, the number of registers doesn't matter so much, because they're just "temporary tags" and get renamed to the much larger virtual register set. You don't want the number to be too small, because it gets difficult to write small code sequences. This is a problem for x86-32, because the limited 8 registers means a lot of temporaries end up going through the stack, and the core needs extra logic to forward reads/writes to memory. If you don't have OOO, you're usually talking about a small core, in which case a large register set is a poor cost/performance benefit.
So there's a natural sweet spot for register bank size which maxes out at about 32 architected registers for most classes of CPU. x86-32 has 8 registers and it's definitely too small. ARM went with 16 registers and it's a good compromise. 32 registers is slightly too many if anything - you end up not needing the last 10 or so.
None of this touches on the extra registers you get for SSE and other vector floating point coprocessors. Those make sense as an extra set because they run independently of the integer core, and don't grow the CPU's complexity exponentially.
We Do Have More of Them
Because almost every instruction must select 1, 2, or 3 architecturally visible registers, expanding the number of them would increase code size by several bits on each instruction and so reduce code density. It also increases the amount of context that must be saved as thread state, and partially saved in a function's activation record. These operations occur frequently. Pipeline interlocks must check a scoreboard for every register and this has quadratic time and space complexity. And perhaps the biggest reason is simply compatibility with the already-defined instruction set.
But it turns out, thanks to register renaming, we really do have lots of registers available, and we don't even need to save them. The CPU actually has many register sets, and it automatically switches between them as your code exeutes. It does this purely to get you more registers.
Example:
load r1, a # x = a
store r1, x
load r1, b # y = b
store r1, y
In an architecture that has only r0-r7, the following code may be rewritten automatically by the CPU as something like:
load r1, a
store r1, x
load r10, b
store r10, y
In this case r10 is a hidden register that is substituted for r1 temporarily. The CPU can tell that the the value of r1 is never used again after the first store. This allows the first load to be delayed (even an on-chip cache hit usually takes several cycles) without requiring the delay of the second load or the second store.
They add registers all of the time, but they are often tied to special purpose instructions (e.g. SIMD, SSE2, etc) or require compiling to a specific CPU architecture, which lowers portability. Existing instructions often work on specific registers and couldn't take advantage of other registers if they were available. Legacy instruction set and all.
To add a little interesting info here you'll notice that having 8 same sized registers allows opcodes to maintain consistency with hexadecimal notation. For example the instruction push ax is opcode 0x50 on x86 and goes up to 0x57 for the last register di. Then the instruction pop ax starts at 0x58 and goes up to 0x5F pop di to complete the first base-16. Hexadecimal consistency is maintained with 8 registers per a size.

Resources