Cortex M4 LDR/STR timing

Cortex M4 LDR/STR timing - performance

I am reading through Cortex M4 TRM to understand instruction execution cycles. However, there are some confusing description there
In Table of Processor Instuctions, STR takes 2 cycles.
Later in Load/store timings, it indicates that
STR Rx,[Ry,#imm] is always one cycle, This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing.
If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete.
If the store is not to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion
Still in Load/store timings, it indicates LDR can be pipelined by following LDR and STR, but STR can't be pipelined by following instructions.
Other instructions cannot be pipelined after STR with register offset. STR can only be pipelined when it follows an LDR, but nothing can be pipelined after the store. Even a stalled STR normally only takes two cycles, because of the write buffer
More specific on what confused me:
Q1. 1 and 2 seems conflict with each other, how many cycles do STR actually take, 1 or 2? (My experiment shows 1 though)
Q2. 2 indicates that if store go through write buffer and it is not available, it will stall the pipeline nevertheless, but if store bypass it, the pipeline may only stalled when load/store instructions are following. Smells like write buffer can only make things worse. That is contrary to common sense.
Q3. 3 means STR can't be pipelined with following instruction, however 2 means STR is always pipelined with following instruction under proper condition. How to understand the conflicting statements? (And here it indicates STR takes 2 instead of 1 cycle because of the write buffer)
Q4. I don't find more information on how write buffer is imeplemented. How large is the buffer? How STR determine whether to use it or bypass it?

Type of STR
Note that on "Load/Store timings page" the first statement refers to STR with a literal offset to the base address register (STR Rx,[Ry,#imm]). Further down it refers to an STR with a register offset to the base address register (STR R1,[R3,R2]). These are two different variants of the STR instruction.
Literal Offset STR(STR Rx,[Ry,#imm])
Hmm, I wonder if the documentation is mis-leading when it says "always 1 cycle", because it then follows to add a caveat that means it could take multiple cycles "... the next instruction is delayed until the store can complete"
I am going to do my best to interpret the documentation:
STR Rx,[Ry,#imm] is always one cycle. This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing. If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete. If the store is to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion.
I would assume that the first STR takes 1 cycle, if the write buffer is available. If it is not available, the next instruction will be stalled until the buffer is available. However, if the buffer is not in use, it will delay the next instruction until the bus transaction completes.
With a non consecutive STR (the first STR) the write buffer will be empty, and the instruction takes 1 cycle. If there are 2 consecutive STR instructions, the 2nd STR will begin immediately as the 1st STR has written to the buffer. However, if the bus transaction for the 1st STR stalls and remains in the write buffer, the 2nd STR will be unable to write to the buffer and will block further instructions. Then when the bus transaction for the 1st STR completes the buffer is emptied and the 2nd STR writes to the buffer, unblocking the next instruction.
A stalled bus transaction, where the transaction is buffered in the write buffer, doesn't affect non STR instructions as they do not need access to the write buffer to complete. So an STR instruction where the bus is stalled will not delay further instructions unless it is another STR. However, if the write buffer is not in use then a stalled bus transaction will delay all instructions.
It does seem a bit off that the instruction set summary page puts a solid "2" as the number of cycles for STR when clearly it is not as predictable as this.
Register offset STR(STR R1,[R3,R2])
I stand with you on your confusion over the following apparently conflicting statement:
Other instructions cannot be pipelined after STR with register offset. STR can only be pipelined when it follows an LDR, but nothing can be pipelined after the store. Even a stalled STR normally only takes two cycles, because of the write buffer.
As this is contradicted by the first clause on the page. But, I believe this is because it is refering to 2 different STR types, literal offset (the first one) and register offset. The register offset STR being the one that can't allow pipelined instructions afterwards. The language could be clearer though. What does it mean by a stalled STR, is it refering to a register offset STR which always stalls by default? Is this stall different to a stall caused by the write buffer being unavailable? It is easy to get lost here.
I think basically a register offset STR is a minimum of 2 cycles. It is going to block and take more cycles if the write buffer is unavailable, or if the transaction is not buffered and the bus stalls.
Size of write buffer
The size is a single entry, see https://developer.arm.com/documentation/100166/0001/Programmers-Model/Write-buffer?lang=en
To prevent bus wait cycles from stalling the processor during data stores, buffered stores to the DCode and System buses go through a one-entry write buffer. If the write buffer is full, subsequent accesses to the bus stall until the write buffer has drained.
The write buffer is only used if the bus waits the data phase of the buffered store, otherwise the transaction completes on the bus.
Usefulness of write buffer
As far as my understanding goes: If the CPU could write to a bus instantly then it would not need a buffer as the bus would be free immediately for the next instruction. On a high performance part like M4 some of the memory buses can't keep up with the CPU clock rate which means it could take multiple cycles to perform a transaction. Also there could be DMA units that make use of the same bus. To prevent stalling the CPU until a bus transaction completes, the buffer provides an immediate store to use which hardware then writes to the bus when it is free.

#EmbeddedSoftwareEngineer, thanks for the reply. I'd like to post what I summarized from my experiment
As a baseline, LDR takes 2 cycles, STR takes 1 cycle
There are 2 kinds of dependency for adjacent instructions
content dependency. A typical example is STR followed by a LDR, because the assembly don't make sure the LDR target memory is not modified by STR, it always get delay，that is 3 cycles for LDR
addressing dependency. When 2nd instruction's address is based on result of first instruction, the 2nd instruction always get delay, typical example
sub SP, SP, #20
ldr r1, [SP, #4]
；OR
ldr r3, [SP, #8]
ldr r4, [r3]
The second LDR will always get an extra wait cycle, yields 3 cycles
When there is no dependencies described in 2, LDR following LDR will take 1 cycle, STR following LDR will take 0 cycle
All these are based on TCM which introduce no extra cycle from cache load or external bus stall.

Related

Duration of a LDR instruction on STM32H7 depending on memory

I'm doing some evaluations on STM32H7, on the STM32H753I-EVAL2 board. I used STMicro example code to configure, write and read the QSPI Flash in memory mapped mode.
I was surprised by some figures regarding duration of LDR instruction:
I measure the number of cycles of instructions using the SysTick (connected on CPU clock). As far as I understood: one cycle of SysTick = one cycle of CPU.
I measured two instructions exactly identical ldrb.w Rn, [Rp, Rq] except that Rp is in one case an address in DTC-RAM and in the other case an address
in QSPI Flash.
The results are (code executed from internal flash): 15 cycles from DCTM-RAM, 12 cycles from QSPI.
I'm surprised by the results, I guess the QSPI content if cached so it might explain the figures ?
Also I find that 15 cycles for a single LDR instruction seems quite a lot, what do you think ? Is there something wrong in my procedure ?

If the internal flash is not cached, or the cache is invalid, or the pipeline was flushed or ... (many many other)s it may take more time than the QSPI Flash located instruction.
To measure execution time you have special registers.

Bring code into the L1 instruction cache without executing it

Let's say I have a function that I plan to execute as part of a benchmark. I want to bring this code into the L1 instruction cache prior to executing since I don't want to measure the cost of I$ misses as part of the benchmark.
The obvious way to do this is to simply execute the code at least once before the benchmark, hence "warming it up" and bringing it into the L1 instruction cache and possibly the uop cache, etc.
What are my alternatives in the case I don't want to execute the code (e.g., because I want the various predictors which key off of instruction addresses to be cold)?

In Granite Rapids and later, PREFETCHIT0 [rip+rel32] to prefetch code into "all levels" of cache, or prefetchit1 to prefetch into all levels except L1i. These instructions are a NOP with an addressing-mode other than RIP-relative, or on CPUs that don't support them. (Perhaps they also prime iTLB or even uop cache, or at least could on paper, in which case this isn't what you want.) The docs in Intel's "future extensions" manual as of 2022 Dec recommends that the target address be the start of some instruction.
Note that this Q&A is about priming things for a microbenchmark. Not things that would be worth doing to improve overall performance. For that, probably just best-effort prefetch into L2 cache (the inner-most unified cache) with prefetcht1 SW prefetch, also priming the dTLB in a way that possibly helps the iTLB (although it will evict a possibly-useful dTLB entry).
Or not, if the L2TLB is a victim cache. see X86 prefetching optimizations: "computed goto" threaded code for more discussion.
Caveat: some of this answer is Intel-centric. If I just say "the uop cache", I'm talking about Sandybridge-family. I know Ryzen has a uop-cache too, but I haven't read much of anything about its internals, and only know some basics about Ryzen from reading Agner Fog's microarch guide.
You can at least prefetch into L2 with software prefetch, but that doesn't even necessarily help with iTLB hits. (The 2nd-level TLB is a victim cache, I think, so a dTLB miss might not populate anything that the iTLB checks.)
But this doesn't help with L1I$ misses, or getting the target code decoded into the uop cache.
If there is a way to do what you want, it's going to be with some kind of trick. x86 has no "standard" way to do this; no code-prefetch instruction. Darek Mihoka wrote about code-prefetch as part of a CPU-emulator interpreter loop: The Common CPU Interpreter Loop Revisited: the Nostradamus Distributor back in 2008 when P4 and Core2 were the hot CPU microarchitectures to tune for.
But of course that's a different case: the goal is sustained performance of indirect branches, not priming things for a benchmark. It doesn't matter if you spend a lot of time achieving the microarchitectural state you want outside the timed portion of a micro-benchmark.
Speaking of which, modern branch predictors aren't just "cold", they always contain some prediction based on aliasing1. This may or may not be important.
Prefetch the first / last lines (and maybe others) with call to a ret
I think instruction fetch / prefetch normally continues past an ordinary ret or jmp, because it can't be detected until decode. So you could just call a function that ends at the end of the previous cache line. (Make sure they're in the same page, so an iTLB miss doesn't block prefetch.)
ret after a call will predict reliably if no other call did anything to the return-address predictor stack, except in rare cases if an interrupt happened between the call and ret, and the kernel code had a deep enough call tree to push the prediction for this ret out of the RSB (return-stack-buffer). Or if Spectre mitigation flushed it intentionally on a context switch.
; make sure this is in the same *page* as function_under_test, to prime the iTLB
ALIGN 64
; could put the label here, but probably better not
60 bytes of (long) NOPs or whatever padding
prime_Icache_first_line:
ret
; jmp back_to_benchmark_startup ; alternative if JMP is handled differently than RET.
lfence ; prevent any speculative execution after RET, in case it or JMP aren't detected as soon as they decode
;;; cache-line boundary here
function_under_test:
...
prime_Icache_last_line: ; label the last RET in function_under_test
ret ; this will prime the "there's a ret here" predictor (correctly)
benchmark_startup:
call prime_Icache_first_line
call prime_Icache_first_line ; IDK if calling twice could possibly help in case prefetch didn't get far the first time? But now the CPU has "seen" the RET and may not fetch past it.
call prime_Icache_last_line ; definitely only need to call once; it's in the right line
lfence
rdtsc
.timed_loop:
call function_under_test
...
jnz .time_loop
We can even extend this technique to more than 2 cache lines by calling to any 0xC3 (ret) byte inside the body of function_under_test. But as #BeeOnRope points out, that's dangerous because it may prime branch prediction with "there's a ret here" causing a mispredict you otherwise wouldn't have had when calling function_under_test for real.
Early in the front-end, branch prediction is needed based on fetch-block address (which block to fetch after this one), not on individual branches inside each block, so this could be a problem even if the ret byte was part of another instruction.
But if this idea is viable, then you can look for a 0xc3 byte as part of an instruction in the cache line, or at worst add a 3-byte NOP r/m32 (0f 1f c3 nop ebx,eax). c3 as a ModR/M encodes a reg,reg instruction (with ebx and eax as operands), so it doesn't have to be hidden in a disp8 to avoid making the NOP even longer, and it's easy to find in short instructions: e.g. 89 c3 mov ebx,eax, or use the other opcode so the same modrm byte gives you mov eax,ebx. Or 83 c3 01 add ebx,0x1, or many other instructions with e/rbx, bl (and r/eax or al).
With a REX prefix, those can be you have a choice of rbx / r11 (and rax/r8 for the /r field if applicable). It's likely you can choose (or modify for this microbenchmark) your register allocation to produce an instruction using the relevant registers to produce a c3 byte without any overhead at all, especially if you can use a custom calling convention (at least for testing purposes) so you can clobber rbx if you weren't already saving/restoring it.
I found these by searching for (space)c3(space) in the output of objdump -d /bin/bash, just to pick a random not-small executable full of compiler-generated code.
Evil hack: end the cache line before with the start of a multi-byte instruction.
; at the end of a cache line
prefetch_Icache_first_line:
db 0xe9 ; the opcode for 5-byte jmp rel32
function_under_test:
... normal code ; first 4 bytes will be treated as a rel32 when decoding starts at prefetch_I...
ret
; then at function_under_test+4 + rel32:
;org whatever (that's not how ORG works in NASM, so actually you'll need a linker script or something to put code here)
prefetch_Icache_branch_target:
jmp back_to_test_startup
So it jumps to a virtual address which depends on the instruction bytes of function_under_test. Map that page and put code in it that jumps back to your benchmark-prep code. The destination has to be within 2GiB, so (in 64-bit code) it's always possible to choose a virtual address for function_under_test that makes the destination a valid user-space virtual address. Actually, for many rel32 values, it's possible to choose the address of function_under_test to keep both it and the target within the low 2GiB of virtual address space, (and definitely 3GiB) and thus valid 32-bit user-space addresses even under a 32-bit kernel.
Or less crazy, using the end of a ret imm16 to consume a byte or two, just requiring a fixup of RSP after return (and treating whatever is temporarily below RSP as a "red zone" if you don't reserve extra space):
; at the end of a cache line
prefetch_Icache_first_line:
db 0xc2 ; the opcode for 3-byte ret imm16
; db 0x00 ; optional: one byte of the immediate at least keeps RSP aligned
; But x86 is little-endian, so the last byte is most significant
;; Cache-line boundary here
function_under_test:
... normal code ; first 2 bytes will be added to RSP when decoding starts at prefetch_Icache_first_line
ret
prefetch_caller:
push rbp
mov rbp, rsp ; save old RSP
;sub rsp, 65536 ; reserve space in case of the max RSP+imm16.
call prefetch_Icache_first_line
;;; UNSAFE HERE in case of signal handler if you didn't SUB.
mov rsp, rbp ; restore RSP; if no signal handlers installed, probably nothing could step on stack memory
...
pop rbp
ret
Using sub rsp, 65536 before calling to the ret imm16 makes it safe even if there might be a signal handler (or interrupt handler in kernel code, if your kernel stack is big enough, otherwise look at the actual byte and see how much will really be added to RSP). It means that call's push/store will probably miss in data cache, and maybe even cause a pagefault to grow the stack. That does happen before fetching the ret imm16, so that won't evict the L1I$ line we wanted to prime.
This whole idea is probably unnecessary; I think the above method can reliably prefetch the first line of a function anyway, and this only works for the first line. (Unless you put a 0xe9 or 0xc2 in the last byte of each relevant cache line, e.g. as part of a NOP if necessary.)
But this does give you a way to non-speculatively do code-fetch from from the cache line you want without architecturally executing any instructions in it. Hopefully a direct jmp is detected before any later instructions execute, and probably without any others even decoding, except ones that decoded in the same block. (And an unconditional jmp always ends a uop-cache line on Intel). i.e. the mispredict penalty is all in the front-end from re-steering the fetch pipeline as soon as decode detects the jmp. I hope ret is like this too, in cases where the return-predictor stack is not empty.
A jmp r/m64 would let you control the destination just by putting the address in the right register or memory. (Figure out what register or memory addressing mode the first byte(s) of function_under_test encode, and put an address there). The opcode is FF /4, so you can only use a register addressing mode if the first byte works as a ModRM that has /r = 4 and mode=11b. But you could put the first 2 bytes of the jmp r/m64 in the previous line, so the extra bytes form the SIB (and disp8 or disp32). Whatever those are, you can set up register contents such that the jump-target address will be loaded from somewhere convenient.
But the key problem with a jmp r/m64 is that default-prediction for an indirect branch can fall through and speculatively execute function_under_test, affecting the branch-prediction entries for those branches. You could have bogus values in registers so you prime branch prediction incorrectly, but that's still different from not touching them at all.
How does this overlapping-instructions hack to consume bytes from the target cache line affect the uop cache?
I think (based on previous experimental evidence) Intel's uop cache puts instructions in the uop-cache line that corresponds to their start address, in cases where they cross a 32 or 64-byte boundary. So when the real execution of function_under_test begins, it will simply miss in the uop-cache because no uop-cache line is caching the instruction-start-address range that includes the first byte of function_under_test. i.e. the overlapping decode is probably not even noticed when it's split across an L1I$ boundary this way.
It is normally a problem for the uop cache to have the same bytes decode as parts of different instructions, but I'm optimistic that we wouldn't have a penalty in this case. (I haven't double-checked that for this case. I'm mostly assuming that lines record which range of start-addresses they cache, and not the whole range of x86 instruction bytes they're caching.)
Create mis-speculation to fetch arbitrary lines, but block exec with lfence
Spectre / Meltdown exploits and mitigation strategies provide some interesting ideas: you could maybe trigger a mispredict that fetches at least the start of the code you want, but maybe doesn't speculate into it.
lfence blocks speculative execution, but (AFAIK) not instruction prefetch / fetch / decode.
I think (and hope) the front-end will follow direct relative jumps on its own, even after lfence, so we can use jmp target_cache_line in the shadow of a mispredict + lfence to fetch and decode but not execute the target function.
If lfence works by blocking the issue stage until the reservation station (OoO scheduler) is empty, then an Intel CPU should probably decode past lfence until the IDQ is full (64 uops on Skylake). There are further buffers before other stages (fetch -> instruction-length-decode, and between that and decode), so fetch can run ahead of that. Presumably there's a HW prefetcher that runs ahead of where actual fetch is reading from, so it's pretty plausible to get several cache lines into the target function in the shadow of a single mispredict, especially if you introduce delays before the mispredict can be detected.
We can use the same return-address frobbing as a retpoline to reliably trigger a mispredict to jmp rel32 which sends fetch into the target function. (I'm pretty sure a re-steer of the front-end can happen in the shadow of speculative execution without waiting to confirm correct speculation, because that would make every re-steer serializing.)
function_under_test:
...
some_line: ; not necessarily the first cache line
...
ret
;;; This goes in the same page as the test function,
;;; so we don't iTLB-miss when trying to send the front-end there
ret_frob:
xorps xmm0,xmm0
movq xmm1, rax
;; The point of this LFENCE is to make sure the RS / ROB are empty so the front-end can run ahead in a burst.
;; so the sqrtpd delay chain isn't gradually issued.
lfence
;; alternatively, load the return address from the stack and create a data dependency on it, e.g. and eax,0
;; create a many-cycle dependency-chain before the RET misprediction can be detected
times 10 sqrtpd xmm0,xmm0 ; high latency, single uop
orps xmm0, xmm1 ; target address with data-dep on the sqrtpd chain
movq [rsp], xmm0 ; overwrite return address
; movd [rsp], xmm0 ; store-forwarding stall: do this *as well* as the movq
ret ; mis-speculate to the lfence/jmp some_line
; but architecturally jump back to the address we got in RAX
prefetch_some_line:
lea rax, [rel back_to_bench_startup]
; or pop rax or load it into xmm1 directly,
; so this block can be CALLed as a function instead of jumped to
call ret_frob
; speculative execution goes here, but architecturally never reached
lfence ; speculative *execution* stops here, fetch continues
jmp some_line
I'm not sure the lfence in ret_frob is needed. But it does make it easier to reason about what the front-end is doing relative to the back-end. After the lfence, the return address has a data dependency on the chain of 10x sqrtpd. (10x 15 to 16 cycle latency on Skylake, for example). But the 10x sqrtpd + orps + movq only take 3 cycles to issue (on 4-wide CPUs), leaving at least 148 cycles + store-forwarding latency before ret can read the return address back from the stack and discover that the return-stack prediction was wrong.
This should be plenty of time for the front-end to follow the jmp some_line and load that line into L1I$, and probably load several lines after that. It should also get some of them decoded into the uop cache.
You need a separate call / lfence / jmp block for each target line (because the target address has to be hard-coded into a direct jump for the front-end to follow it without the back-end executing anything), but they can all share the same ret_frob block.
If you left out the lfence, you could use the above retpoline-like technique to trigger speculative execution into the function. This would let you jump to any target branch in the target function with whatever args you like in registers, so you can mis-prime branch prediction however you like.
Footnote 1:
Modern branch predictors aren't just "cold", they contain predictions
from whatever aliased the target virtual addresses in the various branch-prediction data structures. (At least on Intel where SnB-family pretty definitely uses TAGE prediction.)
So you should decide whether you want to specifically anti-prime the branch predictors by (speculatively) executing the branches in your function with bogus data in registers / flags, or whether your micro-benchmarking environment resembles the surrounding conditions of the real program closely enough.
If your target function has enough branching in a very specific complex pattern (like a branchy sort function over 10 integers), then presumably only that exact input can train the branch predictor well, so any initial state other than a specially-warmed-up state is probably fine.
You may not want the uop-cache primed at all, to look for cold-execution effects in general (like decode), so that might rule out any speculative fetch / decode, not just speculative execution. Or maybe speculative decode is ok if you then run some uop-cache-polluting long-NOPs or times 800 xor eax,eax (2-byte instructions -> 16 per 32-byte block uses up all 3 entries that SnB-family uop caches allow without running out of room and not being able to fit in the uop cache at all). But not so many that you evict L1I$ as well.
Even speculative decode without execute will prime the front-end branch prediction that knows where branches are ahead of decode, though. I think that a ret (or jmp rel32) at the end of the previous cache line

Map the same physical page to two different virtual addresses.
L1I$ is physically addressed. (VIPT but with all the index bits from below the page offset, so effectively PIPT).
Branch-prediction and uop caches are virtually addressed, so with the right choice of virtual addresses, a warm-up run of the function at the alternate virtual address will prime L1I, but not branch prediction or uop caches. (This only works if branch aliasing happens modulo something larger than 4096 bytes, because the position within the page is the same for both mappings.)
Prime the iTLB by calling to a ret in the same page as the test function, but outside it.
After setting this up, no modification of the page tables are required between the warm-up run and the timing run. This is why you use two mappings of the same page instead of remapping a single mapping.
Margaret Bloom suggests that CPUs vulnerable to Meltdown might speculatively fetch instructions from a no-exec page if you jump there (in the shadow of a mispredict so it doesn't actually fault), but that would then require changing the page table, and thus a system call which is expensive and might evict that line of L1I. But if it doesn't pollute the iTLB, you could then re-populate the iTLB entry with a mispredicted branch anywhere into the same page as the function. Or just a call to a dummy ret outside the function in the same page.
None of this will let you get the uop cache warmed up, though, because it's virtually addressed. OTOH, in real life, if branch predictors are cold then probably the uop cache will also be cold.

One approach that could work for small functions would be to execute some code which appears on the same cache line(s) as your target function, which will bring in the entire cache line.
For example, you could organize your code as follows:
ALIGN 64
function_under_test:
; some code, less than 64 bytes
dummy:
ret
and then call the dummy function prior to calling function_under_test - if dummy starts on the same cache line as the target function, it would bring the entire cache line into L1I. This works for functions of 63 bytes or less1.
This can probably be extended to functions up to ~126 bytes or so by using this trick both at before2 and after the target function. You could extend it to arbitrarily sized functions by inserting dummy functions on every cache line and having the target code jump over them, but this comes at a cost of inserting the otherwise-unnecessary jumps your code under test, and requires careful control over the code size so that the dummy functions are placed correctly.
You need fine control over function alignment and placement to achieve this: assembler is probably the easiest, but you can also probably do it with C or C++ in combination with compiler-specific attributes.
1 You could even reuse the ret in the function_under_test itself to support slightly longer functions (e.g., those whose ret starts within 64 bytes of the start).
2 You'd have to be more careful about the dummy function appearing before the code under test: the processor might fetch instructions past the ret and it might (?) even execute them. A ud2 after the dummy ret is likely to block further fetch (but you might want fetch if populating the uop cache is important).

Why do memory instructions take 4 cycles in ARM assembly?

Memory instructions such as ldr, str or b take 4 cycles each in ARM assembly.
Is it because each memory location is 4 bytes long?

ARM has a pipelined architecture. Each clock cycle advances the pipeline by one step (e.g. fetch/decode/execute/read...). Since the pipeline is continuously fed, the overall time to execute each instruction can approach 1 cycle, but the actual time for an individual instruction from 'fetch' through completion can be 3+ cycles. ARM has a good explanation on their website:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0222b/ch01s01s01.html
Memory latency adds another layer of complication to this idea. ARM employs a multi-level cache system which aims to have the most frequently used data available in the fewest cycles. Even a read from the fastest (L0) cache involves several cycles of latency. The pipeline includes facilities to allow read requests to complete at a later time if the data is not used right away. It's easier to understand by way of example:
LDR R0,[R1]
MOV R2,R3 // Allow time for memory read to occur
ADD R4,R4,#200 // by interleaving other instructions
CMP R0,#0 // before trying to use the value
// By trying to access the data immediately, this will cause a pipeline
// 'stall' and waste time waiting for the data to become available.
LDR R0,[R1]
CMP R0,#0 // Wastes at least 1 cycle due to pipeline not having the data
The idea is to hide the inherent latencies in the pipeline and, if you can, hide additional latencies in the memory access by delaying dependencies on registers (aka instruction interleaving).

Why Load/Store from buffer takes one whole clock cycle?

In this video: http://www.infoq.com/presentations/click-crash-course-modern-hardware
At 28:00, he starts to explain the following example:
ld rax <- [rbx+16] //Clock 0 Starts
add rbx,16
cmp rax,0
jeq null_chk
st [rbx-16] <- rcx //Clock 1 Starts
ld rcx <- [rdx+0] //Clock 2 Starts. Why clock 1 only does one op?
ld rax <- [rax+8] //Clock 3 Starts. Why?
And at 29:48, he says 2 cache misses happen in the example. Why 2?

No, he says that there are 2 cache misses running in parallel.
Out of the 4 memory accesses here (data memory, but let's ignore code/page walks, etc):
First ld of address [rbx+16] can be dispatched and sent to memory.
The store can not be performed until retirement when it's known to be non-bogus for sure (he mentioned the branch possibly being mis-predicted - that can be known at execution time, which is before retirement, but there could be other cases he didn't mention, like faults or exceptions). In addition, the store address needs to wait for the calculation of add rbx, 16, but that shouldn't take too long. However, the execution and eventually retirement of the branch that's holding the store depends on the result of the rax compare & jmp, which in turn depends on the first load - assumed here to go to memory (in other words - don't hold your breath).
The 2nd load (from [rdx+0]) is independent and can be dispatched - that's the 2nd load done in parallel.
The 3rd load can't be dispatched - its address depends on rax, which just like the cmp+jmp has to wait for the first load to finish. This means we can't do anything with this load, we don't even know the actual address. Think for e.g. what happens in HW if the load return, writes to rax, and it turns out that [rax+8] equals [rbx-16]? You'll have to forward the data from the store.
So, when executing the code, we can right away (withing 1-2 cycles, depends on the decode and address calculation time) send 2 loads - sections #1 and #3. The other memory operations will have to wait. The crucial element for performance here is to launch your loads as early as possible, and in parallel whenever possible.

If registers are so blazingly fast, why don't we have more of them?

In 32bit, we had 8 "general purpose" registers. With 64bit, the amount doubles, but it seems independent of the 64bit change itself.
Now, if registers are so fast (no memory access), why aren't there more of them naturally? Shouldn't CPU builders work as many registers as possible into the CPU? What is the logical restriction to why we only have the amount we have?

There's many reasons you don't just have a huge number of registers:
They're highly linked to most pipeline stages. For starters, you need to track their lifetime, and forward results back to previous stages. The complexity gets intractable very quickly, and the number of wires (literally) involved grows at the same rate. It's expensive on area, which ultimately means it's expensive on power, price and performance after a certain point.
It takes up instruction encoding space. 16 registers takes up 4 bits for source and destination, and another 4 if you have 3-operand instructions (e.g ARM). That's an awful lot of instruction set encoding space taken up just to specify the register. This eventually impacts decoding, code size and again complexity.
There's better ways to achieve the same result...
These days we really do have lots of registers - they're just not explicitly programmed. We have "register renaming". While you only access a small set (8-32 registers), they're actually backed by a much larger set (e.g 64-256). The CPU then tracks the visibility of each register, and allocates them to the renamed set. For example, you can load, modify, then store to a register many times in a row, and have each of these operations actually performed independently depending on cache misses etc. In ARM:
ldr r0, [r4]
add r0, r0, #1
str r0, [r4]
ldr r0, [r5]
add r0, r0, #1
str r0, [r5]
Cortex A9 cores do register renaming, so the first load to "r0" actually goes to a renamed virtual register - let's call it "v0". The load, increment and store happen on "v0". Meanwhile, we also perform a load/modify/store to r0 again, but that'll get renamed to "v1" because this is an entirely independent sequence using r0. Let's say the load from the pointer in "r4" stalled due to a cache miss. That's ok - we don't need to wait for "r0" to be ready. Because it's renamed, we can run the next sequence with "v1" (also mapped to r0) - and perhaps that's a cache hit and we just had a huge performance win.
ldr v0, [v2]
add v0, v0, #1
str v0, [v2]
ldr v1, [v3]
add v1, v1, #1
str v1, [v3]
I think x86 is up to a gigantic number of renamed registers these days (ballpark 256). That would mean having 8 bits times 2 for every instruction just to say what the source and destination is. It would massively increase the number of wires needed across the core, and its size. So there's a sweet spot around 16-32 registers which most designers have settled for, and for out-of-order CPU designs, register renaming is the way to mitigate it.
Edit: The importance of out-of-order execution and register renaming on this. Once you have OOO, the number of registers doesn't matter so much, because they're just "temporary tags" and get renamed to the much larger virtual register set. You don't want the number to be too small, because it gets difficult to write small code sequences. This is a problem for x86-32, because the limited 8 registers means a lot of temporaries end up going through the stack, and the core needs extra logic to forward reads/writes to memory. If you don't have OOO, you're usually talking about a small core, in which case a large register set is a poor cost/performance benefit.
So there's a natural sweet spot for register bank size which maxes out at about 32 architected registers for most classes of CPU. x86-32 has 8 registers and it's definitely too small. ARM went with 16 registers and it's a good compromise. 32 registers is slightly too many if anything - you end up not needing the last 10 or so.
None of this touches on the extra registers you get for SSE and other vector floating point coprocessors. Those make sense as an extra set because they run independently of the integer core, and don't grow the CPU's complexity exponentially.

We Do Have More of Them
Because almost every instruction must select 1, 2, or 3 architecturally visible registers, expanding the number of them would increase code size by several bits on each instruction and so reduce code density. It also increases the amount of context that must be saved as thread state, and partially saved in a function's activation record. These operations occur frequently. Pipeline interlocks must check a scoreboard for every register and this has quadratic time and space complexity. And perhaps the biggest reason is simply compatibility with the already-defined instruction set.
But it turns out, thanks to register renaming, we really do have lots of registers available, and we don't even need to save them. The CPU actually has many register sets, and it automatically switches between them as your code exeutes. It does this purely to get you more registers.
Example:
load r1, a # x = a
store r1, x
load r1, b # y = b
store r1, y
In an architecture that has only r0-r7, the following code may be rewritten automatically by the CPU as something like:
load r1, a
store r1, x
load r10, b
store r10, y
In this case r10 is a hidden register that is substituted for r1 temporarily. The CPU can tell that the the value of r1 is never used again after the first store. This allows the first load to be delayed (even an on-chip cache hit usually takes several cycles) without requiring the delay of the second load or the second store.

They add registers all of the time, but they are often tied to special purpose instructions (e.g. SIMD, SSE2, etc) or require compiling to a specific CPU architecture, which lowers portability. Existing instructions often work on specific registers and couldn't take advantage of other registers if they were available. Legacy instruction set and all.

To add a little interesting info here you'll notice that having 8 same sized registers allows opcodes to maintain consistency with hexadecimal notation. For example the instruction push ax is opcode 0x50 on x86 and goes up to 0x57 for the last register di. Then the instruction pop ax starts at 0x58 and goes up to 0x5F pop di to complete the first base-16. Hexadecimal consistency is maintained with 8 registers per a size.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio