Memory instructions such as ldr, str or b take 4 cycles each in ARM assembly.
Is it because each memory location is 4 bytes long?

ARM has a pipelined architecture. Each clock cycle advances the pipeline by one step (e.g. fetch/decode/execute/read...). Since the pipeline is continuously fed, the overall time to execute each instruction can approach 1 cycle, but the actual time for an individual instruction from 'fetch' through completion can be 3+ cycles. ARM has a good explanation on their website:
Memory latency adds another layer of complication to this idea. ARM employs a multi-level cache system which aims to have the most frequently used data available in the fewest cycles. Even a read from the fastest (L0) cache involves several cycles of latency. The pipeline includes facilities to allow read requests to complete at a later time if the data is not used right away. It's easier to understand by way of example:
LDR R0,[R1]
MOV R2,R3 // Allow time for memory read to occur
ADD R4,R4,#200 // by interleaving other instructions
CMP R0,#0 // before trying to use the value
// By trying to access the data immediately, this will cause a pipeline
// 'stall' and waste time waiting for the data to become available.
LDR R0,[R1]
CMP R0,#0 // Wastes at least 1 cycle due to pipeline not having the data
The idea is to hide the inherent latencies in the pipeline and, if you can, hide additional latencies in the memory access by delaying dependencies on registers (aka instruction interleaving).


Duration of a LDR instruction on STM32H7 depending on memory

I'm doing some evaluations on STM32H7, on the STM32H753I-EVAL2 board. I used STMicro example code to configure, write and read the QSPI Flash in memory mapped mode.
I was surprised by some figures regarding duration of LDR instruction:
I measure the number of cycles of instructions using the SysTick (connected on CPU clock). As far as I understood: one cycle of SysTick = one cycle of CPU.
I measured two instructions exactly identical ldrb.w Rn, [Rp, Rq] except that Rp is in one case an address in DTC-RAM and in the other case an address
in QSPI Flash.
The results are (code executed from internal flash): 15 cycles from DCTM-RAM, 12 cycles from QSPI.
I'm surprised by the results, I guess the QSPI content if cached so it might explain the figures ?
Also I find that 15 cycles for a single LDR instruction seems quite a lot, what do you think ? Is there something wrong in my procedure ?
If the internal flash is not cached, or the cache is invalid, or the pipeline was flushed or ... (many many other)s it may take more time than the QSPI Flash located instruction.
To measure execution time you have special registers.

Deterministic execution time on x86-64

Is there an x64 instruction(s) that takes a fixed amount of time, regardless of the micro-architectural state such as caches, branch predictors, etc.?
For instance, if a hypothetical add or increment instruction always takes n cycles, then I can implement a timer in my program by performing that add instruction multiple times. Perhaps an increment instruction with register operands may work, but it's not clear to me whether Intel's spec guarantees that it would take deterministic number of cycles. Note that I am not interested in current time, but only a primitive / instruction sequence that takes a fixed number of cycles.
Assume that I have a way to force atomic execution i.e. no context switches during timer's execution i.e. only my program gets to run.
On a related note, I also cannot use system services to keep track of time, because I am working in a setting where my program is a user-level program running on an untrusted OS.
The x86 ISA documents don't guarantee anything about what takes a certain amount of cycles. The ISA allows things like Transmeta's Crusoe that JIT-compiled x86 instructions to an internal VLIW instruction set. It could conceivably do optimizations between adjacent instructions.
The best you can do is write something that will work on as many known microarchitectures as possible. I'm not aware of any x86-64 microarchitectures that are "weird" like Transmeta, only the usual superscalar decode-to-uops designs like Intel and AMD use.
Simple integer ALU instructions like ADD are almost all 1c latency, and tiny loops that don't touch memory are almost totally unaffected anything, and are very predictable. If they run a lot of iterations, they're also almost totally unaffected by anything to do with the impact of surrounding code on the out-of-order core, and recover very quickly from disruptions like timer interrupts.
On nearly every Intel microarchitecture, this loop will run at one iteration per clock:
mov ecx, 1234567 ; or use a 64-bit register for higher counts.
sub ecx, 1 ; not dec because of Pentium 4.
jnz .loop
Agner Fog's microarch guide and instruction tables say that VIA Nano3000 has a taken-branch throughput of one per 3 cycles, so this loop would only run at one iteration per 3 clocks there. AMD Bulldozer-family and Jaguar similarly have a max throughput of one taken JCC per 2 clocks.
See also other performance links in the x86 tag wiki.
If you want a more power-efficient loop, you could use PAUSE in the loop, but it waits ~100 cycles on Skylake, up from ~5 cycles on previous microarchitectures. (You can make cycle-accurate predictions for more complicated loops that don't touch memory, but that depends on microarchitectural details.)
You could make a more reliable loop that's less likely to have different bottlenecks on different CPUs by making a longer dependency chain within each iteration. Since each instruction depends on the previous, it can still only run at one instruction per cycle (not counting the branch), drastically the branches per cycle.
# one add/sub per clock, limited by latency
# should run one iteration per 6 cycles on every CPU listed in Agner Fog's tables
# And should be the same on all future CPUs unless they do magic inter-instruction optimizations.
# Or it could be slower on CPUs that always have a bubble on taken branches, but it seems unlikely anyone would design one.
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 2 ; net result ecx-1
jnz .loop
Unrolling like this ensures that front-end effects are not a bottleneck. It gives the frontend decoders plenty of time to queue up the 6 add/sub insns and the jcc before the next branch.
Using add/sub instead of dec/inc avoids a partial-flag false dependency on Pentium 4. (Although I don't think that would be an issue anyway.)
Pentium4's double-clocked ALUs can each run two ADDs per clock, but the latency is still one cycle. i.e. apparently it can't forward a result internally to chew through this dependency chain twice as fast as any other CPU.
And yes, Prescott P4 is an x86-64 CPU, so we can't quite ignore P4 if we need a general purpose answer.

Understanding CPU pipeline stages vs. Instruction throughput

I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?
Besides the obvious of "different instructions require a different amount of work to complete", hear me out...
Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.
An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).
Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.
I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.
I think what's missing from the existing answers is the existence of "bypass" or "forwarding" datapaths. For simplicity, let's stick with the MIPS 5-stage pipeline. Every instruction takes 5 cycles from birth to death -- fetch, decode, execute, memory, writeback. So that's how long it takes to process a single instruction.
What you want to know is how long it takes for one instruction to hand off its result to a dependent instruction. Say you have two consecutive ADD instructions, and there's a dependency through R1:
ADD R1, R2, R3
ADD R4, R1, R5
If there were no forwarding paths, we'd have to stall the second instruction for multiple cycles (2 or 3 depending on how writeback works), so that the first one could store its result into the register file before the second one reads that as input in the decode stage.
However, there are forwarding paths that allow valid results (but ones that are not yet written back) to be picked out of the pipeline. So let's say the first ADD gets all its inputs from the register file in decode. The second one will get R5 out of the register file, but it'll get R1 out of the pipeline register following the execute stage. In other words, we're routing the output of the ALU back into its input one cycle later.
Out-of-order processors make ubiquitous use of forwarding. They will have lots of different functional units that have lots of different latencies. For instance, ADD and AND will typically take one cycle (TO DO THE MATH, putting aside all of the pipeline stages before and after), MUL will take like 4, floating point operations will take lots of cycles, memory access has variable latency (due to cache misses), etc.
By using forwarding, we can limit the critical path of an instruction to just the latencies of the execution units, while everything else (fetch, decode, retirement), it out of the critical path. Instructions get decoded and dumped into instruction queues, awaiting their inputs to be produced by other executing instructions. When an instruction's dependency is satisfied, then it can begin executing.
Let's consider this example
MUL R1,R5,R6
ADD R2,R1,R3
AND R7,R2,R8
I'm going to make an attempt at drawing a timeline that shows the flow of these instructions through the pipeline.
F - Fetch
D - Decode
I - Instruction queue (IQ)
X - execute
W - writeback/forward/bypass
R - retire
So, as you see, the multiply instruction has a total lifetime of 9 cycles. But there is overlap in execution of the MUL and the ADD, because the processor is pipelined. When the ADD enters the IQ, it has to wait for its input (R1), and likewise so does the AND that is dependent on the ADD's result (R2). What we care about is not how long the MUL lives in total but how long any dependent instruction has to wait. That is its EFFECTIVE latency, which is 4 cycles. As you can see, once the ADD executes, the dependent AND can execute on the next cycle, again due to forwarding.
I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?
Because what we're interested in is in speed between instructions, not the start to end time of a single instruction.
Besides the obvious of "different instructions require a different amount of work to complete", hear me out...
Well that's the key answer to why different instructions have different latencies.
Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.
That is correct, though that's not a particularly meaningful number. For example, why do we care how long it takes before the CPU is entirely done with an instruction? That has basically no effect.
An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).
This is just a bunch of misunderstandings. An XOR introduces one cycle of latency into a dependency chain. That is, if I do 12 instructions that each modify the previous instruction's value and then add an XOR as the 13th instruction, it will take one cycle more. That's what the latency means.
Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.
Right. So?
I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.
They don't. Why should there be any connection? Say there's 14 extra stages at the beginning of the pipeline. Why would that effect latency or throughput at all? It would just mean everything happens 14 clock cycles later, but still at the same rate. (Though likely it would impact the cost of a mispredicted branch and other things.)

Cycles per Instruction - Does a line of code in assembly sum different cpi operation?

Imagine you have two instructions in assembly:
movl $10, %ecx
movl 0(%eax), %edx
The CPI for movements is 1, and for acess to memory is 2.
For the 1st line CPI = 1. For the second one, is the CPI= 2 or 3? Do we sum the acess to the memory (2 cycles) + the move cost, or just consider the acess to memory?
Cycle counting doesn't really work anymore, ever since the Pentium 4 hit the market. Deep pipelines, three-level memory cache hierarchies, multiple execution units with out-of-order execution, branch prediction...
It is often possible to make a good guess about the timing of a bigger piece of code but for two isolated instructions it is virtually impossible (unless one instruction happens to be DIV or IDIV, then we know it must be bad). The context is important because dependency chains play a big role (critical path).
In real code, your two instructions might well contribute nothing at all to the total timing, if they execute in the latency shadow of some other instruction. On the other hand, if the value addressed by EAX is not in any of the caches then it costs you hundreds of cycles, or many thousands if the data has to be paged in from disk...
The current IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual contains everything that you need. It contains tables with cycle counts (latency and throughput) for most instructions, as well as several hundred pages of explanation why simple cycle counting doesn't work.

Why "execute" located before "memory" in Instruction Set Achitecture?

I have learnd Processor Architecture 3 years ago.
Until today , I can't figure out why execute located before memory in the sequential instructions.
While executing the instruction [ mov (%eax) %ebx] , does it needn't to access memory?
Let's remember classic RISC pipeline, which is usually studied: http://en.wikipedia.org/wiki/Classic_RISC_pipeline. Here are its stages:
IF = Instruction Fetch
ID = Instruction Decode
EX = Execute
MEM = Memory access
WB = Register write back
In RISC you can only have loads and stores to work with memory. And EX stage for memory access instruction will compute the address in memory (take address from register file, scale it or add offset). Then address will be passed to MEM stage.
Your example, mov (%eax), %ebx is actually a load from memory without any additional computation and it can be represented even in RISC pipeline:
IF - get the instruction from instruction memory
ID - decode instruction, pass "eax" register to ALU as operand; remember "ebx" as output for WB (in control unit);
EX - compute "eax+0" in ALU and pass result to next stage MEM (as address in memory)
MEM - take address from EX stage (from ALU), go to memory and take value (this stage can take several ticks to reach memory with blocking of the pipeline). Pass value to WB
WB - take value from MEM and pass it back to register file. Control unit should set the register file into mode: "Writing"+"EBX selected"
Situation is more complex in true CISC instruction, e.g. add (%eax), %ebx (load word T from [%eax] memory, then store T+%ebx to %ebx). This instruction needs both address computation and addition in ALU. This can't be easily represented in simplest RISC (MIPS) pipelines.
First x86 cpu (8086) was not pipelined, it executed only single instruction at any moment. But since 80386 there is pipeline with 6 stages, which is more complex than in RISC. There is presentation about its pipeline, comparing it with MIPS: http://www.academic.marist.edu/~jzbv/architecture/Projects/projects2004/INTEL%20X86%20PIPELINING.ppt
Slide 17 says:
Intel combines the mem and EX stages to avoid loads and stalls, but does create stalls for address computation
All stages in mips takes one cycle, where as Intel may take more than one for certain stages. This creates asymmetric performance
In my example, add will be executed in that combined "MEM+EX" stage for several CPU ticks, generating many stalls.
Modern x86 CPUs have very long pipeline (16 stages is typical), and they are RISC-like cpus internally. Decoder stages (3 stage or more) will break most complex x86 instructions into series of internal RISC-like micro-operations (sometimes up to 450 microoperations per instruction are generated with help of microcode; more typical is 2-3 microoperations). For complex ALU/MEM operations, there will be microop for address computation, then microop for memory load and then microop for ALU action. Microoperations will have depends between them, and planned to different execution ports.
