Fully utilizing pipelines on kaby lake

Fully utilizing pipelines on kaby lake - performance

(Followup code-review question here, with more details of the context of this loop.)
Environment:
Windows 7 x64
VS 2017 community
Targeting x64 code on Intel i7700k (kaby lake)
I don't write a lot of assembler code, and when I do, it's either short enough or simple enough that I don't have to worry much about squeezing the maximum amount of perf out of it. My more complex code is usually written in C and I let the compiler's optimizers worry about latency, code alignment, etc.
However in my current project, MSVC's optimizer does a remarkably poor job on the code in my critical path. So...
I haven't yet found a good tool that does either a static or runtime analysis of x64 assembler code with a view to removing stalls, improving latency, etc. All I've got is the VS profiler which tells me (roughly) which instructions are taking the most time. And the clock on the wall that tells me if the latest change has made things better or worse.
As an alternative, I've been slogging my way thru Agner's docs with the hope of squeezing some more perf out of my code. The problem is that it's hard to understand any of his work until you understand all of it. But pieces of it make sense, and I'm trying to apply what I have learned.
What that in mind, here's the core of my innermost loop which (not surprisingly) is where the VS profiler says my time is being spent:
nottop:
vpminub ymm2, ymm2, ymm3 ; reset out of range values
vpsubb ymm2, ymm2, ymm0 ; take a step
top:
vptest ymm2, ymm1 ; check for out of range values
jnz nottop
; Outer loop that does some math, does a "vpsubb ymm2, ymm2, ymm0",
; and eventually jumps back to top
Yup, this is pretty much a textbook example of a dependency chain: Each instruction in this tight little loop depends upon the results of the previous operation. This means there can be no parallelism, which means I'm not taking full advantage of the processor.
Inspired by Agner's "optimizing assembler" doc, I've come up with an approach that (hopefully) allows me to do 2 operations at a time, so I could have one pipeline updating ymm2 and another updating (say) ymm8.
It's a non-trivial change though, so before I start ripping everything apart, I wonder if it's likely to help. Looking at Agner's "Instruction tables" for kaby lake (my target), I see that:
uops
each
port Latency
pminub p01 1
psubb p015 1
ptest p0 p5 3
Given this, it looks like while one pipeline is using p0+p5 to do the vptest against ymm2, another can be utilizing p1 to do both vpminub and vpsubb on ymm8. Yeah, things are still going to get stacked up behind vptest, but it should help.
Or would it?
I'm currently running this code from 8 threads (Yes, 8 threads really does give me better total throughput than 4,5,6 or 7). Given that my i7700k has 4 hyperthreaded cores, wouldn't the fact that there are 2 threads running on each core mean that I'm already maxing out the ports? Ports are "per core," not "per logical cpu," right?
So.
Based on my current understanding of Agner's work, it appears that there is no way to further optimize this code in its current form. If I want better perf, I'm going to need to come up with a different approach.
And yes, I'm sure if I posted my whole asm routine here, someone could suggest an alternate approach. But the purpose of this Question isn't to have someone write my code for me. I'm trying to see if I'm starting to understand how to think about optimizing asm code.
Is this (roughly) the right way to look at things? Am I missing a few pieces? Or is this flat-out wrong?

TL:DR: I think Hyperthreading should keep all your vector ALU ports busy with 2 threads per core.
vptest doesn't write either vector register, only flags. The next iteration doesn't have to wait for it, so its latency is mostly irrelevant.
Only jnz is dependent on vptest, and speculative execution + branch prediction hides the latency of control dependencies. vptest latency is relevant for how quickly a branch mispredict can be detected, but not for throughput in the correctly-predicted case.
Good point about hyperthreading. Interleaving two independent dep chains within a single thread can be helpful, but it's a lot harder to do correctly and efficiently.
Let's look at the instructions in your loop. predicted-taken jnz will always run on p6, so we can discount it. (Unrolling could actually hurt: predicted-not-taken jnz can also run on p0 or p6)
On a core by itself, your loop should run at 2 cycles per iteration, bottlenecked on latency. It's 5 fused-domain uops, so it takes 1.25 cycles to issue. (Unlike test, jnz can't macro-fuse with vptest). With hyperthreading, the front-end is already a worse bottleneck than latency. Each thread can issue 4 uops every other cycle, which is less than the 5 uops every other cycle of the dependency-chain bottleneck.
(This is common for recent Intel, especially SKL/KBL: many uops have enough ports to choose from that sustaining 4 uops per clock throughput is realistic, especially with SKL's improved throughput of the uop-cache and decoders to avoid issue bubbles due to front-end limitations rather than the back-end filling up.)
Every time one thread stalls (e.g. for a branch mispredict), the front-end can catch up on the other thread and get lots of future iterations into the out-of-order core for it to chew through at one iter per 2 cycles. (Or less because of execution-port throughput limits, see below).
Execution-port throughput (unfused-domain):
Only 1 of every 5 uops runs on p6 (the jnz). It can't be a bottleneck because the front-end issue rate limits us to less than one branch issuing per clock while running this loop.
The other 4 vector ALU uops per iteration have to run on the 3 ports with vector execution units. The p01 and p015 uops have enough scheduling flexibility that no single port will be a bottleneck, so we can just look at total ALU throughput. That's 4 uops / iter for 3 ports, for a max average throughput for a physical core of one iter per 1.333 cycles.
For a single thread (no HT), this is not the most serious bottleneck. But with two hyperthreads, that's one iter per 2.6666 cycles.
Hyperthreading should saturate your execution units, with some front-end throughput to spare. Each thread should average one per 2.666c, with the front-end able to issue at one per 2.5c. Since latency only limits you to one per 2c, it can catch up after any delays on the critical-path due to resource conflicts. (a vptest uop stealing a cycle from one of the other two uops).
If you can change the loop to check any less often, or with fewer vector uops, that could be a win. But everything I'm thinking of is more vector uops (e.g. vpand instead of vptest and then vpor a couple of those results together before checking... Or vpxor to produce an all-zero vector when vptest would). Maybe if there was a vector XNOR or something, but there isn't.
To check what's actually happening, you could use perf counters to profile your current code, and see what uop throughput you're getting for a whole core (not just each logical thread separately). Or profile one logical thread and see if it's saturating about half of p015.

A partial answer:
Intel provides a tool named Intel Architecture Code Analyzer (described here) that does static analysis of code, showing (kind of) what ports are in use in a section of asm code.
Unfortunately:
v2.3 doesn't include the essential (and only) header file. You can find this file in v2.2.
v2.2 includes the header, but omits a python script (pt.py) for analyzing the output. This file is also not included in v2.3 (haven't found it yet).
One of the output formats from iaca is a .dot file, that is read by graphviz, but the Intel docs fail to describe which of the 38 executables in graphviz is used to display output.
But perhaps most importantly (for my needs):
v2.3 (currently the most recent release) supports Skylake, but not Kaby Lake.
Given how the implementation details change between processors, that makes all the output suspect. Dates in the pdf file suggest that v2.3 was released July 2017, meaning that I'm probably going to have to wait a while for the next release.

Related

How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?

I am on the hook to analyze some "timing channels" of some x86 binary code. I am posting one question to comprehend the bsf/bsr opcodes.
So high-levelly, these two opcodes can be modeled as a "loop", which counts the leading and trailing zeros of a given operand. The x86 manual has a good formalization of these opcodes, something like the following:
IF SRC = 0
THEN
ZF ← 1;
DEST is undefined;
ELSE
ZF ← 0;
temp ← OperandSize – 1;
WHILE Bit(SRC, temp) = 0
DO
temp ← temp - 1;
OD;
DEST ← temp;
FI;
But to my suprise, bsf/bsr instructions seem to have fixed cpu cycles. According to some documents I found here: https://gmplib.org/~tege/x86-timing.pdf, seems that they always take 8 CPU cycles to finish.
So here are my questions:
I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.
Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?

BSF/BSR performance is not data dependent on any modern CPUs. See https://agner.org/optimize/, https://uops.info/, or http://instlatx64.atw.hu/ for experimental timing results, as well as the https://gmplib.org/~tege/x86-timing.pdf you found.
On modern Intel, they decode to 1 uop with 3 cycle latency and 1/clock throughput, running only on port 1. Ryzen also runs them with 3c latency for BSF, 4c latency for BSR, but multiple uops. Earlier AMD is sometimes even slower.
(Prefer rep bsf aka tzcnt in code that might run on AMD CPUs, if you don't need the FLAGS difference between bsf and tzcnt for zero inputs. lzcnt and tzcnt are fast on AMD as well, like 1 cycle latency with 3/clock throughput for lzcnt on Zen 2 (https://uops.info/). Unfortunately lzcnt and bsr aren't compatible that way, so you can't use it in an "optimistic" forward-compatible way, you have to know which you're getting.)
Your "8 cycle" (latency and throughput) cost appears to be for 32-bit BSF on AMD K8, from Granlund's table that you linked. Agner Fog's table agrees, (and shows it decodes to 21 uops instead of having a dedicated bit-scan execution unit. But the microcoded implementation is presumably still branchless and not data-dependent). No clue why you picked that number; K8 doesn't have SMT / Hyperthreading so the opportunity for an ALU-timing side channel is much reduced.
Do note that they have an output dependency on the destination register, which they leave unmodified if the input was zero. AMD documents this behaviour, Intel implements it in hardware but documents it as an "undefined" result, so unfortunately compilers won't take advantage of it and human programmers maybe should be cautious. IDK if some ancient 32-bit only CPU had different behaviour, or if Intel is planning to ever change (doubtful!), but I wish Intel would document the behaviour at least for 64-bit mode (which excludes any older CPUs).
lzcnt/tzcnt and popcnt on Intel CPUs (but not AMD) have the same output dependency before Skylake and before Cannon Lake (respectively), even though architecturally the result is well-defined for all inputs. They all use the same execution unit. (How is POPCNT implemented in hardware?). AMD Bulldozer/Ryzen builds their bit-scan execution unit without the output dependency baked in, so BSF/BSR are slower than LZCNT/TZCNT (multiple uops to handle the input=0 case, and probably also setting ZF according to the input, not the result).
(Taking advantage of that with intrinsics isn't possible; not even with MSVC's _BitScanReverse64 which uses a by-reference output arg that you could set first. MSVC doesn't respect the previous value and assumes it's output-only. VS: unexpected optimization behavior with _BitScanReverse64 intrinsic)
The pseudocode in the manual is not the implementation
(i.e. it's not necessarily how hardware or microcode works).
It gives precisely the same result in all cases, so you can use it to understand exactly what will happen for any corner cases the text leaves you wondering about. That is all.
The point is to be simple and easy to understand, and that means modeling things in terms of simple 2-input operations which happen serially. C / Fortran / typical pseudocode doesn't have operators for many-input AND, OR, or XOR, but you can build that in hardware up to a point (limited by fan-in, the opposite of fan-out).
Integer addition can be modelled as bit-serial ripple carry, but that's not how it's implemented! Instead, we get single-cycle latency for 64-bit addition with far fewer than 64 gate delays using tricks like carry lookahead adders.
The actual implementation techniques used in Intel's bit-scan / popcnt execution unit are described in US Patent US8214414 B2.
Abstract
A merged datapath for PopCount and BitScan is described. A hardware
circuit includes a compressor tree utilized for a PopCount function,
which is reused by a BitScan function (e.g., bit scan forward (BSF) or
bit scan reverse (BSR)).
Selector logic enables the compressor tree to
operate on an input word for the PopCount or BitScan operation, based
on a microprocessor instruction. The input word is encoded if a
BitScan operation is selected.
The compressor tree receives the input
word, operates on the bits as though all bits have same level of
significance (e.g., for an N-bit input word, the input word is treated
as N one-bit inputs). The result of the compressor tree circuit is a
binary value representing a number related to the operation performed
(the number of set bits for PopCount, or the bit position of the first
set bit encountered by scanning the input word).
It's fairly safe to assume that Intel's actual silicon works similarly to this. Other Intel patents for things like out-of-order machinery (ROB, RS) do tend to match up with performance experiments we can perform.
AMD may do something different, but regardless we know from performance experiments that it's not data-dependent.
It's well known that fixed latency is a hugely beneficial thing for out-of-order scheduling, so it's very surprising when instructions don't have fixed latency. Sandybridge even went so far as to standardize uop latencies to simplify the scheduler and reduce the opportunities write-back conflicts. (e.g. a 3-cycle latency uop followed by a 2-cycle latency uop to the same port would produce 2 results in the same cycle). This meant making complex-LEA (with all 3 components: [disp + base + idx*scale]) take 3 cycles instead of just 2 for the 2 additions like on previous CPUs. There are no 2-cycle latency uops on Sandybridge-family. (There are some 2-cycle latency instructions, because they decode to 2 uops with 1c latency each. The scheduler schedules uops, not instructions).
One of the few exceptions to the rule of fixed latency for ALU uops is division / sqrt, which uses a not-fully-pipelined execution unit. Division is inherently iterative, unlike multiplication where you can make wide hardware that does the partial products and partial additions in parallel.
On Intel CPUs, variable-latency for L1d cache access can produce replays of dependent uops if the data wasn't ready when the scheduler optimistically hoped it would be.
Is there a penalty when base+offset is in a different page than the base?
Why does the number of uops per iteration increase with the stride of streaming loads?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

The 80x86 manual has a good description of the expected behavior, but that has nothing to do with how it's actually implemented in silicon in any model from any manufacturer.
Let's say that there's been 50 different CPU designs from Intel, 25 CPU designs from AMD, then 25 more from other manufacturers (VIA, Cyrix, SiS/Vortex, NSC, ...). Out of those 100 different CPU designs, maybe there's 20 completely different ways that BSF has been implemented, and maybe 10 of them have fixed timing, 5 have timing that depends on every bit of the source operand, and 5 depend on groups of bits of the source operand (e.g. maybe like "if highest 32 bits of 64-bit operand are zeros { switch to 32-bit logic that's 2 cycles faster }").
I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.
You can't. More specifically, you can test or research existing CPUs, but that's a waste of time because next week Intel (or AMD or VIA or someone else) can release a new CPU that has completely different timing.
As soon as you rely on "measured from existing CPUs" you're doing it wrong. You have to rely on "architectural guarantees" that apply to all future CPUs. There is no "architectural guarantee". You have to assume that there may be a timing side-channel (even if there isn't for current CPUs)
Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?
Instead of doing a 64-bit BSF, why not split it into a pair of 32-bit pieces and do them in parallel, then merge the results? Why not split it into eight 8-bit pieces? Why not use a table lookup for each 8-bit piece?

The answers posted have explained well that the implementation is different from pseudocode. But if you are still curious why the latency is fixed and not data dependent or uses any loops for that matter, you need to see electronic side of things.
One way you could implement this feature in hardware is by using a Priority encoder.
A priority encoder will accept n input lines that can be one or off (0 or 1) and give out the index of the highest priority line that is on. Below is a table from the linked Wikipedia article modified for a most significant set bit function.
input | output index of first set bit
0000 | xx undefined
0001 | 00 0
001x | 01 1
01xx | 10 2
1xxx | 11 3
x denotes the bit value does not matter and can be anything
If you see the circuit diagram on the article, there are no loops of any kind, it is all parallel.

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular architecture, taking into account instruction reordering, superscalarity, latencies, CPIs, etc.
What / describe the rules must be followed to achieve this?
I think I've got some preliminary rules figured out, but I haven't been able to find any references on breaking down any example code to this level of detail, so I've had to take some guesses. (For example, the Intel optimization manual barely even mentions instruction reordering.)
At minimum, I'm looking for (1) confirmation that each rule is correct or else a correct statement of each rule, and (2) a list of any rules that I may have forgotten.
As many instructions as possible are issued each cycle, starting in-order from the current cycle and potentially as far ahead as the reorder buffer size.
An instruction can be issued on a given cycle if:
No instructions that affect its operands are still being executed. And:
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering). And:
There is a functional unit available for that instruction on that cycle. Every (?) functional unit is pipelined, meaning it can accept 1 new instruction per cycle, and the number of total functional units is 1/CPI, for the CPI of a given function class (nebulous here: presumably e.g. addps and subps use the same functional unit? How do I determine this?). And:
Fewer than the superscalar width (typically 4) number of instructions have already been issued this cycle.
If no instructions can be issued, the processor simply doesn't issue any—a condition called a "stall".
As an example, consider the following example code (which computes a cross-product):
shufps xmm3, xmm2, 210
shufps xmm0, xmm1, 201
shufps xmm2, xmm2, 201
mulps xmm0, xmm3
shufps xmm1, xmm1, 210
mulps xmm1, xmm2
subps xmm0, xmm1
My attempt to predict the latency for Haswell looks something like this:
; `mulps` Haswell latency=5, CPI=0.5
; `shufps` Haswell latency=1, CPI=1
; `subps` Haswell latency=3, CPI=1
shufps xmm3, xmm2, 210 ; cycle 1
shufps xmm0, xmm1, 201 ; cycle 2
shufps xmm2, xmm2, 201 ; cycle 3
mulps xmm0, xmm3 ; (superscalar execution)
shufps xmm1, xmm1, 210 ; cycle 4
mulps xmm1, xmm2 ; cycle 5
; cycle 6 (stall `xmm0` and `xmm1`)
; cycle 7 (stall `xmm1`)
; cycle 8 (stall `xmm1`)
subps xmm0, xmm1 ; cycle 9
; cycle 10 (stall `xmm0`)

TL:DR: look for dependency chains, especially loop-carried ones. For a long-running loop, see which latency, front-end throughput, or back-end port contention/throughput is the worst bottleneck. That's how many cycles your loop probably takes per iteration, on average, if there are no cache misses or branch mispredicts.
Latency bounds and throughput bounds for processors for operations that must occur in sequence is a good example of analyzing loop-carried dependency chains in a specific loop with two dep chains, one pulling values from the other.
Related: How many CPU cycles are needed for each assembly instruction? is a good introduction to throughput vs. latency on a per-instruction basis, and how what that means for sequences of multiple instructions. See also Assembly - How to score a CPU instruction by latency and throughput for how to measure a single instruction.
This is called static (performance) analysis. Wikipedia says (https://en.wikipedia.org/wiki/List_of_performance_analysis_tools) that AMD's AMD CodeXL has a "static kernel analyzer" (i.e. for computational kernels, aka loops). I've never tried it.
Intel also has a free tool for analyzing how loops will go through the pipeline in Sandybridge-family CPUs: What is IACA and how do I use it?
IACA is not bad, but has bugs (e.g. wrong data for shld on Sandybridge, and last I checked, it doesn't know that Haswell/Skylake can keep indexed addressing modes micro-fused for some instructions. But maybe that will change now that Intel's added details on that to their optimization manual.) IACA is also unhelpful for counting front-end uops to see how close to a bottleneck you are (it likes to only give you unfused-domain uop counts).
Static analysis is often pretty good, but definitely check by profiling with performance counters. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of profiling a simple loop to investigate a microarchitectural feature.
Essential reading:
Agner Fog's microarch guide (chapter 2: Out of order exec) explains some of the basics of dependency chains and out-of-order execution. His "Optimizing Assembly" guide has more good introductory and advanced performance stuff.
The later chapters of his microarch guide cover the details of the pipelines in CPUs like Nehalem, Sandybridge, Haswell, K8/K10, Bulldozer, and Ryzen. (And Atom / Silvermont / Jaguar).
Agner Fog's instruction tables (spreadsheet or PDF) are also normally the best source for instruction latency / throughput / execution-port breakdowns.
David Kanter's microarch analysis docs are very good, with diagrams. e.g. https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/, and https://www.realworldtech.com/bulldozer/.
See also other performance links in the x86 tag wiki.
I also took a stab at explaining how a CPU core finds and exploits instruction-level parallelism in this answer, but I think you've already grasped those basics as far as it's relevant for tuning software. I did mention how SMT (Hyperthreading) works as a way to expose more ILP to a single CPU core, though.
In Intel terminology:
"issue" means to send a uop into the out-of-order part of the core; along with register-renaming, this is the last step in the front-end. The issue/rename stage is often the narrowest point in the pipeline, e.g. 4-wide on Intel since Core2. (With later uarches like Haswell and especially Skylake often actually coming very close to that in some real code, thanks to SKL's improved decoders and uop-cache bandwidth, as well as back-end and cache bandwidth improvements.) This is fused-domain uops: micro-fusion lets you send 2 uops through the front-end and only take up one ROB entry. (I was able to construct a loop on Skylake that sustains 7 unfused-domain uops per clock). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: out-of-order window size.
"dispatch" means the scheduler sends a uop to an execution port. This happens as soon as all the inputs are ready, and the relevant execution port is available. How are x86 uops scheduled, exactly?. Scheduling happens in the "unfused" domain; micro-fused uops are tracked separately in the OoO scheduler (aka Reservation Station, RS).
A lot of other computer-architecture literature uses these terms in the opposite sense, but this is the terminology you will find in Intel's optimization manual, and the names of hardware performance counters like uops_issued.any or uops_dispatched_port.port_5.
exactly how long arbitrary arithmetical x86-64 assembly code will take
It depends on the surrounding code as well, because of OoO exec
Your final subps result doesn't have to be ready before the CPU starts running later instructions. Latency only matters for later instructions that need that value as an input, not for integer looping and whatnot.
Sometimes throughput is what matters, and out-of-order exec can hide the latency of multiple independent short dependency chains. (e.g. if you're doing the same thing to every element of a big array of multiple vectors, multiple cross products can be in flight at once.) You'll end up with multiple iterations in flight at once, even though in program order you finish all of one iteration before doing any of the next. (Software pipelining can help for high-latency loop bodies if OoO exec has a hard time doing all the reordering in HW.)
There are three major dimensions to analyze for a short block
You can approximately characterize a short block of non-branching code in terms of these three factors. Usually only one of them is the bottleneck for a given use-case. Often you're looking at a block that you will use as part of a loop, not as the whole loop body, but OoO exec normally works well enough that you can just add up these numbers for a couple different blocks, if they're not so long that OoO window size prevents finding all the ILP.
latency from each input to the output(s). Look at which instructions are on the dependency chain from each input to each output. e.g. one choice might need one input to be ready sooner.
total uop count (for front-end throughput bottlenecks), fused-domain on Intel CPUs. e.g. Core2 and later can in theory issue/rename 4 fused-domain uops per clock into the out-of-order scheduler/ROB. Sandybridge-family can often achieve that in practice with the uop cache and loop buffer, especially Skylake with its improved decoders and uop-cache throughput.
uop count for each back-end execution port (unfused domain). e.g. shuffle-heavy code will often bottleneck on port 5 on Intel CPUs. Intel usually only publishes throughput numbers, not port breakdowns, which is why you have to look at Agner Fog's tables (or IACA output) to do anything meaningful if you're not just repeating the same instruction a zillion times.
Generally you can assuming best-case scheduling/distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some. (How are x86 uops scheduled, exactly?)
Looking at CPI is not sufficient; two CPI=1 instructions might or might not compete for the same execution port. If they don't, they can execute in parallel. e.g. Haswell can only run psadbw on port 0 (5c latency, 1c throughput, i.e. CPI=1) but it's a single uop so a mix of 1 psadbw + 3 add instructions could sustain 4 instructions per clock. There are vector ALUs on 3 different ports in Intel CPUs, with some operations replicated on all 3 (e.g. booleans) and some only on one port (e.g. shifts before Skylake).
Sometimes you can come up with a couple different strategies, one maybe lower latency but costing more uops. A classic example is multiplying by constants like imul eax, ecx, 10 (1 uop, 3c latency on Intel) vs. lea eax, [rcx + rcx*4] / add eax,eax (2 uops, 2c latency). Modern compilers tend to choose 2 LEA vs. 1 IMUL, although clang up to 3.7 favoured IMUL unless it could get the job done with only a single other instruction.
See What is the efficient way to count set bits at a position or lower? for an example of static analysis for a few different ways to implement a function.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (which ended up being way more detailed than you'd guess from the question title) for another summary of static analysis, and some neat stuff about unrolling with multiple accumulators for a reduction.
Every (?) functional unit is pipelined
The divider is pipelined in recent CPUs, but not fully pipelined. (FP divide is single-uop, though, so if you do one divps mixed in with dozens of mulps / addps, it can have negligible throughput impact if latency doesn't matter: Floating point division vs floating point multiplication. rcpps + a Newton iteration is worse throughput and about the same latency.
Everything else is fully pipelined on mainstream Intel CPUs; multi-cycle (reciprocal) throughput for a single uop. (variable-count integer shifts like shl eax, cl have lower-than-expected throughput for their 3 uops, because they create a dependency through the flag-merging uops. But if you break that dependency through FLAGS with an add or something, you can get better throughput and latency.)
On AMD before Ryzen, the integer multiplier is also only partially pipelined. e.g. Bulldozer's imul ecx, edx is only 1 uop, but with 4c latency, 2c throughput.
Xeon Phi (KNL) also has some not-fully-pipelined shuffle instructions, but it tends to bottleneck on the front-end (instruction decode), not the back-end, and does have a small buffer + OoO exec capability to hide back-end bubbles.
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering)
No.
Maybe you read that for Silvermont, which doesn't do OoO exec for FP/SIMD, only integer (with a small ~20 uop window). Maybe some ARM chips are like that, too, with simpler schedulers for NEON? I don't know much about ARM uarch details.
The mainstream big-core microarchitectures like P6 / SnB-family, and all AMD OoO chips, do OoO exec for SIMD and FP instructions the same as for integer. AMD CPUs use a separate scheduler, but Intel uses a unified scheduler so its full size can be applied to finding ILP in integer or FP code, whichever is currently running.
Even the silvermont-based Knight's Landing (in Xeon Phi) does OoO exec for SIMD.
x86 is generally not very sensitive to instruction ordering, but uop scheduling doesn't do critical-path analysis. So it could sometimes help to put instructions on the critical path first, so they aren't stuck waiting with their inputs ready while other instructions run on that port, leading to a bigger stall later when we get to instructions that need the result of the critical path. (i.e. that's why it is the critical path.)
My attempt to predict the latency for Haswell looks something like this:
Yup, that looks right. shufps runs on port 5, addps runs on p1, mulps runs on p0 or p1. Skylake drops the dedicated FP-add unit and runs SIMD FP add/mul/FMA on the FMA units on p0/p1, all with 4c latency (up/down from 3/5/5 in Haswell, or 3/3/5 in Broadwell).
This is a good example of why keeping a whole XYZ direction vector in a SIMD vector usually sucks. Keeping an array of X, an array of Y, and an array of Z, would let you do 4 cross products in parallel without any shuffles.
The SSE tag wiki has a link to these slides: SIMD at Insomniac Games (GDC 2015) which covers that array-of-structs vs. struct-of-arrays issues for 3D vectors, and why it's often a mistake to always try to SIMD a single operation instead of using SIMD to do multiple operations in parallel.

Why are loops always compiled into "do...while" style (tail jump)?

When trying to understand assembly (with compiler optimization on), I see this behavior:
A very basic loop like this
outside_loop;
while (condition) {
statements;
}
Is often compiled into (pseudocode)
; outside_loop
jmp loop_condition ; unconditional
loop_start:
loop_statements
loop_condition:
condition_check
jmp_if_true loop_start
; outside_loop
However, if the optimization is not turned on, it compiles to normally understandable code:
loop_condition:
condition_check
jmp_if_false loop_end
loop_statements
jmp loop_condition ; unconditional
loop_end:
According to my understanding, the compiled code is better resembled by this:
goto condition;
do {
statements;
condition:
}
while (condition_check);
I can't see a huge performance boost or code readability boost, so why is this often the case? Is there a name for this loop style, for example "trailing condition check"?

Related: asm loop basics: While, Do While, For loops in Assembly Language (emu8086)
Terminology: Wikipedia says "loop inversion" is the name for turning a while(x) into if(x) do{}while(x), putting the condition at the bottom of the loop where it belongs.
Fewer instructions / uops inside the loop = better. Structuring the code outside the loop to achieve this is very often a good idea.
Sometimes this requires "loop rotation" (peeling part of the first iteration so the actual loop body has the conditional branch at the bottom). So you do some of the first iteration and maybe skip the loop entirely, then fall into the loop. Sometimes you also need some code after the loop to finish the last iteration.
Sometimes loop rotation is extra useful if the last iteration is a special case, e.g. a store you need to skip. This lets you implement a while(1) {... ; if(x)break; ...; } loop as a do-while, or put one of the conditions of a multiple-condition loop at the bottom.
Some of these optimizations are related to or enable software pipelining, e.g. loading something for the next iteration. (OoO exec on x86 makes SW pipelining not very important these days but it's still useful for in-order cores like many ARM. And unrolling with multiple accumulators is still very valuable for hiding loop-carried FP latency in a reduction loop like a dot product or sum of an array.)
do{}while() is the canonical / idiomatic structure for loops in asm on all architectures, get used to it. IDK if there's a name for it; I would say such a loop has a "do while structure". If you want names, you could call the while() structure "crappy unoptimized code" or "written by a novice". :P Loop-branch at the bottom is universal, and not even worth mentioning as a Loop Optimization. You always do that.
This pattern is so widely used that on CPUs that use static branch prediction for branches without an entry in the branch-predictor caches, unknown forward conditional branches are predicted not-taken, unknown backwards branches are predicted taken (because they're probably loop branches). See Static branch prediction on newer Intel processors on Matt Godbolt's blog, and Agner Fog's branch-prediction chapter at the start of his microarch PDF.
This answer ended up using x86 examples for everything, but much of this applies across the board for all architectures. I wouldn't be surprised if other superscalar / out-of-order implementations (like some ARM, or POWER) also have limited branch-instruction throughput whether they're taken or not. But fewer instructions inside the loop is nearly universal when all you have is a conditional branch at the bottom, and no unconditional branch.
If the loop might need to run zero times, compilers more often put a test-and-branch outside the loop to skip it, instead of jumping to the loop condition at the bottom. (i.e. if the compiler can't prove the loop condition is always true on the first iteration).
BTW, this paper calls transforming while() to if(){ do{}while; } an "inversion", but loop inversion usually means inverting a nested loop. (e.g. if the source loops over a row-major multi-dimensional array in the wrong order, a clever compiler could change for(i) for(j) a[j][i]++; into for(j) for(i) a[j][i]++; if it can prove it's correct.) But I guess you can look at the if() as a zero-or-one iteration loop. Fun fact, compiler devs teaching their compilers how to invert a loop (to allow auto-vectorization) for a (very) specific case is why SPECint2006's libquantum benchmark is "broken". Most compilers can't invert loops in the general case, just ones that looks almost exactly like the one in SPECint2006...
You can help the compiler make more compact asm (fewer instructions outside the loop) by writing do{}while() loops in C when you know the caller isn't allowed to pass size=0 or whatever else guarantees a loop runs at least once.
(Actually 0 or negative for signed loop bounds. Signed vs. unsigned loop counters is a tricky optimization issue, especially if you choose a narrower type than pointers; check your compiler's asm output to make sure it isn't sign-extending a narrow loop counter inside the loop very time if you use it as an array index. But note that signed can actually be helpful, because the compiler can assume that i++ <= bound will eventually become false, because signed overflow is UB but unsigned isn't. So with unsigned, while(i++ <= bound) is infinite if bound = UINT_MAX.) I don't have a blanket recommendation for when to use signed vs. unsigned; size_t is often a good choice for looping over arrays, though, but if you want to avoid the x86-64 REX prefixes in the loop overhead (for a trivial saving in code size) but convince the compiler not to waste any instructions zero or sign-extending, it can be tricky.
I can't see a huge performance boost
Here's an example where that optimization will give speedup of 2x on Intel CPUs before Haswell, because P6 and SnB/IvB can only run branches on port 5, including not-taken conditional branches.
Required background knowledge for this static performance analysis: Agner Fog's microarch guide (read the Sandybridge section). Also read his Optimizing Assembly guide, it's excellent. (Occasionally outdated in places, though.) See also other x86 performance links in the x86 tag wiki. See also Can x86's MOV really be "free"? Why can't I reproduce this at all? for some static analysis backed up by experiments with perf counters, and some explanation of fused vs. unfused domain uops.
You could also use Intel's IACA software (Intel Architecture Code Analyzer) to do static analysis on these loops.
; sum(int []) using SSE2 PADDD (dword elements)
; edi = pointer, esi = end_pointer.
; scalar cleanup / unaligned handling / horizontal sum of XMM0 not shown.
; NASM syntax
ALIGN 16 ; not required for max performance for tiny loops on most CPUs
.looptop: ; while (edi<end_pointer) {
cmp edi, esi ; 32-bit code so this can macro-fuse on Core2
jae .done ; 1 uop, port5 only (macro-fused with cmp)
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
jmp .looptop ; 1 uop, p5 only
; Sandybridge/Ivybridge ports each uop can use
.done: ; }
This is 4 total fused-domain uops (with macro-fusion of the cmp/jae), so it can issue from the front-end into the out-of-order core at one iteration per clock. But in the unfused domain there are 4 ALU uops and Intel pre-Haswell only has 3 ALU ports.
More importantly, port5 pressure is the bottleneck: This loop can execute at only one iteration per 2 cycles because cmp/jae and jmp both need to run on port5. Other uops stealing port5 could reduce practical throughput somewhat below that.
Writing the loop idiomatically for asm, we get:
ALIGN 16
.looptop: ; do {
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
cmp edi, esi ; 1 uop, port5 only (macro-fused with cmp)
jb .looptop ; } while(edi < end_pointer);
Notice right away, independent of everything else, that this is one fewer instruction in the loop. This loop structure is at least slightly better on everything from simple non-pipelined 8086 through classic RISC (like early MIPS), especially for long-running loops (assuming they don't bottleneck on memory bandwidth).
Core2 and later should run this at one iteration per clock, twice as fast as the while(){}-structured loop, if memory isn't a bottleneck (i.e. assuming L1D hits, or at least L2 actually; this is only SSE2 16-bytes per clock).
This is only 3 fused-domain uops, so can issue at better than one per clock on anything since Core2, or just one per clock if issue groups always end with a taken branch.
But the important part is that port5 pressure is vastly reduced: only cmp/jb needs it. The other uops will probably be scheduled to port5 some of the time and steal cycles from loop-branch throughput, but this will be a few % instead of a factor of 2. See How are x86 uops scheduled, exactly?.
Most CPUs that normally have a taken-branch throughput of one per 2 cycles can still execute tiny loops at 1 per clock. There are some exceptions, though. (I forget which CPUs can't run tight loops at 1 per clock; maybe Bulldozer-family? Or maybe just some low-power CPUs like VIA Nano.) Sandybridge and Core2 can definitely run tight loops at one per clock. They even have loop buffers; Core2 has a loop buffer after instruction-length decode but before regular decode. Nehalem and later recycle uops in the queue that feeds the issue/rename stage. (Except on Skylake with microcode updates; Intel had to disable the loop buffer because of a partial-register merging bug.)
However, there is a loop-carried dependency chain on xmm0: Intel CPUs have 1-cycle latency paddd, so we're right up against that bottleneck, too. add esi, 16 is also 1 cycle latency. On Bulldozer-family, even integer vector ops have 2c latency, so that would bottleneck the loop at 2c per iteration. (AMD since K8 and Intel since SnB can run two loads per clock, so we need to unroll anyway for max throughput.) With floating point, you definitely want to unroll with multiple accumulators. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators).
If I'd used an indexed addressing mode, like paddd xmm0, [edi + eax], I could have used sub eax, 16 / jnc at the loop condition. SUB/JNC can macro-fuse on Sandybridge-family, but the indexed load would un-laminate on SnB/IvB (but stay fused on Haswell and later, unless you use the AVX form).
; index relative to the end of the array, with an index counting up towards zero
add rdi, rsi ; edi = end_pointer
xor eax, eax
sub eax, esi ; eax = -length, so [rdi+rax] = first element
.looptop: ; do {
paddd xmm0, [rdi + rax]
add eax, 16
jl .looptop ; } while(idx+=16 < 0); // or JNC still works
(It's usually better to unroll some to hide the overhead of pointer increments instead of using indexed addressing modes, especially for stores, partly because indexed stores can't use the port7 store AGU on Haswell+.)
On Core2/Nehalem add/jl don't macro-fuse, so this is 3 fused-domain uops even in 64-bit mode, without depending on macro-fusion. Same for AMD K8/K10/Bulldozer-family/Ryzen: no fusion of the loop condition, but PADDD with a memory operand is 1 m-op / uop.
On SnB, paddd un-laminates from the load, but add/jl macro-fuse, so again 3 fused-domain uops. (But in the unfused domain, only 2 ALU uops + 1 load, so probably fewer resource conflicts reducing throughput of the loop.)
On HSW and later, this is 2 fused-domain uops because an indexed load can stay micro-fused with PADDD, and add/jl macro-fuses. (Predicted-taken branches run on port 6, so there are never resource conflicts.)
Of course, the loops can only run at best 1 iteration per clock because of taken branch throughput limits even for tiny loops. This indexing trick is potentially useful if you had something else to do inside the loop, too.
But all of these loops had no unrolling
Yes, that exaggerates the effect of loop overhead. But gcc doesn't unroll by default even at -O3 (unless it decides to fully unroll). It only unrolls with profile-guided optimization to let it know which loops are hot. (-fprofile-use). You can enable -funroll-all-loops, but I'd only recommend doing that on a per-file basis for a compilation unit you know has one of your hot loops that needs it. Or maybe even on a per-function basis with an __attribute__, if there is one for optimization options like that.
So this is highly relevant for compiler-generated code. (But clang does default to unrolling tiny loops by 4, or small loops by 2, and extremely importantly, using multiple accumulators to hide latency.)
Benefits with very low iteration count:
Consider what happens when the loop body should run once or twice: There's a lot more jumping with anything other than do{}while.
For do{}while, execution is a straight-line with no taken branches and one not-taken branch at the bottom. This is excellent.
For an if() { do{}while; } that might run the loop zero times, it's two not-taken branches. That's still very good. (Not-taken is slightly cheaper for the front-end than taken when both are correctly predicted).
For a jmp-to-the-bottom jmp; do{}while(), it's one taken unconditional branch, one taken loop condition, and then the loop branch is not-taken. This is kinda clunky but modern branch predictors are very good...
For a while(){} structure, this is one not-taken loop exit, one taken jmp at the bottom, then one taken loop-exit branch at the top.
With more iterations, each loop structure does one more taken branch. while(){} also does one more not-taken branch per iteration, so it quickly becomes obviously worse.
The latter two loop structures have more jumping around for small trip counts.
Jumping to the bottom also has a disadvantage for non-tiny loops that the bottom of the loop might be cold in L1I cache if it hasn't run for a while. Code fetch / prefetch is good at bringing code to the front-end in a straight line, but if prediction didn't predict the branch early enough, you might have a code miss for the jump to the bottom. Also, parallel decode will probably have (or could have) decoded some of the top of the loop while decoding the jmp to the bottom.
Conditionally jumping over a do{}while loop avoids all that: you only jump forwards into code that hasn't been run yet in cases where the code you're jumping over shouldn't run at all. It often predicts very well because a lot of code never actually takes 0 trips through the loop. (i.e. it could have been a do{}while, the compiler just didn't manage to prove it.)
Jumping to the bottom also means the core can't start working on the real loop body until after the front-end chases two taken branches.
There are cases with complicated loop conditions where it's easiest to write it this way, and the performance impact is small, but compilers often avoid it.
Loops with multiple exit conditions:
Consider a memchr loop, or a strchr loop: they have to stop at the end of the buffer (based on a count) or the end of an implicit-length string (0 byte). But they also have to break out of the loop if they find a match before the end.
So you'll often see a structure like
do {
if () break;
blah blah;
} while(condition);
Or just two conditions near the bottom. Ideally you can test multiple logical conditions with the same actual instruction (e.g. 5 < x && x < 25 using sub eax, 5 / cmp eax, 20 / ja .outside_range, unsigned compare trick for range checking, or combine that with an OR to check for alphabetic characters of either case in 4 instructions) but sometimes you can't and just need to use an if()break style loop-exit branch as well as a normal backwards taken branch.
Further reading:
Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” for good ways to look at compiler output (e.g. what kind of inputs give interesting output, and a primer on reading x86 asm for beginners). related: How to remove "noise" from GCC/clang assembly output?
Modern Microprocessors A 90-Minute Guide!. Details look at superscalar pipelined CPUs, mostly architecture neutral. Very good. Explains instruction-level parallelism and stuff like that.
Agner Fog's x86 optimization guide and microarch pdf. This will take you from being able to write (or understand) correct x86 asm to being able to write efficient asm (or see what the compiler should have done).
other links in the x86 tag wiki, including Intel's optimization manuals. Also several of my answers (linked in the tag wiki) have things that Agner missed in his testing on more recent microarchitectures (like un-lamination of micro-fused indexed addressing modes on SnB, and partial register stuff on Haswell+).
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators): how to use multiple accumulators to hide latency of a reduction loop (like an FP dot product).
Lecture 7: Loop Transformations (also on archive.org). Lots of cool stuff that compilers do to loops, using C syntax to describe the asm.
https://en.wikipedia.org/wiki/Loop_optimization
Sort of off topic:
Memory bandwidth is almost always important, but it's not widely known that a single core on most modern x86 CPUs can't saturate DRAM, and not even close on many-core Xeons where single-threaded bandwidth is worse than on a quad-core with dual channel memory controllers.
How much of ‘What Every Programmer Should Know About Memory’ is still valid? (my answer has commentary on what's changed and what's still relevant in Ulrich Drepper's well-known excellent article.)

Deterministic execution time on x86-64

Is there an x64 instruction(s) that takes a fixed amount of time, regardless of the micro-architectural state such as caches, branch predictors, etc.?
For instance, if a hypothetical add or increment instruction always takes n cycles, then I can implement a timer in my program by performing that add instruction multiple times. Perhaps an increment instruction with register operands may work, but it's not clear to me whether Intel's spec guarantees that it would take deterministic number of cycles. Note that I am not interested in current time, but only a primitive / instruction sequence that takes a fixed number of cycles.
Assume that I have a way to force atomic execution i.e. no context switches during timer's execution i.e. only my program gets to run.
On a related note, I also cannot use system services to keep track of time, because I am working in a setting where my program is a user-level program running on an untrusted OS.

The x86 ISA documents don't guarantee anything about what takes a certain amount of cycles. The ISA allows things like Transmeta's Crusoe that JIT-compiled x86 instructions to an internal VLIW instruction set. It could conceivably do optimizations between adjacent instructions.
The best you can do is write something that will work on as many known microarchitectures as possible. I'm not aware of any x86-64 microarchitectures that are "weird" like Transmeta, only the usual superscalar decode-to-uops designs like Intel and AMD use.
Simple integer ALU instructions like ADD are almost all 1c latency, and tiny loops that don't touch memory are almost totally unaffected anything, and are very predictable. If they run a lot of iterations, they're also almost totally unaffected by anything to do with the impact of surrounding code on the out-of-order core, and recover very quickly from disruptions like timer interrupts.
On nearly every Intel microarchitecture, this loop will run at one iteration per clock:
mov ecx, 1234567 ; or use a 64-bit register for higher counts.
ALIGN 16
.loop:
sub ecx, 1 ; not dec because of Pentium 4.
jnz .loop
Agner Fog's microarch guide and instruction tables say that VIA Nano3000 has a taken-branch throughput of one per 3 cycles, so this loop would only run at one iteration per 3 clocks there. AMD Bulldozer-family and Jaguar similarly have a max throughput of one taken JCC per 2 clocks.
See also other performance links in the x86 tag wiki.
If you want a more power-efficient loop, you could use PAUSE in the loop, but it waits ~100 cycles on Skylake, up from ~5 cycles on previous microarchitectures. (You can make cycle-accurate predictions for more complicated loops that don't touch memory, but that depends on microarchitectural details.)
You could make a more reliable loop that's less likely to have different bottlenecks on different CPUs by making a longer dependency chain within each iteration. Since each instruction depends on the previous, it can still only run at one instruction per cycle (not counting the branch), drastically the branches per cycle.
# one add/sub per clock, limited by latency
# should run one iteration per 6 cycles on every CPU listed in Agner Fog's tables
# And should be the same on all future CPUs unless they do magic inter-instruction optimizations.
# Or it could be slower on CPUs that always have a bubble on taken branches, but it seems unlikely anyone would design one.
ALIGN 16
.loop:
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 2 ; net result ecx-1
jnz .loop
Unrolling like this ensures that front-end effects are not a bottleneck. It gives the frontend decoders plenty of time to queue up the 6 add/sub insns and the jcc before the next branch.
Using add/sub instead of dec/inc avoids a partial-flag false dependency on Pentium 4. (Although I don't think that would be an issue anyway.)
Pentium4's double-clocked ALUs can each run two ADDs per clock, but the latency is still one cycle. i.e. apparently it can't forward a result internally to chew through this dependency chain twice as fast as any other CPU.
And yes, Prescott P4 is an x86-64 CPU, so we can't quite ignore P4 if we need a general purpose answer.

Understanding CPU pipeline stages vs. Instruction throughput

I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?
Besides the obvious of "different instructions require a different amount of work to complete", hear me out...
Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.
An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).
Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.
I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.

I think what's missing from the existing answers is the existence of "bypass" or "forwarding" datapaths. For simplicity, let's stick with the MIPS 5-stage pipeline. Every instruction takes 5 cycles from birth to death -- fetch, decode, execute, memory, writeback. So that's how long it takes to process a single instruction.
What you want to know is how long it takes for one instruction to hand off its result to a dependent instruction. Say you have two consecutive ADD instructions, and there's a dependency through R1:
ADD R1, R2, R3
ADD R4, R1, R5
If there were no forwarding paths, we'd have to stall the second instruction for multiple cycles (2 or 3 depending on how writeback works), so that the first one could store its result into the register file before the second one reads that as input in the decode stage.
However, there are forwarding paths that allow valid results (but ones that are not yet written back) to be picked out of the pipeline. So let's say the first ADD gets all its inputs from the register file in decode. The second one will get R5 out of the register file, but it'll get R1 out of the pipeline register following the execute stage. In other words, we're routing the output of the ALU back into its input one cycle later.
Out-of-order processors make ubiquitous use of forwarding. They will have lots of different functional units that have lots of different latencies. For instance, ADD and AND will typically take one cycle (TO DO THE MATH, putting aside all of the pipeline stages before and after), MUL will take like 4, floating point operations will take lots of cycles, memory access has variable latency (due to cache misses), etc.
By using forwarding, we can limit the critical path of an instruction to just the latencies of the execution units, while everything else (fetch, decode, retirement), it out of the critical path. Instructions get decoded and dumped into instruction queues, awaiting their inputs to be produced by other executing instructions. When an instruction's dependency is satisfied, then it can begin executing.
Let's consider this example
MUL R1,R5,R6
ADD R2,R1,R3
AND R7,R2,R8
I'm going to make an attempt at drawing a timeline that shows the flow of these instructions through the pipeline.
MUL FDIXXXXWR
ADD FDIIIIXWR
AND FDIIIIXWR
Key:
F - Fetch
D - Decode
I - Instruction queue (IQ)
X - execute
W - writeback/forward/bypass
R - retire
So, as you see, the multiply instruction has a total lifetime of 9 cycles. But there is overlap in execution of the MUL and the ADD, because the processor is pipelined. When the ADD enters the IQ, it has to wait for its input (R1), and likewise so does the AND that is dependent on the ADD's result (R2). What we care about is not how long the MUL lives in total but how long any dependent instruction has to wait. That is its EFFECTIVE latency, which is 4 cycles. As you can see, once the ADD executes, the dependent AND can execute on the next cycle, again due to forwarding.

I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?
Because what we're interested in is in speed between instructions, not the start to end time of a single instruction.
Besides the obvious of "different instructions require a different amount of work to complete", hear me out...
Well that's the key answer to why different instructions have different latencies.
Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.
That is correct, though that's not a particularly meaningful number. For example, why do we care how long it takes before the CPU is entirely done with an instruction? That has basically no effect.
An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).
This is just a bunch of misunderstandings. An XOR introduces one cycle of latency into a dependency chain. That is, if I do 12 instructions that each modify the previous instruction's value and then add an XOR as the 13th instruction, it will take one cycle more. That's what the latency means.
Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.
Right. So?
I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.
They don't. Why should there be any connection? Say there's 14 extra stages at the beginning of the pipeline. Why would that effect latency or throughput at all? It would just mean everything happens 14 clock cycles later, but still at the same rate. (Though likely it would impact the cost of a mispredicted branch and other things.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio