Loop optimization. How does register renaming break dependencies? What is execution port capacity? - performance

I am analyzing an example of a loop from Agner Fog's optimization_assembly. I mean the 12.9 chapter.
The code is: ( I simplified a bit)
L1:
vmulpd ymm1, ymm2, [rsi+rax]
vaddpd ymm1, ymm1, [rdi+rax]
vmovupd [rdi+rax], ymm1
add rax, 32
jl L1
And I have some questions:
The author said that there is no loop-carried dependency. I don't understand why it is so. ( I skipped the case of add rax, 32 ( it is loop-carried indeed, but only one cycle)). But, after all, the next iteration cannot modify ymm1 register before the previous iteration will not have finished. Maybe register-renaming plays a role here?
Let's assume that there is a loop-carried dependency.
vaddpd ymm1, ymm1, [rdi+rax] -> vmovupd [rdi+rax], ymm1
And let latency for first is 3, and latency for second is 7.
( In fact, there is no such dependency, but I would like to ask a hypothetical question)
Now, How to determine a total latency. Should I add latencies and the result would be 10? I have no idea.
It is written:
There are two 256-bit read operations, each using a read port for two
consecutive clock cycles, which is indicated as 1+ in the table. Using
both read ports (port 2 and 3), we will have a throughput of two
256-bit reads in two clock cycles. One of the read ports will make an
address calculation for the write in the second clock cycle. The write
port (port 4) is occupied for two clock cycles by the 256-bit write.
The limiting factor will be the read and write operations, using the
two read ports and the write port at their maximum capacity.
What exactly is capacity for ports? How can I determine them, for example for IvyBridge (my CPU).

Yes, the whole point of register renaming is to break dependency chains when an instruction writes a register without depending on the old value. The destination of a mov, or the write-only destination operand of AVX instructions, is like this. Also zeroing idioms like xor eax,eax are recognized as independent of the old value, even though they appear to have the old value as an input.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for a more detailed description of register-renaming, and some performance experiments with multiple loop-carried dependency chains in flight at once.
Without renaming, vmulpd couldn't write ymm1 until vmovupd had read its operand (Write-After-Read hazard), but it wouldn't have to wait for vmovupd to complete. See a computer architecture textbook to learn about in-order pipelines and stuff. I'm not sure if any out-of-order CPUs without register renaming exist.
update: early OoO CPUs used scoreboarding to do some limited out-of-order execution without register renaming, but were much more limited in their capacity to find and exploit instruction-level parallelism.
Each of the two load ports on IvB has a capacity of one 128b load per clock. And also of one address-generation per clock.
In theory, SnB/IvB can sustain a throughput of 2x 128b load and 1x 128b store per clock, but only by using 256b instructions. They can only generate two addresses per clock, but a 256b load or store only needs one address calculation per 2 cycles of data transfer. See Agner Fog's microarch guide
Haswell added a dedicated store AGU on port 7 that handles simple addressing modes only, and widened the data paths to 256b. A single cycle can do a peak of 96 bytes total loaded + stored. (But some unknown bottleneck limits sustained throughput to less than that. On Skylake-client, about 84 bytes / cycle reported by Intel, and matches my testing.)
(IceLake client reportedly can sustain 2x64B loaded + 1x64B stored per cycle, or 2x32B stored, according to a recent update to Intel's optimization guide.)
Also note that your indexed addressing modes won't micro-fuse, so fused-domain uop throughput is also a concern.

Related

Assembly - How to score a CPU instruction by latency and throughput

I'm looking for a type of a formula / way to measure how fast an instruction is, or more specific to give a "score" each of the instruction by CPU cycles.
Let's take the follow assembly program for an example,
nop
mov eax,dword ptr [rbp+34h]
inc eax
mov dword ptr [rbp+34h],eax
and the following Intel Skylake information:
mov r,m : Throughput=0.5 Latency=2
mov m,r
: Throughput=1 Latency=2
nop : Throughput=0.25 Latency=non
inc : Throughput=0.25 Latency=1
I know that the order of the instructions in the program are matter in here but
I'm looking to create something general that not need to be "accurate to the single cycle"
any one have any idea how can I do that?
There is no formula you can apply; you have to measure.
The same instruction on different versions of the same uarch family can have different performance. e.g. mulps:
Sandybridge 1c / 5c throughput/latency.
HSW 0.5 / 5. BDW 0.5 / 3 (faster multiply path in the FMA unit? FMA is still 5c).
SKL 0.5 / 4 (lower latency FMA, too). SKL runs addps on the FMA unit as well, dropping the dedicated FP multiply unit so add latency is higher, but throughput is higher.
There's no way you could predict any of this without measuring, or knowing some microarchitectural details. We expect FP math ops won't be single-cycle latency, because they're much more complicated than integer ops. (So if they were single cycle, the clock speed is set too low for integer ops.)
You measure by repeating the instruction many times in an unrolled loop. Or fully unrolled with no looping, but then you defeat the uop-cache and can get front-end bottlenecks. (e.g. for decoding 10-byte mov r64, imm64)
https://uops.info/ has already automated this testing for every form of every (unprivileged) instruction, and you can even click on any table entry to see what test loops they used. e.g. Skylake xchg r32, eax latency testing (https://uops.info/html-lat/SKL/XCHG_R32_EAX-Measurements.html) from each input operand to each output. (2 cycle latency from EAX -> R8D, but 1 cycle latency from R8D -> EAX.) So we can guess that the 3 uops include copying EAX to an internal temporary, but moving directly from the other operand to EAX.
https://uops.info/ is the current best source of test data; when it and Agner's tables disagree, my own measurements and/or other sources have always confirmed uops.info's testing was accurate. And they don't try to make up a latency number for 2 halves of a round-trip like movd xmm0,eax and back, they show you the range of possible latencies assuming the rest of the chain was the minimum plausible.
Agner Fog creates his instruction tables (which you appear to be reading) by timing large non-looping blocks of code that repeat an instruction. https://agner.org/optimize/. The intro section of his instruction-tables explains briefly how he measures, and his microarch guide explains more details of how different x86 microarchitectures work internally. Unfortunately there are occasional typos or copy/paste errors in his hand-edited tables.
http://instlatx64.atw.hu/ also has results of experimental measurements. I think they use a similar technique of a large block of the same instruction repeated, maybe small enough to fit in the uop cache. But they don't use perf counters to measure what execution port each instruction needs, so their throughput numbers don't help you figure out which instructions compete with which other instructions.
These latter two sources have been around for longer than uops.info, and cover some older CPUs, especially older AMD.
To measure latency yourself, you make the output of each instruction an input for the next.
mov ecx, 10000000
inc_latency:
inc eax
inc eax
inc eax
inc eax
inc eax
inc eax
sub ecx,1 ; avoid partial-flag false dep for P4
jnz inc_latency ; dec or sub/jnz macro-fuses into 1 uop on Intel SnB-family
This dependency chain of 7 inc instructions will bottleneck the loop at 1 iteration per 7 * inc_latency cycles. Using perf counters for core clock cycles (not RDTSC cycles), you can easily measure the time for all the iterations to 1 part in 10k, and with more care probably even more precisely than that. The repeat count of 10000000 hides start/stop overhead of whatever timing you use.
I normally put a loop like this in a Linux static executable that just makes a sys_exit(0) system call directly (with a syscall) instruction, and time the whole executable with perf stat ./testloop to get time and a cycle count. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example).
Another example is Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths, with the added complication of using lfence to drain the out-of-order execution window for two dep chains.
To measure throughput, you use separate registers, and/or include an xor-zeroing occasionally to break dep chains and let out-of-order exec overlap things. Don't forget to also use perf counters to see which ports it can run on, so you can tell which other instructions it will compete with. (e.g. FMA (p01) and shuffles (p5) don't compete at all for back-end resources on Haswell/Skylake, only for front-end throughput.) Don't forget to measure front-end uop counts, too: some instructions decode to multiply uops.
How many different dependency chains do we need to avoid a bottleneck? Well we know the latency (measure it first), and we know the max possible throughput (number of execution ports, or front-end throughput.)
For example, if FP multiply had 0.25c throughput (4 per clock), we could keep 20 in flight at once on Haswell (5c latency). That's more than we have registers, so we could just use all 16 and discover that in fact the throughput is only 0.5c. But if it had turned out that 16 registers was a bottleneck, we could add xorps xmm0,xmm0 occasionally and let out-of-order execution overlap some blocks.
More is normally better; having just barely enough to hide latency can slow down with imperfect scheduling. If we wanted to go nuts measuring inc, we'd do this:
mov ecx, 10000000
inc_latency:
%rep 10 ;; source-level repeat of a block, no runtime branching
inc eax
inc ebx
; not ecx, we're using it as a loop counter
inc edx
inc esi
inc edi
inc ebp
inc r8d
inc r9d
inc r10d
inc r11d
inc r12d
inc r13d
inc r14d
inc r15d
%endrep
sub ecx,1 ; break partial-flag false dep for P4
jnz inc_latency ; dec/jnz macro-fuses into 1 uop on Intel SnB-family
If we were worried about partial-flag false dependencies or flag-merging effects, we might experiment with mixing in an xor eax,eax somewhere to let OoO exec overlap more than just when sub wrote all flags. (See INC instruction vs ADD 1: Does it matter?)
There's a similar problem for measuring throughput and latency of shl r32, cl on Sandybridge-family: the flag dependency chain isn't normally relevant for a computation, but putting shl back-to-back creates a dependency through FLAGS as well as through the register. (Or for throughput, there isn't even a register dep).
I posted about this on Agner Fog's blog: https://www.agner.org/optimize/blog/read.php?i=415#860. I mixed shl edx,cl in with four add edx,1 instructions, to see what incremental slowdown adding one more instruction had, where the FLAGS dependency was a non-issue. On SKL, it only slows down by an extra 1.23 cycles on average, so the true latency cost of that shl was only ~1.23 cycles, not 2. (It's not a whole number or just 1 because of resource conflicts to run the flag-merging uops of the shl, I guess. BMI2 shlx edx, edx, ecx would be exactly 1c because it's only a single uop.)
Related: for static performance analysis of whole blocks of code (containing different instructions), see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. (It's using the word "latency" for the end-to-end latency of a whole computation, but actually asking about things small enough for OoO exec to overlap different parts, so instruction latency and throughput both matter.)
The Latency=2 numbers for load/store appear to be from Agner Fog's instruction tables (https://agner.org/optimize/). They unfortunately aren't accurate for a chain of mov rax, [rax]. You'll find that's 4c
latency if you measure it by putting that in a loop.
Agner splits up load/store latency into something that makes the total store/reload latency come out correct, but for some reason he doesn't make the load part equal to the L1d load-use latency when it comes from cache instead of the store buffer. (But also note that if the load feeds an ALU instruction instead of another load, the latency is 5c. So the simple addressing-mode fast-path only helps for pure pointer-chasing.)

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular architecture, taking into account instruction reordering, superscalarity, latencies, CPIs, etc.
What / describe the rules must be followed to achieve this?
I think I've got some preliminary rules figured out, but I haven't been able to find any references on breaking down any example code to this level of detail, so I've had to take some guesses. (For example, the Intel optimization manual barely even mentions instruction reordering.)
At minimum, I'm looking for (1) confirmation that each rule is correct or else a correct statement of each rule, and (2) a list of any rules that I may have forgotten.
As many instructions as possible are issued each cycle, starting in-order from the current cycle and potentially as far ahead as the reorder buffer size.
An instruction can be issued on a given cycle if:
No instructions that affect its operands are still being executed. And:
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering). And:
There is a functional unit available for that instruction on that cycle. Every (?) functional unit is pipelined, meaning it can accept 1 new instruction per cycle, and the number of total functional units is 1/CPI, for the CPI of a given function class (nebulous here: presumably e.g. addps and subps use the same functional unit? How do I determine this?). And:
Fewer than the superscalar width (typically 4) number of instructions have already been issued this cycle.
If no instructions can be issued, the processor simply doesn't issue any—a condition called a "stall".
As an example, consider the following example code (which computes a cross-product):
shufps xmm3, xmm2, 210
shufps xmm0, xmm1, 201
shufps xmm2, xmm2, 201
mulps xmm0, xmm3
shufps xmm1, xmm1, 210
mulps xmm1, xmm2
subps xmm0, xmm1
My attempt to predict the latency for Haswell looks something like this:
; `mulps` Haswell latency=5, CPI=0.5
; `shufps` Haswell latency=1, CPI=1
; `subps` Haswell latency=3, CPI=1
shufps xmm3, xmm2, 210 ; cycle 1
shufps xmm0, xmm1, 201 ; cycle 2
shufps xmm2, xmm2, 201 ; cycle 3
mulps xmm0, xmm3 ; (superscalar execution)
shufps xmm1, xmm1, 210 ; cycle 4
mulps xmm1, xmm2 ; cycle 5
; cycle 6 (stall `xmm0` and `xmm1`)
; cycle 7 (stall `xmm1`)
; cycle 8 (stall `xmm1`)
subps xmm0, xmm1 ; cycle 9
; cycle 10 (stall `xmm0`)
TL:DR: look for dependency chains, especially loop-carried ones. For a long-running loop, see which latency, front-end throughput, or back-end port contention/throughput is the worst bottleneck. That's how many cycles your loop probably takes per iteration, on average, if there are no cache misses or branch mispredicts.
Latency bounds and throughput bounds for processors for operations that must occur in sequence is a good example of analyzing loop-carried dependency chains in a specific loop with two dep chains, one pulling values from the other.
Related: How many CPU cycles are needed for each assembly instruction? is a good introduction to throughput vs. latency on a per-instruction basis, and how what that means for sequences of multiple instructions. See also Assembly - How to score a CPU instruction by latency and throughput for how to measure a single instruction.
This is called static (performance) analysis. Wikipedia says (https://en.wikipedia.org/wiki/List_of_performance_analysis_tools) that AMD's AMD CodeXL has a "static kernel analyzer" (i.e. for computational kernels, aka loops). I've never tried it.
Intel also has a free tool for analyzing how loops will go through the pipeline in Sandybridge-family CPUs: What is IACA and how do I use it?
IACA is not bad, but has bugs (e.g. wrong data for shld on Sandybridge, and last I checked, it doesn't know that Haswell/Skylake can keep indexed addressing modes micro-fused for some instructions. But maybe that will change now that Intel's added details on that to their optimization manual.) IACA is also unhelpful for counting front-end uops to see how close to a bottleneck you are (it likes to only give you unfused-domain uop counts).
Static analysis is often pretty good, but definitely check by profiling with performance counters. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of profiling a simple loop to investigate a microarchitectural feature.
Essential reading:
Agner Fog's microarch guide (chapter 2: Out of order exec) explains some of the basics of dependency chains and out-of-order execution. His "Optimizing Assembly" guide has more good introductory and advanced performance stuff.
The later chapters of his microarch guide cover the details of the pipelines in CPUs like Nehalem, Sandybridge, Haswell, K8/K10, Bulldozer, and Ryzen. (And Atom / Silvermont / Jaguar).
Agner Fog's instruction tables (spreadsheet or PDF) are also normally the best source for instruction latency / throughput / execution-port breakdowns.
David Kanter's microarch analysis docs are very good, with diagrams. e.g. https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/, and https://www.realworldtech.com/bulldozer/.
See also other performance links in the x86 tag wiki.
I also took a stab at explaining how a CPU core finds and exploits instruction-level parallelism in this answer, but I think you've already grasped those basics as far as it's relevant for tuning software. I did mention how SMT (Hyperthreading) works as a way to expose more ILP to a single CPU core, though.
In Intel terminology:
"issue" means to send a uop into the out-of-order part of the core; along with register-renaming, this is the last step in the front-end. The issue/rename stage is often the narrowest point in the pipeline, e.g. 4-wide on Intel since Core2. (With later uarches like Haswell and especially Skylake often actually coming very close to that in some real code, thanks to SKL's improved decoders and uop-cache bandwidth, as well as back-end and cache bandwidth improvements.) This is fused-domain uops: micro-fusion lets you send 2 uops through the front-end and only take up one ROB entry. (I was able to construct a loop on Skylake that sustains 7 unfused-domain uops per clock). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: out-of-order window size.
"dispatch" means the scheduler sends a uop to an execution port. This happens as soon as all the inputs are ready, and the relevant execution port is available. How are x86 uops scheduled, exactly?. Scheduling happens in the "unfused" domain; micro-fused uops are tracked separately in the OoO scheduler (aka Reservation Station, RS).
A lot of other computer-architecture literature uses these terms in the opposite sense, but this is the terminology you will find in Intel's optimization manual, and the names of hardware performance counters like uops_issued.any or uops_dispatched_port.port_5.
exactly how long arbitrary arithmetical x86-64 assembly code will take
It depends on the surrounding code as well, because of OoO exec
Your final subps result doesn't have to be ready before the CPU starts running later instructions. Latency only matters for later instructions that need that value as an input, not for integer looping and whatnot.
Sometimes throughput is what matters, and out-of-order exec can hide the latency of multiple independent short dependency chains. (e.g. if you're doing the same thing to every element of a big array of multiple vectors, multiple cross products can be in flight at once.) You'll end up with multiple iterations in flight at once, even though in program order you finish all of one iteration before doing any of the next. (Software pipelining can help for high-latency loop bodies if OoO exec has a hard time doing all the reordering in HW.)
There are three major dimensions to analyze for a short block
You can approximately characterize a short block of non-branching code in terms of these three factors. Usually only one of them is the bottleneck for a given use-case. Often you're looking at a block that you will use as part of a loop, not as the whole loop body, but OoO exec normally works well enough that you can just add up these numbers for a couple different blocks, if they're not so long that OoO window size prevents finding all the ILP.
latency from each input to the output(s). Look at which instructions are on the dependency chain from each input to each output. e.g. one choice might need one input to be ready sooner.
total uop count (for front-end throughput bottlenecks), fused-domain on Intel CPUs. e.g. Core2 and later can in theory issue/rename 4 fused-domain uops per clock into the out-of-order scheduler/ROB. Sandybridge-family can often achieve that in practice with the uop cache and loop buffer, especially Skylake with its improved decoders and uop-cache throughput.
uop count for each back-end execution port (unfused domain). e.g. shuffle-heavy code will often bottleneck on port 5 on Intel CPUs. Intel usually only publishes throughput numbers, not port breakdowns, which is why you have to look at Agner Fog's tables (or IACA output) to do anything meaningful if you're not just repeating the same instruction a zillion times.
Generally you can assuming best-case scheduling/distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some. (How are x86 uops scheduled, exactly?)
Looking at CPI is not sufficient; two CPI=1 instructions might or might not compete for the same execution port. If they don't, they can execute in parallel. e.g. Haswell can only run psadbw on port 0 (5c latency, 1c throughput, i.e. CPI=1) but it's a single uop so a mix of 1 psadbw + 3 add instructions could sustain 4 instructions per clock. There are vector ALUs on 3 different ports in Intel CPUs, with some operations replicated on all 3 (e.g. booleans) and some only on one port (e.g. shifts before Skylake).
Sometimes you can come up with a couple different strategies, one maybe lower latency but costing more uops. A classic example is multiplying by constants like imul eax, ecx, 10 (1 uop, 3c latency on Intel) vs. lea eax, [rcx + rcx*4] / add eax,eax (2 uops, 2c latency). Modern compilers tend to choose 2 LEA vs. 1 IMUL, although clang up to 3.7 favoured IMUL unless it could get the job done with only a single other instruction.
See What is the efficient way to count set bits at a position or lower? for an example of static analysis for a few different ways to implement a function.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (which ended up being way more detailed than you'd guess from the question title) for another summary of static analysis, and some neat stuff about unrolling with multiple accumulators for a reduction.
Every (?) functional unit is pipelined
The divider is pipelined in recent CPUs, but not fully pipelined. (FP divide is single-uop, though, so if you do one divps mixed in with dozens of mulps / addps, it can have negligible throughput impact if latency doesn't matter: Floating point division vs floating point multiplication. rcpps + a Newton iteration is worse throughput and about the same latency.
Everything else is fully pipelined on mainstream Intel CPUs; multi-cycle (reciprocal) throughput for a single uop. (variable-count integer shifts like shl eax, cl have lower-than-expected throughput for their 3 uops, because they create a dependency through the flag-merging uops. But if you break that dependency through FLAGS with an add or something, you can get better throughput and latency.)
On AMD before Ryzen, the integer multiplier is also only partially pipelined. e.g. Bulldozer's imul ecx, edx is only 1 uop, but with 4c latency, 2c throughput.
Xeon Phi (KNL) also has some not-fully-pipelined shuffle instructions, but it tends to bottleneck on the front-end (instruction decode), not the back-end, and does have a small buffer + OoO exec capability to hide back-end bubbles.
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering)
No.
Maybe you read that for Silvermont, which doesn't do OoO exec for FP/SIMD, only integer (with a small ~20 uop window). Maybe some ARM chips are like that, too, with simpler schedulers for NEON? I don't know much about ARM uarch details.
The mainstream big-core microarchitectures like P6 / SnB-family, and all AMD OoO chips, do OoO exec for SIMD and FP instructions the same as for integer. AMD CPUs use a separate scheduler, but Intel uses a unified scheduler so its full size can be applied to finding ILP in integer or FP code, whichever is currently running.
Even the silvermont-based Knight's Landing (in Xeon Phi) does OoO exec for SIMD.
x86 is generally not very sensitive to instruction ordering, but uop scheduling doesn't do critical-path analysis. So it could sometimes help to put instructions on the critical path first, so they aren't stuck waiting with their inputs ready while other instructions run on that port, leading to a bigger stall later when we get to instructions that need the result of the critical path. (i.e. that's why it is the critical path.)
My attempt to predict the latency for Haswell looks something like this:
Yup, that looks right. shufps runs on port 5, addps runs on p1, mulps runs on p0 or p1. Skylake drops the dedicated FP-add unit and runs SIMD FP add/mul/FMA on the FMA units on p0/p1, all with 4c latency (up/down from 3/5/5 in Haswell, or 3/3/5 in Broadwell).
This is a good example of why keeping a whole XYZ direction vector in a SIMD vector usually sucks. Keeping an array of X, an array of Y, and an array of Z, would let you do 4 cross products in parallel without any shuffles.
The SSE tag wiki has a link to these slides: SIMD at Insomniac Games (GDC 2015) which covers that array-of-structs vs. struct-of-arrays issues for 3D vectors, and why it's often a mistake to always try to SIMD a single operation instead of using SIMD to do multiple operations in parallel.

Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?

I'm doing micro-optimization on a performance critical part of my code and came across the sequence of instructions (in AT&T syntax):
add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx
I thought I finally had a use case for xchg which would allow me to shave an instruction and write:
add %rbx, %rax
xchg %rax, %rdx
However, to my dimay I found from Agner Fog's instruction tables, that xchg is a 3 micro-op instruction with a 2 cycle latency on Sandy Bridge, Ivy Bridge, Broadwell, Haswell and even Skylake. 3 whole micro-ops and 2 cycles of latency! The 3 micro-ops throws off my 4-1-1-1 cadence and the 2 cycle latency makes it worse than the original in the best case since the last 2 instructions in the original might execute in parallel.
Now... I get that the CPU might be breaking the instruction into micro-ops that are equivalent to:
mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx
where tmp is an anonymous internal register and I suppose the last two micro-ops could be run in parallel so the latency is 2 cycles.
Given that register renaming occurs on these micro-architectures, though, it doesn't make sense to me that this is done this way. Why wouldn't the register renamer just swap the labels? In theory, this would have a latency of only 1 cycle (possibly 0?) and could be represented as a single micro-op so it would be much cheaper.
Supporting efficient xchg is non-trivial, and presumably not worth the extra complexity it would require in various parts of the CPU. A real CPU's microarchitecture is much more complicated than the mental model that you can use while optimizing software for it. For example, speculative execution makes everything more complicated, because it has to be able to roll back to the point where an exception occurred.
Making fxch efficient was important for x87 performance because the stack nature of x87 makes it (or alternatives like fld st(2)) hard to avoid. Compiler-generated FP code (for targets without SSE support) really does use fxch a significant amount. It seems that fast fxch was done because it was important, not because it's easy. Intel Haswell even dropped support for single-uop fxch. It's still zero-latency, but decodes to 2 uops on HSW and later (up from 1 in P5, and PPro through IvyBridge).
xchg is usually easy to avoid. In most cases, you can just unroll a loop so it's ok that the same value is now in a different register. e.g. Fibonacci with add rax, rdx / add rdx, rax instead of add rax, rdx / xchg rax, rdx. Compilers generally don't use xchg reg,reg, and usually hand-written asm doesn't either. (This chicken/egg problem is pretty similar to loop being slow (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?). loop would have been very useful for for adc loops on Core2/Nehalem where an adc + dec/jnz loop causes partial-flag stalls.)
Since xchg is still slow-ish on previous CPUs, compilers wouldn't start using it with -mtune=generic for several years. Unlike fxch or mov-elimination, a design-change to support fast xchg wouldn't help the CPU run most existing code faster, and would only enable performance gains over the current design in rare cases where it's actually a useful peephole optimization.
Integer registers are complicated by partial-register stuff, unlike x87
There are 4 operand sizes of xchg, 3 of which use the same opcode with REX or operand-size prefixes. (xchg r8,r8 is a separate opcode, so it's probably easier to make the decoders decode it differently from the others). The decoders already have to recognize xchg with a memory operand as special, because of the implicit lock prefix, but it's probably less decoder complexity (transistor-count + power) if the reg-reg forms all decode to the same number of uops for different operand sizes.
Making some r,r forms decode to a single uop would be even more complexity, because single-uop instructions have to be handled by the "simple" decoders as well as the complex decoder. So they would all need to be able to parse xchg and decide whether it was a single uop or multi-uop form.
AMD and Intel CPUs behave somewhat similarly from a programmer's perspective, but there are many signs that the internal implementation is vastly different. For example, Intel mov-elimination only works some of the time, limited by some kind of microarchitectural resources, but AMD CPUs that do mov-elimination do it 100% of the time (e.g. Bulldozer for the low lane of vector regs).
See Intel's optimization manual, Example 3-23. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions, where they discuss overwriting the zero-latency-movzx result right away to free up the internal resource sooner. (I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.)
I don't know exactly what needs tracking in a limited-size table(?) for mov-elimination. Probably it's related to needing to free register-file entries as soon as possible when they're no longer needed, because Physical Register File size limits rather than ROB size can be the bottleneck for the out-of-order window size. Swapping around indices might make this harder.
xor-zeroing is eliminated 100% of the time on Intel Sandybridge-family; it's assumed that this works by renaming to a physical zero register, and this register never needs to be freed.
If xchg used the same mechanism that mov-elimination does, it also could probably only work some of the time. It would need to decode to enough uops to work in cases where it isn't handled at rename. (Or else the issue/rename stage would have to insert extra uops when an xchg will take more than 1 uop, like it does when un-laminating micro-fused uops with indexed addressing modes that can't stay micro-fused in the ROB, or when inserting merging uops for flags or high-8 partial registers. But that's a significant complication that would only be worth doing if xchg was a common and important instruction.)
Note that xchg r32,r32 has to zero-extend both results to 64 bits, so it can't be a simple swap of RAT (Register Alias Table) entries. It would be more like truncating both registers in-place. And note that Intel CPUs never eliminate mov same,same. It does already need to support mov r32,r32 and movzx r32, r8 with no execution port, so presumably it has some bits that indicate that rax = al or something. (And yes, Intel HSW/SKL do that, not just Ivybridge, despite what Agner's microarch guide says.)
We know P6 and SnB had upper-zeroed bits like this, because xor eax,eax before setz al avoids a partial-register stall when reading eax. HSW/SKL never rename al separately in the first place, only ah. It may not be a coincidence that partial-register renaming (other than AH) seems to have been dropped in the same uarch that introduced mov-elimination (Ivybridge). Still, setting that bit for 2 registers at once would be a special case that required special support.
xchg r64,r64 could maybe just swap the RAT entries, but decoding that differently from the r32 case is yet another complication. It might still need to trigger partial-register merging for both inputs, but add r64,r64 needs to do that, too.
Also note that an Intel uop (other than fxch) only ever produces one register result (plus flags). Not touching flags doesn't "free up" an output slot; For example mulx r64,r64,r64 still takes 2 uops to produce 2 integer outputs on HSW/SKL, even though all the "work" is done in the multiply unit on port 1, same as with mul r64 which does produce a flag result.)
Even if it is as simple as "swap the RAT entries", building a RAT that supports writing more than one entry per uop is a complication. What to do when renaming 4 xchg uops in a single issue group? It seems to me like it would make the logic significantly more complicated. Remember that this has to be built out of logic gates / transistors. Even if you say "handle that special case with a trap to microcode", you have to build the whole pipeline to support the possibility that that pipeline stage could take that kind of exception.
Single-uop fxch requires support for swapping RAT entries (or some other mechanism) in the FP RAT (fRAT), but it's a separate block of hardware from the integer RAT (iRAT). Leaving out that complication in the iRAT seems reasonable even if you have it in the fRAT (pre-Haswell).
Issue/rename complexity is definitely an issue for power consumption, though. Note that Skylake widened a lot of the front-end (legacy decode and uop cache fetch), and retirement, but kept the 4-wide issue/rename limit. SKL also added replicated execution units on more port in the back-end, so issue bandwidth is a bottleneck even more of the time, especially in code with a mix of loads, stores, and ALU.
The RAT (or the integer register file, IDK) may even have limited read ports, since there seem to be some front-end bottlenecks in issuing/renaming many 3-input uops like add rax, [rcx+rdx]. I posted some microbenchmarks (this and the follow-up post) showing Skylake being faster than Haswell when reading lots of registers, e.g. with micro-fusion of indexed addressing modes. Or maybe the bottleneck there was really some other microarchitectural limit.
But how does 1-uop fxch work? IDK how it's done in Sandybridge / Ivybridge. In P6-family CPUs, an extra remapping table exists basically to support FXCH. That might only be needed because P6 uses a Retirement Register File with 1 entry per "logical" register, instead of a physical register file (PRF). As you say, you'd expect it to be simpler when even "cold" register values are just a pointer to a PRF entry. (Source: US patent 5,499,352: Floating point register alias table FXCH and retirement floating point register array (describes Intel's P6 uarch).
One main reason the rfRAT array 802 is included within the present invention fRAT logic is a direct result of the manner in which the present invention implements the FXCH instruction.
(Thanks Andy Glew (#krazyglew), I hadn't thought of looking up patents to find out about CPU internals.) It's pretty heavy going, but may provide some insight into the bookkeeping needed for speculative execution.
Interesting tidbit: the patent describes integer as well, and mentions that there are some "hidden" logical registers which are reserved for use by microcode. (Intel's 3-uop xchg almost certain uses one of these as a temporary.)
We might be able to get some insight from looking at what AMD does.
Interestingly, AMD has 2-uop xchg r,r in K10, Bulldozer-family, Bobcat/Jaguar, and Ryzen. (But Jaguar xchg r8,r8 is 3 uops. Maybe to support the xchg ah,al corner case without a special uop for swapping the low 16 of a single reg).
Presumably both uops read the old values of the input architectural registers before the first one updates the RAT. IDK exactly how this works, since they aren't necessarily issued/renamed in the same cycle (but they are at least contiguous in the uop flow, so at worst the 2nd uop is the first uop in the next cycle). I have no idea if Haswell's 2-uop fxch works similarly, or if they're doing something else.
Ryzen is a new architecture designed after mov-elimination was "invented", so presumably they take advantage of it wherever possible. (Bulldozer-family renames vector moves (but only for the low 128b lane of YMM vectors); Ryzen is the first AMD architecture to do it for GP regs too.) xchg r32,r32 and r64,r64 are zero-latency (renamed), but still 2 uops each. (r8 and r16 need an execution unit, because they merge with the old value instead of zero-extending or copying the entire reg, but are still only 2 uops).
Ryzen's fxch is 1 uop. AMD (like Intel) probably isn't spending a lot of transistors on making x87 fast (e.g. fmul is only 1 per clock and on the same port as fadd), so presumably they were able to do this without a lot of extra support. Their micro-coded x87 instructions (like fyl2x) are faster than on recent Intel CPUs, so maybe Intel cares even less (at least about the microcoded x87 instruction).
Maybe AMD could have made xchg r64,r64 a single uop too, more easily than Intel. Maybe even xchg r32,r32 could be single uop, since like Intel it needs to support mov r32,r32 zero-extension with no execution port, so maybe it could just set whatever "upper 32 zeroed" bit exists to support that. Ryzen doesn't eliminate movzx r32, r8 at rename, so presumably there's only an upper32-zero bit, not bits for other widths.
What Intel might be able to do cheaply if they wanted to:
It's possible that Intel could support 2-uop xchg r,r the way Ryzen does (zero latency for the r32,r32 and r64,r64 forms, or 1c for the r8,r8 and r16,r16 forms) without too much extra complexity in critical parts of the core, like the issue/rename and retirement stages that manage the Register Alias Table (RAT). But maybe not, if they can't have 2 uops read the "old" value of a register when the first uop writes it.
Stuff like xchg ah,al is definitely a extra complication, since Intel CPUs don't rename partial registers separately anymore, except AH/BH/CH/DH.
xchg latency in practice on current hardware
Your guess about how it might work internally is good. It almost certainly uses one of the internal temporary registers (accessible only to microcode). Your guess about how they can reorder is too limited, though.
In fact, one direction has 2c latency and the other direction has ~1c latency.
00000000004000e0 <_start.loop>:
4000e0: 48 87 d1 xchg rcx,rdx # slow version
4000e3: 48 83 c1 01 add rcx,0x1
4000e7: 48 83 c1 01 add rcx,0x1
4000eb: 48 87 ca xchg rdx,rcx
4000ee: 48 83 c2 01 add rdx,0x1
4000f2: 48 83 c2 01 add rdx,0x1
4000f6: ff cd dec ebp
4000f8: 7f e6 jg 4000e0 <_start.loop>
This loop runs in ~8.06 cycles per iteration on Skylake. Reversing the xchg operands makes it run in ~6.23c cycles per iteration (measured with perf stat on Linux). uops issued/executed counters are equal, so no elimination happened. It looks like the dst <- src direction is the slow one, since putting the add uops on that dependency chain makes things slower than when they're on the dst -> src dependency chain.
If you ever want to use xchg reg,reg on the critical path (code-size reasons?), do it with the dst -> src direction on the critical path, because that's only about 1c latency.
Other side-topics from comments and the question
The 3 micro-ops throws off my 4-1-1-1 cadence
Sandybridge-family decoders are different from Core2/Nehalem. They can produce up to 4 uops total, not 7, so the patterns are 1-1-1-1, 2-1-1, 3-1, or 4.
Also beware that if the last uop is one that can macro-fuse, they will hang onto it until the next decode cycle in case the first instruction in the next block is a jcc. (This is a win when code runs multiple times from the uop cache for each time it's decoded. And that's still usually 3 uops per clock decode throughput.)
Skylake has an extra "simple" decoder so it can do 1-1-1-1-1 up to 4-1 I guess, but > 4 uops for one instruction still requires the microcode ROM. Skylake beefed up the uop cache, too, and can often bottleneck on the 4 fused-domain uops per clock issue/rename throughput limit if the back-end (or branch misses) aren't a bottleneck first.
I'm literally searching for ~1% speed bumps so hand optimization has been working out on the main loop code. Unfortunately that's ~18kB of code so I'm not even trying to consider the uop cache anymore.
That seems kinda crazy, unless you're mostly limiting yourself to asm-level optimization in shorter loops inside your main loop. Any inner loops within the main loop will still run from the uop cache, and that should probably be where you're spending most of your time optimizing. Compilers usually do a good-enough job that it's not practical for a human to do much over a large scale. Try to write your C or C++ in such a way that the compiler can do a good job with it, of course, but looking for tiny peephole optimizations like this over 18kB of code seems like going down the rabbit hole.
Use perf counters like idq.dsb_uops vs. uops_issued.any to see how many of your total uops came from the uop cache (DSB = Decoded Stream Buffer or something). Intel's optimization manual has some suggestions for other perf counters to look at for code that doesn't fit in the uop cache, such as DSB2MITE_SWITCHES.PENALTY_CYCLES. (MITE is the legacy-decode path). Search the pdf for DSB to find a few places it's mentioned.
Perf counters will help you find spots with potential problems, e.g. regions with higher than average uops_issued.stall_cycles could benefit from finding ways to expose more ILP if there are any, or from solving a front-end problem, or from reducing branch-mispredicts.
As discussed in comments, a single uop produces at most 1 register result
As an aside, with a mul %rbx, do you really get %rdx and %rax all at once or does the ROB technically have access to the lower part of the result one cycle earlier than the higher part? Or is it like the "mul" uop goes into the multiplication unit and then the multiplication unit issues two uops straight into the ROB to write the result at the end?
Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.
The mul %rbx instruction decodes to 2 uops in the decoders. They don't even have to issue in the same cycle, let alone execute in the same cycle.
However, Agner Fog's instruction tables only list a single latency number. It turns out that 3 cycles is the latency from both inputs to RAX. The minimum latency for RDX is 4c, according to InstlatX64 testing on both Haswell and Skylake-X.
From this, I conclude that the 2nd uop is dependent on the first, and exists to write the high half of the result to an architectural register. The port1 uop produces a full 128b multiply result.
I don't know where the high-half result lives until the p6 uop reads it. Perhaps there's some sort of internal queue between the multiply execution unit and hardware connected to port 6. By scheduling the p6 uop with a dependency on the low-half result, that might arrange for the p6 uops from multiple in-flight mul instructions to run in the correct order. But then instead of actually using that dummy low-half input, the uop would take the high half result from the queue output in an execution unit that's connected to port 6 and return that as the result. (This is pure guess work, but I think it's plausible as one possible internal implementation. See comments for some earlier ideas).
Interestingly, according to Agner Fog's instruction tables, on Haswell the two uops for mul r64 go to ports 1 and 6. mul r32 is 3 uops, and runs on p1 + p0156. Agner doesn't say whether that's really 2p1 + p0156 or p1 + 2p0156 like he does for some other insns. (However, he says that mulx r32,r32,r32 runs on p1 + 2p056 (note that p056 doesn't include p1).)
Even more strangely, he says that Skylake runs mulx r64,r64,r64 on p1 p5 but mul r64 on p1 p6. If that's accurate and not a typo (which is a possibility), it pretty much rules out the possibility that the extra uop is an upper-half multiplier.

Intel Intrinsics guide - Latency and Throughput

Can somebody explain the Latency and the Throughput values given in the Intel Intrinsic Guide?
Have I understood it correctly that the latency is the amount of time units an instruction takes to run, and the throughput is the number of instructions that can be started per time unit?
If my definition is correct, why is the latency for some instructions higher on newer CPU versions (e.g. mulps)?
Missing from that table: MULPS latency on Broadwell: 3. On Skylake: 4.
The intrinsic finder's latency is accurate in this case, although it occasionally doesn't match Agner Fog's experimental testing. (That VEXTRACTF128 latency may be a case of Intel not including a bypass delay in their table). See my answer on that linked question for more details about what to do with throughput and latency numbers, and what they mean for a modern out-of-order CPU.
MULPS latency did increase from 4 (Nehalem) to 5 (Sandybridge). This may have been to save power or transistors, but more likely because SandyBridge standardized uop latencies to only a few different values, to avoid writeback conflict: i.e. when the same execution unit would produce two results in the same cycle, e.g. from starting a 2c uop one cycle, then a 1c uop the next cycle.
This simplifies the uop scheduler, which dispatches uops from the Reservation Station to the execution units. More or less in oldest-first order, but it has has to filter by which ones have their inputs ready. The scheduler is power-hungry, and this is a significant part of the power cost of out-of-order execution. (It's unfortunately not practical to make a scheduler that picks uops in critical-path-first order, to avoid having independent uops steal cycles from the critical path with resource conflicts.)
Agner Fog explains the same thing (in the SnB section of his microarch pdf):
Mixing μops with different latencies
Previous processors have a write-back conflict when μops with
different latencies are issued to the same execution port, as
described on page 114. This problem is largely solved on the Sandy
Bridge. Execution latencies are standardized so that all μops with a
latency of 3 are issued to port 1 and all μops with a latency of 5 go
to port 0. μops with a latency of 1 can go to port 0, 1 or 5. No other
latencies are allowed, except for division and square root.
The standardization of latencies has the advantage that write-back
conflicts are avoided. The disadvantage is that some μops have higher
latencies than necessary.
Hmm, I just realized that Agner's numbers for VEXTRACTF128 xmm, ymm, imm8 are weird. Agner lists it as 1 uop 2c latency on SnB, but Intel lists it as 1c latency (as discussed here). Maybe the execution unit is 1c latency, but there's a built-in 1c bypass delay (for lane-crossing?) before you can use the result. That would explain the discrepancy between Intel's numbers and Agner's experimental test.
Some instructions are still 2c latency, because they decode to 2 dependent uops that are each 1c latency. MULPS is a single uop, even the AVX 256b version, because even Intel's first-gen AVX CPUs have full-width 256b execution units (except the divide/sqrt unit). Needing twice as many copies of the FP multiplier circuitry is a good reason for optimizing it to save transistors at the cost of latency.
This pattern holds up to and including Broadwell, AFAICT from searching Agner's tables. (Using LibreOffice, I selected the whole table, and did data->filter->standard filter, and looked for rows with column C = 1 and column F = 4. (And then repeat for 2.) Look for any uops that aren't loads or stores.
Haswell sticks to the pattern of only 1, 3 and 5 cycle ALU uop latencies (except for AESENC/AESDEC, which is 1 uop for port5 with 7c latency. And of course DIVPS and SQRTPS). There's also CVTPI2PS xmm, mm, at 1 uop 4c latency, but maybe that's 3c for the p1 uop and 1c of bypass delay, the way Agner Fog measured it or unavoidable. VMOVMSKPS r32, ymm is also 2c (vs. 3c for the r32,xmm version).
Broadwell dropped MULPS latency to 3, same as ADDPS, but kept FMA at 5c. Presumably they figured out how to shortcut the FMA unit to produce just a multiply when no add was needed.
Skylake is able to handle uops with latency=4. Latency for FMA, ADDPS/D, and MULPS/D = 4 cycles. (SKL drops the dedicated vector-FP add unit, and does everything with the FMA unit. So ADDPS/D throughput is doubled to match MULPS/D and FMA...PS/D. I'm not sure which change motivated what, and whether they would have introduced 4c latency instructions at all if they hadn't wanted to drop the vec-FP adder without hurting ADDPS latency too badly.)
Other SKL instructions with 4c latency: PHMINPOSUW (down from 5c), AESDEC/AESENC, CVTDQ2PS (up from 3c, but this might be 3c + bypass), RCPPS (down from 5c), RSQRTPS, CMPPS/D (up from 3c). Hmm, I guess FP compares were done in the adder, and now have to use FMA.
MOVD r32, xmm and MOVD xmm, r32 are listed as 2c, perhaps a bypass delay from int-vec to int? Or a glitch in Agner's testing? Testing the latency would require other instructions to create a round-trip back to xmm. It's 1c on HSW. Agner lists SKL MOVQ r64, xmm as 2 cycles (port0), but MOVQ xmm, r64 as 1c (port5), and it seems extremely weird that reading a 64-bit register is faster than reading a 32-bit register. Agner has had mistakes in his table in the past; this may be another.

Deterministic execution time on x86-64

Is there an x64 instruction(s) that takes a fixed amount of time, regardless of the micro-architectural state such as caches, branch predictors, etc.?
For instance, if a hypothetical add or increment instruction always takes n cycles, then I can implement a timer in my program by performing that add instruction multiple times. Perhaps an increment instruction with register operands may work, but it's not clear to me whether Intel's spec guarantees that it would take deterministic number of cycles. Note that I am not interested in current time, but only a primitive / instruction sequence that takes a fixed number of cycles.
Assume that I have a way to force atomic execution i.e. no context switches during timer's execution i.e. only my program gets to run.
On a related note, I also cannot use system services to keep track of time, because I am working in a setting where my program is a user-level program running on an untrusted OS.
The x86 ISA documents don't guarantee anything about what takes a certain amount of cycles. The ISA allows things like Transmeta's Crusoe that JIT-compiled x86 instructions to an internal VLIW instruction set. It could conceivably do optimizations between adjacent instructions.
The best you can do is write something that will work on as many known microarchitectures as possible. I'm not aware of any x86-64 microarchitectures that are "weird" like Transmeta, only the usual superscalar decode-to-uops designs like Intel and AMD use.
Simple integer ALU instructions like ADD are almost all 1c latency, and tiny loops that don't touch memory are almost totally unaffected anything, and are very predictable. If they run a lot of iterations, they're also almost totally unaffected by anything to do with the impact of surrounding code on the out-of-order core, and recover very quickly from disruptions like timer interrupts.
On nearly every Intel microarchitecture, this loop will run at one iteration per clock:
mov ecx, 1234567 ; or use a 64-bit register for higher counts.
ALIGN 16
.loop:
sub ecx, 1 ; not dec because of Pentium 4.
jnz .loop
Agner Fog's microarch guide and instruction tables say that VIA Nano3000 has a taken-branch throughput of one per 3 cycles, so this loop would only run at one iteration per 3 clocks there. AMD Bulldozer-family and Jaguar similarly have a max throughput of one taken JCC per 2 clocks.
See also other performance links in the x86 tag wiki.
If you want a more power-efficient loop, you could use PAUSE in the loop, but it waits ~100 cycles on Skylake, up from ~5 cycles on previous microarchitectures. (You can make cycle-accurate predictions for more complicated loops that don't touch memory, but that depends on microarchitectural details.)
You could make a more reliable loop that's less likely to have different bottlenecks on different CPUs by making a longer dependency chain within each iteration. Since each instruction depends on the previous, it can still only run at one instruction per cycle (not counting the branch), drastically the branches per cycle.
# one add/sub per clock, limited by latency
# should run one iteration per 6 cycles on every CPU listed in Agner Fog's tables
# And should be the same on all future CPUs unless they do magic inter-instruction optimizations.
# Or it could be slower on CPUs that always have a bubble on taken branches, but it seems unlikely anyone would design one.
ALIGN 16
.loop:
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 2 ; net result ecx-1
jnz .loop
Unrolling like this ensures that front-end effects are not a bottleneck. It gives the frontend decoders plenty of time to queue up the 6 add/sub insns and the jcc before the next branch.
Using add/sub instead of dec/inc avoids a partial-flag false dependency on Pentium 4. (Although I don't think that would be an issue anyway.)
Pentium4's double-clocked ALUs can each run two ADDs per clock, but the latency is still one cycle. i.e. apparently it can't forward a result internally to chew through this dependency chain twice as fast as any other CPU.
And yes, Prescott P4 is an x86-64 CPU, so we can't quite ignore P4 if we need a general purpose answer.

Resources