What's up with the "half fence" behavior of rdtscp? - performance

For many years x86 CPUs supported the rdtsc instruction, which reads the "time stamp counter" of the current CPU. The exact definition of this counter has changed over time, but on recent CPUs it is a counter that increments at a fixed frequency with respect to wall clock time, so it is very useful as building block for a fast, accurate clock or measuring the time taken by small segments of code.
One important fact about the rdtsc instruction isn't ordered in any special way with the surrounding code. Like most instructions, it can be freely reordered with respect to other instructions which it isn't in a dependency relationship with. This is actually "normal" and for most instructions it's just a mostly invisible way of making the CPU faster (it's just a long winder way of saying out-of-order execution).
For rdtsc it is important because it means you might not be timing the code you expect to be timing. For example, given the following sequence1:
rdtsc
mov ecx, eax
mov rdi, [rdi]
mov rdi, [rdi]
rdtsc
You might expect rdtsc to measure the latency of the two pointer chasing loading loads mov rdi, [rdi]. In practice, however, even if both of these loads take a look time (100s of cycles if they miss in the cache), you'll get a fairly small reading for the rdtsc pair. The problem is that the second rdtsc doens't wait for the loads to finish, it just executes out of order, so you aren't timing the interval you think you are. Perhaps both rdtsc instruction actually even execute before the first load even starts, depending how rdi was calculated in the code prior to this example.
So far, this is sounding more like an answer to a question nobody asked than a real question, but I'm getting there.
You have two basic use-cases for rdtsc:
As a quick timestamp, in which can you usually don't care exactly how it reorders with the surrounding code, since you probably don't have have an instruction-level concept of where the timestamp should be taken, anyways.
As a precise timing mechanism, e.g., in a micro-benchmark. In this case you'll usually protect your rdtsc from re-ordering with the lfence instruction. For the example above, you might do something like:
lfence
rdtsc
lfence
mov ecx, eax
...
lfence
rdtsc
To ensure the timed instructions (...) don't escape outside of the timed region, and also to ensure instructions from inside the time region don't come in (probably less of a problem, but they may compete for resources with the code you want to measure).
Years later, Intel looked down upon us poor programmers and came up with a new instruction: rdtscp. Like rdtsc it returns a reading of the time stamp counter, and this guy does something more: it reads a core-specific MSR value atomically with the timestamp reading. On most OSes this contains a core ID value. I think the idea is that this value can be used to properly adjust the returned value to real time on CPUs that may have different TSC offsets per core.
Great.
The other thing rdtscp introduced was half-fencing in terms of out-of-order execution:
From the manual:
The RDTSCP instruction is not a serializing instruction, but it does
wait until all previous instructions have executed and all previous
loads are globally visible.1 But it does not wait for previous stores
to be globally visible, and subsequent instructions may begin
execution before the read operation is performed.
So it's like putting an lfence before the rdtscp, but not after. What is the point of this half-fencing behavior? If you want a general timestamp and don't care about instruction ordering, the unfenced behavior is what you want. If you want to use this for timing short code sections, the half-fencing behavior is useful only for the second (final) reading, but not for the initial reading, since the fence is on the "wrong" side (in practice you want fences on both sides, but having them on the inside is probably the most important).
What purpose does such half-fencing serve?
1 I'm ignoring the upper 32-bits of the counter in this case.

Related

Why are loops always compiled into "do...while" style (tail jump)?

When trying to understand assembly (with compiler optimization on), I see this behavior:
A very basic loop like this
outside_loop;
while (condition) {
statements;
}
Is often compiled into (pseudocode)
; outside_loop
jmp loop_condition ; unconditional
loop_start:
loop_statements
loop_condition:
condition_check
jmp_if_true loop_start
; outside_loop
However, if the optimization is not turned on, it compiles to normally understandable code:
loop_condition:
condition_check
jmp_if_false loop_end
loop_statements
jmp loop_condition ; unconditional
loop_end:
According to my understanding, the compiled code is better resembled by this:
goto condition;
do {
statements;
condition:
}
while (condition_check);
I can't see a huge performance boost or code readability boost, so why is this often the case? Is there a name for this loop style, for example "trailing condition check"?
Related: asm loop basics: While, Do While, For loops in Assembly Language (emu8086)
Terminology: Wikipedia says "loop inversion" is the name for turning a while(x) into if(x) do{}while(x), putting the condition at the bottom of the loop where it belongs.
Fewer instructions / uops inside the loop = better. Structuring the code outside the loop to achieve this is very often a good idea.
Sometimes this requires "loop rotation" (peeling part of the first iteration so the actual loop body has the conditional branch at the bottom). So you do some of the first iteration and maybe skip the loop entirely, then fall into the loop. Sometimes you also need some code after the loop to finish the last iteration.
Sometimes loop rotation is extra useful if the last iteration is a special case, e.g. a store you need to skip. This lets you implement a while(1) {... ; if(x)break; ...; } loop as a do-while, or put one of the conditions of a multiple-condition loop at the bottom.
Some of these optimizations are related to or enable software pipelining, e.g. loading something for the next iteration. (OoO exec on x86 makes SW pipelining not very important these days but it's still useful for in-order cores like many ARM. And unrolling with multiple accumulators is still very valuable for hiding loop-carried FP latency in a reduction loop like a dot product or sum of an array.)
do{}while() is the canonical / idiomatic structure for loops in asm on all architectures, get used to it. IDK if there's a name for it; I would say such a loop has a "do while structure". If you want names, you could call the while() structure "crappy unoptimized code" or "written by a novice". :P Loop-branch at the bottom is universal, and not even worth mentioning as a Loop Optimization. You always do that.
This pattern is so widely used that on CPUs that use static branch prediction for branches without an entry in the branch-predictor caches, unknown forward conditional branches are predicted not-taken, unknown backwards branches are predicted taken (because they're probably loop branches). See Static branch prediction on newer Intel processors on Matt Godbolt's blog, and Agner Fog's branch-prediction chapter at the start of his microarch PDF.
This answer ended up using x86 examples for everything, but much of this applies across the board for all architectures. I wouldn't be surprised if other superscalar / out-of-order implementations (like some ARM, or POWER) also have limited branch-instruction throughput whether they're taken or not. But fewer instructions inside the loop is nearly universal when all you have is a conditional branch at the bottom, and no unconditional branch.
If the loop might need to run zero times, compilers more often put a test-and-branch outside the loop to skip it, instead of jumping to the loop condition at the bottom. (i.e. if the compiler can't prove the loop condition is always true on the first iteration).
BTW, this paper calls transforming while() to if(){ do{}while; } an "inversion", but loop inversion usually means inverting a nested loop. (e.g. if the source loops over a row-major multi-dimensional array in the wrong order, a clever compiler could change for(i) for(j) a[j][i]++; into for(j) for(i) a[j][i]++; if it can prove it's correct.) But I guess you can look at the if() as a zero-or-one iteration loop. Fun fact, compiler devs teaching their compilers how to invert a loop (to allow auto-vectorization) for a (very) specific case is why SPECint2006's libquantum benchmark is "broken". Most compilers can't invert loops in the general case, just ones that looks almost exactly like the one in SPECint2006...
You can help the compiler make more compact asm (fewer instructions outside the loop) by writing do{}while() loops in C when you know the caller isn't allowed to pass size=0 or whatever else guarantees a loop runs at least once.
(Actually 0 or negative for signed loop bounds. Signed vs. unsigned loop counters is a tricky optimization issue, especially if you choose a narrower type than pointers; check your compiler's asm output to make sure it isn't sign-extending a narrow loop counter inside the loop very time if you use it as an array index. But note that signed can actually be helpful, because the compiler can assume that i++ <= bound will eventually become false, because signed overflow is UB but unsigned isn't. So with unsigned, while(i++ <= bound) is infinite if bound = UINT_MAX.) I don't have a blanket recommendation for when to use signed vs. unsigned; size_t is often a good choice for looping over arrays, though, but if you want to avoid the x86-64 REX prefixes in the loop overhead (for a trivial saving in code size) but convince the compiler not to waste any instructions zero or sign-extending, it can be tricky.
I can't see a huge performance boost
Here's an example where that optimization will give speedup of 2x on Intel CPUs before Haswell, because P6 and SnB/IvB can only run branches on port 5, including not-taken conditional branches.
Required background knowledge for this static performance analysis: Agner Fog's microarch guide (read the Sandybridge section). Also read his Optimizing Assembly guide, it's excellent. (Occasionally outdated in places, though.) See also other x86 performance links in the x86 tag wiki. See also Can x86's MOV really be "free"? Why can't I reproduce this at all? for some static analysis backed up by experiments with perf counters, and some explanation of fused vs. unfused domain uops.
You could also use Intel's IACA software (Intel Architecture Code Analyzer) to do static analysis on these loops.
; sum(int []) using SSE2 PADDD (dword elements)
; edi = pointer, esi = end_pointer.
; scalar cleanup / unaligned handling / horizontal sum of XMM0 not shown.
; NASM syntax
ALIGN 16 ; not required for max performance for tiny loops on most CPUs
.looptop: ; while (edi<end_pointer) {
cmp edi, esi ; 32-bit code so this can macro-fuse on Core2
jae .done ; 1 uop, port5 only (macro-fused with cmp)
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
jmp .looptop ; 1 uop, p5 only
; Sandybridge/Ivybridge ports each uop can use
.done: ; }
This is 4 total fused-domain uops (with macro-fusion of the cmp/jae), so it can issue from the front-end into the out-of-order core at one iteration per clock. But in the unfused domain there are 4 ALU uops and Intel pre-Haswell only has 3 ALU ports.
More importantly, port5 pressure is the bottleneck: This loop can execute at only one iteration per 2 cycles because cmp/jae and jmp both need to run on port5. Other uops stealing port5 could reduce practical throughput somewhat below that.
Writing the loop idiomatically for asm, we get:
ALIGN 16
.looptop: ; do {
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
cmp edi, esi ; 1 uop, port5 only (macro-fused with cmp)
jb .looptop ; } while(edi < end_pointer);
Notice right away, independent of everything else, that this is one fewer instruction in the loop. This loop structure is at least slightly better on everything from simple non-pipelined 8086 through classic RISC (like early MIPS), especially for long-running loops (assuming they don't bottleneck on memory bandwidth).
Core2 and later should run this at one iteration per clock, twice as fast as the while(){}-structured loop, if memory isn't a bottleneck (i.e. assuming L1D hits, or at least L2 actually; this is only SSE2 16-bytes per clock).
This is only 3 fused-domain uops, so can issue at better than one per clock on anything since Core2, or just one per clock if issue groups always end with a taken branch.
But the important part is that port5 pressure is vastly reduced: only cmp/jb needs it. The other uops will probably be scheduled to port5 some of the time and steal cycles from loop-branch throughput, but this will be a few % instead of a factor of 2. See How are x86 uops scheduled, exactly?.
Most CPUs that normally have a taken-branch throughput of one per 2 cycles can still execute tiny loops at 1 per clock. There are some exceptions, though. (I forget which CPUs can't run tight loops at 1 per clock; maybe Bulldozer-family? Or maybe just some low-power CPUs like VIA Nano.) Sandybridge and Core2 can definitely run tight loops at one per clock. They even have loop buffers; Core2 has a loop buffer after instruction-length decode but before regular decode. Nehalem and later recycle uops in the queue that feeds the issue/rename stage. (Except on Skylake with microcode updates; Intel had to disable the loop buffer because of a partial-register merging bug.)
However, there is a loop-carried dependency chain on xmm0: Intel CPUs have 1-cycle latency paddd, so we're right up against that bottleneck, too. add esi, 16 is also 1 cycle latency. On Bulldozer-family, even integer vector ops have 2c latency, so that would bottleneck the loop at 2c per iteration. (AMD since K8 and Intel since SnB can run two loads per clock, so we need to unroll anyway for max throughput.) With floating point, you definitely want to unroll with multiple accumulators. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators).
If I'd used an indexed addressing mode, like paddd xmm0, [edi + eax], I could have used sub eax, 16 / jnc at the loop condition. SUB/JNC can macro-fuse on Sandybridge-family, but the indexed load would un-laminate on SnB/IvB (but stay fused on Haswell and later, unless you use the AVX form).
; index relative to the end of the array, with an index counting up towards zero
add rdi, rsi ; edi = end_pointer
xor eax, eax
sub eax, esi ; eax = -length, so [rdi+rax] = first element
.looptop: ; do {
paddd xmm0, [rdi + rax]
add eax, 16
jl .looptop ; } while(idx+=16 < 0); // or JNC still works
(It's usually better to unroll some to hide the overhead of pointer increments instead of using indexed addressing modes, especially for stores, partly because indexed stores can't use the port7 store AGU on Haswell+.)
On Core2/Nehalem add/jl don't macro-fuse, so this is 3 fused-domain uops even in 64-bit mode, without depending on macro-fusion. Same for AMD K8/K10/Bulldozer-family/Ryzen: no fusion of the loop condition, but PADDD with a memory operand is 1 m-op / uop.
On SnB, paddd un-laminates from the load, but add/jl macro-fuse, so again 3 fused-domain uops. (But in the unfused domain, only 2 ALU uops + 1 load, so probably fewer resource conflicts reducing throughput of the loop.)
On HSW and later, this is 2 fused-domain uops because an indexed load can stay micro-fused with PADDD, and add/jl macro-fuses. (Predicted-taken branches run on port 6, so there are never resource conflicts.)
Of course, the loops can only run at best 1 iteration per clock because of taken branch throughput limits even for tiny loops. This indexing trick is potentially useful if you had something else to do inside the loop, too.
But all of these loops had no unrolling
Yes, that exaggerates the effect of loop overhead. But gcc doesn't unroll by default even at -O3 (unless it decides to fully unroll). It only unrolls with profile-guided optimization to let it know which loops are hot. (-fprofile-use). You can enable -funroll-all-loops, but I'd only recommend doing that on a per-file basis for a compilation unit you know has one of your hot loops that needs it. Or maybe even on a per-function basis with an __attribute__, if there is one for optimization options like that.
So this is highly relevant for compiler-generated code. (But clang does default to unrolling tiny loops by 4, or small loops by 2, and extremely importantly, using multiple accumulators to hide latency.)
Benefits with very low iteration count:
Consider what happens when the loop body should run once or twice: There's a lot more jumping with anything other than do{}while.
For do{}while, execution is a straight-line with no taken branches and one not-taken branch at the bottom. This is excellent.
For an if() { do{}while; } that might run the loop zero times, it's two not-taken branches. That's still very good. (Not-taken is slightly cheaper for the front-end than taken when both are correctly predicted).
For a jmp-to-the-bottom jmp; do{}while(), it's one taken unconditional branch, one taken loop condition, and then the loop branch is not-taken. This is kinda clunky but modern branch predictors are very good...
For a while(){} structure, this is one not-taken loop exit, one taken jmp at the bottom, then one taken loop-exit branch at the top.
With more iterations, each loop structure does one more taken branch. while(){} also does one more not-taken branch per iteration, so it quickly becomes obviously worse.
The latter two loop structures have more jumping around for small trip counts.
Jumping to the bottom also has a disadvantage for non-tiny loops that the bottom of the loop might be cold in L1I cache if it hasn't run for a while. Code fetch / prefetch is good at bringing code to the front-end in a straight line, but if prediction didn't predict the branch early enough, you might have a code miss for the jump to the bottom. Also, parallel decode will probably have (or could have) decoded some of the top of the loop while decoding the jmp to the bottom.
Conditionally jumping over a do{}while loop avoids all that: you only jump forwards into code that hasn't been run yet in cases where the code you're jumping over shouldn't run at all. It often predicts very well because a lot of code never actually takes 0 trips through the loop. (i.e. it could have been a do{}while, the compiler just didn't manage to prove it.)
Jumping to the bottom also means the core can't start working on the real loop body until after the front-end chases two taken branches.
There are cases with complicated loop conditions where it's easiest to write it this way, and the performance impact is small, but compilers often avoid it.
Loops with multiple exit conditions:
Consider a memchr loop, or a strchr loop: they have to stop at the end of the buffer (based on a count) or the end of an implicit-length string (0 byte). But they also have to break out of the loop if they find a match before the end.
So you'll often see a structure like
do {
if () break;
blah blah;
} while(condition);
Or just two conditions near the bottom. Ideally you can test multiple logical conditions with the same actual instruction (e.g. 5 < x && x < 25 using sub eax, 5 / cmp eax, 20 / ja .outside_range, unsigned compare trick for range checking, or combine that with an OR to check for alphabetic characters of either case in 4 instructions) but sometimes you can't and just need to use an if()break style loop-exit branch as well as a normal backwards taken branch.
Further reading:
Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” for good ways to look at compiler output (e.g. what kind of inputs give interesting output, and a primer on reading x86 asm for beginners). related: How to remove "noise" from GCC/clang assembly output?
Modern Microprocessors A 90-Minute Guide!. Details look at superscalar pipelined CPUs, mostly architecture neutral. Very good. Explains instruction-level parallelism and stuff like that.
Agner Fog's x86 optimization guide and microarch pdf. This will take you from being able to write (or understand) correct x86 asm to being able to write efficient asm (or see what the compiler should have done).
other links in the x86 tag wiki, including Intel's optimization manuals. Also several of my answers (linked in the tag wiki) have things that Agner missed in his testing on more recent microarchitectures (like un-lamination of micro-fused indexed addressing modes on SnB, and partial register stuff on Haswell+).
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators): how to use multiple accumulators to hide latency of a reduction loop (like an FP dot product).
Lecture 7: Loop Transformations (also on archive.org). Lots of cool stuff that compilers do to loops, using C syntax to describe the asm.
https://en.wikipedia.org/wiki/Loop_optimization
Sort of off topic:
Memory bandwidth is almost always important, but it's not widely known that a single core on most modern x86 CPUs can't saturate DRAM, and not even close on many-core Xeons where single-threaded bandwidth is worse than on a quad-core with dual channel memory controllers.
How much of ‘What Every Programmer Should Know About Memory’ is still valid? (my answer has commentary on what's changed and what's still relevant in Ulrich Drepper's well-known excellent article.)

Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?

I'm doing micro-optimization on a performance critical part of my code and came across the sequence of instructions (in AT&T syntax):
add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx
I thought I finally had a use case for xchg which would allow me to shave an instruction and write:
add %rbx, %rax
xchg %rax, %rdx
However, to my dimay I found from Agner Fog's instruction tables, that xchg is a 3 micro-op instruction with a 2 cycle latency on Sandy Bridge, Ivy Bridge, Broadwell, Haswell and even Skylake. 3 whole micro-ops and 2 cycles of latency! The 3 micro-ops throws off my 4-1-1-1 cadence and the 2 cycle latency makes it worse than the original in the best case since the last 2 instructions in the original might execute in parallel.
Now... I get that the CPU might be breaking the instruction into micro-ops that are equivalent to:
mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx
where tmp is an anonymous internal register and I suppose the last two micro-ops could be run in parallel so the latency is 2 cycles.
Given that register renaming occurs on these micro-architectures, though, it doesn't make sense to me that this is done this way. Why wouldn't the register renamer just swap the labels? In theory, this would have a latency of only 1 cycle (possibly 0?) and could be represented as a single micro-op so it would be much cheaper.
Supporting efficient xchg is non-trivial, and presumably not worth the extra complexity it would require in various parts of the CPU. A real CPU's microarchitecture is much more complicated than the mental model that you can use while optimizing software for it. For example, speculative execution makes everything more complicated, because it has to be able to roll back to the point where an exception occurred.
Making fxch efficient was important for x87 performance because the stack nature of x87 makes it (or alternatives like fld st(2)) hard to avoid. Compiler-generated FP code (for targets without SSE support) really does use fxch a significant amount. It seems that fast fxch was done because it was important, not because it's easy. Intel Haswell even dropped support for single-uop fxch. It's still zero-latency, but decodes to 2 uops on HSW and later (up from 1 in P5, and PPro through IvyBridge).
xchg is usually easy to avoid. In most cases, you can just unroll a loop so it's ok that the same value is now in a different register. e.g. Fibonacci with add rax, rdx / add rdx, rax instead of add rax, rdx / xchg rax, rdx. Compilers generally don't use xchg reg,reg, and usually hand-written asm doesn't either. (This chicken/egg problem is pretty similar to loop being slow (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?). loop would have been very useful for for adc loops on Core2/Nehalem where an adc + dec/jnz loop causes partial-flag stalls.)
Since xchg is still slow-ish on previous CPUs, compilers wouldn't start using it with -mtune=generic for several years. Unlike fxch or mov-elimination, a design-change to support fast xchg wouldn't help the CPU run most existing code faster, and would only enable performance gains over the current design in rare cases where it's actually a useful peephole optimization.
Integer registers are complicated by partial-register stuff, unlike x87
There are 4 operand sizes of xchg, 3 of which use the same opcode with REX or operand-size prefixes. (xchg r8,r8 is a separate opcode, so it's probably easier to make the decoders decode it differently from the others). The decoders already have to recognize xchg with a memory operand as special, because of the implicit lock prefix, but it's probably less decoder complexity (transistor-count + power) if the reg-reg forms all decode to the same number of uops for different operand sizes.
Making some r,r forms decode to a single uop would be even more complexity, because single-uop instructions have to be handled by the "simple" decoders as well as the complex decoder. So they would all need to be able to parse xchg and decide whether it was a single uop or multi-uop form.
AMD and Intel CPUs behave somewhat similarly from a programmer's perspective, but there are many signs that the internal implementation is vastly different. For example, Intel mov-elimination only works some of the time, limited by some kind of microarchitectural resources, but AMD CPUs that do mov-elimination do it 100% of the time (e.g. Bulldozer for the low lane of vector regs).
See Intel's optimization manual, Example 3-23. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions, where they discuss overwriting the zero-latency-movzx result right away to free up the internal resource sooner. (I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.)
I don't know exactly what needs tracking in a limited-size table(?) for mov-elimination. Probably it's related to needing to free register-file entries as soon as possible when they're no longer needed, because Physical Register File size limits rather than ROB size can be the bottleneck for the out-of-order window size. Swapping around indices might make this harder.
xor-zeroing is eliminated 100% of the time on Intel Sandybridge-family; it's assumed that this works by renaming to a physical zero register, and this register never needs to be freed.
If xchg used the same mechanism that mov-elimination does, it also could probably only work some of the time. It would need to decode to enough uops to work in cases where it isn't handled at rename. (Or else the issue/rename stage would have to insert extra uops when an xchg will take more than 1 uop, like it does when un-laminating micro-fused uops with indexed addressing modes that can't stay micro-fused in the ROB, or when inserting merging uops for flags or high-8 partial registers. But that's a significant complication that would only be worth doing if xchg was a common and important instruction.)
Note that xchg r32,r32 has to zero-extend both results to 64 bits, so it can't be a simple swap of RAT (Register Alias Table) entries. It would be more like truncating both registers in-place. And note that Intel CPUs never eliminate mov same,same. It does already need to support mov r32,r32 and movzx r32, r8 with no execution port, so presumably it has some bits that indicate that rax = al or something. (And yes, Intel HSW/SKL do that, not just Ivybridge, despite what Agner's microarch guide says.)
We know P6 and SnB had upper-zeroed bits like this, because xor eax,eax before setz al avoids a partial-register stall when reading eax. HSW/SKL never rename al separately in the first place, only ah. It may not be a coincidence that partial-register renaming (other than AH) seems to have been dropped in the same uarch that introduced mov-elimination (Ivybridge). Still, setting that bit for 2 registers at once would be a special case that required special support.
xchg r64,r64 could maybe just swap the RAT entries, but decoding that differently from the r32 case is yet another complication. It might still need to trigger partial-register merging for both inputs, but add r64,r64 needs to do that, too.
Also note that an Intel uop (other than fxch) only ever produces one register result (plus flags). Not touching flags doesn't "free up" an output slot; For example mulx r64,r64,r64 still takes 2 uops to produce 2 integer outputs on HSW/SKL, even though all the "work" is done in the multiply unit on port 1, same as with mul r64 which does produce a flag result.)
Even if it is as simple as "swap the RAT entries", building a RAT that supports writing more than one entry per uop is a complication. What to do when renaming 4 xchg uops in a single issue group? It seems to me like it would make the logic significantly more complicated. Remember that this has to be built out of logic gates / transistors. Even if you say "handle that special case with a trap to microcode", you have to build the whole pipeline to support the possibility that that pipeline stage could take that kind of exception.
Single-uop fxch requires support for swapping RAT entries (or some other mechanism) in the FP RAT (fRAT), but it's a separate block of hardware from the integer RAT (iRAT). Leaving out that complication in the iRAT seems reasonable even if you have it in the fRAT (pre-Haswell).
Issue/rename complexity is definitely an issue for power consumption, though. Note that Skylake widened a lot of the front-end (legacy decode and uop cache fetch), and retirement, but kept the 4-wide issue/rename limit. SKL also added replicated execution units on more port in the back-end, so issue bandwidth is a bottleneck even more of the time, especially in code with a mix of loads, stores, and ALU.
The RAT (or the integer register file, IDK) may even have limited read ports, since there seem to be some front-end bottlenecks in issuing/renaming many 3-input uops like add rax, [rcx+rdx]. I posted some microbenchmarks (this and the follow-up post) showing Skylake being faster than Haswell when reading lots of registers, e.g. with micro-fusion of indexed addressing modes. Or maybe the bottleneck there was really some other microarchitectural limit.
But how does 1-uop fxch work? IDK how it's done in Sandybridge / Ivybridge. In P6-family CPUs, an extra remapping table exists basically to support FXCH. That might only be needed because P6 uses a Retirement Register File with 1 entry per "logical" register, instead of a physical register file (PRF). As you say, you'd expect it to be simpler when even "cold" register values are just a pointer to a PRF entry. (Source: US patent 5,499,352: Floating point register alias table FXCH and retirement floating point register array (describes Intel's P6 uarch).
One main reason the rfRAT array 802 is included within the present invention fRAT logic is a direct result of the manner in which the present invention implements the FXCH instruction.
(Thanks Andy Glew (#krazyglew), I hadn't thought of looking up patents to find out about CPU internals.) It's pretty heavy going, but may provide some insight into the bookkeeping needed for speculative execution.
Interesting tidbit: the patent describes integer as well, and mentions that there are some "hidden" logical registers which are reserved for use by microcode. (Intel's 3-uop xchg almost certain uses one of these as a temporary.)
We might be able to get some insight from looking at what AMD does.
Interestingly, AMD has 2-uop xchg r,r in K10, Bulldozer-family, Bobcat/Jaguar, and Ryzen. (But Jaguar xchg r8,r8 is 3 uops. Maybe to support the xchg ah,al corner case without a special uop for swapping the low 16 of a single reg).
Presumably both uops read the old values of the input architectural registers before the first one updates the RAT. IDK exactly how this works, since they aren't necessarily issued/renamed in the same cycle (but they are at least contiguous in the uop flow, so at worst the 2nd uop is the first uop in the next cycle). I have no idea if Haswell's 2-uop fxch works similarly, or if they're doing something else.
Ryzen is a new architecture designed after mov-elimination was "invented", so presumably they take advantage of it wherever possible. (Bulldozer-family renames vector moves (but only for the low 128b lane of YMM vectors); Ryzen is the first AMD architecture to do it for GP regs too.) xchg r32,r32 and r64,r64 are zero-latency (renamed), but still 2 uops each. (r8 and r16 need an execution unit, because they merge with the old value instead of zero-extending or copying the entire reg, but are still only 2 uops).
Ryzen's fxch is 1 uop. AMD (like Intel) probably isn't spending a lot of transistors on making x87 fast (e.g. fmul is only 1 per clock and on the same port as fadd), so presumably they were able to do this without a lot of extra support. Their micro-coded x87 instructions (like fyl2x) are faster than on recent Intel CPUs, so maybe Intel cares even less (at least about the microcoded x87 instruction).
Maybe AMD could have made xchg r64,r64 a single uop too, more easily than Intel. Maybe even xchg r32,r32 could be single uop, since like Intel it needs to support mov r32,r32 zero-extension with no execution port, so maybe it could just set whatever "upper 32 zeroed" bit exists to support that. Ryzen doesn't eliminate movzx r32, r8 at rename, so presumably there's only an upper32-zero bit, not bits for other widths.
What Intel might be able to do cheaply if they wanted to:
It's possible that Intel could support 2-uop xchg r,r the way Ryzen does (zero latency for the r32,r32 and r64,r64 forms, or 1c for the r8,r8 and r16,r16 forms) without too much extra complexity in critical parts of the core, like the issue/rename and retirement stages that manage the Register Alias Table (RAT). But maybe not, if they can't have 2 uops read the "old" value of a register when the first uop writes it.
Stuff like xchg ah,al is definitely a extra complication, since Intel CPUs don't rename partial registers separately anymore, except AH/BH/CH/DH.
xchg latency in practice on current hardware
Your guess about how it might work internally is good. It almost certainly uses one of the internal temporary registers (accessible only to microcode). Your guess about how they can reorder is too limited, though.
In fact, one direction has 2c latency and the other direction has ~1c latency.
00000000004000e0 <_start.loop>:
4000e0: 48 87 d1 xchg rcx,rdx # slow version
4000e3: 48 83 c1 01 add rcx,0x1
4000e7: 48 83 c1 01 add rcx,0x1
4000eb: 48 87 ca xchg rdx,rcx
4000ee: 48 83 c2 01 add rdx,0x1
4000f2: 48 83 c2 01 add rdx,0x1
4000f6: ff cd dec ebp
4000f8: 7f e6 jg 4000e0 <_start.loop>
This loop runs in ~8.06 cycles per iteration on Skylake. Reversing the xchg operands makes it run in ~6.23c cycles per iteration (measured with perf stat on Linux). uops issued/executed counters are equal, so no elimination happened. It looks like the dst <- src direction is the slow one, since putting the add uops on that dependency chain makes things slower than when they're on the dst -> src dependency chain.
If you ever want to use xchg reg,reg on the critical path (code-size reasons?), do it with the dst -> src direction on the critical path, because that's only about 1c latency.
Other side-topics from comments and the question
The 3 micro-ops throws off my 4-1-1-1 cadence
Sandybridge-family decoders are different from Core2/Nehalem. They can produce up to 4 uops total, not 7, so the patterns are 1-1-1-1, 2-1-1, 3-1, or 4.
Also beware that if the last uop is one that can macro-fuse, they will hang onto it until the next decode cycle in case the first instruction in the next block is a jcc. (This is a win when code runs multiple times from the uop cache for each time it's decoded. And that's still usually 3 uops per clock decode throughput.)
Skylake has an extra "simple" decoder so it can do 1-1-1-1-1 up to 4-1 I guess, but > 4 uops for one instruction still requires the microcode ROM. Skylake beefed up the uop cache, too, and can often bottleneck on the 4 fused-domain uops per clock issue/rename throughput limit if the back-end (or branch misses) aren't a bottleneck first.
I'm literally searching for ~1% speed bumps so hand optimization has been working out on the main loop code. Unfortunately that's ~18kB of code so I'm not even trying to consider the uop cache anymore.
That seems kinda crazy, unless you're mostly limiting yourself to asm-level optimization in shorter loops inside your main loop. Any inner loops within the main loop will still run from the uop cache, and that should probably be where you're spending most of your time optimizing. Compilers usually do a good-enough job that it's not practical for a human to do much over a large scale. Try to write your C or C++ in such a way that the compiler can do a good job with it, of course, but looking for tiny peephole optimizations like this over 18kB of code seems like going down the rabbit hole.
Use perf counters like idq.dsb_uops vs. uops_issued.any to see how many of your total uops came from the uop cache (DSB = Decoded Stream Buffer or something). Intel's optimization manual has some suggestions for other perf counters to look at for code that doesn't fit in the uop cache, such as DSB2MITE_SWITCHES.PENALTY_CYCLES. (MITE is the legacy-decode path). Search the pdf for DSB to find a few places it's mentioned.
Perf counters will help you find spots with potential problems, e.g. regions with higher than average uops_issued.stall_cycles could benefit from finding ways to expose more ILP if there are any, or from solving a front-end problem, or from reducing branch-mispredicts.
As discussed in comments, a single uop produces at most 1 register result
As an aside, with a mul %rbx, do you really get %rdx and %rax all at once or does the ROB technically have access to the lower part of the result one cycle earlier than the higher part? Or is it like the "mul" uop goes into the multiplication unit and then the multiplication unit issues two uops straight into the ROB to write the result at the end?
Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.
The mul %rbx instruction decodes to 2 uops in the decoders. They don't even have to issue in the same cycle, let alone execute in the same cycle.
However, Agner Fog's instruction tables only list a single latency number. It turns out that 3 cycles is the latency from both inputs to RAX. The minimum latency for RDX is 4c, according to InstlatX64 testing on both Haswell and Skylake-X.
From this, I conclude that the 2nd uop is dependent on the first, and exists to write the high half of the result to an architectural register. The port1 uop produces a full 128b multiply result.
I don't know where the high-half result lives until the p6 uop reads it. Perhaps there's some sort of internal queue between the multiply execution unit and hardware connected to port 6. By scheduling the p6 uop with a dependency on the low-half result, that might arrange for the p6 uops from multiple in-flight mul instructions to run in the correct order. But then instead of actually using that dummy low-half input, the uop would take the high half result from the queue output in an execution unit that's connected to port 6 and return that as the result. (This is pure guess work, but I think it's plausible as one possible internal implementation. See comments for some earlier ideas).
Interestingly, according to Agner Fog's instruction tables, on Haswell the two uops for mul r64 go to ports 1 and 6. mul r32 is 3 uops, and runs on p1 + p0156. Agner doesn't say whether that's really 2p1 + p0156 or p1 + 2p0156 like he does for some other insns. (However, he says that mulx r32,r32,r32 runs on p1 + 2p056 (note that p056 doesn't include p1).)
Even more strangely, he says that Skylake runs mulx r64,r64,r64 on p1 p5 but mul r64 on p1 p6. If that's accurate and not a typo (which is a possibility), it pretty much rules out the possibility that the extra uop is an upper-half multiplier.

Deterministic execution time on x86-64

Is there an x64 instruction(s) that takes a fixed amount of time, regardless of the micro-architectural state such as caches, branch predictors, etc.?
For instance, if a hypothetical add or increment instruction always takes n cycles, then I can implement a timer in my program by performing that add instruction multiple times. Perhaps an increment instruction with register operands may work, but it's not clear to me whether Intel's spec guarantees that it would take deterministic number of cycles. Note that I am not interested in current time, but only a primitive / instruction sequence that takes a fixed number of cycles.
Assume that I have a way to force atomic execution i.e. no context switches during timer's execution i.e. only my program gets to run.
On a related note, I also cannot use system services to keep track of time, because I am working in a setting where my program is a user-level program running on an untrusted OS.
The x86 ISA documents don't guarantee anything about what takes a certain amount of cycles. The ISA allows things like Transmeta's Crusoe that JIT-compiled x86 instructions to an internal VLIW instruction set. It could conceivably do optimizations between adjacent instructions.
The best you can do is write something that will work on as many known microarchitectures as possible. I'm not aware of any x86-64 microarchitectures that are "weird" like Transmeta, only the usual superscalar decode-to-uops designs like Intel and AMD use.
Simple integer ALU instructions like ADD are almost all 1c latency, and tiny loops that don't touch memory are almost totally unaffected anything, and are very predictable. If they run a lot of iterations, they're also almost totally unaffected by anything to do with the impact of surrounding code on the out-of-order core, and recover very quickly from disruptions like timer interrupts.
On nearly every Intel microarchitecture, this loop will run at one iteration per clock:
mov ecx, 1234567 ; or use a 64-bit register for higher counts.
ALIGN 16
.loop:
sub ecx, 1 ; not dec because of Pentium 4.
jnz .loop
Agner Fog's microarch guide and instruction tables say that VIA Nano3000 has a taken-branch throughput of one per 3 cycles, so this loop would only run at one iteration per 3 clocks there. AMD Bulldozer-family and Jaguar similarly have a max throughput of one taken JCC per 2 clocks.
See also other performance links in the x86 tag wiki.
If you want a more power-efficient loop, you could use PAUSE in the loop, but it waits ~100 cycles on Skylake, up from ~5 cycles on previous microarchitectures. (You can make cycle-accurate predictions for more complicated loops that don't touch memory, but that depends on microarchitectural details.)
You could make a more reliable loop that's less likely to have different bottlenecks on different CPUs by making a longer dependency chain within each iteration. Since each instruction depends on the previous, it can still only run at one instruction per cycle (not counting the branch), drastically the branches per cycle.
# one add/sub per clock, limited by latency
# should run one iteration per 6 cycles on every CPU listed in Agner Fog's tables
# And should be the same on all future CPUs unless they do magic inter-instruction optimizations.
# Or it could be slower on CPUs that always have a bubble on taken branches, but it seems unlikely anyone would design one.
ALIGN 16
.loop:
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 2 ; net result ecx-1
jnz .loop
Unrolling like this ensures that front-end effects are not a bottleneck. It gives the frontend decoders plenty of time to queue up the 6 add/sub insns and the jcc before the next branch.
Using add/sub instead of dec/inc avoids a partial-flag false dependency on Pentium 4. (Although I don't think that would be an issue anyway.)
Pentium4's double-clocked ALUs can each run two ADDs per clock, but the latency is still one cycle. i.e. apparently it can't forward a result internally to chew through this dependency chain twice as fast as any other CPU.
And yes, Prescott P4 is an x86-64 CPU, so we can't quite ignore P4 if we need a general purpose answer.

per clock perf. - can I use different registers for same instruction?

Can I use say four general purpose registers say r8,r9,r10,r11 each with MOV instruction for independent operations and be in impression that CPU is doing all those instructions in a single clock ?
I want to know because according to Agner Fog's Instruction Table, it says reciprocal throughput of MOV instruction is 0.25. It means CPU should be able to execute 4 MOV operations per cycle. Or I misinterpreted that all ??
I am a noob and have been learning Assembly in MASM since two months (mainly for learning debugging stuffs how registers works and it is really fun).
Edit, just re-read your question, and you're asking about different registers. I'll leave in my original answer; let's pretend your question wasn't just the most trivial case. :P
Yes, even without register renaming, these instructions can all execute (on separate execution units) in the same cycle because they're completely independent of each other.
mov eax, 1
mov ebx, ecx
mov edx, [mem]
xor esi,esi ;xor-zero: doesn't even use an execution unit on SnB-family
This is the easiest case for superscalar execution. If eax/rax was the destination for all four instructions, register-renaming would still allow all four instructions to execute in parallel.
Out-of-order execution allows four nearby instructions from separate dependency chains to execute at the same time, even if they weren't decoded or issued in the same clock cycle. And they probably won't retire in the same cycle either, if there are instructions between them. (The x86 ISA guarantees precise exceptions, like most other ISAs (ARM/PPC/etc.). All current designs accomplish with in-order retirement. So if a memory op segfaults, the program will stop at exactly that instruction, not just "well, there was a segfault somewhere recently, but we can't tell you where". (That would be non-precise exceptions).)
Superscalar in-order designs like Atom, or P5 (original Pentium) can still take advantage of the parallelism in these four independent instructions, but not in many other cases.
In a hand-crafted loop, it's common for a SnB-family CPU to be able to sustain well over 3 fused-domain uops per cycle. (It's also very easy to write loops that run at less than one fused-domain uop per cycle, due to latency, to say nothing of cache misses or branch mispredicts.)
Yes, multiple writes to the same architectural register can execute in parallel. Register renaming is not a bottleneck on Intel or AMD designs.
To understand and make full use of Agner Fog's tables, you have to read his microarch guide, or at least his "optimizing assembly" guide. See also good stuff at the x86 wiki.
As Agner Fog's microarch pdf points out (section 9.8 about Intel SnB/IvB):
Register renaming is controlled by the register alias table (RAT) and
the reorder buffer (ROB), shown in figure 6.1. The μops from the
decoders and the stack engine go to the RAT via a queue and then to
the ROB-read and the reservation station. The RAT can handle 4 μops
per clock cycle. The RAT can rename four registers per clock cycle,
and it can even rename the same register four times in one clock
cycle.
read-modify-write is another story (destination of an add instruction). A read-modify-write of an architectural register is (part of) a dependency chain, while an unconditional mov or an xor-zeroing starts a new dep chain. (Same for the output of certain other instructions like lea which don't read their destination).
Those register writes still rename the architectural register to a new physical register as well. This is how CPUs handle cases like
mov eax, 1 ; start of a dep chain
mov [mem+rax+rcx], eax
inc eax ; eax renamed again
The store needs the value of eax from before the inc. It gets it because when it checks the RAT, the architectural eax is still pointing to the same physical register that the mov eax,1 wrote. The inc can't just modify that same physical register because it doesn't know what if anything is not done yet with the previous value of eax.

Instruction Level Profiling: The Meaning of the Instruction Pointer?

When profiling code at the the assembly instruction level, what does the position of the instruction pointer really mean given that modern CPUs don't execute instructions serially or in-order? For example, assume the following x64 assembly code:
mov RAX, [RBX]; // Assume a cache miss here.
mov RSI, [RBX + RCX]; // Another cache miss.
xor R8, R8;
add RDX, RAX; // Dependent on the load into RAX.
add RDI, RSI; // Dependent on the load into RSI.
Which instruction will the instruction pointer spend most of its time on? I can think of good arguments for all of them:
mov RAX, [RBX] is taking probably 100s of cycles because it's a cache miss.
mov RSI, [RBX + RCX] also takes 100s of cycles, but probably executes in parallel with the previous instruction. What does it even mean for the instruction pointer to be on one or the other of these?
xor R8, R8 probably executes out-of-order and finishes before the memory loads finish, but the instruction pointer might stay here until all previous instructions are also finished.
add RDX, RAX generates a pipeline stall because it's the instruction where the value of RAX is actually used after a slow cache-miss load into it.
add RDI, RSI also stalls because it's dependent on the load into RSI.
CPUs maintains a fiction that there are only the architectural registers (RAX, RBX, etc) and there is a specific instruction pointer (IP). Programmers and compilers target this fiction.
Yet as you noted, modern CPUs don't execute serially or in-order. Until you the programmer / user request the IP, it is like Quantum Physics, the IP is a wave of instructions being executed; all so that the processor can run the program as fast as possible. When you request the current IP (for example, via a debugger breakpoint or profiler interrupt), then the processor must recreate the fiction that you expect so it collapses this wave form (all "in flight" instructions), gathers the register values back into architectural names, and builds a context for executing the debugger routine, etc.
In this context, there is an IP that indicates the instruction where the processor should resume execution. During the out-of-order execution, this instruction was the oldest instruction yet to complete, even though at the time of the interrupt the processor was perhaps fetching instructions well past that point.
For example, perhaps the interrupt indicates mov RSI, [RBX + RCX]; as the IP, but the xor had already executed and completed; however, when the processor would resume execution after the interrupt, it will re-execute the xor.
It's a good question, but in the kind of performance tuning I do, it doesn't matter.
It doesn't really matter because what you're looking for is speed-bugs.
These are things that the code is doing that take clock time and that could be done better or not at all. Examples:
- Spending I/O time looking in DLLs for resources that don't, actually, need to be looked for.
- Spending time in memory-allocation routines making and freeing objects that could simply be re-used.
- Re-calculating things in functions that could be memo-ized.
... this is just a few off the top of my head
Your biggest enemy is a self-congratulatory tendency to say "I wouldn't consciously write any bugs. Why would I?" Of course, you know that's why you test software. But the same goes for speed-bugs, and if you don't know how to find those you assume there are none, which is a way of saying "My code has no possible speedups, except maybe a profiler can show me how to shave a few cycles."
In my half-century experience, there is no code that, as first written, contains no speed-bugs. What's more, there's an enormous multiplier effect, where every speed-bug you remove makes the remaining ones more obvious. As a contrived example, suppose bug A accounts for 90% of clock time, and bug B accounts for 9%. If you only fix B, big deal - the code is 11% faster. If you only fix A, that's good - it's 10x faster. But if you fix both, that's really good - it's 100x faster. Fixing A made B big.
So the thing you need most in performance tuning is to find the speed-bugs, and not miss any. When you've done all that, then you can get down to cycle-shaving.

Resources