Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs - performance

This is related, but not the same, as this question: Performance optimisations of x86-64 assembly - Alignment and branch prediction and is slightly related to my previous question: Unsigned 64-bit to double conversion: why this algorithm from g++
The following is a not real-world test case. This primality testing algorithm is not sensible. I suspect any real-world algorithm would never execute such a small inner-loop quite so many times (num is a prime of size about 2**50). In C++11:
using nt = unsigned long long;
bool is_prime_float(nt num)
{
for (nt n=2; n<=sqrt(num); ++n) {
if ( (num%n)==0 ) { return false; }
}
return true;
}
Then g++ -std=c++11 -O3 -S produces the following, with RCX containing n and XMM6 containing sqrt(num). See my previous post for the remaining code (which is never executed in this example, as RCX never becomes large enough to be treated as a signed negative).
jmp .L20
.p2align 4,,10
.L37:
pxor %xmm0, %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .L36 // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .L30 // Failed divisibility test
addq $1, %rcx
jns .L37
// Further code to deal with case when ucomisd can't be used
I time this using std::chrono::steady_clock. I kept getting weird performance changes: from just adding or deleting other code. I eventually tracked this down to an alignment issue. The command .p2align 4,,10 tried to align to a 2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to balance between alignment and code size.
I wrote a Python script to replace .p2align 4,,10 by a manually controlled number of nop instructions. The following scatter plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the x-axis:
From objdump with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a laptop, Sandybridge i5-3210m, turboboost disabled, I found that
For 0 byte padding, slow performance (0.42 secs)
For 1-4 bytes padding (offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the plot).
For 5-20 bytes padding (offset 0x402f64 to 0x402f73) get fast performance (0.37s)
For 21-32 bytes padding (offset 0x402f74 to 0x402f7f) slow performance (0.42 secs)
Then cycles on a 32 byte sample
So a 16-byte alignment doesn't give the best performance-- it puts us in the slightly better (or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19 gives the best performance.
Why am I seeing this performance difference? Why does this seem to violate the rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation manual)
I don't see any branch-prediction problems. Could this be a uop cache quirk??
By changing the C++ algorithm to cache sqrt(num) in an 64-bit integer and then make the loop purely integer based, I remove the problem-- alignment now makes no difference at all.

Here's what I found on Skylake for the same loop. All the code to reproduce my tests on your hardware is on github.
I observe three different performance levels based on alignment, whereas the OP only really saw 2 primary ones. The levels are very distinct and repeatable2:
We see three distinct performance levels here (the pattern repeats starting from offset 32), which we'll call regions 1, 2 and 3, from left to right (region 2 is split into two parts straddling region 3). The fastest region (1) is from offset 0 to 8, the middle (2) region is from 9-18 and 28-31, and the slowest (3) is from 19-27. The difference between each region is close to or exactly 1 cycle/iteration.
Based on the performance counters, the fastest region is very different from the other two:
All the instructions are delivered from the legacy decoder, not from the DSB1.
There are exactly 2 decoder <-> microcode switches (idq_ms_switches) for every iteration of the loop.
On the hand, the two slower regions are fairly similar:
All the instructions are delivered from the DSB (uop cache), and not from the legacy decoder.
There are exactly 3 decoder <-> microcode switches per iteration of the loop.
The transition from the fastest to the middle region, as the offset changes from 8 to 9, corresponds exactly to when the loop starts fitting in the uop buffer, because of alignment issues. You count this out in exactly the same way as Peter did in his answer:
Offset 8:
LSD? <_start.L37>:
ab 1 4000a8: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000ac: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000b1: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000b5: 72 21 jb 4000d8 <_start.L36>
ab 2 4000b7: 31 d2 xor edx,edx
ab 2 4000b9: 48 89 d8 mov rax,rbx
ab 3 4000bc: 48 f7 f1 div rcx
!!!! 4000bf: 48 85 d2 test rdx,rdx
4000c2: 74 0d je 4000d1 <_start.L30>
4000c4: 48 83 c1 01 add rcx,0x1
4000c8: 79 de jns 4000a8 <_start.L37>
In the first column I've annotated how the uops for each instruction end up in the uop cache. "ab 1" means they go in the set associated with address like ...???a? or ...???b? (each set covers 32 bytes, aka 0x20), while 1 means way 1 (out of a max of 3).
At the point !!! this busts out of the uop cache because the test instruction has no where to go, all the 3 ways are used up.
Let's look at offset 9 on the other hand:
00000000004000a9 <_start.L37>:
ab 1 4000a9: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000ad: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000b2: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000b6: 72 21 jb 4000d9 <_start.L36>
ab 2 4000b8: 31 d2 xor edx,edx
ab 2 4000ba: 48 89 d8 mov rax,rbx
ab 3 4000bd: 48 f7 f1 div rcx
cd 1 4000c0: 48 85 d2 test rdx,rdx
cd 1 4000c3: 74 0d je 4000d2 <_start.L30>
cd 1 4000c5: 48 83 c1 01 add rcx,0x1
cd 1 4000c9: 79 de jns 4000a9 <_start.L37>
Now there is no problem! The test instruction has slipped into the next 32B line (the cd line), so everything fits in the uop cache.
So that explains why stuff changes between the MITE and DSB at that point. It doesn't, however, explain why the MITE path is faster. I tried some simpler tests with div in a loop, and you can reproduce this with simpler loops without any of the floating point stuff. It's weird and sensitive to random other stuff you put in the loop.
For example this loop also executes faster out of the legacy decoder than the DSB:
ALIGN 32
<add some nops here to swtich between DSB and MITE>
.top:
add r8, r9
xor eax, eax
div rbx
xor edx, edx
times 5 add eax, eax
dec rcx
jnz .top
In that loop, adding the pointless add r8, r9 instruction, which doesn't really interact with the rest of the loop, sped things up for the MITE version (but not the DSB version).
So I think the difference between region 1 an region 2 and 3 is due to the former executing out of the legacy decoder (which, oddly, makes it faster).
Let's also take a look at the offset 18 to offset 19 transition (where region2 ends and 3 starts):
Offset 18:
00000000004000b2 <_start.L37>:
ab 1 4000b2: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000b6: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000bb: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000bf: 72 21 jb 4000e2 <_start.L36>
cd 1 4000c1: 31 d2 xor edx,edx
cd 1 4000c3: 48 89 d8 mov rax,rbx
cd 2 4000c6: 48 f7 f1 div rcx
cd 3 4000c9: 48 85 d2 test rdx,rdx
cd 3 4000cc: 74 0d je 4000db <_start.L30>
cd 3 4000ce: 48 83 c1 01 add rcx,0x1
cd 3 4000d2: 79 de jns 4000b2 <_start.L37>
Offset 19:
00000000004000b3 <_start.L37>:
ab 1 4000b3: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000b7: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000bc: 66 0f 2e f0 ucomisd xmm6,xmm0
cd 1 4000c0: 72 21 jb 4000e3 <_start.L36>
cd 1 4000c2: 31 d2 xor edx,edx
cd 1 4000c4: 48 89 d8 mov rax,rbx
cd 2 4000c7: 48 f7 f1 div rcx
cd 3 4000ca: 48 85 d2 test rdx,rdx
cd 3 4000cd: 74 0d je 4000dc <_start.L30>
cd 3 4000cf: 48 83 c1 01 add rcx,0x1
cd 3 4000d3: 79 de jns 4000b3 <_start.L37>
The only difference I see here is that the first 4 instructions in the offset 18 case fit into the ab cache line, but only 3 in the offset 19 case. If we hypothesize that the DSB can only deliver uops to the IDQ from one cache set, this means that at some point one uop may be issued and executed a cycle earlier in the offset 18 scenario than in the 19 scenario (imagine, for example, that the IDQ is empty). Depending on exactly what port that uop goes to in the context of the surrounding uop flow, that may delay the loop by one cycle. Indeed, the difference between region 2 and 3 is ~1 cycle (within the margin of error).
So I think we can say that the difference between 2 and 3 is likely due to uop cache alignment - region 2 has a slightly better alignment than 3, in terms of issuing one additional uop one cycle earlier.
Some addition notes on things I checked that didn't pan out as being a possible cause of the slowdowns:
Despite the DSB modes (regions 2 and 3) having 3 microcode switches versus the 2 of the MITE path (region 1), that doesn't seem to directly cause the slowdown. In particular, simpler loops with div execute in identical cycle counts, but still show 3 and 2 switches for DSB and MITE paths respectively. So that's normal and doesn't directly imply the slowdown.
Both paths execute essentially identical number of uops and, in particular, have identical number of uops generated by the microcode sequencer. So it's not like there is more overall work being done in the different regions.
There wasn't really an difference in cache misses (very low, as expected) at various levels, branch mispredictions (essentially zero3), or any other types of penalties or unusual conditions I checked.
What did bear fruit is looking at the pattern of execution unit usage across the various regions. Here's a look at the distribution of uops executed per cycle and some stall metrics:
+----------------------------+----------+----------+----------+
| | Region 1 | Region 2 | Region 3 |
+----------------------------+----------+----------+----------+
| cycles: | 7.7e8 | 8.0e8 | 8.3e8 |
| uops_executed_stall_cycles | 18% | 24% | 23% |
| exe_activity_1_ports_util | 31% | 22% | 27% |
| exe_activity_2_ports_util | 29% | 31% | 28% |
| exe_activity_3_ports_util | 12% | 19% | 19% |
| exe_activity_4_ports_util | 10% | 4% | 3% |
+----------------------------+----------+----------+----------+
I sampled a few different offset values and the results were consistent within each region, yet between the regions you have quite different results. In particular, in region 1, you have fewer stall cycles (cycles where no uop is executed). You also have significant variation in the non-stall cycles, although no clear "better" or "worse" trend is evident. For example, region 1 has many more cycles (10% vs 3% or 4%) with 4 uops executed, but the other regions largely make up for it with more cycles with 3 uops executed, and few cycles with 1 uop executed.
The difference in UPC4 that the execution distribution above implies fully explains the difference in performance (this is probably a tautology since we already confirmed the uop count is the same between them).
Let's see what toplev.py has to say about it ... (results omitted).
Well, toplev suggests that the primary bottleneck is the front-end (50+%). I don't think you can trust this because the way it calculates FE-bound seems broken in the case of long strings of micro-coded instructions. FE-bound is based on frontend_retired.latency_ge_8, which is defined as:
Retired instructions that are fetched after an interval where the
front-end delivered no uops for a period of 8 cycles which was not
interrupted by a back-end stall. (Supports PEBS)
Normally that makes sense. You are counting instructions which were delayed because the frontend wasn't delivering cycles. The "not interrupted by a back-end stall" condition ensures that this doesn't trigger when the front-end isn't delivering uops simply because is the backend is not able to accept them (e.g,. when the RS is full because the backend is performing some low-throuput instructions).
It kind of seems for div instructions - even a simple loop with pretty much just one div shows:
FE Frontend_Bound: 57.59 % [100.00%]
BAD Bad_Speculation: 0.01 %below [100.00%]
BE Backend_Bound: 0.11 %below [100.00%]
RET Retiring: 42.28 %below [100.00%]
That is, the only bottleneck is the front-end ("retiring" is not a bottleneck, it represents the useful work). Clearly, such a loop is trivially handled by the front-end and is instead limited by the backend's ability to chew threw all the uops generated by the div operation. Toplev might get this really wrong because (1) it may be that the uops delivered by the microcode sequencer aren't counted in the frontend_retired.latency... counters, so that every div operation causes that event to count all the subsequent instructions (even though the CPU was busy during that period - there was no real stall), or (2) the microcode sequencer might deliver all its ups essentially "up front", slamming ~36 uops into the IDQ, at which point it doesn't deliver any more until the div is finished, or something like that.
Still, we can look at the lower levels of toplev for hints:
The main difference toplev calls out between the regions 1 and region 2 and 3 is the increased penalty of ms_switches for the latter two regions (since they incur 3 every iteration vs 2 for the legacy path. Internally, toplev estimates a 2-cycle penalty in the frontend for such switches. Of course, whether these penalties actually slow anything down depends in a complex way on the instruction queue and other factors. As mentioned above, a simple loop with div doesn't show any difference between the DSB and MITE paths, a loop with additional instructions does. So it could be that the extra switch bubble is absorbed in simpler loops (where the backend processing of all the uops generated by the div is the main factor), but once you add some other work in the loop, the switches become a factor at least for the transition period between the div and non-div` work.
So I guess my conclusion is that the way the div instruction interacts with the rest of the frontend uop flow, and backend execution, isn't completely well understood. We know it involves a flood of uops, delivered both from the MITE/DSB (seems like 4 uops per div) and from the microcode sequencer (seems like ~32 uops per div, although it changes with different input values to the div op) - but we don't know what those uops are (we can see their port distribution though). All that makes the behavior fairly opaque, but I think it is probably down to either the MS switches bottlnecking the front-end, or slight differences in the uop delivery flow resulting in different scheduling decisions which end up making the MITE order master.
1 Of course, most of the uops are not delivered from the legacy decoder or DSB at all, but by the microcode sequencer (ms). So we loosely talk about instructions delivered, not uops.
2 Note that the x axis here is "offset bytes from 32B alignment". That is, 0 means the top of the loop (label .L37) is aligned to a 32B boundary, and 5 means the loop starts five bytes below a 32B boundary (using nop for padding) and so on. So my padding bytes and offset are the same. The OP used a different meaning for offset, if I understand it correctly: his 1 byte of padding resulted in a 0 offset. So you would subtract 1 from the OPs padding values to get my offset values.
3 In fact, the branch prediction rate for a typical test with prime=1000000000000037 was ~99.999997%, reflecting only 3 mispredicted branches in the entire run (likely on the first pass through the loop, and the last iteration).
4 UPC, i.e., uops per cycle - a measure closely related to IPC for similar programs, and one that is a bit more precise when we are looking in detail at uop flows. In this case, we already know the uop counts are the same for all variations of alignment, so UPC and IPC will be directly proportional.

I don't have a specific answer, just a few different hypotheses that I'm unable to test (lack of hardware). I thought I'd found something conclusive, but I had the alignment off by one (because the question counts padding from 0x5F, not from an aligned boundary). Anyway, hopefully it's useful to post this anyway to describe the factors that are probably at play here.
The question also doesn't specify the encoding of the branches (short (2B) or near (6B)). This leaves too many possibilities to look at and theorize about exactly which instruction crossing a 32B boundary or not is causing the issue.
I think it's either a matter of the loop fitting in the uop cache or not, or else it's a matter of alignment mattering for whether it decodes fast with the legacy decoders.
Obviously that asm loop could be improved a lot (e.g. by hoisting the floating-point out of it, not to mention using a different algorithm entirely), but that's not the question. We just want to know why alignment matters for this exact loop.
You might expect that a loop that bottlenecks on division wouldn't bottleneck on the front-end or be affected by alignment, because division is slow and the loop runs very few instructions per clock. That's true, but 64-bit DIV is micro-coded as 35-57 micro-ops (uops) on IvyBridge, so it turns out there can be front-end issues.
The two main ways alignment can matter are:
Front-end bottlenecks (in the fetch/decode stages), leading to bubbles in keeping the out-of-order core supplied with work to do.
Branch prediction: if two branches have the same address modulo some large power of 2, they can alias each other in the branch prediction hardware. Code alignment in one object file is affecting the performance of a function in another object file
scratches the surface of this issue, but much has been written about it.
I suspect this is a purely front-end issue, not branch prediction, since the code spends all its time in this loop, and isn't running other branches that might alias with the ones here.
Your Intel IvyBridge CPU is a die-shrink of SandyBridge. It has a few changes (like mov-elimination, and ERMSB), but the front-end is similar between SnB/IvB/Haswell. Agner Fog's microarch pdf has enough details to analyze what should happen when the CPU runs this code. See also David Kanter's SandyBridge writeup for a block diagram of the fetch/decode stages, but he splits the fetch/decode from the uop cache, microcode, and decoded-uop queue. At the end, there's a full block diagram of a whole core. His Haswell article has a block diagram including the whole front-end, up to the decoded-uop queue that feeds the issue stage. (IvyBridge, like Haswell, has a 56 uop queue / loopback buffer when not using Hyperthreading. Sandybridge statically partitions them into 2x28 uop queues even when HT is disabled.)
Image copied from David Kanter's also-excellent Haswell write-up, where he includes the decoders and uop-cache in one diagram.
Let's look at how the uop cache will probably cache this loop, once things settle down. (i.e. assuming that the loop entry with a jmp to the middle of the loop doesn't have any serious long-term effect on how the loop sits in the uop cache).
According to Intel's optimization manual (2.3.2.2 Decoded ICache):
All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have
their EIPs within the same aligned 32-byte region. (I think this means an instruction that extends past the boundary goes in the uop cache for the block containing its start, rather than end. Spanning instructions have to go somewhere, and the branch target address that would run the instruction is the start of the insn, so it's most useful to put it in a line for that block).
A multi micro-op instruction cannot be split across Ways.
An instruction which turns on the MSROM consumes an entire Way. (i.e. any instruction that takes more than 4 uops (for the reg,reg form) is microcoded. For example, DPPD is not micro-coded (4 uops), but DPPS is (6 uops). DPPD with a memory operand that can't micro-fuse would be 5 total uops, but still wouldn't need to turn on the microcode sequencer (not tested).
Up to two branches are allowed per Way.
A pair of macro-fused instructions is kept as one micro-op.
David Kanter's SnB writeup has some more great details about the uop cache.
Let's see how the actual code will go into the uop cache
# let's consider the case where this is 32B-aligned, so it runs in 0.41s
# i.e. this is at 0x402f60, instead of 0 like this objdump -Mintel -d output on a .o
# branch displacements are all 00, and I forgot to put in dummy labels, so they're using the rel32 encoding not rel8.
0000000000000000 <.text>:
0: 66 0f ef c0 pxor xmm0,xmm0 # 1 uop
4: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx # 2 uops
9: 66 0f 2e f0 ucomisd xmm6,xmm0 # 2 uops
d: 0f 82 00 00 00 00 jb 0x13 # 1 uop (end of one uop cache line of 6 uops)
13: 31 d2 xor edx,edx # 1 uop
15: 48 89 d8 mov rax,rbx # 1 uop (end of a uop cache line: next insn doesn't fit)
18: 48 f7 f1 div rcx # microcoded: fills a whole uop cache line. (And generates 35-57 uops)
1b: 48 85 d2 test rdx,rdx ### PROBLEM!! only 3 uop cache lines can map to the same 32-byte block of x86 instructions.
# So the whole block has to be re-decoded by the legacy decoders every time, because it doesn't fit in the uop-cache
1e: 0f 84 00 00 00 00 je 0x24 ## spans a 32B boundary, so I think it goes with TEST in the line that includes the first byte. Should actually macro-fuse.
24: 48 83 c1 01 add rcx,0x1 # 1 uop
28: 79 d6 jns 0x0 # 1 uop
So with 32B alignment for the start of the loop, it has to run from the legacy decoders, which is potentially slower than running from the uop cache. There could even be some overhead in switching from uop cache to legacy decoders.
#Iwill's testing (see comments on the question) reveals that any microcoded instruction prevents a loop from running from the loopback buffer. See comments on the question. (LSD = Loop Stream Detector = loop buffer; physically the same structure as the IDQ (instruction decode queue). DSB = Decode Stream Buffer = the uop cache. MITE = legacy decoders.)
Busting the uop cache will hurt performance even if the loop is small enough to run from the LSD (28 uops minimum, or 56 without hyperthreading on IvB and Haswell).
Intel's optimization manual (section 2.3.2.4) says the LSD requirements include
All micro-ops are also resident in the Decoded ICache.
So this explains why microcode doesn't qualify: in that case the uop-cache only holds a pointer into to microcode, not the uops themselves. Also note that this means that busting the uop cache for any other reason (e.g. lots of single-byte NOP instructions) means a loop can't run from the LSD.
With the minimum padding to go fast, according to the OP's testing.
# branch displacements are still 32-bit, except the loop branch.
# This may not be accurate, since the question didn't give raw instruction dumps.
# the version with short jumps looks even more unlikely
0000000000000000 <loop_start-0x64>:
...
5c: 00 00 add BYTE PTR [rax],al
5e: 90 nop
5f: 90 nop
60: 90 nop # 4NOPs of padding is just enough to bust the uop cache before (instead of after) div, if they have to go in the uop cache.
# But that makes little sense, because looking backward should be impossible (insn start ambiguity), and we jump into the loop so the NOPs don't even run once.
61: 90 nop
62: 90 nop
63: 90 nop
0000000000000064 <loop_start>: #uops #decode in cycle A..E
64: 66 0f ef c0 pxor xmm0,xmm0 #1 A
68: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx #2 B
6d: 66 0f 2e f0 ucomisd xmm6,xmm0 #2 C (crosses 16B boundary)
71: 0f 82 db 00 00 00 jb 152 #1 C
77: 31 d2 xor edx,edx #1 C
79: 48 89 d8 mov rax,rbx #1 C
7c: 48 f7 f1 div rcx #line D
# 64B boundary after the REX in next insn
7f: 48 85 d2 test rdx,rdx #1 E
82: 74 06 je 8a <loop_start+0x26>#1 E
84: 48 83 c1 01 add rcx,0x1 #1 E
88: 79 da jns 64 <loop_start>#1 E
The REX prefix of test rdx,rdx is in the same block as the DIV, so this should bust the uop cache. One more byte of padding would put it into the next 32B block, which would make perfect sense. Perhaps the OP's results are wrong, or perhaps prefixes don't count, and it's the position of the opcode byte that matters. Perhaps that matters, or perhaps a macro-fused test+branch is pulled to the next block?
Macro-fusion does happen across the 64B L1I-cache line boundary, since it doesn't fall on the boundary between instructions.
Macro fusion does not happen if the first instruction ends on byte 63 of a cache line, and the second instruction is a conditional branch that starts at byte 0 of the next cache line. -- Intel's optimization manual, 2.3.2.1
Or maybe with a short encoding for one jump or the other, things are different?
Or maybe busting the uop cache has nothing to do with it, and that's fine as long as it decodes fast, which this alignment makes happen. This amount of padding just barely puts the end of UCOMISD into a new 16B block, so maybe that actually improves efficiency by letting it decode with the other instructions in the next aligned 16B block. However, I'm not sure that a 16B pre-decode (instruction-length finding) or 32B decode block have to be aligned.
I also wondered if the CPU ends up switching from uop cache to legacy decode frequently. That can be worse than running from legacy decode all the time.
Switching from the decoders to the uop cache or vice versa takes a cycle, according to Agner Fog's microarch guide. Intel says:
When micro-ops cannot be stored in the Decoded ICache due to these restrictions, they are delivered from the legacy decode pipeline. Once micro-ops are delivered from the legacy pipeline, fetching micro-
ops from the Decoded ICache can resume only after the next branch micro-op. Frequent switches can incur a penalty.
The source that I assembled + disassembled:
.skip 0x5e
nop
# this is 0x5F
#nop # OP needed 1B of padding to reach a 32B boundary
.skip 5, 0x90
.globl loop_start
loop_start:
.L37:
pxor %xmm0, %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .Loop_exit // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .Lnot_prime // Failed divisibility test
addq $1, %rcx
jns .L37
.skip 200 # comment this to make the jumps rel8 instead of rel32
.Lnot_prime:
.Loop_exit:

From what I can see in your algorithm, there is certainly not much you can do to improve it.
The problem you are hitting is probably not so much the branch to an aligned position, although that can still help, you're current problem is much more likely the pipeline mechanism.
When you write two instructions one after another such as:
mov %eax, %ebx
add 1, %ebx
In order to execute the second instruction, the first one has to be complete. For that reason compilers tend to mix instructions. Say you need to set %ecx to zero, you could do this:
mov %eax, %ebx
xor %ecx, %ecx
add 1, %ebx
In this case, the mov and the xor can both be executed in parallel. This makes things go faster... The number of instructions that can be handled in parallel vary very much between processors (Xeons are generally better at that).
The branch adds another parameter where the best processors may start executing both sides of the branch (the true and the false...) simultaneously. But really most processors will make a guess and hope they are right.
Finally, it is obvious that converting the sqrt() result to an integer will make things a lot faster since you will avoid all that non-sense with SSE2 code which is definitively slower if used only for a conversion + compare when those two instructions could be done with integers.
Now... you are probably still wondering why the alignment does not matter with the integers. The fact is that if your code fits in the L1 instruction cache, then the alignment is not important. If you lose the L1 cache, then it has to reload the code and that's where the alignment becomes quite important since on each loop it could otherwise be loading useless code (possibly 15 bytes of useless code...) and memory access is still dead slow.

The performance difference can be explained by the different ways the instruction encoding mechanism "sees" the instructions. A CPU reads the instructions in chunks (was on core2 16 byte I believe) and it tries to give the different superscalar units microops. If the instructions are on boundaries or ordered unlikely the units in one core can starve quite easily.

Related

Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction?

Consider a simple instruction like
mov RCX, RDI # 48 89 f9
The 48 is the REX prefix for x86_64. It is not an LCP. But consider adding an LCP (for alignment purposes):
.byte 0x67
mov RCX, RDI # 67 48 89 f9
67 is an address size prefix which in this case is for an instruction without addresses. This instruction also has no immediates, and it doesn't use the F7 opcode (False LCP stalls; F7 would be TEST, NOT, NEG, MUL, IMUL, DIV + IDIV). Assume that it doesn't cross a 16-byte boundary either. Those are the LCP stall cases mentioned in Intel's Optimization Reference Manual.
Would this instruction incur an LCP stall (on Skylake, Haswell, ...) ? What about two LCPs?
My daily driver is a MacBook. So I don't have access to VTune and I can't look at the ILD_STALL event. Is there any other way to know?

TL:DR: 67h is safe here on all CPUs. In 64-bit mode1, 67h can only LCP-stall with addr32 movabs load/store of the accumulator (AL/AX/EAX/RAX) from/to a moffs 32-bit absolute address (vs. the normal 64-bit absolute for that special opcode).
67h isn't length-changing with normal instructions that use a ModRM, even with a disp32 component in the addressing mode, because 32 and 64-bit address-size use identical ModRM formats. That 67h-LCP-stallable form of mov is special and doesn't use a modrm addressing mode.
(It also almost certainly won't have some other meaning in future CPUs, like being part of longer opcode the way rep is3.)
A Length Changing Prefix is when the opcode(+modrm) would imply a different length in bytes for the non-prefixes part of the instruction's machine code, if you ignored prefixes. I.e. it changes the length of the rest of the instruction. (Parallel length-finding is hard, and done separately from full decode: Later insns in a 16-byte block don't even have known start points. So this min(16-byte, 6-instruction) stage needs to look at as few bits as possible after prefixes, for the normal fast case to work. This is the stage where LCP stalls can happen.)
Usually only with an actual imm16 / imm32 opcode, e.g. 66h is length-changing in add cx, 1234, but not add cx, 12: after prefixes or in the appropriate mode, add r/m16, imm8 and add r/m32, imm8 are both opcode + modrm + imm8, 3 bytes regardless, (https://www.felixcloutier.com/x86/add). Pre-decode hardware can find the right length by just skipping prefixes, not modifying interpretation of later opcode+modrm based on what it saw, unlike when 66h means the opcode implies 2 immediate bytes instead of 4. Assemblers will always pick the imm8 encoding when possible because it's shorter (or equal length for the no-modrm add ax, imm16 special case).
(Note that REX.W=1 is length-changing for mov r64, imm64 vs. mov r32, imm32, but all hardware handles that relatively common instruction efficiently so only 66h and 67h can ever actually LCP-stall.)
SnB-family doesn't have any false2 LCP stalls for prefixes that can be length-changing for this opcode but not this this particular instruction, for either 66h or 67h. So F7 is a non-issue on SnB, unlike Core2 and Nehalem. (Earlier P6-family Intel CPUs didn't support 64-bit mode.) Atom/Silvermont don't have LCP penalties at all, nor do AMD or Via CPUs.
Agner Fog's microarch guide covers this well, and explains things clearly. Search for "length-changing prefixes". (This answer is an attempt to put those pieces together with some reminders about how x86 instruction encoding works, etc.)
Footnote 1: 67h increases length-finding difficulty more in non-64-bit modes:
In 64-bit mode, 67h changes from 64 to 32-bit address size, both of which use disp0 / 8 / 32 (0, 1, or 4 bytes of immediate displacement as part of the instruction), and which use the same ModRM + optional SIB encoding for normal addressing modes. RIP+rel32 repurposes the shorter (no SIB) encoding of 32-bit mode's two redundant ways to encode [disp32], so length decoding is unaffected. Note that REX was already designed not to be length-changing (except for mov r64, imm64), by burdening R13 and R12 in the same ways as RBP and RSP as ModRM "escape codes" to signal no base reg, or presence of a SIB byte, respectively.
In 16 and 32-bit modes, 67h switches to 32 or 16-bit address size. Not only are [x + disp32] vs. [x + disp16] different lengths after the ModRM byte (just like immediates for the operand-size prefix), but also 16-bit address-size can't signal a SIB byte. Why don't x86 16-bit addressing modes have a scale factor, while the 32-bit version has it? So the same bits in the mode and /rm fields can imply different lengths.
Footnote 2: "False" LCP stalls
This need (see footnote 1) to sometimes look differently at ModRM even to find the length is presumably why Intel CPUs before Sandybridge have "false" LCP stalls in 16/32-bit modes on 67h prefixes on any instruction with a ModRM, even when they aren't length-changing (e.g. register addressing mode). Instead of optimistically length-finding and checking somehow, a Core2/Nehalem just punts if they see addr32 + most opcodes, if they're not in 64-bit mode.
Fortunately there's basically zero reason to ever use it in 32-bit code so this mostly only matters for 16-bit code that uses 32-bit registers without switching to protected mode. Or code using 67h for padding like you're doing, except in 32-bit mode. .byte 0x67 / mov ecx, edi would be a problem for Core 2 / Nehalem. (I didn't check earlier 32-bit-only P6 family CPUs. They're a lot more obsolete than Nehalem.)
False LCP stalls for 67h never happen in 64-bit mode; as discussed above that's the easy case, and the length pre-decoders already have to know what mode they're in, so fortunately there's no downside to using it for padding. Unlike rep (which could become part of some future opcode), 67h is extremely likely to be safely ignored for instructions where it can apply to some form of the same opcode, even if there isn't actually a memory operand for this one.
Sandybridge-family doesn't ever have any false LCP stalls, removing both the 16/32-bit mode address-size (67h) and the all-modes 66 F7 cases (which needs to look at ModRM to disambiguate instructions like neg di or mul di from test di, imm16.)
SnB-family also removes some 66h true-LCP stalls, e.g. from mov-immediate like mov word ptr [rdi], 0 which is actually useful.
Footnote 3: forward compat of using 67h for padding
When 67h applies to the opcode in general (i.e. it can use a memory operand), it's very unlikely that it will mean something else for the same opcode with a modrm that just happens to encode reg,reg operands. So this is safe for What methods can be used to efficiently extend instruction length on modern x86?.
In fact, "relaxing" a 6-byte call [RIP+rel32] to a 5-byte call rel32 is done by GNU binutils by padding the call rel32 with a 67h address-size prefix, even though that's never meaningful for E8 call rel32. (This happens when linking code compiled with -fno-plt, which uses call [RIP + foo#gotpcrel] for any foo that's not found in the current compilation unit and doesn't have "hidden" visibility.)
But that's not a good precedent: at this point it's too widespread for CPU vendors to want to break that specific prefix+opcode combo (like for What does `rep ret` mean?), but some homebrewed thing in your program like 67h cdq would not get the same treatment from vendors.
The rules, for Sandybridge-family CPUs
edited/condensed from Agner's microarch PDF, these cases can LCP-stall, taking an extra 2 to 3 cycles in pre-decode (if they miss in the uop cache).
Any ALU op with an imm16 that would be imm32 without a 66h. (Except mov-immediate).
Remember that mov and test don't have imm8 forms for wider operand-size so prefer test al, 1, or imm32 if necessary. Or sometimes even test ah, imm8 if you want to test bits in the top half of AX, although beware of 1 cycle of extra latency for reading AH after writing the full reg on HSW and later. GCC uses this trick but should maybe start being careful with it, perhaps sometimes using bt reg, imm8 when feeding a setcc or cmovcc (which can't macro-fuse with test like JCC can).
67h with movabs moffs (A0/A1/A2/A3 opcodes in 64-bit mode, and probably also in 16 or 32-bit mode). Confirmed by my testing with perf counters for ild_stall.lcp on Skylake when LLVM was deciding whether to optimize mov al, [0x123456] to use 67 A0 4-byte-address or a normal opcode + modrm + sib + disp32 (to get absolute instead of rip-relative). That refers to an old version of Agner's guide; he updated soon after I sent him my test results.
If one of the instructions NEG, NOT, DIV, IDIV, MUL and IMUL with a single operand
has a 16-bit operand and there is a 16-bytes boundary between the opcode byte and
the mod-reg-rm byte. These instructions have a bogus length-changing prefix
because these instructions have the same opcode as the TEST instruction with a 16-
bit immediate operand [...]
No penalty on SnB-family for div cx or whatever, regardless of alignment.
The address size prefix (67H) will always cause a delay in 16-bit and 32-bit mode on any
instruction that has a mod/reg/rm byte even if it doesn't change the length of the instruction.
SnB-family removed this penalty, making address-size prefixes usable as padding if you're careful.
Or to summarize another way:
SnB-family has no false LCP stalls.
SnB-family has LCP stalls on every 66h and 67h true LCP except for:
mov r/m16, imm16 and the mov r16, imm16 no-modrm version.
67h address size interaction with ModRM (in 16/32-bit modes).
(That excludes the no-modrm absolute address load/store of AL/AX/EAX/RAX forms- they can still LCP-stall, presumably even in 32-bit mode, like in 64-bit.)
Length-changing REX doesn't stall (on any CPU).
Some examples
(This part ignores the false LCP stalls that some CPUs have in some non-length-changing cases which turn out not to matter here, but perhaps that's why you were worried about 67h for mov reg,reg.)
In your case, the rest of the instruction bytes, starting after the 67, decode as a 3-byte instruction whether the current address-size is 32 or 64. Same even with addressing modes like mov eax, [e/rsi + 1024] (reg+disp32) or addr32 mov edx, [RIP + rel32].
In 16 and 32-bit modes, 67h switches between 16 and 32-bit address size. [x + disp32] vs. [x + disp16] are different lengths after the ModRM byte, but also non-16-bit address-size can signal a SIB byte depending on the R/M field. But in 64-bit mode, 32 and 64-bit address-size both use [x + disp32], and the same ModRM->SIB or not encoding.
There is only one case where a 67h address-size prefix is length-changing in 64-bit mode: movabs load/store with 8-byte vs. 4-byte absolute addresses, and yes it does LCP-stall Intel CPUs. (I posted test results on https://bugs.llvm.org/show_bug.cgi?id=34733#c3)
For example, addr32 movabs [0x123456], al
.intel_syntax noprefix
addr32 mov [0x123456], cl # non-AL to make movabs impossible
mov [0x123456], al # GAS picks normal absolute [disp32]
addr32 mov [0x123456], al # GAS picks A2 movabs since addr32 makes that the shortest choice, same as NASM does.
movabs [0x123456], al # 64-bit absolute address
Note that GAS (fortunately) doesn't choose to use an addr32 prefix on its own, even with as -Os (gcc -Wa,-Os).
$ gcc -c foo.s
$ objdump -drwC -Mintel foo.o
...
0: 67 88 0c 25 56 34 12 00 mov BYTE PTR ds:0x123456,cl
8: 88 04 25 56 34 12 00 mov BYTE PTR ds:0x123456,al # same encoding after the 67
f: 67 a2 56 34 12 00 addr32 mov ds:0x123456,al
15: a2 56 34 12 00 00 00 00 00 movabs ds:0x123456,al # different length for same opcode
As you can see from the last 2 instructions, using the a2 mov moffs, al opcode, with a 67 the rest of the instruction is a different length for the same opcode.
This does LCP-stall on Skylake, so it's only fast when running from the uop cache.
Of course the more common source of LCP stalls is with the 66 prefix and an imm16 (instead of imm32). Like add ax, 1234, as in this random test where I wanted to see if jumping over the LCP-stalling instruction could avoid the problem: Label in %rep section in NASM. But not cases like add ax, 12 which will use add r/m16, imm8 (which is the same length after the 66 prefix as add r/m32, imm8).
Also, Sandybridge-family reportedly avoid LCP stalls for mov-immediate with 16-bit immediate.
Related:
Another example of working around add r/m16, imm16: add 1 byte immediate value to a 2 bytes memory location
x86 assembly 16 bit vs 8 bit immediate operand encoding - choose add r/m16, imm8 instead of the also-3-byte add ax, imm16 form.
Sign or Zero Extension of address in 64bit mode for MOV moffs32? - how address-size interacts with the moffs forms of movabs. (The kind that can LCP-stall)
What methods can be used to efficiently extend instruction length on modern x86? - the general case of what you're doing.
Tuning advice and uarch details:
Usually don't try to save space with addr32 mov [0x123456], al, except maybe when it's a choice between saving 1 byte or using 15 bytes of padding including actual NOPs inside a loop. (more tuning advice below)
One LCP stall usually won't be a disaster with a uop cache, especially if length-decoding probably isn't a front-end bottleneck here (although it often can be if the front-end is a bottleneck at all). Hard to test a single instance in one function by micro-benchmarking, though; only a real full-app benchmark will accurately reflect when the code can run from the uop cache (what Intel perf counters call the DSB), bypassing legacy decode (MITE).
There are queues between stages in modern CPUs that can at least partly absorb stalls https://www.realworldtech.com/haswell-cpu/2/ (moreso than in PPro/PIII), and SnB-family has shorter LCP-stalls than Core2/Nehalem. (But other reasons for pre-decode slowness already dips into their capacity, and after an I-cache miss they may all be empty.)
When prefixes aren't length-changing, the pre-decode pipeline stage that finds instruction boundaries (before steering chunks of bytes to actual complex/simple decoders or doing actual decoding) will find the correct instruction-length / end by skipping all prefixes and then looking at just the opcode (and modrm if applicable).
This pre-decode length-finding is where LCP stalls happen, so fun fact: even Core 2's pre-decode loop buffer can hide LCP stalls in subsequent iterations because it locks down up to 64 bytes / 18 insns of x86 machine code after finding instruction boundaries, using the decode queue (pre-decode output) as a buffer.
In later CPUs, the LSD and uop cache are post decode, so unless something defeats the uop cache (like the pesky JCC-erratum mitigation or simply having too many uops for the uop cache in a 32-byte aligned block of x86 machine code), loops only pay the LCP-stall cost on the first iteration, if they weren't already hot.
I'd say generally work around LCP stalls if you can do so cheaply, especially for code that usually runs "cold". Or if you can just use 32-bit operand-size and avoid partial-register shenanigans, costing usually only a byte of code-size and no extra instructions or uops. Or if you'd have multiple LCP stalls in a row, e.g. from naively using 16-bit immediates, that would be too many bubbles for buffers to hide so you'd have a real problem and it's worth spending extra instructions. (e.g. mov eax, imm32 / add [mem], ax, or movzx load / add r32,imm32 / store, or whatever.)
Padding to end 16-byte fetch blocks at instruction boundaries: not needed
(This is separate from aligning the start of a fetch block at a branch target, which is also sometimes unnecessary given the uop cache.)
Wikichip's section on Skylake pre-decode is incorrectly implies that a partial instruction left at the end of a block has to pre-decode on its own, rather than along with the next 16-byte group that contains the end of the instruction. It seems to be paraphrased from Agner Fog's text, with some changes and additions that make it wrong:
[from wikichip...] As with previous microarchitectures, the pre-decoder has a throughput of 6 macro-ops per cycle or until all 16 bytes are consumed, whichever happens first. Note that the predecoder will not load a new 16-byte block until the previous block has been fully exhausted. For example, suppose a new chunk was loaded, resulting in 7 instructions. In the first cycle, 6 instructions will be processed and a whole second cycle will be wasted for that last instruction. This will produce the much lower throughput of 3.5 instructions per cycle which is considerably less than optimal.
[this part is paraphrased from Agner Fog's Core2/Nehalem section, with the word "fully" having been added"]
Likewise, if the 16-byte block resulted in just 4 instructions with 1 byte of the 5th instruction received, the first 4 instructions will be processed in the first cycle and a second cycle will be required for the last instruction. This will produce an average throughput of 2.5 instructions per cycle.
[nothing like this appears in the current version of Agner's guide, IDK where this misinformation came from. Perhaps made up based on a misunderstanding of what Agner said, but without testing.]
Fortunately no. The rest of the instruction is in the next fetch block, so reality makes a lot more sense: the leftover bytes are prepended to the next 16-byte block.
(Starting a new 16-byte pre-decode block starting with this instruction would also have been plausible, but my testing rules that out: 2.82 IPC with a repeating 5,6,6 byte = 17-byte pattern. If it only ever looked at 16 bytes and left the partial 5 or 6-byte instruction to be the start of the next block, that would give us 2 IPC.)
A repeating pattern of 3x 5 byte instructions unrolled many times (a NASM %rep 2500 or GAS .rept 2500 block, so 7.5k instructions in ~36kiB) runs at 3.19 IPC, pre-decoding and decoding at ~16 bytes per cycle. (16 bytes/cycle) / (5 bytes/insn) = 3.2 instructions per cycle theoretical.
(If wikichip was right, it would predict close to 2 IPC in a 3-1 pattern, which is of course unreasonably low and would be not be an acceptable design for Intel for long runs of long or medium-length when running from legacy decode. 2 IPC is so much narrower than the 4-wide pipeline that it would not be ok even for legacy decode. Intel learned from P4 that running at least decently well from legacy decode is important, even when your CPU caches decoded uops. That's why SnB's uop cache can be so small, only ~1.5k uops. A lot smaller than P4's trace cache, but P4's problem was trying to replace L1i with a trace cache, and having weak decoders. (Also the fact it was a trace cache, so it cached the same code multiple times.))
These perf differences are large enough that you can verify it on your Mac, using a plenty-large repeat count so you don't need perf counters to verify uop-cache misses. (Remember that L1i is inclusive of uop cache, so loops that don't fit in L1i will also evict themselves from uop cache.) Anyway, just measuring total time and knowing the approximate max-turbo that you'll hit is sufficient for a sanity check like this.
Getting better than the theoretical-max that wikichip predicts, even after startup overhead and conservative frequency estimates, will completely rule out that behaviour even on a machine where you don't have perf counters.
$ nasm -felf64 && ld # 3x 5 bytes, repeated 2.5k times
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_retired.retire_slots,uops_executed.thread,idq.dsb_uops -r2 ./testloop
Performance counter stats for './testloop' (2 runs):
604.16 msec task-clock # 1.000 CPUs utilized ( +- 0.02% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1 page-faults # 0.002 K/sec
2,354,699,144 cycles # 3.897 GHz ( +- 0.02% )
7,502,000,195 instructions # 3.19 insn per cycle ( +- 0.00% )
7,506,746,328 uops_issued.any # 12425.167 M/sec ( +- 0.00% )
7,506,686,463 uops_retired.retire_slots # 12425.068 M/sec ( +- 0.00% )
7,506,726,076 uops_executed.thread # 12425.134 M/sec ( +- 0.00% )
0 idq.dsb_uops # 0.000 K/sec
0.6044392 +- 0.0000998 seconds time elapsed ( +- 0.02% )
(and from another run):
7,501,076,096 idq.mite_uops # 12402.209 M/sec ( +- 0.00% )
No clue why idq.mite_uops:u is not equal to issued or retired. There's nothing to un-laminate, and no stack-sync uops should be necessary, so IDK where the extra issued+retired uops could be coming from. The excess is consistent across runs, and proportional to the %rep count I think.
With other pattern like 5-5-6 (16 bytes) and 5-6-6 (17 bytes), I get similar results.
I do sometimes measure a slight difference when the 16-byte groups are misaligned relative to an absolute 16-byte boundary or not (put a nop at the top of the loop). But that seems to only happen with larger repeat counts. %rep 2500 for 39kiB total size, I still get 2.99 IPC (just under one 16-byte group per cycle), with 0 DSB uops, regardless of aligned vs. misaligned.
I still get 2.99IPC at %rep 5000, but I see a diff at %rep 10000: 2.95 IPC misaligned vs. 2.99 IPC aligned. That largest %rep count is ~156kiB and still fits in the 256k L2 cache so IDK why anything would be different from half that size. (They're much larger than 32k Li1). I think earlier I was seeing a different at 5k, but I can't repro it now. Maybe that was with 17-byte groups.
The actual loop runs 1000000 times in a static executable under _start, with a raw syscall to _exit, so perf counters (and time) for the whole process is basically just the loop. (especially with perf --all-user to only count user-space.)
; complete Linux program
default rel
%use smartalign
alignmode p6, 64
global _start
_start:
mov ebp, 1000000
align 64
.loop:
%ifdef MISALIGN
nop
%endif
%rep 2500
mov eax, 12345 ; 5 bytes.
mov ecx, 123456 ; 5 bytes. Use r8d for 6 bytes
mov edx, 1234567 ; 5 bytes. Use r9d for 6 bytes
%endrep
dec ebp
jnz .loop
.end:
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)

What methods can be used to efficiently extend instruction length on modern x86?

Imagine you want to align a series of x86 assembly instructions to certain boundaries. For example, you may want to align loops to a 16 or 32-byte boundary, or pack instructions so they are efficiently placed in the uop cache or whatever.
The simplest way to achieve this is single-byte NOP instructions, followed closely by multi-byte NOPs. Although the latter is generally more efficient, neither method is free: NOPs use front-end execution resources, and also count against your 4-wide1 rename limit on modern x86.
Another option is to somehow lengthen some instructions to get the alignment you want. If this is done without introducing new stalls, it seems better than the NOP approach. How can instructions be efficiently made longer on recent x86 CPUs?
In the ideal world lengthening techniques would simultaneously be:
Applicable to most instructions
Capable of lengthening the instruction by a variable amount
Not stall or otherwise slow down the decoders
Be efficiently represented in the uop cache
It isn't likely that there is a single method that satisfies all of the above points simultaneously, so good answers will probably address various tradeoffs.
1The limit is 5 or 6 on AMD Ryzen.

Consider mild code-golfing to shrink your code instead of expanding it, especially before a loop. e.g. xor eax,eax / cdq if you need two zeroed registers, or mov eax, 1 / lea ecx, [rax+1] to set registers to 1 and 2 in only 8 total bytes instead of 10. See Set all bits in CPU register to 1 efficiently for more about that, and Tips for golfing in x86/x64 machine code for more general ideas. Probably you still want to avoid false dependencies, though.
Or fill extra space by creating a vector constant on the fly instead of loading it from memory. (Adding more uop-cache pressure could be worse, though, for the larger loop that contains your setup + inner loop. But it avoids d-cache misses for constants, so it has an upside to compensate for running more uops.)
If you weren't already using them to load "compressed" constants, pmovsxbd, movddup, or vpbroadcastd are longer than movaps. dword / qword broadcast loads are free (no ALU uop, just a load).
If you're worried about code alignment at all, you're probably worried about how it sits in the L1I cache or where the uop-cache boundaries are, so just counting total uops is no longer sufficient, and a few extra uops in the block before the one you care about may not be a problem at all.
But in some situations, you might really want to optimize decode throughput / uop-cache usage / total uops for the instructions before the block you want aligned.
Padding instructions, like the question asked for:
Agner Fog has a whole section on this: "10.6 Making instructions longer for the sake of alignment" in his "Optimizing subroutines in assembly language" guide. (The lea, push r/m64, and SIB ideas are from there, and I copied a sentence / phrase or two, otherwise this answer is my own work, either different ideas or written before checking Agner's guide.)
It hasn't been updated for current CPUs, though: lea eax, [rbx + dword 0] has more downsides than it used to vs mov eax, ebx, because you miss out on zero-latency / no execution unit mov. If it's not on the critical path, go for it though. Simple lea has fairly good throughput, and an LEA with a large addressing mode (and maybe even some segment prefixes) can be better for decode / execute throughput than mov + nop.
Use the general form instead of the short form (no ModR/M) of instructions like push reg or mov reg,imm. e.g. use 2-byte push r/m64 for push rbx. Or use an equivalent instruction that is longer, like add dst, 1 instead of inc dst, in cases where there are no perf downsides to inc so you were already using inc.
Use SIB byte. You can get NASM to do that by using a single register as an index, like mov eax, [nosplit rbx*1] (see also), but that hurts the load-use latency vs. simply encoding mov eax, [rbx] with a SIB byte. Indexed addressing modes have other downsides on SnB-family, like un-lamination and not using port7 for stores.
So it's best to just encode base=rbx + disp0/8/32=0 using ModR/M + SIB with no index reg. (The SIB encoding for "no index" is the encoding that would otherwise mean idx=RSP). [rsp + x] addressing modes require a SIB already (base=RSP is the escape code that means there's a SIB), and that appears all the time in compiler-generated code. So there's very good reason to expect this to be fully efficient to decode and execute (even for base registers other than RSP) now and in the future. NASM syntax can't express this, so you'd have to encode manually. GNU gas Intel syntax from objdump -d says 8b 04 23 mov eax,DWORD PTR [rbx+riz*1] for Agner Fog's example 10.20. (riz is a fictional index-zero notation that means there's a SIB with no index). I haven't tested if GAS accepts that as input.
Use an imm32 and/or disp32 form of an instruction that only needed imm8 or disp0/disp32. Agner Fog's testing of Sandybridge's uop cache (microarch guide table 9.1) indicates that the actual value of an immediate / displacement is what matters, not the number of bytes used in the instruction encoding. I don't have any info on Ryzen's uop cache.
So NASM imul eax, [dword 4 + rdi], strict dword 13 (10 bytes: opcode + modrm + disp32 + imm32) would use the 32small, 32small category and take 1 entry in the uop cache, unlike if either the immediate or disp32 actually had more than 16 significant bits. (Then it would take 2 entries, and loading it from the uop cache would take an extra cycle.)
According to Agner's table, 8/16/32small are always equivalent for SnB. And addressing modes with a register are the same whether there's no displacement at all, or whether it's 32small, so mov dword [dword 0 + rdi], 123456 takes 2 entries, just like mov dword [rdi], 123456789. I hadn't realized [rdi] + full imm32 took 2 entries, but apparently that' is the case on SnB.
Use jmp / jcc rel32 instead of rel8. Ideally try to expand instructions in places that don't require longer jump encodings outside the region you're expanding. Pad after jump targets for earlier forward jumps, pad before jump targets for later backward jumps, if they're close to needing a rel32 somewhere else. i.e. try to avoid padding between a branch and its target, unless you want that branch to use a rel32 anyway.
You might be tempted to encode mov eax, [symbol] as 6-byte a32 mov eax, [abs symbol] in 64-bit code, using an address-size prefix to use a 32-bit absolute address. But this does cause a Length-Changing-Prefix stall when it decodes on Intel CPUs. Fortunately, none of NASM/YASM / gas / clang do this code-size optimization by default if you don't explicitly specify a 32-bit address-size, instead using 7-byte mov r32, r/m32 with a ModR/M+SIB+disp32 absolute addressing mode for mov eax, [abs symbol].
In 64-bit position-dependent code, absolute addressing is a cheap way to use 1 extra byte vs. RIP-relative. But note that 32-bit absolute + immediate takes 2 cycles to fetch from uop cache, unlike RIP-relative + imm8/16/32 which takes only 1 cycle even though it still uses 2 entries for the instruction. (e.g. for a mov-store or a cmp). So cmp [abs symbol], 123 is slower to fetch from the uop cache than cmp [rel symbol], 123, even though both take 2 entries each. Without an immediate, there's no extra cost for
Note that PIE executables allow ASLR even for the executable, and are the default in many Linux distro, so if you can keep your code PIC without any perf downsides, then that's preferable.
Use a REX prefix when you don't need one, e.g. db 0x40 / add eax, ecx.
It's not in general safe to add prefixes like rep that current CPUs ignore, because they might mean something else in future ISA extensions.
Repeating the same prefix is sometimes possible (not with REX, though). For example, db 0x66, 0x66 / add ax, bx gives the instruction 3 operand-size prefixes, which I think is always strictly equivalent to one copy of the prefix. Up to 3 prefixes is the limit for efficient decoding on some CPUs. But this only works if you have a prefix you can use in the first place; you usually aren't using 16-bit operand-size, and generally don't want 32-bit address-size (although it's safe for accessing static data in position-dependent code).
A ds or ss prefix on an instruction that accesses memory is a no-op, and probably doesn't cause any slowdown on any current CPUs. (#prl suggested this in comments).
In fact, Agner Fog's microarch guide uses a ds prefix on a movq
[esi+ecx],mm0 in Example 7.1. Arranging IFETCH blocks to tune a loop for PII/PIII (no loop buffer or uop cache), speeding it up from 3 iterations per clock to 2.
Some CPUs (like AMD) decode slowly when instructions have more than 3 prefixes. On some CPUs, this includes the mandatory prefixes in SSE2 and especially SSSE3 / SSE4.1 instructions. In Silvermont, even the 0F escape byte counts.
AVX instructions can use a 2 or 3-byte VEX prefix. Some instructions require a 3-byte VEX prefix (2nd source is x/ymm8-15, or mandatory prefixes for SSSE3 or later). But an instruction that could have used a 2-byte prefix can always be encoded with a 3-byte VEX. NASM or GAS {vex3} vxorps xmm0,xmm0. If AVX512 is available, you can use 4-byte EVEX as well.
Use 64-bit operand-size for mov even when you don't need it, for example mov rax, strict dword 1 forces the 7-byte sign-extended-imm32 encoding in NASM, which would normally optimize it to 5-byte mov eax, 1.
mov eax, 1 ; 5 bytes to encode (B8 imm32)
mov rax, strict dword 1 ; 7 bytes: REX mov r/m64, sign-extended-imm32.
mov rax, strict qword 1 ; 10 bytes to encode (REX B8 imm64). movabs mnemonic for AT&T.
You could even use mov reg, 0 instead of xor reg,reg.
mov r64, imm64 fits efficiently in the uop cache when the constant is actually small (fits in 32-bit sign extended.) 1 uop-cache entry, and load-time = 1, the same as for mov r32, imm32. Decoding a giant instruction means there's probably not room in a 16-byte decode block for 3 other instructions to decode in the same cycle, unless they're all 2-byte. Possibly lengthening multiple other instructions slightly can be better than having one long instruction.
Decode penalties for extra prefixes:
P5: prefixes prevent pairing, except for address/operand-size on PMMX only.
PPro to PIII: There is always a penalty if an instruction has more than one prefix. This penalty is usually one clock per extra prefix. (Agner's microarch guide, end of section 6.3)
Silvermont: it's probably the tightest constraint on which prefixes you can use, if you care about it. Decode stalls on more than 3 prefixes, counting mandatory prefixes + 0F escape byte. SSSE3 and SSE4 instructions already have 3 prefixes so even a REX makes them slow to decode.
some AMD: maybe a 3-prefix limit, not including escape bytes, and maybe not including mandatory prefixes for SSE instructions.
... TODO: finish this section. Until then, consult Agner Fog's microarch guide.
After hand-encoding stuff, always disassemble your binary to make sure you got it right. It's unfortunate that NASM and other assemblers don't have better support for choosing cheap padding over a region of instructions to reach a given alignment boundary.
Assembler syntax
NASM has some encoding override syntax: {vex3} and {evex} prefixes, NOSPLIT, and strict byte / dword, and forcing disp8/disp32 inside addressing modes. Note that [rdi + byte 0] isn't allowed, the byte keyword has to come first. [byte rdi + 0] is allowed, but I think that looks weird.
Listing from nasm -l/dev/stdout -felf64 padding.asm
line addr machine-code bytes source line
num
4 00000000 0F57C0 xorps xmm0,xmm0 ; SSE1 *ps instructions are 1-byte shorter
5 00000003 660FEFC0 pxor xmm0,xmm0
6
7 00000007 C5F058DA vaddps xmm3, xmm1,xmm2
8 0000000B C4E17058DA {vex3} vaddps xmm3, xmm1,xmm2
9 00000010 62F1740858DA {evex} vaddps xmm3, xmm1,xmm2
10
11
12 00000016 FFC0 inc eax
13 00000018 83C001 add eax, 1
14 0000001B 4883C001 add rax, 1
15 0000001F 678D4001 lea eax, [eax+1] ; runs on fewer ports and doesn't set flags
16 00000023 67488D4001 lea rax, [eax+1] ; address-size and REX.W
17 00000028 0501000000 add eax, strict dword 1 ; using the EAX-only encoding with no ModR/M
18 0000002D 81C001000000 db 0x81, 0xC0, 1,0,0,0 ; add eax,0x1 using the ModR/M imm32 encoding
19 00000033 81C101000000 add ecx, strict dword 1 ; non-eax must use the ModR/M encoding
20 00000039 4881C101000000 add rcx, strict qword 1 ; YASM requires strict dword for the immediate, because it's still 32b
21 00000040 67488D8001000000 lea rax, [dword eax+1]
22
23
24 00000048 8B07 mov eax, [rdi]
25 0000004A 8B4700 mov eax, [byte 0 + rdi]
26 0000004D 3E8B4700 mov eax, [ds: byte 0 + rdi]
26 ****************** warning: ds segment base generated, but will be ignored in 64-bit mode
27 00000051 8B8700000000 mov eax, [dword 0 + rdi]
28 00000057 8B043D00000000 mov eax, [NOSPLIT dword 0 + rdi*1] ; 1c extra latency on SnB-family for non-simple addressing mode
GAS has encoding-override pseudo-prefixes {vex3}, {evex}, {disp8}, and {disp32} These replace the now-deprecated .s, .d8 and .d32 suffixes.
GAS doesn't have an override to immediate size, only displacements.
GAS does let you add an explicit ds prefix, with ds mov src,dst
gcc -g -c padding.S && objdump -drwC padding.o -S, with hand-editting:
# no CPUs have separate ps vs. pd domains, so there's no penalty for mixing ps and pd loads/shuffles
0: 0f 28 07 movaps (%rdi),%xmm0
3: 66 0f 28 07 movapd (%rdi),%xmm0
7: 0f 58 c8 addps %xmm0,%xmm1 # not equivalent for SSE/AVX transitions, but sometimes safe to mix with AVX-128
a: c5 e8 58 d9 vaddps %xmm1,%xmm2, %xmm3 # default {vex2}
e: c4 e1 68 58 d9 {vex3} vaddps %xmm1,%xmm2, %xmm3
13: 62 f1 6c 08 58 d9 {evex} vaddps %xmm1,%xmm2, %xmm3
19: ff c0 inc %eax
1b: 83 c0 01 add $0x1,%eax
1e: 48 83 c0 01 add $0x1,%rax
22: 67 8d 40 01 lea 1(%eax), %eax # runs on fewer ports and doesn't set flags
26: 67 48 8d 40 01 lea 1(%eax), %rax # address-size and REX
# no equivalent for add eax, strict dword 1 # no-ModR/M
.byte 0x81, 0xC0; .long 1 # add eax,0x1 using the ModR/M imm32 encoding
2b: 81 c0 01 00 00 00 add $0x1,%eax # manually encoded
31: 81 c1 d2 04 00 00 add $0x4d2,%ecx # large immediate, can't get GAS to encode this way with $1 other than doing it manually
37: 67 8d 80 01 00 00 00 {disp32} lea 1(%eax), %eax
3e: 67 48 8d 80 01 00 00 00 {disp32} lea 1(%eax), %rax
mov 0(%rdi), %eax # the 0 optimizes away
46: 8b 07 mov (%rdi),%eax
{disp8} mov (%rdi), %eax # adds a disp8 even if you omit the 0
48: 8b 47 00 mov 0x0(%rdi),%eax
{disp8} ds mov (%rdi), %eax # with a DS prefix
4b: 3e 8b 47 00 mov %ds:0x0(%rdi),%eax
{disp32} mov (%rdi), %eax
4f: 8b 87 00 00 00 00 mov 0x0(%rdi),%eax
{disp32} mov 0(,%rdi,1), %eax # 1c extra latency on SnB-family for non-simple addressing mode
55: 8b 04 3d 00 00 00 00 mov 0x0(,%rdi,1),%eax
GAS is strictly less powerful than NASM for expressing longer-than-needed encodings.

Let's look at a specific piece of code:
cmp ebx,123456
mov al,0xFF
je .foo
For this code, none of the instructions can be replaced with anything else, so the only options are redundant prefixes and NOPs.
However, what if you change the instruction ordering?
You could convert the code into this:
mov al,0xFF
cmp ebx,123456
je .foo
After re-ordering the instructions; the mov al,0xFF could be replaced with or eax,0x000000FF or or ax,0x00FF.
For the first instruction ordering there is only one possibility, and for the second instruction ordering there are 3 possibilities; so there's a total of 4 possible permutations to choose from without using any redundant prefixes or NOPs.
For each of those 4 permutations you can add variations with different amounts of redundant prefixes, and single and multi-byte NOPs, to make it end on a specific alignment/s. I'm too lazy to do the maths, so let's assume that maybe it expands to 100 possible permutations.
What if you gave each of these 100 permutations a score (based on things like how long it would take to execute, how well it aligns the instruction after this piece, if size or speed matters, ...). This can include micro-architectural targeting (e.g. maybe for some CPUs the original permutation breaks micro-op fusion and makes the code worse).
You could generate all the possible permutations and give them a score, and choose the permutation with the best score. Note that this may not be the permutation with the best alignment (if alignment is less important than other factors and just makes performance worse).
Of course you can break large programs into many small groups of linear instructions separated by control flow changes; and then do this "exhaustive search for the permutation with the best score" for each small group of linear instructions.
The problem is that instruction order and instruction selection are co-dependent.
For the example above, you couldn't replace mov al,0xFF until after we re-ordered the instructions; and it's easy to find cases where you can't re-order the instructions until after you've replaced (some) instructions. This makes it hard to do an exhaustive search for the best solution, for any definition of "best", even if you only care about alignment and don't care about performance at all.

I can think of four ways off the top of my head:
First: Use alternate encodings for instructions (Peter Cordes mentioned something similar). There are a lot of ways to call the ADD operation for example, and some of them take up more bytes:
http://www.felixcloutier.com/x86/ADD.html
Usually an assembler will try to choose the "best" encoding for the situation whether that is optimizing for speed or length, but you can always use another one and get the same result.
Second: Use other instructions that mean the same thing and have different lengths. I'm sure you can think of countless examples where you could drop one instruction into the code to replace an existing one and get the same results. People that hand optimize code do it all the time:
shl 1
add eax, eax
mul 2
etc etc
Third: Use the variety of NOPs available to pad out extra space:
nop
and eax, eax
sub eax, 0
etc etc
In an ideal world you'd probably have to use all these tricks to get code to be the exact byte length you want.
Fourth: Change your algorithm to get more options using the above methods.
One final note: Obviously targeting more modern processors will give you better results due to the number and complexity of instructions. Having access to MMX, XMM, SSE, SSE2, floating point, etc instructions could make your job easier.

Depends on the nature of the code.
Floatingpoint heavy code
AVX prefix
One can resort to the longer AVX prefix for most SSE instructions.
Note that there is a fixed penalty when switching between SSE and AVX on intel CPUs [1][2]. This requires vzeroupper which can be interpreted as another NOP for SSE code or AVX code which doesn't require the higher 128 bits.
SSE/AVX NOPS
typical NOPs I can think of are:
XORPS the same register, use SSE/AVX variations for integers of these
ANDPS the same register, use SSE/AVX variations for integers of these

What's the reason for padding executable sections with "long NOPs"?

I found that x86-64 programs (at least those compiled using GCC) have functions start by default at addresses aligned to multiples of 16 bytes and that the padding is done by NOP instructions with as many prefixes as could fit to optimally fill the space. For example,
(...)
447454: c3 retq
447455: 90 nop
447456: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1)
0000000000447460 <__libc_csu_fini>:
447460: f3 c3 repz retq
What's the advantage to filling the space with regular NOPs like observed here or here?

There's no downside, so why not? It makes the disassembly easier to read for humans, because you don't have a huge amount of lines separating functions.
GCC (the actual compiler part that transforms C to assembly) uses the same .p2align directive to ask the assembler to insert padding whether it's inside a function to align branch targets, or whether it's between functions to align function entry points.
GCC could emit .p2align 4,,0x90 to ask the assembler to fill with single-byte NOPs in cases where the NOPs won't be executed, but like I said, there's no reason to bother doing that instead of .p2align 4 (pad out to the next 2^4 boundary with the default choice of filler).
If the end of the function is an indirect branch (tail-call with jmp [rax] or something), speculative execution could run into these NOP instructions. Decoding many short NOPs could overflow the uop cache on Intel SnB-family. (more than 3 cache lines of up-to-6 uops per 32-byte block). (http://agner.org/optimize/ microarch pdf). Long NOPs are potentially better for that.
IDK how Pentium4's trace cache builder behaved; maybe it was useful for that, too? Again, fewer longer NOP instructions are less likely to trigger anything weird in the front-end of a CPU before it figures out that the NOPs aren't executed.
MSVC pads with int3 between functions, IIRC, which will stop speculative execution. That's not a bad idea.
This is guesswork; it's probably not a real factor in performance; if it still mattered on modern CPUs, all compilers would probably avoid short NOPs between functions, but as one of your links showed, not all do.
Some CPUs, like AMD K8/K10 and Bulldozer-family, mark instruction-lengths in L1I cache. Agner Fog says that bandwidth from L2 to L1I is low on K8/K10, and guesses that it may be from adding extra pre-decode information. IDK if this takes longer when there are lots of small instructions? It would have to know where to start decoding, because the middle of an instruction can span a cache-line boundary. IDK how that works.
BTW, these instructions might be decoded as part of a group containing a normal ret, but I don't think there's anything to worry about either way in that case.
Decoding happens in 2 stages in some CPUs: first, instruction-length decoding, which finds blocks of up-to-16 bytes containing up-to-4 instructions (e.g. on Intel P6-family / Sandybridge-family). Then it feeds those blocks to the decoders.
With correct branch prediction for the ret, even nasty stuff like LCP stalls after the ret don't seem to hurt.
Anyway, I don't think this difference is significant. Decoded NOP instructions after a RET should be cancelled before they go anywhere, because the RET is an unconditional branch. I probably makes no difference whether the instruction-length decoder finds many single-byte instructions vs. some prefixes but not the end of an instruction before the end of a 16-byte window.

Intel x86 0x2E/0x3E Prefix Branch Prediction actually used?

In the latest Intel software dev manual it describes two opcode prefixes:
Group 2 > Branch Hints
0x2E: Branch Not Taken
0x3E: Branch Taken
These allow for explicit branch prediction of Jump instructions (opcodes likeJxx)
I remember reading a couple of years ago that on x86 explicit branch prediction was essentially a no-op in the context of gccs branch prediciton intrinsics.
I am now unclear if these x86 branch hints are a new feature or whether they are essentially no-ops in practice.
Can anyone clear this up?
(That is: Does gccs branch prediction functions generate these x86 branch hints? - and do current Intel CPUs not ignore them? - and when did this happen?)
Update:
I created a quick test program:
int main(int argc, char** argv)
{
if (__builtin_expect(argc,0))
return 1;
if (__builtin_expect(argc == 2, 1))
return 2;
return 3;
}
Disassembles to the following:
00000000004004cc <main>:
4004cc: 55 push %rbp
4004cd: 48 89 e5 mov %rsp,%rbp
4004d0: 89 7d fc mov %edi,-0x4(%rbp)
4004d3: 48 89 75 f0 mov %rsi,-0x10(%rbp)
4004d7: 8b 45 fc mov -0x4(%rbp),%eax
4004da: 48 98 cltq
4004dc: 48 85 c0 test %rax,%rax
4004df: 74 07 je 4004e8 <main+0x1c>
4004e1: b8 01 00 00 00 mov $0x1,%eax
4004e6: eb 1b jmp 400503 <main+0x37>
4004e8: 83 7d fc 02 cmpl $0x2,-0x4(%rbp)
4004ec: 0f 94 c0 sete %al
4004ef: 0f b6 c0 movzbl %al,%eax
4004f2: 48 85 c0 test %rax,%rax
4004f5: 74 07 je 4004fe <main+0x32>
4004f7: b8 02 00 00 00 mov $0x2,%eax
4004fc: eb 05 jmp 400503 <main+0x37>
4004fe: b8 03 00 00 00 mov $0x3,%eax
400503: 5d pop %rbp
400504: c3 retq
400505: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40050c: 00 00 00
40050f: 90 nop
I don't see 2E or 3E ? Maybe gcc has elided them for some reason?

These instruction prefixes have no effect on modern processors (anything newer than Pentium 4). They just cost one byte of code space, and thus, not generating them is the right thing.
For details, see Agner Fog's optimization manuals, in particular 3. Microarchitecture: http://www.agner.org/optimize/
The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" no longer mentions them in the section about optimizing branches (section 3.4.1):
http://www.intel.de/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf
These prefixes are a (harmless) relict of the Netburst architecture. In all-out optimization, you can use them to align code, but that's all they're good for nowadays.

gcc is right to not generate the prefix, as they have no effect for all processors since the Pentium 4.
But __builtin_expect has other effects, like moving a not expected code path away from the cache-hot locations in the code or inlining decisions, so it is still useful.

While Pentium 4 is the only generation which actually respects the branch-hint instructions, most CPUs do have some form of static branch prediction, which can be used to achieve the same effect. This answer is a bit tangential to the original question, but I think this would be valuable information to anyone who comes to this page.
The Intel optimisation guide and Agner Fog's guide (which have been mentioned here already) both have excellent descriptions of this feature.
Intel has this to say about generations newer than Core 2:
Make the fall-through code following a conditional branch be the likely target for a branch with a forward target
So conditional branches which jump forward in the code are predicted to be not-taken, by the static prediction algorithm.
This is consistent with what GCC seems to have generated using __builtin_expect: the 'expected' return 1 / return 2 code is placed in the not-taken paths from the conditional branches, which will be statically predicted as not-taken.
Additionally:
Branches that do not have a history in the Branch Target Buffer are predicted using a static prediction algorithm:
Predict unconditional branches to be taken.
Predict indirect branches to be NOT taken.
So in the 'expected' not-taken paths where GCC has placed unconditional jmps to the end of the function, those jumps will be statically predicted as taken (i.e. not skipped).
Intel also says:
make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target
So conditional branches which jump backwards in the code are predicted to be taken, by the static prediction algorithm.
According to Agner Fog, most Pentiums also follow this algorithm:
On PPro, P2, P3, P4 and P4E, a control transfer instruction which has not been seen before, or which is not in the Branch Target Buffer, is predicted to fall through if it goes forwards, and to be taken if it goes backwards (e.g. a loop). Static prediction takes longer time than dynamic prediction on these processors.
However, the Core 2 family (and Pentium M) has a completely different policy:
These processors do not use static prediction. The predictor simply makes a random prediction the first time a branch is seen, depending on what happens to be in the BTB entry that is assigned to the new branch. There is simply a 50% chance of making the right prediction of jump or no jump, but the predicted target is correct.
As do AMD processors apparently:
A branch is predicted not taken the first time it is seen. A branch is predicted always taken after the first time it has been taken. Dynamic prediction is used only after a branch has been taken and then not taken. Branch hint prefixes have no effect.
There is one additional factor to consider: CPUs generally like to execute in a linear fashion, so even correctly-predicted taken branches are often more expensive than correctly-predicted not-taken branches.

Intel® 64 and IA-32 Architectures Software Developer’s Manual -> Volume 2: Instruction Set Reference, A-Z -> Chapter 2: Instruction Format -> 2.1 Instruction Format for Protected Mode, real-address Mode, and virtual-8086 mode -> 2.1.1 Instruction Prefixes
Some earlier microarchitectures used these as branch hints, but recent
generations have not and they are reserved for future hint usage.

Relative performance of x86 inc vs. add instruction

Quick question, assuming beforehand
mov eax, 0
which is more efficient?
inc eax
inc eax
or
add eax, 2
Also, in case the two incs are faster, do compilers (say, the GCC) commonly (i.e. w/o aggressive optimization flags) optimize var += 2 to it?
PS: Don't bother to answer with a variation of "don't prematurely optimize", this is merely academic interest.

Two inc instructions on the same register (or more generally speaking two read-modify-write instructions) do always have a dependency chain of at least two cycles. This is assuming a one clock latency for a inc, which is the case since the 486. That means if the surrounding instructions can't be interleaved with the two inc instructions to hide those latencies, the code will execute slower.
But no compiler will emit the instruction sequence you propose anyway (mov eax,0 will be replaced by xor eax,eax, see What is the purpose of XORing a register with itself?)
mov eax,0
inc eax
inc eax
it will be optimizied to
mov eax,2

If you ever wanna know raw performance stats of x86 instructions, see Dr Agner Fogs listings (volume 4 to be exact). As for the part about compilers, thats dependent on the compiler's code generator, and not something you should rely on too much.
on a side note: I find it funny/ironic that in a question about performance, you used MOV EAX,0 to zero a register instead of XOR EAX,EAX :P (and if MOV EAX,0 was done beforehand, the fastest variant would be to remove the inc's and add's and just MOV EAX,2).

For all purposes, it probably doesn't matter. But take into account that inc uses less bytes.
Consider the following code:
int x = 0;
x += 2;
Without using any optimization flags, GCC compiles this code into:
80483ed: c7 44 24 1c 00 00 00 movl $0x0,0x1c(%esp)
80483f4: 00
80483f5: 83 44 24 1c 02 addl $0x2,0x1c(%esp)
Using -O1 and -O2, it becomes:
c7 44 24 08 02 00 00 movl $0x2,0x8(%esp)
Funny, isn't it?

From the Intel manual that you can find here it looks like the ADD/SUB instructions are half a cycle cheaper on one particular architecture. But remember that Intel uses an out-of-order execution model for it's (recent) processors. This primarily means, performance bottlenecks show up wherever the processor has to wait for data to come in (eg. it ran out of things to do during the L1/L2/L3/RAM data-fetch). So if you're profiler tells you INC might be the problem; look at it form a data-throughput point of view instead of looking at raw cycle-counts.
Instruction Latency1 Throughput Execution Unit
2
CPUID 0F_3H 0F_2H 0F_3H 0F_2H 0F_2H
ADD/SUB 1 0.5 0.5 0.5 ALU
[...]
DEC/INC 1 1 0.5 0.5 ALU

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio