Why gcc compile _mm256_permute2f128_ps to Vinsertf128 instruction? - gcc

This instruction is a part of an assembly out put of a C program (gcc -O2). According to the result I understand that ymm6 is source operand 1 that all of it, is cloned to ymm9 and then xmm1 is cloned to the ymm6[127-256] I read Intel manual but it uses Intel assembly syntax not At&t and I don't want to use Intel syntax. So ymm8, ymm2 and ymm6 here is SRC1. is this true?
vshufps $68, %ymm0, %ymm8, %ymm6
vshufps $68, %ymm4, %ymm2, %ymm1
Vinsertf128 $1, %xmm1, %ymm6, %ymm9
And the main question is why gcc changed the instruction
row0 = _mm256_permute2f128_ps(__tt0, __tt4, 0x20);
to
Vinsertf128 $1, %xmm1, %ymm6, %ymm9
and
row4 = _mm256_permute2f128_ps(__tt0, __tt4, 0x31);
to
Vperm2f128 $49, %ymm1, %ymm6, %ymm1
How could I ignore this optimization? I tried -O0 but doesn't work.

So ymm8, ymm2 and ymm6 here is SRC1. is this true?
Yes, the middle operand is always src1 in a 3-operand instruction in both syntaxes.
AT&T: op %src2, %src1, %dest
Intel: op dest, src1, src2
I don't want to use Intel syntax
Tough. The only really good documentation I know of for exactly what every instruction does is the Intel insn ref manual. I used to think AT&T syntax was better, because the $ and % decorators remove ambiguity. I do like that, but otherwise prefer the Intel syntax now. The rules for each are simple enough that you can easily mentally convert, or "think" in whichever one you're reading ATM.
Unless you're actually writing GNU C inline asm, you can just use gcc -masm=intel and objdump -Mintel to get GNU-flavoured asm using intel mnemonics, operand order, and so on. The assembler directives are still gas style, not NASM. Use http://gcc.godbolt.org/ to get nicely-formatted asm output for code with only the essential labels left in.
gcc and clang both have some understanding of what the intrinsics actually do, so internally they translate the intrinsic to some data movement. When it comes time to emit code, they see that said data movement can be done with vinsertf128, so they emit that.
On some CPUs (Intel SnB-family), both instructions have equal performance, but on AMD Bulldozer-family (which only has 128b ALUs), vinsertf128 is much faster than vperm2f128. (source: see Agner Fog's guides, and other links at the x86 tag wiki). They both take 6 bytes to encode, including the immediate, so there's no code-size difference. vinsertf128 is always a better choice than a vperm2f128 that does identical data movement.
gcc and clang don't have a "literal translation of intrinsics to instructions" mode, because it would take extra work to implement. If you care exactly which instructions the compiler uses, that's what inline asm is for.
Keep in mind that -O0 doesn't mean "no optimization". It still has to transform through a couple internal representations before emitting asm.

Examination of the instructions that bind to port 5 in the instruction analysis report shows that the instructions were broadcasts and vpermilps. The broadcasts can only execute on port 5, but replacing them with 128-bit loads followed by vinsertf128 instructions reduces the pressure on port 5 because vinsertf128 can execute on port 0. from IACA user guid

Related

What exactly do the gcc compiler switches (-mavx -mavx2 -mavx512f) do?

I explicitly use the Intel SIMD extensions intrinsic in my C/C++ code. In order to compile the code I need to specify -mavx, or -mavx512, or something similar on the command line. I'm good with all that.
However, from reading the gcc man page, it's not clear if these command-line flags also tell the gcc compiler to try to automatically vectorize the C/C++ code with the Intel SIMD instructions. Does someone know if that is the case? Does the -mavx flag simply allow you to manually insert SIMD intrinsics into your code, or does it also tell the compiler to use the SIMD instructions when compiling you C/C++ code?
-mavx/-mavx2/-mavx512f (and -march= options that imply them with relevant tuning settings) let GCC use AVX / AVX2 / AVX-512 instructions for anything it thinks is a good idea when compiling your code, including but not limited to auto-vectorization of loops, if you also enable that.
Other use-cases for SSE instructions (where GCC will use the AVX encoding if you tell it AVX is enabled) include copying and zero-initializing structs and arrays, and other cases of inlining small constant-size memset and memcpy. And also scalar FP math, even at -O0 in 64-bit code where -mfpmath=sse is the default.
Code built with -mavx usually can't be run on CPUs without AVX, even if auto-vectorization wasn't enabled and you didn't use any AVX intrinsics; it makes GCC use the VEX encoding instead of legacy SSE for every SIMD instruction. AVX2, on the other hand, doesn't usually get used except when actually auto-vectorizing a loop. It's not relevant for just copying data around, or for scalar FP math. GCC will use scalar FMA instructions if -mfma is enabled, though.
Examples on Godbolt
void ext(void *);
void caller(void){
int arr[16] = {0};
ext(arr);
}
double fp(double a, double b){
return b-a;
}
compiles with AVX instructions with gcc -O2 -fno-tree-vectorize -march=haswell, because when AVX is enabled, GCC completely avoids legacy-SSE encodings everywhere.
caller:
sub rsp, 72
vpxor xmm0, xmm0, xmm0
mov rdi, rsp
vmovdqa XMMWORD PTR [rsp], xmm0 # only 16-byte vectors, not using YMM + vzeroupper
vmovdqa XMMWORD PTR [rsp+16], xmm0
vmovdqa XMMWORD PTR [rsp+32], xmm0
vmovdqa XMMWORD PTR [rsp+48], xmm0
call ext
add rsp, 72
ret
fp:
vsubsd xmm0, xmm1, xmm0
ret
-m options do not enable auto-vectorization; -ftree-vectorize does that. It's on at -O3 and higher. (Or at -O2 with GCC12 and later, like with clang.)
If you do want auto-vectorization with enabled extensions, use -O3 as well, and preferably -march=native or -march=znver2 or something instead of just -mavx2. -march sets tuning options as well, and will enable other ISA extension you probably forgot about, like -mfma and -mbmi2.
The tuning options implied by -march=haswell (or just -mtune=haswell) are especially useful on older GCC, when tune=generic cared more about old CPUs that didn't have AVX2, or where doing unaligned 256-bit loads as two separate parts was a win in some cases: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?
Unfortunately there isn't anything like -mtune=generic-avx2 or -mtune=enabled-extension to still care about both AMD and Intel CPUs, but not about ones too old for all the extensions you enabled.
When manually vectorizing with intrinsics, you can only use intrinsics for instruction-sets you've enabled. (Or ones that are on by default, like SSE2 which is baseline for x86-64, and often enabled even with -m32 in modern GCC configs.)
e.g. if you use _mm256_add_epi32, your code won't compile unless you use -mavx2. (Or better, something like -march=haswell or -march=native that enables AVX2, FMA, BMI2, and other stuff modern x86 has, and sets appropriate tuning options.)
The GCC error message in that case is error: inlining failed in call to 'always_inline' '_mm256_loadu_si256': target specific option mismatch.
In GCC terminology, the "target" is the machine you're compiling for. i.e. -mavx2 tells GCC that the target supports AVX2. Thus GCC will make an executable that might use AVX2 instructions anywhere, e.g. for copying a struct or zero-initializing a local array, or otherwise expanding a small constant-size memcpy or memset.
It will also define the CPP macro __AVX2__, so #ifdef __AVX2__ can test whether AVX2 can be assumed at compile-time.
If that's not what you want for the whole program, you need to make sure not to use -mavx2 to compile any code that gets called without a run-time check of CPU features. e.g. put your AVX2 versions of functions in a separate file to compile with -mavx2, or use __attribute__((target("avx2"))). Have your program set function pointers after checking __builtin_cpu_supports("avx2"), or use GCC's ifunc dispatching mechanism to do multi-versioning.
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5
https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html
-m options do not on their own enable auto-vectorization
(Auto-vectorization is not the only way GCC can use SIMD instruction sets.)
-ftree-vectorize (enabled as part of -O3, or even at -O2 in GCC12 and later) is necessary for GCC to auto-vectorize. And/or -fopenmp if the code has some #pragma omp simd. (You definitely always want at least -O2 or -Os if you care about performance; -O3 should be fastest, but may not always be. Sometimes GCC has missed-optimization bugs where -O3 makes things worse, or in large programs it might happen that larger code-size costs more I-cache and I-TLB misses.)
When auto-vectorizing and optimizing in general, GCC will (maybe) use any instruction sets you told it were available (with -m options). So for example, -O3 -march=haswell will auto-vectorize with AVX2 + FMA. -O3 without -m options will just auto-vectorize with SSE2.
e.g. compare on Godbolt GCC -O3 -march=nehalem (SSE4.2) vs. -march=znver2 (AVX2) for summing an integer array. (Compile-time constant size to keep the asm simple).
If you use -O3 -mgeneral-regs-only (the latter option normally only used in kernel code), GCC will still auto-vectorize, but only in cases where it thinks it's profitable to do SWAR (e.g. xor of an array is straightforward using 64-bit integer regs, or even sum of bytes using SWAR bit-hacks to block/correct for carry between bytes)
e.g. gcc -O1 -mavx still just uses scalar code.
Normally if you want full optimization but not auto-vectorization, you'd use something like -O3 -march=znver1 -fno-tree-vectorize
Other compilers
All of the above is true for clang as well, except it doesn't understand -mgeneral-regs-only. (I think you'd need -mno-mmx -mno-sse and maybe other options.)
(The Effect of Architecture When Using SSE / AVX Intrinisics repeats some of this info)
For MSVC / ICC, you can use intrinsics for ISA extensions you haven't told the compiler it can use on its own. So for example, MSVC -O2 without -arch:AVX would let it auto-vectorize with SSE2 (because that's baseline for x86-64), and use movaps for copying around 16-byte structs or whatever.
But with MSVC's style of target options, you can still use SSE4 intrinsics like _mm_cvtepi8_epi32 (pmovsxwd), or even AVX intrinsics without telling the compiler its allowed to use those instructions itself.
Older MSVC used to make really bad asm when you used AVX / AVX2 intrinsics without -arch:AVX, e.g. resulting in mixing VEX and legacy-SSE encodings in the same function (e.g. using the non-VEX encoding for 128-bit intrinsics like _mm_add_ps), and failure to use vzeroupper after 256-bit vectors, both of which were disastrous for performance.
But I think modern MSVC has mostly solved that. Although it still doesn't optimize intrinsics much at all, like not even doing constant-propagation through them.
Not optimizing intrinsics is likely related to MSVC's ability to let you write code like if(avx_supported) { __m256 v = _mm256_load_ps(p); ... and so on. If it was trying to optimize, it would have to keep track of the minimum extension-level already seen along paths of execution that could reach any given intrinsic, so it would know what alternatives would be valid. ICC is like that, too.
For the same reason, GCC can't inline functions with different target options into each other. So you can't use __attribute__((target(""))) to avoid the cost of run-time dispatching; you still want to avoid function-call overhead inside a loop, i.e. make sure there's a loop inside the AVX2 function, otherwise it may not be worth having an AVX2 version, just use the SSE2 version.
I don't know about Intel's new OneAPI compiler, ICX. I think it's based on LLVM, so it might be more like clang.
Currently used gcc 11.3.1 or higher.
I am not programmer but distinguish between C and C++.
I have been producing the latest codecs on github / doom9 forum for three years.
On my old Intel (R) Core (TM) i5-2500K CPU # 3.30GHz I notice that.
In C language you can play SIMD AVX2 ex. assempler codecs for non-SIMD processor. Can we use codecs posted on the forum? Who knows that. Ex. libjpeg, dav1d with SIMD without mavx2.
xeve, xevd, uvg266, uavs3e, uavs3d, aom, libavif
In C++ SIMD AVX2 you won't even open help.
The second thing is thread and compatibility Unix with Windows.
In C this works faster than in C++. Also in C++ you have to add some special untested additions like mingw-std-thread to g++ to get everything working.
Another curiosity about C++.
MSYS2 GCC 12.1.0. Codecs made in AVX2/AVX3 open on old processors. How is it made? I don't know, but not with the functions above.
jpegxl, libwebp2, libheif, jvetvvc, vvenc, vvdec, libraw, jpegls, jpegxt, openhtj2k, openjph, grok(C++20 openjpeg)

Timing of executing 8 bit and 64 bit instructions on 64 bit x64/Amd64 processors

Is there any execution timing difference between 8 it and 64 bit instructions on 64 bit x64/Amd64 processor, when those instructions are similar/same except bit width?
Is there a way to find real processor timing of executing these 2 tiny assembly functions?
-Thanks.
; 64 bit instructions
add64:
mov $0x1, %rax
add $0x2, %rax
ret
; 8 bit instructions
add8:
mov $0x1, %al
add $0x2, %al
ret
Yes, there's a difference. mov $0x1, %al has a false dependency on the old value of RAX on most CPUs, including everything newer than Sandybridge. It's a 2-input 1-output instruction; from the CPU's point of view it's like add $1, %al as far as scheduling it independently or not relative to other uses of RAX. Only writing a 32 or 64-bit register starts a new dependency chain.
This means the AL return value of your add8 function might not be ready until after a cache miss for some independent work the caller happened to be doing in EAX before the call, but the RAX result of add64 could be ready right away for out-of-order execution to get started on later instructions in the caller that use the return value. (Assuming their other inputs are also ready.)
Why doesn't GCC use partial registers? and
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - Important background for understanding performance on modern OoO exec CPUs.
Their code-size also differs: Both the 8-bit instructions are 2 bytes long. (Thanks to the AL, imm8 short-form encoding; add $1, %dl would be 3 bytes). The RAX instructions are 7 and 4 bytes long. This matters for L1i cache footprint (and on a large scale, for how many bytes have to get paged in from disk). On a small scale, how many instructions can fit into a 16 or 32-byte fetch block if the CPU is doing legacy decode because the code wasn't already hot in the uop cache. Also code-alignment of later instructions is affected by varying lengths of previous instructions, sometimes affecting which branches alias each other.
https://agner.org/optimize/ explains the details of the pipelines of various x86 microarchitectures, including front-end decoding effects that can make instruction-length matter beyond just code density in the I-cache / uop-cache.
Generally 32-bit operand-size is the most efficient (for performance, and pretty good for code-size). 32 and 8 are the operand-sizes that x86-64 can use without extra prefixes, and in practice with 8-bit to avoid stalls and badness you need more instructions or longer instructions because they don't zero-extend. The advantages of using 32bit registers/instructions in x86-64.
A few instructions are actually slower in the ALUs for 64-bit operand-size, not just front-end effects. That includes div on most CPUs, and imul on some older CPUs. Also popcnt and bswap. e.g. Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux
Note that mov $0x1, %rax will assemble to 7 bytes with GAS, unless you use as -O2 (not the same as gcc -O2, see this for examples) to get it to optimize to mov $1, %eax which exactly the same architectural effects, but is shorter (no REX or ModRM byte). Some assemblers do that optimization by default, but GAS doesn't. Why NASM on Linux changes registers in x86_64 assembly has more about why this optimization is safe and good, and why you should do it yourself in the source especially if your assembler doesn't do it for you.
But other than the false dep and code-size, they're the same for the back-end of the CPU: all those instructions are single-uop and can run on any scalar-integer ALU execution port1. (https://uops.info/ has automated test results for every form of every unprivileged instruction).
Footnote 1: Excavator (last-gen Bulldozer-family) can also run mov $imm, %reg on 2 more ports (AGU) for 32 and 64-bit operand-size. But merging a new low-8 or low-16 into a full register needs an ALU port. So mov $1, %rax has 4/clock throughput on Excavator, but mov $1, %al only has 2/clock throughput. (And of course only if you use a few different destination registers, not actually AL repeatedly; that would be a latency bottleneck of 1/clock because of the false dependency from writing a partial register on that microarchitecture.)
Previous Bulldozer-family CPUs starting with Piledriver can run mov reg, reg (for r32 or r64) on EX0, EX1, AGU0, AGU1, while most ALU instructions including mov $imm, %reg can only run on EX0/1. Further extending the AGU port's capabilities to also handle mov-immediate was a new feature in Excavator.
Fortunately Bulldozer was obsoleted by AMD's much better Zen architecture which has 4 full scalar integer ALU ports / execution units. (And a wider front end and a uop cache, good caches, and generally doesn't suck in a lot of the ways that Bulldozer sucked.)
Is there a way to measure it?
yes, but generally not in a function you call with call. Instead put it in an unrolled loop so you can run it lots of times with minimal other instructions. Especially useful to look at CPU performance counter results to find front-end / back-end uop counts, as well as just the overall time for your loop.
You can construct your loop to measure latency or throughput; see RDTSCP in NASM always returns the same value (timing a single instruction). Also:
Assembly - How to score a CPU instruction by latency and throughput
Idiomatic way of performance evaluation?
Can x86's MOV really be "free"? Why can't I reproduce this at all? is a good specific example of constructing a microbenchmark to measure / prove something specific.
Generally you don't need to measure yourself (although it's good to understand how, that helps you know what the measurements really mean). People have already done that for most CPU microarchitectures. You can predict performance for a specific CPU for some loops (if you can assume no stalls or cache misses) based on analyzing the instructions. Often that can predict performance fairly accurately, but medium-length dependency chains that OoO exec can only partially hide makes it too hard to accurately predict or account for every cycle.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? has links to lots of good details, and stuff about CPU internals.
How many CPU cycles are needed for each assembly instruction? (you can't add up a cycle count for each instruction; front-end and back-end throughput, and latency, could each be the bottleneck for a loop.)

How can I mitigate the impact of the Intel jcc erratum on gcc?

If I have a chip that is subject to the Intel jcc erratum, how I can enable the mitigation in gcc (which adjusts branch locations to avoid the problematic alignment), and which gcc versions support it?
By compiler:
GCC: -Wa,-mbranches-within-32B-boundaries
clang (10+): -mbranches-within-32B-boundaries compiler option directly, not -Wa.
MSVC: /QIntel-jcc-erratum See Intel JCC Erratum - what is the effect of prefixes used for mitigation?
ICC: TODO, look for docs.
The GNU toolchain does mitigation in the assembler, with as -mbranches-within-32B-boundaries, which enables (GAS manual: x86 options):
-malign-branch-boundary=32 (care about 32-byte boundaries). Except the manual says this option takes an exponent, not a direct power of 2, so probably it's actually ...boundary=5.
-malign-branch=jcc+fused+jmp (the default which does not include any of +call+ret+indirect)
-malign-branch-prefix-size=5 (up to 5 segment prefixes per insn).
So the relevant GCC invocation is gcc -Wa,-mbranches-within-32B-boundaries
Unfortunately, GCC -mtune=skylake doesn't enable this.
GAS's strategy seems to be to pad as early as possible after the last alignment directive (e.g. .p2align) or after the last jcc/jmp that can end before a 32B boundary. I guess that might end up with padding in outer loops, before or after inner loops, maybe helping them fit in fewer uop cache lines? (Skylake also has its LSD loop buffer disabled, so a tiny loop split across two uop cache lines can run at best 2 cycles per iteration, instead of 1.)
It can lead to quite a large amount of padding with long macro-fused jumps, such as with -fstack-protector-strong which in recent GCC uses sub rdx,QWORD PTR fs:0x28 / jnz (earlier GCC used to use xor, which can't fuse even on Intel). That's 11 bytes total of sub + jnz, so could require 11 bytes of CS prefixes in the worst case to shift it to the start of a new 32B block. Example showing 8 CS prefixes in the insns before it: https://godbolt.org/z/n1dYGMdro
GCC doesn't know instruction sizes, it only prints text. That's why it needs GAS to support stuff like .p2align 4,,10 to align by 16 if that will take fewer than 10 bytes of padding, to implement the alignment heuristics it wants to use. (Often followed by .p2align 3 to unconditionally align by 8.)
as has other fun options that aren't on by default, like -Os to optimize hand-written asm like mov $1, %rax => mov $1, %eax / xor %rax,%rax => %eax / test $1, %eax => al and even EVEX => VEX for stuff like vmovdqa64 => vmovdqa.
Also stuff like -msse2avx to always use VEX prefixes even when the mnemonic isn't v..., and -momit-lock-prefix=yes which could be used to build std::atomic code for a uniprocessor system.
And -mfence-as-lock-add=yes to assemble mfence into lock addl $0x0, (%rsp). But insanely it also does that for sfence and even lfence, so it's unusable in code that uses lfence as an execution barrier, which is the primary use-case for lfence. e.g. for retpolines or timing like lfence;rdtsc.
as also has CPU feature-level checking with -march=znver3 for example, or .arch directives. And -mtune=CPU, although IDK what that does. Perhaps set NOP strategy?

x86 - instruction interleaving to avoid cpu stall

Gcc6 - intel core 2 duo.
Compilation flags: "-march=native -O3" (-S)
I was compiling a simple program and asked for the assembly output:
Code
movq 8(%rsi), %rdi
call _atoi
movq 16(%rbp), %rdi
movl %eax, %ebx
call _atof
pxor %xmm1, %xmm1
movl $1, %eax <- this instruction is my problem
cvtsi2sd %ebx, %xmm1
leaq LC0(%rip), %rdi
addsd %xmm1, %xmm0
call _printf
addq $8, %rsp
Execution
read/convert an integer variable, then read/convert a double value and add them.
The problem
I perfectly understand that one (the compiler more so) has to avoid cpu stalls as much as possible.
I've shown the offending instruction in the code section above.
To me, with cpu reordering, and different execution context, this interleaved instruction is useless.
My rationale is: chances that we stall are very high anyway and the cpu will wait for pxor xmm1 to return before being able to reuse it in the next instruction. Adding an instruction will just fill the cpu decoder for nothing. The cpu HAS to wait anyway. So why not leaving it alone for 1 instruction?
Moving the pxor before atof seems not possible as atof may use it.
Question
Is that a bug, a legacy junk (when cpu were not able to reorder) or.. else?
Thanks
EDIT:
I admit my question was not clear: can this instruction be safely removed without performance consequences?
The x86-64 ABI requires that calls to varargs functions (like printf) set %al = the count of floating-point args passed in xmm registers. In this case, you're passing one double, so the ABI requires %al = 1. (Fun fact: C's promotion rules make it impossible to pass a float to a vararg function. This is why there are no printf conversion specifiers for float, only double.)
mov $1, %eax avoids false dependencies on the rest of eax, (compared to mov $1, %al), so gcc prefers spending extra instruction bytes on that, even though it's tuning for Core2 (which renames partial registers).
Previous answer, before it was clarified that the question was why the mov is done all, not about its ordering.
IIRC, gcc doesn't do much instruction scheduling for x86, because it's assuming out-of-order execution. I tried to google that, but didn't find the quote from a gcc developer that I seem to remember reading (maybe in a gcc bug report comment).
Anyway, it looks ok to me, unless you're tuning for in-order Atom or P5. If you are, use gcc -O3 -march=atom (which implies -mtune=atom). But anyway, you're clearly not doing that, because you used -march=native on a C2Duo, which is a 4-wide out-of-order design with a fairly large scheduler.
To me, with cpu reordering, and different execution context, this interleaved instruction is useless.
I have no idea what you think the problem is, or what ordering you think would be better, so I'll just explain why it looks good.
I didn't take the time to edit this down to a short answer, so you might prefer to just read Agner Fog's microarch pdf for details of the Core2 pipeline, and skim this answer. See also other links from the x86 tag wiki.
...
call _atof
# xmm0 is probably still not ready when the following instructions issue
pxor %xmm1, %xmm1 # no inputs, so can run any time after being issued.
gcc uses pxor because cvtsi2sd is badly designed, giving it a false dependency on the previous value of the vector register. Note how the upper half of the vector register keeps its old value. Intel probably designed it this way because the original SSE cvtsi2ss was first implemented on Pentium III, where 128b vectors were handled as two halves. Zeroing the rest of the register (including the upper half) instead of merging probably would have taken an extra uop on PIII.
This short-sighted design choice saddled the architecture with the choice between an extra dependency-breaking instruction, or a false dependency. A false dep might not matter at all, or might be a big slowdown if the register used by one function happened to be used for a very long FP dependency chain in another function (maybe including a cache miss).
On Intel SnB-family CPUs, xor-zeroing is handled at register-rename time, so the uop never needs to execute on an execution port; it's already completed as soon as it issues into the ROB. This is true for integer and vector registers.
On other CPUs, the pxor will need an execution port, but has no input dependencies so it can execute any time there's a free ALU port, after it issues.
movl $1, %eax # no input dependencies, can execute any time.
This instruction could be placed anywhere after call atof and before call printf.
cvtsi2sd %ebx, %xmm1 # no false dependency thanks to pxor.
This is a 2 uop instruction on Core2 (Merom and Penryn), according to Agner Fog's tables. That's weird because cvtsi2ss is 1 uop. (They're both 2 uops in SnB; presumably one uop to move data between integer and vector, and another for the conversion).
Putting this insn earlier would be good, potentially issue it a cycle earlier, since it's part of the longest dependency chain here. (The integer stuff is all simple and trivial). However, printf has to parse the format string before it will decide to look at xmm0, so the FP instructions aren't actually on the critical path.
It can't go ahead of pxor, and call / pxor / cvtsi2sd would mean pxor would decode by itself that cycle. Decoding will start with the instruction after the call, after the ret in the called function has been decoded (and the return-address predictor predicts the jump back to the insn after the call). Multi-uop instructions have to be the first instruction in a block, so having pxor and mov imm32 decode that cycle means less of a decode bottleneck.
leaq LC0(%rip), %rdi # 1 uop
addsd %xmm1, %xmm0 # 1 uop
call _printf # 3 uop insn
cvtsi2sd/lea/addsd can all decode in the same cycle, which is optimal. If the mov imm32 was after the cvt, it could decode in the same cycle as well (since pre-SnB decoders can handle up to 4-1-1-1), but it couldn't have issued as soon.
If decoding was only barely keeping up with issue, that would mean pxor would issue by itself (because no other instructions were decoded yet). Then cvtsi2sd/mov imm/lea (4 uops), then addsd / call (4 uops). (addsd decoded with the previous issue group; core2 has a short queue between decode and issue to help absorb decode bubbles like this, and make it useful to be able to decode up to 7 uops in a cycle.)
That's not appreciably different from the current issue pattern in a decode-bottleneck situation: (pxor / mov imm) / (cvtsi2sd/lea/addsd) / (call printf)
If decode isn't the bottleneck, I'm not sure if Core2 can issue a ret or jmp in the same cycle as uops that follow the jump. In SnB-family CPUs, an unconditional jump always ends an issue group. e.g. a 3-uop loop issues ABC, ABC, ABC, not ABCA, BCAB, CABC.
Assuming the instructions after the ret issue with a group not including the ret, we'd have
(pxor/mov imm/cvtsi2sd), (lea / addsd / 2 of call's 3 uops) / (last call uop)
So the cvtsi2sd still issues in the first cycle after returning from atof, which means it can get started executing right away. Even on Core2, where pxor takes an execution unit, the first of the 2 uops from cvtsi2sd can probably execute in the same cycle as pxor. It's probably only the 2nd uop that has an input dependency on the dst register.
(mov imm / pxor / cvtsi2sd) would be equivalent, and so would the slower-to-decode (pxor / cvtsi2sd / mov imm), or getting the lea executed before mov imm.

Unnecessary instructions generated for _mm_movemask_epi8 intrinsic in x64 mode

The intrinsic function _mm_movemask_epi8 from SSE2 is defined by Intel with the following prototype:
int _mm_movemask_epi8 (__m128i a);
This intrinsic function directly corresponds to the pmovmskb instruction, which is generated by all compilers.
According to this reference, the pmovmskb instruction can write the resulting integer mask to either a 32-bit or a 64-bit general purpose register in x64 mode. In any case, only 16 lower bits of the result can be nonzero, i.e. the result is surely within range [0; 65535].
Speaking of the intrinsic function _mm_movemask_epi8, its returned value is of type int, which as a signed integer of 32-bit size on most platforms. Unfortunately, there is no alternative function which returns a 64-bit integer in x64 mode. As a result:
Compiler usually generates pmovmskb instruction with 32-bit destination register (e.g. eax).
Compiler cannot assume that upper 32 bits of the whole register (e.g. rax) are zero.
Compiler inserts unnecessary instruction (e.g. mov eax, eax) to zero the upper half of 64-bit register, given that the register is later used as 64-bit value (e.g. as an index of array).
An example of code and generated assembly with such a problem can be seen in this answer. Also the comments to that answer contain some related discussion. I regularly experience this problem with MSVC2013 compiler, but it seems that it is also present on GCC.
The questions are:
Why is this happening?
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).
Why is this happening?
gcc's internal instruction definitions that tells it what pmovmskb does must be failing to inform it that the upper 32-bits of rax will always be zero. My guess is that it's treated like a function call return value, where the ABI allows a function returning a 32bit int to leave garbage in the upper 32bits of rax.
GCC does know about 32-bit operations in general zero-extending for free, but this missed optimization is widespread for intrinsics, also affecting scalar intrinsics like _mm_popcnt_u32.
There's also the issue of gcc (not) knowing that the actual result has set bits only in the low 16 of its 32-bit int result (unless you used AVX2 vpmovmskb ymm). So actual sign extension is unnecessary; implicit zero extension is totally fine.
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
No, other than fixing gcc. Has anyone reported this as a compiler missed-optimization bug?
clang doesn't have this bug. I added code to Paul R's test to actually use the result as an array index, and clang is still fine.
gcc always either zero or sign extends (to a different register in this case, perhaps because it wants to "keep" the 32-bit value in the bottom of RAX, not because it's optimizing for mov-elimination.
Casting to unsigned helps with GCC6 and later; it will use the pmovmskb result directly as part of an addressing mode, but also returning it results in a mov rax, rdx.
And with older GCC, at least gets it to use mov instead of movsxd or cdqe.
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).
mov same,same is never eliminated on SnB-family microarchitectures or AMD zen. mov ecx, eax would be eliminated. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details.
Even if it doesn't take an execution unit, it still takes a slot in the fused-domain part of the pipeline, and a slot in the uop-cache. And code-size. If you're close to the front-end 4 fused-domain uops per clock limit (pipeline width), then it's a problem.
It also costs an extra 1c of latency in the dep chain.
(Back-end throughput is not a problem, though. On Haswell and newer, it can run on port6 which has no vector execution units. On AMD, the integer ports are separate from the vector ports.)
gcc.godbolt.org is a great online resource for testing this kind of issue with different compilers.
clang seems to do the best with this, e.g.
#include <xmmintrin.h>
#include <cstdint>
int32_t test32(const __m128i v) {
int32_t mask = _mm_movemask_epi8(v);
return mask;
}
int64_t test64(const __m128i v) {
int64_t mask = _mm_movemask_epi8(v);
return mask;
}
generates:
test32(long long __vector(2)): # #test32(long long __vector(2))
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)): # #test64(long long __vector(2))
vpmovmskb eax, xmm0
ret
Whereas gcc generates an extra cdqe instruction in the 64-bit case:
test32(long long __vector(2)):
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)):
vpmovmskb eax, xmm0
cdqe
ret

Resources