What exactly do the gcc compiler switches (-mavx -mavx2 -mavx512f) do? - gcc

I explicitly use the Intel SIMD extensions intrinsic in my C/C++ code. In order to compile the code I need to specify -mavx, or -mavx512, or something similar on the command line. I'm good with all that.
However, from reading the gcc man page, it's not clear if these command-line flags also tell the gcc compiler to try to automatically vectorize the C/C++ code with the Intel SIMD instructions. Does someone know if that is the case? Does the -mavx flag simply allow you to manually insert SIMD intrinsics into your code, or does it also tell the compiler to use the SIMD instructions when compiling you C/C++ code?

-mavx/-mavx2/-mavx512f (and -march= options that imply them with relevant tuning settings) let GCC use AVX / AVX2 / AVX-512 instructions for anything it thinks is a good idea when compiling your code, including but not limited to auto-vectorization of loops, if you also enable that.
Other use-cases for SSE instructions (where GCC will use the AVX encoding if you tell it AVX is enabled) include copying and zero-initializing structs and arrays, and other cases of inlining small constant-size memset and memcpy. And also scalar FP math, even at -O0 in 64-bit code where -mfpmath=sse is the default.
Code built with -mavx usually can't be run on CPUs without AVX, even if auto-vectorization wasn't enabled and you didn't use any AVX intrinsics; it makes GCC use the VEX encoding instead of legacy SSE for every SIMD instruction. AVX2, on the other hand, doesn't usually get used except when actually auto-vectorizing a loop. It's not relevant for just copying data around, or for scalar FP math. GCC will use scalar FMA instructions if -mfma is enabled, though.
Examples on Godbolt
void ext(void *);
void caller(void){
int arr[16] = {0};
ext(arr);
}
double fp(double a, double b){
return b-a;
}
compiles with AVX instructions with gcc -O2 -fno-tree-vectorize -march=haswell, because when AVX is enabled, GCC completely avoids legacy-SSE encodings everywhere.
caller:
sub rsp, 72
vpxor xmm0, xmm0, xmm0
mov rdi, rsp
vmovdqa XMMWORD PTR [rsp], xmm0 # only 16-byte vectors, not using YMM + vzeroupper
vmovdqa XMMWORD PTR [rsp+16], xmm0
vmovdqa XMMWORD PTR [rsp+32], xmm0
vmovdqa XMMWORD PTR [rsp+48], xmm0
call ext
add rsp, 72
ret
fp:
vsubsd xmm0, xmm1, xmm0
ret
-m options do not enable auto-vectorization; -ftree-vectorize does that. It's on at -O3 and higher. (Or at -O2 with GCC12 and later, like with clang.)
If you do want auto-vectorization with enabled extensions, use -O3 as well, and preferably -march=native or -march=znver2 or something instead of just -mavx2. -march sets tuning options as well, and will enable other ISA extension you probably forgot about, like -mfma and -mbmi2.
The tuning options implied by -march=haswell (or just -mtune=haswell) are especially useful on older GCC, when tune=generic cared more about old CPUs that didn't have AVX2, or where doing unaligned 256-bit loads as two separate parts was a win in some cases: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?
Unfortunately there isn't anything like -mtune=generic-avx2 or -mtune=enabled-extension to still care about both AMD and Intel CPUs, but not about ones too old for all the extensions you enabled.
When manually vectorizing with intrinsics, you can only use intrinsics for instruction-sets you've enabled. (Or ones that are on by default, like SSE2 which is baseline for x86-64, and often enabled even with -m32 in modern GCC configs.)
e.g. if you use _mm256_add_epi32, your code won't compile unless you use -mavx2. (Or better, something like -march=haswell or -march=native that enables AVX2, FMA, BMI2, and other stuff modern x86 has, and sets appropriate tuning options.)
The GCC error message in that case is error: inlining failed in call to 'always_inline' '_mm256_loadu_si256': target specific option mismatch.
In GCC terminology, the "target" is the machine you're compiling for. i.e. -mavx2 tells GCC that the target supports AVX2. Thus GCC will make an executable that might use AVX2 instructions anywhere, e.g. for copying a struct or zero-initializing a local array, or otherwise expanding a small constant-size memcpy or memset.
It will also define the CPP macro __AVX2__, so #ifdef __AVX2__ can test whether AVX2 can be assumed at compile-time.
If that's not what you want for the whole program, you need to make sure not to use -mavx2 to compile any code that gets called without a run-time check of CPU features. e.g. put your AVX2 versions of functions in a separate file to compile with -mavx2, or use __attribute__((target("avx2"))). Have your program set function pointers after checking __builtin_cpu_supports("avx2"), or use GCC's ifunc dispatching mechanism to do multi-versioning.
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5
https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html
-m options do not on their own enable auto-vectorization
(Auto-vectorization is not the only way GCC can use SIMD instruction sets.)
-ftree-vectorize (enabled as part of -O3, or even at -O2 in GCC12 and later) is necessary for GCC to auto-vectorize. And/or -fopenmp if the code has some #pragma omp simd. (You definitely always want at least -O2 or -Os if you care about performance; -O3 should be fastest, but may not always be. Sometimes GCC has missed-optimization bugs where -O3 makes things worse, or in large programs it might happen that larger code-size costs more I-cache and I-TLB misses.)
When auto-vectorizing and optimizing in general, GCC will (maybe) use any instruction sets you told it were available (with -m options). So for example, -O3 -march=haswell will auto-vectorize with AVX2 + FMA. -O3 without -m options will just auto-vectorize with SSE2.
e.g. compare on Godbolt GCC -O3 -march=nehalem (SSE4.2) vs. -march=znver2 (AVX2) for summing an integer array. (Compile-time constant size to keep the asm simple).
If you use -O3 -mgeneral-regs-only (the latter option normally only used in kernel code), GCC will still auto-vectorize, but only in cases where it thinks it's profitable to do SWAR (e.g. xor of an array is straightforward using 64-bit integer regs, or even sum of bytes using SWAR bit-hacks to block/correct for carry between bytes)
e.g. gcc -O1 -mavx still just uses scalar code.
Normally if you want full optimization but not auto-vectorization, you'd use something like -O3 -march=znver1 -fno-tree-vectorize
Other compilers
All of the above is true for clang as well, except it doesn't understand -mgeneral-regs-only. (I think you'd need -mno-mmx -mno-sse and maybe other options.)
(The Effect of Architecture When Using SSE / AVX Intrinisics repeats some of this info)
For MSVC / ICC, you can use intrinsics for ISA extensions you haven't told the compiler it can use on its own. So for example, MSVC -O2 without -arch:AVX would let it auto-vectorize with SSE2 (because that's baseline for x86-64), and use movaps for copying around 16-byte structs or whatever.
But with MSVC's style of target options, you can still use SSE4 intrinsics like _mm_cvtepi8_epi32 (pmovsxwd), or even AVX intrinsics without telling the compiler its allowed to use those instructions itself.
Older MSVC used to make really bad asm when you used AVX / AVX2 intrinsics without -arch:AVX, e.g. resulting in mixing VEX and legacy-SSE encodings in the same function (e.g. using the non-VEX encoding for 128-bit intrinsics like _mm_add_ps), and failure to use vzeroupper after 256-bit vectors, both of which were disastrous for performance.
But I think modern MSVC has mostly solved that. Although it still doesn't optimize intrinsics much at all, like not even doing constant-propagation through them.
Not optimizing intrinsics is likely related to MSVC's ability to let you write code like if(avx_supported) { __m256 v = _mm256_load_ps(p); ... and so on. If it was trying to optimize, it would have to keep track of the minimum extension-level already seen along paths of execution that could reach any given intrinsic, so it would know what alternatives would be valid. ICC is like that, too.
For the same reason, GCC can't inline functions with different target options into each other. So you can't use __attribute__((target(""))) to avoid the cost of run-time dispatching; you still want to avoid function-call overhead inside a loop, i.e. make sure there's a loop inside the AVX2 function, otherwise it may not be worth having an AVX2 version, just use the SSE2 version.
I don't know about Intel's new OneAPI compiler, ICX. I think it's based on LLVM, so it might be more like clang.

Currently used gcc 11.3.1 or higher.
I am not programmer but distinguish between C and C++.
I have been producing the latest codecs on github / doom9 forum for three years.
On my old Intel (R) Core (TM) i5-2500K CPU # 3.30GHz I notice that.
In C language you can play SIMD AVX2 ex. assempler codecs for non-SIMD processor. Can we use codecs posted on the forum? Who knows that. Ex. libjpeg, dav1d with SIMD without mavx2.
xeve, xevd, uvg266, uavs3e, uavs3d, aom, libavif
In C++ SIMD AVX2 you won't even open help.
The second thing is thread and compatibility Unix with Windows.
In C this works faster than in C++. Also in C++ you have to add some special untested additions like mingw-std-thread to g++ to get everything working.
Another curiosity about C++.
MSYS2 GCC 12.1.0. Codecs made in AVX2/AVX3 open on old processors. How is it made? I don't know, but not with the functions above.
jpegxl, libwebp2, libheif, jvetvvc, vvenc, vvdec, libraw, jpegls, jpegxt, openhtj2k, openjph, grok(C++20 openjpeg)

Related

Is there a way to force visual studio to generate aligned instructions from SSE intrinsics?

The _mm_load_ps() SSE intrinsic is defined as aligned, throwing exception if the address is not aligned. However, it seems visual studio generates unaligned read instead.
Since not all compilers are made the same, this hides bugs. It would be nice to be able to be able to turn the actual aligned operations on, even though the performance hit that used to be there doesn't seem to be there anymore.
In other words, writing code:
__m128 p1 = _mm_load_ps(data);
currently produces:
movups xmm0,xmmword ptr [eax]
expected result:
movaps xmm0,xmmword ptr [eax]
(I was asked by microsoft to ask here)
MSVC and ICC only use instructions that do alignment checking when they fold a load into a memory source operand without AVX enabled, like addps xmm0, [rax]. SSE memory source operands require alignment, unlike AVX. But you can't reliably control when this happens, and in debug builds it generally doesn't.
As Mysticial points out in Visual Studio 2017: _mm_load_ps often compiled to movups , another case is NT load/store, because there is no unaligned version.
If your code is compatible with clang-cl, have Visual Studio use it instead of MSVC. It's a modified version of clang that tries to act more like MSVC. But like GCC, clang uses aligned load and store instructions for aligned intrinsics.
Either disable optimization, or make sure AVX is not enabled, otherwise it could fold a _mm_load_ps into a memory source operand like vaddps xmm0, [rax] which doesn't require alignment because it's the AVX version. This may be a problem if your code also uses AVX intrinsics in the same file, because clang requires that you enable ISA extensions for intrinsics you want to use; the compiler won't emit asm instructions for an extension that isn't enabled, even with intrinsics. Unlike MSVC and ICC.
A debug build should work even with AVX enabled, especially if you _mm_load_ps or _mm256_load_ps into a separate variable in a separate statement, not v=_mm_add_ps(v, _mm_load_ps(ptr));
With MSVC itself, for debugging purposes only (usually very big speed penalty for stores), you could substitute normal loads/stores with NT. Since they're special, the compiler won't fold loads into memory source operands for ALU instructions, so this can maybe work even with AVX with optimization enabled.
// alignment_debug.h (untested)
// #include this *after* immintrin.h
#ifdef DEBUG_SIMD_ALIGNMENT
#warn "using slow alignment-debug SIMD instructions to work around MSVC/ICC limitations"
// SSE4.1 MOVNTDQA doesn't do anything special on normal WB memory, only WC
// On WB, it's just a slower MOVDQA, wasting an ALU uop.
#define _mm_load_si128 _mm_stream_load_si128
#define _mm_load_ps(ptr) _mm_castsi128_ps(_mm_stream_load_si128((const __m128i*)ptr))
#define _mm_load_pd(ptr) _mm_castsi128_pd(_mm_stream_load_si128((const __m128i*)ptr))
// SSE1/2 MOVNTPS / PD / MOVNTDQ evict data from cache if it was hot, and bypass cache
#define _mm_store_ps _mm_stream_ps // SSE1 movntps
#define _mm_store_pd _mm_stream_pd // SSE2 movntpd is a waste of space vs. the ps encoding, but whatever
#define _mm_store_si128 _mm_stream_si128 // SSE2 movntdq
// and repeat for _mm256_... versions with _mm256_castsi256_ps
// and _mm512_... versions
// edit welcome if anyone tests this and adds those versions
#endif
Related: for auto-vectorization with MSVC (and gcc/clang), see Alex's answer on Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang

How can I mitigate the impact of the Intel jcc erratum on gcc?

If I have a chip that is subject to the Intel jcc erratum, how I can enable the mitigation in gcc (which adjusts branch locations to avoid the problematic alignment), and which gcc versions support it?
By compiler:
GCC: -Wa,-mbranches-within-32B-boundaries
clang (10+): -mbranches-within-32B-boundaries compiler option directly, not -Wa.
MSVC: /QIntel-jcc-erratum See Intel JCC Erratum - what is the effect of prefixes used for mitigation?
ICC: TODO, look for docs.
The GNU toolchain does mitigation in the assembler, with as -mbranches-within-32B-boundaries, which enables (GAS manual: x86 options):
-malign-branch-boundary=32 (care about 32-byte boundaries). Except the manual says this option takes an exponent, not a direct power of 2, so probably it's actually ...boundary=5.
-malign-branch=jcc+fused+jmp (the default which does not include any of +call+ret+indirect)
-malign-branch-prefix-size=5 (up to 5 segment prefixes per insn).
So the relevant GCC invocation is gcc -Wa,-mbranches-within-32B-boundaries
Unfortunately, GCC -mtune=skylake doesn't enable this.
GAS's strategy seems to be to pad as early as possible after the last alignment directive (e.g. .p2align) or after the last jcc/jmp that can end before a 32B boundary. I guess that might end up with padding in outer loops, before or after inner loops, maybe helping them fit in fewer uop cache lines? (Skylake also has its LSD loop buffer disabled, so a tiny loop split across two uop cache lines can run at best 2 cycles per iteration, instead of 1.)
It can lead to quite a large amount of padding with long macro-fused jumps, such as with -fstack-protector-strong which in recent GCC uses sub rdx,QWORD PTR fs:0x28 / jnz (earlier GCC used to use xor, which can't fuse even on Intel). That's 11 bytes total of sub + jnz, so could require 11 bytes of CS prefixes in the worst case to shift it to the start of a new 32B block. Example showing 8 CS prefixes in the insns before it: https://godbolt.org/z/n1dYGMdro
GCC doesn't know instruction sizes, it only prints text. That's why it needs GAS to support stuff like .p2align 4,,10 to align by 16 if that will take fewer than 10 bytes of padding, to implement the alignment heuristics it wants to use. (Often followed by .p2align 3 to unconditionally align by 8.)
as has other fun options that aren't on by default, like -Os to optimize hand-written asm like mov $1, %rax => mov $1, %eax / xor %rax,%rax => %eax / test $1, %eax => al and even EVEX => VEX for stuff like vmovdqa64 => vmovdqa.
Also stuff like -msse2avx to always use VEX prefixes even when the mnemonic isn't v..., and -momit-lock-prefix=yes which could be used to build std::atomic code for a uniprocessor system.
And -mfence-as-lock-add=yes to assemble mfence into lock addl $0x0, (%rsp). But insanely it also does that for sfence and even lfence, so it's unusable in code that uses lfence as an execution barrier, which is the primary use-case for lfence. e.g. for retpolines or timing like lfence;rdtsc.
as also has CPU feature-level checking with -march=znver3 for example, or .arch directives. And -mtune=CPU, although IDK what that does. Perhaps set NOP strategy?

Xcode Apple Clang enable avx512

In Xcode(Version 10.1 (10B61)), I used Macro as below to detect AVX512 support.
#ifdef __SSE4_1__
#error "sse4_1"
#endif
#ifdef __AVX__
#error "avx"
#endif
#ifdef __AVX2__
#error "avx2"
#endif
#ifdef __AVX512__
#error "avx512"
#endif
In default Build Settings, SSE4_1 is active, but avx, avx2 and is not. When I add -mavx in Building Settings-->Apple Clang-Custom Compiler Flags-->Other C Flags, that enable AVX, further adding -mavx2 to enable AVX and AVX2, but Unknow argument: '-mavx512'.
How do you enable avx512 and detect it?
It seems like there are few Macro to detect avx512.
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512F__ 1
#define __AVX512VL__ 1
What's differences between them?
AVX512 isn't a single extension, and doesn't have a specific-enough meaning in this context to be useful. Compilers only deal with specific CPU features, like AVX512F, AVX512DQ, AVX512CD, etc.
All CPUs that support any AVX512 extensions must support AVX512F, the "Foundation". AVX512F is the baseline AVX512 extension that other AVX512 extensions build on.
In code that wants to use AVX512 intrinsics, you should look at https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 and pick a set of extensions that are available together on one CPU you care about, e.g. F + CD and VL, DQ, BW on currently-available Skylake-X.
Then for example use #if defined(__AVX512BW__) && defined(__AVX512VL__) before code that uses vpermt2w on 256-bit vectors or something. __AVX512(anything)__ implies __AVX512F__; that's the one extension you don't have to check for separately.
But if you only used AVX512F instructions, they yeah just check for that macro.
You should pretty much never use -mavx512f directly: use -march=skylake-avx512, -march=knl, or -march=native. Or in future, -march=icelake or whatever.
The compiler knows which CPUs support which sets of extensions (or can detect which extensions the machine you're compiling on supports). There are a lot of them, and leaving out important ones like AVX512VL (support for AVX512 instructions on 128-bit and 256-bit vectors) or Xeon Phi's AVX512ER (fast 1/x and 1/sqrt(x) with twice the precision of the normal AVX512 14-bit versions) could hurt performance significantly. Especially AVX512ER is very important if you do any division or log/exp on Xeon Phi, because full-precision division is very slow on KNL compared to Skylake.
-march=x implies -mtune=x, enabling tuning options relevant for the target as well. KNL is basically Silvermont with AVX512 bolted on, and has significant differences from -mtune=skylake-avx512.
These are the same reasons you should generally not use -mfma -mavx2 directly, except that there are currently no AMD CPUs with AVX512, so there are only 2 main tuning targets (Xeon Phi and mainstream Skylake/CannonLake/Icelake), and they also support different sets of AVX512 extensions. There is unfortunately no -mtune=generic-avx2 tuning setting, but Ryzen supports almost all extensions that Haswell does (and the ones it doesn't GCC / clang won't use automatically, like transactional memory), so -march=haswell might be reasonable to make code tuned for CPUs with FMA, AVX2, popcnt, etc, without suffering too much on Ryzen.
Also relevant (for GCC, maybe not clang currently. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html):
-mprefer-vector-width=256 auto-vectorize with 256-bit vectors by default, in case most of the time is spent in non-vectorized loops. Using 512-bit vectors reduces the max turbo clock speed by a significant amount on Intel Xeon CPUs (maybe not as much on i9 desktop versions of Skylake-X), so it can be a net slowdown to use 512-bit vectors in small scattered bits of your program. So 256 is the default for tune=skylake-avx512 in GCC, but KNL uses 512.
-mprefer-avx-128 the old version of the -mprefer-vector-width= option, before AVX512 existed.
Using AVX512 mask registers, 32 vector registers, and/or its new instructions, can be a significant win even at the same vector width, so it makes sense to enable AVX512 even if you don't want to use 512-bit vector width. (Although sometimes code using intrinsics or auto-vectorization will compile in a worse way, instead of better, if AVX512 compare-into-register versions of comparison are available at all. But hopefully anti-optimization bugs like that will be sorted out as AVX512 becomes more widely used.)

MingW Windows GCC cant compile c program with 2gb global data

GCC/G++ of MingW gives Relocation Errors when Building Applications with Large Global or Static Data.
Understanding the x64 code models
References to both code and data on x64 are done with
instruction-relative (RIP-relative in x64 parlance) addressing modes.
The offset from RIP in these instructions is limited to 32 bits.
small code model promises to the compiler that 32-bit relative offsets
should be enough for all code and data references in the compiled
object. The large code model, on the other hand, tells it not to make
any assumptions and use absolute 64-bit addressing modes for code and
data references. To make things more interesting, there's also a
middle road, called the medium code model.
For the below example program, despite adding options-mcmodel=medium or -mcmodel=large the code fails to compile
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];
int main(){
return 0;
}
gcc -mcmodel=medium example.c fails to compile on MingW/Cygwin Windows, Intel windows /MSVC
You are limited to 32-bits for an offset, but this is a signed offset. So in practice, you are actually limited to 2GiB. You asked why this is not possible, but your array alone is 2GiB in size and there are things in the data segment other than just your array. C is a high level language. You get the ease of just being able to define a main function and you get all of these other things for free -- a standard in and output, etc. The C runtime implements this for you and all of this consumes stack space and room in your data segment. For example, if I build this on x86_64-pc-linux-gnu my .bss size is 0x80000020 in size -- an additional 32 bytes. (I've erased PE information from my brain, so I don't remember how those are laid out.)
I don't remember much about the various machine models, but it's probably helpful to note that the x86_64 instruction set doesn't even contain instructions (that I'm aware of, although I'm not an x86 assembly expert) to access any register-relative address beyond a signed 32-bit value. For example, when you want to cram that much stuff on the stack, gcc has to do weird things like this stack pointer allocation:
movabsq $-10000000016, %r11
addq %r11, %rsp
You can't addq $-10000000016, %rsp because it's more than a signed 32-bit offset. The same applies to RIP-relative addressing:
movq $10000000016(%rip), %rax # No such addressing mode

Why gcc compile _mm256_permute2f128_ps to Vinsertf128 instruction?

This instruction is a part of an assembly out put of a C program (gcc -O2). According to the result I understand that ymm6 is source operand 1 that all of it, is cloned to ymm9 and then xmm1 is cloned to the ymm6[127-256] I read Intel manual but it uses Intel assembly syntax not At&t and I don't want to use Intel syntax. So ymm8, ymm2 and ymm6 here is SRC1. is this true?
vshufps $68, %ymm0, %ymm8, %ymm6
vshufps $68, %ymm4, %ymm2, %ymm1
Vinsertf128 $1, %xmm1, %ymm6, %ymm9
And the main question is why gcc changed the instruction
row0 = _mm256_permute2f128_ps(__tt0, __tt4, 0x20);
to
Vinsertf128 $1, %xmm1, %ymm6, %ymm9
and
row4 = _mm256_permute2f128_ps(__tt0, __tt4, 0x31);
to
Vperm2f128 $49, %ymm1, %ymm6, %ymm1
How could I ignore this optimization? I tried -O0 but doesn't work.
So ymm8, ymm2 and ymm6 here is SRC1. is this true?
Yes, the middle operand is always src1 in a 3-operand instruction in both syntaxes.
AT&T: op %src2, %src1, %dest
Intel: op dest, src1, src2
I don't want to use Intel syntax
Tough. The only really good documentation I know of for exactly what every instruction does is the Intel insn ref manual. I used to think AT&T syntax was better, because the $ and % decorators remove ambiguity. I do like that, but otherwise prefer the Intel syntax now. The rules for each are simple enough that you can easily mentally convert, or "think" in whichever one you're reading ATM.
Unless you're actually writing GNU C inline asm, you can just use gcc -masm=intel and objdump -Mintel to get GNU-flavoured asm using intel mnemonics, operand order, and so on. The assembler directives are still gas style, not NASM. Use http://gcc.godbolt.org/ to get nicely-formatted asm output for code with only the essential labels left in.
gcc and clang both have some understanding of what the intrinsics actually do, so internally they translate the intrinsic to some data movement. When it comes time to emit code, they see that said data movement can be done with vinsertf128, so they emit that.
On some CPUs (Intel SnB-family), both instructions have equal performance, but on AMD Bulldozer-family (which only has 128b ALUs), vinsertf128 is much faster than vperm2f128. (source: see Agner Fog's guides, and other links at the x86 tag wiki). They both take 6 bytes to encode, including the immediate, so there's no code-size difference. vinsertf128 is always a better choice than a vperm2f128 that does identical data movement.
gcc and clang don't have a "literal translation of intrinsics to instructions" mode, because it would take extra work to implement. If you care exactly which instructions the compiler uses, that's what inline asm is for.
Keep in mind that -O0 doesn't mean "no optimization". It still has to transform through a couple internal representations before emitting asm.
Examination of the instructions that bind to port 5 in the instruction analysis report shows that the instructions were broadcasts and vpermilps. The broadcasts can only execute on port 5, but replacing them with 128-bit loads followed by vinsertf128 instructions reduces the pressure on port 5 because vinsertf128 can execute on port 0. from IACA user guid

Resources