When compiling for CPUs that have AVX (such as with -march=sandy-bridge), GCC seems to always prefer the AVX versions of simple, scalar floating-point instructions over the SSE versions. Such as, it uses vmulsd instead of mulsd.
I'm wondering, are there any particular performance-related reasons for this, or is it just some implementation detail of GCC that makes it easier/more natural for it to schedule such instructions? From what I can tell from the sources I have (mostly Agner's instruction tables), the AVX and SSE instructions seem to be equal in performance. I realize that AVX instructions are three-operand, but GCC seems to almost always only use the same destination register as one of the source operands anyway.
Related
Title says it all. What are the differences and tradeoffs between -march=haswell, -march=core-avx2, and -mavx2 for compiling avx2 intrinsics?
I know that -mavx2 is a flag and -march=haswell/core-avx2 are architectures which just translate to a bunch of flags. So -mavx2 is a subset of the other two. But beyond that, how do I choose the right one for my application?
In particular, if I'm compiling for Skylake and adding -mtune=skylake, but I can't actually do -march=skylake, how do I decide and why? (That's my particular case, but also interested in the more general tradeoffs).
-march sets -mtune, and enables other useful stuff you forgot like -mfma and -mbmi / -mbmi2 which aren't implied by -mavx2.
Unfortunately GCC's -mtune=generic doesn't adapt itself to the subset of CPUs that support the extensions enabled. Most notably, -mavx2 rules out Sandy Bridge, but doesn't imply -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?). Out of all AVX2 CPUs, only I think Excavator would really benefit much from those options, and it's not a big fraction of market share, and I think misaligned 32-byte load/store isn't disastrous on it like it was on Sandybridge. GCC11 and later have a tune=generic that does do unaligned 32-byte load/store (https://godbolt.org/z/qr9cEcne3) so this specific issue is no longer critical if you only care about AVX2 and newer CPUs. But still you generally want a -mtune= option somewhat appropriate for the set of CPUs you're targeting. But usually erring on the side of less loop unrolling, since too much can hurt I-cache / uop cache hit rate.
AVX2 CPUs have been widespread for long enough that -mtune=generic happens to be mostly appropriate for them in GCC11 and later, fortunately. (Especially CPUs like Haswell and Zen2 that have 32-byte-wide load/store units). But in general, with other CPU features and other far-future cases, that won't always be true.
But still don't just use -mavx2, that will leave out -mfma (which all AVX2 CPUs support except for one Via model), and leave out -mbmi/2 which I think all AVX2 CPUs also support. More efficient variable-count shifts are always nice, and so is lzcnt/tzcnt and efficient x &= x-1 lowest-set-bit idioms.
-march=haswell -mtune=skylake could be useful if you want to enable a Haswell baseline for compatibility, but tune for Skylake. (There's very little difference between tuning heuristics for those microarchitectures, and probably no new instructions that GCC would use automatically. For SKL specifically, see also How can I mitigate the impact of the Intel jcc erratum on gcc? which isn't implied by -march=skylake)
-march=core-avx2 and similar are an obsolete naming convention
It's just a synonym for -march=haswell, AFAIK. But maybe leaving -mtune=generic, unlike -march=haswell which sets -mtune=haswell. How to correctly determine -march and -mtune for Intel processors? has more details.
-march=core-i7 is I think just a synonym for Nehalem, which is really dumb because Nehalem (1st-gen i7) is a different microarchitecture from Sandy Bridge and later (2nd-gen i7). Intel unfortunately kept that branding through SnB's major re-architecting of the internals, although the 3-level cache with shared inclusive L3 has been a constant.
There's no such thing as "a generic i7", because i7 includes two different microarchitectures, one with a uop cache, one without.
Good riddance to those stupid ambiguous confusing -march= / -mtune names that are trying to be generic but actually aren't.
It would be nice if there was a -march=generic-avx2 to tune for Haswell and later, Zen1 and later, etc., using the common subset of instructions they all support efficiently. And setting tuning options appropriate for all of them, not going too crazy unrolling. But there isn't.
Or more generally, -mtune=generic-enabled or something that would look at the enabled -m options and not care about CPUs that couldn't run this code. e.g. not doing rep ret if you enabled any features that a Phenom II couldn't run, such as AVX. GCC bug 78762 is basically a feature request for that.
(Current GCC -mtune=generic no longer uses rep ret; Phenom II is old enough that GCC doesn't spend extra code-size to avoid potholes there. Most binary releases of software built with GCC use a GCC version that was in development a year or more before that, so there's some lead time in terms of what's appropriate for -mtune=generic. The CPU with a glass jaw / performance pothole you're working around doesn't have to be totally gone before a current nightly build of GCC stops tuning for it.)
Following the advice of Linus Torvalds (and cross platform performance), I wish to not use avx512. Is there a flag I can specify to the compiler (both gcc and msvc) such that all avx512 instructions are split into pairs of avx2 instructions if a library I am using tries to use axv512 either from intrinsics or compiler optimiszation?
No, compile your code not to use AVX-512 in the first place by telling the compiler it can't; you only have to do anything about code using intrinsics that require AVX-512.
However, if you're compiling for a CPU that supports AVX-512, it's often worth using it, especially with 256-bit vectors to avoid the turbo-frequency and other penalties that come with 512-bit vectors. GCC's default tuning is already -mprefer-vector-width=256 for CPUs like -march=skylake-avx512.
If you want to make a binary that can run on CPUs without AVX-512, then yes obviously you need to make sure it never executes and instructions that would fault without it. e.g. gcc -O3 -march=znver2 or -march=skylake or whatever. Neither of those target arch options include AVX-512. Or -march=native if compiling for whatever CPU you have.
But if you do have a CPU that supports AVX-512, and you want to not use it, you can use something like -march=native -mno-avx512f (All other AVX-512 extensions depend on the "Foundation" AVX-512F, so disabling that also prevents even AVX-512VL for 128 and 256-bit vectors.)
(Part of the benefit of -march=native and then disabling stuff is to also set tuning options. If you want a binary that runs well on both Skylake and Zen2, I'm not sure what to recommend; probably -march=skylake or -march=znver2 are both ok; there's the default "tune=generic" but it cares too much about really old CPUs that don't even support AVX2, like Sandybridge: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Intrinsics
Even with intrinsics, GCC will only ever emit instructions supported by the target options, so -mno-avx512f can let you be sure you didn't miss anything. You'll get compile time errors, instead of EVEX instructions slipping through the cracks.
(MSVC is different and is designed around a single-binary model where using new instruction-sets is done in functions that you only call if the CPU supports it, so it won't stop you from using AVX-512. AFAIK, MSVC still doesn't even have an option to auto-vectorize with AVX-512, only /arch:AVX2. But anyway, MSVC won't emit AVX-512 instructions on its own if you don't tell it to, if you don't use any option like /arch:AVX512 if such a thing exists; AFAIK it doesn't have a /arch:native unfortunately. With MSVC you do have to be sure you caught all uses of intrinsics, although compiling with GCC can help to make sure your codebase doesn't do that.)
If you still want to compile code that uses _mm512_add_epi32 or _mm256_ternlog_epi32 or whatever, you'll need a version of immintrin.h that defines __m512i as a struct/class with two __m256i members and emulates all the intrinsics. Some AVX512 intrinsics won't be cheap to emulate, especially masked operations, and the whole concept of compare-into-mask to get an integer instead of a vector. So it's probably a bad idea to try to make this happen fully transparently; instead just get GCC to stop you from using any AVX-512 instructions while you make AVX2-only versions of any intrinsics code that didn't already have AVX2 versions.
Last time this came up, Coding on insufficient hardware, I was able to find an avxintrin-emu.h that let you develop for AVX while only compiling for SSE4. But I didn't find an equivalent for AVX-512. (Normally you would compile an AVX-512 binary and test it on an emulator like SDE that emulates at runtime, not compile-time.)
Agner Fog's VectorClass wrapper library (https://www.agner.org/optimize/#vectorclass) has support for basic operations like + - * /, and shuffles and blends, and has versions 512-bit vectors emulated with a pair of AVX2 vectors. (And VCL types are implicitly convertible to __m256i or __m512i and so on, so for operations it doesn't have its own functions for, you can use Intel intrinsics. But then you're back in the same boat of needing a library that emulates __m256_ternlog_epi32 with only AVX2 instructions.)
This won't stop libc from possibly using hand-written AVX-512 instructions in functions like strcmp or log/exp, since dynamic CPU dispatching happens at run-time, and you can't stop your CPU from reporting that it supports AVX-512. (Except with a VM, or by telling the kernel not to enable AVX-512 at boot, if Linux has an option for that.)
I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags, -msse2 -mavx -mfma etc.
This is all very good, but I would like to deploy a single binary, which works across a range of generations of processors.
Question: Is it possible to tell the compiler (GCC) that whenever it optimizes a function using SIMD, it must generate code for a list of architectures? And of of course introduce high-level branches
I am thinking similar to how the compiler generates code for functions, where input pointers are either 4 or 8 byte aligned. To prevent this, I use the __builtin_assume_aligned macro.
What is best practice? Multiple binaries? Naming?
As long as you don't care about portability, yes.
Recent versions of GCC make this easier than any other compiler I'm aware of by using the target_clones function attribute. Just add the attribute, with a list of targets you want to create versions for, and GCC will automatically create the different variants, as well as a dispatch function to choose a version automatically at runtime.
If you want a bit more portability you can use the target attribute, which clang and icc also support, but you'll have to write the dispatch function yourself (which isn't difficult), and emit the function multiple times (generally using a macro, or repeatedly including a header).
AFAIK, if you want your code to work with MSVC you'll need multiple compiler invocations with different options.
If you're talking about just getting the compiler to generate SSE/AVX etc instructions, and you've got "general purpose" code (ie you're not explicitly vectorising using intrinsics, or got lots of code that the compiler will spot and auto-vectorise) then I should warn you that AVX, AVX2 or AVX512 compiling your entire codebase will probably run significantly slower than compiling for SSE versions.
When AVX opcodes using the upper halves of the registers are detected, the CPU powers up the upper half of the circuitry (which is otherwise powered down). This consumes more power, generates more heat and reduces the base clock speed of the chip, typically by 10-20% depending on the mix of high power and low-power opcodes, so you lose maybe 15% of performance immediately, and then have to be doing quite a lot of vectorised processing in order to make up for this performance deficit before you start seeing any gains.
See my longer explanation and references in this thread.
If on the other hand you're explicitly vectorising using intrinsics and you're sure you have large enough burst of AVX etc to make it worthwhile, I've successfully written code where I tell MSVC to compile for SSE2 (default for x64) but then I dynamically check the CPU capabilities and some functions switch to a codepath implemented using AVX intrinsics.
MSVC allows this (it will produce warnings, but you can silence these), but the same technique is hard to make work under GCC 4.9 as the intrinsics are only considered declared by the compiler when the appropriate code generation flag is used. [UPDATE: #nemequ explains below how you can make this work under gcc using attributes to decorate the functions] Depending on the version of GCC you may have to compile files with different flags to get a workable system.
Oh, and you have to watch for AVX-SSE transitions too (call VZEROUPPER when you leave an AVX section of code to return to SSE code) - it can be done but I found that understanding the CPU implications was a bigger battle than I originally envisaged.
Is it possible to use xmm register parameter with AVX intrinsics function (_mm256_**_**)?
My code require the usage of vecter integer operation (for load and storing data) along with vector floating point operation. The integer code is written with SSE2 intrinsics to be compatible with older CPU, while floating point is written with AVX to improve speed (there is also SSE code branch, so do not suggest this).
Currently, except for using compiler flag to automatically convert all SSE instructions to VEX-encoded version, are there any way using intrinsics function (i.e. no inline/external assembly) to force the use of VEX-encoded instruction on XMM register?
Note: I tried _mm256_castsi128_si256(), and this generates instruction with ymm operand.
You have a processor with AVX. It does not have XMM registers in only has YMM registers. If you compile all your code with AVX support (e.g. with -mavx in GCC or /arch:AVX in MSVC) then all your SSE2 code operates on the lower 128-bits of the YMM registers. There is nothing to worry about.
However, let's say you have two different modules one you compiled with SSE2 support (e.g. with -msse2 in GCC or /arch:SSE2 in MSVC) and the other with AVX support and you use functions from both then you do have something to worry about when you switch between them. In that case you should call _mm256_zeroupper() or _mm256_zeroall() when you switch from AVX to SSE2 code unless you want to take a performance hit. Using AVX CPU instructions: Poor performance without "/arch:AVX"
The simple solutions is to just compile all your code with AVX support. The only reason I can think of to compile different modules with different instruction set support is if you want to make a CPU dispatcher so your code can run on different processors. That's a bit of a pain to implement. But then you don't do state changes so the only time I can think of you need to worry about a state change is when you call functions from a shared library which were compiled with another instruction set (e.g. a DLL compiled with SSE2). In that case you may need to call _mm256_zeroupper() or _mm256_zeroall() when calling the library function from AVX code.
When compiled for processor that support AVX extension (say -m64 -march=corei7-avx -mtune=corei7-avx is applicable), does it make sense to use -mfpmath=both -mavx keys at the same time? Does not it so much that it causes the compiler to use three sets of instructions (i87, SSE, AVX) at the same time? Or just i87 for scalars (in some sense) and AVX for vectors only?
The AVX registers are only extensions of the SSE registers. You cannot mix SSE and AVX instructions to increase the number of available registers (you can still mix x87 and AVX instructions, I assume this is what -mfpmath=both does in this case).
See for instance the discussion “Mixing AVX and SSE” on this page.
You normally don't want this; I don't think gcc is smart enough at deciding to use x87 when there's high register pressure to make it worth it.
x87 and SSE/AVX instructions compete for the same FP execution units on normal x86 CPUs (Intel and AMD), so you don't get more throughput from interleaving them.
Normally you should just use -mfpmath=sse (which means AVX when used with -mavx, or better -mfpmath=sse -march=native. The default for x86-64 is sse, so -mfpmath=sse only changes anything for -m32, AFAIK.
The main benefit of -mfpmath=both is more total registers, but managing the x87 register stack often costs extra instructions. Moving data between x87 and AVX also costs a store/reload (store-forwarding round trip, ~6 cycle latency on Haswell, http://agner.org/optimize/), so it's only really useful if you have two independent sets of calculations for the compiler to interleave. Otherwise it's not better than normal spill/reload.
Last time I looked at gcc -O3 -mfpmath=both, the results were not impressive: https://godbolt.org/g/p2KLEC shows gcc5.4 using some store/reloads to bounce data between x87 and xmm (AVX) registers. It would be better off just keeping some of the constants in memory.