How can I compile *without* various instruction sets enabled? - gcc

I am attempting to recompile some software with various instruction sets, specifically, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, and AVX, and I would like to see how the code performs without these instruction sets to make sure I am getting the full effect of them.
For example, I want to compile it with just -O2 with a gnu compiler and see how it performs when restricting it to only SSE, to see which flags it is invoking by default. I also have an intel compiler that I am working with and I would like to isolate what each flag (or combination of flags) is doing to my code, so how can I specify exactly which flags are being invoked?
If it matters, I am working with C, C++, and Fortran on Linux.

For GCC compiler:
You have to use -mno-options for doing it.
These switches enable or disable the use of instructions in the MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, AVX512F, AVX512PF, AVX512ER, AVX512CD, SHA, AES, PCLMUL, FSGSBASE, RDRND, F16C, FMA, SSE4A, FMA4, XOP, LWP, ABM, BMI, BMI2, FXSR, XSAVE, XSAVEOPT, LZCNT, RTM, or 3DNow! extended instruction sets. These extensions are also available as built-in functions: see X86 Built-in Functions, for details of the functions enabled and disabled by these switches.
More info you can find on official GCC site
For ICC compiler, you have to use
combination of :
-march=”cpu” optimize for a specific cpu
-mtune=”cpu” produce code only for a specific cpu
-msse3,-msse4,-mavx, etc. level of SIMD and vector
instructions
More info here

Related

Is there way to automatically replace avx512 with avx2?

Following the advice of Linus Torvalds (and cross platform performance), I wish to not use avx512. Is there a flag I can specify to the compiler (both gcc and msvc) such that all avx512 instructions are split into pairs of avx2 instructions if a library I am using tries to use axv512 either from intrinsics or compiler optimiszation?
No, compile your code not to use AVX-512 in the first place by telling the compiler it can't; you only have to do anything about code using intrinsics that require AVX-512.
However, if you're compiling for a CPU that supports AVX-512, it's often worth using it, especially with 256-bit vectors to avoid the turbo-frequency and other penalties that come with 512-bit vectors. GCC's default tuning is already -mprefer-vector-width=256 for CPUs like -march=skylake-avx512.
If you want to make a binary that can run on CPUs without AVX-512, then yes obviously you need to make sure it never executes and instructions that would fault without it. e.g. gcc -O3 -march=znver2 or -march=skylake or whatever. Neither of those target arch options include AVX-512. Or -march=native if compiling for whatever CPU you have.
But if you do have a CPU that supports AVX-512, and you want to not use it, you can use something like -march=native -mno-avx512f (All other AVX-512 extensions depend on the "Foundation" AVX-512F, so disabling that also prevents even AVX-512VL for 128 and 256-bit vectors.)
(Part of the benefit of -march=native and then disabling stuff is to also set tuning options. If you want a binary that runs well on both Skylake and Zen2, I'm not sure what to recommend; probably -march=skylake or -march=znver2 are both ok; there's the default "tune=generic" but it cares too much about really old CPUs that don't even support AVX2, like Sandybridge: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Intrinsics
Even with intrinsics, GCC will only ever emit instructions supported by the target options, so -mno-avx512f can let you be sure you didn't miss anything. You'll get compile time errors, instead of EVEX instructions slipping through the cracks.
(MSVC is different and is designed around a single-binary model where using new instruction-sets is done in functions that you only call if the CPU supports it, so it won't stop you from using AVX-512. AFAIK, MSVC still doesn't even have an option to auto-vectorize with AVX-512, only /arch:AVX2. But anyway, MSVC won't emit AVX-512 instructions on its own if you don't tell it to, if you don't use any option like /arch:AVX512 if such a thing exists; AFAIK it doesn't have a /arch:native unfortunately. With MSVC you do have to be sure you caught all uses of intrinsics, although compiling with GCC can help to make sure your codebase doesn't do that.)
If you still want to compile code that uses _mm512_add_epi32 or _mm256_ternlog_epi32 or whatever, you'll need a version of immintrin.h that defines __m512i as a struct/class with two __m256i members and emulates all the intrinsics. Some AVX512 intrinsics won't be cheap to emulate, especially masked operations, and the whole concept of compare-into-mask to get an integer instead of a vector. So it's probably a bad idea to try to make this happen fully transparently; instead just get GCC to stop you from using any AVX-512 instructions while you make AVX2-only versions of any intrinsics code that didn't already have AVX2 versions.
Last time this came up, Coding on insufficient hardware, I was able to find an avxintrin-emu.h that let you develop for AVX while only compiling for SSE4. But I didn't find an equivalent for AVX-512. (Normally you would compile an AVX-512 binary and test it on an emulator like SDE that emulates at runtime, not compile-time.)
Agner Fog's VectorClass wrapper library (https://www.agner.org/optimize/#vectorclass) has support for basic operations like + - * /, and shuffles and blends, and has versions 512-bit vectors emulated with a pair of AVX2 vectors. (And VCL types are implicitly convertible to __m256i or __m512i and so on, so for operations it doesn't have its own functions for, you can use Intel intrinsics. But then you're back in the same boat of needing a library that emulates __m256_ternlog_epi32 with only AVX2 instructions.)
This won't stop libc from possibly using hand-written AVX-512 instructions in functions like strcmp or log/exp, since dynamic CPU dispatching happens at run-time, and you can't stop your CPU from reporting that it supports AVX-512. (Except with a VM, or by telling the kernel not to enable AVX-512 at boot, if Linux has an option for that.)

Generate code for multiple SIMD architectures

I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags, -msse2 -mavx -mfma etc.
This is all very good, but I would like to deploy a single binary, which works across a range of generations of processors.
Question: Is it possible to tell the compiler (GCC) that whenever it optimizes a function using SIMD, it must generate code for a list of architectures? And of of course introduce high-level branches
I am thinking similar to how the compiler generates code for functions, where input pointers are either 4 or 8 byte aligned. To prevent this, I use the __builtin_assume_aligned macro.
What is best practice? Multiple binaries? Naming?
As long as you don't care about portability, yes.
Recent versions of GCC make this easier than any other compiler I'm aware of by using the target_clones function attribute. Just add the attribute, with a list of targets you want to create versions for, and GCC will automatically create the different variants, as well as a dispatch function to choose a version automatically at runtime.
If you want a bit more portability you can use the target attribute, which clang and icc also support, but you'll have to write the dispatch function yourself (which isn't difficult), and emit the function multiple times (generally using a macro, or repeatedly including a header).
AFAIK, if you want your code to work with MSVC you'll need multiple compiler invocations with different options.
If you're talking about just getting the compiler to generate SSE/AVX etc instructions, and you've got "general purpose" code (ie you're not explicitly vectorising using intrinsics, or got lots of code that the compiler will spot and auto-vectorise) then I should warn you that AVX, AVX2 or AVX512 compiling your entire codebase will probably run significantly slower than compiling for SSE versions.
When AVX opcodes using the upper halves of the registers are detected, the CPU powers up the upper half of the circuitry (which is otherwise powered down). This consumes more power, generates more heat and reduces the base clock speed of the chip, typically by 10-20% depending on the mix of high power and low-power opcodes, so you lose maybe 15% of performance immediately, and then have to be doing quite a lot of vectorised processing in order to make up for this performance deficit before you start seeing any gains.
See my longer explanation and references in this thread.
If on the other hand you're explicitly vectorising using intrinsics and you're sure you have large enough burst of AVX etc to make it worthwhile, I've successfully written code where I tell MSVC to compile for SSE2 (default for x64) but then I dynamically check the CPU capabilities and some functions switch to a codepath implemented using AVX intrinsics.
MSVC allows this (it will produce warnings, but you can silence these), but the same technique is hard to make work under GCC 4.9 as the intrinsics are only considered declared by the compiler when the appropriate code generation flag is used. [UPDATE: #nemequ explains below how you can make this work under gcc using attributes to decorate the functions] Depending on the version of GCC you may have to compile files with different flags to get a workable system.
Oh, and you have to watch for AVX-SSE transitions too (call VZEROUPPER when you leave an AVX section of code to return to SSE code) - it can be done but I found that understanding the CPU implications was a bigger battle than I originally envisaged.

Why does GCC prefer the AVX version of FP instructions?

When compiling for CPUs that have AVX (such as with -march=sandy-bridge), GCC seems to always prefer the AVX versions of simple, scalar floating-point instructions over the SSE versions. Such as, it uses vmulsd instead of mulsd.
I'm wondering, are there any particular performance-related reasons for this, or is it just some implementation detail of GCC that makes it easier/more natural for it to schedule such instructions? From what I can tell from the sources I have (mostly Agner's instruction tables), the AVX and SSE instructions seem to be equal in performance. I realize that AVX instructions are three-operand, but GCC seems to almost always only use the same destination register as one of the source operands anyway.

Using xmm parameter in AVX intrinsics

Is it possible to use xmm register parameter with AVX intrinsics function (_mm256_**_**)?
My code require the usage of vecter integer operation (for load and storing data) along with vector floating point operation. The integer code is written with SSE2 intrinsics to be compatible with older CPU, while floating point is written with AVX to improve speed (there is also SSE code branch, so do not suggest this).
Currently, except for using compiler flag to automatically convert all SSE instructions to VEX-encoded version, are there any way using intrinsics function (i.e. no inline/external assembly) to force the use of VEX-encoded instruction on XMM register?
Note: I tried _mm256_castsi128_si256(), and this generates instruction with ymm operand.
You have a processor with AVX. It does not have XMM registers in only has YMM registers. If you compile all your code with AVX support (e.g. with -mavx in GCC or /arch:AVX in MSVC) then all your SSE2 code operates on the lower 128-bits of the YMM registers. There is nothing to worry about.
However, let's say you have two different modules one you compiled with SSE2 support (e.g. with -msse2 in GCC or /arch:SSE2 in MSVC) and the other with AVX support and you use functions from both then you do have something to worry about when you switch between them. In that case you should call _mm256_zeroupper() or _mm256_zeroall() when you switch from AVX to SSE2 code unless you want to take a performance hit. Using AVX CPU instructions: Poor performance without "/arch:AVX"
The simple solutions is to just compile all your code with AVX support. The only reason I can think of to compile different modules with different instruction set support is if you want to make a CPU dispatcher so your code can run on different processors. That's a bit of a pain to implement. But then you don't do state changes so the only time I can think of you need to worry about a state change is when you call functions from a shared library which were compiled with another instruction set (e.g. a DLL compiled with SSE2). In that case you may need to call _mm256_zeroupper() or _mm256_zeroall() when calling the library function from AVX code.

Disabling -msse

I am trying to run various benchmark tests using CPU2006 to see what various optimizations do in terms of speed on gcc. I am familiar with -O1, -O2, and -O3, but have heard that -msse is a decent optimization. What exactly is -msse? I've also seen that -msse is default on a 64 bit architecture, so how do I disable it to compare the difference between using it and not using it?
-msse activates the generation of SSE instructions. All 64-bit processors (x86-64) have them, but some older 32-bit processors (IA-32) do not have these instructions. This is the reason for GCC's default settings.
SSE instructions have to do with vector operations and floating-point. Considering that opportunities for automatic vectorization are rare in general-purpose code, the only difference you are likely to observe are if you use floating-point.
On 64-bit, to disable SSE instructions, use -mno-sse
http://www.justskins.com/forums/gcc-option-msse-and-128289.html
SSE (http://it.wikipedia.org/wiki/Streaming_SIMD_Extensions) as the name says are the SSE instructions present in processors since Pentium 3. They are fast for some kind of vectorial and floating point computation. They are available in all 64 bit processors, so why should we disable them?
You can choose between -msse and -msse2. SSE2 are another instruction set built over SSE that add other powerful and very fast vectorial instructions.
Pentium 3 did have SSE, and is a 32 bit processor.
SSE2 is more modern instead, Pentium 4, that is still a 32 bit processor, have SSE2.

Resources