I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags, -msse2 -mavx -mfma etc.
This is all very good, but I would like to deploy a single binary, which works across a range of generations of processors.
Question: Is it possible to tell the compiler (GCC) that whenever it optimizes a function using SIMD, it must generate code for a list of architectures? And of of course introduce high-level branches
I am thinking similar to how the compiler generates code for functions, where input pointers are either 4 or 8 byte aligned. To prevent this, I use the __builtin_assume_aligned macro.
What is best practice? Multiple binaries? Naming?
As long as you don't care about portability, yes.
Recent versions of GCC make this easier than any other compiler I'm aware of by using the target_clones function attribute. Just add the attribute, with a list of targets you want to create versions for, and GCC will automatically create the different variants, as well as a dispatch function to choose a version automatically at runtime.
If you want a bit more portability you can use the target attribute, which clang and icc also support, but you'll have to write the dispatch function yourself (which isn't difficult), and emit the function multiple times (generally using a macro, or repeatedly including a header).
AFAIK, if you want your code to work with MSVC you'll need multiple compiler invocations with different options.
If you're talking about just getting the compiler to generate SSE/AVX etc instructions, and you've got "general purpose" code (ie you're not explicitly vectorising using intrinsics, or got lots of code that the compiler will spot and auto-vectorise) then I should warn you that AVX, AVX2 or AVX512 compiling your entire codebase will probably run significantly slower than compiling for SSE versions.
When AVX opcodes using the upper halves of the registers are detected, the CPU powers up the upper half of the circuitry (which is otherwise powered down). This consumes more power, generates more heat and reduces the base clock speed of the chip, typically by 10-20% depending on the mix of high power and low-power opcodes, so you lose maybe 15% of performance immediately, and then have to be doing quite a lot of vectorised processing in order to make up for this performance deficit before you start seeing any gains.
See my longer explanation and references in this thread.
If on the other hand you're explicitly vectorising using intrinsics and you're sure you have large enough burst of AVX etc to make it worthwhile, I've successfully written code where I tell MSVC to compile for SSE2 (default for x64) but then I dynamically check the CPU capabilities and some functions switch to a codepath implemented using AVX intrinsics.
MSVC allows this (it will produce warnings, but you can silence these), but the same technique is hard to make work under GCC 4.9 as the intrinsics are only considered declared by the compiler when the appropriate code generation flag is used. [UPDATE: #nemequ explains below how you can make this work under gcc using attributes to decorate the functions] Depending on the version of GCC you may have to compile files with different flags to get a workable system.
Oh, and you have to watch for AVX-SSE transitions too (call VZEROUPPER when you leave an AVX section of code to return to SSE code) - it can be done but I found that understanding the CPU implications was a bigger battle than I originally envisaged.
Related
Following the advice of Linus Torvalds (and cross platform performance), I wish to not use avx512. Is there a flag I can specify to the compiler (both gcc and msvc) such that all avx512 instructions are split into pairs of avx2 instructions if a library I am using tries to use axv512 either from intrinsics or compiler optimiszation?
No, compile your code not to use AVX-512 in the first place by telling the compiler it can't; you only have to do anything about code using intrinsics that require AVX-512.
However, if you're compiling for a CPU that supports AVX-512, it's often worth using it, especially with 256-bit vectors to avoid the turbo-frequency and other penalties that come with 512-bit vectors. GCC's default tuning is already -mprefer-vector-width=256 for CPUs like -march=skylake-avx512.
If you want to make a binary that can run on CPUs without AVX-512, then yes obviously you need to make sure it never executes and instructions that would fault without it. e.g. gcc -O3 -march=znver2 or -march=skylake or whatever. Neither of those target arch options include AVX-512. Or -march=native if compiling for whatever CPU you have.
But if you do have a CPU that supports AVX-512, and you want to not use it, you can use something like -march=native -mno-avx512f (All other AVX-512 extensions depend on the "Foundation" AVX-512F, so disabling that also prevents even AVX-512VL for 128 and 256-bit vectors.)
(Part of the benefit of -march=native and then disabling stuff is to also set tuning options. If you want a binary that runs well on both Skylake and Zen2, I'm not sure what to recommend; probably -march=skylake or -march=znver2 are both ok; there's the default "tune=generic" but it cares too much about really old CPUs that don't even support AVX2, like Sandybridge: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Intrinsics
Even with intrinsics, GCC will only ever emit instructions supported by the target options, so -mno-avx512f can let you be sure you didn't miss anything. You'll get compile time errors, instead of EVEX instructions slipping through the cracks.
(MSVC is different and is designed around a single-binary model where using new instruction-sets is done in functions that you only call if the CPU supports it, so it won't stop you from using AVX-512. AFAIK, MSVC still doesn't even have an option to auto-vectorize with AVX-512, only /arch:AVX2. But anyway, MSVC won't emit AVX-512 instructions on its own if you don't tell it to, if you don't use any option like /arch:AVX512 if such a thing exists; AFAIK it doesn't have a /arch:native unfortunately. With MSVC you do have to be sure you caught all uses of intrinsics, although compiling with GCC can help to make sure your codebase doesn't do that.)
If you still want to compile code that uses _mm512_add_epi32 or _mm256_ternlog_epi32 or whatever, you'll need a version of immintrin.h that defines __m512i as a struct/class with two __m256i members and emulates all the intrinsics. Some AVX512 intrinsics won't be cheap to emulate, especially masked operations, and the whole concept of compare-into-mask to get an integer instead of a vector. So it's probably a bad idea to try to make this happen fully transparently; instead just get GCC to stop you from using any AVX-512 instructions while you make AVX2-only versions of any intrinsics code that didn't already have AVX2 versions.
Last time this came up, Coding on insufficient hardware, I was able to find an avxintrin-emu.h that let you develop for AVX while only compiling for SSE4. But I didn't find an equivalent for AVX-512. (Normally you would compile an AVX-512 binary and test it on an emulator like SDE that emulates at runtime, not compile-time.)
Agner Fog's VectorClass wrapper library (https://www.agner.org/optimize/#vectorclass) has support for basic operations like + - * /, and shuffles and blends, and has versions 512-bit vectors emulated with a pair of AVX2 vectors. (And VCL types are implicitly convertible to __m256i or __m512i and so on, so for operations it doesn't have its own functions for, you can use Intel intrinsics. But then you're back in the same boat of needing a library that emulates __m256_ternlog_epi32 with only AVX2 instructions.)
This won't stop libc from possibly using hand-written AVX-512 instructions in functions like strcmp or log/exp, since dynamic CPU dispatching happens at run-time, and you can't stop your CPU from reporting that it supports AVX-512. (Except with a VM, or by telling the kernel not to enable AVX-512 at boot, if Linux has an option for that.)
Let's say I take a compiler: gcc 4.8. And processor from intel, let's say skylake or some other fancy new family.
checking this question: How to see which flags -march=native will activate?; if I do gcc -march=native -E -v - </dev/null 2>&1 | grep cc1, this will spout out some flags for the host machine, which is the above processor, skylake.
How does gcc know what flags to enable disable... when 4.8 was released before skylake processors were out? What about other newer family of processors?
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
The question isn't really specific to gcc/intel, I would like to know how others maintain synchronicity between processor and compiler too.
Old compilers don't know how to tune for new microarchitectures. (And are also missing out on better optimization in general: New versions of gcc/clang usually add new optimizations that help across the board, e.g. gcc8 can coalesce loads/stores of multiple adjacent small variables or array elements into a single 4 or 8-byte load or store. This helps on everything.)
They can also only use ISA extensions they know about.
They can make correct code because new x86 CPUs are still x86, and are backwards compatible with code for older CPUs1. Same with ARM. The ARMv8 ISA is backwards compatible with ARMv7, ARMv6, and so on, so new ARM CPUs can run existing ARM binaries. (There are some AArch64 CPUs that dropped support for 32-bit mode, but nevermind that.)
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
Yes, you want your compiler to at least know about your CPU for tuning options.
But yes, always, even when your CPU isn't new. New compiler versions often benefit old CPUs, too, but yes a new set of SIMD extensions to auto-vectorize with can lead to potentially large speedups for code that spends a lot of time in one hot loop. Assuming that loop auto-vectorizes well.
e.g. Phoronix recently posted GCC 5 Through GCC 10 Compiler Benchmarks - Five Years Worth Of C/C++ Compiler Performance where they benchmarked on an i7 5960X (Haswell-E) CPU. I think GCC5 knows about -march=haswell. GCC9.2 makes measurably faster code than even gcc8 on some benchmarks.
But I can pretty much guarantee it's not optimal!! Compilers are good over large scales but there's usually something a human can find in a single hot loop, if they know the low level details of optimizing for a given microarchitecture. It's merely as good as you're going to get from any compiler. (Actually performance regressions exist, so even that's not always true. File a missed-optimization bug if you find one).
-march=native does two separate things
CPU feature detection to enable stuff like -mfma and -mbmi2. This is easy on x86 with the CPUID instruction. GCC will enable all extensions it knows about that are supported by the actual CPU. e.g. I think GCC4.8 was the first GCC to know about any AVX512 extensions, so you might even get some AVX512 auto-vectorization on an Ice Lake or Skylake-avx512. Whether it does a good job or not is another matter, for anything non-trivial. But no AVX512 with GCC4.7.
CPU type detection to set -mtune=skylake. This depends on GCC actually recognizing your specific CPU as something it knows about. If not, it falls back to -mtune=generic. It might detect (with CPUID) your L1/L2/L3 cache sizes and use that to influence some tuning decisions like inlining / unrolling, instead of using a known size for -mtune=haswell. I don't think that's a big deal; current compilers don't AFAIK introduce cache-blocking optimizations to matmul loops or things like that, and that's where knowing cache sizes really matters.
CPU type detection can also use CPUID on x86; the vendor-string and model / family / stepping numbers uniquely identify the microarchitecture. ((wikipedia), sandpile, InstLatx64, https://agner.org/optimize/)
x86 is very much designed to support single binaries that run on multiple microarchitectures and might want do to runtime feature detection / dispatching. So an efficient / portable / extensible CPU detection mechanism exists in the form of the CPUID instruction, introduced in Pentium and some late 486 CPUs. (And thus baseline for x86-64.)
Other ISAs are more often used in embedded uses where code gets recompiled for the specific CPU. They mostly don't have as good support for runtime detection. GCC might have to install a handler for SIGILL and just try running some instructions. Or query the OS which knows what's supported, e.g. Linux's /proc/cpuinfo.
Footnote 1:
For x86 specifically, its main claim to fame / reason for popularity is strict backwards compatibility. A new CPU that fails to run some existing programs would be a lot harder to sell, so vendors don't do that. They'll even bend over backwards to go beyond the on-paper ISA docs to make sure existing code keeps working. As former Intel architect Andy Glew said: All or almost all modern Intel processors are stricter than the manual. (For self-modifying code, and in general).
Modern PC motherboard firmwares even still emulate the legacy hardware of an IBM PC/XT when you boot in legacy BIOS mode, as well as implementing a software ABI for disk, keyboard, and screen access. So even bootloaders and stuff like GRUB have a consistent backwards-compatible interface to use, before they load a kernel which has actual drivers for the real hardware that's actually present.
A modern PC can I think still run real MS-DOS (the operating system) binaries in 16-bit real mode.
Adding new instruction opcodes without breaking backwards compat makes variable-length x86 machine code instructions ever more complex, and careless / anti-competitive developments in x86's history haven't helped, leading to more bloated instruction encodings for SSSE3 and later, for example. See Agner Fog's article Stop the instruction set war.
Code that depended on rep foo to decode as foo can break, though: Intel's manuals are pretty clear that random prefixes can cause code to misbehave in future. This makes it safe for Intel or AMD to introduce new instructions that decode in a known way on old CPUs, but do something new on newer CPUs. Like pause = rep nop. Or transactional memory HLE uses prefixes on locked instructions that old CPUs will ignore.
And prefixes like VEX (AVX) and EVEX (AVX512) are carefully chosen to not overlap with valid encodings of instructions, especially in 32-bit mode. See How does the instruction decoder differentiate between EVEX prefix and BOUND opcode in 32-bit mode?. This is one reason why 32-bit mode can still only use 8 vector registers (zmm0..7) even with VEX or EVEX which allow ymm0..15 or zmm0..31 respectively in 64-bit mode. (In 32-bit mode, a VEX prefix is invalid encodings of some opcode. In 64-bit mode, that opcode isn't valid in the first place to the later bytes are more flexible. But to simplify decoder HW they aren't fundamentally different.)
MIPS32r6 / MIPS64r6 in 2014 is one notable example that's not backwards compatible. It rearranged a few opcodes for instructions that stayed the same, and removed some instructions to reuse their opcode for other new instructions, e.g. branches without a delay slot. This is highly unusual and only makes sense for CPUs that are used for embedded systems (like current MIPS). Recompiling everything for MIPS32r6 is not a problem for an embedded system.
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but still of course only for extensions that the compiler knows about when it compiles. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code.
And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.
GCC has some support for this, via its ifunc mechanism, but IIRC you can't do that without source changes.
Intel's compiler (ICC) I think does support multi-versioning some hot functions when auto-vectorizing, with just command-line options.
It can only happen if the new processor is specifically designed to be backwards compatible with older models.
Forget gcc for a moment. You have a compiled X86 binary from year 2000, say, an executable built for the original Windows NT. Will a Skylake CPU run it? You betcha. Will an Itanium CPU run iit? Nope, it is not designed to do that. It is a completely different architecture
Now that executable most probably wouldn't use the Skylake efficiently, but that's the whole point of evolving architectures and introducing new instructions.
Returning to gcc, -march=native is not magic. It cannot possibly divine out the new instructions and new timings. It simply selects the "best" instruction set it knows that is supported by the CPU it runs on. How it's done is architecture specific. X86 CPUs can be queried about their capabilities with the the CPUID instruction. Other architectures may do it differently.
To put it another way, -O3 -march=native optimizes for the machine you compiled on, so it's good when you're compiling code to run on the build host. A binary built with -march=native on a Nehalem system is essentially the same as one built with -march=nehalem on any system. -march=native might detect your specific L3 cache size instead of using a default for that, if any GCC tuning decisions (like inlining or unrolling) depend on L3 size. Except if you run an old compiler on a new CPU it doesn't recognize, you get feature detection for stuff like -mavx but for tuning only tune=generic.
None of this can take advantage of new features like AVX2 or BMI2 when running on a Skylake or Ice Lake system. And some specific tuning decisions that were good on Nehalem might be sub-optimal on a different CPU. (Although this is less likely; Intel mostly maintains backwards compatibility for performance as well as correctness. Getting everyone to recompile everything for P4 didn't work out so they usually try to make existing binaries run well on new CPUs.)
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but only for extensions that the compiler knows about when it compiled. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code. And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.
The Eigen web site says:
Explicit vectorization is performed for SSE 2/3/4, AVX, FMA, AVX512, ARM NEON (32-bit and 64-bit), PowerPC AltiVec/VSX (32-bit and 64-bit) instruction sets, and now S390x SIMD (ZVector) with graceful fallback to non-vectorized code.
Does that mean if you compile, e.g. with FMA, and the CPU you are running on does not support it it will fall back to completely unvectorised code? Or will it fall back to the best vectorisation available?
If not, is there any way to have Eigen compile for all or several SIMD ISAs and automatically pick the best at runtime?
Edit: To be clear I'm talking about run-time fallbacks.
There is absolutely no run-time dispatching in Eigen. Everything that happens happens at compile-time. This is where you need to make all of the choices, either via preprocessor macros that control the library's behavior, or using optimization settings in your C++ compiler.
In order to implement run-time dispatching, you would either need to check and see what CPU features were supported on each and every call into the library and branch into the applicable implementation, or you would need to do this check once at launch and set up a table of function pointers (or some other method to facilitate dynamic dispatching). Eigen can't do the latter because it is a header-only library (just a collection of types and functions), with no "main" function that gets called upon initialization where all of this setup code could be localized. So the only option would be the former, which would result in a significant performance penalty. The whole point of this library is speed; introducing this type of performance penalty to each of the library's functions would be a disaster.
The documentation also contains a breakdown of how Eigen works, using a simple example. This page says:
The goal of this page is to understand how Eigen compiles it, assuming that SSE2 vectorization is enabled (GCC option -msse2).
which lends further credence to the claim that static, compile-time options determine how the library will work.
Whichever instruction set you choose to target, the generated code will have those instructions in it. If you try to execute that code on a processor that does not support these instructions (for example, you compile Eigen with AVX optimizations enabled, but you run it on an Intel Nehalem processor that doesn't support the AVX instruction set), then you will get an invalid instruction exception (presented to your program in whatever form the operating system passes CPU exceptions through). This will happen as soon as your CPU encounters an unrecognized/unsupported instruction, which will probably be deep in the bowels of one of the Eigen functions (i.e., not immediately upon startup).
However, as I said in the comments, there is a fallback mechanism of sorts, but it is a static one, all done at compile time. As the documentation indicates, Eigen supports multiple vector instruction sets. If you choose a lowest-common-denominator instruction set, like SSE2, you will still get some degree of vectorization. For example, although SSE3 may provide a specialized instruction for some particular task, if you can't target SSE3, all hope is not lost. There is, in many places, code that uses a series of SSE2 instructions to accomplish the same task. These are probably not as efficient as if SSE3 were available, but they'll still be faster than having no vector code whatsoever. You can see examples of this by digging into the source code, specifically the arch folder that contains the tidbits that have been specifically optimized for the different instruction sets (often by the use of intrinsics). In other words, you get the best code it can give you for your target architecture/instruction set.
I am porting the code from INTEL architecture to ARM architecture. The same code I am trying to build with arm cross compilers on centos. Are there any compiler options that increase the performance of executable image because the image on INTEL might not give similar performance on ARM. Is there a way to achieve this?
A lot of optimization options exist in GCC. By default the compiler tries to make the compilation process as short as possible and produce an object code that makes debugging easy. However, gcc also provides several options for optimization.
Four general levels of incremental optimizations for performance are available in GCC and can be activated by passing one of the options -O0, -O1, -O2 or -O3. Each of these levels activates a set of optimizations that can be also activated manually by specifying the corresponding command line option. For instance with -O1 the compiler performs branches using the decrement and branch instruction (if available and applicable) rather then decrementing the register, comparing it to zero and branching in separate instructions. This decrement and branch can be also specified manually by passing the -fbranch-count-reg option. Consider that optimizations performed at each level also depend on the target architecture, you can get the list of available and enabled optimization by running GCC with the -Q --help=optimizers option.
Generally speaking the levels of optimization correspond to (notice that at each level also the optimization of the previous ones are applied):
-O0: the default level, the compiler tries to reduce compilation time and produce an object code that can be easily processed by a debugger
-O1: the compiler tries to reduce both code size and execution time. Only optimizations that do not take a lot of compile time are performed. Compilation may take considerably more memory.
-O2: the compiler applies almost all optimizations available that do not affect the code size. Compiling takes more but performance should improve.
-O3: the compiler applies also optimization that might increase the code size (for instance inlining functions)
For a detailed description of all optimization options you can have a look here.
As a general remark consider that compiler optimization are designed to work in the general case but their effectiveness will depend a lot on both you program and the architecture you are running it on.
Edit:
If you are interested in memory paging optimization there is the -freorder-blocks-and-partition option (activated also with -O2). This option reorders basic blocks inside each function to partition them in hot blocks (called frequently) and cold blocks (called rarely). Hot blocks are then placed in contiguous memory locations. This should increasing cache locality and paging performance.
I tried to scrub the GCC man page for this, but still don't get it, really.
What's the difference between -march and -mtune?
When does one use just -march, vs. both? Is it ever possible to just -mtune?
If you use -march then GCC will be free to generate instructions that work on the specified CPU, but (typically) not on earlier CPUs in the architecture family.
If you just use -mtune, then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated. e.g. setting loop-unrolling heuristics appropriately for that CPU.
-march=foo implies -mtune=foo unless you also specify a different -mtune. This is one reason why using -march is better than just enabling options like -mavx without doing anything about tuning.
Caveat: -march=native on a CPU that GCC doesn't specifically recognize will still enable new instruction sets that GCC can detect, but will leave -mtune=generic. Use a new enough GCC that knows about your CPU if you want it to make good code.
This is what i've googled up:
The -march=X option takes a CPU name X and allows GCC to generate code that uses all features of X. GCC manual explains exactly which CPU names mean which CPU families and features.
Because features are usually added, but not removed, a binary built with -march=X will run on CPU X, has a good chance to run on CPUs newer than X, but it will almost assuredly not run on anything older than X. Certain instruction sets (3DNow!, i guess?) may be specific to a particular CPU vendor, making use of these will probably get you binaries that don't run on competing CPUs, newer or otherwise.
The -mtune=Y option tunes the generated code to run faster on Y than on other CPUs it might run on. -march=X implies -mtune=X. -mtune=Y will not override -march=X, so, for example, it probably makes no sense to -march=core2 and -mtune=i686 - your code will not run on anything older than core2 anyway, because of -march=core2, so why on Earth would you want to optimize for something older (less featureful) than core2? -march=core2 -mtune=haswell makes more sense: don't use any features beyond what core2 provides (which is still a lot more than what -march=i686 gives you!), but do optimize code for much newer haswell CPUs, not for core2.
There's also -mtune=generic. generic makes GCC produce code that runs best on current CPUs (meaning of generic changes from one version of GCC to another). There are rumors on Gentoo forums that -march=X -mtune=generic produces code that runs faster on X than code produced by -march=X -mtune=X does (or just -march=X, as -mtune=X is implied). No idea if this is true or not.
Generally, unless you know exactly what you need, it seems that the best course is to specify -march=<oldest CPU you want to run on> and -mtune=generic (-mtune=generic is here to counter the implicit -mtune=<oldest CPU you want to run on>, because you probably don't want to optimize for the oldest CPU). Or just -march=native, if you ever going to run only on the same machine you build on.