When compiling/linking C/C++ libraries or programs that are meant to work on all implementations of an ISA (e.g. x86-64), what optimization flags are safe from the correctness and run-time performance perspectives? I want optimizations that yield correct results and won't be detrimental performance-wise for a particular CPU. E.g I would like to avoid optimization flags that yield run-time performance improvements on an 8th-gen Intel Core i7, but result in performance degradation on an AMD Ryzen.
Are PGO, LTO, and -O3 safe? Is it solely dependent on -march and -mtune (or the absence thereof)?
They're all supposed to be "safe", assuming that your code is well defined.
If you don't want to specialize for a particular CPU family then just leave -march and -mtune alone; the default suits a generic x86_64.
PGO is always a good idea, it's mostly used for avoiding branches.
LTO and -O3 can have different effects on different code-bases. For example, if your code benefits from vectorization then -O3 is a big win over -O2, but the extra inlining and unrolling can lead to larger code sizes, and that can be a disadvantage on systems with more limited caches.
In the end, the only advice that ever really means anything here is: measure it and see what's good for your code.
Related
Title says it all. What are the differences and tradeoffs between -march=haswell, -march=core-avx2, and -mavx2 for compiling avx2 intrinsics?
I know that -mavx2 is a flag and -march=haswell/core-avx2 are architectures which just translate to a bunch of flags. So -mavx2 is a subset of the other two. But beyond that, how do I choose the right one for my application?
In particular, if I'm compiling for Skylake and adding -mtune=skylake, but I can't actually do -march=skylake, how do I decide and why? (That's my particular case, but also interested in the more general tradeoffs).
-march sets -mtune, and enables other useful stuff you forgot like -mfma and -mbmi / -mbmi2 which aren't implied by -mavx2.
Unfortunately GCC's -mtune=generic doesn't adapt itself to the subset of CPUs that support the extensions enabled. Most notably, -mavx2 rules out Sandy Bridge, but doesn't imply -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?). Out of all AVX2 CPUs, only I think Excavator would really benefit much from those options, and it's not a big fraction of market share, and I think misaligned 32-byte load/store isn't disastrous on it like it was on Sandybridge. GCC11 and later have a tune=generic that does do unaligned 32-byte load/store (https://godbolt.org/z/qr9cEcne3) so this specific issue is no longer critical if you only care about AVX2 and newer CPUs. But still you generally want a -mtune= option somewhat appropriate for the set of CPUs you're targeting. But usually erring on the side of less loop unrolling, since too much can hurt I-cache / uop cache hit rate.
AVX2 CPUs have been widespread for long enough that -mtune=generic happens to be mostly appropriate for them in GCC11 and later, fortunately. (Especially CPUs like Haswell and Zen2 that have 32-byte-wide load/store units). But in general, with other CPU features and other far-future cases, that won't always be true.
But still don't just use -mavx2, that will leave out -mfma (which all AVX2 CPUs support except for one Via model), and leave out -mbmi/2 which I think all AVX2 CPUs also support. More efficient variable-count shifts are always nice, and so is lzcnt/tzcnt and efficient x &= x-1 lowest-set-bit idioms.
-march=haswell -mtune=skylake could be useful if you want to enable a Haswell baseline for compatibility, but tune for Skylake. (There's very little difference between tuning heuristics for those microarchitectures, and probably no new instructions that GCC would use automatically. For SKL specifically, see also How can I mitigate the impact of the Intel jcc erratum on gcc? which isn't implied by -march=skylake)
-march=core-avx2 and similar are an obsolete naming convention
It's just a synonym for -march=haswell, AFAIK. But maybe leaving -mtune=generic, unlike -march=haswell which sets -mtune=haswell. How to correctly determine -march and -mtune for Intel processors? has more details.
-march=core-i7 is I think just a synonym for Nehalem, which is really dumb because Nehalem (1st-gen i7) is a different microarchitecture from Sandy Bridge and later (2nd-gen i7). Intel unfortunately kept that branding through SnB's major re-architecting of the internals, although the 3-level cache with shared inclusive L3 has been a constant.
There's no such thing as "a generic i7", because i7 includes two different microarchitectures, one with a uop cache, one without.
Good riddance to those stupid ambiguous confusing -march= / -mtune names that are trying to be generic but actually aren't.
It would be nice if there was a -march=generic-avx2 to tune for Haswell and later, Zen1 and later, etc., using the common subset of instructions they all support efficiently. And setting tuning options appropriate for all of them, not going too crazy unrolling. But there isn't.
Or more generally, -mtune=generic-enabled or something that would look at the enabled -m options and not care about CPUs that couldn't run this code. e.g. not doing rep ret if you enabled any features that a Phenom II couldn't run, such as AVX. GCC bug 78762 is basically a feature request for that.
(Current GCC -mtune=generic no longer uses rep ret; Phenom II is old enough that GCC doesn't spend extra code-size to avoid potholes there. Most binary releases of software built with GCC use a GCC version that was in development a year or more before that, so there's some lead time in terms of what's appropriate for -mtune=generic. The CPU with a glass jaw / performance pothole you're working around doesn't have to be totally gone before a current nightly build of GCC stops tuning for it.)
Let's say I take a compiler: gcc 4.8. And processor from intel, let's say skylake or some other fancy new family.
checking this question: How to see which flags -march=native will activate?; if I do gcc -march=native -E -v - </dev/null 2>&1 | grep cc1, this will spout out some flags for the host machine, which is the above processor, skylake.
How does gcc know what flags to enable disable... when 4.8 was released before skylake processors were out? What about other newer family of processors?
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
The question isn't really specific to gcc/intel, I would like to know how others maintain synchronicity between processor and compiler too.
Old compilers don't know how to tune for new microarchitectures. (And are also missing out on better optimization in general: New versions of gcc/clang usually add new optimizations that help across the board, e.g. gcc8 can coalesce loads/stores of multiple adjacent small variables or array elements into a single 4 or 8-byte load or store. This helps on everything.)
They can also only use ISA extensions they know about.
They can make correct code because new x86 CPUs are still x86, and are backwards compatible with code for older CPUs1. Same with ARM. The ARMv8 ISA is backwards compatible with ARMv7, ARMv6, and so on, so new ARM CPUs can run existing ARM binaries. (There are some AArch64 CPUs that dropped support for 32-bit mode, but nevermind that.)
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
Yes, you want your compiler to at least know about your CPU for tuning options.
But yes, always, even when your CPU isn't new. New compiler versions often benefit old CPUs, too, but yes a new set of SIMD extensions to auto-vectorize with can lead to potentially large speedups for code that spends a lot of time in one hot loop. Assuming that loop auto-vectorizes well.
e.g. Phoronix recently posted GCC 5 Through GCC 10 Compiler Benchmarks - Five Years Worth Of C/C++ Compiler Performance where they benchmarked on an i7 5960X (Haswell-E) CPU. I think GCC5 knows about -march=haswell. GCC9.2 makes measurably faster code than even gcc8 on some benchmarks.
But I can pretty much guarantee it's not optimal!! Compilers are good over large scales but there's usually something a human can find in a single hot loop, if they know the low level details of optimizing for a given microarchitecture. It's merely as good as you're going to get from any compiler. (Actually performance regressions exist, so even that's not always true. File a missed-optimization bug if you find one).
-march=native does two separate things
CPU feature detection to enable stuff like -mfma and -mbmi2. This is easy on x86 with the CPUID instruction. GCC will enable all extensions it knows about that are supported by the actual CPU. e.g. I think GCC4.8 was the first GCC to know about any AVX512 extensions, so you might even get some AVX512 auto-vectorization on an Ice Lake or Skylake-avx512. Whether it does a good job or not is another matter, for anything non-trivial. But no AVX512 with GCC4.7.
CPU type detection to set -mtune=skylake. This depends on GCC actually recognizing your specific CPU as something it knows about. If not, it falls back to -mtune=generic. It might detect (with CPUID) your L1/L2/L3 cache sizes and use that to influence some tuning decisions like inlining / unrolling, instead of using a known size for -mtune=haswell. I don't think that's a big deal; current compilers don't AFAIK introduce cache-blocking optimizations to matmul loops or things like that, and that's where knowing cache sizes really matters.
CPU type detection can also use CPUID on x86; the vendor-string and model / family / stepping numbers uniquely identify the microarchitecture. ((wikipedia), sandpile, InstLatx64, https://agner.org/optimize/)
x86 is very much designed to support single binaries that run on multiple microarchitectures and might want do to runtime feature detection / dispatching. So an efficient / portable / extensible CPU detection mechanism exists in the form of the CPUID instruction, introduced in Pentium and some late 486 CPUs. (And thus baseline for x86-64.)
Other ISAs are more often used in embedded uses where code gets recompiled for the specific CPU. They mostly don't have as good support for runtime detection. GCC might have to install a handler for SIGILL and just try running some instructions. Or query the OS which knows what's supported, e.g. Linux's /proc/cpuinfo.
Footnote 1:
For x86 specifically, its main claim to fame / reason for popularity is strict backwards compatibility. A new CPU that fails to run some existing programs would be a lot harder to sell, so vendors don't do that. They'll even bend over backwards to go beyond the on-paper ISA docs to make sure existing code keeps working. As former Intel architect Andy Glew said: All or almost all modern Intel processors are stricter than the manual. (For self-modifying code, and in general).
Modern PC motherboard firmwares even still emulate the legacy hardware of an IBM PC/XT when you boot in legacy BIOS mode, as well as implementing a software ABI for disk, keyboard, and screen access. So even bootloaders and stuff like GRUB have a consistent backwards-compatible interface to use, before they load a kernel which has actual drivers for the real hardware that's actually present.
A modern PC can I think still run real MS-DOS (the operating system) binaries in 16-bit real mode.
Adding new instruction opcodes without breaking backwards compat makes variable-length x86 machine code instructions ever more complex, and careless / anti-competitive developments in x86's history haven't helped, leading to more bloated instruction encodings for SSSE3 and later, for example. See Agner Fog's article Stop the instruction set war.
Code that depended on rep foo to decode as foo can break, though: Intel's manuals are pretty clear that random prefixes can cause code to misbehave in future. This makes it safe for Intel or AMD to introduce new instructions that decode in a known way on old CPUs, but do something new on newer CPUs. Like pause = rep nop. Or transactional memory HLE uses prefixes on locked instructions that old CPUs will ignore.
And prefixes like VEX (AVX) and EVEX (AVX512) are carefully chosen to not overlap with valid encodings of instructions, especially in 32-bit mode. See How does the instruction decoder differentiate between EVEX prefix and BOUND opcode in 32-bit mode?. This is one reason why 32-bit mode can still only use 8 vector registers (zmm0..7) even with VEX or EVEX which allow ymm0..15 or zmm0..31 respectively in 64-bit mode. (In 32-bit mode, a VEX prefix is invalid encodings of some opcode. In 64-bit mode, that opcode isn't valid in the first place to the later bytes are more flexible. But to simplify decoder HW they aren't fundamentally different.)
MIPS32r6 / MIPS64r6 in 2014 is one notable example that's not backwards compatible. It rearranged a few opcodes for instructions that stayed the same, and removed some instructions to reuse their opcode for other new instructions, e.g. branches without a delay slot. This is highly unusual and only makes sense for CPUs that are used for embedded systems (like current MIPS). Recompiling everything for MIPS32r6 is not a problem for an embedded system.
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but still of course only for extensions that the compiler knows about when it compiles. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code.
And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.
GCC has some support for this, via its ifunc mechanism, but IIRC you can't do that without source changes.
Intel's compiler (ICC) I think does support multi-versioning some hot functions when auto-vectorizing, with just command-line options.
It can only happen if the new processor is specifically designed to be backwards compatible with older models.
Forget gcc for a moment. You have a compiled X86 binary from year 2000, say, an executable built for the original Windows NT. Will a Skylake CPU run it? You betcha. Will an Itanium CPU run iit? Nope, it is not designed to do that. It is a completely different architecture
Now that executable most probably wouldn't use the Skylake efficiently, but that's the whole point of evolving architectures and introducing new instructions.
Returning to gcc, -march=native is not magic. It cannot possibly divine out the new instructions and new timings. It simply selects the "best" instruction set it knows that is supported by the CPU it runs on. How it's done is architecture specific. X86 CPUs can be queried about their capabilities with the the CPUID instruction. Other architectures may do it differently.
To put it another way, -O3 -march=native optimizes for the machine you compiled on, so it's good when you're compiling code to run on the build host. A binary built with -march=native on a Nehalem system is essentially the same as one built with -march=nehalem on any system. -march=native might detect your specific L3 cache size instead of using a default for that, if any GCC tuning decisions (like inlining or unrolling) depend on L3 size. Except if you run an old compiler on a new CPU it doesn't recognize, you get feature detection for stuff like -mavx but for tuning only tune=generic.
None of this can take advantage of new features like AVX2 or BMI2 when running on a Skylake or Ice Lake system. And some specific tuning decisions that were good on Nehalem might be sub-optimal on a different CPU. (Although this is less likely; Intel mostly maintains backwards compatibility for performance as well as correctness. Getting everyone to recompile everything for P4 didn't work out so they usually try to make existing binaries run well on new CPUs.)
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but only for extensions that the compiler knows about when it compiled. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code. And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.
I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags, -msse2 -mavx -mfma etc.
This is all very good, but I would like to deploy a single binary, which works across a range of generations of processors.
Question: Is it possible to tell the compiler (GCC) that whenever it optimizes a function using SIMD, it must generate code for a list of architectures? And of of course introduce high-level branches
I am thinking similar to how the compiler generates code for functions, where input pointers are either 4 or 8 byte aligned. To prevent this, I use the __builtin_assume_aligned macro.
What is best practice? Multiple binaries? Naming?
As long as you don't care about portability, yes.
Recent versions of GCC make this easier than any other compiler I'm aware of by using the target_clones function attribute. Just add the attribute, with a list of targets you want to create versions for, and GCC will automatically create the different variants, as well as a dispatch function to choose a version automatically at runtime.
If you want a bit more portability you can use the target attribute, which clang and icc also support, but you'll have to write the dispatch function yourself (which isn't difficult), and emit the function multiple times (generally using a macro, or repeatedly including a header).
AFAIK, if you want your code to work with MSVC you'll need multiple compiler invocations with different options.
If you're talking about just getting the compiler to generate SSE/AVX etc instructions, and you've got "general purpose" code (ie you're not explicitly vectorising using intrinsics, or got lots of code that the compiler will spot and auto-vectorise) then I should warn you that AVX, AVX2 or AVX512 compiling your entire codebase will probably run significantly slower than compiling for SSE versions.
When AVX opcodes using the upper halves of the registers are detected, the CPU powers up the upper half of the circuitry (which is otherwise powered down). This consumes more power, generates more heat and reduces the base clock speed of the chip, typically by 10-20% depending on the mix of high power and low-power opcodes, so you lose maybe 15% of performance immediately, and then have to be doing quite a lot of vectorised processing in order to make up for this performance deficit before you start seeing any gains.
See my longer explanation and references in this thread.
If on the other hand you're explicitly vectorising using intrinsics and you're sure you have large enough burst of AVX etc to make it worthwhile, I've successfully written code where I tell MSVC to compile for SSE2 (default for x64) but then I dynamically check the CPU capabilities and some functions switch to a codepath implemented using AVX intrinsics.
MSVC allows this (it will produce warnings, but you can silence these), but the same technique is hard to make work under GCC 4.9 as the intrinsics are only considered declared by the compiler when the appropriate code generation flag is used. [UPDATE: #nemequ explains below how you can make this work under gcc using attributes to decorate the functions] Depending on the version of GCC you may have to compile files with different flags to get a workable system.
Oh, and you have to watch for AVX-SSE transitions too (call VZEROUPPER when you leave an AVX section of code to return to SSE code) - it can be done but I found that understanding the CPU implications was a bigger battle than I originally envisaged.
I am porting the code from INTEL architecture to ARM architecture. The same code I am trying to build with arm cross compilers on centos. Are there any compiler options that increase the performance of executable image because the image on INTEL might not give similar performance on ARM. Is there a way to achieve this?
A lot of optimization options exist in GCC. By default the compiler tries to make the compilation process as short as possible and produce an object code that makes debugging easy. However, gcc also provides several options for optimization.
Four general levels of incremental optimizations for performance are available in GCC and can be activated by passing one of the options -O0, -O1, -O2 or -O3. Each of these levels activates a set of optimizations that can be also activated manually by specifying the corresponding command line option. For instance with -O1 the compiler performs branches using the decrement and branch instruction (if available and applicable) rather then decrementing the register, comparing it to zero and branching in separate instructions. This decrement and branch can be also specified manually by passing the -fbranch-count-reg option. Consider that optimizations performed at each level also depend on the target architecture, you can get the list of available and enabled optimization by running GCC with the -Q --help=optimizers option.
Generally speaking the levels of optimization correspond to (notice that at each level also the optimization of the previous ones are applied):
-O0: the default level, the compiler tries to reduce compilation time and produce an object code that can be easily processed by a debugger
-O1: the compiler tries to reduce both code size and execution time. Only optimizations that do not take a lot of compile time are performed. Compilation may take considerably more memory.
-O2: the compiler applies almost all optimizations available that do not affect the code size. Compiling takes more but performance should improve.
-O3: the compiler applies also optimization that might increase the code size (for instance inlining functions)
For a detailed description of all optimization options you can have a look here.
As a general remark consider that compiler optimization are designed to work in the general case but their effectiveness will depend a lot on both you program and the architecture you are running it on.
Edit:
If you are interested in memory paging optimization there is the -freorder-blocks-and-partition option (activated also with -O2). This option reorders basic blocks inside each function to partition them in hot blocks (called frequently) and cold blocks (called rarely). Hot blocks are then placed in contiguous memory locations. This should increasing cache locality and paging performance.
Have advancements in CPU design like dynamic instruction scheduling narrowed the performance gap between code generated by splat compilers and by optimizing compilers, i.e. can compilers get away with being more stupid these days?
On the contrary, optimizing compilers achieve more on contemporary CPUs. Automatic vectorization makes code up to several times faster. Modern instruction sets also give some optimization opportunities (for example, using CMOV instead of conditional branch on x86).
There are some areas, where performance gap is narrowed. CPU performs function calls faster, so function inlining may be not as beneficial as earlier. Loop unrolling may sometimes make code a little bit slower. But in most cases compiler optimizations and CPU optimizations are orthogonal to each other. CPUs cannot do loop fusion or common subexpression elimination. Compilers cannot provide good alternative to dynamic instruction scheduling, branch prediction or data prefetch.