GCC/Clang supports compilation options -march and -mcpu.
What exactly is architecture in the context of -march ?
What exactly is processor (cpu) in the context of option -mcpu ?
What is the difference between architecture and processor(cpu) ?
Thanks.
for the flags:
-march flag specifies the target architecture version, i.e. a set of ISA extensions. -march tells the compiler that it is allowed to generate special instructions to use the specific hardware features of a given architecture. (On x86, it also implies -mtune= the same thing)
-mtune flag specifies the target microarchitecture (specific CPU) to tune for, in terms of choices like loop unrolling thresholds or instruction choices that don't affect compatibility. -mtune without any -march tells the compiler to generate a binary with a bare-minimum, generic instruction set but also tune the resulting binary code for the specified target. The -mtune flag does not enable the compiler to use the special hardware features of the target, but can be used with it.
It can be used with -march to set a baseline CPU feature level, but also to tune for a specific CPU, which is hopefully representative of the CPUs you care most about actually running on. Even if that's newer than the oldest CPUs that match the -march setting, i.e. which could run the binary.
-mcpu flag on ARM specifies the target processor to tune for, and to use any extension it supports. GCC uses this name to derive the name of the target ARM architecture (as if specified by -march), such as -mcpu=cortex-a53 implying -march= and -mtune= settings.
-mcpu flag on x86 is a deprecated synonym for -mtune. See GCC: mtune vs march vs mcpu
GCC and LLVM agree on the meanings of these flags, e.g. both treating -mcpu as a synonym for -mtune on x86-64, but being different on ARM.
example:
-mcpu=cortex-a8 will perform specific optimisations for the Cortex-A8 such as instruction scheduling and will produce better performing code on that core, as well as using instructions it supports but older CPUs don't.
-march=armv7-a just selects the ARMv7-a architecture which tells the compiler
that it can use the instructions in ARMv7-a, but it will leave the performance tuning / instruction scheduling heuristics at their default values, aimed at running well across a range of CPUs of the overall architecture. (e.g. ARM in general, not restricted to ones new enough to support the selected -march, unfortunately).
‑mtune=thunderx2t99, the compiler will generate a binary that is optimized for the ThunderX2 microarchitecture. This binary will not take full advantage of all the ThunderX2’s hardware features; But the binary may be somewhat optimized for the ThunderX2 and it will be more portable than a binary compiled with ‑march=armv8.1‑a for instance.
The x86 convention is to let -march imply -mtune as stated here in the docs
Links:
For further reference maybe these arm docs help
Deprecation of -mcpu as synonym of -mtune
Docs for x86 options
Related
The CUDA 3.0 toolkit documentation listed several flags NVIDIA's OpenCL compiler accepts, as an extension beyond what the OpenCL standard mandates:
Option
Description
-cl-nv-maxrregcount <N>
Max number of registers a kernel (or device function?) may use); passed on to ptxas as --maxrregcount
-cl-nv-opt-level <N>
Code optimization level.
-cl-nv-verbose
Enable verbose mode.
But I know there are others. For example, the clcc project mentions cl-nv-arch and cl-nv-cstd (which actually regards the OpenCL C version targeted). I vaguely recall one flag in particular which turns off support for grids/block sizes beyond CUDA's natively-supported grid and block sizes. How can I determine those extra flags, in a recent NVIDIA OpenCL runtime version?
This one is due to #Tim, in this answer:
Option
Description
-nv-line-info
Generates information about the locations in the source files corresponding to PTX instructions; more-or-less the same as the NVCC command-line option --generate-line-info.
Following the advice of Linus Torvalds (and cross platform performance), I wish to not use avx512. Is there a flag I can specify to the compiler (both gcc and msvc) such that all avx512 instructions are split into pairs of avx2 instructions if a library I am using tries to use axv512 either from intrinsics or compiler optimiszation?
No, compile your code not to use AVX-512 in the first place by telling the compiler it can't; you only have to do anything about code using intrinsics that require AVX-512.
However, if you're compiling for a CPU that supports AVX-512, it's often worth using it, especially with 256-bit vectors to avoid the turbo-frequency and other penalties that come with 512-bit vectors. GCC's default tuning is already -mprefer-vector-width=256 for CPUs like -march=skylake-avx512.
If you want to make a binary that can run on CPUs without AVX-512, then yes obviously you need to make sure it never executes and instructions that would fault without it. e.g. gcc -O3 -march=znver2 or -march=skylake or whatever. Neither of those target arch options include AVX-512. Or -march=native if compiling for whatever CPU you have.
But if you do have a CPU that supports AVX-512, and you want to not use it, you can use something like -march=native -mno-avx512f (All other AVX-512 extensions depend on the "Foundation" AVX-512F, so disabling that also prevents even AVX-512VL for 128 and 256-bit vectors.)
(Part of the benefit of -march=native and then disabling stuff is to also set tuning options. If you want a binary that runs well on both Skylake and Zen2, I'm not sure what to recommend; probably -march=skylake or -march=znver2 are both ok; there's the default "tune=generic" but it cares too much about really old CPUs that don't even support AVX2, like Sandybridge: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Intrinsics
Even with intrinsics, GCC will only ever emit instructions supported by the target options, so -mno-avx512f can let you be sure you didn't miss anything. You'll get compile time errors, instead of EVEX instructions slipping through the cracks.
(MSVC is different and is designed around a single-binary model where using new instruction-sets is done in functions that you only call if the CPU supports it, so it won't stop you from using AVX-512. AFAIK, MSVC still doesn't even have an option to auto-vectorize with AVX-512, only /arch:AVX2. But anyway, MSVC won't emit AVX-512 instructions on its own if you don't tell it to, if you don't use any option like /arch:AVX512 if such a thing exists; AFAIK it doesn't have a /arch:native unfortunately. With MSVC you do have to be sure you caught all uses of intrinsics, although compiling with GCC can help to make sure your codebase doesn't do that.)
If you still want to compile code that uses _mm512_add_epi32 or _mm256_ternlog_epi32 or whatever, you'll need a version of immintrin.h that defines __m512i as a struct/class with two __m256i members and emulates all the intrinsics. Some AVX512 intrinsics won't be cheap to emulate, especially masked operations, and the whole concept of compare-into-mask to get an integer instead of a vector. So it's probably a bad idea to try to make this happen fully transparently; instead just get GCC to stop you from using any AVX-512 instructions while you make AVX2-only versions of any intrinsics code that didn't already have AVX2 versions.
Last time this came up, Coding on insufficient hardware, I was able to find an avxintrin-emu.h that let you develop for AVX while only compiling for SSE4. But I didn't find an equivalent for AVX-512. (Normally you would compile an AVX-512 binary and test it on an emulator like SDE that emulates at runtime, not compile-time.)
Agner Fog's VectorClass wrapper library (https://www.agner.org/optimize/#vectorclass) has support for basic operations like + - * /, and shuffles and blends, and has versions 512-bit vectors emulated with a pair of AVX2 vectors. (And VCL types are implicitly convertible to __m256i or __m512i and so on, so for operations it doesn't have its own functions for, you can use Intel intrinsics. But then you're back in the same boat of needing a library that emulates __m256_ternlog_epi32 with only AVX2 instructions.)
This won't stop libc from possibly using hand-written AVX-512 instructions in functions like strcmp or log/exp, since dynamic CPU dispatching happens at run-time, and you can't stop your CPU from reporting that it supports AVX-512. (Except with a VM, or by telling the kernel not to enable AVX-512 at boot, if Linux has an option for that.)
Let's say I take a compiler: gcc 4.8. And processor from intel, let's say skylake or some other fancy new family.
checking this question: How to see which flags -march=native will activate?; if I do gcc -march=native -E -v - </dev/null 2>&1 | grep cc1, this will spout out some flags for the host machine, which is the above processor, skylake.
How does gcc know what flags to enable disable... when 4.8 was released before skylake processors were out? What about other newer family of processors?
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
The question isn't really specific to gcc/intel, I would like to know how others maintain synchronicity between processor and compiler too.
Old compilers don't know how to tune for new microarchitectures. (And are also missing out on better optimization in general: New versions of gcc/clang usually add new optimizations that help across the board, e.g. gcc8 can coalesce loads/stores of multiple adjacent small variables or array elements into a single 4 or 8-byte load or store. This helps on everything.)
They can also only use ISA extensions they know about.
They can make correct code because new x86 CPUs are still x86, and are backwards compatible with code for older CPUs1. Same with ARM. The ARMv8 ISA is backwards compatible with ARMv7, ARMv6, and so on, so new ARM CPUs can run existing ARM binaries. (There are some AArch64 CPUs that dropped support for 32-bit mode, but nevermind that.)
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
Yes, you want your compiler to at least know about your CPU for tuning options.
But yes, always, even when your CPU isn't new. New compiler versions often benefit old CPUs, too, but yes a new set of SIMD extensions to auto-vectorize with can lead to potentially large speedups for code that spends a lot of time in one hot loop. Assuming that loop auto-vectorizes well.
e.g. Phoronix recently posted GCC 5 Through GCC 10 Compiler Benchmarks - Five Years Worth Of C/C++ Compiler Performance where they benchmarked on an i7 5960X (Haswell-E) CPU. I think GCC5 knows about -march=haswell. GCC9.2 makes measurably faster code than even gcc8 on some benchmarks.
But I can pretty much guarantee it's not optimal!! Compilers are good over large scales but there's usually something a human can find in a single hot loop, if they know the low level details of optimizing for a given microarchitecture. It's merely as good as you're going to get from any compiler. (Actually performance regressions exist, so even that's not always true. File a missed-optimization bug if you find one).
-march=native does two separate things
CPU feature detection to enable stuff like -mfma and -mbmi2. This is easy on x86 with the CPUID instruction. GCC will enable all extensions it knows about that are supported by the actual CPU. e.g. I think GCC4.8 was the first GCC to know about any AVX512 extensions, so you might even get some AVX512 auto-vectorization on an Ice Lake or Skylake-avx512. Whether it does a good job or not is another matter, for anything non-trivial. But no AVX512 with GCC4.7.
CPU type detection to set -mtune=skylake. This depends on GCC actually recognizing your specific CPU as something it knows about. If not, it falls back to -mtune=generic. It might detect (with CPUID) your L1/L2/L3 cache sizes and use that to influence some tuning decisions like inlining / unrolling, instead of using a known size for -mtune=haswell. I don't think that's a big deal; current compilers don't AFAIK introduce cache-blocking optimizations to matmul loops or things like that, and that's where knowing cache sizes really matters.
CPU type detection can also use CPUID on x86; the vendor-string and model / family / stepping numbers uniquely identify the microarchitecture. ((wikipedia), sandpile, InstLatx64, https://agner.org/optimize/)
x86 is very much designed to support single binaries that run on multiple microarchitectures and might want do to runtime feature detection / dispatching. So an efficient / portable / extensible CPU detection mechanism exists in the form of the CPUID instruction, introduced in Pentium and some late 486 CPUs. (And thus baseline for x86-64.)
Other ISAs are more often used in embedded uses where code gets recompiled for the specific CPU. They mostly don't have as good support for runtime detection. GCC might have to install a handler for SIGILL and just try running some instructions. Or query the OS which knows what's supported, e.g. Linux's /proc/cpuinfo.
Footnote 1:
For x86 specifically, its main claim to fame / reason for popularity is strict backwards compatibility. A new CPU that fails to run some existing programs would be a lot harder to sell, so vendors don't do that. They'll even bend over backwards to go beyond the on-paper ISA docs to make sure existing code keeps working. As former Intel architect Andy Glew said: All or almost all modern Intel processors are stricter than the manual. (For self-modifying code, and in general).
Modern PC motherboard firmwares even still emulate the legacy hardware of an IBM PC/XT when you boot in legacy BIOS mode, as well as implementing a software ABI for disk, keyboard, and screen access. So even bootloaders and stuff like GRUB have a consistent backwards-compatible interface to use, before they load a kernel which has actual drivers for the real hardware that's actually present.
A modern PC can I think still run real MS-DOS (the operating system) binaries in 16-bit real mode.
Adding new instruction opcodes without breaking backwards compat makes variable-length x86 machine code instructions ever more complex, and careless / anti-competitive developments in x86's history haven't helped, leading to more bloated instruction encodings for SSSE3 and later, for example. See Agner Fog's article Stop the instruction set war.
Code that depended on rep foo to decode as foo can break, though: Intel's manuals are pretty clear that random prefixes can cause code to misbehave in future. This makes it safe for Intel or AMD to introduce new instructions that decode in a known way on old CPUs, but do something new on newer CPUs. Like pause = rep nop. Or transactional memory HLE uses prefixes on locked instructions that old CPUs will ignore.
And prefixes like VEX (AVX) and EVEX (AVX512) are carefully chosen to not overlap with valid encodings of instructions, especially in 32-bit mode. See How does the instruction decoder differentiate between EVEX prefix and BOUND opcode in 32-bit mode?. This is one reason why 32-bit mode can still only use 8 vector registers (zmm0..7) even with VEX or EVEX which allow ymm0..15 or zmm0..31 respectively in 64-bit mode. (In 32-bit mode, a VEX prefix is invalid encodings of some opcode. In 64-bit mode, that opcode isn't valid in the first place to the later bytes are more flexible. But to simplify decoder HW they aren't fundamentally different.)
MIPS32r6 / MIPS64r6 in 2014 is one notable example that's not backwards compatible. It rearranged a few opcodes for instructions that stayed the same, and removed some instructions to reuse their opcode for other new instructions, e.g. branches without a delay slot. This is highly unusual and only makes sense for CPUs that are used for embedded systems (like current MIPS). Recompiling everything for MIPS32r6 is not a problem for an embedded system.
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but still of course only for extensions that the compiler knows about when it compiles. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code.
And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.
GCC has some support for this, via its ifunc mechanism, but IIRC you can't do that without source changes.
Intel's compiler (ICC) I think does support multi-versioning some hot functions when auto-vectorizing, with just command-line options.
It can only happen if the new processor is specifically designed to be backwards compatible with older models.
Forget gcc for a moment. You have a compiled X86 binary from year 2000, say, an executable built for the original Windows NT. Will a Skylake CPU run it? You betcha. Will an Itanium CPU run iit? Nope, it is not designed to do that. It is a completely different architecture
Now that executable most probably wouldn't use the Skylake efficiently, but that's the whole point of evolving architectures and introducing new instructions.
Returning to gcc, -march=native is not magic. It cannot possibly divine out the new instructions and new timings. It simply selects the "best" instruction set it knows that is supported by the CPU it runs on. How it's done is architecture specific. X86 CPUs can be queried about their capabilities with the the CPUID instruction. Other architectures may do it differently.
To put it another way, -O3 -march=native optimizes for the machine you compiled on, so it's good when you're compiling code to run on the build host. A binary built with -march=native on a Nehalem system is essentially the same as one built with -march=nehalem on any system. -march=native might detect your specific L3 cache size instead of using a default for that, if any GCC tuning decisions (like inlining or unrolling) depend on L3 size. Except if you run an old compiler on a new CPU it doesn't recognize, you get feature detection for stuff like -mavx but for tuning only tune=generic.
None of this can take advantage of new features like AVX2 or BMI2 when running on a Skylake or Ice Lake system. And some specific tuning decisions that were good on Nehalem might be sub-optimal on a different CPU. (Although this is less likely; Intel mostly maintains backwards compatibility for performance as well as correctness. Getting everyone to recompile everything for P4 didn't work out so they usually try to make existing binaries run well on new CPUs.)
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but only for extensions that the compiler knows about when it compiled. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code. And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.
I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags, -msse2 -mavx -mfma etc.
This is all very good, but I would like to deploy a single binary, which works across a range of generations of processors.
Question: Is it possible to tell the compiler (GCC) that whenever it optimizes a function using SIMD, it must generate code for a list of architectures? And of of course introduce high-level branches
I am thinking similar to how the compiler generates code for functions, where input pointers are either 4 or 8 byte aligned. To prevent this, I use the __builtin_assume_aligned macro.
What is best practice? Multiple binaries? Naming?
As long as you don't care about portability, yes.
Recent versions of GCC make this easier than any other compiler I'm aware of by using the target_clones function attribute. Just add the attribute, with a list of targets you want to create versions for, and GCC will automatically create the different variants, as well as a dispatch function to choose a version automatically at runtime.
If you want a bit more portability you can use the target attribute, which clang and icc also support, but you'll have to write the dispatch function yourself (which isn't difficult), and emit the function multiple times (generally using a macro, or repeatedly including a header).
AFAIK, if you want your code to work with MSVC you'll need multiple compiler invocations with different options.
If you're talking about just getting the compiler to generate SSE/AVX etc instructions, and you've got "general purpose" code (ie you're not explicitly vectorising using intrinsics, or got lots of code that the compiler will spot and auto-vectorise) then I should warn you that AVX, AVX2 or AVX512 compiling your entire codebase will probably run significantly slower than compiling for SSE versions.
When AVX opcodes using the upper halves of the registers are detected, the CPU powers up the upper half of the circuitry (which is otherwise powered down). This consumes more power, generates more heat and reduces the base clock speed of the chip, typically by 10-20% depending on the mix of high power and low-power opcodes, so you lose maybe 15% of performance immediately, and then have to be doing quite a lot of vectorised processing in order to make up for this performance deficit before you start seeing any gains.
See my longer explanation and references in this thread.
If on the other hand you're explicitly vectorising using intrinsics and you're sure you have large enough burst of AVX etc to make it worthwhile, I've successfully written code where I tell MSVC to compile for SSE2 (default for x64) but then I dynamically check the CPU capabilities and some functions switch to a codepath implemented using AVX intrinsics.
MSVC allows this (it will produce warnings, but you can silence these), but the same technique is hard to make work under GCC 4.9 as the intrinsics are only considered declared by the compiler when the appropriate code generation flag is used. [UPDATE: #nemequ explains below how you can make this work under gcc using attributes to decorate the functions] Depending on the version of GCC you may have to compile files with different flags to get a workable system.
Oh, and you have to watch for AVX-SSE transitions too (call VZEROUPPER when you leave an AVX section of code to return to SSE code) - it can be done but I found that understanding the CPU implications was a bigger battle than I originally envisaged.
I tried to scrub the GCC man page for this, but still don't get it, really.
What's the difference between -march and -mtune?
When does one use just -march, vs. both? Is it ever possible to just -mtune?
If you use -march then GCC will be free to generate instructions that work on the specified CPU, but (typically) not on earlier CPUs in the architecture family.
If you just use -mtune, then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated. e.g. setting loop-unrolling heuristics appropriately for that CPU.
-march=foo implies -mtune=foo unless you also specify a different -mtune. This is one reason why using -march is better than just enabling options like -mavx without doing anything about tuning.
Caveat: -march=native on a CPU that GCC doesn't specifically recognize will still enable new instruction sets that GCC can detect, but will leave -mtune=generic. Use a new enough GCC that knows about your CPU if you want it to make good code.
This is what i've googled up:
The -march=X option takes a CPU name X and allows GCC to generate code that uses all features of X. GCC manual explains exactly which CPU names mean which CPU families and features.
Because features are usually added, but not removed, a binary built with -march=X will run on CPU X, has a good chance to run on CPUs newer than X, but it will almost assuredly not run on anything older than X. Certain instruction sets (3DNow!, i guess?) may be specific to a particular CPU vendor, making use of these will probably get you binaries that don't run on competing CPUs, newer or otherwise.
The -mtune=Y option tunes the generated code to run faster on Y than on other CPUs it might run on. -march=X implies -mtune=X. -mtune=Y will not override -march=X, so, for example, it probably makes no sense to -march=core2 and -mtune=i686 - your code will not run on anything older than core2 anyway, because of -march=core2, so why on Earth would you want to optimize for something older (less featureful) than core2? -march=core2 -mtune=haswell makes more sense: don't use any features beyond what core2 provides (which is still a lot more than what -march=i686 gives you!), but do optimize code for much newer haswell CPUs, not for core2.
There's also -mtune=generic. generic makes GCC produce code that runs best on current CPUs (meaning of generic changes from one version of GCC to another). There are rumors on Gentoo forums that -march=X -mtune=generic produces code that runs faster on X than code produced by -march=X -mtune=X does (or just -march=X, as -mtune=X is implied). No idea if this is true or not.
Generally, unless you know exactly what you need, it seems that the best course is to specify -march=<oldest CPU you want to run on> and -mtune=generic (-mtune=generic is here to counter the implicit -mtune=<oldest CPU you want to run on>, because you probably don't want to optimize for the oldest CPU). Or just -march=native, if you ever going to run only on the same machine you build on.