Compiler and linker flags for Intel Advisor - compilation

I'm using an Intel C++ v16 compiler on a Xeon Phi Knights Landing (KNL) on an application using OpenMP. I'm reading about what compiler and linker options to use for the Vectorization Advisor, the Thread Advisor and finally VTune.
Combining the tables in the 3 linked documents, I came up with this list (considering that Xeon Phi KNL supports AVX512):
-g -O3 -parallel -Bdynamic -ldl -simd -qopenmp -parallel-source-info=2 -qopenmp-link dynamic -debug inline-debug-info -shared-intel -xCORE-AVX512
However, I don't know which of these flags has to be used during compiling and/or linking. In second place, am I missing any useful flag (or some of them are redundant)?
Btw, this happens compiling opencv.

Related

What is the difference between architecture and processor?

GCC/Clang supports compilation options -march and -mcpu.
What exactly is architecture in the context of -march ?
What exactly is processor (cpu) in the context of option -mcpu ?
What is the difference between architecture and processor(cpu) ?
Thanks.
for the flags:
-march flag specifies the target architecture version, i.e. a set of ISA extensions. -march tells the compiler that it is allowed to generate special instructions to use the specific hardware features of a given architecture. (On x86, it also implies -mtune= the same thing)
-mtune flag specifies the target microarchitecture (specific CPU) to tune for, in terms of choices like loop unrolling thresholds or instruction choices that don't affect compatibility. -mtune without any -march tells the compiler to generate a binary with a bare-minimum, generic instruction set but also tune the resulting binary code for the specified target. The -mtune flag does not enable the compiler to use the special hardware features of the target, but can be used with it.
It can be used with -march to set a baseline CPU feature level, but also to tune for a specific CPU, which is hopefully representative of the CPUs you care most about actually running on. Even if that's newer than the oldest CPUs that match the -march setting, i.e. which could run the binary.
-mcpu flag on ARM specifies the target processor to tune for, and to use any extension it supports. GCC uses this name to derive the name of the target ARM architecture (as if specified by -march), such as -mcpu=cortex-a53 implying -march= and -mtune= settings.
-mcpu flag on x86 is a deprecated synonym for -mtune. See GCC: mtune vs march vs mcpu
GCC and LLVM agree on the meanings of these flags, e.g. both treating -mcpu as a synonym for -mtune on x86-64, but being different on ARM.
example:
-mcpu=cortex-a8 will perform specific optimisations for the Cortex-A8 such as instruction scheduling and will produce better performing code on that core, as well as using instructions it supports but older CPUs don't.
-march=armv7-a just selects the ARMv7-a architecture which tells the compiler
that it can use the instructions in ARMv7-a, but it will leave the performance tuning / instruction scheduling heuristics at their default values, aimed at running well across a range of CPUs of the overall architecture. (e.g. ARM in general, not restricted to ones new enough to support the selected -march, unfortunately).
‑mtune=thunderx2t99, the compiler will generate a binary that is optimized for the ThunderX2 microarchitecture. This binary will not take full advantage of all the ThunderX2’s hardware features; But the binary may be somewhat optimized for the ThunderX2 and it will be more portable than a binary compiled with ‑march=armv8.1‑a for instance.
The x86 convention is to let -march imply -mtune as stated here in the docs
Links:
For further reference maybe these arm docs help
Deprecation of -mcpu as synonym of -mtune
Docs for x86 options

Is there way to automatically replace avx512 with avx2?

Following the advice of Linus Torvalds (and cross platform performance), I wish to not use avx512. Is there a flag I can specify to the compiler (both gcc and msvc) such that all avx512 instructions are split into pairs of avx2 instructions if a library I am using tries to use axv512 either from intrinsics or compiler optimiszation?
No, compile your code not to use AVX-512 in the first place by telling the compiler it can't; you only have to do anything about code using intrinsics that require AVX-512.
However, if you're compiling for a CPU that supports AVX-512, it's often worth using it, especially with 256-bit vectors to avoid the turbo-frequency and other penalties that come with 512-bit vectors. GCC's default tuning is already -mprefer-vector-width=256 for CPUs like -march=skylake-avx512.
If you want to make a binary that can run on CPUs without AVX-512, then yes obviously you need to make sure it never executes and instructions that would fault without it. e.g. gcc -O3 -march=znver2 or -march=skylake or whatever. Neither of those target arch options include AVX-512. Or -march=native if compiling for whatever CPU you have.
But if you do have a CPU that supports AVX-512, and you want to not use it, you can use something like -march=native -mno-avx512f (All other AVX-512 extensions depend on the "Foundation" AVX-512F, so disabling that also prevents even AVX-512VL for 128 and 256-bit vectors.)
(Part of the benefit of -march=native and then disabling stuff is to also set tuning options. If you want a binary that runs well on both Skylake and Zen2, I'm not sure what to recommend; probably -march=skylake or -march=znver2 are both ok; there's the default "tune=generic" but it cares too much about really old CPUs that don't even support AVX2, like Sandybridge: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Intrinsics
Even with intrinsics, GCC will only ever emit instructions supported by the target options, so -mno-avx512f can let you be sure you didn't miss anything. You'll get compile time errors, instead of EVEX instructions slipping through the cracks.
(MSVC is different and is designed around a single-binary model where using new instruction-sets is done in functions that you only call if the CPU supports it, so it won't stop you from using AVX-512. AFAIK, MSVC still doesn't even have an option to auto-vectorize with AVX-512, only /arch:AVX2. But anyway, MSVC won't emit AVX-512 instructions on its own if you don't tell it to, if you don't use any option like /arch:AVX512 if such a thing exists; AFAIK it doesn't have a /arch:native unfortunately. With MSVC you do have to be sure you caught all uses of intrinsics, although compiling with GCC can help to make sure your codebase doesn't do that.)
If you still want to compile code that uses _mm512_add_epi32 or _mm256_ternlog_epi32 or whatever, you'll need a version of immintrin.h that defines __m512i as a struct/class with two __m256i members and emulates all the intrinsics. Some AVX512 intrinsics won't be cheap to emulate, especially masked operations, and the whole concept of compare-into-mask to get an integer instead of a vector. So it's probably a bad idea to try to make this happen fully transparently; instead just get GCC to stop you from using any AVX-512 instructions while you make AVX2-only versions of any intrinsics code that didn't already have AVX2 versions.
Last time this came up, Coding on insufficient hardware, I was able to find an avxintrin-emu.h that let you develop for AVX while only compiling for SSE4. But I didn't find an equivalent for AVX-512. (Normally you would compile an AVX-512 binary and test it on an emulator like SDE that emulates at runtime, not compile-time.)
Agner Fog's VectorClass wrapper library (https://www.agner.org/optimize/#vectorclass) has support for basic operations like + - * /, and shuffles and blends, and has versions 512-bit vectors emulated with a pair of AVX2 vectors. (And VCL types are implicitly convertible to __m256i or __m512i and so on, so for operations it doesn't have its own functions for, you can use Intel intrinsics. But then you're back in the same boat of needing a library that emulates __m256_ternlog_epi32 with only AVX2 instructions.)
This won't stop libc from possibly using hand-written AVX-512 instructions in functions like strcmp or log/exp, since dynamic CPU dispatching happens at run-time, and you can't stop your CPU from reporting that it supports AVX-512. (Except with a VM, or by telling the kernel not to enable AVX-512 at boot, if Linux has an option for that.)

What GCC optimization flags and techniques are safe across CPUs?

When compiling/linking C/C++ libraries or programs that are meant to work on all implementations of an ISA (e.g. x86-64), what optimization flags are safe from the correctness and run-time performance perspectives? I want optimizations that yield correct results and won't be detrimental performance-wise for a particular CPU. E.g I would like to avoid optimization flags that yield run-time performance improvements on an 8th-gen Intel Core i7, but result in performance degradation on an AMD Ryzen.
Are PGO, LTO, and -O3 safe? Is it solely dependent on -march and -mtune (or the absence thereof)?
They're all supposed to be "safe", assuming that your code is well defined.
If you don't want to specialize for a particular CPU family then just leave -march and -mtune alone; the default suits a generic x86_64.
PGO is always a good idea, it's mostly used for avoiding branches.
LTO and -O3 can have different effects on different code-bases. For example, if your code benefits from vectorization then -O3 is a big win over -O2, but the extra inlining and unrolling can lead to larger code sizes, and that can be a disadvantage on systems with more limited caches.
In the end, the only advice that ever really means anything here is: measure it and see what's good for your code.

How can I compile *without* various instruction sets enabled?

I am attempting to recompile some software with various instruction sets, specifically, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, and AVX, and I would like to see how the code performs without these instruction sets to make sure I am getting the full effect of them.
For example, I want to compile it with just -O2 with a gnu compiler and see how it performs when restricting it to only SSE, to see which flags it is invoking by default. I also have an intel compiler that I am working with and I would like to isolate what each flag (or combination of flags) is doing to my code, so how can I specify exactly which flags are being invoked?
If it matters, I am working with C, C++, and Fortran on Linux.
For GCC compiler:
You have to use -mno-options for doing it.
These switches enable or disable the use of instructions in the MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, AVX512F, AVX512PF, AVX512ER, AVX512CD, SHA, AES, PCLMUL, FSGSBASE, RDRND, F16C, FMA, SSE4A, FMA4, XOP, LWP, ABM, BMI, BMI2, FXSR, XSAVE, XSAVEOPT, LZCNT, RTM, or 3DNow! extended instruction sets. These extensions are also available as built-in functions: see X86 Built-in Functions, for details of the functions enabled and disabled by these switches.
More info you can find on official GCC site
For ICC compiler, you have to use
combination of :
-march=”cpu” optimize for a specific cpu
-mtune=”cpu” produce code only for a specific cpu
-msse3,-msse4,-mavx, etc. level of SIMD and vector
instructions
More info here

Disabling -msse

I am trying to run various benchmark tests using CPU2006 to see what various optimizations do in terms of speed on gcc. I am familiar with -O1, -O2, and -O3, but have heard that -msse is a decent optimization. What exactly is -msse? I've also seen that -msse is default on a 64 bit architecture, so how do I disable it to compare the difference between using it and not using it?
-msse activates the generation of SSE instructions. All 64-bit processors (x86-64) have them, but some older 32-bit processors (IA-32) do not have these instructions. This is the reason for GCC's default settings.
SSE instructions have to do with vector operations and floating-point. Considering that opportunities for automatic vectorization are rare in general-purpose code, the only difference you are likely to observe are if you use floating-point.
On 64-bit, to disable SSE instructions, use -mno-sse
http://www.justskins.com/forums/gcc-option-msse-and-128289.html
SSE (http://it.wikipedia.org/wiki/Streaming_SIMD_Extensions) as the name says are the SSE instructions present in processors since Pentium 3. They are fast for some kind of vectorial and floating point computation. They are available in all 64 bit processors, so why should we disable them?
You can choose between -msse and -msse2. SSE2 are another instruction set built over SSE that add other powerful and very fast vectorial instructions.
Pentium 3 did have SSE, and is a 32 bit processor.
SSE2 is more modern instead, Pentium 4, that is still a 32 bit processor, have SSE2.

Resources