What happens if we define wrong flag for -mfpu? - gcc

For example, if I have a chip -mcpu=cortex-a7 , I should define -mfpu=neon-vfpv4 , not -mfpu=neon . However, I'm wondering what will happen if I define -mfpu=neon on Cortex-A7? Will it just ignore the flag and don't do the SIMD, or what will it do with the wrong flag like that?

It will use an older set of NEON instructions (-mfpu=neon is for selecting the NEON instructions that are available on the Cortex-A8 core). For example, it will not include the VFMA instructions.
Note that from GCC 8 (still in development) you will be able to just use -mfpu=auto or leave out the -mfpu entirely and have the compiler pick the optimal FPU setting for the -mcpu option you selected

If you do this compiler won't use VFPv4 instructions and potentially generate suboptimal code.

Related

AVX512 and MSVC preprocessor symbol

According to this link there are no predefined preprocessor symbols for AVX512 ( MSVC 2017 )
I'm trying to build thundersvm which uses eigen library on (you guessed it) windows. Both Eigen and thundersvm use cmake and depinding on the compiler prerpocessor symbols, Eigen compiles with avx512 instructions or not.
It seems that using /arch:AVX512 doesn't trigger any errors in MSVC but doesn't define __AVX512F__ symbol which Eigen needs. I also tried to include -D__AVX512F__=ON in the cmake arguments but still no luck.
Since there is no predefined preprocessor symbol for AVX512, is there any way to force Eigen to compile with avx512?
Update
According to chtz comment I've checked out the default branch of Eigen and recompiled thundersvm with arch:AVX512 with this cmake arguments (maybe not all are needed):
-DUSE_CUDA=OFF -DUSE_EIGEN=ON -DBUILD_SHARED_LIBS=OFF -DEIGEN_ENABLE_AVX512=ON -D__AVX512F__=ON -DEIGEN_VECTORIZE_AVX512=ON -DEIGEN_VECTORIZE_AVX2=ON -DEIGEN_VECTORIZE_AVX=ON -DEIGEN_VECTORIZE_FMA=ON
Comparing instruction mix from Intel's SDE -mix tool before and after the patch I can clearly see that AVX instructions are used (SDE complains it doesn't recognise instruction vbroadcastss zmm0, xmm0 when running for skl cpu but works fine for skx). The problem is that MSVC uses the scalar version of AVX and there is no improvement in the runtime(also the number of total instructions is the same) which is similar to this post
Are there other flags I need to define so that MSVC generates non scalar instrucions ? (I think I'll also give gcc a try)
MSVC has poor support for AVX-512 and no distinction between the different subsets. There is no safe way to produce AVX512F code on MSVC without also possibly making AVX512DQ instructions.
The best compilers for AVX-512 are gcc and clang. There is a Clang plugin to Visual Studio that you can use if you like the IDE. The gcc and clang compilers have preprocessor symbols like __AVX512F__, __AVX512VL__, etc.

How to use GCC LTO with differently optimized object files?

I'm compiling an executable with arm-none-eabi-gcc for a Cortex-M4 based microcontroller. Non-performance-critical code is compiled with -Os (optimized for executable code size) and performance critical parts with another optimalization flags, eg. -Og / -O2 etc.
Is it safe to use -flto in such a build? If so, which optimalization flag should be passed to the linker?
According to the GCC documentation regarding optimise options:
It is recommended that you compile all the files participating in the same link with the same options
Such a statement is rather vague. Nevertheless, when digging into the release notes of GCC 5, there are some additional details:
Command-line optimization and target options are now streamed on a per-function basis and honored by the link-time optimizer. This change makes link-time optimization a more transparent replacement of per-file optimizations. It is now possible to build projects that require different optimization settings for different translation units (such as -ffast-math, -mavx, or -finline).
And also information about which flags are affected by such limitations and which aren't:
Note that this applies only to those command-line options that can be passed to optimize and target attributes. Command-line options affecting global code generation (such as -fpic), warnings (such as -Wodr), optimizations affecting the way static variables are optimized (such as -fcommon), debug output (such as -g), and --param parameters can be applied only to the whole link-time optimization unit. In these cases, it is recommended to consistently use the same options at both compile time and link time.
In your scenario, the optimisation flags -Og, -O2 and -Os can be passed as optimise attributes and do not fall into the cases where the compile time and link time flags ought to be the same. So yes, it should be safe to use -flto in such a build.
Regarding the optimisations flags passed at link time, as stated in the release notes:
Contrary to earlier GCC releases, the optimization and target options
passed on the link command line are ignored.
GCC automatically determines which optimisation level to use, which is the highest level used when compiling the object files. You therefore don't need to pass any of your -O optimisation options to the linker.

How to turn on Fused Multiply Add in GCC for ARM processor

In my C program, I want the processor to compute a*b +c using FMADD instruction rather than MUL and ADD. How do I specify this to the compiler to do this. Also I would like to see FMADD instruction in the assembly code after compile.
gcc version 4.9.2
ARM v7 Processor
You need to have one of the following FPUs,
vfpv4
vfpv4-d16
fpv4-sp-d16
fpv5-sp-d16
fpv5-d16
neon-vfpv4
fp-armv8
neon-fp-armv8
crypto-neon-fp-armv8
You must use the hard-float ABI option.
An example with integers.
An example with floats.
You shouldn't need to specify any special function calls; the compiler will use the instruction if it finds they are beneficial.
The code in arm.c responsible for generation is,
case FMA:
if (TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_FMA)
With TARGET_FMA being a version '4' or better FPU.

gcc; Aarch64; Armv8; enable crypto; -mcpu=cortex-a53+crypto

I am trying to optimize an Arm processor (Corte-A53) with an Armv8 architecture for crypto purposes.
The problem is that however the compiler accepts -mcpu=cortex-a53+crypto etc it doesn't change the output (I checked the assembly output).
Changing mfpu, mcpu add futures like crypto or simd, it doesn't matter, it is completely ignored.
To enable Neon code -ftree-vectorize is needed, how to make use of crypto?
(I checked the -O(1,2,3) flags, it won't help).
Edit: I realized I made a mistake by thinking the crypto flag works like an optimization flag solved by the compiler. My bad.
You had two questions...
Why does -mcpu=cortex-a53+crypto not change code output?
The crypto extensions are an optional feature under the AArch64 state of ARMv8-A. The +crypto feature flag indicates to the compiler that these instructions are available use. From a practical perspective, in GCC 4.8/4.9/5.1, this defines the macro __ARM_FEATURE_CRYPTO, and controls whether or not you can use the crypto intrinsics defined in ACLE, for example:
uint8x16_t vaeseq_u8 (uint8x16_t data, uint8x16_t key)
There is no optimisation in current GCC which will automatically convert a sequence of C code to use the cryptography instructions. If you want to make this transformation, you have to do it by hand (and guard it by the appropriate feature macro).
Why do the +fpu and +simd flags not change code output?
For -mcpu=cortex-a53 the +fp and +simd flags are implied by default (for some configurations of GCC +crypto may also be implied by default). Adding these feature flags will therefore not change code generation.

Restrict SSE instruction set

I want my compiler to use only instructions of the specified version of SSE.
For now, looks like -msse2 -mno-sse3 -mno-sse4 -mno-sse41 -mno-sse42 does it, however I'm looking for something like -monly-sse2.
Unless you specify -msse3/-march=<cpu-with-sse3> only SSE2 will be used on x86-64 (and even lower instruction sets on x86).

Resources