How to turn on Fused Multiply Add in GCC for ARM processor - gcc

In my C program, I want the processor to compute a*b +c using FMADD instruction rather than MUL and ADD. How do I specify this to the compiler to do this. Also I would like to see FMADD instruction in the assembly code after compile.
gcc version 4.9.2
ARM v7 Processor

You need to have one of the following FPUs,
vfpv4
vfpv4-d16
fpv4-sp-d16
fpv5-sp-d16
fpv5-d16
neon-vfpv4
fp-armv8
neon-fp-armv8
crypto-neon-fp-armv8
You must use the hard-float ABI option.
An example with integers.
An example with floats.
You shouldn't need to specify any special function calls; the compiler will use the instruction if it finds they are beneficial.
The code in arm.c responsible for generation is,
case FMA:
if (TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_FMA)
With TARGET_FMA being a version '4' or better FPU.

Related

What happens if we define wrong flag for -mfpu?

For example, if I have a chip -mcpu=cortex-a7 , I should define -mfpu=neon-vfpv4 , not -mfpu=neon . However, I'm wondering what will happen if I define -mfpu=neon on Cortex-A7? Will it just ignore the flag and don't do the SIMD, or what will it do with the wrong flag like that?
It will use an older set of NEON instructions (-mfpu=neon is for selecting the NEON instructions that are available on the Cortex-A8 core). For example, it will not include the VFMA instructions.
Note that from GCC 8 (still in development) you will be able to just use -mfpu=auto or leave out the -mfpu entirely and have the compiler pick the optimal FPU setting for the -mcpu option you selected
If you do this compiler won't use VFPv4 instructions and potentially generate suboptimal code.

Can I make my compiler use fast-math on a per-function basis?

Suppose I have
template <bool UsesFastMath> void foo(float* data, size_t length);
and I want to compile one instantiation with -ffast-math (--use-fast-math for nvcc), and the other instantiation without it.
This can be achieved by instantiating each of the variants in a separate translation unit, and compiling each of them with a different command-line - with and without the switch.
My question is whether it's possible to indicate to popular compilers (*) to apply or not apply -ffast-math for individual functions - so that I'll be able to have my instantiations in the same translation unit.
Notes:
If the answer is "no", bonus points for explaining why not.
This is not the same questions as this one, which is about turning fast-math on and off at runtime. I'm much more modest...
(*) by popular compilers I mean any of: gcc, clang, msvc icc, nvcc (for GPU kernel code) about which you have that information.
In GCC you can declare functions like following:
__attribute__((optimize("-ffast-math")))
double
myfunc(double val)
{
return val / 2;
}
This is GCC-only feature.
See working example here -> https://gcc.gnu.org/ml/gcc/2009-10/msg00385.html
It seems that GCC not verifies optimize() arguments. So typos like "-ffast-match" will be silently ignored.
As of CUDA 7.5 (the latest version I am familiar with, although CUDA 8.0 is currently shipping), nvcc does not support function attributes that allow programmers to apply specific compiler optimizations on a per-function basis.
Since optimization configurations set via command line switches apply to the entire compilation unit, one possible approach is to use as many different compilation units as there are different optimization configurations, as already noted in the question; source code may be shared and #include-ed from a common file.
With nvcc, the command line switch --use_fast_math basically controls three areas of functionality:
Flush-to-zero mode is enabled (that is, denormal support is disabled)
Single-precision reciprocal, division, and square root are switched to approximate versions
Certain standard math functions are replaced by equivalent, lower-precision, intrinsics
You can apply some of these changes with per-operation granularity by using appropriate intrinsics, others by using PTX inline assembly.

gcc; Aarch64; Armv8; enable crypto; -mcpu=cortex-a53+crypto

I am trying to optimize an Arm processor (Corte-A53) with an Armv8 architecture for crypto purposes.
The problem is that however the compiler accepts -mcpu=cortex-a53+crypto etc it doesn't change the output (I checked the assembly output).
Changing mfpu, mcpu add futures like crypto or simd, it doesn't matter, it is completely ignored.
To enable Neon code -ftree-vectorize is needed, how to make use of crypto?
(I checked the -O(1,2,3) flags, it won't help).
Edit: I realized I made a mistake by thinking the crypto flag works like an optimization flag solved by the compiler. My bad.
You had two questions...
Why does -mcpu=cortex-a53+crypto not change code output?
The crypto extensions are an optional feature under the AArch64 state of ARMv8-A. The +crypto feature flag indicates to the compiler that these instructions are available use. From a practical perspective, in GCC 4.8/4.9/5.1, this defines the macro __ARM_FEATURE_CRYPTO, and controls whether or not you can use the crypto intrinsics defined in ACLE, for example:
uint8x16_t vaeseq_u8 (uint8x16_t data, uint8x16_t key)
There is no optimisation in current GCC which will automatically convert a sequence of C code to use the cryptography instructions. If you want to make this transformation, you have to do it by hand (and guard it by the appropriate feature macro).
Why do the +fpu and +simd flags not change code output?
For -mcpu=cortex-a53 the +fp and +simd flags are implied by default (for some configurations of GCC +crypto may also be implied by default). Adding these feature flags will therefore not change code generation.

Is the mno-mul option still supported in the mips compiler?

I am trying to compile my C code to use soft multiplication in MIPS I as my hardware does not have a hard multiplier.
From this document (page 10): http://www.sm.luth.se/csee/courses/smd/137/doc/gcc.pdf indicates that "-mno-mul" option can be used to inform the compiler to not generate integer multiply/divide instructions and instead insert calls to multiply/divide subroutines.
However, when I feed in the "-mno-mul" option to my compiler, the error message returned is:
unrecognized command line option "-mno-mul"
I tried googling for more information on "-mno-mul", but there is very limited search results returned. The option is not even listed here: https://gcc.gnu.org/onlinedocs/gcc/Option-Summary.html
My question is: Has the mno-mul option become obsolete? If so, is there a workaround for the compiler to generate code for soft multiplication?
This option is obsolete, since all MIPS architecture specifications since MIPS1 require an integer multiplier.
You might still be able to track down a copy of GCC 2.96 and compile using that. Or you could write a handler for the illegal instruction trap that implements soft multiplication.
According to gcc MIPS options you can use -mno-mad
-mno-mad
Enable (disable) use of the mad, madu and mul instructions, as provided by the R4650 ISA.

How to get GCC to use more than two SIMD registers when using intrinsics?

I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm looking at the assembler code generated, it seems that GCC keeps flushing the data back to the memory, in order to reload something else in XMM0 and XMM1. I am compiling for x86-64 so I have 15 registers. Why is GCC using only two and what can I do to ask it to use more? Is there any way that I can "pin" some value in a register? I added the "register" keyword to my variable definition, but the generated assembly code is identical.
Yes, you can. Explicit Reg Vars talks about the syntax you need to pin a variable to a specific register.
If you're getting to the point where you're specifying individual registers for each intrinsic, you might as well just write the assembly directory, especially given gcc's nasty habit of pessimizing intrinsics unnecessarily in many cases.
It sounds like you compiled with optimization disabled, so no variables are kept in registers between C statements, not even int.
Compile with gcc -O3 -march=native to let the compiler make non-terrible asm, optimized for your machine. The default is -O0 with a "generic" target ISA and tuning.
See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why "debug" builds in general are like that, and the fact that register int foo; or register __m128 bar; can stay in a register even in a debug build. But it's much better to actually have the compiler optimize, as well as using registers, if you want your code to run fast overall!

Resources