GCC hidden optimizations - gcc

The GCC manual lists all optimization flags being applied for the different levels of optimizations (-O1, -O2, etc.). However when compiling and measuring a benchmark program (e.g. cBench's automotive_bitcount) there is a significant difference when applying an optimization level instead of turning on all the listed optimizations manually. For -O1 with the automotive_bitcount program, I measured a speedup of roughly 100% when compiling with -O1 instead of manually applying all the listed flags. Those "hidden" optimizations seem in fact to be the main part of the optimization work GCC does for -O1. When applying the flags manually, I only get a speedup of about 10% compared to no optimizations.
The same can be observed when applying all enabled flags from gcc -c -Q -O3 --help=optimizers.
In the GCC manual I found this section which would explain this behavior:
Not all optimizations are controlled directly by a flag. Only optimizations that have a flag are listed in this section.
Most optimizations are completely disabled at -O0 or if an -O level is not set on the command line, even if individual optimization flags are specified.
Since I couldn't find any further documentation on those optimizations, I wonder if there is a way of controlling them and what the optimizations are in detail?

Some optimizations are directly gated by -O flags e.g. complete unroller:
{
public:
pass_complete_unrolli (gcc::context *ctxt)
: gimple_opt_pass (pass_data_complete_unrolli, ctxt)
{}
/* opt_pass methods: */
virtual bool gate (function *) { return optimize >= 2; }
virtual unsigned int execute (function *);
}; // class pass_complete_unrolli
and for others -O influences their internal algorithms e.g. in optimization of expressions:
/* If FROM is a SUBREG, put it into a register. Do this
so that we always generate the same set of insns for
better cse'ing; if an intermediate assignment occurred,
we won't be doing the operation directly on the SUBREG. */
if (optimize > 0 && GET_CODE (from) == SUBREG)
from = force_reg (from_mode, from);
There is no way to work around this, you have to use -O.

Related

The Effect of Architecture When Using SSE / AVX Intrinisics

I wonder how does a Compiler treats Intrinsics.
If one uses SSE2 Intrinsics (Using #include <emmintrin.h>) and compile with -mavx flag. What will the compiler generate? Will it generate AVX or SSE code?
If one uses AVX2 Intrinsics (Using #include <immintrin.h>) and compile with -msse2 flag. What will the compiler generate? Will it generate SSE Only or AVX code?
How does compilers treat Intrinsics?
If one uses Intrinsics, does it help the compiler understand the dependency in the loop for better vectorization?
For instance, what's going on here - https://godbolt.org/z/Y4J5OA (Or https://godbolt.org/z/LZOJ2K)?
See all 3 panes.
The Context
I'm trying to build various version of the same functions with different CPU features (SSE4 and AVX2).
I'm writing the same version one with SSE Intrinsics and once with AVX Intrinsics.
Let's say theyare name MyFunSSE() and MyFunAVX(). Both are in the same file.
How can I make the Compiler (Same method should work for MSVC, GCC and ICC) build each of them using only the respective functions?
GCC and clang require that you enable all extensions you use. Otherwise it's a compile-time error, like error: inlining failed to call always_inline error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch
Using -march=native or -march=haswell or whatever is preferred over enabling specific extensions, because that also sets appropriate tuning options. And you don't forget useful ones like -mpopcnt that will let std::bitset::count() inline a popcnt instruction, and make all variable-count shifts more efficient with BMI2 shlx / shrx (1 uop vs. 3)
MSVC and ICC do not, and will let you use intrinsics to emit instructions that they couldn't auto-vectorize with.
You should definitely enable AVX if you use AVX intrinsics. Older MSVC without enabling AVX didn't always use vzeroupper automatically where needed, but that's been fixed for a few years. Still, if your whole program can assume AVX support, definitely tell the compiler about it even for MSVC.
For compilers that support GNU extensions (GCC, clang, ICC), you can use stuff like __attribute__((target("avx"))) on specific functions in a compilation unit. Or better, __attribute__((target("arch=haswell"))) to maybe also set tuning options. (That also enables AVX2 and FMA, which you might not want. And I'm not sure if target attributes do set -mtune=xx). See
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
__attribute__((target())) will prevent them from inlining into functions with other target options, so be careful to use this on functions they will inline into, if the function itself is too small. Use it on a function containing a loop, not a helper function called in a loop.
See also
https://gcc.gnu.org/wiki/FunctionMultiVersioning for using different target options on multiple definitions of the same function name, for compiler supported runtime dispatching. But I don't think there's a portable (to MSVC) way to do that.
See specify simd level of a function that compiler can use for more about doing runtime dispatch on GCC/clang.
With MSVC you don't need anything, although like I said I think it's normally a bad idea to use AVX intrinsics without -arch:AVX, so you might be better off putting those in a separate file. But for AVX vs. AVX2 + FMA, or SSE2 vs. SSE4.2, you're fine without anything.
Just #define AVX2_FUNCTION to the empty string instead of __attribute__((target("avx2,fma")))
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
// apparently ICC doesn't support target attributes, despite supporting GNU C
#define TARGET_HASWELL __attribute__((target("arch=haswell")))
#else
#define TARGET_HASWELL // empty
// maybe warn if __AVX__ isn't defined for functions where this is used?
// if you need to make sure MSVC uses vzeroupper everywhere needed.
#endif
TARGET_HASWELL
void foo_avx(float *__restrict dst, float *__restrict src)
{
for (size_t i = 0 ; i<1024 ; i++) {
__m256 v = _mm256_loadu_ps(src);
...
...
}
}
With GCC and clang, the macro expands to the __attribute__((target)) stuff; with MSVC and ICC it doesn't.
ICC pragma:
https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-optimization-parameter documents a pragma which you'd want to put before AVX functions to make sure vzeroupper is used properly in functions that use _mm256 intrinsics.
#pragma intel optimization_parameter target_arch=AVX
For ICC, you could #define TARGET_AVX as this, and always used it on a line by itself before the function, where you can put an __attribute__ or a pragma. You might also want separate macros for defining vs. declaring functions, if ICC doesn't want this on declarations. And a macro to end a block of AVX functions, if you want to have non-AVX functions after them. (For non-ICC compilers, this would be empty.)
If you compile code with -mavx2 enabled your compiler will (usually) generate so-called "VEX encoded" instructions. In case of _mm_loadu_ps, this will generate vmovups instead of movups, which is almost equivalent, except that the latter will only modify the lower 128 bit of the target register, whereas the former will zero-out everything above the lower 128 bits. However, it will only run on machines which support at least AVX. Details on [v]movups are here.
For other instructions like [v]addps, AVX has the additional advantage of allowing three operands (i.e., the target can be different from both sources), which in some cases can avoid copying registers. E.g.,
_mm_mul_ps(_mm_add_ps(a,b), _mm_sub_ps(a,b));
requires a register copy (movaps) when compiled for SSE, but not when compiled for AVX:
https://godbolt.org/z/YHN5OA
Regarding using AVX-intrinsics but compiling without AVX, compilers either fail (like gcc/clang) or silently generate the corresponding instructions which would then fail on machines without AVX support (see #PeterCordes answer for details on that).
Addendum: If you want to implement different functions depending on the architecture (at compile-time) you can check that using #ifdef __AVX__ or #if defined(__AVX__): https://godbolt.org/z/ZVAo-7
Implementing them in the same compilation unit is difficult, I think. The easiest solutions are to built different shared-libraries or even different binaries and have a small binary which detects the available CPU features and loads the corresponding library/binary. I assume there are related questions on that topic.

Flags mapping of tasking compiler with GCC compiler

I am working on GCC compiler. I have previously worked on tasking compiler where most of the flags are different.
Can someone help me to map below mentioned flags of tasking with GCC.
-t (section information is listed on list file)
--emit-locals = -equ (emit equ symbol to object file)
-wa-ogs (short and long option names to assembler)
--tasking-sfr (compiler includes regcpu.sfr and assembler includes regcpu.def)
--language = -gcc,-volatile,+strings (enable number of GCC extensions, don't optimization across volatile access, disable const check for string lateral)
--switch = auto (which code must be generated for switch statement)
--align = 0 (compiler aligns variables and function to minimum alignment)
--default-near-size = 8 (compiler allocates objects smaller than or equal to threshold memory in __near memory)
--default-a0-size = 0 (compiler will never use __a0 memory)
--default-a1-size = 0 (compiler will never use __a1 memory)
--tradeoff =4 (whether used optimization should optimize for more speed/smaller code size)
--source (merge c source code with assembly code in output file)

How to use GCC LTO with differently optimized object files?

I'm compiling an executable with arm-none-eabi-gcc for a Cortex-M4 based microcontroller. Non-performance-critical code is compiled with -Os (optimized for executable code size) and performance critical parts with another optimalization flags, eg. -Og / -O2 etc.
Is it safe to use -flto in such a build? If so, which optimalization flag should be passed to the linker?
According to the GCC documentation regarding optimise options:
It is recommended that you compile all the files participating in the same link with the same options
Such a statement is rather vague. Nevertheless, when digging into the release notes of GCC 5, there are some additional details:
Command-line optimization and target options are now streamed on a per-function basis and honored by the link-time optimizer. This change makes link-time optimization a more transparent replacement of per-file optimizations. It is now possible to build projects that require different optimization settings for different translation units (such as -ffast-math, -mavx, or -finline).
And also information about which flags are affected by such limitations and which aren't:
Note that this applies only to those command-line options that can be passed to optimize and target attributes. Command-line options affecting global code generation (such as -fpic), warnings (such as -Wodr), optimizations affecting the way static variables are optimized (such as -fcommon), debug output (such as -g), and --param parameters can be applied only to the whole link-time optimization unit. In these cases, it is recommended to consistently use the same options at both compile time and link time.
In your scenario, the optimisation flags -Og, -O2 and -Os can be passed as optimise attributes and do not fall into the cases where the compile time and link time flags ought to be the same. So yes, it should be safe to use -flto in such a build.
Regarding the optimisations flags passed at link time, as stated in the release notes:
Contrary to earlier GCC releases, the optimization and target options
passed on the link command line are ignored.
GCC automatically determines which optimisation level to use, which is the highest level used when compiling the object files. You therefore don't need to pass any of your -O optimisation options to the linker.

Can I make my compiler use fast-math on a per-function basis?

Suppose I have
template <bool UsesFastMath> void foo(float* data, size_t length);
and I want to compile one instantiation with -ffast-math (--use-fast-math for nvcc), and the other instantiation without it.
This can be achieved by instantiating each of the variants in a separate translation unit, and compiling each of them with a different command-line - with and without the switch.
My question is whether it's possible to indicate to popular compilers (*) to apply or not apply -ffast-math for individual functions - so that I'll be able to have my instantiations in the same translation unit.
Notes:
If the answer is "no", bonus points for explaining why not.
This is not the same questions as this one, which is about turning fast-math on and off at runtime. I'm much more modest...
(*) by popular compilers I mean any of: gcc, clang, msvc icc, nvcc (for GPU kernel code) about which you have that information.
In GCC you can declare functions like following:
__attribute__((optimize("-ffast-math")))
double
myfunc(double val)
{
return val / 2;
}
This is GCC-only feature.
See working example here -> https://gcc.gnu.org/ml/gcc/2009-10/msg00385.html
It seems that GCC not verifies optimize() arguments. So typos like "-ffast-match" will be silently ignored.
As of CUDA 7.5 (the latest version I am familiar with, although CUDA 8.0 is currently shipping), nvcc does not support function attributes that allow programmers to apply specific compiler optimizations on a per-function basis.
Since optimization configurations set via command line switches apply to the entire compilation unit, one possible approach is to use as many different compilation units as there are different optimization configurations, as already noted in the question; source code may be shared and #include-ed from a common file.
With nvcc, the command line switch --use_fast_math basically controls three areas of functionality:
Flush-to-zero mode is enabled (that is, denormal support is disabled)
Single-precision reciprocal, division, and square root are switched to approximate versions
Certain standard math functions are replaced by equivalent, lower-precision, intrinsics
You can apply some of these changes with per-operation granularity by using appropriate intrinsics, others by using PTX inline assembly.

ARM Cortex M4 hard fault - floating point

When I run my program, which just calculates a sine wave:
for(i = 0; i < ADS1299_SIGNAL_WINDOW; i++){
TEST[i] = (float32_t)(10.0f * (float32_t)(arm_sin_f32((float32_t)(3.14f * i/ADS1299_SIGNAL_WINDOW))));
}
The compiler generates the following line, which results in a hard fault:
800702a: ed2d 8b04 vpush {d8-d9}
What is happening? For reference, here are my flags for the compiler:
SETTINGS="-g -nostartfiles -mthumb -mthumb-interwork -march=armv7e-m -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -fsingle-precision-constant -fdata-sections -ffunction-sections -O3 -Wl,-T,../STM32F407VG_FLASH.ld"
DECLARE="-DARM_MATH_CM4 -D__FPU_PRESENT=1 -D__FPU_USED"
.... -larm_cortexM4lf_math
The problem is that you're doing both the CPACR enable, and some floating-point operations in the same scope. Because code in main uses floating-point registers, the compiler (being well-behaved and respecting the ABI), will emit code to preserve those registers on entry to main. Before any other code in main executes. Including the write to CAPCR which makes them accessible. Oops.
To avoid that, either enable FP in the CPACR before entry to main in a reset handler (if your toolchain allows), or simply do all FP operations in another function, and ensure main itself doesn't touch any FP registers.
It would also be wise (if you haven't already) to ensure you have a DSB; ISB synchronisation sequence after the CPACR write. Otherwise, you could potentially still get a fault from any stale FP instuctions already in the pipeline.
I think the problem is that the FPU is not enabled. I've got same problem with Nordic Semiconductors SDK examples on Keil 4. In the Keil IDE the check box for FPU enable is marked, but in the SystemInit code it is conditional compilation like this:
void SystemInit(void) {
#if (__FPU_USED == 1)
SCB->CPACR |= (3UL << 20) | (3UL << 22);
__DSB();
__ISB();
#endif
}
But I think the Keil 4 IDE does not set this __FPU_USED to 1 and on the VPUSH instruction I've got a HardFault because the FPU is not enabled.
I think you need to enable the FPU in SystemInit and the problem then will be solved.
If you use the FPU, the stack should be aligned to 8 bytes boundaries. If you are using an RTOS, check the thread stack initialization code. If you are running on pure bare metal, check the startup code for the stack setup.

Resources