gcc differences between -O3 vs -Ofast optimizations - gcc

I was just reading through the gcc manual to find out the difference between -O3 and -Ofast.
For -O3
-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags:
-fgcse-after-reload
-fipa-cp-clone
-floop-interchange
-floop-unroll-and-jam
-fpeel-loops
-fpredictive-commoning
-fsplit-paths
-ftree-loop-distribute-patterns
-ftree-loop-distribution
-ftree-loop-vectorize
-ftree-partial-pre
-ftree-slp-vectorize
-funswitch-loops
-fvect-cost-model
-fversion-loops-for-strides
While of -Ofast
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for
all standard-compliant programs. It turns on -ffast-math,
-fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens
Therefore I was wondering, if maybe -Ofast is for some reason less safe than -O3, and therefore I should stick to -O3 most of the times then.
Can you clarify the "practical difference" when using them? And if -Ofast is actually safe?

Ofast enables optimizations which violate requirements of C standard for floating-point semantics. In particular under -Ofast (aka -ffast-math) it'll freely reorder float computations (which is prohibited by default because in general a + (b + c) != (a + b) + c != a + (c + b) for floats).
Whether -Ofast is safe in particular case depends on the algorithm but usually most non-scientific applications work fine with it. For example most game engines are built with -ffast-math.

Related

random_number subroutine run time comparison ifort vs gfortran

I wrote this code:
program random
implicit none
integer :: i, j, limit
real(8) :: r, max_val
real(8) :: start, finish
max_val = 0.d0
limit = 10000
call CPU_TIME(start)
do i=1, limit
do j=1, limit
call random_number(r)
max_val = max(max_val, r)
end do
end do
call CPU_TIME(finish)
print *, max_val
print '("Time = ",f6.3," seconds.")',finish-start
end program random
And I compiled it with gfortran 10.1.0 and ifort 19.1.3.304 on CentOS Linux 7 using:
ifort *.f90 -O3 -no-vec -o intel.out
gfortran *.f90 -O3 -fno-tree-vectorize -o gnu.out
and the outputs are:
gnu:
0.9999999155521957
Time = 0.928 seconds.
intel:
0.999999968800691 (same for every run btw)
Time = 1.989 seconds.
When I run a few times, the run time of each is pretty much the same.
Why is gfortran faster than ifort and how can I make ifort run as fast as gfortran?
Different compilers have their libraries with implementations of their intrinsic functions and subroutines. They will differ in performance and may also differ in their results. Gfortran uses the GLIBC library for many general intrinsics and the libgfortran library for many Fortran-spific ones. The Intel compiler comes with its own runtime-library suite.
Notably, the Fortran standard gives no guarantees about the quality of the pseudo-random generator used for random_number(). Even if it did, the actual implementation in code could always differ and hence the actual performance.
There are many external pseudo-random number generator libraries available. Some faster, some slower. Some more robust, some fail certain randomness tests. (sometimes that does matter, sometimes it does not). Some give more random bits in a single call, some give fewer random bits in a single call. If you need some particular properties for the generator in all your compilers, you might be better off with an external library.

Is there a good reason why GCC would generate jump to jump just over one cheap instruction?

I was benchmarking some counting in a loop code.
g++ was used with -O2 code and I noticed that it has some perf problems when some condition is true in 50% of the cases. I assumed that may mean that code does unnecessary jumps(since clang produces faster code so it is not some fundamental limitation).
What I find in this asm output funny is that code jumps over one simple add.
=> 0x42b46b <benchmark_many_ints()+1659>: movslq (%rdx),%rax
0x42b46e <benchmark_many_ints()+1662>: mov %rax,%rcx
0x42b471 <benchmark_many_ints()+1665>: imul %r9,%rax
0x42b475 <benchmark_many_ints()+1669>: shr $0xe,%rax
0x42b479 <benchmark_many_ints()+1673>: and $0x1ff,%eax
0x42b47e <benchmark_many_ints()+1678>: cmp (%r10,%rax,4),%ecx
0x42b482 <benchmark_many_ints()+1682>: jne 0x42b488 <benchmark_many_ints()+1688>
0x42b484 <benchmark_many_ints()+1684>: add $0x1,%rbx
0x42b488 <benchmark_many_ints()+1688>: add $0x4,%rdx
0x42b48c <benchmark_many_ints()+1692>: cmp %rdx,%r8
0x42b48f <benchmark_many_ints()+1695>: jne 0x42b46b <benchmark_many_ints()+1659>
Note that my question is not how to fix my code, I am just asking if there is a reason why a good compiler at O2 would generate jne instruction to jump over 1 cheap instruction.
I ask because from what I understand one could "simply" get the comparison result and use that to without jumps increment the counter(rbx in my example) by 0 or 1.
edit: source:
https://godbolt.org/z/v0Iiv4
The relevant part of the source (from a Godbolt link in a comment which you should really edit into your question) is:
const auto cnt = std::count_if(lookups.begin(), lookups.end(),[](const auto& val){
return buckets[hash_val(val)%16] == val;});
I didn't check the libstdc++ headers to see if count_if is implemented with an if() { count++; }, or if it uses a ternary to encourage branchless code. Probably a conditional. (The compiler can choose either, but a ternary is more likely to compile to a branchless cmovcc or setcc.)
It looks like gcc overestimated the cost of branchless for this code with generic tuning. -mtune=skylake (implied by -march=skylake) gives us branchless code for this regardless of -O2 vs. -O3, or -fno-tree-vectorize vs. -ftree-vectorize. (On the Godbolt compiler explorer, I also put the count in a separate function that counts a vector<int>&, so we don't have to wade through the timing and cout code-gen in main.)
branchy code: gcc8.2 -O2 or -O3, and O2/3 -march=haswell or broadwell
branchless code: gcc8.2 -O2/3 -march=skylake.
That's weird. The branchless code it emits has the same cost on Broadwell vs. Skylake. I wondered if Skylake vs. Haswell was favouring branchless because of cheaper cmov. GCC's internal cost model isn't always in terms of x86 instructions when its optimizing in the middle-end (in GIMPLE, an architecture-neutral representation). It doesn't yet know what x86 instructions would actually be used for a branchless sequence. So maybe a conditional-select operation is involved, and gcc models it as more expensive on Haswell, where cmov is 2 uops? But I tested -march=broadwell and still got branchy code. Hopefully we can rule that out assuming gcc's cost model knows that Broadwell (not Skylake) was the first Intel P6/SnB-family uarch to have single-uop cmov, adc, and sbb (3-input integer ops).
I don't know what else about gcc's Skylake tuning option that makes it favour branchless code for this loop. Gather is efficient on Skylake, but gcc is auto-vectorizing (with vpgatherqd xmm) even with -march=haswell, where it doesn't look like a win because gather is expensive, and and requires 32x64 => 64-bit SIMD multiplies using 2x vpmuludq per input vector. Maybe worth it with SKL, but I doubt HSW. Also probably a missed optimization not to pack back down to dword elements to gather twice as many elements with nearly the same throughput for vpgatherdd.
I did rule out the function being less optimized because it was called main (and marked cold). It's generally recommended not to put your microbenchmarks in main: compilers at least used to optimize main differently (e.g. for code-size instead of just speed).
Clang does make it branchless even with just -O2.
When compilers have to decide between branching and branchy, they have heuristics that guess which will be better. If they think it's highly predictable (e.g. probably mostly not-taken), that leans in favour of branchy.
In this case, the heuristic could have decided that out of all 2^32 possible values for an int, finding exactly the value you're looking for is rare. The == may have fooled gcc into thinking it would be predictable.
Branchy can be better sometimes, depending on the loop, because it can break a data dependency. See gcc optimization flag -O3 makes code slower than -O2 for a case where it was very predictable, and the -O3 branchless code-gen was slower.
-O3 at least used to be more aggressive at if-conversion of conditionals into branchless sequences like cmp ; lea 1(%rbx), %rcx; cmove %rcx, %rbx, or in this case more likely xor-zero / cmp/ sete / add. (Actually gcc -march=skylake uses sete / movzx, which is pretty much strictly worse.)
Without any runtime profiling / instrumentation data, these guesses can easily be wrong. Stuff like this is where Profile Guided Optimization shines. Compile with -fprofile-generate, run it, then compiler with -fprofile-use, and you'll probably get branchless code.
BTW, -O3 is generally recommended these days. Is optimisation level -O3 dangerous in g++?. It does not enable -funroll-loops by default, so it only bloats code when it auto-vectorizes (especially with very large fully-unrolled scalar prologue/epilogue around a tiny SIMD loop that bottlenecks on loop overhead. /facepalm.)

OpenMP compiles with different flags gives different results

I developed a FORTRAN code which I compiled with the following command:
ifort -g -O0 -openmp -openmp_report -threads -ipo
When running this code with the above flags, I keep the result with 15 digits when running serial and parallel (OpenMP). I have also checked with Intel Inspector 2013 - and I do not have any data race condition.
However, when I change the optimization compilation flag to -O2 or -O3, I get a small error which grows with time (it is a simulation which integrates over time) from the order of 10^15 towards larger numbers.
The results with either one of -O2 or -O3 are different (up to the fifth digits after the dot).
Can anyone advise on how can I, in general, improve my code in order it to run with the same precision (double precision) as with -O0 flag ?
Thanks in advance,
Jack.

mfpmath option to MinGW (or even gcc)

Does the -march=corei7-avx -mtune=corei7-avx or -march=corei7 -mtune=corei7 -mavx command line options to MinGW with the -mfpmath=sse command line option (or even with -mfpmath=both) enables using of AVX instruction for math routines? Note, that --with-fpmath=avx from here does not work (that is "unrecognized option" for recent builds on MinGW).
AVX is enabled by either -march=corei7-avx or -mavx. The -mtune option is neither necessary nor sufficient to enable AVX.
A -mfpmath=avx does not make any sense, because with this switch you control the generation of scalar floating point code. It makes no difference if you use only one float of a 4 float vector register or only one element of a 8 float vector register. If you have march=avx enabled, scalar floating point instructions will use the VEX encoding anyway, which will save a few mov instructions.
Note that on x86_64 -mfpmath defaults to SSE, so using this switch is usually not necessary or even harmful if you don't exactly know what you are doing.

How to vectorize with gcc?

The v4 series of the gcc compiler can automatically vectorize loops using the SIMD processor on some modern CPUs, such as the AMD Athlon or Intel Pentium/Core chips. How is this done?
The original page offers details on getting gcc to automatically vectorize
loops, including a few examples:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
While the examples are great, it turns out the syntax for calling those options with latest GCC seems to have changed a bit, see now:
https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info
In summary, the following options will work for x86 chips with SSE2,
giving a log of loops that have been vectorized:
gcc -O2 -ftree-vectorize -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5
Note that -msse is also a possibility, but it will only vectorize loops
using floats, not doubles or ints. (SSE2 is baseline for x86-64. For 32-bit code use -mfpmath=sse as well. That's the default for 64-bit but not 32-bit.)
Modern versions of GCC enable -ftree-vectorize at -O3 so just use that in GCC4.x and later:
gcc -O3 -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5
(Clang enables auto-vectorization at -O2. ICC defaults to optimization enabled + fast-math.)
Most of the following was written by Peter Cordes, who could have just written a new answer. Over time, as compilers change, options and compiler output will change. I am not entirely sure whether it is worth tracking it in great detail here. Comments? -- Author
To also use instruction set extensions supported by the hardware you're compiling on, and tune for it, use -march=native.
Reduction loops (like sum of an array) will need OpenMP or -ffast-math to treat FP math as associative and vectorize. Example on the Godbolt compiler explorer with -O3 -march=native -ffast-math including a reduction (array sum) which is scalar without -ffast-math. (Well, GCC8 and later do a SIMD load and then unpack it to scalar elements, which is pointless vs. simple unrolling. The loop bottlenecks on the latency of the one addss dependency chain.)
Sometimes you don't need -ffast-math, just -fno-math-errno can help gcc inline math functions and vectorize something involving sqrt and/or rint / nearbyint.
Other useful options include -flto (link-time optimization for cross-file inlining, constant propagation, etc) and / or profile-guided optimization with -fprofile-generate / test run(s) with realistic input(s) /-fprofile-use. PGO enables loop unrolling for "hot" loops; in modern GCC that's off by default even at -O3.
There is a gimple (an Intermediate Representation of GCC) pass pass_vectorize. This pass will enable auto-vectorization at gimple level.
For enabling autovectorization (GCC V4.4.0), we need to following steps:
Mention the number of words in a vector as per target architecture. This can be done by defining the macro UNITS_PER_SIMD_WORD.
The vector modes that are possible needs to be defined in a separate file usually <target>-modes.def. This file has to reside in the directory where other files containing the machine descriptions are residing on. (As per the configuration script. If you can change the script you can place the file in whatever directory you want it to be in).
The modes that are to be considered for vectorization as per target architecture. Like, 4 words will constitute a vector or eight half words will constitute a vector or two double-words will constitute a vector. The details of this needs to be mentioned in the <target>-modes.def file. For example:
VECTOR_MODES (INT, 8);     /*       V8QI V4HI V2SI /
VECTOR_MODES (INT, 16);    / V16QI V8HI V4SI V2DI /
VECTOR_MODES (FLOAT, 8);   /            V4HF V2SF */
Build the port. Vectorization can be enabled using the command line options -O2 -ftree-vectorize.

Resources