OpenMP compiles with different flags gives different results - openmp

I developed a FORTRAN code which I compiled with the following command:
ifort -g -O0 -openmp -openmp_report -threads -ipo
When running this code with the above flags, I keep the result with 15 digits when running serial and parallel (OpenMP). I have also checked with Intel Inspector 2013 - and I do not have any data race condition.
However, when I change the optimization compilation flag to -O2 or -O3, I get a small error which grows with time (it is a simulation which integrates over time) from the order of 10^15 towards larger numbers.
The results with either one of -O2 or -O3 are different (up to the fifth digits after the dot).
Can anyone advise on how can I, in general, improve my code in order it to run with the same precision (double precision) as with -O0 flag ?
Thanks in advance,
Jack.

Related

random_number subroutine run time comparison ifort vs gfortran

I wrote this code:
program random
implicit none
integer :: i, j, limit
real(8) :: r, max_val
real(8) :: start, finish
max_val = 0.d0
limit = 10000
call CPU_TIME(start)
do i=1, limit
do j=1, limit
call random_number(r)
max_val = max(max_val, r)
end do
end do
call CPU_TIME(finish)
print *, max_val
print '("Time = ",f6.3," seconds.")',finish-start
end program random
And I compiled it with gfortran 10.1.0 and ifort 19.1.3.304 on CentOS Linux 7 using:
ifort *.f90 -O3 -no-vec -o intel.out
gfortran *.f90 -O3 -fno-tree-vectorize -o gnu.out
and the outputs are:
gnu:
0.9999999155521957
Time = 0.928 seconds.
intel:
0.999999968800691 (same for every run btw)
Time = 1.989 seconds.
When I run a few times, the run time of each is pretty much the same.
Why is gfortran faster than ifort and how can I make ifort run as fast as gfortran?
Different compilers have their libraries with implementations of their intrinsic functions and subroutines. They will differ in performance and may also differ in their results. Gfortran uses the GLIBC library for many general intrinsics and the libgfortran library for many Fortran-spific ones. The Intel compiler comes with its own runtime-library suite.
Notably, the Fortran standard gives no guarantees about the quality of the pseudo-random generator used for random_number(). Even if it did, the actual implementation in code could always differ and hence the actual performance.
There are many external pseudo-random number generator libraries available. Some faster, some slower. Some more robust, some fail certain randomness tests. (sometimes that does matter, sometimes it does not). Some give more random bits in a single call, some give fewer random bits in a single call. If you need some particular properties for the generator in all your compilers, you might be better off with an external library.

gcc differences between -O3 vs -Ofast optimizations

I was just reading through the gcc manual to find out the difference between -O3 and -Ofast.
For -O3
-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags:
-fgcse-after-reload
-fipa-cp-clone
-floop-interchange
-floop-unroll-and-jam
-fpeel-loops
-fpredictive-commoning
-fsplit-paths
-ftree-loop-distribute-patterns
-ftree-loop-distribution
-ftree-loop-vectorize
-ftree-partial-pre
-ftree-slp-vectorize
-funswitch-loops
-fvect-cost-model
-fversion-loops-for-strides
While of -Ofast
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for
all standard-compliant programs. It turns on -ffast-math,
-fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens
Therefore I was wondering, if maybe -Ofast is for some reason less safe than -O3, and therefore I should stick to -O3 most of the times then.
Can you clarify the "practical difference" when using them? And if -Ofast is actually safe?
Ofast enables optimizations which violate requirements of C standard for floating-point semantics. In particular under -Ofast (aka -ffast-math) it'll freely reorder float computations (which is prohibited by default because in general a + (b + c) != (a + b) + c != a + (c + b) for floats).
Whether -Ofast is safe in particular case depends on the algorithm but usually most non-scientific applications work fine with it. For example most game engines are built with -ffast-math.

Intrinsic dot_product slower than a*a+b*b+c*c?

Recently I tested the runtime difference of explicit summation and intrinsic functions to calculate a dot product. Surprisingly the naïve explicit writing was faster.
program test
real*8 , dimension(3) :: idmat
real*8 :: dummy(3)
idmat=0
dummy=0
do i=1,3
idmat(i)=1
enddo
do j=1,10**10
! dummy(mod(j,3)+1)=dot_product(idmat,idmat)
dummy(mod(j,3)+1)=idmat(1)*idmat(1)+idmat(2)*idmat(2)+idmat(3)*idmat(3)
enddo
print*, dummy
end program test
Here is what confuses me:
1. No -O3 Optimization
If I use: gfortran test.f90 -o test ; time ./test
I find a runtime of 6,297s using the function dot_product (commented above) and 4,486s using the manual explicit writing.
How does that make sense?
2. Including -O3 Optimization
If I use: gfortran test.f90 -O3 -o test ; time ./test
I find a runtime of 1,808s and 1,803s respectively. So both are actually of the same speed.
3. What I actually expect
...is the intrinsic function to be faster, as it could:
compute the 3 products in parallel
add the 3 products
where the explicit form has to sequentially:
compute product 1
compute product 2
compute product 3
add the 3 products
Do I have to create a new parallel dot_product function to be faster? Or is there an additional option for the gfortran compiler which I don't know?
Please note: I read across the internet about SIMD, auto-vectorization and parallelisation in modern Fortran. Although I learned something my question wasn't answered anywhere.
It makes no sense even looking at the non-optimized numbers. The optimized numbers are the same, so everything is fine.
"...is the intrinsic function to be faster, as it could: compute the 3 products in parallel"
There will be nothing done in parallel unless you enable specific parallel optimizations. These optimizations will be as easy to do for the loop as for the intrinsic and often even much easier for the loop.
Well, at least for the normal sense of parallel using threads or similar. What can be done in parallel is to use the vector instructions and to schedule the instructions to overlap in the CPU pipeline. That can be done by the optimizing compiler and is likely done for both versions when you use -O3. You should not expect this to happen when no optimizations are enabled.
The use of the "parallel" instructions (SIMD) can be sometimes improved by using compiler directives like !$omp simd or !$DEC VECTOR.
"Do I have to create a new parallel dot_product function to be faster?"
Yes, normally you do. For example using OpenMP. Or you could:
"Or is there an additional option for the gfortran compiler which i don't know?"
Yes, the automatic parallelization https://gcc.gnu.org/wiki/AutoParInGCC , for example -floop-parallelize-all -ftree-parallelize-loops=4
Note that it will not make those individual multiplications in parallel, it will make the i loop parallel.

compilation option FFLAGS in MAKEFILE

what does the compilation options mean?
export FFLAGS = -O3 -r8 -i4 -I${PWD}/headers -nofor_main.
-r8 means what?i4 means what?where could I find the help file.can anybody explain compilation option FFLAGS?I really appreciate it
You apparently already know that FFLAGS is a list of options for a FORTRAN compiler.
-r8 sets the size of certain data types to 8 bytes, depending on architecture. It is approximately the same as setting double precision.
-i4 sets the default integer size to 4 bytes.
Do you need more?
EDIT:
There are a lot of different compilers, and versions of compilers. The default for GNUMake is f77, and from the UNIX man page:
-r8
Double the size of default REAL, DOUBLE, INTEGER, and COMPLEX data.
NOTE: This option is now considered obsolete and may be
removed in future releases. Use the more flexible -xtypemap
option instead.
This option sets the default size for REAL, INTEGER, and
LOGICAL to 8, and for COMPLEX to 16. For INTEGER and LOGI-
CAL the compiler allocates 8 bytes, but does 4-byte arith-
metic. For actual 8-byte arithmetic,
see -dbl.

How to vectorize with gcc?

The v4 series of the gcc compiler can automatically vectorize loops using the SIMD processor on some modern CPUs, such as the AMD Athlon or Intel Pentium/Core chips. How is this done?
The original page offers details on getting gcc to automatically vectorize
loops, including a few examples:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
While the examples are great, it turns out the syntax for calling those options with latest GCC seems to have changed a bit, see now:
https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info
In summary, the following options will work for x86 chips with SSE2,
giving a log of loops that have been vectorized:
gcc -O2 -ftree-vectorize -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5
Note that -msse is also a possibility, but it will only vectorize loops
using floats, not doubles or ints. (SSE2 is baseline for x86-64. For 32-bit code use -mfpmath=sse as well. That's the default for 64-bit but not 32-bit.)
Modern versions of GCC enable -ftree-vectorize at -O3 so just use that in GCC4.x and later:
gcc -O3 -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5
(Clang enables auto-vectorization at -O2. ICC defaults to optimization enabled + fast-math.)
Most of the following was written by Peter Cordes, who could have just written a new answer. Over time, as compilers change, options and compiler output will change. I am not entirely sure whether it is worth tracking it in great detail here. Comments? -- Author
To also use instruction set extensions supported by the hardware you're compiling on, and tune for it, use -march=native.
Reduction loops (like sum of an array) will need OpenMP or -ffast-math to treat FP math as associative and vectorize. Example on the Godbolt compiler explorer with -O3 -march=native -ffast-math including a reduction (array sum) which is scalar without -ffast-math. (Well, GCC8 and later do a SIMD load and then unpack it to scalar elements, which is pointless vs. simple unrolling. The loop bottlenecks on the latency of the one addss dependency chain.)
Sometimes you don't need -ffast-math, just -fno-math-errno can help gcc inline math functions and vectorize something involving sqrt and/or rint / nearbyint.
Other useful options include -flto (link-time optimization for cross-file inlining, constant propagation, etc) and / or profile-guided optimization with -fprofile-generate / test run(s) with realistic input(s) /-fprofile-use. PGO enables loop unrolling for "hot" loops; in modern GCC that's off by default even at -O3.
There is a gimple (an Intermediate Representation of GCC) pass pass_vectorize. This pass will enable auto-vectorization at gimple level.
For enabling autovectorization (GCC V4.4.0), we need to following steps:
Mention the number of words in a vector as per target architecture. This can be done by defining the macro UNITS_PER_SIMD_WORD.
The vector modes that are possible needs to be defined in a separate file usually <target>-modes.def. This file has to reside in the directory where other files containing the machine descriptions are residing on. (As per the configuration script. If you can change the script you can place the file in whatever directory you want it to be in).
The modes that are to be considered for vectorization as per target architecture. Like, 4 words will constitute a vector or eight half words will constitute a vector or two double-words will constitute a vector. The details of this needs to be mentioned in the <target>-modes.def file. For example:
VECTOR_MODES (INT, 8);     /*       V8QI V4HI V2SI /
VECTOR_MODES (INT, 16);    / V16QI V8HI V4SI V2DI /
VECTOR_MODES (FLOAT, 8);   /            V4HF V2SF */
Build the port. Vectorization can be enabled using the command line options -O2 -ftree-vectorize.

Resources