openmp: different answer using ifort than gfortran - openmp

I have recently encountered a compiler specific openmp issue. I have written a code that works correctly using openmp and gfortran. I have recently used ifort and openmp to compile the same code, and now the openmp section segfaults. I have cut most of the fat away, and wrote this snippet of code which reproduces my error
!$omp parallel do schedule(dynamic) reduction(+:energy) default(private) shared(r,tr)
do i=0,listvar%ncellT-1
c1s = tr(i)%start
c1e = tr(i)%end
do j = c1s,c1e
x1 = r(j)%x; y1 = r(j)%y; z1 = r(j)%z
energy = energy + x1
enddo
enddo
!$omp end parallel do
All this loop does is access pointers to elements in r from the array tr, and then loop between those elements. However, ifort is for some reason assigning nonsense values to c1s and c1e causing the code to segfault when I loop over r. Again, the code works with gfortran. Does anyone have any idea why this may be?

Found the error. For some reason, gfortran was kind enough to initialize the variable listvar for all threads! IFORT however, requires the use of firstprivate. I am really surprised gfortran never threw an error here. I have ran the code using gfortran over a wide range of conditions, and it worked every time.

Related

random_number subroutine run time comparison ifort vs gfortran

I wrote this code:
program random
implicit none
integer :: i, j, limit
real(8) :: r, max_val
real(8) :: start, finish
max_val = 0.d0
limit = 10000
call CPU_TIME(start)
do i=1, limit
do j=1, limit
call random_number(r)
max_val = max(max_val, r)
end do
end do
call CPU_TIME(finish)
print *, max_val
print '("Time = ",f6.3," seconds.")',finish-start
end program random
And I compiled it with gfortran 10.1.0 and ifort 19.1.3.304 on CentOS Linux 7 using:
ifort *.f90 -O3 -no-vec -o intel.out
gfortran *.f90 -O3 -fno-tree-vectorize -o gnu.out
and the outputs are:
gnu:
0.9999999155521957
Time = 0.928 seconds.
intel:
0.999999968800691 (same for every run btw)
Time = 1.989 seconds.
When I run a few times, the run time of each is pretty much the same.
Why is gfortran faster than ifort and how can I make ifort run as fast as gfortran?
Different compilers have their libraries with implementations of their intrinsic functions and subroutines. They will differ in performance and may also differ in their results. Gfortran uses the GLIBC library for many general intrinsics and the libgfortran library for many Fortran-spific ones. The Intel compiler comes with its own runtime-library suite.
Notably, the Fortran standard gives no guarantees about the quality of the pseudo-random generator used for random_number(). Even if it did, the actual implementation in code could always differ and hence the actual performance.
There are many external pseudo-random number generator libraries available. Some faster, some slower. Some more robust, some fail certain randomness tests. (sometimes that does matter, sometimes it does not). Some give more random bits in a single call, some give fewer random bits in a single call. If you need some particular properties for the generator in all your compilers, you might be better off with an external library.

How to efficiently vectorize polynomial computation with condition (roofline model)

I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible.
Example: For example, we can take the function: (1+x^2)^3, when x>3 and 0 when x<=3.
Such a function would be executed 100k times for vectors of double elements. The size of each vector can be anything between 50 and 3000.
One idea would be to use Eigen:
Eigen::ArrayXd v;
then simply apply a functor:
v.unaryExpr([&](double x) {return x>3 ? std::pow((1+x*x), 3.00) : 0.00;});
Trying with both GCC 9 and GCC 10, I saw that this loop is not being vectorized. I did vectorize it manually, only to see that the gain is much smaller than I expected (1.5x). I also replaced the conditioning with logical AND instructions, basically executing both branches and zeroing out the result when x<=3. I presume that the gain came mostly from the lack of branch misprediction.
Some considerations
There are multiple factors at play. First of all, there are RAW dependencies in my code (using intrinsics). I am not sure how this affects the computation. I wrote my code with AVX2 so I was expecting a 4x gain. I presume that this plays a role, but I cannot be sure, as the CPU has out-of-order-processing. Another problem is that I am unsure if the performance of the loop I am trying to write is bound by the memory bandwidth.
Question
How can I determine if either the memory bandwidth or pipeline hazards are affecting the implementation of this loop? Where can I learn techniques to better vectorize this loop? Are there good tools for this in Eigenr MSVC or Linux? I am using an AMD CPU as opposed to Intel.
You can fix the GCC missed optimization with -fno-trapping-math, which should really be the default because -ftrapping-math doesn't even fully work. It auto-vectorizes just fine with that option: https://godbolt.org/z/zfKjjq.
#include <stdlib.h>
void foo(double *arr, size_t n) {
for (size_t i=0 ; i<n ; i++){
double &tmp = arr[i];
double sqrp1 = 1.0 + tmp*tmp;
tmp = tmp>3 ? sqrp1*sqrp1*sqrp1 : 0;
}
}
It's avoiding the multiplies in one side of the ternary because they could raise FP exceptions that C++ abstract machine wouldn't.
You'd hope that writing it with the cubing outside a ternary should let GCC auto-vectorize, because none of the FP math operations are conditional in the source. But it doesn't actually help: https://godbolt.org/z/c7Ms9G GCC's default -ftrapping-math still decides to branch on the input to avoid all the FP computation, potentially not raising an overflow (to infinity) exception that the C++ abstract machine would have raised. Or invalid if the input was NaN. This is the kind of thing I meant about -ftrapping-math not working. (related: How to force GCC to assume that a floating-point expression is non-negative?)
Clang also has no problem: https://godbolt.org/z/KvM9fh
I'd suggest using clang -O3 -march=native -ffp-contract=fast to get FMAs across statements when FMA is available.
(In this case, -ffp-contract=on is sufficient to contract 1.0 + tmp*tmp within that one expression, but not across statements if you need to avoid that for Kahan summation for example. The clang default is apparently -ffp-contract=off, giving separate mulpd and addpd)
Of course you'll want to avoid std::pow with a small integer exponent. Compilers might not optimize that into just 2 multiplies and instead call a full pow function.

Are these array initialization methods in Fortran equally fast?

In Fortran you can initialize whole array just by typing
array = 1.0
But it is also possible to type
do i = first_element:last_element
array(i) = 1.0
enddo
Is the first method faster or is there no performance difference?
Any sensible contemporary Fortran compiler would optimise the loop to the same assembly as the short-form initialisation.
(Optimising Fortran compilers have been around since the 1960s).
If you are in any doubt of this, check that generated assembly.

Intrinsic dot_product slower than a*a+b*b+c*c?

Recently I tested the runtime difference of explicit summation and intrinsic functions to calculate a dot product. Surprisingly the naïve explicit writing was faster.
program test
real*8 , dimension(3) :: idmat
real*8 :: dummy(3)
idmat=0
dummy=0
do i=1,3
idmat(i)=1
enddo
do j=1,10**10
! dummy(mod(j,3)+1)=dot_product(idmat,idmat)
dummy(mod(j,3)+1)=idmat(1)*idmat(1)+idmat(2)*idmat(2)+idmat(3)*idmat(3)
enddo
print*, dummy
end program test
Here is what confuses me:
1. No -O3 Optimization
If I use: gfortran test.f90 -o test ; time ./test
I find a runtime of 6,297s using the function dot_product (commented above) and 4,486s using the manual explicit writing.
How does that make sense?
2. Including -O3 Optimization
If I use: gfortran test.f90 -O3 -o test ; time ./test
I find a runtime of 1,808s and 1,803s respectively. So both are actually of the same speed.
3. What I actually expect
...is the intrinsic function to be faster, as it could:
compute the 3 products in parallel
add the 3 products
where the explicit form has to sequentially:
compute product 1
compute product 2
compute product 3
add the 3 products
Do I have to create a new parallel dot_product function to be faster? Or is there an additional option for the gfortran compiler which I don't know?
Please note: I read across the internet about SIMD, auto-vectorization and parallelisation in modern Fortran. Although I learned something my question wasn't answered anywhere.
It makes no sense even looking at the non-optimized numbers. The optimized numbers are the same, so everything is fine.
"...is the intrinsic function to be faster, as it could: compute the 3 products in parallel"
There will be nothing done in parallel unless you enable specific parallel optimizations. These optimizations will be as easy to do for the loop as for the intrinsic and often even much easier for the loop.
Well, at least for the normal sense of parallel using threads or similar. What can be done in parallel is to use the vector instructions and to schedule the instructions to overlap in the CPU pipeline. That can be done by the optimizing compiler and is likely done for both versions when you use -O3. You should not expect this to happen when no optimizations are enabled.
The use of the "parallel" instructions (SIMD) can be sometimes improved by using compiler directives like !$omp simd or !$DEC VECTOR.
"Do I have to create a new parallel dot_product function to be faster?"
Yes, normally you do. For example using OpenMP. Or you could:
"Or is there an additional option for the gfortran compiler which i don't know?"
Yes, the automatic parallelization https://gcc.gnu.org/wiki/AutoParInGCC , for example -floop-parallelize-all -ftree-parallelize-loops=4
Note that it will not make those individual multiplications in parallel, it will make the i loop parallel.

Are gfortan whole array expressions, enabled?

I'm new to fortran and to gfortran. I learned that whole expression arrays are calculated in parallel, but I see that calculations only take place in just one core of my computer.
I use the following code:
program prueba_matrices
implicit none
integer, parameter :: num = 5000
double precision, dimension(1:num,1:num) :: A, B, C
double precision, dimension (num*num) :: temp
integer :: i
temp = (/ (i/2.0, i=1,num*num) /)
A = reshape(temp, (/ num, num/) )
B = reshape(temp, (/ num, num/) )
C = matmul(A , B)
end program prueba_matrices
I complie like this:
gfortran prueba_matrices.f03 -o prueba_gfortran
And, watching the graphs produced in real time by gnome-system-monitor, I can see that there is only one core working. If I substitute the line with the calculation
C = matmul(A , B)
for
C = A * B
It yields the same behaviour.
What am I doing wrong?
GFortran/GCC does have some automatic parallelization features, see http://gcc.gnu.org/wiki/AutoParInGCC . They are frequently not that good, so they are not enabled at any of the -ON optimization levels, you have to select it specifically with -ftree-parallelize-loops=N, where N is the number of threads you want to use. Note however that in your example above a loop like "A*B" is likely constrainet by memory bandwidth (for sufficiently large arrays), and thus adding cores might not help that much. Furthermore, the MATMUL intrinsic leads to an implementation in the gfortran runtime library, which is not compiled with the autopar options (unless you have specifically built it that way).
What could help your example code above more is to actually enable any optimization at all. With -O3 Gfortran automatically enables vectorization, which can be seen as a way to parallelize loops as well, although not over several cpu cores.
If you want your call to matmult from gfortran to be multithreaded, easiest is to simply link to external BLAS package that has been compiled with multithreading support. Candidates include OpenBlas (née Goto Blas), ATLAS, or commercial packages like Intel's MKL, AMD's ACML, or Apple's accelerate framework.
So for instance, for this simple example:
program timematmult
real, allocatable, dimension(:,:) :: A, B, C
integer, parameter :: N = 2048
allocate( A(N,N) )
allocate( B(N,N) )
allocate( C(N,N) )
call random_seed
call random_number(A)
call random_number(B)
C = matmul(A,B)
print *, C(1,1)
deallocate(C)
deallocate(B)
deallocate(A)
end program timematmult
With the base matmul:
$ gfortran -o matmult matmult.f90
$ time ./matmult
514.38751
real 0m6.518s
user 0m6.374s
sys 0m0.021s
and with the multithreaded gotoblas library:
$ gfortran -o matmult matmult.f90 -fexternal-blas -lgoto2
$ time ./matmult
514.38696
real 0m0.564s
user 0m2.202s
sys 0m0.964s
Note in particular here that the real time is less than the user time, indicating multiple cores are being used.
I think that a key sentence in the course that you cited is "With array assignment there is no implied order of the individual assignments, they are performed, conceptually, in parallel." The key word is "conceptually". It isn't saying that whole array expressions are actually executed in parallel; you shouldn't expect more than one core to be used. For that, you need to use OpenMP or MPI (outside of Fortran itself) or the coarrays of Fortran 2008.
EDIT: Fortran didn't have, as part of the language, actual parallel execution until the coarrays of Fortran 2008. Some compilers might provide parallelization otherwise and some language features make it easier for compilers to implement parallel execution (optionally). The sentence that I cited from the web article better states reality than the portion you cite. Whole-array expressions were not intended to require parallel execution; they are a syntactical convenience to the programmer, making the language higher level, so that array operations can be expressed in single statements, without writing do loops. In any case, no article on the web is definitive. Your observation of the lack of parallel executions shows which statement is correct. It does not contradict the Fortran language.

Resources