random_number subroutine run time comparison ifort vs gfortran - performance

I wrote this code:
program random
implicit none
integer :: i, j, limit
real(8) :: r, max_val
real(8) :: start, finish
max_val = 0.d0
limit = 10000
call CPU_TIME(start)
do i=1, limit
do j=1, limit
call random_number(r)
max_val = max(max_val, r)
end do
end do
call CPU_TIME(finish)
print *, max_val
print '("Time = ",f6.3," seconds.")',finish-start
end program random
And I compiled it with gfortran 10.1.0 and ifort 19.1.3.304 on CentOS Linux 7 using:
ifort *.f90 -O3 -no-vec -o intel.out
gfortran *.f90 -O3 -fno-tree-vectorize -o gnu.out
and the outputs are:
gnu:
0.9999999155521957
Time = 0.928 seconds.
intel:
0.999999968800691 (same for every run btw)
Time = 1.989 seconds.
When I run a few times, the run time of each is pretty much the same.
Why is gfortran faster than ifort and how can I make ifort run as fast as gfortran?

Different compilers have their libraries with implementations of their intrinsic functions and subroutines. They will differ in performance and may also differ in their results. Gfortran uses the GLIBC library for many general intrinsics and the libgfortran library for many Fortran-spific ones. The Intel compiler comes with its own runtime-library suite.
Notably, the Fortran standard gives no guarantees about the quality of the pseudo-random generator used for random_number(). Even if it did, the actual implementation in code could always differ and hence the actual performance.
There are many external pseudo-random number generator libraries available. Some faster, some slower. Some more robust, some fail certain randomness tests. (sometimes that does matter, sometimes it does not). Some give more random bits in a single call, some give fewer random bits in a single call. If you need some particular properties for the generator in all your compilers, you might be better off with an external library.

Related

Intrinsic dot_product slower than a*a+b*b+c*c?

Recently I tested the runtime difference of explicit summation and intrinsic functions to calculate a dot product. Surprisingly the naïve explicit writing was faster.
program test
real*8 , dimension(3) :: idmat
real*8 :: dummy(3)
idmat=0
dummy=0
do i=1,3
idmat(i)=1
enddo
do j=1,10**10
! dummy(mod(j,3)+1)=dot_product(idmat,idmat)
dummy(mod(j,3)+1)=idmat(1)*idmat(1)+idmat(2)*idmat(2)+idmat(3)*idmat(3)
enddo
print*, dummy
end program test
Here is what confuses me:
1. No -O3 Optimization
If I use: gfortran test.f90 -o test ; time ./test
I find a runtime of 6,297s using the function dot_product (commented above) and 4,486s using the manual explicit writing.
How does that make sense?
2. Including -O3 Optimization
If I use: gfortran test.f90 -O3 -o test ; time ./test
I find a runtime of 1,808s and 1,803s respectively. So both are actually of the same speed.
3. What I actually expect
...is the intrinsic function to be faster, as it could:
compute the 3 products in parallel
add the 3 products
where the explicit form has to sequentially:
compute product 1
compute product 2
compute product 3
add the 3 products
Do I have to create a new parallel dot_product function to be faster? Or is there an additional option for the gfortran compiler which I don't know?
Please note: I read across the internet about SIMD, auto-vectorization and parallelisation in modern Fortran. Although I learned something my question wasn't answered anywhere.
It makes no sense even looking at the non-optimized numbers. The optimized numbers are the same, so everything is fine.
"...is the intrinsic function to be faster, as it could: compute the 3 products in parallel"
There will be nothing done in parallel unless you enable specific parallel optimizations. These optimizations will be as easy to do for the loop as for the intrinsic and often even much easier for the loop.
Well, at least for the normal sense of parallel using threads or similar. What can be done in parallel is to use the vector instructions and to schedule the instructions to overlap in the CPU pipeline. That can be done by the optimizing compiler and is likely done for both versions when you use -O3. You should not expect this to happen when no optimizations are enabled.
The use of the "parallel" instructions (SIMD) can be sometimes improved by using compiler directives like !$omp simd or !$DEC VECTOR.
"Do I have to create a new parallel dot_product function to be faster?"
Yes, normally you do. For example using OpenMP. Or you could:
"Or is there an additional option for the gfortran compiler which i don't know?"
Yes, the automatic parallelization https://gcc.gnu.org/wiki/AutoParInGCC , for example -floop-parallelize-all -ftree-parallelize-loops=4
Note that it will not make those individual multiplications in parallel, it will make the i loop parallel.

random_number() gives vastly different behavior between GNU and PGI fortran compilers

Here's a simple fortran program I was using to understand the behavior of the fortran intrinsic uniform random number generator.
program test_prog
implicit none
integer, allocatable :: seed(:)
real(8), dimension(2) :: unif_rand
integer :: nseed ! minimum number of random seed value
integer :: i,n
call random_seed( size=nseed )
nseed = 100
allocate( seed(nseed) )
write(*,*) "nseed: ",nseed
do n = 1,5
seed(:) = n**10
call random_seed( put=seed )
call random_number(harvest=unif_rand)
write(*,1000) seed(nseed),unif_rand(1),unif_rand(2)
write(*,*) ""
1000 format(i12," ",f12.8," ",f12.8)
enddo
end program test_prog
When I compile with gfortran I get sensible results:
1 0.76322100 0.72975598
1024 0.30901699 0.80380552
59049 0.05916934 0.69849271
1048576 0.59972035 0.71558547
9765625 0.79167428 0.37621382
But when I compile with pgf90 I get very different results:
1 0.00000024 0.00000024
1024 0.00024414 0.00024414
59049 0.01407838 0.01407838
1048576 0.25000003 0.25000003
9765625 0.32830648 0.32830648
With small seed values the PGI results are always very close to zero, so it seems the PGI compiler does something to make the random values such that they are scaled by the seed value. This is very problematic for my current project because I need it to give consistent results for different compilers.
Some google searches haven't turned up any explanation, so I'm wondering if anyone here can explain why these results are so different?
Or does anyone know of a trick to make the PGI compiler results more in line with the GNU compiler results?
Or does anyone know of some code for a decent random number generator available online that I could implement as an alternative to the intrinsic routines?
Firstly, as discussed here several times before, the Fortran standard gives NO requirement or guarantee about the quality of the intrinsic random number generator. One has to use external libraries to have any quality guarantee.
Secondly, the seed in a PRNG is there to give the source of initial entropy. Different generators are able to cope with different seeds with various success but in poor generators you need the bits in the seed to also to be distributed relatively "randomly". Too much coherence of the seed bits and the results may be terrible.
As to which external library to use - recommendations are off topic, but there are plenty of them and you will even find articles with comparisons of the quality of individual methods. Many recommendations were already given on this very site in the comments or answers under similar questions. Just search for random and fortran.

openmp: different answer using ifort than gfortran

I have recently encountered a compiler specific openmp issue. I have written a code that works correctly using openmp and gfortran. I have recently used ifort and openmp to compile the same code, and now the openmp section segfaults. I have cut most of the fat away, and wrote this snippet of code which reproduces my error
!$omp parallel do schedule(dynamic) reduction(+:energy) default(private) shared(r,tr)
do i=0,listvar%ncellT-1
c1s = tr(i)%start
c1e = tr(i)%end
do j = c1s,c1e
x1 = r(j)%x; y1 = r(j)%y; z1 = r(j)%z
energy = energy + x1
enddo
enddo
!$omp end parallel do
All this loop does is access pointers to elements in r from the array tr, and then loop between those elements. However, ifort is for some reason assigning nonsense values to c1s and c1e causing the code to segfault when I loop over r. Again, the code works with gfortran. Does anyone have any idea why this may be?
Found the error. For some reason, gfortran was kind enough to initialize the variable listvar for all threads! IFORT however, requires the use of firstprivate. I am really surprised gfortran never threw an error here. I have ran the code using gfortran over a wide range of conditions, and it worked every time.

OpenMP compiles with different flags gives different results

I developed a FORTRAN code which I compiled with the following command:
ifort -g -O0 -openmp -openmp_report -threads -ipo
When running this code with the above flags, I keep the result with 15 digits when running serial and parallel (OpenMP). I have also checked with Intel Inspector 2013 - and I do not have any data race condition.
However, when I change the optimization compilation flag to -O2 or -O3, I get a small error which grows with time (it is a simulation which integrates over time) from the order of 10^15 towards larger numbers.
The results with either one of -O2 or -O3 are different (up to the fifth digits after the dot).
Can anyone advise on how can I, in general, improve my code in order it to run with the same precision (double precision) as with -O0 flag ?
Thanks in advance,
Jack.

Are gfortan whole array expressions, enabled?

I'm new to fortran and to gfortran. I learned that whole expression arrays are calculated in parallel, but I see that calculations only take place in just one core of my computer.
I use the following code:
program prueba_matrices
implicit none
integer, parameter :: num = 5000
double precision, dimension(1:num,1:num) :: A, B, C
double precision, dimension (num*num) :: temp
integer :: i
temp = (/ (i/2.0, i=1,num*num) /)
A = reshape(temp, (/ num, num/) )
B = reshape(temp, (/ num, num/) )
C = matmul(A , B)
end program prueba_matrices
I complie like this:
gfortran prueba_matrices.f03 -o prueba_gfortran
And, watching the graphs produced in real time by gnome-system-monitor, I can see that there is only one core working. If I substitute the line with the calculation
C = matmul(A , B)
for
C = A * B
It yields the same behaviour.
What am I doing wrong?
GFortran/GCC does have some automatic parallelization features, see http://gcc.gnu.org/wiki/AutoParInGCC . They are frequently not that good, so they are not enabled at any of the -ON optimization levels, you have to select it specifically with -ftree-parallelize-loops=N, where N is the number of threads you want to use. Note however that in your example above a loop like "A*B" is likely constrainet by memory bandwidth (for sufficiently large arrays), and thus adding cores might not help that much. Furthermore, the MATMUL intrinsic leads to an implementation in the gfortran runtime library, which is not compiled with the autopar options (unless you have specifically built it that way).
What could help your example code above more is to actually enable any optimization at all. With -O3 Gfortran automatically enables vectorization, which can be seen as a way to parallelize loops as well, although not over several cpu cores.
If you want your call to matmult from gfortran to be multithreaded, easiest is to simply link to external BLAS package that has been compiled with multithreading support. Candidates include OpenBlas (née Goto Blas), ATLAS, or commercial packages like Intel's MKL, AMD's ACML, or Apple's accelerate framework.
So for instance, for this simple example:
program timematmult
real, allocatable, dimension(:,:) :: A, B, C
integer, parameter :: N = 2048
allocate( A(N,N) )
allocate( B(N,N) )
allocate( C(N,N) )
call random_seed
call random_number(A)
call random_number(B)
C = matmul(A,B)
print *, C(1,1)
deallocate(C)
deallocate(B)
deallocate(A)
end program timematmult
With the base matmul:
$ gfortran -o matmult matmult.f90
$ time ./matmult
514.38751
real 0m6.518s
user 0m6.374s
sys 0m0.021s
and with the multithreaded gotoblas library:
$ gfortran -o matmult matmult.f90 -fexternal-blas -lgoto2
$ time ./matmult
514.38696
real 0m0.564s
user 0m2.202s
sys 0m0.964s
Note in particular here that the real time is less than the user time, indicating multiple cores are being used.
I think that a key sentence in the course that you cited is "With array assignment there is no implied order of the individual assignments, they are performed, conceptually, in parallel." The key word is "conceptually". It isn't saying that whole array expressions are actually executed in parallel; you shouldn't expect more than one core to be used. For that, you need to use OpenMP or MPI (outside of Fortran itself) or the coarrays of Fortran 2008.
EDIT: Fortran didn't have, as part of the language, actual parallel execution until the coarrays of Fortran 2008. Some compilers might provide parallelization otherwise and some language features make it easier for compilers to implement parallel execution (optionally). The sentence that I cited from the web article better states reality than the portion you cite. Whole-array expressions were not intended to require parallel execution; they are a syntactical convenience to the programmer, making the language higher level, so that array operations can be expressed in single statements, without writing do loops. In any case, no article on the web is definitive. Your observation of the lack of parallel executions shows which statement is correct. It does not contradict the Fortran language.

Resources