slow-down when using OpenMP and calling subroutine in a loop - performance

Here I present a simple fortran code using OpenMP that calculate a summation of arrays multiple times. My computers has 6 cores with 12 threads and memory space of 16G.
There are two versions of this code. The first version has only 1 file test.f90 and the summation is implemented in this file. The code is presented as follows
program main
implicit none
integer*8 :: begin, end, rate
integer i, j, k, ii, jj, kk, cnt
real*8,allocatable,dimension(:,:,:)::theta, e
allocate(theta(2000,50,5))
allocate(e(2000,50,5))
call system_clock(count_rate=rate)
call system_clock(count=begin)
!$omp parallel do
do cnt = 1, 8
do i = 1, 1001
do j = 1, 50
theta = theta+0.5d0*e
end do
end do
end do
!$omp end parallel do
call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate
deallocate(theta)
deallocate(e)
end program main
This version has no problem on OpenMP and we can see acceleration.
The second version is modified such that the implementation of summation is written in a subroutine. There are two files, test.f90 and sub.f90 which are presented as follows
! test.f90
program main
use sub
implicit none
integer*8 :: begin, end, rate
integer i, j, k, ii, jj, kk, cnt
call system_clock(count_rate=rate)
call system_clock(count=begin)
!$omp parallel do
do cnt = 1, 8
call summation()
end do
!$omp end parallel do
call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate
end program main
and
! sub.f90
module sub
implicit none
contains
subroutine summation()
implicit none
real*8,allocatable,dimension(:,:,:)::theta, e
integer i, j
allocate(theta(2000,50,5))
allocate(e(2000,50,5))
theta = 0.d0
e = 0.d0
do i = 1, 101
do j = 1, 50
theta = theta+0.5d0*e
end do
end do
deallocate(theta)
deallocate(e)
end subroutine summation
end module sub
I also write a Makefile as follows
FC = ifort -O2 -mcmodel=large -qopenmp
LN = ifort -O2 -mcmodel=large -qopenmp
FFLAGS = -c
LFLAGS =
result: sub.o test.o
$(LN) $(LFLAGS) -o result test.o sub.o
test.o: test.f90
$(FC) $(FFLAGS) -o test.o test.f90
sub.o: sub.f90
$(FC) $(FFLAGS) -o sub.o sub.f90
clean:
rm result *.o* *.mod *.e*
(we can use gfortran instead) However, we I run this version, there will be dramatic slow-down in using OpenMP and it is even much slower than the single-thread one (no OpenMP). So, what happened here and how to fix this ?

Related

array initialization run time comparison ifort vs gfortran

I would like to compare array initialization run time for ifort vs gfortran using this compilation lines with gfortran 10.1.0 and ifort 19.1.3.304 on CentOS Linux 7:
ifort array-initialize.f90 -O3 -init=arrays,zero,minus_huge,snan -g -o intel-array.out
gfortran array-initialize.f90 -O3 -finit-local-zero -finit-integer=-2147483647 -finit-real=snan -finit-logical=True -finit-derived -g -o gnu-array.out
array-initialize.f90:
program array_initialize
implicit none
integer :: i, j, limit
real :: my_max
real :: start, finish
my_max = -1.0
limit = 10000
call cpu_time(start)
do j=1, limit
do i=1, limit
my_max = max(my_max, initializer(i, j))
end do
end do
call cpu_time(finish)
print *, my_max
print '("Time = ", f6.3," seconds.")', finish-start
contains
function initializer(i, j)
implicit none
real :: initializer
real :: arr(2)
integer :: i, j
arr(1) = -1.0/(2*i+j+1)
arr(2) = -1.0/(2*j+i+1)
initializer = max(arr(1), arr(2))
end function
end program array_initialize
Run times for this code:
gnu - 0.096 sec
intel - 0.392 sec
When I remove the init flags:
gnu - 0.098 sec
intel - 0.057 sec
When I replace the array with two variables:
gnu - 0.099 sec
intel - 0.065 sec
What happens here? Does gnu not initialize its arrays? Does intel initialize arrays very slow?
OOPS.
I disabled vectorization using -no-vec on ifort and -fno-tree-vectorize on gfortran, and now the run times are same and about 0.39 sec (just like the original intel time).

In parallel computing, why using all threads (4) execution time is longer than using only a half (2)?

E.g, I'm using this code (CPU: 4 cores (thread per core)):
program main
use omp_lib
implicit none
integer, parameter:: ma=100, n=10000, mb= 100
integer:: istart, iend
real, dimension (ma,n) :: a
real, dimension (n,mb) :: b
real, dimension (ma,mb) :: c = 0.
integer:: i,j,k, threads=2, ppt, thread_num
integer:: toc, tic, rate
real:: time_parallel, time
call random_number (a)
call random_number (b)
!/////////////////////// 1- PARALLEL PRIVATE ///////////////////////
CALL system_clock(count_rate=rate)
call system_clock(tic)
ppt = ma/threads
!$ call omp_set_num_threads(threads)
!$omp parallel default(shared) private(istart, iend, &
!$omp thread_num, i)
!$ thread_num = omp_get_thread_num()
!$ istart = thread_num*ppt +1
!$ iend = min(ma, thread_num*ppt + ppt)
do i= istart,iend
do j= 1,mb
do k= 1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
!$omp end parallel
print*, 'Result in parallel mode'
!$ print*, c(85:90,40)
call system_clock(toc)
time_parallel = real(toc-tic)/real(rate)
!/////////////////////// 2-normal execution ///////////////////////
c = 0
CALL system_clock(count_rate=rate)
call system_clock(tic)
call system_clock(tic)
do i= 1,ma
do j= 1,mb
do k= 1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
call system_clock(toc)
time = real(toc-tic)/real(rate)
print*, 'Result in serial mode'
print*, c(85:90,40)
print*, '------------------------------------------------'
print*, 'Threads: ', threads, '| Time Parallel Private', time_parallel, 's '
print*, ' Time Normal ', time, 's'
!----------------------------------------------------------------
end program main
I get the following results:
First execution:
Result in parallel mode
2477.89478 2528.50391 2511.84204 2528.12061 2500.79517
2510.69971
Result in serial mode
2477.89478 2528.50391 2511.84204 2528.12061 2500.79517
2510.69971
Threads: 2 | Time Parallel Private 0.379999995 s
Time Normal 0.603999972 s
Second execution:
Result in parallel mode
2492.20679 2496.56152 2500.58203 2516.51685 2516.43604
2530.71313
Result in serial mode
2492.20679 2496.56152 2500.58203 2516.51685 2516.43604
2530.71313
------------------------------------------------
Threads: 4 | Time Parallel Private 1.11500001 s
Time Normal 0.486000001 s
It was compiled using:
gfortran -Wall -fopenmp -g -O2 -o prog.exe prueba.f90
./prog.exe
If you have N cores and using N threads than some of your threads get switched out for some other process and threads. So it's preferable to use less number threads than the available cores.

Disrepancy in results between OpenMP/OpenACC implementation and gcc/PGI compilers

I have a larger Fortran program that I am trying to convert so that the computationally intensive part will run on an NVidia GPU using OpenMP and/or OpenACC. During development I had some issues to understand how variables declared in a module can be used within subroutines that are executed on the GPU (and some of them also on the CPU). Therefore, I created a small example and worked on that, by experimenting and adding the corresponding OpenMP and OpenACC directives. I have included the three files that comprise my example at the end of this message.
Just as I thought that I had understood things and that my example program works, I noticed the following:
I compile the program with gcc 10.2 using the OpenMP directives:
gfortran -O3 -fopenmp -Wall -Wextra test_link.f90 parameters.f90 common_vars.f90 -o test_link
The results are as expected, i.e. all elements of array XMO are 1, of DCP are 2, of IS1 are 3 and of IS2 are 24.
I compile the program with PGI compiler 19.10 community edition using the OpenACC directives:
pgfortran -O4 -acc -ta=tesla,cc35 -Minfo=all,mp,accel -Mcuda=cuda10.0 test_link.f90 common_vars.f90 parameters.f90 -o test_link
The results are the same as above.
I compile the program with gcc 10.2 using the OpenACC directives:
gfortran -O3 -fopenacc -Wall -Wextra test_link.f90 parameters.f90 common_vars.f90 -o test_link
The results for arrays XMO, DCP and IS1 are correct, but all elements of IS2 are 0. It is easy to verify that variable NR has a value of 0 to get this result.
My understanding is that the OpenMP and OpenACC version of my example are equivalent, but I cannot figure out why the OpenACC version works only for the PGI compiler and not for gcc.
If possible, please provide solutions that do not require changes in the code but only in the directives. As I mentioned, my original code is much larger, contains many more module variables and calls many more subroutines in the code to be executed on the GPU. Changes in that code will be much more difficult to do and obviously I would prefer to do that only if really necessary.
Thank you in advance!
The files of my example follow.
File parameters.f90
MODULE PARAMETERS
IMPLICIT NONE
INTEGER, PARAMETER :: MAX_SOURCE_POSITIONS = 100
END MODULE PARAMETERS
File common_vars.f90
MODULE COMMON_VARS
USE PARAMETERS
IMPLICIT NONE
!$OMP DECLARE TARGET TO(NR)
INTEGER :: NR
!$ACC DECLARE COPYIN(NR)
END MODULE COMMON_VARS
File test_link.f90
SUBROUTINE TEST()
USE COMMON_VARS
IMPLICIT NONE
!$OMP DECLARE TARGET
!$ACC ROUTINE SEQ
INTEGER I
I = NR
END SUBROUTINE TEST
PROGRAM TEST_LINK
USE COMMON_VARS
USE PARAMETERS
IMPLICIT NONE
INTERFACE
SUBROUTINE TEST()
!$OMP DECLARE TARGET
!$ACC ROUTINE SEQ
END SUBROUTINE TEST
END INTERFACE
REAL :: XMO(MAX_SOURCE_POSITIONS), DCP(MAX_SOURCE_POSITIONS)
INTEGER :: IS1(MAX_SOURCE_POSITIONS), IS2(MAX_SOURCE_POSITIONS)
INTEGER :: X, Y, Z, MAX_X, MAX_Y, MAX_Z, ISOUR
MAX_X = 3
MAX_Y = 4
MAX_Z = 5
NR = 6
!$OMP TARGET UPDATE TO(NR)
!$OMP TARGET MAP(TOFROM:IS1,IS2,DCP,XMO)
!$OMP TEAMS DISTRIBUTE PARALLEL DO COLLAPSE(3)
!$ACC UPDATE DEVICE(NR)
!$ACC PARALLEL LOOP GANG WORKER COLLAPSE(3) INDEPENDENT &
!$ACC COPY(IS1,IS2,DCP,XMO)
DO X = 1, MAX_X
DO Y = 1, MAX_Y
DO Z = 1, MAX_Z
ISOUR = (X - 1)*MAX_Y*MAX_Z + (Y - 1)*MAX_Z + Z
XMO(ISOUR) = 1.0
DCP(ISOUR) = 2.0
IS1(ISOUR) = 3
IS2(ISOUR) = 4 * NR
CALL TEST()
ENDDO ! End of z loop
ENDDO ! End of y loop
ENDDO ! End of x loop
!$ACC END PARALLEL LOOP
!$OMP END TEAMS DISTRIBUTE PARALLEL DO
!$OMP END TARGET
DO X = 1, MAX_X
DO Y = 1, MAX_Y
DO Z = 1, MAX_Z
ISOUR = (X - 1)*MAX_Y*MAX_Z + (Y - 1)*MAX_Z + Z
WRITE(*, *) 'ISOUR = ', ISOUR, 'XMO = ', XMO(ISOUR), 'DCP = ', DCP(ISOUR), 'IS1 = ', IS1(ISOUR), 'IS2 = ', IS2(ISOUR)
ENDDO ! End of z loop
ENDDO ! End of y loop
ENDDO ! End of x loop
END PROGRAM TEST_LINK

Compiling a fortran90 files with different parameters each time

I am recently working on a fortran90 program which calculate the time needed and result of some mathematics calculation. Here is the code:
program loops
use omp_lib
implicit none
integer, parameter :: N=729
integer, parameter :: reps=1000
real(kind=8), allocatable :: a(:,:), b(:,:), c(:)
integer :: jmax(N)
real(kind=8) :: start1,start2,end1,end2
integer :: r
allocate(a(N,N), b(N,N), c(N))
call init1()
start1 = omp_get_wtime()
do r = 1,reps
call loop1()
end do
end1 = omp_get_wtime()
call valid1();
print *, "Total time for ",reps," reps of loop 1 = ", end1-start1
call init2()
start2 = omp_get_wtime()
do r = 1,reps
call loop2()
end do
end2 = omp_get_wtime()
call valid2();
print *, "Total time for ",reps," reps of loop 2 = ", end2-start2
contains
subroutine init1()
implicit none
integer :: i,j
do i = 1,N
do j = 1,N
a(j,i) = 0.0
b(j,i) = 3.142*(i+j)
end do
end do
end subroutine init1
subroutine init2()
implicit none
integer :: i,j,expr
do i = 1,N
expr = mod(i,3*(i/30)+1)
if (expr == 0) then
jmax(i) = N
else
jmax(i) = 1
end if
c(i) = 0.0
end do
do i = 1,N
do j = 1,N
b(j,i) = dble(i*j+1)/dble(N*N)
end do
end do
end subroutine init2
subroutine loop1()
implicit none
integer :: i,j
!$OMP PARALLEL DO DEFAULT(NONE), PRIVATE(i,j), SHARED(a,b), SCHEDULE(type,chunksize)
do i = 1,N
do j = N,i,-1
a(j,i) = a(j,i) + cos(b(j,i))
end do
end do
!$OMP END PARALLEL DO
end subroutine loop1
subroutine loop2()
implicit none
integer :: i,j,k
real (kind=8) :: rN2
rN2 = 1.0 / dble (N*N)
!$OMP PARALLEL DO DEFAULT(NONE), PRIVATE(i,j,k), SHARED(rN2,c,b,jmax), SCHEDULE(type,chunksize)
do i = 1,N
do j = 1, jmax(i)
do k = 1,j
c(i) = c(i) + k * log(b(j,i)) *rN2
end do
end do
end do
!$OMP END PARALLEL DO
end subroutine loop2
subroutine valid1()
implicit none
integer :: i,j
real (kind=8) :: suma
suma= 0.0
do i = 1,N
do j = 1,N
suma = suma + a(j,i)
end do
end do
print *, "Loop 1 check: Sum of a is ", suma
end subroutine valid1
subroutine valid2()
implicit none
integer i
real (kind=8) sumc
sumc= 0.0
do i = 1,N
sumc = sumc + c(i)
end do
print *, "Loop 2 check: Sum of c is ", sumc
end subroutine valid2
end program loops
In the line !$OMP PARALLEL DO DEFAULT(NONE), PRIVATE(i,j), SHARED(a,b), SCHEDULE(type,chunksize) and !$OMP PARALLEL DO DEFAULT(NONE), PRIVATE(i,j,k), SHARED(rN2,c,b,jmax), SCHEDULE(type,chunksize).
As I want to perform the task of different schedule case to see the different results, so I need to change this part SCHEDULE(type,chunksize), with different schedule type and different chunksize. For example, in this case, the schedule type is static and chunksize is 1.
Say if I have type of (static, a, b, c) and chunksize (1,2,3,4,5,6,7). As I am new to fortran so I wonder is it possible to compile and run the code for all case in once without the fact that I have to change the parameters manually everytime, i.e it compiles and runs to give the result of first case e.g (static,1), it then compiles and runs the file again but with the parameters changed automatically that gives another result. For instance, (static,2)...(b,4) etc.
I heard that we can create a script file to perform such task, but I not am sure what exactly I need to do for this.
Thank you so much.
You may want to investigate the use of the preprocessor. I'm speaking from experience with gfortran, but I believe this applies (almost) all other compilers as well even though it is outside the scope of the Fortran standard.
If you name your source file with a capital F in the suffix, i.e. file.F, file.F90, file.F95 etc, your file will be preprocessed with the C preprocessor before being compiled. That may sound complicated, but cutting this down to what you need, this means that if you compile your code with a command like
$ gfortran -DCHUNK_SIZE=1 mySource.F90
then all occurrences of CHUNK_SIZE (with qualifiers which are not essential to your problem) will be replaced by 1. More technically, CHUNK_SIZE becomes a macro defined to expand to 1. So if you replace SCHEDULE(type,chunksize) with SCHEDULE(type,CHUNK_SIZE) in your source file, you can repeatedly invoke the compiler with different values, -DCHUNK_SIZE=1, -DCHUNK_SIZE=2 etc, and get the result that you described. The same can be done for type.
Now you may want to change the function names accordingly as well. One way would be to add a few preprocessor statements near the top of your file declaring a few macros, namely
#ifdef __GFORTRAN__
#define PASTE2(a,b) a/**/b
#define FUNC_NAME_WITH_CHUNK_SIZE(fn) PASTE2(PASTE2(fn,_),CHUNK_SIZE)
#else
#define FUNC_NAME_WITH_CHUNK_SIZE(fn) fn ## _ ## CHUNK_SIZE
#endif
#define LOOP1 FUNC_NAME_WITH_CHUNK_SIZE(loop1)
#define LOOP2 FUNC_NAME_WITH_CHUNK_SIZE(loop2)
and replace loop1 with LOOP1 etc. You could do this from the command line as before, but since these rules are not supposed to change between compilations, it makes sense to keep these in the source file. I think the only part that is not self-explanatory is the use of ## and /**/ between #ifdef and #endif. This is how one does string concatenation with the preprocessor, and because gfortran uses the way C preprocessors did it before the language was standardized, it gets exceptional treatment, see e.g. this answer for some info on these operators. The purpose of this operation is to replace LOOP1 with loop1_<CHUNK_SIZE>, where <CHUNK_SIZE> is filled in from the command line. Feel free to follow any other conventions for naming these functions.
If you want to call these functions from another translation unit, you will have to process the function names in the same way, of course. In order to make your life easier, you may want to research the #include statement. Detailing this would take us too far here, but the idea is that you put all your includes into a file (conventionally named <something>.inc in the Fortran-world with <something> replaced that makes sense to you) and use #include "<something>.inc in all source files to obtain the same macro definitions.

Why does a large matrix pass through several subroutine tasks as fast as a smaller matrix

What Exactly is happening to my matrix? how is Fortran handling it?
What's attached is a snippet of code inspired from a larger
project that simulates light transport in eye tissue. It
passes some large matrices through subroutines and then
randomly puts values in them.
My Goal: To see how passing such a large matrix through
several subroutines would have an impact on
performance.
My Reference: is the exact same code except the dimension of the matrix of interest is now [5,5] ( it was previously [250,200] )
My Question: Why is there no significant difference in results?
MY RESULTS
MATRIX A_rz dimension [250,200]
real 0m6.661s
user 0m6.638s
sys 0m0.012s
MATRIX A_rz dimension [5,5]
real 0m6.508s
user 0m6.489s
sys 0m0.011s
**bMatMOD.f90
module bMatMOD
implicit none
type :: INPUT
integer :: nLayers = 1
integer :: nPhotons = 50000000
real, dimension (2) :: dZR = (/0.0004, 0.001/)
integer, dimension(3) :: nZRA = (/250,200,30/)
real, dimension (1) :: d = (/0.03/)
end type INPUT
type :: OUTPUT
real, allocatable :: Rd_ra(:,:)
real, allocatable :: A_rz(:,:)
real, allocatable :: Tt_ra(:,:)
end type OUTPUT
contains
subroutine initOUTPUTS (in_INPUT,out_OUTPUT)
type (INPUT), intent (in) :: in_INPUT
type (OUTPUT),intent (out) :: out_OUTPUT
allocate (out_OUTPUT%A_rz(in_INPUT%nZRA(2),in_INPUT%nZRA(1)))
allocate (out_OUTPUT%Rd_ra(in_INPUT%nZRA(2),in_INPUT%nZRA(3)))
allocate (out_OUTPUT%Tt_ra(in_INPUT%nZRA(2),in_INPUT%nZRA(3)))
out_OUTPUT%A_rz = 0.0
out_OUTPUT%Rd_ra = 0.0
out_OUTPUT%Tt_ra = 0.0
return
end subroutine initOUTPUTS
end module bMatMOD
**bMatRoutines.f90
subroutine A (o)
use bMatMOD
type (OUTPUT) :: o
real :: rnd1, rnd2
rnd1 = rand()
rnd2 = rand()
call B(o,rnd1,rnd2)
return
end subroutine A
subroutine B (o,x,y)
use bMatMOD
type (OUTPUT) :: o
real, intent (in) :: x
real, intent (in) :: y
integer, dimension(2) :: temp
integer :: i, j
temp = SHAPE(o%A_rz)
i = INT(temp(1)*y)
j = INT(temp(2)*x)
if ( i .eq. 0) then
i = 1
endif
if (i .eq. temp(1)) then
i = i - 1
endif
if (j .eq. 0) then
j = 1
endif
if (j .eq. temp(2)) then
j = j - 1
endif
o%A_rz(i,j) = o%A_rz(i,j) + x + y
return
end subroutine B
**bMatmcml.f90
program bMatmcml
use bMatMOD
implicit none
type (INPUT) :: u
type (OUTPUT) :: o
integer :: i
call initOUTPUTS(u,o)
call srand(0)
do i = 1,u%nPhotons,1
call A(o)
enddo
end program bMatmcml
**bMat.sh
rm -f *.o *~ *.exe
echo "MATRIX A_rz dimension [250,200]"
gfortran bMatMOD.f90 bMatRoutines.f90 bMatmcml.f90 -g -Wall -Werror -O3 -ffast-math -o bMat.exe
time ./bMat.exe
echo "MATRIX A_rz dimension [5,5]"
gfortran bMatMOD-v1.f90 bMatRoutines.f90 bMatmcml.f90 -g -Wall -Werror -O3 -ffast-math -o bMat-v1.exe
time ./bMat.exe

Resources