OpenMP calling a function gives wrong results - parallel-processing

Hi I am trying to put a do loop in different threads. Now inside the do loop I am calling a function which again calls some subroutine and adding to a total sum. Now if I put parallel enclosing the do loop, it is giving random results however I see that if I put the function inside CRITICAL environment it gives the correct result. But this costs more cpu time and does not improve the speed at all.
I tested with a small test program and check that my logic is correct. However in a big program (which I can not post here) this only works when I enclose the function call in CRITICAL.
Below I give the test program: (my test program works and gives correct result however in the big program I see that funb is not correctly taken in different threads unless it is in CRITICAL environment.)
sum=0d0
!$OMP PARALLEL PRIVATE(i,j,sum1,xcn,fun)
ithrd=OMP_GET_THREAD_NUM()
!$OMP DO
do i=1,5
sum1=0d0
do j=1,3
xcn=i+j+xx
!$OMP CRITICAL
fun=funb(xnc)
write(*,*)fun
!$OMP END CRITICAL
sum1=sum1+fun
enddo
enddo
!$OMP END DO
!$OMP CRITICAL
sum=sum+sum1
!$OMP END CRITICAL
!$OMP END PARALLEL
write(*,*)sum
If I remove OMP CRITICAL in the big program I see that different threads are taking same values for funb in different threads which should be different. Therefore my understanding is: there is some restriction in the function being called in PARALLEL section. I would be thankful if anybody can clarify the issue.
The function funb given as:
COMPLEX*16 FUNCTION FUNB(ZAA)
IMPLICIT COMPLEX*16 (A-H,O-Z)
real*8 X1,X2
COMMON/ZVAR/ZA
COMMON/XVAR/X1,X2
ZA=ZAA
call myinvini
call myinvc(x2,fout)
funb=fout
RETURN
END
myinvini are some data for wl8,xl8 but myinvc is again a subroutine:
subroutine myinvc(x,f2)
complex*16 dir,dirc,sta,ss,ssc,cn,cnc,f2,ff,ffc,func
complex*16 f22,ans
integer igauss,inte,l,m
double precision x,range,phi,w,z,zz,zr
double precision st,st0,zint,xbl,a,b,dli,sli
double precision cpar,zero
double precision xl8,wl8,xl32,wl32
dimension zint(51)
COMMON/iinte/inte
complex*16 cbeta
common /wgauss/ xl8(8),wl8(8),xl32(32),wl32(32)
common /ccpar/ cpar
include 'constants.h'
igauss = 8
zero=0.0d0
range=201.0d0
phi=3.0d0/4.0d0*pi
dir=dcmplx(dcos(phi),dsin(phi))
dirc=dcmplx(dcos(phi),-dsin(phi))
sta=dcmplx(cpar,zero)
st =dexp(dlog(range)/dble(inte))
st0=1.0d0
zint(1)=zero
do 11 l=1,inte
st0 =st0*st
zint(l+1)=st0-1.0d0
11 continue
ss=dcmplx(zero,zero)
ssc=dcmplx(zero,zero)
xbl=dlog(x)
do 23 l=1,inte ! inte=5
a=zint(l)
b=zint(l+1)
dli=(b-a)/2.d0
sli=(b+a)/2.d0
do 24 m=1,igauss
if(igauss.eq. 8) w=wl8(m)
if(igauss.eq.32) w=wl32(m)
if(igauss.eq. 8) zz=xl8(m)
if(igauss.eq.32) zz=xl32(m)
z =dli*zz+sli
cn=sta+z*dir
cnc=sta+z*dirc
ff=func(cn)
ffc=func(cnc)
ss=ss+ff*dir*exp(-xbl*cn)*w*dli
ssc=ssc+ffc*dirc*exp(-xbl*cnc)*w*dli
24 continue
23 continue
f2=(ss+ssc)
return
end

In the absence of threadprivate directive, common block variables are shared. The function referenced inside the parallel section modifies such a common block variable, this will cause a data race and is not permitted by the openmp standard.
The code uses implicit typing and implicit specification of the data sharing attributes for most of the variables referenced in the openmp construct. These are apalling from a coding style perspective. The code as shown has one likely variable spelling mistake, which would likely have been avoided if implicit specifications were avoided.

Related

Reduction of output array dimension in Fortran77 procedure

I am working on a large Fortran code, where parts are written in FORTRAN77.
There is a piece of code, which causes debugger to raise errors like:
Fortran runtime error:
Index '2' of dimension 1 of array 'trigs' above upper bound of 1
but when compiled without debugging options runs and does not crash the program. Debugging options used:
-g -ggdb -w -fstack-check -fbounds-check\
-fdec -fmem-report -fstack-usage
The logic of the problematic piece of code is following: in file variables.cmn I declare
implicit none
integer factors,n
real*8 triggers
parameter (n=32)
common /fft/ factors(19), triggers(6*n)
Variables factors and triggers are initialized in procedure initialize:
include 'variables.cmn'
...
CALL FFTFAX(n,factors,triggers)
...
FFTFAX is declared in another procedure as:
SUBROUTINE FFTFAX(N,IFAX,TRIGS)
implicit real*8(a-h,o-z)
DIMENSION IFAX(13),TRIGS(1)
CALL FAX (IFAX, N, 3)
CALL FFTRIG (TRIGS, N, 3)
RETURN
END
and lets look at procedure FFTRIG:
SUBROUTINE FFTRIG(TRIGS,N,MODE)
implicit real*8(a-h,o-z)
DIMENSION TRIGS(1)
PI=2.0d0*ASIN(1.0d0)
NN=N/2
DEL=(PI+PI)/dFLOAT(NN)
L=NN+NN
DO 10 I=1,L,2
ANGLE=0.5*FLOAT(I-1)*DEL
TRIGS(I)=COS(ANGLE)
TRIGS(I+1)=SIN(ANGLE)
10 CONTINUE
DEL=0.5*DEL
NH=(NN+1)/2
L=NH+NH
LA=NN+NN
DO 20 I=1,L,2
ANGLE=0.5*FLOAT(I-1)*DEL
TRIGS(LA+I)=COS(ANGLE)
TRIGS(LA+I+1)=SIN(ANGLE)
20 CONTINUE
In both FFTFAX and FFTRIG procedures there are different bounds for dimensions of arguments than the actual input array size (for TRIGS it is 1 and 19, respectively).
I printed out TRIGS after calling FFTFAX in no-debugger compilation setup:
trigs: 1.0000000000000000 0.0000000000000000\
0.99144486137381038 0.13052619222005157 0.96592582628906831\
0.25881904510252074 0.92387953251128674 0.38268343236508978\
...
My questions are:
Is notation :
DIMENSION TRIGS(1)
something more than setting bound of an array?
Why is the program even working in no-debugger mode?
Is setting:
DIMENSION TRIGS(*)
a good fix if I want variable trigs be a result of the procedure?
In f77 statements like the DIMENSION TRIGS(1) or similar or ..(*) with any number, if pertaining an argument of the procedure just tells the compiler
the rank of the array, the length in memory must be assigned to the array which is given in the call of the subroutine, normally f77 does not check this!
My recommendation either use (*) or better reformat (if necessary) the f77 sources to f90 (the bits shown would compile without change...).
and use dimension computed using n in the declaration within the subroutines/procedures.
Fortan passes arguments by address (i.e. trigs(i) in the subroutine just
will refer on the memory location, which corresponds to the address of trigs(1) + i*size(real*8).
A more consisted way to write the subroutine code could be:
SUBROUTINE FFTRIG(TRIGS,N,MODE)
! implicit real*8(a-h,o-z)
integer, intent(in) :: n
real(kind=8) :: trigs(6*n)
integer :: mode
! DIMENSION TRIGS(1)
.....
PI=2.0d0*ASIN(1.0d0)
.....
or with less ability for the compiler to check
SUBROUTINE FFTRIG(TRIGS,N,MODE)
! implicit real*8(a-h,o-z)
integer, intent(in) :: n
real(kind=8) :: trigs(:)
integer :: mode
! DIMENSION TRIGS(1)
.....
PI=2.0d0*ASIN(1.0d0)
.....
To answer your question, I would change TRIGS(1) to TRIGS(*), only to more clearly identify array TRIGS as not having it's dimension provided. TRIGS(1) is a carry over from pre F77 for how to identify this.
Using TRIGS(:) is incorrect, as defining array TRIGS in this way requires any routine calling FFTRIG to have an INTERFACE definition. This change would lead to other errors.
Your question is mixing the debugger's need for the array size vs the syntax excluding the size being provided. To overcome this you could pass the array TRIGS's declared dimension, as an extra declared argument, for the debugger to check. When using "debugger" mode, some compilers do provide hidden properties including the declared size of all arrays.

Writing a parallel for loop

I'm confused as to how to declare which parts of a program are accessible or not from the different workers. On a relatively low level in a program of mine I have a for loop I want to parallelize.
module module_name
[...]
addprocs(3)
totalsum = #parallel (+) for i in 1:large_number
tmp_sum = 0
for j in 1:num
... # calls f1 f2
end
tmp_sum # not sure how to 'return' the result, the examples have a conveniantly placed calculation at the end
end
rmprocs([2 3 4])
[...]
end
As I understand I'd have to put the #everywhere decorator infront of f1 and f2. But the program fails far before with the additional workers complaining that UndefVarError: module_name not defined, and I have no clue how to fix that.
I feel like I’ve missed something needed for setting up parallel processing. As I understood it other than writing the actual #parallel part, one needs to addprocs and then add the #everywhere decorator to those functions used inside the loop. Is that really it?
I know pmap is better suited for what I'm doing here but I wanted to get the simpler option to work first (I'd need to pass several arguments the pmap function).

Can we create pure functions in Fortran which generate random numbers?

My goal is to write a pure function using random numbers which can be used in a DO CONCURRENT structure. The compiler does not seem to permit this.
mwe.f95:8:30:
call init_seed ( )
1
Error: Subroutine call to ‘init_seed’ at (1) is not PURE
mwe.f95:9:36:
call random_number ( y )
1
Error: Subroutine call to intrinsic ‘random_number’ at (1) is not PURE
mwe.f95:16:8:
use myFunction
1
Fatal Error: Can't open module file ‘myfunction.mod’ for reading at (1): No such file or directory
compilation terminated.
Why is this so and is there a way to generate random numbers in a pure routine?
The MWE follows. Compilation command is gfortran mwe.f95. Compiler version is GCC 5.1.0.
module myFunction
implicit none
contains
pure real function action ( ) result ( new_number )
real :: y
call init_seed ( )
call random_number ( y )
new_number = y**2
end function
end module myFunction
program mwe
use myFunction
implicit none
real :: x
x = action ( )
end program mwe
This is completely against the concept of pureness. True pure functions, as found in true functional languages, should always return the same result for the same input. Fortran pure functions can read module variables and therefore are more complex.
It is not even a good idea to have any function, not just a pure function, to return pseudo-random numbers. When you have more function calls in an expression the Fortran compiler is permitted to evaluate the function just once. That is even more likely, or better justified, when that function is pure.
I would suggest to just use regular DO loops and call random_number or other custom PRNG subroutine. Even if you want automatic parallelization or similar , the compilers are normally capable to treat regular DO loops equally well as DO CONCURRENT.
You'll need pure random number generator. It is quite possible to make, say, for Linear Congruential Generator, where seed (being 64bit unsigned integer) is the same as state and is the same as return value. In that case state/seed is kept externally outside the sampling routine, passed explicitly and on getting it back from RNG is stored

Get Wtime function returning "***"

I'm currently working on converting some Fortran code into parallel using openMP. I'm trying to use omp_get_wtime() to calculate how much actual time passes, but its returning ******. Other OpenMP functions work, yet for some reason this doesn't. I've removed all the code from in between the timer just to try to get something different. Removing the finish, and just displaying the start gives the same result. Any ideas of what I'm doing wrong would be much appreciated.
C$ USE OMP_LIB
DOUBLE PRECISION START,FINISH
START = OMP_GET_WTIME()
FINISH=OMP_GET_WTIME()
WRITE(OUT,850) FINISH-START
850 FORMAT(25X,'ELAPSED TIME',I6)
Your problem has nothing to do with the OMP_GET_WTIME function. Rather it stems from the fact that the I edit descriptor is used to display integers and you are feeding it with a double precision number instead. You should use one of the floating-point edit descriptors like, e.g. F10.6:
$ cat wtime.f
USE OMP_LIB
IMPLICIT NONE
DOUBLE PRECISION START,FINISH
START = OMP_GET_WTIME()
CALL SLEEP(1)
FINISH=OMP_GET_WTIME()
WRITE(*,850) FINISH-START
850 FORMAT(25X,'ELAPSED TIME',F10.6)
END
$ ifort -openmp -o wtime.exe wtime.f
$ ./wtime.exe
ELAPSED TIME 1.000277

OpenMP FORTRAN issue with privates

In the following code, when I am passing the variable "aa" as private, the results are getting bad. The code works fine how it is posted, but when I replace the line
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
with
!$OMP PARALLEL PRIVATE(aa,iter,y,i,yt) SHARED(bb)
the code is not working properly.
!!!!!!!! module
module common
use iso_fortran_env
implicit none
integer,parameter:: dp=real64
real(dp):: aa,bb
contains
subroutine evolve(y,yevl)
implicit none
integer(dp),parameter:: id=2
real(dp),intent(in):: y(id)
real(dp),intent(out):: yevl(id)
yevl(1)=y(2)+1.d0-aa*y(1)**2
yevl(2)=bb*y(1)
end subroutine evolve
end module common
use common
implicit none
integer(dp):: iii,iter,i
integer(dp),parameter:: id=2
real(dp),allocatable:: y(:),yt(:)
integer(dp):: OMP_GET_THREAD_NUM, IXD
allocate(y(id)); allocate(yt(id)); y=0.d0; yt=0.d0; bb=0.3d0
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
!$OMP DO
do iii=1,20000; print*,iii !! EXPECTED THREADS TO BE OF 5000 ITERATIONS EACH
aa=1.d0+dfloat(iii-1)*0.4d0/2000.d0
loop1: do iter=1,10 !! THE INITIAL CONDITION LOOP
call random_number(y)!! RANDOM INITIALIZATION OF THE VARIABLE
loop2: do i=1,70000 !! ITERATION OF THE SYSTEM
call evolve(y,yt)
y=yt
enddo loop2 !! END OF SYSTEM ITERATION
write(IXD+1,*)aa,yt !!! WRITING FILE CORRESPONDING TO EACH THREAD
enddo loop1 !!INITIAL CONDITION ITERATION DONE
enddo
!$OMP ENDDO
!$OMP END PARALLEL
end
What could be the issue? Works fine when I generate "aa" from "iii" but not when I pass it as a private variable. Thanks in advance for any comments or suggestions.
aa is a module variable. Module variables can either be shared (the default) or threadprivate. Example A.32.2f from the OpenMP standard document illustrates that when module variables are accessed in the dynamic scope of a construct, it is unspecified whether the original variable or the private thread copy is being accessed. This is not the case with threadprivate variables as they are always stored in the thread-local storage, no matter if used inside the lexical scope of a parallel region or not.
There are many scenarios for what happens if you declare a module variable to be private and then access it into a subroutine. What is most likely to happen depends on what kind of analysis the compiler does on the code. Some compilers might detect, that the module subroutine is only called inside the parallel region and hence make aa refer to the private copy of each thread. Other compilers might decide to always access the original module variable. On the other hand, if the subroutine gets inlined in the calling subroutine, then it might refer to the same aa that is used in the calling context (e.g. the private version if aa is declared private)
Here is an example of how gfortran handles PRIVATE(iii,aa,iter,y,i,yt) at the default optimisation level:
; aa is declared as a global symbol in the BSS section
.globl __common_MOD_aa
.bss
.align 8
.type __common_MOD_aa, #object
.size __common_MOD_aa, 8
__common_MOD_aa:
.zero 8
; Here is how evolve accesses aa
...
movsd __common_MOD_aa(%rip), %xmm2
...
; Here is how the assignment to aa is done inside the parallel region
...
movsd %xmm0, -72(%rbp)
...
The private aa is implemented as an automatic variable and stored in the stack of the thread, while evolve uses the value of aa from the module. Therefore this operator:
aa=1.d0+dfloat(iii-1)*0.4d0/2000.d0
only alters the value of aa inside the thread, while evolve uses the original value of aa from outside the parallel region.
At the high optimisation level -O3 gfortran inlines evolve into the parallel region and...
...
mulsd __common_MOD_aa(%rip), %xmm2
...
The inlined code also refers to the global value of aa in the module, i.e. the behaviour is consistent between the two optimisation levels.
The same applies to Intel Fortran.
The correct approach is to declare aa to be threadprivate and to not put it in a private clause:
module common
use iso_fortran_env
implicit none
integer,parameter:: dp=real64
real(dp):: aa,bb
!$OMP THREADPRIVATE(aa)
...
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
!$OMP DO
do iii=1,20000; print*,iii !! EXPECTED THREADS TO BE OF 5000 ITERATIONS EACH
aa=1.d0+dfloat(iii-1)*0.4d0/2000.d0
...
Now both the parallel region and evolve will use a private to each thread copy of aa. As access to threadprivate variables is usually slower than access to normal private (stack) variables, on 64-bit x86 systems it might make more sense to pass the value of aa as an argument to evolve instead as suggested by #Bálint Aradi.
You should try to carefully analyze your variables, especially to think about which of them would have different values on the different threads at the same time, as those must be declared OMP private. In your case, both variables aa and iii must be OMP private. Variable iii because it is a counter in a loop which is distributed over the threads, and aa because it gets a value which depends on iii.
EDIT: As each thread calls the evolve subroutine itself and evolve is supposed to use the thread specific value of aa (I guess), you should also pass aa to your subroutine instead of using the module variable aa.
The routine should look like:
subroutine evolve(y, aa, yevl)
integer(dp),parameter:: id=2
real(dp),intent(in):: y(id), aa
real(dp),intent(out):: yevl(id)
yevl(1)=y(2)+1.d0-aa*y(1)**2
yevl(2)=bb*y(1)
end subroutine evolve
and the according call in your main program:
call evolve(y, aa, yt)

Resources