I have a few questions about using common blocks in parallel programming in Fortran.
My subroutines have common blocks. Do I have to declare all the common blocks and threadprivate in the parallel do region?
How do they pass information? I want seperate common clock for each thread and want them to pass information through the end of parallel region. Does it happen here?
My Ford subroutine changes some variables in common blocks and Condact subroutine overwrites over them again but the function uses the values from Condact subroutine. Do the second subroutine and function copy the variables from the previous subroutine for each thread?
program
...
! Loop which I want to parallelize
!$OMP parallel DO
!do I need to declear all common block and threadprivate them here?
I = 1, N
...
call FORD(i,j)
...
!$OMP END parallel DO
end program
subroutine FORD(i,j)
dimension zl(3),zg(3)
common /ellip/ b1,c1,f1,g1,h1,d1,
. b2,c2,f2,g2,h2,p2,q2,r2,d2
common /root/ root1,root2
!$OMP threadprivate (/ellip/,/root/)
!this subroutine rewrite values of b1, c1 and f1 variable.
CALL CONDACT(genflg,lapflg)
return
end subroutine
SUBROUTINE CONDACT(genflg,lapflg)
common /ellip/ b1,c1,f1,g1,h1,d1,b2,c2,f2,g2,h2,p2,q2,r2,d2
!$OMP threadprivate (/ellip/)
! this subroutine rewrite b1, c1 and f1 again
call function f(x)
RETURN
END
function f(x)
common /ellip/ b1,c1,f1,g1,h1,d1,
. b2,c2,f2,g2,h2,p2,q2,r2,d2
!$OMP threadprivate (/ellip/)
! here the function uses the value of b1, c1, f1 from CONDAT subroutine.
end
Firstly as the comment above says I would strongly advise against the use of common especially in modern code, and mixing global data and parallelism is just asking for a world of pain - in fact global data is just a bad idea full stop.
OK, your questions:
My subroutines has common blocks. Do I have to declare all the
common block and threadprivate in the parallel do region?
No,threadprivate is a declarative directive, and should be used only in the declarative part of the code, and it must appear after every declaration.
How do they pass information? I want seperate common clock for each
thread and want them to pass information through the end of parallel
region. Does it happen here?
As you suspect each thread will gets its own version of the common block. When you enter the first parallel region the values in the block will be undefined, unless you use copyin to broadcast the values from the master thread. For subsequent parallel regions the values will be retained as long as the number of threads used in each region is the same. Between regions the values in the common block will be those of the master thread.
Are those common block accessible through the subroutine? My Ford subroutine rewrite some variables in common block and Condat
subroutine rewrite over them again but the function uses the values
from Condat subroutine. Is that possible rewrite and pass the common
block variable using threadprivate here?
I have to admit I am unsure what you are asking here. But if you are asking whether common can be used to communicate variables between different sub-programs in OpenMP code, the answer is yes, just as in serial Fortran (note capitalisation)
How about converting the common blocks into modules?
Change common /root/ root1, root2 to use gammax, then make a new file root.f that contains:
module root
implicit none
save
real :: root1, root2
!$omp threadprivate( root1, root2 )
end module root
Related
Why does this OpenMP fortran program work (every element of out is equal to num)? Each thread in the parallel loop might read the variable num simultaneously. I thought this was not acceptable?
program example
implicit none
integer i
integer, parameter :: n = 100000
double precision :: num
double precision, dimension(n) :: out
num = 1.123456789123456789123456d-5
out = 0.d0
!$OMP PARALLEL
!$OMP DO
do i=1,n
out(i) = num
enddo
!$OMP END DO
!$OMP END PARALLEL
do i=1,n
if (out(i).ne.num) print*,'Problem with ',i
enddo
end program
Thanks so much for any insights.
Can reading a variable be a data race in OpenMP?
Any race is between two things happening, so a read can be part of a race. However for the competition between two actions to be a race, there has to be a different outcome depending on the order in which the two actions occur.
Given that the possible actions in a parallel program which we are considering are read and write occurring in different threads, we have four possible cases:
Read, Read: no values are changed, and no code can detect which order the two reads occurred in (at least, not without looking at meta-data such as code performance in a system with caches :-)).
Read, Write: this clearly can be a race; whether the write wins the race or not affects the value which will be read.
Write, Read: as with case 2 (Read,Write), the result seen by the read is affected by the order.
Write, Write: here we have a race too, since we asssume that someone will ultimately read the value, and which value they see will depend on the order of the writes.
So, reading a variable can be part of a race.
However, if your question is really "Is there a race if a variable is only read?", then the answer is "No".
Variables are shared by default in openMP so they are accessible from all the threads. Furthermore, you're not writing to num so even if all the threads were accessing the same memory (which here they probably aren't) there would be no issue.
The fortran 2008 do concurrent construct is a do loop that tells the compiler that no iteration affect any other. It can thus be parallelized safely.
A valid example:
program main
implicit none
integer :: i
integer, dimension(10) :: array
do concurrent( i= 1: 10)
array(i) = i
end do
end program main
where iterations can be done in any order. You can read more about it here.
To my knowledge, gfortran does not automatically parallelize these do concurrent loops, while I remember a gfortran-diffusion-list mail about doing it (here). It justs transform them to classical do loops.
My question: Do you know a way to systematically parallelize do concurrent loops? For instance with a systematic openmp syntax?
It is not that easy to do it automatically. The DO CONCURRENT construct has a forall-header which means that it could accept multiple loops, index variables definition and a mask. Basically, you need to replace:
DO CONCURRENT([<type-spec> :: ]<forall-triplet-spec 1>, <forall-triplet-spec 2>, ...[, <scalar-mask-expression>])
<block>
END DO
with:
[BLOCK
<type-spec> :: <indexes>]
!$omp parallel do
DO <forall-triplet-spec 1>
DO <forall-triplet-spec 2>
...
[IF (<scalar-mask-expression>) THEN]
<block>
[END IF]
...
END DO
END DO
!$omp end parallel do
[END BLOCK]
(things in square brackets are optional, based on the presence of the corresponding parts in the forall-header)
Note that this would not be as effective as parallelising one big loop with <iters 1>*<iters 2>*... independent iterations which is what DO CONCURRENT is expected to do. Note also that forall-header permits a type-spec that allows one to define loop indexes inside the header and you will need to surround the whole thing in BLOCK ... END BLOCK construct to preserve the semantics. You would also need to check if scalar-mask-expr exists at the end of the forall-header and if it does you should also put that IF ... END IF inside the innermost loop.
If you only have array assignments inside the body of the DO CONCURRENT you would could also transform it into FORALL and use the workshare OpenMP directive. It would be much easier than the above.
DO CONCURRENT <forall-header>
<block>
END DO
would become:
!$omp parallel workshare
FORALL <forall-header>
<block>
END FORALL
!$omp end parallel workshare
Given all the above, the only systematic way that I can think about is to systematically go through your source code, searching for DO CONCURRENT and systematically replacing it with one of the above transformed constructs based on the content of the forall-header and the loop body.
Edit: Usage of OpenMP workshare directive is currently discouraged. It turns out that at least Intel Fortran Compiler and GCC serialise FORALL statements and constructs inside OpenMP workshare directives by surrounding them with OpenMP single directive during compilation which brings no speedup whatsoever. Other compilers might implement it differently but it's better to avoid its usage if portable performance is to be achieved.
I'm not sure what you mean "a way to systematically parallelize do concurrent loops". However, to simply parallelise an ordinary do loop with OpenMP you could just use something like:
!$omp parallel private (i)
!$omp do
do i = 1,10
array(i) = i
end do
!$omp end do
!$omp end parallel
Is this what you are after?
I have a Fortran code like this:
file1.f90
program myprog
use func1mod
do i=1,N
call subroutine1
enddo
subroutine subroutine1
integer*8::var1,var2,var3,...
do j=1,N
x=func1(var1,var2,var3,..)
computations based on x
enddo
return
end
end
file2.f90
module func1mod
contains
func1(var1,var2,var3,....)
func1=some computations based on var1, var2, var3, ...
return
end function func1
end module func1mod
function func1 does not modify any of its arguments. It computes a value based on the arguments and returns a value. The # of arguments is large but the function is less than 30 lines of code. What is the best approach to reduce the function call overhead.
One approach would be to inline the function. Is there any other way out?
The best you can do is be as explicit as possible about the semantics of the function, turn optimization up as high as possible, and let the compiler make the best decision it can about how best to implement the call. Make sure the dummy variables are marked intent(in), and mark the function as pure - although if it's only 30 lines, the compiler will doubtless notice these things anyway at high optimization - and check your compiler options to see if there's anything you can do to encourage (for instance) inlining.
Generally the overhead of a procedure call is low. If the function has 30 lines of code probably you will gain very little because the actual function will dominant over the function call. If you want to be sure, measure the runtime of the current implementation, then inline the code and measure that runtime.
The fortran 2008 do concurrent construct is a do loop that tells the compiler that no iteration affect any other. It can thus be parallelized safely.
A valid example:
program main
implicit none
integer :: i
integer, dimension(10) :: array
do concurrent( i= 1: 10)
array(i) = i
end do
end program main
where iterations can be done in any order. You can read more about it here.
To my knowledge, gfortran does not automatically parallelize these do concurrent loops, while I remember a gfortran-diffusion-list mail about doing it (here). It justs transform them to classical do loops.
My question: Do you know a way to systematically parallelize do concurrent loops? For instance with a systematic openmp syntax?
It is not that easy to do it automatically. The DO CONCURRENT construct has a forall-header which means that it could accept multiple loops, index variables definition and a mask. Basically, you need to replace:
DO CONCURRENT([<type-spec> :: ]<forall-triplet-spec 1>, <forall-triplet-spec 2>, ...[, <scalar-mask-expression>])
<block>
END DO
with:
[BLOCK
<type-spec> :: <indexes>]
!$omp parallel do
DO <forall-triplet-spec 1>
DO <forall-triplet-spec 2>
...
[IF (<scalar-mask-expression>) THEN]
<block>
[END IF]
...
END DO
END DO
!$omp end parallel do
[END BLOCK]
(things in square brackets are optional, based on the presence of the corresponding parts in the forall-header)
Note that this would not be as effective as parallelising one big loop with <iters 1>*<iters 2>*... independent iterations which is what DO CONCURRENT is expected to do. Note also that forall-header permits a type-spec that allows one to define loop indexes inside the header and you will need to surround the whole thing in BLOCK ... END BLOCK construct to preserve the semantics. You would also need to check if scalar-mask-expr exists at the end of the forall-header and if it does you should also put that IF ... END IF inside the innermost loop.
If you only have array assignments inside the body of the DO CONCURRENT you would could also transform it into FORALL and use the workshare OpenMP directive. It would be much easier than the above.
DO CONCURRENT <forall-header>
<block>
END DO
would become:
!$omp parallel workshare
FORALL <forall-header>
<block>
END FORALL
!$omp end parallel workshare
Given all the above, the only systematic way that I can think about is to systematically go through your source code, searching for DO CONCURRENT and systematically replacing it with one of the above transformed constructs based on the content of the forall-header and the loop body.
Edit: Usage of OpenMP workshare directive is currently discouraged. It turns out that at least Intel Fortran Compiler and GCC serialise FORALL statements and constructs inside OpenMP workshare directives by surrounding them with OpenMP single directive during compilation which brings no speedup whatsoever. Other compilers might implement it differently but it's better to avoid its usage if portable performance is to be achieved.
I'm not sure what you mean "a way to systematically parallelize do concurrent loops". However, to simply parallelise an ordinary do loop with OpenMP you could just use something like:
!$omp parallel private (i)
!$omp do
do i = 1,10
array(i) = i
end do
!$omp end do
!$omp end parallel
Is this what you are after?
I have a section of a Fortran90 program that should be parallelized with OpenMP.
!$omp parallel num_threads(8) &
!$omp private(j, s, prop_states) &
!$omp firstprivate(targets, pulses)
! ... modify something in pulses. targets(s)%ham contains pointers to
! elements of pulses ...
do s = 1, n_systems
prop_states(s) = targets(s)%psi_i
call prop(prop_states(s), targets(s)%grid, targets(s)%ham, &
& targets(s)%work, para)
end do
!$omp end parallel
What I'm unsure about is whether complex data structures can be private to each thread (and how this should be done -- is firstprivate correct?). In the example code above, targets is of a somewhat complicated user-defined type, with equally complex sub-fields. For example, targets(s)%ham%op(1)%pulse is a pointer to some element of an array pulses. Also, targets(s)%work contains allocated space to be used as work arrays in Fast-Fourier-Transforms.
Obviously, every thread needs to maintain an independent copy both of targets and of pulses, and maintain the pointers between the two independently. It seems to me that this might be asking a little bit too much from the automatic memory management of OpenMP. Is this correct, or should this work out of the box?
The alternative of course is to create copies of the original data within each thread (stored in an array), and use this private copied data in the call to prop.
From my reading of the OpenMP 2.5 standard you can't use the targets of Fortran pointers in private (or firstprivate or threadprivate) clauses, which seems to rule out your code. Having said that, it's not something I've ever tried in OpenMP so if you bash ahead and get anywhere, do let us know.
And firstprivate is correct if your private variables are to be initialised, on entry into the parallel region, with the value of the variables of the same name at the entry to the parallel region.
I guess you will probably have to implement your plan B.