Recently I had to change a serial program written in fortran to a parallel version to make it faster to get the result. But I met some problems.
I'm using ubuntu os and gfortran compiler, and as for the parallel API, I'm using OpeMP. In the previous (serial) version, I use many modules to share the data, but in openmp version, I make the variables threadprivate attribute, and some of these variables has allocatable attribute. In the previous version, I allocate the space for the variable before the do-loop, but in the openmp version, if I do the same, the program will report an error as invalid memory reference, although I give it the threadprivate attribute. So I allocate the variable in the loop and deallocate it also in the loop. And I make that do-loop in a parallel region. It give no error and the program can run. But there is another problem. As it runs about 800min cpu time, and I use ps -ux command to see the status of this parallel program, its status changes from Rl to Sl. I search for the meaning of S, and it represent
Interruptible sleep (waiting for an event to complete)
So why this problem appears? Is it because I frequently alloc and free the space? The following is the sample code:
module variables
real, dimension(:), allocatable, save :: a
real, dimension(:,:), allocatable, save :: b
!$omp threadprivate(a,b)
integer, parameter :: n=100
contains
subroutine alloc_var
integer :: status
allocate(a(100),stat=status)
allocate(b(100:100),stat=status)
end subroutine
subroutine free_var
integer :: status
deallocate(a,stat=status)
deallocate(b,stat=status)
end subroutine
end module
for other subroutines, there are some using variable a and b.
subroutine cal_sth
use variables, only a
...
end subroutine
for the serial version main program
program main
implicit none
external :: cal_sth
use variables, only alloc_var,free_var
integer :: i, j
call alloc_var
do j=1, count1
...
other expresion ...
do i=1, count2
call cal_sth
end do
end do
call free_var
end program
for parallel region,
program main
implicit none
external :: cal_sth
use variables, only alloc_var, free_var
integer :: i,j
!$omp parallel do private(i,j)
do j=1, count1
...
other expression ...
do i=1, count2
call alloc_var
call cal_sth
if (logical expression) then
call free_var
cycle
end if
call free_var
end do
end do
end program
Either split the combined parallel do directive and rewrite the parallel loop so:
!$omp parallel
call alloc_var
!$omp do
do i=1, count
call cal_sth
end do
!$omp end do
call free_var
!$omp end parallel
or use dedicated parallel regions as per Gilles' comment:
program main
implicit none
external :: cal_sth
use variables, only alloc_var, free_var
integer :: i
!$omp parallel
call alloc_var
!$omp end parallel
...
!$omp parallel do
do i=1, count
call cal_sth
end do
!$omp end parallel do
...
! other OpenMP regions
...
!$omp parallel
call free_var
!$omp end parallel
end program
With your updated code, I think you have two different path to explore for improving performance:
The memory allocation: As previously mentioned, the calls to alloc_var and free_var only need to be made in a parallel region, but definitely not necessarily inside the do loop. By splitting the parallel do into a parallel, then a do, it gives you room for calling alloc_var prior to entering the loop, and calling free_var after exiting it. And the potential early exit from the inner loop, possibly necessitating a release / re-allocation of the memory isn't by itself a constraint preventing you from doing this. (see the code below for an example on how this can be done)
The scheduling: the early terminations of some of you inner iterations might translate into some load imbalance between threads. This could explain the waiting times you experiment. Explicitly setting the scheduling to dynamic might permit to reduce this effect and improve performance. This will need to be experimented a bit with to find the best scheduling policy to apply, but dynamic seems a good starting point.
So here is your code as it could look like once these two idea implemented:
program main
implicit none
external :: cal_sth
use variables, only alloc_var, free_var
integer :: i,j
!$omp parallel schedule(dynamic)
call alloc_var
!$omp do private(i,j)
do j=1, count1
...
other expression ...
do i=1, count2
call cal_sth
if (logical expression) then
!uncomment these only if needed for some reasons
!call free_var
!call alloc_var
cycle
end if
end do
end do
!$omp end do
call free_var
!$omp end parallel
end program
Related
I have a fortran program using openmp which is simplified as below:
program test
use omp_lib
use function_module
use data_module
implicit none
real*8 :: theta,f
call readdata()
call func_subroutine(theta,f)
end program
subroutine func_subroutine(theta,f)
use omp_lib
use data_module
implicit none
integer :: i
real*8 :: theta,f
real*8 :: t1,t2,t_vec(20),tsum
!$omp parallel do shared(data_variables,t_vec) private(i,t1,t2)
do i=1,20
! read individual data
! count time
t1=omp_get_wtime()
call func(theta,f_ind)
t2=omp_get_wtime()
t_vec(i)=t2-t1
end do
!$end omp parallel do
tsum=sum(t_vec)
end subroutine
In short, I calculate the function value for individual 1 (f_ind) for 20 times. And tsum only counts the time of calculating f_ind.
With 5 threads, tsum=7.2s, and the program takes 33s (without considering the time spent on the subroutine readdata()).
However, with 10 threads, tsum=9.5s, and the program takes 20s.
And with 20 threads, tsum=12s, and the program takes 12s.
The computation time is not improved proportionally to thread number I am using. I am very confused. Is there something wrong?
I have three nested loops. I want to parallelize the middle loop as such:
do a = 1,amax
!$omp parallel do private(c)
do b = 1,bmax
do c = 1,cmax
call mysubroutine(b,c)
end do
end do
!$omp end parallel do
end do
However this creates a problem, in that for every iteration of the a loop, threads are spawned, run through the inner loops, then terminate. I assume this is causing an excessive amount of overhead, as the inner loops do not take long to execute (~ 10^-4 s). So I would like to spawn the threads only once. How can I spawn the threads before starting the a loop while still executing the a loop sequentially? Due to the nature of the code, each iteration of the a loop must be complete before the next can be executed. For example, clearly this would not work:
!$omp parallel private(c)
do a = 1,amax
!$omp do
do b = 1,bmax
do c = 1,cmax
call mysubroutine(b,c)
end do
end do
!$omp end do
end do
!$omp end parallel
because all of the threads will attempt to execute the a loop. Any help appreciated.
"For example, clearly this would not work"
That is not only not clear, that is completely incorrect. The code you show is exactly what you should do (better with private(a)).
"because all of the threads will attempt to execute the a loop"
Of course they will and they have to! All of them have to execute it if they are supposed to take part in worksharing in the omp do inner loop! If they don't execute it, they simply won't be there to help with inner loop.
A different remark: you may benefit from the collapse(2) clause for the omp do nested loop.
A good way to assert that "this is causing an excessive amount of overhead" is to evaluate the scaling using different number of threads.
1s is a long time, respawing threads isn't so costly…
I have a few questions about using common blocks in parallel programming in Fortran.
My subroutines have common blocks. Do I have to declare all the common blocks and threadprivate in the parallel do region?
How do they pass information? I want seperate common clock for each thread and want them to pass information through the end of parallel region. Does it happen here?
My Ford subroutine changes some variables in common blocks and Condact subroutine overwrites over them again but the function uses the values from Condact subroutine. Do the second subroutine and function copy the variables from the previous subroutine for each thread?
program
...
! Loop which I want to parallelize
!$OMP parallel DO
!do I need to declear all common block and threadprivate them here?
I = 1, N
...
call FORD(i,j)
...
!$OMP END parallel DO
end program
subroutine FORD(i,j)
dimension zl(3),zg(3)
common /ellip/ b1,c1,f1,g1,h1,d1,
. b2,c2,f2,g2,h2,p2,q2,r2,d2
common /root/ root1,root2
!$OMP threadprivate (/ellip/,/root/)
!this subroutine rewrite values of b1, c1 and f1 variable.
CALL CONDACT(genflg,lapflg)
return
end subroutine
SUBROUTINE CONDACT(genflg,lapflg)
common /ellip/ b1,c1,f1,g1,h1,d1,b2,c2,f2,g2,h2,p2,q2,r2,d2
!$OMP threadprivate (/ellip/)
! this subroutine rewrite b1, c1 and f1 again
call function f(x)
RETURN
END
function f(x)
common /ellip/ b1,c1,f1,g1,h1,d1,
. b2,c2,f2,g2,h2,p2,q2,r2,d2
!$OMP threadprivate (/ellip/)
! here the function uses the value of b1, c1, f1 from CONDAT subroutine.
end
Firstly as the comment above says I would strongly advise against the use of common especially in modern code, and mixing global data and parallelism is just asking for a world of pain - in fact global data is just a bad idea full stop.
OK, your questions:
My subroutines has common blocks. Do I have to declare all the
common block and threadprivate in the parallel do region?
No,threadprivate is a declarative directive, and should be used only in the declarative part of the code, and it must appear after every declaration.
How do they pass information? I want seperate common clock for each
thread and want them to pass information through the end of parallel
region. Does it happen here?
As you suspect each thread will gets its own version of the common block. When you enter the first parallel region the values in the block will be undefined, unless you use copyin to broadcast the values from the master thread. For subsequent parallel regions the values will be retained as long as the number of threads used in each region is the same. Between regions the values in the common block will be those of the master thread.
Are those common block accessible through the subroutine? My Ford subroutine rewrite some variables in common block and Condat
subroutine rewrite over them again but the function uses the values
from Condat subroutine. Is that possible rewrite and pass the common
block variable using threadprivate here?
I have to admit I am unsure what you are asking here. But if you are asking whether common can be used to communicate variables between different sub-programs in OpenMP code, the answer is yes, just as in serial Fortran (note capitalisation)
How about converting the common blocks into modules?
Change common /root/ root1, root2 to use gammax, then make a new file root.f that contains:
module root
implicit none
save
real :: root1, root2
!$omp threadprivate( root1, root2 )
end module root
The fortran 2008 do concurrent construct is a do loop that tells the compiler that no iteration affect any other. It can thus be parallelized safely.
A valid example:
program main
implicit none
integer :: i
integer, dimension(10) :: array
do concurrent( i= 1: 10)
array(i) = i
end do
end program main
where iterations can be done in any order. You can read more about it here.
To my knowledge, gfortran does not automatically parallelize these do concurrent loops, while I remember a gfortran-diffusion-list mail about doing it (here). It justs transform them to classical do loops.
My question: Do you know a way to systematically parallelize do concurrent loops? For instance with a systematic openmp syntax?
It is not that easy to do it automatically. The DO CONCURRENT construct has a forall-header which means that it could accept multiple loops, index variables definition and a mask. Basically, you need to replace:
DO CONCURRENT([<type-spec> :: ]<forall-triplet-spec 1>, <forall-triplet-spec 2>, ...[, <scalar-mask-expression>])
<block>
END DO
with:
[BLOCK
<type-spec> :: <indexes>]
!$omp parallel do
DO <forall-triplet-spec 1>
DO <forall-triplet-spec 2>
...
[IF (<scalar-mask-expression>) THEN]
<block>
[END IF]
...
END DO
END DO
!$omp end parallel do
[END BLOCK]
(things in square brackets are optional, based on the presence of the corresponding parts in the forall-header)
Note that this would not be as effective as parallelising one big loop with <iters 1>*<iters 2>*... independent iterations which is what DO CONCURRENT is expected to do. Note also that forall-header permits a type-spec that allows one to define loop indexes inside the header and you will need to surround the whole thing in BLOCK ... END BLOCK construct to preserve the semantics. You would also need to check if scalar-mask-expr exists at the end of the forall-header and if it does you should also put that IF ... END IF inside the innermost loop.
If you only have array assignments inside the body of the DO CONCURRENT you would could also transform it into FORALL and use the workshare OpenMP directive. It would be much easier than the above.
DO CONCURRENT <forall-header>
<block>
END DO
would become:
!$omp parallel workshare
FORALL <forall-header>
<block>
END FORALL
!$omp end parallel workshare
Given all the above, the only systematic way that I can think about is to systematically go through your source code, searching for DO CONCURRENT and systematically replacing it with one of the above transformed constructs based on the content of the forall-header and the loop body.
Edit: Usage of OpenMP workshare directive is currently discouraged. It turns out that at least Intel Fortran Compiler and GCC serialise FORALL statements and constructs inside OpenMP workshare directives by surrounding them with OpenMP single directive during compilation which brings no speedup whatsoever. Other compilers might implement it differently but it's better to avoid its usage if portable performance is to be achieved.
I'm not sure what you mean "a way to systematically parallelize do concurrent loops". However, to simply parallelise an ordinary do loop with OpenMP you could just use something like:
!$omp parallel private (i)
!$omp do
do i = 1,10
array(i) = i
end do
!$omp end do
!$omp end parallel
Is this what you are after?
The fortran 2008 do concurrent construct is a do loop that tells the compiler that no iteration affect any other. It can thus be parallelized safely.
A valid example:
program main
implicit none
integer :: i
integer, dimension(10) :: array
do concurrent( i= 1: 10)
array(i) = i
end do
end program main
where iterations can be done in any order. You can read more about it here.
To my knowledge, gfortran does not automatically parallelize these do concurrent loops, while I remember a gfortran-diffusion-list mail about doing it (here). It justs transform them to classical do loops.
My question: Do you know a way to systematically parallelize do concurrent loops? For instance with a systematic openmp syntax?
It is not that easy to do it automatically. The DO CONCURRENT construct has a forall-header which means that it could accept multiple loops, index variables definition and a mask. Basically, you need to replace:
DO CONCURRENT([<type-spec> :: ]<forall-triplet-spec 1>, <forall-triplet-spec 2>, ...[, <scalar-mask-expression>])
<block>
END DO
with:
[BLOCK
<type-spec> :: <indexes>]
!$omp parallel do
DO <forall-triplet-spec 1>
DO <forall-triplet-spec 2>
...
[IF (<scalar-mask-expression>) THEN]
<block>
[END IF]
...
END DO
END DO
!$omp end parallel do
[END BLOCK]
(things in square brackets are optional, based on the presence of the corresponding parts in the forall-header)
Note that this would not be as effective as parallelising one big loop with <iters 1>*<iters 2>*... independent iterations which is what DO CONCURRENT is expected to do. Note also that forall-header permits a type-spec that allows one to define loop indexes inside the header and you will need to surround the whole thing in BLOCK ... END BLOCK construct to preserve the semantics. You would also need to check if scalar-mask-expr exists at the end of the forall-header and if it does you should also put that IF ... END IF inside the innermost loop.
If you only have array assignments inside the body of the DO CONCURRENT you would could also transform it into FORALL and use the workshare OpenMP directive. It would be much easier than the above.
DO CONCURRENT <forall-header>
<block>
END DO
would become:
!$omp parallel workshare
FORALL <forall-header>
<block>
END FORALL
!$omp end parallel workshare
Given all the above, the only systematic way that I can think about is to systematically go through your source code, searching for DO CONCURRENT and systematically replacing it with one of the above transformed constructs based on the content of the forall-header and the loop body.
Edit: Usage of OpenMP workshare directive is currently discouraged. It turns out that at least Intel Fortran Compiler and GCC serialise FORALL statements and constructs inside OpenMP workshare directives by surrounding them with OpenMP single directive during compilation which brings no speedup whatsoever. Other compilers might implement it differently but it's better to avoid its usage if portable performance is to be achieved.
I'm not sure what you mean "a way to systematically parallelize do concurrent loops". However, to simply parallelise an ordinary do loop with OpenMP you could just use something like:
!$omp parallel private (i)
!$omp do
do i = 1,10
array(i) = i
end do
!$omp end do
!$omp end parallel
Is this what you are after?