OpenMp with fortran : why multiples DO loops are faster than workshare [duplicate] - parallel-processing

The fortran 2008 do concurrent construct is a do loop that tells the compiler that no iteration affect any other. It can thus be parallelized safely.
A valid example:
program main
implicit none
integer :: i
integer, dimension(10) :: array
do concurrent( i= 1: 10)
array(i) = i
end do
end program main
where iterations can be done in any order. You can read more about it here.
To my knowledge, gfortran does not automatically parallelize these do concurrent loops, while I remember a gfortran-diffusion-list mail about doing it (here). It justs transform them to classical do loops.
My question: Do you know a way to systematically parallelize do concurrent loops? For instance with a systematic openmp syntax?

It is not that easy to do it automatically. The DO CONCURRENT construct has a forall-header which means that it could accept multiple loops, index variables definition and a mask. Basically, you need to replace:
DO CONCURRENT([<type-spec> :: ]<forall-triplet-spec 1>, <forall-triplet-spec 2>, ...[, <scalar-mask-expression>])
<block>
END DO
with:
[BLOCK
<type-spec> :: <indexes>]
!$omp parallel do
DO <forall-triplet-spec 1>
DO <forall-triplet-spec 2>
...
[IF (<scalar-mask-expression>) THEN]
<block>
[END IF]
...
END DO
END DO
!$omp end parallel do
[END BLOCK]
(things in square brackets are optional, based on the presence of the corresponding parts in the forall-header)
Note that this would not be as effective as parallelising one big loop with <iters 1>*<iters 2>*... independent iterations which is what DO CONCURRENT is expected to do. Note also that forall-header permits a type-spec that allows one to define loop indexes inside the header and you will need to surround the whole thing in BLOCK ... END BLOCK construct to preserve the semantics. You would also need to check if scalar-mask-expr exists at the end of the forall-header and if it does you should also put that IF ... END IF inside the innermost loop.
If you only have array assignments inside the body of the DO CONCURRENT you would could also transform it into FORALL and use the workshare OpenMP directive. It would be much easier than the above.
DO CONCURRENT <forall-header>
<block>
END DO
would become:
!$omp parallel workshare
FORALL <forall-header>
<block>
END FORALL
!$omp end parallel workshare
Given all the above, the only systematic way that I can think about is to systematically go through your source code, searching for DO CONCURRENT and systematically replacing it with one of the above transformed constructs based on the content of the forall-header and the loop body.
Edit: Usage of OpenMP workshare directive is currently discouraged. It turns out that at least Intel Fortran Compiler and GCC serialise FORALL statements and constructs inside OpenMP workshare directives by surrounding them with OpenMP single directive during compilation which brings no speedup whatsoever. Other compilers might implement it differently but it's better to avoid its usage if portable performance is to be achieved.

I'm not sure what you mean "a way to systematically parallelize do concurrent loops". However, to simply parallelise an ordinary do loop with OpenMP you could just use something like:
!$omp parallel private (i)
!$omp do
do i = 1,10
array(i) = i
end do
!$omp end do
!$omp end parallel
Is this what you are after?

Related

Can reading a variable be a data race in OpenMP?

Why does this OpenMP fortran program work (every element of out is equal to num)? Each thread in the parallel loop might read the variable num simultaneously. I thought this was not acceptable?
program example
implicit none
integer i
integer, parameter :: n = 100000
double precision :: num
double precision, dimension(n) :: out
num = 1.123456789123456789123456d-5
out = 0.d0
!$OMP PARALLEL
!$OMP DO
do i=1,n
out(i) = num
enddo
!$OMP END DO
!$OMP END PARALLEL
do i=1,n
if (out(i).ne.num) print*,'Problem with ',i
enddo
end program
Thanks so much for any insights.
Can reading a variable be a data race in OpenMP?
Any race is between two things happening, so a read can be part of a race. However for the competition between two actions to be a race, there has to be a different outcome depending on the order in which the two actions occur.
Given that the possible actions in a parallel program which we are considering are read and write occurring in different threads, we have four possible cases:
Read, Read: no values are changed, and no code can detect which order the two reads occurred in (at least, not without looking at meta-data such as code performance in a system with caches :-)).
Read, Write: this clearly can be a race; whether the write wins the race or not affects the value which will be read.
Write, Read: as with case 2 (Read,Write), the result seen by the read is affected by the order.
Write, Write: here we have a race too, since we asssume that someone will ultimately read the value, and which value they see will depend on the order of the writes.
So, reading a variable can be part of a race.
However, if your question is really "Is there a race if a variable is only read?", then the answer is "No".
Variables are shared by default in openMP so they are accessible from all the threads. Furthermore, you're not writing to num so even if all the threads were accessing the same memory (which here they probably aren't) there would be no issue.

Sequential loop inside parallel regions with OpenMP

I have three nested loops. I want to parallelize the middle loop as such:
do a = 1,amax
!$omp parallel do private(c)
do b = 1,bmax
do c = 1,cmax
call mysubroutine(b,c)
end do
end do
!$omp end parallel do
end do
However this creates a problem, in that for every iteration of the a loop, threads are spawned, run through the inner loops, then terminate. I assume this is causing an excessive amount of overhead, as the inner loops do not take long to execute (~ 10^-4 s). So I would like to spawn the threads only once. How can I spawn the threads before starting the a loop while still executing the a loop sequentially? Due to the nature of the code, each iteration of the a loop must be complete before the next can be executed. For example, clearly this would not work:
!$omp parallel private(c)
do a = 1,amax
!$omp do
do b = 1,bmax
do c = 1,cmax
call mysubroutine(b,c)
end do
end do
!$omp end do
end do
!$omp end parallel
because all of the threads will attempt to execute the a loop. Any help appreciated.
"For example, clearly this would not work"
That is not only not clear, that is completely incorrect. The code you show is exactly what you should do (better with private(a)).
"because all of the threads will attempt to execute the a loop"
Of course they will and they have to! All of them have to execute it if they are supposed to take part in worksharing in the omp do inner loop! If they don't execute it, they simply won't be there to help with inner loop.
A different remark: you may benefit from the collapse(2) clause for the omp do nested loop.
A good way to assert that "this is causing an excessive amount of overhead" is to evaluate the scaling using different number of threads.
1s is a long time, respawing threads isn't so costly…

Newbie OpenACC issue with CYCLE instruction in Fortran

quite newbie with OpenACC here, so please be patient :-)
I'm trying to port some Fortran code to use OpenACC, and I'm finding a strange (at least to me) behaviour.
The code is given below, but as you can see is just some nested loops which ultimately update the variable zc, which I copyout. I have tried to make private copies where I think they are needed and for the moment specified that all loops are independent. Now, when I compile with and without OpenACC all is fine if I remove the line "if(mu2-mup2.ne.q2) cycle", but if that line is present, then the results for the zc calculated with OpenACC are very different to those calculated without OpenACC.
Any ideas why that line could be giving me trouble?
Many thanks in advance,
Ángel de Vicente
!$acc data copyout(zc)
!$acc update device(fact)
!$acc kernels
!$acc loop independent private(k2)
do k=kmin,kmax
k2=2*k
!$acc loop independent private(km,kp2,z0)
do kp=kmin,kmax
km = MIN(k,kp)
kp2=2*kp
z0=3.d0*dble(ju2+1)*dsqrt(dble(k2+1))*dsqrt(dble(kp2+1))
!$acc loop independent private(q2)
do q=-km,km
q2=2*q
!$acc loop independent
do mu2=-ju2,ju2,2
!$acc loop independent private(p2,z1)
do ml2=-jl2,jl2,2
p2=mu2-ml2
if(iabs(p2).gt.2) cycle
z1=w3js(ju2,jl2,2,mu2,-ml2,-p2)
!$acc loop independent private(pp2,z2)
do mup2=-ju2,ju2,2
if(mu2-mup2.ne.q2) cycle
pp2=mup2-ml2
if(iabs(pp2).gt.2) cycle
z2=w3js(ju2,jl2,2,mup2,-ml2,-pp2)
!$acc loop independent
do mlp2=-jl2,jl2,2
zc(ml2,mlp2,mu2,mup2,k,kp,q) = z2
enddo
enddo
enddo
enddo
end do
end do
end do
!$acc end kernels
!$acc end data
Without a reproducing example it's difficult to give a complete answer, but I'll do my best.
First, there are only three parallel dimensions in OpenACC: gang, worker, and vector. Hence, the compiler will need to ignore 4 of the 7 loop directives. Most likely the middle 4 (if using PGI, you can see which loops the compiler is parallelizing from the compiler feedback messages, i.e. -Minfo=accel). Not that you can't parallelize all the loops, but you'd need to make them tightly nested and then use the collapse clause to collapse them into a single parallel loop.
Also since scalars are private by default, there's no need to put them into a private clause (except for a few cases). While putting them in a private clause shouldn't impact correctness, it can cause performance slow downs since you'd be fetching the private copy from global memory rather than having the potential of the scalar being put into a register.
My guess is that the inner loops are not that large so may not be beneficial to parallelize. Hence, I would first try removing all the inner "loop" directives, and only parallelize the "k" and "kp" loops. Depending of the values of "kmin" and "kmax", I'd try collapsing them as well. Something like:
!$acc loop independent collapse(2)
do k=kmin,kmax
do kp=kmin,kmax
k2=2*k
km = MIN(k,kp)
Assuming that gets you the correct answers but not as much parallelism as you want, you can then try collapsing the middle two loops:
!$acc loop independent collapse(2)
do q=-km,km
do mu2=-ju2,ju2,2
q2=2*q
do ml2=-jl2,jl2,2
I wouldn't recommend parallelizing loops with cycles in them. Not that you can't, but doing so would hurt performance due to thread divergence.
If none of this helps, please post a full reproducing example.

Issue with common block in OpenMP parallel programming

I have a few questions about using common blocks in parallel programming in Fortran.
My subroutines have common blocks. Do I have to declare all the common blocks and threadprivate in the parallel do region?
How do they pass information? I want seperate common clock for each thread and want them to pass information through the end of parallel region. Does it happen here?
My Ford subroutine changes some variables in common blocks and Condact subroutine overwrites over them again but the function uses the values from Condact subroutine. Do the second subroutine and function copy the variables from the previous subroutine for each thread?
program
...
! Loop which I want to parallelize
!$OMP parallel DO
!do I need to declear all common block and threadprivate them here?
I = 1, N
...
call FORD(i,j)
...
!$OMP END parallel DO
end program
subroutine FORD(i,j)
dimension zl(3),zg(3)
common /ellip/ b1,c1,f1,g1,h1,d1,
. b2,c2,f2,g2,h2,p2,q2,r2,d2
common /root/ root1,root2
!$OMP threadprivate (/ellip/,/root/)
!this subroutine rewrite values of b1, c1 and f1 variable.
CALL CONDACT(genflg,lapflg)
return
end subroutine
SUBROUTINE CONDACT(genflg,lapflg)
common /ellip/ b1,c1,f1,g1,h1,d1,b2,c2,f2,g2,h2,p2,q2,r2,d2
!$OMP threadprivate (/ellip/)
! this subroutine rewrite b1, c1 and f1 again
call function f(x)
RETURN
END
function f(x)
common /ellip/ b1,c1,f1,g1,h1,d1,
. b2,c2,f2,g2,h2,p2,q2,r2,d2
!$OMP threadprivate (/ellip/)
! here the function uses the value of b1, c1, f1 from CONDAT subroutine.
end
Firstly as the comment above says I would strongly advise against the use of common especially in modern code, and mixing global data and parallelism is just asking for a world of pain - in fact global data is just a bad idea full stop.
OK, your questions:
My subroutines has common blocks. Do I have to declare all the
common block and threadprivate in the parallel do region?
No,threadprivate is a declarative directive, and should be used only in the declarative part of the code, and it must appear after every declaration.
How do they pass information? I want seperate common clock for each
thread and want them to pass information through the end of parallel
region. Does it happen here?
As you suspect each thread will gets its own version of the common block. When you enter the first parallel region the values in the block will be undefined, unless you use copyin to broadcast the values from the master thread. For subsequent parallel regions the values will be retained as long as the number of threads used in each region is the same. Between regions the values in the common block will be those of the master thread.
Are those common block accessible through the subroutine? My Ford subroutine rewrite some variables in common block and Condat
subroutine rewrite over them again but the function uses the values
from Condat subroutine. Is that possible rewrite and pass the common
block variable using threadprivate here?
I have to admit I am unsure what you are asking here. But if you are asking whether common can be used to communicate variables between different sub-programs in OpenMP code, the answer is yes, just as in serial Fortran (note capitalisation)
How about converting the common blocks into modules?
Change common /root/ root1, root2 to use gammax, then make a new file root.f that contains:
module root
implicit none
save
real :: root1, root2
!$omp threadprivate( root1, root2 )
end module root

Parallelizing fortran 2008 `do concurrent` systematically, possibly with openmp

The fortran 2008 do concurrent construct is a do loop that tells the compiler that no iteration affect any other. It can thus be parallelized safely.
A valid example:
program main
implicit none
integer :: i
integer, dimension(10) :: array
do concurrent( i= 1: 10)
array(i) = i
end do
end program main
where iterations can be done in any order. You can read more about it here.
To my knowledge, gfortran does not automatically parallelize these do concurrent loops, while I remember a gfortran-diffusion-list mail about doing it (here). It justs transform them to classical do loops.
My question: Do you know a way to systematically parallelize do concurrent loops? For instance with a systematic openmp syntax?
It is not that easy to do it automatically. The DO CONCURRENT construct has a forall-header which means that it could accept multiple loops, index variables definition and a mask. Basically, you need to replace:
DO CONCURRENT([<type-spec> :: ]<forall-triplet-spec 1>, <forall-triplet-spec 2>, ...[, <scalar-mask-expression>])
<block>
END DO
with:
[BLOCK
<type-spec> :: <indexes>]
!$omp parallel do
DO <forall-triplet-spec 1>
DO <forall-triplet-spec 2>
...
[IF (<scalar-mask-expression>) THEN]
<block>
[END IF]
...
END DO
END DO
!$omp end parallel do
[END BLOCK]
(things in square brackets are optional, based on the presence of the corresponding parts in the forall-header)
Note that this would not be as effective as parallelising one big loop with <iters 1>*<iters 2>*... independent iterations which is what DO CONCURRENT is expected to do. Note also that forall-header permits a type-spec that allows one to define loop indexes inside the header and you will need to surround the whole thing in BLOCK ... END BLOCK construct to preserve the semantics. You would also need to check if scalar-mask-expr exists at the end of the forall-header and if it does you should also put that IF ... END IF inside the innermost loop.
If you only have array assignments inside the body of the DO CONCURRENT you would could also transform it into FORALL and use the workshare OpenMP directive. It would be much easier than the above.
DO CONCURRENT <forall-header>
<block>
END DO
would become:
!$omp parallel workshare
FORALL <forall-header>
<block>
END FORALL
!$omp end parallel workshare
Given all the above, the only systematic way that I can think about is to systematically go through your source code, searching for DO CONCURRENT and systematically replacing it with one of the above transformed constructs based on the content of the forall-header and the loop body.
Edit: Usage of OpenMP workshare directive is currently discouraged. It turns out that at least Intel Fortran Compiler and GCC serialise FORALL statements and constructs inside OpenMP workshare directives by surrounding them with OpenMP single directive during compilation which brings no speedup whatsoever. Other compilers might implement it differently but it's better to avoid its usage if portable performance is to be achieved.
I'm not sure what you mean "a way to systematically parallelize do concurrent loops". However, to simply parallelise an ordinary do loop with OpenMP you could just use something like:
!$omp parallel private (i)
!$omp do
do i = 1,10
array(i) = i
end do
!$omp end do
!$omp end parallel
Is this what you are after?

Resources