openmp slows down near the end of execution - openmp

I edited a program to a parallel edition, but met a problem. This program is large and I tried to parallelize it at the outmost layer. This program is written in fortran and the following is the scheme of my code.
program main
use omp_lib
implicit none
declare some variables...
call omp_set_num_threads(n)
!$OMP parallel do
!$OMP private(a,b,c,d)
do i=1,5000000
call sub1(a,b,c)
b=c+d;c=b+d;...
call sub2(b,c,d)
if (logical_expression) cycle
call sub3()
call sub4()
enddo
end program
There are some conditions that the outmost loop will cycle, and these conditions mostly happen when is are small. So when I tried to print out which loop i is performing, I finally got that near the end of the execution, the i is continuous and large. And if I print out i on the screen, I saw that the speed of my program was as slow as a sequential edition near the end of the execution. Is there anyone know why and how to solve this problem?

Related

Can reading a variable be a data race in OpenMP?

Why does this OpenMP fortran program work (every element of out is equal to num)? Each thread in the parallel loop might read the variable num simultaneously. I thought this was not acceptable?
program example
implicit none
integer i
integer, parameter :: n = 100000
double precision :: num
double precision, dimension(n) :: out
num = 1.123456789123456789123456d-5
out = 0.d0
!$OMP PARALLEL
!$OMP DO
do i=1,n
out(i) = num
enddo
!$OMP END DO
!$OMP END PARALLEL
do i=1,n
if (out(i).ne.num) print*,'Problem with ',i
enddo
end program
Thanks so much for any insights.
Can reading a variable be a data race in OpenMP?
Any race is between two things happening, so a read can be part of a race. However for the competition between two actions to be a race, there has to be a different outcome depending on the order in which the two actions occur.
Given that the possible actions in a parallel program which we are considering are read and write occurring in different threads, we have four possible cases:
Read, Read: no values are changed, and no code can detect which order the two reads occurred in (at least, not without looking at meta-data such as code performance in a system with caches :-)).
Read, Write: this clearly can be a race; whether the write wins the race or not affects the value which will be read.
Write, Read: as with case 2 (Read,Write), the result seen by the read is affected by the order.
Write, Write: here we have a race too, since we asssume that someone will ultimately read the value, and which value they see will depend on the order of the writes.
So, reading a variable can be part of a race.
However, if your question is really "Is there a race if a variable is only read?", then the answer is "No".
Variables are shared by default in openMP so they are accessible from all the threads. Furthermore, you're not writing to num so even if all the threads were accessing the same memory (which here they probably aren't) there would be no issue.

Newbie OpenACC issue with CYCLE instruction in Fortran

quite newbie with OpenACC here, so please be patient :-)
I'm trying to port some Fortran code to use OpenACC, and I'm finding a strange (at least to me) behaviour.
The code is given below, but as you can see is just some nested loops which ultimately update the variable zc, which I copyout. I have tried to make private copies where I think they are needed and for the moment specified that all loops are independent. Now, when I compile with and without OpenACC all is fine if I remove the line "if(mu2-mup2.ne.q2) cycle", but if that line is present, then the results for the zc calculated with OpenACC are very different to those calculated without OpenACC.
Any ideas why that line could be giving me trouble?
Many thanks in advance,
Ángel de Vicente
!$acc data copyout(zc)
!$acc update device(fact)
!$acc kernels
!$acc loop independent private(k2)
do k=kmin,kmax
k2=2*k
!$acc loop independent private(km,kp2,z0)
do kp=kmin,kmax
km = MIN(k,kp)
kp2=2*kp
z0=3.d0*dble(ju2+1)*dsqrt(dble(k2+1))*dsqrt(dble(kp2+1))
!$acc loop independent private(q2)
do q=-km,km
q2=2*q
!$acc loop independent
do mu2=-ju2,ju2,2
!$acc loop independent private(p2,z1)
do ml2=-jl2,jl2,2
p2=mu2-ml2
if(iabs(p2).gt.2) cycle
z1=w3js(ju2,jl2,2,mu2,-ml2,-p2)
!$acc loop independent private(pp2,z2)
do mup2=-ju2,ju2,2
if(mu2-mup2.ne.q2) cycle
pp2=mup2-ml2
if(iabs(pp2).gt.2) cycle
z2=w3js(ju2,jl2,2,mup2,-ml2,-pp2)
!$acc loop independent
do mlp2=-jl2,jl2,2
zc(ml2,mlp2,mu2,mup2,k,kp,q) = z2
enddo
enddo
enddo
enddo
end do
end do
end do
!$acc end kernels
!$acc end data
Without a reproducing example it's difficult to give a complete answer, but I'll do my best.
First, there are only three parallel dimensions in OpenACC: gang, worker, and vector. Hence, the compiler will need to ignore 4 of the 7 loop directives. Most likely the middle 4 (if using PGI, you can see which loops the compiler is parallelizing from the compiler feedback messages, i.e. -Minfo=accel). Not that you can't parallelize all the loops, but you'd need to make them tightly nested and then use the collapse clause to collapse them into a single parallel loop.
Also since scalars are private by default, there's no need to put them into a private clause (except for a few cases). While putting them in a private clause shouldn't impact correctness, it can cause performance slow downs since you'd be fetching the private copy from global memory rather than having the potential of the scalar being put into a register.
My guess is that the inner loops are not that large so may not be beneficial to parallelize. Hence, I would first try removing all the inner "loop" directives, and only parallelize the "k" and "kp" loops. Depending of the values of "kmin" and "kmax", I'd try collapsing them as well. Something like:
!$acc loop independent collapse(2)
do k=kmin,kmax
do kp=kmin,kmax
k2=2*k
km = MIN(k,kp)
Assuming that gets you the correct answers but not as much parallelism as you want, you can then try collapsing the middle two loops:
!$acc loop independent collapse(2)
do q=-km,km
do mu2=-ju2,ju2,2
q2=2*q
do ml2=-jl2,jl2,2
I wouldn't recommend parallelizing loops with cycles in them. Not that you can't, but doing so would hurt performance due to thread divergence.
If none of this helps, please post a full reproducing example.

OpenMp with fortran : why multiples DO loops are faster than workshare [duplicate]

The fortran 2008 do concurrent construct is a do loop that tells the compiler that no iteration affect any other. It can thus be parallelized safely.
A valid example:
program main
implicit none
integer :: i
integer, dimension(10) :: array
do concurrent( i= 1: 10)
array(i) = i
end do
end program main
where iterations can be done in any order. You can read more about it here.
To my knowledge, gfortran does not automatically parallelize these do concurrent loops, while I remember a gfortran-diffusion-list mail about doing it (here). It justs transform them to classical do loops.
My question: Do you know a way to systematically parallelize do concurrent loops? For instance with a systematic openmp syntax?
It is not that easy to do it automatically. The DO CONCURRENT construct has a forall-header which means that it could accept multiple loops, index variables definition and a mask. Basically, you need to replace:
DO CONCURRENT([<type-spec> :: ]<forall-triplet-spec 1>, <forall-triplet-spec 2>, ...[, <scalar-mask-expression>])
<block>
END DO
with:
[BLOCK
<type-spec> :: <indexes>]
!$omp parallel do
DO <forall-triplet-spec 1>
DO <forall-triplet-spec 2>
...
[IF (<scalar-mask-expression>) THEN]
<block>
[END IF]
...
END DO
END DO
!$omp end parallel do
[END BLOCK]
(things in square brackets are optional, based on the presence of the corresponding parts in the forall-header)
Note that this would not be as effective as parallelising one big loop with <iters 1>*<iters 2>*... independent iterations which is what DO CONCURRENT is expected to do. Note also that forall-header permits a type-spec that allows one to define loop indexes inside the header and you will need to surround the whole thing in BLOCK ... END BLOCK construct to preserve the semantics. You would also need to check if scalar-mask-expr exists at the end of the forall-header and if it does you should also put that IF ... END IF inside the innermost loop.
If you only have array assignments inside the body of the DO CONCURRENT you would could also transform it into FORALL and use the workshare OpenMP directive. It would be much easier than the above.
DO CONCURRENT <forall-header>
<block>
END DO
would become:
!$omp parallel workshare
FORALL <forall-header>
<block>
END FORALL
!$omp end parallel workshare
Given all the above, the only systematic way that I can think about is to systematically go through your source code, searching for DO CONCURRENT and systematically replacing it with one of the above transformed constructs based on the content of the forall-header and the loop body.
Edit: Usage of OpenMP workshare directive is currently discouraged. It turns out that at least Intel Fortran Compiler and GCC serialise FORALL statements and constructs inside OpenMP workshare directives by surrounding them with OpenMP single directive during compilation which brings no speedup whatsoever. Other compilers might implement it differently but it's better to avoid its usage if portable performance is to be achieved.
I'm not sure what you mean "a way to systematically parallelize do concurrent loops". However, to simply parallelise an ordinary do loop with OpenMP you could just use something like:
!$omp parallel private (i)
!$omp do
do i = 1,10
array(i) = i
end do
!$omp end do
!$omp end parallel
Is this what you are after?

OpenMP ensemble execution

I am new to the OpenMP and at the moment with no access to my workstation where I can check the details. Had a quick question to set the basics right before moving on to the hands on part.
Suppose I have a serial program written in FORTRAN90 which evolves a map with iterations and gives the final value of the variable after the evolution, the code looks like:
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
I want to run the same code as independent processes on a cluster for 100 different random initial conditions and see how the output changes with the initial conditions. A serial program for this purpose would look like:
do iter=1,100 !! THE INITIAL CONDITION LOOP
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
Will the OpenMP implementation that I could think of work? The code I could come up with is as follows:
!$ OMP PARALLEL PRIVATE(xi,xf,i)
!$ OMP DO
do iter=1,100 !! THE INITIAL CONDITION LOOP
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
!$ OMP ENDDO
!$ OMP END PARALLEL
Thank you in advance for any suggestions or help.
I think that this line
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
might cause some problems. Is the implementation of random_number on your system thread-safe ? I haven't a clue, I know nothing about your compiler or operating system. If it isn't thread-safe then your program might do a number of things when the OpenMP threads all start using the random number generator; those things include crashing or deadlocking.
If the implementation is thread-safe you will want to figure out how to ensure that the threads either do or don't all generate the same sequence of random numbers. It's entirely sensible to write programs which use the same random numbers in each thread, or that use different sequences in different threads, but you ought to figure out that what you get is what you want.
And if the random number generator is thread safe and generates different sequences for each thread, do those sequences pass the sort of tests for randomness that a single-threaded random number generator might pass ?
It's quite tricky to generate properly independent sequences of pseudo-random numbers in parallel programs; certainly not something I can cover in the space of an SO answer.
While you figure all that out one workaround which might help would be to generate, in a sequential part of your code, all the random numbers you need (into an array perhaps) and let the different threads read different elements out of the array.
I want to run the same code as independent processes on a cluster
Then you do not want OpenMP. OpenMP is about exploiting parallelism inside a single address space.
I suggest you look at MPI, if you want to operate on a cluster

Parallelizing fortran 2008 `do concurrent` systematically, possibly with openmp

The fortran 2008 do concurrent construct is a do loop that tells the compiler that no iteration affect any other. It can thus be parallelized safely.
A valid example:
program main
implicit none
integer :: i
integer, dimension(10) :: array
do concurrent( i= 1: 10)
array(i) = i
end do
end program main
where iterations can be done in any order. You can read more about it here.
To my knowledge, gfortran does not automatically parallelize these do concurrent loops, while I remember a gfortran-diffusion-list mail about doing it (here). It justs transform them to classical do loops.
My question: Do you know a way to systematically parallelize do concurrent loops? For instance with a systematic openmp syntax?
It is not that easy to do it automatically. The DO CONCURRENT construct has a forall-header which means that it could accept multiple loops, index variables definition and a mask. Basically, you need to replace:
DO CONCURRENT([<type-spec> :: ]<forall-triplet-spec 1>, <forall-triplet-spec 2>, ...[, <scalar-mask-expression>])
<block>
END DO
with:
[BLOCK
<type-spec> :: <indexes>]
!$omp parallel do
DO <forall-triplet-spec 1>
DO <forall-triplet-spec 2>
...
[IF (<scalar-mask-expression>) THEN]
<block>
[END IF]
...
END DO
END DO
!$omp end parallel do
[END BLOCK]
(things in square brackets are optional, based on the presence of the corresponding parts in the forall-header)
Note that this would not be as effective as parallelising one big loop with <iters 1>*<iters 2>*... independent iterations which is what DO CONCURRENT is expected to do. Note also that forall-header permits a type-spec that allows one to define loop indexes inside the header and you will need to surround the whole thing in BLOCK ... END BLOCK construct to preserve the semantics. You would also need to check if scalar-mask-expr exists at the end of the forall-header and if it does you should also put that IF ... END IF inside the innermost loop.
If you only have array assignments inside the body of the DO CONCURRENT you would could also transform it into FORALL and use the workshare OpenMP directive. It would be much easier than the above.
DO CONCURRENT <forall-header>
<block>
END DO
would become:
!$omp parallel workshare
FORALL <forall-header>
<block>
END FORALL
!$omp end parallel workshare
Given all the above, the only systematic way that I can think about is to systematically go through your source code, searching for DO CONCURRENT and systematically replacing it with one of the above transformed constructs based on the content of the forall-header and the loop body.
Edit: Usage of OpenMP workshare directive is currently discouraged. It turns out that at least Intel Fortran Compiler and GCC serialise FORALL statements and constructs inside OpenMP workshare directives by surrounding them with OpenMP single directive during compilation which brings no speedup whatsoever. Other compilers might implement it differently but it's better to avoid its usage if portable performance is to be achieved.
I'm not sure what you mean "a way to systematically parallelize do concurrent loops". However, to simply parallelise an ordinary do loop with OpenMP you could just use something like:
!$omp parallel private (i)
!$omp do
do i = 1,10
array(i) = i
end do
!$omp end do
!$omp end parallel
Is this what you are after?

Resources