Heterogeneous OpenMP parallel loop with Intel MIC offloading - openmp

I am working on a code which includes a loop with many iterations (~10^6-10^7) where an array (let's say, 'myresult') is being calculated via summation over lots of contributions. In Fortran 90 with OpenMP, this will look something like:
!$omp parallel do
!$omp& reduction(+:myresult)
do i=1,N
myresult[i] = myresult[i] + [contribution]
enddo
!$omp end parallel
The code will be run on a system with Intel Xeon coprocessors, and would of course like to benefit from their existence, if possible. I have tried using MIC offloading statements (!dir$ offload target ...) with OpenMP so that the loop runs on just the coprocessor, but then I am wasting host CPU time while it sits there waiting for the coprocessor to finish. Ideally, one could divide up the loop between the host and the device, so I would like to know if something like the following is feasible (or if there is a better approach); the loop will only run on one core on the host (though perhaps with OMP_NUM_THREADS=2?):
!$omp parallel sections
!$omp& reduction(+:myresult)
!$omp section ! parallel calculation on device
!dir$ offload target mic
!$omp parallel do
!$omp& reduction(+:myresult)
(do i=N/2+1,N)
!$omp end parallel do
!$omp section ! serial calculation on host
(do i=1,N/2)
!$omp end parallel sections

The general idea would be to use an asynchronous offload to MIC so that the CPU could continue. Setting aside the details of how to divide the work, this is how it is expressed:
module m
!dir$ attributes offload:mic :: myresult, micresult
integer :: myresult(10000)
integer :: result
integer :: micresult
end module
use m
N = 10000
result = 0
micresult = 0
myresult = 0
!dir$ omp offload target(mic:0) signal(micresult)
!$omp parallel do reduction(+:micresult)
do i=N,N/2
micresult = myresult(i) + 55
enddo
!$omp end parallel do
!$omp parallel do reduction(+:result)
do i=1,N/2
result = myresult(i) + 55
enddo
!$omp end parallel do
!dir$ offload_wait target(mic:0) wait(micresult)
result = result + micresult
end

Have you considered to use MPI symmetric mode instead of offload? In case you haven't, MPI can do what you just describe: you start two MPI ranks, one on host and one on co-processor. Each rank performs a parallel loop using OpenMP.

Related

Fortran OMP : how to do a parallel and a single task?

I am a newbie in parallel programming. This is my serial code that I would like do parallelize
program main
implicit none
integer :: pr_number, i, pr_sum
real :: pr_av
pr_sum = 0
do i=1,1000
! The following instruction is an example to simplify the problem.
! In the real case, it takes a long time that is more or less the same for all threads
! and it returns a large array
pr_number = int(rand()*10)
pr_sum = pr_sum+pr_number
pr_av = (1.d0*pr_sum) / i
print *,i,pr_av ! In real case, writing a huge amount of data on one file
enddo
end program main
I woud like to parallelize pr_number = int(rand()*10) and to have only one print each num_threads.
I tried many things but it does not work. For example,
program main
implicit none
integer :: pr_number, i, pr_sum
real :: pr_av
pr_sum = 0
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(pr_number) SHARED(pr_sum,pr_av)
!$OMP DO REDUCTION(+:pr_sum)
do i=1,1000
pr_number = int(rand()*10)
pr_sum = pr_sum+pr_number
!$OMP SINGLE
pr_av = (1.d0*pr_sum) / i
print *,i,pr_av
!$OMP END SINGLE
enddo
!$OMP END DO
!$OMP END PARALLEL
end program main
I have an error message at compilation time : work-sharing region may not be closely nested inside of work-sharing, critical or explicit task region.
How can I have an output like that (if I have 4 threads for example) ?
4 3.00000000
8 3.12500000
12 4.00000000
16 3.81250000
20 3.50000000
...
I repeat, I am a beginner on parallel programming. I read many things on stackoverflow but, I think, I have not yet the skill to understand. I work on it, but ...
Edit 1
To explain as suggested in comments. A do loop performs N times a lengthy calculation (N markov chain montecarlo) and the average of all calculations is written to a file at each iteration. The previous average is deleted, only the last one is kept, so process can be followed. I would like to parallelise this calculation over 4 threads.
This is what I imagine to do but perhaps, it is not the best idea.
Thanks for help.
The value of the reduction variable inside the construct where the reduction happens is not really well defined. The reduction clause with a sum is typically implemented by each thread having a private copy of the reduction variable that they use for summing just the numbers for that very thread. At the and of the loop, the private copies are summed into the final sum. There is little point printing the intermediate value before the reduction is actually made.
You can do the reduction in a nested loop and print the intermediate result every n iterations
program main
implicit none
integer :: pr_number, i, j, pr_sum
real :: pr_av
pr_sum = 0
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(pr_number) SHARED(pr_sum,pr_av)
do j = 1, 10
!$OMP DO REDUCTION(+:pr_sum)
do i=1,100
pr_number = int(rand()*10)
pr_sum = pr_sum+pr_number
enddo
!$OMP END DO
!$omp single
pr_av = (1.d0*pr_sum) / 100
print *,j*100,pr_av
!$omp end single
end do
!$OMP END PARALLEL
end program main
I kept the same rand() that may or may not work correctly in parallel depending on the compiler. Even if it gives the right results, it may actually be executed sequentially using some locks or barriers. However, the main point carries over to other libraries.
Result
> gfortran -fopenmp reduction-intermediate.f90
> ./a.out
100 4.69000006
200 9.03999996
300 13.7600002
400 18.2299995
500 22.3199997
600 26.5900002
700 31.0599995
800 35.4300003
900 40.1599998

I cannot get !$acc parallel to work (but acc kernels does)

I've been trying to use OpenACC with a simple code, but I guess I don't fully understand how to write nested OpenACC loops or what private does. The routine that I'm trying to parallelize is:
SUBROUTINE zcs(zc,kmin,kmax,ju2,jl2)
INTEGER, INTENT(IN) :: kmin,kmax,ju2,jl2
DOUBLE PRECISION, DIMENSION(-jl2:jl2,-jl2:jl2,-ju2:ju2,-ju2:ju2,kmin:kmax,kmin:kmax,-kmax:kmax) :: zc
INTEGER :: k,kp,k2,km,kp2,q,q2,mu2,ml2,p2,mup2,pp2,mlp2,ps2,pt2
DOUBLE PRECISION :: z0,z1,z2,z3,z4,z5,z6,z7
! Start loop over K, K' and Q
!$acc kernels
do k=kmin,kmax
do kp=kmin,kmax
k2=2*k
km = MIN(k,kp)
kp2=2*kp
z0=3.d0*dble(ju2+1)*dsqrt(dble(k2+1))*dsqrt(dble(kp2+1))
do q=-km,km
q2=2*q
! Calculate quantity C and its sum over magnetic quantum numbers
do mu2=-ju2,ju2,2
do ml2=-jl2,jl2,2
p2=mu2-ml2
if(abs(p2).gt.2) cycle
z1=w3js(ju2,jl2,2,mu2,-ml2,-p2)
do mup2=-ju2,ju2,2
if(mu2-mup2.ne.q2) cycle
pp2=mup2-ml2
if(abs(pp2).gt.2) cycle
z2=w3js(ju2,jl2,2,mup2,-ml2,-pp2)
do mlp2=-jl2,jl2,2
ps2=mu2-mlp2
if(abs(ps2).gt.2) cycle
pt2=mup2-mlp2
if(abs(pt2).gt.2) cycle
z3=w3js(ju2,jl2,2,mu2,-mlp2,-ps2)
z4=w3js(ju2,jl2,2,mup2,-mlp2,-pt2)
z5=w3js(2,2,k2,-p2,pp2,q2)
z6=w3js(2,2,kp2,-ps2,pt2,q2)
z7=1.d0
if(mod(2*ju2-ml2-mlp2,4).ne.0) z7=-1.d0
zc(ml2,mlp2,mu2,mup2,k,kp,q)=z0*z1*z2*z3*z4*z5*z6*z7
enddo
enddo
enddo
enddo
end do
end do
end do
!$acc end kernels
END SUBROUTINE zcs
As it is, the code behaves fine, and if I compare the zc matrix after calling this routine, both the non-OpenACC and the OpenACC version give identical answer. But if I try to do it with a parallel directive there seems to be a race condition, that I cannot figure out where it is. The relevant changes are just:
!$acc parallel
!$acc loop private(k,kp,k2,km,kp2,z0,q,q2)
do k=kmin,kmax
do kp=kmin,kmax
k2=2*k
km = MIN(k,kp)
kp2=2*kp
z0=3.d0*dble(ju2+1)*dsqrt(dble(k2+1))*dsqrt(dble(kp2+1))
do q=-km,km
q2=2*q
! Calculate quantity C and its sum over magnetic quantum numbers
!$acc loop private(mu2,ml2,p2,z1,mup2,pp2,z2,mlp2,ps2,pt2,z3,z4,z5,z6,z7)
do mu2=-ju2,ju2,2
[...]
!$acc end parallel
As far as I can see I have declared the appropriate variables as private, but I guess I don't fully understand how to nest several loops, and/or what private really does. Any suggestions to help me properly understand what is going on?
Many thanks,
AdV
The core problem here is that you're passing the loop bounds variables "ju2" and "jl2" by reference to the "w3js" routine. This means that the loop trip count could change during the execution of the loop and thus prevents parallelization. You could try making these variables private, but the easiest thing to do is add the "VALUE" attribute on w3js' arguments so they are passed in by value.
Note that it works in the "kernels" case since the compiler is only parallelizing the outer loops. In the "parallel" case, you're try to parallelize these "non-countable" inner loops.

Why bother to initialize a reduction variable outside the parallel construct?

I am learning how to port my Fortran code to OpenMP. When I read an online tutorial (see here) I came across one question.
At first, I knew from page 28 that the value of a reduction variable is undefined from the moment the first thread reaches the clause till the operation has completed.
To my understanding, the statement implies that it doesn't matter whether I initialize the reduction variable before the program hits the parallel construct, because it is not defined until the complete of the operation. However, the sample code on page 27 of the same tutorial initializes the reduction variable before the parallel construct.
Could anyone please let me know which treatment is correct? Thanks.
Lee
sum = 0.0
!$omp parallel default(none) shared(n,x) private(i)
do i = 1, n
sum = sum + x(i)
end do
!$omp end do
!$omp end parallel
print*, sum
After fixing your code:
integer,parameter :: n = 10000
real :: x(n)
x = 1
sum = 0
!$omp parallel do default(none) shared(x) private(i) reduction(+:sum)
do i = 1, n
sum = sum + x(i)
end do
!$omp end parallel do
print*, sum
end
Notice, that the value to which you initialize sum matters! If you change it you get a different result. It is quite obvious you have to initialize it properly and even the OpenMP version is ill-defined without proper initialization.
Yes, the value of sum is not defined until completing the loop, but that doesn't mean it can be undefined before the loop.
For one thing, one of the nice features of OpenMP is that if you compile the program without enabling OpenMP, the program can/(should!) be a valid serial program as well. The serial version of your example would be ill-defined without initializing "sum" before the loop.

OpenMP private array - Segmentation fault: 11

When I try to parallelize my program in Fortran90 by OpenMP, I get a segmentation fault error.
!$OMP PARALLEL DO NUM_THREADS(4) &
!$OMP PRIVATE(numstrain, i)
do irep = 1, nrep
do i=1, 10
PRINT *, numstrain(i)
end do
end do
!$OMP END PARALLEL DO
I find that if I comment out "PRINT *, numstrain(i)" or remove openmp flags it works without error. I think it is because memory access conflict happens when I access numstrain(i) in parallel. I already declared i and numstrain as private variables. Could someone please give me some idea why it is the case? Thank you so much. :)
UPDATE:
I modified the previous version and this version can print out correct result.
integer, allocatable :: numstrain(:)
integer :: allocate_status
integer :: n
!$OMP PARALLEL DO NUM_THREADS(4) &
!$OMP PRIVATE(numstrain, i)
n = 1000000
do irep = 1, nrep
allocate (numstrain(n), stat = allocate_status)
do i=1, 10
PRINT *, numstrain(i)
end do
deallocate (numstrain, stat = allocate_status)
end do
!$OMP END PARALLEL DO
However if I move the numstrain accessing to another subroutine called by this subroutine (code attached below), 1. It always processes in one thread. 2. At some point (i=4 or 5), it returns Segmentation Fault:11. The variable i when it returns Segmentation Fault:11 is different when I have different NUM_THREADS.
integer, allocatable :: numstrain(:)
integer :: allocate_status
integer :: n
!$OMP PARALLEL DO NUM_THREADS(4) &
!$OMP PRIVATE(numstrain, i)
n = 1000000
do irep = 1, nrep
allocate (numstrain(n), stat = allocate_status)
call anotherSubroutine(numstrain)
deallocate (numstrain, stat = allocate_status)
end do
!$OMP END PARALLEL DO
subroutine anotherSubroutine(numstrain)
integer, allocatable :: numstrain(:)
do i=1, 10
PRINT *, numstrain(i)
end do
end subroutine anotherSubroutine
I also tried to both allocate/deallocate in help subroutine and main subroutine, and only allocate/deallocate in help subroutine. Nothing is changed.
The most typical reason for this is that not enough space is available on the stack to hold the private copy of numstrain. Compute and compare the following two values:
the size of the array in bytes
the stack size limit
There are two kinds of stack size limits. The stack size of the main thread is controlled by things like process limits on Unix systems (use ulimit -s to check and modify this limit) or is fixed at link time on Windows (recompilation or binary edit of the executable is necessary in order to change the limit). The stack size of the additional OpenMP threads is controlled by environment variables like the standard OMP_STACKSIZE, or the implementation-specific GOMP_STACKSIZE (GNU/GCC OpenMP) and KMP_STACKSIZE (Intel OpenMP).
Note that most Fortran OpenMP implementations always put private arrays on the stack, no matter if you enable compiler options that allocate large arrays on the heap (tested with GNU's gfortran and Intel's ifort).
If you comment out the PRINT statement, you effectively remove the reference to numstrain and the compiler is free to optimise it out, e.g. it could simply not make a private copy of numstrain, thus the stack limit is not exceeded.
After the additional information that you've provided one can conclude, that stack size is not the culprit. When dealing with private ALLOCATABLE arrays, you should know that:
private copies of unallocated arrays remain unallocated;
private copies of allocated arrays are allocated with the same bounds.
If you do not use numstrain outside of the parallel region, it is fine to do what you've done in your first case, but with some modifications:
integer, allocatable :: numstrain(:)
integer :: allocate_status
integer, parameter :: n = 1000000
interface
subroutine anotherSubroutine(numstrain)
integer, allocatable :: numstrain(:)
end subroutine anotherSubroutine
end interface
!$OMP PARALLEL NUM_THREADS(4) PRIVATE(numstrain, allocate_status)
allocate (numstrain(n), stat = allocate_status)
!$OMP DO
do irep = 1, nrep
call anotherSubroutine(numstrain)
end do
!$OMP END DO
deallocate (numstrain)
!$OMP END PARALLEL
If you also use numstrain outside of the parallel region, then the allocation and deallocation go outside:
allocate (numstrain(n), stat = allocate_status)
!$OMP PARALLEL DO NUM_THREADS(4) PRIVATE(numstrain)
do irep = 1, nrep
call anotherSubroutine(numstrain)
end do
!$OMP END PARALLEL DO
deallocate (numstrain)
You should also know that when you call a routine that takes an ALLOCATABLE array as argument, you have to provide an explicit interface for that routine. You can either write an INTERFACE block or you can put the called routine in a module and then USE that module - both cases would provide the explicit interface. If you do not provide the explicit interface, the compiler would not pass the array correctly and the subroutine would fail to access its content.

Low Performance of Nested DO Loop using OpenMP for FORTRAN90

I am trying to parallel a portion of my code which is as follows
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(xDim, yDim, ex, f, fplus)
!$OMP DO
DO j = 1, 8
DO y=1, yDim
ynew = y+ey(j)
DO x=1, xDim
xnew = x+ex(j)
IF ((xnew >= 1 .AND. xnew <= xDim) .AND. (ynew >= 1 .AND. ynew <= yDim)) f(xnew,ynew,j)=fplus(x,y,j)
END DO
END DO
END DO
!$OMP END DO
!$OMP END PARALLEL
I am new to OpenMP and FORTRAN.. The single core gives better performance that the parallel code. Please suggest what mistake I am doing here..
The problem here is that you're just copying an array slice -- there's nothing really CPU limited here that splitting things up between cores will significantly help with. Ultimately this problem is memory bound, copying data from one piece of memory to another, and increasing the number of CPUs working at once likely only increases contention.
Having said that, I can get small (~10%) speedups if I rework the loop a bit to get that if statement out from inside the loop. This:
CALL tick(clock)
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(ex, ey, f, fplus) DEFAULT(none)
!$OMP DO
DO j = 1, 8
DO y=1+ey(j), yDim
DO x=1+ex(j), xDim
f(x,y,j)=fplus(x-ex(j),y-ey(j),j)
END DO
END DO
END DO
!$OMP END DO
!$OMP END PARALLEL
time2 = tock(clock)
or this:
CALL tick(clock)
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(ex, ey, f, fplus) DEFAULT(none)
!$OMP DO
DO j = 1, 8
f(1+ex(j):xDim, 1+ey(j):yDim, j) = fplus(1:xDim-ex(j),1:yDim-ey(j),j)
ENDDO
!$OMP END DO
!$OMP END PARALLEL
time3 = tock(clock)
make very modest improvements. If fplus was a function of the arguments x, y, and j and were compute intensive, things would be different; but a memory copy isn't likely to be sped up much.
Your performance will also depend on the sizes of the loops. You have the correct arrangement of loops, with you right-most index on the outer loop for more optimized memory access. If these loops are smalls and all the memory can fit in the cache of a single processor, there will likely be no performance improvement from using OpenMP. As you saw, you can actually see a degradation of the performance because of the OpenMP overhead such as thread creation/destruction. And in the future, try to avoid IF statements inside nested loops, they will really hurt your performance !

Resources