Low Performance of Nested DO Loop using OpenMP for FORTRAN90

Low Performance of Nested DO Loop using OpenMP for FORTRAN90 - openmp

I am trying to parallel a portion of my code which is as follows
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(xDim, yDim, ex, f, fplus)
!$OMP DO
DO j = 1, 8
DO y=1, yDim
ynew = y+ey(j)
DO x=1, xDim
xnew = x+ex(j)
IF ((xnew >= 1 .AND. xnew <= xDim) .AND. (ynew >= 1 .AND. ynew <= yDim)) f(xnew,ynew,j)=fplus(x,y,j)
END DO
END DO
END DO
!$OMP END DO
!$OMP END PARALLEL
I am new to OpenMP and FORTRAN.. The single core gives better performance that the parallel code. Please suggest what mistake I am doing here..

The problem here is that you're just copying an array slice -- there's nothing really CPU limited here that splitting things up between cores will significantly help with. Ultimately this problem is memory bound, copying data from one piece of memory to another, and increasing the number of CPUs working at once likely only increases contention.
Having said that, I can get small (~10%) speedups if I rework the loop a bit to get that if statement out from inside the loop. This:
CALL tick(clock)
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(ex, ey, f, fplus) DEFAULT(none)
!$OMP DO
DO j = 1, 8
DO y=1+ey(j), yDim
DO x=1+ex(j), xDim
f(x,y,j)=fplus(x-ex(j),y-ey(j),j)
END DO
END DO
END DO
!$OMP END DO
!$OMP END PARALLEL
time2 = tock(clock)
or this:
CALL tick(clock)
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(ex, ey, f, fplus) DEFAULT(none)
!$OMP DO
DO j = 1, 8
f(1+ex(j):xDim, 1+ey(j):yDim, j) = fplus(1:xDim-ex(j),1:yDim-ey(j),j)
ENDDO
!$OMP END DO
!$OMP END PARALLEL
time3 = tock(clock)
make very modest improvements. If fplus was a function of the arguments x, y, and j and were compute intensive, things would be different; but a memory copy isn't likely to be sped up much.

Your performance will also depend on the sizes of the loops. You have the correct arrangement of loops, with you right-most index on the outer loop for more optimized memory access. If these loops are smalls and all the memory can fit in the cache of a single processor, there will likely be no performance improvement from using OpenMP. As you saw, you can actually see a degradation of the performance because of the OpenMP overhead such as thread creation/destruction. And in the future, try to avoid IF statements inside nested loops, they will really hurt your performance !

Related

Fortran OMP : how to do a parallel and a single task?

I am a newbie in parallel programming. This is my serial code that I would like do parallelize
program main
implicit none
integer :: pr_number, i, pr_sum
real :: pr_av
pr_sum = 0
do i=1,1000
! The following instruction is an example to simplify the problem.
! In the real case, it takes a long time that is more or less the same for all threads
! and it returns a large array
pr_number = int(rand()*10)
pr_sum = pr_sum+pr_number
pr_av = (1.d0*pr_sum) / i
print *,i,pr_av ! In real case, writing a huge amount of data on one file
enddo
end program main
I woud like to parallelize pr_number = int(rand()*10) and to have only one print each num_threads.
I tried many things but it does not work. For example,
program main
implicit none
integer :: pr_number, i, pr_sum
real :: pr_av
pr_sum = 0
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(pr_number) SHARED(pr_sum,pr_av)
!$OMP DO REDUCTION(+:pr_sum)
do i=1,1000
pr_number = int(rand()*10)
pr_sum = pr_sum+pr_number
!$OMP SINGLE
pr_av = (1.d0*pr_sum) / i
print *,i,pr_av
!$OMP END SINGLE
enddo
!$OMP END DO
!$OMP END PARALLEL
end program main
I have an error message at compilation time : work-sharing region may not be closely nested inside of work-sharing, critical or explicit task region.
How can I have an output like that (if I have 4 threads for example) ?
4 3.00000000
8 3.12500000
12 4.00000000
16 3.81250000
20 3.50000000
...
I repeat, I am a beginner on parallel programming. I read many things on stackoverflow but, I think, I have not yet the skill to understand. I work on it, but ...
Edit 1
To explain as suggested in comments. A do loop performs N times a lengthy calculation (N markov chain montecarlo) and the average of all calculations is written to a file at each iteration. The previous average is deleted, only the last one is kept, so process can be followed. I would like to parallelise this calculation over 4 threads.
This is what I imagine to do but perhaps, it is not the best idea.
Thanks for help.

The value of the reduction variable inside the construct where the reduction happens is not really well defined. The reduction clause with a sum is typically implemented by each thread having a private copy of the reduction variable that they use for summing just the numbers for that very thread. At the and of the loop, the private copies are summed into the final sum. There is little point printing the intermediate value before the reduction is actually made.
You can do the reduction in a nested loop and print the intermediate result every n iterations
program main
implicit none
integer :: pr_number, i, j, pr_sum
real :: pr_av
pr_sum = 0
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(pr_number) SHARED(pr_sum,pr_av)
do j = 1, 10
!$OMP DO REDUCTION(+:pr_sum)
do i=1,100
pr_number = int(rand()*10)
pr_sum = pr_sum+pr_number
enddo
!$OMP END DO
!$omp single
pr_av = (1.d0*pr_sum) / 100
print *,j*100,pr_av
!$omp end single
end do
!$OMP END PARALLEL
end program main
I kept the same rand() that may or may not work correctly in parallel depending on the compiler. Even if it gives the right results, it may actually be executed sequentially using some locks or barriers. However, the main point carries over to other libraries.
Result
> gfortran -fopenmp reduction-intermediate.f90
> ./a.out
100 4.69000006
200 9.03999996
300 13.7600002
400 18.2299995
500 22.3199997
600 26.5900002
700 31.0599995
800 35.4300003
900 40.1599998

Improper use of OpenMP in fortran code increase wall time

I am trying to parallelize a FORTRAN code, but apparently I am not using the OpenMP statements correctly. I started to implement OpenMP paralelization only for a subroutine to observe expected decrease in the wall time, but it increases total wall time of program instead! Here is the subroutine that I am try to run in parallel. It has common variables and ncell,nnode,node(3,mxc),neigh(3,mxc),xy(2,mxn),area(mxc), mxc and mxn are assigned before gradient is called.
I made the variable ne,Tneigh,Tface,n1,n2 as PRIVATE because they are computed differently in each thread. Is this approach is wrong and is it reason for longer wall time?
Any help will be appreciated. Thanks in advance.
subroutine GRADIENT
parameter (mxc=5001,mxn=3001)
common /grid/ ncell,nnode,node(3,mxc),neigh(3,mxc),
> xy(2,mxn),area(mxc)
common /var/ time,dt,Tcell(mxc),Tbc(10),outflux(mxc)
common /grad/ dTdx(mxc),dTdy(mxc)
!$OMP PARALLEL
!$OMP DO PRIVATE(ne,Tneigh,Tface,n1,n2)
DO n = 1,ncell
dTdx(n) = 0.
dTdy(n) = 0.
do nf = 1,3
ne = neigh(nf,n)
if(ne .gt. 0) then !..real neighbor
Tneigh = Tcell(ne)
else !..other walls
Tneigh = Tbc(-ne)
endif
Tface = 0.5*(Tcell(n)+Tneigh)
n1 = node(nf,n)
if(nf .lt. 3) then
n2=node(nf+1,n)
else
n2=node(1,n)
endif
dTdx(n) = dTdx(n) + Tface*(xy(2,n2)-xy(2,n1))
dTdy(n) = dTdy(n) - Tface*(xy(1,n2)-xy(1,n1))
enddo
dTdx(n) = dTdx(n)/area(n)
dTdy(n) = dTdy(n)/area(n)
ENDDO
!$OMP END DO !NOWAIT
!$OMP END PARALLEL
return
end
Edit: The time is measured by omp_get_wtime function and followings are the measurements:
Sequential = 0.5 s
Parallel = 8.4 s (on average, sometimes it is much more higher I do not know why)
node, neigh and xy 2D arrays are filled with floating point numbers by reading from a dat file and area 1D function is filled with some operations using xy, node and neigh arrays.

Is it possible to remove the following !$OMP CRITICAL regions

I have a fortran code that shows some very unsatisfactory performance due to some $OMP CRITICAL regions. This question is actually more about how to the critical regions can be avoided and whether those regions can be removed? In those critical regions I am updating counters and reading/writing values to an array
i=0
j=MAX/2
total = 0
!$OMP PARALLEL PRIVATE(x,N)
MAIN_LOOP:do
$OMP CRITICAL
total = total + 1
x = array(i)
i = i + 1
if ( i > MAX) i=1 ! if the counter is past the end start form the beginning
$OMP END CRITICAL
if (total > MAX_TOTAL) exit
! do some calculations here and get the value of the integer (N)
! store (N) copies of x it back in the original array with some offset
!$OMP CRITICAL
do p=1,N
array(j)=x
j=j+1
if (j>MAX) j=1
end do
!$OMP END CRITICAL
end do MAIN_LOOP
$OMP END PARALLEL
One simple thing that came to my mind is to eliminate the counter on total by using explicit dynamic loop scheduling.
!$OMP PARALLEL DO SCHEDULE(DYNAMIC)
MAIN_LOOP:do total = 1,MAX_TOTAL
! do the calculation here
end do MAIN_LOOP
!$OMP END PARALLEL DO
I was also thinking to allocate different portion of the array to each thread and using the thread ID to do offsetting. This time each processor will have it's own counter which will be stored in an array count_i(ID) and something of the sort
!this time the size if array is NUM_OMP_THREADS*MAX
x=array(ID + sum(count_i)) ! get the offset by summing up all values
ID=omp_get_thread_num()
count_i(ID)=count_i(ID)+1
if (count_i(ID) > MAX) count_i(ID) = 1
This however will mess the order and will not do the same as the original method. Moreover some empty space will be present, since the different threads will not able to fit the entire range 1:MAX
I would appreciate your help and ideas.

Your use of critical sections is a bit strange here. The motivation for using critical sections must be to avoid having an entry in the array being clobbered before it can be read. Your code does accomplish this, but only accidentally, by acting as barriers. Try replacing the critical stuff with OMP barriers, and you should still get the right result and the same horrible speed.
Since you always write to the array half its length away from where you write to it, you can avoid critical sections by dividing the operation into one step which reads from the first half and writes to the second half, and vice versa. (Edit: After the question was edited, this is no longer true, so the approach below won't work).
nhalf = size(array)/2
!$omp parallel do
do i = 1, nhalf
array(i+nhalf) = f(array(i))
end do
!$omp parallel do
do i = 1, nhalf
array(i) = f(array(i+nhalf))
end do
Here f(x) represents whatever calculation you want to do to the array values. It doesn't have to be a function if you don't want it to. If it isn't clear, this code first loops through the entries in the first half of the array in parallel. The first task may go through i=1,1+nproc,1+2*nproc, etc. while the second task goes through i=2,2+nproc,2+2*nproc, and so on. This can be done in parallel without any locking because there is no overlap between the part of the array that is read from and written to in this loop. The second loop only starts once every task has finished the first loop, so there is no clobbering between the loops.
Unlike in your code, there is here one i per thread, so one doesn't need locking to update it (the loop variable is automatically private).
This assumes that you only want to make one pass through the array. Otherwise you can just loop over these two loops:
do iouter = 1, (max_total+size(array)-1)/size(array)
nleft = max_total-(iouter-1)*size(array)
nhalf = size(array)/2
!$omp parallel do
do i = 1, min(nhalf,nleft)
array(i+nhalf) = f(array(i))
end do
!$omp parallel do
do i = 1, min(nhalf,nleft-nhalf)
array(i) = f(array(i+nhalf))
end do
end do
Edit: Your new example is confusing. I'm not sure what it's supposed to do. Depending on the value of N, the array values may end being clobbered before they can be used. Is this intentional? It's hard to answer your question when it's not clear what you're trying to do. :/

I thought about this for a while and my feeling is that there is no good answer to this specific issue.
Indeed, your code seems, at first glance, like a good approach to the problem such as stated (although I personally find the problem itself a bit strange). However, there are problems in your implementation:
What happens if for some reason one of the threads gets delayed in processing its iteration? Just imagine that the thread owning very first index takes a while to process it (delayed y some third party process coming in the way and taking the CPU time on the core where the thread was pinned/scheduled for example) and is the last to finish... Then it will set back its values to array in a completely different order than what the sequential algorithm would have done. Is that something you can accept in your algorithm?
Even without this sort of "extreme" delay, can you accept that the order in which the i indexes were distributed among threads is different that the order in which the j indexes are subsequently updated? If the thread owning i+1 finishes right before the one owning i, it will use index j instead of index j+n as it should have had...
Again, I'm not sure I understand all the subtleties of your algorithm and how resilient it is to miss-ordering of iterations, but if ordering is something important, then the approach is wrong. In this case, I guess that a proper parallelisation could be something like this (put in a subroutine to make it compilable):
subroutine loop(array, maxi, max_iteration)
implicit none
integer, intent(in) :: maxi, max_iteration
real, intent(inout) :: array(maxi)
real :: x
integer :: iteration, i, j, n, p
i = 0
j = maxi/2
!$omp parallel do ordered private(x, n, p) schedule(static,1)
do iteration = 1,max_iteration
!$omp ordered
x = array(wrap_around(i, maxi))
!$omp end ordered
! do some calculations here and get the value of the integer (n)
!$omp ordered
do p = 1,n
array(wrap_around(j, maxi)) = x
end do
!$omp end ordered
end do
!$omp end parallel do
contains
integer function wrap_around(i, maxi)
implicit none
integer, intent(in) :: maxi
integer, intent(inout) :: i
i = i+1
if (i > maxi) i = 1
wrap_around = i
end function wrap_around
end subroutine loop
I hope this would work. However, unless the central part of the loop where n is retrieved does some serious computation, this won't be any faster than the sequential version.

Heterogeneous OpenMP parallel loop with Intel MIC offloading

I am working on a code which includes a loop with many iterations (~10^6-10^7) where an array (let's say, 'myresult') is being calculated via summation over lots of contributions. In Fortran 90 with OpenMP, this will look something like:
!$omp parallel do
!$omp& reduction(+:myresult)
do i=1,N
myresult[i] = myresult[i] + [contribution]
enddo
!$omp end parallel
The code will be run on a system with Intel Xeon coprocessors, and would of course like to benefit from their existence, if possible. I have tried using MIC offloading statements (!dir$ offload target ...) with OpenMP so that the loop runs on just the coprocessor, but then I am wasting host CPU time while it sits there waiting for the coprocessor to finish. Ideally, one could divide up the loop between the host and the device, so I would like to know if something like the following is feasible (or if there is a better approach); the loop will only run on one core on the host (though perhaps with OMP_NUM_THREADS=2?):
!$omp parallel sections
!$omp& reduction(+:myresult)
!$omp section ! parallel calculation on device
!dir$ offload target mic
!$omp parallel do
!$omp& reduction(+:myresult)
(do i=N/2+1,N)
!$omp end parallel do
!$omp section ! serial calculation on host
(do i=1,N/2)
!$omp end parallel sections

The general idea would be to use an asynchronous offload to MIC so that the CPU could continue. Setting aside the details of how to divide the work, this is how it is expressed:
module m
!dir$ attributes offload:mic :: myresult, micresult
integer :: myresult(10000)
integer :: result
integer :: micresult
end module
use m
N = 10000
result = 0
micresult = 0
myresult = 0
!dir$ omp offload target(mic:0) signal(micresult)
!$omp parallel do reduction(+:micresult)
do i=N,N/2
micresult = myresult(i) + 55
enddo
!$omp end parallel do
!$omp parallel do reduction(+:result)
do i=1,N/2
result = myresult(i) + 55
enddo
!$omp end parallel do
!dir$ offload_wait target(mic:0) wait(micresult)
result = result + micresult
end

Have you considered to use MPI symmetric mode instead of offload? In case you haven't, MPI can do what you just describe: you start two MPI ranks, one on host and one on co-processor. Each rank performs a parallel loop using OpenMP.

Why bother to initialize a reduction variable outside the parallel construct?

I am learning how to port my Fortran code to OpenMP. When I read an online tutorial (see here) I came across one question.
At first, I knew from page 28 that the value of a reduction variable is undefined from the moment the first thread reaches the clause till the operation has completed.
To my understanding, the statement implies that it doesn't matter whether I initialize the reduction variable before the program hits the parallel construct, because it is not defined until the complete of the operation. However, the sample code on page 27 of the same tutorial initializes the reduction variable before the parallel construct.
Could anyone please let me know which treatment is correct? Thanks.
Lee
sum = 0.0
!$omp parallel default(none) shared(n,x) private(i)
do i = 1, n
sum = sum + x(i)
end do
!$omp end do
!$omp end parallel
print*, sum

After fixing your code:
integer,parameter :: n = 10000
real :: x(n)
x = 1
sum = 0
!$omp parallel do default(none) shared(x) private(i) reduction(+:sum)
do i = 1, n
sum = sum + x(i)
end do
!$omp end parallel do
print*, sum
end
Notice, that the value to which you initialize sum matters! If you change it you get a different result. It is quite obvious you have to initialize it properly and even the OpenMP version is ill-defined without proper initialization.
Yes, the value of sum is not defined until completing the loop, but that doesn't mean it can be undefined before the loop.

For one thing, one of the nice features of OpenMP is that if you compile the program without enabling OpenMP, the program can/(should!) be a valid serial program as well. The serial version of your example would be ill-defined without initializing "sum" before the loop.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Low Performance of Nested DO Loop using OpenMP for FORTRAN90 - openmp

Related

Fortran OMP : how to do a parallel and a single task?

Improper use of OpenMP in fortran code increase wall time

Is it possible to remove the following !$OMP CRITICAL regions

Heterogeneous OpenMP parallel loop with Intel MIC offloading

Why bother to initialize a reduction variable outside the parallel construct?

Categories

Resources