omp sections using private (num_threads) clause vs default (without clauses) - parallel-processing

I execute the following code using two cases:
"$omp sections" and "!$omp sections private(thread_num)"
Is it done each section by a different thread in both cases?
program main
use omp_lib
implicit none
integer, parameter:: ma=100, n=10000, mb= 100
real, dimension (ma,n) :: a
real, dimension (n,mb) :: b
real, dimension (ma,mb) :: c = 0.
integer:: i,j,k, threads=2, ppt, thread_num
integer:: toc, tic, rate
real:: time_parallel, time
call random_number (a)
call random_number (b)
!/////////////////////// PARALLEL PRIVATE ///////////////////////
c=0
CALL system_clock(count_rate=rate)
call system_clock(tic)
ppt = ma/threads
!$ call omp_set_num_threads(threads)
!$omp parallel
!$omp sections private(thread_num) !(HERE IS THE QUESTION TOPIC)
! EXAMPLE PROCESS 1 (it is only an example to test 'omp sections')
!$omp section
!$ thread_num = omp_get_thread_num()
!$ print*, "Section 1 started by thread number:", thread_num
do i= 1,50
do j= 1,mb
do k= 1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
!$ print*, "Section 1 finished by thread number:", thread_num
! EXAMPLE PROCESS 2
!$omp section
!$ thread_num = omp_get_thread_num()
!$ print*, "Section 2 started by thread number:", thread_num
do i= 51,100
do j= 1,mb
do k= 1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
!$ print*, "Section 2 finished by thread number:", thread_num
!$omp end sections
!$omp end parallel
print*, '//////////////////////////////////////////////////////////////'
print*, 'Result in Parallel'
!$ print*, c(85:90,40)
call system_clock(toc)
time_parallel = real(toc-tic)/real(rate)
!/////////////////////// normal execution ///////////////////////
c = 0
CALL system_clock(count_rate=rate)
call system_clock(tic)
call system_clock(tic)
do i= 1,ma
do j= 1,mb
do k= 1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
call system_clock(toc)
time = real(toc-tic)/real(rate)
print*, 'Result in serial mode'
print*, c(85:90,40)
print*, '------------------------------------------------'
print*, 'Threads: ', threads, '| Time Parallel ', time_parallel, 's '
print*, ' Time Normal ', time, 's'
!----------------------------------------------------------------
end program main
That is the result for "!$omp sections" and "!$omp sections private(thread_num)", respectively :
Section 1 started by thread number: 1
Section 2 started by thread number: 1
Section 1 finished by thread number: 1
Section 2 finished by thread number: 1
//////////////////////////////////////////////////////////////
Result in Parallel
2507.23853 2494.16162 2496.83960 2503.58960 2509.34448
2518.64160
Result in serial mode
2507.23853 2494.16162 2496.83960 2503.58960 2509.34448
2518.64160
Threads: 2 | Time Parallel 0.428116574
Time Normal 0.605000019 s
Section 1 started by thread number: 0
Section 2 started by thread number: 1
Section 1 finished by thread number: 0
Section 2 finished by thread number: 1
//////////////////////////////////////////////////////////////
Result in Parallel
2523.38281 2501.28369 2517.81860 2502.66235 2503.13940
2532.35791
Result in serial mode
2523.38281 2501.28369 2517.81860 2502.66235 2503.13940
2532.35791
Threads: 2 | Time Parallel 0.432999998
Time Normal 0.610204018 s
It was compiled using:
gfortran -Wall -fopenmp -O2 -Wall -o prog.exe prueba.f90
./prog.exe
CPU model of my laptop:
AMD A6-6310 (4 cores and a thread per core)
P.S: the main goal is to test parallel clauses not to speedup matrix calculations

thread_num should definitely be a private variable. Otherwise both threads use the same variable and hence you got the value 1 from both threads. It is a race condition to write from two threads to the same variable.
You can make it private for the whole parallel region and only call omp_get_thread_num() once at the start of the region.
!$omp parallel private(thread_num)
!$ thread_num = omp_get_thread_num()
!$omp sections
!$omp section
!$ print*, "Section 1 started by thread number:", thread_num
...

Related

In parallel computing, why using all threads (4) execution time is longer than using only a half (2)?

E.g, I'm using this code (CPU: 4 cores (thread per core)):
program main
use omp_lib
implicit none
integer, parameter:: ma=100, n=10000, mb= 100
integer:: istart, iend
real, dimension (ma,n) :: a
real, dimension (n,mb) :: b
real, dimension (ma,mb) :: c = 0.
integer:: i,j,k, threads=2, ppt, thread_num
integer:: toc, tic, rate
real:: time_parallel, time
call random_number (a)
call random_number (b)
!/////////////////////// 1- PARALLEL PRIVATE ///////////////////////
CALL system_clock(count_rate=rate)
call system_clock(tic)
ppt = ma/threads
!$ call omp_set_num_threads(threads)
!$omp parallel default(shared) private(istart, iend, &
!$omp thread_num, i)
!$ thread_num = omp_get_thread_num()
!$ istart = thread_num*ppt +1
!$ iend = min(ma, thread_num*ppt + ppt)
do i= istart,iend
do j= 1,mb
do k= 1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
!$omp end parallel
print*, 'Result in parallel mode'
!$ print*, c(85:90,40)
call system_clock(toc)
time_parallel = real(toc-tic)/real(rate)
!/////////////////////// 2-normal execution ///////////////////////
c = 0
CALL system_clock(count_rate=rate)
call system_clock(tic)
call system_clock(tic)
do i= 1,ma
do j= 1,mb
do k= 1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
call system_clock(toc)
time = real(toc-tic)/real(rate)
print*, 'Result in serial mode'
print*, c(85:90,40)
print*, '------------------------------------------------'
print*, 'Threads: ', threads, '| Time Parallel Private', time_parallel, 's '
print*, ' Time Normal ', time, 's'
!----------------------------------------------------------------
end program main
I get the following results:
First execution:
Result in parallel mode
2477.89478 2528.50391 2511.84204 2528.12061 2500.79517
2510.69971
Result in serial mode
2477.89478 2528.50391 2511.84204 2528.12061 2500.79517
2510.69971
Threads: 2 | Time Parallel Private 0.379999995 s
Time Normal 0.603999972 s
Second execution:
Result in parallel mode
2492.20679 2496.56152 2500.58203 2516.51685 2516.43604
2530.71313
Result in serial mode
2492.20679 2496.56152 2500.58203 2516.51685 2516.43604
2530.71313
------------------------------------------------
Threads: 4 | Time Parallel Private 1.11500001 s
Time Normal 0.486000001 s
It was compiled using:
gfortran -Wall -fopenmp -g -O2 -o prog.exe prueba.f90
./prog.exe
If you have N cores and using N threads than some of your threads get switched out for some other process and threads. So it's preferable to use less number threads than the available cores.

Disrepancy in results between OpenMP/OpenACC implementation and gcc/PGI compilers

I have a larger Fortran program that I am trying to convert so that the computationally intensive part will run on an NVidia GPU using OpenMP and/or OpenACC. During development I had some issues to understand how variables declared in a module can be used within subroutines that are executed on the GPU (and some of them also on the CPU). Therefore, I created a small example and worked on that, by experimenting and adding the corresponding OpenMP and OpenACC directives. I have included the three files that comprise my example at the end of this message.
Just as I thought that I had understood things and that my example program works, I noticed the following:
I compile the program with gcc 10.2 using the OpenMP directives:
gfortran -O3 -fopenmp -Wall -Wextra test_link.f90 parameters.f90 common_vars.f90 -o test_link
The results are as expected, i.e. all elements of array XMO are 1, of DCP are 2, of IS1 are 3 and of IS2 are 24.
I compile the program with PGI compiler 19.10 community edition using the OpenACC directives:
pgfortran -O4 -acc -ta=tesla,cc35 -Minfo=all,mp,accel -Mcuda=cuda10.0 test_link.f90 common_vars.f90 parameters.f90 -o test_link
The results are the same as above.
I compile the program with gcc 10.2 using the OpenACC directives:
gfortran -O3 -fopenacc -Wall -Wextra test_link.f90 parameters.f90 common_vars.f90 -o test_link
The results for arrays XMO, DCP and IS1 are correct, but all elements of IS2 are 0. It is easy to verify that variable NR has a value of 0 to get this result.
My understanding is that the OpenMP and OpenACC version of my example are equivalent, but I cannot figure out why the OpenACC version works only for the PGI compiler and not for gcc.
If possible, please provide solutions that do not require changes in the code but only in the directives. As I mentioned, my original code is much larger, contains many more module variables and calls many more subroutines in the code to be executed on the GPU. Changes in that code will be much more difficult to do and obviously I would prefer to do that only if really necessary.
Thank you in advance!
The files of my example follow.
File parameters.f90
MODULE PARAMETERS
IMPLICIT NONE
INTEGER, PARAMETER :: MAX_SOURCE_POSITIONS = 100
END MODULE PARAMETERS
File common_vars.f90
MODULE COMMON_VARS
USE PARAMETERS
IMPLICIT NONE
!$OMP DECLARE TARGET TO(NR)
INTEGER :: NR
!$ACC DECLARE COPYIN(NR)
END MODULE COMMON_VARS
File test_link.f90
SUBROUTINE TEST()
USE COMMON_VARS
IMPLICIT NONE
!$OMP DECLARE TARGET
!$ACC ROUTINE SEQ
INTEGER I
I = NR
END SUBROUTINE TEST
PROGRAM TEST_LINK
USE COMMON_VARS
USE PARAMETERS
IMPLICIT NONE
INTERFACE
SUBROUTINE TEST()
!$OMP DECLARE TARGET
!$ACC ROUTINE SEQ
END SUBROUTINE TEST
END INTERFACE
REAL :: XMO(MAX_SOURCE_POSITIONS), DCP(MAX_SOURCE_POSITIONS)
INTEGER :: IS1(MAX_SOURCE_POSITIONS), IS2(MAX_SOURCE_POSITIONS)
INTEGER :: X, Y, Z, MAX_X, MAX_Y, MAX_Z, ISOUR
MAX_X = 3
MAX_Y = 4
MAX_Z = 5
NR = 6
!$OMP TARGET UPDATE TO(NR)
!$OMP TARGET MAP(TOFROM:IS1,IS2,DCP,XMO)
!$OMP TEAMS DISTRIBUTE PARALLEL DO COLLAPSE(3)
!$ACC UPDATE DEVICE(NR)
!$ACC PARALLEL LOOP GANG WORKER COLLAPSE(3) INDEPENDENT &
!$ACC COPY(IS1,IS2,DCP,XMO)
DO X = 1, MAX_X
DO Y = 1, MAX_Y
DO Z = 1, MAX_Z
ISOUR = (X - 1)*MAX_Y*MAX_Z + (Y - 1)*MAX_Z + Z
XMO(ISOUR) = 1.0
DCP(ISOUR) = 2.0
IS1(ISOUR) = 3
IS2(ISOUR) = 4 * NR
CALL TEST()
ENDDO ! End of z loop
ENDDO ! End of y loop
ENDDO ! End of x loop
!$ACC END PARALLEL LOOP
!$OMP END TEAMS DISTRIBUTE PARALLEL DO
!$OMP END TARGET
DO X = 1, MAX_X
DO Y = 1, MAX_Y
DO Z = 1, MAX_Z
ISOUR = (X - 1)*MAX_Y*MAX_Z + (Y - 1)*MAX_Z + Z
WRITE(*, *) 'ISOUR = ', ISOUR, 'XMO = ', XMO(ISOUR), 'DCP = ', DCP(ISOUR), 'IS1 = ', IS1(ISOUR), 'IS2 = ', IS2(ISOUR)
ENDDO ! End of z loop
ENDDO ! End of y loop
ENDDO ! End of x loop
END PROGRAM TEST_LINK

How to use task (or probably section) to read big-data while data-processing

As can be seen from the code below, the time for reading the big data will eat a large portion of the total CPU time.
In my opinion, there should be some way to efficiently enhance the efficiency of data-reading. For instance, when one thread is reading the data the other threads could at the same time do some data processing.
I have tried to use the OpenMP to increase the efficiency of data-processing (part two) but need further help to figure out a way to further optimize the part one (TASK OR SECTION).
---------------updated-----------------------
At the current stage, I did not want to do multiple read/write, which could be possibly realized by MPI (MPI_FILE_WRITE_ALL). All I expected is as follows: one thread read the data of the next time step and the other threads could do the rest of the work of the current time step by using task or section constructs. Any suggestion in this direction.
Program main
Implicit none
Integer i,j,k, Count, rl
Integer, Parameter :: Nxt=961, Nyt=526, Nzt=100
Integer OMP_GET_THREAD_NUM, TID, OMP_GET_NUM_THREADS, NTHREADS
Real(4), Dimension(Nxt,Nyt,Nzt) :: Ui, Vi, Wi, Pi
Real(4), Dimension(Nxt*4,Nyt,Nzt) :: Utotal
real*8:: start, finish, OMP_GET_WTIME
Character(len=50) :: filename
call OMP_SET_NUM_THREADS(6)
!---------=====OpenMP Number Threads=======------------
!$OMP PARALLEL PRIVATE(NTHREADS, TID)
!$ TID = OMP_GET_THREAD_NUM()
! Only master thread does this
!$ IF (TID .EQ. 0) THEN
!$ NTHREADS = OMP_GET_NUM_THREADS()
!$ PRINT *, 'Number of threads = ', NTHREADS
!$ END IF
!$OMP END PARALLEL
Do ii = 200000, 700000, 20
1912 format('../../../volume7/20_40/WI_Inst3Dsub_UVWP',I7.7)
1913 format('../../../volume8/40_60/WI_Inst3Dsub_UVWP',I7.7)
1914 format('../../../volume5/60_70/WI_Inst3Dsub_UVWP',I7.7)
if(ii .le. 400000) Write(filename,1912) ii
if(ii .gt. 400000) Write(filename,1913) ii
if(ii .ge. 600000) Write(filename,1914) ii
!$ start=OMP_GET_WTIME()
!---------Part 1---------------
inquire(iolength=rl) Utotal(:,:,:)
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED',&
ACCESS='DIRECT', RECL=rl, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1; READ(10,REC=COUNT) Utotal(:,:,:)
CLOSE(10)
!---------Part 2 ---------------
!$OMP PARALLEL DO PRIVATE(i,j,k) SHARED(Ui,Vi,Wi,Pi)
DO k = 1, Nzt
DO j = 1, Nyt
DO i = 1, Nxt
Ui(i,j,k) = Utotal(i+Nxt*0,j,k)
Vi(i,j,k) = Utotal(i+Nxt*1,j,k)
Wi(i,j,k) = Utotal(i+Nxt*2,j,k)
Pi(i,j,k) = Utotal(i+Nxt*3,j,k)
END DO; End Do; End Do
!$OMP END PARALLEL DO
!$ finish=OMP_GET_WTIME()
!$ Write(*,*) ii,'Time cost per step', finish-start
! THERE ARE ALSO OTHER WOKRS
End DO
End program

Restart a loop in Fortran

I have an algorithm that looks like this:
10 WRITE (*,*) "Start"
DO I = 1, 10
WRITE (*,*) "Step"
IF(I .EQ. 5) then
go to 10
END IF
END DO
I want to restart the loop, when that if statement executes. However, I don't want to have to use a go to, I tried this:
10 WRITE (*,*) "Start"
DO I = 1, 10
WRITE (*,*) "Step"
IF(I .EQ. 5) then
I = 0; CYCLE
END IF
END DO
But then I get the error that I can't redefine the I variable, inside a loop. So I'm not sure how to approach this. Any help would be much appreciated
A conceptually simple way to phrase this problem is: "I want to repeat a loop until it completes, where there is some abort condition".
This "repeat until it completes" is idiomatically a do construct with indeterminate iterations:
do
... ! Our actions; we exit this outer loop when we are satisfied
end do
[This could also be phrased as a do-while loop.]
With the inner loop:
do
do i=1,10
... ! A conditional statement which aborts the inner loop
... ! And some actions
end do
! And once we're complete we exit the outer loop
end do
Now it's just a matter of handling "abort inner" and "exit outer". Here cycle and exit:
outer: do
print*, 'Start'
do i=1,10
print*, 'Step'
if (...) cycle outer ! Abort the inner loop
end do
exit outer ! The inner loop completed, so we're done
end do outer
The outer loop is labelled so that the cycle statement in the inner loop can reference it. Without that label the cycle will cycle the innermost loop containing it.
Likely a typo here somewhere, but this has some other ideas in a "do while" sense.
i.e. One could poll something coming in.
... USE ISO... etc stuff.
REAL(KIND=C_FLOAT), DIMENSION(10) :: A
LOGICAL(KIND=C_BOOL) :: Question1 = .TRUE.
LOGICAL(KIND=C_BOOL) :: Question2 = .TRUE.
INTEGER(KIND=C_INT) :: Index = 0
INTEGER(KIND=C_INT) :: I = 5
WRITE(*,*)'Enter first index to skip:"
READ(*,*) I
Outer_Loop: DO WHILE (Question1) ! .eq. .TRUE.
Inner_Loop: DO WHILE (Question2) ! .EQV. .TRUE.
Index = Index + 1
IF(Index > UBOUND(A,1)) THEN
Question1 = .FALSE.
Question2 = .FALSE.
Exit
ENDIF
IF(Index == I) EXIT
!Whatever like A(INdex) = something....
ENDDO Inner_Loop
IF(Question1) THEN !We must have more to possibly read or do...
WRITE(*,*)'Do more? 1=yes, 0=No"
READ(*,*) I
IF(I == 1) THEN
WRITE(*,*)'Enter next index to skip:"
READ(*,*) I
Question2 = .TRUE.
!and one can do a reset of I=0 here, which then becomes I=1 the next time through the inner loop...
ELSE
Question1 = .FALSE.
Question2 = .FALSE.
EXIT
ENDIF
ELSE !Must have gotten here to exit at the end of array?
EXIT
ENDIF
ENDDO Outer_Loop
I can suggest to ways of solving this issue: either use a while loop or a recursive function (highly depends on your actual algorithm). I am attaching a sample code for both cases (keep in mind that having variables with save qualifier is not the best idea, it's much better to do that with by passing variables as arguments to function - here it is used just for simplicity)
module functions
implicit none
integer :: restarted, save = 0
contains
recursive subroutine restart_loop
integer :: i
do i = 1, 10
print*, i
if ( i == 5 .and. restarted < 5 ) then
restarted = restarted + 1
call restart_loop
exit
end if
end do
end subroutine restart_loop
subroutine while_loop
integer :: i = 1
do while (i <= 10)
print*, i
if ( i == 5 .and. restarted < 5 ) then
i = 1
restarted = restarted + 1
end if
i = i + 1
end do
end subroutine while_loop
end module functions
program test_prog
use functions
implicit none
call while_loop
end program test_prog

when is !$OMP PARALLEL END DO necessary?

The OpenMP specification says that the [!$OMP PARALLEL END DO] is optional and if not present the compiler assumes it by default. Can I implicitly assume its correctness ? For example
Version 1: (no end parallel do)
do i = 1, N
!$OMP PARALLEL DO DEFAULT(SHARED)
do j = xadj(i), xadj(i+1) - 1
call residual(j, ...)
end do
end do
Version 2: (with end parallel do)
do i = 1, N
!$OMP PARALLEL DO DEFAULT(SHARED)
do j = xadj(i), xadj(i+1) - 1
call residual(j, ...)
end do
!$OMP END PARALLEL DO
end do
Although the specification says it is optional (safe) how safe is it in practice (for the above example)?

Resources