OpenMP calling subroutines in threads - parallel-processing

The problem I'm facing is as outlined below:
module k
integer :: l,m
end module k
program p4
use k
integer :: i,j,omp_get_thread_num,cr
i = 2
j = 3
!$omp parallel num_threads(2) shared(l,m) private(i,j,cr)
cr = omp_get_thread_num()
if (cr == 0) goto 1111
call sub1(i)
write(*,*) l
goto 2222
1111 call sub2(j)
write(*,*) m
2222 continue
!$omp end parallel
end program p4
subroutine sub1(a)
use k
integer :: a
l = a**2
write(*,*) 'entered sub1'
end subroutine sub1
subroutine sub2(b)
use k
integer :: b
m = b**2
write(*,*) 'entered sub2'
end subroutine sub2
I've tried to parallelize a serial, (which after parallelization looks as written above). I want essentially the same operation performed twice. So Ideally, I want the output to be
entered sub1
4
enterer sub2
9
but the output is
entered sub2
0
entered sub1
923239424
I'm new to parallel programming, (my actual problem is a more complicated version of the one I've outlined). Can anyone point out the mistakes and suggest improvements. Thanks

OpenMP private variables are NOT given initial values, therefore both the calls to sub1 and sub2 are made with random values of i and j. What you are (probably) looking for is firstprivate instead:
!$omp parallel num_threads(2) shared(l,m) private(cr) firstprivate(i,j)
...
!$omp end parallel
firstprivate initialises each private copy with the value that the corresponding variable in the main thread had on entry into the parallel region.
Btw, implementing IF/THEN/ELSE/ENDIF with IF/GOTO/CONTINUE in Fortran 90 and later is considered by many a bad programming style. You should use OpenMP sections instead:
!$omp parallel sections num_threads(2) shared(l,m) private(cr) firstprivate(i,j)
!$omp section
call sub1(i)
write(*,*) l
!$omp section
call sub2(j)
write(*,*) m
!$omp end parallel sections

Related

FFTW3 gfortran code is correct only when arraysize is LE 16

I have fftw3 code and it works ONLY when array size is less than or equal to 16. Could some expert please tell me what I am doing wrong
program fftcalc
implicit none
integer n
parameter (n=128)
double precision in(n)
double complex out(1+(n/2))
integer*8 p, p2
integer i,j
real fact
real f,dt,t,pi
double precision re
integer FFTW_FORWARD,FFTW_BACKWARD
parameter (FFTW_FORWARD=-1,FFTW_BACKWARD=1)
integer FFTW_REAL_TO_COMPLEX,FFTW_COMPLEX_TO_REAL
parameter (FFTW_REAL_TO_COMPLEX=-1,FFTW_COMPLEX_TO_REAL=1)
integer FFTW_ESTIMATE,FFTW_MEASURE
parameter (FFTW_ESTIMATE=0,FFTW_MEASURE=1)
integer FFTW_OUT_OF_PLACE,FFTW_IN_PLACE,FFTW_USE_WISDOM
parameter (FFTW_OUT_OF_PLACE=0)
parameter (FFTW_IN_PLACE=8,FFTW_USE_WISDOM=16)
integer FFTW_THREADSAFE
parameter (FFTW_THREADSAFE=128)
f=25.
pi=3.14159
dt=0.004
do i=1,n
j=i-1
t=j*dt
re=cos(2.*pi*f*t)
in(i)=re
write(*,*) 'i,in',i,in(i)
enddo
write(*,*) "creating plans"
call dfftw_plan_dft_r2c_1d(p,n,in,out,fftw_measure)
call dfftw_plan_dft_c2r_1d(p2,n,in,out,fftw_measure)
write(*,*) "execute"
call dfftw_execute_dft_r2c(p,in,out)
write(*,*) "forward is executed"
do i=1,n
write(*,*) out(i)
enddo
write(*,*)
write(*,*)"do inverse fft"
call dfftw_execute_dft_c2r(p2, out,in)
write(*,*)"inverse fft completed"
fact=1.0/n
do i=1,n
write(*,*) in(i)*fact
enddo
write(*,*)"clean up"
call dfftw_destroy_plan(p)
call dfftw_destroy_plan(p2)
end program fftcalc
I tried with n=16 and n=10 it works, but when n=64 or so it give mostly zero forward transform.
How do I solve the problem?

OpenMP Sparse Jacobi

I'm trying to determine if there is a way to parallelize the Jacobi method using sparse matrix formats (specifically Compressed Row Format)
I have a working sparse matrix Jacobi. I don't know if I can place
!$OMP PARALLEL DO
Directives on the middle do loop because x is being both written to and read from. I guess the inner do loop can have it, but the same t is being overwritten so I don't know if it is possible there either. Am I overlooking something here? Thanks.
x(:) = 0
do p = 1, numIterations
do i=1, n
t=b(i)
do j = IA(i), IA(i+1) - 1
if j=i
d=A(j)
else
t = t - A(j) * x(jA(j))
end if
end do
x(i) = t/d
end do
end do
It is true you have a dependency on t in the inner loop since it used as an accumulator. However, that also means you can have a private copy of t in each of the threads (since the arrays A and x are not being written in the loop, the value of t only depends on the value of j, which is also thread private).
The following should work:
x(:) = 0
do p = 1, numIterations
do i=1, n
t=0
!$OMP PARALLEL DO
!$OMP REDUCTION(+:t)
do j = IA(i), IA(i+1) - 1
if j=i
d=A(j)
else
t = A(j) * x(jA(j))
end if
end do
x(i) = (b(i)-t)/d
end do
end do
Note that d can only be be written by one of the threads, so the variable can be shared betewen the threads, no loop-carried dependencies on d.

How to Avoid Conditionals in Loops

In this link, the author gives an example as
subroutine threshold(a, thresh, ic)
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt
integer :: n
ic = 0
tt = 0.d0
n = size(a)
do j = 1, n
tt = tt + a(j) * a(j)
if (sqrt(tt) >= thresh) then
ic = j
return
end if
end do
end subroutine threshold
and the author commented this code as
An alternative approach, which would allow for many optimizations
(loop unrolling, CPU pipelining, less time spent evaluating the
conditional) would involve adding tt in blocks (e.g., blocks of size
128) and checking the conditional after each block. When it the
condition is met, the last block can be repeated to determine the
value of ic.
What does it mean? loop unrolling? CPU pipelining? adding tt in blocks?
How to optimize the code as the author say?
If the loop is performed in chunks/blocks that fit into the CPU cache you will reduce the number of cache misses, and consequently the number of cache lines retrieved from memory. This increases the performance on all loops that are limited by memory operations.
If the corresponding block size is BLOCKSIZE, this is achieved by
do j = 1, n, BLOCKSIZE
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
end do
This, however, will leave a remainder that is not treated in the main loop. To illustrate this, consider an array of length 1000. The first seven chunks (1--896) are covered in the loop, but the eighth one (897--1024) is not. Therefore, another loop for the remainder is required:
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
! ...
enddo
While it makes little sense to remove the conditional from the remainder loop, it can be performed in the outer loop of the blocked main loop.
As now no branches occur in the inner loop, aggressive optimizations might be applicable then.
However, this limits the "accuracy" of the determined position to the blocks. To get to an element-wise accuracy, you have to repeat the calculation.
Here is the complete code:
subroutine threshold_block(a, thresh, ic)
implicit none
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt, tt_bak, thresh_sqr
integer :: n, j, jj
integer,parameter :: BLOCKSIZE = 128
ic = 0
tt = 0.d0
thresh_sqr = thresh**2
n = size(a)
! Perform the loop in chunks of BLOCKSIZE
do j = 1, n, BLOCKSIZE
tt_bak = tt
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
! Perform the check on the block level
if (tt >= thresh_sqr) then
! If the threshold is reached, repeat the last block
! to determine the last position
tt = tt_bak
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
if (tt >= thresh_sqr) then
ic = jj
return
end if
end do
end if
end do
! Remainder is treated element-wise
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
tt = tt + a(j) * a(j)
if (tt >= thresh_sqr) then
ic = j
return
end if
end do
end subroutine threshold_block
Please note that the compilers are nowadays very good in creating blocked loops in combination with other optimizations. In my experience it is quite difficult to get a better performance out of such simple loops by manually tweaking it.
Loop blocking is enabled in gfortran with the compiler option -floop-block.
Loop unrolling can be done manually, but should be left to the compiler. The idea is to manually perform a loop in blocks and instead of a second loop as shown above, perform the operations by duplicating the code. Here is an example for the inner loop as given above, for a loop unrolling of factor four:
do jj = j, j+BLOCKSIZE-1,4
tt = tt + a(jj) * a(jj)
tt = tt + a(jj+1) * a(jj+1)
tt = tt + a(jj+2) * a(jj+2)
tt = tt + a(jj+3) * a(jj+3)
end do
Here, no remainder can occur if BLOCKSIZE is a multiple of 4. You can probably shave off a few operations in here ;-)
The gfortran compiler option to enable this is -funroll-loops
As far as I know, CPU Pipelining (Instruction Pipelining) cannot be enforced manually in Fortran. This task is up to the compiler.
Pipelining sets up a pipe of instructions. You feed the complete array into that pipe and, after the wind-up phase, you will get a result with each clock cycle. This drastically increases the throughput.
However, branches are difficult (impossible?) to treat in pipes, and the array should be long enough that the time required for setting up the pipe, wind-up, and wind-down phase are compensated.

OpenMP over Summation

I have been trying to apply OpenMP on a simple summation operation inside two nested loop, but it produced incorrect result so far. I have been looking around in here and here, also in here. All suggest to use reduction clause, but it does not work for my case by producing very large number which leads to segmentation fault.
I also tried this way posted in here and my own question here which has been solved. Both do not use reduction and simply just set summation variable as shared, but it also produces incorrect result. Is there something that I am missing? When to use reduction and not using that while facing summation operation?
Codes using reduction clause
index = 0
!$OMP PARALLEL DO PRIVATE(iy,ix) REDUCTION(:+index)
do iy = 1, number(2)
do ix = 1, number(1)
index = index + 1
xoutput(index)=xinput(ix)
youtput(index)=yinput(iy)
end do
end do
!$OMP END PARALLEL DO
Code without using reduction clause
index = 0
!$OMP PARALLEL DO PRIVATE(iy,ix) SHARED(index)
do iy = 1, number(2)
do ix = 1, number(1)
index = index + 1
xoutput(index)=xinput(ix)
youtput(index)=yinput(iy)
end do
end do
!$OMP END PARALLEL DO
I think you have a mis-conception of what the reduction clause does...
REDUCTION(+:index)
means, that you will have the correct sum index in the end. In each step of the iteration, each tread will have a different version with different values! So the reduction is not suitable to manage array indices during the parallel section.
Let me try to illustrate this...
The following loop
!$OMP PARALLEL DO PRIVATE(iy) REDUCTION(+:index)
do iy = 1, number(2)
index = index + 1
end do
!$OMP END PARALLEL DO
is (more or less) equivalent to
!$OMP PARALLEL PRIVATE(iy, privIndex) SHARED(index)
!$OMP DO
do iy = 1, number(2)
privIndex = privIndex + 1
end do
!$OMP END DO
!$OMP CRITICAL
index = index + privIndex
!$OMP END CRITICAL
!$OMP END PARALLEL
You can see that during the loop all threads work on different variables privIndex which are private to that thread, and calculate local (partial) sums. In the end, the total sum is taken, using a critical section to avoid race conditions.
This might not be what the compiler does, but it gives you an idea how a reduction works: at no point within the first loop does privIndex correspond to the correct index you would expect in the serial version.
As Vladimir suggests in his comment, you can calculate the index directly as you are only incrementing it in the inner loop:
!$OMP PARALLEL DO PRIVATE(iy,ix, index)
do iy = 1, number(2)
do ix = 1, number(1)
index = (iy-1)*number(1) + ix
xoutput(index)=xinput(ix)
youtput(index)=yinput(iy)
end do
end do
!$OMP END PARALLEL DO

Implied do. vs explicit loop with IO

I realize this question has been asked before, but not in the context of IO. Is there any reason to believe that:
!compiler can tell that it should write the whole array at once?
!but perhaps compiler allocates/frees temporary array?
write(UNIT) (/( arr(i), i=1,N )/)
would be any more efficient than:
!compiler does lots of IO here?
do i=1,N
write(UNIT) arr(i)
enddo
for a file which is opened as:
open(unit=UNIT,access='STREAM',file=fname,status='UNKNOWN')
There is a possibly that this will be used with compiler options to turn off buffered writing as well ...
As suggested by #HighPerformanceMark, here's a simple benchmark I set up:
Using gfortran:
program main
implicit none
include 'mpif.h'
integer, parameter :: N = 1000000
integer :: unit = 22
integer i
real*8 arr(N)
real*8 t1
integer repeat
external test1
external test2
external test3
repeat=15
call MPI_INIT(i)
arr = 0
call timeit(test1,repeat,arr,N,t1)
print*,t1/repeat
call timeit(test2,repeat,arr,N,t1)
print*,t1/repeat
call timeit(test3,repeat,arr,N,t1)
print*,t1/repeat
call MPI_Finalize(i)
end
subroutine timeit(sub,repeat,arr,size,time)
include 'mpif.h'
external sub
integer repeat
integer size
real*8 time,t1
real*8 arr(size)
integer i
time = 0
do i=1,repeat
open(unit=10,access='STREAM',file='test1',status='UNKNOWN')
t1 = mpi_wtime()
call sub(10,arr,size)
time = time + (mpi_wtime()-t1)
close(10)
enddo
return
end
subroutine test1(ou,a,N)
integer N
real*8 a(N)
integer ou
integer i
do i=1,N
write(ou),a(i)
enddo
return
end
subroutine test2(ou,a,N)
integer N
real*8 a(N)
integer ou
integer i
write(ou),(a(i),i=1,N)
return
end
subroutine test3(ou,a,N)
integer N
real*8 a(N)
integer ou
write(ou),a(1:N)
return
end
My results are (buffered):
temp $ GFORTRAN_UNBUFFERED_ALL=1 mpirun -np 1 ./test
6.2392100652058922
3.3046503861745200
9.76902325948079409E-002
(unbuffered):
temp $ GFORTRAN_UNBUFFERED_ALL=0 mpirun -np 1 ./test
2.7789104779561362
0.15584923426310221
9.82964992523193415E-002
I compiled and ran the above benchmark code using both gfortran (4.7.2 20120921) and ifort (13.0.0.079 Build 20120731). My results are as follows:
gfortran
UNBUFFERED BUFFERED
test1: 1.2614487171173097 0.20308602650960286
test2: 1.0525423844655355 3.4633986155192059E-002
test3: 5.9630711873372398E-003 6.0543696085611975E-003
ifort
UNBUFFERED BUFFERED
test1: 1.33864809672038 0.171342913309733
test2: 6.001885732014974E-003 6.095488866170247E-003
test3: 5.962880452473959E-003 6.007925669352213E-003
It would appear that the explicit loop in test1 is by far the most disadvantageous in both cases (without any optimisation flags set). Furthermore, with the Intel compiler there is no significant difference in execution time whether you run write(ou), (a(i), i=1, N) (case 2) or write(ou), a(1:N) (case 3, identical to simply write(ou), a in this case).
By the way, for this single-threaded process you can also just use the fortran 90 (or 95?) intrinsic subroutines cpu_time, which sums over all threads and returns a time in seconds. Otherwise there is also system_clock, which can return the number of elapsed clock cycles and the clock rate as integers, possibly to higher precision.

Resources