How to use lock in OpenMP - parallel-processing

I am new to OpenMP and find it a little bit hard to understand how locks in OpenMP really work. Here is an example code written in Fortran 90 to do LU factorization. Can anyone explain how locks work in this code?
program lu
implicit none
integer, parameter :: DP=kind(0.0D0),n=20
!-- Variables
integer :: i,j,k,nthr,thrid,chunk=1
real(kind=DP), dimension(:,:),allocatable :: A,B,L,U
real(kind=DP) :: timer,error,walltime
integer(kind=8), dimension(n)::lck
integer::omp_get_thread_num,omp_get_max_threads
nthr=omp_get_max_threads()
allocate(A(n,n))
allocate(B(n,n))
allocate(L(n,n))
allocate(U(n,n))
!-- Set up locks for each column
do i=1,n
call omp_init_lock(lck(i))
end do
timer=walltime()
!$OMP PARALLEL PRIVATE(i,j,k,thrid)
thrid=omp_get_thread_num();
!-- Initiate matrix
!$OMP DO SCHEDULE(STATIC,chunk)
do j=1,n
do i=1,n
A(i,j)=1.0/(i+j)
end do
call omp_set_lock(lck(j))
end do
!$OMP END DO
!-- First column of L
if (thrid==0) then
do i=2,n
A(i,1)=A(i,1)/A(1,1)
end do
call omp_unset_lock(lck(1))
end if
!-- LU-factorization
do k=1,n
call omp_set_lock(lck(k))
call omp_unset_lock(lck(k))
!$OMP DO SCHEDULE(STATIC,chunk)
do j=1,n
if (j>k) then
do i=k+1,n
A(i,j)=A(i,j)-A(i,k)*A(k,j)
end do
if (j==k+1) then
do i=k+2,n
A(i,k+1)=A(i,k+1)/A(k+1,k+1)
end do
call omp_unset_lock(lck(k+1))
end if
end if
end do
!$OMP END DO NOWAIT
end do
!$OMP END PARALLEL
timer=walltime()-timer
write(*,*) 'n = ',n,' time = ',timer,' nthr = ',nthr
! CHECK CORRECTNESS
do j=1,n
L(j,j)=1
U(j,j)=A(j,j)
do i=j+1,n
L(i,j)=A(i,j)
U(i,j)=0
end do
do i=1,j-1
U(i,j)=A(i,j)
L(i,j)=0
end do
end do
B=0
do j=1,n
do k=1,n
do i=1,n
B(i,j)=B(i,j)+L(i,k)*U(k,j)
end do
end do
end do
error=0.0
do j=1,n
do i=1,n
error=error+abs(1.0/(i+j)-B(i,j))
end do
end do
write(*,*) 'ERROR: ',error
end program lu
Another file is listed below which contains the walltime function. It should be compiled with the main file together.
function walltime()
integer, parameter:: DP = kind(0.0D0)
real(DP) walltime
integer::count,count_rate,count_max
call system_clock(count,count_rate,count_max)
walltime=real(count,DP)/real(count_rate,DP)
end function walltime

DISCLAIMER: I don't have experience with the lock mechanism and took a look to the standard to learn, how this will work. I might be wrong...
At first some problems with your code: This code won't compile with a recent version of gfortran. You have to move the function walltime to the contains section of your program and you should use USE omp_lib which defines all necessary functions (and remove the resulting duplicate definitions). Additionally, you have to define your lock in the standard way:
integer(kind=OMP_LOCK_KIND), dimension(n) :: lck
Now to your question: The call to OMP_INIT_LOCK initializes your lck array to unlocked state. All threads will get a copy of this variable. Then the parallel section is started.
In the first loop, the array is initialized as something similar to a Hilbert matrix and each lock is set.
The second block is only executed by the first thread and the first lock is released. Still nothing interesting. The following loop is entered by all threads and all threads are waiting for the k-th lock, because omp_set_lock waits, till the lock is acquired. The following omp_unset_lock lets all other threads follow. Due to the already released 1st lock, all threads will immediately enter the inner loop and finally one of the threads will release the next lock. By the time, this thread releases this lock, the other threads might already be waiting for this lock.
In principle, this algorithm provides some form of synchronization, to make sure, that the data, which is required by the k+1-th loop is already calculated, when entering it.

Related

There's a set of complex numbers,How to do parallel numerical integrals work in Fortran?

program array
implicit none
integer(8)::i,j
real(8):: dt
real(8):: RR(1317,26123),t(26123),IR(1317,26123),f(8,26123)
complex(8):: sum(1317),factor(1317),fnl(1317)
complex(8),dimension(:,:),allocatable::R
allocate( R(1317,26123))
j=0.
sum(j)=0.
factor(j)=0.
Open(0 , File ='/Users/clyrionchen/Desktop/result1-4.log')
Open(11 , File ='/Users/clyrionchen/Desktop/ReR' )
Open(22 , File ='/Users/clyrionchen/Desktop/ImR' )
do i=1,26123
read(11,*) RR(:,i)
read(0,*) f(:,i)
read(22,*) IR(:,i)
end do
t(:)=f(1,:)
r(:,:)=cmplx(Rr(:,:),IR(:,:),kind=16)
do j=1,1317
do i=2,26123
factor(j)=(r(j,i)+r(j,i-1))/2. *(t(i)-t(i-1))
sum(j)=sum(j)+factor(j)
end do
end do
do j=1,1317
write(*,*) sum(j)
end do
end program
“Rr” and “Dr” Form a complex number, and now we need to integrate the complex number, the trapezoidal integral I wrote myself (top base plus bottom base, divided by 2, times the time interval). There are 1317 groups of points, and each group needs to be integrated to get the result. But it turns out that the value of the integral doesn't match the result. Is it the parallel error, or is it the numerical integration error? How should I write this program? Thank you very much for your answer.

How to change seed number in Fortran stochastic simulator code

I'm running a Fortran code which performs a stochastic simulation of a marked Poisson cluster process. In practice, event properties (eg. time of occurrences) are generated by inversion method, i.e. by random sampling of the cumulative distribution function.
Because of the Poissonian randomness, I expect each generated sequence to be different, but this is not the case. I guess the reason is that the seed for the pseudorandom number generator is the same at each simulation.
I do not know Fortran, so I have no idea how to solve this issue. Here is the part of the code involved with the pseudorandom number generator, any idea?
subroutine pseud0(r)
c generation of pseudo-random numbers
c data ir/584287/
data ir/574289/
ir=ir*48828125
if(ir) 10,20,20
10 ir=(ir+2147483647)+1
20 r=float(ir)*0.4656613e-9
return
end
subroutine pseudo(random)
c wichmann+hill (1982) Appl. Statist 31
data ix,iy,iz /1992,1111,1151/
ix=171*mod(ix,177)-2*(ix/177)
iy=172*mod(iy,176)-35*(iy/176)
iz=170*mod(iz,178)-63*(iz/178)
if (ix.lt.0) ix=ix+30269
if (iy.lt.0) iy=iy+30307
if (iz.lt.0) iz=iz+30323
random=mod(float(ix)/30269.0+float(iy)/30307.0+
& float(iz)/30323.0,1.0)
return
end
First, I would review the modern literature for PRNG and pick a modern implementation. Second, I would rewrite the code in modern Fortran.
You need to follow #francescalus advice and have a method for updating the seed. Without attempting to modernizing your code, here is one method for the pseud0 prng
subroutine init0(i)
integer, intent(in) :: i
common /myseed0/iseed
iseed = i
end subroutine init0
subroutine pseud0(r)
common /myseed0/ir
ir = ir * 48828125
if (ir) 10,20,20
10 ir = (ir+2147483647)+1
20 r = ir*0.4656613e-9
end subroutine pseud0
program foo
integer i
real r1
call init0(574289) ! Original seed
do i = 1, 10
call pseud0(r1)
print *, r1
end do
print *
call init0(289574) ! New seed
do i = 1, 10
call pseud0(r1)
print *, r1
end do
print *
end program foo

Fortran OMP Parallel Do for a Do While loop

As of lately I have been reading and playing around with OpenMP parallel do's in Fortran 95. However, I still have not figured out how the parallel do would be used in a code like the one beneath:
I=1
DO WHILE I<100
A=2*I
B=3*I
C=A+B
SUM(I)=C
I=I+1
END DO
Using simply !$OMP PARALLEL DO before the do loop and !$OMP END PARALLEL DO doesn't seem to work. I have read a couple of things about private and shared variables however I think that each successive loop of the code above is completely independent. Any help would be appreciated greatly.
The parallel do construct doesn't work with do while loops. You need to change the do while loop to a standard DO loop. This is from the OpenMP 4.0 standard on the parallel do construct at https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf, page 59:
• The associated do-loops must be structured blocks.
• Only an iteration of the innermost associated loop may be curtailed by a CYCLE statement.
• No statement in the associated loops other than the DO statements can cause a branch out of the loops.
• The do-loop iteration variable must be of type integer.
• The do-loop cannot be a DO WHILE or a DO loop without loop control.
The following example may help with understanding your approach for what you have outlined.
It shows the use of !$OMP and also identifies the thread being used for each iteration of the loop.
I changed SUM to SUMI to retain SUM as an intrinsic function.
Hopefully you can build on this.
use omp_lib
real sumi(99), a,b,c
integer thread_used(0:9), I
nThreads = omp_get_max_threads ()
thread_used = 0
!$OMP PARALLEL DO &
!$OMP SHARED (SUMI,thread_used) &
!$OMP PRIVATE (i,a,b,c,iThread)
DO I = 1,99
iThread = omp_get_thread_num ()
thread_used(iThread) = thread_used(iThread) + 1
A=2*I
B=3*I
C=A+B
SUMI(I)=C
END DO
!$OMP END PARALLEL DO
write (*,*) sum (SUMI)
do i = 0, nThreads
write (*,*) i, thread_used(i)
end do
end

Parallelising nested do loop Fortran

I don't get any speedup when I try to do the following in the subroutine:
!$ call omp_set_num_threads(threadno)
call system_clock(x1)
!$OMP PARALLEL do private(i), reduction(+:total)
do i = 1,m
total = 0.d0
call result(a,l,b,qm,q,en) !here l is input for subroutine and en is output
qm(:,i) = q
qtv(i) = qt
mean = sum(q)/size(q)
do i2 = 1,k
total = total + ((mean-q(i2))**2)/(a+b)
end do
qvv(i1) = total
end do
call system_clock(x2)
print *, x2-x1
!$OMP END PARALLEL do
Comments on the OpenMP part:
total should not be reset in the loop but before the !$OMP clause.
i2 and mean should be private.
If q does not change between iterations of the loop, sum(q)/size(q) should be placed outside.
The lack of private setting can lead to memory access conflicts (and thus slowdowns).
I guess that the code you show is close to but not really the one that you compile. It would be useful to have a compiled code to provide a better help.
Cheers,
Pierre
EDIT: for timing OpenMP code, you should use omp_get_wtime (see https://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fwtime.html) that gives the walltime https://en.wikipedia.org/wiki/Wall-clock_time. The module for openmp routines is loaded with use omp_lib

Non-blocking MPI communication from every node to every other node fails

I'm trying to implement a data transfer using Fortran 90 and MPI in which every node sends a specific buffer to every other node, i.e. for N nodes I have N-1 buffers to be sent, each one with a different content specific to the recipient node. This involves a double loop and non-blocking send/recv calls; here's the code:
program a2a
implicit none
include "mpif.h"
integer, parameter :: ILEN=4
integer :: nn,i,j,me,ierr
integer :: status(MPI_status_size)
integer, allocatable :: sndv(:),rcvv(:),ireq(:)
call MPI_init(ierr)
call MPI_comm_size(mpi_comm_world,nn,ierr)
nn=nn-1
call MPI_comm_rank(mpi_comm_world,me,ierr)
allocate(sndv(0:nn),rcvv(0:nn),ireq(0:nn))
do i=0,nn
sndv(i)=10*me+i
rcvv(i)=0
end do
do i=0,nn
if (i == me) then
do j=0,nn
if (i == j) cycle
call MPI_isend(sndv(j),ILEN,MPI_byte,j,1000+j,MPI_comm_world,ireq(j),ierr)
write(*,*) 1000+j,'Send - #',me,' -> #',j,': ',sndv(j),ireq(j)
end do
else
do j=0,nn
if (i == j) cycle
call MPI_irecv(rcvv(j),ILEN,MPI_byte,j,1000+j,MPI_comm_world,ireq(j),ierr)
write(*,*) 1000+j,'Recv0 #',i,' -> #',j,': ',rcvv(j),ireq(j)
end do
end if
end do
do j=0,nn
if (me == j) cycle
call MPI_wait(ireq(j),status,ierr)
write(*,*) 1000+j,'Recv1 #',me,' -> #',j,': ',rcvv(j),ireq(j)
end do
call MPI_barrier(MPI_comm_world,ierr)
do i=0,nn
write(*,*) 'Recv2 #',i,' -> #',me,': ',rcvv(i)
end do
call MPI_finalize(ierr)
end program a2a
The expected result for a run with just two nodes is that node 0 sends "1" to node 1 and node 1 sends "10" to node 0. The actual result is that nothing seems to be sent, although there is no deadlock and the tags and request numbers seem to be correct. What is wrong here?
Thomas
Look at the MPI_irecv command, and what it should be:
MPI_irecv(rcvv(j),ILEN,MPI_byte,j, 1000+j,MPI_comm_world,ireq(j), ierr)
MPI_irecv(sendBuf, len,type, source, tag, comm, request, ierr)
Specifically, you have set your source variable to be j. If you look at the MPI_isend command, however, the processor that is sending the information is processor i (the send only occurs if i == me). Change the source in your MPI_irecv command to i and it should work fine.
That said, this seems like a perfect use case for an MPI_Alltoall command, why don't you use that instead?
Turns out, the whole approach of the program was flawed, because for tests with more than 2 nodes, deadlocks occurred and/or buffers got mixed up. For the record, below is a new program that seems to do the job correctly.
#wolfPack88 concerning the suggestion to use MPI_Alltoallv: yes, in principle that would do it. However, in my actual problem, for which this is just a test, it is even more complicated in that the nodes involved in the whole task can be only a fairly small subset of all nodes of the run. In that case MPI_Alltoallv might be overkill and would presumably cause unnecessary communication. Still, pointing me to the mistake with the source finally opened my eyes to the root of the trouble, so thanks for that.
Here's the code:
program a2a
implicit none
include "mpif.h"
integer, parameter :: ILEN=4
integer :: nn,i,me,ierr
integer :: status(MPI_status_size)
integer, allocatable :: sndv(:),rcvv(:),ireq(:)
integer, external :: isend,irecv,mynode,numnodes
call MPI_init(ierr)
call MPI_comm_size(mpi_comm_world,nn,ierr)
nn=nn-1
call MPI_comm_rank(mpi_comm_world,me,ierr)
allocate(sndv(0:nn),rcvv(0:nn),ireq(0:nn))
do i=0,nn
sndv(i)=10*me+i
rcvv(i)=0
end do
do i=0,nn
if (i == me) cycle
call MPI_irecv(rcvv(i),ILEN,MPI_byte,i,1000*i+me,MPI_comm_world,ireq(i),ierr)
end do
do i=0,nn
if (me == i) cycle
call MPI_isend(sndv(i),ILEN,MPI_byte,i,1000*me+i,MPI_comm_world,ireq(i),ierr)
write(*,*) 1000*me+i,'Send - #',me,' -> #',i,': ',sndv(i),ireq(i)
end do
do i=0,nn
if (me == i) cycle
call MPI_wait(ireq(i),status,ierr)
end do
call MPI_barrier(MPI_comm_world,ierr)
do i=0,nn
if (i /= me) write(*,*) 'Recv2 #',i,' -> #',me,': ',rcvv(i)
end do
call MPI_finalize(ierr)
end program a2a

Resources