Non-blocking MPI communication from every node to every other node fails - parallel-processing

I'm trying to implement a data transfer using Fortran 90 and MPI in which every node sends a specific buffer to every other node, i.e. for N nodes I have N-1 buffers to be sent, each one with a different content specific to the recipient node. This involves a double loop and non-blocking send/recv calls; here's the code:
program a2a
implicit none
include "mpif.h"
integer, parameter :: ILEN=4
integer :: nn,i,j,me,ierr
integer :: status(MPI_status_size)
integer, allocatable :: sndv(:),rcvv(:),ireq(:)
call MPI_init(ierr)
call MPI_comm_size(mpi_comm_world,nn,ierr)
nn=nn-1
call MPI_comm_rank(mpi_comm_world,me,ierr)
allocate(sndv(0:nn),rcvv(0:nn),ireq(0:nn))
do i=0,nn
sndv(i)=10*me+i
rcvv(i)=0
end do
do i=0,nn
if (i == me) then
do j=0,nn
if (i == j) cycle
call MPI_isend(sndv(j),ILEN,MPI_byte,j,1000+j,MPI_comm_world,ireq(j),ierr)
write(*,*) 1000+j,'Send - #',me,' -> #',j,': ',sndv(j),ireq(j)
end do
else
do j=0,nn
if (i == j) cycle
call MPI_irecv(rcvv(j),ILEN,MPI_byte,j,1000+j,MPI_comm_world,ireq(j),ierr)
write(*,*) 1000+j,'Recv0 #',i,' -> #',j,': ',rcvv(j),ireq(j)
end do
end if
end do
do j=0,nn
if (me == j) cycle
call MPI_wait(ireq(j),status,ierr)
write(*,*) 1000+j,'Recv1 #',me,' -> #',j,': ',rcvv(j),ireq(j)
end do
call MPI_barrier(MPI_comm_world,ierr)
do i=0,nn
write(*,*) 'Recv2 #',i,' -> #',me,': ',rcvv(i)
end do
call MPI_finalize(ierr)
end program a2a
The expected result for a run with just two nodes is that node 0 sends "1" to node 1 and node 1 sends "10" to node 0. The actual result is that nothing seems to be sent, although there is no deadlock and the tags and request numbers seem to be correct. What is wrong here?
Thomas

Look at the MPI_irecv command, and what it should be:
MPI_irecv(rcvv(j),ILEN,MPI_byte,j, 1000+j,MPI_comm_world,ireq(j), ierr)
MPI_irecv(sendBuf, len,type, source, tag, comm, request, ierr)
Specifically, you have set your source variable to be j. If you look at the MPI_isend command, however, the processor that is sending the information is processor i (the send only occurs if i == me). Change the source in your MPI_irecv command to i and it should work fine.
That said, this seems like a perfect use case for an MPI_Alltoall command, why don't you use that instead?

Turns out, the whole approach of the program was flawed, because for tests with more than 2 nodes, deadlocks occurred and/or buffers got mixed up. For the record, below is a new program that seems to do the job correctly.
#wolfPack88 concerning the suggestion to use MPI_Alltoallv: yes, in principle that would do it. However, in my actual problem, for which this is just a test, it is even more complicated in that the nodes involved in the whole task can be only a fairly small subset of all nodes of the run. In that case MPI_Alltoallv might be overkill and would presumably cause unnecessary communication. Still, pointing me to the mistake with the source finally opened my eyes to the root of the trouble, so thanks for that.
Here's the code:
program a2a
implicit none
include "mpif.h"
integer, parameter :: ILEN=4
integer :: nn,i,me,ierr
integer :: status(MPI_status_size)
integer, allocatable :: sndv(:),rcvv(:),ireq(:)
integer, external :: isend,irecv,mynode,numnodes
call MPI_init(ierr)
call MPI_comm_size(mpi_comm_world,nn,ierr)
nn=nn-1
call MPI_comm_rank(mpi_comm_world,me,ierr)
allocate(sndv(0:nn),rcvv(0:nn),ireq(0:nn))
do i=0,nn
sndv(i)=10*me+i
rcvv(i)=0
end do
do i=0,nn
if (i == me) cycle
call MPI_irecv(rcvv(i),ILEN,MPI_byte,i,1000*i+me,MPI_comm_world,ireq(i),ierr)
end do
do i=0,nn
if (me == i) cycle
call MPI_isend(sndv(i),ILEN,MPI_byte,i,1000*me+i,MPI_comm_world,ireq(i),ierr)
write(*,*) 1000*me+i,'Send - #',me,' -> #',i,': ',sndv(i),ireq(i)
end do
do i=0,nn
if (me == i) cycle
call MPI_wait(ireq(i),status,ierr)
end do
call MPI_barrier(MPI_comm_world,ierr)
do i=0,nn
if (i /= me) write(*,*) 'Recv2 #',i,' -> #',me,': ',rcvv(i)
end do
call MPI_finalize(ierr)
end program a2a

Related

MT19937 does NOT reproduce the same pseudo-random sequence by holding the seed value a constant

I'm writing a checkpoint function in my Monte Carlo simulation in Fortran 90/95, the compiler I'm using is ifort 18.0.2, before going through detail just to clarify the version of pseudo-random generator I'm using:
A C-program for MT19937, with initialization, improved 2002/1/26.
Coded by Takuji Nishimura and Makoto Matsumoto.
Code converted to Fortran 95 by Josi Rui Faustino de Sousa
Date: 2002-02-01
See mt19937 for the source code.
The general structure of my Monte Carlo simulation code is given below:
program montecarlo
call read_iseed(...)
call mc_subroutine(...)
end
Within the read_iseed
subroutine read_iseed(...)
use mt19937
if (Restart == 'n') then
call system('od -vAn -N4 -td4 < /dev/urandom > '//trim(IN_ISEED)
open(unit=7,file=trim(IN_ISEED),status='old')
read(7,*) i
close(7)
!This is only used to initialise the PRNG sequence
iseed = abs(i)
else if (Restart == 'y') then
!Taking seed value from the latest iteration of previous simulation
iseed = RestartSeed
endif
call init_genrand(iseed)
print *, 'first pseudo-random value ',genrand_real3(), 'iseed ',iseed
return
end subroutine
Based on my understanding, if the seed value holds a constant, the PRNG should be able to reproduce the pseudo-random sequence every time?
In order to prove this is the case, I ran two individual simulations by using the same seed value, they are able to reproduce the exact sequence. So far so good!
Based on the previous test, I'd further assume that regardless the number of times init_genrand() being called within one individual simulation, the PRNG should also be able to reproduce the pseudo-random value sequence? So I did a little modification to my read_iseed() subroutine
subroutine read_iseed(...)
use mt19937
if (Restart == 'n') then
call system('od -vAn -N4 -td4 < /dev/urandom > '//trim(IN_ISEED)
open(unit=7,file=trim(IN_ISEED),status='old')
read(7,*) i
close(7)
!This is only used to initialise the PRNG sequence
iseed = abs(i)
else if (Restart == 'y') then
!Taking seed value from the latest iteration of the previous simulation
iseed = RestartSeed
endif
call init_genrand(iseed)
print *, 'first time initialisation ',genrand_real3(), 'iseed ',iseed
call init_genrand(iseed)
print *, 'second time initialisation ',genrand_real3(), 'iseed ',iseed
return
end subroutine
The output is surprisingly not the case I thought would be, by all means iseed outputs are identical in between two initializations, however, genrand_real3() outputs are not identical.
Because of this unexpected result, I struggled with resuming the simulation at an arbitrary state of the system since the simulation is not reproducing the latest configuration state of the system I'm simulating.
I'm not sure if I've provided enough information, please let me know if any part of this question needs to be more specific?
From the source code you've provided (See [mt19937]{http://web.mst.edu/~vojtat/class_5403/mt19937/mt19937ar.f90} for the source code.), the init_genrand does not clear the whole state.
There are 3 critical state variables:
integer( kind = wi ) :: mt(n) ! the array for the state vector
logical( kind = wi ) :: mtinit = .false._wi ! means mt[N] is not initialized
integer( kind = wi ) :: mti = n + 1_wi ! mti==N+1 means mt[N] is not initialized
The first one is the "array for the state vector", second one is a flag that ensures we don't start with uninitialized array, and the third one is some position marker, as I guess from the condition stated in the comment.
Looking at subroutine init_genrand( s ), it sets mtinit flag, and fills the mt() array from 1 upto n. Alright.
Looking at genrand_real3 it's based on genrand_int32.
Looking at genrand_int32, it starts up with
if ( mti > n ) then ! generate N words at one time
! if init_genrand() has not been called, a default initial seed is used
if ( .not. mtinit ) call init_genrand( seed_d )
and does its arithmetic magic and then starts getting the result:
y = mt(mti)
mti = mti + 1_wi
so.. mti is a positional index in the 'state array', and it is incremented by 1 after each integer read from the generator.
Back to init_genrand - remember? it have been resetting the array mt() but it has not resetted the MTI back to its starting mti = n + 1_wi.
I bet this is the cause of the phenomenon you've observed, since after re-initializing with the same seed, the array would be filled with the same set of values, but later the int32 generator would read from a different starting point. I doubt it was intended, so it's probably a tiny bug easy to overlook.

Using MPI_TYPE_VECTOR instead of MPI_GATHER

Suppose that k processes compute the elements of a matrix A, whose dimension is (n,m), where n is the number of rows and m is the number of columns. I am trying to use MPI_GATHER to gather these two matrices to the matrix B at the root process, where the dimension of B is (n,km). To be more specific, I wrote an example fortran code below. Here, I am passing over the columns of the matrix A (not the entire matrix) to the matrix B but this wouldn't work. When I run the executable using mpirun -n 2 a.out, I get the error:
malloc: *** error for object 0x7ffa89413fb8: incorrect checksum for freed object - object was probably modified after being freed.
1) Why do I get this error message?
2) Who can please explain conceptually, why I have to use MPI_TYPE_VECTOR?
3) How should I correct the MPI_GATHER part of the code? Can I pass over the entire matrix A?
PROGRAM test
IMPLICIT NONE
INCLUDE "mpif.h"
INTEGER, PARAMETER :: n=100, m=100
INTEGER, ALLOCATABLE, DIMENSION(:,:) :: A
INTEGER, DIMENSION(n,m) :: B
INTEGER :: ind_a, ind_c
INTEGER :: NUM_PROC, PROC_ID, IERROR, MASTER_ID=0
INTEGER :: c
INTEGER, DIMENSION(m) :: cvec
CALL MPI_INIT(IERROR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, PROC_ID, IERROR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, NUM_PROC, IERROR)
ALLOCATE(A(n,m/NUM_PROC))
DO ind_c=1,m
cvec(ind_c)=ind_c
END DO
! Fill in matrix A
DO ind_a=1,n
DO ind_c=1,m/NUM_PROC
c=cvec(ind_c+PROC_ID*m/NUM_PROC)
A(ind_a,ind_c)=c*ind_a
END DO
END DO
! Gather the elements at the root process
DO ind_a=1,n
CALL MPI_GATHER(A(ind_a,:),m/NUM_PROC,MPI_INTEGER,B(ind_a,PROC_ID*m/NUM_PROC+1:(PROC_ID+1)*m/NUM_PROC),m/NUM_PROC,MPI_INTEGER,MASTER_ID,MPI_COMM_WORLD,IERROR)
END DO
CALL MPI_FINALIZE(IERROR)
END PROGRAM
There are two types of gather operation that can be performed in a 2 dimensional array.
1. gathering the elements from dimension-2 of all the process and collecting it in the dimension-2 of one process; and
2. gathering the elements from dimension-2 of all the process and collecting it in the dimension-1 of one process.
Said that in this example;
n=dimension-1 and m=dimension-2, and we know that Fortran is column major. Hence, the dimension-1 is contiguous in memory in Fortran.
In your gather statement you are trying to gather dimension-2 of Array-A from all the processes, and collect it into the dimension-2 of Array-B in MASTER_ID proc(TYPE-1). Since, dimension-2 is non-contiguous in memory, this causes the segmentation fault.
A single MPI_Gather call as shown below will get to the required operation, without any looping-tricks as shown above:
CALL MPI_GATHER(A, n*(m/NUM_PROC), MPI_INTEGER, &
B, n*(m/NUM_PROC), MPI_INTEGER, MASTER_ID, &
MPI_COMM_WORLD, IERROR)
But, if you attempting to gather elements from dimension-2 of Array-A from all the process to dimension-1 of Array-B in MASTER_ID proc, that is when we need to make use of MPI_TYPE_VECTOR, where we create a new type with the non-contiguous elements. Let, me know if that is the intention.
Because, the current code logic doesn't look like we need to make use of MPI_TYPE_VECTOR.

Passing user-defined variables using MPI [duplicate]

I have a Fortran program where I specify the kind of the numeric data types in an attempt to retain a minimum level of precision, regardless of what compiler is used to build the program. For example:
integer, parameter :: rsp = selected_real_kind(4)
...
real(kind=rsp) :: real_var
The problem is that I have used MPI to parallelize the code and I need to make sure the MPI communications are specifying the same type with the same precision. I was using the following approach to stay consistent with the approach in my program:
call MPI_Type_create_f90_real(4,MPI_UNDEFINED,rsp_mpi,mpi_err)
...
call MPI_Send(real_var,1,rsp_mpi,dest,tag,MPI_COMM_WORLD,err)
However, I have found that this MPI routine is not particularly well-supported for different MPI implementations, so it's actually making my program non-portable. If I omit the MPI_Type_create routine, then I'm left to rely on the standard MPI_REAL and MPI_DOUBLE_PRECISION data types, but what if that type is not consistent with what selected_real_kind picks as the real type that will ultimately be passed around by MPI? Am I stuck just using the standard real declaration for a datatype, with no kind attribute and, if I do that, am I guaranteed that MPI_REAL and real are always going to have the same precision, regardless of compiler and machine?
UPDATE:
I created a simple program that demonstrates the issue I see when my internal reals have higher precision than what is afforded by the MPI_DOUBLE_PRECISION type:
program main
use mpi
implicit none
integer, parameter :: rsp = selected_real_kind(16)
integer :: err
integer :: rank
real(rsp) :: real_var
call MPI_Init(err)
call MPI_Comm_rank(MPI_COMM_WORLD,rank,err)
if (rank.eq.0) then
real_var = 1.123456789012345
call MPI_Send(real_var,1,MPI_DOUBLE_PRECISION,1,5,MPI_COMM_WORLD,err)
else
call MPI_Recv(real_var,1,MPI_DOUBLE_PRECISION,0,5,MPI_COMM_WORLD,&
MPI_STATUS_IGNORE,err)
end if
print *, rank, real_var
call MPI_Finalize(err)
end program main
If I build and run with 2 cores, I get:
0 1.12345683574676513672
1 4.71241976735884452383E-3998
Now change the 16 to a 15 in selected_real_kind and I get:
0 1.1234568357467651
1 1.1234568357467651
Is it always going to be safe to use selected_real_kind(15) with MPI_DOUBLE_PRECISION no matter what machine/compiler is used to do the build?
Use the Fortran 2008 intrinsic STORAGE_SIZE to determine the number bytes that each number requires and send as bytes. Note that STORAGE_SIZE returns the size in bits, so you will need to divide by 8 to get the size in bytes.
This solution works for moving data but does not help you use reductions. For that you will have to implement a user-defined reduction operation. If that's important to you, I will update my answer with the details.
For example:
program main
use mpi
implicit none
integer, parameter :: rsp = selected_real_kind(16)
integer :: err
integer :: rank
real(rsp) :: real_var
call MPI_Init(err)
call MPI_Comm_rank(MPI_COMM_WORLD,rank,err)
if (rank.eq.0) then
real_var = 1.123456789012345
call MPI_Send(real_var,storage_size(real_var)/8,MPI_BYTE,1,5,MPI_COMM_WORLD,err)
else
call MPI_Recv(real_var,storage_size(real_var)/8,MPI_BYTE,0,5,MPI_COMM_WORLD,&
MPI_STATUS_IGNORE,err)
end if
print *, rank, real_var
call MPI_Finalize(err)
end program main
I confirmed that this change corrects the problem and the output I see is:
0 1.12345683574676513672
1 1.12345683574676513672
Not really an answer, but we have the same problem and use something like this:
!> Number of digits for single precision numbers
integer, parameter, public :: single_prec = 6
!> Number of digits for double precision numbers
integer, parameter, public :: double_prec = 15
!> Number of digits for extended double precision numbers
integer, parameter, public :: xdble_prec = 18
!> Number of digits for quadruple precision numbers
integer, parameter, public :: quad_prec = 33
integer, parameter, public :: rk_prec = double_prec
!> The kind to select for default reals
integer, parameter, public :: rk = selected_real_kind(rk_prec)
And then have an initialization routine where we do:
!call mpi_type_create_f90_real(rk_prec, MPI_UNDEFINED, rk_mpi, iError)
!call mpi_type_create_f90_integer(long_prec, long_k_mpi, iError)
! Workaround shitty MPI-Implementations.
select case(rk_prec)
case(single_prec)
rk_mpi = MPI_REAL
case(double_prec)
rk_mpi = MPI_DOUBLE_PRECISION
case(quad_prec)
rk_mpi = MPI_REAL16
case default
write(*,*) 'unknown real type specified for mpi_type creation'
end select
long_k_mpi = MPI_INTEGER8
While this is not nice, it works reasonably well, and seems to be usable on Cray, IBM BlueGene and conventional Linux Clusters.
Best thing to do is push sites and vendors to properly support this in MPI. As far as I know it has been fixed in OpenMPI and planned to be fixed in MPICH by 3.1.1. See OpenMPI Tickets 3432 and 3435 as well as MPICH Tickets 1769 and 1770.
How about:
integer, parameter :: DOUBLE_PREC = kind(0.0d0)
integer, parameter :: SINGLE_PREC = kind(0.0e0)
integer, parameter :: MYREAL = DOUBLE_PREC
if (MYREAL .eq. DOUBLE_PREC) then
MPIREAL = MPI_DOUBLE_PRECISION
else if (MYREAL .eq. SINGLE_PREC) then
MPIREAL = MPI_REAL
else
print *, "Erorr: Can't figure out MPI precision."
STOP
end if
and use MPIREAL instead of MPI_DOUBLE_PRECISION from then on.

How to use lock in OpenMP

I am new to OpenMP and find it a little bit hard to understand how locks in OpenMP really work. Here is an example code written in Fortran 90 to do LU factorization. Can anyone explain how locks work in this code?
program lu
implicit none
integer, parameter :: DP=kind(0.0D0),n=20
!-- Variables
integer :: i,j,k,nthr,thrid,chunk=1
real(kind=DP), dimension(:,:),allocatable :: A,B,L,U
real(kind=DP) :: timer,error,walltime
integer(kind=8), dimension(n)::lck
integer::omp_get_thread_num,omp_get_max_threads
nthr=omp_get_max_threads()
allocate(A(n,n))
allocate(B(n,n))
allocate(L(n,n))
allocate(U(n,n))
!-- Set up locks for each column
do i=1,n
call omp_init_lock(lck(i))
end do
timer=walltime()
!$OMP PARALLEL PRIVATE(i,j,k,thrid)
thrid=omp_get_thread_num();
!-- Initiate matrix
!$OMP DO SCHEDULE(STATIC,chunk)
do j=1,n
do i=1,n
A(i,j)=1.0/(i+j)
end do
call omp_set_lock(lck(j))
end do
!$OMP END DO
!-- First column of L
if (thrid==0) then
do i=2,n
A(i,1)=A(i,1)/A(1,1)
end do
call omp_unset_lock(lck(1))
end if
!-- LU-factorization
do k=1,n
call omp_set_lock(lck(k))
call omp_unset_lock(lck(k))
!$OMP DO SCHEDULE(STATIC,chunk)
do j=1,n
if (j>k) then
do i=k+1,n
A(i,j)=A(i,j)-A(i,k)*A(k,j)
end do
if (j==k+1) then
do i=k+2,n
A(i,k+1)=A(i,k+1)/A(k+1,k+1)
end do
call omp_unset_lock(lck(k+1))
end if
end if
end do
!$OMP END DO NOWAIT
end do
!$OMP END PARALLEL
timer=walltime()-timer
write(*,*) 'n = ',n,' time = ',timer,' nthr = ',nthr
! CHECK CORRECTNESS
do j=1,n
L(j,j)=1
U(j,j)=A(j,j)
do i=j+1,n
L(i,j)=A(i,j)
U(i,j)=0
end do
do i=1,j-1
U(i,j)=A(i,j)
L(i,j)=0
end do
end do
B=0
do j=1,n
do k=1,n
do i=1,n
B(i,j)=B(i,j)+L(i,k)*U(k,j)
end do
end do
end do
error=0.0
do j=1,n
do i=1,n
error=error+abs(1.0/(i+j)-B(i,j))
end do
end do
write(*,*) 'ERROR: ',error
end program lu
Another file is listed below which contains the walltime function. It should be compiled with the main file together.
function walltime()
integer, parameter:: DP = kind(0.0D0)
real(DP) walltime
integer::count,count_rate,count_max
call system_clock(count,count_rate,count_max)
walltime=real(count,DP)/real(count_rate,DP)
end function walltime
DISCLAIMER: I don't have experience with the lock mechanism and took a look to the standard to learn, how this will work. I might be wrong...
At first some problems with your code: This code won't compile with a recent version of gfortran. You have to move the function walltime to the contains section of your program and you should use USE omp_lib which defines all necessary functions (and remove the resulting duplicate definitions). Additionally, you have to define your lock in the standard way:
integer(kind=OMP_LOCK_KIND), dimension(n) :: lck
Now to your question: The call to OMP_INIT_LOCK initializes your lck array to unlocked state. All threads will get a copy of this variable. Then the parallel section is started.
In the first loop, the array is initialized as something similar to a Hilbert matrix and each lock is set.
The second block is only executed by the first thread and the first lock is released. Still nothing interesting. The following loop is entered by all threads and all threads are waiting for the k-th lock, because omp_set_lock waits, till the lock is acquired. The following omp_unset_lock lets all other threads follow. Due to the already released 1st lock, all threads will immediately enter the inner loop and finally one of the threads will release the next lock. By the time, this thread releases this lock, the other threads might already be waiting for this lock.
In principle, this algorithm provides some form of synchronization, to make sure, that the data, which is required by the k+1-th loop is already calculated, when entering it.

Need help with Binary Search algorithm error

Using the g95 compiler, I have an error:
Error: Operands of comparison operator '.EQ.' at (1) are LOGICAL(4)/LOGICAL(4)
I have no idea what this means. I'm including the subroutine. Do you have any ideas how to fix this?
Thanks so much for your time.
SUBROUTINE search(iarray, ItemSought, Found, Location)
CHARACTER(20), DIMENSION(50),INTENT(IN)::itemarray
CHARACTER(20)::ItemSought
LOGICAL, INTENT(OUT)::Found
INTEGER, INTENT(OUT)::Location
INTEGER:: First, Last, Middle
WRITE(*,'(1x,A)',ADVANCE="NO"),"What are you searching for? "
READ*, ItemSought
First=1
Last=SIZE(Iarray)
FOUND = .FALSE.
DO
IF ((First > Last) .OR. Found) RETURN
Middle = (First+Last)/2
IF (ItemSought < Iarray(Middle)) THEN
Last=Middle-1
ELSE IF (ItemSought > Iarray(Middle)) THEN
First=Middle+1
ELSE
Found = .TRUE.
Location = Middle
END IF
END DO
IF (Found == .TRUE.) THEN
PRINT*, Itemsought
END SUBROUTINE
I'm not going to admit the last time I used FORTRAN, but it sure looks a lot different than I remember. So this is just a guess.
Based on the error message I'd say it's on this line (you didn't say which):
IF (Found == .TRUE.) THEN
Again just guessing, you usually don't test a logical value by comparing to true/false, you use it directly:
IF (Found) THEN
The .EQ. (or ==)relational operator, just like .NE. (/=), .LT. (<) and so on, is for comparing numbers only, for comparing logical values you should use .EQV. and .NEQV.

Resources