the use of MPI_Init() - parallel-processing

I encountered a question about the use of MPI_Init().
I want to initialize random number "randv" only on the root processor with the code in the context below. To see if my goal is fulfilled, I have the program print out the array "randv" by placing a do loop immediately after the line "call RANDOM_NUMBER(randv)."
However, what is shown on the outcome screen is the repetition of the random number array by 8 times (given the number of processors is 8). My question is why the processors other than the root one are initialized before call MPI_Init(). If all the processors are awaken and have the same random number array before evoking MPI_Init, why bother to place call MPI_Init() for initialization? Thanks.
Lee
Here is the example I use:
program main
include 'mpif.h'
integer :: i
integer :: ierr
integer :: irank
integer :: nrow, ncol
real, dimension(:,:), allocatable :: randv
nrow = 4
ncol = 2
allocate(randv(nrow,ncol))
call RANDOM_SEED
call RANDOM_NUMBER(randv)
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
call MPI_Init ( ierr )
allocate(row_list(ncol), col_list(nrow))
call MPI_Comm_rank ( MPI_COMM_WORLD, irank, ierr )
if( irank == 0 )then
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
endif
call MPI_Finalize ( ierr )
deallocate( randv )
end program

I think you misunderstand how MPI works. The program you wrote is executed by every process. MPI_Init initializes the MPI environment s.t. those processes can interact. After initialization every process is uniquely identified by its rank. You have to make sure that, based on these ranks, each process works on different portions of your data, or performs different tasks.
Typically, you should run MPI_Init before anything else in your program.
Using MPI_Comm_rank you can obtain the ID of the current process (its rank). The first process always has the rank 0. Therefore, if you want to run parts of the code on the "master" process only, you can test for irank == 0:
program main
include 'mpif.h'
integer :: i
integer :: ierr
integer :: irank
integer :: nrow, ncol
real, dimension(:,:), allocatable :: randv
! Initialize MPI
call MPI_Init ( ierr )
! Get process ID
call MPI_Comm_rank ( MPI_COMM_WORLD, irank, ierr )
! Executed on all processes
nrow = 4
ncol = 2
allocate(randv(nrow,ncol))
! Only exectued on the master process
if ( irank == 0 ) then
call RANDOM_SEED
call RANDOM_NUMBER(randv)
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
endif
! Executed on all threads
allocate(row_list(ncol), col_list(nrow))
! Only exectued on the master process
if ( irank == 0 ) then
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
endif
deallocate( randv )
! Finalize MPI, should always be executed last
call MPI_Finalize ( ierr )
end program

Related

Why are my MPI-parallelised DO-loops faster the second time round?

I'm working on a scientific model that's parallelised using Open MPI, and I'm finding some weird results with computation time for parallelised loops. Basically, I find that a parallelised loop over a large array is slow right after allocating the (shared) memory, but a second operation is much faster. Why is this?
To demonstrate, here's a dummy program I wrote that illustrates this:
PROGRAM test_program
USE mpi
USE, INTRINSIC :: ISO_C_BINDING, ONLY: C_PTR, C_F_POINTER
IMPLICIT NONE
INTEGER, PARAMETER :: dp = KIND(1.0D0) ! Double precision
INTEGER :: process_rank, number_of_processes
INTEGER :: ierr
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit
TYPE(C_PTR) :: baseptr
REAL(dp), DIMENSION(:,:), POINTER :: A
INTEGER :: win
INTEGER :: nx, ny, i, j, i1, i2, j1, j2
REAL(dp) :: tstart, tstop, dt1, dt2
! Initialise MPI split processes
CALL MPI_INIT(ierr)
CALL MPI_COMM_RANK( MPI_COMM_WORLD, process_rank, ierr)
CALL MPI_COMM_SIZE( MPI_COMM_WORLD, number_of_processes, ierr)
! Dimensions of data array
nx = 20000
ny = 20000
! Allocate MPI-shared memory for data array, with an associated window object
! (done by all processes, but only the master actually allocates any space)
IF (process_rank == 0) THEN
windowsize = ny*nx*8_MPI_ADDRESS_KIND
disp_unit = ny*8
ELSE
windowsize = 0_MPI_ADDRESS_KIND
disp_unit = 1
END IF
CALL MPI_WIN_ALLOCATE_SHARED( windowsize, disp_unit, MPI_INFO_NULL, MPI_COMM_WORLD, baseptr, win, ierr)
IF (.NOT. process_rank == 0) THEN
! Get the baseptr of the master's memory space.
CALL MPI_WIN_SHARED_QUERY( win, 0, windowsize, disp_unit, baseptr, ierr)
END IF
! Associate the pointer with this memory space (so all processes act on the same physical memory)
CALL C_F_POINTER(baseptr, A, [ny, nx])
! Divide the memory over the processors; each gets a range i1-i2 of row
i1 = MAX(1, FLOOR(REAL(nx * process_rank / number_of_processes)) + 1)
i2 = MIN(nx, FLOOR(REAL(nx * (process_rank + 1) / number_of_processes)))
! Do some operations on the data, measure how long this takes on each core
tstart = MPI_WTIME()
DO i = i1, i2
DO j = 1, ny
A( j,i) = SQRT( (REAL(i,dp) - REAL(nx,dp)/2._dp)**2 + (REAL(j,dp) - REAL(ny,dp)/2._dp)**2 )
END DO
END DO
tstop = MPI_WTIME()
dt1 = tstop - tstart
WRITE(0,*) ' Process ', process_rank, ': dt1 = ', dt1
! Do some more calculations on the data, measure how long this takes on each core
tstart = MPI_WTIME()
DO i = i1, i2
DO j = 1, ny
A( j,i) = SQRT( (REAL(i,dp) - REAL(nx,dp)/4._dp)**2 + (REAL(j,dp) - REAL(ny,dp)/3._dp)**2 )
END DO
END DO
tstop = MPI_WTIME()
dt2 = tstop - tstart
WRITE(0,*) ' Process ', process_rank, ': dt2 = ', dt2
CALL MPI_FINALIZE( ierr)
END PROGRAM test_program
This gives me the following output:
Process 0 : dt1 = 2.1637410982511938
Process 1 : dt1 = 2.6094086961820722
Process 0 : dt2 = 1.0437976177781820
Process 1 : dt2 = 0.96576740033924580
Replacing the first loop by a vectorised operation (e.g. A(:,i1:i2) = 0._dp) gives the same "fast" results for the second loop.
When you perform memory allocation (such as with MPI_WIN_ALLOCATE_SHARED), the memory is only allocated virtually by the operating system: so until you use it, it doesn't take up space neither on RAM nor on the disk.
The first iteration writes into the virtually-allocated memory location for the first time, so it causes what is called page faults. When handling a page fault, the operating system tries to make the required page accessible at the location in physical memory (or terminates the program in cases of an illegal memory access). This process is very slow, hence the slow execution.
During the second iteration, no page faults should occur because the memory should be physically mapped, hence the faster execution.
Note that this effect is completely independent of the use of MPI (although MPI_WIN_ALLOCATE_SHARED should allocate memory in a different way than usual since memory must be shared between processes).

Problem in splitting array with MPI_ISEND with fortran

I wrote the following lines of code with the aim to divide the 1D array Jp (which lives on the master process only) over different processes (included the master one).
Each process have to receive a block of non-contigouos modified data (the values are changed in the inner loop) and I created a new format newF with the MPI_TYPE_INDEXED function to select the correct portion of data to send.
I used both MPI_RECV or MPI_IRECV to receive the data.
The problem is that this part of code works fine, with any numbers of tasks (from 1 to 8), until the number of element of Jp is small, when I increase such number (i.e. n = 5000), not all the processes receive the data and the splitted array JpS show the values I used to initialized it (i.e. -10000).
The lines commented show all the changes done in order to solve this problem, Does anyone have an idea?
program test_send
use mpi
implicit none
integer :: rank, nproc, mpi_stat
integer :: n, m, k, io, i, j
integer, allocatable :: Jp(:), JpS(:), JpAux(:)
integer :: count, n_distro,newF
integer, allocatable :: sendcounts(:), displ(:), &
blocklens(:), blockdisp(:), &
request(:)
integer :: ARRAY_OF_STATUS(MPI_STATUS_SIZE), error
data count /3/
call mpi_init(mpi_stat)
call mpi_comm_size(mpi_comm_world, nproc, mpi_stat)
call mpi_comm_rank(mpi_comm_world, rank, mpi_stat)
n = 400*count
allocate(sendcounts(nproc), displ(nproc), &
blocklens(count), blockdisp(count), request(nproc))
if (rank.eq.0) then
allocate(Jp(n+1),JpAux(n+1))
Jp = 0
do i = 1,n+1
Jp(i) = i
enddo
endif
call mpi_barrier(mpi_comm_world, mpi_stat)
m = n/count
n_distro = (m+1)/nproc
k = 0
do i = 1,nproc
if (i<nproc) then
sendcounts(i) = n_distro
else
sendcounts(i) = m - (nproc-1)*n_distro
endif
displ(i) = k
k = k + sendcounts(i)
enddo
call mpi_barrier(mpi_comm_world, mpi_stat)
allocate(JpS(count*sendcounts(rank+1)+1))
call mpi_barrier(mpi_comm_world, mpi_stat)
! call mpi_irecv(JpS, (sendcounts(rank+1))*count+1,mpi_int,0,0,mpi_comm_world, request(rank+1), mpi_stat)
! call mpi_recv(JpS, (sendcounts(rank+1))*count+1,mpi_int,0,0,mpi_comm_world, MPI_STATUS_IGNORE,mpi_stat)
!call mpi_waitall(1,request,ARRAY_OF_STATUS,error)
! call mpi_barrier(mpi_comm_world, mpi_stat)
if (rank.eq.0) then
do i = 0,nproc-1
JpAux = -100000
blocklens = spread(sendcounts(i+1),1,count)
blockdisp = spread(displ(i+1),1,count) + (/ (k*m, k=0,count-1) /)
blocklens(count) = blocklens(count)+1
do j = 1,count
if (j.eq.1) then
JpAux(blockdisp(j)+1:blockdisp(j)+blocklens(j)) = Jp(blockdisp(j)+1:blockdisp(j)+blocklens(j))&
-Jp(blockdisp(j)+1)
else
JpAux(blockdisp(j)+1:blockdisp(j)+blocklens(j)) = Jp( blockdisp(j) + 1 : blockdisp(j) + blocklens(j) )&
-Jp( blockdisp(j)+1 ) + JpAux( blockdisp(j-1) + blocklens(j-1))&
+(Jp( blockdisp(j-1)+blocklens(j-1)+1 )-Jp( blockdisp(j-1)+blocklens(j-1)))
endif
enddo
call mpi_type_indexed(count, blocklens, blockdisp, mpi_int, newF, mpi_stat)
call mpi_type_commit(newF, mpi_stat)
call mpi_isend(JpAux, 1, newF, i, i, mpi_comm_world, request(i+1), mpi_stat)
call mpi_type_free(newF, mpi_stat)
enddo
endif
! call mpi_wait(request(rank+1), ARRAY_OF_STATUS, mpi_stat)
call mpi_barrier(mpi_comm_world, mpi_stat)
!call mpi_waitall(1,request,ARRAY_OF_STATUS,error)
call mpi_recv(JpS, (sendcounts(rank+1))*count+1,mpi_int,0,MPI_ANY_TAG,mpi_comm_world, MPI_STATUS_IGNORE,mpi_stat)
! print*, request
print*, 'rank: ', rank, ', size: ', size(JpS), ', Jp: ', JpS
call mpi_barrier(mpi_comm_world, mpi_stat)
call mpi_finalize(mpi_stat)
end program test_send

Fortran OMP parallel do loop scales differently for gfortran and Intel

I abstracted some code from a much larger coding project I'm working on. The code has an OMP parallel do loop which scales well with processor number when compiled with gfortran but badly when compiled with Intel. With gfortran, the code takes 18 seconds to run with 1 processor and 5 seconds to run with 4 processors. With Intel it takes 7 seconds to run with 1 processor and 14 seconds to run with 4 processors. I don't understand what's going on here. The code is below.
MODULE test
TYPE walker
DOUBLE PRECISION, DIMENSION(:,:), ALLOCATABLE :: R
END TYPE walker
TYPE walkerlist
INTEGER :: nwlkr
TYPE(walker), DIMENSION(:), ALLOCATABLE :: W
END TYPE walkerlist
CONTAINS
FUNCTION step( dTau, nelec, ndim ) RESULT ( dR )
DOUBLE PRECISION, INTENT(IN) :: dTau
INTEGER, INTENT(IN) :: nelec, ndim
DOUBLE PRECISION :: dR(ndim,nelec), rand1, rand2, N2DTau
INTEGER :: d, k
DOUBLE PRECISION, PARAMETER :: twopi = 8.d0 * atan(1.d0)
N2DTau = -2 * dTau
DO k = 1, nelec
DO d = 1, ndim
CALL RANDOM_NUMBER(rand1)
CALL RANDOM_NUMBER(rand2)
dR(d,k) = SQRT( N2DTau * LOG( rand1 ) ) * COS( twopi * rand2 )
END DO
END DO
END FUNCTION step
END MODULE test
PROGRAM walk
USE test
TYPE(walkerlist), TARGET :: Wl
DOUBLE PRECISION :: dTau
INTEGER :: istp, i, t1, t2, clock_rate, clock_max
Wl % nwlkr = 10000
ALLOCATE( Wl % W ( Wl % nwlkr ) )
DO i = 1, Wl % nwlkr
ALLOCATE( Wl % W(i) % R(3,2) )
END DO
dTau = 0.001
CALL SYSTEM_CLOCK ( t1, clock_rate, clock_max )
!$OMP PARALLEL DO SHARED( W ) DEFAULT( FIRSTPRIVATE )
DO i = 1, Wl % nwlkr
DO istp = 1, 4000
Wl % W(i) % R = Wl % W(i) % R + step( dTau, 2, 3 )
END DO
END DO
!$OMP END PARALLEL DO
CALL SYSTEM_CLOCK ( t2, clock_rate, clock_max )
Print*, "time:", REAL ( t2 - t1 ) / REAL ( clock_rate )
END PROGRAM walk
The issue was the random_number calls, where I'm guessing the threads were sharing seeds. I solved it by instead using the random number generating function ran.
rand1 = ran(s)
rand2 = ran(s)
Ran lets you input the seed s, which I made thread_private and of the save type. Ran changes the seed only for ifort and not gfortran, so I can't use it for the latter. Ran also sometimes outputs 0, which I personally need to always check for and discard. I also need to Ensure all threads start with a different seed.

OpenMP FFTW with Fortran not thread safe

I am trying to use the FFTW with openMP and Fortran, but I get wrong results when executing in parallel, which also change their values every execution step, displaying typical behaviour when parallelisation goes wrong.
I am aiming for a simple 3d real-to-complex transformation. Following the FFTW tutorial, I took all but the call to dfftw_execute_dft_r2c() out of the parallel region, but it doesn't seem to work.
I use FFTW 3.3.8, configured with ./configure --enable-threads --enable-openmp --enable-mpi and compile my code with gfortran program.f03 -o program.o -I/usr/include -fopenmp -lfftw3_omp -lfftw3 -g -Wall.
This is how my program looks like:
program use_fftw
use,intrinsic :: iso_c_binding
use omp_lib
implicit none
include 'fftw3.f03'
integer, parameter :: dp=kind(1.0d0)
integer, parameter :: Nx = 10
integer, parameter :: Ny = 5
integer, parameter :: Nz = 5
real(dp), parameter :: pi = 3.1415926d0
real(dp), parameter :: physical_length_x = 20.d0
real(dp), parameter :: physical_length_y = 10.d0
real(dp), parameter :: physical_length_z = 10.d0
real(dp), parameter :: lambda1 = 0.5d0
real(dp), parameter :: lambda2 = 0.7d0
real(dp), parameter :: lambda3 = 0.9d0
real(dp), parameter :: dx = physical_length_x/real(Nx,dp)
real(dp), parameter :: dy = physical_length_y/real(Ny,dp)
real(dp), parameter :: dz = physical_length_z/real(Nz,dp)
integer :: void, nthreads
integer :: i, j, k
real(dp):: d
complex(dp), allocatable, dimension(:,:,:) :: arr_out
real(dp), allocatable, dimension(:,:,:) :: arr_in
integer*8 :: plan_forward
allocate(arr_in( 1:Nx, 1:Ny, 1:Nz)); arr_in = 0
allocate(arr_out(1:Nx/2+1, 1:Ny, 1:Nz)); arr_out = 0
!------------------------------
! Initialize fftw stuff
!------------------------------
! Call before any FFTW routine is called outside of parallel region
void = fftw_init_threads()
if (void==0) then
write(*,*) "Error in fftw_init_threads, quitting"
stop
endif
nthreads = omp_get_num_threads()
call fftw_plan_with_nthreads(nthreads)
! plan execution is thread-safe, but plan creation and destruction are not:
! you should create/destroy plans only from a single thread
call dfftw_plan_dft_r2c_3d(plan_forward, Nx, Ny, Nz, arr_in, arr_out, FFTW_ESTIMATE)
!--------------------------------
! Start parallel region
!--------------------------------
!$OMP PARALLEL PRIVATE( j, k, d)
! Fill array with wave
! NOTE: wave only depends on x so you can plot it later.
!$OMP DO
do i = 1, Nx
d = 2.0*pi*i*dx
do j = 1, Ny
do k = 1, Nz
arr_in(i,j,k) = cos(d/lambda1)+sin(d/lambda2)+cos(d/lambda3)
enddo
enddo
enddo
!$OMP END DO
call dfftw_execute_dft_r2c(plan_forward, arr_in, arr_out)
!$OMP END PARALLEL
!-----------------
! print results
!-----------------
do i=1, Nx/2+1
do j=1, Ny
do k=1, Nz
write(*,'(F12.6,A3,F12.6,A3)',advance='no') real(arr_out(i,j,k)), " , ", aimag(arr_out(i,j,k)), " ||"
enddo
write(*,*)
enddo
write(*,*)
enddo
deallocate(arr_in, arr_out)
! destroy plans is not thread-safe; do only with single
call dfftw_destroy_plan(plan_forward)
end program use_fftw
I also tried moving the initialisation part of FFTW (void = fftw_init_threads(); call fftw_plan_with_nthreads(nthreads); call dfftw_plan_dft_r2c_3d(...) into the parallel region, using a !$OMP SINGLE block and synchronising with a barrier afterwards, but the situation didn't improve.
Can anyone help me?
EDIT: I was able to test my program on another system, the problem remains. So the issue apparently isn't in my implementation of openmp or FFTW, but somewhere in the program itself.
You should normally call fftw execute routines outside of the parallel region. They have their own parallel regions inside them and they will take care of running the transform in parallel with that many threads as you requested during planning. They will re-use your existing OpenMP threads.
You can also call them inside a parallel region, but on different arrays, not on the same arrays! And then your plan should be planned to use 1 thread. Each thread would then preform a 2D transform of a slice of the array, for example.
The thread-safety means you can call the routines concurrently, but each for different data.

Writing a large matrix in a single file using MPI

I have a large N by N matrix containing real numbers, which has been decomposed into blocks using MPI. I am now trying to recompose this matrix and write it in a single file.
This topic (writing a matrix into a single txt file with mpi) covered a similar issue, but I got pretty confused by all the 'integer-to-string' conversion, etc (I am not an expert!). I am using Fortran for my code, but I guess that even a C explanation should help. I have been reading tutorials on MPI-IO, but there are still a few things I do not understand. Here is the code I have been working on:
use mpi
implicit none
! matrix dimensions
integer, parameter :: imax = 200
integer, parameter :: jmax = 100
! domain decomposition in each direction
integer, parameter :: iprocs = 3
integer, parameter :: jprocs = 3
! variables
integer :: i, j
integer, dimension(mpi_status_size) :: wstatus
integer :: ierr, proc_num, numprocs, fileno, localarray
integer :: loc_i, loc_j, ppp
integer :: istart, iend, jstart, jend
real, dimension(:,:), allocatable :: x
! initialize MPI
call mpi_init(ierr)
call mpi_comm_size(mpi_comm_world, numprocs, ierr)
call mpi_comm_rank(mpi_comm_world, proc_num, ierr)
! define the beginning and end of blocks
loc_j = proc_num/iprocs
loc_i = proc_num-loc_j*iprocs
ppp = (imax+iprocs-1)/iprocs
istart = loc_i*ppp + 1
iend = min((loc_i+1)*ppp, imax)
ppp = (jmax+jprocs-1)/jprocs
jstart = loc_j*ppp + 1
jend = min((loc_j+1)*ppp, jmax)
! write random data in each block
allocate(x(istart:iend,jstart:jend))
do j = jstart, jend
do i = istart, iend
x(i,j) = real(i + j)
enddo
enddo
! create subarrays
call mpi_type_create_subarray( 2, [imax,jmax], [iend-istart+1,jend-jstart+1], &
[istart,jstart], mpi_order_fortran, mpi_real, localarray, ierr )
call mpi_type_commit( localarray, ierr )
! write to file
call mpi_file_open( mpi_comm_world, 'test.dat', IOR(MPI_mode_create,MPI_mode_wronly), &
mpi_info_null, fileno, ierr )
call mpi_file_set_view( fileno, 0, mpi_real, localarray, "native", mpi_info_null, ierr )
call mpi_file_write_all( fileno, x, (jend-jstart+1)*(iend-istart+1), MPI_real, wstatus, ierr )
call mpi_file_close( fileno, ierr )
! deallocate data
deallocate(x)
! finalize MPI
call mpi_finalize(ierr)
I have been following this tutorial (PDF), but my compiler complains that there is no specific subroutine for the generic mpi_file_set_view. Did I do something wrong? Is the rest of the code ok?
Thank you very much for your help!!
Joachim
I would say that the easy way is to use a library designed to perform such operations efficiently : http://2decomp.org/mpiio.html
You can also look at their source code (files io.f90 and io_write_one.f90).
In the source code, you will see a call to MPI_FILE_SET_SIZE that may be relevant for your case.
EDIT : consider using "call MPI_File_Set_View(fhandle, 0_MPI_OFFSET_KIND,...". Answer from MPI-IO: MPI_File_Set_View vs. MPI_File_Seek

Resources