Why are my MPI-parallelised DO-loops faster the second time round?

Why are my MPI-parallelised DO-loops faster the second time round? - performance

I'm working on a scientific model that's parallelised using Open MPI, and I'm finding some weird results with computation time for parallelised loops. Basically, I find that a parallelised loop over a large array is slow right after allocating the (shared) memory, but a second operation is much faster. Why is this?
To demonstrate, here's a dummy program I wrote that illustrates this:
PROGRAM test_program
USE mpi
USE, INTRINSIC :: ISO_C_BINDING, ONLY: C_PTR, C_F_POINTER
IMPLICIT NONE
INTEGER, PARAMETER :: dp = KIND(1.0D0) ! Double precision
INTEGER :: process_rank, number_of_processes
INTEGER :: ierr
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit
TYPE(C_PTR) :: baseptr
REAL(dp), DIMENSION(:,:), POINTER :: A
INTEGER :: win
INTEGER :: nx, ny, i, j, i1, i2, j1, j2
REAL(dp) :: tstart, tstop, dt1, dt2
! Initialise MPI split processes
CALL MPI_INIT(ierr)
CALL MPI_COMM_RANK( MPI_COMM_WORLD, process_rank, ierr)
CALL MPI_COMM_SIZE( MPI_COMM_WORLD, number_of_processes, ierr)
! Dimensions of data array
nx = 20000
ny = 20000
! Allocate MPI-shared memory for data array, with an associated window object
! (done by all processes, but only the master actually allocates any space)
IF (process_rank == 0) THEN
windowsize = ny*nx*8_MPI_ADDRESS_KIND
disp_unit = ny*8
ELSE
windowsize = 0_MPI_ADDRESS_KIND
disp_unit = 1
END IF
CALL MPI_WIN_ALLOCATE_SHARED( windowsize, disp_unit, MPI_INFO_NULL, MPI_COMM_WORLD, baseptr, win, ierr)
IF (.NOT. process_rank == 0) THEN
! Get the baseptr of the master's memory space.
CALL MPI_WIN_SHARED_QUERY( win, 0, windowsize, disp_unit, baseptr, ierr)
END IF
! Associate the pointer with this memory space (so all processes act on the same physical memory)
CALL C_F_POINTER(baseptr, A, [ny, nx])
! Divide the memory over the processors; each gets a range i1-i2 of row
i1 = MAX(1, FLOOR(REAL(nx * process_rank / number_of_processes)) + 1)
i2 = MIN(nx, FLOOR(REAL(nx * (process_rank + 1) / number_of_processes)))
! Do some operations on the data, measure how long this takes on each core
tstart = MPI_WTIME()
DO i = i1, i2
DO j = 1, ny
A( j,i) = SQRT( (REAL(i,dp) - REAL(nx,dp)/2._dp)**2 + (REAL(j,dp) - REAL(ny,dp)/2._dp)**2 )
END DO
END DO
tstop = MPI_WTIME()
dt1 = tstop - tstart
WRITE(0,*) ' Process ', process_rank, ': dt1 = ', dt1
! Do some more calculations on the data, measure how long this takes on each core
tstart = MPI_WTIME()
DO i = i1, i2
DO j = 1, ny
A( j,i) = SQRT( (REAL(i,dp) - REAL(nx,dp)/4._dp)**2 + (REAL(j,dp) - REAL(ny,dp)/3._dp)**2 )
END DO
END DO
tstop = MPI_WTIME()
dt2 = tstop - tstart
WRITE(0,*) ' Process ', process_rank, ': dt2 = ', dt2
CALL MPI_FINALIZE( ierr)
END PROGRAM test_program
This gives me the following output:
Process 0 : dt1 = 2.1637410982511938
Process 1 : dt1 = 2.6094086961820722
Process 0 : dt2 = 1.0437976177781820
Process 1 : dt2 = 0.96576740033924580
Replacing the first loop by a vectorised operation (e.g. A(:,i1:i2) = 0._dp) gives the same "fast" results for the second loop.

When you perform memory allocation (such as with MPI_WIN_ALLOCATE_SHARED), the memory is only allocated virtually by the operating system: so until you use it, it doesn't take up space neither on RAM nor on the disk.
The first iteration writes into the virtually-allocated memory location for the first time, so it causes what is called page faults. When handling a page fault, the operating system tries to make the required page accessible at the location in physical memory (or terminates the program in cases of an illegal memory access). This process is very slow, hence the slow execution.
During the second iteration, no page faults should occur because the memory should be physically mapped, hence the faster execution.
Note that this effect is completely independent of the use of MPI (although MPI_WIN_ALLOCATE_SHARED should allocate memory in a different way than usual since memory must be shared between processes).

Related

OpenACC constant parameters

I am wondering what is the proper way to handle constants in OpenACC kernels.
For example, in the following code
module vecaddmod
implicit none
integer, parameter :: n = 100000
!$acc declare create(n)
contains
subroutine vecaddgpu(r, a, b)
real, dimension(:) :: r, a, b
integer :: i
!$acc update self(n)
!$acc data present(n)
!$acc kernels loop copyin(a(1:n),b(1:n)) copyout(r(1:n))
do i = 1, n
r(i) = a(i) + b(i)
enddo
!$acc end data
end subroutine vecaddgpu
end module vecaddmod
program main
use vecaddmod
implicit none
integer :: i, errs, argcount
real, dimension(:), allocatable :: a, b, r, e
character*10 :: arg1
allocate( a(n), b(n), r(n), e(n) )
do i = 1, n
a(i) = i
b(i) = 1000*i
enddo
! compute on the GPU
call vecaddgpu( r, a, b )
! compute on the host to compare
do i = 1, n
e(i) = a(i) + b(i)
enddo
! compare results
errs = 0
do i = 1, n
if( r(i) /= e(i) )then
errs = errs + 1
endif
enddo
print *, errs, ' errors found'
if( errs ) call exit(errs)
end program main
n is declared as a constant on CPU in a module, and it is used as the range in the loop. nvfortran warns me about Constant or Parameter used in data clause. Is the above example the proper way to handle this? Can I take advantage of the constant memory on GPU, such that I don't need to copy it from CPU to GPU for each kernel launch?
Thanks.

The compiler will replace parameters with the literal value so no need to put them in data regions.
module vecaddmod
implicit none
integer, parameter :: n = 100000
contains
subroutine vecaddgpu(r, a, b)
real, dimension(:) :: r, a, b
integer :: i
!$acc kernels loop copyin(a(1:n),b(1:n)) copyout(r(1:n))
do i = 1, n
r(i) = a(i) + b(i)
enddo
end subroutine vecaddgpu
end module vecaddmod
...
% nvfortran -acc -Minfo=accel test.f90
vecaddgpu:
11, Generating copyin(a(:100000)) << "n" is replaced with 100000
Generating copyout(r(:100000))
Generating copyin(b(:100000))
12, Loop is parallelizable
Generating Tesla code
12, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

F95 Send/Receive Memory Errors for Array Sending

I'm new to parallel programming and attempting to produce a sparse matrix-vector calculation in Fortran 95. I'm working on a subprogram that only gathers the components of the vector that the sparse matrix will touch (instead of MPI_AllGather), but I keep getting SIGSESV errors. I know this means I've asked the process to touch something it can't/doesn't exist, but I can't for the life of me figure out what it could be.
!Gather the vector matrix in matrix vector multiplication for sparse matrices
subroutine sparsegather(u,BW,myid,nprocs)
use header
include "mpif.h"
type(Vector), intent(inout) :: u
integer,intent(in) :: BW !Bandwidth
integer,intent(in) :: myid !process id
integer,intent(in) :: nprocs !number of processes
integer :: n, i
integer,dimension(BW) :: rlr, rrr, slr, srr !Range of receive left/right, send left/right
real(kind=rk),dimension(BW) :: rl, rr, sl, sr !Arrays of actual values
integer :: ierr
n = u%n !Length of whole vector - used in periodic condition
!Define ranges
do i = 1,BW
rlr(i) = u%ibeg - BW - 1 + i
rrr(i) = u%iend + i
srr(i) = u%iend - i + 1
slr(i) = u%ibeg + i - 1
end do
!Periodic conditions
do i = 1,BW
if (rlr(i) < 1) then
rlr(i) = rlr(i) + n
end if
if ((srr(i) < 1) then
srr(i) = srr(i) + n
end if
if (rrr(i) > n ) then
rrr(i) = rrr(i) - n
end if
if (slr(i) > n ) then
slr(i) = slr(i) - n
end if
end do
!Store the matrix values being sent over
sl = u%xx(slr)
sr = u%xx(srr)
!Pass the value parcels around
if (myid == 0) then
call MPI_Recv(rl,BW,MPI_DOUBLE_PRECISION,nprocs-1,MPI_ANY_TAG,MPI_COMM_WORLD,ierr)
call MPI_Send(sr,BW,MPI_DOUBLE_PRECISION,myid+1,0,MPI_COMM_WORLD,ierr)
call MPI_Recv(rr,BW,MPI_DOUBLE_PRECISION,myid+1,MPI_ANY_TAG,MPI_COMM_WORLD,ierr)
call MPI_Send(sl,BW,MPI_DOUBLE_PRECISION,nprocs-1,0,MPI_COMM_WORLD,ierr)
elseif (myid == nprocs-1) then
call MPI_Send(sr,BW,MPI_DOUBLE_PRECISION,0,0,MPI_COMM_WORLD,ierr)
call MPI_Recv(rl,BW,MPI_DOUBLE_PRECISION,myid-1,MPI_ANY_TAG,MPI_COMM_WORLD,ierr)
call MPI_Send(sl,BW,MPI_DOUBLE_PRECISION,myid-1,0,MPI_COMM_WORLD,ierr)
call MPI_Recv(rr,BW,MPI_DOUBLE_PRECISION,0,MPI_ANY_TAG,MPI_COMM_WORLD,ierr)
elseif (mod(myid,2) == 0) then
call MPI_Recv(rl,BW,MPI_DOUBLE_PRECISION,myid-1,MPI_ANY_TAG,MPI_COMM_WORLD,ierr)
call MPI_Send(sr,BW,MPI_DOUBLE_PRECISION,myid+1,0,MPI_COMM_WORLD,ierr)
call MPI_Recv(rr,BW,MPI_DOUBLE_PRECISION,myid+1,MPI_ANY_TAG,MPI_COMM_WORLD,ierr)
call MPI_Send(sl,BW,MPI_DOUBLE_PRECISION,myid-1,0,MPI_COMM_WORLD,ierr)
else
call MPI_Send(sr,BW,MPI_DOUBLE_PRECISION,myid+1,0,MPI_COMM_WORLD,ierr)
call MPI_Recv(rl,BW,MPI_DOUBLE_PRECISION,myid-1,MPI_ANY_TAG,MPI_COMM_WORLD,ierr)
call MPI_Send(sl,BW,MPI_DOUBLE_PRECISION,myid-1,0,MPI_COMM_WORLD,ierr)
call MPI_Recv(rr,BW,MPI_DOUBLE_PRECISION,myid+1,MPI_ANY_TAG,MPI_COMM_WORLD,ierr)
end if
u%xx(rrr) = rr
u%xx(rlr) = rl
end subroutine sparsegather
u is an object with the vector values stored in %xx and its size in %n. The relevant starting point and end points for each processor are in %ibeg and %iend.
BW is bandwith of the sparse banded matrix. This equation has periodic conditions, so values to the left of the start of the vector wrap around to the right side (and vice versa), which is done in the periodic conditions section.

MPI program for the Poisson equation stuck after one iteration

I am trying to solve the Poisson Equation in a Square domain [(0,1)--(0,1)] using MPI and overlapping domains. Currently, my code takes an input of the number of domain divisions on X and Y directions, the length of the overlap between two domains as a function of domain length and the number of elemntal divisions in the overlap.
Input file looks like
2,2
10,10
10,10
program main
!implicit none
include 'mpif.h'
integer cols, divfx, divfy, iter
integer xdiv, ydiv, info, max_iter, x_shift, y_shift
integer, allocatable:: ipiv(:)
double precision, allocatable :: A(:,:), Ainv(:,:)
real, allocatable:: edge(:,:)
double precision, allocatable :: u(:,:), f(:,:)
double precision, allocatable :: u_exact(:,:)
allocatable :: Left(:,:), Right(:,:)
allocatable :: Top(:,:), Bottom(:,:)
allocatable :: TempLeft(:,:), TempRight(:,:)
allocatable :: TempTop(:,:), TempBottom(:,:)
integer myid, master, numprocs, ierr, status(MPI_STATUS_SIZE)
integer i, j, numsent, sender, L, T, R, B
integer anstype, row, dovfx, dovfy, domx, domy, idx
real dom1,dom2,buff
double precision mesh(2), buffer(4), divx, divy, dx, dy, xd, yd
double precision error, derror, error_norm
character(len=100) :: domaindata
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
master = 0
divx=0.d0
divy=0.d0
! Input the number of divisions for domain decomposition and calculate sub-domain dimensions.
open(1, file='Inputdata.dat', status='old')
! read(1,*) domx,domy
! read(1,*) dovfx,dovfy
! read(1,*) divfx,divfy
write(*,*)'Starting the Program'
write(*,*) "Enter the number of domain divisions in x-direction &
and y-direction ( Enter 4 if you want three sub-domains)"
read(1,*) domx,domy
write(*,*) domx,domy
write(*,*) "Total number of sub-domains for the problem"
write(*,*) domx*domy
write(*,*) "Enter the sub-domain overlap in x & y -direction as &
a fraction of sub-domain length (multiple of 10)"
read(1,*) dovfx,dovfy
write(*,*) dovfx,dovfy
write(*,*) "Enter the number of divisions in the overlap in &
x & yas a fraction of sub-domain(multiple of 5)"
read(1,*) divfx,divfy
write(*,*) divfx,divfy
divx=1.d0/(((1.d0/domx)/dovfx)/divfx)
divy=1.d0/(((1.d0/domy)/dovfy)/divfy)
write(*,*)"Total number of elemental divisions for the &
problem domain (0,1) in both dimensions"
write(*,*) divx, divy, divx*divy
write(*,*)"Total number of nodal divisions for the problem domain"
write(*,*) (divx+1)*(divy+1)
! time
! **************************
tic = MPI_Wtime();
! Maximum number of iterations.
max_iter=100
! Mesh Size
mesh(1)=1/divx
mesh(2)=1/divy
write(*,*) 'Element Size'
write(*,*) mesh(1), mesh(2)
if ( myid .eq. master ) then
! Send iteration number to subdomain and receive the error from each to
! calculate total error.
write(*,*) 'still1'
do 10 iter = 1,max_iter
do 20 i = 1,domx*domy
call MPI_SEND(iter, 1, MPI_INTEGER, i, i, MPI_COMM_WORLD, ierr)
20 continue
! Receive results obtained from sub-processor/sub-domain
!
error = 0.d0
do 30 i = 1,domx*domy
call MPI_RECV(d_error , 1, MPI_DOUBLE_PRECISION, i, iter, &
MPI_COMM_WORLD, status, ierr)
error = error + d_error
30 continue
write(*,*) 'In iteration ', iter, 'cumulative error is', error*1.d0/domx/domy
10 continue
! time:
! *************
toc = MPI_Wtime();
! Write results to output
! **************************
write(*,*)
write(*,*) 'Time taken for parallel computation is: ',(toc-tic)*1000, 'miliseconds'
else
!************************************ Slaves receive mesh size for discretization ******************************************************
write(*,*) 'iter', iter
write(*,*) 'myid', myid
! Slaves receive corners, then creates a Cartesian grid for finite
! difference until done message received, for one iteration.
! This is done for the first iteration
! Get Domain ID :
if (myid.gt.(domx*domy)) goto 200
write(*,*) 'still31'
1000 call MPI_RECV(iter, 1, MPI_INTEGER, master, MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr)
if (status(MPI_TAG) .eq. 0) goto 200
write(*,*) 'still4'
if (iter.eq.1) then
write(*,*) 'still5'
dom1=domx
dom2=domy
allocate (edge(domx*domy,4))
! Determining the edge matrices for each subdomain - the bounding box
do j =1,domx
do k=1,domy
idx=(j-1)*(domx-1)+k+(j-1)
buff=REAL((mod(idx-1,domx)))/domx
IF (buff-((1.d0/domx)/dovfx) .gt. 0) THEN
buff=buff-((1.d0/domx)/dovfx)
ENDIF
edge(idx,1) = buff
IF ((mod(idx ,domx)) .eq. 0) THEN
buff=1
ELSE
buff=REAL(mod(idx ,domx))/domx
ENDIF
!write(*,*) buff
IF (buff + ((1.d0/domx)/dovfx) .lt. 1) THEN
buff=buff+((1.d0/domx)/dovfx)
ENDIF
edge(idx,2) = buff
!
buff=REAL(floor((idx-1)/dom1))/dom1
IF (buff -((1.d0/domy)/dovfy) .gt. 0) THEN
buff=buff-((1.d0/domy)/dovfy)
ENDIF
edge(idx,3) = buff
buff=REAL(ceiling(idx/dom1))/dom1
IF (buff+((1.d0/domy)/dovfy) .lt. 1) THEN
buff= buff+((1.d0/domy)/dovfy)
ENDIF
edge(idx,4) = buff
end do
end do
write(*,*) myid, edge(myid,:)
write(*,*) 'iter', iter
call Surround_dom(myid,domx,domy,LeftC, RightC, BottomC, TopC)
! Calculate data for the matrices: Divisions in each subdomain. :
xdiv=(edge(myid,2)-edge(myid,1))/mesh(1)
ydiv=(edge(myid,4)-edge(myid,3))/mesh(2)
dx=mesh(1)
dy=mesh(2)
allocate (A((xdiv-1)*(ydiv-1),(xdiv-1)*(ydiv-1)))
allocate (Ainv((xdiv-1)*(ydiv-1),(xdiv-1)*(ydiv-1)))
allocate (u((xdiv-1)*(ydiv-1),1),f((xdiv-1)*(ydiv-1),1))
allocate (u_exact((xdiv-1)*(ydiv-1),1))
allocate (ipiv((xdiv-1)*(ydiv-1)))
allocate (Left((ydiv-1),1),Right((ydiv-1),1))
allocate (Top((xdiv-1),1), Bottom(((xdiv-1)),1))
allocate (TempLeft((ydiv-1),1),TempRight((ydiv-1),1))
allocate (TempTop((xdiv-1),1), TempBottom(((xdiv-1)),1))
Left = 0.d0; Right = 0.d0; Bottom = 0.d0; Top = 0.d0;
TempLeft = 0.d0; TempRight = 0.d0; TempBottom = 0.d0; TempTop = 0.d0;
A=0;
endif
write(*,*) 'still6'
! ******************************************************************
! SendReceive data based on location
! ******************************************************************
if (LeftC.ne.0) then
call MPI_SENDRECV(Left, ydiv - 1, MPI_DOUBLE_PRECISION, LeftC, iter, &
TempLeft, ydiv - 1, MPI_DOUBLE_PRECISION, LeftC, iter, MPI_COMM_WORLD, status, ierr)
end if
if (RightC.ne.0) then
call MPI_SENDRECV(Right, ydiv - 1, MPI_DOUBLE_PRECISION, RightC, iter, &
TempRight, ydiv - 1, MPI_DOUBLE_PRECISION, RightC, iter, MPI_COMM_WORLD, status, ierr)
end if
if (BottomC.ne.0) then
call MPI_SENDRECV(Bottom, xdiv - 1, MPI_DOUBLE_PRECISION, BottomC, iter, &
TempBottom, xdiv - 1, MPI_DOUBLE_PRECISION, BottomC, iter, MPI_COMM_WORLD, status, ierr)
end if
if (TopC.ne.0) then
call MPI_SENDRECV(Top, xdiv - 1, MPI_DOUBLE_PRECISION, TopC, iter, &
TempTop, xdiv - 1, MPI_DOUBLE_PRECISION, TopC, iter, MPI_COMM_WORLD, status, ierr)
end if
Left = TempLeft ;
Right = TempRight;
Top = TempTop ;
Bottom = TempBottom;
write(*,*) 'still7'
! Form the coefficient matrices
do i =1,(xdiv-1)*(ydiv-1)
A(i,i)=-2.d0*(1.d0/(dx**2)+1.d0/(dy**2))
enddo
do i=1,(xdiv-2)
do j=1,(ydiv-1)
A(i+(j-1)*(xdiv-1),i+(j-1)*(xdiv-1)+1)=1.d0/(dx**2)
A(i+(j-1)*(xdiv-1)+1,i+(j-1)*(xdiv-1))=1.d0/(dx**2)
enddo
enddo
do i=1,(xdiv-1)
do j=1,(ydiv-2)
A(i+(j-1)*(xdiv-1),i+(j)*(xdiv-1))=1.d0/(dy**2)
A(i+(j)*(xdiv-1),i+(j-1)*(xdiv-1))=1.d0/(dy**2)
enddo
enddo
write(*,*) 'still9'
L=1
T=1
R=1
B=1
write(*,*) 'still10'
! Impose Boundary Conditions in F matrix
do i=1,(xdiv-1)*(ydiv-1)
xd = edge(myid,1) + (dx)*mod(i,(xdiv-1))
if (mod(i,xdiv-1).eq.0) xd = edge(myid,1) + (dx)*(xdiv-1)
yd = edge(myid,3) + (dy)*ceiling(i*1.d0/(xdiv-1))
!if (iter.eq.1 .and. myid.eq.2) write(*,*) xd,yd
u_exact(i,1) = sin(2.d0*3.1415*xd)*sin(2.d0*3.1415*yd)
f(i,1) = 8.d0*3.1415*3.1415*u_exact(i,1)
IF (mod(i,(xdiv-1)) .eq. 1) THEN
f(i,1)= f(i,1)+Left(L,1)/dx/dx
L=L+1
ENDIF
IF (mod(i,(xdiv-1)) .eq. 0) THEN
f(i,1)=f(i,1)+Right(R,1)/dx/dx
R=R+1
ENDIF
IF (i .le. (xdiv-1)) THEN
f(i,1)=f(i,1)+Bottom(B,1)/dy/dy
B=B+1
ENDIF
IF (i .gt. (xdiv-1)*(ydiv-2)) THEN
f(i,1)=f(i,1)+Top(T,1)/dy/dy
T=T+1
END IF
! enddo
enddo
!Solve AU=F by LU factorization!
write(*,*) 'still11'
do i=1,(xdiv-1)*(ydiv-1)
do j=1,(xdiv-1)*(ydiv-1)
Ainv(i,j)=A(i,j)
end do
end do
! do i=1,(xdiv-1)*(ydiv-1)
! write(*,*) myid,Ainv(i,i)
!end do
call DGESV((xdiv-1)*(ydiv-1), 1, A, &
(xdiv-1)*(ydiv-1), ipiv, f, (xdiv-1)*(ydiv-1), info)
write(*,*) 'still12'
call ErrorNorm(f,u_exact,(xdiv-1)*(ydiv-1),error_norm)
write(*,*) 'still13'
! ****************************************************
! Update boundary conditions based on new solution:
! ****************************************************
x_shift = divfx-1 ;
y_shift = divfy-1 ;
! write(*,*) 'LeftC', myid,LeftC,RightC,TopC,BottomC
if (LeftC.ne.0) then
do 50 i = 1,ydiv - 1
Left(i,1) = f((xdiv - 1)*(i - 1) + 1 + x_shift,1)
!if ((myid.eq.2).and.(iter.eq.1)) write(*,*) 'for left',i, &
!(xdiv - 1)*(i - 1) + 1 + x_shift
50 continue
end if
if (RightC.ne.0) then
do 60 i = 1,ydiv - 1
Right(i,1) = f((xdiv - 1)*i - x_shift,1)
!if ((myid.eq.1).and.(iter.eq.1)) write(*,*) 'for right',i, &
!(xdiv - 1)*i - x_shift
60 continue
end if
if (TopC.ne.0) then
do 70 i = 1,xdiv - 1
Top(i,1) = f((xdiv - 1)*((ydiv - 2) - y_shift) + i,1)
!if ((myid.eq.1).and.(iter.eq.1)) write(*,*) 'for top',i, &
!((xdiv - 1)*((ydiv - 2) - y_shift) + i)
70 continue
end if
if ( BottomC.ne.0) then
do 80 i = 1,xdiv - 1
Bottom(i,1) = f((xdiv - 1)*y_shift + i,1)
!if ((myid.eq.3).and.(iter.eq.1)) write(*,*) 'for bottom',i, &
!((xdiv - 1)*y_shift + i)
80 continue
end if
write(*,*) 'still14'
TempLeft =Left;
TempRight = Right;
TempTop = Top;
TempBottom = Bottom;
call MPI_SEND(error_norm, 1, MPI_DOUBLE_PRECISION, master, iter, &
MPI_COMM_WORLD, ierr)
write(*,*) 'still15'
if (iter.lt.iter_max) go to 1000
! *********************************************************************************
! Write solution to data file to view the results.
! *********************************************************************************
write (domaindata, "(A7,I2,A4)") "domain_",myid,".dat"
open (unit=myid*10, file = domaindata)
write (myid*10,*) ' VARIABLE= "X","Y","U_EXACT","U_CALC" '
do i=1,(xdiv-1)*(ydiv-1)
xd = edge(myid,1) + (dx)*mod(i,(xdiv-1))
if (mod(i,xdiv-1).eq.0) xd = edge(myid,1) + (dx)*(xdiv-1)
yd = edge(myid,3) + (dy)*ceiling(i*1.d0/(xdiv-1))
write (myid*10,*) xd, yd, u_exact(i,1), f(i,1)
enddo
write(*,*) 'still16'
if (iter.eq.max_iter) go to 200
200 continue
write(*,*) 'still45'
endif
call MPI_FINALIZE(ierr)
stop
end program main
subroutine Surround_dom(myid,domx,domy,LeftID, RightID, BottomID, TopID)
implicit none
integer myid, j, k, domy, domx, BottomID, TopID, LeftID, RightID
j = ceiling(1.d0*myid/domx)
k = mod(myid,domx)
if (k.eq.0) k = domx
! Domain on the left
if(k.eq.1) then
LeftID = 0
else
LeftID = ((j-1)*domx + k) - 1
end if
! Domain on the Right
if(k.eq.domx) then
RightID = 0
else
RightID = ((j-1)*domx + k) + 1
end if
! Domain on the Bottom
if(j.eq.1) then
BottomID = 0
else
BottomID = ((j-1)*domx + k) - domx
end if
! Domain on the Top
if(j.eq.domy) then
TopID = 0
else
TopID = ((j-1)*domx + k) + domx
end if
return
end
subroutine ErrorNorm(u,u_exact,N,error_norm)
implicit none
double precision u(N), u_exact(N), err, error_norm
integer i, N
error_norm = 0.d0
do 10 i = 1,N
err = (u(i) - u_exact(i))
error_norm = error_norm + err*err
10 continue
error_norm = sqrt(error_norm)/(N*1.d0)
return
end
I expect the code to run through all the iterations giving me an respectable error about multiples of 1e-3/1e-4.?
Currently, no error shows up, the code successfully runs for 1 iteration and then doesn't produce any output at all, even after days. It would be really helpful if I could get some guidance. I am sorry since the structure of my code is a awkward, I am just a beginner. It won't run if number of domains is odd or if the number of domains is not equal to number of processors. Any suggestions on how to remove these limitations is also welcome.

Writing a large matrix in a single file using MPI

I have a large N by N matrix containing real numbers, which has been decomposed into blocks using MPI. I am now trying to recompose this matrix and write it in a single file.
This topic (writing a matrix into a single txt file with mpi) covered a similar issue, but I got pretty confused by all the 'integer-to-string' conversion, etc (I am not an expert!). I am using Fortran for my code, but I guess that even a C explanation should help. I have been reading tutorials on MPI-IO, but there are still a few things I do not understand. Here is the code I have been working on:
use mpi
implicit none
! matrix dimensions
integer, parameter :: imax = 200
integer, parameter :: jmax = 100
! domain decomposition in each direction
integer, parameter :: iprocs = 3
integer, parameter :: jprocs = 3
! variables
integer :: i, j
integer, dimension(mpi_status_size) :: wstatus
integer :: ierr, proc_num, numprocs, fileno, localarray
integer :: loc_i, loc_j, ppp
integer :: istart, iend, jstart, jend
real, dimension(:,:), allocatable :: x
! initialize MPI
call mpi_init(ierr)
call mpi_comm_size(mpi_comm_world, numprocs, ierr)
call mpi_comm_rank(mpi_comm_world, proc_num, ierr)
! define the beginning and end of blocks
loc_j = proc_num/iprocs
loc_i = proc_num-loc_j*iprocs
ppp = (imax+iprocs-1)/iprocs
istart = loc_i*ppp + 1
iend = min((loc_i+1)*ppp, imax)
ppp = (jmax+jprocs-1)/jprocs
jstart = loc_j*ppp + 1
jend = min((loc_j+1)*ppp, jmax)
! write random data in each block
allocate(x(istart:iend,jstart:jend))
do j = jstart, jend
do i = istart, iend
x(i,j) = real(i + j)
enddo
enddo
! create subarrays
call mpi_type_create_subarray( 2, [imax,jmax], [iend-istart+1,jend-jstart+1], &
[istart,jstart], mpi_order_fortran, mpi_real, localarray, ierr )
call mpi_type_commit( localarray, ierr )
! write to file
call mpi_file_open( mpi_comm_world, 'test.dat', IOR(MPI_mode_create,MPI_mode_wronly), &
mpi_info_null, fileno, ierr )
call mpi_file_set_view( fileno, 0, mpi_real, localarray, "native", mpi_info_null, ierr )
call mpi_file_write_all( fileno, x, (jend-jstart+1)*(iend-istart+1), MPI_real, wstatus, ierr )
call mpi_file_close( fileno, ierr )
! deallocate data
deallocate(x)
! finalize MPI
call mpi_finalize(ierr)
I have been following this tutorial (PDF), but my compiler complains that there is no specific subroutine for the generic mpi_file_set_view. Did I do something wrong? Is the rest of the code ok?
Thank you very much for your help!!
Joachim

I would say that the easy way is to use a library designed to perform such operations efficiently : http://2decomp.org/mpiio.html
You can also look at their source code (files io.f90 and io_write_one.f90).
In the source code, you will see a call to MPI_FILE_SET_SIZE that may be relevant for your case.
EDIT : consider using "call MPI_File_Set_View(fhandle, 0_MPI_OFFSET_KIND,...". Answer from MPI-IO: MPI_File_Set_View vs. MPI_File_Seek

MPI_Gather 2-d arrays from all processors into bigger 2-d array on root with FORTRAN

Here's what I am trying to do with my code.
I have to calculate Fw and Fi on the 2-d grid that is of size nx by nz. I split the k loop between all processors so that each processor calculates nx by(nz/p) where p is the number of processors being used. After each processor is done I want to gather all the chunks, that is, each nx by nz/p Fw and Fi and put it into Fw and Fi in the root. I eventually want to use allgather i.e. Gather all the calculated Fw and Fi into all processors.
I have attached the code below.
I am not sure that I am specifying the sendcount and recvcount right or why my code is deadlocking. Any help is appreciated. Thanks!
PROGRAM gridtestpar
IMPLICIT NONE
INTEGER :: nx, nz, i, k, t
INTEGER :: order, mx, mz
INTEGER :: count
INTEGER :: ierror, comm, p, rank, npr, s, f, np(2)
REAL(KIND = 8) :: dx, dz, startx, startz, finishx, finishz
REAL(KIND = 8) :: dt
REAL(KIND = 8) :: cx, cz
REAL(KIND = 8) :: cbx, cbz
REAL(KIND = 8), ALLOCATABLE ::X(:), Z(:), Fw(:,:), Fi(:,:)
REAL(KIND = 8), ALLOCATABLE :: Fn(:,:), Fnp1(:,:)
include 'mpif.h'
!----------------------------------------------------------
!Parameters that can be changed
!---------------------------------------------------------
!Time step
dt = 0.000000001d0
!Number of points in x and z direction(i.e. streamwise and
!spanwise) directions respectively
nx = (400*5)
nz = (400*5)
!First and last grid point locations in x and z directions
startx = 0.d0
finishx = 60.d0*5.d0
startz = 0.d0
finishz = 60.d0*5.d0
!Distance between grid points
dx = (finishx-startx)/REAL(nx-1)
dz = (finishz-startz)/REAL(nz-1)
!Allocate
ALLOCATE(X(nx), Z(nz))
ALLOCATE(Fw(nx,nz), Fi(nx,nz))
ALLOCATE(Fn(nx,nz), Fnp1(nx,nz))
! Make Grid
!--------------------------------------------------------------
DO i = 1, nx
X(i) = (i-1)*dx
END DO
DO k = 1, nz
Z(k) = (k-1)*dx
END DO
CALL MPI_INIT(ierror)
comm = MPI_COMM_WORLD
!Get rank
CALL MPI_COMM_RANK(comm, rank, ierror)
!Get number of processors
CALL MPI_COMM_SIZE(comm, p, ierror)
!split job between all processors
npr = INT((nz-1)/p)
DO k = rank*npr+1, (rank+1)*npr
DO i = 1, nx
cx = 50.d0
Fi(i,k) = 0.d0
DO mx = 1,30
cz = 0.d0;
DO mz = 1,13*5
Fi(i,k) = Fi(i,k) + EXP(-0.9d0*((X(i)-cx)**2+(Z(k)-cz)**2))
cz = cz + 5.d0
END DO
cx = cx + 5.d0
END DO
cbz = 0.d0
cbx = 30.d0
DO mx = 1,4*5
Fw(i,k) = Fw(i,k) + 0.05d0 + 7.d0*EXP(-0.1*((X(i)-cbx)**2 &
+ (Z(k)-cbz)**2)) + 0.1d0*Fi(i,k)
cbz = cbz + 20.d0
END DO
END DO
END DO
s = rank*npr+1
f = (rank+1)*npr
np(1) = nx
np(2) = npr
CALL MPI_GATHER(Fw(:,s:f), np , MPI_DOUBLE_PRECISION, &
Fw,np , MPI_DOUBLE_PRECISION, 0, comm, ierror)
CALL MPI_GATHER(Fi(:,s:f), np , MPI_DOUBLE_PRECISION, &
Fi,np , MPI_DOUBLE_PRECISION, 0, comm, ierror)
Fn(:,:) = Fw(:,:) - Fi(:,:)
Fnp1 = Fn
WRITE(*,*) "I'm here"
IF(rank == 0) THEN
!Output initial condition
!----------------------------------------------------------------
OPEN(unit = 11, file = "Fiinitial.dat")
WRITE(11,*) 'Variables = "X", "Z", "Fi"'
WRITE(11,*) 'Zone I = ', nx, 'J = ', nz, 'F = POINT'
DO k = 1, nz
DO i = 1, nx
WRITE(11,*) X(i), Z(k), Fi(i,k)
END DO
END DO
WRITE(11,*) 'Zone I = ', nx, 'J = ', nz, 'F = POINT'
DO k = 1, nz
DO i = 1, nx
WRITE(11,*) X(i), Z(k), Fw(i,k)
END DO
END DO
CLOSE(11)
END IF
CALL MPI_FINALIZE(ierror)
END PROGRAM gridtestpar

You call the mpi_gather() subroutine incorrectly. You have to pass it the total nr. of elements which should be communicated as one integer for the send buffer and as an other integer for the receive buffer. You passed instead of each integer an array with two integers, which contained the number of elements along each dimension. Just multiply the numbers in your array, and pass the result as an integer instead:
program gridtestpar
use mpi
implicit none
integer, parameter :: dp = kind(1.0d0)
integer :: nx, nz
integer :: ierror, comm, p, rank, npr, s, f, np(2)
real(dp), allocatable :: Fw(:,:), Fi(:,:)
nx = (400*5)
nz = (400*5)
allocate(Fw(nx,nz))
allocate(Fi(nx,nz))
Fw(:,:) = 0.0_dp
Fi(:,:) = 0.0_dp
call mpi_init(ierror)
comm = MPI_COMM_WORLD
call mpi_comm_rank(comm, rank, ierror)
call mpi_comm_size(comm, p, ierror)
s = rank * npr + 1
f = (rank + 1) * npr
call mpi_gather(Fw(:,s:f), nx * (f - s + 1), MPI_DOUBLE_PRECISION, &
Fw, nx * npr, MPI_DOUBLE_PRECISION, 0, comm, ierror)
call mpi_gather(Fi(:,s:f), nx * (f - s + 1), MPI_DOUBLE_PRECISION, &
Fi, nx * npr, MPI_DOUBLE_PRECISION, 0, comm, ierror)
write(*,*) "I'm here"
call mpi_finalize(ierror)
end program gridtestpar
Maybe a few additional comments:
Please always post the shortest possible self containing code, which demonstrates the issue. Nobody likes to spend time on reading and trying to understand irrelevant code snippets. Leave away everything which is not essential for reproducing your problem. Maybe, this way you will even find the solution yourself.
Do not use kind = 8 when specifying precision. See the last part of this answer and some of the comments to it for alternatives.
You should use the mpi module instead of the include file.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why are my MPI-parallelised DO-loops faster the second time round? - performance

Related

OpenACC constant parameters

F95 Send/Receive Memory Errors for Array Sending

MPI program for the Poisson equation stuck after one iteration

Writing a large matrix in a single file using MPI

MPI_Gather 2-d arrays from all processors into bigger 2-d array on root with FORTRAN

Categories

Resources