Writing a large matrix in a single file using MPI - view

I have a large N by N matrix containing real numbers, which has been decomposed into blocks using MPI. I am now trying to recompose this matrix and write it in a single file.
This topic (writing a matrix into a single txt file with mpi) covered a similar issue, but I got pretty confused by all the 'integer-to-string' conversion, etc (I am not an expert!). I am using Fortran for my code, but I guess that even a C explanation should help. I have been reading tutorials on MPI-IO, but there are still a few things I do not understand. Here is the code I have been working on:
use mpi
implicit none
! matrix dimensions
integer, parameter :: imax = 200
integer, parameter :: jmax = 100
! domain decomposition in each direction
integer, parameter :: iprocs = 3
integer, parameter :: jprocs = 3
! variables
integer :: i, j
integer, dimension(mpi_status_size) :: wstatus
integer :: ierr, proc_num, numprocs, fileno, localarray
integer :: loc_i, loc_j, ppp
integer :: istart, iend, jstart, jend
real, dimension(:,:), allocatable :: x
! initialize MPI
call mpi_init(ierr)
call mpi_comm_size(mpi_comm_world, numprocs, ierr)
call mpi_comm_rank(mpi_comm_world, proc_num, ierr)
! define the beginning and end of blocks
loc_j = proc_num/iprocs
loc_i = proc_num-loc_j*iprocs
ppp = (imax+iprocs-1)/iprocs
istart = loc_i*ppp + 1
iend = min((loc_i+1)*ppp, imax)
ppp = (jmax+jprocs-1)/jprocs
jstart = loc_j*ppp + 1
jend = min((loc_j+1)*ppp, jmax)
! write random data in each block
allocate(x(istart:iend,jstart:jend))
do j = jstart, jend
do i = istart, iend
x(i,j) = real(i + j)
enddo
enddo
! create subarrays
call mpi_type_create_subarray( 2, [imax,jmax], [iend-istart+1,jend-jstart+1], &
[istart,jstart], mpi_order_fortran, mpi_real, localarray, ierr )
call mpi_type_commit( localarray, ierr )
! write to file
call mpi_file_open( mpi_comm_world, 'test.dat', IOR(MPI_mode_create,MPI_mode_wronly), &
mpi_info_null, fileno, ierr )
call mpi_file_set_view( fileno, 0, mpi_real, localarray, "native", mpi_info_null, ierr )
call mpi_file_write_all( fileno, x, (jend-jstart+1)*(iend-istart+1), MPI_real, wstatus, ierr )
call mpi_file_close( fileno, ierr )
! deallocate data
deallocate(x)
! finalize MPI
call mpi_finalize(ierr)
I have been following this tutorial (PDF), but my compiler complains that there is no specific subroutine for the generic mpi_file_set_view. Did I do something wrong? Is the rest of the code ok?
Thank you very much for your help!!
Joachim

I would say that the easy way is to use a library designed to perform such operations efficiently : http://2decomp.org/mpiio.html
You can also look at their source code (files io.f90 and io_write_one.f90).
In the source code, you will see a call to MPI_FILE_SET_SIZE that may be relevant for your case.
EDIT : consider using "call MPI_File_Set_View(fhandle, 0_MPI_OFFSET_KIND,...". Answer from MPI-IO: MPI_File_Set_View vs. MPI_File_Seek

Related

Implementing qsort in Fortran 95

I am trying to implement qsort algorithm in Fortran.
The implemented qsort is intended to operate over an array of a derived type which contains also another derived type.
The derived types are defined in a separate module as:
MODULE DATA_MODEL
! -------------------
! CONSTANTS
! -------------------
integer,parameter :: max_records = 100000000
type :: timestamp
integer :: year
integer :: month
integer :: day
integer :: hour
integer :: minute
integer :: second
end type
type :: tape
type(timestamp) :: ts
integer :: value1
integer :: value2
end type
END MODULE
This is what I have tried to implement the quicksort algorithm.
! DESCRIPTION:
! THIS MODULE IMPLEMENTS QSORT ALGORITH USING LOMUTO PARTITION SCHEME
! PSEUDOCODE:
! ALGORITHM QUICKSORT(A, LO, HI) IS
! IF LO < HI THEN
! P := PARTITION(A, LO, HI)
! QUICKSORT(A, LO, P - 1)
! QUICKSORT(A, P + 1, HI)
!
! ALGORITHM PARTITION(A, LO, HI) IS
! PIVOT := A[HI]
! I := LO
! FOR J := LO TO HI DO
! IF A[J] < PIVOT THEN
! SWAP A[I] WITH A[J]
! I := I + 1
! SWAP A[I] WITH A[HI]
! RETURN I
!
! SORTING THE ENTIRE ARRAY IS ACCOMPLOMISHED BY QUICKSORT(A, 0, LENGTH(A) - 1).
module qsort
use data_model
contains
subroutine quicksort(a, lo, hi)
implicit none
! SUBROUTINE PARAMETERS
type(tape),allocatable,intent(in out) :: a
integer,intent(in) :: lo, hi
! ALGORITHM INTERNAL VARIABLES
integer :: p
if (lo < hi) then
call partition(a, lo, hi, p)
call quicksort(a, lo, p - 1)
call quicksort(a, p + 1, hi)
end if
end subroutine
subroutine partition(a, lo, hi, p)
implicit none
! SUBROUTINE PARAMETERS
type(tape),allocatable,intent(inout) :: a
integer,intent(in) :: lo
integer,intent(in) :: hi
integer,intent(out) :: p
! ALGORITHM INTERNAL VARIABLES
type(tape) :: pivot
type(tape) :: swap
integer :: i,j
pivot = a(hi)
i = lo
do j = lo, hi
if (compare(a(j), pivot)) then
swap = a(i)
a(i) = a(j)
a(j) = swap
i = i + 1
endif
end do
swap = a(i)
a(i) = a(hi)
a(hi) = swap
p = i
end subroutine
function compare(a,b)
implicit none
! FUNCTION PARAMETERS
type(tape) :: a
type(tape) :: b
logical :: compare
if (a%ts%year < b%ts%year) then
compare = .true.
else if (a%ts%year > a%ts%year) then
compare = .false.
else if (a%ts%month < b%ts%month) then
compare = .true.
else if (a%ts%month > b%ts%month) then
compare = .false.
else if (a%ts%day < b%ts%day) then
compare = .true.
else if (a%ts%day > b%ts%day) then
compare = .false.
else if (a%ts%hour < b%ts%hour) then
compare = .true.
else if (a%ts%hour > b%ts%hour) then
compare = .false.
else if (a%ts%minute < b%ts%minute) then
compare = .true.
else if (a%ts%minute > b%ts%minute) then
compare = .false.
else if (a%ts%second < b%ts%second) then
compare = .true.
else if (a%ts%second > b%ts%second) then
compare = .false.
else
compare = .false.
end if
end function
end module
This is the errors I get while trying to compile it:
$ flang -c data_model.f95
$ flang -c qsort.f95
F90-S-0072-Assignment operation illegal to external procedure a (qsort.f95: 79)
F90-S-0076-Subscripts specified for non-array variable a (qsort.f95: 80)
F90-S-0076-Subscripts specified for non-array variable a (qsort.f95: 84)
F90-S-0076-Subscripts specified for non-array variable a (qsort.f95: 85)
F90-S-0076-Subscripts specified for non-array variable a (qsort.f95: 85)
F90-S-0076-Subscripts specified for non-array variable a (qsort.f95: 86)
0 inform, 0 warnings, 6 severes, 0 fatal for partition
$
Edit 1: I have modified the source code with the subroutine based code, which makes more sense as we want to modify the arguments.
Edit 2: modifying the definition of a to type(tape),intent(in out) :: a(:) in both quicksort and partition subroutines make the module to compile without errors – see comments.
I saw that you got unblocked with your problem with the help of the comments, but let me give you some suggestions to make your implementation more modular, easy to use and modern.
Disclaimer: Some of my suggestions might need a more recent Fortran version than 95.
You can improve your timestamp type definition by providing overloads for the relational operators.
type :: timestamp
integer :: year, month, day, hour = 0, minute = 0, second = 0
contains
procedure, private :: eq, ne, gt, ge, lt, le
generic :: operator(==) => eq
generic :: operator(/=) => ne
generic :: operator(>) => gt
generic :: operator(>=) => ge
generic :: operator(<) => lt
generic :: operator(<=) => le
end type
(A subtle change there is that I have default values for hour, minute and second. So you can instantiate like this: timestamp(2021,5,22))
To get this working, you just need to provide implementations for the functions eq, ne, gt, ge, lt, le available in the module you define your type. Note that, when writing a generic type bound procedure, you must declare your bound parameter as class(timestamp) instead of type(timestamp).
elemental function lt(left, right) result(result)
class(timestamp), intent(in) :: left, right
logical :: result
result = compare(left, right) < 0
end function
elemental function compare(this, other) result(result)
class(timestamp), intent(in) :: this, other
integer :: result
if (this%year /= other%year) then
result = sign(1, this%year - other%year)
else if (this%month /= other%month) then
result = sign(1, this%month - other%month)
else if (this%day /= other%day) then
result = sign(1, this%day - other%day)
else if (this%hour /= other%hour) then
result = sign(1, this%hour - other%hour)
else if (this%minute /= other%minute) then
result = sign(1, this%minute - other%minute)
else if (this%second /= other%second) then
result = sign(1, this%second - other%second)
else
result = 0
end if
end function
Another good practice you can implement is to control access of your module elements by using public and private.
module data_model
implicit none
public :: timestamp, tape
private
type :: timestamp
! (...)
end type
type :: tape
type(timestamp) :: ts
integer :: value1, value2
end type
contains
! (...) implementations of eq, ne, gt, ge, lt, le
end
Then, when you use this module from another program unit, only the public names will be available. You can also use only specific name with the use only clause:
module qsort
use data_model, only: tape
implicit none
public :: quicksort
private
contains
! (...) your quicksort implementation
end
Finally, let me suggest some tweaks on your quicksort implementation.
First, I suggest that you don't need to pass around the boundaries lo and hi everywhere together with your array. One of the most distinctive features of Fortran is how easy it is to operate on array segments. You can call the quicksort procedure on a contiguous portion of your array, and the procedure can work on it in a boundaries-agnostic way, if you use assumed-shape arrays, like this: type(tape) :: a(:). Inside the procedure, the array segment is rebounded to start on index 1, no matter what are the bounds on the call site.
Besides that, as I mentioned in the comments, you don't need to declare the array argument as allocatable in this case. Even if the original array you are passing is originally allocatable, you can pass an allocatable array to a procedure without declaring the argument as allocatable in the procedure, it will be handled as a normal array. It only makes sense to declare the argument as allocatable if you plan to allocate/deallocate inside the procedure.
pure recursive subroutine quicksort(a)
type(tape), intent(inout) :: a(:)
integer :: p
if (size(a) == 0) return
call partition(a, p)
call quicksort(a(:p-1))
call quicksort(a(p+1:))
end
I declared this procedure as pure in this case, but that would depend on your specific use case. Making it pure helps me to remind declaring intents correctly and have well-though procedures (and there is a performance gain in some cases), but this brings many restrictions (like not being able to print inside the procedure). You can search for pure procedures to learn more.
Both quicksort and partition are implemented as subroutines here. I like to do this way always that the procedure performs important side-effects, like updates on the passed argument. If I need a returned value, I can have an intent(out) argument, like the argument out in partition, that returns the pivot position.
pure subroutine partition(a, out)
type(tape), intent(inout) :: a(:)
integer, intent(out) :: out
integer :: i, j
i = 1
do j = 1, size(a)
if (a(j)%ts < a(size(a))%ts) then
call swap(a(i), a(j))
i = i + 1
end if
end do
call swap(a(i), a(size(a)))
out = i
end
elemental subroutine swap(a, b)
type(tape), intent(inout) :: a, b
type(tape) :: temp
temp = a
a = b
b = temp
end
You may note at a(j)%ts < a(size(a))%ts that I am making use of the overloaded operator < to compare two timestamp. This way, the comparison logic belongs to the same module as the type definition.
Finally, you can use the modules and make some tests on your quicksort implementation!
program main
use data_model, only: tape, timestamp
use qsort, only: quicksort
implicit none
type(tape) :: a(8) = [ &
tape(timestamp(2020, 01, 08), 0, 0), &
tape(timestamp(2021, 01, 30), 0, 0), &
tape(timestamp(2020, 01, 06), 0, 0), &
tape(timestamp(2019, 12, 14), 0, 0), &
tape(timestamp(2020, 01, 08), 0, 0), &
tape(timestamp(2020, 05, 05), 0, 0), &
tape(timestamp(2021, 04, 30), 0, 0), &
tape(timestamp(2020, 10, 22), 0, 0) &
]
call quicksort(a(3:7)) ! will sort in place, only from index 3 to 7
call quicksort(a) ! will sort whole array
end
Works like a charm!
This is not an answer directly related to the quicksort algorithm but rather on how to implement type-bound operators.
You can move the compare function inside the data_model module.
This decouples the modules further s.t. the quicksort module only contains the quicksort algorithm.
The compare function can be implemented by a type-bound operator operator(<).
The following shows a quick implementation (only for year/month/day) and it should help you to edit your own code accordingly.
module timestamp_m
implicit none
private
public timestamp
type timestamp
integer :: y, m, d
contains
generic :: operator(<) => timestamp_lt
procedure, private :: timestamp_lt
end type
contains
logical function timestamp_lt(this, rhs) result(tf)
!! result of: this < rhs
class(timestamp), intent(in) :: this
type(timestamp), intent(in) :: rhs
! compare year
if (this%y < rhs%y) then
tf = .true.
else if (this%y > rhs%y) then
tf = .false.
else
! compare month
if (this%m < rhs%m) then
tf = .true.
else if (this%m > rhs%m) then
tf = .false.
else
! compare day
if (this%d < rhs%d) then
tf = .true.
else
tf = .false.
end if
end if
end if
end function
end module
You will need to adjust one line in your quicksort module:
module qsort
..
subroutine quicksort(a, lo, hi)
..
! if (compare(a(j), pivot)) then ! OLD. replace by:
if (a(j)%ts < pivot%ts) then
..

Why are my MPI-parallelised DO-loops faster the second time round?

I'm working on a scientific model that's parallelised using Open MPI, and I'm finding some weird results with computation time for parallelised loops. Basically, I find that a parallelised loop over a large array is slow right after allocating the (shared) memory, but a second operation is much faster. Why is this?
To demonstrate, here's a dummy program I wrote that illustrates this:
PROGRAM test_program
USE mpi
USE, INTRINSIC :: ISO_C_BINDING, ONLY: C_PTR, C_F_POINTER
IMPLICIT NONE
INTEGER, PARAMETER :: dp = KIND(1.0D0) ! Double precision
INTEGER :: process_rank, number_of_processes
INTEGER :: ierr
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit
TYPE(C_PTR) :: baseptr
REAL(dp), DIMENSION(:,:), POINTER :: A
INTEGER :: win
INTEGER :: nx, ny, i, j, i1, i2, j1, j2
REAL(dp) :: tstart, tstop, dt1, dt2
! Initialise MPI split processes
CALL MPI_INIT(ierr)
CALL MPI_COMM_RANK( MPI_COMM_WORLD, process_rank, ierr)
CALL MPI_COMM_SIZE( MPI_COMM_WORLD, number_of_processes, ierr)
! Dimensions of data array
nx = 20000
ny = 20000
! Allocate MPI-shared memory for data array, with an associated window object
! (done by all processes, but only the master actually allocates any space)
IF (process_rank == 0) THEN
windowsize = ny*nx*8_MPI_ADDRESS_KIND
disp_unit = ny*8
ELSE
windowsize = 0_MPI_ADDRESS_KIND
disp_unit = 1
END IF
CALL MPI_WIN_ALLOCATE_SHARED( windowsize, disp_unit, MPI_INFO_NULL, MPI_COMM_WORLD, baseptr, win, ierr)
IF (.NOT. process_rank == 0) THEN
! Get the baseptr of the master's memory space.
CALL MPI_WIN_SHARED_QUERY( win, 0, windowsize, disp_unit, baseptr, ierr)
END IF
! Associate the pointer with this memory space (so all processes act on the same physical memory)
CALL C_F_POINTER(baseptr, A, [ny, nx])
! Divide the memory over the processors; each gets a range i1-i2 of row
i1 = MAX(1, FLOOR(REAL(nx * process_rank / number_of_processes)) + 1)
i2 = MIN(nx, FLOOR(REAL(nx * (process_rank + 1) / number_of_processes)))
! Do some operations on the data, measure how long this takes on each core
tstart = MPI_WTIME()
DO i = i1, i2
DO j = 1, ny
A( j,i) = SQRT( (REAL(i,dp) - REAL(nx,dp)/2._dp)**2 + (REAL(j,dp) - REAL(ny,dp)/2._dp)**2 )
END DO
END DO
tstop = MPI_WTIME()
dt1 = tstop - tstart
WRITE(0,*) ' Process ', process_rank, ': dt1 = ', dt1
! Do some more calculations on the data, measure how long this takes on each core
tstart = MPI_WTIME()
DO i = i1, i2
DO j = 1, ny
A( j,i) = SQRT( (REAL(i,dp) - REAL(nx,dp)/4._dp)**2 + (REAL(j,dp) - REAL(ny,dp)/3._dp)**2 )
END DO
END DO
tstop = MPI_WTIME()
dt2 = tstop - tstart
WRITE(0,*) ' Process ', process_rank, ': dt2 = ', dt2
CALL MPI_FINALIZE( ierr)
END PROGRAM test_program
This gives me the following output:
Process 0 : dt1 = 2.1637410982511938
Process 1 : dt1 = 2.6094086961820722
Process 0 : dt2 = 1.0437976177781820
Process 1 : dt2 = 0.96576740033924580
Replacing the first loop by a vectorised operation (e.g. A(:,i1:i2) = 0._dp) gives the same "fast" results for the second loop.
When you perform memory allocation (such as with MPI_WIN_ALLOCATE_SHARED), the memory is only allocated virtually by the operating system: so until you use it, it doesn't take up space neither on RAM nor on the disk.
The first iteration writes into the virtually-allocated memory location for the first time, so it causes what is called page faults. When handling a page fault, the operating system tries to make the required page accessible at the location in physical memory (or terminates the program in cases of an illegal memory access). This process is very slow, hence the slow execution.
During the second iteration, no page faults should occur because the memory should be physically mapped, hence the faster execution.
Note that this effect is completely independent of the use of MPI (although MPI_WIN_ALLOCATE_SHARED should allocate memory in a different way than usual since memory must be shared between processes).

OpenMP FFTW with Fortran not thread safe

I am trying to use the FFTW with openMP and Fortran, but I get wrong results when executing in parallel, which also change their values every execution step, displaying typical behaviour when parallelisation goes wrong.
I am aiming for a simple 3d real-to-complex transformation. Following the FFTW tutorial, I took all but the call to dfftw_execute_dft_r2c() out of the parallel region, but it doesn't seem to work.
I use FFTW 3.3.8, configured with ./configure --enable-threads --enable-openmp --enable-mpi and compile my code with gfortran program.f03 -o program.o -I/usr/include -fopenmp -lfftw3_omp -lfftw3 -g -Wall.
This is how my program looks like:
program use_fftw
use,intrinsic :: iso_c_binding
use omp_lib
implicit none
include 'fftw3.f03'
integer, parameter :: dp=kind(1.0d0)
integer, parameter :: Nx = 10
integer, parameter :: Ny = 5
integer, parameter :: Nz = 5
real(dp), parameter :: pi = 3.1415926d0
real(dp), parameter :: physical_length_x = 20.d0
real(dp), parameter :: physical_length_y = 10.d0
real(dp), parameter :: physical_length_z = 10.d0
real(dp), parameter :: lambda1 = 0.5d0
real(dp), parameter :: lambda2 = 0.7d0
real(dp), parameter :: lambda3 = 0.9d0
real(dp), parameter :: dx = physical_length_x/real(Nx,dp)
real(dp), parameter :: dy = physical_length_y/real(Ny,dp)
real(dp), parameter :: dz = physical_length_z/real(Nz,dp)
integer :: void, nthreads
integer :: i, j, k
real(dp):: d
complex(dp), allocatable, dimension(:,:,:) :: arr_out
real(dp), allocatable, dimension(:,:,:) :: arr_in
integer*8 :: plan_forward
allocate(arr_in( 1:Nx, 1:Ny, 1:Nz)); arr_in = 0
allocate(arr_out(1:Nx/2+1, 1:Ny, 1:Nz)); arr_out = 0
!------------------------------
! Initialize fftw stuff
!------------------------------
! Call before any FFTW routine is called outside of parallel region
void = fftw_init_threads()
if (void==0) then
write(*,*) "Error in fftw_init_threads, quitting"
stop
endif
nthreads = omp_get_num_threads()
call fftw_plan_with_nthreads(nthreads)
! plan execution is thread-safe, but plan creation and destruction are not:
! you should create/destroy plans only from a single thread
call dfftw_plan_dft_r2c_3d(plan_forward, Nx, Ny, Nz, arr_in, arr_out, FFTW_ESTIMATE)
!--------------------------------
! Start parallel region
!--------------------------------
!$OMP PARALLEL PRIVATE( j, k, d)
! Fill array with wave
! NOTE: wave only depends on x so you can plot it later.
!$OMP DO
do i = 1, Nx
d = 2.0*pi*i*dx
do j = 1, Ny
do k = 1, Nz
arr_in(i,j,k) = cos(d/lambda1)+sin(d/lambda2)+cos(d/lambda3)
enddo
enddo
enddo
!$OMP END DO
call dfftw_execute_dft_r2c(plan_forward, arr_in, arr_out)
!$OMP END PARALLEL
!-----------------
! print results
!-----------------
do i=1, Nx/2+1
do j=1, Ny
do k=1, Nz
write(*,'(F12.6,A3,F12.6,A3)',advance='no') real(arr_out(i,j,k)), " , ", aimag(arr_out(i,j,k)), " ||"
enddo
write(*,*)
enddo
write(*,*)
enddo
deallocate(arr_in, arr_out)
! destroy plans is not thread-safe; do only with single
call dfftw_destroy_plan(plan_forward)
end program use_fftw
I also tried moving the initialisation part of FFTW (void = fftw_init_threads(); call fftw_plan_with_nthreads(nthreads); call dfftw_plan_dft_r2c_3d(...) into the parallel region, using a !$OMP SINGLE block and synchronising with a barrier afterwards, but the situation didn't improve.
Can anyone help me?
EDIT: I was able to test my program on another system, the problem remains. So the issue apparently isn't in my implementation of openmp or FFTW, but somewhere in the program itself.
You should normally call fftw execute routines outside of the parallel region. They have their own parallel regions inside them and they will take care of running the transform in parallel with that many threads as you requested during planning. They will re-use your existing OpenMP threads.
You can also call them inside a parallel region, but on different arrays, not on the same arrays! And then your plan should be planned to use 1 thread. Each thread would then preform a 2D transform of a slice of the array, for example.
The thread-safety means you can call the routines concurrently, but each for different data.

the use of MPI_Init()

I encountered a question about the use of MPI_Init().
I want to initialize random number "randv" only on the root processor with the code in the context below. To see if my goal is fulfilled, I have the program print out the array "randv" by placing a do loop immediately after the line "call RANDOM_NUMBER(randv)."
However, what is shown on the outcome screen is the repetition of the random number array by 8 times (given the number of processors is 8). My question is why the processors other than the root one are initialized before call MPI_Init(). If all the processors are awaken and have the same random number array before evoking MPI_Init, why bother to place call MPI_Init() for initialization? Thanks.
Lee
Here is the example I use:
program main
include 'mpif.h'
integer :: i
integer :: ierr
integer :: irank
integer :: nrow, ncol
real, dimension(:,:), allocatable :: randv
nrow = 4
ncol = 2
allocate(randv(nrow,ncol))
call RANDOM_SEED
call RANDOM_NUMBER(randv)
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
call MPI_Init ( ierr )
allocate(row_list(ncol), col_list(nrow))
call MPI_Comm_rank ( MPI_COMM_WORLD, irank, ierr )
if( irank == 0 )then
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
endif
call MPI_Finalize ( ierr )
deallocate( randv )
end program
I think you misunderstand how MPI works. The program you wrote is executed by every process. MPI_Init initializes the MPI environment s.t. those processes can interact. After initialization every process is uniquely identified by its rank. You have to make sure that, based on these ranks, each process works on different portions of your data, or performs different tasks.
Typically, you should run MPI_Init before anything else in your program.
Using MPI_Comm_rank you can obtain the ID of the current process (its rank). The first process always has the rank 0. Therefore, if you want to run parts of the code on the "master" process only, you can test for irank == 0:
program main
include 'mpif.h'
integer :: i
integer :: ierr
integer :: irank
integer :: nrow, ncol
real, dimension(:,:), allocatable :: randv
! Initialize MPI
call MPI_Init ( ierr )
! Get process ID
call MPI_Comm_rank ( MPI_COMM_WORLD, irank, ierr )
! Executed on all processes
nrow = 4
ncol = 2
allocate(randv(nrow,ncol))
! Only exectued on the master process
if ( irank == 0 ) then
call RANDOM_SEED
call RANDOM_NUMBER(randv)
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
endif
! Executed on all threads
allocate(row_list(ncol), col_list(nrow))
! Only exectued on the master process
if ( irank == 0 ) then
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
endif
deallocate( randv )
! Finalize MPI, should always be executed last
call MPI_Finalize ( ierr )
end program

Jacobi Iteration in Fortran and MPI

I'm trying to solve ax=b with a jacobi iteration, my serial code works fine but the MPI version won't even run. can anyone help me?
Serial
program jacobis
implicit none
integer, parameter :: n=10
integer :: i,j,k,ni,s,seed
double precision :: tol,t1,t2,sig
double precision, dimension(0:n-1,0:n-1) :: A
double precision, dimension(0:n-1) :: B, x, xb, buff
ni=1000
seed=time()
call srand(seed)
do i=0, n-1
do j=0, n-1
A(i,j)=rand(0)
B(i)=rand(0)
end do
end do
do i = 0, n-1
A(i,i) = sum(A(i,:)) + 1
enddo
!do i=0,n-1
!A(i,i)=4
!end do
print *, "a", A
print *, "b", B
x=B
call cpu_time(t1)
do k=1,ni
xb=x
do i=0,n-1
s=0
do j=0,n-1
if (j/=i) then
s=s+A(i,j)*xb(j)
endif
end do
x(i)=(B(i)-s)/A(i,i)
sig=(x(i)-xb(i))*(x(i)-xb(i))
tol=tol+sig
tol=sqrt(tol)
end do
print *, "x", x
!print *, "tol=", tol
print *, "iter =",k
if (tol<1.000001) EXIT
if (k==(ni-1)) then
print *, "Numero Maximo de Iteracoes"
EXIT
endif
end do
call cpu_time(t2)
print *, "t=",t2-t1
end
MPI version
program jacobis
use mpi
implicit none
integer, parameter :: n=2
integer :: i_local,i_global,j,k,ni,s,m
double precision :: tol,t,t2,sig
double precision, dimension(:,:), ALLOCATABLE :: A_local
double precision, dimension(:), ALLOCATABLE :: B_local, x_local, x_temp1,x_temp2,x_old,x_new, buff
INTEGER, DIMENSION (MPI_STATUS_SIZE) :: STATUS
integer :: rank,procs,tag,ierror
CALL MPI_INIT(ierror)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierror)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,procs,ierror)
ni=100
m=n/procs
ALLOCATE (A_local(0:n-1,0:n-1))
ALLOCATE (B_local(0:m-1))
ALLOCATE (x_temp1(0:m-1))
ALLOCATE (x_temp2(0:m-1))
A_local=0
B_local=2
do i_global=0,n-1
A_local(i_global,i_global)=2
end do
CALL MPI_ALLGATHER(B_local, m, MPI_DOUBLE, x_temp1, m, MPI_DOUBLE, MPI_COMM_WORLD,ierror)
x_new=x_temp1
x_old=x_temp2
print *, "a", A_local
print *, "b", B_local
t=mpi_wtime()
do k=1,ni
x_old=x_new
do i_local=0,m-1
i_global=i_local+rank*m
!x_local(i_local)=b_local(i_local)
s=0
do j=0,n-1
if (j/=i_local) then
s=s+A_local(i_local,j)*x_old(j)
endif
end do
x_local(i_local)=(B_local(i_local)-s)/A_local(i_local,i_global)
end do
CALL MPI_ALLGATHER(x_local,m, MPI_DOUBLE, x_new, m, MPI_DOUBLE, MPI_COMM_WORLD,ierror)
do i_global=0,n-1
sig=(x_new(i_global)-x_old(i_global))*(x_new(i_global)-x_old(i_global))
tol=tol+sig
tol=sqrt(tol)
end do
print *, "x", x_local
print *, "tol=", tol
print *, "iter =",k
if (tol<1.000001) EXIT
if (k==(ni-1)) then
print *, "Numero Maximo de Iteracoes"
EXIT
endif
end do
t2=mpi_wtime()-t;
print *, "t=",t2
CALL MPI_FINALIZE(ierror)
end
can anyone point out what i'm doing wrong? Is it an index problem? Please i realy need to solve this today or i'll flunk the course. I've spent countless hours on this and can't make it work.
Ok you were right! Now i have a segmentation fault, but can't find it! have replaced the code with the new version
Your program has several issues that I can see. The error message that you included indicates a non-allocated receive buffer in this call:
CALL MPI_ALLGATHER(B_local, m, MPI_DOUBLE, x_temp1, m, MPI_DOUBLE, MPI_COMM_WORLD)
Array x_temp1, the receive buffer, needs to be allocated before using in this context.
Fixing this would only get you as far, and you will get a less informative Segmentation Fault. It will be useful to look up correct usage for MPI_AllGather in your MPI implementation. Most MPI routines have an integer error status argument at the end:
MPI_ALLGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT,
RECVTYPE, COMM, IERROR)
<type> SENDBUF (*), RECVBUF (*)
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM,
INTEGER IERROR
This should get you going with your assignment. Make sure to allocate all allocatable arrays that you use, and to use appropriate documentation for your MPI implementation and compiler manual.
I've solved the problem, now it calculates the iteration correctly, proven by the serial program using the same matrix. It was an allocation and index problem. Thanks to the previous answer, was very helpfull.
program jacobis
use mpi
implicit none
integer, parameter :: n=1000
integer :: i_local,i_global,j,k,ni,s,m,seed
double precision :: tol,t,t2,sig
double precision, dimension(:,:), ALLOCATABLE :: A_local
double precision, dimension(:), ALLOCATABLE :: B_local, x_local, x_temp1,x_old,x_new, buff
INTEGER, DIMENSION (MPI_STATUS_SIZE) :: STATUS
integer :: rank,procs,tag,ierror
CALL MPI_INIT(ierror)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierror)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,procs,ierror)
ni=1000
m=n/procs
ALLOCATE (A_local(0:n-1,0:n-1))
ALLOCATE (B_local(0:n-1))
ALLOCATE (x_local(0:n-1))
ALLOCATE (x_temp1(0:n-1))
ALLOCATE (x_new(0:n-1))
!A_local=23
!B_local=47
seed=time()
call srand(seed)
do k=0, n-1
do j=0, n-1
A_local(k,j)=rand(0)
B_local(k)=rand(0)
end do
end do
do i_global = 0, m-1
A_local(i_global,i_global) = sum(A_local(i_global,:)) + n
enddo
CALL MPI_ALLGATHER(B_local, m, MPI_DOUBLE, x_temp1, m, MPI_DOUBLE, MPI_COMM_WORLD,ierror)
x_new=x_temp1
print *, "a", A_local
print *, "b", B_local
t=mpi_wtime()
do k=1,ni
x_old=x_new
do i_local=0,m-1
i_global=i_local+rank*m
!x_local(i_local)=b_local(i_local)
s=0
do j=0,n-1
if (j/=i_local) then
s=s+A_local(i_local,j)*x_old(j)
endif
end do
x_local(i_local)=(B_local(i_local)-s)/A_local(i_local,i_global)
end do
CALL MPI_ALLGATHER(x_local,m, MPI_DOUBLE, x_new, m, MPI_DOUBLE, MPI_COMM_WORLD,ierror)
do j=0,n-1
sig=(x_new(j)-x_old(j))*(x_new(j)-x_old(j))
tol=tol+sig
tol=sqrt(tol)
end do
print *, "x", x_local
print *, "tol=", tol
print *, "iter =",k
if (tol<1.01) EXIT
if (k==(ni-1)) then
print *, "Numero Maximo de Iteracoes"
EXIT
endif
end do
t2=mpi_wtime()-t;
print *, "t=",t2
CALL MPI_FINALIZE(ierror)
end
Your program has a serious issue and you may be getting wrong results. Variable s is declared as integer, while it is assigned non-integer values. Redeclare it as double precision to get correct results. (Posted for those who ever copy this code)

Resources