Jacobi Iteration in Fortran and MPI - matrix

I'm trying to solve ax=b with a jacobi iteration, my serial code works fine but the MPI version won't even run. can anyone help me?
Serial
program jacobis
implicit none
integer, parameter :: n=10
integer :: i,j,k,ni,s,seed
double precision :: tol,t1,t2,sig
double precision, dimension(0:n-1,0:n-1) :: A
double precision, dimension(0:n-1) :: B, x, xb, buff
ni=1000
seed=time()
call srand(seed)
do i=0, n-1
do j=0, n-1
A(i,j)=rand(0)
B(i)=rand(0)
end do
end do
do i = 0, n-1
A(i,i) = sum(A(i,:)) + 1
enddo
!do i=0,n-1
!A(i,i)=4
!end do
print *, "a", A
print *, "b", B
x=B
call cpu_time(t1)
do k=1,ni
xb=x
do i=0,n-1
s=0
do j=0,n-1
if (j/=i) then
s=s+A(i,j)*xb(j)
endif
end do
x(i)=(B(i)-s)/A(i,i)
sig=(x(i)-xb(i))*(x(i)-xb(i))
tol=tol+sig
tol=sqrt(tol)
end do
print *, "x", x
!print *, "tol=", tol
print *, "iter =",k
if (tol<1.000001) EXIT
if (k==(ni-1)) then
print *, "Numero Maximo de Iteracoes"
EXIT
endif
end do
call cpu_time(t2)
print *, "t=",t2-t1
end
MPI version
program jacobis
use mpi
implicit none
integer, parameter :: n=2
integer :: i_local,i_global,j,k,ni,s,m
double precision :: tol,t,t2,sig
double precision, dimension(:,:), ALLOCATABLE :: A_local
double precision, dimension(:), ALLOCATABLE :: B_local, x_local, x_temp1,x_temp2,x_old,x_new, buff
INTEGER, DIMENSION (MPI_STATUS_SIZE) :: STATUS
integer :: rank,procs,tag,ierror
CALL MPI_INIT(ierror)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierror)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,procs,ierror)
ni=100
m=n/procs
ALLOCATE (A_local(0:n-1,0:n-1))
ALLOCATE (B_local(0:m-1))
ALLOCATE (x_temp1(0:m-1))
ALLOCATE (x_temp2(0:m-1))
A_local=0
B_local=2
do i_global=0,n-1
A_local(i_global,i_global)=2
end do
CALL MPI_ALLGATHER(B_local, m, MPI_DOUBLE, x_temp1, m, MPI_DOUBLE, MPI_COMM_WORLD,ierror)
x_new=x_temp1
x_old=x_temp2
print *, "a", A_local
print *, "b", B_local
t=mpi_wtime()
do k=1,ni
x_old=x_new
do i_local=0,m-1
i_global=i_local+rank*m
!x_local(i_local)=b_local(i_local)
s=0
do j=0,n-1
if (j/=i_local) then
s=s+A_local(i_local,j)*x_old(j)
endif
end do
x_local(i_local)=(B_local(i_local)-s)/A_local(i_local,i_global)
end do
CALL MPI_ALLGATHER(x_local,m, MPI_DOUBLE, x_new, m, MPI_DOUBLE, MPI_COMM_WORLD,ierror)
do i_global=0,n-1
sig=(x_new(i_global)-x_old(i_global))*(x_new(i_global)-x_old(i_global))
tol=tol+sig
tol=sqrt(tol)
end do
print *, "x", x_local
print *, "tol=", tol
print *, "iter =",k
if (tol<1.000001) EXIT
if (k==(ni-1)) then
print *, "Numero Maximo de Iteracoes"
EXIT
endif
end do
t2=mpi_wtime()-t;
print *, "t=",t2
CALL MPI_FINALIZE(ierror)
end
can anyone point out what i'm doing wrong? Is it an index problem? Please i realy need to solve this today or i'll flunk the course. I've spent countless hours on this and can't make it work.
Ok you were right! Now i have a segmentation fault, but can't find it! have replaced the code with the new version

Your program has several issues that I can see. The error message that you included indicates a non-allocated receive buffer in this call:
CALL MPI_ALLGATHER(B_local, m, MPI_DOUBLE, x_temp1, m, MPI_DOUBLE, MPI_COMM_WORLD)
Array x_temp1, the receive buffer, needs to be allocated before using in this context.
Fixing this would only get you as far, and you will get a less informative Segmentation Fault. It will be useful to look up correct usage for MPI_AllGather in your MPI implementation. Most MPI routines have an integer error status argument at the end:
MPI_ALLGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT,
RECVTYPE, COMM, IERROR)
<type> SENDBUF (*), RECVBUF (*)
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM,
INTEGER IERROR
This should get you going with your assignment. Make sure to allocate all allocatable arrays that you use, and to use appropriate documentation for your MPI implementation and compiler manual.

I've solved the problem, now it calculates the iteration correctly, proven by the serial program using the same matrix. It was an allocation and index problem. Thanks to the previous answer, was very helpfull.
program jacobis
use mpi
implicit none
integer, parameter :: n=1000
integer :: i_local,i_global,j,k,ni,s,m,seed
double precision :: tol,t,t2,sig
double precision, dimension(:,:), ALLOCATABLE :: A_local
double precision, dimension(:), ALLOCATABLE :: B_local, x_local, x_temp1,x_old,x_new, buff
INTEGER, DIMENSION (MPI_STATUS_SIZE) :: STATUS
integer :: rank,procs,tag,ierror
CALL MPI_INIT(ierror)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierror)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,procs,ierror)
ni=1000
m=n/procs
ALLOCATE (A_local(0:n-1,0:n-1))
ALLOCATE (B_local(0:n-1))
ALLOCATE (x_local(0:n-1))
ALLOCATE (x_temp1(0:n-1))
ALLOCATE (x_new(0:n-1))
!A_local=23
!B_local=47
seed=time()
call srand(seed)
do k=0, n-1
do j=0, n-1
A_local(k,j)=rand(0)
B_local(k)=rand(0)
end do
end do
do i_global = 0, m-1
A_local(i_global,i_global) = sum(A_local(i_global,:)) + n
enddo
CALL MPI_ALLGATHER(B_local, m, MPI_DOUBLE, x_temp1, m, MPI_DOUBLE, MPI_COMM_WORLD,ierror)
x_new=x_temp1
print *, "a", A_local
print *, "b", B_local
t=mpi_wtime()
do k=1,ni
x_old=x_new
do i_local=0,m-1
i_global=i_local+rank*m
!x_local(i_local)=b_local(i_local)
s=0
do j=0,n-1
if (j/=i_local) then
s=s+A_local(i_local,j)*x_old(j)
endif
end do
x_local(i_local)=(B_local(i_local)-s)/A_local(i_local,i_global)
end do
CALL MPI_ALLGATHER(x_local,m, MPI_DOUBLE, x_new, m, MPI_DOUBLE, MPI_COMM_WORLD,ierror)
do j=0,n-1
sig=(x_new(j)-x_old(j))*(x_new(j)-x_old(j))
tol=tol+sig
tol=sqrt(tol)
end do
print *, "x", x_local
print *, "tol=", tol
print *, "iter =",k
if (tol<1.01) EXIT
if (k==(ni-1)) then
print *, "Numero Maximo de Iteracoes"
EXIT
endif
end do
t2=mpi_wtime()-t;
print *, "t=",t2
CALL MPI_FINALIZE(ierror)
end

Your program has a serious issue and you may be getting wrong results. Variable s is declared as integer, while it is assigned non-integer values. Redeclare it as double precision to get correct results. (Posted for those who ever copy this code)

Related

OpenMP application on fortran find best thread

For example, I have kmeans code
program read_from_file
use functions_module
!$ use omp_lib
character(len=2) :: c2
integer :: i, j, k, l, c, d
real, dimension(:,:), allocatable :: r, centroid, new_centro, converge
real, dimension(:), allocatable :: cost
integer,dimension(:),allocatable :: indices,distancereg,cluster
integer :: ios_read = 0
integer :: n = 0
integer :: omega, tid, n_threads
real, dimension(:,:), allocatable :: comparison_value
print *, 'which data index?'
read*, idx
write(c2, '(i2.2)') idx
open(unit=99, file='datatest1.dat', iostat=ios_read)
if (ios_read /= 0) then
print *, "kmeans_data_distrib_"//c2//"_small.dat could not be opened"
! print
end if
!find the maximum lines
do
read(99, *, iostat=ios_read) i, x, y
if (ios_read > 0) then
print *, "something is wrong"
stop
else if (ios_read < 0) then
print *, "end of file reached"
exit
else
n = n+1
end if
end do
rewind(99)
!do i=1,n
open(unit=98, file='rawdata.dat')
allocate(r(2, n))
do i = 1,n
read(99, *, iostat=ios_read) j, x, y
r(1, j) = x
r(2, j) = y
write(98, *) x, y
end do
close(99) ! close kmeans
close(98) ! close rawdatai
print*, 'put k'
read*, k
allocate (comparison_value(2,k))
comparison_value = 0.02
** do l=1,10
call centroid_inits(r, n, k, centroid)
call min_distance(r, n, k, centroid, distance,indices,distancereg)
call new_centroid(r,n,k,centroid,indices,new_centro,omega)
call costfunction(r,n,k,distancereg,indices,new_centro,cluster,cost)
end do
open(unit=99,file="kmeans3_test.dat")
do i = 1, n
write(99,"(2es14.5,i4)") r(:,i),indices(i)
enddo
close(99)
Contains
subroutine centroid_inits(r,n,k,centroid)
real,dimension (:,:),intent(in),allocatable :: r
real,dimension (:,:),intent(out),allocatable:: centroid
real,dimension(k),allocatable::xc(:) ,yc(:)
integer,intent(in) :: n,k
integer :: i
real :: maks_x,maks_y,min_x,min_y
allocate(centroid(2, k))
allocate(xc(k))
allocate(yc(k))
maks_x = maxval(r(1,:))
maks_y = maxval(r(2,:))
min_x = minval(r(1,:))
min_y = minval(r(2,:))
! print *, min_x, maks_x, min_y, maks_y
do i = 1,k
xc (i) = min_x + (maks_x - min_x) * fib_rnd()
yc (i) = min_y + (maks_y - min_y) * fib_rnd()
centroid (1,i) = xc(i)
centroid (2,i) = yc(i)
end do
do i = 1,k
print *, centroid(:,i)
end do
end subroutine centroid_inits
subroutine min_distance(r,n,k,centroid,distance,indices,distancereg)
integer, intent(out):: n,k
real,dimension(:,:),intent(in),allocatable::centroid
real,dimension(:,:),intent(in),allocatable::r
integer,dimension(:),intent(out),allocatable::indices,distancereg
real ::d_min
integer::y,i_min,j,i
integer,parameter :: data_dim=2
allocate (indices(n))
allocate (distancereg(k))
!cost=0.d0
do j=1,n
i_min = -1
d_min=1.d6
! !$ OMP DO
do i=1,k
distance=0.d0
distancereg(i)=0.d0
do y=1,data_dim
distance = distance+abs(r(y,j)-centroid(y,i))
distancereg(i)=distancereg(i)+abs(r(y,j)-centroid(y,i))
end do
if (distance<d_min) then
d_min=distance
i_min=i
end if
end do
!!$OMP END DO
if( i_min < 0 ) print*," found error by assigning k-index to particle ",j
indices(j)=i_min
end do
end subroutine
subroutine new_centroid(r,n,k,centroid,indices,new_centro,omega)
integer, intent(in):: n
real,dimension(:,:),intent(inout),allocatable ::centroid
real,dimension(:,:),intent(in),allocatable ::r
integer,dimension(:),intent(in),allocatable::indices
real,dimension(:,:),intent(out),allocatable:: new_centro
integer,intent(inout)::k
integer :: t,y,j,k_ind
integer,intent(out) :: omega
real,dimension(:),allocatable :: summ
allocate(summ(2))
allocate (new_centro(2,k))
t=2
do k_ind=1,k
omega = count(indices==k_ind)
summ(1)=0
summ(2)=0
do j=1,n
if (indices(j)==k_ind) then
summ(1) =+ r(1,j)
summ(2) =+ r(2,j)
end if
end do
new_centro(1,k_ind) = summ(1)/omega
new_centro(2,k_ind) = summ(2)/omega
end do
centroid = new_centro
!do k_ind=1,k
!print*, 'new centro',new_centro(:,k_ind)
!end do
end subroutine
subroutine costfunction(r,n,k,distancereg,indices,new_centro,cluster,cost)
integer, dimension (:), allocatable, intent(out) :: distancereg, indices
integer, dimension (:), allocatable, intent(out) :: cluster
real, dimension (:,:), allocatable, intent(in) :: r
real, dimension (:,:), intent(in), allocatable :: new_centro
real, dimension(:), intent(out), allocatable :: cost
integer :: i,k
allocate(cluster(k))
allocate(cost(k))
allocate(distancereg(k))
call min_distance(r,n,k,centroid,distance,indices,distancereg)
cluster = 0
do i=1,k
cost(i)=0
cluster(i)=count(indices==i)
cost(i)=(1.0/cluster(i))*distancereg(i)
! print*,cost(i)
end do
print*," total sum of cluster members ",sum(cluster)," vs N ",n
end subroutine
subroutine convergence_value(converge, centroid, new_centro, cost, cluster)
real, dimension (:,:), intent(inout), allocatable :: new_centro
real, dimension (:,:), intent(inout), allocatable :: centroid
real, dimension(:), allocatable, intent(out):: cost
integer, dimension (:), allocatable, intent(out) :: cluster
real, dimension(:,:), intent (inout), allocatable:: converge
allocate(converge(2,k))
call centroid_inits(r, n, k, centroid)
call min_distance(r, n, k, centroid, distance,indices,distancereg)
call new_centroid(r,n,k,centroid,indices,new_centro,omega)
converge = (abs(centroid-new_centro))
print*, 'this is c',converge
end subroutine
end program read_from_file
It runs okay with serial. But I want to apply openmp. I want to each thread doing the same calculation and find which thread have better cost function and time. (all thread doing the clusterization). My attemp and idea is to paralellized the code before encounter the subroutine, that two asterisk. But I do not know if its enough (though I tried it and showing error), and how do make display report of each thread ?
*You might notice from the code that I am a beginner

OpenACC constant parameters

I am wondering what is the proper way to handle constants in OpenACC kernels.
For example, in the following code
module vecaddmod
implicit none
integer, parameter :: n = 100000
!$acc declare create(n)
contains
subroutine vecaddgpu(r, a, b)
real, dimension(:) :: r, a, b
integer :: i
!$acc update self(n)
!$acc data present(n)
!$acc kernels loop copyin(a(1:n),b(1:n)) copyout(r(1:n))
do i = 1, n
r(i) = a(i) + b(i)
enddo
!$acc end data
end subroutine vecaddgpu
end module vecaddmod
program main
use vecaddmod
implicit none
integer :: i, errs, argcount
real, dimension(:), allocatable :: a, b, r, e
character*10 :: arg1
allocate( a(n), b(n), r(n), e(n) )
do i = 1, n
a(i) = i
b(i) = 1000*i
enddo
! compute on the GPU
call vecaddgpu( r, a, b )
! compute on the host to compare
do i = 1, n
e(i) = a(i) + b(i)
enddo
! compare results
errs = 0
do i = 1, n
if( r(i) /= e(i) )then
errs = errs + 1
endif
enddo
print *, errs, ' errors found'
if( errs ) call exit(errs)
end program main
n is declared as a constant on CPU in a module, and it is used as the range in the loop. nvfortran warns me about Constant or Parameter used in data clause. Is the above example the proper way to handle this? Can I take advantage of the constant memory on GPU, such that I don't need to copy it from CPU to GPU for each kernel launch?
Thanks.
The compiler will replace parameters with the literal value so no need to put them in data regions.
module vecaddmod
implicit none
integer, parameter :: n = 100000
contains
subroutine vecaddgpu(r, a, b)
real, dimension(:) :: r, a, b
integer :: i
!$acc kernels loop copyin(a(1:n),b(1:n)) copyout(r(1:n))
do i = 1, n
r(i) = a(i) + b(i)
enddo
end subroutine vecaddgpu
end module vecaddmod
...
% nvfortran -acc -Minfo=accel test.f90
vecaddgpu:
11, Generating copyin(a(:100000)) << "n" is replaced with 100000
Generating copyout(r(:100000))
Generating copyin(b(:100000))
12, Loop is parallelizable
Generating Tesla code
12, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

How to allocate memory for a matrix?

I want to allocate memory for a matrix filled with double elements with Fortran 90, below is the corresponding C code:
int dim = 1024;
double *M = (double *)malloc(dim*dim*sizeof(double));
I wrote the code below but I could not access M(i) with i>=100:
program matrix
INTEGER :: i,d
CHARACTER(len=32) :: arg
REAL*8 M(*)
POINTER(ptr_M, M)
d=0
if(iargc() == 1) then
call getarg(1, arg)
read(arg, '(I10)') d
end if
print '("Dimension=", i6)', d
!allocate and init matrix
ptr_M = malloc(d*d*8)
do i=1,d*d
M(i) = i
end do
print '("M(i)=", f7.4)', M(100)
call free(ptr_M)
end program matrix
what's wrong?
Thanks to all, here is my final solution:
program matrix
IMPLICIT NONE
REAL, ALLOCATABLE :: M(:,:)
INTEGER :: i, j, d
CHARACTER(len=32) :: arg
!specify dimension with programm parameter
if(iargc() == 1) then
call getarg(1, arg)
read(arg, '(I10)') d
end if
!create and init matrix
ALLOCATE (M(d, d))
do i=1,d
do j=1,d
M(i, j) = (i - 1)*d+j
write (*,*) "M(",i,",",j,")=",M(i, j)
end do
end do
DEALLOCATE (M)
end program matrix
Using an ALLOCATABLE array, you can allocate a matrix with 100 rows and 200 columns as follows:
program xalloc
real, allocatable :: x(:,:)
allocate(x(100,200))
end program xalloc

Writing a large matrix in a single file using MPI

I have a large N by N matrix containing real numbers, which has been decomposed into blocks using MPI. I am now trying to recompose this matrix and write it in a single file.
This topic (writing a matrix into a single txt file with mpi) covered a similar issue, but I got pretty confused by all the 'integer-to-string' conversion, etc (I am not an expert!). I am using Fortran for my code, but I guess that even a C explanation should help. I have been reading tutorials on MPI-IO, but there are still a few things I do not understand. Here is the code I have been working on:
use mpi
implicit none
! matrix dimensions
integer, parameter :: imax = 200
integer, parameter :: jmax = 100
! domain decomposition in each direction
integer, parameter :: iprocs = 3
integer, parameter :: jprocs = 3
! variables
integer :: i, j
integer, dimension(mpi_status_size) :: wstatus
integer :: ierr, proc_num, numprocs, fileno, localarray
integer :: loc_i, loc_j, ppp
integer :: istart, iend, jstart, jend
real, dimension(:,:), allocatable :: x
! initialize MPI
call mpi_init(ierr)
call mpi_comm_size(mpi_comm_world, numprocs, ierr)
call mpi_comm_rank(mpi_comm_world, proc_num, ierr)
! define the beginning and end of blocks
loc_j = proc_num/iprocs
loc_i = proc_num-loc_j*iprocs
ppp = (imax+iprocs-1)/iprocs
istart = loc_i*ppp + 1
iend = min((loc_i+1)*ppp, imax)
ppp = (jmax+jprocs-1)/jprocs
jstart = loc_j*ppp + 1
jend = min((loc_j+1)*ppp, jmax)
! write random data in each block
allocate(x(istart:iend,jstart:jend))
do j = jstart, jend
do i = istart, iend
x(i,j) = real(i + j)
enddo
enddo
! create subarrays
call mpi_type_create_subarray( 2, [imax,jmax], [iend-istart+1,jend-jstart+1], &
[istart,jstart], mpi_order_fortran, mpi_real, localarray, ierr )
call mpi_type_commit( localarray, ierr )
! write to file
call mpi_file_open( mpi_comm_world, 'test.dat', IOR(MPI_mode_create,MPI_mode_wronly), &
mpi_info_null, fileno, ierr )
call mpi_file_set_view( fileno, 0, mpi_real, localarray, "native", mpi_info_null, ierr )
call mpi_file_write_all( fileno, x, (jend-jstart+1)*(iend-istart+1), MPI_real, wstatus, ierr )
call mpi_file_close( fileno, ierr )
! deallocate data
deallocate(x)
! finalize MPI
call mpi_finalize(ierr)
I have been following this tutorial (PDF), but my compiler complains that there is no specific subroutine for the generic mpi_file_set_view. Did I do something wrong? Is the rest of the code ok?
Thank you very much for your help!!
Joachim
I would say that the easy way is to use a library designed to perform such operations efficiently : http://2decomp.org/mpiio.html
You can also look at their source code (files io.f90 and io_write_one.f90).
In the source code, you will see a call to MPI_FILE_SET_SIZE that may be relevant for your case.
EDIT : consider using "call MPI_File_Set_View(fhandle, 0_MPI_OFFSET_KIND,...". Answer from MPI-IO: MPI_File_Set_View vs. MPI_File_Seek

the use of MPI_Init()

I encountered a question about the use of MPI_Init().
I want to initialize random number "randv" only on the root processor with the code in the context below. To see if my goal is fulfilled, I have the program print out the array "randv" by placing a do loop immediately after the line "call RANDOM_NUMBER(randv)."
However, what is shown on the outcome screen is the repetition of the random number array by 8 times (given the number of processors is 8). My question is why the processors other than the root one are initialized before call MPI_Init(). If all the processors are awaken and have the same random number array before evoking MPI_Init, why bother to place call MPI_Init() for initialization? Thanks.
Lee
Here is the example I use:
program main
include 'mpif.h'
integer :: i
integer :: ierr
integer :: irank
integer :: nrow, ncol
real, dimension(:,:), allocatable :: randv
nrow = 4
ncol = 2
allocate(randv(nrow,ncol))
call RANDOM_SEED
call RANDOM_NUMBER(randv)
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
call MPI_Init ( ierr )
allocate(row_list(ncol), col_list(nrow))
call MPI_Comm_rank ( MPI_COMM_WORLD, irank, ierr )
if( irank == 0 )then
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
endif
call MPI_Finalize ( ierr )
deallocate( randv )
end program
I think you misunderstand how MPI works. The program you wrote is executed by every process. MPI_Init initializes the MPI environment s.t. those processes can interact. After initialization every process is uniquely identified by its rank. You have to make sure that, based on these ranks, each process works on different portions of your data, or performs different tasks.
Typically, you should run MPI_Init before anything else in your program.
Using MPI_Comm_rank you can obtain the ID of the current process (its rank). The first process always has the rank 0. Therefore, if you want to run parts of the code on the "master" process only, you can test for irank == 0:
program main
include 'mpif.h'
integer :: i
integer :: ierr
integer :: irank
integer :: nrow, ncol
real, dimension(:,:), allocatable :: randv
! Initialize MPI
call MPI_Init ( ierr )
! Get process ID
call MPI_Comm_rank ( MPI_COMM_WORLD, irank, ierr )
! Executed on all processes
nrow = 4
ncol = 2
allocate(randv(nrow,ncol))
! Only exectued on the master process
if ( irank == 0 ) then
call RANDOM_SEED
call RANDOM_NUMBER(randv)
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
endif
! Executed on all threads
allocate(row_list(ncol), col_list(nrow))
! Only exectued on the master process
if ( irank == 0 ) then
do i = 1, nrow
write(*,'(2(f5.2,x))') randv(i,:)
enddo
endif
deallocate( randv )
! Finalize MPI, should always be executed last
call MPI_Finalize ( ierr )
end program

Resources