Fortran MPI_Isend and MPI_Irecv - parallel-processing

Fortran MPI_Isend and MPI_Irecv - parallel-processing

I have a problem with MPI_Isend and MPI_Irecv: The receiving vectors are never sent correctly.
The code is written in Fortran.
Every process has number of touching processes to which I want to send some values. The values that I want to send consist of 4 vectors that are part of a type called variables for each process.
Here is the code that I use:
program isend_test
use mpi
real, dimension(:,:,:), allocatable :: receivedValues
real, dimension(:,:), allocatable :: sendReals
integer, dimension(:,:), allocatable :: requestSend
integer, dimension(:,:), allocatable :: requestReceive
integer, dimension(:), allocatable :: neighbours
integer, dimension(mpi_status_size) :: status
integer :: ierr, currentNeighbour, k, me, nTasks, nValues, nNeighbours, addedNeighbours
call MPI_init(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, me, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nTasks, ierr)
nNeighbours = 2
! Only 3 values for each variable to keep it simple
nValues = 3
allocate(receivedValues(nNeighbours,4,nValues))
allocate(sendReals(4,nValues))
allocate(requestSend(4,nNeighbours))
allocate(requestReceive(4,nNeighbours))
allocate(neighbours(2))
receivedValues = -9999
! Initializing neighbours - Every process is adjacent to every other process in this example
addedNeighbours = 0
do j = 0,2
if (j == me) then
cycle
endif
addedNeighbours = addedNeighbours + 1
neighbours(addedNeighbours) = j
enddo
! fill in some values to send
do j = 1,4
do i=1,nValues
sendReals(j,i) = j + 10*me + 100*i
enddo
enddo
do j = 1,4
do i = 1,nNeighbours
call mpi_isend(sendReals(j,:), nValues, mpi_real, neighbours(i), j, MPI_COMM_WORLD, requestSend(j,i), ierr)
call mpi_irecv(receivedValues(i, j, :), nValues, mpi_real, neighbours(i), j, MPI_COMM_WORLD, requestReceive(j,i), ierr)
enddo
enddo
do j = 1,4
do i = 1,nNeighbours
call mpi_wait(requestSend(j,i), status, ierr)
call mpi_wait(requestreceive(j,i), status, ierr)
enddo
enddo
write(*,*)receivedValues
call MPI_finalize(ierr)
end
I know that the datatypes are correct (they work with MPI_Sendand MPI_Recv) and the whole matching of the neighbours and the tags is correct as well because the code runs through correctly. However, if I set receivedValues = -9999in the beginning before synching, the values aren't changed.
I know the code could be done much more efficiently but I changed so much to find the error without success... Does anyone have an idea? It is probably a problem with the buffers, I just can't find it...
By the way: Sending and receiving sendReals(j,1) and neighbours(i), j, 1) does not work either...

The receive buffer is not contiguous in memory (since Fortran is column major)
mpi_irecv(receivedValues(i, j, :)
so this can only work if MPI_SUBARRAYS_SUPPORTED is .true. (and it is not in Open MPI, it might be the case with MPICH and mpi_f08 bindings).
Not only you are getting incorrect data, but you are also very likely causing silent data corruption.
You can either reorder the receivedValues array, or use a derived datatype and use receivedValues(i,j,1) as the receive buffer.

Related

how to do sorting of distance in 2d array using fortran code?

I want to sort the distance. for example r(1,3)< r(1,2) then r(1,3) should come first .
input file of dist.txt like this
1,2,3.5
1,3,0.5
1,4,4.7
1,5,4,5
output file of sort.txtshould be like this
1, 3, 0.5
1,2, 3.5
1,5, 4.5
1,4,4.7
here first column is i and second column j then third column is r(i,j).
So here I have wrote down one code in fortran which can do sorting for 2D array. But that code has problem If some can fix. I will be so glad.
program sort
implicit none
character CN*8,O*7
integer i,m,k,j
integer n,nmax,ind,num
integer L
parameter (n=3,m=n**2-n)
double precision xbox,rq
parameter (nmax=3091,nconf=1)
double precision atom(nmax),id(nmax),ox(nmax),oy(nmax),oz(nmax)
double precision xij,yij,zij,rij,t
double precision a(n,n)
double precision r(n,n)
open(unit=10,status='unknown',file='a.gro')
do i=1,n
read(10,'(A8,A7,1i5,3f8.3)')CN,O,num,ox(i),oy(i),oz(i)
enddo
read(10,*)xbox ! read the xbox for PBC
t=0.0d0
open(unit=3,file='dist.txt')
open(unit=4,file='2d_1d_dist.txt')
open(unit=5,file='sort.txt')
do i=1,n
do j=1,n
if(i .ne. j) then
xij=ox(i)-ox(j)
yij=oy(i)-oy(j)
zij=oz(i)-oz(j)
xij=xij - nint(xij/xbox)*xbox
yij=yij - nint(yij/xbox)*xbox
zij=zij - nint(zij/xbox)*xbox
r(i,j)=dsqrt(xij**2 + yij**2 + zij**2) !distance calculation
write(3,'(i3,2x,i3,4x,f17.15)') i,j, r(i,j)
endif
enddo
enddo
t=0.0d0
do i = 1,m-2
do j = i+1,m-1
if(i .ne. j) then
write(4,*) r(i,j)," ", r(i,j+1)
if (r(i,j) .gt. r(i,j+1)) then
t=r(i,j)
r(i,j)=r(i,j+1)
r(i,j+1)=t
endif
endif
write(5,*) r(i,j)
enddo
enddo
END program sort
Please look at the code.

The first thing I think of when I encounter such a situation is: Do I need to write my own program?
Quick answer here is no: Linux/Unix has the sort command that works just fine:
sort -t, -k3 -g dist.txt
-t, tells sort that the field separator is the comma,
-k3 tells it to sort according to the 3rd field,
-g tells it to use a general numeric sort
If I needed to do that using Fortran, I would probably read i, j, and r into separate 1-D arrays, then write a sorting routine that not only sorts r, but also returns the order. Then you can easily rearrange the i and j arrays to correspond to the same order. See this example:
program sort_r
implicit none
integer :: u
integer, parameter :: num_of_elements = 4
integer :: i(num_of_elements), j(num_of_elements)
real :: r(num_of_elements)
integer :: order(num_of_elements)
integer :: ii
open(newunit=u, file='dist.txt')
do ii=1, num_of_elements
read(u, *) i(ii), j(ii), r(ii)
end do
close(u)
order = [(ii, ii=1, num_of_elements)]
call qsort(r, first=1, last=num_of_elements, order=order)
i(:) = i(order)
j(:) = j(order)
do ii = 1, num_of_elements
write(*,'(I0,",",I0,",",F3.1)') i(ii), j(ii), r(ii)
end do
contains
recursive subroutine qsort(a, first, last, order)
real, intent(inout) :: a(:)
integer, intent(in) :: first, last
integer, intent(inout) :: order(:)
! Prerequsits:
! first >= lbound(a, 1)
! last <= lbound(a, 1)
! lbound(a, 1) == lbound(order, 1)
! ubound(a, 1) == ubound(order, 1)
real :: pivot
integer :: i, j
if (.not. first < last) return ! less than 2 elements
! Get pivot from middle to beginning of subarray.
call swap(a, first, (first+last)/2, order)
pivot = a(first)
i = first + 1
j = last
do while (j >= i)
! move up from left while elements are smaller than pivot
do while (a(i) < pivot)
i = i + 1
end do
! move down from right while elements are larger than pivot
do while (a(j) > pivot)
j = j - 1
end do
! If we moved past the other index, exit loop
if (j < i) exit
! We have found a larger than pivot element left of a smaller than
! pivot element to the right, swap the two, move the indices to next
call swap(a,i,j,order)
i = i + 1
j = j - 1
end do
! Move pivot back to centre
call swap(a,first,j, order)
call qsort(a,first=first,last=j-1,order=order)
call qsort(a,first=i,last=last,order=order)
end subroutine qsort
subroutine swap(a, i, j, order)
real, intent(inout) :: a(:)
integer, intent(in) :: i, j
integer, intent(inout) :: order(:)
real :: t
integer :: k
t = a(i)
a(i) = a(j)
a(j) = t
k = order(i)
order(i) = order(j)
order(j) = k
end subroutine swap
end program sort_r

Calculating the convolution of a matrix with a recursive subroutine in Fortran 2003

I've never asked a question here, so please let me know if I am describing my problem enough.
I'm pretty new at Fortran and I wanted to create a recursive subroutine that would compute each square of a 3 x 3 matrix. If you are not familiar with convolution, this is a good resource: http://songho.ca/dsp/convolution/convolution2d_example.html.
I used the same values in this example to make sure I was doing it right.
The purpose of the program is to have the recursive subroutine called in the middle of two do loops (indexes both go from [0,2]). When the recusive function is called, it will find the sum of all the products of one square of the output matrix. The loops will call it 9 times to ensure that every square's value has been calculated to produce the desired output. Well, after much editing on paper, I thought that I had a pretty good idea that the subroutine would work and it seems that only the first square (0,0) was able to get its answer, -13.
I believe my problem has to do with the assignments in the subroutine. I want to continue calling the next 'temp' value to add it to the total, which will be returned to the program calling it.
Depending on the current i and j values, there might not be any multiplication needed for every turn through the recursive method, so I wanted the subroutine to find a way to continue adding temp if that was the case.
program conprac
implicit none
integer, dimension(0:2,0:2) :: mat1(0:2,0:2) = reshape((/1,4,7,2,5,8,3,6,9/),(/3,3/))
integer, dimension(0:2,0:2) :: totals(0:2,0:2) = 0
integer, dimension(2,2) :: kernal(0:2,0:2) = reshape((/1,0,-1,2,0,-2,1,0,-1/),(/3,3/))
integer :: i=0, j=0, this_total=0, total=0, m=0, n=0, k=0
!do m = 0,0
!do n = 0,1
total = 0
call calc(kernal, mat1, i, j, m, n, this_total, total)
totals(m,n) = total
!end do
!end do
write(*,*) "totals(0,0): ", totals(0,0) !-13
!write(*,*) "totals(0,1): ", totals(0,1) !-20
end program conprac
recursive subroutine calc(kernal, mat, i, j, addToi, addToj, this_total, total)
implicit none
!declare calling parameter types and definitions
!to calculate with
integer, intent(in), dimension(0:2,0:2) :: kernal, mat
integer, intent(in) :: addToi, addToj
integer, intent(out) :: i, j, this_total
!to calculate
integer, intent(out) :: total
!temp variable
integer :: temp
if (i <= 2) then
if (j > 2) then
i = i + 1
j = 0
end if
if ((i + addToi) - 1 < 0 .or. (j + addToj) - 1 < 0 .or. (i + addToi) - 1 > 2 .or. (j + addToj) - 1 > 2) then
j = j+1
call calc(kernal, mat, i, j, addToi, addToj, this_total, temp)
total = total + temp
write(*,*) "total1: ", total
else
this_total = kernal(i,j) * mat((i + addToi) - 1, (j + addToj) - 1)
j = j+1
call calc(kernal, mat, i, j, addToi, addToj, this_total, temp)
total = this_total + temp
write(*,*) "total2: ", total
end if
end if
end subroutine calc
As of right now, the do loops are commented out so I can test one value at a time up where m, n are initialized.
The parameters for the subroutine are: kernal and mat being the 2 matricies, i and j both starting at 0 when the subroutine is called, m and n values being what is added to i and j respectively to make sure that the kernal is shifted and not calculating in the same spot for all 9 squares. this_total is a holder for the product of an overlapping square to be added to the returning value total, and total is the value returned and will be sent to the totals array in the program. For the first one, it would be at index (m,n), or (0,0).
These are the outputs for the first square in totals:
m=0,n=0

How to find the dimension of a matrix?

I have a matrix that contains both character and reals and I want a program that reads this matrix (finds the dimensions by itself). Here is my code:
! A fortran95 program for G95
Program Project2nd
implicit none
character(len=40), allocatable :: a(:,:)
integer i,j,k,n,m,l,st
character(len=40) d
n=0; m=1; j=1;
open(10,file=&
'/Users/dariakowsari/Documents/Physics/Programming/Fortran95-Projects/Project2nd/input.txt', &
IOstat=st)
do while (st == 0)
read(10,*,IOstat=st) d
n=n+1
end do
st=0
do j=1,m
do while (st == 0)
allocate(a(1,m))
read(10,*,IOstat=st) (a(1,j),j=1,m)
m=m+1
deallocate(a)
end do
print*, n,m
end
Here is my Matrix:
a b 13 15.5 13.2
c d 16 16.75 19
e f 19.2 12.2 18.2
With this code I got (3,2) for the dimensions of my matrix.

There are a few errors in your example code which means it doesn't compile for me but after a few changes I managed to get a similar result to you.
*Update: As noted by #francescalus in the comments to my other (now deleted) answer, that approach involved undefined behaviour and as such is not an appropriate solution. This arose from trying to read more elements from the file than were present.)
Here's an alternative approach, which should avoid this undefined behaviour, but is probably pretty inefficient.
Program Project2nd
implicit none
character(len=40), allocatable :: a(:)
integer, allocatable :: ind(:)
integer, parameter :: maxElements = 100
integer i,j,n,m,st
character(len=40) d
n=0;
open(10,file='mat.txt',IOstat=st)
!Find number of lines
do while (st == 0)
read(10,*,IOstat=st) d
if(st ==0) n=n+1
end do
!Move back to the start of the file
rewind(10)
!Read all of the data
do m=n,maxElements,n
allocate(a(m))
read(10,*,IOstat=st) a
deallocate(a)
rewind(10)
if(st.ne.0) exit
enddo
m = m -n !Need to roll back m by one iteration to get the last which worked.
if(mod(m,n).ne.0) then
print*,"Error: Number of elements not divisible by number of rows."
stop
endif
!Number of columns = n_elements/nrow
m=m/n
print*, n,m
end Program Project2nd
Essentially this uses the same code as you had for counting the number of lines, however note that you only want to increment n when the read was successful (i.e. st==0). Note we do not exit the whilst block as soon as st becomes non-zero, it is only once we reach the end of the whilst block. After that we need to rewind the file so that the next read starts at the start of the file.
In a previous comment you mentioned that you'd rather not have to specify maxElement if you really want to avoid this then replace the second do loop with something like
st = 0 ; m = n
do while (st==0)
allocate(a(m))
read(10,*,IOstat=st) a
deallocate(a)
rewind(10)
if(st.ne.0) then
m = m - n !Go back to value of m that worked
exit
endif
m=m+n
enddo

here is how to do w/o rewinding.
implicit none
character(len=100) wholeline
character(len=20), allocatable :: c(:)
integer iline,io,ni,nums
open(20,file='testin.dat')
iline=0
do while(.true.)
read(20,'(a)',iostat=io)wholeline
if(io.ne.0)exit
iline=iline+1
ni=lineitems(wholeline)
allocate(c(ni))
read(wholeline,*)c
nums=ctnums(c)
write(*,*)'line',iline,' contains ',ni,'items',nums,
$ 'are numbers'
deallocate(c)
enddo
write(*,*)'total lines is ',iline
contains
integer function ctnums(c)
! count the number of items in a character array that are numbers
! this is a template,
! obviously you could assign the numbers to a real array here
character(len=*), allocatable :: c(:)
real f
integer i,io
ctnums=0
do i = 1,size(c)
read(c(i),*,iostat=io)f
if(io.eq.0)ctnums=ctnums+1
enddo
end function
integer function lineitems(line)
! count the number of items in a space delimited string
integer,parameter ::maxitems=100
character(len=*) line
character(len=80) :: c(maxitems)
integer iline,io
lineitems=0
do iline=1,maxitems
read(line,*,iostat=io)c(:iline)
if(io.ne.0)return
lineitems=iline
enddo
if(lineitems.eq.maxitems)write(*,*)'warning maxitems reached'
end function
end
output
line 1 contains 5 items 3 are numbers
line 2 contains 5 items 3 are numbers
total lines is 2

Read array of unknown size from keyboard

I want to insert an unknown number of values in an array (no matter the order).
I could first read how many values are to be inserted, then allocate the allocatable array, and finally read its values, as in the following code
PROGRAM try
IMPLICIT NONE
INTEGER :: N
REAL, DIMENSION(:), ALLOCATABLE :: x
WRITE (*,*) "how many values?"
READ (*,*) N
ALLOCATE(x(N))
WRITE (*,*) "insert the values"
READ (*,*) x
END PROGRAM
What if I want to insert the values without declaring how many before allocating the array?
I think I should use a DO WHILE cycle to insert the values in ascending order, till a descending value is insert, thus indicating the sequence is ended. I think a part of the code would be the following,
index = 1
WRITE(*,*) x
READ(*,*) x(index)
exit = .FALSE.
DO WHILE (exit.EQV..FALSE.)
index = index + 1
READ(*,*) x(index)
IF (x(index)>x(index-1)) THEN
exit = .TRUE.
index = index - 1
END IF
END DO
How to declare the array x?

I tried with the following solution, building on the concept "a lot of memory allocation and reallocation" expressed by #High Performance Mark.
PROGRAM COEFFS
USE COMPACT
IMPLICIT NONE
REAL, DIMENSION(:), ALLOCATABLE :: x,x2
INTEGER :: nL,nR,nT,index,oL,oR
LOGICAL :: exit
WRITE(*,*) "Input an increasing sequence of reals (end the sequence &
& with the first decreasing element, which will be discarded):"
index = 1
ALLOCATE(x(index))
READ(*,*) x(index)
ALLOCATE(x2(index))
x2 = x
DEALLOCATE(x)
exit = .FALSE.
DO WHILE (exit.EQV..FALSE.)
index = index + 1
ALLOCATE(x(index))
x(1:index-1) = x2
READ(*,*) x(index)
DEALLOCATE(x2)
ALLOCATE(x2(index))
x2 = x
DEALLOCATE(x)
IF (x2(index)<x2(index-1)) THEN
exit = .TRUE.
index = index - 1
ALLOCATE(x(index))
x = x2(1:index)
END IF
END DO
DEALLOCATE(x2)
WRITE(*,*) "x = ", x
END PROGRAM
With the array being input by keyboard, I don't think allocation/reallocation is a problem, since it happens at a much higher speed than that of my fingers typing the values, doesn't it?
Still I think the code could be made better. For instance, using two arrays is the only way to take advantage of allocation/reallocation?

How to Avoid Conditionals in Loops

In this link, the author gives an example as
subroutine threshold(a, thresh, ic)
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt
integer :: n
ic = 0
tt = 0.d0
n = size(a)
do j = 1, n
tt = tt + a(j) * a(j)
if (sqrt(tt) >= thresh) then
ic = j
return
end if
end do
end subroutine threshold
and the author commented this code as
An alternative approach, which would allow for many optimizations
(loop unrolling, CPU pipelining, less time spent evaluating the
conditional) would involve adding tt in blocks (e.g., blocks of size
128) and checking the conditional after each block. When it the
condition is met, the last block can be repeated to determine the
value of ic.
What does it mean? loop unrolling? CPU pipelining? adding tt in blocks?
How to optimize the code as the author say?

If the loop is performed in chunks/blocks that fit into the CPU cache you will reduce the number of cache misses, and consequently the number of cache lines retrieved from memory. This increases the performance on all loops that are limited by memory operations.
If the corresponding block size is BLOCKSIZE, this is achieved by
do j = 1, n, BLOCKSIZE
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
end do
This, however, will leave a remainder that is not treated in the main loop. To illustrate this, consider an array of length 1000. The first seven chunks (1--896) are covered in the loop, but the eighth one (897--1024) is not. Therefore, another loop for the remainder is required:
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
! ...
enddo
While it makes little sense to remove the conditional from the remainder loop, it can be performed in the outer loop of the blocked main loop.
As now no branches occur in the inner loop, aggressive optimizations might be applicable then.
However, this limits the "accuracy" of the determined position to the blocks. To get to an element-wise accuracy, you have to repeat the calculation.
Here is the complete code:
subroutine threshold_block(a, thresh, ic)
implicit none
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt, tt_bak, thresh_sqr
integer :: n, j, jj
integer,parameter :: BLOCKSIZE = 128
ic = 0
tt = 0.d0
thresh_sqr = thresh**2
n = size(a)
! Perform the loop in chunks of BLOCKSIZE
do j = 1, n, BLOCKSIZE
tt_bak = tt
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
! Perform the check on the block level
if (tt >= thresh_sqr) then
! If the threshold is reached, repeat the last block
! to determine the last position
tt = tt_bak
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
if (tt >= thresh_sqr) then
ic = jj
return
end if
end do
end if
end do
! Remainder is treated element-wise
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
tt = tt + a(j) * a(j)
if (tt >= thresh_sqr) then
ic = j
return
end if
end do
end subroutine threshold_block
Please note that the compilers are nowadays very good in creating blocked loops in combination with other optimizations. In my experience it is quite difficult to get a better performance out of such simple loops by manually tweaking it.
Loop blocking is enabled in gfortran with the compiler option -floop-block.
Loop unrolling can be done manually, but should be left to the compiler. The idea is to manually perform a loop in blocks and instead of a second loop as shown above, perform the operations by duplicating the code. Here is an example for the inner loop as given above, for a loop unrolling of factor four:
do jj = j, j+BLOCKSIZE-1,4
tt = tt + a(jj) * a(jj)
tt = tt + a(jj+1) * a(jj+1)
tt = tt + a(jj+2) * a(jj+2)
tt = tt + a(jj+3) * a(jj+3)
end do
Here, no remainder can occur if BLOCKSIZE is a multiple of 4. You can probably shave off a few operations in here ;-)
The gfortran compiler option to enable this is -funroll-loops
As far as I know, CPU Pipelining (Instruction Pipelining) cannot be enforced manually in Fortran. This task is up to the compiler.
Pipelining sets up a pipe of instructions. You feed the complete array into that pipe and, after the wind-up phase, you will get a result with each clock cycle. This drastically increases the throughput.
However, branches are difficult (impossible?) to treat in pipes, and the array should be long enough that the time required for setting up the pipe, wind-up, and wind-down phase are compensated.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Fortran MPI_Isend and MPI_Irecv - parallel-processing

Related

how to do sorting of distance in 2d array using fortran code?

Calculating the convolution of a matrix with a recursive subroutine in Fortran 2003

How to find the dimension of a matrix?

Read array of unknown size from keyboard

How to Avoid Conditionals in Loops

Categories

Resources