I've been playing around with various random seed generators. Here is a simple one:
subroutine init_random_seed()
integer :: i, n, clock
integer, dimension(:), allocatable :: seed
call random_seed(size = n)
allocate(seed(n))
call system_clock(count=clock)
seed = clock + 37 * (/ (i - 1, i = 1, n) /)
call random_seed(put = seed)
deallocate(seed)
end
...and a more robust one:
SUBROUTINE init_random_seed()
USE ISO_Fortran_env, ONLY: INT64
IMPLICIT NONE
INTEGER, ALLOCATABLE :: seed(:)
INTEGER :: i, n, un, istat, dt(8), pid
INTEGER(INT64) :: t
CALL RANDOM_SEED(size = n)
ALLOCATE(seed(n))
OPEN(newunit=un, file='/dev/urandom', access='stream', status='old', action='read', form='unformatted', iostat=istat)
IF (istat == 0) THEN
READ(un) seed
CLOSE(un)
ELSE
CALL SYSTEM_CLOCK(t)
IF (t == 0) THEN
CALL DATE_AND_TIME(values = dt)
t = (dt(1) - 1970) * 365_INT64 * 24 * 60 * 60 * 1000 + dt(2) * 31_INT64 * 24 * 60 * 60 * 1000 + dt(3) * 24_INT64 * 60 * 60 &
* 1000 + dt(5) * 60 * 60 * 1000 + dt(6) * 60 * 1000 + dt(7) * 1000 + dt(8)
END IF
pid = GETPID()
t = IEOR(t, INT(pid, KIND(t)))
DO i = 1, n
seed(i) = lcg(t)
END DO
END IF
CALL RANDOM_SEED(put = seed)
DEALLOCATE(seed)
CONTAINS
FUNCTION lcg(s)
INTEGER :: lcg
INTEGER(INT64) :: s
IF (s == 0) THEN
s = 104729
ELSE
s = MOD(s, 4294967296_INT64)
END IF
s = MOD(s * 279470273_INT64, 4294967291_INT64)
lcg = INT(MOD(s, INT(HUGE(0), INT64)), KIND(0))
END FUNCTION lcg
END SUBROUTINE init_random_seed
The second one generates higher-quality random numbers, but is comparatively slow. Does anyone see why?
The second subroutine is performing many more algebriac operations than the first. In addition, the second subroutine is doing file I/O by making a call to OPEN. This is likely the performance killer. Reading from disk is usually orders of magnitude slower than reading from memory. You could try commenting out the call to OPEN, replacing it with a hard coded value, and benchmarking both subroutines again.
Related
I abstracted some code from a much larger coding project I'm working on. The code has an OMP parallel do loop which scales well with processor number when compiled with gfortran but badly when compiled with Intel. With gfortran, the code takes 18 seconds to run with 1 processor and 5 seconds to run with 4 processors. With Intel it takes 7 seconds to run with 1 processor and 14 seconds to run with 4 processors. I don't understand what's going on here. The code is below.
MODULE test
TYPE walker
DOUBLE PRECISION, DIMENSION(:,:), ALLOCATABLE :: R
END TYPE walker
TYPE walkerlist
INTEGER :: nwlkr
TYPE(walker), DIMENSION(:), ALLOCATABLE :: W
END TYPE walkerlist
CONTAINS
FUNCTION step( dTau, nelec, ndim ) RESULT ( dR )
DOUBLE PRECISION, INTENT(IN) :: dTau
INTEGER, INTENT(IN) :: nelec, ndim
DOUBLE PRECISION :: dR(ndim,nelec), rand1, rand2, N2DTau
INTEGER :: d, k
DOUBLE PRECISION, PARAMETER :: twopi = 8.d0 * atan(1.d0)
N2DTau = -2 * dTau
DO k = 1, nelec
DO d = 1, ndim
CALL RANDOM_NUMBER(rand1)
CALL RANDOM_NUMBER(rand2)
dR(d,k) = SQRT( N2DTau * LOG( rand1 ) ) * COS( twopi * rand2 )
END DO
END DO
END FUNCTION step
END MODULE test
PROGRAM walk
USE test
TYPE(walkerlist), TARGET :: Wl
DOUBLE PRECISION :: dTau
INTEGER :: istp, i, t1, t2, clock_rate, clock_max
Wl % nwlkr = 10000
ALLOCATE( Wl % W ( Wl % nwlkr ) )
DO i = 1, Wl % nwlkr
ALLOCATE( Wl % W(i) % R(3,2) )
END DO
dTau = 0.001
CALL SYSTEM_CLOCK ( t1, clock_rate, clock_max )
!$OMP PARALLEL DO SHARED( W ) DEFAULT( FIRSTPRIVATE )
DO i = 1, Wl % nwlkr
DO istp = 1, 4000
Wl % W(i) % R = Wl % W(i) % R + step( dTau, 2, 3 )
END DO
END DO
!$OMP END PARALLEL DO
CALL SYSTEM_CLOCK ( t2, clock_rate, clock_max )
Print*, "time:", REAL ( t2 - t1 ) / REAL ( clock_rate )
END PROGRAM walk
The issue was the random_number calls, where I'm guessing the threads were sharing seeds. I solved it by instead using the random number generating function ran.
rand1 = ran(s)
rand2 = ran(s)
Ran lets you input the seed s, which I made thread_private and of the save type. Ran changes the seed only for ifort and not gfortran, so I can't use it for the latter. Ran also sometimes outputs 0, which I personally need to always check for and discard. I also need to Ensure all threads start with a different seed.
I took two ways to round numbers to decimals. First function just rounds the number:
function round(num)
local under = math.floor(num)
local over = math.floor(num) + 1
local underV = -(under - num)
local overV = over - num
if overV > underV then
return under
else
return over
end
end
The next two functions use this function to round a number to decimals:
function roundf(num, dec)
return round(num * (1 * dec)) / (1 * dec)
end
function roundf_alt(num, dec)
local r = math.exp(1 * math.log(dec));
return round(r * num) / r;
end
Why not simply
function round(num)
return num >= 0 and math.floor(num+0.5) or math.ceil(num-0.5)
end
Instead of math.floor(num) + 1 you can simply use math.ceil(num) btw.
Why do you multiply with 1 multiple times?
There are many things to consider when rounding numbers. Please do some research on how to handle special cases.
The following code just generates a simple triple of random numbers:
program testrand
integer, parameter :: nz = 160, nf = 160, nlt = 90
real :: tmpidx(3)
integer :: idxarr(3), idx1, idx2, idx3, seed_size, ticks
integer, allocatable :: seed(:)
call random_seed(size=seed_size)
allocate(seed(seed_size))
call system_clock(count=ticks)
seed = ticks+37*(/(i-1, i=1,seed_size)/)
call random_seed(put=seed)
deallocate(seed)
call random_number(tmpidx)
idxarr = tmpidx * (/nz, nf, nlt/)
idx1 = max(1,idxarr(1))
idx2 = max(1,idxarr(2))
idx3 = max(1,idxarr(3))
print *,idx1, idx2, idx3
end program
I compile this with gfortran and run a few times and I get:
> gfortran testrand.f90
> ./a.out
74 98 86
> ./a.out
113 3 10
> ./a.out
44 104 27
Looks pretty random. Now I compile with PGI Fortran and run a few times:
> pgf90 testrand.f90
> ./a.out
1 1 1
> ./a.out
1 1 1
> ./a.out
1 1 1
Of course, there's no way to be completely sure, but I suspect this is not random. :) Anyone know what is going on here? Anyone know the right way to get random numbers with PGI Fortran?
Somehow, PGI does not implement system_clock as in GNU compilers. I do not know why, I found it recently by doing similar stuff like you.
To see what I am talking about, just print ticks after calling system_clock. Chances are that you get 0 all the time with PGI and varying numbers with GNU compilers. To solve your problem, you can adapt the code bellow. It is a slightly modified version of a code that you can get at GNU fortran web site
program testrand
use iso_fortran_env, only: int64
integer, parameter :: nz = 160, nf = 160, nlt = 90
real :: tmpidx(3)
integer :: idxarr(3), idx1, idx2, idx3, seed_size, ticks
integer, allocatable :: seed(:)
call random_seed(size=seed_size)
allocate(seed(seed_size))
! call system_clock(count=ticks)
! seed = ticks+37*(/(i-1, i=1,seed_size)/)
! call random_seed(put=seed)
!
! deallocate(seed)
call init_random_seed()
call random_number(tmpidx)
idxarr = tmpidx * (/nz, nf, nlt/)
idx1 = max(1,idxarr(1))
idx2 = max(1,idxarr(2))
idx3 = max(1,idxarr(3))
print *,idx1, idx2, idx3
contains
!
subroutine init_random_seed()
implicit none
integer, allocatable :: seed(:)
integer :: i, n, istat, dt(8), pid
integer(int64) :: t
integer, parameter :: un=703
call random_seed(size = n)
allocate(seed(n))
! First try if the OS provides a random number generator
open(unit=un, file="/dev/urandom", access="stream", &
form="unformatted", action="read", status="old", iostat=istat)
if (istat == 0) then
read(un) seed
close(un)
else
! The PID is
! useful in case one launches multiple instances of the same
! program in parallel.
call system_clock(t)
if (t == 0) then
call date_and_time(values=dt)
t = (dt(1) - 1970) * 365_int64 * 24 * 60 * 60 * 1000 &
+ dt(2) * 31_int64 * 24 * 60 * 60 * 1000 &
+ dt(3) * 24_int64 * 60 * 60 * 1000 &
+ dt(5) * 60 * 60 * 1000 &
+ dt(6) * 60 * 1000 + dt(7) * 1000 &
+ dt(8)
end if
pid = getpid()
t = ieor( t, int(pid, kind(t)) )
do i = 1, n
seed(i) = lcg(t)
end do
end if
call random_seed(put=seed)
!print*, "optimal seed = ", seed
end subroutine init_random_seed
!
function lcg(s)
integer :: lcg
integer(int64), intent(in out) :: s
if (s == 0) then
s = 104729
else
s = mod(s, 4294967296_int64)
end if
s = mod(s * 279470273_int64, 4294967291_int64)
lcg = int(mod(s, int(huge(0), 8)), kind(0))
end function lcg
!
!this option is especially used for pgf90 to provide a getpid() function
!> #brief Returns the process ID of the current process
!! #todo write the actual code, for now returns a fixed value
!<
function getpid()result(pid)
integer pid
pid = 53 !just a prime number, no special meaning
end function getpid
end program
I am working to solve Poisson's equation (in 2d axisymmetric cylindrical coordinates) using the Jacobi method. The L2 norm decreases from ~1E3 on the first iteration (I have a really bad guess) to ~0.2 very slowly. Then, the L2 norm begins to increase over many iterations.
My geometry is parallel plates with sharp points at r = 0 on both plates. (If that matters).
Is there some error in my code? Do I need to go to a different algorithm? (I have a not yet working DADI algorithm.)
Here is my Jacobi method algorithm. Then this just in wrapped in a while loop.
subroutine Jacobi(PoissonRHS, V, resid)
implicit none
real, dimension(0:,0:) :: PoissonRHS, V
REAL resid
integer :: i,j, lb, ub
real, dimension(0:size(V,1)-1, 0:size(V,2)-1) :: oldV
real :: dr = delta(1)
real :: dz = delta(2)
real :: dr2 = (delta(1))**(-2)
real :: dz2 = (delta(2))**(-2)
integer :: M = cells(1)
integer :: N = cells(2)
oldV = V
!Note: All of the equations are second order accurate
!If at r = 0 and in the computational domain
! This is the smoothness condition, dV(r=0)/dr = 0
V(0,:) = (4.0*oldV(1,:)-oldV(2,:))/3.0
!If at r = rMax and in the computational domain
! This is an approximation and should be fixed to improve accuracy, it should be
! lim r->inf V' = 0, while this is V'(r = R) = 0
V(M, 1:N-1) = 0.5 / (dr2 + dz2) * ( &
(2.0*dr2)*oldV(M-1,1:N-1) + &
dz2 * (oldV(M,2:N) + oldV(M,0:N-2)) &
- PoissonRHS(M,1:N-1))
do i = 1, M-1
lb = max(0, nint(lowerBoundary(i * dr) / dz)) + 1
ub = min(N, nint(upperBoundary(i * dr) / dz)) - 1
V(i,lb:ub) = 0.5 / (dr2 + dz2) * ( &
((1.0 - 0.5/dble(i))*dr2)*oldV(i-1,lb:ub) + &
((1.0 + 0.5/dble(i))*dr2)*oldV(i+1,lb:ub) + &
dz2 * (oldV(i,lb+1:ub+1) + oldV(i,lb-1:ub-1)) &
- PoissonRHS(i,lb:ub))
V(i, 0:lb-1) = V0
V(i, ub+1:N) = VL
enddo
!compare to old V values to check for convergence
resid = sqrt(sum((oldV-V)**2))
return
end subroutine Jacobi
Based on additional readings it seems like it was a precision problem. Because (for example), I had the expression
V(i,lb:ub) = 0.5 / (dr2 + dz2) * ( &
((1.0 - 0.5/dble(i))*dr2)*oldV(i-1,lb:ub) + &
((1.0 + 0.5/dble(i))*dr2)*oldV(i+1,lb:ub) + &
dz2 * (oldV(i,lb+1:ub+1) + oldV(i,lb-1:ub-1)) &
- PoissonRHS(i,lb:ub))
where dr2 and dz2 are very large. So by distributing these I got terms that were ~1 and the code converges (slowly, but that's a function of the mathematics).
So my new code is
subroutine Preconditioned_Jacobi(PoissonRHS, V, resid)
implicit none
real, dimension(0:,0:) :: PoissonRHS, V
REAL resid
integer :: i,j, lb, ub
real, dimension(0:size(V,1)-1, 0:size(V,2)-1) :: oldV
real :: dr = delta(1)
real :: dz = delta(2)
real :: dr2 = (delta(1))**(-2)
real :: dz2 = (delta(2))**(-2)
real :: b,c,d
integer :: M = cells(1)
integer :: N = cells(2)
b = 0.5*(dr**2)/((dr**2) + (dz**2))
c = 0.5*(dz**2)/((dr**2) + (dz**2))
d = -0.5 / (dr2 + dz2)
oldV = V
!Note: All of the equations are second order accurate
!If at r = 0 and in the computational domain
! This is the smoothness condition, dV(r=0)/dr = 0
V(0,:) = (4.0*oldV(1,:)-oldV(2,:))/3.0 !same as: oldV(0,:) - 2.0/3.0 * (1.5 * oldV(0,:) - 2.0 * oldV(1,:) + 0.5 * oldV(2,:) - 0)
!If at r = rMax and in the computational domain
! This is an approximation and should be fixed to improve accuracy, it should be
! lim r->inf V' = 0, while this is V'(r = R) = 0
V(M,1:N-1) = d*PoissonRHS(M,1:N-1) &
+ 2.0*c * oldV(M-1,1:N-1) &
+ b * ( oldV(M,0:N) + oldV(M,2:N) )
do i = 1, M-1
lb = max(0, nint(lowerBoundary(i * dr) / dz)) + 1
ub = min(N, nint(upperBoundary(i * dr) / dz)) - 1
V(i,lb:ub) = d*PoissonRHS(i,lb:ub) &
+ (c * (1.0-0.5/dble(i)) * oldV(i-1,lb:ub)) &
+ (c * (1.0+0.5/dble(i)) * oldV(i+1,lb:ub)) &
+ b * (oldV(i,lb-1:ub-1) + oldV(i,lb+1:ub+1))
V(i, 0:lb-1) = V0
V(i, ub+1:N) = VL
enddo
!compare to old V values to check for convergence
resid = sum(abs(oldV-V))
return
end subroutine Preconditioned_Jacobi
I have a fortran MPI code in which a compute intensive function is invoked on every element of a 2D array. I'm trying to split the tasks among the ranks. For example if there are 30 columns and 10 ranks, then each rank gets 3 columns. The following code does this split and gathers the results using allgather. But the final array doesn't have the values from all ranks.
program allgather
include 'mpif.h'
!create a 2 x 30 myarray
integer :: x=2,y=30
integer :: numprocs,myid
integer :: i,j,k,myelements,mycolumns,jb,je
integer*4,dimension(:),allocatable :: displacement,recvcnt
real :: checksum
real,dimension(:,:),allocatable :: myarr,combinedarr
call MPI_INIT(IERR)
call MPI_COMM_SIZE(MPI_COMM_WORLD,NUMPROCS,IERR)
call MPI_COMM_RANK(MPI_COMM_WORLD,MYID,IERR)
mycolumns = y/numprocs
myelements = x * mycolumns
allocate(displacement(numprocs),recvcnt(numprocs))
jb = 1 + ( myid * mycolumns )
je = ( myid + 1 ) * mycolumns
allocate(myarr(x,mycolumns))
allocate(combinedarr(x,y))
myarr(:,:) =0
do j=jb,je
do i=1,x
myarr(i,j) = 1
enddo
enddo
!myarr(:,:)=1
if(mod(y,numprocs) > 0) then
if(myid==numprocs-1) then
jb=(myid + 1) * mycolumns + 1
do j=jb,y
do i=1,x
myarr(i,j) = 1
enddo
enddo
endif
endif
combinedarr(:,:) =0
recvcnt(:)=myelements
do k=1,numprocs
displacement(k) = (k-1) *myelements
enddo
call MPI_ALLGATHERV(myarr,myelements,MPI_REAL,combinedarr,recvcnt,displacement,MPI_REAL,MPI_COMM_WORLD,IERR)
if(mod(y,numprocs) > 0) then
recvcnt(:) = 0
recvcnt(numprocs) = (x*y) - myelements * (numprocs)
displacement(numprocs) = displacement(numprocs) + myelements
call MPI_ALLGATHERV(myarr,recvcnt(numprocs),MPI_REAL,combinedarr,recvcnt,displacement,MPI_REAL,MPI_COMM_WORLD,IERR)
endif
if (myid==0) then
checksum=0
write(6,*) "mycolumns:",mycolumns,"myelements:",myelements
do j=1,y
do i=1,x
checksum = checksum + combinedarr(i,j)
enddo
enddo
write(6,*) checksum
endif
end
First of all, you are using MPI_ALLGATHERV() just as MPI_ALLGATHER() and get no benefit from its ability to send different number of elements from/to each process. But that's not the error in your program. The error lies in the way it fills myarr. You allocate it as myarr(x,mycolumns) but when filling it from column jb to column je, you go past the end of the array in all processes but rank 0 since jb and je are greater than mycolumns there. Thus myarr contains ones only in rank 0 and zeroes in all other ranks. So, yes, the final array does not have the values that you expect but that's because you filled them wrong, not because of the way MPI subroutines are used.
Writing past the end of an allocatable array destroys the hidden structures that are used to manage heap allocation and usually crashes the program. In your case you are just lucky - I run your code with Open MPI and it crashed with core dumps each time.
And you are also missing a call to MPI_FINALIZE() at the end of your code.
Hint: use the Fortran 90 interface if available - replace include 'mpif.h' with use mpi
here is the final version of the code. I have implemented the fixes suggested by "Hristo Iliev" and also fixed the part where the # or ranks does not equally divide the # of columns. Here the last rank does the computation on the leftover columns.
program allgather
include 'mpif.h'
!create a 2 x 30 myarray
integer :: x=4,y=6
integer :: numprocs,myid
integer :: i,j,k,myelements,mycolumns,jb,je,jbb
integer*4,dimension(:),allocatable :: displacement,recvcnt
real :: checksum
real,dimension(:,:),allocatable :: myarr,combinedarr
call MPI_INIT(IERR)
call MPI_COMM_SIZE(MPI_COMM_WORLD,NUMPROCS,IERR)
call MPI_COMM_RANK(MPI_COMM_WORLD,MYID,IERR)
mycolumns = y/numprocs
myelements = x * mycolumns
allocate(displacement(numprocs),recvcnt(numprocs))
jb = 1 + ( myid * mycolumns )
je = ( myid + 1 ) * mycolumns
allocate(myarr(x,y))
allocate(combinedarr(x,y))
myarr(:,:) =0
do j=jb,je
do i=1,x
myarr(i,j) = (j-1) * x + i
enddo
enddo
if(mod(y,numprocs) > 0) then
if(myid==numprocs-1) then
jbb=(myid + 1) * mycolumns + 1
do j=jbb,y
do i=1,x
myarr(i,j) = (j-1) * x + i
enddo
enddo
endif
endif
combinedarr(:,:) =0
recvcnt(:)=myelements
do k=1,numprocs
displacement(k) = (k-1) *myelements
enddo
call MPI_ALLGATHERV(myarr(1,jb),myelements,MPI_REAL,combinedarr,recvcnt,displacement,MPI_REAL,MPI_COMM_WORLD,IERR)
if(mod(y,numprocs) > 0) then
recvcnt(:) = 0
recvcnt(numprocs) = (x*y) - myelements * (numprocs)
displacement(numprocs) = displacement(numprocs) + myelements
call MPI_ALLGATHERV(myarr(1,jbb),recvcnt(numprocs),MPI_REAL,combinedarr,recvcnt,displacement,MPI_REAL,MPI_COMM_WORLD,IERR)
endif
if (myid==0) then
checksum=0
write(6,*) "mycolumns:",mycolumns,"myelements:",myelements
do j=1,y
do i=1,x
checksum = checksum + combinedarr(i,j)
enddo
enddo
write(6,*) checksum
endif
end