I wrote a fortran program to code this algorithm (https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_A-ExpJ). It works on my computer. But after I asked these two questions (Intrinsic Rand, what is the interval [0,1] or ]0,1] or [0,1[ and How far can we trust calculus with infinity?), I think I could have a problem with log(random()) because call random_number(Xw); Xw = log(Xw) is used.
Indeed, random_number(Xw) could return 0 and log(0)=-infinity.
Therefore, I plan to modify this line as follow call random_number(Xw); Xw = log(1-Xw) to change the random value interval from [0,1[ to ]0,1].
Is it a good idea or is there a best solution ?
While mathematically it is true that if X is uniformly distributed on (the real interval) [0,1) then 1-X is uniformly distributed on (0,1], this does not particularly help you.
As noted in the description of the algorithm to which you link, the underlying assumption is that the base uniform distribution is over the interval (0,1). This is not the same as (0,1].
You can use rejection sampling to generate X uniformly over (0,1) from random_number() (which is [0,1)): throw away all zero occurrences.
Not a good idea. If you want your algorithm to be stable, you need to define bounds. your log function represents a priority, it can likely be as low as you want, but it must be a number. You can bind it to numerical precision:
program t
use iso_fortran_env
implicit none
real(real64), parameter :: SAFE = exp(-0.5d0*huge(0.0_real64))
print *, log(randoms_in_range(100,SAFE,1.0_real64))
contains
elemental real(real64) function in_range(f,low,hi) result(x)
real(real64), intent(in) :: f ! in: [0:1]
real(real64), intent(in) :: low,hi
real(real64) :: frac
frac = max(min(f,1.0_real64),0.0_real64)
x = low+frac*(hi-low)
end function in_range
real(real64) function random_in_range(low,hi) result(x)
real(real64), intent(in) :: low,hi
call random_number(x) ! [0,1]
x = in_range(x,low,hi) ! [low,hi]
end function random_in_range
function randoms_in_range(n,low,hi) result(x)
integer , intent(in) :: n
real(real64), intent(in) :: low,hi
real(real64) :: x(n)
call random_number(x) ! [0,1]
x = in_range(x,low,hi) ! [low,hi]
end function randoms_in_range
end program
I want to perform a Matrix-Vector product in fortran using the SGEMV subroutine from BLAS.
I have a code that is similar to this:
program test
integer, parameter :: DP = selected_real_kind(15)
real(kind=DP), dimension (3,3) :: A
real(kind=DP), dimension (3) :: XP,YP
call sgemv(A,XP,YP)
A is a 3x3 Matrix, XP and YP are Vectors.
In the included module one can see the following code:
PURE SUBROUTINE SGEMV_F95(A,X,Y,ALPHA,BETA,TRANS)
! Fortran77 call:
! SGEMV(TRANS,M,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY)
USE F95_PRECISION, ONLY: WP => SP
REAL(WP), INTENT(IN), OPTIONAL :: ALPHA
REAL(WP), INTENT(IN), OPTIONAL :: BETA
CHARACTER(LEN=1), INTENT(IN), OPTIONAL :: TRANS
REAL(WP), INTENT(IN) :: A(:,:)
REAL(WP), INTENT(IN) :: X(:)
REAL(WP), INTENT(INOUT) :: Y(:)
END SUBROUTINE SGEMV_F95
I understand that the some of the parameters are optional, so where am i wrong in the method call?
When you look at BLAS or LAPACK routines then you should always have a look at the first letter:
S: single precision
D: double precision
C: single precision complex
Z: double precision complex
You defined your matrix A as well as the vectors XP and YP as a double precision number using the statement:
integer, parameter :: DP = selected_real_kind(15)
So for this, you need to use dgemv or define your precision as single precision.
There is also a difference between calling dgemv and dgemv_f95. dgemv_f95 is part of Intel MKL and not really a common naming. For portability reasons, I would not use that notation but stick to the classic dgemv which is also part of Intel MKL.
DGEMV performs one of the matrix-vector operations
y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y,
where alpha and beta are scalars, x and y are vectors and A is an
m by n matrix.
If you want to know how to call the function, I suggest to have a look here, but it should, in the end, look something like this:
call DGEMV('N',3,3,ALPHA,A,3,XP,1,BETA,YP,1)
The precisions are incompatible. You are calling sgemv which takes single precision arguments but you are passing double precision arrays and vectors.
Perhaps the trans parameter is required?
trans: Must be 'N', 'C', or 'T'.
(As per the note at the bottom of Developer Reference for IntelĀ® Math Kernel Library - Fortran.)
In the main part of my Fortran code I have this lines
Gmat=0
do i=1,indCompMax
do j=(i-1)*useSymFlag+1,nsit-(i-1)*useSymFlag
l=1
do while (G0poles(l,2)/=0)
Gmat(i,j)=Gmat(i,j)+real(G0int(i,j,l))/(omega(k)-G0poles(l,1))**G0poles(l,3)
l=l+1
enddo
enddo
enddo
call ExtendBySymmetry(Gmat)
This part is repeated several times in the code so I defined this subroutine
!=============================================================================
SUBROUTINE EvaluateFunc(matrixPol,matrixInt,z,matrix)
use NAGmodule
integer i,j,k
REAL*8, DIMENSION(Npoles,3) :: matrixPol
COMPLEX*16, DIMENSION(nsit,nsit,Npoles) :: matrixInt
COMPLEX*16, DIMENSION(nsit,nsit) :: matrix
COMPLEX*16 :: z
do i=1,indCompMax
do j=(i-1)*useSymFlag+1,nsit-(i-1)*useSymFlag
k=1
do while (matrixPol(k,2)/=0)
matrix(i,j)=matrix(i,j)+real(matrixInt(i,j,k))/(z-matrixPol(k,1))**matrixPol(k,3)
k=k+1
enddo
enddo
enddo
call ExtendBySymmetry(matrix)
end
The problem is that if I use this subroutine the evaluation of the output matrix takes much longer (around 5 times slower) than the same evaluation but made directly in the main part of the code.
How can I optimize the code and make the evaluation made with the subroutine faster?
UPDATE: Thanks for the reply. First, the operation **matrixPol(k,3) is present also in the main code, I forgot to write it in the post.
For the comparison (matrixPol(k,2)/=0) there is no problem because actually starting from a certain position of the vector all the elements are exactly zero.
Computing the prefactor outside the i,j loop helped to speed the subroutine. And switching the two indices i and j has practically no effect. Here are the running times of the subroutine
All in the main part
1.688s
my old subroutine
19.063s
with factor outside the loop i,j
5.193s
Switching the indices i and j
5.281s
with dot_product
4.958s
But the subroutine is still more than 2.5 time slower.
Here is a minimal example:
module NAGmodule
implicit none
real*8, allocatable :: hmat(:,:),eval(:),eigmat(:,:)
real*8, allocatable :: G0poles(:,:)
complex*16, allocatable :: G0int(:,:,:)
complex*16, allocatable :: Gmat(:,:)
real*8, allocatable :: omega(:)
integer :: numpoles, halffillingflag, iter, indCompMax
complex*16 :: omegaComplex
real*8, parameter :: pi=3.141592653589793
integer, parameter :: out_unit=10
integer, parameter :: timeFag=1
integer :: counti, countf, count_rate
real :: dt
integer, parameter :: Npoles=1000
real*8, parameter :: U=4
real*8, parameter :: omegamin=-20
real*8, parameter :: omegamax=20
integer, parameter :: Nomega=1500000
integer, parameter :: nsit = 4
integer, parameter :: nup = 1
integer, parameter :: ndw = 1
integer, parameter :: PBCflag=1
integer, parameter :: useSymFlag=1
end module NAGmodule
use nagmodule
integer :: i,j,k,l,m,n,p,q
REAL*8 t1,t2
allocate(hmat(nsit,nsit),eigmat(nsit,nsit),eval(nsit))
allocate(G0poles(Npoles,3),G0int(nsit,nsit,Npoles))
allocate(omega(Nomega))
allocate(Gmat(nsit,nsit))
indCompMax=1
hmat=0.
do i=1,(nsit-1)
hmat(i,i+1)=-1
hmat(i+1,i)=-1
enddo
if (PBCflag==1) then
hmat(1,nsit)=-1
hmat(nsit,1)=-1
end if
call NAGdiag(nsit)
eigmat=hmat
do k=1,Nomega
omega(k)=(omegamax-omegamin)/(Nomega-1)*(k-1)+omegamin
enddo
do k=1,nup
G0poles(k,1)=eval(k)
G0poles(k,2)=-1
G0poles(k,3)=1
enddo
do k=(nup+1),nsit
G0poles(k,1)=eval(k)
G0poles(k,2)=1
G0poles(k,3)=1
enddo
do k=1,nsit
G0int(k,k,k)=1
if ((k==nup).AND.(abs(eval(k)-eval(k+1))<1e-15)) then
G0int(k,k,k)=0.5
G0int(k+1,k+1,k)=0.5
else if ((k==nup+1).AND.(abs(eval(k)-eval(k-1))<1e-15)) then
G0int(k,k,k)=0.5
G0int(k-1,k-1,k)=0.5
end if
enddo
do k=1,nsit
G0int(:,:,k)=matmul(eigmat,matmul(G0int(:,:,k),transpose(eigmat)))
enddo
t1=0
t2=0
do k=1,Nomega
omegaComplex=CMPLX(omega(k),0)
call system_clock(counti,count_rate)
Gmat=0
call EvaluateFunc3(G0poles,G0int,omegaComplex,Gmat)
call system_clock(countf)
dt=REAL(countf-counti)/REAL(count_rate)
t1=t1+dt
call system_clock(counti,count_rate)
Gmat=0
do i=1,indCompMax
do j=(i-1)*useSymFlag+1,nsit-(i-1)*useSymFlag
l=1
do while (G0poles(l,2)/=0)
Gmat(i,j)=Gmat(i,j)+real(G0int(i,j,l))/(omega(k)-G0poles(l,1))
l=l+1
enddo
enddo
enddo
call ExtendBySymmetry(Gmat)
call system_clock(countf)
dt=REAL(countf-counti)/REAL(count_rate)
t2=t2+dt
enddo
write(*,*)'time with subroutine',t1
write(*,*)'time main',t2
stop
end
!=================================================================================
SUBROUTINE EvaluateFunc3(matrixPol,matrixInt,z,matrix)
use NAGmodule
integer i,j,k
REAL*8, DIMENSION(Npoles,3) :: matrixPol
COMPLEX*16, DIMENSION(nsit,nsit,Npoles) :: matrixInt
COMPLEX*16, DIMENSION(nsit,nsit) :: matrix
COMPLEX*16 :: z
integer :: maxPoles
COMPLEX*16, DIMENSION(Npoles) :: factor
maxPoles=0
do while (matrixPol(maxPoles+1,2)/=0)
maxPoles=maxPoles+1
enddo
factor(:maxPoles) = (1.,0.)/(z-matrixPol(:maxPoles,1))**matrixPol(:maxPoles,3)
do j=1,indCompMax
do i=(j-1)*useSymFlag+1,nsit-(j-1)*useSymFlag
matrix(i,j)=matrix(i,j)+dot_product(matrixInt(i,j,1:maxPoles),factor(1:maxPoles))
enddo
enddo
call ExtendBySymmetry2(matrix)
end
!=================================================================================
SUBROUTINE ExtendBySymmetry2(matrix)
use NAGmodule
COMPLEX*16, DIMENSION(nsit,nsit) :: matrix
integer k,i,j,l,m
if ((PBCflag==1).AND.(useSymFlag==1)) then
do i=2,nsit
matrix(2:nsit,i)=matrix(1:nsit-1,i-1)
matrix(1,i)=matrix(nsit,i-1)
enddo
else if ((PBCflag==0).AND.(useSymFlag==1)) then
do j=1,nsit/2
do i=j,nsit-j+1
matrix(j,i)=matrix(i,j)
matrix(nsit-i+1,nsit-j+1)=matrix(i,j)
matrix(nsit-j+1,nsit-i+1)=matrix(i,j)
enddo
enddo
end if
end
!=================================================================================
SUBROUTINE ExtendBySymmetry(matrix)
use NAGmodule
COMPLEX*16, DIMENSION(nsit,nsit) :: matrix
integer k,i,j,l,m
if ((PBCflag==1).AND.(useSymFlag==1)) then
do i=2,nsit
matrix(i,2:nsit)=matrix(i-1,1:nsit-1)
matrix(i,1)=matrix(i-1,nsit)
enddo
else if ((PBCflag==0).AND.(useSymFlag==1)) then
do i=1,nsit/2
do j=i,nsit-i+1
matrix(j,i)=matrix(i,j)
matrix(nsit-i+1,nsit-j+1)=matrix(i,j)
matrix(nsit-j+1,nsit-i+1)=matrix(i,j)
enddo
enddo
end if
end
!=================================================================================
SUBROUTINE NAGdiag(nsit1)
use NAGmodule
real*8, allocatable :: WORK(:)
integer, allocatable :: IWORK(:)
CHARACTER JOB, UPLO
EXTERNAL dsyevd
NMAX=nsit1
LDA=NMAX
LWORK=4*NMAX*NMAX+100
LIWORK=5*NMAX
LIWORK=10*NMAX
ALLOCATE(WORK(LWORK),IWORK(LIWORK))
JOB='V'
UPLO='L'
CALL dsyevd(JOB,UPLO,nsit1,hmat,LDA,eval,WORK,LWORK,IWORK,LIWORK,INFO)
IF (INFO.GT.0) THEN
WRITE (*,*) 'Failure to converge.'
stop
endif
deALLOCATE(WORK,IWORK)
return
end`
Due to several edits of the original question, the answer is partially superfluous by now. However, the optimization part is still valid.
The real issue with you code is that you are passing z as a complex number to the subroutine (omegaComplex), while omega(k) is real. This results in the exponentiation and division being performed as complex operations instead of real ones.
Fixing z to be real (and factor in the optimization as well) leads to the expected results. With optimizations I get:
time with subroutine 0.24000001139938831
time main 0.35700001695659012
Original answer:
First of all, the subroutine does not do the same operations that your inline code does. The operation **matrixPol(k,3) is the power to a complex number which involves a heavy computational effort. No wonder the subroutine is a lot slower.
I see a few possibilities to accelerate your code:
The divisor (z-matrixPol(k,1))**matrixPol(k,3) is independent of i and j and can be taken out of the loop.
Divisions are more expensive than multiplications. Therefore, you should pre-compute factor = 1/div outside the loop, and multiply with factor in the loop.
The comparison (matrixPol(k,2)/=0) will almost never be true, unless you set the corresponding values to exactly zero. I assume you know the order of your poles before you call the subroutine, so why not pass it along and save yourself this comparison? If that is not possible, determine the number of poles inside the subroutine before the main loop. Then, the inner loop over k is much simpler.
Inside the loop, you convert the input matrix to real over and over again. This can be done outside the loop in one go. Or, even better, just pass only the real part to the function!
At this point your code looks something like:
!=============================================================================
SUBROUTINE EvaluateFunc(matrixPol,matrixInt,z,matrix)
use NAGmodule
integer i,j,k
REAL*8, DIMENSION(Npoles,3) :: matrixPol
COMPLEX*16, DIMENSION(nsit,nsit,Npoles) :: matrixInt
COMPLEX*16, DIMENSION(nsit,nsit) :: matrix
COMPLEX*16 :: z, factor(Npoles)
REAL*8, DIMENSION(nsit,nsit,Npoles) :: matrixInt_re
integer :: maxPoles
! Determine maximum number of poles
do k=1,Npoles
! Only valid if the poles are set to exactly zero outside. If not,
! use ( abs(matrixPol(k,2)) <= someEpsilon )
if ( matrixPol(k,2) == 0 ) then
maxPoles = k-1
exit
endif
enddo
! Pre-compute factors
factor(:maxPoles) = (1.,0.)/(z-matrixPol(:maxPoles,1))**matrixPol(:maxPoles,3)
! Convert input to real
matrixInt_re = real(matrixInt)
do i=1,indCompMax
do j=i,nsit-i+1
do k=1,maxPoles
matrix(i,j)=matrix(i,j)+matrixInt_re(i,j,k)*factor(k)
enddo
enddo
enddo
call ExtendBySymmetry(Gmat)
end
Further optimization:
Rewriting the code like this it becomes apparent that the inner loop over k is nothing more than a dot product. This could potentially be sped up by the compiler. The main loop would then look like:
do i=1,indCompMax
do j=i,nsit-i+1
matrix(i,j)=matrix(i,j) + &
dot_product( matrixInt_re(i,j,:maxPoles), factor(:maxPoles) )
enddo
enddo
As chw21 noted, Fortran uses a column major memory layout and you are accessing it in a row major fashion. This potentially loses you a lot of memory. You should consider transposing your arrays in the main program or at least switch the loops over i and j. I would prefer the first option, as the inner dot product would then be performed on contiguous memory chunks.
Try to see whether you can swap the loops around. Since Fortran stores the arrays in the order
(1, 1), (2, 1), (3, 1), ..., (n, 1), (1, 2), (2, 2), ...
Memory access is far faster if you loop along that dimension.