Consecutive calls to same subroutine in fortran with significantly slower performance

Consecutive calls to same subroutine in fortran with significantly slower performance - performance

TL;DR
My program calls a subroutine twice. The subroutine performs basically identical operations both times, but takes significantly longer to run the second time. The only difference between both calls is that data produced in the first call is used as an input in the second call. This data is stored in allocatable arrays declared in a module before the first call to the subroutine.
Full Story
The following shows the relevant portions of my code for the problem:
program Economy
!! Declarations !!
use Modern_mod, only: Modern
use Globals, only: Na, Ny, Ne, Vimp, Xmp, Pmp, FCp, Vimu,& ! Globals is a module containing vbles.
& Xmu, Pmu, FCu
implicit none
real(kind=nag_wp) :: param(4)
!! Execution !!
! First call to modern !
param = (/1.0d0,2.0d0,3.0d0,4.0d0/)
allocate(Vimp(Na,Ne),FCp(4,Na*Ne),Pmp(Ny,Ne),Xmp(Ny,Ne))
call Modern(param,Vimp,FCp,Pmp,Xmp)
! Second call to modern !
param = (/5.0d0,6.0d0,7.0d0,8.0d0/)
allocate(Vimu(Na,Ne),FCu(4,Na*Ne),Pmu(Ny,Ne),Xmu(Ny,Ne))
call Modern(param,Vimu,FCu,Pmu,Xmu)
end program Economy
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module Modern_mod
implicit none
contains
subroutine Modern(param,Vim,FCm,Pm,Xm)
!! Declarations !!
! Modules !
use Globals, only: Na, Ne, Ny
use FindVmp_mod, only: FindVmp
use FindVmu_mod, only: FindVmu
implicit none
! Declaring other variables !
real(kind=nag_wp), intent(in) :: param(4) ! param-Parameters specific to one of the modern sectors;
real(kind=nag_wp), intent(out), allocatable :: FCm(:,:), Xm(:,:),& ! FCm-Firm choices; Xm-Policy fun;
& Pm(:,:), Vim(:,:) ! Pm-Price of a share; Vim-Start of period value function;
real(kind=nag_wp), allocatable :: Vm1(:,:), Vim1(:,:), Pm1(:,:),& ! Vm1-Vm next guess; Pm1-Next share price guess;
& Vm(:,:) ! Vm-End of period value function; Vim1-Next Vim guess;
!! Execution !!
! Allocating and initializing functions !
allocate(Vim(Na,Ne),FCm(4,Na*Ne),Vm(Ny,Ne),Pm(Ny,Ne),Xm(Ny,Ne))
allocate(Vim1(Na,Ne),Vm1(Ny,Ne),Pm1(Ny,Ne))
! Inizializing arrays !
Vm = ...
Vim = ...
...
! Doing calculations !
if(param(1) .eq. 1.0d0) then
call FindVmp(FCm,Vim,Pm,Vm1,Pm1,Xm) ! New value funciton guess for productive guys
else
call FindVmu(FCm,Vim,Pm,Vm1,Pm1,Xm) ! New value funciton guess for unproductive guys
end if
end subroutine Modern
end module Modern_mod
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module FindVmp_mod
implicit none
contains
subroutine FindVmp(FCm,Vim0,P0,Vm,P,x)
!! Declarations !!
use VmFp_mod, only: Calculations ! Contains the operations computing the final values of the outputs to FindVmp
implicit none
real(kind=nag_wp), allocatable, intent(out) :: Vm(:,:), x(:,:), P(:,:) ! Vm-New value function; x-Policy function; P-Share price;
real(kind=nag_wp), intent(in) :: P0(:,:), Vim0(:,:), FCm(:,:) ! P0-Initial share price guess; Vim-Initial guess for beginning of period value function;
! FCm-Firm choices;
!! Execution !!
! Allocate matrices !
allocate(Vm(Ny,Ne), x(Ny,Ne), P(Ny,Ne))
! Compute results !
call Calculations(FCm,Vim0,P0,Vm,P,x)
end subroutine FindVmp
end module FindVmp_mod
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module FindVmu_mod
implicit none
contains
subroutine FindVmu(FCm,Vim0,P0,Vm,P,x)
!! Declarations !!
use Globals, only: Vmp, Pmp
use VmFu_mod, only: Calculations ! Contains the operations computing the final values of the outputs to FindVmu
implicit none
real(kind=nag_wp), allocatable, intent(out) :: Vm(:,:), x(:,:), P(:,:) ! Vm-New value function; x-Policy function; P-Share price;
real(kind=nag_wp), intent(in) :: P0(:,:), Vim0(:,:), FCm(:,:) ! P0-Initial share price guess; Vim-Initial guess for beginning of period value function;
! FCm-Firm choices;
!! Execution !!
! Allocate matrices !
allocate(Vm(Ny,Ne), x(Ny,Ne), P(Ny,Ne))
! Compute results !
call Calculations(FCm,Vim0,P0,Vm,P,x,Vmp,Pmp) ! Using the values of Vmp and Pmp computes in the first call to Modern
end subroutine FindVmu
end module FindVmu_mod
Each run, Modern is fed different arrays of the same size and type (*p and *u respectively) which are declared in the module Globals. Modern similarly calls one of two very similar subroutines FindVm?, feeding them the corresponding arrays. FindVmp and FindVmu compute almost identical operations, only that the latter uses the values of Vimp, Pmp (computed in FindVmp) as inputs.
I've been trying to figure out why the second call to Modern takes up to an order of magnitude longer to complete compared to the first one.
My first guess was that maybe by allocating Vimp and Pmp at the beginning of the program, and allocating a bunch of other arrays afterwards, each reference to the former arrays might be costly because their memory addresses were far away from the arrays currently being computed in FindVmu (for reference, Na = 101, Ny = 91, Ne = 9). But then I read that allocated arrays are stored in the heap, and that data in the heap isn't stacked (no pun intended) so that this was not necessarily the origin of my problem. As a matter of fact, I've tried allocating all matrices at different points and in different orders in Modern, but I get roughly the same execution times.
In the same spirit, I've tried to vary how I declare some of the arrays in different subroutines (e.g. making some automatic instead of allocatable and using the compiler (ifort18) option to force them on the stack) and although I do get overall performance variations throughout the code, the relative performance of the two calls to Modern does not change.
Finally, I read in this thread that the more arrays you have in memory, the slower your code generally becomes. Although the explanation does make sense to me, this is the first time I experience such a significant performance loss throughout a fortran program. If this were actually the problem I'm facing, I would have expected to run into it in any number of my previous projects. Is this nonetheless a plausible cause of what is happening here?
And I'm basically out of ideas...
Bonus Question
While we're at it, I've found that leaving out the following lines in Economy surprisingly (to me) does not lead to a segfault:
allocate(Vimp(Na,Ne),FCp(4,Na*Ne),Pmp(Ny,Ne),Xmp(Ny,Ne))
allocate(Vimu(Na,Ne),FCu(4,Na*Ne),Pmu(Ny,Ne),Xmu(Ny,Ne))
In words: if I don't manually allocate the arrays declared in Globals, it seems like the program does so automatically once I pass them to Modern. Is this standard behavior or was I just lucky when I was initially not allocating them myself?

Related

Can reading a variable be a data race in OpenMP?

Why does this OpenMP fortran program work (every element of out is equal to num)? Each thread in the parallel loop might read the variable num simultaneously. I thought this was not acceptable?
program example
implicit none
integer i
integer, parameter :: n = 100000
double precision :: num
double precision, dimension(n) :: out
num = 1.123456789123456789123456d-5
out = 0.d0
!$OMP PARALLEL
!$OMP DO
do i=1,n
out(i) = num
enddo
!$OMP END DO
!$OMP END PARALLEL
do i=1,n
if (out(i).ne.num) print*,'Problem with ',i
enddo
end program
Thanks so much for any insights.

Can reading a variable be a data race in OpenMP?
Any race is between two things happening, so a read can be part of a race. However for the competition between two actions to be a race, there has to be a different outcome depending on the order in which the two actions occur.
Given that the possible actions in a parallel program which we are considering are read and write occurring in different threads, we have four possible cases:
Read, Read: no values are changed, and no code can detect which order the two reads occurred in (at least, not without looking at meta-data such as code performance in a system with caches :-)).
Read, Write: this clearly can be a race; whether the write wins the race or not affects the value which will be read.
Write, Read: as with case 2 (Read,Write), the result seen by the read is affected by the order.
Write, Write: here we have a race too, since we asssume that someone will ultimately read the value, and which value they see will depend on the order of the writes.
So, reading a variable can be part of a race.
However, if your question is really "Is there a race if a variable is only read?", then the answer is "No".

Variables are shared by default in openMP so they are accessible from all the threads. Furthermore, you're not writing to num so even if all the threads were accessing the same memory (which here they probably aren't) there would be no issue.

Random numbers keep coming out the same, despite random seed being used

I have the following small piece of code:
REAL(8) :: x
INTEGER :: i
call system_clock(i)
WRITE(*,*) 'cpu time', i
CALL random_seed(i)
CALL random_number(x)
WRITE(*,*) 'uniform RandVar', x
CPU time is working fine, but every time I run this I get the same uniform RandVar number = 0.99755959009261719, almost like random_number is using the same default seed over and over again and ignoring random seed.
What am I doing wrong?

The same seed may well be being used: that is processor-dependent. The reason for this is that your call to random_seed is not setting the seed.
With the reference
CALL random_seed(i)
the argument i is not the (intent(in)) seed, but is the (intent(out)) size of the seed used by the processor. This call is like
CALL random_seed(SIZE=i) ! SIZE is the first dummy argument
To set the seed you need to explicitly associate with the PUT dummy argument: call random_seed(put=seed). Here the seed is a rank 1 array of size at least n where n - again processor-dependent - is the size given by call random_seed(size=n). From your call i holds this value.
Full details are given in 13.7.136 of F2008.
A common way to seed the generator is:
integer, allocatable :: seed(:)
integer size
call random_seed(size=size)
allocate(seed(size))
! set seed(:) somehow
call random_seed(put=seed)
Setting seed appropriately is not a simple process. I don't address how to do that here, but detail can be found in answers to this other question.
Use of srand(), which is mentioned in the comments, is non-standard.

Allocating arrays in a Fortran Subroutine

I need to read a lot of data from a file in a Fortran program. The size of the data is variable, so I would like to dynamically allocate the arrays. My idea is to make a subroutine who reads all the data and allocates the memory. A simplified version of the program is:
program main
implicit none
real*8, dimension(:,:), allocatable :: v
integer*4 n
!This subroutine will read all the data and allocate the memory
call Memory(v,n)
!From here the program will have other subroutines to make calculations
end
subroutine Memory(v,n)
implicit none
real*8, dimension(:,:), allocatable :: v
integer*4 n,i
n=5
allocate(v(n,2))
do i=1,n
v(i,1)=1.0
v(i,2)=2.0
enddo
return
end subroutine Memory
This program gives me the following error:
Error: Dummy argument 'v' of procedure 'memory' at (1) has an attribute that requieres an explicit interface for this procedure
Is this the right way of structuring this kind of program? If so, How can I solve the error?
Thanks.

Assuming that you have a single source file containing both the program and the subroutine, as your post suggests, the easiest fix is to replace the line containing the statement
end
with a line containing the statement
contains
and writing, at the end of the source file a line containing the statement
end program
(Yes, the keyword program is not required but it is useful.)
The problem that your compiler has reported is that, as you have structured your code, the program does not know anything about the interface to the subroutine memory, that interface is, in Fortran terms, implicit. When you want to call a subroutine and either pass in or pass out an allocatable array that subroutine must have an explicit interface.
There are several ways to provide an explicit interface. One is, as I have shown you, to contain the subroutine within the program. Another, and more useful way when your programs become a little bit larger, is to write your subroutines in modules and use-associate them in the program that wants to use them. Read the parts of your Fortran tutorial that cover modules and the use statement.
There is at least one other option but it is, especially for a beginner, unattractive and I won't mention it here.
And, while I'm writing, learn about and use the keyword intent to specify whether an argument to a subroutine will be read, written to or both. This is a great aid to safe programming, your favourite Fortran resources will explain in detail.

Does the target attribute affect efficiency of Fortran programs?

I have read difference between POINTER and ALLOCATABLE and Why does a Fortran POINTER require a TARGET? and they have really good answers.
It seems that the most efficient choice is to use allocatables for big arrays and pointers for things like linked lists, trees, etc.
Now, if I have my nice efficient allocatable variable, would the target attribute somehow affect the way it is accessed or how the compiler optimizes code with it? Is it advisable to do this?
For instance, I can think of my coordinate triplet:
real, dimension(:),allocatable :: x,y,z
and within a subroutine I would like to point at them in a cyclic way using a pointer.

If you want to permute the directions (i.e., (x,y,z) --> (y,z,x)), then a pointer is not necessary as you can use an if-else condition with allocatables:
if(dir == 1) then
allocate(x(1:nx),y(1:ny),z(1:nz))
elseif(dir == 2) then
allocate(x(1:ny),y(1:nz),z(1:nx))
elseif(dir == 3) then
allocate(x(1:nz),y(1:nx),z(1:ny))
endif
This should enable code re-use since you always have your parallel direction in x and your perpendicular directions in y and z.

This sounds to be compiler dependent and I personally have no knowledge of how compiler developers handle this situation in practice. However, this might depend on the size/complexity of derived data type and also the type of {intent(in), intent(out), intent(inout)} used in subroutine. Compare the following types:
type ex1
real :: a ! only one member
end type ex1
and
type ex2
real, dimension(:), allocatable :: a ! member 1
type(complicated), allocatable :: b(:) ! member 2
type(more_complicated) :: c
! :
! :
! still has more 200 members!
end type ex2
Assuming that you don't want to point to the objects made of "ex2 data type" it in your program, I think removing target attribute might improve the speed of the program because compiler can significantly optimize the interior structure of ex2 with the assumption of no aliasing will be done on that during runtime. The same argument holds regarding intent() when used as an argument to a subroutine.
However for simple "type ex1" I really wouldn't care if I leave it with or without target attribute when I don't intent to point to it in my program.

Allocatable arrays performance

There is an mpi-version of a program which uses COMMON blocks to store arrays that are used everywhere through the code. Unfortunately, there is no way to declare arrays in COMMON block size of which would be known only run-time. So, as a workaround I decided to move that arrays in modules which accept ALLOCATABLE arrays inside. That is, all arrays in COMMON blocks were vanished, instead ALLOCATE was used. So, this was the only thing I changed in my program. Unfortunately, performance of the program was awful (when compared to COMMON blocks realization). As to mpi-settings, there is a single mpi-process on each computational node and each mpi-process has a single thread.
I found similar question asked here but don't think (don't understand :) ) how it could be applied to my case (where each process has a single thread). I appreciate any help.
Here is a simple example which illustrates what I was talking about (below is a pseudocode):
"SOURCE FILE":
SUBROUTINE ZEROSET()
INCLUDE 'FILE_1.INC'
INCLUDE 'FILE_2.INC'
INCLUDE 'FILE_3.INC'
....
INCLUDE 'FILE_N.INC'
ARRAY_1 = 0.0
ARRAY_2 = 0.0
ARRAY_3 = 0.0
ARRAY_4 = 0.0
...
ARRAY_N = 0.0
END SUBROUTINE
As you may see, ZEROSET() has no parallel or MPI stuff. FILE_1.INC, FILE_2, ... , FILE_N.INC are files where ARRAY_1, ARRAY_2 ... ARRAY_N are defined in COMMON blocks. Something like that
REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)
Where NX, NY, NZ are well defined parameters described with help of PARAMETER directive.
When I use modules, I just destroyed all COMMON blocks, so FILE_I.INC looks like
REAL, ALLOCATABLE:: ARRAY_I(:,:,:)
And then just changed "INCLUDE 'FILE_I.INC'" statement above to "USE FILE_I". Actually, when parallel program is executed, one particular process does not need a whole (NX, NY, NZ) domain, so I calculate parameters and then allocate ARRAY_I (only ONCE!).
Subroutine ZEROSET() is executed 0.18 seconds with COMMON blocks and 0.36 with modules (when array's dimensions are calculated runtime). So, the performance worsened by two times.
I hope that everything is clear now. I appreciate you help very much.

Using allocatable arrays in modules can often hurt performance because the compiler has no idea about sizes at compile time. You will get much better performance with many compilers with this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
call Z(A,N)
end subroutine
subroutine Z(A,N)
Integer N
real A(N,N)
do stuff here
end
Then this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
do stuff here
end subroutine
The compiler will know that the array is NxN and the do loops are over N and be able to take advantage of that fact (most codes work that way on arrays). Also, after any subroutine calls in "do stuff here", the compiler will have to assume that array "A" might have changed sizes or moved locations in memory and recheck. That kills optimization.
This should get you most of your performance back.
Common blocks are located in a specific place in memory also, and that allows optimizations also.

Actually I guess, your problem here is, in combination with stack vs. heap memory, indeed compiler optimization based. Depending on the compiler you're using, it might do some more efficient memory blanking, and for a fixed chunk of memory it does not even need to check the extent and location of it within the subroutine. Thus, in the fixed sized arrays there won't be nearly no overhead involved.
Is this routine called very often, or why do you care about these 0.18 s?
If it is indeed relevant, the best option would be to get rid of the 0 setting at all, and instead for example separate the first iteration loop and use it for the initialization, this way you do not have to introduce additional memory accesses, just for initialization with 0. However it would duplicate some code...

I could think of just these reasons when it comes to fortran performance using arrays:
arrays on the stack VS heap, but I doubt this could have a huge performance impact.
passing arrays to a subroutine, because the best way to do that depends on the array, see this page on using arrays efficiently

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio