Allocating arrays in a Fortran Subroutine - memory-management

I need to read a lot of data from a file in a Fortran program. The size of the data is variable, so I would like to dynamically allocate the arrays. My idea is to make a subroutine who reads all the data and allocates the memory. A simplified version of the program is:
program main
implicit none
real*8, dimension(:,:), allocatable :: v
integer*4 n
!This subroutine will read all the data and allocate the memory
call Memory(v,n)
!From here the program will have other subroutines to make calculations
end
subroutine Memory(v,n)
implicit none
real*8, dimension(:,:), allocatable :: v
integer*4 n,i
n=5
allocate(v(n,2))
do i=1,n
v(i,1)=1.0
v(i,2)=2.0
enddo
return
end subroutine Memory
This program gives me the following error:
Error: Dummy argument 'v' of procedure 'memory' at (1) has an attribute that requieres an explicit interface for this procedure
Is this the right way of structuring this kind of program? If so, How can I solve the error?
Thanks.

Assuming that you have a single source file containing both the program and the subroutine, as your post suggests, the easiest fix is to replace the line containing the statement
end
with a line containing the statement
contains
and writing, at the end of the source file a line containing the statement
end program
(Yes, the keyword program is not required but it is useful.)
The problem that your compiler has reported is that, as you have structured your code, the program does not know anything about the interface to the subroutine memory, that interface is, in Fortran terms, implicit. When you want to call a subroutine and either pass in or pass out an allocatable array that subroutine must have an explicit interface.
There are several ways to provide an explicit interface. One is, as I have shown you, to contain the subroutine within the program. Another, and more useful way when your programs become a little bit larger, is to write your subroutines in modules and use-associate them in the program that wants to use them. Read the parts of your Fortran tutorial that cover modules and the use statement.
There is at least one other option but it is, especially for a beginner, unattractive and I won't mention it here.
And, while I'm writing, learn about and use the keyword intent to specify whether an argument to a subroutine will be read, written to or both. This is a great aid to safe programming, your favourite Fortran resources will explain in detail.

Related

Consecutive calls to same subroutine in fortran with significantly slower performance

TL;DR
My program calls a subroutine twice. The subroutine performs basically identical operations both times, but takes significantly longer to run the second time. The only difference between both calls is that data produced in the first call is used as an input in the second call. This data is stored in allocatable arrays declared in a module before the first call to the subroutine.
Full Story
The following shows the relevant portions of my code for the problem:
program Economy
!! Declarations !!
use Modern_mod, only: Modern
use Globals, only: Na, Ny, Ne, Vimp, Xmp, Pmp, FCp, Vimu,& ! Globals is a module containing vbles.
& Xmu, Pmu, FCu
implicit none
real(kind=nag_wp) :: param(4)
!! Execution !!
! First call to modern !
param = (/1.0d0,2.0d0,3.0d0,4.0d0/)
allocate(Vimp(Na,Ne),FCp(4,Na*Ne),Pmp(Ny,Ne),Xmp(Ny,Ne))
call Modern(param,Vimp,FCp,Pmp,Xmp)
! Second call to modern !
param = (/5.0d0,6.0d0,7.0d0,8.0d0/)
allocate(Vimu(Na,Ne),FCu(4,Na*Ne),Pmu(Ny,Ne),Xmu(Ny,Ne))
call Modern(param,Vimu,FCu,Pmu,Xmu)
end program Economy
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module Modern_mod
implicit none
contains
subroutine Modern(param,Vim,FCm,Pm,Xm)
!! Declarations !!
! Modules !
use Globals, only: Na, Ne, Ny
use FindVmp_mod, only: FindVmp
use FindVmu_mod, only: FindVmu
implicit none
! Declaring other variables !
real(kind=nag_wp), intent(in) :: param(4) ! param-Parameters specific to one of the modern sectors;
real(kind=nag_wp), intent(out), allocatable :: FCm(:,:), Xm(:,:),& ! FCm-Firm choices; Xm-Policy fun;
& Pm(:,:), Vim(:,:) ! Pm-Price of a share; Vim-Start of period value function;
real(kind=nag_wp), allocatable :: Vm1(:,:), Vim1(:,:), Pm1(:,:),& ! Vm1-Vm next guess; Pm1-Next share price guess;
& Vm(:,:) ! Vm-End of period value function; Vim1-Next Vim guess;
!! Execution !!
! Allocating and initializing functions !
allocate(Vim(Na,Ne),FCm(4,Na*Ne),Vm(Ny,Ne),Pm(Ny,Ne),Xm(Ny,Ne))
allocate(Vim1(Na,Ne),Vm1(Ny,Ne),Pm1(Ny,Ne))
! Inizializing arrays !
Vm = ...
Vim = ...
...
! Doing calculations !
if(param(1) .eq. 1.0d0) then
call FindVmp(FCm,Vim,Pm,Vm1,Pm1,Xm) ! New value funciton guess for productive guys
else
call FindVmu(FCm,Vim,Pm,Vm1,Pm1,Xm) ! New value funciton guess for unproductive guys
end if
end subroutine Modern
end module Modern_mod
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module FindVmp_mod
implicit none
contains
subroutine FindVmp(FCm,Vim0,P0,Vm,P,x)
!! Declarations !!
use VmFp_mod, only: Calculations ! Contains the operations computing the final values of the outputs to FindVmp
implicit none
real(kind=nag_wp), allocatable, intent(out) :: Vm(:,:), x(:,:), P(:,:) ! Vm-New value function; x-Policy function; P-Share price;
real(kind=nag_wp), intent(in) :: P0(:,:), Vim0(:,:), FCm(:,:) ! P0-Initial share price guess; Vim-Initial guess for beginning of period value function;
! FCm-Firm choices;
!! Execution !!
! Allocate matrices !
allocate(Vm(Ny,Ne), x(Ny,Ne), P(Ny,Ne))
! Compute results !
call Calculations(FCm,Vim0,P0,Vm,P,x)
end subroutine FindVmp
end module FindVmp_mod
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module FindVmu_mod
implicit none
contains
subroutine FindVmu(FCm,Vim0,P0,Vm,P,x)
!! Declarations !!
use Globals, only: Vmp, Pmp
use VmFu_mod, only: Calculations ! Contains the operations computing the final values of the outputs to FindVmu
implicit none
real(kind=nag_wp), allocatable, intent(out) :: Vm(:,:), x(:,:), P(:,:) ! Vm-New value function; x-Policy function; P-Share price;
real(kind=nag_wp), intent(in) :: P0(:,:), Vim0(:,:), FCm(:,:) ! P0-Initial share price guess; Vim-Initial guess for beginning of period value function;
! FCm-Firm choices;
!! Execution !!
! Allocate matrices !
allocate(Vm(Ny,Ne), x(Ny,Ne), P(Ny,Ne))
! Compute results !
call Calculations(FCm,Vim0,P0,Vm,P,x,Vmp,Pmp) ! Using the values of Vmp and Pmp computes in the first call to Modern
end subroutine FindVmu
end module FindVmu_mod
Each run, Modern is fed different arrays of the same size and type (*p and *u respectively) which are declared in the module Globals. Modern similarly calls one of two very similar subroutines FindVm?, feeding them the corresponding arrays. FindVmp and FindVmu compute almost identical operations, only that the latter uses the values of Vimp, Pmp (computed in FindVmp) as inputs.
I've been trying to figure out why the second call to Modern takes up to an order of magnitude longer to complete compared to the first one.
My first guess was that maybe by allocating Vimp and Pmp at the beginning of the program, and allocating a bunch of other arrays afterwards, each reference to the former arrays might be costly because their memory addresses were far away from the arrays currently being computed in FindVmu (for reference, Na = 101, Ny = 91, Ne = 9). But then I read that allocated arrays are stored in the heap, and that data in the heap isn't stacked (no pun intended) so that this was not necessarily the origin of my problem. As a matter of fact, I've tried allocating all matrices at different points and in different orders in Modern, but I get roughly the same execution times.
In the same spirit, I've tried to vary how I declare some of the arrays in different subroutines (e.g. making some automatic instead of allocatable and using the compiler (ifort18) option to force them on the stack) and although I do get overall performance variations throughout the code, the relative performance of the two calls to Modern does not change.
Finally, I read in this thread that the more arrays you have in memory, the slower your code generally becomes. Although the explanation does make sense to me, this is the first time I experience such a significant performance loss throughout a fortran program. If this were actually the problem I'm facing, I would have expected to run into it in any number of my previous projects. Is this nonetheless a plausible cause of what is happening here?
And I'm basically out of ideas...
Bonus Question
While we're at it, I've found that leaving out the following lines in Economy surprisingly (to me) does not lead to a segfault:
allocate(Vimp(Na,Ne),FCp(4,Na*Ne),Pmp(Ny,Ne),Xmp(Ny,Ne))
allocate(Vimu(Na,Ne),FCu(4,Na*Ne),Pmu(Ny,Ne),Xmu(Ny,Ne))
In words: if I don't manually allocate the arrays declared in Globals, it seems like the program does so automatically once I pass them to Modern. Is this standard behavior or was I just lucky when I was initially not allocating them myself?

Using MPI_Send/Recv to handle chunk of multi-dim array in Fortran 90

I have to send and receive (MPI) a chunk of a multi-dimensional array in FORTRAN 90. The line
MPI_Send(x(2:5,6:8,1),12,MPI_Real,....)
is not supposed to be used, as per the book "Using MPI..." by Gropp, Lusk, and Skjellum. What is the best way to do this? Do I have to create a temporary array and send it or use MPI_Type_Create_Subarray or something like that?
The reason not to use array sections with MPI_SEND is that the compiler has to create a temporary copy with some MPI implementations. This is due to the fact that Fortran can only properly pass array sections to subroutines with explicit interfaces and has to generate temporary "flattened" copies in all other cases, usually on the stack of the calling subroutine. Unfortunately in Fortran before the TR 29113 extension to F2008 there is no way to declare subroutines that take variable type arguments and MPI implementations usually resort to language hacks, e.g. MPI_Send is entirely implemented in C and relies on Fortran always passing the data as a pointer.
Some MPI libraries work around this issue by generating huge number of overloads for MPI_SEND:
one that takes a single INTEGER
one that takes an 1-d array of INTEGER
one that takes an 2-d array of INTEGER
and so on
The same is then repeated for CHARACTER, LOGICAL, DOUBLE PRECISION, etc. This is still a hack as it does not cover cases where one passes user-defined type. Further it greatly complicates the C implementation as it now has to understand the Fortran array descriptors, which are very compiler-specific.
Fortunately times are changing. The TR 29113 extension to Fortran 2008 includes two new features:
assumed-type arguments: TYPE(*)
assumed-dimension arguments: DIMENSION(..)
The combination of both, i.e. TYPE(*), DIMENSION(..), INTENT(IN) :: buf, describes an argument that can both be of varying type and have any dimension. This is already being taken advantage of in the new mpi_f08 interface in MPI-3.
Non-blocking calls present bigger problems in Fortran that go beyond what Alexander Vogt has described. The reason is that Fortran does not have the concept of suppressing compiler optimisations (i.e. there is no volatile keyword in Fortran). The following code might not run as expected:
INTEGER :: data
data = 10
CALL MPI_IRECV(data, 1, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, req, ierr)
! data is not used here
! ...
CALL MPI_WAIT(req, MPI_STATUS_IGNORE, ierr)
! data is used here
One might expect that after the call to MPI_WAIT data would contain the value received from rank 0, but this might very well not be the case. The reason is that the compiler cannot know that data might change asynchronously after MPI_IRECV returns and therefore keep its value in a register instead. That's why non-blocking MPI calls are generally considered as dangerous in Fortran.
TR 29113 has solution for that second problem too with the ASYNCHRONOUS attribute. If you take a look at the mpi_f08 definition of MPI_IRECV, its buf argument is declared as:
TYPE(*), DIMENSION(..), INTENT(OUT), ASYNCHRONOUS :: buf
Even if buf is a scalar argument, i.e. no temporary copy is created, a TR 29113 compliant compiler would not resort to register optimisations for the buffer argument.
EDIT: As Hristo Iliev pointed out MPI_Send is always blocking, but might choose to send data asynchronously. From here:
MPI_Send will not return until you can use the send buffer.
Non-blocking communications (like MPI_Send), might pose a problem with Fortran when non-contiguous arrays are involved. Then, the compiler creates a temporary array for the dummy variable and passes it to the subroutine. Once the subroutine is finished, the compiler is at liberty to free the memory of that copy.
That's fine as long as you use blocking communication (MPI_Send), because then the message has been sent when the subroutine returns. For the non-blocking communication (MPI_Isend), however, the temporary array is the send buffer, and the subroutine returns before it has been sent.
So it might happen, that MPI will send data from a memory location that holds no valid data any more.
So, either you create a copy yourself (so that your send buffer is contiguous in memory), or you create a sub-array (i.e. tell MPI the addresses in memory of elements you want to send). There are further alternatives out there, like MPI_Pack, but I have no experience with them.
Which way is faster? Well, that depends:
On the actual implementation of your MPI library
On the data and its distribution
On your compiler
On your hardware
See here for a detailed explanation and further options.

Does the target attribute affect efficiency of Fortran programs?

I have read difference between POINTER and ALLOCATABLE and Why does a Fortran POINTER require a TARGET? and they have really good answers.
It seems that the most efficient choice is to use allocatables for big arrays and pointers for things like linked lists, trees, etc.
Now, if I have my nice efficient allocatable variable, would the target attribute somehow affect the way it is accessed or how the compiler optimizes code with it? Is it advisable to do this?
For instance, I can think of my coordinate triplet:
real, dimension(:),allocatable :: x,y,z
and within a subroutine I would like to point at them in a cyclic way using a pointer.
If you want to permute the directions (i.e., (x,y,z) --> (y,z,x)), then a pointer is not necessary as you can use an if-else condition with allocatables:
if(dir == 1) then
allocate(x(1:nx),y(1:ny),z(1:nz))
elseif(dir == 2) then
allocate(x(1:ny),y(1:nz),z(1:nx))
elseif(dir == 3) then
allocate(x(1:nz),y(1:nx),z(1:ny))
endif
This should enable code re-use since you always have your parallel direction in x and your perpendicular directions in y and z.
This sounds to be compiler dependent and I personally have no knowledge of how compiler developers handle this situation in practice. However, this might depend on the size/complexity of derived data type and also the type of {intent(in), intent(out), intent(inout)} used in subroutine. Compare the following types:
type ex1
real :: a ! only one member
end type ex1
and
type ex2
real, dimension(:), allocatable :: a ! member 1
type(complicated), allocatable :: b(:) ! member 2
type(more_complicated) :: c
! :
! :
! still has more 200 members!
end type ex2
Assuming that you don't want to point to the objects made of "ex2 data type" it in your program, I think removing target attribute might improve the speed of the program because compiler can significantly optimize the interior structure of ex2 with the assumption of no aliasing will be done on that during runtime. The same argument holds regarding intent() when used as an argument to a subroutine.
However for simple "type ex1" I really wouldn't care if I leave it with or without target attribute when I don't intent to point to it in my program.

Reduce Fortran function call overhead

I have a Fortran code like this:
file1.f90
program myprog
use func1mod
do i=1,N
call subroutine1
enddo
subroutine subroutine1
integer*8::var1,var2,var3,...
do j=1,N
x=func1(var1,var2,var3,..)
computations based on x
enddo
return
end
end
file2.f90
module func1mod
contains
func1(var1,var2,var3,....)
func1=some computations based on var1, var2, var3, ...
return
end function func1
end module func1mod
function func1 does not modify any of its arguments. It computes a value based on the arguments and returns a value. The # of arguments is large but the function is less than 30 lines of code. What is the best approach to reduce the function call overhead.
One approach would be to inline the function. Is there any other way out?
The best you can do is be as explicit as possible about the semantics of the function, turn optimization up as high as possible, and let the compiler make the best decision it can about how best to implement the call. Make sure the dummy variables are marked intent(in), and mark the function as pure - although if it's only 30 lines, the compiler will doubtless notice these things anyway at high optimization - and check your compiler options to see if there's anything you can do to encourage (for instance) inlining.
Generally the overhead of a procedure call is low. If the function has 30 lines of code probably you will gain very little because the actual function will dominant over the function call. If you want to be sure, measure the runtime of the current implementation, then inline the code and measure that runtime.

Allocatable arrays performance

There is an mpi-version of a program which uses COMMON blocks to store arrays that are used everywhere through the code. Unfortunately, there is no way to declare arrays in COMMON block size of which would be known only run-time. So, as a workaround I decided to move that arrays in modules which accept ALLOCATABLE arrays inside. That is, all arrays in COMMON blocks were vanished, instead ALLOCATE was used. So, this was the only thing I changed in my program. Unfortunately, performance of the program was awful (when compared to COMMON blocks realization). As to mpi-settings, there is a single mpi-process on each computational node and each mpi-process has a single thread.
I found similar question asked here but don't think (don't understand :) ) how it could be applied to my case (where each process has a single thread). I appreciate any help.
Here is a simple example which illustrates what I was talking about (below is a pseudocode):
"SOURCE FILE":
SUBROUTINE ZEROSET()
INCLUDE 'FILE_1.INC'
INCLUDE 'FILE_2.INC'
INCLUDE 'FILE_3.INC'
....
INCLUDE 'FILE_N.INC'
ARRAY_1 = 0.0
ARRAY_2 = 0.0
ARRAY_3 = 0.0
ARRAY_4 = 0.0
...
ARRAY_N = 0.0
END SUBROUTINE
As you may see, ZEROSET() has no parallel or MPI stuff. FILE_1.INC, FILE_2, ... , FILE_N.INC are files where ARRAY_1, ARRAY_2 ... ARRAY_N are defined in COMMON blocks. Something like that
REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)
Where NX, NY, NZ are well defined parameters described with help of PARAMETER directive.
When I use modules, I just destroyed all COMMON blocks, so FILE_I.INC looks like
REAL, ALLOCATABLE:: ARRAY_I(:,:,:)
And then just changed "INCLUDE 'FILE_I.INC'" statement above to "USE FILE_I". Actually, when parallel program is executed, one particular process does not need a whole (NX, NY, NZ) domain, so I calculate parameters and then allocate ARRAY_I (only ONCE!).
Subroutine ZEROSET() is executed 0.18 seconds with COMMON blocks and 0.36 with modules (when array's dimensions are calculated runtime). So, the performance worsened by two times.
I hope that everything is clear now. I appreciate you help very much.
Using allocatable arrays in modules can often hurt performance because the compiler has no idea about sizes at compile time. You will get much better performance with many compilers with this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
call Z(A,N)
end subroutine
subroutine Z(A,N)
Integer N
real A(N,N)
do stuff here
end
Then this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
do stuff here
end subroutine
The compiler will know that the array is NxN and the do loops are over N and be able to take advantage of that fact (most codes work that way on arrays). Also, after any subroutine calls in "do stuff here", the compiler will have to assume that array "A" might have changed sizes or moved locations in memory and recheck. That kills optimization.
This should get you most of your performance back.
Common blocks are located in a specific place in memory also, and that allows optimizations also.
Actually I guess, your problem here is, in combination with stack vs. heap memory, indeed compiler optimization based. Depending on the compiler you're using, it might do some more efficient memory blanking, and for a fixed chunk of memory it does not even need to check the extent and location of it within the subroutine. Thus, in the fixed sized arrays there won't be nearly no overhead involved.
Is this routine called very often, or why do you care about these 0.18 s?
If it is indeed relevant, the best option would be to get rid of the 0 setting at all, and instead for example separate the first iteration loop and use it for the initialization, this way you do not have to introduce additional memory accesses, just for initialization with 0. However it would duplicate some code...
I could think of just these reasons when it comes to fortran performance using arrays:
arrays on the stack VS heap, but I doubt this could have a huge performance impact.
passing arrays to a subroutine, because the best way to do that depends on the array, see this page on using arrays efficiently

Resources