Does the target attribute affect efficiency of Fortran programs? - performance

I have read difference between POINTER and ALLOCATABLE and Why does a Fortran POINTER require a TARGET? and they have really good answers.
It seems that the most efficient choice is to use allocatables for big arrays and pointers for things like linked lists, trees, etc.
Now, if I have my nice efficient allocatable variable, would the target attribute somehow affect the way it is accessed or how the compiler optimizes code with it? Is it advisable to do this?
For instance, I can think of my coordinate triplet:
real, dimension(:),allocatable :: x,y,z
and within a subroutine I would like to point at them in a cyclic way using a pointer.

If you want to permute the directions (i.e., (x,y,z) --> (y,z,x)), then a pointer is not necessary as you can use an if-else condition with allocatables:
if(dir == 1) then
allocate(x(1:nx),y(1:ny),z(1:nz))
elseif(dir == 2) then
allocate(x(1:ny),y(1:nz),z(1:nx))
elseif(dir == 3) then
allocate(x(1:nz),y(1:nx),z(1:ny))
endif
This should enable code re-use since you always have your parallel direction in x and your perpendicular directions in y and z.

This sounds to be compiler dependent and I personally have no knowledge of how compiler developers handle this situation in practice. However, this might depend on the size/complexity of derived data type and also the type of {intent(in), intent(out), intent(inout)} used in subroutine. Compare the following types:
type ex1
real :: a ! only one member
end type ex1
and
type ex2
real, dimension(:), allocatable :: a ! member 1
type(complicated), allocatable :: b(:) ! member 2
type(more_complicated) :: c
! :
! :
! still has more 200 members!
end type ex2
Assuming that you don't want to point to the objects made of "ex2 data type" it in your program, I think removing target attribute might improve the speed of the program because compiler can significantly optimize the interior structure of ex2 with the assumption of no aliasing will be done on that during runtime. The same argument holds regarding intent() when used as an argument to a subroutine.
However for simple "type ex1" I really wouldn't care if I leave it with or without target attribute when I don't intent to point to it in my program.

Related

How to do several enumerations type in Fortran?

I tried to declare several enumeration types in Fortran.
This funny simple example illustrates well my problem :
program Main
enum, bind(c)
enumerator :: Colors = 0
enumerator :: Blue = 1
enumerator :: Red = 2
enumerator :: Green = 3
end enum
enum, bind(c)
enumerator :: Size = 0
enumerator :: Small = 1
enumerator :: Medium = 2
enumerator :: Large = 3
end enum
integer(kind(Colors)) :: myColor
myColor = Green
if (myColor == Large) then
write(*,*) 'MyColor is Large'
end if
end program Main
I also tried to enclose this enumeration in a type and many others things but none works.
Here I can compare Colors with Size. In C, for example, when I declare color and a size typedef enum, I have no such problem, because the two types are different.
Does it exist a simple solution to have several enumerated type in Fortran?
Otherwise, I imagine to declare several types with one integer member that holds the value and, after, to create interface to overload the operators I need (comparison, affectation and so on). I am not sure that solution is possible and also, I can do it.
Fortran does not have enumerated types in the sense that you wish to use.1
An enumeration in Fortran is a set of enumerators. The program of the question has two of them.
Enumerators themselves are named (integer) constants of a kind interoperable with C's corresponding enumeration type. They exist for the purposes of C interoperability and not to provide a similar functionality within Fortran.
The enumerators Green and Large in the question are two named integer constants with value 3 (of some, possibly different kind). Green==Large is a true expression whatever the kind parameters of the constants.
There is no mechanism in Fortran to restrict a variable to values of an enumeration. The constants could equivalently be declared as
integer(kind=enum_kind1) :: Green = 3_enum_kind1
integer(kind=enum_kind2) :: Large = 3_enum_kind2
for the appropriate kind values (which are quite likely in this case to be the same: C_INT) and the Fortran program would know no difference.
If you wish to use enumerated types in the sense that they exist in C and similar languages, you will have to use a non-intrinsic approach (as intimated in the question).
1 This is the case for the current, 2018, revision of the language. At this time, there is a proposal for the next revision (provisionally 2023) to include enumerated types closer to what is desired here. This specification is given in 7.6.2 of one particular working draft.

Consecutive calls to same subroutine in fortran with significantly slower performance

TL;DR
My program calls a subroutine twice. The subroutine performs basically identical operations both times, but takes significantly longer to run the second time. The only difference between both calls is that data produced in the first call is used as an input in the second call. This data is stored in allocatable arrays declared in a module before the first call to the subroutine.
Full Story
The following shows the relevant portions of my code for the problem:
program Economy
!! Declarations !!
use Modern_mod, only: Modern
use Globals, only: Na, Ny, Ne, Vimp, Xmp, Pmp, FCp, Vimu,& ! Globals is a module containing vbles.
& Xmu, Pmu, FCu
implicit none
real(kind=nag_wp) :: param(4)
!! Execution !!
! First call to modern !
param = (/1.0d0,2.0d0,3.0d0,4.0d0/)
allocate(Vimp(Na,Ne),FCp(4,Na*Ne),Pmp(Ny,Ne),Xmp(Ny,Ne))
call Modern(param,Vimp,FCp,Pmp,Xmp)
! Second call to modern !
param = (/5.0d0,6.0d0,7.0d0,8.0d0/)
allocate(Vimu(Na,Ne),FCu(4,Na*Ne),Pmu(Ny,Ne),Xmu(Ny,Ne))
call Modern(param,Vimu,FCu,Pmu,Xmu)
end program Economy
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module Modern_mod
implicit none
contains
subroutine Modern(param,Vim,FCm,Pm,Xm)
!! Declarations !!
! Modules !
use Globals, only: Na, Ne, Ny
use FindVmp_mod, only: FindVmp
use FindVmu_mod, only: FindVmu
implicit none
! Declaring other variables !
real(kind=nag_wp), intent(in) :: param(4) ! param-Parameters specific to one of the modern sectors;
real(kind=nag_wp), intent(out), allocatable :: FCm(:,:), Xm(:,:),& ! FCm-Firm choices; Xm-Policy fun;
& Pm(:,:), Vim(:,:) ! Pm-Price of a share; Vim-Start of period value function;
real(kind=nag_wp), allocatable :: Vm1(:,:), Vim1(:,:), Pm1(:,:),& ! Vm1-Vm next guess; Pm1-Next share price guess;
& Vm(:,:) ! Vm-End of period value function; Vim1-Next Vim guess;
!! Execution !!
! Allocating and initializing functions !
allocate(Vim(Na,Ne),FCm(4,Na*Ne),Vm(Ny,Ne),Pm(Ny,Ne),Xm(Ny,Ne))
allocate(Vim1(Na,Ne),Vm1(Ny,Ne),Pm1(Ny,Ne))
! Inizializing arrays !
Vm = ...
Vim = ...
...
! Doing calculations !
if(param(1) .eq. 1.0d0) then
call FindVmp(FCm,Vim,Pm,Vm1,Pm1,Xm) ! New value funciton guess for productive guys
else
call FindVmu(FCm,Vim,Pm,Vm1,Pm1,Xm) ! New value funciton guess for unproductive guys
end if
end subroutine Modern
end module Modern_mod
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module FindVmp_mod
implicit none
contains
subroutine FindVmp(FCm,Vim0,P0,Vm,P,x)
!! Declarations !!
use VmFp_mod, only: Calculations ! Contains the operations computing the final values of the outputs to FindVmp
implicit none
real(kind=nag_wp), allocatable, intent(out) :: Vm(:,:), x(:,:), P(:,:) ! Vm-New value function; x-Policy function; P-Share price;
real(kind=nag_wp), intent(in) :: P0(:,:), Vim0(:,:), FCm(:,:) ! P0-Initial share price guess; Vim-Initial guess for beginning of period value function;
! FCm-Firm choices;
!! Execution !!
! Allocate matrices !
allocate(Vm(Ny,Ne), x(Ny,Ne), P(Ny,Ne))
! Compute results !
call Calculations(FCm,Vim0,P0,Vm,P,x)
end subroutine FindVmp
end module FindVmp_mod
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module FindVmu_mod
implicit none
contains
subroutine FindVmu(FCm,Vim0,P0,Vm,P,x)
!! Declarations !!
use Globals, only: Vmp, Pmp
use VmFu_mod, only: Calculations ! Contains the operations computing the final values of the outputs to FindVmu
implicit none
real(kind=nag_wp), allocatable, intent(out) :: Vm(:,:), x(:,:), P(:,:) ! Vm-New value function; x-Policy function; P-Share price;
real(kind=nag_wp), intent(in) :: P0(:,:), Vim0(:,:), FCm(:,:) ! P0-Initial share price guess; Vim-Initial guess for beginning of period value function;
! FCm-Firm choices;
!! Execution !!
! Allocate matrices !
allocate(Vm(Ny,Ne), x(Ny,Ne), P(Ny,Ne))
! Compute results !
call Calculations(FCm,Vim0,P0,Vm,P,x,Vmp,Pmp) ! Using the values of Vmp and Pmp computes in the first call to Modern
end subroutine FindVmu
end module FindVmu_mod
Each run, Modern is fed different arrays of the same size and type (*p and *u respectively) which are declared in the module Globals. Modern similarly calls one of two very similar subroutines FindVm?, feeding them the corresponding arrays. FindVmp and FindVmu compute almost identical operations, only that the latter uses the values of Vimp, Pmp (computed in FindVmp) as inputs.
I've been trying to figure out why the second call to Modern takes up to an order of magnitude longer to complete compared to the first one.
My first guess was that maybe by allocating Vimp and Pmp at the beginning of the program, and allocating a bunch of other arrays afterwards, each reference to the former arrays might be costly because their memory addresses were far away from the arrays currently being computed in FindVmu (for reference, Na = 101, Ny = 91, Ne = 9). But then I read that allocated arrays are stored in the heap, and that data in the heap isn't stacked (no pun intended) so that this was not necessarily the origin of my problem. As a matter of fact, I've tried allocating all matrices at different points and in different orders in Modern, but I get roughly the same execution times.
In the same spirit, I've tried to vary how I declare some of the arrays in different subroutines (e.g. making some automatic instead of allocatable and using the compiler (ifort18) option to force them on the stack) and although I do get overall performance variations throughout the code, the relative performance of the two calls to Modern does not change.
Finally, I read in this thread that the more arrays you have in memory, the slower your code generally becomes. Although the explanation does make sense to me, this is the first time I experience such a significant performance loss throughout a fortran program. If this were actually the problem I'm facing, I would have expected to run into it in any number of my previous projects. Is this nonetheless a plausible cause of what is happening here?
And I'm basically out of ideas...
Bonus Question
While we're at it, I've found that leaving out the following lines in Economy surprisingly (to me) does not lead to a segfault:
allocate(Vimp(Na,Ne),FCp(4,Na*Ne),Pmp(Ny,Ne),Xmp(Ny,Ne))
allocate(Vimu(Na,Ne),FCu(4,Na*Ne),Pmu(Ny,Ne),Xmu(Ny,Ne))
In words: if I don't manually allocate the arrays declared in Globals, it seems like the program does so automatically once I pass them to Modern. Is this standard behavior or was I just lucky when I was initially not allocating them myself?

Haskell(GHC) specialization tour & efficient TypeFamilies

Haskell is all about abstraction. But abstraction costs us extra CPU cycles and extra memory usage due to common representation of all abstract (polymorphic) data - pointer on heap. There are some ways to make abstract code play better with high performance demands. As far as I understand, one way it is done is specialization - basically extra code generation(manual or by compiler), correct ?
Let's assume that all code below is Strict(which helps compiler perform more optimizations ?)
If we have a function sum:
sum :: (Num a) => a -> a -> a
We can generate specialized version of it using specialize pragma:
{-#SPECIALIZE sum :: Float -> Float -> Float#-}
Now if haskell compiler can determine at compile time that we call sum on two Floats, it is going to use specialized version of it. No heap allocations, right ?
Functions - done. Same pragma can be applied to class instances. Logic does not change here, does it ?
But what about data types ?
I suspect that TypeFamilies are in charge here ?
Let's try to specialize dependent length-indexed list.
--UVec for unboxed vector
class UVec a where
data Vec (n :: Nat) a :: *
instance UVec Float where
data Vec n Float where
VNilFloat :: Vec 0 Float
VConsFloat :: {-#UNPACK#-}Float ->
Vec n Float ->
Vec (N :+ 1) Float
But Vec has a problem. We can't pattern match on its constructors as
each instance of UVec does not have to provide Vec with identical constructors. This forces us to implement each function on Vec for each instance of Vec (as lack of pattern matching implies that it can't be polymorphic on Vec). What is the best practice in such case ?
As you say, we can't pattern match on UVec a without knowing what a is.
One option is to use another typeclass that extends your vector class with a custom function.
class UVec a => UVecSum a where
sum :: UVec a -> a
instance UVecSum Float where
sum = ... -- use pattern match here
If, later on, we use sum v where v :: UVec Float, the Float-specific code we defined in the instance will be called.
Partial answer, but perhaps it might help.
As far as I understand, one way it is done is specialization - basically extra code generation(manual or by compiler), correct ?
Yes, this is similar to code instantiation in C++ templates.
Now if haskell compiler can determine at compile time that we call sum on two Floats, it is going to use specialized version of it. No heap allocations, right ?
Yes the compiler calls the specialised version whenever possible. Not sure what you mean regarding the heap allocations.
Regarding the dependently types vectors: usually (I know this from Idris) the length of the vector is eliminated by the compiler when possible. It is intended for stronger type checking. At runtime the length information is useless and can be dropped.

Allocating arrays in a Fortran Subroutine

I need to read a lot of data from a file in a Fortran program. The size of the data is variable, so I would like to dynamically allocate the arrays. My idea is to make a subroutine who reads all the data and allocates the memory. A simplified version of the program is:
program main
implicit none
real*8, dimension(:,:), allocatable :: v
integer*4 n
!This subroutine will read all the data and allocate the memory
call Memory(v,n)
!From here the program will have other subroutines to make calculations
end
subroutine Memory(v,n)
implicit none
real*8, dimension(:,:), allocatable :: v
integer*4 n,i
n=5
allocate(v(n,2))
do i=1,n
v(i,1)=1.0
v(i,2)=2.0
enddo
return
end subroutine Memory
This program gives me the following error:
Error: Dummy argument 'v' of procedure 'memory' at (1) has an attribute that requieres an explicit interface for this procedure
Is this the right way of structuring this kind of program? If so, How can I solve the error?
Thanks.
Assuming that you have a single source file containing both the program and the subroutine, as your post suggests, the easiest fix is to replace the line containing the statement
end
with a line containing the statement
contains
and writing, at the end of the source file a line containing the statement
end program
(Yes, the keyword program is not required but it is useful.)
The problem that your compiler has reported is that, as you have structured your code, the program does not know anything about the interface to the subroutine memory, that interface is, in Fortran terms, implicit. When you want to call a subroutine and either pass in or pass out an allocatable array that subroutine must have an explicit interface.
There are several ways to provide an explicit interface. One is, as I have shown you, to contain the subroutine within the program. Another, and more useful way when your programs become a little bit larger, is to write your subroutines in modules and use-associate them in the program that wants to use them. Read the parts of your Fortran tutorial that cover modules and the use statement.
There is at least one other option but it is, especially for a beginner, unattractive and I won't mention it here.
And, while I'm writing, learn about and use the keyword intent to specify whether an argument to a subroutine will be read, written to or both. This is a great aid to safe programming, your favourite Fortran resources will explain in detail.

Allocatable arrays performance

There is an mpi-version of a program which uses COMMON blocks to store arrays that are used everywhere through the code. Unfortunately, there is no way to declare arrays in COMMON block size of which would be known only run-time. So, as a workaround I decided to move that arrays in modules which accept ALLOCATABLE arrays inside. That is, all arrays in COMMON blocks were vanished, instead ALLOCATE was used. So, this was the only thing I changed in my program. Unfortunately, performance of the program was awful (when compared to COMMON blocks realization). As to mpi-settings, there is a single mpi-process on each computational node and each mpi-process has a single thread.
I found similar question asked here but don't think (don't understand :) ) how it could be applied to my case (where each process has a single thread). I appreciate any help.
Here is a simple example which illustrates what I was talking about (below is a pseudocode):
"SOURCE FILE":
SUBROUTINE ZEROSET()
INCLUDE 'FILE_1.INC'
INCLUDE 'FILE_2.INC'
INCLUDE 'FILE_3.INC'
....
INCLUDE 'FILE_N.INC'
ARRAY_1 = 0.0
ARRAY_2 = 0.0
ARRAY_3 = 0.0
ARRAY_4 = 0.0
...
ARRAY_N = 0.0
END SUBROUTINE
As you may see, ZEROSET() has no parallel or MPI stuff. FILE_1.INC, FILE_2, ... , FILE_N.INC are files where ARRAY_1, ARRAY_2 ... ARRAY_N are defined in COMMON blocks. Something like that
REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)
Where NX, NY, NZ are well defined parameters described with help of PARAMETER directive.
When I use modules, I just destroyed all COMMON blocks, so FILE_I.INC looks like
REAL, ALLOCATABLE:: ARRAY_I(:,:,:)
And then just changed "INCLUDE 'FILE_I.INC'" statement above to "USE FILE_I". Actually, when parallel program is executed, one particular process does not need a whole (NX, NY, NZ) domain, so I calculate parameters and then allocate ARRAY_I (only ONCE!).
Subroutine ZEROSET() is executed 0.18 seconds with COMMON blocks and 0.36 with modules (when array's dimensions are calculated runtime). So, the performance worsened by two times.
I hope that everything is clear now. I appreciate you help very much.
Using allocatable arrays in modules can often hurt performance because the compiler has no idea about sizes at compile time. You will get much better performance with many compilers with this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
call Z(A,N)
end subroutine
subroutine Z(A,N)
Integer N
real A(N,N)
do stuff here
end
Then this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
do stuff here
end subroutine
The compiler will know that the array is NxN and the do loops are over N and be able to take advantage of that fact (most codes work that way on arrays). Also, after any subroutine calls in "do stuff here", the compiler will have to assume that array "A" might have changed sizes or moved locations in memory and recheck. That kills optimization.
This should get you most of your performance back.
Common blocks are located in a specific place in memory also, and that allows optimizations also.
Actually I guess, your problem here is, in combination with stack vs. heap memory, indeed compiler optimization based. Depending on the compiler you're using, it might do some more efficient memory blanking, and for a fixed chunk of memory it does not even need to check the extent and location of it within the subroutine. Thus, in the fixed sized arrays there won't be nearly no overhead involved.
Is this routine called very often, or why do you care about these 0.18 s?
If it is indeed relevant, the best option would be to get rid of the 0 setting at all, and instead for example separate the first iteration loop and use it for the initialization, this way you do not have to introduce additional memory accesses, just for initialization with 0. However it would duplicate some code...
I could think of just these reasons when it comes to fortran performance using arrays:
arrays on the stack VS heap, but I doubt this could have a huge performance impact.
passing arrays to a subroutine, because the best way to do that depends on the array, see this page on using arrays efficiently

Resources