Reduce Fortran function call overhead - performance

I have a Fortran code like this:
file1.f90
program myprog
use func1mod
do i=1,N
call subroutine1
enddo
subroutine subroutine1
integer*8::var1,var2,var3,...
do j=1,N
x=func1(var1,var2,var3,..)
computations based on x
enddo
return
end
end
file2.f90
module func1mod
contains
func1(var1,var2,var3,....)
func1=some computations based on var1, var2, var3, ...
return
end function func1
end module func1mod
function func1 does not modify any of its arguments. It computes a value based on the arguments and returns a value. The # of arguments is large but the function is less than 30 lines of code. What is the best approach to reduce the function call overhead.
One approach would be to inline the function. Is there any other way out?

The best you can do is be as explicit as possible about the semantics of the function, turn optimization up as high as possible, and let the compiler make the best decision it can about how best to implement the call. Make sure the dummy variables are marked intent(in), and mark the function as pure - although if it's only 30 lines, the compiler will doubtless notice these things anyway at high optimization - and check your compiler options to see if there's anything you can do to encourage (for instance) inlining.

Generally the overhead of a procedure call is low. If the function has 30 lines of code probably you will gain very little because the actual function will dominant over the function call. If you want to be sure, measure the runtime of the current implementation, then inline the code and measure that runtime.

Related

jax minimization with stochastically estimated gradients

I'm trying to use the bfgs optimizer from tensorflow_probability.substrates.jax and from jax.scipy.optimize.minimize to minimize a function f which is estimated from pseudo-random samples and has a jax.random.PRNGKey as argument. To use this function with the jax/tfp bfgs minimizer, I wrap the function inside a lambda function
seed = 100
key = jax.random.PRNGKey(seed)
fun = lambda x: return f(x,key)
result = jax.scipy.optimize.minimize(fun = fun, ...)
What is the best way to update the key when the minimization routine calls the function to be minimized so that I use different pseudo-random numbers in a reproducible way? Maybe a global key variable? If yes, is there an example I could follow?
Secondly, is there a way to make the optimization stop after a certain amount of time, as one could do with a callback in scipy? I could directly use the scipy implementation of bfgs/ l-bfgs-b/ etc and use jax ony for the estimation of the function and of tis gradients, which seems to work. Is there a difference between the scipy, jax.scipy and tfp.jax bfgs implementations?
Finally, is there a way to print the values of the arguments of fun during the bfgs optimization in jax.scipy or tfp, given that f is jitted?
Thank you!
There is no way to do what you're asking with jax.scipy.optimize.minimize, because the minimizer does not offer any means to track changing state between function calls, and does not provide for any inbuilt stochasticity in the optimizer.
If you're interested in stochastic optimization in JAX, you might try stochastic optimization in JAXOpt, which provides a much more flexible set of optimization routines.
Regarding your second question, if you'd like to print values during the course of a jit-compiled optimization or other loop, you can use jax.debug.print.

Do all programming languages store the output of function calls?

This is a general question about if programming languages remember/store the output of function calls.
Suppose I need to calculate a quantity X which depends on some number of simpler calculations. Let's say
X=sin(t)+cos(t)+(cos(t)-sin(t))^2.
Naively, I could compute X as above, calling sin(t) twice, and cos(t) twice.
Or I could call sin(t) and cos(t) once:
a=sin(t)
b=cos(t)
and do
X=a+b+(b-a)^2
Intuitively, the second method should be twice as fast right? Is this the case in all programming languages?
I ask because, doing such a calculation in Julia, I noticed that computing the simpler quantities once vs calling them at each point they appear in the expression for X does not change the runtime.
It depends on how clever your compiler is, and on properties of the function.
First your compiler would need to figure out for example that you are calling sin(t) twice. That's not too difficult.
Second it needs to convince itself that t has the same value for each call. For example, if t was a static variable, and you didn't call sin(t) but some other function, that function call could modify t, so the second call sin(t) would have a different argument and sin(t) would have to be called twice.
Third it needs to convince itself that it doesn't matter whether sin(t) is called once or twice. (Such a function is called idempotent). For example, if you called a function that writes a message to a logfile, then the compiler would have to call it twice, or only one message is written to the logfile instead of two.

Issue with common block in OpenMP parallel programming

I have a few questions about using common blocks in parallel programming in Fortran.
My subroutines have common blocks. Do I have to declare all the common blocks and threadprivate in the parallel do region?
How do they pass information? I want seperate common clock for each thread and want them to pass information through the end of parallel region. Does it happen here?
My Ford subroutine changes some variables in common blocks and Condact subroutine overwrites over them again but the function uses the values from Condact subroutine. Do the second subroutine and function copy the variables from the previous subroutine for each thread?
program
...
! Loop which I want to parallelize
!$OMP parallel DO
!do I need to declear all common block and threadprivate them here?
I = 1, N
...
call FORD(i,j)
...
!$OMP END parallel DO
end program
subroutine FORD(i,j)
dimension zl(3),zg(3)
common /ellip/ b1,c1,f1,g1,h1,d1,
. b2,c2,f2,g2,h2,p2,q2,r2,d2
common /root/ root1,root2
!$OMP threadprivate (/ellip/,/root/)
!this subroutine rewrite values of b1, c1 and f1 variable.
CALL CONDACT(genflg,lapflg)
return
end subroutine
SUBROUTINE CONDACT(genflg,lapflg)
common /ellip/ b1,c1,f1,g1,h1,d1,b2,c2,f2,g2,h2,p2,q2,r2,d2
!$OMP threadprivate (/ellip/)
! this subroutine rewrite b1, c1 and f1 again
call function f(x)
RETURN
END
function f(x)
common /ellip/ b1,c1,f1,g1,h1,d1,
. b2,c2,f2,g2,h2,p2,q2,r2,d2
!$OMP threadprivate (/ellip/)
! here the function uses the value of b1, c1, f1 from CONDAT subroutine.
end
Firstly as the comment above says I would strongly advise against the use of common especially in modern code, and mixing global data and parallelism is just asking for a world of pain - in fact global data is just a bad idea full stop.
OK, your questions:
My subroutines has common blocks. Do I have to declare all the
common block and threadprivate in the parallel do region?
No,threadprivate is a declarative directive, and should be used only in the declarative part of the code, and it must appear after every declaration.
How do they pass information? I want seperate common clock for each
thread and want them to pass information through the end of parallel
region. Does it happen here?
As you suspect each thread will gets its own version of the common block. When you enter the first parallel region the values in the block will be undefined, unless you use copyin to broadcast the values from the master thread. For subsequent parallel regions the values will be retained as long as the number of threads used in each region is the same. Between regions the values in the common block will be those of the master thread.
Are those common block accessible through the subroutine? My Ford subroutine rewrite some variables in common block and Condat
subroutine rewrite over them again but the function uses the values
from Condat subroutine. Is that possible rewrite and pass the common
block variable using threadprivate here?
I have to admit I am unsure what you are asking here. But if you are asking whether common can be used to communicate variables between different sub-programs in OpenMP code, the answer is yes, just as in serial Fortran (note capitalisation)
How about converting the common blocks into modules?
Change common /root/ root1, root2 to use gammax, then make a new file root.f that contains:
module root
implicit none
save
real :: root1, root2
!$omp threadprivate( root1, root2 )
end module root

Allocating arrays in a Fortran Subroutine

I need to read a lot of data from a file in a Fortran program. The size of the data is variable, so I would like to dynamically allocate the arrays. My idea is to make a subroutine who reads all the data and allocates the memory. A simplified version of the program is:
program main
implicit none
real*8, dimension(:,:), allocatable :: v
integer*4 n
!This subroutine will read all the data and allocate the memory
call Memory(v,n)
!From here the program will have other subroutines to make calculations
end
subroutine Memory(v,n)
implicit none
real*8, dimension(:,:), allocatable :: v
integer*4 n,i
n=5
allocate(v(n,2))
do i=1,n
v(i,1)=1.0
v(i,2)=2.0
enddo
return
end subroutine Memory
This program gives me the following error:
Error: Dummy argument 'v' of procedure 'memory' at (1) has an attribute that requieres an explicit interface for this procedure
Is this the right way of structuring this kind of program? If so, How can I solve the error?
Thanks.
Assuming that you have a single source file containing both the program and the subroutine, as your post suggests, the easiest fix is to replace the line containing the statement
end
with a line containing the statement
contains
and writing, at the end of the source file a line containing the statement
end program
(Yes, the keyword program is not required but it is useful.)
The problem that your compiler has reported is that, as you have structured your code, the program does not know anything about the interface to the subroutine memory, that interface is, in Fortran terms, implicit. When you want to call a subroutine and either pass in or pass out an allocatable array that subroutine must have an explicit interface.
There are several ways to provide an explicit interface. One is, as I have shown you, to contain the subroutine within the program. Another, and more useful way when your programs become a little bit larger, is to write your subroutines in modules and use-associate them in the program that wants to use them. Read the parts of your Fortran tutorial that cover modules and the use statement.
There is at least one other option but it is, especially for a beginner, unattractive and I won't mention it here.
And, while I'm writing, learn about and use the keyword intent to specify whether an argument to a subroutine will be read, written to or both. This is a great aid to safe programming, your favourite Fortran resources will explain in detail.

Allocatable arrays performance

There is an mpi-version of a program which uses COMMON blocks to store arrays that are used everywhere through the code. Unfortunately, there is no way to declare arrays in COMMON block size of which would be known only run-time. So, as a workaround I decided to move that arrays in modules which accept ALLOCATABLE arrays inside. That is, all arrays in COMMON blocks were vanished, instead ALLOCATE was used. So, this was the only thing I changed in my program. Unfortunately, performance of the program was awful (when compared to COMMON blocks realization). As to mpi-settings, there is a single mpi-process on each computational node and each mpi-process has a single thread.
I found similar question asked here but don't think (don't understand :) ) how it could be applied to my case (where each process has a single thread). I appreciate any help.
Here is a simple example which illustrates what I was talking about (below is a pseudocode):
"SOURCE FILE":
SUBROUTINE ZEROSET()
INCLUDE 'FILE_1.INC'
INCLUDE 'FILE_2.INC'
INCLUDE 'FILE_3.INC'
....
INCLUDE 'FILE_N.INC'
ARRAY_1 = 0.0
ARRAY_2 = 0.0
ARRAY_3 = 0.0
ARRAY_4 = 0.0
...
ARRAY_N = 0.0
END SUBROUTINE
As you may see, ZEROSET() has no parallel or MPI stuff. FILE_1.INC, FILE_2, ... , FILE_N.INC are files where ARRAY_1, ARRAY_2 ... ARRAY_N are defined in COMMON blocks. Something like that
REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)
Where NX, NY, NZ are well defined parameters described with help of PARAMETER directive.
When I use modules, I just destroyed all COMMON blocks, so FILE_I.INC looks like
REAL, ALLOCATABLE:: ARRAY_I(:,:,:)
And then just changed "INCLUDE 'FILE_I.INC'" statement above to "USE FILE_I". Actually, when parallel program is executed, one particular process does not need a whole (NX, NY, NZ) domain, so I calculate parameters and then allocate ARRAY_I (only ONCE!).
Subroutine ZEROSET() is executed 0.18 seconds with COMMON blocks and 0.36 with modules (when array's dimensions are calculated runtime). So, the performance worsened by two times.
I hope that everything is clear now. I appreciate you help very much.
Using allocatable arrays in modules can often hurt performance because the compiler has no idea about sizes at compile time. You will get much better performance with many compilers with this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
call Z(A,N)
end subroutine
subroutine Z(A,N)
Integer N
real A(N,N)
do stuff here
end
Then this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
do stuff here
end subroutine
The compiler will know that the array is NxN and the do loops are over N and be able to take advantage of that fact (most codes work that way on arrays). Also, after any subroutine calls in "do stuff here", the compiler will have to assume that array "A" might have changed sizes or moved locations in memory and recheck. That kills optimization.
This should get you most of your performance back.
Common blocks are located in a specific place in memory also, and that allows optimizations also.
Actually I guess, your problem here is, in combination with stack vs. heap memory, indeed compiler optimization based. Depending on the compiler you're using, it might do some more efficient memory blanking, and for a fixed chunk of memory it does not even need to check the extent and location of it within the subroutine. Thus, in the fixed sized arrays there won't be nearly no overhead involved.
Is this routine called very often, or why do you care about these 0.18 s?
If it is indeed relevant, the best option would be to get rid of the 0 setting at all, and instead for example separate the first iteration loop and use it for the initialization, this way you do not have to introduce additional memory accesses, just for initialization with 0. However it would duplicate some code...
I could think of just these reasons when it comes to fortran performance using arrays:
arrays on the stack VS heap, but I doubt this could have a huge performance impact.
passing arrays to a subroutine, because the best way to do that depends on the array, see this page on using arrays efficiently

Resources