How efficient are Fortran's (90+) intrinsic (math) functions? I especially care about tanh and sech but am interested in the other Fortran intrinsic functions as well.
By "how efficient" I mean that if it is very hard to come up with a faster method then the intrinsics are efficient but if it is very easy to come up with a faster method then the intrinsics are inefficient.
Here is a MWE, in which my change to try to make it faster actually made it slower, suggesting the intrinsics are efficient.
program main
implicit none
integer, parameter :: n = 10000000
integer :: i
real :: x, var
real :: t1,t2,t3,t4
!! Intrinsic first
call cpu_time(t1)
do i = 1, n
x = REAL(i)/300.0
var = tanh(x)
end do
call cpu_time(t2)
write(*,*) "Elapsed CPU Time = ", t2 - t1
write(*,*) var
!! Intrinsic w/ small change
call cpu_time(t3)
do i = 1, n
x = REAL(i)/300.0
if (x > 10.0) then
var = 1.0
else
var = tanh(x)
end if
end do
call cpu_time(t4)
write(*,*) "Elapsed CPU Time = ", t4 - t3
write(*,*) var
end program main
Note that Fortran90 seems to be lazy; if I don't include the "write(,) var" then it says elapsed CPU time = 0.0
Related
I have a complex matrix A, and would like to modify it Nt times according to A = exp( -1i*(A + abs(A).^2) ). The size of A is typically 1000x1000, and the number of times to run would be around 10000.
I am looking to reduce the time taken to carry out these operations. For 1000 iterations on the CPU, I measure around 6.4 seconds. Following the Matlab documentation, I was able to move this to the GPU, which reduced the time taken to 0.07 seconds (an incredible x91 improvement!). So far so good.
However, I also now read this link in the docs, which describes how we can sometimes find even further improvement for element-wise calculations if we use arrayfun() as well. If I try to follow the tutorial, the time taken is actually worse, clocking in at 0.47 seconds. My tests are shown below:
Nt = 1000; % Number of times to run each method
test_functionFcn = #test_function;
A = rand( 500, 600, 'double' ) + rand( 500, 600, 'double' )*1i; % Define an initial complex matrix
gpu_A = gpuArray(A); % Transfer matrix to a GPU array
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%%%%%%%%%
cpu_data_out = A;
tic
for k = 1:Nt
cpu_data_out = test_function( cpu_data_out );
end
tcpu = toc;
%%%%%%%%%%%%%%%%% Run the calculation Nt times on GPU directly %%%%%%%%%%%%%%%%%%%%
gpu_data_out = gpu_A;
tic
for k = 1:Nt
gpu_data_out = test_function(gpu_data_out);
end
tgpu = toc;
%%%%%%%%%%%%%% Run the calculation Nt times on GPU using arrayfun() %%%%%%%%%%%%%%
gpuarrayfun_data_out = gpu_A;
tic
for k = 1:Nt
gpuarrayfun_data_out = arrayfun( test_functionFcn, gpuarrayfun_data_out );
end
tgpu_arrayfun = toc;
%%% Print results %%%
fprintf( 'Time taken using only CPU: %g\n', tcpu );
fprintf( 'Time taken using gpuArray directly: %g\n', tgpu );
fprintf( 'Time taken using GPU + arrayfun(): %g\n', tgpu_arrayfun );
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end
and the results are:
Time taken using only CPU: 6.38785
Time taken using gpuArray directly: 0.0680587
Time taken using GPU + arrayfun(): 0.474612
My questions are:
Am I using arrayfun() correctly in this situation, and it is expected that arrayfun() should be worse?
If so, and it is really just expected that it is slower than the direct gpuArray method, is there any easy (i.e non-MEX) way to speed up such a calculation? (I see they also mention using pagefun for example).
Thanks in advance for any advice.
(The graphics card is Nvidia Quadro M4000, and I am running Matlab R2017a)
Edit
After reading #Edric's answer, I think it is important to show a little more of the wider code. One thing I didn't mention in the OP is that in my actual main code, is that inside the k=1:Nt loop there is an additional operation which is a matrix multiplication with the transpose of a sparse, tridiagonal matrix. Here is a more fleshed out MWE of what is really going on:
Nt = 1000; % Number of times to run each method
N_rows = 500;
N_cols = 600;
test_functionFcn = #test_function;
A = rand( N_rows, N_cols, 'double' ) + rand( N_rows, N_cols, 'double' )*1i; % Define an initial complex matrix
%%% Generate a sparse, tridiagonal, square transformation matrix %%%%%%%%
mm = 10*ones(N_cols,1); % Subdiagonal elements
dd = 20*ones(N_cols,1); % Main diagonal elements
pp = 30*ones(N_cols,1); % Superdiagonal elements
M = spdiags([mm dd pp],-1:1,N_cols,N_cols);
M(1,1) = 6; % Set a couple of other entries
M(2,1) = 3;
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%
cpu_data_out = A;
for k = 1:Nt
cpu_data_out = test_function( cpu_data_out );
cpu_data_out = cpu_data_out*M.';
end
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end
I'm very sorry for not including that in the OP - I did not realise at the time that it might be relevant to the solution. Does this change things? Are there still gains to be made with arrayfun() on the GPU, or is this now not suitable for converting to arrayfun() ?
A few points here. Firstly, (and most importantly), to time code on the GPU, you need to use either gputimeit, or you need to inject a call to wait(gpuDevice) before calling toc. That's because work is launched asynchronously on the GPU, and you only get accurate timings by waiting for it to finish. With those minor modifications, on my GPU, I see 0.09 seconds for the gpuArray method, and 0.18 seconds for the arrayfun version.
Running a loop of GPU operations is generally inefficient, so the main gain you can get here is by pushing the loop inside the arrayfun function body so that that loop runs directly on the GPU. Like this:
%%% Function to operate on matrices %%%
function x = test_function(x,Nt)
for ii = 1:Nt
x = exp(-1i*(x + abs(x).^2));
end
end
You'll need to invoke it like A = arrayfun(#test_function, A, Nt). On my GPU, this brings the arrayfun time down to 0.05 seconds, so about twice as fast as the plain gpuArray version.
I encounter an issue with arrays in OpenACC kernels. Here is the demo code:
module mpoint
type point
real :: x, y, z
real :: tmp
real :: v(3)
end type point
end module mpoint
program main
use mpoint
implicit none
integer, parameter :: n = 3
real, allocatable :: array(:)
!--------------------------------
allocate(array(n))
array(:) = 1.0
call vecadd
contains
subroutine vecadd()
integer :: i
type(point) :: A
real :: w(3)
real :: scalar
A%v(:) = 0.0
A%v(1) = 1.0
w(:) = 0.0
w(1) = 1.0
scalar = 1.0
write(*,*) 'host: v(1) = ', A%v(1)
write(*,*) 'host: w(1) = ', w(1)
write(*,*) 'host: scalar = ', scalar
!$acc parallel loop firstprivate(A,w)
do i = 1, n
write(*,*) 'device: v(1) = ', A%v(1)
write(*,*) 'device: w(1) = ', w(1)
write(*,*) 'device: scalar = ', scalar
enddo
end subroutine vecadd
end program main
When I compile it with nvfortran -acc -Minfo=accel test.f90 and run, it shows that on the device the values in the arrays are 0.0, not the correct values 1.0 I set on the host side. This happens only for arrays: scalars, like shown in the example, has the right values.
I am wondering if this is a limitation for nvfortran, or the current OpenACC standard?
Looks like a compiler issue where we're not initializing the small arrays on the host. In looking through our existing bug reports, I see one that's nearly identical, just with C instead of Fortran, that by coincidence got fixed in our development compiler this morning. Unfortunately it didn't seem to fix your issue as well. I sent a note to the compiler engineer assigned to the issue asking if he can take a look.
Worst case if the issue turns out to be similar but unrelated, I'll open a new problem report and update this post with the tracking number.
I have some basic psuedo code as follows,
PROGRAM PSUEDOEXAMPLE
IMPLICIT NONE
!Define all types
!Load some data into arrays: Array_I, Array_J
!Do Loops
do i = 1,10
xi = Array_I(i,1)
yi = Array_I(i,2)
zi = Array_I(i,3)
do j =1,10
xj = Array_J(j,1)
yj = Array_J(j,2)
zj = Array_J(j,3)
separation = ((xi - xj)**2 + (yi-yj)**2 +(zi-zj)**2)**0.5
enddo
enddo
END PROGRAM PSUEDOEXAMPLE
I can time the time it takes for a single i-step to be ~0.3 seconds. What are the best ways to reduce this time? I can see potentially removing the square root would be effective. I am using gfortran as my compiler.
I am trying to solve a system of ODEs and my input excitation is a function of time.
I have been using interp1 inside the integration function, but this doesn't seems like a very efficient way to do this. I know it is not, because once I change the input excitation to a sin function, which does not require an interp1 call inside the function, I get much much faster results. But doing interpolation every step takes about 10–20 times longer to converge. So, is there a better way of solving ODEs for arbitrary time dependent excitation, without needing to do interpolation or some other tricks to speed up?
I am just copying a modified version of a simple example from The MathWorks here:
Input Excitation is a gradually increasing sin function, but after some time later it becomes a constant amplitude sin function.
Dt = 0.01; % sampling time step
Amp0 = 2; % Final Amplitude of signal
Dur_G = 10; % Duration of gradually increasing part of signal
Dur_tot = 25; % Duration of total signal
t_G = 0 : Dt : Dur_G; % time of gradual part
A = linspace(0, Amp0, length(t_G));
carrier_1 = sin(5*t_G); % Unit Normal Signal
carrier_A0 = Amp0*sin(5*t_G);
out_G = A.*carrier_1; % Gradually Increasing Signal
% Total Signal with Gradual Constant Amplitude Parts
t_C = Dur_G+Dt:Dt:Dur_tot; % time of constant part
out_C = Amp0*sin(5*t_C); % Signal of constant part
ft = [t_G t_C]; % total time
f = [out_G out_C]; % total signal
figure; plot(ft, f, '-b'); % input excitation
function dydt = myode(t,y,ft,f)
f = interp1(ft,f,t); % Interpolate the data set (ft,f) at time t
g = 2; % a constant
dydt = -f.*y + g; % Evaluate ODE at time t
tspan = [1 5]; ic = 1;
opts = odeset('RelTol',1e-2,'AbsTol',1e-4);
[t,y] = ode45(#(t,y) myode(t,y,ft,f), tspan, ic, opts);
figure;
plot(t,y);
Note that I explained only first part of my problem above, which is solving system for a gradually increasing sin function.
In the second part, I need to solve it for an arbitrary input excitation (e.g., a ground acceleration input).
For this example, you could use griddedInterpolant class to get a bit of a speed-up:
ft = linspace(0,5,25);
f = ft.^2 - ft - 3;
Fp = griddedInterpolant(ft,f);
gt = linspace(1,6,25);
g = 3*sin(gt-0.25);
Gp = griddedInterpolant(gt,g);
tspan = [1 5];
ic = 1;
opts = odeset('RelTol',1e-2,'AbsTol',1e-4);
[t,y] = ode45(#(t,y)myode(t,y,Fp,Gp),tspan,ic,opts);
figure;
plot(t,y);
The ODE function is then:
function dydt = myode(t,y,Fp,Gp)
f = Fp(t); % Interpolate the data set (ft,f) at time t
g = Gp(t); % Interpolate the data set (gt,g) at time t
dydt = -f.*y + g; % Evaluate ODE at time t
On my system with R2015b, the call to ode45 is about three times faster (0.011 sec vs. 0.035 sec) for your example. You could get a bit more speed by switching to ode23. You can read more about the griddedInterpolant class here.
If your actual system, discretely switches between inputs particular points in time, then you should probably solve the problem piecewise by integrating each case separately. See this question and this question. If the system switches based on the value of the state variable(s), then you should use event location (see this question). However, if "solving ODEs for random time dependent excitation" means that you're adding random noise to the system, then you have an SDE rather than an ODE, which is a completely different beast.
In this link, the author gives an example as
subroutine threshold(a, thresh, ic)
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt
integer :: n
ic = 0
tt = 0.d0
n = size(a)
do j = 1, n
tt = tt + a(j) * a(j)
if (sqrt(tt) >= thresh) then
ic = j
return
end if
end do
end subroutine threshold
and the author commented this code as
An alternative approach, which would allow for many optimizations
(loop unrolling, CPU pipelining, less time spent evaluating the
conditional) would involve adding tt in blocks (e.g., blocks of size
128) and checking the conditional after each block. When it the
condition is met, the last block can be repeated to determine the
value of ic.
What does it mean? loop unrolling? CPU pipelining? adding tt in blocks?
How to optimize the code as the author say?
If the loop is performed in chunks/blocks that fit into the CPU cache you will reduce the number of cache misses, and consequently the number of cache lines retrieved from memory. This increases the performance on all loops that are limited by memory operations.
If the corresponding block size is BLOCKSIZE, this is achieved by
do j = 1, n, BLOCKSIZE
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
end do
This, however, will leave a remainder that is not treated in the main loop. To illustrate this, consider an array of length 1000. The first seven chunks (1--896) are covered in the loop, but the eighth one (897--1024) is not. Therefore, another loop for the remainder is required:
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
! ...
enddo
While it makes little sense to remove the conditional from the remainder loop, it can be performed in the outer loop of the blocked main loop.
As now no branches occur in the inner loop, aggressive optimizations might be applicable then.
However, this limits the "accuracy" of the determined position to the blocks. To get to an element-wise accuracy, you have to repeat the calculation.
Here is the complete code:
subroutine threshold_block(a, thresh, ic)
implicit none
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt, tt_bak, thresh_sqr
integer :: n, j, jj
integer,parameter :: BLOCKSIZE = 128
ic = 0
tt = 0.d0
thresh_sqr = thresh**2
n = size(a)
! Perform the loop in chunks of BLOCKSIZE
do j = 1, n, BLOCKSIZE
tt_bak = tt
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
! Perform the check on the block level
if (tt >= thresh_sqr) then
! If the threshold is reached, repeat the last block
! to determine the last position
tt = tt_bak
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
if (tt >= thresh_sqr) then
ic = jj
return
end if
end do
end if
end do
! Remainder is treated element-wise
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
tt = tt + a(j) * a(j)
if (tt >= thresh_sqr) then
ic = j
return
end if
end do
end subroutine threshold_block
Please note that the compilers are nowadays very good in creating blocked loops in combination with other optimizations. In my experience it is quite difficult to get a better performance out of such simple loops by manually tweaking it.
Loop blocking is enabled in gfortran with the compiler option -floop-block.
Loop unrolling can be done manually, but should be left to the compiler. The idea is to manually perform a loop in blocks and instead of a second loop as shown above, perform the operations by duplicating the code. Here is an example for the inner loop as given above, for a loop unrolling of factor four:
do jj = j, j+BLOCKSIZE-1,4
tt = tt + a(jj) * a(jj)
tt = tt + a(jj+1) * a(jj+1)
tt = tt + a(jj+2) * a(jj+2)
tt = tt + a(jj+3) * a(jj+3)
end do
Here, no remainder can occur if BLOCKSIZE is a multiple of 4. You can probably shave off a few operations in here ;-)
The gfortran compiler option to enable this is -funroll-loops
As far as I know, CPU Pipelining (Instruction Pipelining) cannot be enforced manually in Fortran. This task is up to the compiler.
Pipelining sets up a pipe of instructions. You feed the complete array into that pipe and, after the wind-up phase, you will get a result with each clock cycle. This drastically increases the throughput.
However, branches are difficult (impossible?) to treat in pipes, and the array should be long enough that the time required for setting up the pipe, wind-up, and wind-down phase are compensated.