I have the following code for testing the intel mkl DAXPY routine.
program test
implicit none
integer, parameter :: n = 50000000
integer, parameter :: nloop = 100
real(8), dimension(:), allocatable :: a, b
integer start_t, end_t, rate, i
allocate(a(n))
allocate(b(n))
a = 1.0d0
b = 2.0d0
call system_clock(start_t, rate)
do i = 1, nloop
call sumArray(a, b, a, 3.0d0, n)
end do
call system_clock(end_t)
print *, sum(a)
print *, "sumArray time: ", real(end_t-start_t)/real(rate)
a = 1.0d0
b = 2.0d0
call system_clock(start_t, rate)
do i = 1, nloop
call daxpy(n, 3.0d0, b, 1, a, 1)
end do
call system_clock(end_t)
print *, sum(a)
print *, "daxpy time: ", real(end_t-start_t)/real(rate)
a = 1.0d0
b = 2.0d0
call system_clock(start_t, rate)
do i = 1, nloop
a = a + 3.0d0*b
end do
call system_clock(end_t)
print *, sum(a)
print *, "a + 3*b time: ", real(end_t-start_t)/real(rate)
end program test
subroutine sumArray(x, y, z, alfa, n)
implicit none
integer n, i
real(8) x(n), y(n), z(n), alfa
!$OMP PARALLEL DO
do i = 1, n
z(i) = x(i) + alfa*y(i)
end do
!$OMP END PARALLEL DO
end subroutine sumArray
Here, sumArray is a handwritten subroutine with openmp that does something similar to DAXPY.
When I compile the code with ifort test.f90 -o test -O3 -openmp -mkl the results are (aproximately):
sumArray time: 5.7 sec
daxpy time: 5.7 sec
a + 3*b time: 1.9 sec
However, when I compile it with ifort test.f90 -o test -O3 -openmp -mkl -ipo the results for a + 3*b change dramatically:
sumArray time: 5.7 sec
daxpy time: 5.7 sec
a + 3*b time: 9.3 sec
So firstly, why is the naive array sum better than mkl? And what does -ipo have to do with the slowdown of the naive array sum? Also, what bothers me is that when I eliminate the loops, that is, when I just time each operation once, the times are just like the first case divided by 1000 (around 5.7 ms for sumArray and daxpy, 9.3 ms for a + 3*b) regardless of using -ipo. My guess is that something about the naive sum in a loop allows the compiler to optimize further, but the -ipo flag messes up this optimization. Note: I know that -ipo in this case is useless since it is a single file.
Related
I read this post Puzzling performance difference between ifort and gfortran from 2012 and tried to run the code by myself:
PROGRAM PERFECT_SQUARE
IMPLICIT NONE
INTEGER*8 :: N, M, NTOT
LOGICAL :: IS_SQUARE
N=4
WRITE(*,*) IS_SQUARE(N)
NTOT = 0
DO N = 1, 1000000000
IF (IS_SQUARE(N)) THEN
NTOT = NTOT + 1
END IF
END DO
WRITE (*, *) NTOT ! should find 31622 squares
END PROGRAM
LOGICAL FUNCTION IS_SQUARE(N)
IMPLICIT NONE
INTEGER*8 :: N, M
! check if negative
IF (N .LT. 0) THEN
IS_SQUARE = .FALSE.
RETURN
END IF
! check if ending 4 bits belong to (0,1,4,9)
M = IAND(int(N, kind(8)), int(15, kind(8)))
IF (.NOT. (M .EQ. 0 .OR. M .EQ. 1 .OR. M .EQ. 4 .OR. M .EQ. 9)) THEN
IS_SQUARE = .FALSE.
RETURN
END IF
! try to find the nearest integer to sqrt(n)
M = DINT(SQRT(DBLE(N)))
IF (M**2 .NE. N) THEN
IS_SQUARE = .FALSE.
RETURN
END IF
IS_SQUARE = .TRUE.
RETURN
END FUNCTION
(I changed a few lines so that the code will compile).
I compiled with ifort 19.1.3.304 and gfortran 10.2.0.
The ifort executable ran way faster than gfortran and included way more commands. According to the post, ifort 12.1.2.273 should have fixed the problem, but it doesn't seem to be fixed.
If I turn off vectorization on ifort, I get much better results on ifort but still worse than gfortran.
If I remove lines:
N=4
WRITE(*,*) IS_SQUARE(N)
I get much worse results on gfortran compared to the version with this lines.
I can keep going with some weird phenomena I have noticed but I mainly just want to know how this simple code can cause such big difference in the run time between the two compilers, and how can I avoid cases like this.
I compiled with -O3 on both and tried adding -no-vec to ifort. I'm open to other compilation flags.
UPDATE:
ifort example.f90 -O3 -o example_ifort
gfortran example.f90 -O3 -o example_gnu
run time for example_ifort is 3 sec while example_gnu is 1 sec
I have fftw3 code and it works ONLY when array size is less than or equal to 16. Could some expert please tell me what I am doing wrong
program fftcalc
implicit none
integer n
parameter (n=128)
double precision in(n)
double complex out(1+(n/2))
integer*8 p, p2
integer i,j
real fact
real f,dt,t,pi
double precision re
integer FFTW_FORWARD,FFTW_BACKWARD
parameter (FFTW_FORWARD=-1,FFTW_BACKWARD=1)
integer FFTW_REAL_TO_COMPLEX,FFTW_COMPLEX_TO_REAL
parameter (FFTW_REAL_TO_COMPLEX=-1,FFTW_COMPLEX_TO_REAL=1)
integer FFTW_ESTIMATE,FFTW_MEASURE
parameter (FFTW_ESTIMATE=0,FFTW_MEASURE=1)
integer FFTW_OUT_OF_PLACE,FFTW_IN_PLACE,FFTW_USE_WISDOM
parameter (FFTW_OUT_OF_PLACE=0)
parameter (FFTW_IN_PLACE=8,FFTW_USE_WISDOM=16)
integer FFTW_THREADSAFE
parameter (FFTW_THREADSAFE=128)
f=25.
pi=3.14159
dt=0.004
do i=1,n
j=i-1
t=j*dt
re=cos(2.*pi*f*t)
in(i)=re
write(*,*) 'i,in',i,in(i)
enddo
write(*,*) "creating plans"
call dfftw_plan_dft_r2c_1d(p,n,in,out,fftw_measure)
call dfftw_plan_dft_c2r_1d(p2,n,in,out,fftw_measure)
write(*,*) "execute"
call dfftw_execute_dft_r2c(p,in,out)
write(*,*) "forward is executed"
do i=1,n
write(*,*) out(i)
enddo
write(*,*)
write(*,*)"do inverse fft"
call dfftw_execute_dft_c2r(p2, out,in)
write(*,*)"inverse fft completed"
fact=1.0/n
do i=1,n
write(*,*) in(i)*fact
enddo
write(*,*)"clean up"
call dfftw_destroy_plan(p)
call dfftw_destroy_plan(p2)
end program fftcalc
I tried with n=16 and n=10 it works, but when n=64 or so it give mostly zero forward transform.
How do I solve the problem?
How efficient are Fortran's (90+) intrinsic (math) functions? I especially care about tanh and sech but am interested in the other Fortran intrinsic functions as well.
By "how efficient" I mean that if it is very hard to come up with a faster method then the intrinsics are efficient but if it is very easy to come up with a faster method then the intrinsics are inefficient.
Here is a MWE, in which my change to try to make it faster actually made it slower, suggesting the intrinsics are efficient.
program main
implicit none
integer, parameter :: n = 10000000
integer :: i
real :: x, var
real :: t1,t2,t3,t4
!! Intrinsic first
call cpu_time(t1)
do i = 1, n
x = REAL(i)/300.0
var = tanh(x)
end do
call cpu_time(t2)
write(*,*) "Elapsed CPU Time = ", t2 - t1
write(*,*) var
!! Intrinsic w/ small change
call cpu_time(t3)
do i = 1, n
x = REAL(i)/300.0
if (x > 10.0) then
var = 1.0
else
var = tanh(x)
end if
end do
call cpu_time(t4)
write(*,*) "Elapsed CPU Time = ", t4 - t3
write(*,*) var
end program main
Note that Fortran90 seems to be lazy; if I don't include the "write(,) var" then it says elapsed CPU time = 0.0
In this link, the author gives an example as
subroutine threshold(a, thresh, ic)
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt
integer :: n
ic = 0
tt = 0.d0
n = size(a)
do j = 1, n
tt = tt + a(j) * a(j)
if (sqrt(tt) >= thresh) then
ic = j
return
end if
end do
end subroutine threshold
and the author commented this code as
An alternative approach, which would allow for many optimizations
(loop unrolling, CPU pipelining, less time spent evaluating the
conditional) would involve adding tt in blocks (e.g., blocks of size
128) and checking the conditional after each block. When it the
condition is met, the last block can be repeated to determine the
value of ic.
What does it mean? loop unrolling? CPU pipelining? adding tt in blocks?
How to optimize the code as the author say?
If the loop is performed in chunks/blocks that fit into the CPU cache you will reduce the number of cache misses, and consequently the number of cache lines retrieved from memory. This increases the performance on all loops that are limited by memory operations.
If the corresponding block size is BLOCKSIZE, this is achieved by
do j = 1, n, BLOCKSIZE
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
end do
This, however, will leave a remainder that is not treated in the main loop. To illustrate this, consider an array of length 1000. The first seven chunks (1--896) are covered in the loop, but the eighth one (897--1024) is not. Therefore, another loop for the remainder is required:
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
! ...
enddo
While it makes little sense to remove the conditional from the remainder loop, it can be performed in the outer loop of the blocked main loop.
As now no branches occur in the inner loop, aggressive optimizations might be applicable then.
However, this limits the "accuracy" of the determined position to the blocks. To get to an element-wise accuracy, you have to repeat the calculation.
Here is the complete code:
subroutine threshold_block(a, thresh, ic)
implicit none
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt, tt_bak, thresh_sqr
integer :: n, j, jj
integer,parameter :: BLOCKSIZE = 128
ic = 0
tt = 0.d0
thresh_sqr = thresh**2
n = size(a)
! Perform the loop in chunks of BLOCKSIZE
do j = 1, n, BLOCKSIZE
tt_bak = tt
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
! Perform the check on the block level
if (tt >= thresh_sqr) then
! If the threshold is reached, repeat the last block
! to determine the last position
tt = tt_bak
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
if (tt >= thresh_sqr) then
ic = jj
return
end if
end do
end if
end do
! Remainder is treated element-wise
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
tt = tt + a(j) * a(j)
if (tt >= thresh_sqr) then
ic = j
return
end if
end do
end subroutine threshold_block
Please note that the compilers are nowadays very good in creating blocked loops in combination with other optimizations. In my experience it is quite difficult to get a better performance out of such simple loops by manually tweaking it.
Loop blocking is enabled in gfortran with the compiler option -floop-block.
Loop unrolling can be done manually, but should be left to the compiler. The idea is to manually perform a loop in blocks and instead of a second loop as shown above, perform the operations by duplicating the code. Here is an example for the inner loop as given above, for a loop unrolling of factor four:
do jj = j, j+BLOCKSIZE-1,4
tt = tt + a(jj) * a(jj)
tt = tt + a(jj+1) * a(jj+1)
tt = tt + a(jj+2) * a(jj+2)
tt = tt + a(jj+3) * a(jj+3)
end do
Here, no remainder can occur if BLOCKSIZE is a multiple of 4. You can probably shave off a few operations in here ;-)
The gfortran compiler option to enable this is -funroll-loops
As far as I know, CPU Pipelining (Instruction Pipelining) cannot be enforced manually in Fortran. This task is up to the compiler.
Pipelining sets up a pipe of instructions. You feed the complete array into that pipe and, after the wind-up phase, you will get a result with each clock cycle. This drastically increases the throughput.
However, branches are difficult (impossible?) to treat in pipes, and the array should be long enough that the time required for setting up the pipe, wind-up, and wind-down phase are compensated.
I realize this question has been asked before, but not in the context of IO. Is there any reason to believe that:
!compiler can tell that it should write the whole array at once?
!but perhaps compiler allocates/frees temporary array?
write(UNIT) (/( arr(i), i=1,N )/)
would be any more efficient than:
!compiler does lots of IO here?
do i=1,N
write(UNIT) arr(i)
enddo
for a file which is opened as:
open(unit=UNIT,access='STREAM',file=fname,status='UNKNOWN')
There is a possibly that this will be used with compiler options to turn off buffered writing as well ...
As suggested by #HighPerformanceMark, here's a simple benchmark I set up:
Using gfortran:
program main
implicit none
include 'mpif.h'
integer, parameter :: N = 1000000
integer :: unit = 22
integer i
real*8 arr(N)
real*8 t1
integer repeat
external test1
external test2
external test3
repeat=15
call MPI_INIT(i)
arr = 0
call timeit(test1,repeat,arr,N,t1)
print*,t1/repeat
call timeit(test2,repeat,arr,N,t1)
print*,t1/repeat
call timeit(test3,repeat,arr,N,t1)
print*,t1/repeat
call MPI_Finalize(i)
end
subroutine timeit(sub,repeat,arr,size,time)
include 'mpif.h'
external sub
integer repeat
integer size
real*8 time,t1
real*8 arr(size)
integer i
time = 0
do i=1,repeat
open(unit=10,access='STREAM',file='test1',status='UNKNOWN')
t1 = mpi_wtime()
call sub(10,arr,size)
time = time + (mpi_wtime()-t1)
close(10)
enddo
return
end
subroutine test1(ou,a,N)
integer N
real*8 a(N)
integer ou
integer i
do i=1,N
write(ou),a(i)
enddo
return
end
subroutine test2(ou,a,N)
integer N
real*8 a(N)
integer ou
integer i
write(ou),(a(i),i=1,N)
return
end
subroutine test3(ou,a,N)
integer N
real*8 a(N)
integer ou
write(ou),a(1:N)
return
end
My results are (buffered):
temp $ GFORTRAN_UNBUFFERED_ALL=1 mpirun -np 1 ./test
6.2392100652058922
3.3046503861745200
9.76902325948079409E-002
(unbuffered):
temp $ GFORTRAN_UNBUFFERED_ALL=0 mpirun -np 1 ./test
2.7789104779561362
0.15584923426310221
9.82964992523193415E-002
I compiled and ran the above benchmark code using both gfortran (4.7.2 20120921) and ifort (13.0.0.079 Build 20120731). My results are as follows:
gfortran
UNBUFFERED BUFFERED
test1: 1.2614487171173097 0.20308602650960286
test2: 1.0525423844655355 3.4633986155192059E-002
test3: 5.9630711873372398E-003 6.0543696085611975E-003
ifort
UNBUFFERED BUFFERED
test1: 1.33864809672038 0.171342913309733
test2: 6.001885732014974E-003 6.095488866170247E-003
test3: 5.962880452473959E-003 6.007925669352213E-003
It would appear that the explicit loop in test1 is by far the most disadvantageous in both cases (without any optimisation flags set). Furthermore, with the Intel compiler there is no significant difference in execution time whether you run write(ou), (a(i), i=1, N) (case 2) or write(ou), a(1:N) (case 3, identical to simply write(ou), a in this case).
By the way, for this single-threaded process you can also just use the fortran 90 (or 95?) intrinsic subroutines cpu_time, which sums over all threads and returns a time in seconds. Otherwise there is also system_clock, which can return the number of elapsed clock cycles and the clock rate as integers, possibly to higher precision.