Implied do. vs explicit loop with IO - performance

I realize this question has been asked before, but not in the context of IO. Is there any reason to believe that:
!compiler can tell that it should write the whole array at once?
!but perhaps compiler allocates/frees temporary array?
write(UNIT) (/( arr(i), i=1,N )/)
would be any more efficient than:
!compiler does lots of IO here?
do i=1,N
write(UNIT) arr(i)
enddo
for a file which is opened as:
open(unit=UNIT,access='STREAM',file=fname,status='UNKNOWN')
There is a possibly that this will be used with compiler options to turn off buffered writing as well ...

As suggested by #HighPerformanceMark, here's a simple benchmark I set up:
Using gfortran:
program main
implicit none
include 'mpif.h'
integer, parameter :: N = 1000000
integer :: unit = 22
integer i
real*8 arr(N)
real*8 t1
integer repeat
external test1
external test2
external test3
repeat=15
call MPI_INIT(i)
arr = 0
call timeit(test1,repeat,arr,N,t1)
print*,t1/repeat
call timeit(test2,repeat,arr,N,t1)
print*,t1/repeat
call timeit(test3,repeat,arr,N,t1)
print*,t1/repeat
call MPI_Finalize(i)
end
subroutine timeit(sub,repeat,arr,size,time)
include 'mpif.h'
external sub
integer repeat
integer size
real*8 time,t1
real*8 arr(size)
integer i
time = 0
do i=1,repeat
open(unit=10,access='STREAM',file='test1',status='UNKNOWN')
t1 = mpi_wtime()
call sub(10,arr,size)
time = time + (mpi_wtime()-t1)
close(10)
enddo
return
end
subroutine test1(ou,a,N)
integer N
real*8 a(N)
integer ou
integer i
do i=1,N
write(ou),a(i)
enddo
return
end
subroutine test2(ou,a,N)
integer N
real*8 a(N)
integer ou
integer i
write(ou),(a(i),i=1,N)
return
end
subroutine test3(ou,a,N)
integer N
real*8 a(N)
integer ou
write(ou),a(1:N)
return
end
My results are (buffered):
temp $ GFORTRAN_UNBUFFERED_ALL=1 mpirun -np 1 ./test
6.2392100652058922
3.3046503861745200
9.76902325948079409E-002
(unbuffered):
temp $ GFORTRAN_UNBUFFERED_ALL=0 mpirun -np 1 ./test
2.7789104779561362
0.15584923426310221
9.82964992523193415E-002

I compiled and ran the above benchmark code using both gfortran (4.7.2 20120921) and ifort (13.0.0.079 Build 20120731). My results are as follows:
gfortran
UNBUFFERED BUFFERED
test1: 1.2614487171173097 0.20308602650960286
test2: 1.0525423844655355 3.4633986155192059E-002
test3: 5.9630711873372398E-003 6.0543696085611975E-003
ifort
UNBUFFERED BUFFERED
test1: 1.33864809672038 0.171342913309733
test2: 6.001885732014974E-003 6.095488866170247E-003
test3: 5.962880452473959E-003 6.007925669352213E-003
It would appear that the explicit loop in test1 is by far the most disadvantageous in both cases (without any optimisation flags set). Furthermore, with the Intel compiler there is no significant difference in execution time whether you run write(ou), (a(i), i=1, N) (case 2) or write(ou), a(1:N) (case 3, identical to simply write(ou), a in this case).
By the way, for this single-threaded process you can also just use the fortran 90 (or 95?) intrinsic subroutines cpu_time, which sums over all threads and returns a time in seconds. Otherwise there is also system_clock, which can return the number of elapsed clock cycles and the clock rate as integers, possibly to higher precision.

Related

FFTW3 gfortran code is correct only when arraysize is LE 16

I have fftw3 code and it works ONLY when array size is less than or equal to 16. Could some expert please tell me what I am doing wrong
program fftcalc
implicit none
integer n
parameter (n=128)
double precision in(n)
double complex out(1+(n/2))
integer*8 p, p2
integer i,j
real fact
real f,dt,t,pi
double precision re
integer FFTW_FORWARD,FFTW_BACKWARD
parameter (FFTW_FORWARD=-1,FFTW_BACKWARD=1)
integer FFTW_REAL_TO_COMPLEX,FFTW_COMPLEX_TO_REAL
parameter (FFTW_REAL_TO_COMPLEX=-1,FFTW_COMPLEX_TO_REAL=1)
integer FFTW_ESTIMATE,FFTW_MEASURE
parameter (FFTW_ESTIMATE=0,FFTW_MEASURE=1)
integer FFTW_OUT_OF_PLACE,FFTW_IN_PLACE,FFTW_USE_WISDOM
parameter (FFTW_OUT_OF_PLACE=0)
parameter (FFTW_IN_PLACE=8,FFTW_USE_WISDOM=16)
integer FFTW_THREADSAFE
parameter (FFTW_THREADSAFE=128)
f=25.
pi=3.14159
dt=0.004
do i=1,n
j=i-1
t=j*dt
re=cos(2.*pi*f*t)
in(i)=re
write(*,*) 'i,in',i,in(i)
enddo
write(*,*) "creating plans"
call dfftw_plan_dft_r2c_1d(p,n,in,out,fftw_measure)
call dfftw_plan_dft_c2r_1d(p2,n,in,out,fftw_measure)
write(*,*) "execute"
call dfftw_execute_dft_r2c(p,in,out)
write(*,*) "forward is executed"
do i=1,n
write(*,*) out(i)
enddo
write(*,*)
write(*,*)"do inverse fft"
call dfftw_execute_dft_c2r(p2, out,in)
write(*,*)"inverse fft completed"
fact=1.0/n
do i=1,n
write(*,*) in(i)*fact
enddo
write(*,*)"clean up"
call dfftw_destroy_plan(p)
call dfftw_destroy_plan(p2)
end program fftcalc
I tried with n=16 and n=10 it works, but when n=64 or so it give mostly zero forward transform.
How do I solve the problem?

How Efficient Are Intrinsic (Math) Functions in Fortran?

How efficient are Fortran's (90+) intrinsic (math) functions? I especially care about tanh and sech but am interested in the other Fortran intrinsic functions as well.
By "how efficient" I mean that if it is very hard to come up with a faster method then the intrinsics are efficient but if it is very easy to come up with a faster method then the intrinsics are inefficient.
Here is a MWE, in which my change to try to make it faster actually made it slower, suggesting the intrinsics are efficient.
program main
implicit none
integer, parameter :: n = 10000000
integer :: i
real :: x, var
real :: t1,t2,t3,t4
!! Intrinsic first
call cpu_time(t1)
do i = 1, n
x = REAL(i)/300.0
var = tanh(x)
end do
call cpu_time(t2)
write(*,*) "Elapsed CPU Time = ", t2 - t1
write(*,*) var
!! Intrinsic w/ small change
call cpu_time(t3)
do i = 1, n
x = REAL(i)/300.0
if (x > 10.0) then
var = 1.0
else
var = tanh(x)
end if
end do
call cpu_time(t4)
write(*,*) "Elapsed CPU Time = ", t4 - t3
write(*,*) var
end program main
Note that Fortran90 seems to be lazy; if I don't include the "write(,) var" then it says elapsed CPU time = 0.0

How to find the dimension of a matrix?

I have a matrix that contains both character and reals and I want a program that reads this matrix (finds the dimensions by itself). Here is my code:
! A fortran95 program for G95
Program Project2nd
implicit none
character(len=40), allocatable :: a(:,:)
integer i,j,k,n,m,l,st
character(len=40) d
n=0; m=1; j=1;
open(10,file=&
'/Users/dariakowsari/Documents/Physics/Programming/Fortran95-Projects/Project2nd/input.txt', &
IOstat=st)
do while (st == 0)
read(10,*,IOstat=st) d
n=n+1
end do
st=0
do j=1,m
do while (st == 0)
allocate(a(1,m))
read(10,*,IOstat=st) (a(1,j),j=1,m)
m=m+1
deallocate(a)
end do
print*, n,m
end
Here is my Matrix:
a b 13 15.5 13.2
c d 16 16.75 19
e f 19.2 12.2 18.2
With this code I got (3,2) for the dimensions of my matrix.
There are a few errors in your example code which means it doesn't compile for me but after a few changes I managed to get a similar result to you.
*Update: As noted by #francescalus in the comments to my other (now deleted) answer, that approach involved undefined behaviour and as such is not an appropriate solution. This arose from trying to read more elements from the file than were present.)
Here's an alternative approach, which should avoid this undefined behaviour, but is probably pretty inefficient.
Program Project2nd
implicit none
character(len=40), allocatable :: a(:)
integer, allocatable :: ind(:)
integer, parameter :: maxElements = 100
integer i,j,n,m,st
character(len=40) d
n=0;
open(10,file='mat.txt',IOstat=st)
!Find number of lines
do while (st == 0)
read(10,*,IOstat=st) d
if(st ==0) n=n+1
end do
!Move back to the start of the file
rewind(10)
!Read all of the data
do m=n,maxElements,n
allocate(a(m))
read(10,*,IOstat=st) a
deallocate(a)
rewind(10)
if(st.ne.0) exit
enddo
m = m -n !Need to roll back m by one iteration to get the last which worked.
if(mod(m,n).ne.0) then
print*,"Error: Number of elements not divisible by number of rows."
stop
endif
!Number of columns = n_elements/nrow
m=m/n
print*, n,m
end Program Project2nd
Essentially this uses the same code as you had for counting the number of lines, however note that you only want to increment n when the read was successful (i.e. st==0). Note we do not exit the whilst block as soon as st becomes non-zero, it is only once we reach the end of the whilst block. After that we need to rewind the file so that the next read starts at the start of the file.
In a previous comment you mentioned that you'd rather not have to specify maxElement if you really want to avoid this then replace the second do loop with something like
st = 0 ; m = n
do while (st==0)
allocate(a(m))
read(10,*,IOstat=st) a
deallocate(a)
rewind(10)
if(st.ne.0) then
m = m - n !Go back to value of m that worked
exit
endif
m=m+n
enddo
here is how to do w/o rewinding.
implicit none
character(len=100) wholeline
character(len=20), allocatable :: c(:)
integer iline,io,ni,nums
open(20,file='testin.dat')
iline=0
do while(.true.)
read(20,'(a)',iostat=io)wholeline
if(io.ne.0)exit
iline=iline+1
ni=lineitems(wholeline)
allocate(c(ni))
read(wholeline,*)c
nums=ctnums(c)
write(*,*)'line',iline,' contains ',ni,'items',nums,
$ 'are numbers'
deallocate(c)
enddo
write(*,*)'total lines is ',iline
contains
integer function ctnums(c)
! count the number of items in a character array that are numbers
! this is a template,
! obviously you could assign the numbers to a real array here
character(len=*), allocatable :: c(:)
real f
integer i,io
ctnums=0
do i = 1,size(c)
read(c(i),*,iostat=io)f
if(io.eq.0)ctnums=ctnums+1
enddo
end function
integer function lineitems(line)
! count the number of items in a space delimited string
integer,parameter ::maxitems=100
character(len=*) line
character(len=80) :: c(maxitems)
integer iline,io
lineitems=0
do iline=1,maxitems
read(line,*,iostat=io)c(:iline)
if(io.ne.0)return
lineitems=iline
enddo
if(lineitems.eq.maxitems)write(*,*)'warning maxitems reached'
end function
end
output
line 1 contains 5 items 3 are numbers
line 2 contains 5 items 3 are numbers
total lines is 2

Read array of unknown size from keyboard

I want to insert an unknown number of values in an array (no matter the order).
I could first read how many values are to be inserted, then allocate the allocatable array, and finally read its values, as in the following code
PROGRAM try
IMPLICIT NONE
INTEGER :: N
REAL, DIMENSION(:), ALLOCATABLE :: x
WRITE (*,*) "how many values?"
READ (*,*) N
ALLOCATE(x(N))
WRITE (*,*) "insert the values"
READ (*,*) x
END PROGRAM
What if I want to insert the values without declaring how many before allocating the array?
I think I should use a DO WHILE cycle to insert the values in ascending order, till a descending value is insert, thus indicating the sequence is ended. I think a part of the code would be the following,
index = 1
WRITE(*,*) x
READ(*,*) x(index)
exit = .FALSE.
DO WHILE (exit.EQV..FALSE.)
index = index + 1
READ(*,*) x(index)
IF (x(index)>x(index-1)) THEN
exit = .TRUE.
index = index - 1
END IF
END DO
How to declare the array x?
I tried with the following solution, building on the concept "a lot of memory allocation and reallocation" expressed by #High Performance Mark.
PROGRAM COEFFS
USE COMPACT
IMPLICIT NONE
REAL, DIMENSION(:), ALLOCATABLE :: x,x2
INTEGER :: nL,nR,nT,index,oL,oR
LOGICAL :: exit
WRITE(*,*) "Input an increasing sequence of reals (end the sequence &
& with the first decreasing element, which will be discarded):"
index = 1
ALLOCATE(x(index))
READ(*,*) x(index)
ALLOCATE(x2(index))
x2 = x
DEALLOCATE(x)
exit = .FALSE.
DO WHILE (exit.EQV..FALSE.)
index = index + 1
ALLOCATE(x(index))
x(1:index-1) = x2
READ(*,*) x(index)
DEALLOCATE(x2)
ALLOCATE(x2(index))
x2 = x
DEALLOCATE(x)
IF (x2(index)<x2(index-1)) THEN
exit = .TRUE.
index = index - 1
ALLOCATE(x(index))
x = x2(1:index)
END IF
END DO
DEALLOCATE(x2)
WRITE(*,*) "x = ", x
END PROGRAM
With the array being input by keyboard, I don't think allocation/reallocation is a problem, since it happens at a much higher speed than that of my fingers typing the values, doesn't it?
Still I think the code could be made better. For instance, using two arrays is the only way to take advantage of allocation/reallocation?

OpenMP calling subroutines in threads

The problem I'm facing is as outlined below:
module k
integer :: l,m
end module k
program p4
use k
integer :: i,j,omp_get_thread_num,cr
i = 2
j = 3
!$omp parallel num_threads(2) shared(l,m) private(i,j,cr)
cr = omp_get_thread_num()
if (cr == 0) goto 1111
call sub1(i)
write(*,*) l
goto 2222
1111 call sub2(j)
write(*,*) m
2222 continue
!$omp end parallel
end program p4
subroutine sub1(a)
use k
integer :: a
l = a**2
write(*,*) 'entered sub1'
end subroutine sub1
subroutine sub2(b)
use k
integer :: b
m = b**2
write(*,*) 'entered sub2'
end subroutine sub2
I've tried to parallelize a serial, (which after parallelization looks as written above). I want essentially the same operation performed twice. So Ideally, I want the output to be
entered sub1
4
enterer sub2
9
but the output is
entered sub2
0
entered sub1
923239424
I'm new to parallel programming, (my actual problem is a more complicated version of the one I've outlined). Can anyone point out the mistakes and suggest improvements. Thanks
OpenMP private variables are NOT given initial values, therefore both the calls to sub1 and sub2 are made with random values of i and j. What you are (probably) looking for is firstprivate instead:
!$omp parallel num_threads(2) shared(l,m) private(cr) firstprivate(i,j)
...
!$omp end parallel
firstprivate initialises each private copy with the value that the corresponding variable in the main thread had on entry into the parallel region.
Btw, implementing IF/THEN/ELSE/ENDIF with IF/GOTO/CONTINUE in Fortran 90 and later is considered by many a bad programming style. You should use OpenMP sections instead:
!$omp parallel sections num_threads(2) shared(l,m) private(cr) firstprivate(i,j)
!$omp section
call sub1(i)
write(*,*) l
!$omp section
call sub2(j)
write(*,*) m
!$omp end parallel sections

Resources