FFTW3 gfortran code is correct only when arraysize is LE 16 - fftw

I have fftw3 code and it works ONLY when array size is less than or equal to 16. Could some expert please tell me what I am doing wrong
program fftcalc
implicit none
integer n
parameter (n=128)
double precision in(n)
double complex out(1+(n/2))
integer*8 p, p2
integer i,j
real fact
real f,dt,t,pi
double precision re
integer FFTW_FORWARD,FFTW_BACKWARD
parameter (FFTW_FORWARD=-1,FFTW_BACKWARD=1)
integer FFTW_REAL_TO_COMPLEX,FFTW_COMPLEX_TO_REAL
parameter (FFTW_REAL_TO_COMPLEX=-1,FFTW_COMPLEX_TO_REAL=1)
integer FFTW_ESTIMATE,FFTW_MEASURE
parameter (FFTW_ESTIMATE=0,FFTW_MEASURE=1)
integer FFTW_OUT_OF_PLACE,FFTW_IN_PLACE,FFTW_USE_WISDOM
parameter (FFTW_OUT_OF_PLACE=0)
parameter (FFTW_IN_PLACE=8,FFTW_USE_WISDOM=16)
integer FFTW_THREADSAFE
parameter (FFTW_THREADSAFE=128)
f=25.
pi=3.14159
dt=0.004
do i=1,n
j=i-1
t=j*dt
re=cos(2.*pi*f*t)
in(i)=re
write(*,*) 'i,in',i,in(i)
enddo
write(*,*) "creating plans"
call dfftw_plan_dft_r2c_1d(p,n,in,out,fftw_measure)
call dfftw_plan_dft_c2r_1d(p2,n,in,out,fftw_measure)
write(*,*) "execute"
call dfftw_execute_dft_r2c(p,in,out)
write(*,*) "forward is executed"
do i=1,n
write(*,*) out(i)
enddo
write(*,*)
write(*,*)"do inverse fft"
call dfftw_execute_dft_c2r(p2, out,in)
write(*,*)"inverse fft completed"
fact=1.0/n
do i=1,n
write(*,*) in(i)*fact
enddo
write(*,*)"clean up"
call dfftw_destroy_plan(p)
call dfftw_destroy_plan(p2)
end program fftcalc
I tried with n=16 and n=10 it works, but when n=64 or so it give mostly zero forward transform.
How do I solve the problem?

Related

OpenACC firstprivate variables

I encounter an issue with arrays in OpenACC kernels. Here is the demo code:
module mpoint
type point
real :: x, y, z
real :: tmp
real :: v(3)
end type point
end module mpoint
program main
use mpoint
implicit none
integer, parameter :: n = 3
real, allocatable :: array(:)
!--------------------------------
allocate(array(n))
array(:) = 1.0
call vecadd
contains
subroutine vecadd()
integer :: i
type(point) :: A
real :: w(3)
real :: scalar
A%v(:) = 0.0
A%v(1) = 1.0
w(:) = 0.0
w(1) = 1.0
scalar = 1.0
write(*,*) 'host: v(1) = ', A%v(1)
write(*,*) 'host: w(1) = ', w(1)
write(*,*) 'host: scalar = ', scalar
!$acc parallel loop firstprivate(A,w)
do i = 1, n
write(*,*) 'device: v(1) = ', A%v(1)
write(*,*) 'device: w(1) = ', w(1)
write(*,*) 'device: scalar = ', scalar
enddo
end subroutine vecadd
end program main
When I compile it with nvfortran -acc -Minfo=accel test.f90 and run, it shows that on the device the values in the arrays are 0.0, not the correct values 1.0 I set on the host side. This happens only for arrays: scalars, like shown in the example, has the right values.
I am wondering if this is a limitation for nvfortran, or the current OpenACC standard?
Looks like a compiler issue where we're not initializing the small arrays on the host. In looking through our existing bug reports, I see one that's nearly identical, just with C instead of Fortran, that by coincidence got fixed in our development compiler this morning. Unfortunately it didn't seem to fix your issue as well. I sent a note to the compiler engineer assigned to the issue asking if he can take a look.
Worst case if the issue turns out to be similar but unrelated, I'll open a new problem report and update this post with the tracking number.

LAPACK - DGETRF fails (inverting matrices)

I am trying to calculate the inverse of a matrix, but DGETRF keeps saying that the matrix is numerically singular even when the matrix isn't.
real :: testM(1:2, 1:2), workT(2)
integer :: ipivT(2), info
testM = reshape((/4,2,7,6/), shape(testM))
WRITE(*,*) "started matrix inversion"
! DGETRF computes an LU factorization of a general M-by-N matrix A
! using partial pivoting with row interchanges.
call DGETRF(2, 2, testM, 2, ipivT, info)
WRITE(*,*) "info = ", info
if (info /= 0) then
stop 'Matrix is numerically singular!'
end if
! DGETRI computes the inverse of a matrix using the LU factorization
! computed by DGETRF.
call DGETRI(2, testM, 2, ipivT, workT, 2, info)
if (info /= 0) then
stop 'Matrix inversion failed!'
end if
WRITE(*,*) testM
WRITE(*,*) "Matrix success"
I get the error saying that the matrix is numerically singular (info = 2).
The matrix however is not numerically singular and I don't know why the error is given.
In the end I want to scale the code to a (33,33) matrix. But I'm first trying to get it working for this (2,2) matrix.
The documentation of dgetrf (http://www.netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga0019443faea08275ca60a734d0593e60.html) says:
subroutine dgetrf ( integer M,
integer N,
double precision, dimension( lda, * ) A,
integer LDA,
integer, dimension( * ) IPIV,
integer INFO
)
So a double precision variable A but you call the routine with a real variable testM.
Similar problems will probably occur for DGETRI.

How to find the dimension of a matrix?

I have a matrix that contains both character and reals and I want a program that reads this matrix (finds the dimensions by itself). Here is my code:
! A fortran95 program for G95
Program Project2nd
implicit none
character(len=40), allocatable :: a(:,:)
integer i,j,k,n,m,l,st
character(len=40) d
n=0; m=1; j=1;
open(10,file=&
'/Users/dariakowsari/Documents/Physics/Programming/Fortran95-Projects/Project2nd/input.txt', &
IOstat=st)
do while (st == 0)
read(10,*,IOstat=st) d
n=n+1
end do
st=0
do j=1,m
do while (st == 0)
allocate(a(1,m))
read(10,*,IOstat=st) (a(1,j),j=1,m)
m=m+1
deallocate(a)
end do
print*, n,m
end
Here is my Matrix:
a b 13 15.5 13.2
c d 16 16.75 19
e f 19.2 12.2 18.2
With this code I got (3,2) for the dimensions of my matrix.
There are a few errors in your example code which means it doesn't compile for me but after a few changes I managed to get a similar result to you.
*Update: As noted by #francescalus in the comments to my other (now deleted) answer, that approach involved undefined behaviour and as such is not an appropriate solution. This arose from trying to read more elements from the file than were present.)
Here's an alternative approach, which should avoid this undefined behaviour, but is probably pretty inefficient.
Program Project2nd
implicit none
character(len=40), allocatable :: a(:)
integer, allocatable :: ind(:)
integer, parameter :: maxElements = 100
integer i,j,n,m,st
character(len=40) d
n=0;
open(10,file='mat.txt',IOstat=st)
!Find number of lines
do while (st == 0)
read(10,*,IOstat=st) d
if(st ==0) n=n+1
end do
!Move back to the start of the file
rewind(10)
!Read all of the data
do m=n,maxElements,n
allocate(a(m))
read(10,*,IOstat=st) a
deallocate(a)
rewind(10)
if(st.ne.0) exit
enddo
m = m -n !Need to roll back m by one iteration to get the last which worked.
if(mod(m,n).ne.0) then
print*,"Error: Number of elements not divisible by number of rows."
stop
endif
!Number of columns = n_elements/nrow
m=m/n
print*, n,m
end Program Project2nd
Essentially this uses the same code as you had for counting the number of lines, however note that you only want to increment n when the read was successful (i.e. st==0). Note we do not exit the whilst block as soon as st becomes non-zero, it is only once we reach the end of the whilst block. After that we need to rewind the file so that the next read starts at the start of the file.
In a previous comment you mentioned that you'd rather not have to specify maxElement if you really want to avoid this then replace the second do loop with something like
st = 0 ; m = n
do while (st==0)
allocate(a(m))
read(10,*,IOstat=st) a
deallocate(a)
rewind(10)
if(st.ne.0) then
m = m - n !Go back to value of m that worked
exit
endif
m=m+n
enddo
here is how to do w/o rewinding.
implicit none
character(len=100) wholeline
character(len=20), allocatable :: c(:)
integer iline,io,ni,nums
open(20,file='testin.dat')
iline=0
do while(.true.)
read(20,'(a)',iostat=io)wholeline
if(io.ne.0)exit
iline=iline+1
ni=lineitems(wholeline)
allocate(c(ni))
read(wholeline,*)c
nums=ctnums(c)
write(*,*)'line',iline,' contains ',ni,'items',nums,
$ 'are numbers'
deallocate(c)
enddo
write(*,*)'total lines is ',iline
contains
integer function ctnums(c)
! count the number of items in a character array that are numbers
! this is a template,
! obviously you could assign the numbers to a real array here
character(len=*), allocatable :: c(:)
real f
integer i,io
ctnums=0
do i = 1,size(c)
read(c(i),*,iostat=io)f
if(io.eq.0)ctnums=ctnums+1
enddo
end function
integer function lineitems(line)
! count the number of items in a space delimited string
integer,parameter ::maxitems=100
character(len=*) line
character(len=80) :: c(maxitems)
integer iline,io
lineitems=0
do iline=1,maxitems
read(line,*,iostat=io)c(:iline)
if(io.ne.0)return
lineitems=iline
enddo
if(lineitems.eq.maxitems)write(*,*)'warning maxitems reached'
end function
end
output
line 1 contains 5 items 3 are numbers
line 2 contains 5 items 3 are numbers
total lines is 2

OpenMP calling subroutines in threads

The problem I'm facing is as outlined below:
module k
integer :: l,m
end module k
program p4
use k
integer :: i,j,omp_get_thread_num,cr
i = 2
j = 3
!$omp parallel num_threads(2) shared(l,m) private(i,j,cr)
cr = omp_get_thread_num()
if (cr == 0) goto 1111
call sub1(i)
write(*,*) l
goto 2222
1111 call sub2(j)
write(*,*) m
2222 continue
!$omp end parallel
end program p4
subroutine sub1(a)
use k
integer :: a
l = a**2
write(*,*) 'entered sub1'
end subroutine sub1
subroutine sub2(b)
use k
integer :: b
m = b**2
write(*,*) 'entered sub2'
end subroutine sub2
I've tried to parallelize a serial, (which after parallelization looks as written above). I want essentially the same operation performed twice. So Ideally, I want the output to be
entered sub1
4
enterer sub2
9
but the output is
entered sub2
0
entered sub1
923239424
I'm new to parallel programming, (my actual problem is a more complicated version of the one I've outlined). Can anyone point out the mistakes and suggest improvements. Thanks
OpenMP private variables are NOT given initial values, therefore both the calls to sub1 and sub2 are made with random values of i and j. What you are (probably) looking for is firstprivate instead:
!$omp parallel num_threads(2) shared(l,m) private(cr) firstprivate(i,j)
...
!$omp end parallel
firstprivate initialises each private copy with the value that the corresponding variable in the main thread had on entry into the parallel region.
Btw, implementing IF/THEN/ELSE/ENDIF with IF/GOTO/CONTINUE in Fortran 90 and later is considered by many a bad programming style. You should use OpenMP sections instead:
!$omp parallel sections num_threads(2) shared(l,m) private(cr) firstprivate(i,j)
!$omp section
call sub1(i)
write(*,*) l
!$omp section
call sub2(j)
write(*,*) m
!$omp end parallel sections

Implied do. vs explicit loop with IO

I realize this question has been asked before, but not in the context of IO. Is there any reason to believe that:
!compiler can tell that it should write the whole array at once?
!but perhaps compiler allocates/frees temporary array?
write(UNIT) (/( arr(i), i=1,N )/)
would be any more efficient than:
!compiler does lots of IO here?
do i=1,N
write(UNIT) arr(i)
enddo
for a file which is opened as:
open(unit=UNIT,access='STREAM',file=fname,status='UNKNOWN')
There is a possibly that this will be used with compiler options to turn off buffered writing as well ...
As suggested by #HighPerformanceMark, here's a simple benchmark I set up:
Using gfortran:
program main
implicit none
include 'mpif.h'
integer, parameter :: N = 1000000
integer :: unit = 22
integer i
real*8 arr(N)
real*8 t1
integer repeat
external test1
external test2
external test3
repeat=15
call MPI_INIT(i)
arr = 0
call timeit(test1,repeat,arr,N,t1)
print*,t1/repeat
call timeit(test2,repeat,arr,N,t1)
print*,t1/repeat
call timeit(test3,repeat,arr,N,t1)
print*,t1/repeat
call MPI_Finalize(i)
end
subroutine timeit(sub,repeat,arr,size,time)
include 'mpif.h'
external sub
integer repeat
integer size
real*8 time,t1
real*8 arr(size)
integer i
time = 0
do i=1,repeat
open(unit=10,access='STREAM',file='test1',status='UNKNOWN')
t1 = mpi_wtime()
call sub(10,arr,size)
time = time + (mpi_wtime()-t1)
close(10)
enddo
return
end
subroutine test1(ou,a,N)
integer N
real*8 a(N)
integer ou
integer i
do i=1,N
write(ou),a(i)
enddo
return
end
subroutine test2(ou,a,N)
integer N
real*8 a(N)
integer ou
integer i
write(ou),(a(i),i=1,N)
return
end
subroutine test3(ou,a,N)
integer N
real*8 a(N)
integer ou
write(ou),a(1:N)
return
end
My results are (buffered):
temp $ GFORTRAN_UNBUFFERED_ALL=1 mpirun -np 1 ./test
6.2392100652058922
3.3046503861745200
9.76902325948079409E-002
(unbuffered):
temp $ GFORTRAN_UNBUFFERED_ALL=0 mpirun -np 1 ./test
2.7789104779561362
0.15584923426310221
9.82964992523193415E-002
I compiled and ran the above benchmark code using both gfortran (4.7.2 20120921) and ifort (13.0.0.079 Build 20120731). My results are as follows:
gfortran
UNBUFFERED BUFFERED
test1: 1.2614487171173097 0.20308602650960286
test2: 1.0525423844655355 3.4633986155192059E-002
test3: 5.9630711873372398E-003 6.0543696085611975E-003
ifort
UNBUFFERED BUFFERED
test1: 1.33864809672038 0.171342913309733
test2: 6.001885732014974E-003 6.095488866170247E-003
test3: 5.962880452473959E-003 6.007925669352213E-003
It would appear that the explicit loop in test1 is by far the most disadvantageous in both cases (without any optimisation flags set). Furthermore, with the Intel compiler there is no significant difference in execution time whether you run write(ou), (a(i), i=1, N) (case 2) or write(ou), a(1:N) (case 3, identical to simply write(ou), a in this case).
By the way, for this single-threaded process you can also just use the fortran 90 (or 95?) intrinsic subroutines cpu_time, which sums over all threads and returns a time in seconds. Otherwise there is also system_clock, which can return the number of elapsed clock cycles and the clock rate as integers, possibly to higher precision.

Resources