OpenACC firstprivate variables - openacc

I encounter an issue with arrays in OpenACC kernels. Here is the demo code:
module mpoint
type point
real :: x, y, z
real :: tmp
real :: v(3)
end type point
end module mpoint
program main
use mpoint
implicit none
integer, parameter :: n = 3
real, allocatable :: array(:)
!--------------------------------
allocate(array(n))
array(:) = 1.0
call vecadd
contains
subroutine vecadd()
integer :: i
type(point) :: A
real :: w(3)
real :: scalar
A%v(:) = 0.0
A%v(1) = 1.0
w(:) = 0.0
w(1) = 1.0
scalar = 1.0
write(*,*) 'host: v(1) = ', A%v(1)
write(*,*) 'host: w(1) = ', w(1)
write(*,*) 'host: scalar = ', scalar
!$acc parallel loop firstprivate(A,w)
do i = 1, n
write(*,*) 'device: v(1) = ', A%v(1)
write(*,*) 'device: w(1) = ', w(1)
write(*,*) 'device: scalar = ', scalar
enddo
end subroutine vecadd
end program main
When I compile it with nvfortran -acc -Minfo=accel test.f90 and run, it shows that on the device the values in the arrays are 0.0, not the correct values 1.0 I set on the host side. This happens only for arrays: scalars, like shown in the example, has the right values.
I am wondering if this is a limitation for nvfortran, or the current OpenACC standard?

Looks like a compiler issue where we're not initializing the small arrays on the host. In looking through our existing bug reports, I see one that's nearly identical, just with C instead of Fortran, that by coincidence got fixed in our development compiler this morning. Unfortunately it didn't seem to fix your issue as well. I sent a note to the compiler engineer assigned to the issue asking if he can take a look.
Worst case if the issue turns out to be similar but unrelated, I'll open a new problem report and update this post with the tracking number.

Related

Avoiding Conditional Statements in Loops

There is a portion of my f90 program that is taking up a significant amount of compute time. I am basically looping through three matrices (of the same size, with dimensions as large as 250-by-250), and trying to make sure values stay bounded within the interval [-1.0, 1.0]. I know that it is best practice to avoid conditionals in loops, but I am having trouble figuring out how to re-write this block of code for optimal performance. Is there a way to "unravel" the loop or use a built-in function of some sort to "vectorize" the conditional statements?
do ind2 = 1, size(u_mat,2)
do ind1 = 1,size(u_mat,1)
! Dot product 1 must be bounded between [-1,1]
if (b1_dotProd(ind1,ind2) .GT. 1.0_dp) then
b1_dotProd(ind1,ind2) = 1.0_dp
else if (b1_dotProd(ind1,ind2) .LT. -1.0_dp) then
b1_dotProd(ind1,ind2) = -1.0_dp
end if
! Dot product 2 must be bounded between [-1,1]
if (b2_dotProd(ind1,ind2) .GT. 1.0_dp) then
b2_dotProd(ind1,ind2) = 1.0_dp
else if (b2_dotProd(ind1,ind2) .LT. -1.0_dp) then
b2_dotProd(ind1,ind2) = -1.0_dp
end if
! Dot product 3 must be bounded between [-1,1]
if (b3_dotProd(ind1,ind2) .GT. 1.0_dp) then
b3_dotProd(ind1,ind2) = 1.0_dp
else if (b3_dotProd(ind1,ind2) .LT. -1.0_dp) then
b3_dotProd(ind1,ind2) = -1.0_dp
end if
end do
end do
For what it's worth, I am compiling with ifort.
You can use the intrinsic min and max functions for this.
As they are both elemental, you can use them on the whole array, as
b1_dotProd = max(-1.0_dp, min(b1_dotProd, 1.0_dp))
While there are processor instructions which allow min and max to be implemented without branches, it will depend on the compiler implementation of min and max as to whether or not this is actually done and if this is actually any faster, but it is at least a lot more concise.
The answer by #veryreverie is definitely correct, but there
are two things to consider.
A where statement is another sensible choice. Since it still is a conditional choice the same caveat of
whether or not this actually avoids branches and if it's actually any faster, but it is at least a lot more concise
still applies.
One example is:
pure function clamp(X) result(res)
real, intent(in) :: X(:)
real :: res(size(X))
where (X < -1.0)
res = -1.0
else where (X > 1.0)
res = 1.0
else
res = X
end where
end function
If you want to normalize to strictly 1 or -1, I would actually think about changing the datatype to integer. Then you can actually use a == 1 etc. without thinking about floating point equality problems. Depending on your code I would also think about cases where the dot product gets close to zero. Of course this point only applies, if you are only interested in the sign.
pure function get_sign(X) result(res)
real, intent(in) :: X(:)
integer :: res(size(X))
! Or use another appropiate choice to test for near_zero
where (abs(X) < epsilon(X) * 10.)
res = 0
else where (X < 0.0)
res = -1
else where (X > 0.0)
res = +1
end where
end function

FFTW3 gfortran code is correct only when arraysize is LE 16

I have fftw3 code and it works ONLY when array size is less than or equal to 16. Could some expert please tell me what I am doing wrong
program fftcalc
implicit none
integer n
parameter (n=128)
double precision in(n)
double complex out(1+(n/2))
integer*8 p, p2
integer i,j
real fact
real f,dt,t,pi
double precision re
integer FFTW_FORWARD,FFTW_BACKWARD
parameter (FFTW_FORWARD=-1,FFTW_BACKWARD=1)
integer FFTW_REAL_TO_COMPLEX,FFTW_COMPLEX_TO_REAL
parameter (FFTW_REAL_TO_COMPLEX=-1,FFTW_COMPLEX_TO_REAL=1)
integer FFTW_ESTIMATE,FFTW_MEASURE
parameter (FFTW_ESTIMATE=0,FFTW_MEASURE=1)
integer FFTW_OUT_OF_PLACE,FFTW_IN_PLACE,FFTW_USE_WISDOM
parameter (FFTW_OUT_OF_PLACE=0)
parameter (FFTW_IN_PLACE=8,FFTW_USE_WISDOM=16)
integer FFTW_THREADSAFE
parameter (FFTW_THREADSAFE=128)
f=25.
pi=3.14159
dt=0.004
do i=1,n
j=i-1
t=j*dt
re=cos(2.*pi*f*t)
in(i)=re
write(*,*) 'i,in',i,in(i)
enddo
write(*,*) "creating plans"
call dfftw_plan_dft_r2c_1d(p,n,in,out,fftw_measure)
call dfftw_plan_dft_c2r_1d(p2,n,in,out,fftw_measure)
write(*,*) "execute"
call dfftw_execute_dft_r2c(p,in,out)
write(*,*) "forward is executed"
do i=1,n
write(*,*) out(i)
enddo
write(*,*)
write(*,*)"do inverse fft"
call dfftw_execute_dft_c2r(p2, out,in)
write(*,*)"inverse fft completed"
fact=1.0/n
do i=1,n
write(*,*) in(i)*fact
enddo
write(*,*)"clean up"
call dfftw_destroy_plan(p)
call dfftw_destroy_plan(p2)
end program fftcalc
I tried with n=16 and n=10 it works, but when n=64 or so it give mostly zero forward transform.
How do I solve the problem?

How Efficient Are Intrinsic (Math) Functions in Fortran?

How efficient are Fortran's (90+) intrinsic (math) functions? I especially care about tanh and sech but am interested in the other Fortran intrinsic functions as well.
By "how efficient" I mean that if it is very hard to come up with a faster method then the intrinsics are efficient but if it is very easy to come up with a faster method then the intrinsics are inefficient.
Here is a MWE, in which my change to try to make it faster actually made it slower, suggesting the intrinsics are efficient.
program main
implicit none
integer, parameter :: n = 10000000
integer :: i
real :: x, var
real :: t1,t2,t3,t4
!! Intrinsic first
call cpu_time(t1)
do i = 1, n
x = REAL(i)/300.0
var = tanh(x)
end do
call cpu_time(t2)
write(*,*) "Elapsed CPU Time = ", t2 - t1
write(*,*) var
!! Intrinsic w/ small change
call cpu_time(t3)
do i = 1, n
x = REAL(i)/300.0
if (x > 10.0) then
var = 1.0
else
var = tanh(x)
end if
end do
call cpu_time(t4)
write(*,*) "Elapsed CPU Time = ", t4 - t3
write(*,*) var
end program main
Note that Fortran90 seems to be lazy; if I don't include the "write(,) var" then it says elapsed CPU time = 0.0

Read array of unknown size from keyboard

I want to insert an unknown number of values in an array (no matter the order).
I could first read how many values are to be inserted, then allocate the allocatable array, and finally read its values, as in the following code
PROGRAM try
IMPLICIT NONE
INTEGER :: N
REAL, DIMENSION(:), ALLOCATABLE :: x
WRITE (*,*) "how many values?"
READ (*,*) N
ALLOCATE(x(N))
WRITE (*,*) "insert the values"
READ (*,*) x
END PROGRAM
What if I want to insert the values without declaring how many before allocating the array?
I think I should use a DO WHILE cycle to insert the values in ascending order, till a descending value is insert, thus indicating the sequence is ended. I think a part of the code would be the following,
index = 1
WRITE(*,*) x
READ(*,*) x(index)
exit = .FALSE.
DO WHILE (exit.EQV..FALSE.)
index = index + 1
READ(*,*) x(index)
IF (x(index)>x(index-1)) THEN
exit = .TRUE.
index = index - 1
END IF
END DO
How to declare the array x?
I tried with the following solution, building on the concept "a lot of memory allocation and reallocation" expressed by #High Performance Mark.
PROGRAM COEFFS
USE COMPACT
IMPLICIT NONE
REAL, DIMENSION(:), ALLOCATABLE :: x,x2
INTEGER :: nL,nR,nT,index,oL,oR
LOGICAL :: exit
WRITE(*,*) "Input an increasing sequence of reals (end the sequence &
& with the first decreasing element, which will be discarded):"
index = 1
ALLOCATE(x(index))
READ(*,*) x(index)
ALLOCATE(x2(index))
x2 = x
DEALLOCATE(x)
exit = .FALSE.
DO WHILE (exit.EQV..FALSE.)
index = index + 1
ALLOCATE(x(index))
x(1:index-1) = x2
READ(*,*) x(index)
DEALLOCATE(x2)
ALLOCATE(x2(index))
x2 = x
DEALLOCATE(x)
IF (x2(index)<x2(index-1)) THEN
exit = .TRUE.
index = index - 1
ALLOCATE(x(index))
x = x2(1:index)
END IF
END DO
DEALLOCATE(x2)
WRITE(*,*) "x = ", x
END PROGRAM
With the array being input by keyboard, I don't think allocation/reallocation is a problem, since it happens at a much higher speed than that of my fingers typing the values, doesn't it?
Still I think the code could be made better. For instance, using two arrays is the only way to take advantage of allocation/reallocation?

How to Avoid Conditionals in Loops

In this link, the author gives an example as
subroutine threshold(a, thresh, ic)
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt
integer :: n
ic = 0
tt = 0.d0
n = size(a)
do j = 1, n
tt = tt + a(j) * a(j)
if (sqrt(tt) >= thresh) then
ic = j
return
end if
end do
end subroutine threshold
and the author commented this code as
An alternative approach, which would allow for many optimizations
(loop unrolling, CPU pipelining, less time spent evaluating the
conditional) would involve adding tt in blocks (e.g., blocks of size
128) and checking the conditional after each block. When it the
condition is met, the last block can be repeated to determine the
value of ic.
What does it mean? loop unrolling? CPU pipelining? adding tt in blocks?
How to optimize the code as the author say?
If the loop is performed in chunks/blocks that fit into the CPU cache you will reduce the number of cache misses, and consequently the number of cache lines retrieved from memory. This increases the performance on all loops that are limited by memory operations.
If the corresponding block size is BLOCKSIZE, this is achieved by
do j = 1, n, BLOCKSIZE
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
end do
This, however, will leave a remainder that is not treated in the main loop. To illustrate this, consider an array of length 1000. The first seven chunks (1--896) are covered in the loop, but the eighth one (897--1024) is not. Therefore, another loop for the remainder is required:
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
! ...
enddo
While it makes little sense to remove the conditional from the remainder loop, it can be performed in the outer loop of the blocked main loop.
As now no branches occur in the inner loop, aggressive optimizations might be applicable then.
However, this limits the "accuracy" of the determined position to the blocks. To get to an element-wise accuracy, you have to repeat the calculation.
Here is the complete code:
subroutine threshold_block(a, thresh, ic)
implicit none
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt, tt_bak, thresh_sqr
integer :: n, j, jj
integer,parameter :: BLOCKSIZE = 128
ic = 0
tt = 0.d0
thresh_sqr = thresh**2
n = size(a)
! Perform the loop in chunks of BLOCKSIZE
do j = 1, n, BLOCKSIZE
tt_bak = tt
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
! Perform the check on the block level
if (tt >= thresh_sqr) then
! If the threshold is reached, repeat the last block
! to determine the last position
tt = tt_bak
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
if (tt >= thresh_sqr) then
ic = jj
return
end if
end do
end if
end do
! Remainder is treated element-wise
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
tt = tt + a(j) * a(j)
if (tt >= thresh_sqr) then
ic = j
return
end if
end do
end subroutine threshold_block
Please note that the compilers are nowadays very good in creating blocked loops in combination with other optimizations. In my experience it is quite difficult to get a better performance out of such simple loops by manually tweaking it.
Loop blocking is enabled in gfortran with the compiler option -floop-block.
Loop unrolling can be done manually, but should be left to the compiler. The idea is to manually perform a loop in blocks and instead of a second loop as shown above, perform the operations by duplicating the code. Here is an example for the inner loop as given above, for a loop unrolling of factor four:
do jj = j, j+BLOCKSIZE-1,4
tt = tt + a(jj) * a(jj)
tt = tt + a(jj+1) * a(jj+1)
tt = tt + a(jj+2) * a(jj+2)
tt = tt + a(jj+3) * a(jj+3)
end do
Here, no remainder can occur if BLOCKSIZE is a multiple of 4. You can probably shave off a few operations in here ;-)
The gfortran compiler option to enable this is -funroll-loops
As far as I know, CPU Pipelining (Instruction Pipelining) cannot be enforced manually in Fortran. This task is up to the compiler.
Pipelining sets up a pipe of instructions. You feed the complete array into that pipe and, after the wind-up phase, you will get a result with each clock cycle. This drastically increases the throughput.
However, branches are difficult (impossible?) to treat in pipes, and the array should be long enough that the time required for setting up the pipe, wind-up, and wind-down phase are compensated.

Resources