How to Avoid Conditionals in Loops - performance

In this link, the author gives an example as
subroutine threshold(a, thresh, ic)
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt
integer :: n
ic = 0
tt = 0.d0
n = size(a)
do j = 1, n
tt = tt + a(j) * a(j)
if (sqrt(tt) >= thresh) then
ic = j
return
end if
end do
end subroutine threshold
and the author commented this code as
An alternative approach, which would allow for many optimizations
(loop unrolling, CPU pipelining, less time spent evaluating the
conditional) would involve adding tt in blocks (e.g., blocks of size
128) and checking the conditional after each block. When it the
condition is met, the last block can be repeated to determine the
value of ic.
What does it mean? loop unrolling? CPU pipelining? adding tt in blocks?
How to optimize the code as the author say?

If the loop is performed in chunks/blocks that fit into the CPU cache you will reduce the number of cache misses, and consequently the number of cache lines retrieved from memory. This increases the performance on all loops that are limited by memory operations.
If the corresponding block size is BLOCKSIZE, this is achieved by
do j = 1, n, BLOCKSIZE
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
end do
This, however, will leave a remainder that is not treated in the main loop. To illustrate this, consider an array of length 1000. The first seven chunks (1--896) are covered in the loop, but the eighth one (897--1024) is not. Therefore, another loop for the remainder is required:
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
! ...
enddo
While it makes little sense to remove the conditional from the remainder loop, it can be performed in the outer loop of the blocked main loop.
As now no branches occur in the inner loop, aggressive optimizations might be applicable then.
However, this limits the "accuracy" of the determined position to the blocks. To get to an element-wise accuracy, you have to repeat the calculation.
Here is the complete code:
subroutine threshold_block(a, thresh, ic)
implicit none
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt, tt_bak, thresh_sqr
integer :: n, j, jj
integer,parameter :: BLOCKSIZE = 128
ic = 0
tt = 0.d0
thresh_sqr = thresh**2
n = size(a)
! Perform the loop in chunks of BLOCKSIZE
do j = 1, n, BLOCKSIZE
tt_bak = tt
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
! Perform the check on the block level
if (tt >= thresh_sqr) then
! If the threshold is reached, repeat the last block
! to determine the last position
tt = tt_bak
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
if (tt >= thresh_sqr) then
ic = jj
return
end if
end do
end if
end do
! Remainder is treated element-wise
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
tt = tt + a(j) * a(j)
if (tt >= thresh_sqr) then
ic = j
return
end if
end do
end subroutine threshold_block
Please note that the compilers are nowadays very good in creating blocked loops in combination with other optimizations. In my experience it is quite difficult to get a better performance out of such simple loops by manually tweaking it.
Loop blocking is enabled in gfortran with the compiler option -floop-block.
Loop unrolling can be done manually, but should be left to the compiler. The idea is to manually perform a loop in blocks and instead of a second loop as shown above, perform the operations by duplicating the code. Here is an example for the inner loop as given above, for a loop unrolling of factor four:
do jj = j, j+BLOCKSIZE-1,4
tt = tt + a(jj) * a(jj)
tt = tt + a(jj+1) * a(jj+1)
tt = tt + a(jj+2) * a(jj+2)
tt = tt + a(jj+3) * a(jj+3)
end do
Here, no remainder can occur if BLOCKSIZE is a multiple of 4. You can probably shave off a few operations in here ;-)
The gfortran compiler option to enable this is -funroll-loops
As far as I know, CPU Pipelining (Instruction Pipelining) cannot be enforced manually in Fortran. This task is up to the compiler.
Pipelining sets up a pipe of instructions. You feed the complete array into that pipe and, after the wind-up phase, you will get a result with each clock cycle. This drastically increases the throughput.
However, branches are difficult (impossible?) to treat in pipes, and the array should be long enough that the time required for setting up the pipe, wind-up, and wind-down phase are compensated.

Related

Avoiding Conditional Statements in Loops

There is a portion of my f90 program that is taking up a significant amount of compute time. I am basically looping through three matrices (of the same size, with dimensions as large as 250-by-250), and trying to make sure values stay bounded within the interval [-1.0, 1.0]. I know that it is best practice to avoid conditionals in loops, but I am having trouble figuring out how to re-write this block of code for optimal performance. Is there a way to "unravel" the loop or use a built-in function of some sort to "vectorize" the conditional statements?
do ind2 = 1, size(u_mat,2)
do ind1 = 1,size(u_mat,1)
! Dot product 1 must be bounded between [-1,1]
if (b1_dotProd(ind1,ind2) .GT. 1.0_dp) then
b1_dotProd(ind1,ind2) = 1.0_dp
else if (b1_dotProd(ind1,ind2) .LT. -1.0_dp) then
b1_dotProd(ind1,ind2) = -1.0_dp
end if
! Dot product 2 must be bounded between [-1,1]
if (b2_dotProd(ind1,ind2) .GT. 1.0_dp) then
b2_dotProd(ind1,ind2) = 1.0_dp
else if (b2_dotProd(ind1,ind2) .LT. -1.0_dp) then
b2_dotProd(ind1,ind2) = -1.0_dp
end if
! Dot product 3 must be bounded between [-1,1]
if (b3_dotProd(ind1,ind2) .GT. 1.0_dp) then
b3_dotProd(ind1,ind2) = 1.0_dp
else if (b3_dotProd(ind1,ind2) .LT. -1.0_dp) then
b3_dotProd(ind1,ind2) = -1.0_dp
end if
end do
end do
For what it's worth, I am compiling with ifort.
You can use the intrinsic min and max functions for this.
As they are both elemental, you can use them on the whole array, as
b1_dotProd = max(-1.0_dp, min(b1_dotProd, 1.0_dp))
While there are processor instructions which allow min and max to be implemented without branches, it will depend on the compiler implementation of min and max as to whether or not this is actually done and if this is actually any faster, but it is at least a lot more concise.
The answer by #veryreverie is definitely correct, but there
are two things to consider.
A where statement is another sensible choice. Since it still is a conditional choice the same caveat of
whether or not this actually avoids branches and if it's actually any faster, but it is at least a lot more concise
still applies.
One example is:
pure function clamp(X) result(res)
real, intent(in) :: X(:)
real :: res(size(X))
where (X < -1.0)
res = -1.0
else where (X > 1.0)
res = 1.0
else
res = X
end where
end function
If you want to normalize to strictly 1 or -1, I would actually think about changing the datatype to integer. Then you can actually use a == 1 etc. without thinking about floating point equality problems. Depending on your code I would also think about cases where the dot product gets close to zero. Of course this point only applies, if you are only interested in the sign.
pure function get_sign(X) result(res)
real, intent(in) :: X(:)
integer :: res(size(X))
! Or use another appropiate choice to test for near_zero
where (abs(X) < epsilon(X) * 10.)
res = 0
else where (X < 0.0)
res = -1
else where (X > 0.0)
res = +1
end where
end function

How Efficient Are Intrinsic (Math) Functions in Fortran?

How efficient are Fortran's (90+) intrinsic (math) functions? I especially care about tanh and sech but am interested in the other Fortran intrinsic functions as well.
By "how efficient" I mean that if it is very hard to come up with a faster method then the intrinsics are efficient but if it is very easy to come up with a faster method then the intrinsics are inefficient.
Here is a MWE, in which my change to try to make it faster actually made it slower, suggesting the intrinsics are efficient.
program main
implicit none
integer, parameter :: n = 10000000
integer :: i
real :: x, var
real :: t1,t2,t3,t4
!! Intrinsic first
call cpu_time(t1)
do i = 1, n
x = REAL(i)/300.0
var = tanh(x)
end do
call cpu_time(t2)
write(*,*) "Elapsed CPU Time = ", t2 - t1
write(*,*) var
!! Intrinsic w/ small change
call cpu_time(t3)
do i = 1, n
x = REAL(i)/300.0
if (x > 10.0) then
var = 1.0
else
var = tanh(x)
end if
end do
call cpu_time(t4)
write(*,*) "Elapsed CPU Time = ", t4 - t3
write(*,*) var
end program main
Note that Fortran90 seems to be lazy; if I don't include the "write(,) var" then it says elapsed CPU time = 0.0

Calculating the convolution of a matrix with a recursive subroutine in Fortran 2003

I've never asked a question here, so please let me know if I am describing my problem enough.
I'm pretty new at Fortran and I wanted to create a recursive subroutine that would compute each square of a 3 x 3 matrix. If you are not familiar with convolution, this is a good resource: http://songho.ca/dsp/convolution/convolution2d_example.html.
I used the same values in this example to make sure I was doing it right.
The purpose of the program is to have the recursive subroutine called in the middle of two do loops (indexes both go from [0,2]). When the recusive function is called, it will find the sum of all the products of one square of the output matrix. The loops will call it 9 times to ensure that every square's value has been calculated to produce the desired output. Well, after much editing on paper, I thought that I had a pretty good idea that the subroutine would work and it seems that only the first square (0,0) was able to get its answer, -13.
I believe my problem has to do with the assignments in the subroutine. I want to continue calling the next 'temp' value to add it to the total, which will be returned to the program calling it.
Depending on the current i and j values, there might not be any multiplication needed for every turn through the recursive method, so I wanted the subroutine to find a way to continue adding temp if that was the case.
program conprac
implicit none
integer, dimension(0:2,0:2) :: mat1(0:2,0:2) = reshape((/1,4,7,2,5,8,3,6,9/),(/3,3/))
integer, dimension(0:2,0:2) :: totals(0:2,0:2) = 0
integer, dimension(2,2) :: kernal(0:2,0:2) = reshape((/1,0,-1,2,0,-2,1,0,-1/),(/3,3/))
integer :: i=0, j=0, this_total=0, total=0, m=0, n=0, k=0
!do m = 0,0
!do n = 0,1
total = 0
call calc(kernal, mat1, i, j, m, n, this_total, total)
totals(m,n) = total
!end do
!end do
write(*,*) "totals(0,0): ", totals(0,0) !-13
!write(*,*) "totals(0,1): ", totals(0,1) !-20
end program conprac
recursive subroutine calc(kernal, mat, i, j, addToi, addToj, this_total, total)
implicit none
!declare calling parameter types and definitions
!to calculate with
integer, intent(in), dimension(0:2,0:2) :: kernal, mat
integer, intent(in) :: addToi, addToj
integer, intent(out) :: i, j, this_total
!to calculate
integer, intent(out) :: total
!temp variable
integer :: temp
if (i <= 2) then
if (j > 2) then
i = i + 1
j = 0
end if
if ((i + addToi) - 1 < 0 .or. (j + addToj) - 1 < 0 .or. (i + addToi) - 1 > 2 .or. (j + addToj) - 1 > 2) then
j = j+1
call calc(kernal, mat, i, j, addToi, addToj, this_total, temp)
total = total + temp
write(*,*) "total1: ", total
else
this_total = kernal(i,j) * mat((i + addToi) - 1, (j + addToj) - 1)
j = j+1
call calc(kernal, mat, i, j, addToi, addToj, this_total, temp)
total = this_total + temp
write(*,*) "total2: ", total
end if
end if
end subroutine calc
As of right now, the do loops are commented out so I can test one value at a time up where m, n are initialized.
The parameters for the subroutine are: kernal and mat being the 2 matricies, i and j both starting at 0 when the subroutine is called, m and n values being what is added to i and j respectively to make sure that the kernal is shifted and not calculating in the same spot for all 9 squares. this_total is a holder for the product of an overlapping square to be added to the returning value total, and total is the value returned and will be sent to the totals array in the program. For the first one, it would be at index (m,n), or (0,0).
These are the outputs for the first square in totals:
m=0,n=0

Random number concentrated within a certain range

I wrote Fortran code to generate a series of random numbers. In this code, I could set up random number window (minimum and maximum random number) and percentage of random numbers within this window (number of random numbers). I want that the generated random numbers are always different from each other.
I could use gfortran compiler to compile it successfully; however, I found a problem. For instance, when I input 1 and 81 as minimum and maximum values respectively and 0.07 as the percentage, the code always gave me seven different random numbers, which were always smaller than 10, no matter how many times I ran it. What I expect is that the code should give me seven different random numbers which are distributed within 1~81 range, rather than only concentrated within 1~10 range. I do not know why the code gave me the random numbers only concentrating within a certain range. I paste my code below.
Would you anyone give me some suggestions on my problem? Thank you very much in advance.
PROGRAM RANDOM_POSITION
IMPLICIT NONE
REAL percent, val
INTEGER maxi, mini, num, i, l
INTEGER, DIMENSION(1), ALLOCATABLE :: position(:)
PRINT *,'Range for the impurity position(maximum and minimum value):'
PRINT *,'Minimum value:'
READ (UNIT=*, FMT=*) mini
PRINT *,'Maximum value:'
READ (UNIT=*, FMT=*) maxi
PRINT 11,'Percentage of impurity='
11 FORMAT(A23,$)
READ (UNIT=*, FMT=*) percent
num = (maxi-mini) * percent
IF ((maxi-mini) * percent-num .NE. 0.0) THEN
num = num + 1
END IF
PRINT *, num
ALLOCATE (position(num))
CALL RANDOM_SEED()
DO i=1, num ,1
CALL RANDOM_NUMBER(val)
position(i) = NINT(mini + val * num)
CALL JUDGEMENT(position, i, l)
l = 0
DO WHILE (l .EQ. 0)
CALL RANDOM_NUMBER(val)
position(i) = NINT(mini + val * num)
CALL JUDGEMENT(position, i, l)
END DO
PRINT *, position(i)
END DO
DEALLOCATE(position)
STOP
END PROGRAM RANDOM_POSITION
SUBROUTINE JUDGEMENT(arr, j, l)
IMPLICIT NONE
INTEGER j, k, l
INTEGER, DIMENSION(1) :: arr(j)
l = 1
DO k=1, j-1, 1
IF (arr(k) .EQ. arr(j)) THEN
l = 0
EXIT
ELSE
l = 1
END IF
END DO
RETURN
END SUBROUTINE JUDGEMENT

Read array of unknown size from keyboard

I want to insert an unknown number of values in an array (no matter the order).
I could first read how many values are to be inserted, then allocate the allocatable array, and finally read its values, as in the following code
PROGRAM try
IMPLICIT NONE
INTEGER :: N
REAL, DIMENSION(:), ALLOCATABLE :: x
WRITE (*,*) "how many values?"
READ (*,*) N
ALLOCATE(x(N))
WRITE (*,*) "insert the values"
READ (*,*) x
END PROGRAM
What if I want to insert the values without declaring how many before allocating the array?
I think I should use a DO WHILE cycle to insert the values in ascending order, till a descending value is insert, thus indicating the sequence is ended. I think a part of the code would be the following,
index = 1
WRITE(*,*) x
READ(*,*) x(index)
exit = .FALSE.
DO WHILE (exit.EQV..FALSE.)
index = index + 1
READ(*,*) x(index)
IF (x(index)>x(index-1)) THEN
exit = .TRUE.
index = index - 1
END IF
END DO
How to declare the array x?
I tried with the following solution, building on the concept "a lot of memory allocation and reallocation" expressed by #High Performance Mark.
PROGRAM COEFFS
USE COMPACT
IMPLICIT NONE
REAL, DIMENSION(:), ALLOCATABLE :: x,x2
INTEGER :: nL,nR,nT,index,oL,oR
LOGICAL :: exit
WRITE(*,*) "Input an increasing sequence of reals (end the sequence &
& with the first decreasing element, which will be discarded):"
index = 1
ALLOCATE(x(index))
READ(*,*) x(index)
ALLOCATE(x2(index))
x2 = x
DEALLOCATE(x)
exit = .FALSE.
DO WHILE (exit.EQV..FALSE.)
index = index + 1
ALLOCATE(x(index))
x(1:index-1) = x2
READ(*,*) x(index)
DEALLOCATE(x2)
ALLOCATE(x2(index))
x2 = x
DEALLOCATE(x)
IF (x2(index)<x2(index-1)) THEN
exit = .TRUE.
index = index - 1
ALLOCATE(x(index))
x = x2(1:index)
END IF
END DO
DEALLOCATE(x2)
WRITE(*,*) "x = ", x
END PROGRAM
With the array being input by keyboard, I don't think allocation/reallocation is a problem, since it happens at a much higher speed than that of my fingers typing the values, doesn't it?
Still I think the code could be made better. For instance, using two arrays is the only way to take advantage of allocation/reallocation?

Resources