I have been trying to apply OpenMP on a simple summation operation inside two nested loop, but it produced incorrect result so far. I have been looking around in here and here, also in here. All suggest to use reduction clause, but it does not work for my case by producing very large number which leads to segmentation fault.
I also tried this way posted in here and my own question here which has been solved. Both do not use reduction and simply just set summation variable as shared, but it also produces incorrect result. Is there something that I am missing? When to use reduction and not using that while facing summation operation?
Codes using reduction clause
index = 0
!$OMP PARALLEL DO PRIVATE(iy,ix) REDUCTION(:+index)
do iy = 1, number(2)
do ix = 1, number(1)
index = index + 1
xoutput(index)=xinput(ix)
youtput(index)=yinput(iy)
end do
end do
!$OMP END PARALLEL DO
Code without using reduction clause
index = 0
!$OMP PARALLEL DO PRIVATE(iy,ix) SHARED(index)
do iy = 1, number(2)
do ix = 1, number(1)
index = index + 1
xoutput(index)=xinput(ix)
youtput(index)=yinput(iy)
end do
end do
!$OMP END PARALLEL DO
I think you have a mis-conception of what the reduction clause does...
REDUCTION(+:index)
means, that you will have the correct sum index in the end. In each step of the iteration, each tread will have a different version with different values! So the reduction is not suitable to manage array indices during the parallel section.
Let me try to illustrate this...
The following loop
!$OMP PARALLEL DO PRIVATE(iy) REDUCTION(+:index)
do iy = 1, number(2)
index = index + 1
end do
!$OMP END PARALLEL DO
is (more or less) equivalent to
!$OMP PARALLEL PRIVATE(iy, privIndex) SHARED(index)
!$OMP DO
do iy = 1, number(2)
privIndex = privIndex + 1
end do
!$OMP END DO
!$OMP CRITICAL
index = index + privIndex
!$OMP END CRITICAL
!$OMP END PARALLEL
You can see that during the loop all threads work on different variables privIndex which are private to that thread, and calculate local (partial) sums. In the end, the total sum is taken, using a critical section to avoid race conditions.
This might not be what the compiler does, but it gives you an idea how a reduction works: at no point within the first loop does privIndex correspond to the correct index you would expect in the serial version.
As Vladimir suggests in his comment, you can calculate the index directly as you are only incrementing it in the inner loop:
!$OMP PARALLEL DO PRIVATE(iy,ix, index)
do iy = 1, number(2)
do ix = 1, number(1)
index = (iy-1)*number(1) + ix
xoutput(index)=xinput(ix)
youtput(index)=yinput(iy)
end do
end do
!$OMP END PARALLEL DO
Related
Integer :: NBE,ierr,RN,i,j
Real(kind=8), allocatable :: AA1(:,:),AA2(:,:)
NBE=40
RN=3*NBE-2
Allocate(AA1(3*NBE,3*NBE),AA2(3*NBE,RN),stat=ierr)
If (ierr .ne. 0) Then
print *, 'allocate steps failed 1'
pause
End If
Do i=1,3*NBE
Do j=1,3*NBE
AA1(i,j)=1
End Do
End Do
I want to remove columns 97 and 113 from the matrix AA1 and then this matrix becomes AA2. I just want to know if any command of Fortran can realize this operation?
Alexander Vogt's answer gives the concept of using a vector subscript to select the elements of the array for inclusion. That answer constructs the vector subscript array using
[(i,i=1,96),(i=98,112),(i=114,3*NBE)]
Some may consider
AA2 = AA1(:,[(i,i=1,96),(i=98,112),(i=114,3*NBE)])
to be less than clear in reading. One could use a "temporary" index vector
integer selected_columns(RN)
selected_columns = [(i,i=1,96),(i=98,112),(i=114,3*NBE)]
AA2 = AA1(:,selected_columns)
but that doesn't address the array constructor being not nice, especially in more complicated cases. Instead, we can create a mask and use our common techniques:
logical column_wanted(3*NBE)
integer, allocatable :: selected_columns(:)
! Create a mask of whether a column is wanted
column_wanted = .TRUE.
column_wanted([97,113]) = .FALSE.
! Create a list of indexes of wanted columns
selected_columns = PACK([(i,i=1,3*NBE)],column_wanted)
AA2 = AA1(:,selected_columns)
Here's a simple one liner:
AA2 = AA1(:,[(i,i=1,96),(i=98,112),(i=114,3*NBE)])
Explanation:
(Inner part) Construct a temporary array for the indices [1,...,96,98,...,112,114,...,3*NBE]
(Outer part) Copy the matrix and only consider the columns in the index array
OK, I yield to #IanBush... Even simpler would be to do three dedicated assignments:
AA2(:,1:96) = AA1(:,1:96)
AA2(:,97:111) = AA1(:,98:112)
AA2(:,112:) = AA1(:,114:)
I don't have a fortran compiler here at home, so I cant test it. But I'd do something line this:
i = 0
DO j = 1, 3*NBE
IF (j == 97 .OR. j == 113) CYCLE
i = i + 1
AA2(:, i) = AA1(:, j)
END DO
The CYCLE command means that the rest of the loop should not be executed any more and the next iteration should start. Thereby, i will not get incremented, so when j=96, then i=96, when j=98 then i=97, and when j=114 then i=112.
A few more words: Due to Fortran's memory layout, you want to cycle over the first index the fastest, and so forth. So your code would run faster if you changed it to:
Do j=1,3*NBE ! Outer loop over second index
Do i=1,3*NBE ! Inner loop over first index
AA1(i,j)=1
End Do
End Do
(Of course, such an easy initialisation can be done even easier with just AA1(:,:) = 1 of just AA1 = 1.
I'm trying to determine if there is a way to parallelize the Jacobi method using sparse matrix formats (specifically Compressed Row Format)
I have a working sparse matrix Jacobi. I don't know if I can place
!$OMP PARALLEL DO
Directives on the middle do loop because x is being both written to and read from. I guess the inner do loop can have it, but the same t is being overwritten so I don't know if it is possible there either. Am I overlooking something here? Thanks.
x(:) = 0
do p = 1, numIterations
do i=1, n
t=b(i)
do j = IA(i), IA(i+1) - 1
if j=i
d=A(j)
else
t = t - A(j) * x(jA(j))
end if
end do
x(i) = t/d
end do
end do
It is true you have a dependency on t in the inner loop since it used as an accumulator. However, that also means you can have a private copy of t in each of the threads (since the arrays A and x are not being written in the loop, the value of t only depends on the value of j, which is also thread private).
The following should work:
x(:) = 0
do p = 1, numIterations
do i=1, n
t=0
!$OMP PARALLEL DO
!$OMP REDUCTION(+:t)
do j = IA(i), IA(i+1) - 1
if j=i
d=A(j)
else
t = A(j) * x(jA(j))
end if
end do
x(i) = (b(i)-t)/d
end do
end do
Note that d can only be be written by one of the threads, so the variable can be shared betewen the threads, no loop-carried dependencies on d.
Does anyone know if the following can be an example of firstprivate in openmp?
rowstr[0] = 0;
for (j = 1; j < nrows+1; j++) {
rowstr[j] = rowstr[j] + rowstr[j-1];
}
nza = rowstr[nrows] - 1;
firstprivate variable is rowstr and j is a private variable.
Actually not, if you use the firstprivate clause you may have inconsistency in your output as some values would never be updated, clarifying:
Let's suppose it's an array of size 4 and you have 2 threads, one thread you get iterations 0 and 1 and the other 2 and 3 (in a perfect world). If you use the firstprivate clause the second thread will sum the position 2 of the array with was initially in the array in position 1, instead of summing it with the previous iteration as the sequential version would do.
Not just that, this particular loop have dependency issues and you should use something like a sum reduction in nza.
In this link, the author gives an example as
subroutine threshold(a, thresh, ic)
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt
integer :: n
ic = 0
tt = 0.d0
n = size(a)
do j = 1, n
tt = tt + a(j) * a(j)
if (sqrt(tt) >= thresh) then
ic = j
return
end if
end do
end subroutine threshold
and the author commented this code as
An alternative approach, which would allow for many optimizations
(loop unrolling, CPU pipelining, less time spent evaluating the
conditional) would involve adding tt in blocks (e.g., blocks of size
128) and checking the conditional after each block. When it the
condition is met, the last block can be repeated to determine the
value of ic.
What does it mean? loop unrolling? CPU pipelining? adding tt in blocks?
How to optimize the code as the author say?
If the loop is performed in chunks/blocks that fit into the CPU cache you will reduce the number of cache misses, and consequently the number of cache lines retrieved from memory. This increases the performance on all loops that are limited by memory operations.
If the corresponding block size is BLOCKSIZE, this is achieved by
do j = 1, n, BLOCKSIZE
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
end do
This, however, will leave a remainder that is not treated in the main loop. To illustrate this, consider an array of length 1000. The first seven chunks (1--896) are covered in the loop, but the eighth one (897--1024) is not. Therefore, another loop for the remainder is required:
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
! ...
enddo
While it makes little sense to remove the conditional from the remainder loop, it can be performed in the outer loop of the blocked main loop.
As now no branches occur in the inner loop, aggressive optimizations might be applicable then.
However, this limits the "accuracy" of the determined position to the blocks. To get to an element-wise accuracy, you have to repeat the calculation.
Here is the complete code:
subroutine threshold_block(a, thresh, ic)
implicit none
real, dimension(:), intent(in) :: a
real, intent(in) :: thresh
integer, intent(out) :: ic
real :: tt, tt_bak, thresh_sqr
integer :: n, j, jj
integer,parameter :: BLOCKSIZE = 128
ic = 0
tt = 0.d0
thresh_sqr = thresh**2
n = size(a)
! Perform the loop in chunks of BLOCKSIZE
do j = 1, n, BLOCKSIZE
tt_bak = tt
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
end do
! Perform the check on the block level
if (tt >= thresh_sqr) then
! If the threshold is reached, repeat the last block
! to determine the last position
tt = tt_bak
do jj = j, j+BLOCKSIZE-1
tt = tt + a(jj) * a(jj)
if (tt >= thresh_sqr) then
ic = jj
return
end if
end do
end if
end do
! Remainder is treated element-wise
do j=(n/BLOCKSIZE)*BLOCKSIZE,n
tt = tt + a(j) * a(j)
if (tt >= thresh_sqr) then
ic = j
return
end if
end do
end subroutine threshold_block
Please note that the compilers are nowadays very good in creating blocked loops in combination with other optimizations. In my experience it is quite difficult to get a better performance out of such simple loops by manually tweaking it.
Loop blocking is enabled in gfortran with the compiler option -floop-block.
Loop unrolling can be done manually, but should be left to the compiler. The idea is to manually perform a loop in blocks and instead of a second loop as shown above, perform the operations by duplicating the code. Here is an example for the inner loop as given above, for a loop unrolling of factor four:
do jj = j, j+BLOCKSIZE-1,4
tt = tt + a(jj) * a(jj)
tt = tt + a(jj+1) * a(jj+1)
tt = tt + a(jj+2) * a(jj+2)
tt = tt + a(jj+3) * a(jj+3)
end do
Here, no remainder can occur if BLOCKSIZE is a multiple of 4. You can probably shave off a few operations in here ;-)
The gfortran compiler option to enable this is -funroll-loops
As far as I know, CPU Pipelining (Instruction Pipelining) cannot be enforced manually in Fortran. This task is up to the compiler.
Pipelining sets up a pipe of instructions. You feed the complete array into that pipe and, after the wind-up phase, you will get a result with each clock cycle. This drastically increases the throughput.
However, branches are difficult (impossible?) to treat in pipes, and the array should be long enough that the time required for setting up the pipe, wind-up, and wind-down phase are compensated.
The problem I'm facing is as outlined below:
module k
integer :: l,m
end module k
program p4
use k
integer :: i,j,omp_get_thread_num,cr
i = 2
j = 3
!$omp parallel num_threads(2) shared(l,m) private(i,j,cr)
cr = omp_get_thread_num()
if (cr == 0) goto 1111
call sub1(i)
write(*,*) l
goto 2222
1111 call sub2(j)
write(*,*) m
2222 continue
!$omp end parallel
end program p4
subroutine sub1(a)
use k
integer :: a
l = a**2
write(*,*) 'entered sub1'
end subroutine sub1
subroutine sub2(b)
use k
integer :: b
m = b**2
write(*,*) 'entered sub2'
end subroutine sub2
I've tried to parallelize a serial, (which after parallelization looks as written above). I want essentially the same operation performed twice. So Ideally, I want the output to be
entered sub1
4
enterer sub2
9
but the output is
entered sub2
0
entered sub1
923239424
I'm new to parallel programming, (my actual problem is a more complicated version of the one I've outlined). Can anyone point out the mistakes and suggest improvements. Thanks
OpenMP private variables are NOT given initial values, therefore both the calls to sub1 and sub2 are made with random values of i and j. What you are (probably) looking for is firstprivate instead:
!$omp parallel num_threads(2) shared(l,m) private(cr) firstprivate(i,j)
...
!$omp end parallel
firstprivate initialises each private copy with the value that the corresponding variable in the main thread had on entry into the parallel region.
Btw, implementing IF/THEN/ELSE/ENDIF with IF/GOTO/CONTINUE in Fortran 90 and later is considered by many a bad programming style. You should use OpenMP sections instead:
!$omp parallel sections num_threads(2) shared(l,m) private(cr) firstprivate(i,j)
!$omp section
call sub1(i)
write(*,*) l
!$omp section
call sub2(j)
write(*,*) m
!$omp end parallel sections