Newbie OpenACC issue with CYCLE instruction in Fortran

Newbie OpenACC issue with CYCLE instruction in Fortran - openacc

quite newbie with OpenACC here, so please be patient :-)
I'm trying to port some Fortran code to use OpenACC, and I'm finding a strange (at least to me) behaviour.
The code is given below, but as you can see is just some nested loops which ultimately update the variable zc, which I copyout. I have tried to make private copies where I think they are needed and for the moment specified that all loops are independent. Now, when I compile with and without OpenACC all is fine if I remove the line "if(mu2-mup2.ne.q2) cycle", but if that line is present, then the results for the zc calculated with OpenACC are very different to those calculated without OpenACC.
Any ideas why that line could be giving me trouble?
Many thanks in advance,
Ángel de Vicente
!$acc data copyout(zc)
!$acc update device(fact)
!$acc kernels
!$acc loop independent private(k2)
do k=kmin,kmax
k2=2*k
!$acc loop independent private(km,kp2,z0)
do kp=kmin,kmax
km = MIN(k,kp)
kp2=2*kp
z0=3.d0*dble(ju2+1)*dsqrt(dble(k2+1))*dsqrt(dble(kp2+1))
!$acc loop independent private(q2)
do q=-km,km
q2=2*q
!$acc loop independent
do mu2=-ju2,ju2,2
!$acc loop independent private(p2,z1)
do ml2=-jl2,jl2,2
p2=mu2-ml2
if(iabs(p2).gt.2) cycle
z1=w3js(ju2,jl2,2,mu2,-ml2,-p2)
!$acc loop independent private(pp2,z2)
do mup2=-ju2,ju2,2
if(mu2-mup2.ne.q2) cycle
pp2=mup2-ml2
if(iabs(pp2).gt.2) cycle
z2=w3js(ju2,jl2,2,mup2,-ml2,-pp2)
!$acc loop independent
do mlp2=-jl2,jl2,2
zc(ml2,mlp2,mu2,mup2,k,kp,q) = z2
enddo
enddo
enddo
enddo
end do
end do
end do
!$acc end kernels
!$acc end data

Without a reproducing example it's difficult to give a complete answer, but I'll do my best.
First, there are only three parallel dimensions in OpenACC: gang, worker, and vector. Hence, the compiler will need to ignore 4 of the 7 loop directives. Most likely the middle 4 (if using PGI, you can see which loops the compiler is parallelizing from the compiler feedback messages, i.e. -Minfo=accel). Not that you can't parallelize all the loops, but you'd need to make them tightly nested and then use the collapse clause to collapse them into a single parallel loop.
Also since scalars are private by default, there's no need to put them into a private clause (except for a few cases). While putting them in a private clause shouldn't impact correctness, it can cause performance slow downs since you'd be fetching the private copy from global memory rather than having the potential of the scalar being put into a register.
My guess is that the inner loops are not that large so may not be beneficial to parallelize. Hence, I would first try removing all the inner "loop" directives, and only parallelize the "k" and "kp" loops. Depending of the values of "kmin" and "kmax", I'd try collapsing them as well. Something like:
!$acc loop independent collapse(2)
do k=kmin,kmax
do kp=kmin,kmax
k2=2*k
km = MIN(k,kp)
Assuming that gets you the correct answers but not as much parallelism as you want, you can then try collapsing the middle two loops:
!$acc loop independent collapse(2)
do q=-km,km
do mu2=-ju2,ju2,2
q2=2*q
do ml2=-jl2,jl2,2
I wouldn't recommend parallelizing loops with cycles in them. Not that you can't, but doing so would hurt performance due to thread divergence.
If none of this helps, please post a full reproducing example.

Related

Can reading a variable be a data race in OpenMP?

Why does this OpenMP fortran program work (every element of out is equal to num)? Each thread in the parallel loop might read the variable num simultaneously. I thought this was not acceptable?
program example
implicit none
integer i
integer, parameter :: n = 100000
double precision :: num
double precision, dimension(n) :: out
num = 1.123456789123456789123456d-5
out = 0.d0
!$OMP PARALLEL
!$OMP DO
do i=1,n
out(i) = num
enddo
!$OMP END DO
!$OMP END PARALLEL
do i=1,n
if (out(i).ne.num) print*,'Problem with ',i
enddo
end program
Thanks so much for any insights.

Can reading a variable be a data race in OpenMP?
Any race is between two things happening, so a read can be part of a race. However for the competition between two actions to be a race, there has to be a different outcome depending on the order in which the two actions occur.
Given that the possible actions in a parallel program which we are considering are read and write occurring in different threads, we have four possible cases:
Read, Read: no values are changed, and no code can detect which order the two reads occurred in (at least, not without looking at meta-data such as code performance in a system with caches :-)).
Read, Write: this clearly can be a race; whether the write wins the race or not affects the value which will be read.
Write, Read: as with case 2 (Read,Write), the result seen by the read is affected by the order.
Write, Write: here we have a race too, since we asssume that someone will ultimately read the value, and which value they see will depend on the order of the writes.
So, reading a variable can be part of a race.
However, if your question is really "Is there a race if a variable is only read?", then the answer is "No".

Variables are shared by default in openMP so they are accessible from all the threads. Furthermore, you're not writing to num so even if all the threads were accessing the same memory (which here they probably aren't) there would be no issue.

Parallel programming dependency openacc

I am trying to parallelize this loops, but get some error in PGI compiler, I don't understand what's wrong
#pragma acc kernels
{
#pragma acc loop independent
for (i = 0;i < k; i++)
{
for(;dt*j <= Ms[i+1].t;j++)
{
w = (j*dt - Ms[i].t)/(Ms[i+1].t-Ms[i].t);
X[j] = Ms[i].x*(1-w)+Ms[i+1].x*w;
Y[j] = Ms[i].y*(1-w)+Ms[i+1].y*w;
}
}
}
Error
85, Generating Multicore code
87, #pragma acc loop gang
89, Accelerator restriction: size of the GPU copy of Y,X is unknown
Complex loop carried dependence of Ms->t,Ms->x,X->,Ms->y,Y-> prevents parallelization
Loop carried reuse of Y->,X-> prevents parallelization
So what i can do to solve this dependence problem?

I see a few issues here. Also given the output, I'm assuming that you're compiling with "-ta=multicore,tesla" (i.e. targeting both a multicore CPU and a GPU)
First, since "j" is not initialized in the "i" loop, the starting value of "j" will depended on the ending value of "j" from the previous iteration of "i". Hence, the loops are not parallelizable. By using "loop independent", you have forced parallelization on the outer loop, but you will get differing answers from running the code sequentially. You will need to rethink your algorithm.
I would suggest making X and Y two dimensional. With the first dimension of size "k". The second dimension can be a jagged array (i.e. each having a differing size) with the size corresponding to the "Ms[i+1].t" value.
I wrote an example of using jagged arrays as part of my Chapter (#5) of the Parallel Programming with OpenACC book. See: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/jagged_array.c
Alternatively, you might be able to set "j=Ms[i].t" assuming "Ms[0].t" is set.
for(j=Ms[i].t;dt*j <= Ms[i+1].t;j++)
"Accelerator restriction: size of the GPU copy of Y,X is unknown"
This is telling you that the compiler can not implicitly copy the X and Y arrays on the device. In C/C++, unbounded pointers don't have sizes so the compiler can't tell how big these arrays are. Often it can derive this information from the loop trip counts, but since the loop trip count is unknown (see above), it can't in this case. To fix, you need to include a data directive on the "kernels" directive or add a data region to your code. For example:
#pragma acc kernels copyout(X[0:size], Y[0:size])
or
#pragma acc data copyout(X[0:size], Y[0:size])
{
...
#pragma acc kernels
...
}
Another thing to keep in mind is pointer aliasing. In C/C++, pointers of the same type are allowed to point at the same object. Hence, without additional information such as the "restrict" attribute, the "independent" clause, or the PGI compiler flag "-Msafeptr", the compiler must assume your pointers do point to the same object making the loop not parallelizable.

This would most likely go away by either adding loop independent to the inner loop as well or using the collapse clause to flatted the loop, applying independent to both. Might also go away if all of your arrays are passed in using restrict, but maybe not.

Parallelizing fortran 2008 `do concurrent` systematically, possibly with openmp

The fortran 2008 do concurrent construct is a do loop that tells the compiler that no iteration affect any other. It can thus be parallelized safely.
A valid example:
program main
implicit none
integer :: i
integer, dimension(10) :: array
do concurrent( i= 1: 10)
array(i) = i
end do
end program main
where iterations can be done in any order. You can read more about it here.
To my knowledge, gfortran does not automatically parallelize these do concurrent loops, while I remember a gfortran-diffusion-list mail about doing it (here). It justs transform them to classical do loops.
My question: Do you know a way to systematically parallelize do concurrent loops? For instance with a systematic openmp syntax?

It is not that easy to do it automatically. The DO CONCURRENT construct has a forall-header which means that it could accept multiple loops, index variables definition and a mask. Basically, you need to replace:
DO CONCURRENT([<type-spec> :: ]<forall-triplet-spec 1>, <forall-triplet-spec 2>, ...[, <scalar-mask-expression>])
<block>
END DO
with:
[BLOCK
<type-spec> :: <indexes>]
!$omp parallel do
DO <forall-triplet-spec 1>
DO <forall-triplet-spec 2>
...
[IF (<scalar-mask-expression>) THEN]
<block>
[END IF]
...
END DO
END DO
!$omp end parallel do
[END BLOCK]
(things in square brackets are optional, based on the presence of the corresponding parts in the forall-header)
Note that this would not be as effective as parallelising one big loop with <iters 1>*<iters 2>*... independent iterations which is what DO CONCURRENT is expected to do. Note also that forall-header permits a type-spec that allows one to define loop indexes inside the header and you will need to surround the whole thing in BLOCK ... END BLOCK construct to preserve the semantics. You would also need to check if scalar-mask-expr exists at the end of the forall-header and if it does you should also put that IF ... END IF inside the innermost loop.
If you only have array assignments inside the body of the DO CONCURRENT you would could also transform it into FORALL and use the workshare OpenMP directive. It would be much easier than the above.
DO CONCURRENT <forall-header>
<block>
END DO
would become:
!$omp parallel workshare
FORALL <forall-header>
<block>
END FORALL
!$omp end parallel workshare
Given all the above, the only systematic way that I can think about is to systematically go through your source code, searching for DO CONCURRENT and systematically replacing it with one of the above transformed constructs based on the content of the forall-header and the loop body.
Edit: Usage of OpenMP workshare directive is currently discouraged. It turns out that at least Intel Fortran Compiler and GCC serialise FORALL statements and constructs inside OpenMP workshare directives by surrounding them with OpenMP single directive during compilation which brings no speedup whatsoever. Other compilers might implement it differently but it's better to avoid its usage if portable performance is to be achieved.

I'm not sure what you mean "a way to systematically parallelize do concurrent loops". However, to simply parallelise an ordinary do loop with OpenMP you could just use something like:
!$omp parallel private (i)
!$omp do
do i = 1,10
array(i) = i
end do
!$omp end do
!$omp end parallel
Is this what you are after?

Parallelizing an algorithm with many exit points?

I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.
Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(n1,o1) .eq. .TRUE.) then
retval =.TRUE.
RETURN
endif
end do
end do
Or C pseudocode:
for (n1=nlo,n1<=nhi,n++) {
for (o1=olo,o1<=ohi,o++) {
if(somecondition(n1,o1)!=0) {
return (bool)true;
}
}
}
There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.
What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition() becomes True, execution among all the threads must immediately stop and a value of True set in a shared location.
What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.
Are there message passing mechanisms in OpenMP?
EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?

Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?
It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.

One possibility is to use OpenMP to parallelize over the 6 loops -- declare logical :: array(6), allow each loop to run to completion, and then retval = any(array). Then you can check this value and return outside the parallelized loop. Add a schedule(dynamic) to the parallel do statement if you do this. Or, have a separate !$omp parallel and then put !$omp do schedule(dynamic) ... !$omp end do nowait around each of the 6 loops.
Or, you can follow the good advice by #M.S.B. and parallelize the outermost loop over the whole array. The problem here is that you cannot have a RETURN inside a parallel loop -- so label the second outermost loop (the largest one within the parallel part), and EXIT that loop -- smth like
retval = .FALSE.
!$omp parallel do default(private) shared(BIGARRAY,retval) schedule(dynamic,1)
do k=1,NN
if(.not. retval) then
outer2: do j=1,NN
do i=1,NN
! --- your loop #1
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(BIGARRAY(i,j,k),n1,o1)) then
retval =.TRUE.
exit outer2
endif
end do
end do
! --- your loops #2 ... #6 go here
end do
end do outer2
end if
end do
!$omp end parallel do
[edit: the if statement is there presuming that you need to find out if there is at least one element like that in the big array. If you need to figure the condition for every element, you can similarly either add a dummy loop exit or goto, skipping the rest of the processing for that element. Again, use schedule(dynamic) or schedule(guided).]
As a separate point, you might also want to check if it may be a good idea to go through the innermost loop by some larger step (depending on float size), compute a vector of logicals on each iteration and then aggregate the results, eg. smth like if(count(somecondition(x(o1:o1+step,n1,k)))>0); in this case the compiler may be able to vectorize somecondition.

I believe you can do what you want with the task construct introduced in OpenMP 3; Intel Fortran supports tasking in OpenMP. I don't use tasks often so I won't offer you any wonky pseudocode.

You already mentioned the obvious way to stop all threads as soon as any thread finds the ending condition: have each check some shared variable which gives the status of the ending condition, thereby determining whether to break out of the loops. Obviously this is an overhead, so if you decide to take this approach I would suggest a few things:
Use atomics to check the ending condition, this avoids expensive memory flushing as just the variable in question is flushed. Move to OpenMP 3.1, there are some new atomic operations supported.
Check infrequently, maybe like once per outer iteration. You should only be parallelizing large cases to overcome the overhead of multithreading.
This one is optional, but you can try adding compiler hints, e.g. if you expect a certain condition to be false most of the time, the compiler will optimize the code accordingly.
Another (somewhat dirty) approach is to use shared variables for the loop ranges for each thread, maybe use a shared array where index n is for thread n. When one thread finds the ending condition, it changes the loop ranges of all the other threads so that they stop. You'll need the appropriate memory synchronization. Basically the overhead has now moved from checking a dummy variable to synchronizing/checking loop conditions. Again probably not so good to do this frequently, so maybe use shared outer loop variables and private inner loop variables.
On another note, this reminds me of the classic polling versus interrupt problem. Unfortunately I don't think OpenMP supports interrupts where you can send some kind of kill signal to each thread.
There are hacking work-arounds like using a child process for just this parallel work and invoking the operating system scheduler to emulate interrupts, however this is rather tricky to get correct and would make your code extremely unportable.
Update in response to comment:
Try something like this:
char shared_var = 0;
#pragma omp parallel
{
//you should have some method for setting loop ranges for each thread
for (n1=nlo; n1<=nhi; n1++) {
for (o1=olo; o1<=ohi; o1++) {
if (somecondition(n1,o1)!=0) {
#pragma omp atomic write
shared_var = 1; //done marker, this will also trigger the other break below
break; //could instead use goto to break out of both loops in 1 go
}
}
#pragma omp atomic read
private_var = shared_var;
if (private_var!=0) break;
}
}

A suitable parallel approach might be, to let each worker examine a part of the overall problem, exactly as in the serial case and use a local (non-shared) variable for the result (retval). Finally do a reduction over all workers on these local variables into a shared overall result.

Non-trivial private data in Fortran90 OpenMP

I have a section of a Fortran90 program that should be parallelized with OpenMP.
!$omp parallel num_threads(8) &
!$omp private(j, s, prop_states) &
!$omp firstprivate(targets, pulses)
! ... modify something in pulses. targets(s)%ham contains pointers to
! elements of pulses ...
do s = 1, n_systems
prop_states(s) = targets(s)%psi_i
call prop(prop_states(s), targets(s)%grid, targets(s)%ham, &
& targets(s)%work, para)
end do
!$omp end parallel
What I'm unsure about is whether complex data structures can be private to each thread (and how this should be done -- is firstprivate correct?). In the example code above, targets is of a somewhat complicated user-defined type, with equally complex sub-fields. For example, targets(s)%ham%op(1)%pulse is a pointer to some element of an array pulses. Also, targets(s)%work contains allocated space to be used as work arrays in Fast-Fourier-Transforms.
Obviously, every thread needs to maintain an independent copy both of targets and of pulses, and maintain the pointers between the two independently. It seems to me that this might be asking a little bit too much from the automatic memory management of OpenMP. Is this correct, or should this work out of the box?
The alternative of course is to create copies of the original data within each thread (stored in an array), and use this private copied data in the call to prop.

From my reading of the OpenMP 2.5 standard you can't use the targets of Fortran pointers in private (or firstprivate or threadprivate) clauses, which seems to rule out your code. Having said that, it's not something I've ever tried in OpenMP so if you bash ahead and get anywhere, do let us know.
And firstprivate is correct if your private variables are to be initialised, on entry into the parallel region, with the value of the variables of the same name at the entry to the parallel region.
I guess you will probably have to implement your plan B.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Newbie OpenACC issue with CYCLE instruction in Fortran - openacc

Related

Can reading a variable be a data race in OpenMP?

Parallel programming dependency openacc

Parallelizing fortran 2008 `do concurrent` systematically, possibly with openmp

Parallelizing an algorithm with many exit points?

Non-trivial private data in Fortran90 OpenMP

Categories

Resources