Parallelizing an algorithm with many exit points? - windows

I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.
Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(n1,o1) .eq. .TRUE.) then
retval =.TRUE.
RETURN
endif
end do
end do
Or C pseudocode:
for (n1=nlo,n1<=nhi,n++) {
for (o1=olo,o1<=ohi,o++) {
if(somecondition(n1,o1)!=0) {
return (bool)true;
}
}
}
There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.
What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition() becomes True, execution among all the threads must immediately stop and a value of True set in a shared location.
What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.
Are there message passing mechanisms in OpenMP?
EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?

Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?
It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.

One possibility is to use OpenMP to parallelize over the 6 loops -- declare logical :: array(6), allow each loop to run to completion, and then retval = any(array). Then you can check this value and return outside the parallelized loop. Add a schedule(dynamic) to the parallel do statement if you do this. Or, have a separate !$omp parallel and then put !$omp do schedule(dynamic) ... !$omp end do nowait around each of the 6 loops.
Or, you can follow the good advice by #M.S.B. and parallelize the outermost loop over the whole array. The problem here is that you cannot have a RETURN inside a parallel loop -- so label the second outermost loop (the largest one within the parallel part), and EXIT that loop -- smth like
retval = .FALSE.
!$omp parallel do default(private) shared(BIGARRAY,retval) schedule(dynamic,1)
do k=1,NN
if(.not. retval) then
outer2: do j=1,NN
do i=1,NN
! --- your loop #1
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(BIGARRAY(i,j,k),n1,o1)) then
retval =.TRUE.
exit outer2
endif
end do
end do
! --- your loops #2 ... #6 go here
end do
end do outer2
end if
end do
!$omp end parallel do
[edit: the if statement is there presuming that you need to find out if there is at least one element like that in the big array. If you need to figure the condition for every element, you can similarly either add a dummy loop exit or goto, skipping the rest of the processing for that element. Again, use schedule(dynamic) or schedule(guided).]
As a separate point, you might also want to check if it may be a good idea to go through the innermost loop by some larger step (depending on float size), compute a vector of logicals on each iteration and then aggregate the results, eg. smth like if(count(somecondition(x(o1:o1+step,n1,k)))>0); in this case the compiler may be able to vectorize somecondition.

I believe you can do what you want with the task construct introduced in OpenMP 3; Intel Fortran supports tasking in OpenMP. I don't use tasks often so I won't offer you any wonky pseudocode.

You already mentioned the obvious way to stop all threads as soon as any thread finds the ending condition: have each check some shared variable which gives the status of the ending condition, thereby determining whether to break out of the loops. Obviously this is an overhead, so if you decide to take this approach I would suggest a few things:
Use atomics to check the ending condition, this avoids expensive memory flushing as just the variable in question is flushed. Move to OpenMP 3.1, there are some new atomic operations supported.
Check infrequently, maybe like once per outer iteration. You should only be parallelizing large cases to overcome the overhead of multithreading.
This one is optional, but you can try adding compiler hints, e.g. if you expect a certain condition to be false most of the time, the compiler will optimize the code accordingly.
Another (somewhat dirty) approach is to use shared variables for the loop ranges for each thread, maybe use a shared array where index n is for thread n. When one thread finds the ending condition, it changes the loop ranges of all the other threads so that they stop. You'll need the appropriate memory synchronization. Basically the overhead has now moved from checking a dummy variable to synchronizing/checking loop conditions. Again probably not so good to do this frequently, so maybe use shared outer loop variables and private inner loop variables.
On another note, this reminds me of the classic polling versus interrupt problem. Unfortunately I don't think OpenMP supports interrupts where you can send some kind of kill signal to each thread.
There are hacking work-arounds like using a child process for just this parallel work and invoking the operating system scheduler to emulate interrupts, however this is rather tricky to get correct and would make your code extremely unportable.
Update in response to comment:
Try something like this:
char shared_var = 0;
#pragma omp parallel
{
//you should have some method for setting loop ranges for each thread
for (n1=nlo; n1<=nhi; n1++) {
for (o1=olo; o1<=ohi; o1++) {
if (somecondition(n1,o1)!=0) {
#pragma omp atomic write
shared_var = 1; //done marker, this will also trigger the other break below
break; //could instead use goto to break out of both loops in 1 go
}
}
#pragma omp atomic read
private_var = shared_var;
if (private_var!=0) break;
}
}

A suitable parallel approach might be, to let each worker examine a part of the overall problem, exactly as in the serial case and use a local (non-shared) variable for the result (retval). Finally do a reduction over all workers on these local variables into a shared overall result.

Related

OpenACC Scheduling

Say that I have a construct like this:
for(int i=0;i<5000;i++){
const int upper_bound = f(i);
#pragma acc parallel loop
for(int j=0;j<upper_bound;j++){
//Do work...
}
}
Where f is a monotonically-decreasing function of i.
Since num_gangs, num_workers, and vector_length are not set, OpenACC chooses what it thinks is an appropriate scheduling.
But does it choose such a scheduling afresh each time it encounters the pragma, or only once the first time the pragma is encountered?
Looking at the output of PGI_ACC_TIME suggests that scheduling is only performed once.
The PGI compiler will choose how to decompose the work at compile-time, but will generally determine the number of gangs at runtime. Gangs are inherently scalable parallelism, so the decision on how many can be deferred until runtime. The vector length and number of workers affects how the underlying kernel gets generated, so they're generally selected at compile-time to maximize optimization opportunities. With loops like these, where the bounds aren't really known at compile-time, the compiler has to generate some extra code in the kernel to ensure exactly the correct number of iterations are performed.
According to OpenAcc 2.6 specification[1] Line 1357 and 1358:
A loop associated with a loop construct that does not have a seq clause must be written such that the loop iteration count is computable when entering the loop construct.
Which seems to be the case, so your code is valid.
However, note it is implementation defined how to distribute the work among the gangs and workers, and it may be that the PGI compiler is simply doing some simple partitioning of the iterations.
You could manually define values of gang/workers using num_gangs and num_workers, and the integer expression passed to those clauses can depend on the value of your function (See 2.5.7 and 2.5.8 on OpenACC specification).
[1] https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf

Newbie OpenACC issue with CYCLE instruction in Fortran

quite newbie with OpenACC here, so please be patient :-)
I'm trying to port some Fortran code to use OpenACC, and I'm finding a strange (at least to me) behaviour.
The code is given below, but as you can see is just some nested loops which ultimately update the variable zc, which I copyout. I have tried to make private copies where I think they are needed and for the moment specified that all loops are independent. Now, when I compile with and without OpenACC all is fine if I remove the line "if(mu2-mup2.ne.q2) cycle", but if that line is present, then the results for the zc calculated with OpenACC are very different to those calculated without OpenACC.
Any ideas why that line could be giving me trouble?
Many thanks in advance,
Ángel de Vicente
!$acc data copyout(zc)
!$acc update device(fact)
!$acc kernels
!$acc loop independent private(k2)
do k=kmin,kmax
k2=2*k
!$acc loop independent private(km,kp2,z0)
do kp=kmin,kmax
km = MIN(k,kp)
kp2=2*kp
z0=3.d0*dble(ju2+1)*dsqrt(dble(k2+1))*dsqrt(dble(kp2+1))
!$acc loop independent private(q2)
do q=-km,km
q2=2*q
!$acc loop independent
do mu2=-ju2,ju2,2
!$acc loop independent private(p2,z1)
do ml2=-jl2,jl2,2
p2=mu2-ml2
if(iabs(p2).gt.2) cycle
z1=w3js(ju2,jl2,2,mu2,-ml2,-p2)
!$acc loop independent private(pp2,z2)
do mup2=-ju2,ju2,2
if(mu2-mup2.ne.q2) cycle
pp2=mup2-ml2
if(iabs(pp2).gt.2) cycle
z2=w3js(ju2,jl2,2,mup2,-ml2,-pp2)
!$acc loop independent
do mlp2=-jl2,jl2,2
zc(ml2,mlp2,mu2,mup2,k,kp,q) = z2
enddo
enddo
enddo
enddo
end do
end do
end do
!$acc end kernels
!$acc end data
Without a reproducing example it's difficult to give a complete answer, but I'll do my best.
First, there are only three parallel dimensions in OpenACC: gang, worker, and vector. Hence, the compiler will need to ignore 4 of the 7 loop directives. Most likely the middle 4 (if using PGI, you can see which loops the compiler is parallelizing from the compiler feedback messages, i.e. -Minfo=accel). Not that you can't parallelize all the loops, but you'd need to make them tightly nested and then use the collapse clause to collapse them into a single parallel loop.
Also since scalars are private by default, there's no need to put them into a private clause (except for a few cases). While putting them in a private clause shouldn't impact correctness, it can cause performance slow downs since you'd be fetching the private copy from global memory rather than having the potential of the scalar being put into a register.
My guess is that the inner loops are not that large so may not be beneficial to parallelize. Hence, I would first try removing all the inner "loop" directives, and only parallelize the "k" and "kp" loops. Depending of the values of "kmin" and "kmax", I'd try collapsing them as well. Something like:
!$acc loop independent collapse(2)
do k=kmin,kmax
do kp=kmin,kmax
k2=2*k
km = MIN(k,kp)
Assuming that gets you the correct answers but not as much parallelism as you want, you can then try collapsing the middle two loops:
!$acc loop independent collapse(2)
do q=-km,km
do mu2=-ju2,ju2,2
q2=2*q
do ml2=-jl2,jl2,2
I wouldn't recommend parallelizing loops with cycles in them. Not that you can't, but doing so would hurt performance due to thread divergence.
If none of this helps, please post a full reproducing example.

foreach loop is not working(parallelization)

If I want to speed up the following code, How can I do that?
pcg <- foreach(boot.iter=1:boot.rep) %dopar% {
d.boot<-d[in.sample[[boot.iter]],]
*here in.sample[[boot.iter]] randomly generates 1000 row numbers.
I planned to split the overall tasks and send the seperated trials to each core. for example,
sub_task<-foreach(i=1:cores.use)%dopar%{
for (j in 1:trialsPerCore){
d.boot<-d[in.sample[[structure[i,j]]],]}}
*structure is a matrix which contains from 1 to boot.rep
But this one would not work, seems like we cannot use "for" loop inside the foreach? Also, the d.boot only keeps the last iteration of each core.
I tried to search online, I found the following code works,
sub_task<foreach(i=1:cores.use)%:%
foreach(j=1:trialsPerCore)%dopar%{
d.boot<-d[in.sample[[structure[i,j]]],]}
But I think it is similar to my original function, and I do not think there is a great enhancement.
Do you guys have any suggestions?
Unless I'm missing something, it doesn't look like you're doing much if any computation in your foreach loop. You appear to be simply creating a list of matrices from d. That wouldn't benefit from parallel computing unless you can perform an operation on those matrices in your loop, and ideally return a relatively small result from that operation.
Although "chunking" often helps to execute parallel loops more efficiently, I don't think it's going to help here. The communication may be a little more efficient, but you're still just doing a lot of communication and essentially no computation.
Note that your attempt at chunking doesn't work because the for loop in the foreach loop is repeatedly assigning a matrix to the same variable. Then, the for loop itself returns a NULL as the body of the foreach loop, so that sub_task is a list of NULL's. An lapply would work much better in this context.
It will help a little to compute the values in the in.sample list in the foreach loop. That will decrease the amount of data that is auto-exported to each of the workers at the cost of a bit more computation on the workers, which is generally what you want to do in parallel loops. At the very least, you could iterate over in.sample directly:
pcg <- foreach(i=in.sample) %dopar% d[i,]
In this form, it's all the more obvious that there isn't enough computation to warrant parallel computing. If there isn't any real computation to perform, you're better off using lapply:
pcg <- lapply(in.sample, function(i) d[i,])

Or-equals on constant as reduction operation (ex. value |= 1 ) thread-safe?

Let's say that I have a variable x.
x = 0
I then spawn some number of threads, and each of them may or may not run the following expression WITHOUT the use of atomics.
x |= 1
After all threads have joined with my main thread, the main thread branches on the value.
if(x) { ... } else { ... }
Is it possible for there to be a race condition in this situation? My thoughts say no, because it doesn't seem to matter whether or not a thread is interrupted by another thread between reading and writing 'x' (in both cases, either 'x == 1', or 'x == 1'). That said, I want to make sure I'm not missing something stupid obvious or ridiculously subtle.
Also, if you happen to provide an answer to the contrary, please provide an instruction-by-instruction example!
Context:
I'm trying to, in OpenCL, have my threads indicate the presence or absence of a feature among any of their work-items. If any of the threads indicate the presence of the feature, my host ought to be able to branch on the result. I'm thinking of using the above method. If you guys have a better suggestion, that works too!
Detail:
I'm trying to add early-exit to my OpenCL radix-sort implementation, to skip radix passes if the data is banded (i.e. 'x' above would be x[RADIX] and I'd have all work groups, right after partial reduction of the data, indicate presence or absence of elements in the RADIX bins via 'x').
It may work within a work-group. You will need to insert a barrier before testing x. I'm not sure it will be faster than using atomic increments.
It will not work across several work-groups. Imagine you have 1000 work-groups to run on 20 cores. Typically, only a small number of work-groups can be resident on a single core, for example 4, meaning only 80 work-groups can be in flight inside the GPU at a given time. Once a work-group is done executing, it is retired, and another one is started. Halting a kernel in the middle of execution to wait for all 1000 work-groups to reach the same point is impossible.

Is there a difference in runtime between these two loops, and are there exceptions?

Consider the following two loops where N = 10^9 or something large enough to notice inefficiencies with.
Loop x = 1 to N
total += A(x)
total += B(x)
or
Loop x = 1 to N
total += A(x)
Loop x=1 to N
total += B(x)
Where each function takes x, performs some arbitrary arithmetic calculation (e.g. x^2 and 3x^3 or something, doesn't matter), and returns a value.
Are there going to be any differences in overall runtime, and when would this not be the case, if at all?
Each loop requires four actions:
Preparation (once per loop)
Checking of the stopping condition (once per iteration)
Executing the body of the loop (once per iteration)
Adjusting the values used to determine if the iteration should continue (once per iteration)
when you have one loop, you "pay" for items 1, 2 and 4 only once; when you have two loops, you "pay" for everything exactly twice.
Assuming that the order of invoking the two functions is not important, the difference will not be noticeable in most common situations. However, in very uncommon situations of extremely tight loops a single loop will take less CPU resources. In fact, a common technique of loop unwinding relies on reducing the share of per-iteration checks and setup operations in the overall CPU load during the loop by repeating the body several times, and reducing the number of iterations by the corresponding factor.
There are a few things to think about. One is you're doing twice as many instructions for the loop itself (condition check, incrementing x, etc) in the second version. If your functions are really trivial, that could be a major cost.
However, in more realistic situations, cache performance, register sharing, and things like that are going to make a bigger difference. For instance, if both functions need to use a lot of registers, you might find that the second version performs worse than the first because the compiler needs to spill more registers to memory since it's doing it once per loop. Or if A and B both access the same memory, the second version might be faster than the second because all of B's accesses will be cache hits in the second version but misses in the first version.
All of this is highly program- and platform-specific. If there's some particular program you want to optimize, you need to benchmark it.
The primary difference is that the first one test X against N, N times, while the second one tests X against N, 2N times.
There is a slight overhead on the loop itself.
In each Iteration you need to do at least 2 operations, increase the counter, and then compare it to the end value.
So you're doing 2*10^9 more operations.
If both functions used lot's of memory, for example they created some big array, and recursively modified it in each iteration, it could be possible that first loop is slower due to the memory cache or such.
There are a lot of potential factors to be considered;
1) number of iterations -- does loop setup dominate over the task
2) loop comparison penalty vs. the task complexity
for (i=0;i<2;i++) a[i]=b[i];
3) general complexity of function
- with two complex functions one might run out of registers
4) register dependency or are the task serial in nature
- two independent tasks intermixed vs. result of other loop depends on the first one
5) can the loop be executed completely on a prefetch queue -- no need for cache access
- mixing in the second tasks may ruin the throughput
6) what kind of cache hit patterns there are

Resources