OpenMP GCC GOMP wasteful barrier - gcc

I have the following program.
nv is around 100, dgemm is 20x100 or so, so there is plenty of work to go around:
#pragma omp parallel for schedule(dynamic,1)
for (int c = 0; c < int(nv); ++c) {
omp::thread thread;
matrix &t3_c = vv_.at(omp::num_threads()+thread);
if (terms.first) {
blas::gemm(1, t2_, vvvo_, 1, t3_c);
blas::gemm(1, vvvo_, t2_, 1, t3_c);
}
matrix &t3_b = vv_[thread];
if (terms.second) {
matrix &t2_ci = vo_[thread];
blas::gemm(-1, t2_ci, Vjk_, 1, t3_c);
blas::gemm(-1, t2_ci, Vkj_, 0, t3_b);
}
}
however with GCC 4.4, GOMP v1, the gomp_barrier_wait_end accounts for nearly 50% of runtime. Changing GOMP_SPINCOUNT aleviates the overhead but then only 60% of cores are used. Same for OMP_WAIT_POLICY=passive. The system is Linux, 8 cores.
How can i get full utilization without spinning/waiting overhread

The barrier is a symptom, not the problem. The reason that there's lots of waiting at the end of the loop is that some of the threads are done well before the others, and they all wait at the end of the for loop for quite a while until everyone's done.
This is a classic load imbalance problem, which is weird here, since it's just a bunch of matrix multiplies. Are they of varying sizes? How are they laid out in memory, in terms of NUMA stuff - are they all currently sitting in one core's cache, or are there other sharing issues? Or, more simply -- are there only 9 matricies, so that the remaining 8 are doomed to be stuck waiting for whoever got the last one?
When this sort of thing happens in a larger parallel block of code, sometime it's ok to proceed to the next block of code while some of the loop iterations aren't done yet; there you can add the nowait directive to the for which will override the default behaviour and get rid of the implied barrier. Here, though, since the parallel block is exactly the size of the for loop, that can't really help.

Could it be that your BLAS implementation also calls OpenMP inside? Unless you only see one call to gomp_barrier_wait_end.

Related

Same class, 2 programs, different OpenMP speedups; MSVC2017

I have a C++ class, several of whose functions have OpenMP parallel for loops. I'm building it into two apps with MSVC2017, and find that one of those functions runs differently in the 2 apps. The function has two separate parallel for loops. In one build, the VS debugger shows them both using 7 cores for a solid second while processing a block of test data; in the other, it shows just two blips of multicore usage, presumably at the beginning of each parallel section, but only 1 processor runs most of the time.
These functions are deep inside the code for the class, which is identical in the 2 apps. The builds have the same compiler and linker options so far as I can see. I generate the projects with CMake and never modify them by hand.
Can anyone suggest possible reasons for this behavior? I am fully aware of other ways to parallelize code, so please don't tell me about those. I am just looking for expertise on OpenMP under MSVC.
I expect he two calls are passing in significantly different amounts of work. Consider (example, trivial, typed into this post, not compiled, not the way to write this!) code like
void scale(int n, double *d, double f) {
#pragma omp parallel for
for (int i=0; i<n; i++)
d[i] = d[i] * f;
}
If invoked with a large vector where n == 10000, you'll get some parallelism and many threads working. If called with n == 3 there's obviously only work for three threads!
If you use #pragma omp parallel for schedule(dynamic) it's quite possible that even with ten or twenty iterations a single thread will execute most of them.
In summary: context matters.

Parallel programming dependency openacc

I am trying to parallelize this loops, but get some error in PGI compiler, I don't understand what's wrong
#pragma acc kernels
{
#pragma acc loop independent
for (i = 0;i < k; i++)
{
for(;dt*j <= Ms[i+1].t;j++)
{
w = (j*dt - Ms[i].t)/(Ms[i+1].t-Ms[i].t);
X[j] = Ms[i].x*(1-w)+Ms[i+1].x*w;
Y[j] = Ms[i].y*(1-w)+Ms[i+1].y*w;
}
}
}
Error
85, Generating Multicore code
87, #pragma acc loop gang
89, Accelerator restriction: size of the GPU copy of Y,X is unknown
Complex loop carried dependence of Ms->t,Ms->x,X->,Ms->y,Y-> prevents parallelization
Loop carried reuse of Y->,X-> prevents parallelization
So what i can do to solve this dependence problem?
I see a few issues here. Also given the output, I'm assuming that you're compiling with "-ta=multicore,tesla" (i.e. targeting both a multicore CPU and a GPU)
First, since "j" is not initialized in the "i" loop, the starting value of "j" will depended on the ending value of "j" from the previous iteration of "i". Hence, the loops are not parallelizable. By using "loop independent", you have forced parallelization on the outer loop, but you will get differing answers from running the code sequentially. You will need to rethink your algorithm.
I would suggest making X and Y two dimensional. With the first dimension of size "k". The second dimension can be a jagged array (i.e. each having a differing size) with the size corresponding to the "Ms[i+1].t" value.
I wrote an example of using jagged arrays as part of my Chapter (#5) of the Parallel Programming with OpenACC book. See: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/jagged_array.c
Alternatively, you might be able to set "j=Ms[i].t" assuming "Ms[0].t" is set.
for(j=Ms[i].t;dt*j <= Ms[i+1].t;j++)
"Accelerator restriction: size of the GPU copy of Y,X is unknown"
This is telling you that the compiler can not implicitly copy the X and Y arrays on the device. In C/C++, unbounded pointers don't have sizes so the compiler can't tell how big these arrays are. Often it can derive this information from the loop trip counts, but since the loop trip count is unknown (see above), it can't in this case. To fix, you need to include a data directive on the "kernels" directive or add a data region to your code. For example:
#pragma acc kernels copyout(X[0:size], Y[0:size])
or
#pragma acc data copyout(X[0:size], Y[0:size])
{
...
#pragma acc kernels
...
}
Another thing to keep in mind is pointer aliasing. In C/C++, pointers of the same type are allowed to point at the same object. Hence, without additional information such as the "restrict" attribute, the "independent" clause, or the PGI compiler flag "-Msafeptr", the compiler must assume your pointers do point to the same object making the loop not parallelizable.
This would most likely go away by either adding loop independent to the inner loop as well or using the collapse clause to flatted the loop, applying independent to both. Might also go away if all of your arrays are passed in using restrict, but maybe not.

Open mp parallel for does not work

I'm studying OpenMP now, and I have a question. The work time of the following code and the same code without a parallel section is statistically equal, though all threads are accessing the function. I tried to look at some guides in the internet, but it did not help. So the question is, what is wrong with this parallel section?
int sumArrayParallel( )
{
int i = 0;
int sum = 0;
#pragma omp parallel for
for (i = 0; i < arraySize; ++i)
{
cout << omp_get_thread_num() << " ";
sum += testArray[i];
}
return sum;
}
There are two very common causes of OpenMP codes failing to exhibit improved performance over their serial counterparts:
The work being done is not sufficient to outweigh the overhead of parallel computation. Think of there being a cost, in time, for setting up a team of threads, for distributing work to them, for gathering results from them. Unless this cost is less than the time saved by parallelising the computation an OpenMP code, even if correct, will not show any speed up and may show the opposite. You haven't shown us the numbers so do the calculations on this yourself.
The programmer imposes serial operation on the parallel program, perhaps by wrapping data access inside memory fences, perhaps by accessing platform resources which are inherently serial. I suspect (but my knowledge of C is lousy) that your writing to cout may inadvertently serialise that part of your computation.
Of course, you can have a mixture of these two problems, too much serialisation and not enough work, resulting in disappointing performance.
For further reading this page on Intel's website is useful, and not just for beginners.
I think, though, that you have a more serious problem with your code than its poor parallel performance. Does the OpenMP version produce the correct sum ? Since you have made no specific provision sum is shared by all threads and they will race for access to it. While learning OpenMP it is a very good idea to attach the clause default(none) to your parallel regions and to take responsibility for defining the shared/private status of each variable in each region. Then, once you are fluent in OpenMP you will know why it makes sense to continue to use the default(none) clause.
Even if you reply Yes, the code does produce the correct result the data race exists and your program can't be trusted. Data races are funny like that, they don't show up in all the tests you run then, once you roll-out your code into production, bang ! and egg all over your face.
However, you seem to be rolling your own reduction and OpenMP provides the tools for doing this. Investigate the reduction clause in your OpenMP references. If I read your code correctly, and taking into account the advice above, you could rewrite the loop to
#pragma omp parallel for default(none) shared(sum, arraySize, testArray) private(i) reduction(+:sum)
for (i = 0; i < arraySize; ++i)
{
sum += testArray[i];
}
In a nutshell, using the reduction clause tells OpenMP to sort out the problems of summing a single value from work distributed across threads, avoiding race conditions etc.
Since OpenMP makes loop iteration variables private by default you could omit the clause private(i) from the directive without too much risk. Even better though might be to declare it inside the for statement:
#pragma omp parallel for default(none) shared(sum, arraySize, testArray) reduction(+:sum)
for (int i = 0; i < arraySize; ++i)
variables declared inside parallel regions are (leaving aside some special cases) always private.

Parallelizing an algorithm with many exit points?

I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.
Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(n1,o1) .eq. .TRUE.) then
retval =.TRUE.
RETURN
endif
end do
end do
Or C pseudocode:
for (n1=nlo,n1<=nhi,n++) {
for (o1=olo,o1<=ohi,o++) {
if(somecondition(n1,o1)!=0) {
return (bool)true;
}
}
}
There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.
What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition() becomes True, execution among all the threads must immediately stop and a value of True set in a shared location.
What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.
Are there message passing mechanisms in OpenMP?
EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?
Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?
It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.
One possibility is to use OpenMP to parallelize over the 6 loops -- declare logical :: array(6), allow each loop to run to completion, and then retval = any(array). Then you can check this value and return outside the parallelized loop. Add a schedule(dynamic) to the parallel do statement if you do this. Or, have a separate !$omp parallel and then put !$omp do schedule(dynamic) ... !$omp end do nowait around each of the 6 loops.
Or, you can follow the good advice by #M.S.B. and parallelize the outermost loop over the whole array. The problem here is that you cannot have a RETURN inside a parallel loop -- so label the second outermost loop (the largest one within the parallel part), and EXIT that loop -- smth like
retval = .FALSE.
!$omp parallel do default(private) shared(BIGARRAY,retval) schedule(dynamic,1)
do k=1,NN
if(.not. retval) then
outer2: do j=1,NN
do i=1,NN
! --- your loop #1
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(BIGARRAY(i,j,k),n1,o1)) then
retval =.TRUE.
exit outer2
endif
end do
end do
! --- your loops #2 ... #6 go here
end do
end do outer2
end if
end do
!$omp end parallel do
[edit: the if statement is there presuming that you need to find out if there is at least one element like that in the big array. If you need to figure the condition for every element, you can similarly either add a dummy loop exit or goto, skipping the rest of the processing for that element. Again, use schedule(dynamic) or schedule(guided).]
As a separate point, you might also want to check if it may be a good idea to go through the innermost loop by some larger step (depending on float size), compute a vector of logicals on each iteration and then aggregate the results, eg. smth like if(count(somecondition(x(o1:o1+step,n1,k)))>0); in this case the compiler may be able to vectorize somecondition.
I believe you can do what you want with the task construct introduced in OpenMP 3; Intel Fortran supports tasking in OpenMP. I don't use tasks often so I won't offer you any wonky pseudocode.
You already mentioned the obvious way to stop all threads as soon as any thread finds the ending condition: have each check some shared variable which gives the status of the ending condition, thereby determining whether to break out of the loops. Obviously this is an overhead, so if you decide to take this approach I would suggest a few things:
Use atomics to check the ending condition, this avoids expensive memory flushing as just the variable in question is flushed. Move to OpenMP 3.1, there are some new atomic operations supported.
Check infrequently, maybe like once per outer iteration. You should only be parallelizing large cases to overcome the overhead of multithreading.
This one is optional, but you can try adding compiler hints, e.g. if you expect a certain condition to be false most of the time, the compiler will optimize the code accordingly.
Another (somewhat dirty) approach is to use shared variables for the loop ranges for each thread, maybe use a shared array where index n is for thread n. When one thread finds the ending condition, it changes the loop ranges of all the other threads so that they stop. You'll need the appropriate memory synchronization. Basically the overhead has now moved from checking a dummy variable to synchronizing/checking loop conditions. Again probably not so good to do this frequently, so maybe use shared outer loop variables and private inner loop variables.
On another note, this reminds me of the classic polling versus interrupt problem. Unfortunately I don't think OpenMP supports interrupts where you can send some kind of kill signal to each thread.
There are hacking work-arounds like using a child process for just this parallel work and invoking the operating system scheduler to emulate interrupts, however this is rather tricky to get correct and would make your code extremely unportable.
Update in response to comment:
Try something like this:
char shared_var = 0;
#pragma omp parallel
{
//you should have some method for setting loop ranges for each thread
for (n1=nlo; n1<=nhi; n1++) {
for (o1=olo; o1<=ohi; o1++) {
if (somecondition(n1,o1)!=0) {
#pragma omp atomic write
shared_var = 1; //done marker, this will also trigger the other break below
break; //could instead use goto to break out of both loops in 1 go
}
}
#pragma omp atomic read
private_var = shared_var;
if (private_var!=0) break;
}
}
A suitable parallel approach might be, to let each worker examine a part of the overall problem, exactly as in the serial case and use a local (non-shared) variable for the result (retval). Finally do a reduction over all workers on these local variables into a shared overall result.

When, if ever, is loop unrolling still useful?

I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:
// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}
I tried unrolling to something like:
while(true) {
if(myArray[++index1] < pivot) break;
if(myArray[++index1] < pivot) break;
// More unrolling
}
while(true) {
if(pivot < myArray[--index2]) break;
if(pivot < myArray[--index2]) break;
// More unrolling
}
This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?
Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.
On the other hand this code:
for (int i=0; i<n-3; i+=4) // note the n-3 bound for starting i + 0..3
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.
(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)
Those wouldn't make any difference because you're doing the same number of comparisons. Here's a better example. Instead of:
for (int i=0; i<200; i++) {
doStuff();
}
write:
for (int i=0; i<50; i++) {
doStuff();
doStuff();
doStuff();
doStuff();
}
Even then it almost certainly won't matter but you are now doing 50 comparisons instead of 200 (imagine the comparison is more complex).
Manual loop unrolling in general is largely an artifact of history however. It's another of the growing list of things that a good compiler will do for you when it matters. For example, most people don't bother to write x <<= 1 or x += x instead of x *= 2. You just write x *= 2 and the compiler will optimize it for you to whatever is best.
Basically there's increasingly less need to second-guess your compiler.
Regardless of branch prediction on modern hardware, most compilers do loop unrolling for you anyway.
It would be worthwhile finding out how much optimizations your compiler does for you.
I found Felix von Leitner's presentation very enlightening on the subject. I recommend you read it. Summary: Modern compilers are VERY clever, so hand optimizations are almost never effective.
As far as I understand it, modern compilers already unroll loops where appropriate - an example being gcc, if passed the optimisation flags it the manual says it will:
Unroll loops whose number of
iterations can be determined at
compile time or upon entry to the
loop.
So, in practice it's likely that your compiler will do the trivial cases for you. It's up to you therefore to make sure that as many as possible of your loops are easy for the compiler to determine how many iterations will be needed.
Loop unrolling, whether it's hand unrolling or compiler unrolling, can often be counter-productive, particularly with more recent x86 CPUs (Core 2, Core i7). Bottom line: benchmark your code with and without loop unrolling on whatever CPUs you plan to deploy this code on.
Trying without knowing is not the way to do it.
Does this sort take a high percentage of overall time?
All loop unrolling does is reduce the loop overhead of incrementing/decrementing, comparing for the stop condition, and jumping. If what you're doing in the loop takes more instruction cycles than the loop overhead itself, you're not going to see much improvement percentage-wise.
Here's an example of how to get maximum performance.
Loop unrolling can be helpful in specific cases. The only gain isn't skipping some tests!
It can for instance allow scalar replacement, efficient insertion of software prefetching... You would be surprised actually how useful it can be (you can easily get 10% speedup on most loops even with -O3) by aggressively unrolling.
As it was said before though, it depends a lot on the loop and the compiler and experiment is necessary. It's hard to make a rule (or the compiler heuristic for unrolling would be perfect)
Loop unrolling entirely depends on your problem size. It is entirely dependent on your algorithm being able to reduce the size into smaller groups of work. What you did above does not look like that. I am not sure if a monte carlo simulation can even be unrolled.
I good scenario for loop unrolling would be rotating an image. Since you could rotate separate groups of work. To get this to work you would have to reduce the number of iterations.
Loop unrolling is still useful if there are a lot of local variables both in and with the loop. To reuse those registers more instead of saving one for the loop index.
In your example, you use small amount of local variables, not overusing the registers.
Comparison (to loop end) are also a major drawback if the comparison is heavy (i.e non-test instruction), especially if it depends on an external function.
Loop unrolling helps increasing the CPU's awareness for branch prediction as well, but those occur anyway.

Resources