Does OMP Pragmas nesting have significance? - openmp

I'm looking at some code like below (in a reviewer/auditor capacity). The nesting shown below was created with TABS in the source code.
#pragma omp parallel
#pragma omp sections
{
#pragma omp section
p2 = ModularExponentiation((a % p), dp, p);
#pragma omp section
q2 = ModularExponentiation((a % q), dq, q);
}
Is the nesting a matter of style? Or does it have significance like in, say, Python?

You can collapse the first two pragmas into one, and it won't change the semantics, so you can change
#pragma omp parallel
#pragma omp sections
into
#pragma omp parallel sections
However, the section pragmas must be nested inside of the sections pragma in order to specify task parallelism. One example of when you might want to keep them separate is if you want to do two parallel sections right after one another with a barrier in between, such as:
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{ ... }
#pragma omp section
{ ... }
}
#pragma omp sections
{
#pragma omp section
{ ... }
#pragma omp section
{ ... }
}
}
This will not need to create a new "team" for the second set of sections

Related

OpenMP reduction from within parallel region

How to do OpenMP reduction (sum) inside parallel region? (Result is needed on master thread only).
Algorithm prototype:
#pragma omp parallel
{
t = omp_get_thread_num();
while iterate
{
float f = get_local_result(t);
// fsum is required on master only
float fsum = // ? - SUM of f
if (t == 0):
MPI_Bcast(&fsum, ...);
}
If I have OpenMP region inside while iterate loop, parallel region overhead at each iteration kills the performance...
Here is the simplest way to do this:
float sharedFsum = 0.f;
float masterFsum;
#pragma omp parallel
{
const int t = omp_get_thread_num();
while(iteration_condition)
{
float f = get_local_result(t);
// Manual reduction
#pragma omp update
sharedFsum += f;
// Ensure the reduction is completed
#pragma omp barrier
#pragma omp master
MPI_Bcast(&sharedFsum, ...);
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}
The atomic operations can be costly if you have a lot of threads (eg. hundreds). A better approach is to let the runtime perform the reduction for you.
Here is a better version:
float sharedFsum = 0;
#pragma omp parallel
{
const int threadCount = omp_get_num_threads();
float masterFsum;
while(iteration_condition)
{
// Execute get_local_result on each thread and
// perform the reduction into sharedFsum
#pragma omp for reduction(+:sharedFsum) schedule(static,1)
for(int i=0 ; i<threadCount ; ++i)
sharedFsum += get_local_result(i);
#pragma omp master
{
MPI_Bcast(&sharedFsum, ...);
// sharedFsum must be reinitialized for the next iteration
sharedFsum = 0.f;
}
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}
Side notes:
t is not protected in your code, use private(t) in the #pragma omp parallel section to avoid an undefined behavior due to a race condition. Alternatively, you can use scoped variables.
#pragma omp master should be preferred to a conditional on the thread ID.
parallel region overhead at each iteration kills the performance...
Most of the time this is due to either (implicit) synchronizations/communications or a work imbalance.
The code above may have the same problem since it is quite synchronous.
If it makes sense in your application, you can make it a bit less synchronous (and thus possibly faster) by removing or moving barriers regarding the speed of the MPI_Bcast and get_local_result. However, this is far from being easy to do it correctly. One way to do that it to use OpenMP tasks and multi-buffering.

OpenMP, dependency graph

I took some of my old OpenMP exercises to practice a little bit, but I have difficulties to find the solution for on in particular.
The goal is to write the most simple OpenMP code that correspond to the dependency graph.
The graphs are visible here: http://imgur.com/a/8qkYb
First one is simple.
It correspond to the following code:
#pragma omp parallel
{
#pragma omp simple
{
#pragma omp task
{
A1();
A2();
}
#pragma omp task
{
B1();
B2();
}
#pragma omp task
{
C1();
C2();
}
}
}
Second one is still easy.
#pragma omp parallel
{
#pragma omp simple
{
#pragma omp task
{
A1();
}
#pragma omp task
{
B1();
}
#pragma omp task
{
C1();
}
#pragma omp barrier
A2();
B2();
C2();
}
}
And now comes the last oneā€¦
which is bugging me quite a bit because the number of dependencies is unequal across all function calls. I thought there was a to explicitly state which task you should be waiting for, but I can't find what I'm looking for in the OpenMP documentation.
If anyone have an explanation for this question, I will be very grateful because I've been thinking about it for more than a month now.
First of all there is no #pragma omp simple in the OpenMP 4.5 specification.
I assume you meant #pragma omp single.
If so pragma omp barrier is a bad idea inside a single region, since only one thread will execude the code and waits for all other threads, which do not execute the region.
Additionally in the second on A2,B2 and C2 are not executed in parallel as tasks anymore.
To your acutual question:
What you are looking for seems to be the depend clause for Task constructs at OpenMP Secification pg. 169.
There is a pretty good explaination of the depend clause and how it works by Massimiliano for this question.
The last example is not that complex once you understand what is going on there: each task Tn depends on the previous iteration T-1_n AND its neighbors (T-1_n-1 and T-1_n+1). This pattern is known as Jacobi stencil. It is very common in partial differential equation solvers.
As Henkersmann said, the easiest option is using OpenMP Task's depend clause:
int val_a[N], val_b[N];
#pragma omp parallel
#pragma omp single
{
int *a = val_a;
int *b = val_b;
for( int t = 0; t < T; ++t ) {
// Unroll the inner loop for the boundary cases
#pragma omp task depend(in:a[0], a[1]) depend(out:b[0])
stencil(b, a, i);
for( int i = 1; i < N-1; ++i ) {
#pragma omp task depend(in:a[i-1],a[i],a[i+1]) \
depend(out:b[i])
stencil(b, a, i);
}
#pragma omp task depend(in:a[N-2],a[N-1]) depend(out:b[N-1])
stencil(b, a, N-1);
// Swap the pointers for the next iteration
int *tmp = a;
a = b;
b = tmp;
}
#pragma omp taskwait
}
As you may see, OpenMP task dependences are point-to-point, that means you can not express them in terms of array regions.
Another option, a bit cleaner for this specific case, is to enforce the dependences indirectly, using a barrier:
int a[N], b[N];
#pragma omp parallel
for( int t = 0; t < T; ++t ) {
#pragma omp for
for( int i = 0; i < N-1; ++i ) {
stencil(b, a, i);
}
}
This second case performs a synchronization barrier every time the inner loop finishes. The synchronization granularity is coarser, in the sense that you have only 1 synchronization point for each outer loop iteration. However, if stencil function is long and unbalanced, it is probably worth using tasks.

openMP use critical section without brackets

My question is simple : if I don't use any brackets, does the #pragma omp critical apply only to the following line ?
#pragma omp parallel for shared(k)
for(int i = 0 ; i < bdd.size() ; i++){
#pragma omp critical
++k;
myfunction();
}
In the example above is ++k; the only line affected by the critical section ?

OpenMP parallelizing loop

I want to parallelize that kind of loop. Note that each "calc_block" uses the data that obtained on previous iteration.
for (i=0 ; i<MAX_ITER; i++){
norma1 = calc_block1();
norma2 = calc_block2();
norma3 = calc_block3();
norma4 = calc_block4();
norma = norma1+norma2+norma3+norma4;
...some calc...
if(norma<eps)break;
}
I tryed this, but speedup is quite small ~1.2
for (i=0 ; i<MAX_ITER; i++){
#pragma omp parallel sections{
#pragma omp section
norma1 = calc_block1();
#pragma omp section
norma2 = calc_block2();
#pragma omp section
norma3 = calc_block3();
#pragma omp section
norma4 = calc_block4();
}
norma = norma1+norma2+norma3+norma4;
...some calc...
if(norma<eps)break;
}
I think it happened because of the overhead of using sections inside of loop. But i dont know how to fix it up...
Thanks in advance!
You could reduce the overhead by moving the entire loop inside the parallel region. Thus the threads in the pool used to implement the team would only get "awaken" once. It is a bit tricky and involves careful consideration of variable sharing classes:
#pragma omp parallel private(i,...) num_threads(4)
{
for (i = 0; i < MAX_ITER; i++)
{
#pragma omp sections
{
#pragma omp section
norma1 = calc_block1();
#pragma omp section
norma2 = calc_block2();
#pragma omp section
norma3 = calc_block3();
#pragma omp section
norma4 = calc_block4();
}
#pragma omp single
{
norma = norm1 + norm2 + norm3 + norm4;
// ... some calc ..
}
if (norma < eps) break;
}
}
Both sections and single constructs have implicit barriers at their ends, hence the threads would synchronise before going into the next loop iteration. The single construct reproduces the previously serial part of your program. The ... part in the private clause should list as many as possible variables that are only relevant to ... some calc .... The idea is to run the serial part with thread-local variables since access to shared variables is slower with most OpenMP implementations.
Note that often time the speed-up might not be linear for completely different reason. For example calc_blockX() (with X being 1, 2, 3 or 4) might have too low compute intensity and therefore require very high memory bandwidth. If the memory subsystem is not able to feed all 4 threads at the same time, the speed-up would be less than 4. An example of such case - this question.

Parallelizing a series of independent sequential lines of code

What is the best way to execute multiple lines of code in parallel if they are not dependent of each other? (I'm using OpenMP)
Pseudo code:
database->connect()
openfile("stuff.txt")
ping("stackoverflow.com")
x = 2;
y = a + b;
The only way I can come up with is:
#pragma omp parallel for
for(i = 0; i < 5; i++)
switch (i) {
case 0: database->connect(); break;
...
I haven't tried it, but I also remember that you're not supposed to break while using OpenMP
So I'm assuming that the indivdual things you listed as independant tasks were just examples. If they really are things like y=a+b, then as #chrisaycock and #ejd have said, they're too small for this sort of parallelism (eg thread based, as opposed to ILP or something) to actually take advantage of the concurrency due to overheads. But if they are bigger operations, the way to do task-based parallelism in OpenMP is with the task directive: eg,
#include <stdio.h>
#include <omp.h>
#include <unistd.h>
void work(int *v) {
*v = omp_get_thread_num();
sleep(1);
}
int main(int argc, char **argv)
{
int a, b, c;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(a) default(none)
work(&a);
#pragma omp task shared(b) default(none)
work(&b);
#pragma omp task shared(c) default(none)
work(&c);
}
}
printf("a,b,c = %d,%d,%d\n", a, b, c);
return 0;
}

Resources