avoid nested parallel region - openmp

I want to write a function which employs openMP parallelism but should work whether called from within a parallel region or not. So I used the if clause to suppress parallelism, but this doesn't work as I thought:
#include <omp.h>
#include <stdio.h>
int m=0,s=0;
void func()
{
bool p = omp_in_parallel();
// if clause to suppress nested parallelism
#pragma omp parallel if(!p)
{
/* do some massive work in parallel */
#pragma omp master
++m;
#pragma omp single
++s;
}
}
int main()
{
fprintf(stderr,"running func() serial:\n");
m=s=0;
func();
fprintf(stderr," m=%d s=%d\n",m,s);
fprintf(stderr,"running func() parallel:\n");
m=s=0;
#pragma omp parallel
func();
fprintf(stderr," m=%d s=%d\n",m,s);
}
which creates the output
running func() serial:
m=1 s=1
running func() parallel:
m=16 s=16
Thus the first call to func() worked fine: m and s obtain the values 1 as they should, but the second call to func() from within a parallel region did create nested parallelism (with 16 teams of 1 thread each) even though this was suppressed. That is the omp master and omp single directives bind to the preceeding omp parallel if(!p) directive rather than to the outside parallel region.
Of course, one can fix this problem by the following code
void work()
{
/* do some massive work in parallel */
#pragma omp master
++m;
#pragma omp single
++s;
}
void func()
{
if(omp_in_parallel())
work();
else
#pragma omp parallel
work();
}
but this requires an additional function to be defined etc. Is it possible to do this within a single function (and without repeating code)?

The OpenMP constructs will always bind to the innermost containing construct, even if it isn't active. So I don't think it's possible while retaining the #pragma omp parallel for both code paths (At least with the provided informations about the problem).
Note that it is a good think that it behaves like this, because otherwise the use of conditionals would easily lead to very problematic (read buggy) code. Look at the following example:
void func(void* data, int size)
{
#pragma omp parallel if(size > 1024)
{
//do some work
#pragma omp barrier
//do some more work
}
}
...
#pragma omp parallel
{
//create foo, bar, bar varies massively between different threads (so sometimes bigger, sometimes smaller then 1024
func(foo, bar);
//do more work
}
In general a programmer should not need to know the implementation details of the called functions, only their behaviour. So I really shouldn't have to care about whether func creates a nested parallel section or not and under which exact conditions it creates one. However if the barrier would bind to the outer parallel if the inner is inactive this code would be buggy, since some threads of the outer parallel sections encounter the barrier and some don't. Therefore such details stay hidden inside the innermost containing parallel, even if it isn't active.
Personally I have never encountered a situation where I wanted it to behave differently (which would go against information hiding and such), so maybe you should tell us a bit more about what you are trying to accomplish to get better answers.

I had a look at the openMP standard. The if clause is actually somewhat misleadingly coined, for the #pragma omp parallel directive is not conditional (as I originally thought). Instead the if clause may restrict the number of threads to 1 thereby suppressing parallelisation.
However, this implies omp single or omp master cannot be used for threadsafe once-per-process writes of global shared variables.

Related

OpenACC 2.0 routine: data locality

Take the following code, which illustrates the calling of a simple routine on the accelerator, compiled on the device using OpenACC 2.0's routine directive:
#include <iostream>
#pragma acc routine
int function(int *ARRAY,int multiplier){
int sum=0;
#pragma acc loop reduction(+:sum)
for(int i=0; i<10; ++i){
sum+=multiplier*ARRAY[i];
}
return sum;
}
int main(){
int *ARRAY = new int[10];
int multiplier = 5;
int out;
for(int i=0; i<10; i++){
ARRAY[i] = 1;
}
#pragma acc enter data create(out) copyin(ARRAY[0:10],multiplier)
#pragma acc parallel present(out,ARRAY[0:10],multiplier)
if (function(ARRAY,multiplier) == 50){
out = 1;
}else{
out = 0;
}
#pragma acc exit data copyout(out) delete(ARRAY[0:10],multiplier)
std::cout << out << std::endl;
}
How does function know to use the device copies of ARRAY[0:10] and multiplier when it is called from within a parallel region? How can we enforce the use of the device copies?
When your routine is called within a device region (the parallel in your code), it is being called by the threads on the device, which means those threads will only have access to arrays on the device. The compiler may actually choose to inline that function, or it may be a device-side function call. That means that you can know that when the function is called from the device it will be receiving device copies of the data because the function is essentially inheriting the present data clause from the parallel region. If you still want to convince yourself that you're running on the device once inside the function, you could call acc_on_device, but that only tells you that you're running on the accelerator, not that you received a device pointer.
If you want to enforce the use of device copies more than that, you could make the routine nohost so that it would technically not be valid to call from the host, but that doesn't really do what you're asking, which is to do a check on the GPU that the array really is a device array.
Keep in mind though that any code inside a parallel region that is not inside a loop will be run gang-redundantly, so the write to out is likely a race condition, unless you happen to be running with one gang or you write to it using an atomic.
Basically, when you involved "data" clause, the device will create/copy data to the device memory, then the block of code that defined with "acc routine" will be executed on the device. Notice that the memory between host and device does not share unlike multi-threading (OpenMP). So yes, "function" will be using the device copies of ARRAY and multiplier as long as it is under data segment. Hope this helps! :)
You should assign the function with one parallelism level such as gang/worker/vector. It's a more accurate way.
The routine will use the date in device memory.

gcc openmp tasks do not work

I have already used OpenMP with "pragma omp for" loops and wanted to try OpenMP tasks now.
But a simple program, which should run 2 tasks parallel does not seem to work.
Did I misunderstand the use of tasks or what is wrong here?
#include<iostream>
#include<omp.h>
//ubuntu 12.04 LTS, gcc 4.6.3
//g++ test_omp.cpp -fopenmp
int main()
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{
while(true)
{
usleep(1e6);
#pragma omp critical (c_out)
std::cout<<"task1"<<std::endl;
}
}
#pragma omp task
{
while(true)
{
usleep(1e6);
#pragma omp critical (c_out)
std::cout<<"task2"<<std::endl;
}
}
}
}
}
The output is:
task1
task1
task1
.....
So the second task is not running.
From the OpenMP spec:
When a thread encounters a task construct, a task is generated from
the code for the associated structured block. The data environment of
the task is created according to the data-sharing attribute clauses on
the task construct, per-data environment ICVs, and any defaults that
apply.
The encountering thread may immediately execute the task, or
defer its execution. In the latter case, any thread in the team may be
assigned the task. Completion of the task can be guaranteed using task
synchronization constructs. A task construct may be nested inside an
outer task, but the task region of the inner task is not a part of the
task region of the outer task.
(emphasis mine)
The way I read this: A single thread starts executing your single section. It reaches the task directive, at which point it may decide to either run the task itself, or give it to another thread. The problem occurs when it decides to run the task itself - it never returns.
I'm not quite sure why you have task/single in your example though. What you want to do seems like a case for omp parallel sections instead:
int main()
{
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
while(true)
{
usleep(3e5);
#pragma omp critical (c_out)
std::cout<<"task1"<<std::endl;
}
}
#pragma omp section
{
while(true)
{
usleep(5e5);
#pragma omp critical (c_out)
std::cout<<"task2"<<std::endl;
}
}
}
}

Openmp not updating

I am tryint to write a C (gcc) function that will calculate the maximum of an array of doubles while running across multiple threads. I create an array of size omp_get_num_threads, in which I store the local maxima of each thread before finally maximizing this small array. The code is (more or less) the following:
int i;
double *local_max;
double A[1e10]; //made up size
#pragma omp parallel
{
#pragma omp master
{
local_max=(double *)calloc(omp_get_num_threads(),sizeof(double));
}
#pragma omp flush //so that all threads point
//to the correct location of local_max
#pragma omp for
for(i=0;i<1e10;i++){
if(A[i]>local_max[omp_get_thread_num()])
local_max[omp_get_thread_num()]=A[i];
}
}
free(local_max);
This, however, leads to segfaults, and valgrind complains of the usage of uninitialized variables. Turns out, local_max is not actually updated throughout all threads before they enter the for construct. I thought #pragma omp flush should do that? If I replace it with #pragma omp barrier, everything works fine.
Could someone explain to me what is going on?
The easiest solution to your problem is to simply replace the master construct with a single one as it doesn't really matter which thread would make the allocation (unless you are running on a NUMA machine, but then you would also have many other things to worry about):
#pragma omp single
{
local_max=(double *)calloc(omp_get_num_threads(),sizeof(double));
}
The subtle difference between master and single is that there is an implicit barrier at the end of the single while no such barrier exists at the end of master. This implicit barrier makes all other threads to wait until the thread that executes the single block has made it to the end of the block (unless the nowait clause is specified, which removes the implicit barrier). With master the barrier must be added explicitly. It is beyond my comprehension why the OpenMP designers made the decision that master would not have an implicit barrier like single does.
You need to put a barrier to ensure memory allocation has been completed. Memory allocation is a time consuming operation and when your final for loop starts running, local_max is not pointing to a properly allocated space. I modified your code below to demonstrate the behavior.
int i;
double *local_max;
omp_set_num_threads(8);
#pragma omp parallel
{
#pragma omp master
{
for(int k = 0; k < 999999; k++) {} // Lazy man's sleep function
cout << "Master start allocating" << endl;
local_max=(double *)calloc(omp_get_num_threads(),sizeof(double));
cout << "Master finish allocating" << endl;
}
#pragma omp flush
#pragma omp for
for(i=0;i<10;i++){
cout << "for : " << omp_get_thread_num() << " i: " << i << endl;
}
}
free(local_max);
getchar();
return 0;
Better yet, just move the memory allocation before the #pragma omp parallel. No need for flush, or single, or master.

Race conditions with OpenMP

I need to fill 2D array (tmp[Ny][Nx]) while each cell of the array gets an integral (of some function) as a function of free parameters. Since I deal with a very large arrays (here I simplified my case), I need to use OpenMP parallelism in order to speed my calculations up. Here I use simple #pragma omp parallel for directive.
Without using #pragma omp parallel for, the code executes perfectly. But adding the parallel directive, produces race conditions in the output.
I tried to cure it by making private(i,j,par), it did not help.
P.S. I use VS2008 Professional with OpenMP 2.0 and under WIndows 7 OS
Here is my code: (a short sample)
testfunc(const double* var, const double* par)
{
// here is some simple function to be integrated over
// var[0] and var[1] and two free parameters par[0] and par[1]
return ....
}
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
int main()
{
double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int i,j,NDim=2,NEval=1e5;
#pragma omp parallel for private(i,j,par,val)
for (i=0;i<Nx;i++)
{
for (j=0;j<Ny;j++)
{
par[0]=i;
par[1]=j*j;
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
// adapt_integrate - receives my integrand, performs
// integration and returns a result through "val"
tmp[i][j] = val;
}
}
}
It produces race conditions at the output. I tried to avoid it by making all internal variables (i,j,par and val) private, but it doesn't help.
P.S. Serial version (#threads=1) of this code runs properly.
(Answered in the question. Converted to a community wiki answer. See Question with no answers, but issue solved in the comments (or extended in chat) )
The OP wrote:
The problem Solved!
I defined parameters of integration as global and used #pragma omp threadprivate(parGlob) directive for them. Now it works like a charm. I've been thinking that private() and threadprivate() have the same meaning, just different ways of implementations, but they do not.
So, playing with these directives may give a correct answer. Another thing is that defining iterator i inside the first for loop gives additional 20%-30% speed up in performance. So, the fastest version of the code looks now as:
testfunc(const double* var, const double* par)
{
.......
}
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
double parGlob[2]; //<- Here are they!!!
#pragma omp threadprivate(parGlob) // <-Magic directive!!!!
int main()
{
// Not here !!!! -> double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int j,NDim=2,NEval=1e5;
#pragma omp parallel for private(j,val) // no `i` inside `private` clause
for (int i=0;i<Nx;i++)
{
for (j=0;j<Ny;j++)
{
parGlob[0]=i;
parGlob[1]=j*j;
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
tmp[i][j] = val;
}
}
}

How does the SECTIONS directive in OpenMP distribute work?

In OpenMP when using omp sections, will the threads be distributed to the blocks inside the sections, or will each thread be assigned to each sections?
When nthreads == 3:
#pragma omp sections
{
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
}
Output:
id=1
id=1
But when I execute the following code:
#pragma omp sections
{
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
}
#pragma omp sections
{
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
}
Output:
id=1
id=1
id=2
id=2
From these output I can't understand what the concept of sections is in OpenMP.
The code posted by the OP will never execute in parallel, because the parallel keyword does not appear. The fact that the OP got ids different from 0 shows that probably his code was embedded in a parallel directive. However, this is not clear from his post, and might confuse beginners.
The minimum sensible example is (for the first example posted by the OP):
#pragma omp parallel sections
{
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
}
On my machine, this prints
id = 0,
id = 1,
showing that the two sections are being executed by different threads.
It's worth noting that however this code can not extract more parallelism than two threads: if it is executed with more threads, the other threads don't have any work to do and will just sit down idle.
The idea of parallel sections is to give the compiler a hint that the various (inner) sections can be performed in parallel, for example:
#pragma omp parallel sections
{
#pragma omp section
{
/* Executes in thread 1 */
}
#pragma omp section
{
/* Executes in thread 2 */
}
#pragma omp section
{
/* Executes in thread 3 */
}
/* ... */
}
This is a hint to the compiler and not guaranteed to happen, though it should. Your output is kind of what is expected; it says that there are #sections being executed in thread id 1, and in thread 2. The output order is non-deterministic as you don't know what thread will run first.
Change the first line from
#pragma omp sections
into
#pragma omp parallel sections
"parallel" directive ensures that the two sections are assigned to two threads.
Then, you will receive the following output
id = 0,
id = 1,
You are missing parallel keyword.
The parallel keyword triggers the openmp run in parallel.
According to OpenMP standard 3.1, section 2.5.2 (emphasis mine):
The sections construct is a noniterative worksharing construct that
contains a set of structured blocks that are to be distributed among
and executed by the threads in a team. Each structured block is
executed once by one of the threads in the team in the context of its
implicit task.
...
Each structured block in the sections construct is preceded by a
section directive except possibly the first block, for which a
preceding section directive is optional. The method of scheduling the
structured blocks among the threads in the team is implementation
defined. There is an implicit barrier at the end of a sections
construct unless a nowait clause is specified.
So, applying these rules to your case, we can argue that:
the different structured blocks identified in a sections directive are executed once, by one thread. In other words you have always four prints, whichever the number of threads
the blocks in the first sections will be executed (in a non-deterministic order) before the blocks in the second sections (also executed in a non-deterministic order). This is because of the implicit barrier at the end of the work-sharing constructs
the scheduling is implementation defined, so that you can't possibly control which thread has been assigned a given section
Your output is thus due to the way your scheduler decided to assign the different blocks to the threads in the team.
It may be helpful to add more information to the output line and to add more sections (if you have the thread-count)
#pragma omp parallel sections
{
#pragma omp section
{
printf ("section 1 id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("section 2 id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("section 3 id = %d, \n", omp_get_thread_num());
}
}
Then you may get more interesting output like this:
section 1 id = 4,
section 3 id = 3,
section 2 id = 1,
which shows how the sections may be executed in any order, by any available thread.
Note that 'nowait' tells the compiler that threads do not need to wait to exit the section. In Fortran 'nowait' goes at the end of the loop or section, which makes this more obvious.
The #pragma omp parallel is what creates (forks) the threads initially. Only on creating the threads, will the other Openmp constructs be of significance.
Hence,
Method 1:
// this creates the threads
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
// code here
}
#pragma omp section
{
// code here
}
}
}
or
Method 2:
// this creates the threads and creates sections in one line
#pragma omp parallel sections
#pragma omp section
{
// code here
}
#pragma omp section
{
// code here
}
}
If you want really start different threads in different sections, the nowait clause tells compiler that threads do not need to wait to enter a section.
#pragma omp parallel sections nowait
{
...
}

Resources