Does #pragma omp parallel for num_threads(1) mean master thread will execute it? - openmp

If I start a parallel region with a number of threads 1, is it guaranteed that no new threads will be started (hence, there will be no overhead) and the master thread will execute that region?
In other words, can we guarantee that this code will increment all elements of A:
#pragma omp parallel for num_threads(1)
for(int i=0; i< 1e6; ++i){
#pragma omp master
A[i]++;
}

The intention of your code is not entirely clear. By using the parallel for combined directive you start sharing the work between the threads. Then you restrict the body of the loop to the master thread of the team thus no work sharing may occur. This is ambiguous.
The OpenMP specification handles this ambiguity explicitly in Section 2.20. The solution is simple: You may not do it in a conforming program.
A master region may not be closely nested inside a worksharing, loop,
atomic, task, or taskloop region.

Related

What is "implicit synchronization" in OpenMP

What is exactly "implicit synchronization" in OpenMP and how can you spot one? My teacher said that
#pragma omp parallel
printf(“Hello 1\n”);
Has an implicit sync. Why? And how do you see it?
Synchronisation is an important issue in parallel processing and in openmp. In general parallel processing is asynchronous. You know that several threads are working on a problem, but you have no way to know exactly what is their actual state, the iteration they are working on, etc. A synchronisation allows you get control on thread execution.
There are two kinds of synchronisations in openmp: explicit and implicit. An explicit synchronisation is done with a specific openmp construct that allows to create a barrier: #pragma omp barrier. A barrier is a parallel construct that can only be passed by all the threads simultaneously. So after the barrier, you know exactly the state of all threads and, more importantly, what amount of work they have done.
Implicit synchronisation is done in two situations:
at the end of a parallel region. Openmp relies on a fork-join model. When the program starts, a single thread (master thread) is created. When you create a parallel section by #pragma omp parallel, several threads are created (fork). These threads will work concurrently and at the end of the parallel section will be destroyed (join). So at the end of a parallel section, you have a synchronisation and you know precisely the status of all threads (they have finished their work). This is what happens in the example that you give. The parallel section only contains the printf() and at the end, the program waits for the termination of all threads before continuing.
at the end of some openmp constructs like #pragma omp for or #pragma omp sections, there is an implicit barrier. No thread can continue working as long as all the threads have not reached the barrier. This is important to know exactly what work has been done by the different threads.
For instance, consider the following code.
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=A[j]+A[j+N/2];// use the previously computed vector A
} // end of parallel section
As all the threads work asynchronously, you do not know which threads have finished creating their part of vector A. Without a synchronisation, there is a risk that a thread finishes rapidly its part of the first for loop, enters the second for loop and accesses elements of vector A while the threads that are supposed to compute them are still in the first loop and have not computed the corresponding value of A[i].
This is reason why openmp compilers add an implicit barrier to synchronize all the threads. So you are certain that all threads have finished all their work and that all values of A have been computed when the second for loop starts.
But in some situations, no synchronisation is required. For instance, consider the following code:
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=g(j);// compute values for B
} // end of parallel section
Obviously the two loops are completely independent and it does not matter if A is properly computed to start the second for loop. So the synchronisation gives nothing for the program correctness
and adding a synchronisation barrier has two major drawbacks:
If function f() has very different running times, you may have some threads that have finished their work, while others are still computing. The synchronisation will force the former threads to wait and this idleness do not exploit properly parallelism.
Synchronisations are expensive. A simple way to realize a barrier is to increment a global counter when reaching the barrier and to wait until the value of the counter is equal to the number of threads omp_get_num_threads(). To avoid races between threads, the incrementation of the global counter must be done with an atomic read-modify-write that requires a large number of cycles and the wait for the proper value of the counter is typically done with a spin lock that wastes processor cycles.
So there is construct to suppress implicit synchronisations and the best way to program the previous loop would be:
#pragma omp parallel
{
#pragma omp for nowait // nowait suppresses implicit synchronisations.
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=g(j);// compute values for B
} // end of parallel section
This way, as soon as a thread has finished its work in the first loop, it will immediately start to process the second for loop, and, depending on the actual program, this may reduce significantly execution time.

How will the threads be scheduled when their number is greater than the number of iterations?

I am a beginner to OpenMP. There are some loops in my code and they have different number of iterations.
I set the number of threads as the greatest number of iterations among these loops.
But what will happen when the number of threads is greater than the number of iterations for some of my loops?
The code to specify the number of threads is
#pragma omp parallel for num_threads
Assuming that when you say "number of loops" you mean "number of iterations in a loop", then the answer should be fairly clear!
Consider
#pragma omp parallel num_threads(10)
#pragma omp for
for (int i=0; i<5; i++)
...
There are only five iterations, so the largest number of threads that can execute an iteration is also five (if each of them executes one iteration), and, therefore, at least five threads will immediately skip to the implicit barrier at the end of the for loop.
However, I get the impression that what you're doing is trying to adjust the number of threads in each parallel region to match the number of iterations. That is a bad idea. Changing the number of threads (the team size) is a s..l..o..w operation.
Indeed, it is generally a bad idea to set the number of threads explicitly at all. In general the OpenMP runtime will set the number of threads appropriately for the hardware on which the program is running. You don't need to force the number of threads yourself unless you are doing scaling tests, in which case using the OMP_NUM_THREADS environment variable is still easier!
Forcing the number of threads "because my machine has that many cores" prompts these questions
Will you have that machine for the rest of your life?
Are you the only person who will ever run this code?

OpenMP for-loop chunk scheduling visualization

Are there tools that visualize execution of OpenMP for-loop chunks?
For example, consider the parallel for-loop below:
#pragma omp parallel for schedule(dynamic, 10) num_threads(4)
for(int i=1; i<100; i++)
{
// do work of uneven execution time.
}
I want to visualize on which thread each of the 10 chunks (say (1,10),(11,20),...,(91,100)) executed and how long they took, without modifying code?
I understand that only four (one per thread) parallel outline functions are started, and that each of these functions ask for chunks in a synchronized manner. I can visualize the four parallel outline functions in tools such as Intel VTune, but am unable to drill this visualization down to the chunk level.
Thanks in advance for your tips and suggestions!

different OpenMP output in different machine

When I m trying to run the following code in my system centos running virtually i am getting right output but when i am trying to run the same code on compact supercomputer "Param Shavak" I am getting incorrect output.... :(
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,ti
#pragma omp parallel private(p,tid)shared(s)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
If your program runs correctly in one machine, it must be because it's actually not running in parallel in that machine.
Your program suffers from a race condition in the s=s+tid; line of code. s is a shared variable, so several threads at the same time try to update it, which results in data loss.
You can fix the problem by turning that line of code into an atomic operation:
#pragma omp atomic
s=s+tid;
That way only one thread at a time can read and update the variable s, and the race condition is no more.
In more complex programs you should use atomic operations or critical regions only when necessary, because you don't have parallelism in those regions and that hurts performance.
EDIT: As suggested by user High Performance Mark, I must remark that the program above is very inefficient because of the atomic operation. The proper way to do that kind of calculation (adding to the same variable in all iterations of a loop) is to implement a reduction. OpenMP makes it easy by using the reduction clause:
#pragma omp reduction(operator : variables)
Try this version of your program, using reduction:
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,tid;
#pragma omp parallel reduction(+:s) private(p,tid)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
The following link explains critical sections, atomic operations and reduction in a more verbose way: http://www.lindonslog.com/programming/openmp/openmp-tutorial-critical-atomic-and-reduction/

OpenMP, use all cores with parallel for

I have computer with 4 cores and OMP application with 2 weighty tasks.
int main()
{
#pragma omp parallel sections
{
#pragma omp section
WeightyTask1();
#pragma omp section
WeightyTask2();
}
return 0;
}
Each task has such weighty part:
#omp pragma parallel for
for (int i = 0; i < N; i++)
{
...
}
I compiled program with -fopenmp parameter, made export OMP_NUM_THREADS=4.
The problem is that only two cores are loaded. How I can use all cores in my tasks?
My initial reaction was: You have to declare more parallelism.
You have defined two tasks that can run in parallel. Any attempt by OpenMP to run it on more than two cores will slow you down (because of cache locality and possible false sharing).
Edit If the parallel for loops are of any significant volume (say, not under 8 iterations), and you are not seeing more than 2 cores used, look at
omp_set_nested()
the OMP_NESTED=TRUE|FALSE environment variable
This environment variable enables or disables nested parallelism. The setting of this environment variable can be overridden by calling the omp_set_nested() runtime library function.
If nested parallelism is disabled, nested parallel regions are serialized and run in the current thread.
In the current implementation, nested parallel regions are always serialized. As a result, OMP_SET_NESTED does not have any effect, and omp_get_nested() always returns 0. If -qsmp=nested_par option is on (only in non-strict OMP mode), nested parallel regions may employ additional threads as available. However, no new team will be created to run nested parallel regions.
The default value for OMP_NESTED is FALSE.

Resources