What is "implicit synchronization" in OpenMP - parallel-processing

What is exactly "implicit synchronization" in OpenMP and how can you spot one? My teacher said that
#pragma omp parallel
printf(“Hello 1\n”);
Has an implicit sync. Why? And how do you see it?

Synchronisation is an important issue in parallel processing and in openmp. In general parallel processing is asynchronous. You know that several threads are working on a problem, but you have no way to know exactly what is their actual state, the iteration they are working on, etc. A synchronisation allows you get control on thread execution.
There are two kinds of synchronisations in openmp: explicit and implicit. An explicit synchronisation is done with a specific openmp construct that allows to create a barrier: #pragma omp barrier. A barrier is a parallel construct that can only be passed by all the threads simultaneously. So after the barrier, you know exactly the state of all threads and, more importantly, what amount of work they have done.
Implicit synchronisation is done in two situations:
at the end of a parallel region. Openmp relies on a fork-join model. When the program starts, a single thread (master thread) is created. When you create a parallel section by #pragma omp parallel, several threads are created (fork). These threads will work concurrently and at the end of the parallel section will be destroyed (join). So at the end of a parallel section, you have a synchronisation and you know precisely the status of all threads (they have finished their work). This is what happens in the example that you give. The parallel section only contains the printf() and at the end, the program waits for the termination of all threads before continuing.
at the end of some openmp constructs like #pragma omp for or #pragma omp sections, there is an implicit barrier. No thread can continue working as long as all the threads have not reached the barrier. This is important to know exactly what work has been done by the different threads.
For instance, consider the following code.
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=A[j]+A[j+N/2];// use the previously computed vector A
} // end of parallel section
As all the threads work asynchronously, you do not know which threads have finished creating their part of vector A. Without a synchronisation, there is a risk that a thread finishes rapidly its part of the first for loop, enters the second for loop and accesses elements of vector A while the threads that are supposed to compute them are still in the first loop and have not computed the corresponding value of A[i].
This is reason why openmp compilers add an implicit barrier to synchronize all the threads. So you are certain that all threads have finished all their work and that all values of A have been computed when the second for loop starts.
But in some situations, no synchronisation is required. For instance, consider the following code:
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=g(j);// compute values for B
} // end of parallel section
Obviously the two loops are completely independent and it does not matter if A is properly computed to start the second for loop. So the synchronisation gives nothing for the program correctness
and adding a synchronisation barrier has two major drawbacks:
If function f() has very different running times, you may have some threads that have finished their work, while others are still computing. The synchronisation will force the former threads to wait and this idleness do not exploit properly parallelism.
Synchronisations are expensive. A simple way to realize a barrier is to increment a global counter when reaching the barrier and to wait until the value of the counter is equal to the number of threads omp_get_num_threads(). To avoid races between threads, the incrementation of the global counter must be done with an atomic read-modify-write that requires a large number of cycles and the wait for the proper value of the counter is typically done with a spin lock that wastes processor cycles.
So there is construct to suppress implicit synchronisations and the best way to program the previous loop would be:
#pragma omp parallel
{
#pragma omp for nowait // nowait suppresses implicit synchronisations.
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=g(j);// compute values for B
} // end of parallel section
This way, as soon as a thread has finished its work in the first loop, it will immediately start to process the second for loop, and, depending on the actual program, this may reduce significantly execution time.

Related

Does #pragma omp parallel for num_threads(1) mean master thread will execute it?

If I start a parallel region with a number of threads 1, is it guaranteed that no new threads will be started (hence, there will be no overhead) and the master thread will execute that region?
In other words, can we guarantee that this code will increment all elements of A:
#pragma omp parallel for num_threads(1)
for(int i=0; i< 1e6; ++i){
#pragma omp master
A[i]++;
}
The intention of your code is not entirely clear. By using the parallel for combined directive you start sharing the work between the threads. Then you restrict the body of the loop to the master thread of the team thus no work sharing may occur. This is ambiguous.
The OpenMP specification handles this ambiguity explicitly in Section 2.20. The solution is simple: You may not do it in a conforming program.
A master region may not be closely nested inside a worksharing, loop,
atomic, task, or taskloop region.

OpenMP for-loop chunk scheduling visualization

Are there tools that visualize execution of OpenMP for-loop chunks?
For example, consider the parallel for-loop below:
#pragma omp parallel for schedule(dynamic, 10) num_threads(4)
for(int i=1; i<100; i++)
{
// do work of uneven execution time.
}
I want to visualize on which thread each of the 10 chunks (say (1,10),(11,20),...,(91,100)) executed and how long they took, without modifying code?
I understand that only four (one per thread) parallel outline functions are started, and that each of these functions ask for chunks in a synchronized manner. I can visualize the four parallel outline functions in tools such as Intel VTune, but am unable to drill this visualization down to the chunk level.
Thanks in advance for your tips and suggestions!

different OpenMP output in different machine

When I m trying to run the following code in my system centos running virtually i am getting right output but when i am trying to run the same code on compact supercomputer "Param Shavak" I am getting incorrect output.... :(
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,ti
#pragma omp parallel private(p,tid)shared(s)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
If your program runs correctly in one machine, it must be because it's actually not running in parallel in that machine.
Your program suffers from a race condition in the s=s+tid; line of code. s is a shared variable, so several threads at the same time try to update it, which results in data loss.
You can fix the problem by turning that line of code into an atomic operation:
#pragma omp atomic
s=s+tid;
That way only one thread at a time can read and update the variable s, and the race condition is no more.
In more complex programs you should use atomic operations or critical regions only when necessary, because you don't have parallelism in those regions and that hurts performance.
EDIT: As suggested by user High Performance Mark, I must remark that the program above is very inefficient because of the atomic operation. The proper way to do that kind of calculation (adding to the same variable in all iterations of a loop) is to implement a reduction. OpenMP makes it easy by using the reduction clause:
#pragma omp reduction(operator : variables)
Try this version of your program, using reduction:
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,tid;
#pragma omp parallel reduction(+:s) private(p,tid)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
The following link explains critical sections, atomic operations and reduction in a more verbose way: http://www.lindonslog.com/programming/openmp/openmp-tutorial-critical-atomic-and-reduction/

OpenMP slower reduction

There are two versions of openmp codes with reduction and without.
// with reduction
#pragma omp parallel for reduction(+:sum)
for (i=1;i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
// without reduction
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;
double partial_sum = 0;
for (i=id;i< num_steps; i+=numthreads){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp critical
sum += partial_sum;
}
I run the codes using 8 cores, the total time double for the reduction version. What's the reason? Thanks.
Scalar reduction in OpenMP is usually quite fast. The observed behaviour in your case is due to two things made wrong in two different ways.
In your first code you did not make x private. Therefore it is shared among the threads and besides getting incorrect results, the execution suffers from the data sharing. Whenever one thread writes to x, the core that it executes on sends a message to all other cores and makes them invalidate their copies of that cache line. When any of them writes to x later, the whole cache line has to be reloaded and then the cache lines in all other cores get invalidated. And so forth. This slows things down significantly.
In your second code you have used the OpenMP critical construct. This is a relatively heavy-weight in comparison with the atomic adds, usually used to implement the reduction at the end. Atomic adds on x86 are performed using the LOCK instruction prefix and everything gets implemented in the hardware. On the other side, critical sections are implemented using mutexes and require several instructions and often busy waiting loops. This is far less efficient than the atomic adds.
In the end, your first code is slowed down due to bad data sharing condition. Your second code is slowed down due to the use of incorrect synchronisation primitive. It just happens that on your particular system the latter effect is less severe than the former and hence your second example runs faster.
If you want to manually parallelize the loop as well as the reduction you can do it like this:
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
int start = id*num_steps/numthreads;
int finish = (id+1)*num_steps/numthreads;
double x;
double partial_sum = 0;
for (i=start; i<finish ; i++){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp atomic
sum += partial_sum;
}
However, I don't recommend this. Reductions don't have to be done with atomic and you should just let OpenMP parallelize the loop. The first case is the best solution (but make sure you declare x private).
Edit: According to Hristo once you make x private these two methods are nearlly the same in speed. I want to explain why using critical in your second method instead of atomic or allowing OpenMP to do the reduction has hardly any effect on the performance in this case.
There are two ways I can think of doing a reduction:
Sum the partial sums linearly using atomic or critical
Sum the partial sums using a tree. I.e. if you have 8 cores this gives you eight partial sums you reduce this to 4 partial sums then 2 partial sums then 1.
The first cast has linear convergence in the number of cores. The second case goes as the log of the number of cores. So one my be temped to think the second case is always better. However, for only eight cores the reduction is entirely dominated by taking the partial sums. Adding eight numbers with atomic/critical vs. reducing the tree in 3 steps will be negligable.
What if you have e.g. 1024 cores? Then the tree can be reduced in only 10 steps and the linear sum takes 1024 steps. But the constant term can be much larger for the second case and doing the partial sum of a large array e.g. with 1 million elements probably still dominates the reduction.
So I suspect that using atomic or even critical for a reduction has a negligable effect on the reduction time in general.

OpenMP, use all cores with parallel for

I have computer with 4 cores and OMP application with 2 weighty tasks.
int main()
{
#pragma omp parallel sections
{
#pragma omp section
WeightyTask1();
#pragma omp section
WeightyTask2();
}
return 0;
}
Each task has such weighty part:
#omp pragma parallel for
for (int i = 0; i < N; i++)
{
...
}
I compiled program with -fopenmp parameter, made export OMP_NUM_THREADS=4.
The problem is that only two cores are loaded. How I can use all cores in my tasks?
My initial reaction was: You have to declare more parallelism.
You have defined two tasks that can run in parallel. Any attempt by OpenMP to run it on more than two cores will slow you down (because of cache locality and possible false sharing).
Edit If the parallel for loops are of any significant volume (say, not under 8 iterations), and you are not seeing more than 2 cores used, look at
omp_set_nested()
the OMP_NESTED=TRUE|FALSE environment variable
This environment variable enables or disables nested parallelism. The setting of this environment variable can be overridden by calling the omp_set_nested() runtime library function.
If nested parallelism is disabled, nested parallel regions are serialized and run in the current thread.
In the current implementation, nested parallel regions are always serialized. As a result, OMP_SET_NESTED does not have any effect, and omp_get_nested() always returns 0. If -qsmp=nested_par option is on (only in non-strict OMP mode), nested parallel regions may employ additional threads as available. However, no new team will be created to run nested parallel regions.
The default value for OMP_NESTED is FALSE.

Resources