OpenMP, use all cores with parallel for - parallel-processing

I have computer with 4 cores and OMP application with 2 weighty tasks.
int main()
{
#pragma omp parallel sections
{
#pragma omp section
WeightyTask1();
#pragma omp section
WeightyTask2();
}
return 0;
}
Each task has such weighty part:
#omp pragma parallel for
for (int i = 0; i < N; i++)
{
...
}
I compiled program with -fopenmp parameter, made export OMP_NUM_THREADS=4.
The problem is that only two cores are loaded. How I can use all cores in my tasks?

My initial reaction was: You have to declare more parallelism.
You have defined two tasks that can run in parallel. Any attempt by OpenMP to run it on more than two cores will slow you down (because of cache locality and possible false sharing).
Edit If the parallel for loops are of any significant volume (say, not under 8 iterations), and you are not seeing more than 2 cores used, look at
omp_set_nested()
the OMP_NESTED=TRUE|FALSE environment variable
This environment variable enables or disables nested parallelism. The setting of this environment variable can be overridden by calling the omp_set_nested() runtime library function.
If nested parallelism is disabled, nested parallel regions are serialized and run in the current thread.
In the current implementation, nested parallel regions are always serialized. As a result, OMP_SET_NESTED does not have any effect, and omp_get_nested() always returns 0. If -qsmp=nested_par option is on (only in non-strict OMP mode), nested parallel regions may employ additional threads as available. However, no new team will be created to run nested parallel regions.
The default value for OMP_NESTED is FALSE.

Related

Is there a way to efficiently synchronize a subset of theads using OpenMP?

Because OpenMP nested parallelism has often performance problems, I was wondering if there is a way to implement a partial barrier that would synchronize only a subset of all threads.
Here is an example code structure:
#pragma omp parallel
{
int nth = omp_get_num_threads();
int tid = omp_get_thread_num();
if (tid < nth/2) {
// do some work
...
// I need some synchronization here, but only for nth/2 threads
#pragma omp partial_barrier(nth/2)
// do some more work
...
} else {
// do some other independent work
...
}
}
I did not find anything like that in the openmp standard, but maybe there is a way to efficiently program a similar behaviour with locks or something?
EDIT:
So my actual problem is that I have a computation kernel (a Legendre transform -- part A) that is efficiently parallelized using OpenMP with 4 to 12 threads depending on the problem size.
This computation is followed by a Fast Fourier transform (part B).
I have several independent datasets (typically about 6 to 10) that have to be processed by A followed by B.
I would like to use more parallelism with more threads (48 to 128, depending on the machines).
Since A is not efficiently parallelized with more than 4 to 12 threads, the idea is to split the threads into several groups, each group working on an independent dataset. Because the datasets are independent, I don't need to synchronize all the threads (which is quite expensive when many threads are used) before doing B, only the subset working on a given dataset.
OpenMP tasks with depenencies would do what I need, but my experience is that on some platforms (xeon servers) the performance is significantly lower to what you can get with simple threads.
Is there a way to synchronize a subset of threads efficiently?

Does #pragma omp parallel for num_threads(1) mean master thread will execute it?

If I start a parallel region with a number of threads 1, is it guaranteed that no new threads will be started (hence, there will be no overhead) and the master thread will execute that region?
In other words, can we guarantee that this code will increment all elements of A:
#pragma omp parallel for num_threads(1)
for(int i=0; i< 1e6; ++i){
#pragma omp master
A[i]++;
}
The intention of your code is not entirely clear. By using the parallel for combined directive you start sharing the work between the threads. Then you restrict the body of the loop to the master thread of the team thus no work sharing may occur. This is ambiguous.
The OpenMP specification handles this ambiguity explicitly in Section 2.20. The solution is simple: You may not do it in a conforming program.
A master region may not be closely nested inside a worksharing, loop,
atomic, task, or taskloop region.

What is "implicit synchronization" in OpenMP

What is exactly "implicit synchronization" in OpenMP and how can you spot one? My teacher said that
#pragma omp parallel
printf(“Hello 1\n”);
Has an implicit sync. Why? And how do you see it?
Synchronisation is an important issue in parallel processing and in openmp. In general parallel processing is asynchronous. You know that several threads are working on a problem, but you have no way to know exactly what is their actual state, the iteration they are working on, etc. A synchronisation allows you get control on thread execution.
There are two kinds of synchronisations in openmp: explicit and implicit. An explicit synchronisation is done with a specific openmp construct that allows to create a barrier: #pragma omp barrier. A barrier is a parallel construct that can only be passed by all the threads simultaneously. So after the barrier, you know exactly the state of all threads and, more importantly, what amount of work they have done.
Implicit synchronisation is done in two situations:
at the end of a parallel region. Openmp relies on a fork-join model. When the program starts, a single thread (master thread) is created. When you create a parallel section by #pragma omp parallel, several threads are created (fork). These threads will work concurrently and at the end of the parallel section will be destroyed (join). So at the end of a parallel section, you have a synchronisation and you know precisely the status of all threads (they have finished their work). This is what happens in the example that you give. The parallel section only contains the printf() and at the end, the program waits for the termination of all threads before continuing.
at the end of some openmp constructs like #pragma omp for or #pragma omp sections, there is an implicit barrier. No thread can continue working as long as all the threads have not reached the barrier. This is important to know exactly what work has been done by the different threads.
For instance, consider the following code.
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=A[j]+A[j+N/2];// use the previously computed vector A
} // end of parallel section
As all the threads work asynchronously, you do not know which threads have finished creating their part of vector A. Without a synchronisation, there is a risk that a thread finishes rapidly its part of the first for loop, enters the second for loop and accesses elements of vector A while the threads that are supposed to compute them are still in the first loop and have not computed the corresponding value of A[i].
This is reason why openmp compilers add an implicit barrier to synchronize all the threads. So you are certain that all threads have finished all their work and that all values of A have been computed when the second for loop starts.
But in some situations, no synchronisation is required. For instance, consider the following code:
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=g(j);// compute values for B
} // end of parallel section
Obviously the two loops are completely independent and it does not matter if A is properly computed to start the second for loop. So the synchronisation gives nothing for the program correctness
and adding a synchronisation barrier has two major drawbacks:
If function f() has very different running times, you may have some threads that have finished their work, while others are still computing. The synchronisation will force the former threads to wait and this idleness do not exploit properly parallelism.
Synchronisations are expensive. A simple way to realize a barrier is to increment a global counter when reaching the barrier and to wait until the value of the counter is equal to the number of threads omp_get_num_threads(). To avoid races between threads, the incrementation of the global counter must be done with an atomic read-modify-write that requires a large number of cycles and the wait for the proper value of the counter is typically done with a spin lock that wastes processor cycles.
So there is construct to suppress implicit synchronisations and the best way to program the previous loop would be:
#pragma omp parallel
{
#pragma omp for nowait // nowait suppresses implicit synchronisations.
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=g(j);// compute values for B
} // end of parallel section
This way, as soon as a thread has finished its work in the first loop, it will immediately start to process the second for loop, and, depending on the actual program, this may reduce significantly execution time.

different OpenMP output in different machine

When I m trying to run the following code in my system centos running virtually i am getting right output but when i am trying to run the same code on compact supercomputer "Param Shavak" I am getting incorrect output.... :(
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,ti
#pragma omp parallel private(p,tid)shared(s)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
If your program runs correctly in one machine, it must be because it's actually not running in parallel in that machine.
Your program suffers from a race condition in the s=s+tid; line of code. s is a shared variable, so several threads at the same time try to update it, which results in data loss.
You can fix the problem by turning that line of code into an atomic operation:
#pragma omp atomic
s=s+tid;
That way only one thread at a time can read and update the variable s, and the race condition is no more.
In more complex programs you should use atomic operations or critical regions only when necessary, because you don't have parallelism in those regions and that hurts performance.
EDIT: As suggested by user High Performance Mark, I must remark that the program above is very inefficient because of the atomic operation. The proper way to do that kind of calculation (adding to the same variable in all iterations of a loop) is to implement a reduction. OpenMP makes it easy by using the reduction clause:
#pragma omp reduction(operator : variables)
Try this version of your program, using reduction:
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,tid;
#pragma omp parallel reduction(+:s) private(p,tid)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
The following link explains critical sections, atomic operations and reduction in a more verbose way: http://www.lindonslog.com/programming/openmp/openmp-tutorial-critical-atomic-and-reduction/

OpenMP output for "for" loop

I am new to OpenMP and I just tried to write a small program with the parallel for construct. I have trouble understanding the output of my program. I don't understand why thread number 3 prints the output before 1 and 2. Could someone offer me an explanation?
So, the program is:
#pragma omp parallel for
for (i = 0; i < 7; i++) {
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
and the output is:
We are in thread number 0 and are printing 0
We are in thread number 0 and are printing 1
We are in thread number 3 and are printing 6
We are in thread number 1 and are printing 2
We are in thread number 1 and are printing 3
We are in thread number 2 and are printing 4
We are in thread number 2 and are printing 5
My processor is a Intel(R) Core(TM) i5-2410M CPU with 4 cores.
Thank you!
OpenMP makes no guarantees of the relative ordering, in time, of the execution of statements by different threads. OpenMP leaves it to the programmer to impose such ordering if it is required. In general it is not required, in many cases not even desirable, which is why OpenMP's default behaviour is as it is. The cost, in time, of imposing such an ordering is likely to be significant.
I suggest you run much larger tests several times, you should observe that the cross-thread sequencing of events is, essentially, random.
If you want to print in order then you can use the ordered construct
#pragma omp parallel for ordered
for (i = 0; i < 7; i++) {
#pragma omp ordered
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
I assume this requires threads from larger iterations to wait for the ones with lower iteration so it will have an effect on performance. You can see it used here http://bisqwit.iki.fi/story/howto/openmp/#ExampleCalculatingTheMandelbrotFractalInParallel
That draws the Mandelbrot set as characters using ordered. A much faster solution than using ordered is to fill an array in parallel of the characters and then draw them serially (try the code). Since one uses OpenMP for performance I have never found a good reason to use ordered but I'm sure it has its use somewhere.

Resources