Use of OMP single inside omp parallel for loop - openmp

I have a for loop which I am parallelizing using OMP parallel for. Now inside my main for loop, I have another for loop which contains one section of code which I want to be executed by only one thread per iteration of nested for loop. That means that that Function1() should be called 100 times only while Function2 should be called 400 times per iteration of outer for loop, provided the number of threads is 4.
#pragma omp parallel for schedule (auto)
for (i=0; i<nrows; i++) {
int tid = omp_get_thread_num(); // thread ID
startarray[tid] = 0;
for (int partition = 0; partition < 100 ; partition++)
{
#pragma omp single
{
qqyy = Function1();
}
Function2();
//some work
}
}
Right now my code is giving following warning:
error: work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
What is the correct way to do this?

Related

Question about OpenMP sections and critical

I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.
Here is what I currently am doing:
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp section
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}
The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:
id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...
So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?
#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.
Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.
You have several options:
Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.
Try enable nesting with omp_set_nested(1);
Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp critical(ONE_TO_J)
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp critical(J_TO_K)
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
Or use atomic operations to edit the matrix, if this is suitable.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
float tmp;
printf("id0 = %d\n", omp_get_thread_num());
for (int row=0; row<max_row;row++)
for (int column=0;column<k;column++){
float(tmp)=some_function(b,row,column);
#pragma omp atomic
A[column][row]+=tmp;
}
}
By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.

OpenMP, dependency graph

I took some of my old OpenMP exercises to practice a little bit, but I have difficulties to find the solution for on in particular.
The goal is to write the most simple OpenMP code that correspond to the dependency graph.
The graphs are visible here: http://imgur.com/a/8qkYb
First one is simple.
It correspond to the following code:
#pragma omp parallel
{
#pragma omp simple
{
#pragma omp task
{
A1();
A2();
}
#pragma omp task
{
B1();
B2();
}
#pragma omp task
{
C1();
C2();
}
}
}
Second one is still easy.
#pragma omp parallel
{
#pragma omp simple
{
#pragma omp task
{
A1();
}
#pragma omp task
{
B1();
}
#pragma omp task
{
C1();
}
#pragma omp barrier
A2();
B2();
C2();
}
}
And now comes the last oneā€¦
which is bugging me quite a bit because the number of dependencies is unequal across all function calls. I thought there was a to explicitly state which task you should be waiting for, but I can't find what I'm looking for in the OpenMP documentation.
If anyone have an explanation for this question, I will be very grateful because I've been thinking about it for more than a month now.
First of all there is no #pragma omp simple in the OpenMP 4.5 specification.
I assume you meant #pragma omp single.
If so pragma omp barrier is a bad idea inside a single region, since only one thread will execude the code and waits for all other threads, which do not execute the region.
Additionally in the second on A2,B2 and C2 are not executed in parallel as tasks anymore.
To your acutual question:
What you are looking for seems to be the depend clause for Task constructs at OpenMP Secification pg. 169.
There is a pretty good explaination of the depend clause and how it works by Massimiliano for this question.
The last example is not that complex once you understand what is going on there: each task Tn depends on the previous iteration T-1_n AND its neighbors (T-1_n-1 and T-1_n+1). This pattern is known as Jacobi stencil. It is very common in partial differential equation solvers.
As Henkersmann said, the easiest option is using OpenMP Task's depend clause:
int val_a[N], val_b[N];
#pragma omp parallel
#pragma omp single
{
int *a = val_a;
int *b = val_b;
for( int t = 0; t < T; ++t ) {
// Unroll the inner loop for the boundary cases
#pragma omp task depend(in:a[0], a[1]) depend(out:b[0])
stencil(b, a, i);
for( int i = 1; i < N-1; ++i ) {
#pragma omp task depend(in:a[i-1],a[i],a[i+1]) \
depend(out:b[i])
stencil(b, a, i);
}
#pragma omp task depend(in:a[N-2],a[N-1]) depend(out:b[N-1])
stencil(b, a, N-1);
// Swap the pointers for the next iteration
int *tmp = a;
a = b;
b = tmp;
}
#pragma omp taskwait
}
As you may see, OpenMP task dependences are point-to-point, that means you can not express them in terms of array regions.
Another option, a bit cleaner for this specific case, is to enforce the dependences indirectly, using a barrier:
int a[N], b[N];
#pragma omp parallel
for( int t = 0; t < T; ++t ) {
#pragma omp for
for( int i = 0; i < N-1; ++i ) {
stencil(b, a, i);
}
}
This second case performs a synchronization barrier every time the inner loop finishes. The synchronization granularity is coarser, in the sense that you have only 1 synchronization point for each outer loop iteration. However, if stencil function is long and unbalanced, it is probably worth using tasks.

shared arrays in OpenMP

I'm trying to parallelize a piece of C++ code with OpenMp but I'm facing some problems.
In fact, my parallelized code is not faster than the serial one.
I think I have understood the cause of this, but I'm not able to solve it.
The structure of my code is like this:
int vec1 [M];
int vec2 [N];
...initialization of vec1 and vec2...
for (int it=0; it < tot_iterations; it++) {
if ( (it+1)%2 != 0 ) {
#pragma omp parallel for
for (int j=0 ; j < N ; j++) {
....code involving a call to a function to which I'm passing as a parameter vec1.....
if (something) { vec2[j]=vec2[j]-1;}
}
}
else {
# pragma omp parallel for
for (int i=0 ; i < M ; i++) {
....code involving a call to a function to which I'm passing as a parameter vec2.....
if (something) { vec1[i]=vec1[i]-1;}
}
}
}
I thought that maybe my parallelized code is slower because multiple threads want to access to the same shared array and one has to wait until another has finished, but I'm not sure how things really go. But I can't make vec1 and vec2 private since the updates wouldn't be seen in the other iterations...
How can I improve it??
When you speak about issue when accessing the same array with multiple thread, this is called "false-sharing". Except if your array is small, it should not be the bottle neck here as pragma omp parallel for use static scheduling in default implementation (with gcc at least) so each thread should access most of the array without concurency except if your "...code involving a call to a function to which I'm passing as a parameter vec2....." really access a lot of elements in the array.
Case 1: You do not access most elements in the array in this part of the code
Is M big enough to make parallelism useful?
Can you move parallelism on the outer loop? (with one loop for vec1 only and the other for vec2 only)
Try to move the parallel region code :
int vec1 [M];
int vec2 [N];
...initialization of vec1 and vec2...
#pragma omp parallel
for (int it=0; it < tot_iterations; it++) {
if ( (it+1)%2 != 0 ) {
#pragma omp for
for (int j=0 ; j < N ; j++) {
....code involving a call to a function to which I'm passing as a parameter vec1.....
if (something) { vec2[j]=vec2[j]-1;}
}
}
else {
# pragma omp for
for (int i=0 ; i < M ; i++) {
....code involving a call to a function to which I'm passing as a parameter vec2.....
if (something) { vec1[i]=vec1[i]-1;}
}
}
This should not change much but some implementation have a costly parallel region creation.
case 2: You access every elements with every thread
I would say you can't do that if you perform update, otherwise, you may have concurency issue as you have order dependency in the loop.

OpenMP parallelizing loop

I want to parallelize that kind of loop. Note that each "calc_block" uses the data that obtained on previous iteration.
for (i=0 ; i<MAX_ITER; i++){
norma1 = calc_block1();
norma2 = calc_block2();
norma3 = calc_block3();
norma4 = calc_block4();
norma = norma1+norma2+norma3+norma4;
...some calc...
if(norma<eps)break;
}
I tryed this, but speedup is quite small ~1.2
for (i=0 ; i<MAX_ITER; i++){
#pragma omp parallel sections{
#pragma omp section
norma1 = calc_block1();
#pragma omp section
norma2 = calc_block2();
#pragma omp section
norma3 = calc_block3();
#pragma omp section
norma4 = calc_block4();
}
norma = norma1+norma2+norma3+norma4;
...some calc...
if(norma<eps)break;
}
I think it happened because of the overhead of using sections inside of loop. But i dont know how to fix it up...
Thanks in advance!
You could reduce the overhead by moving the entire loop inside the parallel region. Thus the threads in the pool used to implement the team would only get "awaken" once. It is a bit tricky and involves careful consideration of variable sharing classes:
#pragma omp parallel private(i,...) num_threads(4)
{
for (i = 0; i < MAX_ITER; i++)
{
#pragma omp sections
{
#pragma omp section
norma1 = calc_block1();
#pragma omp section
norma2 = calc_block2();
#pragma omp section
norma3 = calc_block3();
#pragma omp section
norma4 = calc_block4();
}
#pragma omp single
{
norma = norm1 + norm2 + norm3 + norm4;
// ... some calc ..
}
if (norma < eps) break;
}
}
Both sections and single constructs have implicit barriers at their ends, hence the threads would synchronise before going into the next loop iteration. The single construct reproduces the previously serial part of your program. The ... part in the private clause should list as many as possible variables that are only relevant to ... some calc .... The idea is to run the serial part with thread-local variables since access to shared variables is slower with most OpenMP implementations.
Note that often time the speed-up might not be linear for completely different reason. For example calc_blockX() (with X being 1, 2, 3 or 4) might have too low compute intensity and therefore require very high memory bandwidth. If the memory subsystem is not able to feed all 4 threads at the same time, the speed-up would be less than 4. An example of such case - this question.

openmp: difference in combining 2 for loop & not combining

What is the difference in combining 2 for loops and parallizing together and parallizing separately
Example
1. not paralleling together
#pragma omp parallel for
for(i = 0; i < 100; i++) {
//.... some code
}
#pragma omp parallel for
for(i = 0; i < 1000; i++) {
//.... some code
}
2. paralleling together
#pragma omp parallel
{
#pragma omp for
for(i = 0; i < 100; i++) {
//.... some code
}
#pragma omp for
for(i = 0; i < 1000; i++) {
//.... some code
}
}
which code is better and why????
One might expect a small win in the second, because one is fork/joining (or the functional equivalent) the OMP threads twice, rather than once. Whether it makes any actual difference for your code is an empirical question best answered by measurement.
The second can also have a more significant advantage if the work in the two loops are independant, and you can start the second at any time, and there's reason to expect some load imbalance in the first loop. In that case, you can add a nowait clause to the firs tomp for and, rather than all threads waiting until the for loop ends, whoever's done first can immediately go on to start working on the second loop. Or, one could put the two chunks of codes each in a section, or task. In general, you have a lot of control over what threads do and how they do it within a parallel section; whereas once you end the parallel section, you lose that flexibility - everything has to join together and you're done.

Resources