What kinds of problem do we have with this program?(OpenMP) how to svoid it? - parallel-processing

Following is my code and i wanna know why there will happen the problem with "race-condition" and how to solve this?
#include <iostream>
int main(){
int a = 123;
#pragma omp parallel num_threads(2)
{
int thread_id = omp_get_thread_num();
int b = (thread_id + 1)*10;
a += b;
}
std::cout << “a = “ << a << “\n”;
return 0;
}

A race condition occurs when two or more threads access shared data and at least one of them changes its value at the same time. In your code this line cause a race condition:
a += b;
a is a shared variable and updated by 2 threads simultaneously, so the final result may be incorrect. Note that depending on the hardware used, possible race condition does not necessarily means that a data race actually will occur, so the result may be correct, but it is a semantic error in your code.
To fix it you have 2 options:
use atomic operation:
#pragma omp atomic
a += b;
use reduction:
#pragma omp parallel num_threads(2) reduction(+:a)

Related

Question about OpenMP sections and critical

I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.
Here is what I currently am doing:
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp section
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}
The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:
id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...
So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?
#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.
Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.
You have several options:
Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.
Try enable nesting with omp_set_nested(1);
Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp critical(ONE_TO_J)
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp critical(J_TO_K)
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
Or use atomic operations to edit the matrix, if this is suitable.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
float tmp;
printf("id0 = %d\n", omp_get_thread_num());
for (int row=0; row<max_row;row++)
for (int column=0;column<k;column++){
float(tmp)=some_function(b,row,column);
#pragma omp atomic
A[column][row]+=tmp;
}
}
By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.

Why does my openMP 2.0 critical directive not flush?

I am currently attempting to parallelize a maximum value search using openMP 2.0 and Visual Studio 2012. I feel like this problem is so simple, it could be used as a textbook example. However, I run into a race condition I do not understand.
The code passage in question is:
double globalMaxVal = std::numeric_limits<double>::min();;
#pragma omp parallel for
for(int i = 0; i < numberOfLoops; i++)
{
{/* ... */} // In this section I determine maxVal
// Besides reading out values from two std::vector via the [] operator, I do not access or manipulate any global variables.
#pragma omp flush(globalMaxVal) // IF I COMMENT OUT THIS LINE I RUN INTO A RACE CONDITION
#pragma omp critical
if(maxVal > globalMaxVal)
{
globalMaxVal = maxVal;
}
}
I do not grasp why it is necessary to flush globalMaxVal. The openMP 2.0 documentation states: "A flush directive without a variable-list is implied for the following directives: [...] At entry to and exit from critical [...]" Yet, I get results diverging from the non-parallelized implementation, if I leave out the flush directive.
I realize that above's code might not be the prettiest or most efficient way to solve my problem, but at the moment I want to understand, why I am seeing this race condition.
Any help would be greatly appreciated!
EDIT:
Below I've now added a minimal, complete and verifiable example below requiring only openMP and the standard library. I've been able to reproduce the problem described above with this code.
For me some runs yield a globalMaxVal != 99, if I omit the flush directive. With the directive, it works just fine.
#include <algorithm>
#include <iostream>
#include <random>
#include <Windows.h>
#include <omp.h>
int main()
{
// Repeat parallelized code 20 times
for(int r = 0; r < 20; r++)
{
int globalMaxVal = 0;
#pragma omp parallel for
for(int i = 0; i < 100; i++)
{
int maxVal = i;
// Some dummy calculations to use computation time
std::random_device rd;
std::mt19937 generator(rd());
std::uniform_real_distribution<double> particleDistribution(-1.0, 1.0);
for(int j = 0; j < 1000000; j++)
particleDistribution(generator);
// The actual code bit again
#pragma omp flush(globalMaxVal) // IF I COMMENT OUT THIS LINE I RUN INTO A RACE CONDITION
#pragma omp critical
if(maxVal > globalMaxVal)
{
globalMaxVal = maxVal;
}
}
// Report outcome - expected to be 99
std::cout << "Run: " << r << ", globalMaxVal: " << globalMaxVal << std::endl;
}
system("pause");
return 0;
}
EDIT 2:
After further testing, we've found that compiling the code in Visual Studio without optimization (/Od) or in Linux gives correct results, whereas the bugs surface in Visual Studio 2012 (Microsoft C/C++ compiler version 17.00.61030) with activated optimization (/O2).

Avoiding race condition in OpenMP?

I was reading about OpenMP and shared memory programming and fell over this pseudo code that has an integer x and two threads
thread 1
x++;
thread 2
x--;
This will lead to a race condition, but can be avoided. I want to avoid it using OpenMP, how should it be done?
This is how I think it will be avoided:
int x;
#pragma omp parallel shared(X) num_threads(2)
int tid = omp_get_thread_num();
if(tid == 1)
x++;
else
x--;
I know that by eliminating the race condition will lead to correct execution but also poor performance, but I don't know why?
If more than one thread is modifying x, the code is at risk from a race condition. Taking a simplified version of your example
int main()
{
int x = 0;
#pragma omp parallel sections
{
#pragma omp section {
++x;
}
#pragma omp section {
--x;
}
}
return x;
}
The two threads modifying x may be interleaved with each other, meaning that the result will not necessarily be zero.
One way to protect the modifications is to wrap the read-modify-write code in a critical region.
Another, that suitable for the simple operations here, is to mark the ++ and -- lines with #pragma omp atomic - that will use platform-native atomic instructions where they exist, which is lightweight compared to a critical region.
Another approach that usually works (but isn't strictly guaranteed by OpenMP) is to change the type used for x to a standard atomic type. Simply changing it from int to std::atomic<int> gives you indivisible ++ and -- operators which you can use here.

OpenMP, dependency graph

I took some of my old OpenMP exercises to practice a little bit, but I have difficulties to find the solution for on in particular.
The goal is to write the most simple OpenMP code that correspond to the dependency graph.
The graphs are visible here: http://imgur.com/a/8qkYb
First one is simple.
It correspond to the following code:
#pragma omp parallel
{
#pragma omp simple
{
#pragma omp task
{
A1();
A2();
}
#pragma omp task
{
B1();
B2();
}
#pragma omp task
{
C1();
C2();
}
}
}
Second one is still easy.
#pragma omp parallel
{
#pragma omp simple
{
#pragma omp task
{
A1();
}
#pragma omp task
{
B1();
}
#pragma omp task
{
C1();
}
#pragma omp barrier
A2();
B2();
C2();
}
}
And now comes the last one…
which is bugging me quite a bit because the number of dependencies is unequal across all function calls. I thought there was a to explicitly state which task you should be waiting for, but I can't find what I'm looking for in the OpenMP documentation.
If anyone have an explanation for this question, I will be very grateful because I've been thinking about it for more than a month now.
First of all there is no #pragma omp simple in the OpenMP 4.5 specification.
I assume you meant #pragma omp single.
If so pragma omp barrier is a bad idea inside a single region, since only one thread will execude the code and waits for all other threads, which do not execute the region.
Additionally in the second on A2,B2 and C2 are not executed in parallel as tasks anymore.
To your acutual question:
What you are looking for seems to be the depend clause for Task constructs at OpenMP Secification pg. 169.
There is a pretty good explaination of the depend clause and how it works by Massimiliano for this question.
The last example is not that complex once you understand what is going on there: each task Tn depends on the previous iteration T-1_n AND its neighbors (T-1_n-1 and T-1_n+1). This pattern is known as Jacobi stencil. It is very common in partial differential equation solvers.
As Henkersmann said, the easiest option is using OpenMP Task's depend clause:
int val_a[N], val_b[N];
#pragma omp parallel
#pragma omp single
{
int *a = val_a;
int *b = val_b;
for( int t = 0; t < T; ++t ) {
// Unroll the inner loop for the boundary cases
#pragma omp task depend(in:a[0], a[1]) depend(out:b[0])
stencil(b, a, i);
for( int i = 1; i < N-1; ++i ) {
#pragma omp task depend(in:a[i-1],a[i],a[i+1]) \
depend(out:b[i])
stencil(b, a, i);
}
#pragma omp task depend(in:a[N-2],a[N-1]) depend(out:b[N-1])
stencil(b, a, N-1);
// Swap the pointers for the next iteration
int *tmp = a;
a = b;
b = tmp;
}
#pragma omp taskwait
}
As you may see, OpenMP task dependences are point-to-point, that means you can not express them in terms of array regions.
Another option, a bit cleaner for this specific case, is to enforce the dependences indirectly, using a barrier:
int a[N], b[N];
#pragma omp parallel
for( int t = 0; t < T; ++t ) {
#pragma omp for
for( int i = 0; i < N-1; ++i ) {
stencil(b, a, i);
}
}
This second case performs a synchronization barrier every time the inner loop finishes. The synchronization granularity is coarser, in the sense that you have only 1 synchronization point for each outer loop iteration. However, if stencil function is long and unbalanced, it is probably worth using tasks.

The time of execution doesn't change whether I increase the number of threads or not

I am executing the following code snippet as explained in the openMP tutorial. But what I see is the time of execution doesn't change with NUM_THREADS, infact, the time of execution just keeps changing a lot..I am wondering if the way I am trying to measure the time is wrong. I tried using clock_gettime, but I see the same results. Can any one help on this please. More than the problem of reduction in time with use of openMP, I am troubled why the time reported varies a lot.
#include "iostream"
#include "omp.h"
#include "stdio.h"
double getTimeNow();
static long num_steps = 10000000;
#define PAD 8
#define NUM_THREADS 1
int main ()
{
int i,nthreads;
double pi, sum[NUM_THREADS][PAD];
double t0,t1;
double step = 1.0/(double) num_steps;
t0 = omp_get_wtime();
#pragma omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id==0) nthreads = nthrds;
for (i=id,sum[id][0]=0;i< num_steps; i=i+nthrds)
{
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i][0] * step;
t1 = omp_get_wtime();
printf("\n value obtained is %f\n",pi);
std::cout << "It took "
<< t1-t0
<< " seconds\n";
return 0;
}
You use openmp_set_num_threads(), but it is a function, not a compiler directive. You should use it without #pragma:
openmp_set_num_threads(NUM_THREADS);
Also, you can set the number of threads in the compiler directive, but the keyword is different:
#pragma omp parallel num_threads(4)
The preferred way is not to hardcode the number of threads in your program, but use the environment variable OMP_NUM_THREADS. For example, in bash:
export OMP_NUM_THREADS=4
However, the last example is not suited for your program.

Resources