I am currently attempting to parallelize a maximum value search using openMP 2.0 and Visual Studio 2012. I feel like this problem is so simple, it could be used as a textbook example. However, I run into a race condition I do not understand.
The code passage in question is:
double globalMaxVal = std::numeric_limits<double>::min();;
#pragma omp parallel for
for(int i = 0; i < numberOfLoops; i++)
{
{/* ... */} // In this section I determine maxVal
// Besides reading out values from two std::vector via the [] operator, I do not access or manipulate any global variables.
#pragma omp flush(globalMaxVal) // IF I COMMENT OUT THIS LINE I RUN INTO A RACE CONDITION
#pragma omp critical
if(maxVal > globalMaxVal)
{
globalMaxVal = maxVal;
}
}
I do not grasp why it is necessary to flush globalMaxVal. The openMP 2.0 documentation states: "A flush directive without a variable-list is implied for the following directives: [...] At entry to and exit from critical [...]" Yet, I get results diverging from the non-parallelized implementation, if I leave out the flush directive.
I realize that above's code might not be the prettiest or most efficient way to solve my problem, but at the moment I want to understand, why I am seeing this race condition.
Any help would be greatly appreciated!
EDIT:
Below I've now added a minimal, complete and verifiable example below requiring only openMP and the standard library. I've been able to reproduce the problem described above with this code.
For me some runs yield a globalMaxVal != 99, if I omit the flush directive. With the directive, it works just fine.
#include <algorithm>
#include <iostream>
#include <random>
#include <Windows.h>
#include <omp.h>
int main()
{
// Repeat parallelized code 20 times
for(int r = 0; r < 20; r++)
{
int globalMaxVal = 0;
#pragma omp parallel for
for(int i = 0; i < 100; i++)
{
int maxVal = i;
// Some dummy calculations to use computation time
std::random_device rd;
std::mt19937 generator(rd());
std::uniform_real_distribution<double> particleDistribution(-1.0, 1.0);
for(int j = 0; j < 1000000; j++)
particleDistribution(generator);
// The actual code bit again
#pragma omp flush(globalMaxVal) // IF I COMMENT OUT THIS LINE I RUN INTO A RACE CONDITION
#pragma omp critical
if(maxVal > globalMaxVal)
{
globalMaxVal = maxVal;
}
}
// Report outcome - expected to be 99
std::cout << "Run: " << r << ", globalMaxVal: " << globalMaxVal << std::endl;
}
system("pause");
return 0;
}
EDIT 2:
After further testing, we've found that compiling the code in Visual Studio without optimization (/Od) or in Linux gives correct results, whereas the bugs surface in Visual Studio 2012 (Microsoft C/C++ compiler version 17.00.61030) with activated optimization (/O2).
Related
Following is my code and i wanna know why there will happen the problem with "race-condition" and how to solve this?
#include <iostream>
int main(){
int a = 123;
#pragma omp parallel num_threads(2)
{
int thread_id = omp_get_thread_num();
int b = (thread_id + 1)*10;
a += b;
}
std::cout << āa = ā << a << ā\nā;
return 0;
}
A race condition occurs when two or more threads access shared data and at least one of them changes its value at the same time. In your code this line cause a race condition:
a += b;
a is a shared variable and updated by 2 threads simultaneously, so the final result may be incorrect. Note that depending on the hardware used, possible race condition does not necessarily means that a data race actually will occur, so the result may be correct, but it is a semantic error in your code.
To fix it you have 2 options:
use atomic operation:
#pragma omp atomic
a += b;
use reduction:
#pragma omp parallel num_threads(2) reduction(+:a)
What are solutions to the hazards in doing this?
#include <iostream>
#include <unistd.h>
#include <cstdlib>
#include <ctime>
int main(){
int k = 0;
#pragma omp parallel for
for(int i = 0; i < 100000; i++){
if (i % 2){ /** Conditional based on i **/
#pragma omp atomic
k++;
usleep(1000 * ((float)std::rand() / RAND_MAX));
#pragma omp task
std::cout << k << std::endl; /** Some sort of task **/
}
}
return 0;
}
I need all ks to be unique. What would be a better way of doing this?
Edit
Notice how this question refers to an aggregate
In particular I want to spawn tasks based on a shared variable. I run the risk of having a race condition.
Consider thread 2 completes, evaluates true for the conditional, and increments k before thread 1 spawns all tasks.
Edit edit
I tried to force a race condition. It wasn't obvious without the sleep. There are in fact problems. How can I overcome this.
Here's a quick solution:
...
#pragma omp atomic
k++;
int c = k;
...
but I'd like a guarantee.
Tangential. Why doesn't this implementation work?
...
int c;
#pragma omp crtical
{
k++;
c = k;
}
...
At the end of the function, std::cout << k;, is consistently less than the expected 50000 output proof
I hate to answer my question so quickly, but I found a solution for this particular instance.
As of OpenMP 3.1 there is the "atomic capture" pragma
The use case is for problems just like this. The resultant code:
#include <iostream>
#include <unistd.h>
#include <cstdlib>
#include <ctime>
int main(){
int k = 0;
#pragma omp parallel for
for(int i = 0; i < 100000; i++){
if (i % 2){ /** Conditional based on i **/
int c;
#pragma omp atomic capture
{
c = k;
k++;
}
usleep(1000 * ((float)std::rand() / RAND_MAX));
#pragma omp task
std::cout << c << std::endl; /** Some sort of task **/
}
}
std::cout << k << std::endl; /** Some sort of task **/
std::cout.flush();
return 0;
}
I will leave this problem open if someone would like to contribute ideas/ code arch. suggestions for avoiding these problems, reasons the #pragma omp crtical didn't work
I am executing the following code snippet as explained in the openMP tutorial. But what I see is the time of execution doesn't change with NUM_THREADS, infact, the time of execution just keeps changing a lot..I am wondering if the way I am trying to measure the time is wrong. I tried using clock_gettime, but I see the same results. Can any one help on this please. More than the problem of reduction in time with use of openMP, I am troubled why the time reported varies a lot.
#include "iostream"
#include "omp.h"
#include "stdio.h"
double getTimeNow();
static long num_steps = 10000000;
#define PAD 8
#define NUM_THREADS 1
int main ()
{
int i,nthreads;
double pi, sum[NUM_THREADS][PAD];
double t0,t1;
double step = 1.0/(double) num_steps;
t0 = omp_get_wtime();
#pragma omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id==0) nthreads = nthrds;
for (i=id,sum[id][0]=0;i< num_steps; i=i+nthrds)
{
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i][0] * step;
t1 = omp_get_wtime();
printf("\n value obtained is %f\n",pi);
std::cout << "It took "
<< t1-t0
<< " seconds\n";
return 0;
}
You use openmp_set_num_threads(), but it is a function, not a compiler directive. You should use it without #pragma:
openmp_set_num_threads(NUM_THREADS);
Also, you can set the number of threads in the compiler directive, but the keyword is different:
#pragma omp parallel num_threads(4)
The preferred way is not to hardcode the number of threads in your program, but use the environment variable OMP_NUM_THREADS. For example, in bash:
export OMP_NUM_THREADS=4
However, the last example is not suited for your program.
I've created a simple c program using gsl(GNU Scienctific Library) and open mp. In this simple program, I want to test the execution time for sequential and parallel. Here is the program snippets, main.c.
#include "omp.h"
#include <stdio.h>
#include <gsl/gsl_matrix.h>
#include <time.h>
int main()
{
omp_set_num_threads(4);
int n1=10000, n2=10000;
gsl_matrix *A = gsl_matrix_alloc(n1, n2);
int i,j;
struct timeval tv1, tv2, tv3, tv4;
gettimeofday(&tv1, 0);
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv2, 0);
long elapsed = (tv2.tv_sec-tv1.tv_sec)*1000000 + tv2.tv_usec-tv1.tv_usec;
printf("Sequential Duration:%ldms\n", elapsed);
gettimeofday(&tv3, 0);
#pragma omp parallel for private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv4, 0);
elapsed = (tv4.tv_sec-tv3.tv_sec)*1000000 + tv4.tv_usec-tv3.tv_usec;
printf(" Parallel Duration:%ldms\n", elapsed);
return 0;
}
Then I compiled the above code, using this command:
gcc -fopenmp main.c -o test -lgsl -lgslcblas -lm
Here is the program's result:
Sequential Duration:11980106ms
Parallel Duration:20624043ms
Why, the parallel part slower than the sequential part. How can I optimize this code? Thanks
as you have written it the j variable is shared between all threads so the threads are overwritting other threads state constantly, leading to them iterating values they have already covered.
You should always minimize the scope of variables when trying to parallelize with openmp. Either move the scope of j into the loop or mark it as private explicitly:
#pragma omp parallel for private(j)
also clock counts the processor time not the real time, you probably want to use gettimeofday
you matrix is too small to benefit much from parallelization, the threading overhead will dominate. Increase it to ~10000x10000 to start seeing something.
The problem here is that you do not know what the procedure gsl_matrix_set does with A. You do not know if it is thread safe. To change one element in that matrix you supply the whole matrix to the routine instead of only the indices of the element. This smells by false sharing (see e.g. this answer).
I would try this instead
gsl_matrix_set(A[i][j],i*j*1000000);
If that does not work and what you are interested in is only the time difference between serial and parallel I would just do
A[i][j] = i*j*1000000
In the thread part, try this:
#pragma omp parallel private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
or
#pragma omp parallel for
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
My openMP version did not give any speed boost. I have a dual core machine and the CPU usage is always 50%. So I tried the sample program given in Wiki. Looks like the openMP compiler (Visual Studio 2008) is not creating more than one thread.
This is the program:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
This is the output that I get:
Hello World from thread 0
There are 1 threads
Press any key to continue . . .
There's nothing wrong with the program - so presumably there's some issue with how it's being compiled or run. Is this VS2008 Pro? A quick google around suggests OpenMP is not enabled in Standard. Is OpenMP enabled in Properties -> C/C++ -> Language -> OpenMP? (Eg, are you compiling with /openmp)? Is the environment variable OMP_NUM_THREADS being set to 1 somewhere when you run this?
If you want to test out your program with more than one thread, there are several constructs for specifying the number of threads in an OpenMP parallel region. They are, in order of precedence:
Evaluation of the if clause
Setting of the num_threads clause
Use of the omp_set_num_threads() library function
Setting of the OMP_NUM_THREADS environment variable
Implementation default
It sounds like your implementation is defaulting to one thread (assuming you don't have OMP_NUM_THREADS=1 set in your environment).
To test with 4 threads, for instance, you could add num_threads(4) to your #pragma omp parallel directive.
As the other answer noted, you won't really see any "speedup" because you aren't exploiting any parallelism. But it is reasonable to want to run a "hello world" program with several threads to test it out.
As mentioned here, http://docs.oracle.com/cd/E19422-01/819-3694/5_compiling.html I got it working by setting the environment variable OMP_DYNAMIC to FALSE
Why would you need more than one thread for that program? It's clearly the case that OpenMP realizes that it doesn't need to create an extra thread to run a program with no loops, no code that could run in parallel whatsoever.
Try running some parallel stuff with OpenMP. Something like this:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define CHUNKSIZE 10
#define N 100
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d starting...\n",tid);
#pragma omp for schedule(dynamic,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
}
If you want some hard core stuff, try running one of these.