openMP is not creating threads in visual studio - openmp

My openMP version did not give any speed boost. I have a dual core machine and the CPU usage is always 50%. So I tried the sample program given in Wiki. Looks like the openMP compiler (Visual Studio 2008) is not creating more than one thread.
This is the program:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
This is the output that I get:
Hello World from thread 0
There are 1 threads
Press any key to continue . . .

There's nothing wrong with the program - so presumably there's some issue with how it's being compiled or run. Is this VS2008 Pro? A quick google around suggests OpenMP is not enabled in Standard. Is OpenMP enabled in Properties -> C/C++ -> Language -> OpenMP? (Eg, are you compiling with /openmp)? Is the environment variable OMP_NUM_THREADS being set to 1 somewhere when you run this?

If you want to test out your program with more than one thread, there are several constructs for specifying the number of threads in an OpenMP parallel region. They are, in order of precedence:
Evaluation of the if clause
Setting of the num_threads clause
Use of the omp_set_num_threads() library function
Setting of the OMP_NUM_THREADS environment variable
Implementation default
It sounds like your implementation is defaulting to one thread (assuming you don't have OMP_NUM_THREADS=1 set in your environment).
To test with 4 threads, for instance, you could add num_threads(4) to your #pragma omp parallel directive.
As the other answer noted, you won't really see any "speedup" because you aren't exploiting any parallelism. But it is reasonable to want to run a "hello world" program with several threads to test it out.

As mentioned here, http://docs.oracle.com/cd/E19422-01/819-3694/5_compiling.html I got it working by setting the environment variable OMP_DYNAMIC to FALSE

Why would you need more than one thread for that program? It's clearly the case that OpenMP realizes that it doesn't need to create an extra thread to run a program with no loops, no code that could run in parallel whatsoever.
Try running some parallel stuff with OpenMP. Something like this:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define CHUNKSIZE 10
#define N 100
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d starting...\n",tid);
#pragma omp for schedule(dynamic,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
}
If you want some hard core stuff, try running one of these.

Related

Why does my openMP 2.0 critical directive not flush?

I am currently attempting to parallelize a maximum value search using openMP 2.0 and Visual Studio 2012. I feel like this problem is so simple, it could be used as a textbook example. However, I run into a race condition I do not understand.
The code passage in question is:
double globalMaxVal = std::numeric_limits<double>::min();;
#pragma omp parallel for
for(int i = 0; i < numberOfLoops; i++)
{
{/* ... */} // In this section I determine maxVal
// Besides reading out values from two std::vector via the [] operator, I do not access or manipulate any global variables.
#pragma omp flush(globalMaxVal) // IF I COMMENT OUT THIS LINE I RUN INTO A RACE CONDITION
#pragma omp critical
if(maxVal > globalMaxVal)
{
globalMaxVal = maxVal;
}
}
I do not grasp why it is necessary to flush globalMaxVal. The openMP 2.0 documentation states: "A flush directive without a variable-list is implied for the following directives: [...] At entry to and exit from critical [...]" Yet, I get results diverging from the non-parallelized implementation, if I leave out the flush directive.
I realize that above's code might not be the prettiest or most efficient way to solve my problem, but at the moment I want to understand, why I am seeing this race condition.
Any help would be greatly appreciated!
EDIT:
Below I've now added a minimal, complete and verifiable example below requiring only openMP and the standard library. I've been able to reproduce the problem described above with this code.
For me some runs yield a globalMaxVal != 99, if I omit the flush directive. With the directive, it works just fine.
#include <algorithm>
#include <iostream>
#include <random>
#include <Windows.h>
#include <omp.h>
int main()
{
// Repeat parallelized code 20 times
for(int r = 0; r < 20; r++)
{
int globalMaxVal = 0;
#pragma omp parallel for
for(int i = 0; i < 100; i++)
{
int maxVal = i;
// Some dummy calculations to use computation time
std::random_device rd;
std::mt19937 generator(rd());
std::uniform_real_distribution<double> particleDistribution(-1.0, 1.0);
for(int j = 0; j < 1000000; j++)
particleDistribution(generator);
// The actual code bit again
#pragma omp flush(globalMaxVal) // IF I COMMENT OUT THIS LINE I RUN INTO A RACE CONDITION
#pragma omp critical
if(maxVal > globalMaxVal)
{
globalMaxVal = maxVal;
}
}
// Report outcome - expected to be 99
std::cout << "Run: " << r << ", globalMaxVal: " << globalMaxVal << std::endl;
}
system("pause");
return 0;
}
EDIT 2:
After further testing, we've found that compiling the code in Visual Studio without optimization (/Od) or in Linux gives correct results, whereas the bugs surface in Visual Studio 2012 (Microsoft C/C++ compiler version 17.00.61030) with activated optimization (/O2).

Why a simple for loop without OpenMP is faster than it with OpenMP

Here is my test code for OpenMP
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
int main(int argc, char const *argv[]){
double x[10000];
clock_t start, end;
double cpu_time_used;
start = clock();
#pragma omp parallel
#pragma omp for
for (int i = 0; i < 10000; ++i){
x[i] = 1;
}
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("%lf\n", cpu_time_used);
return 0;
}
I compiled the code with the following two commands:
gcc test.c -o main
The output of rum main is 0.000039
Then I compiled with OpenMP
gcc test.c -o main -fopenmp
and the output is 0.008020
Could anyone help me understand why it happens. Thanks beforehand.
As High Performance Mark so eloquently described in his comment, there is a cost (overhead) with creating threads and distributing work. For such a tiny piece of work (39 us), the overhead outweighs any possible gains.
That said, your measurement is also misleading. clock measures CPU time and is most likely not what you wanted (wall clock). For more details, see this question.
Another misconception that you might have: As soon as x is large enough, the simple loop will become memory-bound. And you will likely not see the speedup you expect. For example on a typical desktop system with four cores you might see a speedup of 1.5 x instead of 4 x.

The time of execution doesn't change whether I increase the number of threads or not

I am executing the following code snippet as explained in the openMP tutorial. But what I see is the time of execution doesn't change with NUM_THREADS, infact, the time of execution just keeps changing a lot..I am wondering if the way I am trying to measure the time is wrong. I tried using clock_gettime, but I see the same results. Can any one help on this please. More than the problem of reduction in time with use of openMP, I am troubled why the time reported varies a lot.
#include "iostream"
#include "omp.h"
#include "stdio.h"
double getTimeNow();
static long num_steps = 10000000;
#define PAD 8
#define NUM_THREADS 1
int main ()
{
int i,nthreads;
double pi, sum[NUM_THREADS][PAD];
double t0,t1;
double step = 1.0/(double) num_steps;
t0 = omp_get_wtime();
#pragma omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id==0) nthreads = nthrds;
for (i=id,sum[id][0]=0;i< num_steps; i=i+nthrds)
{
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i][0] * step;
t1 = omp_get_wtime();
printf("\n value obtained is %f\n",pi);
std::cout << "It took "
<< t1-t0
<< " seconds\n";
return 0;
}
You use openmp_set_num_threads(), but it is a function, not a compiler directive. You should use it without #pragma:
openmp_set_num_threads(NUM_THREADS);
Also, you can set the number of threads in the compiler directive, but the keyword is different:
#pragma omp parallel num_threads(4)
The preferred way is not to hardcode the number of threads in your program, but use the environment variable OMP_NUM_THREADS. For example, in bash:
export OMP_NUM_THREADS=4
However, the last example is not suited for your program.

Why my parallel program on linear search using OpenMP is taking more execution time than the sequential linear search program?

#include <stdio.h>
#include <omp.h>
int main()
{
int i, key=85, tid;
int a[100] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33, 34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,6 4,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94 ,95};
#pragma omp parallel num_threads(2) private(i)
{
tid = omp_get_thread_num();
#pragma omp for
for(i=0; i<100; i++)
if(a[i] == key)
{
printf("Key found. Position = %d by thread %d \n", i+1, tid);
}
}
return 0;
}
Here is my parallel program.. I'm using GCC in Fedora and system is dual-core...
Actually i need to compare both sequential and parallel program for linear search and prove parallel is better than sequential.
Do i need to add user and sys time to calculate execution time for both sequential and parallel( as this uses two core)??
pls help me out. Thanks in advance.
It costs some time to setup the parallel environment. Try a much larger array. You should certanly see a speed up.

Parallelizing a series of independent sequential lines of code

What is the best way to execute multiple lines of code in parallel if they are not dependent of each other? (I'm using OpenMP)
Pseudo code:
database->connect()
openfile("stuff.txt")
ping("stackoverflow.com")
x = 2;
y = a + b;
The only way I can come up with is:
#pragma omp parallel for
for(i = 0; i < 5; i++)
switch (i) {
case 0: database->connect(); break;
...
I haven't tried it, but I also remember that you're not supposed to break while using OpenMP
So I'm assuming that the indivdual things you listed as independant tasks were just examples. If they really are things like y=a+b, then as #chrisaycock and #ejd have said, they're too small for this sort of parallelism (eg thread based, as opposed to ILP or something) to actually take advantage of the concurrency due to overheads. But if they are bigger operations, the way to do task-based parallelism in OpenMP is with the task directive: eg,
#include <stdio.h>
#include <omp.h>
#include <unistd.h>
void work(int *v) {
*v = omp_get_thread_num();
sleep(1);
}
int main(int argc, char **argv)
{
int a, b, c;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(a) default(none)
work(&a);
#pragma omp task shared(b) default(none)
work(&b);
#pragma omp task shared(c) default(none)
work(&c);
}
}
printf("a,b,c = %d,%d,%d\n", a, b, c);
return 0;
}

Resources