Why a simple for loop without OpenMP is faster than it with OpenMP - openmp

Here is my test code for OpenMP
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
int main(int argc, char const *argv[]){
double x[10000];
clock_t start, end;
double cpu_time_used;
start = clock();
#pragma omp parallel
#pragma omp for
for (int i = 0; i < 10000; ++i){
x[i] = 1;
}
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("%lf\n", cpu_time_used);
return 0;
}
I compiled the code with the following two commands:
gcc test.c -o main
The output of rum main is 0.000039
Then I compiled with OpenMP
gcc test.c -o main -fopenmp
and the output is 0.008020
Could anyone help me understand why it happens. Thanks beforehand.

As High Performance Mark so eloquently described in his comment, there is a cost (overhead) with creating threads and distributing work. For such a tiny piece of work (39 us), the overhead outweighs any possible gains.
That said, your measurement is also misleading. clock measures CPU time and is most likely not what you wanted (wall clock). For more details, see this question.
Another misconception that you might have: As soon as x is large enough, the simple loop will become memory-bound. And you will likely not see the speedup you expect. For example on a typical desktop system with four cores you might see a speedup of 1.5 x instead of 4 x.

Related

How to have the same routine executed sometimes by the CPU and sometimes by the GPU with OpenACC?

I'm dealing with a routine which I want the first time to be executed by the CPU and every other time by the GPU. This routine contains the loop:
for (k = kb; k <= ke; k++){
for (j = jb; j <= je; j++){
for (i = ib; i <= ie; i++){
...
}}}
I tried with adding #pragma acc loop collapse(3) to the loop and #pragma acc routine(routine) vector just before the calls where I want the GPU to execute the routine. -Minfo=accel doesn't report any message and with Nsight-System I see that the routine is always executed by the CPU so in this way it doesn't work.
Why the compiler is reading neither of the two #pragma?
To follow on to Thomas' answer, here's an example of using the "if" clause:
% cat test.c
#include <stdlib.h>
#include <stdio.h>
void compute(int * Arr, int size, int use_gpu) {
#pragma acc parallel loop copyout(Arr[:size]) if(use_gpu)
for (int i=0; i < size; ++i) {
Arr[i] = i;
}
}
int main() {
int *Arr;
int size;
int use_gpu;
size=1024;
Arr = (int*) malloc(sizeof(int)*size);
// Run on the host
use_gpu=0;
compute(Arr,size,use_gpu);
// Run on the GPU
use_gpu=1;
compute(Arr,size,use_gpu);
free(Arr);
}
% nvc -acc -Minfo=accel test.c
compute:
4, Generating copyout(Arr[:size]) [if not already present]
Generating NVIDIA GPU code
7, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
% setenv NV_ACC_TIME 1
% a.out
Accelerator Kernel Timing data
test.c
compute NVIDIA devicenum=0
time(us): 48
4: compute region reached 1 time
4: kernel launched 1 time
grid: [8] block: [128]
device time(us): total=5 max=5 min=5 avg=5
elapsed time(us): total=331 max=331 min=331 avg=331
4: data region reached 2 times
9: data copyout transfers: 1
device time(us): total=43 max=43 min=43 avg=43
I'm using nvc and set the compiler's runtime profiler (NV_ACC_TIME=1) to show that the kernel is launched only once.
You need to enable OpenACC processing: -acc (with NVHPC tools) or -fopenacc (with GCC), for example, and then you need to use an OpenACC compute construct (parallel, kernels) to actually launch parallel GPU execution (plus host/device memory management, as necessary). For example, you could call your routine from that compute construct, and the routine would annotate the loop nest with OpenACC loop directives, as you've mentioned, to actually make use of the GPU parallelism.
Then, to answer your actual question: the OpenACC compute constructs then support an if clause to specify whether the region will execute on the current device ("GPU") vs. the local thread will execute the region ("CPU").

OpenMP does not provide speedup for simple program

I just started learning OpenMP with C++, and I used a very simple program to check if I can get some speedup from parallelize the program:
#include <iostream>
#include <ctime>
#include "omp.h"
int main() {
const uint N = 1000000000;
clock_t start_time = clock();
#pragma omp parallel for
for (uint i = 0; i < N; i++) {
int x = 1+1;
}
clock_t end_time = clock();
std::cout << "total_time: " << double(end_time - start_time) / CLOCKS_PER_SEC << " seconds." << std::endl;
}
The program takes 2.2 seconds without parallel #pragma, and takes 2.8 seconds with parallel #pragma 4 threads. What mistake did I make in the program? My compiler is clang++ 6.0, and the computer is Macbook Pro with 2.6G i5 CPU and MacOS 10.13.6.
EDIT:
I realized I used the wrong function for measuring execution time. Instead of clock() from library ctime, I should use high_resolution_clock from library chrono library. In that case, I get 80 seconds for 1 thread, 47 seconds for 2 threads, 35 seconds for 3 threads. Should the speedup be better than what I get here, since the program is embarrassingly parallel?
As with anything in parallel programming, there is a startup cost to creating new threads. For simple programs, the overhead of creating and managing threads is often great enough that it actually slows down the target program compared to when the program is run in a single thread.
In other words, you didn't make a mistake - this is an inherent part of using threads.

The time of execution doesn't change whether I increase the number of threads or not

I am executing the following code snippet as explained in the openMP tutorial. But what I see is the time of execution doesn't change with NUM_THREADS, infact, the time of execution just keeps changing a lot..I am wondering if the way I am trying to measure the time is wrong. I tried using clock_gettime, but I see the same results. Can any one help on this please. More than the problem of reduction in time with use of openMP, I am troubled why the time reported varies a lot.
#include "iostream"
#include "omp.h"
#include "stdio.h"
double getTimeNow();
static long num_steps = 10000000;
#define PAD 8
#define NUM_THREADS 1
int main ()
{
int i,nthreads;
double pi, sum[NUM_THREADS][PAD];
double t0,t1;
double step = 1.0/(double) num_steps;
t0 = omp_get_wtime();
#pragma omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id==0) nthreads = nthrds;
for (i=id,sum[id][0]=0;i< num_steps; i=i+nthrds)
{
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i][0] * step;
t1 = omp_get_wtime();
printf("\n value obtained is %f\n",pi);
std::cout << "It took "
<< t1-t0
<< " seconds\n";
return 0;
}
You use openmp_set_num_threads(), but it is a function, not a compiler directive. You should use it without #pragma:
openmp_set_num_threads(NUM_THREADS);
Also, you can set the number of threads in the compiler directive, but the keyword is different:
#pragma omp parallel num_threads(4)
The preferred way is not to hardcode the number of threads in your program, but use the environment variable OMP_NUM_THREADS. For example, in bash:
export OMP_NUM_THREADS=4
However, the last example is not suited for your program.

Why my parallel program on linear search using OpenMP is taking more execution time than the sequential linear search program?

#include <stdio.h>
#include <omp.h>
int main()
{
int i, key=85, tid;
int a[100] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33, 34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,6 4,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94 ,95};
#pragma omp parallel num_threads(2) private(i)
{
tid = omp_get_thread_num();
#pragma omp for
for(i=0; i<100; i++)
if(a[i] == key)
{
printf("Key found. Position = %d by thread %d \n", i+1, tid);
}
}
return 0;
}
Here is my parallel program.. I'm using GCC in Fedora and system is dual-core...
Actually i need to compare both sequential and parallel program for linear search and prove parallel is better than sequential.
Do i need to add user and sys time to calculate execution time for both sequential and parallel( as this uses two core)??
pls help me out. Thanks in advance.
It costs some time to setup the parallel environment. Try a much larger array. You should certanly see a speed up.

openMP is not creating threads in visual studio

My openMP version did not give any speed boost. I have a dual core machine and the CPU usage is always 50%. So I tried the sample program given in Wiki. Looks like the openMP compiler (Visual Studio 2008) is not creating more than one thread.
This is the program:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
This is the output that I get:
Hello World from thread 0
There are 1 threads
Press any key to continue . . .
There's nothing wrong with the program - so presumably there's some issue with how it's being compiled or run. Is this VS2008 Pro? A quick google around suggests OpenMP is not enabled in Standard. Is OpenMP enabled in Properties -> C/C++ -> Language -> OpenMP? (Eg, are you compiling with /openmp)? Is the environment variable OMP_NUM_THREADS being set to 1 somewhere when you run this?
If you want to test out your program with more than one thread, there are several constructs for specifying the number of threads in an OpenMP parallel region. They are, in order of precedence:
Evaluation of the if clause
Setting of the num_threads clause
Use of the omp_set_num_threads() library function
Setting of the OMP_NUM_THREADS environment variable
Implementation default
It sounds like your implementation is defaulting to one thread (assuming you don't have OMP_NUM_THREADS=1 set in your environment).
To test with 4 threads, for instance, you could add num_threads(4) to your #pragma omp parallel directive.
As the other answer noted, you won't really see any "speedup" because you aren't exploiting any parallelism. But it is reasonable to want to run a "hello world" program with several threads to test it out.
As mentioned here, http://docs.oracle.com/cd/E19422-01/819-3694/5_compiling.html I got it working by setting the environment variable OMP_DYNAMIC to FALSE
Why would you need more than one thread for that program? It's clearly the case that OpenMP realizes that it doesn't need to create an extra thread to run a program with no loops, no code that could run in parallel whatsoever.
Try running some parallel stuff with OpenMP. Something like this:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define CHUNKSIZE 10
#define N 100
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d starting...\n",tid);
#pragma omp for schedule(dynamic,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
}
If you want some hard core stuff, try running one of these.

Resources