OpenMP does not provide speedup for simple program

OpenMP does not provide speedup for simple program - openmp

I just started learning OpenMP with C++, and I used a very simple program to check if I can get some speedup from parallelize the program:
#include <iostream>
#include <ctime>
#include "omp.h"
int main() {
const uint N = 1000000000;
clock_t start_time = clock();
#pragma omp parallel for
for (uint i = 0; i < N; i++) {
int x = 1+1;
}
clock_t end_time = clock();
std::cout << "total_time: " << double(end_time - start_time) / CLOCKS_PER_SEC << " seconds." << std::endl;
}
The program takes 2.2 seconds without parallel #pragma, and takes 2.8 seconds with parallel #pragma 4 threads. What mistake did I make in the program? My compiler is clang++ 6.0, and the computer is Macbook Pro with 2.6G i5 CPU and MacOS 10.13.6.
EDIT:
I realized I used the wrong function for measuring execution time. Instead of clock() from library ctime, I should use high_resolution_clock from library chrono library. In that case, I get 80 seconds for 1 thread, 47 seconds for 2 threads, 35 seconds for 3 threads. Should the speedup be better than what I get here, since the program is embarrassingly parallel?

As with anything in parallel programming, there is a startup cost to creating new threads. For simple programs, the overhead of creating and managing threads is often great enough that it actually slows down the target program compared to when the program is run in a single thread.
In other words, you didn't make a mistake - this is an inherent part of using threads.

Related

How to have the same routine executed sometimes by the CPU and sometimes by the GPU with OpenACC?

I'm dealing with a routine which I want the first time to be executed by the CPU and every other time by the GPU. This routine contains the loop:
for (k = kb; k <= ke; k++){
for (j = jb; j <= je; j++){
for (i = ib; i <= ie; i++){
...
}}}
I tried with adding #pragma acc loop collapse(3) to the loop and #pragma acc routine(routine) vector just before the calls where I want the GPU to execute the routine. -Minfo=accel doesn't report any message and with Nsight-System I see that the routine is always executed by the CPU so in this way it doesn't work.
Why the compiler is reading neither of the two #pragma?

To follow on to Thomas' answer, here's an example of using the "if" clause:
% cat test.c
#include <stdlib.h>
#include <stdio.h>
void compute(int * Arr, int size, int use_gpu) {
#pragma acc parallel loop copyout(Arr[:size]) if(use_gpu)
for (int i=0; i < size; ++i) {
Arr[i] = i;
}
}
int main() {
int *Arr;
int size;
int use_gpu;
size=1024;
Arr = (int*) malloc(sizeof(int)*size);
// Run on the host
use_gpu=0;
compute(Arr,size,use_gpu);
// Run on the GPU
use_gpu=1;
compute(Arr,size,use_gpu);
free(Arr);
}
% nvc -acc -Minfo=accel test.c
compute:
4, Generating copyout(Arr[:size]) [if not already present]
Generating NVIDIA GPU code
7, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
% setenv NV_ACC_TIME 1
% a.out
Accelerator Kernel Timing data
test.c
compute NVIDIA devicenum=0
time(us): 48
4: compute region reached 1 time
4: kernel launched 1 time
grid: [8] block: [128]
device time(us): total=5 max=5 min=5 avg=5
elapsed time(us): total=331 max=331 min=331 avg=331
4: data region reached 2 times
9: data copyout transfers: 1
device time(us): total=43 max=43 min=43 avg=43
I'm using nvc and set the compiler's runtime profiler (NV_ACC_TIME=1) to show that the kernel is launched only once.

You need to enable OpenACC processing: -acc (with NVHPC tools) or -fopenacc (with GCC), for example, and then you need to use an OpenACC compute construct (parallel, kernels) to actually launch parallel GPU execution (plus host/device memory management, as necessary). For example, you could call your routine from that compute construct, and the routine would annotate the loop nest with OpenACC loop directives, as you've mentioned, to actually make use of the GPU parallelism.
Then, to answer your actual question: the OpenACC compute constructs then support an if clause to specify whether the region will execute on the current device ("GPU") vs. the local thread will execute the region ("CPU").

Why a simple for loop without OpenMP is faster than it with OpenMP

Here is my test code for OpenMP
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
int main(int argc, char const *argv[]){
double x[10000];
clock_t start, end;
double cpu_time_used;
start = clock();
#pragma omp parallel
#pragma omp for
for (int i = 0; i < 10000; ++i){
x[i] = 1;
}
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("%lf\n", cpu_time_used);
return 0;
}
I compiled the code with the following two commands:
gcc test.c -o main
The output of rum main is 0.000039
Then I compiled with OpenMP
gcc test.c -o main -fopenmp
and the output is 0.008020
Could anyone help me understand why it happens. Thanks beforehand.

As High Performance Mark so eloquently described in his comment, there is a cost (overhead) with creating threads and distributing work. For such a tiny piece of work (39 us), the overhead outweighs any possible gains.
That said, your measurement is also misleading. clock measures CPU time and is most likely not what you wanted (wall clock). For more details, see this question.
Another misconception that you might have: As soon as x is large enough, the simple loop will become memory-bound. And you will likely not see the speedup you expect. For example on a typical desktop system with four cores you might see a speedup of 1.5 x instead of 4 x.

NVidia CUDA Thrust device vector allocation is too slow

Does anybody knows why vector allocation on device takes too much for the first run being compiled in Debug mode? In my particular case (NVIDIA Quadro 3000M, Cuda Toolkit 6.0, Windows 7, MSVC2010) first run for Debug compiled version takes over 40 seconds, next (no recompilation) runs take 10 times less (vector allocation on device for Release version takes over 1 second).
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <ctime>
int main(void) {
clock_t t;
t = clock();
thrust::host_vector<int> h_vec( 100);
clock_t dt = clock() - t;
printf ("allocation on host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::generate(h_vec.begin(), h_vec.end(), rand);
dt = clock() - t;
printf ("initialization on host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::device_vector<int> d_vec( 100); // First run for Debug compiled version takes over 40 seconds...
dt = clock() - t;
printf ("allocation on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
d_vec[0] = h_vec[0];
dt = clock() - t;
printf ("copy one to device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
d_vec = h_vec;
dt = clock() - t;
printf ("copy all to device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::sort(d_vec.begin(), d_vec.end());
dt = clock() - t;
printf ("sort on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
dt = clock() - t;
printf ("copy to host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
for(int i=0; i<10; i++)
printf("%d\n", h_vec[i]);
dt = clock() - t;
printf ("output - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
std::cin.ignore();
return 0;
}

Most of the time you are measuring for the first vector instantiation isn't the cost of the vector allocation and initialisation, it is overhead costs associated with the CUDA runtime and driver. I would guess that if you changed your code to something like this:
int main(void) {
clock_t t;
....
cudaFree(0); // This forces context establishment and lazy runtime overheads
t = clock();
thrust::device_vector<int> d_vec( 100); // First run for Debug compiled version takes over 40 seconds...
dt = clock() - t;
printf ("allocation on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
.....
You should see that the time you measure to allocate the vector between first and second runs becomes the same even though the wall clock time to run the program shows a big difference.
I don't have a good explanation as to why there is such a large difference in startup time between first and second runs, but if I was to hazard a guess, it is that there is some driver level JIT recompilation being performed on the first run, and the driver caches the code for subsequent runs. One thing to check is that you are compiling code for the correct architecture for your GPU, that would eliminate driver recompilation as a source of the time difference.
The nvprof utility can provide you with an API trace and timings. You might want to run it and see where in the API call sequence the difference in time is arising from. It isn't beyond the realms of possibility that you are seeing the effects of some sort of driver bug, but without more information it is impossible to say.

It looks like in my case (NVIDIA Quadro 3000M, Cuda Toolkit 6.0, Windows 7, MSVC2010) the problem is solved by changing project CUDA C/C++ / Code Generation option from compute_10,sm_10 to compute_20,sm_20 which states for newer GPU achrchitecture. So I've got happiness for today )

Using both GPU device of CUDA and zero copy pinned memory

I am using the CUSP library for sparse matrix-multiplication on CUDA a machine. My current code is
#include <cusp/coo_matrix.h>
#include <cusp/multiply.h>
#include <cusp/print.h>
#include <cusp/transpose.h>
#include<stdio.h>
#define CATAGORY_PER_SCAN 1000
#define TOTAL_CATAGORY 100000
#define MAX_SIZE 1000000
#define ELEMENTS_PER_CATAGORY 10000
#define ELEMENTS_PER_TEST_CATAGORY 1000
#define INPUT_VECTOR 1000
#define TOTAL_ELEMENTS ELEMENTS_PER_CATAGORY * CATAGORY_PER_SCAN
#define TOTAL_TEST_ELEMENTS ELEMENTS_PER_TEST_CATAGORY * INPUT_VECTOR
int main(void)
{
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
cusp::coo_matrix<long long int, double, cusp::host_memory> A(CATAGORY_PER_SCAN,MAX_SIZE,TOTAL_ELEMENTS);
cusp::coo_matrix<long long int, double, cusp::host_memory> B(MAX_SIZE,INPUT_VECTOR,TOTAL_TEST_ELEMENTS);
for(int i=0; i< ELEMENTS_PER_TEST_CATAGORY;i++){
for(int j = 0;j< INPUT_VECTOR ; j++){
int index = i * INPUT_VECTOR + j ;
B.row_indices[index] = i; B.column_indices[ index ] = j; B.values[index ] = i;
}
}
for(int i = 0;i < CATAGORY_PER_SCAN; i++){
for(int j=0; j< ELEMENTS_PER_CATAGORY;j++){
int index = i * ELEMENTS_PER_CATAGORY + j ;
A.row_indices[index] = i; A.column_indices[ index ] = j; A.values[index ] = i;
}
}
/*cusp::print(A);
cusp::print(B); */
//test vector
cusp::coo_matrix<long int, double, cusp::device_memory> A_d = A;
cusp::coo_matrix<long int, double, cusp::device_memory> B_d = B;
// allocate output vector
cusp::coo_matrix<int, double, cusp::device_memory> y_d(CATAGORY_PER_SCAN, INPUT_VECTOR ,CATAGORY_PER_SCAN * INPUT_VECTOR);
cusp::multiply(A_d, B_d, y_d);
cusp::coo_matrix<int, double, cusp::host_memory> y=y_d;
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
printf("time elaplsed %f ms\n",elapsedTime);
return 0;
}
cusp::multiply function uses 1 GPU only (as of my understanding).
How can I use setDevice() to run same program on both the GPU(one cusp::multiply per GPU) .
Measure the total time accurately.
How can I use zero-copy pinned memory with this library as I can use malloc myself.

1 How can I use setDevice() to run same program on both the GPU
If you mean "How can I perform a single cusp::multiply operation using two GPUs", the answer is you can't.
EDIT:
For the case where you want to run two separate CUSP sparse matrix-matrix products on different GPUs, it is possible to simply wrap the operation in a loop and call cudaSetDevice before the transfers and the cusp::multiply call. You will probably not, however get any speed up by doing so. I think I am correct in saying that both the memory transfers and cusp::multiply operations are blocking calls, so the host CPU will stall until they are finished. Because of this, the calls for different GPUs cannot overlap and there will be no speed up over performing the same operation on a single GPU twice. If you were willing to use a multithreaded application and have a host CPU with multiple cores, you could probably still run them in parallel, but it won't be as straightforward host code as it seems you are hoping for.
2 Measure the total time accurately
The cuda_event approach you have now is the most accurate way of measuring the execution time of a single kernel. If you had a hypthetical multi-gpu scheme, then the sum of the events from each GPU context would be the total execution time of the kernels. If, by total time, you mean the "wallclock" time to complete the operation, then you would need to either use a host timer around the whole multigpu segment of your code. I vaguely recall that it might be possible in the latest versions of CUDA to synchronize between events in streams from different contexts in some circumstances, so a CUDA event based timer might still be usable in such a scenario.
3 How can I use zero-copy pinned memory with this library as I can use malloc myself.
To the best of my knowledge that isn't possible. The underlying thrust library CUSP uses can support containers using zero copy memory, but CUSP doesn't expose the necessary mechanisms in the standard matrix constructors to be able to use allocate a CUSP sparse matrix in zero copy memory.

openMP is not creating threads in visual studio

My openMP version did not give any speed boost. I have a dual core machine and the CPU usage is always 50%. So I tried the sample program given in Wiki. Looks like the openMP compiler (Visual Studio 2008) is not creating more than one thread.
This is the program:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
This is the output that I get:
Hello World from thread 0
There are 1 threads
Press any key to continue . . .

There's nothing wrong with the program - so presumably there's some issue with how it's being compiled or run. Is this VS2008 Pro? A quick google around suggests OpenMP is not enabled in Standard. Is OpenMP enabled in Properties -> C/C++ -> Language -> OpenMP? (Eg, are you compiling with /openmp)? Is the environment variable OMP_NUM_THREADS being set to 1 somewhere when you run this?

If you want to test out your program with more than one thread, there are several constructs for specifying the number of threads in an OpenMP parallel region. They are, in order of precedence:
Evaluation of the if clause
Setting of the num_threads clause
Use of the omp_set_num_threads() library function
Setting of the OMP_NUM_THREADS environment variable
Implementation default
It sounds like your implementation is defaulting to one thread (assuming you don't have OMP_NUM_THREADS=1 set in your environment).
To test with 4 threads, for instance, you could add num_threads(4) to your #pragma omp parallel directive.
As the other answer noted, you won't really see any "speedup" because you aren't exploiting any parallelism. But it is reasonable to want to run a "hello world" program with several threads to test it out.

As mentioned here, http://docs.oracle.com/cd/E19422-01/819-3694/5_compiling.html I got it working by setting the environment variable OMP_DYNAMIC to FALSE

Why would you need more than one thread for that program? It's clearly the case that OpenMP realizes that it doesn't need to create an extra thread to run a program with no loops, no code that could run in parallel whatsoever.
Try running some parallel stuff with OpenMP. Something like this:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define CHUNKSIZE 10
#define N 100
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d starting...\n",tid);
#pragma omp for schedule(dynamic,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
}
If you want some hard core stuff, try running one of these.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio