Why do I get 128 teams/thread blocks and 96 threads in each teams/thread blocks using #pragma omp target team distribute parallel for in OpenMP? - openmp

I am running this code on Ubuntu 18.04, clang/llvm compiler with Nvidia GTX 1070 GPU
#pragma omp target data map(to: A,B) map(from: C)
{
#pragma omp target teams distribute
for(int n=0; n<Row; n++)
{
int team_id= omp_get_team_num();
#pragma omp parallel for default(shared) schedule(auto)
for(int j = 0; j <Col; j++)
{
int thread_id = omp_get_thread_num();
printf("Iteration= c[ %d ][ %d ], Team=%d, Thread=%d\n",n, j, team_id, thread_id);
C[n][j] = A[n][j] + B[n][j];
}
}
}
in the above code, max value of team is 127 and thread is 95
compile flags: clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_61 -Wall -O3 debug.cpp -o debug

Related

OpenMP device offload reduction to existing device memory location

How do I tell OpenMP device offload to use an existing location in device memory for a reduction? I want to avoid data movement to/from device. Results will only be accessed on the device.
Here's my code
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
for (long i = 0; i < n; ++i)
{
mo[0] += mi[i];
xo[0] += mi[i]*xi[i];
yo[0] += mi[i]*yi[i];
}
#pragma omp target is_device_ptr(mo,xo,yo)
{
xo[0] /= mo[0];
yo[0] /= mo[0];
}
}
with this code and clang++ 15 targeting nvidia ptx, I'm getting the error:
test.cpp:6:109: error: reduction variable cannot be in a is_device_ptr clause in '#pragma omp target teams distribute parallel for' directive
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
test.cpp:6:67: note: defined as reduction
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
You cannot use array subscripts in reduction clause. That's non-conforming code. Please try something along these lines:
#include <stdio.h>
int main(int argc, char * argv[]) {
double sum = 0;
#pragma omp target data map(tofrom:sum)
{
for (int t = 0; t < 10; t++) {
#pragma omp target teams distribute parallel for map(tofrom:sum) reduction(+:sum)
for (int j = 0; j < 10000; j++) {
sum += 1;
}
}
}
printf("sum=%lf\n", sum);
return 0;
}
With the the target data construct you can allocate a buffer for the reduction variable on the GPU. The target construct's reduction clause will then reduce the value into that buffered variable and will only transfer the variable back from the GPU at the closing curly brace of the target data construct.
It is possible to do this all on the device. A couple of key pieces of info were required:
the map clause's map-type is used to optimize the copies to/from the device and disable unnecessary copies. The alloc map-type disables both copies to and from the device.
variables have only one instance in the device. map clauses for variables already mapped by an enclosing target data or target enter data by default do not result in copies to/from the device.
With that the solution is as follows:
// --------------------------------------------------------------------------
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
double m, x, y;
#pragma omp target enter data map(alloc: m,x,y)
#pragma omp target map(alloc: m,x,y)
{
m = 0.;
x = 0.;
y = 0.;
}
#pragma omp target teams distribute parallel for reduction(+: m,x,y), \
is_device_ptr(mi,xi,yi), map(alloc: m,x,y)
for (long i = 0; i < n; ++i)
{
m += mi[i];
x += mi[i]*xi[i];
y += mi[i]*yi[i];
}
#pragma omp target is_device_ptr(mo,xo,yo), map(alloc: m,x,y)
{
mo[0] = m;
xo[0] = x/m;
yo[0] = y/m;
}
#pragma omp target exit data map(release: m,x,y)
}
BEWARE this is a tentative, unverified answer, since I don't have the target, and Compiler Explorer doesn.t seem to have a gcc which has offload enabled. Hence this is untested.
However, you can clearly try this for yourself!
I suggest splitting the directives, and adding explicit scalar locals for the reduction.
So your code would look something like this
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
#pragma omp target is_device_ptr(mi,xi,yi,mo,xo,yo)
{
double mTotal = 0.0;
double xTotal = 0.0;
double yTotal = 0.0;
#pragma omp teams distribute parallel for reduction(+: mTotal, xTotal, yTotal)
for (long i = 0; i < n; ++i)
{
mTotal += mi[i];
xTotal += mi[i]*xi[i];
yTotal += mi[i]*yi[i];
}
mo[0] = mTotal;
xo[0] = xTotal/mTotal;
yo[0] = yTotal/mTotal;
}
}
That compiles OK for the host, but, as above, YOUR MILEAGE MAY VARY

Unexplainable LLVM vs GCC OpenMP differences

Consider the following matrix multiplication code:
#define BLOCKING 64
void mat_mult_ijk_blocked(
const int m,
const int n,
const int p,
real a[restrict m][p],
const real b[m][n],
const real c[n][p]) {
for(int i=0; i<m; i++) {
for(int k=0; k<p; k++) {
a[i][k] = 0.0;
}
}
#pragma omp parallel for
for(int block_i=0; block_i<m; block_i += BLOCKING) {
for(int block_j=0; block_j<n; block_j += BLOCKING) {
for(int i=block_i; i<min(block_i + BLOCKING, m); i++) {
for(int j=block_j; j<min(block_j + BLOCKING, n); j++) {
real w = b[i][j];
for(int k=0; k<p; k++) {
a[i][k] += w * c[j][k];
}
}
}
}
}
}
For 1800×2200×1400 multiplication:
With clang without -fopenmp, the code takes 3.2s on my machine.
With -fopenmp and
OMP_NUM_THREADS=4 1.6s
OMP_NUM_THREADS=3 2.1s
OMP_NUM_THREADS=2 1.6s
OMP_NUM_THREADS=1 2.6s
This machine has 2 cores with hyperthreading, this seems kind of reasonable.
Now with gcc-5.4.0:
without -fopenmp 2.8s
with -fopenmp
OMP_NUM_THREADS=4 5.1s
OMP_NUM_THREADS=3 4.6s
OMP_NUM_THREADS=2 4.9s
OMP_NUM_THREADS=1 10s
What am I doing wrong that gives so poor performance with gcc!?
The performance with a single thread of OpenMP is considerably worse than that of the original single-threaded code.
UPDATE - with gcc-6.2
without -fopenmp 3.s
with -fopenmp
OMP_NUM_THREADS=4 2.23s
OMP_NUM_THREADS=3 1.89s
OMP_NUM_THREADS=2 1.61s
OMP_NUM_THREADS=1 2.9s
Clearly now results are similar to those in LLVM. What is the reason for this?

Open MP with gsl_matrix is slower than sequential

I've created a simple c program using gsl(GNU Scienctific Library) and open mp. In this simple program, I want to test the execution time for sequential and parallel. Here is the program snippets, main.c.
#include "omp.h"
#include <stdio.h>
#include <gsl/gsl_matrix.h>
#include <time.h>
int main()
{
omp_set_num_threads(4);
int n1=10000, n2=10000;
gsl_matrix *A = gsl_matrix_alloc(n1, n2);
int i,j;
struct timeval tv1, tv2, tv3, tv4;
gettimeofday(&tv1, 0);
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv2, 0);
long elapsed = (tv2.tv_sec-tv1.tv_sec)*1000000 + tv2.tv_usec-tv1.tv_usec;
printf("Sequential Duration:%ldms\n", elapsed);
gettimeofday(&tv3, 0);
#pragma omp parallel for private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv4, 0);
elapsed = (tv4.tv_sec-tv3.tv_sec)*1000000 + tv4.tv_usec-tv3.tv_usec;
printf(" Parallel Duration:%ldms\n", elapsed);
return 0;
}
Then I compiled the above code, using this command:
gcc -fopenmp main.c -o test -lgsl -lgslcblas -lm
Here is the program's result:
Sequential Duration:11980106ms
Parallel Duration:20624043ms
Why, the parallel part slower than the sequential part. How can I optimize this code? Thanks
as you have written it the j variable is shared between all threads so the threads are overwritting other threads state constantly, leading to them iterating values they have already covered.
You should always minimize the scope of variables when trying to parallelize with openmp. Either move the scope of j into the loop or mark it as private explicitly:
#pragma omp parallel for private(j)
also clock counts the processor time not the real time, you probably want to use gettimeofday
you matrix is too small to benefit much from parallelization, the threading overhead will dominate. Increase it to ~10000x10000 to start seeing something.
The problem here is that you do not know what the procedure gsl_matrix_set does with A. You do not know if it is thread safe. To change one element in that matrix you supply the whole matrix to the routine instead of only the indices of the element. This smells by false sharing (see e.g. this answer).
I would try this instead
gsl_matrix_set(A[i][j],i*j*1000000);
If that does not work and what you are interested in is only the time difference between serial and parallel I would just do
A[i][j] = i*j*1000000
In the thread part, try this:
#pragma omp parallel private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
or
#pragma omp parallel for
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}

Compiling using GSL and OpenMP

I am not the best when it comes to compiling/writing makefiles.
I am trying to write a program that uses both GSL and OpenMP.
I have no problem using GSL and OpenMP separately, but I'm having issues using both. For instance, I can compile the GSL program
http://www.gnu.org/software/gsl/manual/html_node/An-Example-Program.html
By typing
$gcc -c Bessel.c
$gcc Bessel.o -lgsl -lgslcblas -lm
$./a.out
and it works.
I was also able to compile the program that uses OpenMP that I found here:
Starting a thread for each inner loop in OpenMP
In this case I typed
$gcc -fopenmp test_omp.c
$./a.out
And I got what I wanted (all 4 threads I have were used).
However, when I simply write a program that combines the two codes
#include <stdio.h>
#include <gsl/gsl_sf_bessel.h>
#include <omp.h>
int
main (void)
{
double x = 5.0;
double y = gsl_sf_bessel_J0 (x);
printf ("J0(%g) = %.18e\n", x, y);
int dimension = 4;
int i = 0;
int j = 0;
#pragma omp parallel private(i, j)
for (i =0; i < dimension; i++)
for (j = 0; j < dimension; j++)
printf("i=%d, jjj=%d, thread = %d\n", i, j, omp_get_thread_num());
return 0;
}
Then I try to compile to typing
$gcc -c Bessel_omp_test.c
$gcc Bessel_omp_test.o -fopenmp -lgsl -lgslcblas -lm
$./a.out
The GSL part works (The Bessel function is computed), but only one thread is used for the OpenMP part. I'm not sure what's wrong here...
You missed the worksharing directive for in your OpenMP part. It should be:
// Just in case GSL modifies the number of threads
omp_set_num_threads(omp_get_max_threads());
omp_set_dynamic(0);
#pragma omp parallel for private(i, j)
for (i =0; i < dimension; i++)
for (j = 0; j < dimension; j++)
printf("i=%d, jjj=%d, thread = %d\n", i, j, omp_get_thread_num());
Edit: To summarise the discussion in the comments below, the OP failed to supply -fopenmp during the compilation phase. That prevented GCC from recognising the OpenMP directives and thus no paralle code was generated.
IMHO, it's incorrect to declare the variables i and j as shared. Try declaring them private. Otherwise, each thread would get the same j and j++ would generate a race condition among threads.

openMP is not creating threads in visual studio

My openMP version did not give any speed boost. I have a dual core machine and the CPU usage is always 50%. So I tried the sample program given in Wiki. Looks like the openMP compiler (Visual Studio 2008) is not creating more than one thread.
This is the program:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
This is the output that I get:
Hello World from thread 0
There are 1 threads
Press any key to continue . . .
There's nothing wrong with the program - so presumably there's some issue with how it's being compiled or run. Is this VS2008 Pro? A quick google around suggests OpenMP is not enabled in Standard. Is OpenMP enabled in Properties -> C/C++ -> Language -> OpenMP? (Eg, are you compiling with /openmp)? Is the environment variable OMP_NUM_THREADS being set to 1 somewhere when you run this?
If you want to test out your program with more than one thread, there are several constructs for specifying the number of threads in an OpenMP parallel region. They are, in order of precedence:
Evaluation of the if clause
Setting of the num_threads clause
Use of the omp_set_num_threads() library function
Setting of the OMP_NUM_THREADS environment variable
Implementation default
It sounds like your implementation is defaulting to one thread (assuming you don't have OMP_NUM_THREADS=1 set in your environment).
To test with 4 threads, for instance, you could add num_threads(4) to your #pragma omp parallel directive.
As the other answer noted, you won't really see any "speedup" because you aren't exploiting any parallelism. But it is reasonable to want to run a "hello world" program with several threads to test it out.
As mentioned here, http://docs.oracle.com/cd/E19422-01/819-3694/5_compiling.html I got it working by setting the environment variable OMP_DYNAMIC to FALSE
Why would you need more than one thread for that program? It's clearly the case that OpenMP realizes that it doesn't need to create an extra thread to run a program with no loops, no code that could run in parallel whatsoever.
Try running some parallel stuff with OpenMP. Something like this:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define CHUNKSIZE 10
#define N 100
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d starting...\n",tid);
#pragma omp for schedule(dynamic,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
}
If you want some hard core stuff, try running one of these.

Resources