OpenMP device offload reduction to existing device memory location - openmp

How do I tell OpenMP device offload to use an existing location in device memory for a reduction? I want to avoid data movement to/from device. Results will only be accessed on the device.
Here's my code
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
for (long i = 0; i < n; ++i)
{
mo[0] += mi[i];
xo[0] += mi[i]*xi[i];
yo[0] += mi[i]*yi[i];
}
#pragma omp target is_device_ptr(mo,xo,yo)
{
xo[0] /= mo[0];
yo[0] /= mo[0];
}
}
with this code and clang++ 15 targeting nvidia ptx, I'm getting the error:
test.cpp:6:109: error: reduction variable cannot be in a is_device_ptr clause in '#pragma omp target teams distribute parallel for' directive
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
test.cpp:6:67: note: defined as reduction
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^

You cannot use array subscripts in reduction clause. That's non-conforming code. Please try something along these lines:
#include <stdio.h>
int main(int argc, char * argv[]) {
double sum = 0;
#pragma omp target data map(tofrom:sum)
{
for (int t = 0; t < 10; t++) {
#pragma omp target teams distribute parallel for map(tofrom:sum) reduction(+:sum)
for (int j = 0; j < 10000; j++) {
sum += 1;
}
}
}
printf("sum=%lf\n", sum);
return 0;
}
With the the target data construct you can allocate a buffer for the reduction variable on the GPU. The target construct's reduction clause will then reduce the value into that buffered variable and will only transfer the variable back from the GPU at the closing curly brace of the target data construct.

It is possible to do this all on the device. A couple of key pieces of info were required:
the map clause's map-type is used to optimize the copies to/from the device and disable unnecessary copies. The alloc map-type disables both copies to and from the device.
variables have only one instance in the device. map clauses for variables already mapped by an enclosing target data or target enter data by default do not result in copies to/from the device.
With that the solution is as follows:
// --------------------------------------------------------------------------
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
double m, x, y;
#pragma omp target enter data map(alloc: m,x,y)
#pragma omp target map(alloc: m,x,y)
{
m = 0.;
x = 0.;
y = 0.;
}
#pragma omp target teams distribute parallel for reduction(+: m,x,y), \
is_device_ptr(mi,xi,yi), map(alloc: m,x,y)
for (long i = 0; i < n; ++i)
{
m += mi[i];
x += mi[i]*xi[i];
y += mi[i]*yi[i];
}
#pragma omp target is_device_ptr(mo,xo,yo), map(alloc: m,x,y)
{
mo[0] = m;
xo[0] = x/m;
yo[0] = y/m;
}
#pragma omp target exit data map(release: m,x,y)
}

BEWARE this is a tentative, unverified answer, since I don't have the target, and Compiler Explorer doesn.t seem to have a gcc which has offload enabled. Hence this is untested.
However, you can clearly try this for yourself!
I suggest splitting the directives, and adding explicit scalar locals for the reduction.
So your code would look something like this
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
#pragma omp target is_device_ptr(mi,xi,yi,mo,xo,yo)
{
double mTotal = 0.0;
double xTotal = 0.0;
double yTotal = 0.0;
#pragma omp teams distribute parallel for reduction(+: mTotal, xTotal, yTotal)
for (long i = 0; i < n; ++i)
{
mTotal += mi[i];
xTotal += mi[i]*xi[i];
yTotal += mi[i]*yi[i];
}
mo[0] = mTotal;
xo[0] = xTotal/mTotal;
yo[0] = yTotal/mTotal;
}
}
That compiles OK for the host, but, as above, YOUR MILEAGE MAY VARY

Related

Difference between mutual exclusion like atomic and reduction in OpenMP

I'm am following video lectures of Tim Mattson on OpenMP and there was one exercise to find errors in provided code that count area of the Mandelbrot. So here is the solution that was provided:
#define NPOINTS 1000
#define MAXITER 1000
void testpoint(struct d_complex);
struct d_complex{
double r;
double i;
};
struct d_complex c;
int numoutside = 0;
int main(){
int i,j;
double area, error, eps = 1.0e-5;
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps)
for(i = 0; i<NPOINTS; i++){
for(j=0; j < NPOINTS; j++){
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;
testpoint(c);
}
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-numoutside)/(double)(NPOINTS*NPOINTS);
error=area/(double)NPOINTS;
printf("Area of Mandlebrot set = %12.8f +/- %12.8f\n",area,error);
printf("Correct answer should be around 1.510659\n");
}
void testpoint(struct d_complex c){
// Does the iteration z=z*z+c, until |z| > 2 when point is known to be outside set
// If loop count reaches MAXITER, point is considered to be inside the set
struct d_complex z;
int iter;
double temp;
z=c;
for (iter=0; iter<MAXITER; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
z.i = z.r*z.i*2+c.i;
z.r = temp;
if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp atomic
numoutside++;
break;
}
}
}
The question I have is, could we use reduction in #pragma omp parallel of variable numoutside like:
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps) reduction(+:numoutside)
without atomic construct in testpoint function?
I tested the function without atomic, and the result was different from the one I got in the first place. Why does that happen? And while I understand the concept of mutual exclusion and use of it because of race conditioning, isn't reduction just another form of solving that problem with private variables?
Thank You in advance.

Output container from openmp parallel loop

Which critical section style is better when collecting output container?
// Insert into the output container one object at a time.
vector<float> output;
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
#pragma omp critical
{
output.push_back(value);
}
}
// Insert object into per-thread container; later aggregate those containers.
vector<float> output;
#pragma omp parallel
{
vector<float> per_thread;
#pragma omp for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
per_thread.push_back(value);
}
#pragma omp critical
{
output.insert(output.end(), per_thread.begin(), per_thread.end());
}
}
EDIT: the above examples were misleading because they indicated that each iteration pushes exactly one item, which is not true in my case. Here are more accurate examples:
// Insert into the output container one object at a time.
vector<float> output;
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
int k = // compute number of items
for( int j=0; j<k; ++j)
{
float value = // compute something complicated
#pragma omp critical
{
output.push_back(value);
}
}
}
// Insert object into per-thread container; later aggregate those containers.
vector<float> output;
#pragma omp parallel
{
vector<float> per_thread;
#pragma omp for
for(int i=0; i<1000000; ++i)
{
int k = // compute number of items
for( int j=0; j<k; ++j)
{
float value = // compute something complicated
per_thread.push_back(value);
}
}
#pragma omp critical
{
output.insert(output.end(), per_thread.begin(), per_thread.end());
}
}
If you always insert exactly one item per parallel iteration, the proper way is:
std::vector<float> output(1000000);
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
output[i] = value;
}
It is threadsafe to assign distinct elements of std::vector (which is guaranteed because all i are different). And there is no significant false-sharing in this case.
If you do not insert exactly one item per parallel iteration either version is basically correct.
Your first version using a critical in the loop can be very slow - note that if the computation is really slow, it may still be fine overall.
The per-thread container / manual reduction is generally fine. Of course it makes the order of the result non-deterministic. You could streamline this by using a user-defined reduction.

GCC v.4 compilation error with #pragma omp task (variables with reference type are not permitted in private/firstprivate clauses)

I am porting one large MPI-based physics code to OpenMP tasking. On one Cray supercomputing machine the code compiled, linked and runs perfectly (cray-mpich library, Cray compiler were used for this). Then, the code moved to a server for Jenkins continuous integration (I don't have admin rights on that server), and there is only GCC v.4 compiler (Cray compiler can't be used as it's not a Cray machine). On that server my code is not compiled, there is an error:
... error: ‘pcls’ implicitly determined as ‘firstprivate’ has reference type
#pragma omp task
^
It's a spaghetti code, so it's hard to copy-paste here the code lines caused this error, but my guess is that this is due to the problem described here:
http://forum.openmp.org/forum/viewtopic.php?f=5&t=117
Is there any possibility to solve this issue? It seems like with GCC v.6 this was resolved, but not sure... I am curious if someone has this situation...
UPD:
I am providing the skeleton of one function, where one such error is caused (sorry for long listing!):
void EMfields3D::sumMoments_vectorized(const Particles3Dcomm* part)
{
grid_initialisation(...);
#pragma omp parallel
{
for (int species_idx = 0; species_idx < ns; species_idx++)
{
const Particles3Dcomm& pcls = part[species_idx];
assert_eq(pcls.get_particleType(), ParticleType::SoA);
const int is = pcls.get_species_num();
assert_eq(species_idx,is);
double const*const x = pcls.getXall();
double const*const y = pcls.getYall();
double const*const z = pcls.getZall();
double const*const u = pcls.getUall();
double const*const v = pcls.getVall();
double const*const w = pcls.getWall();
double const*const q = pcls.getQall();
const int nop = pcls.getNOP();
#pragma omp master
{
start_timing_for_moments_accumulation(...);
}
...
#pragma omp for // because shared
for(int i=0; i<moments1dsize; i++)
moments1d[i]=0;
// prevent threads from writing to the same location
for(int cxmod2=0; cxmod2<2; cxmod2++)
for(int cymod2=0; cymod2<2; cymod2++)
// each mesh cell is handled by its own thread
#pragma omp for collapse(2)
for(int cx=cxmod2;cx<nxc;cx+=2)
for(int cy=cymod2;cy<nyc;cy+=2)
for(int cz=0;cz<nzc;cz++)
#pragma omp task
{
const int ix = cx + 1;
const int iy = cy + 1;
const int iz = cz + 1;
{
// reference the 8 nodes to which we will
// write moment data for particles in this mesh cell.
//
arr1_double_fetch momentsArray[8];
arr2_double_fetch moments00 = moments[ix][iy];
arr2_double_fetch moments01 = moments[ix][cy];
arr2_double_fetch moments10 = moments[cx][iy];
arr2_double_fetch moments11 = moments[cx][cy];
momentsArray[0] = moments00[iz]; // moments000
momentsArray[1] = moments00[cz]; // moments001
momentsArray[2] = moments01[iz]; // moments010
momentsArray[3] = moments01[cz]; // moments011
momentsArray[4] = moments10[iz]; // moments100
momentsArray[5] = moments10[cz]; // moments101
momentsArray[6] = moments11[iz]; // moments110
momentsArray[7] = moments11[cz]; // moments111
const int numpcls_in_cell = pcls.get_numpcls_in_bucket(cx,cy,cz);
const int bucket_offset = pcls.get_bucket_offset(cx,cy,cz);
const int bucket_end = bucket_offset+numpcls_in_cell;
some_manipulation_with_moments_accumulation(...);
}
}
#pragma omp master
{
end_timing_for_moments_accumulation(...);
}
// reduction
#pragma omp master
{
start_timing_for_moments_reduction(...);
}
{
#pragma omp for collapse(2)
for(int i=0;i<nxn;i++)
{
for(int j=0;j<nyn;j++)
{
for(int k=0;k<nzn;k++)
#pragma omp task
{
rhons[is][i][j][k] = invVOL*moments[i][j][k][0];
Jxs [is][i][j][k] = invVOL*moments[i][j][k][1];
Jys [is][i][j][k] = invVOL*moments[i][j][k][2];
Jzs [is][i][j][k] = invVOL*moments[i][j][k][3];
pXXsn[is][i][j][k] = invVOL*moments[i][j][k][4];
pXYsn[is][i][j][k] = invVOL*moments[i][j][k][5];
pXZsn[is][i][j][k] = invVOL*moments[i][j][k][6];
pYYsn[is][i][j][k] = invVOL*moments[i][j][k][7];
pYZsn[is][i][j][k] = invVOL*moments[i][j][k][8];
pZZsn[is][i][j][k] = invVOL*moments[i][j][k][9];
}
}
}
}
#pragma omp master
{
end_timing_for_moments_reduction(...);
}
}
}
for (int i = 0; i < ns; i++)
{
communicateGhostP2G(i);
}
}
Please, don't try to find a logic here (like why there is "#pragma omp parallel" and then the for-loop appears without "#pragma omp for"; or why in a for-loop there is a task construct)... I was not implementing the code, but I has to port it to OpenMP tasking...

OpenMP: How to copy back value of firstprivate variable back to global

I am new to OpenMP and I am stuck with a basic operation. Here is a sample code for my question.
#include <omp.h>
int main(void)
{
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for firstprivate(A)
for(int i = 0; i < 4; i++)
{
for(int j = 0; j < 4; j++)
{
A[i*4+j] = Process(A[i*4+j]);
}
}
}
As evident,value of A is local to each thread. However, at the end, I want to write back part of A calculated by each threadto the corresponding position in global variable A. How this can be accomplished?
Simply make A shared. This is fine, because all loop iterations operate on separate elements of A. Remember that OpenMP is shared memory programming.
You can do so explicitly by using shared instead of firstprivate, or simply remove the declaration:
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for
for(int i = 0; i < 4; i++)
By default all variables declared outside of the parallel region. You can find an extended exemplary description in this answer.

mandelbrot using openMP

// return 1 if in set, 0 otherwise
int inset(double real, double img, int maxiter){
double z_real = real;
double z_img = img;
for(int iters = 0; iters < maxiter; iters++){
double z2_real = z_real*z_real-z_img*z_img;
double z2_img = 2.0*z_real*z_img;
z_real = z2_real + real;
z_img = z2_img + img;
if(z_real*z_real + z_img*z_img > 4.0) return 0;
}
return 1;
}
// count the number of points in the set, within the region
int mandelbrotSetCount(double real_lower, double real_upper, double img_lower, double img_upper, int num, int maxiter){
int count=0;
double real_step = (real_upper-real_lower)/num;
double img_step = (img_upper-img_lower)/num;
for(int real=0; real<=num; real++){
for(int img=0; img<=num; img++){
count+=inset(real_lower+real*real_step,img_lower+img*img_step,maxiter);
}
}
return count;
}
// main
int main(int argc, char *argv[]){
double real_lower;
double real_upper;
double img_lower;
double img_upper;
int num;
int maxiter;
int num_regions = (argc-1)/6;
for(int region=0;region<num_regions;region++){
// scan the arguments
sscanf(argv[region*6+1],"%lf",&real_lower);
sscanf(argv[region*6+2],"%lf",&real_upper);
sscanf(argv[region*6+3],"%lf",&img_lower);
sscanf(argv[region*6+4],"%lf",&img_upper);
sscanf(argv[region*6+5],"%i",&num);
sscanf(argv[region*6+6],"%i",&maxiter);
printf("%d\n",mandelbrotSetCount(real_lower,real_upper,img_lower,img_upper,num,maxiter));
}
return EXIT_SUCCESS;
}
I need to convert the above code into openMP. I know how to do it for a single matrix or image but i have to do it for 2 images at the same time
the arguments are as follows
$./mandelbrot -2.0 1.0 -1.0 1.0 100 10000 -1 1.0 0.0 1.0 100 10000
Any suggestion how to divide the work in to different threads for the two images and then further divide work for each image.
thanks in advance
If you want to process multiple images at a time, you need to add a #pragma omp parallel for into the loop in the main body such as:
#pragma omp parallel for private(real_lower, real_upper, img_lower, img_upper, num, maxiter)
for(int region=0;region<num_regions;region++){
// scan the arguments
sscanf(argv[region*6+1],"%lf",&real_lower);
sscanf(argv[region*6+2],"%lf",&real_upper);
sscanf(argv[region*6+3],"%lf",&img_lower);
sscanf(argv[region*6+4],"%lf",&img_upper);
sscanf(argv[region*6+5],"%i",&num);
sscanf(argv[region*6+6],"%i",&maxiter);
printf("%d\n",mandelbrotSetCount(real_lower,real_upper,img_lower,img_upper,num,maxiter));
}
Notice that some variables need to be classified as private (i.e. each thread has its own copy).
Now, if you want additional parallelism you need nested OpenMP (see nested and NESTED_OMP in OpenMP specification) as the work will be spawned by OpenMP threads -- but note that nesting may not give you a performance boost always.
In this case, what about adding a #pragma omp parallel for (with the appropriate reduction clause so that each thread accumulates into count) into the mandelbrotSetCount routine such as
// count the number of points in the set, within the region
int mandelbrotSetCount(double real_lower, double real_upper, double img_lower, double img_upper, int num, int maxiter)
{
int count=0;
double real_step = (real_upper-real_lower)/num;
double img_step = (img_upper-img_lower)/num;
#pragma omp parallel for reduction(+:count)
for(int real=0; real<=num; real++){
for(int img=0; img<=num; img++){
count+=inset(real_lower+real*real_step,img_lower+img*img_step,maxiter);
}
}
return count;
}
The whole approach would split images between threads first and then the rest of the available threads would be able to split the loop iterations in this routine among all the available threads each time you invoke the routine.
EDIT
As user Hristo suggest's on the comments, the mandelBrotSetCount routine might be unbalanced (the best reason is that the user simply requests a different number of maxiter) on each invocation. One way to address this performance issue might be to use dynamic thread scheduling in the routine. So rather than having
#pragma omp parallel for reduction(+:count)
we might want to have
#pragma omp parallel for reduction(+:count) schedule(dynamic,N)
and here N should be a relatively small value (and likely larger than 1).

Resources