reduction when allocating daata on the device - openacc

I have a simple nbody implementation code. To be short, I dropped out the extra code in randomizeBodies().
typedef struct
{
float x;
float y;
float z;
float w;
} Point4;
n=16384;
Point4 positions [n];
Point4 velocities [n];
Point4 acceleration[n];
float E_pot [n];
#pragma acc declare device_resident(positions,velocities,
acceleration,E_pot,n)
void randomizeBodies()
{
float K=0;
#pragma acc data copy(K)
#pragma acc parallel loop reduction(+:K)
for(int i=0;i<n; ++i)
{
...
Point4 velocity=...;
K+=1;
K+=velocity.y*velocity.y;
K+=velocity.z*velocity.z;
velocities[i].x = velocity.x ;
velocities[i].y = velocity.y ;
velocities[i].z = velocity.z ;
}
printf("K=%f",K);
}
Here veleocity.x, velocity.y, velocity.z are float. I call randomizeBodies() in main() and don't understand why the output writes "K=0". Is anything wrong with this code?

From this, it looks like you're managing "K" via a unstructured data region but haven't copied it back to the host before printing. Try adding an "#pragma acc update self(K)" directive before the print statement. Alternatively, you can remove the "#pragma acc enter data(K)" statement in which case the compiler will implicitly copy back K after the end of the parallel loop.

Related

-ta=tesla:deepcopy flag and #pragma acc shape

I just found out about the deepcopy flag. Until this moment I've always used -ta=tesla:managed to handle deep copy and I would like to explore the alternative.
I read this article: https://www.pgroup.com/blogs/posts/deep-copy-beta.htm which is well written but I think it does not cover my case. I have a structure of this type:
typedef struct Data_{
double ****Vc;
double ****Uc;
} Data
The shape of these to array is not defined by an element of the struct itself but by the elements of another structure and that are themselves defined only during the execution of the program.
How can I use the #pragma acc shape(Vc, Uc) in this case?
Without this pragma and copying the structure as follows:
int main(){
Data data;
Initialize(&data);
}
int Initialize(Data *data){
data->Uc = ARRAY_4D(ntot[KDIR], ntot[JDIR], ntot[IDIR], NVAR, double);
data->Vc = ARRAY_4D(NVAR, ntot[KDIR], ntot[JDIR], ntot[IDIR], double);
#pragma acc enter data copyin(data)
PrimToCons3D(data->Vc, data->Uc, grid, NULL);
}
void PrimToCons3D(double ****V, double ****U, Grid *grid, RBox *box){
#pragma acc parallel loop collapse(3) present(V[:NVAR][:nx3_tot][:nx2_tot][:nx1_tot])
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
for (i = ibeg; i <= iend; i++){
double v[NVAR];
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) v[nv] = V[nv][k][j][i];
}
I get
FATAL ERROR: data in PRESENT clause was not found on device 1: name=V host:0x1fd2b80
file:/home/Prova/Src/mappers3D.c PrimToCons3D line:140
Btw, this same code works fine with -ta=tesla:managed.
Since you don't provide a full reproducing example, I wasn't able to test this, but it would look something like:
typedef struct Data_{
int i,j,k,l;
double ****Vc;
double ****Uc;
#pragma acc shape(Vc[0:k][0:j][0:i][0:l])
#pragma acc shape(Uc[0:k][0:j][0:i][0:l])
} Data;
int Initialize(Data *data){
data->Vc.i = ntot[IDIR];
data->Vc.j = ntot[JDIR];
data->Vc.k = ntot[KDIR];
data->Vc.l = NVAR;
data->Uc.i = ntot[IDIR];
data->Uc.j = ntot[JDIR];
data->Uc.k = ntot[KDIR];
data->Uc.l = NVAR;
data->Uc = ARRAY_4D(ntot[KDIR], ntot[JDIR], ntot[IDIR], NVAR, double);
data->Vc = ARRAY_4D(NVAR, ntot[KDIR], ntot[JDIR], ntot[IDIR], double);
#pragma acc enter data copyin(data)
PrimToCons3D(data->Vc, data->Uc, grid, NULL);
}
void PrimToCons3D(double ****V, double ****U, Grid *grid, RBox *box){
int kbeg, jbeg, ibeg, kend, jend, iend;
#pragma acc parallel loop collapse(3) present(V, U)
for (int k = kbeg; k <= kend; k++){
for (int j = jbeg; j <= jend; j++){
for (int i = ibeg; i <= iend; i++){
Though keep in mind that the "shape" and "policy" directives we not adopted by the OpenACC standard and we (the NVHPC compiler team) only did a Beta version, which we have not maintained.
Probably better to do a manual deep copy, which will be standard compliant, which I can help with if you can provide a reproducer which includes how you're doing the array allocation, i.e. "ARRAY_4D".

OpenMP device offload reduction to existing device memory location

How do I tell OpenMP device offload to use an existing location in device memory for a reduction? I want to avoid data movement to/from device. Results will only be accessed on the device.
Here's my code
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
for (long i = 0; i < n; ++i)
{
mo[0] += mi[i];
xo[0] += mi[i]*xi[i];
yo[0] += mi[i]*yi[i];
}
#pragma omp target is_device_ptr(mo,xo,yo)
{
xo[0] /= mo[0];
yo[0] /= mo[0];
}
}
with this code and clang++ 15 targeting nvidia ptx, I'm getting the error:
test.cpp:6:109: error: reduction variable cannot be in a is_device_ptr clause in '#pragma omp target teams distribute parallel for' directive
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
test.cpp:6:67: note: defined as reduction
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
You cannot use array subscripts in reduction clause. That's non-conforming code. Please try something along these lines:
#include <stdio.h>
int main(int argc, char * argv[]) {
double sum = 0;
#pragma omp target data map(tofrom:sum)
{
for (int t = 0; t < 10; t++) {
#pragma omp target teams distribute parallel for map(tofrom:sum) reduction(+:sum)
for (int j = 0; j < 10000; j++) {
sum += 1;
}
}
}
printf("sum=%lf\n", sum);
return 0;
}
With the the target data construct you can allocate a buffer for the reduction variable on the GPU. The target construct's reduction clause will then reduce the value into that buffered variable and will only transfer the variable back from the GPU at the closing curly brace of the target data construct.
It is possible to do this all on the device. A couple of key pieces of info were required:
the map clause's map-type is used to optimize the copies to/from the device and disable unnecessary copies. The alloc map-type disables both copies to and from the device.
variables have only one instance in the device. map clauses for variables already mapped by an enclosing target data or target enter data by default do not result in copies to/from the device.
With that the solution is as follows:
// --------------------------------------------------------------------------
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
double m, x, y;
#pragma omp target enter data map(alloc: m,x,y)
#pragma omp target map(alloc: m,x,y)
{
m = 0.;
x = 0.;
y = 0.;
}
#pragma omp target teams distribute parallel for reduction(+: m,x,y), \
is_device_ptr(mi,xi,yi), map(alloc: m,x,y)
for (long i = 0; i < n; ++i)
{
m += mi[i];
x += mi[i]*xi[i];
y += mi[i]*yi[i];
}
#pragma omp target is_device_ptr(mo,xo,yo), map(alloc: m,x,y)
{
mo[0] = m;
xo[0] = x/m;
yo[0] = y/m;
}
#pragma omp target exit data map(release: m,x,y)
}
BEWARE this is a tentative, unverified answer, since I don't have the target, and Compiler Explorer doesn.t seem to have a gcc which has offload enabled. Hence this is untested.
However, you can clearly try this for yourself!
I suggest splitting the directives, and adding explicit scalar locals for the reduction.
So your code would look something like this
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
#pragma omp target is_device_ptr(mi,xi,yi,mo,xo,yo)
{
double mTotal = 0.0;
double xTotal = 0.0;
double yTotal = 0.0;
#pragma omp teams distribute parallel for reduction(+: mTotal, xTotal, yTotal)
for (long i = 0; i < n; ++i)
{
mTotal += mi[i];
xTotal += mi[i]*xi[i];
yTotal += mi[i]*yi[i];
}
mo[0] = mTotal;
xo[0] = xTotal/mTotal;
yo[0] = yTotal/mTotal;
}
}
That compiles OK for the host, but, as above, YOUR MILEAGE MAY VARY

Difference between mutual exclusion like atomic and reduction in OpenMP

I'm am following video lectures of Tim Mattson on OpenMP and there was one exercise to find errors in provided code that count area of the Mandelbrot. So here is the solution that was provided:
#define NPOINTS 1000
#define MAXITER 1000
void testpoint(struct d_complex);
struct d_complex{
double r;
double i;
};
struct d_complex c;
int numoutside = 0;
int main(){
int i,j;
double area, error, eps = 1.0e-5;
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps)
for(i = 0; i<NPOINTS; i++){
for(j=0; j < NPOINTS; j++){
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;
testpoint(c);
}
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-numoutside)/(double)(NPOINTS*NPOINTS);
error=area/(double)NPOINTS;
printf("Area of Mandlebrot set = %12.8f +/- %12.8f\n",area,error);
printf("Correct answer should be around 1.510659\n");
}
void testpoint(struct d_complex c){
// Does the iteration z=z*z+c, until |z| > 2 when point is known to be outside set
// If loop count reaches MAXITER, point is considered to be inside the set
struct d_complex z;
int iter;
double temp;
z=c;
for (iter=0; iter<MAXITER; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
z.i = z.r*z.i*2+c.i;
z.r = temp;
if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp atomic
numoutside++;
break;
}
}
}
The question I have is, could we use reduction in #pragma omp parallel of variable numoutside like:
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps) reduction(+:numoutside)
without atomic construct in testpoint function?
I tested the function without atomic, and the result was different from the one I got in the first place. Why does that happen? And while I understand the concept of mutual exclusion and use of it because of race conditioning, isn't reduction just another form of solving that problem with private variables?
Thank You in advance.

Eigen JacobiSVD cuda compile error

I've got an error, regarding calling JacobiSVD in my cuda function.
This is the part of the code that causing the error.
Eigen::JacobiSVD<Eigen::Matrix3d> svd( cov_e, Eigen::ComputeThinU | Eigen::ComputeThinV);
And this is the error message.
CUDA_voxel_building.cu(43): error: calling a __host__
function("Eigen::JacobiSVD , (int)2> ::JacobiSVD") from a __global__
function("kernel") is not allowed
I've used the following command to compile it.
nvcc -std=c++11 -D_MWAITXINTRIN_H_INCLUDED -D__STRICT_ANSI__ -ptx CUDA_voxel_building.cu
I'm using code 8.0 with eigen3 on ubuntu 16.04.
It seems like other functions such as eigen value decomposition also gives the same error.
Anyone knows a solution? I'm enclosing my code below.
//nvcc -ptx CUDA_voxel_building.cu
#include </usr/include/eigen3/Eigen/Core>
#include </usr/include/eigen3/Eigen/SVD>
/*
#include </usr/include/eigen3/Eigen/Sparse>
#include </usr/include/eigen3/Eigen/Dense>
#include </usr/include/eigen3/Eigen/Eigenvalues>
*/
__global__ void kernel(double *p, double *breaks,double *ind, double *mu, double *cov, double *e,double *v, int *n, char *isgood, int minpts, int maxgpu){
bool debuginfo = false;
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(debuginfo)printf("Thread %d got pointer\n",idx);
if( idx < maxgpu){
int s_ind = breaks[idx];
int e_ind = breaks[idx+1];
int diff = e_ind-s_ind;
if(diff >minpts){
int cnt = 0;
Eigen::MatrixXd local_p(3,diff) ;
for(int k = s_ind;k<e_ind;k++){
int temp_ind=ind[k];
//Eigen::Matrix<double, 3, diff> local_p;
local_p(1,cnt) = p[temp_ind*3];
local_p(2,cnt) = p[temp_ind*3+1];
local_p(3,cnt) = p[temp_ind*3+2];
cnt++;
}
Eigen::Matrix3d centered = local_p.rowwise() - local_p.colwise().mean();
Eigen::Matrix3d cov_e = (centered.adjoint() * centered) / double(local_p.rows() - 1);
Eigen::JacobiSVD<Eigen::Matrix3d> svd( cov_e, Eigen::ComputeThinU | Eigen::ComputeThinV);
/* Eigen::Matrix3d Cp = svd.matrixU() * svd.singularValues().asDiagonal() * svd.matrixV().transpose();
mu[idx]=p[ind[s_ind]*3];
mu[idx+1]=p[ind[s_ind+1]*3];
mu[idx+2]=p[ind[s_ind+2]*3];
e[idx]=svd.singularValues()(0);
e[idx+1]=svd.singularValues()(1);
e[idx+2]=svd.singularValues()(2);
n[idx] = diff;
isgood[idx] = 1;
for(int x = 0; x < 3; x++)
{
for(int y = 0; y < 3; y++)
{
v[x+ 3*y +idx*9]=svd.matrixV()(x, y);
cov[x+ 3*y +idx*9]=cov_e(x, y);
//if(debuginfo)printf("%f ",R[x+ 3*y +i*9]);
if(debuginfo)printf("%f ",Rm(x, y));
}
}
*/
} else {
mu[idx]=0;
mu[idx+1]=0;
mu[idx+2]=0;
e[idx]=0;
e[idx+1]=0;
e[idx+2]=0;
n[idx] = 0;
isgood[idx] = 0;
for(int x = 0; x < 3; x++)
{
for(int y = 0; y < 3; y++)
{
v[x+ 3*y +idx*9]=0;
cov[x+ 3*y +idx*9]=0;
}
}
}
}
}
First of all, Ubuntu 16.04 provides Eigen 3.3-beta1, which is not really recommended to be used. I would suggest upgrading to a more recent version. Furthermore, to include Eigen, write (e.g.):
#include <Eigen/Eigenvalues>
and compile with -I /usr/include/eigen3 (if you use the version provided by the OS), or better -I /path/to/local/eigen-version.
Then, as talonmies noted, you can't call host-functions from kernels, (I'm not sure at the moment, why JacobiSVD is not marked as device function), but in your case it would make much more sense to use Eigen::SelfAdjointEigenSolver, anyway. Since the matrix you are decomposing is fixed-size 3x3 you should actually use the optimized computeDirect method:
Eigen::SelfAdjointEigenSolver<Eigen::Matrix3d> eig; // default constructor
eig.computeDirect(cov_e); // works for 2x2 and 3x3 matrices, does not require loops
It seems the computeDirect even works on the beta version provided by Ubuntu (I'd still recommend to update).
Some unrelated notes:
The following is wrong, since you should start with index 0:
local_p(1,cnt) = p[temp_ind*3];
local_p(2,cnt) = p[temp_ind*3+1];
local_p(3,cnt) = p[temp_ind*3+2];
Also, you can write this in one line:
local_p.col(cnt) = Eigen::Vector3d::Map(p+temp_ind*3);
This line will not fit (unless diff==3):
Eigen::Matrix3d centered = local_p.rowwise() - local_p.colwise().mean();
What you probably mean is (local_p is actually 3xn not nx3)
Eigen::Matrix<double, 3, Eigen::Dynamic> centered = local_p.colwise() - local_p.rowwise().mean();
And when computing cov_e you need to .adjoint() the second factor, not the first.
You can avoid both 'big' matrices local_p and centered, by directly accumulating Eigen::Matrix3d sum2 and Eigen::Vector3d sum with sum2 += v*v.adjoint() and sum +=v and computing
Eigen::Vector3d mu = sum / diff;
Eigen::Matrix3d cov_e = (sum2 - mu*mu.adjoint()*diff)/(diff-1);

parallelizing in openMP

I have the following code that I want to paralleize using OpenMP
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0;
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
//cout<<"row "<<m<<" completed"<<endl;
}
In this I want every thread to perform "for j" and "for k" simultaneouly.
I am trying to do using pragma omp parallel for before the "for m" loop but not getting the correct result.
How can I do this in an optimized manner. thanks in advance.
Depending exactly from which loop you want to parallelize, you have three options:
#pragma omp parallel
{
#pragma omp for // Option #1
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0;
#pragma omp for // Option #2
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
//cout<<"row "<<m<<" completed"<<endl;
}
}
//////////////////////////////////////////////////////////////////////////
// Option #3
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
#pragma omp parallel
{
double value = 0.0;
#pragma omp for
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
}
//cout<<"row "<<m<<" completed"<<endl;
}
Test and profile each. You might find that option #1 is fastest if there isn't a lot of work for each thread, or you may find that with optimizations on, there is no difference (or even a slowdown) when enabling OMP.
Edit
I've adopted the MCVE supplied in the comments as follows:
#include <iostream>
#include <chrono>
#include <omp.h>
#include <algorithm>
#include <vector>
#define W_OMP
int main(int argc, char *argv[])
{
std::vector<double> h_a(9);
std::generate(h_a.begin(), h_a.end(), std::rand);
int r_b = 500;
int c_b = r_b;
std::vector<double> h_b(r_b * c_b);
std::generate(h_b.begin(), h_b.end(), std::rand);
int r_c = 500;
int c_c = r_c;
int r_a = 3, c_a = 3;
std::vector<double> h_c(r_c * c_c);
auto start = std::chrono::system_clock::now();
#ifdef W_OMP
#pragma omp parallel
{
#endif
int m,n,j,k;
#ifdef W_OMP
#pragma omp for
#endif
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0,a;
for(j=0; j<r_b; j++)
{
for(k=0; k<c_b; k++)
{
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else a = h_a[((m-j)*c_a) + (n-k)];
value += h_b[(j*c_b) + k] * a;
}
}
h_c[m*c_c + n] = value;
}
}
#ifdef W_OMP
}
#endif
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms"
#ifdef W_OMP
"\t with OMP"
#else
"\t without OMP"
#endif
"\n";
return 0;
}
As a reference, I'm using VS2012 (OMP 2.0, grrr). I'm not sure when collapse was introduced, but apparently after 2.0. Optimizations were /O2 and compiled in Release x64.
Benchmarks
Using the original sizes of the loops (7,7,5,5) and therefore arrays, the results were 0ms without OMP and 1ms with. Verdict: optimizations were better, and the added overhead wasn't worth it. Also, the measurements are not reliable (too short).
Using the slightly larger sizes of the loops (100, 100, 100, 100) and therefore arrays, the results were about equal at about 108ms. Verdict: still not worth the naive effort, tweaking OMP parameters might tip the scale. Definitely not the x4 speedup I would hope for.
Using an even larger sizes of the loops (500, 500, 500, 500) and therefore arrays, OMP started to pull ahead. Without OMP 74.3ms, with 15s. Verdict: Worth it. Weird. I got a x5 speedup with four threads and four cores on an i5. I'm not going to try and figure out how that happened.
Summary
As has been stated in countless answers here on SO, it's not always a good idea to parallelize every for loop you come across. Things that can screw up your desired xN speedup:
Not enough work per thread to justify the overhead of creating the additional threads
The work itself is memory bound. This means that the CPU can be running at 1petaHz and you still won't see a speedup.
Memory access patterns. I'm not going to go there. Feel free to edit in the relevant info if you want it.
OMP parameters. The best choice of parameters will often be a result of this entire list (not including this item, to avoid recursion issues).
SIMD operations. Depending on what and how you're doing, the compiler may vectorize your operations. I have no idea if OMP will usurp the SIMD operations, but it is possible. Check your assembly (foreign language to me) to confirm.

Resources