external variables used in acc routine need to be in #pragma acc create() - openacc

NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - i (sim_xy1.c: 1012)
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - i (sim_xy1.c: 1010)
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - j (sim_xy1.c: 1002)
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - j (sim_xy1.c: 994)
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - i (sim_xy1.c: 986)
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - i (sim_xy1.c: 984)
I tried to bold or marked with ** ** in the code the location of the line corresponding to the error
void produto_matriz_vetor(int NX, int NY,double *AN,double *AW,double *AP,double *AE,double *AS, double *x, double *b)
{
int N,aux,NXY;
NXY=NX*NY;
N=1;
b[N]=(AP[N]*x[N])+(AE[N]*x[N+1])+(AS[N]*x[N+NX]);
N=NX;
b[N]=(AW[N]*x[N-1])+(AP[N]*x[N])+(AS[N]*x[N+NX]);
N=NXY-NX+1;
b[N]=(AN[N]*x[N-NX])+(AP[N]*x[N])+(AE[N]*x[N+1]);
N=NXY;
b[N]=(AN[N]*x[N-NX])+(AW[N]*x[N-1])+(AP[N]*x[N]);
for(N=2;N<NX;N++)
{
b[N]=(AP[N]*x[N])+AE[N]*x[N+1]+AS[N]*x[N+NX]+AW[N]*x[N-1];
}
**for(i=2;i<NX;i++)**
{
**N=NXY-NX+i;**
b[N]=(AN[N]*x[N-NX])+(AW[N]*x[N-1])+(AP[N]*x[N])+(AE[N]*x[N+1]);
}
for(j=2;j<NY;j++)
{
**N=(NX*(j-1))+1;**
b[N]=(AN[N]*x[N-NX])+(AP[N]*x[N])+(AE[N]*x[N+1])+(AS[N]*x[N+NX]);
}
for(j=2;j<NY;j++)
{
**N=(NX*(j-1))+NX;**
b[N]=(AN[N]*x[N-NX])+(AW[N]*x[N-1])+(AP[N]*x[N])+(AS[N]*x[N+NX]);
}
for(j=2;j<NY;j++)
{
**for(i=2;i<NX;i++)**
{
**N=(NX*(j-1))+i;**
b[N]=(AN[N]*x[N-NX])+(AW[N]*x[N-1])+(AP[N]*x[N])+(AE[N]*x[N+1])+(AS[N]*x[N+NX]);
}
}
}

Yes, static global variables need to be placed in a "acc declare create" directive so a device copy of the global variable can be accessed from the device routine.
However here, using a global variable as your index variable is going to cause issues since all threads would be using the same variables. The better solution would be to use local variables for the index variables so they can be made private for each thread.

Related

OPENACC How to handle a library function in a #pragma acc routine

I have to call the <stdlib.h> function exit() inside this routine:
#pragma acc routine(Check) seq
int Check (double **u, char *str)
{
for (int i = beg; i <= end; i++) {
for (int v = 0; v < vend; v++) {
if (isnan(u[i][v])) {
#pragma acc routine(Here) seq
Here (i,NULL);
#pragma acc routine(exit)
exit(1);
}
}}
return 0;
}
I get the error:
nvlink error : Undefined reference to 'exit' in 'tools.o'
Usually I solve this problem by adding the routine #pragma acc routine before the body of the function but in this case I'm dealing with a library function.
All routines called from the device, need a device callable version of the routine. Often system routines do not have device callable versions, including "exit", so can't be used.
Though, you can't exit a host application from device code, so you may want to rethink this portion of the code. Instead of using "exit", you'll want to capture errors and then abort once execution has returned to the host.

Is there a better way in c++ for one time execution of a set of code instead of using a static variable check

In many places i have code for one time initialization as below
int callback_method(void * userData)
{
/* This piece of code should run one time only */
static int init_flag = 0;
if (0 == init_flag)
{
/* do initialization stuff here */
init_flag = 1;
}
/* Do regular stuff here */
return 0;
}
just now I started using c++11. Is there a better way to replace the one time code with c++11 lambda or std::once functionality or any thing else?.
You can encapsulate your action to static function call or use something like Immediately invoked function expression with C++11 lambdas:
int action() { /*...*/ }
...
static int temp = action();
and
static auto temp = [](){ /*...*/ }();
respectively.
And yes, the most common solution is to use std::call_once, but sometimes it's a little bit overkill (you should use special flag with it, returning to initial question).
With C++11 standard all of these approaches are thread safe (variables will be initialized once and "atomically"), but you still must avoid races inside action if some shared resources used here.
Yes - std::call_once, see the docs on how to use this: http://en.cppreference.com/w/cpp/thread/call_once

Run a function when number of references decrease in shared_ptr

I am developing a cache and I need to know when an object expired.
Is possible run a function when the reference counter of a shared_ptr decrease?
std::shared_ptr< MyClass > p1 = std::make_shared( MyClass() );
std::shared_ptr< MyClass > p2 = p1; // p1.use_count() = 2
p2.reset(); // [ run function ] p1.use_count() = 1
You can't have a function called every time the reference count decreases, but you can have one called when it hits zero. You do this by passing a "custom deleter" to the shared_ptr constructor (you can't use the make_shared utility for this); the deleter is a callable object which is responsible for being passed, and deleting, the shared object.
Example:
#include <iostream>
#include <memory>
using namespace std;
void deleteInt(int* i)
{
std::cout << "Deleting " << *i << std::endl;
delete i;
}
int main() {
std::shared_ptr<int> ptr(new int(3), &deleteInt); // refcount now 1
auto ptr2 = ptr; // refcount now 2
ptr.reset(); // refcount now 1
ptr2.reset(); // refcount now 0, deleter called
return 0;
}
You can specify a deleter functor when creating the shared_ptr. The following article show an example use of a deleter:
http://en.cppreference.com/w/cpp/memory/shared_ptr/shared_ptr
Not using a vanilla std::shared_ptr, but if you only require customized behaviour when calling reset() (with no arguments), you can easily create a custom adapter:
template <typename T>
struct my_ptr : public std::shared_ptr<T> {
using std::shared_ptr<T>::shared_ptr;
void reset() {
std::shared_ptr<T>::reset(); // Release the managed object.
/* Run custom function */
}
};
And use it like this:
my_ptr<int> p = std::make_shared<int>(5);
std::cout << *p << std::endl; // Works as usual.
p.reset(); // Customized behaviour.
Edit
This answer is meant to suggest a solution to an issue that I didn't think the other answers did address, that is: executing custom behaviour every time when the refcount is decreased by use of reset().
If the issue is simply to make a call upon object release, then use a custom deleter functor as suggested in the answers by #Sneftel and #fjardon.

Constness of captured reference

An object can be captured by mutable reference, and changed inside a member function which takes the same object as const.
void g(const int& x, std::function<void()> f)
{
std::cout << x << '\n';
f();
std::cout << x << '\n';
}
int main()
{
int y = 0;
auto f = [&y] { ++y; };
g(y, f);
}
An object is mutated in a scope where it is const. I understand that the compiler can't enforce constness here without proving that x and y are aliases. I suppose all I'm looking for is confirmation that this is undefined behavior. Is it equivalent in some sense to a const_cast - using a value as non-const in a context where it should be?
Reference or pointer to const doesn't mean the referenced object cannot be modified at all - it just means that the object cannot be modified via this reference/pointer. It may very well be modified via another reference/pointer to the same object. This is called aliasing.
Here's an example that doesn't use lambdas or any other fancy features:
int x = 0;
void f() { x = 42; }
void g(const int& y) {
cout << y;
f();
cout << y;
}
int main() {
g(x);
}
There's nothing undefined going on, because the object itself is not const, and constness on aliases is primarily for the user's benefit. For thoroughness, the relevant section is [dcl.type.cv]p3:
A pointer or reference to a cv-qualified type need not actually point
or refer to a cv-qualified object, but it is treated as if it does; a
const-qualified access path cannot be used to modify an object even if
the object referenced is a non-const object and can be modified
through some other access path. [ Note: Cv-qualifiers
are supported by the type system so that they cannot be subverted without casting (5.2.11). —end note ]

Race conditions with OpenMP

I need to fill 2D array (tmp[Ny][Nx]) while each cell of the array gets an integral (of some function) as a function of free parameters. Since I deal with a very large arrays (here I simplified my case), I need to use OpenMP parallelism in order to speed my calculations up. Here I use simple #pragma omp parallel for directive.
Without using #pragma omp parallel for, the code executes perfectly. But adding the parallel directive, produces race conditions in the output.
I tried to cure it by making private(i,j,par), it did not help.
P.S. I use VS2008 Professional with OpenMP 2.0 and under WIndows 7 OS
Here is my code: (a short sample)
testfunc(const double* var, const double* par)
{
// here is some simple function to be integrated over
// var[0] and var[1] and two free parameters par[0] and par[1]
return ....
}
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
int main()
{
double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int i,j,NDim=2,NEval=1e5;
#pragma omp parallel for private(i,j,par,val)
for (i=0;i<Nx;i++)
{
for (j=0;j<Ny;j++)
{
par[0]=i;
par[1]=j*j;
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
// adapt_integrate - receives my integrand, performs
// integration and returns a result through "val"
tmp[i][j] = val;
}
}
}
It produces race conditions at the output. I tried to avoid it by making all internal variables (i,j,par and val) private, but it doesn't help.
P.S. Serial version (#threads=1) of this code runs properly.
(Answered in the question. Converted to a community wiki answer. See Question with no answers, but issue solved in the comments (or extended in chat) )
The OP wrote:
The problem Solved!
I defined parameters of integration as global and used #pragma omp threadprivate(parGlob) directive for them. Now it works like a charm. I've been thinking that private() and threadprivate() have the same meaning, just different ways of implementations, but they do not.
So, playing with these directives may give a correct answer. Another thing is that defining iterator i inside the first for loop gives additional 20%-30% speed up in performance. So, the fastest version of the code looks now as:
testfunc(const double* var, const double* par)
{
.......
}
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
double parGlob[2]; //<- Here are they!!!
#pragma omp threadprivate(parGlob) // <-Magic directive!!!!
int main()
{
// Not here !!!! -> double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int j,NDim=2,NEval=1e5;
#pragma omp parallel for private(j,val) // no `i` inside `private` clause
for (int i=0;i<Nx;i++)
{
for (j=0;j<Ny;j++)
{
parGlob[0]=i;
parGlob[1]=j*j;
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
tmp[i][j] = val;
}
}
}

Resources