I have to call the <stdlib.h> function exit() inside this routine:
#pragma acc routine(Check) seq
int Check (double **u, char *str)
{
for (int i = beg; i <= end; i++) {
for (int v = 0; v < vend; v++) {
if (isnan(u[i][v])) {
#pragma acc routine(Here) seq
Here (i,NULL);
#pragma acc routine(exit)
exit(1);
}
}}
return 0;
}
I get the error:
nvlink error : Undefined reference to 'exit' in 'tools.o'
Usually I solve this problem by adding the routine #pragma acc routine before the body of the function but in this case I'm dealing with a library function.
All routines called from the device, need a device callable version of the routine. Often system routines do not have device callable versions, including "exit", so can't be used.
Though, you can't exit a host application from device code, so you may want to rethink this portion of the code. Instead of using "exit", you'll want to capture errors and then abort once execution has returned to the host.
Related
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - i (sim_xy1.c: 1012)
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - i (sim_xy1.c: 1010)
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - j (sim_xy1.c: 1002)
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - j (sim_xy1.c: 994)
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - i (sim_xy1.c: 986)
NVC++-W-1056-External variables used in acc routine need to be in #pragma acc create() - i (sim_xy1.c: 984)
I tried to bold or marked with ** ** in the code the location of the line corresponding to the error
void produto_matriz_vetor(int NX, int NY,double *AN,double *AW,double *AP,double *AE,double *AS, double *x, double *b)
{
int N,aux,NXY;
NXY=NX*NY;
N=1;
b[N]=(AP[N]*x[N])+(AE[N]*x[N+1])+(AS[N]*x[N+NX]);
N=NX;
b[N]=(AW[N]*x[N-1])+(AP[N]*x[N])+(AS[N]*x[N+NX]);
N=NXY-NX+1;
b[N]=(AN[N]*x[N-NX])+(AP[N]*x[N])+(AE[N]*x[N+1]);
N=NXY;
b[N]=(AN[N]*x[N-NX])+(AW[N]*x[N-1])+(AP[N]*x[N]);
for(N=2;N<NX;N++)
{
b[N]=(AP[N]*x[N])+AE[N]*x[N+1]+AS[N]*x[N+NX]+AW[N]*x[N-1];
}
**for(i=2;i<NX;i++)**
{
**N=NXY-NX+i;**
b[N]=(AN[N]*x[N-NX])+(AW[N]*x[N-1])+(AP[N]*x[N])+(AE[N]*x[N+1]);
}
for(j=2;j<NY;j++)
{
**N=(NX*(j-1))+1;**
b[N]=(AN[N]*x[N-NX])+(AP[N]*x[N])+(AE[N]*x[N+1])+(AS[N]*x[N+NX]);
}
for(j=2;j<NY;j++)
{
**N=(NX*(j-1))+NX;**
b[N]=(AN[N]*x[N-NX])+(AW[N]*x[N-1])+(AP[N]*x[N])+(AS[N]*x[N+NX]);
}
for(j=2;j<NY;j++)
{
**for(i=2;i<NX;i++)**
{
**N=(NX*(j-1))+i;**
b[N]=(AN[N]*x[N-NX])+(AW[N]*x[N-1])+(AP[N]*x[N])+(AE[N]*x[N+1])+(AS[N]*x[N+NX]);
}
}
}
Yes, static global variables need to be placed in a "acc declare create" directive so a device copy of the global variable can be accessed from the device routine.
However here, using a global variable as your index variable is going to cause issues since all threads would be using the same variables. The better solution would be to use local variables for the index variables so they can be made private for each thread.
I'm trying to parallelize this loop but I'm having a problem with the function fRSolver (&sw, ts->cmax);
#pragma acc parallel loop collapse(2) present(d, sw, ts)
for (k = KBEG; k <= KEND; k++){
for (j = JBEG; j <= JEND; j++){
.
.
.
#pragma acc routine(fRSolver) vector
d->fRSolver (&sw, ts->cmax);
}}
I get this error:
139, Accelerator region ignored
141, Accelerator restriction: loop contains unsupported statement type
175, Accelerator restriction: unsupported statement type: opcode=JSRA
d is a type D variable:
typedef struct D_{
.
.
.
void (*fRSolver) (const Sw *, double *);
}D;
fRSolver is a pointer to a function void HSolver (const Sw *sw, double *cmax)
Is there a way I can accelerate this loop without changing the way the function HSolver is called?
Is there a way I can accelerate this loop without changing the way the
function HSolver is called?
No, sorry. Function pointers and indirect function calls are not supported within device code. This requires late binding (i.e. the function resolution is done at runtime) and we currently don't have a way on the device to resolve the functions address. Same issue occurs with C++ virtual functions and Fortran type bound procedures.
It's definitely on our list of things we'd like to support, and hopefully will at some point, but thus far has proven to be a major challenge. You need to modify the code to have fRSolver be a direct call, resolved at link time.
I've boiled this down to a simple self-contained example. The main thread enqueues 1000 items, and a worker thread tries to dequeue concurrently. ThreadSanitizer complains that there's a race between the read and the write of one of the elements, even though there is an acquire-release memory barrier sequence protecting them.
#include <atomic>
#include <thread>
#include <cassert>
struct FakeQueue
{
int items[1000];
std::atomic<int> m_enqueueIndex;
int m_dequeueIndex;
FakeQueue() : m_enqueueIndex(0), m_dequeueIndex(0) { }
void enqueue(int x)
{
auto tail = m_enqueueIndex.load(std::memory_order_relaxed);
items[tail] = x; // <- element written
m_enqueueIndex.store(tail + 1, std::memory_order_release);
}
bool try_dequeue(int& x)
{
auto tail = m_enqueueIndex.load(std::memory_order_acquire);
assert(tail >= m_dequeueIndex);
if (tail == m_dequeueIndex)
return false;
x = items[m_dequeueIndex]; // <- element read -- tsan says race!
++m_dequeueIndex;
return true;
}
};
FakeQueue q;
int main()
{
std::thread th([&]() {
int x;
for (int i = 0; i != 1000; ++i)
q.try_dequeue(x);
});
for (int i = 0; i != 1000; ++i)
q.enqueue(i);
th.join();
}
ThreadSanitizer output:
==================
WARNING: ThreadSanitizer: data race (pid=17220)
Read of size 4 at 0x0000006051c0 by thread T1:
#0 FakeQueue::try_dequeue(int&) /home/cameron/projects/concurrentqueue/tests/tsan/issue49.cpp:26 (issue49+0x000000402bcd)
#1 main::{lambda()#1}::operator()() const <null> (issue49+0x000000401132)
#2 _M_invoke<> /usr/include/c++/5.3.1/functional:1531 (issue49+0x0000004025e3)
#3 operator() /usr/include/c++/5.3.1/functional:1520 (issue49+0x0000004024ed)
#4 _M_run /usr/include/c++/5.3.1/thread:115 (issue49+0x00000040244d)
#5 <null> <null> (libstdc++.so.6+0x0000000b8f2f)
Previous write of size 4 at 0x0000006051c0 by main thread:
#0 FakeQueue::enqueue(int) /home/cameron/projects/concurrentqueue/tests/tsan/issue49.cpp:16 (issue49+0x000000402a90)
#1 main /home/cameron/projects/concurrentqueue/tests/tsan/issue49.cpp:44 (issue49+0x000000401187)
Location is global 'q' of size 4008 at 0x0000006051c0 (issue49+0x0000006051c0)
Thread T1 (tid=17222, running) created by main thread at:
#0 pthread_create <null> (libtsan.so.0+0x000000027a67)
#1 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) <null> (libstdc++.so.6+0x0000000b9072)
#2 main /home/cameron/projects/concurrentqueue/tests/tsan/issue49.cpp:41 (issue49+0x000000401168)
SUMMARY: ThreadSanitizer: data race /home/cameron/projects/concurrentqueue/tests/tsan/issue49.cpp:26 FakeQueue::try_dequeue(int&)
==================
ThreadSanitizer: reported 1 warnings
Command line:
g++ -std=c++11 -O0 -g -fsanitize=thread issue49.cpp -o issue49 -pthread
g++ version: 5.3.1
Can anybody shed some light onto why tsan thinks this is a data race?
UPDATE
It seems like this is a false positive. To appease ThreadSanitizer, I've added annotations (see here for the supported ones and here for an example). Note that detecting whether tsan is enabled in GCC via a macro has only recently been added, so I had to manually pass -D__SANITIZE_THREAD__ to g++ for now.
#if defined(__SANITIZE_THREAD__)
#define TSAN_ENABLED
#elif defined(__has_feature)
#if __has_feature(thread_sanitizer)
#define TSAN_ENABLED
#endif
#endif
#ifdef TSAN_ENABLED
#define TSAN_ANNOTATE_HAPPENS_BEFORE(addr) \
AnnotateHappensBefore(__FILE__, __LINE__, (void*)(addr))
#define TSAN_ANNOTATE_HAPPENS_AFTER(addr) \
AnnotateHappensAfter(__FILE__, __LINE__, (void*)(addr))
extern "C" void AnnotateHappensBefore(const char* f, int l, void* addr);
extern "C" void AnnotateHappensAfter(const char* f, int l, void* addr);
#else
#define TSAN_ANNOTATE_HAPPENS_BEFORE(addr)
#define TSAN_ANNOTATE_HAPPENS_AFTER(addr)
#endif
struct FakeQueue
{
int items[1000];
std::atomic<int> m_enqueueIndex;
int m_dequeueIndex;
FakeQueue() : m_enqueueIndex(0), m_dequeueIndex(0) { }
void enqueue(int x)
{
auto tail = m_enqueueIndex.load(std::memory_order_relaxed);
items[tail] = x;
TSAN_ANNOTATE_HAPPENS_BEFORE(&items[tail]);
m_enqueueIndex.store(tail + 1, std::memory_order_release);
}
bool try_dequeue(int& x)
{
auto tail = m_enqueueIndex.load(std::memory_order_acquire);
assert(tail >= m_dequeueIndex);
if (tail == m_dequeueIndex)
return false;
TSAN_ANNOTATE_HAPPENS_AFTER(&items[m_dequeueIndex]);
x = items[m_dequeueIndex];
++m_dequeueIndex;
return true;
}
};
// main() is as before
Now ThreadSanitizer is happy at runtime.
This looks like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78158. Disassembling the binary produced by GCC shows that it doesn't instrument the atomic operations on O0.
As a workaround, you can either build your code with GCC with -O1/-O2, or get yourself a fresh Clang build and use it to run ThreadSanitizer (this is the recommended way, as TSan is being developed as part of Clang and only backported to GCC).
The comments above are invalid: TSan can easily comprehend the happens-before relation between the atomics in your code (one can check that by running the above reproducer under TSan in Clang).
I also wouldn't recommend using the AnnotateHappensBefore()/AnnotateHappensAfter() for two reasons:
you shouldn't need them in most cases; they denote that the code is doing something really complex (in which case you may want to double-check you're doing it right);
if you make an error in your lock-free code, spraying it with annotations may mask that error, so that TSan won't notice it.
The ThreadSanitizer is not good at counting, it cannot understand that writes to the items always happen before the reads.
The ThreadSanitizer can find that the stores of m_enqueueIndex happen before the loads, but it does not understand that the store to items[m_dequeueIndex] must happen before the load when tail > m_dequeueIndex.
It is not clear from the documentation. This template function returns void. The document mentions -
If the function cannot lock all objects, the function first unlocks
all objects it successfully locked (if any) before failing.
But how should the caller know it has failed ?
Does it block until it is successful and exception is the only failure scenario ?
It throws an error on any issue.
As a couple other SO members have mentioned to me in the past on my own questions, steer away from CPlusPlus.com - The Canonical Reference for Misinformation.
Please take this as an opportunity to learn the differences between c and c++. C requires return codes or side-effects to function arguments, while C++ offers exceptions in addition to the aforementioned.
Parameters
(none)
Return value
(none)
Exceptions
Throws std::system_error when errors occur, including errors from the
underlying operating system that would prevent lock from meeting its
specifications. The mutex is not locked in the case of any exception
being thrown.
Notes
lock() is usually not called directly: std::unique_lock and
std::lock_guard are used to manage exclusive locking.
Example
This example shows how lock and unlock can be used to protect shared
data.
#include <iostream>
#include <chrono>
#include <thread>
#include <mutex>
int g_num = 0; // protected by g_num_mutex
std::mutex g_num_mutex;
void slow_increment(int id)
{
for (int i = 0; i < 3; ++i) {
g_num_mutex.lock();
++g_num;
std::cout << id << " => " << g_num << '\n';
g_num_mutex.unlock();
std::this_thread::sleep_for(std::chrono::seconds(1));
}
}
int main()
{
std::thread t1(slow_increment, 0);
std::thread t2(slow_increment, 1);
t1.join();
t2.join();
}
Take the following code, which illustrates the calling of a simple routine on the accelerator, compiled on the device using OpenACC 2.0's routine directive:
#include <iostream>
#pragma acc routine
int function(int *ARRAY,int multiplier){
int sum=0;
#pragma acc loop reduction(+:sum)
for(int i=0; i<10; ++i){
sum+=multiplier*ARRAY[i];
}
return sum;
}
int main(){
int *ARRAY = new int[10];
int multiplier = 5;
int out;
for(int i=0; i<10; i++){
ARRAY[i] = 1;
}
#pragma acc enter data create(out) copyin(ARRAY[0:10],multiplier)
#pragma acc parallel present(out,ARRAY[0:10],multiplier)
if (function(ARRAY,multiplier) == 50){
out = 1;
}else{
out = 0;
}
#pragma acc exit data copyout(out) delete(ARRAY[0:10],multiplier)
std::cout << out << std::endl;
}
How does function know to use the device copies of ARRAY[0:10] and multiplier when it is called from within a parallel region? How can we enforce the use of the device copies?
When your routine is called within a device region (the parallel in your code), it is being called by the threads on the device, which means those threads will only have access to arrays on the device. The compiler may actually choose to inline that function, or it may be a device-side function call. That means that you can know that when the function is called from the device it will be receiving device copies of the data because the function is essentially inheriting the present data clause from the parallel region. If you still want to convince yourself that you're running on the device once inside the function, you could call acc_on_device, but that only tells you that you're running on the accelerator, not that you received a device pointer.
If you want to enforce the use of device copies more than that, you could make the routine nohost so that it would technically not be valid to call from the host, but that doesn't really do what you're asking, which is to do a check on the GPU that the array really is a device array.
Keep in mind though that any code inside a parallel region that is not inside a loop will be run gang-redundantly, so the write to out is likely a race condition, unless you happen to be running with one gang or you write to it using an atomic.
Basically, when you involved "data" clause, the device will create/copy data to the device memory, then the block of code that defined with "acc routine" will be executed on the device. Notice that the memory between host and device does not share unlike multi-threading (OpenMP). So yes, "function" will be using the device copies of ARRAY and multiplier as long as it is under data segment. Hope this helps! :)
You should assign the function with one parallelism level such as gang/worker/vector. It's a more accurate way.
The routine will use the date in device memory.