Procedures called in a compute region must have acc routine information - exit - openacc

I have this function that is called in various parts of the code when I have an error and I need to stop the program from running. However, I get this error when compiling only on nvidia video card (-ta=nvidia:managed) , this does not occur when compiling on CPU (ta:multicore)
void exit_tool()
{
printf("Simulacao interrompida");
exit(1);
}
Procedures called in a compute region must have acc routine information - exit

Correct, in order to call a routine on the device, there's needs to be a device callable version available. 'exit' is not a supported device routine since you can't terminate a host process from the device.
You can call "assert(0)" to terminate the running device kernel, but this will corrupt the CUDA context giving undefined behavior.
Given this is likely debugging code used for error checking, the simplest method to work around this would be to conditionally compile this function so exit isn't use within OpenACC:
void exit_tool()
{
printf("Simulacao interrompida");
#ifndef _OPENACC
exit(1);
#endif
}

Related

Is there a race between starting and seeing yourself in WinApi's EnumProcesses()?

I just found this code in the wild:
def _scan_for_self(self):
win32api.Sleep(2000) # sleep to give time for process to be seen in system table.
basename = self.cmdline.split()[0]
pids = win32process.EnumProcesses()
if not pids:
UserLog.warn("WindowsProcess", "no pids", pids)
for pid in pids:
try:
handle = win32api.OpenProcess(
win32con.PROCESS_QUERY_INFORMATION | win32con.PROCESS_VM_READ,
pywintypes.FALSE, pid)
except pywintypes.error, err:
UserLog.warn("WindowsProcess", str(err))
continue
try:
modlist = win32process.EnumProcessModules(handle)
except pywintypes.error,err:
UserLog.warn("WindowsProcess",str(err))
continue
This line caught my eye:
win32api.Sleep(2000) # sleep to give time for process to be seen in system table.
It suggests that if you call EnumProcesses() too fast after starting, you won't see yourself. Is there any truth to this?
There is a race, but it's not the race the code tried to protect against.
A successful call to CreateProcess returns only after the kernel object representing the process has been created and enqueued into the kernel's process list. A subsequent call to EnumProcesses accesses the same list, and will immediately observe the newly created process object.
That is, unless the process object has since been destroyed. This isn't entirely unusual since processes in Windows are initialized in-process. The documentation even makes note of that:
Note that the function returns before the process has finished initialization. If a required DLL cannot be located or fails to initialize, the process is terminated.
What this means is that if a call to EnumProcesses immediately following a successful call to CreateProcess doesn't observe the newly created process, it does so because it was late rather than early. If you are late already then adding a delay will only make you more late.
Which swiftly leads to the actual race here: Process IDs uniquely identify processes only for a finite time interval. Once a process object is gone, its ID is up for grabs, and the system will reuse it at some point. The only reliable way to identify a process is by holding a handle to it.
Now it's anyone's guess what the author of _scan_for_self was trying to accomplish. As written, the code takes more time to do something that's probably altogether wrong1 anyway.
1 Turns out my gut feeling was correct. This is just your average POSIX developer, that, in the process of learning that POSIX is insufficient would rather call out Microsoft instead of actually using an all-around superior API.
The documentation for EnumProcesses (WIn32 API - EnumProcesses function), does not mention anything about a delay needed to see the current process in the list it returns.
The example from Microsoft how to use EnumProcess to enumerate all running processes (Enumerating All Processes), also does not contain any delay before calling EnumProcesses.
A small test application I created in C++ (see below) always reports that the current process is in the list (tested on Windows 10):
#include <Windows.h>
#include <Psapi.h>
#include <iostream>
#include <vector>
const DWORD MAX_NUM_PROCESSES = 4096;
DWORD aProcesses[MAX_NUM_PROCESSES];
int main(void)
{
// Get the list of running process Ids:
DWORD cbNeeded;
if (!EnumProcesses(aProcesses, MAX_NUM_PROCESSES * sizeof(DWORD), &cbNeeded))
{
return 1;
}
// Check if current process is in the list:
DWORD curProcId = GetCurrentProcessId();
bool bFoundCurProcId{ false };
DWORD numProcesses = cbNeeded / sizeof(DWORD);
for (DWORD i=0; i<numProcesses; ++i)
{
if (aProcesses[i] == curProcId)
{
bFoundCurProcId = true;
}
}
std::cout << "bFoundCurProcId: " << bFoundCurProcId << std::endl;
return 0;
}
Note: I am aware that the fact that the program reported the expected result does not mean that there is no race. Maybe I just couldn't catch it manifest. But trying to run code like that can give you a hint sometimes (especially if the result would have been that there is a race).
The fact that I never had a problem running this test (did it many times), together with the lack of any mention of the need for a delay in Microsoft's documentation make me believe that it is not required.
My conclusion is that either:
There is a unique issue when using it from python (doubt it).
or:
The code you found is doing something unnecessary.
There is no race.
EnumProcesses calls a NT API function that switches to kernel mode to walk the linked list of processes. Your own process has been added to the list before it starts running.

pthread_recursive_mutex - assertion failed

I'm using ROS (Robot operating system) framework. If you are familiar with ROS, in my code, I'm not using activity servers. Plainly using publishers, subscribers and services. Unfortunately, I'm facing issue with pthread_recursive_mutex error. The following is the error and its backtrace.
If anyone is familiar with ROS stack, could you please share what could be potential causes that might cause this runtime error ?
I can give more information about my the runtime error. Help much appreciated. Thanks
/usr/include/boost/thread/pthread/recursive_mutex.hpp:113: void boost::recursive_mutex::lock(): Assertion `!pthread_mutex_lock(&m)' failed.
The lock method implementation merely assert the pthread return value:
void lock()
{
BOOST_VERIFY(!posix::pthread_mutex_lock(&m));
}
This means that according to the docs, either:
(EAGAIN) The mutex could not be acquired because the maximum number of
recursive locks for mutex has been exceeded.
This would indicate you have some kind of imbalance in your locks (not this call-site, because unique_lock<> makes sure that doesn't happen) or are just racking up threads that are all waiting for the same lock
(EOWNERDEAD) The mutex is a robust mutex and the process containing the
previous owning thread terminated while holding the mutex lock. The mutex
lock shall be acquired by the calling thread and it is up to the new
owner to make the state consistent.
Boost does not deal with this case and simply asserts. This would also not likely occur if all your threads use thread-safe lock-guards (scoped_lock, unique_lock, shared_lock, lock_guard). It could, however, occur, if you use the lock() (and unlock()) functions manually somewhere and the thread exits without unlock()ing
There are some other ways in which (particularly checked) mutexes can fail, but those would not apply to boost::recursive_mutex
This looks like a use-after-free problem, where a mutex has already been destroyed, probably because its owning object was deleted.
I had some success using Valgrind to hunt down this type of bugs. Install it using apt install valgrind, and add a launch-prefix="valgrind" to the <node> in your launch file. It will be super slow, but it's quite adept at pinpointing these issues.
Take this buggy program for example:
struct Test
{
int a;
};
int main()
{
Test* test = new Test();
test->a = 42;
delete test;
test->a = 0; // BUG!
}
valgrind ./testprog yields
==8348== Invalid write of size 4
==8348== at 0x108601: main (test.cpp:11)
==8348== Address 0x5b7ec80 is 0 bytes inside a block of size 4 free'd
==8348== at 0x4C3168B: operator delete(void*, unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==8348== by 0x108600: main (test.cpp:10)
==8348== Block was alloc'd at
==8348== at 0x4C303EF: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==8348== by 0x1085EA: main (test.cpp:8)
Note how it will not only tell you where the buggy access happened (test.cpp:11), but also where the Test object was deleted (test.cpp:10), and where it was initially created (test.cpp:8).
Good luck in your bug hunt!

Initialize C runtime by multiple threads?

I am using MinGW-w64 on Windows. I am trying to make sense what is happening inside the CRT startup functions. The program entry point is mainCRTStartup(). This function initializes buffer security cookie, and then jumps to __tmainCRTStartup(), which ultimately calls main(). Now inside the source code of __tmainCRTStartup() there is a curious operation:
while((lock_free = InterlockedCompareExchangePointer ((volatile PVOID *) &__native_startup_lock, fiberid, 0)) != 0)
{
if (lock_free == fiberid)
{
nested = TRUE;
break;
}
Sleep(1000);
}
Here fiberid is set to the base address of the thread stack.
It seems that the code is preventing multiple threads or fibers initializing CRT simultaneously. But why would there be multiple threads running __tmainCRTStartup() ? Since all threads share the same address space, wouldn't it be sufficient to just initialize CRT once? Also difficult to understand is the if-block inside the loop. To trigger this condition the __native_startup_lock must be already set to fiberid, which implies that the thread in question has already exited the loop. I do not see what kind of feature MinGW is trying to support here.
Update:
OK. I found a copy of Visual Studio 2013, shipped with the annotated CRT source code. Here's the annotation:
/*
* There is a possiblity that the module where this object is
* linked into is a mixed module. In all the cases we gurantee that
* native initialization will occur before managed initialization.
* Also in anycase this code should never be called when some other
* code is initializing native code, that's why we exit in that case.
*/
This may explain the nested condition in the loop, but still I do not see why multithreading is involved here.

linux device driver stuck in spin lock due to ongoing read on first access

I am working on a Linux driver for usb device which fortunately is identical to that in the usb_skeleton example driver which is part of the standard kernel source.
With the 4.4 kernel, it was a breeze, I simply changed the VID and PID and a few strings and the driver compiled and worked perfectly both on x64 and ARM kernels.
But it turns out I have to make this work with a 3.2 kernel. I have no choice in this. I made the same modifications to the skeleton driver in the 3.2 source. Again, I did not have to change actual code, just the VID, PID and some strings. Although it compiles and loads fine (and shows up in /dev), it permanently hangs in the first attempt to do a read from /dev/myusbdev0.
The following code is from the read function, which is supposed to read from the bulk endpoint. When I attempt to read the device, I see the first message that it is going to block due to ongoing io. Then nothing. The user program trying to read this is hung, and cannot be killed with kill -9. The linux machine cannot even reboot - I have to power cycle. There are no error messages, exceptions or anything like that. It seems fairly certain it is hanging in the part that is commented 'IO May Take Forever'.
My question is: why would there be ongoing IO when no program has done any IO with the driver yet? Can I fix this in driver code, or does the user program have to do something before it can start reading from /dev/myusbdev0 ?
In this case the target machine an embedded ARM device similar to a Beaglebone Black. Incidently, the 4.4 kernel version of this driver works perfectly with on the Beaglebone with the same user-mode test program.
/* if IO is under way, we must not touch things */
retry:
spin_lock_irq(&dev->err_lock);
ongoing_io = dev->ongoing_read;
spin_unlock_irq(&dev->err_lock);
if (ongoing_io) {
dev_info(&interface->dev,
"USB PureView Pulser Receiver device blocking due to ongoing io -%d",
interface->minor);
/* nonblocking IO shall not wait */
if (file->f_flags & O_NONBLOCK) {
rv = -EAGAIN;
goto exit;
}
/*
* IO may take forever
* hence wait in an interruptible state
*/
rv = wait_for_completion_interruptible(&dev->bulk_in_completion);
dev_info(&interface->dev,
"USB PureView Pulser Receiver device completion wait done io -%d",
interface->minor);
if (rv < 0)
goto exit;
/*
* by waiting we also semiprocessed the urb
* we must finish now
*/
dev->bulk_in_copied = 0;
dev->processed_urb = 1;
}
Writing this up as an answer since there was no response to my comments. Kernel commit c79041a4[1], which was added in 3.10, fixes "blocked forever in skel_read". Looking at the code above, I see that the first message can trigger without the second being shown if the device file has the O_NONBLOCK flag set. As described in the commit message, if the completion occurs between read() calls the next read() call will end up at the uninterruptible wait, waiting for a completion which has already occurred.
[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c79041a4
Obviously I am not sure that this is what you are seeing, but I think there is a good chance. If that is correct then you can apply the change (manually) to your driver and that should fix the problem.

Details about KiTrap06

I recently started to learn about interrupts, but I feel like I'm missing something important.
I know that when an exception type interrupt occurs I can either let the CPU handle it or I can handle it myself if I use __try/__except.
I know that this interrupt happens when the CPU attempts to execute an invalid op-code. When such an event occurs, the address of the current instruction (so the current eip/rip) is pushed on the stack. The ISR is invoked, does something and then the same instruction is re-executed.
This got me to think that it somehow tries to skip the invalid op-code or to repair it - otherwise why would the return address be the address of the invalid instruction?
If I compile the following user mode code:
int main()
{
printf("before\n");
__asm
{
UD2
};
printf("after\n");
return 0;
}
the application crashes without printing the second message.
If I put the UD2 in a __try/__except block I can deal with it and the application will no longer crash.
A UD2 instruction in a driver will Blue Screen the system.
So what is KiTrap06 actually doing? Or in this case a crash happens because UD2 is made for testing purposes and in a real case results may differ?

Resources