CUDA Nvidia NSight Debugging: "CUDA grid launch failed" - debugging

When I try to debug an arbitrary CUDA application, e.g. the matrix multiplication or convolutionSeparable sample from the Nvidia GPU Computing SDK 4.0, I always get an output similar to:
Parallel Nsight Debug
CUDA grid launch failed: CUcontext: 2059192 CUmodule: 348912936 Function: _Z9matrixMulILi32EEvPfS0_S0_ii
……
……
And a file with the following content is showing up:
Parallel Nsight CUDA Debugger
The application being debugged with the Nexus CUDA debugger, was unable to
find any associated source. This could be for a number of reasons:
1) CUDA has not been initialized.
Make sure cuInit has been called, and it returned a successful result.
2) No CUDA contexts have been created.
Once a context is created, memory can be examined in the context. Each context
shows up as a single "Thread" in the Visual Studio Threads view. (Debug | Windows | Threads)
3) There are no active CUDA grids in any context.
A grid must be launched in order to hit breakpoints.
4) You have selected the "Default Context" in the Visual Studio Threads view.
This context is a placeholder shown when there are no available actual CUDA
contexts. It does not show real data.
5) No CUDA modules have been loaded.
You can see which modules are loaded in each CUDA context by showing the
Visual Studio Modules view. (Debug | Windows | Modules)
6) Symbolics were not found for the loaded .cubin.
The module needs to be built with debug information. Please specify the
-G0 switch when building.
7) A grid launch failed while running a kernel.
Each breakpoint within the corresponding “.cu” file is completely ignored during the run. When I just run the application, without Nsight Debugging, the program executes without any problems.
What can I do to tackle this problem?
My Setup:
1xIntel GPU and 1x NV 570GTX, I want to use the local debugging option
Win 7. Pro 64Bit
Dev Env.: VS2008 or VS2010
CUDA 4.0 & Parallel Nsight 2.0
NV Driver Vers.: 285.38
WPF is disabled
TDR is disabled
Windows runs in Basic mode (no aero)
Project Propertys: Cuda Runtime API -> GPU-> Generate GPU Debug Information -> Yes (-G0)

Firstly, you need to ensure that your display is driven by the Intel integrated graphics and not the NVIDIA GPU. This is because when you hit a breakpoint in CUDA code you are stalling the entire GPU, so if the same GPU was used for display then your system would lock up naturally.
Note that the hardware requirements for Parallel Nsight indicate you need two supported GPUs whereas you only have one, but if I understand correctly it's possible to use a non-Intel GPU for display (I haven't tried).
Assuming the above is working you should start by trying out the samples included with Parallel Nsight. You can find them in the Parallel Nsight menu group in the start menu.

CUDA Grid Launch has a wide variety of causes. This one is probably accessing an array beyond its allocated size. what in the x86 world is called a segmentation fault. i debug these by selectively commenting out parts of the kernel you are testing until the error goes away. (what we used to call wolf fence debugging). Another cause of grid launch failure is if the kernel is taking too long (1 or 2 seconds) to execute.
the reason the debugger isnt helping is that the debugger ONLY stops 1 thread in 1 block! your access error is coming before then. also you cant use the printf to find the bug as the output does not get returned in the event of a grid launch failure.

To add potential solution on top of the answers given already, one way to avoid the error is to run the NSight monitor with administrator right.

The answer for this is definitely using the correct driver for the installation of Parallel NSight. For the latest version (2.1 RC2, currently), this is driver version 285.86. For the current stable version 2.0, this is driver version 270.81, as another poster mentioned.

Related

Unticking "Use Shared Runtime" creates performance issues

I've been investigating performance tuning my Xamarin Forms application when running on an Android device using an Intel Atom CPU.
The largest performance I'm currently seeing is when loading a DataTemplate to set as the Content of a ContentView, this DataTemplate then dynamically loads other controls and DataTemplates within it to produce the UI of the application.
Running on this device from the signed APK archive, I am seeing the time taken from tapping on the screen to finish loading the completed UI of 3.9 seconds.
However, if I deploy the application from Visual Studio then run it on the device without the debugger attached, this same process drops to around 1.5 seconds.
It seems to be reliably triggered by the "Use Shared Runtime" option within the project properties. With this ticked I consistently see this process taking about 1.5 seconds, with it unticked it's up to ~3.9 seconds. This is the same regardless of application has been built in debug or release mode and regardless of the other settings within the project properties.
This particular scenario is the worst performing part, however I see all UI loading/layout processes improve by a similar ratio, displaying another view drops from 1.5s to 0.8s and so on.
As I am unable to create an archive with this setting ticked, what could be causing this change in performance and how do I get this performance replicated with it unticked?
I am running Xamarin.Forms 3.4.0.1008975, VS2017 15.9.4 and Xamarin.Android 9.1.4.2
Having Use Shared Runtime ticked selects all architectures in the Supported Architectures option in the Advanced window.
Unticking it removes all entries except armeabi-v7a meaning x86 is no longer selected. As it's now running an ARM application on an x86 CPU it results in the application running slower.
Ensuring x86 is ticked after unticking Use Shared Runtime gives the required performance when running on an x86 based device.

Can't use printf or debugger in Intel SDK for OpenCL

I'm using the Intel SDK for OpenCL with an Intel HD Graphics 4000 GPU to successfully run an OpenCL program. I've made sure to link against the Intel OpenCL libraries since I also have Nvidia libraries installed.
However, putting a printf() call in the kernel gives the OpenCL compiler error
error: implicit declaration of function 'printf' is not allowed in OpenCL
Also, I've enabled OpenCL kernel debugging in the Visual Studio 2012 plugin, and passed the following options to clBuildProgram:
"-g -s C:\\Path\\to\\my\\program.cl"
However, kernel breakpoints are skipped. Hovering over the breakpoint gives the message:
The breakpoint will not currently be hit. No symbols have been loaded for this document.
My kernels are in a separate .cl file, and I'm setting the breakpoints the way I would for C/C++ code. Is this the correct way to set breakpoints using the Intel SDK for OpenCL debugger?
Why are printf() calls and breakpoints not working with the Intel SDK for OpenCL?
THe function printf() was introduced in the OCL version 1.2. Intel released this version not that long time ago. I'd bet that you still have the 1.1 version.
Regarding the debugger I almost never used it but based on this document the path is supposed to be given like that:
"-g -s \"C:\\Path\\to\\my\\program.cl\""
You are also supposed to choose which thread you wanna debug.

Debugging CUDA kernels

I have an OpenCV application, with additional CUDA(.cu) files which I would like to debug using Parallel NSight. NSight debugging works on CUDA samples (without OpenCV .cpp files), but when I try to start the debugger in my application the debugger loads lots of additional modules ("no symbols loaded") and crashes with this error:
OpenCV Error: Gpu API call (out of memory) in unknown function, file ..\.\
opencv-2.4.4\modules\core\src\gpumat.cpp, line 1415
Also, a window gets opened: "Microsoft Visual c++ Debug Library", with: "Debug error!" and "R6010 abort has been called".
What could be the issue? Could loading of this modules be avoided? I am not sure that they are necessary.
And how to correctly debug CUDA kernels? I know CPU and GPU code cannot be debugged at the same time.
Edit:
I am pretty sure that loading of more than 200 kernels makes it crash. Single gpu::GpuMat declaration has more than 100 kernels(modules) on its own, then SURF, BFM and similar algorithms run the rest...
I´d like to debug only kernels in which I put breakpoints (i.e. my own kernels, not OpenCV ones). Is it possible to exclude other modules/kernels somehow?
Thanks!
It sounds like symbols have been compiled for all of your OpenCV kernels, and this is not what you want. Make sure you are not building OpenCV with CUDA debug flags. Specifically, you don't want the -g/-G/--debug* flags being passed to nvcc.
Debugging a lot of kernels, while having effects on performance, should not cause crashes. I would recommend upgrading to Nsight 3.0 which is available now from the Nsight Visual Studio Edition Early Access site. Many improvements have been made in this version.

Nvidia Nsight 2.2 OpenGL shader debugger - not working?

I've got NVidia's Parallel Nsight 2.2 system configured on my two computers. The target has a Geforce 450 gts with driver ver 301.42 and the host a Quadro 1000M with the same driver version. Loading the simplest OpenGL 3.0 program (display a colored triangle using shaders) runs fine but I can't seem to get the Nsight shader debugger to work.
Everything seems to work, I can open the NSight->Windows->Shaders List window, double click a shader, have the source code open and select a line and set a breakpoint. A big fat red dot shows up to indicate the break point is set, but the breakpoint is NEVER hit, so I'm stuck.
Has anyone ever got the OpenGL shader debugger working with Parallel Nsight 2.2?
B.t.w. the NSight->New Analysis Activity works great. I can create a trace of all the openGL calls and view it with no problems.
The OpenGL shader debugger requires a driver that is not released yet. You will need a driver more recent than 306.37 to get a good debugging experience.
-s

How do I debug a CUDA library with only 1 graphics card running X11

I'm running a CUDA library that I need to debug for memory problems and other issues. But when I attach cuda-gdb to the process I get the error
error: All CUDA devices are used for X11 and cannot be used while debugging.
I understand the error, but there has to be a way that I can debug the issues. Since I only have 1 GPU, it really isn't practical to turn off X11.
On non Nvidia hardware I thought there was a way to emulate a cuda gpu. could this be setup for debugging even though I have an NVIDIA gpu?
First of all, as you are using Linux you're in a lucky position as you can kill X pretty easily just for the time of debugging.
However, if you really want to stick to running X while debugging you are out of luck, as this is not possible for a very good reason: the display driver has a protection mechanism called watchdog timer which is enabled when the GPU in use also drives a display. The watchdog timer interrupts any kernel that runs for longer AFAIR 5s. This is intended to prevent GPU lockups.
Alternatively, you could try using ocelot, but I am not sure how good are the debugging features it provides.

Resources