cannot debug Thrust CUDA in Visual Studio [duplicate] - visual-studio-2010

I am using visual studio 2010, parallel nsight 2.2 and cuda 4.2 for learning. My system is Windows 8 pro x64.
I opened the radix sort project which included by cuda computing SDK in VS, and compiled it with no error. The sort code uses thrust library:
if(keysOnly)
thrust::sort(d_keys.begin(), d_keys.end());
else
thrust::sort_by_key(d_keys.begin(), d_keys.end(), d_values.begin());
I want to know how thrust dispatch the sort function to cuda kernels, so I tried to add breakpoints in front of lines above and compiled the project in debug mode. But when I use parallel nsight for cuda debugging, there are always errors that "no source correspondence for breakpoint".
So, my problems are:
How to debug cuda thrust programs in visual studio with parallel nsight?
Or is there anyone can instruct me using another way to know how cuda thrust dipatch functions to cuda kernels or other functions?
Any advise will be appreciated!

Normally, to debug device code in CUDA, it's necessary to pass the:
-G -g
switches to nvcc. However this modality is not supported with thrust code. You can get an idea of how thrust code gets dispatched to the device by following the structure in the thrust include files. Since thrust is entirely templatized code, there are no libraries to worry about. However that's a challenging proposition. You can also tell the compiler to generate ptx:
-ptx
which is one of the intermediate code types that cuda code gets compiled to. However that is not a trivial thing to parse either. This link gives some alternate ideas for debugging with Thrust.

Related

Effect of visual studio compiler settings on performance of CUDA kernels

I get about 3-4x times difference in computation time of a same CUDA kernel compiled on two different machines. Both versions run on a same machine and GPU device. The direct conclusion explaining the difference is different compiler settings. Although there is no single perfect setting and the tuning should be customized depending on the kernel, I wonder if there is any clear guideline for helping to choose the right settings. I use Visual Studio 2010. Thank you.
Compile in release mode, not debug mode, if you want fastest performance. The -G switch passed to the nvcc compiler will usually have a negative effect on GPU code performance.
It's generally recommended to select the right architecture for the GPU you are compiling for. For example, if you have a cc 2.1 capability GPU, make sure that setting (sm_21, in GPU code settings) is being passed to the compiler. There are some counter examples to this (e.g. compiling for cc 2.0 seems to run faster, etc.) but as a general recommendation, it is best.
Use the latest version of CUDA (compiler). This is especially important when using GPU libraries (CUFFT, CUBLAS, etc.) (yes, this is not really a compiler setting)

What compilers support CUDA

I found some problem with Visual Studio. My project that use openMP multithreading was twice slow on Visual Studio 2010, than on Dev-C++ , Now I wrote my other project that uses CUDA technology , I think that my project works slow because of Visual Studio, so I need some other compiler that will support CUDA , my questions are:
is Dev-C++ support CUDA?
what compilers support CUDA except Visual Studio?
if there are a lot compilers supporting CUDA what will give best speed for application?
The CUDA Toolkit Release Notes list the supported platforms and compilers.
Well I think it's the other way around. The thing is, there is a driver called nvcc. it generates device code and host code and sends the host code to a compiler. It should be a C compiler and it should be in the executable path. (EDIT: and it should be gcc on Linux and cl on Windows and I think I should ignore mac as the release note did(?))
nvcc Compiler Info reads:
A general purpose C compiler is needed by nvcc in the following
situations:
During non-CUDA phases (except the run phase), because these phases will be forwarded by nvcc to this compiler
During CUDA phases, for several preprocessing stages (see also 0). On Linux platforms, the compiler is assumed to be ‘gcc’, or ‘g++’ for linking. On Windows platforms, the compiler is assumed to be ‘cl’. The
compiler executables are expected to be in the current executable
search path, unless option -compiler-bin-dir is specified, in which
case the value of this option must be the name of the directory in
which these compiler executables reside.
And please don't talk like that about compilers. Your code is in a way that works better with Dev-C++. What is generated is an assembly code. I don't say that they don't make any difference, but maybe 4 to 5%, not 100%.
And absolutely definitely don't blame the compiler for your slow program. It is definitely because of inefficient memory access and incorrect use of different types of memory.

Debugging symbols for CUFFT and CUDA Runtime API (cudart)

Where I can find *.debug file with debugging info for CUDA libraries from CUDA SDK, namely CUFFT and CUDA Runtime API (cudart), and how to provide them to debugger and/or profiler?
Without this info debugging application that uses CUDA libraries is very difficult, especially when the error is in CUDA code.
These libraries are not open source, and so naturally debug symbols are not provided.
If you find that there is a bug in a library, I recommend you become a registered CUDA developer and report the issue using the online bug report form. Alternatively (but less preferably), report the issue in more detail here or on the NVIDIA forums.
Before you report a bug, make sure you are confident it is not in your own code first. :)

Performance comparison between Windows gcc compiled & Visual Studio compiled

I'm currently compiling an open source optimization library (native C++) supplied with makefiles for use with gcc. As I am a Windows user, I'm curious on the two options I see of compiling this, using gcc with MinGW/Cygwin or manually building a Visual Studio project and compiling the source.
1) If I compile using MinGW/Cygwin + gcc, will the resulting .lib (static library) require any libraries from MinGW/Cygwin? I.e. can I distribute my compiled .lib to a Windows PC that doesn't have MinGW/Cygwin and will it still run?
2) Other than performance differences between the compilers themselves, is there an overhead associated when compiling using MinGW/Cygwin and gcc - as in does the emulation layer get compiled into the library, or does gcc build a native Windows library?
3) If speed is my primary objective of the library, which is the best method to use? I realise this is quite open ended, and I may be best running my own benchmarks, but if someone has experience here this would be great!
The whole point of Cygwin is the Linux emulation layer, and by default (ie if you don't cross-compile), binaries need cygwin1.dll to run.
This is not the case for MinGW, which creates binaries as 'native' as the ones from MSVC. However, MinGW comes with its own set of runtime libraries, in particular libstdc++-6.dll. This library can also be linked statically by using -static-libstdc++, in which case you also probably want to compile with -static-libgcc.
This does not mean that you can freely mix C++ libraries from different compilers (see this page on mingw.org). If you do not want to restrict yourself to an extern "C" interface to your library, you most likely will have to choose a single compiler and stick with it.
As to your performance concerns: Using Cygwin only causes a (minor?) penalty when actually interacting with the OS - where raw computations are concerned, only the quality of the optimizer matters.

NVIDIA Parallel Nsight and OpenCL

I'm little confuse with NVIDIA Parallel Nsight and OpenCL, can anyone confirm me that it is possible to debug OpenCL code using NVIDIA Parallel Nsight 1.5 or 2.0RC?
Currently it is not possible to debug OpenCL kernels with Parallel Nsight yet. Parallel Nsight 2.0 (the latest release as of Jun 2011) only supports the debugging of CUDA kernel. However OpenCL debugging is one of the features that is likely to go into the product in future releases.
Yes, it is possible, I've did it my self, the only problem is that you will need two computers connected to network, having two identical video cards. One will be executing your kernel step by step(due to this fact the graphical adapter won't be able to display results, the display will stall) this is where the second computer comes in to play, it displays results in Visual Studio like you were debugging ordinary program.
Personally I found NVIDIA Parallel Nsight as a useless tool. Any kernel debugging can be done via adding additional argument to a kernel and outputting any subject data there.
Parallel Nsight 2.1 include API for trace OpenCL 1.1 now
#see http://nvidia.com/object/parallel-nsight.html

Resources