How Cuda compilation process takes place?

How Cuda compilation process takes place? - compilation

According to NVIDIAs Programming Guide:
Source files for CUDA applications consist of a mixture of
conventional C++ host code, plus GPU device functions. The CUDA
compilation trajectory separates the device functions from the host
code, compiles the device functions using the proprietary NVIDIA
compilers and assembler, compiles the host code using a C++ host
compiler that is available, and afterwards embeds the compiled GPU
functions as fatbinary images in the host object file. In the linking
stage, specific CUDA runtime libraries are added for supporting remote
SPMD procedure calling and for providing explicit GPU manipulation
such as allocation of GPU memory buffers and host-GPU data transfer.
What does using the proprietary NVIDIA compilers and assembler mean?
Also, what is a PTX and a cubin file? and in which step of compilation do these take place?
I have searched a lot about this concept but, i would like a simple explanation

The nvcc documentation explains the different compilation steps and their respective compilers. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory
Cubin files contain the "real" sass assembler code, whereas ptx files contain assembler code for a "virtual" GPU architecture.

Related

Proper way of compiling OpenCL applications and using available compiler options

I am a newbie in OpenCL stuffs.
Whats is the best way to compiler an OpenCL project ?
Using a supported compiler (GCC or Clang):
When we use a compiler
like gcc or clang, how do we control these options? Are they
have to be set inside the source code, or, likewise the normal
compilation flow we can pass them on the command line. Looking at the Khornos-Manual-1.2, there are a few options provided for cl_int clBuildProgram for optimizations. :
gcc|clang -O3 -I<INCLUDES> OpenCL_app.c -framework OpenCL OPTION -lm
Actually, I Tried this and received an error :
gcc: error: unrecognized command line option '<OPTION>'
Alternatively, using openclc:
I have seen people using openclc to compiler using
a Makefile.
I would like to know which is the best way (if
there are actually two separate ways), and how do we control the
usage of different compile time options.

You might be aware but it is important to reiterate. OpenCL standard contains two things:
OpenCL C language and programming model (I think recent standard include some C++)
OpenCL host library to manage device
gcc and clang are compilers for the host side of your OpenCL project. So there will be no way to provide compiler options for OpenCL device code compilations using a host compiler since they are not even aware of any OpenCL.
Except with clang there is a flag that accept OpenCL device code, .cl file which contains the kernels. That way you can use clang and provide also the flags and options if I remember correctly, but now you would have either llvm IR or SPIR output not an device executable object. You can then load SPIR object to a device using device's run-time environment(opencl drivers).
You can checkout these links:
Using Clang to compile kernels
Llvm IR generation
SPIR
Other alternative is to use the tools provided by your target platform. Each vendor that claims to support opencl, should have a run-time environment. Usually, they have separate CLI tools to compile OpenCL device code. In you case(I guess) you have drivers from Apple, therefore you have openclc.
Intel CLI as an example
Now to your main question (best way to compile opencl). It depends what you want to do. You didn't specify what kind of requirements you have so I had to speculate.
If you want to have off-line compilation without a host program, the considerations above will help you. Otherwise, you have to use OpenCL library and have on-line compilation for you kernels, this is generally preferred for products that needs portability. Since if you compile all your kernels at the start of your program, you directly use the provided environment and you don't need to provide libraries for each target platform.
Therefore, if you have an OpenCL project, you have to decide how to compile. If you really want to use the generic flags and do not rely on third party tools. I suggest you to have a class that builds your kernels and provides the flags you want.

...how do we control these options? Are they have to be set inside the source code, or, likewise the normal compilation flow we can pass them on the command line.
Options can be set inside the source code. For example:
const char options[] = "-cl-finite-math-only -cl-no-signed-zeros";
/* Build program */
err = clBuildProgram(program, 1, &device, options, NULL, NULL);
I have never seen opencl options being specified at the command line and I'm unaware whether this is possible or not.

What compilers support CUDA

I found some problem with Visual Studio. My project that use openMP multithreading was twice slow on Visual Studio 2010, than on Dev-C++ , Now I wrote my other project that uses CUDA technology , I think that my project works slow because of Visual Studio, so I need some other compiler that will support CUDA , my questions are:
is Dev-C++ support CUDA?
what compilers support CUDA except Visual Studio?
if there are a lot compilers supporting CUDA what will give best speed for application?

The CUDA Toolkit Release Notes list the supported platforms and compilers.

Well I think it's the other way around. The thing is, there is a driver called nvcc. it generates device code and host code and sends the host code to a compiler. It should be a C compiler and it should be in the executable path. (EDIT: and it should be gcc on Linux and cl on Windows and I think I should ignore mac as the release note did(?))
nvcc Compiler Info reads:
A general purpose C compiler is needed by nvcc in the following
situations:
During non-CUDA phases (except the run phase), because these phases will be forwarded by nvcc to this compiler
During CUDA phases, for several preprocessing stages (see also 0). On Linux platforms, the compiler is assumed to be ‘gcc’, or ‘g++’ for linking. On Windows platforms, the compiler is assumed to be ‘cl’. The
compiler executables are expected to be in the current executable
search path, unless option -compiler-bin-dir is specified, in which
case the value of this option must be the name of the directory in
which these compiler executables reside.
And please don't talk like that about compilers. Your code is in a way that works better with Dev-C++. What is generated is an assembly code. I don't say that they don't make any difference, but maybe 4 to 5%, not 100%.
And absolutely definitely don't blame the compiler for your slow program. It is definitely because of inefficient memory access and incorrect use of different types of memory.

Should I look into PTX to optimize my kernel? If so, how?

Do you recommend reading your kernel's PTX code to find out to optimize your kernels further?
One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the loops manually in the kernel code.
Are there other use-cases for the PTX code?
Do you look into your PTX code?
Where can I find out how to be able to read the PTX code CUDA generates for my kernels?

The first point to make about PTX is that it is only an intermediate representation of the code run on the GPU -- a virtual machine assembly language. PTX is assembled to target machine code either by ptxas at compile time, or by the driver at runtime. So when you are looking at PTX, you are looking at what the compiler emitted, but not at what the GPU will actually run. It is also possible to write your own PTX code, either from scratch (this is the only JIT compilation model supported in CUDA), or as part of inline-assembler sections in CUDA C code (the latter officially supported since CUDA 4.0, but "unofficially" supported for much longer than that). CUDA has always shipped with a complete guide to the PTX language with the toolkit, and it is fully documented. The ocelot project has used this documentation to implement their own PTX cross compiler, which allows CUDA code to run natively on other hardware, initially x86 processors, but more recently AMD GPUs.
If you want to see what the GPU is actualy running (as opposed to what the compiler is emitting), NVIDIA now supply a binary disassembler tool called cudaobjdump which can show the actual machine code segments in code compiled for Fermi GPUs. There was an older, unofficialy tool called decuda which worked for G80 and G90 GPUs.
Having said that, there is a lot to be learned from PTX output, particularly at how the compiler is applying optimizations and what instructions it is emitting to implement certain C contructs. Every version of the NVIDIA CUDA toolkit comes with a guide to nvcc and documentation for the PTX language. There is plenty of information contained in both documents to both learn how to compile a CUDA C/C++ kernel code to PTX, and to understand what the PTX instructions will do.

Performance comparison between Windows gcc compiled & Visual Studio compiled

I'm currently compiling an open source optimization library (native C++) supplied with makefiles for use with gcc. As I am a Windows user, I'm curious on the two options I see of compiling this, using gcc with MinGW/Cygwin or manually building a Visual Studio project and compiling the source.
1) If I compile using MinGW/Cygwin + gcc, will the resulting .lib (static library) require any libraries from MinGW/Cygwin? I.e. can I distribute my compiled .lib to a Windows PC that doesn't have MinGW/Cygwin and will it still run?
2) Other than performance differences between the compilers themselves, is there an overhead associated when compiling using MinGW/Cygwin and gcc - as in does the emulation layer get compiled into the library, or does gcc build a native Windows library?
3) If speed is my primary objective of the library, which is the best method to use? I realise this is quite open ended, and I may be best running my own benchmarks, but if someone has experience here this would be great!

The whole point of Cygwin is the Linux emulation layer, and by default (ie if you don't cross-compile), binaries need cygwin1.dll to run.
This is not the case for MinGW, which creates binaries as 'native' as the ones from MSVC. However, MinGW comes with its own set of runtime libraries, in particular libstdc++-6.dll. This library can also be linked statically by using -static-libstdc++, in which case you also probably want to compile with -static-libgcc.
This does not mean that you can freely mix C++ libraries from different compilers (see this page on mingw.org). If you do not want to restrict yourself to an extern "C" interface to your library, you most likely will have to choose a single compiler and stick with it.
As to your performance concerns: Using Cygwin only causes a (minor?) penalty when actually interacting with the OS - where raw computations are concerned, only the quality of the optimizer matters.

Incremental compilation in nvcc (CUDA)

I have many structs (classes) and standalone functions that I like to compile separately and then link to the CUDA kernel, but I am getting the External calls are not supported error while compiling (not linking) the kernel. nvcc forces to always use inline functions from the kernel. This is very frustrating!! If somebody have figured out a way to achieve incremental compilation, please share.
Also see the following thread on NVIDIA forums.
http://forums.nvidia.com/index.php?s=&showtopic=103256&view=findpost&p=1009242

Currently you cannot call device functions from the GPU in CUDA, which is why they are inlined.
Fermi hardware supports device functions without inlining.

Ok, it can now be done with CUDA 5.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio