Incremental compilation in nvcc (CUDA)

Incremental compilation in nvcc (CUDA) - gcc

I have many structs (classes) and standalone functions that I like to compile separately and then link to the CUDA kernel, but I am getting the External calls are not supported error while compiling (not linking) the kernel. nvcc forces to always use inline functions from the kernel. This is very frustrating!! If somebody have figured out a way to achieve incremental compilation, please share.
Also see the following thread on NVIDIA forums.
http://forums.nvidia.com/index.php?s=&showtopic=103256&view=findpost&p=1009242

Currently you cannot call device functions from the GPU in CUDA, which is why they are inlined.
Fermi hardware supports device functions without inlining.

Ok, it can now be done with CUDA 5.

Related

How Cuda compilation process takes place?

According to NVIDIAs Programming Guide:
Source files for CUDA applications consist of a mixture of
conventional C++ host code, plus GPU device functions. The CUDA
compilation trajectory separates the device functions from the host
code, compiles the device functions using the proprietary NVIDIA
compilers and assembler, compiles the host code using a C++ host
compiler that is available, and afterwards embeds the compiled GPU
functions as fatbinary images in the host object file. In the linking
stage, specific CUDA runtime libraries are added for supporting remote
SPMD procedure calling and for providing explicit GPU manipulation
such as allocation of GPU memory buffers and host-GPU data transfer.
What does using the proprietary NVIDIA compilers and assembler mean?
Also, what is a PTX and a cubin file? and in which step of compilation do these take place?
I have searched a lot about this concept but, i would like a simple explanation

The nvcc documentation explains the different compilation steps and their respective compilers. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory
Cubin files contain the "real" sass assembler code, whereas ptx files contain assembler code for a "virtual" GPU architecture.

Proper way of compiling OpenCL applications and using available compiler options

I am a newbie in OpenCL stuffs.
Whats is the best way to compiler an OpenCL project ?
Using a supported compiler (GCC or Clang):
When we use a compiler
like gcc or clang, how do we control these options? Are they
have to be set inside the source code, or, likewise the normal
compilation flow we can pass them on the command line. Looking at the Khornos-Manual-1.2, there are a few options provided for cl_int clBuildProgram for optimizations. :
gcc|clang -O3 -I<INCLUDES> OpenCL_app.c -framework OpenCL OPTION -lm
Actually, I Tried this and received an error :
gcc: error: unrecognized command line option '<OPTION>'
Alternatively, using openclc:
I have seen people using openclc to compiler using
a Makefile.
I would like to know which is the best way (if
there are actually two separate ways), and how do we control the
usage of different compile time options.

You might be aware but it is important to reiterate. OpenCL standard contains two things:
OpenCL C language and programming model (I think recent standard include some C++)
OpenCL host library to manage device
gcc and clang are compilers for the host side of your OpenCL project. So there will be no way to provide compiler options for OpenCL device code compilations using a host compiler since they are not even aware of any OpenCL.
Except with clang there is a flag that accept OpenCL device code, .cl file which contains the kernels. That way you can use clang and provide also the flags and options if I remember correctly, but now you would have either llvm IR or SPIR output not an device executable object. You can then load SPIR object to a device using device's run-time environment(opencl drivers).
You can checkout these links:
Using Clang to compile kernels
Llvm IR generation
SPIR
Other alternative is to use the tools provided by your target platform. Each vendor that claims to support opencl, should have a run-time environment. Usually, they have separate CLI tools to compile OpenCL device code. In you case(I guess) you have drivers from Apple, therefore you have openclc.
Intel CLI as an example
Now to your main question (best way to compile opencl). It depends what you want to do. You didn't specify what kind of requirements you have so I had to speculate.
If you want to have off-line compilation without a host program, the considerations above will help you. Otherwise, you have to use OpenCL library and have on-line compilation for you kernels, this is generally preferred for products that needs portability. Since if you compile all your kernels at the start of your program, you directly use the provided environment and you don't need to provide libraries for each target platform.
Therefore, if you have an OpenCL project, you have to decide how to compile. If you really want to use the generic flags and do not rely on third party tools. I suggest you to have a class that builds your kernels and provides the flags you want.

...how do we control these options? Are they have to be set inside the source code, or, likewise the normal compilation flow we can pass them on the command line.
Options can be set inside the source code. For example:
const char options[] = "-cl-finite-math-only -cl-no-signed-zeros";
/* Build program */
err = clBuildProgram(program, 1, &device, options, NULL, NULL);
I have never seen opencl options being specified at the command line and I'm unaware whether this is possible or not.

What it takes to make OpenACC/OpenMP4.0 offloading to nvidia/mic work om GCC?

I am trying to understand how exactly I can use OpenACC to offload computation to my nvidia GPU on GCC 5.3. The more I google things the more confused I become. All the guides I find, they involve recompiling the entire gcc along with two libs called nvptx-tools and nvptx-newlib. Other sources say that OpenACC is part of GOMP library. Other sources say that the development for OpenACC support will continue only on GCC 6.x. Also I have read that support for OpenACC is in the main brunch of GCC. However if I compile a program with -fopenacc and -foffload=nvptx-non is just wont work. Can someone explain to me what exactly it takes to compiler and run OpenACC code with gcc 5.3+?
Why some guides seem to require (re)compilation of nvptx-tools, nvptx-newlib, and GCC, if, as some internet sources say, OpenACC support is part of GCC's main branch?
What is the role of the GOMP library in all this?
Is it true that development for OpenACC support will only be happening for GCC 6+ from now on?
When OpenACC support matures, is it the goal to enable it in a similar way we enable OpenMP (i.e., by just adding a couple of compiler flags)?
Can someone also provide answers to all the above after replacing "OpenACC" with "OpenMP 4.0 GPU/MIC offload capability"?
Thanks in advance

The link below contains a script that will compile gcc for OpenACC support.
https://github.com/olcf/OLCFHack15/blob/master/GCC5OffloadTest/auto-gcc5-offload-openacc-build-install.sh
OpenACC is part of GCC's main branch now, but there are some points to note. Even if there are libraries that are part of gcc, when you compile gcc, you have to specify which libraries to compile. Not all of them will be compiled by default. For OpenACC there's an additional problem. Since, NVIDIA drivers are not open source, GCC cannot compile OpenACC directly to binaries. It needs to compile OpenACC to the intermediate NVPTX instructions which the Nvidia runtime will handle. Therefore you also need to install nvptx libs.
GOMP library is the intermediate library that handles both OpenMP and OpenACC
Yes, I think OpenACC development will only be happening in GCC 6, but it may still be backported to GCC 5. But your best best would be to use GCC 6.
While I cannot comment on what GCC developers decide to do, I think in the first point I have already stated what the problems are. Unless NVIDIA make their drivers open source, I think an extra step will always be necessary.
I believe right now OpenMP is planned only for CPU's and MIC. I believe OpenMP support for both will probably become default behavior. I am not sure whether OpenMP targeting NVIDIA GPU's are immediately part of their target, but since GCC is using GOMP for both OpenMP and OpenACC, I believe eventually they might be able to do it. Also, GCC is also targeting HSA using OpenMP, so basically AMD APU's. I am not sure whether AMD GPU's will work the same way, but it maybe possible. Since, AMD is making their drivers open source, I believe they maybe easier to integrate into default behavior.

Should I look into PTX to optimize my kernel? If so, how?

Do you recommend reading your kernel's PTX code to find out to optimize your kernels further?
One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the loops manually in the kernel code.
Are there other use-cases for the PTX code?
Do you look into your PTX code?
Where can I find out how to be able to read the PTX code CUDA generates for my kernels?

The first point to make about PTX is that it is only an intermediate representation of the code run on the GPU -- a virtual machine assembly language. PTX is assembled to target machine code either by ptxas at compile time, or by the driver at runtime. So when you are looking at PTX, you are looking at what the compiler emitted, but not at what the GPU will actually run. It is also possible to write your own PTX code, either from scratch (this is the only JIT compilation model supported in CUDA), or as part of inline-assembler sections in CUDA C code (the latter officially supported since CUDA 4.0, but "unofficially" supported for much longer than that). CUDA has always shipped with a complete guide to the PTX language with the toolkit, and it is fully documented. The ocelot project has used this documentation to implement their own PTX cross compiler, which allows CUDA code to run natively on other hardware, initially x86 processors, but more recently AMD GPUs.
If you want to see what the GPU is actualy running (as opposed to what the compiler is emitting), NVIDIA now supply a binary disassembler tool called cudaobjdump which can show the actual machine code segments in code compiled for Fermi GPUs. There was an older, unofficialy tool called decuda which worked for G80 and G90 GPUs.
Having said that, there is a lot to be learned from PTX output, particularly at how the compiler is applying optimizations and what instructions it is emitting to implement certain C contructs. Every version of the NVIDIA CUDA toolkit comes with a guide to nvcc and documentation for the PTX language. There is plenty of information contained in both documents to both learn how to compile a CUDA C/C++ kernel code to PTX, and to understand what the PTX instructions will do.

Compiling the Linux Kernel using g++

I want to compile the Linux kernel (written in c) using g++. Is this possible? If not, could you suggest ways of accomplishing it?

Why would you want to do that??? Just use gcc. Compiling towards a C++ environment/runtime is not possible as in the kernel there is no way to run a C++ runtime. This would imply having exception handling available for example, which is very problematic in the kernel. So you have to stick to a C compiler like intel's C compiler icc or gcc.
Here is another question that might interest you:
Is it possible to compile Linux kernel with something other than gcc?
Another Reference:
Why don't we rewrite the Linux kernel in C++?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio