Should I look into PTX to optimize my kernel? If so, how? - performance

Do you recommend reading your kernel's PTX code to find out to optimize your kernels further?
One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the loops manually in the kernel code.
Are there other use-cases for the PTX code?
Do you look into your PTX code?
Where can I find out how to be able to read the PTX code CUDA generates for my kernels?

The first point to make about PTX is that it is only an intermediate representation of the code run on the GPU -- a virtual machine assembly language. PTX is assembled to target machine code either by ptxas at compile time, or by the driver at runtime. So when you are looking at PTX, you are looking at what the compiler emitted, but not at what the GPU will actually run. It is also possible to write your own PTX code, either from scratch (this is the only JIT compilation model supported in CUDA), or as part of inline-assembler sections in CUDA C code (the latter officially supported since CUDA 4.0, but "unofficially" supported for much longer than that). CUDA has always shipped with a complete guide to the PTX language with the toolkit, and it is fully documented. The ocelot project has used this documentation to implement their own PTX cross compiler, which allows CUDA code to run natively on other hardware, initially x86 processors, but more recently AMD GPUs.
If you want to see what the GPU is actualy running (as opposed to what the compiler is emitting), NVIDIA now supply a binary disassembler tool called cudaobjdump which can show the actual machine code segments in code compiled for Fermi GPUs. There was an older, unofficialy tool called decuda which worked for G80 and G90 GPUs.
Having said that, there is a lot to be learned from PTX output, particularly at how the compiler is applying optimizations and what instructions it is emitting to implement certain C contructs. Every version of the NVIDIA CUDA toolkit comes with a guide to nvcc and documentation for the PTX language. There is plenty of information contained in both documents to both learn how to compile a CUDA C/C++ kernel code to PTX, and to understand what the PTX instructions will do.

Related

How Cuda compilation process takes place?

According to NVIDIAs Programming Guide:
Source files for CUDA applications consist of a mixture of
conventional C++ host code, plus GPU device functions. The CUDA
compilation trajectory separates the device functions from the host
code, compiles the device functions using the proprietary NVIDIA
compilers and assembler, compiles the host code using a C++ host
compiler that is available, and afterwards embeds the compiled GPU
functions as fatbinary images in the host object file. In the linking
stage, specific CUDA runtime libraries are added for supporting remote
SPMD procedure calling and for providing explicit GPU manipulation
such as allocation of GPU memory buffers and host-GPU data transfer.
What does using the proprietary NVIDIA compilers and assembler mean?
Also, what is a PTX and a cubin file? and in which step of compilation do these take place?
I have searched a lot about this concept but, i would like a simple explanation
The nvcc documentation explains the different compilation steps and their respective compilers. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory
Cubin files contain the "real" sass assembler code, whereas ptx files contain assembler code for a "virtual" GPU architecture.

What is it meant by "developers must optimise their apps to run on ARM-based processors"?

This is a subject that I am not very knowledgable about and I was hoping to get a better understanding on the topic.
I was going through articles about Apple's transition to Apple Silicon and at some point I read "Apple is going to ship Rosetta 2, an emulation layer that lets you run old apps on new Macs."
As far as I know, an application is written in a high level language (e.g. C/C++,Java etc.). Then the compiler (let's assume interpreters don't exist for a moment) reads that code and translates it to assembly code. Then the assembler will convert assembly code to machine code which is readable by the processor.
My question is, assuming the above are correct, why is Rosetta 2 required since a CPU is supposed to translate high level code into readable machine code anyway? Why would developers need to "optimise" (or care on what processor their applications are run on) their applications since they are written (mostly) in high level language (which the processor can compile) ? I don't get why would programmers care if the CPU is supposed to handle compiling and assembling.
This question is probably rather trivial but I couldn't find what I was looking for just by reading about compilers or CPU architecture.
a CPU is supposed to translate high level code into readable machine code anyway?
No, the CPU doesn't do that itself, it happens via software running on the CPU (JIT or ahead-of-time compiler).
For ahead-of-time compiler (e.g. normal C++ implementations), closed source software only ships x86 machine code, not source. So you can't just recompile it yourself. Open-source software is usually easily portable by recompiling.
Rewritten is an overstatement for most apps, most can just recompile.
But if you have custom x86-specific code, like manually vectorized SIMD loops using SSE / AVX intrinsics or hand-written asm, you'd have to port those to NEON / AArch64 SIMD.

How to count LLVM IR instructions during OpenCL kernel execution?

I am trying to write an OpenCL hostcode for a custom kernel of mine and I want to count the LLVM IR instructions that where executed. My problem is, the LLVM IR representation of the kernel is lost after I build it, and the only thing that exists is the native binary. Is there any way to:
count the native architecture instructions executed?
find a mapping between the native architecture instructions and the LLVM IR representation and, through this, manage to count the LLVM IR instructions that were executed?
I think that is not directly possible through the OpenCL API. But there are two ways you could achieve counting native/IR instructions using tools:
IR:
You could use the SPIR-V (Khronos-defined IR, similar to LLVM IR, but portable) tools, see this README to generate SPIR-V from an OpenCL source. Then count the IR instructions of the result.
Native:
You can use an offline OpenCL compiler from some SDK, e.g. the Intel OpenCL SDK provides a command line tool called ioc64 which can generate assembly code from OpenCL and even allows you to specify the target architecture.
You could try to disassemble the OpenCL-generated binary (via clGetProgramInfo() with CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES), e.g. with an appropriate command line tool after storing it to disk.
Hope that helps.

GCC support for Intel AVX instrinsics (dvec.h)

Does GCC support dvec.h, and if not, what can I do to port code written for ICC to work with GCC?
I am getting errors:
fatal error: dvec.h: No such file or directory
#include <dvec.h>
Alternatively, GCC cannot find F32vec8.
See Agner Fog's manual Optimizing software in C++. See section 12.5 Using vector classes.
Agner's Vector Class Library (VCL) is far more powerful than Intel's dvec.h, it works on more compilers (including GCC and Clang), and it's free. However, it requires C++.
Another option is to use Yeppp!. Yepp works for C, C++, C#, Java, and FORTRAN and not just C++. However, it's an actually library that you must link in. The VCL is only a set of header files.
Another difference between the Yeppp! and the VCL is that Yeppp! is built from assembly whereas the VCL uses intrinsics. This is one reason Yeppp! needs to be linked in (MSVC 64-bit mode does not allow inline assembly).
One disadvantage of intrinsics is that the compiler can implement them differently than you expect. This is not normally a problem with ICC and GCC. They are excellent when it comes to intrinsic. However, MSVC with AVX and especially FMA is disappointing (though with SSE it's normally fine). So the performance using the VCL with GCC compared to MSVC may be quite different with AVX and FMA.
With assembly you always get what you want. However, since Yeppp! is not inline assembly you have to deal with the function calling overhead. In my case most of the time I want something like inline assembly which is what intrinsics mostly achieve.
I don't know Yeppp! well but the documentation of the VCL library is excellent and the source code is very clear.

Incremental compilation in nvcc (CUDA)

I have many structs (classes) and standalone functions that I like to compile separately and then link to the CUDA kernel, but I am getting the External calls are not supported error while compiling (not linking) the kernel. nvcc forces to always use inline functions from the kernel. This is very frustrating!! If somebody have figured out a way to achieve incremental compilation, please share.
Also see the following thread on NVIDIA forums.
http://forums.nvidia.com/index.php?s=&showtopic=103256&view=findpost&p=1009242
Currently you cannot call device functions from the GPU in CUDA, which is why they are inlined.
Fermi hardware supports device functions without inlining.
Ok, it can now be done with CUDA 5.

Resources