How to count LLVM IR instructions during OpenCL kernel execution? - compilation

I am trying to write an OpenCL hostcode for a custom kernel of mine and I want to count the LLVM IR instructions that where executed. My problem is, the LLVM IR representation of the kernel is lost after I build it, and the only thing that exists is the native binary. Is there any way to:
count the native architecture instructions executed?
find a mapping between the native architecture instructions and the LLVM IR representation and, through this, manage to count the LLVM IR instructions that were executed?

I think that is not directly possible through the OpenCL API. But there are two ways you could achieve counting native/IR instructions using tools:
IR:
You could use the SPIR-V (Khronos-defined IR, similar to LLVM IR, but portable) tools, see this README to generate SPIR-V from an OpenCL source. Then count the IR instructions of the result.
Native:
You can use an offline OpenCL compiler from some SDK, e.g. the Intel OpenCL SDK provides a command line tool called ioc64 which can generate assembly code from OpenCL and even allows you to specify the target architecture.
You could try to disassemble the OpenCL-generated binary (via clGetProgramInfo() with CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES), e.g. with an appropriate command line tool after storing it to disk.
Hope that helps.

Related

Proper way of compiling OpenCL applications and using available compiler options

I am a newbie in OpenCL stuffs.
Whats is the best way to compiler an OpenCL project ?
Using a supported compiler (GCC or Clang):
When we use a compiler
like gcc or clang, how do we control these options? Are they
have to be set inside the source code, or, likewise the normal
compilation flow we can pass them on the command line. Looking at the Khornos-Manual-1.2, there are a few options provided for cl_int clBuildProgram for optimizations. :
gcc|clang -O3 -I<INCLUDES> OpenCL_app.c -framework OpenCL OPTION -lm
Actually, I Tried this and received an error :
gcc: error: unrecognized command line option '<OPTION>'
Alternatively, using openclc:
I have seen people using openclc to compiler using
a Makefile.
I would like to know which is the best way (if
there are actually two separate ways), and how do we control the
usage of different compile time options.
You might be aware but it is important to reiterate. OpenCL standard contains two things:
OpenCL C language and programming model (I think recent standard include some C++)
OpenCL host library to manage device
gcc and clang are compilers for the host side of your OpenCL project. So there will be no way to provide compiler options for OpenCL device code compilations using a host compiler since they are not even aware of any OpenCL.
Except with clang there is a flag that accept OpenCL device code, .cl file which contains the kernels. That way you can use clang and provide also the flags and options if I remember correctly, but now you would have either llvm IR or SPIR output not an device executable object. You can then load SPIR object to a device using device's run-time environment(opencl drivers).
You can checkout these links:
Using Clang to compile kernels
Llvm IR generation
SPIR
Other alternative is to use the tools provided by your target platform. Each vendor that claims to support opencl, should have a run-time environment. Usually, they have separate CLI tools to compile OpenCL device code. In you case(I guess) you have drivers from Apple, therefore you have openclc.
Intel CLI as an example
Now to your main question (best way to compile opencl). It depends what you want to do. You didn't specify what kind of requirements you have so I had to speculate.
If you want to have off-line compilation without a host program, the considerations above will help you. Otherwise, you have to use OpenCL library and have on-line compilation for you kernels, this is generally preferred for products that needs portability. Since if you compile all your kernels at the start of your program, you directly use the provided environment and you don't need to provide libraries for each target platform.
Therefore, if you have an OpenCL project, you have to decide how to compile. If you really want to use the generic flags and do not rely on third party tools. I suggest you to have a class that builds your kernels and provides the flags you want.
...how do we control these options? Are they have to be set inside the source code, or, likewise the normal compilation flow we can pass them on the command line.
Options can be set inside the source code. For example:
const char options[] = "-cl-finite-math-only -cl-no-signed-zeros";
/* Build program */
err = clBuildProgram(program, 1, &device, options, NULL, NULL);
I have never seen opencl options being specified at the command line and I'm unaware whether this is possible or not.

Mapping Between LLVM IR and x86 Instructions

Is there an easy way to map to LLVM instructions from their associated assembly instructions in the output binary? Given an instruction in an x86 binary, I would like to be able to determine with which LLVM IR instruction it is associated.
One possibility would be to compile the binary with debug symbols turned on and then associate the instructions based off of source code line, but that seems like a hack and is prone to having a many-to-many mapping between x86 and LLVM IR when ideally it would be a many-to-one mapping.

Loading OpenCL kernels from bitcode in the correct architecture

I'm using Xcode (version 5.3) to compile OpenCL kernels to bitcode, as explained in WWDC 2013 session 508.
Xcode generates 4 different files, each with a different extension according to the architecture for which it has been targeted for.
The extensions are: cl.gpu_32.bc , cl.gpu_64.bc , cl.x84_64.bc, cl.i386.bc
In session 508, they only load a single file (The one with the cl.gpu_32.bc extension and use it).
Is it possible to generate a single cl_program that support all devices associated with the context?
How do I know which architecture to use for each of the available devices?
A sample code that reads all files and generate a single cl_program would be very helpful.
Apple provides sample code that covers loading platform-specific bitcode:
https://developer.apple.com/library/mac/samplecode/OpenCLOfflineCompilation/Introduction/Intro.html#//apple_ref/doc/uid/DTS40011196
From the description:
This sample demonstrates how developers can utilize the OpenCL offline
compiler to transform their human-readable OpenCL source files into
shippable bitcode. It includes an example Makefile that demonstrates
how to invoke the compiler, and a self-contained OpenCL program that
shows how to build a program from the generated bitcode. The sample
covers the case of using bitcode on 64 and 32 bit CPU devices, as well
as 32 bit GPU devices.
The readme covers the CLI arguments and the single-file C program contains lots of explanations.
Seems from Apple Sample Code (referenced by weichsel), that all is needed is to get CL_DEVICE_ADDRESS_BITS and CL_DEVICE_TYPE_GPU using clGetDeviceInfo to distinguish between all possible different architectures.

Compiling for Cortex M3 bare metal

Is there a guide somewhere that describes how to get LLVM to emit a binary for Cortex-M3 that I can massage into running bare metal? I've spent considerable time playing with LLVM on Windows and Ubuntu to no avail. I can get ARM-like assembly out. I can get bit code out, but what I really need is ELF, DWARF, Hobbit, Gandalf or any other Lord of the Rings critter that has a file format specification. Any and all help appreciated! I'm compiling LLVM 3.4 with CLANG on Ubuntu, Windows and/or OS X.
I created a firmware framework - PolyMCU https://github.com/labapart/polymcu - that is based on CMake that support GCC and LLVM. Because it is based on CMake you can build your firmware on Linux/Windows/MacOS.
It also uses Newlib and supports Baremetal/CMSIS RTOS (RTX)/FreeRTOS.
The benefit of using PolyMCU is this framework does not add any software layer on top of the libc and the MCU vendor's SDKs.
Another benefit is you can easily switch toolchains. I used this feature to get more feedback on my code by testing it with many compilers.
I also wrote a blog where I compared GCC and LLVM build size on ARM Cortex-M: http://labapart.com/blogs/3-the-importance-of-the-toolchain-version-in-embedded-space Interesting results, Clang generated code is not much bigger than GCC on Cortex-M...
The best guide that I know of is here: http://wiki.osdev.org/LLVM_Cross-Compiler. It's mostly about building an LLVM cross-compiler, but it does show a "Usage" section. However, that section specifically shows an example for a Cortex-A processor, but you should be able to get the general idea.
I have created an simple clang bare metal Cortex-M3 "hello world" program, but I don't have it in front of me. IIRC, the only options I needed were -march=thumb -mcpu=cortex-m3 as long as the LLVM compiler backend was built with the ARM thumb backend support (Again, see http://wiki.osdev.org/LLVM_Cross-Compiler). I did, however, need to link with arm-none-eabi-ld from the GCC toolchain here (http://launchpad.net/gcc-arm-embedded), and I believe that is how you can get your ELF binary.
I've since moved on to the D programming language, and I have a simple example using LDC (The LLVM D compiler) here (http://wiki.dlang.org/Extremely_minimal_semihosted_%22Hello_World%22)
So, I believe compiling bare metal ARM Cortex-M3 software with LLVM can be done, but it seems not many people have tried.
It is possible to use clang++ pulled from http://llvm.org/builds with https://launchpad.net/gcc-arm-embedded as a base, at least for the compile step.
Required extra arguments are the include paths hardcoded into gcc and certain arm-none-eabi defaults:
--target=arm-none-eabi -fshort-enums -isystem "../arm-none-eabi/include/c++/5.2.1" [-isystem ...]

Should I look into PTX to optimize my kernel? If so, how?

Do you recommend reading your kernel's PTX code to find out to optimize your kernels further?
One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the loops manually in the kernel code.
Are there other use-cases for the PTX code?
Do you look into your PTX code?
Where can I find out how to be able to read the PTX code CUDA generates for my kernels?
The first point to make about PTX is that it is only an intermediate representation of the code run on the GPU -- a virtual machine assembly language. PTX is assembled to target machine code either by ptxas at compile time, or by the driver at runtime. So when you are looking at PTX, you are looking at what the compiler emitted, but not at what the GPU will actually run. It is also possible to write your own PTX code, either from scratch (this is the only JIT compilation model supported in CUDA), or as part of inline-assembler sections in CUDA C code (the latter officially supported since CUDA 4.0, but "unofficially" supported for much longer than that). CUDA has always shipped with a complete guide to the PTX language with the toolkit, and it is fully documented. The ocelot project has used this documentation to implement their own PTX cross compiler, which allows CUDA code to run natively on other hardware, initially x86 processors, but more recently AMD GPUs.
If you want to see what the GPU is actualy running (as opposed to what the compiler is emitting), NVIDIA now supply a binary disassembler tool called cudaobjdump which can show the actual machine code segments in code compiled for Fermi GPUs. There was an older, unofficialy tool called decuda which worked for G80 and G90 GPUs.
Having said that, there is a lot to be learned from PTX output, particularly at how the compiler is applying optimizations and what instructions it is emitting to implement certain C contructs. Every version of the NVIDIA CUDA toolkit comes with a guide to nvcc and documentation for the PTX language. There is plenty of information contained in both documents to both learn how to compile a CUDA C/C++ kernel code to PTX, and to understand what the PTX instructions will do.

Resources