Loading OpenCL kernels from bitcode in the correct architecture

Loading OpenCL kernels from bitcode in the correct architecture - xcode

I'm using Xcode (version 5.3) to compile OpenCL kernels to bitcode, as explained in WWDC 2013 session 508.
Xcode generates 4 different files, each with a different extension according to the architecture for which it has been targeted for.
The extensions are: cl.gpu_32.bc , cl.gpu_64.bc , cl.x84_64.bc, cl.i386.bc
In session 508, they only load a single file (The one with the cl.gpu_32.bc extension and use it).
Is it possible to generate a single cl_program that support all devices associated with the context?
How do I know which architecture to use for each of the available devices?
A sample code that reads all files and generate a single cl_program would be very helpful.

Apple provides sample code that covers loading platform-specific bitcode:
https://developer.apple.com/library/mac/samplecode/OpenCLOfflineCompilation/Introduction/Intro.html#//apple_ref/doc/uid/DTS40011196
From the description:
This sample demonstrates how developers can utilize the OpenCL offline
compiler to transform their human-readable OpenCL source files into
shippable bitcode. It includes an example Makefile that demonstrates
how to invoke the compiler, and a self-contained OpenCL program that
shows how to build a program from the generated bitcode. The sample
covers the case of using bitcode on 64 and 32 bit CPU devices, as well
as 32 bit GPU devices.
The readme covers the CLI arguments and the single-file C program contains lots of explanations.

Seems from Apple Sample Code (referenced by weichsel), that all is needed is to get CL_DEVICE_ADDRESS_BITS and CL_DEVICE_TYPE_GPU using clGetDeviceInfo to distinguish between all possible different architectures.

Related

How Cuda compilation process takes place?

According to NVIDIAs Programming Guide:
Source files for CUDA applications consist of a mixture of
conventional C++ host code, plus GPU device functions. The CUDA
compilation trajectory separates the device functions from the host
code, compiles the device functions using the proprietary NVIDIA
compilers and assembler, compiles the host code using a C++ host
compiler that is available, and afterwards embeds the compiled GPU
functions as fatbinary images in the host object file. In the linking
stage, specific CUDA runtime libraries are added for supporting remote
SPMD procedure calling and for providing explicit GPU manipulation
such as allocation of GPU memory buffers and host-GPU data transfer.
What does using the proprietary NVIDIA compilers and assembler mean?
Also, what is a PTX and a cubin file? and in which step of compilation do these take place?
I have searched a lot about this concept but, i would like a simple explanation

The nvcc documentation explains the different compilation steps and their respective compilers. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory
Cubin files contain the "real" sass assembler code, whereas ptx files contain assembler code for a "virtual" GPU architecture.

How to count LLVM IR instructions during OpenCL kernel execution?

I am trying to write an OpenCL hostcode for a custom kernel of mine and I want to count the LLVM IR instructions that where executed. My problem is, the LLVM IR representation of the kernel is lost after I build it, and the only thing that exists is the native binary. Is there any way to:
count the native architecture instructions executed?
find a mapping between the native architecture instructions and the LLVM IR representation and, through this, manage to count the LLVM IR instructions that were executed?

I think that is not directly possible through the OpenCL API. But there are two ways you could achieve counting native/IR instructions using tools:
IR:
You could use the SPIR-V (Khronos-defined IR, similar to LLVM IR, but portable) tools, see this README to generate SPIR-V from an OpenCL source. Then count the IR instructions of the result.
Native:
You can use an offline OpenCL compiler from some SDK, e.g. the Intel OpenCL SDK provides a command line tool called ioc64 which can generate assembly code from OpenCL and even allows you to specify the target architecture.
You could try to disassemble the OpenCL-generated binary (via clGetProgramInfo() with CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES), e.g. with an appropriate command line tool after storing it to disk.
Hope that helps.

Build output of SpiderMonkey under Windows

I built SpiderMonkey 60 under Windows (VS2017) according to the documentation, using
../configure --enable-nspr-build followed by mozmake.
In the output folder (dist\bin) I could see 5 DLLs created:
mozglue.dll, mozjs-60.dll, nspr4.dll, plc4.dll, plds4.dll
In order to run the SpiderMonkey Hello World sample I linked my C++ program with mozjs-60.lib and had to copy over to my program location the following DLLs: mozglue.dll, mozjs-60.dll, nspr4.dll
It seems that plc4.dll, plds4.dll are not needed for the program to run and execute scripts.
I could not find any documentation about what is the purpose of each one of the DLLs. Do I need all 5 DLLs? what is the purpose of each one?

Quoting from NSPR archived release notes for an old version I found this:
The plc (Portable Library C) library is a separate library from the
core nspr. You do not need to use plc if you just want to use the core
nspr functions. The plc library currently contains thread-safe string
functions and functions for processing command-line options.
The plds (Portable Library Data Structures) library supports data
structures such as arenas and hash tables. It is important to note
that services of plds are not thread-safe. To use these services in a
multi-threaded environment, clients have to implement their own
thread-safe access, by acquiring locks/monitors, for example.
It sounds like they are unused unless specifically loaded by your application.
It seems it would be safe to not distribute these if you don't need them.

Proper way of compiling OpenCL applications and using available compiler options

I am a newbie in OpenCL stuffs.
Whats is the best way to compiler an OpenCL project ?
Using a supported compiler (GCC or Clang):
When we use a compiler
like gcc or clang, how do we control these options? Are they
have to be set inside the source code, or, likewise the normal
compilation flow we can pass them on the command line. Looking at the Khornos-Manual-1.2, there are a few options provided for cl_int clBuildProgram for optimizations. :
gcc|clang -O3 -I<INCLUDES> OpenCL_app.c -framework OpenCL OPTION -lm
Actually, I Tried this and received an error :
gcc: error: unrecognized command line option '<OPTION>'
Alternatively, using openclc:
I have seen people using openclc to compiler using
a Makefile.
I would like to know which is the best way (if
there are actually two separate ways), and how do we control the
usage of different compile time options.

You might be aware but it is important to reiterate. OpenCL standard contains two things:
OpenCL C language and programming model (I think recent standard include some C++)
OpenCL host library to manage device
gcc and clang are compilers for the host side of your OpenCL project. So there will be no way to provide compiler options for OpenCL device code compilations using a host compiler since they are not even aware of any OpenCL.
Except with clang there is a flag that accept OpenCL device code, .cl file which contains the kernels. That way you can use clang and provide also the flags and options if I remember correctly, but now you would have either llvm IR or SPIR output not an device executable object. You can then load SPIR object to a device using device's run-time environment(opencl drivers).
You can checkout these links:
Using Clang to compile kernels
Llvm IR generation
SPIR
Other alternative is to use the tools provided by your target platform. Each vendor that claims to support opencl, should have a run-time environment. Usually, they have separate CLI tools to compile OpenCL device code. In you case(I guess) you have drivers from Apple, therefore you have openclc.
Intel CLI as an example
Now to your main question (best way to compile opencl). It depends what you want to do. You didn't specify what kind of requirements you have so I had to speculate.
If you want to have off-line compilation without a host program, the considerations above will help you. Otherwise, you have to use OpenCL library and have on-line compilation for you kernels, this is generally preferred for products that needs portability. Since if you compile all your kernels at the start of your program, you directly use the provided environment and you don't need to provide libraries for each target platform.
Therefore, if you have an OpenCL project, you have to decide how to compile. If you really want to use the generic flags and do not rely on third party tools. I suggest you to have a class that builds your kernels and provides the flags you want.

...how do we control these options? Are they have to be set inside the source code, or, likewise the normal compilation flow we can pass them on the command line.
Options can be set inside the source code. For example:
const char options[] = "-cl-finite-math-only -cl-no-signed-zeros";
/* Build program */
err = clBuildProgram(program, 1, &device, options, NULL, NULL);
I have never seen opencl options being specified at the command line and I'm unaware whether this is possible or not.

CUDA Build Error with CUDA 5.5.targets [duplicate]

The CUDA FAQ says:
CUDA defines vector types such as float4, but doesn't include any
operators on them by default. However, you can define your own
operators using standard C++. The CUDA SDK includes a header
"cutil_math.h" that defines some common operations on the vector
types.
However I can not find this using CUDA SDK 5.0. Has it been removed/renamed?
I've found a version of the header here. How is it related to the one that's supposed to come with SDK?

The cutil functionality was deleted from the CUDA 5.0 Samples (i.e. the "SDK"). You can still download a previous SDK and compile it under CUDA 5, you should then have everything that came with previous SDK's.
The official notice was given by nvidia in the CUDA 5.0 release notes (CUDA_Samples_Release_Notes.pdf, installed with the samples). As to why, I imagine that the nvidia sentiment regarding cutil probably was something like what is expressed here "not suitable for use in a real application. It is completely unsupported" but people were using it in real applications. So one way to try put a stop to that is to delete it, I suppose. That's just speculation.
Note some additional useful info provided in the release notes:
CUTIL has been removed with the CUDA Samples in CUDA 5.0, and replaced
with helper functions found in NVIDIA_CUDA-5.0/common/inc:
helper_cuda.h, helper_cuda_gl.h, helper_cuda_drvapi.h,
helper_functions.h, helper_image.h, helper_math.h, helper_string.h,
helper_timer.h
These helper functions handle CUDA device
initialization, CUDA error checking, string parsing, image file
loading and saving, and timing functions. The CUDA Samples projects no
longer have references and dependencies to CUTIL, and now use these
helper functions going forward.
So you may find useful functions in some of those header files.

in latest SDK helper_math.h implement most of required operator, however its still missing logical operators like OR or AND

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio