Halide: OpenCL code generation - halide

Is it possible in Halide to produce a file which contains generated OpenCL code? I have tried to produce a c file from a Halide program which target would be opencl, but I don't see any opencl specfic code there.
Edit 1:
I would like to see especially how kernels are created in Halide. Something like this:
static char
kernelSourceCode[] =
kernel void test_kernel(int a, int b, __global int *out)
{
out[0] = a + b;
}
Edit 2:
Ok, I put HL_DEBUG_CODEGEN=1 to env variable and set in the code set_target(Target::Debug). I got bunch of code on the screen, which some of were OpenCL code but I still can't see any kernel spesific code.
There are two lines on the screen which indicates about kernels. Should there be something?
OpenCL kernel:
/*OpenCL C*/
Then there is also a line:
kernel void _at_least_one_kernel(int x) { }
In example if I have a function like this:
gradient(x, y) = x + y;
Is the function inside a kernel if I want to target to OpenCL?

Here is what I managed to spot from documentation
CUDA or OpenCL are not enabled by default. You have to construct a Target object, enable one of them, and then pass that target object to compile_jit.
Target target = get_host_target();
target.set_feature(Target::OpenCL);
curved.compile_jit(target);
Or similarely you can use compile_to method, by providing the correct target.
EXPORT void Halide::Func::compile_to(const Outputs & output_files,
std::vector<Argument> args,
const std::string& fn_name,
const Target& target = get_target_from_environment()
)

Related

What is causing this error: SSE register return with SSE disabled?

I'm new to kernel development, and I need to write a Linux kernel module that performs several matrix multiplications (I'm working on an x64_64 platform). I'm trying to use fixed-point values for these operations, however during compilation, the compiler encounters this error:
error: SSE register return with SSE disabled
I don't know that much about SSE or this issue in particular, but from what i've found and according to most answers to questions about this problem, it is related to the usage of Floating-Point (FP) arithmetic in kernel space, which seems to be rarely a good idea (hence the utilization of Fixed-Point arithmetics). This error seems weird to me because I'm pretty sure I'm not using any FP values or operations, however it keeps popping up and in some ways that seem weird to me. For instance, I have this block of code:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
const int scale = 16;
#define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
#define FIXED_TO_DOUBLE(x) ((x) / (1 << scale))
#define MULT(x, y) ((((x) >> 8) * ((y) >> 8)) >> 0)
#define DIV(x, y) (((x) << 8) / (y) << 8)
#define OUTPUT_ROWS 6
#define OUTPUT_COLUMNS 2
struct matrix {
int rows;
int cols;
double *data;
};
double outputlayer_weights[OUTPUT_ROWS * OUTPUT_COLUMNS] =
{
0.7977986, -0.77172316,
-0.43078753, 0.67738613,
-1.04312621, 1.0552227 ,
-0.32619684, 0.14119884,
-0.72325027, 0.64673559,
0.58467862, -0.06229197
};
...
void matmul (struct matrix *A, struct matrix *B, struct matrix *C) {
int i, j, k, a, b, sum, fixed_prod;
if (A->cols != B->rows) {
return;
}
for (i = 0; i < A->rows; i++) {
for (j = 0; j < B->cols; j++) {
sum = 0;
for (k = 0; k < A->cols; k++) {
a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
b = DOUBLE_TO_FIXED(B->data[k * B->rows + j]);
fixed_prod = MULT(a, b);
sum += fixed_prod;
}
/* Commented the following line, causes error */
//C->data[i * C->rows + j] = sum;
}
}
}
...
static int __init insert_matmul_init (void)
{
printk(KERN_INFO "INSERTING MATMUL");
return 0;
}
static void __exit insert_matmul_exit (void)
{
printk(KERN_INFO "REMOVING MATMUL");
}
module_init (insert_matmul_init);
module_exit (insert_matmul_exit);
which compiles with no errors (I left out code that I found irrelevant to the problem). I have made sure to comment any error-prone lines to get to a point where the program can be compiled with no errors, and I am trying to solve each of them one by one. However, when uncommenting this line:
C->data[i * C->rows + j] = sum;
I get this error message in a previous (unmodified) line of code:
error: SSE register return with SSE disabled
sum += fixed_prod;
~~~~^~~~~~~~~~~~~
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error. Maybe my fixed-point implementation is flawed (I'm no expert in that matter either), or maybe I'm missing something obvious. Just in case, I have tested the same logic in a user-space program (using Floating-Point values) and it seems to work fine. In either case, any help in solving this issue would be appreciated. Thanks in advance!
Edit: I have included the definition of matrix and an example matrix. I have been using the default kbuild command for building external modules, here is what my Makefile looks like:
obj-m = matrix_mult.o
KVERSION = $(shell uname -r)
all:
make -C /lib/modules/$(KVERSION)/build M=$(PWD) modules
Linux compiles kernel code with -mgeneral-regs-only on x86, which produces this error in functions that do anything with FP or SIMD. (Except via inline asm, because then the compiler doesn't see the FP instructions, only the assembler does.)
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error.
GCC optimizes whole functions when optimization is enabled, and you are using FP inside that function. You're doing FP multiply and truncating conversion to integer with your macro and assigning the result to an int, since the MCVE you eventually provided shows struct matrix containing double *data.
If you stop the compiler from using FP instructions (like Linux does by building with -mgeneral-regs-only), it refuses to compile your file instead of doing software floating-point.
The only odd thing is that it pins down the error to an integer += instead of one of the statements that compiles to a mulsd and cvttsd2si
If you disable optimization (-O0 -mgeneral-regs-only) you get a more obvious location for the same error (https://godbolt.org/z/Tv5nG6nd4):
<source>: In function 'void matmul(matrix*, matrix*, matrix*)':
<source>:9:33: error: SSE register return with SSE disabled
9 | #define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
| ~~~~~^~~~~~~~~~~~~~~
<source>:46:21: note: in expansion of macro 'DOUBLE_TO_FIXED'
46 | a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
| ^~~~~~~~~~~~~~~
If you really want to know what's going on with the GCC internals, you could dig into it with -fdump-tree-... options, e.g. on the Godbolt compiler explorer there's a dropdown for GCC Tree / RTL output that would let you look at the GIMPLE or RTL internal representation of your function's logic after various analyzer passes.
But if you just want to know whether there's a way to make this function work, no obviously not, unless you compile a file without -mgeneral-registers-only. All functions in a file compiled that way must only be called by callers that have used kernel_fpu_begin() before the call. (and kernel_fpu_end after).
You can't safely use kernel_fpu_begin inside a function compiled to allow it to use SSE / x87 registers; it might already have corrupted user-space FPU state before calling the function, after optimization. The symptom of getting this wrong is not a fault, it's corrupting user-space state, so don't assume that happens to work = correct. Also, depending on how GCC optimizes, the code-gen might be fine with your version, but might be broken with earlier or later GCC or clang versions. I somewhat expect that kernel_fpu_begin() at the top of this function would get called before the compiler did anything with FP instructions, but that doesn't mean it would be safe and correct.
See also Generate and optimize FP / SIMD code in the Linux Kernel on files which contains kernel_fpu_begin()?
Apparently -msse2 overrides -mgeneral-regs-only, so that's probably just an alias for -mno-mmx -mno-sse and whatever options disables x87. So you might be able to use __attribute__((target("sse2"))) on a function without changing build options for it, but that would be x86-specific. Of course, so is -mgeneral-regs-only. And there isn't a -mno-general-regs-only option to override the kernel's normal CFLAGS.
I don't have a specific suggestion for the best way to set up a build option if you really do think it's worth using kernel_fpu_begin at all, here (rather than using fixed-point the whole way through).
Obviously if you do save/restore the FPU state, you might as well use it for the loop instead of using FP to convert to fixed-point and back.

MacOs OpenCL on HD 530 with clBuildProgram error(-11)

I have the latest mac pro(OS:10.12.2) ,with intel intergrated GPU HD 530(Gen9) which runs the OpenCL code. In my OpenCL code, I use vloadx and atomic_add instruction. change my OpenCL kernel code into bitcode like https://developer.apple.com/library/content/samplecode/OpenCLOfflineCompilation/Introduction/Intro.html#//apple_ref/doc/uid/DTS40011196-Intro-DontLinkElementID_2
. and create the program with clCreateProgramWithBinary. But when clBuildProgram, it returns error with -11 .and build log is "
error: undefined reference to _Z6vload2mPKU3AS1h()'
undefined reference to_Z8atom_addPVU3AS3ii()'
"
But in my mac air with HD 5500(Gen8), the code is ok.
Can someone tell me what should I do?
The problem here is, you cannot use incompatible binaries in different devices. Which means if you compile for Intel, you cannot use the compiled binary for AMD for example. What you need to do is compile the code for the specific device every time from the source.
If you do not want to use the OpenCL codes in different files, what you can do is put them inside your source file by stringifying them. Instead of reading a file, you use the kernel string inside your host code to pass as the kernel string. This will allow you to protect your IP. However, everytime, you need to build the code using clBuildProgram. You can also save the built program as binary, so after the first run, you won't degrade performance by building it everytime. To give an example, lets suppose you have a kernel.cl file as following:
__kernel void foo(__global int* in, __global int* out)
{
int idx = get_global_id(0);
out[idx] = in[idx] * in[idx];
}
You probably get this kernel code by reading the file with something like:
char *source_str;
fp = fopen("kernel.cl", "r");
source_str = (char *)malloc(MAX_SOURCE_SIZE);
source_size = fread(source_str, 1, MAX_SOURCE_SIZE, fp);
fclose(fp);
program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret);
What you can do instead is something like:
const char* src = "__kernel void foo(__global int* in, __global int* out)\
{\
int idx = get_global_id(0);\
out[idx] = in[idx] * in[idx];\
}";
program = clCreateProgramWithSource(context, 1, (const char **)&src, (const size_t *)&src_size, &ret);
When you compile your C code, this string will be converted into binary, so you protect your source code.

How to use arrays in program (global) scope in OpenCL

AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide
#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.
I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilĂ  !
It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.

Changing function reference in Mac OS Process at runtime

I need to change reference of a function in a Mac OS process at runtime to a custom function defined in my own custom dylib. I kept the new function signature same as the original.
For example I need to change "open" function to "myopen" function.
I tried processing __LINKEDIT segment to get the dynamic symbol table and string table.
I used following pointers,
1. the VMAddrress from __LINKEDIT segment,
2. mach_header and vmaddr_slide from the "_dyld_register_func_for_add_image" callback,
3. symoff and stroff from symtab_command.
But I am unable to get the symbol table and string table mentioned in the __LINKEDIT segment.
Can someone throw some light on this?
Thanks in advance.
If the function in question is a library function, and not statically compiled into the executable, you don't need to do any of that - you can use function interposing, instead. Specifically, add this to your library:
// The attribute creates a Mach-O Section in your library - q.v. libgmalloc.dylib for
// a nice example
static const interpose_t interposing_functions[] \
__attribute__ ((section("__DATA, __interpose"))) = {
{ (void *)my_open, (void *)open },
{ (void *)my_close, (void *)close }, // .. etc
};
int my_open(const char *path, int flags, mode_t mode)
{
int rc;
// Prolog - do something before open
rc = open(path, flags, mode); // call real open
// Epilog - record rc, etc..
return rc;
}
There are several excellent books on OS X internals which can provide you with samples, though apparently according to S.O site policies we can't link you to them. That said, the above code snippet should work. Bear in mind, that this won't work on calls to open performed by other dylibs (though there are more complicated ways to get that, as well)

Overloading conflict with vector types __m128, __m256 in GCC

I've started playing around with AVX instructions on the new Intel's Sandy Bridge processor. I'm using GCC 4.5.2, TDM-GCC 64bit build of MinGW64.
I want to overload operator<< for ostream to be able to print out the vector types __m256, __m128 etc to the console. But I'm running into an overloading conflict. The 2nd function in the following code produces an error "conflicts with previous declaration void f(__vector(8) float)":
void f(__m128 v) {
cout << 4;
}
void f(__m256 v) {
cout << 8;
}
It seems that the compiler cannot distinguish between the two types and consideres them both f(float __vector).
Is there a way around this? I haven't been able to find anything online. Any help is greatly appreciated.
I accidentally stumbled upon the answer when having a similar problem with function templates. In this case, the GCC error message actually suggested a solution:
add -fabi-version=4 compiler option.
This solves my problem, and hopefully doesn't cause any issues when linking the standard libraries.
One can read more about ABI (Application Binary Interface) and GCC at ABI Policy and Guidelines and ABI specification. ABI specifies how the functions names are mangled when the code is compiled into object files. Apparently, ABI version 3 used by GCC by default cannot distinguish between the various vector types.
I was unsatisfied with the solution of changing compiler ABI flags to solve this, so I went looking for a different solution. It seems they encountered this issue in writing the Eigen library - see this source file for details http://eigen.tuxfamily.org/dox-devel/SSE_2PacketMath_8h_source.html
My solution to this is a slightly tweaked version of theirs:
template <typename T, unsigned RegisterSize>
struct Register
{
using ValueType = T;
enum { Size = RegisterSize };
inline operator T&() { return myValue; }
inline operator const T&() const { return myValue; }
inline Register() {}
inline Register(const T & v) : myValue(v) {} // Not explicit
inline Register & operator=(const T & v)
{
myValue = v;
return *this;
}
T myValue;
};
using Register4 = Register<__m128, 4u>;
using Register8 = Register<__m256, 8u>;
// Could provide more declarations for __m128d, __m128i, etc. if needed
Using the above, you can overload on Register4, Register8, etc. or produce template functions taking Registers without running into linking issues and without changing ABI settings.

Resources