How to optimize SYCL kernel - parallel-processing

I'm studying SYCL at university and I have a question about performance of a code.
In particular I have this C/C++ code:
And I need to translate it in a SYCL kernel with parallelization and I do this:
#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
//Create a vector with size elements and initialize them to 1
std::vector<float> dA(size);
try {
queue gpuQueue{ gpu_selector{} };
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
gpuQueue.submit([&](handler& cgh) {
accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }
So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?
Thanks in advance for the help!

Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.
Equally, if you had:
constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc
The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.
I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:
%5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
%add.i = fadd float %5, 2.000000e+00
store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.
Conclusion: you've already achieved maximum efficiency, and compilers are smart!

Related

What is causing this error: SSE register return with SSE disabled?

I'm new to kernel development, and I need to write a Linux kernel module that performs several matrix multiplications (I'm working on an x64_64 platform). I'm trying to use fixed-point values for these operations, however during compilation, the compiler encounters this error:
error: SSE register return with SSE disabled
I don't know that much about SSE or this issue in particular, but from what i've found and according to most answers to questions about this problem, it is related to the usage of Floating-Point (FP) arithmetic in kernel space, which seems to be rarely a good idea (hence the utilization of Fixed-Point arithmetics). This error seems weird to me because I'm pretty sure I'm not using any FP values or operations, however it keeps popping up and in some ways that seem weird to me. For instance, I have this block of code:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
const int scale = 16;
#define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
#define FIXED_TO_DOUBLE(x) ((x) / (1 << scale))
#define MULT(x, y) ((((x) >> 8) * ((y) >> 8)) >> 0)
#define DIV(x, y) (((x) << 8) / (y) << 8)
#define OUTPUT_ROWS 6
#define OUTPUT_COLUMNS 2
struct matrix {
int rows;
int cols;
double *data;
};
double outputlayer_weights[OUTPUT_ROWS * OUTPUT_COLUMNS] =
{
0.7977986, -0.77172316,
-0.43078753, 0.67738613,
-1.04312621, 1.0552227 ,
-0.32619684, 0.14119884,
-0.72325027, 0.64673559,
0.58467862, -0.06229197
};
...
void matmul (struct matrix *A, struct matrix *B, struct matrix *C) {
int i, j, k, a, b, sum, fixed_prod;
if (A->cols != B->rows) {
return;
}
for (i = 0; i < A->rows; i++) {
for (j = 0; j < B->cols; j++) {
sum = 0;
for (k = 0; k < A->cols; k++) {
a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
b = DOUBLE_TO_FIXED(B->data[k * B->rows + j]);
fixed_prod = MULT(a, b);
sum += fixed_prod;
}
/* Commented the following line, causes error */
//C->data[i * C->rows + j] = sum;
}
}
}
...
static int __init insert_matmul_init (void)
{
printk(KERN_INFO "INSERTING MATMUL");
return 0;
}
static void __exit insert_matmul_exit (void)
{
printk(KERN_INFO "REMOVING MATMUL");
}
module_init (insert_matmul_init);
module_exit (insert_matmul_exit);
which compiles with no errors (I left out code that I found irrelevant to the problem). I have made sure to comment any error-prone lines to get to a point where the program can be compiled with no errors, and I am trying to solve each of them one by one. However, when uncommenting this line:
C->data[i * C->rows + j] = sum;
I get this error message in a previous (unmodified) line of code:
error: SSE register return with SSE disabled
sum += fixed_prod;
~~~~^~~~~~~~~~~~~
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error. Maybe my fixed-point implementation is flawed (I'm no expert in that matter either), or maybe I'm missing something obvious. Just in case, I have tested the same logic in a user-space program (using Floating-Point values) and it seems to work fine. In either case, any help in solving this issue would be appreciated. Thanks in advance!
Edit: I have included the definition of matrix and an example matrix. I have been using the default kbuild command for building external modules, here is what my Makefile looks like:
obj-m = matrix_mult.o
KVERSION = $(shell uname -r)
all:
make -C /lib/modules/$(KVERSION)/build M=$(PWD) modules
Linux compiles kernel code with -mgeneral-regs-only on x86, which produces this error in functions that do anything with FP or SIMD. (Except via inline asm, because then the compiler doesn't see the FP instructions, only the assembler does.)
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error.
GCC optimizes whole functions when optimization is enabled, and you are using FP inside that function. You're doing FP multiply and truncating conversion to integer with your macro and assigning the result to an int, since the MCVE you eventually provided shows struct matrix containing double *data.
If you stop the compiler from using FP instructions (like Linux does by building with -mgeneral-regs-only), it refuses to compile your file instead of doing software floating-point.
The only odd thing is that it pins down the error to an integer += instead of one of the statements that compiles to a mulsd and cvttsd2si
If you disable optimization (-O0 -mgeneral-regs-only) you get a more obvious location for the same error (https://godbolt.org/z/Tv5nG6nd4):
<source>: In function 'void matmul(matrix*, matrix*, matrix*)':
<source>:9:33: error: SSE register return with SSE disabled
9 | #define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
| ~~~~~^~~~~~~~~~~~~~~
<source>:46:21: note: in expansion of macro 'DOUBLE_TO_FIXED'
46 | a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
| ^~~~~~~~~~~~~~~
If you really want to know what's going on with the GCC internals, you could dig into it with -fdump-tree-... options, e.g. on the Godbolt compiler explorer there's a dropdown for GCC Tree / RTL output that would let you look at the GIMPLE or RTL internal representation of your function's logic after various analyzer passes.
But if you just want to know whether there's a way to make this function work, no obviously not, unless you compile a file without -mgeneral-registers-only. All functions in a file compiled that way must only be called by callers that have used kernel_fpu_begin() before the call. (and kernel_fpu_end after).
You can't safely use kernel_fpu_begin inside a function compiled to allow it to use SSE / x87 registers; it might already have corrupted user-space FPU state before calling the function, after optimization. The symptom of getting this wrong is not a fault, it's corrupting user-space state, so don't assume that happens to work = correct. Also, depending on how GCC optimizes, the code-gen might be fine with your version, but might be broken with earlier or later GCC or clang versions. I somewhat expect that kernel_fpu_begin() at the top of this function would get called before the compiler did anything with FP instructions, but that doesn't mean it would be safe and correct.
See also Generate and optimize FP / SIMD code in the Linux Kernel on files which contains kernel_fpu_begin()?
Apparently -msse2 overrides -mgeneral-regs-only, so that's probably just an alias for -mno-mmx -mno-sse and whatever options disables x87. So you might be able to use __attribute__((target("sse2"))) on a function without changing build options for it, but that would be x86-specific. Of course, so is -mgeneral-regs-only. And there isn't a -mno-general-regs-only option to override the kernel's normal CFLAGS.
I don't have a specific suggestion for the best way to set up a build option if you really do think it's worth using kernel_fpu_begin at all, here (rather than using fixed-point the whole way through).
Obviously if you do save/restore the FPU state, you might as well use it for the loop instead of using FP to convert to fixed-point and back.

complex and double complex methods not working in visual studio c project

The below code is throwing error in Visual Studio C project. That same code is working in Linux with GCC compiler. Please let me know any solution to execute properly in Windows.
#include<stdio.h>
#include <complex.h>
#include <math.h>
typedef struct {
float r, i;
} complex_;
double c_abs(complex_* z)
{
return (cabs(z->r + I * z->i));
}
int main()
{
complex_ number1 = { 3.0, 4.0 };
double d = c_abs(&number1);
printf("The absolute value of %f + %fi is %f\n", number1.r, number1.i, d);
return 0;
}
The error I am getting was
C2088: '*': illegal for struct
Here what t I observed the I macro not working properly...
So is there any other way we can handle in Windows?
error C2088: '*': illegal for struct
This is the error MSVC returns when compiling the code (as C) for this line.
return (cabs(z->r + I * z->i));
The MSVC C compiler does not have a native complex type, and (quoting the docs) "therefore the Microsoft implementation uses structure types to represent complex numbers".
The imaginary unit I a.k.a. _Complex_I is defined as an _Fcomplex structure in <complex.h>, which explains compile error C2088, since there is no operator * to multiply that structure with a float value.
Even if the multiplication worked out by some magic, the end result would be a float value being passed into the cabs call, but MSVC declares cabs as double cabs(_Dcomplex z); and there is no automatic conversion from float to _Dcomplex so the call would still fail to compile.
What would work with MSVC, however, is replace that line with the following, which constructs a _Dcomplex on the fly from the float real and imaginary parts.
return cabs(_Dcomplex{z->r, z->i});

Directly assigning to a std::vector after reserving does not throw error but does not increase vector size

Let's create a helper class to assist visualizing the issue:
class C
{
int ID = 0;
public:
C(const int newID)
{
ID = newID;
}
int getID()
{
return ID;
}
};
Suppose you create an empty std::vector<C> and then reserve it to hold 10 elements:
std::vector<C> pack;
pack.reserve(10);
printf("pack has %i\n", pack.size()); //will print '0'
Now, you assign a new instance of C into index 4 of the vector:
pack[4] = C(57);
printf("%i\n", pack[4].getID()); //will print '57'
printf("pack has %i\n", pack.size()); //will still print '0'
I found two things to be weird here:
1) shouldn't the assignment make the compiler (Visual Studio 2015, Release Mode) throw an error even in Release mode?
2) since it does not and the element is in fact stored in position 4, shouldn't the vector then have size = 1 instead of zero?
Undefined behavior is still undefined. If we make this a vector of objects, you would see the unexpected behavior more clearly.
#include <iostream>
#include <vector>
struct Foo {
int data_ = 3;
};
int main() {
std::vector<Foo> foos;
foos.reserve(10);
std::cout << foos[4].data_; // This probably doesn't output 3.
}
Here, we can see that because we haven't actually allocated the object yet, the constructor hasn't run.
Another example, since you're using space that the vector hasn't actually started allocating to you, if the vector needed to reallocate it's backing memory, the value that you wrote wouldn't be copied.
#include <iostream>
#include <vector>
int main() {
std::vector<int> foos;
foos.reserve(10);
foos[4] = 100;
foos.reserve(10000000);
std::cout << foos[4]; // Probably doesn't print 100.
}
Short answers:
1) There is no reason to throw an exception since operator[] is not supposed to verify the position you have passed. It might do so in Debug mode, but for sure not in Release (otherwise performance would suffer). In Release mode compiler trusts you that code you provide is error-proof and does everything to make your code fast.
Returns a reference to the element at specified location pos. No
bounds checking is performed.
http://en.cppreference.com/w/cpp/container/vector/operator_at
2) You simply accessed memory you don't own yet (reserve is not resize), anything you do on it is undefined behavior. But, you have never added an element into vector and it has no idea you even modified its buffer. And as #Bill have shown, the vector is allowed to change its buffer without copying your local change.
EDIT:
Also, you can get exception due to boundary checking if you use vector::at function.
That is: pack.at(4) = C(57); throws exception
Example:
https://ideone.com/sXnPzT

How to use arrays in program (global) scope in OpenCL

AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide
#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.
I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilĂ  !
It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.

Race conditions with OpenMP

I need to fill 2D array (tmp[Ny][Nx]) while each cell of the array gets an integral (of some function) as a function of free parameters. Since I deal with a very large arrays (here I simplified my case), I need to use OpenMP parallelism in order to speed my calculations up. Here I use simple #pragma omp parallel for directive.
Without using #pragma omp parallel for, the code executes perfectly. But adding the parallel directive, produces race conditions in the output.
I tried to cure it by making private(i,j,par), it did not help.
P.S. I use VS2008 Professional with OpenMP 2.0 and under WIndows 7 OS
Here is my code: (a short sample)
testfunc(const double* var, const double* par)
{
// here is some simple function to be integrated over
// var[0] and var[1] and two free parameters par[0] and par[1]
return ....
}
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
int main()
{
double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int i,j,NDim=2,NEval=1e5;
#pragma omp parallel for private(i,j,par,val)
for (i=0;i<Nx;i++)
{
for (j=0;j<Ny;j++)
{
par[0]=i;
par[1]=j*j;
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
// adapt_integrate - receives my integrand, performs
// integration and returns a result through "val"
tmp[i][j] = val;
}
}
}
It produces race conditions at the output. I tried to avoid it by making all internal variables (i,j,par and val) private, but it doesn't help.
P.S. Serial version (#threads=1) of this code runs properly.
(Answered in the question. Converted to a community wiki answer. See Question with no answers, but issue solved in the comments (or extended in chat) )
The OP wrote:
The problem Solved!
I defined parameters of integration as global and used #pragma omp threadprivate(parGlob) directive for them. Now it works like a charm. I've been thinking that private() and threadprivate() have the same meaning, just different ways of implementations, but they do not.
So, playing with these directives may give a correct answer. Another thing is that defining iterator i inside the first for loop gives additional 20%-30% speed up in performance. So, the fastest version of the code looks now as:
testfunc(const double* var, const double* par)
{
.......
}
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
double parGlob[2]; //<- Here are they!!!
#pragma omp threadprivate(parGlob) // <-Magic directive!!!!
int main()
{
// Not here !!!! -> double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int j,NDim=2,NEval=1e5;
#pragma omp parallel for private(j,val) // no `i` inside `private` clause
for (int i=0;i<Nx;i++)
{
for (j=0;j<Ny;j++)
{
parGlob[0]=i;
parGlob[1]=j*j;
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
tmp[i][j] = val;
}
}
}

Resources