OpenACC: error "unsupported statement type: opcode=JSRA" - openacc

I'm trying to parallelize this loop but I'm having a problem with the function fRSolver (&sw, ts->cmax);
#pragma acc parallel loop collapse(2) present(d, sw, ts)
for (k = KBEG; k <= KEND; k++){
for (j = JBEG; j <= JEND; j++){
.
.
.
#pragma acc routine(fRSolver) vector
d->fRSolver (&sw, ts->cmax);
}}
I get this error:
139, Accelerator region ignored
141, Accelerator restriction: loop contains unsupported statement type
175, Accelerator restriction: unsupported statement type: opcode=JSRA
d is a type D variable:
typedef struct D_{
.
.
.
void (*fRSolver) (const Sw *, double *);
}D;
fRSolver is a pointer to a function void HSolver (const Sw *sw, double *cmax)
Is there a way I can accelerate this loop without changing the way the function HSolver is called?

Is there a way I can accelerate this loop without changing the way the
function HSolver is called?
No, sorry. Function pointers and indirect function calls are not supported within device code. This requires late binding (i.e. the function resolution is done at runtime) and we currently don't have a way on the device to resolve the functions address. Same issue occurs with C++ virtual functions and Fortran type bound procedures.
It's definitely on our list of things we'd like to support, and hopefully will at some point, but thus far has proven to be a major challenge. You need to modify the code to have fRSolver be a direct call, resolved at link time.

Related

can complex numbers be used in Apple metal code?

I'm trying to make a stand alone binary which uses metal for acceleration. (found a solution in another answer)
I'm trying to compile the metal code and I get an error;
metal_mandel.metal:24:16: error: subscript of pointer to incomplete type 'device complex' (aka 'device __Reserved_Name__Do_not_use_complex')
using namespace metal;
///#include <complex.h>
kernel void point_in_mandel(device complex* inC,
device complex* inZ,
device int* count,
uint index [[thread_position_in_grid]])
{
// the for-loop is replaced with a collection of threads, each of which
// calls this function.
count [index] = 0;
while(cabs(inZ[index]) < 2.0 && count[index] < 256)
{
inZ[index] = (inZ[index] * inZ[index]) + inC;
count[index] ++;
}
}
I've tried both adding and removing the include complex but my head is too sore from looking at this code all day, to work this out further.
Do I just have to implement complex numbers manually?
Thanks
Mark

What is causing this error: SSE register return with SSE disabled?

I'm new to kernel development, and I need to write a Linux kernel module that performs several matrix multiplications (I'm working on an x64_64 platform). I'm trying to use fixed-point values for these operations, however during compilation, the compiler encounters this error:
error: SSE register return with SSE disabled
I don't know that much about SSE or this issue in particular, but from what i've found and according to most answers to questions about this problem, it is related to the usage of Floating-Point (FP) arithmetic in kernel space, which seems to be rarely a good idea (hence the utilization of Fixed-Point arithmetics). This error seems weird to me because I'm pretty sure I'm not using any FP values or operations, however it keeps popping up and in some ways that seem weird to me. For instance, I have this block of code:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
const int scale = 16;
#define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
#define FIXED_TO_DOUBLE(x) ((x) / (1 << scale))
#define MULT(x, y) ((((x) >> 8) * ((y) >> 8)) >> 0)
#define DIV(x, y) (((x) << 8) / (y) << 8)
#define OUTPUT_ROWS 6
#define OUTPUT_COLUMNS 2
struct matrix {
int rows;
int cols;
double *data;
};
double outputlayer_weights[OUTPUT_ROWS * OUTPUT_COLUMNS] =
{
0.7977986, -0.77172316,
-0.43078753, 0.67738613,
-1.04312621, 1.0552227 ,
-0.32619684, 0.14119884,
-0.72325027, 0.64673559,
0.58467862, -0.06229197
};
...
void matmul (struct matrix *A, struct matrix *B, struct matrix *C) {
int i, j, k, a, b, sum, fixed_prod;
if (A->cols != B->rows) {
return;
}
for (i = 0; i < A->rows; i++) {
for (j = 0; j < B->cols; j++) {
sum = 0;
for (k = 0; k < A->cols; k++) {
a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
b = DOUBLE_TO_FIXED(B->data[k * B->rows + j]);
fixed_prod = MULT(a, b);
sum += fixed_prod;
}
/* Commented the following line, causes error */
//C->data[i * C->rows + j] = sum;
}
}
}
...
static int __init insert_matmul_init (void)
{
printk(KERN_INFO "INSERTING MATMUL");
return 0;
}
static void __exit insert_matmul_exit (void)
{
printk(KERN_INFO "REMOVING MATMUL");
}
module_init (insert_matmul_init);
module_exit (insert_matmul_exit);
which compiles with no errors (I left out code that I found irrelevant to the problem). I have made sure to comment any error-prone lines to get to a point where the program can be compiled with no errors, and I am trying to solve each of them one by one. However, when uncommenting this line:
C->data[i * C->rows + j] = sum;
I get this error message in a previous (unmodified) line of code:
error: SSE register return with SSE disabled
sum += fixed_prod;
~~~~^~~~~~~~~~~~~
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error. Maybe my fixed-point implementation is flawed (I'm no expert in that matter either), or maybe I'm missing something obvious. Just in case, I have tested the same logic in a user-space program (using Floating-Point values) and it seems to work fine. In either case, any help in solving this issue would be appreciated. Thanks in advance!
Edit: I have included the definition of matrix and an example matrix. I have been using the default kbuild command for building external modules, here is what my Makefile looks like:
obj-m = matrix_mult.o
KVERSION = $(shell uname -r)
all:
make -C /lib/modules/$(KVERSION)/build M=$(PWD) modules
Linux compiles kernel code with -mgeneral-regs-only on x86, which produces this error in functions that do anything with FP or SIMD. (Except via inline asm, because then the compiler doesn't see the FP instructions, only the assembler does.)
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error.
GCC optimizes whole functions when optimization is enabled, and you are using FP inside that function. You're doing FP multiply and truncating conversion to integer with your macro and assigning the result to an int, since the MCVE you eventually provided shows struct matrix containing double *data.
If you stop the compiler from using FP instructions (like Linux does by building with -mgeneral-regs-only), it refuses to compile your file instead of doing software floating-point.
The only odd thing is that it pins down the error to an integer += instead of one of the statements that compiles to a mulsd and cvttsd2si
If you disable optimization (-O0 -mgeneral-regs-only) you get a more obvious location for the same error (https://godbolt.org/z/Tv5nG6nd4):
<source>: In function 'void matmul(matrix*, matrix*, matrix*)':
<source>:9:33: error: SSE register return with SSE disabled
9 | #define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
| ~~~~~^~~~~~~~~~~~~~~
<source>:46:21: note: in expansion of macro 'DOUBLE_TO_FIXED'
46 | a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
| ^~~~~~~~~~~~~~~
If you really want to know what's going on with the GCC internals, you could dig into it with -fdump-tree-... options, e.g. on the Godbolt compiler explorer there's a dropdown for GCC Tree / RTL output that would let you look at the GIMPLE or RTL internal representation of your function's logic after various analyzer passes.
But if you just want to know whether there's a way to make this function work, no obviously not, unless you compile a file without -mgeneral-registers-only. All functions in a file compiled that way must only be called by callers that have used kernel_fpu_begin() before the call. (and kernel_fpu_end after).
You can't safely use kernel_fpu_begin inside a function compiled to allow it to use SSE / x87 registers; it might already have corrupted user-space FPU state before calling the function, after optimization. The symptom of getting this wrong is not a fault, it's corrupting user-space state, so don't assume that happens to work = correct. Also, depending on how GCC optimizes, the code-gen might be fine with your version, but might be broken with earlier or later GCC or clang versions. I somewhat expect that kernel_fpu_begin() at the top of this function would get called before the compiler did anything with FP instructions, but that doesn't mean it would be safe and correct.
See also Generate and optimize FP / SIMD code in the Linux Kernel on files which contains kernel_fpu_begin()?
Apparently -msse2 overrides -mgeneral-regs-only, so that's probably just an alias for -mno-mmx -mno-sse and whatever options disables x87. So you might be able to use __attribute__((target("sse2"))) on a function without changing build options for it, but that would be x86-specific. Of course, so is -mgeneral-regs-only. And there isn't a -mno-general-regs-only option to override the kernel's normal CFLAGS.
I don't have a specific suggestion for the best way to set up a build option if you really do think it's worth using kernel_fpu_begin at all, here (rather than using fixed-point the whole way through).
Obviously if you do save/restore the FPU state, you might as well use it for the loop instead of using FP to convert to fixed-point and back.

OPENACC How to handle a library function in a #pragma acc routine

I have to call the <stdlib.h> function exit() inside this routine:
#pragma acc routine(Check) seq
int Check (double **u, char *str)
{
for (int i = beg; i <= end; i++) {
for (int v = 0; v < vend; v++) {
if (isnan(u[i][v])) {
#pragma acc routine(Here) seq
Here (i,NULL);
#pragma acc routine(exit)
exit(1);
}
}}
return 0;
}
I get the error:
nvlink error : Undefined reference to 'exit' in 'tools.o'
Usually I solve this problem by adding the routine #pragma acc routine before the body of the function but in this case I'm dealing with a library function.
All routines called from the device, need a device callable version of the routine. Often system routines do not have device callable versions, including "exit", so can't be used.
Though, you can't exit a host application from device code, so you may want to rethink this portion of the code. Instead of using "exit", you'll want to capture errors and then abort once execution has returned to the host.

Does using unique_ptr mean that I don't have to use the restrict keyword?

When trying to get loops to auto-vectorise, I've seen code like this written:
void addFP(int N, float *in1, float *in2, float * restrict out)
{
for (i = 0 ; i < N; i++)
{
out[i] = in1[i] + in2[i];
}
}
Where the restrict keyword is needed reassure the compiler about pointer aliases, so that it can vectorise the loop.
Would something like this do the same thing?
void addFP(int N, float *in1, float *in2, std::unique_ptr<float> out)
{
for (i = 0 ; i < N; i++)
{
out[i] = in1[i] + in2[i];
}
}
If this does work, which is the more portable choice?
tl;dr Can std::unique_ptr be used to replace the restrict keyword in a loop you're trying to auto-vectorise?
restrict is not part of C++11, instead it is part of C99.
std::unique_ptr<T> foo; is telling your compiler: I only need the memory in this scope. Once this scope ends, free the memory.
restrict tells your compiler: I know that you can't know or proof this, but I pinky swear that this is the only reference to this chunk of memory that occurs in this function.
unique_ptr doesn't stop aliases, nor should the compiler assume they don't exist:
int* pointer = new int[3];
int* alias = pointer;
std::unique_ptr<int> alias2(pointer);
std::unique_ptr<int> alias3(pointer); //compiles, but crashes when deleting
So your first version is not valid in C++11 (though it works on many modern compilers) and the second doesn't do the optimization you are expecting. To still get the behavior concider std::valarray.
I don't think so. Suppose this code:
auto p = std::make_unique<float>(0.1f);
auto raw = p.get();
addFP(1, raw, raw, std::move(p));

OpenACC 2.0 routine: data locality

Take the following code, which illustrates the calling of a simple routine on the accelerator, compiled on the device using OpenACC 2.0's routine directive:
#include <iostream>
#pragma acc routine
int function(int *ARRAY,int multiplier){
int sum=0;
#pragma acc loop reduction(+:sum)
for(int i=0; i<10; ++i){
sum+=multiplier*ARRAY[i];
}
return sum;
}
int main(){
int *ARRAY = new int[10];
int multiplier = 5;
int out;
for(int i=0; i<10; i++){
ARRAY[i] = 1;
}
#pragma acc enter data create(out) copyin(ARRAY[0:10],multiplier)
#pragma acc parallel present(out,ARRAY[0:10],multiplier)
if (function(ARRAY,multiplier) == 50){
out = 1;
}else{
out = 0;
}
#pragma acc exit data copyout(out) delete(ARRAY[0:10],multiplier)
std::cout << out << std::endl;
}
How does function know to use the device copies of ARRAY[0:10] and multiplier when it is called from within a parallel region? How can we enforce the use of the device copies?
When your routine is called within a device region (the parallel in your code), it is being called by the threads on the device, which means those threads will only have access to arrays on the device. The compiler may actually choose to inline that function, or it may be a device-side function call. That means that you can know that when the function is called from the device it will be receiving device copies of the data because the function is essentially inheriting the present data clause from the parallel region. If you still want to convince yourself that you're running on the device once inside the function, you could call acc_on_device, but that only tells you that you're running on the accelerator, not that you received a device pointer.
If you want to enforce the use of device copies more than that, you could make the routine nohost so that it would technically not be valid to call from the host, but that doesn't really do what you're asking, which is to do a check on the GPU that the array really is a device array.
Keep in mind though that any code inside a parallel region that is not inside a loop will be run gang-redundantly, so the write to out is likely a race condition, unless you happen to be running with one gang or you write to it using an atomic.
Basically, when you involved "data" clause, the device will create/copy data to the device memory, then the block of code that defined with "acc routine" will be executed on the device. Notice that the memory between host and device does not share unlike multi-threading (OpenMP). So yes, "function" will be using the device copies of ARRAY and multiplier as long as it is under data segment. Hope this helps! :)
You should assign the function with one parallelism level such as gang/worker/vector. It's a more accurate way.
The routine will use the date in device memory.

Resources