"Integer operation result out of range" in cuda source code - gcc

I'm trying to compile a code written using CUDA 3.2 on RHEL 5.6. The relevant portions are
extern "C"{
#include <stdio.h>
#include <inttypes.h>
static uint64_t size = 0;
...
size = 5000 * 1024 * 1024;
printf("sizeof(size) = %d size = %lu\n", sizeof(size), size);
}
The code is in a .cu file, and compiled using nvcc. I get the compilation warning that for the line "size = 5000 * 1024 * 1024", the "integer operation result is out of range". The output I got is
sizeof(size) = 8 size = 947912704
I don't understand why the variable "size" can't represent the value 5242880000 if it's 8-bytes large.
Thank you.

As #Damien commented, the multiplication is being done on int. The next code gives the expected result:
size = 5000L * 1024 * 1024;
This is not related with CUDA or the nvcc compiler which calls to a general purpose C compiler during 'non-CUDA' phases. See The CUDA Compiler Driver NVCC doc for more details.

Related

How to optimize SYCL kernel

I'm studying SYCL at university and I have a question about performance of a code.
In particular I have this C/C++ code:
And I need to translate it in a SYCL kernel with parallelization and I do this:
#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
//Create a vector with size elements and initialize them to 1
std::vector<float> dA(size);
try {
queue gpuQueue{ gpu_selector{} };
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
gpuQueue.submit([&](handler& cgh) {
accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }
So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?
Thanks in advance for the help!
Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.
Equally, if you had:
constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc
The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.
I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:
%5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
%add.i = fadd float %5, 2.000000e+00
store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.
Conclusion: you've already achieved maximum efficiency, and compilers are smart!

What is causing this error: SSE register return with SSE disabled?

I'm new to kernel development, and I need to write a Linux kernel module that performs several matrix multiplications (I'm working on an x64_64 platform). I'm trying to use fixed-point values for these operations, however during compilation, the compiler encounters this error:
error: SSE register return with SSE disabled
I don't know that much about SSE or this issue in particular, but from what i've found and according to most answers to questions about this problem, it is related to the usage of Floating-Point (FP) arithmetic in kernel space, which seems to be rarely a good idea (hence the utilization of Fixed-Point arithmetics). This error seems weird to me because I'm pretty sure I'm not using any FP values or operations, however it keeps popping up and in some ways that seem weird to me. For instance, I have this block of code:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
const int scale = 16;
#define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
#define FIXED_TO_DOUBLE(x) ((x) / (1 << scale))
#define MULT(x, y) ((((x) >> 8) * ((y) >> 8)) >> 0)
#define DIV(x, y) (((x) << 8) / (y) << 8)
#define OUTPUT_ROWS 6
#define OUTPUT_COLUMNS 2
struct matrix {
int rows;
int cols;
double *data;
};
double outputlayer_weights[OUTPUT_ROWS * OUTPUT_COLUMNS] =
{
0.7977986, -0.77172316,
-0.43078753, 0.67738613,
-1.04312621, 1.0552227 ,
-0.32619684, 0.14119884,
-0.72325027, 0.64673559,
0.58467862, -0.06229197
};
...
void matmul (struct matrix *A, struct matrix *B, struct matrix *C) {
int i, j, k, a, b, sum, fixed_prod;
if (A->cols != B->rows) {
return;
}
for (i = 0; i < A->rows; i++) {
for (j = 0; j < B->cols; j++) {
sum = 0;
for (k = 0; k < A->cols; k++) {
a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
b = DOUBLE_TO_FIXED(B->data[k * B->rows + j]);
fixed_prod = MULT(a, b);
sum += fixed_prod;
}
/* Commented the following line, causes error */
//C->data[i * C->rows + j] = sum;
}
}
}
...
static int __init insert_matmul_init (void)
{
printk(KERN_INFO "INSERTING MATMUL");
return 0;
}
static void __exit insert_matmul_exit (void)
{
printk(KERN_INFO "REMOVING MATMUL");
}
module_init (insert_matmul_init);
module_exit (insert_matmul_exit);
which compiles with no errors (I left out code that I found irrelevant to the problem). I have made sure to comment any error-prone lines to get to a point where the program can be compiled with no errors, and I am trying to solve each of them one by one. However, when uncommenting this line:
C->data[i * C->rows + j] = sum;
I get this error message in a previous (unmodified) line of code:
error: SSE register return with SSE disabled
sum += fixed_prod;
~~~~^~~~~~~~~~~~~
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error. Maybe my fixed-point implementation is flawed (I'm no expert in that matter either), or maybe I'm missing something obvious. Just in case, I have tested the same logic in a user-space program (using Floating-Point values) and it seems to work fine. In either case, any help in solving this issue would be appreciated. Thanks in advance!
Edit: I have included the definition of matrix and an example matrix. I have been using the default kbuild command for building external modules, here is what my Makefile looks like:
obj-m = matrix_mult.o
KVERSION = $(shell uname -r)
all:
make -C /lib/modules/$(KVERSION)/build M=$(PWD) modules
Linux compiles kernel code with -mgeneral-regs-only on x86, which produces this error in functions that do anything with FP or SIMD. (Except via inline asm, because then the compiler doesn't see the FP instructions, only the assembler does.)
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error.
GCC optimizes whole functions when optimization is enabled, and you are using FP inside that function. You're doing FP multiply and truncating conversion to integer with your macro and assigning the result to an int, since the MCVE you eventually provided shows struct matrix containing double *data.
If you stop the compiler from using FP instructions (like Linux does by building with -mgeneral-regs-only), it refuses to compile your file instead of doing software floating-point.
The only odd thing is that it pins down the error to an integer += instead of one of the statements that compiles to a mulsd and cvttsd2si
If you disable optimization (-O0 -mgeneral-regs-only) you get a more obvious location for the same error (https://godbolt.org/z/Tv5nG6nd4):
<source>: In function 'void matmul(matrix*, matrix*, matrix*)':
<source>:9:33: error: SSE register return with SSE disabled
9 | #define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
| ~~~~~^~~~~~~~~~~~~~~
<source>:46:21: note: in expansion of macro 'DOUBLE_TO_FIXED'
46 | a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
| ^~~~~~~~~~~~~~~
If you really want to know what's going on with the GCC internals, you could dig into it with -fdump-tree-... options, e.g. on the Godbolt compiler explorer there's a dropdown for GCC Tree / RTL output that would let you look at the GIMPLE or RTL internal representation of your function's logic after various analyzer passes.
But if you just want to know whether there's a way to make this function work, no obviously not, unless you compile a file without -mgeneral-registers-only. All functions in a file compiled that way must only be called by callers that have used kernel_fpu_begin() before the call. (and kernel_fpu_end after).
You can't safely use kernel_fpu_begin inside a function compiled to allow it to use SSE / x87 registers; it might already have corrupted user-space FPU state before calling the function, after optimization. The symptom of getting this wrong is not a fault, it's corrupting user-space state, so don't assume that happens to work = correct. Also, depending on how GCC optimizes, the code-gen might be fine with your version, but might be broken with earlier or later GCC or clang versions. I somewhat expect that kernel_fpu_begin() at the top of this function would get called before the compiler did anything with FP instructions, but that doesn't mean it would be safe and correct.
See also Generate and optimize FP / SIMD code in the Linux Kernel on files which contains kernel_fpu_begin()?
Apparently -msse2 overrides -mgeneral-regs-only, so that's probably just an alias for -mno-mmx -mno-sse and whatever options disables x87. So you might be able to use __attribute__((target("sse2"))) on a function without changing build options for it, but that would be x86-specific. Of course, so is -mgeneral-regs-only. And there isn't a -mno-general-regs-only option to override the kernel's normal CFLAGS.
I don't have a specific suggestion for the best way to set up a build option if you really do think it's worth using kernel_fpu_begin at all, here (rather than using fixed-point the whole way through).
Obviously if you do save/restore the FPU state, you might as well use it for the loop instead of using FP to convert to fixed-point and back.

memcmp difference between gcc 10.3 and gcc 11.1 for char16_t

Im converting some tests that use the memcmp function and don't get the expected output. Now I've been trying to figure out why there is a difference in the windows vs linux output and I ended up on godbolt.org. There I played around with different gcc versions and to my surprise there is a difference between x86-64 gcc 10.3 and x86-64 gcc 11.1. Can you help me figure out what the correct output is?
The code that is used:
#include <string.h>
#include <iostream>
int main()
{
char16_t const * p10 = u"Same";
char16_t const * p210 = u"NotSame";
auto result10 = memcmp(p10, p210, sizeof(p10));
std::cout << result10 << "\n";
char16_t const p11[] = u"Same";
char16_t const p211[] = u"NotSame";
auto result11 = memcmp(&p11, &p211, sizeof(p11));
std::cout << result11 << "\n";
}
Gcc 10.3 output
5
5
Gcc 11.1 output
5
1
VS 2019 / MSVC 14.29.30133 output
1
1
It looks like in this example MSVC always returns exactly 1. For gcc this isnt the case sometimes, because it seems like it will return the difference. So between 83'S' and 78'N' is 5 so that is returned. Now my question is, is this the correct output or should it just be "1" in this case to indicate that there is a difference and ptr1 is higher than ptr2? I looked at some documentation but its a bit vague as to what it should be.

CUDA 5.0 "Generate Relocatable Device Code" leads to invalid device symbol error

I am trying to do separate compilation using CUDA 5. For this reason I set the "Generate Relocatable Device Code" to "Yes (-rdc=true)" in Visual Studio 2010. The program compiles without errors, however,
I get an invalid device symbol error when I try to initialize device constants using cudaMemcpyToSymbol.
i.e. I have the following constant
__constant__ float gdDomainOrigin[2];
and try to initialize it with
cudaMemcpyToSymbol(gdDomainOrigin, mDomainOrigin, 2*sizeof(float));
which leads to the error. The error does not occur, when I compile everything as a whole, without the aforementioned option set. Could anybody please help me with that?
I can't reproduce this. If build an application from two .cu files, one containing a __constant__ symbol and a simple kernel, and the other containing the runtime API incantations to populate that constant memory and call the kernel, it works only when relocatable device code is enabled, viz:
__constant__ float gdDomainOrigin[2];
__global__
void kernel(float *inout)
{
inout[0] = gdDomainOrigin[0];
inout[1] = gdDomainOrigin[1];
}
and
#include <cstdio>
extern __constant__ float gdDomainOrigin;
extern __global__ void kernel(float *);
inline
void gpuAssert(cudaError_t code, char * file, int line, bool Abort=true)
{
if (code != 0) {
fprintf(stderr, "GPUassert: %s %s %d\n",
cudaGetErrorString(code),file,line);
if (Abort) exit(code);
}
}
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
int main(void)
{
const float mDomainOrigin[2] = { 1.234f, 5.6789f };
const size_t sz = sizeof(float) * size_t(2);
float * dbuf, * hbuf;
gpuErrchk( cudaFree(0) );
gpuErrchk( cudaMemcpyToSymbol(gdDomainOrigin, mDomainOrigin, sz) );
gpuErrchk( cudaMalloc((void **)&dbuf, sz) );
kernel<<<1,1>>>(dbuf);
gpuErrchk( cudaPeekAtLastError() );
hbuf = new float[2];
gpuErrchk( cudaMemcpy(hbuf, dbuf, sz, cudaMemcpyDeviceToHost) );
fprintf(stdout, "%f %f\n", hbuf[0], hbuf[1]);
return 0;
}
Compiling and running these in CUDA 5 on a 64 bit linux system with a Kepler GPU produces the following:
$ nvcc -arch=sm_30 -o shared shared.cu shared_dev.cu
$ ./shared
GPUassert: invalid device symbol shared.cu 23
$ nvcc -arch=sm_30 -rdc=true -o shared shared.cu shared_dev.cu
$ ./shared
1.234000 5.678900
You can see that in the first compilation, without relocatable GPU code generation, the symbol isn't found. In the second case, with relocatable GPU code generation, it is found, and the elf header in the object file looks just as you would expect:
$ nvcc -arch=sm_30 -rdc=true -c shared_dev.cu
$ cuobjdump -symbols shared_dev.o
Fatbin elf code:
================
arch = sm_30
code version = [1,6]
producer = cuda
host = linux
compile_size = 64bit
identifier = shared_dev.cu
symbols:
STT_SECTION STB_LOCAL .text._Z6kernelPf
STT_SECTION STB_LOCAL .nv.constant3
STT_SECTION STB_LOCAL .nv.constant0._Z6kernelPf
STT_CUDA_OBJECT STB_LOCAL _param
STT_SECTION STB_LOCAL .nv.callgraph
STT_FUNC STB_GLOBAL _Z6kernelPf
STT_CUDA_OBJECT STB_GLOBAL gdDomainOrigin
Fatbin ptx code:
================
arch = sm_30
code version = [3,1]
producer = cuda
host = linux
compile_size = 64bit
compressed
identifier = shared_dev.cu
ptxasOptions = --compile-only
Perhaps you could try my code and compilation/diagnostic steps and see what happens with your Windows toolchain.

Limiting memory usage for a single process in OSX /Darwin

I am trying to modify some JNI code to limit the amount of memory that a process can consume. Here is the code that I am using to test setRlimit on linux and osx. In linux it works as expected and the buf is null.
This code sets the limit to 32 MB and then tries to malloc a 64 MB buffer, if buffer is null then setrlimit works.
#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/resource.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
int main(int argc) {
pid_t pid = getpid();
struct rlimit current;
struct rlimit *newp;
int memLimit = 32 * 1024 * 1024;
int result = getrlimit(RLIMIT_AS, &current);
if (result != 0)
errExit("Unable to get rlimit");
current.rlim_cur = memLimit;
current.rlim_max = memLimit;
result = setrlimit(RLIMIT_AS, &current);
if (result != 0)
errExit("Unable to setrlimit");
printf("Doing malloc \n");
int memSize = 64 * 1024 * 1024;
char *buf = malloc(memSize);
if (buf == NULL) {
printf("Your out of memory\n");
} else {
printf("Malloc successsful\n");
}
free(buf);
}
On linux machine this is my result
memtest]$ ./m200k
Doing malloc
Your out of memory
On osx 10.8
./m200k
Doing malloc
Malloc successsful
My question is that if this does not work on osx is there a way to acomplish this task in darwin kernel. The man pages all seem to say it will work but it does not appear to do so. I have seen that launchctl has some support for limiting memory but my goal is to add this ability in code. I tried using ulimit also but this did not work either and am pretty sure ulimit uses setrlimit to set limits. Also is there a signal I can catch when setrlimit soft or hardlimit is exceeded. I haven't been able to find one.
Bonus points if it can be accomplished in windows also.
Thanks for any advice
Update
As pointed out the RLIMIT_AS is explicitly defined in the man page but is defined as the RLIMIT_RSS, so if referring to the documentation RLIMIT_RSS and RLIMIT_AS are interchangable on OSX.
/usr/include/sys/resource.h on osx 10.8
#define RLIMIT_RSS RLIMIT_AS /* source compatibility alias */
Tested trojanfoe's excellent suggestion to use RLIMIT_DATA which is described here
The RLIMIT_DATA limit specifies the maximum amount of bytes the process
data segment can occupy. The data segment for a process is the area in which
dynamic memory is located (that is, memory allocated by malloc() in C, or in C++,
with new()). If this limit is exceeded, calls to allocate new memory will fail.
The result was the same for linux and osx and that was the malloc was successful for both.
chinshaw#osx$ ./m200k
Doing malloc
Malloc successsful
chinshaw#redhat ./m200k
Doing malloc
Malloc successsful

Resources