I'm new to kernel development, and I need to write a Linux kernel module that performs several matrix multiplications (I'm working on an x64_64 platform). I'm trying to use fixed-point values for these operations, however during compilation, the compiler encounters this error:
error: SSE register return with SSE disabled
I don't know that much about SSE or this issue in particular, but from what i've found and according to most answers to questions about this problem, it is related to the usage of Floating-Point (FP) arithmetic in kernel space, which seems to be rarely a good idea (hence the utilization of Fixed-Point arithmetics). This error seems weird to me because I'm pretty sure I'm not using any FP values or operations, however it keeps popping up and in some ways that seem weird to me. For instance, I have this block of code:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
const int scale = 16;
#define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
#define FIXED_TO_DOUBLE(x) ((x) / (1 << scale))
#define MULT(x, y) ((((x) >> 8) * ((y) >> 8)) >> 0)
#define DIV(x, y) (((x) << 8) / (y) << 8)
#define OUTPUT_ROWS 6
#define OUTPUT_COLUMNS 2
struct matrix {
int rows;
int cols;
double *data;
};
double outputlayer_weights[OUTPUT_ROWS * OUTPUT_COLUMNS] =
{
0.7977986, -0.77172316,
-0.43078753, 0.67738613,
-1.04312621, 1.0552227 ,
-0.32619684, 0.14119884,
-0.72325027, 0.64673559,
0.58467862, -0.06229197
};
...
void matmul (struct matrix *A, struct matrix *B, struct matrix *C) {
int i, j, k, a, b, sum, fixed_prod;
if (A->cols != B->rows) {
return;
}
for (i = 0; i < A->rows; i++) {
for (j = 0; j < B->cols; j++) {
sum = 0;
for (k = 0; k < A->cols; k++) {
a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
b = DOUBLE_TO_FIXED(B->data[k * B->rows + j]);
fixed_prod = MULT(a, b);
sum += fixed_prod;
}
/* Commented the following line, causes error */
//C->data[i * C->rows + j] = sum;
}
}
}
...
static int __init insert_matmul_init (void)
{
printk(KERN_INFO "INSERTING MATMUL");
return 0;
}
static void __exit insert_matmul_exit (void)
{
printk(KERN_INFO "REMOVING MATMUL");
}
module_init (insert_matmul_init);
module_exit (insert_matmul_exit);
which compiles with no errors (I left out code that I found irrelevant to the problem). I have made sure to comment any error-prone lines to get to a point where the program can be compiled with no errors, and I am trying to solve each of them one by one. However, when uncommenting this line:
C->data[i * C->rows + j] = sum;
I get this error message in a previous (unmodified) line of code:
error: SSE register return with SSE disabled
sum += fixed_prod;
~~~~^~~~~~~~~~~~~
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error. Maybe my fixed-point implementation is flawed (I'm no expert in that matter either), or maybe I'm missing something obvious. Just in case, I have tested the same logic in a user-space program (using Floating-Point values) and it seems to work fine. In either case, any help in solving this issue would be appreciated. Thanks in advance!
Edit: I have included the definition of matrix and an example matrix. I have been using the default kbuild command for building external modules, here is what my Makefile looks like:
obj-m = matrix_mult.o
KVERSION = $(shell uname -r)
all:
make -C /lib/modules/$(KVERSION)/build M=$(PWD) modules
Linux compiles kernel code with -mgeneral-regs-only on x86, which produces this error in functions that do anything with FP or SIMD. (Except via inline asm, because then the compiler doesn't see the FP instructions, only the assembler does.)
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error.
GCC optimizes whole functions when optimization is enabled, and you are using FP inside that function. You're doing FP multiply and truncating conversion to integer with your macro and assigning the result to an int, since the MCVE you eventually provided shows struct matrix containing double *data.
If you stop the compiler from using FP instructions (like Linux does by building with -mgeneral-regs-only), it refuses to compile your file instead of doing software floating-point.
The only odd thing is that it pins down the error to an integer += instead of one of the statements that compiles to a mulsd and cvttsd2si
If you disable optimization (-O0 -mgeneral-regs-only) you get a more obvious location for the same error (https://godbolt.org/z/Tv5nG6nd4):
<source>: In function 'void matmul(matrix*, matrix*, matrix*)':
<source>:9:33: error: SSE register return with SSE disabled
9 | #define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
| ~~~~~^~~~~~~~~~~~~~~
<source>:46:21: note: in expansion of macro 'DOUBLE_TO_FIXED'
46 | a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
| ^~~~~~~~~~~~~~~
If you really want to know what's going on with the GCC internals, you could dig into it with -fdump-tree-... options, e.g. on the Godbolt compiler explorer there's a dropdown for GCC Tree / RTL output that would let you look at the GIMPLE or RTL internal representation of your function's logic after various analyzer passes.
But if you just want to know whether there's a way to make this function work, no obviously not, unless you compile a file without -mgeneral-registers-only. All functions in a file compiled that way must only be called by callers that have used kernel_fpu_begin() before the call. (and kernel_fpu_end after).
You can't safely use kernel_fpu_begin inside a function compiled to allow it to use SSE / x87 registers; it might already have corrupted user-space FPU state before calling the function, after optimization. The symptom of getting this wrong is not a fault, it's corrupting user-space state, so don't assume that happens to work = correct. Also, depending on how GCC optimizes, the code-gen might be fine with your version, but might be broken with earlier or later GCC or clang versions. I somewhat expect that kernel_fpu_begin() at the top of this function would get called before the compiler did anything with FP instructions, but that doesn't mean it would be safe and correct.
See also Generate and optimize FP / SIMD code in the Linux Kernel on files which contains kernel_fpu_begin()?
Apparently -msse2 overrides -mgeneral-regs-only, so that's probably just an alias for -mno-mmx -mno-sse and whatever options disables x87. So you might be able to use __attribute__((target("sse2"))) on a function without changing build options for it, but that would be x86-specific. Of course, so is -mgeneral-regs-only. And there isn't a -mno-general-regs-only option to override the kernel's normal CFLAGS.
I don't have a specific suggestion for the best way to set up a build option if you really do think it's worth using kernel_fpu_begin at all, here (rather than using fixed-point the whole way through).
Obviously if you do save/restore the FPU state, you might as well use it for the loop instead of using FP to convert to fixed-point and back.
STM32 chips (and many others) have hardware random number generator (RNG), it is faster and more reliable than software RNG provided by libc. Compiler knows nothing about hardware.
Is there a way to redefine implementation of rand()?
There are other hardware modules, i.e real time clock (RTC) which can provide data for time().
You simply override them by defining functions with identical signature. If they are defined WEAK in the standard library they will be overridden, otherwise they are overridden on a first resolution basis so so long as your implementation is passed to the linker before libc is searched, it will override. Moreover .o / .obj files specifically are used in symbol resolution before .a / .lib files, so if your implementation is included in your project source, it will always override.
You should be careful to get the semantics of your implementation correct. For example rand() returns a signed integer 0 to RAND_MAX, which is likley not teh same as the RNG hardware. Since RAND_MAX is a macro, changing it would require changing the standard header, so your implementation needs to enforce the existing RAND_MAX.
Example using STM32 Standard Peripheral Library:
#include <stdlib.h>
#include <stm32xxx.h> // Your processor header here
#if defined __cplusplus
extern "C"
{
#endif
static int rng_running = 0 ;
int rand( void )
{
if( rng_running == 0 )
{
RCC_AHB2PeriphClockCmd(RCC_AHB2Periph_RNG, ENABLE);
RNG_Cmd(ENABLE);
rng_running = 1 ;
}
while(RNG_GetFlagStatus(RNG_FLAG_DRDY)== RESET) { }
// Assumes RAND_MAX is an "all ones" integer value (check)
return (int)(RNG_GetRandomNumber() & (unsigned)RAND_MAX) ;
}
void srand( unsigned ) { }
#if defined __cplusplus
}
#endif
For time() similar applies and there is an example at Problem with time() function in embedded application with C
Can chrono be used as a timer/counter in a bare-metal microcontroller (e.g. MSP432 running an RTOS)? Can the high_resolution_clock (and other APIs in chrono) be configured so that it increments based on the given microcontroller's actual timer tick/register?
The Real-Time C++ book (section 16.5) seems to suggest this is possible, but I haven't found any examples of this being applied, especially within bare-metal microcontrollers.
How could this be implemented? Would this be even recommended? If not, where can chrono aid in RTOS-based embedded software?
I would create a clock that implements now by reading from your timer register:
#include <chrono>
#include <cstdint>
struct clock
{
using rep = std::int64_t;
using period = std::milli;
using duration = std::chrono::duration<rep, period>;
using time_point = std::chrono::time_point<clock>;
static constexpr bool is_steady = true;
static time_point now() noexcept
{
return time_point{duration{"asm to read timer register"}};
}
};
Adjust period to whatever speed your processor ticks at (but it does have to be a compile-time constant). Above I've set it for 1 tick/ms. Here is how it should read for 1 tick == 2ns:
using period = std::ratio<1, 500'000'000>;
Now you can say things like:
auto t = clock::now(); // a chrono::time_point
and
auto d = clock::now() - t; // a chrono::duration
I am using an std::chrono::system_clock::time_point in my program.
When the application stops I want to save to the time_point to a file and load it again when the application starts.
If it was an UNIX-Timestamp I could simply store the value as integer. Is there a way to similarly store a time_point?
Yes. Choose the precision you desire the timestamp in (seconds, milliseconds, ... nanoseconds). Then cast the system_clock::time_point to that precision, extract its numeric value, and print it:
cout << time_point_cast<seconds>(system_clock::now()).time_since_epoch().count();
Though not specified by the standard, the above line (de facto) portably outputs the number of non-leap seconds since 1970-01-01 00:00:00 UTC. That is, this is a UNIX-Timestamp.
I am attempting to get the above code blessed by the standard to do what it in fact does by all implementations today. And I have the unofficial assurance of the std::chrono implementors, that they will not change their system_clock epochs in the meantime.
Here's a complete roundtrip example:
#include <chrono>
#include <iostream>
#include <sstream>
int
main()
{
using namespace std;
using namespace std::chrono;
stringstream io;
io << time_point_cast<seconds>(system_clock::now()).time_since_epoch().count();
int64_t i;
system_clock::time_point tp;
io >> i;
if (!io.fail())
tp = system_clock::time_point{seconds{i}};
}
I can't find any clear indication of if/when the 64-bit value returned by QueryPerformanceCounter() gets reset, or overflows and resets back to zero. Hopefully it never overflows because the 64 bits gives space for decades worth of counting at gigahertz rates. However... is there anything other than a computer restart that will reset it?
Empirically, QPC is reset at system startup.
Note that you should not depend on this behavior, since Microsoft do not explicitly state what the "zero point" is for QPC, merely that it is a monotonically increasing value (mod 2^64) that can be used for high precision timing.
Hence they are quite within their rights to modify it's behavior at any time. They could, for example, make it return values that match FILETIME values as would be produced by a call to GetSystemTimeAsFileTime(), with the same resolution, 100ns tick rate. Under these circumstances, it would never reset. At least not in your or my lifetimes.
That said, the following program when run on Windows 10 [Version 6.3.16299] produces pairs of identical values that are the system uptime in seconds.
#include <windows.h>
#include <iostream>
int main()
{
LARGE_INTEGER performanceCount;
LARGE_INTEGER performanceFrequency;
QueryPerformanceFrequency(&performanceFrequency);
for (;;)
{
QueryPerformanceCounter(&performanceCount);
DWORD const systemTicks = timeGetTime();
DWORD const systemSeconds = systemTicks / 1000;
__int64 const performanceSeconds = performanceCount.QuadPart / performanceFrequency.QuadPart;
std::cout << systemSeconds << " " << performanceSeconds << std::endl;
Sleep(1000);
}
return 0;
}
Standard disclaimers apply, your actual mileage may vary, etc. etc. etc.
It seems that some Windows running inside VirtualBox may reset QueryPerformanceCounter every 20 minutes or so: see here.
QPC is more reliable as time goes by, but for better portability a low precision timer should be used such as GetTickCount64.