How to instruct avr-gcc to optimize volatile variables? - gcc

Code for an interrupt service handler:
volatile unsigned char x = 0;
void interruptHandler() __attribute__ ((signal));
void interruptHandler() {
f();
g();
}
Calls:
void f() { x ++; } // could be more complex, could also be in a different file
void g() { x ++; } // as `f()`, this is just a very simple example
Because x is a volatile variable, it is read and written every time it is used. The body of the interrupt handler compiles to (avr-gcc -g -c -Wa,-alh -mmcu=atmega328p -Ofast file.c):
lds r24,x
subi r24,lo8(-(1))
sts x,r24
lds r24,x
subi r24,lo8(-(1))
sts x,r24
Now I can manually inline the functions and employ a temporary variable:
unsigned char y = x;
y ++;
y ++;
x = y;
Or I can just write:
x += 2;
Both examples compile to the much more efficient:
lds r24,x
subi r24,lo8(-(2))
sts x,r24
Is it possible to tell avr-gcc to optimize access to volatile variables inside of interruptHandler, i.e. to do my manual optimization automatically?
After all, while interruptHandler is running, global interrupts are disabled, and it is impossible for x to change. I prefer not having to hand optimize code, thereby possibly creating duplicate code (if f() and g() are needed elsewhere) and introducing errors.

Is it possible to tell avr-gcc to optimize access to volatile variables inside of interruptHandler, i.e. to do my manual optimization automatically?
No, that is not possible in the C language.
After all, while interruptHandler is running, global interrupts are disabled
The compiler does not know this - and you could simply put an sei into the handler to turn them back on.
Also note that hardware registers are declared volatile, too. Some of these - like the UART data register - have side effects even when read. The compiler must not remove any reads or writes for these.

If you declare a variable to be volatile, then all accesses to it are volatile - the compiler will read and write it exactly as many times as the source code says, without combining them or doing similar optimisations.
So if you want combining optimisations, declare the variable without the "volatile" - then you will get what you need inside the interrupt code.
And then from outside the interrupt code, you can force volatile accesses using something like this macro:
#define volatileAccess(v) *((volatile typeof((v)) *) &(v))
Use "volatileAccess(x)" rather than "x" outside the interrupt code.
Just don't forget that "volatile" does not mean "atomic" !

Related

What is causing this error: SSE register return with SSE disabled?

I'm new to kernel development, and I need to write a Linux kernel module that performs several matrix multiplications (I'm working on an x64_64 platform). I'm trying to use fixed-point values for these operations, however during compilation, the compiler encounters this error:
error: SSE register return with SSE disabled
I don't know that much about SSE or this issue in particular, but from what i've found and according to most answers to questions about this problem, it is related to the usage of Floating-Point (FP) arithmetic in kernel space, which seems to be rarely a good idea (hence the utilization of Fixed-Point arithmetics). This error seems weird to me because I'm pretty sure I'm not using any FP values or operations, however it keeps popping up and in some ways that seem weird to me. For instance, I have this block of code:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
const int scale = 16;
#define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
#define FIXED_TO_DOUBLE(x) ((x) / (1 << scale))
#define MULT(x, y) ((((x) >> 8) * ((y) >> 8)) >> 0)
#define DIV(x, y) (((x) << 8) / (y) << 8)
#define OUTPUT_ROWS 6
#define OUTPUT_COLUMNS 2
struct matrix {
int rows;
int cols;
double *data;
};
double outputlayer_weights[OUTPUT_ROWS * OUTPUT_COLUMNS] =
{
0.7977986, -0.77172316,
-0.43078753, 0.67738613,
-1.04312621, 1.0552227 ,
-0.32619684, 0.14119884,
-0.72325027, 0.64673559,
0.58467862, -0.06229197
};
...
void matmul (struct matrix *A, struct matrix *B, struct matrix *C) {
int i, j, k, a, b, sum, fixed_prod;
if (A->cols != B->rows) {
return;
}
for (i = 0; i < A->rows; i++) {
for (j = 0; j < B->cols; j++) {
sum = 0;
for (k = 0; k < A->cols; k++) {
a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
b = DOUBLE_TO_FIXED(B->data[k * B->rows + j]);
fixed_prod = MULT(a, b);
sum += fixed_prod;
}
/* Commented the following line, causes error */
//C->data[i * C->rows + j] = sum;
}
}
}
...
static int __init insert_matmul_init (void)
{
printk(KERN_INFO "INSERTING MATMUL");
return 0;
}
static void __exit insert_matmul_exit (void)
{
printk(KERN_INFO "REMOVING MATMUL");
}
module_init (insert_matmul_init);
module_exit (insert_matmul_exit);
which compiles with no errors (I left out code that I found irrelevant to the problem). I have made sure to comment any error-prone lines to get to a point where the program can be compiled with no errors, and I am trying to solve each of them one by one. However, when uncommenting this line:
C->data[i * C->rows + j] = sum;
I get this error message in a previous (unmodified) line of code:
error: SSE register return with SSE disabled
sum += fixed_prod;
~~~~^~~~~~~~~~~~~
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error. Maybe my fixed-point implementation is flawed (I'm no expert in that matter either), or maybe I'm missing something obvious. Just in case, I have tested the same logic in a user-space program (using Floating-Point values) and it seems to work fine. In either case, any help in solving this issue would be appreciated. Thanks in advance!
Edit: I have included the definition of matrix and an example matrix. I have been using the default kbuild command for building external modules, here is what my Makefile looks like:
obj-m = matrix_mult.o
KVERSION = $(shell uname -r)
all:
make -C /lib/modules/$(KVERSION)/build M=$(PWD) modules
Linux compiles kernel code with -mgeneral-regs-only on x86, which produces this error in functions that do anything with FP or SIMD. (Except via inline asm, because then the compiler doesn't see the FP instructions, only the assembler does.)
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error.
GCC optimizes whole functions when optimization is enabled, and you are using FP inside that function. You're doing FP multiply and truncating conversion to integer with your macro and assigning the result to an int, since the MCVE you eventually provided shows struct matrix containing double *data.
If you stop the compiler from using FP instructions (like Linux does by building with -mgeneral-regs-only), it refuses to compile your file instead of doing software floating-point.
The only odd thing is that it pins down the error to an integer += instead of one of the statements that compiles to a mulsd and cvttsd2si
If you disable optimization (-O0 -mgeneral-regs-only) you get a more obvious location for the same error (https://godbolt.org/z/Tv5nG6nd4):
<source>: In function 'void matmul(matrix*, matrix*, matrix*)':
<source>:9:33: error: SSE register return with SSE disabled
9 | #define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
| ~~~~~^~~~~~~~~~~~~~~
<source>:46:21: note: in expansion of macro 'DOUBLE_TO_FIXED'
46 | a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
| ^~~~~~~~~~~~~~~
If you really want to know what's going on with the GCC internals, you could dig into it with -fdump-tree-... options, e.g. on the Godbolt compiler explorer there's a dropdown for GCC Tree / RTL output that would let you look at the GIMPLE or RTL internal representation of your function's logic after various analyzer passes.
But if you just want to know whether there's a way to make this function work, no obviously not, unless you compile a file without -mgeneral-registers-only. All functions in a file compiled that way must only be called by callers that have used kernel_fpu_begin() before the call. (and kernel_fpu_end after).
You can't safely use kernel_fpu_begin inside a function compiled to allow it to use SSE / x87 registers; it might already have corrupted user-space FPU state before calling the function, after optimization. The symptom of getting this wrong is not a fault, it's corrupting user-space state, so don't assume that happens to work = correct. Also, depending on how GCC optimizes, the code-gen might be fine with your version, but might be broken with earlier or later GCC or clang versions. I somewhat expect that kernel_fpu_begin() at the top of this function would get called before the compiler did anything with FP instructions, but that doesn't mean it would be safe and correct.
See also Generate and optimize FP / SIMD code in the Linux Kernel on files which contains kernel_fpu_begin()?
Apparently -msse2 overrides -mgeneral-regs-only, so that's probably just an alias for -mno-mmx -mno-sse and whatever options disables x87. So you might be able to use __attribute__((target("sse2"))) on a function without changing build options for it, but that would be x86-specific. Of course, so is -mgeneral-regs-only. And there isn't a -mno-general-regs-only option to override the kernel's normal CFLAGS.
I don't have a specific suggestion for the best way to set up a build option if you really do think it's worth using kernel_fpu_begin at all, here (rather than using fixed-point the whole way through).
Obviously if you do save/restore the FPU state, you might as well use it for the loop instead of using FP to convert to fixed-point and back.

How to call a c-function that takes a c-struct that contains pointers

From a GO program on a Raspberry PI I'm trying to call a function(Matlab function converted to C function) and the input to the function is a pointer to a struct and the struct contains pointer to a double(data) and a pointer to an int(size) and two int(allocatedSize, numDimensions). I have tried several ways but nothing has worked, when I have passed the compilation it usually throws a panic: runtime error: cgo argument has Go pointer to Go pointer when I run the program.
sumArray.c
/*sumArray.C*/
/* Include files */
#include "sumArray.h"
/* Function Definitions */
double sumArray(const emxArray_real_T *A1)
{
double S1;
int vlen;
int k;
vlen = A1->size[0];
if (A1->size[0] == 0) {
S1 = 0.0;
} else {
S1 = A1->data[0];
for (k = 2; k <= vlen; k++) {
S1 += A1->data[k - 1];
}
}
return S1;
}
sumArray.h
#ifndef SUMARRAY_H
#define SUMARRAY_H
/* Include files */
#include <stddef.h>
#include <stdlib.h>
#include "sumArray_types.h"
/* Function Declarations */
extern double sumArray(const emxArray_real_T *A1);
#endif
sumArray_types.h
#ifndef SUMARRAY_TYPES_H
#define SUMARRAY_TYPES_H
/* Include files */
/* Type Definitions */
#ifndef struct_emxArray_real_T
#define struct_emxArray_real_T
struct emxArray_real_T
{
double *data;
int *size;
int allocatedSize;
int numDimensions;
};
#endif /*struct_emxArray_real_T*/
#ifndef typedef_emxArray_real_T
#define typedef_emxArray_real_T
typedef struct emxArray_real_T emxArray_real_T;
#endif /*typedef_emxArray_real_T*/
#endif
/* End of code generation (sumArray_types.h) */
main.go
// #cgo CFLAGS: -g -Wall
// #include <stdlib.h>
// #include "sumArray.h"
import "C"
import (
"fmt"
)
func main() {
a1 := [4]C.Double{1,1,1,1}
a2 := [1]C.int{4}
cstruct := C.emxArray_real_T{data: &a1[0], size: &a2[0]}
cstructArr := [1]C.emxArray_real_T{cstruct}
y := C.sumArray(&cstructArr[0])
fmt.Print(float64(y))
}
With this example I get panic: runtime error: cgo argument has Go pointer to Go pointer when I run the program.
I do not how to make it work or if it is possible to make it work. I hope someone can help me or give some direction on how to solve this.
Too much for a comment, so here's the answer.
First, the original text:
A direct solution is to use C.malloc(4 * C.sizeof(C.double))to allocate the array of double-s. Note that you have to make sure to call C.free() on it when done. The same applies to the second array of a single int.
Now, your comment to the Mattanis' remark, which was, reformatted a bit:
thanks for giving some pointers. I tried with
a1 := [4]C.double{1,1,1,1}
sizeA1 := C.malloc(4 * C.sizeof_double)
cstruct := C.emxArray_real_T{
data: &a1[0],
size: (*C.int)(sizeA1)
}
y := C.sumArray(cstruct)
defer C.free(sizeA1)
but it gave me the same
answer as before cgo argument
has Go pointer to Go pointer when I
tried to run the program
You still seem to miss the crucial point. When you're using cgo, there are two disjoint "memory views":
"The Go memory" is everything allocated by the Go runtime powering your running process—on behalf of that process. This memory (most of the time, barring weird tricks) is known to the GC—which is a part of the runtime.
"The C memory" is memory allocated by the C code—typically by calling the libc's malloc()/realloc().
Now imagine a not-so-far-fetched scenario:
Your program runs, the C "side" gets initialized and
spawns its own thread (or threads), and holds handles on them.
Your Go "side" already uses multiple threads to run your goroutines.
You allocate some Go memory in your Go code and pass it
to the C side.
The C side passes the address of that memory to one or more of its own threads.
Your program continues to chug away, and so do the C-side threads—in parallel with your Go code.
As a result you have a reasonably classical scenario in which you get a super-simple situation for unsynchronized parallel memory access, which is a sure recepy for disaster on today's multi-core multi-socket hardware.
Also consider that Go is considerably a more higher-level programming language than C; at the bare minimum, it has automatic garbage collection, and notice that nothing in the Go spec specifies how exactly the GC must be specified.
This means, a particular implementation of Go (including the reference one—in the future) is free to allow its GC to move arbitrary objects in the memory¹, and this means updating every pointer pointing into the memory block in its original location to point to the same place in the block's new location—after it was moved.
With these considerations in mind, the Go devs postulated that in order to keep cgo-using programs future-proof², it is forbidden to pass to C any memory blocks which contain pointers to other Go memory blocks.
It's okay to pass Go memory blocks which contain pointers to C memory, though.
Going back to the example from your second comment,
you still allocate the array of 4 doubles, a1, in the Go memory.
Then the statement cstruct := C.emxArray_real_T{...} again allocates an instance of C.emxArray_real_T in the Go memory, and so after you initialize its data field with a pointer to Go memory (&a1[0]), and then pass its address to the C side, the runtime performs its dynamic checks before actually calling into the C side and crashes your program.
¹ This is typical behaviour for the so-called "generational" garbage collectors, for one example.
² That is, you recompile your program with a future version of the Go compiler of the same "major" release, and the program continues to work, unmodified.

update integer array elements atomically C++

Given a shared array of integer counters, I am interested to know if a thread can atomically fetch and add an array element without locking the entire array?
Here's an illustration of working model that uses mutex to lock access to the entire array.
// thread-shared class members
std::mutex count_array_mutex_;
std::vector<int> counter_array_( 100ish );
// Thread critical section
int counter_index = ... // unpredictable index
int current_count;
{
std::lock_guard<std::mutex> lock(count_array_mutex_);
current_count = counter_array_[counter_index] ++;
}
// ... do stuff using current_count.
I'd like multiple threads to be able to fetch-add separate array elements simultaneously.
So far, in my research of std::atomic<int> I'm thrown off that constructing the atomic object also constructs the protected member. (And plenty of answers explaining why you can't make a std::vector<std::atomic<int> > )
C++20 / C++2a (or whatever you want to call it) will add std::atomic_ref<T> which lets you do atomic operations on an object that wasn't atomic<T> to start with.
It's not available yet as part of the standard library for most compilers, but there is a working implementation for gcc/clang/ICC / other compilers with GNU extensions.
Previously, atomic access to "plain" data was only available with some platform-specific functions like Microsoft's LONG InterlockedExchange(LONG volatile *Target, LONG Value); or GNU C / C++
type __atomic_add_fetch (type *ptr, type val, int memorder) (the same builtins that C++ libraries for GNU compilers use to implement std::atomic<T>.)
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0019r8.html includes some intro stuff about the motivation. CPUs can easily do this, compilers can already do this, and it's been annoying that C++ didn't expose this capability portably.
So instead of having to wrestle with C++ to get all the non-atomic allocation and init done in a constructor, you can just have every access create an atomic_ref to the element you want to access. (It's free to instantiate as a local, at least when it's lock-free, on any "normal" C++ implementations).
This will even let you do things like resize the std::vector<int> after you've ensured no other threads are accessing the vector elements or the vector control block itself. And then you can signal the other threads to resume.
It's not yet implemented in libstdc++ or libc++ for gcc/clang.
#include <vector>
#include <atomic>
#define Foo std // this atomic_ref.hpp puts it in namespace Foo, not std.
// current raw url for https://github.com/ORNL/cpp-proposals-pub/blob/master/P0019/atomic_ref.hpp
#include "https://raw.githubusercontent.com/ORNL/cpp-proposals-pub/580934e3b8cf886e09accedbb25e8be2d83304ae/P0019/atomic_ref.hpp"
void inc_element(std::vector<int> &v, size_t idx)
{
v[idx]++;
}
void atomic_inc_element(std::vector<int> &v, size_t idx)
{
std::atomic_ref<int> elem(v[idx]);
static_assert(decltype(elem)::is_always_lock_free,
"performance is going to suck without lock-free atomic_ref<T>");
elem.fetch_add(1, std::memory_order_relaxed); // take your pick of memory order here
}
For x86-64, these compile exactly the way we'd hope with GCC,
using the sample implementation (for compilers implementing GNU extensions) linked in the C++ working-group proposal. https://github.com/ORNL/cpp-proposals-pub/blob/master/P0019/atomic_ref.hpp
From the Godbolt compiler explorer with g++8.2 -Wall -O3 -std=gnu++2a:
inc_element(std::vector<int, std::allocator<int> >&, unsigned long):
mov rax, QWORD PTR [rdi] # load the pointer member of std::vector
add DWORD PTR [rax+rsi*4], 1 # and index it as a memory destination
ret
atomic_inc_element(std::vector<int, std::allocator<int> >&, unsigned long):
mov rax, QWORD PTR [rdi]
lock add DWORD PTR [rax+rsi*4], 1 # same but atomic RMW
ret
The atomic version is identical except it uses a lock prefix to make the read-modify-write atomic, by making sure no other core can read or write the cache line while this core is in the middle of atomically modifying it. Just in case you were curious how atomics work in asm.
Most non-x86 ISAs like AArch64 of course require a LL/SC retry loop to implement an atomic RMW, even with relaxed memory order.
The point here is that constructing / destructing the atomic_ref doesn't cost anything. Its member pointer fully optimizes away. So this is exactly as cheap as a vector<atomic<int>>, but without the headache.
As long as you're careful not to create data-race UB by resizing the vector, or accessing an element without going through atomic_ref. (It would potentially manifest as a use-after-free on many real implementations if std::vector reallocated the memory in parallel with another thread indexing into it, and of course you'd be atomically modifying a stale copy.)
This definitely gives you rope to hang yourself if you don't carefully respect the fact that the std::vector object itself is not atomic, and also that the compiler won't stop you from doing non-atomic access to the underlying v[idx] after other threads have started using it.
One way:
// Create.
std::vector<std::atomic<int>> v(100);
// Initialize.
for(auto& e : v)
e.store(0, std::memory_order_relaxed);
// Atomically increment.
auto unpredictable_index = std::rand() % v.size();
int old = v[unpredictable_index].fetch_add(1, std::memory_order_relaxed);
Note that std::atomic<> copy-constructor is deleted, so that the vector cannot be resized and needs to be initialized with the final count of elements.
Since resize functionality of std::vector is lost, instead of std::vector you may as well use std::unique_ptr<std::atomic<int>[]>, e.g.:
// Create.
unsigned const N = 100;
std::unique_ptr<std::atomic<int>[]> p(new std::atomic<int>[N]);
// Initialize.
for(unsigned i = 0; i < N; ++i)
p[i].store(0, std::memory_order_relaxed);
// Atomically increment.
auto unpredictable_index = std::rand() % N;
int old = p[unpredictable_index].fetch_add(1, std::memory_order_relaxed);

How to keep unused function in firmware image use arm-none-eabi-gcc toolchain?

I now try create a firmware image running STM32F0xx MCU. It's like flash algorithm, provide some function call to control STM32F0xx MCU Pins, but it's more complicated than flash algorithm. So it will use STM32 HAL lib and Mbed lib.
The Compiler/linker use "-ffunction-sections" and "-fdata-sections" flags.
So I use "attribute((used))" to try keep function into firmware image, but it's failed.
arm-none-eabi-gcc toolchain version is 4.9.3.
My codes like this:
extern "C" {
__attribute__((__used__)) void writeSPI(uint32_t value)
{
for (int i = 0; i < spiPinsNum; i++) {
spiPins[i] = (((value >> i) & 0x01) != 0) ? 1 : 0;
}
__ASM volatile ("movs r0, #0"); // set R0 to 0 show success
__ASM volatile ("bkpt #0"); // halt MCU
}
}
After build succeed, the writeSPI symbol no in image.
I also try static for function, the "-uXXXXX" flag, create a new section.
Question: How keep writeSPI function code with "-ffunction-sections" and "-fdata-sections" flags?
One way to ensure a wanted function doesn't get garbage collected is to create a function pointer to it within a method that is used. You don't have to do anything with the function pointer, just initialize it.
void(*dummy)(uint32_t)=&writeSPI;
An alternative would be to omit the -ffunction-sections flag from the compilation units that contain functions that should not be stripped, but that may involve significant restructuring of your code base.

Rewrite Intel-style assembly code into GCC inline assembly

How to write this assembly code as inline assembly? Compiler: gcc(i586-elf-gcc). The GAS syntax confuses me. Please give tell me how to write this as inline assembly that works for gcc.
.set_video_mode:
mov ah,00h
mov al,13h
int 10h
.init_mouse:
mov ax,0
int 33h
Similar one I have in assembly. I wrote them separate as assembly routines to call them from my C program. I need to call these and some more interrupts from C itself.
Also I need to put some values in some registers depending on which interrupt routine I'm calling. Please tell me how to do it.
All that I want to do is call interrupt routines from C. It's OK for me even to do it using int86() but i don't have source code of that function.
I want int86() so that i can call interrupts from C.
I am developing my own tiny OS so i got no restrictions for calling interrupts or for any direct hardware access.
I've not tested this, but it should get you started:
void set_video_mode (int x, int y) {
register int ah asm ("ah") = x;
register int al asm ("al") = y;
asm volatile ("int $0x10"
: /* no outputs */
: /* no inputs */
: /* clobbers */ "ah", "al");
}
I've put in two 'clobbers' as an example, but you'll need to set the correct list of clobbers so that the compiler knows you've overwritten register values (maybe none).
First, keep in mind GCC doesn't support 16-bit code yet, so you'll end up compiling 32-bit code in 16-bit mode, which is very inefficient but doable (it is used, for example, by Linux and SeaBIOS). It can be done with the following at the begging of each file:
__asm__ (".code16gcc");
Newer GCC versions (since 4.9 IIRC) support the -m16 flag that does the same thing.
Also, there's no mouse driver available unless you load it previous to your kernel running init_mouse.
You seem to be using an API commonly available in several x86 DOS.
asm can take care of the register assignments, so the code can be reduced to:
void set_video_mode(int mode)
{
mode &= 255;
__asm__ __volatile__ (
"int $0x10"
: "+a" (mode) /* %eax = mode & 255 => %ah = 0, %al = mode */
);
}
void init_mouse(void)
{
/* XXX it is really important to check the IDT entry isn't 0 */
int tmp = 0;
__asm__ __volatile__ (
"int $0x33"
: "+a" (tmp) /* %eax = 0*/
:: "ebx" /* %ebx is also clobbered by DOS mouse drivers */
);
}
The asm statement is documented in the GCC manual, although perhaps not in enough depth and lacks x86 examples. The outputs (after first colon) have a distinctively obscure syntax, while the rest is far easier to understand (the second colon specifies the inputs and the third the clobbered registers, flags and/or memory).
The outputs must be prefixed with =, meaning you don't care the previous value it may have had, or +, meaning you want to use it as an input too. In this context we use that instead of an input because the value is modified by the interrupt and you're not allowed to specify input registers in the clobbered list (because the compiler is forbidden from using them).

Resources