ARM Cortex M4 hard fault - floating point

ARM Cortex M4 hard fault - floating point - gcc

When I run my program, which just calculates a sine wave:
for(i = 0; i < ADS1299_SIGNAL_WINDOW; i++){
TEST[i] = (float32_t)(10.0f * (float32_t)(arm_sin_f32((float32_t)(3.14f * i/ADS1299_SIGNAL_WINDOW))));
}
The compiler generates the following line, which results in a hard fault:
800702a: ed2d 8b04 vpush {d8-d9}
What is happening? For reference, here are my flags for the compiler:
SETTINGS="-g -nostartfiles -mthumb -mthumb-interwork -march=armv7e-m -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -fsingle-precision-constant -fdata-sections -ffunction-sections -O3 -Wl,-T,../STM32F407VG_FLASH.ld"
DECLARE="-DARM_MATH_CM4 -D__FPU_PRESENT=1 -D__FPU_USED"
.... -larm_cortexM4lf_math

The problem is that you're doing both the CPACR enable, and some floating-point operations in the same scope. Because code in main uses floating-point registers, the compiler (being well-behaved and respecting the ABI), will emit code to preserve those registers on entry to main. Before any other code in main executes. Including the write to CAPCR which makes them accessible. Oops.
To avoid that, either enable FP in the CPACR before entry to main in a reset handler (if your toolchain allows), or simply do all FP operations in another function, and ensure main itself doesn't touch any FP registers.
It would also be wise (if you haven't already) to ensure you have a DSB; ISB synchronisation sequence after the CPACR write. Otherwise, you could potentially still get a fault from any stale FP instuctions already in the pipeline.

I think the problem is that the FPU is not enabled. I've got same problem with Nordic Semiconductors SDK examples on Keil 4. In the Keil IDE the check box for FPU enable is marked, but in the SystemInit code it is conditional compilation like this:
void SystemInit(void) {
#if (__FPU_USED == 1)
SCB->CPACR |= (3UL << 20) | (3UL << 22);
__DSB();
__ISB();
#endif
}
But I think the Keil 4 IDE does not set this __FPU_USED to 1 and on the VPUSH instruction I've got a HardFault because the FPU is not enabled.
I think you need to enable the FPU in SystemInit and the problem then will be solved.

If you use the FPU, the stack should be aligned to 8 bytes boundaries. If you are using an RTOS, check the thread stack initialization code. If you are running on pure bare metal, check the startup code for the stack setup.

Related

GCC hidden optimizations

The GCC manual lists all optimization flags being applied for the different levels of optimizations (-O1, -O2, etc.). However when compiling and measuring a benchmark program (e.g. cBench's automotive_bitcount) there is a significant difference when applying an optimization level instead of turning on all the listed optimizations manually. For -O1 with the automotive_bitcount program, I measured a speedup of roughly 100% when compiling with -O1 instead of manually applying all the listed flags. Those "hidden" optimizations seem in fact to be the main part of the optimization work GCC does for -O1. When applying the flags manually, I only get a speedup of about 10% compared to no optimizations.
The same can be observed when applying all enabled flags from gcc -c -Q -O3 --help=optimizers.
In the GCC manual I found this section which would explain this behavior:
Not all optimizations are controlled directly by a flag. Only optimizations that have a flag are listed in this section.
Most optimizations are completely disabled at -O0 or if an -O level is not set on the command line, even if individual optimization flags are specified.
Since I couldn't find any further documentation on those optimizations, I wonder if there is a way of controlling them and what the optimizations are in detail?

Some optimizations are directly gated by -O flags e.g. complete unroller:
{
public:
pass_complete_unrolli (gcc::context *ctxt)
: gimple_opt_pass (pass_data_complete_unrolli, ctxt)
{}
/* opt_pass methods: */
virtual bool gate (function *) { return optimize >= 2; }
virtual unsigned int execute (function *);
}; // class pass_complete_unrolli
and for others -O influences their internal algorithms e.g. in optimization of expressions:
/* If FROM is a SUBREG, put it into a register. Do this
so that we always generate the same set of insns for
better cse'ing; if an intermediate assignment occurred,
we won't be doing the operation directly on the SUBREG. */
if (optimize > 0 && GET_CODE (from) == SUBREG)
from = force_reg (from_mode, from);
There is no way to work around this, you have to use -O.

ld: 32-bit RIP relative reference out of range in simple gcc program

related to "ld: 32-bit RIP relative reference out of range" on Mac OSX but not solved and in a more complex context. the relevant computer(s) have >32GB of RAM.
static const int K=1024;
static const int M=K*K;
static const int G=K*M;
const int MC = G;
void donada(float *rvec, const int MC) { rvec[MC-1]= 1.0; return; }
float notused[1][MC]; // 4GB, ramp up with first index
float used[MC]; // 4GB
int main() {
donada( used, MC );
donada( notused[1], MC );
}
and gcc -Wall -o test test.cc. compiling this program not on osx yields
ld: 32-bit RIP relative reference out of range (4294967395 max is
+/-2GB): from _main (0x100000F92) to _used (0x200001000) in '_main' from /var/folders/yl/8gp3pgbn1l562ywg_q86rk6800\ 00z9/T/test-b3bebf.o
for architecture x86_64
on linux, there is a similar error
test.cc:(.text+0x18): relocation truncated to fit: R_X86_64_32 against symbol `used' defined in .bss section in /tmp/ccqcNh2C.o
I first thought compiler flag -Os would fix this, but it does not. It would be appropriate for gcc or clang to provide a more suggestive error message.

the relevant computer(s) have >32GB of RAM.
That's actually not very relevant. The issue is that 64-bit GCC defaults to -mcmodel=small, and you are trying to access data that is 4GiB away from its base symbol, which is not comatible with the small model.
From documentation:
-mcmodel=small
Generate code for the small code model: the program and its symbols
must be linked in the lower 2 GB of the address space. Pointers are 64 bits.
Programs can be statically or dynamically linked. This is the default code model.
-mcmodel=medium
Generate code for the medium model: The program is linked in the lower 2 GB
of the address space. Small symbols are also placed there.
Symbols with sizes larger than -mlarge-data-threshold are put into large data
or bss sections and can be located above 2GB.
Programs can be statically or dynamically linked.
-mcmodel=large
Generate code for the large model: This model makes no assumptions about addresses
and sizes of sections.
To correctly link your program, you need to use -mcmodel=large.
However note that this is not well tested (almost nobody does that), and that all code you (statically) link into your program will need to be built that way.
It is probably much better to dynamically allocate your arrays instead.
I first thought compiler flag -Os would fix this
It can't: -Os minimizes code size. Your program is that you are forcing the compiler to allocate very large contiguous data array. There is nothing the compiler could optimize for size there.

Stack Guard and Stack Smashing Protection - canaries, memory

I have a few questions about Stack Guard and SSP protections. First question is about Stack Guard and its three types of canaries, if I am correctly - terminator, random and random XOR.
I'd like to know, how to disabled Stack Guard on x86 Linux system? Somewhere I read, it's possible with this command, while compiling with gcc '-disable-stackguard-randomization', it's same like with this command for enable '-enable-stackguard-randomization', both doesn't work. If needed, my gcc version is 4.8.2.
Next question about Stack guard, when I will able to enable/disable it, how can I set, which type of canaries I want to use? What I read, terminator canaries are used by default, for random I have to compiled with '-enable-stackguard-randomization', but how about random XOR? (Or with null 0x00000000)
Now about SSP(ProPolice), I know, for random canary I have to compiled with 'fstack-protector-all', but how about terminator, is it same as in Stack Guard, by default?
Last one, if anyone of you, can tell me, where I can find random canary in memory. For example, I have this scenario - compiled C program, like 'gcc -g example.c -o example -fstack-protector-all', so with random canaries. Let's say, I'm able to get address of canary, after every execution. So expect, I have: Canary = 0x1ae3f900. From a different papers, I get some info, that canary is located in .bss segment. So I get address of .bss segment using readelf: 'readelf -a ./example | grep bss'. It's 080456c9. In gdb I set some breakpoints, to get address of canary, but when I check .bss address x/20x 0x080456c9, all I see are only 0x00000000 addresses, but
canary is nowhere. Plus, I checked __stack_chk_fail's if it isn't there, but with same result, I can't see it there. I get address of stack_chk_fail from PLT/GOT.
Thank in advance for your answer and time.

Stack Smashing Protection (SSP) is an improvement over StackGuard. SSP was first implemented in gcc 4.1.
I'd like to know, how to disabled Stack Guard on x86 Linux system?
Use -fno-stack-protector to disable the userland SSP.
The --disable-stackguard-randomization and --enable-stackguard-randomization are build options for glibc source code.
when I will able to enable/disable it, how can I set, which type of
canaries I want to use?
This is not configurable in gcc as far as I know. Since glibc 2.10, the stack canary is generated in a function called _dl_setup_stack_chk_guard. Here is some part of its code:
if (dl_random == NULL)
{
ret.bytes[sizeof (ret) - 1] = 255;
ret.bytes[sizeof (ret) - 2] = '\n';
}
else
{
memcpy (ret.bytes, dl_random, sizeof (ret));
ret.num &= ~(uintptr_t) 0xff;
}
dl_random holds the address of the auxiliary vector entry for AT_RANDOM, which is a 16-byte random value initialized by the kernel while creating the process. If you are running on a kernel or an emulator that doesn't initialize AT_RANDOM, the check dl_random == NULL would be true and the canary used is the terminator value with the first and second most significant bytes initialized to 255 and \n, respectively. All other bytes are zero. Usually AT_RANDOM is initialized by the kernel and so the least 7 significant bytes of AT_RANDOM are copied. The last byte of canary is set to zero.
So if you want to use a particular method to generate the canary, you can change this code and build you own glibc.
As an alternative method, #PeterCordes have suggested in the comments to write your canary value to memory location %%fs:0x28 (see the code below) at the top of the main function and restore the runtime-generated canary just before returning from main.
Now about SSP(ProPolice), I know, for random canary I have to compiled
with 'fstack-protector-all', but how about terminator, is it same as
in Stack Guard, by default?
All variants of the -fstack-protector option use SSP. These don't affect how the canary is generated.
Last one, if anyone of you, can tell me, where I can find random
canary in memory.
The canary is generated dynamically early at process startup; you can't use readelf to get the canary. According to this article, you can use the following code to get the canary when compiling for i386:
int read_canary()
{
int val = 0;
__asm__("movl %%gs:0x14, %0;"
: "=r"(val)
:
:);
return val;
}
and for x86_64:
long read_canary()
{
long val = 0;
__asm__("movq %%fs:0x28, %0;"
: "=r"(val)
:
:);
return val;
}

Where is the "Zero divide" done in kernel for Arm Cortex A-9

I am looking into kernel source code (2.6.35 ) for Zero divide .
I inserted Zero divide in user space program and all threads stopped.
So I want to know Where is the "Zero divide" done in kernel for Arm Cortex A-9?
I am not able to find any trap for this ....
Thanks

It depends on the architecture. Given the following user space code on an x86 system:
main() {
int x = 42 / 0;
}
the compiler inserts a idivl command into the object code. When this command is executed with a divisor of 0, the CPU generates a division by zero trap (similar to an interrupt). This calls the divide_error trap handler inside the kernel, in case of x86 it is located in arch/x86/kernel/entry_32.S:
ENTRY(divide_error)
RING0_INT_FRAME
pushl_cfi $0 # no error code
pushl_cfi $do_divide_error
jmp error_code
CFI_ENDPROC
END(divide_error)
The error_code target then takes care of all necessary actions to handle the error and finally returns from the trap.
On ARM, things are different: With a few exceptions, ARM CPUs do not have a hardware division instruction (e.g. Arm Cortex A-9 does not have one). Division needs to be implemented as a library function. For the kernel, this is implemented in arch/arm/lib/lib1funcs.S where you also find the division by zero handling. For user space applications, I suppose this is implemented as a library function in the libgcc library.

How to get GCC to use more than two SIMD registers when using intrinsics?

I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm looking at the assembler code generated, it seems that GCC keeps flushing the data back to the memory, in order to reload something else in XMM0 and XMM1. I am compiling for x86-64 so I have 15 registers. Why is GCC using only two and what can I do to ask it to use more? Is there any way that I can "pin" some value in a register? I added the "register" keyword to my variable definition, but the generated assembly code is identical.

Yes, you can. Explicit Reg Vars talks about the syntax you need to pin a variable to a specific register.

If you're getting to the point where you're specifying individual registers for each intrinsic, you might as well just write the assembly directory, especially given gcc's nasty habit of pessimizing intrinsics unnecessarily in many cases.

It sounds like you compiled with optimization disabled, so no variables are kept in registers between C statements, not even int.
Compile with gcc -O3 -march=native to let the compiler make non-terrible asm, optimized for your machine. The default is -O0 with a "generic" target ISA and tuning.
See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why "debug" builds in general are like that, and the fact that register int foo; or register __m128 bar; can stay in a register even in a debug build. But it's much better to actually have the compiler optimize, as well as using registers, if you want your code to run fast overall!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ARM Cortex M4 hard fault - floating point - gcc

If you use the FPU, the stack should be aligned to 8 bytes boundaries. If you are using an RTOS, check the thread stack initialization code. If you are running on pure bare metal, check the startup code for the stack setup.

Related

GCC hidden optimizations

ld: 32-bit RIP relative reference out of range in simple gcc program

Stack Guard and Stack Smashing Protection - canaries, memory

Where is the "Zero divide" done in kernel for Arm Cortex A-9

How to get GCC to use more than two SIMD registers when using intrinsics?

Categories

Resources