ARM GCC hardfault when using -O2 - gcc

When using ARM GCC g++ compiler with optimization level -O2 (and up) this code:
void foo(void)
{
DBB("#0x%08X: 0x%08X", 1, *((uint32_t *)1));
DBB("#0x%08X: 0x%08X", 0, *((uint32_t *)0));
}
Compiles to:
0800abb0 <_Z3foov>:
800abb0: b508 push {r3, lr}
800abb2: 2301 movs r3, #1
800abb4: 4619 mov r1, r3
800abb6: 681a ldr r2, [r3, #0]
800abb8: 4802 ldr r0, [pc, #8] ; (800abc4 <_Z3foov+0x14>)
800abba: f007 fa83 bl 80120c4 <debug_print_blocking>
800abbe: 2300 movs r3, #0
800abc0: 681b ldr r3, [r3, #0]
800abc2: deff udf #255 ; 0xff
800abc4: 08022704 stmdaeq r2, {r2, r8, r9, sl, sp}
And this gives me hardfault at undefined instruction #0x0800abc2.
Also, if there is more code after that, it is not compiled into final binary.
The question is why compiler generates it like that, why undefined istruction?
By the way, it works fine for stuff like this:
...
uint32_t num = 2;
num -= 2;
DBB("#0x%08X: 0x%08X", 0, *((uint32_t *)num));
...
Compiler version:
arm-none-eabi-g++.exe (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]

You can disable this (and verify this answer) by using -fno-delete-null-pointer-checks
The pointer you are passing has a value which matches the null pointer, and the compiler can see that from static analysis, so it faults (because that is the defined behaviour).
In your second example, the static analysis doesn't identify a NULL.

Related

Why GCC (ARM Cortex-M0) generates UXTB instruction when it should know that data is already uint8

I'm using a Cortex-M0 MCU from NXP (LPC845) and I'm trying to figure out what GCC is trying to do :)
Basically, the C code (pseudo) is as follows:
volatile uint8_t readb1 = 0x1a; // dummy
readb1 = GpioPadB(GPIO_PIN);
and the macro I wrote is
(*((volatile uint8_t*)(SOME_GPIO_ADDRESS)))
Now the code is working, but it produced some extra UXTB instruction I don't understand
00000378: ldrb r3, [r3, #0]
0000037a: ldr r2, [pc, #200] ; (0x444 <AppInit+272>)
0000037c: uxtb r3, r3
0000037e: strb r3, [r2, #0]
105 asm("nop");
My explanation is as follows:
load BYTE from address specified in R3, put result in R3 <-- this is load from GPIO register as BYTE
load in R2 address of readb1 variable
UXTB extends the uint8 value ??? But rotate argument is 0, so basically does nothing for uint8 !
store as BYTE to R2's address (my variable) data from R3
Why does that?
First of all, it should know that data in R3 has just a BYTE meaning (it already generates LDRB correctly). Second, the STRB will already trim 7..0 LSB so why using UXTB ?
Thanks for clarifications,
EDITED:
Compiler version:
gcc version 9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599] (GNU Tools for Arm Embedded Processors 9-2019-q4-major)
I use -O3
Looks like an extra instruction left in by the compiler and/or there is some nuance to the cortex-m or newer cores (would love to know what that nuance is).
#define GpioPadB(x) (*((volatile unsigned char *)(x)))
volatile unsigned char readb1;
void fun ( void )
{
readb1 = 0x1A;
readb1 = GpioPadB(0x1234000);
}
an apt gotten gcc
arm-none-eabi-gcc --version
arm-none-eabi-gcc (15:4.9.3+svn231177-1) 4.9.3 20150529 (prerelease)
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
arm-none-eabi-gcc -O2 -c -mthumb so.c -o so.o
arm-none-eabi-objdump -d so.o
00000000 <fun>:
0: 231a movs r3, #26
2: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
4: 7013 strb r3, [r2, #0]
6: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
8: 781b ldrb r3, [r3, #0]
a: 7013 strb r3, [r2, #0]
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
as one would expect.
arm-none-eabi-gcc -O2 -c -mthumb -march=armv7-m so.c -o so.o
arm-none-eabi-objdump -d so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <fun>:
0: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
2: 211a movs r1, #26
4: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
6: 7011 strb r1, [r2, #0]
8: 781b ldrb r3, [r3, #0]
a: b2db uxtb r3, r3
c: 7013 strb r3, [r2, #0]
e: 4770 bx lr
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
with the extra utxb instruction in there
Something a bit newer
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
for armv6m and armv7m
00000000 <fun>:
0: 231a movs r3, #26
2: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
4: 7013 strb r3, [r2, #0]
6: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
8: 781b ldrb r3, [r3, #0]
a: 7013 strb r3, [r2, #0]
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
for armv4t
00000000 <fun>:
0: 231a movs r3, #26
2: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
4: 7013 strb r3, [r2, #0]
6: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
8: 781b ldrb r3, [r3, #0]
a: 7013 strb r3, [r2, #0]
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
and the utxb is gone.
I think it is just a missed optimization, peephole or otherwise.
As answered already though, when you use non-gpr-sized variables you can expect and/or tolerate the compiler converting up to the register size. Varies by compiler and target as to whether they do it on the way in or the way out (when a variable is read or just before it is written or used down the road).
For x86 where you can access various portions of the register separately (or use memory based operands) you will see they do not do this (in gcc) even for cases when it clearly needs a sign extension or padding. And sort it out down the road when the value is used.
You can search the gcc sources for utxb and perhaps see the issue or a comment.
EDIT
Note that clang takes a different path, it burns clocks generating the address but does not do the extension
00000000 <fun>:
0: f240 0000 movw r0, #0
4: f2c0 0000 movt r0, #0
8: 211a movs r1, #26
a: 7001 strb r1, [r0, #0]
c: f244 0100 movw r1, #16384 ; 0x4000
10: f2c0 1123 movt r1, #291 ; 0x123
14: 7809 ldrb r1, [r1, #0]
16: 7001 strb r1, [r0, #0]
18: 4770 bx lr
clang --version
clang version 11.1.0 (https://github.com/llvm/llvm-project.git 1fdec59bffc11ae37eb51a1b9869f0696bfd5312)
Target: armv7m-none-unknown-eabi
Thread model: posix
InstalledDir: /opt/llvm11armv7m/bin
I think it is simply an optimization problem with gcc/gnu.
The "volatile" modifier is to blame. It does not call type extensions when written, because it doesn't make sense. But when reading, it always calls the extension. Because now the data is stored in a register, and must be ready for any operations, over the entire range of the visibility limit.
Abandoning "volatile" removes any additional operations on the data, but it can also remove the very fact of using the variable.
https://godbolt.org/z/cGvc8r6se
First of all, it should know that data in R3 has just a BYTE meaning
Registers are only 32 bits. They do not have any other "meaning". The register must contain the same value as the loaded byte - thus UXTB. Any other operation later (for example adding something requires the whole register to contain the correct value.
Generally speaking, using shorter types than 32 bit usually adds some overhead as Cortex-Mx processors do not do operations on the "portions" of the registers.
To fix this problem, you need to file a bug at https://gcc.gnu.org/bugzilla/. But there are two difficult situations.
There are a lot of bugs related to "volatile", and all of them are not closed, and most of them are not even confirmed. As far as I understand, the developers are already tired of fighting windmills, and do not even react to it.
To successfully fix the problem - you need to find the extreme, the very one that wrote the root of evil. Authorship and all. You will not be allowed into someone else's branch, and only the most advanced are allowed into the master.
But even before this moment, you need to find the reason for this behavior, and here again there are problems.
The GCC code is huge, you can search endlessly.
My personal opinion: GCC treats ARM kernel registers as part of fast memory. This memory can be accessed via a physical address, which only adds to the problems. Well, if this is memory, and the dimension does not match, then, according to GCC, you need to add expansion commands.
Why does GCC use the correct commands when simply accessed? - well, he reads from memory to memory. Emphasis - "from memory". No matter what happens next, you need to read it right now.

No FPU support with gcc for ARM Cortex M?

I have the following function from a well known benchmark that I am compiling with gcc-arm-none-eabi-10-2020-q4-major:
#include <unistd.h>
double b[1000], c[1000];
void tuned_STREAM_Scale(double scalar)
{
ssize_t j;
for (j = 0; j < 1000; j++)
b[j] = scalar* c[j];
}
I am using the following compiler options:
arm-none-eabi-gcc -O3 -mcpu=cortex-m7 -mthumb -mfloat-abi=hard -mfpu=fpv5-sp-d16 -c test.c
However, if I check the compiled code, the compiler seems unable to use a basic FPU multiply instruction, and just uses the __aeabi_dmul function from libgcc (we can however see that a FPU vmov is used):
00000000 <tuned_STREAM_Scale>:
0: e92d 41f0 stmdb sp!, {r4, r5, r6, r7, r8, lr}
4: 4c08 ldr r4, [pc, #32] ; (28 <tuned_STREAM_Scale+0x28>)
6: 4d09 ldr r5, [pc, #36] ; (2c <tuned_STREAM_Scale+0x2c>)
8: f504 58fa add.w r8, r4, #8000 ; 0x1f40
c: ec57 6b10 vmov r6, r7, d0
10: e8f4 0102 ldrd r0, r1, [r4], #8
14: 4632 mov r2, r6
16: 463b mov r3, r7
18: f7ff fffe bl 0 <__aeabi_dmul>
1c: 4544 cmp r4, r8
1e: e8e5 0102 strd r0, r1, [r5], #8
22: d1f5 bne.n 10 <tuned_STREAM_Scale+0x10>
24: e8bd 81f0 ldmia.w sp!, {r4, r5, r6, r7, r8, pc}
If I compare with another compiler, the code is incomparably more efficient:
00000000 <tuned_STREAM_Scale>:
0: 4808 ldr r0, [pc, #32] ; (24 <tuned_STREAM_Scale+0x24>)
2: b580 push {r7, lr}
4: 4b06 ldr r3, [pc, #24] ; (20 <tuned_STREAM_Scale+0x20>)
6: 27c8 movs r7, #200 ; 0xc8
8: c806 ldmia r0!, {r1, r2}
a: ec42 1b11 vmov d1, r1, r2
e: ee20 1b01 vmul.f64 d1, d0, d1
12: 1e7f subs r7, r7, #1
14: ec52 1b11 vmov r1, r2, d1
18: c306 stmia r3!, {r1, r2}
1a: d1f5 bne.n 8 <tuned_STREAM_Scale+0x8>
1c: bd80 pop {r7, pc}
If I check inside gcc package the various libgcc object files depending on CPU or FPU options, I cannot find any FPU instructions in __aeabi_dmul or any other function.
I find very strange that gcc is not able to use a basic FPU multiplication, and I could not find in any documentation or README this limitation, so I am wondering if I am not doing anything wrong. I have checked older gcc versions and I still have this problem. Would it be due to gcc or to the compiled binaries from ARM?
The clue is in the compiler options you already posted:
-mfpu=fpv5-sp-d16 "sp" means single precision.
You told it not to generate hardware double instructions, which is correct for most Cortex-M7 processors because they can't execute them. If you have an M7 which can then you need to set the correct fpu argument.

How to force gcc generate thumb 32 bit instructions?

Is it possible to force generating thumb 32 bit instructions when possible?
For example I have:
int main(void) {
8000280: b480 push {r7}
8000282: b085 sub sp, #20
8000284: af00 add r7, sp, #0
uint32_t a, b, c;
a = 1;
8000286: 2301 movs r3, #1
8000288: 60fb str r3, [r7, #12]
b = 1;
800028a: 2301 movs r3, #1
800028c: 60bb str r3, [r7, #8]
c = a+b;
800028e: 68fa ldr r2, [r7, #12]
8000290: 68bb ldr r3, [r7, #8]
8000292: 4413 add r3, r2
8000294: 607b str r3, [r7, #4]
while (1) ;
8000296: e7fe b.n 8000296 <main+0x16>
But they are all thumb 16 bit. For testing reason I want thumb 32 bit instructions.
Well, I don't think you can do that directly, but there is a very strange way of achieving that in a very indirect way. Just a few days ago I've seen a project where the compilation process was not just a simple arm-none-eabi-gcc ... -c file.c -o file.o. The project was calling arm-none-eabi-gcc to also generate extended assembly listing, which was later assembled manually with arm-none-eabi-as into an object file. If you would do it like that, then between these two steps you could modify the assembly listing to have wide instructions only - in most (all?) cases you could just use sed to add .w suffix to the instructions (change add r3, r2 into add.w r3, r2 and so on). Whether or not such level of build complication is worth it, is up to you...

Using GCC's builtin functions in arm

I'm working on a cortex-m3 board with a bare-metal toolchain without libc.
I implemented memcpy which copies data byte-to-byte but it's too slow. In GCC manual, it says it provides __builtin_memcpy and I decided to use it. So here is the implementation with __builtin_memcpy.
#include <stddef.h>
void *memcpy(void *dest, const void *src, size_t n)
{
return __builtin_memcpy(dest,src,n);
}
When I compile this code, it becomes a recursive function which never ends.
$ arm-none-eabi-gcc -march=armv7-m -mcpu=cortex-m3 -mtune=cortex-m3 \
-O2 -ffreestanding -c memcpy.c -o memcpy.o
$ arm-none-eabi-objdump -d memcpy.o
memcpy.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <memcpy>:
0: f7ff bffe b.w 0 <memcpy>
Am I doing wrong? How can I use the compiler-generated memcpy version?
Builtin functions are not supposed to be used to implement itself :)
Builtin functions are supposed to be used in application code - then the compiler may or may not generate some special insn sequence or a call to the underlying real function
Compare:
int a [10], b [20];
void
foo ()
{
__builtin_memcpy (a, b, 10 * sizeof (int));
}
This results in:
foo:
stmfd sp!, {r4, r5}
ldr r4, .L2
ldr r5, .L2+4
ldmia r4!, {r0, r1, r2, r3}
mov ip, r5
stmia ip!, {r0, r1, r2, r3}
ldmia r4!, {r0, r1, r2, r3}
stmia ip!, {r0, r1, r2, r3}
ldmia r4, {r0, r1}
stmia ip, {r0, r1}
ldmfd sp!, {r4, r5}
bx lr
But:
void
bar (int n)
{
__builtin_memcpy (a, b, n * sizeof (int));
}
results in a call to the memcpy function:
bar:
mov r2, r0, asl #2
stmfd sp!, {r3, lr}
ldr r1, .L5
ldr r0, .L5+4
bl memcpy
ldmfd sp!, {r3, lr}
bx lr
Theoretically, library is not part of C compiler and not part of toolchain.
Thus, if you wrotememcpy(&a,&b,sizeof(a)) compiler MUST generate subroutine call.
The idea of __builtin : to inform compiler, that the function is standard and can be optimized. Thus, if you wrote __builtin_memcpy(&a,&b,sizeof(a)) compiler MAY generate subroutine call, but in most cases it will not happens. For example, if size is known as 4 at compile time - only one mov command will be generated. (Another advantage - even in case of subroutine call compiler is informed, that library function has no side effects).
So, it's ALWAYS better to use __builtin_memcpy instead of memcpy. In modern libraries it was done by #define memcpy __builtin_memcpy just in string.h
But you still need implement memcpy somewhere, call will be generated in sophistical places. For string functions on ARM, it's strictly recommended 4-byte implementation.

Generating %pc relative address of constant data

Is there a way to have gcc generate %pc relative addresses of constants? Even when the string appears in the text segment, arm-elf-gcc will generate a constant pointer to the data, load the address of the pointer via a %pc relative address and then dereference it. For a variety of reasons, I need to skip the middle step. As an example, this simple function:
const char * filename(void)
{
static const char _filename[]
__attribute__((section(".text")))
= "logfile";
return _filename;
}
generates (when compiled with arm-elf-gcc-4.3.2 -nostdlib -c
-O3 -W -Wall logfile.c):
00000000 <filename>:
0: e59f0000 ldr r0, [pc, #0] ; 8 <filename+0x8>
4: e12fff1e bx lr
8: 0000000c .word 0x0000000c
0000000c <_filename.1175>:
c: 66676f6c .word 0x66676f6c
10: 00656c69 .word 0x00656c69
I would have expected it to generate something more like:
filename:
add r0, pc, #0
bx lr
_filename.1175:
.ascii "logfile\000"
The code in question needs to be partially position independent since it will be relocated in memory at load time, but also integrate with code that was not compiled -fPIC, so there is no global offset table.
My current work around is to call a non-inline function (which will be done via a %pc relative address) to find the offset from the compiled location in a technique similar to how -fPIC code works:
static intptr_t
__attribute__((noinline))
find_offset( void )
{
uintptr_t pc;
asm __volatile__ (
"mov %0, %%pc" : "=&r"(pc)
);
return pc - 8 - (uintptr_t) find_offset;
}
But this technique requires that all data references be fixed up manually, so the filename() function in the above example would become:
const char * filename(void)
{
static const char _filename[]
__attribute__((section(".text")))
= "logfile";
return _filename + find_offset();
}
Hmmm, maybe you have to compile it as -fPIC to get PIC. Or simply write it in assembler, assembler is a lot easier than the C you are writing.
00000000 :
0: e59f300c ldr r3, [pc, #12] ; 14
4: e59f000c ldr r0, [pc, #12] ; 18
8: e08f3003 add r3, pc, r3
c: e0830000 add r0, r3, r0
10: e12fff1e bx lr
14: 00000004 andeq r0, r0, r4
18: 00000000 andeq r0, r0, r0
0000001c :
1c: 66676f6c strbtvs r6, [r7], -ip, ror #30
20: 00656c69 rsbeq r6, r5, r9, ror #24
Are you getting the same warning I am getting?
/tmp/ccySyaUE.s: Assembler messages:
/tmp/ccySyaUE.s:35: Warning: ignoring changed section attributes for .text

Resources