Why GCC (ARM Cortex-M0) generates UXTB instruction when it should know that data is already uint8 - gcc

I'm using a Cortex-M0 MCU from NXP (LPC845) and I'm trying to figure out what GCC is trying to do :)
Basically, the C code (pseudo) is as follows:
volatile uint8_t readb1 = 0x1a; // dummy
readb1 = GpioPadB(GPIO_PIN);
and the macro I wrote is
(*((volatile uint8_t*)(SOME_GPIO_ADDRESS)))
Now the code is working, but it produced some extra UXTB instruction I don't understand
00000378: ldrb r3, [r3, #0]
0000037a: ldr r2, [pc, #200] ; (0x444 <AppInit+272>)
0000037c: uxtb r3, r3
0000037e: strb r3, [r2, #0]
105 asm("nop");
My explanation is as follows:
load BYTE from address specified in R3, put result in R3 <-- this is load from GPIO register as BYTE
load in R2 address of readb1 variable
UXTB extends the uint8 value ??? But rotate argument is 0, so basically does nothing for uint8 !
store as BYTE to R2's address (my variable) data from R3
Why does that?
First of all, it should know that data in R3 has just a BYTE meaning (it already generates LDRB correctly). Second, the STRB will already trim 7..0 LSB so why using UXTB ?
Thanks for clarifications,
EDITED:
Compiler version:
gcc version 9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599] (GNU Tools for Arm Embedded Processors 9-2019-q4-major)
I use -O3

Looks like an extra instruction left in by the compiler and/or there is some nuance to the cortex-m or newer cores (would love to know what that nuance is).
#define GpioPadB(x) (*((volatile unsigned char *)(x)))
volatile unsigned char readb1;
void fun ( void )
{
readb1 = 0x1A;
readb1 = GpioPadB(0x1234000);
}
an apt gotten gcc
arm-none-eabi-gcc --version
arm-none-eabi-gcc (15:4.9.3+svn231177-1) 4.9.3 20150529 (prerelease)
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
arm-none-eabi-gcc -O2 -c -mthumb so.c -o so.o
arm-none-eabi-objdump -d so.o
00000000 <fun>:
0: 231a movs r3, #26
2: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
4: 7013 strb r3, [r2, #0]
6: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
8: 781b ldrb r3, [r3, #0]
a: 7013 strb r3, [r2, #0]
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
as one would expect.
arm-none-eabi-gcc -O2 -c -mthumb -march=armv7-m so.c -o so.o
arm-none-eabi-objdump -d so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <fun>:
0: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
2: 211a movs r1, #26
4: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
6: 7011 strb r1, [r2, #0]
8: 781b ldrb r3, [r3, #0]
a: b2db uxtb r3, r3
c: 7013 strb r3, [r2, #0]
e: 4770 bx lr
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
with the extra utxb instruction in there
Something a bit newer
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
for armv6m and armv7m
00000000 <fun>:
0: 231a movs r3, #26
2: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
4: 7013 strb r3, [r2, #0]
6: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
8: 781b ldrb r3, [r3, #0]
a: 7013 strb r3, [r2, #0]
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
for armv4t
00000000 <fun>:
0: 231a movs r3, #26
2: 4a03 ldr r2, [pc, #12] ; (10 <fun+0x10>)
4: 7013 strb r3, [r2, #0]
6: 4b03 ldr r3, [pc, #12] ; (14 <fun+0x14>)
8: 781b ldrb r3, [r3, #0]
a: 7013 strb r3, [r2, #0]
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
14: 01234000 .word 0x01234000
and the utxb is gone.
I think it is just a missed optimization, peephole or otherwise.
As answered already though, when you use non-gpr-sized variables you can expect and/or tolerate the compiler converting up to the register size. Varies by compiler and target as to whether they do it on the way in or the way out (when a variable is read or just before it is written or used down the road).
For x86 where you can access various portions of the register separately (or use memory based operands) you will see they do not do this (in gcc) even for cases when it clearly needs a sign extension or padding. And sort it out down the road when the value is used.
You can search the gcc sources for utxb and perhaps see the issue or a comment.
EDIT
Note that clang takes a different path, it burns clocks generating the address but does not do the extension
00000000 <fun>:
0: f240 0000 movw r0, #0
4: f2c0 0000 movt r0, #0
8: 211a movs r1, #26
a: 7001 strb r1, [r0, #0]
c: f244 0100 movw r1, #16384 ; 0x4000
10: f2c0 1123 movt r1, #291 ; 0x123
14: 7809 ldrb r1, [r1, #0]
16: 7001 strb r1, [r0, #0]
18: 4770 bx lr
clang --version
clang version 11.1.0 (https://github.com/llvm/llvm-project.git 1fdec59bffc11ae37eb51a1b9869f0696bfd5312)
Target: armv7m-none-unknown-eabi
Thread model: posix
InstalledDir: /opt/llvm11armv7m/bin
I think it is simply an optimization problem with gcc/gnu.

The "volatile" modifier is to blame. It does not call type extensions when written, because it doesn't make sense. But when reading, it always calls the extension. Because now the data is stored in a register, and must be ready for any operations, over the entire range of the visibility limit.
Abandoning "volatile" removes any additional operations on the data, but it can also remove the very fact of using the variable.
https://godbolt.org/z/cGvc8r6se

First of all, it should know that data in R3 has just a BYTE meaning
Registers are only 32 bits. They do not have any other "meaning". The register must contain the same value as the loaded byte - thus UXTB. Any other operation later (for example adding something requires the whole register to contain the correct value.
Generally speaking, using shorter types than 32 bit usually adds some overhead as Cortex-Mx processors do not do operations on the "portions" of the registers.

To fix this problem, you need to file a bug at https://gcc.gnu.org/bugzilla/. But there are two difficult situations.
There are a lot of bugs related to "volatile", and all of them are not closed, and most of them are not even confirmed. As far as I understand, the developers are already tired of fighting windmills, and do not even react to it.
To successfully fix the problem - you need to find the extreme, the very one that wrote the root of evil. Authorship and all. You will not be allowed into someone else's branch, and only the most advanced are allowed into the master.
But even before this moment, you need to find the reason for this behavior, and here again there are problems.
The GCC code is huge, you can search endlessly.
My personal opinion: GCC treats ARM kernel registers as part of fast memory. This memory can be accessed via a physical address, which only adds to the problems. Well, if this is memory, and the dimension does not match, then, according to GCC, you need to add expansion commands.
Why does GCC use the correct commands when simply accessed? - well, he reads from memory to memory. Emphasis - "from memory". No matter what happens next, you need to read it right now.

Related

No FPU support with gcc for ARM Cortex M?

I have the following function from a well known benchmark that I am compiling with gcc-arm-none-eabi-10-2020-q4-major:
#include <unistd.h>
double b[1000], c[1000];
void tuned_STREAM_Scale(double scalar)
{
ssize_t j;
for (j = 0; j < 1000; j++)
b[j] = scalar* c[j];
}
I am using the following compiler options:
arm-none-eabi-gcc -O3 -mcpu=cortex-m7 -mthumb -mfloat-abi=hard -mfpu=fpv5-sp-d16 -c test.c
However, if I check the compiled code, the compiler seems unable to use a basic FPU multiply instruction, and just uses the __aeabi_dmul function from libgcc (we can however see that a FPU vmov is used):
00000000 <tuned_STREAM_Scale>:
0: e92d 41f0 stmdb sp!, {r4, r5, r6, r7, r8, lr}
4: 4c08 ldr r4, [pc, #32] ; (28 <tuned_STREAM_Scale+0x28>)
6: 4d09 ldr r5, [pc, #36] ; (2c <tuned_STREAM_Scale+0x2c>)
8: f504 58fa add.w r8, r4, #8000 ; 0x1f40
c: ec57 6b10 vmov r6, r7, d0
10: e8f4 0102 ldrd r0, r1, [r4], #8
14: 4632 mov r2, r6
16: 463b mov r3, r7
18: f7ff fffe bl 0 <__aeabi_dmul>
1c: 4544 cmp r4, r8
1e: e8e5 0102 strd r0, r1, [r5], #8
22: d1f5 bne.n 10 <tuned_STREAM_Scale+0x10>
24: e8bd 81f0 ldmia.w sp!, {r4, r5, r6, r7, r8, pc}
If I compare with another compiler, the code is incomparably more efficient:
00000000 <tuned_STREAM_Scale>:
0: 4808 ldr r0, [pc, #32] ; (24 <tuned_STREAM_Scale+0x24>)
2: b580 push {r7, lr}
4: 4b06 ldr r3, [pc, #24] ; (20 <tuned_STREAM_Scale+0x20>)
6: 27c8 movs r7, #200 ; 0xc8
8: c806 ldmia r0!, {r1, r2}
a: ec42 1b11 vmov d1, r1, r2
e: ee20 1b01 vmul.f64 d1, d0, d1
12: 1e7f subs r7, r7, #1
14: ec52 1b11 vmov r1, r2, d1
18: c306 stmia r3!, {r1, r2}
1a: d1f5 bne.n 8 <tuned_STREAM_Scale+0x8>
1c: bd80 pop {r7, pc}
If I check inside gcc package the various libgcc object files depending on CPU or FPU options, I cannot find any FPU instructions in __aeabi_dmul or any other function.
I find very strange that gcc is not able to use a basic FPU multiplication, and I could not find in any documentation or README this limitation, so I am wondering if I am not doing anything wrong. I have checked older gcc versions and I still have this problem. Would it be due to gcc or to the compiled binaries from ARM?
The clue is in the compiler options you already posted:
-mfpu=fpv5-sp-d16 "sp" means single precision.
You told it not to generate hardware double instructions, which is correct for most Cortex-M7 processors because they can't execute them. If you have an M7 which can then you need to set the correct fpu argument.

ARM GCC hardfault when using -O2

When using ARM GCC g++ compiler with optimization level -O2 (and up) this code:
void foo(void)
{
DBB("#0x%08X: 0x%08X", 1, *((uint32_t *)1));
DBB("#0x%08X: 0x%08X", 0, *((uint32_t *)0));
}
Compiles to:
0800abb0 <_Z3foov>:
800abb0: b508 push {r3, lr}
800abb2: 2301 movs r3, #1
800abb4: 4619 mov r1, r3
800abb6: 681a ldr r2, [r3, #0]
800abb8: 4802 ldr r0, [pc, #8] ; (800abc4 <_Z3foov+0x14>)
800abba: f007 fa83 bl 80120c4 <debug_print_blocking>
800abbe: 2300 movs r3, #0
800abc0: 681b ldr r3, [r3, #0]
800abc2: deff udf #255 ; 0xff
800abc4: 08022704 stmdaeq r2, {r2, r8, r9, sl, sp}
And this gives me hardfault at undefined instruction #0x0800abc2.
Also, if there is more code after that, it is not compiled into final binary.
The question is why compiler generates it like that, why undefined istruction?
By the way, it works fine for stuff like this:
...
uint32_t num = 2;
num -= 2;
DBB("#0x%08X: 0x%08X", 0, *((uint32_t *)num));
...
Compiler version:
arm-none-eabi-g++.exe (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]
You can disable this (and verify this answer) by using -fno-delete-null-pointer-checks
The pointer you are passing has a value which matches the null pointer, and the compiler can see that from static analysis, so it faults (because that is the defined behaviour).
In your second example, the static analysis doesn't identify a NULL.

How to align to cache line GCC ldr pc-relative

In ARM, GCC uses the PC-relative load is usually used to load constants into registers. The idea is that you store the constant relative to the instruction loading the constant. E.g. the following instruction can be used to load a constant from the address PC+8+offset
ldr r0, [pc, #offset]
As result, the .text segment interleaves instructions and data. The latter usually stored at the end of function's code. E.g.
00010860 <call_weak_fn>:
10860: e59f3014 ldr r3, [pc, #20] ; 1087c <call_weak_fn+0x1c>
10864: e59f2014 ldr r2, [pc, #20] ; 10880 <call_weak_fn+0x20>
10868: e08f3003 add r3, pc, r3
1086c: e7932002 ldr r2, [r3, r2]
10870: e3520000 cmp r2, #0
10874: 012fff1e bxeq lr
10878: e1a00000 nop ; (mov r0, r0)
1087c: 00089790 muleq r8, r0, r7
10880: 00000074 andeq r0, r0, r4, ror r0
For a research project, I would like to ensure that code and constant never reside on the same cache line (i.e. block 64 bytes aligned).
Is it possible to align the constants generated by GCC?

Data Memory Barrier (DMB) in CMSIS libraries for Cortex-M3s

In the CMSIS definitions for gcc you can find something like this:
static __INLINE void __DMB(void) { __ASM volatile ("dmb"); }
My question is: what use does a memory barrier have if it does not declare "memory" in the clobber list?
Is it an error in the core_cm3.h or is there a reason why gcc should behave correctly without any additional help?
I did some testing with gcc 4.5.2 (built with LTO). If I compile this code:
static inline void __DMB(void) { asm volatile ("dmb"); }
static inline void __DMB2(void) { asm volatile ("dmb" ::: "memory"); }
char x;
char test1 (void)
{
x = 15;
return x;
}
char test2 (void)
{
x = 15;
__DMB();
return x;
}
char test3 (void)
{
x = 15;
__DMB2();
return x;
}
using arm-none-eabi-gcc -Os -mcpu=cortex-m3 -mthumb -c dmb.c, then from arm-none-eabi-objdump -d dmb.o I get this:
00000000 <test1>:
0: 4b01 ldr r3, [pc, #4] ; (8 <test1+0x8>)
2: 200f movs r0, #15
4: 7018 strb r0, [r3, #0]
6: 4770 bx lr
8: 00000000 .word 0x00000000
0000000c <test2>:
c: 4b02 ldr r3, [pc, #8] ; (18 <test2+0xc>)
e: 200f movs r0, #15
10: 7018 strb r0, [r3, #0]
12: f3bf 8f5f dmb sy
16: 4770 bx lr
18: 00000000 .word 0x00000000
0000001c <test3>:
1c: 4b03 ldr r3, [pc, #12] ; (2c <test3+0x10>)
1e: 220f movs r2, #15
20: 701a strb r2, [r3, #0]
22: f3bf 8f5f dmb sy
26: 7818 ldrb r0, [r3, #0]
28: 4770 bx lr
2a: bf00 nop
2c: 00000000 .word 0x00000000
It is obvious that __DBM() only inserts the dmb instruction and it takes DMB2() to actually force the compiler to flush the values cached in the registers.
I guess I found a CMSIS bug.
IMHO the CMSIS version is right.
Injecting the barrier instruction without the memory in clobber list achieves exactly what it is supposed to do:
If the previous write on "x" variable was buffered then it is committed. This is useful, for example, if you are going to pass "x" address to as a DMA address, or if you are going to setup MPU.
It has no effect on returning "x" (your program is guaranteed to be correct even if you omit memory barrier).
On the other hand by inserting memory in clobber list, you have no kind of effect in situations like the example before (DMA, MPU..).
The only difference in the latter case is that if you have for example an ISR modifying the value of "x" right after "strb" , then the value that will be returned is the value modified by the ISR, because the clobber caused the compiler to read from memory to register again.
But if you want to obtain this thing then you should use "volatile" variables.
In other words: the barrier forces cache vs memory commit in order to guarantee consistency with other HW resources that might access RAM memory, while clobbering memory causes compiler to stop assuming the memory has not changed and to read again in local registers, that is another thing with different purposes (it does not matter if a memory change is still in cache or already committed on RAM, because an eventual asm load operation is guaranteed to work in both cases without barriers).

Generating %pc relative address of constant data

Is there a way to have gcc generate %pc relative addresses of constants? Even when the string appears in the text segment, arm-elf-gcc will generate a constant pointer to the data, load the address of the pointer via a %pc relative address and then dereference it. For a variety of reasons, I need to skip the middle step. As an example, this simple function:
const char * filename(void)
{
static const char _filename[]
__attribute__((section(".text")))
= "logfile";
return _filename;
}
generates (when compiled with arm-elf-gcc-4.3.2 -nostdlib -c
-O3 -W -Wall logfile.c):
00000000 <filename>:
0: e59f0000 ldr r0, [pc, #0] ; 8 <filename+0x8>
4: e12fff1e bx lr
8: 0000000c .word 0x0000000c
0000000c <_filename.1175>:
c: 66676f6c .word 0x66676f6c
10: 00656c69 .word 0x00656c69
I would have expected it to generate something more like:
filename:
add r0, pc, #0
bx lr
_filename.1175:
.ascii "logfile\000"
The code in question needs to be partially position independent since it will be relocated in memory at load time, but also integrate with code that was not compiled -fPIC, so there is no global offset table.
My current work around is to call a non-inline function (which will be done via a %pc relative address) to find the offset from the compiled location in a technique similar to how -fPIC code works:
static intptr_t
__attribute__((noinline))
find_offset( void )
{
uintptr_t pc;
asm __volatile__ (
"mov %0, %%pc" : "=&r"(pc)
);
return pc - 8 - (uintptr_t) find_offset;
}
But this technique requires that all data references be fixed up manually, so the filename() function in the above example would become:
const char * filename(void)
{
static const char _filename[]
__attribute__((section(".text")))
= "logfile";
return _filename + find_offset();
}
Hmmm, maybe you have to compile it as -fPIC to get PIC. Or simply write it in assembler, assembler is a lot easier than the C you are writing.
00000000 :
0: e59f300c ldr r3, [pc, #12] ; 14
4: e59f000c ldr r0, [pc, #12] ; 18
8: e08f3003 add r3, pc, r3
c: e0830000 add r0, r3, r0
10: e12fff1e bx lr
14: 00000004 andeq r0, r0, r4
18: 00000000 andeq r0, r0, r0
0000001c :
1c: 66676f6c strbtvs r6, [r7], -ip, ror #30
20: 00656c69 rsbeq r6, r5, r9, ror #24
Are you getting the same warning I am getting?
/tmp/ccySyaUE.s: Assembler messages:
/tmp/ccySyaUE.s:35: Warning: ignoring changed section attributes for .text

Resources