Debugger stops inside mathematical functions that have floating point arguments - gcc

I'm using Keil uVision with gcc compiler (Sourcery Codebenchlite for ARM EABI ) to program the STM32F4 cortex M4 chip.
The compiler control strings I have set are:
-march=armv7e-m -mfpu=fpv4-sp-d16 -mfloat-abi=softfp -std=gnu99 -fsingle-precision-constant
When the debugger encounters some mathematical functions (e.g. asinf(), atan2f() etc), it stops.
I have checked that the arguments for these functions are also single-precision.
I think it is because of some missing compiler directives for the use of VFP floating point, but was unable to identify it.
Is there anything I have missed out?
The disassembly code of an example I did:
The debugger can evaluate atan2f(0.3,0.4), but stops at 0x0803B9CA when it evaluates atan2f(a,b). Didn't know why the number works but not variables.
377: float a = 0.3;
0x0803B9BA 4B1E LDR r3,[pc,#120] ; #0x0803BA34
0x0803B9BC 63BB STR r3,[r7,#0x38]
378: float b = 0.4;
379:
0x0803B9BE 4B1E LDR r3,[pc,#120] ; #0x0803BA38
0x0803B9C0 637B STR r3,[r7,#0x34]
380: float c = atan2f(0.3,0.4);
0x0803B9C2 4B1E LDR r3,[pc,#120] ; #0x0803BA3C
0x0803B9C4 633B STR r3,[r7,#0x30]
381: float d = atan2f(a,b);
382:
0x0803B9C6 6BB8 LDR r0,[r7,#0x38]
0x0803B9C8 6B79 LDR r1,[r7,#0x34]
0x0803B9CA F004F993 BL.W atan2f (0x0803FCF4)
0x0803B9CE 62F8 STR r0,[r7,#0x2C]

On the STM32F4 you first need to enable the FPU - otherwise the CPU will jump into the HardFault_Handler or BusFault_Handler (I'm not shure which one).
You can do it in C/C++ anywhere before you use floating point instructions (maybe at the beginning of main()?). Assuming you use the CMSIS library and have the core_m4.h included (maybe through stm32f4xx.h):
void cortexm4f_enable_fpu() {
/* set CP10 and CP11 Full Access */
SCB->CPACR |= ((3UL << 10*2)|(3UL << 11*2));
}
The alternative is assembler code in the startup file:
/*enable fpu begin*/
ldr r0, =0xe000ed88 /*; enable cp10,cp11 */
ldr r1,[r0]
ldr r2, =0xf00000
orr r1,r1,r2
str r1,[r0]
/*enable fpu end*/
(I found the code somewhere on the internet, don't know where though. I used it myself, it works).
Maybe your problem is located there?

Related

arm gcc aarch32 compile longlong constants param why skip r1 register?

using toolchains:
"gcc-arm-none-eabi-9-2020-q2-update"
build cmd:
"arm-none-eabi-gcc -MMD -g -Wno-discarded-qualifiers -O0 -mcpu=cortex-r52 -c -DGCC -mthumb -mfloat-abi=hard -mfpu=fp-armv8 -nostartfiles -ffreestanding -falign-functions=16 -falign-jumps=8 -falign-loops=8 -fomit-frame-pointer -funroll-loops printf.c -o printf.o"
Found that the code:
printf("test hex long number = 0x%lx\n", 0x123456789abcdef0ul);
Compiled as:
401372: a315 add r3, pc, #84 ; (adr r3, 4013c8 <printf_test+0xd8>)
401374: e9d3 2300 ldrd r2, r3, [r3]
401378: f245 201c movw r0, #21020 ; 0x521c
40137c: f2c0 0040 movt r0, #64 ; 0x40
401380: f7ff ff76 bl 401270 <_printf>
Why not use "r1" register as params delivery?
That make "_printf" print unexpected.
test hex long number = 0x9abcdef000000000
How to fix or workaround?
Let "_printf" print as expected "0x123456789abcdef0"
The use of r2/r3 is correct. The AAPCS ABI specifies that 8-byte objects (or more precisely, objects needing 8-byte alignment) shall be passed in an even/odd register pair. See Section 6.3 stage C3. This is most likely so that ldrd/strd can be used, as they have this same restriction.
The bug in your program is that 0x123456789abcdef0ul is of type unsigned long long despite the ul suffix, since it is too large for the 32-bit unsigned long. As such you need to use the %llx format specifier with it. If you do, then printf will correctly find the argument in r2/r3 and everything works fine.
With the code as it is, you ought to get a compiler warning about the format specifier not matching the argument type.

How to have GCC combine "move r10, r3; store r10" into a "store r3"?

I'm working Power9 and utilizing the hardware random number generator instruction called DARN. I have the following inline assembly:
uint64_t val;
__asm__ __volatile__ (
"xor 3,3,3 \n" // r3 = 0
"addi 4,3,-1 \n" // r4 = -1, failure
"1: \n"
".byte 0xe6, 0x05, 0x61, 0x7c \n" // r3 = darn 3, 1
"cmpd 3,4 \n" // r3 == -1?
"beq 1b \n" // retry on failure
"mr %0,3 \n" // val = r3
: "=g" (val) : : "r3", "r4", "cc"
);
I had to add a mr %0,3 with "=g" (val) because I could not get GCC to produce expected code with "=r3" (val). Also see Error: matching constraint not valid in output operand.
A disassembly shows:
(gdb) b darn.cpp : 36
(gdb) r v
...
Breakpoint 1, DARN::GenerateBlock (this=<optimized out>,
output=0x7fffffffd990 "\b", size=0x100) at darn.cpp:77
77 DARN64(output+i*8);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.ppc64le libgcc-4.8.5-28.el7_5.1.ppc64le libstdc++-4.8.5-28.el7_5.1.ppc64le
(gdb) disass
Dump of assembler code for function DARN::GenerateBlock(unsigned char*, unsigned long):
...
0x00000000102442b0 <+48>: addi r10,r8,-8
0x00000000102442b4 <+52>: rldicl r10,r10,61,3
0x00000000102442b8 <+56>: addi r10,r10,1
0x00000000102442bc <+60>: mtctr r10
=> 0x00000000102442c0 <+64>: xor r3,r3,r3
0x00000000102442c4 <+68>: addi r4,r3,-1
0x00000000102442c8 <+72>: darn r3,1
0x00000000102442cc <+76>: cmpd r3,r4
0x00000000102442d0 <+80>: beq 0x102442c8 <DARN::GenerateBlock(unsigned char*, unsigned long)+72>
0x00000000102442d4 <+84>: mr r10,r3
0x00000000102442d8 <+88>: stdu r10,8(r9)
Notice GCC faithfully reproduces the:
0x00000000102442d4 <+84>: mr r10,r3
0x00000000102442d8 <+88>: stdu r10,8(r9)
How do I get GCC to fold the two instructions into:
0x00000000102442d8 <+84>: stdu r3,8(r9)
GCC will never remove text that's part of the asm template; it doesn't even parse it other than substituting in for %operand. It's literally just a text substitution before the asm is sent to the assembler.
You have to leave out the mr from your inline asm template, and tell gcc that your output is in r3 (or use a memory-destination output operand, but don't do that). If your inline-asm template ever starts or ends with mov instructions, you're usually doing it wrong.
Use register uint64_t foo asm("r3"); to force "=r"(foo) to pick r3 on platforms that don't have specific-register constraints.
(Despite ISO C++17 removing the register keyword, this GNU extension still works with -std=c++17. You can also use register uint64_t foo __asm__("r3"); if you want to avoid the asm keyword. You probably still need to treat register as a reserved word in source that uses this extension; that's fine. ISO C++ removing it from the base language doesn't force implementations to not use it as part of an extension.)
Or better, don't hard-code a register number. Use an assembler that supports the DARN instruction. (But apparently it's so new that even up-to-date clang lacks it, and you'd only want this inline asm as a fallback for gcc too old to support the __builtin_darn() intrinsic)
Using these constraints will let you remove the register setup, too, and use foo=0 / bar=-1 before the inline asm statement, and use "+r"(foo).
But note that darn's output register is write-only. There's no need to zero r3 first. I found a copy of IBM's POWER ISA instruction set manual that is new enough to include darn here: https://wiki.raptorcs.com/w/images/c/cb/PowerISA_public.v3.0B.pdf#page=96
In fact, you don't need to loop inside the asm at all, you can leave that to the C and only wrap the one asm instruction, like inline-asm is designed for.
uint64_t random_asm() {
register uint64_t val asm("r3");
do {
//__asm__ __volatile__ ("darn 3, 1");
__asm__ __volatile__ (".byte 0x7c, 0x61, 0x05, 0xe6 # gcc asm operand = %0\n" : "=r" (val));
} while(val == -1ULL);
return val;
}
compiles cleanly (on the Godbolt compiler explorer) to
random_asm():
.L6: # compiler-generated label, no risk of name clashes
.byte 0x7c, 0x61, 0x05, 0xe6 # gcc asm operand = 3
cmpdi 7,3,-1 # compare-immediate
beq 7,.L6
blr
Just as tight as your loop, with less setup. (Are you sure you even need to zero r3 before the asm instruction?)
This function can inline anywhere you want it to, allowing gcc to emit a store instruction that reads r3 directly.
In practice, you'll want to use a retry counter, as advised in the manual: if the hardware RNG is broken, it might give you failure forever so you should have a fallback to a PRNG. (Same for x86's rdrand)
Deliver A Random Number (darn) - Programming Note
When the error value is obtained, software is
expected to repeat the operation. If a non-error
value has not been obtained after several attempts,
a software random number generation method
should be used. The recommended number of
attempts may be implementation specific. In the
absence of other guidance, ten attempts should be
adequate.
xor-zeroing is not efficient on most fixed-instruction-width ISAs, because a mov-immediate is just as short so there's no need to detect and special-case an xor. (And thus CPU designs don't spend transistors on it). Moreover, dependency rules for the PPC asm equivalent of C++11 std::memory_order_consume require it to carry a dependency on the input register, so it couldn't be dependency-breaking even if the designers wanted it to. xor-zeroing is only a thing on x86 and maybe a few other variable-width ISAs.
Use li r3, 0 like gcc does for int foo(){return 0;} https://godbolt.org/z/-gHI4C.

Really Minimal STM32 Application: linker failure

I'm building a tiny microcontroller with only the bare essentials for self-educational purposes. This way, I can refresh my knowledge about topics like the linkerscript, the startup code, ...
EDIT:
I got quite a lot of comments pointing out that the "absolute minimal STM32-application" shown below is no good. You are absolutely right when noticing that the vector table is not complete, the .bss-section is not taken care of, the peripheral addresses are not complete, ... Please allow me to explain why.
It has never been the purpose of the author to write a complete and useful application in this particular chapter. His purpose was to explain step-by-step how a linkerscript works, how startup code works, what the boot procedure of an STM32 looks like, ... purely for educational purposes. I can appreciate this approach, and learned a lot.
The example I have put below is taken from the middle of the chapter in question. The chapter keeps adding more parts to the linkerscript and startup code (for example initialization of .bss-section) as it goes forward.
The reason I put files here from the middle of his chapter, is because I got stuck at a particular error message. I want to get that fixed before continuing.
The chapter in question is somewhere at the end of his book. It is intended for the more experienced or curious reader who wants to gain deeper knowledge about topics most people don't even consider (most people use the standard linkerscript and startup code given by the manufacturer without ever reading it).
Keeping this in mind, please let us focus on the technical issue at hand (as described below in the error messages). Please also accept my sincere apologies that I didn't clarify the intentions of the writer earlier. But I've done it now, so we can move on ;-)
1. Absolute minimal STM32-application
The tutorial I'm following is chapter 20 from this book: "Mastering STM32" (https://leanpub.com/mastering-stm32). The book explains how to make a tiny microcontroller application with two files: main.c and linkerscript.ld. As I'm not using an IDE (like Eclipse), I also added build.bat and clean.bat to generate the compilation commands. So my project folder looks like this:
Before I continue, I should perhaps give some more details about my system:
OS: Windows 10, 64-bit
Microcontroller: NUCLEO-F401RE board with STM32F401RE microcontroller.
Compiler: arm-none-eabi-gcc version 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437].
The main file looks like this:
/* ------------------------------------------------------------ */
/* Minimal application */
/* for NUCLEO-F401RE */
/* ------------------------------------------------------------ */
typedef unsigned long uint32_t;
/* Memory and peripheral start addresses (common to all STM32 MCUs) */
#define FLASH_BASE 0x08000000
#define SRAM_BASE 0x20000000
#define PERIPH_BASE 0x40000000
/* Work out end of RAM address as initial stack pointer
* (specific of a given STM32 MCU) */
#define SRAM_SIZE 96*1024 //STM32F401RE has 96 KB of RAM
#define SRAM_END (SRAM_BASE + SRAM_SIZE)
/* RCC peripheral addresses applicable to GPIOA
* (specific of a given STM32 MCU) */
#define RCC_BASE (PERIPH_BASE + 0x23800)
#define RCC_APB1ENR ((uint32_t*)(RCC_BASE + 0x30))
/* GPIOA peripheral addresses
* (specific of a given STM32 MCU) */
#define GPIOA_BASE (PERIPH_BASE + 0x20000)
#define GPIOA_MODER ((uint32_t*)(GPIOA_BASE + 0x00))
#define GPIOA_ODR ((uint32_t*)(GPIOA_BASE + 0x14))
/* Function headers */
int main(void);
void delay(uint32_t count);
/* Minimal vector table */
uint32_t *vector_table[] __attribute__((section(".isr_vector"))) = {
(uint32_t*)SRAM_END, // initial stack pointer (MSP)
(uint32_t*)main // main as Reset_Handler
};
/* Main function */
int main() {
/* Enable clock on GPIOA peripheral */
*RCC_APB1ENR = 0x1;
/* Configure the PA5 as output pull-up */
*GPIOA_MODER |= 0x400; // Sets MODER[11:10] = 0x1
while(1) { // Always true
*GPIOA_ODR = 0x20;
delay(200000);
*GPIOA_ODR = 0x0;
delay(200000);
}
}
void delay(uint32_t count) {
while(count--);
}
The linkerscript looks like this:
/* ------------------------------------------------------------ */
/* Linkerscript */
/* for NUCLEO-F401RE */
/* ------------------------------------------------------------ */
/* Memory layout for STM32F401RE */
MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
SRAM (xrw) : ORIGIN = 0x20000000, LENGTH = 96K
}
/* The ENTRY(..) directive overrides the default entry point symbol _start.
* Here we define the main-routine as the entry point.
* In fact, the ENTRY(..) directive is meaningless for embedded chips,
* but it is informative for debuggers. */
ENTRY(main)
SECTIONS
{
/* Program code into FLASH */
.text : ALIGN(4)
{
*(.isr_vector) /* Vector table */
*(.text) /* Program code */
*(.text*) /* Merge all .text.* sections inside the .text section */
KEEP(*(.isr_vector)) /* Don't allow other tools to strip this off */
} >FLASH
_sidata = LOADADDR(.data); /* Used by startup code to initialize data */
.data : ALIGN(4)
{
. = ALIGN(4);
_sdata = .; /* Create a global symbol at data start */
*(.data)
*(.data*)
. = ALIGN(4);
_edata = .; /* Define a global symbol at data end */
} >SRAM AT >FLASH
}
The build.bat file calls the compiler on main.c, and next the linker:
#echo off
setlocal EnableDelayedExpansion
echo.
echo ----------------------------------------------------------------
echo. )\ ***************************
echo. ( =_=_=_=^< ^| * build NUCLEO-F401RE *
echo. )( ***************************
echo. ""
echo.
echo.
echo. Call the compiler on main.c
echo.
#arm-none-eabi-gcc main.c -o main.o -c -MMD -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O0 -g3 -Wall -fmessage-length=0 -Werror-implicit-function-declaration -Wno-comment -Wno-unused-function -ffunction-sections -fdata-sections
echo.
echo. Call the linker
echo.
#arm-none-eabi-gcc main.o -o myApp.elf -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -specs=nosys.specs -specs=nano.specs -T linkerscript.ld -Wl,-Map=output.map -Wl,--gc-sections
echo.
echo. Post build
echo.
#arm-none-eabi-objcopy -O binary myApp.elf myApp.bin
arm-none-eabi-size myApp.elf
echo.
echo ----------------------------------------------------------------
The clean.bat file removes all the compiler output:
#echo off
setlocal EnableDelayedExpansion
echo ----------------------------------------------------------------
echo. __ **************
echo. __\ \___ * clean *
echo. \ _ _ _ \ **************
echo. \_`_`_`_\
echo.
del /f /q main.o
del /f /q main.d
del /f /q myApp.bin
del /f /q myApp.elf
del /f /q output.map
echo ----------------------------------------------------------------
Building this works. I get the following output:
C:\Users\Kristof\myProject>build
----------------------------------------------------------------
)\ ***************************
( =_=_=_=< | * build NUCLEO-F401RE *
)( ***************************
""
Call the compiler on main.c
Call the linker
Post build
text data bss dec hex filename
112 0 0 112 70 myApp.elf
----------------------------------------------------------------
2. Proper startup code
Maybe you have noticed that the minimal application didn't have proper startup code to initialize the global variables in the .data-section. Chapter 20.2.2 .data and .bss Sections initialization from the "Mastering STM32" book explains how to do this.
As I follow along, my main.c file now looks like this:
/* ------------------------------------------------------------ */
/* Minimal application */
/* for NUCLEO-F401RE */
/* ------------------------------------------------------------ */
typedef unsigned long uint32_t;
/* Memory and peripheral start addresses (common to all STM32 MCUs) */
#define FLASH_BASE 0x08000000
#define SRAM_BASE 0x20000000
#define PERIPH_BASE 0x40000000
/* Work out end of RAM address as initial stack pointer
* (specific of a given STM32 MCU) */
#define SRAM_SIZE 96*1024 //STM32F401RE has 96 KB of RAM
#define SRAM_END (SRAM_BASE + SRAM_SIZE)
/* RCC peripheral addresses applicable to GPIOA
* (specific of a given STM32 MCU) */
#define RCC_BASE (PERIPH_BASE + 0x23800)
#define RCC_APB1ENR ((uint32_t*)(RCC_BASE + 0x30))
/* GPIOA peripheral addresses
* (specific of a given STM32 MCU) */
#define GPIOA_BASE (PERIPH_BASE + 0x20000)
#define GPIOA_MODER ((uint32_t*)(GPIOA_BASE + 0x00))
#define GPIOA_ODR ((uint32_t*)(GPIOA_BASE + 0x14))
/* Function headers */
void __initialize_data(uint32_t*, uint32_t*, uint32_t*);
void _start (void);
int main(void);
void delay(uint32_t count);
/* Minimal vector table */
uint32_t *vector_table[] __attribute__((section(".isr_vector"))) = {
(uint32_t*)SRAM_END, // initial stack pointer (MSP)
(uint32_t*)_start // _start as Reset_Handler
};
/* Variables defined in linkerscript */
extern uint32_t _sidata;
extern uint32_t _sdata;
extern uint32_t _edata;
volatile uint32_t dataVar = 0x3f;
/* Data initialization */
inline void __initialize_data(uint32_t* flash_begin, uint32_t* data_begin, uint32_t* data_end) {
uint32_t *p = data_begin;
while(p < data_end)
*p++ = *flash_begin++;
}
/* Entry point */
void __attribute__((noreturn,weak)) _start (void) {
__initialize_data(&_sidata, &_sdata, &_edata);
main();
for(;;);
}
/* Main function */
int main() {
/* Enable clock on GPIOA peripheral */
*RCC_APB1ENR = 0x1;
/* Configure the PA5 as output pull-up */
*GPIOA_MODER |= 0x400; // Sets MODER[11:10] = 0x1
while(dataVar == 0x3f) { // Always true
*GPIOA_ODR = 0x20;
delay(200000);
*GPIOA_ODR = 0x0;
delay(200000);
}
}
void delay(uint32_t count) {
while(count--);
}
I've added the initialization code just above the main(..) function. The linkerscript has also some modification:
/* ------------------------------------------------------------ */
/* Linkerscript */
/* for NUCLEO-F401RE */
/* ------------------------------------------------------------ */
/* Memory layout for STM32F401RE */
MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
SRAM (xrw) : ORIGIN = 0x20000000, LENGTH = 96K
}
/* The ENTRY(..) directive overrides the default entry point symbol _start.
* In fact, the ENTRY(..) directive is meaningless for embedded chips,
* but it is informative for debuggers. */
ENTRY(_start)
SECTIONS
{
/* Program code into FLASH */
.text : ALIGN(4)
{
*(.isr_vector) /* Vector table */
*(.text) /* Program code */
*(.text*) /* Merge all .text.* sections inside the .text section */
KEEP(*(.isr_vector)) /* Don't allow other tools to strip this off */
} >FLASH
_sidata = LOADADDR(.data); /* Used by startup code to initialize data */
.data : ALIGN(4)
{
. = ALIGN(4);
_sdata = .; /* Create a global symbol at data start */
*(.data)
*(.data*)
. = ALIGN(4);
_edata = .; /* Define a global symbol at data end */
} >SRAM AT >FLASH
}
The little application doesn't compile anymore. Actually, the compilation from main.c to main.o is still okay. But the linking process gets stuck:
C:\Users\Kristof\myProject>build
----------------------------------------------------------------
)\ ***************************
( =_=_=_=< | * build NUCLEO-F401RE *
)( ***************************
""
Call the compiler on main.c
Call the linker
c:/gnu_arm_embedded_toolchain/bin/../lib/gcc/arm-none-eabi/6.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m/fpv4-sp/hard/crt0.o: In function `_start':
(.text+0x64): undefined reference to `__bss_start__'
c:/gnu_arm_embedded_toolchain/bin/../lib/gcc/arm-none-eabi/6.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m/fpv4-sp/hard/crt0.o: In function `_start':
(.text+0x68): undefined reference to `__bss_end__'
collect2.exe: error: ld returned 1 exit status
Post build
arm-none-eabi-objcopy: 'myApp.elf': No such file
arm-none-eabi-size: 'myApp.elf': No such file
----------------------------------------------------------------
3. What I've tried
I've omitted this part, otherwise this question gets too long ;-)
4. Solution
#berendi provided the solution. Thank you #berendi! Apparently I need to add the flags -nostdlib and -ffreestanding to gcc and the linker. The build.bat file now looks like this:
#echo off
setlocal EnableDelayedExpansion
echo.
echo ----------------------------------------------------------------
echo. )\ ***************************
echo. ( =_=_=_=^< ^| * build NUCLEO-F401RE *
echo. )( ***************************
echo. ""
echo.
echo.
echo. Call the compiler on main.c
echo.
#arm-none-eabi-gcc main.c -o main.o -c -MMD -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O0 -g3 -Wall -fmessage-length=0 -Werror-implicit-function-declaration -Wno-comment -Wno-unused-function -ffunction-sections -fdata-sections -ffreestanding -nostdlib
echo.
echo. Call the linker
echo.
#arm-none-eabi-gcc main.o -o myApp.elf -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -specs=nosys.specs -specs=nano.specs -T linkerscript.ld -Wl,-Map=output.map -Wl,--gc-sections -ffreestanding -nostdlib
echo.
echo. Post build
echo.
#arm-none-eabi-objcopy -O binary myApp.elf myApp.bin
arm-none-eabi-size myApp.elf
echo.
echo ----------------------------------------------------------------
Now it works!
In his answer, #berendi also gives a few interesting remarks about the main.c file. I've applied most of them:
Missing volatile keyword
Empty loop
Missing Memory Barrier (did I put the memory barrier in the correct place?)
Missing delay after RCC enable
Misleading symbolic name (apparently it should be RCC_AHB1ENR instead of RCC_APB1ENR).
The vector table: this part I've skipped. Right now I don't really need a HardFault_Handler, MemManage_Handler, ... as this is just a tiny test for educational purposes.
Nevertheless, I did notice that #berendi put a few interesting modifications in the way he declares the vector table. But I'm not entirely grasping what he's doing exactly.
The main.c file now looks like this:
/* ------------------------------------------------------------ */
/* Minimal application */
/* for NUCLEO-F401RE */
/* ------------------------------------------------------------ */
typedef unsigned long uint32_t;
/**
\brief Data Synchronization Barrier
\details Acts as a special kind of Data Memory Barrier.
It completes when all explicit memory accesses before this instruction complete.
*/
__attribute__((always_inline)) static inline void __DSB(void)
{
__asm volatile ("dsb 0xF":::"memory");
}
/* Memory and peripheral start addresses (common to all STM32 MCUs) */
#define FLASH_BASE 0x08000000
#define SRAM_BASE 0x20000000
#define PERIPH_BASE 0x40000000
/* Work out end of RAM address as initial stack pointer
* (specific of a given STM32 MCU) */
#define SRAM_SIZE 96*1024 //STM32F401RE has 96 KB of RAM
#define SRAM_END (SRAM_BASE + SRAM_SIZE)
/* RCC peripheral addresses applicable to GPIOA
* (specific of a given STM32 MCU) */
#define RCC_BASE (PERIPH_BASE + 0x23800)
#define RCC_AHB1ENR ((volatile uint32_t*)(RCC_BASE + 0x30))
/* GPIOA peripheral addresses
* (specific of a given STM32 MCU) */
#define GPIOA_BASE (PERIPH_BASE + 0x20000)
#define GPIOA_MODER ((volatile uint32_t*)(GPIOA_BASE + 0x00))
#define GPIOA_ODR ((volatile uint32_t*)(GPIOA_BASE + 0x14))
/* Function headers */
void __initialize_data(uint32_t*, uint32_t*, uint32_t*);
void _start (void);
int main(void);
void delay(uint32_t count);
/* Minimal vector table */
uint32_t *vector_table[] __attribute__((section(".isr_vector"))) = {
(uint32_t*)SRAM_END, // initial stack pointer (MSP)
(uint32_t*)_start // _start as Reset_Handler
};
/* Variables defined in linkerscript */
extern uint32_t _sidata;
extern uint32_t _sdata;
extern uint32_t _edata;
volatile uint32_t dataVar = 0x3f;
/* Data initialization */
inline void __initialize_data(uint32_t* flash_begin, uint32_t* data_begin, uint32_t* data_end) {
uint32_t *p = data_begin;
while(p < data_end)
*p++ = *flash_begin++;
}
/* Entry point */
void __attribute__((noreturn,weak)) _start (void) {
__initialize_data(&_sidata, &_sdata, &_edata);
asm volatile("":::"memory"); // <- Did I put this instruction at the right spot?
main();
for(;;);
}
/* Main function */
int main() {
/* Enable clock on GPIOA peripheral */
*RCC_AHB1ENR = 0x1;
__DSB();
/* Configure the PA5 as output pull-up */
*GPIOA_MODER |= 0x400; // Sets MODER[11:10] = 0x1
while(dataVar == 0x3f) { // Always true
*GPIOA_ODR = 0x20;
delay(200000);
*GPIOA_ODR = 0x0;
delay(200000);
}
}
void delay(uint32_t count) {
while(count--){
asm volatile("");
}
}
PS: The book "Mastering STM32" from Carmine Noviello is an absolute masterpiece. You should read it! => https://leanpub.com/mastering-stm32
You can tell gcc not to use the library.
The Compiler
By default, gcc assumes that you are using a standard C library, and can emit code that calls some functions. For example, when optimizations are enabled, it detects loops that copy a piece of memory, and may substitute them with a call to memcpy(). Disable it with -ffreestanding.
The Linker
The linker assumes as well that you want to link your program with the C library and startup code. The library startup code is responsible for initializing the library and the program execution environment. It has a function named _start() which has to be called after reset. One of its functions is to fill the .bss segment (see below) with zero. If the symbols that delimit .bss are not defined, then _startup() cannot be linked. Had you named your startup function anything else but _startup(), then the library startup would have been siletly dropped by the linker as an unused function, and the code could have been linked.
You can tell the linker not to link any standard library or startup code with -nostdlib, then the library supplied startup function name would not conflict with yours, and you would get a linker error every time you accidentally invoked a library function.
Missing volatile
Your register definitions are missing the volatile qualifier. Without it, subsequent writes to *GPIOA_ODR will be optimized out. The compiler will move this "invariant code" out of the loop. Changing the type in the register definitions to (volatile uint32_t*) would fix that.
Empty loop
The optimizer can recognize that the delay loop does nothing, and eliminate it completely to speed up execution. Add an empty but non-removable asm volatile(""); instruction to the delay loop.
Missing Memory Barrier
You are initializing the .data section that holds dataVar in a C function. The *p in __initialize_data() is effectively an alias for dataVar, and the compiler has no way to know it. The optimizer could theoretically rearrange the test of dataVar before __initialize_data(). Even if dataVar is volatile, *p is not, therefore ordering is not guaranteed.
After the data initialization loop, you should tell the compiler that program variables are changed by a mechanism unknown to the compiler:
asm volatile("":::"memory");
It's an old-fashioned gcc extension, the latest C standards might have defined a portable way to do this (which is not recognized by older gcc versions).
Missing delay after RCC enable
The Errata is saying,
A delay between an RCC peripheral clock enable and the effective peripheral enabling should be taken into account in order to manage the peripheral read/write to registers.
This delay depends on the peripheral mapping:
• If the peripheral is mapped on AHB: the delay should be equal to 2 AHB cycles.
• If the peripheral is mapped on APB: the delay should be equal to 1 + (AHB/APB prescaler) cycles.
Workarounds
Use the DSB instruction to stall the Cortex®-M4 CPU pipeline until the instruction is completed.
Therefore, insert a
__DSB();
after *RCC_APB1ENR = 0x1; (which should be called something else)
Misleading symbolic name
Although the address for enabling GPIOA in RCC seems to be correct, the register is called RCC_AHB1ENR in the documentation. It will confuse people trying to understand your code.
The Vector Table
Although technically you can get away with having only a stack pinter and a reset handler in it, I'd too recommend having a few more entries, at least the fault handlers for simple troubleshooting.
__attribute__ ((section(".isr_vector"),used))
void (* const _vectors[]) (void) = {
(void (*const)(void))(&__stack),
Reset_Handler,
NMI_Handler,
HardFault_Handler,
MemManage_Handler,
BusFault_Handler,
UsageFault_Handler
}
The Linker Script
At the bare minimum, it must define a section for your vector table, and the code. A program must have a start address and some code, static data is optional. The rest depends on what kind of data your program is using. You could technically omit them from the linker script if there are no data of a particular type.
.rodata: read-only data, const arrays and structs go here. They remain in flash. (simple const variables are usually put in the code)
.data: initialized variables, everything you declare with an = sign, and without const.
.bss: variables that should be zero-initialized in C, i.e. global and static ones.
As you don't need .rodata or .bss now, it's fine.
Linker scripts in general are an artform, they are their own programming language and gnu's are certainly a bit of a nightmare. Divide the task into figuring out the linker script from making a working binary, once you can see the linker script is doing what you want then make the bootstrap code to use it. Take advantage of the toolchain.
The example the author used was derived from code written specifically to be used as baremetal examples that maximize success. Avoided common language and toolchain issues, yet be portable across many versions of the toolchain and to be easily ported to other toolchains (minimal reliance on the toolchain, in particular the linker script which leads to the bootstrap). The author of the book used that code but added risk to it to not be as reliable of an example.
Avoiding .data specifically and not relying on .bss to be zeroed when you write baremetal code goes a very long way toward long term success.
It was also modified such that optimization would prevent that code from working (well blinking at a rate you can see).
An example somewhat minimal linker script for binutils that you can modify to work toward .data and .bss initialization looks generically like this
test.ld
MEMORY
{
bob : ORIGIN = 0x8000, LENGTH = 0x1000
ted : ORIGIN = 0xA000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > bob
__data_rom_start__ = .;
.data : {
__data_start__ = .;
*(.data*)
} > ted AT > bob
__data_end__ = .;
__data_size__ = __data_end__ - __data_start__;
.bss : {
__bss_start__ = .;
*(.bss*)
} > ted
__bss_end__ = .;
__bss_size__ = __bss_end__ - __bss_start__;
}
(note memory names dont have to be rom or ram or flash or data or whatever bob is program space and ted is memory btw, change the addresses as desired)
How you see what is going on is you can link with a simple example or with your code, you need some .data and some .bss (and some .text).
vectors.s
.thumb
.globl _start
_start:
.word 0x20001000
.word reset
.thumb_func
reset:
bl notmain
b .
.globl bss_start
bss_start: .word __bss_start__
.globl bss_end
bss_end: .word __bss_end__
.word __bss_size__
.globl data_rom_start
data_rom_start:
.word __data_rom_start__
.globl data_start
data_start:
.word __data_start__
.globl data_end
data_end:
.word __data_end__
.word __data_size__
so.c
unsigned int a=1;
unsigned int b=2;
unsigned int c;
unsigned int d;
unsigned int e;
unsigned int notmain ( void )
{
return(a+b+c+d+e);
}
build
arm-none-eabi-as vectors.s -o vectors.o
arm-none-eabi-gcc -O2 -c -mthumb so.c -o so.o
arm-none-eabi-ld -T test.ld vectors.o so.o -o vectors.elf
arm-none-eabi-objdump -D vectors.elf
The code so far is not specific to arm-none-whatever or arm-linux-whatever versions of the toolchain. If/when you need gcclib items you can use gcc instead of ld but you have to be careful when doing that...or provide the path to libgcc and use ld.
What we get from this code is linker script debugging on the cheap:
Disassembly of section .text:
00008000 <_start>:
8000: 20001000 andcs r1, r0, r0
8004: 00008009 andeq r8, r0, r9
00008008 <reset>:
8008: f000 f810 bl 802c <notmain>
800c: e7fe b.n 800c <reset+0x4>
0000800e <bss_start>:
800e: 0000a008 andeq sl, r0, r8
00008012 <bss_end>:
8012: 0000a014 andeq sl, r0, r4, lsl r0
8016: 0000000c andeq r0, r0, ip
0000801a <data_rom_start>:
801a: 00008058 andeq r8, r0, r8, asr r0
0000801e <data_start>:
801e: 0000a000 andeq sl, r0, r0
00008022 <data_end>:
8022: 0000a008 andeq sl, r0, r8
8026: 00000008 andeq r0, r0, r8
...
We care about the 32 bit values being created the andeq disassembly is because the disassembler is trying to disassemble those values as instructions which they are not. The reset instructions are real the rest is 32 bit values we are generating. might be able to use readelf, but getting used to disassembling, insuring the vector table is correct as step one, which is easy to see in the disassembly. Using the disassembler as a habit can then lead to using it as above to show you what the linker generated.
If you dont get the linker script variables right you wont be able to write a successful bootstrap, if you dont have a good way to see what the linker is producing you will fail on a regular basis.
Yes, you could have exposed them in C and not assembly, the toolchain would still help you there.
You can work toward this now that you can see what the linker is doing:
.thumb
.globl _start
_start:
.word 0x20001000
.word reset
.thumb_func
reset:
ldr r0,=__bss_start__
ldr r1,=__bss_size__
# zero this
ldr r0,=__data_rom_start__
ldr r1,=__data_start__
ldr r2,=__data_size__
# copy this
bl notmain
b .
giving something like this
00008000 <_start>:
8000: 20001000 andcs r1, r0, r0
8004: 00008009 andeq r8, r0, r9
00008008 <reset>:
8008: 4803 ldr r0, [pc, #12] ; (8018 <reset+0x10>)
800a: 4904 ldr r1, [pc, #16] ; (801c <reset+0x14>)
800c: 4804 ldr r0, [pc, #16] ; (8020 <reset+0x18>)
800e: 4905 ldr r1, [pc, #20] ; (8024 <reset+0x1c>)
8010: 4a05 ldr r2, [pc, #20] ; (8028 <reset+0x20>)
8012: f000 f80b bl 802c <notmain>
8016: e7fe b.n 8016 <reset+0xe>
8018: 0000a008 andeq sl, r0, r8
801c: 0000000c andeq r0, r0, ip
8020: 00008058 andeq r8, r0, r8, asr r0
8024: 0000a000 andeq sl, r0, r0
8028: 00000008 andeq r0, r0, r8
0000802c <notmain>:
802c: 4b06 ldr r3, [pc, #24] ; (8048 <notmain+0x1c>)
802e: 6818 ldr r0, [r3, #0]
8030: 685b ldr r3, [r3, #4]
8032: 18c0 adds r0, r0, r3
If you then align the items in the linker script the copy/zero code gets even simpler you can stick to 1 to some number N whole registers rather than dealing with bytes or halfwords, can use ldr/str, ldrd/strd (if available) or ldm/stm (and not need ldrb/strb nor ldrh/strh), tight simple few line loops to complete the job.
I highly recommend you do not use C for your bootstrap.
Note that the ld linker script variables are very sensitive to position (inside or outside curly braces)
The above linker script is somewhat typical of what you will find in stock linker scripts a defined start and end, sometimes the size is computed in the linker script sometimes the bootstrap code computes the size or the bootstrap code can just loop until the address equals the end value, depends on the overall system design between the two.
Your specific issue BTW is you linked in two bootstraps, at the time I wrote this I dont see your command line(s) in the question so that would tell us more. That is why you are seeing the bss_start, etc, things that you didnt put in your linker script but are often found in stock ones that come with a pre-built toolchain (similar to the above but more complicated)
It could be by using gcc instead of ld and without the the various -nostartfiles options (that it pulled in crt0.o), just try ld instead of gcc and see what changes. You would have failed with the original example had it been something like this though so I dont think that is the issue here. If you used the same command lines the failure should have been on both examples not just the latter.
The book you're reading has led you astray. Discard it and start learning from another source.
I see at least four major problems with what it has told you to do:
The linker script and _start function you included is missing a number of important sections, and will either malfunction or fail to link many executables. Most notably, it lacks any handling for BSS (zero-filled) sections.
The vector table in main.c is beyond "minimal"; it lacks the required definitions for even the standard ARM interrupt vectors. Without these, debugging hardfaults will become very difficult, as the microcontroller will treat random code following the vector table as an interrupt vector when a fault occurs, which will probably lead to a secondary fault as it fails to load code from that "address".
The startup functions given by your book bypass the libc startup functions. This will cause some portions of the standard C library, as well as any C++ code, to fail to work correctly.
You are defining peripheral addresses yourself in main.c. These addresses are all defined in standard ST header files (e.g. <stm32f4xx.h>), so there is no need to define them yourself.
As a starter, I would recommend that you refer to the startup code provided by ST in any of their examples. These will all include a complete linker script and startup code.
As old_timer hinted in the comments, using gcc to link is a problem.
If you change the linker call in your batch file to use ld, it links without error. Try the following:
echo.
echo. Call the linker
echo.
#arm-none-eabi-ld main.o -o myApp.elf -T linkerscript.ld

Why do I have an undefined reference to _init in __libc_init_array?

I'm attempting to build a simple project using Yagarto and Eclipse for an ARM microcontroller platform. In my startup code, I have this (which I believe is fairly standard and uninteresting):
void Reset_Handler(void)
{
/* Initialize data and bss */
__Init_Data();
/* Call CTORS of static objects */
__libc_init_array();
/* Call the application's entry point.*/
main();
while(1) { ; }
}
Unless I comment out the call to __libc_init_array(), I get the following error from the linker:
arm-none-eabi-g++ -nostartfiles -mthumb -mcpu=cortex-m4 -TC:/Users/mark/workspace/stm32_cpp_test/STM32F40x_1024k_192k_flash.ld -gc-sections -Wl,-Map=test_rom.map,--cref,--no-warn-mismatch -o stm32_cpp_test "system\\syscalls.o" "system\\startup_stm32f4xx.o" "system\\mini_cpp.o" "system\\cmsis\\system_stm32f4xx.o" main.o
d:/utils/yagarto/bin/../lib/gcc/arm-none-eabi/4.7.2/../../../../arm-none-eabi/lib/thumb/v7m\libg.a(lib_a-init.o): In function `__libc_init_array':
C:\msys\1.0\home\yagarto\newlib-build\arm-none-eabi\thumb\v7m\newlib\libc\misc/../../../../../../../newlib-1.20.0/newlib/libc/misc/init.c:37: undefined reference to `_init'
collect2.exe: error: ld returned 1 exit status
Why am I getting this "undefined reference" error? What am I missing? I assume there's some linker flag that I'm missing, but I can't for the life of me figure out what.
Oldish question, but I encountered a similar issue and the solution was as Marco van de Voort indicated, if you're going to use __libc_init_array you should omit the -nostartfiles linker option, to include the normal libc init functions. Duplicate answer.
Secondly I would suggest including the --specs=nano.specs flag when linking with gcc-arm (I believe yargarto is a fork or even just a precompile of gcc-arm), as it reduces libc etc. code consumption.
I'm no expert, but:
Probably _init (the normal runtime entry point) references the code that executes the ctor and dtor tables.
You use -nostartfiles so avoid the standard startup, and probably that whole startcode is eliminated by --gc-sections. The explicit call adds a reference again.
If omitting --gc-sections doesn't solve it, it might also be a missing keep() statement in your (embedded) linker script that keeps the entry code at all times, or your own startup code
(startup_*) should reference it
The __libc_init_array function from stdlib takes care to call all initializers or C++ constructors, registered to preinit_array and init_array. Inbetween preinit and init, it calls an extern _init function. The code looks as simple as:
#include <sys/types.h>
/* These magic symbols are provided by the linker. */
extern void (*__preinit_array_start []) (void) __attribute__((weak));
extern void (*__preinit_array_end []) (void) __attribute__((weak));
extern void (*__init_array_start []) (void) __attribute__((weak));
extern void (*__init_array_end []) (void) __attribute__((weak));
extern void _init (void);
void __libc_init_array (void)
{
size_t count;
size_t i;
count = __preinit_array_end - __preinit_array_start;
for (i = 0; i < count; i++)
__preinit_array_start[i] ();
_init ();
count = __init_array_end - __init_array_start;
for (i = 0; i < count; i++)
__init_array_start[i] ();
}
Also see: understanding the __libc_init_array.
If a custom startup code is implemented, it is required to perform this initialization, either by linking to 'init.o' or by implementing something similar to the code snippet above.
If at least building for arm-none-eabi ARMv7e target with newlib-nano specs, then the _init method gets linked in from crti.o and crtn.o, which provides just some empty stubs for _init and _fini. Did some search through all other stdlib objects for arm-none-eabi and found no other objects that will append sections to .init, which would be obsolete anyway. Here some disassembly of crti.o and crtn.o:
$ ./bin/arm-none-eabi-objdump.exe -j .init -D ./lib/gcc/arm-none-eabi/10.2.1/thumb/v7/nofp/crt?.o
./lib/gcc/arm-none-eabi/10.2.1/thumb/v7/nofp/crti.o: file format elf32-littlearm
Disassembly of section .init:
00000000 <_init>:
0: b5f8 push {r3, r4, r5, r6, r7, lr}
2: bf00 nop
./lib/gcc/arm-none-eabi/10.2.1/thumb/v7/nofp/crtn.o: file format elf32-littlearm
Disassembly of section .init:
00000000 <.init>:
0: bcf8 pop {r3, r4, r5, r6, r7}
2: bc08 pop {r3}
4: 469e mov lr, r3
6: 4770 bx lr
If somebody wants to use __libc_init_array in combination with linker option nostartfiles for this specific ARM target, it would be acceptable to provide an own _init stub method, to let the linker pass, as long as no other initialization code is emitted to section .init, other than this from crti.o and crtn.o. A stub could look like:
extern "C" void _init(void) {;}
The special functions _init and _fini are some historic left-overs to control constructors and destructors. However, they are obsolete, and their use can lead to unpredictable results. No modern library should make use of these anymore, and make use of the GCC function attributes constructor and destructor instead, which add methods to those tables inside .preinit_array, .init_array and .fini_array sections.
If it is known that there was some initialization code emitted to .init (even if this is obsolete today), then a _init(void) function should be provided, that will be running this initialization code by calling the start address of the .init section.

gcc arm -- ensuring args are retained when inlining functions with inline asm statements

I have a series of functions that are ultimately implemented with an SVC call. For instance:
void func(int arg) {
asm volatile ("svc #123");
}
as you might imagine, the SVC operates on 'arg' which is presumably in a register. if i explictly add a 'noinline' attribute to the definition, everything works as you'd expect.
but, were the function inlined at a higher optimization level, the code that loads 'arg' into a register would be omitted -- as there is apprently no reference to 'arg'.
I've tried adding a 'used' attribute to the declaration of 'arg' itself -- but gcc apparently yields a warning in this case.
I've also tried adding "dummy" asm statements such as
asm ("" : "=r"(arg));
But this didn't appear to work in general. (maybe i need to say volatile here as well???)
Anyway, it seems unfortunate to have an explicit function call for a routine whose body essentially consists of one asm statement.
A relevant recipe is in the GCC manual, in Assembler Instructions with C Expression Operands section, that uses sysint with the same role of your svc instruction. The idea is to define a local register variable with a specified register, and then use extended asmsyntax to add inputs and outputs to the inline assembly block.
I tried to compile the following code:
#include <stdint.h>
__attribute__((always_inline))
uint32_t func(uint32_t arg) {
register uint32_t r0 asm("r0") = arg;
register uint32_t result asm("r0");
asm volatile ("svc #123":"=r" (result) : "0" (r0));
return result;
}
uint32_t foo(void) {
return func(2);
}
This is the disassembly of the compiled (with -O2 flag) object file:
00000000 <func>:
0: ef00007b svc 0x0000007b
4: e12fff1e bx lr
00000008 <foo>:
8: e3a00002 mov r0, #2
c: ef00007b svc 0x0000007b
10: e12fff1e bx lr
func is expanded inline and the argument is put in r0 correctly. I believe volatile is necessary, because if you don't make use of the return value of the service call, then the compiler might assume that the assembly piece of code is not necessary.
You should have a single asm block, compiler is still free to treat two asm blocks individually until otherwise specified. Meaning requirements put on second asm block won't have any effect on the first one.
You are assuming registers will be in their right places because of the calling convention.
What about something like this? (didn't test)
void func(int arg) {
asm volatile (
"mov r0, %[code]\n\t"
"svc #123"
:
: [code]"r" (code)
);
}
For more information, see ARM GCC Inline Assembler Cookbook.

Resources