I'm developing a complex and intricate app for STM32F746. I stumbled upon the following hard fault and I'm not sure how to find the origin of the problem :
16:05:51.832 HardFault : ExceptionFrame { r0: 0xffffffff, r1: 0xffffffff, r2: 0xffffffff, r3: 0xffffffff, r12: 0xffffffff, lr: 0xffffffff, pc: 0xffffffff, xpsr: 0xffffffff }
16:05:51.832 UFSR : 0000000000000000
16:05:51.832 BFSR : 00000000
16:05:51.832 MMFSR : 00000001
16:05:51.832 HFSR : 01000000000000000000000000000000
The MMFSR part of the CFSR clearly indicates the error is IACCVIOL. Unfortunately, MMARVALID is not set, so I can't use the MMFAR to find the root of the issue.
Stepping through with GDB takes a huge amount of time for very little progress, as I need to start over every time the fault appears. I couldn't find a way to record/replay the session to quickly track down the issue in GDB.
Is there an approach that could help me pinpoint where the code fails ?
I am trying to use a custom RAM section to be able to pass information across reboot. This section will not be erased at boot and so the variables placed in this section will be kept across reboots (if there is no alimentation loss of course).
I use GNU toolchain and a Cortex-M0 (STM32) MCU
So I added in the linker script a new memory area before RAM :
RAM_PERSIST (xrw) : ORIGIN = 0x20000000, LENGTH = 0x0040
RAM (xrw) : ORIGIN = 0x20000040, LENGTH = 0x0FD0
Then a section to go in there :
.pds :
{
KEEP(*(.pds))
} >RAM_PERSIST
Finally in the C code, I declare some data in this section :
data_t __attribute((section(".pds")) data;
I does compile but I could not upload the generated binary on my target. Using objdump I discovered that my firmware got a new section ".sec2" beginning at 0x20000000 :
> (...)/arm-none-eabi-objdump -s ./obj/firmware.hex | tail
8006d20 f8bc08bc 9e467047 f8b5c046 f8bc08bc .....FpG...F....
8006d30 9e467047 e9000008 c1000008 00127a00 .FpG..........z.
8006d40 19000000 e0930400 409c0000 400d0300 ........#...#...
8006d50 c0c62d00 30750000 ffffffff 01000000 ..-.0u..........
8006d60 04000000 ....
Contents of section .sec2:
20000000 00000000 00000000 00000000 00000000 ................
20000010 00000000 00000000 00000000 00000000 ................
20000020 00000000 00000000 00000000 00000000 ................
20000030 00000000 00000000 00000000 00000000 ................
So I think I have to tell the linker this section is not in the flash so must not be part of the firmware.
Am I right ? If so, how to do that ?
Thanks by advance.
MEMORY
{
rom : ORIGIN = 0x00000000, LENGTH = 0x40000
ram : ORIGIN = 0x20000000, LENGTH = 0x4000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
I had more control/success when I stopped using xrw, etc in the memory definition and instead went with control over .text, .bss, .data, etc. and if you then further want a specific object somewhere you add that. etc...
I did achieve what I wanted by adding NOLOAD attibute to my custom section :
.pds (NOLOAD): { KEEP(*(.pds)) } >RAM
Here is the NOLOAD description (gcc documentation) :
(NOLOAD)
The (NOLOAD) directive will mark a section to not be loaded at run time. The linker will process the section normally, but will mark
it so that a program loader will not load it into memory. For example,
in the script sample below, the ROM section is addressed at memory
location 0 and does not need to be loaded when the program is run.
The contents of the ROM section will appear in the linker output file
as usual.
SECTIONS {
ROM 0 (NOLOAD) : { ... }
...
}
I found a similar post which helped me, I add a link here for reference : GCC (NOLOAD) directive loads memory into section anyway
I am trying to upload this simple assembly program:
.global _start
.text
reset: b _start
undefined: b undefined
software_interrupt: b software_interrupt
prefetch_abort: b prefetch_abort
data_abort: b data_abort
nop
interrupt_request: b interrupt_request
fast_interrupt_request: b fast_interrupt_request
_start:
mov r0, #0
mov r1, #1
increase:
add r0, r0, r1
cmp r0, #10
bne increase
decrease:
sub r0, r0, r1
cmp r0, #0
bne decrease
b increase
stop: b stop
to my LPC4088 (I am using Embedded artists LPC4088 QSB) via SEGGER's JLink so I could later debug it using GDB.
First I compiled my sources with all the debugging symbols using GCC toolchain:
arm-none-eabi-as -g -gdwarf-2 -o program.o program.s
arm-none-eabi-ld -Ttext=0x0 -o program.elf program.o
arm-none-eabi-objcopy -O binary program.elf program.bin
But uploading binary program.bin to LPC4088 was unsuccessful. Then user #old_timer reminded me in the comments that LPC4088's boot ROM does a checksum test after every reset like described on a page 876 of LPC4088 user manual:
So I mad sure my binary would pass a checksum test by following steps described here. So I first created a C source file checksum.c:
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, char **argv) {
int fw, count, crc;
char buf[28];
fw = open(argv[1], O_RDWR);
// read fist 28 bytes
read(fw, &buf, 28);
// find 2's complement of entries 0 to 6
for (count=0, crc=0; count < 7; count++) {
crc += *((int*)(buf+count*4));
}
crc = (~crc) + 1;
// write it at offset 0x0000001C
lseek(fw, 0x0000001C, SEEK_SET);
write(fw, &crc, 4);
close(fw);
return 0;
}
compiled it using gcc -o checksum.bin checksum.c and then I fed it the original program.bin as an argument like this ./checksum.bin program.bin. So I got a modified program.bin which really had a value at 0x1C modified! Here is the comparison of the original:
and the modified version:
So the value at 0x1C was modified from 0xFEFFFFEA to 0x0400609D. This is all that was modified as can be seen from the images.
I then opened terminal application JLinkExe which presented a prompt. In the prompt I:
powered on my board using power on,
connected to the LPC4088 using command connect,
halted the MCPU using command h,
erased entire FLASH memory using command erase,
uploaded my modified binary to FLASH loadbin program.bin 0x0,
set the program counter to start at the beginning SetPC 0x4.
started stepping into the program using s.
When I started stepping into the program in first step I got some errors as can be seen at the end of the procedure inside JLinkExe prompt:
SEGGER J-Link Commander V6.30a (Compiled Jan 31 2018 18:14:21)
DLL version V6.30a, compiled Jan 31 2018 18:14:14
Connecting to J-Link via USB...O.K.
Firmware: J-Link V9 compiled Jan 29 2018 15:41:50
Hardware version: V9.30
S/N: 269300437
License(s): FlashBP, GDB
OEM: SEGGER-EDU
VTref = 3.293V
Type "connect" to establish a target connection, '?' for help
J-Link>connect
Please specify device / core. <Default>: LPC4088
Type '?' for selection dialog
Device>
Please specify target interface:
J) JTAG (Default)
S) SWD
TIF>
Device position in JTAG chain (IRPre,DRPre) <Default>: -1,-1 => Auto-detect
JTAGConf>
Specify target interface speed [kHz]. <Default>: 4000 kHz
Speed>
Device "LPC4088" selected.
Connecting to target via JTAG
TotalIRLen = 4, IRPrint = 0x01
JTAG chain detection found 1 devices:
#0 Id: 0x4BA00477, IRLen: 04, CoreSight JTAG-DP
Scanning AP map to find all available APs
AP[1]: Stopped AP scan as end of AP map has been reached
AP[0]: AHB-AP (IDR: 0x24770011)
Iterating through AP map to find AHB-AP to use
AP[0]: Core found
AP[0]: AHB-AP ROM base: 0xE00FF000
CPUID register: 0x410FC241. Implementer code: 0x41 (ARM)
Found Cortex-M4 r0p1, Little endian.
FPUnit: 6 code (BP) slots and 2 literal slots
CoreSight components:
ROMTbl[0] # E00FF000
ROMTbl[0][0]: E000E000, CID: B105E00D, PID: 000BB00C SCS-M7
ROMTbl[0][1]: E0001000, CID: B105E00D, PID: 003BB002 DWT
ROMTbl[0][2]: E0002000, CID: B105E00D, PID: 002BB003 FPB
ROMTbl[0][3]: E0000000, CID: B105E00D, PID: 003BB001 ITM
ROMTbl[0][4]: E0040000, CID: B105900D, PID: 000BB9A1 TPIU
ROMTbl[0][5]: E0041000, CID: B105900D, PID: 000BB925 ETM
Cortex-M4 identified.
J-Link>h
PC = 000001B2, CycleCnt = 825F97DB
R0 = 00000000, R1 = 20098038, R2 = 2009803C, R3 = 000531FB
R4 = 00000000, R5 = 00000000, R6 = 12345678, R7 = 00000000
R8 = 6C2030E3, R9 = 0430DB64, R10= 10000000, R11= 00000000
R12= 899B552C
SP(R13)= 1000FFF0, MSP= 1000FFF0, PSP= 6EBAAC08, R14(LR) = 00000211
XPSR = 21000000: APSR = nzCvq, EPSR = 01000000, IPSR = 000 (NoException)
CFBP = 00000000, CONTROL = 00, FAULTMASK = 00, BASEPRI = 00, PRIMASK = 00
FPS0 = 93310C50, FPS1 = 455D159C, FPS2 = 01BA3FC2, FPS3 = E851BEED
FPS4 = D937E8F4, FPS5 = 82BD7BF6, FPS6 = 8F16D263, FPS7 = B0E8C039
FPS8 = 302C0A38, FPS9 = 8007BC9C, FPS10= 9A1A276F, FPS11= 76C9DCFE
FPS12= B2FFFA20, FPS13= B55786BB, FPS14= 2175F73E, FPS15= 5D35EC5F
FPS16= 98917B32, FPS17= C964EEB6, FPS18= FEDCA529, FPS19= 1703B679
FPS20= 2F378232, FPS21= 973440E3, FPS22= 928C911C, FPS23= 20A1BF55
FPS24= 4AE3AD0C, FPS25= 4F47CC1E, FPS26= C7B418D5, FPS27= 3EAB9244
FPS28= 73C795D0, FPS29= A359C85E, FPS30= 823AEA80, FPS31= EC9CBCD5
FPSCR= 00000000
J-Link>erase
Erasing device (LPC4088)...
J-Link: Flash download: Only internal flash banks will be erased.
To enable erasing of other flash banks like QSPI or CFI, it needs to be enabled via "exec EnableEraseAllFlashBanks"
Comparing flash [100%] Done.
Erasing flash [100%] Done.
Verifying flash [100%] Done.
J-Link: Flash download: Total time needed: 3.357s (Prepare: 0.052s, Compare: 0.000s, Erase: 3.301s, Program: 0.000s, Verify: 0.000s, Restore: 0.002s)
Erasing done.
J-Link>loadbin program.bin 0x0
Downloading file [program.bin]...
Comparing flash [100%] Done.
Erasing flash [100%] Done.
Programming flash [100%] Done.
Verifying flash [100%] Done.
J-Link: Flash download: Bank 0 # 0x00000000: 1 range affected (4096 bytes)
J-Link: Flash download: Total time needed: 0.076s (Prepare: 0.056s, Compare: 0.001s, Erase: 0.000s, Program: 0.005s, Verify: 0.000s, Restore: 0.012s)
O.K.
J-Link>SetPC 0x4
J-Link>s
**************************
WARNING: T-bit of XPSR is 0 but should be 1. Changed to 1.
**************************
J-Link>s
****** Error: Failed to read current instruction.
J-Link>s
****** Error: Failed to read current instruction.
J-Link>s
****** Error: Failed to read current instruction.
J-Link>
So this code must have come from somewhere and it may be the LPC4088's Boot ROM which is remapped to 0x0 at boot time as is stated on page 907 of the LPC4088 user manual:
Do you have any idea on how to overcome this Boot ROM & checksum problem, so I could debug my program normally?
After a while I found out that warning:
**************************
WARNING: T-bit of XPSR is 0 but should be 1. Changed to 1.
**************************
is actually saying that I am trying to execute ARM instruction on a Cortex-M4 which is Thumb only! This T-bit mentioned in the warning is described on page 100 of ARMv7-M architecture reference manual:
And this is exactly what user #old_timer is saying.
You are trying to run arm instructions (0xExxxxxxxx is a big giveaway, not to mention the exception table being a lot of 0xEAxxxxxx instructions) on a cortex-m4. The cortex-m boots differently (vector table rather than executable instructions) and is thumb only (the thumb2 extensions in armv7-m are also...just thumb, dont be confused by that, what thumb2 extensions do matter but the early/original thumb is portable across all of them). So whether or not you need an additional checksum somewhere like older ARM7TDMI based NXP chips in order for the bootloader to allow the user/application code to run, you first need something that will run on the cortex-m4.
start with this, yes I know you have a cortex-m4 use cortex-m0 for now.
so.s
.cpu cortex-m0
.thumb
.thumb_func
.globl _start
_start:
stacktop: .word 0x20001000
.word reset
.word hang
# ...
.thumb_func
hang: b hang
.thumb_func
reset:
mov r1,#0
outer:
mov r0,#0xFF
inner:
nop
nop
add r1,#1
sub r0,#1
bne inner
nop
nop
b outer
build
arm-none-eabi-as so.s -o so.o
arm-none-eabi-ld -Ttext=0 so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy so.elf -O binary so.bin
examine so.list to make sure the vector table is correct.
00000000 <_start>:
0: 20001000 andcs r1, r0, r0
4: 0000000f andeq r0, r0, pc
8: 0000000d andeq r0, r0, sp
0000000c <hang>:
c: e7fe b.n c <hang>
0000000e <reset>:
e: 2100 movs r1, #0
00000010 <outer>:
10: 20ff movs r0, #255 ; 0xff
00000012 <inner>:
12: 46c0 nop ; (mov r8, r8)
14: 46c0 nop ; (mov r8, r8)
16: 3101 adds r1, #1
18: 3801 subs r0, #1
1a: d1fa bne.n 12 <inner>
1c: 46c0 nop ; (mov r8, r8)
1e: 46c0 nop ; (mov r8, r8)
20: e7f6 b.n 10 <outer>
The reset entry point is 0x00E which is correctly indicated in the vector table at offset 0x4 as 0x00F. You can flash it to 0x000 and then reset and see if it works (need a debugger to stop it to see if it is stepping through that code).
To run from sram there is nothing position dependent here, so you can load the .bin as is to 0x20000000 and execute from 0x2000000E (or whatever address your toolchain ends up creating for the reset entry point).
Or you can remove the vector table
.cpu cortex-m0
.thumb
.thumb_func
reset:
mov r1,#0
outer:
mov r0,#0xFF
inner:
nop
nop
add r1,#1
sub r0,#1
bne inner
nop
nop
b outer
And link with -Ttext=0x20000000, then download to sram and start execution with the debugger at 0x20000000.
You should see r0 counting some, r1 should just keep counting forever then roll over and keep counting so if you stop it check the registers, resume, stop, etc you should see that activity.
I worked my way through all of the free Linux training materials created by Free Electrons. In the last lab, we learn to use kgdb to remotely debug a simple crash in a loadable module. The crash is caused by a null pointer dereference in a memzero function call.
I am using Linux kernel 4.9 and a BeagleBone Black as the target, all according to the recommendations for the labs, and I've had no problems up to this point. My host is Ubuntu xenial and I am using standard packages for the ARM toolchain (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4) and gdb (7.11.1-0ubuntu1~16.04) debugger.
gdb is able to read the symbol tables from vmlinux and from the module with the bug in it, which is called drvbroken.ko. The module has a bug in its init function, so it crashes immediately when I insmod it.
gdb output:
(gdb) backtrace
#0 __memzero () at arch/arm/lib/memzero.S:69
#1 0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) list 69
64 ldmeqfd sp!, {pc} # 1/2 quick exit
65 /*
66 * No need to correct the count; we're only testing bits from now on
67 */
68 tst r1, #32 # 1
69 stmneia r0!, {r2, r3, ip, lr} # 4
70 stmneia r0!, {r2, r3, ip, lr} # 4
71 tst r1, #16 # 1 16 bytes or more?
72 stmneia r0!, {r2, r3, ip, lr} # 4
73 ldr lr, [sp], #4 # 1
The result is the same whether I build the kernel with CONFIG_ARM_UNWIND (the default) or disable that and use CONFIG_FRAME_POINTER (the old method recommended by the lab notes).
I tried the same procedure in kdb, and here I see a very long backtrace that includes the calling functions. The caller of memzero is cdev_init.
kdb output:
Entering kdb (current=0xde616240, pid 106) on processor 0 Oops: (null)
due to oops # 0xc04c2be0
CPU: 0 PID: 106 Comm: insmod Tainted: G O 4.9.0-dirty #1
Hardware name: Generic AM33XX (Flattened Device Tree)
task: de616240 task.stack: de676000
PC is at __memzero+0x40/0x7c
LR is at 0x0
pc : [<c04c2be0>] lr : [<00000000>] psr: 00000013
sp : de677da4 ip : 00000000 fp : de677dbc
r10: bf000240 r9 : 219a3868 r8 : 00000000
r7 : de65c7c0 r6 : de6420c0 r5 : bf0000b4 r4 : 00000000
r3 : 00000000 r2 : 00000000 r1 : fffffffc r0 : 00000000
Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: 9e69c019 DAC: 00000051
CPU: 0 PID: 106 Comm: insmod Tainted: G O 4.9.0-dirty #1
Hardware name: Generic AM33XX (Flattened Device Tree)
Backtrace:
... pruned function calls related to kdb itself ...
[<c08326cc>] (do_page_fault) from [<c010138c>] (do_DataAbort+0x3c/0xbc)
r10:bf000240 r9:de676000 r8:de677d50 r7:00000000 r6:c08326cc r5:00000817
r4:c0d0bb2c
[<c0101350>] (do_DataAbort) from [<c0831d04>] (__dabt_svc+0x64/0xa0)
Exception stack(0xde677d50 to 0xde677d98)
7d40: 00000000 fffffffc 00000000 00000000
7d60: 00000000 bf0000b4 de6420c0 de65c7c0 00000000 219a3868 bf000240 de677dbc
7d80: 00000000 de677da4 00000000 c04c2be0 00000013 ffffffff
r8:00000000 r7:de677d84 r6:ffffffff r5:00000013 r4:c04c2be0
[<c02bf44c>] (cdev_init) from [<bf002048>] (init_module+0x48/0xb4 [drvbroken])
r5:bf002000 r4:bf000480
[<bf002000>] (init_module [drvbroken]) from [<c01018d4>] (do_one_initcall+0x44/0x180)
r5:bf002000 r4:ffffe000
[<c0101890>] (do_one_initcall) from [<c024fa2c>] (do_init_module+0x64/0x1d8)
r8:00000001 r7:de65c7c0 r6:de6420c0 r5:c0dbfa84 r4:bf000240
[<c024f9c8>] (do_init_module) from [<c01e10e8>] (load_module+0x1d6c/0x23d8)
r6:c0d0512c r5:c0dbfa84 r4:c0d4c70f
[<c01df37c>] (load_module) from [<c01e18ac>] (SyS_init_module+0x158/0x17c)
r10:00000051 r9:de676000 r8:e0a95100 r7:00000000 r6:000ac118 r5:00004100
It is pretty easy to figure out where to look for the bug with this information, but alas, it is not possible to get a line number or list the source directly from kdb. This is much easier in gdb, assuming that I can get a full backtrace.
The latest version of Cobalt(8.20698) will crash at init time on arm linux platform, the backtrace is as follows, but the old version doesn't has this issue, could anyone help to have a look?
[00000000] *pgd=0dce6831, *pte=00000000, *ppte=00000000
CPU: 0 PID: 4268 Comm: cobalt_qa Tainted: P O 3.10.79 #2
task: cf33b400 ti: d24bc000 task.ti: d24bc000
PC is at 0xb5d12180
LR is at 0x161610
pc : [<b5d12180>] lr : [<00161610>] psr: 600f0010
sp : bed2fc20 ip : b5d12180 fp : 00000000
r10: bed30088 r9 : bed2ff78 r8 : bed2fe84
r7 : 00000002 r6 : 00000000 r5 : 00000000 r4 : 01027e68
r3 : 00000043 r2 : 00000049 r1 : 0000002e r0 : 00000000
Flags: nZCv IRQs on FIQs on Mode USER_32 ISA ARM Segment user
Control: 10c5387d Table: 124d406a DAC: 00000015
CPU: 0 PID: 4268 Comm: cobalt_qa Tainted: P O 3.10.79 #2
[<c0012c20>] (unwind_backtrace+0x0/0xdc) from [<c0010ef8>] (show_stack+0x10/0x14)
[<c0010ef8>] (show_stack+0x10/0x14) from [<c0014204>] (__do_user_fault+0x13c/0x1ac)
[<c0014204>] (__do_user_fault+0x13c/0x1ac) from [<c001449c>] (do_page_fault+0x228/0x268)
[<c001449c>] (do_page_fault+0x228/0x268) from [<c0008328>] (do_DataAbort+0x34/0x120)
[<c0008328>] (do_DataAbort+0x34/0x120) from [<c000dab4>] (__dabt_usr+0x34/0x40)
Exception stack(0xd24bdfb0 to 0xd24bdff8)
dfa0: 00000000 0000002e 00000049 00000043
dfc0: 01027e68 00000000 00000000 00000002 bed2fe84 bed2ff78 bed30088 00000000
dfe0: b5d12180 bed2fc20 00161610 b5d12180 600f0010 ffffffff
Caught signal: SIGSEGV (11)
<unknown> [0xb5d12180]
uprv_getDefaultLocaleID_56 [0x161610]
icu_56::locale_set_default_internal() [0x15a114]
icu_56::Locale::getDefault() [0x159ca0]
locale_get_default_56 [0x159cb0]
EzTimeValueExplode [0xb4d10]
EzTimeTExplode [0xb5048]
EzTimeTExplodeLocal [0xb5838]
logging::LogMessage::Init() [0x7b7cc]
logging::LogMessage::LogMessage() [0x7bcf4]
base::UserLog::IsRegistrationSupported() [0x6b108]
cobalt::browser::Application::RegisterUserLogs() [0x2c608]
cobalt::browser::Application::Application() [0x2d998]
cobalt::browser::CreateApplication() [0x2b278]
SbEventHandle [0x2b0c0]
starboard::shared::starboard::Application::DispatchStart() [0xbadec]
starboard::shared::starboard::Application::Run() [0xbb4e0]
main [0x21c24]
<unknown> [0xb5cb2278]
After tracing the code of Cobalt, the cobalt need to get the posix_id by SbSystemGetLocaledId() in system_get_locale_id.cc, but the system didn't set the clang environment variable yet, and it get null which made the Crash, after setting the LANG environment variable(export LANG="en_US.UTF-8"), it works.
Add CLANG environment variable