I am at my wit's end trying to debug a hard fault on an EFR32BG12 processor. I've been following the instructions in the Silicon Labs knowledge base here:
https://www.silabs.com/community/mcu/32-bit/knowledge-base.entry.html/2014/05/26/debug_a_hardfault-78gc
I've also been using the Keil app note here to fill in some details:
http://www.keil.com/appnotes/files/apnt209.pdf
I've managed to get the hard fault to occur quite consistently in one place. When the hard fault occurs, the code from the knowledge base article gives me the following values (pushed onto the stack by the processor before calling the hard fault handler):
Name Type Value Location
~~~~ ~~~~ ~~~~~ ~~~~~~~~
cfsr uint32_t 0x20000 (Hex) 0x2000078c
hfsr uint32_t 0x40000000 (Hex) 0x20000788
mmfar uint32_t 0xe000ed34 (Hex) 0x20000784
bfar uint32_t 0xe000ed38 (Hex) 0x20000780
r0 uint32_t 0x0 (Hex) 0x2000077c
r1 uint32_t 0x8 (Hex) 0x20000778
r2 uint32_t 0x0 (Hex) 0x20000774
r3 uint32_t 0x0 (Hex) 0x20000770
r12 uint32_t 0x1 (Hex) 0x2000076c
lr uint32_t 0xab61 (Hex) 0x20000768
pc uint32_t 0x38dc8 (Hex) 0x20000764
psr uint32_t 0x0 (Hex) 0x20000760
Looking at the Keil app note, I believe a CFSR value of 0x20000 indicates a Usage Fault with the INVSTATE bit set, i.e.:
INVSTATE: Invalid state: 0 = no invalid state 1 = the processor has
attempted to execute an instruction that makes illegal use of the
Execution Program Status Register (EPSR). When this bit is set, the PC
value stacked for the exception return points to the instruction that
attempted the illegal use of the EPSR. Potential reasons: a) Loading a
branch target address to PC with LSB=0. b) Stacked PSR corrupted
during exception or interrupt handling. c) Vector table contains a
vector address with LSB=0.
The PC value pushed onto the stack by the exception (provided by the code from the knowledge base article) seems to be 0x38dc8. If I go to this address in the Simplicity Studio "Disassembly" window, I see the following:
00038db8: str r5,[r5,#0x14]
00038dba: str r0,[r7,r1]
00038dbc: str r4,[r5,#0x14]
00038dbe: ldr r4,[pc,#0x1e4] ; 0x38fa0
00038dc0: strb r1,[r4,#0x11]
00038dc2: ldr r5,[r4,#0x64]
00038dc4: ldrb r3,[r4,#0x5]
00038dc6: movs r3,r6
00038dc8: strb r1,[r4,#0x15]
00038dca: ldr r4,[r4,#0x14]
00038dcc: cmp r7,#0x6f
00038dce: cmp r6,#0x30
00038dd0: str r7,[r6,#0x14]
00038dd2: lsls r6,r6,#1
00038dd4: movs r5,r0
00038dd6: movs r0,r0
The address appears to be well past the end of my code. If I look at the same address in the "Memory" window, this is what I see:
0x00038DC8 69647561 2E302F6F 00766177 00000005 audio/0.wav.....
0x00038DD8 00000000 000F4240 00000105 00000000 ....#B..........
0x00038DE8 00000000 00000000 00000005 00000000 ................
0x00038DF8 0001C200 00000500 00001000 00000000 .Â..............
0x00038E08 00000000 F00000F0 02F00001 0003F000 ....ð..ð..ð..ð..
0x00038E18 F00004F0 06010005 01020101 01011201 ð..ð............
0x00038E28 35010121 01010D01 6C363025 2E6E6775 !..5....%06lugn.
0x00038E38 00746164 00000001 000008D0 00038400 dat.....Ð.......
Curiously, "audio/0.wav" is a static string which is part of the firmware. If I understand correctly, what I've learned here is that PC somehow gets set to this point in memory, which of course is not a valid instruction and causes the hard fault.
To debug the issue, I need to know how PC came to be set to this incorrect value. I believe the LR register should give me an idea. The LR register pushed onto the stack by the exception seems to be 0xab61. If I look at this location, I see the following in the Disassembly window:
1270 dp->sect = clst2sect(fs, clst);
0000ab58: ldr r0,[r7,#0x10]
0000ab5a: ldr r1,[r7,#0x14]
0000ab5c: bl 0x00009904
0000ab60: mov r2,r0
0000ab62: ldr r3,[r7,#0x4]
0000ab64: str r2,[r3,#0x18]
It looks to me like the problem occurs during this call specifically:
0000ab5c: bl 0x00009904
This makes me think that the problem occurs as a result of a corrupt stack, which causes clst2sect to return to an invalid part of memory rather than to 0xab60. The code for clst2sect is pretty innocuous:
/*-----------------------------------------------------------------------*/
/* Get physical sector number from cluster number */
/*-----------------------------------------------------------------------*/
DWORD clst2sect ( /* !=0:Sector number, 0:Failed (invalid cluster#) */
FATFS* fs, /* Filesystem object */
DWORD clst /* Cluster# to be converted */
)
{
clst -= 2; /* Cluster number is origin from 2 */
if (clst >= fs->n_fatent - 2) return 0; /* Is it invalid cluster number? */
return fs->database + fs->csize * clst; /* Start sector number of the cluster */
}
Does this analysis sound about right?
I suppose the problem I've run into is that I have no idea what might cause this kind of behaviour... I've tried putting breakpoints in all of my interrupt handlers, to see if one of them might be corrupting the stack, but there doesn't seem to be any pattern--sometimes, no interrupt handler is called but the problem still occurs.
In that case, though, it's hard for me to see how a program might try to execute code at a location well past the actual end of the code... I feel like a function pointer might be a likely candidate, but in that case I would expect to see the problem show up, e.g., where a function pointer is used. However, I don't see any function pointers used near where the error is occurring.
Perhaps there is more information I can extract from the debug information I've given above? The problem is quite reproducible, so if there's something I have not tried, but which you think might give some insight, I would love to hear it.
Thanks for any help you can offer!
After about a month of chasing this one, I managed to identify the cause of the problem. I hope I can give enough information here that this will be useful to someone else.
In the end, the problem was caused by passing a pointer to a non-static local variable to a state machine which changed the value at that memory location later on. Because the local variable was no longer in scope, that memory location was a random point in the stack, and changing the value there corrupted the stack.
The problem was difficult to track down for two reasons:
Depending on how the code compiled, the changed memory location could be something non-critical like another local variable, which would cause a much more subtle error. Only when I got lucky would the change affect the PC register and cause a hard fault.
Even when I found a version of the code that consistently generated a hard fault, the actual hard fault typically occurred somewhere up the call stack, when a function returned and popped the stack value into PC. This made it difficult to identify the cause of the problem--all I knew was that something was corrupting the stack before that function return.
A few tools were really helpful in identifying the cause of the problem:
Early on, I had identified a block of code where the hard fault usually occurred using GPIO pins. I would toggle a pin high before entering the block and low when exiting the block. Then I performed many tests, checking if the pin was high or low when the hard fault occurred, and used a sort of binary search to determine the smallest block of code which consistently contained all the hard faults.
The hard fault pushes a number of important registers onto the stack. These helped me confirm where the PC register was becoming corrupt, and also helped me understand that it was becoming corrupt as a result of a stack corruption.
Starting somewhere before that block of code and stepping forward while keeping an eye on local variables, I was able to identify a function call that was corrupting the stack. I could confirm this using Simplicity Studio's memory view.
Finally, stepping through the offending function in detail, I realized that the problem was occurring when I dereferenced a stored pointer and wrote to that memory location. Looking back at where that pointer value was set, I realized it had been set to point to a non-static local variable that was now out of scope.
Thanks to #SeanHoulihane and #cooperised, who helped me eliminate a few possible causes and gave me a little more confidence with the debugging tools.
I am dealing with android boot.img which is a combination of compressed kernel, ramdisk, and dtb. I see from the serial console log from uboot about the booting process and here's the part that triggers my curiosity
CPU: Freescale i.MX6Q rev1.2 at 792 MHz
CPU: Temperature 27 C
Reset cause: POR
Board: MX6-SabreSD
I2C: ready
DRAM: 1 GiB
PMIC: PFUZE100 ID=0x10
MMC: FSL_SDHC: 0, FSL_SDHC: 1, FSL_SDHC: 2
No panel detected: default to Hannstar-XGA
Display: Hannstar-XGA (1024x768)
In: serial
Out: serial
Err: serial
check_and_clean: reg 0, flag_set 0
Fastboot: Normal
flash target is MMC:1
Net: FEC [PRIME]
Normal Boot
Hit any key to stop autoboot: 3 2 1 0
boota mmc1
kernel # 14008000 (7272352)
ramdisk # 15000000 (869937)
fdt # 14f00000 (44072)
## Booting Android Image at 0x12000000 ...
Kernel load addr 0x14008000 size 7102 KiB
Kernel command line: console=ttymxc0,115200 init=/init video=mxcfb0:dev=hdmi,1920x1080M#60,bpp=32 video=mxcfb1:off video=mxcfb:off video=mxcfb3:off vmalloc=256M androidboot.console=ttymxc0 consolebalank=0 androidboot.hardware=freescale cma=384M
## Flattened Device Tree blob at 14f00000
Booting using the fdt blob at 0x14f00000
Loading Kernel Image ... OK
Using Device Tree in place at 14f00000, end 14f0dc27
switch to ldo_bypass mode!
Starting kernel ...
The kernel's address is 14008000, ramdisk 15000000, fdt 14f00000.
I have found out that these values are saved in the boot.img header and when I manually mess with this values, the image will not boot even though the actual contents are not modified.
Why is this address so critical? Why does it have to be these values? why not other values?
One of my own guess is:
these load address values are hardcoded inside the kernel so if I change it in the boot.img header it will cause side-effects. To elaborate my own theory, loading the kernel to the 'modified' load addr won't be a problem. But when actually executing the lines, it will cause severe problems because these codes are fixed to work with the proper 'load addr'.
Is my theory wrong? If so, I'd be grateful if you could correct me.
After the U-boot finished, it needs to handover the further execution to kernel so that kernel takes care for all further processes. Plus Kernel have some set of routines which are essential to initialize the board.
These set of routines have a entry point start_kernel(). Before this some other routines are also called to execute start_kernel. That can be identified by this linker script and this init.S.
To maintain execution in this procedure kernel need to be started from a particular address, moreover its data segments and other memories are also attached with that so to load and run all these in a proper manner, exact memory location need to be provided.
With this kernel needs the device tree blob for probing the devices attached to it and the rootfs to bring up the user space, here to load the device tree and rootfs exact offsets are needed so that any data should not be missed or corrupted. Every single byte is important for all these execution.
So to safely bootup all these values are being stored in bootloader's configuration.
I am trying to learn User mode driver to receive interrupts of my Network Card.
I insmod two kernel components ${KSRC}/drivers/uio/uio.ko and ${KSRC}/drivers/uio/uio_pci_generic.ko.
But I donot see any device getting created which I can then mmap
Typically for UIO I would need something like "/dev/uio0" which I can open then mmap()
So how to go about using UIO framework?
Edit:
My network card is Marvell ethernet controller. My hardware is x86 Ubuntu. Linux kernel 3.13.11.11. So no device tree based.
First of all, the driver has to be compiled into the kernel. Either use menu config or add the following lines to a .cfg file. You can check that the driver has been compiled by looking in /lib/modules/<kernel-name>/modules.builtin.
CONFIG_UIO=y
CONFIG_UIO_PDRV_GENIRQ=y
The next step is to add the following entry to your device tree You can check that the driver has been compiled by looking in /lib/modules/<kernel-name>/modules.builtin.file. Where the middle number is the interrupt you are targeting -32. This means 0x1D == 29 and then add 32 for the interrupt number that is registered in the GIC (Generic Interrupt Controller on ARM systems).
spw0#7aa00000 {
compatible = "generic-uio";
reg = <0x7aa00000 0x10000>;
interrupts = <0x0 0x1D 0x4>;
interrupt-parent = <0x3>;
clocks = <0x1>;
};
and changing the bootargs to console=ttyPS0,115200 root=/dev/mmcblk0p1 rw rootwait earlyprintk uio_pdrv_genirq.of_id=generic-uio.
If everything goes well, you will see the /dev/uio0 device after booting up.
Here's the nm dump of my program.
00000000 T __ctors_end
00000000 T __ctors_start
00000000 T __dtors_end
00000000 T __dtors_start
00000000 a __tmp_reg__
00000000 T __trampolines_end
00000000 T __trampolines_start
00000000 T setup
00000001 a __zero_reg__
0000003d a __SP_L__
0000003e a __SP_H__
0000003f a __SREG__
00000072 T __vector_15
00000086 T main
000000a8 A __data_load_end
000000a8 A __data_load_start
000000a8 T _etext
00800100 D _edata
00800100 T _end
00810000 T __eeprom_end
The architecture is AVR, and I need to get main() back up to 0x00000000 in order for the chip that I'm running this code on to execute properly. It should be as simple as a linker script, shouldn't it?
It doesn't matter where main() is in memory. Simply put a jump instruction to its address at the reset vector, or 0x0000 in application memory.
I used to program for AVR and as I know the only way to change main() entry is fuse bits. But you just can to put in the back of FLASH for bootloader. Depending on chip main starts in different places, I'm not sure but on AVR it should be something like 0x20 to 0x100.
It is because at the beginning there is RESET vector, registers and interrupt vectors.
This structure helps very much, once I had a project on which I wasn't able to use watchdog so the only way to trigger reset was overflow.
Also, I've read your comment. You don't need to put 256 bytes of 0x00 that place is for some registers (AVR registers are divided in to places one is SRAM, other FLASH) and interrupt vectors, so if you use lets say timer or UART and your code start at 0x00 so initialization of these would destroy your code.
It is designed to work, I think redesigning would spoil that. But if you really want this, you can try to add -Ttext=0x0000 this flag. This may compile it as you want but I do not recomend doing that.
AFAIK, /dev/mem presents physical memory to user, and it's usually being used for device read/write through MMIO. In my use case, I want to modify current process' pte so that two ptes will point to the same physical page. In particular, I move a x86_64 binary above 4G virtual space and mmap virtual space below 4G. I want to make 4G above pte and 4G below pte point to the same physical page, so that when I write into 4G above vaddr and read from 4G below pte, I get the same result. Sample code might look like below:
*(unsigned char *)vaddr1 = 7 // write into 4G above vaddr1
val = *(unsigned char *)vaddr2; // read from 4G below vaddr2
printf("val should be 7, %d\n", val);
But after I modify 4G below pte to point to physical page pointed by 4G above pte through /dev/mem, kernel give me message below,
BUG: Bad page map in process mmap pte:8000000007eb2067 pmd:07acb067
page:ffffea00001fac80 count:0 mapcount:-1 mapping: (null) index:0x101b7b
page flags: 0x4000000000000014(referenced|dirty)
addr:0000000101b7b000 vm_flags:00100073 anon_vma:ffff880007ab0708 mapping: (null) index:101b7b
Pid: 609, comm: mmap Tainted: G B 3.5.3 #7
Call Trace:
[<ffffffff8107abcc>] ? print_bad_pte+0x1d2/0x1ea
[<ffffffff8107bf18>] ? unmap_single_vma+0x3a0/0x56d
[<ffffffff8107c745>] ? unmap_vmas+0x2c/0x46
[<ffffffff8108106b>] ? exit_mmap+0x6e/0xdd
[<ffffffff8101cc4f>] ? do_page_fault+0x30f/0x348
[<ffffffff81020ce6>] ? mmput+0x20/0xb4
[<ffffffff810256ae>] ? exit_mm+0x105/0x110
[<ffffffff8103bb6c>] ? hrtimer_try_to_cancel+0x67/0x70
[<ffffffff81026b59>] ? do_exit+0x211/0x711
[<ffffffff810272e0>] ? do_group_exit+0x76/0xa0
[<ffffffff8102731c>] ? sys_exit_group+0x12/0x19
[<ffffffff812f3662>] ? system_call_fastpath+0x16/0x1b
BUG: Bad rss-counter state mm:ffff880007a496c0 idx:0 val:-1
BUG: Bad rss-counter state mm:ffff880007a496c0 idx:1 val:1
I guess kernel will examine if the pte has been modified, and I did something wrong. Here are vaddr1 and vaddr2's pte before/after my pte rewriting.
above 4G pte: 0x8000000007eb2067
below 4G pte: 0x0000000007ea7067
after rewriting pte...
above 4G pte: 0x8000000007eb2067
below 4G pte: 0x8000000007eb2067
Any idea? Thanks.
Note: Now I know I should release the physical page pointed by vaddr2's pte, otherwise kernel will note that physical page isn't pointed by any pte and give those error. But how? I try to use __free_page, but get error below.
BUG: unable to handle kernel paging request at ffffebe00008001c
IP: [<ffffffff8106b908>] __free_pages+0x4/0x2a
PGD 0
Oops: 0000 [#2] PREEMPT SMP
CPU 0