Debugging a hard fault in ARM Cortex-M4 - debugging

I am at my wit's end trying to debug a hard fault on an EFR32BG12 processor. I've been following the instructions in the Silicon Labs knowledge base here:
https://www.silabs.com/community/mcu/32-bit/knowledge-base.entry.html/2014/05/26/debug_a_hardfault-78gc
I've also been using the Keil app note here to fill in some details:
http://www.keil.com/appnotes/files/apnt209.pdf
I've managed to get the hard fault to occur quite consistently in one place. When the hard fault occurs, the code from the knowledge base article gives me the following values (pushed onto the stack by the processor before calling the hard fault handler):
Name Type Value Location
~~~~ ~~~~ ~~~~~ ~~~~~~~~
cfsr uint32_t 0x20000 (Hex) 0x2000078c
hfsr uint32_t 0x40000000 (Hex) 0x20000788
mmfar uint32_t 0xe000ed34 (Hex) 0x20000784
bfar uint32_t 0xe000ed38 (Hex) 0x20000780
r0 uint32_t 0x0 (Hex) 0x2000077c
r1 uint32_t 0x8 (Hex) 0x20000778
r2 uint32_t 0x0 (Hex) 0x20000774
r3 uint32_t 0x0 (Hex) 0x20000770
r12 uint32_t 0x1 (Hex) 0x2000076c
lr uint32_t 0xab61 (Hex) 0x20000768
pc uint32_t 0x38dc8 (Hex) 0x20000764
psr uint32_t 0x0 (Hex) 0x20000760
Looking at the Keil app note, I believe a CFSR value of 0x20000 indicates a Usage Fault with the INVSTATE bit set, i.e.:
INVSTATE: Invalid state: 0 = no invalid state 1 = the processor has
attempted to execute an instruction that makes illegal use of the
Execution Program Status Register (EPSR). When this bit is set, the PC
value stacked for the exception return points to the instruction that
attempted the illegal use of the EPSR. Potential reasons: a) Loading a
branch target address to PC with LSB=0. b) Stacked PSR corrupted
during exception or interrupt handling. c) Vector table contains a
vector address with LSB=0.
The PC value pushed onto the stack by the exception (provided by the code from the knowledge base article) seems to be 0x38dc8. If I go to this address in the Simplicity Studio "Disassembly" window, I see the following:
00038db8: str r5,[r5,#0x14]
00038dba: str r0,[r7,r1]
00038dbc: str r4,[r5,#0x14]
00038dbe: ldr r4,[pc,#0x1e4] ; 0x38fa0
00038dc0: strb r1,[r4,#0x11]
00038dc2: ldr r5,[r4,#0x64]
00038dc4: ldrb r3,[r4,#0x5]
00038dc6: movs r3,r6
00038dc8: strb r1,[r4,#0x15]
00038dca: ldr r4,[r4,#0x14]
00038dcc: cmp r7,#0x6f
00038dce: cmp r6,#0x30
00038dd0: str r7,[r6,#0x14]
00038dd2: lsls r6,r6,#1
00038dd4: movs r5,r0
00038dd6: movs r0,r0
The address appears to be well past the end of my code. If I look at the same address in the "Memory" window, this is what I see:
0x00038DC8 69647561 2E302F6F 00766177 00000005 audio/0.wav.....
0x00038DD8 00000000 000F4240 00000105 00000000 ....#B..........
0x00038DE8 00000000 00000000 00000005 00000000 ................
0x00038DF8 0001C200 00000500 00001000 00000000 .Â..............
0x00038E08 00000000 F00000F0 02F00001 0003F000 ....ð..ð..ð..ð..
0x00038E18 F00004F0 06010005 01020101 01011201 ð..ð............
0x00038E28 35010121 01010D01 6C363025 2E6E6775 !..5....%06lugn.
0x00038E38 00746164 00000001 000008D0 00038400 dat.....Ð.......
Curiously, "audio/0.wav" is a static string which is part of the firmware. If I understand correctly, what I've learned here is that PC somehow gets set to this point in memory, which of course is not a valid instruction and causes the hard fault.
To debug the issue, I need to know how PC came to be set to this incorrect value. I believe the LR register should give me an idea. The LR register pushed onto the stack by the exception seems to be 0xab61. If I look at this location, I see the following in the Disassembly window:
1270 dp->sect = clst2sect(fs, clst);
0000ab58: ldr r0,[r7,#0x10]
0000ab5a: ldr r1,[r7,#0x14]
0000ab5c: bl 0x00009904
0000ab60: mov r2,r0
0000ab62: ldr r3,[r7,#0x4]
0000ab64: str r2,[r3,#0x18]
It looks to me like the problem occurs during this call specifically:
0000ab5c: bl 0x00009904
This makes me think that the problem occurs as a result of a corrupt stack, which causes clst2sect to return to an invalid part of memory rather than to 0xab60. The code for clst2sect is pretty innocuous:
/*-----------------------------------------------------------------------*/
/* Get physical sector number from cluster number */
/*-----------------------------------------------------------------------*/
DWORD clst2sect ( /* !=0:Sector number, 0:Failed (invalid cluster#) */
FATFS* fs, /* Filesystem object */
DWORD clst /* Cluster# to be converted */
)
{
clst -= 2; /* Cluster number is origin from 2 */
if (clst >= fs->n_fatent - 2) return 0; /* Is it invalid cluster number? */
return fs->database + fs->csize * clst; /* Start sector number of the cluster */
}
Does this analysis sound about right?
I suppose the problem I've run into is that I have no idea what might cause this kind of behaviour... I've tried putting breakpoints in all of my interrupt handlers, to see if one of them might be corrupting the stack, but there doesn't seem to be any pattern--sometimes, no interrupt handler is called but the problem still occurs.
In that case, though, it's hard for me to see how a program might try to execute code at a location well past the actual end of the code... I feel like a function pointer might be a likely candidate, but in that case I would expect to see the problem show up, e.g., where a function pointer is used. However, I don't see any function pointers used near where the error is occurring.
Perhaps there is more information I can extract from the debug information I've given above? The problem is quite reproducible, so if there's something I have not tried, but which you think might give some insight, I would love to hear it.
Thanks for any help you can offer!

After about a month of chasing this one, I managed to identify the cause of the problem. I hope I can give enough information here that this will be useful to someone else.
In the end, the problem was caused by passing a pointer to a non-static local variable to a state machine which changed the value at that memory location later on. Because the local variable was no longer in scope, that memory location was a random point in the stack, and changing the value there corrupted the stack.
The problem was difficult to track down for two reasons:
Depending on how the code compiled, the changed memory location could be something non-critical like another local variable, which would cause a much more subtle error. Only when I got lucky would the change affect the PC register and cause a hard fault.
Even when I found a version of the code that consistently generated a hard fault, the actual hard fault typically occurred somewhere up the call stack, when a function returned and popped the stack value into PC. This made it difficult to identify the cause of the problem--all I knew was that something was corrupting the stack before that function return.
A few tools were really helpful in identifying the cause of the problem:
Early on, I had identified a block of code where the hard fault usually occurred using GPIO pins. I would toggle a pin high before entering the block and low when exiting the block. Then I performed many tests, checking if the pin was high or low when the hard fault occurred, and used a sort of binary search to determine the smallest block of code which consistently contained all the hard faults.
The hard fault pushes a number of important registers onto the stack. These helped me confirm where the PC register was becoming corrupt, and also helped me understand that it was becoming corrupt as a result of a stack corruption.
Starting somewhere before that block of code and stepping forward while keeping an eye on local variables, I was able to identify a function call that was corrupting the stack. I could confirm this using Simplicity Studio's memory view.
Finally, stepping through the offending function in detail, I realized that the problem was occurring when I dereferenced a stored pointer and wrote to that memory location. Looking back at where that pointer value was set, I realized it had been set to point to a non-static local variable that was now out of scope.
Thanks to #SeanHoulihane and #cooperised, who helped me eliminate a few possible causes and gave me a little more confidence with the debugging tools.

Related

Ask for clarification about "the segment registers continue to point to the same linear addresses as in real address mode" [duplicate]

This question already has an answer here:
How can the x86 processor fetch the instruction just after GDT is loaded by a bootloader?
(1 answer)
Closed 1 year ago.
The question is about persistent validity of code segment selector while switching from real mode to protected mode on intel i386. The switching code is as follows (excerpted from bootasm.S of xv6 x86 version):
9138 # Switch from real to protected mode. Use a bootstrap GDT that makes
9139 # virtual addresses map directly to physical addresses so that the
9140 # effective memory map doesn’t change during the transition.
9141 lgdt gdtdesc
9142 movl %cr0, %eax
9143 orl $CR0_PE, %eax
9144 movl %eax, %cr0
9150 # Complete the transition to 32−bit protected mode by using a long jmp
9151 # to reload %cs and %eip. The segment descriptors are set up with no
9152 # translation, so that the mapping is still the identity mapping.
9153 ljmp $(SEG_KCODE<<3), $start32
The GDT layout is as follows:
9182 gdt:
9183 SEG_NULLASM # null seg
9184 SEG_ASM(STA_X|STA_R, 0x0, 0xffffffff) # code seg
9185 SEG_ASM(STA_W, 0x0, 0xffffffff) # data seg
After executing line 9144, the processor switches to protected mode in which mere segment memory management is enabled (but paging has not yet been enabled). My understanding is that, since segment MM has been enabled, the fetching of the following instruction should conform to the rules of segment MM. At this point (immediately before line 9153), however, the code selector remains 0, which in my understanding means the code segment should have selected the zero-th descriptor in GDT, which is null. But my question comes out naturally, how such a null descriptor can load the supposed ljmp instruction? I tried to answer my question by googling, and a document gives some explanation as follows: http://www.logix.cz/michal/doc/i386/chp10-03.htm#10-03
The segment registers continue to point to the same linear addresses
as in real address mode
This sentence seems to answer my question: if the segment registers continue to point to the same linear addresses, the next instruction should be the same as in real mode, that is, ljmp. But I immediately have a sequence of new questions: why can the segment selector "continue to point to the same linear addresses"? Hasn't the processor been changed to protected mode? Doesn't the value of 0 in %cs point to the zero-th descriptor, instead of the 1st (set in line 9184) which is the supposed descriptor to fetch ljmp instruction? How does the x86 CPU magically know it is the ljmp that is the next instruction it should execute? Where is the description in any manual that describe this magic? I tried to persuade myself that the ljmp has been prefetched in the processor's instruction queue, but the second paragraph of the same webpage tells me that the prefetched ljmp, if any, has been invalidated so the CPU should fetch the next instruction afresh. Can you please give me some clarification of how "the segment registers continue to point to the same linear addresses as in real address mode" magically? Thank you.
PS, the CPU I am working on is intel i386 compatible.
The modern reference is the Intel Software Developer's Manual, Volume 3A, Section 9.9.1, "Switching to protected mode".
Intel isn't big on explaining how magic works internally. What it says, and all you need to know, is that if your movl %eax, %cr0 is immediately followed by a far jump or far call, then everything will work. If you put any other instruction there, then "random failures can occur" (their wording).
As it says, %cs continues to hold its previous value, and presumably that's the value that would be pushed on the stack if you did a far call as the instruction after movl %eax, %cr0. (Where the stack would be is another interesting question - I think everyone uses the jump instead so it rarely comes up.) But for this one instruction it evidently isn't used as a selector in the usual way.
One guess as to how it might work: we know that in protected mode, there are hidden registers that store the segment attributes, and are reloaded from the descriptor table when you load a segment register. So the movl %eax, %cr0 might cause the hidden register corresponding to %cs to be loaded with attributes of a segment whose base address is the linear address of the current 16-bit segment: e.g. if %cs contained 0x1234 then it could be a segment with base address 0x12340. But the %cs register itself could be left alone, temporarily not matching its hidden counterpart. Then if the high bits of %eip are zeroed, the next instruction would be fetched from the right place. That instruction is required to be the long jump which will reload %cs as well as the hidden segment attribute register.
It's also possible that it just sets some internal flag that says "even though in protected mode, fetch the next instruction according to real-mode address translation". Then this flag gets cleared when a far jump occurs, or after one instruction has been fetched, or something like that.

Threads spawned by a detoured pthread_create do not execute instructions

I've got a custom implementation of detours on macOS and a test application using it, which is written in C, compiled for macOS x86_64, running on an Intel i9 processor.
The implemention works fine with a multitude of functions. However, if I detour pthread_create, I encounter strange behaviour: threads that have been spawned via a detoured pthread_create do not execute instructions. I can step through instructions one by one but as soon as I continue it does not progress. There are no mutexes or synchronisation involved and the result of the function is 0 (success). The exact same application with detours turned off works fine so it's unlikely to be the culprit.
This does not happen all the time - sometimes they are fine but at other times the test applications stalls in the following state:
(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff7296f55e libsystem_kernel.dylib`__ulock_wait + 10
frame #1: 0x00007fff72a325c2 libsystem_pthread.dylib`_pthread_join + 347
frame #2: 0x0000000100001186 DetoursTestApp`main + 262
frame #3: 0x00007fff7282ccc9 libdyld.dylib`start + 1
frame #4: 0x00007fff7282ccc9 libdyld.dylib`start + 1
thread #2
frame #0: 0x00007fff72a2cb7c libsystem_pthread.dylib`thread_start
Relevant memory pages have the executable flag set. The detour function that intercepts the thread creation looks like this:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
return original(thread, attr, start_routine, arg);
}
Where detour_original retrieves the pointer to [original function + size of function's prologue].
Tracing through the instructions, everything seems to be working correctly and pthread_create terminates successfully. Tracing the application's system calls via dtruss does show calls to
bsdthread_create(0x10DB964B0, 0x0, 0x7000080DB000) = 29646848 0
With what I have confirmed are the correct arguments.
This behaviour is only observed in release builds - debug works fine but the disassembly and execution of a detoured pthread_create and associated detours code seems to be identical in both cases.
Workarounds
I found a couple of odd workarounds for this issue that don't make much sense. Given the detour function, a number of things can be substituted into the following:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
<...> <== SUBSTITUTE HERE
return original(thread, attr, start_routine, arg);
}
A cache flush.
__asm__ __volatile__("" ::: "memory");
_mm_clflush(real_pthread_create);
A sleep of any duration - usleep(1)
A printf statement.
A memory allocation larger than 32768 bytes, e.g. void *data = malloc(40000);.
Cache?
All of these seem to point to a stale instruction cache. However, the Intel manual states the following:
A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.
What's even more interesting is that those workarounds have to be executed for every new thread created, with the execution happening on the main thread, so it's very unlikely to be the cache. I have also tried putting in cache flushes at every memory write that writes instructions but that did not help. I've also written a memcpy that bypasses the cache with the use of Intel's intrinsic _mm_stream_si32 and swapped it out for every instruction memory write in my implementation without any success.
Race condition?
The next suspect in line is a race condition. However, it's not clear what would be racing as at first there are no other threads. I have put in a fibonacci sequence calculation for a randomly-generated number and that would still stall the newly-spawned threads.
The question
What is causing this issue? What other mechanisms could be responsible for this?
At this point I have run out of things to check so any suggestions will be welcome.
I found that the reason why the spawned thread was not executing instructions was that the r8 register wasn't being cleared at the right time in the execution of pthread_create due to an issue with my detours implementation.
If we look at the disassembly of the function, it is split up to two parts - the "head" and the "body" that's found in an internal _pthread_create function. The head does two things - zeroes out r8 and jumps to the body:
libsystem_pthread.dylib`pthread_create:
0x7fff72a2e236 <+0>: 45 31 c0 xor r8d, r8d
0x7fff72a2e239 <+3>: e9 40 37 00 00 jmp 0x7fff72a3197e ; _pthread_create
libsystem_pthread.dylib`_pthread_create:
0x7fff72a3197e <+0>: 55 push rbp
0x7fff72a3197f <+1>: 48 89 e5 mov rbp, rsp
0x7fff72a31982 <+4>: 41 57 push r15
<...> // the rest of the 1409 instructions
My implementation would detour the internal _pthread_create function instead of the head containing the actual entry point which meant that the r8 would get cleared at the wrong time (before the detour). Since the detour function would contain some could, the execution would go something like:
pthread_create (r8 gets cleared) -> _pthread_create -> chain of jumps -> pthread_create_detour -> trampoline (containing the beginning of _pthread_create) -> _pthread_create + 6
Which meant that depending on the contents of the pthread_create_detour function the r8 would not always end up with a 0 when it returned to the internal function.
It's not yet clear why having r8 set to something other than 0 before _pthread_create would not crash but instead start up a thread in a locked up state. An important detail is that the stalled thread would have the rflags register set to 0x200 which should never be the case according to Intel's manual. This is what lead me to inspecting the CPU state more closely, leading to the answer.

Cortex M0 HardFault_Handler and getting the fault address

I'm having a HardFault when executing my program. I've found dozens of ways to get PC's value, but I'm using Keil uVision 5 and none of them has worked.
As far as I know I'm not in a multitasking context, and PSP contains 0xFFFFFFF1, so adding 24 to it would cause overflow.
Here's what I've managed to get working (as in, it compiles and execute):
enum { r0, r1, r2, r3, r12, lr, pc, psr};
extern "C" void HardFault_Handler()
{
uint32_t *stack;
__ASM volatile("MRS stack, MSP");
stack += 0x20;
pc = stack[pc];
psr = stack[psr];
__ASM volatile("BKPT #01");
}
Note the "+= 0x20", which is here to compensate for C function stack.
Whenever I read the PC's value, it's 0.
Would anyone have working code for that?
Otherwise, here's how I do it manually:
Put a breakpoint on HardFault_Handler (the original one)
When it breaks, look as MSP
Add 24 to its value.
Dump memory at that address.
And there it is, 0x00000000.
What am I doing wrong?
A few problems with your code
uint32_t *stack;
__ASM volatile("MRS stack, MSP");
MRS supports register destinations only. Your assembler migt be clever enough to transfer it to a temporary register first, but I'd like to see the machine code generated from that.
If you are using some kind of multitasking system, it might use PSP instead of MSP. See the linked code below on how one can distinguish that.
pc = stack[pc];
psr = stack[psr];
It uses the previous values of pc and psr as an index. Should be
pc = stack[6];
psr = stack[7];
Whenever I read the PC's value, it's 0.
Your program might actually have jumped to address 0 (e.g. through a null function pointer), tried to execute the value found there, which was probably not a valid instruction but the initial SP value from the vector table, and faulted on that. This code
void (*f)(void) = 0;
f();
does exactly that, I'm seeing 0x00000000 at offset 24.
Would anyone have working code for that?
This works for me. Note the code choosing between psp and msp, and the __attribute__((naked)) directive. You could try to find some equivalent for your compiler, to prevent the compiler from allocating a stack frame at all.

Interpreting App Verifier output: Heap corruption or misinterpreting stack address as heap address?

We have a test case that crashes our big MFC-based app with a heap corruption error.
I turned on the page heap using App Verifier for the DLL in question (turning the heap on for the entire process isn't workable for other reasons, unfortunately.) The verifier didn't give us any more information than we already had; it triggered at the same point as the original crash.
Right now I have two competing theories. Which theory do you think is more likely to be correct, and what would your next steps be?
This is indeed heap corruption. The verifier isn't catching the original damage because it's happening in another DLL. We should try to activate the verifier for more DLLs and determine what code is damaging the heap.
The heap is fine; the problem is that we are treating a stack address as a heap address. We should study the code in this callstack further to figure out what's going wrong.
I'm leaning #2 because the parameter to free() looks like a stack address, but so far nobody has proposed an explanation for how this is possible.
Here's a snippet of the call stack. MyString is a simple wrapper around CString. MyAppDll is the DLL that's set to use the page heap.
msvcr90.dll!free(void * pBlock=0x000000000012d6e8) Line 110
mfc90u.dll!ATL::CStringT > >::~CStringT > >() Line 1011 + 0x1e bytes
MyStringDll.dll!MyString::~MyString() Line 59
MyAppDll.dll!doStuffWithLotsOfStringInlining(MyClass* input=0x000000000012d6d0) Line 863 + 0x26 bytes
Here are the registers inside the free() stack frame:
RAX = 0000000000000000 RBX = 000000000012D6E8 RCX = 0000000000000000
RDX = 0000000000000000 RSI = 000000000012D6D0 RDI = 00000000253C1090
R8 = 0000000000000000 R9 = 0000000000000000 R10 = 0000000000000000
R11 = 0000000000000000 R12 = 000000000012D7D0 R13 = 000007FFFFC04CE0
R14 = 0000000025196600 R15 = 0000000000000000 RIP = 00000000725BC7BC
RSP = 000000000012D570 RBP = 000007FFF3670900 EFL = 00000000
And here's the app verifier message:
VERIFIER STOP 0000000000000010: pid 0x1778: Corrupted start stamp for heap block.
00000000083B1000 : Heap handle used in the call.
000000006DD394E8 : Heap block involved in the operation.
54D32858A8747589 : Size of the heap block.
000000005E33BA8D : Corrupted stamp value.
I think your string or users of it is/are overflowing/underflowing the string's buffer somewhere, probably against a field which is next to the string pointer, which you then try to free.
Your RSP is 12D570, which is 94 quads (ints) away from what you are trying to free, so somewhere between there, something bad is happening with buffers.
Verify that you are not doing any unsafe string ops and that you are correctly reading the documentation for passing buffers/strings into the DLLs you are using.
You probably need more code in your question if you want a more exact answer.

grdb not working variables

i know this is kinda retarded but I just can't figure it out. I'm debugging this:
xor eax,eax
mov ah,[var1]
mov al,[var2]
call addition
stop: jmp stop
var1: db 5
var2: db 6
addition:
add ah,al
ret
the numbers that I find on addresses var1 and var2 are 0x0E and 0x07. I know it's not segmented, but that ain't reason for it to do such escapades, because the addition call works just fine. Could you please explain to me where is my mistake?
I see the problem, dunno how to fix it yet though. The thing is, for some reason the instruction pointer starts at 0x100 and all the segment registers at 0x1628. To address the instruction the used combination is i guess [cs:ip] (one of the segment registers and the instruction pointer for sure). The offset to var1 is 0x10 (probably because from the begining of the code it's the 0x10th byte in order), i tried to examine the memory and what i got was:
1628:100 8 bytes
1628:108 8 bytes
1628:110 <- wtf? (assume another 8 bytes)
1628:118 ...
whatever tricks are there in the memory [cs:var1] points somewhere else than in my code, which is probably where the label .data would usually address ds.... probably.. i don't know what is supposed to be at 1628:10
ok, i found out what caused the assness and wasted me whole fuckin day. the behaviour described above is just correct, the code is fully functional. what i didn't know is that grdb debugger for some reason sets the begining address to 0x100... the sollution is to insert the directive ORG 0x100 on the first line and that's the whole thing. the code was working because instruction pointer has the right address to first instruction and goes one by one, but your assembler doesn't know what effective address will be your program stored at so it pretty much remains relative to first line of the code which means all the variables (if not using label for data section) will remain pointing as if it started at 0x0. which of course wouldn't work with DOS. and grdb apparently emulates some DOS features... sry for the language, thx everyone for effort, hope this will spare someone's time if having the same problem...
heheh.. at least now i know the reason why to use .data section :))))
Assuming that is x86 assembly, var1 and var2 must reside in the .data section.
Explanation: I'm not going to explain exactly how the executable file is structured (not to mention this is platform-specific), but here's a general idea as to why what you're doing is not working.
Assembly code must be divided into data sections due to the fact that each data section corresponds directly (or almost directly) to a specific part of the binary/executable file. All global variables must be defined in the .data sections since they have a corresponding location in the binary file which is where all global data resides.
Defining a global variable (or a globally accessed part of the memory) inside the code section will lead to undefined behavior. Some x86 assemblers might even throw an error on this.

Resources