Threads spawned by a detoured pthread_create do not execute instructions - macos

I've got a custom implementation of detours on macOS and a test application using it, which is written in C, compiled for macOS x86_64, running on an Intel i9 processor.
The implemention works fine with a multitude of functions. However, if I detour pthread_create, I encounter strange behaviour: threads that have been spawned via a detoured pthread_create do not execute instructions. I can step through instructions one by one but as soon as I continue it does not progress. There are no mutexes or synchronisation involved and the result of the function is 0 (success). The exact same application with detours turned off works fine so it's unlikely to be the culprit.
This does not happen all the time - sometimes they are fine but at other times the test applications stalls in the following state:
(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff7296f55e libsystem_kernel.dylib`__ulock_wait + 10
frame #1: 0x00007fff72a325c2 libsystem_pthread.dylib`_pthread_join + 347
frame #2: 0x0000000100001186 DetoursTestApp`main + 262
frame #3: 0x00007fff7282ccc9 libdyld.dylib`start + 1
frame #4: 0x00007fff7282ccc9 libdyld.dylib`start + 1
thread #2
frame #0: 0x00007fff72a2cb7c libsystem_pthread.dylib`thread_start
Relevant memory pages have the executable flag set. The detour function that intercepts the thread creation looks like this:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
return original(thread, attr, start_routine, arg);
}
Where detour_original retrieves the pointer to [original function + size of function's prologue].
Tracing through the instructions, everything seems to be working correctly and pthread_create terminates successfully. Tracing the application's system calls via dtruss does show calls to
bsdthread_create(0x10DB964B0, 0x0, 0x7000080DB000) = 29646848 0
With what I have confirmed are the correct arguments.
This behaviour is only observed in release builds - debug works fine but the disassembly and execution of a detoured pthread_create and associated detours code seems to be identical in both cases.
Workarounds
I found a couple of odd workarounds for this issue that don't make much sense. Given the detour function, a number of things can be substituted into the following:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
<...> <== SUBSTITUTE HERE
return original(thread, attr, start_routine, arg);
}
A cache flush.
__asm__ __volatile__("" ::: "memory");
_mm_clflush(real_pthread_create);
A sleep of any duration - usleep(1)
A printf statement.
A memory allocation larger than 32768 bytes, e.g. void *data = malloc(40000);.
Cache?
All of these seem to point to a stale instruction cache. However, the Intel manual states the following:
A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.
What's even more interesting is that those workarounds have to be executed for every new thread created, with the execution happening on the main thread, so it's very unlikely to be the cache. I have also tried putting in cache flushes at every memory write that writes instructions but that did not help. I've also written a memcpy that bypasses the cache with the use of Intel's intrinsic _mm_stream_si32 and swapped it out for every instruction memory write in my implementation without any success.
Race condition?
The next suspect in line is a race condition. However, it's not clear what would be racing as at first there are no other threads. I have put in a fibonacci sequence calculation for a randomly-generated number and that would still stall the newly-spawned threads.
The question
What is causing this issue? What other mechanisms could be responsible for this?
At this point I have run out of things to check so any suggestions will be welcome.

I found that the reason why the spawned thread was not executing instructions was that the r8 register wasn't being cleared at the right time in the execution of pthread_create due to an issue with my detours implementation.
If we look at the disassembly of the function, it is split up to two parts - the "head" and the "body" that's found in an internal _pthread_create function. The head does two things - zeroes out r8 and jumps to the body:
libsystem_pthread.dylib`pthread_create:
0x7fff72a2e236 <+0>: 45 31 c0 xor r8d, r8d
0x7fff72a2e239 <+3>: e9 40 37 00 00 jmp 0x7fff72a3197e ; _pthread_create
libsystem_pthread.dylib`_pthread_create:
0x7fff72a3197e <+0>: 55 push rbp
0x7fff72a3197f <+1>: 48 89 e5 mov rbp, rsp
0x7fff72a31982 <+4>: 41 57 push r15
<...> // the rest of the 1409 instructions
My implementation would detour the internal _pthread_create function instead of the head containing the actual entry point which meant that the r8 would get cleared at the wrong time (before the detour). Since the detour function would contain some could, the execution would go something like:
pthread_create (r8 gets cleared) -> _pthread_create -> chain of jumps -> pthread_create_detour -> trampoline (containing the beginning of _pthread_create) -> _pthread_create + 6
Which meant that depending on the contents of the pthread_create_detour function the r8 would not always end up with a 0 when it returned to the internal function.
It's not yet clear why having r8 set to something other than 0 before _pthread_create would not crash but instead start up a thread in a locked up state. An important detail is that the stalled thread would have the rflags register set to 0x200 which should never be the case according to Intel's manual. This is what lead me to inspecting the CPU state more closely, leading to the answer.

Related

Cortex M3 - Calling a SVC inside a C function, and returning to thread mode

I am writing System Calls, as recommended by Joseph Yiu (M3 Guide), by taking the arguments from the stack. The Assembly SVC Handler is like:
SVC_Handler:
MOV R0, #0
MSR CONTROL, R0
CMP LR, #0xFFFFFFFD
BEQ KernelEntry
B KernelExit
KernelEntry:
<save current user stack>
B svchandler_main
KernelExit:
<code to restore the very same saved user stack saved before>
MOV LR, #0xFFFFFFFFD
BX LR
Well, so the svchandler_main is C function that recovers the immediates (arguments of the the system call) creates the kernel stack and branches to 0xFFFFFFFF9 (MSP privileged). The system call itself is made like:
#define svc(code) asm volatile ("svc %[immediate]"::[immediate] "I" (code))
void SysCall_MyCall(int32_t args)
{
svc(CallBack_Number);
}
That said, the callback function, running in handler mode:
void SysCallBack(void* args)
{
/* <my c routine>*/
asm volatile("svc #0"); //to exit the kernel mode
}
The last SV Call is performed so that the SVC_Handler in assembly will identify it is coming from handler mode (MSP privileged) and will exit the kernel - kind of a cooperative scheduling on the kernel. The problem is the context is saved with PSP pointing inside SysCall_MyCall, and it returns to there, and never exits. If I use inline functions, I will lose the handy svchandler_main. Any ideas? I didnt write the svchandler_main here because it is a classic code found on ARM application notes. Thanks.
Edit, to clarify: I am not branching to the callback function INSIDE SVC Handler. It creates the callback stack, change the LR to 0xFFFFFFF9 and execute a BX LR, exiting the the interruption, and going to the indicated MSP. To exit kernel another SVC is called, and user thread resumed.
It seems that you misunderstand how exception entry and return works on the Cortex-M. When you issue an SVC instruction from thread mode, the CPU transitions to handler mode just as for any other exception.
Handler mode is always privileged, and always uses the main stack (MSP). Thread mode can be either privileged or unprivileged depending on the nPRIV bit (bit 0) in the CONTROL register, and may be configured to use the process stack (PSP) by setting the SPSEL bit (bit 1) in the CONTROL register from thread mode.
On entry to handler mode, r0-r3, r12, lr, pc and xPSR are pushed to the active stack (PSP or MSP, depending on which is in use) and an exception return value is loaded into lr. The stack is switched to MSP. At the end of the handler, a BX lr instruction (or equivalent) causes this value to be used as a branch target, which automatically causes a restoration of the previous mode and stack, and pops r0-r3, r12, lr, pc and xPSR. The pop of pc restores execution from the place where the interrupt occurred.
The important thing about this mechanism is that it is 100% compatible with the ARM ABI. In other words, it is possible to write an ordinary function and use it as an exception handler, just by placing the address of the function in the appropriate place in the interrupt vector table. That's because the return at the end of a function is actioned by BX lr or equivalent, which is exactly the same instruction that triggers a return from handler mode.
So to write an SVC handler that makes use of callbacks, it is necessary to:
Work out which stack was in use when the SVC instruction was issued
Dig out the stacked pc to find the address of the SVC instruction itself, and extract the 8-bit constant from within the SVC instruction
Use the constant to work out which callback to invoke
Branch to (not call) the appropriate callback
The callback can be a perfectly ordinary function. When it returns, it will trigger the return from handler mode because the appropriate exception return code will still be in lr.
A handler that does all of this is presented in the M3 Guide, chapter 10.
If it is required that the callback receives arguments, this is a bit more complex but I can expand my answer if you'd like. Generally handler callbacks execute in handler mode (that's pretty much the point of SVC). If for some reason you need the callback to be executed without privilege, that's more complex still; there is an example in Chapter 23 of the M3 guide, though. You refer in the comments to not wanting to "manage nested interrupts" but really nested interrupts just manage themselves.

Debugging a hard fault in ARM Cortex-M4

I am at my wit's end trying to debug a hard fault on an EFR32BG12 processor. I've been following the instructions in the Silicon Labs knowledge base here:
https://www.silabs.com/community/mcu/32-bit/knowledge-base.entry.html/2014/05/26/debug_a_hardfault-78gc
I've also been using the Keil app note here to fill in some details:
http://www.keil.com/appnotes/files/apnt209.pdf
I've managed to get the hard fault to occur quite consistently in one place. When the hard fault occurs, the code from the knowledge base article gives me the following values (pushed onto the stack by the processor before calling the hard fault handler):
Name Type Value Location
~~~~ ~~~~ ~~~~~ ~~~~~~~~
cfsr uint32_t 0x20000 (Hex) 0x2000078c
hfsr uint32_t 0x40000000 (Hex) 0x20000788
mmfar uint32_t 0xe000ed34 (Hex) 0x20000784
bfar uint32_t 0xe000ed38 (Hex) 0x20000780
r0 uint32_t 0x0 (Hex) 0x2000077c
r1 uint32_t 0x8 (Hex) 0x20000778
r2 uint32_t 0x0 (Hex) 0x20000774
r3 uint32_t 0x0 (Hex) 0x20000770
r12 uint32_t 0x1 (Hex) 0x2000076c
lr uint32_t 0xab61 (Hex) 0x20000768
pc uint32_t 0x38dc8 (Hex) 0x20000764
psr uint32_t 0x0 (Hex) 0x20000760
Looking at the Keil app note, I believe a CFSR value of 0x20000 indicates a Usage Fault with the INVSTATE bit set, i.e.:
INVSTATE: Invalid state: 0 = no invalid state 1 = the processor has
attempted to execute an instruction that makes illegal use of the
Execution Program Status Register (EPSR). When this bit is set, the PC
value stacked for the exception return points to the instruction that
attempted the illegal use of the EPSR. Potential reasons: a) Loading a
branch target address to PC with LSB=0. b) Stacked PSR corrupted
during exception or interrupt handling. c) Vector table contains a
vector address with LSB=0.
The PC value pushed onto the stack by the exception (provided by the code from the knowledge base article) seems to be 0x38dc8. If I go to this address in the Simplicity Studio "Disassembly" window, I see the following:
00038db8: str r5,[r5,#0x14]
00038dba: str r0,[r7,r1]
00038dbc: str r4,[r5,#0x14]
00038dbe: ldr r4,[pc,#0x1e4] ; 0x38fa0
00038dc0: strb r1,[r4,#0x11]
00038dc2: ldr r5,[r4,#0x64]
00038dc4: ldrb r3,[r4,#0x5]
00038dc6: movs r3,r6
00038dc8: strb r1,[r4,#0x15]
00038dca: ldr r4,[r4,#0x14]
00038dcc: cmp r7,#0x6f
00038dce: cmp r6,#0x30
00038dd0: str r7,[r6,#0x14]
00038dd2: lsls r6,r6,#1
00038dd4: movs r5,r0
00038dd6: movs r0,r0
The address appears to be well past the end of my code. If I look at the same address in the "Memory" window, this is what I see:
0x00038DC8 69647561 2E302F6F 00766177 00000005 audio/0.wav.....
0x00038DD8 00000000 000F4240 00000105 00000000 ....#B..........
0x00038DE8 00000000 00000000 00000005 00000000 ................
0x00038DF8 0001C200 00000500 00001000 00000000 .Â..............
0x00038E08 00000000 F00000F0 02F00001 0003F000 ....ð..ð..ð..ð..
0x00038E18 F00004F0 06010005 01020101 01011201 ð..ð............
0x00038E28 35010121 01010D01 6C363025 2E6E6775 !..5....%06lugn.
0x00038E38 00746164 00000001 000008D0 00038400 dat.....Ð.......
Curiously, "audio/0.wav" is a static string which is part of the firmware. If I understand correctly, what I've learned here is that PC somehow gets set to this point in memory, which of course is not a valid instruction and causes the hard fault.
To debug the issue, I need to know how PC came to be set to this incorrect value. I believe the LR register should give me an idea. The LR register pushed onto the stack by the exception seems to be 0xab61. If I look at this location, I see the following in the Disassembly window:
1270 dp->sect = clst2sect(fs, clst);
0000ab58: ldr r0,[r7,#0x10]
0000ab5a: ldr r1,[r7,#0x14]
0000ab5c: bl 0x00009904
0000ab60: mov r2,r0
0000ab62: ldr r3,[r7,#0x4]
0000ab64: str r2,[r3,#0x18]
It looks to me like the problem occurs during this call specifically:
0000ab5c: bl 0x00009904
This makes me think that the problem occurs as a result of a corrupt stack, which causes clst2sect to return to an invalid part of memory rather than to 0xab60. The code for clst2sect is pretty innocuous:
/*-----------------------------------------------------------------------*/
/* Get physical sector number from cluster number */
/*-----------------------------------------------------------------------*/
DWORD clst2sect ( /* !=0:Sector number, 0:Failed (invalid cluster#) */
FATFS* fs, /* Filesystem object */
DWORD clst /* Cluster# to be converted */
)
{
clst -= 2; /* Cluster number is origin from 2 */
if (clst >= fs->n_fatent - 2) return 0; /* Is it invalid cluster number? */
return fs->database + fs->csize * clst; /* Start sector number of the cluster */
}
Does this analysis sound about right?
I suppose the problem I've run into is that I have no idea what might cause this kind of behaviour... I've tried putting breakpoints in all of my interrupt handlers, to see if one of them might be corrupting the stack, but there doesn't seem to be any pattern--sometimes, no interrupt handler is called but the problem still occurs.
In that case, though, it's hard for me to see how a program might try to execute code at a location well past the actual end of the code... I feel like a function pointer might be a likely candidate, but in that case I would expect to see the problem show up, e.g., where a function pointer is used. However, I don't see any function pointers used near where the error is occurring.
Perhaps there is more information I can extract from the debug information I've given above? The problem is quite reproducible, so if there's something I have not tried, but which you think might give some insight, I would love to hear it.
Thanks for any help you can offer!
After about a month of chasing this one, I managed to identify the cause of the problem. I hope I can give enough information here that this will be useful to someone else.
In the end, the problem was caused by passing a pointer to a non-static local variable to a state machine which changed the value at that memory location later on. Because the local variable was no longer in scope, that memory location was a random point in the stack, and changing the value there corrupted the stack.
The problem was difficult to track down for two reasons:
Depending on how the code compiled, the changed memory location could be something non-critical like another local variable, which would cause a much more subtle error. Only when I got lucky would the change affect the PC register and cause a hard fault.
Even when I found a version of the code that consistently generated a hard fault, the actual hard fault typically occurred somewhere up the call stack, when a function returned and popped the stack value into PC. This made it difficult to identify the cause of the problem--all I knew was that something was corrupting the stack before that function return.
A few tools were really helpful in identifying the cause of the problem:
Early on, I had identified a block of code where the hard fault usually occurred using GPIO pins. I would toggle a pin high before entering the block and low when exiting the block. Then I performed many tests, checking if the pin was high or low when the hard fault occurred, and used a sort of binary search to determine the smallest block of code which consistently contained all the hard faults.
The hard fault pushes a number of important registers onto the stack. These helped me confirm where the PC register was becoming corrupt, and also helped me understand that it was becoming corrupt as a result of a stack corruption.
Starting somewhere before that block of code and stepping forward while keeping an eye on local variables, I was able to identify a function call that was corrupting the stack. I could confirm this using Simplicity Studio's memory view.
Finally, stepping through the offending function in detail, I realized that the problem was occurring when I dereferenced a stored pointer and wrote to that memory location. Looking back at where that pointer value was set, I realized it had been set to point to a non-static local variable that was now out of scope.
Thanks to #SeanHoulihane and #cooperised, who helped me eliminate a few possible causes and gave me a little more confidence with the debugging tools.

SysTick Interrupt pending but won't execute, debug interrupt mask issue?

I've been trying to get a SysTick interrupt to work on a TM4C123GH6PM7. It's a cortex m4 based microcontroller. When using the Keil Debugger I can see that the Systick interrupt is pending int NVIC but it won't execute the handler. There are no other exceptions enabled and I have cleared the PRIMASK register. The code below is how I initialise the interrupt:
systck_init LDR R0,=NVIC_ST_CTRL_R
LDR R1,=NVIC_ST_RELOAD_R
LDR R2,=NVIC_ST_CURRENT_R
MOV R3,#0
STR R3,[R0]
STR R3,[R2]
MOV R3,#0x000020
STR R3,[R1]
MOV R3,#7
STR R3,[R0]
LDR R3,=NVIC_EN0_R
LDR R4,[R3]
ORR R4,#0x00008000
STR R4,[R3]
CPSIE I
MOV R3,#0x3
MSR CONTROL,R3
After a lot of searching I found that it may be the debugger masking all interrupts. The bit to control this is in a register called the Debug Halting Status and Control Register. Though I can't seem to view it in the debugger nor read/write to it with debug commands.
I used the Startup.s supplied by Keil and as far as I can tell the vectors/labels are correct.
And yes I know. Why bother doing it all in assembly.
Any ideas would be greatly appreciated. First time posting :)
I can see that the Systick interrupt is pending int NVIC
Systick has neither Enable nor Pending register bits in the NVIC. It is special that way, being tightly coupled to the MCU core itself.
Using 0x20 for the reload value is also dangerously low. You may get "stuck" in the Systick Handler, unable to leave it because the next interrupt triggers too early. Remember that Cortex M4 requires at least 12 clocks to enter and exit an interrupt handler - that consumes 24 out of your 32 cycles.
Additional hint: You last instruction changes the register used for the SP from MSP to PSP, but I don't see your code setting up the PSP first.
Be sure to implement the Hardfault_Handler - your code most likely triggers it.

Setting a random address breakpoint in gdb

I am learning some mechanism of breakpoint and I learned that 'In x86, there exist a instruction called int3 for debugger to interrupt the CPU. And then CPU will interrupt the running program by signal'.
For example:
8048e20: 55 push %ebp
8048e21: 89 e5 mov %esp,%ebp
When the user input
b *0x8048e21
The instruction will be replaced by int3(opcode 0xcc) and become this:
8048e20: 55 push %ebp
8048e21: cc e5 mov %esp,%ebp
And it will stop at the right place.
Then comes the question:
What would happen if I set the breakpoint not at the beginning of a instruction? ie, if I input:
b *0x8048e22
will debugger still replace the e5 with cc? So I write a simple example and run it with gdb.
As you can see above, I set two break points and the second is at the middle of a break points. I Input r and stop at the first breakpoint and input c and run to the end.
So it seems that the gdb ignore the second breakpoint. (For if it really repalce it with a int3 the program would be totally wrong).
Question: What happen to the second breakpoint, more specifically, what does gdb deal with it( or what I learn is wrong?)
Edit: #dbrank already give a great example about altering the data field of a instruction, I will try to make it more comprehensive with a similar example (it seems the register).
(Any reference about mechanism of breakpoint is appreciated!)
Inserting breakpoint in the middle of instruction will alter the instruction.
See this example of a program, where inserting a breakpoint overwrites original value assigned to variable (42 (0x2a)) with breakpoint instruction (0xcc (204)).
You can find more about how breakpoints work here.
You can also look into GDB sources (breakpoint.c & infrun.c mostly).

CPU idle loop without Sleep

I am learning WIN32 ASM right now and I was wondering if there is something like an "idle" infinite loop that doesn't consume any resources at all. Basically I need a running process to experiment with that goes like this:
loop:
; alternative to sleep...
jmp loop
Is there something that may idle the process?
You can't have it both ways. You can either consume the CPU or you can let it do something else. You can conserve power and avoid depriving other cores of resources with rep; nop; (also known as pause), as Vlad Lazarenko suggested. But if you loop without yielding the core to another process, at least that virtual core cannot do anything else.
Note: You should never use empty loops to make the application idle. It will load your processor to 100%
There are several ways to make the application idle, when there is no GUI activity in Win32 environment. The main (documented in MSDN) is to use "GetMessage" function in the main loop in order to extract the messages from the message queue.
When the message queue is empty, this function will idle, consuming very low processor time, waiting for message to arrive in the message queue.
Below is an example, using FASM macro library:
msg_loop:
invoke GetMessage, msg, NULL, 0, 0
cmp eax, 1
jb end_loop
jne msg_loop
invoke TranslateMessage, msg
invoke DispatchMessage, msg
jmp msg_loop
Another approach is used when you want to catch the moment when the application goes to idle state and make some low priority, one-time processing (for example enabling/disabling the buttons on the toolbar, according to the state of the application).
In this case, a combination of PeekMessage and WaitMessage have to be used. PeekMessage function returns immediately, even when the message queue is empty. This way, you can detect this situation and provide some idle tasks to be done, then you have to call WaitMessage in order to idle the process waiting for incoming messages.
Here is an simplified example from my code (using FreshLib macros):
; Main message loop
Run:
invoke PeekMessageA, msg, 0, 0, 0, PM_REMOVE
test eax,eax
jz .empty
cmp [msg.message], WM_QUIT
je .terminate
invoke TranslateMessage, msg
invoke DispatchMessageA, msg
jmp Run
.empty:
call OnIdle
invoke WaitMessage
jmp Run
.terminate:
FinalizeAll
stdcall TerminateAll, 0
What do you mean by "consume resources"?
If you just want a command that does nothing? If so nop will do that, and you can loop as much as you want even: rep; nop. However, the CPU will actually be busy doing work: executing the "no operation" instruction.
If you want an instruction that will cause the CPU itself to stop, then you are sorta-kinda out of luck: although there are ways to do that, you cannot do it from userspace.
With ring 0 access level (like kernel driver), you could use the x86 HLT opcode, but you need to have a system programming skill to really understand how to use it. Using HLT this way requires interrupts to be enabled (not masked) and the guarantee of an interrupt occurring (e.g. system timer), because the return from the interrupt will execute the next instruction after the HLT.
Without ring 0 access you'll never find any x86 opcode to enter an "idle" mode...
You only could find some instructions which consume less power (no memory access, no cache access, no FPU access, low ALU usage...).
Yes, there are some architectures support intrinsic idle state AKA halt.
For example, in x86 hlt, opcode 0xf4. Probably can be called only on privileged mode.
CPU Switches from User mode to Kernel Mode : What exactly does it do? How does it makes this transition?
How to completely suspend the processor?
a Linux's userspace example I found here:
.section .rodata
greeting:
.string "Hello World\n"
.text
_start:
mov $12,%edx /* write(1, "Hello World\n", 12) */
mov $greeting,%ecx
mov $1,%ebx
mov $4,%eax /* write is syscall 4 */
int $0x80
xorl %ebx, %ebx /* Set exit status and exit */
mov $0xfc,%eax
int $0x80
hlt /* Just in case... */

Resources