SysTick Interrupt pending but won't execute, debug interrupt mask issue? - debugging

I've been trying to get a SysTick interrupt to work on a TM4C123GH6PM7. It's a cortex m4 based microcontroller. When using the Keil Debugger I can see that the Systick interrupt is pending int NVIC but it won't execute the handler. There are no other exceptions enabled and I have cleared the PRIMASK register. The code below is how I initialise the interrupt:
systck_init LDR R0,=NVIC_ST_CTRL_R
LDR R1,=NVIC_ST_RELOAD_R
LDR R2,=NVIC_ST_CURRENT_R
MOV R3,#0
STR R3,[R0]
STR R3,[R2]
MOV R3,#0x000020
STR R3,[R1]
MOV R3,#7
STR R3,[R0]
LDR R3,=NVIC_EN0_R
LDR R4,[R3]
ORR R4,#0x00008000
STR R4,[R3]
CPSIE I
MOV R3,#0x3
MSR CONTROL,R3
After a lot of searching I found that it may be the debugger masking all interrupts. The bit to control this is in a register called the Debug Halting Status and Control Register. Though I can't seem to view it in the debugger nor read/write to it with debug commands.
I used the Startup.s supplied by Keil and as far as I can tell the vectors/labels are correct.
And yes I know. Why bother doing it all in assembly.
Any ideas would be greatly appreciated. First time posting :)

I can see that the Systick interrupt is pending int NVIC
Systick has neither Enable nor Pending register bits in the NVIC. It is special that way, being tightly coupled to the MCU core itself.
Using 0x20 for the reload value is also dangerously low. You may get "stuck" in the Systick Handler, unable to leave it because the next interrupt triggers too early. Remember that Cortex M4 requires at least 12 clocks to enter and exit an interrupt handler - that consumes 24 out of your 32 cycles.
Additional hint: You last instruction changes the register used for the SP from MSP to PSP, but I don't see your code setting up the PSP first.
Be sure to implement the Hardfault_Handler - your code most likely triggers it.

Related

How is execution resumed after a hardware breakpoint without an infinite loop?

As far as I know SW-breakpoints are working as follows:
The instruction the BP is set to gets substituted by a int/trap instruction, than the trap is handled in a trap handler, on continue the trap is replaced by the original instruction, the instruction is executed in single step mode, now the PC points to the next instruction and the original instruction is replaced again by a int/trap instruction.
HW Breakpoints work as follows according to my understanding:
The address of the instruction the BP is set to is written in a HW-BP Register. If the instruction is hit respectively the PC matches the address in the HW-BP Register, the CPU raises an interrupt which is also handled by a trap handler. Now if the program returns to the orignial instruction the HW BP is still active and one is caught in an infinite loop.
How is that problem treated?
Is the HW BP disabled before continuing and is the orignal instruction also getting executed in single step mode? Or is the original instruction executed before the trap handler is entered, so that the trap handler returns to the instruction after the original instruction? Or is there an other mechanism?
In case of the Intel 64 and IA-32 ("x64/x86") architectures, this is the task of the Resume Flag (RF), bit 16 in EFLAGS. (Other processor architectures that support hardware breakpoints probably have a similar mechanism.)
See section 18.3.1.1 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B:
Because the debug exception for an instruction breakpoint is generated before the instruction is executed, if the instruction breakpoint is not removed by the exception handler; the processor will detect the instruction breakpoint again when the instruction is restarted and generate another debug exception. To prevent looping on an instruction breakpoint, the Intel 64 and IA-32 architectures provide the RF flag (resume flag) in the EFLAGS register (see Section 2.3, “System Flags and Fields in the EFLAGS Register,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). When the RF flag is set, the processor ignores instruction breakpoints.
[...]
The RF Flag is cleared at the start of the instruction after the check for code breakpoint, CS limit violation and FP exceptions.
[...]
If the RF flag in the EFLAGS image is set when the processor returns from the exception handler, it is copied into the RF flag in the EFLAGS register by IRETD/IRETQ or a task switch that causes the return. The processor then ignores instruction breakpoints for the duration of the next instruction. (Note that the POPF, POPFD, and IRET instructions do not transfer the RF image into the EFLAGS register.) Setting the RF flag does not prevent other types of debug-exception conditions (such as, I/O or data breakpoints) from being detected, nor does it prevent non-debug exceptions from being generated.
(Emphasis mine.)
So, the debugger will set RF before returning from the exception handler so that instruction breakpoints are "muted" for one instruction, after which the flag is automatically cleared by the processor.
Note that this is not a concern in the case of data breakpoints because these will fire after the instruction that triggered the read/write operation.
Recommendation: I find the slides of "Intermediate x86 Part 4" by Xeno Kovah to be helpful in understanding these things. He talks about various topics there but starts with debugging. This information in particular can be found on slides 12-13:
Image credit: Xeno Kovah, CC BY-SA 3.0

Threads spawned by a detoured pthread_create do not execute instructions

I've got a custom implementation of detours on macOS and a test application using it, which is written in C, compiled for macOS x86_64, running on an Intel i9 processor.
The implemention works fine with a multitude of functions. However, if I detour pthread_create, I encounter strange behaviour: threads that have been spawned via a detoured pthread_create do not execute instructions. I can step through instructions one by one but as soon as I continue it does not progress. There are no mutexes or synchronisation involved and the result of the function is 0 (success). The exact same application with detours turned off works fine so it's unlikely to be the culprit.
This does not happen all the time - sometimes they are fine but at other times the test applications stalls in the following state:
(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff7296f55e libsystem_kernel.dylib`__ulock_wait + 10
frame #1: 0x00007fff72a325c2 libsystem_pthread.dylib`_pthread_join + 347
frame #2: 0x0000000100001186 DetoursTestApp`main + 262
frame #3: 0x00007fff7282ccc9 libdyld.dylib`start + 1
frame #4: 0x00007fff7282ccc9 libdyld.dylib`start + 1
thread #2
frame #0: 0x00007fff72a2cb7c libsystem_pthread.dylib`thread_start
Relevant memory pages have the executable flag set. The detour function that intercepts the thread creation looks like this:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
return original(thread, attr, start_routine, arg);
}
Where detour_original retrieves the pointer to [original function + size of function's prologue].
Tracing through the instructions, everything seems to be working correctly and pthread_create terminates successfully. Tracing the application's system calls via dtruss does show calls to
bsdthread_create(0x10DB964B0, 0x0, 0x7000080DB000) = 29646848 0
With what I have confirmed are the correct arguments.
This behaviour is only observed in release builds - debug works fine but the disassembly and execution of a detoured pthread_create and associated detours code seems to be identical in both cases.
Workarounds
I found a couple of odd workarounds for this issue that don't make much sense. Given the detour function, a number of things can be substituted into the following:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
<...> <== SUBSTITUTE HERE
return original(thread, attr, start_routine, arg);
}
A cache flush.
__asm__ __volatile__("" ::: "memory");
_mm_clflush(real_pthread_create);
A sleep of any duration - usleep(1)
A printf statement.
A memory allocation larger than 32768 bytes, e.g. void *data = malloc(40000);.
Cache?
All of these seem to point to a stale instruction cache. However, the Intel manual states the following:
A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.
What's even more interesting is that those workarounds have to be executed for every new thread created, with the execution happening on the main thread, so it's very unlikely to be the cache. I have also tried putting in cache flushes at every memory write that writes instructions but that did not help. I've also written a memcpy that bypasses the cache with the use of Intel's intrinsic _mm_stream_si32 and swapped it out for every instruction memory write in my implementation without any success.
Race condition?
The next suspect in line is a race condition. However, it's not clear what would be racing as at first there are no other threads. I have put in a fibonacci sequence calculation for a randomly-generated number and that would still stall the newly-spawned threads.
The question
What is causing this issue? What other mechanisms could be responsible for this?
At this point I have run out of things to check so any suggestions will be welcome.
I found that the reason why the spawned thread was not executing instructions was that the r8 register wasn't being cleared at the right time in the execution of pthread_create due to an issue with my detours implementation.
If we look at the disassembly of the function, it is split up to two parts - the "head" and the "body" that's found in an internal _pthread_create function. The head does two things - zeroes out r8 and jumps to the body:
libsystem_pthread.dylib`pthread_create:
0x7fff72a2e236 <+0>: 45 31 c0 xor r8d, r8d
0x7fff72a2e239 <+3>: e9 40 37 00 00 jmp 0x7fff72a3197e ; _pthread_create
libsystem_pthread.dylib`_pthread_create:
0x7fff72a3197e <+0>: 55 push rbp
0x7fff72a3197f <+1>: 48 89 e5 mov rbp, rsp
0x7fff72a31982 <+4>: 41 57 push r15
<...> // the rest of the 1409 instructions
My implementation would detour the internal _pthread_create function instead of the head containing the actual entry point which meant that the r8 would get cleared at the wrong time (before the detour). Since the detour function would contain some could, the execution would go something like:
pthread_create (r8 gets cleared) -> _pthread_create -> chain of jumps -> pthread_create_detour -> trampoline (containing the beginning of _pthread_create) -> _pthread_create + 6
Which meant that depending on the contents of the pthread_create_detour function the r8 would not always end up with a 0 when it returned to the internal function.
It's not yet clear why having r8 set to something other than 0 before _pthread_create would not crash but instead start up a thread in a locked up state. An important detail is that the stalled thread would have the rflags register set to 0x200 which should never be the case according to Intel's manual. This is what lead me to inspecting the CPU state more closely, leading to the answer.

Cortex M3 - Calling a SVC inside a C function, and returning to thread mode

I am writing System Calls, as recommended by Joseph Yiu (M3 Guide), by taking the arguments from the stack. The Assembly SVC Handler is like:
SVC_Handler:
MOV R0, #0
MSR CONTROL, R0
CMP LR, #0xFFFFFFFD
BEQ KernelEntry
B KernelExit
KernelEntry:
<save current user stack>
B svchandler_main
KernelExit:
<code to restore the very same saved user stack saved before>
MOV LR, #0xFFFFFFFFD
BX LR
Well, so the svchandler_main is C function that recovers the immediates (arguments of the the system call) creates the kernel stack and branches to 0xFFFFFFFF9 (MSP privileged). The system call itself is made like:
#define svc(code) asm volatile ("svc %[immediate]"::[immediate] "I" (code))
void SysCall_MyCall(int32_t args)
{
svc(CallBack_Number);
}
That said, the callback function, running in handler mode:
void SysCallBack(void* args)
{
/* <my c routine>*/
asm volatile("svc #0"); //to exit the kernel mode
}
The last SV Call is performed so that the SVC_Handler in assembly will identify it is coming from handler mode (MSP privileged) and will exit the kernel - kind of a cooperative scheduling on the kernel. The problem is the context is saved with PSP pointing inside SysCall_MyCall, and it returns to there, and never exits. If I use inline functions, I will lose the handy svchandler_main. Any ideas? I didnt write the svchandler_main here because it is a classic code found on ARM application notes. Thanks.
Edit, to clarify: I am not branching to the callback function INSIDE SVC Handler. It creates the callback stack, change the LR to 0xFFFFFFF9 and execute a BX LR, exiting the the interruption, and going to the indicated MSP. To exit kernel another SVC is called, and user thread resumed.
It seems that you misunderstand how exception entry and return works on the Cortex-M. When you issue an SVC instruction from thread mode, the CPU transitions to handler mode just as for any other exception.
Handler mode is always privileged, and always uses the main stack (MSP). Thread mode can be either privileged or unprivileged depending on the nPRIV bit (bit 0) in the CONTROL register, and may be configured to use the process stack (PSP) by setting the SPSEL bit (bit 1) in the CONTROL register from thread mode.
On entry to handler mode, r0-r3, r12, lr, pc and xPSR are pushed to the active stack (PSP or MSP, depending on which is in use) and an exception return value is loaded into lr. The stack is switched to MSP. At the end of the handler, a BX lr instruction (or equivalent) causes this value to be used as a branch target, which automatically causes a restoration of the previous mode and stack, and pops r0-r3, r12, lr, pc and xPSR. The pop of pc restores execution from the place where the interrupt occurred.
The important thing about this mechanism is that it is 100% compatible with the ARM ABI. In other words, it is possible to write an ordinary function and use it as an exception handler, just by placing the address of the function in the appropriate place in the interrupt vector table. That's because the return at the end of a function is actioned by BX lr or equivalent, which is exactly the same instruction that triggers a return from handler mode.
So to write an SVC handler that makes use of callbacks, it is necessary to:
Work out which stack was in use when the SVC instruction was issued
Dig out the stacked pc to find the address of the SVC instruction itself, and extract the 8-bit constant from within the SVC instruction
Use the constant to work out which callback to invoke
Branch to (not call) the appropriate callback
The callback can be a perfectly ordinary function. When it returns, it will trigger the return from handler mode because the appropriate exception return code will still be in lr.
A handler that does all of this is presented in the M3 Guide, chapter 10.
If it is required that the callback receives arguments, this is a bit more complex but I can expand my answer if you'd like. Generally handler callbacks execute in handler mode (that's pretty much the point of SVC). If for some reason you need the callback to be executed without privilege, that's more complex still; there is an example in Chapter 23 of the M3 guide, though. You refer in the comments to not wanting to "manage nested interrupts" but really nested interrupts just manage themselves.

CPU idle loop without Sleep

I am learning WIN32 ASM right now and I was wondering if there is something like an "idle" infinite loop that doesn't consume any resources at all. Basically I need a running process to experiment with that goes like this:
loop:
; alternative to sleep...
jmp loop
Is there something that may idle the process?
You can't have it both ways. You can either consume the CPU or you can let it do something else. You can conserve power and avoid depriving other cores of resources with rep; nop; (also known as pause), as Vlad Lazarenko suggested. But if you loop without yielding the core to another process, at least that virtual core cannot do anything else.
Note: You should never use empty loops to make the application idle. It will load your processor to 100%
There are several ways to make the application idle, when there is no GUI activity in Win32 environment. The main (documented in MSDN) is to use "GetMessage" function in the main loop in order to extract the messages from the message queue.
When the message queue is empty, this function will idle, consuming very low processor time, waiting for message to arrive in the message queue.
Below is an example, using FASM macro library:
msg_loop:
invoke GetMessage, msg, NULL, 0, 0
cmp eax, 1
jb end_loop
jne msg_loop
invoke TranslateMessage, msg
invoke DispatchMessage, msg
jmp msg_loop
Another approach is used when you want to catch the moment when the application goes to idle state and make some low priority, one-time processing (for example enabling/disabling the buttons on the toolbar, according to the state of the application).
In this case, a combination of PeekMessage and WaitMessage have to be used. PeekMessage function returns immediately, even when the message queue is empty. This way, you can detect this situation and provide some idle tasks to be done, then you have to call WaitMessage in order to idle the process waiting for incoming messages.
Here is an simplified example from my code (using FreshLib macros):
; Main message loop
Run:
invoke PeekMessageA, msg, 0, 0, 0, PM_REMOVE
test eax,eax
jz .empty
cmp [msg.message], WM_QUIT
je .terminate
invoke TranslateMessage, msg
invoke DispatchMessageA, msg
jmp Run
.empty:
call OnIdle
invoke WaitMessage
jmp Run
.terminate:
FinalizeAll
stdcall TerminateAll, 0
What do you mean by "consume resources"?
If you just want a command that does nothing? If so nop will do that, and you can loop as much as you want even: rep; nop. However, the CPU will actually be busy doing work: executing the "no operation" instruction.
If you want an instruction that will cause the CPU itself to stop, then you are sorta-kinda out of luck: although there are ways to do that, you cannot do it from userspace.
With ring 0 access level (like kernel driver), you could use the x86 HLT opcode, but you need to have a system programming skill to really understand how to use it. Using HLT this way requires interrupts to be enabled (not masked) and the guarantee of an interrupt occurring (e.g. system timer), because the return from the interrupt will execute the next instruction after the HLT.
Without ring 0 access you'll never find any x86 opcode to enter an "idle" mode...
You only could find some instructions which consume less power (no memory access, no cache access, no FPU access, low ALU usage...).
Yes, there are some architectures support intrinsic idle state AKA halt.
For example, in x86 hlt, opcode 0xf4. Probably can be called only on privileged mode.
CPU Switches from User mode to Kernel Mode : What exactly does it do? How does it makes this transition?
How to completely suspend the processor?
a Linux's userspace example I found here:
.section .rodata
greeting:
.string "Hello World\n"
.text
_start:
mov $12,%edx /* write(1, "Hello World\n", 12) */
mov $greeting,%ecx
mov $1,%ebx
mov $4,%eax /* write is syscall 4 */
int $0x80
xorl %ebx, %ebx /* Set exit status and exit */
mov $0xfc,%eax
int $0x80
hlt /* Just in case... */

Why Interrupt handler entry code check Carry flag?

I am trying to generate an interrupt in a VM and have written a simple interrupt handler but when I try to test this interrupt generation and handling, kernel crashes because of page fault. Now I debugged the issue and found out that in 'entry_64.S' file where 'error_entry' is called to push registers onto stack and check for GS there following code:
xorl %ebx,%ebx
testl $3,CS+8(%rsp)
je error_kernelspace
error_swapgs:
SWAPGS
When interrupt is handled, CPU will push EFLAGS to (rsp)+CS+8 location. So in above code 'testl' instruction check if flag's Carry flag was set at the time of interrupt to detect if interrupt was in kernel mode or in user mode.
Can please someone explain why Carry flag is checked here?
Actually, I think it's checking whether CS corresponds to a kernel thread, see the comment for a similar construct at ret_from_fork.

Resources