I am writing an NMI handler in an LKM. I would like to know the mode(user or kernel) of operation during the NMI fire. Is there any kernel flag to denote that? I am running Linux 4.18.0.
You can determine if cpu was in user or kernel mode by value of CS register, which is saved on stack by CPU in addition to RIP, RSP, SS etc.
Stack layout of interrupts is described in Intel® 64 and IA-32 ArchitecturesSoftware Developer’s ManualVolume 3A:System Programming Guide, Part Section 6.12.1
In kernel mode, saved CS value is __KERNEL_CS, in user mode - __USER_CS.
Code of default kernel nmi handler actually does this in /arch/x86/entry/entry_64.S:
ENTRY(nmi)
...
testb $3, CS-RIP+8(%rsp)
jz .Lnmi_from_kernel
Related
As far as I know SW-breakpoints are working as follows:
The instruction the BP is set to gets substituted by a int/trap instruction, than the trap is handled in a trap handler, on continue the trap is replaced by the original instruction, the instruction is executed in single step mode, now the PC points to the next instruction and the original instruction is replaced again by a int/trap instruction.
HW Breakpoints work as follows according to my understanding:
The address of the instruction the BP is set to is written in a HW-BP Register. If the instruction is hit respectively the PC matches the address in the HW-BP Register, the CPU raises an interrupt which is also handled by a trap handler. Now if the program returns to the orignial instruction the HW BP is still active and one is caught in an infinite loop.
How is that problem treated?
Is the HW BP disabled before continuing and is the orignal instruction also getting executed in single step mode? Or is the original instruction executed before the trap handler is entered, so that the trap handler returns to the instruction after the original instruction? Or is there an other mechanism?
In case of the Intel 64 and IA-32 ("x64/x86") architectures, this is the task of the Resume Flag (RF), bit 16 in EFLAGS. (Other processor architectures that support hardware breakpoints probably have a similar mechanism.)
See section 18.3.1.1 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B:
Because the debug exception for an instruction breakpoint is generated before the instruction is executed, if the instruction breakpoint is not removed by the exception handler; the processor will detect the instruction breakpoint again when the instruction is restarted and generate another debug exception. To prevent looping on an instruction breakpoint, the Intel 64 and IA-32 architectures provide the RF flag (resume flag) in the EFLAGS register (see Section 2.3, “System Flags and Fields in the EFLAGS Register,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). When the RF flag is set, the processor ignores instruction breakpoints.
[...]
The RF Flag is cleared at the start of the instruction after the check for code breakpoint, CS limit violation and FP exceptions.
[...]
If the RF flag in the EFLAGS image is set when the processor returns from the exception handler, it is copied into the RF flag in the EFLAGS register by IRETD/IRETQ or a task switch that causes the return. The processor then ignores instruction breakpoints for the duration of the next instruction. (Note that the POPF, POPFD, and IRET instructions do not transfer the RF image into the EFLAGS register.) Setting the RF flag does not prevent other types of debug-exception conditions (such as, I/O or data breakpoints) from being detected, nor does it prevent non-debug exceptions from being generated.
(Emphasis mine.)
So, the debugger will set RF before returning from the exception handler so that instruction breakpoints are "muted" for one instruction, after which the flag is automatically cleared by the processor.
Note that this is not a concern in the case of data breakpoints because these will fire after the instruction that triggered the read/write operation.
Recommendation: I find the slides of "Intermediate x86 Part 4" by Xeno Kovah to be helpful in understanding these things. He talks about various topics there but starts with debugging. This information in particular can be found on slides 12-13:
Image credit: Xeno Kovah, CC BY-SA 3.0
When I read Intel's X86 programmer's manual, see the following for interrupt & interrupt return with stack switching:
interrupt:
If a stack switch does occur, the processor does the following:
Temporarily saves (internally) the current contents of the SS, ESP, EFLAGS, CS, and EIP registers.
Loads the segment selector and stack pointer for the new stack (that is, the stack for the privilege level being called) from the TSS into the SS and ESP registers and switches to the new stack.
Pushes the temporarily saved SS, ESP, EFLAGS, CS, and EIP values for the interrupted procedure’s stack onto the new stack.
Pushes an error code on the new stack (if appropriate).
Loads the segment selector for the new code segment and the new instruction pointer (from the interrupt gate or trap gate) into the CS and EIP registers, respectively.
If the call is through an interrupt gate, clears the IF flag in the EFLAGS register.
Begins execution of the handler procedure at the new privilege level.
On return:
Performs a privilege check.
Restores the CS and EIP registers to their values prior to the interrupt or exception.
Restores the EFLAGS register.
Restores the SS and ESP registers to their values prior to the interrupt or exception, resulting in a stack switch back to the stack of the interrupted procedure.
Resumes execution of the interrupted procedure.
For example, one linux process P:
It's initially in kernel mode
It returns to user mode by iret. But from the manual, there is no change to TSS
It traps into kernel by int. Here it needs to find the kernel stack from ESP & SS in TSS. How is this kernel stack value set up, since they are not stored to TSS in step 2?
Once the kernel returns to user-space for a given task, it's done with that task's kernel stack until the next interrupt / exception. There's no useful data on it, so the TSS can hold a fixed SS:[ER]SP value that points to the top of the virtual page[s] allocated as the kernel stack for the current task.
Kernel state doesn't live on the kernel stack between entries into the kernel; it's kept elsewhere in a process control block. (Context switches between asks actually happen in the kernel, switching kernel stacks to the formerly-sleeping task's kernel stack, so eventually returning to user-space means returning up the call-chain of whatever that task was doing in the kernel first).
BTW, unless the kernel pushes a new CS:EIP / EFLAGS / SS:ESP for iret to pop, the stuff it pops will be the stuff pushed by hardware at the address specified in the TSS. So even if there was some desire to re-enter the kernel with the stack as you left it, that would normally be at the TSS location anyway. But this is irrelevant because Linux doesn't keep stuff on a task's kernel stack while user-space is running, except for a pointer to per-task stuff at the bottom of the region where the kernel can find it with [ER]SP & -16384.
(I think this is right; I've looked at a few bits of Linux kernel code but haven't really gotten my hands dirty experimenting with things. I think this is how Linux works, and a consistent viable design.)
I'm trying to hook SYSENTER dispatch function from the kernel and during the past few days I was studying about what happens when a program executes SYSENTER and wants to enter to kernel then I realized IA32_SYSENTER_EIP and IA32_SYSENTER_ESP are responsible to set the kernel RIP and RSP after SYSENTER.
Yesterday I read Intel Software Developer Manuals about SWAPGS :
SWAPGS exchanges the current GS base register value with the value contained in MSR address C0000102H (IA32_KERNEL_GS_BASE). The SWAPGS instruction is a privileged instruction intended for use by system software.
When using SYSCALL to implement system calls, there is no kernel stack
at the OS entry point. Neither is there a straightforward method to
obtain a pointer to kernel structures from which the kernel stack
pointer could be read. Thus, the kernel cannot save general purpose
registers or reference memory.
From the second paragraph, there is no kernel stack at the OS entry point seems that OS kernel executes SWAPGS to set the GS and then get the kernel stack pointer but as I read, in a SYSENTER kernel RIP(EIP) and RSP (ESP) should set from IA32_SYSENTER_EIP and IA32_SYSENTER_ESP so the kernel has its stack pointer in IA32_SYSENTER_ESP !
My Questions :
If kernel stack address should come from GS then what's the purpose of IA32_SYSENTER_ESP?
What are differences between AMD LSTAR (0xC0000082) and IA32_SYSENTER_EIP? I ask it because I saw Windows set 0xc0000082 on my Intel processor.
Is there any special problem with hooking kernels SYSENTER dispatcher?It's because whenever I put a breakpoint in Windows function which is responsible for dispatching SYSENTER calls (KiSystemCall64Shadow) on a remote debugging machine (Not VM) then it causes BSOD with UNEXPECTED_KERNEL_MODE_TRAP.
I wanted to know how interrupt handling works from the point any device is interrupted.I know of interrupt handling in bits and pieces and would like to have clear end to end picture of interrupt handing.Let me put across what little I know about interrupt handling.
Suppose an FPGA device is interrupted through electrical lines and get some data .Device driver for this FPGA device already had code (Interrupt handler) registered using request_irq function.
So now FPGA device have an IRQ line which it get after to call request_irq ,using this IRQ line device send data to the General Interrupt controller and GIC will do many to one translation of IRQ lines and send the signal to CPU core which then call below minimal code
IRQ_handler
SUB lr, lr, #4 ; modify LR
SRSFD #0x12! ; store SPSR and LR to IRQ mode stack
PUSH {r0-r3, r12} ; store AAPCS registers on to the IRQ mode stack
BL IRQ_handler_to_specific_device
POP {r0-r3, r12} ; restore registers
RFEFD sp! ; and return from the exception using pre-modified LR
IRQ_handler_to_specific_device is nothing is what we registered in Device driver using request_irq() call.
I still don't how CPU core comes to know about the interrupt source?(from which device interrupt is coming)
Also what is role of call like do_irq and shared interrupts works?
Need some help in understanding end to end picture on how interrupts are handled on ARM architecture?
The GIC is divided into two sections. The first is called the distributor. This is global to the system. It has several interrupt sources physically routed to it; although it maybe within an SOC package. The second section is replicated per-CPU and it called the cpu interface. The distributor has logic on how to distribute the shared peripheral interrupts or SPI. These are the type of interrupt your question is asking about. They are global hardware interrupts.
In the context of Linux, this is implemented in irq-gic.c. There is some documentation in gic.txt. Of specific interest,
reg : Specifies base physical address(s) and size of the GIC registers. The
first region is the GIC distributor register base and size. The 2nd region is
the GIC cpu interface register base and size.
The distributor must be accessed globally, so care must be taken to manage it's registers. The CPU interface has the same physical address for each CPU, but each CPU has a separate implementation. The distributor can be set up to route interrupts to specific CPUs (including multiples). See: gic_set_affinity() for example. It is also possible for any CPU to handle the interrupt. The ACK register will allocate IRQ; the first CPU to read it, gets the interrupt. If multiple IRQs pend and there are two ACK reads from different CPUs, then each will get a different interrupt. A third CPU reading would get a spurious IRQ.
As well, each CPU interface has some private interrupt sources, that are used for CPU-to-CPU interrupts as well as private timers and the like. But I believe the focus of the question is how a physical peripheral (unique to a system) gets routed to a CPU in an SMP system.
When and how does CPU Switch from Kernel mode to User Mode On X86 : What exactly does it do? How does it makes this transition?
In x86 protected mode, the current privilege level that the CPU is executing in is controlled by the two least significant bits of the CS register (the RPL field of the segment selector).
So a switch from kernel mode (CPL=0) to user mode (CPL=3) is accomplished by replacing a kernel-mode CS value with a user-mode one. There's many ways to do this, but one typical one is an IRET instruction which pops the EIP, CS and EFLAGS registers from the stack.
iret does this for example. See the code here (INTERRUPT_RETURN macro)