Getting stack pointer x86_64 linux syscall - linux-kernel

I have implemented a syscall on x86_64 Linux 3.0, and would like to know how to get the calling process's stack pointer (%rsp). My syscall is a plain vanilla syscall...
I'm used to using task_pt_regs to get the stack frame of the calling process, but from arxh/x86/include/asm/ptrace.h, comments in struct pt_regs note that non-tracing syscalls don't read all registers: ip, cs, flags, sp and ss are not set when the CPU syscall instruction is invoked and my actual syscall being called. In other words, in my syscall task_pt_regs(current)->ss is garbage.
For calls like sys_fork, a special macro in arch/x86/kernel/entry_64.S (PTREGSCALL) sets up the sys_fork function to be called with a proper pt_regs stack frame.
How can I extract values like IP and SS in my syscall without forcing an extra argument onto my custom system call like sys_fork with PTREGSCALL?

If can understand well when a syscall is invoked the CPU jumps to the kernel code (jump of privileged), in that moment the CPU fills the stack with the CS, RIP, RSP and Eflags registers in order to return to user code when the handler executes an IRET (Return from Interruption).
This means that you may find the RSP and RIP of the calling process just looking in the stack when the syscall is executed.
You may get more information in the "AMD64 Architecture, Programmer’s Manual, Volume 2: System Programming", page 292. It's called "Long-Mode Stack After Interrupt—Higher Privilege".
In the previous answer, I've ignored a few stuff around the way that Linux kernel handles the syscalls but it doesn't change the answer.

Related

When kernel stack's esp is stored to TSS for interrupt return iret?

When I read Intel's X86 programmer's manual, see the following for interrupt & interrupt return with stack switching:
interrupt:
If a stack switch does occur, the processor does the following:
Temporarily saves (internally) the current contents of the SS, ESP, EFLAGS, CS, and EIP registers.
Loads the segment selector and stack pointer for the new stack (that is, the stack for the privilege level being called) from the TSS into the SS and ESP registers and switches to the new stack.
Pushes the temporarily saved SS, ESP, EFLAGS, CS, and EIP values for the interrupted procedure’s stack onto the new stack.
Pushes an error code on the new stack (if appropriate).
Loads the segment selector for the new code segment and the new instruction pointer (from the interrupt gate or trap gate) into the CS and EIP registers, respectively.
If the call is through an interrupt gate, clears the IF flag in the EFLAGS register.
Begins execution of the handler procedure at the new privilege level.
On return:
Performs a privilege check.
Restores the CS and EIP registers to their values prior to the interrupt or exception.
Restores the EFLAGS register.
Restores the SS and ESP registers to their values prior to the interrupt or exception, resulting in a stack switch back to the stack of the interrupted procedure.
Resumes execution of the interrupted procedure.
For example, one linux process P:
It's initially in kernel mode
It returns to user mode by iret. But from the manual, there is no change to TSS
It traps into kernel by int. Here it needs to find the kernel stack from ESP & SS in TSS. How is this kernel stack value set up, since they are not stored to TSS in step 2?
Once the kernel returns to user-space for a given task, it's done with that task's kernel stack until the next interrupt / exception. There's no useful data on it, so the TSS can hold a fixed SS:[ER]SP value that points to the top of the virtual page[s] allocated as the kernel stack for the current task.
Kernel state doesn't live on the kernel stack between entries into the kernel; it's kept elsewhere in a process control block. (Context switches between asks actually happen in the kernel, switching kernel stacks to the formerly-sleeping task's kernel stack, so eventually returning to user-space means returning up the call-chain of whatever that task was doing in the kernel first).
BTW, unless the kernel pushes a new CS:EIP / EFLAGS / SS:ESP for iret to pop, the stuff it pops will be the stuff pushed by hardware at the address specified in the TSS. So even if there was some desire to re-enter the kernel with the stack as you left it, that would normally be at the TSS location anyway. But this is irrelevant because Linux doesn't keep stuff on a task's kernel stack while user-space is running, except for a pointer to per-task stuff at the bottom of the region where the kernel can find it with [ER]SP & -16384.
(I think this is right; I've looked at a few bits of Linux kernel code but haven't really gotten my hands dirty experimenting with things. I think this is how Linux works, and a consistent viable design.)

In a Linux system call, are system call parameters preserved in registers after the syscall finished (at the sys_exit tracepoint)?

Is it guaranteed to be able to read all the syscall parameters at sys_exit tracepoint?
sysdig driver is a kernel module to capture syscall using kernel static tracepoint. In this project some of system call parameters are read at sys_enter tracepoint, and some other parameters are read at sys_exit (return value of course, and contents in userspace to avoid pagefault).
Why not read all parameters at sys_exit? Is this because some parameters may be not be available at sys_exit?
Is it guaranteed to be able to read all the syscall parameters at sys_exit tracepoint?
Yes... and no, we need to distinguish parameters from registers. Linux syscalls should preserve all general purpose userspace registers, except the register used for the return value (and on some architectures also a second register to indicate if an error occurred). However, this does not mean that the input parameters of the syscall cannot change between entry and exit: if a register holds the value of a pointer to some data, while the register itself does not change, the data it points to could very well change.
Looking at the code for the static tracepoint sys_exit, you can see that only the syscall number (id) and its return value (ret) are traced. See note at the bottom of my answer for more.
Why not read all parameters at sys_exit? Is this because some parameters may be not available at sys_exit?
Yes, I would say that ensuring the correctness of the traced parameters is the main reason why tracing only at the exit would be a bad idea. Even if you get the values of the register, you cannot know the real parameters at syscall exit. Even if a syscall per se is guaranteed to save and restore the state of user registers, the syscall itself can alter the data that is being passed as argument. For example, the recvmsg syscall takes a pointer to a struct msghdr in memory which is used both as an input and an output parameter; the poll syscall does the same with a pointer to struct pollfd. Furthermore, another thread or program could have very well modified the memory of the program while it was making a syscall, therefore altering the data.
Under specific circumstances a syscall can also take a very long time before returning (think for example of a sleep, or a blocking read on your terminal, an accept on a listening socket, etc). If you only trace at the exit, you will have very incorrect timing information, and most importantly you will have to wait a lot before any meaningful information can be captured, even though that information is already available at the entry point.
Note on sys_exit tracepoint
Although you could thecnically extract the values of the saved registers of the current task, I am not entirely sure about the semantics of doing so while in the sys_exit tracepoint. I searched for some documentation on this specific case, but had no luck, and kernel code is well... complex.
The chain of calls to reach the exit hook should be:
Arch specific entry point (e.g. entry_INT80_32 for x86 int 0x80)
Arch specific entry handler (e.g. do_int80_syscall_32() for x86 int 0x80)
syscall_exit_to_user_code()
syscall_exit_to_user_mode_prepare()
syscall_exit_work()
trace_sys_exit()
If a deadly signal is delivered to a process during a syscall, while the actual process will never reach the exit of the syscall (i.e. no value is ever returned to user space), the tracepoint will still be hit. When a signal delivery of this kind happens, a special internal return value is used, like -ERESTARTSYS (see here). This value is not an actual syscall return value (it is not returned to user space), but rather it is only meant to be used by kernel. So it looks like the sys_exit tracepoint is being hit with the special -ERESTARTSYS if a deadly signal is received by the process. This does not happen for example in the case of SIGSTOP + SIGCONT. Take this with a grain of salt though, since I was not able to find proper documentation for this.

Hooking Windows Kernel Dispatcher for System Calls

I'm trying to hook SYSENTER dispatch function from the kernel and during the past few days I was studying about what happens when a program executes SYSENTER and wants to enter to kernel then I realized IA32_SYSENTER_EIP and IA32_SYSENTER_ESP are responsible to set the kernel RIP and RSP after SYSENTER.
Yesterday I read Intel Software Developer Manuals about SWAPGS :
SWAPGS exchanges the current GS base register value with the value contained in MSR address C0000102H (IA32_KERNEL_GS_BASE). The SWAPGS instruction is a privileged instruction intended for use by system software.
When using SYSCALL to implement system calls, there is no kernel stack
at the OS entry point. Neither is there a straightforward method to
obtain a pointer to kernel structures from which the kernel stack
pointer could be read. Thus, the kernel cannot save general purpose
registers or reference memory.
From the second paragraph, there is no kernel stack at the OS entry point seems that OS kernel executes SWAPGS to set the GS and then get the kernel stack pointer but as I read, in a SYSENTER kernel RIP(EIP) and RSP (ESP) should set from IA32_SYSENTER_EIP and IA32_SYSENTER_ESP so the kernel has its stack pointer in IA32_SYSENTER_ESP !
My Questions :
If kernel stack address should come from GS then what's the purpose of IA32_SYSENTER_ESP?
What are differences between AMD LSTAR (0xC0000082) and IA32_SYSENTER_EIP? I ask it because I saw Windows set 0xc0000082 on my Intel processor.
Is there any special problem with hooking kernels SYSENTER dispatcher?It's because whenever I put a breakpoint in Windows function which is responsible for dispatching SYSENTER calls (KiSystemCall64Shadow) on a remote debugging machine (Not VM) then it causes BSOD with UNEXPECTED_KERNEL_MODE_TRAP.

Printing a string in x86 Assembly on Mac OS X (NASM)

I'm doing x86 on Mac OS X with NASM. Copying an example and experimenting I noticed that my print command needed a four bytes pushed onto the stack after the other parameters but can't figure out why line five is necessary:
1 push dword len ;Length of message
2 push dword msg ;Message to write
3 push dword 1 ;STDOUT
4 mov eax,4 ;Command code for 'writing'
5 sub esp,4 ;<<< Effectively 'push' Without this the print breaks
6 int 0x80 ;SYSCALL
7 add esp,16 ;Functionally 'pop' everything off the stack
I am having trouble finding any documentation on this 'push the parameters to the stack' syntax that NASM/OS X seems to require. If anyone can point me to a resource for that in general that would most likely answer this question as well.
(Most of the credit goes to #Michael Petch's comment; I'm repeating it here so that it is an answer, and also in order to further clarify the reason for the additional four bytes on the stack.)
macOS is based on BSD, and, as per FreeBSD's documentation re system calls, by default the kernel uses the C calling conventions (which means arguments are pushed to the stack, from last to first), but assuming four extra bytes pushed to the stack, as "it is assumed the program will call a function that issues int 80h, rather than issuing int 80h directly".
That is, the kernel is not built for direct int 80h calls, but rather for code that looks like this:
kernel: ; subroutine to make system calls
int 80h
ret
.
.
.
; code that makes a system call
call kernel ; instead of invoking int 80h directly
Notice that call kernel would push the return address (used by the kernel subroutine's ret to return to calling code after the system call) onto the stack, accounting for four additional bytes – that's why it's necessary to manually push four bytes to the stack (any four bytes – their actual value doesn't matter, as it is ignored by the kernel – so one way to achieve this is sub esp, 4) when invoking int 80h directly.
The reason the kernel expects this behaviour – of calling a method which invokes the interrupt instead of invoking it directly – is that when writing code that can be run on multiple platforms it's then only needed to provide a different version of the kernel subroutine, rather than of every place where a system call is invoked (more details and examples in the link above).
Note: all the above is for 32-bit; for 64-bit the calling conventions are different – registers are used to pass the arguments rather than the stack (there's also a call convention for 32-bit which uses registers, but even then it's not the same registers), the syscall instruction is used instead of int 80h, and no extra four bytes (which, on 64-bit systems, would actually be eight bytes) need to be pushed.

System call uses registers or stack to pass the parameters to kernel?

I have a confusion about the system call mechanism. In X86, System Call uses eax to pass the system call number to kernel.
But what does it use to pass the parameters to kernel, at some place I am seeing it uses stack and at other places it says, it uses ebx, ecx, etc registers.
So can someone confirm which one is correct ?
Fore reference :
this link says it uses stack.
And this link says it uses registers.
Both the links tell that the parameters are passed through registers like EBX, ECX, etc to the kernel space from the user space.
In the first reference page : 35/352, System Call Implementation/wrappers task 1st point, it is given that
the parameters available in the user stack are moved to the processor registers and then this registers are used to pass parameters of the syscall to the kernel space.
I think you must be confused after seeing the word stack in that point about implementing the libc wrappers like write() which are callable from C, to interface between the system-call calling convention (6 regs) and the function-calling convention (stack args since user-space doesn't normally use -mregparm=3)
Both the links are correct.
You can see in , all system calls are declared with prefix asmlinkage. Infact when you define your system call using SYSCALL_DEFINEx macro, it defines your system call function with asmlinkage directive. asmlinkage directive directs compiler that the function should not expect any of it's parameters from CPU registers i.e. all parameters should be accessed from stack only.
When called from user space each parameters are pushed to CPU registers, during user to kernel transition, kernel needs to save all the registers onto stack (in order to restore the environment before returning to the user space) when handling the system call requests from user space, so after that the parameters are available on stack for kernel space system call function.

Resources