How to switch from user mode to kernel mode? - linux-kernel

I'm learning about the Linux kernel but I don't understand how to switch from user mode to kernel mode in Linux. How does it work? Could you give me some advice or give me some link to refer or some book about this?

The only way an user space application can explicitly initiate a switch to kernel mode during normal operation is by making an system call such as open, read, write etc.
Whenever a user application calls these system call APIs with appropriate parameters, a software interrupt/exception(SWI) is triggered.
As a result of this SWI, the control of the code execution jumps from the user application to a predefined location in the Interrupt Vector Table [IVT] provided by the OS.
This IVT contains an adress for the SWI exception handler routine, which performs all the necessary steps required to switch the user application to kernel mode and start executing kernel instructions on behalf of user process.

To switch from user mode to kernel mode you need to perform a system call.
If you just want to see what the stuff is going on under the hood, go to TLDP is your new friend and see the code (it is well documented, no need of additional knowledge to understand an assembly code).
You are interested in:
movl $len,%edx # third argument: message length
movl $msg,%ecx # second argument: pointer to message to write
movl $1,%ebx # first argument: file handle (stdout)
movl $4,%eax # system call number (sys_write)
int $0x80 # call kernel
As you can see, a system call is just a wrapper around the assembly code, that performs an interruption (0x80) and as a result a handler for this system call will be called.
Let's cheat a bit and use a C preprocessor here to build an executable (foo.S is a file where you put a code from the link below):
gcc -o foo -nostdlib foo.S
Run it via strace to ensure that we'll get what we write:
$ strace -t ./foo
09:38:28 execve("./foo", ["./foo"], 0x7ffeb5b771d8 /* 57 vars */) = 0
09:38:28 stat(NULL, Hello, world!
NULL) = 14
09:38:28 write(0, NULL, 14)

I just read through this, and it's a pretty good resource. It explains user mode and kernel mode, why changes happen, how expensive they are, and gives some interesting related reading.
Here's a short excerpt:
Kernel Mode
In Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in kernel mode are catastrophic; they will halt the entire PC.
User Mode
In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode.


Trace from user space code to kernel space

I recently set up my system for kernel debug using qemu+gdb. At present, I can set breakpoints at, for example, __do_page_fault() and trace the call via gdb (with win command). Now I want the following task: A simple C program having a "hello world" printfstatement. Trace the call sequence starting from the userspace down to the write() system call ( or anything in the kernel space that is invoked during the execution of that particular userspace program). I want to learn how userspace program traps into system call w.r.t Linux kernel specifically.
Now my doubt is where to set the breakpoint? We have kernel code as well as the C code of the program. How to go about this situation ? Please give us an explanation with example.
The most easiest way in my opinion is to separate this into two pieces.
Place breakpoint in guest kernel using host gdb.
Place breakpoint in user code before trap instruction, using in-guest target gdb, when hit - print stack using target (in-qemu) gdb. You will get user space stack trace.
Continue execution in guest gdb
In-kernel breakpoint (we have set it at stage 1) will be hit in host gdb. Print kernel stack trace.
If your kernel will continuously hit breakpoint (f.e. write syscall is definitely used widely), you can use a conditional breakpoint to hit a breakpoint only with a certain parameters passed.

CPU in kernel/user mode

A CPU can be either in kernel mode (fully privilege) or in user mode. The kernel requires kernel mode, while applications need to run in the user mode. But how can the CPU be in two modes at once?
Processors generally include a mode flag which indicates which mode the processor is in at a given time; that flag need not necessarily do a whole lot. In a simple implementation, the flag might only control whether the processor is allowed to change memory mappings; the processor would include an instruction which simply switches to user mode, and an instruction which simultaneously switches to kernel mode and jumps to a particular address.
If the kernel stores its own code at the aforementioned address and then switches the memory map so that the address in question is write-protected, then user code would be able to ask the kernel to do something by storing its request somewhere and making a call to a "switch to kernel mode and jump" instruction. The kernel code could then enable its private memory areas, examine the request stored by the user-mode code, act upon the request, disable its private memory areas, switch back to user mode, and return to executing user-mode code.

int instruction from user space

I was under the impression that "int" instruction on x86 is not privileged. So, I thought we should be able to execute this instruction from the user space application. But does not seem so.
I am trying to execute int from user application on windows. I know it may not be right to do so. But I wanted to have some fun. But windows is killing my application.
I think the issue is due to condition cpl <=iopl. Does anyone know how to get around it?
Generally the old dispatcher mechanism for user mode code to transition to kernel mode in order to invoke kernel services was implemented by int 2Eh (now replaced by sysenter). Also int 3 is still reserved for breakpoints to this very day.
Basically the kernel sets up traps for certain interrupts (don't remember whether for all, though) and depending the trap code they will perform some service for the user mode invoker or if that is not possible your application would get killed, because it attempts a privileged operation.
Details would anyway depend on the exact interrupt you were trying to invoke. The functions DbgBreakPoint (ntdll.dll) and DebugBreak (kernel32.dll) do nothing other than invoking int 3 (or actually the specific opcode int3), for example.
Edit 1: On newer Windows versions (XP SP2 and newer, IIRC) sysenter replaces int 2Eh as I wrote in my answer. One possible reason why it gets terminated - although you should be able to catch this via exception handling - is because you don't pass the parameters that it expects on the stack. Basically the usermode part of the native API places the parameters for the system service you call onto the stack, then places the number of the service (an index into the system service dispatcher table - SSDT, sometimes SDT) into a specific register and then calls on newer systems sysenter and on older systems int 2Eh.
The minimum ring level of a given interrupt vector (which decides whether a given "int" is privileged) is based on the ring-level descriptor associated with the vector in the interrupt descriptor table.
In Windows the majority of interrupts are privileged instructions. This prevents user-mode from merely calling the double-fault handler to immediately bugcheck the OS.
There are some non-privileged interrupts in Windows. Specifically:
int 1 (both CD 01 encoding and debug interrupt occurs after a single instruction if EFLAGS_TF is set in eflags)
int 3 (both encoding CC and CD 03)
int 2E (Windows system call)
All other interrupts are privileged, and calling them causes the "invalid instruction" interrupt to be issued instead.
INT is a 'privilege controlled' instruction. It has to be this way for the kernel to protect itself from usermode. INT goes through the exact same trap vectors that hardware interrupts and processor exceptions go through, so if usermode could arbitrarily trigger these exceptions, the interrupt dispatching code would get confused.
If you want to trigger an interrupt on a particular vector that's not already set up by Windows, you have to modify the IDT entry for that interrupt vector with a debugger or a kernel driver. Patchguard won't let you do this from a driver on x64 versions of Windows.

How does the kernel know if the CPU is in user mode or kenel mode?

Since the CPU runs in user/kernel mode, I want to know how this is determined by kernel. I mean, if a sys call is invoked, the kernel executes it on behalf of the process, but how does the kernel know that it is executing in kernel mode?
You can tell if you're in user-mode or kernel-mode from the privilege level set in the code-segment register (CS). Every instruction loaded into the CPU from the memory pointed to by the RIP or EIP register (the instruction pointer register depending on if you are x86_64 or x86 respectively) will read from the segment described in the global descriptor table (GDT) by the current code-segment descriptor. The lower two-bits of the code segment descriptor will determine the current privilege level that the code is executing at. When a syscall is made, which is typically done through a software interrupt, the CPU will check the current privilege-level, and if it's in user-mode, will exchange the current code-segment descriptor for a kernel-level one as determined by the syscall's software interrupt gate descriptor, as well as make a stack-switch and save the current flags, the user-level CS value and RIP value on this new kernel-level stack. When the syscall is complete, the user-mode CS value, flags, and instruction pointer (EIP or RIP) value are restored from the kernel-stack, and a stack-switch is made back to the current executing processes' stack.
Broadly if it's running kernel code it's in kernel mode. The transition from user-space to kernel mode (say for a system call) causes a context switch to occur. As part of this context switch the CPU mode is changed.
Kernel code only executes in kernel mode. There is no way, kernel code can execute in user mode. When application calls system call, it will generate a trap (software interrupt) and the mode will be switch to kernel mode and kernel implementation of system call will executed. Once it is done, kernel will switch back to user mode and user application will continue processing in user mode.
The term is called "Superviser Mode", which applies to x86/ARM and many other processor as well.
Read this (which applies only to x86 CPU):
Ring 0 to 3 are the different privileges level of x86 CPU. Normally only Ring0 and 3 are used (kernel and user), but nowadays Ring 1 find usages (eg, VMWare used it to emulate guest's execution of ring 0). Only Ring 0 has the full privilege to run some privileged instructions (like lgdt, or lidt), and so a good test at the assembly level is of course to execute these instruction, and see if your program encounters any exception or not.
Read this to really identify your current privilege level (look for CPL, which is a pictorialization of Jason's answer):
It is a simple question and does not need any expert comment as provided above..
The question is how does a cpu come to know whether it is kernel mode or its a user mode.
The answer is "mode bit"....
It is a bit in Status register of cpu's registers set.
When "mode bit=0",,,it is considered as kernel mode(also called,monitor mode,privileged mode,protected mode...and many other...)
When "mode bit=1",,it is considered as User mode...and user can now perform its personal applications without any special kernel interruption.
so simple...isn't it??

How does Windows protect transition into kernel mode?

How does Windows protect against a user-mode thread from arbitrarily transitioning the CPU to kernel-mode?
I understand these things are true:
User-mode threads DO actually transition to kernel-mode when a system call is made through NTDLL.
The transition to kernel-mode is done through processor-specific instructions.
So what is special about these system calls through NTDLL? Why can't the user-mode thread fake-it and execute the processor-specific instructions to transition to kernel-mode? I know I'm missing some key piece of Windows architecture here...what is it?
You're probably thinking that thread running in user mode is calling into Ring 0, but that's not what's actually happening. The user mode thread is causing an exception that's caught by the Ring 0 code. The user mode thread is halted and the CPU switches to a kernel/ring 0 thread, which can then inspect the context (e.g., call stack and registers) of the user mode thread to figure out what to do. Before syscall, it really was an exception rather than a special exception specifically to invoke ring 0 code.
If you take the advice of the other responses and read the Intel manuals, you'll see syscall/sysenter don't take any parameters - the OS decides what happens. You can't call arbitrary code. WinNT uses function numbers that map to which kernel mode function the user mode code will execute (for example, NtOpenFile is fnc 75h on my Windows XP machine (the numbers change all the time; it's one of the jobs of NTDll is to map a function call to a fnc number, put it in EAX, point EDX to the incoming parameters then invoke sysenter).
Intel CPUs enforce security using what's called 'Protection Rings'.
There are 4 of these, numbered from 0 to 3. Code running in ring 0 has the highest privileges; it can (practically) do whatever it pleases with your computer. The code in ring 3, on the other hand, is always on a tight leash; it has only limited powers to influence things. And rings 1 and 2 are currently not used for any purpose at all.
A thread running in a higher privileged ring (such as ring 0) can transition to lower privilege ring (such as ring 1, 2 or 3) at will. However, the transition the other way around is strictly regulated. This is how the security of high privileged resources (such as memory) etc. is maintained.
Naturally, your user mode code (applications and all) runs in ring 3 while the OS's code runs in ring 0. This ensures that the user mode threads can't mess with the OS's data structures and other critical resources.
For details on how all this is actually implemented you could read this article. In addition, you may also want to go through Intel Manuals, especially Vol 1 and Vol 3A, which you can download here.
This is the story for Intel processors. I'm sure other architectures have something similar going on.
I think (I may be wrong) that the mechanism which it uses for transition is simple:
User-mode code executes a software interrupt
This (interrupt) causes a branch to a location specified in the interrupt descriptor table (IDT)
The thing that prevents user-mode code from usurping this is as follows: you need to be priviledged to write to the IDT; so only the kernel is able to specify what happens when an interrupt is executed.
Code running in User Mode (Ring 3) can't arbitrarily change to Kernel Mode (Ring 0). It can only do so using special routes -- jump gates, interrupts, and sysenter vectors. These routes are highly protected and input is scrubbed so that bad data can't (shouldn't) cause bad behavior.
All of this is set up by the kernel, usually on startup. It can only be configured in Kernel Mode so User-Mode code can't modify it.
It's probably fair to say that it does it in a (relatively) similar way to what Linux does. In both cases it's going to be CPU-specific, but on x86 probably either a software interrupt with the INT instruction, or via SYSENTER instruction.
The advantage of looking at how Linux does it is that you can do so without a Windows source licence.
The userspace source part is here here at LXR and the
kernel space bit - look at entry_32.S and entry_64.S
Under Linux on x86 there are three different mechanisms, int 0x80, syscall and sysenter.
A library which is built at runtime by the kernel called vdso is called by the C library to implement the syscall function, which uses a different mechanism depending on the CPU and which system call it is. The kernel then has handlers for those mechanisms (if they exist on the specific CPU variant).
