Which stack does hardware interrupt use? - linux-kernel

I don't quite understand the interrupt stack switch mechanism in x86_64. From Intel's developer's manual, when a hardware interrupt occurs, the current context will be stored on interrupt stack for later iret use. I have the following questions:
Is this stack hardirq_stack in Linux kernel? If it is, this stack is also used by softirq; so how does the stack keep consistency when an interrupt occurs during handling softirq? If it is not, which stack is used?
Only a part of the context is saved on the stack (e.g., RSP, CS, RIP), what about the other part (e.g., registers)?

If you read the following: https://www.kernel.org/doc/Documentation/x86/kernel-stacks, you have information on what Linux does for the different stacks.
Like all other architectures, x86_64 has a kernel stack for every
active thread.
This means that each process/thread has one kernel stack which is used to handle syscalls. The libstdc++/libc (standard C++/C implementations) use the syscall assembly instruction directly to make system calls. The entry point to the kernel is arch/x86/entry/entry_64.S. This assembly file will switch the stack of the user mode process to its kernel counterpart. This is during a syscall not during an external interrupt. This paragraph is just to clear misconceptions.
In addition to the per thread stacks, there are specialized stacks
associated with each CPU. These stacks are only used while the kernel
is in control on that CPU; when a CPU returns to user space the
specialized stacks contain no useful data. The main CPU stacks are:
Interrupt stack. IRQ_STACK_SIZE
Used for external hardware interrupts. If this is the first external
hardware interrupt (i.e. not a nested hardware interrupt) then the
kernel switches from the current task to the interrupt stack. Like
the split thread and interrupt stacks on i386, this gives more room
for kernel interrupt processing without having to increase the size
of every per thread stack.
The interrupt stack is also used when processing a softirq.
Switching to the kernel interrupt stack is done by software based on a
per CPU interrupt nest counter.
This means that each CPU has some specialized stacks which are used to handle some special events. The specialized stacks don't contain valid data when in user mode.
The kernel maintains a nest counter (interrupt of an interrupt). If the nest counter is 0, it will switch the user mode stack to the interrupt kernel stack. Otherwise, it will remain on the same stack.
Is this stack hardirq_stack in Linux kernel? If it is, this stack is also used by softirq; so how does the stack keep consistency when an interrupt occurs during handling softirq? If it is not, which stack is used?
The stack used is the interrupt stack for both the hardirq (hardware irq) and softirq (software irq). The stack remains consistent because the irq handler pushes all registers that could be used by a nested interrupt and maintains a nest counter. The TSS is used for irq going from ring 3 to ring 0 to switch the stacks. This is necessary because the switch to kernel mode (ring 0) cannot be done by software. It must be done by hardware. The TSS is a mechanism which allows safe switch to kernel mode on interrupt.
Only a part of the context is saved on the stack (e.g., RSP, CS, RIP), what about the other part (e.g., registers)?
The registers are also saved on the stack.

Related

Why do we need Interrupt context?

I am having doubts, why exactly we need interrupt context? Everything tells what are the properties but no one explains why we come up with this concept?
Another doubt related to same concept is, If we are not disabling the interrupt in interrupt handler, then what is the use of running this interrupt handler code in interrupt context ?
The interrupt context is fundamentally different from the process context:
It is not associated with a process; a specific process does not serve interrupts, the kernel does. Even if a process will be interrupted, it has no significance over any parameters of the interrupt itself or the routine that will serve it. It follows that at the very least, interrupt context must be different from process context conceptually.
Additionally, if an interrupt were to be serviced in a process context, and (re-) scheduled some work at a later time, what context would that run in? The original process may not even exist at that later time. Ergo, we need some context which is independent from processes for a practical reason.
Interrupt handling must be fast; your interrupt handler has interrupted (d'oh) some other code. Significant work should be pushed outside the interrupt handler, onto the "bottom half". It is unacceptable to block a process for work which is not even remotely its concern, either in user or in kernel space.
Disabling the interrupt is something you can (actually could, before 2.6.36) request to be disabled when registering your ISR. Recall that a handler can serve interrupts on multiple CPUs simultaneously, and can thus race with itself. Non-Maskable Interrupts (NMIs) can not be disabled.
Why do we need Interrupt context?
First, what do we mean by interrupt context? A context is usually a state. There are two separate concepts of state.
CPU context
Every CPU architecture has a mechanism for handling interrupts. There maybe a single interrupt vector called for every system interrupt, or the CPU/hardware may be capable of dispatching the CPU to a particular address based on the interrupt source. There are also mechanisms for masking/unmasking interrupts. Each interrupt maybe masked individually, or there maybe a global mask for the entire CPU(s). Finally, there is an actual CPU state. Some may have separate stacks, register sets, and CPU modes implying some memory and other privileges. Your question is about Linux in general and it must handle all cases.
Linux context
Generally all of the architectures have a separate kernel stack, process context (ala ps) and VM (virtual memory) context for each process. The VM has different privileges for user and kernel modes. In order for the kernel to run all the time, it must remain mapped for all processes on a device. A kernel thread is a special case that doesn't care so much about the VM, because it is privileged and can access all kernel memory. However, it does have a separate stack and process context. User registers are typically stored upon the kernel stack when exceptions happen. Exceptions are at least page faults, system calls and interrupts. These items may nest. Ie, you may call write() from user space and while the kernel is transferring a user buffer, it may page fault to read some swapped out user space data. The page fault may again have to service an interrupt.
Interrupt recursion
Linux general wants you to leave interrupts masked as the VM, the execptions, and process management (context and context switching) have to work together. In order to keep things simple for the VM, the kernel stack and process context are generally rooted in either a single 4k (or 8k) area which is a single VM page. This page is always mapped. Typically, all CPUs will switch from interrupt mode to system mode when servicing an interrupt and use the same kernel stack as all other exceptions. The stack is small so to allow recursion (and large stack allocation) can blow up the stack resulting in stack overflows at the kernel level. This is bad.
Atomicity
Many kernel structures need to stay consistent over multiple bus cycles; Ie, a linked list must update both prev and next node links when adding an element. A typical mechanism to do this maybe to mask interrupts, to ensure the code is atomic. Some CPUs may allow bus locking, but this is not universal. The context switching code must also be atomic. A consequence of an interrupt is typically rescheduling. Ie, a kernel interrupt handler may have acked a disk controller and started a write operation. Then a kernel thread may schedule to write more buffered data from the original user space write().
Interrupts occurring at any time can break some sub-sytem's assumptions of atomic behavior. Instead of allowing interrupt to use the sub-system, they are prohibited from using it.
Summary
Linux must handle three thing. The current process execution context, the current virtual memory layout and hardware requests. They all need to work together. As the interrupts may happen at any time, they occur in any process context. Using sleep(), etc in an interrupt would put random processes to sleep. Allowing large stack allocation in an interrupt could blow up the limited stack. These design choices limit what can happen in a Linux interrupt handler. Various configuration options can allow re-entrant interrupts, but this is often CPU specific.
A benefit of keeping the top half, now the main interrupt handler small is that interrupt latency is reduced. Busy work should be done in a kernel thread. An interrupt service routine that would need to un-mask interrupts is already somewhat anti-social to the Linux eco-system. That work should be put in a kernel thread.
The Linux interrupt context really doesn't exist in some sense. It is only a CPU interrupt which may happen in any process context. The Linux interrupt context is actually a set of coding limitations that happen as a consequence of this.

Performance difference between system call vs function call

I quite often listen to driver developers saying its good to avoid kernel mode switches as much as possible. I couldn't understand the precise reason. To start with my understanding is -
System calls are software interrupts. On x86 they are triggered by using instruction sysenter. Which actually looks like a branch instruction which takes the target from a machine specific register.
System calls don't really have to change the address space or process context.
Though, they do save registers on process stack and and change stack pointer to kernel stack.
Among these operations syscall pretty much works like a normal function call. Though the sysenter could behave like a mis-predicted branch which could lead to ROB flush in processor pipeline. Even that is not really bad, its just like any other mis-predicted branch.
I heard a few people answering on Stack Overflow:
You never know how long syscall takes - [me] yeah, but thats case with any function. Amount of time it takes depends on the function
It is often scheduling spot. - [me] process can get rescheduled, even if it is running all the time in user mode. ex, while(1); doesnt guarantee a no-context switch.
Where is the actual syscall cost coming from?
You don't indicate what OS you are asking about. Let me attempt an answer anyway.
The CPU instructions syscall and sysenter should not be confused with the concept of a system call and its representation in the respective OSs.
The best explanation for the difference in the overhead incurred by each respective instruction is given by reading through the Operation sections of the IntelĀ® 64 and IA-32 Architectures Developer's Manual volume 2A (for int, see page 3-392) and volume 2B (for sysenter see page 4-463). Also don't forget to glance at iretd and sysexit while at it.
A casual counting of the pseudo-code for the operations yields:
408 lines for int
55 lines for sysenter
Note: Although the existing answer is right in that sysenter and syscall are not interrupts or in any way related to interrupts, older kernels in the Linux and the Windows world used interrupts to implement their system call mechanism. On Linux this used to be int 0x80 and on Windows int 0x2E. And consequently on those kernel versions the IDT had to be primed to provide an interrupt handler for the respective interrupt. On newer systems, that's true, the sysenter and syscall instructions have completely replaced the old ways. With sysenter it's the MSR (machine specific register) 0x176 which gets primed with the address of the handler for sysenter (see the reading material linked below).
On Windows ...
A system call on Windows, just like on Linux, results in the switch to kernel mode. The scheduler of NT doesn't provide any guarantees about the time a thread is granted. Also it yanks away time from threads and can even end up starving threads. In general one can say that user mode code can be preempted by kernel mode code (with very few very specific exceptions to which you'll certainly get in the "advanced driver writing class"). This makes perfect sense if we only look at one example. User mode code can be swapped out - or, for that matter, the data it's trying to access. Now the CPU doesn't have the slightest clue how to access pages in the swap/paging file, so an intermediate step is required. And that's also why kernel mode code must be able to preempt user mode code. It is also the reason for one of the most prolific bug-check codes seen on Windows and mostly caused by third-party drivers: IRQL_NOT_LESS_OR_EQUAL. It means that a driver accessed paged memory when it wasn't possible to preempt the code touching that memory.
Further reading
SYSENTER and SYSEXIT in Windows by Geoff Chappell (always worth a read in my experience!)
Sysenter Based System Call Mechanism in Linux 2.6
Windows NT platform specific discussion: How Do Windows NT System Calls REALLY Work?
Windows NT platform specific discussion: System Call Optimization with the SYSENTER Instruction
Windows Internals, 5th ed., by Russinovich et. al. - pages 125 through 132.
ReactOS implementation of KiFastSystemCall
SYSENTER/SYSCALL is not a software interrupt; whole point of those instructions is to avoid overhead caused by issuing IRQ and calling interrupt handler.
Saving registers on stack costs time, this is one place where the syscall cost comes from.
Another place comes from the kernel mode switch itself. It involves changing segment registers - CS, DS, ES, FS, GS, they all have to be changed (it's less costly on x86-64, as segmentation is mostly unused, but you still need to essentially make far jump to kernel code) and also changes CPU ring of execution.
To conclude: function call is (on modern systems, where segmentation is not used) near call, while syscall involves far call and ring switch.

task gate, interrupt gate, call gate

I have been trying to read more about different gates in x86 architecture. If I understand correctly then interrupt and trap gate are used for hw and sw interrupt handling respectively.
Whereas CALL gate is probably no more used, as ppl prefer replaced by SYSENTER and SYSEXIT.
I was wondering how task gates are used (I know they are used for hw task switch). What does that exactly mean? Does hw task refer to OS task/process. Or is it more like switching between two different instances of operating system. (May be on servers.)?
On a side note, can it happen that some of the interrupts are handled in the user mode. (Can we handle divide by zero interrupt in the user mode. If it can be then does that mean IDT handler entry for divide by zero contains address from the user space?)
Thanks
Everything you might want to know about interrupts and gates is in the Intel developer manual, volume 3. In short:
Task gates were originally designed as a CPU-mediated method of performing task switching; the CPU can automatically record the state of the process during the task switching operation. These are not typically used in modern operating systems; the OS usually does the state-saving operations on its own.
At least in Linux, all interrupt handlers are in kernel space and execute at ring 0. If you want to handle a divide-by-zero exception, you register a userspace signal handler for SIGFPE; the kernel-space interrupt handler raises the SIGFPE signal, indirectly triggering the userspace handler code (the userspace code is executed after returning from the interrupt handler).
The state of affairs is that only interrupt and trap gates was actually in use and stay in use now. In theory, both of them can be used as for s/w and for h/w event handling. The only difference between them is that interrupt gate call automatically prohibits future interrupts, that can be useful in some cases of hardware interrupt handling.
By default people try to use trap gates, because unnecessary interrupt disabling is a bad thing, because interrupt disabling increase interrupt handling latencies and increase probability of interrupt lost.
Call gates was never been in actual use. It is inconvenient and not optimal way for system call implementation. Instead call gate, most of the operating systems use trap gate (int 0x80 in Linux and int 0x2E in Windows) or sysenter/sysexit syscall/sysrt instructions.
Task gate was never been in actual use too. It is not optimal, inconvenient and limited feature, if not ugly at all. Instead of it, operating systems usually implements task switching on its own side by kernel mode task stacks switching.
Initially, Intel delivered hardware support of multitasking by introduction of TSS (Task State Segment) and Task Gate. According to that features, processor is able to automatically store the state of one task and restore state of another one in reply to the request came from hw or sw. Sw request can be done by issuing call or jmp instructions with TSS selector or task gate selector used as instruction operand. Hw request can be done by hardware traping into the task gate in appropriate IDT entry. But as I've already mentioned, no one really uses it. Instead of it, operating systems use only one TSS for all tasks (TSS must be used in any case, because during control transfer from the less privileged segment to more privileged segment CPU switch stacks and it capture address of the stack for more privileged segment from the TSS) and make task switch manually.
In theory, interrupts and exceptions can be handled in user mode (ring 3), but in practice it is not useful and operating system handle all such events on the kernel side (in ring 0). The reason is simple, interrupt and exception handlers must always reside in the memory and be accessible from the any address space. Kernel part of address space is shared and the same in all address spaces of all tasks in the system, but the user part of address space is wired to the particular task. If you want to handle exception in user mode you will be forced to reprogram IDT on each task switch that will introduce significant performance penalty. If you want to handle interrupts in the same way you will be forced to share interrupt handlers between all tasks on the same addresses. As unwanted consequence, any task in the system will be able to corrupt handler.

What is the difference between the kernel space and the user space?

What is the difference between the kernel space and the user space? Do kernel space, kernel threads, kernel processes and kernel stack mean the same thing? Also, why do we need this differentiation?
The really simplified answer is that the kernel runs in kernel space, and normal programs run in user space. User space is basically a form of sand-boxing -- it restricts user programs so they can't mess with memory (and other resources) owned by other programs or by the OS kernel. This limits (but usually doesn't entirely eliminate) their ability to do bad things like crashing the machine.
The kernel is the core of the operating system. It normally has full access to all memory and machine hardware (and everything else on the machine). To keep the machine as stable as possible, you normally want only the most trusted, well-tested code to run in kernel mode/kernel space.
The stack is just another part of memory, so naturally it's segregated right along with the rest of memory.
The Random Access Memory (RAM) can be logically divided into two distinct regions namely - the kernel space and the user space.(The Physical Addresses of the RAM are not actually divided only the Virtual Addresses, all this implemented by the MMU)
The kernel runs in the part of memory entitled to it. This part of memory cannot be accessed directly by the processes of the normal users, while the kernel can access all parts of the memory. To access some part of the kernel, the user processes have to use the predefined system calls i.e. open, read, write etc. Also, the C library functions like printf call the system call write in turn.
The system calls act as an interface between the user processes and the kernel processes. The access rights are placed on the kernel space in order to stop the users from messing with the kernel unknowingly.
So, when a system call occurs, a software interrupt is sent to the kernel. The CPU may hand over the control temporarily to the associated interrupt handler routine. The kernel process which was halted by the interrupt resumes after the interrupt handler routine finishes its job.
CPU rings are the most clear distinction
In x86 protected mode, the CPU is always in one of 4 rings. The Linux kernel only uses 0 and 3:
0 for kernel
3 for users
This is the most hard and fast definition of kernel vs userland.
Why Linux does not use rings 1 and 2: CPU Privilege Rings: Why rings 1 and 2 aren't used?
How is the current ring determined?
The current ring is selected by a combination of:
global descriptor table: a in-memory table of GDT entries, and each entry has a field Privl which encodes the ring.
The LGDT instruction sets the address to the current descriptor table.
See also: http://wiki.osdev.org/Global_Descriptor_Table
the segment registers CS, DS, etc., which point to the index of an entry in the GDT.
For example, CS = 0 means the first entry of the GDT is currently active for the executing code.
What can each ring do?
The CPU chip is physically built so that:
ring 0 can do anything
ring 3 cannot run several instructions and write to several registers, most notably:
cannot change its own ring! Otherwise, it could set itself to ring 0 and rings would be useless.
In other words, cannot modify the current segment descriptor, which determines the current ring.
cannot modify the page tables: How does x86 paging work?
In other words, cannot modify the CR3 register, and paging itself prevents modification of the page tables.
This prevents one process from seeing the memory of other processes for security / ease of programming reasons.
cannot register interrupt handlers. Those are configured by writing to memory locations, which is also prevented by paging.
Handlers run in ring 0, and would break the security model.
In other words, cannot use the LGDT and LIDT instructions.
cannot do IO instructions like in and out, and thus have arbitrary hardware accesses.
Otherwise, for example, file permissions would be useless if any program could directly read from disk.
More precisely thanks to Michael Petch: it is actually possible for the OS to allow IO instructions on ring 3, this is actually controlled by the Task state segment.
What is not possible is for ring 3 to give itself permission to do so if it didn't have it in the first place.
Linux always disallows it. See also: Why doesn't Linux use the hardware context switch via the TSS?
How do programs and operating systems transition between rings?
when the CPU is turned on, it starts running the initial program in ring 0 (well kind of, but it is a good approximation). You can think this initial program as being the kernel (but it is normally a bootloader that then calls the kernel still in ring 0).
when a userland process wants the kernel to do something for it like write to a file, it uses an instruction that generates an interrupt such as int 0x80 or syscall to signal the kernel. x86-64 Linux syscall hello world example:
.data
hello_world:
.ascii "hello world\n"
hello_world_len = . - hello_world
.text
.global _start
_start:
/* write */
mov $1, %rax
mov $1, %rdi
mov $hello_world, %rsi
mov $hello_world_len, %rdx
syscall
/* exit */
mov $60, %rax
mov $0, %rdi
syscall
compile and run:
as -o hello_world.o hello_world.S
ld -o hello_world.out hello_world.o
./hello_world.out
GitHub upstream.
When this happens, the CPU calls an interrupt callback handler which the kernel registered at boot time. Here is a concrete baremetal example that registers a handler and uses it.
This handler runs in ring 0, which decides if the kernel will allow this action, do the action, and restart the userland program in ring 3. x86_64
when the exec system call is used (or when the kernel will start /init), the kernel prepares the registers and memory of the new userland process, then it jumps to the entry point and switches the CPU to ring 3
If the program tries to do something naughty like write to a forbidden register or memory address (because of paging), the CPU also calls some kernel callback handler in ring 0.
But since the userland was naughty, the kernel might kill the process this time, or give it a warning with a signal.
When the kernel boots, it setups a hardware clock with some fixed frequency, which generates interrupts periodically.
This hardware clock generates interrupts that run ring 0, and allow it to schedule which userland processes to wake up.
This way, scheduling can happen even if the processes are not making any system calls.
What is the point of having multiple rings?
There are two major advantages of separating kernel and userland:
it is easier to make programs as you are more certain one won't interfere with the other. E.g., one userland process does not have to worry about overwriting the memory of another program because of paging, nor about putting hardware in an invalid state for another process.
it is more secure. E.g. file permissions and memory separation could prevent a hacking app from reading your bank data. This supposes, of course, that you trust the kernel.
How to play around with it?
I've created a bare metal setup that should be a good way to manipulate rings directly: https://github.com/cirosantilli/x86-bare-metal-examples
I didn't have the patience to make a userland example unfortunately, but I did go as far as paging setup, so userland should be feasible. I'd love to see a pull request.
Alternatively, Linux kernel modules run in ring 0, so you can use them to try out privileged operations, e.g. read the control registers: How to access the control registers cr0,cr2,cr3 from a program? Getting segmentation fault
Here is a convenient QEMU + Buildroot setup to try it out without killing your host.
The downside of kernel modules is that other kthreads are running and could interfere with your experiments. But in theory you can take over all interrupt handlers with your kernel module and own the system, that would be an interesting project actually.
Negative rings
While negative rings are not actually referenced in the Intel manual, there are actually CPU modes which have further capabilities than ring 0 itself, and so are a good fit for the "negative ring" name.
One example is the hypervisor mode used in virtualization.
For further details see:
https://security.stackexchange.com/questions/129098/what-is-protection-ring-1
https://security.stackexchange.com/questions/216527/ring-3-exploits-and-existence-of-other-rings
ARM
In ARM, the rings are called Exception Levels instead, but the main ideas remain the same.
There exist 4 exception levels in ARMv8, commonly used as:
EL0: userland
EL1: kernel ("supervisor" in ARM terminology).
Entered with the svc instruction (SuperVisor Call), previously known as swi before unified assembly, which is the instruction used to make Linux system calls. Hello world ARMv8 example:
hello.S
.text
.global _start
_start:
/* write */
mov x0, 1
ldr x1, =msg
ldr x2, =len
mov x8, 64
svc 0
/* exit */
mov x0, 0
mov x8, 93
svc 0
msg:
.ascii "hello syscall v8\n"
len = . - msg
GitHub upstream.
Test it out with QEMU on Ubuntu 16.04:
sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
arm-linux-gnueabihf-as -o hello.o hello.S
arm-linux-gnueabihf-ld -o hello hello.o
qemu-arm hello
Here is a concrete baremetal example that registers an SVC handler and does an SVC call.
EL2: hypervisors, for example Xen.
Entered with the hvc instruction (HyperVisor Call).
A hypervisor is to an OS, what an OS is to userland.
For example, Xen allows you to run multiple OSes such as Linux or Windows on the same system at the same time, and it isolates the OSes from one another for security and ease of debug, just like Linux does for userland programs.
Hypervisors are a key part of today's cloud infrastructure: they allow multiple servers to run on a single hardware, keeping hardware usage always close to 100% and saving a lot of money.
AWS for example used Xen until 2017 when its move to KVM made the news.
EL3: yet another level. TODO example.
Entered with the smc instruction (Secure Mode Call)
The ARMv8 Architecture Reference Model DDI 0487C.a - Chapter D1 - The AArch64 System Level Programmer's Model - Figure D1-1 illustrates this beautifully:
The ARM situation changed a bit with the advent of ARMv8.1 Virtualization Host Extensions (VHE). This extension allows the kernel to run in EL2 efficiently:
VHE was created because in-Linux-kernel virtualization solutions such as KVM have gained ground over Xen (see e.g. AWS' move to KVM mentioned above), because most clients only need Linux VMs, and as you can imagine, being all in a single project, KVM is simpler and potentially more efficient than Xen. So now the host Linux kernel acts as the hypervisor in those cases.
Note how ARM, maybe due to the benefit of hindsight, has a better naming convention for the privilege levels than x86, without the need for negative levels: 0 being the lower and 3 highest. Higher levels tend to be created more often than lower ones.
The current EL can be queried with the MRS instruction: what is the current execution mode/exception level, etc?
ARM does not require all exception levels to be present to allow for implementations that don't need the feature to save chip area. ARMv8 "Exception levels" says:
An implementation might not include all of the Exception levels. All implementations must include EL0 and EL1.
EL2 and EL3 are optional.
QEMU for example defaults to EL1, but EL2 and EL3 can be enabled with command line options: qemu-system-aarch64 entering el1 when emulating a53 power up
Code snippets tested on Ubuntu 18.10.
Kernel space & virtual space are concepts of virtual memory....it doesn't mean Ram(your actual memory) is divided into kernel & User space.
Each process is given virtual memory which is divided into kernel & user space.
So saying
"The random access memory (RAM) can be divided into two distinct regions namely - the kernel space and the user space." is wrong.
& regarding "kernel space vs user space" thing
When a process is created and its virtual memory is divided into user-space and a kernel-space , where user space region contains data, code, stack, heap of the process & kernel-space contains things such as the page table for the process, kernel data structures and kernel code etc.
To run kernel space code, control must shift to kernel mode(using 0x80 software interrupt for system calls) & kernel stack is basically shared among all processes currently executing in kernel space.
Kernel space and user space is the separation of the privileged operating system functions and the restricted user applications. The separation is necessary to prevent user applications from ransacking your computer. It would be a bad thing if any old user program could start writing random data to your hard drive or read memory from another user program's memory space.
User space programs cannot access system resources directly so access is handled on the program's behalf by the operating system kernel. The user space programs typically make such requests of the operating system through system calls.
Kernel threads, processes, stack do not mean the same thing. They are analogous constructs for kernel space as their counterparts in user space.
Each process has its own 4GB of virtual memory which maps to the physical memory through page tables. The virtual memory is mostly split in two parts: 3 GB for the use of the process and 1 GB for the use of the Kernel. Most of the variables you create lie in the first part of the address space. That part is called user space. The last part is where the kernel resides and is common for all the processes. This is called Kernel space and most of this space is mapped to the starting locations of physical memory where the kernel image is loaded at boot time.
The maximum size of address space depends on the length of the address register on the CPU.
On systems with 32-bit address registers, the maximum size of address space is 232 bytes, or 4 GiB.
Similarly, on 64-bit systems, 264 bytes can be addressed.
Such address space is called virtual memory or virtual address space. It is not actually related to physical RAM size.
On Linux platforms, virtual address space is divided into kernel space and user space.
An architecture-specific constant called task size limit, or TASK_SIZE, marks the position where the split occurs:
the address range from 0 up to TASK_SIZE-1 is allotted to user space;
the remainder from TASK_SIZE up to 232-1 (or 264-1) is allotted to kernel space.
On a particular 32-bit system for example, 3 GiB could be occupied for user space and 1 GiB for kernel space.
Each application/program in a Unix-like operating system is a process; each of those has a unique identifier called Process Identifier (or simply Process ID, i.e. PID). Linux provides two mechanisms for creating a process: 1. the fork() system call, or 2. the exec() call.
A kernel thread is a lightweight process and also a program under execution.
A single process may consist of several threads sharing the same data and resources but taking different paths through the program code. Linux provides a clone() system call to generate threads.
Example uses of kernel threads are: data synchronization of RAM, helping the scheduler to distribute processes among CPUs, etc.
Briefly : Kernel runs in Kernel Space, the kernel space has full access to all memory and resources, you can say the memory divide into two parts, part for kernel , and part for user own process, (user space) runs normal programs, user space cannot access directly to kernel space so it request from kernel to use resources. by syscall (predefined system call in glibc)
there is a statement that simplify the different "User Space is Just a test load for the Kernel " ...
To be very clear : processor architecture allow CPU to operate in two mode, Kernel Mode and User Mode, the Hardware instruction allow switching from one mode to the other.
memory can be marked as being part of user space or kernel space.
When CPU running in User Mode, the CPU can access only memory that is being in user space, while cpu attempts to access memory in Kernel space the result is a "hardware exception", when CPU running in Kernel mode, the CPU can access directly to both kernel space and user space ...
The kernel space means a memory space can only be touched by kernel. On 32bit linux it is 1G(from 0xC0000000 to 0xffffffff as virtual memory address).Every process created by kernel is also a kernel thread, So for one process, there are two stacks: one stack in user space for this process and another in kernel space for kernel thread.
the kernel stack occupied 2 pages(8k in 32bit linux), include a task_struct(about 1k) and the real stack(about 7k). The latter is used to store some auto variables or function call params or function address in kernel functions. Here is the code(Processor.h (linux\include\asm-i386)):
#define THREAD_SIZE (2*PAGE_SIZE)
#define alloc_task_struct() ((struct task_struct *) __get_free_pages(GFP_KERNEL,1))
#define free_task_struct(p) free_pages((unsigned long) (p), 1)
__get_free_pages(GFP_KERNEL,1)) means alloc memory as 2^1=2 pages.
But the process stack is another thing, its address is just bellow 0xC0000000(32bit linux), the size of it can be quite bigger, used for the user space function calls.
So here is a question come for system call, it is running in kernel space but was called by process in user space, how does it work? Will linux put its params and function address in kernel stack or process stack? Linux's solution: all system call are triggered by software interruption INT 0x80.
Defined in entry.S (linux\arch\i386\kernel), here is some lines for example:
ENTRY(sys_call_table)
.long SYMBOL_NAME(sys_ni_syscall) /* 0 - old "setup()" system call*/
.long SYMBOL_NAME(sys_exit)
.long SYMBOL_NAME(sys_fork)
.long SYMBOL_NAME(sys_read)
.long SYMBOL_NAME(sys_write)
.long SYMBOL_NAME(sys_open) /* 5 */
.long SYMBOL_NAME(sys_close)
By Sunil Yadav, on Quora:
The Linux Kernel refers to everything that runs in Kernel mode and is
made up of several distinct layers. At the lowest layer, the Kernel
interacts with the hardware via the HAL. At the middle level, the
UNIX Kernel is divided into 4 distinct areas. The first of the four
areas handles character devices, raw and cooked TTY and terminal
handling. The second area handles network device drivers, routing
protocols and sockets. The third area handles disk device drivers,
page and buffer caches, file system, virtual memory, file naming and
mapping. The fourth and last area handles process dispatching,
scheduling, creation and termination as well as signal handling.
Above all this we have the top layer of the Kernel which includes
system calls, interrupts and traps. This level serves as the
interface to each of the lower level functions. A programmer uses
the various system calls and interrupts to interact with the features
of the operating system.
IN short kernel space is the portion of memory where linux kernel runs (top 1 GB virtual space in case of linux) and user space is the portion of memory where user application runs( bottom 3 GB of virtual memory in case of Linux. If you wanna know more the see the link given below :)
http://learnlinuxconcepts.blogspot.in/2014/02/kernel-space-and-user-space.html
Kernel Space and User Space are logical spaces.
Most of the modern processors are designed to run in different privileged mode. x86 machines can run in 4 different privileged modes.
And a particular machine instruction can be executed when in/above particular privileged mode.
Because of this design you are giving a system protection or sand-boxing the execution environment.
Kernel is a piece of code, which manages your hardware and provide system abstraction. So it needs to have access for all the machine instruction. And it is most trusted piece of software. So i should be executed with the highest privilege. And Ring level 0 is the most privileged mode. So Ring Level 0 is also called as Kernel Mode.
User Application are piece of software which comes from any third party vendor, and you can't completely trust them. Someone with malicious intent can write a code to crash your system if he had complete access to all the machine instruction. So application should be provided with access to limited set of instructions. And Ring Level 3 is the least privileged mode. So all your application run in that mode. Hence that Ring Level 3 is also called User Mode.
Note: I am not getting Ring Levels 1 and 2. They are basically modes with intermediate privilege. So may be device driver code are executed with this privilege. AFAIK, linux uses only Ring Level 0 and 3 for kernel code execution and user application respectively.
So any operation happening in kernel mode can be considered as kernel space.
And any operation happening in user mode can be considered as user space.
Trying to give a very simplified explanation
Virtual Memory is divided into kernel space and the user space.
Kernel space is that area of virtual memory where kernel processes will run and user space is that area of virtual memory where user processes will be running.
This division is required for memory access protections.
Whenever a bootloader starts a kernel after loading it to a location in RAM, (on an ARM based controller typically)it needs to make sure that the controller is in supervisor mode with FIQ's and IRQ's disabled.
The correct answer is: There is no such thing as kernel space and user space. The processor instruction set has special permissions to set destructive things like the root of the page table map, or access hardware device memory, etc.
Kernel code has the highest level privileges, and user code the lowest. This prevents user code from crashing the system, modifying other programs, etc.
Generally kernel code is kept under a different memory map than user code (just as user spaces are kept in different memory maps than each other). This is where the "kernel space" and "user space" terms come from. But that is not a hard and fast rule. For example, since the x86 indirectly requires its interrupt/trap handlers to be mapped at all times, part (or some OSes all) of the kernel must be mapped into user space. Again, this does not mean that such code has user privileges.
Why is the kernel/user divide necessary? Some designers disagree that it is, in fact, necessary. Microkernel architecture is based on the idea that the highest privileged sections of code should be as small as possible, with all significant operations done in user privileged code. You would need to study why this might be a good idea, it is not a simple concept (and is famous for both having advantages and drawbacks).
This demarcation need architecture support there are some instructions that are accessed in privileged mode.
In pagetables we have access details if user process try to access address which lies in kernel address range then it will give privilege violation fault.
So to enter privileged mode it is required to run instruction like trap which change CPU mode to privilege and give access to instructions as well as memory regions
In Linux there are two space 1st is user space and another one is kernal space. user space consist of only user application which u want to run. as the kernal service there is process management, file management, signal handling, memory management, thread management, and so many services are present there. if u run the application from the user space that appliction interact with only kernal service. and that service is interact with device driver which is present between hardware and kernal.
the main benefit of kernal space and user space seperation is we can acchive a security by the virus.bcaz of all user application present in user space, and service is present in kernal space. thats why linux doesn,t affect from the virus.

in linux kernel, the data structure thread_struct contains both field esp0 and esp, what is the difference?

This is my guess:
esp0 is initialized with the kernel stack top addr. when the kernel stack is allocated, and it is used, during process switch, to initialize tss->esp0, so that when context switches from user mode to kernel mode, the kernel stack can be located; while esp is used to save the kernel stack top of the process that is to be scheduled out during process switch.
So esp0 in a thread_struct doesn't change once initialized, while esp changes.
Is my guess right?
The thread_struct structure contains two of these ESP fields, those being esp0 and esp. However, they relate to the four fields in the tss_segment_32 structure, those being esp0, esp1, esp2 and esp.
These actually exist in the TSS so it's very much something from Intel rather than something from Linus et al.
As to why the TSS contains them, the numbers are logical if you know how the protection model works under x86. They are, in fact, the ring levels (except for esp which is for ring level 3 despite the fact it's not actually called esp3).
In other words, they contain the stack pointer to be used in the ring that you're executing in. Since Linux only uses ring 0 (kernel mode) and ring 3 (user mode), esp0 and esp are the only ones that need to be saved.
As an aside, I think the only OS I've ever seen use another ring was OS/2 which used ring 2 for certain I/O operations. Processes that were allowed to perform those operations had to be specially marked and the OS would run them in ring 2 to allow unfettered I/O access, without being allowed to bring down the kernel.

Resources