how kernel manage user space threads in linux? - linux-kernel

I have read this Linux - Threads and Process
I understood that every kernel threads have unique task_struct
But Right now my question is that how kernel manage user application's thread, suppose any user application have 12 thread then how kernel manage them and every thread have unique task_struct like kernel threads

The kernel manages them when it can, ie. whenever it is entered from an 'interrupt' that changes the state of the threads.
There are two flavours of interrupt: either a syscall from a running thread, or a call from a driver that has been entered from a 'real' hardware interrupt from KB, NIC, disk, timer etc, can change the state of threads and initiate a sceduling algorithm run that may change the set of threads to run on the available cores.
In between interrupts, the kernel manages nothing because it is not entered.
The task_struct is raised when a running thread makes a syscall to create a new thread. The new thread is created ready, and will run whenever the scheduling algorithm dispatches it onto a core.

Related

Linux (or other *nix): Attaching an interrupt to userspace

I'm trying to make sure that a unique user process executes as soon as possible after a particular hardware interrupt occurs.
One mechanism I'm aware of for doing this is to write a small kernel module that exports a device while sleeping inside the read handler. The module also registers an irq handler, which does nothing but wake the process. Then from the user's perspective, reads to that device block until the relevant interrupt occurs.
(1) On a modern CPU with a mainline kernel, can you reliably expect sub millisecond latency between the kernel seeing the interrupt and the user process regaining control with this?
(2) Are there any lower latency mechanisms on a mainline kernel?
Apply the PREEMPT_RT patch to the kernel and compile it configuring full preemptability through make menuconfig.
This will allow you to have threaded interrupts (i.e., interrupt handlers executed as kernel threads). Then, you can assign maximum priority (i.e., RT prio > 50) to your specific interrupt handler (check its PID using ps aux) and to your specific process, and a lower priority to anything else.

Context switch internals

I want to learn and fill gaps in my knowledge with the help of this question.
So, a user is running a thread (kernel-level) and it now calls yield (a system call I presume).
The scheduler must now save the context of the current thread in the TCB (which is stored in the kernel somewhere) and choose another thread to run and loads its context and jump to its CS:EIP.
To narrow things down, I am working on Linux running on top of x86 architecture. Now, I want to get into the details:
So, first we have a system call:
1) The wrapper function for yield will push the system call arguments onto the stack. Push the return address and raise an interrupt with the system call number pushed onto some register (say EAX).
2) The interrupt changes the CPU mode from user to kernel and jumps to the interrupt vector table and from there to the actual system call in the kernel.
3) I guess the scheduler gets called now and now it must save the current state in the TCB. Here is my dilemma. Since, the scheduler will use the kernel stack and not the user stack for performing its operation (which means the SS and SP have to be changed) how does it store the state of the user without modifying any registers in the process. I have read on forums that there are special hardware instructions for saving state but then how does the scheduler get access to them and who runs these instructions and when?
4) The scheduler now stores the state into the TCB and loads another TCB.
5) When the scheduler runs the original thread, the control gets back to the wrapper function which clears the stack and the thread resumes.
Side questions: Does the scheduler run as a kernel-only thread (i.e. a thread which can run only kernel code)? Is there a separate kernel stack for each kernel-thread or each process?
At a high level, there are two separate mechanisms to understand. The first is the kernel entry/exit mechanism: this switches a single running thread from running usermode code to running kernel code in the context of that thread, and back again. The second is the context switch mechanism itself, which switches in kernel mode from running in the context of one thread to another.
So, when Thread A calls sched_yield() and is replaced by Thread B, what happens is:
Thread A enters the kernel, changing from user mode to kernel mode;
Thread A in the kernel context-switches to Thread B in the kernel;
Thread B exits the kernel, changing from kernel mode back to user mode.
Each user thread has both a user-mode stack and a kernel-mode stack. When a thread enters the kernel, the current value of the user-mode stack (SS:ESP) and instruction pointer (CS:EIP) are saved to the thread's kernel-mode stack, and the CPU switches to the kernel-mode stack - with the int $80 syscall mechanism, this is done by the CPU itself. The remaining register values and flags are then also saved to the kernel stack.
When a thread returns from the kernel to user-mode, the register values and flags are popped from the kernel-mode stack, then the user-mode stack and instruction pointer values are restored from the saved values on the kernel-mode stack.
When a thread context-switches, it calls into the scheduler (the scheduler does not run as a separate thread - it always runs in the context of the current thread). The scheduler code selects a process to run next, and calls the switch_to() function. This function essentially just switches the kernel stacks - it saves the current value of the stack pointer into the TCB for the current thread (called struct task_struct in Linux), and loads a previously-saved stack pointer from the TCB for the next thread. At this point it also saves and restores some other thread state that isn't usually used by the kernel - things like floating point/SSE registers. If the threads being switched don't share the same virtual memory space (ie. they're in different processes), the page tables are also switched.
So you can see that the core user-mode state of a thread isn't saved and restored at context-switch time - it's saved and restored to the thread's kernel stack when you enter and leave the kernel. The context-switch code doesn't have to worry about clobbering the user-mode register values - those are already safely saved away in the kernel stack by that point.
What you missed during step 2 is that the stack gets switched from a thread's user-level stack (where you pushed args) to a thread's protected-level stack. The current context of the thread interrupted by the syscall is actually saved on this protected stack. Inside the ISR and just before entering the kernel, this protected-stack is again switched to the kernel stack you are talking about. Once inside the kernel, kernel functions such as scheduler's functions eventually use the kernel-stack. Later on, a thread gets elected by the scheduler and the system returns to the ISR, it switchs back from the kernel stack to the newly elected (or the former if no higher priority thread is active) thread's protected-level stack, wich eventually contains the new thread context. Therefore the context is restored from this stack by code automatically (depending on the underlying architecture). Finally, a special instruction restores the latest touchy resgisters such as the stack pointer and the instruction pointer. Back in the userland...
To sum-up, a thread has (generally) two stacks, and the kernel itself has one. The kernel stack gets wiped at the end of each kernel entering. It's interesting to point out that since 2.6, the kernel itself gets threaded for some processing, therefore a kernel-thread has its own protected-level stack beside the general kernel-stack.
Some ressources:
3.3.3 Performing the Process Switch of Understanding the Linux Kernel, O'Reilly
5.12.1 Exception- or Interrupt-Handler Procedures of the Intel's manual 3A (sysprogramming). Chapter number may vary from edition to other, thus a lookup on "Stack Usage on Transfers to Interrupt and Exception-Handling Routines" should get you to the good one.
Hope this help!
Kernel itself have no stack at all. The same is true for the process. It also have no stack. Threads are only system citizens which are considered as execution units. Due to this only threads can be scheduled and only threads have stacks. But there is one point which kernel mode code exploits heavily - every moment of time system works in the context of the currently active thread. Due to this kernel itself can reuse the stack of the currently active stack. Note that only one of them can execute at the same moment of time either kernel code or user code. Due to this when kernel is invoked it just reuse thread stack and perform a cleanup before returning control back to the interrupted activities in the thread. The same mechanism works for interrupt handlers. The same mechanism is exploited by signal handlers.
In its turn thread stack is divided into two isolated parts, one of which called user stack (because it is used when thread executes in user mode), and second one is called kernel stack (because it is used when thread executes in kernel mode). Once thread crosses the border between user and kernel mode, CPU automatically switches it from one stack to another. Both stack are tracked by kernel and CPU differently. For the kernel stack, CPU permanently keeps in mind pointer to the top of the kernel stack of the thread. It is easy, because this address is constant for the thread. Each time when thread enters the kernel it found empty kernel stack and each time when it returns to the user mode it cleans kernel stack. In the same time CPU doesn't keep in mind pointer to the top of the user stack, when thread runs in the kernel mode. Instead during entering to the kernel, CPU creates special "interrupt" stack frame on the top of the kernel stack and stores the value of the user mode stack pointer in that frame. When thread exits the kernel, CPU restores the value of ESP from previously created "interrupt" stack frame, immediately before its cleanup. (on legacy x86 the pair of instructions int/iret handle enter and exit from kernel mode)
During entering to the kernel mode, immediately after CPU will have created "interrupt" stack frame, kernel pushes content of the rest of CPU registers to the kernel stack. Note that is saves values only for those registers, which can be used by kernel code. For example kernel doesn't save content of SSE registers just because it will never touch them. Similarly just before asking CPU to return control back to the user mode, kernel pops previously saved content back to the registers.
Note that in such systems as Windows and Linux there is a notion of system thread (frequently called kernel thread, I know it is confusing). System threads a kind of special threads, because they execute only in kernel mode and due to this have no user part of the stack. Kernel employs them for auxiliary housekeeping tasks.
Thread switch is performed only in kernel mode. That mean that both threads outgoing and incoming run in kernel mode, both uses their own kernel stacks, and both have kernel stacks have "interrupt" frames with pointers to the top of the user stacks. Key point of the thread switch is a switch between kernel stacks of threads, as simple as:
pushad; // save context of outgoing thread on the top of the kernel stack of outgoing thread
; here kernel uses kernel stack of outgoing thread
mov [TCB_of_outgoing_thread], ESP;
mov ESP , [TCB_of_incoming_thread]
; here kernel uses kernel stack of incoming thread
popad; // save context of incoming thread from the top of the kernel stack of incoming thread
Note that there is only one function in the kernel that performs thread switch. Due to this each time when kernel has stacks switched it can find a context of incoming thread on the top of the stack. Just because every time before stack switch kernel pushes context of outgoing thread to its stack.
Note also that every time after stack switch and before returning back to the user mode, kernel reloads the mind of CPU by new value of the top of kernel stack. Making this it assures that when new active thread will try to enter kernel in future it will be switched by CPU to its own kernel stack.
Note also that not all registers are saved on the stack during thread switch, some registers like FPU/MMX/SSE are saved in specially dedicated area in TCB of outgoing thread. Kernel employs different strategy here for two reasons. First of all not every thread in the system uses them. Pushing their content to and and popping it from the stack for every thread is inefficient. And second one there are special instructions for "fast" saving and loading of their content. And these instructions doesn't use stack.
Note also that in fact kernel part of the thread stack has fixed size and is allocated as part of TCB. (true for Linux and I believe for Windows too)

who is running kernel if cpu is running processes?

Suppose in a two process environment, one process is scheduled for execution by the kernel, and it demanded for some data which is not available in the RAM. So the cpu will indicate the kernel that something is not available and the process will be suspended. Then after kernel loads the second process for execution through the CPU and start investigating about the data in secondary memory location (say virtual memory) and gets it, puts it back to main memory by a swap to the memory data which is currently inactive, and puts the process back in the ready queue for execution.
We know that everything in computer system is get manipulated by CPU only and if CPU is busy executing continuously the process code then who is executing the kernel code to perform the tasks done by kernel?
Please let me know if i am able to explain the scenario.
At any point in time, CPU (/s) will be
Running a process in User Mode.
Running on behalf of a process in Kernel Mode to execute previleged instruction or access hardware (for example when system call read / write is issued).
Running in repsonse to a hardware interrupt. i.e. running in interrupt context. (Not associated with any process in particular) and yes in kernel mode.
Running some kernel threads to serve deferred work like soft irq. (Tasklet / Softirq)
Running CPU idle thread if nothing is there to execute.
If you are in particular asking about scheduling, then
Suppose a process is running and now it has issued a read call to retrieve data from hard disk, say, then process is removed from cpu and kernel invokes schedule() functions. So here, first process issues read system call, which results in switching from user mode to kernel mode. The kernel which is running on behalf of the process prepares for the hard disk read operation and then calls schedule() function
Suppose a hardware interrupt has come, then currently running process is removed, and interrupt service handler for that interrupt begins to execute in kernel mode (obviously).
Basically, kernel runs in between user processes !!
Clear now ?
Shash
The kernel runs either as a result of a hardware interrupt, or as a result of being invoked by a process to do something. In both cases the code which was executing at that moment stops running until the kernel finishes its job.
It is similar to a function call: when function A calls function B, function A has to wait until function B is done doing what it does, and returns control to function A. You do not need multiple CPUs, or any kind of magic to accomplish this.
The CPU is not continuously executing process code. The CPU is interrupted to perform various operations. Interrupts can occur for various reasons: a resource becomes available, a previous action completes, or simply a timer goes off.
I recommend this series of videos for more in-depth information: http://academicearth.org/courses/operating-systems-and-system-programming

Windows: how to spawn threads from (NDIS) kernel driver?

Which function is recommended to spawn a new thread within NDIS5/6 context? Looking for something that is guaranteed to work at IRQL=PASSIVE (e.g. no bsods out of nothing); by a quick examination of ndis.h contents, found nothing.
Also, it is planned to use a newly spawned thread for calling upon NdisFreeMemory* family, will it be causing any problems to free allocated, but unused memory from a different thread?
Threading is outside the scope of NDIS. If you need to start a new thread, use the standard kernel routines (like PsCreateSystemThread). Note that usually timers and work items are sufficicent for most miniport needs. It is unusual for an NDIS miniport to create its own thread, although I suppose there are valid cases where it might be a fair design.
It is ok to allocate memory on one thread and free it on another.

Why kernel code/thread executing in interrupt context cannot sleep?

I am reading following article by Robert Love
http://www.linuxjournal.com/article/6916
that says
"...Let's discuss the fact that work queues run in process context. This is in contrast to the other bottom-half mechanisms, which all run in interrupt context. Code running in interrupt context is unable to sleep, or block, because interrupt context does not have a backing process with which to reschedule. Therefore, because interrupt handlers are not associated with a process, there is nothing for the scheduler to put to sleep and, more importantly, nothing for the scheduler to wake up..."
I don't get it. AFAIK, scheduler in the kernel is O(1), that is implemented through the bitmap. So what stops the scehduler from putting interrupt context to sleep and taking next schedulable process and passing it the control?
So what stops the scehduler from putting interrupt context to sleep and taking next schedulable process and passing it the control?
The problem is that the interrupt context is not a process, and therefore cannot be put to sleep.
When an interrupt occurs, the processor saves the registers onto the stack and jumps to the start of the interrupt service routine. This means that when the interrupt handler is running, it is running in the context of the process that was executing when the interrupt occurred. The interrupt is executing on that process's stack, and when the interrupt handler completes, that process will resume executing.
If you tried to sleep or block inside an interrupt handler, you would wind up not only stopping the interrupt handler, but also the process it interrupted. This could be dangerous, as the interrupt handler has no way of knowing what the interrupted process was doing, or even if it is safe for that process to be suspended.
A simple scenario where things could go wrong would be a deadlock between the interrupt handler and the process it interrupts.
Process1 enters kernel mode.
Process1 acquires LockA.
Interrupt occurs.
ISR starts executing using Process1's stack.
ISR tries to acquire LockA.
ISR calls sleep to wait for LockA to be released.
At this point, you have a deadlock. Process1 can't resume execution until the ISR is done with its stack. But the ISR is blocked waiting for Process1 to release LockA.
I think it's a design idea.
Sure, you can design a system that you can sleep in interrupt, but except to make to the system hard to comprehend and complicated(many many situation you have to take into account), that's does not help anything. So from a design view, declare interrupt handler as can not sleep is very clear and easy to implement.
From Robert Love (a kernel hacker):
http://permalink.gmane.org/gmane.linux.kernel.kernelnewbies/1791
You cannot sleep in an interrupt handler because interrupts do not have
a backing process context, and thus there is nothing to reschedule back
into. In other words, interrupt handlers are not associated with a task,
so there is nothing to "put to sleep" and (more importantly) "nothing to
wake up". They must run atomically.
This is not unlike other operating systems. In most operating systems,
interrupts are not threaded. Bottom halves often are, however.
The reason the page fault handler can sleep is that it is invoked only
by code that is running in process context. Because the kernel's own
memory is not pagable, only user-space memory accesses can result in a
page fault. Thus, only a few certain places (such as calls to
copy_{to,from}_user()) can cause a page fault within the kernel. Those
places must all be made by code that can sleep (i.e., process context,
no locks, et cetera).
Because the thread switching infrastructure is unusable at that point. When servicing an interrupt, only stuff of higher priority can execute - See the Intel Software Developer's Manual on interrupt, task and processor priority. If you did allow another thread to execute (which you imply in your question that it would be easy to do), you wouldn't be able to let it do anything - if it caused a page fault, you'd have to use services in the kernel that are unusable while the interrupt is being serviced (see below for why).
Typically, your only goal in an interrupt routine is to get the device to stop interrupting and queue something at a lower interrupt level (in unix this is typically a non-interrupt level, but for Windows, it's dispatch, apc or passive level) to do the heavy lifting where you have access to more features of the kernel/os. See - Implementing a handler.
It's a property of how O/S's have to work, not something inherent in Linux. An interrupt routine can execute at any point so the state of what you interrupted is inconsistent. If you interrupted the thread scheduling code, its state is inconsistent so you can't be sure you can "sleep" and switch threads. Even if you protect the thread switching code from being interrupted, thread switching is a very high level feature of the O/S and if you protected everything it relies on, an interrupt becomes more of a suggestion than the imperative implied by its name.
So what stops the scehduler from putting interrupt context to sleep and taking next schedulable process and passing it the control?
Scheduling happens on timer interrupts. The basic rule is that only one interrupt can be open at a time, so if you go to sleep in the "got data from device X" interrupt, the timer interrupt cannot run to schedule it out.
Interrupts also happen many times and overlap. If you put the "got data" interrupt to sleep, and then get more data, what happens? It's confusing (and fragile) enough that the catch-all rule is: no sleeping in interrupts. You will do it wrong.
Disallowing an interrupt handler to block is a design choice. When some data is on the device, the interrupt handler intercepts the current process, prepares the transfer of the data and enables the interrupt; before the handler enables the current interrupt, the device has to hang. We want keep our I/O busy and our system responsive, then we had better not block the interrupt handler.
I don't think the "unstable states" are an essential reason. Processes, no matter they are in user-mode or kernel-mode, should be aware that they may be interrupted by interrupts. If some kernel-mode data structure will be accessed by both interrupt handler and the current process, and race condition exists, then the current process should disable local interrupts, and moreover for multi-processor architectures, spinlocks should be used to during the critical sections.
I also don't think if the interrupt handler were blocked, it cannot be waken up. When we say "block", basically it means that the blocked process is waiting for some event/resource, so it links itself into some wait-queue for that event/resource. Whenever the resource is released, the releasing process is responsible for waking up the waiting process(es).
However, the really annoying thing is that the blocked process can do nothing during the blocking time; it did nothing wrong for this punishment, which is unfair. And nobody could surely predict the blocking time, so the innocent process has to wait for unclear reason and for unlimited time.
Even if you could put an ISR to sleep, you wouldn't want to do it. You want your ISRs to be as fast as possible to reduce the risk of missing subsequent interrupts.
The linux kernel has two ways to allocate interrupt stack. One is on the kernel stack of the interrupted process, the other is a dedicated interrupt stack per CPU. If the interrupt context is saved on the dedicated interrupt stack per CPU, then indeed the interrupt context is completely not associated with any process. The "current" macro will produce an invalid pointer to current running process, since the "current" macro with some architecture are computed with the stack pointer. The stack pointer in the interrupt context may point to the dedicated interrupt stack, not the kernel stack of some process.
By nature, the question is whether in interrupt handler you can get a valid "current" (address to the current process task_structure), if yes, it's possible to modify the content there accordingly to make it into "sleep" state, which can be back by scheduler later if the state get changed somehow. The answer may be hardware-dependent.
But in ARM, it's impossible since 'current' is irrelevant to process under interrupt mode. See the code below:
#linux/arch/arm/include/asm/thread_info.h
94 static inline struct thread_info *current_thread_info(void)
95 {
96 register unsigned long sp asm ("sp");
97 return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
98 }
sp in USER mode and SVC mode are the "same" ("same" here not mean they're equal, instead, user mode's sp point to user space stack, while svc mode's sp r13_svc point to the kernel stack, where the user process's task_structure was updated at previous task switch, When a system call occurs, the process enter kernel space again, when the sp (sp_svc) is still not changed, these 2 sp are associated with each other, in this sense, they're 'same'), So under SVC mode, kernel code can get the valid 'current'. But under other privileged modes, say interrupt mode, sp is 'different', point to dedicated address defined in cpu_init(). The 'current' calculated under these mode will be irrelevant to the interrupted process, accessing it will result in unexpected behaviors. That's why it's always said that system call can sleep but interrupt handler can't, system call works on process context but interrupt not.
High-level interrupt handlers mask the operations of all lower-priority interrupts, including those of the system timer interrupt. Consequently, the interrupt handler must avoid involving itself in an activity that might cause it to sleep. If the handler sleeps, then the system may hang because the timer is masked and incapable of scheduling the sleeping thread.
Does this make sense?
If a higher-level interrupt routine gets to the point where the next thing it must do has to happen after a period of time, then it needs to put a request into the timer queue, asking that another interrupt routine be run (at lower priority level) some time later.
When that interrupt routine runs, it would then raise priority level back to the level of the original interrupt routine, and continue execution. This has the same effect as a sleep.
It is just a design/implementation choices in Linux OS. The advantage of this design is simple, but it may not be good for real time OS requirements.
Other OSes have other designs/implementations.
For example, in Solaris, the interrupts could have different priorities, that allows most of devices interrupts are invoked in interrupt threads. The interrupt threads allows sleep because each of interrupt threads has separate stack in the context of the thread.
The interrupt threads design is good for real time threads which should have higher priorities than interrupts.

Resources