I am working on a kernel module where I need to be "aware" that a given process has crashed.
Right now my approach is to set up a periodic timer interrupt in the kernel module; on every timer interrupt, I check the task_struct.state and task_struct.exitstate values for that process.
I am wondering if there's a way to set up an interrupt in the kernel module that would go off when the process terminates, or, when the process receives a given signal (e.g., SIGINT or SIGHUP).
Thanks!
EDIT: A catch here is that I can't modify the user application. Or at least, it would be a much tougher sell to the customer if I place additional requirements/constraints on s/w from another vendor...
You could have your module create a character device node and then open that node from your userspace process. It's only about a dozen lines of boilerplate to register a simple cdev in your module. Your cdev's open method will get called when the process opens the device node and the release method will be called when the device node is closed. If a process exits, either intentionally or because of a signal, all open file descriptors are closed by the kernel. So you can be certain that release will be called. This avoids any need to poll the process status and you can avoid modifying any kernel code outside of your module.
You could also setup a watchdog style system, where your process must write one byte to the device every so often. Have the write method of the cdev reset a timer. If too much time passes without a write and the timer expires, it is assumed the process has somehow failed, even if it hasn't crashed and terminated. For instance a programming bug that allowed for a mutex deadlock or placed the process into an infinite loop.
There is a point in the kernel code where signals are delivered to user processes. You could patch that, check the process name, and signal a condition variable if it matches. This would just catch signals, not intentional process exits. IMHO, this is much uglier and you'll need to deal with maintaining a kernel patch. But it's not that hard, there's a single point, I don't recall what function, sorry, where one can insert the necessary code and it will catch all signals.
This is purely academic question, I don't really need to know this information for anything, but I would like to understand kernel a bit more :-)
According to kernel documentation http://www.tldp.org/LDP/tlk/kernel/processes.html processes in linux kernel have following states:
Running
The process is either running (it is the current process in the
system) or it is ready to run (it is waiting to be assigned to one of
the system's CPUs).
Waiting
The process is waiting for an event or for a resource. Linux
differentiates between two types of waiting process; interruptible and
uninterruptible. Interruptible waiting processes can be interrupted by
signals whereas uninterruptible waiting processes are waiting directly
on hardware conditions and cannot be interrupted under any
circumstances.
Stopped
The process has been stopped, usually by receiving a signal. A process
that is being debugged can be in a stopped state.
Zombie
This is a halted process which, for some reason, still has a
task_struct data structure in the task vector. It is what it sounds
like, a dead process.
As you can see, when I take a snapshot of processes state, using command like ps, I can see, if it's in Running state, that process either was literally Running or just waiting to be assigned to some CPU by kernel.
In my opinion, these 2 states (that are actually both represented by 1 state in task_struct) are quite different.
Why there is no state like "Ready" that would mean the process is "ready to run" but wasn't assigned to any CPU so far, so that the task_struct would be more clear about the real state? Is it even possible to retrieve this information, or is it secret for whatever reason which process is "literally running" on the CPU?
The struct task_struct contains a long to represent current state:
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
This simply indicates if a process is 'runnable'.
To see the currently executing process you should look at the runqueue. Specifically a struct rq (as defined in kernel/sched/sched.h) contains:
struct task_struct *curr, *idle, *stop;
The pointer *curr is the currently running process on this runqueue (there exists a runqueue per CPU).
You should consult files under kernel/sched/ to see how the Kernel determines which processes should be scheduled according to the different scheduling algorithms if you are interested in exactly how it arrives at the running state.
This is not a linux-kernel answer but a more general about scheduling ^^
A core part of any OS is the Scheduler: http://en.wikipedia.org/wiki/Process_scheduler
Many of them work giving every process a time slice of execution and letting each of them do a little bit of work before switching (referred as a context switch) to another process.
Since the length of a time slice is in the order of milliseconds by the time the information you requested is shown, the state has surely changed so differentiate between "Really Running" and "Ready-but-not-really-running" could result (most of the time) in inaccurate informations.
Suppose in a two process environment, one process is scheduled for execution by the kernel, and it demanded for some data which is not available in the RAM. So the cpu will indicate the kernel that something is not available and the process will be suspended. Then after kernel loads the second process for execution through the CPU and start investigating about the data in secondary memory location (say virtual memory) and gets it, puts it back to main memory by a swap to the memory data which is currently inactive, and puts the process back in the ready queue for execution.
We know that everything in computer system is get manipulated by CPU only and if CPU is busy executing continuously the process code then who is executing the kernel code to perform the tasks done by kernel?
Please let me know if i am able to explain the scenario.
At any point in time, CPU (/s) will be
Running a process in User Mode.
Running on behalf of a process in Kernel Mode to execute previleged instruction or access hardware (for example when system call read / write is issued).
Running in repsonse to a hardware interrupt. i.e. running in interrupt context. (Not associated with any process in particular) and yes in kernel mode.
Running some kernel threads to serve deferred work like soft irq. (Tasklet / Softirq)
Running CPU idle thread if nothing is there to execute.
If you are in particular asking about scheduling, then
Suppose a process is running and now it has issued a read call to retrieve data from hard disk, say, then process is removed from cpu and kernel invokes schedule() functions. So here, first process issues read system call, which results in switching from user mode to kernel mode. The kernel which is running on behalf of the process prepares for the hard disk read operation and then calls schedule() function
Suppose a hardware interrupt has come, then currently running process is removed, and interrupt service handler for that interrupt begins to execute in kernel mode (obviously).
Basically, kernel runs in between user processes !!
Clear now ?
Shash
The kernel runs either as a result of a hardware interrupt, or as a result of being invoked by a process to do something. In both cases the code which was executing at that moment stops running until the kernel finishes its job.
It is similar to a function call: when function A calls function B, function A has to wait until function B is done doing what it does, and returns control to function A. You do not need multiple CPUs, or any kind of magic to accomplish this.
The CPU is not continuously executing process code. The CPU is interrupted to perform various operations. Interrupts can occur for various reasons: a resource becomes available, a previous action completes, or simply a timer goes off.
I recommend this series of videos for more in-depth information: http://academicearth.org/courses/operating-systems-and-system-programming
I am reading following article by Robert Love
http://www.linuxjournal.com/article/6916
that says
"...Let's discuss the fact that work queues run in process context. This is in contrast to the other bottom-half mechanisms, which all run in interrupt context. Code running in interrupt context is unable to sleep, or block, because interrupt context does not have a backing process with which to reschedule. Therefore, because interrupt handlers are not associated with a process, there is nothing for the scheduler to put to sleep and, more importantly, nothing for the scheduler to wake up..."
I don't get it. AFAIK, scheduler in the kernel is O(1), that is implemented through the bitmap. So what stops the scehduler from putting interrupt context to sleep and taking next schedulable process and passing it the control?
So what stops the scehduler from putting interrupt context to sleep and taking next schedulable process and passing it the control?
The problem is that the interrupt context is not a process, and therefore cannot be put to sleep.
When an interrupt occurs, the processor saves the registers onto the stack and jumps to the start of the interrupt service routine. This means that when the interrupt handler is running, it is running in the context of the process that was executing when the interrupt occurred. The interrupt is executing on that process's stack, and when the interrupt handler completes, that process will resume executing.
If you tried to sleep or block inside an interrupt handler, you would wind up not only stopping the interrupt handler, but also the process it interrupted. This could be dangerous, as the interrupt handler has no way of knowing what the interrupted process was doing, or even if it is safe for that process to be suspended.
A simple scenario where things could go wrong would be a deadlock between the interrupt handler and the process it interrupts.
Process1 enters kernel mode.
Process1 acquires LockA.
Interrupt occurs.
ISR starts executing using Process1's stack.
ISR tries to acquire LockA.
ISR calls sleep to wait for LockA to be released.
At this point, you have a deadlock. Process1 can't resume execution until the ISR is done with its stack. But the ISR is blocked waiting for Process1 to release LockA.
I think it's a design idea.
Sure, you can design a system that you can sleep in interrupt, but except to make to the system hard to comprehend and complicated(many many situation you have to take into account), that's does not help anything. So from a design view, declare interrupt handler as can not sleep is very clear and easy to implement.
From Robert Love (a kernel hacker):
http://permalink.gmane.org/gmane.linux.kernel.kernelnewbies/1791
You cannot sleep in an interrupt handler because interrupts do not have
a backing process context, and thus there is nothing to reschedule back
into. In other words, interrupt handlers are not associated with a task,
so there is nothing to "put to sleep" and (more importantly) "nothing to
wake up". They must run atomically.
This is not unlike other operating systems. In most operating systems,
interrupts are not threaded. Bottom halves often are, however.
The reason the page fault handler can sleep is that it is invoked only
by code that is running in process context. Because the kernel's own
memory is not pagable, only user-space memory accesses can result in a
page fault. Thus, only a few certain places (such as calls to
copy_{to,from}_user()) can cause a page fault within the kernel. Those
places must all be made by code that can sleep (i.e., process context,
no locks, et cetera).
Because the thread switching infrastructure is unusable at that point. When servicing an interrupt, only stuff of higher priority can execute - See the Intel Software Developer's Manual on interrupt, task and processor priority. If you did allow another thread to execute (which you imply in your question that it would be easy to do), you wouldn't be able to let it do anything - if it caused a page fault, you'd have to use services in the kernel that are unusable while the interrupt is being serviced (see below for why).
Typically, your only goal in an interrupt routine is to get the device to stop interrupting and queue something at a lower interrupt level (in unix this is typically a non-interrupt level, but for Windows, it's dispatch, apc or passive level) to do the heavy lifting where you have access to more features of the kernel/os. See - Implementing a handler.
It's a property of how O/S's have to work, not something inherent in Linux. An interrupt routine can execute at any point so the state of what you interrupted is inconsistent. If you interrupted the thread scheduling code, its state is inconsistent so you can't be sure you can "sleep" and switch threads. Even if you protect the thread switching code from being interrupted, thread switching is a very high level feature of the O/S and if you protected everything it relies on, an interrupt becomes more of a suggestion than the imperative implied by its name.
So what stops the scehduler from putting interrupt context to sleep and taking next schedulable process and passing it the control?
Scheduling happens on timer interrupts. The basic rule is that only one interrupt can be open at a time, so if you go to sleep in the "got data from device X" interrupt, the timer interrupt cannot run to schedule it out.
Interrupts also happen many times and overlap. If you put the "got data" interrupt to sleep, and then get more data, what happens? It's confusing (and fragile) enough that the catch-all rule is: no sleeping in interrupts. You will do it wrong.
Disallowing an interrupt handler to block is a design choice. When some data is on the device, the interrupt handler intercepts the current process, prepares the transfer of the data and enables the interrupt; before the handler enables the current interrupt, the device has to hang. We want keep our I/O busy and our system responsive, then we had better not block the interrupt handler.
I don't think the "unstable states" are an essential reason. Processes, no matter they are in user-mode or kernel-mode, should be aware that they may be interrupted by interrupts. If some kernel-mode data structure will be accessed by both interrupt handler and the current process, and race condition exists, then the current process should disable local interrupts, and moreover for multi-processor architectures, spinlocks should be used to during the critical sections.
I also don't think if the interrupt handler were blocked, it cannot be waken up. When we say "block", basically it means that the blocked process is waiting for some event/resource, so it links itself into some wait-queue for that event/resource. Whenever the resource is released, the releasing process is responsible for waking up the waiting process(es).
However, the really annoying thing is that the blocked process can do nothing during the blocking time; it did nothing wrong for this punishment, which is unfair. And nobody could surely predict the blocking time, so the innocent process has to wait for unclear reason and for unlimited time.
Even if you could put an ISR to sleep, you wouldn't want to do it. You want your ISRs to be as fast as possible to reduce the risk of missing subsequent interrupts.
The linux kernel has two ways to allocate interrupt stack. One is on the kernel stack of the interrupted process, the other is a dedicated interrupt stack per CPU. If the interrupt context is saved on the dedicated interrupt stack per CPU, then indeed the interrupt context is completely not associated with any process. The "current" macro will produce an invalid pointer to current running process, since the "current" macro with some architecture are computed with the stack pointer. The stack pointer in the interrupt context may point to the dedicated interrupt stack, not the kernel stack of some process.
By nature, the question is whether in interrupt handler you can get a valid "current" (address to the current process task_structure), if yes, it's possible to modify the content there accordingly to make it into "sleep" state, which can be back by scheduler later if the state get changed somehow. The answer may be hardware-dependent.
But in ARM, it's impossible since 'current' is irrelevant to process under interrupt mode. See the code below:
#linux/arch/arm/include/asm/thread_info.h
94 static inline struct thread_info *current_thread_info(void)
95 {
96 register unsigned long sp asm ("sp");
97 return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
98 }
sp in USER mode and SVC mode are the "same" ("same" here not mean they're equal, instead, user mode's sp point to user space stack, while svc mode's sp r13_svc point to the kernel stack, where the user process's task_structure was updated at previous task switch, When a system call occurs, the process enter kernel space again, when the sp (sp_svc) is still not changed, these 2 sp are associated with each other, in this sense, they're 'same'), So under SVC mode, kernel code can get the valid 'current'. But under other privileged modes, say interrupt mode, sp is 'different', point to dedicated address defined in cpu_init(). The 'current' calculated under these mode will be irrelevant to the interrupted process, accessing it will result in unexpected behaviors. That's why it's always said that system call can sleep but interrupt handler can't, system call works on process context but interrupt not.
High-level interrupt handlers mask the operations of all lower-priority interrupts, including those of the system timer interrupt. Consequently, the interrupt handler must avoid involving itself in an activity that might cause it to sleep. If the handler sleeps, then the system may hang because the timer is masked and incapable of scheduling the sleeping thread.
Does this make sense?
If a higher-level interrupt routine gets to the point where the next thing it must do has to happen after a period of time, then it needs to put a request into the timer queue, asking that another interrupt routine be run (at lower priority level) some time later.
When that interrupt routine runs, it would then raise priority level back to the level of the original interrupt routine, and continue execution. This has the same effect as a sleep.
It is just a design/implementation choices in Linux OS. The advantage of this design is simple, but it may not be good for real time OS requirements.
Other OSes have other designs/implementations.
For example, in Solaris, the interrupts could have different priorities, that allows most of devices interrupts are invoked in interrupt threads. The interrupt threads allows sleep because each of interrupt threads has separate stack in the context of the thread.
The interrupt threads design is good for real time threads which should have higher priorities than interrupts.
Do we have any sort of relationship between fork() and CreateThread? Is there anything that
CreateThread internally calls fork()?
In NT, the fundamental working unit is called a thread (ie NT schedules threads, not processes.). User threads run in the context of a process. When you call CreateThread, you request the NT kernel to allocate a working unit within the context of your process (you also have fibres that are basically threads you can schedule yourself but that's beyond the topic of your question).
When you call CreateThread you provide the function with an entry point that is going to be run after the function is called. The code must be within the virtual space of the process and the page must have execution rights. Put simply, you give a function pointer. ;)
fork() is an UNIX function that requests the kernel to create copy of the running process. The parent process gets the pid of the child process and the child process gets 0 (this way you know who you are).
If you wish to create a process in Windows, you call the CreateProcess function, but that doesn't behave like fork(). The reason being that most of the time you will create threads, not processes.
As you can see, there is no relation between CreateThread and fork.
fork() only exists on Unix systems and it creates a new process with the same state as the caller. CreateThread() creates a new thread in the same process.
The Windows and Unix process model is fundamentally very different, so there is no way of directly mapping the API from one on top of the other.
fork() clones the current process into two. In the parent process, fork() returns the pid, and in the child it returns 0. This is typically used like this:
int pid;
if (pid = fork()) {
// this code is executed in the parent
} else {
// this code is executed in the child
}
Cygwin is an emulation layer for building and running Unix applications on Windows which emulates the behavior of fork() using CreateProcess().
CreateThread - is for threads, fork - is for creating duplicate process. And there is no native way to have fork functionality for windows (at least through Win32 ).
You might want to know Microsoft provides fork() in high-end versions of Windows with component called Subsystem for UNIX-based Applications (SUA). You can find details in my answer here.
Found this link which i believe could be helpful in clearing few facts regarding forking/threading.
Sharing over here: http://www.geekride.com/index.php/2010/01/fork-forking-vs-threading-thread-linux-kernel/