In what order does a context switch to the kernel occur - linux-kernel

Out of these three steps, is this the right order, or do I need to switch any?
1) Save current state data
2) Turn on kernel mode
3) Determine cause of interrupt

So, let me try to help you figuring out the correct order.
Only the kernel can switch a context as only the kernel has access to the necessary data and can for example change the page tables for the other process' address space.
To determine whether to do a context switch or not, the kernel needs to analyse some "inputs". A context switch might be done for example because the timer interrupt fired and the time slice of a process is over or because the process started doing some IO.
Only the kernel can save the state of a user process because a user process would change its state when it would try storing it. The kernel however knows that if its running, the user process is currently interrupted (eg because of an interrupt or because the user space process voluntarily entered the kernel eg for a system call)

The current context of a process is first saved partly by the hardware(processor) and rest by the software(kernel).
Then the control is transferred from the user process to the kernel by loading the new eip, esp and other saved context of kernel is loaded by hardware from Task State Segment(TSS).
Then based on the interrupt or trap no. the request is dispatched to the appropriate handler.

Related

Silently discard writes to mmap region

I have a Linux device driver which allows a userspace process to mmap() certain regions of the device's MMIO space for writing. The device may at some point decide to revoke access to the region, and will notify the driver when this happens. The driver (asynchronously) notifies the userspace process to stop using this region.
I'd like the driver to immediately zap the PTEs for this mapping so they can be returned to device control, however, the userspace process might still be finishing a write. I'd like to simply discard these writes. The user does not need to know which writes made it to the device and which writes were discarded. What can the driver's fault handler do after zapping the PTEs that can discard writes to the region harmlessly?
For the userspace process to make progress, the PTE needs to end up pointing to a writeable page.
If you don't want it writing to your device MMIO region, this implies you'll need to allocate a page of normal memory for the write to go to, just like the fault handler does for an anonymous VMA.
Alternatively, you could let your userspace task take a SIGBUS when this revocation event occurs, and just specify that a task using this device should expect this to happen and must install a SIGBUS handler that uses longjmp() to cancel its attempt to write to the device. The downside of this approach - apart from the additional complexity it dumps onto userspace - is that it makes using your device difficult from a library, as signal handlers are process-global state.

context switching in an operating system

good evening everyone
I would like to know what will happen if during a context switch, the new context is already in one of the registers or if it is ever in memory and all the registers are occupied?
Basically, a context switch is a way of saving the current state of the machine and replacing it with a new one. Steps are vaguely like this:
enter privileged mode, where the CPU will have access to system/kernel memory
save old program counter (now we know where we were when the task-switch event happened - maybe a system call, maybe an interrupt; basically the running process was forced to yield control)
save current register state (either on the stack, or in a specific set of OS-allocated-and-managed memory)
save the stack pointer (if the architecture has one)
save memory information for the task being suspended by marking all the pages used by this process as eligible for eviction (if the next task or the OS needs the main memory that the old process was using, that will be copied out to page storage and then memory-mapped into the correct address space; if not, they may hang around and be available when the task regains control)
It is now safe for the OS to do anything it pleases, as the transient state of the old process is saved, and its memory is safe. Maybe it handles an interrupt, or executes a system call. We'll skip all that and just do a task switch.
set up memory for new task (map main memory to the new process's virtual memory; some may be in main memory already, if there's not a lot of memory in use, or it may have been paged out to external storage, in which case it will be loaded via a "page fault" when the program tries to reference it - the program will suspend in the same way as above, the OS will read in the memory block, and the process will be resumed by the OS)
load register state from the new process's OS control block or stack
load the stack pointer if required
exit privileged mode
branch to the last suspend program counter or entry point for new task
The key point is that the the OS is in charge of preserving state; it manages this process appropriately for the CPU architecture. Registers are not "busy" because the task switch process saves them and restores them. The process which lost control then regained it does not have any idea that it lost control; its world state is saved and restored seamlessly.

Interrupt a kernel module when a user process terminates/receives a signal?

I am working on a kernel module where I need to be "aware" that a given process has crashed.
Right now my approach is to set up a periodic timer interrupt in the kernel module; on every timer interrupt, I check the task_struct.state and task_struct.exitstate values for that process.
I am wondering if there's a way to set up an interrupt in the kernel module that would go off when the process terminates, or, when the process receives a given signal (e.g., SIGINT or SIGHUP).
Thanks!
EDIT: A catch here is that I can't modify the user application. Or at least, it would be a much tougher sell to the customer if I place additional requirements/constraints on s/w from another vendor...
You could have your module create a character device node and then open that node from your userspace process. It's only about a dozen lines of boilerplate to register a simple cdev in your module. Your cdev's open method will get called when the process opens the device node and the release method will be called when the device node is closed. If a process exits, either intentionally or because of a signal, all open file descriptors are closed by the kernel. So you can be certain that release will be called. This avoids any need to poll the process status and you can avoid modifying any kernel code outside of your module.
You could also setup a watchdog style system, where your process must write one byte to the device every so often. Have the write method of the cdev reset a timer. If too much time passes without a write and the timer expires, it is assumed the process has somehow failed, even if it hasn't crashed and terminated. For instance a programming bug that allowed for a mutex deadlock or placed the process into an infinite loop.
There is a point in the kernel code where signals are delivered to user processes. You could patch that, check the process name, and signal a condition variable if it matches. This would just catch signals, not intentional process exits. IMHO, this is much uglier and you'll need to deal with maintaining a kernel patch. But it's not that hard, there's a single point, I don't recall what function, sorry, where one can insert the necessary code and it will catch all signals.

How do I write to a __user memory from within the top half of an interrupt handler?

I am working on a proprietary device driver. The driver is implemented as a kernel module. This module is then coupled with an user-space process.
It is essential that each time the device generates an interrupt, the driver updates a set of counters directly in the address space of the user-space process from within the top half of the interrupt handler. The driver knows the PID and the task_struct of the user-process and is also aware of the virtual address where the counters lie in the user-process context. However, I am having trouble in figuring out how code running in the interrupt context could take up the mm context of the user-process and write to it. Let me sum up what I need to do:
Get the address of the physical page and offset corresponding to the virtual address of the counters in the context of the user-process.
Set up mappings in the page table and write to the physical page corresponding to the counter.
For this, I have tried the following:
Try to take up the mm context of the user-task, like below:
use_mm(tsk->mm);
/* write to counters. */
unuse_mm(tsk->mm);
This apparently causes the entire system to hang.
Wait for the interrupt to occur when our user-process was the
current process. Then use copy_to_user().
I'm not much of an expert on kernel programming. If there's a good way to do this, please do advise and thank you in advance.
Your driver should be the one, who maps kernel's memory for user space process. E.g., you may implement .mmap callback for struct file_operation for your device.
Kernel driver may write to kernel's address, which it have mapped, at any time (even in interrupt handler). The user-space process will immediately see all modifications on its side of the mapping (using address obtained with mmap() system call).
Unix's architecture frowns on interrupt routines accessing user space
because a process could (in theory) be swapped out when the interrupt occurs. 
If the process is running on another CPU, that could be a problem, too. 
I suggest that you write an ioctl to synchronize the counters,
and then have the the process call that ioctl
every time it needs to access the counters.
Outside of an interrupt context, your driver will need to check the user memory is accessible (using access_ok), and pin the user memory using get_user_pages or get_user_pages_fast (after determining the page offset of the start of the region to be pinned, and the number of pages spanned by the region to be pinned, including page alignment at both ends). It will also need to map the list of pages to kernel address space using vmap. The return address from vmap, plus the offset of the start of the region within its page, will give you an address that your interrupt handler can access.
At some point, you will want to terminate access to the user memory, which will involve ensuring that your interrupt routine no longer accesses it, a call to vunmap (passing the pointer returned by vmap), and a sequence of calls to put_page for each of the pages pinned by get_user_pages or get_user_pages_fast.
I don't think what you are trying to do is possible. Consider this situation:
(assuming how your device works)
Some function allocates the user-space memory for the counters and
supplies its address in PROCESS X.
A switch occurs and PROCESS Y executes.
Your device interrupts.
The address for your counters is inaccessible.
You need to schedule a kernel mode asynchronous event (lower half) that will execute when PROCESS X is executing.

use of spin variants in network processing

I have written a Kernel module that is interacting with net-filter hooks.
The net-filter hooks operate in Softirq context.
I am accessing a global data structure
"Hash Table" from the softirq context as well as from Process context. The process context access is due to a sysctl file being used to modify the contents of the Hash-table.
I am using spinlock_irq_save.
Is this choice of spin_lock api correct ?? In terms of performance and locking standards.
what would happen if an interrupt is scheduled on another processor? while on the current processor lock is already hold by a process context code?
Firstly:
So, with all the above details I concluded that my softirqs can run concurrently on both cores.
Yes, this is correct. Your softirq handler may be executed "simultaneously on more than one CPU".
Your conclusion to use spinlocks sounds correct to me. However, this assumes that the critical section (ie., that which is executed with the spinlock held) has the following properties:
It must not sleep (for example, acquire a blocking mutex)
It should be as short as possible
Generally, if you're just updating your hash table, you should be fine here.
If an IRQ handler tries to acquire a spinlock that is held by a process context, that's fine. As long as your process context does not sleep with that lock held, the lock should be released within a short amount of time, allowing the IRQ handler to make forward progress.
I think the solution is appropriate . Softirqs anyways runs with preemption disabled . To share a data with a process, the process must also disable both preemption and interrupts. In case of timer, which only reduces the time stamp of an entry can do it atomically i.e. the time stamp variable must be atomic. If in another core softirqs run and wants to acquire the spinlock, when it is already held in the other core,it must wait.

Resources