How are kernel states duplicated on a fork? - linux-kernel

Suppose I have a character device driver in Linux that allocates some memory in the kernel to store some state against every open file descriptor.
Some process opens a fd on the driver and through some ioctls the process also has provided initialization parameters for this state.
Now the process forks. All the file descriptors will also be created for the child process.
How will the fd specific state be duplicated? AFAIK do_fork only duplicates the data structure the kernel knows about.
Will the child process have to re-initialize the fd or it will end up sharing the state with the parent process?

No open file description state is duplicated on fork or dup. All such state will be shared between parent and child.

Related

Interrupt a kernel module when a user process terminates/receives a signal?

I am working on a kernel module where I need to be "aware" that a given process has crashed.
Right now my approach is to set up a periodic timer interrupt in the kernel module; on every timer interrupt, I check the task_struct.state and task_struct.exitstate values for that process.
I am wondering if there's a way to set up an interrupt in the kernel module that would go off when the process terminates, or, when the process receives a given signal (e.g., SIGINT or SIGHUP).
Thanks!
EDIT: A catch here is that I can't modify the user application. Or at least, it would be a much tougher sell to the customer if I place additional requirements/constraints on s/w from another vendor...
You could have your module create a character device node and then open that node from your userspace process. It's only about a dozen lines of boilerplate to register a simple cdev in your module. Your cdev's open method will get called when the process opens the device node and the release method will be called when the device node is closed. If a process exits, either intentionally or because of a signal, all open file descriptors are closed by the kernel. So you can be certain that release will be called. This avoids any need to poll the process status and you can avoid modifying any kernel code outside of your module.
You could also setup a watchdog style system, where your process must write one byte to the device every so often. Have the write method of the cdev reset a timer. If too much time passes without a write and the timer expires, it is assumed the process has somehow failed, even if it hasn't crashed and terminated. For instance a programming bug that allowed for a mutex deadlock or placed the process into an infinite loop.
There is a point in the kernel code where signals are delivered to user processes. You could patch that, check the process name, and signal a condition variable if it matches. This would just catch signals, not intentional process exits. IMHO, this is much uglier and you'll need to deal with maintaining a kernel patch. But it's not that hard, there's a single point, I don't recall what function, sorry, where one can insert the necessary code and it will catch all signals.

In what order does a context switch to the kernel occur

Out of these three steps, is this the right order, or do I need to switch any?
1) Save current state data
2) Turn on kernel mode
3) Determine cause of interrupt
So, let me try to help you figuring out the correct order.
Only the kernel can switch a context as only the kernel has access to the necessary data and can for example change the page tables for the other process' address space.
To determine whether to do a context switch or not, the kernel needs to analyse some "inputs". A context switch might be done for example because the timer interrupt fired and the time slice of a process is over or because the process started doing some IO.
Only the kernel can save the state of a user process because a user process would change its state when it would try storing it. The kernel however knows that if its running, the user process is currently interrupted (eg because of an interrupt or because the user space process voluntarily entered the kernel eg for a system call)
The current context of a process is first saved partly by the hardware(processor) and rest by the software(kernel).
Then the control is transferred from the user process to the kernel by loading the new eip, esp and other saved context of kernel is loaded by hardware from Task State Segment(TSS).
Then based on the interrupt or trap no. the request is dispatched to the appropriate handler.

how does linux kernel implement shared memory between 2 processes

How does the Linux kernel implement the shared memory mechanism between different processes?
To elaborate further, each process has its own address space. For example, an address of 0x1000 in Process A is a different location when compared to an address of 0x1000 in Process B.
So how does the kernel ensure that a piece of memory is shared between different process, having different address spaces?
Thanks in advance.
Interprocess Communication Mechanisms
Processes communicate with each other and with the kernel to coordinate their activities. Linux supports a number of Inter-Process Communication (IPC) mechanisms. Signals and pipes are two of them but Linux also supports the System V IPC mechanisms named after the Unix TM release in which they first appeared.
Signals
Signals are one of the oldest inter-process communication methods used by Unix TM systems. They are used to signal asynchronous events to one or more processes. A signal could be generated by a keyboard interrupt or an error condition such as the process attempting to access a non-existent location in its virtual memory. Signals are also used by the shells to signal job control commands to their child processes.
There are a set of defined signals that the kernel can generate or that can be generated by other processes in the system, provided that they have the correct privileges. You can list a system's set of signals using the kill command (kill -l).
Pipes
The common Linux shells all allow redirection. For example
$ ls | pr | lpr
pipes the output from the ls command listing the directory's files into the standard input of the pr command which paginates them. Finally the standard output from the pr command is piped into the standard input of the lpr command which prints the results on the default printer. Pipes then are unidirectional byte streams which connect the standard output from one process into the standard input of another process. Neither process is aware of this redirection and behaves just as it would normally. It is the shell which sets up these temporary pipes between the processes.
In Linux, a pipe is implemented using two file data structures which both point at the same temporary VFS inode which itself points at a physical page within memory. Figure shows that each file data structure contains pointers to different file operation routine vectors; one for writing to the pipe, the other for reading from the pipe.
Sockets
Message Queues: Message queues allow one or more processes to write messages, which will be read by one or more reading processes. Linux maintains a list of message queues, the msgque vector; each element of which points to a msqid_ds data structure that fully describes the message queue. When message queues are created a new msqid_ds data structure is allocated from system memory and inserted into the vector.
System V IPC Mechanisms: Linux supports three types of interprocess communication mechanisms that first appeared in Unix TM System V (1983). These are message queues, semaphores and shared memory. These System V IPC mechanisms all share common authentication methods. Processes may access these resources only by passing a unique reference identifier to the kernel via system calls. Access to these System V IPC objects is checked using access permissions, much like accesses to files are checked. The access rights to the System V IPC object is set by the creator of the object via system calls. The object's reference identifier is used by each mechanism as an index into a table of resources. It is not a straight forward index but requires some manipulation to generate the index.
Semaphores: In its simplest form a semaphore is a location in memory whose value can be tested and set by more than one process. The test and set operation is, so far as each process is concerned, uninterruptible or atomic; once started nothing can stop it. The result of the test and set operation is the addition of the current value of the semaphore and the set value, which can be positive or negative. Depending on the result of the test and set operation one process may have to sleep until the semphore's value is changed by another process. Semaphores can be used to implement critical regions, areas of critical code that only one process at a time should be executing.
Say you had many cooperating processes reading records from and writing records to a single data file. You would want that file access to be strictly coordinated. You could use a semaphore with an initial value of 1 and, around the file operating code, put two semaphore operations, the first to test and decrement the semaphore's value and the second to test and increment it. The first process to access the file would try to decrement the semaphore's value and it would succeed, the semaphore's value now being 0. This process can now go ahead and use the data file but if another process wishing to use it now tries to decrement the semaphore's value it would fail as the result would be -1. That process will be suspended until the first process has finished with the data file. When the first process has finished with the data file it will increment the semaphore's value, making it 1 again. Now the waiting process can be woken and this time its attempt to increment the semaphore will succeed.
Shared Memory: Shared memory allows one or more processes to communicate via memory that appears in all of their virtual address spaces. The pages of the virtual memory is referenced by page table entries in each of the sharing processes' page tables. It does not have to be at the same address in all of the processes' virtual memory. As with all System V IPC objects, access to shared memory areas is controlled via keys and access rights checking. Once the memory is being shared, there are no checks on how the processes are using it. They must rely on other mechanisms, for example System V semaphores, to synchronize access to the memory.
Quoted from tldp.org.
There are two kinds of shared memory in Linux.
If A and B are Parent process and Child process respectively, each of them uses their own pte to access the shared memory.The shared memory is shared by the fork mechanism. So every thing is good, right?(More details, please look at the kernel function copy_one_pte() and related functions.)
If A and B are not parent and Child, they use the a public key to access the shared memory.
Let's assume that A creates a shared memory though System V shmget() with a key and, correspondly, the kernel creates a file(file name is "SYSTEMV+key") for process A in a shmem/tmpfs which is an internal RAM-based filesystem. It's mounted by the kenrel(Check shmem_init()). And the shared memory region is handled by shmem/tmpfs. Basically, it's handled by the page fault mechanism when process A accesses the shared memory region.
If process B wants to access that shared memory region created by process A. Process B should use shmget() with the same key used by Process A. Then process B can find the file("SYSTEMV+key") and map the file into Process B's address space.

when using shared memory in unix

When you code a data supplier app in C for Unix that uses shared memory when do you detach the shared memory only when the server exits or when you are finished updating the shared memory ?
AFAIK, keeping it attached will not bother.
However since the attachment tracks the number of processes attached, if that count is >0, then you won't be allowed to destroy your shm until that count get back to 0 (in other words when all process are detached).
If you have a main process attached, I'm not sure you will be able to destroy it from an external "administrative" process for you shm.
In my personnal experience, I don't detach the SHM after write operations, only at process exit.

What is a good portable way to implement a global signalable event in a POSIX environment

The usage case is that one application generates an event and sends out a signal that any application that cares to listen for it will get. E.g. an application updates the contents of a file and signals this. On Linux this could be done by the waiters calling inotify on the file. One portable way would be for listeners to register with a well-known server, but I would prefer something simpler if possible. As portable as possible ideally means using only POSIX features which are also widely available.
Option using lock files
You can do this by locking a file.
Signal emitter initial setup:
Create a file with a well-known name and lock it for writing (fcntl(F_SETLK) with F_WRLCK or flock(LOCK_EX)`).
Signal receiver procedure:
Open the file using the well-known filename and try to obtain a read lock on it (fcntl(F_SETLK) with F_RDLCK or flock(LOCK_SH)).
Receiver blocks because the emitter is holding a conflicting write lock.
Signal emission:
Signal emitter creates a new temporary file
Signal emitter obtains a write lock on the new temporary file
Signal emitter renames the new temporary file to the well-known filename. This clobbers the old lock file but the waiting receivers all retain references to it.
Signal emitter closes the old lock file. This also releases the lock.
Signal receivers all wake up because now they can obtain their read locks.
Signal receivers should close the file they've just obtained a lock on. It won't be used again. If they want to wait for the condition to happen again they should reopen the file.
In the signal emitter, the temporary lock file which has been renamed over top of the original lock file now becomes the new current lock file.
Option using network multicast
Have the receivers join a multicast group and wait for packets. Have the signal emitter send UDP packets to that multicast group.
You can bind both the sending and receiving UDP sockets to the loopback interface if you want it to use only host-local communication.
In the end I used a bound unix domain socket. The owner keeps an array of client FDs and sends each a message when there is an event.

Resources