implementing blocking syscalls in Linux - linux-kernel

I would like to understand how implementing blocking I/O syscalls is different from non-blocking? Googling it didn't help much, any links or references would be greatly appreciated.
Thanks.

http://faculty.salina.k-state.edu/tim/ossg/Device/blocking.html
Blocking syscall will put the task (calling thread) to sleep (block it from running on CPU), and syscall will return only after event (or timeout). Non-blocking syscall will not block thread, it just checks in-kernel states and immediately returns.
More detailed description: http://www.makelinux.net/ldd3/chp-6-sect-2
one important issue: how does a driver respond if it cannot immediately satisfy the request? A call to read may come when no data is available, but more is expected in the future. Or a process could attempt to write, but your device is not ready to accept the data, because your output buffer is full. The calling process usually does not care about such issues; the programmer simply expects to call read or write and have the call return after the necessary work has been done. So, in such cases, your driver should (by default) block the process, putting it to sleep until the request can proceed. ....
There are several forms of wait_event kernel functions to block the caller thread, check include/linux/wait.h; thread can be waked up by different ways, for example with wake_up/wake_up_interruptible.

Related

How does epoll know socket is ready in kernel?

I didn't find any hints in epool source code about how epoll knows socket is ready for read/write.
Does epoll register a callback in the kernel?
Does epool register a signal in the kernel for read/write?
Or something else?
Many thanks.
Short answer
Not only for epoll but in general for "blocking I/O" (the same mechanism is used by read() syscall, for example), kernel uses waitqueues (don't confuse them with workqueues which is totally different mechanism). If you check ep_poll() implementation, it's even documented in comments.
Some not-so-interesting details
In order to put current thread to sleep on waitqueue, one would normally use wait_event_interruptible() call. epoll_wait does not do that, however. Instead it kind off re-implements what this call would do by adding itself to the waitqueue with __add_wait_queue_exclusive(), putting itself to sleep with set_current_state(TASK_INTERRUPTIBLE) and checking what was the cause of being woken up in a loop. The end result is the same - the current thread will be put to interruptible sleep which may be terminated either by sending signal (in which case epoll_wait will return EINTR) or when woken up by ep_poll_callback through waitqueues mechanism.

What is channel event system?

I am working on some project Where I have to deal with uc ATxmega128A1 , But being a beginner to a ucontrollers I want to know what is this channel event system regarding ucs.
I have referred a link http://www.atmel.com/Images/doc8071.pdf but not getting it.
The traditional way to do things the channel system can do is to use interrupts.
In the interrupt model, the CPU runs the code starting with main(), and continues usually with some loop. When an particular event occurs, such as a button being pressed, the CPU is "interrupted". The current processing is stopped, some registers are saved, and the execution jumps to some code pointed to by an interrupt vector called an interrupt handler. This code usually has instructions to save register values, and this is added automatically by the compiler.
When the interrupting code is finished, the CPU restores the values that the registers previously had and execution jumps back to the point in the main code where it was interrupted.
But this approach takes valuable CPU cycles. And some interrupt handlers don't do very much expect trigger some peripheral to take an action. Wouldn't it be great it these kinds of interrupt handlers could be avoided and have the mC have the peripherals talk directly to each other without pausing the CPU?
This is what the event channel system does. It allows peripherals to trigger each other directly without involving the CPU. The CPU continues to execute instructions while the channel system operates in parallel. This doesn't mean you can replace all interrupt handlers, though. If complicated processing is involved, you still need a handler to act. But the channel system does allow you to avoid using very simple interrupt handlers.
The paper you reference describes this in a little more detail (but assumes a lot of knowledge on the reader's part). You have to read the actual datasheet of your mC to find the exact details.

Why is there a call to mdelay(1) when resetting interrupt affinities?

I'm trying to change the code that brings down a cpu, and got into something I don't completely understand:
One of the things that happen after a core is removed from cpu_online_mask, is the resetting of the interrupt affinities.
This is being done in the fixup_irqs() function, found in /arch/x86/kernel/irq.c.
The function resets interrupt affinities, then calls to mdelay(1) (which simply waits for 1 millisecond), and finally turns to handle possibly "lost" interrupts.
My question is: why is the call to mdelay(1) necessary? what can happen without it?
My guess is that it takes time for the rerouting in the APIC to take effect... but I'm sure that there is a more convincing explanation for this.
Thanks!
In a nut shell, there is a race condition in fixup_irq() - the function starts by going over all the IRQs routed to the CPU that is being offlined and tells the HW to route them to somewhere else.
The thing is, the process of changing this interrupt routing is not atomic or instantaneous. The transaction that changes the routing on the PIC chip might race with a transaction that sends an interrupt - and that might take some cycles to arrive, so you might end up with:
Tell the APIC to send interrupts to some other CPU, not me.
Interrupt!
So what the code does is basically:
Tell the APIC to send interrupts to some other CPU not me.
Wait a bit.Enough so that the interrupt re-route would be guaranteed to finalize. ( How to know how much time is enough to wait? maybe its documented in the APIC spec, maybe its internal knowledge some Intel VLSI engineer revealed to their Linux people - I don't know :-)
Check if an interrupt occurred by reading a register on the APIC that latches when an interrupt was sent and if you find any, send it to the proper target as an IPI.
Now we know no interrupt will really get to us.

EINTR and non-blocking calls

As is known, some blocking calls like read and write would return -1 and set errno to EINTR, and we need handle this.
My question is: Does this apply for non-blocking calls, e.g, set socket to O_NONBLOCK?
Since some articles and sources I have read said non-blocking calls don't need bother with this, but I have found no authoritative reference about it. If so, does it apply cross different implementations?
I cannot give you a definitive answer to this question, and the answer may further vary from system to system, but I would expect a non-blocking socket to never fail with EINTR. If you take a look at the man pages of various systems for the following socket functions bind(), connect(), send(), and receive(), or look those up in the POSIX standard, you'll notice something interesting: All these functions except one may return -1 and set errno to EINTR. The one function that is not documented to ever fail with EINTR is bind(). And bind() is also the only function of that list that will never block by default. So it seems that only blocking functions may fail because of EINTR, including read() and write(), yet if these functions never block, they also will never fail with EINTR and if you use O_NONBLOCK, those functions will never block.
It would also make no sense from a logical perspective. E.g. consider you are using blocking I/O and you call read() and this call has to block, but while it was blocking, a signal is sent to your process and thus the read request is unblocked. How should the system handle this situation? Claiming that read() did succeed? That would be a lie, it did not succeed because no data was read. Claiming it did succeed, but zero bytes data were read? This wouldn't be correct either, since a "zero read result" is used to indicate end-of-stream (or end-of-file), so your process would to assume that no data was read, because the end of a file has been reached (or a socket/pipe has been closed at other end), which simply isn't the case. The end-of-file (or end-of-stream) has not been reached, if you call read() again, it will be able to return more data. So that would also be a lie. You expectation is that this read call either succeeds and reads data or fails with an error. Thus the read call has to fail and return -1 in that case, but what errno value shall the system set? All the other error values indicate a critical error with the file descriptor, yet there was no critical error and indicating such an error would also be a lie. That's why errno is set to EINTR, which means: "There was nothing wrong with the stream. Your read call just failed, because it was interrupted by a signal. If it wasn't interrupted, it may still have succeeded, so if you still care for the data, please try again."
If you now switch to non-blocking I/O, the situation of above never arises. The read call will never block and if it cannot read data immediately, it will fail with an error EAGAIN (POSIX) or EWOULDBLOCK (unofficial, on Linux both are the same error, just alternative names for it), which means: "There is no data available right now and thus your read call would have to block and wait for data arriving, but blocking is not allowed, so it failed instead." So there is an error for every situation that may arise.
Of course, even with non-blocking I/O, the read call may have temporarily interrupted by a signal but why would the system have to indicate that? Every function call, whether this is a system function or one written by the user, may be temporarily interrupted by a signal, really every single one, no exception. If the system would have to inform the user whenever that happens, all system functions could possibly fail because of EINTR. However, even if there was a signal interruption, the functions usually perform their task all the way to the end, that's why this interruption is irrelevant. The error EINTR is used to tell the caller that the action he has requested was not performed because of a signal interruption, but in case of non-blocking I/O, there is no reason why the function should not perform the read or the write request, unless it cannot be performed right now, but then this can be indicated by an appropriate error.
To confirm my theory, I took a look at the kernel of MacOS (10.8), which is still largely based on the FreeBSD kernel and it seems to confirm the suspicion. If a read call is currently not possible, as no data are available, the kernel checks for the O_NONBLOCK flag in the file descriptor flags. If this flag is set, it fails immediately with EAGAIN. If it is not set, it puts the current thread to sleep by calling a function named msleep(). The function is documented here (as I said, OS X uses plenty of FreeBSD code in its kernel). This function causes the current thread to sleep until it is explicitly woken up (which is the case if data becomes ready for reading) or a timeout has been hit (e.g. you can set a receive timeout on sockets). Yet the thread is also woken up, if a signal is delivered, in which case msleep() itself returns EINTR and the next higher layer just passes this error through. So it is msleep() that produces the EINTR error, but if the O_NONBLOCK flag is set, msleep() is never called in the first place, hence this error cannot be returned.
Of course that was MacOS/FreeBSD, other systems may be different, but since most systems try to keep at least a certain level of consistency among these APIs, if a system breaks the assumption, that non-blocking I/O calls can never fail because of EINTR, this is probably not by intention and may even get fixed if your report it.
#Mecki Great explanation. To add to the accepted answer, the book "Unix Network Programming - Volume 1, Third Edition" (Stevens) makes a distinction between slow system call and others in chapter/section 5.9 - "Handling Interrupted System Calls". I am quoting from the book -
We used the term "slow system call" to describe accept, and we use
this term for any system call that can block forever. That is, the
system call need never return.
In the next para of the same section -
The basic rule that applies here is that when a process is blocked in
a slow system call and the process catches a signal and the signal
handler returns, the system call can return an error of EINTR.
Going by this explanation, a read / write on a non-blocking socket is not a slow system call and hence should not return an error of EINTR.
Just to add some evidence to #Mecki's answer, I found this discussion about fixing a bug in Linux where a patch caused non-blocking recvmsg to return EINTR. It was stated:
EINTR always means that you asked for a blocking operation, and a
signal arrived meanwhile.
Once you invert the "blocking" part of that set of conditions, EINTR
becomes an impossible event.
Also:
Look at what we do for AF_INET. We handle this the proper way.
If we are 'interrupted' by a signal while sleeping in lock_sock(),
recvmsg() on a non blocking socket, we return -EAGAIN properly, not
-EINTR.
Fact that we potentially sleep to get the socket lock is hidden for
the user, its an implementation detail of the kernel.
We never return -EINTR, as stated in manpage for non blocking sockets.
Source here: https://patchwork.ozlabs.org/project/netdev/patch/1395798147.12610.196.camel#edumazet-glaptop2.roam.corp.google.com/#741015

Avoiding sleep while holding a spinlock

I've recently read section 5.5.2 (Spinlocks and Atomic Context) of LDDv3 book:
Avoiding sleep while holding a lock can be more difficult; many kernel functions can sleep, and this behavior is not always well documented. Copying data to or from user space is an obvious example: the required user-space page may need to be swapped in from the disk before the copy can proceed, and that operation clearly requires a sleep. Just about any operation that must allocate memory can sleep; kmalloc can decide to give up the processor, and wait for more memory to become available unless it is explicitly told not to. Sleeps can happen in surprising places; writing code that will execute under a spinlock requires paying attention to every function that you call.
It's clear to me that spinlocks must always be held for the minimum time possible and I think that it's relatively easy to write correct spinlock-using code from scratch.
Suppose, however, that we have a big project where spinlocks are widely used.
How can we make sure that functions called from critical sections protected by spinlocks will never sleep?
Thanks in advance!
What about enabling "Sleep-inside-spinlock checking" for your kernel ? It is usually found under Kernel Debugging when you run make config. You might also try to duplicate its behavior in your code.
One thing I noticed on a lot of projects is people seem to misuse spinlocks, they get used instead of the other locking primitives that should have be used.
A linux spinlock only exists in multiprocessor builds (in single process builds the spinlock preprocessor defines are empty) spinlocks are for short duration locks on a multi processor platform.
If code fails to aquire a spinlock it just spins the processor until the lock is free. So either another process running on a different processor must free the lock or possibly it could be freed by an interrupt handler but the wait event mechanism is much better way of waiting on an interrupt.
The irqsave spinlock primitive is a tidy way of disabling/ enabling interrupts so a driver can lock out an interrupt handler but this should only be held for long enough for the process to update some variables shared with an interrupt handler, if you disable interupts you are not going to be scheduled.
If you need to lock out an interrupt handler use a spinlock with irqsave.
For general kernel locking you should be using mutex/semaphore api which will sleep on the lock if they need to.
To lock against code running in other processes use muxtex/semaphore
To lock against code running in an interrupt context use irq save/restore or spinlock_irq save/restore
To lock against code running on other processors then use spinlocks and avoid holding the lock for long.
I hope this helps

Resources