EINTR and non-blocking calls - nonblocking

As is known, some blocking calls like read and write would return -1 and set errno to EINTR, and we need handle this.
My question is: Does this apply for non-blocking calls, e.g, set socket to O_NONBLOCK?
Since some articles and sources I have read said non-blocking calls don't need bother with this, but I have found no authoritative reference about it. If so, does it apply cross different implementations?

I cannot give you a definitive answer to this question, and the answer may further vary from system to system, but I would expect a non-blocking socket to never fail with EINTR. If you take a look at the man pages of various systems for the following socket functions bind(), connect(), send(), and receive(), or look those up in the POSIX standard, you'll notice something interesting: All these functions except one may return -1 and set errno to EINTR. The one function that is not documented to ever fail with EINTR is bind(). And bind() is also the only function of that list that will never block by default. So it seems that only blocking functions may fail because of EINTR, including read() and write(), yet if these functions never block, they also will never fail with EINTR and if you use O_NONBLOCK, those functions will never block.
It would also make no sense from a logical perspective. E.g. consider you are using blocking I/O and you call read() and this call has to block, but while it was blocking, a signal is sent to your process and thus the read request is unblocked. How should the system handle this situation? Claiming that read() did succeed? That would be a lie, it did not succeed because no data was read. Claiming it did succeed, but zero bytes data were read? This wouldn't be correct either, since a "zero read result" is used to indicate end-of-stream (or end-of-file), so your process would to assume that no data was read, because the end of a file has been reached (or a socket/pipe has been closed at other end), which simply isn't the case. The end-of-file (or end-of-stream) has not been reached, if you call read() again, it will be able to return more data. So that would also be a lie. You expectation is that this read call either succeeds and reads data or fails with an error. Thus the read call has to fail and return -1 in that case, but what errno value shall the system set? All the other error values indicate a critical error with the file descriptor, yet there was no critical error and indicating such an error would also be a lie. That's why errno is set to EINTR, which means: "There was nothing wrong with the stream. Your read call just failed, because it was interrupted by a signal. If it wasn't interrupted, it may still have succeeded, so if you still care for the data, please try again."
If you now switch to non-blocking I/O, the situation of above never arises. The read call will never block and if it cannot read data immediately, it will fail with an error EAGAIN (POSIX) or EWOULDBLOCK (unofficial, on Linux both are the same error, just alternative names for it), which means: "There is no data available right now and thus your read call would have to block and wait for data arriving, but blocking is not allowed, so it failed instead." So there is an error for every situation that may arise.
Of course, even with non-blocking I/O, the read call may have temporarily interrupted by a signal but why would the system have to indicate that? Every function call, whether this is a system function or one written by the user, may be temporarily interrupted by a signal, really every single one, no exception. If the system would have to inform the user whenever that happens, all system functions could possibly fail because of EINTR. However, even if there was a signal interruption, the functions usually perform their task all the way to the end, that's why this interruption is irrelevant. The error EINTR is used to tell the caller that the action he has requested was not performed because of a signal interruption, but in case of non-blocking I/O, there is no reason why the function should not perform the read or the write request, unless it cannot be performed right now, but then this can be indicated by an appropriate error.
To confirm my theory, I took a look at the kernel of MacOS (10.8), which is still largely based on the FreeBSD kernel and it seems to confirm the suspicion. If a read call is currently not possible, as no data are available, the kernel checks for the O_NONBLOCK flag in the file descriptor flags. If this flag is set, it fails immediately with EAGAIN. If it is not set, it puts the current thread to sleep by calling a function named msleep(). The function is documented here (as I said, OS X uses plenty of FreeBSD code in its kernel). This function causes the current thread to sleep until it is explicitly woken up (which is the case if data becomes ready for reading) or a timeout has been hit (e.g. you can set a receive timeout on sockets). Yet the thread is also woken up, if a signal is delivered, in which case msleep() itself returns EINTR and the next higher layer just passes this error through. So it is msleep() that produces the EINTR error, but if the O_NONBLOCK flag is set, msleep() is never called in the first place, hence this error cannot be returned.
Of course that was MacOS/FreeBSD, other systems may be different, but since most systems try to keep at least a certain level of consistency among these APIs, if a system breaks the assumption, that non-blocking I/O calls can never fail because of EINTR, this is probably not by intention and may even get fixed if your report it.

#Mecki Great explanation. To add to the accepted answer, the book "Unix Network Programming - Volume 1, Third Edition" (Stevens) makes a distinction between slow system call and others in chapter/section 5.9 - "Handling Interrupted System Calls". I am quoting from the book -
We used the term "slow system call" to describe accept, and we use
this term for any system call that can block forever. That is, the
system call need never return.
In the next para of the same section -
The basic rule that applies here is that when a process is blocked in
a slow system call and the process catches a signal and the signal
handler returns, the system call can return an error of EINTR.
Going by this explanation, a read / write on a non-blocking socket is not a slow system call and hence should not return an error of EINTR.

Just to add some evidence to #Mecki's answer, I found this discussion about fixing a bug in Linux where a patch caused non-blocking recvmsg to return EINTR. It was stated:
EINTR always means that you asked for a blocking operation, and a
signal arrived meanwhile.
Once you invert the "blocking" part of that set of conditions, EINTR
becomes an impossible event.
Also:
Look at what we do for AF_INET. We handle this the proper way.
If we are 'interrupted' by a signal while sleeping in lock_sock(),
recvmsg() on a non blocking socket, we return -EAGAIN properly, not
-EINTR.
Fact that we potentially sleep to get the socket lock is hidden for
the user, its an implementation detail of the kernel.
We never return -EINTR, as stated in manpage for non blocking sockets.
Source here: https://patchwork.ozlabs.org/project/netdev/patch/1395798147.12610.196.camel#edumazet-glaptop2.roam.corp.google.com/#741015

Related

Triggering a software event from an interrupt (XMEGA, GCC)

I want to run a periodic "housekeeping" event, triggered regularly by a timer interrupt. The interrupt fires frequently (kHz+), while the function may take a long time to finish, so I can't simply have it executed in line.
In the past, I've done this on an ATMEGA, where an ISR can simply permit other interrupts to fire (including itself again) with sei(). By wrapping the event in a "still executing" flag, it won't pile up on the stack and cause a... you know:
if (!inFunction) { inFunction = true; doFunction(); inFunction = false; }
I don't think this can be done -- at least as easily -- on the XMEGA, due to the PMIC interrupt controller. It appears the interrupt flags can only be reset by executing RETI.
So, I was thinking, it would be convenient if I could convince GCC to produce a tail call out of an interrupt. That would immediately execute the event, while clearing interrupts.
This would be easy enough to do in assembler, just push the address and IRET. (Well, some stack-mangling because ISR, but, yeah.) But I'm guessing it'll be a hack in GCC, possibly a custom ASM wrapper around a "naked" function?
Alternately, I would love to simply set a low priority software interrupt, but I don't see an intentional way to do this.
I could use software to trigger an interrupt from an otherwise unused peripheral. That's fine as a special case, but then, if I ever need to use that device, I have to find another. It's bad for code reuse, too.
Really, this is an X-Y problem and I know it. I think I want to do X, but really I need method Y that I just don't know about.
One better method is to set a flag, then let main() deal with it when it gets around to it. Unfortunately, I have blocking functions in main() (handling user input via serial), so that would take work, and be a mess.
The only "proper" method I know of offhand, is to do a full task switch -- but damned if I'm going to effectively implement an RTOS, or pull one in, just for this. There's got to be a better way.
Have I actually covered all the possibilities, and painted myself into a corner? Do I have to compromise and choose one of these? Am I missing anything better?
There are more possibilities to solve this.
1. Enable your timer interrupt as low priority. In this way the medium and high priority interrupts will be able to interrupt this low priority interrupt, and run unaffected.
This is similar to using sei(); in your interrupt handler in older processors (without PMIC).
2.a Set a flag (variable) in the interrupt. Poll the flag in the main loop. If the flag is set, clear it and do your stuff.
2.b Set up the timer but don't enable its interrupt. Poll the OVF interrupt flag of your timer in the main loop. If the flag is set, clear it and do your stuff.
These are timed less accurately according to what else the main loop does, so depends on your expectations for accuracy. Handling more tasks in the main loop without an OS: Cooperative multitasking, State machine.

Interrupt a kernel module when a user process terminates/receives a signal?

I am working on a kernel module where I need to be "aware" that a given process has crashed.
Right now my approach is to set up a periodic timer interrupt in the kernel module; on every timer interrupt, I check the task_struct.state and task_struct.exitstate values for that process.
I am wondering if there's a way to set up an interrupt in the kernel module that would go off when the process terminates, or, when the process receives a given signal (e.g., SIGINT or SIGHUP).
Thanks!
EDIT: A catch here is that I can't modify the user application. Or at least, it would be a much tougher sell to the customer if I place additional requirements/constraints on s/w from another vendor...
You could have your module create a character device node and then open that node from your userspace process. It's only about a dozen lines of boilerplate to register a simple cdev in your module. Your cdev's open method will get called when the process opens the device node and the release method will be called when the device node is closed. If a process exits, either intentionally or because of a signal, all open file descriptors are closed by the kernel. So you can be certain that release will be called. This avoids any need to poll the process status and you can avoid modifying any kernel code outside of your module.
You could also setup a watchdog style system, where your process must write one byte to the device every so often. Have the write method of the cdev reset a timer. If too much time passes without a write and the timer expires, it is assumed the process has somehow failed, even if it hasn't crashed and terminated. For instance a programming bug that allowed for a mutex deadlock or placed the process into an infinite loop.
There is a point in the kernel code where signals are delivered to user processes. You could patch that, check the process name, and signal a condition variable if it matches. This would just catch signals, not intentional process exits. IMHO, this is much uglier and you'll need to deal with maintaining a kernel patch. But it's not that hard, there's a single point, I don't recall what function, sorry, where one can insert the necessary code and it will catch all signals.

How does epoll know socket is ready in kernel?

I didn't find any hints in epool source code about how epoll knows socket is ready for read/write.
Does epoll register a callback in the kernel?
Does epool register a signal in the kernel for read/write?
Or something else?
Many thanks.
Short answer
Not only for epoll but in general for "blocking I/O" (the same mechanism is used by read() syscall, for example), kernel uses waitqueues (don't confuse them with workqueues which is totally different mechanism). If you check ep_poll() implementation, it's even documented in comments.
Some not-so-interesting details
In order to put current thread to sleep on waitqueue, one would normally use wait_event_interruptible() call. epoll_wait does not do that, however. Instead it kind off re-implements what this call would do by adding itself to the waitqueue with __add_wait_queue_exclusive(), putting itself to sleep with set_current_state(TASK_INTERRUPTIBLE) and checking what was the cause of being woken up in a loop. The end result is the same - the current thread will be put to interruptible sleep which may be terminated either by sending signal (in which case epoll_wait will return EINTR) or when woken up by ep_poll_callback through waitqueues mechanism.

implementing blocking syscalls in Linux

I would like to understand how implementing blocking I/O syscalls is different from non-blocking? Googling it didn't help much, any links or references would be greatly appreciated.
Thanks.
http://faculty.salina.k-state.edu/tim/ossg/Device/blocking.html
Blocking syscall will put the task (calling thread) to sleep (block it from running on CPU), and syscall will return only after event (or timeout). Non-blocking syscall will not block thread, it just checks in-kernel states and immediately returns.
More detailed description: http://www.makelinux.net/ldd3/chp-6-sect-2
one important issue: how does a driver respond if it cannot immediately satisfy the request? A call to read may come when no data is available, but more is expected in the future. Or a process could attempt to write, but your device is not ready to accept the data, because your output buffer is full. The calling process usually does not care about such issues; the programmer simply expects to call read or write and have the call return after the necessary work has been done. So, in such cases, your driver should (by default) block the process, putting it to sleep until the request can proceed. ....
There are several forms of wait_event kernel functions to block the caller thread, check include/linux/wait.h; thread can be waked up by different ways, for example with wake_up/wake_up_interruptible.

Windows: TCP/IP: force close connection: avoid memleaks in kernel/user-level

A question to windows network programming experts.
When I use pseudo-code like this:
reconnect:
s = socket(...);
// more code...
read_reply:
recv(...);
// merge received data
if(high_level_protocol_error) {
// whoops, there was a deviation from protocol, like overflow
// need to reset connection and discard data right now!
closesocket(s);
goto reconnect;
}
Does kernel un-associate and frees all data "physically" received from NIC(since it must really already be there, in kernel memory, waiting for user-level to read it with recv()), when I closesocket()? Well, it logically should since data is not associated with any internal object anymore, right?
Because I don't really want to waste unknown amount of time for clean shutdown like "call recv() until returns error". That does not make sense: what if it will never return error, say, server continues to send data forever and not closes connection, but that is bad behaviour?
I'm wondering about it since I don't want my application to cause memory leaks anywhere. Is this way of forced resetting connection, that still expected to send in unknown amount of data correct?
// optional addition to question: if this method considered correct for windows, can it be considered correct (with change of closesocket() to close() ) for UNIX-compliant OS?
Kernel drivers in Windows (or any OS really), including tcpip.sys, are supposed to avoid memory leaks in all circumstances, regardless of what you do in user mode. I would think that the developers have charted the possible states, including error states, to make sure that resources aren't leaked. As for user mode, I'm not exactly sure but I wouldn't think that resources are leaked in your process either.
Sockets are just file objects in Windows. When you close the last handle to a file, the IO manager sends a IRP_MJ_CLEANUP message to the driver that owns the file to clean up resources associated with it. The receive buffers associated with the socket would be freed along with the file object.
It does say in the closesocket documentation that pending operations are canceled but that async operations may complete after the function returns. It sounds like closing the socket while in use is a supported scenario and wouldn't lead to a memory leak.
There will be no leak and you are under no obligation to read the stream to EOS before closing. If the sender is still sending after you close it will eventually get a 'connection reset'.

Resources