How do "special" epoll flags correspond to kqueue ones? - nonblocking

I'm struggling to draw a parallel between epoll and kqueue flags, specifically EPOLLONESHOT EPOLLET EPOLLEXCLUSIVE and EV_CLEAR/EV_DISPATCH/EV_ONESHOT. I'm investigating the kqueue for the first time; I only had an experience with epoll.
EV_DISPATCH
It feels like the mix of EPOLLEXCLUSIVE and EPOLLONESHOT flags; from the kqueue documentation:
EV_DISPATCH Disable the event source immediately after delivery of an
event. See EV_DISABLE above.
EV_DISABLE Disable the event so kevent() will not return it. The fil-
ter itself is not disabled.
Do I understand the documentation correctly that the event is signalled and then immediately discarded, if there was at least one kqueue instance which polled for this event? That is, if we poll a socket for EVFILT_READ on two kqueues, only one will receive it, and then, until the same event is set with EVFILT_ENABLE, there won't be any further events at all, even if new data comes to socket?
EV_CLEAR
Looks like it is close to EPOLLET; from the kqueue documentation:
EV_CLEAR After the event is retrieved by the user, its state is
reset. This is useful for filters which report state tran-
sitions instead of the current state. Note that some fil-
ters may automatically set this flag internally.
So, for example, given the same socket with EVFILT_READ, all kqueues, which poll it simultaneously, will wake up with EVFILT_READ. If, however, not all data is read (i.e. until EAGAIN), no further events are reported. If and only if all the data was read and a new data arrives, a new EVFILT_READ event would be triggered. Is it correct?
EV_ONESHOT
Looks like it maps to EPOLLONESHOT; from the kqueue documentation:
EV_ONESHOT Causes the event to return only the first occurrence of the
filter being triggered. After the user retrieves the event
from the kqueue, it is deleted.
Questions
So, the questions:
Is my understanding correct? Did I understand these special flags right, compared to epoll? The documentation seems to be a bit tricky to me; perhaps the problem is that I've only used epoll before and didn't yet played with kqueue.
Could you please provide good sources or examples to see kqueue techniques? It would be nice if it would be not that complex like Boost.Asio; it would be also nice these sources would be written in C.
Can these flags be combined together? For example, EPOLLONESHOT cannot be combined with EPOLLEXCLUSIVE, but EV_DISPATCH seems to be exactly something in the middle between these flags.
Thank you for your help!
References
kqueue(2): FreeBSD System Calls Manual
epoll(7): Linux Programmer's Manual
epoll_ctl(7): Linux Programmer's Manual

EV_CLEAR is not equal to EPOLLET, e.g. some listen socket has 5 pending connections, and you don't consume all of them(accept until EAGAIN), then with EV_CLEAR, you won't get EVFILT_READ event from kevent until the 6th connection appears.
EPOLLEXCLUSIVE is used for CPU binding, it isn't related to EV_DISPATCH.
EV_ONESHOT means delete knote after the specific event is triggered, while EV_DISPATCH only disable it.
If one socket fd is registered to several kqueues, then the event is broadcasted while the event is triggered.
EV_ONESHOT is almost equal to EPOLLONESHOT, it is useful in the case that different threads need to call kevent with same kqueue fd.

Related

wait queues and work queues, do they always go together?

I’ve been trying to refresh my understanding of sleeping in the kernel with regards to wait queues. So started browsing the source code for bcmgenet.c (kernel version 4.4) which is the driver responsible for driving the 7xxx series of Broadcom SoC for their set top box solution.
As part of the probe callback, this driver initializes a work queue which is part of the driver’s private structure and adds itself to the Q. But I do not see any blocking of any kind anywhere. Then it goes on to initialize a work queue with a function to call when woken up.
Now coming to the ISR0 for the driver, within that is an explicit call to the scheduler as part of the ISR (bcmgenet_isr0) if certain conditions are met. Now AFAIK, this call is used to defer work to a later time, much like a tasklet does.
Post this we check some MDIO status flags and if the conditions are met, we wake up the process which was blocked in process context. But where exactly is the process blocked?
Also, most of the time, wait queues seem to be used in conjunction with work queues. Is that the typical way to use them?
As part of the probe callback, this driver initializes a work queue which is part of the driver’s private structure and adds itself to the Q. But I do not see any blocking of any kind anywhere.
I think you meant the wait queue head, not the work queue. I do not see any evidence of the probe adding itself to the queue; it is merely initializing the queue.
The queue is used by the calls to the wait_event_timeout() macro in the bcmgenet_mii_read() and bcmgenet_mii_write() functions in bcmmii.c. These calls will block until either the condition they are waiting for becomes true or the timeout period elapses. They are woken up by the wake_up(&priv->wq); call in the ISR0 interrupt handler.
Then it goes on to initialize a work queue with a function to call when woken up.
It is initializing a work item, not a work queue. The function will be called from a kernel thread as a result of the work item being added to the system work queue.
Now coming to the ISR0 for the driver, within that is an explicit call to the scheduler as part of the ISR (bcmgenet_isr0) if certain conditions are met. Now AFAIK, this call is used to defer work to a later time, much like a tasklet does.
You are referring to the schedule_work(&priv->bcmgenet_irq_work); call in the ISR0 interrupt handler. This is adding the previously mentioned work item to the system work queue. It is similar to as tasklet, but tasklets are run in a softirq context whereas work items are run in a process context.
Post this we check some MDIO status flags and if the conditions are met, we wake up the process which was blocked in process context. But where exactly is the process blocked?
As mentioned above, the process is blocked in the bcmgenet_mii_read() and bcmgenet_mii_write() functions, although they use a timeout to avoid blocking for long periods. (This timeout is especially important for those versions of GENET that do not support MDIO-related interrupts!)
Also, most of the time, wait queues seem to be used in conjunction with work queues. Is that the typical way to use them?
Not especially. This particular driver uses both a wait queue and a work item, but I wouldn't describe them as being used "in conjunction" since they are being used to handle different interrupt conditions.

Is it OK for a text editor to pass each keypress from one thread to another?

I'm implementing a Win32 Console text editor that has an internal message queue for passing around information about which areas to redraw, messages to/from plugins, etc. I prefer it to be single-threaded by default (if there's nothing happening that requires additional threads). I'm considering 2 message queue implementation strategies:
Use general-purpose queue and Win32 Event, so I can use WaitForMultipleObjectsEx to wait for internal messages and user input simultaneously, passing both console input handle and Event handle. In this case, text editor can live entirely within a single thread.
Use I/O Completion Port. In this case, text editor will need at least two threads, one of them calling GetQueuedCompletionStatus to get messages, and another reading user input and sending it to queue via PostQueuedCompletionStatus. The reason it that console input handle cannot be overlapped and WaitFor* functions don't accept Completion Port as a waitable handle, so it's not possible to wait on them simultaneously. Just like in the first setup, both threads don't waste CPU time when there's no input or events, but each keypress has to be passed from one thread to another via IOCP.
Which design is overall better?
Is performance and latency drawback from passing each keypress via IOCP significant for a text editor?
Performance and latency of IOCP is fine for your purpose. I wouldn’t however translate each key press into PostQueuedCompletionStatus. I’d rather PostQueuedCompletionStatus to enqueue multiple keypresses at once, whatever count you get from ReadConsoleInput.
I think performance difference is minimal, either one can be faster depending on the environment. And WaitForMultipleObjects is much easier to implement.
P.S. You sure you need message queue at all? Why don’t you process these redraw requests / plugin messages in whichever thread fired those, using e.g. critical section to guard (console output + your shared state)? If your congestion is low, like 100Hz messages + 15Hz keypresses, and you’ll process them fast, this will get you even lower latency than IOCP or any other queues: with low congestion or brief locking, critical sections don’t even switch to kernel for lock/unlock. This will also simplify design, your main thread will sleep on blocking ReadConsoleInput() call, no need to WaitFor.. before that.

Coalescing GCD file system events

I have a class that implements a file-monitoring service to detect when a file I am interested in has been changed by something other than my application. I use the standard technique of opening the file (with the O_EVTONLY flag) and binding the file descriptor to a Grand Central Dispatch source of type DISPATCH_SOURCE_TYPE_VNODE. When I get an event, I notify my main thread with NSNotificationCenter's postNotificationName:object:userInfo: which calls an observer in my app delegate. So far so good. It works great. But, in general, if the triggering event is an attributes change (i.e. the DISPATCH_VNODE_ATTRIB flag is set on return from dispatch_source_get_data()) then I usually get two closely-spaced events. The behaviour is easily exhibited if I touch(1) the object I am monitoring. I hypothesise this is due to the file's mtime and atime being set non-atomically although I can't verify this. This can lead to spurious notifications being sent to my observer and this raises the possibility of race conditions etc.
What is the best way of dealing with this? I thought of storing a timestamp for the last event received and only sending a notification if the current event is later than this timestamp by some amount (a few tens of milliseconds?) Does this sound like a reasonable solution?
You can't ever escape the "race condition" in this situation, because the notification of your GCD event source in your process is not synchronous with the other process's modification of the underlying file. So, no matter what, you must always be tolerant of the possibility that the change you're being notified for could already be "gone."
As for coalescing, do whatever makes sense for your app. There are two obvious strategies. You can act immediately on a received event, and then drop subsequent events received in some time window on the floor, or you can delay every event for some time period during which you will drop other events for the same file on the floor. It really just depends on what's more important, acting quickly, or having a higher likelihood of a quiescent state (knowing that you can never be sure things are quiescent.)
The only thing I would add is to suggest that you do all your coalescence before dispatching anything to the main thread. The main thread has things like tracking loops, etc that will make it harder to get time-based coalescing right in certain cases.

Usage of IcmpSendEcho2 with an asynchronous callback

I've been reading the MSDN documentation for IcmpSendEcho2 and it raises more questions than it answers.
I'm familiar with asynchronous callbacks from other Win32 APIs such as ReadFileEx... I provide a buffer which I guarantee will be reserved for the driver's use until the operation completes with any result other than IO_PENDING, I get my callback in case of either success or failure (and call GetCompletionStatus to find out which). Timeouts are my responsibility and I can call CancelIo to abort processing, but the buffer is still reserved until the driver cancels the operation and calls my completion routine with a status of CANCELLED. And there's an OVERLAPPED structure which uniquely identifies the request through all of this.
IcmpSendEcho2 doesn't use an OVERLAPPED context structure for asynchronous requests. And the documentation is unclear excessively minimalist about what happens if the ping times out or fails (failure would be lack of a network connection, a missing ARP entry for local peers, ICMP destination unreachable response from an intervening router for remote peers, etc).
Does anyone know whether the callback occurs on timeout and/or failure? And especially, if no response comes, can I reuse the buffer for another call to IcmpSendEcho2 or is it forever reserved in case a reply comes in late?
I'm wanting to use this function from a Win32 service, which means I have to get the error-handling cases right and I can't just leak buffers (or if the API does leak buffers, I have to use a helper process so I have a way to abandon requests).
There's also an ugly incompatibility in the way the callback is made. It looks like the first parameter is consistent between the two signatures, so I should be able to use the newer PIO_APC_ROUTINE as long as I only use the second parameter if an OS version check returns Vista or newer? Although MSDN says "don't do a Windows version check", it seems like I need to, because the set of versions with the new argument aren't the same as the set of versions where the function exists in iphlpapi.dll.
Pointers to additional documentation or working code which uses this function and an APC would be much appreciated.
Please also let me know if this is completely the wrong approach -- i.e. if either using raw sockets or some combination of IcmpCreateFile+WriteFileEx+ReadFileEx would be more robust.
I use IcmpSendEcho2 with an event, not a callback, but I think the flow is the same in both cases. IcmpSendEcho2 uses NtDeviceIoControlFile internally. It detects some ICMP-related errors early on and returns them as error codes in the 12xx range. If (and only if) IcmpSendEcho2 returns ERROR_IO_PENDING, it will eventually call the callback and/or set the event, regardless of whether the ping succeeds, fails or times out. Any buffers you pass in must be preserved until then, but can be reused afterwards.
As for the version check, you can avoid it at a slight cost by using an event with RegisterWaitForSingleObject instead of an APC callback.

What are alternatives to Win32 PulseEvent() function?

The documentation for the Win32 API PulseEvent() function (kernel32.dll) states that this function is “… unreliable and should not be used by new applications. Instead, use condition variables”. However, condition variables cannot be used across process boundaries like (named) events can.
I have a scenario that is cross-process, cross-runtime (native and managed code) in which a single producer occasionally has something interesting to make known to zero or more consumers. Right now, a well-known named event is used (and set to signaled state) by the producer using this PulseEvent function when it needs to make something known. Zero or more consumers wait on that event (WaitForSingleObject()) and perform an action in response. There is no need for two-way communication in my scenario, and the producer does not need to know if the event has any listeners, nor does it need to know if the event was successfully acted upon. On the other hand, I do not want any consumers to ever miss any events. In other words, the system needs to be perfectly reliable – but the producer does not need to know if that is the case or not. The scenario can be thought of as a “clock ticker” – i.e., the producer provides a semi-regular signal for zero or more consumers to count. And all consumers must have the correct count over any given period of time. No polling by consumers is allowed (performance reasons). The ticker is just a few milliseconds (20 or so, but not perfectly regular).
Raymen Chen (The Old New Thing) has a blog post pointing out the “fundamentally flawed” nature of the PulseEvent() function, but I do not see an alternative for my scenario from Chen or the posted comments.
Can anyone please suggest one?
Please keep in mind that the IPC signal must cross process boundries on the machine, not simply threads. And the solution needs to have high performance in that consumers must be able to act within 10ms of each event.
I think you're going to need something a little more complex to hit your reliability target.
My understanding of your problem is that you have one producer and an unknown number of consumers all of which are different processes. Each consumer can NEVER miss any events.
I'd like more clarification as to what missing an event means.
i) if a consumer started to run and got to just before it waited on your notification method and an event occurred should it process it even though it wasn't quite ready at the point that the notification was sent? (i.e. when is a consumer considered to be active? when it starts or when it processes its first event)
ii) likewise, if the consumer is processing an event and the code that waits on the next notification hasn't yet begun its wait (I'm assuming a Wait -> Process -> Loop to Wait code structure) then should it know that another event occurred whilst it was looping around?
I'd assume that i) is a "not really" as it's a race between process start up and being "ready" and ii) is "yes"; that is notifications are, effectively, queued per consumer once the consumer is present and each consumer gets to consume all events that are produced whilst it's active and doesn't get to skip any.
So, what you're after is the ability to send a stream of notifications to a set of consumers where a consumer is guaranteed to act on all notifications in that stream from the point where it acts on the first to the point where it shuts down. i.e. if the producer produces the following stream of notifications
1 2 3 4 5 6 7 8 9 0
and consumer a) starts up and processes 3, it should also process 4-0
if consumer b) starts up and processes 5 but is shut down after 9 then it should have processed 5,6,7,8,9
if consumer c) was running when the notifications began it should have processed 1-0
etc.
Simply pulsing an event wont work. If a consumer is not actively waiting on the event when the event is pulsed then it will miss the event so we will fail if events are produced faster than we can loop around to wait on the event again.
Using a semaphore also wont work as if one consumer runs faster than another consumer to such an extent that it can loop around to the semaphore call before the other completes processing and if there's another notification within that time then one consumer could process an event more than once and one could miss one. That is you may well release 3 threads (if the producer knows there are 3 consumers) but you cant ensure that each consumer is released just the once.
A ring buffer of events (tick counts) in shared memory with each consumer knowing the value of the event it last processed and with consumers alerted via a pulsed event should work at the expense of some of the consumers being out of sync with the ticks sometimes; that is if they miss one they will catch up next time they get pulsed. As long as the ring buffer is big enough so that all consumers can process the events before the producer loops in the buffer you should be OK.
With the example above, if consumer d misses the pulse for event 4 because it wasn't waiting on its event at the time and it then settles into a wait it will be woken when event 5 is produced and since it's last processed counted is 3 it will process 4 and 5 and then loop back to the event...
If this isn't good enough then I'd suggest something like PGM via sockets to give you a reliable multicast; the advantage of this would be that you could move your consumers off onto different machines...
The reason PulseEvent is "unreliable" is not so much because of anything wrong in the function itself, just that if your consumer doesn't happen to be waiting on the event at the exact moment that PulseEvent is called, it'll miss it.
In your scenario, I think the best solution is to manually keep the counter yourself. So the producer thread keeps a count of the current "clock tick" and when a consumer thread starts up, it reads the current value of that counter. Then, instead of using PulseEvent, increment the "clock ticks" counter and use SetEvent to wake all threads waiting on the tick. When the consumer thread wakes up, it checks it's "clock tick" value against the producer's "clock ticks" and it'll know how many ticks have elapsed. Just before it waits on the event again, it can check to see if another tick has occurred.
I'm not sure if I described the above very well, but hopefully that gives you an idea :)
There are two inherent problems with PulseEvent:
if it's used with auto-reset events, it releases one waiter only.
threads might never be awaken if they happen to be removed from the waiting queue due to APC at the moment of the PulseEvent.
An alternative is to broadcast a window message and have any listener have a top-level message -only window that listens to this particular message.
The main advantage of this approach is that you don't have to block your thread explicitly. The disadvantage of this approach is that your listeners have to be STA (can't have a message queue on an MTA thread).
The biggest problem with that approach would be that the processing of the event by the listener will be delayed with the amount of time it takes the queue to get to that message.
You can also make sure you use manual-reset events (so that all waiting threads are awaken) and do SetEvent/ResetEvent with some small delay (say 150ms) to give a bigger chance for threads temporarily woken by APC to pick up your event.
Of course, whether any of these alternative approaches will work for you depends on how often you need to fire your events and whether you need the listeners to process each event or just the last one they get.
If I understand your question correctly, it seems like you can simply use SetEvent. It will release one thread. Just make sure it is an auto-reset event.
If you need to allow multiple threads, you could use a named semaphore with CreateSemaphore. Each call to ReleaseSemaphore increases the count. If the count is 3, for example, and 3 threads wait on it, they will all run.
Events are more suitable for communications between the treads inside one process (unnamed events). As you have described, you have zero ore more clients that need to read something interested. I understand that the number of clients changes dynamically. In this case, the best chose will be a named pipe.
Named Pipe is King
If you need to just send data to multiple processes, it’s better to use named pipes, not the events. Unlike auto-reset events, you don't need own pipe for each of the client processes. Each named pipe has an associated server process and one or more associated client processes (and even zero). When there are many clients, many instances of the same named pipe are automatically created by the operating system for each of the clients. All instances of a named pipe share the same pipe name, but each instance has its own buffers and handles, and provides a separate conduit for client/server communication. The use of instances enables multiple pipe clients to use the same named pipe simultaneously. Any process can act as both a server for one pipe and a client for another pipe, and vice versa, making peer-to-peer communication possible.
If you will use a named pipe, there would be no need in the events at all in your scenario, and the data will have guaranteed delivery no matter what happens with the processes – each of the processes may get long delays (e.g. by a swap) but the data will be finally delivered ASAP without your special involvement.
On The Events
If you are still interested in the events -- the auto-reset event is king! ☺
The CreateEvent function has the bManualReset argument. If this parameter is TRUE, the function creates a manual-reset event object, which requires the use of the ResetEvent function to set the event state to non-signaled. This is not what you need. If this parameter is FALSE, the function creates an auto-reset event object, and system automatically resets the event state to non-signaled after a single waiting thread has been released.
These auto-reset events are very reliable and easy to use.
If you wait for an auto-reset event object with WaitForMultipleObjects or WaitForSingleObject, it reliably resets the event upon exit from these wait functions.
So create events the following way:
EventHandle := CreateEvent(nil, FALSE, FALSE, nil);
Wait for the event from one thread and do SetEvent from another thread. This is very simple and very reliable.
Don’t' ever call ResetEvent (since it automatically reset) or PulseEvent (since it is not reliable and deprecated). Even Microsoft has admitted that PulseEvent should not be used. See https://msdn.microsoft.com/en-us/library/windows/desktop/ms684914(v=vs.85).aspx
This function is unreliable and should not be used, because only those threads will be notified that are in the "wait" state at the moment PulseEvent is called. If they are in any other state, they will not be notified, and you may never know for sure what the thread state is. A thread waiting on a synchronization object can be momentarily removed from the wait state by a kernel-mode Asynchronous Procedure Call, and then returned to the wait state after the APC is complete. If the call to PulseEvent occurs during the time when the thread has been removed from the wait state, the thread will not be released because PulseEvent releases only those threads that are waiting at the moment it is called.
You can find out more about the kernel-mode Asynchronous Procedure Calls at the following links:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms681951(v=vs.85).aspx
http://www.drdobbs.com/inside-nts-asynchronous-procedure-call/184416590
http://www.osronline.com/article.cfm?id=75
We have never used PulseEvent in our applications. As about auto-reset events, we are using them since Windows NT 3.51 (although they appeared in the first 32-bit version of NT - 3.1) and they work very well.
Your Inter-Process Scenario
Unfortunately, your case is a little bit more complicated. You have multiple threads in multiple processes waiting for an event, and you have to make sure that all the threads did in fact receive the notification. There is no other reliable way other than to create own event for each consumer. So, you will need to have as many events as are the consumers. Besides that, you will need to keep a list of registered consumers, where each consumer has an associated event name. So, to notify all the consumers, you will have to do SetEvent in a loop for all the consumer events. This is a very fast, reliable and cheap way. Since you are using cross-process communication, the consumers will have to register and de-register its events via other means of inter-process communication, like SendMessage. For example, when a consumer process registers itself at your main notifier process, it sends SendMessage to your process to request a unique event name. You just increment the counter and return something like Event1, Event2, etc, and creating events with that name, so the consumers will open existing events. When the consumer de-registers – it closes the event handle that it opened before, and sends another SendMessage, to let you know that you should CloseHandle too on your side to finally release this event object. If the consumer process crashes, you will end up with a dummy event, since you will not know that you should do CloseHandle, but this should not be a problem - the events are very fast and very cheap, and there is virtually no limit on the kernel objects - the per-process limit on kernel handles is 2^24. If you are still concerned, you may to the opposite – the clients create the events but you open them. If they won’t open – then the client has crashed and you just remove it from the list.

Resources