Waiting for GUI events and IOCP notifications at once - windows

Can I wait for GUI events — that is, pump a message loop — and on an I/O completion port at the same time? I would like to integrate libuv with the Windows GUI.

There are two solutions that I know of. One works on all versions of Windows, but involves multiple threads. The other is faster, but only supports Windows 10+ (thank you #RbMm for this fact).
Another thread calls GetQueuedCompletionStatusEx in a loop, and sends messages to the main thread with SendMessage. The main thread reads the messages from it's message loop, notes the custom message type, and dispatches the I/O completions.
This solution works on all versions of Windows, but is slower. However, if one is willing to trade latency for throughput, one can increase the GetQueuedCompletionStatusEx receive buffer to recover nearly the same throughput as the second solution. For best performance, both threads should use the same CPU, to avoid playing cache ping-pong with the I/O completions.
The main thread uses MsgWaitForMultipleObjectsEx to wait for the completion port to be signaled or user input to arrive. Once it is signaled, the main thread calls GetQueuedCompletionStatusEx with a zero timeout.
This assumes that an IOCP that is used by only one thread becomes signaled precisely when an I/O completion arrives. This is only true on Windows 10 and up. Otherwise, you will busyloop, since the IOCP will always be signaled. On systems that support this method, it should be faster, since it reduces scheduling overhead.

Related

How to set up a thread pool for an IOCP server with connections that may perform blocking operations

I'm currently working on a server application that's written in the proactor style, using select() + a dynamically sized thread pool (there's a simple mechanism based on keeping track of idle worker threads).
I need to modify it to use IOCP instead of select() on windows, and I'm wondering what the best way to utilize threads is.
For background information, the server has stateful, long-lived connections, and any request may require significant processing, and block. In fact, most requests call into customer-written code, which may block at will.
I've read that the OS can tell when an IOCP thread blocks, and unblock another one, but it doesn't look like there's any support for creating additional threads under heavy load, or if many of the threads are blocked.
I've read one site which suggested that you have a small, fixed-size thread pool which uses IOCP to deal with I/O only, which sends requests which can block to another, dynamically-sized thread pool. This seems non-optimal due to the additional thread synchronization required (although you can use IOCP as well for the tasks for the second thread pool), and the larger number of threads needed (extra context switching).
Is that the best way?
It sounds like what you've read is one of my articles on IOCP (most probably this one). That's likely a bit out of date now as the whole problem that it sort to avoid (that of I/O being cancelled if the thread that issued it exits before the I/O completes) is no longer a problem with any of Microsoft's currently supported OS's (it's only an issue on XP and before).
You're correct in noticing that my design from 2000/2002 was sub optimal from a context switching point of view; but it worked pretty well at the time, given the constraints of the underlying API.
On a modern OS there's no real advantage in having separate thread pools for I/O and blocking work. A more modern solution would probably involve dynamically expanding and reducing the number of I/O threads servicing the IOCP as required.
You'd need to track the number of IOCP threads that are active (i.e. not waiting on GetQueuedCompletionStatus() ) and spawn more when there are "too few". Likewise just as a thread is about to go back and wait on GQCS you could check to see if you have "too many" and if so, let it die instead.
I should probably update those articles.

How are sleeping threads woken, at the lowest level?

I've wondered about this for a very long time.
I understand that GUI programming is event-driven. I understand that most GUI programms will feature an event loop which loops through all events on the message queue. I also understand that it does so by calling some kind of Operating System method like "get_message()", which will block the thread until a message is received. In this sense, when no events are happening, the thread is sleeping peacefully.
My question, however, is: how does the Operating System check for available messages?Somewhere down the stack I assume there must be a loop which is continually checking for new events. Such a loop cannot possibly feature any blocking, because if so, there must be another looping thread which is 'always-awake', ready to wake the first. However, I also appreciate that this cannot be true, because otherwise I would expect to see 100% of at least one processor core in use at all times, checking over and over and over and over....
I have considered that perhaps the checking thread has a small sleep between each iteration. This would certainly explain why an idle system isn't using 100% CPU. But then I recalled how events are usually responded to immediately. Take a mouse movement for example: the cursor is being constantly redrawn, in sync with the physical movements.
Is there something fundamental, perhaps, in CPU architectures that allows threads to be woken at the hardware level, when certain memory addresses change value?
I'm otherwise out of ideas! Could anyone please help to explain what's really happening?
Yes there is: hardware interrupts.
When a key is pressed or the mouse is moved, or a network packet arrives, or data is read from some other device, or a timer elapses, the OS receives a hardware interrupt.
Threads or applications wanting to do I/O have to call a function in the OS, which returns the requested data, or, suspends the calling thread if the data is not available yet. This suspension simply means the thread is not considered for scheduling, until some condition changes - in this case, the requested data must be available. Such threads are said to be 'IO blocked'.
When the OS receives an interrupt indicating some device has some data, it looks through it's list of suspended threads to see if there is one that is suspended because it is waiting for that data, and then removes the suspension,
making it eligible for scheduling again.
In this interrupt-driven way, no CPU time is wasted 'polling' for data.

In windows, what does the CPU do while blocking

One has blocking calls whenever the CPU is waiting for some system to respond, e.g. waiting for an internet request. Is the CPU literally wasting time during these calls (I don't know whether there are machine instructions other than no-op that would correspond to the CPU literally wasting time). If not, what is it doing?
The thread is simply skipped when the operating system scheduler looks for work to hand off to a core. With the very common outcome that nothing needs to be done. The processor core then executes the HLT instruction.
In the HALT state it consumes (almost) no power. An interrupt is required to bring it back alive. Most typically that will be the clock interrupt, it ticks 64 times per second by default. It could be a device interrupt. The scheduler then again looks for work to do. Rinse and repeat.
Basically, the kernel maintains run queues or something similar to schedule threads. Each thread receives a time slice where it gets to execute until it expires or it volontarily yields its slice. When a thread yields or its slice expires, the scheduler decides which thread gets to execute next.
A blocking system call would result in a yield. It would also result in the thread being removed from the run queue and placed in a sleep/suspend queue where it is not eligible to receive time slices. It would remain in the sleep/suspend queue until some critiera is met (e.g. timer tick, data available on socket, etc.). Once the criteria is met, it'd be placed back into the run queue.
Sleep(1); // Yield, install a timer, and place the thread in a sleep queue.
As long as there are tasks in any of the run queues (there may be more than one, commonly one per processor core), the scheduler will keep handing out time slices. Depending on scheduler design and hardware constraints, these time slices may vary in length.
When there are no tasks in the run queue, the core can enter a powersaving state until an interrupt is received.
In essence, the processor never wastes time. Its either executing other threads, servicing interrupts or in a powersaving state (even for very short durations).
While a thread is blocked, especially if it is blocked on an efficient wait object that puts the blocked thread to sleep, the CPU is busy servicing other threads in the system. If there are no application threads running, there is always system threads running. The CPU is never truly idle.

I/O completion port's advantages and disadvantages

Why do many people say I/O completion port is a fast and nice model?
What are the I/O completion port's advantages and disadvantages?
I want to know some points which make the I/O completion port faster than other approaches.
If you can explain it comparing to other models (select, epoll, traditional multithread/multiprocess), it would be better.
I/O completion ports are awesome. There's no better word to describe them. If anything in Windows was done right, it's completion ports.
You can create some number of threads (does not really matter how many) and make them all block on one completion port until an event (either one you post manually, or an event from a timer or asynchronous I/O, or whatever) arrives. Then the completion port will wake one thread to handle the event, up to the limit that you specified. If you didn't specify anything, it will assume "up to number of CPU cores", which is really nice.
If there are already more threads active than the maximum limit, it will wait until one of them is done and then hand the event to the thread as soon as it goes to wait state. Also, it will always wake threads in a LIFO order, so chances are that caches are still warm.
In other words, completion ports are a no-fuss "poll for events" as well as "fill CPU as much as you can" solution.
You can throw file reads and writes at a completion port, sockets, or anything else that's waitable. And, you can post your own events if you want. Each custom event has at least one integer and one pointer worth of data (if you use the default structure), but you are not really limited to that as the system will happily accept any other structure too.
Also, completion ports are fast, really really fast. Once upon a time, I needed to notify one thread from another. As it happened, that thread already had a completion port for file I/O, but it didn't pump messages. So, I wondered if I should just bite the bullet and use the completion port for simplicity, even though posting a thread message would obviously be much more efficient. I was undecided, so I benchmarked. Surprise, it turned out completion ports were about 3 times faster. So... faster and more flexible, the decision was not hard.
by using IOCP, we can overcome the "one-thread-per-client" problem. It is commonly known that the performance decreases heavily if the software does not run on a true multiprocessor machine. Threads are system resources that are neither unlimited nor cheap.
IOCP provides a way to have a few (I/O worker) threads handle multiple clients' input/output "fairly". The threads are suspended, and don't use the CPU cycles until there is something to do.
Also you can read some information in this nice book http://www.amazon.com/Windows-System-Programming-Johnson-Hart/dp/0321256190
I/O completion ports are provided by the O/S as an asynchronous I/O operation, which means that it occurs in the background (usually in hardware). The system does not waste any resources (e.g. threads) waiting for the I/O to complete. When the I/O is complete, the hardware sends an interrupt to the O/S, which then wakes up the relevant process/thread to handle the result. WRONG: IOCP does NOT require hardware support (see comments below)
Typically a single thread can wait on a large number of I/O completions while taking up very little resources when the I/O has not returned.
Other async models that are not based on I/O completion ports usually employ a thread pool and have threads wait for I/O to complete, thereby using more system resources.
The flip side is that I/O completion ports usually require hardware support, and so they are not generally applicable to all async scenarios.

Is it better to poll or wait?

I have seen a question on why "polling is bad". In terms of minimizing the amount of processor time used by one thread, would it be better to do a spin wait (i.e. poll for a required change in a while loop) or wait on a kernel object (e.g. a kernel event object in windows)?
For context, assume that the code would be required to run on any type of processor, single core, hyperthreaded, multicore, etc. Also assume that a thread that would poll or wait can't continue until the polling result is satisfactory if it polled instead of waiting. Finally, the time between when a thread starts waiting (or polling) and when the condition is satisfied can potentially vary from a very short time to a long time.
Since the OS is likely to more efficiently "poll" in the case of "waiting", I don't want to see the "waiting just means someone else does the polling" argument, that's old news, and is not necessarily 100% accurate.
Provided the OS has reasonable implementations of these type of concurrency primitives, it's definitely better to wait on a kernel object.
Among other reasons, this lets the OS know not to schedule the thread in question for additional timeslices until the object being waited-for is in the appropriate state. Otherwise, you have a thread which is constantly getting rescheduled, context-switched-to, and then running for a time.
You specifically asked about minimizing the processor time for a thread: in this example the thread blocking on a kernel object would use ZERO time; the polling thread would use all sorts of time.
Furthermore, the "someone else is polling" argument needn't be true. When a kernel object enters the appropriate state, the kernel can look to see at that instant which threads are waiting for that object...and then schedule one or more of them for execution. There's no need for the kernel (or anybody else) to poll anything in this case.
Waiting is the "nicer" way to behave. When you are waiting on a kernel object your thread won't be granted any CPU time as it is known by the scheduler that there is no work ready. Your thread is only going to be given CPU time when it's wait condition is satisfied. Which means you won't be hogging CPU resources needlessly.
I think a point that hasn't been raised yet is that if your OS has a lot of work to do, blocking yeilds your thread to another process. If all processes use the blocking primitives where they should (such as kernel waits, file/network IO etc.) you're giving the kernel more information to choose which threads should run. As such, it will do more work in the same amount of time. If your application could be doing something useful while waiting for that file to open or the packet to arrive then yeilding will even help you're own app.
Waiting does involve more resources and means an additional context switch. Indeed, some synchronization primitives like CLR Monitors and Win32 critical sections use a two-phase locking protocol - some spin waiting is done fore actually doing a true wait.
I imagine doing the two-phase thing would be very difficult, and would involve lots of testing and research. So, unless you have the time and resources, stick to the windows primitives...they already did the research for you.
There are only few places, usually within the OS low-level things (interrupt handlers/device drivers) where spin-waiting makes sense/is required. General purpose applications are always better off waiting on some synchronization primitives like mutexes/conditional variables/semaphores.
I agree with Darksquid, if your OS has decent concurrency primitives then you shouldn't need to poll. polling usually comes into it's own on realtime systems or restricted hardware that doesn't have an OS, then you need to poll, because you might not have the option to wait(), but also because it gives you finegrain control over exactly how long you want to wait in a particular state, as opposed to being at the mercy of the scheduler.
Waiting (blocking) is almost always the best choice ("best" in the sense of making efficient use of processing resources and minimizing the impact to other code running on the same system). The main exceptions are:
When the expected polling duration is small (similar in magnitude to the cost of the blocking syscall).
Mostly in embedded systems, when the CPU is dedicated to performing a specific task and there is no benefit to having the CPU idle (e.g. some software routers built in the late '90s used this approach.)
Polling is generally not used within OS kernels to implement blocking system calls - instead, events (interrupts, timers, actions on mutexes) result in a blocked process or thread being made runnable.
There are four basic approaches one might take:
Use some OS waiting primitive to wait until the event occurs
Use some OS timer primitive to check at some defined rate whether the event has occurred yet
Repeatedly check whether the event has occurred, but use an OS primitive to yield a time slice for an arbitrary and unknown duration any time it hasn't.
Repeatedly check whether the event has occurred, without yielding the CPU if it hasn't.
When #1 is practical, it is often the best approach unless delaying one's response to the event might be beneficial. For example, if one is expecting to receive a large amount of serial port data over the course of several seconds, and if processing data 100ms after it is sent will be just as good as processing it instantly, periodic polling using one of the latter two approaches might be better than setting up a "data received" event.
Approach #3 is rather crude, but may in many cases be a good one. It will often waste more CPU time and resources than would approach #1, but it will in many cases be simpler to implement and the resource waste will in many cases be small enough not to matter.
Approach #2 is often more complicated than #3, but has the advantage of being able to handle many resources with a single timer and no dedicated thread.
Approach #4 is sometimes necessary in embedded systems, but is generally very bad unless one is directly polling hardware and the won't have anything useful to do until the event in question occurs. In many circumstances, it won't be possible for the condition being waited upon to occur until the thread waiting for it yields the CPU. Yielding the CPU as in approach #3 will in fact allow the waiting thread to see the event sooner than would hogging it.

Resources