Win32 Overlapped I/O - Completion routines or WaitForMultipleObjects? - winapi

I'm wondering which approach is faster and why ?
While writing a Win32 server I have read a lot about the Completion Ports and the Overlapped I/O, but I have not read anything to suggest which set of API's yields the best results in the server.
Should I use completion routines, or should I use the WaitForMultipleObjects API and why ?

You suggest two methods of doing overlapped I/O and ignore the third (or I'm misunderstanding your question).
When you issue an overlapped operation, a WSARecv() for example, you can specify an OVERLAPPED structure which contains an event and you can wait for that event to be signalled to indicate the overlapped I/O has completed. This, I assume, is your WaitForMultipleObjects() approach and, as previously mentioned, this doesn't scale well as you're limited to the number of handles that you can pass to WaitForMultipleObjects().
Alternatively you can pass a completion routine which is called when completion occurs. This is known as 'alertable I/O' and requires that the thread that issued the WSARecv() call is in an 'alertable' state for the completion routine to be called. Threads can put themselves in an alertable state in several ways (calling SleepEx() or the various EX versions of the Wait functions, etc). The Richter book that I have open in front of me says "I have worked with alertable I/O quite a bit, and I'll be the first to tell you that alertable I/O is horrible and should be avoided". Enough said IMHO.
There's a third way, before issuing the call you should associate the handle that you want to do overlapped I/O on with a completion port. You then create a pool of threads which service this completion port by calling GetQueuedCompletionStatus() and looping. You issue your WSARecv() with an OVERLAPPED structure WITHOUT an event in it and when the I/O completes the completion pops out of GetQueuedCompletionStatus() on one of your I/O pool threads and can be handled there.
As previously mentioned, Vista/Server 2008 have cleaned up how IOCPs work a little and removed the problem whereby you had to make sure that the thread that issued the overlapped request continued to run until the request completed. Link to a reference to that can be found here. But this problem is easy to work around anyway; you simply marshal the WSARecv over to one of your I/O pool threads using the same IOCP that you use for completions...
Anyway, IMHO using IOCPs is the best way to do overlapped I/O. Yes, getting your head around the overlapped/async nature of the calls can take a little time at the start but it's well worth it as the system scales very well and offers a simple "fire and forget" method of dealing with overlapped operations.
If you need some sample code to get you going then I have several articles on writing IO completion port systems and a heap of free code that provides a real-world framework for high performance servers; see here.
As an aside; IMHO, you really should read "Windows Via C/C++ (PRO-Developer)" by Jeffrey Richter and Christophe Nasarre as it deals will all you need to know about overlapped I/O and most other advanced windows platform techniques and APIs.

WaitForMultipleObjects is limited to 64 handles; in a highly concurrent application this could become a limitation.
Completion ports fit better with a model of having a pool of threads all of which are capable of handling any event, and you can queue your own (non-IO based) events into the port, whereas with waits you would need to code your own mechanism.
However completion ports, and the event based programming model, are a more difficult concept to really work against.
I would not expect any significant performance difference, but in the end you can only make your own measurements to reflect your usage. Note that Vista/Server2008 made a change with completion ports that the originating thread is not now needed to complete IO operations, this may make a bigger difference (see this article by Mark Russinovich).

Table 6-3 in the book Network Programming for Microsoft Windows, 2nd Edition compares the scalability of overlapped I/O via completion ports vs. other techniques. Completion ports blow all the other I/O models out of the water when it comes to throughput, while using far fewer threads.

The difference between WaitForMultipleObjects() and I/O completion ports is that IOCP scales to thousands of objects, whereas WFMO() does not and should not be used for anything more than 64 objects (even though you could).
You can't really compare them for performance, because in the domain of < 64 objects, they will be essentially identical.
WFMO() however does a round-robin on its objects, so busy objects with low index numbers can starve objects with high index numbers. (E.g. if object 0 is going off constantly, it will starve objects 1, 2, 3, etc). This is obviously undesireable.
I wrote an IOCP library (for sockets) to solve the C10K problem and put it in the public domain. I was able on a 512mb W2K machine to get 4,000 sockets concurrently transferring data. (You can get a lot more sockets, if they're idle - a busy socket consumes more non-paged pool and that's the ultimate limit on how many sockets you can have).
http://www.45mercystreet.com/computing/libiocp/index.html
The API should give you exactly what you need.

Not sure. But I use WaitForMultipleObjects and/or WaitFoSingleObjects. It's very convenient.

Either routine works and I don't really think one is any significant faster then another.
These two approaches exists to satisfy different programming models.
WaitForMultipleObjects is there to facilitate async completion pattern (like UNIX select() function) while completion ports is more towards event driven model.
I personally think WaitForMultipleObjects() approach result in cleaner code and more thread safe.

Related

How to set up a thread pool for an IOCP server with connections that may perform blocking operations

I'm currently working on a server application that's written in the proactor style, using select() + a dynamically sized thread pool (there's a simple mechanism based on keeping track of idle worker threads).
I need to modify it to use IOCP instead of select() on windows, and I'm wondering what the best way to utilize threads is.
For background information, the server has stateful, long-lived connections, and any request may require significant processing, and block. In fact, most requests call into customer-written code, which may block at will.
I've read that the OS can tell when an IOCP thread blocks, and unblock another one, but it doesn't look like there's any support for creating additional threads under heavy load, or if many of the threads are blocked.
I've read one site which suggested that you have a small, fixed-size thread pool which uses IOCP to deal with I/O only, which sends requests which can block to another, dynamically-sized thread pool. This seems non-optimal due to the additional thread synchronization required (although you can use IOCP as well for the tasks for the second thread pool), and the larger number of threads needed (extra context switching).
Is that the best way?
It sounds like what you've read is one of my articles on IOCP (most probably this one). That's likely a bit out of date now as the whole problem that it sort to avoid (that of I/O being cancelled if the thread that issued it exits before the I/O completes) is no longer a problem with any of Microsoft's currently supported OS's (it's only an issue on XP and before).
You're correct in noticing that my design from 2000/2002 was sub optimal from a context switching point of view; but it worked pretty well at the time, given the constraints of the underlying API.
On a modern OS there's no real advantage in having separate thread pools for I/O and blocking work. A more modern solution would probably involve dynamically expanding and reducing the number of I/O threads servicing the IOCP as required.
You'd need to track the number of IOCP threads that are active (i.e. not waiting on GetQueuedCompletionStatus() ) and spawn more when there are "too few". Likewise just as a thread is about to go back and wait on GQCS you could check to see if you have "too many" and if so, let it die instead.
I should probably update those articles.

How can I tell Windows XP/7 not to switch threads during a certain segment of my code?

I want to prevent a thread switch by Windows XP/7 in a time critical part of my code that runs in a background thread. I'm pretty sure I can't create a situation where I can guarantee that won't happen, because of higher priority interrupts from system drivers, etc. However, I'd like to decrease the probability of a thread switch during that part of my code to the minimum that I can. Are there any create-thread flags or Window API calls that can assist me? General technique tips are appreciated too. If there is a way to get this done without having to raise the threads priority to real-time-critical that would be great, since I worry about creating system performance issues for the user if I do that.
UPDATE: I am adding this update after seeing the first responses to my original post. The concrete application that motivated the question has to do with real-time audio streaming. I want to eliminate every bit of delay I can. I found after coding up my original design that a thread switch can cause a 70ms or more delay at times. Since my app is between two sockets acting as a middleman for delivering audio, the instant I receive an audio buffer I want to immediately turn around and push it out the the destination socket. My original design used two cooperating threads and a semaphore since the there was one thread managing the source socket, and another thread for the destination socket. This architecture evolved from the fact the two devices behind the sockets are disparate entities.
I realized that if I combined the two sockets onto the same thread I could write a code block that reacted immediately to the socket-data-received message and turned it around to the destination socket in one shot. Now if I can do my best to avoid an intervening thread switch, that would be the optimal coding architecture for minimizing delay. To repeat, I know I can't guarantee this situation, but I am looking for tips/suggestions on how to write a block of code that does this and minimizes as best as I can the chance of an intervening thread switch.
Note, I am aware that O/S code behind the sockets introduces (potential) delays of its own.
AFAIK there are no such flags in CreateThread or etc (This also doesn't make sense IMHO). You may snooze other threads in your process from execution during in critical situations (by enumerating them and using SuspendThread), as well as you theoretically may enumerate & suspend threads in other processes.
OTOH snoozing threads is generally not a good idea, eventually you may call some 3rd-party code that would implicitly wait for something that should be accomplished in another threads, which you suspended.
IMHO - you should use what's suggested for the case - playing with thread/process priorities (also you may consider SetThreadPriorityBoost). Also the OS tends to raise the priority to threads that usually don't use CPU aggressively. That is, threads that work often but for short durations (before calling one of the waiting functions that suspend them until some condition) are considered to behave "nicely", and they get prioritized.

I/O completion port's advantages and disadvantages

Why do many people say I/O completion port is a fast and nice model?
What are the I/O completion port's advantages and disadvantages?
I want to know some points which make the I/O completion port faster than other approaches.
If you can explain it comparing to other models (select, epoll, traditional multithread/multiprocess), it would be better.
I/O completion ports are awesome. There's no better word to describe them. If anything in Windows was done right, it's completion ports.
You can create some number of threads (does not really matter how many) and make them all block on one completion port until an event (either one you post manually, or an event from a timer or asynchronous I/O, or whatever) arrives. Then the completion port will wake one thread to handle the event, up to the limit that you specified. If you didn't specify anything, it will assume "up to number of CPU cores", which is really nice.
If there are already more threads active than the maximum limit, it will wait until one of them is done and then hand the event to the thread as soon as it goes to wait state. Also, it will always wake threads in a LIFO order, so chances are that caches are still warm.
In other words, completion ports are a no-fuss "poll for events" as well as "fill CPU as much as you can" solution.
You can throw file reads and writes at a completion port, sockets, or anything else that's waitable. And, you can post your own events if you want. Each custom event has at least one integer and one pointer worth of data (if you use the default structure), but you are not really limited to that as the system will happily accept any other structure too.
Also, completion ports are fast, really really fast. Once upon a time, I needed to notify one thread from another. As it happened, that thread already had a completion port for file I/O, but it didn't pump messages. So, I wondered if I should just bite the bullet and use the completion port for simplicity, even though posting a thread message would obviously be much more efficient. I was undecided, so I benchmarked. Surprise, it turned out completion ports were about 3 times faster. So... faster and more flexible, the decision was not hard.
by using IOCP, we can overcome the "one-thread-per-client" problem. It is commonly known that the performance decreases heavily if the software does not run on a true multiprocessor machine. Threads are system resources that are neither unlimited nor cheap.
IOCP provides a way to have a few (I/O worker) threads handle multiple clients' input/output "fairly". The threads are suspended, and don't use the CPU cycles until there is something to do.
Also you can read some information in this nice book http://www.amazon.com/Windows-System-Programming-Johnson-Hart/dp/0321256190
I/O completion ports are provided by the O/S as an asynchronous I/O operation, which means that it occurs in the background (usually in hardware). The system does not waste any resources (e.g. threads) waiting for the I/O to complete. When the I/O is complete, the hardware sends an interrupt to the O/S, which then wakes up the relevant process/thread to handle the result. WRONG: IOCP does NOT require hardware support (see comments below)
Typically a single thread can wait on a large number of I/O completions while taking up very little resources when the I/O has not returned.
Other async models that are not based on I/O completion ports usually employ a thread pool and have threads wait for I/O to complete, thereby using more system resources.
The flip side is that I/O completion ports usually require hardware support, and so they are not generally applicable to all async scenarios.

Efficient Overlapped I/O for a socket server

Which of these two different models would be more efficient (consider thrashing, utilization of processor cache, overall desgn, everything, etc)?
1 IOCP and spinning up X threads (where X is the number of processors the computer has). This would mean that my "server" would only have 1 IOCP (queue) for all requests and X Threads to serve/handle them. I have read many articles discussing the effeciency of this design. With this model I would have 1 listener that would also be associated to the IOCP. Lets assume that I could figure out how to keep the packets/requests synchronized.
X IOCP (where X is the number of processors the computer has) and each IOCP has 1 thread. This would mean that each Processor has its own queue and 1 thread to serve/handle them. With this model I would have a separate Listener (not using IOCP) that would handle incomming connections and would assign the SOCKET to the proper IOCP (one of the X that were created). Lets assume that I could figure out the Load Balancing.
Using an overly simplified analogy for the two designs (a bank):
One line with several cashiers to hand the transactions. Each person is in the same line and each cashier takes the next available person in line.
Each cashier has their own line and the people are "placed" into one of those lines
Between these two designs, which one is more efficient. In each model the Overlapped I/O structures would be using VirtualAlloc with MEM_COMMIT (as opposed to "new") so the swap-file should not be an issue (no paging). Based on how it has been described to me, using VirtualAlloc with MEM_COMMIT, the memory is reserved and is not paged out. This would allow the SOCKETS to write the incomming data right to my buffers without going through intermediate layers. So I don't think thrashing should be a factor but I might be wrong.
Someone was telling me that #2 would be more efficient but I have not heard of this model. Thanks in advance for your comments!
I assume that for #2 you plan to manually associate your sockets with an IOCP that you decide is 'best' based on some measure of 'goodness' at the time the socket is accepted? And that somehow this measure of 'goodness' will persist for the life of the socket?
With IOCP used the 'standard' way, i.e. your option number 1, the kernel works out how best to use the threads you have and allows more to run if any of them block. With your method, assuming you somehow work out how to distribute the work, you are going to end up with more threads running than with option 1.
Your #2 option also prevents you from using AcceptEx() for overlapped accepts and this is more efficient than using a normal accept loop as you remove a thread (and the resulting context switching and potential contention) from the scene.
Your analogy breaks down; it's actually more a case of either having 1 queue with X bank tellers where you join the queue and know that you'll be seen in an efficient order as opposed to each teller having their own queue and you having to guess that the queue you join doesn't contain a whole bunch of people who want to open new accounts and the one next to you contains a whole bunch of people who only want to do some paying in. The single queue ensures that you get handled efficiently.
I think you're confused about MEM_COMMIT. It doesn't mean that the memory isn't in the paging file and wont be paged. The usual reason for using VirtualAlloc for overlapped buffers is to ensure alignment on page boundaries and so reduce the number of pages that are locked for I/O (a page sized buffer can be allocated on a page boundary and so only take one page rather than happening to span two due to the memory manager deciding to use a block that doesn't start on a page boundary).
In general I think you're attempting to optimise something way ahead of schedule. Get an efficient server working using IOCP the normal way first and then profile it. I seriously doubt that you'll even need to worry about building your #2 version ... Likewise, use new to allocate your buffers to start with and then switch to the added complexity of VirtualAlloc() when you find that you server fails due to ENOBUFS and you're sure that's caused by the I/O locked page limit and not lack of non-paged pool (you do realise that you have to allocate in 'allocation granularity' sized chunks for VirtualAlloc()?).
Anyway, I have a free IOCP server framework that's available here: http://www.serverframework.com/products---the-free-framework.html which might help you get started.
Edited: The complex version that you suggest could be useful in some NUMA architectures where you use NIC teaming to have the switch spit your traffic across multiple NICs, bind each NIC to a different physical processor and then bind your IOCP threads to the same processor. You then allocate memory from that NUMA node and effectively have your network switch load balance your connections across your NUMA nodes. I'd still suggest that it's better, IMHO, to get a working server which you can profile using the "normal" method of using IOCP first and only once you know that cross NUMA node issues are actually affecting your performance move towards the more complex architecture...
Queuing theory tells us that a single queue has better characteristics than multiple queues. You could possibly get around this with work-stealing.
The multiple queues method should have better cache behavior. Whether it is significantly better depends on how many received packets are associated with a single transaction. If a request fits in a single incoming packet, then it'll be associated to a single thread even with the single IOCP approach.

Is it better to poll or wait?

I have seen a question on why "polling is bad". In terms of minimizing the amount of processor time used by one thread, would it be better to do a spin wait (i.e. poll for a required change in a while loop) or wait on a kernel object (e.g. a kernel event object in windows)?
For context, assume that the code would be required to run on any type of processor, single core, hyperthreaded, multicore, etc. Also assume that a thread that would poll or wait can't continue until the polling result is satisfactory if it polled instead of waiting. Finally, the time between when a thread starts waiting (or polling) and when the condition is satisfied can potentially vary from a very short time to a long time.
Since the OS is likely to more efficiently "poll" in the case of "waiting", I don't want to see the "waiting just means someone else does the polling" argument, that's old news, and is not necessarily 100% accurate.
Provided the OS has reasonable implementations of these type of concurrency primitives, it's definitely better to wait on a kernel object.
Among other reasons, this lets the OS know not to schedule the thread in question for additional timeslices until the object being waited-for is in the appropriate state. Otherwise, you have a thread which is constantly getting rescheduled, context-switched-to, and then running for a time.
You specifically asked about minimizing the processor time for a thread: in this example the thread blocking on a kernel object would use ZERO time; the polling thread would use all sorts of time.
Furthermore, the "someone else is polling" argument needn't be true. When a kernel object enters the appropriate state, the kernel can look to see at that instant which threads are waiting for that object...and then schedule one or more of them for execution. There's no need for the kernel (or anybody else) to poll anything in this case.
Waiting is the "nicer" way to behave. When you are waiting on a kernel object your thread won't be granted any CPU time as it is known by the scheduler that there is no work ready. Your thread is only going to be given CPU time when it's wait condition is satisfied. Which means you won't be hogging CPU resources needlessly.
I think a point that hasn't been raised yet is that if your OS has a lot of work to do, blocking yeilds your thread to another process. If all processes use the blocking primitives where they should (such as kernel waits, file/network IO etc.) you're giving the kernel more information to choose which threads should run. As such, it will do more work in the same amount of time. If your application could be doing something useful while waiting for that file to open or the packet to arrive then yeilding will even help you're own app.
Waiting does involve more resources and means an additional context switch. Indeed, some synchronization primitives like CLR Monitors and Win32 critical sections use a two-phase locking protocol - some spin waiting is done fore actually doing a true wait.
I imagine doing the two-phase thing would be very difficult, and would involve lots of testing and research. So, unless you have the time and resources, stick to the windows primitives...they already did the research for you.
There are only few places, usually within the OS low-level things (interrupt handlers/device drivers) where spin-waiting makes sense/is required. General purpose applications are always better off waiting on some synchronization primitives like mutexes/conditional variables/semaphores.
I agree with Darksquid, if your OS has decent concurrency primitives then you shouldn't need to poll. polling usually comes into it's own on realtime systems or restricted hardware that doesn't have an OS, then you need to poll, because you might not have the option to wait(), but also because it gives you finegrain control over exactly how long you want to wait in a particular state, as opposed to being at the mercy of the scheduler.
Waiting (blocking) is almost always the best choice ("best" in the sense of making efficient use of processing resources and minimizing the impact to other code running on the same system). The main exceptions are:
When the expected polling duration is small (similar in magnitude to the cost of the blocking syscall).
Mostly in embedded systems, when the CPU is dedicated to performing a specific task and there is no benefit to having the CPU idle (e.g. some software routers built in the late '90s used this approach.)
Polling is generally not used within OS kernels to implement blocking system calls - instead, events (interrupts, timers, actions on mutexes) result in a blocked process or thread being made runnable.
There are four basic approaches one might take:
Use some OS waiting primitive to wait until the event occurs
Use some OS timer primitive to check at some defined rate whether the event has occurred yet
Repeatedly check whether the event has occurred, but use an OS primitive to yield a time slice for an arbitrary and unknown duration any time it hasn't.
Repeatedly check whether the event has occurred, without yielding the CPU if it hasn't.
When #1 is practical, it is often the best approach unless delaying one's response to the event might be beneficial. For example, if one is expecting to receive a large amount of serial port data over the course of several seconds, and if processing data 100ms after it is sent will be just as good as processing it instantly, periodic polling using one of the latter two approaches might be better than setting up a "data received" event.
Approach #3 is rather crude, but may in many cases be a good one. It will often waste more CPU time and resources than would approach #1, but it will in many cases be simpler to implement and the resource waste will in many cases be small enough not to matter.
Approach #2 is often more complicated than #3, but has the advantage of being able to handle many resources with a single timer and no dedicated thread.
Approach #4 is sometimes necessary in embedded systems, but is generally very bad unless one is directly polling hardware and the won't have anything useful to do until the event in question occurs. In many circumstances, it won't be possible for the condition being waited upon to occur until the thread waiting for it yields the CPU. Yielding the CPU as in approach #3 will in fact allow the waiting thread to see the event sooner than would hogging it.

Resources