Is there in general a speed or performance difference in blocking and non-blocking Winsock TCP Sockets?
I could get the differences of both sockets but there isn't a detailed performance comparison between the two types.
Because it isn't about speed. The operations write and read are just memory copying in disguise. All they do is copy data to and from the kernel, respectively. I.e. they don't actually send or receive anything.
The blocking vs nonblocking feature asks: do you prefer these operations to block until completed or to return -1 and EAGAIN in case they can't be performed immediately ? For example, you read from a socket but there's nothing in the receive buffer. Do you prefer to have recv hanging until something comes along or to return -1 EAGAIN ?
In my experience non-blocking winsock operations are slightly slower but much more scalable. The fact is that you need to make two system calls plus some dispatching at the application level when you perform nonblocking I/O (with IOCP) and one system call if you use blocking I/O. If you have many concurrent connections, nonblocking I/O is much faster because of more scalable architecture if implemented well.
If you need to transfer data from point to point with max bandwidth - use blocking I/O. If you need to handle many concurrent client connections - use nonblocking I/O. Don't expect too much from any of them.
In general this is more about "event-driven vs threaded" server architecture then "blocking vs nonblocking". There is no universal server architecture that can be used in any situation. It depends on application.
Related
I am contemplating inter-process sharing of custom objects. My current implementation uses ZeroMQ where the objects are packed into a message and sent from process A to process B.
I am wondering whether it would be faster instead to have a concurrent container implemented using boost::interprocess (where process A will insert into the container and process B will retrieve from it). Not sure if this will be faster than having to serialise the object in process A and then de-serialising it in process B.
Just wondering if anyone has done benchmarking? Is it conceptually right to compare the two?
In principle, ZeroMq should be slower, because the metaphor it's using is the passing of messages. These kinds of libraries are not intended for sharing regions of memory, in place, and for different processes to be able to modify them concurrently.
Specifically, you mentioned "packing". When sharing memory regions, you can - ideally - avoid any packing and just work on data as-is (of course, with the care necessary in concurrent use of the same data structures, using offsets instead of pointers etc.)
Also note that even when sharing is a one-directional back-and-forth (i.e. only one process at a time accesses any of the data), ZeroMq can only match the use of IPC shared memory if it supports zero-copying all the way down. This is not clear to me from the FAQ page on zero-copying (but may be the case anyway).
I agree with Nim, they're too different for easy comparison.
ZeroMQ has inproc which uses shared memory as a byte transport.
Boost.Interprocess seems to be mostly about having objects constructed in shared memory, accessible to multiple processes / threads. However it does have message queues, but they too are just byte transports requiring objects to be serialised, just like you have to with ZeroMQ. They're not object containers, so are more comparable to ZeroMQ but is quite a long way from what Boost.Interprocess seems to represent.
I have done a ZeroMQ / STL container hybrid. Yeurk. I used a C++ STL queue to store objects, but then used a ZeroMQ PUSH/PULL socket to govern which thread could read from that queue. Reading threads were blocked on a ZeroMQ poll, and when they received a message they'd lock the queue and read an object out from it. This avoided having to serialise objects, which was handy, so it was pretty fast. This doesn't work for PUB/SUB which implies copying objects between recipients, which would need object serialisation.
ZMQ IPC is effective only in linux(using UNIX domain socket)
The performance is slower than boost::interprocess shared_memory
I'm currently working on a server application that's written in the proactor style, using select() + a dynamically sized thread pool (there's a simple mechanism based on keeping track of idle worker threads).
I need to modify it to use IOCP instead of select() on windows, and I'm wondering what the best way to utilize threads is.
For background information, the server has stateful, long-lived connections, and any request may require significant processing, and block. In fact, most requests call into customer-written code, which may block at will.
I've read that the OS can tell when an IOCP thread blocks, and unblock another one, but it doesn't look like there's any support for creating additional threads under heavy load, or if many of the threads are blocked.
I've read one site which suggested that you have a small, fixed-size thread pool which uses IOCP to deal with I/O only, which sends requests which can block to another, dynamically-sized thread pool. This seems non-optimal due to the additional thread synchronization required (although you can use IOCP as well for the tasks for the second thread pool), and the larger number of threads needed (extra context switching).
Is that the best way?
It sounds like what you've read is one of my articles on IOCP (most probably this one). That's likely a bit out of date now as the whole problem that it sort to avoid (that of I/O being cancelled if the thread that issued it exits before the I/O completes) is no longer a problem with any of Microsoft's currently supported OS's (it's only an issue on XP and before).
You're correct in noticing that my design from 2000/2002 was sub optimal from a context switching point of view; but it worked pretty well at the time, given the constraints of the underlying API.
On a modern OS there's no real advantage in having separate thread pools for I/O and blocking work. A more modern solution would probably involve dynamically expanding and reducing the number of I/O threads servicing the IOCP as required.
You'd need to track the number of IOCP threads that are active (i.e. not waiting on GetQueuedCompletionStatus() ) and spawn more when there are "too few". Likewise just as a thread is about to go back and wait on GQCS you could check to see if you have "too many" and if so, let it die instead.
I should probably update those articles.
IOCP is great for many connections, but what I'm wondering is, is there a significant benefit to allowing multiple pending receives or multiple pending writes per individual TCP socket, or am I not really going to lose performance if I just allow one pending receive and one pending send per each socket (which really simplifies things, as I don't have to deal with out-of-order completion notifications)?
My general use case is 2 worker threads servicing the IOCP port, handling several connections (more than 2 but less than 10), where the transmitted data is ether of two forms: one is frequent very small messages (which I combine if possible manually, but generally need to send often enough that the per-send data is still pretty small), and the other is transferring large files.
Multiple pending recvs tend to be of limited use unless you plan to turn off the network stack's recv buffering in which case they're essential. Bear in mind that if you DO decide to issue multiple pending recvs then you must do some work to make sure you process them in the correct sequence. Whilst the recvs will complete from the IOCP in the order that they were issued thread scheduling issues may mean that they are processed by different I/O threads in a different order unless you actively work to ensure that this is not the case, see here for details.
Multiple pending sends are more useful to fully utilise the TCP connection's available TCP window (and send at the maximum rate possible) but only if you have lots of data to send, only if you want to send it as efficiently as you can and only if you take care to ensure that you don't have too many pending writes. See here for details of issues that you can come up against if you don't actively manage the number of pending writes.
For less than 10 connections and TCP, you probably won't feel any difference even at high rates. You may see better performance by simply growing your buffer sizes.
Queuing up I/Os is going to help if your application is bursty and expensive to process. Basically it lets you perform the costly work up front so that when the burst comes in, you're using a little of the CPU on I/O and as much of it on processing as possible.
Why do many people say I/O completion port is a fast and nice model?
What are the I/O completion port's advantages and disadvantages?
I want to know some points which make the I/O completion port faster than other approaches.
If you can explain it comparing to other models (select, epoll, traditional multithread/multiprocess), it would be better.
I/O completion ports are awesome. There's no better word to describe them. If anything in Windows was done right, it's completion ports.
You can create some number of threads (does not really matter how many) and make them all block on one completion port until an event (either one you post manually, or an event from a timer or asynchronous I/O, or whatever) arrives. Then the completion port will wake one thread to handle the event, up to the limit that you specified. If you didn't specify anything, it will assume "up to number of CPU cores", which is really nice.
If there are already more threads active than the maximum limit, it will wait until one of them is done and then hand the event to the thread as soon as it goes to wait state. Also, it will always wake threads in a LIFO order, so chances are that caches are still warm.
In other words, completion ports are a no-fuss "poll for events" as well as "fill CPU as much as you can" solution.
You can throw file reads and writes at a completion port, sockets, or anything else that's waitable. And, you can post your own events if you want. Each custom event has at least one integer and one pointer worth of data (if you use the default structure), but you are not really limited to that as the system will happily accept any other structure too.
Also, completion ports are fast, really really fast. Once upon a time, I needed to notify one thread from another. As it happened, that thread already had a completion port for file I/O, but it didn't pump messages. So, I wondered if I should just bite the bullet and use the completion port for simplicity, even though posting a thread message would obviously be much more efficient. I was undecided, so I benchmarked. Surprise, it turned out completion ports were about 3 times faster. So... faster and more flexible, the decision was not hard.
by using IOCP, we can overcome the "one-thread-per-client" problem. It is commonly known that the performance decreases heavily if the software does not run on a true multiprocessor machine. Threads are system resources that are neither unlimited nor cheap.
IOCP provides a way to have a few (I/O worker) threads handle multiple clients' input/output "fairly". The threads are suspended, and don't use the CPU cycles until there is something to do.
Also you can read some information in this nice book http://www.amazon.com/Windows-System-Programming-Johnson-Hart/dp/0321256190
I/O completion ports are provided by the O/S as an asynchronous I/O operation, which means that it occurs in the background (usually in hardware). The system does not waste any resources (e.g. threads) waiting for the I/O to complete. When the I/O is complete, the hardware sends an interrupt to the O/S, which then wakes up the relevant process/thread to handle the result. WRONG: IOCP does NOT require hardware support (see comments below)
Typically a single thread can wait on a large number of I/O completions while taking up very little resources when the I/O has not returned.
Other async models that are not based on I/O completion ports usually employ a thread pool and have threads wait for I/O to complete, thereby using more system resources.
The flip side is that I/O completion ports usually require hardware support, and so they are not generally applicable to all async scenarios.
I'm wondering which approach is faster and why ?
While writing a Win32 server I have read a lot about the Completion Ports and the Overlapped I/O, but I have not read anything to suggest which set of API's yields the best results in the server.
Should I use completion routines, or should I use the WaitForMultipleObjects API and why ?
You suggest two methods of doing overlapped I/O and ignore the third (or I'm misunderstanding your question).
When you issue an overlapped operation, a WSARecv() for example, you can specify an OVERLAPPED structure which contains an event and you can wait for that event to be signalled to indicate the overlapped I/O has completed. This, I assume, is your WaitForMultipleObjects() approach and, as previously mentioned, this doesn't scale well as you're limited to the number of handles that you can pass to WaitForMultipleObjects().
Alternatively you can pass a completion routine which is called when completion occurs. This is known as 'alertable I/O' and requires that the thread that issued the WSARecv() call is in an 'alertable' state for the completion routine to be called. Threads can put themselves in an alertable state in several ways (calling SleepEx() or the various EX versions of the Wait functions, etc). The Richter book that I have open in front of me says "I have worked with alertable I/O quite a bit, and I'll be the first to tell you that alertable I/O is horrible and should be avoided". Enough said IMHO.
There's a third way, before issuing the call you should associate the handle that you want to do overlapped I/O on with a completion port. You then create a pool of threads which service this completion port by calling GetQueuedCompletionStatus() and looping. You issue your WSARecv() with an OVERLAPPED structure WITHOUT an event in it and when the I/O completes the completion pops out of GetQueuedCompletionStatus() on one of your I/O pool threads and can be handled there.
As previously mentioned, Vista/Server 2008 have cleaned up how IOCPs work a little and removed the problem whereby you had to make sure that the thread that issued the overlapped request continued to run until the request completed. Link to a reference to that can be found here. But this problem is easy to work around anyway; you simply marshal the WSARecv over to one of your I/O pool threads using the same IOCP that you use for completions...
Anyway, IMHO using IOCPs is the best way to do overlapped I/O. Yes, getting your head around the overlapped/async nature of the calls can take a little time at the start but it's well worth it as the system scales very well and offers a simple "fire and forget" method of dealing with overlapped operations.
If you need some sample code to get you going then I have several articles on writing IO completion port systems and a heap of free code that provides a real-world framework for high performance servers; see here.
As an aside; IMHO, you really should read "Windows Via C/C++ (PRO-Developer)" by Jeffrey Richter and Christophe Nasarre as it deals will all you need to know about overlapped I/O and most other advanced windows platform techniques and APIs.
WaitForMultipleObjects is limited to 64 handles; in a highly concurrent application this could become a limitation.
Completion ports fit better with a model of having a pool of threads all of which are capable of handling any event, and you can queue your own (non-IO based) events into the port, whereas with waits you would need to code your own mechanism.
However completion ports, and the event based programming model, are a more difficult concept to really work against.
I would not expect any significant performance difference, but in the end you can only make your own measurements to reflect your usage. Note that Vista/Server2008 made a change with completion ports that the originating thread is not now needed to complete IO operations, this may make a bigger difference (see this article by Mark Russinovich).
Table 6-3 in the book Network Programming for Microsoft Windows, 2nd Edition compares the scalability of overlapped I/O via completion ports vs. other techniques. Completion ports blow all the other I/O models out of the water when it comes to throughput, while using far fewer threads.
The difference between WaitForMultipleObjects() and I/O completion ports is that IOCP scales to thousands of objects, whereas WFMO() does not and should not be used for anything more than 64 objects (even though you could).
You can't really compare them for performance, because in the domain of < 64 objects, they will be essentially identical.
WFMO() however does a round-robin on its objects, so busy objects with low index numbers can starve objects with high index numbers. (E.g. if object 0 is going off constantly, it will starve objects 1, 2, 3, etc). This is obviously undesireable.
I wrote an IOCP library (for sockets) to solve the C10K problem and put it in the public domain. I was able on a 512mb W2K machine to get 4,000 sockets concurrently transferring data. (You can get a lot more sockets, if they're idle - a busy socket consumes more non-paged pool and that's the ultimate limit on how many sockets you can have).
http://www.45mercystreet.com/computing/libiocp/index.html
The API should give you exactly what you need.
Not sure. But I use WaitForMultipleObjects and/or WaitFoSingleObjects. It's very convenient.
Either routine works and I don't really think one is any significant faster then another.
These two approaches exists to satisfy different programming models.
WaitForMultipleObjects is there to facilitate async completion pattern (like UNIX select() function) while completion ports is more towards event driven model.
I personally think WaitForMultipleObjects() approach result in cleaner code and more thread safe.