The background is developing DBMS kernel, specifically database checkpoint processing. Rules of the game are such that we need to wait for outstanding asynchronous IOs on the file to finish, before issuing fsync().
Current solution we deploy, is to count asynchronous IOs in-flight, manually, wait for this count to go own to 0, before fsyncing or FlushFileBuffer-ing. The question is whether we really have to do that, perhaps kernels/filesystems do it by themselves?
The OS in questions are Windows and Linux, mainly, although I'm also curious how BSD based OS handle that, too.
On Linux, we'e using libaio, for asynchronous IO.
On Windows: Yes, for a given HANDLE instance, the current asynchronous i/o queue is drained before FlushFileBuffers() is executed. If you are writing a database, you really ought to use NtFlushBuffersFileEx() instead, it offers far finer granularity of synchronisation, makes a huge difference.
On FreeBSD: Certainly with ZFS, yes. I can't say I've tested UFS, but I'd be surprised if it were not the same. FreeBSD implements cached async i/o as a kernel thread pool in any case, only uncached async i/o is truly async.
On Mac OS: No idea, and worse, disk i/o semantics have been all over the place last few releases. It was once very good, like BSD, but recently it's been downhill. Async file i/o was always nearly unusable on Mac OS in any case, the maximum 16 depth queue limit plus the requirement to use signals for async i/o completion is very hard to mix well with threaded code.
On Linux: For synchronous i/o, yes fsync() enforces a total ordering, per inode, if your filesystem guarantees that (all the popular ones do). For libaio, which only really works right for O_DIRECT i/o in any case, I believe that the block storage layer does flush all enqueued i/o before telling the device to barrier, unless you have disabled barriers. For io_uring (which you ought to be using instead of libaio), for non-O_DIRECT i/o, the ordering is whatever the filesystem enforces for per-inode i/o once io_uring has processed the submission. For io_uring with O_DIRECT i/o, the block storage layer is a singleton, and should enforce ordering across the whole system, once io_uring has processed the submission.
I keep mentioning "once io_uring has processed the submission" because io_uring works with ring buffered queues. If you add an entry to the submission queue, it will get processed in order of submission by io_uring (i.e. the queue gets drained). From the moment of submission to the moment of io_uring consuming the submission, there is no ordering. But once io_uring has consumed the submission, the destination filesystem has been told of the i/o, and whatever ordering guarantees it implements it will apply to the ordering of completions it emits back to io_uring. So, when using io_uring, do not proceed after i/o submission until io_uring has drained your i/o submission request from the submission queue. This happens naturally using the syscall to tell io_uring to drain the queue, or for polling drains, you can watch the "last drained item" offset the kernel atomically updates as it consumes submission items.
Source: I am the author of the reference library for the WG21 C++ standardisation of low level i/o. Caveat: all of the above is purely from my memory and experience, and may be bitrotted or wrong.
Related
Can I wait for GUI events — that is, pump a message loop — and on an I/O completion port at the same time? I would like to integrate libuv with the Windows GUI.
There are two solutions that I know of. One works on all versions of Windows, but involves multiple threads. The other is faster, but only supports Windows 10+ (thank you #RbMm for this fact).
Another thread calls GetQueuedCompletionStatusEx in a loop, and sends messages to the main thread with SendMessage. The main thread reads the messages from it's message loop, notes the custom message type, and dispatches the I/O completions.
This solution works on all versions of Windows, but is slower. However, if one is willing to trade latency for throughput, one can increase the GetQueuedCompletionStatusEx receive buffer to recover nearly the same throughput as the second solution. For best performance, both threads should use the same CPU, to avoid playing cache ping-pong with the I/O completions.
The main thread uses MsgWaitForMultipleObjectsEx to wait for the completion port to be signaled or user input to arrive. Once it is signaled, the main thread calls GetQueuedCompletionStatusEx with a zero timeout.
This assumes that an IOCP that is used by only one thread becomes signaled precisely when an I/O completion arrives. This is only true on Windows 10 and up. Otherwise, you will busyloop, since the IOCP will always be signaled. On systems that support this method, it should be faster, since it reduces scheduling overhead.
Is MPI_Bcast() blocking or nonblocking? In other word, when the root sends a data, do all processors block until every processor has received this data? If not, how to synchronized (block) all of them so that no one proceeds until all receives the same data.
You need to be a bit careful about terminology here as what MPI means by "blocking" may not be how you have seen it used in other contexts.
In MPI terms, Bcast is blocking. Blocking means that, when the function returns, it has completed the operation it was meant to do. In this case, it means that on return from Bcast it is guaranteed that the receive buffer in every process contains the data you want to broadcast. The non-blocking version is Ibcast.
In MPI terms, what you are asking is whether the operation is synchronous, i.e. implies synchronisation amongst processes. For a point-to-point operation such as Send, this refers to whether or not the sender waits for the receive to be posted before returning from the send call. For collective operations, the question is whether there is a barrier (as pointed out by #Vladimir). Bcast does not necessarily imply a barrier.
However, the reason I am posting is that, in almost all MPI programs written using the standard Send/Recv calls (as opposed to single-sided Put/Get) you do not care if there is a synchronisation after the barrier. All each process cares about is that it has received the data it needs - why would it matter what the other processes are doing? If you subsequently want to communicate with any other process then the MPI routines are designed so that the required synchronisation happens automatically. If you issue a receive and another process is slow, you wait; if you issue a send and the other process has not issued a receive, everything will still work correctly (this is assuming you don't call Rsend - you should never call Rsend!). Whether or not there is synchronisation has effects on performance, but rarely affects whether a program is correct or not.
Unless processes are interacting via some other mechanism (e.g. all accessing the same file) then it is hard to come up with a real example where you care whether or not the Bcast synchronises. Of course you can always construct some edge case, but in real practical applications of MPI it almost never matters.
Many MPI programs are littered with barriers and in my experience they are almost never required for correctness; the only common use case is to ensure meaningful timings for performance measurements.
No, this kind of blocking (waiting for the other processes to finish their part of the job) would be very bad for performance. Every process continues as soon as it has all it need -- that means that the data it was to receive are there, or the data to be sent are at least copied to some buffer.
You can use an MPI_Barrier to synchronize processes if you need to be sure all processes finished. As already said, it can slowdown the program significantly. I use it only for certain diagnostic logging when initializing my code. Not during the actual integration.
In a Windows application I have a class which wraps up a filename and a buffer. You construct it with a filename and you can query the object to see if the buffer is filled yet, returning nullptr if not and the buffer addres if so. When the object falls out of scope, the buffer is released:
class file_buffer
{
public:
file_buffer(const std::string& file_name);
~file_buffer();
void* buffer();
private:
...
}
I want to put the data into memory asynchronously, and as far as I see it I have two choices: either create a buffer and use overlapped IO through ReadFileEx, or use MapViewOfFile and touch the address on another thread.
At the moment I'm using ReadFileEx which presents some problems, as requests greater than about 16MB are prone to failure: I can try splitting up the request but then I get synchronisation issues, and if the object falls out of scope before the IO is complete I have buffer-cleanup issues. Also, if multiple instances of the class are created in quick succession things get very fiddly.
Mapping and touching the data on another thread would seem to be considerably easier since I won't have the upper limit issues: also if the client absolutely has to have the data right now, they can simply dereference the address, let the OS worry about page faults and take the blocking hit.
This application needs to support single core machines, so my question is: will page faults on another software thread be any more expensive than overlapped IO on the current thread? Will they stall the process? Does overlapped IO stall the process in the same way or is there some OS magic I don't understand? Are page faults carried out using overlapped IO anyway?
I've had a good read of these topics:
http://msdn.microsoft.com/en-us/library/aa365199(v=vs.85).aspx (IO Concepts in File Management)
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx (File mapping)
but I can't seem to infer how to make a performance tradeoff.
You will definitively want to go with memory-mapped files. Overlapped IO (with FILE_FLAG_NO_BUFFERING) has been advocated as "the fastest way to get data into RAM" by some people for years, but this is only true in very contrieved cases with very specific conditions. In the normal, average case, turning off the buffer cache is a serious anti-optimization.
Now, overlapped IO without FILE_FLAG_NO_BUFFERINGhas all the quirks of overlapped IO, and is about 50% slower (for a reason I still cannot understand).
I've done some rather extensive benchmarking a year ago. The bottom line is: Memory mapped files are faster, better, less surprising.
Overlapped IO uses more CPU, is much slower when using the buffer cache, asynchronous reverts to synchronous under some well-documented and some undocumented conditions (e.g. encryption, compression, and... pure chance? request size? number of requests?), stalling your application at unpredictable times.
Submitting requests can sometimes take "funny" amounts of time, and CancelIO sometimes doesn't cancel anything but waits for completion. Processes with outstanding requests are unkillable. Managing buffers with outstanding overlapped writes is non-trivial extra work.
File mapping just works. Fullstop. And it works nicely. No surprises, no funny stuff. Touching every page has very little overhead and delivers as fast as the disk is able to deliver, and it takes advantage of the buffer cache. Your concern about a single-core CPU is no problem. If the touch-thread faults, it blocks, and as always when a thread blocks, another thread gets CPU time instead.
I'm even using file mapping for writing now, whenever I have more than a few bytes to write. This is somewhat non-trivial (have to manually grow/preallocate files and mappings, and truncate to actual length when closing), but with some helper classes it's entirely doable. Write 500 MiB of data, and it takes "zero time" (you basically do a memcpy, the actual write happens in the background, any time later, even after your program has finished). It's stunning how well this works, even if you know that it's the natural thing for an operating system to do.
Of course you had better not have a power failure before the OS has written out all pages, but that's true for any kind of writing. What's not on the disk yet is not on the disk -- there's really not much more to say to it than that. If you must be sure about that, you have to wait for a disk sync to complete, and even then you can't be sure the lights aren't going out while you wait for the sync. That's life.
I don't claim to understand this better than you, as it seem you made some inventigation. And to be totally sure you will need to experiment. But this is my understanding of the issues, in reverse order:
File mapping and overlapped IO in Windows are different implentations and none of them rely on the other under the hood. But both use the asynchronous block device layer. As I imagine it, in the kernel every IO is actually asynchronous, but some user operations wait for it to finish and so they create the illusion of synchronicity.
From point 1, if a thread does IO, other threads from the same process will not stall. That, unless the system resources are scarce or these other threads do IO themselves and face some kind of contention. This will be true no matter the kind of IO the first thread does: blocking, non-blocking, overlapped, memory-mapped.
In memory-mapped files, the data is read at least one page at a time, probably more because of the read-ahead, but you cannot be sure about that. So the probing thread will have to touch the mapped memory at least one on every page. That will be something like probe/block-probe-probe-probe-probe/block-probe... That might be a bit less efficient than a big overlapped read of several MB. Or maybe the kernel programmers were smart and it is even more efficient. You will have to make a little profiling... Hey, you could even go without the probing thread and see what happens.
Cancelling overlapping operations is a PITA, so my recommendation will be to go with the memory-mapped files. That is way easier to set up and you get extra functionality:
the memory is usable even before it is fully in memory
the memory can/will be shared by several instances of the process
if the memory is in the cache, it will be ready instantaneously instead of just quickly.
if the data is read-only, you can protect the memory from writing, catching bugs.
Why do many people say I/O completion port is a fast and nice model?
What are the I/O completion port's advantages and disadvantages?
I want to know some points which make the I/O completion port faster than other approaches.
If you can explain it comparing to other models (select, epoll, traditional multithread/multiprocess), it would be better.
I/O completion ports are awesome. There's no better word to describe them. If anything in Windows was done right, it's completion ports.
You can create some number of threads (does not really matter how many) and make them all block on one completion port until an event (either one you post manually, or an event from a timer or asynchronous I/O, or whatever) arrives. Then the completion port will wake one thread to handle the event, up to the limit that you specified. If you didn't specify anything, it will assume "up to number of CPU cores", which is really nice.
If there are already more threads active than the maximum limit, it will wait until one of them is done and then hand the event to the thread as soon as it goes to wait state. Also, it will always wake threads in a LIFO order, so chances are that caches are still warm.
In other words, completion ports are a no-fuss "poll for events" as well as "fill CPU as much as you can" solution.
You can throw file reads and writes at a completion port, sockets, or anything else that's waitable. And, you can post your own events if you want. Each custom event has at least one integer and one pointer worth of data (if you use the default structure), but you are not really limited to that as the system will happily accept any other structure too.
Also, completion ports are fast, really really fast. Once upon a time, I needed to notify one thread from another. As it happened, that thread already had a completion port for file I/O, but it didn't pump messages. So, I wondered if I should just bite the bullet and use the completion port for simplicity, even though posting a thread message would obviously be much more efficient. I was undecided, so I benchmarked. Surprise, it turned out completion ports were about 3 times faster. So... faster and more flexible, the decision was not hard.
by using IOCP, we can overcome the "one-thread-per-client" problem. It is commonly known that the performance decreases heavily if the software does not run on a true multiprocessor machine. Threads are system resources that are neither unlimited nor cheap.
IOCP provides a way to have a few (I/O worker) threads handle multiple clients' input/output "fairly". The threads are suspended, and don't use the CPU cycles until there is something to do.
Also you can read some information in this nice book http://www.amazon.com/Windows-System-Programming-Johnson-Hart/dp/0321256190
I/O completion ports are provided by the O/S as an asynchronous I/O operation, which means that it occurs in the background (usually in hardware). The system does not waste any resources (e.g. threads) waiting for the I/O to complete. When the I/O is complete, the hardware sends an interrupt to the O/S, which then wakes up the relevant process/thread to handle the result. WRONG: IOCP does NOT require hardware support (see comments below)
Typically a single thread can wait on a large number of I/O completions while taking up very little resources when the I/O has not returned.
Other async models that are not based on I/O completion ports usually employ a thread pool and have threads wait for I/O to complete, thereby using more system resources.
The flip side is that I/O completion ports usually require hardware support, and so they are not generally applicable to all async scenarios.
I'm wondering which approach is faster and why ?
While writing a Win32 server I have read a lot about the Completion Ports and the Overlapped I/O, but I have not read anything to suggest which set of API's yields the best results in the server.
Should I use completion routines, or should I use the WaitForMultipleObjects API and why ?
You suggest two methods of doing overlapped I/O and ignore the third (or I'm misunderstanding your question).
When you issue an overlapped operation, a WSARecv() for example, you can specify an OVERLAPPED structure which contains an event and you can wait for that event to be signalled to indicate the overlapped I/O has completed. This, I assume, is your WaitForMultipleObjects() approach and, as previously mentioned, this doesn't scale well as you're limited to the number of handles that you can pass to WaitForMultipleObjects().
Alternatively you can pass a completion routine which is called when completion occurs. This is known as 'alertable I/O' and requires that the thread that issued the WSARecv() call is in an 'alertable' state for the completion routine to be called. Threads can put themselves in an alertable state in several ways (calling SleepEx() or the various EX versions of the Wait functions, etc). The Richter book that I have open in front of me says "I have worked with alertable I/O quite a bit, and I'll be the first to tell you that alertable I/O is horrible and should be avoided". Enough said IMHO.
There's a third way, before issuing the call you should associate the handle that you want to do overlapped I/O on with a completion port. You then create a pool of threads which service this completion port by calling GetQueuedCompletionStatus() and looping. You issue your WSARecv() with an OVERLAPPED structure WITHOUT an event in it and when the I/O completes the completion pops out of GetQueuedCompletionStatus() on one of your I/O pool threads and can be handled there.
As previously mentioned, Vista/Server 2008 have cleaned up how IOCPs work a little and removed the problem whereby you had to make sure that the thread that issued the overlapped request continued to run until the request completed. Link to a reference to that can be found here. But this problem is easy to work around anyway; you simply marshal the WSARecv over to one of your I/O pool threads using the same IOCP that you use for completions...
Anyway, IMHO using IOCPs is the best way to do overlapped I/O. Yes, getting your head around the overlapped/async nature of the calls can take a little time at the start but it's well worth it as the system scales very well and offers a simple "fire and forget" method of dealing with overlapped operations.
If you need some sample code to get you going then I have several articles on writing IO completion port systems and a heap of free code that provides a real-world framework for high performance servers; see here.
As an aside; IMHO, you really should read "Windows Via C/C++ (PRO-Developer)" by Jeffrey Richter and Christophe Nasarre as it deals will all you need to know about overlapped I/O and most other advanced windows platform techniques and APIs.
WaitForMultipleObjects is limited to 64 handles; in a highly concurrent application this could become a limitation.
Completion ports fit better with a model of having a pool of threads all of which are capable of handling any event, and you can queue your own (non-IO based) events into the port, whereas with waits you would need to code your own mechanism.
However completion ports, and the event based programming model, are a more difficult concept to really work against.
I would not expect any significant performance difference, but in the end you can only make your own measurements to reflect your usage. Note that Vista/Server2008 made a change with completion ports that the originating thread is not now needed to complete IO operations, this may make a bigger difference (see this article by Mark Russinovich).
Table 6-3 in the book Network Programming for Microsoft Windows, 2nd Edition compares the scalability of overlapped I/O via completion ports vs. other techniques. Completion ports blow all the other I/O models out of the water when it comes to throughput, while using far fewer threads.
The difference between WaitForMultipleObjects() and I/O completion ports is that IOCP scales to thousands of objects, whereas WFMO() does not and should not be used for anything more than 64 objects (even though you could).
You can't really compare them for performance, because in the domain of < 64 objects, they will be essentially identical.
WFMO() however does a round-robin on its objects, so busy objects with low index numbers can starve objects with high index numbers. (E.g. if object 0 is going off constantly, it will starve objects 1, 2, 3, etc). This is obviously undesireable.
I wrote an IOCP library (for sockets) to solve the C10K problem and put it in the public domain. I was able on a 512mb W2K machine to get 4,000 sockets concurrently transferring data. (You can get a lot more sockets, if they're idle - a busy socket consumes more non-paged pool and that's the ultimate limit on how many sockets you can have).
http://www.45mercystreet.com/computing/libiocp/index.html
The API should give you exactly what you need.
Not sure. But I use WaitForMultipleObjects and/or WaitFoSingleObjects. It's very convenient.
Either routine works and I don't really think one is any significant faster then another.
These two approaches exists to satisfy different programming models.
WaitForMultipleObjects is there to facilitate async completion pattern (like UNIX select() function) while completion ports is more towards event driven model.
I personally think WaitForMultipleObjects() approach result in cleaner code and more thread safe.