Overlapped IO or file mapping? - winapi

In a Windows application I have a class which wraps up a filename and a buffer. You construct it with a filename and you can query the object to see if the buffer is filled yet, returning nullptr if not and the buffer addres if so. When the object falls out of scope, the buffer is released:
class file_buffer
{
public:
file_buffer(const std::string& file_name);
~file_buffer();
void* buffer();
private:
...
}
I want to put the data into memory asynchronously, and as far as I see it I have two choices: either create a buffer and use overlapped IO through ReadFileEx, or use MapViewOfFile and touch the address on another thread.
At the moment I'm using ReadFileEx which presents some problems, as requests greater than about 16MB are prone to failure: I can try splitting up the request but then I get synchronisation issues, and if the object falls out of scope before the IO is complete I have buffer-cleanup issues. Also, if multiple instances of the class are created in quick succession things get very fiddly.
Mapping and touching the data on another thread would seem to be considerably easier since I won't have the upper limit issues: also if the client absolutely has to have the data right now, they can simply dereference the address, let the OS worry about page faults and take the blocking hit.
This application needs to support single core machines, so my question is: will page faults on another software thread be any more expensive than overlapped IO on the current thread? Will they stall the process? Does overlapped IO stall the process in the same way or is there some OS magic I don't understand? Are page faults carried out using overlapped IO anyway?
I've had a good read of these topics:
http://msdn.microsoft.com/en-us/library/aa365199(v=vs.85).aspx (IO Concepts in File Management)
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx (File mapping)
but I can't seem to infer how to make a performance tradeoff.

You will definitively want to go with memory-mapped files. Overlapped IO (with FILE_FLAG_NO_BUFFERING) has been advocated as "the fastest way to get data into RAM" by some people for years, but this is only true in very contrieved cases with very specific conditions. In the normal, average case, turning off the buffer cache is a serious anti-optimization.
Now, overlapped IO without FILE_FLAG_NO_BUFFERINGhas all the quirks of overlapped IO, and is about 50% slower (for a reason I still cannot understand).
I've done some rather extensive benchmarking a year ago. The bottom line is: Memory mapped files are faster, better, less surprising.
Overlapped IO uses more CPU, is much slower when using the buffer cache, asynchronous reverts to synchronous under some well-documented and some undocumented conditions (e.g. encryption, compression, and... pure chance? request size? number of requests?), stalling your application at unpredictable times.
Submitting requests can sometimes take "funny" amounts of time, and CancelIO sometimes doesn't cancel anything but waits for completion. Processes with outstanding requests are unkillable. Managing buffers with outstanding overlapped writes is non-trivial extra work.
File mapping just works. Fullstop. And it works nicely. No surprises, no funny stuff. Touching every page has very little overhead and delivers as fast as the disk is able to deliver, and it takes advantage of the buffer cache. Your concern about a single-core CPU is no problem. If the touch-thread faults, it blocks, and as always when a thread blocks, another thread gets CPU time instead.
I'm even using file mapping for writing now, whenever I have more than a few bytes to write. This is somewhat non-trivial (have to manually grow/preallocate files and mappings, and truncate to actual length when closing), but with some helper classes it's entirely doable. Write 500 MiB of data, and it takes "zero time" (you basically do a memcpy, the actual write happens in the background, any time later, even after your program has finished). It's stunning how well this works, even if you know that it's the natural thing for an operating system to do.
Of course you had better not have a power failure before the OS has written out all pages, but that's true for any kind of writing. What's not on the disk yet is not on the disk -- there's really not much more to say to it than that. If you must be sure about that, you have to wait for a disk sync to complete, and even then you can't be sure the lights aren't going out while you wait for the sync. That's life.

I don't claim to understand this better than you, as it seem you made some inventigation. And to be totally sure you will need to experiment. But this is my understanding of the issues, in reverse order:
File mapping and overlapped IO in Windows are different implentations and none of them rely on the other under the hood. But both use the asynchronous block device layer. As I imagine it, in the kernel every IO is actually asynchronous, but some user operations wait for it to finish and so they create the illusion of synchronicity.
From point 1, if a thread does IO, other threads from the same process will not stall. That, unless the system resources are scarce or these other threads do IO themselves and face some kind of contention. This will be true no matter the kind of IO the first thread does: blocking, non-blocking, overlapped, memory-mapped.
In memory-mapped files, the data is read at least one page at a time, probably more because of the read-ahead, but you cannot be sure about that. So the probing thread will have to touch the mapped memory at least one on every page. That will be something like probe/block-probe-probe-probe-probe/block-probe... That might be a bit less efficient than a big overlapped read of several MB. Or maybe the kernel programmers were smart and it is even more efficient. You will have to make a little profiling... Hey, you could even go without the probing thread and see what happens.
Cancelling overlapping operations is a PITA, so my recommendation will be to go with the memory-mapped files. That is way easier to set up and you get extra functionality:
the memory is usable even before it is fully in memory
the memory can/will be shared by several instances of the process
if the memory is in the cache, it will be ready instantaneously instead of just quickly.
if the data is read-only, you can protect the memory from writing, catching bugs.

Related

V8 isolates mapped memory leaks

V8 developer is needed.
I've noticed that the following code leaks mapped memory (mmap, munmap), concretely the amount of mapped regions within cat /proc/<pid>/maps continuously grows and hits the system limit pretty quickly (/proc/sys/vm/max_map_count).
void f() {
auto platform = v8::platform::CreateDefaultPlatform();
v8::Isolate::CreateParams create_params;
create_params.array_buffer_allocator =
v8::ArrayBuffer::Allocator::NewDefaultAllocator();
v8::V8::InitializePlatform(platform);
v8::V8::Initialize();
for (;;) {
std::shared_ptr<v8::Isolate> isolate(v8::Isolate::New(create_params), [](v8::Isolate* i){ i->Dispose(); });
}
v8::V8::Dispose();
v8::V8::ShutdownPlatform();
delete platform;
delete create_params.array_buffer_allocator;
}
I've played a little bit with platform-linux.cc file and have found that UncommitRegion call just remaps region with PROT_NONE, but not release it. Probably thats somehow related to that problem..
There are several reasons why we recreate isolates during the program execution.
The first one is that creating new isolate along with discarding the old one is more predictable in terms of GC. Basically, I found that doing
auto remoteOldIsolate = std::async(
std::launch::async,
[](decltype(this->_isolate) isolateToRemove) { isolateToRemove->Dispose(); },
this->_isolate
);
this->_isolate = v8::Isolate::New(cce::Isolate::_createParams);
//
is more predictable and faster than call to LowMemoryNotification. So we monitor memory consumptions using GetHeapStatistics and recreate isolate when it hits the limit. Turns out we cannot consider GC activity as a part of code execution, this leads to bad user experience.
The second reason is that having isolate per code allows as to run several codes in parallel, otherwise v8::Locker will block second code for that particular isolate.
Looks like at this stage I have no choices and will rewrite application to have a pool of isolates and persistent context per code..of course this way code#1 may affect code#2 by doing many allocations and GC will run on code2 with no allocations at all, but at least it will not leak.
PS. I've mentioned that we use GetHeapStatistics for memory monitoring. I want to clarify a little bit that part.
In our case its a big problem when GC works during code execution. Each code has execution timeout (100-500ms). Having GC activity during code execution locks code and sometimes we have timeouts just for assignment operation. GC callbacks don't give you enough accuracy, so we cannot rely on them.
What we actually do, we specify --max-old-space-size=32000 (32GB). That way GC don't want to run, cuz it should see that a lot of memory exists. And using GetHeapStatistics (along with isolate recreation I've mentioned above) we have manual memory monitoring.
PPS. I also mentioned that sharing isolate between codes may affect users.
Say you have user#1 and user#2. Each of them have their own code, both are unrelated. code#1 has a loop with tremendous memory allocation, code#2 is just an assignment operation. Chances are GC will run during code#2 and user#2 will receive timeout.
V8 developer is needed.
Please file a bug at crbug.com/v8/new. Note that this issue will probably be considered low priority; we generally assume that the number of Isolates per process remains reasonably small (i.e., not thousands or millions).
have a pool of isolates
Yes, that's probably the way to go. In particular, as you already wrote, you will need one Isolate per thread if you want to execute scripts in parallel.
this way code#1 may affect code#2 by doing many allocations and GC will run on code2 with no allocations at all
No, that can't happen. Only allocations trigger GC activity. Allocation-free code will spend zero time doing GC. Also (as we discussed before in your earlier question), GC activity is split into many tiny (typically sub-millisecond) steps (which in turn are triggered by allocations), so in particular a short-running bit of code will not encounter some huge GC pause.
sometimes we have timeouts just for assignment operation
That sounds surprising, and doesn't sound GC-related; I would bet that something else is going on, but I don't have a guess as to what that might be. Do you have a repro?
we specify --max-old-space-size=32000 (32GB). That way GC don't want to run, cuz it should see that a lot of memory exists. And using GetHeapStatistics (along with isolate recreation I've mentioned above) we have manual memory monitoring.
Have you tried not doing any of that? V8's GC is very finely tuned by default, and I would assume that side-stepping it in this way is causing more problems than it solves. Of course you can experiment with whatever you like; but if the resulting behavior isn't what you were hoping for, then my first suggestion is to just let V8 do its thing, and only interfere if you find that the default behavior is somehow unsatisfactory.
code#1 has a loop with tremendous memory allocation, code#2 is just an assignment operation. Chances are GC will run during code#2 and user#2 will receive timeout.
Again: no. Code that doesn't allocate will not be interrupted by GC. And several functions in the same Isolate can never run in parallel; only one thread may be active in one Isolate at the same time.

How many contiuous cycles per process?

I've just read, in multiprocessing, when a waiting process comes into context, the entire cache becomes invalid and we see a lot of cache misses. I'm wondering how long a process runs continuously before it goes to wait state... Is it long enough such that the newly updated cache can be used meaningfully? But then, other processes will be waiting too long ? Appreciate any help. Thanks!
The simple answer is that "yes, it is usually long enough". If it wasn't long enough, they would not put a cache on the chip. The time is controlled by the operating system, you can see a post on it here: Linux Scheduler Time Slice.
However, there are cases where it might not be long enough. This would typically be where you have a lot of I/O and not much computation in between. A good example would be a program like telnet that can read a character, output a character and then go to sleep. However, I don't think that it is correct to say that the entire cache is invalidated. This would only happen if the new process used enough memory to overwrite all of the cache entries from the previous process.
To avoid issues like this you should use buffered file I/O or database access, caches in the application and avoid character by character processing.

Caching OVERLAPPED structures when using IOCPs in Windows

I'm using I/O Completion Ports in Windows, I have an object called 'Stream' that resembles and abstract an HANDLE (so it can be a socket, a file, and so on).
When I call Stream::read() or Stream::write() (so, ReadFile()/WriteFile() in the case of files, and WSARecv()/WSASend() in the case of sockets), I allocate a new OVERLAPPED structure in order to make a pending I/O request that will be completed in the IOCP loop by some other thread.
Then, when the OVERLAPPED structure will be completed by the IOCP loop, it will be destroyed there. If that's the case, Stream::read() or Stream::write() are called again from the IOCP loop, they will instance new OVERLAPPED structures, and it will go forever.
This works just fine. But now I want to improve this by adding caching of OVERLAPPED objects:
when my Stream object does a lot of reads or writes, it absolutely makes sense to cache the OVERLAPPED structures.
But now arise a problem: when I deallocate a Stream object, I must deallocate the cached OVERLAPPED structures, but how I can know if they've been completed or are still pending and one of the IOCP loops will complete that lately? So, an atomic reference count is needed here, but now the problem is that if I use an atomic ref counter, I have to increase that ref counter for each read or write operations, and decrease on each IOCP loop completion of the OVERLAPPED structure or Stream deletion, which in a server are a lot of operations, so I'll end up by increasing/decreasing a lot of atomic counters a lot of times.
Will this impact very negatively the concurrency of multiple threads? This is my only concern that blocks me to put this atomic reference counter for each OVERLAPPED structure.
Are my concerns baseless?
I thought this is an important topic to point out, and a question on SO, to see other's people thoughts on this and methods for caching OVERLAPPED structures with IOCP, is worth it.
I wish to find out a clever solution on this, without using atomic ref counters, if it is possible.
Assuming that you bundle a data buffer with the OVERLAPPED structure as a 'per operation' data object then pooling them to avoid excessive allocation/deallocation and heap fragmentation is a good idea.
If you only ever use this object for I/O operations then there's no need for a ref count, simply pull one from the pool, do your WSASend/WSARecv with it and then release it to the pool once you're done with it in the IOCP completion handler.
If, however, you want to get a bit more complicated and allow these buffers to be passed out to other code then you may want to consider ref counting them if that makes it easier. I do this in my current framework and it allows me to have generic code for the networking side of things and then pass data buffers from read completions out to customer code and they can do what they want with them and when they're done they release them back to the pool. This currently uses a ref count but I'm moving away from that as a minor performance tweak. The ref count is still there but in most situations it only ever goes from 0 -> 1 and then to 0 again, rather than being manipulated at various layers within my framework (this is done by passing the ownership of the buffer out to the user code using a smart pointer).
In most situations I expect that a ref count is unlikely to be your most expensive operation (even on NUMA hardware in situations where your buffers are being used from multiple nodes). More likely the locking involved in putting these things back into a pool will be your bottleneck; I've solved that one so am moving on to the next higher fruit ;)
You also talk about your 'per connection' object and caching your 'per operation' data locally there (which is what I do before pushing them back to the allocator), whilst ref counts aren't strictly required for the 'per operation' data, the 'per connection' data needs, at least, an atomically modifiable 'num operations in progress' count so that you can tell when you can free IT up. Again, due to my framework design, this has become a normal ref count for which customer code can hold refs as well as active I/O operations. I've yet to work a way around the need for this counter in a general purpose framework.

In what applications caching does not give any advantage?

Our professor asked us to think of an embedded system design where caches cannot be used to their full advantage. I have been trying to find such a design but could not find one yet. If you know such a design, can you give a few tips?
Caches exploit the fact data (and code) exhibit locality.
So an embedded system wich does not exhibit locality, will not benefit from a cache.
Example:
An embedded system has 1MB of memory and 1kB of cache.
If this embedded system is accessing memory with short jumps it will stay long in the same 1kB area of memory, which could be successfully cached.
If this embedded system is jumping in different distant places inside this 1MB and does that frequently, then there is no locality and cache will be used badly.
Also note that depending on architecture you can have different caches for data and code, or a single one.
More specific example:
If your embedded system spends most of its time accessing the same data and (e.g.) running in a tight loop that will fit in cache, then you're using cache to a full advantage.
If your system is something like a database that will be fetching random data from any memory range, then cache can not be used to it's full advantage. (Because the application is not exhibiting locality of data/code.)
Another, but weird example
Sometimes if you are building safety-critical or mission-critical system, you will want your system to be highly predictable. Caches makes your code execution being very unpredictable, because you can't predict if a certain memory is cached or not, thus you don't know how long it will take to access this memory. Thus if you disable cache it allows you to judge you program's performance more precisely and calculate worst-case execution time. That is why it is common to disable cache in such systems.
I do not know what you background is but I suggest to read about what the "volatile" keyword does in the c language.
Think about how a cache works. For example if you want to defeat a cache, depending on the cache, you might try having your often accessed data at 0x10000000, 0x20000000, 0x30000000, 0x40000000, etc. It takes very little data at each location to cause cache thrashing and a significant performance loss.
Another one is that caches generally pull in a "cache line" A single instruction fetch may cause 8 or 16 or more bytes or words to be read. Any situation where on average you use a small percentage of the cache line before it is evicted to bring in another cache line, will make your performance with the cache on go down.
In general you have to first understand your cache, then come up with ways to defeat the performance gain, then think about any real world situations that would cause that. Not all caches are created equal so there is no one good or bad habit or attack that will work for all caches. Same goes for the same cache with different memories behind it or a different processor or memory interface or memory cycles in front of it. You also need to think of the system as a whole.
EDIT:
Perhaps I answered the wrong question. not...full advantage. that is a much simpler question. In what situations does the embedded application have to touch memory beyond the cache (after the initial fill)? Going to main memory wipes out the word full in "full advantage". IMO.
Caching does not offer an advantage, and is actually a hindrance, in controlling memory-mapped peripherals. Things like coprocessors, motor controllers, and UARTs often appear as just another memory location in the processor's address space. Instead of simply storing a value, those locations can cause something to happen in the real world when written to or read from.
Cache causes problems for these devices because when software writes to them, the peripheral doesn't immediately see the write. If the cache line never gets flushed, the peripheral may never actually receive a command even after the CPU has sent hundreds of them. If writing 0xf0 to 0x5432 was supposed to cause the #3 spark plug to fire, or the right aileron to tilt down 2 degrees, then the cache will delay or stop that signal and cause the system to fail.
Similarly, the cache can prevent the CPU from getting fresh data from sensors. The CPU reads repeatedly from the address, and cache keeps sending back the value that was there the first time. On the other side of the cache, the sensor waits patiently for a query that will never come, while the software on the CPU frantically adjusts controls that do nothing to correct gauge readings that never change.
In addition to almost complete answer by Halst, I would like to mention one additional case where caches may be far from being an advantage. If you have multiple-core SoC where all cores, of course, have own cache(s) and depending on how program code utilizes these cores - caches can be very ineffective. This may happen if ,for example, due to incorrect design or program specific (e.g. multi-core communication) some data block in RAM is concurrently used by 2 or more cores.

Explanation for tiny reads (overlapped, buffered) outperforming large contiguous reads?

(apologies for the somewhat lengthy intro)
During development of an application which prefaults an entire large file (>400MB) into the buffer cache for speeding up the actual run later, I tested whether reading 4MB at a time still had any noticeable benefits over reading only 1MB chunks at a time. Surprisingly, the smaller requests actually turned out to be faster. This seemed counter-intuitive, so I ran a more extensive test.
The buffer cache was purged before running the tests (just for laughs, I did one run with the file in the buffers, too. The buffer cache delivers upwards of 2GB/s regardless of request size, though with a surprising +/- 30% random variance).
All reads used overlapped ReadFile with the same target buffer (the handle was opened with FILE_FLAG_OVERLAPPED and without FILE_FLAG_NO_BUFFERING). The harddisk used is somewhat elderly but fully functional, NTFS has a cluster size of 8kB. The disk was defragmented after an initial run (6 fragments vs. unfragmented, zero difference). For better figures, I used a larger file too, below numbers are for reading 1GB.
The results were really surprising:
4MB x 256 : 5ms per request, completion 25.8s # ~40 MB/s
1MB x 1024 : 11.7ms per request, completion 23.3s # ~43 MB/s
32kB x 32768 : 12.6ms per request, completion 15.5s # ~66 MB/s
16kB x 65536 : 12.8ms per request, completion 13.5s # ~75 MB/s
So, this suggests that submitting ten thousands of requests two clusters in length is actually better than submitting a few hundred large, contiguous reads. The submit time (time before ReadFile returns) does go up substantially as the number of requests goes up, but asynchronous completion time nearly halves.
Kernel CPU time is around 5-6% in every case (on a quadcore, so one should really say 20-30%) while the asynchronous reads are completing, which is a surprising amount of CPU -- apparently the OS does some non-neglegible amount of busy waiting, too. 30% CPU for 25 seconds at 2.6 GHz, that's quite a few cycles for doing "nothing".
Any idea how this can be explained? Maybe someone here has a deeper insight of the inner workings of Windows overlapped IO? Or, is there something substantially wrong with the idea that you can use ReadFile for reading a megabyte of data?
I can see how an IO scheduler would be able to optimize multiple requests by minimizing seeks, especially when requests are random access (which they aren't!). I can also see how a harddisk would be able to perform a similar optimization given a few requests in the NCQ.
However, we're talking about ridiculous numbers of ridiculously small requests -- which nevertheless outperform what appears to be sensible by a factor of 2.
Sidenote: The clear winner is memory mapping. I'm almost inclined to add "unsurprisingly" because I am a big fan of memory mapping, but in this case, it actually does surprise me, as the "requests" are even smaller and the OS should be even less able to predict and schedule the IO. I didn't test memory mapping at first because it seemed counter-intuitive that it might be able to compete even remotely. So much for your intuition, heh.
Mapping/unmapping a view repeatedly at different offsets takes practically zero time. Using a 16MB view and faulting every page with a simple for() loop reading a single byte per page completes in 9.2 secs # ~111 MB/s. CPU usage is under 3% (one core) at all times. Same computer, same disk, same everything.
It also appears that Windows loads 8 pages into the buffer cache at a time, although only one page is actually created. Faulting every 8th page runs at the same speed and loads the same amount of data from disk, but shows lower "physical memory" and "system cache" metrics and only 1/8 of the page faults. Subsequent reads prove that the pages are nevertheless definitively in the buffer cache (no delay, no disk activity).
(Possibly very, very distantly related to Memory-Mapped File is Faster on Huge Sequential Read?)
To make it a bit more illustrative:
Update:
Using FILE_FLAG_SEQUENTIAL_SCAN seems to somewhat "balance" reads of 128k, improving performance by 100%. On the other hand, it severely impacts reads of 512k and 256k (you have to wonder why?) and has no real effect on anything else. The MB/s graph of the smaller blocks sizes arguably seems a little more "even", but there is no difference in runtime.
I may have found an explanation for smaller block sizes performing better, too. As you know, asynchronous requests may run synchronously if the OS can serve the request immediately, i.e. from the buffers (and for a variety of version-specific technical limitations).
When accounting for actual asynchronous vs. "immediate" asyncronous reads, one notices that upwards of 256k, Windows runs every asynchronous request asynchronously. The smaller the blocksize, the more requests are being served "immediately", even when they are not available immediately (i.e. ReadFile simply runs synchronously). I cannot make out a clear pattern (such as "the first 100 requests" or "more than 1000 requests"), but there seems to be an inverse correlation between request size and synchronicity. At a blocksize of 8k, every asynchronous request is served synchronously.
Buffered synchronous transfers are, for some reason, twice as fast as asynchronous transfers (no idea why), hence the smaller the request sizes, the faster the overall transfer, because more transfers are done synchronously.
For memory mapped prefaulting, FILE_FLAG_SEQUENTIAL_SCAN causes a slightly different shape of the performance graph (there is a "notch" which is moved a bit backwards), but the total time taken is exactly identical (again, this is surprising, but I can't help it).
Update 2:
Unbuffered IO makes the performance graphs for the 1M, 4M, and 512k request testcases somewhat higher and more "spiky" with maximums in the 90s of GB/s, but with harsh minumums too, the overall runtime for 1GB is within +/- 0.5s of the buffered run (the requests with smaller buffer sizes complete significantly faster, however, that is because with more than 2558 in-flight requests, ERROR_WORKING_SET_QUOTA is returned). Measured CPU usage is zero in all unbuffered cases, which is unsurprising, since any IO that happens runs via DMA.
Another very interesting observation with FILE_FLAG_NO_BUFFERING is that it significantly changes API behaviour. CancelIO does not work any more, at least not in a sense of cancelling IO. With unbuffered in-flight requests, CancelIO will simply block until all requests have finished. A lawyer would probably argue that the function cannot be held liable for neglecting its duty, because there are no more in-flight requests left when it returns, so in some way it has done what was asked -- but my understanding of "cancel" is somewhat different.
With buffered, overlapped IO, CancelIO will simply cut the rope, all in-flight operations terminate immediately, as one would expect.
Yet another funny thing is that the process is unkillable until all requests have finished or failed. This kind of makes sense if the OS is doing DMA into that address space, but it's a stunning "feature" nevertheless.
I'm not a filesystem expert but I think there are a couple of things going on here. First off. w.r.t. your comment about memory mapping being the winner. This isn't totally surprising since the NT cache manager is based on memory mapping - by doing the memory mapping yourself, you're duplicating the cache manager's behavior without the additional memory copies.
When you read sequentially from the file, the cache manager attempts to pre-fetch the data for you - so it's likely that you are seeing the effect of readahead in the cache manager. At some point the cache manager stops prefetching reads (or rather at some point the prefetched data isn't sufficient to satisfy your reads and so the cache manager has to stall). That may account for the slowdown on larger I/Os that you're seeing.
Have you tried adding FILE_FLAG_SEQUENTIAL_SCAN to your CreateFile flags? That instructs the prefetcher to be even more aggressive.
This may be counter-intuitive, but traditionally the fastest way to read data off the disk is to use asynchronous I/O and FILE_FLAG_NO_BUFFERING. When you do that, the I/O goes directly from the disk driver into your I/O buffers with nothing to get in the way (assuming that the segments of the file are contiguous - if they're not, the filesystem will have to issue several disk reads to satisfy the application read request). Of course it also means that you lose the built-in prefetch logic and have to roll your own. But with FILE_FLAG_NO_BUFFERING you have complete control of your I/O pipeline.
One other thing to remember: When you're doing asynchronous I/O, it's important to ensure that you always have an I/O request oustanding - otherwise you lose potential time between when the last I/O completes and the next I/O is started.

Resources