How to obtain good concurrent read performance from disk - windows

I'd like to ask a question then follow it up with my own answer, but also see what answers other people have.
We have two large files which we'd like to read from two separate threads concurrently. One thread will sequentially read fileA while the other thread will sequentially read fileB. There is no locking or communication between the threads, both are sequentially reading as fast as they can, and both are immediately discarding the data they read.
Our experience with this setup on Windows is very poor. The combined throughput of the two threads is in the order of 2-3 MiB/sec. The drive seems to be spending most of its time seeking backwards and forwards between the two files, presumably reading very little after each seek.
If we disable one of the threads and temporarily look at the performance of a single thread then we get much better bandwidth (~45 MiB/sec for this machine). So clearly the bad two-thread performance is an artefact of the OS disk scheduler.
Is there anything we can do to improve the concurrent thread read performance? Perhaps by using different APIs or by tweaking the OS disk scheduler parameters in some way.
Some details:
The files are in the order of 2 GiB each on a machine with 2GiB of RAM. For the purpose of this question we consider them not to be cached and perfectly defragmented. We have used defrag tools and rebooted to ensure this is the case.
We are using no special APIs to read these files. The behaviour is repeatable across various bog-standard APIs such as Win32's CreateFile, C's fopen, C++'s std::ifstream, Java's FileInputStream, etc.
Each thread spins in a loop making calls to the read function. We have varied the number of bytes requested from the API each iteration from values between 1KiB up to 128MiB. Varying this has had no effect, so clearly the amount the OS is physically reading after each disk seek is not dictated by this number. This is exactly what should be expected.
The dramatic difference between one-thread and two-thread performance is repeatable across Windows 2000, Windows XP (32-bit and 64-bit), Windows Server 2003, and also with and without hardware RAID5.

The problem seems to be in Windows I/O scheduling policy. According to what I found here there are many ways for an O.S. to schedule disk requests. While Linux and others can choose between different policies, before Vista Windows was locked in a single policy: a FIFO queue, where all requests where splitted in 64 KB blocks. I believe that this policy is the cause for the problem you are experiencing: the scheduler will mix requests from the two threads, causing continuous seek between different areas of the disk.
Now, the good news is that according to here and here, Vista introduced a smarter disk scheduler, where you can set the priority of your requests and also allocate a minimum badwidth for your process.
The bad news is that I found no way to change disk policy or buffers size in previous versions of Windows. Also, even if raising disk I/O priority of your process will boost the performance against the other processes, you still have the problems of your threads competing against each other.
What I can suggest is to modify your software by introducing a self-made disk access policy.
For example, you could use a policy like this in your thread B (similar for Thread A):
if THREAD A is reading from disk then wait for THREAD A to stop reading or wait for X ms
Read for X ms (or Y MB)
Stop reading and check status of thread A again
You could use semaphores for status checking or you could use perfmon counters to get the status of the actual disk queue.
The values of X and/or Y could also be auto-tuned by checking the actual trasfer rates and slowly modify them, thus maximizing the throughtput when the application runs on different machines and/or O.S. You could find that cache, memory or RAID levels affect them in a way or the other, but with auto-tuning you will always get the best performance in every scenario.

I'd like to add some further notes in my response. All other non-Microsoft operating systems we have tested do not suffer from this problem. Linux, FreeBSD, and Mac OS X (this final one on different hardware) all degrade much more gracefully in terms of aggregate bandwidth when moving from one thread to two. Linux for example degraded from ~45 MiB/sec to ~42 MiB/sec. These other operating systems must be reading larger chunks of the file between each seek, and therefor not spending nearly all their time waiting on the disk to seek.
Our solution for Windows is to pass the FILE_FLAG_NO_BUFFERING flag to CreateFile and use large (~16MiB) reads in each call to ReadFile. This is suboptimal for several reasons:
Files don't get cached when read like this, so there are none of the advantages that caching normally gives.
The constraints when working with this flag are much more complicated than normal reading (alignment of read buffers to page boundaries, etc).
(As a final remark. Does this explain why swapping under Windows is so hellish? Ie, Windows is incapable of doing IO to multiple files concurrently with any efficiency, so while swapping all other IO operations are forced to be disproportionately slow.)
Edit to add some further details for Will Dean:
Of course across these different hardware configurations the raw figures did change (sometimes substantially). The problem however is the consistent degradation in performance that only Windows suffers when moving from one thread to two. Here is a summary of the machines tested:
Several Dell workstations (Intel Xeon) of various ages running Windows 2000, Windows XP (32-bit), and Windows XP (64-bit) with single drive.
A Dell 1U server (Intel Xeon) running Windows Server 2003 (64-bit) with RAID 1+0.
An HP workstation (AMD Opteron) with Windows XP (64-bit), and Windows Server 2003, and hardware RAID 5.
My home unbranded PC (AMD Athlon64) running Windows XP (32-bit), FreeBSD (64-bit), and Linux (64-bit) with single drive.
My home MacBook (Intel Core1) running Mac OS X, single SATA drive.
My home Koolu PC running Linux. Vastly underpowered compared to the other systems but I demonstrated that even this machine can outperform a Windows server with RAID5 when doing multi-threaded disk reads.
CPU usage on all of these systems was very low during the tests and anti-virus was disabled.
I forgot to mention before but we also tried the normal Win32 CreateFile API with the FILE_FLAG_SEQUENTIAL_SCAN flag set. This flag didn't fix the problem.

It does seem a little strange that you see no difference across quite a wide range of windows versions and nothing between a single drive and hardware raid-5.
It's only 'gut feel', but that does make me doubtful that this is really a simple seeking problem. Other than the OS X and the Raid5, was all this tried on the same machine - have you tried another machine? Is your CPU usage basically zero during this test?
What's the shortest app you can write which demonstrates this problem? - I would be interested to try it here.

I would create some kind of in memory thread safe lock. Each thread could wait on the lock until it was free. When the lock becomes free, take the lock and read the file for a defined length of time or a defined amount of data, then release the lock for any other waiting threads.

Do you use IOCompletionPorts under Windows? Windows via C++ has an in-depth chapter on this subject and as luck would have it, it is also available on MSDN.

Paul - saw the update. Very interesting.
It would be interesting to try it on Vista or Win2008, as people seem to be reporting some considerable I/O improvements on these in some circumstances.
My only suggestion about a different API would be to try memory mapping the files - have you tried that? Unfortunately at 2GB per file, you're not going to be able to map multiple whole files on a 32-bit machine, which means this isn't quite as trivial as it might be.

Related

Do Operating Systems slow down program execution?

This question is about operating systems in general. Is there any necessary mechanism in implementation of operating systems that impacts flow of instructions my program sends to CPU?
For example if my program was set for maximum priority in OS, would it perform exactly the same when run without OS?
Is there any necessary mechanism in implementation of operating systems that impacts flow of instructions my program sends to CPU?
Not strictly necessary mechanisms (depending on how you define "OS"); but typically there's IRQs, exceptions and task switches.
IRQs are used by devices to ask the OS (their device driver) for attention; and interrupting the flow of instructions your program sends to CPU. The alternative is polling, which wastes a huge amount of CPU time checking if the device needs attention when it probably doesn't. Because applications need to use devices (file IO, keyboard, video, etc) and wasting CPU time is bad; IRQs significantly improve the performance of applications.
Exceptions (like IRQs) also interrupt the normal flow of instructions. They occur when the normal flow of instructions can't continue, either because your program crashed, or because your program needs something. The most common cause of exceptions is virtual memory (e.g. using swap space to let the application have more memory than actually exists so that the application can actually work properly; where the exception tells the OS that your program tried to access memory that has to be fetched from disk first). In general; this also improves performance for multiple reasons (because "can't execute because there's not enough RAM" can be considered "zero performance"; and because various tricks reduce RAM consumption and increase the amount of RAM that can be used for things like caching files which improve file IO speed).
Task switches is the basis of multi-tasking (e.g. being able to run more than one application at a time). If there are more tasks that want CPU time than there are CPUs, then the OS (scheduler) may (depending on task priorities and scheduler design) switch between them so that all the tasks get some CPU time. However; most applications spend most of their time waiting for something to do (e.g. waiting for user to press a key) and don't need CPU time while waiting; and if the OS is only running one task then the scheduler does nothing (no task switches because there's no other task to switch to). In other words, if the OS supports multi-tasking but you're only running one task, then it makes no difference.
Note that in some cases, IRQs and/or tasks are also used to "opportunistically" do work in the background (when hardware has nothing better to do) to improve performance (e.g. pre-fetch, pre-process and/or pre-calculate data before it's needed so that the resulting data is available instantly when it is needed).
For example if my program was set for maximum priority in OS, would it perform exactly the same when run without OS?
It's best to think of it as many layers - hardware & devices (CPU, etc), with kernel and device drivers on top, with applications on top of that. If you remove any of the layers nothing works (e.g. how can an application read and write files when there's no file system and no disk device drivers?).
If you shift all of the functionality that an OS provides into the application (e.g. a statically linked library that can make an application boot on bare metal); then if the functionality is the same the performance will be the same.
You can only improve performance by reducing functionality. For example, if you get rid of security you'll improve performance (temporarily, until your application becomes part of an attacker's botnet and performance becomes significantly worse due to all the bitcoin mining it's doing). In a similar way, you can get rid of flexibility (reboot the computer when you plug in a different USB flash stick), or fault tolerance (trash all of your data without any warning when the storage devices start failing because software assumed hardware is permanently perfect).

Make chrome use the memory and cpu

So, yesterday I opened task manager in Win 8 (64 bit) and noticed that Chrome (32-bit for some reason) didn't use the whole power my PC has got. So I was running an AI JavaScript program and I noticed that my CPU was running at 1% and Memory was only runnning 120 MB, and that forced me to think why would I wait 5 minutes for it to run instead of somehow boosting it to at least 60%. As far as I know Windows automatically distributes the hardware usage to programs so I'm asking what's the problem:
Is it because it's x32?
Is it because I should manually configure windows to give it more power?
Note: I did search Google, but all I got is that people actually complain about High CPU usage and I've got the opposite.
32 bit doesn't make a difference here. Javascript is inherently single-threaded, so by default (not counting web workers) it won't use more than a single core on your machine. It just cannot. Also memory usage doesn't necessarily tell you how hard a program is working. Some need lots of memory, others only little.
It's up to programs to use the resources of the machine most efficiently; if they don't, there is nothing you can do with Windows to make them run better or faster.

Windows 7: poor GUI response in my program while downloading data; is there some way to improve this?

I've written a program that (among other things) downloads multiple large files from a server on the LAN, using TCP. This program runs fine under Linux, MacOS/X, and generally under Windows as well (it uses Qt for the GUI and straight sockets calls for networking), but on certain Windows machines the download appears to be too much for the machine to handle, and I'm wondering if anyone has any ideas as to why that is and what can be done about it.
When downloading files, my program spawns a separate I/O thread that basically just sits in a loop, downloading data over TCP and writing it to a file, writing 128KB per call to QFile:write(). Each file is typically several hundred megabytes long, and a typical download session writes out several dozen of these files. Note that the I/O thread runs independently of the GUI thread, so I wouldn't expect it to affect GUI's performance much if at all -- especially not when running on a multicore PC.
The PC in question is a Core-2Duo Quad Q6600 running at 2.40GHz, with 4GB of RAM. It's running Windows 7 Ultimate SP1, 32-bit. It is receiving data over a Gigabit Ethernet connection and writing it to files on the NTFS-formatted boot partition of the 232GB internal Hitachi ATA drive.
The symptom is that sometimes during a download (seemingly at random) the program's GUI will become non-responsive for 10 to 30 seconds at a time, and often the title bar of the window will have "(not responding)" appended to it. The symptom will then clear up again and the download will proceed normally again. Another symptom is that the desktop is extremely sluggish during the download... for example, if I click on the "Start" button, the Start menu will take ~30 seconds to populate, instead of being populated near-instantaneously as I would expect.
Note that Task Manager shows plenty of free memory, but it does show short spikes of CPU usage to 100% one one of the 4 cores, at the same time the problems are seen.
The data is arriving over Gigabit Ethernet, and if I have my program just receive the data and throw it away (without writing it to the hard drive), the machine can maintain a constant download rate of about 96MB/sec without breaking a sweat. If I write the received data to a file, however, the download rate decreases to about 37MB/sec, and the symptoms described above start to appear.
The interesting thing is that just for curiosity's sake I added this call to my I/O thread's entry function, just before the beginning of its event loop:
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_BELOW_NORMAL);
When I did that, the "(not responding)" symptoms cleared, but then download speed was reduced to only ~25MB/sec.
So my questions are:
Does anyone know what might be causing the sporadic hangups of the GUI when the hard drive is under a heavy write-load?
Why does lowering the I/O thread's priority cause the download rate to drop so much, given that there are three idle cores on the machine? I would think that even a lower-priority thread would have plenty of CPU available in this situation.
Is there any way to get a maximum download rate without causing Windows' desktop responsiveness and/or my app's GUI responsiveness to suffer problems?
Without seeing any code is hard to answer but this seems to be something related to processors and the fact that your download thread is not leaving any space for other threads to performs other operations.
It seems it never waits and that the driver of the network card is not well written.
Are you sure your thread is entering in an idle state when there is no data incoming?
In OS with a single processor a for (;;) {} will consume 100% cpu and if it talks continuously with the kernel it may stops other processes or other threads for doing that, especially if there is a bug or a very bad behaviour in some network card driver in your case.
Probably putting the thread priority below normal you are asking the OS to use your thread less often, this gives by a magical combination of things that allow things to not hang too much.
Check the code, maybe you are forgetting something?
Check if adding a sleep(0) to force the OS to yield to another thread sometime will make things better, but this is a temporary fix, you should find why your thread is consuming 100% cpu, if it is.

What is causing one Vista machine to be 10 times faster than another machine?

We run a Fortran console program we have run for years. Recently we purchased identical new HP server class machines (4 processors, 8 gig ram, 4 hard drives) for everyone in the office. We configured them identically as nearly as we know. We can compile the Fortran program on one machine, pass the executable to the different machines, and on two machines it executes painfully slow, while on two others it has modest performance (but not as good as before we upgraded from XP machines).
It uses almost no console output (about 40 lines) but outputs about 15 megs of files.
We open task manager to see what's going on, and we see that on the slow machines it's loading ONE CPU to about 15%. On the fast machines it's loading ALL CPUs to about 40% (but one of them seems to load more than the others). As I recall, on XP it loaded the CPU to 99%, and ran much faster.
These machines are the employees' general purpose machines, and have lots of company bloatware on them. And there is the possibility they have slightly different directory structures. But what seems totally puzzling to me is why Vista is not giving them more CPU time. If the CPUs were loading up, I might blame the performance variation on different directory structures, but not loading up the CPUs just boggles my mind.
David
if there's a bottleneck in IO, the CPU wouldn't be loaded as much because it's mostly waiting for the IO to take place. One could even imagine this to cause the one CPU vs many CPUs problem if there's just no point in kicking in another CPU because there's plenty of time between while waiting. What if you take an external HD and try out if the differences also take place if you run the same program on that HD on different machines?
Please go into Windows task manager, Performance / - Select in [View] the option: [Kernel Times] and look what's displayed on the bars during program execution.
If its only 15% load on quad+hyperthreading box, that says basically, OpenMP, MPI (or whatever it uses) - isn't properly working - works on 1/8 => 15%. Can you run the MPI-test command for your specific system in order to check for errors in multiprocessing on each box? Therefore, the question would be - why does the multiprocessing environment not work?
Regards
rbo
SWAG, but have you checked your virus scanner configuration? If the scanner isn't set to ignore the type of file you're writing on the slow machine, then each write to those files might be getting intercepted and scanned before being written to the disk. This could lead to the process sitting in I/O wait and not getting scheduled as often.
Vista had a problem with some uncontrollably memory leaks, perhaps this is your error, some conflict in the "bloatware" is causing a memory leak and so your Fortran program is running so much slower?
I assume you have tested this with all programs ended. It seems unlikely that your console program is the issue. Sounds like there's definitely a memory conflict going on though.

Is there any way of throttling CPU/Memory of a process?

Problem: I have a developers machine (read: fast, lots of memory), but the user has a users machine (read: slow, not very much memory).
I can simulate a slow network using Fiddler (http://www.fiddler2.com/fiddler2/)
I can look at how CPU is used over time for a process using Process Explorer (http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx).
Is there any way I can restrict the amount of CPU a process can have, or the amount of memory a process can have in order to simulate a users machine more effectively? (In order to isolate performance problems for instance)
I suppose I could use a VM, but I'm looking for something a bit lighter.
I'm using Windows XP, but a solution for any Windows machine would be welcome. Thanks.
The platform SDK used to come with stress tools for doing just this back in the good old days (STRESS.EXE, CPUSTRESS.EXE in the SDK), but they might still be there (check your platform SDK and/or Visual Studio installation for these two files -- unfortunately I have niether the PSDK nor VS installed on the machine I'm typing from.)
Other tools:
memory: performance & reliability (e.g. handling failed memory allocation): can use EatMem
CPU: performance & reliability (e.g. race conditions): can use CPU Burn, Prime95, etc
handles (GDI, User): reliability (e.g. handling failed GDI resource allocation): ??? may have to write your own, but running out of GDI handles (buggy GTK apps would usually eat them all away until all other apps on the system would start falling dead like flies) is a real test for any Windows app
disk: performance & reliability (e.g. handling disk full): DiskFiller, etc.
AppVerifier has a low-resource simulation feature.
You could also try setting the priority of your process to be very low.
You can run MemAlloc to chew up RAM, possibly a few copies at once.
I found a related question:
Set Windows process (or user) memory limit
The accepted answer for the question has a link to the Windows API's SetProcessWorkingSetSize, so it's not exactly a tool that can limit the amount of memory that a process can use.
In terms of changing the amount of CPU resources a process can use, if you don't mind the granularity of per-core limiting of resources, Task Manager can change the processor affinity of a process.
In Task Manager, right-click a process and select "Set Affinity...", then select the processor cores that the process can be assigned to.
If the development machine has many cores but the user machine only has one, then, rather than allowing the process to run on all the available cores, set the process' processor affinity to only one core.
It has nothing to do with SetProcessWorkingSetSize
Just use internal Win32 kernel apis to restrict CPU Usage

Resources