How do I get Windows to go as fast as Linux for compiling C++?

How do I get Windows to go as fast as Linux for compiling C++? - windows

I know this is not so much a programming question but it is relevant.
I work on a fairly large cross platform project. On Windows I use VC++ 2008. On Linux I use gcc. There are around 40k files in the project. Windows is 10x to 40x slower than Linux at compiling and linking the same project. How can I fix that?
A single change incremental build 20 seconds on Linux and > 3 mins on Windows. Why? I can even install the 'gold' linker in Linux and get that time down to 7 seconds.
Similarly git is 10x to 40x faster on Linux than Windows.
In the git case it's possible git is not using Windows in the optimal way but VC++? You'd think Microsoft would want to make their own developers as productive as possible and faster compilation would go a long way toward that. Maybe they are trying to encourage developers into C#?
As simple test, find a folder with lots of subfolders and do a simple
dir /s > c:\list.txt
on Windows. Do it twice and time the second run so it runs from the cache. Copy the files to Linux and do the equivalent 2 runs and time the second run.
ls -R > /tmp/list.txt
I have 2 workstations with the exact same specs. HP Z600s with 12gig of ram, 8 cores at 3.0ghz. On a folder with ~400k files Windows takes 40seconds, Linux takes < 1 second.
Is there a registry setting I can set to speed up Windows? What gives?
A few slightly relevant links, relevant to compile times, not necessarily i/o.
Apparently there's an issue in Windows 10 (not in Windows 7) that closing a process holds a global lock. When compiling with multiple cores and therefore multiple processes this issue hits.
The /analyse option can adversely affect perf because it loads a web browser. (Not relevant here but good to know)

Unless a hardcore Windows systems hacker comes along, you're not going to get more than partisan comments (which I won't do) and speculation (which is what I'm going to try).
File system - You should try the same operations (including the dir) on the same filesystem. I came across this which benchmarks a few filesystems for various parameters.
Caching. I once tried to run a compilation on Linux on a RAM disk and found that it was slower than running it on disk thanks to the way the kernel takes care of caching. This is a solid selling point for Linux and might be the reason why the performance is so different.
Bad dependency specifications on Windows. Maybe the chromium dependency specifications for Windows are not as correct as for Linux. This might result in unnecessary compilations when you make a small change. You might be able to validate this using the same compiler toolchain on Windows.

A few ideas:
Disable 8.3 names. This can be a big factor on drives with a large number of files and a relatively small number of folders: fsutil behavior set disable8dot3 1
Use more folders. In my experience, NTFS starts to slow down with more than about 1000 files per folder.
Enable parallel builds with MSBuild; just add the "/m" switch, and it will automatically start one copy of MSBuild per CPU core.
Put your files on an SSD -- helps hugely for random I/O.
If your average file size is much greater than 4KB, consider rebuilding the filesystem with a larger cluster size that corresponds roughly to your average file size.
Make sure the files have been defragmented. Fragmented files cause lots of disk seeks, which can cost you a factor of 40+ in throughput. Use the "contig" utility from sysinternals, or the built-in Windows defragmenter.
If your average file size is small, and the partition you're on is relatively full, it's possible that you are running with a fragmented MFT, which is bad for performance. Also, files smaller than 1K are stored directly in the MFT. The "contig" utility mentioned above can help, or you may need to increase the MFT size. The following command will double it, to 25% of the volume: fsutil behavior set mftzone 2 Change the last number to 3 or 4 to increase the size by additional 12.5% increments. After running the command, reboot and then create the filesystem.
Disable last access time: fsutil behavior set disablelastaccess 1
Disable the indexing service
Disable your anti-virus and anti-spyware software, or at least set the relevant folders to be ignored.
Put your files on a different physical drive from the OS and the paging file. Using a separate physical drive allows Windows to use parallel I/Os to both drives.
Have a look at your compiler flags. The Windows C++ compiler has a ton of options; make sure you're only using the ones you really need.
Try increasing the amount of memory the OS uses for paged-pool buffers (make sure you have enough RAM first): fsutil behavior set memoryusage 2
Check the Windows error log to make sure you aren't experiencing occasional disk errors.
Have a look at Physical Disk related performance counters to see how busy your disks are. High queue lengths or long times per transfer are bad signs.
The first 30% of disk partitions is much faster than the rest of the disk in terms of raw transfer time. Narrower partitions also help minimize seek times.
Are you using RAID? If so, you may need to optimize your choice of RAID type (RAID-5 is bad for write-heavy operations like compiling)
Disable any services that you don't need
Defragment folders: copy all files to another drive (just the files), delete the original files, copy all folders to another drive (just the empty folders), then delete the original folders, defragment the original drive, copy the folder structure back first, then copy the files. When Windows builds large folders one file at a time, the folders end up being fragmented and slow. ("contig" should help here, too)
If you are I/O bound and have CPU cycles to spare, try turning disk compression ON. It can provide some significant speedups for highly compressible files (like source code), with some cost in CPU.

NTFS saves file access time everytime. You can try disabling it:
"fsutil behavior set disablelastaccess 1"
(restart)

The issue with visual c++ is, as far I can tell, that it is not a priority for the compiler team to optimize this scenario.
Their solution is that you use their precompiled header feature. This is what windows specific projects have done. It is not portable, but it works.
Furthermore, on windows you typically have virus scanners, as well as system restore and search tools that can ruin your build times completely if they monitor your buid folder for you. windows 7 resouce monitor can help you spot it.
I have a reply here with some further tips for optimizing vc++ build times if you're really interested.

The difficulty in doing that is due to the fact that C++ tends to spread itself and the compilation process over many small, individual, files. That's something Linux is good at and Windows is not. If you want to make a really fast C++ compiler for Windows, try to keep everything in RAM and touch the filesystem as little as possible.
That's also how you'll make a faster Linux C++ compile chain, but it is less important in Linux because the file system is already doing a lot of that tuning for you.
The reason for this is due to Unix culture:
Historically file system performance has been a much higher priority in the Unix world than in Windows. Not to say that it hasn't been a priority in Windows, just that in Unix it has been a higher priority.
Access to source code.
You can't change what you can't control. Lack of access to Windows NTFS source code means that most efforts to improve performance have been though hardware improvements. That is, if performance is slow, you work around the problem by improving the hardware: the bus, the storage medium, and so on. You can only do so much if you have to work around the problem, not fix it.
Access to Unix source code (even before open source) was more widespread. Therefore, if you wanted to improve performance you would address it in software first (cheaper and easier) and hardware second.
As a result, there are many people in the world that got their PhDs by studying the Unix file system and finding novel ways to improve performance.
Unix tends towards many small files; Windows tends towards a few (or a single) big file.
Unix applications tend to deal with many small files. Think of a software development environment: many small source files, each with their own purpose. The final stage (linking) does create one big file but that is an small percentage.
As a result, Unix has highly optimized system calls for opening and closing files, scanning directories, and so on. The history of Unix research papers spans decades of file system optimizations that put a lot of thought into improving directory access (lookups and full-directory scans), initial file opening, and so on.
Windows applications tend to open one big file, hold it open for a long time, close it when done. Think of MS-Word. msword.exe (or whatever) opens the file once and appends for hours, updates internal blocks, and so on. The value of optimizing the opening of the file would be wasted time.
The history of Windows benchmarking and optimization has been on how fast one can read or write long files. That's what gets optimized.
Sadly software development has trended towards the first situation. Heck, the best word processing system for Unix (TeX/LaTeX) encourages you to put each chapter in a different file and #include them all together.
Unix is focused on high performance; Windows is focused on user experience
Unix started in the server room: no user interface. The only thing users see is speed. Therefore, speed is a priority.
Windows started on the desktop: Users only care about what they see, and they see the UI. Therefore, more energy is spent on improving the UI than performance.
The Windows ecosystem depends on planned obsolescence. Why optimize software when new hardware is just a year or two away?
I don't believe in conspiracy theories but if I did, I would point out that in the Windows culture there are fewer incentives to improve performance. Windows business models depends on people buying new machines like clockwork. (That's why the stock price of thousands of companies is affected if MS ships an operating system late or if Intel misses a chip release date.). This means that there is an incentive to solve performance problems by telling people to buy new hardware; not by improving the real problem: slow operating systems. Unix comes from academia where the budget is tight and you can get your PhD by inventing a new way to make file systems faster; rarely does someone in academia get points for solving a problem by issuing a purchase order. In Windows there is no conspiracy to keep software slow but the entire ecosystem depends on planned obsolescence.
Also, as Unix is open source (even when it wasn't, everyone had access to the source) any bored PhD student can read the code and become famous by making it better. That doesn't happen in Windows (MS does have a program that gives academics access to Windows source code, it is rarely taken advantage of). Look at this selection of Unix-related performance papers: http://www.eecs.harvard.edu/margo/papers/ or look up the history of papers by Osterhaus, Henry Spencer, or others. Heck, one of the biggest (and most enjoyable to watch) debates in Unix history was the back and forth between Osterhaus and Selzer http://www.eecs.harvard.edu/margo/papers/usenix95-lfs/supplement/rebuttal.html
You don't see that kind of thing happening in the Windows world. You might see vendors one-uping each other, but that seems to be much more rare lately since the innovation seems to all be at the standards body level.
That's how I see it.
Update: If you look at the new compiler chains that are coming out of Microsoft, you'll be very optimistic because much of what they are doing makes it easier to keep the entire toolchain in RAM and repeating less work. Very impressive stuff.

I personally found running a windows virtual machine on linux managed to remove a great deal of the IO slowness in windows, likely because the linux vm was doing lots of caching that Windows itself was not.
Doing that I was able to speed up compile times of a large (250Kloc) C++ project I was working on from something like 15 minutes to about 6 minutes.

Incremental linking
If the VC 2008 solution is set up as multiple projects with .lib outputs, you need to set "Use Library Dependency Inputs"; this makes the linker link directly against the .obj files rather than the .lib. (And actually makes it incrementally link.)
Directory traversal performance
It's a bit unfair to compare directory crawling on the original machine with crawling a newly created directory with the same files on another machine. If you want an equivalent test, you should probably make another copy of the directory on the source machine. (It may still be slow, but that could be due to any number of things: disk fragmentation, short file names, background services, etc.) Although I think the perf issues for dir /s have more to do with writing the output than measuring actual file traversal performance. Even dir /s /b > nul is slow on my machine with a huge directory.

I'm pretty sure it's related to the filesystem. I work on a cross-platform project for Linux and Windows where all the code is common except for where platform-dependent code is absolutely necessary. We use Mercurial, not git, so the "Linuxness" of git doesn't apply. Pulling in changes from the central repository takes forever on Windows compared to Linux, but I do have to say that our Windows 7 machines do a lot better than the Windows XP ones. Compiling the code after that is even worse on VS 2008. It's not just hg; CMake runs a lot slower on Windows as well, and both of these tools use the file system more than anything else.
The problem is so bad that most of our developers that work in a Windows environment don't even bother doing incremental builds anymore - they find that doing a unity build instead is faster.
Incidentally, if you want to dramatically decrease compilation speed in Windows, I'd suggest the aforementioned unity build. It's a pain to implement correctly in the build system (I did it for our team in CMake), but once done automagically speeds things up for our continuous integration servers. Depending on how many binaries your build system is spitting out, you can get 1 to 2 orders of magnitude improvement. Your mileage may vary. In our case I think it sped up the Linux builds threefold and the Windows one by about a factor of 10, but we have a lot of shared libraries and executables (which decreases the advantages of a unity build).

How do you build your large cross platform project?
If you are using common makefiles for Linux and Windows you could easily degrade windows performance by a factor of 10 if the makefiles are not designed to be fast on Windows.
I just fixed some makefiles of a cross platform project using common (GNU) makefiles for Linux and Windows. Make is starting a sh.exe process for each line of a recipe causing the performance difference between Windows and Linux!
According to the GNU make documentation
.ONESHELL:
should solve the issue, but this feature is (currently) not supported for Windows make. So rewriting the recipes to be on single logical lines (e.g. by adding ;\ or \ at the end of the current editor lines) worked very well!

IMHO this is all about disk I/O performance. The order of magnitude suggests a lot of the operations go to disk under Windows whereas they're handled in memory under Linux, i.e. Linux is caching better. Your best option under windows will be to move your files onto a fast disk, server or filesystem. Consider buying an Solid State Drive or moving your files to a ramdisk or fast NFS server.
I ran the directory traversal tests and the results are very close to the compilation times reported, suggesting this has nothing to do with CPU processing times or compiler/linker algorithms at all.
Measured times as suggested above traversing the chromium directory tree:
Windows Home Premium 7 (8GB Ram) on NTFS: 32 seconds
Ubuntu 11.04 Linux (2GB Ram) on NTFS: 10 seconds
Ubuntu 11.04 Linux (2GB Ram) on ext4: 0.6 seconds
For the tests I pulled the chromium sources (both under win/linux)
git clone http://github.com/chromium/chromium.git
cd chromium
git checkout remotes/origin/trunk
To measure the time I ran
ls -lR > ../list.txt ; time ls -lR > ../list.txt # bash
dir -Recurse > ../list.txt ; (measure-command { dir -Recurse > ../list.txt }).TotalSeconds #Powershell
I did turn off access timestamps, my virus scanner and increased the cache manager settings under windows (>2Gb RAM) - all without any noticeable improvements. Fact of the matter is, out of the box Linux performed 50x better than Windows with a quarter of the RAM.
For anybody who wants to contend that the numbers wrong - for whatever reason - please give it a try and post your findings.

Try using jom instead of nmake
Get it here:
https://github.com/qt-labs/jom
The fact is that nmake is using only one of your cores, jom is a clone of nmake that make uses of multicore processors.
GNU make do that out-of-the-box thanks to the -j option, that might be a reason of its speed vs the Microsoft nmake.
jom works by executing in parallel different make commands on different processors/cores.
Try yourself an feel the difference!

I want to add just one observation using Gnu make and other tools from MinGW tools on Windows: They seem to resolve hostnames even when the tools can not even communicate via IP. I would guess this is caused by some initialisation routine of the MinGW runtime. Running a local DNS proxy helped me to improve the compilation speed with these tools.
Before I got a big headache because the build speed dropped by a factor of 10 or so when I opened a VPN connection in parallel. In this case all these DNS lookups went through the VPN.
This observation might also apply to other build tools, not only MinGW based and it could have changed on the latest MinGW version meanwhile.

I recently could archive an other way to speed up compilation by about 10% on Windows using Gnu make by replacing the mingw bash.exe with the version from win-bash
(The win-bash is not very comfortable regarding interactive editing.)

Related

What is causing one Vista machine to be 10 times faster than another machine?

We run a Fortran console program we have run for years. Recently we purchased identical new HP server class machines (4 processors, 8 gig ram, 4 hard drives) for everyone in the office. We configured them identically as nearly as we know. We can compile the Fortran program on one machine, pass the executable to the different machines, and on two machines it executes painfully slow, while on two others it has modest performance (but not as good as before we upgraded from XP machines).
It uses almost no console output (about 40 lines) but outputs about 15 megs of files.
We open task manager to see what's going on, and we see that on the slow machines it's loading ONE CPU to about 15%. On the fast machines it's loading ALL CPUs to about 40% (but one of them seems to load more than the others). As I recall, on XP it loaded the CPU to 99%, and ran much faster.
These machines are the employees' general purpose machines, and have lots of company bloatware on them. And there is the possibility they have slightly different directory structures. But what seems totally puzzling to me is why Vista is not giving them more CPU time. If the CPUs were loading up, I might blame the performance variation on different directory structures, but not loading up the CPUs just boggles my mind.
David

if there's a bottleneck in IO, the CPU wouldn't be loaded as much because it's mostly waiting for the IO to take place. One could even imagine this to cause the one CPU vs many CPUs problem if there's just no point in kicking in another CPU because there's plenty of time between while waiting. What if you take an external HD and try out if the differences also take place if you run the same program on that HD on different machines?

Please go into Windows task manager, Performance / - Select in [View] the option: [Kernel Times] and look what's displayed on the bars during program execution.
If its only 15% load on quad+hyperthreading box, that says basically, OpenMP, MPI (or whatever it uses) - isn't properly working - works on 1/8 => 15%. Can you run the MPI-test command for your specific system in order to check for errors in multiprocessing on each box? Therefore, the question would be - why does the multiprocessing environment not work?
Regards
rbo

SWAG, but have you checked your virus scanner configuration? If the scanner isn't set to ignore the type of file you're writing on the slow machine, then each write to those files might be getting intercepted and scanned before being written to the disk. This could lead to the process sitting in I/O wait and not getting scheduled as often.

Vista had a problem with some uncontrollably memory leaks, perhaps this is your error, some conflict in the "bloatware" is causing a memory leak and so your Fortran program is running so much slower?
I assume you have tested this with all programs ended. It seems unlikely that your console program is the issue. Sounds like there's definitely a memory conflict going on though.

How do I improve Windows Subversion client update performance?

How do I improve Subversion client update performance? It appears to be disk bound on the client.
Details:
CollabNet Windows client version 1.6.2 (r37639)
Windows XP SP2
3 GB RAM with PF Usage around 1 GB and System Cache of 1.1 GB.
Disk has write caching enabled
Update takes 7-15 minutes (when very little to update).
Checkout has 36,083 directories/files (from svn list)
Repository has 58,750 revisions.
Checkout takes about 2.7 GB
Perf monitor shows % Disk Write time stays near 90% during update.
Max Disk Read Bytes/sec got up to 12.8M and write got up to 5.2M
CPU, paging file usage, and network usage are all low.
Watching the server performance seems to show that it isn't a bottleneck.
I'm especially interested in answers besides getting a faster disk (especially configuration changes).
Updates from some of the suggestions:
I need the whole thing so sparse directories won't work.
Another client (TortoiseSVN) takes 7 minutes also
TortoiseSVN icon overlays have be configured so they don't cause the problem.
Anti-virus is configured to to skip that directory is it isn't causing the problem.

I experience exatly the same thing. Recently replaced Perforce with svn, but if we cannot overcome the performance problems on Windows me must consider another tool.
Using svn 1.6.6, Win XP and Vista clients. RedHat server.
My observations matches yours:
Huge disk-write activity.
Antivirus not a bottleneck.
No matter witch svn-clients are used.
No server or network bottleneck.
Complementary info
More than 3 times faster operations on:
Linux (Ubuntu).
Linux (Ubuntu) running on VirtualBox at Win Vista host.
Win XP running on VMWare at RedHat host.

Do you need every bit of the repository on your working copy? If you truly only care about particular portions of the tree, look into Subversion's Sparse Directories (a.k.a. "Sparse Checkouts") feature. It allows you to manipulate your working copy so it only contains those directories of interest.
Just as an example, you might use this to prune documentation, installer-related files, etc. Depending on what you truly need on your local machine, embracing this approach could make a serious dent in your wait times.

Try svn client version 1.5.. It helped me on my Vista laptop. Versions 1.6. are extremely slow.

This is more likely to be your network and the amount of data moved as well as your client. Are you using Tortoise? I find it to be a bit slow myself when moving that much data!

Are you using TortoiseSVN? If so, the Icon Overlays do slow down operations. If you go to TortoiseSVN Settings/Icon Overlays there are several settings you can tweak to control the level to which you want to use the Overlays, including turning them off completely. See if that affects your performance.

Do you run a virus checker that uses on-access scanning? That can really make it crawl. If so, turn it off and see if that helps. Most scanners will have a way to exclude specific directories if that helps.

Nobody seems to be pointing out the one reason that I often consider a design flaw. Subversion creates a second "pristine" copy of the checkout for offline operations. If you're checking out 4G of files, it's actually writing 8G to disk.
Compare a checkout to an export. That will show you the massive difference when writing those second copies.
There's nothing you can do about that.

Upgrade to svn 1.7
From Discussion of Slow Performance of SVN Update:
The update process in svn 1.6 goes something like this:
search the entire working copy, to see what's there at the moment, and locking it so no one changes the answer during the next steps
tell that to the server
receive from the server whatever new stuff you need, applying the changes to the files as you go
recurse over the entire working copy again, unlocking it
If there are many directories and files, steps 1 and 4 can take up a
lot of time. This would be consistent with your observation of long
delays with no network traffic.
Working copy format was changed in svn 1.7. Now all meta information is stored in SQLite database in root folder of working copy and there is no need to perform steps 1 and 4 any more which consumed most of the time durring svn update.

Win32 console processes in VISTA - 10% CPU, but VERY SLOW

I have a Win32 console application which is doing some computations, compiled in Compaq Visual Fortran (which probably doesn't matter).
I need to run a lot of them simultaneously.
In XP, they take around 90-100% CPU together, work very fast.
In Vista, no matter how many of them I run, they take no more than 10% of CPU (together), and work very slow respectively.
There is quite a bit of console output going on, but now VERY much.
I can minimize all the windows, it does not help. CPU is basically doing nothing...
Any ideas?
Update:
No, these are different machines, but they run relatively the same hardware. 2. Threads are not used, this is a VERY OLD (20 yrs) plain app for DOS, compiled in win32. It is supposed to compute iterations until they meet, consume all it has. My impression - VISTA just does NOT GIVE IT MORE CPU

Have you tried redirecting the console output to a file?
If your applications are being held up writing to the console (this happens sometimes unfortunately) then redirecting the output should help, as it's much quicker to write to a simple file than write to the console.
You do this like so
c:\temp> dir > output.log
If you really don't care about the output at all, you can throw it away, by redirecting to nul. eg:
c:\temp> dir > nul

There was a known "feature" in Vista that limits certain console applications to 32MB of RAM. I don't know if those compiled by Compaq Visual Fortran are affected by this "feature."
This article appears to have been updated as recently as October 2008, so the problem still exists.

To expound on Daok's post - your XP machine might be CPU bound for this process, whereas the vista machine is bound by some other resource.
To clarify:
output to stdout (or other) can be slowing down the processing. (as can context switching or file access, etc)

As Tim hinted, console output (stdout) is EXTREMELY expensive.
I suggest rerunning your test while redirecting the console output to a separate log file for each process. If possible, tune down the verbosity of the output in another test run.
Beyond that, there are other obvious possibilities: is the hardware significantly different, are there other major processes running, is there a shared resource that is under contention?
Other than the obvious, look for a nonobvious resource contention such as a shared file.
But the main area where I would look is whether there is a significant difference in how your code is compiled for the two OS environments--I wonder if your Fortran code is incurring some kind of special penalty when running on Vista, such as a compatibility mode. Look to see how well Vista is supported and whether you can target your compile for Vista specifically. Also look for anyone reporting similar issues, such as in bug reports, feature requests, etc.

Your loops are obviously not simple computations. There is a blocking system call in there somewhere. Just because it worked on XP doesn't mean the app is bug free.
Since you can minimize the console windows and see no improvement, I would not consider that an issue. In my experience console output slows a program down only if the console window is drawing text, not when it's minimized.

Is it the same machine hardware on your Vista and XP? It might use just 10% of the Vista because it doesn't require more. Are you using Thread? I think it requires more information about your project to have a better idea. Have you try to use a profiler to see what's going on?

Compiling code on an external drive

To make things easier when switching between machines (my workstation at the office and my personal laptop) I have thought about trying an external hard drive to store my working directory on. Specifically I am looking at Firewire 800 drives (most are 5400 rpm 8mb cache). What I am wondering is if anyone has experience with doing this with Visual Studio projects and what sort of performance hit they see.

It depends on the size of the project. The throughput is low and the latency is high, so you're going to get hit every which way, but due to the latency you'll be hit harder if you have a lot of little files rather than a few large ones.
Have you considered simply carrying around a GIT or other distributed repository and updating the machine repositories as you move around? Then you can compile locally and treat the drive and a roving server. Since only changes will be moved across, it should be faster, and your code will be 'backed up' in more places.
If you forget the drive, it breaks, or is lost/stolen, then you can still sit down at a PC and program with no code missing if you're at the last PC you used, or very little code missing (which will be updated later with a resync anyway).
And it's just a hop skip and a jump away from simply using the network to move the changes between the systems if you don't want to carry the drive around later.

I use vmware and the virtual machines are on an external usb drive. Performance is fine. You might have some issues with the drive name changing - not an issue if you use virtual machines.

Granted I work in an industry were Personal Information and Intellectual Property are king, but I don't like that idea at all. That hard drive disappears and you have a big problem.
Why not Remote Desktop into the work machine?
EDIT Stipud Spelingg

How to obtain good concurrent read performance from disk

I'd like to ask a question then follow it up with my own answer, but also see what answers other people have.
We have two large files which we'd like to read from two separate threads concurrently. One thread will sequentially read fileA while the other thread will sequentially read fileB. There is no locking or communication between the threads, both are sequentially reading as fast as they can, and both are immediately discarding the data they read.
Our experience with this setup on Windows is very poor. The combined throughput of the two threads is in the order of 2-3 MiB/sec. The drive seems to be spending most of its time seeking backwards and forwards between the two files, presumably reading very little after each seek.
If we disable one of the threads and temporarily look at the performance of a single thread then we get much better bandwidth (~45 MiB/sec for this machine). So clearly the bad two-thread performance is an artefact of the OS disk scheduler.
Is there anything we can do to improve the concurrent thread read performance? Perhaps by using different APIs or by tweaking the OS disk scheduler parameters in some way.
Some details:
The files are in the order of 2 GiB each on a machine with 2GiB of RAM. For the purpose of this question we consider them not to be cached and perfectly defragmented. We have used defrag tools and rebooted to ensure this is the case.
We are using no special APIs to read these files. The behaviour is repeatable across various bog-standard APIs such as Win32's CreateFile, C's fopen, C++'s std::ifstream, Java's FileInputStream, etc.
Each thread spins in a loop making calls to the read function. We have varied the number of bytes requested from the API each iteration from values between 1KiB up to 128MiB. Varying this has had no effect, so clearly the amount the OS is physically reading after each disk seek is not dictated by this number. This is exactly what should be expected.
The dramatic difference between one-thread and two-thread performance is repeatable across Windows 2000, Windows XP (32-bit and 64-bit), Windows Server 2003, and also with and without hardware RAID5.

The problem seems to be in Windows I/O scheduling policy. According to what I found here there are many ways for an O.S. to schedule disk requests. While Linux and others can choose between different policies, before Vista Windows was locked in a single policy: a FIFO queue, where all requests where splitted in 64 KB blocks. I believe that this policy is the cause for the problem you are experiencing: the scheduler will mix requests from the two threads, causing continuous seek between different areas of the disk.
Now, the good news is that according to here and here, Vista introduced a smarter disk scheduler, where you can set the priority of your requests and also allocate a minimum badwidth for your process.
The bad news is that I found no way to change disk policy or buffers size in previous versions of Windows. Also, even if raising disk I/O priority of your process will boost the performance against the other processes, you still have the problems of your threads competing against each other.
What I can suggest is to modify your software by introducing a self-made disk access policy.
For example, you could use a policy like this in your thread B (similar for Thread A):
if THREAD A is reading from disk then wait for THREAD A to stop reading or wait for X ms
Read for X ms (or Y MB)
Stop reading and check status of thread A again
You could use semaphores for status checking or you could use perfmon counters to get the status of the actual disk queue.
The values of X and/or Y could also be auto-tuned by checking the actual trasfer rates and slowly modify them, thus maximizing the throughtput when the application runs on different machines and/or O.S. You could find that cache, memory or RAID levels affect them in a way or the other, but with auto-tuning you will always get the best performance in every scenario.

I'd like to add some further notes in my response. All other non-Microsoft operating systems we have tested do not suffer from this problem. Linux, FreeBSD, and Mac OS X (this final one on different hardware) all degrade much more gracefully in terms of aggregate bandwidth when moving from one thread to two. Linux for example degraded from ~45 MiB/sec to ~42 MiB/sec. These other operating systems must be reading larger chunks of the file between each seek, and therefor not spending nearly all their time waiting on the disk to seek.
Our solution for Windows is to pass the FILE_FLAG_NO_BUFFERING flag to CreateFile and use large (~16MiB) reads in each call to ReadFile. This is suboptimal for several reasons:
Files don't get cached when read like this, so there are none of the advantages that caching normally gives.
The constraints when working with this flag are much more complicated than normal reading (alignment of read buffers to page boundaries, etc).
(As a final remark. Does this explain why swapping under Windows is so hellish? Ie, Windows is incapable of doing IO to multiple files concurrently with any efficiency, so while swapping all other IO operations are forced to be disproportionately slow.)
Edit to add some further details for Will Dean:
Of course across these different hardware configurations the raw figures did change (sometimes substantially). The problem however is the consistent degradation in performance that only Windows suffers when moving from one thread to two. Here is a summary of the machines tested:
Several Dell workstations (Intel Xeon) of various ages running Windows 2000, Windows XP (32-bit), and Windows XP (64-bit) with single drive.
A Dell 1U server (Intel Xeon) running Windows Server 2003 (64-bit) with RAID 1+0.
An HP workstation (AMD Opteron) with Windows XP (64-bit), and Windows Server 2003, and hardware RAID 5.
My home unbranded PC (AMD Athlon64) running Windows XP (32-bit), FreeBSD (64-bit), and Linux (64-bit) with single drive.
My home MacBook (Intel Core1) running Mac OS X, single SATA drive.
My home Koolu PC running Linux. Vastly underpowered compared to the other systems but I demonstrated that even this machine can outperform a Windows server with RAID5 when doing multi-threaded disk reads.
CPU usage on all of these systems was very low during the tests and anti-virus was disabled.
I forgot to mention before but we also tried the normal Win32 CreateFile API with the FILE_FLAG_SEQUENTIAL_SCAN flag set. This flag didn't fix the problem.

It does seem a little strange that you see no difference across quite a wide range of windows versions and nothing between a single drive and hardware raid-5.
It's only 'gut feel', but that does make me doubtful that this is really a simple seeking problem. Other than the OS X and the Raid5, was all this tried on the same machine - have you tried another machine? Is your CPU usage basically zero during this test?
What's the shortest app you can write which demonstrates this problem? - I would be interested to try it here.

I would create some kind of in memory thread safe lock. Each thread could wait on the lock until it was free. When the lock becomes free, take the lock and read the file for a defined length of time or a defined amount of data, then release the lock for any other waiting threads.

Do you use IOCompletionPorts under Windows? Windows via C++ has an in-depth chapter on this subject and as luck would have it, it is also available on MSDN.

Paul - saw the update. Very interesting.
It would be interesting to try it on Vista or Win2008, as people seem to be reporting some considerable I/O improvements on these in some circumstances.
My only suggestion about a different API would be to try memory mapping the files - have you tried that? Unfortunately at 2GB per file, you're not going to be able to map multiple whole files on a 32-bit machine, which means this isn't quite as trivial as it might be.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio