At work we are using some pretty powerful machines: HP Z600 with dual xeon #2.5GHz, 8-16GB ram. Unfortunately due to ill-fitting company policies we are forced to use 32-bit XP, so I made a PAE ramdrive out of the unused 4GB RAM.
Now, the temporary files are on the ramdrive. I've also tried moving the whole project to the ramdrive, then to an SSD, but there was no noticeable improvement in hosted-mode startup or compilation times.
I've then run SysInternals' Process Monitor to see if there are any bottlenecks that are not visible with task manager / hard-disk activity led, but I have not seen anything notable - except some buffer overflows which I don't understand what they mean.
I can assume that performance for OOMPH startup and GWT compilation are tied so I'm using the compilation times as benchmarking between various changes.
I've activated and deactivated hyper-threading and turbo-boost in BIOS, but again saw no differences. Hyperthreading even seems to make everything slower, I can assume that context switching penalty is higher for 16 cores than for 8 cores. Turbo-boost does not seem to do anything, I can assume it only works under Win7, I have not succeeded in activating the driver. It should boost the core from 2.5Ghz to 2.8Ghz.
Deactivated indexing and timestamping on NTFS drives, changed the performance setting from foreground to background and back, used another instance of Eclipse - no changes.
For compilation I've tried specifying a different number of workers, larger memory and some other options. Everything above two workers increases compilation time.
Older HP machines (XW6600) seem to compile a bit faster, perhaps because of the 2.8GHz clock, but their hosted mode seems to start up slower.
To sum up, memory usage is at about 2.6GB, pagefile usage is zero, harddrive is not signaling much activity, CPU activity is <10% (single core at about 50-70%) but still the computer seems to do nothing for some time while compiling or launching OOMPH GWT.
Ok, so now that I've tried everything I knew and found on the Internet, is there anything else that I can try? Will switching to 64-bit Win7 improve much (this is due next year anyway)? Are there any hardware/software options that I can tweak?
L.E.: also ran RATT (tracer from MS) to see if there are any interrupts taking too long, but everything seems in order. Antivirus does not make a difference. Benchmarked another GWT project against my i7 mobile (2630q) and the i7 is about 70% faster, though it has about the same clock.
In most cases I've encountered the reason for slow compilation/hosted mode refresh is that application is written this way.
For compilation there are few things to watch out for:
Don't keep unused classes and modules in your classpath, remove them if possible because GWT is parsing them anyway at precompile stage
whatch out for GWT-RPC type explosion, this is usually a cause of all problems
Be really carefull with interface ContextWithLokup, always try to minimize the number of methods used in interface which extends ContextWithLookup
Try to check out some other compilation options, like distributed compilation , soft permutations or multi-jvm compilation (run compiler with system property -Dgwt.jjs.permutationWorkerFactory=com.google.gwt.dev.ExternalPermutationWorkerFactory and -localWorkers to specify number of JVMs)
For hosted mode:
Use lazy loading where it is possible. The greatest problem with hosted mode i saw, i that application is trying to initialize too many classes which aren't even used on current screen, this can greatly speedup a hosted mode startup. Again, stuff like RPC type explosion affects hosted mode as well
That's all I can advise. Stuff like ram disk can speed up stuff like -compileReport, since it can generate a huge number of files.
Related
So, yesterday I opened task manager in Win 8 (64 bit) and noticed that Chrome (32-bit for some reason) didn't use the whole power my PC has got. So I was running an AI JavaScript program and I noticed that my CPU was running at 1% and Memory was only runnning 120 MB, and that forced me to think why would I wait 5 minutes for it to run instead of somehow boosting it to at least 60%. As far as I know Windows automatically distributes the hardware usage to programs so I'm asking what's the problem:
Is it because it's x32?
Is it because I should manually configure windows to give it more power?
Note: I did search Google, but all I got is that people actually complain about High CPU usage and I've got the opposite.
32 bit doesn't make a difference here. Javascript is inherently single-threaded, so by default (not counting web workers) it won't use more than a single core on your machine. It just cannot. Also memory usage doesn't necessarily tell you how hard a program is working. Some need lots of memory, others only little.
It's up to programs to use the resources of the machine most efficiently; if they don't, there is nothing you can do with Windows to make them run better or faster.
I know this is not so much a programming question but it is relevant.
I work on a fairly large cross platform project. On Windows I use VC++ 2008. On Linux I use gcc. There are around 40k files in the project. Windows is 10x to 40x slower than Linux at compiling and linking the same project. How can I fix that?
A single change incremental build 20 seconds on Linux and > 3 mins on Windows. Why? I can even install the 'gold' linker in Linux and get that time down to 7 seconds.
Similarly git is 10x to 40x faster on Linux than Windows.
In the git case it's possible git is not using Windows in the optimal way but VC++? You'd think Microsoft would want to make their own developers as productive as possible and faster compilation would go a long way toward that. Maybe they are trying to encourage developers into C#?
As simple test, find a folder with lots of subfolders and do a simple
dir /s > c:\list.txt
on Windows. Do it twice and time the second run so it runs from the cache. Copy the files to Linux and do the equivalent 2 runs and time the second run.
ls -R > /tmp/list.txt
I have 2 workstations with the exact same specs. HP Z600s with 12gig of ram, 8 cores at 3.0ghz. On a folder with ~400k files Windows takes 40seconds, Linux takes < 1 second.
Is there a registry setting I can set to speed up Windows? What gives?
A few slightly relevant links, relevant to compile times, not necessarily i/o.
Apparently there's an issue in Windows 10 (not in Windows 7) that closing a process holds a global lock. When compiling with multiple cores and therefore multiple processes this issue hits.
The /analyse option can adversely affect perf because it loads a web browser. (Not relevant here but good to know)
Unless a hardcore Windows systems hacker comes along, you're not going to get more than partisan comments (which I won't do) and speculation (which is what I'm going to try).
File system - You should try the same operations (including the dir) on the same filesystem. I came across this which benchmarks a few filesystems for various parameters.
Caching. I once tried to run a compilation on Linux on a RAM disk and found that it was slower than running it on disk thanks to the way the kernel takes care of caching. This is a solid selling point for Linux and might be the reason why the performance is so different.
Bad dependency specifications on Windows. Maybe the chromium dependency specifications for Windows are not as correct as for Linux. This might result in unnecessary compilations when you make a small change. You might be able to validate this using the same compiler toolchain on Windows.
A few ideas:
Disable 8.3 names. This can be a big factor on drives with a large number of files and a relatively small number of folders: fsutil behavior set disable8dot3 1
Use more folders. In my experience, NTFS starts to slow down with more than about 1000 files per folder.
Enable parallel builds with MSBuild; just add the "/m" switch, and it will automatically start one copy of MSBuild per CPU core.
Put your files on an SSD -- helps hugely for random I/O.
If your average file size is much greater than 4KB, consider rebuilding the filesystem with a larger cluster size that corresponds roughly to your average file size.
Make sure the files have been defragmented. Fragmented files cause lots of disk seeks, which can cost you a factor of 40+ in throughput. Use the "contig" utility from sysinternals, or the built-in Windows defragmenter.
If your average file size is small, and the partition you're on is relatively full, it's possible that you are running with a fragmented MFT, which is bad for performance. Also, files smaller than 1K are stored directly in the MFT. The "contig" utility mentioned above can help, or you may need to increase the MFT size. The following command will double it, to 25% of the volume: fsutil behavior set mftzone 2 Change the last number to 3 or 4 to increase the size by additional 12.5% increments. After running the command, reboot and then create the filesystem.
Disable last access time: fsutil behavior set disablelastaccess 1
Disable the indexing service
Disable your anti-virus and anti-spyware software, or at least set the relevant folders to be ignored.
Put your files on a different physical drive from the OS and the paging file. Using a separate physical drive allows Windows to use parallel I/Os to both drives.
Have a look at your compiler flags. The Windows C++ compiler has a ton of options; make sure you're only using the ones you really need.
Try increasing the amount of memory the OS uses for paged-pool buffers (make sure you have enough RAM first): fsutil behavior set memoryusage 2
Check the Windows error log to make sure you aren't experiencing occasional disk errors.
Have a look at Physical Disk related performance counters to see how busy your disks are. High queue lengths or long times per transfer are bad signs.
The first 30% of disk partitions is much faster than the rest of the disk in terms of raw transfer time. Narrower partitions also help minimize seek times.
Are you using RAID? If so, you may need to optimize your choice of RAID type (RAID-5 is bad for write-heavy operations like compiling)
Disable any services that you don't need
Defragment folders: copy all files to another drive (just the files), delete the original files, copy all folders to another drive (just the empty folders), then delete the original folders, defragment the original drive, copy the folder structure back first, then copy the files. When Windows builds large folders one file at a time, the folders end up being fragmented and slow. ("contig" should help here, too)
If you are I/O bound and have CPU cycles to spare, try turning disk compression ON. It can provide some significant speedups for highly compressible files (like source code), with some cost in CPU.
NTFS saves file access time everytime. You can try disabling it:
"fsutil behavior set disablelastaccess 1"
(restart)
The issue with visual c++ is, as far I can tell, that it is not a priority for the compiler team to optimize this scenario.
Their solution is that you use their precompiled header feature. This is what windows specific projects have done. It is not portable, but it works.
Furthermore, on windows you typically have virus scanners, as well as system restore and search tools that can ruin your build times completely if they monitor your buid folder for you. windows 7 resouce monitor can help you spot it.
I have a reply here with some further tips for optimizing vc++ build times if you're really interested.
The difficulty in doing that is due to the fact that C++ tends to spread itself and the compilation process over many small, individual, files. That's something Linux is good at and Windows is not. If you want to make a really fast C++ compiler for Windows, try to keep everything in RAM and touch the filesystem as little as possible.
That's also how you'll make a faster Linux C++ compile chain, but it is less important in Linux because the file system is already doing a lot of that tuning for you.
The reason for this is due to Unix culture:
Historically file system performance has been a much higher priority in the Unix world than in Windows. Not to say that it hasn't been a priority in Windows, just that in Unix it has been a higher priority.
Access to source code.
You can't change what you can't control. Lack of access to Windows NTFS source code means that most efforts to improve performance have been though hardware improvements. That is, if performance is slow, you work around the problem by improving the hardware: the bus, the storage medium, and so on. You can only do so much if you have to work around the problem, not fix it.
Access to Unix source code (even before open source) was more widespread. Therefore, if you wanted to improve performance you would address it in software first (cheaper and easier) and hardware second.
As a result, there are many people in the world that got their PhDs by studying the Unix file system and finding novel ways to improve performance.
Unix tends towards many small files; Windows tends towards a few (or a single) big file.
Unix applications tend to deal with many small files. Think of a software development environment: many small source files, each with their own purpose. The final stage (linking) does create one big file but that is an small percentage.
As a result, Unix has highly optimized system calls for opening and closing files, scanning directories, and so on. The history of Unix research papers spans decades of file system optimizations that put a lot of thought into improving directory access (lookups and full-directory scans), initial file opening, and so on.
Windows applications tend to open one big file, hold it open for a long time, close it when done. Think of MS-Word. msword.exe (or whatever) opens the file once and appends for hours, updates internal blocks, and so on. The value of optimizing the opening of the file would be wasted time.
The history of Windows benchmarking and optimization has been on how fast one can read or write long files. That's what gets optimized.
Sadly software development has trended towards the first situation. Heck, the best word processing system for Unix (TeX/LaTeX) encourages you to put each chapter in a different file and #include them all together.
Unix is focused on high performance; Windows is focused on user experience
Unix started in the server room: no user interface. The only thing users see is speed. Therefore, speed is a priority.
Windows started on the desktop: Users only care about what they see, and they see the UI. Therefore, more energy is spent on improving the UI than performance.
The Windows ecosystem depends on planned obsolescence. Why optimize software when new hardware is just a year or two away?
I don't believe in conspiracy theories but if I did, I would point out that in the Windows culture there are fewer incentives to improve performance. Windows business models depends on people buying new machines like clockwork. (That's why the stock price of thousands of companies is affected if MS ships an operating system late or if Intel misses a chip release date.). This means that there is an incentive to solve performance problems by telling people to buy new hardware; not by improving the real problem: slow operating systems. Unix comes from academia where the budget is tight and you can get your PhD by inventing a new way to make file systems faster; rarely does someone in academia get points for solving a problem by issuing a purchase order. In Windows there is no conspiracy to keep software slow but the entire ecosystem depends on planned obsolescence.
Also, as Unix is open source (even when it wasn't, everyone had access to the source) any bored PhD student can read the code and become famous by making it better. That doesn't happen in Windows (MS does have a program that gives academics access to Windows source code, it is rarely taken advantage of). Look at this selection of Unix-related performance papers: http://www.eecs.harvard.edu/margo/papers/ or look up the history of papers by Osterhaus, Henry Spencer, or others. Heck, one of the biggest (and most enjoyable to watch) debates in Unix history was the back and forth between Osterhaus and Selzer http://www.eecs.harvard.edu/margo/papers/usenix95-lfs/supplement/rebuttal.html
You don't see that kind of thing happening in the Windows world. You might see vendors one-uping each other, but that seems to be much more rare lately since the innovation seems to all be at the standards body level.
That's how I see it.
Update: If you look at the new compiler chains that are coming out of Microsoft, you'll be very optimistic because much of what they are doing makes it easier to keep the entire toolchain in RAM and repeating less work. Very impressive stuff.
I personally found running a windows virtual machine on linux managed to remove a great deal of the IO slowness in windows, likely because the linux vm was doing lots of caching that Windows itself was not.
Doing that I was able to speed up compile times of a large (250Kloc) C++ project I was working on from something like 15 minutes to about 6 minutes.
Incremental linking
If the VC 2008 solution is set up as multiple projects with .lib outputs, you need to set "Use Library Dependency Inputs"; this makes the linker link directly against the .obj files rather than the .lib. (And actually makes it incrementally link.)
Directory traversal performance
It's a bit unfair to compare directory crawling on the original machine with crawling a newly created directory with the same files on another machine. If you want an equivalent test, you should probably make another copy of the directory on the source machine. (It may still be slow, but that could be due to any number of things: disk fragmentation, short file names, background services, etc.) Although I think the perf issues for dir /s have more to do with writing the output than measuring actual file traversal performance. Even dir /s /b > nul is slow on my machine with a huge directory.
I'm pretty sure it's related to the filesystem. I work on a cross-platform project for Linux and Windows where all the code is common except for where platform-dependent code is absolutely necessary. We use Mercurial, not git, so the "Linuxness" of git doesn't apply. Pulling in changes from the central repository takes forever on Windows compared to Linux, but I do have to say that our Windows 7 machines do a lot better than the Windows XP ones. Compiling the code after that is even worse on VS 2008. It's not just hg; CMake runs a lot slower on Windows as well, and both of these tools use the file system more than anything else.
The problem is so bad that most of our developers that work in a Windows environment don't even bother doing incremental builds anymore - they find that doing a unity build instead is faster.
Incidentally, if you want to dramatically decrease compilation speed in Windows, I'd suggest the aforementioned unity build. It's a pain to implement correctly in the build system (I did it for our team in CMake), but once done automagically speeds things up for our continuous integration servers. Depending on how many binaries your build system is spitting out, you can get 1 to 2 orders of magnitude improvement. Your mileage may vary. In our case I think it sped up the Linux builds threefold and the Windows one by about a factor of 10, but we have a lot of shared libraries and executables (which decreases the advantages of a unity build).
How do you build your large cross platform project?
If you are using common makefiles for Linux and Windows you could easily degrade windows performance by a factor of 10 if the makefiles are not designed to be fast on Windows.
I just fixed some makefiles of a cross platform project using common (GNU) makefiles for Linux and Windows. Make is starting a sh.exe process for each line of a recipe causing the performance difference between Windows and Linux!
According to the GNU make documentation
.ONESHELL:
should solve the issue, but this feature is (currently) not supported for Windows make. So rewriting the recipes to be on single logical lines (e.g. by adding ;\ or \ at the end of the current editor lines) worked very well!
IMHO this is all about disk I/O performance. The order of magnitude suggests a lot of the operations go to disk under Windows whereas they're handled in memory under Linux, i.e. Linux is caching better. Your best option under windows will be to move your files onto a fast disk, server or filesystem. Consider buying an Solid State Drive or moving your files to a ramdisk or fast NFS server.
I ran the directory traversal tests and the results are very close to the compilation times reported, suggesting this has nothing to do with CPU processing times or compiler/linker algorithms at all.
Measured times as suggested above traversing the chromium directory tree:
Windows Home Premium 7 (8GB Ram) on NTFS: 32 seconds
Ubuntu 11.04 Linux (2GB Ram) on NTFS: 10 seconds
Ubuntu 11.04 Linux (2GB Ram) on ext4: 0.6 seconds
For the tests I pulled the chromium sources (both under win/linux)
git clone http://github.com/chromium/chromium.git
cd chromium
git checkout remotes/origin/trunk
To measure the time I ran
ls -lR > ../list.txt ; time ls -lR > ../list.txt # bash
dir -Recurse > ../list.txt ; (measure-command { dir -Recurse > ../list.txt }).TotalSeconds #Powershell
I did turn off access timestamps, my virus scanner and increased the cache manager settings under windows (>2Gb RAM) - all without any noticeable improvements. Fact of the matter is, out of the box Linux performed 50x better than Windows with a quarter of the RAM.
For anybody who wants to contend that the numbers wrong - for whatever reason - please give it a try and post your findings.
Try using jom instead of nmake
Get it here:
https://github.com/qt-labs/jom
The fact is that nmake is using only one of your cores, jom is a clone of nmake that make uses of multicore processors.
GNU make do that out-of-the-box thanks to the -j option, that might be a reason of its speed vs the Microsoft nmake.
jom works by executing in parallel different make commands on different processors/cores.
Try yourself an feel the difference!
I want to add just one observation using Gnu make and other tools from MinGW tools on Windows: They seem to resolve hostnames even when the tools can not even communicate via IP. I would guess this is caused by some initialisation routine of the MinGW runtime. Running a local DNS proxy helped me to improve the compilation speed with these tools.
Before I got a big headache because the build speed dropped by a factor of 10 or so when I opened a VPN connection in parallel. In this case all these DNS lookups went through the VPN.
This observation might also apply to other build tools, not only MinGW based and it could have changed on the latest MinGW version meanwhile.
I recently could archive an other way to speed up compilation by about 10% on Windows using Gnu make by replacing the mingw bash.exe with the version from win-bash
(The win-bash is not very comfortable regarding interactive editing.)
We run a Fortran console program we have run for years. Recently we purchased identical new HP server class machines (4 processors, 8 gig ram, 4 hard drives) for everyone in the office. We configured them identically as nearly as we know. We can compile the Fortran program on one machine, pass the executable to the different machines, and on two machines it executes painfully slow, while on two others it has modest performance (but not as good as before we upgraded from XP machines).
It uses almost no console output (about 40 lines) but outputs about 15 megs of files.
We open task manager to see what's going on, and we see that on the slow machines it's loading ONE CPU to about 15%. On the fast machines it's loading ALL CPUs to about 40% (but one of them seems to load more than the others). As I recall, on XP it loaded the CPU to 99%, and ran much faster.
These machines are the employees' general purpose machines, and have lots of company bloatware on them. And there is the possibility they have slightly different directory structures. But what seems totally puzzling to me is why Vista is not giving them more CPU time. If the CPUs were loading up, I might blame the performance variation on different directory structures, but not loading up the CPUs just boggles my mind.
David
if there's a bottleneck in IO, the CPU wouldn't be loaded as much because it's mostly waiting for the IO to take place. One could even imagine this to cause the one CPU vs many CPUs problem if there's just no point in kicking in another CPU because there's plenty of time between while waiting. What if you take an external HD and try out if the differences also take place if you run the same program on that HD on different machines?
Please go into Windows task manager, Performance / - Select in [View] the option: [Kernel Times] and look what's displayed on the bars during program execution.
If its only 15% load on quad+hyperthreading box, that says basically, OpenMP, MPI (or whatever it uses) - isn't properly working - works on 1/8 => 15%. Can you run the MPI-test command for your specific system in order to check for errors in multiprocessing on each box? Therefore, the question would be - why does the multiprocessing environment not work?
Regards
rbo
SWAG, but have you checked your virus scanner configuration? If the scanner isn't set to ignore the type of file you're writing on the slow machine, then each write to those files might be getting intercepted and scanned before being written to the disk. This could lead to the process sitting in I/O wait and not getting scheduled as often.
Vista had a problem with some uncontrollably memory leaks, perhaps this is your error, some conflict in the "bloatware" is causing a memory leak and so your Fortran program is running so much slower?
I assume you have tested this with all programs ended. It seems unlikely that your console program is the issue. Sounds like there's definitely a memory conflict going on though.
A couple of times recently I have noticed that 'something' is causing the Windows System Process to sit at 50+% and it will not quit until the PC is rebooted. Happening on Win2k and Win XP so far.
This is particularly troublesome because it currently appears to be triggered by MSVC 2005/Incredibuild and rebooting the build servers is not a nice thing.
At the same time the 'System Idle Process' process is holding the rest of the CPU and the build steps themselves seem to be starved. ie. a module that normally takes <5 minutes to compile is currently taking 20+.
I'd take a few guesses at maybe being virus checker or tortoise svn but would desperatly like some other suggestions.
Edit:
I've been experiencing this as something that is triggered, and the culprit may not be ongoing. Thats not to say that some other ongoing process hasn't done something 'stupid' and is managing an active lock up of System while appearing to be idle itself.
System (100% of 1 core), and System Idle Process are sharing 98-100% of the total CPU.
Occasionaly mt.exe, link.exe, buildservice would get a look in at 1-2%.
I'm running VNC to view the machine, so it's getting a look in on occasion.
Edit 2:
When left the previous evening the build process seemed to be progressing all be it slowly, but after waiting another 13 hours the 1 hour build process hasn't completed. System is still hogging the 1 core.
My understanding is that the "System" process is the time spent in the kernel (so performing disk I/O, network I/O (you did mention Incredibuild) and the like) -- I'd check for disk fragmentation, virus checkers and possibly look at these on other machines in your Incredibuild cluster.
As the System Idle process runs at "Low" priority, it's a red herring that it'd be "taking up CPU time" -- if anything it's just showing that there is available CPU time available. The fact the processing is stuck to a single processor shows that the process is doing something that is not multi-core aware, or someone has set it's thread affinity to 1.
I've noticed the virus checking software that I use can radically slow down compilation but it does not extend beyond the end of the build. Turning off advanced and heuristic checking improves this to the extent that I do not have to disable the scanner entirely. I have changed my scanning strategy such that I use scheduled full scans now more than advanced on the fly scanning, as it hurts the perfromance of a number of apps. (n.b. I am using the latest cut of Kaspersky). I'm also using an automated backup tool (AJCBackup) that also needs to be restrained when compiling.
You may also want to consider disableing the Windows Indexing service on drives that are be used to create a lot of temporary and object files, as it doesn't provide much value in this context for the amount of performance it draws.
Edit: Have checked which processes are actually hogging the CPU core and traced them back to a given app?
We've encountered issues with Kaspersky and Incredibuild in our offices - compiles and sometimes links will just hang and never finish.
Only seems to affect some machines though which is wierd, and only Windows XP (Vista seems immune from what I've seen).
Only solution I've found so far is to turn Kaspersky off entirely - so if you find a solution then let me know!
RE: smacl, work from the Windows Search/Indexing Service (WSearch) won't be attributed to the System process's CPU time, it should come from the SearchIndexer.exe/SearchFilterHost.exe services (Vista+).
The majority of activity from System you will see will be in disk activity from the lazy writer and other disk accesses. CPU activity from System will be because of kernel activity such as drivers (ISRs/DPCs) and other kernel-level filters (which could include AV file and process filters).
Process Explorer (http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx) can aid in viewing CPU usage across processes, including System. You can use the public Microsoft Symbol Server and this resource to get you started.
If you can take a trace with Xperf (http://msdn.microsoft.com/en-us/performance/cc825801.aspx), I can help you analyze where the CPU time is being spent in the System (kernel) context. Xperf isn't officially supported on XP, but you can take a trace on XP and analyze it on other systems.
Xperf and Process Explorer should be able to shine a spotlight on exactly the module(s) that are causing the runaway CPU usage. Symbols may not even be necessary to diagnose the problem; simply the module name can often point to the component in question that is slowing down your system. For example, high CPU usage from ndis.sys can point to network interrupts, or activity from modules such as aavmker4.sys can point to AV software (Avast! in this case).
And as always, check if there are any updated drivers and AV software for your system.
In my office, a conflict between Incredibuild and Spyware Doctor's Immunize feature caused similar issues. Turning off Immunize solved it for us.
What anti-virus/malware do you use?
I'm having same hangs when compiling using IncrediBuild in VS2003, on clean Windows 7 without any anti-virus. It worked fine on same box in XP and Vista.
I'd like to ask a question then follow it up with my own answer, but also see what answers other people have.
We have two large files which we'd like to read from two separate threads concurrently. One thread will sequentially read fileA while the other thread will sequentially read fileB. There is no locking or communication between the threads, both are sequentially reading as fast as they can, and both are immediately discarding the data they read.
Our experience with this setup on Windows is very poor. The combined throughput of the two threads is in the order of 2-3 MiB/sec. The drive seems to be spending most of its time seeking backwards and forwards between the two files, presumably reading very little after each seek.
If we disable one of the threads and temporarily look at the performance of a single thread then we get much better bandwidth (~45 MiB/sec for this machine). So clearly the bad two-thread performance is an artefact of the OS disk scheduler.
Is there anything we can do to improve the concurrent thread read performance? Perhaps by using different APIs or by tweaking the OS disk scheduler parameters in some way.
Some details:
The files are in the order of 2 GiB each on a machine with 2GiB of RAM. For the purpose of this question we consider them not to be cached and perfectly defragmented. We have used defrag tools and rebooted to ensure this is the case.
We are using no special APIs to read these files. The behaviour is repeatable across various bog-standard APIs such as Win32's CreateFile, C's fopen, C++'s std::ifstream, Java's FileInputStream, etc.
Each thread spins in a loop making calls to the read function. We have varied the number of bytes requested from the API each iteration from values between 1KiB up to 128MiB. Varying this has had no effect, so clearly the amount the OS is physically reading after each disk seek is not dictated by this number. This is exactly what should be expected.
The dramatic difference between one-thread and two-thread performance is repeatable across Windows 2000, Windows XP (32-bit and 64-bit), Windows Server 2003, and also with and without hardware RAID5.
The problem seems to be in Windows I/O scheduling policy. According to what I found here there are many ways for an O.S. to schedule disk requests. While Linux and others can choose between different policies, before Vista Windows was locked in a single policy: a FIFO queue, where all requests where splitted in 64 KB blocks. I believe that this policy is the cause for the problem you are experiencing: the scheduler will mix requests from the two threads, causing continuous seek between different areas of the disk.
Now, the good news is that according to here and here, Vista introduced a smarter disk scheduler, where you can set the priority of your requests and also allocate a minimum badwidth for your process.
The bad news is that I found no way to change disk policy or buffers size in previous versions of Windows. Also, even if raising disk I/O priority of your process will boost the performance against the other processes, you still have the problems of your threads competing against each other.
What I can suggest is to modify your software by introducing a self-made disk access policy.
For example, you could use a policy like this in your thread B (similar for Thread A):
if THREAD A is reading from disk then wait for THREAD A to stop reading or wait for X ms
Read for X ms (or Y MB)
Stop reading and check status of thread A again
You could use semaphores for status checking or you could use perfmon counters to get the status of the actual disk queue.
The values of X and/or Y could also be auto-tuned by checking the actual trasfer rates and slowly modify them, thus maximizing the throughtput when the application runs on different machines and/or O.S. You could find that cache, memory or RAID levels affect them in a way or the other, but with auto-tuning you will always get the best performance in every scenario.
I'd like to add some further notes in my response. All other non-Microsoft operating systems we have tested do not suffer from this problem. Linux, FreeBSD, and Mac OS X (this final one on different hardware) all degrade much more gracefully in terms of aggregate bandwidth when moving from one thread to two. Linux for example degraded from ~45 MiB/sec to ~42 MiB/sec. These other operating systems must be reading larger chunks of the file between each seek, and therefor not spending nearly all their time waiting on the disk to seek.
Our solution for Windows is to pass the FILE_FLAG_NO_BUFFERING flag to CreateFile and use large (~16MiB) reads in each call to ReadFile. This is suboptimal for several reasons:
Files don't get cached when read like this, so there are none of the advantages that caching normally gives.
The constraints when working with this flag are much more complicated than normal reading (alignment of read buffers to page boundaries, etc).
(As a final remark. Does this explain why swapping under Windows is so hellish? Ie, Windows is incapable of doing IO to multiple files concurrently with any efficiency, so while swapping all other IO operations are forced to be disproportionately slow.)
Edit to add some further details for Will Dean:
Of course across these different hardware configurations the raw figures did change (sometimes substantially). The problem however is the consistent degradation in performance that only Windows suffers when moving from one thread to two. Here is a summary of the machines tested:
Several Dell workstations (Intel Xeon) of various ages running Windows 2000, Windows XP (32-bit), and Windows XP (64-bit) with single drive.
A Dell 1U server (Intel Xeon) running Windows Server 2003 (64-bit) with RAID 1+0.
An HP workstation (AMD Opteron) with Windows XP (64-bit), and Windows Server 2003, and hardware RAID 5.
My home unbranded PC (AMD Athlon64) running Windows XP (32-bit), FreeBSD (64-bit), and Linux (64-bit) with single drive.
My home MacBook (Intel Core1) running Mac OS X, single SATA drive.
My home Koolu PC running Linux. Vastly underpowered compared to the other systems but I demonstrated that even this machine can outperform a Windows server with RAID5 when doing multi-threaded disk reads.
CPU usage on all of these systems was very low during the tests and anti-virus was disabled.
I forgot to mention before but we also tried the normal Win32 CreateFile API with the FILE_FLAG_SEQUENTIAL_SCAN flag set. This flag didn't fix the problem.
It does seem a little strange that you see no difference across quite a wide range of windows versions and nothing between a single drive and hardware raid-5.
It's only 'gut feel', but that does make me doubtful that this is really a simple seeking problem. Other than the OS X and the Raid5, was all this tried on the same machine - have you tried another machine? Is your CPU usage basically zero during this test?
What's the shortest app you can write which demonstrates this problem? - I would be interested to try it here.
I would create some kind of in memory thread safe lock. Each thread could wait on the lock until it was free. When the lock becomes free, take the lock and read the file for a defined length of time or a defined amount of data, then release the lock for any other waiting threads.
Do you use IOCompletionPorts under Windows? Windows via C++ has an in-depth chapter on this subject and as luck would have it, it is also available on MSDN.
Paul - saw the update. Very interesting.
It would be interesting to try it on Vista or Win2008, as people seem to be reporting some considerable I/O improvements on these in some circumstances.
My only suggestion about a different API would be to try memory mapping the files - have you tried that? Unfortunately at 2GB per file, you're not going to be able to map multiple whole files on a 32-bit machine, which means this isn't quite as trivial as it might be.