How to limit Hard Drive Disk I/O when reading/writing a file on disk?

How to limit Hard Drive Disk I/O when reading/writing a file on disk? - windows

I have a few Rust programs that read data from a file, do some operations, and write data on another file.
Simple enough, but I've been having a big issue in that my programs saturate the HDD max I/O and can only be executed when no other process is in use.
To be more precise, I'm currently using BufReader and BufWriter with a buffer size of 64 KB which is fantastic in and of itself to read/write a file as quickly as possible. But reading at 250MB/s and writing at the same time at 250MB/s has a tendency to overflow what the HDD can manage. Suffice to say that I'm all for speed and whatnot, but I realized that those Rust programs are asking for too much resources from the HDD and seems to be stalled by the Operating System (Windows) to let other processes work in peace. The files I'm reading/writing are generally a few Gigabytes
Now I know I could just add some form of wait() between each read/write operation on the disk but, I don't know how to find out at which speed I'm currently reading/writing and am looking for a more optimal solution. Plus even after reading the docs, I still can't find an option on BufReader/BufWriter that could limit HDD I/O operations to some arbitrary value (let's say 100MB/s for example).
I looked through the sysinfo crate but it does not seem to help in finding out current and maximum I/O for the HDD.
Am I out of luck and should I delve deeper in systems programming to find a solution ? Or is there already something that might teach how to prioritize my calls to the HDD or to simply limit my calls to some arbitrary value calculated from the currently available I/O rate of the HDD ?

After reading a bit more on the subject, apart from trying to read/write a lot of data and calculate from its performance, it seems like you can't find out HDD max I/O rate during the execution of the program and can only guess a constant at which HDD I/O rate can't go higher. (see https://superuser.com/questions/795483/how-to-limit-hdd-write-speed-for-chosen-programs/795488#795488)
But, you can still monitor disk activity, and with the number guessed earlier, you can use wait() more accurately than always limiting yourself at a constant speed. (here is a crate for Rust : https://github.com/myfreeweb/systemstat).
Prioritizing the process with the OS might be overkill since I'm trying to slip between other processes and share whatever resources are available at that time.

Related

Performance Counter for Memory Mapped Files

When using memory mapped files I'm getting in situations where Windows stalls since new memory is allocated and processed faster than it can be written to disk using memory mappedfiles.
The only solution I see is to throttle my processing while the MiMappedPageWriter and the KeBalanceSetManager are doing their jobs. I would be completely fine if the application is running slower instead of a complete OS freeze.
It already helped to use SetWorkingSetSizeEx using a hard limit, because the MiMappedPageWriter is starting earlier to page-out to disk, but still on some drives the data is allocated faster. For example an SSD with 250MB/s does not manage it, but with 500MB/s it is getting better. But I have to support a wide range of hardware and cannot rely on fast drives.
I found that there once was a performance counter, for example: Memory\Mapped File Bytes Written/sec, that I could use not monitor periodically (see: https://docs.microsoft.com/en-us/windows-server/management/windows-performance-monitor/memory-performance-counter-mapped-file-bytes-written-sec) but it seems that all the links have gone.
I have searched on many places, but couldn't find the performance counters for this.
Is there still a source for this?

Pre-warm disk cache

After some theoretical discussion today I decided to do some research, but I did not find anything conclusive.
Here's the problem:
We have written a tool that reads around 10Gb of image files from a data set of several terabytes. We want to speed up the execution time by minimizing I/O overhead. The idea would be to "pre-warm" the disk cache, as we known beforehand what directory we will be reading from as the tool executes. Is there any API or method to give this hint to Windows so that it can start pre-warming the disk cache, speeding up future disk access as the files are already in RAM (of which there is plenty on the machines we run the tool on)?
I know Windows does readahead on a single file, but what if I have a directory with thousands of files?
I haven't found any direct win32 APIs or command line tools to do this directly.
What if I start a low priority background thread, opening all the files for reading and closing them?
I could of course memory map all the files and pin them in RAM, but that would probably run the risk of starving the main worker thread of I/O.
The general idea here is that the tool "bursts" I/O requests, as each thread will do I/O and CPU processing in sequence, hence we could use the "idle" I/O time to preload the remaining files into RAM.
(I could of course benchmark, and I will, but I would like to understand a bit more of how this works in order to be more scientific and less cargo culty).

Constant Write Speed to Disk

I'm writing real-time data to an empty spinning disk sequentially. (EDIT: It doesn't have to be sequential, as long as I can read it back as if it was sequential.) The data arrives at a rate of 100 MB/s and the disks have an average write speed of 120 MB/s.
Sometimes (especially as free space starts to decrease) the disk speed goes under 100 MB/s depending on where on the platter the disk is writing, and I have to drop vital data.
Is there any way to write to disk in a pattern (or some other way) to ensure a constant write speed close to the average rate? Regardless of how much data there currently is on the disk.
EDIT:
Some notes on why I think this should be possible.
When usually writing to the disk, it starts in the fast portion of the platter and then writes towards the slower parts. However, if I could write half the data to the fast part and half the data to the slow part (i.e. for 1 second it could write 50MB to the fast part and 50MB to the slow part), they should meet in the middle. I could possibly achieve a constant rate?
As a programmer, I am not sure how I can decide where on the platter the data is written or even if the OS can achieve something similar.

If I had to do this on a regular Windows system, I would use a device with a higher average write speed to give me more headroom. Expecting 100MB/s average write speed over the entire disk that is rated for 120MB/s is going to cause you trouble. Spinning hard disks don't have a constant write speed over the whole disk.
The usual solution to this problem is to buffer in RAM to cover up infrequent slow downs. The more RAM you use as a buffer, the longer the span of slowness you can handle. These are tradeoffs you have to make. If your problem is the known slowdown on the inside sectors of a rotating disk, then your device just isn't fast enough.
Another thing that might help is to access the disk as directly as possible and ensure it isn't being shared by other parts of the system. Use a separate physical device, don't format it with a filesystem, write directly to the partitioned space. Yes, you'll have to deal with some of the issues a filesystem solves for you, but you also skip a bunch of code you can't control. Even then, your app could run into scheduling issues with Windows. Windows is not a RTOS, there are not guarantees as far as timing. Again this would help more with temporary slowdowns from filesystem cleanup, flushing dirty pages, etc. It probably won't help much with the "last 100GB writes at 80MB/s" problem.
If you really are stuck with a disk that goes from 120MB/s -> 80MB/s outside-to-inside (you should test with your own code and not trust the specs from the manufacture so you know what you're dealing with), then you're going to have to play partitioning games like others have suggested. On a mechanical disk, that will introduce some serious head seeking, which may eat up your improvement. To minimize seeks, it would be even more important to ensure it's a dedicated disk the OS isn't using for anything else. Also, use large buffers and write many megabytes at a time before seeking to the end of the disk. Instead of partitioning, you could write directly to the block device and control which blocks you write to. I don't know how to do this in Windows.
To solve this on Linux, I would be tempted to test mdadm's raid0 across two partitions on the same drive and see if that works. If so, then the work is done and you don't have to write and test some complicated write mechanism.

Partition the disk into two equally sized partitions. Write a few seconds worth of data alternating between the partitions. That way you get almost all of the usual sequential speed, nicely averaged. One disk seek every few seconds eats up almost no time. One seek per second reduces the usable time from 1000ms to ~990ms which is a ~1% reduction in throughput. The more RAM you can dedicate to buffering the less you have to seek.
Use more partitions to increase the averaging effect.

I fear this may be more difficult than you realize:
If your average 120 MB/s write speed is the manufacturer's value then it is most likely "optimistic" at best.
Even a benchmarked write speed is usually done on a non-partitioned/formatted drive and will be higher than what you'd typically see in actual use (how much higher is a good question).
A more important value is the drive's minimum write speed. For example, from Tom's Hardware 2013 HDD Benchmarks a drive with a 120 MB/s average has a 76 MB/s minimum.
A drive that is being used by other applications at the same time (e.g., Windows) will have a much lower write speed.
An even more important value is the drives actual measured performance. I would make a simple application similar to your use case that writes data to the drive as fast as possible until it fills the drive. Do this a few (dozen) times to get a more realistic average/minimum/maximum write speed value...it will likely be lower than you'd expect.
As you noted, even if your "real" average write speed is higher than 100 MB/s you run into issues if you run into slow write speeds just before the disk fills up, assuming you don't have somewhere else to write the data to. Using a buffer doesn't help in this case.
I'm not sure if you can actually specify a physical location to write to on the hard drive these days without getting into the drive's firmware. Even if you could this would be my last choice for a solution.
A few specific things I would look at to solve your problem:
Measure the "real" write performance of the drive to see if its fast enough. This gives you an idea of how far behind you actually are.
Put the OS on a separate drive to ensure the data drive is not being used by anything other than your application.
Get faster drives (either HDD or SDD). It is fine to use the manufacturer's write speeds as an initial guide but test them thoroughly as well.
Get more drives and put them into a RAID0 (or similar) configuration for faster write access. You'll again want to actually test this to confirm it works for you.

You could implement the strategy of alternating writes bewteen the inside and the outside by directly controlling the disk write locations. Under Windows you can open a disk like "\.\PhysicalDriveX" and control where it writes. For more info see
http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx

First of all, I hope you are using raw disks and not a filesystem. If you're using a filesystem, you must:
Create an empty, non-sparse file that's as large as the filesystem will fit.
Obtain a mapping from the logical file positions to disk blocks.
Reverse this mapping, so that you can map from disk blocks to logical file positions. Of course some blocks are unavailable due to filesystem's own use.
At this point, the disk looks like a raw disk that you access by disk block. It's a valid assumption that this block addressing is mostly monotonous to the physical cylinder number. IOW if you increase the disk block number, the cylinder number will never decrease (or never increase -- depending on the drive's LBA to physical mapping order).
Also, note that a disk's average write speed may be given per cylinder or per unit of storage. How would you know? You need the latter number, and the only sure way to get it is to benchmark it yourself. You need to fill the entire disk with data, by repeatedly writing a zero page to the disk, going block by block, and divide the total amount of data written by the amount it took. You need to be accessing the disk or the file in the direct mode. This should disable the OS buffering for the file data, and not for the filesystem metadata (if not using a raw disk).
At this point, all you need to do is to write data blocks of sensible sizes at the two extremes of the block numbers: you need to fill the disk from both ends inwards. The size of the data blocks depends on the bandwidth wastage you can allow for seeks. You should also assume that the hard drive might seek once in a while to update its housekeeping data. Assuming a worst-case seek taking 15ms, you waste 1.5% of per-second bandwidth for each seek. Assuming you can spare no more than 5% of bandwidth, with 1 seek/s on average for the drive itself, you can seek twice per second. Thus your block size needs to be your_bandwith_per_second/2. This bandwidth is not the disk bandwidth, but the bandwidth of your data source.
Alas, if only things where this easy. It generally turns out that the bandwidth at the middle of the disk is not the average bandwidth. During your benchmark you must also take a note of write speed over smaller sections of the disk, say every 1% of the disk. This way, when writing into each section of the disk, you can figure out how to split the data between the "low" and the "high" section that you're writing to. Suppose that you're starting out at 0% and 99% positions on the disk, and the low position has a bandwidth of mean*1.5, and the high position has a bandwidth of mean*0.8, where mean is your desired mean bandwidth. You'll then need to write 100% * 1.5/(0.8+1.5) of the data into the low position, and the remainder (100% * 0.8/(0.8+1.5)) into the slower high position.
The size of your buffer needs to be larger than just the block size, since you must assume some worst-case latency for the hard drive if it hits bad blocks and needs to relocate data, etc. I'd say a 3 second buffer may be reasonable. Optionally it can grow by itself if latencies you measure while your software runs turn out higher. This buffer must be locked ("pinned") to physical memory so that it's not subject to swapping.

Another possible option is to destroke (or short stroke) a hard drive. If you start with a 4TB or larger drive and destroke it to 2TB, only the outer portions of the platters will be used, resulting in a faster throughput rate. The issue would be getting the software that issues vendor unique commands to a hard drive to destroke it.

performance tuning where cpu not pinned and plenty of memory

I'm benchmarking a windows server - web application that for argument sake has a single method called parseText().
Running a single instance take less than 10ms, however when I ramp it up to 10 simultaneous requests, things slow down drastically. Say 1 second per request.
The CPU is not pinned and there's plenty of memory available. So I'm confused as to what the bottleneck is.
One thought was that the memory latency or bus bandwidth could be an issue, but I'm not sure which perfmon counters would best indicate something like this.
Can someone suggest some counters to check that may shed some light on the matter?

My first guess would be either disk IO or mutexes.
For disk, Try adding physical disk, read bytes/sec and write bytes/sec and also read/sec write/sec (ie both total bytes and actual io operation counts for read and write) Make sure they aren't spiking. Could also add queue length if you are keen. You are looking for big shifts like 10Mb/sec or lots of small IOs.
For mutexs, which might be a side effect of memory allocation (very frequent memory allocation can cause this), try adding "system" and context switches/sec and maybe system calls/sec. These bounce a bit from general load, so get a feel first and then see what happens.
If you think it is caused by memory bandwidth (ie exhausting the FSB) then I don't think perfmon can measure that, you would need to switch to something more like vtune, which may or may not be an option for you. An example of exhausting main memory bandwidth would be a program that allocates large amounts of memory and then initialises each byte to some value, and does this LOTS. If you think this is your issue, you might need to isolate a routine using code profilers and ot her such tools, but this is hard if you are outside the program and just observing.

Are solid-state drives good enough to stop worrying about disk IO bottlenecks?

I've got a proof-of-concept program which is doing some interprocess communication simply by writing and reading from the HD. Yes, I know this is really slow; but it was the easiest way to get things up and running. I had always planned on coming back and swapping out that part of the code with a mechanism that does all the IPC(interprocess communication) in RAM.
With the arrival of solid-state disks, do you think that bottleneck is likely to become negligible?
Notes: It's server software written in C# calling some bare metal number-crunching libraries written in FORTRAN.

The short answer is probably no. A famous researcher named Jim Gray gave a talk about storage and performance which included this great analogy. Assuming your brain as the processor, accessing a register takes 1 clock tick (numbers on left) which roughly equivalent to that information being in your brain. Accessing memory takes 100 clock ticks, so roughly equivalent to getting data somewhere in the city you live in. Accessing a standard disk takes roughly 10^6 ticks, which is the equivalent to the data being on pluto. Where does solid state fit it? Current SSD technology is somewhere between 10^4-10^5 depending on who you ask. While they can be an order of magnitude faster, there is still a tremendous gap between reading from memory and reading from disk. This is why the answer to your question is likely no, since as fast as SSDs become they will still be significantly slower than disk (at least in the foreseeable future).

I think that you will find the bottlenecks are just moved. As we expect higher throughput then we write programs with higher demands.
This pushes bottlenecks to buses, caches and parts other than the read/write mechanism (which is last in the chain anyway).
With a process not bound by disk I/O, then I think you might find it becomes bound by the scheduler which limits the amount of read/write instructions (as with all process instructions).
To take full advantage of limitless I/O speed you would require real-time response and very aggressive management of caches and so on.
When disks get faster then so does RAM and processors and the demand on devices. The bottleneck is the same, the workload just gets bigger.

I don't believe that it will change the way I/O bound applications are written the tiniest bit. Having faster processors did not make people pick bubblesort as a sorting algorithm either.
The external memory hierarchies are an inherent problem of computing.

Joel on Software has an article about his experience upgrading to solid state. Not exactly the same issue you have, but my takeaway was:
Solid state drives can significantly speed up I/O bound operations, but many things (like compiling) are still cpu-bound.

I have a solid-state drive, and no, this won't eliminate I/O as a bottleneck. The SSD is nice, bit it's not that nice.
It's actually not hard to master your system's IPC primitives or to build something on top of TCP. But if you want to stick with your disk stuff and make it faster, ramdisk or tmpfs might do the trick.

No. Current SSDs are designed as disk replacements. Every layer, from SATA controller to filesystem driver treats them as storage.
This is not a problem of the underlying technology, NAND flash. When NAND flash is directly mapped into memory, and uses a rotating log storage system instead of a file system based on named files it can be quite fast. The fundamental problem is that NAND Flash only performans well in block updates. File metadata updates cause expensive read-modify-write operations. Also, NAND blocks are much bigger than typical disk blocks, which doesn't help performance either.
For these reasons, the future of SSDs will be better cached SSDs. DRAM will hide the overhead of poor mapping and a small supercap backup will allow the SSD to commit writes faster.

Solid state drives do make one important improvement to IO throughput, and that is the fact that on solid state disks, block locality is less of an issue from rotating media. This means that high performance IO bound applications can shift their focus from structures that arrange data accessed in order to structures that optimize IO in other ways, such as by keeping data in a single block by means of compression. That said, Even solid state drives benefit from linear access patterns because they can prefetch subsequent blocks into a read cache before the application requests it.
A noticeable regression on solid state disks is that writes take longer than reads, although both are still generally faster than rotating drives, and the difference is narrowing with newer, high end solid state disks.

No, sadly not. They do make it more interesting though: SSD drives have very fast reads and no sync time, but their writes are almost as slow as normal hard drives. This means that you will want to read most of the time. However when you do write to the drive you should write as much as possible in the same spot since SSD drives can only write entire blocks at a time.

How about using a ram drive instead of the disk? You would not have to rewrite anything. Just point it to a different file system. Windows and Linux both have them. Make sure you have lots of memory on the machine and create a virtual disk with enough space for your processing. I did this for a system that listened to multiple protocols on a network tap. I never new what packet I was going to get and there was too much data to keep it in memory. I would write it to the RAM drive and when something was completed, I would move it and let another process get it off the RAM drive and onto a physical disk. I was able to keep up with really busy server class network cards in this way. Good luck!

Something to keep in mind here:
If the communication involves frequent messages and is on the same system you'll get very good performance because Windows won't actually write the data out in the first place.
I've had to resort to it once and discovered this--the drive light did NOT come on at all so long as the data kept getting written.

but it was the easiest way to get things up and running.
I usually find that it's much cheaper to think good once with your own head, than to make the cpu think millions of times in vain.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio