Why is file concatenation on Windows so slow? - windows

I am working on a small utility app to concatenate large video files. The main concatenation step is to run something like this on the command line on Windows 7:
copy /b file1.dv + file2.dv + file3.dv output.dv
The input files are large - typically 7-15GB each. I know that I am dealing with a lot of data here, but the binary concatenation takes a very long time - for a total of around 40GB of data, it can almost an hour.
Considering that the process is basically just a scan through each file and copying it's contents to a new file, why is the binary copy so slow?

The built in command copy was designed way back in the DOS days, and hasn't really been updated since. As a result, it was designed for machines with small disks, and very small primary memories. As a result, it uses very small buffers when copying things around. For typical workloads; this is no big deal, but doesn't do so well for the specific case you're dealing with.
That said, I don't think copy is going all that slowly given the scenario you describe. If it takes about an hour for a 40 gigabyte file, that means that you're getting speeds of around 11 MB/s. Typical commodity Dell laptops like you describe in your comment are typically equipped with 5400 RPM consumer hard disks, which achieve something like 30MB/s (end of the disk) to 60MB/s (beginning of the disk) under ideal conditions for sequential reads and writes. However, your workload isn't a sequential workload; it's a constant shift of the read/write heads from the source file(s) to the target file(s). Throw in a 16ms typical latency for such disks and you've got about 60 seeks per second, or 30 copy operations per second. That would mean that copy was using a buffer of around 11MB / 30 = around 375k, which conveniently (after you account for the size of copy's code and a few DOS device drivers) fits under the 640k ceiling that copy was originally designed for. This all assumes that your disk is operating under ideal conditions, and has plenty of leftover space allowing these reads and writes to actually be sequential within a copy operation.
Of course if you're doing anything else at the same time this is going to cause more seek operations, and your performance will be worse.
You will probably get better results (maybe up to twice as fast) if you use another application which is designed for large copy operations, and as such uses larger buffers. I'm unaware of any such application though; you'll probably need to write one yourself if that's what you need.

Related

Can improved performance be obtained from parallel reading in Fortran?

I have a fortran90 code that spends (by far) most of the time on I/O, because very large data files (at least 1GB and up from there), need to be read. Smaller, but still large, data files with the results of the calculations need to be written. Comparatively, some fast Fourier transforms and other calculations are done in no time. I have parallelized (OpenMP) some of these calculations but the overall gain in performance is minimal given the mentioned I/O issues.
My strategy at the moment is to read the whole file at once:
open(unit=10, file="data", status="old")
do i=1,verylargenumber
read(10,*) var1(i), var2(i), var3(i)
end do
close(10)
and then perform operations on var1, etc. My question is whether there is a suitable strategy using (preferably) OpenMP that would allow me to speed up the reading process, especially with the consideration in mind (if it makes any difference) that the data files are quite large.
I have the possibility to run these calculations on Lustre file systems, which in principle offer advantages for parallel I/O, although a general solution for regular file systems would be appreciated.
My intuition is that there is no work around this issue but I wanted to check for sure.
I'm not a Fortran guru, but it looks like you are reading the values from the file in very small chunks (3x integers at a time, at most a few dozen bytes). Reading the file in large chunks (multi-MB at a time) is going to provide a significant improvement in performance, since you will be reducing the number of underlying read() system calls (and corresponding locking overhead) by many orders of magnitude.
If your large files are written in Lustre with multiple stripes (e.g. in a directory with lfs setstripe -c 8 -S 4M <dir> to set a default stripe count of 8 with a stripe size of 4MB for all new files in that directory) then this may improve the aggregate read performance - assuming that you are reading only a single file at one time, and you are not limited by the client network bandwidth. If your program is running on multiple nodes and/or threads concurrently, and each of those threads is itself reading its own file, then you will already have parallelism above the file level. Even reading from a single file can do quite well (if the reads are large) because the Lustre client will do readahead in the background.
If you have multiple compute threads that are each working on different chunks of the file at one time (e.g. 4MB chunks) then you could read each of the 4MB chunks from a different thread, which may improve performance since you will have more IO requests in flight. However, there is still a limit on how fast a single client can read files over the network. Reading from a multi-striped file from multiple clients concurrently will allow you to aggregate the network and disk bandwidth from multiple clients and servers, which is where Lustre does best.

Constant Write Speed to Disk

I'm writing real-time data to an empty spinning disk sequentially. (EDIT: It doesn't have to be sequential, as long as I can read it back as if it was sequential.) The data arrives at a rate of 100 MB/s and the disks have an average write speed of 120 MB/s.
Sometimes (especially as free space starts to decrease) the disk speed goes under 100 MB/s depending on where on the platter the disk is writing, and I have to drop vital data.
Is there any way to write to disk in a pattern (or some other way) to ensure a constant write speed close to the average rate? Regardless of how much data there currently is on the disk.
EDIT:
Some notes on why I think this should be possible.
When usually writing to the disk, it starts in the fast portion of the platter and then writes towards the slower parts. However, if I could write half the data to the fast part and half the data to the slow part (i.e. for 1 second it could write 50MB to the fast part and 50MB to the slow part), they should meet in the middle. I could possibly achieve a constant rate?
As a programmer, I am not sure how I can decide where on the platter the data is written or even if the OS can achieve something similar.
If I had to do this on a regular Windows system, I would use a device with a higher average write speed to give me more headroom. Expecting 100MB/s average write speed over the entire disk that is rated for 120MB/s is going to cause you trouble. Spinning hard disks don't have a constant write speed over the whole disk.
The usual solution to this problem is to buffer in RAM to cover up infrequent slow downs. The more RAM you use as a buffer, the longer the span of slowness you can handle. These are tradeoffs you have to make. If your problem is the known slowdown on the inside sectors of a rotating disk, then your device just isn't fast enough.
Another thing that might help is to access the disk as directly as possible and ensure it isn't being shared by other parts of the system. Use a separate physical device, don't format it with a filesystem, write directly to the partitioned space. Yes, you'll have to deal with some of the issues a filesystem solves for you, but you also skip a bunch of code you can't control. Even then, your app could run into scheduling issues with Windows. Windows is not a RTOS, there are not guarantees as far as timing. Again this would help more with temporary slowdowns from filesystem cleanup, flushing dirty pages, etc. It probably won't help much with the "last 100GB writes at 80MB/s" problem.
If you really are stuck with a disk that goes from 120MB/s -> 80MB/s outside-to-inside (you should test with your own code and not trust the specs from the manufacture so you know what you're dealing with), then you're going to have to play partitioning games like others have suggested. On a mechanical disk, that will introduce some serious head seeking, which may eat up your improvement. To minimize seeks, it would be even more important to ensure it's a dedicated disk the OS isn't using for anything else. Also, use large buffers and write many megabytes at a time before seeking to the end of the disk. Instead of partitioning, you could write directly to the block device and control which blocks you write to. I don't know how to do this in Windows.
To solve this on Linux, I would be tempted to test mdadm's raid0 across two partitions on the same drive and see if that works. If so, then the work is done and you don't have to write and test some complicated write mechanism.
Partition the disk into two equally sized partitions. Write a few seconds worth of data alternating between the partitions. That way you get almost all of the usual sequential speed, nicely averaged. One disk seek every few seconds eats up almost no time. One seek per second reduces the usable time from 1000ms to ~990ms which is a ~1% reduction in throughput. The more RAM you can dedicate to buffering the less you have to seek.
Use more partitions to increase the averaging effect.
I fear this may be more difficult than you realize:
If your average 120 MB/s write speed is the manufacturer's value then it is most likely "optimistic" at best.
Even a benchmarked write speed is usually done on a non-partitioned/formatted drive and will be higher than what you'd typically see in actual use (how much higher is a good question).
A more important value is the drive's minimum write speed. For example, from Tom's Hardware 2013 HDD Benchmarks a drive with a 120 MB/s average has a 76 MB/s minimum.
A drive that is being used by other applications at the same time (e.g., Windows) will have a much lower write speed.
An even more important value is the drives actual measured performance. I would make a simple application similar to your use case that writes data to the drive as fast as possible until it fills the drive. Do this a few (dozen) times to get a more realistic average/minimum/maximum write speed value...it will likely be lower than you'd expect.
As you noted, even if your "real" average write speed is higher than 100 MB/s you run into issues if you run into slow write speeds just before the disk fills up, assuming you don't have somewhere else to write the data to. Using a buffer doesn't help in this case.
I'm not sure if you can actually specify a physical location to write to on the hard drive these days without getting into the drive's firmware. Even if you could this would be my last choice for a solution.
A few specific things I would look at to solve your problem:
Measure the "real" write performance of the drive to see if its fast enough. This gives you an idea of how far behind you actually are.
Put the OS on a separate drive to ensure the data drive is not being used by anything other than your application.
Get faster drives (either HDD or SDD). It is fine to use the manufacturer's write speeds as an initial guide but test them thoroughly as well.
Get more drives and put them into a RAID0 (or similar) configuration for faster write access. You'll again want to actually test this to confirm it works for you.
You could implement the strategy of alternating writes bewteen the inside and the outside by directly controlling the disk write locations. Under Windows you can open a disk like "\.\PhysicalDriveX" and control where it writes. For more info see
http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx
First of all, I hope you are using raw disks and not a filesystem. If you're using a filesystem, you must:
Create an empty, non-sparse file that's as large as the filesystem will fit.
Obtain a mapping from the logical file positions to disk blocks.
Reverse this mapping, so that you can map from disk blocks to logical file positions. Of course some blocks are unavailable due to filesystem's own use.
At this point, the disk looks like a raw disk that you access by disk block. It's a valid assumption that this block addressing is mostly monotonous to the physical cylinder number. IOW if you increase the disk block number, the cylinder number will never decrease (or never increase -- depending on the drive's LBA to physical mapping order).
Also, note that a disk's average write speed may be given per cylinder or per unit of storage. How would you know? You need the latter number, and the only sure way to get it is to benchmark it yourself. You need to fill the entire disk with data, by repeatedly writing a zero page to the disk, going block by block, and divide the total amount of data written by the amount it took. You need to be accessing the disk or the file in the direct mode. This should disable the OS buffering for the file data, and not for the filesystem metadata (if not using a raw disk).
At this point, all you need to do is to write data blocks of sensible sizes at the two extremes of the block numbers: you need to fill the disk from both ends inwards. The size of the data blocks depends on the bandwidth wastage you can allow for seeks. You should also assume that the hard drive might seek once in a while to update its housekeeping data. Assuming a worst-case seek taking 15ms, you waste 1.5% of per-second bandwidth for each seek. Assuming you can spare no more than 5% of bandwidth, with 1 seek/s on average for the drive itself, you can seek twice per second. Thus your block size needs to be your_bandwith_per_second/2. This bandwidth is not the disk bandwidth, but the bandwidth of your data source.
Alas, if only things where this easy. It generally turns out that the bandwidth at the middle of the disk is not the average bandwidth. During your benchmark you must also take a note of write speed over smaller sections of the disk, say every 1% of the disk. This way, when writing into each section of the disk, you can figure out how to split the data between the "low" and the "high" section that you're writing to. Suppose that you're starting out at 0% and 99% positions on the disk, and the low position has a bandwidth of mean*1.5, and the high position has a bandwidth of mean*0.8, where mean is your desired mean bandwidth. You'll then need to write 100% * 1.5/(0.8+1.5) of the data into the low position, and the remainder (100% * 0.8/(0.8+1.5)) into the slower high position.
The size of your buffer needs to be larger than just the block size, since you must assume some worst-case latency for the hard drive if it hits bad blocks and needs to relocate data, etc. I'd say a 3 second buffer may be reasonable. Optionally it can grow by itself if latencies you measure while your software runs turn out higher. This buffer must be locked ("pinned") to physical memory so that it's not subject to swapping.
Another possible option is to destroke (or short stroke) a hard drive. If you start with a 4TB or larger drive and destroke it to 2TB, only the outer portions of the platters will be used, resulting in a faster throughput rate. The issue would be getting the software that issues vendor unique commands to a hard drive to destroke it.

Writing multiple files Vs. writing one big file [in a solid state drive]

(I was not able to find a clear answer to my question, maybe I used the wrong search term)
I want to record many images from a camera, with no compression or lossless compression, on a not so powerful device with one single solid drive.
After investigating, I have decided that, if any, the compression will be simply png image by image (this is not part of the discussion).
Given these constraints, I want to be able to record at maximum possible frequency from the camera. The bottleneck is the (only one) hard drive speed. I want to use the RAM for queuing, and the few available cores for compressing the images in parallel, so that there's less data to write.
Once the data is compressed, do I get any gain in writing speed if I stream all the bytes in one single file, or, considering that I am working with a solid drive, can I just write one file (let's say about 1 or 2 MB) per image still working at the maximum disk bandwidth? (or very close to it, like >90%)?
I don't know if it matters, this will be done using C++ and its libraries.
My question is "simply" if by writing my output on a single file instead of in many 2MB files I can expect a significant benefit, when working with a solid state drive.
There's a benefit, not a significant one. A file system driver for a solid state drive already knows how to distribute the data of a file across many non-adjacent clusters so doing it yourself doesn't help. Necessary to fit a large file on a drive that already contains files. By breaking it up, you force extra writes to also add the directory entries for those segments.
The type of a solid state drive matters but this is in general already done by the driver to implement "wear-leveling". In other words, intentionally scatter the data across the drive. This avoids wearing out flash memory cells, they have a limited number of times you can write them before they physically wear out and fail. Traditionally only guaranteed at 10,000 writes, they've gotten better. You'll exercise this of course. Notable as well is that flash drives are fast to read but slow to write, that matters in your case.
There's one notable advantage to breaking up the image data into separate files: it is easier to recover from a drive error. Either from a disastrous failure or the drive just filling up to capacity without you stopping in time. You don't lose the entire shot. But inconvenient to whatever program reads the images off the drive, it has to glue them back together. Which is an important design goal as well, if you make it too impractical with a non-standard uncompressed file format or just too slow to transfer or just too inconvenient in general then it will just not get used very often.

Multithreaded File Compare Performance

I just stumbled onto this SO question and was wondering if there would be any performance improvement if:
The file was compared in blocks no larger than the hard disk sector size (1/2KB, 2KB, or 4KB)
AND the comparison was done multithreaded (or maybe even with the .NET 4 parallel stuff)
I imagine there being 2 threads: one that reads from the beginning of the file and another that reads from the end until they meet in the middle.
I understand in this situation the disk IO is going to be the slowest part but if the reads never have to cross sector boundries (which in my twisted imagination somehow eliminates any possible fragmentation overhead) then it may potentially reduce head moves hence resulting in better performance (maybe?).
Of course other factors could play in as well, such as, single vs multiple processors/cores or SSD vs non-SSD, but with those aside; is the disk IO speed + potentially sharing processor time insurmountable? Or perhaps my concept of computer theory is completely off-base...
If you're comparing two files that are on the same drive, the only benefit you could receive from multi-threading is to have one thread reading--populating the next buffers--while another thread is comparing the previously-read buffers.
If the files you're comparing are on different physical drives, then you can have two asynchronous reads going concurrently--one on each drive.
But your idea of having one thread reading from the beginning and another reading from the end will make things slower because seek time is going to kill you. The disk drive heads will continually be seeking from one end of the file to the other. Think of it this way: do you think it would be faster to read a file sequentially from the start, or would it be faster to read 64K from the front, then read 64K from the end, then seek back to the start of the file to read the next 64K, etc?
Fragmentation is an issue, to be sure, but excessive fragmentation is the exception, not the rule. Most files are going to be unfragmented, or only partially fragmented. Reading alternately from either end of the file would be like reading a file that's pathologically fragmented.
Remember, a typical disk drive can only satisfy one I/O request at a time.
Making single-sector reads will probably slow things down. In my tests of .NET I/O speed, reading 32K at a time was significantly faster (between 10 and 20 percent) than reading 4K at a time. As I recall (it's been some time since I did this), on my machine at the time, the optimum buffer size for sequential reads was 256K. That will undoubtedly differ for each machine, based on processor speed, disk controller, hard drive, and operating system version.

Performance testing. How to increase hdd operations stability

I try to simulate application load to measure application performance. Dozens of clients send requests to server and significant part of request processing is random data loaded from HDD (random file, random file offset).
I use 15 Gb in 400 files.
HDD does its best to cache reading operations so overall performance is very unstable from run to run (+/- 5..10%).
In order to minimize HDD-internals optimizations I am thinking to put data to dedicated physical HDD, create random files before every test run, use the same random file access sequence (sequence of files and offsets), then run a test and format HDD at the end. I suppose it will clear all internal HDD caches and file access predictions.
What shall I do to minimize performance result dispersion? It there a simpler (or may be more appropriate) way to get performance results stable?
Thank you in advance!
Essentially all modern hard drives do include caching. It seems to me that results without a cache might be more uniform, but would be uniformly meaningless.
In any case, there are commands to disable caching on most drives (but, if memory serves, they're probably extensions, not part of the standard, so you'd have to implement them specifically for a particular target drive).
OTOH, given that you want to simulate something that isn't how a real hard drive (normally) works, I'd consider writing it as a complete software simulation -- e.g., have some sort of hard-drive class that kept a "current track", with commands to read and write data, seek to another track, etc. The class would keep track of things like the amount of (virtual) time consumed for each operation.

Resources