Would keeping say 512 file handles to files sized 3GB+ open for the lifetime of a program, say a week or so, cause issues in 32-bit Linux? Windows?
Potential workaround: How bad is the performance penalty of opening/closing file handles?
The size of the files doesn't matter. The number of file descriptors does, though. On Mac OS X, for example, the default limit is 256 open files per process, so your program would not be able to run.
I don't know about Linux, but in Windows, 512 files doesn't seem that much to me. But as a rule of thumb, any more than a thousand and it's too many. (Although I have to say that I haven't seen any program first-hand opening more than, say, 50.)
And the cost of opening/closing handles isn't that big unless you do them every time you want to read/write a small amount, in which case it's too high and you should buffer your data.
Related
I am working on a small utility app to concatenate large video files. The main concatenation step is to run something like this on the command line on Windows 7:
copy /b file1.dv + file2.dv + file3.dv output.dv
The input files are large - typically 7-15GB each. I know that I am dealing with a lot of data here, but the binary concatenation takes a very long time - for a total of around 40GB of data, it can almost an hour.
Considering that the process is basically just a scan through each file and copying it's contents to a new file, why is the binary copy so slow?
The built in command copy was designed way back in the DOS days, and hasn't really been updated since. As a result, it was designed for machines with small disks, and very small primary memories. As a result, it uses very small buffers when copying things around. For typical workloads; this is no big deal, but doesn't do so well for the specific case you're dealing with.
That said, I don't think copy is going all that slowly given the scenario you describe. If it takes about an hour for a 40 gigabyte file, that means that you're getting speeds of around 11 MB/s. Typical commodity Dell laptops like you describe in your comment are typically equipped with 5400 RPM consumer hard disks, which achieve something like 30MB/s (end of the disk) to 60MB/s (beginning of the disk) under ideal conditions for sequential reads and writes. However, your workload isn't a sequential workload; it's a constant shift of the read/write heads from the source file(s) to the target file(s). Throw in a 16ms typical latency for such disks and you've got about 60 seeks per second, or 30 copy operations per second. That would mean that copy was using a buffer of around 11MB / 30 = around 375k, which conveniently (after you account for the size of copy's code and a few DOS device drivers) fits under the 640k ceiling that copy was originally designed for. This all assumes that your disk is operating under ideal conditions, and has plenty of leftover space allowing these reads and writes to actually be sequential within a copy operation.
Of course if you're doing anything else at the same time this is going to cause more seek operations, and your performance will be worse.
You will probably get better results (maybe up to twice as fast) if you use another application which is designed for large copy operations, and as such uses larger buffers. I'm unaware of any such application though; you'll probably need to write one yourself if that's what you need.
I have some processing I want to do on thousands of files simultaneously. Grab the first byte of all the files and do something, go to the next byte, etc. The files could be any size, so loading them all into memory could be prohibitive.
I'm concerned that due to limitations in operating system file descriptors, just naively opening thousands of files and reading them in seems like I might run into issues.
But cycling through and opening/closing files would be rather inefficient, I imagine.
Is there some efficient mechanism to handle what I'm trying to do?
NOTE: this function may be distributed to use machines that I would have no control over, so I can't just go changing settings on the OS.
I want to do on thousands of files simultaneously. Grab the first byte of all the files and do something, go to the next byte, etc.
Are these files small enough that you could read them all into memory at once. If so, then read the files one at a time, then process all the files a byte at a time.
I'm concerned that due to limitations in operating system file descriptors, just naively opening thousands of files and reading them in seems like I might run into issues.
You might. The only way to find out is to try.
But cycling through and opening/closing files would be rather inefficient, I imagine.
Yes it would. But if you can't read all the files into memory, and your operating system can't open thousands of files at a time, then this is your last resort.
What you can do is find out the limit of simultaneous open files that your system can handle. Let's just say for the sake of discussion that your system can open 100 files at a time, and you have 2,500 files to process.
Then your process would look something like this.
Open the first 100 files.
Write an output file that contains the first byte from the 100 files, then the second byte from the 100 files, and so on.
Handle any problems you might encounter if the 100 files are not of the same byte length.
Now, after running this process through all your files, you'll have 25 intermediate files.
Then your second process would look something like this.
Open the 25 intermediate files.
Process the first 100 bytes from each file.
You would determine the actual numbers (simultaneous files open, number of intermediate files) through experimentation or research on your operating system.
This relates to some software I've been given to "fix". The easiest and quickest solution would make it open and read 10 random files out of hundreds and extract some very short strings for processing and immediately close them. Another process may come along right after that and do the same thing to different, or the same, random files and this may occur hundreds of times in a few seconds.
I know modern operating systems keep those files in memory to a point so disk thrashing isn't an issue as in the past but I'm looking for any articles or discussions about how to determine when all this open/closing of many random files becomes a problem.
When your working set (the amount of data read by all your processes) exceeds your available RAM, your throughput will tend towards the I/O capacity of your underlying disk.
From your description of the workload, seek times will be more of a problem than data transfer rates.
When your working set size stays below the amount of RAM you have, the OS will keep all data cached and won't need to go to the disk after having its caches filled.
this question is according to this topic:
creating a huge dummy file in a matter of seconds in c#
I just checked the fsutil.exe in xp/vista/seven to write a huge amount of dummy data into storage disk and it takes less time to write such a big file in comparison to programmaticly way.
When I'm trying to do the same thing with the help of .net it will take considerably more time than fsutil.exe.
note: I know that .net don't use native code because of that I just checked this issue with native api too like following:
long int size = DiskFree('L' - 64);
const char* full = "fulldisk.dsk";
__try{
Application->ProcessMessages();
HANDLE hf = CreateFile(full,
GENERIC_WRITE,
0,
0,
CREATE_ALWAYS,
0,
0);
SetFilePointer(hf, size, 0, FILE_BEGIN);
SetEndOfFile(hf);
CloseHandle(hf);
}__finally{
ShowMessage("Finished");
exit(0);
and the answer was as equal as .net results.
but with the help of fsutil.exe it only takes less duration than above or .net approaches say it is 2 times faster
example :
for writing 400mb with .net it will take ~40 secs
the same amount with fsutil.exe will take around 20secs or less.
is there any explanation about that?
or which function fsutil.exe does use which has this significance speed to write?
I don't know exactly what fsutil is doing, but I do know of two ways to write a large file that are faster than what you've done above (or seeking to the length you want and writing a zero, which has the same result).
The problem with those approaches is that they zero-fill the file at the time you do the write.
You can avoid the zero-fill by either:
Creating a sparse file. The size is marked where you want it, but the data doesn't actually exist on disk until you write it. All reads of the unwritten areas will return zeros.
Using the SetFileValidData function to set the valid data length without zeroing the file first. However, due to potential security issues, this command requires elevated permissions.
I agree with the last comment. I was experimenting with a minifilter driver and was capturing IRP_MJ_WRITE IRPs in callback. When I create or write to the file from cmd line or win32 application, I can see writes coming down. But when I created a file using "fsutil file createnew ..." command, I don't see any writes. I'm seeing this behavior on win2k8 r2 on NTFS volume. And I don't think (not sure 100% though) it is a sparse file either. It is probably setting the size properties in MFT without allocating any cluster. fsutil do check the available free space, so if the file size is bigger than free space on disk, you get error 1.
I also ran program sening FSCTL_GET_RETRIEVAL_POINTERS to the file and I got one extent for the entire size of the file. But I believe it is getting all data
This is something that could be
Written in Assembler (Raw blinding speed)
A Native C/C++ code to do this
Possibly an undocumented system call to do this or some trick that is not documented anywhere.
The above three points could have a huge significant factor - when you think about it, when a .NET code is loaded, it gets jit'ted by the runtime (Ok, the time factor would not be noticeable if you have a blazing fast machine - on a lowly end pentium, it would be noticeable, sluggish loading).
More than likely it could have been written in either C/C++. It may surprise you if it was written in Assembler.
You can check this for yourself - look at the file size of the executable and compare it to a .NET's executable. You might argue that the file is compressed, which I doubt it would be and therefore be inclined to rule this out, Microsoft wouldn't go that far I reckon with the compressing executables business.
Hope this answers your question,
Best regards,
Tom.
fsutil is only fast on NTFS and exFAT, not on FAT32, FAT16
This is because some file system have an "initialized size" concespt and thus support fast file initialization. This just reserves the clusters but does not zero them out, because it notes in the file system that no data was written to the file and valid reads would all return 00-filled buffers.
I am using VB6 and the Win32 API to write data to a file, this functionality is for the export of data, therefore write performance to the disk is the key factor in my considerations. As such I am using the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH options when opening the file with a call to CreateFile.
FILE_FLAG_NO_BUFFERING requires that I use my own buffer and write data to the file in multiples of the disk's sector size, this is no problem generally, apart from the last part of data, which if it is not an exact multiple of the sector size will include character zero's padding out the file, how do I set the file size once the last block is written to not include these character zero's?
I can use SetEndOfFile however this requires me to close the file and re-open it without using FILE_FLAG_NO_BUFFERING. I have seen someone talk about NtSetInformationFile however I cannot find how to use and declare this in VB6. SetFileInformationByHandle can do exactly what I want however it is only available in Windows Vista, my application needs to be compatible with previous versions of Windows.
I believe SetEndOfFile is the only way.
And I agree with Mike G. that you should bench your code with and without FILE_FLAG_NO_BUFFERING. Windows file buffering on modern OS's is pretty darn effective.
I'm not sure, but are YOU sure that setting FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH give you maximum performance?
They'll certainly result in your data hitting the disk as soon as possible, but that sort of thing doesn't actually help performance - it just helps reliability for things like journal files that you want to be as complete as possible in the event of a crash.
For a data export routine like you describe, allowing the operating system to buffer your data will probably result in BETTER performance, since the writes will be scheduled in line with other disk activity, rather than forcing the disk to jump back to your file every write.
Why don't you benchmark your code without those options? Leave in the zero-byte padding logic to make it a fair test.
If it turns out that skipping those options is faster, then you can remove the 0-padding logic, and your file size issue fixes itself.
For a one-gigabyte file, Windows buffering will indeed probably be faster, especially if doing many small I/Os. If you're dealing with files which are much larger than available RAM, and doing large-block I/O, the flags you were setting WILL produce must better throughput (up to three times faster for heavily threaded and/or random large-block I/O).