I have some processing I want to do on thousands of files simultaneously. Grab the first byte of all the files and do something, go to the next byte, etc. The files could be any size, so loading them all into memory could be prohibitive.
I'm concerned that due to limitations in operating system file descriptors, just naively opening thousands of files and reading them in seems like I might run into issues.
But cycling through and opening/closing files would be rather inefficient, I imagine.
Is there some efficient mechanism to handle what I'm trying to do?
NOTE: this function may be distributed to use machines that I would have no control over, so I can't just go changing settings on the OS.
I want to do on thousands of files simultaneously. Grab the first byte of all the files and do something, go to the next byte, etc.
Are these files small enough that you could read them all into memory at once. If so, then read the files one at a time, then process all the files a byte at a time.
I'm concerned that due to limitations in operating system file descriptors, just naively opening thousands of files and reading them in seems like I might run into issues.
You might. The only way to find out is to try.
But cycling through and opening/closing files would be rather inefficient, I imagine.
Yes it would. But if you can't read all the files into memory, and your operating system can't open thousands of files at a time, then this is your last resort.
What you can do is find out the limit of simultaneous open files that your system can handle. Let's just say for the sake of discussion that your system can open 100 files at a time, and you have 2,500 files to process.
Then your process would look something like this.
Open the first 100 files.
Write an output file that contains the first byte from the 100 files, then the second byte from the 100 files, and so on.
Handle any problems you might encounter if the 100 files are not of the same byte length.
Now, after running this process through all your files, you'll have 25 intermediate files.
Then your second process would look something like this.
Open the 25 intermediate files.
Process the first 100 bytes from each file.
You would determine the actual numbers (simultaneous files open, number of intermediate files) through experimentation or research on your operating system.
Related
I am looking for a program that could download simultaneously (like, about 100 files in parallel) multiple files. The only thing is, that this program should be able to handle very big lists of files (like 200MB of links), and should work on windows.
As for now, I have tested aria2, but when I load my file list I get out of memory exception (aria is trying to use over 4Gb of memory!). Also I tried using mulk, but this thing just is not working (because I don't believe that it is loading my files list for about two hours now, when generating this list and writing onto the disk took me about a half of a minute). I haven't tried using wget yet, but as far as I know it cannot download in parallel, am I right?
Is there any software that could handle my requirements?
With aria2, you can use --deferred-input option to reduce memory footprint for list input. Also making --max-download-result option low, such as 100, may reduce memory usage too.
One can, of course, use fopen or any other large number of APIs available on the Mac to read a file, but what I need to do is open and read every file on the disk and to do so as efficiently as possible.
So, my thought was to using /dev/rdisk* (?) or /dev/(?) to start with the files at the beginning of the device. I would do my best to read the files in order as they appear on the disk, minimize the amount of seeking across the device since files may be fragmented, and read in large blocks of data into RAM where it can be processed very quickly.
So, the primary question I have is when reading the data from my device directly, how can I determine exactly what data belongs with what files?
I assume I could start by reading a catalog of the files and that there would be a way to determine the start and stop locations of file or file fragments on the disk, but I am not sure where to find information about how to obtain such information...?
I am running Mac OS X 10.6.x and one can assume a standard setup for the drive. I might assume the same information would apply to a standard, read-only, uncompressed .dmg created by Disk Utility as well.
Any information on this topic or articles to read would be of interest.
I don't imagine what I want to do is particularly difficult once the format and layout of the files on disk was understood.
thank you
As mentioned in the comments, you need to look at the file system format, however by reading the raw disk sequentially, you are for (1) not guaranteed that subsequent blocks belong to same file, so you may have to seek anyway slowing down the advantage you had from reading directly from /dev/device, and (2) if your disk only is 50% full, you may still end up reading 100% of the disk, as you will be reading the unallocated space as well as the space allocated to file, and hence directly ready from /dev/device may be in efficient as well.
However fsck and similar does this operation, but they do it with moderation nased on possible error they are looking for when repairing file systems.
Would keeping say 512 file handles to files sized 3GB+ open for the lifetime of a program, say a week or so, cause issues in 32-bit Linux? Windows?
Potential workaround: How bad is the performance penalty of opening/closing file handles?
The size of the files doesn't matter. The number of file descriptors does, though. On Mac OS X, for example, the default limit is 256 open files per process, so your program would not be able to run.
I don't know about Linux, but in Windows, 512 files doesn't seem that much to me. But as a rule of thumb, any more than a thousand and it's too many. (Although I have to say that I haven't seen any program first-hand opening more than, say, 50.)
And the cost of opening/closing handles isn't that big unless you do them every time you want to read/write a small amount, in which case it's too high and you should buffer your data.
This relates to some software I've been given to "fix". The easiest and quickest solution would make it open and read 10 random files out of hundreds and extract some very short strings for processing and immediately close them. Another process may come along right after that and do the same thing to different, or the same, random files and this may occur hundreds of times in a few seconds.
I know modern operating systems keep those files in memory to a point so disk thrashing isn't an issue as in the past but I'm looking for any articles or discussions about how to determine when all this open/closing of many random files becomes a problem.
When your working set (the amount of data read by all your processes) exceeds your available RAM, your throughput will tend towards the I/O capacity of your underlying disk.
From your description of the workload, seek times will be more of a problem than data transfer rates.
When your working set size stays below the amount of RAM you have, the OS will keep all data cached and won't need to go to the disk after having its caches filled.
Using VC++ VisualStudio 2003.
I'm trying to copy several image files (30kb or so per file) from another computer`s shared folder to a local file.
The problem is that there can be more than 2000 or so files in one transfer, and it seems
to take its toll, substantially taking more time to complete.
Is there any alternate method of copying files from another computer that could possibly
speed up the copy?
Thanks in advance.
EDIT*
Due to client request, it is not possible to change the code base dramaticaly,
hate to have to deviate from best practice because of non-techinical issues,
but is there a more subtle approuch? such as another function call?
I know I`m asking for some magical voodoo, asking just in case somebody knows of such.
A few things to try:
is copying files using the OS any faster?
if no, then there may be some inherent limitations to your network or the way it's setup (maybe authentication troubles, or the distant server has some hardware issues, or it's too busy, or the network card loses too many packets because of collisions, faulty switch, bad wiring...)
make some tests transferring files of various sizes.
Small files are always slower to transfer because there is a lot of overhead to fetch their details, then transfer the data, then create directory entries etc.
if large files are fast, then your network is OK and you're probably not be able to improve the system much (the bottleneck is elsewhere).
Eventually, from code, you could try to open, read the files into a large buffer in one go then save them on the local drive. This may be faster as you'll be bypassing a lot of checks that the OS does internally.
You could even do this over a few threads to open, load, write files concurrently to speed things up a bit.
A couple of references you can check for mutli-threaded file copy:
MTCopy: A Multi-threaded Single/Multi file copying tool on CodeProject
Good parallel/multi-thread file copy util? discussion thread on Channel 9.
McTool a command line tool for parallel file copy.
If implementing this yourself in code is too much trouble, you could always simply execute a utility like McTool in the background of your application and let it do the work for you.
Well, for a start, 2000 is not several. If it's taking most of the time because you're sending lots of small files, then you come up with a solution that packages them at the source into a single file and unpackages them at the destination. This will require some code running at the source - you'll have to design your solution to allow that since I assume at the moment you're just copying from a network share.
If it's the network speed (unlikely), you compress them as well.
My own beliefs are that it will be the number of files, basically all the repeated startup costs of a copy. That's because 2000 30K files is only 60MB, and on a 10Mb link, theoretical minimum time would be about a minute.
If your times are substantially above that, then I'd say I'm right.
A solution that uses 7zip or similar to compress them all to a single 7z file, transmit them, then unzip them at the other end sounds like what you're looking for.
But measure, don't guess! Test it out to see if it improves performance. Then make a decision.