Downloading simultaneously multiple files with big file lists on windows - download

I am looking for a program that could download simultaneously (like, about 100 files in parallel) multiple files. The only thing is, that this program should be able to handle very big lists of files (like 200MB of links), and should work on windows.
As for now, I have tested aria2, but when I load my file list I get out of memory exception (aria is trying to use over 4Gb of memory!). Also I tried using mulk, but this thing just is not working (because I don't believe that it is loading my files list for about two hours now, when generating this list and writing onto the disk took me about a half of a minute). I haven't tried using wget yet, but as far as I know it cannot download in parallel, am I right?
Is there any software that could handle my requirements?

With aria2, you can use --deferred-input option to reduce memory footprint for list input. Also making --max-download-result option low, such as 100, may reduce memory usage too.

Related

Pre-warm disk cache

After some theoretical discussion today I decided to do some research, but I did not find anything conclusive.
Here's the problem:
We have written a tool that reads around 10Gb of image files from a data set of several terabytes. We want to speed up the execution time by minimizing I/O overhead. The idea would be to "pre-warm" the disk cache, as we known beforehand what directory we will be reading from as the tool executes. Is there any API or method to give this hint to Windows so that it can start pre-warming the disk cache, speeding up future disk access as the files are already in RAM (of which there is plenty on the machines we run the tool on)?
I know Windows does readahead on a single file, but what if I have a directory with thousands of files?
I haven't found any direct win32 APIs or command line tools to do this directly.
What if I start a low priority background thread, opening all the files for reading and closing them?
I could of course memory map all the files and pin them in RAM, but that would probably run the risk of starving the main worker thread of I/O.
The general idea here is that the tool "bursts" I/O requests, as each thread will do I/O and CPU processing in sequence, hence we could use the "idle" I/O time to preload the remaining files into RAM.
(I could of course benchmark, and I will, but I would like to understand a bit more of how this works in order to be more scientific and less cargo culty).

Trying to understand why freading a file over a network is so much slower over Samba than NFS

We have a program building a 3d Model from three files hosted on a Linux file server. Basically x.bin, y.bin and z.bin. It builds the models one z level at a time, and is read each file for every "slice".
On Linux machines running this program, the first slice takes around 45 seconds, and then ~2 seconds for every "slice" after that.
On Windows, the exact same program performing the exact same operation running the exact same script and code takes 5 minutes for the first slice, and around a minute and a half each slice after that.
Reading file over network slow due to extra reads
This thread seemed to have a guy with a similar problem, but the truth is that I'm still unclear on how NFS can be faster, as well as how I can suggest a change to the actual developers as to how to improve performance. The code is OS independent, I believe it's just using C's fread, fseek, etc to read the file information over the network.
How does NFS transfer/read data that it can be 60x faster than samba?
How can I get that performance on samba?
I'm not 100% sure as I don't know much about samba, but my guess is that nfs support fseek and thus can just position over the next splice and return that data. While samba probably doesn't and have to return the full file from the server and discard the "unused" content.
By the way, it's not the exact same program you're running, you probably recompile them right? So it's been transcode to a lot of different system call with each platforms having differents pros and cons...

Is there a way to efficiently read many files simultaneously?

I have some processing I want to do on thousands of files simultaneously. Grab the first byte of all the files and do something, go to the next byte, etc. The files could be any size, so loading them all into memory could be prohibitive.
I'm concerned that due to limitations in operating system file descriptors, just naively opening thousands of files and reading them in seems like I might run into issues.
But cycling through and opening/closing files would be rather inefficient, I imagine.
Is there some efficient mechanism to handle what I'm trying to do?
NOTE: this function may be distributed to use machines that I would have no control over, so I can't just go changing settings on the OS.
I want to do on thousands of files simultaneously. Grab the first byte of all the files and do something, go to the next byte, etc.
Are these files small enough that you could read them all into memory at once. If so, then read the files one at a time, then process all the files a byte at a time.
I'm concerned that due to limitations in operating system file descriptors, just naively opening thousands of files and reading them in seems like I might run into issues.
You might. The only way to find out is to try.
But cycling through and opening/closing files would be rather inefficient, I imagine.
Yes it would. But if you can't read all the files into memory, and your operating system can't open thousands of files at a time, then this is your last resort.
What you can do is find out the limit of simultaneous open files that your system can handle. Let's just say for the sake of discussion that your system can open 100 files at a time, and you have 2,500 files to process.
Then your process would look something like this.
Open the first 100 files.
Write an output file that contains the first byte from the 100 files, then the second byte from the 100 files, and so on.
Handle any problems you might encounter if the 100 files are not of the same byte length.
Now, after running this process through all your files, you'll have 25 intermediate files.
Then your second process would look something like this.
Open the 25 intermediate files.
Process the first 100 bytes from each file.
You would determine the actual numbers (simultaneous files open, number of intermediate files) through experimentation or research on your operating system.

Random access of multiple files and file caching

This relates to some software I've been given to "fix". The easiest and quickest solution would make it open and read 10 random files out of hundreds and extract some very short strings for processing and immediately close them. Another process may come along right after that and do the same thing to different, or the same, random files and this may occur hundreds of times in a few seconds.
I know modern operating systems keep those files in memory to a point so disk thrashing isn't an issue as in the past but I'm looking for any articles or discussions about how to determine when all this open/closing of many random files becomes a problem.
When your working set (the amount of data read by all your processes) exceeds your available RAM, your throughput will tend towards the I/O capacity of your underlying disk.
From your description of the workload, seek times will be more of a problem than data transfer rates.
When your working set size stays below the amount of RAM you have, the OS will keep all data cached and won't need to go to the disk after having its caches filled.

Performance issues using Copyfile() to copy files from different computers

Using VC++ VisualStudio 2003.
I'm trying to copy several image files (30kb or so per file) from another computer`s shared folder to a local file.
The problem is that there can be more than 2000 or so files in one transfer, and it seems
to take its toll, substantially taking more time to complete.
Is there any alternate method of copying files from another computer that could possibly
speed up the copy?
Thanks in advance.
EDIT*
Due to client request, it is not possible to change the code base dramaticaly,
hate to have to deviate from best practice because of non-techinical issues,
but is there a more subtle approuch? such as another function call?
I know I`m asking for some magical voodoo, asking just in case somebody knows of such.
A few things to try:
is copying files using the OS any faster?
if no, then there may be some inherent limitations to your network or the way it's setup (maybe authentication troubles, or the distant server has some hardware issues, or it's too busy, or the network card loses too many packets because of collisions, faulty switch, bad wiring...)
make some tests transferring files of various sizes.
Small files are always slower to transfer because there is a lot of overhead to fetch their details, then transfer the data, then create directory entries etc.
if large files are fast, then your network is OK and you're probably not be able to improve the system much (the bottleneck is elsewhere).
Eventually, from code, you could try to open, read the files into a large buffer in one go then save them on the local drive. This may be faster as you'll be bypassing a lot of checks that the OS does internally.
You could even do this over a few threads to open, load, write files concurrently to speed things up a bit.
A couple of references you can check for mutli-threaded file copy:
MTCopy: A Multi-threaded Single/Multi file copying tool on CodeProject
Good parallel/multi-thread file copy util? discussion thread on Channel 9.
McTool a command line tool for parallel file copy.
If implementing this yourself in code is too much trouble, you could always simply execute a utility like McTool in the background of your application and let it do the work for you.
Well, for a start, 2000 is not several. If it's taking most of the time because you're sending lots of small files, then you come up with a solution that packages them at the source into a single file and unpackages them at the destination. This will require some code running at the source - you'll have to design your solution to allow that since I assume at the moment you're just copying from a network share.
If it's the network speed (unlikely), you compress them as well.
My own beliefs are that it will be the number of files, basically all the repeated startup costs of a copy. That's because 2000 30K files is only 60MB, and on a 10Mb link, theoretical minimum time would be about a minute.
If your times are substantially above that, then I'd say I'm right.
A solution that uses 7zip or similar to compress them all to a single 7z file, transmit them, then unzip them at the other end sounds like what you're looking for.
But measure, don't guess! Test it out to see if it improves performance. Then make a decision.

Resources