Performance issues using Copyfile() to copy files from different computers - performance

Using VC++ VisualStudio 2003.
I'm trying to copy several image files (30kb or so per file) from another computer`s shared folder to a local file.
The problem is that there can be more than 2000 or so files in one transfer, and it seems
to take its toll, substantially taking more time to complete.
Is there any alternate method of copying files from another computer that could possibly
speed up the copy?
Thanks in advance.
EDIT*
Due to client request, it is not possible to change the code base dramaticaly,
hate to have to deviate from best practice because of non-techinical issues,
but is there a more subtle approuch? such as another function call?
I know I`m asking for some magical voodoo, asking just in case somebody knows of such.

A few things to try:
is copying files using the OS any faster?
if no, then there may be some inherent limitations to your network or the way it's setup (maybe authentication troubles, or the distant server has some hardware issues, or it's too busy, or the network card loses too many packets because of collisions, faulty switch, bad wiring...)
make some tests transferring files of various sizes.
Small files are always slower to transfer because there is a lot of overhead to fetch their details, then transfer the data, then create directory entries etc.
if large files are fast, then your network is OK and you're probably not be able to improve the system much (the bottleneck is elsewhere).
Eventually, from code, you could try to open, read the files into a large buffer in one go then save them on the local drive. This may be faster as you'll be bypassing a lot of checks that the OS does internally.
You could even do this over a few threads to open, load, write files concurrently to speed things up a bit.
A couple of references you can check for mutli-threaded file copy:
MTCopy: A Multi-threaded Single/Multi file copying tool on CodeProject
Good parallel/multi-thread file copy util? discussion thread on Channel 9.
McTool a command line tool for parallel file copy.
If implementing this yourself in code is too much trouble, you could always simply execute a utility like McTool in the background of your application and let it do the work for you.

Well, for a start, 2000 is not several. If it's taking most of the time because you're sending lots of small files, then you come up with a solution that packages them at the source into a single file and unpackages them at the destination. This will require some code running at the source - you'll have to design your solution to allow that since I assume at the moment you're just copying from a network share.
If it's the network speed (unlikely), you compress them as well.
My own beliefs are that it will be the number of files, basically all the repeated startup costs of a copy. That's because 2000 30K files is only 60MB, and on a 10Mb link, theoretical minimum time would be about a minute.
If your times are substantially above that, then I'd say I'm right.
A solution that uses 7zip or similar to compress them all to a single 7z file, transmit them, then unzip them at the other end sounds like what you're looking for.
But measure, don't guess! Test it out to see if it improves performance. Then make a decision.

Related

Downloading simultaneously multiple files with big file lists on windows

I am looking for a program that could download simultaneously (like, about 100 files in parallel) multiple files. The only thing is, that this program should be able to handle very big lists of files (like 200MB of links), and should work on windows.
As for now, I have tested aria2, but when I load my file list I get out of memory exception (aria is trying to use over 4Gb of memory!). Also I tried using mulk, but this thing just is not working (because I don't believe that it is loading my files list for about two hours now, when generating this list and writing onto the disk took me about a half of a minute). I haven't tried using wget yet, but as far as I know it cannot download in parallel, am I right?
Is there any software that could handle my requirements?
With aria2, you can use --deferred-input option to reduce memory footprint for list input. Also making --max-download-result option low, such as 100, may reduce memory usage too.

what's the fastest way to copy a folder over the network to multiple servers(Python)

As the title says, what I would like to accomplish is given a package(usually the size may vary between 500Mb and 1Gb), I would like to copy over something around 40 servers at the same time(concurrently), I've been using a script that run a copy at the time, therefore I'm considering these possibilities:
1- Multiprocess library and create a single process for each copy function so that, they can run concurrently;
-although I think I might end up having an I/O bottleneck, and process cannot share the same data.
2-I m not using a single internet connection, but a huge corporate WAN.
Can anyone tell me whether is there any other more effective way(faster) to achieve the same thing? Or some other way to solve it?(I can run this task from a 2 core workstation).
1) I have no experience with this, but it looks like a fit for your use case:
http://code.google.com/p/pysendfile/
sendfile(2) is a system call which provides a "zero-copy" way of copying data from one file descriptor to another (a socket). The phrase "zero-copy" refers to the fact that all of the copying of data between the two descriptors is done entirely by the kernel, with no copying of data into userspace buffers. This is particularly useful when sending a file over a socket (e.g. FTP).
and
When do you want to use it?
Basically any application sending files over the network can take advantage of sendfile(2).
2) Another option would be to use some torrent library. I recently learned (skip to 31:00 for the torrent stuff) that facebook distribute their daily software updates via torrent (and update 1000s of servers with 1.5GB binaries within 15min or so).
Assume your machines have 1Gbit connections. You'll get 800Mbit/s if you're lucky/work at it, and it'll take ~10s to copy each 1GByte and 6-7 minutes to update those machines. If that's good enough, the only thing you need to do is work on using the 1Gbit efficiently to hit that target (what are you seeing from your current scripts ? OK 1Gbit may be ambitous on WAN, but you can do a similar analysis). Multiprocessing might or might not help here... but it's not going to magically get you more bandwidth.
If it's not good enough, I'd either consider:
go P2P (see miku;s answer), so as soon as one machine has a bit of
the data it can share it with other machines using it's own
bandwidth. How much this helps depends to some extent on your
network topology (existence of other bottleneck points).
Look into multicast, if the network is enough under your control that you can get the stuff routed appropriately (this seems pretty
unlikely for WAN, but maybe one day in an IPv6 wonderland...).
Instead of copying the same data 40 times (assuming it is the same
each time), you just broadcast it once and all the receivers pick it
up simultaneously. Multicast UDP isn't reliable (intended more for
IPTV I think) but there have been attempts to build reliable file
transfer tools using multicast tech e.g OpenPGM and MS's
own implementation.

A Linux Kernel Module for Self-Optimizing Hard Drives: Advice?

I am a computer engineering student studying Linux kernel development. My 4-man team was tasked to propose a kernel development project (to be implemented in 6 weeks), and we came up with a tentative "Self-Optimizing Hard Disk Drive Linux Kernel Module". I'm not sure if that title makes sense to the pros.
We based the proposal on this project.
The goal of the project is to minimize hard disk access times. The plan is to create a special partition where the "most commonly used" files are to be placed. An LKM will profile, analyze, plan, and redirect I/O operations to the hard disk. This LKM should primarily be able to predict and redirect all file access (on files with sizes of < 10 MB) with minimal overhead, and lessen average read/write access times to the hard disk. I believe Apple's HFS has this feature.
Can anybody suggest a starting point? I recently found a way to redirect I/O operations by intercepting system calls (by hijacking all the read/write ones). However, I'm not convinced that this is the best way to go. Is there a way to write a driver that redirects these read/write operations? Can we perhaps tap into the read/write cache to achieve the same effect?
Any feedback at all is appreciated.
You may want to take a look at Unionfs. You don't even need a LKM - just a some user-space daemon which would subscribe to inotify events, keep statistics and migrate files between partitions. Unionfs will combine both partitions into a single logical filesystem.
There are many ways in which such optimizations might be useful:
accessing file A implies file B access is imminent. Example: opening an icon file for a media file by a media player
accessing any file in some group G of files means that other files in the group will be accessed shortly. Example: mysql receives a use somedb command which implies all the file tables, indexes, etc. will be accessed.
a program which stops reading a sequential file suggests the program has stalled or exited, so predictions of future accesses associated with that file should be abandoned.
having multiple (yet transparent) copies of some frequently referenced files strategically sprinkled about can use the copy nearest the disk heads. Example: uncached directories or small, frequently accessed settings files.
There are so many possibilities that I think at least 50% of an efficient solution would be a sensible, limited specification for what features you will attempt to implement and what you won't. It might be valuable to study how Microsoft's Vista's aggressive file caching mechanism disappointed.
Another problem you might encounter with a modern Linux distribution is how well the system already does much of what you plan to improve. In fact, measuring the improvement might be a big challenge. I suggest writing a benchmark program which opens and reads a series of files and precisely times the complete sequence. Run it several times with your improvements enabled and disabled. But you'll have to reboot in between for valid timing....

Why does multithreaded file transfer improve performance?

RichCopy, a better-than-robocopy-with-GUI tool from Microsoft, seems to be the current tool of choice for copying files. One of it's main features, hightlighted in the TechNet article presenting the tool, is that it copies multiple files in parallel. In its default setting, three files are copied simultaneously, which you can see nicely in the GUI: [Progress: xx% of file A, yy% of file B, ...]. There are a lot of blog entries around praising this tool and claiming that this speeds up the copying process.
My question is: Why does this technique improve performance? As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network. My assumption would be that copying multiple files at once makes the whole process slower, since the HDD needs to jump back and forth between different files rather than just sequentially streaming one file. Since RichCopy is faster, there must be some mistake in my assumptions...
The tool is making use improvements in hardware which can optimise multiple read and write requests much better.
When copying one file at a time the hardware isn't going to know that the block of data that currently is passing under the read head (or near by) will be needed of a subsquent read since the software hasn't queued that request yet.
A single file copy these days is not very taxing task for modern disk sub-systems. By giving these hardware systems more work to do at once the tool is leveraging its improved optimising features.
A naive "copy multiple files" application will copy one file, then wait for that to complete before copying the next one.
This will mean that an individual file CANNOT be copied faster than the network latency, even if it is empty (0 bytes). Because it probably does several file server calls, (open,write,close), this may be several x the latency.
To efficiently copy files, you want to have a server and client which use a sane protocol which has pipelining; that's to say - the client does NOT wait for the first file to be saved before sending the next, and indeed, several or many files may be "on the wire" at once.
Of course to do that would require a custom server not a SMB (or similar) file server. For example, rsync does this and is very good at copying large numbers of files despite being single threaded.
So my guess is that the multithreading helps because it is a work-around for the fact that the server doesn't support pipelining on a single session.
A single-threaded implementation which used a sensible protocol would be best in my opinion.
It's a network tool, so the bottleneck is the network, not the HDD. Up to a (low) point you can get more throughput out of a TCP link by using a few connections in parallel. This (a) parallelizes the TCP handshakes; (b) can make better use of the bandwidth-delay product if that is high; and (c) doesn't make one arbitrarily slow connection the critical path if for some reason it encounters a high RTT or failure rate.
Another way to do (b) is to use an enormous TCP socket receive buffer but that's not always convenient.
Several of the other answers about HDD are incorrect. Practically any HDD will do some read-ahead on the assumption of sequential access, and any intelligent OS cache will also do that.
My gues is that the hdd read write heads spend most of their time idle and wait for the correct memory block of the disk to apear under them, the more memory being copied means less time in idle and most modern disk schedulers should take care of the jumping (for a low number of files/fragments)
As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network.
I think those assumptions are overly simplistic.
First, while LANs run at 100Mb / 1Gbit. Long haul networks have a maximum data rate that is less than the max rate of the slowest link.
Second, the effective throughput of TCP/IP stream over the internet is often dominated by the time taken to round-trip messages and acknowledgments. For example, I have a 8+Mbit link, but my data rate on downloads is rarely above 1-2Mbits per second when I'm downloading from the USA. So if you can run multiple streams in parallel one stream can be waiting for an acknowledgment while another is pumping packets. (But if you try to send too much, you start getting congestion, timeouts, back-off and lower overall transfer rates.)
Finally, operating systems are good at doing a variety of I/O tasks in parallel with other work. If you are downloading 2 or more files in parallel, the O/S may be reading / processing network packets for one download and writing to disc for another one ... at the same time.
Over long distances, networks can write much faster than they can read. With multithreading, having additional "readers" means the data can be transmitted more efficiently and not bogged down in buffers.

Fastest way to move files on a Windows System [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 years ago.
Improve this question
I want to move about 800gb of data from an NTFS storage device to a FAT32 device (both are external hard drives), on a Windows System.
What is the best way to achieve this?
Simply using cut-paste?
Using the command prompt ? (move)
Writing a batch file to copy a small chunks of data on a given interval ?
Use some specific application that does the job for me?
Or any better idea...?
What is the most safe, efficient and fast way to achieve such a time consuming process?
Robocopy
You can restart the command and it'll resume. I use it all the time over the network. Works on large files as well.
I would physically move the hard dsk if possible.
I've found fast copy to be quite good for this sort of thing. Its a gui tool ....
http://www.ipmsg.org/tools/fastcopy.html.en
If you have to move it over a network, you want to use FTP between the servers. The Windows File system will get bogged down with chatty protocols.
I've found Teracopy to be pretty fast and handy. Allegedly Fastcopy (as suggested by benlumley) is even faster, but I don't have any experience with it.
Try using WinRar or a zipping tool. Big "files" are moved quicker than lots of small ones.
Most zipping tools allow to split the archive(zip) files into multiple archives.
You might even reduce the size a bit when you turn on compression.
Command Line: xcopy is probably your best bet
Command Reference:
http://www.computerhope.com/xcopyhlp.htm
I used Teracopy and copied 50+GB to a 128GB flash drive.
Too almost 48 hours...had to do it twice because had a power
hiccup. Had to re-format and start over...Not my favorite thing
to do...
One of the fastest way to copy files is use robocopy as pointed by Pyrolistical in above post. its very flexible and powerful.
If command doesn't work from your dos prompt directly then try with powershell option like below example.
Must Check the documentation for this command before using it "robocopy /?".
powershell "robocopy 'Source' 'destination' /E /R:3 /W:10 /FP /MT:25 /V"
/E - Copy subdirectory including empty ones.
/R - Retry 3 times if failed.
/W - wait for 10 seconds between retries.
/FP - include full path name in output.
/MT - Multi thread.
/V - verbose output.
I wanted to comment a comment about multithreading, from #hello_earth, 201510131124, but I don't have enough reputation points on Stackoverflow (I've mostly posted on Superuser up until now) :
Multithreading is typically not efficient when it comes to copying files from 1 storage device to 1 other, because the fastest throughput is reached for sequential reads, and using multiple threads will make a HDD rattle and grind like crazy to read or write several files at the same time, and since a HDD can only access one file at a time it must read or write one chunk from a file then move to a chunk from another file located in a different area, which slows down the process considerably (I don't know how a SSD would behave in such a case). It is both inefficient and potentially harmful : the mechanical stress is considerably higher when the heads are moving repeatedly across the platters to reach several areas in short succession, rather than staying at the same spot to parse a large contiguous file.
I discovered this when batch checking the MD5 checksums of a very large folder full of video files with md5deep : with the default options the analysis was multithreaded, so there were 8 threads with an i7 6700K CPU, and it was excruciatingly slow. Then I added the -j1 option, meaning 1 thread, and it proceeded much faster, since the files were now read sequentially.
Another consideration that derives from this is that the transfer speed will be significantly higher if files are not fragmented, and also, more marginally, if they are located at the begining of a hard disk drive, corresponding to the outermost parts of the platters, where the linear velocity is maximum (that aspect is irrelevant with a solid state drive or other flash memory based device).
Also, the original poster wanted “the most safe, efficient and fast way to achieve such a time consuming process” – I'd say that one has to choose a compromise favoring either speed/efficiency, or safety : if you want safety, you have to check that each file was copied flawlessly (by checking MD5 checksums, or with something like WinMerge) ; if you don't do that, you can never be 100% sure that there weren't some SNAFUs in the process (hardware or software issues) ; if you do that, you have to spend twice as much time on the task.
For instance : I relied on a little tool called SynchronizeIt! for my file copying purposes, because it has the huge advantage compared to most similar tools of preserving all timestamps (including directory timestamps, like Robocopy does with the /DCOPY:T switch), and it has a streamlined interface with just the options I need. But I discovered that some files were always corrupted after a copy, truncated after exactly 25000 bytes (so the copy of a 1GB video for instance had 25000 good bytes then 1GB of 00s, the copy process was abnormally fast, took only a split second, which triggered my suspicion in the first place). I reported this issue to the author a first time in 2010, but then he chalked it up to a hardware malfunction, and didn't think twice about it. I still used SI, but started to check files thoroughly every time I made a copy (with WinMerge or Total Commander) ; when files ended up corrupted I used Robocopy instead (files which were corrupted with SynchronizeIt, when they were copied with Robocopy, then copied again with SynchronizeIt, were copied flawlessly, so there was something in the way they were recorded on the NTFS partition which confused that software, and which Robocopy somehow fixed). Then in 2015 I reported it again, after having identified more patterns regarding which files were corrupted : they had all been downloaded with particular download managers. That time the author did some digging, and found the explanation : it turned out that his tool had trouble copying files with the little known “sparse” attribute, and that some download managers set this attribute to save space when downloading files in multiple chunks. He provided me with an updated version which correctly copies sparse files, but hasn't released it on his website (the currently available version is 3.5 from 2009, the version I now use is a 3.6 beta from October 2015), so if you want to try that otherwise excellent software, be aware of that bug, and whenever you copy important files, thoroughly verify if each copied file is identical to the source (using a different tool), before deleting them from the source.

Resources