Would love some help figuring out why a script is running much slower than it used to.
The script starts sequential Matlab simulations and saves each simulation's output to a file in a directory on computer #1. The script is running on computers #2, 3, and 4 which have the C: drive of computer #1 mounted as drive K:, and the computers read and write K: drive files during the simulations. Prior to starting each simulation, the script saves a 'placeholder' version of the simulation's output file which later gets overwritten with that simulation's results once the simulation is complete. The output filename is unique to that simulation. The script checks for the output file before starting a simulation; if the file is found, it goes to the next simulation. The intent is to divide up many simulations among the different computers. The directory on computer #1 has many files in it (~4000, 6GB) and computer #1 is an old windows XP machine. Computers #2-4 are also windows machines and are 2+ years old.
This scheme used to work fine, saving ~3 files per minute. Now it is taking ~15 minutes per file. What might be the leading cause for the slowdown? Could it be the number of files in the directory or the number of computers accessing computer #1? If that is unlikely, I would like to know so I can redirect my troubleshooting.
The number of items in a single directory absolutely leads to decreased performance. I've read that it depends on OS, filesystem, phase of the moon, local/remote drives ... maybe phase of the moon.
My personal rule of thumb is that at about 5,000 items per directory performance starts to degrade, and at about 10,000 performance has degraded enough that whatever you are doing will not work correctly anymore.
It turns out the problem was an old network switch that the various computers were plugged into. When we tried a newer switch, the script ran like lightning.
However everyone's suggestions (subdirectories to reduce # of files; defragging computer #1 which turned out to be badly fragmented) were very helpful, and it was great to have some other eyes on the problem, so thanks.
Related
We have a program building a 3d Model from three files hosted on a Linux file server. Basically x.bin, y.bin and z.bin. It builds the models one z level at a time, and is read each file for every "slice".
On Linux machines running this program, the first slice takes around 45 seconds, and then ~2 seconds for every "slice" after that.
On Windows, the exact same program performing the exact same operation running the exact same script and code takes 5 minutes for the first slice, and around a minute and a half each slice after that.
Reading file over network slow due to extra reads
This thread seemed to have a guy with a similar problem, but the truth is that I'm still unclear on how NFS can be faster, as well as how I can suggest a change to the actual developers as to how to improve performance. The code is OS independent, I believe it's just using C's fread, fseek, etc to read the file information over the network.
How does NFS transfer/read data that it can be 60x faster than samba?
How can I get that performance on samba?
I'm not 100% sure as I don't know much about samba, but my guess is that nfs support fseek and thus can just position over the next splice and return that data. While samba probably doesn't and have to return the full file from the server and discard the "unused" content.
By the way, it's not the exact same program you're running, you probably recompile them right? So it's been transcode to a lot of different system call with each platforms having differents pros and cons...
Currently, I have an 4GB sdcard on which I have an journaling FS partition (EXT3 and EXT4). I am testing the journaling recovery aspect of these filesystems to fix any corruption on an sd.
I have an SDCARD on a piece of hardware that simply boots linux then runs a copy.sh I wrote.
I run a script that powers the machine for 150 seconds then hard shutdowns the machine for 30. This process is repeated for an extended period of time. I am running a script that copies a directory recursively back and forth on the journaling FS, deleting the directory from which it read from after it finishes. I keep track of how many times the directory was copied per boot.
I noticed something interesting in my results. At first, the directory may be copied successfully 20 times back and forth, but after hours of running, it only copies once or twice.
I was wondering why that was?
This trend is consistent with both EXT3 and EXT4. I've researched online for answers, but haven't found an answer for why the number of writes would decrease over time.
Does this explanation of how sdcards work help? http://www.anandtech.com/show/2738/8 Read that page and the couple following. This explains how deletes and overwrites are handled within the sd memory chips themselves, and implications for systems that dont implement the TRIM command.
I am a computer engineering student studying Linux kernel development. My 4-man team was tasked to propose a kernel development project (to be implemented in 6 weeks), and we came up with a tentative "Self-Optimizing Hard Disk Drive Linux Kernel Module". I'm not sure if that title makes sense to the pros.
We based the proposal on this project.
The goal of the project is to minimize hard disk access times. The plan is to create a special partition where the "most commonly used" files are to be placed. An LKM will profile, analyze, plan, and redirect I/O operations to the hard disk. This LKM should primarily be able to predict and redirect all file access (on files with sizes of < 10 MB) with minimal overhead, and lessen average read/write access times to the hard disk. I believe Apple's HFS has this feature.
Can anybody suggest a starting point? I recently found a way to redirect I/O operations by intercepting system calls (by hijacking all the read/write ones). However, I'm not convinced that this is the best way to go. Is there a way to write a driver that redirects these read/write operations? Can we perhaps tap into the read/write cache to achieve the same effect?
Any feedback at all is appreciated.
You may want to take a look at Unionfs. You don't even need a LKM - just a some user-space daemon which would subscribe to inotify events, keep statistics and migrate files between partitions. Unionfs will combine both partitions into a single logical filesystem.
There are many ways in which such optimizations might be useful:
accessing file A implies file B access is imminent. Example: opening an icon file for a media file by a media player
accessing any file in some group G of files means that other files in the group will be accessed shortly. Example: mysql receives a use somedb command which implies all the file tables, indexes, etc. will be accessed.
a program which stops reading a sequential file suggests the program has stalled or exited, so predictions of future accesses associated with that file should be abandoned.
having multiple (yet transparent) copies of some frequently referenced files strategically sprinkled about can use the copy nearest the disk heads. Example: uncached directories or small, frequently accessed settings files.
There are so many possibilities that I think at least 50% of an efficient solution would be a sensible, limited specification for what features you will attempt to implement and what you won't. It might be valuable to study how Microsoft's Vista's aggressive file caching mechanism disappointed.
Another problem you might encounter with a modern Linux distribution is how well the system already does much of what you plan to improve. In fact, measuring the improvement might be a big challenge. I suggest writing a benchmark program which opens and reads a series of files and precisely times the complete sequence. Run it several times with your improvements enabled and disabled. But you'll have to reboot in between for valid timing....
Using VC++ VisualStudio 2003.
I'm trying to copy several image files (30kb or so per file) from another computer`s shared folder to a local file.
The problem is that there can be more than 2000 or so files in one transfer, and it seems
to take its toll, substantially taking more time to complete.
Is there any alternate method of copying files from another computer that could possibly
speed up the copy?
Thanks in advance.
EDIT*
Due to client request, it is not possible to change the code base dramaticaly,
hate to have to deviate from best practice because of non-techinical issues,
but is there a more subtle approuch? such as another function call?
I know I`m asking for some magical voodoo, asking just in case somebody knows of such.
A few things to try:
is copying files using the OS any faster?
if no, then there may be some inherent limitations to your network or the way it's setup (maybe authentication troubles, or the distant server has some hardware issues, or it's too busy, or the network card loses too many packets because of collisions, faulty switch, bad wiring...)
make some tests transferring files of various sizes.
Small files are always slower to transfer because there is a lot of overhead to fetch their details, then transfer the data, then create directory entries etc.
if large files are fast, then your network is OK and you're probably not be able to improve the system much (the bottleneck is elsewhere).
Eventually, from code, you could try to open, read the files into a large buffer in one go then save them on the local drive. This may be faster as you'll be bypassing a lot of checks that the OS does internally.
You could even do this over a few threads to open, load, write files concurrently to speed things up a bit.
A couple of references you can check for mutli-threaded file copy:
MTCopy: A Multi-threaded Single/Multi file copying tool on CodeProject
Good parallel/multi-thread file copy util? discussion thread on Channel 9.
McTool a command line tool for parallel file copy.
If implementing this yourself in code is too much trouble, you could always simply execute a utility like McTool in the background of your application and let it do the work for you.
Well, for a start, 2000 is not several. If it's taking most of the time because you're sending lots of small files, then you come up with a solution that packages them at the source into a single file and unpackages them at the destination. This will require some code running at the source - you'll have to design your solution to allow that since I assume at the moment you're just copying from a network share.
If it's the network speed (unlikely), you compress them as well.
My own beliefs are that it will be the number of files, basically all the repeated startup costs of a copy. That's because 2000 30K files is only 60MB, and on a 10Mb link, theoretical minimum time would be about a minute.
If your times are substantially above that, then I'd say I'm right.
A solution that uses 7zip or similar to compress them all to a single 7z file, transmit them, then unzip them at the other end sounds like what you're looking for.
But measure, don't guess! Test it out to see if it improves performance. Then make a decision.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 years ago.
Improve this question
I want to move about 800gb of data from an NTFS storage device to a FAT32 device (both are external hard drives), on a Windows System.
What is the best way to achieve this?
Simply using cut-paste?
Using the command prompt ? (move)
Writing a batch file to copy a small chunks of data on a given interval ?
Use some specific application that does the job for me?
Or any better idea...?
What is the most safe, efficient and fast way to achieve such a time consuming process?
Robocopy
You can restart the command and it'll resume. I use it all the time over the network. Works on large files as well.
I would physically move the hard dsk if possible.
I've found fast copy to be quite good for this sort of thing. Its a gui tool ....
http://www.ipmsg.org/tools/fastcopy.html.en
If you have to move it over a network, you want to use FTP between the servers. The Windows File system will get bogged down with chatty protocols.
I've found Teracopy to be pretty fast and handy. Allegedly Fastcopy (as suggested by benlumley) is even faster, but I don't have any experience with it.
Try using WinRar or a zipping tool. Big "files" are moved quicker than lots of small ones.
Most zipping tools allow to split the archive(zip) files into multiple archives.
You might even reduce the size a bit when you turn on compression.
Command Line: xcopy is probably your best bet
Command Reference:
http://www.computerhope.com/xcopyhlp.htm
I used Teracopy and copied 50+GB to a 128GB flash drive.
Too almost 48 hours...had to do it twice because had a power
hiccup. Had to re-format and start over...Not my favorite thing
to do...
One of the fastest way to copy files is use robocopy as pointed by Pyrolistical in above post. its very flexible and powerful.
If command doesn't work from your dos prompt directly then try with powershell option like below example.
Must Check the documentation for this command before using it "robocopy /?".
powershell "robocopy 'Source' 'destination' /E /R:3 /W:10 /FP /MT:25 /V"
/E - Copy subdirectory including empty ones.
/R - Retry 3 times if failed.
/W - wait for 10 seconds between retries.
/FP - include full path name in output.
/MT - Multi thread.
/V - verbose output.
I wanted to comment a comment about multithreading, from #hello_earth, 201510131124, but I don't have enough reputation points on Stackoverflow (I've mostly posted on Superuser up until now) :
Multithreading is typically not efficient when it comes to copying files from 1 storage device to 1 other, because the fastest throughput is reached for sequential reads, and using multiple threads will make a HDD rattle and grind like crazy to read or write several files at the same time, and since a HDD can only access one file at a time it must read or write one chunk from a file then move to a chunk from another file located in a different area, which slows down the process considerably (I don't know how a SSD would behave in such a case). It is both inefficient and potentially harmful : the mechanical stress is considerably higher when the heads are moving repeatedly across the platters to reach several areas in short succession, rather than staying at the same spot to parse a large contiguous file.
I discovered this when batch checking the MD5 checksums of a very large folder full of video files with md5deep : with the default options the analysis was multithreaded, so there were 8 threads with an i7 6700K CPU, and it was excruciatingly slow. Then I added the -j1 option, meaning 1 thread, and it proceeded much faster, since the files were now read sequentially.
Another consideration that derives from this is that the transfer speed will be significantly higher if files are not fragmented, and also, more marginally, if they are located at the begining of a hard disk drive, corresponding to the outermost parts of the platters, where the linear velocity is maximum (that aspect is irrelevant with a solid state drive or other flash memory based device).
Also, the original poster wanted “the most safe, efficient and fast way to achieve such a time consuming process” – I'd say that one has to choose a compromise favoring either speed/efficiency, or safety : if you want safety, you have to check that each file was copied flawlessly (by checking MD5 checksums, or with something like WinMerge) ; if you don't do that, you can never be 100% sure that there weren't some SNAFUs in the process (hardware or software issues) ; if you do that, you have to spend twice as much time on the task.
For instance : I relied on a little tool called SynchronizeIt! for my file copying purposes, because it has the huge advantage compared to most similar tools of preserving all timestamps (including directory timestamps, like Robocopy does with the /DCOPY:T switch), and it has a streamlined interface with just the options I need. But I discovered that some files were always corrupted after a copy, truncated after exactly 25000 bytes (so the copy of a 1GB video for instance had 25000 good bytes then 1GB of 00s, the copy process was abnormally fast, took only a split second, which triggered my suspicion in the first place). I reported this issue to the author a first time in 2010, but then he chalked it up to a hardware malfunction, and didn't think twice about it. I still used SI, but started to check files thoroughly every time I made a copy (with WinMerge or Total Commander) ; when files ended up corrupted I used Robocopy instead (files which were corrupted with SynchronizeIt, when they were copied with Robocopy, then copied again with SynchronizeIt, were copied flawlessly, so there was something in the way they were recorded on the NTFS partition which confused that software, and which Robocopy somehow fixed). Then in 2015 I reported it again, after having identified more patterns regarding which files were corrupted : they had all been downloaded with particular download managers. That time the author did some digging, and found the explanation : it turned out that his tool had trouble copying files with the little known “sparse” attribute, and that some download managers set this attribute to save space when downloading files in multiple chunks. He provided me with an updated version which correctly copies sparse files, but hasn't released it on his website (the currently available version is 3.5 from 2009, the version I now use is a 3.6 beta from October 2015), so if you want to try that otherwise excellent software, be aware of that bug, and whenever you copy important files, thoroughly verify if each copied file is identical to the source (using a different tool), before deleting them from the source.