Percentage Difference Between Remote Directories - bash

I am writing a script using rsync that synchronizes a remote directory with a local one without deleting any files. This is fairly simple, but I would like to be able to track progress, in simple terms without flooding a log file with the output of the --progress option in rsync.
To this end I was wondering if there is a way to easily (i.e. without consuming a lot of bandwidth) calculate the difference between two directories recursively. One directory should be remote, and the other local. We are looking for the amount of data that exists in the remote file but not in the local directory keeping in mind the fact that there may be data in the local directory that is not in the remote directory and we do not want that factored into the difference. Ideally, it should say something like: Synchronizing /remote/dir -> /local/dir: 12MiB/24MiB (50%) Done
If anyone knows an easy way of accomplishing this with a bash script that would be really helpful.
As always, thanks in advance for your help.

Related

Recursive copy to a flat directory

I have a directory of images, currently at ~117k files for about 200 gig in size. My backup solution vomits on directories of that size, so I wish to split them into subdirectories of 1000. Name sorting or type discrimination is not required. I just want my backups to not go nuts.
From another answer, someone provided a way to move files into the split up configuration. However, that was a move, not a copy. Since this is a backup, I need a copy.
I have three thoughts:
1. Files are added to the large directory with random filenames, so alpha sorts aren't a practical way to figure out deltas. Even using a tool like rsync, adding a couple hundred files at the beginning of the list could cause a significant reshuffle and lots of file movement on the backup side.
2. The solution to this problem is to reverse the process: Do an initial file split, add new files to the backup at the newest directory, manually create a new subdir at the 1000 file mark, and then use rsync to pull files from the backup directories to the work area, eg rsync -trvh <backupdir>/<subdir>/ <masterdir>.
3. While some answers to similar questions indicate that rsync is a poor choice for this, I may need to do multiple passes, one of which would be via a slower link to an offsite location. The performance hit of using rsync and its startup parsing is far superior to the length of time reuploading the backup on a daily basis would take.
My question is:
How do I create a script that will recurse into all 117+ subdirectories and dump the contained files into my large working directory, without a lot of unnecessary copying?
My initial research produces something like this:
#!/bin/bash
cd /path/to/backup/tree/root
find . -type d -exec rsync -trvh * /path/to/work/dir/
Am I on the right track here?
It's safe to assume modern versions of bash, find, and rsync.
Thanks!

determine if file is complete

I am trying to write a video ruby transformer script (using ffmpeg) that depends on mov files being ftped to a server.
The problem I've run into is that when a large file is uploaded by a user, the watch script (using rb-inotify) attempts to execute (and run the transcoder) before the mov is completely uploaded.
I'm a complete noob. But I'm trying to discover if there is a way for me to be able to ensure my watch script doesn't run until the file(s) is/are completely uploaded.
My watch script is here:
watch_me = INotify::Notifier.new
watch_me.watch("/directory_to_my/videos", :close_write) do |directories|
load '/directory_to_my/videos/.transcoder.rb'
end
watch_me.run
Thank you for any help you can provide.
Just relying on inotify(7) to tell you when a file has been updated isn't a great fit for telling when an upload is 'complete' -- an FTP session might time out and be re-started, for example, allowing a user to upload a file in chunks over several days as connectivity is cheap or reliable or available. inotify(7) only ever sees file open, close, rename, and access, but never the higher-level event "I'm done modifying this file", as the user would understand it.
There are two mechanisms I can think of: one is to have uploads go initially into one directory and ask the user to move the file into another directory when the upload is complete. The other creates some file meta-data on the client and uses that to "know" when the upload is complete.
Move completed files manually
If your users upload into the directory ftp/incoming/temporary/, they can upload the file in as many connections is required. Once the file is "complete", they can rename the file (rename ftp/incoming/temporary/hello.mov ftp/incoming/complete/hello.mov) and your rb-inotify interface looks for file renames in the ftp/incoming/complete/ directory, and starts the ffmpeg(1) command.
Generate metadata
For a transfer to be "complete", you're really looking for two things:
The file is the same size on both systems.
The file is identical on both systems.
Since "identical" is otherwise difficult to check, most people content themselves with checking if the contents of the file, when run through a cryptographic hash function such as MD5 or SHA-1 (or better, SHA-224, SHA-256, SHA-384, or SHA-512) functions. MD5 is quite fine if you're guarding against incomplete transmission but if you intend on using the output of the function for other means, using a stronger function would be wise.
MD5 is really tempting though, since tools to create and validate MD5 hashes are very widespread: md5sum(1) on most Linux systems, md5(1) on most BSD systems (including OS X).
$ md5sum /etc/passwd
c271aa0e11f560af419557ef49a27ac8 /etc/passwd
$ md5sum /etc/passwd > /tmp/sums
$ md5sum -c /tmp/sums
/etc/passwd: OK
The md5sum -c command asks the md5sum(1) program to check the file of hashes and filenames for correctness. It looks a little silly when used on just a single file, but when you've got dozens or hundreds of files, it's nice to let the software do the checking for you. For example: http://releases.mozilla.org/pub/mozilla.org/firefox/releases/3.0.19-real-real/MD5SUMS -- Mozilla has published such files with 860 entries -- checking them by hand would get tiring.
Because checking hashes can take a long time (five minutes on my system to check a high-definition hour-long video that wasn't recently used), it'd be a good idea to only check the hashes when the filesizes match. Modify your upload tool to send along some metadata about how long the file is and what its cryptographic hash is. When your rb-inotify script sees file close requests, check the file size, and if the sizes match, check the cryptographic hash. If the hashes match, then start your ffmpeg(1) command.
It seems easier to upload the file to a temporal directory on the server and move it to the location your script is watching once the transfer is completed.

command line wisdom for 2 panel file manager user

Want to upgrade my file management productivity by replacing 2 panel file manager with command line (bash or cygwin). Can commandline give same speed? Please advise a guru way of how to do e.g. copy of some file in directory A to the directory B. Is it heavy use of pushd/popd? Or creation of links to most often used directories? What are the best practices and a day-to-day routine to manage files of a command line master?
Can commandline give same speed?
My experience is that commandline copying is significantly faster (especially in the Windows environment). Of course the basic laws of physics still apply, a file that is 1000 times bigger than a file that copies in 1 second will still take 1000 seconds to copy.
..(howto) copy of some file in directory A to the directory B.
Because I often have 5-10 projects that use similar directory structures, I set up variables for each subdir using a naming convention :
project=NewMatch
NM_scripts=${project}/scripts
NM_data=${project}/data
NM_logs=${project}/logs
NM_cfg=${project}/cfg
proj2=AlternateMatch
altM_scripts=${proj2}/scripts
altM_data=${proj2}/data
altM_logs=${proj2}/logs
altM_cfg=${proj2}/cfg
You can make this sort of thing as spartan or baroque as needed to match your theory of living/programming.
Then you can easily copy the cfg from 1 project to another
cp -p $NM_cfg/*.cfg ${altM_cfg}
Is it heavy use of pushd/popd?
Some people seem to really like that. You can try it and see what you thing.
Or creation of links to most often used directories?
Links to dirs are, in my experience used more for software development where a source code is expecting a certain set of dir names, and your installation has different names. Then making links to supply the dir paths expected is helpful. For production data, is just one more thing that can get messed up, or blow up. That's not always true, maybe you'll have a really good reason to have links, but I wouldn't start out that way, just because it is possible to do.
What are the best practices and a day-to-day routine to manage files of a command line master?
( Per above, use standardized directory structure for all projects.
Have scripts save any small files to a directory your dept keeps in the /tmp dir, .
i.e /tmp/MyDeptsTmpFile (named to fit your local conventions) )
It depends. If you're talking about data and logfiles, dated fileNames can save you a lot of time. I recommend dateFmts like YYYYMMDD(_HHMMSS) if you need the extra resolution.
Dated logfiles are very handy, when a current process seems like it is taking a long time, you can look at the log file from a week ago and quantify exactly how long this process took, a week, month, 6 months (up to how much space you can afford). LogFiles should also capture all STDERR messages, so you never have to re-run a bombed program just to see what the error message was.
This is Linux/Unix you're using, right? Read the man page for the cp cmd installed on your machine. I recommend using an alias like alias CP='/bin/cp -pi' so you always copy a file with the same permissions and with the original files' time stamp. Then it is easy to use /bin/ls -ltr to see a sorted list of files with the most recent files showing up at the bottom of the list. (No need to scroll back to the top, when you sort by time,reverse). Also the '-i' option will warn you that you are going to overwrite a file, and this has saved me more than a couple of times.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.

How to programmatically find the difference between two directories

First off; I am not necessarily looking for Delphi code, spit it out any way you want.
I've been searching around (especially here) and found a bit about people looking for ways to compare to directories (inclusive subdirs) though they were using byte-by-byte methods. Second off, I am not looking for a difftool, I am "just" looking for a way to find files which do not match and, just as important, files which are in one directory but not the other and vice versa.
To be more specific: I have one directory (the backup folder) which I constantly update using FindFirstChangeNotification. Though the first time I need to copy all files and I also need to check the backup directory against the original when the applications starts (in case something happened when the application wasn't running or FindFirstChangeNotification didn't catch a file change). To solve this I am thinking of creating a CRC list for the backed up files and then run through the original directory computing the CRC for every file and finally compare the two CRCs. Then somehow look for files which are in one directory and not the other (again; vice versa).
Here's the question: Is this the fastest way? If so, how would one (roughly) get the job done?
You don't necessarily need CRCs for each file, you can just compare the "last modified" date for every file for most normal purposes. It's WAY faster. If you need additional safety, you can also compare the lengths. You get both of these metrics for free with the find functions.
And in your change notification, you should probably add the files to a queue and use a timer object to copy the new queued files every ~30sec or something, so you don't bog down the system with frequent updates/checks.
For additional speed, use the Win32 functions wherever possible, avoid any Delphi find/copy/getfileinfo functions. I'm not familiar with the Delphi framework but for example the C# stuff is WAY WAY WAY slower than the Win32 functions.
Regardless of you "not looking for a difftool", are you opposed to using Cygwin with it's "diff" command for the shell? If you are open to this its quite easy, particularly using diff with the -r "recursive" option.
The following generates the differences between 2 Rails installs on my machine, and greps out not only information about differences between files but also, specifically by grepping for 'Only', finds files in one directory, but not the other:
$ diff -r pgnindex pgnonrails | egrep '^Only|diff'
Only in pgnindex/app/controllers: openings_controller.rb
Only in pgnindex/app/helpers: openings_helper.rb
Only in pgnindex/app/views: openings
diff -r pgnindex/config/environment.rb pgnonrails/config/environment.rb
diff -r pgnindex/config/initializers/session_store.rb pgnonrails/config/initializers/session_store.rb
diff -r pgnindex/log/development.log pgnonrails/log/development.log
Only in pgnindex/test/functional: openings_controller_test.rb
Only in pgnindex/test/unit: helpers
The fastest way to compare one directory on the local machine to a directory on another machine thousands of miles away is exactly as you propose:
generate a CRC/checksum for every file
send the name, path, and CRC/checksum for each file over the internet to the other machine
compare
Perhaps the easiest way to do that is to use rsync with the "--dryrun" or "--list-only" option.
(Or use one of the many applications that use the rsync algorithm,
or compile the rsync algorithm into your application).
cd some_backup_directory
rsync --dryrun myname#remote_host:latest_version_directory .
For speed, the default rsync assumes, as Blindy suggested, that two files with the same name and the same path and the same length and the same modification time are the same.
For extra safety, you can give rsync the "--checksum" option to ignore the length and modification time and force it to compare (the checksum of) the actual contents of the file.

Is file still being uploaded?

I have an app that I'm writing that takes files in a specific directory that have been uploaded via SFTP and moves them to S3.
I have a problem where my cron job starts uploading a file when it's not completely uploaded. I have thought of every way to try and wait until the file is complete, but I have no way of knowing (that I know of).
I'm hoping that the collective genius of SO would be able to shed some light on this!
There are a number of ways to handle this:
Change the upload process to upload the data file itself (e.g., data.txt) followed by a sentinel file (e.g., data.txt.sentinel). Then wait for the sentinel before processing the data file and deleting them both. Data files older then N days with no corresponding sentinel - just delete them. This is only good if you can change the uploader.
If you can evaluate the content of the file to check completeness, this is another way. For example, if you're only uploading HTML files, you could check that it ends with </html>. Not always possible unless you can control what's being uploaded.
The not-been-modified-for-a-while method. Basically, if the file hasn't been modified for N minutes, you can assume the upload has been finished. This may still result in the processing of incomplete files where the transfer has failed partway through.
All these methods have their advantages and drawbacks and you will have to decide which is the best for you. We try to opt for number 1 where we can influence the uploading side.
And remember that N is configurable in the above scenarios. You need to balance the possibility that a too-small N will result in you processing an incomplete file in option 3 but too large a value of N will delay the processing of said file.
Is there any way you can add a step after the SFTP transfer? The idea is to SFTP the files to a temporary directory, then once that's done have the same client execute (via SSH) a script to mv the files over to the directory the cron job is looking at. mv is atomic on many local Unix filesystems, so the cron job will only either see the old file or the new one.
Of course, if you can execute a script after the SFTP transfer you can just have the script do the transfer to S3, without the cron job ;)
We are using pure-ftpd for a very similar process. Rather then having a cron job do the uploads, we use the upload script option of pure-ftp, which triggers a script every time an upload is complete. You might consider using a similar mechanism if it is available with your ftp server.

Resources