How to keep two folders automatically synchronized? - bash

I would like to have a synchronized copy of one folder with all its subtree.
It should work automatically in this way: whenever I create, modify, or delete stuff from the original folder those changes should be automatically applied to the sync-folder.
Which is the best approach to this task?
BTW: I'm on Ubuntu 12.04
Final goal is to have a separated real-time backup copy, without the use of symlinks or mount.
I used Ubuntu One to synchronize data between my computers, and after a while something went wrong and all my data was lost during a synchronization.
So I thought to add a step further to keep a backup copy of my data:
I keep my data stored on a "folder A"
I need the answer of my current question to create a one-way sync of "folder A" to "folder B" (cron a script with rsync? could be?). I need it to be one-way only from A to B any changes to B must not be applied to A.
The I simply keep synchronized "folder B" with Ubuntu One
In this manner any change in A will be appled to B, which will be detected from U1 and synchronized to the cloud. If anything goes wrong and U1 delete my data on B, I always have them on A.
Inspired by lanzz's comments, another idea could be to run rsync at startup to backup the content of a folder under Ubuntu One, and start Ubuntu One only after rsync is completed.
What do you think about that?
How to know when rsync ends?

You can use inotifywait (with the modify,create,delete,move flags enabled) and rsync.
while inotifywait -r -e modify,create,delete,move /directory; do
rsync -avz /directory /target
done
If you don't have inotifywait on your system, run sudo apt-get install inotify-tools

You need something like this:
https://github.com/axkibe/lsyncd
It is a tool which combines rsync and inotify - the former is a tool that mirrors, with the correct options set, a directory to the last bit. The latter tells the kernel to notify a program of changes to a directory ot file.
It says:
It aggregates and combines events for a few seconds and then spawns one (or more) process(es) to synchronize the changes.
But - according to Digital Ocean at https://www.digitalocean.com/community/tutorials/how-to-mirror-local-and-remote-directories-on-a-vps-with-lsyncd - it ought to be in the Ubuntu repository!
I have similar requirements, and this tool, which I have yet to try, seems suitable for the task.

Just simple modification of #silgon answer:
while true; do
inotifywait -r -e modify,create,delete /directory
rsync -avz /directory /target
done
(#silgon version sometimes crashes on Ubuntu 16 if you run it in cron)

Using the cross-platform fswatch and rsync:
fswatch -o /src | xargs -n1 -I{} rsync -a /src /dest

You can take advantage of fschange. It’s a Linux filesystem change notification. The source code is downloadable from the above link, you can compile it yourself. fschange can be used to keep track of file changes by reading data from a proc file (/proc/fschange). When data is written to a file, fschange reports the exact interval that has been modified instead of just saying that the file has been changed.
If you are looking for the more advanced solution, I would suggest checking Resilio Connect.
It is cross-platform, provides extended options for use and monitoring. Since it’s BitTorrent-based, it is faster than any other existing sync tool. It was written on their behalf.

I use this free program to synchronize local files and directories: https://github.com/Fitus/Zaloha.sh. The repository contains a simple demo as well.
The good point: It is a bash shell script (one file only). Not a black box like other programs. Documentation is there as well. Also, with some technical talents, you can "bend" and "integrate" it to create the final solution you like.

Related

Retrieving latest file in a directory from a remote server

I was hoping to crack this myself, but it seems I have fallen at the first hurdle because I can't make head nor tale of other options I've read about.
I wish to access a database file hosted as follows (i.e. the hhsuite_dbs is a folder containing several databases)
http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pdb70_08Oct15.tgz
Periodically, they update these databases, and so I want to download the lastest version. My plan is to run a bash script via cron, most likely monthly (though I've yet to even tackle the scheduling aspect of the task).
I believe the database is refreshed fortnightly, so if my script runs monthly I can expect there to be a new version. I'll then be running downstream programs that require the database.
My question is then, how do I go about retrieving this (and for a little more finesse I'd perhaps like to be able to check whether the remote file has changed in name or content to avoid a large download if unnecessary)? Is the best approach to query the name of the file, or the file property of date last modified (given that they may change the naming syntax of the file too?). To my naive brain, some kind of globbing of the pdb70 (something I think I can rely on to be in the filename) then pulled down with wget was all I had come up with so far.
EDIT Another confounding issue that has just occurred to me is that the file I want wont necessarily be the newest in the folder (as there are other types of databases there too), but rather, I need the newest version of, in this case, the pdb70 database.
Solutions I've looked at so far have mentioned weex, lftp, curlftpls but all of these seem to suggest logins/passwords for the server which I don't have/need if I was to just download it via the web. I've also seen mention of rsync, but of a cursory read it seems like people are steering clear of it for FTP uses.
Quite a few barriers in your way for this.
My first suggestion is that rather than getting the filename itself, you simply mirror the directory using wget, which should already be installed on your Ubuntu system, and let wget figure out what to download.
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
And new files will be created in the "safe" directory.
But that just gets you your mirror. You're still after is the "newest" file.
Luckily, wget sets the datestamp of files it downloads, if it can. So after mirroring, you might be able to do something like:
newestfile=$(ls -t /some/place/safe/pdb70*gz | head -1)
Note that this fails if ever there are newlines in the filename.
Another possibility might be to check the difference between the current file list and the last one. Something like this:
#!/bin/bash
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
rm index.html* *.gif # remove debris from mirroring an index
ls > /tmp/filelist.txt.$$
if [ -f /tmp/filelist.txt ]; then
echo "Difference since last check:"
diff /tmp/filelist.txt /tmp/filelist.txt.$$
fi
mv /tmp/filelist.txt.$$ /tmp/filelist.txt
You can parse the diff output (man diff for more options) to determine what file has been added.
Of course, with a solution like this, you could run your script every day and hopefully download a new update within a day of it being ready, rather than a fortnight later. Nice thing about --mirror is that it won't download files that are already on-hand.
Oh, and I haven't tested what I've written here. That's one monstrously large file.

Is there a way to move files from one set of directories to another set of corresponding directories

I take delivery of files from multiple places as part of a publishing aggregation service. I need a way to move files that have been delivered to me from one location to another without losing the directory listings for sorting purposes.
Example:
Filepath of delivery: Server/Vendor/To_Company/Customer_Name/**
Filepath of processing: ~/Desktop/MM-DD-YYYY/Returned_Files/Customer_Name/**
I know I can move all of the directories by doing something such as:
find Server/Vendor/To_Company/* -exec mv -n ~/Desktop/MM-DD-YYYY/Returned_Files \;
but using that I can only run the script one time per day and there are times when I might need to run it multiple times.
It seems like ideally I should be able to create a copycat directory in my daily processing folder and then move the files from one to the other.
you can use rsync command with --remove-source-files option. you can run it as many times as needed.
#for trial run, without making any actual transfer.
rsync --dry-run -rv --remove-source-files Server/Vendor/To_Company/ ~/Desktop/MM-DD-YYYY/Returned_Files/
#command
rsync -rv --remove-source-files Server/Vendor/To_Company/ ~/Desktop/MM-DD-YYYY/Returned_Files/
reference:
http://www.cyberciti.biz/faq/linux-unix-bsd-appleosx-rsync-delete-file-after-transfer/
You could use rsync to do this for you:
rsync -a --remove-source-files /Server/Vendor/To_Company/Customer_Name ~/Desktop/$(date +"%y-%m-%d")/Returned_files/
Add -n to do a dry run to make sure it does what you want.
From the manual page:
--remove-source-files
This tells rsync to remove from the sending side the files (meaning non-directories) that are a part of the
transfer and have been successfully duplicated on the receiving side.
Note that you should only use this option on source files that are quiescent. If you are using this to move
files that show up in a particular directory over to another host, make sure that the finished files get renamed
into the source directory, not directly written into it, so that rsync can’t possibly transfer a file that is
not yet fully written. If you can’t first write the files into a different directory, you should use a naming
idiom that lets rsync avoid transferring files that are not yet finished (e.g. name the file "foo.new" when it
is written, rename it to "foo" when it is done, and then use the option --exclude='*.new' for the rsync trans‐
fer).

Faster Alternatives to cp -l for a Whole File Structure?

Okay, so I need to create a copy of a file-structure, however the structure is huge (millions of files) and I'm looking for the fastest way to copy it.
I'm currently using cp -lR "$original" "$copy" however even this is extremely slow (takes several hours).
I'm wondering if there are any faster methods I can use? I know of rsync --link-dest but this isn't any quicker, but really I'd want it to be quicker as I want to create these snap-shots every hour or-so.
The alternative is copying only changes (which I can find quickly) into each folder then "flattening" them when I need to free up space (rsync newer folders into older ones until the last complete snapshot is reached), but I would really rather that each folder be its own complete snapshot.
Why are you discarding link-dest? I use a script with that option and take snapshots pretty often and the performance is pretty good.
In case you reconsider, here's the script I use: https://github.com/alvaroreig/varios/blob/master/incremental_backup.sh
If you have pax installed, you can use it. Think of it as tar or cpio, but standard, per POSIX.
#!/bin/sh
# Move "somedir/tree" to "$odir/tree".
itree=$1
odir=$2
base_tree=$(basename "$itree")
pax -rw "$itree" -s "#$itree#$base_tree#g" "$odir"
The -s replstr is an unfortunate necessity (you'll get $odir/$itree otherwise), but it works nicely, and it has been quicker than cp for large structures thus far.
tar is of course another option if you don't have pax as one person suggested already.
Depending on the files, you may achieve performance gains by compressing:
cd "$original && tar zcf - . | (cd "$copy" && tar zxf -)
This creates a tarball of the "original" directory, sends the data to stdout, then changes to the "copy" directory (which must exist) and untars the incoming stream.
For the extraction, you may want to watch the progress: tar zxvf -

Copying directories recursively using shell script

Should be an easy question for the gurus here, though it's hard to explain it in text so hopefully this is clear. I've got two directories on a box with some flavor of unix on it. I've got a script that I want to use to move all the files and directories from one location to another.
First, an example of how the directories look:
Directory A: final/results/2012/2012-02/2012-02-25/name/files
Directory B: test/results/2012/2012-02/2012-02-24/name/files
So you see they're very similar. What I want to do is move everything from the Directory B 2012 directory, recursively, to the same level of Directory A. So you'd end up with:
someproject/results/2012/2012-02/2012-02-25/name/files
someproject/results/2012/2012-02/2012-02-24/name/files
etc.
I want this script to be future proof though, meaning I don't want the 2012 hardcoded. Also, towards the end of a month you will potentially have data from two different months and both need to be copied into the 2012 directory. So here is the command I used in the shell script file:
CONS="/someproject";
ROOT="/test";
/bin/cp -r ${ROOT}/results/* ${CONS}/results/*
but this resulted in:
/final/results/2012/2012-02/2012-02-25/name/files
and
/final/results/2012/2012/2012-02/2012-02-24/name/files
So as I hope is clear, it started a level below where I wanted it too. Can anyone fill me in on what I'm doing wrong, if they can understand what I'm even trying to explain. My apologies if it's not clear. I'm sure this is a fairly simple fix but I'm not sure what to do. Shell scripting is not a strong point of mine.
One poster suggests rsync, which is overkill.
cp -rp will work fine. if you want to move the files, just mv the directory -- it and everything under it will move too.
The only real problem here is the use of terminating *'s in the command line in the original script. You don't need the *, you're just trying to pass directories to the cp command, you aren't trying to pass it the names of all the files already in the source (and more importantly, the destination).
You could also use a tool like rsync to make sure your source and target are synchronized.
rsync -av ${ROOT}/results/ ${CONS}/results/
You specified that you want to "move" the files, though. Which means deleting the originals after they're copied:
rsync -av --remove-source-files ${ROOT}/results/ ${CONS}/results/
If you start playing around with rsync, be sure to read the man page about how it treats trailing slashes.

Rsync bash script and hard linking files

I am creating a bash script to backup my files with rsync.
Backups all come from a single directory.
I only want new or modified files to be backed up.
Currently, I am telling rsync to backup the dir, and to check the files compared to the last backup.
The way I am doing this is
THE_TIME=`date "+%Y-%m-%dT%H:%M:%S"`
rsync -aP --link-dest=/Backup/Current /usr/home/user/backup /Backup/Backup-$THE_TIME
rm -f /Backup/Current
ln -s /Backup/Backup-$THE_TIME /Backup/Current
I am pretty sure I have the syntax correct for this. Each backup will check against the "Current" folder, and upload only as necesary. It will then delete the Current folder, and re-create the symlink to the newest backup it just did.
I am getting an error when I run the script:
rsync: link "/Backup/Backup-2010-08-04-12:21:15/dgs1200series_manual_310.pdf"
=> /Backup/Current/dgs1200series_manual_310.pdf
failed: Operation not supported (45)
The host OS is running HFS filesystem, which supports hard linking. I am trying to figure out if something else is not supporting this, or if I have a problem in my code.
Thanks for any help
Edit:
I am able to create a hard link on my local machine.
I am also able to create a hard link on the remote server (when logged in locally)
I am NOT able to create a hard link on the remote server when mounted via afp. Even if both files exist on the server.
I am guessing this is a limitation of afp.
Just in case your command line is only an example: Be sure to always specify the link-dest directory with an absolute pathname! That’s something which took me quite some time to figure out …
Two things from the man page stand out that are worth checking:
If file's aren't linking, double-check their attributes. Also
check if some attributes are getting forced outside of rsync's
control, such a mount option that squishes root to a single
user, or mounts a removable drive with generic ownership (such
as OS X's “Ignore ownership on this volume” option).
and
Note that rsync versions prior to 2.6.1 had a bug that could
prevent --link-dest from working properly for a non-super-user
when -o was specified (or implied by -a). You can work-around
this bug by avoiding the -o option when sending to an old rsync.
Do you have the "ignore ownership" option turned on? What version of rsync do you have?
Also, have you tried manually creating a similar hardlink using ln at the command line?
I don't know if this is the same issue, but I know that rsync can't sync a file when the destination is a FAT32 partition and the filename has a ":" (colon) in it. [The source filesystem is ext3, and the destination is FAT32]
Try reconfiguring the date command so that it doesn't use a colon and see if that makes a difference.
e.g.
THE_TIME=`date "+%Y-%m-%dT%H_%_%S"`

Resources