How to compare two directories, and if they're the EXACT SAME, delete the second - bash

I'm trying to setup an automatic backup on a raspberry pi system connected to an external hard drive.
Basically, I have shared folders and they're mounted via samba on the rPI under
/mnt/Comp1
/mnt/Comp2
I will then have the external hard drive plugged in and mounted with two folders under
/media/external/Comp1
/media/external/Comp2
I will then run a recursive copy from /mnt/Comp1* to /media/external/Comp1/* and the same with Comp2.
What I need help with is at the end of the copies (because it will be a total of 5 computers), I would like to verify that all the files transferred, and if they did and everything is on the external, then I can delete from the local machine automatically. I understand this is risky, because almost inevitably it will delete things that may not be backed up, but I need help knowing where to start.
I've found a lot of information on checking contents of a folder, and I know I can use the diff command, but I don't know how to use it in this pseudocode
use diff on directories /mnt/Comp1/ and /media/external/Comp1
if no differences, proceed to delete /mnt/Comp1/* recursively
if differences, preferably move the files not saved to /media/external/Comp1
repeat checking for differences, and deleting if necessary

Try something like:
diff -r -q d1/ d2/ >/dev/null 2>&1
check return value with $?
remove the d2, if return value is 1.

Related

Is it possible to sync multiple clients over a central server using just rsync and POSIX shell scripting?

The scenario
I have a file server that acts as a master storage for the files to sync and I have several clients that have a local copy of the master storage. Each client may alter files from the master storage, add new ones or delete existing ones. I would like all of them to stay in sync as good as possible by regularly performing a sync operation, yet the only tool I have available everywhere for that is rsync and I can only run script code on the clients, not on the server.
The problem
rsync doesn't perform a bi-directional sync, so I have to sync from server to client as well as from client to server. This works okay for files that just changed by running two rsync operations but it fails when files have been added or deleted. If I don't use rsync with a delete option, clients cannot ever delete files as the sync from the server to the client restores them. If I use a delete option, then either the sync from server to client runs first and deletes all new files the client has added or the sync from client to server runs first and deletes all new files other clients have added to the server.
The question
Apparently rsync alone cannot handle that situation, since it is only supposted to bring one location in sync with another location. I surely neeed to write some code but I can only rely on POSIX shell scripting, which seems to make achieving my goals impossible. So can it even be done with rsync?
What is required for this scenario are three sync operations and awareness of which files the local client has added/deleted since the last sync. This awareness is essential and establishes a state, which rsync doesn't have, as rsync is stateless; when it runs it knows nothing about previous or future operations. And yes, it can be done with some simple POSIX scripting.
We will assume three variables are set:
metaDir is a directory where the client can persistently store files related to the sync operations; the content itself is not synced.
localDir is the local copy of the files to be synced.
remoteStorage is any valid rsync source/target (can be a mounted directory or an rsync protocol endpoint, with or w/o SSH tunneling).
After every successful sync, we create a file in the meta dir that lists all files in local dir, we need this to track files getting added or deleted in between two syncs. In case no such file exists, we have never ran a successful sync. In that case we just sync all files from remote storage, build such a file, and we are done:
filesAfterLastSync="$metaDir/files_after_last_sync.txt"
if [ ! -f "$metaDir/files_after_last_sync.txt" ]; then
rsync -a "$remoteStorage/" "$localDir"
( cd "$localDir" && find . ) | sed "s/^\.//" | sort > "$filesAfterLastSync"
exit 0
fi
Why ( cd "$localDir" && find . ) | sed "s/^\.//"? Files need to be rooted at $localDir for rsync later on. If a file $localDir/test.txt exists, the generated output file line must be /test.txt and nothing else. Without the cd and an absolute path for the find command, it would contain /..abspath../test.txt and without the sed it would contain ./test.txt. Why the explicit sort call? See further downwards.
If that isn't our initial sync, we should create a temporary directory that auto-deletes itself when the script terminates, no matter which way:
tmpDir=$( mktemp -d )
trap 'rm -rf "$tmpDir"' EXIT
Then we create a file list of all files currently in local dir:
filesForThisSync="$tmpDir/files_for_this_sync.txt"
( cd "$localDir" && find . ) | sed "s/^\.//" | sort > "$filesForThisSync"
Now why is there that sort call? The reason is that I need the file list to be sorted below. Okay, but then why not telling find to sort the list? That's because find does not guarantee to sort the same was as sort does (that is explicitly documented on the man page) and I need exactly the order that sort produces.
Now we need to create two special file lists, one containing all files that were added since last sync and one that contains all files that were deleted since last sync. Doing so is a bit tricky with just POSIX but various possibility exists. Here's one of them:
newFiles="$tmpDir/files_added_since_last_sync.txt"
join -t "" -v 2 "$filesAfterLastSync" "$filesForThisSync" > "$newFiles"
deletedFiles="$tmpDir/files_removed_since_last_sync.txt"
join -t "" -v 1 "$filesAfterLastSync" "$filesForThisSync" > "$deletedFiles"
By setting the delimiter to an empty string, join compares whole lines. Usually the output would contain all lines that exists in both files but we instruct join to only output lines of one of the files that cannot be matched with the lines of the other file. Lines that only exist in the second file must be from files have been added and lines that only exist in the first file file must be from files that have been deleted. And that's why I use sort above as join can only work correctly if the lines were sorted by sort.
Finally we perform three sync operations. First we sync all new files to the remote storage to ensure these are not getting lost when we start working with delete operations:
rsync -aum --files-from="$newFiles" "$localDir/" "$remoteStorage"
What is -aum? -a means archive, which means sync recursive, keep symbolic links, keep file permissions, keep all timestamps, try to keep ownership and group and some other (it's a shortcut for -rlptgoD). -u means update, which means if a file already exists at the destination, only sync if the source file has a newer last modification date. -m means prune empty directories (you can leave it out, if that isn't desired).
Next we sync from remote storage to local with deletion, to get all changes and file deletions performed by other clients, yet we exclude the files that have been deleted locally, as otherwise those would get restored what we don't want:
rsync -aum --delete --exclude-from="$deletedFiles" "$remoteStorage/" "$localDir"
And finally we sync from local to remote storage with deletion, to update files that were changed locally and delete files that were deleted locally.
rsync -aum --delete "$localDir/" "$remoteStorage"
Some people might think that this is too complicated and it can be done with just two syncs. First sync remote to local with deletion and exclude all files that were either added or deleted locally (that way we also only need to produce a single special file, which is even easier to produce). Then sync local to remote with deletion and exclude nothing. Yet this approach is faulty. It requires a third sync to be correct.
Consider this case: Client A created FileX but hasn't synced yet. Client B also creates FileX a bit later and syncs at once. When now client A performs the two syncs above, FileX on remote storage is newer and should replace FileX on client A but that won't happen. The first sync explicitly excludes FileX; it was added to client A and thus must be excluded to not be deleted by the first sync (client A cannot know that FileX was also added and uploaded to remote by client B). And the second one would only upload to remote and exclude FileX as the remote one is newer. After the sync, client A has an outdated FileX, despite the fact, that an updated one existed on remote.
To fix that, a third sync from remote to local without any exclusion is required. So you would also end up with a three sync operations and compared to the three ones I presented above, I think the ones above are always equally fast and sometimes even faster, so I would prefer the ones above, however, the choice is yours. Also if you don't need to support that edge case, you can skip the last sync operation. The problem will then resolve automatically on next sync.
Before the script quits, don't forget to update our file list for the next sync:
( cd "$localDir" && find . ) | sed "s/^\.//" | sort > "$filesAfterLastSync"
Finally, --delete implies --delete-before or --delete-during, depending on your version of rsync. You may prefer another or explicit specified delete operation.

Running program/macro to rename, add files to flash drive

I have a huge batch of flash drives that I need to move files onto. I'd also love to rename the drives (they're all called NO NAME by default). I'd love to plug two drives in, run a terminal script on the computer to accomplish all of that (most importantly the file moving). Then remove the drives, put the next two in, run it again, etc. until I'm done. All of the drives are identically named.
Is batch executing like this possible, and does anyone know how to go about doing it?
I figured it out. Put each one in and run this command to rename the drive and then move the files into it:
diskutil rename /Volumes/OLDNAME "NEWNAME" && cp -r ~/Desktop/sourceFolder/. /Volumes/NEWNAME

dynamically scan accessed files, or modifed files with AV

I need to set up McAfee AV for Linux to either dynamically scan accessed files, or to perform daily scans on all modified files.
I know how to make a cron job, and to search for last modified files, but I can't find any documentation anywhere on how to do what I need to do, even from McAfee :(
The problem with scanning modified files is that I can't find any find options that will scan the modified files from the last scan date, only from a time-frame. If I set McAfee to scan modified files daily, and the machine is off for over a day, it wont see those modified files as being modified within 24hours, and thus won't scan them. I also cannot figure out how to make McAfee scan a while when it is accessed. I assume I could possibly write a script that just launches a scan when any file is opened, but I am not sure how to do this either.
If possible, I'd like to use bash to do this, only because I haven't learned awk or perl yet. Any help or a point in the right direction would be appreciated. Thanks!
This works for me with ClamAV, replace 'clamscan' with the equivalent command provided by McAfee. This loop will look for files in the /root directory that have been edited in the last 2 days and then run a virus scan on them:
for i in `find /root -type f -mtime -2`; do
clamscan $i
done

Making checks before rsyncing external drive on OSX

I have the following issue on OSX though I guess this could equally be filed under bash. I have several encrypted portable drives that I use to sync an offsite data store or as an on-the-go data store etc. I keep these updated using rsync with several options including --del and an includes file.
This is currently done very statically i.e.
rsync <options> --include-file=... /Volumes /Volumes/PortableData
where the includes file would read something like
+ /Abc/
+ /Def/
...
- *
I would like to do the following:
Check the correct drive is mounted and find its mount-point
Check that all the + /...../ entries are mounted under /Volumes
rsync
To achieve 1 I was intending to store the uuid of the drives in variables in my profile so that I could search for them and find the relevant mount point. A bash function in .bashrc that takes a uuid and returns a mount point. I have seen some web entries for achieving this.
2 I am a little more stuck on. What is the best way of retrieving only those entries that are both + and top level folder designations in the include files then iterating to check they are mounted and readable? Again, I'm thinking of trying to put some of this logic in functions for re-usability.
Is there a better way of achieving this? I have thought of CCC, but like the idea of scripting in bash and using rsync as it is a good way of getting to know the command line.
rsync can call in a file that is a list of exclusions.
I would write a script that dumped directories to text file that are NOT + and top level folder designations in the include files
You are going to want an exclusion to look like this:(you can use wildcards if it helps)
dirtoexlude1
dirtoexlude2
dirtoexlude
Then just direct an rsync to that exclusion file.
Your Rsync command will be something like this:
rsync -aP --exclude-from=rsyncexclusion.txt
a is for recursive essentially (with hand waving) and P is for verbose.
good luck.

Rsync bash script and hard linking files

I am creating a bash script to backup my files with rsync.
Backups all come from a single directory.
I only want new or modified files to be backed up.
Currently, I am telling rsync to backup the dir, and to check the files compared to the last backup.
The way I am doing this is
THE_TIME=`date "+%Y-%m-%dT%H:%M:%S"`
rsync -aP --link-dest=/Backup/Current /usr/home/user/backup /Backup/Backup-$THE_TIME
rm -f /Backup/Current
ln -s /Backup/Backup-$THE_TIME /Backup/Current
I am pretty sure I have the syntax correct for this. Each backup will check against the "Current" folder, and upload only as necesary. It will then delete the Current folder, and re-create the symlink to the newest backup it just did.
I am getting an error when I run the script:
rsync: link "/Backup/Backup-2010-08-04-12:21:15/dgs1200series_manual_310.pdf"
=> /Backup/Current/dgs1200series_manual_310.pdf
failed: Operation not supported (45)
The host OS is running HFS filesystem, which supports hard linking. I am trying to figure out if something else is not supporting this, or if I have a problem in my code.
Thanks for any help
Edit:
I am able to create a hard link on my local machine.
I am also able to create a hard link on the remote server (when logged in locally)
I am NOT able to create a hard link on the remote server when mounted via afp. Even if both files exist on the server.
I am guessing this is a limitation of afp.
Just in case your command line is only an example: Be sure to always specify the link-dest directory with an absolute pathname! That’s something which took me quite some time to figure out …
Two things from the man page stand out that are worth checking:
If file's aren't linking, double-check their attributes. Also
check if some attributes are getting forced outside of rsync's
control, such a mount option that squishes root to a single
user, or mounts a removable drive with generic ownership (such
as OS X's “Ignore ownership on this volume” option).
and
Note that rsync versions prior to 2.6.1 had a bug that could
prevent --link-dest from working properly for a non-super-user
when -o was specified (or implied by -a). You can work-around
this bug by avoiding the -o option when sending to an old rsync.
Do you have the "ignore ownership" option turned on? What version of rsync do you have?
Also, have you tried manually creating a similar hardlink using ln at the command line?
I don't know if this is the same issue, but I know that rsync can't sync a file when the destination is a FAT32 partition and the filename has a ":" (colon) in it. [The source filesystem is ext3, and the destination is FAT32]
Try reconfiguring the date command so that it doesn't use a colon and see if that makes a difference.
e.g.
THE_TIME=`date "+%Y-%m-%dT%H_%_%S"`

Resources