How can I prioritize one rsync process over another? - bash

I have two separate cronjobs running on a Red Hat instance. Both are copying log files from a remote server to the instance via rsync. The first cronjob runs rsync-new.sh, which copies any new log files (from today or yesterday) from various directories on the server. The second cronjob runs rsync-backfill.sh, which copies any log files older than yesterday. I separated the rsync processes so that the new files will always be copied quickly, and a large backfill job won't interfere with the copying of new files.
This generally works, except for the following case: if rsync-backfill.sh is already copying the old files from a folder, the rsync-new.sh won't copy its files until after rsync-backfill.sh has finished with the folder.
Is there any way to prioritize the rsync command from rsync-new.sh over the rsync command from rsync-backfill.sh? Or to at least let the rsync commands run in parallel so that the new files are always copied quickly?
Here's the general script structure:
rsync-new.sh
for SUBDIR in $(ls $SOURCEDIR)
do
rsyc -zt \
--exclude-from=$TRACKERFILE \
--out-format="%n" \
$SOURCEDIR/$SUBDIR/log-$TODAY*.log $DESTDIR/ | tee -a $TRACKERFILE
done
for SUBDIR in $(ls $SOURCEDIR)
do
rsyc -zt \
--exclude-from=$TRACKERFILE \
--out-format="%n" \
$SOURCEDIR/$SUBDIR/log-$YESTERDAY*.log $DESTDIR/ | tee -a $TRACKERFILE
done
rsync-backfill.sh
for SUBDIR in $(ls $SOURCEDIR)
do
rsyc -zt \
--exclude-from=$TRACKERFILE \
--exclude="log-$TODAY*.log" \
--exclude="log-$YESTERDAY*.log" \
--out-format="%n" \
$SOURCEDIR/$SUBDIR/log-*.log $DESTDIR/ | tee -a $TRACKERFILE
done

This is a non-issue, turns out it was just a coincidence that the new log files for one of the directories weren't syncing until after the backfill was complete (the new log files just happened to be much larger than the backfill files).
Cron runs jobs in isolated environments, so the rsync processes weren't interacting with each other.

Related

hash method to verify integrity of dir vs dir.tar.gz

I'm working on a python scrip that verify the integrity of some downloaded projects.
On my nas, I have all my compressed folder: folder1.tar.gz, folder2.tar.gz, …
On my Linux computer, the equivalent uncompressed folder : folder1, folder2, …
So, i want to compare the integrity of my files without any UnTar or download !
I think i can do it on the nas with something like (with md5sum):
sshpass -p 'pasword' ssh login#my.nas.ip tar -xvf /path/to/my/folder.tar.gz | md5sum | awk '{ print $1 }'
this give me a hash, but I don't know how to get an equivalent hash to compare with the normal folder on my computer. Maybe the way I am doing it is wrong.
I need one command for the nas, and one for the Linux computer, that output the same hash ( if the folders are the same, of course )
If you did that, tar xf would actually extract the files. md5sum would only see the file listing, and not the file content.
However, if you have GNU tar on the server and the standard utility paste, you could create checksums this way:
mksums:
#!/bin/bash
data=/path/to/data.tar.gz
sums=/path/to/data.md5
paste \
<(tar xzf "$data" --to-command=md5sum) \
<(tar tzf "$data" | grep -v '/$') \
| sed 's/-\t//' > "$sums"
Run mksums above on the machine with the tar file.
Copy the sums file it creates to the computer with the folders and run:
cd /top/level/matching/tar/contents
md5sums -c "$sums"
paste joins lines of files given as arguments
<( ...) runs a command, making its output appear in a fifo
--to-command is a GNU tar extension which allows running commands which will receive their data from stdin
grep filters out directories from the tar listing
sed removes the extraneous -\t so the checksum file can be understood by md5sum
The above assumes you don't have any very-oddly named files (for example, the names can't contain newlines)

docker mkdir won't create a directory

I am trying to run a bash script which should load data into jena. This script comes from a github repository and was allegedly working on the owner's machine but on mine it won't run, even though I followed the instructions. So let me first describe what the script does based on my understanding: It should load .nt data (RDF data) into Jena using docker by using the docker image of jena, named stain/jena. Here is the script:
#/bin/bash
files=$(echo $(pwd)/rawdata-bearb/hour/alldata.IC.nt/*.nt | sed "s%$(pwd)/rawdata-bearb/hour/alldata.IC.nt%/var/data/in%g")
mkdir output # added
for file in $files; do
v=$(echo $file | sed "s/^.*\/\([0-9][0-9]*\)\.nt$/\1-1/" | bc)
echo "$v"
mkdir -p /var/data/out/ic/$v
time docker run \
-it \
--rm \
-v $(pwd)/tdb-bearb-hour/:/var/data/out/ \
-v $(pwd)/rawdata-bearb/hour/alldata.IC.nt/:/var/data/in/ \
stain/jena /jena/bin/tdbloader2 \
--sort-args "-S=16G" \
--loc /var/data/out/ic/$v $file \
> output/load-bearb-hour-ic-$v-.txt
done
However, when I execute the script, I get following message from the saved log file:
13:12:46 INFO -- TDB Bulk Loader Start
mkdir: cannot create directory ‘/var/data/out/ic/0’: No such file or directory
13:12:46 ERROR Failed during data phase
According to the tdbloader2 manual the --loc parameter should create the directory if it does not exist
-- loc: Sets the location in which the database should be created.
This location must be a directory and must be empty,
if a non-existent path is specified it will be created as a new directory.
I created the directories /var/data/out/ic/0 - /var/data/out/ic/10 manually and re-executed the script. Still, I got the same error message. My first guess was that tdbloader2 or docker use the mkdir command without the -p parameter but since I manually created the directories, thus, they existed before the execution and I still got the same error, it must be something else. I am kindly asking for your help

How can I pipe a tar compression operation to aws s3 cp?

I'm writing a custom backup script in bash for personal use. The goal is to compress the contents of a directory via tar/gzip, split the compressed archive, then upload the parts to AWS S3.
On my first try writing this script a few months ago, I was able to get it working via something like:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 - /mnt/SCRATCH/backup.tgz.part
aws s3 sync /mnt/SCRATCH/ s3://backups/ --delete
rm /mnt/SCRATCH/*
This worked well for my purposes, but required /mnt/SCRATCH to have enough disk space to store the compressed directory. Now I wanted to improve this script to not have to rely on having enough space in /mnt/SCRATCH, and did some research. I ended up with something like:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 --filter "aws s3 cp - s3://backups/backup.tgz.part" -
This almost works, but the target filename on my S3 bucket is not dynamic, and it seems to just overwrite the backup.tgz.part file several times while running. The end result is just one 100MB file, vs the intended several 100MB files with endings like .part0001.
Any guidance would be much appreciated. Thanks!
when using split you can use the env variable $FILE to get the generated file name.
See split man page:
--filter=COMMAND
write to shell COMMAND; file name is $FILE
For your use case you could use something like the following:
--filter 'aws s3 cp - s3://backups/backup.tgz.part$FILE'
(the single quotes are needed, otherwise the environment variable substitution will happen immediately)
Which will generate the following file names on aws:
backup.tgz.partx0000
backup.tgz.partx0001
backup.tgz.partx0002
...
Full example:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 --filter 'aws s3 cp - s3://backups/backup.tgz.part$FILE' -
You should be able to get it done quite easily and in parallel using GNU Parallel. It has the --pipe option to split the input data into blocks of size --block and distribute it amongst multiple parallel processes.
So, if you want to use 100MB blocks and use all cores of your CPU in parallel, and append the block number ({#}) to the end of the filename on AWS, your command would look like this:
tar czf - something | parallel --pipe --block 100M --recend '' aws s3 cp - s3://backups/backup.tgz.part{#}
You can use just 4 CPU cores instead of all cores with parallel -j4.
Note that I set the "record end" character to nothing so that it doesn't try to avoid splitting mid-line which is its default behaviour and better suited to text file processing than binary files like tarballs.

Split and copy a file from a bucket to another bucket, without downloading it locally

I'd like to split and copy a huge file from a bucket (gs://$SRC_BUCKET/$MY_HUGE_FILE) to another bucket (gs://$DST_BUCKET/), but without downloading the file locally. I expect to do this using only gsutil and shell commands.
I'm looking for something with the same final behaviour as the following commands :
gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE my_huge_file_stored_locally
split -l 1000000 my_huge_file_stored_locally a_split_of_my_file_
gsutil -m mv a_split_of_my_file_* gs://$DST_BUCKET/
But, because I'm executing these actions on a Compute Engine VM with limited disk storage capacity, getting the huge file locally is not possible (and anyway, it seems like a waste of network bandwidth).
The file in this example is split by number of lines (-l 1000000), but I will accept answers if the split is done by number of bytes.
I took a look at the docs about streaming uploads and downloads using gsutil to do something like :
gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE - | split -1000000 | ...
But I can't figure out how to upload split files directly to gs://$DST_BUCKET/, without creating them locally (creating temporarily only 1 shard for the transfer is OK though).
This can't be done without downloading, but you could use range reads to build the pieces without downloading the full file at once, e.g.,
gsutil cat -r 0-10000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file1
gsutil cat -r 10001-20000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file2
...

recursively use scp but excluding some folders

Assume there are some folders with these structures
/bench1/1cpu/p_0/image/
/bench1/1cpu/p_0/fl_1/
/bench1/1cpu/p_0/fl_1/
/bench1/1cpu/p_0/fl_1/
/bench1/1cpu/p_0/fl_1/
/bench1/1cpu/p_1/image/
/bench1/1cpu/p_1/fl_1/
/bench1/1cpu/p_1/fl_1/
/bench1/1cpu/p_1/fl_1/
/bench1/1cpu/p_1/fl_1/
/bench1/2cpu/p_0/image/
/bench1/2cpu/p_0/fl_1/
/bench1/2cpu/p_0/fl_1/
/bench1/2cpu/p_0/fl_1/
/bench1/2cpu/p_0/fl_1/
/bench1/2cpu/p_1/image/
/bench1/2cpu/p_1/fl_1/
/bench1/2cpu/p_1/fl_1/
/bench1/2cpu/p_1/fl_1/
/bench1/2cpu/p_1/fl_1/
....
What I want to do is to scp the following folders
/bench1/1cpu/p_0/image/
/bench1/1cpu/p_1/image/
/bench1/2cpu/p_0/image/
/bench1/2cpu/p_1/image/
As you can see I want to recursively use scp but excluding all folders that name "fl_X". It seems that scp has not such option.
UPDATE
scp has not such feature. Instead I use the following command
rsync -av --exclude 'fl_*' user#server:/my/dir
But it doesn't work. It only transfers the list of folders!! something like ls -R
Although scp supports recursive directory copying with the -r option, it does not support filtering of the files. There are several ways to accomplish your task, but I would probably rely on find, xargs, tar, and ssh instead of scp.
find . -type d -wholename '*bench*/image' \
| xargs tar cf - \
| ssh user#remote tar xf - -C /my/dir
The rsync solution can be made to work, but you are missing some arguments. rsync also needs the r switch to recurse into subdirectories. Also, if you want the same security of scp, you need to do the transfer under ssh. Something like:
rsync -avr -e "ssh -l user" --exclude 'fl_*' ./bench* remote:/my/dir
You can specify GLOBIGNORE and use the pattern *
GLOBIGNORE='ignore1:ignore2' scp -r source/* remoteurl:remoteDir
You may wish to have general rules which you combine or override by using export GLOBIGNORE, but for ad-hoc usage simply the above will do. The : character is used as delimiter for multiple values.
Assuming the simplest option (installing rsync on the remote host) isn't feasible, you can use sshfs to mount the remote locally, and rsync from the mount directory. That way you can use all the options rsync offers, for example --exclude.
Something like this should do:
sshfs user#server: sshfsdir
rsync --recursive --exclude=whatever sshfsdir/path/on/server /where/to/store
Note that the effectiveness of rsync (only transferring changes, not everything) doesn't apply here. This is because for that to work, rsync must read every file's contents to see what has changed. However, as rsync runs only on one host, the whole file must be transferred there (by sshfs). Excluded files should not be transferred, however.
If you use a pem file to authenticate u can use the following command (which will exclude files with something extension):
rsync -Lavz -e "ssh -i <full-path-to-pem> -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" --exclude "*.something" --progress <path inside local host> <user>#<host>:<path inside remote host>
The -L means follow links (copy files not links).
Use full path to your pem file and not relative.
Using sshfs is not recommended since it works slowly. Also, the combination of find and scp that was presented above is also a bad idea since it will open a ssh session per file which is too expensive.
You can use extended globbing as in the example below:
#Enable extglob
shopt -s extglob
cp -rv !(./excludeme/*.jpg) /var/destination
This one works fine for me as the directories structure is not important for me.
scp -r USER#HOSTNAME:~/bench1/?cpu/p_?/image/ .
Assuming /bench1 is in the home directory of the current user. Also, change USER and HOSTNAME to the real values.

Resources