Rsync calculate file count before transfer? - bash

When transferring lots of small files in many folders with rsync, it tends to stay around ir-chk=1000/xxxxx, where xxxxx keeps counting up as it discovers new files, and the amount to check stays around 1000 until its on its last few folders.
How can I have it check for the entire file count before copying?
The command I was using to copy was:
rsync -av --progress source dest

rsync -av --progress --dry-run --stats source dest
Option --dry-run doesn't transfer any files but will show how many bytes will be transferred.
Option --stats shows a summary.
a sample output:
...
tests/
tests/__init__.py
tests/test_config.py
Number of files: 5,033 (reg: 2,798, dir: 2,086, link: 149)
Number of created files: 5,032 (reg: 2,798, dir: 2,085, link: 149)
Number of deleted files: 0
Number of regular files transferred: 2,798
Total file size: 26,035,530 bytes
Total transferred file size: 26,032,322 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.004 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 158,821
Total bytes received: 17,284
sent 158,821 bytes received 17,284 bytes 117,403.33 bytes/sec
total size is 26,035,530 speedup is 147.84 (DRY RUN)
Get the number of files will be transferred
rsync -av --progress --dry-run --stats source dest |
fgrep 'Number of files' |
cut -d' ' -f4 |
tr -d ,

Since version 3.0.0, you can use this option to explicitly turn off incremental recursion:
--no-i-r
Details from the rsync man page:
Beginning with rsync 3.0.0, the recursive algorithm used is now an
incremental scan that uses much less memory than before and begins the transfer after the scanning of the first few directories have been completed. This incremental scan only affects our recursion algorithm, and does not change a non-recursive transfer. It is also only possible when both ends of the transfer are at least version 3.0.0.
Some options require rsync to know the full file list, so these options
disable the incremental recursion mode. These include: --delete-before, --delete-after, --prune-empty-dirs, and --delay-updates. Because of this, the default delete mode when you specify --delete is now --delete-during when both ends of the connection are at least 3.0.0 (use --del or --delete-during to request this improved deletion mode explicitly). See also the --delete-delay option that is a better choice than using --delete-after.
Incremental recursion can be disabled using the --no-inc-recursive
option or its shorter --no-i-r alias.

Related

Rsync include or exclude directories using text file

I'm using rsync to backup some data from a remote host.
this is how I'm using the rsync cmd:
rsync --dry-run -avhi -e ssh --include-from=/home/rsync_list/test.txt root#10.10.4.61:/ /mnt/BACKUP/my_BACKUP/
this is the file /home/rsync_list/test.txt
+ /usr/acs/conf/**
+ /usr/acs/bin/**
+ /raid0/opmdps/TEMP_folder/**
- *
I want to copy only the listed folders excluding the remaining files.
I always get
receiving file list ... done
sent 103 bytes received 48 bytes 302.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
Could you tell me what I'm doing wrong? How should I write the rsync command if I would like to sync, for example, only /raid0/opmdps/TEMP_folder/ without its subfolders?
I wonder if you really only tried with the command you posted?
Not only you are using "--dry-run" option, even the output indicates this:
total size is 0 speedup is 0.00 (DRY RUN)
Please consult the manpage:
-n, --dry-run perform a trial run with no changes made
https://linux.die.net/man/1/rsync
May I suggest you give it a run without --dry-run?

How can I compare the file sizes match between duplicate directories?

I need to compare two directories to validate a backup.
Say my directory looks like the following:
Filename Filesize Filename Filesize
user#main_server:~/mydir/ user#backup_server:~/mydir/
file1000.txt 4182410737 file1000.txt 4182410737
file1001.txt 8241410737 - <-- missing on backup_server!
... ...
file9999.txt 2410418737 file9999.txt 1111111111 <-- size != main_server
Is there a quick one liner that would get me close to output like:
Invalid Backup Files:
file1001.txt
file9999.txt
(with the goal to instruct the backup script to refetch these files)
I've tried to get variations of the following to no avail.
[main_server] $ rsync -n ~/mydir/ user#backup_server:~/mydir
I cannot do rsync to backup the directories itself because it takes way too long (8-24hrs). Instead I run multiple threads of scp to fetch files in batches. This completes regularly <1hr. However, occasionally I find a few files that were somehow missed (perhaps dropped connection).
Speed is a priority, so file sizes should be sufficient. But I'm open to including a checksum, provided it doesn't slow the process down like I find with rsync.
Here's my test process:
# Generate Large Files (1GB)
for i in {1..100}; do head -c 1073741824 </dev/urandom >foo-$i ; done
# SCP them from src to dest
for i in {1..100}; do ( scp ~/mydir/foo-$i user#backup_server:~/mydir/ & ) ; sleep 0.1 ; done
# Confirm destination has everything from source
# This is the point of the question. I've tried:
rsync -Sa ~/mydir/ user#backup_server:~/mydir
# Way too slow
What do you recommend?
By default, rsync uses the quick check method which only transfers files that differ in size or last-modified time. As you report that the sizes are unchanged, that would seem to indicate that the timestamps differ. Two options to handlel this are:
Use -p to preserve timestamps when transferring files.
Use --size-only to ignore timestamps and transfer only files that differ in size.

rsync script expands variable incorrectly

I have a script that takes in a unique location number from a file. These are formatted like this 7325-05, 5269-09 and 7479-14, for example. The first four numbers are what the folder is called and the second number is the first two characters of the filename unique within each folder.
So I wrote this script to use locate and find to get the full path of the folder and then use a wildcard to download the specific file using rsync. Here's the script that I have right now:
#!/bin/bash
#IFS='
#'
oIFS=$IFS
IFS=$'\n'
while read line;
do
name=$line;
folder=${line:0:4}
track=${line: -2}
folderlocation="$(locate -r '/'$folder'$')"
filelocation="$(find "$folderlocation" -type f -name "$track*")"
rsync -vazhn --progress "$filelocation" /cygdrive/c/
# mkdir /cygdrive/c/test/"$folder"
# cp -rvi "$filelocation" /cygdrive/c/test/"$folder"
echo "";
done < $1
The code using cp that is commented out works just fine. I would just really prefer to use rsync, mainly due to better feedback and more accurate progress reporting, as far as I can tell.
Using the code as pasted above (with rsync) throws this error:
./filelocator classic-locations.txt
sending incremental file list
rsync: change_dir "/home/emil//\\sandrew-nas/SMMUSIC/MMIMUSIC/7001-8000/7201-7300/7252/Unknown Album (29-12-2012 09-52-02)" failed: No such file or directory (2)
sent 20 bytes received 12 bytes 64.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1165) [sender=3.1.1]
sending incremental file list
rsync: change_dir "/home/emil//\\sandrew-nas/SMMUSIC/MMIMUSIC/7001-8000/7201-7300/7252/Unknown Album (29-12-2012 09-52-02)" failed: No such file or directory (2)
sent 20 bytes received 12 bytes 64.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1165) [sender=3.1.1]
As you can see, my home folder (where I issue the command) is suddenly included in the script, leading me to believe that a variable or wildcard is being expanded in the local shell, but no amount of escape characters seem to accomplish what I want with rsync.

How to zgrep the last line of a gz file without tail

Here is my problem, I have a set of big gz log files, the very first info in the line is a datetime text, e.g.: 2014-03-20 05:32:00.
I need to check what set of log files holds a specific data.
For the init I simply do a:
'-query-data-'
zgrep -m 1 '^20140320-04' 20140320-0{3,4}*gz
BUT HOW to do the same with the last line without process the whole file as would be done with zcat (too heavy):
zcat foo.gz | tail -1
Additional info, those logs are created with the data time of it's initial record, so if I want to query logs at 14:00:00 I have to search, also, in files created BEFORE 14:00:00, as a file would be created at 13:50:00 and closed at 14:10:00.
The easiest solution would be to alter your log rotation to create smaller files.
The second easiest solution would be to use a compression tool that supports random access.
Projects like dictzip, BGZF, and csio each add sync flush points at various intervals within gzip-compressed data that allow you to seek to in a program aware of that extra information. While it exists in the standard, the vanilla gzip does not add such markers either by default or by option.
Files compressed by these random-access-friendly utilities are slightly larger (by perhaps 2-20%) due to the markers themselves, but fully support decompression with gzip or another utility that is unaware of these markers.
You can learn more at this question about random access in various compression formats.
There's also a "Blasted Bioinformatics" blog by Peter Cock with several posts on this topic, including:
BGZF - Blocked, Bigger & Better GZIP! – gzip with random access (like dictzip)
Random access to BZIP2? – An investigation (result: can't be done, though I do it below)
Random access to blocked XZ format (BXZF) – xz with improved random access support
Experiments with xz
xz (an LZMA compression format) actually has random access support on a per-block level, but you will only get a single block with the defaults.
File creation
xz can concatenate multiple archives together, in which case each archive would have its own block. The GNU split can do this easily:
split -b 50M --filter 'xz -c' big.log > big.log.sp.xz
This tells split to break big.log into 50MB chunks (before compression) and run each one through xz -c, which outputs the compressed chunk to standard output. We then collect that standard output into a single file named big.log.sp.xz.
To do this without GNU, you'd need a loop:
split -b 50M big.log big.log-part
for p in big.log-part*; do xz -c $p; done > big.log.sp.xz
rm big.log-part*
Parsing
You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:
SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }')
tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1
Side note
Version 5.1.1 introduced support for the --block-size flag:
xz --block-size=50M big.log
However, I have not been able to extract a specific block since it doesn't include full headers between blocks. I suspect this is nontrivial to do from the command line.
Experiments with gzip
gzip also supports concatenation. I (briefly) tried mimicking this process for gzip without any luck. gzip --verbose --list doesn't give enough information and it appears the headers are too variable to find.
This would require adding sync flush points, and since their size varies on the size of the last buffer in the previous compression, that's too hard to do on the command line (use dictzip or another of the previously discussed tools).
I did apt-get install dictzip and played with dictzip, but just a little. It doesn't work without arguments, creating a (massive!) .dz archive that neither dictunzip nor gunzip could understand.
Experiments with bzip2
bzip2 has headers we can find. This is still a bit messy, but it works.
Creation
This is just like the xz procedure above:
split -b 50M --filter 'bzip2 -c' big.log > big.log.sp.bz2
I should note that this is considerably slower than xz (48 min for bzip2 vs 17 min for xz vs 1 min for xz -0) as well as considerably larger (97M for bzip2 vs 25M for xz -0 vs 15M for xz), at least for my test log file.
Parsing
This is a little harder because we don't have the nice index. We have to guess at where to go, and we have to err on the side of scanning too much, but with a massive file, we'd still save I/O.
My guess for this test was 50000000 (out of the original 52428800, a pessimistic guess that isn't pessimistic enough for e.g. an H.264 movie.)
GUESS=50000000
LAST=$(tail -c$GUESS big.log.sp.bz2 \
|grep -abo 'BZh91AY&SY' |awk -F: 'END { print '$GUESS'-$1 }')
tail -c $LAST big.log.sp.bz2 |bunzip2 -c |tail -n1
This takes just the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts that from the guess size, and pulls that many bytes off of the end of the file. Just that part is decompressed and thrown into tail.
Because this has to query the compressed file twice and has an extra scan (the grep call seeking the header, which examines the whole guessed space), this is a suboptimal solution. See also the below section on how slow bzip2 really is.
Perspective
Given how fast xz is, it's easily the best bet; using its fastest option (xz -0) is quite fast to compress or decompress and creates a smaller file than gzip or bzip2 on the log file I was testing with. Other tests (as well as various sources online) suggest that xz -0 is preferable to bzip2 in all scenarios.
————— No Random Access —————— ——————— Random Access ———————
FORMAT SIZE RATIO WRITE READ SIZE RATIO WRITE SEEK
————————— ————————————————————————————— —————————————————————————————
(original) 7211M 1.0000 - 0:06 7211M 1.0000 - 0:00
bzip2 96M 0.0133 48:31 3:15 97M 0.0134 47:39 0:00
gzip 79M 0.0109 0:59 0:22
dictzip 605M 0.0839 1:36 (fail)
xz -0 25M 0.0034 1:14 0:12 25M 0.0035 1:08 0:00
xz 14M 0.0019 16:32 0:11 14M 0.0020 16:44 0:00
Timing tests were not comprehensive, I did not average anything and disk caching was in use. Still, they look correct; there is a very small amount of overhead from split plus launching 145 compression instances rather than just one (this may even be a net gain if it allows an otherwise non-multithreaded utility to consume multiple threads).
Well, you can access randomly a gzipped file if you previously create an index for each file ...
I've developed a command line tool which creates indexes for gzip files which allow for very quick random access inside them:
https://github.com/circulosmeos/gztool
The tool has two options that may be of interest for you:
-S option supervise a still-growing file and creates an index for it as it is growing - this can be useful for gzipped rsyslog files as reduces to zero in the practice the time of index creation.
-t tails a gzip file: this way you can do: $ gztool -t foo.gz | tail -1
Please, note that if the index doesn't exists, this will consume the same time as a complete decompression: but as the index is reusable, next searches will be greatly reduced in time!
This tool is based on zran.c demonstration code from original zlib, so there's no out-of-the-rules magic!

Is rsync really any faster on files that have changed?

Why can't I trust rsync to be minimally as fast as cp? (I'm ignoring negligible differences for overhead.)
It seems to me like rsync is fairly slow on files with no content difference, but a changed timestamp.
If I make a file: cp -a testfile-100M destfile
And then I rsync them, I get what you would expect:
$ rsync -av testfile-100M destfile
sending incremental file list
sent 56 bytes received 12 bytes 8.00 bytes/sec
total size is 104857600 speedup is 1542023.53
But that's just because rsync is checking the size and the timestamp and skipping the file. What if I just change the timestamp?
$ touch testfile-100M
$ rsync -av testfile-100M destfile sending incremental file list
testfile-100M
sent 104870495 bytes received 31 bytes 113804.15 bytes/sec
total size is 104857600 speedup is 1.00
Also note that even though the speedup is 1, the inital copy took about 1/4 the time to complete than the final rsync, even though the contents are exactly the same. So what's going on here? Is it just all the overhead of doing the comparisons?
If that's the case, then when does rsync ever provide a performance advantage? Only when files are exactly the same on both sides?
For local files, if the size or mtime have changed, rsync by default just copies the whole thing without using its delta algorithm. You can turn this off with the --no-whole-file option, but for local copies this will typically be slower.
For the specific case of touching a file without changing it:
If you give the --size-only option, it will assume that files that have the same size are unchanged.
If you give the --checksum option, it will first hash the file to see if anything has changed, before copying it.
When the source and destination are both locally mounted filesystems rsync just copies the file(s) if the timestamps or sizes don't match. Rsync wins where you have large files with small differences and they are on machines separated by a low bandwidth link.
EDIT: Since someone felt the need to downvote this ancient answer... As to why rsync on local files might be slower than cp, there does not seem to be any good reason.
It appears the answer is that rsync does some extra steps in order to keep files in a consistent state, and not in a "partial-transferred" state while operating. Using the --inplace option removes this overhead.
Interestingly, for me rsync is about 4× faster than cp for copying to an external USB drive.

Resources