Excluding Everything in /Library/ but /Library/Mail via rsync - macos

I have an rsync script that backs up user's home folder using this line of code:
/usr/bin/caffeinate -i /usr/bin/rsync -rlptDn --human-readable --progress --ignore-existing --update $PATH/$NAME/ --exclude=".*" --exclude="Public" --exclude="Library" /Volumes/Backup/Users/$NAME\ -\ $DATE
How do I ignore everything in ~/Library/ but their ~/Library/Mail/? I wanted to include this rsync flag, --include="/Library/Mail", but I'm not sure if I should depend too much on rsync exclusions & inclusions as it can become unreliable and varies between different versions of OS X rsync.
Maybe a command-line regex tool would be more useful? Example:
ls -d1 ~/Library/*| grep -v '^mail' > $ALIST
exec <${ALIST}
read SRC
do
.
.
.
$RSYNC..etc...

rsync's --include and --exclude options obey what is called "precedence" so there's a firm rule you can rely which means what you explicitly include before you exclude will be what is sent.
In your case, add --include ~/Library/Mail before the first --exclude.

Related

lftp --exclude option excluding more files than we'd like

we use this command in our .gitlab-ci.yml during the deploy stage. It uses lftp to mirror(ish).
But we recently noticed a big problem which indicates we might not understand the syntax for the --exclude option of lftp.
It indeed seems that --exclude .env is excluding all files matching ..env.. like this one for instance :
application/migrations/20220805084313_rajout_champ_cronjobs_id_environnement.php
- lftp -e "set ftp:ssl-allow no ; mirror -p -Rev ./ public_html/project/ --parallel=10 --exclude-glob .git* --exclude .env" -u $LOGIN,$PWD ftp://$SERVER
Is that normal behavior? If so How do I exclude only files named ".env" then? Or is it a bug?
Thanks.

recursively use scp but excluding some folders

Assume there are some folders with these structures
/bench1/1cpu/p_0/image/
/bench1/1cpu/p_0/fl_1/
/bench1/1cpu/p_0/fl_1/
/bench1/1cpu/p_0/fl_1/
/bench1/1cpu/p_0/fl_1/
/bench1/1cpu/p_1/image/
/bench1/1cpu/p_1/fl_1/
/bench1/1cpu/p_1/fl_1/
/bench1/1cpu/p_1/fl_1/
/bench1/1cpu/p_1/fl_1/
/bench1/2cpu/p_0/image/
/bench1/2cpu/p_0/fl_1/
/bench1/2cpu/p_0/fl_1/
/bench1/2cpu/p_0/fl_1/
/bench1/2cpu/p_0/fl_1/
/bench1/2cpu/p_1/image/
/bench1/2cpu/p_1/fl_1/
/bench1/2cpu/p_1/fl_1/
/bench1/2cpu/p_1/fl_1/
/bench1/2cpu/p_1/fl_1/
....
What I want to do is to scp the following folders
/bench1/1cpu/p_0/image/
/bench1/1cpu/p_1/image/
/bench1/2cpu/p_0/image/
/bench1/2cpu/p_1/image/
As you can see I want to recursively use scp but excluding all folders that name "fl_X". It seems that scp has not such option.
UPDATE
scp has not such feature. Instead I use the following command
rsync -av --exclude 'fl_*' user#server:/my/dir
But it doesn't work. It only transfers the list of folders!! something like ls -R
Although scp supports recursive directory copying with the -r option, it does not support filtering of the files. There are several ways to accomplish your task, but I would probably rely on find, xargs, tar, and ssh instead of scp.
find . -type d -wholename '*bench*/image' \
| xargs tar cf - \
| ssh user#remote tar xf - -C /my/dir
The rsync solution can be made to work, but you are missing some arguments. rsync also needs the r switch to recurse into subdirectories. Also, if you want the same security of scp, you need to do the transfer under ssh. Something like:
rsync -avr -e "ssh -l user" --exclude 'fl_*' ./bench* remote:/my/dir
You can specify GLOBIGNORE and use the pattern *
GLOBIGNORE='ignore1:ignore2' scp -r source/* remoteurl:remoteDir
You may wish to have general rules which you combine or override by using export GLOBIGNORE, but for ad-hoc usage simply the above will do. The : character is used as delimiter for multiple values.
Assuming the simplest option (installing rsync on the remote host) isn't feasible, you can use sshfs to mount the remote locally, and rsync from the mount directory. That way you can use all the options rsync offers, for example --exclude.
Something like this should do:
sshfs user#server: sshfsdir
rsync --recursive --exclude=whatever sshfsdir/path/on/server /where/to/store
Note that the effectiveness of rsync (only transferring changes, not everything) doesn't apply here. This is because for that to work, rsync must read every file's contents to see what has changed. However, as rsync runs only on one host, the whole file must be transferred there (by sshfs). Excluded files should not be transferred, however.
If you use a pem file to authenticate u can use the following command (which will exclude files with something extension):
rsync -Lavz -e "ssh -i <full-path-to-pem> -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" --exclude "*.something" --progress <path inside local host> <user>#<host>:<path inside remote host>
The -L means follow links (copy files not links).
Use full path to your pem file and not relative.
Using sshfs is not recommended since it works slowly. Also, the combination of find and scp that was presented above is also a bad idea since it will open a ssh session per file which is too expensive.
You can use extended globbing as in the example below:
#Enable extglob
shopt -s extglob
cp -rv !(./excludeme/*.jpg) /var/destination
This one works fine for me as the directories structure is not important for me.
scp -r USER#HOSTNAME:~/bench1/?cpu/p_?/image/ .
Assuming /bench1 is in the home directory of the current user. Also, change USER and HOSTNAME to the real values.

rsync exclude a directory but include a subdirectory

I am trying to copy a project to my server with rsync.
I have project specific install scripts in a subdirectory
project/specs/install/project1
What I am trying to do is exclude everything in the project/specs directory but the project specific install directory: project/specs/install/project1.
rsync -avz --delete --include=specs/install/project1 \
--exclude=specs/* /srv/http/projects/project/ \
user#server.com:~/projects/project
But like this the content of the specs directory gets excluded but the install/project1 directory does not get included.
I have tried everything but i just don't seem to get this to work
Sometime it's just a detail.
Just change your include pattern adding a trailing / at the end of include pattern and it'll work:
rsync -avz --delete --include=specs/install/project1/ \
--exclude=specs/* /srv/http/projects/project/ \
user#server.com:~/projects/project
Or, in alternative, prepare a filter file like this:
$ cat << EOF >pattern.txt
> + specs/install/project1/
> - specs/*
> EOF
Then use the --filter option:
rsync -avz --delete --filter=". pattern.txt" \
/srv/http/projects/project/ \
user#server.com:~/projects/project
For further info go to the FILTER RULES section in the rsync(1) manual page.
The other solution is not working here.
Reliable way
You have no choice but to manually descend for each level of your sub-directory. There is no risk to include unwanted files, as rsync doesn't include the files of included directories.
1) Create an include filter file, for instance "include_filter.txt":
+ /specs/
+ /specs/install/
+ /specs/install/project1/***
- /specs/**
2) Run it:
rsync -avz --delete --include-from=include_filter.txt \
/srv/http/projects/project/ \
user#server.com:~/projects/project
Don't forget the starting slash "/", otherwise you may match sub-directories named "**/specs/install/project1/".
By choosing an include type filter (--include-from=FILE), the starting plus "+" signs are actually optional, as this is the default action with no sign. (You can have the opposite "-" by default with --exclude-from=FILE.)
The double stars "**" means "any path"
The triple stars "***" means "any path, including this very directory"
Easy way
You can start your filters "*/", allowing rsync to descend all your sub-levels. This is convenient but:
All directories will be included, albeit empty. This can be fixed with the rysnc option -m, but then all empty dirs will be skipped.
1) Create an include filter file, for instance "include_filter.txt":
+ /**/
+ /specs/install/project1/***
- /specs/**
2) Run it:
rsync -avzm --delete --include-from=include_filter.txt \
/srv/http/projects/project/ \
user#server.com:~/projects/project
Note the added option -m.
Order of --include and --exclude affects what is being included or excluded.
When there are particular subdirectories to be included need to place them first.
Similar post available here.

How to emulate 'cp --update' behavior on Mac OS X?

The GNU/Linux version of cp has a nice --update flag:
-u, --update
copy only when the SOURCE file is newer than the destination file or when the destination file is missing
The Mac OS X version of cp lacks this flag.
What is the best way to get the behavior of cp --update by using built-in system command line programs? I want to avoid installing any extra tools (including the GNU version of cp).
rsync has an -u/--update option that works just like GNU cp:
$ rsync -u src dest
Also look at rsync's other options, which are probably what you actually want:
-l, --links copy symlinks as symlinks
-H, --hard-links preserve hard links
-p, --perms preserve permissions
--executability preserve executability
-o, --owner preserve owner (super-user only)
-g, --group preserve group
--devices preserve device files (super-user only)
--specials preserve special files
-D same as --devices --specials
-t, --times preserve times
-a, --archive
This is equivalent to -rlptgoD. It is a quick way of saying you want recursion
and want to preserve almost everything (with -H being a notable omission). The
only exception to the above equivalence is when --files-from is specified, in which
case -r is not implied.
Note that -a does not preserve hardlinks, because finding multiply-linked files is
expensive. You must separately specify -H.

Using wget to recursively fetch a directory with arbitrary files in it

I have a web directory where I store some config files. I'd like to use wget to pull those files down and maintain their current structure. For instance, the remote directory looks like:
http://mysite.com/configs/.vim/
.vim holds multiple files and directories. I want to replicate that on the client using wget. Can't seem to find the right combo of wget flags to get this done. Any ideas?
You have to pass the -np/--no-parent option to wget (in addition to -r/--recursive, of course), otherwise it will follow the link in the directory index on my site to the parent directory. So the command would look like this:
wget --recursive --no-parent http://example.com/configs/.vim/
To avoid downloading the auto-generated index.html files, use the -R/--reject option:
wget -r -np -R "index.html*" http://example.com/configs/.vim/
To download a directory recursively, which rejects index.html* files and downloads without the hostname, parent directory and the whole directory structure :
wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data
For anyone else that having similar issues. Wget follows robots.txt which might not allow you to grab the site. No worries, you can turn it off:
wget -e robots=off http://www.example.com/
http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html
You should use the -m (mirror) flag, as that takes care to not mess with timestamps and to recurse indefinitely.
wget -m http://example.com/configs/.vim/
If you add the points mentioned by others in this thread, it would be:
wget -m -e robots=off --no-parent http://example.com/configs/.vim/
Here's the complete wget command that worked for me to download files from a server's directory (ignoring robots.txt):
wget -e robots=off --cut-dirs=3 --user-agent=Mozilla/5.0 --reject="index.html*" --no-parent --recursive --relative --level=1 --no-directories http://www.example.com/archive/example/5.3.0/
If --no-parent not help, you might use --include option.
Directory struct:
http://<host>/downloads/good
http://<host>/downloads/bad
And you want to download downloads/good but not downloads/bad directory:
wget --include downloads/good --mirror --execute robots=off --no-host-directories --cut-dirs=1 --reject="index.html*" --continue http://<host>/downloads/good
wget -r http://mysite.com/configs/.vim/
works for me.
Perhaps you have a .wgetrc which is interfering with it?
First of all, thanks to everyone who posted their answers. Here is my "ultimate" wget script to download a website recursively:
wget --recursive ${comment# self-explanatory} \
--no-parent ${comment# will not crawl links in folders above the base of the URL} \
--convert-links ${comment# convert links with the domain name to relative and uncrawled to absolute} \
--random-wait --wait 3 --no-http-keep-alive ${comment# do not get banned} \
--no-host-directories ${comment# do not create folders with the domain name} \
--execute robots=off --user-agent=Mozilla/5.0 ${comment# I AM A HUMAN!!!} \
--level=inf --accept '*' ${comment# do not limit to 5 levels or common file formats} \
--reject="index.html*" ${comment# use this option if you need an exact mirror} \
--cut-dirs=0 ${comment# replace 0 with the number of folders in the path, 0 for the whole domain} \
$URL
Afterwards, stripping the query params from URLs like main.css?crc=12324567 and running a local server (e.g. via python3 -m http.server in the dir you just wget'ed) to run JS may be necessary. Please note that the --convert-links option kicks in only after the full crawl was completed.
Also, if you are trying to wget a website that may go down soon, you should get in touch with the ArchiveTeam and ask them to add your website to their ArchiveBot queue.
To fetch a directory recursively with username and password, use the following command:
wget -r --user=(put username here) --password='(put password here)' --no-parent http://example.com/
This version downloads recursively and doesn't create parent directories.
wgetod() {
NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}
Usage:
Add to ~/.bashrc or paste into terminal
wgetod "http://example.com/x/"
The following option seems to be the perfect combination when dealing with recursive download:
wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2
Relevant snippets from man pages for convenience:
-nd
--no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
filenames will get extensions .n).
-np
--no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
All you need is two flags, one is "-r" for recursion and "--no-parent" (or -np) in order not to go in the '.' and ".." . Like this:
wget -r --no-parent http://example.com/configs/.vim/
That's it. It will download into the following local tree: ./example.com/configs/.vim .
However if you do not want the first two directories, then use the additional flag --cut-dirs=2 as suggested in earlier replies:
wget -r --no-parent --cut-dirs=2 http://example.com/configs/.vim/
And it will download your file tree only into ./.vim/
In fact, I got the first line from this answer precisely from the wget manual, they have a very clean example towards the end of section 4.3.
It sounds like you're trying to get a mirror of your file. While wget has some interesting FTP and SFTP uses, a simple mirror should work. Just a few considerations to make sure you're able to download the file properly.
Respect robots.txt
Ensure that if you have a /robots.txt file in your public_html, www, or configs directory it does not prevent crawling. If it does, you need to instruct wget to ignore it using the following option in your wget command by adding:
wget -e robots=off 'http://your-site.com/configs/.vim/'
Convert remote links to local files.
Additionally, wget must be instructed to convert links into downloaded files. If you've done everything above correctly, you should be fine here. The easiest way I've found to get all files, provided nothing is hidden behind a non-public directory, is using the mirror command.
Try this:
wget -mpEk 'http://your-site.com/configs/.vim/'
# If robots.txt is present:
wget -mpEk robots=off 'http://your-site.com/configs/.vim/'
# Good practice to only deal with the highest level directory you specify (instead of downloading all of `mysite.com` you're just mirroring from `.vim`
wget -mpEk robots=off --no-parent 'http://your-site.com/configs/.vim/'
Using -m instead of -r is preferred as it doesn't have a maximum recursion depth and it downloads all assets. Mirror is pretty good at determining the full depth of a site, however if you have many external links you could end up downloading more than just your site, which is why we use -p -E -k. All pre-requisite files to make the page, and a preserved directory structure should be the output. -k converts links to local files.
Since you should have a link set up, you should get your config folder with a file /.vim.
Mirror mode also works with a directory structure that's set up as an ftp:// also.
General rule of thumb:
Depending on the side of the site you are doing a mirror of, you're sending many calls to the server. In order to prevent you from being blacklisted or cut off, use the wait option to rate-limit your downloads.
wget -mpEk --no-parent robots=off --random-wait 'http://your-site.com/configs/.vim/'
But if you're simply downloading the ../config/.vim/ file you shouldn't have to worry about it as your ignoring parent directories and downloading a single file.
Wget 1.18 may work better, e.g., I got bitten by a version 1.12 bug where...
wget --recursive (...)
...only retrieves index.html instead of all files.
Workaround was to notice some 301 redirects and try the new location — given the new URL, wget got all the files in the directory.
Recursive wget ignoring robots (for websites)
wget -e robots=off -r -np --page-requisites --convert-links 'http://example.com/folder/'
-e robots=off causes it to ignore robots.txt for that domain
-r makes it recursive
-np = no parents, so it doesn't follow links up to the parent folder
You should be able to do it simply by adding a -r
wget -r http://stackoverflow.com/

Resources