rsync: --include-from vs. --exclude-from what is the actual difference? - bash

In the documentation, it mentions these as being files containing lists of either patterns to include or patterns to exclude. However, that implies for inclusions, everything is considered an exclusion except where things match patterns. So for example, an include file containing:
/opt/**.cfg
Should only include any file named *.cfg that exists anywhere under a directory named opt any where in the tree. So it would match the following:
/opt/etc/myfile.cfg
/some/dir/opt/myfile.cfg
/notopt/opt/some/other/dir/myfile.cfg
I'd therefore expect it to implicitly exclude anything else. But that doesn't seem to be the case, since I am seeing this in the itemized output:
*deleting etc/rc.d/init.d/somescript
So what is the deal with --include-from and --exclude-from? Are they just aliases for --filter-from?

rsync doesn't work like that. Any file with a filename pattern that does not match any of the include or exclude patterns are considered to be included. In other words, think of the include pattern as a way of overriding exclude pattern.
From the docs (emphasis mine):
Rsync builds an ordered list of include/exclude options as specified on the command line. Rsync checks each file and directory name against each exclude/include pattern in turn. The first matching pattern is acted on. If it is an exclude pattern, then that file is skipped. If it is an include pattern then that filename is not skipped. If no matching include/exclude pattern is found then the filename is not skipped.
So, if you want to include only specific files, you first need to include those specific files, then exclude all other files:
--include="*/" --include="*.cfg" --exclude="*"
Couple of things to note here:
The include patterns have to come before the excludes, because the first pattern that matches is the one that gets considered. If the file name matches the exclude pattern first, it gets excluded.
You need to either include all subdirectories individually, like --include="/opt" --include="/opt/dir1" etc. for all subdirectories, or use --include="*/" to include all directories (not files). I went with the second option for brevity.
It is quirky and not very intuitive. So read the docs carefully (the "EXCLUDE PATTERNS" section in the link) and use the --dry-run or -n option to make sure it is going to do what you think it should do.

If you (like me) have a hard time to wrap your head around the FILTER RULES-section in the man-pages but have a basic understanding of find, you could use that instead.
Say you whant to sync everyting with a specific date (ex 2016-02-01) in either the file-name or in a directory-name from /storage/data to rsync_test. Do something like this:
cd /storage/data
find . -name '*2016-02-01*' \
| rsync --dry-run -arv --files-from=- /storage/data /tmp/rsync_test

Related

How to refer to the current path in this recursive find and replace?

Disclaimer: (off-topic warning) This is not about outputting the list of ignored files actually detected in the repo. This is about ignored paths, even when no file is in fact matching one of these paths.
Context: I'm attempting to write a git alias to "flatten" all .gitignore patterns recursively and output a list of paths as they're seen from the top level.
What I mean with an example:
├─ .git
├─ .gitignore
└─ dir1
├─ .gitignore
├─ file1.txt
└─ file2.txt
With these contents in .gitignore files:
# (currently pointing at top-level directory)
$ cat .gitignore
some_path
$ cat dir1/.gitignore
yet_another_path
*.txt
I try to have an alias to output something along the lines of
$ git flattened-ignore-list
some_path
dir1/yet_another_path
dir1/*.txt
What do I have so far?
I know I can search for all .gitignore files in the repo with
find . -name ".gitignore"
which in this case would output
.gitignore
dir1/.gitignore
So I've tried to combine this with cat to get their contents (either of these work)
find . -name ".gitignore" | xargs cat
# or
cat $(find . -name ".gitignore")
with this result:
some_path
yet_another_path
*.txt
which is technically expected but unfortunately unhelpful for what I am trying to achieve. So to (at last!) arrive at my actual question:
How can I, for each result of find, refer to the current path? (in order to eventually prepend it to the line)
Note for people suspecting an XY problem : It might be the case, my approach might just be naive here, but maybe not, I'm unsure. For example I didn't consider complex cases where nested .gitignore files could refer to upper-levels, or special syntax with **. I've stuck to very simple structures for now, so in case you see a flaw and/or can suggest a totally different way to achieve the same goal, I'll of course be happy to hear about it also.
I try to have an alias to output something along the lines of
$ git flattened-ignore-list
some_path
dir1/yet_another_path
dir1/*.txt
Unfortunately, this approach is naive (and perhaps doomed, but maybe not) because entries in .gitignore files are a bit complicated.
The simple answer to the simple question you asked is to use something that prepends the directory name, relative to the top level. Since find never outputs unnecessarily-complicated names, you can do this with direct string processing:
.gitignore
dir1/.gitignore
tells you that when reading the first file, prepend nothing, and when reading the second, prepend dir1 to each entry. Doing this in shell is a little tricky, but bash has the tools needed: you just get the line minus the /.gitignore at the end, either using regexp replacement or just removing 11 characters (if I counted right) from anything that has a slash in it or isn't the literal 10-character string .gitignore. Grab the directory off the part before the /.gitignore name and use sed or awk to insert it, and a slash, in front of non-comment entries (and remember to handle ! entries a little differently).
You are probably better off handling the top level .gitignore separately–you can just copy it straight through, adding a final newline if necessary—and then dealing with subdirectory .gitignores in a different code path.
Note that a subdirectory .gitignore cannot refer to something above it: nothing in dir1/.gitignore can change whether ./foo or dir2/foo is ignored or not. So that part is not a problem.
The part that is a problem is that, in dir1, the entry:
*.txt
implies that the top level should not only ignore untracked dir1/*.txt files, but also ignore dir1/sub/*.txt files, dir1/sub/sub2/*.txt, and so on. However, a dir1 entry reading:
sub/*.txt
means that the top level should ignore only untracked dir1/sub/*.txt files, without ignoring any dir1/sub/sub2/*.txt files!
You may be able to salvage this with yet more code: while reading a subdirectory .gitignore, check to see if there are embedded slashes in any given line. An embedded slash is one that is not the final slash, because final slashes are removed for this particular differentiation.
If the entry contains an embedded slash, it applies only to the full-path-relative-to-the-subdirectory. You can therefore add dir1/ in front and be done, e.g.:
dir1/foo/*.txt
If the entry does not contain an embedded slash, it applies to the subdirectory and all of its nested sub-subdirectories. You will need to allow for any arbitrary number of subdirectories. This might be correct, but it's quite untested:
dir1/*.txt
dir1/**/*.txt
(In theory **/ should also match the empty list of subdirectories, so only the second line should be needed, but in practice I have seen this not happen for some cases. I do not recall whether this was in other pathspecs, .gitignore files, or both.)
In general, most .gitignore entries seem not to contain embedded slashes, so any successful script you write will probably produce a nearly double-length "flattened" ignore file, compared to its input length.
You can produce a complete list of ignore patterns, with directory prefix like this:
#!usr/bin/env sh
find \
. \
-type f \
-name '.gitignore' \
-printf '%h\n' \
| while IFS= read -r dir_name; do
printf \
"${dir_name}/%s\\n" \
$(
sed \
--silent \
'/^[^#[:space:]]/p' \
"$dir_name/.gitignore"
)
done
The above code will just list all patterns found in .gitignore files across directories, and add the directory as prefix of each pattern.
It does not reflect gitignore syntax and behavior that is described here in git documentation: https://git-scm.com/docs/gitignore

Bash: find references to filenames in other files

Problem:
I have a list of filenames, filenames.txt:
Eg.
/usr/share/important-library.c
/usr/share/youneedthis-header.h
/lib/delete/this-at-your-peril.c
I need to rename or delete these files and I need to find references to these files in a project directory tree: /home/noob/my-project/ so I can remove or correct them.
My thought is to use bash to extract the filename: basename filename, then grep for it in the project directory using a for loop.
FILELISTING=listing.txt
PROJECTDIR=/home/noob/my-project/
for f in $(cat "$FILELISTING"); do
extension=$(basename ${f##*.})
filename=$(basename ${f%.*})
pattern="$filename"\\."$extension"
grep -r "$pattern" "$PROJECTDIR"
done
I could royally screw up this project -- does anyone see a flaw in my logic; better: do you see a more reliable scalable way to do this over a huge directory tree? Let's assume that revision control is off the table ( it is, in fact ).
A few comments:
Instead of
for f in $(cat "$FILELISTING") ; do
...
done
it's somewhat safer to write
while IFS= read -r f ; do
...
done < "$FILELISTING"
That way, your code will have no problem with spaces, tabs, asterisks, and so on in the filenames (though it still won't support newlines).
Your goal in separating f into extension and filename, and then reassembling them with \., seems to be that you want the filename to be treated as a literal string; right? Like, you're worried that grep will treat the . as meaning "any character" rather than as "one dot". A more general solution is to use grep's -F option, which tells it to treat the pattern as a fixed string rather than a regex:
grep -r -F "$f" "$PROJECTDIR"
Your introduction mentions using basename, but then you don't actually use it. Is that intentional?
If your non-use of basename is intentional, then filenames.txt really just contains a list of patterns to search for; you don't even need to write a loop, in this case, since grep's -f option tells it to take a newline-separated list of patterns from a file:
grep -r -F -f "$FILELISTING" "$PROJECTDIR"
You should back up your project, using something like tar -czf backup.tar.gz "$PROJECTDIR". "Revision control is off the table" doesn't mean you can't have a rollback strategy!
Edited to add:
To pass all your base-names to grep at once, in the hopes that it can do something smarter with them than just looping over them just as though the calls were separate, you can write something like:
grep -r -F "$(sed 's#.*/##g' "$FILELISTING")" "$PROJECTDIR"
(I used sed rather than while+basename for brevity's sake, but you can an entire loop inside the "$(...)" if you prefer.)
This is a job for an IDE.
You're right that this is a perilous task, and unless you know the build process and the search directories and the order of the directories, you really can't say what header is with which file.
Let's take something as simple as this:
# include "sql.h"
You have a file in the project headers/sql.h. Is that file needed? Maybe it is. Maybe not. There's also a /usr/include/sql.h. Maybe that's the one that's actually used. You can't tell without looking at the Makefile and seeing the order of the include directories which is which.
Then, there are the libraries that get included and may need their own header files in order to be able to compile. And, once you get to the C preprocessor, you really will have a hard time.
This is a task for an IDE (Integrated Development Environment). An IDE builds the project and tracks file and other resource dependencies. In the Java world, most people use Eclipse, and there is a C/C++ plugin for those developers. However, there are over 2 dozen listed in Wikipedia and almost all of them are open source. The best one will depend upon your environment.

why doesn't *.abc match a file named .abc?

I thought I understood wildcards, till this happened to me. Essentially, I'm looking for a wild card pattern that would return all files that are not named .gitignore. I came up with this, which seems to work for all cases I could conjure:
ls *[!{gitignore}]
To really validate if this works, I thought I'd negate the expression and see if it returns the file named .gitignore (actually any file that ended with gitignore; so 1.gitignore should also be returned). To that effect, I thought the negated expression would be:
ls *[{gitignore}]
However, this expression doesn't return a files named .gitignore (although it returns a file named 1.gitignore).
Essentially, my question, after simplification, boils down to:
Why doesn't *.abc match a file that is named .abc
I think I can take it from there.
PS:
I am working on Mac OSX Lion (10.7.4)
I wanted to add a clause to .gitignore such that I would ignore every file, except .gitignore in a given folder. So I ended up adding * in the .gitignore file. Result was, git ended up ignoring .gitignore :)
From the numerous searches I've made on google - Use the asterisk character (*) to represent zero or more characters.
I assume you're using Bash. From the Bash manual:
When a pattern is used for filename expansion, the character ‘.’ at the start of a filename or immediately following a slash must be matched explicitly, unless the shell option dotglob is set.
.gitignore patterns, however, are treated differently:
Otherwise, git treats the pattern as a shell glob suitable for consumption by fnmatch(3) with the FNM_PATHNAME flag: wildcards in the pattern will not match a / in the pathname.
According to the fnmatch(3) docs, a leading dot has to be explicitly matched only if the FNM_PERIOD flag is set, so *gitignore as a gitignore pattern would match .gitignore.
There is an easier way to accomplish this, though. To have .gitignore ignore everything except .gitignore:
*
!.gitignore
If you want to ignore everything except the gitignore file, use this as the file:
*
!.gitignore
Lines starting with an exclamation point are interpreted as exceptions.

Linux shell list file what's the difference bewteen tmp/**/* and tmp/*

I encounter one problem about the file system in the shell.
what's difference between tmp/**/* and tmp/*?
I make the experiment in my system,
have this directory dir2
dir2
-->dir1
-->xx2
-->ff.txt
and I run ls dir2/*:
dir2/ff.txt
dir2/dir1:
xx2
then I run ls dir2/**/*:
dir2/dir1/xx2
So it means the ** is to ignore this directory(like ignore the dir1),
Can some one help me ?
I think there's a formatting issue in the question test, but I'll answer based on the question title and examples.
There shouldn't be any difference between a single and double asterisk at any single level of the path. Either expression matches any name, except for hidden ones which start with a dot (this can be changed by shell options). So:
tmp/**/* (equivalent to tmp/*/*) is expanded to all names which are nested two levels deep in tmp. The first asterisk expands only to directories and not files at the first level because it's followed by a slash.
tmp/* expands to anything nested one level deep inside tmp.
To this comes the fact that ls will list contents of directory if a directory is given on its command line. This can be overridden by adding -d option to ls.

Makefile problem with files beginning with "#"

I have a directory "FS2" that contains the following files:
ARGH
this
that
I have a makefile with the following contents.
Template:sh= ls ./FS2/*
#all: $(Template)
echo "Template is: $(Template)"
touch all
When I run "clearmake -C sun" and the file "all" does not exist, I get the following output:
"Template is: ./FS2/#ARGH# ./FS2/that ./FS2/this"
Modifying either "this" or "that" does not cause "all" to be regenerated. When run with "-d" for debug, the "all" target is only dependent on the directory "./FS2", not the three files in the directory. I determined that when it expands "Template", the "#" gets treated as the beginning of a comment and the rest of the line is ignored!
The problem is caused by an editor that when killed leaves around files that begin with "#". If one of those files exists, then no modifications to files in the directory causes "all" to be regenerated.
Although, I do not want to make compilation dependent on whether a temporary file has been modified or not and will remove the file from the "Template" variable, I am still curious as to how to get this to work if I did want to treat the "#ARGH#" as a filename that the rule "all" is dependent on. Is this even possible?
I have a directory "FS2" that contains the following files: #ARGH# ...
Therein lies your problem. In my opinion, it is unwise using "funny" characters in filenames. Now I know that those characters are allowed but that doesn't make them a good idea (ASCII control characters like backspace are also allowed with similar annoying results).
I don't even like spaces in filenames, preferring instead SomethingLikeThis to show independent words in a file name, but at least the tools for handling spaces in many UNIX tools is known reasonably well.
My advice would be to rename the file if it was one of yours and save yourself some angst. But, since they're temporary files left around by an editor crash, delete them before your rules start running in the makefile. You probably shouldn't be rebuilding based on an editor temporary file anyway.
Or use a more targeted template like: Template:sh= ls ./FS2/[A-Za-z0-9]* to bypass those files altogether (that's an example only, you should ensure it doesn't faslely exclude files that should be included).
'#' is a valid Makefile comment char, so the second line is ignored by the make program.
Can you filter out (with grep) the files that start with # and process them separately?
I'm not familiar with clearmake, but try replacing your template definition with
Template:sh= ls ./FS2/* | grep -v '#'
so that filenames containing # are not included in $(Template).
If clearmake follows the same rules as GNU make, then you can also re-write your target using something like Template := $(wildcard *.c) which will be a little more intelligent about files with oddball names.
If I really want the file #ARGH# to contribute to whether the target all should be rebuilt as well as be included in the artifacts produced by the rule, the Makefile should be modified so that the line
Template:sh= ls ./FS2/*
is changed to
Template=./FS2/*
Template_files:sh= ls $(Template)
This works because $(Template) will be replaced by the literal string ./FS2/* after all and in the expansion of $(Template_files).
Clearmake (and GNU make) then use ./FS2/* as a pathname containing a wildcard when evaluating the dependencies, which expands in to the filenames ./FS2/#ARGH# ./FS2/that ./FS2/this and $(Template_files) can be used in the rules where a list of filenames is needed.

Resources