Gnu find: apply -prune to directories which match a pattern in external file - bash

I wonder if there is a more efficient way to obtain directory patterns for use with -prune from an external file:
find . \( -type d -a -exec sh -c "echo \"{}\" | grep -qEx -f patterns.prune" \; \) -prune -o \( <further checks> \)
this works but is of course very slow due to the use of a shell/pipe for every previous match. So is there a more elegant way than the above or do i really have to chain the lines of the pattern file as commandline switches for find ?
Thanks.

You could try to pipe to grep at the end of the run, to only invoke it once, i.e. something like:
find . <your_other_conditions> | grep -v -f patterns.prune
This may not apply to your particular case, since it will now A) find everything under the pruned directories as well (though you can fix that by tweaking patterns.prune) and B) relieve control from find, so that you can't use find's builtins (e.g. -exec) on the results.

Related

Find Command Exclude Hidden files when using empty flag

I am looking for a way to use the find command to tell if a folder has no files in it. I have tried using the -empty flag, but since I am on macOS the system files the OS places in the directory such as .DS_Store cause find to not consider the directory empty. I have tried telling find to ignore .DS_Store but it still considers the directory not empty because that file is present.
Is there a way to have find exclude certain files from what it considers -empty? Also is there a way to have find return a list of directories with no visible files?
The -empty predicate is rather simple, it's true for a directory if it has any entries other than . or ...
Kind of an ugly solution, but you can use -exec to run another find in each directory which will implement your criteria for deciding what directories you want to include.
Below:
the outer find will execute sh -c for each directory in /starting/point
sh will execute another find with different criteria.
the inner find will print the first match and then quit
read will consume the output (if any) of the inner find. read will have an exit status of 0 only if the inner find printed at least one line, non-zero otherwise
if there was no output from the inner find, the outer find's -exec predicate will evaluate to false
since -exec is followed by -o, the following -print action will be executed only for those directories which do not match the inner find's criteria
find /starting/point \
-type d \( \
-exec sh -c \
'find "$1" -mindepth 1 -maxdepth 1 ! -name ".*" -print -quit | read' \
sh {} \; \
-o -print \
\)
Also note that the 'find FOLDER -empty' is somewhat tricky. It will consider FOLDER empty even if it contains files, as long as these are empty.
Maybe not exactly what was asked, but I prefer the brute force approach if I want to avoid a no-match error on using FOLDER/*. In tcsh:
ls -d FOLDER/* >& /dev/null
if !($status) COMMANDS FOLDER/* ...
A variation of this might be usable here (like also using
ls -d FOLDER/.* | wc -l
and drawing the desired conclusions from the combined results).

Recursive grep within specific subdirectories

I wish to grep certain files that only exist below a specific subdirectory. Here, I only want .xml files if they exist below the /bar/ subdirectory:
./file0.xml
./.git/file1a.xml
./.git/bar/file1b.xml
./.svn/file2a.xml
./.svn/foo/bar/baz/file2b.xml
./path1/file3.xml
./path1/foo/file4.xml
./path2/foo/bar/file5.xml
./path2/foo/baz/file6.xml
./path3/bar/file7.xml
./path3/foo/bar/baz/file8.xml
I want only the following files to be grepped: file5.xml, file7.xml, file8.xml
To exclude .git and .svn, I came up with:
grep -r --exclude-dir={.git,.svn} --include=\*.xml "pattern"
which still searches file3.xml through file8.xml.
If I grep -v the undesired directories:
grep -r --exclude-dir={.git,.svn} --include=\*.xml "pattern" | grep -v /bar/
I get the desired results, but it spends a lot of time parsing the non-/bar/ files.
Using find to find the xml files under /res/, I get the desired results (and it's much faster than the above result):
find . -type d \( -name .git -o -name .svn \) -prune -o \
-path '*/bar/*.xml' -exec grep "pattern" {} +
I'm trying to avoid this, however, as I use this within a script and don't want to be limited to starting the search in the top ./ directory.
Is there a way to accomplish this using only grep (so it doesn't prevent the user from specifying additional grep options and/or starting search directories)? Perhaps something like:
grep -r --exclude-dir={.git,.svn} --include=\*/bar/\*.xml "pattern"
find+grep is definitely a good approach. You could make it more flexible by defining a function that inserts arguments in strategic places. For example:
search() {
local dir=$1
local pattern=$2
local args=("${#:3}")
find "$dir" -type d \( -name .git -o -name .svn \) -prune -o \
-path '*/bar/*.xml' -exec grep "${args[#]}" "$pattern" {} +
}

Why is my `find` command giving me errors relating to ignored directories?

I have this find command:
find . -type f -not -path '**/.git/**' -not -path '**/node_modules/**' | xargs sed -i '' s/typescript-library-skeleton/xxx/g;
for some reason it's giving me these warnings/errors:
find: ./.git/objects/3c: No such file or directory
find: ./.git/objects/3f: No such file or directory
find: ./.git/objects/41: No such file or directory
I even tried using:
-not -path '**/.git/objects/**'
and got the same thing. Anybody know why the find is searching in the .git directory? Seems weird.
why is the find searching in the .git directory?
GNU find is clever and supports several optimizations over a naive implementation:
It can flip the order of -size +512b -name '*.txt' and check the name first, because querying the size will require a second syscall.
It can count the hard links of a directory to determine the number of subdirectories, and when it's seen all it no longers needs to check them for -type d or for recursing.
It can even rewrite (-B -or -C) -and -A so that if the checks are equally costly and free of side effects, the -A will be evaluated first, hoping to reject the file after 1 test instead of 2.
However, it is not yet clever enough to realize that -not -path '*/.git/*' means that if you find a directory .git then you don't even need to recurse into it because all files inside will fail to match.
Instead, it dutifully recurses, finds each file and matches it against the pattern as if it was a black box.
To explicitly tell it to skip a directory entirely, you can instead use -prune. See How to exclude a directory in find . command
Both more efficient and more correct would be to avoid the default -print action, change -not -path ... to -prune, and ensure that xargs is only used with NUL-delimited input:
find . -name .git -prune -o \
-name node_modules -prune -o \
-type f -print0 | xargs -0 sed -i '' s/typescript-library-skeleton/xxx/g '{}' +
Note the following points:
We use -prune to tell find to not even recurse down the undesired directories, rather than -not -path ... to tell it to discard names in those directories after they were found.
We put the -prunes before the -type f, so we're able to match directories for pruning.
We have an explicit action, not depending on the default -print. This is important because the default -print effectively has a set of parenthesis: find ... behaves like find '(' ... ')' -print, not like find ... -print, no if explicit action is given.
We use xargs only with the -0 argument enabling NUL-delimited input, and the -print0 action on the find side to generate a NUL-delimited list of names. NUL is the only character which cannot be present in an arbitrary file path (yes, newlines can be present) -- and thus the only character which is safe to use to separate paths. (If the -0 extension to xargs and the -print0 extension to find are not guaranteed to be available, use -exec sed -i '' ... {} + instead).

Bash - Excluding subdirectories using the find command [duplicate]

This question already has answers here:
How do I exclude a directory when using `find`?
(46 answers)
Closed 7 years ago.
I'm using the find command to get a list of folders where certain files are located. But because of a permission denied error for certain subdirectories, I want to exclude a certain subdirectory name.
I already tried these solutions I found here:
find /path/to/folders -path "*/noDuplicates" -prune -type f -name "fileName.txt"
find /path/to/folders ! -path "*/noDuplicates" -type f -name "fileName.txt"
And some variations for these commands (variations on the path name for example).
In the first case it won't find a folder at all, in the second case I get the error again, so I guess it still tries to access this directory. Does anyone know what I'm doing wrong or does anyone have a different solution for this?
To complement olivm's helpful answer and address the OP's puzzlement at the need for -o:
-prune, as every find primary (action or test, in GNU speak), returns a Boolean, and that Boolean is always true in the case of -prune.
Without explicit operators, primaries are implicitly connected with -a (-and), which, like its brethren -o (-or) performs short-circuiting Boolean logic.
-a has higher precedence than -o.
For a summary of all find concepts, see https://stackoverflow.com/a/29592349/45375
Thus, the accepted answer,
find . -path ./ignored_directory -prune -o -name fileName.txt -print
is equivalent to (parentheses are used to make the evaluation precedence explicit):
find . \( -path ./ignored_directory -a -prune \) \
-o \
\( -name fileName.txt -a -print \)
Since short-circuiting applies, this is evaluated as follows:
an input path matching ./ignored_directory causes -prune to be evaluated; since -prune always returns true, short-circuiting prevents the right side of the -o operator from being evaluated; in effect, nothing happens (the input path is ignored)
an input path NOT matching ./ignored_directory, instantly - again due to short-circuiting - continues evaluation on the right side of -o:
only if the filename part of the input path matches fileName.txt is the -print primary evaluated; in effect, only input paths whose filename matches fileName.txt are printed.
Edit: In spite of what I originally claimed here, -print IS needed on the right-hand side of -o here; without it, the implied -print would apply to the entire expression and thus also print for left-hand side matches; see below for background information.
By contrast, let's consider what mistakenly NOT using -o does:
find . -path ./ignored_directory -prune -name fileName.txt -print
This is equivalent to:
find . -path ./ignored_directory -a -prune -a -name fileName.txt -a -print
This will only print pruned paths (that also match the -name filter), because the -name and -print primaries are (implicitly) connected with logical ANDs;
in this specific case, since ./ignored_directory cannot also match fileName.txt, nothing is printed, but if -path's argument is a glob, it is possible to get output.
A word on find's implicit use of -print:
POSIX mandates that if a find command's expression as a WHOLE does NOT contain either
output-producing primaries, such as -print itself
primaries that execute something, such as -exec and -ok
(the example primaries given are exhaustive for the POSIX spec. of find, but real-world implementations such as GNU find and BSD find add others, such as the output-producing -print0 primary, and the executing -execdir primary)
that -print be applied implicitly, as if the expression had been specified as:
\( expression \) -print
This is convenient, because it allows you to write commands such as find ., without needing to append -print.
However, in certain situations an explicit -print is needed, as is the case here:
Let's say we didn't specify -print at the end of the accepted answer:
find . -path ./ignored_directory -prune -o -name fileName.txt
Since there's now no output-producing or executing primary in the expression, it is evaluated as:
find . \( -path ./ignored_directory -prune -o -name fileName.txt \) -print
This will NOT work as intended, as it will print paths if the entire parenthesized expression evaluates to true, which in this case mistakenly includes the pruned directory.
By contrast, by explicitly appending -print to the -o branch, paths are only printed if the right-hand side of the -o expression evaluates to true; using parentheses to make the logic clearer:
find . -path ./ignored_directory -prune -o \( -name fileName.txt -print \)
If, by contrast, the left-hand side is true, only -prune is executed, which produces no output (and since the overall expression contains a -print, -print is NOT implicitly applied).
Following my previous comment, this works on my Debian :
find . -path ./ignored_directory -prune -o -name fileName.txt -print
or
find /path/to/folder -path "*/ignored_directory" -prune -o -name fileName.txt -print
or
find /path/to/folder -name fileName.txt -not -path "*/ignored_directory/*"
The differences are nicely debated here
Edit (added behavior specification details)
Pruning all permission denied directories in find
Using gnufind.
Specification behavior details - in this solutions we want to:
exclude unreadable directories contents (prune them),
avoid "permission denied" errors coming from unreadable dierctory,
keep the other errors and return states, but
process all files (even unreadable files, if we can read their names)
The basic design pattern is:
find ... \( -readable -o -prune \) ...
Example
find /var/log/ \( -readable -o -prune \) -name "*.1"
\thanks{mklement0}
The problem is in the way find evaluates the expression you are passing to the -path option.
Instead, you should try something like:
find /path/to/folders ! -path "*noDuplicates*" -type f -name "fileName.txt"

Executing grep on multiple find results

My question is rather similar to this one, except that I'm executing a grep search on multiple find queries. (I have to do this because I have to submit my command to the live servers, and I'd like to tinker with them as little as possible.)
Here is my query:
find /c/some/dir/ -iname "*html" -o -iname "*tpl" -exec grep -inH 'search_string' {} \;
With the -o option, the grep search returns all of the instances of "search_string" in the files that end with tpl. It completely ignores the html extensions I passed in...
Has anyone encountered this? How do I tell find to execute the grep on both html and tpl extensions?
(I'm running Cygwin, which has had some Windows translation issues in the past, so that may be a culprit...)
I think you need to group the two -iname clauses, like this:
find /c/some/dir/ \( -iname "*html" -o -iname "*tpl" \) -exec grep -inH 'search_string' {} \;
The logical or has a lower precedence, which means the -exec bits only apply to your -iname "*tpl" clause.

Resources