Find's cost-based optimiser breaks short-circuit evaluation - gnu-findutils

Quoting the man page of find (GNU findutils 4.7.0, emphasis mine):
GNU find searches the directory tree rooted at each given starting-point by evaluating the given expression from left to right, according to the rules of precedence (see section PERATORS), until the outcome is known (the left hand side is false for and operations, true for or), at which point find moves on to the next file name.
Therefore when find evaluates <expr1> -and <expr2> I would expect that <expr2> is not evaluated unless <expr1> is true and I rely on that to avoid some error messages, specifically, I do not want find to test whether a non readable directory is empty. Here is a SCCCE:
mkdir some_dir
chmod 333 some_dir
find * -readable ! -empty -printf "yes" -or -printf "no" -prune
which yields
find: ‘some_dir’: Permission denied
no
Adding, otherwise implicit, -and and parentheses, the expression evaluated by find should be equivalent to
( ( -readable -and (! -empty ) ) -and -printf "yes" ) -or ( -printf "no" -and -prune )
Hence, after realising that some_directory is not readable, find should forgo the emptiness test and the evaluation of -printf "yes". Instead, it should jump to the evaluation of -printf "no" and finally -prune. The "Permission denied" in the output suggests it's evaluating -empty anyway. (Removing ! -empty from the original expression makes the error go away.)
Using -D tree to inspect the evaluation tree, I see that the optimised form (edited here for the sake of brevity and clarity) is:
( ( ( ! -empty ) -and -readable ) -and -printf "yes" ) -or ( -printf "no" -and -prune )
according to which -empty is indeed evaluated and, worse, prior to -readable which completely screws up the intended logic. I reckon this is a bug. Am I right?
Update: (26-May-2020) A bug report has been submitted and it has been confirmed as a bug by the developers.

In my opinion, this is a bug in findutils' "arm-swapping" optimization, because it fails to consider that -empty and -xtype may have the side effect of causing find to report an error and exit with a non-zero status. I've reported the same issue about -xtype, which the findutils devs agreed was a bug. It's hard to work around this bug too, because findutils doesn't have a way to turn off this optimization. -O0 is equivalent to -O1 which already applies it.
If you need a workaround, I wrote a drop-in replacement for find called bfs: https://github.com/tavianator/bfs. It's fully compatible with all of GNU find's options, and doesn't have this bug.

Related

Sequential BANGs in a bash script

I'm trying to understand what the following bash script snippet is doing.
The sequential bangs ('!') are the main thing tripping me up, and searching online doesn't seem to really yield anything useful.
for file in $(find $pwd/localroot -type f ! -path '*\.git*' ! -path '*README\.md' ! -path "*?scriptname"); do
It means "not". From the find(1) man page:
! expr
True if expr is false. This character will also usually need protection from interpretation by the shell.
There are implicit ands between each of the tests.
Find files: -type f
But not inside .git directories: ! -path '*\.git*'
And ignore README.md: ! -path '*README\.md'
And ignore ?scriptname: ! -path "*?scriptname", where ? is a single character.

Leaving out '-print' from 'find' command when '-prune' is used

I have never been able to fully understand the -prune action of the find command. But in actuality at least some of my misunderstanding stems from the effect of omitting the '-print' expression.
From the 'find' man page..
"If the expression contains no actions other than -prune, -print is performed on all files for which the expression is true."
.. which I have always (for many years) taken to mean I can leave out '-print'.
However, as the following example illustrates, there is a difference between using '-print' and omitting '-print', at least when a '-prune' expression appears.
First of all, I have the following 8 directories under my working directory..
aqua/
aqua/blue/
blue/
blue/orange/
blue/red/
cyan/blue/
green/
green/yellow/
There are a total of 10 files in those 8 directories..
aqua/blue/config.txt
aqua/config.txt
blue/config.txt
blue/orange/config.txt
blue/red/config.txt
cyan/blue/config.txt
green/config.txt
green/test.log
green/yellow/config.txt
green/yellow/test.log
My goal is to use 'find' to display all regular files not having 'blue' as part of the file's path. There are five files matching this requirement.
This works as expected..
% find . -path '*blue*' -prune -o -type f -print
./green/test.log
./green/yellow/config.txt
./green/yellow/test.log
./green/config.txt
./aqua/config.txt
But when I leave out '-print' it returns not only the five desired files, but also any directory whose path name contains 'blue'..
% find . -path '*blue*' -prune -o -type f
./green/test.log
./green/yellow/config.txt
./green/yellow/test.log
./green/config.txt
./cyan/blue
./blue
./aqua/blue
./aqua/config.txt
So why are the three 'blue' directories displayed?
This can be significant because often I'm trying to prune out a directory structure that contains more than 50,000 files. When that path is processed my find command, especially if I'm doing an '-exec grep' to each file, can take a huge amount of time processing files for which I have absolutely no interest. I need to have confidence that find is not going into the pruned structure.
The implicit -print applies to the entire expression, not just the last part of it.
% find . \( -path '*blue*' -prune -o -type f \) -print
./green/test.log
./green/yellow/config.txt
./green/yellow/test.log
./green/config.txt
./cyan/blue
./blue
./aqua/blue
./aqua/config.txt
It's not decending into the pruned directories, but it is printing out the top level.
A slight modification:
$ find . ! \( -path '*blue*' -prune \) -type f
./green/test.log
./green/yellow/config.txt
./green/yellow/test.log
./green/config.txt
./aqua/config.txt
(with implicit -a) would lead to having the same behavior with and without -print.

Bash - Excluding subdirectories using the find command [duplicate]

This question already has answers here:
How do I exclude a directory when using `find`?
(46 answers)
Closed 7 years ago.
I'm using the find command to get a list of folders where certain files are located. But because of a permission denied error for certain subdirectories, I want to exclude a certain subdirectory name.
I already tried these solutions I found here:
find /path/to/folders -path "*/noDuplicates" -prune -type f -name "fileName.txt"
find /path/to/folders ! -path "*/noDuplicates" -type f -name "fileName.txt"
And some variations for these commands (variations on the path name for example).
In the first case it won't find a folder at all, in the second case I get the error again, so I guess it still tries to access this directory. Does anyone know what I'm doing wrong or does anyone have a different solution for this?
To complement olivm's helpful answer and address the OP's puzzlement at the need for -o:
-prune, as every find primary (action or test, in GNU speak), returns a Boolean, and that Boolean is always true in the case of -prune.
Without explicit operators, primaries are implicitly connected with -a (-and), which, like its brethren -o (-or) performs short-circuiting Boolean logic.
-a has higher precedence than -o.
For a summary of all find concepts, see https://stackoverflow.com/a/29592349/45375
Thus, the accepted answer,
find . -path ./ignored_directory -prune -o -name fileName.txt -print
is equivalent to (parentheses are used to make the evaluation precedence explicit):
find . \( -path ./ignored_directory -a -prune \) \
-o \
\( -name fileName.txt -a -print \)
Since short-circuiting applies, this is evaluated as follows:
an input path matching ./ignored_directory causes -prune to be evaluated; since -prune always returns true, short-circuiting prevents the right side of the -o operator from being evaluated; in effect, nothing happens (the input path is ignored)
an input path NOT matching ./ignored_directory, instantly - again due to short-circuiting - continues evaluation on the right side of -o:
only if the filename part of the input path matches fileName.txt is the -print primary evaluated; in effect, only input paths whose filename matches fileName.txt are printed.
Edit: In spite of what I originally claimed here, -print IS needed on the right-hand side of -o here; without it, the implied -print would apply to the entire expression and thus also print for left-hand side matches; see below for background information.
By contrast, let's consider what mistakenly NOT using -o does:
find . -path ./ignored_directory -prune -name fileName.txt -print
This is equivalent to:
find . -path ./ignored_directory -a -prune -a -name fileName.txt -a -print
This will only print pruned paths (that also match the -name filter), because the -name and -print primaries are (implicitly) connected with logical ANDs;
in this specific case, since ./ignored_directory cannot also match fileName.txt, nothing is printed, but if -path's argument is a glob, it is possible to get output.
A word on find's implicit use of -print:
POSIX mandates that if a find command's expression as a WHOLE does NOT contain either
output-producing primaries, such as -print itself
primaries that execute something, such as -exec and -ok
(the example primaries given are exhaustive for the POSIX spec. of find, but real-world implementations such as GNU find and BSD find add others, such as the output-producing -print0 primary, and the executing -execdir primary)
that -print be applied implicitly, as if the expression had been specified as:
\( expression \) -print
This is convenient, because it allows you to write commands such as find ., without needing to append -print.
However, in certain situations an explicit -print is needed, as is the case here:
Let's say we didn't specify -print at the end of the accepted answer:
find . -path ./ignored_directory -prune -o -name fileName.txt
Since there's now no output-producing or executing primary in the expression, it is evaluated as:
find . \( -path ./ignored_directory -prune -o -name fileName.txt \) -print
This will NOT work as intended, as it will print paths if the entire parenthesized expression evaluates to true, which in this case mistakenly includes the pruned directory.
By contrast, by explicitly appending -print to the -o branch, paths are only printed if the right-hand side of the -o expression evaluates to true; using parentheses to make the logic clearer:
find . -path ./ignored_directory -prune -o \( -name fileName.txt -print \)
If, by contrast, the left-hand side is true, only -prune is executed, which produces no output (and since the overall expression contains a -print, -print is NOT implicitly applied).
Following my previous comment, this works on my Debian :
find . -path ./ignored_directory -prune -o -name fileName.txt -print
or
find /path/to/folder -path "*/ignored_directory" -prune -o -name fileName.txt -print
or
find /path/to/folder -name fileName.txt -not -path "*/ignored_directory/*"
The differences are nicely debated here
Edit (added behavior specification details)
Pruning all permission denied directories in find
Using gnufind.
Specification behavior details - in this solutions we want to:
exclude unreadable directories contents (prune them),
avoid "permission denied" errors coming from unreadable dierctory,
keep the other errors and return states, but
process all files (even unreadable files, if we can read their names)
The basic design pattern is:
find ... \( -readable -o -prune \) ...
Example
find /var/log/ \( -readable -o -prune \) -name "*.1"
\thanks{mklement0}
The problem is in the way find evaluates the expression you are passing to the -path option.
Instead, you should try something like:
find /path/to/folders ! -path "*noDuplicates*" -type f -name "fileName.txt"

What are "primaries" in find?

I was reading the manual for the find command. As I was going down the list of options I was reading the following..
PRIMARIES
All primaries which take a numeric argument allow the number to be preceded
by a plus sign (``+'') or a minus sign (``-''). A preceding plus
sign means ``more than n'', a preceding minus sign means ``less than n''
and neither means ``exactly n''.
I was having a hard time understanding what that means. I was also trying to find out what are "Primaries" in Google and couldn't get a good answer.
Can anyone help me understand what this means?
From the man page, this is the list of primaries in OS X find:
-Bmin
-Bnewer
-Btime
-amin
-anewer
-atime
-cmin
-cnewer
-ctime
-d
-delete
-depth
-empty
-exec
-execdir
-flags
-fstype
-gid
-group
-ignore
-ilname
-iname
-inum
-ipath
-iregex
-iwholename
-links
-lname
-ls
-maxdepth
-mindepth
-mmin
-mnewer
-mount
-mtime
-name
-newer
-newerXY
-nogroup
-noignore_readdir_race
-noleaf
-nouser
-ok
-okdir
-path
-perm
-print
-print0
-prune
-regex
-samefile
-size
-type
-uid
-user
-wholename
From the beginning of the same man page (emphasis mine):
DESCRIPTION
The find utility recursively descends the directory tree for each path listed, evaluating an expression (composed
of the ``primaries'' and ``operands'' listed below) in terms of each file in the tree.
"Primary" is the term used by the find documentation for one of the building blocks of an expression used by find to filter its output.
The find command accepts two kinds of parameters, they have been named 'primaries' and 'operators' by the authors of find. Primaries are parameters that allow filtering which files you want find to find, while Operators are the parameters that allow combining the primaries.
In mathematics, a primary is the basic component in an arithmetic or logic expression.
There also is a third class of parameters, that have no name and that modify the directory hierarchy traversal behavior of find, and a forth class that define what action to take upon the found files (print, delete, etc.)
The GNU man page uses the word 'Test' instead of 'Primary'

Building up a command string for find

I'm trying to parse the android source directory and i need to extract all the directory names excluding certain patterns. If you notice below., for now i included only 1 directory to the exclude list, but i will be adding more.,
The find command doesn't exclude the directory with name 'docs'.
The commented out line works., but the other one doesn't. For easy debugging, i included the min and maxdepth which i would remove later.
Any comments or hints on why it doesn't work?
#! /bin/bash
ANDROID_PATH=$1
root=/
EXCLUDES=( doc )
cd ${root}
for dir in "${EXCLUDES[#]}"; do
exclude_name_cmd_string=${exclude_name_cmd_string}$(echo \
"-not -name \"${dir}*\" -prune")
done
echo -e ${exclude_name_cmd_string}
custom_find_cmd=$(find ${ANDROID_PATH} -mindepth 1 -maxdepth 1 \
${exclude_name_cmd_string} -type d)
#custom_find_cmd=$(find ${ANDROID_PATH} -mindepth 1 -maxdepth 1 \
# -not -name "doc*" -prune -type d)
echo ${custom_find_cmd}
Building up a command string with possibly-quoted arguments is a bad idea. You get into nested quoting levels and eval and a bunch of other dangerous/confusing syntactic stuff.
Use an array to build the find; you've already got the EXCLUDES in one.
Also, the repeated -not and -prune seems weird to me. I would write your command as something like this:
excludes=()
for dir in "${EXCLUDES[#]}"; do
excludes+=(-name "${dir}*" -prune -o)
done
find "${ANDROID_PATH}" -mindepth 1 -maxdepth 1 "${excludes[#]}" -type d -print
The upshot is, you want the argument to -name to be passed to find as a literal wildcard that find will expand, not a list of files returned by the shell's expansion, nor a string containing literal quotation marks. This is very hard to do if you try to build the command as a string, but trivial if you use an array.
Friends don't let friends build shell commands as strings.
When I run your script (named fin.sh) as:
bash -x fin.sh $HOME/tmp
one of the lines of trace output is:
find /Users/jleffler/tmp -mindepth 1 -maxdepth 1 -not -name '"doc*"' -prune -type d
Do you see the single quotes around the double quotes? That's bash trying to be helpful. I'm guessing that your "doesn't work" problem is that you still get directories under doc* included in the output; other than that, it seems to work for me.
How to fix that?
...it seems you've found a way to fix that...I'm not sure I'd trust it with a Bourne shell (but the Korn shell seems to agree with Bash), but it looks like it might work with Bash. I'm pretty sure this is something that changed during the last 30 years or so, but it is hard to prove that; getting hands on the old code is not easy.
I also wonder whether you need repeated -prune options if you have repeated excluded directories; I'm not sufficiently familiar with -prune to be sure.
Found the problem. Its with the escape sequence in the exclude_name_cmd_string.
Correct syntax should have been
exclude_name_cmd_string=${exclude_name_cmd_string}$(echo \
"-not -name ${dir}* -prune")

Resources