What is the best way to count "find" results? - bash

My current solution would be find <expr> -exec printf '.' \; | wc -c, but this takes far too long when there are more than 10000 results. Is there no faster/better way to do this?

Why not
find <expr> | wc -l
as a simple portable solution? Your original solution is spawning a new process printf for every individual file found, and that's very expensive (as you've just found).
Note that this will overcount if you have filenames with newlines embedded, but if you have that then I suspect your problems run a little deeper.

Try this instead (require find's -printf support):
find <expr> -type f -printf '.' | wc -c
It will be more reliable and faster than counting the lines.
Note that I use the find's printf, not an external command.
Let's bench a bit :
$ ls -1
a
e
l
ll.sh
r
t
y
z
My snippet benchmark :
$ time find -type f -printf '.' | wc -c
8
real 0m0.004s
user 0m0.000s
sys 0m0.007s
With full lines :
$ time find -type f | wc -l
8
real 0m0.006s
user 0m0.003s
sys 0m0.000s
So my solution is faster =) (the important part is the real line)

This solution is certainly slower than some of the other find -> wc solutions here, but if you were inclined to do something else with the file names in addition to counting them, you could read from the find output.
n=0
while read -r -d ''; do
((n++)) # count
# maybe perform another act on file
done < <(find <expr> -print0)
echo $n
It is just a modification of a solution found in BashGuide that properly handles files with nonstandard names by making the find output delimiter a NUL byte using print0, and reading from it using '' (NUL byte) as the loop delimiter.

This is my countfiles function in my ~/.bashrc (it's reasonably fast, should work for Linux & FreeBSD find, and does not get fooled by file paths containing newline characters; the final wc just counts NUL bytes):
countfiles ()
{
command find "${1:-.}" -type f -name "${2:-*}" -print0 |
command tr -dc '\0' | command wc -c;
return 0
}
countfiles
countfiles ~ '*.txt'

POSIX compliant and newline-proof:
find /path -exec printf %c {} + | wc -c
And, from my tests in /, not even two times slower than the other solutions, which are either not newline-proof or not portable.
Note the + instead of \;. That is crucial for performance, as \; spawns one printf command per file name, whereas + gives as much file names as it can to a single printf command. (And in the possible case where there are too many arguments, Find intelligently spawns new Printfs on demand to cope with it, so it would be as if
{
printf %c very long argument list1
printf %c very long argument list2
printf %c very long argument list3
} | wc -c
were called.)

I needed something where I wouldn't take all output from find as some other commands run also print stuff.
Without need for temporary files this is only possible with a big caveat: You might get (far) more than one line of output as it will execute the output command once for every 800~1600 files.
find . -print -exec sh -c 'printf %c "$#" | wc -c' '' '{}' + # just print the numbers
find . -print -exec sh -c 'echo "Processed `printf %c "$#" | wc -c` items."' '' '{}' +
Generates this result:
Processed 1622 items.
Processed 1578 items.
Processed 1587 items.
An alternative is to use a temporary file:
find . -print -fprintf tmp.file .
wc -c <tmp.file # using the file as argument instead causes the file name to be printed after the count
echo "Processed `wc -c <tmp.file` items." # sh variant
echo "Processed $(wc -c <tmp.file) items." # bash variant
The -print in every of the find commands will not influence the count at all.

Related

Counting sum of lines in all .c and .h files

I am trying to write a shell script that will count the sum of all lines in every file in a directory (and its subdirectories) of format .c and .h.
I already have that code but I am not sure how to make it find both file formats.
!/bin/bash
#Program
total=0
find /path -type f -name "*.php" | while read FILE; do
count=$(grep -c ^ < "$FILE")
echo "$FILE has $count lines"
let total=total+count
done
echo TOTAL LINES COUNTED: $total
I am newbie to shell/bash and if anything else is wrong I would be grateful for help.
Optimized and fast find + GNU parallel solution:
find /path -type f -name "*.[ch]" -print0 | parallel -q0 -j0 --no-notice wc -l {} \
| awk '{ sum+=$1 }END{ print "TOTAL LINES COUNTED: "sum }'
-print0 - print the full file name on the standard output, followed by a null character (instead of the newline character that -print uses). This allows file names that contain newlines or other types of white space to be correctly interpreted by programs that process the find output.
with parallel the command wc -l {} will be excuted for each file in parallel (that's called parallel processing)
To find .c and .h files instead of .php,
simply change the value of the -name parameter to *.[ch].
There are a few other issues in the script:
It would be safer to read the filenames as IFS= read -r
The first line should be #!/bin/bash instead of !/bin/bash
And some minor improvements are possible:
The summing logic can be written a bit simpler using ((...)) syntax (arithmetic context)
It's not recommended to use uppercase variable names, as that conversion is reserved to system variables
Putting it together:
#!/bin/bash
total=0
find /path -type f -name "*.[ch]" | while IFS= read -r file; do
count=$(grep -c ^ < "$file")
echo "$file has $count lines"
((total += count))
done
echo TOTAL LINES COUNTED: $total
Other answers recommend variations of find ... -exec wc -l.
Although they look more elegant,
they will not work exactly the same way as your script:
wc -l counts lines a bit differently from grep -c ^. In particular it doesn't count the last line of a file if it doesn't end with a newline. Try for example printf hello > file; wc -l file; grep -c ^ file -> you'll get 0 and 1.
Getting the line count in the individual files, and the total lines is not so simple. Using find ... -exec wc -l {} + comes quite close (if your implementation of find supports +), but again there will be corner cases that need special treatment. For example if there are too many files, then wc will be invoked multiple times, producing multiple sub-totals that would need to be reconciled.
Try this:
cat $(find /path -type f \( -name '*.c' -o -name '*.h' \)) |wc -l
It will run cat on every file returned by find and pipe the output into wc. If you need the value in a variable just do this
lines=$(cat ...)
echo counted $lines lines
Cat all files ending in .c or .h and pipe to grep -c:
find -type f -name '*.[ch]' -exec cat {} + | grep -c '^'
For a find without the + option, the alternative is
find -type f -name '*.[ch]' -exec cat {} \; | grep -c '^'
which calls cat once per file instead of as few times as possible, making it a bit slower.
If you know that you won't have a lot of files approaching the command line length limit, you could use just shell globbing:
shopt -s globstar # enable **/* glob
cat **/*.[ch] | grep -c '^'

How to iterate through all files in a directory, ordered by date created, with some filenames have spaces in their names

First I had
for file in `ls -t dir` ; do
#blah
done
but files with spaces are split into two iterations.
I've found tons of variations on this that fix the spaces issue, but then leaves some date info in the $file variable.
Edit: to show one such variation:
for file in `find . -printf "%T# %Tc %p\n" | sort -n` ; do
#blah
done
The problem with this is that all the time info is still in-place within the $file variable in the loop. (also, this doesn't work because I happen to be on OSX, whose find utility lacks the -printf option...)
Use find in combination with xargs to pass file names with NUL-byte separation, and use a while read loop for efficiency and space preservation:
find /path/to/dir -type f -print0 | xargs -0 ls -t | while read file
do
ls "$file" # or whatever you want with $file, which may have spaces
# so always enclose it in double quotes
done
find generates the list of files, ls arranges them, by time in this case. To reverse the sort order, replace -t with -tr. If you wanted to sort by size, replace -t with -s.
Example:
$ touch -d '2015-06-17' 'foo foo'
$ touch -d '2016-02-12' 'bar bar'
$ touch -d '2016-05-01' 'baz baz'
$ ls -1
bar bar
baz baz
foo foo
$ find . -type f -print0 | xargs -0 ls -t | while read file
> do
> ls -l "$file"
> done
-rw-rw-r-- 1 bishop bishop 0 May 1 00:00 ./baz baz
-rw-rw-r-- 1 bishop bishop 0 Feb 12 00:00 ./bar bar
-rw-rw-r-- 1 bishop bishop 0 Jun 17 2015 ./foo foo
For completeness, I'll highlight a point from comments to the question: -t is sorting by modification time, which not strictly creation time. The file system on which these files reside dictates whether or not creation time is available. Since your initial attempts used -t, I figured modification time was what you were concerned about, even if it's not pedantically true.
If you want creation time, you'll have to pull it from some source, like stat or the file name if its encoded there. This basically means replacing the xargs -0 ls -t with a suitable command piped to sort, something like: xargs -0 stat -c '%W' | sort -n
Using GNU find and GNU sort, one can do the following:
while IFS='' read -r -d ' ' mtime && IFS='' read -r -d '' filename; do
printf 'Processing file %q with timestamp of %s\n' "$filename" "$mtime"
done < <(find "$dir" -type f -printf '%T# %p\0' | sort -znr)
This works as follows:
find prints its output in the format <seconds-since-epoch> <filename><NUL>.
sort sorts that numerically -- thus, by modification time, expressed in seconds since epoch.
IFS='' read -r -d ' ' mtime reads everything up to the space into the variable mtime.
IFS='' read -r -d '' filename reads all remaining content up to the NUL into the variable filename
Because NUL cannot exist in filenames (as compared to newlines, which can), this can't be thrown off by names with surprising contents. See BashFAQ #3 for a detailed discussion.
Moreover, because it doesn't depend on passing names as command-line arguments to ls -t (which, like all other external commands, can only accept a limited number of command-line arguments on each invocation), this approach is not limited in the number of files it can reliably sort. (Using find ... -exec ls -t {} + or ... | xargs ls -t will result in silently incorrect results when the number of filenames being processed grows larger than the number that can be passed to a single ls invocation).
You can temporarily set your IFS variable to avoid the problem with spaces (thanks to http://www.linuxjournal.com/article/10954?page=0,1)
IFS_backup=$IFS
IFS=$(echo -en "\n\b")
for file in `ls -t dir` ; do
#blah
done
IFS=$IFS_backup
Edit: this worked on Ubuntu, but not RHEL6. The alternative suggested by bishop appears to be more portable, for example:
ls -t dir|while read file; do ...; done

How can I get xargs to do something with the input, then do another thing?

I'm in zsh.
I'd like to do something like:
find . -iname *.md | xargs cat && echo "---" > all_slides_with_separators_in_between.md
Of course this cats all the slides, then appends a single "---" at the end instead of after each slide.
Is there an xargs way of doing this? Can I replace cat && echo "---" with some inline function or do block?
Very strangely, when I create a file cat---.sh with the contents
cat $1
echo ---
and run
find . -iname *.md | xargs ./cat---.sh
it only executes for the first result of find.
Replace cat---.sh with cat and it runs on both files.
There's no need to use xargs at all here. Following is a properly paranoid approach (robust against files with spaces, files with newlines, files with literal backslashes in their names, etc):
while IFS= read -r -d '' filename; do
printf '---\n'
cat -- "$filename"
done < <(find . -iname '*.md' -print0) >all_slides_with_separators.md
However -- you don't even need that either: find can do all the work itself, both printing the separator and calling cat!
find . -iname '*.md' -printf '---\n' -exec cat -- '{}' ';' >all_slides_with_separators.md
A common usage pattern is xargs sh -c 'command; another' _ where the entire shell script in the quotes will have access to the command-line arguments. The underscore is because the first argument to sh -c will be assigned to $0 (where you'd often see e.g. -sh in a ps listing).
find . -iname '*.md' |
xargs sh -c 'for x; do
cat "$x" && echo "---"
done' _ > all_slides_with_separators_in_between.md
As noted in the comments, you should probably investigate find -print0 and the corresponding xargs -0 option in GNU find (and maybe install it if you don't have it).
You can do something like this, but it can be insecure in some cases (see comments):
find . -iname '*.md' | xargs -I % sh -c '{ cat %; echo "----"; }' > output.txt
You'll rarely need find in zsh; its globbing facilities cover nearly every use case of find.
for f in (#i)**/*.md; do
cat $f
print -- "---"
done > all_slides.md
This looks in the current directory hierarchy for every file that matches *.md in a case-insensitive manner.
For even more efficiency, replace cat $f with < $f; zsh itself will read the file and write its contents to standard output.
Using GNU Parallel it looks like this:
parallel cat {}\; print -- --- ::: **/*.md

How do you grep results from 'find'?

Trying to find a word/pattern contained within the resulting file names of the find command.
For instance, I have this command:
find . -name Gruntfile.js that returns several file names.
How do I grep within these for a word pattern?
Was thinking something along the lines of:
find . -name Gruntfile.js | grep -rnw -e 'purifycss'
However, this is doesn't work..
Use the -exec {} + option to pass the list of filenames that are found as arguments to grep:
find -name Gruntfile.js -exec grep -nw 'purifycss' {} +
This is the safest and most efficient approach, as it doesn't break when the path to the file isn't "well-behaved" (e.g. contains a space). Like an approach using xargs, it also minimises the number of calls to grep by passing multiple filenames at once.
I have removed the -e and -r switches, as I don't think that they're useful to you here.
An excerpt from man find:
-exec command {} +
This variant of the -exec action runs the specified command on the selected files, but the command line is built by appending each selected file name at the end; the total number of invocations of the command will be much less than the number of matched files.
While this doesn't strictly answer your question, provided you have globstar turned on (shopt -s globstar), you could filter the results in bash like this:
grep something **/Gruntfile.js
I was using religiously the approach used by Tom Fenech until I switched to zsh, which handles such things much better. Now all I do is:
grep text **/*(.)
which greps text through all regular files in current directory.
I believe this to be much cleaner syntax especially for day-to-day work in shell.
When too many files exist for the * expansion to run:
$ grep -o 'xxmaj\|xxbos\|xxfld' train/* | wc -l
-bash: /bin/grep: Argument list too long
0
Then this code fixes the “too long” problem:
$ find junk -maxdepth 1 -type f | xargs grep -o 'TVDetails\|xxmaj\|xxbos\|xxfld'
junk/gum-.doc.out:TVDetails
junk/Zv0n.doc.out:TVDetails
$ find junk -maxdepth 1 -type f | xargs grep -o 'TVDetails\|xxmaj\|xxbos\|xxfld' | wc -l
2
It runs faster on my system, and maybe yours, when using the -P 0 option:
$ /usr/bin/time -f "%E Elapsed Real Time" find train -maxdepth 1 -type f | xargs -P 0 grep -o 'TVDetails\|xxmaj\|xxbos\|xxfld' | wc -l
0:02.45 Elapsed Real Time
358
$ /usr/bin/time -f "%E Elapsed Real Time" find train -maxdepth 1 -type f | xargs grep -o 'TVDetails\|xxmaj\|xxbos\|xxfld' | wc -l
0:11.96 Elapsed Real Time
358
Hope this helps.

Can xargs execute a subshell command for each argument?

I have a command which is attempting to generate UUIDs for files:
find -printf "%P\n"|sort|xargs -L 1 echo $(uuid)
But in the result, xargs is only executing the $(uuid) subshell once:
8aa9e7cc-d3b2-11e4-83a6-1ff1acc22a7e file1
8aa9e7cc-d3b2-11e4-83a6-1ff1acc22a7e file2
8aa9e7cc-d3b2-11e4-83a6-1ff1acc22a7e file3
Is there a one-liner (i.e not a function) to get xargs to execute a subshell command on each input?
This is because the $(uuid) gets expanded in the current shell. You could explicitly call a shell:
find -printf "%P\n"| sort | xargs -I '{}' bash -c 'echo $(uuid) {}'
Btw, I would use the following command:
find -exec bash -c 'echo "$(uuid) ${1#./}"' -- '{}' \;
without xargs.
hek2mgl's answer explains the problem well and his solution works well; this answer looks at performance.
The accepted answer is a tad slow, because it creates a bash process for every input line.
While xargs is generally preferable to and faster than a shell-code loop, in this particular case the roles are reversed, because shell functionality is needed in each iteration.
The following alternative solution uses a while loop to process the input lines, and, on my machine, is about twice as fast as the xargs solution.
find . -printf "%P\n" | sort | while IFS= read -r f; do echo "$(uuid) $f"; done
Note the use of while rather than for, because for cannot robustly parse command output (in short: filenames with embedded whitespace would break the command - see http://mywiki.wooledge.org/DontReadLinesWithFor).
If you're concerned about filenames with embedded newlines (very rare) and use GNU utilities, you could use NUL bytes as separators:
find . -printf "%P\0" | sort -z | while IFS= read -d '' -r f; do echo "$(uuid) $f"; done
Update: The fastest approach is to not use a shell loop at all, as evidenced by ᴳᵁᴵᴰᴼ's clever answer.
See below for a portable version of his answer.
Compatibility note:
The OP's find command implies the use of GNU find (Linux), and uses features (-printf) that may not work on other platforms.
Here's a portable version of ᴳᵁᴵᴰᴼ's answer that uses only POSIX-compliant features of find (and awk).
Note, however, that uuid is not a POSIX utility; since Linux and BSD-like systems (including OSX) have a uuidgen utility, the command uses that instead:
find . -exec printf '%s\t' {} \; -exec uuidgen \; |
awk -F '\t' '{ sub(/.+\//,"", $1); print $2, $1 }' | sort -k2
With a for loop:
for i in $(find -printf "%P\n" | sort) ; do echo "$(uuid) $i"; done
Edit: another way to do this:
find -printf "%P\0" -exec uuid -v 4 \; | sort | awk -F'\0' '{ print $2 " " $1}'
this outputs the filename followed by the uuid (no subshell required) for letting the sort to happen, then swaps the two columns separated by null.

Resources