Counting sum of lines in all .c and .h files - bash

I am trying to write a shell script that will count the sum of all lines in every file in a directory (and its subdirectories) of format .c and .h.
I already have that code but I am not sure how to make it find both file formats.
!/bin/bash
#Program
total=0
find /path -type f -name "*.php" | while read FILE; do
count=$(grep -c ^ < "$FILE")
echo "$FILE has $count lines"
let total=total+count
done
echo TOTAL LINES COUNTED: $total
I am newbie to shell/bash and if anything else is wrong I would be grateful for help.

Optimized and fast find + GNU parallel solution:
find /path -type f -name "*.[ch]" -print0 | parallel -q0 -j0 --no-notice wc -l {} \
| awk '{ sum+=$1 }END{ print "TOTAL LINES COUNTED: "sum }'
-print0 - print the full file name on the standard output, followed by a null character (instead of the newline character that -print uses). This allows file names that contain newlines or other types of white space to be correctly interpreted by programs that process the find output.
with parallel the command wc -l {} will be excuted for each file in parallel (that's called parallel processing)

To find .c and .h files instead of .php,
simply change the value of the -name parameter to *.[ch].
There are a few other issues in the script:
It would be safer to read the filenames as IFS= read -r
The first line should be #!/bin/bash instead of !/bin/bash
And some minor improvements are possible:
The summing logic can be written a bit simpler using ((...)) syntax (arithmetic context)
It's not recommended to use uppercase variable names, as that conversion is reserved to system variables
Putting it together:
#!/bin/bash
total=0
find /path -type f -name "*.[ch]" | while IFS= read -r file; do
count=$(grep -c ^ < "$file")
echo "$file has $count lines"
((total += count))
done
echo TOTAL LINES COUNTED: $total
Other answers recommend variations of find ... -exec wc -l.
Although they look more elegant,
they will not work exactly the same way as your script:
wc -l counts lines a bit differently from grep -c ^. In particular it doesn't count the last line of a file if it doesn't end with a newline. Try for example printf hello > file; wc -l file; grep -c ^ file -> you'll get 0 and 1.
Getting the line count in the individual files, and the total lines is not so simple. Using find ... -exec wc -l {} + comes quite close (if your implementation of find supports +), but again there will be corner cases that need special treatment. For example if there are too many files, then wc will be invoked multiple times, producing multiple sub-totals that would need to be reconciled.

Try this:
cat $(find /path -type f \( -name '*.c' -o -name '*.h' \)) |wc -l
It will run cat on every file returned by find and pipe the output into wc. If you need the value in a variable just do this
lines=$(cat ...)
echo counted $lines lines

Cat all files ending in .c or .h and pipe to grep -c:
find -type f -name '*.[ch]' -exec cat {} + | grep -c '^'
For a find without the + option, the alternative is
find -type f -name '*.[ch]' -exec cat {} \; | grep -c '^'
which calls cat once per file instead of as few times as possible, making it a bit slower.
If you know that you won't have a lot of files approaching the command line length limit, you could use just shell globbing:
shopt -s globstar # enable **/* glob
cat **/*.[ch] | grep -c '^'

Related

How can I get xargs to do something with the input, then do another thing?

I'm in zsh.
I'd like to do something like:
find . -iname *.md | xargs cat && echo "---" > all_slides_with_separators_in_between.md
Of course this cats all the slides, then appends a single "---" at the end instead of after each slide.
Is there an xargs way of doing this? Can I replace cat && echo "---" with some inline function or do block?
Very strangely, when I create a file cat---.sh with the contents
cat $1
echo ---
and run
find . -iname *.md | xargs ./cat---.sh
it only executes for the first result of find.
Replace cat---.sh with cat and it runs on both files.
There's no need to use xargs at all here. Following is a properly paranoid approach (robust against files with spaces, files with newlines, files with literal backslashes in their names, etc):
while IFS= read -r -d '' filename; do
printf '---\n'
cat -- "$filename"
done < <(find . -iname '*.md' -print0) >all_slides_with_separators.md
However -- you don't even need that either: find can do all the work itself, both printing the separator and calling cat!
find . -iname '*.md' -printf '---\n' -exec cat -- '{}' ';' >all_slides_with_separators.md
A common usage pattern is xargs sh -c 'command; another' _ where the entire shell script in the quotes will have access to the command-line arguments. The underscore is because the first argument to sh -c will be assigned to $0 (where you'd often see e.g. -sh in a ps listing).
find . -iname '*.md' |
xargs sh -c 'for x; do
cat "$x" && echo "---"
done' _ > all_slides_with_separators_in_between.md
As noted in the comments, you should probably investigate find -print0 and the corresponding xargs -0 option in GNU find (and maybe install it if you don't have it).
You can do something like this, but it can be insecure in some cases (see comments):
find . -iname '*.md' | xargs -I % sh -c '{ cat %; echo "----"; }' > output.txt
You'll rarely need find in zsh; its globbing facilities cover nearly every use case of find.
for f in (#i)**/*.md; do
cat $f
print -- "---"
done > all_slides.md
This looks in the current directory hierarchy for every file that matches *.md in a case-insensitive manner.
For even more efficiency, replace cat $f with < $f; zsh itself will read the file and write its contents to standard output.
Using GNU Parallel it looks like this:
parallel cat {}\; print -- --- ::: **/*.md

count number of lines for each file found

i think that i don't understand very well how the find command in Unix works; i have this code for counting the number of files in each folder but i want to count the number of lines of each file found and save the total in variable.
find "$d_path" -type d -maxdepth 1 -name R -print0 | while IFS= read -r -d '' file; do
nb_fichier_R="$(find "$file" -type f -maxdepth 1 -iname '*.R' | wc -l)"
nb_ligne_fichier_R= "$(find "$file" -type f -maxdepth 1 -iname '*.R' -exec wc -l {} +)"
echo "$nb_ligne_fichier_R"
done
output:
43 .//system d exploi/r-repos/gbm/R/basehaz.gbm.R
90 .//system d exploi/r-repos/gbm/R/calibrate.plot.R
45 .//system d exploi/r-repos/gbm/R/checks.R
178 total: File name too long
can i just save to total number of lines in my variable? here in my example just save 178 and that for each files in my folder "$d_path"
Many Thanks
Maybe I'm missing something, but wouldn't this do what you want?
wc -l R/*.[Rr]
Solution:
find "$d_path" -type d -maxdepth 1 -name R | while IFS= read -r file; do
nb_fichier_R="$(find "$file" -type f -maxdepth 1 -iname '*.R' | wc -l)"
echo "$nb_fichier_R" #here is fine
find "$file" -type f -maxdepth 1 -iname '*.R' | while IFS= read -r fille; do
wc -l $fille #here is the problem nothing shown
done
done
Explanation:
adding -print0 the first find produced no newline so you had to tell read -d '' to tell it not to look for a newline. Your subsequent finds output newlines so you can use read without a delimiter. I removed -print0 and -d '' from all calls so it is consistent and idiomatic. Newlines are good in the unix world.
For the command:
find "$d_path" -type d -maxdepth 1 -name R -print0
there can be at most one directory that matches ("$d_path/R"). For that one directory, you want to print:
The number of files matching *.R
For each such file, the number of lines in it.
Allowing for spaces in $d_path and in the file names is most easily handled, I find, with an auxilliary shell script. The auxilliary script processes the directories named on its command line. You then invoke that script from the main find command.
counter.sh
shopt -s nullglob;
for dir in "$#"
do
count=0
for file in "$dir"/*.R; do ((count++)); done
echo "$count"
wc -l "$dir"/*.R </dev/null
done
The shopt -s nullglob option means that if there are no .R files (with names that don't start with a .), then the glob expands to nothing rather than expanding to a string containing *.R at the end. It is convenient in this script. The I/O redirection on wc ensures that if there are no files, it reads from /dev/null, reporting 0 lines (rather than sitting around waiting for you to type something).
On the other hand, the find command will find names that start with a . as well as those that do not, whereas the globbing notation will not. The easiest way around that is to use two globs:
for file in "$dir"/*.R "$dir"/.*.R; do ((count++)); done
or use find (rather carefully):
find . -type f -name '*.R' -exec sh -c 'echo $#' arg0 {} +
Using counter.sh
find "$d_path" -type d -maxdepth 1 -name R -exec sh ./counter.sh {} +
This script allows for the possibility of more than one sub-directory (if you remove -maxdepth 1) and invokes counter.sh with all the directories to be examined as arguments. The script itself carefully handles file names so that whether there are spaces, tabs or newlines (or any other character) in the names, it will work correctly. The sh ./counter.sh part of the find command assumes that the counter.sh script is in the current directory. If it can be found on $PATH, then you can drop the sh and the ./.
Discussion
The technique of having find execute a command with the list of file name arguments is powerful. It avoids issues with -print0 and using xargs -0, but gives you the same reliable handling of arbitrary file names, including names with spaces, tabs and newlines. If there isn't already a command that does what you need (but you could write one as a shell script), then do so and use it. If you might need to do the job more than once, you can keep the script. If you're sure you won't, you can delete it after you're done with it. It is generally much easier to handle files with awkward names like this than it is to fiddle with $IFS.
Consider this solution:
# If `"$dir"/*.R` doesn't match anything, yield nothing instead of giving the pattern.
shopt -s nullglob
# Allows matching both `*.r` and `*.R` in one expression. Using them separately would
# give double results.
shopt -s nocaseglob
while IFS= read -ru 4 -d '' dir; do
files=("$dir"/*.R)
echo "${#files[#]}"
for file in "${files[#]}"; do
wc -l "$file"
done
# Use process substitution to prevent going to a subshell. This may not be
# necessary for now but it could be useful to future modifications.
# Let's also use a custom fd to keep troubles isolated.
# It works with `-u 4`.
done 4< <(exec find "$d_path" -type d -maxdepth 1 -name R -print0)
Another form is to use readarray which allocates all found directories at once. Only caveat is that it can only read normal newline-terminated paths.
shopt -s nullglob
shopt -s nocaseglob
readarray -t dirs < <(exec find "$d_path" -type d -maxdepth 1 -name R)
for dir in "${dirs[#]}"; do
files=("$dir"/*.R)
echo "${#files[#]}"
for file in "${files[#]}"; do
wc -l "$file"
done
done

From UNIX shell, how to find all files containing a specific string, then print the 4th line of each file?

I want to find all files within the current directory that contain a given string, then print just the 4th line of each file.
grep --null -l "$yourstring" * | # List all the files containing your string
xargs -0 sed -n '4p;q' # Print the fourth line of said files.
Different editions of grep have slightly different incantations of --null, but it's usually there in some form. Read your manpage for details.
Update: I believe one of the null file list incantations of grep is a reasonable solution that will cover the vast majority of real-world use cases, but to be entirely portable, if your version of grep does not support any null output it is not perfectly safe to use it with xargs, so you must resort to find.
find . -maxdepth 1 -type f -exec grep -q "$yourstring" {} \; -exec sed -n '4p;q' {} +
Because find arguments can almost all be used as predicates, the -exec grep -q… part filters the files that are eventually fed to sed down to only those that contain the required string.
From other user:
grep -Frl string . | xargs -n 1 sed -n 4p
Give a try to the below GNU find command,
find . -maxdepth 1 -type f -exec grep -l 'yourstring' {} \; | xargs -I {} awk 'NR==4{print; exit}' {}
It finds all the files in the current directory which contains specific string, and prints the line number 4 present in each file.
This for loop should work:
while read -d '' -r file; do
echo -n "$file: "
sed '4q;d' "$file"
done < <(grep --null -l "some-text" *.txt)

how to grep large number of files?

I am trying to grep 40k files in the current directory and i am getting this error.
for i in $(cat A01/genes.txt); do grep $i *.kaks; done > A01/A01.result.txt
-bash: /usr/bin/grep: Argument list too long
How do one normally grep thousands of files?
Thanks
Upendra
This makes David sad...
Everyone so far is wrong (except for anubhava).
Shell scripting is not like any other programming language because much of the interpretation of lines comes from the power of the shell interpolating them before the command is actually executed.
Let's take something simple:
$ set -x
$ ls
+ ls
bar.txt foo.txt fubar.log
$ echo The text files are *.txt
echo The text files are *.txt
> echo The text files are bar.txt foo.txt
The text files are bar.txt foo.txt
$ set +x
$
The set -x allows you to see how the shell actually interpolates the glob and then passes that back to the command as input. The > points to the line that is actually being executed by the command.
You can see that the echo command isn't interpreting the *. Instead, the shell grabs the * and replaces it with the names of the matching files. Then and only then does the echo command actually executes the command.
When you have 40K plus files, and you do grep *, you're expanding that * to the names of those 40,000 plus files before grep even has a chance to execute, and that's where the error message /usr/bin/grep: Argument list too long is coming from.
Fortunately, Unix has a way around this dilemma:
$ find . -name "*.kaks" -type f -maxdepth 1 | xargs grep -f A01/genes.txt
The find . -name "*.kaks" -type f -maxdepth 1 will find all of your *.kaks files, and the -depth 1 will only include files in the current directory. The -type f makes sure you only pick up files and not a directory.
The find command pipes the names of the files into xargs and xargs will append the names of the file to the grep -f A01/genes.txtcommand. However, xargs has a trick up it sleeve. It knows how long the command line buffer is, and will execute the grep when the command line buffer is full, then pass in another series of file to the grep. This way, grep gets executed maybe three or ten times (depending upon the size of the command line buffer), and all of our files are used.
Unfortunately, xargs uses whitespace as a separator for the file names. If your files contain spaces or tabs, you'll have trouble with xargs. Fortunately, there's another fix:
$ find . -name "*.kaks" -type f -maxdepth 1 -print0 | xargs -0 grep -f A01/genes.txt
The -print0 will cause find to print out the names of the files not separated by newlines, but by the NUL character. The -0 parameter for xargs tells xargs that the file separator isn't whitespace, but the NUL character. Thus, fixes the issue.
You could also do this too:
$ find . -name "*.kaks" -type f -maxdepth 1 -exec grep -f A01/genes.txt {} \;
This will execute the grep for each and every file found instead of what xargs does and only runs grep for all the files it can stuff on the command line. The advantage of this is that it avoids shell interference entirely. However, it may or may not be less efficient.
What would be interesting is to experiment and see which one is more efficient. You can use time to see:
$ time find . -name "*.kaks" -type f -maxdepth 1 -exec grep -f A01/genes.txt {} \;
This will execute the command and then tell you how long it took. Try it with the -exec and with xargs and see which is faster. Let us know what you find.
You can combine find with grep like this:
find . -maxdepth 1 -name '*.kaks' -exec grep -H -f A01/genes.txt '{}' \; > A01/A01.result.txt
you can use recursive feature of grep:
for i in $(cat A01/genes.txt); do
grep -r $i .
done > A01/A01.result.txt
though if you want to select only kaks files:
for i in $(cat A01/genes.txt); do
find . -iregex '.*\.kaks$' -exec grep $i \;
done > A01/A01.result.txt
Put another for loop inside your outer one:
for f in *.kaks; do
grep -H $i "$f"
done
By the way, are you interested in finding EVERY occurrence in each file, or merely if the search string exists in there one or more times? If it is "good enough" to know the string occurs in there one or more times you can specify "-n 1" to grep and it will not bother reading/searching the rest of the file after finding the first match, which could potentially save lots of time.
The following solution has worked for me:
Problem:
grep -r "example\.com" *
-bash: /bin/grep: Argument list too long
Solution:
grep -r "example\.com" .
["In newer versions of grep you can omit the “.“, as the current directory is implied."]
Source:
Reinlick, J. https://www.saotn.org/bash-grep-through-large-number-files-argument-list-too-long/

What is the best way to count "find" results?

My current solution would be find <expr> -exec printf '.' \; | wc -c, but this takes far too long when there are more than 10000 results. Is there no faster/better way to do this?
Why not
find <expr> | wc -l
as a simple portable solution? Your original solution is spawning a new process printf for every individual file found, and that's very expensive (as you've just found).
Note that this will overcount if you have filenames with newlines embedded, but if you have that then I suspect your problems run a little deeper.
Try this instead (require find's -printf support):
find <expr> -type f -printf '.' | wc -c
It will be more reliable and faster than counting the lines.
Note that I use the find's printf, not an external command.
Let's bench a bit :
$ ls -1
a
e
l
ll.sh
r
t
y
z
My snippet benchmark :
$ time find -type f -printf '.' | wc -c
8
real 0m0.004s
user 0m0.000s
sys 0m0.007s
With full lines :
$ time find -type f | wc -l
8
real 0m0.006s
user 0m0.003s
sys 0m0.000s
So my solution is faster =) (the important part is the real line)
This solution is certainly slower than some of the other find -> wc solutions here, but if you were inclined to do something else with the file names in addition to counting them, you could read from the find output.
n=0
while read -r -d ''; do
((n++)) # count
# maybe perform another act on file
done < <(find <expr> -print0)
echo $n
It is just a modification of a solution found in BashGuide that properly handles files with nonstandard names by making the find output delimiter a NUL byte using print0, and reading from it using '' (NUL byte) as the loop delimiter.
This is my countfiles function in my ~/.bashrc (it's reasonably fast, should work for Linux & FreeBSD find, and does not get fooled by file paths containing newline characters; the final wc just counts NUL bytes):
countfiles ()
{
command find "${1:-.}" -type f -name "${2:-*}" -print0 |
command tr -dc '\0' | command wc -c;
return 0
}
countfiles
countfiles ~ '*.txt'
POSIX compliant and newline-proof:
find /path -exec printf %c {} + | wc -c
And, from my tests in /, not even two times slower than the other solutions, which are either not newline-proof or not portable.
Note the + instead of \;. That is crucial for performance, as \; spawns one printf command per file name, whereas + gives as much file names as it can to a single printf command. (And in the possible case where there are too many arguments, Find intelligently spawns new Printfs on demand to cope with it, so it would be as if
{
printf %c very long argument list1
printf %c very long argument list2
printf %c very long argument list3
} | wc -c
were called.)
I needed something where I wouldn't take all output from find as some other commands run also print stuff.
Without need for temporary files this is only possible with a big caveat: You might get (far) more than one line of output as it will execute the output command once for every 800~1600 files.
find . -print -exec sh -c 'printf %c "$#" | wc -c' '' '{}' + # just print the numbers
find . -print -exec sh -c 'echo "Processed `printf %c "$#" | wc -c` items."' '' '{}' +
Generates this result:
Processed 1622 items.
Processed 1578 items.
Processed 1587 items.
An alternative is to use a temporary file:
find . -print -fprintf tmp.file .
wc -c <tmp.file # using the file as argument instead causes the file name to be printed after the count
echo "Processed `wc -c <tmp.file` items." # sh variant
echo "Processed $(wc -c <tmp.file) items." # bash variant
The -print in every of the find commands will not influence the count at all.

Resources