How can I count the number of words in a directory recursively? - bash

I'm trying to calculate the number of words written in a project. There are a few levels of folders and lots of text files within them.
Can anyone help me find out a quick way to do this?
bash or vim would be good!
Thanks

use find the scan the dir tree and wc will do the rest
$ find path -type f | xargs wc -w | tail -1
last line gives the totals.

tldr;
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+ | bc
Explanation:
The find . -type f -exec wc -w {} + will run wc -w on all the files (recursively) contained by . (the current working directory). find will execute wc as few times as possible but as many times as is necessary to comply with ARG_MAX --- the system command length limit. When the quantity of files (and/or their constituent lengths) exceeds ARG_MAX, then find invokes wc -w more than once, giving multiple total lines:
$ find . -type f -exec wc -w {} + | awk '/total/{print $0}'
8264577 total
654892 total
1109527 total
149522 total
174922 total
181897 total
1229726 total
2305504 total
1196390 total
5509702 total
9886665 total
Isolate these partial sums by printing only the first whitespace-delimited field of each total line:
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}'
8264577
654892
1109527
149522
174922
181897
1229726
2305504
1196390
5509702
9886665
paste the partial sums with a + delimiter to give an infix summation:
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+
8264577+654892+1109527+149522+174922+181897+1229726+2305504+1196390+5509702+9886665
Evaluate the infix summation using bc, which supports both infix expressions and arbitrary precision:
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+ | bc
30663324
References:
https://www.cyberciti.biz/faq/argument-list-too-long-error-solution/
https://www.in-ulm.de/~mascheck/various/argmax/
https://linux.die.net/man/1/find
https://linux.die.net/man/1/wc
https://linux.die.net/man/1/awk
https://linux.die.net/man/1/paste
https://linux.die.net/man/1/bc

You could find and print all the content and pipe to wc:
find path -type f -exec cat {} \; -exec echo \; | wc -w
Note: the -exec echo \; is needed in case a file doesn't end with a newline character, in which case the last word of one file and the first word of the next will not be separated.
Or you could find and wc and use awk to aggregate the counts:
find . -type f -exec wc -w {} \; | awk '{ sum += $1 } END { print sum }'

If there's one thing I've learned from all the bash questions on SO, it's that a filename with a space will mess you up. This script will work even if you have whitespace in the file names.
#!/usr/bin/env bash
shopt -s globstar
count=0
for f in **/*.txt
do
words=$(wc -w "$f" | awk '{print $1}')
count=$(($count + $words))
done
echo $count

Assuming you don't need to recursively count the words and that you want to include all the files in the current directory , you can use a simple approach such as:
wc -l *
10 000292_0
500 000297_0
510 total
If you want to count the words for only a specific extension in the current directory , you could try :
cat *.txt | wc -l

Related

Trying to do total word count on all files recursively but the sum is not right

So I do this:
find . -name '*.md' -type f -exec wc -w {} \; | awk '{ print $1 }'
And get a column of numbers (truncated):
...
2829
3619
828
1195
2406
2857
1480
1846
23
But then when I pipe all of that into a sum, I get an incorrect amount:
find . -name '*.md' -type f -exec wc -w {} \; | awk '{ print $1 }' | sum
9658 2
I thought awk would strip the white space out of wc -w output. But am I missing something?
(End result: I want to take a weekly word count and compare it previous weeks.)
The issue with your code is that sum does not count the sup of the output of the previous command.
Here is the sum help manual
Usage: sum [OPTION]... [FILE]...
Print checksum and block counts for each FILE.
Here is what you can do
find . -name '*.md' -type f -exec wc -w {} \; | awk '{s+=$1} END {printf "%.0f", s}'
Where the awk increments the s on each step with the value and prints it as an integer (to 0 decimal places) when done.
Concatenate all the files and pipe the result to wc -w, this way you don't need to sum word counts of individual files.
find . -name '*.md' -type f -exec awk 1 {} + | wc -w
awk 1 is for making sure each file's content is separated from that of the other with a newline, if that's not necessary, you can use cat instead.

xargs wc -l reports two totals

I want to calculate all lines in the directory /usr/local/lib/python3.5/dist-packages/pandas.
cd /usr/local/lib/python3.5/dist-packages/pandas
find -name '*.*' |xargs wc -l
536577 total
Write the two lines as one line.
find /usr/local/lib/python3.5/dist-packages/pandas -name '*.*' |xargs wc -l
bash output two total number,one is 495736 ,the other is 40841,
495736 + 40841 = 536577
Why bash do not give only one total 536577 at the bottom such as find -name '*.*' |xargs wc -l do?
POSIX xargs spec. says:
The generated command line length shall be the sum of the size in bytes of the utility name and each argument treated as strings, including a null byte terminator for each of these strings. The xargs utility shall limit the command line length such that when the command line is invoked, the combined argument and environment lists shall not exceed {ARG_MAX}-2048 bytes.
That means; in your case, find's output does not fit in ARG_MAX‒2048 bytes, thus xargs aggregates it into 2 sets and invokes wc once for each set.
Take this pipeline for example, in an ideal world its output would be 1, but it's not.
seq 1000000 | xargs echo | wc -l
seq's output is 6888896 bytes.
$ seq 1000000 | wc -c
6888896
My environment list take up 558 bytes (ignoring that _ is dynamic and whether the implementation takes terminating null pointers into consideration for the sake of clarity).
$ env | wc -c
558
ARG_MAX on my system is 131072 bytes.
$ getconf ARG_MAX
131072
Now xargs have 131072‒2048‒558 = 128466 bytes; echo plus null delimiter takes up 5 bytes, so a space of 128461 bytes is left. Therefore we can say, xargs will have to invoke echo 6888896/128461 = ~54 times. Let's see if that's the case:
$ seq 1000000 | xargs echo | wc -l
54
Yes, it is.
You can deal with xargs running the command multiple times by adding an awk bit to the pipeline:
find wherever -name "*.*" -type f -print0 | \
xargs -0 wc -l | \
awk '$2 == "total" { total += $1 } END { print "Overall total", total } 1'
(Assuming GNU find and xargs or other implementations that understand -print0 and -0 respectively; otherwise filenames with spaces etc. in them can cause problems).
GNU find and maybe other implementations can skip the xargs, actually:
find wherever -name "*.*" -type f -exec wc -l '{}' '+'
will have the same effect as using xargs to run wc on multiple files at a time.

How to count files in subdir and filter output in bash

Hi hoping someone can help, I have some directories on disk and I want to count the number of files in them (as well as dir size if possible) and then strip info from the output. So far I have this
find . -type d -name "*,d" -print0 | xargs -0 -I {} sh -c 'echo -e $(find "{}" | wc -l) "{}"' | sort -n
This gets me all the dir's that match my pattern as well as the number of files - great!
This gives me something like
2 ./bob/sourceimages/psd/dzv_body.psd,d
2 ./bob/sourceimages/psd/dzv_body_nrm.psd,d
2 ./bob/sourceimages/psd/dzv_body_prm.psd,d
2 ./bob/sourceimages/psd/dzv_eyeball.psd,d
2 ./bob/sourceimages/psd/t_zbody.psd,d
2 ./bob/sourceimages/psd/t_gear.psd,d
2 ./bob/sourceimages/psd/t_pupil.psd,d
2 ./bob/sourceimages/z_vehicles_diff.tga,d
2 ./bob/sourceimages/zvehiclesa_diff.tga,d
5 ./bob/sourceimages/zvehicleswheel_diff.jpg,d
From that I would like to filter based on max number of files so > 4 for example, I would like to capture filetype as a variable for each remaining result e.g ./bob/sourceimages/zvehicleswheel_diff.jpg,d
I guess I could use awk for this?
Then finally I would like like to remove all the results from disk, with find I normally just do something like -exec rm -rf {} \; but I'm not clear how it would work here
Thanks a lot
EDITED
While this is clearly not the answer, these commands get me the info I want in the form I want it. I just need a way to put it all together and not search multiple times as that's total rubbish
filetype=$(find . -type d -name "*,d" -print0 | awk 'BEGIN { FS = "." }; {
print $3 }' | cut -d',' -f1)
filesize=$(find . -type d -name "*,d" -print0 | xargs -0 -I {} sh -c 'du -h
{};' | awk '{ print $1 }')
filenumbers=$(find . -type d -name "*,d" -print0 | xargs -0 -I {} sh -c
'echo -e $(find "{}" | wc -l);')
files_count=`ls -keys | nl`
For instance:
ls | nl
nl printed numbers of lines

Script to count totals from commands and output to screen

I am looking for assistance in creating a bash script that will run several similar commands, sum up the totals and output that total to the screen. I want to run the following commands:
find /var/log/audit -xdev -type f -printf '%i\n' | sort -u | wc -l
find /boot -xdev -type f -printf '%i\n' | sort -u | wc -l
find /home -xdev -type f -printf '%i\n' | sort -u | wc -l
And so on. I have a few others. What I am basically doing is counting up all of the files in each mount point on my system, then I need the script to sum up all of the output from each commands "wc -l" and output the grand total to the screen. Any help is greatly appreciated.
this should work:
a=$(find /var/log/audit -xdev -type f -printf '%i\n' | sort -u | wc -l)
b=$(find /boot -xdev -type f -printf '%i\n' | sort -u | wc -l)
c=$(find /home -xdev -type f -printf '%i\n' | sort -u | wc -l)
final=$(($a+$b+$c))
echo $final
this will work without naming names, change the echo n with your scripts
awk '{sum+=$1} END{print "total: "sum}' < <(echo 4; echo 5; echo 6)
alternatively if the individual counts are not required you can pass more than one path to find
find path1 path2 path3 ...
This might be a good place for dc
{
for mnt in /var/log/audit /boot /home; do
find "$mnt" -xdev -type f -printf '%i\n' | sort -u | wc -l
done
echo "+"
echo "+"
echo "p"
} | dc
You need one less "+" than your number of mountpoints.
I would redirect each commands output to a file
your_command >> results.txt
and sum them up
awk '{ sum += $1 } END { print sum }' results.txt

Need a command that will separate the count accounting to files that are < 1M lines and > 1M lines

Environment: Solaris 9
I have a command that gives me a total count of files. But I need a command that will separately count files that are less than 1M lines and files that are more than 1M lines long. How can I do that?
find . -type f -exec wc -l {} \; | awk '{print $1}' | paste -sd+ | bc
Use the -size option:
echo "Smaller: $(find . -type f -size -1M | wc -l)"
echo "Larger: $(find . -type f -size +1M | wc -l)"
When your find does not support 1M, just write the full number.
EDIT: #rojomoke's comment, I have here is a version that counts LINES in the files with the wc utility, since that is what you used in your original post
Code:
# here I am already in the directory with the files so I just use *
# to refer to all files
# the wc -l will return a single column of counts so I use $1 to
# refer to field 1
wc -l * | awk '$1>1e6{bigger++}$1<1e6{smaller++}END{print "Files > 1M lines = ", bigger, "\nFiles < 1M lines = ", smaller}'
Output:
"Files > 1M lines = 454"
"Files < 1M lines = 528"

Resources