Use find, wc, and sed to count lines - bash

I was trying to use sed to count all the lines based on a particular extension.
find -name '*.m' -exec wc -l {} \; | sed ...
I was trying to do the following, how would I include sed in this particular line to get the totals.

You may also get the nice formatting from wc with :
wc `find -name '*.m'`

Most of the answers here won't work well for a large number of files. Some will break if the list of file names is too long for a single command line call, others are inefficient because -exec starts a new process for every file. I believe a robust and efficient solution would be:
find . -type f -name "*.m" -print0 | xargs -0 cat | wc -l
Using cat in this way is fine, as its output is piped straight into wc so only a small amount of the files' content is kept in memory at once. If there are too many files for a single invocation of cat, cat will be called multiple times, but all the output will still be piped into a single wc process.

You can cat all files through a single wc instance to get the total number of lines:
find . -name '*.m' -exec cat {} \; | wc -l

On modern GNU platforms wc and find take -print0 and -files0-from parameters that can be combined into a command that count lines in files with total at the end. Example:
find . -name '*.c' -type f -print0 | wc -l --files0-from=-

you could use sed also for counting lines in place of wc:
find . -name '*.m' -exec sed -n '$=' {} \;
where '$=' is a "special variable" that keep the count of lines
EDIT
you could also try something like sloccount

Hm, solution with cat may be problematic if you have many files, especially big ones.
Second solution doesn't give total, just lines per file, as I tested.
I'll prefer something like this:
find . -name '*.m' | xargs wc -l | tail -1
This will do the job fast, no matter how many and how big files you have.

sed is not the proper tool for counting. Use awk instead:
find . -name '*.m' -exec awk '{print NR}' {} +
Using + instead of \; forces find to call awk every N files found (like with xargs).

For big directories we should use:
find . -type f -name '*.m' -exec sed -n '$=' '{}' + 2>/dev/null | awk '{ total+=$1 }END{print total}'
# alternative using awk twice
find . -type f -name '*.m' -exec awk 'END {print NR}' '{}' + 2>/dev/null | awk '{ total+=$1 }END{print total}'

Related

How to grep query all files for two strings

Here we go:
I need to query php files which both have a TODO statement as well as my name.
Both strings could be anywhere in the document (ie. line) and be positioned anywhere on 0-infinite lines (position 0-n).
How to grep query for my name:
find -name '*.php' -exec grep -in "fincken" {} +
output:
./some/file.php:51: ramon fincken
./somefile.php:2: rfincken
How to grep query for the TODOs
find -name '*.php' -exec grep -n "TODO" {} +
output:
./some/file.php:53: // TODO: foobar!
./some/otherfile.php:53: // TODO: foobar?
I need to combine both grep queries (or their results) so I am expecting this as result:
./some/file.php
I have tried operators in one grep, but they expected both strings on the same line and in a particular order or .. came up with all results (OR .. OR) instead of ( AND )
this line looks ugly, but it should give what you want:
find whatever...|xargs grep -il 'fincken'
|xargs grep -il 'todo'
|xargs grep -in -e'todo' -e'fincken'
The output would look like:
/foo/bar/file : 100:TODO
/foo/bar/file : 101:fincken
only files with both TODO and fincken would be listed.
Ask the first grep to return just the file name and then pipe to another grep:
find -name '*.php' -exec grep -li "fincken" {} + | xargs grep -l "TODO"
From man grep, -l (L) returns file name. This way, the find comman will return a list of files that will be processed one by one through the xargs command.
Your output will be the list of files which contain both "fincken" and "TODO". You can of course pipe more xargs grep -l if you want to add more words to find.
You can also do use of grep alone like this, using -R to do a recursive search:
grep -Rl --include="*php" "TODO" * | xargs grep -il "fincken"
Note I moved the TODO grep to be done in the first place, because you use -i for "fincken" and it is way slowlier. This way, the grep -i will only be run on the already filtered results.
You can pipe the first grep through a second one, get the name of the file and skip repetitions:
find -name '*.php' -exec grep -in "fincken" {} + | grep TODO | cut -d: -f1 | uniq
People are making this more complicated then it needs to be. -exec will take the exit code of the command it runs and use it logically in find. So you can just do
find -name '*.php' -exec grep -iq "fincken" {} \; -exec grep -iq "TODO" {} \; -print
Which will get to the -print only if both -exec blocks return 0.

Displaying the result of find / replace over multiple documents on bash

I love to use the following command to do find / replace across multiple files in bash:
find -wholename "*.txt" -print | xargs sed -i 's/foo/bar/g'
However, the above command process everything in silence, and sometimes I would like the above command to print all the changes it made in order to double check if I did everything correctly. Can I know how should I improve the above command to allow it to dump such information? I tried the -v argument in the xargs command but it gives me the invalid option error.
You can do something like:
find -wholename "*.txt" | xargs sed -n '/foo/p;s/foo/bar/gp'
What this will do is print the line that you wish to substitute and print the substitution in the next line.
You can use awk and get filename as well:
find -wholename "*.txt" | xargs awk '/foo/{print FILENAME; gsub(/foo/,"bar");print}'
To print entire file remove print and add 1
find -wholename "*.txt" | xargs awk '/foo/{print FILENAME; gsub(/foo/,"bar")}1'
Regex will have to be modified as per your requirement and changes in-file is only available in gawk version 4.1
Test:
$ head file*
==> file1 <==
,,"user1","email"
,,"user2","email"
,,"user3","email"
,,"user4","email"
==> file2 <==
,,user2,location2
,,user4,location4
,,user1,location1
,,user3,location3
$ find . -name "file*" -print | xargs awk '/user1/{print FILENAME; gsub(/user1/,"TESTING");print}'
./file1
,,"TESTING","email"
./file2
,,TESTING,location1
In order to see the differences you can redirect the output of sed to a new file for every input file and compare it with the original.
for i in `find -wholename "*.txt"`; do
sed 's/foo/bar/g' ${i} > ${i}.new;
diff -u ${i} ${i}.new;
done
If the changes seem ok, move the new files to their original names.
for i in `find -wholename "*.new"` ; do
mv ${i} ${i/.new};
done
All can be done with find and sed. Only a little modification needed:
find -path "*.txt" -exec sed -i.bak 's/foo/bar/g' {} +
This calls sed with the max number of files (mind + at the end of -exec), so xargs is not needed. In sed -i.bak does an in-place-editing renaming the original file as .bak. So You can check the differences later if needed.
In man find one can read:
-wholename pattern
See -path. This alternative is less portable than -path.

How to output merge to file?

I have been following the directions to concatenate files in a directory as directed in this post: Unix: merge many files, while deleting first line of all files, but it is not working correctly. To sum, this is the command being used:
find . -name "*.extension" | xargs -n 1 tail -n +2
I attempt to attach a name to this file and redirect like this:
find . -name "*.extension" | xargs -n 1 tail -n +2 > concatenated.csv
but this only grabs the first file. Am I doing something wrong? I am very unfamiliar with these commands.
The problem is that your files contain whitespace or quotes, which xargs by default doesn't like. If you use the GNU versions of find and xargs, you can use
find . -name "*.extension" -print0 | xargs -0 -n 1 tail -n +2 > concatenated.csv
to safely handle all file names.
For a simple command like this, you can skip xargs and just use the -execdir primary:
find . -name "*.extension" -execdir tail -n +2 '{}' \; > concatenated.csv
Inspired by Kevin's answer, you can also use
find . -name "*.extension" -execdir tail -q -n +2 '{}' + > concatenated.csv
which will only run tail one time on all the files, instead of once per file.
Instead of using tail on each file individually, you may want to use another tool that can process more files at once. e.g. awk:
find . -name '*.extension' -exec awk 'FNR > 1' + >concatenated.csv

using pipes with a find command

I have a series of delimited files, some of which have some bad data and can be recognized by doing a column count on them. I can find them with the following command:
find ./ -name 201201*gz -mtime 12
They are all gzipped and I do not want to un-archive them all. So to check the column counts I've been doing I'm running this as a second command on each file:
zcat ./path/to/file.data | awk '{print NF}' | head
I know I can run a command on each file through find with -exec, but how can I also get it to run through the pipes? A couple things I tried, neither of which I expected to work and neither of which did:
find ./ -name 201201*gz -mtime 12 -print -exec zcat {} \; | awk '{print NF}'| head
find ./ -name 201201*gz -mtime 12 -print -exec "zcat {} | awk '{print NF}'| head" \;
I'd use a explicit loop aproach:
find . -name 201201*gz -mtime 12 | while read file; do
echo "$file: "
zcat "$file" | awk '{print NF}' | head
done
More or less you pipe things through find like:
find . -name "foo" -print0 | xargs -0 echo
So your command would look like:
find ./ -name "201201*gz" -mtime 12 -print0 | xargs -0 zcat | awk '{print NF}'| head
-print0 and xargs -0 just helps to make sure files with special characters dont break the pipe.

total size of group of files selected with 'find'

For instance, I have a large filesystem that is filling up faster than I expected. So I look for what's being added:
find /rapidly_shrinking_drive/ -type f -mtime -1 -ls | less
And I find, well, lots of stuff. Thousands of files of six-seven types. I can single out a type and count them:
find /rapidly_shrinking_drive/ -name "*offender1*" -mtime -1 -ls | wc -l
but what I'd really like is to be able to get the total size on disk of these files:
find /rapidly_shrinking_drive/ -name "*offender1*" -mtime -1 | howmuchspace
I'm open to a Perl one-liner for this, if someone's got one, but I'm not going to use any solution that involves a multi-line script, or File::Find.
The command du tells you about disk usage. Example usage for your specific case:
find rapidly_shrinking_drive/ -name "offender1" -mtime -1 -print0 | du --files0-from=- -hc | tail -n1
(Previously I wrote du -hs, but on my machine that appears to disregard find's input and instead summarises the size of the cwd.)
Darn, Stephan202 is right. I didn't think about du -s (summarize), so instead I used awk:
find rapidly_shrinking_drive/ -name "offender1" -mtime -1 | du | awk '{total+=$1} END{print total}'
I like the other answer better though, and it's almost certainly more efficient.
with GNU find,
find /path -name "offender" -printf "%s\n" | awk '{t+=$1}END{print t}'
I'd like to promote jason's comment above to the status of answer, because I believe it's the most mnemonic (though not the most generic, if you really gotta have the file list specified by find):
$ du -hs *.nc
6.1M foo.nc
280K foo_region_N2O.nc
8.0K foo_region_PS.nc
844K foo_region_xyz.nc
844K foo_region_z.nc
37M ETOPO1_Ice_g_gmt4.grd_region_zS.nc
$ du -ch *.nc | tail -n 1
45M total
$ du -cb *.nc | tail -n 1
47033368 total
Recently i faced the same(almost) problem and i came up with this solution.
find $path -type f -printf '%s '
It'll show files sizes in bytes, from man find:
-printf format
True; print format on the standard output, interpreting `\' escapes and `%' directives. Field widths and precisions can be spec‐
ified as with the `printf' C function. Please note that many of the fields are printed as %s rather than %d, and this may mean
that flags don't work as you might expect. This also means that the `-' flag does work (it forces fields to be left-aligned).
Unlike -print, -printf does not add a newline at the end of the string.
...
%s File's size in bytes.
...
And to get a total i used this:
echo $[ $(find $path -type f -printf %s+)0] #b
echo $[($(find $path -type f -printf %s+)0)/1024] #Kb
echo $[($(find $path -type f -printf %s+)0)/1024/1024] #Mb
echo $[($(find $path -type f -printf %s+)0)/1024/1024/1024] #Gb
I have tried all this commands but no luck.
So I have found this one that gives me an answer:
find . -type f -mtime -30 -exec ls -l {} \; | awk '{ s+=$5 } END { print s }'
Since OP specifically said:
I'm open to a Perl one-liner for this, if someone's got one, but I'm
not going to use any solution that involves a multi-line script, or
File::Find.
...and there's none yet, here is the perl one-liner:
find . -name "*offender1*" | perl -lne '$total += -s $_; END { print $total }'
You could also use ls -l to find their size, then awk to extract the size:
find /rapidly_shrinking_drive/ -name "offender1" -mtime -1 | ls -l | awk '{print $5}' | sum

Resources