Count how many files contain a string in the last line - bash

I want to count how many files in the current directory have the string "A" in the last line.
First solution: tail -n 1 * | grep \"A\"| wc -l
This works fine, but when there are more files it does bash: /usr/bin/tail: Argument list too long.
Is there a way to get around it?
Bonus points if I can also optionally get which files contains it.
EDIT: my folder contains 343729 files
EDIT2: #tso usefully pointed to the article I'm getting "Argument list too long". How can I process a large list in chunks? in his comment.
RESULTS:
#tso solution for f in $(find . -type f); do tail -1 $f|grep \"A\"; done|wc -l takes about 20 minutes
#lars solution grep -P "\"A\"*\Z" -r . | wc -l takes about 20 minutes
#mklement0 solution printf '%s\0' * | xargs -0 sh -c 'tail -q -n 1 "$#" | grep \"A\"' - | wc -l takes about 10 minutes
#james solution (in the comments) for i in * ; do awk 'END{if(/a/)print FILENAME}' "$i" ; done takes about 25 minutes
#codeforester find . -type f -exec tail -n 1 -- {} + | grep -EB 1 '^[^=]+A' | grep -c '^==>' takes >20 minutes.
#mklement0 and #codeforester solutiona also has the advantage that if I want to change the grep pattern, the second time I run it it takes zero time, I guess it's due to some sort of caching.
I've accepted #mklement0 answer at is seems to be the fastest, but I still like to mention #tso and #lars for their contributions and, based on my personal knowledge, an easier and adaptable solution.

xargs is able to overcome the max. command-line length limitation by efficiently batching the invocations into as few calls as possible.
The shell's builtins, such as printf, are not subject to the max. command-line length.
Knowing this, you can use the following approach (which assumes that your xargs implementation supports the -0 option for NUL-terminated input, and that your tail implementation supports multiple file operands and the -q option for suppressing filename headers.
Both assumptions hold for the GNU (Linux) and BSD/macOS implementations of these utilities):
printf '%s\0' * | xargs -0 sh -c 'tail -q -n 1 "$#" | grep \"A\"' - | wc -l

How about using find, tail, and grep this way? This will be more efficient than having to loop through each file. Also, tail -1 will just read the last line of the files and hence very I/O efficient.
find . -maxdepth 1 -type f -exec tail -n 1 -- {} + | grep -EB 1 '^[^=]+A' | grep -c '^==>'
find will invoke tail -1 in batches, passing ARG_MAX file names at a time
tail will print the last line of each of the file, prefixing it with the pattern "==> file_name <=="
grep -EB 1 '^[^=]+A' will look for pattern A and fetch the previous line as well (it will exclude the file_name lines while looking for the match)
grep -c '^==>' will count the number of files with matching pattern
If you don't need to know the name of the files having a match, but just get the count of files, you could do this:
find . -maxdepth 1 -type f -exec tail -q -n 1 -- {} + | grep -c 'A'

Using GNU awk:
$ cat foo
b
a
$ cat bar
b
b
$ awk 'ENDFILE{if(/a/){c++; print FILENAME}}END{print c}' *
foo
1

try with find:
for f in $(find . -type f); do tail -1 $f|grep PATERN; done|wc -l

If grep supports the -P option, this might work:
grep -P "A\Z" -r . | wc -l
See man pcrepattern. In short:
\Z matches at the end of the subject also matches before a newline at the end of the subject
\z matches only at the end of the subject
Try \Z and \z.
To see which files match, you would use only the grep part without the pipe to wc.

This will return the number of files:
grep -rlP "A\z" | wc -l
If you want to get the names then simply:
grep -rlP "A\Z"

Related

How to grep files in date order

I can list the Python files in a directory from most recently updated to least recently updated with
ls -lt *.py
But how can I grep those files in that order?
I understand one should never try to parse the output of ls as that is a very dangerous thing to do.
You may use this pipeline to achieve this with gnu utilities:
find . -maxdepth 1 -name '*.py' -printf '%T#:%p\0' |
sort -z -t : -rnk1 |
cut -z -d : -f2- |
xargs -0 grep 'pattern'
This will handle filenames with special characters such as space, newline, glob etc.
find finds all *.py files in current directory and prints modification time (epoch value) + : + filename + NUL byte
sort command performs reverse numeric sort on first column that is timestamp
cut command removes 1st column (timestamp) from output
xargs -0 grep command searches pattern in each file
There is a very simple way if you want to get the filelist in chronologic order that hold the pattern:
grep -sil <searchpattern> <files-to-grep> | xargs ls -ltr
i.e. you grep e.g. "hello world" in *.txt, with -sil you make the grep case insensitive (-i), suppress messages (-s) and just list files (-l); this you then pass on to ls (| xargs), sorting it by date (-t) showing date (-l) and all files (-a).

Handle files with space in filename and output file names

I need to write a Bash script that achieve the following goals:
1) move the newest n pdf files from folder 1 to folder 2;
2) correctly handles files that could have spaces in file names;
3) output each file name in a specific position in a text file. (In my actual usage, I will use sed to put the file names in a specific position of an existing file.)
I tried to make an array of filenames and then move them and do text output in a loop. However, the following array cannot handle files with spaces in filename:
pdfs=($(find -name "$DOWNLOADS/*.pdf" -print0 | xargs -0 ls -1 -t | head -n$NUM))
Suppose a file has name "Filename with Space". What I get from the above array will have "with" and "Space" in separate array entries.
I am not sure how to avoid these words in the same filename being treated separately.
Can someone help me out?
Thanks!
-------------Update------------
Sorry for being vague on the third point as I thought I might be able to figure that out after achieving the first and second goals.
Basically, it is a text file that have a line start with "%comment" near the end and I will need to insert the filenames before that line in the format "file=PATH".
The PATH is the folder 2 that I have my pdfs moved to.
You can achieve this using mapfile in conjunction with gnu versions of find | sort | cut | head that have options to operate on NUL terminated filenames:
mapfile -d '' -t pdfs < <(find "$DOWNLOADS/*.pdf" -name 'file*' -printf '%T#:%p\0' |
sort -z -t : -rnk1 | cut -z -d : -f2- | head -z -n $NUM)
Commands used are:
mapfile -d '': To read array with NUL as delimiter
find: outputs each file's modification stamp in EPOCH + ":" + filename + NUL byte
sort: sorts reverse numerically on 1st field
cut: removes 1st field from output
head: outputs only first $NUM filenames
find downloads -name "*.pdf" -printf "%T# %p\0" |
sort -z -t' ' -k1 -n |
cut -z -d' ' -f2- |
tail -z -n 3
find all *.pdf files in downloads
for each file print it's modifition date %T with the format specifier # that means seconds since epoch with fractional part, then print space, filename and separate with \0
Sort the null separated stream using space as field separator using only first field using numerical sort
Remove the first field from the stream, ie. creation date, leaving only filenames.
Get the count of the newest files, in this example 3 newest files, by using tail. We could also do reverse sort and use head, no difference.
Don't use ls in scripts. ls is for nice formatted output. You could do xargs -0 stat --printf "%Y %n\0" which would basically move your script forward, as ls isn't meant to be used for scripts. Just that I couldn't make stat output fractional part of creation date.
As for the second part, we need to save the null delimetered list to a file
find downloads ........ >"$tmp"
and then:
str='%comment'
{
grep -B$((2**32)) -x "$str" "$out" | grep -v "$str"
# I don't know what you expect to do with newlines in filenames, but I guess you don't have those
cat "$tmp" | sed -z 's/^/file=/' | sed 's/\x0/\n/g'
grep -A$((2**32)) -x "$str" "$out"
} | sponge "$out"
where output is the output file name
assuming output file name is stored in variable "$out"
filter all lines before the %comment and remove the line %comment itself from the file
output each filename with file= on the beginning. I also substituted zeros for newlines.
the filter all lines after %comment including %comment itself
write the output for outfile. Remember to use a temporary file.
Don't use pdf=$(...) on null separated inputs. You can use mapfile to store that to an array, as other answers provided.
Then to move the files, do smth like
<"$tmp" xargs -0 -i mv {} "$outdir"
or faster, with a single move:
{ cat <"$tmp"; printf "%s\0" "$outdir"; } | xargs -0 mv
or alternatively:
<"$tmp" xargs -0 sh -c 'outdir="$1"; shift; mv "$#" "$outdir"' -- "$outdir"
Live example at turorialspoint.
I suppose following code will be close to what you want:
IFS=$'\n' pdfs=($(find -name "$DOWNLOADS/*.pdf" -print0 | xargs -0 -I ls -lt "{}" | tail -n +1 | head -n$NUM))
Then you can access the output through ${pdfs[0]}, ${pdfs[1]}, ...
Explanations
IFS=$'\n' makes the following line to be split only with "\n".
-I option for xargs tells xargs to substitute {} with filenames so it can be quoted as "{}".
tail -n +1 is a trick to suppress an error message saying "xargs: 'ls' terminated by signal 13".
Hope this helps.
Bash v4 has an option globstar, after enabling this option, we can use ** to match zero or more subdirectories.
mapfile is a built-in command, which is used for reading lines into an indexed array variable. -t option removes a trailing newline.
shopt -s globstar
mapfile -t pdffiles < <(ls -t1 **/*.pdf | head -n"$NUM")
typeset -p pdffiles
for f in "${pdffiles[#]}"; do
echo "==="
mv "${f}" /dest/path
sed "/^%comment/i${f}=/dest/path" a-text-file.txt
done

echo prints too many spaces

I have code with two variables in echo. I don't know why it prints spaces before $NEXT even though I have just one space in code.
NEXT=$(find "${DIR}" -type f -name "*.$ext" | sed "s/.*\/\.//g" | sed "s/.*\///g" |
sed -n '/.*\..*/p' | wc -l)
echo "Files .$ext: $NEXT"
Files .tar: 1
Your find expression is not doing what you think it is:
NEXT=$(find "${DIR}" -type f -name "*.$ext" | sed "s/.*\/\.//g" | sed "s/.*\///g" |
sed -n '/.*\..*/p' | wc -l)
When you pipe to wc -l you are left with a Number. The format of the number will depend on your distributions default compile options for wc. While generally when information is piped or redirected to wc the value returned should be without any leading whitespace (but there is no guarantee that your install of wc will work that way). All you can do it test and see what results, e.g.
ls "$HOME" | wc -l
If whitespace is returned before the value -- you have found your problem.
If the last line is the output, then it seems it is an output of something else than displayed code. When your output looks weird, try putting single quotes around each variable:
echo " Average file size .'$ext': '$AEXT'"
That way, you will know, if the spaces (or tabs) are coming from the variables themselves or from the script.

moving files that contain part of a line from a file

I have a file that on each line is a string of some numbers such as
1234
2345
...
I need to move files that contain that number in their name followed by other stuff to a directory examples being
1234_hello_other_stuff_2334.pdf
2345_more_stuff_3343.pdf
I tried using xargs to do this, but my bash scripting isn't the best. Can anyone share the proper command to accomplish what I want to do?
for i in `cat numbers.txt`; do
mv ${i}_* examples
done
or (look ma, no cat!)
while read i; do
mv ${i}_* examples
done < numbers.txt
You could use a for loop, but that could make for a really long command line. If you have 20000 lines in numbers.txt, you might hit shell limits. Instead, you could use a pipe:
cat numbers.txt | while read number; do
mv ${number}_*.pdf /path/to/examples/
done
or:
sed 's/.*/mv -v &_*.pdf/' numbers.txt | sh
You can leave off the | sh for testing. If there are other lines in the file and you only want to match lines with 4 digits, you could restrict your match:
sed -r '/^[0-9]{4}$/s//mv -v &_*.pdf/' numbers.txt | sh
cat numbers.txt | xargs -n1 -I % find . -name '%*.pdf' -exec mv {} /path/to \;
% is your number (-n1 means one at a time), and '%*.pdf' to find means it'll match all files whose names begin with that number; then it just copies to /path/to ({} is the actual file name).

count (non-blank) lines-of-code in bash

In Bash, how do I count the number of non-blank lines of code in a project?
cat foo.c | sed '/^\s*$/d' | wc -l
And if you consider comments blank lines:
cat foo.pl | sed '/^\s*#/d;/^\s*$/d' | wc -l
Although, that's language dependent.
#!/bin/bash
find . -path './pma' -prune -o -path './blog' -prune -o -path './punbb' -prune -o -path './js/3rdparty' -prune -o -print | egrep '\.php|\.as|\.sql|\.css|\.js' | grep -v '\.svn' | xargs cat | sed '/^\s*$/d' | wc -l
The above will give you the total count of lines of code (blank lines removed) for a project (current folder and all subfolders recursively).
In the above "./blog" "./punbb" "./js/3rdparty" and "./pma" are folders I blacklist as I didn't write the code in them. Also .php, .as, .sql, .css, .js are the extensions of the files being looked at. Any files with a different extension are ignored.
There are many ways to do this, using common shell utilities.
My solution is:
grep -cve '^\s*$' <file>
This searches for lines in <file> the do not match (-v) lines that match the pattern (-e) '^\s*$', which is the beginning of a line, followed by 0 or more whitespace characters, followed by the end of a line (ie. no content other then whitespace), and display a count of matching lines (-c) instead of the matching lines themselves.
An advantage of this method over methods that involve piping into wc, is that you can specify multiple files and get a separate count for each file:
$ grep -cve '^\s*$' *.hh
config.hh:36
exceptions.hh:48
layer.hh:52
main.hh:39
If you want to use something other than a shell script, try CLOC:
cloc counts blank lines, comment
lines, and physical lines of source
code in many programming languages. It
is written entirely in Perl with no
dependencies outside the standard
distribution of Perl v5.6 and higher
(code from some external modules is
embedded within cloc) and so is quite
portable.
This command count number of non-blank lines. cat fileName | grep -v ^$ | wc -l grep -v ^$ regular expression function is ignore blank lines.
'wc' counts lines, words, chars, so to count all lines (including blank ones) use:
wc *.py
To filter out the blank lines, you can use grep:
grep -v '^\s*$' *.py | wc
'-v' tells grep to output all lines except those that match
'^' is the start of a line
'\s*' is zero or more whitespace characters
'$' is the end of a line
*.py is my example for all the files you wish to count (all python files in current dir)
pipe output to wc. Off you go.
I'm answering my own (genuine) question. Couldn't find an stackoverflow entry that covered this.
cat file.txt | awk 'NF' | wc -l
cat 'filename' | grep '[^ ]' | wc -l
should do the trick just fine
grep -cvE '(^\s*[/*])|(^\s*$)' foo
-c = count
-v = exclude
-E = extended regex
'(comment lines) OR (empty lines)'
where
^ = beginning of the line
\s = whitespace
* = any number of previous characters or none
[/*] = either / or *
| = OR
$ = end of the line
I post this becaus other options gave wrong answers for me. This worked with my java source, where comment lines start with / or * (i use * on every line in multi-line comment).
awk '/^[[:space:]]*$/ {++x} END {print x}' "$testfile"
Here's a Bash script that counts the lines of code in a project. It traverses a source tree recursively, and it excludes blank lines and single line comments that use "//".
# $excluded is a regex for paths to exclude from line counting
excluded="spec\|node_modules\|README\|lib\|docs\|csv\|XLS\|json\|png"
countLines(){
# $total is the total lines of code counted
total=0
# -mindepth exclues the current directory (".")
for file in `find . -mindepth 1 -name "*.*" |grep -v "$excluded"`; do
# First sed: only count lines of code that are not commented with //
# Second sed: don't count blank lines
# $numLines is the lines of code
numLines=`cat $file | sed '/\/\//d' | sed '/^\s*$/d' | wc -l`
# To exclude only blank lines and count comment lines, uncomment this:
#numLines=`cat $file | sed '/^\s*$/d' | wc -l`
total=$(($total + $numLines))
echo " " $numLines $file
done
echo " " $total in total
}
echo Source code files:
countLines
echo Unit tests:
cd spec
countLines
Here's what the output looks like for my project:
Source code files:
2 ./buildDocs.sh
24 ./countLines.sh
15 ./css/dashboard.css
53 ./data/un_population/provenance/preprocess.js
19 ./index.html
5 ./server/server.js
2 ./server/startServer.sh
24 ./SpecRunner.html
34 ./src/computeLayout.js
60 ./src/configDiff.js
18 ./src/dashboardMirror.js
37 ./src/dashboardScaffold.js
14 ./src/data.js
68 ./src/dummyVis.js
27 ./src/layout.js
28 ./src/links.js
5 ./src/main.js
52 ./src/processActions.js
86 ./src/timeline.js
73 ./src/udc.js
18 ./src/wire.js
664 in total
Unit tests:
230 ./ComputeLayoutSpec.js
134 ./ConfigDiffSpec.js
134 ./ProcessActionsSpec.js
84 ./UDCSpec.js
149 ./WireSpec.js
731 in total
Enjoy! --Curran
The neatest command is
grep -vc ^$ fileName
with -c option, you don't even need wc -l
It's kinda going to depend on the number of files you have in the project. In theory you could use
grep -c '.' <list of files>
Where you can fill the list of files by using the find utility.
grep -c '.' `find -type f`
Would give you a line count per file.
Script to recursively count all non-blank lines with a certain file extension in the current directory:
#!/usr/bin/env bash
(
echo 0;
for ext in "$#"; do
for i in $(find . -name "*$ext"); do
sed '/^\s*$/d' $i | wc -l ## skip blank lines
#cat $i | wc -l; ## count all lines
echo +;
done
done
echo p q;
) | dc;
Sample usage:
./countlines.sh .py .java .html
If you want the sum of all non-blank lines for all files of a given file extension throughout a project:
while read line
do grep -cve '^\s*$' "$line"
done < <(find $1 -name "*.$2" -print) | awk '{s+=$1} END {print s}'
First arg is the project's base directory, second is the file extension. Sample usage:
./scriptname ~/Dropbox/project/src java
It's little more than a collection of previous solutions.
rgrep . | wc -l
gives the count of non blank lines in the current working directory.
grep -v '^\W*$' `find -type f` | grep -c '.' > /path/to/lineCountFile.txt
gives an aggregate count for all files in the current directory and its subdirectories.
HTH!
This gives the count of number of lines without counting the blank lines:
grep -v ^$ filename wc -l | sed -e 's/ //g'
Try this one:
> grep -cve ^$ -cve '^//' *.java
it's easy to memorize and it also excludes blank lines and commented lines.
There's already a program for this on linux called 'wc'.
Just
wc -l *.c
and it gives you the total lines and the lines for each file.

Resources