count (non-blank) lines-of-code in bash - bash

In Bash, how do I count the number of non-blank lines of code in a project?

cat foo.c | sed '/^\s*$/d' | wc -l
And if you consider comments blank lines:
cat foo.pl | sed '/^\s*#/d;/^\s*$/d' | wc -l
Although, that's language dependent.

#!/bin/bash
find . -path './pma' -prune -o -path './blog' -prune -o -path './punbb' -prune -o -path './js/3rdparty' -prune -o -print | egrep '\.php|\.as|\.sql|\.css|\.js' | grep -v '\.svn' | xargs cat | sed '/^\s*$/d' | wc -l
The above will give you the total count of lines of code (blank lines removed) for a project (current folder and all subfolders recursively).
In the above "./blog" "./punbb" "./js/3rdparty" and "./pma" are folders I blacklist as I didn't write the code in them. Also .php, .as, .sql, .css, .js are the extensions of the files being looked at. Any files with a different extension are ignored.

There are many ways to do this, using common shell utilities.
My solution is:
grep -cve '^\s*$' <file>
This searches for lines in <file> the do not match (-v) lines that match the pattern (-e) '^\s*$', which is the beginning of a line, followed by 0 or more whitespace characters, followed by the end of a line (ie. no content other then whitespace), and display a count of matching lines (-c) instead of the matching lines themselves.
An advantage of this method over methods that involve piping into wc, is that you can specify multiple files and get a separate count for each file:
$ grep -cve '^\s*$' *.hh
config.hh:36
exceptions.hh:48
layer.hh:52
main.hh:39

If you want to use something other than a shell script, try CLOC:
cloc counts blank lines, comment
lines, and physical lines of source
code in many programming languages. It
is written entirely in Perl with no
dependencies outside the standard
distribution of Perl v5.6 and higher
(code from some external modules is
embedded within cloc) and so is quite
portable.

This command count number of non-blank lines. cat fileName | grep -v ^$ | wc -l grep -v ^$ regular expression function is ignore blank lines.

'wc' counts lines, words, chars, so to count all lines (including blank ones) use:
wc *.py
To filter out the blank lines, you can use grep:
grep -v '^\s*$' *.py | wc
'-v' tells grep to output all lines except those that match
'^' is the start of a line
'\s*' is zero or more whitespace characters
'$' is the end of a line
*.py is my example for all the files you wish to count (all python files in current dir)
pipe output to wc. Off you go.
I'm answering my own (genuine) question. Couldn't find an stackoverflow entry that covered this.

cat file.txt | awk 'NF' | wc -l

cat 'filename' | grep '[^ ]' | wc -l
should do the trick just fine

grep -cvE '(^\s*[/*])|(^\s*$)' foo
-c = count
-v = exclude
-E = extended regex
'(comment lines) OR (empty lines)'
where
^ = beginning of the line
\s = whitespace
* = any number of previous characters or none
[/*] = either / or *
| = OR
$ = end of the line
I post this becaus other options gave wrong answers for me. This worked with my java source, where comment lines start with / or * (i use * on every line in multi-line comment).

awk '/^[[:space:]]*$/ {++x} END {print x}' "$testfile"

Here's a Bash script that counts the lines of code in a project. It traverses a source tree recursively, and it excludes blank lines and single line comments that use "//".
# $excluded is a regex for paths to exclude from line counting
excluded="spec\|node_modules\|README\|lib\|docs\|csv\|XLS\|json\|png"
countLines(){
# $total is the total lines of code counted
total=0
# -mindepth exclues the current directory (".")
for file in `find . -mindepth 1 -name "*.*" |grep -v "$excluded"`; do
# First sed: only count lines of code that are not commented with //
# Second sed: don't count blank lines
# $numLines is the lines of code
numLines=`cat $file | sed '/\/\//d' | sed '/^\s*$/d' | wc -l`
# To exclude only blank lines and count comment lines, uncomment this:
#numLines=`cat $file | sed '/^\s*$/d' | wc -l`
total=$(($total + $numLines))
echo " " $numLines $file
done
echo " " $total in total
}
echo Source code files:
countLines
echo Unit tests:
cd spec
countLines
Here's what the output looks like for my project:
Source code files:
2 ./buildDocs.sh
24 ./countLines.sh
15 ./css/dashboard.css
53 ./data/un_population/provenance/preprocess.js
19 ./index.html
5 ./server/server.js
2 ./server/startServer.sh
24 ./SpecRunner.html
34 ./src/computeLayout.js
60 ./src/configDiff.js
18 ./src/dashboardMirror.js
37 ./src/dashboardScaffold.js
14 ./src/data.js
68 ./src/dummyVis.js
27 ./src/layout.js
28 ./src/links.js
5 ./src/main.js
52 ./src/processActions.js
86 ./src/timeline.js
73 ./src/udc.js
18 ./src/wire.js
664 in total
Unit tests:
230 ./ComputeLayoutSpec.js
134 ./ConfigDiffSpec.js
134 ./ProcessActionsSpec.js
84 ./UDCSpec.js
149 ./WireSpec.js
731 in total
Enjoy! --Curran

The neatest command is
grep -vc ^$ fileName
with -c option, you don't even need wc -l

It's kinda going to depend on the number of files you have in the project. In theory you could use
grep -c '.' <list of files>
Where you can fill the list of files by using the find utility.
grep -c '.' `find -type f`
Would give you a line count per file.

Script to recursively count all non-blank lines with a certain file extension in the current directory:
#!/usr/bin/env bash
(
echo 0;
for ext in "$#"; do
for i in $(find . -name "*$ext"); do
sed '/^\s*$/d' $i | wc -l ## skip blank lines
#cat $i | wc -l; ## count all lines
echo +;
done
done
echo p q;
) | dc;
Sample usage:
./countlines.sh .py .java .html

If you want the sum of all non-blank lines for all files of a given file extension throughout a project:
while read line
do grep -cve '^\s*$' "$line"
done < <(find $1 -name "*.$2" -print) | awk '{s+=$1} END {print s}'
First arg is the project's base directory, second is the file extension. Sample usage:
./scriptname ~/Dropbox/project/src java
It's little more than a collection of previous solutions.

rgrep . | wc -l
gives the count of non blank lines in the current working directory.

grep -v '^\W*$' `find -type f` | grep -c '.' > /path/to/lineCountFile.txt
gives an aggregate count for all files in the current directory and its subdirectories.
HTH!

This gives the count of number of lines without counting the blank lines:
grep -v ^$ filename wc -l | sed -e 's/ //g'

Try this one:
> grep -cve ^$ -cve '^//' *.java
it's easy to memorize and it also excludes blank lines and commented lines.

There's already a program for this on linux called 'wc'.
Just
wc -l *.c
and it gives you the total lines and the lines for each file.

Related

Handle files with space in filename and output file names

I need to write a Bash script that achieve the following goals:
1) move the newest n pdf files from folder 1 to folder 2;
2) correctly handles files that could have spaces in file names;
3) output each file name in a specific position in a text file. (In my actual usage, I will use sed to put the file names in a specific position of an existing file.)
I tried to make an array of filenames and then move them and do text output in a loop. However, the following array cannot handle files with spaces in filename:
pdfs=($(find -name "$DOWNLOADS/*.pdf" -print0 | xargs -0 ls -1 -t | head -n$NUM))
Suppose a file has name "Filename with Space". What I get from the above array will have "with" and "Space" in separate array entries.
I am not sure how to avoid these words in the same filename being treated separately.
Can someone help me out?
Thanks!
-------------Update------------
Sorry for being vague on the third point as I thought I might be able to figure that out after achieving the first and second goals.
Basically, it is a text file that have a line start with "%comment" near the end and I will need to insert the filenames before that line in the format "file=PATH".
The PATH is the folder 2 that I have my pdfs moved to.
You can achieve this using mapfile in conjunction with gnu versions of find | sort | cut | head that have options to operate on NUL terminated filenames:
mapfile -d '' -t pdfs < <(find "$DOWNLOADS/*.pdf" -name 'file*' -printf '%T#:%p\0' |
sort -z -t : -rnk1 | cut -z -d : -f2- | head -z -n $NUM)
Commands used are:
mapfile -d '': To read array with NUL as delimiter
find: outputs each file's modification stamp in EPOCH + ":" + filename + NUL byte
sort: sorts reverse numerically on 1st field
cut: removes 1st field from output
head: outputs only first $NUM filenames
find downloads -name "*.pdf" -printf "%T# %p\0" |
sort -z -t' ' -k1 -n |
cut -z -d' ' -f2- |
tail -z -n 3
find all *.pdf files in downloads
for each file print it's modifition date %T with the format specifier # that means seconds since epoch with fractional part, then print space, filename and separate with \0
Sort the null separated stream using space as field separator using only first field using numerical sort
Remove the first field from the stream, ie. creation date, leaving only filenames.
Get the count of the newest files, in this example 3 newest files, by using tail. We could also do reverse sort and use head, no difference.
Don't use ls in scripts. ls is for nice formatted output. You could do xargs -0 stat --printf "%Y %n\0" which would basically move your script forward, as ls isn't meant to be used for scripts. Just that I couldn't make stat output fractional part of creation date.
As for the second part, we need to save the null delimetered list to a file
find downloads ........ >"$tmp"
and then:
str='%comment'
{
grep -B$((2**32)) -x "$str" "$out" | grep -v "$str"
# I don't know what you expect to do with newlines in filenames, but I guess you don't have those
cat "$tmp" | sed -z 's/^/file=/' | sed 's/\x0/\n/g'
grep -A$((2**32)) -x "$str" "$out"
} | sponge "$out"
where output is the output file name
assuming output file name is stored in variable "$out"
filter all lines before the %comment and remove the line %comment itself from the file
output each filename with file= on the beginning. I also substituted zeros for newlines.
the filter all lines after %comment including %comment itself
write the output for outfile. Remember to use a temporary file.
Don't use pdf=$(...) on null separated inputs. You can use mapfile to store that to an array, as other answers provided.
Then to move the files, do smth like
<"$tmp" xargs -0 -i mv {} "$outdir"
or faster, with a single move:
{ cat <"$tmp"; printf "%s\0" "$outdir"; } | xargs -0 mv
or alternatively:
<"$tmp" xargs -0 sh -c 'outdir="$1"; shift; mv "$#" "$outdir"' -- "$outdir"
Live example at turorialspoint.
I suppose following code will be close to what you want:
IFS=$'\n' pdfs=($(find -name "$DOWNLOADS/*.pdf" -print0 | xargs -0 -I ls -lt "{}" | tail -n +1 | head -n$NUM))
Then you can access the output through ${pdfs[0]}, ${pdfs[1]}, ...
Explanations
IFS=$'\n' makes the following line to be split only with "\n".
-I option for xargs tells xargs to substitute {} with filenames so it can be quoted as "{}".
tail -n +1 is a trick to suppress an error message saying "xargs: 'ls' terminated by signal 13".
Hope this helps.
Bash v4 has an option globstar, after enabling this option, we can use ** to match zero or more subdirectories.
mapfile is a built-in command, which is used for reading lines into an indexed array variable. -t option removes a trailing newline.
shopt -s globstar
mapfile -t pdffiles < <(ls -t1 **/*.pdf | head -n"$NUM")
typeset -p pdffiles
for f in "${pdffiles[#]}"; do
echo "==="
mv "${f}" /dest/path
sed "/^%comment/i${f}=/dest/path" a-text-file.txt
done

Count how many files contain a string in the last line

I want to count how many files in the current directory have the string "A" in the last line.
First solution: tail -n 1 * | grep \"A\"| wc -l
This works fine, but when there are more files it does bash: /usr/bin/tail: Argument list too long.
Is there a way to get around it?
Bonus points if I can also optionally get which files contains it.
EDIT: my folder contains 343729 files
EDIT2: #tso usefully pointed to the article I'm getting "Argument list too long". How can I process a large list in chunks? in his comment.
RESULTS:
#tso solution for f in $(find . -type f); do tail -1 $f|grep \"A\"; done|wc -l takes about 20 minutes
#lars solution grep -P "\"A\"*\Z" -r . | wc -l takes about 20 minutes
#mklement0 solution printf '%s\0' * | xargs -0 sh -c 'tail -q -n 1 "$#" | grep \"A\"' - | wc -l takes about 10 minutes
#james solution (in the comments) for i in * ; do awk 'END{if(/a/)print FILENAME}' "$i" ; done takes about 25 minutes
#codeforester find . -type f -exec tail -n 1 -- {} + | grep -EB 1 '^[^=]+A' | grep -c '^==>' takes >20 minutes.
#mklement0 and #codeforester solutiona also has the advantage that if I want to change the grep pattern, the second time I run it it takes zero time, I guess it's due to some sort of caching.
I've accepted #mklement0 answer at is seems to be the fastest, but I still like to mention #tso and #lars for their contributions and, based on my personal knowledge, an easier and adaptable solution.
xargs is able to overcome the max. command-line length limitation by efficiently batching the invocations into as few calls as possible.
The shell's builtins, such as printf, are not subject to the max. command-line length.
Knowing this, you can use the following approach (which assumes that your xargs implementation supports the -0 option for NUL-terminated input, and that your tail implementation supports multiple file operands and the -q option for suppressing filename headers.
Both assumptions hold for the GNU (Linux) and BSD/macOS implementations of these utilities):
printf '%s\0' * | xargs -0 sh -c 'tail -q -n 1 "$#" | grep \"A\"' - | wc -l
How about using find, tail, and grep this way? This will be more efficient than having to loop through each file. Also, tail -1 will just read the last line of the files and hence very I/O efficient.
find . -maxdepth 1 -type f -exec tail -n 1 -- {} + | grep -EB 1 '^[^=]+A' | grep -c '^==>'
find will invoke tail -1 in batches, passing ARG_MAX file names at a time
tail will print the last line of each of the file, prefixing it with the pattern "==> file_name <=="
grep -EB 1 '^[^=]+A' will look for pattern A and fetch the previous line as well (it will exclude the file_name lines while looking for the match)
grep -c '^==>' will count the number of files with matching pattern
If you don't need to know the name of the files having a match, but just get the count of files, you could do this:
find . -maxdepth 1 -type f -exec tail -q -n 1 -- {} + | grep -c 'A'
Using GNU awk:
$ cat foo
b
a
$ cat bar
b
b
$ awk 'ENDFILE{if(/a/){c++; print FILENAME}}END{print c}' *
foo
1
try with find:
for f in $(find . -type f); do tail -1 $f|grep PATERN; done|wc -l
If grep supports the -P option, this might work:
grep -P "A\Z" -r . | wc -l
See man pcrepattern. In short:
\Z matches at the end of the subject also matches before a newline at the end of the subject
\z matches only at the end of the subject
Try \Z and \z.
To see which files match, you would use only the grep part without the pipe to wc.
This will return the number of files:
grep -rlP "A\z" | wc -l
If you want to get the names then simply:
grep -rlP "A\Z"

SHELL printing just right part after . (DOT)

I need to find just extension of all files in directory (if there are 2 same extensions, its just one). I already have it. But the output of my script is like
test.txt
test2.txt
hello.iso
bay.fds
hellllu.pdf
Im using grep -e -e '.' and it just highlight DOTs
And i need just these extensions give in one variable like txt,iso,fds,pdf
Is there anyone who could help? I already had it one time but i had it on array. Today I found out It's has to work on dash too.
You can use find with awk to get all unique extensions:
find . -type f -name '?*.?*' -print0 |
awk -F. -v RS='\0' '!seen[$NF]++{print $NF}'
can be done with find as well, but I think this is easier
for f in *.*; do echo "${f##*.}"; done | sort -u
if you want to assign a comma separated list of the unique extensions, you can follow this
ext=$(for f in *.*; do echo "${f##*.}"; done | sort -u | paste -sd,)
echo $ext
csv,pdf,txt
alternatively with ls
ls -1 *.* | rev | cut -d. -f1 | rev | sort -u | paste -sd,
rev/rev is required if you have more than one dot in the filename, assuming the extension is after the last dot. For any other directory simply change the part *.* to dirpath/*.* in all scripts.
I'm not sure I understand your comment. If you don't assign to a variable, by default it will print to the output. If you want to pass directory name as a variable to a script, put the code into a script file and replace dirpath with $1, assuming that will be your first argument to the script
#!/bin/bash
# print unique extension in the directory passed as an argument, i.e.
ls -1 "$1"/*.* ...
if you have sub directories with extensions above scripts include them as well, to limit only to file types replace ls .. with
find . -maxdepth 1 -type f -name "*.*" | ...

echo prints too many spaces

I have code with two variables in echo. I don't know why it prints spaces before $NEXT even though I have just one space in code.
NEXT=$(find "${DIR}" -type f -name "*.$ext" | sed "s/.*\/\.//g" | sed "s/.*\///g" |
sed -n '/.*\..*/p' | wc -l)
echo "Files .$ext: $NEXT"
Files .tar: 1
Your find expression is not doing what you think it is:
NEXT=$(find "${DIR}" -type f -name "*.$ext" | sed "s/.*\/\.//g" | sed "s/.*\///g" |
sed -n '/.*\..*/p' | wc -l)
When you pipe to wc -l you are left with a Number. The format of the number will depend on your distributions default compile options for wc. While generally when information is piped or redirected to wc the value returned should be without any leading whitespace (but there is no guarantee that your install of wc will work that way). All you can do it test and see what results, e.g.
ls "$HOME" | wc -l
If whitespace is returned before the value -- you have found your problem.
If the last line is the output, then it seems it is an output of something else than displayed code. When your output looks weird, try putting single quotes around each variable:
echo " Average file size .'$ext': '$AEXT'"
That way, you will know, if the spaces (or tabs) are coming from the variables themselves or from the script.

Bash/Shell - paths with spaces messing things up

I have a bash/shell function that is supposed to find files then awk/copy the first file it finds to another directory. Unfortunately if the directory that contains the file has spaces in the name the whole thing fails, since it truncates the path for some reason or another. How do I fix it?
If file.txt is in /path/to/search/spaces are bad/ it fails.
dir=/path/to/destination/ | find /path/to/search -name file.txt | head -n 1 | awk -v dir="$dir" '{printf "cp \"%s\" \"%s\"\n", $1, dir}' | sh
cp: /path/to/search/spaces: No such file or directory
*If file.txt is in /path/to/search/spacesarebad/ it works, but notice there are no spaces. :-/
Awk's default separator is white space. Simply change it to something else by doing:
awk -F"\t" ...
Your script should look like:
dir=/path/to/destination/ | find /path/to/search -name file.txt | head -n 1 | awk -F"\t" -v dir="$dir" '{printf "cp \"%s\" \"%s\"\n", $1, dir}' | sh
As pointed by the comments, you don't really need all those steps, you could actually simply do (one-liner):
dir=/path/to/destination/ && path="$(find /path/to/search -name file.txt | head -n 1)" && cp "$path" "$dir"
Formated code (that may look better, in this case ^^):
dir=/path/to/destination/
path="$(find /path/to/search -name file.txt | head -n 1)"
cp "$path" "$dir"
The "" are used to assign the entire content of the string to the variable, causing the separator IFS, which is a white space by default, not to be considered over the string.
If you think spaces are bad, wait till you get into trouble with newlines. Consider for example:
mkdir spaces\ are\ bad
touch spaces\ are\ bad/file.txt
mkdir newlines$'\n'are$'\n'even$'\n'worse
touch newlines$'\n'are$'\n'even$'\n'worse/file.txt
And:
find . -name file.txt
The head command assumes newline delimiter. You can get around the space and newline issue with GNU find and GNU grep (maybe others) by using \0 delimiters:
find . -name file.txt -print0 | grep -zm1 . | xargs -0 cp -t "$dir"
You could try this.
awk '{print substr($0, index($0,$9))}'
For example this is the output of ls command:
-rw-r--r--. 1 root root 73834496 Dec 6 10:55 File with spaces 2
If you use simple awk like this
# awk '{print $9}'
It returns only
# File
If used with the full command
# awk '{print substr($0, index($0,$9))}'
I get the whole output
File with spaces 2
Here
substr(s, a, b) : it returns b number of chars from string s, starting at position a. The parameter b is optional.
For example if the match is addr:192.168.1.133 and you use substr as follows
# awk '{print substr($2,6)}'
You get the IP i.e 192.168.1.133. Note the 6 is the character starting from a in addr
So in the proper command the $2 is $0 ( which prints whole line.) and index($0,$9) matches $9 and prints everything ahead of column 9. You can change that to index($0,$8) and see that the output changes to
# 10:55 File with spaces 2
`index(IN, FIND)'
This searches the string IN for the first occurrence of the string
FIND, and returns the position in characters where that occurrence
begins in the string IN.
I hope it helps. Moreover if you are assigning this value to a variable in script then you need to enclose the variables in double quotes. Other wise you will get errors if you are doing some other operation for the extracted file name.

Resources