Counting the number of lines of many files (only .h, .c and .py files) in a directory using bash - bash

I'm asked to write a script (using bash) that count the number of lines in files (but only C files (.h and .c) and python files (.py)) that are regrouped in a single directory. I've already tried with this code but my calculation is always wrong
let "sum = 0"
let "sum = sum + $(wc -l $1/*.c | tail --lines=1 | tr -dc '0-9')"
let "sum = sum + $(wc -l $1/*.h | tail --lines=1 | tr -dc '0-9')"
let "sum = sum + $(wc -l $1/*.py | tail --lines=1 | tr -dc '0-9')"
echo $sum >> manifest.txt
I must write the total in the "manifest.txt" file and the argument of my script is the path to the directory that contains the files.
If someone has another technique to compute this, I'd be very grateful.
Thank you !

You could also use a loop to aggregate the counts:
extensions=("*.h" "*.c" "*.py")
sum=0
for ext in ${extensions[#]} ; do
count=$(wc -l ${1}/${ext} | awk '{ print $1 }')
sum=$((sum+count))
done
echo "${sum}"

Version 1: step by step
#!/bin/bash
echo "Counting the total number of lines for all .c .h .py files in $1"
sum=0
num_py=$(wc -l $1/*.py | tail -1 | tr -dc '0-9')
num_c=$(wc -l $1/*.c | tail -1 | tr -dc '0-9')
num_h=$(wc -l $1/*.h | tail -1 | tr -dc '0-9')
sum=$(($num_py + $num_c + $num_h))
echo $sum >> manifest.txt
version 2: concise
#!/bin/bash
echo "Counting the total number of lines for all .c .h .py files in $1"
echo "$(( $(wc -l $1/*.py | tail -1 | tr -dc '0-9') + $(wc -l $1/*.c | tail -1 | tr -dc '0-9') + $(wc -l $1/*.h | tail -1 | tr -dc '0-9') ))" >> manifest.txt
version 3: loop over your desired files
#!/bin/bash
echo "Counting the total number of lines for all .c .h .py files in $1"
sum=0
for sfile in $1/*.{c,h,py}; do
sum=$(($sum+$(wc -l $sfile|tail -1|tr -dc '0-9')))
done
echo $sum >> manifest.txt
This is how arithmetic operations work: var = $((EXPR))
For example: $sum= $(($sum + $result ))
it is very common to miss the $ sign within the EXPR! Try not to forget them :)

This is the scripts that I use (with minor modifications):
files=( $(find . -mindepth 1 -maxdepth 1 -type f -iname "*.h" -iname "*.c" -iname "*.py") )
declare -i total=0
for file in "${files[#]}"; do
lines="$(wc -l < <(cat "$file"))"
echo -e "${lines}\t${file}"
total+="$lines"
done
echo -e "\n$total\ttotal"

Here is my version.
#!/usr/bin/env bash
shopt -s extglob nullglob
files=( "$1"/*.#(c|h|py) )
shopt -u extglob nullglob
while IFS= read -rd '' file_name; do
count=$(wc -l < "$file_name")
((sum+=count))
done< <(printf '%s\0' "${files[#]}")
echo "$sum" > manifest.txt
Needs some error checking, like if the argument is a directory or if it even exists at all, and so on.

Related

Print top N files by word count in two columns

I would like to make a script that prints the filenames for the top n files from two directories (n being the number of files I give in in the command line) in order of number of words they have. My biggest problem however is in the way they should be displayed.
Say my command line looks like this:
myscript.sh 5 dir1 dir2
The output should have 2 columns: on the left the top 5 files in descending order from dir1, and on the right the top 5 files in descending order from dir2.
This is what I have in terms of code, however I'm missing something. I think that pr -m -t should do what i want, but I couldn't make it work.
#!/bin/bash
dir=$1
dir2=$2
for files in "$dir"
do
find ./reuters-topics/$dir -type f -exec wc -l {} + | sort -rn |head -n 15
done
for files in "$dir2"
do
find ./reuters-topics/$dir2 -type f -exec wc -l {} + | sort -rn | head -n 15
done
This is a solution in fish:
for i in (find . -type f); wc -l $i; end | sort -rn | head -n15 | awk '{print $2 "\t" $1}'
As you can see, the re-ordering (filename first, number of words second) is done by awk. As a separator I use a tab character:
awk '{print $2 "\t" $1}'
The difference between my loop and your find call, btw, is that I do not get the "total" line in the output.
I did not test if this (including awk) also works well for files with spaces in the name.
#!/usr/bin/env bash
_top_files_by_words_usage() {
local usage=""
read -r -d '' usage <<-"EOF"
Usage:
top_files_by_words <show_count> <dir1> <dir2>
EOF
1>&2 printf "%s\n" "$usage"
}
top_files_by_words() {
if (( $# != 3 )) || [[ "$1" != +([0-9]) ]]; then
_top_files_by_words_usage
return 1
fi
local -i showCount=0
local dir1=""
local dir2=""
showCount="$1"
dir1="$2"
dir2="$3"
shopt -s extglob
if [[ ! -d "$dir1" ]]; then
1>&2 printf "directory '%s' does not exist or is not a directory\n" "$dir1"
return 1
fi
if [[ ! -d "$dir2" ]]; then
1>&2 printf "directory '%s' does not exist or is not a directory\n" "$dir2"
return 1
fi
local -a out1=()
local -a out2=()
IFS=$'\n' read -r -d '' -a out1 < <(find "$dir1" -type f -exec wc -w {} \; | sort -k 1gr | head -n "$showCount")
IFS=$'\n' read -r -d '' -a out2 < <(find "$dir2" -type f -exec wc -w {} \; | sort -k 1gr | head -n "$showCount")
local -i i=0
local -i maxLen=0
local -i len=0;
for ((i = 0; i < showCount; ++i)); do
len="${#out1[$i]}"
if (( len > maxLen )); then
maxLen=$len
fi
# len="${#out2[$i]}"
# if (( len > maxLen )); then
# maxLen=$len
# fi
done
for (( i = 0; i < showCount; ++i)); do
printf "%-*.*s %s\n" "$maxLen" "$maxLen" "${out1[$i]}" "${out2[$i]}"
done
return 0
}
top_files_by_words "$#"
$ ~/tmp/count_words.bash 15 tex tikz
2309328 tex/resume.log 9692402 tikz/tikz-Graphics in LaTeX with TikZ.mp4
2242997 tex/resume_cv.log 2208818 tikz/tikz-Tikz-Graphs and Automata.mp4
2242969 tex/cover_letters/resume_cv.log 852631 tikz/tikz-Drawing Automata with TikZ in LaTeX.mp4
73859 tex/pgfplots/plotdata/heightmap.dat 711004 tikz/tikz-tutorial.mp4
49152 tex/pgfplots/lena.dat 300038 tikz/.ipynb_checkpoints/TikZ 11 Design Principles-checkpoint.ipynb
43354 tex/nancy.mp4 300038 tikz/TikZ 11 Design Principles.ipynb
31226 tex/pgfplots/pgfplotstodo.tex 215583 tikz/texample/bridges-of-konigsberg.svg
26000 tex/pgfplots/plotdata/ou.dat 108040 tikz/Visual TikZ.pdf
20481 tex/pgfplots/pgfplotstable.tex 82540 tikz/worldflags.pdf
19571 tex/pgfplots/pgfplots.reference.3dplots.tex 37608 tikz/texample/india-map.tex
19561 tex/pgfplots/plotdata/risingdrop3d_coord.dat 35798 tikz/.ipynb_checkpoints/TikZ-checkpoint.ipynb
19561 tex/pgfplots/plotdata/risingdrop3d_vel.dat 35656 tikz/texample/periodic_table.svg
18207 tex/pgfplots/ChangeLog 35501 tikz/TikZ.ipynb
17710 tex/pgfplots/pgfplots.reference.markers-meta.tex 25677 tikz/tikz-Graphics in LaTeX with TikZ.info.json
13800 tex/pgfplots/pgfplots.reference.axisdescription.tex 14760 tikz/tikz-Tikz-Graphs and Automata.info.json
column can print files side-by-side in columns. You can use process substitution with <(command) to have those "files" be live commands instead of actual files.
#!/bin/bash
top-files() {
local n="$1"
local dir="$2"
find "$dir" -type f -exec wc -l {} + |
head -n -1 | sort -rn | head -n "$n"
}
n="$1"
dir1="$2"
dir2="$3"
column <(top-files "$n" reuters-topics/"$dir1") \
<(top-files "$n" reuters-topics/"$dir2")

Need to remove the extra empty lines from the output of shell script

i'm trying to write a code which will print all files taking more than min_size (lets say 10G) in a directory. the problem is output off the below code is all files irrespective of the min_size. i will be getting other details like mtime , owner as well later in the code but this part itself doesnt work fine, whats wrong here ?
#!/bin/sh
if (( $# <3 )); then
echo "$0 dirname min_size count"
exit 1
else
dirname="$1";
min_size="$2";
count="$3";
#shift 3
fi
tmpfile=$(mktemp /lawdump/pulkit/files.XXXXXX)
exec 3> "$tmpfile"
find "${dirname}" -type f -print0 2>&1 | grep -v "Permission denied" | xargs -0 -I {} echo "{}" > "$tmpfile"
for i in `cat tmpfile`
do
x="`du -ah $i | awk '{print $1}' | grep G | sort -nr -k 1`"
size=$(echo $x | sed 's/[A-Za-z]*//g')
if [ size > $min_size ];then
echo $size
fi
done
Note : i know this can be done through find or du but i need to write a shell script to have an email sent out regularly with all the details.

How to print number of occurances of a word in a file in unix

This is my shell script.
Given a directory, and a word, search the directory and print the absolute path of the file that has the maximum occurrences of the word and also print the number of occurrences.
I have written the following script
#!/bin/bash
if [[ -n $(find / -type d -name $1 2> /dev/null) ]]
then
echo "Directory exists"
x=` echo " $(find / -type d -name $1 2> /dev/null)"`
echo "$x"
cd $x
y=$(find . -type f | xargs grep -c $2 | grep -v ":0"| grep -o '[^/]*$' | sort -t: -k2,1 -n -r )
echo "$y"
else
echo "Directory does does not exists"
fi
result: scriptname directoryname word
output: /somedirectory/vtb/wordsearch : 4
/foo/bar: 3
Is there any option to replace xargs grep -c $2 ? Because grep -c prints the count=number of lines which contains the word but i need to print the exact occurrence of a word in the files in a given directory
Using grep's -c count feature:
grep -c "SEARCH" /path/to/files* | sort -r -t : -k 2 | head -n 1
The grep command will output each file in a /path/name:count format, the sort will numerically (-n) sort by the 2nd (-k 2) field as delimited by a colon (-t :) in reverse order (-r). We then use head to keep the first result (-n 1).
Try This:
grep -o -w 'foo' bar.txt | wc -w
OR
grep -o -w 'word' /path/to/file/ | wc -w
grep -Fwor "$word" "$dir" | sed "s/:${word}\$//" | sort | uniq -c | sort -n | tail -1

remove files which contain more than 14 lines in a folder

Unix command used
wc -l * | grep -v "14" | rm -rf
However this grouping doesn't seem to do the job. Can anyone point me towards the correct way?
Thanks
wc -l * 2>&1 | while read -r num file; do ((num > 14)) && echo rm "$file"; done
remove "echo" if you're happy with the results.
Here's one way to print out the names of all files with at least 15 lines (assuming you have Gnu awk, for the nextfile command):
awk 'FNR==15{print FILENAME;nextfile}' *
That will produce an error for any subdirectory, so it's not ideal.
You don't actually want to print the filenames, though. You want to delete them. You can do that in awk with the system function:
# The following has been defanged in case someone decides to copy&paste
awk 'FNR==15{system("echo rm "FILENAME);nextfile}' *
for f in *; do if [ $(wc -l $f | cut -d' ' -f1) -gt 14 ]; then rm -f $f; fi; done
There's a few problems with your solution: rm doesn't take input from stdin, and your grep only finds files who don't have exactly 14 lines. Try this instead:
find . -type f -maxdepth 1 | while read f; do [ `wc -l $f | tr -s ' ' | cut -d ' ' -f 2` -gt 14 ] && rm $f; done
Here's how it works:
find . -type f -maxdepth 1 #all files (not directories) in the current directory
[ #start comparison
wc -l $f #get line count of file
tr -s ' ' #(on the output of wc) eliminate extra whitespace
cut -d ' ' -f 2 #pick just the line count out of the previous output
-gt 14 ] #test if all that was greater than 14
&& rm $f #if the comparison was true, delete the file
I tried to figure out a solution just using find with -exec, but I couldn't figure out a way to test the line count. Maybe somebody else can come up with a way for it

Get just the integer from wc in bash

Is there a way to get the integer that wc returns in bash?
Basically I want to write the line numbers and word counts to the screen after the file name.
output: filename linecount wordcount
Here is what I have so far:
files=\`ls`
for f in $files;
do
if [ ! -d $f ] #only print out information about files !directories
then
# some way of getting the wc integers into shell variables and then printing them
echo "$f $lines $words"
fi
done
Most simple answer ever:
wc < filename
Just:
wc -l < file_name
will do the job. But this output includes prefixed whitespace as wc right-aligns the number.
You can use the cut command to get just the first word of wc's output (which is the line or word count):
lines=`wc -l $f | cut -f1 -d' '`
words=`wc -w $f | cut -f1 -d' '`
wc $file | awk {'print "$4" "$2" "$1"'}
Adjust as necessary for your layout.
It's also nicer to use positive logic ("is a file") over negative ("not a directory")
[ -f $file ] && wc $file | awk {'print "$4" "$2" "$1"'}
Sometimes wc outputs in different formats in different platforms. For example:
In OS X:
$ echo aa | wc -l
1
In Centos:
$ echo aa | wc -l
1
So using only cut may not retrieve the number. Instead try tr to delete space characters:
$ echo aa | wc -l | tr -d ' '
The accepted/popular answers do not work on OSX.
Any of the following should be portable on bsd and linux.
wc -l < "$f" | tr -d ' '
OR
wc -l "$f" | tr -s ' ' | cut -d ' ' -f 2
OR
wc -l "$f" | awk '{print $1}'
If you redirect the filename into wc it omits the filename on output.
Bash:
read lines words characters <<< $(wc < filename)
or
read lines words characters <<EOF
$(wc < filename)
EOF
Instead of using for to iterate over the output of ls, do this:
for f in *
which will work if there are filenames that include spaces.
If you can't use globbing, you should pipe into a while read loop:
find ... | while read -r f
or use process substitution
while read -r f
do
something
done < <(find ...)
If the file is small you can afford calling wc twice, and use something like the following, which avoids piping into an extra process:
lines=$((`wc -l "$f"`))
words=$((`wc -w "$f"`))
The $((...)) is the Arithmetic Expansion of bash. It removes any whitespace from the output of wc in this case.
This solution makes more sense if you need either the linecount or the wordcount.
How about with sed?
wc -l /path/to/file.ext | sed 's/ *\([0-9]* \).*/\1/'
typeset -i a=$(wc -l fileName.dat | xargs echo | cut -d' ' -f1)
Try this for numeric result:
nlines=$( wc -l < $myfile )
Something like this may help:
#!/bin/bash
printf '%-10s %-10s %-10s\n' 'File' 'Lines' 'Words'
for fname in file_name_pattern*; {
[[ -d $fname ]] && continue
lines=0
words=()
while read -r line; do
((lines++))
words+=($line)
done < "$fname"
printf '%-10s %-10s %-10s\n' "$fname" "$lines" "${#words[#]}"
}
To (1) run wc once, and (2) not assign any superfluous variables, use
read lines words <<< $(wc < $f | awk '{ print $1, $2 }')
Full code:
for f in *
do
if [ ! -d $f ]
then
read lines words <<< $(wc < $f | awk '{ print $1, $2 }')
echo "$f $lines $words"
fi
done
Example output:
$ find . -maxdepth 1 -type f -exec wc {} \; # without formatting
1 2 27 ./CNAME
21 169 1065 ./LICENSE
33 130 961 ./README.md
86 215 2997 ./404.html
71 168 2579 ./index.html
21 21 478 ./sitemap.xml
$ # the above code
404.html 86 215
CNAME 1 2
index.html 71 168
LICENSE 21 169
README.md 33 130
sitemap.xml 21 21
Solutions proposed in the answered question doesn't work for Darwin kernels.
Please, consider following solutions that work for all UNIX systems:
print exactly the number of lines of a file:
wc -l < file.txt | xargs
print exactly the number of characters of a file:
wc -m < file.txt | xargs
print exactly the number of bytes of a file:
wc -c < file.txt | xargs
print exactly the number of words of a file:
wc -w < file.txt | xargs
There is a great solution with examples on stackoverflow here
I will copy the simplest solution here:
FOO="bar"
echo -n "$FOO" | wc -l | bc # "3"
Maybe these pages should be merged?
Try this:
wc `ls` | awk '{ LINE += $1; WC += $2 } END { print "lines: " LINE " words: " WC }'
It creates a line count, and word count (LINE and WC), and increase them with the values extracted from wc (using $1 for the first column's value and $2 for the second) and finally prints the results.
"Basically I want to write the line numbers and word counts to the screen after the file name."
answer=(`wc $f`)
echo -e"${answer[3]}
lines: ${answer[0]}
words: ${answer[1]}
bytes: ${answer[2]}"
Outputs :
myfile.txt
lines: 10
words: 20
bytes: 120
files=`ls`
echo "$files" | wc -l | perl -pe "s#^\s+##"
You have to use input redirection for wc:
number_of_lines=$(wc -l <myfile.txt)
respectively in your context
echo "$f $(wc -l <"$f") $(wc -w <"$f")"

Resources