Print top N files by word count in two columns - bash

I would like to make a script that prints the filenames for the top n files from two directories (n being the number of files I give in in the command line) in order of number of words they have. My biggest problem however is in the way they should be displayed.
Say my command line looks like this:
myscript.sh 5 dir1 dir2
The output should have 2 columns: on the left the top 5 files in descending order from dir1, and on the right the top 5 files in descending order from dir2.
This is what I have in terms of code, however I'm missing something. I think that pr -m -t should do what i want, but I couldn't make it work.
#!/bin/bash
dir=$1
dir2=$2
for files in "$dir"
do
find ./reuters-topics/$dir -type f -exec wc -l {} + | sort -rn |head -n 15
done
for files in "$dir2"
do
find ./reuters-topics/$dir2 -type f -exec wc -l {} + | sort -rn | head -n 15
done

This is a solution in fish:
for i in (find . -type f); wc -l $i; end | sort -rn | head -n15 | awk '{print $2 "\t" $1}'
As you can see, the re-ordering (filename first, number of words second) is done by awk. As a separator I use a tab character:
awk '{print $2 "\t" $1}'
The difference between my loop and your find call, btw, is that I do not get the "total" line in the output.
I did not test if this (including awk) also works well for files with spaces in the name.

#!/usr/bin/env bash
_top_files_by_words_usage() {
local usage=""
read -r -d '' usage <<-"EOF"
Usage:
top_files_by_words <show_count> <dir1> <dir2>
EOF
1>&2 printf "%s\n" "$usage"
}
top_files_by_words() {
if (( $# != 3 )) || [[ "$1" != +([0-9]) ]]; then
_top_files_by_words_usage
return 1
fi
local -i showCount=0
local dir1=""
local dir2=""
showCount="$1"
dir1="$2"
dir2="$3"
shopt -s extglob
if [[ ! -d "$dir1" ]]; then
1>&2 printf "directory '%s' does not exist or is not a directory\n" "$dir1"
return 1
fi
if [[ ! -d "$dir2" ]]; then
1>&2 printf "directory '%s' does not exist or is not a directory\n" "$dir2"
return 1
fi
local -a out1=()
local -a out2=()
IFS=$'\n' read -r -d '' -a out1 < <(find "$dir1" -type f -exec wc -w {} \; | sort -k 1gr | head -n "$showCount")
IFS=$'\n' read -r -d '' -a out2 < <(find "$dir2" -type f -exec wc -w {} \; | sort -k 1gr | head -n "$showCount")
local -i i=0
local -i maxLen=0
local -i len=0;
for ((i = 0; i < showCount; ++i)); do
len="${#out1[$i]}"
if (( len > maxLen )); then
maxLen=$len
fi
# len="${#out2[$i]}"
# if (( len > maxLen )); then
# maxLen=$len
# fi
done
for (( i = 0; i < showCount; ++i)); do
printf "%-*.*s %s\n" "$maxLen" "$maxLen" "${out1[$i]}" "${out2[$i]}"
done
return 0
}
top_files_by_words "$#"
$ ~/tmp/count_words.bash 15 tex tikz
2309328 tex/resume.log 9692402 tikz/tikz-Graphics in LaTeX with TikZ.mp4
2242997 tex/resume_cv.log 2208818 tikz/tikz-Tikz-Graphs and Automata.mp4
2242969 tex/cover_letters/resume_cv.log 852631 tikz/tikz-Drawing Automata with TikZ in LaTeX.mp4
73859 tex/pgfplots/plotdata/heightmap.dat 711004 tikz/tikz-tutorial.mp4
49152 tex/pgfplots/lena.dat 300038 tikz/.ipynb_checkpoints/TikZ 11 Design Principles-checkpoint.ipynb
43354 tex/nancy.mp4 300038 tikz/TikZ 11 Design Principles.ipynb
31226 tex/pgfplots/pgfplotstodo.tex 215583 tikz/texample/bridges-of-konigsberg.svg
26000 tex/pgfplots/plotdata/ou.dat 108040 tikz/Visual TikZ.pdf
20481 tex/pgfplots/pgfplotstable.tex 82540 tikz/worldflags.pdf
19571 tex/pgfplots/pgfplots.reference.3dplots.tex 37608 tikz/texample/india-map.tex
19561 tex/pgfplots/plotdata/risingdrop3d_coord.dat 35798 tikz/.ipynb_checkpoints/TikZ-checkpoint.ipynb
19561 tex/pgfplots/plotdata/risingdrop3d_vel.dat 35656 tikz/texample/periodic_table.svg
18207 tex/pgfplots/ChangeLog 35501 tikz/TikZ.ipynb
17710 tex/pgfplots/pgfplots.reference.markers-meta.tex 25677 tikz/tikz-Graphics in LaTeX with TikZ.info.json
13800 tex/pgfplots/pgfplots.reference.axisdescription.tex 14760 tikz/tikz-Tikz-Graphs and Automata.info.json

column can print files side-by-side in columns. You can use process substitution with <(command) to have those "files" be live commands instead of actual files.
#!/bin/bash
top-files() {
local n="$1"
local dir="$2"
find "$dir" -type f -exec wc -l {} + |
head -n -1 | sort -rn | head -n "$n"
}
n="$1"
dir1="$2"
dir2="$3"
column <(top-files "$n" reuters-topics/"$dir1") \
<(top-files "$n" reuters-topics/"$dir2")

Related

Counting the number of lines of many files (only .h, .c and .py files) in a directory using bash

I'm asked to write a script (using bash) that count the number of lines in files (but only C files (.h and .c) and python files (.py)) that are regrouped in a single directory. I've already tried with this code but my calculation is always wrong
let "sum = 0"
let "sum = sum + $(wc -l $1/*.c | tail --lines=1 | tr -dc '0-9')"
let "sum = sum + $(wc -l $1/*.h | tail --lines=1 | tr -dc '0-9')"
let "sum = sum + $(wc -l $1/*.py | tail --lines=1 | tr -dc '0-9')"
echo $sum >> manifest.txt
I must write the total in the "manifest.txt" file and the argument of my script is the path to the directory that contains the files.
If someone has another technique to compute this, I'd be very grateful.
Thank you !
You could also use a loop to aggregate the counts:
extensions=("*.h" "*.c" "*.py")
sum=0
for ext in ${extensions[#]} ; do
count=$(wc -l ${1}/${ext} | awk '{ print $1 }')
sum=$((sum+count))
done
echo "${sum}"
Version 1: step by step
#!/bin/bash
echo "Counting the total number of lines for all .c .h .py files in $1"
sum=0
num_py=$(wc -l $1/*.py | tail -1 | tr -dc '0-9')
num_c=$(wc -l $1/*.c | tail -1 | tr -dc '0-9')
num_h=$(wc -l $1/*.h | tail -1 | tr -dc '0-9')
sum=$(($num_py + $num_c + $num_h))
echo $sum >> manifest.txt
version 2: concise
#!/bin/bash
echo "Counting the total number of lines for all .c .h .py files in $1"
echo "$(( $(wc -l $1/*.py | tail -1 | tr -dc '0-9') + $(wc -l $1/*.c | tail -1 | tr -dc '0-9') + $(wc -l $1/*.h | tail -1 | tr -dc '0-9') ))" >> manifest.txt
version 3: loop over your desired files
#!/bin/bash
echo "Counting the total number of lines for all .c .h .py files in $1"
sum=0
for sfile in $1/*.{c,h,py}; do
sum=$(($sum+$(wc -l $sfile|tail -1|tr -dc '0-9')))
done
echo $sum >> manifest.txt
This is how arithmetic operations work: var = $((EXPR))
For example: $sum= $(($sum + $result ))
it is very common to miss the $ sign within the EXPR! Try not to forget them :)
This is the scripts that I use (with minor modifications):
files=( $(find . -mindepth 1 -maxdepth 1 -type f -iname "*.h" -iname "*.c" -iname "*.py") )
declare -i total=0
for file in "${files[#]}"; do
lines="$(wc -l < <(cat "$file"))"
echo -e "${lines}\t${file}"
total+="$lines"
done
echo -e "\n$total\ttotal"
Here is my version.
#!/usr/bin/env bash
shopt -s extglob nullglob
files=( "$1"/*.#(c|h|py) )
shopt -u extglob nullglob
while IFS= read -rd '' file_name; do
count=$(wc -l < "$file_name")
((sum+=count))
done< <(printf '%s\0' "${files[#]}")
echo "$sum" > manifest.txt
Needs some error checking, like if the argument is a directory or if it even exists at all, and so on.

Adding files sizes using bash command "wc"

I ran this command to find each file modified yesterday:
find /eqtynas/ -type f -mtime -1 > /home/writtenToStorage.20171026 &
and then developed this script to add up all the files collected by the script, and sum the sizes. .
#!/bin/bash
ydate=$(date +%Y%m%d --date="yesterday")
file="/home/writtenToStorage.$ydate"
fileSize=0
for line in $(cat $file)
do
if [ -f $line ] && [ -s $line ] ; then
fileSize1=$fileSize
fileSize=$(wc -c < $line)
Total=$(( $fileSize + $fileSize1 ))
fi
done
echo $Total
However when I stat just one of the files in the list It comes out to 18942, where as the total for all the files combined comes out at 34499.
wc -c /eqty/fixed
18942 /eqty/fixed
Is the script ok - because I ran another check and the total size was 314 gigs
find /eqtynas/ -type f -mtime -1 -print0 | du -ch --files0-from=- --total -s > 24hourUsage.20171026 &
Continuing from my comment, you may prefer something similar to:
sum=
while read -r sz; do
sum=$((sum + sz))
done < <(find /eqtynas/ -type f -mtime -1 -exec stat -c %s '{}' \; )
echo "sum: $sum"
There are a number of ways to do this. You can also pipe the result of -exec ls -al '{}' to awk and just sum the 5th field.
If you have already written the filenames to /home/writtenToStorage.20171026, then you can simply redirect the file to your while loop, e.g.
while read -r sz; do
sum=$((sum + sz))
done <"/home/writtenToStorage.20171026"
Look things over and let me know i you have any questions.
You're not adding to Total, you're just setting it to the sum of the sizes of the last two files.
for line in $(cat $file)
do
if [ -f $line ] && [ -s $line ] ; then
fileSize=$(wc -c < $line)
((Total += fileSize))
fi
done

Need to remove the extra empty lines from the output of shell script

i'm trying to write a code which will print all files taking more than min_size (lets say 10G) in a directory. the problem is output off the below code is all files irrespective of the min_size. i will be getting other details like mtime , owner as well later in the code but this part itself doesnt work fine, whats wrong here ?
#!/bin/sh
if (( $# <3 )); then
echo "$0 dirname min_size count"
exit 1
else
dirname="$1";
min_size="$2";
count="$3";
#shift 3
fi
tmpfile=$(mktemp /lawdump/pulkit/files.XXXXXX)
exec 3> "$tmpfile"
find "${dirname}" -type f -print0 2>&1 | grep -v "Permission denied" | xargs -0 -I {} echo "{}" > "$tmpfile"
for i in `cat tmpfile`
do
x="`du -ah $i | awk '{print $1}' | grep G | sort -nr -k 1`"
size=$(echo $x | sed 's/[A-Za-z]*//g')
if [ size > $min_size ];then
echo $size
fi
done
Note : i know this can be done through find or du but i need to write a shell script to have an email sent out regularly with all the details.

Bash command - using IF - FI within a DO - DONE

I'm trying to run a command that should find PHP files that contain "base64_decode" and/or "eval", echo the file name, print the top three lines, if the file contains more than 3 lines, also the bottom 3.
I have the following at the moment:
for file in $(find . -name "*.php" -exec grep -il "base64_decode\|eval" {} \;); do echo $file; head -n 3 $file; if [ wc -l < $file -gt 3 ]; then tail -n 3 $file fi; done | less
This returns the following error:
bash: syntax error near unexpected token `done'
I would to use the following
while read -r file
do
echo ==$file==
head -n 3 "$file"
[[ $(grep -c '' "$file") > 3 ]] && (echo ----last-3-lines--- ; tail -n 3 "$file")
done < <(find . -name \*.php -exec grep -il 'base64_decode\|eval' {} \+)
Using while over the for is better, because the filenames could contain spaces. /probably not in this case, but anyway :)/
using grep -c '' "$file" is sometimes better (when the last line in the file, doesn't contains the \n character (the wc counts the \n characters in the file)
the find with the \+ instead of the \; is more efficient
Problem seems to be here:
if [ wc -l < $file -gt 3 ]; then
Since you need to use command substitution here to make sure wc -l command executes first and then compare the result:
if [[ $(wc -l < "$file") -gt 3 ]]; then
You want to execute your wc, more like:
if [[ $(wc -l < $file) -gt 3 ]]; then
try this:
#!/bin/bash
for file in $(grep -H "base64_decode\|eval" ./*.php | cut -d: -f1);
do
echo $file;
head -n 3 $file;
if [[ $(wc -l < $file) -gt 3 ]];
then
tail -n 3 $file
fi;
done
I tested and seems to work fine.
But, be carefull ... if php has 4 lines, you will see:
line1
line2
line3
line2
line3
line4
EDIT: changed the script above to grep inside files.
cat a.php
asdasd
asd
base64_decode
l
a
and result
./test2.sh
./a.php
asdasd
asd
base64_decode
base64_decode
l
a

remove files which contain more than 14 lines in a folder

Unix command used
wc -l * | grep -v "14" | rm -rf
However this grouping doesn't seem to do the job. Can anyone point me towards the correct way?
Thanks
wc -l * 2>&1 | while read -r num file; do ((num > 14)) && echo rm "$file"; done
remove "echo" if you're happy with the results.
Here's one way to print out the names of all files with at least 15 lines (assuming you have Gnu awk, for the nextfile command):
awk 'FNR==15{print FILENAME;nextfile}' *
That will produce an error for any subdirectory, so it's not ideal.
You don't actually want to print the filenames, though. You want to delete them. You can do that in awk with the system function:
# The following has been defanged in case someone decides to copy&paste
awk 'FNR==15{system("echo rm "FILENAME);nextfile}' *
for f in *; do if [ $(wc -l $f | cut -d' ' -f1) -gt 14 ]; then rm -f $f; fi; done
There's a few problems with your solution: rm doesn't take input from stdin, and your grep only finds files who don't have exactly 14 lines. Try this instead:
find . -type f -maxdepth 1 | while read f; do [ `wc -l $f | tr -s ' ' | cut -d ' ' -f 2` -gt 14 ] && rm $f; done
Here's how it works:
find . -type f -maxdepth 1 #all files (not directories) in the current directory
[ #start comparison
wc -l $f #get line count of file
tr -s ' ' #(on the output of wc) eliminate extra whitespace
cut -d ' ' -f 2 #pick just the line count out of the previous output
-gt 14 ] #test if all that was greater than 14
&& rm $f #if the comparison was true, delete the file
I tried to figure out a solution just using find with -exec, but I couldn't figure out a way to test the line count. Maybe somebody else can come up with a way for it

Resources