Bash: Find file with max lines count - bash

This is my try to do it
Find all *.java files
find . -name '*.java'
Count lines
wc -l
Delete last line
sed '$d'
Use AWK to find max lines-count in wc output
awk 'max=="" || data=="" || $1 > max {max=$1 ; data=$2} END{ print max " " data}'
then merge it to single line
find . -name '*.java' | xargs wc -l | sed '$d' | awk 'max=="" || data=="" || $1 > max {max=$1 ; data=$2} END{ print max " " data}'
Can I somehow implement counting just non-blank lines?

find . -type f -name "*.java" -exec grep -H -c '[^[:space:]]' {} \; | \
sort -nr -t":" -k2 | awk -F: '{print $1; exit;}'
Replace the awk command with head -n1 if you also want to see the number of non-blank lines.
Breakdown of the command:
find . -type f -name "*.java" -exec grep -H -c '[^[:space:]]' {} \;
'---------------------------' '-----------------------'
| |
for each *.java file Use grep to count non-empty lines
-H includes filenames in the output
(output = ./full/path/to/file.java:count)
| sort -nr -t":" -k2 | awk -F: '{print $1; exit;}'
'----------------' '-------------------------'
| |
Sort the output in Print filename of the first entry (largest count)
reverse order using the then exit immediately
second column (count)

find . -name "*.java" -type f | xargs wc -l | sort -rn | grep -v ' total$' | head -1

To get the size of all of your files using awk is just:
$ find . -name '*.java' -print0 | xargs -0 awk '
BEGIN { for (i=1;i<ARGC;i++) size[ARGV[i]]=0 }
{ size[FILENAME]++ }
END { for (file in size) print size[file], file }
'
To get the count of the non-empty lines, simply make the line where you increment the size[] conditional:
$ find . -name '*.java' -print0 | xargs -0 awk '
BEGIN { for (i=1;i<ARGC;i++) size[ARGV[i]]=0 }
NF { size[FILENAME]++ }
END { for (file in size) print size[file], file }
'
(If you want to consider lines that contain only blanks as "empty" then replace NF with /^./.)
To get only the file with the most non-empty lines just tweak again:
$ find . -name '*.java' -print0 | xargs -0 awk '
BEGIN { for (i=1;i<ARGC;i++) size[ARGV[i]]=0 }
NF { size[FILENAME]++ }
END {
for (file in size) {
if (size[file] >= maxSize) {
maxSize = size[file]
maxFile = file
}
}
print maxSize, maxFile
}
'

Something like this might work:
find . -name '*.java'|while read filename; do
nlines=`grep -v -E '^[[:space:]]*$' "$filename"|wc -l`
echo $nlines $filename
done|sort -nr|head -1
(edited as per Ed Morton's comment. I must have had too much coffee :-) )

Related

Trying to do total word count on all files recursively but the sum is not right

So I do this:
find . -name '*.md' -type f -exec wc -w {} \; | awk '{ print $1 }'
And get a column of numbers (truncated):
...
2829
3619
828
1195
2406
2857
1480
1846
23
But then when I pipe all of that into a sum, I get an incorrect amount:
find . -name '*.md' -type f -exec wc -w {} \; | awk '{ print $1 }' | sum
9658 2
I thought awk would strip the white space out of wc -w output. But am I missing something?
(End result: I want to take a weekly word count and compare it previous weeks.)
The issue with your code is that sum does not count the sup of the output of the previous command.
Here is the sum help manual
Usage: sum [OPTION]... [FILE]...
Print checksum and block counts for each FILE.
Here is what you can do
find . -name '*.md' -type f -exec wc -w {} \; | awk '{s+=$1} END {printf "%.0f", s}'
Where the awk increments the s on each step with the value and prints it as an integer (to 0 decimal places) when done.
Concatenate all the files and pipe the result to wc -w, this way you don't need to sum word counts of individual files.
find . -name '*.md' -type f -exec awk 1 {} + | wc -w
awk 1 is for making sure each file's content is separated from that of the other with a newline, if that's not necessary, you can use cat instead.

How to return an MD5 and SHA1 value for multiple files in a directory using BASH

I am creating a BASH script to take a directory as an argument and return to std out a list of all files in that directory with both the MD5 and SHA1 value of the files present in that directory. The only files I'm interested in are those between 100 and 500K. So far I gotten this far. (Section of Script)
cd $1 &&
find . -type f -size +100k -size -500k -printf '%f \t %s \t' -exec md5sum {} \; |
awk '{printf "NAME:" " " $1 "\t" "MD5:" " " $3 "\t" "BYTES:" "\t" $2 "\n"}'
I'm getting a little confused when adding the Sha1 and obviously leaving something out.
Can anybody suggest a way to achieve this.
Ideally I'd like the script to format in the following way
Name Md5 SHA1
(With the relevant fields underneath)
Your awk printf bit is overly complicated. Try this:
find . -type f -printf "%f\t%s\t" -exec md5sum {} \; | awk '{ printf "NAME: %s MD5: %s BYTES: %s\n", $1, $3, $2 }'
Just read line by line the list of files outputted by find:
find . -type f |
while IFS= read -r l; do
echo "$(basename "$l") $(md5sum <"$l" | cut -d" " -f1) $(sha1sum <"$l" | cut -d" " -f1)"
done
It's better to use a zero separated stream:
find . -type f -print0 |
while IFS= read -r -d '' l; do
echo "$(basename "$l") $(md5sum <"$l" | cut -d" " -f1) $(sha1sum <"$l" | cut -d" " -f1)"
done
You could speed up something with xargs and multiple processes with -P option to xargs:
find . -type f -print0 |
xargs -0 -n1 sh -c 'echo "$(basename "$1") $(md5sum <"$1" | cut -d" " -f1) $(sha1sum <"$1" | cut -d" " -f1)"' --
Consider adding -maxdepth 1 to find if you are not interested in files in subdirectories recursively.
It's easy from xargs to go to -exec:
find . -type f -exec sh -c 'echo "$1 $(md5sum <"$1" | cut -d" " -f1) $(sha1sum <"$1" | cut -d" " -f1)"' -- {} \;
Tested on repl.
Add those -size +100k -size -500k args to find to limit the sizes.
The | cut -d" " -f1 is used to remove the - that is outputted by both md5sum and sha1sum. If there are no spaces in filenames, you could run a single cut process for the whole stream, so it should be slightly faster:
find . -type f -print0 |
xargs -0 -n1 sh -c 'echo "$(basename "$1") $(md5sum <"$1") $(sha1sum <"$1")"' -- |
cut -d" " -f1,2,5
I also think that running a single md5sum and sha1sum process maybe would be faster rather then spawning multiple separate processes for each file, but such method needs storing all the filenames somewhere. Below a bash array is used:
IFS=$'\n' files=($(find . -type f))
paste -d' ' <(
printf "%s\n" "${files[#]}") <(
md5sum "${files[#]}" | cut -d' ' -f1) <(
sha1sum "${files[#]}" | cut -d' ' -f1)
Your find is fine, you want to join the results of two of those, one for each hash. The command for that is join, which expects sorted inputs.
doit() { find -type f -size +100k -size -500k -exec $1 {} + |sort -k2; }
join -j2 <(doit md5sum) <(doit sha1sum)
and that gets you the raw data in sane environments. If you want pretty data, you can use the column utility:
join -j2 <(doit md5sum) <(doit sha1sum) | column -t
and add nice headers:
(echo Name Md5 SHA1; join -j2 <(doit md5sum) <(doit sha1sum)) | column -t
and if you're in an unclean environment where people put spaces in file names, protect against that by subbing in tabs for the field markers:
doit() { find -type f -size +100k -size -500k -exec $1 {} + \
| sed 's, ,\t,'| sort -k2 -t$'\t' ; }
join -j2 -t$'\t' <(doit md5sum) <(doit sha1sum) | column -ts$'\t'

Getting a list of substring based unique filenames in an array

I have a directory my_dir with files having names like:
a_v5.json
a_v5.mapping.json
a_v5.settings.json
f_v39.json
f_v39.mapping.json
f_v39.settings.json
f_v40.json
f_v40.mapping.json
f_v40.settings.json
c_v1.json
c_v1.mapping.json
c_v1.settings.json
I'm looking for a way to get an array [a_v5, f_v40, c_v1] in bash. Here, array of file names with the latest version number is what I need.
Tried this: ls *.json | find . -type f -exec basename "{}" \; | cut -d. -f1, but it returns the results with files which are not of the .json extension.
You can use the following command if your filenames don't contain whitespace and special symbols like * or ?:
array=($(
find . -type f -iname \*.json |
sed -E 's|(.*/)*(.*_v)([0-9]+)\..*|\2 \3|' |
sort -Vr | sort -uk1,1 | tr -d ' '
))
It's ugly and unsafe. The following solution is longer but can handle all file names, even those with linebreaks in them.
maxversions() {
find -type f -iname \*.json -print0 |
gawk 'BEGIN { RS = "\0"; ORS = "\0" }
match($0, /(.*\/)*(.*_v)([0-9]+)\..*/, group) {
prefix = group[2];
version = group[3];
if (version > maxversion[prefix])
maxversion[prefix] = version
}
END {
for (prefix in maxversion)
print prefix maxversion[prefix]
}'
}
mapfile -d '' array < <(maxversions)
In both cases you can check the contents of array with declare -p array.
Arrays and bash string parsing.
declare -A tmp=()
for f in $SOURCE_DIR/*.json
do f=${f##*/} # strip path
tmp[${f%%.*}]=1 # strip extraneous data after . in filename
done
declare -a c=( $( printf "%s\n" "${!tmp[#]}" | cut -c 1 | sort -u ) ) # get just the first chars
declare -a lst=( $( for f in "${c[#]}"
do printf "%s\n" "${!tmp[#]}" |
grep "^${f}_" |
sort -n |
tail -1; done ) )
echo "[ ${lst[#]} ]"
[ a_v5 c_v1 f_v40 ]
Or, if you'd rather,
declare -a arr=( $(
for f in $SOURCE_DIR/*.json
do d=${f%/*} # get dir path
f=${f##*/} # strip path
g=${f:0:2} # get leading str
( cd $d && printf "%s\n" ${g}*.json |
sort -n | sed -n '$ { s/[.].*//; p; }' )
done | sort -u ) )
echo "[ ${arr[#]} ]"
[ a_v5 c_v1 f_v40 ]
This is one possible way to accomplish this :
arr=( $( { for name in $( ls {f,n,m}*.txt ); do echo ${name:0:1} ; done; } | sort | uniq ) )
Output :
$ echo ${arr[0]}
f
$ echo ${arr[1]}
m
$ echo ${arr[2]}
n
Regards!
AWK SOLUTION
This is not an elegant solution... my knowledge of awk is limited.
You should find this functional.
I've updated this to remove the obsolete uniq as suggested by #socowi.
I've also included the printf version as #socowi suggested.
ls *.json | cut -d. -f1 | sort -rn | awk -v last="xx" '$1 !~ last{ print $1; last=substr($1,1,3) }'
OR
printf %s\\n *.json | cut -d. -f1 | sort -rn | awk -v last="xx" '$1 !~ last{ print $1; last=substr($1,1,3) }'
Old understanding below
Find files with name matching pattern.
Now take the second field since your results will likely be similar to ./
find . -type f -iname "*.json" | cut -d. -f2
To get the unique headings....
find . -type f -iname "*.json" | cut -d. -f2 | sort | uniq

Count the number of files in a directory containing two specific string in bash

I have few files in a directory containing below pattern:
Simulator tool completed simulation at 20:07:18 on 09/28/18.
The situation of the simulation: STATUS PASSED
Now I want to count the number of files which contains both of strings completed simulation & STATUS PASSED anywhere in the file.
This command is working to search for one string STATUS PASSED and count file numbers:
find /directory_path/*.txt -type f -exec grep -l "STATUS PASSED" {} + | wc -l
Sed is also giving 0 as a result:
find /directory_path/*.txt -type f -exec sed -e '/STATUS PASSED/!d' -e '/completed simulation/!d' {} + | wc -l
Any help/suggestion will be much appriciated!
find . -type f -exec \
awk '/completed simulation/{x=1} /STATUS PASSED/{y=1} END{if (x&&y) print FILENAME}' {} \; |
wc -l
I'm printing the matching file names in case that's useful in some other context but piping that to wc will fail if the file names contain newlines - if that's the case just print 1 or anything else from awk.
Since find /directory_path/*.txt -type f is the same as just ls /directory_path/*.txt if all of the ".txt"s are files, though, it sounds like all you actually need is (using GNU awk for nextfile):
awk '
FNR==1 { x=y=0 }
/completed simulation/ { x=1 }
/STATUS PASSED/ { y=1 }
x && y { cnt++; nextfile }
END { print cnt+0 }
' /directory_path/*.txt
or with any awk:
awk '
FNR==1 { x=y=f=0 }
/completed simulation/ { x=1 }
/STATUS PASSED/ { y=1 }
x && y && !f { cnt++; f=1 }
END { print cnt+0 }
' /directory_path/*.txt
Those will work no matter what characters are in your file names.
Using grep and standard utils:
{ grep -Hm1 'completed simulation' /directory_path/*.txt;
grep -Hm1 'STATUS PASSED' /directory_path/*.txt ; } |
sort | uniq -d | wc -l
grep -m1 stops when it finds the first match. This saves time if it's a big file. If the list of matches is large, sort -t: -k1 would be better than sort.
The command find /directory_path/*.txt just lists all txt files in /directory_path/ not including subdirectories of /directory_path
find . -name \*.txt -print0 |
while read -d $'\0' file; do
grep -Fq 'completed simulation' "$file" &&
grep -Fq 'STATUS PASSED' "$_" &&
echo "$_"
done |
wc -l
If you ensure no special characters in the filenames
find . -name \*.txt |
while read file; do
grep -Fq 'completed simulation' "$file" &&
grep -Fq 'STATUS PASSED' "$file" &&
echo "$file"
done |
wc -l
I don't have AIX to test it, but it should be POSIX compliant.

How to count files in subdir and filter output in bash

Hi hoping someone can help, I have some directories on disk and I want to count the number of files in them (as well as dir size if possible) and then strip info from the output. So far I have this
find . -type d -name "*,d" -print0 | xargs -0 -I {} sh -c 'echo -e $(find "{}" | wc -l) "{}"' | sort -n
This gets me all the dir's that match my pattern as well as the number of files - great!
This gives me something like
2 ./bob/sourceimages/psd/dzv_body.psd,d
2 ./bob/sourceimages/psd/dzv_body_nrm.psd,d
2 ./bob/sourceimages/psd/dzv_body_prm.psd,d
2 ./bob/sourceimages/psd/dzv_eyeball.psd,d
2 ./bob/sourceimages/psd/t_zbody.psd,d
2 ./bob/sourceimages/psd/t_gear.psd,d
2 ./bob/sourceimages/psd/t_pupil.psd,d
2 ./bob/sourceimages/z_vehicles_diff.tga,d
2 ./bob/sourceimages/zvehiclesa_diff.tga,d
5 ./bob/sourceimages/zvehicleswheel_diff.jpg,d
From that I would like to filter based on max number of files so > 4 for example, I would like to capture filetype as a variable for each remaining result e.g ./bob/sourceimages/zvehicleswheel_diff.jpg,d
I guess I could use awk for this?
Then finally I would like like to remove all the results from disk, with find I normally just do something like -exec rm -rf {} \; but I'm not clear how it would work here
Thanks a lot
EDITED
While this is clearly not the answer, these commands get me the info I want in the form I want it. I just need a way to put it all together and not search multiple times as that's total rubbish
filetype=$(find . -type d -name "*,d" -print0 | awk 'BEGIN { FS = "." }; {
print $3 }' | cut -d',' -f1)
filesize=$(find . -type d -name "*,d" -print0 | xargs -0 -I {} sh -c 'du -h
{};' | awk '{ print $1 }')
filenumbers=$(find . -type d -name "*,d" -print0 | xargs -0 -I {} sh -c
'echo -e $(find "{}" | wc -l);')
files_count=`ls -keys | nl`
For instance:
ls | nl
nl printed numbers of lines

Resources