Bash Script not working when trying to find large ASCII files - bash

So what I'm trying to do is find large ASCII files and then print out the name of the file and then how many lines, but when I start my script it doesn't find anything.
find / -type f -size +2000c -exec file {} \; 2>/dev/null | awk -F':' '/: ASCII text/ {print $1}' | while read FILENAME; do LINES="$(wc -l)"; if [ $LINES > 10000 ]; then echo $FILENAME && echo $LINES; fi; done

what you got wrong?
if [ $LINES > 10000 ] here > goes for a string comparison. To use a numberic comparion -gt must be used as
if [ $LINES -gt 10000 ]

Please try this:
find / -type f -size +2000c -print0 | xargs.exe -0 grep -Z -L -e '[^[:print:]]' 2>/dev/null | xargs -0 awk 'ENDFILE { if (FNR > 10000) { print FILENAME " " FNR } }'
The idea is to filter out binary files with grep and feed awk with the list of filtered files to finally filter out files with line count less or equal to 10000.
btw, it handles files with white space in names gracefully.

Related

Print top N files by word count in two columns

I would like to make a script that prints the filenames for the top n files from two directories (n being the number of files I give in in the command line) in order of number of words they have. My biggest problem however is in the way they should be displayed.
Say my command line looks like this:
myscript.sh 5 dir1 dir2
The output should have 2 columns: on the left the top 5 files in descending order from dir1, and on the right the top 5 files in descending order from dir2.
This is what I have in terms of code, however I'm missing something. I think that pr -m -t should do what i want, but I couldn't make it work.
#!/bin/bash
dir=$1
dir2=$2
for files in "$dir"
do
find ./reuters-topics/$dir -type f -exec wc -l {} + | sort -rn |head -n 15
done
for files in "$dir2"
do
find ./reuters-topics/$dir2 -type f -exec wc -l {} + | sort -rn | head -n 15
done
This is a solution in fish:
for i in (find . -type f); wc -l $i; end | sort -rn | head -n15 | awk '{print $2 "\t" $1}'
As you can see, the re-ordering (filename first, number of words second) is done by awk. As a separator I use a tab character:
awk '{print $2 "\t" $1}'
The difference between my loop and your find call, btw, is that I do not get the "total" line in the output.
I did not test if this (including awk) also works well for files with spaces in the name.
#!/usr/bin/env bash
_top_files_by_words_usage() {
local usage=""
read -r -d '' usage <<-"EOF"
Usage:
top_files_by_words <show_count> <dir1> <dir2>
EOF
1>&2 printf "%s\n" "$usage"
}
top_files_by_words() {
if (( $# != 3 )) || [[ "$1" != +([0-9]) ]]; then
_top_files_by_words_usage
return 1
fi
local -i showCount=0
local dir1=""
local dir2=""
showCount="$1"
dir1="$2"
dir2="$3"
shopt -s extglob
if [[ ! -d "$dir1" ]]; then
1>&2 printf "directory '%s' does not exist or is not a directory\n" "$dir1"
return 1
fi
if [[ ! -d "$dir2" ]]; then
1>&2 printf "directory '%s' does not exist or is not a directory\n" "$dir2"
return 1
fi
local -a out1=()
local -a out2=()
IFS=$'\n' read -r -d '' -a out1 < <(find "$dir1" -type f -exec wc -w {} \; | sort -k 1gr | head -n "$showCount")
IFS=$'\n' read -r -d '' -a out2 < <(find "$dir2" -type f -exec wc -w {} \; | sort -k 1gr | head -n "$showCount")
local -i i=0
local -i maxLen=0
local -i len=0;
for ((i = 0; i < showCount; ++i)); do
len="${#out1[$i]}"
if (( len > maxLen )); then
maxLen=$len
fi
# len="${#out2[$i]}"
# if (( len > maxLen )); then
# maxLen=$len
# fi
done
for (( i = 0; i < showCount; ++i)); do
printf "%-*.*s %s\n" "$maxLen" "$maxLen" "${out1[$i]}" "${out2[$i]}"
done
return 0
}
top_files_by_words "$#"
$ ~/tmp/count_words.bash 15 tex tikz
2309328 tex/resume.log 9692402 tikz/tikz-Graphics in LaTeX with TikZ.mp4
2242997 tex/resume_cv.log 2208818 tikz/tikz-Tikz-Graphs and Automata.mp4
2242969 tex/cover_letters/resume_cv.log 852631 tikz/tikz-Drawing Automata with TikZ in LaTeX.mp4
73859 tex/pgfplots/plotdata/heightmap.dat 711004 tikz/tikz-tutorial.mp4
49152 tex/pgfplots/lena.dat 300038 tikz/.ipynb_checkpoints/TikZ 11 Design Principles-checkpoint.ipynb
43354 tex/nancy.mp4 300038 tikz/TikZ 11 Design Principles.ipynb
31226 tex/pgfplots/pgfplotstodo.tex 215583 tikz/texample/bridges-of-konigsberg.svg
26000 tex/pgfplots/plotdata/ou.dat 108040 tikz/Visual TikZ.pdf
20481 tex/pgfplots/pgfplotstable.tex 82540 tikz/worldflags.pdf
19571 tex/pgfplots/pgfplots.reference.3dplots.tex 37608 tikz/texample/india-map.tex
19561 tex/pgfplots/plotdata/risingdrop3d_coord.dat 35798 tikz/.ipynb_checkpoints/TikZ-checkpoint.ipynb
19561 tex/pgfplots/plotdata/risingdrop3d_vel.dat 35656 tikz/texample/periodic_table.svg
18207 tex/pgfplots/ChangeLog 35501 tikz/TikZ.ipynb
17710 tex/pgfplots/pgfplots.reference.markers-meta.tex 25677 tikz/tikz-Graphics in LaTeX with TikZ.info.json
13800 tex/pgfplots/pgfplots.reference.axisdescription.tex 14760 tikz/tikz-Tikz-Graphs and Automata.info.json
column can print files side-by-side in columns. You can use process substitution with <(command) to have those "files" be live commands instead of actual files.
#!/bin/bash
top-files() {
local n="$1"
local dir="$2"
find "$dir" -type f -exec wc -l {} + |
head -n -1 | sort -rn | head -n "$n"
}
n="$1"
dir1="$2"
dir2="$3"
column <(top-files "$n" reuters-topics/"$dir1") \
<(top-files "$n" reuters-topics/"$dir2")

Count the number of files in a directory containing two specific string in bash

I have few files in a directory containing below pattern:
Simulator tool completed simulation at 20:07:18 on 09/28/18.
The situation of the simulation: STATUS PASSED
Now I want to count the number of files which contains both of strings completed simulation & STATUS PASSED anywhere in the file.
This command is working to search for one string STATUS PASSED and count file numbers:
find /directory_path/*.txt -type f -exec grep -l "STATUS PASSED" {} + | wc -l
Sed is also giving 0 as a result:
find /directory_path/*.txt -type f -exec sed -e '/STATUS PASSED/!d' -e '/completed simulation/!d' {} + | wc -l
Any help/suggestion will be much appriciated!
find . -type f -exec \
awk '/completed simulation/{x=1} /STATUS PASSED/{y=1} END{if (x&&y) print FILENAME}' {} \; |
wc -l
I'm printing the matching file names in case that's useful in some other context but piping that to wc will fail if the file names contain newlines - if that's the case just print 1 or anything else from awk.
Since find /directory_path/*.txt -type f is the same as just ls /directory_path/*.txt if all of the ".txt"s are files, though, it sounds like all you actually need is (using GNU awk for nextfile):
awk '
FNR==1 { x=y=0 }
/completed simulation/ { x=1 }
/STATUS PASSED/ { y=1 }
x && y { cnt++; nextfile }
END { print cnt+0 }
' /directory_path/*.txt
or with any awk:
awk '
FNR==1 { x=y=f=0 }
/completed simulation/ { x=1 }
/STATUS PASSED/ { y=1 }
x && y && !f { cnt++; f=1 }
END { print cnt+0 }
' /directory_path/*.txt
Those will work no matter what characters are in your file names.
Using grep and standard utils:
{ grep -Hm1 'completed simulation' /directory_path/*.txt;
grep -Hm1 'STATUS PASSED' /directory_path/*.txt ; } |
sort | uniq -d | wc -l
grep -m1 stops when it finds the first match. This saves time if it's a big file. If the list of matches is large, sort -t: -k1 would be better than sort.
The command find /directory_path/*.txt just lists all txt files in /directory_path/ not including subdirectories of /directory_path
find . -name \*.txt -print0 |
while read -d $'\0' file; do
grep -Fq 'completed simulation' "$file" &&
grep -Fq 'STATUS PASSED' "$_" &&
echo "$_"
done |
wc -l
If you ensure no special characters in the filenames
find . -name \*.txt |
while read file; do
grep -Fq 'completed simulation' "$file" &&
grep -Fq 'STATUS PASSED' "$file" &&
echo "$file"
done |
wc -l
I don't have AIX to test it, but it should be POSIX compliant.

bash script for word count a particualar file in a directory then the result is output in a logfile and the logrotate

bash script to word count a particualar file in a directory then the result is output in a logfile and then entry in the log file rotate.
this is what I have tried
something like:
Dir=$1
filename="${Dir#*.}"
auditfile="/tmp/claim"
claims=`wc $filename` # count the number of users
if [ -d $Dir ]; then
find . -type f -exec wc -l {} \; | awk -F'[;/.]' '{print $1}' | sort -n | uniq -c
# will need to use awk here to find the correct count.
fi
preclaims="grep $filename $auditfile" # will need to use awk here to find the correct count.
echo $filename:$claims > $auditfile
echo $prevclaims - $claims
Will need to determine when logfile rotates, something like
if $claims > $prevclaims then $prevclaims = 0

remove files which contain more than 14 lines in a folder

Unix command used
wc -l * | grep -v "14" | rm -rf
However this grouping doesn't seem to do the job. Can anyone point me towards the correct way?
Thanks
wc -l * 2>&1 | while read -r num file; do ((num > 14)) && echo rm "$file"; done
remove "echo" if you're happy with the results.
Here's one way to print out the names of all files with at least 15 lines (assuming you have Gnu awk, for the nextfile command):
awk 'FNR==15{print FILENAME;nextfile}' *
That will produce an error for any subdirectory, so it's not ideal.
You don't actually want to print the filenames, though. You want to delete them. You can do that in awk with the system function:
# The following has been defanged in case someone decides to copy&paste
awk 'FNR==15{system("echo rm "FILENAME);nextfile}' *
for f in *; do if [ $(wc -l $f | cut -d' ' -f1) -gt 14 ]; then rm -f $f; fi; done
There's a few problems with your solution: rm doesn't take input from stdin, and your grep only finds files who don't have exactly 14 lines. Try this instead:
find . -type f -maxdepth 1 | while read f; do [ `wc -l $f | tr -s ' ' | cut -d ' ' -f 2` -gt 14 ] && rm $f; done
Here's how it works:
find . -type f -maxdepth 1 #all files (not directories) in the current directory
[ #start comparison
wc -l $f #get line count of file
tr -s ' ' #(on the output of wc) eliminate extra whitespace
cut -d ' ' -f 2 #pick just the line count out of the previous output
-gt 14 ] #test if all that was greater than 14
&& rm $f #if the comparison was true, delete the file
I tried to figure out a solution just using find with -exec, but I couldn't figure out a way to test the line count. Maybe somebody else can come up with a way for it

Find TXT files and show Total Count of records of each file and Size of each file

I need to find row Count and size of each TXT files.
It needs to search all the directories and just show result as :
FileName|Cnt|Size
ABC.TXT|230|23MB
Here is some code:
v_DIR=$1
echo "the directory to cd is "$1
x=`ls -l $0 | awk '{print $9 "|" $5}'`
y=`awk 'END {print NR}' $0`
echo $x '|' $y
Try something like
find -type f -name '*.txt' -exec bash -c 'lines=$(wc -l "$0" | cut -d " " -f1); size=$(du -h "$0" | cut -f1); echo "$0|$lines|$size"' {} \;

Resources