Count the number of files in a directory containing two specific string in bash - bash

I have few files in a directory containing below pattern:
Simulator tool completed simulation at 20:07:18 on 09/28/18.
The situation of the simulation: STATUS PASSED
Now I want to count the number of files which contains both of strings completed simulation & STATUS PASSED anywhere in the file.
This command is working to search for one string STATUS PASSED and count file numbers:
find /directory_path/*.txt -type f -exec grep -l "STATUS PASSED" {} + | wc -l
Sed is also giving 0 as a result:
find /directory_path/*.txt -type f -exec sed -e '/STATUS PASSED/!d' -e '/completed simulation/!d' {} + | wc -l
Any help/suggestion will be much appriciated!

find . -type f -exec \
awk '/completed simulation/{x=1} /STATUS PASSED/{y=1} END{if (x&&y) print FILENAME}' {} \; |
wc -l
I'm printing the matching file names in case that's useful in some other context but piping that to wc will fail if the file names contain newlines - if that's the case just print 1 or anything else from awk.
Since find /directory_path/*.txt -type f is the same as just ls /directory_path/*.txt if all of the ".txt"s are files, though, it sounds like all you actually need is (using GNU awk for nextfile):
awk '
FNR==1 { x=y=0 }
/completed simulation/ { x=1 }
/STATUS PASSED/ { y=1 }
x && y { cnt++; nextfile }
END { print cnt+0 }
' /directory_path/*.txt
or with any awk:
awk '
FNR==1 { x=y=f=0 }
/completed simulation/ { x=1 }
/STATUS PASSED/ { y=1 }
x && y && !f { cnt++; f=1 }
END { print cnt+0 }
' /directory_path/*.txt
Those will work no matter what characters are in your file names.

Using grep and standard utils:
{ grep -Hm1 'completed simulation' /directory_path/*.txt;
grep -Hm1 'STATUS PASSED' /directory_path/*.txt ; } |
sort | uniq -d | wc -l
grep -m1 stops when it finds the first match. This saves time if it's a big file. If the list of matches is large, sort -t: -k1 would be better than sort.

The command find /directory_path/*.txt just lists all txt files in /directory_path/ not including subdirectories of /directory_path
find . -name \*.txt -print0 |
while read -d $'\0' file; do
grep -Fq 'completed simulation' "$file" &&
grep -Fq 'STATUS PASSED' "$_" &&
echo "$_"
done |
wc -l
If you ensure no special characters in the filenames
find . -name \*.txt |
while read file; do
grep -Fq 'completed simulation' "$file" &&
grep -Fq 'STATUS PASSED' "$file" &&
echo "$file"
done |
wc -l
I don't have AIX to test it, but it should be POSIX compliant.

Related

How to find and move all files matching given Beginning Of File (BOF) string?

How to move all files which content begins with foo to another folder with command line ?
I tried this to echo filenames when matching:
for f in *.txt; do if [ $(head -c5 $f) = "foo" ]; then echo $f; fi; done;
but I'm often getting this error:
-bash: [: too much arguments
Use a Perl one-liner in combination with find and xargs, like so:
echo foo > 1.txt
echo "foo\nbar" > 2.txt
echo "bar\nfoo" > 3.txt
mkdir foodir
find . -maxdepth 1 -name '[123].txt' -exec perl -lne 'print $ARGV if /^foo/; last;' {} \; | xargs -I{} mv {} foodir
find foodir -type f
# foodir/2.txt
# foodir/1.txt
Find with awk
find . -type f -exec awk -F \/ "NR < 6 && /foo/ { fnd=1 } END { if (fnd==1) { split(FILENAME,arr,\"/\");print \"mv -f \"FILENAME\" newdir/\"arr[length(arr)] } }" '{}' \;
Use awk to process the first 5 lines (NR < 6). Search for foo and if it exists, set a fnd variable to 1. At the end, if fnd is 1, print the mv command, using split to get the filename without the directories. Check that everything looks as expected and then run with:
find . -type f -exec awk -F \/ "NR < 6 && /foo/ { fnd=1 } END { if (fnd==1) { split(FILENAME,arr,\"/\");print \"mv -f \"FILENAME\" newdir/\"arr[length(arr)] } }" '{}' \; | bash

Getting a list of substring based unique filenames in an array

I have a directory my_dir with files having names like:
a_v5.json
a_v5.mapping.json
a_v5.settings.json
f_v39.json
f_v39.mapping.json
f_v39.settings.json
f_v40.json
f_v40.mapping.json
f_v40.settings.json
c_v1.json
c_v1.mapping.json
c_v1.settings.json
I'm looking for a way to get an array [a_v5, f_v40, c_v1] in bash. Here, array of file names with the latest version number is what I need.
Tried this: ls *.json | find . -type f -exec basename "{}" \; | cut -d. -f1, but it returns the results with files which are not of the .json extension.
You can use the following command if your filenames don't contain whitespace and special symbols like * or ?:
array=($(
find . -type f -iname \*.json |
sed -E 's|(.*/)*(.*_v)([0-9]+)\..*|\2 \3|' |
sort -Vr | sort -uk1,1 | tr -d ' '
))
It's ugly and unsafe. The following solution is longer but can handle all file names, even those with linebreaks in them.
maxversions() {
find -type f -iname \*.json -print0 |
gawk 'BEGIN { RS = "\0"; ORS = "\0" }
match($0, /(.*\/)*(.*_v)([0-9]+)\..*/, group) {
prefix = group[2];
version = group[3];
if (version > maxversion[prefix])
maxversion[prefix] = version
}
END {
for (prefix in maxversion)
print prefix maxversion[prefix]
}'
}
mapfile -d '' array < <(maxversions)
In both cases you can check the contents of array with declare -p array.
Arrays and bash string parsing.
declare -A tmp=()
for f in $SOURCE_DIR/*.json
do f=${f##*/} # strip path
tmp[${f%%.*}]=1 # strip extraneous data after . in filename
done
declare -a c=( $( printf "%s\n" "${!tmp[#]}" | cut -c 1 | sort -u ) ) # get just the first chars
declare -a lst=( $( for f in "${c[#]}"
do printf "%s\n" "${!tmp[#]}" |
grep "^${f}_" |
sort -n |
tail -1; done ) )
echo "[ ${lst[#]} ]"
[ a_v5 c_v1 f_v40 ]
Or, if you'd rather,
declare -a arr=( $(
for f in $SOURCE_DIR/*.json
do d=${f%/*} # get dir path
f=${f##*/} # strip path
g=${f:0:2} # get leading str
( cd $d && printf "%s\n" ${g}*.json |
sort -n | sed -n '$ { s/[.].*//; p; }' )
done | sort -u ) )
echo "[ ${arr[#]} ]"
[ a_v5 c_v1 f_v40 ]
This is one possible way to accomplish this :
arr=( $( { for name in $( ls {f,n,m}*.txt ); do echo ${name:0:1} ; done; } | sort | uniq ) )
Output :
$ echo ${arr[0]}
f
$ echo ${arr[1]}
m
$ echo ${arr[2]}
n
Regards!
AWK SOLUTION
This is not an elegant solution... my knowledge of awk is limited.
You should find this functional.
I've updated this to remove the obsolete uniq as suggested by #socowi.
I've also included the printf version as #socowi suggested.
ls *.json | cut -d. -f1 | sort -rn | awk -v last="xx" '$1 !~ last{ print $1; last=substr($1,1,3) }'
OR
printf %s\\n *.json | cut -d. -f1 | sort -rn | awk -v last="xx" '$1 !~ last{ print $1; last=substr($1,1,3) }'
Old understanding below
Find files with name matching pattern.
Now take the second field since your results will likely be similar to ./
find . -type f -iname "*.json" | cut -d. -f2
To get the unique headings....
find . -type f -iname "*.json" | cut -d. -f2 | sort | uniq

Bash Script not working when trying to find large ASCII files

So what I'm trying to do is find large ASCII files and then print out the name of the file and then how many lines, but when I start my script it doesn't find anything.
find / -type f -size +2000c -exec file {} \; 2>/dev/null | awk -F':' '/: ASCII text/ {print $1}' | while read FILENAME; do LINES="$(wc -l)"; if [ $LINES > 10000 ]; then echo $FILENAME && echo $LINES; fi; done
what you got wrong?
if [ $LINES > 10000 ] here > goes for a string comparison. To use a numberic comparion -gt must be used as
if [ $LINES -gt 10000 ]
Please try this:
find / -type f -size +2000c -print0 | xargs.exe -0 grep -Z -L -e '[^[:print:]]' 2>/dev/null | xargs -0 awk 'ENDFILE { if (FNR > 10000) { print FILENAME " " FNR } }'
The idea is to filter out binary files with grep and feed awk with the list of filtered files to finally filter out files with line count less or equal to 10000.
btw, it handles files with white space in names gracefully.

Bash: Find file with max lines count

This is my try to do it
Find all *.java files
find . -name '*.java'
Count lines
wc -l
Delete last line
sed '$d'
Use AWK to find max lines-count in wc output
awk 'max=="" || data=="" || $1 > max {max=$1 ; data=$2} END{ print max " " data}'
then merge it to single line
find . -name '*.java' | xargs wc -l | sed '$d' | awk 'max=="" || data=="" || $1 > max {max=$1 ; data=$2} END{ print max " " data}'
Can I somehow implement counting just non-blank lines?
find . -type f -name "*.java" -exec grep -H -c '[^[:space:]]' {} \; | \
sort -nr -t":" -k2 | awk -F: '{print $1; exit;}'
Replace the awk command with head -n1 if you also want to see the number of non-blank lines.
Breakdown of the command:
find . -type f -name "*.java" -exec grep -H -c '[^[:space:]]' {} \;
'---------------------------' '-----------------------'
| |
for each *.java file Use grep to count non-empty lines
-H includes filenames in the output
(output = ./full/path/to/file.java:count)
| sort -nr -t":" -k2 | awk -F: '{print $1; exit;}'
'----------------' '-------------------------'
| |
Sort the output in Print filename of the first entry (largest count)
reverse order using the then exit immediately
second column (count)
find . -name "*.java" -type f | xargs wc -l | sort -rn | grep -v ' total$' | head -1
To get the size of all of your files using awk is just:
$ find . -name '*.java' -print0 | xargs -0 awk '
BEGIN { for (i=1;i<ARGC;i++) size[ARGV[i]]=0 }
{ size[FILENAME]++ }
END { for (file in size) print size[file], file }
'
To get the count of the non-empty lines, simply make the line where you increment the size[] conditional:
$ find . -name '*.java' -print0 | xargs -0 awk '
BEGIN { for (i=1;i<ARGC;i++) size[ARGV[i]]=0 }
NF { size[FILENAME]++ }
END { for (file in size) print size[file], file }
'
(If you want to consider lines that contain only blanks as "empty" then replace NF with /^./.)
To get only the file with the most non-empty lines just tweak again:
$ find . -name '*.java' -print0 | xargs -0 awk '
BEGIN { for (i=1;i<ARGC;i++) size[ARGV[i]]=0 }
NF { size[FILENAME]++ }
END {
for (file in size) {
if (size[file] >= maxSize) {
maxSize = size[file]
maxFile = file
}
}
print maxSize, maxFile
}
'
Something like this might work:
find . -name '*.java'|while read filename; do
nlines=`grep -v -E '^[[:space:]]*$' "$filename"|wc -l`
echo $nlines $filename
done|sort -nr|head -1
(edited as per Ed Morton's comment. I must have had too much coffee :-) )

grep for multiple strings in file on different lines (ie. whole file, not line based search)?

I want to grep for files containing the words Dansk, Svenska or Norsk on any line, with a usable returncode (as I really only like to have the info that the strings are contained, my one-liner goes a little further then this).
I have many files with lines in them like this:
Disc Title: unknown
Title: 01, Length: 01:33:37.000 Chapters: 33, Cells: 31, Audio streams: 04, Subpictures: 20
Subtitle: 01, Language: ar - Arabic, Content: Undefined, Stream id: 0x20,
Subtitle: 02, Language: bg - Bulgarian, Content: Undefined, Stream id: 0x21,
Subtitle: 03, Language: cs - Czech, Content: Undefined, Stream id: 0x22,
Subtitle: 04, Language: da - Dansk, Content: Undefined, Stream id: 0x23,
Subtitle: 05, Language: de - Deutsch, Content: Undefined, Stream id: 0x24,
(...)
Here is the pseudocode of what I want:
for all files in directory;
if file contains "Dansk" AND "Norsk" AND "Svenska" then
then echo the filename
end
What is the best way to do this? Can it be done on one line?
You can use:
grep -l Dansk * | xargs grep -l Norsk | xargs grep -l Svenska
If you want also to find in hidden files:
grep -l Dansk .* | xargs grep -l Norsk | xargs grep -l Svenska
Yet another way using just bash and grep:
For a single file 'test.txt':
grep -q Dansk test.txt && grep -q Norsk test.txt && grep -l Svenska test.txt
Will print test.txt iff the file contains all three (in any combination). The first two greps don't print anything (-q) and the last only prints the file if the other two have passed.
If you want to do it for every file in the directory:
for f in *; do grep -q Dansk $f && grep -q Norsk $f && grep -l Svenska $f; done
grep –irl word1 * | grep –il word2 `cat -` | grep –il word3 `cat -`
-i makes search case insensitive
-r makes file search recursive through folders
-l pipes the list of files with the word found
cat - causes the next grep to look through the files passed to it list.
You can do this really easily with ack:
ack -l 'cats' | ack -xl 'dogs'
-l: return a list of files
-x: take the files from STDIN (the previous search) and only search those files
And you can just keep piping until you get just the files you want.
How to grep for multiple strings in file on different lines (Use the pipe symbol):
for file in *;do
test $(grep -E 'Dansk|Norsk|Svenska' $file | wc -l) -ge 3 && echo $file
done
Notes:
If you use double quotes "" with your grep, you will have to escape the pipe like this: \| to search for Dansk, Norsk and Svenska.
Assumes that one line has only one language.
Walkthrough: http://www.cyberciti.biz/faq/howto-use-grep-command-in-linux-unix/
awk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print "0" }'
you can then catch the return value with the shell
if you have Ruby(1.9+)
ruby -0777 -ne 'print if /Dansk/ and /Norsk/ and /Svenka/' file
This searches multiple words in multiple files:
egrep 'abc|xyz' file1 file2 ..filen
Simply:
grep 'word1\|word2\|word3' *
see this post for more info
This is a blending of glenn jackman's and kurumi's answers which allows an arbitrary number of regexes instead of an arbitrary number of fixed words or a fixed set of regexes.
#!/usr/bin/awk -f
# by Dennis Williamson - 2011-01-25
BEGIN {
for (i=ARGC-2; i>=1; i--) {
patterns[ARGV[i]] = 0;
delete ARGV[i];
}
}
{
for (p in patterns)
if ($0 ~ p)
matches[p] = 1
# print # the matching line could be printed
}
END {
for (p in patterns) {
if (matches[p] != 1)
exit 1
}
}
Run it like this:
./multigrep.awk Dansk Norsk Svenska 'Language: .. - A.*c' dvdfile.dat
Here's what worked well for me:
find . -path '*/.svn' -prune -o -type f -exec gawk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
./path/to/file1.sh
./another/path/to/file2.txt
./blah/foo.php
If I just wanted to find .sh files with these three, then I could have used:
find . -path '*/.svn' -prune -o -type f -name "*.sh" -exec gawk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
./path/to/file1.sh
Expanding on #kurumi's awk answer, here's a bash function:
all_word_search() {
gawk '
BEGIN {
for (i=ARGC-2; i>=1; i--) {
search_terms[ARGV[i]] = 0;
ARGV[i] = ARGV[i+1];
delete ARGV[i+1];
}
}
{
for (i=1;i<=NF; i++)
if ($i in search_terms)
search_terms[$1] = 1
}
END {
for (word in search_terms)
if (search_terms[word] == 0)
exit 1
}
' "$#"
return $?
}
Usage:
if all_word_search Dansk Norsk Svenska filename; then
echo "all words found"
else
echo "not all words found"
fi
I did that with two steps. Make a list of csv files in one file
With a help of this page comments I made two scriptless steps to get what I needed. Just type into terminal:
$ find /csv/file/dir -name '*.csv' > csv_list.txt
$ grep -q Svenska `cat csv_list.txt` && grep -q Norsk `cat csv_list.txt` && grep -l Dansk `cat csv_list.txt`
it did exactly what I needed - print file names containing all three words.
Also mind the symbols like `' "
If you only need two search terms, arguably the most readable approach is to run each search and intersect the results:
comm -12 <(grep -rl word1 . | sort) <(grep -rl word2 . | sort)
If you have git installed
git grep -l --all-match --no-index -e Dansk -e Norsk -e Svenska
The --no-index searches files in the current directory that is not managed by Git. So this command will work in any directory irrespective of whether it is a git repository or not.
I had this problem today, and all one-liners here failed to me because the files contained spaces in the names.
This is what I came up with that worked:
grep -ril <WORD1> | sed 's/.*/"&"/' | xargs grep -il <WORD2>
A simple one-liner in bash for an arbitrary list LIST for file my_file.txt can be:
LIST="Dansk Norsk Svenska"
EVAL=$(echo "$LIST" | sed 's/[^ ]* */grep -q & my_file.txt \&\& /g'); eval "$EVAL echo yes || echo no"
Replacing eval with echo reveals, that the following command is evaluated:
grep -q Dansk my_file.txt && grep -q Norsk my_file.txt && grep -q Svenska my_file.txt && echo yes || echo no

Resources