Count number of occurrences of a specific regex on multiple files - bash

I am trying to write up a bash script to count the number of times a specific pattern matches on a list of files.
I've googled for solutions but I've only found solutions for single files.
I know I can use egrep -o PATTERN file, but how do I generalize to a list of files and out the sum at the end?
EDIT: Adding the script I am trying to write:
#! /bin/bash
egrep -o -c "\s*assert.*;" $1 | awk -F: '{sum+=$2} END{print sum}'
Running egrep directly on the command line works fine, but within a bash script it doesn't. Do I have to specially protect the RegEx?

You could use grep -c to count the matches within each file, and then use awk at the end to sum up the counts, e.g.:
grep -c PATTERN * | awk -F: '{sum+=$2} END{print sum}'

grep -o <pattern> file1 [file2 .. | *] |
uniq -c
If you want the total only:
grep -o <pattern> file1 [file2 .. | *] | wc -l
Edit: The sort seems unnecessary.

The accepted answer has a problem in that grep will count as 1 even though the PATTERN may appear more than once on a line. Besides, one command does the job
awk 'BEGIN{RS="\0777";FS="PATTERN"} { print NF-1 } ' file

Related

Counting Python files with bash and awk always returns zero

I want to get a number of python files on my desktop and I have coded a small script for that. But the awk command does not work as is have expected.
script
ls -l | awk '{ if($NF=="*.py") print $NF; }' | wc -l
I know that there is another solution to finding a number of python files on a PC but I just want to know what am i doing wrong here.
ls -l | awk '{ if($NF=="*.py") print $NF; }' | wc -l
Your code does count of files literally named *.py, you should deploy regex matching and use correct GNU AWK syntax, after fixing that, your code becomes
ls -l | awk '{ if($NF~/[.]py$/) print $NF; }' | wc -l
note [.] which denote literal . and $ denoting end of string.
Your code might be further ameloriated, as there is not need to use if here, as pattern-action will do that is
ls -l | awk '$NF~/[.]py$/{ print $NF; }' | wc -l
Morever you might easily implemented counting inside GNU AWK rather than deploying wc -l as follows
ls -l | awk '$NF~/[.]py$/{t+=1}END{print t}'
Here, t is increased by 1 for every describe line, and after all is processed, that is in END it is printed. Observe there is no need to declare t variable in GNU AWK.
Don't try to parse the output of ls, see https://mywiki.wooledge.org/ParsingLs.
Beyond that your awk script is failing because $NF=="*.py" is doing a literal string partial comparison of the last sting of non-spaces against *.py when you probably wanted a regexp comparison such as $NF~/*.py$/ and your print $NF would fail for any file names containing spaces.
If you really want to involve awk in this for some reason then, assuming the list of python files doesn't exceed ARG_MAX, it'd be:
awk 'BEGIN{print ARGC-1; exit}' *.py
but you could just do it in bash:
shopt -s nullglob
files=(*.py)
echo "${#files[#]}"
or if you want to have a pipe to wc -l for some reason and your files can't have newlines in their names then:
printf '%s\n' *.py | wc -l
gfind . -maxdepth 1 -type f -name "*.py" -print0 |
{m,g}awk 'END { print NR }' RS='\0' FS='^$'
or
{m,g}awk 'END { print --NF }' RS='^$' FS='\0'
879

How to extract codes using the grep command?

I have a file with below input lines.
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
I need to extract only the above highlighted text using the grep command.
I tried the below command and didn't get the proper result. Getting the extra 2 unwanted characters in the output. Please suggest if there is any other way to achieve this through grep command.
find ./ -type f -name <FileName> -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\\....' |
grep -o '...\\....' | sort | uniq
Current Output:
123.NNN
456.1 a
Expected output:
123.NNN
456.1
You can use another grep regular expression.
find ./ -type f -name f -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\.[^ ]*' |
grep -o '...\..*' | sort | uniq
. matches any character, [^ ]* matches any sequence of characters until the first space
Output:
123.NNN
456.1
Your regex specifies a fixed character width for strings of variable width. Based on your examples, something like
[0-9]\+\.[A-Z0-9]\+
would seem like a better regex. However, we could probably also simplify this by merging the cut and multiple grep commands into a single Awk script.
find etc etc -exec awk -F '|' '
$4 ~ /Category is not found for local configuration\/code\/[0-9]{3}\.[0-9A-Z]/ {
split($4, a, /\/code\/);
split(a[2], b); print b[1] }' {} + |
sort -u
The two split operations are just a cheap way to pick out the text between /code/ and the next whitespace character; we have already established by way of the regex match that the string after /code/ matches the pattern we're after.
Notice also how sort has a -u option which allows you to replace (trivial cases of) uniq.
The regex variant supported by Awk is slightly different than that supported by POSIX grep; so the backslashed \+ in grep's BRE dialect is plain + in the dialect called ERE which is [more or less] supported by Awk - and grep -E. If you have grep -P you can use a third variant which has a convenient feature;
find etc etc -exec grep -oP '^([^|]*[|]){3}[^|]*Category is not found for local configuration/code/\K[0-9]{3}\.[0-9A-Z]+' {} + |
sort -u
The \K says "match up through here, but forget everything before this" and so only prints the part after this token.
With sed:
sed -E -n 's#.*code/(.*)\s+and.*#\1#p' file.txt | uniq
Output:
123.NNN
456.1
I'd use the -P option:
grep -oP '/code/\K\S+' file | sort -u
You want to extract the non-whitespace characters following /code/
An awk using match():
$ awk 'match($0,/[0-9]+\.[A-Z0-9]+/)&&++a[(b=substr($0,RSTART,RLENGTH))]==1{print b}' file
Output:
123.NNN
456.1
Pretty printed for slightly better readability:
$ awk '
match($0,/[0-9]+\.[A-Z0-9]+/) && ++a[(b=substr($0,RSTART,RLENGTH))]==1 {
print b
}' file
It's not possible just using grep. You should use AWK instead:
awk '{split($7, ar, "/"); print ar[3]}' FILE
Explanation:
The split function splits on a string, here $7, the 7th field, placing the result in an array ar, and using the string / as delimiter.
Then prints the 3rd field of the array.
Note:
I am assuming that all of your input looks like the samples you have given us, i.e.:
aaa|b|c|ddd is not found for local configuration/code/111.nnn and customer nnn
Where aaa and ddd will not contain whitespace.
I also assume you really do have a file FILE containing those lines. It's a bit unclear.
Input:
▶ cat FILE
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
Output:
▶ awk '{split($7, ar, "/"); print ar[3]}' FILE
123.NNN
123.NNN
456.1
Single sed can do the filtering.
(The pattern can be further generalized as suggested by others if that is an option. But be careful to not to over simplify so that it can match with unexpected inputs)
sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' input.txt
To replace your exact command,
find ./ -type f -name <Filename> -exec cat {} \; | sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' | sort | uniq
Simple substitutions on individual lines is the job sed is best suited for. This will work using any sed in any shell on any UNIX box:
$ cat file
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
$ sed -n 's:.*Category is not found for local configuration/code/\([^ ]*\).*:\1:p' file | sort -u
123.NNN
456.1

How to grep and match the first occurrence of a line?

Given the following content:
title="Bar=1; Fizz=2; Foo_Bar=3;"
I'd like to match the first occurrence of Bar value which is 1. Also I don't want to rely on soundings of the word (like double quote in the front), because the pattern could be in the middle of the line.
Here is my attempt:
$ grep -o -m1 'Bar=[ ./0-9a-zA-Z_-]\+' input.txt
Bar=1
Bar=3
I've used -m/--max-count which suppose to stop reading the file after num matches, but it didn't work. Why this option doesn't work as expected?
I could mix with head -n1, but I wondering if it is possible to achieve that with grep?
grep is line-oriented, so it apparently counts matches in terms of lines when using -m[1]
- even if multiple matches are found on the line (and are output individually with -o).
While I wouldn't know to solve the problem with grep alone (except with GNU grep's -P option - see anubhava's helpful answer), awk can do it (in a portable manner):
$ awk -F'Bar=|;' '{ print $2 }' <<<"Bar=1; Fizz=2; Foo_Bar=3;"
1
Use print "Bar=" $2, if the field name should be included.
Also note that the <<< method of providing input via stdin (a so-called here-string) is specific to Bash, Ksh, Zsh; if POSIX compliance is a must, use echo "..." | grep ... instead.
[1] Options -m and -o are not part of the grep POSIX spec., but both GNU and BSD/OSX grep support them and have chosen to implement the line-based logic.
This is consistent with the standard -c option, which counts "selected lines", i.e., the number of matching lines:
grep -o -c 'Bar=[ ./0-9a-zA-Z_-]\+' <<<"Bar=1; Fizz=2; Foo_Bar=3;" yields 1.
Using perl based regex flavor in gnu grep you can use:
grep -oP '^(.(?!Bar=\d+))*Bar=\d+' <<< "Bar=1; Fizz=2; Foo_Bar=3;"
Bar=1
(.(?!Bar=\d+))* will match 0 or more of any characters that don't have Bar=\d+ pattern thus making sure we match first Bar=\d+
If intent is to just print the value after = then use:
grep -oP '^(.(?!Bar=\d+))*Bar=\K\d+' <<< "Bar=1; Fizz=2; Foo_Bar=3;"
1
You can use grep -P (assuming you are on gnu grep) and positive look ahead ((?=.*Bar)) to achieve that in grep:
echo "Bar=1; Fizz=2; Foo_Bar=3;" | grep -oP -m 1 'Bar=[ ./0-9a-zA-Z_-]+(?=.*Bar)'
First use a grep to make the line start with Bar, and then get the Bar at the start of the line:
grep -o "Bar=.*" input.txt | grep -o -m1 "^Bar=[ ./0-9a-zA-Z_-]\+"
When you have a large file, you can optimize with
grep -o -m1 "Bar=.*" input.txt | grep -o -m1 "^Bar=[ ./0-9a-zA-Z_-]\+"

command to count occurrences of word in entire file

I am trying to count the occurrences of a word in a file.
If word occurs multiple times in a line, I will count is a 1.
Following command will give me the output but will fail if line has multiple occurrences of word
grep -c "word" filename.txt
Is there any one liner?
You can use grep -o to show the exact matches and then count them:
grep -o "word" filename.txt | wc -l
Test
$ cat a
hello hello how are you
hello i am fine
but
this is another hello
$ grep -c "hello" a # Normal `grep -c` fails
3
$ grep -o "hello" a
hello
hello
hello
hello
$ grep -o "hello" a | wc -l # grep -o solves it!
4
Set RS in awk for a shorter one.
awk 'END{print NR-1}' RS="word" file
GNU awk allows it to be done in single command with use of multiple piped commands:
awk -v w="word" '$1==w{n++} END{print n}' RS=' |\n' file
cat file | cut -d ' ' | grep -c word
This assumes that all words in the file have spaces between the words. If there's punctuation concatenating the word to itself, or otherwise no spaces on a single line between the word and itself, they'll count as one.
grep word filename.txt | wc -l
grep prints the lines that match, then wc -l prints the number of lines matched

How to grep, excluding some patterns?

I'd like find lines in files with an occurrence of some pattern and an absence of some other pattern. For example, I need find all files/lines including loom except ones with gloom. So, I can find loom with command:
grep -n 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp)
Now, I want to search loom excluding gloom. However, both of following commands failed:
grep -v 'gloom' -n 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp)
grep -n 'loom' -v 'gloom' ~/projects/**/trunk/src/**/*.#(h|cpp)
What should I do to achieve my goal?
EDIT 1: I mean that loom and gloom are the character sequences (not necessarily the words). So, I need, for example, bloomberg in the command output and don't need ungloomy.
EDIT 2: There is sample of my expectations.
Both of following lines are in command output:
I faced the icons that loomed through the veil of incense.
Arty is slooming in a gloomy day.
Both of following lines aren't in command output:
It’s gloomyin’ ower terrible — great muckle doolders o’ cloods.
In the south west round of the heigh pyntit hall
How about just chaining the greps?
grep -n 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp) | grep -v 'gloom'
Another solution without chaining grep:
egrep '(^|[^g])loom' ~/projects/**/trunk/src/**/*.#(h|cpp)
Between brackets, you exclude the character g before any occurrence of loom, unless loom is the first chars of the line.
A bit old, but oh well...
The most up-voted solution from #houbysoft will not work as that will exclude any line with "gloom" in it, even if it has "loom". According to OP's expectations, we need to include lines with "loom", even if they also have "gloom" in them. This line needs to be in the output "Arty is slooming in a gloomy day.", but this will be excluded by a chained grep like
grep -n 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp) | grep -v 'gloom'
Instead, the egrep regex example of Bentoy13 works better
egrep '(^|[^g])loom' ~/projects/**/trunk/src/**/*.#(h|cpp)
as it will include any line with "loom" in it, regardless of whether or not it has "gloom". On the other hand, if it only has gloom, it will not include it, which is precisely the behaviour OP wants.
Just use awk, it's much simpler than grep in letting you clearly express compound conditions.
If you want to skip lines that contains both loom and gloom:
awk '/loom/ && !/gloom/{ print FILENAME, FNR, $0 }' ~/projects/**/trunk/src/**/*.#(h|cpp)
or if you want to print them:
awk '/(^|[^g])loom/{ print FILENAME, FNR, $0 }' ~/projects/**/trunk/src/**/*.#(h|cpp)
and if the reality is you just want lines where loom appears as a word by itself:
awk '/\<loom\>/{ print FILENAME, FNR, $0 }' ~/projects/**/trunk/src/**/*.#(h|cpp)
-v is the "inverted match" flag, so piping is a very good way:
grep "loom" ~/projects/**/trunk/src/**/*.#(h|cpp)| grep -v "gloom"
Simply use! grep -v multiple times.
#Content of file
[root#server]# cat file
1
2
3
4
5
#Exclude the line or match
[root#server]# cat file |grep -v 3
1
2
4
5
#Exclude the line or match multiple
[root#server]# cat file |grep -v "3\|5"
1
2
4
/*You might be looking something like this?
grep -vn "gloom" `grep -l "loom" ~/projects/**/trunk/src/**/*.#(h|cpp)`
The BACKQUOTES are used like brackets for commands, so in this case with -l enabled,
the code in the BACKQUOTES will return you the file names, then with -vn to do what you wanted: have filenames, linenumbers, and also the actual lines.
UPDATE Or with xargs
grep -l "loom" ~/projects/**/trunk/src/**/*.#(h|cpp) | xargs grep -vn "gloom"
Hope that helps.*/
Please ignore what I've written above, it's rubbish.
grep -n "loom" `grep -l "loom" tt4.txt` | grep -v "gloom"
#this part gets the filenames with "loom"
#this part gets the lines with "loom"
#this part gets the linenumber,
#filename and actual line
You can use grep -P (perl regex) supported negative lookbehind:
grep -P '(?<!g)loom\b' ~/projects/**/trunk/src/**/*.#(h|cpp)
I added \b for word boundaries.
grep -n 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp) | grep -v 'gloom'
Question: search for 'loom' excluding 'gloom'.
Answer:
grep -w 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp)

Resources