command to count occurrences of word in entire file - bash

I am trying to count the occurrences of a word in a file.
If word occurs multiple times in a line, I will count is a 1.
Following command will give me the output but will fail if line has multiple occurrences of word
grep -c "word" filename.txt
Is there any one liner?

You can use grep -o to show the exact matches and then count them:
grep -o "word" filename.txt | wc -l
Test
$ cat a
hello hello how are you
hello i am fine
but
this is another hello
$ grep -c "hello" a # Normal `grep -c` fails
3
$ grep -o "hello" a
hello
hello
hello
hello
$ grep -o "hello" a | wc -l # grep -o solves it!
4

Set RS in awk for a shorter one.
awk 'END{print NR-1}' RS="word" file

GNU awk allows it to be done in single command with use of multiple piped commands:
awk -v w="word" '$1==w{n++} END{print n}' RS=' |\n' file

cat file | cut -d ' ' | grep -c word
This assumes that all words in the file have spaces between the words. If there's punctuation concatenating the word to itself, or otherwise no spaces on a single line between the word and itself, they'll count as one.

grep word filename.txt | wc -l
grep prints the lines that match, then wc -l prints the number of lines matched

Related

Shell Scripting: "grep -w" not to select words separated by "-"

I have 3 words.
abcd-1234
abcd-abcd
abcd
Is it possible to select/print the 3rd word "abcd" with grep -w or a similar command?
This should work:
grep '[a-zA-Z]'
more specific, alphabet from begining:
echo "abcd-1234" | grep -o '^[a-zA-Z]*'
it should be good for given examples,
try this, regarding from your comment
data.txt
abcd-1234
abcd-4678
abcd
abcd-as334s
abcd-abcd
cat data.txt | grep -ow '^[a-zA-Z]*' | sort -u
And why do you want to achieve this using -w if you can simply achieve this by -v (A.K.A. --invert-match):
grep -v "-" data.txt
Output:
abcd
Ok, -w only gets entire words, but a hyphen does not always split a word. If you don't like the hyphen, best thing to say is that you don't like the hyphen (hence -v "-").

How to grep and match the first occurrence of a line?

Given the following content:
title="Bar=1; Fizz=2; Foo_Bar=3;"
I'd like to match the first occurrence of Bar value which is 1. Also I don't want to rely on soundings of the word (like double quote in the front), because the pattern could be in the middle of the line.
Here is my attempt:
$ grep -o -m1 'Bar=[ ./0-9a-zA-Z_-]\+' input.txt
Bar=1
Bar=3
I've used -m/--max-count which suppose to stop reading the file after num matches, but it didn't work. Why this option doesn't work as expected?
I could mix with head -n1, but I wondering if it is possible to achieve that with grep?
grep is line-oriented, so it apparently counts matches in terms of lines when using -m[1]
- even if multiple matches are found on the line (and are output individually with -o).
While I wouldn't know to solve the problem with grep alone (except with GNU grep's -P option - see anubhava's helpful answer), awk can do it (in a portable manner):
$ awk -F'Bar=|;' '{ print $2 }' <<<"Bar=1; Fizz=2; Foo_Bar=3;"
1
Use print "Bar=" $2, if the field name should be included.
Also note that the <<< method of providing input via stdin (a so-called here-string) is specific to Bash, Ksh, Zsh; if POSIX compliance is a must, use echo "..." | grep ... instead.
[1] Options -m and -o are not part of the grep POSIX spec., but both GNU and BSD/OSX grep support them and have chosen to implement the line-based logic.
This is consistent with the standard -c option, which counts "selected lines", i.e., the number of matching lines:
grep -o -c 'Bar=[ ./0-9a-zA-Z_-]\+' <<<"Bar=1; Fizz=2; Foo_Bar=3;" yields 1.
Using perl based regex flavor in gnu grep you can use:
grep -oP '^(.(?!Bar=\d+))*Bar=\d+' <<< "Bar=1; Fizz=2; Foo_Bar=3;"
Bar=1
(.(?!Bar=\d+))* will match 0 or more of any characters that don't have Bar=\d+ pattern thus making sure we match first Bar=\d+
If intent is to just print the value after = then use:
grep -oP '^(.(?!Bar=\d+))*Bar=\K\d+' <<< "Bar=1; Fizz=2; Foo_Bar=3;"
1
You can use grep -P (assuming you are on gnu grep) and positive look ahead ((?=.*Bar)) to achieve that in grep:
echo "Bar=1; Fizz=2; Foo_Bar=3;" | grep -oP -m 1 'Bar=[ ./0-9a-zA-Z_-]+(?=.*Bar)'
First use a grep to make the line start with Bar, and then get the Bar at the start of the line:
grep -o "Bar=.*" input.txt | grep -o -m1 "^Bar=[ ./0-9a-zA-Z_-]\+"
When you have a large file, you can optimize with
grep -o -m1 "Bar=.*" input.txt | grep -o -m1 "^Bar=[ ./0-9a-zA-Z_-]\+"

The wc -l gives wrong result

I got wrong result from the wc -l command. After a long :( checking a found the core of the problem, here is the simulation:
$ echo "line with end" > file
$ echo -n "line without end" >>file
$ wc -l file
1 file
here are two lines, but missing the last "\n". Any easy solution?
For the wc line is what ends with the "\n" char. One of solutions is grep-ing the lines. The grep not looking for the ending NL.
e.g.
$ grep -c . file #count the occurrence of any character
2
the above will not count empty lines. If you want them, use the
$ grep -c '^' file #count the beginnings of the lines
2
from man page of wc
-l, --lines
print the newline counts
form man page of echo
-n do not output the trailing newline
so you have 1 newline in your file and thus wc -l shows 1.
You can use the following awk command to count lines
awk 'END{print NR}' file

bash echo number of lines of file given in a bash variable without the file name

I have the following three constructs in a bash script:
NUMOFLINES=$(wc -l $JAVA_TAGS_FILE)
echo $NUMOFLINES" lines"
echo $(wc -l $JAVA_TAGS_FILE)" lines"
echo "$(wc -l $JAVA_TAGS_FILE) lines"
And they both produce identical output when the script is run:
121711 /home/slash/.java_base.tag lines
121711 /home/slash/.java_base.tag lines
121711 /home/slash/.java_base.tag lines
I.e. the name of the file is also echoed (which I don't want to). Why do these scriplets fail and how should I output a clean:
121711 lines
?
An Example Using Your Own Data
You can avoid having your filename embedded in the NUMOFLINES variable by using redirection from JAVA_TAGS_FILE, rather than passing the filename as an argument to wc. For example:
NUMOFLINES=$(wc -l < "$JAVA_TAGS_FILE")
Explanation: Use Pipes or Redirection to Avoid Filenames in Output
The wc utility will not print the name of the file in its output if input is taken from a pipe or redirection operator. Consider these various examples:
# wc shows filename when the file is an argument
$ wc -l /etc/passwd
41 /etc/passwd
# filename is ignored when piped in on standard input
$ cat /etc/passwd | wc -l
41
# unusual redirection, but wc still ignores the filename
$ < /etc/passwd wc -l
41
# typical redirection, taking standard input from a file
$ wc -l < /etc/passwd
41
As you can see, the only time wc will print the filename is when its passed as an argument, rather than as data on standard input. In some cases, you may want the filename to be printed, so it's useful to understand when it will be displayed.
wc can't get the filename if you don't give it one.
wc -l < "$JAVA_TAGS_FILE"
You can also use awk:
awk 'END {print NR,"lines"}' filename
Or
awk 'END {print NR}' filename
(apply on Mac, and probably other Unixes)
Actually there is a problem with the wc approach: it does not count the last line if it does not terminate with the end of line symbol.
Use this instead
nbLines=$(cat -n file.txt | tail -n 1 | cut -f1 | xargs)
or even better (thanks gniourf_gniourf):
nblines=$(grep -c '' file.txt)
Note: The awk approach by chilicuil also works.
It's a very simple:
NUMOFLINES=$(cat $JAVA_TAGS_FILE | wc -l )
or
NUMOFLINES=$(wc -l $JAVA_TAGS_FILE | awk '{print $1}')
I normally use the 'back tick' feature of bash
export NUM_LINES=`wc -l filename`
Note the 'tick' is the 'back tick' e.g. ` not the normal single quote

Count number of occurrences of a specific regex on multiple files

I am trying to write up a bash script to count the number of times a specific pattern matches on a list of files.
I've googled for solutions but I've only found solutions for single files.
I know I can use egrep -o PATTERN file, but how do I generalize to a list of files and out the sum at the end?
EDIT: Adding the script I am trying to write:
#! /bin/bash
egrep -o -c "\s*assert.*;" $1 | awk -F: '{sum+=$2} END{print sum}'
Running egrep directly on the command line works fine, but within a bash script it doesn't. Do I have to specially protect the RegEx?
You could use grep -c to count the matches within each file, and then use awk at the end to sum up the counts, e.g.:
grep -c PATTERN * | awk -F: '{sum+=$2} END{print sum}'
grep -o <pattern> file1 [file2 .. | *] |
uniq -c
If you want the total only:
grep -o <pattern> file1 [file2 .. | *] | wc -l
Edit: The sort seems unnecessary.
The accepted answer has a problem in that grep will count as 1 even though the PATTERN may appear more than once on a line. Besides, one command does the job
awk 'BEGIN{RS="\0777";FS="PATTERN"} { print NF-1 } ' file

Resources