Count number of grep occurrences and store it a variable - bash

I want to do something like this - grep for a string in a particular file, store it in a variable and be able to print just the number of occurrences.
#!/bin/bash
count=$(grep *something* *somefile*| wc -l)
echo $count
This always gives a 0 value, when I know it should be more.
This is what I intend to do, but its taking like forever to finish the script.
if egrep -iq "Android 6.0.1" $filename; then
count=$(egrep -ic "Android 6.0.1" $filename)
echo 'Operating System Version leaked number of times: '$count
I have 7 other such if statements and I am running this for around 20 files.
Any more efficient way to make it faster?

grep has its own counting flag
-c, --count
Suppress normal output; instead print a count of matching lines for
each input file. With the -v, --invert-match option (see below), count
non-matching lines. (-c is specified by POSIX .)
count=$( grep -c 'match' file)
Note that the match part is quoted as well so if you use special characters they are not interpreted by the shell.
Also as stated in the excerpt from that man page multiple matches on a single line will be counted as a single match as it only counts matching lines:
$ echo "hello hello hello hello
hello
> bye" | grep -c "hello"
2

A much more efficient approach would be to run Awk once.
awk -v patterns="foo,bar,baz" 'BEGIN { n=split(patterns, pats, ",") }
{ for (i=1; i<=n; ++i) if ($0 ~ pats[i]) ++hits[i] }
END { for (i=1; i<=n; ++i) printf("%8d%s\n", hits[i], pats[i]) }' list of files
For bonus points, format the output in machine-readable format (depending on where it ends up, JSON might be a good choice); and/or add the human-readable explanation for the significance of each hit to the END block.
If that's not what you want, running grep -Eic and ditching any zero value would already improve your run time over grepping the file twice for each match in the worst case. (The pessimal situation would be when the last line and no other line matches your pattern.)

Related

How can I generate multiple counts from a file without re-reading it multiple times?

I have large files of HTTP access logs and I'm trying to generate hourly counts for a specific query string. Obviously, the correct solution is to dump everything into splunk or graylog or something, but I can't set all that up at the moment for this one-time deal.
The quick-and-dirty is:
for hour in 0{0..9} {10..23}
do
grep $QUERY $FILE | egrep -c "^\S* $hour:"
# or, alternately
# egrep -c "^\S* $hour:.*$QUERY" $FILE
# not sure which one's better
done
But these files average 15-20M lines, and I really don't want to parse through each file 24 times. It would be far more efficient to parse the file and count each instance of $hour in one go. Is there any way to accomplish this?
You can ask grep to output the matching part of each line with -o and then use uniq -c to count the results:
grep "$QUERY" "$FILE" | grep -o "^\S* [0-2][0-9]:" | sed 's/^\S* //' | uniq -c
The sed command is there to keep only the two digit hour and the colon, which you can also remove with another sed expression if you want.
Caveats: this solution works with GNU grep and GNU sed, and will produce no output, rather than "0", for hours with no log entries. Kudos to #EdMorton for pointing these issues out in the comments, and other issues that were fixed in the answer above.
Assuming the timestamp appears with a space before the 2-digit hour, then a colon after
gawk -v patt="$QUERY" '
$0 ~ patt && match($0, / ([0-9][0-9]):/, m) {
print > (m[1] "." FILENAME)
}
' "$FILE"
This will create 24 files.
Requires GNU awk for the 3-arg form of match()
This is probably what you really need, using GNU awk for the 3rd arg to match() and making assumptions about what your input might look like, what your QUERY variable might contain, and what the output should look like:
awk -v query="$QUERY" '
match($0, " ([0-9][0-9]):.*"query, a) { cnt[a[1]+0]++ }
END {
for (hr=0; hr<=23; hr++) {
printf "%02d = %d\n", hr, cnt[hr]
}
}
' "$FILE"
Don't really use all upper case for non-exported shell variables btw - see Correct Bash and shell script variable capitalization.

how to extract word from grep result in shell?

Using shell i want to search and print only sub-string with next word to that sub-string.
e.g. logfile has line "today is monday and this is:1234 so I am in."
if grep -q "this is :" ./logfile; then
#here i want to print only sub-string with next word i.e. "this is:1234"
#echo ???
fi
You can use sed with \1 to display the matched string in \(..\):
sed 's/.*\(this is:[0-9a-zA-Z]*\).*/\1/' logfile
EDIT: The above command is only fine for 1 line input.
When you have a file with more lines, you only want to print the lines that match:
sed -n 's/.*\(this is:[0-9a-zA-Z]*\).*/\1/p' logfile
When you have a large file and only want to see the first match, you can combine this command with head -1, but you would like to stop scanning/parsing after the first match. You can use q to quit, but you only want to quit after a match.
sed -n '/.*\(this is:[0-9a-zA-Z]*\).*/{s//\1/p;q}'
You can use a regular expression with a look-behind, if you want only the next word:
$ grep --perl-regexp -o '(?<=(this is:))(\S+)' ./logfile
1234
If you want both, then just:
$ grep --perl-regexp -o 'this is:\S+' ./logfile
this is:1234
The -o option instructs grep to return only the matching part.
In the commands above, we assumed that a "word" is a sequence of non-space characters. You can adjust that according to your needs.
If you have a system with GNU extensions (but aren't certain it was compiled with optional PCRE support), consider:
if result=$(grep -E -m 1 -o 'this is:[^[:space:]]+' logfile); then
echo "value is: ${result#*:}"
fi
${varname#value} expands to the contents of varname, but with value stripped from the beginning if present. Thus, ${result#*:} strips everything up to the first colon in result.
However, this may not work on systems without the non-POSIX options -o or -m.
If you want to support non-GNU systems, awk is a tool worth considering: Unlike answers requiring nonportable extensions (like grep -P), this should work on any modern platform (tested with GNU awk, recent BSD awk, and mawk; also, no warnings with with gawk --posix --lint):
# note that the constant 8 is the length of "this is:"
# GNU awk has cleaner syntax, but trying to be portable here.
if value=$(awk '
BEGIN { matched=0; } # by default, this will trigger END to exit as failure
/this is:/ {
match($0, /this\ is:([^[:space:]]+)/);
print substr($0, RSTART+8, RLENGTH-8);
matched=1; # tell END block to use zero exit status
exit(0); # stop processing remaining file contents, jump to END
}
END { if(matched == 0) { exit(1); } }
'); then
echo "Found value of $value"
else
echo "Could not find $value in file"
fi
You can look for everything up to, but not including the next space like this:
grep -Eo "this is:[^[:space:]]+" logfile
The [] introduces the set of characters you are looking for and the ^ at the start complements the set, so the set of characters you are looking for is a blank space, but complemented, i.e. not a blank space. The + says there must be at least one or more such characters.
The -E tells grep to use extended regular expressions and the -o means to only print the matched part.

fast alternative to grep file multiple times?

I currently use long piped bash commands to extract data from text files like this, where $f is my file:
result=$(grep "entry t $t " $f | cut -d ' ' -f 5,19 | \
sort -nk2 | tail -n 1 | cut -d ' ' -f 1)
I use a script that might do hundreds of similar searches of $f ,sorting selected lines in various ways depending on what I'm pulling out. I like one-line bash strings with a bunch of pipes because its compact and easy, but it can take forever. Can anyone suggest a faster alternative? Maybe something that loads the whole file into memory first?
Thanks
You might get a boost with doing the whole pipe with gawk or another awk that has asorti by doing:
contents="$(cat "$f")"
result="$(awk -vpattern="entry t $t" '$0 ~ pattern {matches[$5]=$19} END {asorti(matches,inds); print inds[1]}' <<<"$contents")"
This will read "$f" into a variable then we'll use a single awk command (well, gawk anyway) to do all the rest of the work. Here's how that works:
-vpattern="entry t $t": defines an awk variable named pattern that contains the shell variable t
$0 ~ pattern matches the current line against the pattern, if it matches we'll do the part in the braces, otherwise we skip it
matches[$5]=$19 adds an entry to an array (and creates the array if needed) where the key is the 5th field and the value is the 19th
END do the following function after all the input has been processed
asorti(matches,inds) sort the entries of matches such that the inds is an array holding the order of the keys in matches to get the values in sorted order
print inds[1] prints the index in matches (i.e., a $5 from before) associated with the lowest 19th field
<<<"$contents" have awk work on the value in the shell variable contents as though it were a file it was reading
Then you can just update the pattern for each, not have to read the file from disk each time and not need so many extra processes for all the pipes.
You'll have to benchmark to see if it's really faster or not though, and if performance is important you really should think about moving to a "proper" language instead of shell scripting.
Since you haven't provided sample input/output this is just a guess and I only post it because there's other answers already posted that you should not do, so - this may be what you want instead of that one line:
result=$(awk -v t="$t" '
BEGIN { regexp = "entry t " t " " }
$0 ~ regexp {
if ( ($6 > maxKey) || (maxKey == "") ) {
maxKey = $6
maxVal = $5
}
}
END { print maxVal }
' "$f")
I suspect your real performance issue, however, isn't that script but that you are running it and maybe others inside a loop that you haven't shown us. If so, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice and post a better example so we can help you.

get highest number then print next number in new file

I have a file info.txt with pipe delimited, can you give me idea how to get the highest suffix and add entries on it based on the pattern?
info="$HOME/info.txt"
echo "Input the pattern: "
read pattern
awk '/pattern/{ print $0 }' $info >> $HOME/temp1.$$
sed 's/MICRO_AU_FILE//g' $HOME/temp1.$$
##then count highest num but i think not good approach
##if got he highest num then print next number
for ACC_NUM in `cat acc`
do
echo "$pattern-FILE$Highestsufix|server|$ACC_NUM*| >> $HOME/tempfile.$$
cat $HOME/tempfile.$$ >> $info
done
fi
info.txt
MICRO_AU-FILE01|serve|12345
MICRO_AU-FILE02|serve|23456
MICRO_AU-FILE04|serve|34534
MICRO_PH-FILE01|serve|56457
MICRO_PH-FILE02|serve|12345
MICRO_BN-FILE01|serve|78564
MICRO_BN-FILE03|serve|45267
acc
11111
22222
output: if my pattern is MICRO_AU
MICRO_AU-FILE01|serve|12345
MICRO_AU-FILE02|serve|23456
MICRO_AU-FILE04|serve|34534
MICRO_PL-FILE01|serve|56457
MICRO_PL-FILE02|serve|12345
MICRO_BN-FILE01|serve|78564
MICRO_BN-FILE03|serve|45267
MICRO_AU-FILE05|serve|11111
MICRO_AU-FILE06|serve|22222
I would extract the suffixes, sort them ascending numerically, and take the highest one. If the input is as regular as in the example, this would be simply
HIGHEST_INDEX=$(cut -c 14,15|sort -nr|head -n 1)
If the structure of the lines can vary, you would have to adapt the number selector (cut -c 14,15) according to your tast.
UPDATE: I just see, that you have tagged your question with shell and not with bash, zsh, or ksh. If you need your program to run also on Bourne Shell, you have to use
HIGHEST_INDEX=`cut -c 14,15|sort -nr|head -n 1`
In general, it is best with this type of question, if you explicitly state, on which shell(s) your program should run. The more specific you are in this respect, the better solution we can suggest. For example, getting the next higher number (after HIGHEST_INDEX) is more complicated in Bourne shell as in the other ones.

'grep +A': print everything after a match [duplicate]

This question already has answers here:
How to get the part of a file after the first line that matches a regular expression
(12 answers)
Closed 7 years ago.
I have a file that contains a list of URLs. It looks like below:
file1:
http://www.google.com
http://www.bing.com
http://www.yahoo.com
http://www.baidu.com
http://www.yandex.com
....
I want to get all the records after: http://www.yahoo.com, results looks like below:
file2:
http://www.baidu.com
http://www.yandex.com
....
I know that I could use grep to find the line number of where yahoo.com lies using
grep -n 'http://www.yahoo.com' file1
3 http://www.yahoo.com
But I don't know how to get the file after line number 3. Also, I know there is a flag in grep -A print the lines after your match. However, you need to specify how many lines you want after the match. I am wondering is there something to get around that issue. Like:
Pseudocode:
grep -n 'http://www.yahoo.com' -A all file1 > file2
I know we could use the line number I got and wc -l to get the number of lines after yahoo.com, however... it feels pretty lame.
AWK
If you don't mind using AWK:
awk '/yahoo/{y=1;next}y' data.txt
This script has two parts:
/yahoo/ { y = 1; next }
y
The first part states that if we encounter a line with yahoo, we set the variable y=1, and then skip that line (the next command will jump to the next line, thus skip any further processing on the current line). Without the next command, the line yahoo will be printed.
The second part is a short hand for:
y != 0 { print }
Which means, for each line, if variable y is non-zero, we print that line. In AWK, if you refer to a variable, that variable will be created and is either zero or empty string, depending on context. Before encounter yahoo, variable y is 0, so the script does not print anything. After encounter yahoo, y is 1, so every line after that will be printed.
Sed
Or, using sed, the following will delete everything up to and including the line with yahoo:
sed '1,/yahoo/d' data.txt
This is much easier done with sed than grep. sed can apply any of its one-letter commands to an inclusive range of lines; the general syntax for this is
START , STOP COMMAND
except without any spaces. START and STOP can each be a number (meaning "line number N", starting from 1); a dollar sign (meaning "the end of the file"), or a regexp enclosed in slashes, meaning "the first line that matches this regexp". (The exact rules are slightly more complicated; the GNU sed manual has more detail.)
So, you can do what you want like so:
sed -n -e '/http:\/\/www\.yahoo\.com/,$p' file1 > file2
The -n means "don't print anything unless specifically told to", and the -e directive means "from the first appearance of a line that matches the regexp /http:\/\/www\.yahoo\.com/ to the end of the file, print."
This will include the line with http://www.yahoo.com/ on it in the output. If you want everything after that point but not that line itself, the easiest way to do that is to invert the operation:
sed -e '1,/http:\/\/www\.yahoo\.com/d' file1 > file2
which means "for line 1 through the first line matching the regexp /http:\/\/www\.yahoo\.com/, delete the line" (and then, implicitly, print everything else; note that -n is not used this time).
awk '/yahoo/ ? c++ : c' file1
Or golfed
awk '/yahoo/?c++:c' file1
Result
http://www.baidu.com
http://www.yandex.com
This is most easily done in Perl:
perl -ne 'print unless 1 .. m(http://www\.yahoo\.com)' file
In other words, print all lines that aren’t between line 1 and the first occurrence of that pattern.
Using this script:
# Get index of the "yahoo" word
index=`grep -n "yahoo" filepath | cut -d':' -f1`
# Get the total number of lines in the file
totallines=`wc -l filepath | cut -d' ' -f1`
# Subtract totallines with index
result=`expr $total - $index`
# Gives the desired output
grep -A $result "yahoo" filepath

Resources