How can I generate multiple counts from a file without re-reading it multiple times? - bash

I have large files of HTTP access logs and I'm trying to generate hourly counts for a specific query string. Obviously, the correct solution is to dump everything into splunk or graylog or something, but I can't set all that up at the moment for this one-time deal.
The quick-and-dirty is:
for hour in 0{0..9} {10..23}
do
grep $QUERY $FILE | egrep -c "^\S* $hour:"
# or, alternately
# egrep -c "^\S* $hour:.*$QUERY" $FILE
# not sure which one's better
done
But these files average 15-20M lines, and I really don't want to parse through each file 24 times. It would be far more efficient to parse the file and count each instance of $hour in one go. Is there any way to accomplish this?

You can ask grep to output the matching part of each line with -o and then use uniq -c to count the results:
grep "$QUERY" "$FILE" | grep -o "^\S* [0-2][0-9]:" | sed 's/^\S* //' | uniq -c
The sed command is there to keep only the two digit hour and the colon, which you can also remove with another sed expression if you want.
Caveats: this solution works with GNU grep and GNU sed, and will produce no output, rather than "0", for hours with no log entries. Kudos to #EdMorton for pointing these issues out in the comments, and other issues that were fixed in the answer above.

Assuming the timestamp appears with a space before the 2-digit hour, then a colon after
gawk -v patt="$QUERY" '
$0 ~ patt && match($0, / ([0-9][0-9]):/, m) {
print > (m[1] "." FILENAME)
}
' "$FILE"
This will create 24 files.
Requires GNU awk for the 3-arg form of match()

This is probably what you really need, using GNU awk for the 3rd arg to match() and making assumptions about what your input might look like, what your QUERY variable might contain, and what the output should look like:
awk -v query="$QUERY" '
match($0, " ([0-9][0-9]):.*"query, a) { cnt[a[1]+0]++ }
END {
for (hr=0; hr<=23; hr++) {
printf "%02d = %d\n", hr, cnt[hr]
}
}
' "$FILE"
Don't really use all upper case for non-exported shell variables btw - see Correct Bash and shell script variable capitalization.

Related

How to find files containing a string N times or more often using egrep

I have a folder with about 400-500 SQL-files and need the names of
only those who contain the string CREATE TABLE 3 times or more often.
While the command
$ egrep -rl "(CREATE TABLE)" ./*.sql
prints me of course all file-names, the command
$ egrep -rl "(CREATE TABLE.*){3}" ./*.sql
does not print any at all ...
Flags:
-R – recursive
-L – files-with-matches | print only names of FILEs containing matches
Your command
egrep -rl "(CREATE TABLE.*){3}" ./*.sql
looks for 3 CREATE TABLE's on one line.
When they are on different lines, you need to do something different,
and when you have GNU grep, you are lucky: It has the option -z.
# minimal change of your command
egrep -zrl "(CREATE TABLE.*){3}" ./*.sql
# moving option E to the options as suggested by #anubhava
grep -zErl "(CREATE TABLE.*){3}" ./*.sql
This awk will do the job:
awk 'FNR==1{n=0} /CREATE TABLE/{++n} n>2{print FILENAME; nextfile}' *.sql
Could you please try following. I am taking care of number of opened files in backend too here.
awk 'prev!=FILENAME{n=""}/CREATE TABLE/{++n} n>2{print FILENAME;prev=FILENAME;nextfile}' *.sql
Assuming the possibility of having multiple strings per line, (only covered by the answer of Walter A), here is it's awk version (one that supports nextfile)
awk '(FNR==1){n=0}
{n+=split($0,a,/CREATE TABLE/)-1}
(n>2) {print FILENAME; nextfile}' */.sql
If you don't have GNU grep (according to Walter A's solution) and neither you have an awk with nextfile, the following solutions can be used (POSIX):
awk '(FNR==1){n=0; p=1}
p {n+=split($0,a,/CREATE TABLE/)-1}
(n>2) && p {print FILENAME; p=0}' */.sql
The difference between the two solutions are:
Solution 1 will not process the full file as it will create an early termination per file if the condition is met.
Solution 2 cannot do such an action, however we can reduce computational time by avoiding split if the condition is satisfied.
Try this Perl solution
perl -le ' BEGIN { for(glob("*.sql")) { $x=qx(cat $_); $r++ for($x=~m/CREATE TABLE/g); print $_ if $r > 2 ; $r=0 } } '

How to find the nth multiline block of text using sed

So I have a file which contains blocks that looks as follows:
menuentry ... {
....
....
}
....
menuentry ... {
....
....
}
I need to look at the contents of each menu entry in a bash script. I have very limited experience with sed but through a very exhaustive search I was able to construct the following:
cat $file | sed '/^menuentry.*{/!d;x;s/^/x/;/x\{1\}/!{x;d};x;:a;n;/^}/!ba;q'
and I can replace the \{1\} with whatever number I want to get the nth block. This works fine like that, but the problem is I need to iterate through an arbitrary number of times:
numEntries=$( egrep -c "^menuentry.*{" $file )
for i in $( seq 1 $numEntries); do
i=$( echo $i | tr -d '\n' ) # A google search indicated sed encounters problems
# when a substituted variable has a trailing return char
# Get the nth entry block of text with the sed statement from before,
# but replace with variable $i
entry=$( cat $file | sed '/^menuentry.*{/!d;x;s/^/x/;/x\{$i\}/!{x;d};x;:a;n;/^}/!ba;q')
# Do some stuff with $entry #
done
I've tried with every combination of quotes/double quotes and braces around the variable and around the sed statement and every which way I do it, I get some sort of error. As I said, I don't really know much about sed and that frankenstein of a statement is just what I managed to mish-mosh together from various google searches, so any help or explanation would be much appreciated!
TIA
sed is for simple substitutions on individual lines, that is all, and shell loops to manipulate text are immensely slow and difficult to write robustly. The folks who invented sed and shell also invented awk for tasks like this:
awk -v RS= 'NR==3' file
would print the 3rd blank-line-separated block of text as shown in your question. This would print every block containing the string "foobar":
awk -v RS= '/foobar/' file
and anything else you might want to do is equally trivial.
The above will work efficiently, robustly and portably using any awk in any shell on any UNIX box. For example:
$ cat file
menuentry first {
Now is the Winter
of our discontent
}
menuentry second {
Wee sleekit cowrin
timrous beastie,
oh whit a panic's in
thy breastie.
}
menuentry third {
Twas the best of times
Twas the worst of times
Make up your damn mind
}
.
$ awk -v RS= 'NR==3' file
menuentry third {
Twas the best of times
Twas the worst of times
Make up your damn mind
}
$ awk -v RS= 'NR==2' file
menuentry second {
Wee sleekit cowrin
timrous beastie,
oh whit a panic's in
thy breastie.
}
$ awk -v RS= '/beastie/' file
menuentry second {
Wee sleekit cowrin
timrous beastie,
oh whit a panic's in
they breastie.
}
If you find yourself trying to do anything other than s/old/new with sed and/or using sed commands other than s, g and p (with -n) then you are using constructs that became obsolete in the mid-1970s when awk was invented.
If the above doesn't work for you then edit your question to provide more truly representative sample input and expected output.
How about splitting this work up?
First search for lines, where a menu entry starts (for personal reasons, I here use grub, not grub2, adjust to your needs):
entrystarts=($(sed -n '/^menuentry.*/=' /boot/grub/grub.cfg))
Then, in a second step, choose a starting val from the array, for the n-th entry ${entrystarts[$n]} and proceed from there? Afaik, the entry end is easy to detect by a single closing curly brace.
for i in ${entrystarts[#]}
do
// your code here, proof of concept (note grub/grub2):
sed -n "$i,/}/p" /boot/grub/grub.cfg
done
I see the problem. You are trying to use $i inside single quotes. $i will be left as $i As I posted below correcting it with single quotes around $i so that bash will see it as 3 strings
'one string'"$twostrings"'three strings'
#!/bin/bash
#filename:grubtest.sh
numEntries=$( egrep -c "^menuentry.*{" "$1" )
i=0
while [ $i -lt $numEntries ]; do
i=$(($i+1))
# Get the nth entry block of text with the sed statement from before,
entry=$( cat "$1" | sed '/^menuentry.*{/!d;x;s/^/x/;/x\{'"$i"'\}/!{x;d};x;:a;n;/^}/!ba;q')
# Do some stuff with $entry #
echo $entry|cut -d\' -f2
done
this returned my grub items when ran like so
./grubtest.sh /etc/grub2.cfg

fast alternative to grep file multiple times?

I currently use long piped bash commands to extract data from text files like this, where $f is my file:
result=$(grep "entry t $t " $f | cut -d ' ' -f 5,19 | \
sort -nk2 | tail -n 1 | cut -d ' ' -f 1)
I use a script that might do hundreds of similar searches of $f ,sorting selected lines in various ways depending on what I'm pulling out. I like one-line bash strings with a bunch of pipes because its compact and easy, but it can take forever. Can anyone suggest a faster alternative? Maybe something that loads the whole file into memory first?
Thanks
You might get a boost with doing the whole pipe with gawk or another awk that has asorti by doing:
contents="$(cat "$f")"
result="$(awk -vpattern="entry t $t" '$0 ~ pattern {matches[$5]=$19} END {asorti(matches,inds); print inds[1]}' <<<"$contents")"
This will read "$f" into a variable then we'll use a single awk command (well, gawk anyway) to do all the rest of the work. Here's how that works:
-vpattern="entry t $t": defines an awk variable named pattern that contains the shell variable t
$0 ~ pattern matches the current line against the pattern, if it matches we'll do the part in the braces, otherwise we skip it
matches[$5]=$19 adds an entry to an array (and creates the array if needed) where the key is the 5th field and the value is the 19th
END do the following function after all the input has been processed
asorti(matches,inds) sort the entries of matches such that the inds is an array holding the order of the keys in matches to get the values in sorted order
print inds[1] prints the index in matches (i.e., a $5 from before) associated with the lowest 19th field
<<<"$contents" have awk work on the value in the shell variable contents as though it were a file it was reading
Then you can just update the pattern for each, not have to read the file from disk each time and not need so many extra processes for all the pipes.
You'll have to benchmark to see if it's really faster or not though, and if performance is important you really should think about moving to a "proper" language instead of shell scripting.
Since you haven't provided sample input/output this is just a guess and I only post it because there's other answers already posted that you should not do, so - this may be what you want instead of that one line:
result=$(awk -v t="$t" '
BEGIN { regexp = "entry t " t " " }
$0 ~ regexp {
if ( ($6 > maxKey) || (maxKey == "") ) {
maxKey = $6
maxVal = $5
}
}
END { print maxVal }
' "$f")
I suspect your real performance issue, however, isn't that script but that you are running it and maybe others inside a loop that you haven't shown us. If so, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice and post a better example so we can help you.

Count number of grep occurrences and store it a variable

I want to do something like this - grep for a string in a particular file, store it in a variable and be able to print just the number of occurrences.
#!/bin/bash
count=$(grep *something* *somefile*| wc -l)
echo $count
This always gives a 0 value, when I know it should be more.
This is what I intend to do, but its taking like forever to finish the script.
if egrep -iq "Android 6.0.1" $filename; then
count=$(egrep -ic "Android 6.0.1" $filename)
echo 'Operating System Version leaked number of times: '$count
I have 7 other such if statements and I am running this for around 20 files.
Any more efficient way to make it faster?
grep has its own counting flag
-c, --count
Suppress normal output; instead print a count of matching lines for
each input file. With the -v, --invert-match option (see below), count
non-matching lines. (-c is specified by POSIX .)
count=$( grep -c 'match' file)
Note that the match part is quoted as well so if you use special characters they are not interpreted by the shell.
Also as stated in the excerpt from that man page multiple matches on a single line will be counted as a single match as it only counts matching lines:
$ echo "hello hello hello hello
hello
> bye" | grep -c "hello"
2
A much more efficient approach would be to run Awk once.
awk -v patterns="foo,bar,baz" 'BEGIN { n=split(patterns, pats, ",") }
{ for (i=1; i<=n; ++i) if ($0 ~ pats[i]) ++hits[i] }
END { for (i=1; i<=n; ++i) printf("%8d%s\n", hits[i], pats[i]) }' list of files
For bonus points, format the output in machine-readable format (depending on where it ends up, JSON might be a good choice); and/or add the human-readable explanation for the significance of each hit to the END block.
If that's not what you want, running grep -Eic and ditching any zero value would already improve your run time over grepping the file twice for each match in the worst case. (The pessimal situation would be when the last line and no other line matches your pattern.)

getting the last opened file

input file:
wtf.txt|/Users/jaro/documents/inc/face/|
lol.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/twitter/|
lol.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/face/|
omg.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/linked/|
wtf.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/twitter/|
wtf.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/face/|
omg.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/linked/|
omg.txt|/Users/jaro/documents/inc/linked/|
input file is the list of opened files (opening file means 1 line of file) i want to get the last opened file in
e.g. : get last opened file in dir /Users/jaro/documents/inc/face/
output:
wtf.txt
This fetches the last line in the file whose second field is the desired folder name, and prints the first field.
awk -F '\|' '$2 == "/Users/jaro/documents/inc/face/" { f=$1 }
END { print f }' file
To test whether the most recent file is also an existing file, I would use the shell to reverse the order with tac and perform the logic; skip the files in the wrong path, and the ones which don't exist, then print the first success and quit.
tac file |
while IFS='|' read -r basename path _; do
case $path in "/Users/jaro/documents/inc/face") ;; *) continue;; esac
test -e "$path/$basename" || continue
echo "$basename"
break
done |
grep .
The final grep . is to produce an exit code which reflects whether or not the command was successful -- if it printed a file, it's okay; if none of the extracted files existed, return error.
Below is my original answer, based on a plausible but apparently incorrect interpretation of your question.
Here is a quick attempt at finding the file with the newest modification time from the list. I avoid parsing ls, prefering instead to use properly machine-parseable output from stat. Since your input file is line-oriented, I assume no file names contain newlines, which simplifies things quite a bit.
awk -F '\|' '$2 == "/Users/jaro/documents/inc/face/" { print $2 $1 }' file |
sort -u |
xargs stat -f '%m %N' |
sort -rn |
awk -F '/' '{ print $NF; exit(0) }'
The first sort is to remove any duplicates, to avoid running stat more times than necessary (premature optimization, perhaps), the stat prefixes each line with the file's modification time expressed as seconds since the epoch, which facilitates easy numerical sorting by age, and the final Awk script neatly combines head -n 1 | rev | cut -d / -f1 | rev i.e. extract just the basename from the first line of output, then quit.
If there is any way to use a less wacky input format, that would be an improvement (probably of your life in general as well).
The output format from stat is not properly standardized, but your question is tagged linuxosx so I assume GNU coreutils BSD stat. If portability is desired, maybe look at find (which however may be overkill and/or not much better standardized across diverse platforms) or write a small Perl or Python script instead. (Well, Ruby too, I suppose, but personally, I'd go with Perl.)
perl -F'\|' -lane '{ $t{$F[0]} = (stat($F[1].$F[0]))[10]
if !defined $t{$F[0]} and $F[1] == "/Users/jaro/documents/inc/face/" }
END { print ((sort { $t{$a} <=> $t{$b} } keys %t)[-1]) }' file
atime – The atime (access time) is the time when the data of a file was last accessed. Displaying the contents of a file or executing a shell script will update a file’s atime, for example. You can view the atime with the ls -lu command
http://www.techtrunch.com/linux/ctime-mtime-atime-linux-timestamps
So in your case, will do the trick.
ls -lu /Users/jaro/documents/inc/face/

Resources