Parsing java logs for multiline entries using bash - bash

I have loads of java logs on a Linux machine and I'm trying to find a grep expression or something else (perl, awk) that gives me the entire log entry on a match somewhere in its body. Logstash looks like it could do the job, but something with onboard tools would be way better.
An example should help best. Here is an exemplary log with 5 different entries:
25 Aug 2016 14:00:46,435 DEBUG [User][IP][rsc] An error occurred
java.Exception: Foo1
at xyz
25 Aug 2016 14:00:46,436 Foo2 [User][IP][rsc] Some error occured
25 Aug 2016 14:00:46,436 DEBUG [User][IP][rsc] Somethin occured Foo3
25 Aug 2016 14:18:18,224 XYZ [User][IP][rsc] Some problems
More: bla1
More: bla2
USER.bla.bla: Blala::123 - 456
More: Could not open something
at 567
at 890
Caused by: Foo4: Could not open another thing
at 123
at 456
... 127 more
Caused by: gaga
at a1a2a3
at b3b3b3
... 146 more
25 Aug 2016 14:18:20,118 SSO [User][IP][rsc] Process: error -
Could not Foo5
<here is a blank line>
When I search for "Foo1", I need:
25 Aug 2016 14:00:46,435 DEBUG [User][IP][rsc] An error occurred
java.Exception: Foo1
at xyz
When I search for "Foo2":
25 Aug 2016 14:00:46,436 Foo2 [User][IP][rsc] Some error occured
For "Foo3":
25 Aug 2016 14:00:46,436 DEBUG [User][IP][rsc] Somethin occured Foo3
For "Foo4":
25 Aug 2016 01:18:18,224 XYZ [User][IP][rsc] Some problems
More: bla1
More: bla2
USER.bla.bla: Blala::123 - 456
More: Could not open connection
at 567
at 890
Caused by: Foo4: Could not open connection
at 123
at 456
... 127 more
Caused by: gaga
at a1a2a3
at b3b3b3
... 146 more
And finally for "Foo5":
25 Aug 2016 01:18:20,118 SSO [User][IP][rsc] Process: error -
Could not Foo5
When I search for "Foo", everything should be returned.
Is something like this possible? Maybe even as a one liner?
I would like to use it in a Webmin Custom Commands module where I supply the expression via variable.
The only basic idea I have at the moment is search for the expression and use the "[" as pattern to identify where a new entry begins.
Thanks in advance for anybody who has an idea!

A sed solution - good for environments where awk is not allowed - same sed command is shown in oneliner and multiline forms
pat=$1
# oneliner form
#sed -nr '/^[0-9]{2} [a-zA-Z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} /!{H; $!b}; x; /'"$pat"'/p; ${g; /^[0-9]{2} [a-zA-Z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} /!q; /'"$pat"'/p }'
# multiline form
sed -nr '
/^[0-9]{2} [a-zA-Z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} /!{H; $!b}
x
/'"$pat"'/p
${
g
/^[0-9]{2} [a-zA-Z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} /!q
/'"$pat"'/p
}'
uses timestamp at beginning of line as record start - accumulates non-timestamp lines i.e. record body in holdspace - swaps holdspace and patternspace on record start - prints record if pattern is matched
special case for record start on last line - it has to be re-gotten from holdspace and separately tested for pattern match
shell quoting needed to construct sed command with pat bash variable

I set awk RS to the timestamp pattern for multiline records:
pat=$1
awk -vpat="$pat" '
BEGIN{
RS="[0-9]{2} [a-zA-Z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} "
}
$0 ~ pat {printf("%s%s", prt, $0)}
{prt=RT}
'

Related

Cut Mod_Security ID with sed/awk

I would like to cut the numbers between the quotas of a Mod_sec ID: [id "31231"]. Generally it is not difficult at all but when I am trying to extract all IDs from multiple reports such as:
[Wed Oct 19 15:31:33.460342 2016] [:error] [pid 16526] [client 67.22.202.121] ModSecurity: Access denied with code 400 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/usr/local/apache/conf/includes/mod_security2.conf"] [line "4968"] [id "000784"] [hostname "example.org"] [uri "/"] [unique_id "WAfYJU1ol#MAAECO#HQAAAAI"]
[Wed Mar 19 15:31:33.460342 2016] [:error] [pid 16526] [client 67.22.202.121] ModSecurity: Access denied with code 400 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/usr/local/apache/conf/includes/mod_security2.conf"] [line "4968"] [id "9"] [hostname "example.org"] [uri "/"] [unique_id "WAfYJU1ol#MAAECO#HQAAAAI"]
[Wed Mar 19 15:31:33.460342 2016] [:error] [pid 16526] [client 67.22.202.121] ModSecurity: Access denied with code 400 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/usr/local/apache/conf/includes/mod_security2.conf"] [line "4968"] [id "00263"] [hostname "example.org"] [uri "/"] [unique_id "WAfYJU1ol#MAAECO#HQAAAAI"]
I have attempted several commands such as:
cat asd | awk '/\[id\ "/,/"]/{print}'
cat asd | sed -n '/[id "/,/"]/p'
and many others but they do not print the required IDs but rather include additional output since the pattern is being matched several times. Generally I am able to do something like:
cat asd | egrep -o "\"[0-9][0-9][0-9][0-9][0-9][0-9]\"" and then cut the output again but this does not work in cases where the ID does not contain 6 numbers.
I am not familiar with all options of awk,sed and egrep and do not seem to find a solution.
What I would like to be printed from above history is:
000784
9
00263
Could someone please help. Thank you in advance.
With sed:
sed -n 's/.*\[id "\([^"]*\)"].*/\1/p'
you need to consume all items before [id and after your token
you need to escape the square bracket
With grep if pcre option is available:
$ grep -oP 'id "\K\d+' asd
000784
9
00263
id "\K positive lookbehind to match id ", not part of output
\d+ the digits following id "
With sed
$ sed -nE 's/.*id "([0-9]+).*/\1/p' asd
000784
9
00263
.*id " match up to id "
([0-9]+) capture group to save digits needed
.* rest of line
\1 entire line replaced only with required string
The ids are accessible in the 6th awk field when double quotes are used as custom separators:
$ awk -F '"|"' '{print $6}' file
000784
9
00263

Collect info from multiple lines

I need to extract certain info from multiple lines (5 lines every transaction) and make the output as csv file. These lines are coming from a maillog wherein every transaction has its own transaction id. Here's one sample transaction:
Nov 17 00:15:19 server01 sm-mta[14107]: tAGGFJla014107: from=<sender#domain>, size=2447, class=0, nrcpts=1, msgid=<201511161615.tAGGFJla014107#server01>, proto=ESMTP, daemon=MTA, tls_verify=NONE, auth=NONE, relay=[100.24.134.19]
Nov 17 00:15:19 server01 flow-control[6033]: tAGGFJla014107 accepted
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - virus.McAfee: CLEAN - Declaration for Shared Parental Leave Allocation System
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - mtaqid=tAGGFJla014107, msgid=<201511161615.tAGGFJla014107#server01>, from=<sender#domain>, size=2488, to=<recipient#domain>, relay=[100.24.134.19], disposition=Deliver
Nov 17 00:15:20 server01 sm-mta[14240]: tAGGFJla014107: to=<recipient#domain>, delay=00:00:01, xdelay=00:00:01, mailer=smtp, pri=122447, relay=relayserver.domain. [100.91.20.1], dsn=2.0.0, stat=Sent (tAGGFJlR021747 Message accepted for delivery)
What I tried is, I made these 5 lines into 1 line and used awk to parse each column - unfortunately, the column count is not uniform.
I'm looking into getting the date/time (line 1, columns 1-3), sender, recipient, and subject (line 3, words after "CLEAN -" to the end of line)
Preferably sed or awk in bash.
Thanks!
Explanation: fileis your file.
The script initializes id and block to empty strings. At first run id takes the value of field nr. 7. After that all lines are added to block until a line doesn't match id. At that point block and id are reinitialized.
awk 'BEGIN{id="";block=""} {if (id=="") id=$6; else {if ($0~id) block= block $0; else {print block;block=$0;id=$6}}}' file
Then you're going to have to process each line of the output.
There are many ways to approach this. Here is one example calling a simple script and passing the log filename as the first argument. It will parse the requested data and save the data separated into individual variables. It simply prints the results at the end.
#!/bin/bash
[ -r "$1" ] || { ## validate input file readable
printf "error: invalid argument, file not readable '%s'\n" "$1"
exit 1
}
while read -r line; do
## set date from line containing from/sender
if grep -q -o 'from=<' <<<"$line" &>/dev/null; then
dt=$(cut -c -15 <<<"$line")
from=$(grep -o 'from=<[a-zA-Z0-9]*#[a-zA-Z0-9]*>' <<<"$line")
sender=${from##*<}
sender=${sender%>*}
fi
## search each line for CLEAN
if grep -q -o 'CLEAN.*$' <<<"$line" &>/dev/null; then
subject=$(grep -o 'CLEAN.*$' <<<"$line")
subject="${subject#*CLEAN - }"
fi
## search line for to
if grep -q -o 'to=<' <<<"$line" &>/dev/null; then
to=$(grep -o 'to=<[a-zA-Z0-9]*#[a-zA-Z0-9]*>' <<<"$line")
to=${to##*<}
to=${to%>*}
fi
done < "$1"
printf " date : %s\n from : %s\n to : %s\n subject: \"%s\"\n" \
"$dt" "$sender" "$to" "$subject"
Input
$ cat dat/mail.log
Nov 17 00:15:19 server01 sm-mta[14107]: tAGGFJla014107: from=<sender#domain>, size=2447, class=0, nrcpts=1, msgid=<201511161615.tAGGFJla014107#server01>, proto=ESMTP, daemon=MTA, tls_verify=NONE, auth=NONE, relay=[100.24.134.19]
Nov 17 00:15:19 server01 flow-control[6033]: tAGGFJla014107 accepted
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - virus.McAfee: CLEAN - Declaration for Shared Parental Leave Allocation System
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - mtaqid=tAGGFJla014107, msgid=<201511161615.tAGGFJla014107#server01>, from=<sender#domain>, size=2488, to=<recipient#domain>, relay=[100.24.134.19], disposition=Deliver
Nov 17 00:15:20 server01 sm-mta[14240]: tAGGFJla014107: to=<recipient#domain>, delay=00:00:01, xdelay=00:00:01, mailer=smtp, pri=122447, relay=relayserver.domain. [100.91.20.1], dsn=2.0.0, stat=Sent (tAGGFJlR021747 Message accepted for delivery)
Output
$ bash parsemail.sh dat/mail.log
date : Nov 17 00:15:19
from : sender#domain
to : recipient#domain
subject: "Declaration for Shared Parental Leave Allocation System"
Note: if your from/sender is not always going to be in the first line, you can simply move those lines out from under the test clause. Let me know if you have any questions.

Using sed to extract a substring in curly brackets

I've currently got a string as below:
integration#{Wed Nov 19 14:17:32 2014} branch: thebranch
This is contained in a file, and I parse the string. However I want the value between the brackets {Wed Nov 19 14:17:32 2014}
I have zero experience with Sed, and to be honest I find it a little cryptic.
So far I've managed to use the following command, however the output is still the entire string.
What am I doing wrong?
sed -e 's/[^/{]*"\([^/}]*\).*/\1/'
To get the values which was between {, }
$ sed 's/^[^{]*{\([^{}]*\)}.*/\1/' file
Wed Nov 19 14:17:32 2014
This is very simple to do with awk, not complicate regex.
awk -F"{|}" '{print $2}' file
Wed Nov 19 14:17:32 2014
It sets the field separator to { or }, then your data will be in the second field.
FS could be set like this to:
awk -F"[{}]" '{print $2}' file
To see all field:
awk -F"{|}" '{print "field#1="$1"\nfield#2="$2"\nfield#3="$3}' file
field#1=integration#
field#2=Wed Nov 19 14:17:32 2014
field#3= branch: thebranch
This might work
sed -e 's/[^{]*\({[^}]*}\).*/\1/g'
Test
$ echo "integration#{Wed Nov 19 14:17:32 2014} branch: thebranch" | sed -e 's/[^{]*{\([^}]*\)}.*/\1/g'
Wed Nov 19 14:17:32 2014
Regex
[^{]* Matches anything other than the {, That is integration#
([^}]*) Capture group 1
\{ Matches {
[^}]* matches anything other than }, That is Wed Nov 19 14:17:32 2014
\} matches a }
.* matches the rest
Simply, below command also get the data...
echo "integration#{Wed Nov 19 14:17:32 2014} branch: thebranch" | sed 's/.*{\(.*\)}.*/\1/g'

BASH grep with multiple parameters + n lines after one of the matches

I have a bunch of text as a output from command, I need to display only specific matching lines plus some additional lines after match "message" (message text is obviously longer than 1 line)
what I tried was:
grep -e 'Subject:' -e 'Date:' -A50 -e 'Message:'
but it included 50 lines after EACH match, and I need to pass that only to single parameter. How would I do that?
code with output command:
(<...> | telnet <mailserver> 110 | grep -e 'Subject:' -e 'Date:' -A50 -e 'Message:'
Part of the telnet output:
Date: Tue, 10 Sep 2013 16
Message-ID: <00fb01ceae25$
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_00FC_01CEAE3E.DE32CE40"
X-Mailer: Microsoft Office Outlook 12.0
Thread-Index: Ac6uJWYdA3lUzs1cT8....
Content-Language: lt
X-Mailman-Approved-At: Tue, 10 Sep 2013 16:0 ....
Subject: ...
X-BeenThere: ...
Precedence: list
Try following:
... | telnet ... > <file>
grep -e 'Subject:' -e 'Date:' <file> && grep -A50 -e 'Message:' <file>
Will need to dump the output to a file first.
This can be done with awk as well, without the need for dumping output to a file.
... | telnet ... | awk '/Date:/ {print}; /Subject:/ {print}; /Message:/ {c=50} c && c--'
With grep it would be hard to do. Better use awk for this
awk '/Subject:|Date:/;/Message:/ {while(l<=50){print $0;l++;getline}}'
Here the awk prints 50 lines below the Message: pattern and only one line is printed for all other patterns.

Bash, retrieving two sets of particular strings from across a text file

consider the example:
Feb 14 26:00:01 randomtext here mail from user10#mailbox.com more random text
Feb 15 25:08:82 randomtext random text mail from user8#mailbox.com more random text
Jan 20 26:23:89 randomtext iortest test test mail from user6#mailbox.com more random
Mar 15 18:23:01 randomtext here mail from user4#mailbox.com more random text
Jun 15 20:04:01 randomtext here mail from user10#mailbox.com more random text
Using BASH I am trying to retrieve the first part of the time stamp for example '26' '25' and the email of the user for example 'user10#mailbox.com'
output would then roughly look like:
26 user10#mailbox.com
25 user8#mailbox.com
26 user6#mailbox.com
18 user4#mailbox.com
20 user10#mailbox.com
I have tried using:
cat myfile | grep -o '[0-9][0-9].*.com'
but it gives me excess text in the middle.
How would i go about retrieving just the two strings i need?
Use sed with capture groups to select the parts you want.
sed 's/^.* \([0-9][0-9]\):.* mail from \(.*#.*\.com\).*/\1 \2/' myfile
^ = beginning of line
.* = any sequence of characters followed by space
\([0-9[0-9]\): = 2 digits followed by a colon. The digits will be saved in capture group #1
.* mail from = any sequence up to a space followed by mail from and another space
\(.*#.*\.com\) = any sequence followed by # followed by any sequence up to .com. This will be saved in capture group #2
.* = any sequence; this will match the rest of the line
Everything this matches (the whole line) will be replaced by capture group #1, a space, and capture group #2.
Try
cat myfile | awk '{print $3, $8}' | sed 's/:[0-9][0-9]//g'
Disclaimer: my awk skills are rusty - there should be a way to do this solely in awk without resorting to sed.
If all your email-addresses will have only domain .com - the previous answer using sed is perfect.
But if you can have different domain, it's better to improve this sed:
sed 's/^.* \([0-9][0-9]\):.* mail from \(.*#.*\..*\)\ more.*/\1 \2/' file
With perl :
$ perl -lne '
print "$1 $2" if /^\w+\s+\d+\s+(\d+):\d+:\d+\s+.*?([-\w\.]+#\S+)/
' file.txt
Output :
26 0#mailbox.com
25 8#mailbox.com
26 6#mailbox.com
18 4#mailbox.com
20 0#mailbox.com

Resources