How to concatenate lines from the same paragraph in bash? - bash

I have a log file with the following lines:
vi test.log
23 Jan 01:29:33.498/GLOBAL/ser: RECEIVED message from 91.x.x.x:33583:
INVITE sip:39329172xxxx#sip.x SIP/2.0^M
Supported: ^M
Allow: INVITE, ACK, OPTIONS, CANCEL, BYE^M
Contact: sip:131400xxxx#91.x.x.x:33583^M
Via: SIP/2.0/UDP 91.x.x.x:33583;branch=z9hG4bKe65d47e555749b753faaf095c3256ec569bde77d37de66f62ff18bc40d492496^M
Call-id: ac755ea7e10821aa8174b2e5cd51d9e6^M
Cseq: 1 INVITE^M
From: sip:131400xxxx#sip.x;tag=5a541f1b2fd279cd0b8af3be3f67c7cf^M
ax-forwards: 70^M
To: sip:39329172xxxx#sip.x^M
Content-type: application/sdp^M
Content-length: 127^M
^M
v=0^M
o=anonymous 1327282173 1327282173 IN IP4 91.x.x.x^M
s=session^M
c=IN IP4 91.x.x.x^M
t=0 0^M
m=audio 5856 RTP/AVP 0^M
23 Jan 01:29:33.499/GLOBAL/ser: SENDING message to 91.x.x.x:33583:
SIP/2.0 100 trying -- your call is important to us^M
Via: SIP/2.0/UDP 91.x.x.x:33583;branch=z9hG4bKe65d47e555749b753faaf095c3256ec569bde77d37de66f62ff18bc40d492496^M
Call-id: ac755ea7e10821aa8174b2e5cd51d9e6^M
Cseq: 1 INVITE^M
From: sip:131400xxxx#sip.x;tag=5a541f1b2fd279cd0b8af3be3f67c7cf^M
To: sip:39329172xxxx#sip.x^M
Server: SSP v2.0.84^M
Content-Length: 0^M
^M
What I'd like to achieve is :
23 Jan 01:29:33.498/GLOBAL/ser: RECEIVED message from 91.x.x.x:33583:|INVITE sip:39329172xxxx#sip.x SIP/2.0|Supported:|Allow: INVITE, ACK, OPTIONS, CANCEL, BYE|Contact: sip:1314007008#91.x.x.x:33583|Via: SIP/2.0/UDP 91.x.x.x:33583;branch=z9hG4bKe65d47e555749b753faaf095c3256ec569bde77d37de66f62ff18bc40d492496|Call-id: ac755ea7e10821aa8174b2e5cd51d9e6|Cseq: 1 INVITE|From: sip:131400xxxx#sip.x;tag=5a541f1b2fd279cd0b8af3be3f67c7cf|Max-forwards: 70|To: sip:39329172xxxx#sip.x|Content-type: application/sdp|Content-length: 127|v=0|o=anonymous 1327282173 1327282173 IN IP4 91.x.x.x|s=session|c=IN IP4 91.x.x.x|t=0 0|m=audio 5856 RTP/AVP 0
23 Jan 01:29:33.499/GLOBAL/ser: SENDING message to 91.x.x.x:33583:|SIP/2.0 100 trying -- your call is important to us|Via: SIP/2.0/UDP 91.x.x.x:33583;branch=z9hG4bKe65d47e555749b753faaf095c3256ec569bde77d37de66f62ff18bc40d492496|Call-id: ac755ea7e10821aa8174b2e5cd51d9e6|Cseq: 1 INVITE|From: sip:131400xxxx#sip.x;tag=5a541f1b2fd279cd0b8af3be3f67c7cf|To: sip:39329172xxxx#sip.x|Server: SSP v2.0.84|Content-Length: 0
Basically all lines from the same paragraph (session) should be concatenated with "|" . A carriage return should then be added and next paragraph concatenated and so on. note that every new lines start with date & time.
So far I was only able to concatenate all the lines but unable to add the carriage return.. Any help would be much appreciated. Thank you.

You can use following awk script to do that:
awk '{if ($0 ~ /^\s*$/) {print line; line="";} else line=line $0 "|"}' file.txt
This is assuming that after end of para always a blank line appears same as your example.
Explanation:
$0 ~ /^\s*$/ - to check if line is completely blank or only has white spaces
if block executes when blank line appears. It prints line var and resets line to ""
else block is concatenating line variable with the current line of file and a pipe

This might work for you (although you data is not clear):
sed '1{h;d};/^[123]\?[0-9] [JFMASOND].. ..:..:..\..../{:a;x;s/\s*\n\+/|/g;s/.$//p;d};H;$ba;d' file
23 Jan 01:29:33.498/GLOBAL/ser: RECEIVED message from 91.x.x.x:33583:|INVITE sip:39329172xxxx#sip.x SIP/2.0|Supported:|Allow: INVITE, ACK, OPTIONS, CANCEL, BYE|Contact: sip:131400xxxx#91.x.x.x:33583|Via: SIP/2.0/UDP 91.x.x.x:33583;branch=z9hG4bKe65d47e555749b753faaf095c3256ec569bde77d37de66f62ff18bc40d492496|Call-id: ac755ea7e10821aa8174b2e5cd51d9e6|Cseq: 1 INVITE|From: sip:131400xxxx#sip.x;tag=5a541f1b2fd279cd0b8af3be3f67c7cf|ax-forwards: 70|To: sip:39329172xxxx#sip.x|Content-type: application/sdp|Content-length: 127|v=0|o=anonymous 1327282173 1327282173 IN IP4 91.x.x.x|s=session|c=IN IP4 91.x.x.x|t=0 0|m=audio 5856 RTP/AVP 0
23 Jan 01:29:33.499/GLOBAL/ser: SENDING message to 91.x.x.x:33583:|SIP/2.0 100 trying -- your call is important to us|Via: SIP/2.0/UDP 91.x.x.x:33583;branch=z9hG4bKe65d47e555749b753faaf095c3256ec569bde77d37de66f62ff18bc40d492496|Call-id: ac755ea7e10821aa8174b2e5cd51d9e6|Cseq: 1 INVITE|From: sip:131400xxxx#sip.x;tag=5a541f1b2fd279cd0b8af3be3f67c7cf|To: sip:39329172xxxx#sip.x|Server: SSP v2.0.84|Content-Length: 0

Related

Is there any reason why my AWK functions work only on a shortened version of my file

I have a simple AWK function:
awk '
BEGIN { FS=" "; RS="\n\n" ; OFS="\n"; ORS="\n" }
/ms Response/ { print $0 }
' $FILE
The FILE is a large log that holds sections like this:
2021-10-13 12:15:12 CDT 526ms Request
POST / HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: xxxxxxxxxxxxxxxxxxx
Content-Length: 279
<query xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><product><name>drill</name><price>99</price><stock>5</stock></product>/query>
2021-10-13 12:15:12 CDT 880ms Received
2021-10-13 12:15:12 CDT 896ms Response
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 472
<?xml version="1.0"?>
<query type="c" xmlns="xxxxxxxxxxxxxx">
<product>
<name>screwdriver</name>
<price>5</price>
<stock>51</stock>
</product>
</query>
2021-10-13 12:15:12 CDT 947ms Request
POST / HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: xxxxxxxxxxxxxxx
Content-Length: 515
Expect: 100-continue
The above is just a snippet, the file continues for over 14000 lines, repeating the same pattern.
Now when I run my AWK function on the whole file, it just returns the whole file back. But when I run it on a file that was created with (cat $FILE | head -200), It works as expected by returning:
2021-10-13 12:15:12 CDT 896ms Response
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 472
2021-10-13 12:15:13 CDT 075ms Response
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 3207
2021-10-13 12:15:13 CDT 208ms Response
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 4220
Why can I run this on a shortened file but when I run it on a longer version, it does not work? Even though its the same data in the file?
I am working on Ubuntu 18.04 LTS in Bash.
Thank you!
#markp-fuso's comment helped me. My input file had Windows line endings and I just needed to run the below command prior to executing the AWK:
tr -d '\15\32' < OGfile.txt > unixFile.txt
Then it ran as expected.
I received additional syntax help from the following question: Convert line endings
You can use this:
awk -v RS= -v ORS='\n\n' '/ms Response/'
Or this, to avoid a trailing blank line:
awk -v RS= '/ms Response/ && c++ {printf "\n"} /ms Response/'
If RS is an empty string, the record separator is becomes two or more contiguous new lines.

Cut Mod_Security ID with sed/awk

I would like to cut the numbers between the quotas of a Mod_sec ID: [id "31231"]. Generally it is not difficult at all but when I am trying to extract all IDs from multiple reports such as:
[Wed Oct 19 15:31:33.460342 2016] [:error] [pid 16526] [client 67.22.202.121] ModSecurity: Access denied with code 400 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/usr/local/apache/conf/includes/mod_security2.conf"] [line "4968"] [id "000784"] [hostname "example.org"] [uri "/"] [unique_id "WAfYJU1ol#MAAECO#HQAAAAI"]
[Wed Mar 19 15:31:33.460342 2016] [:error] [pid 16526] [client 67.22.202.121] ModSecurity: Access denied with code 400 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/usr/local/apache/conf/includes/mod_security2.conf"] [line "4968"] [id "9"] [hostname "example.org"] [uri "/"] [unique_id "WAfYJU1ol#MAAECO#HQAAAAI"]
[Wed Mar 19 15:31:33.460342 2016] [:error] [pid 16526] [client 67.22.202.121] ModSecurity: Access denied with code 400 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/usr/local/apache/conf/includes/mod_security2.conf"] [line "4968"] [id "00263"] [hostname "example.org"] [uri "/"] [unique_id "WAfYJU1ol#MAAECO#HQAAAAI"]
I have attempted several commands such as:
cat asd | awk '/\[id\ "/,/"]/{print}'
cat asd | sed -n '/[id "/,/"]/p'
and many others but they do not print the required IDs but rather include additional output since the pattern is being matched several times. Generally I am able to do something like:
cat asd | egrep -o "\"[0-9][0-9][0-9][0-9][0-9][0-9]\"" and then cut the output again but this does not work in cases where the ID does not contain 6 numbers.
I am not familiar with all options of awk,sed and egrep and do not seem to find a solution.
What I would like to be printed from above history is:
000784
9
00263
Could someone please help. Thank you in advance.
With sed:
sed -n 's/.*\[id "\([^"]*\)"].*/\1/p'
you need to consume all items before [id and after your token
you need to escape the square bracket
With grep if pcre option is available:
$ grep -oP 'id "\K\d+' asd
000784
9
00263
id "\K positive lookbehind to match id ", not part of output
\d+ the digits following id "
With sed
$ sed -nE 's/.*id "([0-9]+).*/\1/p' asd
000784
9
00263
.*id " match up to id "
([0-9]+) capture group to save digits needed
.* rest of line
\1 entire line replaced only with required string
The ids are accessible in the 6th awk field when double quotes are used as custom separators:
$ awk -F '"|"' '{print $6}' file
000784
9
00263

Parsing java logs for multiline entries using bash

I have loads of java logs on a Linux machine and I'm trying to find a grep expression or something else (perl, awk) that gives me the entire log entry on a match somewhere in its body. Logstash looks like it could do the job, but something with onboard tools would be way better.
An example should help best. Here is an exemplary log with 5 different entries:
25 Aug 2016 14:00:46,435 DEBUG [User][IP][rsc] An error occurred
java.Exception: Foo1
at xyz
25 Aug 2016 14:00:46,436 Foo2 [User][IP][rsc] Some error occured
25 Aug 2016 14:00:46,436 DEBUG [User][IP][rsc] Somethin occured Foo3
25 Aug 2016 14:18:18,224 XYZ [User][IP][rsc] Some problems
More: bla1
More: bla2
USER.bla.bla: Blala::123 - 456
More: Could not open something
at 567
at 890
Caused by: Foo4: Could not open another thing
at 123
at 456
... 127 more
Caused by: gaga
at a1a2a3
at b3b3b3
... 146 more
25 Aug 2016 14:18:20,118 SSO [User][IP][rsc] Process: error -
Could not Foo5
<here is a blank line>
When I search for "Foo1", I need:
25 Aug 2016 14:00:46,435 DEBUG [User][IP][rsc] An error occurred
java.Exception: Foo1
at xyz
When I search for "Foo2":
25 Aug 2016 14:00:46,436 Foo2 [User][IP][rsc] Some error occured
For "Foo3":
25 Aug 2016 14:00:46,436 DEBUG [User][IP][rsc] Somethin occured Foo3
For "Foo4":
25 Aug 2016 01:18:18,224 XYZ [User][IP][rsc] Some problems
More: bla1
More: bla2
USER.bla.bla: Blala::123 - 456
More: Could not open connection
at 567
at 890
Caused by: Foo4: Could not open connection
at 123
at 456
... 127 more
Caused by: gaga
at a1a2a3
at b3b3b3
... 146 more
And finally for "Foo5":
25 Aug 2016 01:18:20,118 SSO [User][IP][rsc] Process: error -
Could not Foo5
When I search for "Foo", everything should be returned.
Is something like this possible? Maybe even as a one liner?
I would like to use it in a Webmin Custom Commands module where I supply the expression via variable.
The only basic idea I have at the moment is search for the expression and use the "[" as pattern to identify where a new entry begins.
Thanks in advance for anybody who has an idea!
A sed solution - good for environments where awk is not allowed - same sed command is shown in oneliner and multiline forms
pat=$1
# oneliner form
#sed -nr '/^[0-9]{2} [a-zA-Z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} /!{H; $!b}; x; /'"$pat"'/p; ${g; /^[0-9]{2} [a-zA-Z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} /!q; /'"$pat"'/p }'
# multiline form
sed -nr '
/^[0-9]{2} [a-zA-Z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} /!{H; $!b}
x
/'"$pat"'/p
${
g
/^[0-9]{2} [a-zA-Z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} /!q
/'"$pat"'/p
}'
uses timestamp at beginning of line as record start - accumulates non-timestamp lines i.e. record body in holdspace - swaps holdspace and patternspace on record start - prints record if pattern is matched
special case for record start on last line - it has to be re-gotten from holdspace and separately tested for pattern match
shell quoting needed to construct sed command with pat bash variable
I set awk RS to the timestamp pattern for multiline records:
pat=$1
awk -vpat="$pat" '
BEGIN{
RS="[0-9]{2} [a-zA-Z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} "
}
$0 ~ pat {printf("%s%s", prt, $0)}
{prt=RT}
'

print last occurrence of each unique line by IP in file

I need to parse a log file so that the following entries like this:
Jul 23 17:38:06 192.168.1.100 638 "this message will always be the same"
Jul 23 17:56:11 192.168.1.100 648 "this message will always be the same."
Jul 23 18:14:17 192.168.1.101 "this message will always be the same."
Jul 23 18:58:17 192.168.1.101 "this message will always be the same."
Look like this:
Jul 23 17:56:11 192.168.1.100 648 "this message will always be the same."
Jul 23 18:58:17 192.168.1.101 "this message will always be the same."
Basically what I am doing is taking a file that has duplicate IP addresses but with different timestamps, and finding the last occurrence (or most recent by time) of each IP address, and printing that to the screen or directing it into another file.
What I have tried:
I have written a bash script that I thought would allow me to do this but it is not working.
#!/bin/bash
/bin/grep 'common pattern to all lines' /var/log/file | awk '{print $4}' | sort - u > /home/user/iplist
while IFS='' read -r line || [[ -n "$line" ]]; do
echo "$line"
done < "/home/user/iplist"
awk '/'$line'/ {a=$0}END{print a} ' /var/log/logfile
The script runs and outputs each IP address, but it does not print the whole line except for the last one.
ex..
192.168.100.101
192.168.100.102
192.168.100.103
Jul 23 20:20:55 192.168.100.104 "this message will always be the same."
The first command in the script takes all unique occurrences of an IP and sends that to a file. The while loop assigns a "$line" variable to each line which is then passed to awk which I thought would take each IP then search the actual file and print out the last occurrance of each one. How can I get this to work, either with a script or perhaps an awk one liner?
$ tac file | awk '!seen[$4]++' | tac
Jul 23 17:56:11 192.168.1.100 648 "this message will always be the same."
Jul 23 18:58:17 192.168.1.101 "this message will always be the same."
You can use this awk command:
awk 'NF{a[$4]=$0} NF && !seen[$4]++{ips[++numIps]=$4} END {
for (i=1;i<=numIps;i++) print a[ips[i]] }' file
Jul 23 17:56:11 192.168.1.100 648 "this message will always be the same."
Jul 23 18:58:17 192.168.1.101 "this message will always be the same."

Collect info from multiple lines

I need to extract certain info from multiple lines (5 lines every transaction) and make the output as csv file. These lines are coming from a maillog wherein every transaction has its own transaction id. Here's one sample transaction:
Nov 17 00:15:19 server01 sm-mta[14107]: tAGGFJla014107: from=<sender#domain>, size=2447, class=0, nrcpts=1, msgid=<201511161615.tAGGFJla014107#server01>, proto=ESMTP, daemon=MTA, tls_verify=NONE, auth=NONE, relay=[100.24.134.19]
Nov 17 00:15:19 server01 flow-control[6033]: tAGGFJla014107 accepted
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - virus.McAfee: CLEAN - Declaration for Shared Parental Leave Allocation System
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - mtaqid=tAGGFJla014107, msgid=<201511161615.tAGGFJla014107#server01>, from=<sender#domain>, size=2488, to=<recipient#domain>, relay=[100.24.134.19], disposition=Deliver
Nov 17 00:15:20 server01 sm-mta[14240]: tAGGFJla014107: to=<recipient#domain>, delay=00:00:01, xdelay=00:00:01, mailer=smtp, pri=122447, relay=relayserver.domain. [100.91.20.1], dsn=2.0.0, stat=Sent (tAGGFJlR021747 Message accepted for delivery)
What I tried is, I made these 5 lines into 1 line and used awk to parse each column - unfortunately, the column count is not uniform.
I'm looking into getting the date/time (line 1, columns 1-3), sender, recipient, and subject (line 3, words after "CLEAN -" to the end of line)
Preferably sed or awk in bash.
Thanks!
Explanation: fileis your file.
The script initializes id and block to empty strings. At first run id takes the value of field nr. 7. After that all lines are added to block until a line doesn't match id. At that point block and id are reinitialized.
awk 'BEGIN{id="";block=""} {if (id=="") id=$6; else {if ($0~id) block= block $0; else {print block;block=$0;id=$6}}}' file
Then you're going to have to process each line of the output.
There are many ways to approach this. Here is one example calling a simple script and passing the log filename as the first argument. It will parse the requested data and save the data separated into individual variables. It simply prints the results at the end.
#!/bin/bash
[ -r "$1" ] || { ## validate input file readable
printf "error: invalid argument, file not readable '%s'\n" "$1"
exit 1
}
while read -r line; do
## set date from line containing from/sender
if grep -q -o 'from=<' <<<"$line" &>/dev/null; then
dt=$(cut -c -15 <<<"$line")
from=$(grep -o 'from=<[a-zA-Z0-9]*#[a-zA-Z0-9]*>' <<<"$line")
sender=${from##*<}
sender=${sender%>*}
fi
## search each line for CLEAN
if grep -q -o 'CLEAN.*$' <<<"$line" &>/dev/null; then
subject=$(grep -o 'CLEAN.*$' <<<"$line")
subject="${subject#*CLEAN - }"
fi
## search line for to
if grep -q -o 'to=<' <<<"$line" &>/dev/null; then
to=$(grep -o 'to=<[a-zA-Z0-9]*#[a-zA-Z0-9]*>' <<<"$line")
to=${to##*<}
to=${to%>*}
fi
done < "$1"
printf " date : %s\n from : %s\n to : %s\n subject: \"%s\"\n" \
"$dt" "$sender" "$to" "$subject"
Input
$ cat dat/mail.log
Nov 17 00:15:19 server01 sm-mta[14107]: tAGGFJla014107: from=<sender#domain>, size=2447, class=0, nrcpts=1, msgid=<201511161615.tAGGFJla014107#server01>, proto=ESMTP, daemon=MTA, tls_verify=NONE, auth=NONE, relay=[100.24.134.19]
Nov 17 00:15:19 server01 flow-control[6033]: tAGGFJla014107 accepted
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - virus.McAfee: CLEAN - Declaration for Shared Parental Leave Allocation System
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - mtaqid=tAGGFJla014107, msgid=<201511161615.tAGGFJla014107#server01>, from=<sender#domain>, size=2488, to=<recipient#domain>, relay=[100.24.134.19], disposition=Deliver
Nov 17 00:15:20 server01 sm-mta[14240]: tAGGFJla014107: to=<recipient#domain>, delay=00:00:01, xdelay=00:00:01, mailer=smtp, pri=122447, relay=relayserver.domain. [100.91.20.1], dsn=2.0.0, stat=Sent (tAGGFJlR021747 Message accepted for delivery)
Output
$ bash parsemail.sh dat/mail.log
date : Nov 17 00:15:19
from : sender#domain
to : recipient#domain
subject: "Declaration for Shared Parental Leave Allocation System"
Note: if your from/sender is not always going to be in the first line, you can simply move those lines out from under the test clause. Let me know if you have any questions.

Resources