Optimal way to parse a log file

Optimal way to parse a log file - bash

I have a log file that looks something like this:
Client connected with ID 8127641241
< multiple lines of unimportant log here>
Client not responding
Total duration: 154.23583
Sent: 14
Received: 9732
Client lost
Client connected with ID 2521598735
< multiple lines of unimportant log here>
Client not responding
Total duration: 12.33792
Sent: 2874
Received: 1244
Client lost
The log contains lots of these blocks starting with Client connected with ID 1234 and ending with Client lost. They are never mixed up (only 1 client at a time).
How would I parse this file and generate statistics like this:
I'm mainly asking about the parsing process, not the formatting.
I guess I could loop over all the lines, set a flag when finding a Client connected line and save the ID in a variable. Then grep the lines, save the values until I find the Client lost line. Is this a good approach? Is there a better one?

Here's a quick way using awk:
awk 'BEGIN { print "ID Duration Sent Received" } /^(Client connected|Total duration:|Sent:)/ { printf "%s ", $NF } /^Received:/ { print $NF }' file | column -t
Results:
ID Duration Sent Received
8127641241 154.23583 14 9732
2521598735 12.33792 2874 1244

A solution in perl
#!/usr/bin/perl
use warnings;
use strict;
print "\tID\tDuration\tSent\tReceived\n";
while (<>) {
chomp;
if (/Client connected with ID (\d+)/) {
print "$1\t";
}
if (/Total duration: ([\d\.]+)/) {
print "$1\t";
}
if (/Sent: (\d+)/) {
print "$1\t";
}
if (/Received: (\d+)/) {
print "$1\n";
}
}
Sample output:
ID Duration Sent Received
8127641241 154.23583 14 9732
2521598735 12.33792 2874 1244

If you're sure that the logfile can't have errors, and if the fields are always in the same order, you can use something like the following:
#!/bin/bash
ids=()
declare -a duration
declare -a sent
declare -a received
while read _ _ _ _ id; do
ids+=( "$id" )
read _ _ duration[$id]
read _ sent[$id]
read _ received[$id]
done < <(grep '\(^Client connected with ID\|^Total duration:\|^Sent:\|Received:\)' logfile)
# printing the data out, for control purposes only
for id in "${ids[#]}"; do
printf "ID=%s\n\tDuration=%s\n\tSent=%s\n\tReceived=%s\n" "$id" "${duration[$id]}" "${sent[$id]}" "${received[$id]}"
done
Output is:
$ ./parsefile
ID=8127641241
Duration=154.23583
Sent=14
Received=9732
ID=2521598735
Duration=12.33792
Sent=2874
Received=1244
but the data is stored in the corresponding associative arrays. It's fairly efficient. It would probably be slightly more efficient in another programming language (e.g., perl), but since you only tagged your post with bash, sed and grep, I guess I fully answered your question.
Explanation: grep only filters the lines we're interested in, and bash only reads the fields we're interested in, assuming they always come in the same order. The script should be easy to understand and modify to your needs.

awk:
awk 'BEGIN{print "ID Duration Sent Received"}/with ID/&&!f{f=1}f&&/Client lost/{print a[1],a[2],a[3],a[4];f=0}f{for(i=1;i<=NF;i++){
if($i=="ID")a[1]=$(i+1)
if($i=="duration:")a[2]=$(i+1)
if($i=="Sent:")a[3]=$(i+1)
if($i=="Received:")a[4]=$(i+1)
}}'log
if there is always an empty line between your data blocks, the awk script above could be simplified to:
awk -vRS="" 'BEGIN{print "ID Duration Sent Received"}
{for(i=1;i<=NF;i++){
if($i=="ID")a[1]=$(i+1)
if($i=="duration:")a[2]=$(i+1)
if($i=="Sent:")a[3]=$(i+1)
if($i=="Received:")a[4]=$(i+1)
}print a[1],a[2],a[3],a[4];}' log
output:
ID Duration Sent Received
8127641241 154.23583 14 9732
2521598735 12.33792 2874 1244
if you want to get better format, pipe the output to |column -t
you get:
ID Duration Sent Received
8127641241 154.23583 14 9732
2521598735 12.33792 2874 1244

Use Paragraph Mode to Slurp Files
Using Perl or AWK, you can slurp in records using a special paragraph mode that uses blank lines between records as a separator. In Perl, use -00 to use paragraph mode; in AWK, you set the RS variable to the empty string (e.g. "") to do the same thing. Then you can parse fields within each record.
Use Line-Oriented Statements
Alternatively, you can use a shell while-loop to read each line at a time, and then use grep or sed to parse each line. You may even be able to use a case statement, depending on the complexity of your parsing.
For example, assuming you always have 5 matching fields in a record, you could do something like this:
while read; do
grep -Eo '[[:digit:]]+'
done < /tmp/foo | xargs -n5 | sed 's/ /\t/g'
The loop would yield:
23583 14 9732 2521598735 33792
2874 1244 8127641241 23583 14
9732 2521598735 33792 2874 1244
You can certainly play with the formatting, and add header lines, and so forth. The point is that you have to know your data.
AWK, Perl, or even Ruby are better options for parsing record-oriented formats, but the shell is certainly an option if your needs are basic.

A short snippet of Perl:
perl -ne '
BEGIN {print "ID Duration Sent Received\n";}
print "$1 " if /(?:ID|duration:|Sent:|Received:) (.+)$/;
print "\n" if /^Client lost/;
' filename | column -t

awk -v RS= -F'\n' '
BEGIN{ printf "%15s%15s%15s%15s\n","ID","Duration","Sent","Received" }
{
for (i=1;i<=NF;i++) {
n = split($i,f,/ /)
if ( $i ~ /^(Client connected|Total duration:|Sent:|Received:)/ ) {
printf "%15s",f[n]
}
}
print ""
}'

Related

Convert text file to csv using shell script

I have a file like this
InputFile.txt
JOB JOB_A
Source C://files/InputFile
Resource 0 AC
User Guest
ExitCode 0 Success
EndJob
JOB JOB_B
Source C://files/
Resource 1 AD
User Current
ExitCode 1 Fail
EndJob
JOB JOB_C
Source C://files/Input/
Resource 3 AE
User Guest2
ExitCode 0 Success
EndJob
I have to convert the above file to a csv file as below
How to convert it using shell scripting?

I used awk.
The separator is a tabulator because it's more common than a comma in the CSV format.
If you want a coma, you can simply change the \t -> ,.
cat InputFile.txt | \
awk '
BEGIN{print "Source\tResource\tUser\tExitCode"}
/^JOB/{i=0}
/^\s/{
i++;
match($0,/\s*[a-zA-Z]* /);
a[i]=substr($0,RLENGTH+RPOS)}
/^EndJob/{for(i=1;i<5;i++) printf "%s\t",a[i];print ""}'
The first line BEGIN writes header.
The second line matches /JOB/ and only sets an iterator i as zero.
The third line matches the blank on the start of a line and fills array a with values (it count on strict count and order of rows).
The fourth part of the awk script matches EndJob and prints stored values.
Output:
Source
Resource
User
ExitCode
C://files/InputFile
0 AC
Guest
0 Success
C://files/
1 AD
Current
1 Fail
C://files/Input/
3 AE
Guest2
0 Success
Script using associative array:
You can change the script so that uses strict Source, Resource, User, and ExitCode values from $1 (first record) of lines, but it would be a little longer, and this input file doesn't need it.
cat InputFile.txt | \
awk '
BEGIN{
h[1]="Source";
h[2]="Resource";
h[3]="User";
h[4]="ExitCode";
for(i=1;i<5;i++) printf "%s\t",h[i];print ""}
/^\s/{
i++;
match($0,/\s*[a-zA-Z]* /);
a[$1]=substr($0,RLENGTH+RPOS)}
/^EndJob/{for(i=1;i<5;i++) printf "%s\t",a[h[i]];print ""}'

with sed ... dont know if the order in the InputFile.txt is always the same
as Source, Resource, User, ExitCode, but if it is
declare delimiter=";"
sed -Ez "s/[^\n]*(Source|Resource|User) ([^\n]*)\n/\2${delimiter}/g;s/[ \t]*ExitCode //g;s/[^\n]*JOB[^\n]*\n//gi;s/^/Source${delimiter}Resource${delimiter}User${delimiter}ExitCode\n/" < InputFile.txt > output.csv

Store variables from lines in a text file using awk and cut in a for loop

I have a tab separated text file, call it input.txt
cat input.txt
Begin Annotation Diff End Begin,End
6436687 >ENST00000422706.5|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-205|APOL1|2901|protein_coding| 50 6436736 6436687,6436736
6436737 >ENST00000426053.5|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-206|APOL1|2808|protein_coding| 48 6436784 6436737,6436784
6436785 >ENST00000319136.8|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000075315.5|APOL1-201|APOL1|3000|protein_coding| 51 6436835 6436785,6436835
6436836 >ENST00000422471.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319151.1|APOL1-204|APOL1|561|nonsense_mediated_decay| 11 6436846 6436836,6436846
6436847 >ENST00000475519.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319153.1|APOL1-212|APOL1|600|retained_intron| 11 6436857 6436847,6436857
6436858 >ENST00000438034.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319152.2|APOL1-210|APOL1|566|protein_coding| 11 6436868 6436858,6436868
6436869 >ENST00000439680.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319252.1|APOL1-211|APOL1|531|nonsense_mediated_decay| 10 6436878 6436869,6436878
6436879 >ENST00000427990.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319154.2|APOL1-207|APOL1|624|protein_coding| 12 6436890 6436879,6436890
6436891 >ENST00000397278.8|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319100.4|APOL1-202|APOL1|2795|protein_coding| 48 6436938 6436891,6436938
6436939 >ENST00000397279.8|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-203|APOL1|1564|protein_coding| 28 6436966 6436939,6436966
6436967 >ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding| 11 6436977 6436967,6436977
6436978 >ENST00000431184.1|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319254.1|APOL1-208|APOL1|550|nonsense_mediated_decay| 11 6436988 6436978,6436988
Using the information in input.txt I want to obtain information from a file called Other_File.fa. This file is an annotation file filled with ENST#'s (transcript IDs) and sequences of A's,T's,C's,and G's. I want to store the sequence in a file called Output.log (see example below) and I want to store the command used to retrieve the text in a file called Input.log (see example below).
I have tried to do this using awk and cut so far using a for loop. This is the code I have tried.
for line in `awk -F "\\t" 'NR != 1 {print substr($2,2,17)"#"$5}' input.txt`
do
transcript=`cut -d "#" -f 1 $line`
range=`cut -d "#" -f 2 $line` #Range is the string location in Other_File.fa
echo "Our transcript is ${transcript} and our range is ${range}" >> Input.log
sed -n '${range}' Other_File.fa >> Output.log
done
Here is an example of the 11 lines between ENST00000433768.5 and ENST00000431184.1 in Other_File.fa.
grep -A 11 ENST00000433768.5 Other_File.fa
>ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding|
ATCCACACAGCTCAGAACAGCTGGATCTTGCTCAGTCTCTGCCAGGGGAAGATTCCTTGG
AGGAGCACACTGTCTCAACCCCTCTTTTCCTGCTCAAGGAGGAGGCCCTGCAGCGACATG
GAGGGAGCTGCTTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTT
GGTGTGGGAGTGAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGG
ACAGATACTGGAGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGAC
CCAGGCCCAGCTGGGTCCAGAGGTGACAGTGGAGAGCCGTGTACCCTGAGACCAGCCTGC
AGAGGACAGAGGCAACATGGAGGTGCCTCAAGGATCAGTGCTGAGGGTCCCGCCCCCATG
CCCCGTCGAAGAACCCCCTCCACTGCCCATCTGAGAGTGCCCAAGACCAGCAGGAGGAAT
CTCCTTTGCATGAGAGCAGTATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAG
T
>ENST00000431184.1|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319254.1|APOL1-208|APOL1|550|nonsense_mediated_decay|
The range value in input.txt for this transcript is 6436967,6436977. In my file Input.log for this transcript I hope to get
Our transcript is ENST00000433768.5 and our range is 6436967,6436977
And in Output.log for this transcript I hope to get
>ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding|
ATCCACACAGCTCAGAACAGCTGGATCTTGCTCAGTCTCTGCCAGGGGAAGATTCCTTGG
AGGAGCACACTGTCTCAACCCCTCTTTTCCTGCTCAAGGAGGAGGCCCTGCAGCGACATG
GAGGGAGCTGCTTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTT
GGTGTGGGAGTGAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGG
ACAGATACTGGAGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGAC
CCAGGCCCAGCTGGGTCCAGAGGTGACAGTGGAGAGCCGTGTACCCTGAGACCAGCCTGC
AGAGGACAGAGGCAACATGGAGGTGCCTCAAGGATCAGTGCTGAGGGTCCCGCCCCCATG
CCCCGTCGAAGAACCCCCTCCACTGCCCATCTGAGAGTGCCCAAGACCAGCAGGAGGAAT
CTCCTTTGCATGAGAGCAGTATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAG
T
But I am getting the following error, and I am unsure as to why or how to fix it.
cut: ENST00000433768.5#6436967,6436977: No such file or directory
cut: ENST00000433768.5#6436967,6436977: No such file or directory
Our transcript is and our range is
My thought was each line from the awk would be read as a string then cut could split the string along the "#" symbol I have added, but it is reading each line as a file and throwing an error when it can't locate the file in my directory.
Thanks.

EDIT2: This is a generic solution which will compare 2 files(input and other_file.fa) and on whichever line whichever range is found it will print them. Eg--> Range numbers are found on 300 line number but range shows you should print from 1 to 20 it will work in that case also. Also note this calls system command which further calls sed command(like you were using range within sed), there are other ways too, like to load whole Input_file into an array or so and then print, but I am going with this one here, fair warning this is not tested with huge size files.
awk -F'[>| ]' '
FNR==NR{
arr[$2]=$NF
next
}
($2 in arr){
split(arr[$2],lineNum,",")
print arr[$2]
start=lineNum[1]
end=lineNum[2]
print "sed -n \047" start","end"p \047 " FILENAME
system("sed -n \047" start","end"p\047 " FILENAME)
start=end=0
}
' file1 FS="[>|]" other_file.fa
EDIT: With OP's edited samples, please try following to print lines based on other file. assumes that the line you find range values, those values will be always after the line on which they found(eg--> 3rd line range values found and range is 4 to 10).
awk -F'[>| ]' '
FNR==NR{
arr[$2]=$NF
next
}
($2 in arr){
split(arr[$2],lineNum," ")
start=lineNum[1]
end=lineNum[2]
}
FNR>=start && FNR<=end{
print
if(FNR==end){
start=end=0
}
}
' file1 FS="[>|]" other_file.fa
You need not to do this with a for loop and then call awk program each time for each line. This could be done in single awk, considering that you have to only print them. Written and tested with your shown samples.
awk -F'[>| ]' 'FNR>1{print "Our transcript is:"$3" and our range is:"$NF}' Input_file
NOTE: This will print for each line of your Input_file values of transcript and range, in case you want to further perform some operation with their values then please do mention.

Compare two timestamp columns and if difference is greater than 1 hour, trigger email alert(bash)

I have a file that looks like this:
user1,135.4,MATLAB,server1,14:53:59,15:54:28
user2,3432,Solver_HF+,server1,14:52:01,14:54:28
user3,3432,Solver_HF+,server1,14:52:01,15:54:14
user4,3432,Solver_HF+,server1,14:52:01,14:54:36
I want to run a comparison between the last two columns and if the difference is greater than an hour(such as lines 1 and 3) it will trigger something like this:
echo "individual line from file" | mail -s "subject" email#site.com
I was trying to come up with a possible solution using awk, but I'm still fairly new to linux and couldn't quite figure out something that worked.

the following awk scripts maybe is your want
awk 'BEGIN{FS=","}
{a="2019 01 01 " gensub(":"," ","g",$5);
b="2019 01 01 " gensub(":"," ","g",$6);
c = int((mktime(b)-mktime(a))/60)}
{if (c >= 60){system("echo "individual line from file" | mail -s "subject" email#site.com")}}' your_filename
then put the scritps into crontab or other trigger
for example
*/5 * * * * awk_scripts.sh
if you just want check new line . use tail -n filename may be more useful than cat

Here you go: (using gnu awk due to mktime)
awk -F, '{
split($(NF-1),t1,":");
split($NF,t2,":");
d1=mktime("0 0 0 "t1[1]" "t1[2]" "t1[3]" 0");
d2=mktime("0 0 0 "t2[1]" "t2[2]" "t2[3]" 0");
if (d2-d1>3600) print $0}' file
user1,135.4,MATLAB,server1,14:53:59,15:54:28
user3,3432,Solver_HF+,server1,14:52:01,15:54:14
Using field separator as comma to get the second last and last field.
The split the two field inn to array t1 and t2 to get hour min sec
mktime converts this to seconds.
do the math and print only lines with more than 3600 seconds
This can then be piped to other commands.
See how time function are used int gnu awk: https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html

Wondering how to merge these two bash commands in a single efficient one

I've a log file that contains some lines I need to grab:
Jul 2 06:42:00 myhostname error proc[12345]: 01310001:3: event code xxxx Slow transactions attack detected - account id: (20), number of dropped slow transactions: (3)
Jul 2 06:51:00 myhostname error proc[12345]: 01310001:3: event code xxxx Slow transactions attack detected - account id: (20), number of dropped slow transactions: (2)
Account id(xx) gives me the name of an object that I am able to gather through mysql query.
Following command (which is for sure not optimized at all, but working) gives me the number of matching lines per account id:
grep "Slow transactions" logfile| awk '{print $18}' | awk -F '[^0-9]+' '{OFS=" ";for(i=1; i<=NF; i++) if ($i != "") print($i)}' | sort | uniq -c
14 20
The output (14 20) means the account id 20 was observed 14 times (14 lines in the logfile).
Then I also have number of dropped slow transactions: (2) part.
This gives the real number of dropped transactions that was logged. In other word, a log entry could mean 1 or more dropped transaction.
I do have a small command to count the number of dropped transactions:
grep "Slow transactions" logfile | awk '{print $24}' | sed 's/(//g' | sed 's/)//g' | awk '{s+=$1} END {print s}'
73
That means 73 transactions were dropped.
These two works but when coming to the point of merging the two I am stuck. I really don't see how to combine them; I am pretty sure awk can do it (and probably a better way that I did) but I would appreciate if any expert from the community could give me some guidance.
update
Since above one was too easy for some of our awk experts in SO I introduce an optional feature :)
As previously mentioned I can convert account ID into a name issuing a mysql query. So, the idea is now to include the ID => name conversion into the awk command.
The mySQL query looks like this (XX being the account ID):
mysql -Bs -u root -p$(perl -MF5::GenUtils -e "print get_mysql_password.qq{\n}") -e "SELECT name FROM myTABLE where account_id= 'XX'"
I founded the post below which deals with commands outputs into awk but facing syntax errors...
How can I pass variables from awk to a shell command?

This uses parentheses as your field separator, so it's easier to grab the account number and the number of slow connections.
awk -F '[()]' '
/Slow transactions/ {
acct[$2]++
dropped[$2] += $4
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc" # https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html
for (acctnum in acct)
print acctnum, acct[acctnum], dropped[acctnum]
}
' logfile
Given your sample input, this outputs
20 2 5
Required GNU awk for the "sorted_in" method of sorting array traversal by index.

BASH: Find number in text -> variable

I need little help from the community:
I have these two lines in a large text file:
Connected clients: 42
4 ACTIVE CLIENTS IN LAST 20 SECONDS
How I can find, extract and assign the numbers to variables?
clients=42
active=4
SED, AWK, GREP? Which one should I use?

clients=$(grep -Po '^(?<=Connected clients: )([0-9]+)$' filename)
active=$(grep -Po '^([0-9]+)(?= ACTIVE CLIENTS IN LAST [0-9]+ SECONDS$)' filename)
or
clients=$(sed -n 's/^Connected clients: \([0-9]\+\)$/\1/p' filename)
active=$(sed -n 's/^\([0-9]\+\) ACTIVE CLIENTS IN LAST [0-9]\+ SECONDS$/\1/p' filename)

str='Connected clients: 42 4 ACTIVE CLIENTS IN LAST 20 SECONDS'
set -- $str
clients=$3
active=$4
If it's two lines, fine.
str1='Connected clients: 42'
str2='4 ACTIVE CLIENTS IN LAST 20 SECONDS'
set -- $str1
clients=$3
set -- $str2
active=$1
Reading two lines from a file may be done by
{ read str1; read str2; } < file
Alternately, do the reading and writing in AWK, and slurp the results into Bash.
eval "$(awk '/^Connected clients: / { print "clients=" $3 }
/[0-9]+ ACTIVE CLIENTS/ { print "active=" $1 }
' filename)"

you can use awk
$ set -- $(awk '/Connected/{c=$NF}/ACTIVE/{a=$1}END{print c,a}' file)
$ echo $1
42
$ echo $2
4
assign $1, $2 to appropriate variable names as desired
if you can directly assign using declare
$ declare $(awk '/Connected/{c=$NF}/ACTIVE/{a=$1}END{print "client="c;print "active="a}' file)
$ echo $client
42
$ echo $active
4

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Optimal way to parse a log file - bash

Here's a quick way using awk: awk 'BEGIN { print "ID Duration Sent Received" } /^(Client connected|Total duration:|Sent:)/ { printf "%s ", $NF } /^Received:/ { print $NF }' file | column -t Results: ID Duration Sent Received 8127641241 154.23583 14 9732 2521598735 12.33792 2874 1244

A short snippet of Perl: perl -ne ' BEGIN {print "ID Duration Sent Received\n";} print "$1 " if /(?:ID|duration:|Sent:|Received:) (.+)$/; print "\n" if /^Client lost/; ' filename | column -t

awk -v RS= -F'\n' ' BEGIN{ printf "%15s%15s%15s%15s\n","ID","Duration","Sent","Received" } { for (i=1;i<=NF;i++) { n = split($i,f,/ /) if ( $i ~ /^(Client connected|Total duration:|Sent:|Received:)/ ) { printf "%15s",f[n] } } print "" }'

Related

Convert text file to csv using shell script

Store variables from lines in a text file using awk and cut in a for loop

Compare two timestamp columns and if difference is greater than 1 hour, trigger email alert(bash)

Wondering how to merge these two bash commands in a single efficient one

BASH: Find number in text -> variable

Categories

Resources