Carving data from log file - bash

I have a log file containing the data below:
time=1460196536.247325 latency=3:6:7:9:16:(8)ms latency95=11ms latency99=13ms requests=517 option1=0 option2=0 errors=0 throughput=480rps ql=1 rr=0.00% cr=0.00% accRequests=101468 accOption1=0 accOption2=0 accLatency=2:6:7:8:3998:(31)ms accLatency95=11ms accLatency99=649ms accOpenQueuing=1664 accErrors=278
I am trying to write bashscript where I try to carve these values for each line in the log file and write it to a second file:
time (converted to local time GMT+2)
latency99
requests
errors
Desired output in second file:
time latency99 requests errors
12:08:56 13 517 0
Is the easiest way to use regex for this?

Here's a Bash solution for version 4 and above, using an associative array:
#!/bin/bash
# Assoc array to hold data.
declare -A data
# Log file ( the input file ).
logfile=$1
# Output file.
output_file=$2
# Print column names for required values.
printf '%-20s %-10s %-10s %-10s\n' time latency99 requests errors > "$output_file"
# Iterate over each line in $logfile
while read -ra arr; do
# Insert keys and values into 'data' array.
for i in "${arr[#]}"; do
data["${i%=*}"]="${i#*=}"
done
# Convert time to GMT+2
gmt2_time=$(TZ=GMT+2 date -d "#${data[time]}" '+%T')
# Print results to stdout.
printf '%-20s %-10s %-10s %-10s\n' "$gmt2_time" "${data[latency99]%ms}" "${data[requests]}" "${data[errors]}" >> "$output_file"
done < "$logfile"
As you can see, the script accepts two arguments. The first one is the file name of the logfile, and the second is the output file to which parsed data will be inserted line by line for each row in the logfile.
Please notice that I used GMT+2 as the value to the TZ variable.
Use the exact area as the value instead. Like, for example, TZ="Europe/Berlin".
You might want to use the tool tzselect to know the correct string value of your area.
In order to test it, I created the following logfile, containing 3 different rows of input:
time=1260196536.242325 latency=3:6:7:9:16:(8)ms latency95=11ms latency99=10ms requests=100 option1=0 option2=0 errors=1 throughput=480rps ql=1 rr=0.00% cr=0.00% accRequests=101468 accOption1=0 accOption2=0 accLatency=2:6:7:8:3998:(31)ms accLatency95=11ms accLatency99=649ms accOpenQueuing=1664 accErrors=278
time=1460246536.244325 latency=3:6:7:9:16:(8)ms latency95=11ms latency99=20ms requests=200 option1=0 option2=0 errors=2 throughput=480rps ql=1 rr=0.00% cr=0.00% accRequests=101468 accOption1=0 accOption2=0 accLatency=2:6:7:8:3998:(31)ms accLatency95=11ms accLatency99=649ms accOpenQueuing=1664 accErrors=278
time=1260236536.147325 latency=3:6:7:9:16:(8)ms latency95=11ms latency99=30ms requests=300 option1=0 option2=0 errors=3 throughput=480rps ql=1 rr=0.00% cr=0.00% accRequests=101468 accOption1=0 accOption2=0 accLatency=2:6:7:8:3998:(31)ms accLatency95=11ms accLatency99=649ms accOpenQueuing=1664 accErrors=278
Let's run the test ( script name is sof ):
$ ./sof logfile parsed_logfile
$ cat parsed_logfile
time latency99 requests errors
12:35:36 10 100 1
22:02:16 20 200 2
23:42:16 30 300 3
EDIT:
According to OP request as can be seen in the comments, and as discussed further in chat, I edited the script to include the following features:
Remove ms suffix from latency99's value.
Read input from a logfile, line by line, parse and output results to a
selected file.
Include column names only in the first row of output.
Convert the time value to GMT+2.

Here is a awk script for you. Say the logfile is mc.log and the script is saved as mc.awk, you would run it like this: awk -f mc.awk mc.log with GNU awk.
mc.awk:
BEGIN{
OFS="\t"
# some "" to align header and values in output
print "time", "", "latency99", "requests", "errors"
}
function getVal( str) {
# strip leading "key=" and trailing "ms" from str
gsub(/^.*=/, "", str)
gsub(/ms$/, "", str)
return str
}
function fmtTime( timeStamp ){
val=getVal( timeStamp )
return strftime( "%H:%M:%S", val)
}
{
# some "" to align header and values in output
print fmtTime($1), getVal($4), "", getVal($5), "", getVal($8)
}

Here's an awk version (not GNU). Converting the date would require a call to an external program:
#!/usr/bin/awk -f
BEGIN {
FS="([[:alpha:]]+)?[[:blank:]]*[[:alnum:]]+="
OFS="\t"
print "time", "latency99", "requests", "errors"
}
{
print $2, $5, $6, $9
}

Related

shell script subtract fields from pairs of lines

Suppose I have the following file:
stub-foo-start: 10
stub-foo-stop: 15
stub-bar-start: 3
stub-bar-stop: 7
stub-car-start: 21
stub-car-stop: 51
# ...
# EOF at the end
with the goal of writing a script which would append to it like so:
stub-foo-start: 10
stub-foo-stop: 15
stub-bar-start: 3
stub-bar-stop: 7
stub-car-start: 21
stub-car-stop: 51
# ...
# appended:
stub-foo: 5 # 5 = stop(15) - start(10)
stub-bar: 4 # and so on...
stub-car: 30
# ...
# new EOF
The format is exactly this sequential pairing of start and stop tags (stop being the closing one) and no nesting in between.
What is the recommended approach to writing such a script using awk and/or sed? Mostly, what I've tried is greping lines, storing to a variable, but that seemed to overcomplicate things and trail off.
Any advice or helpful links welcome. (Most tutorials I found on shell scripting were illustrative at best)
A naive implementation in plain bash
#!/bin/bash
while read -r start && read -r stop; do
printf '%s: %d\n' "${start%-*}" $(( ${stop##*:} - ${start##*:} ))
done < file
This assumes pairs are contiguous and there are no interlaced or nested pairs.
Using GNU awk:
awk -F '[ -]' '{ map[$2][$3]=$4;print } END { for (i in map) { print i": "(map[i]["stop:"]-map[i]["start:"])" // ("map[i]["stop:"]"-"-map[i]["start:"]")" } }' file
Explanation:
awk -F '[ -]' '{ # Set the field delimiter to space or "-"
map[$2][$3]=$4; # Create a two dimensional array with the second and third field as indexes and the fourth field as the value
print # Print the line
}
END { for (i in map) {
print i": "(map[i]["stop:"]-map[i]["start:"])" // ("map[i]["stop:"]"-"-map[i]["start:"]")" # Loop through the array and print the data in the required format
}
}' file

Adding constant values using awk

I have requirement to add constant value to 4th column if value is less than 240000. The constant value is 010000. I have written command but its not give any output. Below is sample data and script. Please help me in this.Thank in advance.
Command :
awk '{
if($4 -lt 240000)
$4= $4+010000;
}' Test.txt
Sample Data :
1039,1018,20180915,000000,0,0,A
1039,1018,20180915,010000,0,0,A
1039,1018,20180915,020000,0,0,A
1039,1018,20180915,030000,0,0,A
1039,1018,20180915,240000,0,0,A
1039,1018,20180915,050000,0,0,A
1039,1018,20180915,060000,0,0,A
1039,1018,20180915,070000,1,0,A
1039,1018,20180915,080000,0,1,A
1039,1018,20180915,090000,2,0,A
1039,1018,20180915,241000,0,0,A
1039,1018,20180915,240500,0,0,A
$ awk '
BEGIN { FS=OFS="," } # input and output field separators
{
if($4<240000) # if comparison
$4=sprintf("%06d",$4+10000) # I assume 10000 not 010000, also zeropadded to 6 chars
# $4+=10000 # if zeropadding is not required
print # output
}' file
Output:
1039,1018,20180915,010000,0,0,A
1039,1018,20180915,020000,0,0,A
1039,1018,20180915,030000,0,0,A
1039,1018,20180915,040000,0,0,A
1039,1018,20180915,240000,0,0,A
1039,1018,20180915,060000,0,0,A
1039,1018,20180915,070000,0,0,A
1039,1018,20180915,080000,1,0,A
1039,1018,20180915,090000,0,1,A
1039,1018,20180915,100000,2,0,A
1039,1018,20180915,241000,0,0,A
1039,1018,20180915,240500,0,0,A
$4+10000 not 010000 since awk 'BEGIN{ print 010000+0}' outputs 4096 as it is octal representation of of that value.

Sorting a columns value from a large csv(more than a million records) using awk or bash

I am new to shell scripting.
I have a huge csv file which contains more than 100k rows. I need to find a column and sort it and write it to another file and later I need to process this new file.
below is the sample data
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","MAS,CA.ON.OSC,ASIC*,AAAA","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","POU,ABC,MAS,CA.QC.OSC,CA.ON.OSC","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","BVC,AZX,CA.SK.FCAA,CA.NL.DSS","QQQQQQQQQRRCGHDKLKSLS"
Now you can see that field 4 has data which contains comma as well. now I need the data in which the field 4 is sorted out as below:
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA","QQQQQQQQQRRCGHDKLKSLS"
to get this solution I have written a script file as below but the solution does not seems to be efficient because for 100k records it took 20 mins, so trying to get the efficient solution
#this command replaces the comma inside "" with | so that I can split the line based on ','(comma)
awk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/,/, "|", $i)}} {print $0}' $FEED_FILE > temp.csv
count=0;
while read line
do
#break the line on comma ',' and get the array of strings.
IFS=',' read -ra data <<< "$line" #'data' is the array of the record of full line.
#take the 8th column, which is the reportable jurisdiction.
echo "REPORTABLE_JURISDICTION is : " ${data[4]}
#brake the data based on pipe '|' and sort the data
IFS='|' read -ra REPORTABLE_JURISDICTION_ARR <<< "${data[4]}"
#Sort this array
IFS=$'\n' sorted=($(sort <<<"${REPORTABLE_JURISDICTION_ARR[*]}"))
#printf "[%s]\n" "${sorted[#]}"
separator="|" # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${sorted[#]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}"
data[4]=${regex}
echo "$data[68]"
#here we are building the whole line which will be written to the output file.
separator="," # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${data[#]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}" >> temp2.csv
echo $count
((count++))
done < temp.csv
#remove the '|' from the and put the comma back
awk -F\| 'BEGIN{OFS=","} {$1=$1; print}' temp2.csv > temp3.csv
# to remove the tailing , if any
sed 's/,$//' temp3.csv > $OUT_FILE
How to make it faster?
You're using the wrong tools for the task. While CSV seems to be so simple that you can easily process it with shell tools, but your code will break for cells that contain new lines. Also bash isn't very fast when processing lots of data.
Try a tool which understands CSV directly like http://csvkit.rtfd.org/ or use a programming language like Python. That allows you to do the task without starting external processes, the syntax is much more readable and the result will be much more maintainable. Note: I'm suggesting Python because of the low initial cost.
With python and the csv module, the code above would look like this:
import csv
FEED_FILE = '...'
OUT_FILE = '...'
with open(OUT_FILE, 'w', newline='') as out:
with open(FEED_FILE, newline='') as in:
reader = csv.reader(in, delimiter=',', quotechar='"')
writer = csv.writer(
for row in reader:
row[3] = sorted(list(row[3].split(',')))
writer.writerow(row)
That said, there is nothing obviously wrong with your code. There is not much that you can do to speed up awk and sed and the main bash loop doesn't spawn many external processes as far as I can see.
With single awk:
awk 'BEGIN{ FS=OFS="\042,\042"}{ split($4,a,","); asort(a); sf=a[1];
for(i=2;i<=NF;i++) { sf=sf","a[i] } $4=sf; print $0 }' file > output.csv
output.csv contents:
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS,","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA,","QQQQQQQQQRRCGHDKLKSLS"
FS=OFS="\042,\042" - considering "," as field separator
split($4,a,",") - split the 4th field into array by separator ,
asort(a) - sort the array by values
Try pandas in python3. Only limitation: The data needs to fit into memory. And that can be a bit larger than your actually data is. I sorted CSV files with 30.000.000 rows without any problem using this script, which I quickly wrote:
import pandas as pd
import os, datetime, traceback
L1_DIR = '/mnt/ssd/ASCII/'
suffix = '.csv
for fname in sorted(os.listdir(L1_DIR)):
if not fname.endswith(suffix):
continue
print("Start processing %s" % fname)
s = datetime.datetime.now()
fin_path = os.path.join(L1_DIR, fname)
fname_out = fname.split('.')[0] + '.csv_sorted'
fpath_out = os.path.join(L1_DIR, fname_out)
df = pd.read_csv(fin_path)
e = datetime.datetime.now()
print("Read %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s = datetime.datetime.now()
df.set_index('ts', inplace=True)
e = datetime.datetime.now()
print("set_index %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s = datetime.datetime.now()
df.sort_index(inplace=True)
e = datetime.datetime.now()
print("sort_index %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e-s)))
s = datetime.datetime.now()
df.reset_index(inplace=True)
# This one saves at ~10MB per second to disk.. One day is 7.5GB --> 750 seconds or 12.5 minutes
df.to_csv(fpath_out, index=False)
e = datetime.datetime.now()
print("to_csv %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e - s)))

How to compare a field of a file with current timestamp and print the greater and lesser data?

How do I compare current timestamp and a field of a file and print the matched and unmatched data. I have 2 columns in a file (see below)
oac.bat 09:09
klm.txt 9:00
I want to compare the timestamp(2nd column) with current time say suppose(10:00) and print the output as follows.
At 10:00
greater.txt
xyz.txt 10:32
mnp.csv 23:54
Lesser.txt
oac.bat 09:09
klm.txt 9:00
Could anyone help me on this please ?
I used awk $0 > "10:00", which gives me only 2nd column details but I want both the column details and I am taking timestamp from system directly from system with a variable like
d=`date +%H:%M`
With GNU awk you can just use it's builtin time functions:
awk 'BEGIN{now = strftime("%H:%M")} {
split($NF,t,/:/)
cur=sprintf("%02d:%02d",t[1],t[2])
print > ((cur > now ? "greater" : "lesser") ".txt")
}' file
With other awks just set now using -v and date up front, e.g.:
awk -v now="$(date +"%H:%M")" '{
split($NF,t,/:/)
cur = sprintf("%02d:%02d",t[1],t[2])
print > ((cur > now ? "greater" : "lesser") ".txt")
}' file
The above is untested since you didn't provide input/output we could test against.
Pure Bash
The script can be implemented in pure Bash with the help of date command:
# Current Unix timestamp
let cmp_seconds=$(date +%s)
# Read file line by line
while IFS= read -r line; do
let line_seconds=$(date -d "${line##* }" +%s) || continue
(( line_seconds <= cmp_seconds )) && \
outfile=lesser || outfile=greater
# Append the line to the file chosen above
printf "%s\n" "$line" >> "${outfile}.txt"
done < file
In this script, ${line##* } removes the longest match of '* ' (any character followed by a space) pattern from the front of $line thus fetching the last column (the time). The time column is supposed to be in one of the following formats: HH:MM, or H:MM. Actually, date's -d option argument
can be in almost any common format. It can contain month names, time zones, ‘am’ and ‘pm’, ‘yesterday’, etc.
We use the flexibility of this option to convert the time (HH:MM, or H:MM) to Unix timestamp.
The let builtin allows arithmetic to be performed on shell variables. If the last let expression fails, or evaluates to zero, let returns 1 (error code), otherwise 0 (success). Thus, if for some reason the time column is in invalid format, the iteration for such line will be skipped with the help of continue.
Perl
Here is a Perl version I have written just for fun. You may use it instead of the Bash version, if you like.
# For current date
#cmp_seconds=$(date +%s)
# For specific hours and minutes
cmp_seconds=$(date -d '10:05' +%s)
perl -e '
my #t = localtime('$cmp_seconds');
my $minutes = $t[2] * 60 + $t[1];
while (<>) {
/ (\d?\d):(\d\d)$/ or next;
my $fh = ($1 * 60 + $2) > $minutes ? STDOUT : STDERR;
printf $fh "%s", $_;
}' < file >greater.txt 2>lesser.txt
The script computes the number of minutes in the following way:
HH:MM = HH * 60 + MM minutes
If the number of minutes from the file are greater then the number of minutes for the current time, it prints the next line to the standard output, otherwise to standard error. Finally, the standard output is redirected to greater.txt, and the standard error is redirected to lesser.txt.
I have written this script for demonstration of another approach (algorithm), which can be implemented in different languages, including Bash.

Find nth row using AWK and assign them to a variable

Okay, I have two files: one is baseline and the other is a generated report. I have to validate a specific string in both the files match, it is not just a single word see example below:
.
.
name os ksd
56633223223
some text..................
some text..................
My search criteria here is to find unique number such as "56633223223" and retrieve above 1 line and below 3 lines, i can do that on both the basefile and the report, and then compare if they match. In whole i need shell script for this.
Since the strings above and below are unique but the line count varies, I had put it in a file called "actlist":
56633223223 1 5
56633223224 1 6
56633223225 1 3
.
.
Now from below "Rcount" I get how many iterations to be performed, and in each iteration i have to get ith row and see if the word count is 3, if it is then take those values into variable form and use something like this
I'm stuck at the below, which command to be used. I'm thinking of using AWK but if there is anything better please advise. Here's some pseudo-code showing what I'm trying to do:
xxxxx=/root/xxx/xxxxxxx
Rcount=`wc -l $xxxxx | awk -F " " '{print $1}'`
i=1
while ((i <= Rcount))
do
record=_________________'(Awk command to retrieve ith(1st) record (of $xxxx),
wcount=_________________'(Awk command to count the number of words in $record)
(( i=i+1 ))
done
Note: record, wcount values are later printed to a log file.
Sounds like you're looking for something like this:
#!/bin/bash
while read -r word1 word2 word3 junk; do
if [[ -n "$word1" && -n "$word2" && -n "$word3" && -z "$junk" ]]; then
echo "all good"
else
echo "error"
fi
done < /root/shravan/actlist
This will go through each line of your input file, assigning the three columns to word1, word2 and word3. The -n tests that read hasn't assigned an empty value to each variable. The -z checks that there are only three columns, so $junk is empty.
I PROMISE you you are going about this all wrong. To find words in file1 and search for those words in file2 and file3 is just:
awk '
NR==FNR{ for (i=1;i<=NF;i++) words[$i]; next }
{ for (word in words) if ($0 ~ word) print FILENAME, word }
' file1 file2 file3
or similar (assuming a simple grep -f file1 file2 file3 isn't adequate). It DOES NOT involve shell loops to call awk to pull out strings to save in shell variables to pass to other shell commands, etc, etc.
So far all you're doing is asking us to help you implement part of what you think is the solution to your problem, but we're struggling to do that because what you're asking for doesn't make sense as part of any kind of reasonable solution to what it sounds like your problem is so it's hard to suggest anything sensible.
If you tells us what you are trying to do AS A WHOLE with sample input and expected output for your whole process then we can help you.
We don't seem to be getting anywhere so let's try a stab at the kind of solution I think you might want and then take it from there.
Look at these 2 files "old" and "new" side by side (line numbers added by the cat -n):
$ paste old new | cat -n
1 a b
2 b 56633223223
3 56633223223 c
4 c d
5 d h
6 e 56633223225
7 f i
8 g Z
9 h k
10 56633223225 l
11 i
12 j
13 k
14 l
Now lets take this "actlist":
$ cat actlist
56633223223 1 2
56633223225 1 3
and run this awk command on all 3 of the above files (yes, I know it could be briefer, more efficient, etc. but favoring simplicity and clarity for now):
$ cat tst.awk
ARGIND==1 {
numPre[$1] = $2
numSuc[$1] = $3
}
ARGIND==2 {
oldLine[FNR] = $0
if ($0 in numPre) {
oldHitFnr[$0] = FNR
}
}
ARGIND==3 {
newLine[FNR] = $0
if ($0 in numPre) {
newHitFnr[$0] = FNR
}
}
END {
for (str in numPre) {
if ( str in oldHitFnr ) {
if ( str in newHitFnr ) {
for (i=-numPre[str]; i<=numSuc[str]; i++) {
oldFnr = oldHitFnr[str] + i
newFnr = newHitFnr[str] + i
if (oldLine[oldFnr] != newLine[newFnr]) {
print str, "mismatch at old line", oldFnr, "new line", newFnr
print "\t" oldLine[oldFnr], "vs", newLine[newFnr]
}
}
}
else {
print str, "is present in old file but not new file"
}
}
else if (str in newHitFnr) {
print str, "is present in new file but not old file"
}
}
}
.
$ awk -f tst.awk actlist old new
56633223225 mismatch at old line 12 new line 8
j vs Z
It's outputing that result because the 2nd line after 56633223225 is j in file "old" but Z in file "new" and the file "actlist" said the 2 files had to be common from one line before until 3 lines after that pattern.
Is that what you're trying to do? The above uses GNU awk for ARGIND but the workaround is trivial for other awks.
Use the below code:
awk '{if (NF == 3) { word1=$1; word2=$2; word3=$3; print "Words are:" word1, word2, word3} else {print "Line", NR, "is having", NF, "Words" }}' filename.txt
I have given the solution as per the requirement.
awk '{ # awk starts from here and read a file line by line
if (NF == 3) # It will check if current line is having 3 fields. NF represents number of fields in current line
{ word1=$1; # If current line is having exact 3 fields then 1st field will be assigned to word1 variable
word2=$2; # 2nd field will be assigned to word2 variable
word3=$3; # 3rd field will be assigned to word3 variable
print word1, word2, word3} # It will print all 3 fields
}' filename.txt >> output.txt # THese 3 fields will be redirected to a file which can be used for further processing.
This is as per the requirement, but there are many other ways of doing this but it was asked using awk.

Resources