awk: data missed while parsing file - shell

I have written a script to parse hourly log files to extract "CustomerId, Marketplace, StartTime, and DealIdClicked" data. The log file structure is like so:
------------------------------------------------------------------------
Size=0 bytes
scheme=https
StatusCode=302
RequestId=request_Id_X07
CustomerId=XYZCustomerId
Marketplace=MarketPlace
StartTime=1592931599.986
Program=Unknown
Info=sub-page-type=desktop:Deals_Content_DealIdClicked_0002,sub-page-CSMTags=UTF-8
Counters=sub-page-type=desktop:Deals_Content_DealIdClicked_0002=3,sub-page-CSMTags=Encoding:UTF-8
EOE
------------------------------------------------------------------------
Here is the script I have written to parse the log.
function readServiceLog() {
local _logfile="$1"
local _csvFile="$2"
local _logFileName=$(getLogFileName "$_logfile")
parseLogFile "$_logfile" "$_csvFile"
echo "$_logFileName" >>"$SCRIPT_PATH/excludeFile.txt"
}
# Function to match regex and extract required data.
function parseLogFile() {
local _logfile=$1
local _csvFile=$2
zcat <"$_logfile" | awk -v csvFilePath="$_csvFile" '
BEGIN {
customerIdRegex="^CustomerId="
marketplaceIdRegex="^MarketplaceId="
startTimeRegex="^StartTime="
InfoRegex="^Info="
dealIdRegex = "Deals_Content_DealIdClicked_"
EOERegex="^EOE$"
delete RECORD
}
{
logLine=$0
if (match(logLine,InfoRegex)) {
after = substr(logLine,RSTART+RLENGTH);
if(match(after, dealIdRegex)) {
afterDeal = substr(after,RSTART+RLENGTH);
dealId = substr(afterDeal, 1, index(afterDeal,",")-1)
RECORD[0] = dealId
}
}
if (match(logLine,customerIdRegex)) {
after = substr(logLine,RSTART+RLENGTH);
customerid = substr(after, 1, length(after))
RECORD[1] = customerid
}
if (match(logLine,startTimeRegex)) {
after = substr(logLine,RSTART+RLENGTH);
startTime = substr(after, 1, length(after))
RECORD[2] = startTime
}
if (match(logLine,marketplaceIdRegex)) {
after = substr(logLine,RSTART+RLENGTH);
marketplaceId = substr(after, 1, length(after))
RECORD[3] = marketplaceId
}
if (match(logLine,EOERegex)) {
if(length(RECORD) == 4) {
printf("%s,%s,%s,%s\n", RECORD[0],RECORD[1],RECORD[2],RECORD[3]) >> csvFilePath
}
delete RECORD
}
}'
}
function processHourlyFile() {
local _currentProcessingFolder=$1
local _outputFolder=$(getOutputFolderName) //getOutputFolderName function is from util class.
mkdir -p "$_outputFolder"
local _csvFileName="$_outputFolder/${_currentProcessingFolder##*/}.csv"
for entry in "$_currentProcessingFolder"/*; do
if [[ "$entry" == *"$SERVICE_LOG"* ]]; then
readServiceLog "$entry" "$_csvFileName"
fi
done
}
# Main execution to spawn new processes for parallel parsing.
function main() {
local _processCount=1
for entry in $INPUT_LOG_PATH/*; do
processHourlyFile $entry &
pids[${_processCount}]=$!
done
printInfo
# wait for all pids
for pid in ${pids[*]}; do
wait $pid
done
}
main
printf '\nFinished!\n'
Expected output:
A comma separated file.
0002,XYZCustomerId,1592931599.986,MarketPlace
Problem
The script spawns 24 processes to parse 24-hour logs for an entire day. After parsing the files, I verified the count of record, and some time it doesn’t match with the original log file record count.
I am stuck on this from the last two days with no luck. Any help would be appreciated.
Thanks in advance.

Try:
awk -F= '
{
a[$1]=$2
}
/^Info/ {
sub(/.*DealIdClicked_/, "")
sub(/,.*/, "")
print $0, a["CustomerId"], a["StartTime"], a["Marketplace"]
delete a
}' OFS=, filename
When run on your input file, the above produces the desired output:
0002,XYZCustomerId,1592931599.986,MarketPlace
How it works
-F= tells awk to use = as the field separator on input.
{ a[$1]=$2 } tells awk to save the second field, $2, in associative array a under the key $1.
/^Info/ { ... } tells awk to perform the commands in curly braces whenever the line starts with Info. Those commands are:
sub(/.*DealIdClicked_/, "") removes all parts of the line up to and including DealIdClicked_.
sub(/,.*/, "") tells awk to remove from what's left of the line everything from the first comma to the end of the line.
The remainder of the line, still called $0, is the "DealId" that we want.
print $0, a["CustomerId"], a["StartTime"], a["Marketplace"] tells awk to print the output that we want.
delete a this deletes array a so we start over clean on the next record.
OFS=, tells awk to use a comma as the field separator on output.

Related

Retreive specific values from file

I have a file test.cf containing:
process {
withName : teq {
file = "/path/to/teq-0.20.9.txt"
}
}
process {
withName : cad {
file = "/path/to/cad-4.0.txt"
}
}
process {
withName : sik {
file = "/path/to/sik-20.0.txt"
}
}
I would like to retreive value associated at the end of the file for teq, cad and sik
I was first thinking about something like
grep -E 'teq' test.cf
and get only second raw and then remove part of recurrence in line
But it may be easier to do something like:
for a in test.cf
do
line=$(sed -n '{$a}p' test.cf)
if line=teq
#next line using sed -n?
do print nextline &> teq.txt
else if line=cad
do print nextline &> cad.txt
else if line=sik
do print nextline &> sik.txt
done
(obviously it doesn't work)
EDIT:
output wanted:
teq.txt containing teq-0.20.9, cad.txt containing cad-4.0 and sik.txt containing sik-20.0
Is there a good way to do that? Thank you for your comments
Based on your given sample:
awk '/withName/{close(f); f=$3 ".txt"}
/file/{sub(/.*\//, ""); sub(/\.txt".*/, "");
print > f}' ip.txt
/withName/{close(f); f=$3 ".txt"} if line contains withName, save filename in f using the third field. close() will close any previous file handle
/file/{sub(/.*\//, ""); sub(/\.txt".*/, ""); if line contains file, remove everything except the value required
print > f print the modified line and redirect to filename in f
if you can have multiple entries, use >> instead of >
Here is a solution in awk:
awk '/withName/{name=$3} /file =/{print $3 > name ".txt"}' test.cf
/withName/{name=$3}: when I see the line containing "withName", I save that name
When I see the line with "file =", I print

awk or other shell to convert delimited list into a table

So what I have is a huge csv akin to this:
Pool1,Shard1,Event1,10
Pool1,Shard1,Event2,20
Pool1,Shard2,Event1,30
Pool1,Shard2,Event4,40
Pool2,Shard1,Event3,50
etc
Which is not ealisy readable. Eith there being only 4 types of events I'm useing spreadsheets to convert this into the following:
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
Pool2,Shard1,,,50,
Only events are limited to 4, pools and shards can be indefinite really. But the events may be missing from the lines - not all pools/shards have all 4 events every day.
So I tried doing this within an awk in the shell script that gathers the csv in the first place, but I'm failing spectacuraly, no working code can even be shown since it's producing zero results.
Basically I tried sorting the CSV reading the first two fields of a row, comparing to previous row and if matching comparing the third field to a set array of event strings then storing the fouth field in a variable respective to the event, and one the first two fileds are not matching - finally print the whole line including variables.
Sorry for the one-liner, testing and experimenting directly in the command line. It's embarassing, it does nothing.
awk -F, '{if (a==$1&&b==$2) {if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4}} else {printf $a","$b","$r","$d","$p","$t"\n"; a=$1 ; b=$2 ; if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4} ; a=$1; b=$2}} END {printf "\n"}'
You could simply use an assoc array: awk -F, -f parse.awk input.csv with parse.awk being:
{
sub(/Event/, "", $3);
res[$1","$2][$3]=$4;
}
END {
for (name in res) {
printf("%s,%s,%s,%s,%s\n", name, res[name][1], res[name][2], res[name][3], res[name][4])
}
}
Order could be confused by awk, but my test output is:
Pool2,Shard1,,,50,
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
PS: Please use an editor to write awk source code. Your one-liner is really hard to read. Since I used a different approach, I did not even try do get it "right"... ;)
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $1 OFS $2 }
key != prev {
if ( NR>1 ) {
print prev, f["Event1"], f["Event2"], f["Event3"], f["Event4"]
delete f
}
prev = key
}
{ f[$3] = $4 }
END { print key, f["Event1"], f["Event2"], f["Event3"], f["Event4"] }
$ sort file | awk -f tst.awk
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
Pool2,Shard1,,,50,

AWK print block that does NOT contain specific text

I have the following data file:
variable "ARM_CLIENT_ID" {
description = "Client ID for Service Principal"
}
variable "ARM_CLIENT_SECRET" {
description = "Client Secret for Service Principal"
}
# [.....loads of code]
variable "logging_settings" {
description = "Logging settings from TFVARs"
}
variable "azure_firewall_nat_rule_collections" {
default = {}
}
variable "azure_firewall_network_rule_collections" {
default = {}
}
variable "azure_firewall_application_rule_collections" {
default = {}
}
variable "build_route_tables" {
description = "List of Route Table keys that need direct internet prior to Egress FW build"
default = [
"shared_services",
"sub_to_afw"
]
}
There's a 2 things I wish to do:
print the variable names without the inverted commas
ONLY print the variables names if the code block does NOT contain default
I know I can print the variables names like so: awk '{ gsub("\"", "") }; (/variable/ && $2 !~ /^ARM_/) { print $2}'
I know I can print the code blocks with: awk '/variable/,/^}/', which results:
# [.....loads of code output before this]
variable "logging_settings" {
description = "Logging settings from TFVARs"
}
variable "azure_firewall_nat_rule_collections" {
default = {}
}
variable "azure_firewall_network_rule_collections" {
default = {}
}
variable "azure_firewall_application_rule_collections" {
default = {}
}
variable "build_route_tables" {
description = "List of Route Table keys that need direct internet prior to Egress FW build"
default = [
"shared_services",
"sub_to_afw"
]
}
However, I cannot find out how to print the code blocks "if" they don't contain default. I know I will need to use an if statement, and some variables perhaps, but I am unsure as of how.
This code block should NOT appear in the output for which I grab the variable name:
variable "build_route_tables" {
description = "List of Route Table keys that need direct internet prior to Egress FW build"
default = [
"shared_services",
"sub_to_afw"
]
}
End output should NOT contain those that had default:
# [.....loads of code output before this]
expressroute_settings
firewall_settings
global_settings
peering_settings
vnet_transit_object
vnet_shared_services_object
route_tables
logging_settings
Preferable I would like to keep this a single AWK command or file, no piping. I have uses for this that do prefer no piping.
EDIT: update the ideal outputs (missed some examples of those with default)
Assumptions and collection of notes from OP's question and comments:
all variable definition blocks end with a right brace (}) in the first column of a new line
we only display variable names (sans the double quotes)
we do not display the variable names if the body of the variable definition contains the string default
we do not display the variable name if it starts with the string ARM_
One (somewhat verbose) awk solution:
NOTE: I've copied the sample input data into my local file variables.dat
awk -F'"' ' # use double quotes as the input field separator
/^variable / && $2 !~ "^ARM_" { varname = $2 # if line starts with "^variable ", and field #2 is not like "^ARM_", save field #2 for later display
printme = 1 # enable our print flag
}
/variable/,/^}/ { if ( $0 ~ "default" ) # within the range of a variable definition, if we find the string "default" ...
printme = 0 # disable the print flag
next # skip to next line
}
printme { print varname # if the print flag is enabled then print the variable name and then ...
printme = 0 # disable the print flag
}
' variables.dat
This generates:
logging_settings
$ awk -v RS= '!/default =/{gsub(/"/,"",$2); print $2}' file
ARM_CLIENT_ID
ARM_CLIENT_SECRET
[.....loads
logging_settings
of course output doesn't match yours since it's inconsistent with the input data.
Using GNU awk:
awk -v RS="}" '/variable/ && !/default/ && !/ARN/ { var=gensub(/(^.*variable ")(.*)(".*{.*)/,"\\2",$0);print var }' file
Set the record separator to "}" and then check for records that contain "variable", don't contain default and don't contain "ARM". Use gensub to split the string into three sections based on regular expressions and set the variable var to the second section. Print the var variable.
Output:
logging_settings
Another variation on awk using skip variable to control the array index holding the variable names:
awk '
/^[[:blank:]]*#/ { next }
$1=="variable" { gsub(/["]/,"",$2); vars[skip?n:++n]=$2; skip=0 }
$1=="default" { skip=1 }
END { if (skip) n--; for(i=1; i<=n; i++) print vars[i] }
' code
The first rule just skips comment lines. If you want to skip "ARM_" variables, then you can add a test on $2.
Example Use/Output
With your example code in code, all variables without default are:
$ awk '
> /^[[:blank:]]*#/ { next }
> $1=="variable" { gsub(/["]/,"",$2); vars[skip?n:++n]=$2; skip=0 }
> $1=="default" { skip=1 }
> END { if (skip) n--; for(i=1; i<=n; i++) print vars[i] }
> ' code
ARM_CLIENT_ID
ARM_CLIENT_SECRET
logging_settings
Here's another maybe shorter solution.
$ awk -F'"' '/^variable/&&$2!~/^ARM_/{v=$2} /default =/{v=0} /}/&&v{print v; v=0}' file
logging_settings

bash scripting how convert log with key into csv

I have a log with format like a table
ge-1/0/0.0 up down inet 10.100.100.1/24
multiservice
ge-1/0/2.107 up up inet 10.187.132.193/27
10.187.132.194/27
multiservice
ge-1/1/4 up up
ge-1/1/5.0 up up inet 10.164.69.209/30
iso
mpls
multiservice
how we convert it to format csv like below:
ge-1/0/0.0,up,down,inet|multiservice,10.100.100.1/24
ge-1/0/2.107,up,up,inet|multiservice,"10.187.132.193/27,10.187.132.194/27"
ge-1/1/4,up,up
ge-1/1/5.0,up,up,inet|iso|mpls|multiservice,10.164.69.209/30
I've tried with grep interfacename -A4 but it's display other interface information.
#!/bin/bash
show() {
[ "$ge" ] || return
[ "$add_quotes" ] && iprange="\"$iprange\""
out="$ge,$upd1,$upd2,$service,$iprange"
out="${out%%,}"
echo "${out%%,}"
}
while read line
do
case "$line" in
ge*)
show
read ge upd1 upd2 service iprange < <( echo "$line" )
add_quotes=
;;
[0-9]*)
iprange="$iprange,$line"
add_quotes=Y
;;
*)
service="$service|$line"
;;
esac
done
# Show last line
show
With your sample data provided as stdin, this script returns:
ge-1/0/0.0,up,down,inet|multiservice,10.100.100.1/24
ge-1/0/2.107,up,up,inet|multiservice,"10.187.132.193/27,10.187.132.194/27"
ge-1/1/4,up,up
ge-1/1/5.0,up,up,inet|iso|mpls|multiservice,10.164.69.209/30
How it works: This script reads from stdin line by line (while read line). Each line is then classified into one of three types: (a) a new record (i.e. a line that starts with "ge-"), (b) a continuation record that provides another IP range (i.e. a record that starts with a number), or (c) a continuation line that provides another service (i.e. a record that starts with a letter). Taking these cases in turn:
(a) When the line contains the start of a new record, that means that the previous record has ended, so we print it out with the show function. Then we read from the new line the five columns that I have named: ge upd1 upd2 service iprange. And, we reset the add_quotes variable to empty.
(b) When the line contains just another IP range, we add that to the current IP range. As per the example in the question, combinations of two or more IP ranges are separated by a comma and enclosed in quotes. Thus, we set add_quotes to "Y".
(c) When the line contains an additional service, we add that to the service variable. As per the example in the question, two services are separated by a vertical bar "|" and no quotes are used.
The function show first checks to make sure that there is a record to show by checking that the ge variable is non-empty. If it is empty, then the return statement is executed so that the function exits (returns) without processing any of its further statements. If $ge was non-empty, the function proceeds to the next statement which adds quotes around the IP range variable if they are needed. It then combines the variables with commas separating them, removes trailing commas (as per the example in the question), and sends the result to stdout.
parselog.awk
#!/usr/bin/gawk -f
BEGIN {
RS = "[^\n]*\n( [^\n]*\n)*"
OFS = ","
}
length(RT) > 0 {
$0 = RT # See: http://stackoverflow.com/a/11917783/27581
opts = ""
ips = ""
for (i = 4; i <= NF; ++i) {
if (isIP($i)) {
ips = append(ips, $i, ",")
} else {
opts = append(opts, $i, "|")
}
}
print $1, $2, $3, opts, "\"" ips "\""
}
function isIP(str) {
return str ~ /^[0-9]/
}
function append(list, val, separator) {
if (length(list) > 0) {
list = list separator
}
return list val
}
Usage
$ ./parselog.awk < log.txt
ge-1/0/0.0,up,down,inet|multiservice,"10.100.100.1/24"
ge-1/0/2.107,up,up,inet|multiservice,"10.187.132.193/27,10.187.132.194/27"
ge-1/1/4,up,up,,""
ge-1/1/5.0,up,up,inet|iso|mpls|multiservice,"10.164.69.209/30"

Shell script to combine three files using AWK

I have three files G_P_map.txt, G_S_map.txt and S_P_map.txt. I have to combine these three files using awk. The example contents are the following -
(G_P_map.txt contains)
test21g|A-CZ|1mos
test21g|A-CZ|2mos
...
(G_S_map.txt contains)
nwtestn5|A-CZ
nwtestn6|A-CZ
...
(S_P_map.txt contains)
3mos|nwtestn5
4mos|nwtestn6
Expected Output :
1mos, 3mos
2mos, 4mos
Here is the code which I tried. I was able to combine the first two, but I couldn't do along with the third one.
awk -F"|" 'NR==FNR {file1[$1]=$1; next} {$2=file[$1]; print}' G_S_map.txt S_P_map.txt
Any ideas/help is much appreciated. Thanks in advance!
I would look at a combination of join and cut.
GNU AWK (gawk) 4 has BEGINFILE and ENDFILE which would be perfect for this. However, the gawk manual includes a function that will provide this functionality for most versions of AWK.
#!/usr/bin/awk
BEGIN {
FS = "|"
}
function beginfile(ignoreme) {
files++
}
function endfile(ignoreme) {
# endfile() would be defined here if we were using it
}
FILENAME != _oldfilename \
{
if (_oldfilename != "")
endfile(_oldfilename)
_oldfilename = FILENAME
beginfile(FILENAME)
}
END { endfile(FILENAME) }
files == 1 { # save all the key, value pairs from file 1
file1[$2] = $3
next
}
files == 2 { # save all the key, value pairs from file 2
file2[$1] = $2
next
}
files == 3 { # perform the lookup and output
print file1[file2[$2]], $1
}
# Place the regular END block here, if needed. It would be in addition to the one above (there can be more than one)
Call the script like this:
./scriptname G_P_map.txt G_S_map.txt S_P_map.txt

Resources