bash scripting how convert log with key into csv - bash

I have a log with format like a table
ge-1/0/0.0 up down inet 10.100.100.1/24
multiservice
ge-1/0/2.107 up up inet 10.187.132.193/27
10.187.132.194/27
multiservice
ge-1/1/4 up up
ge-1/1/5.0 up up inet 10.164.69.209/30
iso
mpls
multiservice
how we convert it to format csv like below:
ge-1/0/0.0,up,down,inet|multiservice,10.100.100.1/24
ge-1/0/2.107,up,up,inet|multiservice,"10.187.132.193/27,10.187.132.194/27"
ge-1/1/4,up,up
ge-1/1/5.0,up,up,inet|iso|mpls|multiservice,10.164.69.209/30
I've tried with grep interfacename -A4 but it's display other interface information.

#!/bin/bash
show() {
[ "$ge" ] || return
[ "$add_quotes" ] && iprange="\"$iprange\""
out="$ge,$upd1,$upd2,$service,$iprange"
out="${out%%,}"
echo "${out%%,}"
}
while read line
do
case "$line" in
ge*)
show
read ge upd1 upd2 service iprange < <( echo "$line" )
add_quotes=
;;
[0-9]*)
iprange="$iprange,$line"
add_quotes=Y
;;
*)
service="$service|$line"
;;
esac
done
# Show last line
show
With your sample data provided as stdin, this script returns:
ge-1/0/0.0,up,down,inet|multiservice,10.100.100.1/24
ge-1/0/2.107,up,up,inet|multiservice,"10.187.132.193/27,10.187.132.194/27"
ge-1/1/4,up,up
ge-1/1/5.0,up,up,inet|iso|mpls|multiservice,10.164.69.209/30
How it works: This script reads from stdin line by line (while read line). Each line is then classified into one of three types: (a) a new record (i.e. a line that starts with "ge-"), (b) a continuation record that provides another IP range (i.e. a record that starts with a number), or (c) a continuation line that provides another service (i.e. a record that starts with a letter). Taking these cases in turn:
(a) When the line contains the start of a new record, that means that the previous record has ended, so we print it out with the show function. Then we read from the new line the five columns that I have named: ge upd1 upd2 service iprange. And, we reset the add_quotes variable to empty.
(b) When the line contains just another IP range, we add that to the current IP range. As per the example in the question, combinations of two or more IP ranges are separated by a comma and enclosed in quotes. Thus, we set add_quotes to "Y".
(c) When the line contains an additional service, we add that to the service variable. As per the example in the question, two services are separated by a vertical bar "|" and no quotes are used.
The function show first checks to make sure that there is a record to show by checking that the ge variable is non-empty. If it is empty, then the return statement is executed so that the function exits (returns) without processing any of its further statements. If $ge was non-empty, the function proceeds to the next statement which adds quotes around the IP range variable if they are needed. It then combines the variables with commas separating them, removes trailing commas (as per the example in the question), and sends the result to stdout.

parselog.awk
#!/usr/bin/gawk -f
BEGIN {
RS = "[^\n]*\n( [^\n]*\n)*"
OFS = ","
}
length(RT) > 0 {
$0 = RT # See: http://stackoverflow.com/a/11917783/27581
opts = ""
ips = ""
for (i = 4; i <= NF; ++i) {
if (isIP($i)) {
ips = append(ips, $i, ",")
} else {
opts = append(opts, $i, "|")
}
}
print $1, $2, $3, opts, "\"" ips "\""
}
function isIP(str) {
return str ~ /^[0-9]/
}
function append(list, val, separator) {
if (length(list) > 0) {
list = list separator
}
return list val
}
Usage
$ ./parselog.awk < log.txt
ge-1/0/0.0,up,down,inet|multiservice,"10.100.100.1/24"
ge-1/0/2.107,up,up,inet|multiservice,"10.187.132.193/27,10.187.132.194/27"
ge-1/1/4,up,up,,""
ge-1/1/5.0,up,up,inet|iso|mpls|multiservice,"10.164.69.209/30"

Related

Reverse complement SOME sequences in fasta file

I've been reading lots of helpful posts about reverse complementing sequences, but I've got what seems to be an unusual request. I'm working in bash and I have DNA sequences in fasta format in my stdout that I'd like to pass on down the pipe. The seemingly unusual bit is that I'm trying to reverse complement SOME of those sequences, so that the output has all the sequences in the same direction (for multiple sequence alignment later).
My fasta headers end in either "C" or "+". I'd like to reverse complement the ones that end in "C". Here's a little subset:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
I know there are lots of ways to reverse complement out there, like:
echo ACCTTGAAA | tr ACGTacgt TGCAtgca | rev
and
seqtk seq -r in.fa > out.fa
But I'm not sure how to do this for only those sequences that have a C at the end of the header. I think awk or sed is probably the ticket, but I'm at a loss as to how to actually code it. I can get the sequence headers with awk, like:
awk '/^>/ { print $0 }'
>chr1:84518073-84524089C
>chr1:86214203-86220231+
But if someone could help me figure out how to turn that awk statement into one that asks "if the last character in the header has a C, do this!" that would be great!
Edited to add:
I was so tired when I made this post, I apologize for not including my desired output. Here is what I'd like to output to look like, using my little example:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
You can see the sequence that ends in + is unchanged, but the sequence with a header that ends in C is reverse complemented.
Thanks!
An earlier answer (by Ed Morton) uses a self-contained awk procedure to selectively reverse-complement sequences following a comment line ending with "C". Although I think that to be the best approach, I will offer an alternative approach that might have wider applicability.
The procedure here uses awk's system() function to send data extracted from the fasta file in awk to the shell where the sequence can be processed by any of the many shell applications existing for sequence manipulation.
I have defined an awk user function to pass the isolated sequence from awk to the shell. It can be called from any part of the awk procedure:
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
The argument of the system function is a string containing the command you would type into terminal to achieve the desired outcome (in this case I've used one of the example reverse-complement routines mentioned in the question). The parts to note are the correct escaping of quote marks that are to appear in the shell command, and the variable s that will be substituted for the sequence string assigned to it when the function is called. The value of s is concatenated with the strings quoted before and after it in the argument to system() shown above.
isolating the required sequences
The rest of the procedure addresses how to achieve:
"if the last character in the header has a C, do this"
Before making use of shell applications, awk needs to isolate the part(s) of the file to process. In general terms, awk employs one or more pattern/action blocks where only records (lines by default) that match a given pattern are processed by the subsequent action commands. For example, the following illustrative procedure performs the action of printing the whole line print $0 if the pattern /^>/ && /C$/ is true for that line (where /^>/ looks for ">" at the start of a line and /C$/ looks for "C" at the end of the same line.:
/^>/ && /C$/{ print $0 }
For the current needs, the sequence begins on the next record (line) after any record beginning with > and ending with C. One way of referencing that next line is to set a variable (named line in my example) when the C line is encountered and establishing a later pattern for the record with numerical value one more than line variable.
Because fasta sequences may extend over several lines, we have to accumulate several successive lines following a C title line. I have achieved this by concatenating each line following the C title line until a record beginning with > is encountered again (or until the end of the file is reached, using the END block).
In order that sequence lines following a non-C title line are ignored, I have used a variable named flag with values of either "do" or "ignore" set when a title record is encountered.
The call to a the custom function processSeq() that employs the system() command, is made at the beginning of a C title action block if the variable seq holds an accumulated sequence (and in the END block for relevant sequences that occur at the end of the file where there will be no title line).
Test file and procedure
A modified version of your example fasta was used to test the procedure. It contains an extra relevant C record with three and-a-bit lines instead of two, and an extra irrelevant + record.
seq.fasta:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chranotherC
aatgaagtatattcagaatgtagaacattaactgacccgccatgttaatc
aatatctataagaccttttaaaaagcaccttagagattcaataaagtcag
gaagtatattcagaatgtagaacattaactgactaagaccttttaacatg
gcattgact
procedure
awk '
/^>/ && /C$/{
if (length(seq)>0) {processSeq(seq); seq="";}
line=NR; print $0; flag="do"; next;
}
/^>/ {line=NR; flag="ignore"}
NR>1 && NR==(line+1) && (flag=="do"){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0) processSeq(seq);}
' seq.fasta
output
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagttgtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
>chranotherC
agtcaatgccatgttaaaaggtcttagtcagttaatgttctacattctgaatatacttcctgactttattgaatctctaaggtgctttttaaaaggtcttatagatattgattaacatggcgggtcagttaatgttctacattctgaatatacttcatt
Tested using GNU Awk 5.1.0 on a Raspberry Pi 400.
performance note
Because calling sytstem() creates a sub shell, this process will be slower than a self-contained awk procedure. It might be useful where existing shell routines are available or tricky to reproduce with custom awk routines.
Edit: modification to include unaltered + records
This version has some repetition of earlier blocks, with minor changes, to handle printing of the lines that are not to be reverse-complemented (the changes should be self-explanatory if the main explanations were understood)
awk '
/^>/ && /C$/{
if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq="";line=NR; print $0; flag="do"; next;
}
/^>/ {if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq=""; print $0; line=NR; flag="ignore"}
NR>1 && NR==(line+1){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq}}
' seq.fasta
Using any awk:
$ cat tst.awk
/^>/ {
if ( NR > 1 ) {
prt()
}
head = $0
tail = ""
next
}
{ tail = ( tail == "" ? "" : tail ORS ) $0 }
END { prt() }
function prt( type) {
type = substr(head,length(head),1)
tail = ( type == "C" ? rev( tr( tail, "ACGTacgt TGCAtgca" ) ) : tail )
print head ORS tail
}
function tr(oldStr,trStr, i,lgth,char,newStr) {
if ( !_trSeen[trStr]++ ) {
lgth = (length(trStr) - 1) / 2
for ( i=1; i<=lgth; i++ ) {
_trMap[trStr,substr(trStr,i,1)] = substr(trStr,lgth+1+i,1)
}
}
lgth = length(oldStr)
for (i=1; i<=lgth; i++) {
char = substr(oldStr,i,1)
newStr = newStr ( (trStr,char) in _trMap ? _trMap[trStr,char] : char )
}
return newStr
}
function rev(oldStr, i,lgth,char,newStr) {
lgth = length(oldStr)
for ( i=1; i<=lgth; i++ ) {
char = substr(oldStr,i,1)
newStr = char newStr
}
return newStr
}
$ awk -f tst.awk file
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
This might work for you (GNU sed):
sed -nE ':a;p;/^>.*C$/!b
:b;n;/^>/ba;s/^/\n/;y/ACGTacgt/TGCAtgca/
:c;tc;/\n$/{s///p;bb};s/(.*)\n(.)/\2\1\n/;tc' file
Print the current line and then inspect it.
If the line does not begin with > and end with C, bail out and repeat.
Otherwise, fetch the next line and if it begins with >, repeat the above line.
Otherwise, insert a newline (to use as a pivot point when reversing the line), complement the code of the line using a translation command. Then set about reversing the line, character by character until the inserted newline makes its way to the end of the line.
Remove the newline, print the result and repeat the line above.
N.B. The n command will terminate the script when it is executed after the last line has been read.
Since the OP has amended the ouput, another solution is when the whole of the sequence is complemented and then reversed. Here is another solution that I believe follows these criteria.
sed -nE ':a;p;/^>.*C$/!b
:b;n;/^>/!{H;$!bb};x;y/ACGTacgt\n/TGCAtgca%/;s/%/\n/
:c;tc;s/\n$//;td;s/(.*)\n(.)/\2\1\n/;tc
:d;y/%/\n/;p;z;x;$!ba' file

(sed/awk) extract values text file and write to csv (no pattern)

I have (several) large text files from which I want to extract some values to create a csv file with all of these values.
My current solution is to have a few different calls to sed from which I save the values and then have a python script in which I combine the data in different files to a single csv file. However, this is quite slow and I want to speed it up.
The file let's call it my_file_1.txt has a structure that looks something like this
lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...
and I would like to construct something like
file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...
How can I get the results I want? It doesn't have to be Sed or Awk as long as I don't need to install something new and it is reasonably fast.
I don't really have any experience with awk. With sed my best guess would be
filename=$1
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
s/.*start value \([0-9]\+\).*/'"$filename"',\1,/
h
$!N
/.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/\2,\1/}
D
T
G
P
' $filename | sed -z 's/,\n/,/' >> my_data.csv
and then deal with not getting the run number. Furthermore, this is not quite correct as the N will gobble up some "start value" lines leading to wrong result. It feels like it could be done easier with awk.
It is similar to 8992158 but I can't use that pattern and I know too little awk to rewrite it.
Solution (Edit)
I was not general enough in my description of the problem so I changed it up a bit and fixed some inconsistensies.
Awk (Rusty Lemur's answer)
Here I generalised from knowing that the numbers were at the end of the line to using gensub. For this I should have specified version of awk at is not available in all versions.
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = gensub(/.*start value ([0-9]+).*/, "\\1", 1, $0)
}
/epoch/ {
epoch = gensub(/.*epoch ([0-9]+).*/, "\\1", 1, $0)
}
/stop value/ {
stopValue = gensub(/.*stop value ([0-9]+).*/, "\\1", 1, $0)
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
I accepted this answer because it most understandable.
Sed (potong's answer)
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^.*start value/{:a;N;/\n.*stop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' my_file_1.txt | sed '1!N;s/\n//'
It's not clear how you'd get exactly the output you provided from the input you provided but this may be what you're trying to do (using any awk in any shell on every Unix box):
$ cat tst.awk
BEGIN {
OFS = ","
print "file", "start", "stop", "epoch", "run"
}
{ f[$1] = $NF }
$1 == "stop" {
print FILENAME, f["start"], f["stop"], f["epoch"], ++run
delete f
}
$ awk -f tst.awk my_file_1.txt
file,start,stop,epoch,run
my_file_1.txt,123,234,N,1
my_file_1.txt,345,456,M,2
awk's basic structure is:
read a record from the input (by default a record is a line)
evaluate conditions
apply actions
The record is split into fields (by default based on whitespace as the separator).
The fields are referenced by their position, starting at 1. $1 is the first field, $2 is the second.
The last field is referenced by a variable named NF for "number of fields." $NF is the last field, $(NF-1) is the second-to-last field, etc.
A "BEGIN" section will be executed before any input file is read, and it can be used to initialize variables (which are implicitly initialized to 0).
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = $NF # when a line contains "start value" store the last field as startValue
}
/epoch/ {
epoch = $NF
}
/stop value/ {
stopValue = $NF
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
Save that as processor.awk and invoke as:
awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv
This might work for you (GNU sed):
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^start value/{:a;N;/\nstop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^start value (\S+).*\nepoch (\S+)\nstop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' file |
sed '1!N;s/\n//'
The solution contains two invocations of sed, the first to format all but the file name and second to embed the file name into the csv file.
Format the header line on the first line and prime the run number.
Gather up lines between start value and stop value.
Increment the run number, append it to the current line and output the file name. This prints two lines per record, the first is the file name and the second the remainder of the csv file.
In the second sed invocation read two lines at a time (except for the first line) and remove the newline between them, formatting the csv file.

awk: data missed while parsing file

I have written a script to parse hourly log files to extract "CustomerId, Marketplace, StartTime, and DealIdClicked" data. The log file structure is like so:
------------------------------------------------------------------------
Size=0 bytes
scheme=https
StatusCode=302
RequestId=request_Id_X07
CustomerId=XYZCustomerId
Marketplace=MarketPlace
StartTime=1592931599.986
Program=Unknown
Info=sub-page-type=desktop:Deals_Content_DealIdClicked_0002,sub-page-CSMTags=UTF-8
Counters=sub-page-type=desktop:Deals_Content_DealIdClicked_0002=3,sub-page-CSMTags=Encoding:UTF-8
EOE
------------------------------------------------------------------------
Here is the script I have written to parse the log.
function readServiceLog() {
local _logfile="$1"
local _csvFile="$2"
local _logFileName=$(getLogFileName "$_logfile")
parseLogFile "$_logfile" "$_csvFile"
echo "$_logFileName" >>"$SCRIPT_PATH/excludeFile.txt"
}
# Function to match regex and extract required data.
function parseLogFile() {
local _logfile=$1
local _csvFile=$2
zcat <"$_logfile" | awk -v csvFilePath="$_csvFile" '
BEGIN {
customerIdRegex="^CustomerId="
marketplaceIdRegex="^MarketplaceId="
startTimeRegex="^StartTime="
InfoRegex="^Info="
dealIdRegex = "Deals_Content_DealIdClicked_"
EOERegex="^EOE$"
delete RECORD
}
{
logLine=$0
if (match(logLine,InfoRegex)) {
after = substr(logLine,RSTART+RLENGTH);
if(match(after, dealIdRegex)) {
afterDeal = substr(after,RSTART+RLENGTH);
dealId = substr(afterDeal, 1, index(afterDeal,",")-1)
RECORD[0] = dealId
}
}
if (match(logLine,customerIdRegex)) {
after = substr(logLine,RSTART+RLENGTH);
customerid = substr(after, 1, length(after))
RECORD[1] = customerid
}
if (match(logLine,startTimeRegex)) {
after = substr(logLine,RSTART+RLENGTH);
startTime = substr(after, 1, length(after))
RECORD[2] = startTime
}
if (match(logLine,marketplaceIdRegex)) {
after = substr(logLine,RSTART+RLENGTH);
marketplaceId = substr(after, 1, length(after))
RECORD[3] = marketplaceId
}
if (match(logLine,EOERegex)) {
if(length(RECORD) == 4) {
printf("%s,%s,%s,%s\n", RECORD[0],RECORD[1],RECORD[2],RECORD[3]) >> csvFilePath
}
delete RECORD
}
}'
}
function processHourlyFile() {
local _currentProcessingFolder=$1
local _outputFolder=$(getOutputFolderName) //getOutputFolderName function is from util class.
mkdir -p "$_outputFolder"
local _csvFileName="$_outputFolder/${_currentProcessingFolder##*/}.csv"
for entry in "$_currentProcessingFolder"/*; do
if [[ "$entry" == *"$SERVICE_LOG"* ]]; then
readServiceLog "$entry" "$_csvFileName"
fi
done
}
# Main execution to spawn new processes for parallel parsing.
function main() {
local _processCount=1
for entry in $INPUT_LOG_PATH/*; do
processHourlyFile $entry &
pids[${_processCount}]=$!
done
printInfo
# wait for all pids
for pid in ${pids[*]}; do
wait $pid
done
}
main
printf '\nFinished!\n'
Expected output:
A comma separated file.
0002,XYZCustomerId,1592931599.986,MarketPlace
Problem
The script spawns 24 processes to parse 24-hour logs for an entire day. After parsing the files, I verified the count of record, and some time it doesn’t match with the original log file record count.
I am stuck on this from the last two days with no luck. Any help would be appreciated.
Thanks in advance.
Try:
awk -F= '
{
a[$1]=$2
}
/^Info/ {
sub(/.*DealIdClicked_/, "")
sub(/,.*/, "")
print $0, a["CustomerId"], a["StartTime"], a["Marketplace"]
delete a
}' OFS=, filename
When run on your input file, the above produces the desired output:
0002,XYZCustomerId,1592931599.986,MarketPlace
How it works
-F= tells awk to use = as the field separator on input.
{ a[$1]=$2 } tells awk to save the second field, $2, in associative array a under the key $1.
/^Info/ { ... } tells awk to perform the commands in curly braces whenever the line starts with Info. Those commands are:
sub(/.*DealIdClicked_/, "") removes all parts of the line up to and including DealIdClicked_.
sub(/,.*/, "") tells awk to remove from what's left of the line everything from the first comma to the end of the line.
The remainder of the line, still called $0, is the "DealId" that we want.
print $0, a["CustomerId"], a["StartTime"], a["Marketplace"] tells awk to print the output that we want.
delete a this deletes array a so we start over clean on the next record.
OFS=, tells awk to use a comma as the field separator on output.

Parse out key=value pairs into variables

I have a bunch of different kinds of files I need to look at periodically, and what they have in common is that the lines have a bunch of key=value type strings. So something like:
Version=2 Len=17 Hello Var=Howdy Other
I would like to be able to reference the names directly from awk... so something like:
cat some_file | ... | awk '{print Var, $5}' # prints Howdy Other
How can I go about doing that?
The closest you can get is to parse the variables into an associative array first thing every line. That is to say,
awk '{ delete vars; for(i = 1; i <= NF; ++i) { n = index($i, "="); if(n) { vars[substr($i, 1, n - 1)] = substr($i, n + 1) } } Var = vars["Var"] } { print Var, $5 }'
More readably:
{
delete vars; # clean up previous variable values
for(i = 1; i <= NF; ++i) { # walk through fields
n = index($i, "="); # search for =
if(n) { # if there is one:
# remember value by name. The reason I use
# substr over split is the possibility of
# something like Var=foo=bar=baz (that will
# be parsed into a variable Var with the
# value "foo=bar=baz" this way).
vars[substr($i, 1, n - 1)] = substr($i, n + 1)
}
}
# if you know precisely what variable names you expect to get, you can
# assign to them here:
Var = vars["Var"]
Version = vars["Version"]
Len = vars["Len"]
}
{
print Var, $5 # then use them in the rest of the code
}
$ cat file | sed -r 's/[[:alnum:]]+=/\n&/g' | awk -F= '$1=="Var"{print $2}'
Howdy Other
Or, avoiding the useless use of cat:
$ sed -r 's/[[:alnum:]]+=/\n&/g' file | awk -F= '$1=="Var"{print $2}'
Howdy Other
How it works
sed -r 's/[[:alnum:]]+=/\n&/g'
This places each key,value pair on its own line.
awk -F= '$1=="Var"{print $2}'
This reads the key-value pairs. Since the field separator is chosen to be =, the key ends up as field 1 and the value as field 2. Thus, we just look for lines whose first field is Var and print the corresponding value.
Since discussion in commentary has made it clear that a pure-bash solution would also be acceptable:
#!/bin/bash
case $BASH_VERSION in
''|[0-3].*) echo "ERROR: Bash 4.0 required" >&2; exit 1;;
esac
while read -r -a words; do # iterate over lines of input
declare -A vars=( ) # refresh variables for each line
set -- "${words[#]}" # update positional parameters
for word; do
if [[ $word = *"="* ]]; then # if a word contains an "="...
vars[${word%%=*}]=${word#*=} # ...then set it as an associative-array key
fi
done
echo "${vars[Var]} $5" # Here, we use content read from that line.
done <<<"Version=2 Len=17 Hello Var=Howdy Other"
The <<<"Input Here" could also be <file.txt, in which case lines in the file would be iterated over.
If you wanted to use $Var instead of ${vars[Var]}, then substitute printf -v "${word%%=*}" %s "${word*=}" in place of vars[${word%%=*}]=${word#*=}, and remove references to vars elsewhere. Note that this doesn't allow for a good way to clean up variables between lines of input, as the associative-array approach does.
I will try to explain you a very generic way to do this which you can adapt easily if you want to print out other stuff.
Assume you have a string which has a format like this:
key1=value1 key2=value2 key3=value3
or more generic
key1_fs2_value1_fs1_key2_fs2_value2_fs1_key3_fs2_value3
With fs1 and fs2 two different field separators.
You would like to make a selection or some operations with these values. To do this, the easiest is to store these in an associative array:
array["key1"] => value1
array["key2"] => value2
array["key3"] => value3
array["key1","full"] => "key1=value1"
array["key2","full"] => "key2=value2"
array["key3","full"] => "key3=value3"
This can be done with the following function in awk:
function str2map(str,fs1,fs2,map, n,tmp) {
n=split(str,map,fs1)
for (;n>0;n--) {
split(map[n],tmp,fs2);
map[tmp[1]]=tmp[2]; map[tmp[1],"full"]=map[n]
delete map[n]
}
}
So, after processing the string, you have the full flexibility to do operations in any way you like:
awk '
function str2map(str,fs1,fs2,map, n,tmp) {
n=split(str,map,fs1)
for (;n>0;n--) {
split(map[n],tmp,fs2);
map[tmp[1]]=tmp[2]; map[tmp[1],"full"]=map[n]
delete map[n]
}
}
{ str2map($0," ","=",map) }
{ print map["Var","full"] }
' file
The advantage of this method is that you can easily adapt your code to print any other key you are interested in, or even make selections based on this, example:
(map["Version"] < 3) { print map["var"]/map["Len"] }
The simplest and easiest way is to use the string substitution like this:
property='my.password.is=1234567890=='
name=${property%%=*}
value=${property#*=}
echo "'$name' : '$value'"
The output is:
'my.password.is' : '1234567890=='
Yore.
Using bash's set command, we can split the line into positional parameters like awk.
For each word, we'll try to read a name value pair delimited by =.
When we find a value, assign it to the variable named $key using bash's printf -v feature.
#!/usr/bin/env bash
line='Version=2 Len=17 Hello Var=Howdy Other'
set $line
for word in "$#"; do
IFS='=' read -r key val <<< "$word"
test -n "$val" && printf -v "$key" "$val"
done
echo "$Var $5"
output
Howdy Other
SYNOPSIS
an awk-based solution that doesn't require manually checking the fields to locate the desired key pair :
approach being avoid splitting unnecessary fields or arrays - only performing regex match via function call when needed
only returning FIRST occurrence of input key value. Subsequent matches along the row are NOT returned
i just called it S() cuz it's the closest letter to $
I only included an array (_) of the 3 test values for demo purposes. Those aren't needed. In fact, no state information is being kept at all
caveat being : key-match must be exact - this version of the code isn't for case-insensitive or fuzzy/agile matching
Tested and confirmed working on
- gawk 5.1.1
- mawk 1.3.4
- mawk-2/1.9.9.6
- macos nawk
CODE
# gawk profile, created Fri May 27 02:07:53 2022
{m,n,g}awk '
function S(__,_) {
return \
! match($(_=_<_), "(^|["(_="[:blank:]]")")"(__)"[=][^"(_)"*") \
? "^$" \
: substr(__=substr($-_, RSTART, RLENGTH), index(__,"=")+_^!_)
}
BEGIN { OFS = "\f" # This array is only for testing
_["Version"] _["Len"] _["Var"] # purposes. Feel free to discard at will
} {
for (__ in _) {
print __, S(__) } }'
OUTPUT
Var
Howdy
Len
17
Version
2
So either call the fields in BAU fashion
- $5, $0, $NF, etc
or call S(QUOTED_KEY_VALUE), case-sensitive, like
As a safeguard, to prevent mis-interpreting null strings
or invalid inputs as $0, a non-match returns ^$
instead of empty string
S("Version") to get back 2.
As a bonus, it can safely handle values in multibyte unicode, both for values and even for keys, regardless of whether ur awk is UTF-8-aware or not :
1 ✜
🤡
2 Version
2
3 Var
Howdy
4 Len
17
5 ✜=🤡 Version=2 Len=17 Hello Var=Howdy Other
I know this is particularly regarding awk but mentioning this as many people come here for solutions to break down name = value pairs ( with / without using awk as such).
I found below way simple straight forward and very effective in managing multiple spaces / commas as well -
Source: http://jayconrod.com/posts/35/parsing-keyvalue-pairs-in-bash
change="foo=red bar=green baz=blue"
#use below if var is in CSV (instead of space as delim)
change=`echo $change | tr ',' ' '`
for change in $changes; do
set -- `echo $change | tr '=' ' '`
echo "variable name == $1 and variable value == $2"
#can assign value to a variable like below
eval my_var_$1=$2;
done

How to get specific data from block of data based on condition

I have a file like this:
[group]
enable = 0
name = green
test = more
[group]
name = blue
test = home
[group]
value = 48
name = orange
test = out
There may be one ore more space/tabs between label and = and value.
Number of lines may wary in every block.
I like to have the name, only if this is not true enable = 0
So output should be:
blue
orange
Here is what I have managed to create:
awk -v RS="group" '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
blue
orange
There are several fault with this:
I am not able to set RS to [group], both this fails RS="[group]" and RS="\[group\]". This will then fail if name or other labels contains group.
I do prefer not to use RS with multiple characters, since this is gnu awk only.
Anyone have other suggestion? sed or awk and not use a long chain of commands.
If you know that groups are always separated by empty lines, set RS to the empty string:
$ awk -v RS="" '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
blue
orange
#devnull explained in his answer that GNU awk also accepts regular expressions in RS, so you could only split at [group] if it is on its own line:
gawk -v RS='(^|\n)[[]group]($|\n)' '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
This makes sure we're not splitting at evil names like
[group]
enable = 0
name = [group]
name = evil
test = more
Your problem seems to be:
I am not able to set RS to [group], both this fails RS="[group]" and
RS="\[group\]".
Saying:
RS="[[]group[]]"
should yield the desired result.
In these situations where there's clearly name = value statements within a record, I like to first populate an array with those mappings, e.g.:
map["<name>"] = <value>
and then just use the names to reference the values I want. In this case:
$ awk -v RS= -F'\n' '
{
delete map
for (i=1;i<=NF;i++) {
split($i,tmp,/ *= */)
map[tmp[1]] = tmp[2]
}
}
map["enable"] !~ /^0$/ {
print map["name"]
}
' file
blue
orange
If your version of awk doesn't support deleting a whole array then change delete map to split("",map).
Compared to using REs and/or sub()s., etc., it makes the solution much more robust and extensible in case you want to compare and/or print the values of other fields in future.
Since you have line-separated records, you should consider putting awk in paragraph mode. If you must test for the [group] identifier, simply add code to handle that. Here's some example code that should fulfill your requirements. Run like:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
RS=""
}
{
for (i=2; i<=NF; i+=3) {
if ($i == "enable" && $(i+2) == 0) {
f = 1
}
if ($i == "name") {
r = $(i+2)
}
}
}
!(f) && r {
print r
}
{
f = 0
r = ""
}
Results:
blue
orange
This might work for you (GNU sed):
sed -n '/\[group\]/{:a;$!{N;/\n$/!ba};/enable\s*=\s*0/!s/.*name\s*=\s*\(\S\+\).*/\1/p;d}' file
Read the [group] block into the pattern space then substitute out the colour if the enable variable is not set to 0.
sed -n '...' set sed to run in silent mode, no ouput unless specified i.e. a p or P command
/\[group\]/{...} when we have a line which contains [group] do what is found inside the curly braces.
:a;$!{N;/\n$/!ba} to do a loop we need a place to loop to, :a is the place to loop to. $ is the end of file address and $! means not the end of file, so $!{...} means do what is found inside the curly braces when it is not the end of file. N means append a newline and the next line to the current line and /\n$/ba when we have a line that ends with an empty line branch (b) to a. So this collects all lines from a line that contains `[group] to an empty line (or end of file).
/enable\s*=\s*0/!s/.*name\s*=\s*\(\S\+\).*/\1/p if the lines collected contain enable = 0 then do not substitute out the colour. Or to put it another way, if the lines collected so far do not contain enable = 0 do substitute out the colour.
If you don't want to use the record separator, you could use a dummy variable like this:
#!/usr/bin/awk -f
function endgroup() {
if (e == 1) {
print n
}
}
$1 == "name" {
n = $3
}
$1 == "enable" && $3 == 0 {
e = 0;
}
$0 == "[group]" {
endgroup();
e = 1;
}
END {
endgroup();
}
You could actually use Bash for this.
while read line; do
if [[ $line == "enable = 0" ]]; then
n=1
else
n=0
fi
if [ $n -eq 0 ] && [[ $line =~ name[[:space:]]+=[[:space:]]([a-z]+) ]]; then
echo ${BASH_REMATCH[1]}
fi
done < file
This will only work however if enable = 0 is always only one line above the line with name.

Resources