I'm trying to parse a log file that will have lines like this:
aaa bbb ccc: [DDD] efg oi
aaa bbb ccc: lll [DDD] efg oo
aaa bbb ccc: [DDD]
where [DDD] can be at any place in line.
Only one thing will be between [ and ] in any line
Using awk and space as a delimiter, how can I print 1st, 3rd and all data (whole string) between [ and ]?
Expected output: aaa ccc: DDD
gawk(GNU awk) approach:
Let's say we a file with the following line:
aaa bbb ccc: ddd [fff] ggg hhh
The command:
awk '{match($0,/\[([^]]+)\]/, a); print $1,$3,a[1]}' file
The output:
aaa ccc: fff
match(string, regexp [, array]) Search string for the longest, leftmost substring matched by the regular expression regexp and return the character position (index) at which that substring begins (one, if it starts at the beginning of string). If no match is found, return zero..
Given:
$ cat file
aaa bbb ccc: [DDD] efg oi
aaa bbb [ccc:] lll DDD efg oo
aaa [bbb] ccc: DDD
(note -- changed from the OP's example)
In POSIX awk:
awk 'BEGIN{fields[1]; fields[3]}
{s=""
for (i=1;i<=NF;i++)
if ($i~/^\[/ || i in fields)
s=i>1 ? s OFS $i : $i
gsub(/\[|\]/,"",s)
print s
}' file
Prints:
aaa ccc: DDD
aaa ccc:
aaa bbb ccc:
This does not print the field twice if it is both enclosed in [] and in the selected fields array. (i.e., [aaa] bbb ccc: does not print aaa twice) It will also print in correct field order if you have aaa [bbb] ccc ...
awk '$5=="[DDD]"{gsub("[\\[\\]]","");print $1,$3,$5}' file
or
awk '$5=="[DDD]"{print $1,$3, substr($5,2,3)}' file
aaa ccc: DDD
Related
I have a file in this format:
aaa bbb ccc ddd eee|fff|ggg|hhh|iii|lll|mmm|nnn|ooo|ppp
aaa1 bbb1 ccc1 ddd1 eee1|fff1|ggg1|hhh1|iii1|lll1|mmm1|nnn1|ooo1|ppp1
aaa2 bbb2 ccc2 ddd2 eee2|fff2|ggg2|hhh2|iii2|lll2|mmm2|nnn2|ooo2|ppp2
As you can see, the first three fields are separated by a space, while the other ones are separated by the | sign.
I would like to select the first 3 fields, and then the 8th and 9th fields.
I would like to have the following output:
aaa bbb ccc hhh iii
aaa1 bbb1 ccc1 hhh1 iii1
aaa2 bbb2 ccc2 hhh2 iii2
As you can see, I should filter on two delimiters: space and pipe.
How can I do in on bash?
I tried with awk but I was unable to run it with two different delimiters.
If your code isn't so performance-sensitive as to make awk a better choice, the below does the parsing in question in native bash, and does so in such a way as to have correct results even if pipe-separated fields other than the first contain spaces:
while IFS='|' read -r -a psep_fields; do # read into pipe-separated fields
read -r -a space_fields <<<"${psep_fields[0]}" # read 1st field & parse by spaces
printf '%s %s %s %s %s\n' \
"${space_fields[0]}" "${space_fields[1]}" "${space_fields[2]}" \
"${psep_fields[3]}" "${psep_fields[4]}"
done
See this running on your input at https://ideone.com/zCjpDP, returning as output:
aaa bbb ccc hhh iii
aaa1 bbb1 ccc1 hhh1 iii1
aaa2 bbb2 ccc2 hhh2 iii2
If your input may have pipe in first 4 fields or spaces in piped string then better to use this awk that splits 5th field using | as delimiter:
awk 'NF>3{s = $1 OFS $2 OFS $3; sub(/^[ \t]*([^ \t]+[ \t]+){4}/, "");
if (split($0, a, "|") > 4) s = s OFS a[4] OFS a[5]; print s}' file
aaa bbb ccc hhh iii
aaa1 bbb1 ccc1 hhh1 iii1
aaa2 bbb2 ccc2 hhh2 iii2
This will do exactly what you asked for regardless of whether fields in the head (space-separated) section contain |s or fields in the tail (|-separated) section contain spaces.
With GNU awk for the 3rd arg to match() and \S/\s shorthand:
$ cat tst.awk
match($0,/^((\S+\s+){3})(.*)/,a) {
split(a[1],h,/\s+/)
split(a[3],t,/[|]/)
print h[1], h[2], h[3], t[4], t[5]
}
$ awk -f tst.awk file
aaa bbb ccc hhh iii
aaa1 bbb1 ccc1 hhh1 iii1
aaa2 bbb2 ccc2 hhh2 iii2
and with any awk:
$ cat tst.awk
match($0,/^([^[:space:]]+[[:space:]]+){3}/) {
split(substr($0,RSTART,RLENGTH),h,/[[:space:]]+/)
split(substr($0,RSTART+RLENGTH),t,/[|]/)
print h[1], h[2], h[3], t[4], t[5]
}
$ awk -f tst.awk file
aaa bbb ccc hhh iii
aaa1 bbb1 ccc1 hhh1 iii1
aaa2 bbb2 ccc2 hhh2 iii2
The above is assuming you're correct and it's only the first 3 fields that are separated by spaces, hence the {3} in the regexp. If you're mistaken and it's actually 4 (as it appears like it might be in your posted sample input) then obviously just change {3} to {4}. It will only matter if you want to access a 4th space-separated field.
A slightly different approach -
while read a b c d e; do
IFS="|" read -a f <<< "$e"
echo "$a $b $c ${f[3]} ${f[4]}"
done < input.txt
aaa bbb ccc hhh iii
aaa b|b|b ccc hhh "i i i"
aaa1 bbb1 ccc1 hhh1 iii1
aaa1 bbb1 c|c|c|1 hhh1 " i i i 1"
aaa2 bbb2 ccc2 hhh2 iii2
aaa2 bbb2 ccc2 "h h h 2" iii2
The read loads fields splitting on the usual $IFS characters, which puts all the last batch separated by pipes into e. This preserves any pipe characters embedded in a-d. Since e is the last variable, the rest of the line is stored there, even if it has embedded spaces.
e is split explicitly on pipes only into the array named f. This preserves any space characters embedded in the fields of e.
It's not much different from Charles' solution below, though.
if your data in 'd' file, try gnu awk:
awk -F'[ |]' '{print $1,$2,$3,$8,$9 } ' d
awk 'BEGIN{FPAT="\\w{3,}"}{print $1,$2,$3,$8,$9 } ' d
the last is far better as far greater control on field search
Here is one awk solution. Too simple so I am not sure what edge cases I am missing but I get the desired output
awk -v FS="[ |]" '{print $1 OFS $2 OFS $3 OFS $8 OFS $9}' inputFile
result
aaa bbb ccc hhh iii
aaa1 bbb1 ccc1 hhh1 iii1
aaa2 bbb2 ccc2 hhh2 iii2
Explanation:
I separated the fields with regex by either a space or a pipe [ |] and printed the asked fields.
This question already has answers here:
How to print lines between two patterns, inclusive or exclusive (in sed, AWK or Perl)?
(9 answers)
Closed 3 years ago.
Using sed, AWK (or Perl), how do you print all lines between (the first instance of) two patterns, exclusive of the patterns?1
That is, given as input:
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
Or possibly even:
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
fff
PATTERN1
ggg
hhh
iii
PATTERN2
jjj
I would expect, in both cases:
bbb
ccc
ddd
1 A number of users voted to close this question as a duplicate of this one. In the end, I provided a gist that proves they are different. The question is also superficially similar to a number of others, but there is no exact match, and none of them are of high quality, and, as I believe that this specific problem is the one most commonly faced, it deserves a clear formulation, and a set of correct, clear answers.
If you have GNU sed (tested using version 4.7 on Mac OS X), the simplest solution could be:
sed '0,/PATTERN1/d;/PATTERN2/Q'
Explanation:
The d command deletes from line 1 to the line matching /PATTERN1/ inclusive.
The Q command then exits without printing on the first line matching /PATTERN2/.
If the file has only once instance of the pattern, or if you don't mind extracting all of them, and you want a solution that doesn't depend on a GNU extension, this works:
sed -n '/PATTERN1/,/PATTERN2/{//!p}'
Explanation:
Note that the empty regular expression // repeats the last regular expression match.
With awk (assumes that PATTERN1 and PATTERN2 are always present in pairs and either of them do not occur inside a pair)
$ cat ip.txt
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
fff
PATTERN1
ggg
hhh
iii
PATTERN2
jjj
$ awk '/PATTERN2/{exit} f; /PATTERN1/{f=1}' ip.txt
bbb
ccc
ddd
/PATTERN1/{f=1} set flag if /PATTERN1/ is matched
/PATTERN2/{exit} exit if /PATTERN2/ is matched
f; print input line if flag is set
Generic solution, where the block required can be specified
$ awk -v b=1 '/PATTERN2/ && c==b{exit} c==b; /PATTERN1/{c++}' ip.txt
bbb
ccc
ddd
$ awk -v b=2 '/PATTERN2/ && c==b{exit} c==b; /PATTERN1/{c++}' ip.txt
2
46
This might work for you (GNU sed);
sed -n '/PATTERN1/{:a;n;/PATTERN2/q;p;$!ba}' file
This prints only the lines between the first set of delimiters, or if the second delimiter does not exist, to the end of the file.
I attempted twice to answer, but the questions switched hold/duplicate statuses..
Borrowing input from #Sundeep and adding the answer which I shared in the question comments.
Using awk
awk -v x=0 -v y=1 ' /PATTERN1/&&y { x=1;next } /PATTERN2/&&y { x=0;y=0; next } x ' file
with Perl
perl -0777 -ne ' while( /PATTERN1.*?\n(.+?)^[^\n]*?PATTERN2/msg ) { print $1 if $x++ <1 } '
Results:
$ cat ip.txt
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
PATTERN1
2
46
PATTERN2
xyz
$
$ awk -v x=0 -v y=1 ' /PATTERN1/&&y { x=1;next } /PATTERN2/&&y { x=0;y=0; next } x ' ip.txt
bbb
ccc
ddd
$ perl -0777 -ne ' while( /PATTERN1.*?\n(.+?)^[^\n]*?PATTERN2/msg ) { print $1 if $x++ <1 } ' ip.txt
bbb
ccc
ddd
$
To make it generic
awk here y is the input
awk -v x=0 -v y=2 ' /PATTERN1/ { x++;next } /PATTERN2/ { if(x==y) exit } x==y ' ip.txt
2
46
perl check ++$x against the occurence.. here it is 2
perl -0777 -ne ' while( /PATTERN1.*?\n(.+?)^[^\n]*?PATTERN2/msg ) { print $1 if ++$x==2 } ' ip.txt
2
46
Adding more solutions(possible ways here, for fun :) and not at all claiming that these are better than usual ones) All tested and written in GNU awk. Also tested with given examples only.
1st Solution:
awk -v RS="" -v FS="PATTERN2" -v ORS="" '$1 ~ /\nPATTERN1\n/{sub(/.*PATTERN1\n/,"",$1);print $1}' Input_file
2nd solution:
awk -v RS="" -v ORS="" 'match($0,/PATTERN1[^(PATTERN2)]*/){val=substr($0,RSTART,RLENGTH);gsub(/^PATTERN1\n|^$\n/,"",val);print val}' Input_file
3rd solution:
awk -v RS="" -v OFS="\n" -v ORS="" 'sub(/PATTERN2.*/,"") && sub(/.*PATTERN1/,"PATTERN1"){$1=$1;sub(/^PATTERN1\n/,"")} 1' Input_file
In all above codes output will be as follows.
bbb
ccc
ddd
Using GNU sed:
sed -nE '/PATTERN1/{:s n;/PATTERN2/q;p;bs}'
-n will prune all but lines between PATTERN1 and PATTERN2 including both, because there will be p printout command.
every sed range check if it's true will execute only one the next, so {} grouping is mandated..
Drop PATTERN1 by n command (means next), if reach the first PATTERN2 outrightly quit otherwise print the line then and continue the next line within that boundary.
I want to print the output of file1 to first column in new file and file 2 to the second column in the new file.
Something like this.
file1
AAA
BBB
CCC
file2
XXX
YYY
ZZZ
file3
AAA XXX
BBB YYY
CCC ZZZ
paste command will do this job out-of-the-box:
paste file1 file2 > file3
AAA XXX
BBB YYY
CCC ZZZ
Try this click here
You can use paste and format using cut to remove leading and trailing spaces
Here is my sample list:
AAA BBB CCC1
DDD EEE FFF1
GGG HHH III1 <----- I want to remove this
GGG HHH III3 >>updated <----- I want to keep this
JJJ KKK LLL7
As I'm traversing the list using a For Loop, I want to take note of every row that has a ">>updated" in it and go back one row and remove the older row (not updated) and then move forward to the next row after the ">>updated" row. So basically my final output would be:
AAA BBB CCC1
DDD EEE FFF1
GGG HHH III3
JJJ KKK LLL7
I am using awk to parse the values of the other fields from a shell script, but I'm just not quite sure how to do this backwards and forwards step. Any help would be greatly appreciated.
awk '{a=$0;getline; if ($0~/>>updated/)print $1,$2,$3; else print a,"\n"$0}' file
AAA BBB CCC1
DDD EEE FFF1
GGG HHH III3
JJJ KKK LLL7
This might work for you (GNU sed):
sed -r '$!N;s/.*\n(.*)\s+>>updated\s*$/\1/;P;D' file
Keep two lines in the pattern space and delete the first when the last matches you requirements.
An awk solution might be:
awk 'sub(/ *>>updated.*/,""){l=$0;next};NR>1{print l};{l=$0};END{print l}' file
tac is nice but not default for all distributions. In case you don't have it available, here is an awk single process one-liner:
awk -F' >>' 'p{if($2~/updated/){p=$1;next}print p}{p=$0}END{print p}' file
perl -lne 'if(/\>\>updated/){pop #a;s/\>\>updated//g;push #a,$_}else{push #a,$_}END{print join "\n",#a}' your_file
tested:
> cat temp
AAA BBB CCC1
DDD EEE FFF1
GGG HHH III1
GGG HHH III3 >>updated
JJJ KKK LLL7
> perl -lne 'if(/\>\>updated/){pop #a;s/\>\>updated//g;push #a,$_}else{push #a,$_}END{print join "\n",#a}' temp
AAA BBB CCC1
DDD EEE FFF1
GGG HHH III3
JJJ KKK LLL7
>
The simplest way is to build up an array of the lines in your input file but only increase the array index when >>updated is absent so that lines that do contain >>updated overwrite the previous entry in the array and then just print the contents of the array when you get to the end of file:
$ cat file
AAA BBB CCC1
DDD EEE FFF1
GGG HHH III1 <----- I want to remove this
GGG HHH III3 >>updated <----- I want to keep this
JJJ KKK LLL7
$ awk '!/>>updated/{++numLines} {line[numLines]=$0} END {for (nr=1;nr<=numLines;nr++) print line[nr]}' file
AAA BBB CCC1
DDD EEE FFF1
GGG HHH III3 >>updated <----- I want to keep this
JJJ KKK LLL7
If you want to get rid of the >>updated and subsequent text on that line, you can change the test for it's existence to a test for an attempt to remove it:
$ awk '!sub(/ *>>updated.*/,""){++numLines} {line[numLines]=$0} END{for (nr=1;nr<=numLines;nr++) print line[nr]}' file
AAA BBB CCC1
DDD EEE FFF1
GGG HHH III3
JJJ KKK LLL7
If >>updated was present then the sub() will remove it and return success so you know that >>updated was present, otherwise the sub() will do nothing but return fail so you know that >>updated was absent.
I want to find a line in a txt file and then insert string 3 lines above the found line
Input:
aaa
bbb
ccc
ddd
eee
fff
I want to look for "eee" and then print "WWW" 3 lines above it. Output:
aaa
WWW
bbb
ccc
ddd
eee
fff
I'm using awk and can only print "WWW" 1 line above "eee", and not 3:
awk '/eee/{print "WWW"} 4' file.txt
any ideas?
One way:
awk '{a[NR]=$0;}/eee/{a[NR-3]="www\n" a[NR-3];}END{for(i=1;i<=NR;i++)print a[i];}' file