Using awk to format text - bash

I'm getting hard times understanding how to achieve what I want using awk and after searching for quite some time, I couldn't find the solution I'm looking for.
I have an input text that looks like this:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line
I want to properly format the weird lines between ' (' and ')'. The expected output is as follow:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Looking up on stack overflow I found this :
How to select lines between two marker patterns which may occur multiple times with awk/sed
So what I'm using now is echo $text | awk '/ \(/{flag=1;next}/\)/{flag=0}flag'
Which almost works except it filters out the non-matching lines, here's the output produced by this very last command:
(Element 4)
(Element 1, span 1 to Element 5, span 4)
Anyone knows how-to do this? I'm open to any suggestion, including not-using awk if you know better.
Bonus point if you teach me how to remove syntaxic coloration on my question code blocks :)
Thanks a billion times
Edit: Ok, so I accepted #EdMorton's solution as he provided something using awk (well, GNU awk). However, I'm currently using #aaron's sed voodoo incantations with great success and will probably continue doing so until I hit anything new on that specific usecase.
I strongly suggest reading EdMorton's explanation, last paragraph made my day. If anyone passing by has good ressources regarding awk/sed they can share, feel free to do so in the comments.

Here's how I would do it with GNU sed :
s/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}
Which, for those who don't speak gibberish, means :
remove the leading spaces from lines that start with spaces and an opening bracket
test if the line now start with an opening bracket. If that's the case, do the following :
mark this spot as the label l, which denotes the start of a loop
add a line from the input to the pattern space
test if you now have a closing bracket in your pattern space
if so, jump to the label e
(if not) jump to the label l
mark this spot as the label e, which denotes the end of the code
remove the linefeeds from the pattern space
(implicitly print the pattern space, whether it has been modified or not)
This can probably be refined, but it does the trick :
$ echo """Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line """ | sed 's/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}'
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Edit : if you can disable history expansion (set +H), this sed command is nicer : s/^\s*(/(/;/^(/{:l N;/)/!b l;s/\n//g}

sed is for simple substitutions on individual lines, that is all. If you try to do anything else with it then you are using constructs that became obsolete in the mid-1970s when awk was invented, are almost certainly non-portable and inefficient, are always just a pile of indecipherable arcane runes, and are used today just for mental exercise.
The following uses GNU awk for multi-char RS, RT and the \s shorthand for [[:space:]] and works by simply isolating the (...) strings and then doing whatever you want with them:
$ cat tst.awk
BEGIN {
RS="[(][^)]+[)]" # a regexp for the string you want to isolate in RT
ORS="" # disable appending of newlines so we print as-is
}
{
gsub(/\n[[:blank:]]+$/,"\n") # remove any blanks before RT at the start of each line
sub(/\(\s+/,"(",RT) # remove spaces after ( in RT
sub(/\s+\)/,")",RT) # remove spaces before ) in RT
gsub(/\s+/," ",RT) # compress each chain of spaces to one blank char in RT
print $0 RT # print the result
}
$ awk -f tst.awk file
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
If you're considering using a sed solution for this also consider how you would enhance it if/when you have the slightest requirements change. Any change to the above awk code would be trivial and obvious while a change to the equivalent sed code would require first sacrificing a goat under a blood moon then breaking out your copy of the Rosetta Stone...

It's doable in awk, and maybe there's a slicker way than this. It looks for lines between and including those containing only blanks and either an open or close parenthesis, and processes them specially. Everything else it just prints:
awk '/^ *\( *$/,/^ *\) *$/ {
sub(/^ */, "");
sub(/ *$/, "");
if ($1 ~ /[()]/) hold = hold $1; else hold = hold " " $0
if ($0 ~ /\)/) {
sub(/\( /, "(", hold)
sub(/ \)/, ")", hold)
print hold
hold = ""
}
next
}
{ print }' data
The variable hold is initially empty.
The first pair of sub calls strip leading and trailing blanks (copying the data from the question, there's a blank after span 1 to). The if adds the ( or ) to hold without a space, or the line to hold after a space. If the close parenthesis is present, remove the space after the open parenthesis and before the close parenthesis, print hold, and reset hold to empty. Always skip the rest of the script with next. The rest of the script is { print } — print unconditionally, often written 1 by minimalists.
The file data is copy'n'paste from the data in the question.
Output:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
The 'Another Line' (with capital L) has a trailing blank because the data in the question does.

With awk
$ cat fmt.awk
function rem_wsp(s) { # remove white spaces
gsub(/[\t ]/, "", s)
return s
}
function beg() {return rem_wsp($0)=="("}
function end() {return rem_wsp($0)==")"}
function dump_block() {
print "(" block ")"
}
beg() {
in_block = 1
next
}
end() {
dump_block()
in_block = block = ""
next
}
in_block {
if (length(block)>0) sep = " "
block = block sep $0
next
}
{
print
}
END {
if (in_block) dump_block()
}
Usage:
$ awk -f fmt.awk fime.dat

Related

Replace string with text only when a given text precedes it

I have about one hundred Markdown files that contain snippets of Latex like this:
<div latex="true" class="task" id="Task">
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</div>
I'd like to replace the <div> tags with pseudotags that are easier to read, like this:
<task>
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</task>
This would be trivial if all my <div> tags were marking 'tasks', but I have similar divs for 'journal' and 'highlight'. I need a process that will change the </div> to </task> only when the preceding <div> has the class or id 'task', and likewise for 'journal' and 'highlight'.
Having looked around Stack Overflow for a while, I find many examples of multiline search and replace that do almost what I want to do, but the syntax (particularly for sed) is so difficult to untangle I can't adapt it for the above case. My next option is to write a bash script to loop through line by line, but I have a feeling this might be too fragile.
Cheers
Ian
The following awk command works generically, under the following assumptions:
All opening and closing div tags are on their own lines.
Attributes all use "-quoting.
The new tag name is derived from the value of the class attribute only (this could be generalized if the rules were clearer).
awk -F ' class="' '
/^<div / && NF > 1 { tag=$2; sub("\".*", "", tag); printf "<%s>\n", tag; next }
/^<\/div>/ && tag != "" { printf "</%s>\n", tag; tag=""; next }
1
' file
-F ' class="' effectively splits each line into before (field 1, $1) and after (field 2, $2) the class attribute, if present. Only lines that have such an attribute will therefore have more than 1 field (NF > 1).
Processing the opening div tag:
Pattern /^<div / && NF > 1 therefor only matches lines that start with (^) <div and (&&) contain a class attribute (NF > 1)
tag=$2; sub("\".*", "", tag) extracts the class attribute value from the 2nd field, by replacing everything from the first " (the closing " of the attribute value) with the empty string, effectively retaining the attribute value only in variable tag.
printf "<%s>\n", tag prints the attribute value as the replacement opening tag.
next skips the rest of the script and moves to the next input line.
Processing the closing div tag:
/^<\/div>/ && tag != "" matches the closing div tag, assuming that a class attribute value was found in the previous opening tag (tag != "").
printf "</%s>\n", tag prints the new closing tag.
tag="" resets the most recent replacement tag so that any subsequent div elements that do not have class attributes don't accidentally get renamed too.
next skips the rest of the script and moves to the next input line.
All other lines:
1 simply prints all other lines as-is. (1 is a common Awk shorthand for { print }: Pattern 1, interpreted as a Boolean, is by definition true, and a pattern without associated action { ... } prints the input line by default).
No loop needed. Just pipe the files though this...
sed '/Task/s/<div.*>/<task>/g;s/<\/div>/<\/task>/g'
/Task at the beginning makes sed edit lines with the name Task in it only.
With s/NAME/NEWNAME/ you replace some text one by one.
Adding .* will replace all text starting at this point.
Last but not least, g stands for global and will edit all entries this way.
Second command (after ;) will replace </div> with </task>. Its a part of the same command like before. The difference this time is that a / (slash) will be used by sed it self, if not declared other wise! This can be archived via a \ (backslash).
Here you go. The output of your file will look like this....
<task>
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</task>
This might work for you (GNU sed):
v='task|journal|highlight'
sed -ri '/^<div/{:a;N;/^<\/div/M!ba;s/^<.*class="('$v')"[^>]*(.*<\/)div/<\1\2\1/}' file1 file2 file3 ...
This stores the div statements in the pattern space and then substitutes (or not) the required values depending on the shell variable set beforehand.
N.B. the alternatives are stored in the shell variable v separated by |
This should do the trick:
$msys\bin\sed -En "s/<div latex=\"true\" class=\"task\" id=\"Task\">/<task>/;T;{:a;N;s/<\/div>/<\/task>/;Ta;p;}" input.txt
These are the building blocks, in case you want to adapt it:
make a loop:{:a;
it ends when the second replacement triggers: s/<\/div>/<\/task>/;Ta;
only start it, if the first replacement triggered:
s/<div latex=\"true\" class=\"task\" id=\"Task\">/<task>/;T;
inside the loop just collect lines into pattern space:N;
at the end of the loop just print:p;}
called with extended regular expressions and without default-printing
(mine is a windows/msys sed, just so you know):$msys\bin\sed -En

Find lines that have partial matches

So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!

sed to get string between two patterns

I am working on a latex file from which I need to pick out the references marked by \citep{}. This is what I am doing using sed.
cat file.tex | grep citep | sed 's/.*citep{\(.*\)}.*/\1/g'
Now this one works if there is only one pattern in a line. If there are more than one patterns i.e. \citep in a line, it fails. It fails even when there is only one pattern but more than one closing bracket }. What should I do, so that it works for all the patterns in a line and also for the exclusive bracket I am looking for?
I am working on bash. And a part of the file looks like this:
of the Asian crust further north \citep{TapponnierM76, WangLiu2009}. This has led to widespread deformation both within and
\citep{BilhamE01, Mitraetal2005} and by distributed seismicity across the region (Fig. \ref{fig1_2}). Recent GPS Geodetic
across the Dawki fault and Naga Hills, increasing eastwards from $\sim$3~mm/yr to $\sim$13~mm/yr \citep{Vernantetal2014}.
GPS velocity vectors \citep{TapponnierM76, WangLiu2009}. Sikkim Himalaya lies at the transition between this relatively simple
this transition includes deviation of the Himalaya from a perfect arc beyond 89\deg\ longitude \citep{BendickB2001}, reduction
\citep{BhattacharyaM2009, Mitraetal2010}. Rivers Tista, Rangit and Rangli run through Sikkim eroding the MCT and Ramgarh
thrust to form a mushroom-shaped physiography \citep{Mukuletal2009,Mitraetal2010}. Within this sinuous physiography,
\citep{Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study
field results corroborate well with seismic studies in this region \citep{Actonetal2011, Arunetal2010}. From studies of
On one line, I get answer like this
BilhamE01, TapponnierM76} and by distributed seismicity across the region (Fig. \ref{fig1_2
whereas I am looking for
BilhamE01, TapponnierM76
Another example with more than one /citep patterns gives output like this
Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study
whereas I am looking for
Pauletal2015 Mitraetal2005
Can anyone please help?
it's a greedy match change the regex match the first closing brace
.*citep{\([^}]*\)}
test
$ echo "\citep{string} xyz {abc}" | sed 's/.*citep{\([^}]*\)}.*/\1/'
string
note that it will only match one instance per line.
If you are using grep anyway, you can as well stick with it (assuming GNU grep):
$ echo $str | grep -oP '(?<=\\citep{)[^}]+(?=})'
BilhamE01, TapponierM76
For what it's worth, this can be done with sed:
echo "\citep{string} xyz {abc} \citep{string2},foo" | \
sed 's/\\citep{\([^}]*\)}/\n\1\n\n/g; s/^[^\n]*\n//; s/\n\n[^\n]*\n/, /g; s/\n.*//g'
output:
string, string2
But wow, is that ugly. The sed script is more easily understood in this form, which happens to be suitable to be fed to sed via a -f argument:
# change every \citep{string} to <newline>string<newline><newline>
s/\\citep{\([^}]*\)}/\n\1\n\n/g
# remove any leading text before the first wanted string
s/^[^\n]*\n//
# replace text between wanted strings with comma + space
s/\n\n[^\n]*\n/, /g
# remove any trailing unwanted text
s/\n.*//
This makes use of the fact that sed can match and sub the newline character, even though reading a new line of input will not result in a newline initially appearing in the pattern space. The newline is the one character that we can be certain will appear in the pattern space (or in the hold space) only if sed puts it there intentionally.
The initial substitution is purely to make the problem manageable by simplifying the target delimiters. In principle, the remaining steps could be performed without that simplification, but the regular expressions involved would be horrendous.
This does assume that the string in every \citep{string} contains at least one character; if the empty string must be accommodated, too, then this approach needs a bit more refinement.
Of course, I can't imagine why anyone would prefer this to #Lev's straight grep approach, but the question does ask specifically for a sed solution.
f.awk
BEGIN {
pat = "\\citep"
latex_tok = "\\\\[A-Za-z_][A-Za-z_]*" # match \aBcD
}
{
f = f $0 # store content of input file as a sting
}
function store(args, n, k, i) { # store `keys' in `d'
gsub("[ \t]", "", args) # remove spaces
n = split(args, keys, ",")
for (i=1; i<=n; i++) {
k = keys[i]
d[k]
}
}
function ntok() { # next token
if (match(f, latex_tok)) {
tok = substr(f, RSTART ,RLENGTH)
f = substr(f, RSTART+RLENGTH-1 )
return 1
}
return 0
}
function parse( i, rc, args) {
for (;;) { # infinite loop
while ( (rc = ntok()) && tok != pat ) ;
if (!rc) return
i = index(f, "{")
if (!i) return # see `pat' but no '{'
f = substr(f, i+1)
i = index(f, "}")
if (!i) return # unmatched '}'
# extract `args' from \citep{`args'}
args = substr(f, 1, i-1)
store(args)
}
}
END {
parse()
for (k in d)
print k
}
f.example
of the Asian crust further north \citep{TapponnierM76, WangLiu2009}. This has led to widespread deformation both within and
\citep{BilhamE01, Mitraetal2005} and by distributed seismicity across the region (Fig. \ref{fig1_2}). Recent GPS Geodetic
across the Dawki fault and Naga Hills, increasing eastwards from $\sim$3~mm/yr to $\sim$13~mm/yr \citep{Vernantetal2014}.
GPS velocity vectors \citep{TapponnierM76, WangLiu2009}. Sikkim Himalaya lies at the transition between this relatively simple
this transition includes deviation of the Himalaya from a perfect arc beyond 89\deg\ longitude \citep{BendickB2001}, reduction
\citep{BhattacharyaM2009, Mitraetal2010}. Rivers Tista, Rangit and Rangli run through Sikkim eroding the MCT and Ramgarh
thrust to form a mushroom-shaped physiography \citep{Mukuletal2009,Mitraetal2010}. Within this sinuous physiography,
\citep{Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study
field results corroborate well with seismic studies in this region \citep{Actonetal2011, Arunetal2010}. From studies of
Usage:
awk -f f.awk f.example
Expected ouput:
BendickB2001
Arunetal2010
Pauletal2015
Mitraetal2005
BilhamE01
Mukuletal2009
TapponnierM76
WangLiu2009
BhattacharyaM2009
Mitraetal2010
Actonetal2011
Vernantetal2014

Using awk or sed to print column of CSV file enclosed in double quotes

I'm working on a csv file like the one below, comma delimited, each cell is enclosed in double quotes, but some of them contain double quote and/or comma inside double quote enclosure. The actual file contain around 300 columns and 200,000 rows.
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with "comma" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, "cde" here","cde","cde","cde"
I'll need to remove some unless columns, and merge last few columns, instead of having "," in between them, I need </br>. and move second column to the end. Anything within the cells should be the same, with double quotes and commas as the original file. Below is an example of the output that I need.
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, "cde" here","cde</br>cde</br>cde","cde"
In this example I want to remove column3 and merge column 5, 6, 7.
Below is the code that I tried to use, but it is reading either double quote and/or comma, which is end of the row to be different than what I expected.
awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
sed is used to remove beginning and ending double quote of a cell.
The output file that I'm getting right now, if previous field contains a double quote, it will consider that is the beginning of a cell, so the following values are often pushed up a column.
Other code that I have used consider every comma as beginning of a cell, so that won't work as well.
awk -F',' 'BEGIN{OFS=",";} {print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
Any help is greatly appreciated. thanks!
CSV is a loose format. There may be subtle variations in formatting. Your particular format may or may not be expressible with a regular grammar/regular expression. (See this question for a discussion about this.) Even if your particular formatting can be expressed with regular expressions, it may be easier to just whip out a parser from an existing library.
It is not a bash/awk/sed solution as you may have wanted or needed, but Python has a csv module for parsing CSV files. There are a number of options to tweak the formatting. Try something like this:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for row in inreader:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
Note that in Python, indexes start with 0 (e.g. row[1] is the second field). The first index of a slice is inclusive, the last is exclusive (row[1:3] is row[1] and row[2] only). Your formatting seems to require quotes around every field, hence the quoting=csv.QUOTE_ALL. There are more options at Dialects and Formatting Parameters.
The above code produces the following output:
"Column1","Column4","Column5</br>Column6</br>Column7","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, cde"" here""","cde</br>cde</br>cde","cde"
There are two issues with this:
It doesn't treat the first row any differently, so the headers of columns 5, 6, and 7 are merged like the other rows.
Your input CSV contains "some other, "cde" here" (third row, fourth column) with unescaped quotes around the cde. There is another case of this on line two, but it was removed since it is in column 3. The result contains incorrect quotes.
If these quotes are properly escaped, your sample input CSV file becomes
infile.csv (escaped quotes):
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with ""comma"" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, ""cde"" here","cde","cde","cde"
Now consider this modified Python script that doesn't merge columns on the first row:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
first_row = True
for row in inreader:
if first_row:
first_row = False
else:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field (index 1) to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
The output outfile.csv is
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, ""cde"" here","cde</br>cde</br>cde","cde"
This is your sample output, but with properly escaped "some other, ""cde"" here".
This may not be precisely what you wanted, not being a sed or awk solution, but I hope it is still useful. Processing more complicated formats may justify more complicated tools. Using an existing library also removes a few opportunities to make mistakes.
This might be an oversimplification of the problem but this has worked for me with your test data:
cat /tmp/inputfile.csv | sed 's#\"\,\"#|#g' | sed 's#"</br>"#</br>#g' | awk 'BEGIN {FS="|"} {print $1 "," $4 "," $5 "</br>" $6 "</br>" $7 "," $2}'
Please not that I am on Mac probably that's why I had to wrap the commas in the AWK script in quotation marks.

awk script: removing line previous to pattern match and after, until a blank line

I began learning awk yesterday in attempt to solve this problem (and learn a useful new language). At first I tried using sed, but soon realized it was not the correct tool to access/manipulate lines previous to a pattern match.
I need to:
Remove all lines containing "foo" (trivial on it's own, but not whilst keeping track of previous lines)
Find lines containing "bar"
Remove the line previous to the one containing "bar"
Remove all lines after and including the line containing "bar" until we reach a blank line
Example input:
This is foo stuff
I like food!
It is tasty!
stuff
something
stuff
stuff
This is bar
Hello everybody
I'm Dr. Nick
things
things
things
Desired output:
It is tasty!
stuff
something
stuff
things
things
things
My attempt:
{
valid=1; #boolean variable to keep track if x is valid and should be printed
if ($x ~ /foo/){ #x is valid unless it contains foo
valid=0; #invalidate x so that is doesn't get printed at the end
next;
}
if ($0 ~ /bar/){ #if the current line contains bar
valid = 0; #x is invalid (don't print the previous line)
while (NF == 0){ #don't print until we reach an empty line
next;
}
}
if (valid == 1){ #x was a valid line
print x;
}
x=$0; #x is a reference to the previous line
}
Super bonus points (not needed to solve my problem but I'm interesting in learning how this would be done):
Ability to remove n lines before pattern match
Option to include/disclude the blank line in output
Below is an alternative awk script using patterns & functions to trigger state changes and manage output, which produces the same result.
function show_last() {
if (!skip && !empty) {
print last
}
last = $0
empty = 0
}
function set_skip_empty(n) {
skip = n
last = $0
empty = NR <= 0
}
BEGIN { set_skip_empty(0) }
END { show_last() ; }
/foo/ { next; }
/bar/ { set_skip_empty(1) ; next }
/^ *$/ { if (skip > 0) { set_skip_empty(0); next } else show_last() }
!/^ *$/{ if (skip > 0) { next } else show_last() }
This works by retaining the "current" line in a variable last, which is either
ignored or output, depending on other events, such as the occurrence of foo and bar.
The empty variable keeps track of whether or not the last variable is really
a blank line, or simple empty from inception (e.g., BEGIN).
To accomplish the "bonus points", replace last with an array of lines which could then accumulate N number of lines as desired.
To exclude blank lines (such as the one that terminates the bar filter), replace the empty test with a test on the length of the last variable. In awk, empty lines have no length (but, lines with blanks or tabs *do* have a length).
function show_last() {
if (!skip && length(last) > 0) {
print last
}
last = $0
}
will result in no blank lines of output.
Read each blank-lines-separated paragraph in as a string, then do a gsub() removing the strings that match the RE for the pattern(s) you care about:
$ awk -v RS= -v ORS="\n\n" '{ gsub(/[^\n]*foo[^\n]*\n|\n[^\n]*\n[^\n]*bar.*/,"") }1' file
It is tasty!
stuff
something
stuff
things
things
things
To remove N lines, change [^\n]*\n to ([^\n]*\n){N}.
To not remove part of the RE use GNU awk and use gensub() instead of gsub().
To remove the blank lines, change the value of ORS.
Play with it...
This awk should work without storing full file in memory:
awk '/bar/{skip=1;next} skip && p~/^$/ {skip=0} NR>1 && !skip && !(p~/foo/){print p} {p=$0}
END{if (!skip && !(p~/foo/)) print p}' file
It is tasty!
stuff
something
stuff
things
things
things
One way:
awk '
/foo/ { next }
flag && NF { next }
flag && !NF { flag = 0 }
/bar/ { delete line[NR-1]; idx-=1; flag = 1; next }
{ line[++idx] = $0 }
END {
for (x=1; x<=idx; x++) print line[x]
}' file
It is tasty!
stuff
something
stuff
things
things
things
If line contains foo skip it.
If flag is enabled and line is not blank skip it.
If flag is enabled and line is blank disable the flag.
If line contains bar delete the previous line, reset the counter, enable the flag and skip it
Store all lines that manages through in array indexed at incrementing number
In the END block print the lines.
Side Notes:
To remove n number of lines before a pattern match, you can create a loop. Start with current line number and using a reverse for loop you can remove lines from your temporary cache (array). You can then subtract n from your self defined counter variable.
To include or exclude blank lines you can use the NF variable. For a typical line, NF variable is set to number of fields based on your field separator. For blank lines this variable is 0. For example, if you modify the line above END block to NF { line[++idx] = $0 } in the answer above you will see we have bypassed all blank lines from output.

Resources