How to print both the grep pattern and the resulting matched line on the same line? - bash

I have two files File01 and File02.
File01, looks like this:
BU24DRAFT_430534
BU24DRAFT_488391
BU24DRAFT_488386
BU24DRAFT_417707
BU24DRAFT_417704
BU24DRAFT_488335
BU24DRAFT_429509
BU24DRAFT_210092
BU24DRAFT_229465
BU24DRAFT_498094
BU24DRAFT_416051
BU24DRAFT_482795
BU24DRAFT_4305
BU24DRAFT_10621
BU24DRAFT_4883
File02, looks like this:
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79
Using the string from File01, via grep, I would like to identify the lines in File02 that match and with this information generate a file that would look like this:
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488391
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488386
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417707
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417704
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488335
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
I tried generating such file using the following code:
while read r;do CMD01=$(echo $r);CMD02=$(grep $r File01); echo "$CMD02 $CMD01";done < File02 | awk '(NR>1) && ($2 > 2 ) '
The problem I run into is that what I obtain extra matching lines:
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_4305
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_4883
Where, for example, the string: BU24DRAFT_4305 is wrongly recognizing the string: XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79
This result is incorrect. The string in File01 must match a string in File02 that has a complete version of File01's string
Any ideas that could help me will be appreciated.

For the updated sample input and full-matching requirement and assuming you never have any regexp metacharacters in file1 and that the matching strings in file2 are never at the start or end of the line:
$ awk 'NR==FNR{strs[$0]; next} {for (str in strs) if ($0 ~ ("[^[:alnum:]]"str"[^[:alnum:]]")) print $0, str}' file1 file2
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_430534
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488391
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488386
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417707
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417704
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488335
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
Original answer doing partial matching:
The correct approach is 1 call to awk:
$ awk 'NR==FNR{strs[$0]; next} {for (str in strs) if (index($0,str)) print $0, str}' file1 file2
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice and https://mywiki.wooledge.org/Quotes for some of the issues with the script in your question.

So, it looks like yours mostly works. A lot of what you are doing here is unnecessary. Here is your script broken into multiple lines for readability:
while read r; do
CMD01=$(echo $r)
CMD02=$(grep $r zztest01)
echo "$CMD02 $CMD01"
done < <(head zztest) | awk '(NR>1) && ($2 > 2 ) '
First, CMD01=$(echo $r): This is really the same (or intended to be) as CMD01="$r" so kind of useless.
Then, < <(head zztest): You are using head to output the contents of the file. This actually works just as well with a simple redirection like this: < zztest.
Last, | awk '(NR>1) && ($2 > 2 ) ': This appears to just be some sort of logic on whether we are going to print anything or not.
Here is a simplified version:
while read r; do
CMD02=$(grep "$r" zztest01) && echo "$CMD02 $r"
done < zztest
Explanation
CMD02=$(grep $r zztest01) && echo "$CMD02 $r": The main part of this is really two commands separated by &&. This means execute the second command if the first one succeeded. grep will return a "failure" code if it does not find what it is looking for. So, if grep does not find a match, echo will not run.
The output of grep will be stored in the variable $CMD02. Then, you will echo that along with $r for each match.
If you really want to keep this on one line like the original:
while read r; do CMD02=$(grep "$r" zztest01) && echo "$CMD02 $r"; done < zztest
Update
If you want to avoid partial matches as Ed asked, you can change the grep to this grep "$r[^0-9]" zztest01. This will avoid a match if there is a trailing digit after the initial match string (which is really an assumption given the sample).

While not explicit in the question, it seems that each pattern should only match single line in the input file (File02).
Based on this observation, possible to improve performance of the solution from Ed Morton:
awk '
NR==FNR{strs[$0]; next}
{ for (str in strs) if (index($0,str)) { print $0, str ; delete strs[str]; next } }
' file1 file2
For large files. with many patterns, it will reduce runtime by a factor of 4.

Related

How to print matching all names given as a argument?

I want to write a script for any name given as an argument and prints the list of paths
to home directories of people with the name.
I am new at scripts. Is there any simple way to do this with awk or egrep command?
Example:
$ show names jakub anna (as an argument)
/home/users/jakubo
/home/students/j_luczka
/home/students/kubeusz
/home/students/jakub5z
/home/students/qwertinx
/home/users/lazinska
/home/students/annalaz
Here is the my friend's code but I have to write it from a different way and it has to be simple like this code
#!/bin/bash
for name in $#
do
awk -v n="$name" -F ':' 'BEGIN{IGNORECASE=1};$5~n{print $6}' /etc/passwd | while read line
do
echo $line
done
done
Possible to use a simple awk script to look for matching names.
The list of names can be passed as a space separated list to awk, which will construct (in the BEGIN section) a combined pattern (e.g. '(names|jakub|anna)'). The pattern is used for testing the user name column ($5) of the password file.
#! /bin/sh
awk -v "L=$*" -F: '
BEGIN {
name_pat = "(" gensub(" ", "|", "g", L) ")"
}
$5 ~ name_pat { print $6 }
' /etc/passwd
Since at present the question as a whole is unclear, this is more of a long comment, and only a partial answer.
There is one easy simplification, since the sample code includes:
... | while read line
do
echo $line
done
All of the code shown above after and including the | is needless, and does nothing, (like a UUoC), and should therefore be removed. (Actually echo $line with an unquoted $line would remove formatting and repeated spaces, but that's not relevant to the task at hand, so we can say the code above does nothing.)

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Replace some lines in fasta file with appended text using while loop and if/else statement

I am working with a fasta file and need to add line-specific text to each of the headers. So for example if my file is:
>TER1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
I want a while loop that will read through each line; for those with a > at the start, I want to append |population: plus the first three characters after the >. So line one would be:
>TER1|population:TER
etc.
I can't figure out how to make this work. Here my best attempt so far.
filename="testfasta.fa"
while read -r line
do
if [[ "$line" == ">"* ]]; then
id=$(cut -c2-4<<<"$line")
printf $line"|population:"$id"\n" >>outfile
else
printf $line"\n">>outfile
fi
done <"$filename"
This produces a file with the original headers and following line each on a single line.
Can someone tell me where I'm going wrong? My if and else loop aren't working at all!
Thanks!
You could use a while loop if you really want,
but sed would be simpler:
sed -e 's/^>\(...\).*/&|population:\1/' "$filename"
That is, for lines starting with > (pattern: ^>),
capture the next 3 characters (with \(...\)),
and match the rest of the line (.*),
replace with the line as it was (&),
and the fixed string |population:,
and finally the captured 3 characters (\1).
This will produce for your input:
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Or you can use this awk, also producing the same output:
awk '{sub(/^>.*/, $0 "|population:" substr($0, 2, 3))}1' "$filename"
You can do this quickly in awk:
awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' infile.txt > outfile.txt
$ awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' testfile
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Here awk will:
Test if the record starts with a > The $1 looks at the first field, but $0 for the entire record would work just as well in this case. The ~ will perform a regex test, and ^> means "Starts with >". Making the test: ($1~/^>/)
If so it will set the first field to the output you are looking for (using substr() to get the bits of the string you want. {$1=$1"|population:"substr($1,2,3)}
Finally it will print out the entire record (with the changes if applicable): {}1 which is shorthand for {print $0} or.. print the entire record.

Bash - listing files neatly

I have a non-determinate list of file names that I would like to output to the user in a script. I don't mind if it's a paragraph or in columns (like the out put of ls. How does ls manage it?). In fact I only have the following requirements:
file names need to stay on the same line (yes, that even means files with a space in their name. If someone is dumb enough to use a newline in a filename, though, they deserve what they get.)
If the output is formatted as a paragraph, I'd like to see it indented on the left and right to separate it from other text. Sort of like the way apt-get upgrade handles the list of packages to install.
I would love not to write my own function for this - at least not a complicated one. There are so many text formatting utilities in linux!
The utility should be available in the default Ubuntu install.
It should handle relatively large input, just in case. Something like 2000 characters or so?
It seems like a simple proposition, but I can't seem to get it to work. The column command is out simply because it can't handle large chunks of data. fmt and fold both don't care about delimiters. printf looks like it would work... if I wrote a script for it.
Is there a more flexible tool I've overlooked, or a simple way to do this?
Here I have a simple formatter that, it seems to me, is good enough
% ls | awk '
NR==1 {for(i=1;i<9;i++)printf "----+----%d", i; print ""
line=" " $0;l=2+length($0);next}
{if(l+1+length($0)>80){
print line; line = " " $0 ; l = 2+length($0) ; next}
{l=l+length($0)+1; line=line " " $0}}'
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
3inarow.py 5s.py a.csv a.not1.pdf a.pdf as de.1 asde.1 asdef.txt asde.py a.sh
a.tex auto a.wk bizarre.py board.py cc2012xyz2_5_5dp.csv cc2012xyz2_5_5dp.html
cc.py col.pdf col.sh col.sh~ col.tex com.py data data.a datazip datidisk
datizip.py dd.py doc1.pdf doc1.tex doc2 doc2.pdf doc2.tex doc3.pdf doc3.tex
e.awk Exit file file1 file2 geomedian.py group_by_1st group_by_1st.2
group_by_1st.mawk integers its.py join.py light.py listluatexfonts mask.py
mat.rix my_data nostop.py numerize.py pepp.py pepp.pyc pi.pdf pippo muore
pippo.py pi.py pi.tex pizza.py plocol.py points.csv points.py puddu puffo
%
I had to simulate input using ls because you didn't care to show how to access your list of files. The window width is arbitrary as well, but it's easy to provide a value to a -V width=... option of awk
Edit
I added a header line, an unrequested header line, to my awk script because I wanted to test the effectiveness of the (very simple) algorithm.
Addendum
I'd like to stress that the simple formatter above doesn't split "file names" like this across lines, as in the following example:
% ls -d1 b*
bia nconodi
bianconodi.pdf
bianconodi.ppt
bin
b.txt
% ls | awk '
NR==1 {for(i=1;i<9;i++)printf "----+----%d", i; print ""
line=" " $0;l=2+length($0);next}
{if(l+1+length($0)>80){
print line; line = " " $0 ; l = 2+length($0) ; next}
{l=l+length($0)+1; line=line " " $0}}'
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
04_Tower.pdf 2plots.py 2.txt a.csv aiuole asdefff a.txt a.txt~ auto
bia nconodi bianconodi.pdf bianconodi.ppt bin Borsa Ferna.jpg b.txt
...
%
As you can see, in the first line there is enough space to print bia but not enough for the real filename bia nconodi, that hence is printed on the second line.
Addendum 2
This is the formatter the OP eventually went with:
local margin=4
local max=10
echo -e "$filenames" | awk -v width=$(tput cols) -v margin=$margin '
NR==1 {
for (i=0; i<margin; i++) {
line = line " "
}
line = line $0;
l = margin + margin + length($0);
next;
}
{
if (l+1+length($0) > width) {
print line;
line = ""
for (i=0; i<margin; i++) line=line " "
line = line $0 ;
l = margin + margin + length($0) ;
next;
}
{
l = l + length($0) + 1;
line = line " " $0;
}
}
END {
print line;
}'
Perhaps you're looking for /usr/bin/fold?
printf '%s ' * | fold -w 77 | sed -e 's/^/ /'
Replace the * with your list, of course; if your files are in an array (they should be; storing filenames in scalar variables is lossy), that'd be "${your_array[#]}".
If you have your filenames in a variable this will create 3 columns, you can change -3 to whatever number of columns you want
echo "$var" | pr -3 -t
or if you need to get them from the filesystem:
find . -printf "%f\n" 2>/dev/null | pr -3 -t
From what you stated in the comments, I think this may be what you are looking for. The find command prints the file or directory name along with a newline and you can put additional filtering of the filenames by piping through grep or sed prior to pr - the pr command is for print and the -3 states 3 columns and -t is for omit headers and trailers - you can adjust it to your preferences.

Multi-line grep with positive and negative filtering

I need to grep for a multi-line string that doesn't include one string, but does include others. This is what I'm searching for in some HTML files:
<not-this>
<this> . . . </this>
</not-this>
In other words, I want to find files that contain <this> and </this> on the same line, but should not be surrounded by html tags <not-this> on the lines before and/or after. Here is some shorthand logic for what I want to do:
grep 'this' && '/this' && !('not-this')
I've seen answers with the following...
grep -Er -C 2 '.*this.*this.*' . | grep -Ev 'not-this'
...but this just erases the line(s) containing the "not" portion, and displays the other lines. What I'd like is for it to not pull those results at all if "not-this" is found within a line or two of "this".
Is there a way to accomplish this?
P.S. I'm using Ubuntu and gnome-terminal.
It sounds like an awk script might work better here:
$ cat input.txt
<not-this>
<this>BAD! DO NOT PRINT!</this>
</not-this>
<yes-this>
<this>YES! PRINT ME!</this>
</yes-this>
$ cat not-this.awk
BEGIN {
notThis=0
}
/<not-this>/ {notThis=1}
/<\/not-this>/ {notThis=0}
/<this>.*<\/this>/ {if (notThis==0) print}
$ awk -f not-this.awk input.txt
<this>YES! PRINT ME!</this>
Or, if you'd prefer, you can squeeze this awk script onto one long line:
$ awk 'BEGIN {notThis=0} /<not-this>/ {notThis=1} /<\/not-this>/ {notThis=0} /<this>.*<\/this>/ {if (notThis==0) print}' input.txt

Resources