How to extract a fragment of text from file? - shell

I have a file with the following text:
<div>
<b>a:</b> <a class='a' href='/a/1'>a1</a><br>
<b>b:</b> <a class='b' href='/b/2'>b2</a><br>
<b>c:</b> <a class='c' href='/c/3/'>c3</a><br>
<b>d:</b> "ef"<br><br><div class='start'>123
<br>ghij.
<br>klmn
<br><br><b>end</b>
</div>
</div>
I want to do the following:
Whenever a line starts with <b>a:</b> <a class='a', I want to copy the text between the > symbol after <a class='a' and </a> — it must be stored in a[1];
Similarly, whenever a line starts with <b>b:</b> <a class='b', I want to copy the text between the > symbol after <a class='b' and </a> — it must be stored in b[1];
Whenever a line contains <div class='start'>, I want to create the variable t whose value starts with the text that occurs between <div class='start'> and the end of this line, then set flag to 1;
If the value of flag is already 1 and the current line does not start with <br><br><b>end</b>, I want to append the current line to the current value of the variable t (using the space symbol as separator);
If the value of flag is already 1 and the current line starts with <br><br><b>end</b>, I want to concatenate three current values of a[1], b[1] and t (using ; as separator) and print the result to the output file, then set flag to 0, then clear the variable t.
I used the following code (for gawk 4.0.1):
gawk 'BEGIN {flag = 0; t = ""; }
{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ ) {
match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/, a) };
if ($0 ~ /^<b>b:<\/b> <a class=\x27b\x27/ ) {
match($0, /^<b>b:<\/b> <a class=\x27b\x27 href=\x27\/b\/[0-9]{1,}\x27>(.*)<\/a>/, b) };
if ($0 ~ /<div class=\x27start\x27>/ ) {
match($0, /^.*<div class=\x27start\x27>(.*)$/, s);
t = s[1];
flag = 1 };
if (flag == 1) {
if ($0 ~ /^<br><br><b>end<\/b>/) {
str = a[1] ";" b[1] ";" t;
print(str) > "output.txt";
flag = 0; str = ""; t = "" }
else {
t = t " " $0 }
}
}' input.txt
I was expecting the following output:
a1;b2;123 <br>ghij. <br>klmn
But the output is:
;;123 <b>d:</b> "ef"<br><br><div class='start'>123 <br>ghij. <br>klmn
Why are a[1] and b[1] empty? Why does <b>d:</b> "ef"<br><br><div class='start'> occur in the output? How to fix the code to obtain the expected output?

Here's the answers to your specific questions:
Q) Why are a[1] and b[1] empty?
A) They aren't when I try your script with gawk 5.1.1 so most likely either there's a bug in your awk version or some of the white space in your input isn't blanks as your script requires (maybe it's tabs), or you have some control chars or your awk version doesn't like using \x27 instead of \047 for 's.
Q) Why does <b>d:</b> "ef"<br><br><div class='start'> occur in the output?
A) Because you forgot a next in the block that matches on div so the next block is also executing and saving $0 from the div line.
Q) How to fix the code to obtain the expected output?
A) Here's how I'd approach your problem, using GNU awk for the 3rd arg to match() and \s shorthand for [:space:]:
$ cat tst.sh
#!/usr/bin/env bash
gawk '
BEGIN { OFS=";" }
match($0, /^<b>(.):<\/b>\s+<a\s+class=\047.\047\s+href=\047\/.\/[0-9]+\/?\047>(.*)<\/a>/, arr) {
vals[arr[1]] = arr[2]
}
match($0, /^.*<div\s+class=\047start\047>(.*)/, arr) {
vals["div"] = arr[1]
inDiv = 1
next
}
inDiv {
if ( /^<br><br><b>end<\/b>/ ) {
print vals["a"], vals["b"], vals["div"]
delete vals
inDiv = 0
}
else {
vals["div"] = vals["div"] " " $0
}
}
' 'input.txt' > 'output.txt'
$ ./tst.sh
$ cat output.txt
a1;b2;123 <br>ghij. <br>klmn
So
I'm using a single match() to capture all values for lines that look like your a, b, c lines for consistency, conciseness, and maintainability.
I'm always saving the match results in an array named arr rather than different arrays per occurrence so I don't have to remember to keep deleting those arrays and the code that uses the matches can all be homogenized.
I'm using a single associative array vals[] to hold all values indexed by the letter after <b> so we don't need to test those letters and create separate variables, it's easy to clear the data by just deleting the array rather than having to set multiple variables to null, and it's easy to add the c or any other similar values to the output later if desired.
I'm using \s+ instead of a single blank char for every space in the input to be agnostic about the actual space char(s) and number of spaces used.
I'm using \047 instead of \x27 to match 's for portability and robustness, see http://awk.freeshell.org/PrintASingleQuote.
I'm letting the shell handle all input/output rather than including output redirection in the awk script for consistency and improved robustness in error scenarios like files that can't be opened.
I named my flag variable inDiv rather than flag so it tells us what it means, i.e. that we're in the div block of the input, for improved clarity and easy of future maintenance. Naming a flag variable flag is like naming a numeric variable number instead of sum, count, ave, tot, diff or something else meaningful that'd improve your script. When you see people use f for the name of a flag variable, that f is shorthand for found, not for flag.

Demonstrating that gawk's regexes don't match perl's
perl:
$ echo aaaab | perl -nE '/a*(a+b)/ && say $1'
ab
$ echo aaaab | perl -nE '/a*?(a+b)/ && say $1'
aaaab
a*? matched the shortest sequence of zero or more a's, and the greedy a+ consumed the rest.
gawk
$ echo aaaab | gawk 'match($0, /a*(a+b)/, m) {print m[1]}'
ab
$ echo aaaab | gawk 'match($0, /a*?(a+b)/, m) {print m[1]}'
ab
Not the same behaviour: a*? is still greedy.

Why(...)a[1](...)empty?
match function does return 0 if not match was found, which allows to easy check if this is case, I selected part pertaining to filling a-array and altered it a bit
{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ )
{
print NR, match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}>(.*)<\/a>/, a);
}
}
then used it again
<div>
<b>a:</b> <a class='a' href='/a/1'>a1</a><br>
<b>b:</b> <a class='b' href='/b/2'>b2</a><br>
<b>c:</b> <a class='c' href='/c/3/'>c3</a><br>
<b>d:</b> "ef"<br><br><div class='start'>123
<br>ghij.
<br>klmn
<br><br><b>end</b>
</div>
</div>
and got output
2 0
so condition in if worked as expected as line with <b>a:</b>... is 2nd line, however match was not found. This mean your regular expression is wrong, after examining, your regular expression is missing one single quote, it should be
/^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/
then
{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ )
{
print NR, match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/, a);
print a[1];
}
}
does give output
2 1
a1
(tested in gawk 4.2.1)

Related

Reverse complement SOME sequences in fasta file

I've been reading lots of helpful posts about reverse complementing sequences, but I've got what seems to be an unusual request. I'm working in bash and I have DNA sequences in fasta format in my stdout that I'd like to pass on down the pipe. The seemingly unusual bit is that I'm trying to reverse complement SOME of those sequences, so that the output has all the sequences in the same direction (for multiple sequence alignment later).
My fasta headers end in either "C" or "+". I'd like to reverse complement the ones that end in "C". Here's a little subset:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
I know there are lots of ways to reverse complement out there, like:
echo ACCTTGAAA | tr ACGTacgt TGCAtgca | rev
and
seqtk seq -r in.fa > out.fa
But I'm not sure how to do this for only those sequences that have a C at the end of the header. I think awk or sed is probably the ticket, but I'm at a loss as to how to actually code it. I can get the sequence headers with awk, like:
awk '/^>/ { print $0 }'
>chr1:84518073-84524089C
>chr1:86214203-86220231+
But if someone could help me figure out how to turn that awk statement into one that asks "if the last character in the header has a C, do this!" that would be great!
Edited to add:
I was so tired when I made this post, I apologize for not including my desired output. Here is what I'd like to output to look like, using my little example:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
You can see the sequence that ends in + is unchanged, but the sequence with a header that ends in C is reverse complemented.
Thanks!
An earlier answer (by Ed Morton) uses a self-contained awk procedure to selectively reverse-complement sequences following a comment line ending with "C". Although I think that to be the best approach, I will offer an alternative approach that might have wider applicability.
The procedure here uses awk's system() function to send data extracted from the fasta file in awk to the shell where the sequence can be processed by any of the many shell applications existing for sequence manipulation.
I have defined an awk user function to pass the isolated sequence from awk to the shell. It can be called from any part of the awk procedure:
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
The argument of the system function is a string containing the command you would type into terminal to achieve the desired outcome (in this case I've used one of the example reverse-complement routines mentioned in the question). The parts to note are the correct escaping of quote marks that are to appear in the shell command, and the variable s that will be substituted for the sequence string assigned to it when the function is called. The value of s is concatenated with the strings quoted before and after it in the argument to system() shown above.
isolating the required sequences
The rest of the procedure addresses how to achieve:
"if the last character in the header has a C, do this"
Before making use of shell applications, awk needs to isolate the part(s) of the file to process. In general terms, awk employs one or more pattern/action blocks where only records (lines by default) that match a given pattern are processed by the subsequent action commands. For example, the following illustrative procedure performs the action of printing the whole line print $0 if the pattern /^>/ && /C$/ is true for that line (where /^>/ looks for ">" at the start of a line and /C$/ looks for "C" at the end of the same line.:
/^>/ && /C$/{ print $0 }
For the current needs, the sequence begins on the next record (line) after any record beginning with > and ending with C. One way of referencing that next line is to set a variable (named line in my example) when the C line is encountered and establishing a later pattern for the record with numerical value one more than line variable.
Because fasta sequences may extend over several lines, we have to accumulate several successive lines following a C title line. I have achieved this by concatenating each line following the C title line until a record beginning with > is encountered again (or until the end of the file is reached, using the END block).
In order that sequence lines following a non-C title line are ignored, I have used a variable named flag with values of either "do" or "ignore" set when a title record is encountered.
The call to a the custom function processSeq() that employs the system() command, is made at the beginning of a C title action block if the variable seq holds an accumulated sequence (and in the END block for relevant sequences that occur at the end of the file where there will be no title line).
Test file and procedure
A modified version of your example fasta was used to test the procedure. It contains an extra relevant C record with three and-a-bit lines instead of two, and an extra irrelevant + record.
seq.fasta:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chranotherC
aatgaagtatattcagaatgtagaacattaactgacccgccatgttaatc
aatatctataagaccttttaaaaagcaccttagagattcaataaagtcag
gaagtatattcagaatgtagaacattaactgactaagaccttttaacatg
gcattgact
procedure
awk '
/^>/ && /C$/{
if (length(seq)>0) {processSeq(seq); seq="";}
line=NR; print $0; flag="do"; next;
}
/^>/ {line=NR; flag="ignore"}
NR>1 && NR==(line+1) && (flag=="do"){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0) processSeq(seq);}
' seq.fasta
output
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagttgtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
>chranotherC
agtcaatgccatgttaaaaggtcttagtcagttaatgttctacattctgaatatacttcctgactttattgaatctctaaggtgctttttaaaaggtcttatagatattgattaacatggcgggtcagttaatgttctacattctgaatatacttcatt
Tested using GNU Awk 5.1.0 on a Raspberry Pi 400.
performance note
Because calling sytstem() creates a sub shell, this process will be slower than a self-contained awk procedure. It might be useful where existing shell routines are available or tricky to reproduce with custom awk routines.
Edit: modification to include unaltered + records
This version has some repetition of earlier blocks, with minor changes, to handle printing of the lines that are not to be reverse-complemented (the changes should be self-explanatory if the main explanations were understood)
awk '
/^>/ && /C$/{
if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq="";line=NR; print $0; flag="do"; next;
}
/^>/ {if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq=""; print $0; line=NR; flag="ignore"}
NR>1 && NR==(line+1){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq}}
' seq.fasta
Using any awk:
$ cat tst.awk
/^>/ {
if ( NR > 1 ) {
prt()
}
head = $0
tail = ""
next
}
{ tail = ( tail == "" ? "" : tail ORS ) $0 }
END { prt() }
function prt( type) {
type = substr(head,length(head),1)
tail = ( type == "C" ? rev( tr( tail, "ACGTacgt TGCAtgca" ) ) : tail )
print head ORS tail
}
function tr(oldStr,trStr, i,lgth,char,newStr) {
if ( !_trSeen[trStr]++ ) {
lgth = (length(trStr) - 1) / 2
for ( i=1; i<=lgth; i++ ) {
_trMap[trStr,substr(trStr,i,1)] = substr(trStr,lgth+1+i,1)
}
}
lgth = length(oldStr)
for (i=1; i<=lgth; i++) {
char = substr(oldStr,i,1)
newStr = newStr ( (trStr,char) in _trMap ? _trMap[trStr,char] : char )
}
return newStr
}
function rev(oldStr, i,lgth,char,newStr) {
lgth = length(oldStr)
for ( i=1; i<=lgth; i++ ) {
char = substr(oldStr,i,1)
newStr = char newStr
}
return newStr
}
$ awk -f tst.awk file
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
This might work for you (GNU sed):
sed -nE ':a;p;/^>.*C$/!b
:b;n;/^>/ba;s/^/\n/;y/ACGTacgt/TGCAtgca/
:c;tc;/\n$/{s///p;bb};s/(.*)\n(.)/\2\1\n/;tc' file
Print the current line and then inspect it.
If the line does not begin with > and end with C, bail out and repeat.
Otherwise, fetch the next line and if it begins with >, repeat the above line.
Otherwise, insert a newline (to use as a pivot point when reversing the line), complement the code of the line using a translation command. Then set about reversing the line, character by character until the inserted newline makes its way to the end of the line.
Remove the newline, print the result and repeat the line above.
N.B. The n command will terminate the script when it is executed after the last line has been read.
Since the OP has amended the ouput, another solution is when the whole of the sequence is complemented and then reversed. Here is another solution that I believe follows these criteria.
sed -nE ':a;p;/^>.*C$/!b
:b;n;/^>/!{H;$!bb};x;y/ACGTacgt\n/TGCAtgca%/;s/%/\n/
:c;tc;s/\n$//;td;s/(.*)\n(.)/\2\1\n/;tc
:d;y/%/\n/;p;z;x;$!ba' file

How to do operations depending on the presence of a specific string in bash?

I am working with a csv file, so imagine I have this column:
5;10;>11;20;<14
My desired output would be:
5;10;12;20;13
So I would like to add +1 to those values who have the greater than (>) symbol and to subtract 1 to those values with a lesser than (<) symbol with bash language. I have tried something weird with sed but given that it interprets those changes as strings it didn't work out.
Any suggestions?
With awk (tested with GNU awk):
$ awk -F\; -v OFS=\; '
{
for(i = 1; i <= NF; i++) {
if($i ~ /^<[[:digit:]]+$/) {
sub(/^</,"",$i)
$i--
}
else if($i ~ /^>[[:digit:]]+$/) {
sub(/^>/,"",$i)
$i++
}
}
} 1' <<< "5;10;>11;20;<14"
5;10;12;20;13
Warning: use the following if and only if you trust your input file and you are 100% sure it does not contains malicious fields (see the final note).
With GNU sed (and assuming your shell is bash), a bit shorter but also a bit more difficult to understand (as usual with sed):
$ sed -E '
s/<([[:digit:]]+)/$((\1-1))/g
s/>([[:digit:]]+)/$((\1+1))/g
s/.*/printf "%s\n" "&"/e
' <<< "5;10;>11;20;<14"
5;10;12;20;13
That is (where N is a string of digits), substitute all <N with $((N-1)), all >N with $((N+1)), substitute the resulting string S with printf "%s\n" "S", execute it with bash and replace with the output (this is what the e modifier of the substitute command does). In your example the input string successively becomes:
5;10;>11;20;$((14-1))
5;10;$((11+1));20;$((14-1))
printf "%s\n" "5;10;$((11+1));20;$((14-1))"
5;10;12;20;13
The reason why there is a serious security issue here is that if one of your fields is, for instance, $(rm -rf ~/*) it will simply and recursively delete your entire home directory... So, if you do not control the input prefer the awk version.
5;10;>11;20;<14
|
{m,g}awk '
BEGIN {
_*=(OFS= "") (__-=_^= FS ="("(\
___="\31\17")"|"(____="\16\24")")+"
} {
gsub(";[<>][0-9]+",____ "&" ___)
gsub(____ ";[<>]", "&" ___)
NF
for(_+=(_^=($_=$_)<"")+_;_<=NF;_++) {
if ($_~"^[0-9]+$") {
$_+=__^($(_+__)~"[<]$")
}
} print $(_=_<_) }'
=
5;10;>12;20;<13

How to select text in a file until a certain string using grep, sed or awk?

I have a huge file (this is just a sample) and I would like to select all lines with "Ph_gUFAC1083" and all after until reach one that doesn't have the code (in this example Ph_gUFAC1139)
>uce_353_Ph_gUFAC1083 |uce_353
TTTAGCCATAGAAATGCAGAAATAATTAGAAGTGCCATTGTGTACAGTGCCTTCTGGACT
GGGCTGAAGGTGAAGGAGAAAGTATCATACTATCCTTGTCAGCTGCAAGGGTAATTACTG
CTGGCTGAAATTACTCAACATTTGTTTATAAGCTCCCCAGAGCATGCTGTAAATAGATTG
TCTGTTATAGTCCAATCACATTAAAACGCTGCTCCTTGCAAACTGCTACCTCCTGTTTTC
TGTAAGCTAGACAGAGAAAGCCTGCTGCTCACTTACTGAGCACCAAGCACTGAAGAGCTA
TGTTTAATGTGATTGTTTTCATTAGCTCTTCTCTGTCTGATATTACATTTATAATTTGCT
GGGCTTGAAGACTGGCATGTTGCATTGCTTTCATTTACTGTAGTAAGAGTGAATAGCTCT
AT
>uce_101_Ph_gUFAC1083 |uce_101
TTGGGCTTTATTTCCACCTTAAAATCTTTACCTGGCCGTGATCTGTTGTTCCATTACTGG
AGGGCAAAAATGGGAGGAATTGTCTGGGCTAAATTGCAATTAGGCAGCCCTGAGAGAGGC
TGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGT
AGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGAAGAG
GAGAGTTAATTGCATGTTACAGTGAGTGTAATGCCTAGATAACCTTGCATTTAATGCTAT
TCTTAGCCCTGCTGCCAAGACTTCTACAGAGCCTCTCTCTGCAGGAAGTCATTAAAGCTG
TGAGTAGATAATGCAGGCTCAGTGAAACCTAAGTGGCAACAATATA
>uce_171_Ph_gUFAC1083 |uce_171
CATGGAAAACGAGGAAAAGCCATATCTTCCAGGCCATTAATATTACTACGGAGACGTCTT
CATATCGCCGTAATTACAGCAGATCTCAAAGTGGCACAACCAAGACCAGCACCAAAGCTA
AAATAACTCGCAGGAGCAGGCGAGCTGCTTTTGCAGCCCTCAGTCCCAGAAATGCTCGGT
AGCTTTTCTTAAAATAGACAGCCTGTAAATAAGGTCTGTGAACTCAATTGAAGGTGGCTG
TTTCTGAATTAGTCAGCCCTCACAAGGCTCTCGGCCTACATGCTAGTACATAAATTGTCC
ACTTTACCACCAGACAAGAAAGATTAGAGTAATAAACACGGGGCATTAGCTCAGCTAGAG
AAACACACCAGCCGTTACGCACACGCGGGATTGCCAAGAACTGTTAACCCCACTCTCCAG
AAACGCACACAAAAAAACAAGTTAAAGCCATGACATCATGGGAA
>uce_4300_Ph_gUFAC1139 |uce_4300
ATTAAAAATACAATCCTCATGTTTGCATTTTGCAGTCGTCAACAAGAAATTGAAGAGAAA
CTCATAGAGGAAGAAACTGCTCGAAGGGTGGAAGAACTTGTAGCTAAACGCGTGGAAGAA
GAGCTGGAGAAAAGAAAGGATGAGATTGAGCGAGAGGTTCTCCGCAGGGTGGAGGAGGCT
AAGCGCATCATGGAAAAACAGTTGCTCGAAGAACTCGAGCGACAGCGACAAGCTGAACTT
GCAGCACAAAAAGCCAGAGAGGTAACGCTCGGTCGTTTGGAAAGTAGAGACAGTCCATGG
CAAAACTTTCAGTGTCGGTTTGTGCCTCCTGTTCGGTTCAGAAAGAGATGGAATACAGCA
AATCTAATTCCCTTCTCATATAAACTTGCATTGCTGCGAAACTTAATTTCTAGCCTATTC
AGAGGAGCTCACTGATATTTAAACAGTTACTCTCCTAAAACCTGAACAAGGATACTTGAT
TCTTAATGGAACTGACCTACATATTTCAGAATTGTTTGAAACTTTTGCCATGGCTGCAGG
ATTATTCAGCAGTCCTTTCATTTT
>uce_1039_Ph_gUFAC1139 |uce_1039
ATTAGTGGAATACAAATATGCAAAAACCAAACAGTTTGGTGCTATAATGTGAAAAGAAAT
TTACACCAATCTTATTTTTAATTTGTATGGGAACATTTTTACCACAAATTCCATATTTTA
ATAATACTATCCCAACTCTATTTTTTAGACTCATTTTGTCACTGTTTTGTAACAGAAACA
CTGTAAATATTATAGATGTGGTAAACTATTATACTTGTTTTCTTATAAATGAAATGATCT
GTGCCAACACTGACAAAATGAATTAATGTGTTACTAAGGCAACAGTCACATTATATGCTT
TCTCTTTCACAGTATGCGGTAGAGCATATGGTTTACTCTTAATGGAACACTAGCTTCTCA
TTAACATACCAGTAGCAATGTCAGAACTTACAAACCAGCATAACAGAGAAATGGAAAAAC
TTATAAATTAGACCCTTTCAGTATTATTGAGTAGAAAATGACTGATGTTCCAAGGTACAA
TATTTAGCTAATACAGTGCCCTTTTCTGCATCTTTCTTCTCAAAGGAAAAAAAAATCCTC
AAAAAAAACCAGAGCAAGAAACCTAACTTTTTCTTGT
I already tried several alternatives without success, the closest I reached was
sed -n '/Ph_gUFAC1083/, />/p' file.txt
that gave me that:
>uce_2347_Ph_gUFAC1083 |uce_2347
GCTTTTCTATGCAGATTTTTTCTAATTCTCTCCCTCCCCTTGCTTCTGTCAGTGTGAAGC
CCACACTAAGCATTAACAGTATTAAAAAGAGTGTTATCTATTAGTTCAATTAGACATCAG
ACATTTACTTTCCAATGTATTTGAAGACTGATTTGATTTGGGTCCAATCATTTAAAAATA
AGAGAGCAGAACTGTGTACAGAGCTGTGTACAGATATCTGTAGCTCTGAAGTCTTAATTG
CAAATTCAGATAAGGATTAGAAGGGGCTGTATCTCTGTAGACCAAAGGTATTTGCTAATA
CCTGAGATATAAAAGTGGTTAAATTCAATATTTACTAATTTAGGATTTCCACTTTGGATT
TTGATTAAGCTTTTTGGTTGAAAACCCCACATTATTAAGCTGTGATGAGGGAAAAAGCAA
CTCTTTCATAAGCCTCACTTTAACGCTTTATTTCAAATAATTTATTTTGGACCTTCTAAA
G
>uce_353_Ph_gUFAC1083 |uce_353
>uce_101_Ph_gUFAC1083 |uce_101
TTGGGCTTTATTTCCACCTTAAAATCTTTACCTGGCCGTGATCTGTTGTTCCATTACTGG
AGGGCAAAAATGGGAGGAATTGTCTGGGCTAAATTGCAATTAGGCAGCCCTGAGAGAGGC
TGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGT
AGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGAAGAG
GAGAGTTAATTGCATGTTACAGTGAGTGTAATGCCTAGATAACCTTGCATTTAATGCTAT
TCTTAGCCCTGCTGCCAAGACTTCTACAGAGCCTCTCTCTGCAGGAAGTCATTAAAGCTG
TGAGTAGATAATGCAGGCTCAGTGAAACCTAAGTGGCAACAATATA
>uce_171_Ph_gUFAC1083 |uce_171
Do you know how to do it using grep, sed or awk?
Thx
$ awk '/^>/{if(match($0,"Ph_gUFAC1083")){s=1} else s=0}s' file
I made a simple criteria for your request,
If the the start of the line is >, we're going to judge if "Ph_gUFAC1083" existed, if yes, set s=1, set s=0 otherwise.
For the line that doesn't start with >, the value of s would be retained.
The final s in the awk command decide if the line to be printed (s=1) or not (s=0).
If what you want is every line with Ph_gUFAC1139 plus block of lines after that line until the next line starting with >, then the following awk snippet might do:
$ awk 'BEGIN {RS=ORS=">"} /Ph_gUFAC1139/' file.txt
This uses the > character as a record separator, then simply displays records that contain the text you're interested in.
If you wanted to be able to provide the search string using a variable, you'd do it something like this:
$ val="Ph_gUFAC1139"
$ awk -v s="$val" 'BEGIN {RS=ORS=">"} $0 ~ s' file.txt
UPDATE
A comment mentions that the solution above shows trailing record separators rather than leading ones. You can adapt your output to match your input by reversing this order manually:
awk 'BEGIN { RS=ORS=">" } /Ph_gUFAC1139/ { printf "%s%s",ORS,$0 }' file.txt
Note that in the initial examples, a "match" of the regex would invoke awk's default "action", which is to print the line. The default action is invoked if no action is specified within the script. The code (immediately) above includes an action .. which prints the record, preceded by the separator.
This might work for you (GNU sed):
sed '/^>/h;G;/Ph_gUFAC1083/P;d' file
Store each line beginning with > in the hold space (HS) and then append the HS to every line. If any line contains the string Ph_gUFAC1083 print the first line in the pattern space (PS) and discard the everything else.
N.B. the regexp for the match may be amended to /\n.*Ph_gUFAC1083/ if the string match may occur in any line.
This program is used to find the block which starts with Ph_gUFAC1083 and ends with any statement other than Ph_gUFAC1139
cat inp.txt |
awk '
BEGIN{begin=0}
{
# Ignore blank lines
if( $0 ~ /^$/ )
{
print $0
next
}
# mark the line that contains Ph_gUFAC1083 and print it
if( $0 ~ /Ph_gUFAC1083/ )
{
begin=1
print $0
}
else
{
# if the line contains Ph_gUFAC1083 and Ph_gUFAC1139 was found before it, print it
if( begin == 1 && ( $0 ~ /Ph_gUFAC1139/ ) )
{
print $0
}
else
{
# found a line which doesnt contain Ph_gUFAC1139 , mark the end of the block.
begin = 0
}
}
}'

awk print first occurrence after match

I'm trying to print a portion of a text file between two patterns, then return only the first occurrence. Should be simple but I can't seem to find a solution.
cat test.html
if (var == "Option_1"){
document.write("<td>head1</td>")
document.write("<td>text1</td>")
}
if (var == "Option_2"){
document.write("<td>head2</td>")
document.write("<td>text2</td>")
}
if (var == "Option_1"){
document.write("<td>head3</td>")
document.write("<td>text3</td>")
}
This prints all matches:
awk '/Option_1/,/}/' test.txt
I need it to return only the first, i.e.:
if (var == "Option_1"){
document.write("<td>head1</td>")
document.write("<td>text1</td>")
}
Thanks!
Never use range expressions as they make trivial jobs very slightly briefer but then require a complete rewrite or duplicate conditions for even slightly more interesting tasks. Always use a flag:
$ awk '/Option_1/{f=1} f{print; if (/}/) exit}' file
if (var == "Option_1"){
document.write("<td>head1</td>")
document.write("<td>text1</td>")
}
I assumed that there are no } inside the if blocks.
Using GNU sed :
sed -n '/Option_1/{:a N;s/}/}/;Ta;p;q}' file
Here's how it works :
/Option_1/{ #search for Option_1
:a #create label a
N; #append next line to pattern space
s/}/}/; #substitute } with }
Ta; #if substitution failed, jump to label a
p; #print pattern space
q #exit
}
Adding somewhat to Ed Morton's answer, you can write it again to work for some nested if condition or if there exist any other pair of braces inside the if statement (eg. braces for for loop).
awk '/Option_1/{f=1} f{ if(/{/){count++}; print; if(/}/){count--; if(count==0) exit}}' filename
output for:
if (var == "Option_1"){
document.write("<td>head1</td>")
if (condition){
//code
}
document.write("<td>text1</td>")
}
if (var == "Option_2"){
document.write("<td>head2</td>")
document.write("<td>text2</td>")
}
if (var == "Option_1"){
document.write("<td>head3</td>")
document.write("<td>text3</td>")
}
is:
if (var == "Option_1"){
document.write("<td>head1</td>")
if (condition){
//code
}
document.write("<td>text1</td>")
}
count will keep count on number of starting braces and will print the statement until the count reaches 0 again.
My input might be different from question but the information may be useful.
sed '/Option_1/,/}/ !d;/}/q' YourFile
delete everything not inside your delimiter and quit after last line of it (so 1 section only)
for non GNU sed, replace the ; after d by a real new line
You can do,
awk '/Option_1/,/}/{print; if ($0 ~ /}/) exit}' test.txt
This exits after printing the first match

awk substitution ascii table rules bash

I want to perform a hierarchical set of (non-recursive) substitutions in a text file.
I want to define the rules in an ascii file "table.txt" which contains lines of blank space tabulated pairs of strings:
aaa 3
aa 2
a 1
I have tried to solve it with an awk script "substitute.awk":
BEGIN { while (getline < file) { subs[$1]=$2; } }
{ line=$0; for(i in subs)
{ gsub(i,subs[i],line); }
print line;
}
When I call the script giving it the string "aaa":
echo aaa | awk -v file="table.txt" -f substitute.awk
I get
21
instead of the desired "3". Permuting the lines in "table.txt" doesn't help. Who can explain what the problem is here, and how to circumvent it? (This is a simplified version of my actual task. Where I have a large file containing ascii encoded phonetic symbols which I want to convert into Latex code. The ascii encoding of the symbols contains {$,&,-,%,[a-z],[0-9],...)).
Any comments and suggestions!
PS:
Of course in this application for a substitution table.txt:
aa ab
a 1
a original string: "aa" should be converted into "ab" and not "1b". That means a string which was yielded by applying a rule must be left untouched.
How to account for that?
The order of the loop for (i in subs) is undefined by default.
In newer versions of awk you can use PROCINFO["sorted_in"] to control the sort order. See section 12.2.1 Controlling Array Traversal and (the linked) section 8.1.6 Using Predefined Array Scanning Orders for details about that.
Alternatively, if you can't or don't want to do that you could store the replacements in numerically indexed entries in subs and walk the array in order manually.
To do that you will need to store both the pattern and the replacement in the value of the array and that will require some care to combine. You can consider using SUBSEP or any other character that cannot be in the pattern or replacement and then split the value to get the pattern and replacement in the loop.
Also note the caveats/etcץ with getline listed on http://awk.info/?tip/getline and consider not using that manually but instead using NR==1{...} and just listing table.txt as the first file argument to awk.
Edit: Actually, for the manual loop version you could also just keep two arrays one mapping input file line number to the patterns to match and another mapping patterns to replacements. Then looping over the line number array will get you the pattern and the pattern can be used in the second array to get the replacement (for gsub).
Instead of storing the replacements in an associative array, put them in two arrays indexed by integer (one array for the strings to replace, one for the replacements) and iterate over the arrays in order:
BEGIN {i=0; while (getline < file) { subs[i]=$1; repl[i++]=$2}
n = i}
{ for(i=0;i<n;i++) { gsub(subs[i],repl[i]); }
print tolower($0);
}
It seems like perl's zero-width word boundary is what you want. It's a pretty straightforward conversion from the awk:
#!/usr/bin/env perl
use strict;
use warnings;
my %subs;
BEGIN{
open my $f, '<', 'table.txt' or die "table.txt:$!";
while(<$f>) {
my ($k,$v) = split;
$subs{$k}=$v;
}
}
while(<>) {
while(my($k, $v) = each %subs) {
s/\b$k\b/$v/g;
}
print;
}
Here's an answer pulled from another StackExchange site, from a fairly similar question: Replace multiple strings in a single pass.
It's slightly different in that it does the replacements in inverse order by length of target string (i.e. longest target first), but that is the only sensible order for targets which are literal strings, as appears to be the case in this question as well.
If you have tcc installed, you can use the following shell function, which process the file of substitutions into a lex-generated scanner which it then compiles and runs using tcc's compile-and-run option.
# Call this as: substitute replacements.txt < text_to_be_substituted.txt
# Requires GNU sed because I was too lazy to write a BRE
substitute () {
tcc -run <(
{
printf %s\\n "%option 8bit noyywrap nounput" "%%"
sed -r 's/((\\\\)*)(\\?)$/\1\3\3/;
s/((\\\\)*)\\?"/\1\\"/g;
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$1"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex -t)
}
With gcc or clang, you can use something similar to compile a substitution program from the replacement list, and then execute that program on the given text. Posix-standard c99 does not allow input from stdin, but gcc and clang are happy to do so provided you tell them explicitly that it is a C program (-x c). In order to avoid excess compilations, we use make (which needs to be gmake, Gnu make).
The following requires that the list of replacements be in a file with a .txt extension; the cached compiled executable will have the same name with a .exe extension. If the makefile were in the current directory with the name Makefile, you could invoke it as make repl (where repl is the name of the replacement file without a text extension), but since that's unlikely to be the case, we'll use a shell function to actually invoke make.
Note that in the following file, the whitespace at the beginning of each line starts with a tab character:
substitute.mak
.SECONDARY:
%: %.exe
#$(<D)/$(<F)
%.exe: %.txt
#{ printf %s\\n "%option 8bit noyywrap nounput" "%%"; \
sed -r \
's/((\\\\)*)(\\?)$$/\1\3\3/; #\
s/((\\\\)*)\\?"/\1\\"/g; #\
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$<"; \
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"; \
} | lex -t | c99 -D_POSIX_C_SOURCE=200809L -O2 -x c -o "$#" -
Shell function to invoke the above:
substitute() {
gmake -f/path/to/substitute.mak "${1%.txt}"
}
You can invoke the above command with:
substitute file
where file is the name of the replacements file. (The filename must end with .txt but you don't have to type the file extension.)
The format of the input file is a series of lines consisting of a target string and a replacement string. The two strings are separated by whitespace. You can use any valid C escape sequence in the strings; you can also \-escape a space character to include it in the target. If you want to include a literal \, you'll need to double it.
If you don't want C escape sequences and would prefer to have backslashes not be metacharacters, you can replace the sed program with a much simpler one:
sed -r 's/([\\"])/\\\1/g' "$<"; \
(The ; \ is necessary because of the way make works.)
a) Don't use getline unless you have a very specific need and fully understand all the caveats, see http://awk.info/?tip/getline
b) Don't use regexps when you want strings (yes, this means you cannot use sed).
c) The while loop needs to constantly move beyond the part of the line you've already changed or you could end up in an infinite loop.
You need something like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
while ( sstart = index(tail,old) ) {
$0 = $0 substr(tail,1,sstart-1) new
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
print
}
$ echo aaa | awk -f substitute.awk table.txt -
3
$ echo aaaa | awk -f substitute.awk table.txt -
31
and adding some RE metacharacters to table.txt to show they are treated just like every other character and showing how to run it when the target text is stored in a file instead of being piped:
$ cat table.txt
aaa 3
aa 2
a 1
. 7
\ 4
* 9
$ cat foo
a.a\aa*a
$ awk -f substitute.awk table.txt foo
1714291
Your new requirement requires a solution like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
delete news
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
charPos = 0
while ( sstart = index(tail,old) ) {
charPos += sstart
news[charPos] = new
$0 = $0 substr(tail,1,sstart-1) RS
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
numChars = split($0, olds, "")
$0 = ""
for (charPos=1; charPos <= numChars; charPos++) {
$0 = $0 (charPos in news ? news[charPos] : olds[charPos])
}
print
}
.
$ cat table.txt
1 a
2 b
$ echo "121212" | awk -f substitute.awk table.txt -
ababab

Resources