Find, Replace, Remove - with in file - bash

I'm currently using this code:
awk 'BEGIN { s = \"{$CNEW}\" } /WORD_MATCH/ { $0 = s; n = 1 } 1; END { if(!n) print s }' filename > new_filename
To find a match on WORD_MATCH and then replace that line with $CNEW in a file called filename the results are written to new_filename
This all works well. But I have an issue where I may want to DELETE the line instead of replace it.
So I set $CNEW = '' which works in that I get a blank line in the file, but not actually removing the line.
Is there anyway to adapt the AWK command to allow the removal of the line ?
The total aim is :
If there isn't a line in the file containing WORD_MATCH add one, based on $CNEW
If there is a line in the file containing WORD_MATCH update that line with the new value from $CNEW
If $CNEW ='' then delete the line contain WORD_MATCH.
There will only be one line in he file containing WORD_MATCH
Thanks

awk -v s="$CNEW" '/WORD_MATCH/ { n=1; if (s) $0=s; else next; } 1; END { if(s && !n) print s }' file
How it works
-v s="$CNEW"
This creates s as an awk variable with the value $CNEW. Note that the use of -v neatly eliminates the quoting problems that can occur by trying to define s in a BEGIN block.
/WORD_MATCH/ { n=1; if (s) $0=s; else next; }
If the current line matches WORD_MATCH, then set n to 1. If s is non-empty, then set the current line to s. If not, skip the rest of the commands and start over on the next line.
1
This is cryptic shorthand for print the line.
END { if(s && !n) print s }
At the end of the file, if n is still not 1 and s is non-empty, then print s.

Related

Reverse complement SOME sequences in fasta file

I've been reading lots of helpful posts about reverse complementing sequences, but I've got what seems to be an unusual request. I'm working in bash and I have DNA sequences in fasta format in my stdout that I'd like to pass on down the pipe. The seemingly unusual bit is that I'm trying to reverse complement SOME of those sequences, so that the output has all the sequences in the same direction (for multiple sequence alignment later).
My fasta headers end in either "C" or "+". I'd like to reverse complement the ones that end in "C". Here's a little subset:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
I know there are lots of ways to reverse complement out there, like:
echo ACCTTGAAA | tr ACGTacgt TGCAtgca | rev
and
seqtk seq -r in.fa > out.fa
But I'm not sure how to do this for only those sequences that have a C at the end of the header. I think awk or sed is probably the ticket, but I'm at a loss as to how to actually code it. I can get the sequence headers with awk, like:
awk '/^>/ { print $0 }'
>chr1:84518073-84524089C
>chr1:86214203-86220231+
But if someone could help me figure out how to turn that awk statement into one that asks "if the last character in the header has a C, do this!" that would be great!
Edited to add:
I was so tired when I made this post, I apologize for not including my desired output. Here is what I'd like to output to look like, using my little example:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
You can see the sequence that ends in + is unchanged, but the sequence with a header that ends in C is reverse complemented.
Thanks!
An earlier answer (by Ed Morton) uses a self-contained awk procedure to selectively reverse-complement sequences following a comment line ending with "C". Although I think that to be the best approach, I will offer an alternative approach that might have wider applicability.
The procedure here uses awk's system() function to send data extracted from the fasta file in awk to the shell where the sequence can be processed by any of the many shell applications existing for sequence manipulation.
I have defined an awk user function to pass the isolated sequence from awk to the shell. It can be called from any part of the awk procedure:
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
The argument of the system function is a string containing the command you would type into terminal to achieve the desired outcome (in this case I've used one of the example reverse-complement routines mentioned in the question). The parts to note are the correct escaping of quote marks that are to appear in the shell command, and the variable s that will be substituted for the sequence string assigned to it when the function is called. The value of s is concatenated with the strings quoted before and after it in the argument to system() shown above.
isolating the required sequences
The rest of the procedure addresses how to achieve:
"if the last character in the header has a C, do this"
Before making use of shell applications, awk needs to isolate the part(s) of the file to process. In general terms, awk employs one or more pattern/action blocks where only records (lines by default) that match a given pattern are processed by the subsequent action commands. For example, the following illustrative procedure performs the action of printing the whole line print $0 if the pattern /^>/ && /C$/ is true for that line (where /^>/ looks for ">" at the start of a line and /C$/ looks for "C" at the end of the same line.:
/^>/ && /C$/{ print $0 }
For the current needs, the sequence begins on the next record (line) after any record beginning with > and ending with C. One way of referencing that next line is to set a variable (named line in my example) when the C line is encountered and establishing a later pattern for the record with numerical value one more than line variable.
Because fasta sequences may extend over several lines, we have to accumulate several successive lines following a C title line. I have achieved this by concatenating each line following the C title line until a record beginning with > is encountered again (or until the end of the file is reached, using the END block).
In order that sequence lines following a non-C title line are ignored, I have used a variable named flag with values of either "do" or "ignore" set when a title record is encountered.
The call to a the custom function processSeq() that employs the system() command, is made at the beginning of a C title action block if the variable seq holds an accumulated sequence (and in the END block for relevant sequences that occur at the end of the file where there will be no title line).
Test file and procedure
A modified version of your example fasta was used to test the procedure. It contains an extra relevant C record with three and-a-bit lines instead of two, and an extra irrelevant + record.
seq.fasta:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chranotherC
aatgaagtatattcagaatgtagaacattaactgacccgccatgttaatc
aatatctataagaccttttaaaaagcaccttagagattcaataaagtcag
gaagtatattcagaatgtagaacattaactgactaagaccttttaacatg
gcattgact
procedure
awk '
/^>/ && /C$/{
if (length(seq)>0) {processSeq(seq); seq="";}
line=NR; print $0; flag="do"; next;
}
/^>/ {line=NR; flag="ignore"}
NR>1 && NR==(line+1) && (flag=="do"){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0) processSeq(seq);}
' seq.fasta
output
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagttgtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
>chranotherC
agtcaatgccatgttaaaaggtcttagtcagttaatgttctacattctgaatatacttcctgactttattgaatctctaaggtgctttttaaaaggtcttatagatattgattaacatggcgggtcagttaatgttctacattctgaatatacttcatt
Tested using GNU Awk 5.1.0 on a Raspberry Pi 400.
performance note
Because calling sytstem() creates a sub shell, this process will be slower than a self-contained awk procedure. It might be useful where existing shell routines are available or tricky to reproduce with custom awk routines.
Edit: modification to include unaltered + records
This version has some repetition of earlier blocks, with minor changes, to handle printing of the lines that are not to be reverse-complemented (the changes should be self-explanatory if the main explanations were understood)
awk '
/^>/ && /C$/{
if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq="";line=NR; print $0; flag="do"; next;
}
/^>/ {if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq=""; print $0; line=NR; flag="ignore"}
NR>1 && NR==(line+1){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq}}
' seq.fasta
Using any awk:
$ cat tst.awk
/^>/ {
if ( NR > 1 ) {
prt()
}
head = $0
tail = ""
next
}
{ tail = ( tail == "" ? "" : tail ORS ) $0 }
END { prt() }
function prt( type) {
type = substr(head,length(head),1)
tail = ( type == "C" ? rev( tr( tail, "ACGTacgt TGCAtgca" ) ) : tail )
print head ORS tail
}
function tr(oldStr,trStr, i,lgth,char,newStr) {
if ( !_trSeen[trStr]++ ) {
lgth = (length(trStr) - 1) / 2
for ( i=1; i<=lgth; i++ ) {
_trMap[trStr,substr(trStr,i,1)] = substr(trStr,lgth+1+i,1)
}
}
lgth = length(oldStr)
for (i=1; i<=lgth; i++) {
char = substr(oldStr,i,1)
newStr = newStr ( (trStr,char) in _trMap ? _trMap[trStr,char] : char )
}
return newStr
}
function rev(oldStr, i,lgth,char,newStr) {
lgth = length(oldStr)
for ( i=1; i<=lgth; i++ ) {
char = substr(oldStr,i,1)
newStr = char newStr
}
return newStr
}
$ awk -f tst.awk file
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
This might work for you (GNU sed):
sed -nE ':a;p;/^>.*C$/!b
:b;n;/^>/ba;s/^/\n/;y/ACGTacgt/TGCAtgca/
:c;tc;/\n$/{s///p;bb};s/(.*)\n(.)/\2\1\n/;tc' file
Print the current line and then inspect it.
If the line does not begin with > and end with C, bail out and repeat.
Otherwise, fetch the next line and if it begins with >, repeat the above line.
Otherwise, insert a newline (to use as a pivot point when reversing the line), complement the code of the line using a translation command. Then set about reversing the line, character by character until the inserted newline makes its way to the end of the line.
Remove the newline, print the result and repeat the line above.
N.B. The n command will terminate the script when it is executed after the last line has been read.
Since the OP has amended the ouput, another solution is when the whole of the sequence is complemented and then reversed. Here is another solution that I believe follows these criteria.
sed -nE ':a;p;/^>.*C$/!b
:b;n;/^>/!{H;$!bb};x;y/ACGTacgt\n/TGCAtgca%/;s/%/\n/
:c;tc;s/\n$//;td;s/(.*)\n(.)/\2\1\n/;tc
:d;y/%/\n/;p;z;x;$!ba' file

Retreive specific values from file

I have a file test.cf containing:
process {
withName : teq {
file = "/path/to/teq-0.20.9.txt"
}
}
process {
withName : cad {
file = "/path/to/cad-4.0.txt"
}
}
process {
withName : sik {
file = "/path/to/sik-20.0.txt"
}
}
I would like to retreive value associated at the end of the file for teq, cad and sik
I was first thinking about something like
grep -E 'teq' test.cf
and get only second raw and then remove part of recurrence in line
But it may be easier to do something like:
for a in test.cf
do
line=$(sed -n '{$a}p' test.cf)
if line=teq
#next line using sed -n?
do print nextline &> teq.txt
else if line=cad
do print nextline &> cad.txt
else if line=sik
do print nextline &> sik.txt
done
(obviously it doesn't work)
EDIT:
output wanted:
teq.txt containing teq-0.20.9, cad.txt containing cad-4.0 and sik.txt containing sik-20.0
Is there a good way to do that? Thank you for your comments
Based on your given sample:
awk '/withName/{close(f); f=$3 ".txt"}
/file/{sub(/.*\//, ""); sub(/\.txt".*/, "");
print > f}' ip.txt
/withName/{close(f); f=$3 ".txt"} if line contains withName, save filename in f using the third field. close() will close any previous file handle
/file/{sub(/.*\//, ""); sub(/\.txt".*/, ""); if line contains file, remove everything except the value required
print > f print the modified line and redirect to filename in f
if you can have multiple entries, use >> instead of >
Here is a solution in awk:
awk '/withName/{name=$3} /file =/{print $3 > name ".txt"}' test.cf
/withName/{name=$3}: when I see the line containing "withName", I save that name
When I see the line with "file =", I print

How to select text in a file until a certain string using grep, sed or awk?

I have a huge file (this is just a sample) and I would like to select all lines with "Ph_gUFAC1083" and all after until reach one that doesn't have the code (in this example Ph_gUFAC1139)
>uce_353_Ph_gUFAC1083 |uce_353
TTTAGCCATAGAAATGCAGAAATAATTAGAAGTGCCATTGTGTACAGTGCCTTCTGGACT
GGGCTGAAGGTGAAGGAGAAAGTATCATACTATCCTTGTCAGCTGCAAGGGTAATTACTG
CTGGCTGAAATTACTCAACATTTGTTTATAAGCTCCCCAGAGCATGCTGTAAATAGATTG
TCTGTTATAGTCCAATCACATTAAAACGCTGCTCCTTGCAAACTGCTACCTCCTGTTTTC
TGTAAGCTAGACAGAGAAAGCCTGCTGCTCACTTACTGAGCACCAAGCACTGAAGAGCTA
TGTTTAATGTGATTGTTTTCATTAGCTCTTCTCTGTCTGATATTACATTTATAATTTGCT
GGGCTTGAAGACTGGCATGTTGCATTGCTTTCATTTACTGTAGTAAGAGTGAATAGCTCT
AT
>uce_101_Ph_gUFAC1083 |uce_101
TTGGGCTTTATTTCCACCTTAAAATCTTTACCTGGCCGTGATCTGTTGTTCCATTACTGG
AGGGCAAAAATGGGAGGAATTGTCTGGGCTAAATTGCAATTAGGCAGCCCTGAGAGAGGC
TGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGT
AGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGAAGAG
GAGAGTTAATTGCATGTTACAGTGAGTGTAATGCCTAGATAACCTTGCATTTAATGCTAT
TCTTAGCCCTGCTGCCAAGACTTCTACAGAGCCTCTCTCTGCAGGAAGTCATTAAAGCTG
TGAGTAGATAATGCAGGCTCAGTGAAACCTAAGTGGCAACAATATA
>uce_171_Ph_gUFAC1083 |uce_171
CATGGAAAACGAGGAAAAGCCATATCTTCCAGGCCATTAATATTACTACGGAGACGTCTT
CATATCGCCGTAATTACAGCAGATCTCAAAGTGGCACAACCAAGACCAGCACCAAAGCTA
AAATAACTCGCAGGAGCAGGCGAGCTGCTTTTGCAGCCCTCAGTCCCAGAAATGCTCGGT
AGCTTTTCTTAAAATAGACAGCCTGTAAATAAGGTCTGTGAACTCAATTGAAGGTGGCTG
TTTCTGAATTAGTCAGCCCTCACAAGGCTCTCGGCCTACATGCTAGTACATAAATTGTCC
ACTTTACCACCAGACAAGAAAGATTAGAGTAATAAACACGGGGCATTAGCTCAGCTAGAG
AAACACACCAGCCGTTACGCACACGCGGGATTGCCAAGAACTGTTAACCCCACTCTCCAG
AAACGCACACAAAAAAACAAGTTAAAGCCATGACATCATGGGAA
>uce_4300_Ph_gUFAC1139 |uce_4300
ATTAAAAATACAATCCTCATGTTTGCATTTTGCAGTCGTCAACAAGAAATTGAAGAGAAA
CTCATAGAGGAAGAAACTGCTCGAAGGGTGGAAGAACTTGTAGCTAAACGCGTGGAAGAA
GAGCTGGAGAAAAGAAAGGATGAGATTGAGCGAGAGGTTCTCCGCAGGGTGGAGGAGGCT
AAGCGCATCATGGAAAAACAGTTGCTCGAAGAACTCGAGCGACAGCGACAAGCTGAACTT
GCAGCACAAAAAGCCAGAGAGGTAACGCTCGGTCGTTTGGAAAGTAGAGACAGTCCATGG
CAAAACTTTCAGTGTCGGTTTGTGCCTCCTGTTCGGTTCAGAAAGAGATGGAATACAGCA
AATCTAATTCCCTTCTCATATAAACTTGCATTGCTGCGAAACTTAATTTCTAGCCTATTC
AGAGGAGCTCACTGATATTTAAACAGTTACTCTCCTAAAACCTGAACAAGGATACTTGAT
TCTTAATGGAACTGACCTACATATTTCAGAATTGTTTGAAACTTTTGCCATGGCTGCAGG
ATTATTCAGCAGTCCTTTCATTTT
>uce_1039_Ph_gUFAC1139 |uce_1039
ATTAGTGGAATACAAATATGCAAAAACCAAACAGTTTGGTGCTATAATGTGAAAAGAAAT
TTACACCAATCTTATTTTTAATTTGTATGGGAACATTTTTACCACAAATTCCATATTTTA
ATAATACTATCCCAACTCTATTTTTTAGACTCATTTTGTCACTGTTTTGTAACAGAAACA
CTGTAAATATTATAGATGTGGTAAACTATTATACTTGTTTTCTTATAAATGAAATGATCT
GTGCCAACACTGACAAAATGAATTAATGTGTTACTAAGGCAACAGTCACATTATATGCTT
TCTCTTTCACAGTATGCGGTAGAGCATATGGTTTACTCTTAATGGAACACTAGCTTCTCA
TTAACATACCAGTAGCAATGTCAGAACTTACAAACCAGCATAACAGAGAAATGGAAAAAC
TTATAAATTAGACCCTTTCAGTATTATTGAGTAGAAAATGACTGATGTTCCAAGGTACAA
TATTTAGCTAATACAGTGCCCTTTTCTGCATCTTTCTTCTCAAAGGAAAAAAAAATCCTC
AAAAAAAACCAGAGCAAGAAACCTAACTTTTTCTTGT
I already tried several alternatives without success, the closest I reached was
sed -n '/Ph_gUFAC1083/, />/p' file.txt
that gave me that:
>uce_2347_Ph_gUFAC1083 |uce_2347
GCTTTTCTATGCAGATTTTTTCTAATTCTCTCCCTCCCCTTGCTTCTGTCAGTGTGAAGC
CCACACTAAGCATTAACAGTATTAAAAAGAGTGTTATCTATTAGTTCAATTAGACATCAG
ACATTTACTTTCCAATGTATTTGAAGACTGATTTGATTTGGGTCCAATCATTTAAAAATA
AGAGAGCAGAACTGTGTACAGAGCTGTGTACAGATATCTGTAGCTCTGAAGTCTTAATTG
CAAATTCAGATAAGGATTAGAAGGGGCTGTATCTCTGTAGACCAAAGGTATTTGCTAATA
CCTGAGATATAAAAGTGGTTAAATTCAATATTTACTAATTTAGGATTTCCACTTTGGATT
TTGATTAAGCTTTTTGGTTGAAAACCCCACATTATTAAGCTGTGATGAGGGAAAAAGCAA
CTCTTTCATAAGCCTCACTTTAACGCTTTATTTCAAATAATTTATTTTGGACCTTCTAAA
G
>uce_353_Ph_gUFAC1083 |uce_353
>uce_101_Ph_gUFAC1083 |uce_101
TTGGGCTTTATTTCCACCTTAAAATCTTTACCTGGCCGTGATCTGTTGTTCCATTACTGG
AGGGCAAAAATGGGAGGAATTGTCTGGGCTAAATTGCAATTAGGCAGCCCTGAGAGAGGC
TGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGT
AGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGAAGAG
GAGAGTTAATTGCATGTTACAGTGAGTGTAATGCCTAGATAACCTTGCATTTAATGCTAT
TCTTAGCCCTGCTGCCAAGACTTCTACAGAGCCTCTCTCTGCAGGAAGTCATTAAAGCTG
TGAGTAGATAATGCAGGCTCAGTGAAACCTAAGTGGCAACAATATA
>uce_171_Ph_gUFAC1083 |uce_171
Do you know how to do it using grep, sed or awk?
Thx
$ awk '/^>/{if(match($0,"Ph_gUFAC1083")){s=1} else s=0}s' file
I made a simple criteria for your request,
If the the start of the line is >, we're going to judge if "Ph_gUFAC1083" existed, if yes, set s=1, set s=0 otherwise.
For the line that doesn't start with >, the value of s would be retained.
The final s in the awk command decide if the line to be printed (s=1) or not (s=0).
If what you want is every line with Ph_gUFAC1139 plus block of lines after that line until the next line starting with >, then the following awk snippet might do:
$ awk 'BEGIN {RS=ORS=">"} /Ph_gUFAC1139/' file.txt
This uses the > character as a record separator, then simply displays records that contain the text you're interested in.
If you wanted to be able to provide the search string using a variable, you'd do it something like this:
$ val="Ph_gUFAC1139"
$ awk -v s="$val" 'BEGIN {RS=ORS=">"} $0 ~ s' file.txt
UPDATE
A comment mentions that the solution above shows trailing record separators rather than leading ones. You can adapt your output to match your input by reversing this order manually:
awk 'BEGIN { RS=ORS=">" } /Ph_gUFAC1139/ { printf "%s%s",ORS,$0 }' file.txt
Note that in the initial examples, a "match" of the regex would invoke awk's default "action", which is to print the line. The default action is invoked if no action is specified within the script. The code (immediately) above includes an action .. which prints the record, preceded by the separator.
This might work for you (GNU sed):
sed '/^>/h;G;/Ph_gUFAC1083/P;d' file
Store each line beginning with > in the hold space (HS) and then append the HS to every line. If any line contains the string Ph_gUFAC1083 print the first line in the pattern space (PS) and discard the everything else.
N.B. the regexp for the match may be amended to /\n.*Ph_gUFAC1083/ if the string match may occur in any line.
This program is used to find the block which starts with Ph_gUFAC1083 and ends with any statement other than Ph_gUFAC1139
cat inp.txt |
awk '
BEGIN{begin=0}
{
# Ignore blank lines
if( $0 ~ /^$/ )
{
print $0
next
}
# mark the line that contains Ph_gUFAC1083 and print it
if( $0 ~ /Ph_gUFAC1083/ )
{
begin=1
print $0
}
else
{
# if the line contains Ph_gUFAC1083 and Ph_gUFAC1139 was found before it, print it
if( begin == 1 && ( $0 ~ /Ph_gUFAC1139/ ) )
{
print $0
}
else
{
# found a line which doesnt contain Ph_gUFAC1139 , mark the end of the block.
begin = 0
}
}
}'

How to add an input file name to multiple output files in awk?

The question might be trivial. I'm trying to figure out a way to add a part of my input file name to multiple outputs generated by the following awk script.
Script:
zcat $1 | BEGIN {
# the number of sequences per file
if (!N) N=10000;
# file prefix
if (!prefix) prefix = "seq";
# file suffix
if (!suffix) suffix = "fa";
# this keeps track of the sequences
count = 0
}
# skip empty lines at the beginning
/^$/ { next; }
# act on fasta header
/^>/ {
if (count % N == 0) {
if (output) close(output)
output = sprintf("%s%07d.%s", prefix, count, suffix)
}
print > output
count ++
next
}
# write the fasta body into the file
{
print >> output
}
The input in $1 variable is 30_C_283_1_5.9.fa.gz
The output files generated by the script are
myseq0000000.fa, myseq1000000.fa and so on....
I would like the output to be
30_C_283_1_5.9_myseq000000.fa, 30_C_283_1_5.9_myseq100000.fa....
Looking forward for some inputs in this regard.
There's a way to direct the output from inside the Awk script:
https://www.gnu.org/software/gawk/manual/html_node/Redirection.html

Joining lines that matches specific conditions in bash

I need a command that will join lines if:
-following line starts with more than 5 spaces
-length of the joined lines won't be greater than 79 characters
-those lines are not between lines with pattern1 and pattern2
-same as above but with another set of patterns, like pattern3 and pattern4
It will work on a file like this:
Long line that contains too much text for combining it with following one
That line cannot be attached to the previous becouse of the length
This one also
becouse it doesn't start with spaces
This one
could be
expanded
pattern1
here are lines
that shouldn't be
changed
pattern2
Another line
to grow
After running the command, output should be:
Long line that contains too much text for combining it with following one
That line cannot be attached to the previous becouse of the length
This one also
becouse that one doesn't start with spaces
This one could be expanded
pattern1
here are lines
that shouldn't be
changed
pattern2
Another line to grow
It can't move part of the line.
I'm using bash 2.05 sed 3.02 awk 3.1.1 and grep 2.5.1 and i don't know how to solve this problem :)
Here's a start for you:
#!/usr/bin/awk -f
BEGIN {
TRUE = printflag1 = printflag2 = 1
FALSE = 0
}
# using two different flags prevents premature enabling when blocks are
# nested or intermingled
/pattern1/ {
printflag1 = FALSE
}
/pattern2/ {
printflag1 = TRUE
}
/pattern3/ {
printflag2 = FALSE
}
/pattern4/ {
printflag2 = TRUE
}
{
line = $0
sub(/^ +/, " ", line)
sub(/ +$/, "", line)
}
/^ / &&
length(accum line) <= 79 &&
printflag1 &&
printflag2 {
accum = accum line
next
}
{
print accum
accum = line
}

Resources