How to do multiple match and print different number of lines after each pattern using awk - bash

I have a big file with thousand lines that looks like:
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00002235.4
TTACGCAT
TAGGCCAG
>ENST00005546.9
TTTATCGC
TTAGGGTAT
I want to grep specific ids (after > sign), for example, ENST00001234.1 then want to get lines after the match until the next > [regardless of the number of lines]. I want to grep about 63 ids in this way at once.
If I grep ENST00001234.1 and ENST00005546.9 ids, the ideal output should be:
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00005546.9
TTTATCGC
TTAGGGTAT
I tried awk '/ENST00001234.1/ENST00005546.9/{print}' but it did not help.

You can set > as the record separator:
$ awk -F'\n' -v RS='>' -v ORS= '$1=="ENST00001234.1"{print RS $0}' ip.txt
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
-F'\n' to make it easier to compare the search term with first line
-v RS='>' set > as input record separator
-v ORS= clear the output record separator, otherwise you'll get extra newline in the output
$1=="ENST00001234.1" this will do string comparison and matches the entire first line, otherwise you'll have to escape regex metacharacters like . and add anchors
print RS $0 if match is found, print > and the record content
If you want to match more than one search terms, put them in a file:
$ cat f1
ENST00001234.1
ENST00005546.9
$ awk 'BEGIN{FS="\n"; ORS=""}
NR==FNR{a[$0]; next}
$1 in a{print RS $0}' f1 RS='>' ip.txt
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00005546.9
TTTATCGC
TTAGGGTAT
Here, the contents of f1 is used to build the keys for array a. Once the first file is read, RS='>' will change the record separator for the second file.
$1 in a will check if the first line matches a key in array a

EDIT(Generic solution): In case one has to look for multiple strings in Input_file then mention all of them in awk variable search with ,(comma) separated and that should print all matched ones(respective lines).
awk -v search="ENST00001234.1,ENST00002235.4" '
BEGIN{
num=split(search,arr,",")
for(i=1;i<=num;i++){
look[">"arr[i]]
}
}
/^>/{
if($0 in look){ found=1 }
else { found="" }
}
found
' Input_file
In case you want to read ids(which needs to be searched into Input_file) from another file then try following. Where look_file is the file which has all ids needs to be searched and Input_file is the actual content file.
awk '
FNR==NR{
look[">"$0]
}
/^>/{
if($0 in look){ found=1 }
else { found="" }
}
found
' look_file Input_file
For single text search: Could you please try following. Written and tested with shown samples in GNU awk. One could give string which needs to be searched in variable search as per their requirement.
awk -v search="ENST00001234.1" '
/^>/{
if($0==">"search){ found=1 }
else { found="" }
}
found
' Input_file
Explanation: Adding detailed explanation for above.
awk -v search="ENST00001234.1" ' ##Starting awk program from here and setting and setting search variable value what we need to look.
/^>/{ ##Checking condition if a line starts from > then do following.
if($0==">"search){ found=1 } ##Checking condition if current line equals to > search(variable value) then set found to 1 here.
else { found="" } ##else set found to NULL here.
}
found ##Checking condition if found is SET then print that line.
' Input_file ##Mentioning Input_file name here.

There is no need to reinvent the wheel. There are several bioinformatics tools for this task (extract fasta sequences using a list of sequence ids). For example, seqtk subseq:
Extract sequences with names in file name.lst, one sequence name per line:
seqtk subseq in.fq name.lst > out.fq
It works with fasta files as well.
Use conda install seqtk or conda create --name seqtk seqtk to install the seqtk package, which has other useful functionalities, and is very fast.
SEE ALSO:
Retrieve FASTA sequences using sequence ids
Extract fasta sequences from a file using a list in another file
How To Extract A Sequence From A Big (6Gb) Multifasta File?
extract sequences from multifasta file by ID in file using awk

Related

Splitting file based on pattern '\r\n00' in korn shell

My file temp.txt looks like below
00ABC
PQR123400
00XYZ001234
012345
0012233
I want to split the file based on pattern '\r\n00'. In this case temp.txt should split into 3 files
first.txt:
00ABC
PQR123400
second.txt
00XYZ001234
012345
third.txt
0012233
I am trying to use csplit to match pattern '\r\n00' but the debug shows me invalid pattern. Can someone please help me to match the exact pattern using csplit
With your shown samples, please try following awk code. Written and tested in GNU awk.
This code will create files with names like: 1.txt, 2.txt and so on in your system. This will also take care of closing output files in backend so that we don't get in-famous error too many files opened one.
awk -v RS='\r?\n00' -v count="1" '
{
outputFile=(count++".txt")
rt=RT
sub(/\r?\n/,"",rt)
if(!rt){
sub(/\n+/,"")
rt=prevRT
}
printf("%s%s\n",(count>2?rt:""),$0) > outputFile
close(outputFile)
prevRT=rt
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk -v RS='\r?\n00' -v count="1" ' ##Starting awk program from here and setting RS as \r?\n00 aong with that setting count as 1 here.
{
outputFile=(count++".txt") ##Creating outputFile which has value of count(increases each time cursor comes here) followed by .txt here.
rt=RT ##Setting RT value to rt here.
sub(/\r?\n/,"",rt) ##Substituting \r?\n with NULL in rt.
if(!rt){ ##If rt is NULL then do following.
sub(/\n+/,"") ##Substituting new lines 1 or more with NULL.
rt=prevRT ##Setting preRT to rt here.
}
printf("%s%s\n",(count>2?rt:""),$0) > outputFile ##Printing rt and current line into outputFile.
close(outputFile) ##Closing outputFile in backend.
prevRT=rt ##Setting rt to prevRT here.
}
' Input_file ##Mentioning Input_file name here.

How to increase value of a text variable in a file

file1.text contains below data.
VARIABLE=00
RATE=14
PRICE=100
I need to increment value by 1 only for below whenever I want.
VARIABLE=00 file name: file1.txt
output should be incremented by 1 every time.
output will be like below
VARIABLE=01
in next run VARIABLE=02 and so on....
Could you please try following, written and tested with shown samples in GNU awk.
awk 'BEGIN{FS=OFS="="} /^VARIABLE/{$NF=sprintf("%02d",$NF+1)} 1' Input_file > temp && mv temp Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="=" ##Setting FS and OFS as = here.
}
/^VARIABLE/{ ##Checking condition if line starts from VARIABLE then do following.
$NF=sprintf("%02d",$NF+1) ##Adding 1 last field and saing it to last field with 2 digits value.
}
1 ##1 will print the current line.
' Input_file > temp && mv temp Input_file ##Mentioning Input_file name here.
You can do it quite simply as a one-liner in Perl:
perl -i -pe '/^VARIABLE=/ && s/(\d+)/$&+1/e' file
In case you are unfamiliar with Perl, that says...
Run Perl and modify file in-place. if you come to any lines containing VARIABLE=, substitute the digits on that line with an expression calculated as "whatever the number was +1"
Note that Perl is a standard part of macOS - i.e. automatically included with all versions.

replace names in fasta

I want to change the sequence names in a fasta file according a text file containing new names. I found several approaches but seqkit made a good impression, anyway I can´t get it running. Replace key with value by key-value file
The fasta file seq.fa looks like
>BC1
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>BC2
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG
GCATGCATGCATGCATGCATGCATGCATGCATGCG
>BC3
GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC
TGCATGCATGCATG
and the ref.txt tab delimited text file like
BC1 1234
BC2 1235
BC3 1236
using siqkit in Git Bash runs trough the file but doesn´t change the names.
seqkit replace -p' (.+)$' -r' {kv}' -k ref.txt seq.fa --keep-key
I´m used to r and new to bash and can´t find the bug but guess I need to adjust for tab and _ ?
As in the example https://bioinf.shenwei.me/seqkit/usage/#replace part 7. Replace key with value by key-value file the sequence name is tab delimited and only the second part is replaced.
Advise how to adjust the code?
Desired outcome should look like: Replacing BC1 by the number in the text file 1234
>1234
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>1235
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG
GCATGCATGCATGCATGCATGCATGCATGCATGCG
>1236
GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC
TGCATGCATGCATG
could you please try following.
awk '
FNR==NR{
a[$1]=$2
next
}
($2 in a) && /^>/{
print ">"a[$2]
next
}
1
' ref.txt FS="[> ]" seq.fa
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##FNR==NR is condition which will be TRUE when 1st Input_file named ref.txt will be read.
a[$1]=$2 ##Creating an array named a whose index is $1 and value is $2 of current line.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
($2 in a) && /^>/{ ##Checking condition if $2 of current line is present in array a and starts with > then do following.
print ">"a[$2] ##Printing > and value of array a whose index is $2.
next ##next will skip all further statements from here.
}
1 ##Mentioning 1 will print the lines(those which are NOT starting with > in Input_file seq.fa)
' ref.txt FS="[> ]" seq.fa ##Mentioning Input_file names here and setting FS= either space or > for Input_file seq.fa here.
EDIT: As per OP's comment need to add >1234_1 occurrence number too in output so adding following code now.
awk '
FNR==NR{
a[$1]=$2
b[$1]=++c[$2]
next
}
($2 in a) && /^>/{
print ">"a[$2]"_"b[$2]
next
}
1
' ref.txt FS="[> ]" seq.fa
awk solution that doesn't require GNU awk:
awk 'NR==FNR{a[$1]=$2;next}
NF==2{$2=a[$2]; print ">" $2;next}
1' FS='\t' ref.txt FS='>' seq.fa
The first statement is filling the array a with the content of the tab delimited file ref.txt.
The second statement prints all lines of the second files seq.fa with 2 fields given the > as field delimiter.
The last statement prints all lines of that same file.

Replace header of one column by file name

I have about 100 comma-separated text files with eight columns.
Example of two file names:
sample1_sorted_count_clean.csv
sample2_sorted_count_clean.csv
Example of file content:
Domain,Phylum,Class,Order,Family,Genus,Species,Count
Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Zymomonas,Zymomonas mobilis,0.0
Bacteria,Bacteroidetes,Flavobacteria,Flavobacteriales,Flavobacteriaceae,Zunongwangia,Zunongwangia profunda,0.0
For each file, I would like to replace the column header "Count" by sample ID, which is contained in the first part of the file name (sample1, sample2)
In the end, the header should then look like this:
Domain,Phylum,Class,Order,Family,Genus,Species,sample1
If I use my code, the header looks like this:
Domain,Phylum,Class,Order,Family,Genus,Species,${f%_clean.csv}
for f in *_clean.csv; do echo ${f}; sed -e "1s/Domain,Phylum,Class,Order,Family,Genus,Species,RPMM/Domain,Phylum,Class,Order,Family,Genus,Species,${f%_clean.csv}/" ${f} > ${f%_clean.csv}_clean2.csv; done
I also tried:
for f in *_clean.csv; do gawk -F"," '{$NF=","FILENAME}1' ${f} > t && mv t ${f%_clean.csv}_clean2.csv; done
In this case, "count" is replaced by the entire file name, but each row of the column contains file name now. The count values are no longer present. This is not what I want.
Do you have any ideas on what else I may try?
Thank you very much in advance!
Anna
If you are ok with awk, could you please try following.
awk 'BEGIN{FS=OFS=","} FNR==1{var=FILENAME;sub(/_.*/,"",var);$NF=var} 1' *.csv
EDIT: Since OP is asking that after 2nd underscore everything should be removed in file's name then try following.
awk 'BEGIN{FS=OFS=","} FNR==1{split(FILENAME,array,"_");$NF=array[1]"_"array[2]} 1' *.csv
Explanation: Adding explanation for above code here.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code from here, which will be executed before Input_file(s) are being read.
FS=OFS="," ##Setting FS and OFS as comma here for all files all lines.
} ##Closing BEGIN section here.
FNR==1{ ##Checking condition if FNR==1 which means very first line is being read for Input_file then do following.
split(FILENAME,array,"_") ##Using split of awk out of box function by splitting FILENAME(which contains file name in it) into an array named array with delimiter _ here.
$NF=array[1]"_"array[2] ##Setting last field value to array 1st element underscore and then array 2nd element value in it.
} ##Closing FNR==1 condition BLOCK here.
1 ##Mentioning 1 will print the rest of the lines for current Input_file.
' *.csv ##Passing all *.csv files to awk program here.

How to Compare two files line by line and output the whole line if different

I have two sorted files in question
1)one is a control file(ctrl.txt) which is external process generated
2)and other is line count file(count.txt) that I generate using `wc -l`
$more ctrl.txt
Thunderbird|1000
Mustang|2000
Hurricane|3000
$more count.txt
Thunder_bird|1000
MUSTANG|2000
Hurricane|3001
I want to compare these two files ignoring wrinkles in column1(filenames) such as "_" (for Thunder_bird) or "upper case" (for MUSTANG) so that my output only shows below file as the only real different file for which counts dont match.
Hurricane|3000
I have this idea to only compare second column from both the files and output whole line if they are different
I have seen other examples in AWK but I could not get anything to work.
Could you please try following awk and let me know if this helps you.
awk -F"|" 'FNR==NR{gsub(/_/,"");a[tolower($1)]=$2;next} {gsub(/_/,"")} ((tolower($1) in a) && $2!=a[tolower($1)])' cntrl.txt count.txt
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
gsub(/_/,"");
a[tolower($1)]=$2;
next}
{ gsub(/_/,"") }
((tolower($1) in a) && $2!=a[tolower($1)])
' cntrl.txt count.txt
Explanation: Adding explanation too here for above code.
awk -F"|" ' ##Setting field seprator as |(pipe) here for all lines in Input_file(s).
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file(cntrl.txt) in this case is being read. Following instructions will be executed once this condition is TRUE.
gsub(/_/,""); ##Using gsub utility of awk to globally subtitute _ with NULL in current line.
a[tolower($1)]=$2; ##Creating an array named a whose index is first field in LOWER CASE to avoid confusions and value is $2 of current line.
next} ##next is awk out of the box keyword which will skip all further instructions now.(to make sure they are read when 2nd Input-file named count.txt is being read).
{ gsub(/_/,"") } ##Statements from here will be executed when 2nd Input_file is being read, using gsub to remove _ all occurrences from line.
((tolower($1) in a) && $2!=a[tolower($1)]) ##Checking condition here if lower form of $1 is present in array a and value of current line $2 is NOT equal to array a value. If this condition is TRUE then print the current line, since I have NOT given any action so by default printing of current line will happen from count.txt file.
' cntrl.txt count.txt ##Mentioning the Input_file names here which we have to pass to awk.

Resources