Remove multiple sequences from fasta file - bash

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:
>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
In an other file I have a list of headers of sequences that I would like to remove, like this:
>header1
>header5
>header12
[...]
>header145
The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,
while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt
It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?

The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:
The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
The sequence can span multiple lines.
A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.
Most of the presented methods will fail on a multi-fasta with multi-line sequences
The following will work always:
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' headers.txt file.fasta
This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.

$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.
Alternatively:
$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
f is whether or not the most recently read >... line was found in the target array a[]. f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.
The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >. If you know both then which one you use is a style choice.

You may use this awk:
awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt

Create a script with the delete commands from the second file:
sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed
Then apply that file to the first
sed -f commands.sed firstFile.txt

This awk might work for you:
awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1

One option is to create a long sed expression:
sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt
This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123...)
Using a file (as #daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.

try gnu sed,
sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f - first_file.txt
prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's

This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.
filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt

Related

How do I read into a .txt and extract a certain string corresponding to a found string?

A folder contains a README.txt and several dicom files named emr_000x.sx (where x are numerical values). In the README.txt are different lines, one of which contains the characters "xyz" and a corresponding emr_000x.sx in the line.
I would like to: read into the .txt, identify which line contains "xyz", and extract the emr_000x.sx from that line only.
For reference, the line in the .txt is formatted in this way:
A:emr_000x.sx, B:00001, C:number, D(characters)string_string_number_**xyz**_number_number
I think using grep might be helpful, but am not familiar enough to bash coding myself. Does anyone know how to solve this? Many thanks!
You can use awk to match fields on you csv:
awk -F, '$4 ~ "xyz" {sub(/^A:/, "", $1); print $1}'
I like sed for this sort of thing.
sed -nE '/xyz/{ s/^.*A:([^,]+),.*/\1/; p; }' README.txt
This says, "On lines where you see xyz replace the whole line with the non-commas between A: and a comma, then print the line."
-n is no printing unless I say so. (p means print.)
-E just means to use Extended regexes.
/xyz/{...} means "on lines where you see xyz do the stuff between the curlies."
s/^.*A:([^,]+),.*/\1/ will substitute the matched part (which should be the whole line) with just the part between the parens.

Use grep only on specific columns in many files?

Basically, I have one file with patterns and I want every line to be searched in all text files in a certain directory. I also only want exact matches. The many files are zipped.
However, I have one more condition. I need the first two columns of a line in the pattern file to match the first two columns of a line in any given text file that is searched. If they match, the output I want is the pattern(the entire line) followed by all the names of the text files that a match was found in with their entire match lines (not just first two columns).
An output such as:
pattern1
file23:"text from entire line in file 23 here"
file37:"text from entire line in file 37 here"
file156:"text from entire line in file 156 here"
pattern2
file12:"text from entire line in file 12 here"
file67:"text from entire line in file 67 here"
file200:"text from entire line in file 200 here"
I know that grep can take an input file, but the problem is that it takes every pattern in the pattern file and searches for them in a given text file before moving onto the next file, which makes the above output more difficult. So I thought it would be better to loop through each line in a file, print the line, and then search for the line in the many files, seeing if the first two columns match.
I thought about this:
cat pattern_file.txt | while read line
do
echo $line >> output.txt
zgrep -w -l $line many_files/*txt >> output.txt
done
But with this code, it doesn't search by the first two columns only. Is there a way so specify the first two columns for both the pattern line and for the lines that grep searches through?
What is the best way to do this? Would something other than grep, like awk, be better to use? There were other questions like this, but none that used columns for both the search pattern and the searched file.
Few lines from pattern file:
1 5390182 . A C 40.0 PASS DP=21164;EFF=missense_variant(MODERATE|MISSENSE|Aag/Cag|p.Lys22Gln/c.64A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390200 . G T 40.0 PASS DP=21237;EFF=missense_variant(MODERATE|MISSENSE|Gcc/Tcc|p.Ala28Ser/c.82G>T|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390228 . A C 40.0 PASS DP=21317;EFF=missense_variant(MODERATE|MISSENSE|gAa/gCa|p.Glu37Ala/c.110A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
Few lines from a file in searched files:
1 10699576 . G A 36 PASS DP=4 GT:GQ:DP 1|1:36:4
1 10699790 . T C 40 PASS DP=6 GT:GQ:DP 1|1:40:6
1 10699808 . G A 40 PASS DP=7 GT:GQ:DP 1|1:40:7
They both in reality are much larger.
It sounds like this might be what you want:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile anyfile
If it's not then update your question to provide a clear, simple statement of your requirements and concise, testable sample input and expected output that demonstrates your problem and that we could test a potential solution against.
if anyfile is actually a zip file then you'd do something like:
zcat anyfile | awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile -
Replace zcat with whatever command you use to produce text from your zip file if that's not what you use.
Per the question in the comments, if both input files are compressed and your shell supports it (e.g. bash) you could do:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' <(zcat patternfile) <(zcat anyfile)
otherwise just uncompress patternfile to a tmp file first and use that in the awk command.
Use read to parse the pattern file's columns and add an anchor to the zgrep pattern :
while read -r column1 column2 rest_of_the_line
do
echo "$column1 $column2 $rest_of_the_line"
zgrep -w -l "^$column1\s*$column2" many_files/*txt
done < pattern_file.txt >> output.txt
read is able to parse lines into multiple variables passed as parameters, the last of which getting the rest of the line. It will separate fields around characters of the $IFS Internal Field Separator (by default tabulations, spaces and linefeeds, can be overriden for the read command by using while IFS='...' read ...).
Using -r avoids unwanted escapes and makes the parsing more reliable, and while ... do ... done < file performs a bit better since it avoids an useless use of cat. Since the output of all the commands inside the while is redirected I also put the redirection on the while rather than on each individual commands.

How to detect some pattern with grep -f on a file in terminal, and extract those lines without the pattern

I'm on mac terminal.
I have a txt file with one column with 9 IDs, allofthem.txt, where every ID starts with ¨rs¨:
rs382216
rs11168036
rs9296559
rs9349407
rs10948363
rs9271192
rs11771145
rs11767557
rs11
Also, I have another txt file, useful.txt, with those IDs that were useful in an analysis I did. It looks the same, one column with several rows of IDs, but with less IDS, only 5.
rs9349407
rs10948363
rs9271192
rs11
Problem:I want to generate a new txt file with the non-useful ones (the ones that appear in allofthem.txt but not in useful.txt).
I want to do the inverse of:
grep -f useful.txt allofthem.txt
I want to use some systematic way of deleting all the IDs in useful and obtain a file with the remaining ones. Maybe with awk or sed, but I can´t see it. Can you help me, please? Thanks in advance!
Desired output:
rs382216
rs11168036
rs9296559
rs11771145
rs11767557
-v option does the inverse for you:
grep -vxf useful.txt allofthem.txt > remaining.txt
-x option matches the whole line in allofthem.txt, not parts.
As #hek2mgl rightly pointed out, you need -F if you want to treat the content of useful.txt as strings and not patterns:
grep -vxFf useful.txt allofthem.txt > remaining.txt
Make sure your files have no leading or trailing white spaces - they could affect the results.
I recommend to use awk:
awk 'FNR==NR{patterns[$0];next} $0 in patterns' useful.txt allofthem.txt
Explanation:
FNR==NR is true as long as we are reading useful.txt. We create an index in patterns for every line of useful.txt. next stops further processing.
$0 in patterns runs, because of the previous next statement, on every line of allofthem.txt. It checks for every line of that file if it is a key in patterns. If that checks evaluates to true awk will print that line.

Remove a header from a file during parsing

My script gets every .csv file in a dir and writes them into a new file together. It also edits the files such that certain information is written into every row for a all of a file's entries. For instance this file called "trap10c_7C000000395C1641_160110.csv":
"",1/10/2016
"Timezone",-6
"Serial No.","7C000000395C1641"
"Location:","LS_trap_10c"
"High temperature limit (�C)",20.04
"Low temperature limit (�C)",-0.02
"Date - Time","Temperature (�C)"
"8/10/2015 16:00",30.0
"8/10/2015 18:00",26.0
"8/10/2015 20:00",24.5
"8/10/2015 22:00",24.0
Is converted into this format
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Location:,LS_trap_10c
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,High,temperature,limit,(�C),20.04
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Low,temperature,limit,(�C),-0.02
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Date,-,Time,Temperature,(�C)
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,16:00,30.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,18:00,26.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,20:00,24.5
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,22:00,24.0
I use this script to do this:
dos2unix *.csv
gawk '{print FILENAME, $0}' *.csv>>all_master.erin
sed -i 's/Serial No./SerialNo./g' all_master.erin
sed -i 's/ /,/g' all_master.erin
gawk -F, '/"SerialNo."/ {sn = $3}
/"Location:"/ {loc = $3}
/"([0-9]{1,2}\/){2}[0-9]{4} [0-9]{2}:[0-9]{2}"/ {lin = $0}
{$0 =loc FS sn FS $0}1' all_master.erin > formatted_log.csv
sed -i 's/\"//g' formatted_log.csv
sed -i '/^,/ d' formatted_log.csv
rm all_master.erin
printf "\nDone\n"
I want to remove the messy header from the formatted_log.csv file. I've tried and failed to use a sed, as it seems to remove things that I don't want to remove. Is sed the best way to approach this problem? The current sed fixes some problems with the header, but I want the header gone entirely. Any lines that say "serial no." and "location" are important and require information. The other lines can be removed entirely.
I suppose you edited your script before posting; as it stands, it will not produce the posted output (all_master.erin should be $(<all_master.erin) except in the first occurrence).
You don’t specify many vital details of the format of your input files, so we must guess them. Here are my guesses:
You ignore the first two lines and the subsequent empty third line.
The 4th and 5th lines are useful, since they provide the serial number and location you want to use in all lines of that file
The 6th, 7th and 8th lines are useless.
For each file, you want to discard the first four lines of the posted output.
With these assumptions, this is how I would modify your script:
#!/bin/bash
dos2unix *.csv
awk -vFS=, -vOFS=, \
'{gsub("\"","")}
FNR==4{s=$2}
FNR==5{l=$2}
FNR>8{gsub(" ",OFS);print l,s,FILENAME,$0}' \
*.csv > formatted_log.CSV
printf "\nDone\n"
Explanation of the awk script:
First we delete all double quotes with gsub("\"",""). Then, if the line number is 4, we set the variable s to the second field, which is the serial number. If the line number is 5, we set the variable l to the second field, which is the location. If the line number is greater than 8, we do two things. First, we execute gsub(" ",OFS) to replace all spaces with the value of the output field separator: this is needed because the intended output makes two separate fields of date and time, which were only one field in the input. Second, we print the line preceded by the values of l, s and FILENAME as requested.
Note that I’m using the (questionable) Unix trick of naming the output file with an all-caps extension .CSV to avoid it being wrongly matched by a subsequent *.csv. A better solution would be to put it in another directory, but I don’t know anything about your directory tree so I suggest you modify the output file name yourself.
You could use awk to remove anything
with less than 3 columns in your final file:
awk 'NF>=3' file

Merging specific lines in a CSV file if specific trend is found out

I have been stuck with creating the script for below mentioned scenario:
I have a file a.csv with content as
123,fsfs,4124124,412412
1314,fasfwe,42145,rwr
1234,fwtrwqt,twt
wqrfsdgaseg
12424,23532,fafwe,gewgt
14214,wet,wertwtw,wet
What happens is, due to some application, the CSV content of one line gets printed on the second line.
My task is to find such occurrence and merge the such lines in a new file.
so new files will contains only required CSV records I tried few things using sed, but couldn't succeed.
$ awk -F, '!length $4 && length $3 {printf "%s,", $0;next}1' file
123,fsfs,4124124,412412
1314,fasfwe,42145,rwr
1234,fwtrwqt,twt,wqrfsdgaseg
12424,23532,fafwe,gewgt
14214,wet,wertwtw,wet
All previous answers seem great, but I wanted to add a sed answer as well because sed is awesome! (And sed was added as a tag so we were missing a sed answer.)
This answer should work on multiple lines, provided that the cut always happen on a separator and that that separator is omitted (see input example for these assumptions).
sed ':l;/\([^,]*,\)\{3\}[^,]*/!{;N;s/\n/,/g;bl;}' <file_in >file_out
What it does is :
defines a label (:l)
tests if there are four fields (/\([^,]*,\)\{3\}[^,]*/)
If there isn't (!), execute the block ({;N;s/\n/,/g;bl;})
The block :
reads the next line into the buffer (N)
replaces the newline with the separator (s/\n/,/g)
loops by branching to our :llabel (bl)
Proof :
$ sed ':l;/\([^,]*,\)\{3\}[^,]*/!{;N;s/\n/,/g;bl;}' <<EOF
> 123,fsfs,4124124,412412
> 1314,fasfwe,42145,rwr
> 1234,fwtrwqt,twt
> wqrfsdgaseg
> 12424,23532,fafwe,gewgt
> 14214,wet,wertwtw,wet
> EOF
123,fsfs,4124124,412412
1314,fasfwe,42145,rwr
1234,fwtrwqt,twt,wqrfsdgaseg
12424,23532,fafwe,gewgt
14214,wet,wertwtw,wet

Resources