How can i eliminate duplicated sequences in fasta file - bioinformatics

I'm trying to build database bacteria genre using all the sequences published to calculate the coverage of my reads against this database using bowtie2 for mapping, for that, I merge all the genomes sequences I downloaded from ncbi in one fasta_library ( i merge 74 files in on fasta file ), the problem is that in this fasta file (the library I created ) I have a lot of duplicated sequences, and that affected the coverage in a big way, so I'm asking if there's any way to eliminate duplication I have in my Library_File, or if there's any way to merge the sequences without having the duplication, or also if there's any other way to calculate the coverage of my reads against reference sequences
I hope I'm clear enough, please tell me if there's anything not clear.

If you have control over your setup, then you could install seqkit and run the following on your FASTA file:
$ seqkit rmdup -s < in.fa > out.fa
If you have multiple files, you can concatenate them and feed them in as standard input:
$ seqkit rmdup -s < <(cat inA.fa ... inN.fa) > out.fa
The rmdup option removes duplicates, and the -s option calls duplicates on the basis of sequence, ignoring differences in headers. I'm not sure which header is kept in the output, but that may be something to think about.
To avoid third-party dependencies and understand how dups are being removed, one can use awk.
The idea is to read all FASTA records one by one into an associative array (or hash table, also called a "dictionary" in Python), only if the sequence is not already in the array.
For example, starting with a single-line FASTA file in.fa that looks like this:
>test1
ATAT
>test2
CGCG
>test3
ATAT
>test4
GCCT
We can remove duplicates, preserving the first header, like so:
$ awk 'BEGIN {i = 1;} { if ($1 ~ /^>/) { tmp = h[i]; h[i] = $1; } else if (!a[$1]) { s[i] = $1; a[$1] = "1"; i++; } else { h[i] = tmp; } } END { for (j = 1; j < i; j++) { print h[j]; print s[j]; } }' < in.fa > out.fa
$ cat out.fa
>test1
ATAT
>test2
CGCG
>test4
GCCT
It requires a little knowledge about awk if you need modifications. This approach also depends on how your FASTA files are structured (records with sequences on one line or multiple lines, etc.), though it is usually pretty easy to modify FASTA files into the above structure (one line each for header and sequence).
Any hash table approach also uses a fair bit of memory (I imagine that seqkit probably makes the same compromise for this particular task, but I haven't looked at the source). This could be an issue for very large FASTA files.
It's probably better to use seqkit if you have a local environment on which you can install software. If you have an IT-locked-down setup, then awk would work for this task, as well, as it comes with most Unixes out of the box.

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)
If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.
if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

remove line in csv file if string found (from another text file) in bash

Due to a power failure issue, I am having to clean up jobs which are run based on text files. So the problem is, I have a text file with strings like so (they are uuids):
out_file.txt (~300k entries)
<some_uuidX>
<some_uuidY>
<some_uuidZ>
...
and a csv like so:
in_file.csv (~500k entries)
/path/to/some/location1/,<some_uuidK>.json.<some_string1>
/path/to/some/location2/,<some_uuidJ>.json.<some_string2>
/path/to/some/location3/,<some_uuidX>.json.<some_string3>
/path/to/some/location4/,<some_uuidY>.json.<some_string4>
/path/to/some/location5/,<some_uuidN>.json.<some_string5>
/path/to/some/location6/,<some_uuidZ>.json.<some_string6>
...
I would like to remove lines from out_file for entries which match in_file.
The end result:
/path/to/some/location1/,<some_uuidK>.json.<some_string1>
/path/to/some/location2/,<some_uuidJ>.json.<some_string2>
/path/to/some/location5/,<some_uuidN>.json.<some_string5>
...
Since the file sizes are fairly large, I was wondering if there is an efficient way to do it in bash.
any tips would be geat.
Here is a potential grep solution:
grep -vFwf out_file.txt in_file.csv
And a potential awk solution (likely faster):
awk -F"[,.]" 'FNR==NR { a[$1]; next } !($2 in a)' out_file.txt in_file.csv
NB there are caveats to each of these approaches. Although they both appear to be suitable for your intended purpose (as indicated by your comment "the numbers add up correctly"), posting a minimal, reproducible example in future questions is the best way to help us help you.

Looking up and extracting a line from a big file matching the lines of another big file

I permitted myself to create a new question as some parameters changed dramatically compared to my first question in my bash script optimising (Optimising my script which lookups into a big compressed file)
In short : I want to lookup, and extract all the lines where the variable of the first column of a file(1) (a bam file) matches the first column of a text file (2). For bioinformaticians, it's actually extracting the matching reads id from two files.
File 1 is a binary compressed 130GB file
File 2 is a tsv file of 1 billion lines
Recently a user came with a very elegant one liner combining the decompression of the file and the lookup with awk and it worked very well. With the size of the files it is now looking up for more than 200 hours (multithreaded).
Does this "problem" have a name in algorithmics ?
What could be a good way to tackle this challenge ? (If possible with simple solutions such as sed, awk, bash .. )
Thank you a lot
Edit : Sorry for the code, as it was on the link I though it would be a "doublon". Here is the one liner used :
#!/bin/bash
samtools view -# 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'
Think of this as a long comment rather than an answer. The 'merge sort' method can be summarised as: If two records don't match, advance one record in the file with the smaller record. If they do match then record the match and advance one record in the big file.
In pseudocode, this looks something like:
currentSmall <- readFirstRecord(smallFile)
currentLarge <- readFirstRecord(largeFile)
searching <- true
while (searching)
if (currentLarge < currentSmall)
currentLarge <- readNextRecord(largeFile)
else if (currentLarge = currentSmall)
//Bingo!
saveMatchData(currentLarge, currentSmall)
currentLarge <- readNextRecord(largeFile)
else if (currentLarge > currentsmall)
currentSmall <- readNextRecord(smallFile)
endif
if (largeFile.EOF or smallFile.EOF)
searching <- false
endif
endwhile
Quite how you translate that into awk or bash is beyond my meagre knowledge of either.

awk - skip lines of subdomains if domain already matched

lets assume - there is an already ordered list of domains like:
tld.aa.
tld.aa.do.notshowup.0
tld.aa.do.notshowup.0.1
tld.aa.do.notshowup.0.1.1
tld.aa.do.notshowup.too
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.xxxxx.donotshowup
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
which later acts as a blacklist.
Per specific requirement - all lines with a trailing '.' indicate
that all deeper subdomains of that specific domain should not appear
in the blacklist itself then... so the desired output of the example
above would/should be:
tld.aa.
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
I currently run this in a loop (pure bash + heavy use of bash builtins to speedup things) ... but as the list
grows it takes quite long now to process around 562k entries.
Shouldn't it be easy for AWK (or sed maybe) to do this - any help is
really appreciated (I already tried some things in awk but somehow couldn't get it to display what I want...).
Thankyou!
If the . lines always come before the lines to ignore, this awk should do:
$ awk '{for (i in a) if (index($0,i) == 1) next}/\.$/{a[$0]=1}1' file
tld.aa.
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
/\.$/{a[$0]=1} adds lines with trailing dot to an array.
{for (i in a) if (index($0,i) == 1) next} searches for the current line in one of these indexed entries and skips further processing if found (next).
If the file is sorted alphabetically and no subdomains end with a dot, you don't even need an array as #Corentin Limier suggests:
awk 'a{if (index($0,a) == 1) next}/\.$/{a=$0}1' file

One-line program to delete files with few header lines

This is the next part of my earlier question perl one-liner to keep only desired lines. Here I have many *.fa files in a folder.
Suppose for three files: 1.fa, 2.fa, 3.fa
The contents of them are as follows:
1.fa
>djhnk_9
abfgdddcfdafaf
ygdugidg
>kjvk.80
jdsfkdbfdkfadf
>jnck_q2
fdgsdfjghsjhsfddf
>7ytiu98
ihdlfwdfjdlfl]ol
2.fa
>cj76
dkjfhkdjcfhdjk
>67q32
nscvsdkvklsflplsad
>kbvbk
cbjfdikjbfadkjfbka
3.fa
>1290.5
mnzmnvjbsdjb
The lines that start with a > are the headers and the rest are the feature lines.
I want to delete those files that have 3 or fewer header lines. Here, file 2.fa and file 3.fa should be deleted.
As I am working on a Windows system, preferably I use a one-line Perl script like:
for %%F in ("*.fa") do perl ...
Is there a one-line program for that?
Use a program. "One-liners" are inscrutable, non-portable, and very hard to debug
This does as you ask. I hope it's clear that I have commented out the unlink call for testing purposes: it would be a pain to regenerate the *.fa files each time
You will probably want to change '[0-9].fa' to just *.fa. I had other files in my own directory that I didn't want to be considered
use strict;
use warnings 'all';
while ( my $file = glob '[0-9].fa' ) {
open my $fh, '<', $file;
my $headers = grep /^>/, <$fh>;
#unlink $file if $headers <= 3;
print qq{deleting "$file"\n} if $headers <= 3;
}
output
deleting "2.fa"
deleting "3.fa"
Next time, please try to write some code by yourself to solve the problem, and only after come ask for help. You will learn more if you do that, and we won't feel like you're just asking us to write your code.
The problem is very simple though, so here's a solution.
Note that this solution should be considered as a quick fix. Borodin suggested cleaner, easier to understand and more portable way to do this here.
I would suggest doing this with perl like this :
perl -nE "$count{$ARGV}++ if /^>/; END { unlink grep { $count{$_} <= 3 } keys %count }" *.fa
(for the record, I'm using double-quotes" as the delimiter of the string since you are on windows, but if anyone wish to use this on an unix system, just change the double-quotes " for some single-quotes').
Explanations:
-n surround the code with while(<>){...}, which will read the files one by one.
With $count{$ARGV}++ if /^>/ we count the number of headers in each file : $ARGV holds the name of the file being read, and /^>/ is true only if the line starts with >, ie. it's a header line.
Finally ( the END { .. } part), we delete (with the function unlink) the files that have 3 headers or less : keys %count gives all the file names, and grep { $count{$_} <= 3 } retains only the files that have 3 or less header lines to delete them.

Resources