BWA fail to locate the index files - bioinformatics

I'm currently working on trying to analyze a dataset. I'm new to the field of bioinformatics and was trying to use BWA tools, however, as soon as I reach bwa mem, I keep running into the same error:
input --> mirues-macbook:sra ipmiruek$ bwa mem -t 8 Homo_sapiens.GRCh38.dna.chromosome.17.fa ERR3841737/ERR3841737_trimmed.fq.gz > ERR3841737/ERR3841737_mapped.sam
output --> [E::bwa_idx_load_from_disk] fail to locate the index files
I've already indexed the reference chromosome as such:
bwa index Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz
Is there anything I could do to fix this problem? Thank you.
I tried changing the dataset that I was using along with the corresponding reference chromosome but it still yielded the same result. Is this an issue with the code or with the dataset I'm working with?

It looks like you indexed a gzip-compressed FASTA file, but are supplying an index base (idxbase) without the .gz extenstion. What you want is:
$ bwa mem \
-t 8 \
Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz \
ERR3841737/ERR3841737_trimmed.fq.gz \
> ERR3841737/ERR3841737_mapped.sam
Alternatively, gunzip the reference FASTA file and index it. For example:
$ gunzip Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz
$ bwa index Homo_sapiens.GRCh38.dna.chromosome.17.fa
Note that BWA packs the reference sequences (into the .pac file), so you don't even need the FASTA file to run BWA MEM after it's been indexed.

Related

Cannot sort VCF with bcftools due to invalid input

I am trying to compress & index a VCF file and am facing several issues.
When I use bgzip/tabix, it throws an error saying it cannot be indexed due to some unsorted values.
# code used to bgzip and tabix
bgzip -c fn.vcf > fn.vcf.gz
tabix -p vcf fn.vcf.gz
# below is the error returnd
[E::hts_idx_push] Unsorted positions on sequence #1: 115352924 followed by 115352606
tbx_index_build failed: fn.vcf.gz
When I use bcftools sort to sort this VCF to tackle #1, it throws an error due to invalid entries.
# code used to sort
bcftools sort -O z --output-file fn.vcf.gz fn.vcf
# below is the error returned
Writing to /tmp/bcftools-sort.YSrhjT
[W::vcf_parse_format] Extreme FORMAT/AD value encountered and set to missing at chr12:115350908
[E::vcf_parse_format] Invalid character '\x0F' in 'GT' FORMAT field at chr12:115352482
Error encountered while parsing the input
Cleaning
I've tried sorting using linux commands to get around #2. However, when I run the below code, the size of fout.vcf is almost half of fin.vcf, indicating something might be going wrong.
grep "^#" fin.vcf > fout.vcf
grep -v "^#" fin.vcf | sort -k1,1V -k2,2n >> fout.vcf
Please let me know if you have any advice regarding:
How I could sort/fix the problematic inputs in my VCF in a safe & feasible way. (The file is 340G so I cannot simply open the file and edit.)
Why my linux sort might be behaving in an odd way. (i.e. returning file much smaller than the original.)
Any comments or suggestions are appreciated!
Try this
mkdir tmp ##1 create a tmp folder in your working directory
tmp=/yourpath/ ##2 assign the tmp folder
bcftools sort file.vcf -T ./tmp -Oz -o file.vcf.gz
you can index your file after sorting your file
bcftools index file.vcf.gz

How to download URLs in a csv and naming outputs based on a column value

1. OS: Linux / Ubuntu x86/x64
2. Task:
Write a Bash shell script to download URLs in a (large) csv (as fast/simultaneous as possible) and naming each output on a column value.
2.1 Example Input:
A CSV file containing lines like:
001,http://farm6.staticflickr.com/5342/a.jpg
002,http://farm8.staticflickr.com/7413/b.jpg
003,http://farm4.staticflickr.com/3742/c.jpg
2.2 Example outputs:
Files in a folder, outputs, containg files like:
001.jpg
002.jpg
003.jpg
3. My Try:
I tried mainly in two styles.
1. Using the download tool's inner support
Take ariasc as an example, it support use -i option to import a file of URLs to download, and (I think) it will process it in parallel to max speed. It do have --force-sequential option to force download in the order of the lines, but I failed to find a way to make the naming part happen.
2. Splitting first
split the file into files and run a script like the following to process it:
#!/bin/bash
INPUT=$1
while IFS=, read serino url
do
aria2c -c "$url" --dir=outputs --out="$serino.jpg"
done < "$INPUT"
However, it means for each line it will restart aria2c again which seems cost time and low the speed.
Though, one can run the script in bash command multiple times to get 'shell-level' parallelism, it seems not to be the best way.
Any suggestion ?
Thank you,
aria2c supports so called option lines in input files. From man aria2c
-i, --input-file=
Downloads the URIs listed in FILE. You can specify multiple sources for a single entity by putting multiple URIs on a single line separated by the TAB character. Additionally, options can be specified after each URI line. Option lines must start with one or more white space characters (SPACE or TAB) and must only contain one option per line.
and later on
These options have exactly same meaning of the ones in the command-line options, but it just applies to the URIs it belongs to. Please note that for options in input file -- prefix must be stripped.
You can convert your csv file into an aria2c input file:
sed -E 's/([^,]*),(.*)/\2\n out=\1/' file.csv | aria2c -i -
This will convert your file into the following format and run aria2c on it.
http://farm6.staticflickr.com/5342/a.jpg
out=001
http://farm8.staticflickr.com/7413/b.jpg
out=002
http://farm4.staticflickr.com/3742/c.jpg
out=003
However this won't create files 001.jpg, 002.jpg, … but 001, 002, … since that's what you specified. Either specify file names with extensions or guess the extensions from the URLs.
If the extension is always jpg you can use
sed -E 's/([^,]*),(.*)/\2\n out=\1.jpg/' file.csv | aria2c -i -
To extract extensions from the URLs use
sed -E 's/([^,]*),(.*)(\..*)/\2\3\n out=\1\3/' file.csv | aria2c -i -
Warning: This works if and only if every URL ends with an extension. For instance, due to the missing extension the line 001,domain.tld/abc would not be converted at all, causing aria2c to fail on the "URL" 001,domain.tld/abc.
Using all standard utilities you can do this to download in parallel:
tr '\n' ',' < file.csv |
xargs -P 0 -d , -n 2 bash -c 'curl -s "$2" -o "$1.jpg"' -
-P 0 option in xargs lets it run commands in parallel (one per core processor)

using split command in shell how to know the number of files generated

I am using command split for a large file to generate little files which are put in a folder, my problem is the folder contains over files different from my split.
I would like to know if there is a way to know how much files are generated only from my split not the number of all files in my folder.
My command split a 2 d. Is there any option I can join to this command to know it?
I know this ls -Al | wc -l will give me the number of files in the folder that doesn't interest me.
The simplest solution here is to split into a fresh directory.
Assuming that's not possible and you aren't worried about other processes operating on the directory in question you can just count the files before and after. Something like this
$ before=(*)
$ split a 2 d
$ after=(*)
$ echo "Split files: $((after - before))"
If the other files in the directory can't have the same format as the split files (and presumably they can't or split would fail or overwrite them) then you could use an appropriate glob to get just the files that match the pattern. Soemthing like splitfiles=(d??).
That failing you could see whether the --verbose option to split allows you to use split_count=$(split --verbose a 2 d | wc -l) or similar.
To be different, I will be counting the lines with grep utilizing the --verbose option:
split --verbose other_options file|grep -c ""
Example:
$ split --verbose -b 2 file|grep -c ""
60
# yeah, my file is pretty small, splitting on 2 bytes to produce numerous files
You can use split command with options -l and -a to specify prefix and suffix for the generated files.

construct DNA sequence based on variation and human reference

The 1000 genome project provides us information about "variation" of thousands people's DNA sequence against the human reference DNA sequence. The variation is stored in VCF file
format. Basically, for each person in that project, we can get his/her DNA variation information from the VCF file, for example, the type of variation (e.g Insertion/deletion and SNP ) and the position of the variation relative to the reference. The reference is in FASTA format. By combining variation information of one person from the VCF file and the human reference in FASTA file, I want to construct the DNA sequence for that person.
My question is: Does it already exist some tools can perform the task pretty well,or I have to write the scripts by myself.
The perl script vcf-consensus from VCFtools seems close to what you are looking for:
vcf-consensus
Apply VCF variants to a fasta file to create consensus sequence.
Usage: cat ref.fa | vcf-consensus [OPTIONS] in.vcf.gz > out.fa
Options:
-h, -?, --help This help message.
-H, --haplotype <int> Apply only variants for the given haplotype (1,2)
-s, --sample <name> If not given, all variants are applied
Examples:
samtools faidx ref.fa 8:11870-11890 | vcf-consensus in.vcf.gz > out.fa
The answers to the question New fasta sequence from reference fasta and variant calls file? posted on Biostar might also help.
You can use bcftools (https://github.com/samtools/bcftools) to perform this task:
bcftools consensus <file.vcf> \
--fasta-ref <file> \
--iupac-codes \
--output <file> \
--sample <name>
To install bcftools:
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
sudo cp bcftools/bcftools /usr/local/bin/
You can also combine bcftools consensus with samtools faidx (http://www.htslib.org/) to extract specific intervals from the fasta file. See bcftools consensus for more information:
About: Create consensus sequence by applying VCF variants to a reference
fasta file.
Usage: bcftools consensus [OPTIONS] <file.vcf>
Options:
-f, --fasta-ref <file> reference sequence in fasta format
-H, --haplotype <1|2> apply variants for the given haplotype
-i, --iupac-codes output variants in the form of IUPAC ambiguity codes
-m, --mask <file> replace regions with N
-o, --output <file> write output to a file [standard output]
-c, --chain <file> write a chain file for liftover
-s, --sample <name> apply variants of the given sample
Examples:
# Get the consensus for one region. The fasta header lines are then expected
# in the form ">chr:from-to".
samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa
Anyone still coming to this page, if you have a fasta reference genome and a bam file that you want to turn into the reference file by changing SNP's and N's, you may try this one-liner using samtools, bcftools and vcfutils.pl (ps for beginners: both samtools and bcftools can be compiled in a computing cluster or in Linux, if so just add the locations of each before the software name; vcfutils is already a perl script from bcftools)
samtools mpileup -d8000 -q 20 -Q 10 -uf REFERENCE.fasta Your_File.bam | bcftools call -c | vcfutils.pl vcf2fq > OUTPUT.fastq
d, --max-depth == -q, -min-MQ Minimum mapping quality for an alignment to be used == -Q, --min-BQ Minimum base quality for a base to be considered == (You can use different values of course, see http://www.htslib.org/doc/samtools.html)
Which generates a weird format that looks like fastq but isn't, so you can't convert it using a converter, but you can use the following sed command, which I wrote specific for this output:
sed -i -e '/^+/,/^\#/{/^+/!{/^\#/!d}}; /^+/ d; s/#/>/g' OUTPUT.fastq
In the end, make sure to compare your new fasta files to your reference to be sure that everything is fine.
EDIT BE CAREFUL WITH THE SED COMMAND IT MAY DELETE SOME OF YOUR READS IN DIFFERENT CASES OF QUALITY SCORING THAN I HAD

Append to the top of a large file: bash

I have a nearly 3 GB file that I would like to add two lines to the top of. Every time I try to manually add these lines, vim and vi freeze up on the save (I let them try to save for about 10 minutes each). I was hoping that there would be a way to just append to the top, in the same way you would append to the bottom of the file. The only things I have seen so far however include a temporary file, which I feel would be slow due to the file size.
I was hoping something like:
grep -top lineIwant >> fileIwant
Does anyone know a good way to append to the top of the file?
Try
cat file_with_new_lines file > newfile
I did some benchmarking to compare using sed with in-place edit (as suggested here) to cat (as suggested here).
~3GB bigfile filled with dots:
$ head -n3 bigfile
................................................................................
................................................................................
................................................................................
$ du -b bigfile
3025635308 bigfile
File newlines with two lines to insert on top of bigfile:
$ cat newlines
some data
some other data
$ du -b newlines
26 newlines
Benchmark results using dumbbench v0.08:
cat:
$ dumbbench -- sh -c "cat newlines bigfile > bigfile.new"
cmd: Ran 21 iterations (0 outliers).
cmd: Rounded run time per iteration: 2.2107e+01 +/- 5.9e-02 (0.3%)
sed with redirection:
$ dumbbench -- sh -c "sed '1i some data\nsome other data' bigfile > bigfile.new"
cmd: Ran 23 iterations (3 outliers).
cmd: Rounded run time per iteration: 2.4714e+01 +/- 5.3e-02 (0.2%)
sed with in-place edit:
$ dumbbench -- sh -c "sed -i '1i some data\nsome other data' bigfile"
cmd: Ran 27 iterations (7 outliers).
cmd: Rounded run time per iteration: 4.464e+01 +/- 1.9e-01 (0.4%)
So sed seems to be way slower (80.6%) when doing in-place edit on large files, probably due to moving the intermediary temp file to the location of the original file afterwards. Using I/O redirection sed is only 11.8% slower than cat.
Based on these results I would use cat as suggested in this answer.
Try doing this :
using sed :
sed -i '1i NewLine' file
Or using ed :
ed -s file <<EOF
1i
NewLine
.
w
q
EOF
The speed of such an operation depends greatly on the underlying file system. To my knowledge there isn't a FS optimized for this particular operation. Most FS organize files using full disk blocks, excepted for the last one, which may be partially used by the end of the file. Indeed, a file of size N would take N/S blocks, where S is the block size, and one more block for the remaining part of the file (of size N%S, % being the remainder operator), if N is not divisible by S.
Usually, these blocks are referenced by their indices on the disk (or partition), and these indices are stored within the FS metadata, attached to the file entry which allocates them.
From this description, you can see that it could be possible to prepend content whose size would be a multiple of the block size, by just updating the metadata with the new list of blocks used by the file. However, if that prepended content doesn't fill exactly a number of blocks, then the existing data would have to be shifted by that exceeding amount.
Some FS may implement the possibility of having partially used blocks within the list (and not only as the last entry) of used ones for files, but this is not a trivial thing to do.
See these other SO questions for further details:
Prepending Data to a file
Is there a file system with a low level prepend operation
At a higher level, even if that operation is supported by the FS driver, it is still possible that programs don't use the feature.
For the instance of that problem you are trying to solve, the best way is probably a program capable of catening the new content and the existing one to a new file.
cat file
Unix
linux
It append to the the two lines of the file at the same time using the command
sed -i '1a C \n java ' file
cat file
Unix
C
java
Linux
you want to INSERT means using i and Replace means using c

Resources