construct DNA sequence based on variation and human reference - bioinformatics

The 1000 genome project provides us information about "variation" of thousands people's DNA sequence against the human reference DNA sequence. The variation is stored in VCF file
format. Basically, for each person in that project, we can get his/her DNA variation information from the VCF file, for example, the type of variation (e.g Insertion/deletion and SNP ) and the position of the variation relative to the reference. The reference is in FASTA format. By combining variation information of one person from the VCF file and the human reference in FASTA file, I want to construct the DNA sequence for that person.
My question is: Does it already exist some tools can perform the task pretty well,or I have to write the scripts by myself.

The perl script vcf-consensus from VCFtools seems close to what you are looking for:
vcf-consensus
Apply VCF variants to a fasta file to create consensus sequence.
Usage: cat ref.fa | vcf-consensus [OPTIONS] in.vcf.gz > out.fa
Options:
-h, -?, --help This help message.
-H, --haplotype <int> Apply only variants for the given haplotype (1,2)
-s, --sample <name> If not given, all variants are applied
Examples:
samtools faidx ref.fa 8:11870-11890 | vcf-consensus in.vcf.gz > out.fa
The answers to the question New fasta sequence from reference fasta and variant calls file? posted on Biostar might also help.

You can use bcftools (https://github.com/samtools/bcftools) to perform this task:
bcftools consensus <file.vcf> \
--fasta-ref <file> \
--iupac-codes \
--output <file> \
--sample <name>
To install bcftools:
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
sudo cp bcftools/bcftools /usr/local/bin/
You can also combine bcftools consensus with samtools faidx (http://www.htslib.org/) to extract specific intervals from the fasta file. See bcftools consensus for more information:
About: Create consensus sequence by applying VCF variants to a reference
fasta file.
Usage: bcftools consensus [OPTIONS] <file.vcf>
Options:
-f, --fasta-ref <file> reference sequence in fasta format
-H, --haplotype <1|2> apply variants for the given haplotype
-i, --iupac-codes output variants in the form of IUPAC ambiguity codes
-m, --mask <file> replace regions with N
-o, --output <file> write output to a file [standard output]
-c, --chain <file> write a chain file for liftover
-s, --sample <name> apply variants of the given sample
Examples:
# Get the consensus for one region. The fasta header lines are then expected
# in the form ">chr:from-to".
samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa

Anyone still coming to this page, if you have a fasta reference genome and a bam file that you want to turn into the reference file by changing SNP's and N's, you may try this one-liner using samtools, bcftools and vcfutils.pl (ps for beginners: both samtools and bcftools can be compiled in a computing cluster or in Linux, if so just add the locations of each before the software name; vcfutils is already a perl script from bcftools)
samtools mpileup -d8000 -q 20 -Q 10 -uf REFERENCE.fasta Your_File.bam | bcftools call -c | vcfutils.pl vcf2fq > OUTPUT.fastq
d, --max-depth == -q, -min-MQ Minimum mapping quality for an alignment to be used == -Q, --min-BQ Minimum base quality for a base to be considered == (You can use different values of course, see http://www.htslib.org/doc/samtools.html)
Which generates a weird format that looks like fastq but isn't, so you can't convert it using a converter, but you can use the following sed command, which I wrote specific for this output:
sed -i -e '/^+/,/^\#/{/^+/!{/^\#/!d}}; /^+/ d; s/#/>/g' OUTPUT.fastq
In the end, make sure to compare your new fasta files to your reference to be sure that everything is fine.
EDIT BE CAREFUL WITH THE SED COMMAND IT MAY DELETE SOME OF YOUR READS IN DIFFERENT CASES OF QUALITY SCORING THAN I HAD

Related

How to download URLs in a csv and naming outputs based on a column value

1. OS: Linux / Ubuntu x86/x64
2. Task:
Write a Bash shell script to download URLs in a (large) csv (as fast/simultaneous as possible) and naming each output on a column value.
2.1 Example Input:
A CSV file containing lines like:
001,http://farm6.staticflickr.com/5342/a.jpg
002,http://farm8.staticflickr.com/7413/b.jpg
003,http://farm4.staticflickr.com/3742/c.jpg
2.2 Example outputs:
Files in a folder, outputs, containg files like:
001.jpg
002.jpg
003.jpg
3. My Try:
I tried mainly in two styles.
1. Using the download tool's inner support
Take ariasc as an example, it support use -i option to import a file of URLs to download, and (I think) it will process it in parallel to max speed. It do have --force-sequential option to force download in the order of the lines, but I failed to find a way to make the naming part happen.
2. Splitting first
split the file into files and run a script like the following to process it:
#!/bin/bash
INPUT=$1
while IFS=, read serino url
do
aria2c -c "$url" --dir=outputs --out="$serino.jpg"
done < "$INPUT"
However, it means for each line it will restart aria2c again which seems cost time and low the speed.
Though, one can run the script in bash command multiple times to get 'shell-level' parallelism, it seems not to be the best way.
Any suggestion ?
Thank you,
aria2c supports so called option lines in input files. From man aria2c
-i, --input-file=
Downloads the URIs listed in FILE. You can specify multiple sources for a single entity by putting multiple URIs on a single line separated by the TAB character. Additionally, options can be specified after each URI line. Option lines must start with one or more white space characters (SPACE or TAB) and must only contain one option per line.
and later on
These options have exactly same meaning of the ones in the command-line options, but it just applies to the URIs it belongs to. Please note that for options in input file -- prefix must be stripped.
You can convert your csv file into an aria2c input file:
sed -E 's/([^,]*),(.*)/\2\n out=\1/' file.csv | aria2c -i -
This will convert your file into the following format and run aria2c on it.
http://farm6.staticflickr.com/5342/a.jpg
out=001
http://farm8.staticflickr.com/7413/b.jpg
out=002
http://farm4.staticflickr.com/3742/c.jpg
out=003
However this won't create files 001.jpg, 002.jpg, … but 001, 002, … since that's what you specified. Either specify file names with extensions or guess the extensions from the URLs.
If the extension is always jpg you can use
sed -E 's/([^,]*),(.*)/\2\n out=\1.jpg/' file.csv | aria2c -i -
To extract extensions from the URLs use
sed -E 's/([^,]*),(.*)(\..*)/\2\3\n out=\1\3/' file.csv | aria2c -i -
Warning: This works if and only if every URL ends with an extension. For instance, due to the missing extension the line 001,domain.tld/abc would not be converted at all, causing aria2c to fail on the "URL" 001,domain.tld/abc.
Using all standard utilities you can do this to download in parallel:
tr '\n' ',' < file.csv |
xargs -P 0 -d , -n 2 bash -c 'curl -s "$2" -o "$1.jpg"' -
-P 0 option in xargs lets it run commands in parallel (one per core processor)

sort files before converting to pdf in imagemagick

I have a folder full of image files I need to convert to a pdf. I used wget to download them. The problem is the ordering linux gives the files isn't the actual order of the pages, this is an example of the file ordering:
100-52b69f4490.jpg
101-689eb36688.jpg
10-1bf275d638.jpg
102-6f7dc2def9.jpg
103-2da8842faf.jpg
104-9b01a64111.jpg
105-1d5e3862d8.jpg
106-221412a767.jpg
...
I can convert these images to a pdf using imagemagick, with the command
convert *.jpg output.pdf
but it'll put the pages into that pdf in the above order, not in human readable numerical order 1-blahblahblah.jpg, 2-blahblahblah.jpg, 3-blahblahblah.jpg etc.
Is the easiest way to do this pipe the output of sort to convert? or to pipe my wget to add each file as I'm getting it to a pdf file?
convert $(ls -1v *.jpg) book.pdf
worked for me
There are several options:
The simplest is as follows, but may overflow your command-line length if you have too many pages:
convert $(ls *jpg | sort -n) result.pdf
Next up is feeding the list of files on stdin like this:
ls *jpg | sort -n | convert #- result.pdf
Here one bash script to do it that:
#!/bin/bash
sort -n < list.txt > sorted_list.tmp
readarray -t list < sorted_list.tmp
convert "${list[#]}" output.pdf
rm sorted_list.tmp
exit
You can get list.txt by first listing your directory with ls > list.txt.
The sort -n (numerical sort) "normalizes" your entries.
The sorted list is saved in the .tmp file and deleted at the end.
Greetings,

XARGS: Nesting Utilities within Utility call

I am trying to build a YAML file for a large database by piping in a list of names to printf with xargs.
I would like to call ls in the printf command to get files specific for each name in my list, however calls to ls nested within a printf command doesn't seem to work..
The following command
cat w1.nextract.list | awk '{print $1}' | xargs -I {} printf "'{}':\n\tw1:\n\t\tanatomical_scan:\n\t\t\tAnat: $(ls $(pwd)/{})\n"
Just provides the following error
ls: cannot access '/data/raw/long/{}': No such file or directory
Followed by an output that looks like:
'149959':
w1:
anatomical_scan:
Anat:
I'd like to be able to use the standard input to xargs to be used within the nested utility command to give me an autocompleting path to the necessary files.. i.e.)
'149959':
w1:
anatomical_scan:
Anat: /data/raw/long/149959/test-1/test9393.txt
Anyone have any ideas?
Anyone have any ideas?
A safer way that has several caveats would be:
% cat w1.nextract.list | \
sed -e 's#^\(^[^/]*\)/\(.*\)$#'\''\1'\'':\n w1:\nANA-SCAN\nANAT-PWD\1/\2#' \
-e "s#ANAT-PWD# Anat: `pwd`/#" \
-e 's/ANA-SCAN/ anatomical_scan:/'
There are restrictions on the contents of the w1.nextract.list file:
None of the lines may contain a hash ('#') character.
Any other special characters on a line may be unsafe.
For testing, I created the w1.nextract.list file with one entry:
149959/test-1/test9393.txt
The resulting output is here:
'149959':
w1:
anatomical_scan:
Anat: /data/raw/long/149959/test-1/test9393.txt
Can you explain in more detail? What makes this so fragile?
Using xargs to printf results can lead to unexpected results if the input file has special characters or escape sequences. A bad actor could then modify your input file to exploit this issue. Best practice is to avoid.
Fragility comes from maintaining the w1.nextract.list file. You could auto generate the file to reduce this issue:
cd /data/raw/long/; find * -type f -print
What is a real YAML implementation?
The yq command is an example YAML implementation. You could use it to craft the .yaml file.
I haven't worked with these type of python packages before so it's a first time approach solution.
Using python, perl, or even php would allow you to craft the file without worrying about unsafe characters in the filenames.

using split command in shell how to know the number of files generated

I am using command split for a large file to generate little files which are put in a folder, my problem is the folder contains over files different from my split.
I would like to know if there is a way to know how much files are generated only from my split not the number of all files in my folder.
My command split a 2 d. Is there any option I can join to this command to know it?
I know this ls -Al | wc -l will give me the number of files in the folder that doesn't interest me.
The simplest solution here is to split into a fresh directory.
Assuming that's not possible and you aren't worried about other processes operating on the directory in question you can just count the files before and after. Something like this
$ before=(*)
$ split a 2 d
$ after=(*)
$ echo "Split files: $((after - before))"
If the other files in the directory can't have the same format as the split files (and presumably they can't or split would fail or overwrite them) then you could use an appropriate glob to get just the files that match the pattern. Soemthing like splitfiles=(d??).
That failing you could see whether the --verbose option to split allows you to use split_count=$(split --verbose a 2 d | wc -l) or similar.
To be different, I will be counting the lines with grep utilizing the --verbose option:
split --verbose other_options file|grep -c ""
Example:
$ split --verbose -b 2 file|grep -c ""
60
# yeah, my file is pretty small, splitting on 2 bytes to produce numerous files
You can use split command with options -l and -a to specify prefix and suffix for the generated files.

How to remove the path part from a list of files and copy it into another file?

I need to accomplish the following things with bash scripting in FreeBSD:
Create a directory.
Generate 1000 unique files whose names are taken from other random files in the system.
Each file must contain information about the original file whose name it has taken - name and size without the original contents of the file.
The script must show information about the speed of its execution in ms.
What I could accomplish was to take the names and paths of 1000 unique files with the commands find and grep and put them in a list. Then I just can't imagine how to remove the path part and create the files in the other directory with names taken from the list of random files. I tried a for loop with the basename command in it but somehow I can't get it to work and I don't know how to do the other tasks as well...
[Update: I've wanted to come back to this question to try to make my response more useful and portable across platforms (OS X is a Unix!) and $SHELLs, even though the original question specified bash and zsh. Other responses assumed a temporary file listing of "random" file names since the question did not show how the list was constructed or how the selection was made. I show one method for constructing the list in my response using a temporary file. I'm not sure how one could randomize the find operation "inline" and hope someone else can show how this might be done (portably). I also hope this attracts some comments and critique: you never can know too many $SHELL tricks. I removed the perl reference, but I hereby challenge myself to do this again in perl and - because perl is pretty portable - make it run on Windows. I will wait a while for comments and then shorten and clean up this answer. Thanks.]
Creating the file listing
You can do a lot with GNU find(1). The following would create a single file with the file names and three, tab-separated columns of the data you want (name of file, location, size in kilobytes).
find / -type f -fprintf tmp.txt '%f\t%h/%f\t%k \n'
I'm assuming that you want to be random across all filenames (i.e. no links) so you'll grab the entries from the whole file system. I have 800000 files on my workstation but a lot of RAM, so this doesn't take too long to do. My laptop has ~ 300K files and not much memory, but creating the complete listing still only took a couple minutes or so. You'll want to adjust by excluding or pruning certain directories from the search.
A nice thing about the -fprintf flag is that it seems to take care of spaces in file names. By examining the file with vim and sed (i.e. looking for lines with spaces) and comparing the output of wc -l and uniq you can get a sense of your output and whether the resulting listing is sane or not. You could then pipe this through cut, grep or sed, awk and friends in order to to create the files in the way you want. For example from the shell prompt:
~/# touch `cat tmp.txt |cut -f1`
~/# for i in `cat tmp.txt|cut -f1`; do cat tmp.txt | grep $i > $i.dat ; done
I'm giving the files we create a .dat extension here to distinguish them from the files to which they refer, and to make it easier to move them around or delete them, you don't have to do that: just leave off the extension $i > $i.
The bad thing about the -fprintf flag is that it is only available with GNU find and is not a POSIX standard flag so it won't be available on OS X or BSD find(1) (though GNU find may be installed on your Unix as gfind or gnufind). A more portable way to do this is to create a straight up list of files with find / -type f > tmp.txt (this takes about 15 seconds on my system with 800k files and many slow drives in a ZFS pool. Coming up with something more efficient should be easy for people to do in the comments!). From there you can create the data values you want using standard utilities to process the file listing as Florin Stingaciu shows above.
#!/bin/sh
# portably get a random number (OS X, BSD, Linux and $SHELLs w/o $RANDOM)
randnum=`od -An -N 4 -D < /dev/urandom` ; echo $randnum
for file in `cat tmp.txt`
do
name=`basename $file`
size=`wc -c $file |awk '{print $1}'`
# Uncomment the next line to see the values on STDOUT
# printf "Location: $name \nSize: $size \n"
# Uncomment the next line to put data into the respective .dat files
# printf "Location: $file \nSize: $size \n" > $name.dat
done
# vim: ft=sh
If you've been following this far you'll realize that this will create a lot of files - on my workstation this would create 800k of .dat files which is not what we want! So, how to randomly select 1000 files from our listing of 800k for processing? There's several ways to go about it.
Randomly selecting from the file listing
We have a listing of all the files on the system (!). Now in order to select 1000 files we just need to randomly select 1000 lines from our listing file (tmp.txt). We can set an upper limit of the line number to select by generating a random number using the cool od technique you saw above - it's so cool and cross-platform that I have this aliased in my shell ;-) - then performing modulo division (%) on it using the number of lines in the file as the divisor. Then we just take that number and select the line in the file to which it corresponds with awk or sed (e.g. sed -n <$RANDOMNUMBER>p filelist), iterate 1000 times and presto! We have a new list of 1000 random files. Or not ... it's really slow! While looking for a way to speed up awk and sed I came across an excellent trick using dd from Alex Lines that searches the file by bytes (instead of lines) and translates the result into a line using sed or awk.
See Alex's blog for the details. My only problems with his technique came with setting the count= switch to a high enough number. For mysterious reasons (which I hope someone will explain) - perhaps because my locale is LC_ALL=en_US.UTF-8 - dd would spit incomplete lines into randlist.txt unless I set count= to a much higher number that the actual maximum line length. I think I was probably mixing up characters and bytes. Any explanations?
So after the above caveats and hoping it works on more than two platforms, here's my attempt at solving the problem:
#!/bin/sh
IFS='
'
# We create tmp.txt with
# find / -type f > tmp.txt # tweak as needed.
#
files="tmp.txt"
# Get the number of lines and maximum line length for later
bytesize=`wc -c < $files`
# wc -L is not POSIX and we need to multiply so:
linelenx10=`awk '{if(length > x) {x=length; y = $0} }END{print x*10}' $files`
# A function to generate a random number modulo the
# number of bytes in the file. We'll use this to find a
# random location in our file where we can grab a line
# using dd and sed.
genrand () {
echo `od -An -N 4 -D < /dev/urandom` ' % ' $bytesize | bc
}
rm -f randlist.txt
i=1
while [ $i -le 1000 ]
do
# This probably works but is way too slow: sed -n `genrand`p $files
# Instead, use Alex Lines' dd seek method:
dd if=$files skip=`genrand` ibs=1 count=$linelenx10 2>/dev/null |awk 'NR==2 {print;exit}'>> randlist.txt
true $((i=i+1)) # Bourne shell equivalent of $i++ iteration
done
for file in `cat randlist.txt`
do
name=`basename $file`
size=`wc -c <"$file"`
echo -e "Location: $file \n\n Size: $size" > $name.dat
done
# vim: ft=sh
What I could accomplish was to take the names and paths of 1000 unique files with the commands "find" and "grep" and put them in a list
I'm going to assume that there is a file that holds on each line a full path to each file (FULL_PATH_TO_LIST_FILE). Considering there's not much statistics associated with this process, I omitted that. You can add your own however.
cd WHEREVER_YOU_WANT_TO_CREATE_NEW_FILES
for file_path in `cat FULL_PATH_TO_LIST_FILE`
do
## This extracts only the file name from the path
file_name=`basename $file_path`
## This grabs the files size in bytes
file_size=`wc -c < $file_path`
## Create the file and place info regarding original file within new file
echo -e "$file_name \nThis file is $file_size bytes "> $file_name
done

Resources