How to combine all chromosomes in a single file - bioinformatics

I downloaded 1000 genomes data (chromosome 1 -22), which is in VCF format. How I can combine all chromosomes in a single files? Should I first convert all chromosomes into plink binary files and then do the --bmerge mmerge-list? Or is there any other way to combine them? Any suggestion please?

You could use PLINK as you suggest. You can also use BCFtools:
https://samtools.github.io/bcftools/bcftools.html
Specifically, the concat command:
bcftools concat ALL.chr{1..22}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -Oz -o ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
If you use PLINK, you will likely encounter issue with 1000 Genomes as it contains multi-allelic SNPs, which is not compatible with PLINK. Also, there are SNPs that have the same RS identifier, which is also not compatible with PLINK.
You will need to resolve these issues to get PLINK to work by splitting multi-allelic SNPs into multiple records and remove records with duplicate RS identifiers (or make a new unique identifier).
Moreover, PLINK binary PED does not support genotype probabilities. I do not recall if 1000 Genomes includes this type of information. If it does and you want to retain it, you cannot convert it to binary PED as the genotype probabilities will be hard-called, see:
https://www.cog-genomics.org/plink2/input
Specifically:
Since the PLINK 1 binary format cannot represent genotype
probabilities, calls with uncertainty greater than 0.1 are normally
treated as missing, and the rest are treated as hard calls.
So, if you plan to retain VCF format for the output, I recommend against using PLINK.
EDIT
Here is method to convert VCF to PLINK:
To build PLINK compatible files from the VCF files, duplicate positions and SNP id need to be merged or removed. Here I opt to remove all duplicate entries. Catalogue duplicate SNP id:
grep -v '^#' <(zcat ALL.chr${chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) | cut -f 3 | sort | uniq -d > ${chrom}.dups
Using BCFTools, split multi-allelic SNPs, and using plink remove duplicate SNPs id found in previous step:
bcftools norm -d both -m +any -Ob ALL.chr${chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | plink --bcf /dev/stdin --make-bed --out ${chrom} --allow-extra-chr 0 --memory 6000 --exclude ${chrom}.dups
Importantly, this is not the only way to resolve issue in converting VCF to PLINK. For instance, you can uniquely assign identifiers to duplicate RS id.

picard GatherVcfs https://broadinstitute.github.io/picard/command-line-overview.html
Gathers multiple VCF files from a scatter operation into a single VCF file. Input files must be supplied in genomic order and must not have events at overlapping positions.

Related

Is there any faster way to grep billions of mismatch patterns in more than one file?

I wrote a script that calculates all possible mismatch patterns (depending a case) like the two below (please look at grep command) and writes an output file as sh with billion lines like this one:
LC_ALL=C grep -ch "AAAAAAAC[A-Z][A-Z][A-Z][A-Z]CGA[A-Z][A-Z]G\|C[A-Z][A-Z]TCG[A-Z][A-Z][A-Z][A-Z]GTTTTTTT" regions_A regions_B
The next step is to execute all this billions grep lines and write an output.
In order to run it as fast as I can, I look only for ASCII code (all my characters are ASCII) using LC_ALL. Moreover, I split the huge grep file in 16 parts and run them separately using 16 threads.
Does anybody know any faster method to grep my patterns?
Any help would be appreciated.
Thank you in advance!

column-wise merging of multiple files in specific order

I want to perform column-wise merging of multiple files considering the increasing order of file names. To be specific, I have renamed 163 files as 1.lrr, 2.lrr,3.lrr...163.lrr and I used following command to merge multiple files:
Paste -d "\t" *.lrr > all_samples.lrr
However, It combined column in some strange order of filenames. It started file merging with the file 100.lrr instead of file 1.lrr. Later on, it combined column from files 101.lrr until 109.lrr. Is it possible to modify this command so that it also considers numerically sorting of file names while merging the column?
Try this:
paste $(ls | grep -E "*.lrr" | sort -n) > all_samples.lrr

using split command in shell how to know the number of files generated

I am using command split for a large file to generate little files which are put in a folder, my problem is the folder contains over files different from my split.
I would like to know if there is a way to know how much files are generated only from my split not the number of all files in my folder.
My command split a 2 d. Is there any option I can join to this command to know it?
I know this ls -Al | wc -l will give me the number of files in the folder that doesn't interest me.
The simplest solution here is to split into a fresh directory.
Assuming that's not possible and you aren't worried about other processes operating on the directory in question you can just count the files before and after. Something like this
$ before=(*)
$ split a 2 d
$ after=(*)
$ echo "Split files: $((after - before))"
If the other files in the directory can't have the same format as the split files (and presumably they can't or split would fail or overwrite them) then you could use an appropriate glob to get just the files that match the pattern. Soemthing like splitfiles=(d??).
That failing you could see whether the --verbose option to split allows you to use split_count=$(split --verbose a 2 d | wc -l) or similar.
To be different, I will be counting the lines with grep utilizing the --verbose option:
split --verbose other_options file|grep -c ""
Example:
$ split --verbose -b 2 file|grep -c ""
60
# yeah, my file is pretty small, splitting on 2 bytes to produce numerous files
You can use split command with options -l and -a to specify prefix and suffix for the generated files.

Is it possible to split a huge text file (based on number of lines) unpacking a .tar.gz archive if I cannot extract that file as whole?

I have a .tar.gz file. It contains one 20GB-sized text file with 20.5 million lines. I cannot extract this file as a whole and save to disk. I must do either one of the following options:
Specify a number of lines in each file - say, 1 million, - and get 21 files. This would be a preferred option.
Extract a part of that file based on line numbers, that is, say, from 1000001 to 2000001, to get a file with 1M lines. I will have to repeat this step 21 times with different parameters, which is very bad.
Is it possible at all?
This answer - bash: extract only part of tar.gz archive - describes a different problem.
To extract a file from f.tar.gz and split it into files, each with no more than 1 million lines, use:
tar Oxzf f.tar.gz | split -l1000000
The above will name the output files by the default method. If you prefer the output files to be named prefix.nn where nn is a sequence number, then use:
tar Oxzf f.tar.gz |split -dl1000000 - prefix.
Under this approach:
The original file is never written to disk. tar reads from the .tar.gz file and pipes its contents to split which divides it up into pieces before writing the pieces to disk.
The .tar.gz file is read only once.
split, through its many options, has a great deal of flexibility.
Explanation
For the tar command:
O tells tar to send the output to stdout. This way we can pipe it to split without ever having to save the original file on disk.
x tells tar to extract the file (as opposed to, say, creating an archive).
z tells tar that the archive is in gzip format. On modern tars, this is optional
f tells tar to use, as input, the file name specified.
For the split command:
-l tells split to split files limited by number of lines (as opposed to, say, bytes).
-d tells split to use numeric suffixes for the output files.
- tells split to get its input from stdin
You can use the --to-stdout (or -O) option in tar to send the output to stdout.
Then use sed to specify which set of lines you want.
#!/bin/bash
l=1
inc=1000000
p=1
while test $l -lt 21000000; do
e=$(($l+$inc))
tar -xfz --to-stdout myfile.tar.gz file-to-extract.txt |
sed -n -e "$l,$e p" > part$p.txt
l=$(($l+$inc))
p=$(($p+1))
done
Here's a pure Bash solution for option #1, automatically splitting lines into multiple output files.
#!/usr/bin/env bash
set -eu
filenum=1
chunksize=1000000
ii=0
while read line
do
if [ $ii -ge $chunksize ]
then
ii=0
filenum=$(($filenum + 1))
> out/file.$filenum
fi
echo $line >> out/file.$filenum
ii=$(($ii + 1))
done
This will take any lines from stdin and create files like out/file.1 with the first million lines, out/file.2 with the second million lines, etc. Then all you need is to feed the input to the above script, like this:
tar xfzO big.tar.gz | ./split.sh
This will never save any intermediate file on disk, or even in memory. It is entirely a streaming solution. It's somewhat wasteful of time, but very efficient in terms of space. It's also very portable, and should work in shells other than Bash, and on ancient systems with little change.
you can use
sed -n 1,20p /Your/file/Path
Here you mention your first line number and the last line number
I mean to say this could look like
sed -n 1,20p /Your/file/Path >> file1
and use start line number and end line number in a variable and use it accordingly.

construct DNA sequence based on variation and human reference

The 1000 genome project provides us information about "variation" of thousands people's DNA sequence against the human reference DNA sequence. The variation is stored in VCF file
format. Basically, for each person in that project, we can get his/her DNA variation information from the VCF file, for example, the type of variation (e.g Insertion/deletion and SNP ) and the position of the variation relative to the reference. The reference is in FASTA format. By combining variation information of one person from the VCF file and the human reference in FASTA file, I want to construct the DNA sequence for that person.
My question is: Does it already exist some tools can perform the task pretty well,or I have to write the scripts by myself.
The perl script vcf-consensus from VCFtools seems close to what you are looking for:
vcf-consensus
Apply VCF variants to a fasta file to create consensus sequence.
Usage: cat ref.fa | vcf-consensus [OPTIONS] in.vcf.gz > out.fa
Options:
-h, -?, --help This help message.
-H, --haplotype <int> Apply only variants for the given haplotype (1,2)
-s, --sample <name> If not given, all variants are applied
Examples:
samtools faidx ref.fa 8:11870-11890 | vcf-consensus in.vcf.gz > out.fa
The answers to the question New fasta sequence from reference fasta and variant calls file? posted on Biostar might also help.
You can use bcftools (https://github.com/samtools/bcftools) to perform this task:
bcftools consensus <file.vcf> \
--fasta-ref <file> \
--iupac-codes \
--output <file> \
--sample <name>
To install bcftools:
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
sudo cp bcftools/bcftools /usr/local/bin/
You can also combine bcftools consensus with samtools faidx (http://www.htslib.org/) to extract specific intervals from the fasta file. See bcftools consensus for more information:
About: Create consensus sequence by applying VCF variants to a reference
fasta file.
Usage: bcftools consensus [OPTIONS] <file.vcf>
Options:
-f, --fasta-ref <file> reference sequence in fasta format
-H, --haplotype <1|2> apply variants for the given haplotype
-i, --iupac-codes output variants in the form of IUPAC ambiguity codes
-m, --mask <file> replace regions with N
-o, --output <file> write output to a file [standard output]
-c, --chain <file> write a chain file for liftover
-s, --sample <name> apply variants of the given sample
Examples:
# Get the consensus for one region. The fasta header lines are then expected
# in the form ">chr:from-to".
samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa
Anyone still coming to this page, if you have a fasta reference genome and a bam file that you want to turn into the reference file by changing SNP's and N's, you may try this one-liner using samtools, bcftools and vcfutils.pl (ps for beginners: both samtools and bcftools can be compiled in a computing cluster or in Linux, if so just add the locations of each before the software name; vcfutils is already a perl script from bcftools)
samtools mpileup -d8000 -q 20 -Q 10 -uf REFERENCE.fasta Your_File.bam | bcftools call -c | vcfutils.pl vcf2fq > OUTPUT.fastq
d, --max-depth == -q, -min-MQ Minimum mapping quality for an alignment to be used == -Q, --min-BQ Minimum base quality for a base to be considered == (You can use different values of course, see http://www.htslib.org/doc/samtools.html)
Which generates a weird format that looks like fastq but isn't, so you can't convert it using a converter, but you can use the following sed command, which I wrote specific for this output:
sed -i -e '/^+/,/^\#/{/^+/!{/^\#/!d}}; /^+/ d; s/#/>/g' OUTPUT.fastq
In the end, make sure to compare your new fasta files to your reference to be sure that everything is fine.
EDIT BE CAREFUL WITH THE SED COMMAND IT MAY DELETE SOME OF YOUR READS IN DIFFERENT CASES OF QUALITY SCORING THAN I HAD

Resources