Snakemake rules - bioinformatics

I want to use snakemake for making a bioinformatics pipeline and I googled it and read documents and other stuff, but I still don't know how to get it works.
Here are some of my raw data files.
Rawdata/010_0_bua_1.fq.gz, Rawdata/010_0_bua_2.fq.gz
Rawdata/11_15_ap_1.fq.gz, Rawdata/11_15_ap_2.fq.gz
...they are all paired files.)
Here is my align.snakemake
from os.path import join
STAR_INDEX = "/app/ref/ensembl/human/idx/"
SAMPLE_DIR = "Rawdata"
SAMPLES, = glob_wildcards(SAMPLE_DIR + "/{sample}_1.fq.gz")
R1 = '{sample}_1.fq.gz'
R2 = '{sample}_2.fq.gz'
rule alignment:
input:
r1 = join(SAMPLE_DIR, R1),
r2 = join(SAMPLE_DIR, R2),
params:
STAR_INDEX = STAR_INDEX
output:
"Align/{sample}.bam"
message:
"--- mapping STAR---"
shell:"""
mkdir -p Align/{wildcards.sample}
STAR --genomeDir {params.STAR_INDEX} --readFilesCommand zcat --readFilesIn {input.r1} {input.r2} --outFileNamePrefix Align/{wildcards.sample}/log
"""
This is it. I run this file by "snakemake -np -s align.snakemake" and I got this error.
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
I am sorry that I ask this question, there are many people using it pretty well though. any help would be really appriciated. Sorry for my English.
P.S. I read the official document and tutorial but still have no idea.

Oh I did. Here is my answer to my question for some people might want some help.
from os.path import join
STAR_INDEX = "/app/ref/ensembl/human/idx/"
SAMPLE_DIR = "Rawdata"
SAMPLES, = glob_wildcards(SAMPLE_DIR + "/{sample}_1.fq.gz")
R1 = '{sample}_1.fq.gz'
R2 = '{sample}_2.fq.gz'
rule all:
input:
expand("Align/{sample}/Aligned.toTranscriptome.out.bam", sample=SAMPLES)
rule alignment:
input:
r1 = join(SAMPLE_DIR, R1),
r2 = join(SAMPLE_DIR, R2)
params:
STAR_INDEX = STAR_INDEX
output:
"Align/{sample}/Aligned.toTranscriptome.out.bam"
threads:
8
message:
"--- Mapping STAR---"
shell:"""
mkdir -p Align/{wildcards.sample}
STAR --genomeDir {params.STAR_INDEX} --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN {threads} --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD \#HD VN:1.4 SO:unsorted --readFilesCommand zcat --readFilesIn {input.r1} {input.r2} --outFileNamePrefix Align/{wildcards.sample}/log
"""

Related

Accessing Snakemake Config Samples

I have a rule that needs to take 2 samples and combine them.
This is how my samples look like in my config file:
samples:
group1:
sra1:
sample: "SRR14724462"
cell_line: "NA24385"
exome_bedfile: "/bedfiles/truseq.sorted.bed"
sra2:
sample: "SRR14724472"
cell_line: "NA24385"
exome_bedfile: "/bedfiles/idt.sorted.bed"
group2:
sra1:
sample: "SRR14724463"
cell_line: "NA12878"
exome_bedfile: "/bedfiles/truseq.sorted.bed"
sra2:
sample: "SRR14724473"
cell_line: "NA12878"
exome_bedfile: "/bedfiles/idt.sorted.bed"
Essentially I want to combine group1 sra1 together, and group2 sra2 together, into these combinations:
SRR14724462 and SRR14724463
SRR14724472 and SRR14724473
This is my rule and rule all:
rule combine:
output:
r1 = TRIMMED_DIR + "/{sample1}_{sample2}_R1.fastq",
r2 = TRIMMED_DIR + "/{sample1}_{sample2}_R2.fastq"
params:
trimmed_dir = TRIMMED_DIR,
a = "{sample1}",
b = "{sample2}"
shell:
cd {params.trimmed_dir}
/combine.sh {params.a}_R1_trimmed.fastq {params.a}_R2_trimmed.fastq {params.b}_R1_trimmed.fastq {params.b}_R2_trimmed.fastq
rule all:
expand(TRIMMED_DIR + "/{sample1}_{sample2}_R1.fastq", sample1=list_a, sample2=list_b),
expand(TRIMMED_DIR + "/{sample1}_{sample2}_R2.fastq", sample1=list_a, sample2=list_b)
This works EXCEPT it does these combinations:
SRR14724462 and SRR14724463
SRR14724462 and SRR14724473
SRR14724472 and SRR14724463
SRR14724472 and SRR14724473
I only want these combinations:
SRR14724462 and SRR14724463
SRR14724472 and SRR14724473
Note: Not shown is how i got list_a and list_b, but essentially they are:
list_a = ['SRR14724462', 'SRR14724472']
list_b = ['SRR14724463', 'SRR14724473']

WildcardError in Snakefile

I've been trying to run the following bioinformatic script:
configfile: "config.yaml"
WORK_TRIM = config["WORK_TRIM"]
WORK_KALL = config["WORK_KALL"]
rule all:
input:
expand(WORK_KALL + "quant_result_{condition}", condition=config["conditions"])
rule kallisto_quant:
input:
fq1 = WORK_TRIM + "{sample}_1_trim.fastq.gz",
fq2 = WORK_TRIM + "{sample}_2_trim.fastq.gz",
idx = WORK_KALL + "Homo_sapiens.GRCh38.cdna.all.fa.index"
output:
WORK_KALL + "quant_result_{condition}"
shell:
"kallisto quant -i {input.idx} -o {output} {input.fq1} {input.fq2}"
However, I keep obtaing an error like this:
WildcardError in line 13 of /home/user/directory/Snakefile:
Wildcards in input files cannot be determined from output files:
'sample'
Just to explain briefly, kallisto quant will produce 3 outputs: abundance.h5, abundance.tsv and run_injo.json. Each of those files need to be sent to their own newly created condition directory. I not getting exactly what is going on wrong. I'll appreciated any help on this.
If you think about it, you are not giving snakemake enough information.
Say "condition" is either "control" or "treated" with samples "C" and "T", respectively. You need to tell snakemake about the association control: C, treated: T. You could do this using functions-as-input files or lambda functions. For example:
cond2samp = {'control': 'C', 'treated': 'T'}
rule all:
input:
expand("quant_result_{condition}", condition=cond2samp.keys())
rule kallisto_quant:
input:
fq1 = lambda wc: "%s_1_trim.fastq.gz" % cond2samp[wc.condition],
fq2 = lambda wc: "%s_2_trim.fastq.gz" % cond2samp[wc.condition],
idx = "Homo_sapiens.GRCh38.cdna.all.fa.index"
output:
"quant_result_{condition}"
shell:
"kallisto quant -i {input.idx} -o {output} {input.fq1} {input.fq2}"

Julia, ploting histogram or whatever from a serie of files in a loop (list)

Time ago, I did move some files from a directory and copied to a new one,
while copying I did rename each one. For doing so , I used bash.
Something like this
for i in 8 12 16 20 ; do;
for j in 0.00 0.01 ; do ;
cp tera.dat new/tera$i$j.dat
...
Into each tera$i$j.dat, there is numeric data in 2 columns.
I would like to plot the data as a histogram, and save each image (if possible in a loop) for each file tera$i$j.dat . But unfortunately, I do not have much idea how to do this.
I managed to load the data in several tables using de = readtable.(filter(r"tera", readdir()),separator =' ', header= false,;.
But I can not create a plot for each data :(
So far I managed to read the file, and create a histogram plot for 1 file per time, how do I do it in a loop ? This is the code I am using.
using Plots StatPlots, Distributions, DataFrames, PlotlyJS, LaTeXStrings;
theme(:sand);
chp = pwd();
pe = readtable("tera120.00.dat",separator =' ', header= false)
gr()
histogram(pe[2], nbins=1000)
...
Thanks in advance :)
Untested, but it should be something like this:
using StatPlots, DataFrames
theme(:sand)
gr()
for i in [8, 12, 16, 20], j in [0, 0.01]
fn = #sprintf "tera%02d.%02d" i j
pe = readtable(fn*".dat", separator =' ', header = false)
p = histogram(pe[2], bins = 1000);
savefig(p, fn*".png")
end
Note that when you plot with StatPlots, you don't need using Plots or using PlotlyJS, I also don't see why you need Distributions or LaTeXStrings.

Improve genbank feature addition

I am trying to add more than 70000 new features to a genbank file using biopython.
I have this code:
from Bio import SeqIO
from Bio.SeqFeature import SeqFeature, FeatureLocation
fi = "myoriginal.gbk"
fo = "mynewfile.gbk"
for result in results:
start = 0
end = 0
result = result.split("\t")
start = int(result[0])
end = int(result[1])
for record in SeqIO.parse(original, "gb"):
record.features.append(SeqFeature(FeatureLocation(start, end), type = "misc_feat"))
SeqIO.write(record, fo, "gb")
Results is just a list of lists containing the start and end of each one of the features I need to add to the original gbk file.
This solution is extremely costly for my computer and I do not know how to improve the performance. Any good idea?
You should parse the genbank file just once. Omitting what results contains (I do not know exactly, because there are some missing pieces of code in your example), I would guess something like this would improve performance, modifying your code:
fi = "myoriginal.gbk"
fo = "mynewfile.gbk"
original_records = list(SeqIO.parse(fi, "gb"))
for result in results:
result = result.split("\t")
start = int(result[0])
end = int(result[1])
for record in original_records:
record.features.append(SeqFeature(FeatureLocation(start, end), type = "misc_feat"))
SeqIO.write(record, fo, "gb")

Extracting plain text output from binary file

I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy
The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language).
The documentation states that:
"GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)."
The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python.
thanks very much for your help on this.
Update: hexdump graph-name.4B.vout | head -5 gives:
0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4
0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19
0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5
0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d
0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string:
https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators
But the array is simple byte array. Here is example how to read it in python:
import struct
from array import array as binarray
import sys
inputfile = sys.argv[1]
data = open(inputfile).read()
a = binarray('c')
a.fromstring(data)
s = struct.Struct("f")
l = len(a)
print "%d bytes" %l
n = l / 4
for i in xrange(0, n):
x = s.unpack_from(a, i * 4)[0]
print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you:
cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";'
Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.

Resources