NameError in Snakemake dryrun mode - bioinformatics

I am new to Snakemake and I am trying to develop some pipelines. I am encountering some problems when I use wildcards, trying to automate my bioinformatic analyses as much as possible. I run into troubles when the pipeline becomes more complex (as shown below). It looks like Snakemake does not resolve the wildcards correctly. During a dry run of the Snakefile, the wildcards values look correct in the executions of some rules. However, the same wildcards lead to an error in a different step(rule) of the pipeline, and I cannot figure out why. Below I provide the code and the output message of a dry run.
num=["327905-LR-41624_normal","327907-LR-41624_tumor"]
num_normal=["327905-LR-41624"]
num_tumor=["327907-LR-41624"]
path="/path/to/Snakemake/"
genome="/path/to/references_genome/Mus_musculus.GRCm38.dna_rm.toplevel.fa"
rule all:
input:
expand("/path/to/Snakemake/AS-{num_tum}_tumor_no_dupl_sort_RG_LB.bam",num_tum=num_tumor),
expand("/path/to/Snakemake/AS-{num_norm}_normal_no_dupl_sort_RG_LB.bam",num_norm=num_normal)
ruleorder: samtools_sort > remove_duplicates > samtools_index #> add_readgroup_tumor > add_readgroup_normal
rule trim_galore:
input:
r1="/path/to/Snakemake/AS-{num}_R1.fastq",
r2="/path/to/Snakemake/AS-{num}_R2.fastq"
output:
"/path/to/Snakemake/AS-{num }_R1_val_1.fq",
"/path/to/Snakemake/AS-{num }_R2_val_2.fq"
shell:
"module load trim-galore/0.5.0 ; module load pypy/2.7-6.0.0 ; trim_galore --output_dir /path/to/Snakemake/ --paired {input.r1} {input.r2} "
rule bwa_mem:
input:
R1="/path/to/Snakemake/AS-{num}_R1_val_1.fq",
R2="/path/to/Snakemake/AS-{num}_R2_val_2.fq"
output:
"/path/to/Snakemake/AS-{num}.bam"
shell:
"module load samtools/default ; module load bwa/0.7.8 ; bwa mem {genome} {input.R1} {input.R2} | samtools view -h -b > {output} "
rule samtools_sort:
input:
"/path/to/Snakemake/AS-{num}.bam"
output:
"/path/to/Snakemake/AS-{num}_sort.bam"
shell:
"module load samtools/default ; samtools sort -n -O BAM {input} > {output} "
rule remove_duplicates:
input:
"/path/to/Snakemake/AS-{num}_sort.bam"
output:
outbam="/path/to/Snakemake/AS-{num}_no_dupl_sort.bam",
metrics="/path/to/Snakemake/AS-{num}_dupl_metrics.txt"
shell:
"module load gatk/4.0.9.0 ; gatk MarkDuplicates -I {input} -O {output.outbam} -M {output.metrics} --REMOVE_DUPLICATES=true "
rule samtools_index:
input:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam.bai"
shell:
"module load samtools/default ; samtools index {input} "
rule add_readgroup_normal:
input:
"/path/to/Snakemake/AS-{num_normal}_normal_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num_normal}_normal_no_dupl_sort_RG_LB.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { num_normal } -PU { num_normal } -SM NORMAL -I { input } -O {output} "
rule add_readgroup_tumor:
input:
"/path/to/Snakemake/AS-{num_tumor}_tumor_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num_tumor}_tumor_no_dupl_sort_RG_LB.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { num_tumor } -PU { num_tumor } -SM TUMOR -I { input } -O {output} "
When I test the Snakefile with the command:
.local/bin/snakemake -s Snakefile_pipeline --dryrun
I get the following:
**Building DAG of jobs...**
**Job counts:**
**count jobs
1 add_readgroup_normal
1 add_readgroup_tumor
1 all
2 bwa_mem
2 remove_duplicates
2 samtools_sort
2 trim_galore
11**
**[Mon Apr 8 16:14:27 2019]
rule trim_galore:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor_R1.fastq, /path/to/Snakemake/AS-327907-LR-41624_tumor_R2.fastq
output: /path/to/Snakemake/AS-327907-LR-41624_tumor_R1_val_1.fq, /path/to/Snakemake/AS-327907-LR-41624_tumor_R2_val_2.fq
jobid: 9
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule trim_galore:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_R1.fastq, /path/to/Snakemake/AS-327905-LR-41624_normal_R2.fastq
output: /path/to/Snakemake/AS-327905-LR-41624_normal_R1_val_1.fq, /path/to/Snakemake/AS-327905-LR-41624_normal_R2_val_2.fq
jobid: 10
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule bwa_mem:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_R1_val_1.fq, /path/to/Snakemake/AS-327905-LR-41624_normal_R2_val_2.fq
output: /path/to/Snakemake/AS-327905-LR-41624_normal.bam
jobid: 8
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule bwa_mem:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor_R1_val_1.fq, /path/to/Snakemake/AS-327907-LR-41624_tumor_R2_val_2.fq
output: /path/to/Snakemake/AS-327907-LR-41624_tumor.bam
jobid: 7
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule samtools_sort:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor.bam
output: /path/to/Snakemake/AS-327907-LR-41624_tumor_sort.bam
jobid: 5
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule samtools_sort:
input: /path/to/Snakemake/AS-327905-LR-41624_normal.bam
output: /path/to/Snakemake/AS-327905-LR-41624_normal_sort.bam
jobid: 6
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule remove_duplicates:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor_sort.bam
output: /path/to/Snakemake/AS-327907-LR-41624_tumor_no_dupl_sort.bam, /path/to/Snakemake/AS-327907-LR-41624_tumor_dupl_metrics.txt
jobid: 3
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule remove_duplicates:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_sort.bam
output: /path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort.bam, /path/to/Snakemake/AS-327905-LR-41624_normal_dupl_metrics.txt
jobid: 4
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule add_readgroup_normal:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort.bam
output: /path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort_RG_LB.bam
jobid: 2
wildcards: num_normal=327905-LR-41624**
**RuleException in line 93 of /home/l136n/Snakefile_mapping_snv_call_pipeline2:
NameError: The name ' num_normal ' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}**
I have googled the error but found little help. Also, I double checked the pipeline for any incosistency. What I expect as output is indicated in the rule "all". The rules "add_readgroup_normal" and "add_readgroup_tumor" are supposed to take different subsets of input files, generated by the previous steps, which are run on all files. I wonder if the problem arises somehow because of this separation into 2 subsets.
I repeat that I am quite new to Snakemake, so I might be missing something silly somewhere! Any help would be really appreciated, as I am completely stuck!
Thank you so much in advance!
num=["327905-LR-41624_normal","327907-LR-41624_tumor"]
normal=["327905-LR-41624_normal"]
num_tumor=["327907-LR-41624_tumor"]
path="/path/to/Snakemake/"
genome="/icgc/dkfzlsdf/analysis/B210/references_genome/Mus_musculus.GRCm38.dna_rm.toplevel.fa"
rule all:
input:
"/path/to/Snakemake/AS-327905-LR-41624_normal_R1_val_1.fq",
"/path/to/Snakemake/AS-327905-LR-41624_normal_R2_val_2.fq",
"/path/to/Snakemake/AS-327907-LR-41624_tumor_R1_val_1.fq",
"/path/to/Snakemake/AS-327907-LR-41624_tumor_R2_val_2.fq",
"/path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort.bam.bai",
"/path/to/Snakemake/AS-327907-LR-41624_tumor_no_dupl_sort.bam.bai",
"/path/to/Snakemake/AS-327905-LR-41624_normal_RG.bam"
"/path/to/Snakemake/AS-327907-LR-41624_tumor_RG.bam"
rule trim_galore:
input:
r1="/path/to/Snakemake/AS-{num}_R1.fastq",
r2="/path/to/Snakemake/AS-{num}_R2.fastq"
output:
"/path/to/Snakemake/AS-{num }_R1_val_1.fq",
"/path/to/Snakemake/AS-{num }_R2_val_2.fq"
shell:
"module load trim-galore/0.5.0 ; module load pypy/2.7-6.0.0 ; trim_galore --output_dir /path/to/Snakemake/ --paired {input.r1} {input.r2} "
rule bwa_mem:
input:
R1="/path/to/Snakemake/AS-{num}_R1_val_1.fq",
R2="/path/to/Snakemake/AS-{num}_R2_val_2.fq"
output:
"/path/to/Snakemake/AS-{num}.bam"
shell:
"module load samtools/default ; module load bwa/0.7.8 ; bwa mem {genome} {input.R1} {input.R2} | samtools view -h -b > {output} "
rule samtools_sort:
input:
"/path/to/Snakemake/AS-{num}.bam"
output:
"/path/to/Snakemake/AS-{num}_sort.bam"
shell:
"module load samtools/default ; samtools sort -n -O BAM {input} > {output} "
rule remove_duplicates:
input:
"/path/to/Snakemake/AS-{num}_sort.bam"
output:
outbam="/path/to/Snakemake/AS-{num}_no_dupl_sort.bam",
metrics="/path/to/Snakemake/AS-{num}_dupl_metrics.txt"
shell:
"module load gatk/4.0.9.0 ; gatk MarkDuplicates -I {input} -O {output.outbam} -M {output.metrics} --REMOVE_DUPLICATES=true "
rule samtools_index:
input:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam.bai"
shell:
"module load samtools/default ; samtools index {input} "
rule add_readgroup_normal:
input:
"/path/to/Snakemake/AS-{normal}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{normal}_RG.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { wildcards.normal } -PU { wildcards.normal } -SM NORMAL -I { input } -O {output} "
rule add_readgroup_tumor:
input:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num_,'.*tumor.*'}_RG.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { wildcards.num } -PU { wildcards.num } -SM TUMOR -I { input } -O {output} "
Error:
Building DAG of jobs...
MissingInputException in line 37 of /home/l136n/Snakefile_mapping_snv_call_pipeline2b1:
Missing input files for rule trim_galore:
/path/to/Luca/Snakemake/AS-327905-LR-41624_normal_RG.bam/path/to/Luca/Snakemake/AS-327907-LR-41624_tumor_RG_R1.fastq
/path/to/Snakemake/AS-327905-LR-41624_normal_RG.bam/path/to/Luca/Snakemake/AS-327907-LR-41624_tumor_RG_R2.fastq

Wildcards are accessible in shell using syntax {wilcards.var}, not {var}. You have the latter in rule add_readgroup_normal.
Source.

I thought I would provide the solution, even if the post is a bit old now. The error was simply due to the presence of spaces inside "{ wildcards.var }".

Related

Snakemake Ambiguity between 2 rules with similar output file type

I am trying to map 3 samples: SRR14724459, SRR14724473, and a combination of both SRR14724459_SRR14724473.
I have 2 rules that share a similar file type output (.bam), and even tho I am naming their wildcards different, I still get an ambiguity error:
Building DAG of jobs...
AmbiguousRuleException:
Rules map_hybrid and map are ambiguous for the file /gpfs/scratch/hpadre/snakemake_outputs/mapped_dir/SRR14724459_SRR14724473__0.9.bam.
Expected input files:
map_hybrid: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R1_trimmed_SRR14724473_R1_trimmed_0.9.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq
map: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473__0.9/SRR14724459_SRR14724473__0.9_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473__0.9/SRR14724459_SRR14724473__0.9_R2_trimmed.fastq
From my Snakefile, here are my variables:
all_samples: ['SRR14724459', 'SRR14724473']
sample1: ['SRR14724459']
sample2: ['SRR14724473']
titration:[0.9]
This is my rule all:
rule all:
expand(MAPPED_DIR + "/{sample}.bam", sample=all_samples),
expand(MAPPED_DIR + "/{sample1}_{sample2}_{titration}.bam", zip, sample1=list_a_titrations, sample2=list_b_titrations, titration=tit_list)
This is my map rule:
rule map:
input:
r1 = TRIMMED_DIR + "/{sample}/{sample}_R1_trimmed.fastq",
r2 = TRIMMED_DIR + "/{sample}/{sample}_R2_trimmed.fastq"
output:
MAPPED_DIR + "/{sample}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample}_bwa_benchmark.txt"
wildcard_constraints:
word='[^0-9]*'
shell:
"""
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
"""
This is my map_hybrid:
rule map_hybrid:
input:
r1 = TRIMMED_DIR + "/{sample1}_{sample2}_{titration}/{sample1}_R1_trimmed_{sample2}_R1_trimmed_{titration}.fastq",
r2 = TRIMMED_DIR + "/{sample1}_{sample2}_{titration}/{sample1}_R2_trimmed_{sample2}_R2_trimmed_{titration}.fastq"
output:
MAPPED_DIR + "/{sample1}_{sample2}_{titration}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample1}_{sample2}_{titration}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample1}_{sample2}_{titration}_bwa_benchmark.txt"
shell:
"""
set +e
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
exitcode=$?
if [ $exitcode -eq 1 ]
then
exit 1
else
exit 0
fi
"""
The expected input files SHOULD BE as so:
Expected input files:
map_hybrid: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R1_trimmed_SRR14724473_R1_trimmed_0.9.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq
map: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_R2_trimmed.fastq
and also
/home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724473_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724473_R2_trimmed.fastq
Your rules map and map_hybrid can both produce your desired files, for snakemake they are ambigious rules.
The names of the wildcards is irrelevant, what is relevant is whether the wildcards in both rules can match the same output filepath.
That is the case here.
While rule map_hybrid can produce the output file SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq where the wildcard matches are
sample1=SRR14724459
sample2=SRR14724473
the rule map can also produce this output with the wildcard match
sample=SRR14724459_R2_trimmed_SRR14724473
To prevent ambiguity you can use the wildcard_constraint, so that the {sample} wildcard only matches strings starting with SRR followed by numbers:
sample='SRR\d+'
Integrated into your rule map:
rule map:
input:
r1 = TRIMMED_DIR + "/{sample}/{sample}_R1_trimmed.fastq",
r2 = TRIMMED_DIR + "/{sample}/{sample}_R2_trimmed.fastq"
output:
MAPPED_DIR + "/{sample}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample}_bwa_benchmark.txt"
wildcard_constraints:
word='[^0-9]*',
sample='SRR\d+'
shell:
"""
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
"""
it should resolve the ambiguity.

Snakemake can't identify the rule

I'm writing a pipeline with Snakemake and the program can't identify the rule stringtie. I can't find what I'm doing wrong. I already runned the rule fastp and star, the problem is specific with the stringtie rule.
include:
'config.py'
rule all:
input:
expand(FASTP_DIR + "{sample}R{read_no}.fastq",sample=SAMPLES ,read_no=['1', '2']), #fastp
expand(STAR_DIR + STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam",sample=SAMPLES), #STAR
expand(STRINGTIE_DIR + "/{sample}/{sample}Aligned.sortedByCoord.out.gtf", sample=SAMPLES),
GTF_DIR + "path_samplesGTF.txt"
rule fastp:
input:
R1= DATA_DIR + "{sample}R1_001.fastq.gz",
R2= DATA_DIR + "{sample}R2_001.fastq.gz"
output:
R1out= FASTP_DIR + "{sample}R1.fastq",
R2out= FASTP_DIR + "{sample}R2.fastq"
params:
data_dir = DATA_DIR,
name_sample = "{sample}"
log: FASTP_LOG + "{sample}.html"
message: "Executando o programa FASTP"
run:
shell('fastp -i {input.R1} -I {input.R2} -o {output.R1out} -O {output.R2out} \
-h {log} -j {log}')
shell("find {params.data_dir} -type f -name '{params.name_sample}*' -delete ")
rule star:
input:
idx_star = IDX_DIR,
R1 = FASTP_DIR + "{sample}R1.fastq",
R2 = FASTP_DIR + "{sample}R2.fastq",
parameters = "parameters.txt",
params:
outdir = STAR_DIR + "output/{sample}/{sample}",
star_dir = STAR_DIR,
star_sample = '{sample}'
# threads: 18
output:
out = STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam"
#run_time = STAR + "log/star_run.time"
# log: STAR_LOG
# benchmark: BENCHMARK + "star/{sample_star}"
run:
shell("STAR --runThreadN 12 --genomeDir {input.idx_star} \
--readFilesIn {input.R1} {input.R2} --outFileNamePrefix {params.outdir}\
--parametersFiles {input.parameters} \
--quantMode TranscriptomeSAM GeneCounts \
--genomeChrBinNbits 12")
# shell("find {params.star_dir} -type f ! -name
'{params.star_sample}Aligned.sortedByCoord.out.bam' -delete")
rule stringtie:
input:
star_output = STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam"
output:
stringtie_output = STRINGTIE_DIR + "/{sample}/{sample}Aligned.sortedByCoord.out.gtf"
run:
shell("stringtie {input.star_output} -o {output.stringtie_output} \
-v -p 12 ")
rule grep_gtf:
input:
list_gtf = STRINGTIE_DIR
output:
paths = GTF_DIR + "path_samplesGTF.txt"
shell:
"find {input.list_gtf} | grep .gtf > {output.paths}"
This is the output I get with the option dry-run (flag -n)
Building DAG of jobs...
Job counts:
count jobs
1 all
1 grep_gtf
2
[Fri Apr 17 15:59:24 2020]
rule grep_gtf:
input: /homelocal/boralli/workdir/pipeline_v4/STRINGTIE/
output: /homelocal/boralli/workdir/pipeline_v4/GTF/path_samplesGTF.txt
jobid: 1
find /homelocal/boralli/workdir/pipeline_v4/STRINGTIE/ | grep .gtf >
/homelocal/boralli/workdir/pipeline_v4/GTF/path_samplesGTF.txt
[Fri Apr 17 15:59:24 2020]
localrule all:
input: /homelocal/boralli/workdir/pipeline_v4/GTF/path_samplesGTF.txt
jobid: 0
Job counts:
count jobs
1 all
1 grep_gtf
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
I really don't know whats going on. The same pipeline worked before.
In addition to Troy's comment:
You specify as input of your rule grep_gtf a directory. Since that directory probably already exists, the rule stringtie does not need to be executed before running grep_gtf.
Using a directory as input isn't really a good idea. If you need the outputs of rule stringtie before executing rule grep_gtf, i suggest you specify the output files of rule stringtie as input of rule grep_gtf.
So your rule grep_gtf should be something like:
rule grep_gtf:
input:
expand(STRINGTIE_DIR + "/{sample}/{sample}Aligned.sortedByCoord.out.gtf", sample=SAMPLES)
output:
paths = GTF_DIR + "path_samplesGTF.txt"
shell:
"find {STRINGTIE_DIR} | grep .gtf > {output.paths}"
EDIT:
I think there's a bad copy/paste in rule all where there is twice STAR_DIR:
expand(STAR_DIR + STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam",sample=SAMPLES), #STAR
I also think there is a misunderstanding on the snakemake "workflow" concept. You do not need to specify the outputs of all rules in rule all. You only need to specify the last file of the workflow. Snakemake will decide which rules need to be run in order to achieve the creation of the final file. I don't really understand why your snakemake does not want to build the gtf files since you ask for them in rule all but I do see why rule grep_gtf does not need the output of rule stringtie to run.

Snakemake Checkpoint Throws (exited with non-zero exit code) even after correct completion

I'm currently running a snakemake checkpoint that appears to be throwing a non-zero exit code even after correct completion of the command, and am unsure how to resolve the problem.
The purpose of the below script is to parse a file of coordinates, the bed_file, extract all regions from a bam file rna_file and eventually assemble these regions. The code is below, and my snakemake version is 5.6.0.
#Pull coordinates from a BAM file, and use the command samtools view to extract the corresponding #data, naming the output as the coordinate file, here named "6:25274434-25278245.bam". There are #an unknown number of output files
checkpoint pull_reads_for_BAM:
input:
¦ bed_file = get_lncRNA_file,
¦ rna_file = get_RNA_file
conda:
¦ "envs/pydev_1.yml"
params:
¦ "01.pulled_reads"
output:
¦ directory("01.pulled_reads/{tissue}")
shell:"""
mkdir 01.pulled_reads/{wildcards.tissue}
store_regions=$(cat {input.bed_file} | awk -F'\t' '{{ print $1 ":" $2 "-" $3 }}')
for i in $store_regions ; do
¦ samtools view -b -h {input.rna_file} ${{i}} > 01.pulled_reads/{wildcards.tissue}/${{i}}.bam ;
done
echo "This completed fine"
"""
rule samtools_sort:
input:
¦ "01.pulled_reads/{tissue}/{i}.bam"
params:
¦ "{i}"
output:
¦ "01.pulled_reads/{tissue}/{i}.sorted.bam"
shell:
¦ "samtools sort -T sorted_reads/{params}.tmp {input} > {output}"
rule samtools_index:
input:
¦ "01.pulled_reads/{tissue}/{i}.sorted.bam"
output:
¦ "01.pulled_reads/{tissue}/{i}.sorted.bam.bai"
shell:
"samtools index {input}"
rule string_tie_assembly:
input:
¦ "01.pulled_reads/{tissue}/{i}.sorted.bam"
output:
¦ "02.string_tie_assembly/{tissue}/{i}_assembly.gtf"
shell:
"stringtie {input} -f 0.0 -a 0 -m 50 -c 3.0 -f 0.0 -o {output}"
def trigger_aggregate(wildcards):
checkpoint_output = checkpoints.pull_reads_for_BAM.get(**wildcards).output[0]
x = expand("02.string_tie_assembly/{tissue}/{i}_assembly.merged.gtf",
¦ tissue = wildcards.tissue,
¦ i=glob_wildcards(os.path.join(checkpoint_output, "{i}.bam")).i)
return x
#Aggregate function that triggers rule
rule combine_all_gtf_things:
input:
¦ trigger_aggregate
output:
¦ "03.final_stuff/{tissue}.merged.gtf"
shell:"""
cat {input} > {output}
"""
After the command has run to completion, snakemake returns (exited with non-zero exit code) for some mysterious reason. I can watch the output be generated in the file and it appears to be correct, so I'm unsure why it's throwing this error.
The checkpoint I have generated is modeled after this:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
Related Questions that have gone unanswered:
Snakemake checkpoint (exited with non-zero exit code)
It appears that this issue was somehow caused by the wildcards in {tissue} being set as a directory. As to why this throws a non-zero exit status I am unsure. This was fixed by simply appending {tissue}_dir onto the path above.
More on the issue can be found here:
https://bitbucket.org/snakemake/snakemake/issues/1303/snakemake-checkpoint-throws-exited-with
Not sure if this is a problem but mkdir 01.pulled_reads/{wildcards.tissue} will fail if the directory exists or 01.pulled_reads does not exist before mkdir is executed.
Try adding the -p option to mkidr, i.e. mkdir -p 01.pulled_reads/{wildcards.tissue}

Use wildcard on params

I try to use one tool and I need to use a wildcard present on input.
This is an example:
aDict = {"120":"121" } #tumor : normal
rule all:
input: expand("{case}.mutect2.vcf",case=aDict.keys())
def get_files_somatic(wildcards):
case = wildcards.case
control = aDict[wildcards.case]
return [case + ".sorted.bam", control + ".sorted.bam"]
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}'
log:
"logs/{case}.mutect2.log"
threads: 8
shell:
" gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} -I {input[1]} -normal {wildcards.control}"
" -L {params.target} -O {output}"
I Have this error:
'Wildcards' object has no attribute 'control'
So I have a function with case and control. I'm not able to extract code.
The wildcards are derived from the output file/pattern. That is why you only have the wildcard called case. You have to derive the control from that. Try replacing your shell statement with this:
run:
control = aDict[wildcards.case]
shell(
"gatk-launch Mutect2 -R {params.genome} -I {input[0]} "
"-tumor {params.name_tumor} -I {input[1]} -normal {control} "
"-L {input.target2} -O {output}"
)
You could define control in params. Also {input.target2} in shell command would result in error. May be it's supposed to be params.target?
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}',
control = lambda wildcards: aDict[wildcards.case]
shell:
"""
gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} \\
-I {input[1]} -normal {params.control} -L {params.target} -O {output}
"""

snakemake how to encode pair analisys

I want to use gatk recalibration using pair sample ( tumor and normal). I need to parse the data using pandas. That is what I wroted.
expand("mapped_reads/merged_samples/{sample[1][tumor]}/{sample[1][tumor]}_{sample[1][normal]}.bam", sample=read_table(config["conditions"], ",").iterrows())
this is the condition file:
432,433
434,435
I wrote this rule:
rule gatk_RealignerTargetCreator:
input:
"mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
"mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
output:
"mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"
params:
genome=config['reference']['genome_fasta'],
mills= config['mills'],
ph1_indels= config['know_phy'],
log:
"mapped_reads/merged_samples/logs/{tumor}_{normal}.realign_info.log"
threads: 8
shell:
"gatk -T RealignerTargetCreator -R {params.genome} {params.custom} "
"-nt {threads} "
"-I {wildcard.tumor} -I {wildcard.normal} -known {params.ph1_indels} "
"-o {output} >& {log}"
I have this error:
InputFunctionException in line 17 of /home/maurizio/Desktop/TEST_exome/rules/samfiles.rules:
KeyError: '432/432_433'
Wildcards:
sample=432/432_433
this is the samfiles.rules:
rule samtools_merge_bam:
"""
Merge bam files for multiple units into one for the given sample.
If the sample has only one unit, files will be copied.
"""
input:
lambda wildcards: expand("mapped_reads/bam/{unit}_sorted.bam",unit=config["samples"][wildcards.sample])
output:
"mapped_reads/merged_samples/{sample}.bam"
benchmark:
"benchmarks/samtools/merge/{sample}.txt"
run:
if len(input) > 1:
shell("/illumina/software/PROG2/samtools-1.3.1/samtools merge {output} {input}")
else:
shell("cp {input} {output} && touch -h {output}")
I can only guess because you don't show all relevant rule, but I would say the error occurs because the rule samtools_merge_bam also applies to some later bam file where you have the pattern {tumor}/{tumor}_{normal}...
As a solution, you have to resolve this ambiguity (see the snakemake tutorial). For example, you can constrain the wildcard of samtools_merge_bam to not contain any slashes.
wildcard_constraints:
sample="[^/]+"
You can put the constraint either globally or inside your samtools_merge_bam rule.

Resources