Snakemake Ambiguity between 2 rules with similar output file type

Snakemake Ambiguity between 2 rules with similar output file type - bioinformatics

I am trying to map 3 samples: SRR14724459, SRR14724473, and a combination of both SRR14724459_SRR14724473.
I have 2 rules that share a similar file type output (.bam), and even tho I am naming their wildcards different, I still get an ambiguity error:
Building DAG of jobs...
AmbiguousRuleException:
Rules map_hybrid and map are ambiguous for the file /gpfs/scratch/hpadre/snakemake_outputs/mapped_dir/SRR14724459_SRR14724473__0.9.bam.
Expected input files:
map_hybrid: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R1_trimmed_SRR14724473_R1_trimmed_0.9.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq
map: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473__0.9/SRR14724459_SRR14724473__0.9_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473__0.9/SRR14724459_SRR14724473__0.9_R2_trimmed.fastq
From my Snakefile, here are my variables:
all_samples: ['SRR14724459', 'SRR14724473']
sample1: ['SRR14724459']
sample2: ['SRR14724473']
titration:[0.9]
This is my rule all:
rule all:
expand(MAPPED_DIR + "/{sample}.bam", sample=all_samples),
expand(MAPPED_DIR + "/{sample1}_{sample2}_{titration}.bam", zip, sample1=list_a_titrations, sample2=list_b_titrations, titration=tit_list)
This is my map rule:
rule map:
input:
r1 = TRIMMED_DIR + "/{sample}/{sample}_R1_trimmed.fastq",
r2 = TRIMMED_DIR + "/{sample}/{sample}_R2_trimmed.fastq"
output:
MAPPED_DIR + "/{sample}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample}_bwa_benchmark.txt"
wildcard_constraints:
word='[^0-9]*'
shell:
"""
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
"""
This is my map_hybrid:
rule map_hybrid:
input:
r1 = TRIMMED_DIR + "/{sample1}_{sample2}_{titration}/{sample1}_R1_trimmed_{sample2}_R1_trimmed_{titration}.fastq",
r2 = TRIMMED_DIR + "/{sample1}_{sample2}_{titration}/{sample1}_R2_trimmed_{sample2}_R2_trimmed_{titration}.fastq"
output:
MAPPED_DIR + "/{sample1}_{sample2}_{titration}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample1}_{sample2}_{titration}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample1}_{sample2}_{titration}_bwa_benchmark.txt"
shell:
"""
set +e
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
exitcode=$?
if [ $exitcode -eq 1 ]
then
exit 1
else
exit 0
fi
"""
The expected input files SHOULD BE as so:
Expected input files:
map_hybrid: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R1_trimmed_SRR14724473_R1_trimmed_0.9.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq
map: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_R2_trimmed.fastq
and also
/home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724473_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724473_R2_trimmed.fastq

Your rules map and map_hybrid can both produce your desired files, for snakemake they are ambigious rules.
The names of the wildcards is irrelevant, what is relevant is whether the wildcards in both rules can match the same output filepath.
That is the case here.
While rule map_hybrid can produce the output file SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq where the wildcard matches are
sample1=SRR14724459
sample2=SRR14724473
the rule map can also produce this output with the wildcard match
sample=SRR14724459_R2_trimmed_SRR14724473
To prevent ambiguity you can use the wildcard_constraint, so that the {sample} wildcard only matches strings starting with SRR followed by numbers:
sample='SRR\d+'
Integrated into your rule map:
rule map:
input:
r1 = TRIMMED_DIR + "/{sample}/{sample}_R1_trimmed.fastq",
r2 = TRIMMED_DIR + "/{sample}/{sample}_R2_trimmed.fastq"
output:
MAPPED_DIR + "/{sample}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample}_bwa_benchmark.txt"
wildcard_constraints:
word='[^0-9]*',
sample='SRR\d+'
shell:
"""
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
"""
it should resolve the ambiguity.

Related

Nextflow: How to process multiple samples

I have few fq.gz files for few samples. I am trying to process all samples at once using nextflow. But somehow, I am unable to process all the samples at once . But I can process a single sample at once. Here is the data structure and my code for processing single sample.
My nextflow code
params.sampleName="sample1"
params.fastq_path = "data/${params.sampleName}/*{1,2}.fq.gz"
fastq_files = Channel.fromFilePairs(params.fastq_path)
params.ref = "ab.fa"
ref = file(params.ref)
process foo {
input:
set pairId, file(reads) from fastq_files
output:
file("${pairId}.bam") into bamFiles_ch
script:
"""
echo ${reads[0].toRealPath().getParent().baseName}
bwa-mem2 mem -t 8 ${ref} ${reads[0].toRealPath()} ${reads[1].toRealPath()} | samtools sort -#8 -o ${pairId}.bam
samtools index -#8 ${pairId}.bam
"""
}
process samToolsMerge {
publishDir "./aligned_minimap/", mode: 'copy', overwrite: 'false'
input:
file bamFile from bamFiles_ch.collect()
output:
file("**")
script:
"""
samtools merge ${params.sampleName}.bam ${bamFile}
samtools index -# 8 ${params.sampleName}.bam
"""
}
So need help to solve. Thanks in advance.

It looks like you've already built in a way to set your target sample name using:
params.sampleName="sample1"
params.fastq_path = "data/${params.sampleName}/*{1,2}.fq.gz"
To have the glob pattern match all samples, you could simply set the wildcard on the command line using:
nextflow run main.nf --sampleName '*'
Note the quotation marks above. If these are ignored, the glob star will be expanded by your shell before it is passed to your Nextflow command.
The short answer is that you need some easy way to extract the sample name from the parent directory. Then you need some way to group the coordinate-sorted BAMs by the sample name. Below, I've used the new Nextflow DSL 2 but it's not strictly necessary. I just find the new DSL 2 code a lot easier to read and debug. Below is just an example, and you'll need to adapt it to suit your exact use case, but that said, it should do very similar things. It uses a special groupKey so that we can dynamically specify the expected number of elements in each tuple prior to calling the groupTuple operator. This lets us stream the collected values as soon as possible so that each sample can 'merge' when all of it's readgroups have been aligned. Without this, all input readgroups would need to finish alignment before the merge could begin.
Contents of nextflow.config:
process {
shell = [ '/bin/bash', '-euo', 'pipefail' ]
}
Contents of main.nf:
nextflow.enable.dsl=2
params.ref_fasta = "GRCh38.primary_assembly.genome.chr22.fa.gz"
params.fastq_files = "data/*/*.read{1,2}.fastq.gz"
process bwa_index {
conda 'bwa-mem2'
input:
path fasta
output:
path "${fasta}.{0123,amb,ann,bwt.2bit.64,pac}"
"""
bwa-mem2 index "${fasta}"
"""
}
process bwa_mem2 {
tag { [sample, readgroup].join(':') }
conda 'bwa-mem2 samtools'
input:
tuple val(sample), val(readgroup), path(reads)
path bwa_index
output:
tuple val(sample), val(readgroup), path("${readgroup}.bam{,.bai}")
script:
def idxbase = bwa_index.first().baseName
def out_files = [ "${readgroup}.bam", "${readgroup}.bam.bai" ].join('##idx##')
def (r1, r2) = reads
"""
bwa-mem2 mem \\
-R '#RG\\tID:${readgroup}\\tSM:${sample}' \\
-t ${task.cpus} \\
"${idxbase}" \\
"${r1}" \\
"${r2}" |
samtools sort \\
--write-index \\
-# ${task.cpus} \\
-o "${out_files}"
"""
}
process samtools_merge {
tag { sample }
conda 'samtools'
input:
tuple val(sample), path(indexed_bam_files)
output:
tuple val(sample), path("${sample}.bam{,.bai}")
script:
def out_files = [ "${sample}.bam", "${sample}.bam.bai" ].join('##idx##')
def input_bam_files = indexed_bam_files
.findAll { it.name.endsWith('.bam') }
.collect { /"${it}"/ }
.join(' \\\n'+' '*8)
"""
samtools merge \\
--write-index \\
-o "${out_files}" \\
${input_bam_files}
"""
}
workflow {
ref_fasta = file( params.ref_fasta )
bwa_index( ref_fasta )
Channel.fromFilePairs( params.fastq_files ) \
| map { readgroup, reads ->
def (sample_name) = reads*.parent.baseName as Set
tuple( sample_name, readgroup, reads )
} \
| groupTuple() \
| map { sample, readgroups, reads ->
tuple( groupKey(sample, readgroups.size()), readgroups, reads )
} \
| transpose() \
| set { sample_readgroups }
bwa_mem2( sample_readgroups, bwa_index.out )
sample_readgroups \
| join( bwa_mem2.out, by: [0,1] ) \
| map { sample_key, readgroup, reads, indexed_bam ->
tuple( sample_key, indexed_bam )
} \
| groupTuple() \
| map { sample_key, indexed_bam_files ->
tuple( sample_key.toString(), indexed_bam_files.flatten() )
} \
| samtools_merge
}
Run Like:
nextflow run -ansi-log false main.nf
Results:
N E X T F L O W ~ version 21.04.3
Launching `main.nf` [zen_gautier] - revision: dcde9efc8a
Creating Conda env: bwa-mem2 [cache /home/steve/working/stackoverflow/69702077/work/conda/env-8cc153b2eb20a5374bf435019a61c21a]
[63/73c96b] Submitted process > bwa_index
Creating Conda env: bwa-mem2 samtools [cache /home/steve/working/stackoverflow/69702077/work/conda/env-5c358e413a5318c53a45382790eecbd4]
[52/6a92d3] Submitted process > bwa_mem2 (HBR:HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22)
[8b/535b21] Submitted process > bwa_mem2 (UHR:UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[dc/03d949] Submitted process > bwa_mem2 (UHR:UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[e4/bfd08b] Submitted process > bwa_mem2 (HBR:HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22)
[d5/e2aa27] Submitted process > bwa_mem2 (UHR:UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[c2/23ce8a] Submitted process > bwa_mem2 (HBR:HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22)
Creating Conda env: samtools [cache /home/steve/working/stackoverflow/69702077/work/conda/env-912cee20caec78e112a5718bb0c00e1c]
[28/006c03] Submitted process > samtools_merge (HBR)
[3b/51311c] Submitted process > samtools_merge (UHR)

Snakemake can't identify the rule

I'm writing a pipeline with Snakemake and the program can't identify the rule stringtie. I can't find what I'm doing wrong. I already runned the rule fastp and star, the problem is specific with the stringtie rule.
include:
'config.py'
rule all:
input:
expand(FASTP_DIR + "{sample}R{read_no}.fastq",sample=SAMPLES ,read_no=['1', '2']), #fastp
expand(STAR_DIR + STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam",sample=SAMPLES), #STAR
expand(STRINGTIE_DIR + "/{sample}/{sample}Aligned.sortedByCoord.out.gtf", sample=SAMPLES),
GTF_DIR + "path_samplesGTF.txt"
rule fastp:
input:
R1= DATA_DIR + "{sample}R1_001.fastq.gz",
R2= DATA_DIR + "{sample}R2_001.fastq.gz"
output:
R1out= FASTP_DIR + "{sample}R1.fastq",
R2out= FASTP_DIR + "{sample}R2.fastq"
params:
data_dir = DATA_DIR,
name_sample = "{sample}"
log: FASTP_LOG + "{sample}.html"
message: "Executando o programa FASTP"
run:
shell('fastp -i {input.R1} -I {input.R2} -o {output.R1out} -O {output.R2out} \
-h {log} -j {log}')
shell("find {params.data_dir} -type f -name '{params.name_sample}*' -delete ")
rule star:
input:
idx_star = IDX_DIR,
R1 = FASTP_DIR + "{sample}R1.fastq",
R2 = FASTP_DIR + "{sample}R2.fastq",
parameters = "parameters.txt",
params:
outdir = STAR_DIR + "output/{sample}/{sample}",
star_dir = STAR_DIR,
star_sample = '{sample}'
# threads: 18
output:
out = STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam"
#run_time = STAR + "log/star_run.time"
# log: STAR_LOG
# benchmark: BENCHMARK + "star/{sample_star}"
run:
shell("STAR --runThreadN 12 --genomeDir {input.idx_star} \
--readFilesIn {input.R1} {input.R2} --outFileNamePrefix {params.outdir}\
--parametersFiles {input.parameters} \
--quantMode TranscriptomeSAM GeneCounts \
--genomeChrBinNbits 12")
# shell("find {params.star_dir} -type f ! -name
'{params.star_sample}Aligned.sortedByCoord.out.bam' -delete")
rule stringtie:
input:
star_output = STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam"
output:
stringtie_output = STRINGTIE_DIR + "/{sample}/{sample}Aligned.sortedByCoord.out.gtf"
run:
shell("stringtie {input.star_output} -o {output.stringtie_output} \
-v -p 12 ")
rule grep_gtf:
input:
list_gtf = STRINGTIE_DIR
output:
paths = GTF_DIR + "path_samplesGTF.txt"
shell:
"find {input.list_gtf} | grep .gtf > {output.paths}"
This is the output I get with the option dry-run (flag -n)
Building DAG of jobs...
Job counts:
count jobs
1 all
1 grep_gtf
2
[Fri Apr 17 15:59:24 2020]
rule grep_gtf:
input: /homelocal/boralli/workdir/pipeline_v4/STRINGTIE/
output: /homelocal/boralli/workdir/pipeline_v4/GTF/path_samplesGTF.txt
jobid: 1
find /homelocal/boralli/workdir/pipeline_v4/STRINGTIE/ | grep .gtf >
/homelocal/boralli/workdir/pipeline_v4/GTF/path_samplesGTF.txt
[Fri Apr 17 15:59:24 2020]
localrule all:
input: /homelocal/boralli/workdir/pipeline_v4/GTF/path_samplesGTF.txt
jobid: 0
Job counts:
count jobs
1 all
1 grep_gtf
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
I really don't know whats going on. The same pipeline worked before.

In addition to Troy's comment:
You specify as input of your rule grep_gtf a directory. Since that directory probably already exists, the rule stringtie does not need to be executed before running grep_gtf.
Using a directory as input isn't really a good idea. If you need the outputs of rule stringtie before executing rule grep_gtf, i suggest you specify the output files of rule stringtie as input of rule grep_gtf.
So your rule grep_gtf should be something like:
rule grep_gtf:
input:
expand(STRINGTIE_DIR + "/{sample}/{sample}Aligned.sortedByCoord.out.gtf", sample=SAMPLES)
output:
paths = GTF_DIR + "path_samplesGTF.txt"
shell:
"find {STRINGTIE_DIR} | grep .gtf > {output.paths}"
EDIT:
I think there's a bad copy/paste in rule all where there is twice STAR_DIR:
expand(STAR_DIR + STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam",sample=SAMPLES), #STAR
I also think there is a misunderstanding on the snakemake "workflow" concept. You do not need to specify the outputs of all rules in rule all. You only need to specify the last file of the workflow. Snakemake will decide which rules need to be run in order to achieve the creation of the final file. I don't really understand why your snakemake does not want to build the gtf files since you ask for them in rule all but I do see why rule grep_gtf does not need the output of rule stringtie to run.

NameError in Snakemake dryrun mode

I am new to Snakemake and I am trying to develop some pipelines. I am encountering some problems when I use wildcards, trying to automate my bioinformatic analyses as much as possible. I run into troubles when the pipeline becomes more complex (as shown below). It looks like Snakemake does not resolve the wildcards correctly. During a dry run of the Snakefile, the wildcards values look correct in the executions of some rules. However, the same wildcards lead to an error in a different step(rule) of the pipeline, and I cannot figure out why. Below I provide the code and the output message of a dry run.
num=["327905-LR-41624_normal","327907-LR-41624_tumor"]
num_normal=["327905-LR-41624"]
num_tumor=["327907-LR-41624"]
path="/path/to/Snakemake/"
genome="/path/to/references_genome/Mus_musculus.GRCm38.dna_rm.toplevel.fa"
rule all:
input:
expand("/path/to/Snakemake/AS-{num_tum}_tumor_no_dupl_sort_RG_LB.bam",num_tum=num_tumor),
expand("/path/to/Snakemake/AS-{num_norm}_normal_no_dupl_sort_RG_LB.bam",num_norm=num_normal)
ruleorder: samtools_sort > remove_duplicates > samtools_index #> add_readgroup_tumor > add_readgroup_normal
rule trim_galore:
input:
r1="/path/to/Snakemake/AS-{num}_R1.fastq",
r2="/path/to/Snakemake/AS-{num}_R2.fastq"
output:
"/path/to/Snakemake/AS-{num }_R1_val_1.fq",
"/path/to/Snakemake/AS-{num }_R2_val_2.fq"
shell:
"module load trim-galore/0.5.0 ; module load pypy/2.7-6.0.0 ; trim_galore --output_dir /path/to/Snakemake/ --paired {input.r1} {input.r2} "
rule bwa_mem:
input:
R1="/path/to/Snakemake/AS-{num}_R1_val_1.fq",
R2="/path/to/Snakemake/AS-{num}_R2_val_2.fq"
output:
"/path/to/Snakemake/AS-{num}.bam"
shell:
"module load samtools/default ; module load bwa/0.7.8 ; bwa mem {genome} {input.R1} {input.R2} | samtools view -h -b > {output} "
rule samtools_sort:
input:
"/path/to/Snakemake/AS-{num}.bam"
output:
"/path/to/Snakemake/AS-{num}_sort.bam"
shell:
"module load samtools/default ; samtools sort -n -O BAM {input} > {output} "
rule remove_duplicates:
input:
"/path/to/Snakemake/AS-{num}_sort.bam"
output:
outbam="/path/to/Snakemake/AS-{num}_no_dupl_sort.bam",
metrics="/path/to/Snakemake/AS-{num}_dupl_metrics.txt"
shell:
"module load gatk/4.0.9.0 ; gatk MarkDuplicates -I {input} -O {output.outbam} -M {output.metrics} --REMOVE_DUPLICATES=true "
rule samtools_index:
input:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam.bai"
shell:
"module load samtools/default ; samtools index {input} "
rule add_readgroup_normal:
input:
"/path/to/Snakemake/AS-{num_normal}_normal_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num_normal}_normal_no_dupl_sort_RG_LB.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { num_normal } -PU { num_normal } -SM NORMAL -I { input } -O {output} "
rule add_readgroup_tumor:
input:
"/path/to/Snakemake/AS-{num_tumor}_tumor_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num_tumor}_tumor_no_dupl_sort_RG_LB.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { num_tumor } -PU { num_tumor } -SM TUMOR -I { input } -O {output} "
When I test the Snakefile with the command:
.local/bin/snakemake -s Snakefile_pipeline --dryrun
I get the following:
**Building DAG of jobs...**
**Job counts:**
**count jobs
1 add_readgroup_normal
1 add_readgroup_tumor
1 all
2 bwa_mem
2 remove_duplicates
2 samtools_sort
2 trim_galore
11**
**[Mon Apr 8 16:14:27 2019]
rule trim_galore:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor_R1.fastq, /path/to/Snakemake/AS-327907-LR-41624_tumor_R2.fastq
output: /path/to/Snakemake/AS-327907-LR-41624_tumor_R1_val_1.fq, /path/to/Snakemake/AS-327907-LR-41624_tumor_R2_val_2.fq
jobid: 9
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule trim_galore:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_R1.fastq, /path/to/Snakemake/AS-327905-LR-41624_normal_R2.fastq
output: /path/to/Snakemake/AS-327905-LR-41624_normal_R1_val_1.fq, /path/to/Snakemake/AS-327905-LR-41624_normal_R2_val_2.fq
jobid: 10
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule bwa_mem:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_R1_val_1.fq, /path/to/Snakemake/AS-327905-LR-41624_normal_R2_val_2.fq
output: /path/to/Snakemake/AS-327905-LR-41624_normal.bam
jobid: 8
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule bwa_mem:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor_R1_val_1.fq, /path/to/Snakemake/AS-327907-LR-41624_tumor_R2_val_2.fq
output: /path/to/Snakemake/AS-327907-LR-41624_tumor.bam
jobid: 7
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule samtools_sort:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor.bam
output: /path/to/Snakemake/AS-327907-LR-41624_tumor_sort.bam
jobid: 5
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule samtools_sort:
input: /path/to/Snakemake/AS-327905-LR-41624_normal.bam
output: /path/to/Snakemake/AS-327905-LR-41624_normal_sort.bam
jobid: 6
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule remove_duplicates:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor_sort.bam
output: /path/to/Snakemake/AS-327907-LR-41624_tumor_no_dupl_sort.bam, /path/to/Snakemake/AS-327907-LR-41624_tumor_dupl_metrics.txt
jobid: 3
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule remove_duplicates:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_sort.bam
output: /path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort.bam, /path/to/Snakemake/AS-327905-LR-41624_normal_dupl_metrics.txt
jobid: 4
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule add_readgroup_normal:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort.bam
output: /path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort_RG_LB.bam
jobid: 2
wildcards: num_normal=327905-LR-41624**
**RuleException in line 93 of /home/l136n/Snakefile_mapping_snv_call_pipeline2:
NameError: The name ' num_normal ' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}**
I have googled the error but found little help. Also, I double checked the pipeline for any incosistency. What I expect as output is indicated in the rule "all". The rules "add_readgroup_normal" and "add_readgroup_tumor" are supposed to take different subsets of input files, generated by the previous steps, which are run on all files. I wonder if the problem arises somehow because of this separation into 2 subsets.
I repeat that I am quite new to Snakemake, so I might be missing something silly somewhere! Any help would be really appreciated, as I am completely stuck!
Thank you so much in advance!
num=["327905-LR-41624_normal","327907-LR-41624_tumor"]
normal=["327905-LR-41624_normal"]
num_tumor=["327907-LR-41624_tumor"]
path="/path/to/Snakemake/"
genome="/icgc/dkfzlsdf/analysis/B210/references_genome/Mus_musculus.GRCm38.dna_rm.toplevel.fa"
rule all:
input:
"/path/to/Snakemake/AS-327905-LR-41624_normal_R1_val_1.fq",
"/path/to/Snakemake/AS-327905-LR-41624_normal_R2_val_2.fq",
"/path/to/Snakemake/AS-327907-LR-41624_tumor_R1_val_1.fq",
"/path/to/Snakemake/AS-327907-LR-41624_tumor_R2_val_2.fq",
"/path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort.bam.bai",
"/path/to/Snakemake/AS-327907-LR-41624_tumor_no_dupl_sort.bam.bai",
"/path/to/Snakemake/AS-327905-LR-41624_normal_RG.bam"
"/path/to/Snakemake/AS-327907-LR-41624_tumor_RG.bam"
rule trim_galore:
input:
r1="/path/to/Snakemake/AS-{num}_R1.fastq",
r2="/path/to/Snakemake/AS-{num}_R2.fastq"
output:
"/path/to/Snakemake/AS-{num }_R1_val_1.fq",
"/path/to/Snakemake/AS-{num }_R2_val_2.fq"
shell:
"module load trim-galore/0.5.0 ; module load pypy/2.7-6.0.0 ; trim_galore --output_dir /path/to/Snakemake/ --paired {input.r1} {input.r2} "
rule bwa_mem:
input:
R1="/path/to/Snakemake/AS-{num}_R1_val_1.fq",
R2="/path/to/Snakemake/AS-{num}_R2_val_2.fq"
output:
"/path/to/Snakemake/AS-{num}.bam"
shell:
"module load samtools/default ; module load bwa/0.7.8 ; bwa mem {genome} {input.R1} {input.R2} | samtools view -h -b > {output} "
rule samtools_sort:
input:
"/path/to/Snakemake/AS-{num}.bam"
output:
"/path/to/Snakemake/AS-{num}_sort.bam"
shell:
"module load samtools/default ; samtools sort -n -O BAM {input} > {output} "
rule remove_duplicates:
input:
"/path/to/Snakemake/AS-{num}_sort.bam"
output:
outbam="/path/to/Snakemake/AS-{num}_no_dupl_sort.bam",
metrics="/path/to/Snakemake/AS-{num}_dupl_metrics.txt"
shell:
"module load gatk/4.0.9.0 ; gatk MarkDuplicates -I {input} -O {output.outbam} -M {output.metrics} --REMOVE_DUPLICATES=true "
rule samtools_index:
input:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam.bai"
shell:
"module load samtools/default ; samtools index {input} "
rule add_readgroup_normal:
input:
"/path/to/Snakemake/AS-{normal}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{normal}_RG.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { wildcards.normal } -PU { wildcards.normal } -SM NORMAL -I { input } -O {output} "
rule add_readgroup_tumor:
input:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num_,'.*tumor.*'}_RG.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { wildcards.num } -PU { wildcards.num } -SM TUMOR -I { input } -O {output} "
Error:
Building DAG of jobs...
MissingInputException in line 37 of /home/l136n/Snakefile_mapping_snv_call_pipeline2b1:
Missing input files for rule trim_galore:
/path/to/Luca/Snakemake/AS-327905-LR-41624_normal_RG.bam/path/to/Luca/Snakemake/AS-327907-LR-41624_tumor_RG_R1.fastq
/path/to/Snakemake/AS-327905-LR-41624_normal_RG.bam/path/to/Luca/Snakemake/AS-327907-LR-41624_tumor_RG_R2.fastq

Wildcards are accessible in shell using syntax {wilcards.var}, not {var}. You have the latter in rule add_readgroup_normal.
Source.

I thought I would provide the solution, even if the post is a bit old now. The error was simply due to the presence of spaces inside "{ wildcards.var }".

Use wildcard on params

I try to use one tool and I need to use a wildcard present on input.
This is an example:
aDict = {"120":"121" } #tumor : normal
rule all:
input: expand("{case}.mutect2.vcf",case=aDict.keys())
def get_files_somatic(wildcards):
case = wildcards.case
control = aDict[wildcards.case]
return [case + ".sorted.bam", control + ".sorted.bam"]
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}'
log:
"logs/{case}.mutect2.log"
threads: 8
shell:
" gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} -I {input[1]} -normal {wildcards.control}"
" -L {params.target} -O {output}"
I Have this error:
'Wildcards' object has no attribute 'control'
So I have a function with case and control. I'm not able to extract code.

The wildcards are derived from the output file/pattern. That is why you only have the wildcard called case. You have to derive the control from that. Try replacing your shell statement with this:
run:
control = aDict[wildcards.case]
shell(
"gatk-launch Mutect2 -R {params.genome} -I {input[0]} "
"-tumor {params.name_tumor} -I {input[1]} -normal {control} "
"-L {input.target2} -O {output}"
)

You could define control in params. Also {input.target2} in shell command would result in error. May be it's supposed to be params.target?
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}',
control = lambda wildcards: aDict[wildcards.case]
shell:
"""
gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} \\
-I {input[1]} -normal {params.control} -L {params.target} -O {output}
"""

snakemake how to encode pair analisys

I want to use gatk recalibration using pair sample ( tumor and normal). I need to parse the data using pandas. That is what I wroted.
expand("mapped_reads/merged_samples/{sample[1][tumor]}/{sample[1][tumor]}_{sample[1][normal]}.bam", sample=read_table(config["conditions"], ",").iterrows())
this is the condition file:
432,433
434,435
I wrote this rule:
rule gatk_RealignerTargetCreator:
input:
"mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
"mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
output:
"mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"
params:
genome=config['reference']['genome_fasta'],
mills= config['mills'],
ph1_indels= config['know_phy'],
log:
"mapped_reads/merged_samples/logs/{tumor}_{normal}.realign_info.log"
threads: 8
shell:
"gatk -T RealignerTargetCreator -R {params.genome} {params.custom} "
"-nt {threads} "
"-I {wildcard.tumor} -I {wildcard.normal} -known {params.ph1_indels} "
"-o {output} >& {log}"
I have this error:
InputFunctionException in line 17 of /home/maurizio/Desktop/TEST_exome/rules/samfiles.rules:
KeyError: '432/432_433'
Wildcards:
sample=432/432_433
this is the samfiles.rules:
rule samtools_merge_bam:
"""
Merge bam files for multiple units into one for the given sample.
If the sample has only one unit, files will be copied.
"""
input:
lambda wildcards: expand("mapped_reads/bam/{unit}_sorted.bam",unit=config["samples"][wildcards.sample])
output:
"mapped_reads/merged_samples/{sample}.bam"
benchmark:
"benchmarks/samtools/merge/{sample}.txt"
run:
if len(input) > 1:
shell("/illumina/software/PROG2/samtools-1.3.1/samtools merge {output} {input}")
else:
shell("cp {input} {output} && touch -h {output}")

I can only guess because you don't show all relevant rule, but I would say the error occurs because the rule samtools_merge_bam also applies to some later bam file where you have the pattern {tumor}/{tumor}_{normal}...
As a solution, you have to resolve this ambiguity (see the snakemake tutorial). For example, you can constrain the wildcard of samtools_merge_bam to not contain any slashes.
wildcard_constraints:
sample="[^/]+"
You can put the constraint either globally or inside your samtools_merge_bam rule.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Snakemake Ambiguity between 2 rules with similar output file type - bioinformatics

Related

Nextflow: How to process multiple samples

Snakemake can't identify the rule

NameError in Snakemake dryrun mode

Use wildcard on params

snakemake how to encode pair analisys

Categories

Resources