I'm writing a pipeline with Snakemake and the program can't identify the rule stringtie. I can't find what I'm doing wrong. I already runned the rule fastp and star, the problem is specific with the stringtie rule.
include:
'config.py'
rule all:
input:
expand(FASTP_DIR + "{sample}R{read_no}.fastq",sample=SAMPLES ,read_no=['1', '2']), #fastp
expand(STAR_DIR + STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam",sample=SAMPLES), #STAR
expand(STRINGTIE_DIR + "/{sample}/{sample}Aligned.sortedByCoord.out.gtf", sample=SAMPLES),
GTF_DIR + "path_samplesGTF.txt"
rule fastp:
input:
R1= DATA_DIR + "{sample}R1_001.fastq.gz",
R2= DATA_DIR + "{sample}R2_001.fastq.gz"
output:
R1out= FASTP_DIR + "{sample}R1.fastq",
R2out= FASTP_DIR + "{sample}R2.fastq"
params:
data_dir = DATA_DIR,
name_sample = "{sample}"
log: FASTP_LOG + "{sample}.html"
message: "Executando o programa FASTP"
run:
shell('fastp -i {input.R1} -I {input.R2} -o {output.R1out} -O {output.R2out} \
-h {log} -j {log}')
shell("find {params.data_dir} -type f -name '{params.name_sample}*' -delete ")
rule star:
input:
idx_star = IDX_DIR,
R1 = FASTP_DIR + "{sample}R1.fastq",
R2 = FASTP_DIR + "{sample}R2.fastq",
parameters = "parameters.txt",
params:
outdir = STAR_DIR + "output/{sample}/{sample}",
star_dir = STAR_DIR,
star_sample = '{sample}'
# threads: 18
output:
out = STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam"
#run_time = STAR + "log/star_run.time"
# log: STAR_LOG
# benchmark: BENCHMARK + "star/{sample_star}"
run:
shell("STAR --runThreadN 12 --genomeDir {input.idx_star} \
--readFilesIn {input.R1} {input.R2} --outFileNamePrefix {params.outdir}\
--parametersFiles {input.parameters} \
--quantMode TranscriptomeSAM GeneCounts \
--genomeChrBinNbits 12")
# shell("find {params.star_dir} -type f ! -name
'{params.star_sample}Aligned.sortedByCoord.out.bam' -delete")
rule stringtie:
input:
star_output = STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam"
output:
stringtie_output = STRINGTIE_DIR + "/{sample}/{sample}Aligned.sortedByCoord.out.gtf"
run:
shell("stringtie {input.star_output} -o {output.stringtie_output} \
-v -p 12 ")
rule grep_gtf:
input:
list_gtf = STRINGTIE_DIR
output:
paths = GTF_DIR + "path_samplesGTF.txt"
shell:
"find {input.list_gtf} | grep .gtf > {output.paths}"
This is the output I get with the option dry-run (flag -n)
Building DAG of jobs...
Job counts:
count jobs
1 all
1 grep_gtf
2
[Fri Apr 17 15:59:24 2020]
rule grep_gtf:
input: /homelocal/boralli/workdir/pipeline_v4/STRINGTIE/
output: /homelocal/boralli/workdir/pipeline_v4/GTF/path_samplesGTF.txt
jobid: 1
find /homelocal/boralli/workdir/pipeline_v4/STRINGTIE/ | grep .gtf >
/homelocal/boralli/workdir/pipeline_v4/GTF/path_samplesGTF.txt
[Fri Apr 17 15:59:24 2020]
localrule all:
input: /homelocal/boralli/workdir/pipeline_v4/GTF/path_samplesGTF.txt
jobid: 0
Job counts:
count jobs
1 all
1 grep_gtf
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
I really don't know whats going on. The same pipeline worked before.
In addition to Troy's comment:
You specify as input of your rule grep_gtf a directory. Since that directory probably already exists, the rule stringtie does not need to be executed before running grep_gtf.
Using a directory as input isn't really a good idea. If you need the outputs of rule stringtie before executing rule grep_gtf, i suggest you specify the output files of rule stringtie as input of rule grep_gtf.
So your rule grep_gtf should be something like:
rule grep_gtf:
input:
expand(STRINGTIE_DIR + "/{sample}/{sample}Aligned.sortedByCoord.out.gtf", sample=SAMPLES)
output:
paths = GTF_DIR + "path_samplesGTF.txt"
shell:
"find {STRINGTIE_DIR} | grep .gtf > {output.paths}"
EDIT:
I think there's a bad copy/paste in rule all where there is twice STAR_DIR:
expand(STAR_DIR + STAR_DIR + "output/{sample}/{sample}Aligned.sortedByCoord.out.bam",sample=SAMPLES), #STAR
I also think there is a misunderstanding on the snakemake "workflow" concept. You do not need to specify the outputs of all rules in rule all. You only need to specify the last file of the workflow. Snakemake will decide which rules need to be run in order to achieve the creation of the final file. I don't really understand why your snakemake does not want to build the gtf files since you ask for them in rule all but I do see why rule grep_gtf does not need the output of rule stringtie to run.
Related
I am trying to map 3 samples: SRR14724459, SRR14724473, and a combination of both SRR14724459_SRR14724473.
I have 2 rules that share a similar file type output (.bam), and even tho I am naming their wildcards different, I still get an ambiguity error:
Building DAG of jobs...
AmbiguousRuleException:
Rules map_hybrid and map are ambiguous for the file /gpfs/scratch/hpadre/snakemake_outputs/mapped_dir/SRR14724459_SRR14724473__0.9.bam.
Expected input files:
map_hybrid: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R1_trimmed_SRR14724473_R1_trimmed_0.9.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq
map: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473__0.9/SRR14724459_SRR14724473__0.9_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473__0.9/SRR14724459_SRR14724473__0.9_R2_trimmed.fastq
From my Snakefile, here are my variables:
all_samples: ['SRR14724459', 'SRR14724473']
sample1: ['SRR14724459']
sample2: ['SRR14724473']
titration:[0.9]
This is my rule all:
rule all:
expand(MAPPED_DIR + "/{sample}.bam", sample=all_samples),
expand(MAPPED_DIR + "/{sample1}_{sample2}_{titration}.bam", zip, sample1=list_a_titrations, sample2=list_b_titrations, titration=tit_list)
This is my map rule:
rule map:
input:
r1 = TRIMMED_DIR + "/{sample}/{sample}_R1_trimmed.fastq",
r2 = TRIMMED_DIR + "/{sample}/{sample}_R2_trimmed.fastq"
output:
MAPPED_DIR + "/{sample}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample}_bwa_benchmark.txt"
wildcard_constraints:
word='[^0-9]*'
shell:
"""
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
"""
This is my map_hybrid:
rule map_hybrid:
input:
r1 = TRIMMED_DIR + "/{sample1}_{sample2}_{titration}/{sample1}_R1_trimmed_{sample2}_R1_trimmed_{titration}.fastq",
r2 = TRIMMED_DIR + "/{sample1}_{sample2}_{titration}/{sample1}_R2_trimmed_{sample2}_R2_trimmed_{titration}.fastq"
output:
MAPPED_DIR + "/{sample1}_{sample2}_{titration}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample1}_{sample2}_{titration}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample1}_{sample2}_{titration}_bwa_benchmark.txt"
shell:
"""
set +e
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
exitcode=$?
if [ $exitcode -eq 1 ]
then
exit 1
else
exit 0
fi
"""
The expected input files SHOULD BE as so:
Expected input files:
map_hybrid: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R1_trimmed_SRR14724473_R1_trimmed_0.9.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq
map: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_R2_trimmed.fastq
and also
/home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724473_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724473_R2_trimmed.fastq
Your rules map and map_hybrid can both produce your desired files, for snakemake they are ambigious rules.
The names of the wildcards is irrelevant, what is relevant is whether the wildcards in both rules can match the same output filepath.
That is the case here.
While rule map_hybrid can produce the output file SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq where the wildcard matches are
sample1=SRR14724459
sample2=SRR14724473
the rule map can also produce this output with the wildcard match
sample=SRR14724459_R2_trimmed_SRR14724473
To prevent ambiguity you can use the wildcard_constraint, so that the {sample} wildcard only matches strings starting with SRR followed by numbers:
sample='SRR\d+'
Integrated into your rule map:
rule map:
input:
r1 = TRIMMED_DIR + "/{sample}/{sample}_R1_trimmed.fastq",
r2 = TRIMMED_DIR + "/{sample}/{sample}_R2_trimmed.fastq"
output:
MAPPED_DIR + "/{sample}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample}_bwa_benchmark.txt"
wildcard_constraints:
word='[^0-9]*',
sample='SRR\d+'
shell:
"""
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
"""
it should resolve the ambiguity.
I am new to Snakemake and I am trying to develop some pipelines. I am encountering some problems when I use wildcards, trying to automate my bioinformatic analyses as much as possible. I run into troubles when the pipeline becomes more complex (as shown below). It looks like Snakemake does not resolve the wildcards correctly. During a dry run of the Snakefile, the wildcards values look correct in the executions of some rules. However, the same wildcards lead to an error in a different step(rule) of the pipeline, and I cannot figure out why. Below I provide the code and the output message of a dry run.
num=["327905-LR-41624_normal","327907-LR-41624_tumor"]
num_normal=["327905-LR-41624"]
num_tumor=["327907-LR-41624"]
path="/path/to/Snakemake/"
genome="/path/to/references_genome/Mus_musculus.GRCm38.dna_rm.toplevel.fa"
rule all:
input:
expand("/path/to/Snakemake/AS-{num_tum}_tumor_no_dupl_sort_RG_LB.bam",num_tum=num_tumor),
expand("/path/to/Snakemake/AS-{num_norm}_normal_no_dupl_sort_RG_LB.bam",num_norm=num_normal)
ruleorder: samtools_sort > remove_duplicates > samtools_index #> add_readgroup_tumor > add_readgroup_normal
rule trim_galore:
input:
r1="/path/to/Snakemake/AS-{num}_R1.fastq",
r2="/path/to/Snakemake/AS-{num}_R2.fastq"
output:
"/path/to/Snakemake/AS-{num }_R1_val_1.fq",
"/path/to/Snakemake/AS-{num }_R2_val_2.fq"
shell:
"module load trim-galore/0.5.0 ; module load pypy/2.7-6.0.0 ; trim_galore --output_dir /path/to/Snakemake/ --paired {input.r1} {input.r2} "
rule bwa_mem:
input:
R1="/path/to/Snakemake/AS-{num}_R1_val_1.fq",
R2="/path/to/Snakemake/AS-{num}_R2_val_2.fq"
output:
"/path/to/Snakemake/AS-{num}.bam"
shell:
"module load samtools/default ; module load bwa/0.7.8 ; bwa mem {genome} {input.R1} {input.R2} | samtools view -h -b > {output} "
rule samtools_sort:
input:
"/path/to/Snakemake/AS-{num}.bam"
output:
"/path/to/Snakemake/AS-{num}_sort.bam"
shell:
"module load samtools/default ; samtools sort -n -O BAM {input} > {output} "
rule remove_duplicates:
input:
"/path/to/Snakemake/AS-{num}_sort.bam"
output:
outbam="/path/to/Snakemake/AS-{num}_no_dupl_sort.bam",
metrics="/path/to/Snakemake/AS-{num}_dupl_metrics.txt"
shell:
"module load gatk/4.0.9.0 ; gatk MarkDuplicates -I {input} -O {output.outbam} -M {output.metrics} --REMOVE_DUPLICATES=true "
rule samtools_index:
input:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam.bai"
shell:
"module load samtools/default ; samtools index {input} "
rule add_readgroup_normal:
input:
"/path/to/Snakemake/AS-{num_normal}_normal_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num_normal}_normal_no_dupl_sort_RG_LB.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { num_normal } -PU { num_normal } -SM NORMAL -I { input } -O {output} "
rule add_readgroup_tumor:
input:
"/path/to/Snakemake/AS-{num_tumor}_tumor_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num_tumor}_tumor_no_dupl_sort_RG_LB.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { num_tumor } -PU { num_tumor } -SM TUMOR -I { input } -O {output} "
When I test the Snakefile with the command:
.local/bin/snakemake -s Snakefile_pipeline --dryrun
I get the following:
**Building DAG of jobs...**
**Job counts:**
**count jobs
1 add_readgroup_normal
1 add_readgroup_tumor
1 all
2 bwa_mem
2 remove_duplicates
2 samtools_sort
2 trim_galore
11**
**[Mon Apr 8 16:14:27 2019]
rule trim_galore:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor_R1.fastq, /path/to/Snakemake/AS-327907-LR-41624_tumor_R2.fastq
output: /path/to/Snakemake/AS-327907-LR-41624_tumor_R1_val_1.fq, /path/to/Snakemake/AS-327907-LR-41624_tumor_R2_val_2.fq
jobid: 9
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule trim_galore:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_R1.fastq, /path/to/Snakemake/AS-327905-LR-41624_normal_R2.fastq
output: /path/to/Snakemake/AS-327905-LR-41624_normal_R1_val_1.fq, /path/to/Snakemake/AS-327905-LR-41624_normal_R2_val_2.fq
jobid: 10
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule bwa_mem:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_R1_val_1.fq, /path/to/Snakemake/AS-327905-LR-41624_normal_R2_val_2.fq
output: /path/to/Snakemake/AS-327905-LR-41624_normal.bam
jobid: 8
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule bwa_mem:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor_R1_val_1.fq, /path/to/Snakemake/AS-327907-LR-41624_tumor_R2_val_2.fq
output: /path/to/Snakemake/AS-327907-LR-41624_tumor.bam
jobid: 7
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule samtools_sort:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor.bam
output: /path/to/Snakemake/AS-327907-LR-41624_tumor_sort.bam
jobid: 5
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule samtools_sort:
input: /path/to/Snakemake/AS-327905-LR-41624_normal.bam
output: /path/to/Snakemake/AS-327905-LR-41624_normal_sort.bam
jobid: 6
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule remove_duplicates:
input: /path/to/Snakemake/AS-327907-LR-41624_tumor_sort.bam
output: /path/to/Snakemake/AS-327907-LR-41624_tumor_no_dupl_sort.bam, /path/to/Snakemake/AS-327907-LR-41624_tumor_dupl_metrics.txt
jobid: 3
wildcards: num=327907-LR-41624_tumor**
**[Mon Apr 8 16:14:27 2019]
rule remove_duplicates:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_sort.bam
output: /path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort.bam, /path/to/Snakemake/AS-327905-LR-41624_normal_dupl_metrics.txt
jobid: 4
wildcards: num=327905-LR-41624_normal**
**[Mon Apr 8 16:14:27 2019]
rule add_readgroup_normal:
input: /path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort.bam
output: /path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort_RG_LB.bam
jobid: 2
wildcards: num_normal=327905-LR-41624**
**RuleException in line 93 of /home/l136n/Snakefile_mapping_snv_call_pipeline2:
NameError: The name ' num_normal ' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}**
I have googled the error but found little help. Also, I double checked the pipeline for any incosistency. What I expect as output is indicated in the rule "all". The rules "add_readgroup_normal" and "add_readgroup_tumor" are supposed to take different subsets of input files, generated by the previous steps, which are run on all files. I wonder if the problem arises somehow because of this separation into 2 subsets.
I repeat that I am quite new to Snakemake, so I might be missing something silly somewhere! Any help would be really appreciated, as I am completely stuck!
Thank you so much in advance!
num=["327905-LR-41624_normal","327907-LR-41624_tumor"]
normal=["327905-LR-41624_normal"]
num_tumor=["327907-LR-41624_tumor"]
path="/path/to/Snakemake/"
genome="/icgc/dkfzlsdf/analysis/B210/references_genome/Mus_musculus.GRCm38.dna_rm.toplevel.fa"
rule all:
input:
"/path/to/Snakemake/AS-327905-LR-41624_normal_R1_val_1.fq",
"/path/to/Snakemake/AS-327905-LR-41624_normal_R2_val_2.fq",
"/path/to/Snakemake/AS-327907-LR-41624_tumor_R1_val_1.fq",
"/path/to/Snakemake/AS-327907-LR-41624_tumor_R2_val_2.fq",
"/path/to/Snakemake/AS-327905-LR-41624_normal_no_dupl_sort.bam.bai",
"/path/to/Snakemake/AS-327907-LR-41624_tumor_no_dupl_sort.bam.bai",
"/path/to/Snakemake/AS-327905-LR-41624_normal_RG.bam"
"/path/to/Snakemake/AS-327907-LR-41624_tumor_RG.bam"
rule trim_galore:
input:
r1="/path/to/Snakemake/AS-{num}_R1.fastq",
r2="/path/to/Snakemake/AS-{num}_R2.fastq"
output:
"/path/to/Snakemake/AS-{num }_R1_val_1.fq",
"/path/to/Snakemake/AS-{num }_R2_val_2.fq"
shell:
"module load trim-galore/0.5.0 ; module load pypy/2.7-6.0.0 ; trim_galore --output_dir /path/to/Snakemake/ --paired {input.r1} {input.r2} "
rule bwa_mem:
input:
R1="/path/to/Snakemake/AS-{num}_R1_val_1.fq",
R2="/path/to/Snakemake/AS-{num}_R2_val_2.fq"
output:
"/path/to/Snakemake/AS-{num}.bam"
shell:
"module load samtools/default ; module load bwa/0.7.8 ; bwa mem {genome} {input.R1} {input.R2} | samtools view -h -b > {output} "
rule samtools_sort:
input:
"/path/to/Snakemake/AS-{num}.bam"
output:
"/path/to/Snakemake/AS-{num}_sort.bam"
shell:
"module load samtools/default ; samtools sort -n -O BAM {input} > {output} "
rule remove_duplicates:
input:
"/path/to/Snakemake/AS-{num}_sort.bam"
output:
outbam="/path/to/Snakemake/AS-{num}_no_dupl_sort.bam",
metrics="/path/to/Snakemake/AS-{num}_dupl_metrics.txt"
shell:
"module load gatk/4.0.9.0 ; gatk MarkDuplicates -I {input} -O {output.outbam} -M {output.metrics} --REMOVE_DUPLICATES=true "
rule samtools_index:
input:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam.bai"
shell:
"module load samtools/default ; samtools index {input} "
rule add_readgroup_normal:
input:
"/path/to/Snakemake/AS-{normal}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{normal}_RG.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { wildcards.normal } -PU { wildcards.normal } -SM NORMAL -I { input } -O {output} "
rule add_readgroup_tumor:
input:
"/path/to/Snakemake/AS-{num}_no_dupl_sort.bam"
output:
"/path/to/Snakemake/AS-{num_,'.*tumor.*'}_RG.bam"
shell:
"module load gatk/4.0.9.0 ; gatk AddOrReplaceReadGroups -PL Illumina -LB { wildcards.num } -PU { wildcards.num } -SM TUMOR -I { input } -O {output} "
Error:
Building DAG of jobs...
MissingInputException in line 37 of /home/l136n/Snakefile_mapping_snv_call_pipeline2b1:
Missing input files for rule trim_galore:
/path/to/Luca/Snakemake/AS-327905-LR-41624_normal_RG.bam/path/to/Luca/Snakemake/AS-327907-LR-41624_tumor_RG_R1.fastq
/path/to/Snakemake/AS-327905-LR-41624_normal_RG.bam/path/to/Luca/Snakemake/AS-327907-LR-41624_tumor_RG_R2.fastq
Wildcards are accessible in shell using syntax {wilcards.var}, not {var}. You have the latter in rule add_readgroup_normal.
Source.
I thought I would provide the solution, even if the post is a bit old now. The error was simply due to the presence of spaces inside "{ wildcards.var }".
I try to use one tool and I need to use a wildcard present on input.
This is an example:
aDict = {"120":"121" } #tumor : normal
rule all:
input: expand("{case}.mutect2.vcf",case=aDict.keys())
def get_files_somatic(wildcards):
case = wildcards.case
control = aDict[wildcards.case]
return [case + ".sorted.bam", control + ".sorted.bam"]
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}'
log:
"logs/{case}.mutect2.log"
threads: 8
shell:
" gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} -I {input[1]} -normal {wildcards.control}"
" -L {params.target} -O {output}"
I Have this error:
'Wildcards' object has no attribute 'control'
So I have a function with case and control. I'm not able to extract code.
The wildcards are derived from the output file/pattern. That is why you only have the wildcard called case. You have to derive the control from that. Try replacing your shell statement with this:
run:
control = aDict[wildcards.case]
shell(
"gatk-launch Mutect2 -R {params.genome} -I {input[0]} "
"-tumor {params.name_tumor} -I {input[1]} -normal {control} "
"-L {input.target2} -O {output}"
)
You could define control in params. Also {input.target2} in shell command would result in error. May be it's supposed to be params.target?
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}',
control = lambda wildcards: aDict[wildcards.case]
shell:
"""
gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} \\
-I {input[1]} -normal {params.control} -L {params.target} -O {output}
"""
I want to use gatk recalibration using pair sample ( tumor and normal). I need to parse the data using pandas. That is what I wroted.
expand("mapped_reads/merged_samples/{sample[1][tumor]}/{sample[1][tumor]}_{sample[1][normal]}.bam", sample=read_table(config["conditions"], ",").iterrows())
this is the condition file:
432,433
434,435
I wrote this rule:
rule gatk_RealignerTargetCreator:
input:
"mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
"mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
output:
"mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"
params:
genome=config['reference']['genome_fasta'],
mills= config['mills'],
ph1_indels= config['know_phy'],
log:
"mapped_reads/merged_samples/logs/{tumor}_{normal}.realign_info.log"
threads: 8
shell:
"gatk -T RealignerTargetCreator -R {params.genome} {params.custom} "
"-nt {threads} "
"-I {wildcard.tumor} -I {wildcard.normal} -known {params.ph1_indels} "
"-o {output} >& {log}"
I have this error:
InputFunctionException in line 17 of /home/maurizio/Desktop/TEST_exome/rules/samfiles.rules:
KeyError: '432/432_433'
Wildcards:
sample=432/432_433
this is the samfiles.rules:
rule samtools_merge_bam:
"""
Merge bam files for multiple units into one for the given sample.
If the sample has only one unit, files will be copied.
"""
input:
lambda wildcards: expand("mapped_reads/bam/{unit}_sorted.bam",unit=config["samples"][wildcards.sample])
output:
"mapped_reads/merged_samples/{sample}.bam"
benchmark:
"benchmarks/samtools/merge/{sample}.txt"
run:
if len(input) > 1:
shell("/illumina/software/PROG2/samtools-1.3.1/samtools merge {output} {input}")
else:
shell("cp {input} {output} && touch -h {output}")
I can only guess because you don't show all relevant rule, but I would say the error occurs because the rule samtools_merge_bam also applies to some later bam file where you have the pattern {tumor}/{tumor}_{normal}...
As a solution, you have to resolve this ambiguity (see the snakemake tutorial). For example, you can constrain the wildcard of samtools_merge_bam to not contain any slashes.
wildcard_constraints:
sample="[^/]+"
You can put the constraint either globally or inside your samtools_merge_bam rule.
I need to Validate the ID with pattern (Abbbbb-yyy)
Example :
ID := A12345-789 B98765-123 C58730-417
VARIANT := test1 test2 test3
Build and post processing will generate files depends up on VARIANTS :
`sw_main_test1.hex ,sw_main_test1.hex and sw_main_test1.hex `
.PHONY : SW_TEST
SW_TEST :
if <ID is correct>
cp sw_main_test1.hex --> A12345-789.hex
cp sw_main_test2.hex --> B98765-123.hex
cp sw_main_test3.hex --> C58730-417.hex
I am facing issue in validating the ID with pattern
`Abbbbb-yyy.txt`
Where : A=[A-Z]; b=[0-9]; y=[0-9]
Please let me know how to verify ID is correct using regular expressions inside the Makefile using any tool or utility
In this script, I assume, you get your ID from a file (I called it here someidcontent.txt). Then you could write a script like this (assuming, you only working on Linux).
getID = $(shell cat someidcontent.txt)
all:
if [ "$(getID)" == "1234567890" ]; then \
cp -v output.txt ./delivery/$(getID).txt; \
fi
.PHONY: all
Edit
I made a mistake in my previous script. I did not check, if the ID is correct. Now my newer script does this: I read from a file the ID and check it for correctness. If ID is correct, then some file will be copied into target dir with ID number.
# get ID from a file
getID := $(shell cat someidcontent.txt)
# need a hack for successful checking
idToCheck := $(getID)
# check procedure
checkID := $(shell echo $(idToCheck) | grep "[A-Z][0-9][0-9][0-9][0-9]-[0-9][0-9][0-9]$$")
all:
ifeq "$(checkID)" "$(idToCheck)"
echo found
cp -v output.txt ./delivery/$(idToCheck).txt;
endif
.PHONY: all
Edit 2
Ok, this was a little bit challenging, but I solved it somehow. Maybe there are also other ways to solve this better. In my solution, I assume that the file with IDs and source filenames look like this (in other words, this is the content of my someidcontent.txt):
A2345-678:output1.txt
B3456-123:output.txt
C0987-987:thirdfile.txt
And this is my makefile with comments for additional explanation. I hope, they are sufficient
# retrieve id and filename data from other file
listContent := $(shell cat someidcontent.txt)
# extract only IDs from other files
checkIDs = $(shell echo $(listContent) | grep -o "[A-Z][0-9][0-9][0-9][0-9]-[0-9][0-9][0-9]")
all:
# iterate only over IDs
# first, give me the ID
# second retrieve the filename part for successful copy procedure
# and copy the file to the target dir with ID as filename
#$(foreach x,$(checkIDs), \
echo $(x); \
cp -v $(shell echo $(listContent) | grep -o "$(x):[A-Z0-9a-z\.]*" | sed "s/[-A-Z0-9]*://g") ./delivery/$(x).t$
)
.PHONY: all
You can check simple string patterns quite ok (don't want to say "nicely") from within make:
[A-F] := A B C D E F#
[a-f] := a b c d e f#
[A-Z] := $([A-F]) G H I J K L M N O P Q R S T U V W X Y Z#
[a-z] := $([a-f]) g h i j k l m n o p q r s t u v w x y z#
[0-9] := 0 1 2 3 4 5 6 7 8 9#
######################################################################
##### $(call explode,_stringlist_,_string_)
## Insert a blank after every occurrence of the strings from _stringlist_ in _string_.
## This function serves mainly to convert a string into a list.
## Example: `$(call explode,0 1 2 3 4 5 6 7 8 9,0xl337c0de)` --> `0 xl3 3 7 c0 de`
explode = $(if $1,$(subst $(firstword $1),$(firstword $1) ,$(call explode,$(wordlist 2,255,$1),$2)),$2)
ID := A12345-789 B98765-123 C58730-417 123456+328
############################################################
# $(call check-id,_id-string_)
# Return 'malformed' or the given id
check-id = $(if $(call check-id-1,$(call explode,- $([A-Z]) $([0-9]),$1)),malformed,$1)
check-id-1 = $(strip $(filter-out $([A-Z]),$(wordlist 1,1,$1)) $(filter-out $([0-9]),$(wordlist 2,6,$1)) $(filter-out -,$(word 7,$1)) $(filter-out $([0-9]),$(wordlist 8,10,$1)) )
$(info $(foreach w,$(ID),$(call check-id,$(w))))