Snakemake tabular config - bioinformatics

I'm using Snakemake with a tabular configuration. This table is a bunch of rows that I first take out of a very large sample overview. The resulting sample.tsv has a lot of columns. I read the samples file like this, somewhere in my Snakefile:
samples = pd.read_table('samples.tsv').set_index('samples', drop=False)
When I run snakemake:
snakemake --cluster-config cluster.json --cluster "qsub -l nodes={cluster.nodes}:ppn={cluster.ppn}" --jobs 256
I get the following error:
16:51 nlv24077#kiato ~/temp/test_snakemake > run_snakemake.sh
Building DAG of jobs...
MissingInputException in line 6 of /home/nlv24077/temp/test_snakemake/rseqc.smk:
Missing input files for rule bam_stat:
analyzed/I7_index.STAR.genome.sorted.bam
the strange thing is that I7_index is one of the column names in my samples table. Why is it using the column names as sample names?
Below I can show you part of the samples table (I can't show all data publicly):
Edit:
I was calling the samples like this:
rule bcl2fastq:
input:
config['bcl_dir']
output:
expand([os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')],
sample=samples)
threads: 6
shell:
'''
# Run bcl2fastq
...
Whereas I should have used:
rule bcl2fastq:
input:
config['bcl_dir']
output:
expand([os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')],
sample=samples['samples'])
threads: 6
shell:
'''
# Run bcl2fastq
...
Thanx JeeYem.

Related

Snakemake on cluster error: 'Wildcards' object has no attribute 'output'

I'm running into an error of 'Wildcards' object has no attribute 'output', similar to this earlier question 'Wildcards' object has no attribute 'output', when I submit Snakemake to my cluster. I'm wondering if you have any suggestions for how to make this compatible with the cluster?
While my rule annotate_snps works when I test it locally, I get the following error on the cluster:
input: results/CI226380_S4/vars/CI226380_S4_bwa_H37Rv_gatk.vcf.gz
output: results/CI226380_S4/vars/CI226380_S4_bwa_H37Rv_gatk_rename.vcf.gz, results/CI226380_S4/vars/CI226380_S4_bwa_H37Rv_gatk_tmp.vcf.gz, results/CI226380_S4/vars/CI226380_S4_bwa_H37Rv_gatk_ann.vcf.gz
log: results/CI226380_S4/vars/CI226380_S4_bwa_H37Rv_annotate_snps.log
jobid: 1139
wildcards: samp=CI226380_S4, mapper=bwa, ref=H37Rv
WorkflowError in line 173 of /oak/stanford/scg/lab_jandr/walter/tb/mtb/workflow/Snakefile:
'Wildcards' object has no attribute 'output'
My rule is defined as:
rule annotate_snps:
input:
vcf='results/{samp}/vars/{samp}_{mapper}_{ref}_gatk.vcf.gz'
log:
'results/{samp}/vars/{samp}_{mapper}_{ref}_annotate_snps.log'
output:
rename_vcf=temp('results/{samp}/vars/{samp}_{mapper}_{ref}_gatk_rename.vcf.gz'),
tmp_vcf=temp('results/{samp}/vars/{samp}_{mapper}_{ref}_gatk_tmp.vcf.gz'),
ann_vcf='results/{samp}/vars/{samp}_{mapper}_{ref}_gatk_ann.vcf.gz'
params:
bed=config['bed_path'],
vcf_header=config['vcf_header']
shell:
'''
# Rename Chromosome to be consistent with snpEff/Ensembl genomes.
zcat {input.vcf}| sed 's/NC_000962.3/Chromosome/g' | bgzip > {output.rename_vcf}
tabix {output.rename_vcf}
# Run snpEff
java -jar -Xmx8g {config[snpeff]} eff {config[snpeff_db]} {output.rename_vcf} -dataDir {config[snpeff_datapath]} -noStats -no-downstream -no-upstream -canon > {output.tmp_vcf}
# Also use bed file to annotate vcf
bcftools annotate -a {params.bed} -h {params.vcf_header} -c CHROM,FROM,TO,FORMAT/PPE {output.tmp_vcf} > {output.ann_vcf}
'''
Thank you so much in advance!
The raw rule definition appears to be consistent except for the multiple calls to the contents of config, e.g. config[snpeff].
One thing to check is if the config definition on the single machine and on the cluster is the same, if it's not there might be some content that is confusing snakemake, e.g. if somehow config[snpeff] == "wildcards.output" (or something similar).

Taskwarrior: How do I find the tasks that depend on a specific tasks?

How do I find out which task(s) depend on a specific task without reading the information of all tasks?
Reproduction
System
Version
$ task --version
2.5.1
.taskrc
# Taskwarrior program configuration file.
# Files
data.location=~/.task
alias.cal=calendar
rc.date.iso=Y-M-D
default.command=ready
journal.info=no
rc.regex=on
Here are the tasks that I created for testing purposes:
$ task list
ID Age Description Urg
1 2min Something to do 0
2 1min first do this 0
3 1min do this whenever you feel like it 0
3 tasks
Create the dependency from task#1 to task#2:
$ task 1 modify depends:2
Modifying task 1 'something to do'.
Modified 1 task.
$ task list
ID Age D Description Urg
2 4min first do this 8
3 4min do this whenever you feel like it 0
1 4min D Something to do -5
3 tasks
Goal
Now I want to find the tasks that are dependent on task#2, which should be task#1.
Trials
Unfortunately, this does not result in any matches:
$ task list depends:2
No matches.
$ # I can filter by blocked tasks
$ task blocked
ID Deps Age Description
1 2 18min Something to do
1 task
$ # But when I want to only have tasks \
that are blocked by task#2 also task#3 is returned
$ task blocked:2
[task ready ( blocked:2 )]
ID Age Description Urg
2 20min first do this 8
3 19min do this whenever you feel like it 0
2 tasks
Suggestions?
How would you approach this?
Parsing the taskwarrior output through a script looks like a bit of an overkill.
You have the right command but have actually encountered a bug: the depends attribute does not work with "short id", it's a comma-delimited string of uuids.
It will work if you use UUID instead. Use task <id> _uuid to resolve id to UUID.
$ task --version
2.5.1
# Create tasks
$ task rc.data.location: add -- Something to do
$ task rc.data.location: add -- first do this
$ task rc.data.location: add -- do this whenever you feel like it
$ task rc.data.location: list
ID Age Description Urg
1 - Something to do 1.8
2 - first do this 1.8
3 - do this whenever you feel like it 1.8
3 tasks
# Set up dependency
$ task rc.data.location: 1 modify depends:2
Modifying task 1 'Something to do'.
Modified 1 task.
# Query using depends:UUID
$ task rc.data.location: list "depends.has:$(task rc.data.location: _get 2.uuid)"
ID Age D Description Urg
1 - D Something to do -3.2
1 task
# Query using depends:SHORT ID
# This does not work, despite documentation. Likely a bug
$ task rc.data.location: list "depends.has:$(task rc.data.location: _get 2.id)"
No matches.
Small correction with your trial to find blocked tasks
There is no blocked attribute and you're using the ready report.
$ task blocked:2
[task ready ( blocked:2 )]
The ready report will filter out what we're looking for, the blocked report is what we need. To unmagickify this, these are simply useful default reports that have preset filters on top of task all.
$ task show filter | grep -e 'blocked' -e 'ready'
report.blocked.filter status:pending +BLOCKED
report.ready.filter +READY
report.unblocked.filter status:pending -BLOCKED
Blocked tasks will have the virtual tag +BLOCKED, which is mutually exclusive to +READY.
The blocked attribute doesn't exist, use task _columns to show available attributes (e.g. depends). Unfortunately, the CLI parser is probably attempting to apply the filter blocked:2 and ends up ignoring it. For your workflow, the useful command to use is task blocked "depends.has:$(task _get 2.uuid)". Advisable to write a shell function to make it easier to use:
#!/bin/bash
# Untested but gets the point across
function task_blocked {
blocker=$1
shift
task blocked depends.has:$(task _get ${blocker}.uuid) "$#"
}
# Find tasks of project "foo" that are blocked on task 2
task_blocked 2 project:foo
# What about other project that is also impacted
task_blocked 2 project:bar
You could use this taskwarrior hook script that adds a "blocks" attribute to the tasks: https://gist.github.com/wbsch/a2f7264c6302918dfb30

'Wildcards' object has no attribute 'output'

I get an error for a rather simple rule. I have to write a task file for another program, expecting a tsv file. I read a certain number of parameters from my config file and write them to a file with a shell command.
Code:
rule create_tasks:
output:
temp("tasks_{sample}.tsv")
params:
ID="{sample}",
file=lambda wc: samples["path"][wc.sample] ,
bigwig=lambda wc: samples["bigwig"][wc.sample] ,
ambig=lambda wc: samples["ambig"][wc.sample]
shell:
'echo -e "{params.ID}\t{params.file}" > {output}'
When I execute the workflow, I get the following error:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Job counts:
count jobs
1 create_tasks
1
[Mon Oct 12 14:48:15 2020]
rule create_tasks:
output: tasks_sampleA.tsv
jobid: 0
wildcards: sample=sampleA
echo -e "sampleA /Path/To/sampleA.bed " > tasks_sampleA.tsv
WorkflowError in line 23 of /path/to/workflow.snakefile:
'Wildcards' object has no attribute 'output'
File "/path/to/miniconda/envs/snakemake_submit/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 111, in run_jobs
File "/path/to/miniconda/envs/snakemake_submit/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 1233, in run
I should mention, that two of the variables are empty and that I expect the tabs/whitespaces in the echo command.
Does anybody have an explanation, why snakemake is trying to find output in the wildcards? I am expecially confused, because it is printing the correct command.
I've run into this same problem.
The issue is probably in how you invoked Snakemake from the command line.
For example, this was my Snakefile rule:
rule sort:
input:
"{file}.bam",
output:
"{file}.sorted.bam",
"{file}.sorted.bai",
shell:
"sambamba sort {input}"
I don't even have params or wildcards explicitly anywhere in there.
But when I run it on my Slurm HPC I get the same error:
snakemake -j 10 -c "sbatch {cluster.params}" -u cluster.yaml
The Wildcards (note the capital "W") and params objects weren't from the rule.
They came from the cluster execution of the rule, and the error was thrown when trying to parse the cluster.yaml file.
There was no cluster parameter specification in my cluster.yaml file for the sort rule, so the error was thrown.
I fixed this by adding
sort:
params: "..."
to my cluster.yaml file.
In your case, add cluster submission options under a create_tasks: ... list.
You can also add a __default__: ... list as the default submission parameters for any job, by default, unless it matches another rule.

snakemake running nanopolish and making it wait until previous rule is done

Hi I can run the different steps of nanopolish with snakemake. But when I run it it will give an error that the index file created in the bwa rule isnt available yet. After it gives this error it creates the file it that the error was about. If I run snakemake again without removing files it works because the file is there. How can I tell snake make to wait with the next step until the first one is done? I have googled on any ways to solve this problem and all I could find was priority and ruleorder and I have used those but it still doesnt work. Here is the script that I use.
ruleorder: bwa > nanopolish
rule bwa:
input:
"nanopolish/assembly.fasta"
output:
"nanopolish/draft.fa"
conda:
"envs/nanopolish.yaml"
priority:
50
shell:
"bwa index {input} - > {output}"
rule nanopolish:
input:
"nanopolish/assembly.fasta",
"zipped/zipped.gz"
output:
"nanopolish/reads.sorted.bam"
conda:
"envs/nanopolish.yaml"
shell:
"bwa mem -x ont2d {input} | samtools sort -o {output} -T reads.tmp"
You should take a look again at the docs to properly understand the idea of SnakeMake.
Rules describe how to create output files from input files
A rule is not executed until all its input exists, so all you have to do is add the output of the bwa rule
rule nanopolish:
input:
"nanopolish/assembly.fasta",
"nanopolish/draft.fa", # <-- output of bwa
"zipped/zipped.gz"
Ruleorder and priority are not relevant solutions for your problem.

Can a snakemake input rule be defined with different paths/wildcards

I want to know if one can define a input rule that has dependencies on different wildcards.
To elaborate, I am running this Snakemake pipeline on different fastq files using qsub which submits each job to a different node:
fastqc on original fastq - no downstream dependency on other jobs
adapter/quality trimming to generate trimmed fastq
fastqc_after on trimmed fastq (output from step 2) and no downstream dependency
star-rsem pipeline on trimmed fastq (output from step 2 above)
rsem and tximport (output from step 4)
Run multiqc
MultiQC - https://multiqc.info/ - runs on the results folder which has results from fastqc, star, rsem, etc. However, because each job runs on a different node, sometimes Step 3 (fastqc and/or fastqc_after) is still running on the nodes while other steps finish running (Steps 2, 4 and 5) OR vice-versa.
Currently, I can create a MultiQc rule which waits on results from Steps 2, 4, 5 because they are linked to each other by input/output rules.
I have attached my pipeline as png to this post. Any suggestions would help.
What I need: I want to create a "collating" step where I want MultiQC to wait till all steps (from 1 to 5) finish. In other words, using my attached png as guide, I want to define multiple input rules for MultiQC that also wait on results from fastqc
Thanks in advance.
Note: Based on comments I received from 'colin' and 'bli' after my original post, I have shared the code for the different rules here.
Step 1 - fastqc
rule fastqc:
input: "raw_fastq/{sample}.fastq"
output: "results/fastqc/{sample}_fastqc.zip"
log: "results/logs/fq_before/{sample}.fastqc.log"
params: ...
shell: ...
Step 2 - bbduk
rule bbduk:
input: R1 = "raw_fastq/{sample}.fastq"
output: R1 = "results/bbduk/{sample}_trimmed.fastq",
params: ...
log: "results/logs/bbduk/{sample}.bbduk.log"
priority:95
shell: ....
Step 3 - fastqc_after
rule fastqc_after:
input: "results/bbduk/{sample}_trimmed.fastq"
output: "results/bbduk/{sample}_trimmed_fastqc.zip"
log: "results/logs/fq_after/{sample}_trimmed.fastqc.log"
priority: 70
params: ...
shell: ...
Step 4 - star_align
rule star_align:
input: R1 = "results/bbduk/{sample}_trimmed.fastq"
output:
out_1 = "results/bam/{sample}_Aligned.toTranscriptome.out.bam",
out_2 = "results/bam/{sample}_ReadsPerGene.out.tab"
params: ...
log: "results/logs/star/{sample}.star.log"
priority:90
shell: ...
Step 5 - rsem_norm
rule rsem_norm:
input:
bam = "results/bam/{sample}_Aligned.toTranscriptome.out.bam"
output:
genes = "results/quant/{sample}.genes.results"
params: ...
threads = 16
priority:85
shell: ...
Step 6 - rsem_model
rule rsem_model:
input: "results/quant/{sample}.genes.results"
output: "results/quant/{sample}_diagnostic.pdf"
params: ...
shell: ...
Step 7 - tximport_rsem
rule tximport_rsem:
input: expand("results/quant/{sample}_diagnostic.pdf",sample=samples)
output: "results/rsem_tximport/RSEM_GeneLevel_Summarization.csv"
shell: ...
Step 8 - multiqc
rule multiqc:
input: expand("results/quant/{sample}.genes.results",sample=samples)
output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
log: "results/log/multiqc"
shell: ...
If you want rule multiqc to happen only after fastqc completed, you can add the output of fastqc to the input of multiqc:
rule multiqc:
input:
expand("results/quant/{sample}.genes.results",sample=samples),
expand("results/fastqc/{sample}_fastqc.zip", sample=samples)
output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
log: "results/log/multiqc"
shell: ...
Or, if you need to be able to refer to the output of rsem_norm in your shell section:
rule multiqc:
input:
rsem_out = expand("results/quant/{sample}.genes.results",sample=samples),
fastqc_out = expand("results/fastqc/{sample}_fastqc.zip", sample=samples)
output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
log: "results/log/multiqc"
shell: "... {input.rsem_out} ..."
In one of your comments, you wrote:
MultiQC needs directory as input - I give it the 'results' directory in my shell command.
If I understand correctly, this means that results/quant/{sample}.genes.results are directories, and not plain files. If this is the case, you should make sure no downstream rule writes files inside those directories. Otherwise, the directories will be considered as having been updated after the output of multiqc, and multiqc will be re-run every time you run the pipeline.

Resources