Can a snakemake input rule be defined with different paths/wildcards - shell

I want to know if one can define a input rule that has dependencies on different wildcards.
To elaborate, I am running this Snakemake pipeline on different fastq files using qsub which submits each job to a different node:
fastqc on original fastq - no downstream dependency on other jobs
adapter/quality trimming to generate trimmed fastq
fastqc_after on trimmed fastq (output from step 2) and no downstream dependency
star-rsem pipeline on trimmed fastq (output from step 2 above)
rsem and tximport (output from step 4)
Run multiqc
MultiQC - https://multiqc.info/ - runs on the results folder which has results from fastqc, star, rsem, etc. However, because each job runs on a different node, sometimes Step 3 (fastqc and/or fastqc_after) is still running on the nodes while other steps finish running (Steps 2, 4 and 5) OR vice-versa.
Currently, I can create a MultiQc rule which waits on results from Steps 2, 4, 5 because they are linked to each other by input/output rules.
I have attached my pipeline as png to this post. Any suggestions would help.
What I need: I want to create a "collating" step where I want MultiQC to wait till all steps (from 1 to 5) finish. In other words, using my attached png as guide, I want to define multiple input rules for MultiQC that also wait on results from fastqc
Thanks in advance.
Note: Based on comments I received from 'colin' and 'bli' after my original post, I have shared the code for the different rules here.
Step 1 - fastqc
rule fastqc:
input: "raw_fastq/{sample}.fastq"
output: "results/fastqc/{sample}_fastqc.zip"
log: "results/logs/fq_before/{sample}.fastqc.log"
params: ...
shell: ...
Step 2 - bbduk
rule bbduk:
input: R1 = "raw_fastq/{sample}.fastq"
output: R1 = "results/bbduk/{sample}_trimmed.fastq",
params: ...
log: "results/logs/bbduk/{sample}.bbduk.log"
priority:95
shell: ....
Step 3 - fastqc_after
rule fastqc_after:
input: "results/bbduk/{sample}_trimmed.fastq"
output: "results/bbduk/{sample}_trimmed_fastqc.zip"
log: "results/logs/fq_after/{sample}_trimmed.fastqc.log"
priority: 70
params: ...
shell: ...
Step 4 - star_align
rule star_align:
input: R1 = "results/bbduk/{sample}_trimmed.fastq"
output:
out_1 = "results/bam/{sample}_Aligned.toTranscriptome.out.bam",
out_2 = "results/bam/{sample}_ReadsPerGene.out.tab"
params: ...
log: "results/logs/star/{sample}.star.log"
priority:90
shell: ...
Step 5 - rsem_norm
rule rsem_norm:
input:
bam = "results/bam/{sample}_Aligned.toTranscriptome.out.bam"
output:
genes = "results/quant/{sample}.genes.results"
params: ...
threads = 16
priority:85
shell: ...
Step 6 - rsem_model
rule rsem_model:
input: "results/quant/{sample}.genes.results"
output: "results/quant/{sample}_diagnostic.pdf"
params: ...
shell: ...
Step 7 - tximport_rsem
rule tximport_rsem:
input: expand("results/quant/{sample}_diagnostic.pdf",sample=samples)
output: "results/rsem_tximport/RSEM_GeneLevel_Summarization.csv"
shell: ...
Step 8 - multiqc
rule multiqc:
input: expand("results/quant/{sample}.genes.results",sample=samples)
output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
log: "results/log/multiqc"
shell: ...

If you want rule multiqc to happen only after fastqc completed, you can add the output of fastqc to the input of multiqc:
rule multiqc:
input:
expand("results/quant/{sample}.genes.results",sample=samples),
expand("results/fastqc/{sample}_fastqc.zip", sample=samples)
output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
log: "results/log/multiqc"
shell: ...
Or, if you need to be able to refer to the output of rsem_norm in your shell section:
rule multiqc:
input:
rsem_out = expand("results/quant/{sample}.genes.results",sample=samples),
fastqc_out = expand("results/fastqc/{sample}_fastqc.zip", sample=samples)
output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
log: "results/log/multiqc"
shell: "... {input.rsem_out} ..."
In one of your comments, you wrote:
MultiQC needs directory as input - I give it the 'results' directory in my shell command.
If I understand correctly, this means that results/quant/{sample}.genes.results are directories, and not plain files. If this is the case, you should make sure no downstream rule writes files inside those directories. Otherwise, the directories will be considered as having been updated after the output of multiqc, and multiqc will be re-run every time you run the pipeline.

Related

Taskwarrior: How do I find the tasks that depend on a specific tasks?

How do I find out which task(s) depend on a specific task without reading the information of all tasks?
Reproduction
System
Version
$ task --version
2.5.1
.taskrc
# Taskwarrior program configuration file.
# Files
data.location=~/.task
alias.cal=calendar
rc.date.iso=Y-M-D
default.command=ready
journal.info=no
rc.regex=on
Here are the tasks that I created for testing purposes:
$ task list
ID Age Description Urg
1 2min Something to do 0
2 1min first do this 0
3 1min do this whenever you feel like it 0
3 tasks
Create the dependency from task#1 to task#2:
$ task 1 modify depends:2
Modifying task 1 'something to do'.
Modified 1 task.
$ task list
ID Age D Description Urg
2 4min first do this 8
3 4min do this whenever you feel like it 0
1 4min D Something to do -5
3 tasks
Goal
Now I want to find the tasks that are dependent on task#2, which should be task#1.
Trials
Unfortunately, this does not result in any matches:
$ task list depends:2
No matches.
$ # I can filter by blocked tasks
$ task blocked
ID Deps Age Description
1 2 18min Something to do
1 task
$ # But when I want to only have tasks \
that are blocked by task#2 also task#3 is returned
$ task blocked:2
[task ready ( blocked:2 )]
ID Age Description Urg
2 20min first do this 8
3 19min do this whenever you feel like it 0
2 tasks
Suggestions?
How would you approach this?
Parsing the taskwarrior output through a script looks like a bit of an overkill.
You have the right command but have actually encountered a bug: the depends attribute does not work with "short id", it's a comma-delimited string of uuids.
It will work if you use UUID instead. Use task <id> _uuid to resolve id to UUID.
$ task --version
2.5.1
# Create tasks
$ task rc.data.location: add -- Something to do
$ task rc.data.location: add -- first do this
$ task rc.data.location: add -- do this whenever you feel like it
$ task rc.data.location: list
ID Age Description Urg
1 - Something to do 1.8
2 - first do this 1.8
3 - do this whenever you feel like it 1.8
3 tasks
# Set up dependency
$ task rc.data.location: 1 modify depends:2
Modifying task 1 'Something to do'.
Modified 1 task.
# Query using depends:UUID
$ task rc.data.location: list "depends.has:$(task rc.data.location: _get 2.uuid)"
ID Age D Description Urg
1 - D Something to do -3.2
1 task
# Query using depends:SHORT ID
# This does not work, despite documentation. Likely a bug
$ task rc.data.location: list "depends.has:$(task rc.data.location: _get 2.id)"
No matches.
Small correction with your trial to find blocked tasks
There is no blocked attribute and you're using the ready report.
$ task blocked:2
[task ready ( blocked:2 )]
The ready report will filter out what we're looking for, the blocked report is what we need. To unmagickify this, these are simply useful default reports that have preset filters on top of task all.
$ task show filter | grep -e 'blocked' -e 'ready'
report.blocked.filter status:pending +BLOCKED
report.ready.filter +READY
report.unblocked.filter status:pending -BLOCKED
Blocked tasks will have the virtual tag +BLOCKED, which is mutually exclusive to +READY.
The blocked attribute doesn't exist, use task _columns to show available attributes (e.g. depends). Unfortunately, the CLI parser is probably attempting to apply the filter blocked:2 and ends up ignoring it. For your workflow, the useful command to use is task blocked "depends.has:$(task _get 2.uuid)". Advisable to write a shell function to make it easier to use:
#!/bin/bash
# Untested but gets the point across
function task_blocked {
blocker=$1
shift
task blocked depends.has:$(task _get ${blocker}.uuid) "$#"
}
# Find tasks of project "foo" that are blocked on task 2
task_blocked 2 project:foo
# What about other project that is also impacted
task_blocked 2 project:bar
You could use this taskwarrior hook script that adds a "blocks" attribute to the tasks: https://gist.github.com/wbsch/a2f7264c6302918dfb30

'Wildcards' object has no attribute 'output'

I get an error for a rather simple rule. I have to write a task file for another program, expecting a tsv file. I read a certain number of parameters from my config file and write them to a file with a shell command.
Code:
rule create_tasks:
output:
temp("tasks_{sample}.tsv")
params:
ID="{sample}",
file=lambda wc: samples["path"][wc.sample] ,
bigwig=lambda wc: samples["bigwig"][wc.sample] ,
ambig=lambda wc: samples["ambig"][wc.sample]
shell:
'echo -e "{params.ID}\t{params.file}" > {output}'
When I execute the workflow, I get the following error:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Job counts:
count jobs
1 create_tasks
1
[Mon Oct 12 14:48:15 2020]
rule create_tasks:
output: tasks_sampleA.tsv
jobid: 0
wildcards: sample=sampleA
echo -e "sampleA /Path/To/sampleA.bed " > tasks_sampleA.tsv
WorkflowError in line 23 of /path/to/workflow.snakefile:
'Wildcards' object has no attribute 'output'
File "/path/to/miniconda/envs/snakemake_submit/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 111, in run_jobs
File "/path/to/miniconda/envs/snakemake_submit/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 1233, in run
I should mention, that two of the variables are empty and that I expect the tabs/whitespaces in the echo command.
Does anybody have an explanation, why snakemake is trying to find output in the wildcards? I am expecially confused, because it is printing the correct command.
I've run into this same problem.
The issue is probably in how you invoked Snakemake from the command line.
For example, this was my Snakefile rule:
rule sort:
input:
"{file}.bam",
output:
"{file}.sorted.bam",
"{file}.sorted.bai",
shell:
"sambamba sort {input}"
I don't even have params or wildcards explicitly anywhere in there.
But when I run it on my Slurm HPC I get the same error:
snakemake -j 10 -c "sbatch {cluster.params}" -u cluster.yaml
The Wildcards (note the capital "W") and params objects weren't from the rule.
They came from the cluster execution of the rule, and the error was thrown when trying to parse the cluster.yaml file.
There was no cluster parameter specification in my cluster.yaml file for the sort rule, so the error was thrown.
I fixed this by adding
sort:
params: "..."
to my cluster.yaml file.
In your case, add cluster submission options under a create_tasks: ... list.
You can also add a __default__: ... list as the default submission parameters for any job, by default, unless it matches another rule.

snakemake running nanopolish and making it wait until previous rule is done

Hi I can run the different steps of nanopolish with snakemake. But when I run it it will give an error that the index file created in the bwa rule isnt available yet. After it gives this error it creates the file it that the error was about. If I run snakemake again without removing files it works because the file is there. How can I tell snake make to wait with the next step until the first one is done? I have googled on any ways to solve this problem and all I could find was priority and ruleorder and I have used those but it still doesnt work. Here is the script that I use.
ruleorder: bwa > nanopolish
rule bwa:
input:
"nanopolish/assembly.fasta"
output:
"nanopolish/draft.fa"
conda:
"envs/nanopolish.yaml"
priority:
50
shell:
"bwa index {input} - > {output}"
rule nanopolish:
input:
"nanopolish/assembly.fasta",
"zipped/zipped.gz"
output:
"nanopolish/reads.sorted.bam"
conda:
"envs/nanopolish.yaml"
shell:
"bwa mem -x ont2d {input} | samtools sort -o {output} -T reads.tmp"
You should take a look again at the docs to properly understand the idea of SnakeMake.
Rules describe how to create output files from input files
A rule is not executed until all its input exists, so all you have to do is add the output of the bwa rule
rule nanopolish:
input:
"nanopolish/assembly.fasta",
"nanopolish/draft.fa", # <-- output of bwa
"zipped/zipped.gz"
Ruleorder and priority are not relevant solutions for your problem.

Snakemake tabular config

I'm using Snakemake with a tabular configuration. This table is a bunch of rows that I first take out of a very large sample overview. The resulting sample.tsv has a lot of columns. I read the samples file like this, somewhere in my Snakefile:
samples = pd.read_table('samples.tsv').set_index('samples', drop=False)
When I run snakemake:
snakemake --cluster-config cluster.json --cluster "qsub -l nodes={cluster.nodes}:ppn={cluster.ppn}" --jobs 256
I get the following error:
16:51 nlv24077#kiato ~/temp/test_snakemake > run_snakemake.sh
Building DAG of jobs...
MissingInputException in line 6 of /home/nlv24077/temp/test_snakemake/rseqc.smk:
Missing input files for rule bam_stat:
analyzed/I7_index.STAR.genome.sorted.bam
the strange thing is that I7_index is one of the column names in my samples table. Why is it using the column names as sample names?
Below I can show you part of the samples table (I can't show all data publicly):
Edit:
I was calling the samples like this:
rule bcl2fastq:
input:
config['bcl_dir']
output:
expand([os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')],
sample=samples)
threads: 6
shell:
'''
# Run bcl2fastq
...
Whereas I should have used:
rule bcl2fastq:
input:
config['bcl_dir']
output:
expand([os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')],
sample=samples['samples'])
threads: 6
shell:
'''
# Run bcl2fastq
...
Thanx JeeYem.

Running parallel simulations (using the Command Line)

How can I run the simulation with different configurations? I am using omnet++ version 4.6.
My omnetpp.ini file looks as below :
[General]
[Config Dcn2]
network = Dcn2
# leaf switch
#**.down_port = 2
**.up_port = 16 #12 # 4
# spine switch
**.port = 28 # 20 #2048
# crossconnect
**.cross_down_port = 28 # 20 #2048
**.cross_up_port = 28 # 20 #2048
# to set destination of packet
**.number_leaf_switch = 28 # 20 #2048
# link speed
#**.switch_switch_link_speed = 40 Mbps
**.interArrivalTime = ${exponential(.0001),exponential(0.0002),exponential(0.0003)}
**.batch_length = 10
**.buffer_length = 10
sim-time-limit = 1000s
I want to run the code with different values of interArrivalTime. But I can neither run with different configs (one after another), nor can I run individual runs in parallel on separate cores.
I have tried with cmdev option in run configurations but the different runs doesn't show up apart from the 1st one. When I try mentioning the number of processes to be more than one then also only the first run gets simulated. I really cannot find out the reason.
Config Examinataion
In your case you can perform config examination. OMNeT++ offers different options for that. They are explained under the Parameter Studies section of the OMNeT++ manual.
So you can try one of the following options to examine your configs and thus config file:
./run –a - will show all the configurations in the omnet.ini
./run -x <config_name> - will give more info about a specific config
./run -x <config_name> -g - see all the combinations of configs
First you will have to navigate to your example folder, and there execute one of the aforementioned commands.
I executed: ./run -x Dcn2 -g and got the following resuls
OMNeT++ Discrete Event Simulation (C) 1992-2014 Andras Varga, OpenSim Ltd.
Version: 4.6, build: 141202-f785492, edition: Academic Public License -- NOT FOR COMMERCIAL USE
See the license for distribution terms and warranty disclaimer
Setting up Tkenv...
Config: Dcn2
Number of runs: 3
Run 0: $0=exponential(.0001), $repetition=0
Run 1: $0=exponential(0.0002), $repetition=0
Run 2: $0=exponential(0.0003), $repetition=0
End.
This confirms indeed that you have 3 different runs for the simulation parameter you are trying to modify. However, variable name you are using for the interArrivalTime parameter is assigned to $0 by default because you have not specified it.
If you change the following line in your config:
**.interArrivalTime = ${exponential(.0001),exponential(0.0002),exponential(0.0003)}
to
**.interArrivalTime = ${interArrivalTime = exponential(0.0001),exponential(0.0002),exponential(0.0003)}
you will get a more descriptive output for ./run -x Dcn2 -g
Running different runs of a config:
Next step for you would be to run the different runs for your config. You can do that by navigating to your example directory and execute:
./run -c <config-name> -r <run-number> -u Cmdenv
Note that the <config-name> would be Dcn2 for you, and the -r specifies which of the runs given above you would like to execute.
In other words you can open three terminal windows and navigate to your example directory and do:
./run -c Dcn2 -r 0 -u Cmdenv - for interArrivalTime = exponential(0.0001)
./run -c Dcn2 -r 1 -u Cmdenv - for interArrivalTime = exponential(0.0002)
./run -c Dcn2 -r 2 -u Cmdenv - for interArrivalTime = exponential(0.0003)
Distinguishing Different run results
To be able to distinguish between the output result files of the different runs for your given config you can modify the default name of the output file.
The "how-to" is given in the 12.2.3 Result File Names section of the OMNeT++ manual.
output-vector-file = "${resultdir}/${configname}-${runnumber}.vec"
output-scalar-file = "${resultdir}/${configname}-${runnumber}.sca"
As you can see by default your output files will be distinguished by the ${runnumber} variable. You can further improve it by adding the interArrivalTime to the output file name.
Example:
output-scalar-file = "${resultdir}/${configname}-${runnumber}-IAtime=${interArrivalTime}.sca/vec"
I have not tested the final approach. So you might get some error along the path.

Resources