problem with snakemake submitting jobs with multiple wildcard on SGE - cluster-computing

I used snakemake on LSF cluster before and everything worked just fine. However, recently I migrated to SGE cluster and I am getting a very strange error when I try to run a job with more than one wildcard.
When I try to submit a job based on this rule
rule download_reads :
threads : 1
output : "data/{sp}/raw_reads/{accesion}_1.fastq.gz"
shell : "scripts/download_reads.sh {wildcards.sp} {wildcards.accesion} data/{wildcards.sp}/raw_reads/{wildcards.accesion}"
I get a following error (snakemake_clust.sh details bellow)
./snakemake_clust.sh data/Ecol1/raw_reads/SRA123456_1.fastq.gz
Building DAG of jobs...
Using shell: /bin/bash
Provided cluster nodes: 10
Job counts:
count jobs
1 download_reads
1
[Thu Jul 30 12:08:57 2020]
rule download_reads:
output: data/Ecol1/raw_reads/SRA123456_1.fastq.gz
jobid: 0
wildcards: sp=Ecol1, accesion=SRA123456
scripts/download_reads.sh Ecol1 SRA123456 data/Ecol1/raw_reads/SRA123456
Unable to run job: ERROR! two files are specified for the same host
ERROR! two files are specified for the same host
Exiting.
Error submitting jobscript (exit code 1):
Shutting down, this might take some time.
When I replace the sp wildcard with a constant, it works as expected:
rule download_reads :
threads : 1
output : "data/Ecol1/raw_reads/{accesion}_1.fastq.gz"
shell : "scripts/download_reads.sh Ecol1 {wildcards.accesion} data/Ecol1/raw_reads/{wildcards.accesion}"
I.e. I get
Submitted job 1 with external jobid 'Your job 50731 ("download_reads") has been submitted'.
I wonder why I might have this problem, I am sure I used exactly the same rule on the LSF-based cluster before without any problem.
some details
The snakemake submitting script looks like this
#!/usr/bin/env bash
mkdir -p logs
snakemake $# -p --jobs 10 --latency-wait 120 --cluster "qsub \
-N {rule} \
-pe smp64 \
{threads} \
-cwd \
-b y \
-o \"logs/{rule}.{wildcards}.out\" \
-e \"logs/{rule}.{wildcards}.err\""
-b y makes the command executed as it is, -cwd changes the working directory on the computing node the the working directory from where the job was submitted. Other flags / specifications are clear I hope.
Also, I am aware of --drmaa flag, but I think out cluster is not well configured for that. --cluster was till now a more robust solution.
-- edit 1 --
When I execute exactly the same snakefile locally (on the fronend, without the --cluster flag), the script gets executed as expected. It seems to be a problem of interaction of snakemake and the scheduler.

-o \"logs/{rule}.{wildcards}.out\" \
-e \"logs/{rule}.{wildcards}.err\""
This is a random guess... More than one wildcards are concatenated by space before replacing them into logs/{rule}.{wildcards}.err. So despite you use double quotes, SGE treats the resulting string as two files and throws the error. What if you use single quotes instead? Like:
-o 'logs/{rule}.{wildcards}.out' \
-e 'logs/{rule}.{wildcards}.err'
Alternatively, you could concatenate the wildcards in the rule and use the result on the command line. E.g.:
rule one:
params:
wc= lambda wc: '_'.join(wc)
output: ...
Then use:
-o 'logs/{rule}.{params.wc}.out' \
-e 'logs/{rule}.{params.wc}.err'
(This second solution, if it works, kind of sucks though)

Related

Variant Calling Pipeline Parallelization: ERROR - "sbatch: not found Error submitting jobscript (exit code 127)"

I developed a pipeline written in Snakemake for genome variant calling analysis.
I'm trying to parallelize it now, in order to be ran in a HPC cluster with multiple nodes.
I have a configuration file (yaml) where the paths of the files are defined.
When I execute the pipeline with the command:
shifter --volume=/home/ubuntu/vcall_docker_gatk4_bottle/:/mnt/ \
--image=docker:ray2g/vcall_biodata:1.5.1 \
snakemake --snakefile /mnt/vcall-pipe3_cluster.snake \
-p /mnt/genome/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta.sa \
-j 24 \
--cluster 'sbatch -p {params.partition} --mem {resources.mem_mb}mb --cpus-per-task {resources.cpus}' \
--forceall
I'm getting this ERROR:
Building DAG of jobs...
Using shell: /bin/bash
Provided cluster nodes: 24
Unlimited resources: cpus, mem_mb
Job counts:
count jobs
1 bwa_index
1
[Wed Jan 27 14:30:40 2021]
Job 0: Building index -> /mnt/genome/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta.sa
bwa index /mnt/genome/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
/bin/sh: 1: sbatch: not found
Error submitting jobscript (exit code 127):Shutting down, this might take some time.
Any one could help me ?

'Wildcards' object has no attribute 'output'

I get an error for a rather simple rule. I have to write a task file for another program, expecting a tsv file. I read a certain number of parameters from my config file and write them to a file with a shell command.
Code:
rule create_tasks:
output:
temp("tasks_{sample}.tsv")
params:
ID="{sample}",
file=lambda wc: samples["path"][wc.sample] ,
bigwig=lambda wc: samples["bigwig"][wc.sample] ,
ambig=lambda wc: samples["ambig"][wc.sample]
shell:
'echo -e "{params.ID}\t{params.file}" > {output}'
When I execute the workflow, I get the following error:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Job counts:
count jobs
1 create_tasks
1
[Mon Oct 12 14:48:15 2020]
rule create_tasks:
output: tasks_sampleA.tsv
jobid: 0
wildcards: sample=sampleA
echo -e "sampleA /Path/To/sampleA.bed " > tasks_sampleA.tsv
WorkflowError in line 23 of /path/to/workflow.snakefile:
'Wildcards' object has no attribute 'output'
File "/path/to/miniconda/envs/snakemake_submit/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 111, in run_jobs
File "/path/to/miniconda/envs/snakemake_submit/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 1233, in run
I should mention, that two of the variables are empty and that I expect the tabs/whitespaces in the echo command.
Does anybody have an explanation, why snakemake is trying to find output in the wildcards? I am expecially confused, because it is printing the correct command.
I've run into this same problem.
The issue is probably in how you invoked Snakemake from the command line.
For example, this was my Snakefile rule:
rule sort:
input:
"{file}.bam",
output:
"{file}.sorted.bam",
"{file}.sorted.bai",
shell:
"sambamba sort {input}"
I don't even have params or wildcards explicitly anywhere in there.
But when I run it on my Slurm HPC I get the same error:
snakemake -j 10 -c "sbatch {cluster.params}" -u cluster.yaml
The Wildcards (note the capital "W") and params objects weren't from the rule.
They came from the cluster execution of the rule, and the error was thrown when trying to parse the cluster.yaml file.
There was no cluster parameter specification in my cluster.yaml file for the sort rule, so the error was thrown.
I fixed this by adding
sort:
params: "..."
to my cluster.yaml file.
In your case, add cluster submission options under a create_tasks: ... list.
You can also add a __default__: ... list as the default submission parameters for any job, by default, unless it matches another rule.

snakemake running nanopolish and making it wait until previous rule is done

Hi I can run the different steps of nanopolish with snakemake. But when I run it it will give an error that the index file created in the bwa rule isnt available yet. After it gives this error it creates the file it that the error was about. If I run snakemake again without removing files it works because the file is there. How can I tell snake make to wait with the next step until the first one is done? I have googled on any ways to solve this problem and all I could find was priority and ruleorder and I have used those but it still doesnt work. Here is the script that I use.
ruleorder: bwa > nanopolish
rule bwa:
input:
"nanopolish/assembly.fasta"
output:
"nanopolish/draft.fa"
conda:
"envs/nanopolish.yaml"
priority:
50
shell:
"bwa index {input} - > {output}"
rule nanopolish:
input:
"nanopolish/assembly.fasta",
"zipped/zipped.gz"
output:
"nanopolish/reads.sorted.bam"
conda:
"envs/nanopolish.yaml"
shell:
"bwa mem -x ont2d {input} | samtools sort -o {output} -T reads.tmp"
You should take a look again at the docs to properly understand the idea of SnakeMake.
Rules describe how to create output files from input files
A rule is not executed until all its input exists, so all you have to do is add the output of the bwa rule
rule nanopolish:
input:
"nanopolish/assembly.fasta",
"nanopolish/draft.fa", # <-- output of bwa
"zipped/zipped.gz"
Ruleorder and priority are not relevant solutions for your problem.

Slowdown when using Mutect2 container inside Nextflow

I'm trying to run MuTect2 on a sample, which on my machine using java takes about 27 minutes to run.
If I use virtually the same code, but inside Nextflow and using the GATK3:3.6 docker container to run Mutect, it takes 7 minutes longer, for seemingly no apparent reason.
Running on Ubuntu 18.04, the tumor and normal samples are from an Oncomine panel. Tumor is 4.1G, normal is 1.1G. I thought the time might be spent copying in data to the container, but 7-8 minutes seems far too long for that. Could it be from copying in reference files too?
bai_ch is the channel that brings in the tumor and normal index files
process MuTect2 {
label 'mutect'
stageInMode 'copy'
publishDir './output', mode : 'copy', overwrite : true
input:
file tumor_bam_mu from tumor_mu
file normal_bam_mu from normal_mu
file "*" from bai_ch
file mutect2_ref
file ref_index from ref_fasta_i_m
file ref_dict from Channel.fromPath(params.ref_fast_dict)
file regions_file from Channel.fromPath(params.regions)
file cosmic_vcf from Channel.fromPath(params.cosmic_vcf)
file dbsnp_vcf from Channel.fromPath(params.dbsnp_vcf)
file normal_vcf from Channel.fromPath(params.normal_vcf)
output:
file '*' into mutect_ch
script:
"""
ls
echo MuTect2 task path: \$PWD
java -jar /usr/GenomeAnalysisTK.jar \
--analysis_type MuTect2 \
--reference_sequence hg19.fa \
-L designed.bed \
--normal_panel normal_panel.vcf \
--cosmic Cosmic.vcf \
--dbsnp dbsnp.vcf \
--input_file:tumor $tumor_bam_mu \
-o mutect2.somatic.unfiltered.vcf \
--input_file:normal $normal_bam_mu \
--max_alt_allele_in_normal_fraction 0.1 \
--minPruning 10 \
--kmerSize 60
"""
}
My only thought is to create my own docker that has the reference files handy, which will probably save time for copying them in? I'd expect the nextflow+container version to run only slightly slower than the CLI version.
Check the task Bash wrapper in the task work dir to asses the performance issue.

How can I figure out how many threads cut needs in Snakemake rule?

I use cut in one rule of my pipeline and always throws an error, but without any error description.
When I try this command with a simple bash script it is working without any errors.
Here is the rule:
rule convert_bamheader:
input: bam/SERUM-ACT/exon_tagged_trimmed_mapped_cleaned.bam, stats/SERUM-ACT/good_barcodes_clean_filter.txt
output: bam/SERUM-ACT/exon_tagged_trimmed_mapped_cleaned_header.txt, bam/SERUM-ACT/exon_tagged_trimmed_mapped_cleaned_header_filtered.tsv
jobid: 15
wildcards: sample=SERUM-ACT
threads: 4
mkdir -p stats/SERUM-ACT
mkdir -p log/SERUM-ACT
samtools view bam/SERUM-ACT/exon_tagged_trimmed_mapped_cleaned.bam > bam/SERUM-ACT/exon_tagged_trimmed_mapped_cleaned_header.txt
cut -f 12,13,18,20-24 bam/SERUM-ACT/exon_tagged_trimmed_mapped_cleaned_header.txt | grep -f stats/SERUM-ACT/good_barcodes_clean_filter.txt > bam/SERUM-ACT/exon_tagged_trimmed_mapped_cleaned_header_filtered.tsv
Submitted DRMAA job 15 with external jobid 7027806.
Error in rule convert_bamheader:
jobid: 15
output: bam/SERUM-ACT/exon_tagged_trimmed_mapped_cleaned_header.txt, bam/SERUM-ACT/exon_tagged_trimmed_mapped_cleaned_header_filtered.tsv
ClusterJobException in line 256 of */pipeline.snake:
Error executing rule convert_bamheader on cluster (jobid: 15, external: 7027806, jobscript: */.snakemake/tmp.ewej7q4e/snakejob.convert_bamheader.15.sh). For detailed error see the cluster log.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
Complete log: */.snakemake/log/2018-12-18T104741.092698.snakemake.log
I thought that it has to do something with the number of threads provided and number of threads needed for the cut step, but I am not sure.
Perhaps someone can help me?
Cheers!

Resources