Slowdown when using Mutect2 container inside Nextflow - bioinformatics

I'm trying to run MuTect2 on a sample, which on my machine using java takes about 27 minutes to run.
If I use virtually the same code, but inside Nextflow and using the GATK3:3.6 docker container to run Mutect, it takes 7 minutes longer, for seemingly no apparent reason.
Running on Ubuntu 18.04, the tumor and normal samples are from an Oncomine panel. Tumor is 4.1G, normal is 1.1G. I thought the time might be spent copying in data to the container, but 7-8 minutes seems far too long for that. Could it be from copying in reference files too?
bai_ch is the channel that brings in the tumor and normal index files
process MuTect2 {
label 'mutect'
stageInMode 'copy'
publishDir './output', mode : 'copy', overwrite : true
input:
file tumor_bam_mu from tumor_mu
file normal_bam_mu from normal_mu
file "*" from bai_ch
file mutect2_ref
file ref_index from ref_fasta_i_m
file ref_dict from Channel.fromPath(params.ref_fast_dict)
file regions_file from Channel.fromPath(params.regions)
file cosmic_vcf from Channel.fromPath(params.cosmic_vcf)
file dbsnp_vcf from Channel.fromPath(params.dbsnp_vcf)
file normal_vcf from Channel.fromPath(params.normal_vcf)
output:
file '*' into mutect_ch
script:
"""
ls
echo MuTect2 task path: \$PWD
java -jar /usr/GenomeAnalysisTK.jar \
--analysis_type MuTect2 \
--reference_sequence hg19.fa \
-L designed.bed \
--normal_panel normal_panel.vcf \
--cosmic Cosmic.vcf \
--dbsnp dbsnp.vcf \
--input_file:tumor $tumor_bam_mu \
-o mutect2.somatic.unfiltered.vcf \
--input_file:normal $normal_bam_mu \
--max_alt_allele_in_normal_fraction 0.1 \
--minPruning 10 \
--kmerSize 60
"""
}
My only thought is to create my own docker that has the reference files handy, which will probably save time for copying them in? I'd expect the nextflow+container version to run only slightly slower than the CLI version.

Check the task Bash wrapper in the task work dir to asses the performance issue.

Related

problem with snakemake submitting jobs with multiple wildcard on SGE

I used snakemake on LSF cluster before and everything worked just fine. However, recently I migrated to SGE cluster and I am getting a very strange error when I try to run a job with more than one wildcard.
When I try to submit a job based on this rule
rule download_reads :
threads : 1
output : "data/{sp}/raw_reads/{accesion}_1.fastq.gz"
shell : "scripts/download_reads.sh {wildcards.sp} {wildcards.accesion} data/{wildcards.sp}/raw_reads/{wildcards.accesion}"
I get a following error (snakemake_clust.sh details bellow)
./snakemake_clust.sh data/Ecol1/raw_reads/SRA123456_1.fastq.gz
Building DAG of jobs...
Using shell: /bin/bash
Provided cluster nodes: 10
Job counts:
count jobs
1 download_reads
1
[Thu Jul 30 12:08:57 2020]
rule download_reads:
output: data/Ecol1/raw_reads/SRA123456_1.fastq.gz
jobid: 0
wildcards: sp=Ecol1, accesion=SRA123456
scripts/download_reads.sh Ecol1 SRA123456 data/Ecol1/raw_reads/SRA123456
Unable to run job: ERROR! two files are specified for the same host
ERROR! two files are specified for the same host
Exiting.
Error submitting jobscript (exit code 1):
Shutting down, this might take some time.
When I replace the sp wildcard with a constant, it works as expected:
rule download_reads :
threads : 1
output : "data/Ecol1/raw_reads/{accesion}_1.fastq.gz"
shell : "scripts/download_reads.sh Ecol1 {wildcards.accesion} data/Ecol1/raw_reads/{wildcards.accesion}"
I.e. I get
Submitted job 1 with external jobid 'Your job 50731 ("download_reads") has been submitted'.
I wonder why I might have this problem, I am sure I used exactly the same rule on the LSF-based cluster before without any problem.
some details
The snakemake submitting script looks like this
#!/usr/bin/env bash
mkdir -p logs
snakemake $# -p --jobs 10 --latency-wait 120 --cluster "qsub \
-N {rule} \
-pe smp64 \
{threads} \
-cwd \
-b y \
-o \"logs/{rule}.{wildcards}.out\" \
-e \"logs/{rule}.{wildcards}.err\""
-b y makes the command executed as it is, -cwd changes the working directory on the computing node the the working directory from where the job was submitted. Other flags / specifications are clear I hope.
Also, I am aware of --drmaa flag, but I think out cluster is not well configured for that. --cluster was till now a more robust solution.
-- edit 1 --
When I execute exactly the same snakefile locally (on the fronend, without the --cluster flag), the script gets executed as expected. It seems to be a problem of interaction of snakemake and the scheduler.
-o \"logs/{rule}.{wildcards}.out\" \
-e \"logs/{rule}.{wildcards}.err\""
This is a random guess... More than one wildcards are concatenated by space before replacing them into logs/{rule}.{wildcards}.err. So despite you use double quotes, SGE treats the resulting string as two files and throws the error. What if you use single quotes instead? Like:
-o 'logs/{rule}.{wildcards}.out' \
-e 'logs/{rule}.{wildcards}.err'
Alternatively, you could concatenate the wildcards in the rule and use the result on the command line. E.g.:
rule one:
params:
wc= lambda wc: '_'.join(wc)
output: ...
Then use:
-o 'logs/{rule}.{params.wc}.out' \
-e 'logs/{rule}.{params.wc}.err'
(This second solution, if it works, kind of sucks though)

Hot to add extensions when running NetLogo headlessy on a cluster?

I am using a common Netlogo extension, "CSV", to read a table. The job fails because it cannot find the extension (although I am sure the extension file is present).
How do I specify that I want to use an extension when working with Netlogo headlessly?
Here is my script:
#!/bin/bash
module load jdk-13.0.2
java -Xmx1024m -Dfile.encoding=UTF-8 -cp \
/opt/software/uoa/2019/apps/netlogo/netlogo-6.1.0/app/netlogo-6.1.0.jar \
org.nlogo.headless.Main \
--model /uoa/home/s11as6/Desktop/SABM-v.8.4-NL6.1.0.nlogo \
--experiment dqi_stability_exp \
--table SABM-table-results.csv \
--threads 1
Here is the error log:
Exception in thread "main" Can't find extension: csv at position 12 in
at org.nlogo.core.ErrorSource.signalError(ErrorSource.scala:11)
at org.nlogo.workspace.ExtensionManager.importExtension(ExtensionManager.scala:178)
at org.nlogo.parse.StructureParser$.$anonfun$parsingWithExtensions$1(StructureParser.scala:74)
at org.nlogo.parse.StructureParser$.$anonfun$parsingWithExtensions$1$adapted(StructureParser.scala:68)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.nlogo.parse.StructureParser$.parsingWithExtensions(StructureParser.scala:68)
at org.nlogo.parse.StructureParser$.parseSources(StructureParser.scala:33)
at org.nlogo.parse.NetLogoParser.basicParse(NetLogoParser.scala:17)
at org.nlogo.parse.NetLogoParser.basicParse$(NetLogoParser.scala:15)
at org.nlogo.parse.FrontEnd$.basicParse(FrontEnd.scala:10)
at org.nlogo.parse.FrontEndMain.frontEnd(FrontEnd.scala:26)
at org.nlogo.parse.FrontEndMain.frontEnd$(FrontEnd.scala:25)
at org.nlogo.parse.FrontEnd$.frontEnd(FrontEnd.scala:10)
at org.nlogo.compile.CompilerMain$.compile(CompilerMain.scala:43)
at org.nlogo.compile.Compiler.compileProgram(Compiler.scala:54)
at org.nlogo.headless.HeadlessModelOpener.openFromModel(HeadlessModelOpener.scala:50)
at org.nlogo.headless.HeadlessWorkspace.openModel(HeadlessWorkspace.scala:539)
at org.nlogo.headless.HeadlessWorkspace.open(HeadlessWorkspace.scala:506)
at org.nlogo.headless.Main$.newWorkspace$1(Main.scala:18)
at org.nlogo.headless.Main$.runExperiment(Main.scala:21)
at org.nlogo.headless.Main$.$anonfun$main$1(Main.scala:12)
at org.nlogo.headless.Main$.$anonfun$main$1$adapted(Main.scala:12)
at scala.Option.foreach(Option.scala:274)
at org.nlogo.headless.Main$.main(Main.scala:12)
at org.nlogo.headless.Main.main(Main.scala)
slurmstepd: error: *** JOB 414910 ON hmem05 CANCELLED AT 2020-04-16T18:15:09 DUE TO TIME LIMIT ***
The simplest solution was to place the folder of the CSV extension (along with its files) in the same directory where the model was.

Snakemake conda env parameter is not taken from config.yaml file

I use a conda env that I create manually, not automatically using Snakemake. I do this to keep tighter version control.
Anyway, in my config.yaml I have the following line:
conda_env: '/rst1/2017-0205_illuminaseq/scratch/swo-406/snakemake'
Then, at the start of my Snakefile I read that variable (reading variables from config in your shell part does not seem to work, am I right?):
conda_env = config['conda_env']
Then in a shell part I hail said parameter like this:
rule rsem_quantify:
input:
os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')
output:
os.path.join(analyzed_dir, '{sample}.genes.results'),
os.path.join(analyzed_dir, '{sample}.STAR.genome.bam')
threads: 8
shell:
'''
#!/bin/bash
source activate {conda_env}
rsem-calculate-expression \
--paired-end \
{input} \
{rsem_ref_base} \
{analyzed_dir}/{wildcards.sample} \
--strandedness reverse \
--num-threads {threads} \
--star \
--star-gzipped-read-file \
--star-output-genome-bam
'''
Notice the {conda_env}. Now this gives me the following error:
Could not find conda environment: None
You can list all discoverable environments with `conda info --envs`.
Now, if I change {conda_env} for its parameter directly /rst1/2017-0205_illuminaseq/scratch/swo-406/snakemake, it does work! I don't have any trouble reading other parameters using this method (like rsem_ref_base and analyzed_dir in the example rule above.
What could be wrong here?
Highest regards,
Freek.
The pattern I use is to load variables into params, so something along the lines of
rule rsem_quantify:
input:
os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')
output:
os.path.join(analyzed_dir, '{sample}.genes.results'),
os.path.join(analyzed_dir, '{sample}.STAR.genome.bam')
params:
conda_env=config['conda_env']
threads: 8
shell:
'''
#!/bin/bash
source activate {params.conda_env}
rsem-calculate-expression \
...
'''
Although, I'd also never do this with a conda environment, because Snakemake has conda environment management built-in. See this section in the docs on Integrated Package Management for details. This makes reproducibility much more manageable.

JMeter distributed testing and command line parameters

I have been using JMeter parameters to specify test attributes like testduration, rampup period etc for load test. I specify these parameters in shell script and it looks like this -
JMETER_PATH="/home/<user>/apache-jmeter-2.13/bin/jmeter.sh"
${JMETER_PATH} \
-Jjmeter.save.saveservice.output_format=csv \
-Jjmeter.save.saveservice.response_data.on_error=true \
-Jjmeter.save.saveservice.print_field_names=true \
-JCUSTOMERS_THREADS=1 \
-JGTI_THREADS=1 \
// Some more properties
Everything goes good here.
Now I added distributed testing and modified above script with JMeter Server related info. Hence new script looks as -
JMETER_PATH="/home/<user>/apache-jmeter-2.13/bin/jmeter.sh"
${JMETER_PATH} \
-Jjmeter.save.saveservice.output_format=csv \
-Jjmeter.save.saveservice.response_data.on_error=true \
-Jjmeter.save.saveservice.print_field_names=true \
-Jsample_variables=counter,accessToken \
-JCUSTOMERS_THREADS=1 \
-JGTI_THREADS=1 \
// Some more properties
-n \
-R 127.0.0.1:24001,127.0.0.1:24002,127.0.0.1:24003,127.0.0.1:24004,127.0.0.1:24005,127.0.0.1:24006,127.0.0.1:24007,127.0.0.1:24008,127.0.0.1:24009,12$
-Djava.rmi.server.hostname=127.0.0.1 \
Distributed test runs well but test does not take parameters specified in script above into consideration rather it takes the default value from JMeter test plan -
Did I mess up any configuration?
Use -G instead of -J for properties to be sent to remote machines as well. -J is local only.
-D[prop_name]=[value] - defines a java system property value.
-J[prop name]=[value] - defines a local JMeter property.
-G[prop name]=[value] - defines a JMeter property to be sent to all remote servers.
-G[propertyfile] - defines a file containing JMeter properties to be sent to all remote servers.
From here
Replace -J with -G, for more details go to link below or can see the image atached.
If you want to run your load test in distributed mode refer to URL click here
And search for Server Mode (1.4.5)

Elasticsearch standalone JDBC river feeder missing main class

I'm trying to setup the feeder following this instruction https://github.com/jprante/elasticsearch-jdbc#installation
I downloaded and unzipped the feeder
I don't quite understand this step:
run script with a command that starts org.xbib.tools.JDBCImporter with the lib directory on the classpath
what am I suppposed to do?
if I try to run a sample script from bin I get:
Bad substitution
Error: Could not find or load main class org.xbib.elasticsearch.plugin.jdbc.feeder.Runner
where do I get the java classes org.xbib.elasticsearch.plugin.jdbc.feeder.Runner \
org.xbib.elasticsearch.plugin.jdbc.feeder.JDBCFeeder?
figured out the solution
it was to set the installation folder in script (not the elasticsearch folder but the jdbc folder!)
#!/bin/bash
#JDBC Directory -> important, change accordingly!
export JDBC_IMPORTER_HOME=~/Downloads/elasticsearch-jdbc-1.6.0.0
bin=$JDBC_IMPORTER_HOME/bin
lib=$JDBC_IMPORTER_HOME/lib
echo '{
...
...
}
}' | java \
-cp "${lib}/*" \
-Dlog4j.configurationFile=${bin}/log4j2.xml \
org.xbib.tools.Runner \
org.xbib.tools.JDBCImporter

Resources