Hot to add extensions when running NetLogo headlessy on a cluster? - cluster-computing

I am using a common Netlogo extension, "CSV", to read a table. The job fails because it cannot find the extension (although I am sure the extension file is present).
How do I specify that I want to use an extension when working with Netlogo headlessly?
Here is my script:
#!/bin/bash
module load jdk-13.0.2
java -Xmx1024m -Dfile.encoding=UTF-8 -cp \
/opt/software/uoa/2019/apps/netlogo/netlogo-6.1.0/app/netlogo-6.1.0.jar \
org.nlogo.headless.Main \
--model /uoa/home/s11as6/Desktop/SABM-v.8.4-NL6.1.0.nlogo \
--experiment dqi_stability_exp \
--table SABM-table-results.csv \
--threads 1
Here is the error log:
Exception in thread "main" Can't find extension: csv at position 12 in
at org.nlogo.core.ErrorSource.signalError(ErrorSource.scala:11)
at org.nlogo.workspace.ExtensionManager.importExtension(ExtensionManager.scala:178)
at org.nlogo.parse.StructureParser$.$anonfun$parsingWithExtensions$1(StructureParser.scala:74)
at org.nlogo.parse.StructureParser$.$anonfun$parsingWithExtensions$1$adapted(StructureParser.scala:68)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.nlogo.parse.StructureParser$.parsingWithExtensions(StructureParser.scala:68)
at org.nlogo.parse.StructureParser$.parseSources(StructureParser.scala:33)
at org.nlogo.parse.NetLogoParser.basicParse(NetLogoParser.scala:17)
at org.nlogo.parse.NetLogoParser.basicParse$(NetLogoParser.scala:15)
at org.nlogo.parse.FrontEnd$.basicParse(FrontEnd.scala:10)
at org.nlogo.parse.FrontEndMain.frontEnd(FrontEnd.scala:26)
at org.nlogo.parse.FrontEndMain.frontEnd$(FrontEnd.scala:25)
at org.nlogo.parse.FrontEnd$.frontEnd(FrontEnd.scala:10)
at org.nlogo.compile.CompilerMain$.compile(CompilerMain.scala:43)
at org.nlogo.compile.Compiler.compileProgram(Compiler.scala:54)
at org.nlogo.headless.HeadlessModelOpener.openFromModel(HeadlessModelOpener.scala:50)
at org.nlogo.headless.HeadlessWorkspace.openModel(HeadlessWorkspace.scala:539)
at org.nlogo.headless.HeadlessWorkspace.open(HeadlessWorkspace.scala:506)
at org.nlogo.headless.Main$.newWorkspace$1(Main.scala:18)
at org.nlogo.headless.Main$.runExperiment(Main.scala:21)
at org.nlogo.headless.Main$.$anonfun$main$1(Main.scala:12)
at org.nlogo.headless.Main$.$anonfun$main$1$adapted(Main.scala:12)
at scala.Option.foreach(Option.scala:274)
at org.nlogo.headless.Main$.main(Main.scala:12)
at org.nlogo.headless.Main.main(Main.scala)
slurmstepd: error: *** JOB 414910 ON hmem05 CANCELLED AT 2020-04-16T18:15:09 DUE TO TIME LIMIT ***

The simplest solution was to place the folder of the CSV extension (along with its files) in the same directory where the model was.

Related

Slowdown when using Mutect2 container inside Nextflow

I'm trying to run MuTect2 on a sample, which on my machine using java takes about 27 minutes to run.
If I use virtually the same code, but inside Nextflow and using the GATK3:3.6 docker container to run Mutect, it takes 7 minutes longer, for seemingly no apparent reason.
Running on Ubuntu 18.04, the tumor and normal samples are from an Oncomine panel. Tumor is 4.1G, normal is 1.1G. I thought the time might be spent copying in data to the container, but 7-8 minutes seems far too long for that. Could it be from copying in reference files too?
bai_ch is the channel that brings in the tumor and normal index files
process MuTect2 {
label 'mutect'
stageInMode 'copy'
publishDir './output', mode : 'copy', overwrite : true
input:
file tumor_bam_mu from tumor_mu
file normal_bam_mu from normal_mu
file "*" from bai_ch
file mutect2_ref
file ref_index from ref_fasta_i_m
file ref_dict from Channel.fromPath(params.ref_fast_dict)
file regions_file from Channel.fromPath(params.regions)
file cosmic_vcf from Channel.fromPath(params.cosmic_vcf)
file dbsnp_vcf from Channel.fromPath(params.dbsnp_vcf)
file normal_vcf from Channel.fromPath(params.normal_vcf)
output:
file '*' into mutect_ch
script:
"""
ls
echo MuTect2 task path: \$PWD
java -jar /usr/GenomeAnalysisTK.jar \
--analysis_type MuTect2 \
--reference_sequence hg19.fa \
-L designed.bed \
--normal_panel normal_panel.vcf \
--cosmic Cosmic.vcf \
--dbsnp dbsnp.vcf \
--input_file:tumor $tumor_bam_mu \
-o mutect2.somatic.unfiltered.vcf \
--input_file:normal $normal_bam_mu \
--max_alt_allele_in_normal_fraction 0.1 \
--minPruning 10 \
--kmerSize 60
"""
}
My only thought is to create my own docker that has the reference files handy, which will probably save time for copying them in? I'd expect the nextflow+container version to run only slightly slower than the CLI version.
Check the task Bash wrapper in the task work dir to asses the performance issue.

java.lang.IllegalArgumentException: Both source file listing and source paths present

I am trying to copy files from HDFS to S3 using distcp by executing the following command
hadoop distcp -fs.s3a.access.key=AccessKey -fs.s3a.secret.key=SecrerKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
But I am getting the following error:
17/09/05 02:59:30 ERROR tools.DistCp: Invalid arguments:
java.lang.IllegalArgumentException: Both source file listing and source paths present
at org.apache.hadoop.tools.OptionsParser.parseSourceAndTargetPaths(OptionsParser.java:341)
at org.apache.hadoop.tools.OptionsParser.parse(OptionsParser.java:89)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:112)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:436)
Invalid arguments: Both source file listing and source paths present
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and
append new data to them if possible
-async Should distcp execution be blocking
To pass configuration parameters, you have to use -D
hadoop distcp -Dfs.s3a.access.key=AccessKey -Dfs.s3a.secret.key=SecrerKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
Old Command
hadoop distcp -Dfs.s3a.access.key=AccessKey -Dfs.s3a.secret.key=SecretKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
Rectified Command
hadoop distcp -Dfs.s3n.awsAccessKeyId=AccessKey -Dfs.s3n.awsSecretAccessKey=SecretKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test

Reading file in hadoop streaming

I am trying to read an auxiliary file in my mapper and here are my codes and commands.
mapper code:
#!/usr/bin/env python
from itertools import combinations
from operator import itemgetter
import sys
storage = {}
with open('inputData', 'r') as inputFile:
for line in inputFile:
first, second = line.split()
storage[(first, second)] = 0
for line in sys.stdin:
do_something()
And here is my command:
hadoop jar hadoop-streaming-2.7.1.jar \
-D stream.num.map.output.key.fields=2 \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.comparator.options='-k1,1 -k2,2' \
-D mapred.map.tasks=20 \
-D mapred.reduce.tasks=10 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-mapper mapper.py -file mapper.py \
-reducer reducer.py -file reducer.py \
-file inputData \
-input /data \
-output /result
But I keep getting this error, which indicates that my mapper fails to read from stdin. After deleting the read file part, my code works, So I have pinppointed the place where the error occurs, but I don't know what should be the correct way of reading from it. Can anyone help?
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
The error you are getting means your mapper failed to write to their stdout stream for too long.
For example, a common reason for error is that in your do_something() function, you have a for loop that contains continue statement with certain conditions. Then when that condition happens too often in your input data, your script runs over continue many times consecutively, without generating any output to stdout. Hadoop waits for too long without seeing anything, so the task is considered failed.
Another possibility is that your input data file is too large, and it took too long to read. But I think that is considered setup time because it is before the first line of output. I am not sure though.
There are two relatively easy ways to solve this:
(developer side) Modify your code to output something every now and then. In the case of continue, write a short dummy symbol like '\n' to let Hadoop know your script is alive.
(system side) I believe you can set the following parameter with -D option, which controls for the waitout time in milli-seconds
mapreduce.reduce.shuffle.read.timeout
I have never tried option 2. Usually I'd avoid streaming on data that requires filtering. Streaming, especially when done with scripting language like Python, should be doing as little work as possible. My use cases are mostly post-processing output data from Apache Pig, where filtering will already be done in Pig scripts and I need something that is not available in Jython.

JMeter distributed testing and command line parameters

I have been using JMeter parameters to specify test attributes like testduration, rampup period etc for load test. I specify these parameters in shell script and it looks like this -
JMETER_PATH="/home/<user>/apache-jmeter-2.13/bin/jmeter.sh"
${JMETER_PATH} \
-Jjmeter.save.saveservice.output_format=csv \
-Jjmeter.save.saveservice.response_data.on_error=true \
-Jjmeter.save.saveservice.print_field_names=true \
-JCUSTOMERS_THREADS=1 \
-JGTI_THREADS=1 \
// Some more properties
Everything goes good here.
Now I added distributed testing and modified above script with JMeter Server related info. Hence new script looks as -
JMETER_PATH="/home/<user>/apache-jmeter-2.13/bin/jmeter.sh"
${JMETER_PATH} \
-Jjmeter.save.saveservice.output_format=csv \
-Jjmeter.save.saveservice.response_data.on_error=true \
-Jjmeter.save.saveservice.print_field_names=true \
-Jsample_variables=counter,accessToken \
-JCUSTOMERS_THREADS=1 \
-JGTI_THREADS=1 \
// Some more properties
-n \
-R 127.0.0.1:24001,127.0.0.1:24002,127.0.0.1:24003,127.0.0.1:24004,127.0.0.1:24005,127.0.0.1:24006,127.0.0.1:24007,127.0.0.1:24008,127.0.0.1:24009,12$
-Djava.rmi.server.hostname=127.0.0.1 \
Distributed test runs well but test does not take parameters specified in script above into consideration rather it takes the default value from JMeter test plan -
Did I mess up any configuration?
Use -G instead of -J for properties to be sent to remote machines as well. -J is local only.
-D[prop_name]=[value] - defines a java system property value.
-J[prop name]=[value] - defines a local JMeter property.
-G[prop name]=[value] - defines a JMeter property to be sent to all remote servers.
-G[propertyfile] - defines a file containing JMeter properties to be sent to all remote servers.
From here
Replace -J with -G, for more details go to link below or can see the image atached.
If you want to run your load test in distributed mode refer to URL click here
And search for Server Mode (1.4.5)

Elasticsearch standalone JDBC river feeder missing main class

I'm trying to setup the feeder following this instruction https://github.com/jprante/elasticsearch-jdbc#installation
I downloaded and unzipped the feeder
I don't quite understand this step:
run script with a command that starts org.xbib.tools.JDBCImporter with the lib directory on the classpath
what am I suppposed to do?
if I try to run a sample script from bin I get:
Bad substitution
Error: Could not find or load main class org.xbib.elasticsearch.plugin.jdbc.feeder.Runner
where do I get the java classes org.xbib.elasticsearch.plugin.jdbc.feeder.Runner \
org.xbib.elasticsearch.plugin.jdbc.feeder.JDBCFeeder?
figured out the solution
it was to set the installation folder in script (not the elasticsearch folder but the jdbc folder!)
#!/bin/bash
#JDBC Directory -> important, change accordingly!
export JDBC_IMPORTER_HOME=~/Downloads/elasticsearch-jdbc-1.6.0.0
bin=$JDBC_IMPORTER_HOME/bin
lib=$JDBC_IMPORTER_HOME/lib
echo '{
...
...
}
}' | java \
-cp "${lib}/*" \
-Dlog4j.configurationFile=${bin}/log4j2.xml \
org.xbib.tools.Runner \
org.xbib.tools.JDBCImporter

Resources