How to get the training metrics in a file? - stanford-nlp

I have trained my own NER model. I would be interested to know if I can retrieve the metrics somewhere in a file after the training. Only as output from the console, they are unfortunately not usable for me.
I used the following command:
java -cp /content/stanford-ner-tagger/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /content/ner-model1.ser.gz -testFile /content/test1.tsv
Does anyone have an idea how I can get the output as a file?

You can keep all the output at training time by redirecting it to a file, > asdf.txt or > asdf.txt 2>&1
You can recreate the confusion matrix with
java edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier your_classifier.ser.gz -testFile your_test_file.txt

Related

JMeter MergeResults is not handling timeStamp label correctly (millis)

Created two dummy sample projects as (dummy1.jmx and dummy2.jmx) and executed below commands with default settings (JMeter 5.3 default installation with all required plugins installed).
#> jmeter.bat -n -t dummy1.jmx -l dummy1.csv -j dummy1-jmeter.log to execute load
Generated report and timestamps look perfect both in dashboard and graphs
**#> jmeter.bat -g dummy1.csv -o dummy1 -j dummy1-report-jmeter.
#> jmeter.bat -n -t dummy2.jmx -l dummy2.csv -j dummy2-jmeter.log to execute load
Generated report and timestamps look perfect both in dashboard and graphs
#> jmeter.bat -g dummy2.csv -o dummy2 -j dummy2-report-jmeter.log
Used MergeResults plugin to merge the above CSV files to a single file and generated HTML report
#> JMeterPluginsCMD.bat --generate-csv dummy1-dummy2.csv --input-jtl merge.properties --plugin-type MergeResults
Found merged timeStamp label is not valid and also generated report shows invalid DateTime.
#> jmeter.bat -g dummy1-dummy2.csv -o merged -j merged-report-jmeter.log
Is this a bug or am I missing configuration? Even adding jmeter.save.saveservice.timestamp_format=yyyy/MM/dd HH:mm:ss.SSS to user.properties didn't help
merge.properties
inputJtl1=dummy1.csv
prefixLabel1=TEST1:
includeLabels1=.*
excludeLabelsl=
includeLabelRegex1=true
excludeLabelRegex1=
startOffset1=
endOffset1=
inputJtl2=dummy2.csv
prefixLabel2=TEST2:
includeLabels2=.*
excludeLabels2=
includeLabelRegex2=true
excludeLabelRegex2=
startOffset2=
endOffset2=
Unfortunately we cannot help without:
Seeing your merge.properties file contents
Knowing what do you expect
In the meantime I can only tell you where did this 2000-01-01 date came from:
It's declared here:
private static final long REF_START_TIME = 946681200000L;
And being added to the original SampleResult timestamp here:
res.setTimeStamp(res.getTimeStamp() - startTimeRef + REF_START_TIME);
I don't know whether it is a bug or it's designed to work like this (however the crazy logic of substraction of sampler start time from its timestamp is beyond my limited understanding), it's better to check at JMeter Plugins support forum
In the meantime you can use services like BM.Sense for comparing different test runs resutls

MapReduceIndexerTool output dir error "Cannot write parent of file"

I want to use Cloudera's MapReduceIndexerTool to understand how morphlines work. I created a basic morphline that just reads lines from the input file and I tried to run that tool using that command:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphline.conf \
--output-dir hdfs:///hostname/dir/ \
--dry-run true
Hadoop is installed on the same machine where I run this command.
The error I'm getting is the following:
net.sourceforge.argparse4j.inf.ArgumentParserException: Cannot write parent of file: hdfs:/hostname/dir
at org.apache.solr.hadoop.PathArgumentType.verifyCanWriteParent(PathArgumentType.java:200)
The /dir directory has 777 permissions on it, so it is definitely allowed to write into it. I don't know what I should do to allow it to write into that output directory.
I'm new to HDFS and I don't know how I should approach this problem. Logs don't offer me any info about that.
What I tried until now (with no result):
created a hierarchy of 2 directories (/dir/dir2) and put 777 permissions on both of them
changed the output-dir schema from hdfs:///... to hdfs://... because all the examples in the --help menu are built that way, but this leads to an invalid schema error
Thank you.
It states 'cannot write parent of file'. And the parent in your case is /. Take a look into the source:
private void verifyCanWriteParent(ArgumentParser parser, Path file) throws ArgumentParserException, IOException {
Path parent = file.getParent();
if (parent == null || !fs.exists(parent) || !fs.getFileStatus(parent).getPermission().getUserAction().implies(FsAction.WRITE)) {
throw new ArgumentParserException("Cannot write parent of file: " + file, parser);
}
}
In the message printed is file, in your case hdfs:/hostname/dir, so file.getParent() will be /.
Additionally you can try the permissions with hadoop fs command, for example you can try to create a zero length file in the path:
hadoop fs -touchz /test-file
I solved that problem after days of working on it.
The problem is with that line --output-dir hdfs:///hostname/dir/.
First of all, there are not 3 slashes at the beginning as I put in my continuous trying to make this work, there are only 2 (as in any valid HDFS URI). Actually I put 3 slashes because otherwise, the tool throws an invalid schema exception! You can easily see in this code that the schema check is done before the verifyCanWriteParent check.
I tried to get the hostname by simply running the hostname command on the Cent OS machine that I was running the tool on. This was the main issue. I analyzed the /etc/hosts file and I saw that there are 2 hostnames for the same local IP. I took the second one and it worked. (I also attached the port to the hostname, so the final format is the following: --output-dir hdfs://correct_hostname:8020/path/to/file/from/hdfs
This error is very confusing because everywhere you look for the namenode hostname, you will see the same thing that the hostname command returns. Moreover, the errors are not structured in a way that you can diagnose the problem and take a logical path to solve it.
Additional information regarding this tool and debugging it
If you want to see the actual code that runs behind it, check the cloudera version that you are running and select the same branch on the official repository. The master is not up to date.
If you want to just run this tool to play with the morphline (by using the --dry-run option) without connecting to Solr and playing with it, you can't. You have to specify a Zookeeper endpoint and a Solr collection or a solr config directory, which involves additional work to research on. This is something that can be improved to this tool.
You don't need to run the tool with -u hdfs, it works with a regular user.

ESQL performance tools

I would like to analyze separate ESQL modules for performance on IBM Integration Bus, not whole application with PerfHarness. I know that exists list of good practices for write ESQL (for example, this - ESQL code tips)
But is it tool for performance analysis only one ESQL module?
You can check through your Broker 'Web User Interface'. Just turn your flow (with your ESQL code) statistics on and it will show how many time the process took in each node.
I know this is rather old but it still covers the basics. https://www.ibm.com/developerworks/websphere/library/techarticles/0406_dunn/0406_dunn.html The section on "Isolate the problem using accounting and statistics" should answer your question. And the part on using trace should help you profile the statements within an ESQL module.
The trace file generated at the debug level shows you how long each statement took to execute down to microsecond precision helping you to find the problematic statement or loop.
To get a trace file do the following
Step :1 - Start a user trace using the command below
mqsichangetrace <Node> -u -e <Server> -f <MessageFlowName> -l debug -r
Step :2 - Send a message through the message flow.
Step :3 - Stop the trace using the below MQSI command
mqsichangetrace <Node> -u -e <Server> -f "<Message Flow Name>" -l none
Step :4 - Read the trace content into a file :
mqsireadlog <Node> -u -e <Server> -f -o flowtrace.xml
Step :5 - Format the XML trace file into user readable format.
mqsiformatlog -i flowtrace.xml -o flowtrace.txt
Examine the text file.

hive output consists of these 2 warnings at the end. How do I suppress these 2 warnings

Hive query output that is using UDFs consists of these 2 warnings at the end. How do I suppress these 2 warnings. Please note that the 2 warnings come right after the output as part of output.
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
hadoop version
Hadoop 2.6.0-cdh5.4.0
hive --version
Hive 1.1.0-cdh5.4.0
If you use beeline instead of Hive the error goes away. Not the best solution, but I'm planning to post to the CDH user group asking the same question to see if it's a bug that can be fixed.
This error occurs due to adding of assembly jar which which contains classes from icl-over-slf4j.jar (which is causing the stdout messages) and slf4j-log4j12.jar.
You can try couple of things to begin with:
Try removing the assembly jar, in case if using.
Look at the following link: https://issues.apache.org/jira/browse/HIVE-12179
This suggest that we can trigger a flag in Hive where spark-assembly is loaded only if HIVE_ADD_SPARK_ASSEMBLY = "true".
https://community.hortonworks.com/questions/34311/warning-message-in-hive-output-after-upgrading-to.html :
Although there is a workaround if to avoid any end time changes and that is to manually remove the 2 lines from the end of the files using shell script.
Have tried to set HIVE_ADD_SPARK_ASSEMBLY=false, but it didn't work.
Finally, I found a post question at Cloudera community. See: https://community.cloudera.com/t5/Support-Questions/Warning-message-in-Hive-output-after-upgrading-to-hive/td-p/157141
You could try the follow command, it works for me!
hive -S -d ns=$hiveDB -d tab=$t -d dunsCol=$c1 -d phase="$ph1" -d error=$c2 -d ts=$eColumnArray -d reporting_window=$rDate -f $dir'select_count.hsql' | grep -v "^WARN" > $gOutPut 2> /dev/null

Passing directories to hadoop streaming : some help needed

The context is that I am trying to run a streaming job on Amazon EMR (the web UI) with a bash script that I run like:
-input s3://emrdata/test_data/input -output s3://emrdata/test_data/output -mapper
s3://emrdata/test_data/scripts/mapperScript.sh -reducer NONE
The input directory has sub-directories in it and these sub-directories have gzipped data files.
The relevant part of mapperScript.sh that fails is :
for filename in "$input"/*; do
dir_name=`dirname $filename`
fname=`basename $filename`
echo "$fname">/dev/stderr
modelname=${fname}.model
modelfile=$model_location/$modelname
echo "$modelfile">/dev/stderr
inputfile=$dirname/$fname
echo "$inputfile">/dev/stderr
outputfile=$output/$fname
echo "$outputfile">/dev/stderr
# Will do some processing on the files in the sub-directories here
done # this is the loop for getting input from all sub-directories
Basically, I need to read the sub-directories in streaming mode and when I run this, hadoop complains saying :
2013-03-01 10:41:26,226 ERROR
org.apache.hadoop.security.UserGroupInformation (main):
PriviledgedActionException as:hadoop cause:java.io.IOException: Not a
file: s3://emrdata/test_data/input/data1 2013-03-01 10:41:26,226
ERROR org.apache.hadoop.streaming.StreamJob (main): Error Launching
job : Not a file: s3://emrdata/test_data/input/data1
I am aware that a similar q has been asked here
The suggestion there was to write one's own InputFormat. I am wondering if I am missing something else in the way my script is written / EMR inputs are given, or whether writing my own InputFormat in Java is my only choice.
I have tried giving my input with a "input/*" to EMR as well, but no luck.
It seems that while there may be some temporary workarounds to this, but inherently hadoop doesn't support this yet as you may see that there is an open ticket on this here.
So inputpatth/*/* may work for 2 level of subdierctories it may fail for further nesting.
The best thing you can do for now is get the listing of the files/folders-without-any-subdirectory and add them recursively after creating a csv list of inputPaths. You may use sinple tools like s3cmd for this.

Resources