About the Stanford CoreNLP in chinese model - stanford-nlp

How to use the chinese model, and I download the "stanford-corenlp-3.5.2-models-chinese.jar" in my classpath and I copy
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
<classifier>models-chinese</classifier>
</dependency>
to pom.xml file. In additional, my input.txt is
因出席中國大陸閱兵引發爭議的國民黨前主席連戰今晚金婚宴,立法院長王金平說,已向連戰恭喜,等一下回南部。
連戰夫婦今晚的50週年金婚紀念宴,正值連戰赴陸出席閱兵引發爭議之際,社會關注會否受到影響。
包括國民黨主席朱立倫、副主席郝龍斌等人已分別對外表示另有行程,無法出席。
then I compile the program using the code
java -cp "*" -Xmx1g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit -file input.txt
and the result is as follows. But it gives the following error and how do i solve this problem?
C:\stanford-corenlp-full-2015-04-20>java -cp "*" -Xmx1g edu.stanford.nlp.pipelin
e.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,
ssplit -file input.txt
Registering annotator segment with class edu.stanford.nlp.pipeline.ChineseSegmen
terAnnotator
Adding annotator segment
Loading Segmentation Model ... Loading classifier from edu/stanford/nlp/models/s
egmenter/chinese/ctb.gz ... Loading Chinese dictionaries from 1 file:
edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
Done. Unique words in ChineseDictionary is: 423200.
done [22.9 sec].
Ready to process: 1 files, skipped 0, total 1
Processing file C:\stanford-corenlp-full-2015-04-20\input.txt ... writing to C:\
stanford-corenlp-full-2015-04-20\input.txt.xml {
Annotating file C:\stanford-corenlp-full-2015-04-20\input.txt Adding Segmentat
ion annotation ... INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | u
sePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/s
egmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chine
se/dict/in.ctb
Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese
/dict/character_list
Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.
ctb
?]?X?u????j???\?L??o??????????e?D?u?s???????B?b?A??k?|???????????A?w?V?s?????A?
??#?U?^?n???C
?s?????????50?g?~???B?????b?A????s??u???X?u?\?L??o???????A???|???`?|?_????v?T?C
?]?A?????D?u?????B??D?u?q?s?y???H?w???O??~???t????{?A?L?k?X?u?C
--->
[?, ], ?, X, ?u????j???, \, ?L??o??????????e?D?u?s???????B?b?A??k?|???????????A?
w?V?s?????A???#?U?^?n???C, , , , ?s?????????, 50, ?, g?, ~, ???B?????b?A????s??u
???X?u?, \, ?L??o???????A???, |, ???, `, ?, |, ?_????v?T?C, , , , ?, ], ?, A????
?D?u???, ??, B??D?u?q, ?, s?y???H?w???O??, ~, ???t????, {, ?, A?L?k?X?u?C]
}
Processed 1 documents
Skipped 0 documents, error annotating 0 documents
Annotation pipeline timing information:
ChineseSegmenterAnnotator: 0.1 sec.
TOTAL: 0.1 sec. for 34 tokens at 485.7 tokens/sec.
Pipeline setup: 0.0 sec.
Total time for StanfordCoreNLP pipeline: 0.1 sec.

I edited your question to change the command to the one that you actually used to produce the output shown. It looks like you worked out that the former command:
java -cp "*" -Xmx1g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt input.xml
ran the English analysis pipeline, and that didn't work very well for Chinese text....
The CoreNLP support of Chinese in v3.5.2 is still a little rough, and will hopefully be a bit smoother in the next release. But from here you need to:
Specify a properties file for Chinese, giving appropriate models. (If no properties file is specified, CoreNLP defaults to English): -props StanfordCoreNLP-chinese.properties
At present, word segmentation of Chinese is not the annotator tokenize, but segment, specified as a custom annotator in StanfordCoreNLP-chinese.properties. (Maybe we'll unify the two in a future release...)
The current dcoref annotator only works for English. There is Chinese coreference, but it is not fully integrated into the pipeline. If you want to use it, you currently have to write some code, as explained here. So let's delete it. (Again, this should be better integrated in the future).
At that point, things run, but the ugly stderr output you show is that by default the segmenter has VERBOSE turned on, but your output character encoding is not right for our Chinese output. We should have VERBOSE off by default, but you can turn it off with: -segment.verbose false
We have no Chinese lemmatizer, so may as well delete that annotator.
Also, CoreNLP needs more than 1GB of RAM. Try 2GB.
At this point, all should be good! With the command:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit,pos,ner,parse -segment.verbose false -file input.txt
you get the output in input.txt.xml. (I'm not posting it, since it's a couple of thousand lines long....)
Update for CoreNLP v3.8.0: If using the (current in 2017) CoreNLP v3.8.0, then there are some changes/progress: (i) We now use the annotator tokenize for all languages and it doesn't require loading a custom annotator for Chinese; (ii) verbose segmentation is correctly turned off by default; (iii) [negative progress] the requirements require the lemma annotator prior to ner, even though it is a no-op for Chinese; and (iv) coreference is now available for Chinese, invoked as coref, which requires the prior annotator `mentions, and its statistical models require considerable memory. Put that all together, and you're now good with this command:
java -cp "*" -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators tokenize,ssplit,pos,lemma,ner,parse,mention,coref -file input.txt

Related

How to get the training metrics in a file?

I have trained my own NER model. I would be interested to know if I can retrieve the metrics somewhere in a file after the training. Only as output from the console, they are unfortunately not usable for me.
I used the following command:
java -cp /content/stanford-ner-tagger/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /content/ner-model1.ser.gz -testFile /content/test1.tsv
Does anyone have an idea how I can get the output as a file?
You can keep all the output at training time by redirecting it to a file, > asdf.txt or > asdf.txt 2>&1
You can recreate the confusion matrix with
java edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier your_classifier.ser.gz -testFile your_test_file.txt

How to speed up the processing time of long article with StanfordCoreNLP (v3.9.2)

I have an article with 8226 chars, what i want is extracting NERs. (check the original article at Here)
Using command as below cost 8.0 sec at NERCombinerAnnotator
java -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.model edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz -ner.nthreads 4 -file longArticleSample.txt -outputFormat json
Also, I have tried another article with 1973 chars in the same way. It takes 4.2 sec to get NERs.
java -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.model edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz -ner.nthreads 4 -file mediumArticle.txt -outputFormat json
This result is much less efficient than the author's result.(both use Token+SS+PoS+L+NER)
[My Result]
MediumLengthArticle: 4.5 sec. for 356 tokens at 78.4 tokens/sec.
LongArticle: 8.4 sec. for 1683 tokens at 200.3 tokens/sec.
[Stanford Result]
More than 10,000 tokens/sec.
originalSite

Setting queue name in pig v0.15

I am getting below exception while trying to execute pig script via shell.
JobId Alias Feature Message Outputs
job_1520637789949_340250 A,B,D,top_rec GROUP_BY Message: java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1520637789949_340250 to YARN : Application rejected by queue placement policy
I understand that it is due to not setting the correct queue name for MR execution. In order to find that how to set a queuename for mapreduce job, I tried searching thorough help, pig --help, it listed below options
Apache Pig version 0.15.0-mapr-1611 (rexported)
compiled Dec 06 2016, 05:50:07
USAGE: Pig [options] [-] : Run interactively in grunt shell.
Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
Pig [options] [-f[ile]] file : Run cmds found in file.
options include:
-4, -log4jconf - Log4j configuration file, overrides log conf
-b, -brief - Brief logging (no timestamps)
-c, -check - Syntax check
-d, -debug - Debug level, INFO is default
-e, -execute - Commands to execute (within quotes)
-f, -file - Path to the script to execute
-g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
-h, -help - Display this message. You can specify topic to get help for that topic.
properties is the only topic currently supported: -h properties.
-i, -version - Display version information
-l, -logfile - Path to client side log file; default is current working directory.
-m, -param_file - Path to the parameter file
-p, -param - Key value pair of the form param=val
-r, -dryrun - Produces script with substituted parameters. Script is not executed.
-t, -optimizer_off - Turn optimizations off. The following values are supported:
ConstantCalculator - Calculate constants at compile time
SplitFilter - Split filter conditions
PushUpFilter - Filter as early as possible
MergeFilter - Merge filter conditions
PushDownForeachFlatten - Join or explode as late as possible
LimitOptimizer - Limit as early as possible
ColumnMapKeyPrune - Remove unused data
AddForEach - Add ForEach to remove unneeded columns
MergeForEach - Merge adjacent ForEach
GroupByConstParallelSetter - Force parallel 1 for "group all" statement
PartitionFilterOptimizer - Pushdown partition filter conditions to loader implementing LoadMetaData
PredicatePushdownOptimizer - Pushdown filter predicates to loader implementing LoadPredicatePushDown
All - Disable all optimizations
All optimizations listed here are enabled by default. Optimization values are case insensitive.
-v, -verbose - Print all error messages to screen
-w, -warning - Turn warning logging on; also turns warning aggregation off
-x, -exectype - Set execution mode: local|mapreduce|tez, default is mapreduce.
-F, -stop_on_failure - Aborts execution on the first failed job; default is off
-M, -no_multiquery - Turn multiquery optimization off; default is on
-N, -no_fetch - Turn fetch optimization off; default is on
-P, -propertyFile - Path to property file
-printCmdDebug - Overrides anything else and prints the actual command used to run Pig, including
any environment variables that are set by the pig command.
18/03/30 13:03:05 INFO pig.Main: Pig script completed in 163 milliseconds (163 ms)
I tried pig -p mapreduce.job.queuename=my_queue; and was able to login into grunt without any error.
However, on the first command itself, it threw below
ERROR 2997: Encountered IOException. org.apache.pig.tools.parameters.ParseException: Encountered " <OTHER> ".job.queuename=my_queue "" at line 1, column 10.
Was expecting:
"=" ...
I am not sure, if I am doing it right?
To set queuename in pig 0.15, I got below options (it may works for other version too):
1) pig comes with an option to start the pig session using a queue name.
Simple use below commands
pig -Dmapreduce.job.queuename=my_queue
2) Another option is to set the same in the grunt shell or in the pig script itself.
set mapreduce.job.queuename my_queue;

sematext logagent debugging patterns

I have installed sematext logagent https://sematext.github.io/logagent-js/installation/
Configured it to output to elasticsearch and all is good but one thing which i spent this all day trying to do.
There is 0, null, none information on how to debug parsers. I start logagent with "logagent --config logagent.yml -v -j", yml file bellow
options:
printStats: 30
# don't write parsed logs to stdout
suppress: false
# Enable/disable GeoIP lookups
# Startup of logagent might be slower, when downloading the GeoIP database
geoipEnabled: false
# Directory to store Logagent status nad temporary files
diskBufferDir: ./tmp
input:
files:
- '/var/log/messages'
- '/var/log/test'
patterns:
sourceName: !!js/regexp /test/
match:
- type: mysyslog
regex: !!js/regexp /([a-z]){2}(.*)/
fields: [message,severity]
dateFormat: MMM DD HH:mm:ss
output:
elasticsearch:.
module: elasticsearch
url: http://host:9200
index: mysyslog
stdout: yaml # use 'pretty' for pretty json and 'ldjson' for line delimited json (default)
I would expect (based on the scares documentation) that this would split each line of test file into 2, example 'ggff', 'gg' would be message, 'ff' would be severity, but all i can see in my kibana is that 'ggff' is a message and severity is defaulted (?) to info. The problem is, i dont know where the problem is. Does it skip my pattern, does match in my pattern fail ? any help would be VERY appreciated.
Setting 'debug: true' in patterns.yml prints detailed info about matched patterns.
https://github.com/sematext/logagent-js/blob/master/patterns.yml#L36
Watch Logagent issue #69 (https://github.com/sematext/logagent-js/issues/69) for additional improvements.
The docs moved to http://sematext.com/docs/logagent/ . I recommend www.regex101.com to test regular expressions (please use JavaScript regex syntax).
Examples of Syslog messages in /var/log are in the default pattern library:
https://github.com/sematext/logagent-js/blob/master/patterns.yml#L498

Write multi-line string in Spring boot .conf file

For my Spring boot application, I have a .conf file that is used to run the application.
In this file, I put some jvm options.
Currently it contains this :
JAVA_OPTS="-Xms256m -Xmx512m -Dvisualvm.display.name=ApplicationWs -Dcom.sun.management.jmxremote.port=3333 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
In the future I will certainly add other options and the line will increase in size.
I want to make it more readable by writing one or two options by line. But I don't find the proper syntax for this.
I want to do something like this :
# Heap Size
JAVA_OPTS="-Xms256m -Xmx512m"
# JVisualVM Name in VisualVM
JAVA_OPTS="$JAVA_OPTS -Dvisualvm.display.name=ApplicationWs"
# Jmx Configuration
JAVA_OPTS="$JAVA_OPTS -Dcom.sun.management.jmxremote.port=3333 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
I already tried :
JAVA_OPTS="-Xms256m -Xmx512m"
JAVA_OPTS="$JAVA_OPTS -Dvisualvm.display.name=ApplicationWs"
export JAVA_OPTS
JAVA_OPTS="-Xms256m -Xmx512m"
JAVA_OPTS="${JAVA_OPTS} -Dvisualvm.display.name=ApplicationWs"
export JAVA_OPTS
JAVA_OPTS="-Xms256m -Xmx512m
-Dvisualvm.display.name=ApplicationWs"
JAVA_OPTS="-Xms256m -Xmx512m "
+ " -Dvisualvm.display.name=ApplicationWs"
What is the proper syntax for a multi-line string in a spring-boot .conf file ?
Spring boot launch script will use the shell to source the .conf file, so you can put any shell script syntax to write the configuration. I would prefer to use vars to format them in your case such as the following:
MEM_OPTS='-Xms256m -Xmx512m'
DISPLAY_NAME='visualvm.display.name=ApplicationWs'
JMXREMOTE_PORT='com.sun.management.jmxremote.port=3333'
JMXREMOTE_SSL='com.sun.management.jmxremote.ssl=false'
JMXREMOTE_AUTH='com.sun.management.jmxremote.authenticate=false'
JAVA_OPTS="${MEM_OPTS} -D${DISPLAY_NAME} -D${JMXREMOTE_PORT} -D${JMXREMOTE_SSL} -D${JMXREMOTE_AUTH}"
see here
Try multiple line like this:
primes = 2,\
3,\
5,\
7,\
11
from: https://stackoverflow.com/a/8978515/404145
Only way that actually works is to pass one line command, note semicolons and backslashes in the end:
MEMORY_PARAMS=' -Xms512M -Xmx512M '; \
JMX_MONITORING='-Dcom.sun.management.jmxremote.port=8890 -Dcom.sun.management.jmxremote.rmi.port=8890 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote -Djava.rmi.server.hostname=13.55.666.7777'; \
REMOTE_DEBUG='-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:8889'; \
JAVA_OPTS=" -Dfile.encoding=UTF-8 ${MEMORY_PARAMS} ${REMOTE_DEBUG} ${JMX_MONITORING} "

Resources