NullPointerException with Stanford NLP Spanish POS tagging - stanford-nlp

All -
Running Stanford CoreNLP 3.4.1, plus the Spanish models. I have a directory of approximately 100 Spanish raw text documents, UTF-8 encoded. For each one, I execute the following commandline:
java -cp stanford-corenlp-3.4.1.jar:stanford-spanish-corenlp-2014-08-26-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-0.23.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props <propsfile> -file <txtfile>
The props file looks like this:
annotators = tokenize, ssplit, pos
tokenize.language = es
pos.model = edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger
For almost every file, I get the following error:
Exception in thread "main" java.lang.RuntimeException: Error annotating :
at edu.stanford.nlp.pipeline.StanfordCoreNLP$15.run(StanfordCoreNLP.java:1287)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1347)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1389)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1459)
Caused by: java.lang.NullPointerException
at edu.stanford.nlp.tagger.maxent.ExtractorSpanishStrippedVerb.extract(ExtractorFramesRare.java:1626)
at edu.stanford.nlp.tagger.maxent.Extractor.extract(Extractor.java:153)
at edu.stanford.nlp.tagger.maxent.TestSentence.getExactHistories(TestSentence.java:465)
at edu.stanford.nlp.tagger.maxent.TestSentence.getHistories(TestSentence.java:440)
at edu.stanford.nlp.tagger.maxent.TestSentence.getHistories(TestSentence.java:428)
at edu.stanford.nlp.tagger.maxent.TestSentence.getExactScores(TestSentence.java:377)
at edu.stanford.nlp.tagger.maxent.TestSentence.getScores(TestSentence.java:372)
at edu.stanford.nlp.tagger.maxent.TestSentence.scoresOf(TestSentence.java:713)
at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:91)
at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:31)
at edu.stanford.nlp.tagger.maxent.TestSentence.runTagInference(TestSentence.java:322)
at edu.stanford.nlp.tagger.maxent.TestSentence.testTagInference(TestSentence.java:312)
at edu.stanford.nlp.tagger.maxent.TestSentence.tagSentence(TestSentence.java:135)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagSentence(MaxentTagger.java:998)
at edu.stanford.nlp.pipeline.POSTaggerAnnotator.doOneSentence(POSTaggerAnnotator.java:147)
at edu.stanford.nlp.pipeline.POSTaggerAnnotator.annotate(POSTaggerAnnotator.java:110)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:67)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:847)
at edu.stanford.nlp.pipeline.StanfordCoreNLP$15.run(StanfordCoreNLP.java:1275)
Any ideas? I haven't even begun to track this down. I'm certain the problem is in POS; tokenize and ssplit run just fine.
P.S. Please don't say "Upgrade to 3.5.0"; I don't currently have Java 8 installed and don't want to install it yet.
Thanks in advance.

Yes, it seems like there's a bug in the 3.4.1 Spanish models.
The Spanish 3.5.0 models actually seem to be compatible with Java 7. You can download the models used in 3.5 (stanford-spanish-corenlp-2014-10-23-models.jar) and put that on your classpath instead. This fixed the problem for me running Java 7 locally.

Related

Warnings on Spanish text processing Stanford CoreNLP Number in types > column for ... is probably priority

I've downloaded the Stanford CoreNLP from https://stanfordnlp.github.io/CoreNLP/index.html current version 3.9.2
Downloaded the Spanish Language JAR
http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-10-05-models.jar
Put that in application root folder.
Fired up server with:
C:\Stanford>java -mx4g -cp "*"
edu.stanford.nlp.pipeline.StanfordCoreNLPServer - port 9000 -timeout
15000
Loaded http://localhost:9000
Entered the text "Sí, sabes que ya llevo un rato mirándote" and selected "Spanish" and Submit.
In the console readout there are lots of warning like:
[pool-1-thread-1] WARN
edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - Number in types
column for [ejecución] is probably priority: 1
Output suggests defaults are working OK, but what misconfiguration is causing this warning?
This issue should be resolved in future versions. The output doesn't mean anything regarding performance, it's just that the rules files for Spanish were missing a column. We've fixed the files, so in 4.0.0 and on those warnings should go away.

Starting a payara 5 has encountered

I have build a very simple project of hello world in
Payara 5 (5.181)
JSF 2.3
JDK 1.8
CDI 2.0
Maven
and encountered a problem
Unable to start server due following issues: Launch process failed with exit code 1
in console it throws an error of :
Error: Could not find or load main class server\payara5\glassfish.lib.grizzly-npn-bootstrap.jar
[PIC] Payara 5 Error
It seems that the Payara Tools for Eclipse suffer by several bugs that may cause this. In my case, the following workarounds helped:
The Payara installation path should not contain spaces (e.g. Program Files\Payara)
It seems that only Java 8 is supported at the time
Open the domain.xml configuration file for the domain you are trying to start (typically payara_install_path/glassfish/domains/domain1/config/domain1.xml) and search for "Xbootclasspath". You should find a couple of lines like
<jvm-options>[1.8.0|1.8.0u120]-Xbootclasspath/p:${com.sun.aas.installRoot}/lib/grizzly-npn-bootstrap-1.6.jar</jvm-options>
<jvm-options>[1.8.0u121|1.8.0u160]-Xbootclasspath/p:${com.sun.aas.installRoot}/lib/grizzly-npn-bootstrap-1.7.jar</jvm-options>
<jvm-options>[1.8.0u161|1.8.0u190]-Xbootclasspath/p:${com.sun.aas.installRoot}/lib/grizzly-npn-bootstrap-1.8.jar</jvm-options>
<jvm-options>[1.8.0u191|1.8.0u500]-Xbootclasspath/p:${com.sun.aas.installRoot}/lib/grizzly-npn-bootstrap-1.8.1.jar</jvm-options>
Depending of your installed Java version (try running java --version) and choose the appropriate line (most likely the last one). Remove the remaining lines and remove the [...] part at the beginning of the chosen line so you will get something like
<jvm-options>-Xbootclasspath/p:${com.sun.aas.installRoot}/lib/grizzly-npn-bootstrap-1.8.1.jar</jvm-options>
After this, the tools seem to start normally.
The Problem is with Java version. The grizzly-npn-bootstrap-1.8.1.jar Jar is placed in bootclasspath, thats why it requires proper java version to start payara server. So remove unnecessary bootstrap jar from domain.xml.
In Windows:
1) Go To ---C:\Users\xxxx\payara5\glassfish\domains\domain1\config\domain.xml
2) According to my java verson(java version "1.8.0_191") I deleted the following lines from domain.xml. So delete according to your java version.
<jvm-options>[1.8.0|1.8.0u120]-Xbootclasspath/p:${com.sun.aas.installRoot}/lib/grizzly-npn-bootstrap-1.6.jar</jvm-options>
<jvm-options>[1.8.0u121|1.8.0u160]-Xbootclasspath/p:${com.sun.aas.installRoot}/lib/grizzly-npn-bootstrap-1.7.jar</jvm-options>
<jvm-options>[1.8.0u161|1.8.0u190]-Xbootclasspath/p:${com.sun.aas.installRoot}/lib/grizzly-npn-bootstrap-1.8.jar</jvm-options>
3) Remove this [1.8.0u191|1.8.0u500] part from jvm-options & Edit the line in your domain.xml(w.r.t java -version) as shown below
<jvm-options>-Xbootclasspath/p:${com.sun.aas.installRoot}/lib/grizzly-npn-bootstrap-1.8.1.jar</jvm-options>
4) restart your server.
As Radkovo said, "The Payara installation path should not contain spaces (e.g. Program Files\Payara)", so I moved the Payara to the Documents folder.
Problem solved!

Running Stanford corenlp server with French models

I am trying to analyse some French text with the Stanford CoreNLP tool (it's my first time trying to use any StanfordNLP software)
To do so, I have downloaded the v3.6.0 jar and the corresponding french models.
Then I run the server with:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
As described in this answer, I call the API with:
wget --post-data 'Bonjour le monde.' 'localhost:9000/?properties={"parse.model":"edu/stanford/nlp/models/parser/nndep/UD_French.gz", "annotators": "parse", "outputFormat": "json"}' -O -
but I get the following log + error:
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP
Adding annotator tokenize
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-1] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/parser/nndep/UD_French.gz ...
edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header: 64696374
at edu.stanford.nlp.parser.common.ParserGrammar.loadModel(ParserGrammar.java:188)
at edu.stanford.nlp.pipeline.ParserAnnotator.loadModel(ParserAnnotator.java:212)
at edu.stanford.nlp.pipeline.ParserAnnotator.<init>(ParserAnnotator.java:115)
...
The solutions proposed here suggest the code and model version differs but I have dowloaded them from the same page (and they both have the same version number in their name) so I am pretty sure they are the same.
Any other hint on what I am doing wrong?
(I should also mention that I am not a Java expert, so maybe I forgot a stupid step... )
Ok, after a lot of readings and unsuccessful tries, I found a way to make it work (for v3.6.0). Here are the details, if they can be of any interest to someone else:
Dowload the code and french models from http://stanfordnlp.github.io/CoreNLP/index.html#download. Unzip the code .zip and copy the french model .jar to that directory (do not remove the english models, they have different names anyway)
cd to that directory and then run the server with:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
(it's a pity that the -prop flag doesn't help here)
Call the API repeating the properties listed in the StanfordCoreNLP-french.properties:
wget --header="Content-Type: text/plain; charset=UTF-8"
--post-data 'Bonjour le monde.'
'localhost:9000/?properties={
"annotators": "tokenize,ssplit,pos,parse",
"parse.model":"edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz",
"pos.model":"edu/stanford/nlp/models/pos-tagger/french/french.tagger",
"tokenize.language":"fr",
"outputFormat": "json"}'
-O -
which finally gives a 200 response using the French models!
(NB: don't know how to make it work with the UI (same for utf-8 support))
As a potentially useful addition for some, this is what the complete properties file for German looks like:
# annotators
annotators = tokenize, ssplit, mwt, pos, ner, depparse
# tokenize
tokenize.language = de
tokenize.postProcessor = edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor
# mwt
mwt.mappingFile = edu/stanford/nlp/models/mwt/german/german-mwt.tsv
# pos
pos.model = edu/stanford/nlp/models/pos-tagger/german-ud.tagger
# ner
ner.model = edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz
ner.applyNumericClassifiers = false
ner.applyFineGrained = false
ner.useSUTime = false
# parse
parse.model = edu/stanford/nlp/models/srparser/germanSR.beam.ser.gz
# depparse
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_German.gz
The complete property files for Arabic, Chinese, French, German and Spanish can all be found in the CoreNLP github repository.

Stanford NER CRF model is giving different results on CMD and in Code

I have made my own CRF model by following this link.
http://nlp.stanford.edu/software/crf-faq.shtml
Command for training is:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
Then I test this model on CMD by following this testing command.
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier training-ner-model.ser.gz -testFile testing_data.txt
This produces results on console.
Then I am using this model programmatically in my code and giving same sentence which is in test file. This is not giving same results as produced by that testing command.
I searched a lot why this is behaving like this but found nothing till now. Can anybody help me in this regard, I will be thankful to you.

Too small initial heap error - stanford parser

I am trying my hands on the Stanford dependency parser. I tried running the parser from command line on windows to extract the dependencies using this command:
java -mx100m -cp "stanford-parser.jar" edu.stanford.nlp.trees.EnglishGrammaticalStructure -sentFile english-onesent.txt -collapsedTree -CCprocessed -parserFile englishPCFG.ser.gz
I am getting the following error:
Error occurred during initialization of VM
Too small initial heap
I changed the memory size to -mx1024, -mx2048 as well as -mx4096. It didn't change anything and the error persists.
What am I missing?
Type -Xmx1024m instead of -mx1024.
See https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html
It should be -mx1024m. I skipped m.
One more thing: in the -cp, the model jar should also be included.
... -cp "stanford-parser.jar;stanford-parser-3.5.2-models.jar"...
(assuming you are using the latest version).
Otherwise an IO Exception will be thrown.
There may be some arguments that are preexisting in the IDE.
In eclipse:
Go to-> Run as-> run configuration-> Arguments
then Delete the arguments that are used previously.
Restart your eclipse.
Worked for me!

Resources