I am trying to analyse some French text with the Stanford CoreNLP tool (it's my first time trying to use any StanfordNLP software)
To do so, I have downloaded the v3.6.0 jar and the corresponding french models.
Then I run the server with:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
As described in this answer, I call the API with:
wget --post-data 'Bonjour le monde.' 'localhost:9000/?properties={"parse.model":"edu/stanford/nlp/models/parser/nndep/UD_French.gz", "annotators": "parse", "outputFormat": "json"}' -O -
but I get the following log + error:
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP
Adding annotator tokenize
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-1] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/parser/nndep/UD_French.gz ...
edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header: 64696374
at edu.stanford.nlp.parser.common.ParserGrammar.loadModel(ParserGrammar.java:188)
at edu.stanford.nlp.pipeline.ParserAnnotator.loadModel(ParserAnnotator.java:212)
at edu.stanford.nlp.pipeline.ParserAnnotator.<init>(ParserAnnotator.java:115)
...
The solutions proposed here suggest the code and model version differs but I have dowloaded them from the same page (and they both have the same version number in their name) so I am pretty sure they are the same.
Any other hint on what I am doing wrong?
(I should also mention that I am not a Java expert, so maybe I forgot a stupid step... )
Ok, after a lot of readings and unsuccessful tries, I found a way to make it work (for v3.6.0). Here are the details, if they can be of any interest to someone else:
Dowload the code and french models from http://stanfordnlp.github.io/CoreNLP/index.html#download. Unzip the code .zip and copy the french model .jar to that directory (do not remove the english models, they have different names anyway)
cd to that directory and then run the server with:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
(it's a pity that the -prop flag doesn't help here)
Call the API repeating the properties listed in the StanfordCoreNLP-french.properties:
wget --header="Content-Type: text/plain; charset=UTF-8"
--post-data 'Bonjour le monde.'
'localhost:9000/?properties={
"annotators": "tokenize,ssplit,pos,parse",
"parse.model":"edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz",
"pos.model":"edu/stanford/nlp/models/pos-tagger/french/french.tagger",
"tokenize.language":"fr",
"outputFormat": "json"}'
-O -
which finally gives a 200 response using the French models!
(NB: don't know how to make it work with the UI (same for utf-8 support))
As a potentially useful addition for some, this is what the complete properties file for German looks like:
# annotators
annotators = tokenize, ssplit, mwt, pos, ner, depparse
# tokenize
tokenize.language = de
tokenize.postProcessor = edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor
# mwt
mwt.mappingFile = edu/stanford/nlp/models/mwt/german/german-mwt.tsv
# pos
pos.model = edu/stanford/nlp/models/pos-tagger/german-ud.tagger
# ner
ner.model = edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz
ner.applyNumericClassifiers = false
ner.applyFineGrained = false
ner.useSUTime = false
# parse
parse.model = edu/stanford/nlp/models/srparser/germanSR.beam.ser.gz
# depparse
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_German.gz
The complete property files for Arabic, Chinese, French, German and Spanish can all be found in the CoreNLP github repository.
Related
I've downloaded the Stanford CoreNLP from https://stanfordnlp.github.io/CoreNLP/index.html current version 3.9.2
Downloaded the Spanish Language JAR
http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-10-05-models.jar
Put that in application root folder.
Fired up server with:
C:\Stanford>java -mx4g -cp "*"
edu.stanford.nlp.pipeline.StanfordCoreNLPServer - port 9000 -timeout
15000
Loaded http://localhost:9000
Entered the text "Sí, sabes que ya llevo un rato mirándote" and selected "Spanish" and Submit.
In the console readout there are lots of warning like:
[pool-1-thread-1] WARN
edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - Number in types
column for [ejecución] is probably priority: 1
Output suggests defaults are working OK, but what misconfiguration is causing this warning?
This issue should be resolved in future versions. The output doesn't mean anything regarding performance, it's just that the rules files for Spanish were missing a column. We've fixed the files, so in 4.0.0 and on those warnings should go away.
I've got a particular bug in my environment such that I cannot consistently knit Rmarkdown files to HTML, Word documents, or PDFs. The ability to knit varies across by file type.
HTML - Sometimes it knits
Word document - Sometimes it knits
PDF - Never knits. Always throws the same error
The error is:
Segmentation fault/access violation in generated code
Error: pandoc document conversion failed with error 1
Up until a week ago, I was able to knit to all three file types in the same environment so this is a relatively recent bug. The high-level specs of my environment are:
OS: Windows 10
R: 3.4.3
RStudio: 1.1.419
knitr: 1.19
rmarkdown: 1.8
pandoc: 1.19.2.1
MiKTeX : 2.9
As an example of the inconsistency in the ability to knit, I crated a new .Rmd file in RStudio which creates the familiar standard template of summary(cars) and plot(pressure) etc. I didn't change anything about it. I was able to knit it successfully to HTML once, Word .docx once, and then I tried to knit to PDF and got the error above. I then went back and attempted to knit to both HTML and Word and received the error even though nothing changed from the original knit. This is just an example of the type of behavior. Different documents behave differently.
To troubleshoot, I've reinstalled R and RStudio (I believe pandoc comes up with the RStudio install so I've effectively reinstalled pandoc too). The reinstallation did not help. I've tried clearing the Knitr Cache. That didn't work either. I've browsed SO and the internet and found a few similar errors related to the Glasgow-Haskell compiler, but I have no idea what that is and if it's the source of the issue.
Here is an example of the .Rmd file that throws the error:
---
title: "Untitled"
author: "Vypa"
date: "February 8, 2018"
output:
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
## Including Plots
You can also embed plots, for example:
```{r pressure, echo=FALSE}
plot(pressure)
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
And here is the full console error:
Segmentation fault/access violation in generated code
Error: pandoc document conversion failed with error 1
In addition: Warning message:
running command '"C:/Users/Vypa/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS tst.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output tst.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template "C:\Users\Vypa\Documents\R\R-3.4.3\library\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\Vypa\AppData\Local\Temp\RtmpI9Gxsx\rmarkdown-str5db029544a23.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"' had status 1
Execution halted
Any help or thoughts would be appreciated.
Encountered the same problem after a recent install of R & RStudio on Win10. Pandoc problem showed up on edit and knitr of previously err free .rmd file.
Resolved after installing latest preview ver of RStudio at
https://www.rstudio.com/products/rstudio/download/preview/
Updating pandoc fixed this for me.
https://pandoc.org/installing.html
Thanks #EuGENE. Adding here as an answer for clarity.
I have made my own CRF model by following this link.
http://nlp.stanford.edu/software/crf-faq.shtml
Command for training is:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
Then I test this model on CMD by following this testing command.
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier training-ner-model.ser.gz -testFile testing_data.txt
This produces results on console.
Then I am using this model programmatically in my code and giving same sentence which is in test file. This is not giving same results as produced by that testing command.
I searched a lot why this is behaving like this but found nothing till now. Can anybody help me in this regard, I will be thankful to you.
Anyone know where the following files located:
trainFileList = /u/nlp/data/ner/column_data/muc6.ptb.train,
/u/nlp/data/ner/column_data/muc7.ptb.train
I am following the FAQ link http://nlp.stanford.edu/software/crf-faq.shtml#a
If all I need to do is provide a file with two columns consisting of tokens and class, then that will work. But I am curious about the train files listed in the classifier property files.
serializeTo = english.muc.7class.caseless.distsim.crf.ser.gz
java -mx1g -cp "$CLASSPATH" edu.stanford.nlp.ie.NERClassifierCombiner -textFile sample.txt -ner.model classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz -outputFormat tabbedEntities -textFile sample.txt > sample2.tsv
Those files are the training data for the MUC-6 and MUC-7 tasks:
http://cs.nyu.edu/faculty/grishman/muc6.html
They are not distributed by Stanford. I will see if I can figure out where they are distributed and update this answer.
UPDATE: LDC distributes those files if you want to get a copy, they have copyright issues so you have to purchase them from LDC, that is why we don't distribute them. Here are some links with more info:
http://www-nlpir.nist.gov/related_projects/muc/muc_data/muc_data_index.html
https://catalog.ldc.upenn.edu/LDC2003T13
https://catalog.ldc.upenn.edu/LDC2001T02
All -
Running Stanford CoreNLP 3.4.1, plus the Spanish models. I have a directory of approximately 100 Spanish raw text documents, UTF-8 encoded. For each one, I execute the following commandline:
java -cp stanford-corenlp-3.4.1.jar:stanford-spanish-corenlp-2014-08-26-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-0.23.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props <propsfile> -file <txtfile>
The props file looks like this:
annotators = tokenize, ssplit, pos
tokenize.language = es
pos.model = edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger
For almost every file, I get the following error:
Exception in thread "main" java.lang.RuntimeException: Error annotating :
at edu.stanford.nlp.pipeline.StanfordCoreNLP$15.run(StanfordCoreNLP.java:1287)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1347)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1389)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1459)
Caused by: java.lang.NullPointerException
at edu.stanford.nlp.tagger.maxent.ExtractorSpanishStrippedVerb.extract(ExtractorFramesRare.java:1626)
at edu.stanford.nlp.tagger.maxent.Extractor.extract(Extractor.java:153)
at edu.stanford.nlp.tagger.maxent.TestSentence.getExactHistories(TestSentence.java:465)
at edu.stanford.nlp.tagger.maxent.TestSentence.getHistories(TestSentence.java:440)
at edu.stanford.nlp.tagger.maxent.TestSentence.getHistories(TestSentence.java:428)
at edu.stanford.nlp.tagger.maxent.TestSentence.getExactScores(TestSentence.java:377)
at edu.stanford.nlp.tagger.maxent.TestSentence.getScores(TestSentence.java:372)
at edu.stanford.nlp.tagger.maxent.TestSentence.scoresOf(TestSentence.java:713)
at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:91)
at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:31)
at edu.stanford.nlp.tagger.maxent.TestSentence.runTagInference(TestSentence.java:322)
at edu.stanford.nlp.tagger.maxent.TestSentence.testTagInference(TestSentence.java:312)
at edu.stanford.nlp.tagger.maxent.TestSentence.tagSentence(TestSentence.java:135)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagSentence(MaxentTagger.java:998)
at edu.stanford.nlp.pipeline.POSTaggerAnnotator.doOneSentence(POSTaggerAnnotator.java:147)
at edu.stanford.nlp.pipeline.POSTaggerAnnotator.annotate(POSTaggerAnnotator.java:110)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:67)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:847)
at edu.stanford.nlp.pipeline.StanfordCoreNLP$15.run(StanfordCoreNLP.java:1275)
Any ideas? I haven't even begun to track this down. I'm certain the problem is in POS; tokenize and ssplit run just fine.
P.S. Please don't say "Upgrade to 3.5.0"; I don't currently have Java 8 installed and don't want to install it yet.
Thanks in advance.
Yes, it seems like there's a bug in the 3.4.1 Spanish models.
The Spanish 3.5.0 models actually seem to be compatible with Java 7. You can download the models used in 3.5 (stanford-spanish-corenlp-2014-10-23-models.jar) and put that on your classpath instead. This fixed the problem for me running Java 7 locally.