I'm starting to use coreNLP library 3.3.1 to analyze italian text documents. Have anybody tried to run a language other than English ? Did you find the models needed to train the algoritmhs ?
Thanks
Carlo
At the moment, beyond English, we only package models for Chinese (see http://nlp.stanford.edu/software/corenlp.shtml#History), but people have also successfully used the German and French models that we distribute with the Stanford Parser, Stanford NER, or the Stanford POS Tagger inside CoreNLP. For Italian, you'd need annotated data available to train your own models. There are some treebanks available for Italian and the Stanford Parser has been trained for Italian. For info on resources for Italian, see: http://aclweb.org/aclwiki/index.php?title=Resources_for_Italian#Treebanks.
Related
I could not see any Russian pre-trained tokenizer in Sandford-NLP and stanfordCoreNLP. Are there any models for Russian yet?
Unfortunately I don't know of any extensions that handle that for Stanford CoreNLP.
You can use Stanza (https://stanfordnlp.github.io/stanza/) which is our Python package to get Russian tokenization and sentence splitting.
You could theoretically tokenize and sentence split with Stanza, and then use the Stanford CoreNLP Server (which you can also use via Stanza) if you had any CoreNLP specific components you wanted to work with.
A group a while back submitted some models for Russian, but I don't see anything for tokenization.
The link to their resources is here: https://stanfordnlp.github.io/CoreNLP/model-zoo.html
As per the title of this post, I would like to have a maximum of information regarding the dataset that is being used to train the StanfordCoreNLP French models that are made available on this page (https://stanfordnlp.github.io/CoreNLP/history.html). My ultimate aim is to know the set of tag that I can expect to be output by the stanford core nlp tool when using it to characterize a text written in French. I was told that a model is trained using a treebank. For for the French language, there is 6 of them (http://universaldependencies.org/, section for the French language) :
- FTB
- Original
- Sequoia
- ParTUT
- PUD
- Spoken
So I would like to know which of them was used to train which French model.
I have first asked this question on the mailing list dedicated to the java nlp users (java-nlp-user#lists.stanford.edu), but to no avail up until now.
So, again, assuming it is one the treebanks described above that was indeed used to train the stanford core nlp French models available at the link posted above, which one is it? Alternatively, who (name and surname) would know the answer to this question, if no one here knows?
For all who are curious about this, here is some info about the datasets used for French in Stanford CoreNLP:
French POS tagger: CC (Crabbe and Candito) modified French Treebank
French POS tagged (UD version): UD 1.3
French Constituency Parser: CC modified French Treebank
French NN Dependency Parser: UD 1.3
Also note that the constituency parser parse cannot translate constituency parses into dependency parses the way the English constituency parser can.
I am using stanford corenlp for a task. There are two models "stanford-corenlp-3.6.0-models" and "stanford-english-corenlp-2016-01-10-models" on stanford's website. I want to know what is the difference between these two models.
According to the "Human languages supported" section of CoreNLP Overview , the basic distribution provides model files for the analysis of well-edited English,which is the stanford-corenlp-3.6.0-models you mentioned.
But,CoreNLP member also provides a jar that contains all of their English models, which includes various variant models, and in particular has one optimized for working with uncased English (e.g., mostly or all either uppercase or lowercase).The newest one is stanford-english-corenlp-2016-10-31-models and the previous one is stanford-english-corenlp-2016-01-10-models you mentioned.
Reference:
http://stanfordnlp.github.io/CoreNLP/index.html#programming-languages-and-operating-systems
(the Stanford CoreNLP Overview page)
I'm using Stanford NLP to do POS tagging for Spanish texts. I can get a POS Tag for each word but I notice that I am only given the first four sections of the Ancora tag and it's missing the last three sections for person, number and gender.
Why does Stanford NLP only use a reduced version of the Ancora tag?
Is it possible to get the entire tag using Stanford NLP?
Here is my code (please excuse the jruby...):
props = java.util.Properties.new()
props.put("tokenize.language", "es")
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse")
props.put("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz")
props.put("pos.model", "/stanford-postagger-full-2015-01-30/models/spanish-distsim.tagger")
props.put("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz")
pipeline = StanfordCoreNLP.new(props)
annotation = Annotation.new("No sé qué estoy haciendo. Me pregunto si esto va a funcionar.")
I am getting this as the output:
[Text=No CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=rn
Lemma=no NamedEntityTag=O] [Text=sé CharacterOffsetBegin=3
CharacterOffsetEnd=5 PartOfSpeech=vmip000 Lemma=sé NamedEntityTag=O]
[Text=qué CharacterOffsetBegin=6 CharacterOffsetEnd=9
PartOfSpeech=pt000000 Lemma=qué NamedEntityTag=O] [Text=estoy
CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=vmip000
Lemma=estoy NamedEntityTag=O] [Text=haciendo CharacterOffsetBegin=16
CharacterOffsetEnd=24 PartOfSpeech=vmg0000 Lemma=haciendo
NamedEntityTag=O] [Text=. CharacterOffsetBegin=24
CharacterOffsetEnd=25 PartOfSpeech=fp Lemma=. NamedEntityTag=O]
(I notice that the lemmas are incorrect also, but that's probably an issue for a separate question. Nevermind, I see that Stanford NLP does not support Spanish lemmatization.)
Why does Stanford NLP only use a reduced version of the Ancora tag?
This was a practical decision made to ensure high tagging accuracy. (Retaining morphological information on tags caused the entire tagger to suffer from data sparsity, and do worse not only on morphological annotation but all over the board.)
Is it possible to get the entire tag using Stanford NLP?
No. You could get quite far doing this with a simple rule-based system, though, or use the Stanford Classifier to train your own morphological annotator. (Feel free to share your code if you pick either path!)
If it is not strict to only using the Stanford POS tagger, you might want to try the POS and morphological tagging toolkit RDRPOSTagger. RDRPOSTagger supports pre-trained POS and morphological tagging to 40 different languages, including Spanish.
For Spanish POS and morphological tagging, RDRPOSTagger was trained using the IULA Spanish LSP Treebank. RDRPOSTagger then obtained a tagging accuracy of 97.95% with the tagging speed at 200K words/second in Java implementation (10K words/second in Python implementation), using a computer of Window7 OS 64-bit core i5 2.50GHz CPU and 6GB of memory.
Is it possible to train OpenNLP for languages different than English like Slavic languages written in cyrillic, using OpenNLP API ?
Yes, there is. The OpenNLP documentation provides instructions on how to use and train each one of the modules.
For named entity recognition specifically, please see here.