I was simply wondering on which corpus was trained the english statistical coreference resolution system of Stanford NLP. Would it be effective if used on novels ?
The coreference model is trained on the CoNLL 2012 coreference data set, which is related to the OntoNotes 5.0 data set.
Here is the link to the data:
http://conll.cemantix.org/2012/data.html
Related
I'm seeing several posts about training the Stanford NER for other languages.
eg: https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486
However, the Stanford CRF-Classifier uses some language dependent features (such as: Part Of Speechs tags).
Can we really train non-English models using the same Jar file?
https://nlp.stanford.edu/software/crf-faq.html
Training a NER classifier is language independent. You have to provide high quality training data and create meaningful features. The point is, that not all features are equally useful for every languages. Capitalization for instance, is a good indicator for a named entity in english. But in German all nouns are capitalized, which makes this features less useful.
In Stanford NER you can decide which features the classifier has to use and therefore you can disable POS tags (in fact, they are disabled by default). Of course, you could also provide your own POS tags in your desired language.
I hope I could clarify some things.
I agree with previous comment that NER classification model is language independent.
If you have issue with training data I could suggest you this link with a huge amount of labeled datasets for different languages.
If you would like to try another model, I suggest ESTNLTK - library for Estonian language, but it could fit language independent ner models (documentation).
Also, here you could find example how to train ner model using spaCy.
I hope it helps. Good luck!
Stanford Core NLP software has an annotator of sentiment , but it only supports for English , I want to create an sentiment annotator for Chinese . What should I do ? Can someone give me some advice on it , thank you very much!
Unfortunately, we do not have any trained model for Chinese sentiment analysis. To train a Chinese model, you'd need to construct a sentiment treebank similar to the Stanford Sentiment Treebank and then retrain the sentiment model, but this is not a small task.
I've searched the docs and the FAQs but I have yet to find the answer. Was the IULA treebank from the Pompeu Fabra Uni used? https://www.iula.upf.edu/recurs01_tbk_uk.htm
Thanks.
The parser was trained on a preprocessed version of the AnCora Spanish 3.0 corpus.
You can find more information about the training data and the preprocessing at
http://nlp.stanford.edu/software/spanish-faq.html .
I saw that Stanford NLP sentiment analysis first tokenize a sentence to phrases. How can I use this service also (i.e. given a sentence and tokenize by the same function as Stanford NLP sentiment analysis)?
Both of these tools (sentence splitting and tokenization) ship as part of the Stanford CoreNLP API. See http://stanfordnlp.github.io/CoreNLP/cmdline.html for basic usage examples.
I understand that Stanford NER only supports training through a file... is there a way to add more training data at a later stage to update the NER model once it is already trained?
I understand that I can keep all the training datasets from the past and re-train the model, but, I am wondering if there is a way to update the NER model rather than retrain it from scratch.
For the larger audience: StanfordNER does not support Online Training. Marking this question as closed.