How to train OpenNLP for non-English languages? - opennlp

Is it possible to train OpenNLP for languages different than English like Slavic languages written in cyrillic, using OpenNLP API ?

Yes, there is. The OpenNLP documentation provides instructions on how to use and train each one of the modules.
For named entity recognition specifically, please see here.

Related

Does Stanford Core NLP support Russian sentence and word tokenization?

I could not see any Russian pre-trained tokenizer in Sandford-NLP and stanfordCoreNLP. Are there any models for Russian yet?
Unfortunately I don't know of any extensions that handle that for Stanford CoreNLP.
You can use Stanza (https://stanfordnlp.github.io/stanza/) which is our Python package to get Russian tokenization and sentence splitting.
You could theoretically tokenize and sentence split with Stanza, and then use the Stanford CoreNLP Server (which you can also use via Stanza) if you had any CoreNLP specific components you wanted to work with.
A group a while back submitted some models for Russian, but I don't see anything for tokenization.
The link to their resources is here: https://stanfordnlp.github.io/CoreNLP/model-zoo.html

AutoML Translation Supported languages

I have a question about AutoML Translation.
In the list of supported languages, we did not find our language. Can we add the Kazakh language to create our dataset? Example (translation from Russian to Kazakh)
It seems that they only support the next Languages list.
Note that AutoML Translation is on Beta stage so it might get changed and the Supported languages list might be changed as well.

With which treebank are the available StanfordCoreNLP French models trained?

As per the title of this post, I would like to have a maximum of information regarding the dataset that is being used to train the StanfordCoreNLP French models that are made available on this page (https://stanfordnlp.github.io/CoreNLP/history.html). My ultimate aim is to know the set of tag that I can expect to be output by the stanford core nlp tool when using it to characterize a text written in French. I was told that a model is trained using a treebank. For for the French language, there is 6 of them (http://universaldependencies.org/, section for the French language) :
- FTB
- Original
- Sequoia
- ParTUT
- PUD
- Spoken
So I would like to know which of them was used to train which French model.
I have first asked this question on the mailing list dedicated to the java nlp users (java-nlp-user#lists.stanford.edu), but to no avail up until now.
So, again, assuming it is one the treebanks described above that was indeed used to train the stanford core nlp French models available at the link posted above, which one is it? Alternatively, who (name and surname) would know the answer to this question, if no one here knows?
For all who are curious about this, here is some info about the datasets used for French in Stanford CoreNLP:
French POS tagger: CC (Crabbe and Candito) modified French Treebank
French POS tagged (UD version): UD 1.3
French Constituency Parser: CC modified French Treebank
French NN Dependency Parser: UD 1.3
Also note that the constituency parser parse cannot translate constituency parses into dependency parses the way the English constituency parser can.

Stanford core NLP models for English language

I am using stanford corenlp for a task. There are two models "stanford-corenlp-3.6.0-models" and "stanford-english-corenlp-2016-01-10-models" on stanford's website. I want to know what is the difference between these two models.
According to the "Human languages supported" section of CoreNLP Overview , the basic distribution provides model files for the analysis of well-edited English,which is the stanford-corenlp-3.6.0-models you mentioned.
But,CoreNLP member also provides a jar that contains all of their English models, which includes various variant models, and in particular has one optimized for working with uncased English (e.g., mostly or all either uppercase or lowercase).The newest one is stanford-english-corenlp-2016-10-31-models and the previous one is stanford-english-corenlp-2016-01-10-models you mentioned.
Reference:
http://stanfordnlp.github.io/CoreNLP/index.html#programming-languages-and-operating-systems
(the Stanford CoreNLP Overview page)

stanford corenlp 3.3.1 language support

I'm starting to use coreNLP library 3.3.1 to analyze italian text documents. Have anybody tried to run a language other than English ? Did you find the models needed to train the algoritmhs ?
Thanks
Carlo
At the moment, beyond English, we only package models for Chinese (see http://nlp.stanford.edu/software/corenlp.shtml#History), but people have also successfully used the German and French models that we distribute with the Stanford Parser, Stanford NER, or the Stanford POS Tagger inside CoreNLP. For Italian, you'd need annotated data available to train your own models. There are some treebanks available for Italian and the Stanford Parser has been trained for Italian. For info on resources for Italian, see: http://aclweb.org/aclwiki/index.php?title=Resources_for_Italian#Treebanks.

Resources