There seems to be a tagger for French, but I don't find any lemmatizer.
Thank you!
To the best of my knowledge, there isn't a French lemmatizer in Stanford NLP. You can take a look at Ahmet Aker's lemmatizer instead.
Related
My dataset is in french. I am wondering if there is an equivalent of french embedding similar as glove.6b.100d.txt ?
Thank you
Glove doesn't have word embedding for any other language than English.
But you may rely on Spacy supports french. See their trained models for french languages here: https://spacy.io/models/fr
I could not see any Russian pre-trained tokenizer in Sandford-NLP and stanfordCoreNLP. Are there any models for Russian yet?
Unfortunately I don't know of any extensions that handle that for Stanford CoreNLP.
You can use Stanza (https://stanfordnlp.github.io/stanza/) which is our Python package to get Russian tokenization and sentence splitting.
You could theoretically tokenize and sentence split with Stanza, and then use the Stanford CoreNLP Server (which you can also use via Stanza) if you had any CoreNLP specific components you wanted to work with.
A group a while back submitted some models for Russian, but I don't see anything for tokenization.
The link to their resources is here: https://stanfordnlp.github.io/CoreNLP/model-zoo.html
As per the title of this post, I would like to have a maximum of information regarding the dataset that is being used to train the StanfordCoreNLP French models that are made available on this page (https://stanfordnlp.github.io/CoreNLP/history.html). My ultimate aim is to know the set of tag that I can expect to be output by the stanford core nlp tool when using it to characterize a text written in French. I was told that a model is trained using a treebank. For for the French language, there is 6 of them (http://universaldependencies.org/, section for the French language) :
- FTB
- Original
- Sequoia
- ParTUT
- PUD
- Spoken
So I would like to know which of them was used to train which French model.
I have first asked this question on the mailing list dedicated to the java nlp users (java-nlp-user#lists.stanford.edu), but to no avail up until now.
So, again, assuming it is one the treebanks described above that was indeed used to train the stanford core nlp French models available at the link posted above, which one is it? Alternatively, who (name and surname) would know the answer to this question, if no one here knows?
For all who are curious about this, here is some info about the datasets used for French in Stanford CoreNLP:
French POS tagger: CC (Crabbe and Candito) modified French Treebank
French POS tagged (UD version): UD 1.3
French Constituency Parser: CC modified French Treebank
French NN Dependency Parser: UD 1.3
Also note that the constituency parser parse cannot translate constituency parses into dependency parses the way the English constituency parser can.
I'm starting to use coreNLP library 3.3.1 to analyze italian text documents. Have anybody tried to run a language other than English ? Did you find the models needed to train the algoritmhs ?
Thanks
Carlo
At the moment, beyond English, we only package models for Chinese (see http://nlp.stanford.edu/software/corenlp.shtml#History), but people have also successfully used the German and French models that we distribute with the Stanford Parser, Stanford NER, or the Stanford POS Tagger inside CoreNLP. For Italian, you'd need annotated data available to train your own models. There are some treebanks available for Italian and the Stanford Parser has been trained for Italian. For info on resources for Italian, see: http://aclweb.org/aclwiki/index.php?title=Resources_for_Italian#Treebanks.
Is it possible to train OpenNLP for languages different than English like Slavic languages written in cyrillic, using OpenNLP API ?
Yes, there is. The OpenNLP documentation provides instructions on how to use and train each one of the modules.
For named entity recognition specifically, please see here.