Does Stanford Core NLP support Russian sentence and word tokenization? - stanford-nlp

I could not see any Russian pre-trained tokenizer in Sandford-NLP and stanfordCoreNLP. Are there any models for Russian yet?

Unfortunately I don't know of any extensions that handle that for Stanford CoreNLP.
You can use Stanza (https://stanfordnlp.github.io/stanza/) which is our Python package to get Russian tokenization and sentence splitting.
You could theoretically tokenize and sentence split with Stanza, and then use the Stanford CoreNLP Server (which you can also use via Stanza) if you had any CoreNLP specific components you wanted to work with.
A group a while back submitted some models for Russian, but I don't see anything for tokenization.
The link to their resources is here: https://stanfordnlp.github.io/CoreNLP/model-zoo.html

Related

Stanford core NLP models for English language

I am using stanford corenlp for a task. There are two models "stanford-corenlp-3.6.0-models" and "stanford-english-corenlp-2016-01-10-models" on stanford's website. I want to know what is the difference between these two models.
According to the "Human languages supported" section of CoreNLP Overview , the basic distribution provides model files for the analysis of well-edited English,which is the stanford-corenlp-3.6.0-models you mentioned.
But,CoreNLP member also provides a jar that contains all of their English models, which includes various variant models, and in particular has one optimized for working with uncased English (e.g., mostly or all either uppercase or lowercase).The newest one is stanford-english-corenlp-2016-10-31-models and the previous one is stanford-english-corenlp-2016-01-10-models you mentioned.
Reference:
http://stanfordnlp.github.io/CoreNLP/index.html#programming-languages-and-operating-systems
(the Stanford CoreNLP Overview page)

inconsistency between nltk stanford ner tagger and stanford ner tagger online demo

I am using python's inbuilt library nltk to get stanford ner tagger api setup but i am seeing inconsistency between tagging of words by this api and online demo on stanford's ner tagger website.Some words are being tagged in online demo while they are not being in api in python and similarly some words are being tagged differently.I have used the same classifiers as mentioned in the website. Can anyone tell me why is the problem coming and what's the solution for it..?
I was running into the same issue and determined that my code and the online demo were applying different formatting rules for the text.
https://github.com/dat/pyner/blob/master/ner/client.py
for s in ('\f', '\n', '\r', '\t', '\v'): #strip whitespaces
text = text.replace(s, '')
text += '\n' #ensure end-of-line

NLP Postagger can't grok imperatives?

Stanford NLP postagger claims imperative verbs added to recent version. I've inputted lots of text with abundant and obvious imperatives, but there seems to be no tag for them on output. Must one, after all, train it for this pos?
There is no special tag for imperatives, they are simply tagged as VB.
The info on the website refers to the fact that we added a bunch of manually annotated imperative sentences to our training data such that the POS tagger gets more of them right, i.e. tags the verb as VB.

Stanford NLP: Sentence splitting without tokenization?

Can I detect sentences via the command line interface of Stanford NLP like Apache OpenNLP?
https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect
Based on the docs, Stanford NLP requires tokenization as per http://nlp.stanford.edu/software/corenlp.shtml
Our pipeline requires that you tokenize first; we use these tokens in the sentence-splitting algorithm. If your text is pre-tokenized, you can use DocumentPreproccesor and request whitespace-only tokenization.
Let me know if I misunderstood your question.

stanford corenlp 3.3.1 language support

I'm starting to use coreNLP library 3.3.1 to analyze italian text documents. Have anybody tried to run a language other than English ? Did you find the models needed to train the algoritmhs ?
Thanks
Carlo
At the moment, beyond English, we only package models for Chinese (see http://nlp.stanford.edu/software/corenlp.shtml#History), but people have also successfully used the German and French models that we distribute with the Stanford Parser, Stanford NER, or the Stanford POS Tagger inside CoreNLP. For Italian, you'd need annotated data available to train your own models. There are some treebanks available for Italian and the Stanford Parser has been trained for Italian. For info on resources for Italian, see: http://aclweb.org/aclwiki/index.php?title=Resources_for_Italian#Treebanks.

Resources