I'm new here and wanted to know if anyone can help me with the following question.
I'm doing sentiment analysis of text in Spanish and using Stanford CoreNLP but I can not get a positive result.
That is, if I analyze any English text analyzes it perfect to put it in Spanish but the result is always negative
I've been looking how to configure the parser in Spanish, tokenize and everything I found was useless for sentiment analysis.
Someone can tell me if the only thing that works is the tokenize and sentiment does not in Spanish?
This is my properties file so that I managed to find:
annotators = tokenize, ssplit, pos, ner, parse, sentiment
tokenize.language = en
pos.model = edu / stanford / nlp / models / pos-tagger / english / spanish-distsim.tagger
ner.model = edu / stanford / nlp / models / ner / spanish.ancora.distsim.s512.crf.ser.gz
ner.applyNumericClassifiers = false
ner.useSUTime = false
parse.model = edu / stanford / nlp / models / lexparser / spanishPCFG.ser.gz
The code to perform sentiment analysis is typical that you can find in any tutorial
Thank you very much!!
Unfortunately there is no Stanford sentiment model available for Spanish. At the moment all the Spanish words are likely being treated as generic "unknown words" by the sentiment analysis algorithm, which is why you're seeing consistently bad performance.
You can certainly train your own model (documented elsewhere on the Internet, I believe..), but you'll need to have Spanish training data to accomplish this.
Related
I could not see any Russian pre-trained tokenizer in Sandford-NLP and stanfordCoreNLP. Are there any models for Russian yet?
Unfortunately I don't know of any extensions that handle that for Stanford CoreNLP.
You can use Stanza (https://stanfordnlp.github.io/stanza/) which is our Python package to get Russian tokenization and sentence splitting.
You could theoretically tokenize and sentence split with Stanza, and then use the Stanford CoreNLP Server (which you can also use via Stanza) if you had any CoreNLP specific components you wanted to work with.
A group a while back submitted some models for Russian, but I don't see anything for tokenization.
The link to their resources is here: https://stanfordnlp.github.io/CoreNLP/model-zoo.html
As per the title of this post, I would like to have a maximum of information regarding the dataset that is being used to train the StanfordCoreNLP French models that are made available on this page (https://stanfordnlp.github.io/CoreNLP/history.html). My ultimate aim is to know the set of tag that I can expect to be output by the stanford core nlp tool when using it to characterize a text written in French. I was told that a model is trained using a treebank. For for the French language, there is 6 of them (http://universaldependencies.org/, section for the French language) :
- FTB
- Original
- Sequoia
- ParTUT
- PUD
- Spoken
So I would like to know which of them was used to train which French model.
I have first asked this question on the mailing list dedicated to the java nlp users (java-nlp-user#lists.stanford.edu), but to no avail up until now.
So, again, assuming it is one the treebanks described above that was indeed used to train the stanford core nlp French models available at the link posted above, which one is it? Alternatively, who (name and surname) would know the answer to this question, if no one here knows?
For all who are curious about this, here is some info about the datasets used for French in Stanford CoreNLP:
French POS tagger: CC (Crabbe and Candito) modified French Treebank
French POS tagged (UD version): UD 1.3
French Constituency Parser: CC modified French Treebank
French NN Dependency Parser: UD 1.3
Also note that the constituency parser parse cannot translate constituency parses into dependency parses the way the English constituency parser can.
I am using python's inbuilt library nltk to get stanford ner tagger api setup but i am seeing inconsistency between tagging of words by this api and online demo on stanford's ner tagger website.Some words are being tagged in online demo while they are not being in api in python and similarly some words are being tagged differently.I have used the same classifiers as mentioned in the website. Can anyone tell me why is the problem coming and what's the solution for it..?
I was running into the same issue and determined that my code and the online demo were applying different formatting rules for the text.
https://github.com/dat/pyner/blob/master/ner/client.py
for s in ('\f', '\n', '\r', '\t', '\v'): #strip whitespaces
text = text.replace(s, '')
text += '\n' #ensure end-of-line
I'm using Stanford NLP to do POS tagging for Spanish texts. I can get a POS Tag for each word but I notice that I am only given the first four sections of the Ancora tag and it's missing the last three sections for person, number and gender.
Why does Stanford NLP only use a reduced version of the Ancora tag?
Is it possible to get the entire tag using Stanford NLP?
Here is my code (please excuse the jruby...):
props = java.util.Properties.new()
props.put("tokenize.language", "es")
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse")
props.put("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz")
props.put("pos.model", "/stanford-postagger-full-2015-01-30/models/spanish-distsim.tagger")
props.put("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz")
pipeline = StanfordCoreNLP.new(props)
annotation = Annotation.new("No sé qué estoy haciendo. Me pregunto si esto va a funcionar.")
I am getting this as the output:
[Text=No CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=rn
Lemma=no NamedEntityTag=O] [Text=sé CharacterOffsetBegin=3
CharacterOffsetEnd=5 PartOfSpeech=vmip000 Lemma=sé NamedEntityTag=O]
[Text=qué CharacterOffsetBegin=6 CharacterOffsetEnd=9
PartOfSpeech=pt000000 Lemma=qué NamedEntityTag=O] [Text=estoy
CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=vmip000
Lemma=estoy NamedEntityTag=O] [Text=haciendo CharacterOffsetBegin=16
CharacterOffsetEnd=24 PartOfSpeech=vmg0000 Lemma=haciendo
NamedEntityTag=O] [Text=. CharacterOffsetBegin=24
CharacterOffsetEnd=25 PartOfSpeech=fp Lemma=. NamedEntityTag=O]
(I notice that the lemmas are incorrect also, but that's probably an issue for a separate question. Nevermind, I see that Stanford NLP does not support Spanish lemmatization.)
Why does Stanford NLP only use a reduced version of the Ancora tag?
This was a practical decision made to ensure high tagging accuracy. (Retaining morphological information on tags caused the entire tagger to suffer from data sparsity, and do worse not only on morphological annotation but all over the board.)
Is it possible to get the entire tag using Stanford NLP?
No. You could get quite far doing this with a simple rule-based system, though, or use the Stanford Classifier to train your own morphological annotator. (Feel free to share your code if you pick either path!)
If it is not strict to only using the Stanford POS tagger, you might want to try the POS and morphological tagging toolkit RDRPOSTagger. RDRPOSTagger supports pre-trained POS and morphological tagging to 40 different languages, including Spanish.
For Spanish POS and morphological tagging, RDRPOSTagger was trained using the IULA Spanish LSP Treebank. RDRPOSTagger then obtained a tagging accuracy of 97.95% with the tagging speed at 200K words/second in Java implementation (10K words/second in Python implementation), using a computer of Window7 OS 64-bit core i5 2.50GHz CPU and 6GB of memory.
I'm starting to use coreNLP library 3.3.1 to analyze italian text documents. Have anybody tried to run a language other than English ? Did you find the models needed to train the algoritmhs ?
Thanks
Carlo
At the moment, beyond English, we only package models for Chinese (see http://nlp.stanford.edu/software/corenlp.shtml#History), but people have also successfully used the German and French models that we distribute with the Stanford Parser, Stanford NER, or the Stanford POS Tagger inside CoreNLP. For Italian, you'd need annotated data available to train your own models. There are some treebanks available for Italian and the Stanford Parser has been trained for Italian. For info on resources for Italian, see: http://aclweb.org/aclwiki/index.php?title=Resources_for_Italian#Treebanks.