I am trying to retrieve the Stanford Dependencies representation of a sentence. Something like this:
nsubj(makes-8, Bell-1)
nsubj(distributes-10, Bell-1)
vmod(Bell-1, based-3)
I am using a coreNLP pipeline with the ddeparse annotator. The following code produces output in a similar, but not as friendly format.
private void printDepParse(Annotation document) {
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
IoUtils.prln("---Dependencies---");
for(CoreMap sentence: sentences) {
// this is the Stanford dependency graph of the current sentence
SemanticGraph dependencies = sentence.get(BasicDependenciesAnnotation.class);
IoUtils.prln(dependencies.toString());
}
IoUtils.prln("---End Dependencies---");
}
Produces something like:
-> fed/VBD (root)
-> Jimmy/NNP (nsubj)
-> dog/NN (xcomp)
-> Billy/NNP (nsubj)
-> the/DT (det)
-> ./. (punct)
Is there a simple way of producing the former, more generally accepted format?
To get the Stanford Dependencies representation you need to use the GrammaticalStructures class. This should work:
for (CoreMap sentence: sentences) {
Tree tree = sentence.get(TreeAnnotation.class);
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(tree);
Collection<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
}
Related
I downloaded the lastest Stanford POS Tagger from https://nlp.stanford.edu/software/tagger.html , then use the following code to get the pos tag of a sentence:
jar = './stanford_postagger/stanford-postagger-3.9.1.jar'
model = './stanford_postagger/models/english-left3words-distsim.tagger'
pos_tagger = StanfordPOSTagger(model, jar)
text = nltk.word_tokenize('How much did the Dow rise?')
stanford_pos = pos_tagger.tag(text)
The output for this sentence is:
[('How', 'WRB'), ('much', 'RB'), ('did', 'VBD'), ('the', 'DT'), ('Dow', 'NNP'), ('rise', 'NN'), ('?', '.')]
which is a wrong output -- interpreting the last verb as noun.
But the online parser at http://nlp.stanford.edu:8080/parser/index.jsp gives the right pos tag:
How/WRB
much/JJ
did/VBD
the/DT
dow/NN
rise/VB
?/.
could someone tell me why these two give different results?
How is the polarity of a word in a sentence calculated using PatternAnalyser of Text Blob?
TextBlob internally uses NaiveBayes classifer for sentiment analysis,
the naivebayes classifier used in turn is the one provided by NLTK.
See Textblob sentiment analyzer code here.
#requires_nltk_corpus
def train(self):
"""Train the Naive Bayes classifier on the movie review corpus."""
super(NaiveBayesAnalyzer, self).train()
neg_ids = nltk.corpus.movie_reviews.fileids('neg')
pos_ids = nltk.corpus.movie_reviews.fileids('pos')
neg_feats = [(self.feature_extractor(
nltk.corpus.movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]
pos_feats = [(self.feature_extractor(
nltk.corpus.movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]
train_data = neg_feats + pos_feats
#### THE CLASSIFIER USED IS NLTK's NAIVE BAYES #####
self._classifier = nltk.classify.NaiveBayesClassifier.train(train_data)
def analyze(self, text):
"""Return the sentiment as a named tuple of the form:
``Sentiment(classification, p_pos, p_neg)``
"""
# Lazily train the classifier
super(NaiveBayesAnalyzer, self).analyze(text)
tokens = word_tokenize(text, include_punc=False)
filtered = (t.lower() for t in tokens if len(t) >= 3)
feats = self.feature_extractor(filtered)
#### USE PROB_CLASSIFY method of NLTK classifer #####
prob_dist = self._classifier.prob_classify(feats)
return self.RETURN_TYPE(
classification=prob_dist.max(),
p_pos=prob_dist.prob('pos'),
p_neg=prob_dist.prob("neg")
)
Source for NLTK's NaiveBayes classifier is here.. This returns probability distribution which is used for the result returned by Textblobs sentiment analyzer.
def prob_classify(self, featureset):
I have setup CoreNLP server on my ubuntu instance and it works ok. I more interested in Sentiment module and currently I get is
{
sentimentValue: "2",
sentiment: "Neutral"
}
What I need is score distribution value, as you see here: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
"scoreDistr": [0.1685, 0.7187, 0.0903, 0.0157, 0.0068]
What am I missing or How do I get such data ?
Thanks
You need to get the a tree object via SentimentCoreAnnotations.SentimentAnnotatedTree.class from your annotated sentence. Then, you can get the predictions through the RNNCoreAnnotations class. I wrote the following self-contained demo code below that shows how to get the scores for each label of a CoreNLP Sentiment prediction.
import java.util.Arrays;
import java.util.List;
import java.util.Properties;
import org.ejml.simple.SimpleMatrix;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.util.CoreMap;
public class DemoSentiment {
public static void main(String[] args) {
final List<String> texts = Arrays.asList("I am happy.", "This is a neutral sentence.", "I am very angry.");
final Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
final StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
for (String text : texts) {
final Annotation doc = new Annotation(text);
pipeline.annotate(doc);
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
final Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class);
final SimpleMatrix sm = RNNCoreAnnotations.getPredictions(tree);
final String sentiment = sentence.get(SentimentCoreAnnotations.SentimentClass.class);
System.out.println("sentence: "+sentence);
System.out.println("sentiment: "+sentiment);
System.out.println("matrix: "+sm);
}
}
}
}
The output will be similar (some floating point rounding errors or an updated models might change the scores) to what is below.
For the first sentence I am happy., you can see that the sentiment is Positive, and, the highest value in the returned matrix is 0.618, at the fourth position, when interpreting the matrix as an ordered listing.
The second sentence This is a neutral sentence. has its highest score in the middle, at 0.952, hence the Neutral sentiment.
The last sentence has correspondingly a Negative sentiment, with its highest score of 0.652 at the second position.
sentence: I am happy.
sentiment: Positive
matrix: Type = dense , numRows = 5 , numCols = 1
0.016
0.037
0.132
0.618
0.196
sentence: This is a neutral sentence.
sentiment: Neutral
matrix: Type = dense , numRows = 5 , numCols = 1
0.001
0.007
0.952
0.039
0.001
sentence: I am very angry.
sentiment: Negative
matrix: Type = dense , numRows = 5 , numCols = 1
0.166
0.652
0.142
0.028
0.012
I am meeting the openie annotator of Stanford NLP. However the option openie.resolve_coref don't work in my input text.
I want use openie for generate triples with coreference resolved. How I can to do this?
This code was copied of site Stanford and I added the line:
props.setProperty("openie.resolve_coref", "true");
Properties props = new Properties();
props.setProperty("openie.resolve_coref", "true");
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,parse,natlog,ner,coref,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate an example document.
String text = "Obama was born in Hawaii. He is our president.";
Annotation doc = new Annotation(textoInput);
pipeline.annotate(doc);
// Loop over sentences in the document
int sentNo = 0;
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
System.out.println("Sentence #" + ++sentNo + ": " + sentence.get(CoreAnnotations.TextAnnotation.class));
// Print SemanticGraph
System.out.println(sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class).toString(SemanticGraph.OutputFormat.LIST));
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
System.out.println(triple.confidence + "\t CON=" +
triple.subjectLemmaGloss() + "\t REL=" +
triple.relationLemmaGloss() + "\t CON=" +
triple.objectLemmaGloss());
}
// Alternately, to only run e.g., the clause splitter:
List<SentenceFragment> clauses = new OpenIE(props).clausesInSentence(sentence);
for (SentenceFragment clause : clauses) {
System.out.println(clause.parseTree.toString(SemanticGraph.OutputFormat.LIST));
}
System.out.println();
The process results in this triple:
1.0 Obama be bear in Hawaii
1.0 Obama be bear
1.0 he be we president -> Should be -> Obama be we president
EDIT: this bug is also fixed in versions 3.7.0+
This is a bug in the 3.6.0 release which is fixed in the GitHub version. It'll be fixed in the next release, or you can manually update the code and model jars from the CoreNLP GitHub page -- you can download the latest models, and build the code jar with ant jar.
My output is:
Sentence #1: Obama was born in Hawaii.
root(ROOT-0, born-3)
nsubjpass(born-3, Obama-1)
auxpass(born-3, was-2)
case(Hawaii-5, in-4)
nmod:in(born-3, Hawaii-5)
punct(born-3, .-6)
1.0 CON=Obama REL=be bear in CON=Hawaii
1.0 CON=Obama REL=be CON=bear
[main] INFO edu.stanford.nlp.naturalli.ClauseSplitter - Loading clause splitter from edu/stanford/nlp/models/naturalli/clauseSearcherModel.ser.gz ... done [0.43 seconds]
root(ROOT-0, born-3)
nsubjpass(born-3, Obama-1)
auxpass(born-3, was-2)
case(Hawaii-5, in-4)
nmod:in(born-3, Hawaii-5)
Sentence #2: He is our president.
root(ROOT-0, president-4)
nsubj(president-4, He-1)
cop(president-4, is-2)
nmod:poss(president-4, our-3)
punct(president-4, .-5)
1.0 CON=Obama REL=be CON=we president
[main] INFO edu.stanford.nlp.naturalli.ClauseSplitter - Loading clause splitter from edu/stanford/nlp/models/naturalli/clauseSearcherModel.ser.gz ... done [0.45 seconds]
root(ROOT-0, president-4)
nsubj(president-4, He-1)
cop(president-4, is-2)
nmod:poss(president-4, our-3)
I'm using Stanford Log-linear Part-Of-Speech Tagger and here is the sample sentence that I tag:
He can't do that
When tagged I get this result:
He_PRP ca_MD n't_RB do_VB that_DT
As you can see, can't is split into two words, ca is marked as Modal (MD) and n't is marked as ADVERB (RB)?
I actually get the same result if I use can not separately: can is MD and not is RB, so is this way of breaking up is expected instead of say breaking like can_MD and 't_RB?
Note: This is not the perfect answer.
I think that the problem originates from the Tokenizer used in Stanford POS Tagger, not from the tagger itself. the Tokenizer (PTBTokenizer) can not handle apostrophe properly:
1- Stanford PTBTokenizer token's split delimiter.
2- Stanford coreNLP - split words ignoring apostrophe.
As they mentioned here Stanford Tokenizer, the PTBTokenizer will tokenizes the sentence :
"Oh, no," she's saying, "our $400 blender can't handle something this
hard!"
to:
...... our $ 400 blender ca n't handle something
Try to find a suitable tokenization method and apply it to the tagger as following:
import java.util.List;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.ling.Sentence;
import edu.stanford.nlp.ling.TaggedWord;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
public class Test {
public static void main(String[] args) throws Exception {
String model = "F:/code/stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger";
MaxentTagger tagger = new MaxentTagger(model);
List<HasWord> sent;
sent = Sentence.toWordList("He", "can", "'t", "do", "that", ".");
//sent = Sentence.toWordList("He", "can't", "do", "that", ".");
List<TaggedWord> taggedSent = tagger.tagSentence(sent);
for (TaggedWord tw : taggedSent) {
System.out.print(tw.word() + "=" + tw.tag() + " , " );
}
}
}
output:
He=PRP , can=MD , 't=VB , do=VB , that=DT , .=. ,