Why does the property settings for the StanfordCoreNlp pipeline matters when building dependency parse tree? - stanford-nlp

I have been playing around with Stanford-CoreNLP and I figured out that building a dependency parse tree with the following code
String text = "Are depparse and parse equivalent properties for building dependency parse tree?"
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST ));
}
is outputting
root(ROOT-0, tree-11)
cop(tree-11, Are-1)
amod(properties-6, depparse-2)
cc(depparse-2, and-3)
conj(depparse-2, parse-4)
compound(properties-6, equivalent-5)
nsubj(tree-11, properties-6)
case(dependency-9, for-7)
compound(dependency-9, building-8)
nmod(properties-6, dependency-9)
amod(tree-11, parse-10)
punct(tree-11, ?-12)
However this code
String text = "Are depparse and parse equivalent properties for building dependency parse tree?"
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, lemma, ner, depparse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST ));
}
outputs
root(ROOT-0, properties-6)
cop(properties-6, Are-1)
compound(properties-6, depparse-2)
cc(depparse-2, and-3)
conj(depparse-2, parse-4)
amod(properties-6, equivalent-5)
case(tree-11, for-7)
amod(tree-11, building-8)
compound(tree-11, dependency-9)
amod(tree-11, parse-10)
nmod(properties-6, tree-11)
punct(properties-6, ?-12)
So why am I not getting the same outputs with those two methods? Is it possible to change the later code to be equivalent to the first code because loading the constituency parser as well makes the parsing so slow? And how would you recommend setting the properties to get the most accurate dependency parse tree?

The constituency parser (parse annotator) and dependency parser (depparse annotator) are actually completely different models and code paths. In one case, we're predicting a constituency tree and converting it to a dependency graph. In the other case, we are running a dependency parser directly. In general, depparse is expected to be faster (O(n) vs O(n^3)) and more accurate at producing dependency trees, but will not produce constituency trees.

Related

How to Extract subject Verb Object using NLP Java? for every sentence

I want to find a subject, verb, and object for each sentence and then it will be passed to natural language generation library simpleNLG to form a sentence.
I tried multiple libraries like Cornlp, opennlp, Standford parsers. But I can not find them accurately.
Now in the worst case, I will have to write a long set of if-else to find subject, verb, and object form each sentence which is not always accurate for simpleNLG
like,
NN, nsub etc goes to subject, VB, VBZ goes to verb.
I tried lexical parser,
LexicalizedParser lp = **new LexicalizedParser("englishPCFG.ser.gz");**
String[] sent = { "This", "is", "an", "easy", "sentence", "." };
Tree parse = (Tree) lp.apply(Arrays.asList(sent));
parse.pennPrint();
System.out.println();
TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
tp.print(parse);
which gives this output,
nsubj(use-2, I-1)
root(ROOT-0, use-2)
det(parser-4, a-3)
dobj(use-2, parser-4)
And I want something like this
subject = I
verb = use
det = a
object = parser
Is there a simpler way to find this in JAVA or should I go with if-else? please help me with it.
You can use the openie annotator to get triples. You can run this at the command line or build a pipeline with these annotators.
command:
java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie -file example.txt
Java:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation result = pipeline.process("...");
input:
Joe ate some pizza.
output:
Extracted the following Open IE triples:
1.0 Joe ate pizza
More details here: https://stanfordnlp.github.io/CoreNLP/openie.html

Forcing POS tags in Stanford CoreNLP

Is there a way to process an already POS-tagged text using Stanford CoreNLP?
For example, I have the sentence in this format
They_PRP are_VBP hunting_VBG dogs_NNS ._.
and I'd like to annotate with lemma, ner, parse, etc. by forcing the given POS annotation.
Update. I tried this code, but it's not working.
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String sentText = "They_PRP are_VBP hunting_VBG dogs_NNS ._.";
List<CoreLabel> sentence = new ArrayList<>();
String[] parts = sentText.split("\\s");
for (String p : parts) {
String[] split = p.split("_");
CoreLabel clToken = new CoreLabel();
clToken.setValue(split[0]);
clToken.setWord(split[0]);
clToken.setOriginalText(split[0]);
clToken.set(CoreAnnotations.PartOfSpeechAnnotation.class, split[1]);
sentence.add(clToken);
}
Annotation s = new Annotation(sentText);
s.set(CoreAnnotations.TokensAnnotation.class, sentence);
Annotation document = new Annotation(s);
pipeline.annotate(document);
The POS annotations will certainly be replaced if you include the pos annotator in the pipeline.
Instead, remove the pos annotator and add the option -enforceRequirements false. This will allow the pipeline to run even though an annotator which lemma, etc. depend on (the pos annotator) is not present. Add the following line before pipeline instantiation:
props.setProperty("enforceRequirements", "false");
Of course, behavior is undefined if you venture into this area without setting the proper annotations, so make sure you match the annotations made by the relevant annotator (POSTaggerAnnotator in this case).

How to convince the tokenizer to work correctly with single sentence

I have the following sentence (only one), When I use the following code for tokenizing and knowing the index of each word, the tokenizer considers it like two sentences due to the full-stop after "approx". How can I solve this problem:
String sentence = "09-Aug-2003 -- On Saturday, 9th August 2003, Daniel and I start with our Enduros approx. 100 kilometers from the confluence point."
Annotation document = new Annotation(sentence);
pipeline.annotate(document);
for (CoreLabel token : document.get(CoreAnnotations.TokensAnnotation.class)) {
String word = token.get(CoreAnnotations.TextAnnotation.class);
System.out.println(token.index(), word);
}
e.g. the true index of "kilometers" is 20. But according to this code is 2.
If you add the following to the Properties object that you pass in to pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");
props.setProperty("ssplit.isOneSentence", "true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
then it won't split the text into different sentences.
(search 'ssplit' on this page to see all the other options http://nlp.stanford.edu/software/corenlp.shtml)

Formatting NER output from Stanford Corenlp

I am working with Stanford CoreNLP and using it for NER. But when I extract organization names, I see that each word is tagged with the annotation. So, if the entity is "NEW YORK TIMES", then it is getting recorded as three different entities : "NEW", "YORK" and "TIMES". Is there a property we can set in the Stanford COreNLP so that we could get the combined output as the entity ?
Just like in Stanford NER, when we use command line utility, we can choose out output format as : inlineXML ? Can we somehow set a property to select the output format in Stanford CoreNLP ?
If you just want the complete strings of each named entity found by Stanford NER, try this:
String text = "<INSERT YOUR INPUT TEXT HERE>";
AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> entity : entities)
System.out.println(text.substring(entity.second, entity.third), entity.second));
In case you're wondering, the entity class is indicated by entity.first.
Alternatively, you can use ner.classifyWithInlineXML(text) to get output that looks like <PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .
No, CoreNLP 3.5.0 has no utility to merge the NER labels. The next release (coming sometime next week) has a new MentionsAnnotator which handles this merging for you. For now, you can (a) use the MentionsAnnotator, available on the CoreNLP master branch, or (b) merge manually.
Use the -outputFormat xml option to have CoreNLP output XML. (Is this what you want?)
You can set any property in the properties file, include the "outputFormat" property. Stanford CoreNLP supports several different formats such as json, xml, and text. However, the xml option is not an inlineXML format. The xml format gives per token annotations for NER.
<tokens>
<token id="1">
<word>New</word>
<lemma>New</lemma>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>3</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
<token id="2">
<word>York</word>
<lemma>York</lemma>
<CharacterOffsetBegin>4</CharacterOffsetBegin>
<CharacterOffsetEnd>8</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
<token id="3">
<word>Times</word>
<lemma>Times</lemma>
<CharacterOffsetBegin>9</CharacterOffsetBegin>
<CharacterOffsetEnd>14</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
</tokens>
From Stanford CoreNLP 3.6 and onwards, You can use entitymentions in Pipeline and get list of all Entities. I have shown an example here. It works.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner,entitymentions");
props.put("regexner.mapping", "jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String inputText = "I have done Bachelor of Arts and Bachelor of Laws so that I can work at British Broadcasting Corporation";
Annotation annotation = new Annotation(inputText);
pipeline.annotate(annotation);
List<CoreMap> multiWordsExp = annotation.get(MentionsAnnotation.class);
for (CoreMap multiWord : multiWordsExp) {
String custNERClass = multiWord.get(NamedEntityTagAnnotation.class);
System.out.println(multiWord +" : " +custNERClass);
}

How to parse a list of sentences?

I want to parse a list of sentences with the Stanford NLP parser.
My list is an ArrayList, how can I parse all the list with LexicalizedParser?
I want to get from each sentence this form:
Tree parse = (Tree) lp1.apply(sentence);
Although one can dig into the documentation, I am going to provide code here on SO, especially since links move and/or die. This particular answer uses the whole pipeline. If not interested in the whole pipeline, I will provide an alternative answer in just a second.
The below example is the complete way of using the Stanford pipeline. If not interested in coreference resolution, remove dcoref from the 3rd line of code. So in the example below, the pipeline does the sentence splitting for you (the ssplit annotator) if you just feed it in a body of text (the text variable). Have just one sentence? Well, that is ok, you can feed that in as the text variable.
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
String text = ... // Add your text here!
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
}
// this is the parse tree of the current sentence
Tree tree = sentence.get(TreeAnnotation.class);
// this is the Stanford dependency graph of the current sentence
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
}
// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph =
document.get(CorefChainAnnotation.class);
Actually documentation from Stanford NLP provide sample of how to parse sentences.
You can find the documentation here
So as promised, if you don't want to access the full Stanford pipeline (although I believe that is the recommended approach), you can work with the LexicalizedParser class directly. In this case, you would download the latest version of Stanford Parser (whereas the other would use CoreNLP tools). Make sure that in addition to the parser jar, you have the model file for the appropriate parser you want to work with. Example code:
LexicalizedParser lp1 = new LexicalizedParser("englishPCFG.ser.gz", new Options());
String sentence = "It is a fine day today";
Tree parse = lp.parse(sentence);
Note this works for version 3.3.1 of the parser.

Resources