How to Extract subject Verb Object using NLP Java? for every sentence - stanford-nlp

I want to find a subject, verb, and object for each sentence and then it will be passed to natural language generation library simpleNLG to form a sentence.
I tried multiple libraries like Cornlp, opennlp, Standford parsers. But I can not find them accurately.
Now in the worst case, I will have to write a long set of if-else to find subject, verb, and object form each sentence which is not always accurate for simpleNLG
like,
NN, nsub etc goes to subject, VB, VBZ goes to verb.
I tried lexical parser,
LexicalizedParser lp = **new LexicalizedParser("englishPCFG.ser.gz");**
String[] sent = { "This", "is", "an", "easy", "sentence", "." };
Tree parse = (Tree) lp.apply(Arrays.asList(sent));
parse.pennPrint();
System.out.println();
TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
tp.print(parse);
which gives this output,
nsubj(use-2, I-1)
root(ROOT-0, use-2)
det(parser-4, a-3)
dobj(use-2, parser-4)
And I want something like this
subject = I
verb = use
det = a
object = parser
Is there a simpler way to find this in JAVA or should I go with if-else? please help me with it.

You can use the openie annotator to get triples. You can run this at the command line or build a pipeline with these annotators.
command:
java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie -file example.txt
Java:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation result = pipeline.process("...");
input:
Joe ate some pizza.
output:
Extracted the following Open IE triples:
1.0 Joe ate pizza
More details here: https://stanfordnlp.github.io/CoreNLP/openie.html

Related

Why does the property settings for the StanfordCoreNlp pipeline matters when building dependency parse tree?

I have been playing around with Stanford-CoreNLP and I figured out that building a dependency parse tree with the following code
String text = "Are depparse and parse equivalent properties for building dependency parse tree?"
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST ));
}
is outputting
root(ROOT-0, tree-11)
cop(tree-11, Are-1)
amod(properties-6, depparse-2)
cc(depparse-2, and-3)
conj(depparse-2, parse-4)
compound(properties-6, equivalent-5)
nsubj(tree-11, properties-6)
case(dependency-9, for-7)
compound(dependency-9, building-8)
nmod(properties-6, dependency-9)
amod(tree-11, parse-10)
punct(tree-11, ?-12)
However this code
String text = "Are depparse and parse equivalent properties for building dependency parse tree?"
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, lemma, ner, depparse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST ));
}
outputs
root(ROOT-0, properties-6)
cop(properties-6, Are-1)
compound(properties-6, depparse-2)
cc(depparse-2, and-3)
conj(depparse-2, parse-4)
amod(properties-6, equivalent-5)
case(tree-11, for-7)
amod(tree-11, building-8)
compound(tree-11, dependency-9)
amod(tree-11, parse-10)
nmod(properties-6, tree-11)
punct(properties-6, ?-12)
So why am I not getting the same outputs with those two methods? Is it possible to change the later code to be equivalent to the first code because loading the constituency parser as well makes the parsing so slow? And how would you recommend setting the properties to get the most accurate dependency parse tree?
The constituency parser (parse annotator) and dependency parser (depparse annotator) are actually completely different models and code paths. In one case, we're predicting a constituency tree and converting it to a dependency graph. In the other case, we are running a dependency parser directly. In general, depparse is expected to be faster (O(n) vs O(n^3)) and more accurate at producing dependency trees, but will not produce constituency trees.

How to convince the tokenizer to work correctly with single sentence

I have the following sentence (only one), When I use the following code for tokenizing and knowing the index of each word, the tokenizer considers it like two sentences due to the full-stop after "approx". How can I solve this problem:
String sentence = "09-Aug-2003 -- On Saturday, 9th August 2003, Daniel and I start with our Enduros approx. 100 kilometers from the confluence point."
Annotation document = new Annotation(sentence);
pipeline.annotate(document);
for (CoreLabel token : document.get(CoreAnnotations.TokensAnnotation.class)) {
String word = token.get(CoreAnnotations.TextAnnotation.class);
System.out.println(token.index(), word);
}
e.g. the true index of "kilometers" is 20. But according to this code is 2.
If you add the following to the Properties object that you pass in to pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");
props.setProperty("ssplit.isOneSentence", "true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
then it won't split the text into different sentences.
(search 'ssplit' on this page to see all the other options http://nlp.stanford.edu/software/corenlp.shtml)

Ignore words for lemmatizer

I would like to use Stanford CoreNLP for lemmatization but I have some words not to be lemmatized. Is there a way to provide this ignore list to the tool? I am following this code, and when the program calls this.pipeline.annotate(document);then, that's it; it would be hard to replace the occurrences. One solution is that create a mapping list in which each word to be ignored is paired with lemmatize(word) (i.e., d = {(w1, lemmatize(w1)), (w2, lemmatize(w2), ...} and do the post processing with this mapping list. But it should be easier than this, I guess.
Thanks for the help.
I think I found the solution with my friend's help.
for(CoreMap sentence: sentences) {
// Iterate over all tokens in a sentence
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
System.out.print(token.get(OriginalTextAnnotation.class) + "\t");
System.out.println(token.get(LemmaAnnotation.class));
}
}
You can get original form of the word by calling token.get(OriginalTextAnnotation.class).

Generate HTML documentation for a FreeMarker FTL library

I've a FreeMarker library that I want to ship with my product, and I'm looking for a way to generate a HTML documentation for it based on the comments in the FTL file (in a Javadoc fashion).
For example, a typical function in my library is written like:
<#--
MyMacro: Does stuff with param1 and param2.
- param1: The first param, mandatory.
- param2: The second param, 42 if not specified.
-->
<#macro MyMacro param1 param2=42>
...
</#macro>
I didn't find anything on that subject, probably because there is no standard way of writing comments in FreeMarker (Such as #param or #returns in Javadoc).
I don't mind rolling my own solution for that, but I'm keen on using an existing system like Doxia (since I'm using Maven to build the project) or Doxygen maybe, instead of writing something from scratch.
Ideally I'd like to write the comment parsing code only, and rely on something else to detect the macros and generate the doc structure.
I'm open to changing the format of my comments if that helps.
In case you decide to write your own doc generator or to write a FTL-specific front-end for an existing document generator, you can reuse some of FreeMarker's parsing infrastructure:
You can use Template.getRootTreeNode() in order to retrieve the template's top level AST node. Because macros and the responding comments should be direct children of the this top level node (IIRC), iterating over its children and casting them to the right AST node subclass should give you almost everything you need with respect to FTL syntax. To illustrate the approach I hacked together a little "demo" (cfg is a normal FreeMarker Configuration object):
Template t = cfg.getTemplate("foo.ftl");
TemplateElement te = t.getRootTreeNode();
Enumeration e = te.children();
while(e.hasMoreElements()) {
Object child = e.nextElement();
if(child instanceof Comment) {
Comment comment = (Comment)child;
System.out.println("COMMENT: " + comment.getText());
} else if(child instanceof Macro) {
Macro macro = (Macro)child;
System.out.println("MACRO: " + macro.getName());
for(String argumentName : macro.getArgumentNames()) {
System.out.println("- PARAM: " + argumentName);
}
}
}
produces for your given example macro:
COMMENT:
MyMacro: Does stuff with param1 and param2.
- param1: The first param, mandatory.
- param2: The second param, 42 if not specified.
MACRO: MyMacro
- PARAM: param1
- PARAM: param2
How you parse the comment is then up to you ;-)
Update: Found something called ftldoc in my backups and uploaded it to GitHub. Maybe this is what you are looking for...

How to parse a list of sentences?

I want to parse a list of sentences with the Stanford NLP parser.
My list is an ArrayList, how can I parse all the list with LexicalizedParser?
I want to get from each sentence this form:
Tree parse = (Tree) lp1.apply(sentence);
Although one can dig into the documentation, I am going to provide code here on SO, especially since links move and/or die. This particular answer uses the whole pipeline. If not interested in the whole pipeline, I will provide an alternative answer in just a second.
The below example is the complete way of using the Stanford pipeline. If not interested in coreference resolution, remove dcoref from the 3rd line of code. So in the example below, the pipeline does the sentence splitting for you (the ssplit annotator) if you just feed it in a body of text (the text variable). Have just one sentence? Well, that is ok, you can feed that in as the text variable.
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
String text = ... // Add your text here!
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
}
// this is the parse tree of the current sentence
Tree tree = sentence.get(TreeAnnotation.class);
// this is the Stanford dependency graph of the current sentence
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
}
// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph =
document.get(CorefChainAnnotation.class);
Actually documentation from Stanford NLP provide sample of how to parse sentences.
You can find the documentation here
So as promised, if you don't want to access the full Stanford pipeline (although I believe that is the recommended approach), you can work with the LexicalizedParser class directly. In this case, you would download the latest version of Stanford Parser (whereas the other would use CoreNLP tools). Make sure that in addition to the parser jar, you have the model file for the appropriate parser you want to work with. Example code:
LexicalizedParser lp1 = new LexicalizedParser("englishPCFG.ser.gz", new Options());
String sentence = "It is a fine day today";
Tree parse = lp.parse(sentence);
Note this works for version 3.3.1 of the parser.

Resources