Generating stanford semantic graph with nodes storing lemma - stanford-nlp

I am trying to generate a SemanticGraph and use semgrex to find the specific node.
I would like to use lemma as one of node attribute in semgrex. I saw a relevant question and answer here:
CoreNLP SemanticGraph - search for edges with specific lemmas
It is mentioned that
Make sure that the nodes are storing lemmas -- see the lemma annotator of CoreNLP (currently available for English, only).
I current can use pipeline to generate the desired annotation to generate the semantic graph.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
However, after searching for relevant information, I only find a deprecated function at here:
https://github.com/chbrown/nlp/blob/master/src/main/java/edu/stanford/nlp/pipeline/ParserAnnotatorUtils.java
public static SemanticGraph generateDependencies(Tree tree,
boolean collapse,
boolean ccProcess,
boolean includeExtras,
boolean lemmatize,
boolean threadSafe) {
SemanticGraph deps = SemanticGraphFactory.makeFromTree(tree, collapse, ccProcess, includeExtras, lemmatize, threadSafe);
return deps;
}
which seemed to be removed from newest coreNLP.
Could anyone give some hint on how to generate the semantic graph with nodes that storing the lemmas?

The function as you have it should work. The lemmas will be stored in the SemanticGraph produced by the parse annotator. You can retrieve the graph after running your text through pipeline.annotate(Annotation) with:
sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class)
Or, the equivalent for Collapsed and CollapsedCCProcessed dependencies. See http://www-nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/SemanticGraphCoreAnnotations.html. Note that these are attached to a sentence (CoreMap), not a document (Annotation).
You can then run Semgrex over the graph as usual; e.g., {lemma:/foo/} >arctype {lemma:/bar/}.

Related

Can I get an entityMention from the result of a TokensRegex match in Stanford CoreNLP?

I want to add addresses (and possibly other rules based entities) to an NER pipeline and the Tokens Regex seems like a terribly useful DSL for doing so. Following https://stackoverflow.com/a/42604225, I'm created this rules file:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: ([{ner:"NUMBER"}] [{pos:"NN"}|{pos:"NNP"}] /ave(nue)?|st(reet)?|boulevard|blvd|r(oa)?d/), action: Annotate($0, ner, "address") }
Here's a scala repl session showing how I'm trying to set up an annotation pipeline.
# import edu.stanford.nlp.pipeline.{StanfordCoreNLP, CoreDocument}
# import edu.stanford.nlp.util.PropertiesUtils.asProperties
# val pipe = new StanfordCoreNLP(asProperties(
"customAnnotatorClass.tokensregex", "edu.stanford.nlp.pipeline.TokensRegexAnnotator",
"annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex",
"ner.combinationMode", "HIGH_RECALL",
"tokensregex.rules", "addresses.tregx"))
pipe: StanfordCoreNLP = edu.stanford.nlp.pipeline.StanfordCoreNLP#2ce6a051
# val doc = new CoreDocument("Adam Smith lived at 123 noun street in Glasgow, Scotland")
doc: CoreDocument = Adam Smith lived at 123 noun street in Glasgow, Scotland
# pipe.annotate(doc)
# doc.sentences.get(0).nerTags
res5: java.util.List[String] = [PERSON, PERSON, O, O, address, address, address, O, CITY, O, COUNTRY]
# doc.entityMentions
res6: java.util.List[edu.stanford.nlp.pipeline.CoreEntityMention] = [Adam Smith, 123, Glasgow, Scotland]
As you can see, the address gets correctly tagged in the nerTags for the sentence, but it doesn't show up in the documents entityMentions. Is there a way to do this?
Also, is there a way from the document to discern two adjacent matches of the tokenregex from a single match (assuming I have more complicated set of regexes; in the current example I only match exactly 3 tokens, so I could just count tokens)?
I tried approaching it using the regexner with a tokens regex described here https://stanfordnlp.github.io/CoreNLP/regexner.html, but I couldn't seem to get that working.
Since I'm working in scala I'll be happy to dive into the Java API to get this to work, rather than fiddle with properties and resource files, if that's necessary.
Yes, I've recently added some changes (in the GitHub version) to make this easier! Make sure to download the latest version from GitHub. Though we are aiming to release Stanford CoreNLP 3.9.2 fairly soon and it will have these changes.
If you read this page you can get an understanding of the full NER pipeline run by the NERCombinerAnnotator.
https://stanfordnlp.github.io/CoreNLP/ner.html
Furthermore there is a lot of write up on the TokensRegex here:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html
Basically what you want to do is run the ner annotator, and use it's TokensRegex sub-annotator. Imagine you have some named entity rules in a file called my_ner.rules.
You could run a command like this:
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules my_ner.rules -outputFormat text -file example.txt
This will run a TokensRegex sub-annotator during the full named entity recognition process. Then when the final step of entity mentions are run, it will operate on the rules extracted named entities and create entity mentions from them.

Stanford CoreNLP: output in CONLL format from Java

I want to parse some German text with Stanford CoreNLP and obtain a CONLL output, so that I can pass the latter to CorZu for coreference resolution.
How can I do that programmatically?
Here is my code so far (which only outputs dependency trees):
Annotation germanAnnotation = new Annotation("Gestern habe ich eine blonde Frau getroffen");
Properties germanProperties = StringUtils.argsToProperties("-props", "StanfordCoreNLP-german.properties");
StanfordCoreNLP pipeline = new StanfordCoreNLP(germanProperties);
pipeline.annotate(germanAnnotation);
StringBuilder trees = new StringBuilder("");
for (CoreMap sentence : germanAnnotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree sentenceTree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
trees.append(sentenceTree).append("\n");
}
With the following code I managed to save the parsing output in CONLL format.
OutputStream outputStream = new FileOutputStream(new File("./target/", OUTPUT_FILE_NAME));
CoNLLOutputter.conllPrint(germanAnnotation, outputStream, pipeline);
However, the HEAD field was 0 for all words. I am not sure whether there is a problem in parsing or only in the CONLLOutputter. Honestly, I was too annoyed by CoreNLP to investigate further.
I decided and I suggest using ParZu instead. ParZu and CorZu are made to work together seamlessly - and they do.
In my case, I had an already tokenized and POS tagged text. This makes things easier, since you will not need:
A POS-Tagger using the STTS tagset
A tool for morphological analysis
Once you have ParZu and CorZu installed, you will only need to run corzu.sh (included in the CorZu download folder). If your text is tokenized and POS tagged, you can edit the script accordingly:
parzu_cmd="/YourPath/ParZu/parzu -i tagged"
Last note: make sure to convert your tagged text to the following format, with empty lines signifying sentence boundaries: word [tab] tag [newline]

How to get CoreAnnotations.CoNLLDepAnnotation and CoreAnnotations.GovernorAnnotation as annotators through StanfordCorenlp pipeline?

There are standard names for annotations (like tokenize, ssplit, pos) but i am not sure what name should be specified for the CoNLLDepAnnotations and GovernorAnnotations and also what other annotations does these depend on.
Dependency Parse require the annotator (parse). All dependency that you mentioned are performed under this annotator. The code below will print the semantic graph of the sentence in a List format.
for (CoreMap sentenceAnnotation:art.sentences){
SemanticGraph deps = sentenceAnnotation.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
System.out.println(deps.toList());
}
For example the output of the sentence: Apple even went as far to make an electric guitar version . will be:
root(ROOT-0, went-3)
nsubj(went-3, Apple-1)
advmod(went-3, even-2)
advmod(far-5, as-4)
advmod(went-3, far-5)
mark(make-7, to-6)
xcomp(went-3, make-7)
det(version-11, an-8)
amod(version-11, electric-9)
compound(version-11, guitar-10)
dobj(make-7, version-11)
punct(went-3, .-12)
where for the first token Apple the relation is nsubj and the governor is the third token went.

Resolving how square brackets vs parenthesis are conceptually handled in pipeline

When I parse a sentence that contains left and right square brackets, the parse is much, much slower and different than if the sentence contained left and right parentheses and the default normalizeOtherBrackets value of true is kept (I would say 20 seconds vs 3 seconds for the ParserAnnotator to be run). If this is property is instead set to false, than the parse times of the brackets vs the parentheses are pretty comparable however the parse trees are still very different. With a true value for text with brackets, the POS is -LRB- whereas the POS is CD for false but in each case the general substructure of the tree is the same.
In the case of my corpus, the brackets are overwhelmingly meant to "clarify the antecedent" as described in this site. However, the PRN phrase-level label exists for parentheses and not for square brackets and so the formation of the tree is inherently different even if they have close to the same function in the sentence.
So, please explain how the parse times are so different and what can be done to get a proper parse? Obviously, a simplistic approach would be to replace brackets with parens, but that does not seem like a satisfying solution. Are there any settings that can provide me with some relief? Here is my code:
private void execute() {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
props.setProperty("tokenize.options", "normalizeOtherBrackets=false");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
long start = System.nanoTime();
ParserAnnotator pa = new ParserAnnotator(DefaultPaths.DEFAULT_PARSER_MODEL, false, 60, new String[0]);
pa.annotate(document);
long end = System.nanoTime();
System.out.println((end - start) / 1000000 + " ms");
for(CoreMap sentence: sentences) {
System.out.println(sentence);
Tree cm = sentence.get(TreeAnnotation.class);
cm.pennPrint();
List<CoreLabel> cm2 = sentence.get(TokensAnnotation.class);
for (CoreLabel c : cm2) {
System.out.println(c);
}
}
}
Yes, this a known (and very tricky) issue, and unfortunately there is no perfect solution for this.
The problem with such sentences is mainly caused by the fact that these parenthetical constructions don't appear in the training data of the parser and by the way the part-of-speech tagger and the parser interact.
If you run the tokenizer with the option normalizeOtherBrackets=true, then the square brackets will be replaced with -LSB- and -RSB-. Now in most (or maybe even in all) of the training examples for the part-of-speech tagger -LSB- has the POS tag -LRB-, and -RSB- has the tag -RRB-. Therefore the POS-tagger, despite taking context into account, basically always assigns these tags to squared brackets.
Now if you run the POS-tagger before the parser, then the parser operates in delexicalized mode which means it tries to parse the sentence based on the POS tags of the individual words instead of based on the words themselves. Assuming that your sentence will have a POS sequence similar to -LRB- NN -RRB- the parser tries to find production rules so that it can parse this sequence. However, the parser's training data doesn't contain such sequences and therefore there aren't any such production rules, and consequently the parser is unable to parse a sentence with this POS sequence.
When this happens, the parser goes into fallback mode which relaxes the constraints on the part-of-speech tags by assigning a very low probability to all the other possible POS tags of every word so that it can eventually parse the sentence according to the production rules of the parser. This step, however, requires a lot of computation and therefore it takes so much longer to parse such sentences.
On the other hand, if you set normalizeOtherBrackets to false, then it won't translate squared brackets to -LSB- and -RSB- and the POS tagger will assign tags to the original [ and ]. However, our training data was preprocessed such that all squared brackets have been replaced and therefore [ and ] are treated as unknown words and the POS tagger assigns tags purely based on the context which most likely results in a POS sequence that is part of the training data of our parser and tagger. This in return means that the parser is able to parse the sentence without entering into recovery mode which is a lot faster.
If your data contains a lot of sentences with such constructions, I would recommend that you don't run the POS tagger before the parser (by removing pos from the list of annotators). This should still give you faster but still reliable parses. Otherwise, if you have access to a treebank, you could also manually add a few trees with these constructions and train a new parsing model. I would not set normalizeOtherBrackets to false because then your tokenized input will differ from the training data of the parser and the tagger which will potentially give you worse results.

How do I define a SemgrexPattern in Stanford API, to match nodes' text without using {lemma:...}?

I am working with the edu.stanford.nlp.semgrex and edu.stanford.nlp.tress.semgraph packages and am looking for a way to match nodes with a text value other than the lemma: directive.
I couldn't find all possible attribute names in javadoc for SemgrexPattern, only those for lemma, tag, and relational operators - is there a comprehensive list available?
For example, in the following sentence
My take-home pay is $20.
extracting the 'take-home' node is not possible using
(SemgrexPattern.compile( "{lemma:take-home}"))
.matcher( "My take-home pay is $20.").find()
yields false, because take-home is deemed not to be a lemma.
What do I need to do to match nodes with non-lemma, arbitrary text?
Thanks for any advice or comment.
Sorry - I realize that {word:take-home} would work in the example above.
Thanks..

Resources