how to give flag option in stanford-nlp program? - stanford-nlp

The site suggest that i can use several flags
https://nlp.stanford.edu/software/openie.html
But how to use it, I tried doing it this way
import edu.stanford.nlp.ie.util.RelationTriple;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.naturalli.NaturalLogicAnnotations;
import edu.stanford.nlp.util.CoreMap;
import java.util.Collection;
import java.util.Properties;
/**
* A demo illustrating how to call the OpenIE system programmatically.
*/
public class OpenIEDemo {
public static void main(String[] args) throws Exception {
// Create the Stanford CoreNLP pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
props.setProperty("openieformat","ollie");
props.setProperty("openieresolve_coref","1");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate an example document.
Annotation doc = new Annotation("Obama was born in Hawaii. He is our president.");
pipeline.annotate(doc);
// Loop over sentences in the document
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
System.out.println(triple.confidence + "\t" +
triple.subjectLemmaGloss() + "\t" +
triple.relationLemmaGloss() + "\t" +
triple.objectLemmaGloss());
}
}
}
}
I have added
props.setProperty("openieformat","ollie");
props.setProperty("openieresolve_coref","1");
But its not working

For StanfordCoreNLP, flags/properties for individual annotators are set with an annotator.flag name. And boolean flags have value "false" or "true". So, what you have is close to right, but needs to be:
props.setProperty("openie.format","ollie");
props.setProperty("openie.resolve_coref","true");

Related

Stanford CoreNLP Chinese coreference resolution

I want to use stanford coreNLP to process some Chinese Coreference resolution.My code is below:
import java.util.Properties;
import edu.stanford.nlp.coref.CorefCoreAnnotations;
import edu.stanford.nlp.coref.data.CorefChain;
import edu.stanford.nlp.coref.data.Mention;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
public class CorefTest {
public static void main(String[] args) throws Exception {
StanfordCoreNLP pipline = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties");
Annotation document = new Annotation("奥巴马出生在夏威夷,他是美国总统,他在2008年当选");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,mention,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
System.out.println("---");
System.out.println("coref chains");
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
System.out.println("\t" + cc);
}
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
System.out.println("---");
System.out.println("mentions");
for (Mention m : sentence.get(CorefCoreAnnotations.CorefMentionsAnnotation.class)) {
System.out.println("\t" + m);
}
}
}
}
And it give the result :
---
coref chains
---
mentions
奥巴马出生在夏威夷 , 他是美国总统 , 他在2008年当选
Where nothing in the coref chains,I do set the environment right, and it does support chinese , but how can I get coref chains right?
You are making the pipeline twice in your code.
Note: make sure to add this to your code:
import edu.stanford.nlp.util.StringUtils;
You code should be like this:
Properties props = StringUtils.argsToProperties("-props", "StanfordCoreNLP-chinese.properties");
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,mention,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

How to get protobuf extension field in ProtobufAnnotationSerializer

I am a new to protocol-buffers and try to figure out how to extend a message type in the Stanford CoreNLP library as described here: https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/ProtobufAnnotationSerializer.html
The problem: I can set the extension field but i can't get it. I boiled the problem down to the code below. In the original message the field name is [edu.stanford.nlp.pipeline.myNewField] but is replaced by the field number 101 in the deserialized message.
How can i get the value of myNewField?
PS: This post https://stackoverflow.com/questions/28815214/how-to-set-get-protobufs-extension-field-in-go suggests that it should be as easy as calling getExtension(MyAppProtos.myNewField)
custom.proto
syntax = "proto2";
package edu.stanford.nlp.pipeline;
option java_package = "com.example.my.awesome.nlp.app";
option java_outer_classname = "MyAppProtos";
import "CoreNLP.proto";
extend Sentence {
optional uint32 myNewField = 101;
}
ProtoTest.java
import com.example.my.awesome.nlp.app.MyAppProtos;
import com.google.protobuf.ExtensionRegistry;
import com.google.protobuf.InvalidProtocolBufferException;
import edu.stanford.nlp.pipeline.CoreNLPProtos;
import edu.stanford.nlp.pipeline.CoreNLPProtos.Sentence;
public class ProtoTest {
static {
ExtensionRegistry registry = ExtensionRegistry.newInstance();
registry.add(MyAppProtos.myNewField);
CoreNLPProtos.registerAllExtensions(registry);
}
public static void main(String[] args) throws InvalidProtocolBufferException {
Sentence originalSentence = Sentence.newBuilder()
.setText("Hello world!")
.setTokenOffsetBegin(0)
.setTokenOffsetEnd(12)
.setExtension(MyAppProtos.myNewField, 13)
.build();
System.out.println("Original:\n" + originalSentence);
byte[] serialized = originalSentence.toByteArray();
Sentence deserializedSentence = Sentence.parseFrom(serialized);
System.out.println("Deserialized:\n" + deserializedSentence);
Integer myNewField = deserializedSentence.getExtension(MyAppProtos.myNewField);
System.out.println("MyNewField: " + myNewField);
}
}
Output:
Original:
tokenOffsetBegin: 0
tokenOffsetEnd: 12
text: "Hello world!"
[edu.stanford.nlp.pipeline.myNewField]: 13
Deserialized:
tokenOffsetBegin: 0
tokenOffsetEnd: 12
text: "Hello world!"
101: 13
MyNewField: 0
Update
Because this question was about extending CoreNLP message types and using them with the ProtobufAnnotationSerializer, here is what my extended serializer looks like:
import java.io.IOException;
import java.io.InputStream;
import java.util.Set;
import com.example.my.awesome.nlp.app.MyAppProtos;
import com.google.protobuf.ExtensionRegistry;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.CoreNLPProtos;
import edu.stanford.nlp.pipeline.CoreNLPProtos.Sentence;
import edu.stanford.nlp.pipeline.CoreNLPProtos.Sentence.Builder;
import edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer;
import edu.stanford.nlp.util.CoreMap;
import edu.stanford.nlp.util.Pair;
public class MySerializer extends ProtobufAnnotationSerializer {
private static ExtensionRegistry registry;
static {
registry = ExtensionRegistry.newInstance();
registry.add(MyAppProtos.myNewField);
CoreNLPProtos.registerAllExtensions(registry);
}
#Override
protected Builder toProtoBuilder(CoreMap sentence, Set<Class<?>> keysToSerialize) {
keysToSerialize.remove(MyAnnotation.class);
Builder builder = super.toProtoBuilder(sentence, keysToSerialize);
builder.setExtension(MyAppProtos.myNewField, 13);
return builder;
}
#Override
public Pair<Annotation, InputStream> read(InputStream is)
throws IOException, ClassNotFoundException, ClassCastException {
CoreNLPProtos.Document doc = CoreNLPProtos.Document.parseDelimitedFrom(is, registry);
return Pair.makePair(fromProto(doc), is);
}
#Override
protected CoreMap fromProtoNoTokens(Sentence proto) {
CoreMap result = super.fromProtoNoTokens(proto);
result.set(MyAnnotation.class, proto.getExtension(MyAppProtos.myNewField));
return result;
}
}
The mistake was that i didn't provide the parseFrom call with the extension registry.
Changing Sentence deserializedSentence = Sentence.parseFrom(serialized); to Sentence deserializedSentence = Sentence.parseFrom(serialized, registry); did the job!

How to use StanfordNLP Chinese segmentor in Java?

I have tried the following code, however the code does not work and only outputs null.
String text = "我爱北京天安门。";
StanfordCoreNLP pipeline = new StanfordCoreNLP();
Annotation annotation = pipeline.process(text);
String result = annotation.get(CoreAnnotations.ChineseSegAnnotation.class);
System.out.println(result);
The result:
...
done [0.6 sec].
Using mention detector type: rule
null
How to use StanfordNLP Chinese segmentor correctly?
Some sample code:
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.util.StringUtils;
import java.util.*;
public class ChineseSegmenter {
public static void main (String[] args) {
// set the properties to the standard Chinese pipeline properties
Properties props = StringUtils.argsToProperties("-props", "StanfordCoreNLP-chinese.properties");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "...";
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
List<CoreLabel> tokens = annotation.get(CoreAnnotations.TokensAnnotation.class);
for (CoreLabel token : tokens)
System.out.println(token);
}
}
Note: Make sure the Chinese models jar is on your CLASSPATH. That file is available here: http://stanfordnlp.github.io/CoreNLP/download.html
The above code should print out the tokens created after the Chinese segmenter is run.

How to split the result of PTBTokenizer into sentences?

I know I could use DocumentPreprocessor to split a text into sentence. But it does not provide enough information if one wants to convert the tokenized text back to the original text. So I have to use PTBTokenizer, which has an invertible option.
However, PTBTokenizer simply returns an iterator of all the tokens (CoreLabels) in a document. It does not split the document into sentences.
The documentation says:
The output of PTBTokenizer can be post-processed to divide a text into sentences.
But this is obviously not trivial.
Is there a class in the Stanford NLP library that can take as input a sequence of CoreLabels, and output sentences? Here's what I mean exactly:
List<List<CoreLabel>> split(List<CoreLabel> documentTokens);
I would suggest you use the StanfordCoreNLP class. Here is some sample code:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;
import edu.stanford.nlp.util.*;
public class PipelineExample {
public static void main (String[] args) throws IOException {
// build pipeline
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, pos");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = " I am a sentence. I am another sentence.";
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
System.out.println(annotation.get(TextAnnotation.class));
List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
System.out.println(sentence.get(TokensAnnotation.class));
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
System.out.println(token.after() != null);
System.out.println(token.before() != null);
System.out.println(token.beginPosition());
System.out.println(token.endPosition());
}
}
}
}

incompatible types: Object cannot be converted to CoreLabel

I'm trying to use the Stanford tokenizer with the following example from their website:
import java.io.FileReader;
import java.io.IOException;
import java.util.List;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.process.PTBTokenizer;
public class TokenizerDemo {
public static void main(String[] args) throws IOException {
for (String arg : args) {
// option #1: By sentence.
DocumentPreprocessor dp = new DocumentPreprocessor(arg);
for (List sentence : dp) {
System.out.println(sentence);
}
// option #2: By token
PTBTokenizer ptbt = new PTBTokenizer(new FileReader(arg),
new CoreLabelTokenFactory(), "");
for (CoreLabel label; ptbt.hasNext(); ) {
label = ptbt.next();
System.out.println(label);
}
}
}
}
and I get the following error when I try to compile it:
TokenizerDemo.java:24: error: incompatible types: Object cannot be converted to CoreLabel
label = ptbt.next();
Does anyone know what the reason might be? In case you are interested, I'm using Java 1.8 and made sure that CLASSPATH contains the jar file.
Try parameterizing the PTBTokenizer class. For example:
PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<>(new FileReader(arg),
new CoreLabelTokenFactory(), "");

Resources