I know I could use DocumentPreprocessor to split a text into sentence. But it does not provide enough information if one wants to convert the tokenized text back to the original text. So I have to use PTBTokenizer, which has an invertible option.
However, PTBTokenizer simply returns an iterator of all the tokens (CoreLabels) in a document. It does not split the document into sentences.
The documentation says:
The output of PTBTokenizer can be post-processed to divide a text into sentences.
But this is obviously not trivial.
Is there a class in the Stanford NLP library that can take as input a sequence of CoreLabels, and output sentences? Here's what I mean exactly:
List<List<CoreLabel>> split(List<CoreLabel> documentTokens);
I would suggest you use the StanfordCoreNLP class. Here is some sample code:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;
import edu.stanford.nlp.util.*;
public class PipelineExample {
public static void main (String[] args) throws IOException {
// build pipeline
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, pos");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = " I am a sentence. I am another sentence.";
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
System.out.println(annotation.get(TextAnnotation.class));
List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
System.out.println(sentence.get(TokensAnnotation.class));
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
System.out.println(token.after() != null);
System.out.println(token.before() != null);
System.out.println(token.beginPosition());
System.out.println(token.endPosition());
}
}
}
}
Related
The site suggest that i can use several flags
https://nlp.stanford.edu/software/openie.html
But how to use it, I tried doing it this way
import edu.stanford.nlp.ie.util.RelationTriple;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.naturalli.NaturalLogicAnnotations;
import edu.stanford.nlp.util.CoreMap;
import java.util.Collection;
import java.util.Properties;
/**
* A demo illustrating how to call the OpenIE system programmatically.
*/
public class OpenIEDemo {
public static void main(String[] args) throws Exception {
// Create the Stanford CoreNLP pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
props.setProperty("openieformat","ollie");
props.setProperty("openieresolve_coref","1");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate an example document.
Annotation doc = new Annotation("Obama was born in Hawaii. He is our president.");
pipeline.annotate(doc);
// Loop over sentences in the document
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
System.out.println(triple.confidence + "\t" +
triple.subjectLemmaGloss() + "\t" +
triple.relationLemmaGloss() + "\t" +
triple.objectLemmaGloss());
}
}
}
}
I have added
props.setProperty("openieformat","ollie");
props.setProperty("openieresolve_coref","1");
But its not working
For StanfordCoreNLP, flags/properties for individual annotators are set with an annotator.flag name. And boolean flags have value "false" or "true". So, what you have is close to right, but needs to be:
props.setProperty("openie.format","ollie");
props.setProperty("openie.resolve_coref","true");
I want to use stanford coreNLP to process some Chinese Coreference resolution.My code is below:
import java.util.Properties;
import edu.stanford.nlp.coref.CorefCoreAnnotations;
import edu.stanford.nlp.coref.data.CorefChain;
import edu.stanford.nlp.coref.data.Mention;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
public class CorefTest {
public static void main(String[] args) throws Exception {
StanfordCoreNLP pipline = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties");
Annotation document = new Annotation("奥巴马出生在夏威夷,他是美国总统,他在2008年当选");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,mention,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
System.out.println("---");
System.out.println("coref chains");
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
System.out.println("\t" + cc);
}
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
System.out.println("---");
System.out.println("mentions");
for (Mention m : sentence.get(CorefCoreAnnotations.CorefMentionsAnnotation.class)) {
System.out.println("\t" + m);
}
}
}
}
And it give the result :
---
coref chains
---
mentions
奥巴马出生在夏威夷 , 他是美国总统 , 他在2008年当选
Where nothing in the coref chains,I do set the environment right, and it does support chinese , but how can I get coref chains right?
You are making the pipeline twice in your code.
Note: make sure to add this to your code:
import edu.stanford.nlp.util.StringUtils;
You code should be like this:
Properties props = StringUtils.argsToProperties("-props", "StanfordCoreNLP-chinese.properties");
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,mention,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
In the following text:
John said, "There's an elephant outside the window."
Is there a simple way to figure out that the quote "There's an elephant outside the window." belongs to John?
We've just added a module for handling this.
You'll need to get the latest code from GitHub.
Here is some sample code:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.coref.*;
import edu.stanford.nlp.coref.data.*;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.util.*;
import edu.stanford.nlp.pipeline.*;
import java.util.*;
public class QuoteAttributionExample {
public static void main(String[] args) {
Annotation document =
new Annotation("John said, \"There's an elephant outside the window.\"");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions,quote,quoteattribution");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreMap quote : document.get(CoreAnnotations.QuotationsAnnotation.class)) {
System.out.println(quote);
System.out.println(quote.get(QuoteAttributionAnnotator.MentionAnnotation.class));
}
}
}
This is still under development, we will probably add some code to make it easier to get the actual text span that links to the quote soon.
I'm experimenting with Stanford Core NLP for named entity recognition.
Some named entities consist of more than one token, for example, Person: "Bill Smith". I can't figure out what API calls to use to determine when "Bill" and "Smith" should be considered a single entity, and when they should be two different entities.
Is there some decent documentation somewhere which explains this?
Here's my current code:
InputStream is = getClass().getResourceAsStream(MODEL_NAME);
if (MODEL_NAME.endsWith(".gz")) {
is = new GZIPInputStream(is);
}
is = new BufferedInputStream(is);
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifier(is);
is.close();
String text = "Hello, Bill Smith, how are you?";
List<List<CoreLabel>> sentences = classifier.classify(text);
for (List<CoreLabel> sentence: sentences) {
for (CoreLabel word: sentence) {
String type = word.get(CoreAnnotations.AnswerAnnotation.class);
System.out.println(word + " is of type " + type);
}
}
Also, it isn't clear to me why the "PERSON" annotation is coming back as AnswerAnnotation, instead of CoreAnnotations.EntityClassAnnotation, EntityTypeAnnotation, or something else.
You should use the "entitymentions" annotator, which will mark continuous sequences of tokens with the same ner tag as an entity. The list of entities for each sentence will be stored under the CoreAnnotations.MentionsAnnotation.class key. Each entity mention itself will be a CoreMap.
Looking over this code could help:
https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/EntityMentionsAnnotator.java
some sample code:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.util.*;
public class EntityMentionsExample {
public static void main (String[] args) throws IOException {
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "Joe Smith is from Florida.";
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
System.out.println("---");
System.out.println("text: " + text);
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
System.out.print(entityMention.get(CoreAnnotations.TextAnnotation.class));
System.out.print("\t");
System.out.print(
entityMention.get(CoreAnnotations.NamedEntityTagAnnotation.class));
System.out.println();
}
}
}
}
I have tried the following code, however the code does not work and only outputs null.
String text = "我爱北京天安门。";
StanfordCoreNLP pipeline = new StanfordCoreNLP();
Annotation annotation = pipeline.process(text);
String result = annotation.get(CoreAnnotations.ChineseSegAnnotation.class);
System.out.println(result);
The result:
...
done [0.6 sec].
Using mention detector type: rule
null
How to use StanfordNLP Chinese segmentor correctly?
Some sample code:
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.util.StringUtils;
import java.util.*;
public class ChineseSegmenter {
public static void main (String[] args) {
// set the properties to the standard Chinese pipeline properties
Properties props = StringUtils.argsToProperties("-props", "StanfordCoreNLP-chinese.properties");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "...";
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
List<CoreLabel> tokens = annotation.get(CoreAnnotations.TokensAnnotation.class);
for (CoreLabel token : tokens)
System.out.println(token);
}
}
Note: Make sure the Chinese models jar is on your CLASSPATH. That file is available here: http://stanfordnlp.github.io/CoreNLP/download.html
The above code should print out the tokens created after the Chinese segmenter is run.