I'm beginning to learn the Stanford CoreNLP Java API, and am trying to print the syntax tree of a sentence. The syntax tree is supposed to be generated by the ParserAnnotator. In my code (posted below), the ParserAnnotator runs without errors but doesn't generate anything. The error only shows up when the code tries to get the label of the tree's root node, and the tree is revealed to be null. The components that run before it generate their annotations without any problems.
There was one other person on SO who had a problem with the ParserAnnotator, but the issue was with memory. I've increased the memory that I allow Eclipse to use, but the behavior is the same. Running the code in the debugger also did not yield any errors.
Some background information: The sentence I used was "This is a random sentence." I recently upgraded from Windows 8.1 to Windows 10.
public static void main(String[] args){
String sentence = "This is a random sentence.";
Annotation doc = initStanford(sentence);
Tree syntaxTree = doc.get(TreeAnnotation.class);
printTreePreorder(syntaxTree);
}
private static Annotation initStanford(String sentence){
StanfordCoreNLP pipeline = pipeline("tokenize, ssplit, parse");
Annotation document = new Annotation(sentence);
pipeline.annotate(document);
return document;
}
private static StanfordCoreNLP pipeline(String components){
Properties props = new Properties();
props.put("annotators", components);
return new StanfordCoreNLP(props);
}
public static void printTreePreorder(Tree tree){
System.out.println(tree.label());
for(int i = 0;i < tree.numChildren();i++){
printTreePreorder(tree.getChild(i));
}
}
You're trying to get the tree off of the document (Annotation), rather than the sentences (CoreMap). You can get the sentences with:
Tree tree = doc.get(SentencesAnnotation.class).get(0).get(TreeAnnotation.class)
I can also shamelessly plug the Simple CoreNLP API:
Tree tree = new Sentence("this is a sentence").parse()
Related
Ok, so this is what's going on. I'm trying to learn how to use vscode (switching over from jgrasp). I'm trying to run this old school assignment that requires the use of outside .txt files. The .txt files, as well as other classes that I have written are in the same folder and everything. When I try to run this program in JGrasp, it works fine. Though, in VSCode, I get an exception. Not sure what is going wrong here. Thanks Here is an example:
import java.io.*;
public class HangmanMain {
public static final String DICTIONARY_FILE = "dictionary.txt";
public static final boolean SHOW_COUNT = true; // show # of choices left
public static void main(String[] args) throws FileNotFoundException {
System.out.println("Welcome to the cse143 hangman game.");
System.out.println();
// open the dictionary file and read dictionary into an ArrayList
Scanner input = new Scanner(new File(DICTIONARY_FILE));
List<String> dictionary = new ArrayList<String>();
while (input.hasNext()) {
dictionary.add(input.next().toLowerCase());
}
// set basic parameters
Scanner console = new Scanner(System.in);
System.out.print("What length word do you want to use? ");
int length = console.nextInt();
System.out.print("How many wrong answers allowed? ");
int max = console.nextInt();
System.out.println();
//The rest of the program is not shown. This was included just so you guys could see a little bit of it.
If you're not using a project, jGRASP makes the working directory for your program the same one that contains the source file. You are creating the file with a relative path, so it is assumed to be in the working directory. You can print new File(DICTIONARY_FILE).getAbsolutePath() to see where VSCode is looking (probably a separate "classes" directory) and move your data file there, or use an absolute path.
I am currently writing my bachelor thesis and I am using the POI Event API from Apache. In short, my work is about a more efficient way to read data from Excel.
I get asked by developers again and again how exactly this is meant with Event API. Unfortunately I don't find anything on the Apache page about the basic principle.
Following code, how I use the POI Event API (This is from the Apache example for XSSF and SAX):
import java.io.InputStream;
import java.util.Iterator;
import org.apache.poi.ooxml.util.SAXHelper;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import javax.xml.parsers.ParserConfigurationException;
public class ExampleEventUserModel {
public void processOneSheet(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
// To look up the Sheet Name / Sheet Order / rID,
// you need to process the core Workbook stream.
// Normally it's of the form rId# or rSheet#
InputStream sheet2 = r.getSheet("rId2");
InputSource sheetSource = new InputSource(sheet2);
parser.parse(sheetSource);
sheet2.close();
}
public void processAllSheets(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
Iterator<InputStream> sheets = r.getSheetsData();
while(sheets.hasNext()) {
System.out.println("Processing new sheet:\n");
InputStream sheet = sheets.next();
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource);
sheet.close();
System.out.println("");
}
}
public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException, ParserConfigurationException {
XMLReader parser = SAXHelper.newXMLReader();
ContentHandler handler = new SheetHandler(sst);
parser.setContentHandler(handler);
return parser;
}
/**
* See org.xml.sax.helpers.DefaultHandler javadocs
*/
private static class SheetHandler extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if(name.equals("c")) {
// Print the cell reference
System.out.print(attributes.getValue("r") + " - ");
// Figure out if the value is an index in the SST
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = sst.getItemAt(idx).getString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
System.out.println(lastContents);
}
}
public void characters(char[] ch, int start, int length) {
lastContents += new String(ch, start, length);
}
}
public static void main(String[] args) throws Exception {
ExampleEventUserModel example = new ExampleEventUserModel();
example.processOneSheet(args[0]);
example.processAllSheets(args[0]);
}
}
Can someone please explain to me how the Event API works? Is it the same as the event-based architecture or is it something else?
A *.xlsx file, which is Excel stored in Office Open XML and is what apache poi handles as XSSF, is a ZIP archive containing the data in XML files within a directory structure. So we can unzip the *.xlsx file and get the data directly from the XML files then.
There is /xl/sharedStrings.xml having all the string cell values in it. And there is /xl/workbook.xml describing the workbook structure. And there are /xl/worksheets/sheet1.xml, /xl/worksheets/sheet2.xml, ... which are storing the sheets' data. And there is /xl/styles.xml having the style settings for all cells in the sheets.
Per default while creating a XSSFWorkbook all those parts of the *.xlsx file will become object representations as XSSFWorkbook, XSSFSheet, XSSFRow, XSSFCell, ... and further objects of org.apache.poi.xssf.*.* in memory.
To get an impression of how memory consuming XSSFSheet, XSSFRow and XSSFCell are, a look into the sources will be good. Each of those objects contains multiple Lists and Maps as internally members and of course multiple methods too. Now imagine a sheet having hundreds of thousands of rows each containing up to hundreds of cells. Each of those rows and cells will be represented by a XSSFRow or a XSSFCell in memory. This cannot be an accusation to apache poi because those objects are necessary if working with those objects is needed. But if the need is really only getting the content out of the Excel sheet, then those objects are not all necessary. That's why the XSSF and SAX (Event API) approach.
So if the need is only reading data from sheets one could simply parsing the XML of all the /xl/worksheets/sheet[n].xml files without the need for creating memory consuming objects for each sheet and for each row and for each cell in those sheets.
Parsing XML in event based mode means that the code goes top down through the XML and has callback methods defined which get called if the code detects the start of an element, the end of an element or character content within an element. The appropriate callback methods then handle what to do on start, end or with character content of an element. So reading the XML file only means running top down through the file once, handle the events (start, end, character content of an element) and are able getting all needed content out of it. So memory consuming is reduced to storing the text data gotten from the XML.
XSSF and SAX (Event API) uses class SheetHandler which extends DefaultHandler for this.
But if we are already at this level where we get at the underlying XML data and process it, then we could go one more step back too. Native Java is able handling ZIP and parsing XML. So we would not even need additional libraries at all. See how read excel file having more than 100000 row in java? where I have shown this. My code uses Package javax.xml.stream which also provides using event based XMLEventReader but not using callbacks but linear code. Maybe this code is simpler to understand because it is all in one.
For detecting whether a number format is a date format, and so the formatted cell contains a date / time value, one single apache poi class org.apache.poi.ss.usermodel.DateUtil is used. This is done to simplify the code. Of course even this class we could have coded our self.
I'm trying to run the code found here: https://stanfordnlp.github.io/CoreNLP/coref.html
public class CorefExample {
public static void main(String[] args) throws Exception {
Annotation document = new Annotation("Barack Obama was born in Hawaii. He is the president. Obama was elected in 2008.");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,mention,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
System.out.println("---");
System.out.println("coref chains");
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
System.out.println("\t" + cc);
}
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
System.out.println("---");
System.out.println("mentions");
for (Mention m : sentence.get(CorefCoreAnnotations.CorefMentionsAnnotation.class)) {
System.out.println("\t" + m);
}
}
}
}
However, these three imports required aren't found for me:
import edu.stanford.nlp.coref.CorefCoreAnnotations;
import edu.stanford.nlp.coref.data.CorefChain;
import edu.stanford.nlp.coref.data.Mention;
I could use these imports instead:
import edu.stanford.nlp.dcoref.CorefCoreAnnotations;
import edu.stanford.nlp.dcoref.CorefChain;
import edu.stanford.nlp.dcoref.Mention;
But then an annotation is missing, specifically:
CorefCoreAnnotations.CorefMentionsAnnotation.class
Additionally, CorefCoreAnnotations.CorefChainAnnotation.class).values() returns null...
I think the problem is that I am using CoreNLP version 3.6.0. This tutorial is for 3.7.0 I believe. Is there a similar example that uses version 3.6.0? If not, what changes do I need to make? I have a large pipeline set up and I'm not sure how hard it would be to upgrade.
Thanks for any help!
Hi I would recommend just upgrading to Stanford CoreNLP 3.7.0, it should not cause too many things to break.
One of the main changes is that we created a new package named edu.stanford.nlp.coref and put the code from edu.stanford.nlp.hcoref into it.
For the most part things should be the same though if you upgrade.
I use Stanford Core NLP API; My codes are Below:
public class StanfordCoreNLPTool {
public static StanfordCoreNLPTool instance;
private Annotation annotation = new Annotation();
private Properties props = new Properties();
private PrintWriter out = new PrintWriter(System.out);;
private StanfordCoreNLPTool(){}
public void startPipeLine(String question){
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse,mention,
dcoref, sentiment");
annotation = new Annotation(question);
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// run all the selected Annotators on this text
pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, out);
}
public static StanfordCoreNLPTool getInstance() {
if(instance == null){
synchronized (StanfordCoreNLPTool.class){
if(instance == null){
instance = new StanfordCoreNLPTool();
}
}
}
return instance;
}
}
It works fine, but it takes much time; Consider we are using it in a question answering system, so for every new input, pipeAnnotation must be run.
As you know, each time some rules should be fetched, some data trained and etc to yield to an sentence with NLP tags such as POS, NER and ... .
first of all, i wanted to solve the problem with RMI and EJB, but it failed, because, Regardless Of any JAVA Architecture, for every new sentence, pipeAnnotation should be learnt from the scratch.
look at this log that are printed in my intellij console:
Reading POS tagger model from
edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger
... done [6.1 sec].
Loading classifier from
edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ...
done [8.0 sec].
Loading classifier from
edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ...
done [8.7 sec].
Loading classifier from
edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz
... done [5.0 sec].
INFO: Read 25 rules [main] INFO
edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading
parser from serialized file
edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [4.1
sec].
please help me to find a solution to make the program fast.
The big problem here is that your startAnnotation() method combines two things that should be separated:
Constructing an annotation pipeline, including loading large annotator models
Annotating a particular piece of text (a question) and printing the result
For making things fast, these two things must be separated, as loading an annotation pipeline is slow but should be done only once, while annotating each piece of text is reasonably fast.
The rest is all minor details but FWIW:
All the fanciness of the double-checked locking constructing a StanfordCoreNLPTool singleton is doing nothing if you create a new one for every piece of text you annotate. You should construct an annotation pipeline only once, and it might be reasonable to do that as a singleton, but it's probably sufficient to do the initialization in the constructor once you distinguish pipeline construction from text annotation.
The annotation variable can and should be private to the method that annotates one piece of text.
If after these changes you still want model loading to be faster, you could buy an SSD! (My 2012 MBP with an SSD loads models more than 5 times faster than you are reporting.)
If you want to further speed annotation, the main tool is to cut out annotators you don't need or to choose faster versions. E.g., if you're not using coreference, you could delete mention and dcoref.
Is there a way to call stanford parser from command line so that it parses one sentence at a time, and in case of troubles at a specific sentence just goes over to the next sentence?
UPDATE:
I have been adapting the script posted StanfordNLP Help. However, I noticed that, with the last version of corenlp (2015-04-20) there are problems with the CCprocessed dependencies: collapsing just appears not to take place (if I grep prep_ on the output, I find nothing).
Collapsing works with the 2015-04-20 and PCFG, for example, so I assume the issue is model-specific.
If I use the very same java class in corenlp 2015-01-29 (with depparse.model changed to parse.model, and removing the original dependencies part), collapsing works just fine. Maybe I am just using the parser in the wrong way, that's why I am re-posting here and not starting a new post. Here is the updated code of the class:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
import edu.stanford.nlp.util.*;
public class StanfordSafeLineExample {
public static void main (String[] args) throws IOException {
// build pipeline
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, pos, lemma, depparse");
props.setProperty("ssplit.eolonly","true");
props.setProperty("tokenize.whitespace","false");
props.setProperty("depparse.model", "edu/stanford/nlp/models/parser/nndep/english_SD.gz");
props.setProperty("parse.originalDependencies", "true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// open file
BufferedReader br = new BufferedReader(new FileReader(args[0]));
// go through each sentence
for (String line = br.readLine() ; line != null ; line = br.readLine()) {
try {
Annotation annotation = new Annotation(line);
pipeline.annotate(annotation);
ArrayList<String> edges = new ArrayList<String>();
CoreMap sentence = annotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);
System.out.println("sentence: "+line);
for (CoreLabel token: annotation.get(CoreAnnotations.TokensAnnotation.class)) {
Integer identifier = token.get(CoreAnnotations.IndexAnnotation.class);
String word = token.get(CoreAnnotations.TextAnnotation.class);
String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
String lemma = token.get(CoreAnnotations.LemmaAnnotation.class);
System.out.println(identifier+"\t"+word+"\t"+pos+"\t"+lemma);
}
SemanticGraph tree = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
SemanticGraph tree2 = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
System.out.println("---BASIC");
System.out.println(tree.toString(SemanticGraph.OutputFormat.READABLE));
System.out.println("---CCPROCESSED---");
System.out.println(tree2.toString(SemanticGraph.OutputFormat.READABLE)+"</s>");
} catch (Exception e) {
System.out.println("Error with this sentence: "+line);
System.out.println("");
}
}
}
}
There are many ways to handle this.
The way I'd do it is to run the Stanford CoreNLP pipeline.
Here is where you can get the appropriate jar:
http://nlp.stanford.edu/software/corenlp.shtml
After you cd into the directory stanford-corenlp-full-2015-04-20
you can issue this command:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,parse -ssplit.eolonly -outputFormat text -file sample_sentences.txt
sample_sentences.txt would have the sentences you want to parse, one sentence per line
This will put the results in sample_sentences.txt.out which you can extract with some light scripting.
If you change -outputFormat to json instead of text, you will get some json which you can easily load and get the parses from
If you have any issues with this approach let me know and I can modify the answer to further assist you/clarify!
UPDATE:
I am not sure the exact way you are running things, but these options could be helpful.
If you use -fileList to run the pipeline on a list of files rather than on a single file, and then use this flag: -continueOnAnnotateError it should just skip the bad file, which is progress, though admittedly not just skipping the bad sentence
I wrote some Java for doing exactly what you need, so I'll try to post that in the next 24 hours if you just want to use my whipped together Java code, I'm still looking it over...
Here is some sample code for your need:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
import edu.stanford.nlp.util.*;
public class StanfordSafeLineExample {
public static void main (String[] args) throws IOException {
// build pipeline
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, pos, depparse");
props.setProperty("ssplit.eolonly","true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// open file
BufferedReader br = new BufferedReader(new FileReader(args[0]));
// go through each sentence
for (String line = br.readLine() ; line != null ; line = br.readLine()) {
try {
Annotation annotation = new Annotation(line);
pipeline.annotate(annotation);
ArrayList<String> edges = new ArrayList<String>();
CoreMap sentence = annotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);
SemanticGraph tree = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
System.out.println("---");
System.out.println("sentence: "+line);
System.out.println(tree.toString(SemanticGraph.OutputFormat.READABLE));
} catch (Exception e) {
System.out.println("---");
System.out.println("Error with this sentence: "+line);
}
}
}
}
instructions:
Cut and paste this into StanfordSafeLineExample.java
put that file in the directory stanford-corenlp-full-2015-04-20
javac -cp "*:." StanfordSafeLineExample.java
add your sentences one sentence per line to a file called sample_sentences.txt
java -cp "*:." StanfordSafeLineExample sample_sentences.txt