Stanford-NLP German dependency parsing not working properly - stanford-nlp

I recently discovered (reading the question below) that I could obtain german dependencies with the Stanford parser, using the NNDependencyParser.
Dependencies are null with the German Parser from Stanford CoreNLP
My problem is, my parsed dependencies are always simply adjacent words in the sentence, no real tree structure. Parsing "Die Sonne scheint am Himmel." would get me pairs of ("Die", "Sonne"), ("Sonne", "scheint"), ("scheint", "am") etc. as dependencies even when using collaped dependencies.
String modelPath = "edu/stanford/nlp/models/parser/nndep/UD_German.gz";
String taggerPath = "edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger";
String text = "Ich sehe den Mann mit dem Fernglas.";
MaxentTagger tagger = new MaxentTagger(taggerPath);
DependencyParser parser = DependencyParser.loadFromModelFile(modelPath);
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(text));
for (List<HasWord> sentence : tokenizer) {
List<TaggedWord> tagged = tagger.tagSentence(sentence);
GrammaticalStructure gs = parser.predict(tagged);
for (TypedDependency td : gs.typedDependenciesCollapsed()) {
System.out.println(td.toString());
}

Yes, our German dependency parsing model is currently broken (somehow the French model was included in the release and we currently don't seem to have a working German model).
However, you could train your own model using the data from the Universal Dependencies project. You can find some information on how to train the parser on its project page.

Related

How to Extract subject Verb Object using NLP Java? for every sentence

I want to find a subject, verb, and object for each sentence and then it will be passed to natural language generation library simpleNLG to form a sentence.
I tried multiple libraries like Cornlp, opennlp, Standford parsers. But I can not find them accurately.
Now in the worst case, I will have to write a long set of if-else to find subject, verb, and object form each sentence which is not always accurate for simpleNLG
like,
NN, nsub etc goes to subject, VB, VBZ goes to verb.
I tried lexical parser,
LexicalizedParser lp = **new LexicalizedParser("englishPCFG.ser.gz");**
String[] sent = { "This", "is", "an", "easy", "sentence", "." };
Tree parse = (Tree) lp.apply(Arrays.asList(sent));
parse.pennPrint();
System.out.println();
TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
tp.print(parse);
which gives this output,
nsubj(use-2, I-1)
root(ROOT-0, use-2)
det(parser-4, a-3)
dobj(use-2, parser-4)
And I want something like this
subject = I
verb = use
det = a
object = parser
Is there a simpler way to find this in JAVA or should I go with if-else? please help me with it.
You can use the openie annotator to get triples. You can run this at the command line or build a pipeline with these annotators.
command:
java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie -file example.txt
Java:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation result = pipeline.process("...");
input:
Joe ate some pizza.
output:
Extracted the following Open IE triples:
1.0 Joe ate pizza
More details here: https://stanfordnlp.github.io/CoreNLP/openie.html

Why does the property settings for the StanfordCoreNlp pipeline matters when building dependency parse tree?

I have been playing around with Stanford-CoreNLP and I figured out that building a dependency parse tree with the following code
String text = "Are depparse and parse equivalent properties for building dependency parse tree?"
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST ));
}
is outputting
root(ROOT-0, tree-11)
cop(tree-11, Are-1)
amod(properties-6, depparse-2)
cc(depparse-2, and-3)
conj(depparse-2, parse-4)
compound(properties-6, equivalent-5)
nsubj(tree-11, properties-6)
case(dependency-9, for-7)
compound(dependency-9, building-8)
nmod(properties-6, dependency-9)
amod(tree-11, parse-10)
punct(tree-11, ?-12)
However this code
String text = "Are depparse and parse equivalent properties for building dependency parse tree?"
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, lemma, ner, depparse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST ));
}
outputs
root(ROOT-0, properties-6)
cop(properties-6, Are-1)
compound(properties-6, depparse-2)
cc(depparse-2, and-3)
conj(depparse-2, parse-4)
amod(properties-6, equivalent-5)
case(tree-11, for-7)
amod(tree-11, building-8)
compound(tree-11, dependency-9)
amod(tree-11, parse-10)
nmod(properties-6, tree-11)
punct(properties-6, ?-12)
So why am I not getting the same outputs with those two methods? Is it possible to change the later code to be equivalent to the first code because loading the constituency parser as well makes the parsing so slow? And how would you recommend setting the properties to get the most accurate dependency parse tree?
The constituency parser (parse annotator) and dependency parser (depparse annotator) are actually completely different models and code paths. In one case, we're predicting a constituency tree and converting it to a dependency graph. In the other case, we are running a dependency parser directly. In general, depparse is expected to be faster (O(n) vs O(n^3)) and more accurate at producing dependency trees, but will not produce constituency trees.

Formatting NER output from Stanford Corenlp

I am working with Stanford CoreNLP and using it for NER. But when I extract organization names, I see that each word is tagged with the annotation. So, if the entity is "NEW YORK TIMES", then it is getting recorded as three different entities : "NEW", "YORK" and "TIMES". Is there a property we can set in the Stanford COreNLP so that we could get the combined output as the entity ?
Just like in Stanford NER, when we use command line utility, we can choose out output format as : inlineXML ? Can we somehow set a property to select the output format in Stanford CoreNLP ?
If you just want the complete strings of each named entity found by Stanford NER, try this:
String text = "<INSERT YOUR INPUT TEXT HERE>";
AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> entity : entities)
System.out.println(text.substring(entity.second, entity.third), entity.second));
In case you're wondering, the entity class is indicated by entity.first.
Alternatively, you can use ner.classifyWithInlineXML(text) to get output that looks like <PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .
No, CoreNLP 3.5.0 has no utility to merge the NER labels. The next release (coming sometime next week) has a new MentionsAnnotator which handles this merging for you. For now, you can (a) use the MentionsAnnotator, available on the CoreNLP master branch, or (b) merge manually.
Use the -outputFormat xml option to have CoreNLP output XML. (Is this what you want?)
You can set any property in the properties file, include the "outputFormat" property. Stanford CoreNLP supports several different formats such as json, xml, and text. However, the xml option is not an inlineXML format. The xml format gives per token annotations for NER.
<tokens>
<token id="1">
<word>New</word>
<lemma>New</lemma>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>3</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
<token id="2">
<word>York</word>
<lemma>York</lemma>
<CharacterOffsetBegin>4</CharacterOffsetBegin>
<CharacterOffsetEnd>8</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
<token id="3">
<word>Times</word>
<lemma>Times</lemma>
<CharacterOffsetBegin>9</CharacterOffsetBegin>
<CharacterOffsetEnd>14</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
</tokens>
From Stanford CoreNLP 3.6 and onwards, You can use entitymentions in Pipeline and get list of all Entities. I have shown an example here. It works.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner,entitymentions");
props.put("regexner.mapping", "jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String inputText = "I have done Bachelor of Arts and Bachelor of Laws so that I can work at British Broadcasting Corporation";
Annotation annotation = new Annotation(inputText);
pipeline.annotate(annotation);
List<CoreMap> multiWordsExp = annotation.get(MentionsAnnotation.class);
for (CoreMap multiWord : multiWordsExp) {
String custNERClass = multiWord.get(NamedEntityTagAnnotation.class);
System.out.println(multiWord +" : " +custNERClass);
}

Generating PDF from OpenXml

I am trying to find a SDK that can generate PDF from OpenXml. I have used the Open Xml Power Tools to convert the open XML and html and and using iTextSharp to parse the Html to PDF. But the result is a very terrible looking PDF.
I have not yet tried the iText's RTF parser. If I go this direction, I will end up needing a RTF converter making the simple conversion a double step nightmare.
It almost looks like I might end up writing a custom converter based of power tools OpenXml to html converter. Any advise is appreciated. I really at this time can't end up going for a professional converter as the licenses are too expensive (Aspose Word/TxText).
I thought I will put some more effort into my investigation. I went back to the conversion utility "http://msdn.microsoft.com/en-us/library/ff628051.aspx" and looked through its code. Given the biggest thing it missed was reading the underlying styles and generate a style attribute. The PDF looked much better with the limitation of not handling custom true type font. More investigation tomorrow. I am hoping has done something like this/faced weird issues and can shed some light.
private static StringDictionary GetStyle(XElement el)
{
IEnumerable jcL = el.Elements(W.jc);
IEnumerable spacingL = el.Elements(W.spacing);
IEnumerable rPL = el.Elements(W.rPr);
StringDictionary sd = new StringDictionary();
if (HasAttribute(jcL, W.val)) sd.Add("text-align", GetAttribute(jcL, W.val));
// run prop exists
if (rPL.Count() > 0)
{
XElement r = rPL.First();
IEnumerable ftL = el.Elements(W.rFonts);
if (r.Element(W.b) != null) sd.Add("font-weight", "bolder");
if (r.Element(W.i) != null) sd.Add("font-style", "italic");
if (r.Element(W.u) != null) sd.Add("text-decoration", "underline");
if (r.Element(W.color) != null && HasAttribute(r.Element(W.color), W.val)) sd.Add("color", "#" + GetAttribute(r.Element(W.color), W.val));
if (r.Element(W.rFonts) != null )
{
//
if(HasAttribute(r.Element(W.rFonts), W.cs)) sd.Add("font-family", GetAttribute(r.Element(W.rFonts), W.cs));
else if (HasAttribute(r.Element(W.rFonts), W.hAnsi)) sd.Add("font-family", GetAttribute(r.Element(W.rFonts), W.hAnsi));
}
if (r.Element(W.sz) != null && HasAttribute(r.Element(W.sz), W.val)) sd.Add("font-size", GetAttribute(r.Element(W.sz), W.val) + "pt");
}
return sd.Keys.Count > 0 ? sd : null;
}
I don't know of any direct converter with source-code availabe, but yeah, my thought is that you may need to build a converter from scratch. Luckily (I guess), Word's WordprocessingML is the simplest of the Open XML formats and you can look to other projects for inspiration, such as:
TextGlow - Word to Silverlight converter
Word to XAML Converter - Word to XAML converter (probably very
similar to TextGlow above)
OpenXML-DAISY - conversion to Daisy
ODF Converter - convert from/to OpenOffice formats and OpenXML
The XHTML solution by Eric White you already referenced.
For commercial & server-side solutions, you can use either Word Automations Services (requires SharePoint) or Apose.NET Words.

How to parse a list of sentences?

I want to parse a list of sentences with the Stanford NLP parser.
My list is an ArrayList, how can I parse all the list with LexicalizedParser?
I want to get from each sentence this form:
Tree parse = (Tree) lp1.apply(sentence);
Although one can dig into the documentation, I am going to provide code here on SO, especially since links move and/or die. This particular answer uses the whole pipeline. If not interested in the whole pipeline, I will provide an alternative answer in just a second.
The below example is the complete way of using the Stanford pipeline. If not interested in coreference resolution, remove dcoref from the 3rd line of code. So in the example below, the pipeline does the sentence splitting for you (the ssplit annotator) if you just feed it in a body of text (the text variable). Have just one sentence? Well, that is ok, you can feed that in as the text variable.
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
String text = ... // Add your text here!
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
}
// this is the parse tree of the current sentence
Tree tree = sentence.get(TreeAnnotation.class);
// this is the Stanford dependency graph of the current sentence
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
}
// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph =
document.get(CorefChainAnnotation.class);
Actually documentation from Stanford NLP provide sample of how to parse sentences.
You can find the documentation here
So as promised, if you don't want to access the full Stanford pipeline (although I believe that is the recommended approach), you can work with the LexicalizedParser class directly. In this case, you would download the latest version of Stanford Parser (whereas the other would use CoreNLP tools). Make sure that in addition to the parser jar, you have the model file for the appropriate parser you want to work with. Example code:
LexicalizedParser lp1 = new LexicalizedParser("englishPCFG.ser.gz", new Options());
String sentence = "It is a fine day today";
Tree parse = lp.parse(sentence);
Note this works for version 3.3.1 of the parser.

Resources