I have just tried out the Core NLP example code, included as StanfordCoreNlpDemo.java with the download. When trying to parse a chapter of a book, an exception is thrown from the Semantic Graph:
Exception in thread "main" java.lang.NullPointerException
at edu.stanford.nlp.semgraph.SemanticGraph.removeEdge(SemanticGraph.java:122)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.expandPPConjunction(UniversalEnglishGrammaticalStructure.java:553)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.expandPPConjunctions(UniversalEnglishGrammaticalStructure.java:508)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.collapseDependencies(UniversalEnglishGrammaticalStructure.java:807)
at edu.stanford.nlp.trees.GrammaticalStructure.typedDependenciesCollapsed(GrammaticalStructure.java:877)
at edu.stanford.nlp.semgraph.SemanticGraphFactory.makeFromTree(SemanticGraphFactory.java:188)
at edu.stanford.nlp.semgraph.SemanticGraphFactory.generateCollapsedDependencies(SemanticGraphFactory.java:90)
at edu.stanford.nlp.pipeline.ParserAnnotatorUtils.fillInParseAnnotations(ParserAnnotatorUtils.java:51)
at edu.stanford.nlp.pipeline.ParserAnnotator.finishSentence(ParserAnnotator.java:266)
at edu.stanford.nlp.pipeline.ParserAnnotator.doOneSentence(ParserAnnotator.java:245)
at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:96)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:412)
at epubReader.Test.main(Test.java:68)
I have narrowed it down to one sentence being responsible, even when parsed alone:
"Generally speaking, he was a scout, and rarely stood at a watch nearer than the four hundred and fiftieth metre, and then only as a cordon commander."
Is there anything special about the syntax or semantics of the sentence that I am missing, causing the error?
No, the sentence is fine and this is unfortunately a bug in our dependency converter.
The part-of-speech tagger outputs a really weird POS sequence causing the parser to produce a completely wrong parse tree which leads to this exception in the constituency-to-dependency converter.
I fixed the bug in the converter but unless you clone the code from GitHub and compile it yourself, this won't help you until the next release.
But you can still get a parse for this sentence either by disabling the POS tagger (see the parser FAQ for details on how to do this) or by using the neural network dependency parser.
Related
I am trying to write a simple parser to read MusicXML and play it back. I am using JFugue 5.0.3. The library simply hangs in the middle half the time.
My code:-
public void play() throws ParserConfigurationException, ParsingException, IOException {
MusicXmlParser_J parser_j = new MusicXmlParser_J();
StaccatoParserListener listener = new StaccatoParserListener();
parser_j.addParserListener(listener);
parser_j.parse(musicXMLFile);
Player player = new Player();
final Pattern musicXMLPattern = listener.getPattern();
player.play(musicXMLPattern);
}
This code fails after attempting to build MusicXML file with a useless error message:-
attempting to build file
Oops something went wrong. The error isConnection timed out
and it works sometimes and when it does it works like a charm and within milliseconds. What is JFugue trying to download stuff over the network? A little debugging revealed it works by downloading
Few questions to the people who work on this :-
Why doesn't this code work with MusicXMLParserListener class? This class fails with an arbitrary null pointer exception because it expects Staccato parser to be defined. Why? And what is the deal with two different listeners for MusicXML - MusicXMLParserListener_J and MusicXMLParserListener_R? Don't expose broken stuff to consumers please.
I had read that JFugue supports MusicXML and I expected this to work without an issue.
Are there examples of JFugue working with MusicXML?
The two different parsers you see are a result of MusicXML capability in JFugue 5.0+ not being fully completed yet (which is mentioned on the JFugue download page). These were two implementations contributed by two developers in the open source community. Based on your issue, I have just updated the JFugue distribution: I moved one of the parsers out, and renamed the other one simply MusicXmlParser.
It turns out that the other parser, MusicXmlParser_R, is the better of the two. Of course, no user would have had any idea about that. My apologies for exposing an incomplete capability. For the record, "Oops, something went wrong" is not a JFugue error message, but "attempting to build file" was in MusicXmlParser_J, which is now out of the code.
The following code, which is from test/org.jfugue.musicxml, demonstrates the use of MusicXmlParser.
MusicXmlParser musicXmlParser = new MusicXmlParser();
StaccatoParserListener spl = new StaccatoParserListener();
musicXmlParser.addParserListener(spl);
musicXmlParser.parse(new File("src/test/resources/HelloWorldMusicXml.xml"));
Please note that there is no implementation of MusicXmlParserListener for JFugue 5.0 at the moment. I have updated the download page to make this clear.
The MusicXML capabilities within JFugue are entirely community-contributed.
I am trying to use the Stanford Shift Reduce Parser with the Spanish model supplied. I am noticing, however, that unlike the Lexicalized Parser, I cannot get the TypedDependencies, despite sending the adequate flag -outputFormat typedDependencies, as it can be seen in lexparser.bat/sh.
Just in case, this is the Java code I'm using to pass the flags and creating the parser.
ShiftReduceParser model = ShiftReduceParser.loadModel(modelPath);
model.setOptionFlags("-factored", "-outputFormat", "penn,typedDependencies");
ArrayList<TaggedWord> taggedWords = new ArrayList<TaggedWord>();
Thank you
The problem here is not the ShiftReduceParser, but simply that we don't currently support typed dependencies for Spanish currently - we only have them for English and Chinese.
(Looking ahead, the most likely thing to appear first is support for Universal Dependencies in the Neural Network Dependency Parser. Indeed, you could probably train such a model yourself now.)
I've been experimenting with Treetop lately to create simple parser for CFG DSL language for one of my clients. I was successful to implement all the features he required, but working with Treetop turned out to be quite a painful experience.
The problem is that I was not able to get any usable error message from Treetop. The only output I am getting is
parser.rb:22:in `parse': Parser error at offset: 0 (Exception)
Error:
#<TranLanParser:0x007f960c852f60>
from parser.rb:28:in `<class:Parser>'
from parser.rb:10:in `<main>'
which always points to the first character in the file. This is really terrible to find any error in the parsed language. How should I incrementally develop my parser if I can't find what's wrong whatsoever?
I tried to change my grammar to contain recursive rules, because I thought that this would help the parser to create AST nodes as soon as possible, but it didn't help.
My question is:
Am I doing something wrong? Is there any good example how to create PEG grammars for Treetop, which provide meaningful error messages on partially derived trees? Or is it a bug/error in Treetop library?
Thanks for any opinion.
Did you try printing parser.failure_reason? This prints the list of terminals that would have allowed advancing beyond the right-most position the parser reached (before it back-tracked).
Did you try a single token or ultra-simple grammar, working up as you go?
Did you try setting parser.consume_all_input = false, to see whether it was parsing correctly but not to the end of the input?
There are a few more "traps for young players" but you haven't given us enough information to go on. Once you "get it", developing in Treetop is a breeze, but it can take a little while to get to that point.
First, what to read about parsing and building AST?
How to create parser for a language (like SQL) that will build an AST and allow syntax errors?
For example, for "3+4*5":
+
/ \
3 *
/ \
4 5
And for "3+4*+" with syntax error, parser would guess that the user meant:
+
/ \
3 *
/ \
4 +
/ \
? ?
Where to start?
SQL:
SELECT_________________
/ \ \
. FROM JOIN
/ \ | / \
a city_name people address ON
|
=______________
/ \
.____ .
/ \ / \
p address_id a id
The standard answer to the question of how to build parsers (that build ASTs), is to read the standard texts on compiling. Aho and Ullman's "Dragon" Compiler book is pretty classic. If you haven't got the patience to get the best reference materials, you're going to have more trouble, because they provide theory and investigate subtleties. But here is my answer for people in a hurry, building recursive descent parsers.
One can build parsers with built-in error recovery. There are many papers on this sort of thing, a hot topic in the 1980s. Check out Google Scholar, hunt for "syntax error repair". The basic idea is that the parser, on encountering a parsing error, skips to some well-known beacon (";" a statement delimiter is pretty popular for C-like languages, which is why you got asked in a comment if your language has statement terminators), or proposes various input stream deletions or insertions to climb over the point of the syntax error. The sheer variety of such schemes is surprising. The key idea is generally to take into account as much information around the point of error as possible. One of the most intriguing ideas I ever saw had two parsers, one running N tokens ahead of the other, looking for syntax-error land-mines, and the second parser being feed error repairs based on the N tokens available before it encounters the syntax error. This lets the second parser choose to act differently before arriving at the syntax error. If you don't have this, most parser throw away left context and thus lose the ability to repair. (I never implemented such a scheme.)
The choice of things to insert can often be derived from information used to build the parser (often First and Follow sets) in the first place. This is relatively easy to do with L(AL)R parsers, because the parse tables contain the necessary information and are available to the parser at the point where it encounters an error. If you want to understand how to do this, you need to understand the theory (oops, there's that compiler book again) of how the parsers are constructed. (I have implemented this scheme successfully several times).
Regardless of what you do, syntax error repair doesn't help much, because it is almost impossible to guess what the writer of the parsed document actually intended. This suggests fancy schemes won't be really helpful. I stick to simple ones; people are happy to get an error report and some semi-graceful continuation of parsing.
A real problem with rolling your own parser for a real language, is that real languages are nasty messy things; people building real implementations get it wrong and frozen in stone because of existing code bases, or insist on bending/improving the language (standards are for wimps, goodies are for marketing) because its cool. Expect to spend a lot of time re-calibrating what you think the grammar is, against the ground truth of real code. As a general rule, if you want a working parser, better to get one that has a track record rather than roll it yourself.
A lesson most people that are hell-bent to build a parser don't get, is that if they want to do anything useful with the parse result or tree, they'll need a lot more basic machinery than just the parser. Check my bio for "Life After Parsing".
There are two things the parser could do:
Report the error and have the user try again.
Repair the error and proceed.
Generally speaking the first one is easier (and safer). There may not always be enough information for the parser to infer the intent when the syntax is wrong. Depending on the circumstances, it may be dangerous to proceed with a repair that makes the input syntactically correct but semantically wrong.
I've written a few hand-rolled recursive descent parsers for little languages. When writing code to interpret the grammar rules explicitly (as opposed to using a parser-generator), it's easy to detect errors, because the next token doesn't fit the production rule. Generated parsers tend to spit out a simplistic "expected $(TOKEN_TYPE) here" message, which isn't always useful to the user. With a hand-written parser, it's often easy to give a more specific diagnostic message, but it can be time consuming to cover every case.
If your goal is the report the problem but to keep parsing (so that you can see if there are additional problems), you can put a special AST node in the tree at the point of the error. This keeps the tree from falling apart.
You then have to resync to some point beyond the error in order to continue parsing. As Ira Baxter mentioned in his answer, you might look for a token, like ';', that separates statements. The correct token(s) to look for depends on the language you're parsing. Another possibility is to guess what the user meant (e.g., infer an extra token or a different token at the point the error was detected) and then continue. If you encounter another syntax error within the next few tokens, you could backtrack, make a different guess, and try again.
I'm just starting to use antlr, with antlr for ruby. The version is 3.2.1
I'm trying to create a parser for the chef language, and the grammar is giving me a real headache :P I'm sure I'm missing some fundamental concept, but I just couldn't figure it out.
I created 3 grammars. The main one is the recipe parser, which (of course) parses the recipes. Once a recipe is parsed, I used the other 2 grammars, that parse ingredients and instructions (the method section).
My problem is with the last one, the one that parses the instructions, such as "put ... into the mixing bowl", "liquefy ...", etc. Everything works great except for a few rules. I've posted the Instructions.g source here, at paste.bin because of its length.
Here's what's happening:
When I uncomment the rules combine_ingredient_into_mixing_bowl or divide_ingredient_into_mixing_bowl, the parser stops recognizing almost all of the other rules (such as put_ingredient_into_mixing_bowl). This seems strange to me, because they don't seem to override each other (of course they are, somehow). I get the error: "line 0:-1 mismatched input "" expecting WS"
stir_mixing_bowl does not match anything, but it's really no different from the other rules that do work ok. I get the error: "line 0:-1 mismatched input "" expecting set nil"
Is it possible to include the rules verb_the_ingredient and liquefy_ingredient without making them conflict with the other rules? The former will actually conflict with everything else I guess, and the latter will conflict with liquefy_mixing_bowl. What would be the best way to deal with such a nasty grammar?
By the way, I haven't set the WS (whitespace such as space and tab) to the ignore channel because since an ingredient can consiste of one or more words (such as dijon mustard or just zuchinnis) I found that it is easier to specify the grammar by using the WS token as separators.
Also, running the antlr4ruby command to generate the parsers/lexers code shows no warnings at all.
Any tips, hints, or enlightening is really appreciated here :)
Thanks in advance.