Can I get ANTLR to generate ouput from an AST that I create by hand? - antlr3

I understand how to generate AST from character streams using ANTLR.
I'd like to be able to create an AST programmatically and have ANTLR apply the rules in the grammar to produce valid output, for example adding in the syntactic stuff that isn't in the AST like say quotes and whitespace.
To use a tree walker you seem to need a TokenStream attached to the TreeNodeStream, but if the tree was created programmatically there is no TokenStream. Failing to call setTokenStream(...) on a CommonTreeNodeStream causes NullPointerExceptions at runtime.
example:
TokenStream tokens = new CommonTokenStream(new MyLexer(somestream));
// parse an AST 'ast' from this stream
CommonTreeNodeStream nodes = new CommonTreeNodeStream(ast);
// needs this or npe
nodes.setTokenStream(tokens);
new MyWalker(nodes).start();
so - can you create an AST on the fly without a character stream input and have some generated ANTLR class generate a character stream according to the rules defined in a grammar?

Related

Xtext - Multiline String like in YAML

I'm trying to model a YAML-like DSL in Xtext. In this DSL, I need some Multiline String as in YAML.
description: |
Line 1
line 2
...
My first try was this:
terminal BEGIN:
'synthetic:BEGIN'; // increase indentation
terminal END:
'synthetic:END'; // decrease indentation
terminal MULTI_LINE_STRING:
"|"
BEGIN ANY_OTHER END;
and my second try was
terminal MULTI_LINE_STRING:
"|"
BEGIN
((!('\n'))+ '\n')+
END;
but both of them did not succeed. Is there any way to do this in Xtext?
UPDATE 1:
I've tried this alternative as well.
terminal MULTI_LINE_STRING:
"|"
BEGIN ->END
When I triggered the "Generate Xtext Artifacts" process, I got this error:
3492 [main] INFO nerator.ecore.EMFGeneratorFragment2 - Generating EMF model code
3523 [main] INFO clipse.emf.mwe.utils.GenModelHelper - Registered GenModel 'http://...' from 'platform:/resource/.../model/generated/....genmodel'
error(201): ../.../src-gen/.../parser/antlr/lexer/Internal..Lexer.g:236:71: The following alternatives can never be matched: 1
error(3): cannot find tokens file ../.../src-gen/.../parser/antlr/internal/Internal...Lexer.tokens
error(201): ../....idea/src-gen/.../idea/parser/antlr/internal/PsiInternal....g:4521:71: The following alternatives can never be matched: 1
This slide deck shows how we implemented a whitespace block scoping in an Xtext DSL.
We used synthetic tokens called BEGIN corresponding to an indent, and END corresponding to an outdent.
(Note: the language was subsequently renamed to RAPID-ML, included as a feature of RepreZen API Studio.)
I think your main problem is, that you have not defined when your multiline token is ending. Before you come to a solution you have to make clear in your mind how an algorithm should determine the end of the token. No tool can take this mental burdon from you.
Issue: There is no end marking character. Either you have to define such a character (unlike YAML) or define the end of the token in anather way. For example through some sort of semantic whitespace (I think YAML does it like that).
The first approach would make the thing very easy. Just read content until you find the closing character. The sescond approach would probably be manageable using a custom lexer. Basically you replace the generated lexer with your own implemented solution that is able to cound blanks or similar.
Here are some starting points about how this could be done (different approaches thinkable):
Writing a custom Xtext/ANTLR lexer without a grammar file
http://consoliii.blogspot.de/2013/04/xtext-is-incredibly-powerful-framework.html

Stanford CoreNLP: output in CONLL format from Java

I want to parse some German text with Stanford CoreNLP and obtain a CONLL output, so that I can pass the latter to CorZu for coreference resolution.
How can I do that programmatically?
Here is my code so far (which only outputs dependency trees):
Annotation germanAnnotation = new Annotation("Gestern habe ich eine blonde Frau getroffen");
Properties germanProperties = StringUtils.argsToProperties("-props", "StanfordCoreNLP-german.properties");
StanfordCoreNLP pipeline = new StanfordCoreNLP(germanProperties);
pipeline.annotate(germanAnnotation);
StringBuilder trees = new StringBuilder("");
for (CoreMap sentence : germanAnnotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree sentenceTree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
trees.append(sentenceTree).append("\n");
}
With the following code I managed to save the parsing output in CONLL format.
OutputStream outputStream = new FileOutputStream(new File("./target/", OUTPUT_FILE_NAME));
CoNLLOutputter.conllPrint(germanAnnotation, outputStream, pipeline);
However, the HEAD field was 0 for all words. I am not sure whether there is a problem in parsing or only in the CONLLOutputter. Honestly, I was too annoyed by CoreNLP to investigate further.
I decided and I suggest using ParZu instead. ParZu and CorZu are made to work together seamlessly - and they do.
In my case, I had an already tokenized and POS tagged text. This makes things easier, since you will not need:
A POS-Tagger using the STTS tagset
A tool for morphological analysis
Once you have ParZu and CorZu installed, you will only need to run corzu.sh (included in the CorZu download folder). If your text is tokenized and POS tagged, you can edit the script accordingly:
parzu_cmd="/YourPath/ParZu/parzu -i tagged"
Last note: make sure to convert your tagged text to the following format, with empty lines signifying sentence boundaries: word [tab] tag [newline]

Resolving how square brackets vs parenthesis are conceptually handled in pipeline

When I parse a sentence that contains left and right square brackets, the parse is much, much slower and different than if the sentence contained left and right parentheses and the default normalizeOtherBrackets value of true is kept (I would say 20 seconds vs 3 seconds for the ParserAnnotator to be run). If this is property is instead set to false, than the parse times of the brackets vs the parentheses are pretty comparable however the parse trees are still very different. With a true value for text with brackets, the POS is -LRB- whereas the POS is CD for false but in each case the general substructure of the tree is the same.
In the case of my corpus, the brackets are overwhelmingly meant to "clarify the antecedent" as described in this site. However, the PRN phrase-level label exists for parentheses and not for square brackets and so the formation of the tree is inherently different even if they have close to the same function in the sentence.
So, please explain how the parse times are so different and what can be done to get a proper parse? Obviously, a simplistic approach would be to replace brackets with parens, but that does not seem like a satisfying solution. Are there any settings that can provide me with some relief? Here is my code:
private void execute() {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
props.setProperty("tokenize.options", "normalizeOtherBrackets=false");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
long start = System.nanoTime();
ParserAnnotator pa = new ParserAnnotator(DefaultPaths.DEFAULT_PARSER_MODEL, false, 60, new String[0]);
pa.annotate(document);
long end = System.nanoTime();
System.out.println((end - start) / 1000000 + " ms");
for(CoreMap sentence: sentences) {
System.out.println(sentence);
Tree cm = sentence.get(TreeAnnotation.class);
cm.pennPrint();
List<CoreLabel> cm2 = sentence.get(TokensAnnotation.class);
for (CoreLabel c : cm2) {
System.out.println(c);
}
}
}
Yes, this a known (and very tricky) issue, and unfortunately there is no perfect solution for this.
The problem with such sentences is mainly caused by the fact that these parenthetical constructions don't appear in the training data of the parser and by the way the part-of-speech tagger and the parser interact.
If you run the tokenizer with the option normalizeOtherBrackets=true, then the square brackets will be replaced with -LSB- and -RSB-. Now in most (or maybe even in all) of the training examples for the part-of-speech tagger -LSB- has the POS tag -LRB-, and -RSB- has the tag -RRB-. Therefore the POS-tagger, despite taking context into account, basically always assigns these tags to squared brackets.
Now if you run the POS-tagger before the parser, then the parser operates in delexicalized mode which means it tries to parse the sentence based on the POS tags of the individual words instead of based on the words themselves. Assuming that your sentence will have a POS sequence similar to -LRB- NN -RRB- the parser tries to find production rules so that it can parse this sequence. However, the parser's training data doesn't contain such sequences and therefore there aren't any such production rules, and consequently the parser is unable to parse a sentence with this POS sequence.
When this happens, the parser goes into fallback mode which relaxes the constraints on the part-of-speech tags by assigning a very low probability to all the other possible POS tags of every word so that it can eventually parse the sentence according to the production rules of the parser. This step, however, requires a lot of computation and therefore it takes so much longer to parse such sentences.
On the other hand, if you set normalizeOtherBrackets to false, then it won't translate squared brackets to -LSB- and -RSB- and the POS tagger will assign tags to the original [ and ]. However, our training data was preprocessed such that all squared brackets have been replaced and therefore [ and ] are treated as unknown words and the POS tagger assigns tags purely based on the context which most likely results in a POS sequence that is part of the training data of our parser and tagger. This in return means that the parser is able to parse the sentence without entering into recovery mode which is a lot faster.
If your data contains a lot of sentences with such constructions, I would recommend that you don't run the POS tagger before the parser (by removing pos from the list of annotators). This should still give you faster but still reliable parses. Otherwise, if you have access to a treebank, you could also manually add a few trees with these constructions and train a new parsing model. I would not set normalizeOtherBrackets to false because then your tokenized input will differ from the training data of the parser and the tagger which will potentially give you worse results.

antlr4 and string template groups and rewrite option in grammar

In ANtlr3 I could specify the template group from the parser. The grammar itself had the following options
options {
output=template;
rewrite=true;
language=CSharp4;
}
parser.TemplateGroup = template;
How is this done in Antlr4 and ST4?
ANTLR 4 does not have an output option (or rewrite option). In ANTLR 4, you would implement a listener for the parse tree automatically generated by the grammar, and perform all of the StringTemplate operations within the listener.

Getting First and Follow metadata from an ANTLR4 parser

Is it possible to extract the first and follow sets from a rule using ANTLR4? I played around with this a little bit in ANTLR3 and did not find a satisfactory solution, but if anyone has info for either version, it would be appreciated.
I would like to parse user input up the user's cursor location and then provide a list of possible choices for auto-completion. At the moment, I am not interested in auto-completing tokens which are partially entered. I want to display all possible following tokens at some point mid-parse.
For example:
sentence:
subjects verb (adverb)? '.' ;
subjects:
firstSubject (otherSubjects)* ;
firstSubject:
'The' (adjective)? noun ;
otherSubjects:
'and the' (adjective)? noun;
adjective:
'small' | 'orange' ;
noun:
CAT | DOG ;
verb:
'slept' | 'ate' | 'walked' ;
adverb:
'quietly' | 'noisily' ;
CAT : 'cat';
DOG : 'dog';
Given the grammar above...
If the user had not typed anything yet the auto-complete list would be ['The'] (Note that I would have to retrieve the FIRST and not the FOLLOW of rule sentence, since the follow of the base rule is always EOF).
If the input was "The", the auto-complete list would be ['small', 'orange', 'cat', 'dog'].
If the input was "The cat slept, the auto-complete list would be ['quietly', 'noisily', '.'].
So ANTLR3 provides a way to get the set of follows doing this:
BitSet followSet = state.following[state._fsp];
This works well. I can embed some logic into my parser so that when the parser calls the rule at which the user is positioned, it retrieves the follows of that rule and then provides them to the user. However, this does not work as well for nested rules (For instance, the base rule, because the follow set ignores and sub-rule follows, as it should).
I think I need to provide the FIRST set if the user has completed a rule (this could be hard to determine) as well as the FOLLOW set of to cover all valid options. I also think I will need to structure my grammar such that two tokens are never subsequent at the rule level.
I would have break the above "firstSubject" rule into some sub rules...
from
firstSubject:
'The'(adjective)? CAT | DOG;
to
firstSubject:
the (adjective)? CAT | DOG;
the:
'the';
I have yet to find any information on retrieving the FIRST set from a rule.
ANTLR4 appears to have drastically changed the way it works with follows at the level of the generated parser, so at this point I'm not really sure if I should continue with ANTLR3 or make the jump to ANTLR4.
Any suggestions would be greatly appreciated.
ANTLRWorks 2 (AW2) performs a similar operation, which I'll describe here. If you reference the source code for AW2, keep in mind that it is only released under an LGPL license.
Create a special token which represents the location of interest for code completion.
In some ways, this token behaves like the EOF. In particular, the ParserATNSimulator never consumes this token; a decision is always made at or before it is reached.
In other ways, this token is very unique. In particular, if the token is located at an identifier or keyword, it is treated as though the token type was "fuzzy", and allowed to match any identifier or keyword for the language. For ANTLR 4 grammars, if the caret token is located at a location where the user has typed g, the parser will allow that token to match a rule name or the keyword grammar.
Create a specialized ATN interpreter that can return all possible parse trees which lead to the caret token, without looking past the caret for any decision, and without constraining the exact token type of the caret token.
For each possible parse tree, evaluate your code completion in the context of whatever the caret token matched in a parser rule.
The union of all the results found in step 3 is a superset of the complete set of valid code completion results, and can be presented in the IDE.
The following describes AW2's implementation of the above steps.
In AW2, this is the CaretToken, and it always has the token type CARET_TOKEN_TYPE.
In AW2, this specialized operation is represented by the ForestParser<TParser> interface, with most of the reusable implementation in AbstractForestParser<TParser> and specialized for parsing ANTLR 4 grammars for code completion in GrammarForestParser.
In AW2, this analysis is performed primarily by GrammarCompletionQuery.TaskImpl.runImpl(BaseDocument).

Resources