Java Grammar To AST - antlr3

In java grammar I have a parser rule,
name
: Identifier ('.' Identifier)* ';'
;
How to get all the identifiers under a single AST tree node?

It seems impossible to me only with your lexer-parser.
For this, you will need the called: tree-walker.This third part of the parsing process will make you able to go through the generated AST and, with a counter, print the number of occurrences.
I let you a reference here in case you decide to implement it.
https://theantlrguy.atlassian.net/wiki/display/ANTLR3/Tree+construction
I hope this would help you!

Related

Algorithm to tell when we've processed a complex variable path expression while parsing?

I am working on a compiler for a homemade programming language and I am stuck on how to convert the lexical token stream into a tree of commands for constructing a DOM-like tree. The "tree of commands" will still be a list, essentially emmitting events in a way that describes how to create a tree, from partial information provided by the lexer. (This language is like CoffeeScript in a way, indentation based, or like XML with indentation focus).
I am stuck on how to tell when a variable path has been discovered. A variable path can be simple, or complex, as these examples demonstrate:
foo
foo.bar
foo.bar[baz].hello[and][goodday].there
this[is[even[more.complicated].wouldnt.you[say]]]
They could get more complicated still, if we handled dynamic interpolation of strings, such as:
foo[`bar${x}abc`].baz
But in my simple lang, there are two relevant things, "paths", and "terms". Terms are anything /a-z/ for now, and paths are chaining together and nesting, like the first examples.
For demonstration purposes, everything else is a simple "term" of 1 word, so you might have this:
abc foo.bar[baz].hello[and][goodday].there, one foo.bar
It forms a simple tree.
Right now I have a lexer which spits out the tokens, so basically:
abc
[SPACE]
foo
.
bar
[
baz
]
.
hello
[
and
]
[
goodday
]
.
there
,
[SPACE]
one
[SPACE]
foo
.
bar
That is at least how I broke it up initially.
So given that sequence of strings, how can you generate messages to tell the parser how to build a tree?
term
nest-down
term
period
term
open-square
and
close-square
...
That is the stream of tokens with a name now, but it is not a tree yet. I would like this:
term-start
term # value: abc
term-end
nest-down
term-path-start
term-start
term # value: foo
term-end
period
term-start
term # value: bar
term-end
term-nest-start
term-start
term # value: and
term-and
term-nest-end
...
I have been struggling with this example for several days now (boiled down from a complex real-world scenario). I cant seem to figure out how to keep track of all the information you need to make a decision on when to say "this structure is done now, close it out" sort of thing. Wondering if you know how to get past this.
Note, I don't need the last tree to actually be a tree structure visually, I just need it to generate those messages which can be interpreted on the other end and used to construct a tree at runtime.
There is no way to construct a tree from a list without having the description of the tree in some form. Often, in relation to parsing, the description of this tree is given by a context-free grammar (CFG).
Then you create a parser on the basis of this given CFG. The lexical token stream is given as an input to the parser. The parser organizes the lexical tokens into a tree by using some parsing algorithm.
The parser emits commands for syntax tree construction based on the rules it uses during parsing. On entering into a rule a command "rule X enter" is emitted, on exiting a rule a command "exit X rule" is emitted. When you accept a lexical token then a "token forward" is emitted with its lexeme characters. Some grammars, namely these in ABNF format, support repetitions of elements. Depending from these repetitions the syntax tree might be represented as lists or arrays.
Then a builder module receives this commands and builds a tree, or uses the commands for the specific task with the listener pattern.
I have co-authored (2021) a paper describing a list of commands for building a concrete/abstract syntax trees, depending on CFG's structure, that are used in the parsers generated by parser generator Tunnel Grammar Studio.
The paper is named "Тhe Expressive Power of the Statically Typed Concrete Syntax Trees". It is in an open-access journal (intentionally). The commands are in section "4.3 Syntax Structure Construction Commands". The article is bit "compressed", due to space limitations, and it is not really intended to be a software development guide, but to note the taken approach. It might give you some ideas.
Another co-authored paper of mine, from 2021 named "A Parsing Machine Architecture Encapsulating Different Parsing Approaches" (also in open-access journal) describes a general form of parsing machine and its modules. There Fig.1, p.33, will give you a quick description.
Disclaimer: I have made the parser generator.

Datalog code not working in DrRacket

I am trying to run this prolog code in DrRacket: http://www.anselm.edu/homepage/mmalita/culpro/graf1.html
#lang datalog
arc(a,b).
arc(b,c).
arc(a,c).
arc(a,d).
arc(b,e).
arc(e,f).
arc(b,f).
arc(f,g).
pathall(X,X,[]).
pathall(X,Y,[X,Z|L]):- arc(X,Z),pathall(Z,Y,L). % error on this line;
pathall(a,g)?
However, it is giving following error:
read: expected a `]' to close `['
I suspect '|' symbol is not being read as head-tail separator of the list. Additionally, [] is also giving error (if subsequent line is removed):
#%app: missing procedure expression;
probably originally (), which is an illegal empty application in: (#%app)
How can these be corrected so that the code works and searches for paths between a and g ?
The Datalog module in DrRacket is not an implementation of Prolog, and the syntax that you have used is not allowed (see the manual for the syntax allowed).
In particular terms cannot be data structures like lists ([]). To run a program like that of above you need a Prolog interpreter with data structures.
What you can do is define for instance a predicate path, like in the example that you have linked:
path(X,Y):- arc(X,Y).
path(X,Y):- arc(X,Z),path(Z,Y).
and, for instance, ask if a path exists or not, as in:
path(a,g)?
or print all the paths to a certain node with
path(X,g)?
etc.

Reusing assigned labels in rule definition in ANTLR

I have a rule to match 'FOR "hi" FOR'
rule : id1=ELEMENT STRING id1
{
// action
}
-> ^(Tree rule)
but it fails saying reference to undefined rule: id1
How can I reuse a label to ensure the start and end of the rule are the same identifier
The recommended way to handle this is assume the values match while parsing, and then examine the AST after parsing is complete, issuing error messages at that time for any mismatched elements.
This approach results in a more robust parser and much much understandable error messages in the case of an wrote.

In XPath is it possible to use the OR operator with node names?

With XPath, I know that you can use the union operator in the following way:
//*[#id="Catalog"]/thead | //*[#id="Catalog"]/tbody
This seems to be a little awkward to me though. Is there a way to do something similar to one of these instead?
//*[#id="Catalog"]/(thead|tbody)
//*[#id="Catalog"]/(thead or tbody)
//*[#id="Catalog"]/*[nodename="thead" or nodename="tbody"]
That seems a lot more readable and intuitive to me...
While the expression:
//*[#id="Catalog"]/*[name()="thead" or name()="tbody"]
is correct
This expression is more efficient:
//*[#id="Catalog"]/*[self::thead or self::tbody]
There is yet a third way to check if the name of an element is one of a specified sequence of strings:
//*[#id="Catalog"]/*[contains('|thead|tbody|',concat('|',name(),'|'))]
Using this last technique can be especially practical in case the number of possible names is very long (of unlimited and unknown length). The pipe-delimited string of possible names can even be passed as an external parameter to the transformation, which greatly increases its generality, re-usability and "DRY"-ness.
You are looking for the name() function:
//*[#id="Catalog"]/*[name()="thead" or name()="tbody"]
Note that with XPath 2.0 your attempt //*[#id="Catalog"]/(thead|tbody) is correct. That approach does not work however with XPath 1.0.
Yes you can use:
//*[#id="Catalog"]/[nodename='thead' or nodename='tbody']
EDIT:
Just re-read your original post and realised what you were asking. The above syntax wouldn't be correct for this situation. Not sure how to get the name of the node to use but nodename isn't right...

Odd behavior of parser when trying sample grammar

I'm trying to get the feel for antlr3, and i pasted the Expression evaluator into an ANTLRWorks window (latest version) and compiled it. It compiled successfully and started, but two problems:
Attempting to use a input of 1+2*4/3; resulted in the actual input for the parser being 1+2*43.
One of the errors it shows in it's graphical parser tree is MissingTokenException(0!=0).
As i'm new to antlr, can someone help?
The example you linked to doesn't support division (just look at the code, you'll notice there's no division here:
expr returns [int value]
: e=multExpr {$value = $e.value;}
( '+' e=multExpr {$value += $e.value;}
| '-' e=multExpr {$value -= $e.value;}
)*
We often get
MissingTokenException(0!=0)
when we make mistakes. I think it means that it cannot find a token it's looking for, and could be produced by an incorrect token. It's possible for the parser to "recover" sometimes depending on the grammar.
Remember also that the LEXER operates before the parser and your should check what tokens are actually passed to the parser. The AntlrWorks debugger can be very helpful here.

Resources