Antr3 rule rewriting in Antlr4 - antlr3

I were upgrading my antlr3 grammar to antlr4 but found the rule rewiring is not supported in antrl3, appreciate any advice to make below grammar work in Antlr4?
fragment date
: DATE (MINUS DATE)* -> ^(TO DATE+)
;
fragment simpleExpression
: expr (OR expr)* -> expr+
;
fragment simpleExpressionWithLiteral
: exprWithLiteral (OR exprWithLiteral)* -> exprWithLiteral+
;
fragment conditionalExpression
: orExpression -> ^(COND orExpression)?
;
fragment orExpression
: andExpression (OR^ andExpression)*
;
fragment andExpression
: atom (AND^ atom)*
;
fragment atom
: exprWithLiteral
| NOT exprWithLiteral -> ^(NOT exprWithLiteral)
| NOT LPAREN orExpression RPAREN-> ^(NOT orExpression)
| LPAREN orExpression RPAREN -> orExpression
;
fragment exprWithLiteral
: expr
| StringLiteral
;
fragment expr
: WORD
| NUMBER
;

The part after -> is not rule rewiring but tree rewriting. ANTLR3 produced an AST which you could manually change using this tree rewriting syntax. ANTLR4 no longer produces ASTs but parse trees, which you cannot change (as they represent the path taken through the grammar).
So the simple solution is to remove everything on a line starting with ->, example:
fragment date
: DATE (MINUS DATE)* -> ^(TO DATE+)
;
becomes
fragment date
: DATE (MINUS DATE)*
;

Related

OCaml guards syntax after a value

I can't quite understand the syntax used here:
let rec lex = parser
(* Skip any whitespace. *)
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
Firstly, I don't understand what it means to use a guard (vertical line) followed by parser.
And secondly, I can't seem to find the relevant syntax for the condition surrounded by [< and >]
Got the code from here. Thanks in advance!
|
means: "or" (does the stream matches this char or this char or ... ?)
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
means:
IF the stream (one char, in this clause, but it can be a sequence of
several chars) matches "space" or "new line" or "carriage return" or
"tabulation".
THEN consume the ("white") matching character and call lex with the
rest of the stream.
ELSE use the next clause (in your example: "filtering A to Z and a to
z chars" for identifiers). As the matched character has been consumed
by this clause,
(btw, adding '\n\r', which is "newline + carriage return" would be better to address this historical case; you can do it as an exercise).
To be able to parse streams in OCaml with this syntax, you need the modules from OCaml stdlib (at least Stream and Buffer) and you need the camlp4 or camlp5 syntax extension system that knows the meaning of the keywords parser, [<', etc.
In your toplevel, you can do as follows:
#use "topfind";; (* useless if already in your ~/.ocamlinit file *)
#camlp4o;; (* Topfind directive to load camlp4o in the Toplevel *)
# let st = Stream.of_string "OCaml"
val st : char Stream.t = <abstr>
# Stream.next st
- : char = 'O'
# Stream.next flux_car
- : char = 'C'
(* btw, Exception: Stdlib.Stream.Failure must be handled(empty stream) *)
# let rec lex = parser
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
| [< >] -> [< >]
(* just the beginning of the parser definition *)
# val lex : char Stream.t -> 'a = <fun>
Now you are up and running to deal with streams and LL(1) stream parsers.
The exammple you mentioned works well. If you play within the Toplevel, you can evaluate the token.ml and lexer.ml file with the #use directive to respect the module names (#use "token.ml"). Or you can directly evaluate the expressions of lexer.ml if you nest the type token in a module Token.
# let rec lex = parser (* complete definition *)
val lex : char Stream.t -> Token.token Stream.t = <fun>
val lex_number : Buffer.t -> char Stream.t -> Token.token Stream.t = <fun>
val lex_ident : Buffer.t -> char Stream.t -> Token.token Stream.t = <fun>
val lex_comment : char Stream.t -> Token.token Stream.t = <fun>
# let pgm =
"def fib(x) \
if x < 3 then \
1 \
else \
fib(x-1)+fib(x-2)";;
val pgm : string = "def fib(x) if x < 3 then 1 else fib(x-1)+fib(x-2)"
# let cs' = lex (Stream.of_string pgm);;
val cs' : Token.token Stream.t = <abstr>
# Stream.next cs';;
- : Token.token = Token.Def
# Stream.next cs';;
- : Token.token = Token.Ident "fib"
# Stream.next cs';;
- : Token.token = Token.Kwd '('
# Stream.next cs';;
- : Token.token = Token.Ident "x"
# Stream.next cs';;
- : Token.token = Token.Kwd ')'
You get the expected stream of type token.
Now a few technical words about camlp4 and camlp5.
It's indeed recommended not to use the so-called "camlp4" that is being deprecated, and instead use "camlp5" which is in fact the "genuine camlp4" (see hereafter). Assuming you want to use a LL(1) parser.
For that, you can use the following camlp5 Toplevel directive instead of the camlp4 one:
#require "camlp5";; (* add the path + loads the module (topfind directive) *)
#load "camlp5o.cma";;(* patch: manually loads camlp50 module,
because #require forgets to do it (why?)
"o" in "camlp5o" stands for "original syntax" *)
let rec lex = parser
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
| [< >] -> [< >]
# val lex : char Stream.t -> 'a = <fun>
More history about camlp4 and camlp5.
Disclaimer : while I try to be as neutral and factual as possible, this too short explanation may reflect also my personal opinion. Of course, discussion is welcome.
As an Ocaml beginner, I found camlp4 very attractive and powerful but it was not easy to distinguish what was exactly camlp4 and to find its more recent documentation.
In very brief :
It's an old and confused story mainly because of the naming of "camlp4". campl4 is a/the historical syntax extension system for OCaml. Someone decided to improve/retrofit camlp4 around 2006, but it seems that some design decisions turned it in something somehow considered by some people as a "beast" (often, less is more). So, it works, but "there is a lot of stuff under the hood" (its signature became very large).
His historical author, Daniel de Rauglaudre decided to keep on developing camlp4 his way and renamed it "campl5" to differentiate from what was the "new camlp4" (named camlp4). Even if camlp5 is not largely used, it's still maintained, operational and used, for example, by coq that has recently integrated a part of campl5 instead of being dependent of the whole camlp5 library (which doesn't mean that "coq doesn't use camlp5 anymore", as you could read).
ppx has become a mainstream syntax extension technology in the OCaml world (it seems that it's dedicated to make "limited and reliable" OCaml syntax extensions, mainly for small and very useful code generation (helpers functions, etc.); it's a side discussion). It doesn't mean that camlp5 is "deprecated". camlp5 is certainly misunderstood. I had hard time at the beginning, mainly because of its documentation. I wish I could read this post at that time! Anyway, when programming in OCaml, I believe it's a good thing to explore all kinds of technology. It's up to you to make your opinion.
So, the today so-called "camlp4" is in fact the "old campl4" (or the "new camlp4 of the past" ; I know, it's complicated).
LALR(1) parsers such as ocamlyacc or menhir are or have been made mainstream. They have a a bottom-up approach (define .mll and .mly, then compile to OCaml code).
LL(1) parsers, such as camlp4/camlp5, have a top-down approach, very close to functional style.
The best thing is that you compare then by yourself. Implementing a lexer/parser of your language is perfect for that: with ocamllex/menhir and with ocamllex/camlp5, or even with only camlp5 because it's also a lexer (with pros/cons).
I hope you'll enjoy your LLVM tutorial.
All technical and historical complementary comments are very welcome.
As #glennsl says, this page uses the campl4 preprocessor, which is considered obsolete by many in the OCaml community.
Here is a forum message from August 2019 that describes how to move from camlp4 to the more recent ppx:
The end of campl4
Unfortunately that doesn't really help you learn what that LLVM page is trying to teach you, which has little to do with OCaml it seems.
This is one reason I find the use of syntax extensions to be problematic. They don't have the staying power of the base language.
(On the other hand, OCaml really is a fantastic language for writing compilers and other language tools.)

Grammar LaTeX like with mixed whitespace utf and commands

I've tried to implement a LaTeX like grammar that could allow me to parse this kind of sentence :
\title{Un pré é"'§è" \VAR state \draw( 200\if{expression kjlkjé} ) bis tèr }
As you can see, the \title{ } can contain several kind of items :
string in utf8 without quotes and with whitespace which I'd like to
keep in one token
a variable call as : \variable_name
some \keyword following by parentheses or other with braces : for instance \draw( utf8 \var \if{ } ... ) or \if{ idem }.
These items can be nested.
I get inspiration from the XML parser presented in ANTLR 4 book and try to use mode. I meet a problem concerning the recognition of the closing braces of closing parentheses. I also meet a problem with some whitespaces, for instance the one who follows the \variable_name ( I get a : extraneous input ' ').
Here my lexer gramar code :
lexer grammar OEFLexer;
// Default mode rules (the SEA)
SEA_WS : (' '|'\t'|'\r'? '\n')+ ;
TITLE : '\\title';
OB : '{';
OP : '(';
BSLASH : '\\' -> mode(CALLREFERENCE) ;
TEXT : ~[\\({]+; // clump all text together
// ----------------- Everything Callreference ---------------------
mode CALLREFERENCE;
CLOSECALLVAR : ' ' -> mode(DEFAULT_MODE) ; // back to SEA mode
CB : '}' -> mode(DEFAULT_MODE) ; // back to SEA mode
CP : ')' -> mode(DEFAULT_MODE) ; // back to SEA mode
DRAW : 'draw' OP;
IF : 'if' OB;
ID : [a-zA-Z]+ ; // match/send ID in tag to parser
Here my parser grammar
parser grammar OEFParser;
options { tokenVocab=OEFLexer; }
document: TITLE OB ( callreference | string )* CB;
string : TEXT;
var : ID;
commandDraw : DRAW ( callreference | string )* CP ;
commandIf : IF ( callreference | string )* CB ;
callreference : BSLASH ID | BSLASH commandDraw CP | BSLASH commandIf CP;
When I tried to parse the \title code mentionned at the beginning I obtain :
line 1:25 extraneous input ' ' expecting {'\', TEXT, '}'}
line 1:37 extraneous input ' ' expecting {'\', TEXT, ')'}
line 1:45 mismatched input 'expression' expecting {'\', TEXT, '}'}
line 1:75 extraneous input '<EOF>' expecting {'\', TEXT, ')'}
With this generated tree generated by Grun
Thanks for your help to help me tackle this issue.
Chris
The problem is the space after expression:
\title{Un pré é"'§è" \VAR state \draw( 200\if{expression kjlkjé} ) bis tèr }
^
^
^
which causes the mode to go back to the DEFAULT_MODE:
CLOSECALLVAR : ' ' -> mode(DEFAULT_MODE) ;
Something that you don't want because you're (obviously) still in the CALLREFERENCE context.
One way to handle this is to use -> pushMode(...) and -> popMode directives that causes a stack of CALLREFERENCE modes to be created. Whenever you stumble upon a \... ( and \... { you push a new CALLREFERENCE onto this stack, and then pop one off when you see a ) or }.
A quick lexer grammar demo:
lexer grammar OEFLexer;
TITLE : '\\title' S? OB -> pushMode(CALLREFERENCE);
fragment OB : '{';
fragment OP : '(';
fragment S : [ \t\r\n]+;
mode CALLREFERENCE;
CB : '}' -> popMode;
CP : ')' -> popMode;
DRAW : '\\draw' S? OP -> pushMode(CALLREFERENCE);
IF : '\\if' S? OB -> pushMode(CALLREFERENCE);
BSLASH : '\\';
ID : [a-zA-Z]+;
CR_OTHER : .;
and the parser grammar:
parser grammar OEFParser;
options { tokenVocab=OEFLexer; }
document
: TITLE ( callreference | string )* CB EOF
;
string
: CR_OTHER+
| ID
;
commandDraw
: DRAW ( callreference | string )* CP
;
commandIf
: IF ( callreference | string )* CB
;
callreference
: BSLASH ID
| commandDraw
| commandIf
;
Parsing you example input will result in the following parse tree:

Antlr4 Spaces within assignment

I'm trying to write a simple parser in ANTLR 4 that'll be able to handle stuff like this:
java.lang.String dataSourceName=FOO
java.lang.Long dataLoadTimeout=30000
This is what I put in my .g4 file:
cfg : (paramAssign NEWLINE)* ;
paramAssign : paramDecl '=' paramVal ;
paramDecl : javaType paramName ;
paramName : SIMPLEID ;
paramVal : PARAMVAL ;
javaType : JAVATYPE ;
SIMPLEID : [a-zA-Z_][a-zA-Z0-9_]* ;
PARAMVAL : [0-9a-zA-Z_]+ ;
JAVATYPE : SIMPLEID ('.' SIMPLEID)* ;
NEWLINE : '\n' ;
When I run on inputs above, I get:
line 1:16 token recognition error at: ' '
line 2:14 token recognition error at: ' '
line 1:32 mismatched input 'FOO' expecting PARAMVAL
I know that there are precedence rules that ANTLR's lexer & parser follow but it's not clear to me how I'm violating them. For some reason it doesn't like the string FOO although FOO clearly conforms to the PARAMVAL rule. Also, when I put spaces before & after equals signs I get:
token recognition error at: ' '
for each space I've added. Sorry, but I'm really baffled.
FOO is matched as a SIMPLEID token, not a PARAMVAL token. That is just how ANTLR works: whenever 2 (or more) lexer rules match the same amount of characters, the rule defined first will win (SIMPLEID in your case).
So if you let paramVal also match a SIMPLEID, the error would go away:
paramVal : SIMPLEID | PARAMVAL ;
For the recognition error at: ' ' to disappear, you'd have to match space chars as well:
SPACE : [ \t]+ -> skip ;

Xtext grammar QualifiedName ambiguity

I have the following problem. Part of my grammar looks like this
RExpr
: SetOp
;
SetOp returns RExpr
: PrimaryExpr (({Union.left=current} '+'|{Difference.left=current} '-'|{Intersection.left=current} '&') right = PrimaryExpr)*
;
PrimaryExpr returns RExpr
: '(' RExpr ')'
| (this = 'this.')? slot = [Slot | QualifiedName]
| (this = 'this' | ensName = [Ensemble | QualifiedName])
| 'All'
;
When generating Xtext artifacts ANTLR says that due to some ambiguity it disables an option(3). The ambiguity is because of QualifiedName slot and ensemble share. How do I refactor this kind of cases? I guess syntactic predicate wont help here since it'll force only one(Slot/Ensemble) to be resolved only.
Thanks.
Xtext can't choose between your two references slot and ensemble.
You can merge these references into one reference by adding this rule to your grammar:
SlotOrEnsemble:
Slot | Ensemble
;
Then your primaryExpr rule will be something like:
PrimaryExpr returns RExpr
: '(' RExpr ')'
| ((this = 'this.')? ref= [SlotOrEnsemble | QualifiedName])
| this = 'this'
| 'All'
;

ANTLR: field access and evaluation

I'm trying to write a piece of grammar to express field access for a hierarchical structure, something like a.b.c where c is a field of a.b and b is a field of a.
To evaluate the value of a.b.c.d.e we need to evaluate the value of a.b.c.d and then get the value of e.
To evalutate the value of a.b.c.d we need to evalute the value of a.b.c and then get the value of d and so on...
If you have a tree like this (the arrow means "lhs is parent of rhs"):
Node(e) -> Node(d) -> Node(c) -> Node(b) -> Node(a)
the evaluation is quite simple. Using recursion, we just need to resolve the value of the child and then access to the correct field.
The problem is: I have this 3 rules in my ANTLR grammar file:
tokens {
LBRACE = '{' ;
RBRACE = '}' ;
LBRACK = '[' ;
RBRACK = ']' ;
DOT = '.' ;
....
}
reference
: DOLLAR LBRACE selector RBRACE -> ^(NODE_VAR_REFERENCE selector)
;
selector
: IDENT access -> ^(IDENT access)
;
access
: DOT IDENT access? -> ^(IDENT<node=com.at.cson.ast.FieldAccessTree> access?)
| LBRACK IDENT RBRACK access? -> ^(IDENT<node=com.at.cson.ast.FieldAccessTree> access?)
| LBRACK INTEGER RBRACK access? -> ^(INTEGER<node=com.at.cson.ast.ArrayAccessTree> access?)
;
As expected, my tree has this form:
ReferenceTree
IdentTree[a]
FieldAccessTree[b]
FieldAccessTree[c]
FieldAccessTree[d]
FieldAccessTree[e]
The evaluation is not that easy as in the other case because I need to get the value of the current node and then give it to the child and so on...
Is there any way to reverse the order of the tree using ANTLR or I need to do it manually?
You can only do this by using the inline tree operator1, ^, instead of a rewrite rule.
A demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
LBRACK = '[' ;
RBRACK = ']' ;
DOT = '.' ;
}
parse
: selector+ EOF -> ^(ROOT selector+)
;
selector
: IDENT (access^)*
;
access
: DOT IDENT -> IDENT
| LBRACK IDENT RBRACK -> IDENT
| LBRACK INTEGER RBRACK -> INTEGER
;
IDENT : 'a'..'z'+;
INTEGER : '0'..'9'+;
SPACE : ' ' {skip();};
Parsing the input:
a.b.c a[1][2][3]
will produce the following AST:
1 for more info about inline tree operators and rewrite rules, see: How to output the AST built using ANTLR?

Resources