Grammar LaTeX like with mixed whitespace utf and commands - whitespace

I've tried to implement a LaTeX like grammar that could allow me to parse this kind of sentence :
\title{Un pré é"'§è" \VAR state \draw( 200\if{expression kjlkjé} ) bis tèr }
As you can see, the \title{ } can contain several kind of items :
string in utf8 without quotes and with whitespace which I'd like to
keep in one token
a variable call as : \variable_name
some \keyword following by parentheses or other with braces : for instance \draw( utf8 \var \if{ } ... ) or \if{ idem }.
These items can be nested.
I get inspiration from the XML parser presented in ANTLR 4 book and try to use mode. I meet a problem concerning the recognition of the closing braces of closing parentheses. I also meet a problem with some whitespaces, for instance the one who follows the \variable_name ( I get a : extraneous input ' ').
Here my lexer gramar code :
lexer grammar OEFLexer;
// Default mode rules (the SEA)
SEA_WS : (' '|'\t'|'\r'? '\n')+ ;
TITLE : '\\title';
OB : '{';
OP : '(';
BSLASH : '\\' -> mode(CALLREFERENCE) ;
TEXT : ~[\\({]+; // clump all text together
// ----------------- Everything Callreference ---------------------
mode CALLREFERENCE;
CLOSECALLVAR : ' ' -> mode(DEFAULT_MODE) ; // back to SEA mode
CB : '}' -> mode(DEFAULT_MODE) ; // back to SEA mode
CP : ')' -> mode(DEFAULT_MODE) ; // back to SEA mode
DRAW : 'draw' OP;
IF : 'if' OB;
ID : [a-zA-Z]+ ; // match/send ID in tag to parser
Here my parser grammar
parser grammar OEFParser;
options { tokenVocab=OEFLexer; }
document: TITLE OB ( callreference | string )* CB;
string : TEXT;
var : ID;
commandDraw : DRAW ( callreference | string )* CP ;
commandIf : IF ( callreference | string )* CB ;
callreference : BSLASH ID | BSLASH commandDraw CP | BSLASH commandIf CP;
When I tried to parse the \title code mentionned at the beginning I obtain :
line 1:25 extraneous input ' ' expecting {'\', TEXT, '}'}
line 1:37 extraneous input ' ' expecting {'\', TEXT, ')'}
line 1:45 mismatched input 'expression' expecting {'\', TEXT, '}'}
line 1:75 extraneous input '<EOF>' expecting {'\', TEXT, ')'}
With this generated tree generated by Grun
Thanks for your help to help me tackle this issue.
Chris

The problem is the space after expression:
\title{Un pré é"'§è" \VAR state \draw( 200\if{expression kjlkjé} ) bis tèr }
^
^
^
which causes the mode to go back to the DEFAULT_MODE:
CLOSECALLVAR : ' ' -> mode(DEFAULT_MODE) ;
Something that you don't want because you're (obviously) still in the CALLREFERENCE context.
One way to handle this is to use -> pushMode(...) and -> popMode directives that causes a stack of CALLREFERENCE modes to be created. Whenever you stumble upon a \... ( and \... { you push a new CALLREFERENCE onto this stack, and then pop one off when you see a ) or }.
A quick lexer grammar demo:
lexer grammar OEFLexer;
TITLE : '\\title' S? OB -> pushMode(CALLREFERENCE);
fragment OB : '{';
fragment OP : '(';
fragment S : [ \t\r\n]+;
mode CALLREFERENCE;
CB : '}' -> popMode;
CP : ')' -> popMode;
DRAW : '\\draw' S? OP -> pushMode(CALLREFERENCE);
IF : '\\if' S? OB -> pushMode(CALLREFERENCE);
BSLASH : '\\';
ID : [a-zA-Z]+;
CR_OTHER : .;
and the parser grammar:
parser grammar OEFParser;
options { tokenVocab=OEFLexer; }
document
: TITLE ( callreference | string )* CB EOF
;
string
: CR_OTHER+
| ID
;
commandDraw
: DRAW ( callreference | string )* CP
;
commandIf
: IF ( callreference | string )* CB
;
callreference
: BSLASH ID
| commandDraw
| commandIf
;
Parsing you example input will result in the following parse tree:

Related

Antr3 rule rewriting in Antlr4

I were upgrading my antlr3 grammar to antlr4 but found the rule rewiring is not supported in antrl3, appreciate any advice to make below grammar work in Antlr4?
fragment date
: DATE (MINUS DATE)* -> ^(TO DATE+)
;
fragment simpleExpression
: expr (OR expr)* -> expr+
;
fragment simpleExpressionWithLiteral
: exprWithLiteral (OR exprWithLiteral)* -> exprWithLiteral+
;
fragment conditionalExpression
: orExpression -> ^(COND orExpression)?
;
fragment orExpression
: andExpression (OR^ andExpression)*
;
fragment andExpression
: atom (AND^ atom)*
;
fragment atom
: exprWithLiteral
| NOT exprWithLiteral -> ^(NOT exprWithLiteral)
| NOT LPAREN orExpression RPAREN-> ^(NOT orExpression)
| LPAREN orExpression RPAREN -> orExpression
;
fragment exprWithLiteral
: expr
| StringLiteral
;
fragment expr
: WORD
| NUMBER
;
The part after -> is not rule rewiring but tree rewriting. ANTLR3 produced an AST which you could manually change using this tree rewriting syntax. ANTLR4 no longer produces ASTs but parse trees, which you cannot change (as they represent the path taken through the grammar).
So the simple solution is to remove everything on a line starting with ->, example:
fragment date
: DATE (MINUS DATE)* -> ^(TO DATE+)
;
becomes
fragment date
: DATE (MINUS DATE)*
;

Antlr4 Spaces within assignment

I'm trying to write a simple parser in ANTLR 4 that'll be able to handle stuff like this:
java.lang.String dataSourceName=FOO
java.lang.Long dataLoadTimeout=30000
This is what I put in my .g4 file:
cfg : (paramAssign NEWLINE)* ;
paramAssign : paramDecl '=' paramVal ;
paramDecl : javaType paramName ;
paramName : SIMPLEID ;
paramVal : PARAMVAL ;
javaType : JAVATYPE ;
SIMPLEID : [a-zA-Z_][a-zA-Z0-9_]* ;
PARAMVAL : [0-9a-zA-Z_]+ ;
JAVATYPE : SIMPLEID ('.' SIMPLEID)* ;
NEWLINE : '\n' ;
When I run on inputs above, I get:
line 1:16 token recognition error at: ' '
line 2:14 token recognition error at: ' '
line 1:32 mismatched input 'FOO' expecting PARAMVAL
I know that there are precedence rules that ANTLR's lexer & parser follow but it's not clear to me how I'm violating them. For some reason it doesn't like the string FOO although FOO clearly conforms to the PARAMVAL rule. Also, when I put spaces before & after equals signs I get:
token recognition error at: ' '
for each space I've added. Sorry, but I'm really baffled.
FOO is matched as a SIMPLEID token, not a PARAMVAL token. That is just how ANTLR works: whenever 2 (or more) lexer rules match the same amount of characters, the rule defined first will win (SIMPLEID in your case).
So if you let paramVal also match a SIMPLEID, the error would go away:
paramVal : SIMPLEID | PARAMVAL ;
For the recognition error at: ' ' to disappear, you'd have to match space chars as well:
SPACE : [ \t]+ -> skip ;

stuck making simple grammar for filter language

I have tried and get close but keep getting stuck. Input language is like this
('aaa' eq '42') and ('bbb' gt 'zzz') or (....) and (....)
ie a set of clauses of the form left op right joined by 'and' or 'or'. THere can be 1 or more clauses
This seemed simple to me but I am sure I have started getting too complicated
grammar filter;
options {
language=CSharp2;
output=AST;
}
tokens {
ROOT;
}
OPEN_PAREN
: '(';
CLOSE_PAREN
: ')';
SINGLE_QUOTE
: '\'' ;
AND : 'and';
OR : 'or';
GT : 'gt';
GE : 'ge';
EQ : 'eq';
LT : 'lt';
LE : 'le';
fragment
ID : ('a'..'z' | 'A'..'Z' )+;
STRING : SINGLE_QUOTE ID SINGLE_QUOTE;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = Hidden; } ;
//public root : filter -> ^(ROOT filter);
public filter
: clause^
| lc=clause join rc=clause ->^(join $lc $rc)
;
left : STRING;
right : STRING;
clause
: OPEN_PAREN left op right CLOSE_PAREN //-> ^(op left right)
;
join : AND
| OR
;
op : GT|GE|LT|LE|EQ;
when I run this in C# I get 'more than one node as root'
Also I am not sure how I can do the N joins

antl3:Java heap space when testing parser

I'm trying to build a simple config-file reader to read files of this format:
A .-
B -...
C -.-.
D -..
E .
This is the grammar I have so far:
grammar def;
#header {
package mypackage.parser;
}
#lexer::header { package mypackage.parser; }
file
: line+;
line : ID WS* CODE NEWLINE;
ID : ('A'..'Z')*
;
CODE : ('-'|'.')*;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
) {$channel=HIDDEN;}
;
NEWLINE:'\r'? '\n' ;
And this is my test rig (junit4)
#Test
public void BasicGrammarCheckGood() {
String CorrectlyFormedLine="A .-;\n";
ANTLRStringStream input;
defLexer lexer;
defParser parser;
input = new ANTLRStringStream(CorrectlyFormedLine);
lexer = new defLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
parser = new defParser(tokens);
try {
parser.line();
}
catch(RecognitionException re) { fail(re.getMessage()); }
}
If I run this test right with a corrected formatted string - the code exits without any exception or output.
However if feed the parser with an invalid string like this : "xA .-;\n", the code spins for a while then exits with a "Java heap space".
(If I start my test with the top-level rule 'file', then I get the same result - with the additional (repeated) output of "line 1:0 mismatched input '' expecting CODE")
What's going wrong here ? I never seem to get the "RecognitionException" for the invalid output ?
EDIT: Here's my grammar file (Fragment), after being provided advice here - this avoids the 'Java heap space' issue.
file
: line+ EOF;
line : ID WS* CODE NEWLINE;
ID : ('A'..'Z')('A'..'Z')*
;
CODE : ('-'|'.')('-'|'.')*;
Some of your lexer rules match zero characters (an empty string):
ID : ('A'..'Z')*
;
CODE : ('-'|'.')*;
There are, of course, an infinite amount of empty strings in your input, causing your lexer to keep producing tokens, resulting in a heap space error after a while.
Always let lexer rules match at least 1 character.
EDIT
Two (small) remarks:
since you put the WS token on the hidden channel, you don't need to add them in your parser rules. So line becomes line : ID CODE NEWLINE;
something like ('A'..'Z')('A'..'Z')* can be written like this: ('A'..'Z')+

ANTLR: field access and evaluation

I'm trying to write a piece of grammar to express field access for a hierarchical structure, something like a.b.c where c is a field of a.b and b is a field of a.
To evaluate the value of a.b.c.d.e we need to evaluate the value of a.b.c.d and then get the value of e.
To evalutate the value of a.b.c.d we need to evalute the value of a.b.c and then get the value of d and so on...
If you have a tree like this (the arrow means "lhs is parent of rhs"):
Node(e) -> Node(d) -> Node(c) -> Node(b) -> Node(a)
the evaluation is quite simple. Using recursion, we just need to resolve the value of the child and then access to the correct field.
The problem is: I have this 3 rules in my ANTLR grammar file:
tokens {
LBRACE = '{' ;
RBRACE = '}' ;
LBRACK = '[' ;
RBRACK = ']' ;
DOT = '.' ;
....
}
reference
: DOLLAR LBRACE selector RBRACE -> ^(NODE_VAR_REFERENCE selector)
;
selector
: IDENT access -> ^(IDENT access)
;
access
: DOT IDENT access? -> ^(IDENT<node=com.at.cson.ast.FieldAccessTree> access?)
| LBRACK IDENT RBRACK access? -> ^(IDENT<node=com.at.cson.ast.FieldAccessTree> access?)
| LBRACK INTEGER RBRACK access? -> ^(INTEGER<node=com.at.cson.ast.ArrayAccessTree> access?)
;
As expected, my tree has this form:
ReferenceTree
IdentTree[a]
FieldAccessTree[b]
FieldAccessTree[c]
FieldAccessTree[d]
FieldAccessTree[e]
The evaluation is not that easy as in the other case because I need to get the value of the current node and then give it to the child and so on...
Is there any way to reverse the order of the tree using ANTLR or I need to do it manually?
You can only do this by using the inline tree operator1, ^, instead of a rewrite rule.
A demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
LBRACK = '[' ;
RBRACK = ']' ;
DOT = '.' ;
}
parse
: selector+ EOF -> ^(ROOT selector+)
;
selector
: IDENT (access^)*
;
access
: DOT IDENT -> IDENT
| LBRACK IDENT RBRACK -> IDENT
| LBRACK INTEGER RBRACK -> INTEGER
;
IDENT : 'a'..'z'+;
INTEGER : '0'..'9'+;
SPACE : ' ' {skip();};
Parsing the input:
a.b.c a[1][2][3]
will produce the following AST:
1 for more info about inline tree operators and rewrite rules, see: How to output the AST built using ANTLR?

Resources