Problem with antlr grammar (lexical) - antlr3

I get a mismatched set exception when I try to parse "abc" (the quote marks are part of the input)
Here is the (simplified) grammar - pretty much verbatim from the Java.g example and basically the same from other example grammars. Is there some bug in the latest version? Using 3.2 in the context of eclipse.
Thanks in advance.
grammar String;
options {
language = C;
}
rule: literal EOF;
literal
: CHARLITERAL
| STRINGLITERAL
;
CHARLITERAL
: '\''
( EscapeSequence
| ~( '\'' | '\\' | '\r' | '\n' )
)
'\''
;
STRINGLITERAL
: '"'
( EscapeSequence
| ~( '\\' | '"' | '\r' | '\n' )
)*
'"'
;
fragment
EscapeSequence
: '\\' (
'b'
| 't'
| 'n'
| 'f'
| 'r'
| '\"'
| '\''
| '\\'
|
('0'..'3') ('0'..'7') ('0'..'7')
|
('0'..'7') ('0'..'7')
|
('0'..'7')
)
;

I'm confused by these last edits, but the problem is with the interpreter and is a known problem. Reported in 09.
If the code is generated for the grammar, it works like a charm.
It seems hard to believe that this bug has gone unanswered so long given it's frequency of occurrence.

Related

JISON errors occuring with nonterminals

I am writing a JISON file for a class and trying to use nonterminals in place of declaring associativity for operators but am utterly lost on what the errors really mean, as this is a one time assignment for a class and I haven't found amazing examples of using nonterminals for this use case.
My JISON code:
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+("."[0-9]+)?\b return 'NUMBER'
"*" return '*'
"/" return '/'
"-" return '-'
"+" return '+'
"^" return '^'
"!" return '!'
"%" return '%'
"(" return '('
")" return ')'
"PI" return 'PI'
"E" return 'E'
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start expressions
%% /* language grammar */
expressions
: e EOF
{ typeof console !== 'undefined' ? console.log($1) : print($1);
return $1; }
;
e
: NegExp
{$$ = $1;}
| MulExp
{$$ = $1;}
| PowExp
{$$ = $1;}
| UnaryExp
{$$ = $1;}
| RootExp
{$$ = $1;}
;
RootExp
: ’(’ RootExp ’)’
{$$ = ’(’ + $2 + ’)’;}
| NUMBER
{$$ = Number(yytext);}
| E
{$$ = ’E’;}
| PI
{$$ = ’PI’;}
;
UnaryExp
: UnaryExp '!'
{$$ = '(' + $1 + '!' + ')';}
| UnaryExp '%'
{$$ = '(' + $1 + '%' + ')';}
| '-' UnaryExp
{$$ = '(-' + $2 + ')';}
;
NegExp
: NegExp '+' e
{$$ = '(' + $1 + ' + ' + $3 + ')';}
| NegExp '-' e
{$$ = '(' + $1 + ' - ' + $3 + ')';}
;
MulExp
: MulExp '*' e
{$$ = '(' + $1 + ' * ' + $3 + ')';}
| MulExp '/' e
{$$ = '(' + $1 + ' / ' + $3 + ')';}
;
PowExp
: e '^' PowExp
{$$ = '(' + $1 + ' ^ ' + $3 + ')';}
;
And when I run jison filename.jison I get a slew of errors like:
Conflict in grammar: multiple actions possible when lookahead token is ^ in state 26
- reduce by rule: MulExp -> MulExp / e
- shift token (then go to state 13)
and:
States with conflicts:
State 3
e -> NegExp . #lookaheads= EOF ^ + - * /
NegExp -> NegExp .+ e #lookaheads= EOF + - ^ / *
NegExp -> NegExp .- e #lookaheads= EOF + - ^ / *
Again, I am not looking for someone to do my homework for me, but pointers on where to go or what to do to help debug will be greatly appreciated.
It's true; it is not easy to find examples of expression grammars which resolve ambiguity without using precedence declarations. That's probably because precedence declarations, in this particular use case, are extremely convenient and probably more readable than writing out an unambiguous grammar. The resulting parser is usually slightly more efficient, as well, because it avoids the chains of unit reductions imposed by the usual unambiguous grammar style.
The flip side of this convenience is that it does not help the student achieve an understanding of how grammars actually work, and without that understanding it is very difficult to apply precedence declarations to less clear-cut applications. So the exercise which gave rise to this question is certainly worthwhile.
One place you will find unambiguous expression grammars is in the specifications of (some) programming languages, because the unambiguous grammar does not depend on the precise nature of the algorithm used by parser generators to resolve parsing conflicts. These examples tend to be rather complicated, though, because real programming languages usually have a lot of operators. Even so, the sample C grammar in the jison examples directory does show the standard model for arithmetic expression grammars. The following extract is dramatically simplified, but most of the productions were simply copy-and-pasted from the original. (I removed many operators, most of the precedence levels, and some of the complications dealing with things like cast expressions and the idiosyncratic comma operator, which are surely not relevant here.)
primary_expression
: IDENTIFIER
| CONSTANT
| '(' expression ')'
;
postfix_expression
: primary_expression
| postfix_expression INC_OP
| postfix_expression DEC_OP
;
unary_expression
: postfix_expression
| '-' unary_expression
| INC_OP unary_expression
| DEC_OP unary_expression
;
/* I added this for explanatory purposes. DO NOT USE IT. See the text. */
exponential_expression
: unary_expression
| unary_expression '^' exponential_expression
multiplicative_expression
: exponential_expression
| multiplicative_expression '*' exponential_expression
| multiplicative_expression '/' exponential_expression
| multiplicative_expression '%' exponential_expression
;
additive_expression
: multiplicative_expression
| additive_expression '+' multiplicative_expression
| additive_expression '-' multiplicative_expression
;
expression
: additive_expression
;
C does not have an exponentiation operator, so I added one with right associativity and higher precedence than multiplication, which will serve for explanatory purposes. However, your assignment probably wants it to have higher precedence than unary negation as well, which I didn't do. So I don't recommend using the above directly.
One thing to note in the above model is that every precedence level corresponds to a non-terminal. These non-terminals are linked into an ordered chain using unit productions. We can see this sequence:
expression ⇒ additive_expression ⇒ multiplicative_expression ⇒ exponential_expression ⇒ unary_expression ⇒ postfix_expression ⇒ primary_expression
which indeeds corresponds to the precedence ordering of this grammar.
The other interesting aspect of the grammar is that the left-associative operators (all of them except exponentiation) are implemented with left-recursive productions, while the right-associative operator is implemented with a right-recursive production. This is not a coincidence.
That's the basic model, but it's worth spending a few minutes to try to understand how it actually works, because it turns out to be pretty simple. Let's take a look at one production, for multiplication, and see if we can understand why it implies that exponentiation binds more tightly and addition binds less tightly:
multiplicative_expression: multiplicative_expression '*' exponential_expression
This production is saying that a multiplicative_expression consists of a * with a multiplicative_expression on the left and an exponential_expression on the right.
Now, what does that mean for 2 + 3 * 4 ^ 2? 2 + 3 is an additive_expression, but we can see from the chain of unit productions that multiplicative_expression does not produce additive_expression. So the grammar does not include the possibility that 2 + 3 is the phrase matched on the left-hand side of the *. However, it is perfectly legal for 3 (a CONSTANT, which is a primary_expression) to be parsed as the left-hand operand of the multiplication.
Meanwhile, 4 ^ 2 is an exponential_expression, and our production clearly indicates that an exponential_expression can be matched on the right of the *.
A similar argument, examining the addition and exponential expression productions, would show that 3 * 4 ^ 2 (a multiplicative_expression) can be on the right-hand side of the + operator, while neither 2 + 3 * 4 (an additive_expression) nor 3 * 4 (a multiplicative_expression) can be on the left-hand side of the exponentiation operator.
In other words, this simple grammar defines precisely and unambiguously how the expression must be decomposed. There is only one possible parse tree:
expression
|
add
|
+--------+----------------+
| | |
| | mult
| | |
| | +-------+---------------+
| | | | |
| | | | power
| | | | |
| | | | +-------+-------+
| | | | | | |
add | mult | unary | power
... | ... | ... | ...
| | | | | | |
primary | primary | primary | primary
| | | | | | |
2 + 3 * 4 ^ 2
I hope that helps somewhat.

Parsing iOS/macOS localizable strings file with antlr4

I am trying to parse the localized "strings" files of macOS/iOS.
The format of this file is based on key/value pairs, with optional comments. An example follows:
/* This is a comment */
// This is also a comment
"key1" = "value1";
"key2" = "value2";
and so on. NOTE inside the "" could be absolutely any kind of text.
EDIT Original errorneus grammar removed
I tried to write this simple grammar, but unfortunately it doesn't work.
Since the contents inside the quotes could be quite tricky, not to mention the comments, I feel that usual regex has no real power there.
EDIT based on the comments by #GRosenberg I've created a new grammar. Now I have the problem that I can't include "Symbols" as a Char, or else parsing will break.
grammar LProj;
Esc : '\\';
Spaces : [ \t\r\n\f]+;
BlockComment : '/*' .*? ('*/' | EOF) ;
LineComment : '//' ~[\r\n]* ( '\r'? '\n' [ \t]* '//' ~[\r\n]* )* ;
MLN_COMMENT: BlockComment -> channel(HIDDEN) ;
SLN_COMMENT: LineComment -> channel(HIDDEN) ;
doc : expression*;
expression
: BlockComment
| LineComment
| Spaces
| entry
;
entry : '"' key=VALUE '"' Spaces? '=' Spaces? '"' value=VALUE '"' Spaces? ';' ;
VALUE : ( EscSeq | Val )+ ;
fragment Val : Char ( EscSeq | Char )* ;
fragment Symbol
: '*'
| '/'
| ';'
| '='
;
fragment Char
: Spaces
| '!' // skip "
| '#'..')' // skip *
| '+'..'.' // skip /
| '0'..':' // skip ;
| '<' // skip =
| '>'..'[' // skip \
| ']'..'~'
| '\u00B7'..'\ufffd'
; // ignores | ['\u10000-'\uEFFFF] ;
fragment UnicodeEsc
: 'u' (Hex (Hex (Hex Hex?)?)?)?
;
fragment Hex : [0-9a-fA-F] ;
fragment EscSeq
: Esc
( [btnfr"\\] // standard escaped character set
| UnicodeEsc // standard Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;
The Antlr grammar repository, provides good examples of how to achieve the stated goal. Just define the ID terminal to allow for inclusion of escape sequences.
Thus (with obvious details omitted),
id : QUOTE key=ID EQ val=ID QUOTE ;
DOC_COMMENT: DocComment -> channel(HIDDEN) ;
MLN_COMMENT: BlockComment -> channel(HIDDEN) ;
SLN_COMMENT: LineComment -> channel(HIDDEN) ;
NAME : NameStartChar NameChar* ;
VALUE : ( EsqSeq | Val )+ ;
fragment Val : NameStartChar ( EsqSeq | NameChar )* ;
fragment Hws : [ \t] ;
fragment Vws : [\r\n\f] ;
fragment DocComment : '/**' .*? ('*/' | EOF) ;
fragment BlockComment : '/*' .*? ('*/' | EOF) ;
fragment LineComment : '//' ~[\r\n]* ( '\r'? '\n' Hws* '//' ~[\r\n]* )* ;
// escaped short-cut character or Unicode literal
fragment EscSeq
: Esc
( [btnfr"\\] // standard escaped character set
| UnicodeEsc // standard Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;
fragment Esc : '\\' ;
fragment UnicodeEsc
: 'u' (Hex (Hex (Hex Hex?)?)?)?
;
// A valid hex digit
fragment Hex : [0-9a-fA-F] ;
fragment NameChar
: NameStartChar
| '0'..'9'
| '_'
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment NameStartChar
: 'A'..'Z'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
; // ignores | ['\u10000-'\uEFFFF] ;

ANTLRWorks v1.4.3 Debugger random behaviour (Can't connect to debugger)

If I debug this grammar:
grammar CDBFile;
options {
language=Java;
TokenLabelType=CommonToken;
output=AST;
k=1;
ASTLabelType=CommonTree;
}
tokens {
IMAG_COMPILE_UNIT;
MODULE;
}
//#lexer::namespace{Parser}
//#parser::namespace{Parser}
#lexer::header {
}
#lexer::members {
}
#parser::header {
}
#parser::members {
}
/*
* Lexer Rules
*/
fragment LETTER :
'a'..'z'
| 'A'..'Z';
MODULE_NAME
:
(LETTER)*
;
COLON
:
':'
;
/*
* Parser Rules
*/
public
compileUnit
:
(basic_record)* EOF
;
basic_record
:
(
'M' COLON module_record
| 'F' COLON function_record
) ('\n')?
;
module_record
:
MODULE_NAME
;
function_record
:
function_scope MODULE_NAME '$'
;
function_scope
:
('G$' | 'F$' | 'L$')
;
With just this input:
M:divide
the debugger does simply not start saying
"Cannot launch the debuggerTab. Time-out waiting to connect to the remote parser".
But using this grammar here:
grammar Calculator;
options {
//DO NOT CHANGE THESE!
backtrack = false;
k = 1;
output = AST;
ASTLabelType = CommonTree;
//SERIOUSLY, DO NOT CHANGE THESE!
}
tokens {
// Imaginary tokens
// Root
PROGRAM;
// function top level
FUNCTION_DECLARATION;
FUNCTION_HEAD;
FUNCTION_BODY;
DECL;
FUN;
// if-else-statement
IF_STATEMENT;
IF_CONDITION;
IF_BODY;
ELSE_BODY;
// for-loop
FOR_STATEMENT;
FOR_INITIALIZE;
FOR_CONDITION;
FOR_INCREMENT;
FOR_BODY;
// Non-imaginary tokens
}
#lexer::header {
package at.tugraz.ist.cc;
}
#lexer::members {
}
#parser::header {
package at.tugraz.ist.cc;
}
#parser::members {
}
//Lexer rules
ASSIGNOP :
'=';
OR :
'||';
AND :
'&&';
RELOP :
'<'
| '<='
| '>'
| '>='
| '=='
| '!=';
SIGN :
'+'
| '-';
MULOP :
'*'
| '/'
| '%';
NOT :
'!';
fragment OPERATORS :
'<'
| '>'
| '='
| '+'
| '-'
| '/'
| '%'
| '*'
| '|'
| '&';
INT :
'0'
| DIGIT DIGIT0*;
fragment DIGIT :
'1'..'9';
fragment DIGIT0 :
'0'..'9';
BOOLEAN :
'true'
| 'false';
ID :
LETTER
(
LETTER
| DIGIT0
| '_'
)*;
fragment LETTER :
'a'..'z'
| 'A'..'Z';
PUNCT :
'.'
| ','
| ';'
| ':'
| '!';
WS :
(
' '
| '\t'
| '\r'
| '\n'
)
{
$channel = HIDDEN;
};
LITERAL :
'"'
(
LETTER
| DIGIT
| '_'
| '\\'
| OPERATORS
| PUNCT
| WS
)*
'"';
// parse rules
program :
functions -> ^(PROGRAM functions)
;
functions :
(function_declaration functions)?
;
function_declaration :
head=function_head '{' declarations optional_stmt return_stmt rc='}' -> ^(FUNCTION_DECLARATION[$head.start, $head.text] function_head ^( FUNCTION_BODY[rc,"FUNCTION_BODY"] declarations optional_stmt? return_stmt))
;
function_head :
typeInfo=type ID arguments -> ^(FUNCTION_HEAD[$typeInfo.start, "FUNCTION_HEAD"] type ID arguments?)
;
type :
'int'
| 'boolean'
| 'String'
;
arguments :
'(' ! argument_optional ')' !;
argument_optional :
parameter_list ? -> ^(DECL parameter_list)? ;
parameter_list :
type ID parameter_list2 -> ^(type ID) parameter_list2
;
parameter_list2 :
(',' type ID)* -> ^(type ID)*;
declarations :
( type idlist ';' )* -> ^(DECL ( ^(type idlist))*) ;
idlist :
( ID idlist2 );
idlist2 :
( ',' ! idlist ) ?;
optional_stmt :
( stmt_list ) ?;
stmt_list :
statement statement2;
statement2 :
stmt_list ?;
return_stmt :
'return' ^ expression ';' ! ;
statement :
(
compound_stmt
| ifThenElse
| forLoop
| assignment ';' !
) ;
ifThenElse :
(
'if' '(' ifCondition=expression ')' ifBody=statement 'else' elseBody=statement -> ^(IF_STATEMENT ^(IF_CONDITION $ifCondition) ^(IF_BODY $ifBody) ^(ELSE_BODY $elseBody))
)
;
forLoop :
(
'for' '(' forInitialization=assignment ';' forCondition=expression ';' forIncrement=assignment ')' forBody=statement ->
^(FOR_STATEMENT ^(FOR_INITIALIZE $forInitialization) ^(FOR_CONDITION $forCondition) ^(FOR_INCREMENT $forIncrement) ^(FOR_BODY $forBody))
)
;
compound_stmt :
'{'! optional_stmt '}' !;
assignment :
ID ASSIGNOP ^ expression;
expression: andExpression (OR ^ andExpression)*;
andExpression: relOPExpression (AND ^ relOPExpression)*;
relOPExpression: signExpression (RELOP ^ signExpression)*;
signExpression : mulExpression (SIGN ^ mulExpression)*;
mulExpression : factor (MULOP ^ factor)*;
factor :
(
factorID
| INT
| BOOLEAN
| LITERAL
| NOT ^ factor
| SIGN ^ factor
| '('! expression ')' !
);
factorID: ID
( function_call -> ^(FUN ID function_call)
| -> ID
)
;
function_call :
'('! function_call_opt ')' !;
function_call_opt :
extend_assign_expr_list ? ;
extend_assign_expr_list :
(
expression
extend_assign_expr_list1
) ;
extend_assign_expr_list1 :
( ',' ! extend_assign_expr_list ) ? ;
parsing an Input like
int main()
{
return 0;
}
works just fine!
The internet has a lot of suggestions regarding this issue but none of them seem to work. The thing is that the debugger DOES work. Assuming that not the input is the problem here, the grammar has to be it. But if there is a problem with the grammar why would the Interpreter work for both examples?
Any ideas?
Edit:
I have noticed that for some reason in __Test__.java just contains:
M:divide
F:G0
I also get this output while Interpreting M:asd:
[13:47:52] Interpreting...
[13:47:52] problem matching token at 1:3 NoViableAltException('a'#[1:1: Tokens : ( T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | COLON );])
[13:47:52] problem matching token at 1:4 NoViableAltException('s'#[1:1: Tokens : ( T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | COLON );])
[13:47:52] problem matching token at 1:5 NoViableAltException('d'#[1:1: Tokens : ( T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | COLON );])
(even thought the tree is correct)
AFAIK, the debugger only works with the Java target. Since you have C# specific code in your first grammar:
#lexer::namespace{Parser}
#parser::namespace{Parser}
there are no .java classes generated (or at least, none that will compile), and the debugger hangs (and times out).
EDIT
I see you're using fragment rules in your parser rules: you can't. Fragment rules will never become a token on their own, they're only there for other lexer rules.
I've tested the grammar without the C# code in ANTLRWorks 1.4.3, and had no issues.
You could try the following:
restarting ANTLRWorks
changing the port the debugger listens on (perhaps the port is used by another service, or another debug-run of ANTLRWorks)
use the most recent version of ANTLRWorks

ANTLR resolving non-LL(*) problems and syntactic predicates

consider following rules in the parser:
expression
: IDENTIFIER
| (...)
| procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
procedure_call
: '(' expression expression* ')'
;
macro_use
: '(' IDENTIFIER datum* ')'
;
and
// Note that any string that parses as an <expression> will also parse as a <datum>.
datum
: simple_datum
| compound_datum
;
simple_datum
: BOOLEAN
| NUMBER
| CHARACTER
| STRING
| IDENTIFIER
;
compound_datum
: list
| vector
;
list
: '(' (datum+ ( '.' datum)?)? ')'
| ABBREV_PREFIX datum
;
fragment ABBREV_PREFIX
: ('\'' | '`' | ',' | ',#')
;
vector
: '#(' datum* ')'
;
the procedure_call and macro_rule alternative in the expression rule generate an non-LL(*) structure error. I can see the problem, since (IDENTIFIER) will parse as both. but even when i define both with + instead of *, it generates the error, even though above example shouldn't be parsing anymore.
i came up with the usage of syntactic predicates, but i can't figure out how to use them to do the trick here.
something like
expression
: IDENTIFIER
| (...)
| (procedure_call)=>procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
or
expression
: IDENTIFIER
| (...)
| ('(' IDENTIFIER expression)=>procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
doesnt work either, since none but the first rule will match anything. is there a proper way to solve that?
I found a JavaCC grammar of R5RS which I used to (quickly!) write an ANTLR equivalent:
/*
* Copyright (C) 2011 by Bart Kiers, based on the work done by Håkan L. Younes'
* JavaCC R5RS grammar, available at: http://mindprod.com/javacc/R5RS.jj
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
grammar R5RS;
parse
: commandOrDefinition* EOF
;
commandOrDefinition
: (syntaxDefinition)=> syntaxDefinition
| (definition)=> definition
| ('(' BEGIN commandOrDefinition)=> '(' BEGIN commandOrDefinition+ ')'
| command
;
syntaxDefinition
: '(' DEFINE_SYNTAX keyword transformerSpec ')'
;
definition
: '(' DEFINE ( variable expression ')'
| '(' variable defFormals ')' body ')'
)
| '(' BEGIN definition* ')'
;
defFormals
: variable* ('.' variable)?
;
keyword
: identifier
;
transformerSpec
: '(' SYNTAX_RULES '(' identifier* ')' syntaxRule* ')'
;
syntaxRule
: '(' pattern template ')'
;
pattern
: patternIdentifier
| '(' (pattern+ ('.' pattern | ELLIPSIS)?)? ')'
| '#(' (pattern+ ELLIPSIS? )? ')'
| patternDatum
;
patternIdentifier
: syntacticKeyword
| VARIABLE
;
patternDatum
: STRING
| CHARACTER
| bool
| number
;
template
: patternIdentifier
| '(' (templateElement+ ('.' templateElement)?)? ')'
| '#(' templateElement* ')'
| templateDatum
;
templateElement
: template ELLIPSIS?
;
templateDatum
: patternDatum
;
command
: expression
;
identifier
: syntacticKeyword
| variable
;
syntacticKeyword
: expressionKeyword
| ELSE
| ARROW
| DEFINE
| UNQUOTE
| UNQUOTE_SPLICING
;
expressionKeyword
: QUOTE
| LAMBDA
| IF
| SET
| BEGIN
| COND
| AND
| OR
| CASE
| LET
| LETSTAR
| LETREC
| DO
| DELAY
| QUASIQUOTE
;
expression
: (variable)=> variable
| (literal)=> literal
| (lambdaExpression)=> lambdaExpression
| (conditional)=> conditional
| (assignment)=> assignment
| (derivedExpression)=> derivedExpression
| (procedureCall)=> procedureCall
| (macroUse)=> macroUse
| macroBlock
;
variable
: VARIABLE
| ELLIPSIS
;
literal
: quotation
| selfEvaluating
;
quotation
: '\'' datum
| '(' QUOTE datum ')'
;
selfEvaluating
: bool
| number
| CHARACTER
| STRING
;
lambdaExpression
: '(' LAMBDA formals body ')'
;
formals
: '(' (variable+ ('.' variable)?)? ')'
| variable
;
conditional
: '(' IF test consequent alternate? ')'
;
test
: expression
;
consequent
: expression
;
alternate
: expression
;
assignment
: '(' SET variable expression ')'
;
derivedExpression
: quasiquotation
| '(' ( COND ( '(' ELSE sequence ')'
| condClause+ ('(' ELSE sequence ')')?
)
| CASE expression ( '(' ELSE sequence ')'
| caseClause+ ('(' ELSE sequence ')')?
)
| AND test*
| OR test*
| LET variable? '(' bindingSpec* ')' body
| LETSTAR '(' bindingSpec* ')' body
| LETREC '(' bindingSpec* ')' body
| BEGIN sequence
| DO '(' iterationSpec* ')' '(' test doResult? ')' command*
| DELAY expression
)
')'
;
condClause
: '(' test (sequence | ARROW recipient)? ')'
;
recipient
: expression
;
caseClause
: '(' '(' datum* ')' sequence ')'
;
bindingSpec
: '(' variable expression ')'
;
iterationSpec
: '(' variable init step? ')'
;
init
: expression
;
step
: expression
;
doResult
: sequence
;
procedureCall
: '(' operator operand* ')'
;
operator
: expression
;
operand
: expression
;
macroUse
: '(' keyword datum* ')'
;
macroBlock
: '(' (LET_SYNTAX | LETREC_SYNTAX) '(' syntaxSpec* ')' body ')'
;
syntaxSpec
: '(' keyword transformerSpec ')'
;
body
: ((definition)=> definition)* sequence
;
//sequence
// : ((command)=> command)* expression
// ;
sequence
: expression+
;
datum
: simpleDatum
| compoundDatum
;
simpleDatum
: bool
| number
| CHARACTER
| STRING
| identifier
;
compoundDatum
: list
| vector
;
list
: '(' (datum+ ('.' datum)?)? ')'
| abbreviation
;
abbreviation
: abbrevPrefix datum
;
abbrevPrefix
: '\'' | '`' | ',#' | ','
;
vector
: '#(' datum* ')'
;
number
: NUM_2
| NUM_8
| NUM_10
| NUM_16
;
bool
: TRUE
| FALSE
;
quasiquotation
: quasiquotationD[1]
;
quasiquotationD[int d]
: '`' qqTemplate[d]
| '(' QUASIQUOTE qqTemplate[d] ')'
;
qqTemplate[int d]
: (expression)=> expression
| ('(' UNQUOTE)=> unquotation[d]
| simpleDatum
| vectorQQTemplate[d]
| listQQTemplate[d]
;
vectorQQTemplate[int d]
: '#(' qqTemplateOrSplice[d]* ')'
;
listQQTemplate[int d]
: '\'' qqTemplate[d]
| ('(' QUASIQUOTE)=> quasiquotationD[d+1]
| '(' (qqTemplateOrSplice[d]+ ('.' qqTemplate[d])?)? ')'
;
unquotation[int d]
: ',' qqTemplate[d-1]
| '(' UNQUOTE qqTemplate[d-1] ')'
;
qqTemplateOrSplice[int d]
: ('(' UNQUOTE_SPLICING)=> splicingUnquotation[d]
| qqTemplate[d]
;
splicingUnquotation[int d]
: ',#' qqTemplate[d-1]
| '(' UNQUOTE_SPLICING qqTemplate[d-1] ')'
;
// macro keywords
LET_SYNTAX : 'let-syntax';
LETREC_SYNTAX : 'letrec-syntax';
SYNTAX_RULES : 'syntax-rules';
DEFINE_SYNTAX : 'define-syntax';
// syntactic keywords
ELSE : 'else';
ARROW : '=>';
DEFINE : 'define';
UNQUOTE_SPLICING : 'unquote-splicing';
UNQUOTE : 'unquote';
// expression keywords
QUOTE : 'quote';
LAMBDA : 'lambda';
IF : 'if';
SET : 'set!';
BEGIN : 'begin';
COND : 'cond';
AND : 'and';
OR : 'or';
CASE : 'case';
LET : 'let';
LETSTAR : 'let*';
LETREC : 'letrec';
DO : 'do';
DELAY : 'delay';
QUASIQUOTE : 'quasiquote';
NUM_2 : PREFIX_2 COMPLEX_2;
NUM_8 : PREFIX_8 COMPLEX_8;
NUM_10 : PREFIX_10? COMPLEX_10;
NUM_16 : PREFIX_16 COMPLEX_16;
ELLIPSIS : '...';
VARIABLE
: INITIAL SUBSEQUENT*
| PECULIAR_IDENTIFIER
;
STRING : '"' STRING_ELEMENT* '"';
CHARACTER : '#\\' (~(' ' | '\n') | CHARACTER_NAME);
TRUE : '#' ('t' | 'T');
FALSE : '#' ('f' | 'F');
// to ignore
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
COMMENT : ';' ~('\r' | '\n')* {$channel=HIDDEN;};
// fragments
fragment INITIAL : LETTER | SPECIAL_INITIAL;
fragment LETTER : 'a'..'z' | 'A'..'Z';
fragment SPECIAL_INITIAL : '!' | '$' | '%' | '&' | '*' | '/' | ':' | '<' | '=' | '>' | '?' | '^' | '_' | '~';
fragment SUBSEQUENT : INITIAL | DIGIT | SPECIAL_SUBSEQUENT;
fragment DIGIT : '0'..'9';
fragment SPECIAL_SUBSEQUENT : '.' | '+' | '-' | '#';
fragment PECULIAR_IDENTIFIER : '+' | '-';
fragment STRING_ELEMENT : ~('"' | '\\') | '\\' ('"' | '\\');
fragment CHARACTER_NAME : 'space' | 'newline';
fragment COMPLEX_2
: REAL_2 ('#' REAL_2)?
| REAL_2? SIGN UREAL_2? ('i' | 'I')
;
fragment COMPLEX_8
: REAL_8 ('#' REAL_8)?
| REAL_8? SIGN UREAL_8? ('i' | 'I')
;
fragment COMPLEX_10
: REAL_10 ('#' REAL_10)?
| REAL_10? SIGN UREAL_10? ('i' | 'I')
;
fragment COMPLEX_16
: REAL_16 ('#' REAL_16)?
| REAL_16? SIGN UREAL_16? ('i' | 'I')
;
fragment REAL_2 : SIGN? UREAL_2;
fragment REAL_8 : SIGN? UREAL_8;
fragment REAL_10 : SIGN? UREAL_10;
fragment REAL_16 : SIGN? UREAL_16;
fragment UREAL_2 : UINTEGER_2 ('/' UINTEGER_2)?;
fragment UREAL_8 : UINTEGER_8 ('/' UINTEGER_8)?;
fragment UREAL_10 : UINTEGER_10 ('/' UINTEGER_10)? | DECIMAL_10;
fragment UREAL_16 : UINTEGER_16 ('/' UINTEGER_16)?;
fragment DECIMAL_10
: UINTEGER_10 SUFFIX
| '.' DIGIT+ '#'* SUFFIX?
| DIGIT+ '.' DIGIT* '#'* SUFFIX?
| DIGIT+ '#'+ '.' '#'* SUFFIX?
;
fragment UINTEGER_2 : DIGIT_2+ '#'*;
fragment UINTEGER_8 : DIGIT_8+ '#'*;
fragment UINTEGER_10 : DIGIT+ '#'*;
fragment UINTEGER_16 : DIGIT_16+ '#'*;
fragment PREFIX_2 : RADIX_2 EXACTNESS? | EXACTNESS RADIX_2;
fragment PREFIX_8 : RADIX_8 EXACTNESS? | EXACTNESS RADIX_8;
fragment PREFIX_10 : RADIX_10 EXACTNESS? | EXACTNESS RADIX_10;
fragment PREFIX_16 : RADIX_16 EXACTNESS? | EXACTNESS RADIX_16;
fragment SUFFIX : EXPONENT_MARKER SIGN? DIGIT+;
fragment EXPONENT_MARKER : 'e' | 's' | 'f' | 'd' | 'l' | 'E' | 'S' | 'F' | 'D' | 'L';
fragment SIGN : '+' | '-';
fragment EXACTNESS : '#' ('i' | 'e' | 'I' | 'E');
fragment RADIX_2 : '#' ('b' | 'B');
fragment RADIX_8 : '#' ('o' | 'O');
fragment RADIX_10 : '#' ('d' | 'D');
fragment RADIX_16 : '#' ('x' | 'X');
fragment DIGIT_2 : '0' | '1';
fragment DIGIT_8 : '0'..'7';
fragment DIGIT_16 : DIGIT | 'a'..'f' | 'A'..'F';
which can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"(define sum-iter \n" +
" (lambda(n acc i) \n" +
" (if (> i n) \n" +
" acc \n" +
" (sum-iter n (+ acc i) (+ i 1))))) ";
R5RSLexer lexer = new R5RSLexer(new ANTLRStringStream(source));
R5RSParser parser = new R5RSParser(new CommonTokenStream(lexer));
parser.parse();
}
}
and to generate a lexer & parser, compile all Java source files and run the main class, do:
bart#hades:~/Programming/ANTLR/Demos/R5RS$ java -cp antlr-3.3.jar org.antlr.Tool R5RS.g
bart#hades:~/Programming/ANTLR/Demos/R5RS$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/R5RS$ java -cp .:antlr-3.3.jar Main
bart#hades:~/Programming/ANTLR/Demos/R5RS$
The fact that nothing is being printed on the console means the parser (and lexer) didn't find any errors with the provided source.
Note that I have no Unit tests and have only tested the single Scheme source inside the Main class. If you find errors in the ANTLR grammar, I'd appreciate to hear about them so I can fix the grammar. In due time, I'll probably commit the grammar to the official ANTLR Wiki.

ANTLR IDL Grammar

Using ANTLR I am trying to create a very simple IDL-style grammar. Here is what I have so far.
grammar idl;
data_type
: 'DataType' ID LCURLY attribute_list RCURLY
;
modifier
: 'public'
;
primitive
: 'byte'
| 'short'
| 'int'
| 'float'
| 'double'
;
attribute
: modifier primitive ID END
;
attribute_list
: attribute+
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
LCURLY : '{'
;
RCURLY : '}'
;
END : ';'
;
This does not seem to work when I run the debugger over 'data_type' however. It just halts when it reaches the 'attribute_list'. Changing 'attribute_list' to just 'attribute' works fine but obviously I want one-or-more attributes, not just one.
Thanks

Resources