EBNF : missing the purpose of two production rules - ebnf

I have this grammar in EBNF for a sub-language with arithmetic & logical expressions, variables assignment and printing.
start ::= (print | assign)*
print ::= print expr ;
assign ::= ID = expr ;
expr ::= andExpr (|| andExpr)*
andExpr ::= relExpr (&& relExpr)*
relExpr ::= addExpr ( == addExpr | != addExpr | <= addExpr | >= addExpr | < addExpr | > addExpr)?
addExpr ::= mulExpr (+ mulExpr | - mulExpr)*
mulExpr ::= unExpr (* hunExpri | / hunExpr)*
unExpr ::= + unExpr | - unExpr | ! unExpr | primary
primary ::= ( expr ) | ID | NUM | true | false
unfortunately I just can't figure out what these two rules:
unExpr ::= + unExpr
unExpr ::= - unExpr
actually do, or why I should need them, since I seem to be able to derive every phrase of the language without applying them. Any idea?
thanks a lot :-)

If you are planning no expressions like:
a=-1
(where "a" is an ID and "1" is a NUM)
in your language than you don't need those two rules.
Otherwise you have to implement them.

Related

JISON errors occuring with nonterminals

I am writing a JISON file for a class and trying to use nonterminals in place of declaring associativity for operators but am utterly lost on what the errors really mean, as this is a one time assignment for a class and I haven't found amazing examples of using nonterminals for this use case.
My JISON code:
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+("."[0-9]+)?\b return 'NUMBER'
"*" return '*'
"/" return '/'
"-" return '-'
"+" return '+'
"^" return '^'
"!" return '!'
"%" return '%'
"(" return '('
")" return ')'
"PI" return 'PI'
"E" return 'E'
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start expressions
%% /* language grammar */
expressions
: e EOF
{ typeof console !== 'undefined' ? console.log($1) : print($1);
return $1; }
;
e
: NegExp
{$$ = $1;}
| MulExp
{$$ = $1;}
| PowExp
{$$ = $1;}
| UnaryExp
{$$ = $1;}
| RootExp
{$$ = $1;}
;
RootExp
: ’(’ RootExp ’)’
{$$ = ’(’ + $2 + ’)’;}
| NUMBER
{$$ = Number(yytext);}
| E
{$$ = ’E’;}
| PI
{$$ = ’PI’;}
;
UnaryExp
: UnaryExp '!'
{$$ = '(' + $1 + '!' + ')';}
| UnaryExp '%'
{$$ = '(' + $1 + '%' + ')';}
| '-' UnaryExp
{$$ = '(-' + $2 + ')';}
;
NegExp
: NegExp '+' e
{$$ = '(' + $1 + ' + ' + $3 + ')';}
| NegExp '-' e
{$$ = '(' + $1 + ' - ' + $3 + ')';}
;
MulExp
: MulExp '*' e
{$$ = '(' + $1 + ' * ' + $3 + ')';}
| MulExp '/' e
{$$ = '(' + $1 + ' / ' + $3 + ')';}
;
PowExp
: e '^' PowExp
{$$ = '(' + $1 + ' ^ ' + $3 + ')';}
;
And when I run jison filename.jison I get a slew of errors like:
Conflict in grammar: multiple actions possible when lookahead token is ^ in state 26
- reduce by rule: MulExp -> MulExp / e
- shift token (then go to state 13)
and:
States with conflicts:
State 3
e -> NegExp . #lookaheads= EOF ^ + - * /
NegExp -> NegExp .+ e #lookaheads= EOF + - ^ / *
NegExp -> NegExp .- e #lookaheads= EOF + - ^ / *
Again, I am not looking for someone to do my homework for me, but pointers on where to go or what to do to help debug will be greatly appreciated.
It's true; it is not easy to find examples of expression grammars which resolve ambiguity without using precedence declarations. That's probably because precedence declarations, in this particular use case, are extremely convenient and probably more readable than writing out an unambiguous grammar. The resulting parser is usually slightly more efficient, as well, because it avoids the chains of unit reductions imposed by the usual unambiguous grammar style.
The flip side of this convenience is that it does not help the student achieve an understanding of how grammars actually work, and without that understanding it is very difficult to apply precedence declarations to less clear-cut applications. So the exercise which gave rise to this question is certainly worthwhile.
One place you will find unambiguous expression grammars is in the specifications of (some) programming languages, because the unambiguous grammar does not depend on the precise nature of the algorithm used by parser generators to resolve parsing conflicts. These examples tend to be rather complicated, though, because real programming languages usually have a lot of operators. Even so, the sample C grammar in the jison examples directory does show the standard model for arithmetic expression grammars. The following extract is dramatically simplified, but most of the productions were simply copy-and-pasted from the original. (I removed many operators, most of the precedence levels, and some of the complications dealing with things like cast expressions and the idiosyncratic comma operator, which are surely not relevant here.)
primary_expression
: IDENTIFIER
| CONSTANT
| '(' expression ')'
;
postfix_expression
: primary_expression
| postfix_expression INC_OP
| postfix_expression DEC_OP
;
unary_expression
: postfix_expression
| '-' unary_expression
| INC_OP unary_expression
| DEC_OP unary_expression
;
/* I added this for explanatory purposes. DO NOT USE IT. See the text. */
exponential_expression
: unary_expression
| unary_expression '^' exponential_expression
multiplicative_expression
: exponential_expression
| multiplicative_expression '*' exponential_expression
| multiplicative_expression '/' exponential_expression
| multiplicative_expression '%' exponential_expression
;
additive_expression
: multiplicative_expression
| additive_expression '+' multiplicative_expression
| additive_expression '-' multiplicative_expression
;
expression
: additive_expression
;
C does not have an exponentiation operator, so I added one with right associativity and higher precedence than multiplication, which will serve for explanatory purposes. However, your assignment probably wants it to have higher precedence than unary negation as well, which I didn't do. So I don't recommend using the above directly.
One thing to note in the above model is that every precedence level corresponds to a non-terminal. These non-terminals are linked into an ordered chain using unit productions. We can see this sequence:
expression ⇒ additive_expression ⇒ multiplicative_expression ⇒ exponential_expression ⇒ unary_expression ⇒ postfix_expression ⇒ primary_expression
which indeeds corresponds to the precedence ordering of this grammar.
The other interesting aspect of the grammar is that the left-associative operators (all of them except exponentiation) are implemented with left-recursive productions, while the right-associative operator is implemented with a right-recursive production. This is not a coincidence.
That's the basic model, but it's worth spending a few minutes to try to understand how it actually works, because it turns out to be pretty simple. Let's take a look at one production, for multiplication, and see if we can understand why it implies that exponentiation binds more tightly and addition binds less tightly:
multiplicative_expression: multiplicative_expression '*' exponential_expression
This production is saying that a multiplicative_expression consists of a * with a multiplicative_expression on the left and an exponential_expression on the right.
Now, what does that mean for 2 + 3 * 4 ^ 2? 2 + 3 is an additive_expression, but we can see from the chain of unit productions that multiplicative_expression does not produce additive_expression. So the grammar does not include the possibility that 2 + 3 is the phrase matched on the left-hand side of the *. However, it is perfectly legal for 3 (a CONSTANT, which is a primary_expression) to be parsed as the left-hand operand of the multiplication.
Meanwhile, 4 ^ 2 is an exponential_expression, and our production clearly indicates that an exponential_expression can be matched on the right of the *.
A similar argument, examining the addition and exponential expression productions, would show that 3 * 4 ^ 2 (a multiplicative_expression) can be on the right-hand side of the + operator, while neither 2 + 3 * 4 (an additive_expression) nor 3 * 4 (a multiplicative_expression) can be on the left-hand side of the exponentiation operator.
In other words, this simple grammar defines precisely and unambiguously how the expression must be decomposed. There is only one possible parse tree:
expression
|
add
|
+--------+----------------+
| | |
| | mult
| | |
| | +-------+---------------+
| | | | |
| | | | power
| | | | |
| | | | +-------+-------+
| | | | | | |
add | mult | unary | power
... | ... | ... | ...
| | | | | | |
primary | primary | primary | primary
| | | | | | |
2 + 3 * 4 ^ 2
I hope that helps somewhat.

LEX: code to read a specific grammar error

Very new to lex. This is for project for Prog. Langs. class
Consider a language built over the following grammar:
<program> ::= <statement> | <program> <statement>
<statement> ::= <assignStmt> | <ifStmt> | <whileStmt> | <printStmt>
<assignStmt> ::= <id> = <expr> ;
<ifStmt> ::= if ( <expr> ) then <stmt>
<whileStmt> ::= while ( <expr> ) do <stmt>
<printStmt> ::= print <expr> ;
<expr> ::= <term> | <expr> <addOp> <term>
<term> ::= <factor> | <term> <multOp> <factor>
<factor> ::= <id> | <number> | - <factor> | ( <expr> )
<id> ::= <letter> | <id> <letter>
<letter> ::= a | b | c | d | e | f | g | h | i | j
| k | l | m | n | o | p | r | s | t
| u | v | w | x | y | z
<number> ::= <digit> | <number> <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<addOp> ::= + | -
<multOp> ::= * | / | %
Implement a lex-based C program that scans for all the tokens of the language (keywords, identifiers, numbers, operators, and so on).
My problem is I get "l7t2.l:32: unrecognized rule" error. I believe it stems from the declaration of "word" above but not sure how to fix it.
Heres my lex file, l7t2.l
%option noyywrap
%{
#include "l7t2.h"
int totDol = 0;
int *outword;
%}
digit [0-9]
number {digit}*
letter [a-zA-Z]
word ({letter}{[a-zA-Z0-9]}+)
%%
"if" {return IF;}
"then" {return THEN;}
"while" {return WHILE;}
"do" {return DO;}
"+" {return PLUSOP;}
"-" {return MINUSOP;}
"*" {return MULTOP;}
"/" {return DIVOP;}
"%" {return MODOP;}
";" {return SEMICOLON;}
"=" {return EQUAL;}
"print" {return PRINT;}
[ \t\n]+ ;
{word} {strcpy(outword, yytext);}
\${number} {totDol = 0; totDol += strtod(yytext+1, NULL); return totDol;}
%%
word ({letter}{[a-zA-Z0-9]}+)
The problem is here. {} is used to introduce prior definitions only. It should be:
word ({letter}[a-zA-Z0-9]+)
On line 32, surely you should be returning a value from that rule?
NB You can get rid of all the single-special-character rules and have a final cover-all rule:
. return yytext[0];
This also means you can use the special characters directly in the grammar, e.g. '+' instead of PLUSOP. It also saves you from having to handle illegal characters at all in the lexer: the parser does it.

Modifying a Grammar in ANTLR to manage expressions correctly

I´m working on a simple expression evaluator with ANTLR
this is the grammar I created:
expression : relationExpr (cond_op expression)? ;
relationExpr: addExpr (rel_op relationExpr)? ;
addExpr: multExpr (add_op addExpr)? ;
multExpr: unaryExpr (mult_op multExpr)? ;
unaryExpr: '-' value | '!' value | value ;
value: literal | '('expression')' ;
mult_op : '*' | '/' | '%' ;
add_op : '+'|'-' ;
rel_op : '<' | '>' | '<=' | '>='| eq_op ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal | char_literal | bool_literal ;
int_literal : NUM ;
char_literal : CHAR ;
bool_literal : TRUE | FALSE ;
The problem I´m having is that association of operand is not being to the left.
For example if I evaluate: 10+20*2/10
I get this tree:
As you see the / operand is being evaluated first and the correct way should be to the left.
Can you give me help in modifying the grammar to get the asociation right?
If you do not really need your parse tree nodes to be binary operations, the grammar below yields a single multExpr node which you can walk left-to-right which I believe is your goal.
NUM : [0-9]+;
expression : relationExpr (cond_op relationExpr)* ;
relationExpr: addExpr (rel_op addExpr)* ;
addExpr: multExpr (add_op multExpr)* ;
multExpr: unaryExpr (mult_op unaryExpr)* ;
unaryExpr: '-' value | '!' value | value ;
value: literal | '('expression')' ;
mult_op : '*' | '/' | '%' ;
add_op : '+'|'-' ;
rel_op : '<' | '>' | '<=' | '>='| eq_op ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal ;
int_literal : NUM ;
Parsing you example expression 10+20*2/10<8 produces the parse tree below.
'Hope this helps.

EBNF Maximum Token

lets say i have the following EBNF:
ProductNo ::= Digitgroup "-" Lettergroup;
Digitgroup ::= Digit Digit? Digit? Digit?;
Digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";
Lettergroup ::= Letter Letter? Letter? Letter? Letter?;
Letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G"
| "H" | "I" | "J" | "K" | "L" | "M" | "N"
| "O" | "P" | "Q" | "R" | "S" | "T" | "U"
| "V" | "W" | "X" | "Y" | "Z";
now i want to set the maximum of Tokens for ProductNo = 5
Example:
Input : 1-A (EBNF valid and Token < 5)
Input : 023-A (EBNF valid and Token < 5)
Input : 0231-ABI (currently EBNF valid but Token = 8 > 5 so this should not be valid)
Input : 022-ABCDE(currently EBNF valid but Token = 9 > 5 so this should not be valid)
as you can see in this example input, the combination of Digits and Letters can vary as long as its EBNF conform (min 1 Digit max 4 Digit), (min 1 Letter max 5 Letter) but the sum of the Tokens has to be <= 5 including the "-".
Question : Is there a way other than writing every valid combination of Letter and Digit down?
My current solution:
ProductNo ::= Token Token Token Token? Token?;
Token ::= Digit | Letter | "-";
Digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";
Letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G"
| "H" | "I" | "J" | "K" | "L" | "M" | "N"
| "O" | "P" | "Q" | "R" | "S" | "T" | "U"
| "V" | "W" | "X" | "Y" | "Z";
Problem : The composition of ProductNo (Digitgroup, "-", Lettergroup) is not reproduced. So i need to combine the two EBNF into one, but i really cant figure a way out how to do this.
I'm assuming you are using the W3C notation: http://www.w3.org/TR/REC-xml/#sec-notation , not the standard ISO notation: http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form .
If I'm wrong, then please specify which EBNF you're using!
In the W3C notation you can use this:
Digit ::= [0-9]
Letter ::= [A-Z]
GoodFormat ::= Digit+ "-" Letter+
Token ::= Digit | Letter | "-"
TooLong ::= Token Token Token Token Token Token+
ProductNo ::= GoodFormat - TooLong
I think that there is a smarter solution than writing every valid combination down:
ProductNo ::= Case1 | Case2 | Case3
Case1 ::= Digit Digit? Digit? "-" Letter;
Case2 ::= Digit "-" Letter Letter? Letter?;
Case3 ::= Digit Digit? "-" Letter Letter?;
Digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";
Letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G"
| "H" | "I" | "J" | "K" | "L" | "M" | "N"
| "O" | "P" | "Q" | "R" | "S" | "T" | "U"
| "V" | "W" | "X" | "Y" | "Z";
But I don't know if there is any smarter why to do this. I hope this solution helps a bit.

ANTLR resolving non-LL(*) problems and syntactic predicates

consider following rules in the parser:
expression
: IDENTIFIER
| (...)
| procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
procedure_call
: '(' expression expression* ')'
;
macro_use
: '(' IDENTIFIER datum* ')'
;
and
// Note that any string that parses as an <expression> will also parse as a <datum>.
datum
: simple_datum
| compound_datum
;
simple_datum
: BOOLEAN
| NUMBER
| CHARACTER
| STRING
| IDENTIFIER
;
compound_datum
: list
| vector
;
list
: '(' (datum+ ( '.' datum)?)? ')'
| ABBREV_PREFIX datum
;
fragment ABBREV_PREFIX
: ('\'' | '`' | ',' | ',#')
;
vector
: '#(' datum* ')'
;
the procedure_call and macro_rule alternative in the expression rule generate an non-LL(*) structure error. I can see the problem, since (IDENTIFIER) will parse as both. but even when i define both with + instead of *, it generates the error, even though above example shouldn't be parsing anymore.
i came up with the usage of syntactic predicates, but i can't figure out how to use them to do the trick here.
something like
expression
: IDENTIFIER
| (...)
| (procedure_call)=>procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
or
expression
: IDENTIFIER
| (...)
| ('(' IDENTIFIER expression)=>procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
doesnt work either, since none but the first rule will match anything. is there a proper way to solve that?
I found a JavaCC grammar of R5RS which I used to (quickly!) write an ANTLR equivalent:
/*
* Copyright (C) 2011 by Bart Kiers, based on the work done by Håkan L. Younes'
* JavaCC R5RS grammar, available at: http://mindprod.com/javacc/R5RS.jj
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
grammar R5RS;
parse
: commandOrDefinition* EOF
;
commandOrDefinition
: (syntaxDefinition)=> syntaxDefinition
| (definition)=> definition
| ('(' BEGIN commandOrDefinition)=> '(' BEGIN commandOrDefinition+ ')'
| command
;
syntaxDefinition
: '(' DEFINE_SYNTAX keyword transformerSpec ')'
;
definition
: '(' DEFINE ( variable expression ')'
| '(' variable defFormals ')' body ')'
)
| '(' BEGIN definition* ')'
;
defFormals
: variable* ('.' variable)?
;
keyword
: identifier
;
transformerSpec
: '(' SYNTAX_RULES '(' identifier* ')' syntaxRule* ')'
;
syntaxRule
: '(' pattern template ')'
;
pattern
: patternIdentifier
| '(' (pattern+ ('.' pattern | ELLIPSIS)?)? ')'
| '#(' (pattern+ ELLIPSIS? )? ')'
| patternDatum
;
patternIdentifier
: syntacticKeyword
| VARIABLE
;
patternDatum
: STRING
| CHARACTER
| bool
| number
;
template
: patternIdentifier
| '(' (templateElement+ ('.' templateElement)?)? ')'
| '#(' templateElement* ')'
| templateDatum
;
templateElement
: template ELLIPSIS?
;
templateDatum
: patternDatum
;
command
: expression
;
identifier
: syntacticKeyword
| variable
;
syntacticKeyword
: expressionKeyword
| ELSE
| ARROW
| DEFINE
| UNQUOTE
| UNQUOTE_SPLICING
;
expressionKeyword
: QUOTE
| LAMBDA
| IF
| SET
| BEGIN
| COND
| AND
| OR
| CASE
| LET
| LETSTAR
| LETREC
| DO
| DELAY
| QUASIQUOTE
;
expression
: (variable)=> variable
| (literal)=> literal
| (lambdaExpression)=> lambdaExpression
| (conditional)=> conditional
| (assignment)=> assignment
| (derivedExpression)=> derivedExpression
| (procedureCall)=> procedureCall
| (macroUse)=> macroUse
| macroBlock
;
variable
: VARIABLE
| ELLIPSIS
;
literal
: quotation
| selfEvaluating
;
quotation
: '\'' datum
| '(' QUOTE datum ')'
;
selfEvaluating
: bool
| number
| CHARACTER
| STRING
;
lambdaExpression
: '(' LAMBDA formals body ')'
;
formals
: '(' (variable+ ('.' variable)?)? ')'
| variable
;
conditional
: '(' IF test consequent alternate? ')'
;
test
: expression
;
consequent
: expression
;
alternate
: expression
;
assignment
: '(' SET variable expression ')'
;
derivedExpression
: quasiquotation
| '(' ( COND ( '(' ELSE sequence ')'
| condClause+ ('(' ELSE sequence ')')?
)
| CASE expression ( '(' ELSE sequence ')'
| caseClause+ ('(' ELSE sequence ')')?
)
| AND test*
| OR test*
| LET variable? '(' bindingSpec* ')' body
| LETSTAR '(' bindingSpec* ')' body
| LETREC '(' bindingSpec* ')' body
| BEGIN sequence
| DO '(' iterationSpec* ')' '(' test doResult? ')' command*
| DELAY expression
)
')'
;
condClause
: '(' test (sequence | ARROW recipient)? ')'
;
recipient
: expression
;
caseClause
: '(' '(' datum* ')' sequence ')'
;
bindingSpec
: '(' variable expression ')'
;
iterationSpec
: '(' variable init step? ')'
;
init
: expression
;
step
: expression
;
doResult
: sequence
;
procedureCall
: '(' operator operand* ')'
;
operator
: expression
;
operand
: expression
;
macroUse
: '(' keyword datum* ')'
;
macroBlock
: '(' (LET_SYNTAX | LETREC_SYNTAX) '(' syntaxSpec* ')' body ')'
;
syntaxSpec
: '(' keyword transformerSpec ')'
;
body
: ((definition)=> definition)* sequence
;
//sequence
// : ((command)=> command)* expression
// ;
sequence
: expression+
;
datum
: simpleDatum
| compoundDatum
;
simpleDatum
: bool
| number
| CHARACTER
| STRING
| identifier
;
compoundDatum
: list
| vector
;
list
: '(' (datum+ ('.' datum)?)? ')'
| abbreviation
;
abbreviation
: abbrevPrefix datum
;
abbrevPrefix
: '\'' | '`' | ',#' | ','
;
vector
: '#(' datum* ')'
;
number
: NUM_2
| NUM_8
| NUM_10
| NUM_16
;
bool
: TRUE
| FALSE
;
quasiquotation
: quasiquotationD[1]
;
quasiquotationD[int d]
: '`' qqTemplate[d]
| '(' QUASIQUOTE qqTemplate[d] ')'
;
qqTemplate[int d]
: (expression)=> expression
| ('(' UNQUOTE)=> unquotation[d]
| simpleDatum
| vectorQQTemplate[d]
| listQQTemplate[d]
;
vectorQQTemplate[int d]
: '#(' qqTemplateOrSplice[d]* ')'
;
listQQTemplate[int d]
: '\'' qqTemplate[d]
| ('(' QUASIQUOTE)=> quasiquotationD[d+1]
| '(' (qqTemplateOrSplice[d]+ ('.' qqTemplate[d])?)? ')'
;
unquotation[int d]
: ',' qqTemplate[d-1]
| '(' UNQUOTE qqTemplate[d-1] ')'
;
qqTemplateOrSplice[int d]
: ('(' UNQUOTE_SPLICING)=> splicingUnquotation[d]
| qqTemplate[d]
;
splicingUnquotation[int d]
: ',#' qqTemplate[d-1]
| '(' UNQUOTE_SPLICING qqTemplate[d-1] ')'
;
// macro keywords
LET_SYNTAX : 'let-syntax';
LETREC_SYNTAX : 'letrec-syntax';
SYNTAX_RULES : 'syntax-rules';
DEFINE_SYNTAX : 'define-syntax';
// syntactic keywords
ELSE : 'else';
ARROW : '=>';
DEFINE : 'define';
UNQUOTE_SPLICING : 'unquote-splicing';
UNQUOTE : 'unquote';
// expression keywords
QUOTE : 'quote';
LAMBDA : 'lambda';
IF : 'if';
SET : 'set!';
BEGIN : 'begin';
COND : 'cond';
AND : 'and';
OR : 'or';
CASE : 'case';
LET : 'let';
LETSTAR : 'let*';
LETREC : 'letrec';
DO : 'do';
DELAY : 'delay';
QUASIQUOTE : 'quasiquote';
NUM_2 : PREFIX_2 COMPLEX_2;
NUM_8 : PREFIX_8 COMPLEX_8;
NUM_10 : PREFIX_10? COMPLEX_10;
NUM_16 : PREFIX_16 COMPLEX_16;
ELLIPSIS : '...';
VARIABLE
: INITIAL SUBSEQUENT*
| PECULIAR_IDENTIFIER
;
STRING : '"' STRING_ELEMENT* '"';
CHARACTER : '#\\' (~(' ' | '\n') | CHARACTER_NAME);
TRUE : '#' ('t' | 'T');
FALSE : '#' ('f' | 'F');
// to ignore
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
COMMENT : ';' ~('\r' | '\n')* {$channel=HIDDEN;};
// fragments
fragment INITIAL : LETTER | SPECIAL_INITIAL;
fragment LETTER : 'a'..'z' | 'A'..'Z';
fragment SPECIAL_INITIAL : '!' | '$' | '%' | '&' | '*' | '/' | ':' | '<' | '=' | '>' | '?' | '^' | '_' | '~';
fragment SUBSEQUENT : INITIAL | DIGIT | SPECIAL_SUBSEQUENT;
fragment DIGIT : '0'..'9';
fragment SPECIAL_SUBSEQUENT : '.' | '+' | '-' | '#';
fragment PECULIAR_IDENTIFIER : '+' | '-';
fragment STRING_ELEMENT : ~('"' | '\\') | '\\' ('"' | '\\');
fragment CHARACTER_NAME : 'space' | 'newline';
fragment COMPLEX_2
: REAL_2 ('#' REAL_2)?
| REAL_2? SIGN UREAL_2? ('i' | 'I')
;
fragment COMPLEX_8
: REAL_8 ('#' REAL_8)?
| REAL_8? SIGN UREAL_8? ('i' | 'I')
;
fragment COMPLEX_10
: REAL_10 ('#' REAL_10)?
| REAL_10? SIGN UREAL_10? ('i' | 'I')
;
fragment COMPLEX_16
: REAL_16 ('#' REAL_16)?
| REAL_16? SIGN UREAL_16? ('i' | 'I')
;
fragment REAL_2 : SIGN? UREAL_2;
fragment REAL_8 : SIGN? UREAL_8;
fragment REAL_10 : SIGN? UREAL_10;
fragment REAL_16 : SIGN? UREAL_16;
fragment UREAL_2 : UINTEGER_2 ('/' UINTEGER_2)?;
fragment UREAL_8 : UINTEGER_8 ('/' UINTEGER_8)?;
fragment UREAL_10 : UINTEGER_10 ('/' UINTEGER_10)? | DECIMAL_10;
fragment UREAL_16 : UINTEGER_16 ('/' UINTEGER_16)?;
fragment DECIMAL_10
: UINTEGER_10 SUFFIX
| '.' DIGIT+ '#'* SUFFIX?
| DIGIT+ '.' DIGIT* '#'* SUFFIX?
| DIGIT+ '#'+ '.' '#'* SUFFIX?
;
fragment UINTEGER_2 : DIGIT_2+ '#'*;
fragment UINTEGER_8 : DIGIT_8+ '#'*;
fragment UINTEGER_10 : DIGIT+ '#'*;
fragment UINTEGER_16 : DIGIT_16+ '#'*;
fragment PREFIX_2 : RADIX_2 EXACTNESS? | EXACTNESS RADIX_2;
fragment PREFIX_8 : RADIX_8 EXACTNESS? | EXACTNESS RADIX_8;
fragment PREFIX_10 : RADIX_10 EXACTNESS? | EXACTNESS RADIX_10;
fragment PREFIX_16 : RADIX_16 EXACTNESS? | EXACTNESS RADIX_16;
fragment SUFFIX : EXPONENT_MARKER SIGN? DIGIT+;
fragment EXPONENT_MARKER : 'e' | 's' | 'f' | 'd' | 'l' | 'E' | 'S' | 'F' | 'D' | 'L';
fragment SIGN : '+' | '-';
fragment EXACTNESS : '#' ('i' | 'e' | 'I' | 'E');
fragment RADIX_2 : '#' ('b' | 'B');
fragment RADIX_8 : '#' ('o' | 'O');
fragment RADIX_10 : '#' ('d' | 'D');
fragment RADIX_16 : '#' ('x' | 'X');
fragment DIGIT_2 : '0' | '1';
fragment DIGIT_8 : '0'..'7';
fragment DIGIT_16 : DIGIT | 'a'..'f' | 'A'..'F';
which can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"(define sum-iter \n" +
" (lambda(n acc i) \n" +
" (if (> i n) \n" +
" acc \n" +
" (sum-iter n (+ acc i) (+ i 1))))) ";
R5RSLexer lexer = new R5RSLexer(new ANTLRStringStream(source));
R5RSParser parser = new R5RSParser(new CommonTokenStream(lexer));
parser.parse();
}
}
and to generate a lexer & parser, compile all Java source files and run the main class, do:
bart#hades:~/Programming/ANTLR/Demos/R5RS$ java -cp antlr-3.3.jar org.antlr.Tool R5RS.g
bart#hades:~/Programming/ANTLR/Demos/R5RS$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/R5RS$ java -cp .:antlr-3.3.jar Main
bart#hades:~/Programming/ANTLR/Demos/R5RS$
The fact that nothing is being printed on the console means the parser (and lexer) didn't find any errors with the provided source.
Note that I have no Unit tests and have only tested the single Scheme source inside the Main class. If you find errors in the ANTLR grammar, I'd appreciate to hear about them so I can fix the grammar. In due time, I'll probably commit the grammar to the official ANTLR Wiki.

Resources