how to resolve ambiguity for phone number in Boolean expression grammar - antlr3

I am trying to write a Boolean expression grammar that can treat WHITE_SPACE as an implicit logic AND. e.g., "A B" means "A AND B".
However, I would also like to treat the US-formatted phone number as a single toke, e.g., (123) 456-7890. My grammar can cover most cases, but is still facing grammar ambiguity on the AREA_CODE.
Here is my grammar:
grammar myBooleanExpr;
options
{
language = Java;
output = AST;
}
tokens {
AND;
}
fragment DIGIT : '0'..'9';
fragment AREA_CODE : LPAREN DIGIT+ RPAREN;
fragment NUMBER : ( DIGIT | '-' )+;
LPAREN : '(' ;
RPAREN : ')' ;
WS : ( ' ' | '\t' | '\r' | '\n' )+ { $channel = HIDDEN; };
L_AND: 'AND'| 'And' | 'and';
OR : 'OR' | 'Or' | 'or';
NOT : 'NOT' | 'Not' | 'not';
NAME : (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' | '"') )*;
PHONE : AREA_CODE ' '? NUMBER?;
QUOTED_NAME : '"'.*'"';
expression : orexpression;
orexpression : andexpression (OR^ andexpression)*;
andexpression : notexpression (L_AND? notexpression)* -> ^(AND notexpression+);
notexpression : NOT^ atom | atom;
atom : NAME | PHONE | QUOTED_NAME | LPAREN! orexpression RPAREN!;
Input vs. Expected Output:
(123) 456-7890 -> (123) 456-7890 // single token
(123) abc -> 123 AND abc // two tokens
(123456) 789 -> 123456 AND 789 // two tokens ### currently
failed
(12 34) -> 12 AND 34 // two tokens ### currently
failed
(123) 456-aaaa -> 123 AND 456-aaaa // two tokens ### currently
failed
abc efg AND hij -> abc AND efg AND hij // three tokens
It is very difficult for me to understand the usage of input.LA(1) or so. Very appreciated if someone could jump in helping me on this issue.

I think you are trying to put too much into lexer rules. Parsing telephone numbers like that needs more flexibility, e.g. a single space char might not be enough and what about tabs? Instead you should lex all individual tokens (numbers, punctuation etc.) as usual and do a semantic check once you have the syntax tree from the parser run.
It's up to you to decide if a the space between two tokens is just that or can be interpreted as logical operation (here AND). Neither the parser nor the lexer can know that, it depends on the context. This is why you cannot make that grammar free of ambiquities.

Related

JISON errors occuring with nonterminals

I am writing a JISON file for a class and trying to use nonterminals in place of declaring associativity for operators but am utterly lost on what the errors really mean, as this is a one time assignment for a class and I haven't found amazing examples of using nonterminals for this use case.
My JISON code:
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+("."[0-9]+)?\b return 'NUMBER'
"*" return '*'
"/" return '/'
"-" return '-'
"+" return '+'
"^" return '^'
"!" return '!'
"%" return '%'
"(" return '('
")" return ')'
"PI" return 'PI'
"E" return 'E'
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start expressions
%% /* language grammar */
expressions
: e EOF
{ typeof console !== 'undefined' ? console.log($1) : print($1);
return $1; }
;
e
: NegExp
{$$ = $1;}
| MulExp
{$$ = $1;}
| PowExp
{$$ = $1;}
| UnaryExp
{$$ = $1;}
| RootExp
{$$ = $1;}
;
RootExp
: ’(’ RootExp ’)’
{$$ = ’(’ + $2 + ’)’;}
| NUMBER
{$$ = Number(yytext);}
| E
{$$ = ’E’;}
| PI
{$$ = ’PI’;}
;
UnaryExp
: UnaryExp '!'
{$$ = '(' + $1 + '!' + ')';}
| UnaryExp '%'
{$$ = '(' + $1 + '%' + ')';}
| '-' UnaryExp
{$$ = '(-' + $2 + ')';}
;
NegExp
: NegExp '+' e
{$$ = '(' + $1 + ' + ' + $3 + ')';}
| NegExp '-' e
{$$ = '(' + $1 + ' - ' + $3 + ')';}
;
MulExp
: MulExp '*' e
{$$ = '(' + $1 + ' * ' + $3 + ')';}
| MulExp '/' e
{$$ = '(' + $1 + ' / ' + $3 + ')';}
;
PowExp
: e '^' PowExp
{$$ = '(' + $1 + ' ^ ' + $3 + ')';}
;
And when I run jison filename.jison I get a slew of errors like:
Conflict in grammar: multiple actions possible when lookahead token is ^ in state 26
- reduce by rule: MulExp -> MulExp / e
- shift token (then go to state 13)
and:
States with conflicts:
State 3
e -> NegExp . #lookaheads= EOF ^ + - * /
NegExp -> NegExp .+ e #lookaheads= EOF + - ^ / *
NegExp -> NegExp .- e #lookaheads= EOF + - ^ / *
Again, I am not looking for someone to do my homework for me, but pointers on where to go or what to do to help debug will be greatly appreciated.
It's true; it is not easy to find examples of expression grammars which resolve ambiguity without using precedence declarations. That's probably because precedence declarations, in this particular use case, are extremely convenient and probably more readable than writing out an unambiguous grammar. The resulting parser is usually slightly more efficient, as well, because it avoids the chains of unit reductions imposed by the usual unambiguous grammar style.
The flip side of this convenience is that it does not help the student achieve an understanding of how grammars actually work, and without that understanding it is very difficult to apply precedence declarations to less clear-cut applications. So the exercise which gave rise to this question is certainly worthwhile.
One place you will find unambiguous expression grammars is in the specifications of (some) programming languages, because the unambiguous grammar does not depend on the precise nature of the algorithm used by parser generators to resolve parsing conflicts. These examples tend to be rather complicated, though, because real programming languages usually have a lot of operators. Even so, the sample C grammar in the jison examples directory does show the standard model for arithmetic expression grammars. The following extract is dramatically simplified, but most of the productions were simply copy-and-pasted from the original. (I removed many operators, most of the precedence levels, and some of the complications dealing with things like cast expressions and the idiosyncratic comma operator, which are surely not relevant here.)
primary_expression
: IDENTIFIER
| CONSTANT
| '(' expression ')'
;
postfix_expression
: primary_expression
| postfix_expression INC_OP
| postfix_expression DEC_OP
;
unary_expression
: postfix_expression
| '-' unary_expression
| INC_OP unary_expression
| DEC_OP unary_expression
;
/* I added this for explanatory purposes. DO NOT USE IT. See the text. */
exponential_expression
: unary_expression
| unary_expression '^' exponential_expression
multiplicative_expression
: exponential_expression
| multiplicative_expression '*' exponential_expression
| multiplicative_expression '/' exponential_expression
| multiplicative_expression '%' exponential_expression
;
additive_expression
: multiplicative_expression
| additive_expression '+' multiplicative_expression
| additive_expression '-' multiplicative_expression
;
expression
: additive_expression
;
C does not have an exponentiation operator, so I added one with right associativity and higher precedence than multiplication, which will serve for explanatory purposes. However, your assignment probably wants it to have higher precedence than unary negation as well, which I didn't do. So I don't recommend using the above directly.
One thing to note in the above model is that every precedence level corresponds to a non-terminal. These non-terminals are linked into an ordered chain using unit productions. We can see this sequence:
expression ⇒ additive_expression ⇒ multiplicative_expression ⇒ exponential_expression ⇒ unary_expression ⇒ postfix_expression ⇒ primary_expression
which indeeds corresponds to the precedence ordering of this grammar.
The other interesting aspect of the grammar is that the left-associative operators (all of them except exponentiation) are implemented with left-recursive productions, while the right-associative operator is implemented with a right-recursive production. This is not a coincidence.
That's the basic model, but it's worth spending a few minutes to try to understand how it actually works, because it turns out to be pretty simple. Let's take a look at one production, for multiplication, and see if we can understand why it implies that exponentiation binds more tightly and addition binds less tightly:
multiplicative_expression: multiplicative_expression '*' exponential_expression
This production is saying that a multiplicative_expression consists of a * with a multiplicative_expression on the left and an exponential_expression on the right.
Now, what does that mean for 2 + 3 * 4 ^ 2? 2 + 3 is an additive_expression, but we can see from the chain of unit productions that multiplicative_expression does not produce additive_expression. So the grammar does not include the possibility that 2 + 3 is the phrase matched on the left-hand side of the *. However, it is perfectly legal for 3 (a CONSTANT, which is a primary_expression) to be parsed as the left-hand operand of the multiplication.
Meanwhile, 4 ^ 2 is an exponential_expression, and our production clearly indicates that an exponential_expression can be matched on the right of the *.
A similar argument, examining the addition and exponential expression productions, would show that 3 * 4 ^ 2 (a multiplicative_expression) can be on the right-hand side of the + operator, while neither 2 + 3 * 4 (an additive_expression) nor 3 * 4 (a multiplicative_expression) can be on the left-hand side of the exponentiation operator.
In other words, this simple grammar defines precisely and unambiguously how the expression must be decomposed. There is only one possible parse tree:
expression
|
add
|
+--------+----------------+
| | |
| | mult
| | |
| | +-------+---------------+
| | | | |
| | | | power
| | | | |
| | | | +-------+-------+
| | | | | | |
add | mult | unary | power
... | ... | ... | ...
| | | | | | |
primary | primary | primary | primary
| | | | | | |
2 + 3 * 4 ^ 2
I hope that helps somewhat.

Parsing iOS/macOS localizable strings file with antlr4

I am trying to parse the localized "strings" files of macOS/iOS.
The format of this file is based on key/value pairs, with optional comments. An example follows:
/* This is a comment */
// This is also a comment
"key1" = "value1";
"key2" = "value2";
and so on. NOTE inside the "" could be absolutely any kind of text.
EDIT Original errorneus grammar removed
I tried to write this simple grammar, but unfortunately it doesn't work.
Since the contents inside the quotes could be quite tricky, not to mention the comments, I feel that usual regex has no real power there.
EDIT based on the comments by #GRosenberg I've created a new grammar. Now I have the problem that I can't include "Symbols" as a Char, or else parsing will break.
grammar LProj;
Esc : '\\';
Spaces : [ \t\r\n\f]+;
BlockComment : '/*' .*? ('*/' | EOF) ;
LineComment : '//' ~[\r\n]* ( '\r'? '\n' [ \t]* '//' ~[\r\n]* )* ;
MLN_COMMENT: BlockComment -> channel(HIDDEN) ;
SLN_COMMENT: LineComment -> channel(HIDDEN) ;
doc : expression*;
expression
: BlockComment
| LineComment
| Spaces
| entry
;
entry : '"' key=VALUE '"' Spaces? '=' Spaces? '"' value=VALUE '"' Spaces? ';' ;
VALUE : ( EscSeq | Val )+ ;
fragment Val : Char ( EscSeq | Char )* ;
fragment Symbol
: '*'
| '/'
| ';'
| '='
;
fragment Char
: Spaces
| '!' // skip "
| '#'..')' // skip *
| '+'..'.' // skip /
| '0'..':' // skip ;
| '<' // skip =
| '>'..'[' // skip \
| ']'..'~'
| '\u00B7'..'\ufffd'
; // ignores | ['\u10000-'\uEFFFF] ;
fragment UnicodeEsc
: 'u' (Hex (Hex (Hex Hex?)?)?)?
;
fragment Hex : [0-9a-fA-F] ;
fragment EscSeq
: Esc
( [btnfr"\\] // standard escaped character set
| UnicodeEsc // standard Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;
The Antlr grammar repository, provides good examples of how to achieve the stated goal. Just define the ID terminal to allow for inclusion of escape sequences.
Thus (with obvious details omitted),
id : QUOTE key=ID EQ val=ID QUOTE ;
DOC_COMMENT: DocComment -> channel(HIDDEN) ;
MLN_COMMENT: BlockComment -> channel(HIDDEN) ;
SLN_COMMENT: LineComment -> channel(HIDDEN) ;
NAME : NameStartChar NameChar* ;
VALUE : ( EsqSeq | Val )+ ;
fragment Val : NameStartChar ( EsqSeq | NameChar )* ;
fragment Hws : [ \t] ;
fragment Vws : [\r\n\f] ;
fragment DocComment : '/**' .*? ('*/' | EOF) ;
fragment BlockComment : '/*' .*? ('*/' | EOF) ;
fragment LineComment : '//' ~[\r\n]* ( '\r'? '\n' Hws* '//' ~[\r\n]* )* ;
// escaped short-cut character or Unicode literal
fragment EscSeq
: Esc
( [btnfr"\\] // standard escaped character set
| UnicodeEsc // standard Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;
fragment Esc : '\\' ;
fragment UnicodeEsc
: 'u' (Hex (Hex (Hex Hex?)?)?)?
;
// A valid hex digit
fragment Hex : [0-9a-fA-F] ;
fragment NameChar
: NameStartChar
| '0'..'9'
| '_'
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment NameStartChar
: 'A'..'Z'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
; // ignores | ['\u10000-'\uEFFFF] ;

ANTLRWorks v1.4.3 Debugger random behaviour (Can't connect to debugger)

If I debug this grammar:
grammar CDBFile;
options {
language=Java;
TokenLabelType=CommonToken;
output=AST;
k=1;
ASTLabelType=CommonTree;
}
tokens {
IMAG_COMPILE_UNIT;
MODULE;
}
//#lexer::namespace{Parser}
//#parser::namespace{Parser}
#lexer::header {
}
#lexer::members {
}
#parser::header {
}
#parser::members {
}
/*
* Lexer Rules
*/
fragment LETTER :
'a'..'z'
| 'A'..'Z';
MODULE_NAME
:
(LETTER)*
;
COLON
:
':'
;
/*
* Parser Rules
*/
public
compileUnit
:
(basic_record)* EOF
;
basic_record
:
(
'M' COLON module_record
| 'F' COLON function_record
) ('\n')?
;
module_record
:
MODULE_NAME
;
function_record
:
function_scope MODULE_NAME '$'
;
function_scope
:
('G$' | 'F$' | 'L$')
;
With just this input:
M:divide
the debugger does simply not start saying
"Cannot launch the debuggerTab. Time-out waiting to connect to the remote parser".
But using this grammar here:
grammar Calculator;
options {
//DO NOT CHANGE THESE!
backtrack = false;
k = 1;
output = AST;
ASTLabelType = CommonTree;
//SERIOUSLY, DO NOT CHANGE THESE!
}
tokens {
// Imaginary tokens
// Root
PROGRAM;
// function top level
FUNCTION_DECLARATION;
FUNCTION_HEAD;
FUNCTION_BODY;
DECL;
FUN;
// if-else-statement
IF_STATEMENT;
IF_CONDITION;
IF_BODY;
ELSE_BODY;
// for-loop
FOR_STATEMENT;
FOR_INITIALIZE;
FOR_CONDITION;
FOR_INCREMENT;
FOR_BODY;
// Non-imaginary tokens
}
#lexer::header {
package at.tugraz.ist.cc;
}
#lexer::members {
}
#parser::header {
package at.tugraz.ist.cc;
}
#parser::members {
}
//Lexer rules
ASSIGNOP :
'=';
OR :
'||';
AND :
'&&';
RELOP :
'<'
| '<='
| '>'
| '>='
| '=='
| '!=';
SIGN :
'+'
| '-';
MULOP :
'*'
| '/'
| '%';
NOT :
'!';
fragment OPERATORS :
'<'
| '>'
| '='
| '+'
| '-'
| '/'
| '%'
| '*'
| '|'
| '&';
INT :
'0'
| DIGIT DIGIT0*;
fragment DIGIT :
'1'..'9';
fragment DIGIT0 :
'0'..'9';
BOOLEAN :
'true'
| 'false';
ID :
LETTER
(
LETTER
| DIGIT0
| '_'
)*;
fragment LETTER :
'a'..'z'
| 'A'..'Z';
PUNCT :
'.'
| ','
| ';'
| ':'
| '!';
WS :
(
' '
| '\t'
| '\r'
| '\n'
)
{
$channel = HIDDEN;
};
LITERAL :
'"'
(
LETTER
| DIGIT
| '_'
| '\\'
| OPERATORS
| PUNCT
| WS
)*
'"';
// parse rules
program :
functions -> ^(PROGRAM functions)
;
functions :
(function_declaration functions)?
;
function_declaration :
head=function_head '{' declarations optional_stmt return_stmt rc='}' -> ^(FUNCTION_DECLARATION[$head.start, $head.text] function_head ^( FUNCTION_BODY[rc,"FUNCTION_BODY"] declarations optional_stmt? return_stmt))
;
function_head :
typeInfo=type ID arguments -> ^(FUNCTION_HEAD[$typeInfo.start, "FUNCTION_HEAD"] type ID arguments?)
;
type :
'int'
| 'boolean'
| 'String'
;
arguments :
'(' ! argument_optional ')' !;
argument_optional :
parameter_list ? -> ^(DECL parameter_list)? ;
parameter_list :
type ID parameter_list2 -> ^(type ID) parameter_list2
;
parameter_list2 :
(',' type ID)* -> ^(type ID)*;
declarations :
( type idlist ';' )* -> ^(DECL ( ^(type idlist))*) ;
idlist :
( ID idlist2 );
idlist2 :
( ',' ! idlist ) ?;
optional_stmt :
( stmt_list ) ?;
stmt_list :
statement statement2;
statement2 :
stmt_list ?;
return_stmt :
'return' ^ expression ';' ! ;
statement :
(
compound_stmt
| ifThenElse
| forLoop
| assignment ';' !
) ;
ifThenElse :
(
'if' '(' ifCondition=expression ')' ifBody=statement 'else' elseBody=statement -> ^(IF_STATEMENT ^(IF_CONDITION $ifCondition) ^(IF_BODY $ifBody) ^(ELSE_BODY $elseBody))
)
;
forLoop :
(
'for' '(' forInitialization=assignment ';' forCondition=expression ';' forIncrement=assignment ')' forBody=statement ->
^(FOR_STATEMENT ^(FOR_INITIALIZE $forInitialization) ^(FOR_CONDITION $forCondition) ^(FOR_INCREMENT $forIncrement) ^(FOR_BODY $forBody))
)
;
compound_stmt :
'{'! optional_stmt '}' !;
assignment :
ID ASSIGNOP ^ expression;
expression: andExpression (OR ^ andExpression)*;
andExpression: relOPExpression (AND ^ relOPExpression)*;
relOPExpression: signExpression (RELOP ^ signExpression)*;
signExpression : mulExpression (SIGN ^ mulExpression)*;
mulExpression : factor (MULOP ^ factor)*;
factor :
(
factorID
| INT
| BOOLEAN
| LITERAL
| NOT ^ factor
| SIGN ^ factor
| '('! expression ')' !
);
factorID: ID
( function_call -> ^(FUN ID function_call)
| -> ID
)
;
function_call :
'('! function_call_opt ')' !;
function_call_opt :
extend_assign_expr_list ? ;
extend_assign_expr_list :
(
expression
extend_assign_expr_list1
) ;
extend_assign_expr_list1 :
( ',' ! extend_assign_expr_list ) ? ;
parsing an Input like
int main()
{
return 0;
}
works just fine!
The internet has a lot of suggestions regarding this issue but none of them seem to work. The thing is that the debugger DOES work. Assuming that not the input is the problem here, the grammar has to be it. But if there is a problem with the grammar why would the Interpreter work for both examples?
Any ideas?
Edit:
I have noticed that for some reason in __Test__.java just contains:
M:divide
F:G0
I also get this output while Interpreting M:asd:
[13:47:52] Interpreting...
[13:47:52] problem matching token at 1:3 NoViableAltException('a'#[1:1: Tokens : ( T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | COLON );])
[13:47:52] problem matching token at 1:4 NoViableAltException('s'#[1:1: Tokens : ( T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | COLON );])
[13:47:52] problem matching token at 1:5 NoViableAltException('d'#[1:1: Tokens : ( T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | COLON );])
(even thought the tree is correct)
AFAIK, the debugger only works with the Java target. Since you have C# specific code in your first grammar:
#lexer::namespace{Parser}
#parser::namespace{Parser}
there are no .java classes generated (or at least, none that will compile), and the debugger hangs (and times out).
EDIT
I see you're using fragment rules in your parser rules: you can't. Fragment rules will never become a token on their own, they're only there for other lexer rules.
I've tested the grammar without the C# code in ANTLRWorks 1.4.3, and had no issues.
You could try the following:
restarting ANTLRWorks
changing the port the debugger listens on (perhaps the port is used by another service, or another debug-run of ANTLRWorks)
use the most recent version of ANTLRWorks

Regexp issue involving reverse polish calculator

I'm trying to use a regular expression to solve a reverse polish calculator problem, but I'm having issues with converting the mathematical expressions into conventional form.
I wrote:
puts '35 29 1 - 5 + *'.gsub(/(\d*) (\d*) (\W)/, '(\1\3\2)')
which prints:
35 (29-1)(+5) *
expected
(35*((29-1)+5))
but I'm getting a different result. What am I doing wrong?
I'm assuming you meant you tried
puts '35 29 1 - 5 + *'.gsub(/(\d*) (\d*) (\W)/, '(\1\3\2)')
^ ^
Anyway, you have to use the quantifier + instead of *, since otherwise you will match an empty string for \d* as one of your captures, hence the (+5):
/(\d+) (\d+) (\W)/
I would further extend/constrain the expression to something like:
/([\d+*\/()-]+)\s+([\d+*\/()-]+)\s+([+*\/-])/
| | | | |
| | | | Valid operators, +, -, *, and /.
| | | |
| | | Whitespace.
| | |
| | Arbitrary atom, e.g. "35", "(29-1)", "((29-1)+5)".
| |
| Whitepsace.
|
Arbitrary atom, e.g. "35", "(29-1)", "((29-1)+5)".
...and instead of using gsub, use sub in a while loop that quits when it detects that no more substitutions can be made. This is very important because otherwise, you will violate the order of operations. For example, take a look at this Rubular demo. You can see that by using gsub, you might potentially replace the second triad of atoms, "5 + *", when really a second iteration should substitute an "earlier" triad after substituting the first triad!
WARNING: The - (minus) character must appear first or last in a character class, since otherwise it will specify a range! (Thanks to #JoshuaCheek.)

ANTLR resolving non-LL(*) problems and syntactic predicates

consider following rules in the parser:
expression
: IDENTIFIER
| (...)
| procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
procedure_call
: '(' expression expression* ')'
;
macro_use
: '(' IDENTIFIER datum* ')'
;
and
// Note that any string that parses as an <expression> will also parse as a <datum>.
datum
: simple_datum
| compound_datum
;
simple_datum
: BOOLEAN
| NUMBER
| CHARACTER
| STRING
| IDENTIFIER
;
compound_datum
: list
| vector
;
list
: '(' (datum+ ( '.' datum)?)? ')'
| ABBREV_PREFIX datum
;
fragment ABBREV_PREFIX
: ('\'' | '`' | ',' | ',#')
;
vector
: '#(' datum* ')'
;
the procedure_call and macro_rule alternative in the expression rule generate an non-LL(*) structure error. I can see the problem, since (IDENTIFIER) will parse as both. but even when i define both with + instead of *, it generates the error, even though above example shouldn't be parsing anymore.
i came up with the usage of syntactic predicates, but i can't figure out how to use them to do the trick here.
something like
expression
: IDENTIFIER
| (...)
| (procedure_call)=>procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
or
expression
: IDENTIFIER
| (...)
| ('(' IDENTIFIER expression)=>procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
doesnt work either, since none but the first rule will match anything. is there a proper way to solve that?
I found a JavaCC grammar of R5RS which I used to (quickly!) write an ANTLR equivalent:
/*
* Copyright (C) 2011 by Bart Kiers, based on the work done by Håkan L. Younes'
* JavaCC R5RS grammar, available at: http://mindprod.com/javacc/R5RS.jj
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
grammar R5RS;
parse
: commandOrDefinition* EOF
;
commandOrDefinition
: (syntaxDefinition)=> syntaxDefinition
| (definition)=> definition
| ('(' BEGIN commandOrDefinition)=> '(' BEGIN commandOrDefinition+ ')'
| command
;
syntaxDefinition
: '(' DEFINE_SYNTAX keyword transformerSpec ')'
;
definition
: '(' DEFINE ( variable expression ')'
| '(' variable defFormals ')' body ')'
)
| '(' BEGIN definition* ')'
;
defFormals
: variable* ('.' variable)?
;
keyword
: identifier
;
transformerSpec
: '(' SYNTAX_RULES '(' identifier* ')' syntaxRule* ')'
;
syntaxRule
: '(' pattern template ')'
;
pattern
: patternIdentifier
| '(' (pattern+ ('.' pattern | ELLIPSIS)?)? ')'
| '#(' (pattern+ ELLIPSIS? )? ')'
| patternDatum
;
patternIdentifier
: syntacticKeyword
| VARIABLE
;
patternDatum
: STRING
| CHARACTER
| bool
| number
;
template
: patternIdentifier
| '(' (templateElement+ ('.' templateElement)?)? ')'
| '#(' templateElement* ')'
| templateDatum
;
templateElement
: template ELLIPSIS?
;
templateDatum
: patternDatum
;
command
: expression
;
identifier
: syntacticKeyword
| variable
;
syntacticKeyword
: expressionKeyword
| ELSE
| ARROW
| DEFINE
| UNQUOTE
| UNQUOTE_SPLICING
;
expressionKeyword
: QUOTE
| LAMBDA
| IF
| SET
| BEGIN
| COND
| AND
| OR
| CASE
| LET
| LETSTAR
| LETREC
| DO
| DELAY
| QUASIQUOTE
;
expression
: (variable)=> variable
| (literal)=> literal
| (lambdaExpression)=> lambdaExpression
| (conditional)=> conditional
| (assignment)=> assignment
| (derivedExpression)=> derivedExpression
| (procedureCall)=> procedureCall
| (macroUse)=> macroUse
| macroBlock
;
variable
: VARIABLE
| ELLIPSIS
;
literal
: quotation
| selfEvaluating
;
quotation
: '\'' datum
| '(' QUOTE datum ')'
;
selfEvaluating
: bool
| number
| CHARACTER
| STRING
;
lambdaExpression
: '(' LAMBDA formals body ')'
;
formals
: '(' (variable+ ('.' variable)?)? ')'
| variable
;
conditional
: '(' IF test consequent alternate? ')'
;
test
: expression
;
consequent
: expression
;
alternate
: expression
;
assignment
: '(' SET variable expression ')'
;
derivedExpression
: quasiquotation
| '(' ( COND ( '(' ELSE sequence ')'
| condClause+ ('(' ELSE sequence ')')?
)
| CASE expression ( '(' ELSE sequence ')'
| caseClause+ ('(' ELSE sequence ')')?
)
| AND test*
| OR test*
| LET variable? '(' bindingSpec* ')' body
| LETSTAR '(' bindingSpec* ')' body
| LETREC '(' bindingSpec* ')' body
| BEGIN sequence
| DO '(' iterationSpec* ')' '(' test doResult? ')' command*
| DELAY expression
)
')'
;
condClause
: '(' test (sequence | ARROW recipient)? ')'
;
recipient
: expression
;
caseClause
: '(' '(' datum* ')' sequence ')'
;
bindingSpec
: '(' variable expression ')'
;
iterationSpec
: '(' variable init step? ')'
;
init
: expression
;
step
: expression
;
doResult
: sequence
;
procedureCall
: '(' operator operand* ')'
;
operator
: expression
;
operand
: expression
;
macroUse
: '(' keyword datum* ')'
;
macroBlock
: '(' (LET_SYNTAX | LETREC_SYNTAX) '(' syntaxSpec* ')' body ')'
;
syntaxSpec
: '(' keyword transformerSpec ')'
;
body
: ((definition)=> definition)* sequence
;
//sequence
// : ((command)=> command)* expression
// ;
sequence
: expression+
;
datum
: simpleDatum
| compoundDatum
;
simpleDatum
: bool
| number
| CHARACTER
| STRING
| identifier
;
compoundDatum
: list
| vector
;
list
: '(' (datum+ ('.' datum)?)? ')'
| abbreviation
;
abbreviation
: abbrevPrefix datum
;
abbrevPrefix
: '\'' | '`' | ',#' | ','
;
vector
: '#(' datum* ')'
;
number
: NUM_2
| NUM_8
| NUM_10
| NUM_16
;
bool
: TRUE
| FALSE
;
quasiquotation
: quasiquotationD[1]
;
quasiquotationD[int d]
: '`' qqTemplate[d]
| '(' QUASIQUOTE qqTemplate[d] ')'
;
qqTemplate[int d]
: (expression)=> expression
| ('(' UNQUOTE)=> unquotation[d]
| simpleDatum
| vectorQQTemplate[d]
| listQQTemplate[d]
;
vectorQQTemplate[int d]
: '#(' qqTemplateOrSplice[d]* ')'
;
listQQTemplate[int d]
: '\'' qqTemplate[d]
| ('(' QUASIQUOTE)=> quasiquotationD[d+1]
| '(' (qqTemplateOrSplice[d]+ ('.' qqTemplate[d])?)? ')'
;
unquotation[int d]
: ',' qqTemplate[d-1]
| '(' UNQUOTE qqTemplate[d-1] ')'
;
qqTemplateOrSplice[int d]
: ('(' UNQUOTE_SPLICING)=> splicingUnquotation[d]
| qqTemplate[d]
;
splicingUnquotation[int d]
: ',#' qqTemplate[d-1]
| '(' UNQUOTE_SPLICING qqTemplate[d-1] ')'
;
// macro keywords
LET_SYNTAX : 'let-syntax';
LETREC_SYNTAX : 'letrec-syntax';
SYNTAX_RULES : 'syntax-rules';
DEFINE_SYNTAX : 'define-syntax';
// syntactic keywords
ELSE : 'else';
ARROW : '=>';
DEFINE : 'define';
UNQUOTE_SPLICING : 'unquote-splicing';
UNQUOTE : 'unquote';
// expression keywords
QUOTE : 'quote';
LAMBDA : 'lambda';
IF : 'if';
SET : 'set!';
BEGIN : 'begin';
COND : 'cond';
AND : 'and';
OR : 'or';
CASE : 'case';
LET : 'let';
LETSTAR : 'let*';
LETREC : 'letrec';
DO : 'do';
DELAY : 'delay';
QUASIQUOTE : 'quasiquote';
NUM_2 : PREFIX_2 COMPLEX_2;
NUM_8 : PREFIX_8 COMPLEX_8;
NUM_10 : PREFIX_10? COMPLEX_10;
NUM_16 : PREFIX_16 COMPLEX_16;
ELLIPSIS : '...';
VARIABLE
: INITIAL SUBSEQUENT*
| PECULIAR_IDENTIFIER
;
STRING : '"' STRING_ELEMENT* '"';
CHARACTER : '#\\' (~(' ' | '\n') | CHARACTER_NAME);
TRUE : '#' ('t' | 'T');
FALSE : '#' ('f' | 'F');
// to ignore
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
COMMENT : ';' ~('\r' | '\n')* {$channel=HIDDEN;};
// fragments
fragment INITIAL : LETTER | SPECIAL_INITIAL;
fragment LETTER : 'a'..'z' | 'A'..'Z';
fragment SPECIAL_INITIAL : '!' | '$' | '%' | '&' | '*' | '/' | ':' | '<' | '=' | '>' | '?' | '^' | '_' | '~';
fragment SUBSEQUENT : INITIAL | DIGIT | SPECIAL_SUBSEQUENT;
fragment DIGIT : '0'..'9';
fragment SPECIAL_SUBSEQUENT : '.' | '+' | '-' | '#';
fragment PECULIAR_IDENTIFIER : '+' | '-';
fragment STRING_ELEMENT : ~('"' | '\\') | '\\' ('"' | '\\');
fragment CHARACTER_NAME : 'space' | 'newline';
fragment COMPLEX_2
: REAL_2 ('#' REAL_2)?
| REAL_2? SIGN UREAL_2? ('i' | 'I')
;
fragment COMPLEX_8
: REAL_8 ('#' REAL_8)?
| REAL_8? SIGN UREAL_8? ('i' | 'I')
;
fragment COMPLEX_10
: REAL_10 ('#' REAL_10)?
| REAL_10? SIGN UREAL_10? ('i' | 'I')
;
fragment COMPLEX_16
: REAL_16 ('#' REAL_16)?
| REAL_16? SIGN UREAL_16? ('i' | 'I')
;
fragment REAL_2 : SIGN? UREAL_2;
fragment REAL_8 : SIGN? UREAL_8;
fragment REAL_10 : SIGN? UREAL_10;
fragment REAL_16 : SIGN? UREAL_16;
fragment UREAL_2 : UINTEGER_2 ('/' UINTEGER_2)?;
fragment UREAL_8 : UINTEGER_8 ('/' UINTEGER_8)?;
fragment UREAL_10 : UINTEGER_10 ('/' UINTEGER_10)? | DECIMAL_10;
fragment UREAL_16 : UINTEGER_16 ('/' UINTEGER_16)?;
fragment DECIMAL_10
: UINTEGER_10 SUFFIX
| '.' DIGIT+ '#'* SUFFIX?
| DIGIT+ '.' DIGIT* '#'* SUFFIX?
| DIGIT+ '#'+ '.' '#'* SUFFIX?
;
fragment UINTEGER_2 : DIGIT_2+ '#'*;
fragment UINTEGER_8 : DIGIT_8+ '#'*;
fragment UINTEGER_10 : DIGIT+ '#'*;
fragment UINTEGER_16 : DIGIT_16+ '#'*;
fragment PREFIX_2 : RADIX_2 EXACTNESS? | EXACTNESS RADIX_2;
fragment PREFIX_8 : RADIX_8 EXACTNESS? | EXACTNESS RADIX_8;
fragment PREFIX_10 : RADIX_10 EXACTNESS? | EXACTNESS RADIX_10;
fragment PREFIX_16 : RADIX_16 EXACTNESS? | EXACTNESS RADIX_16;
fragment SUFFIX : EXPONENT_MARKER SIGN? DIGIT+;
fragment EXPONENT_MARKER : 'e' | 's' | 'f' | 'd' | 'l' | 'E' | 'S' | 'F' | 'D' | 'L';
fragment SIGN : '+' | '-';
fragment EXACTNESS : '#' ('i' | 'e' | 'I' | 'E');
fragment RADIX_2 : '#' ('b' | 'B');
fragment RADIX_8 : '#' ('o' | 'O');
fragment RADIX_10 : '#' ('d' | 'D');
fragment RADIX_16 : '#' ('x' | 'X');
fragment DIGIT_2 : '0' | '1';
fragment DIGIT_8 : '0'..'7';
fragment DIGIT_16 : DIGIT | 'a'..'f' | 'A'..'F';
which can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"(define sum-iter \n" +
" (lambda(n acc i) \n" +
" (if (> i n) \n" +
" acc \n" +
" (sum-iter n (+ acc i) (+ i 1))))) ";
R5RSLexer lexer = new R5RSLexer(new ANTLRStringStream(source));
R5RSParser parser = new R5RSParser(new CommonTokenStream(lexer));
parser.parse();
}
}
and to generate a lexer & parser, compile all Java source files and run the main class, do:
bart#hades:~/Programming/ANTLR/Demos/R5RS$ java -cp antlr-3.3.jar org.antlr.Tool R5RS.g
bart#hades:~/Programming/ANTLR/Demos/R5RS$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/R5RS$ java -cp .:antlr-3.3.jar Main
bart#hades:~/Programming/ANTLR/Demos/R5RS$
The fact that nothing is being printed on the console means the parser (and lexer) didn't find any errors with the provided source.
Note that I have no Unit tests and have only tested the single Scheme source inside the Main class. If you find errors in the ANTLR grammar, I'd appreciate to hear about them so I can fix the grammar. In due time, I'll probably commit the grammar to the official ANTLR Wiki.

Resources