Parsing iOS/macOS localizable strings file with antlr4 - xcode

I am trying to parse the localized "strings" files of macOS/iOS.
The format of this file is based on key/value pairs, with optional comments. An example follows:
/* This is a comment */
// This is also a comment
"key1" = "value1";
"key2" = "value2";
and so on. NOTE inside the "" could be absolutely any kind of text.
EDIT Original errorneus grammar removed
I tried to write this simple grammar, but unfortunately it doesn't work.
Since the contents inside the quotes could be quite tricky, not to mention the comments, I feel that usual regex has no real power there.
EDIT based on the comments by #GRosenberg I've created a new grammar. Now I have the problem that I can't include "Symbols" as a Char, or else parsing will break.
grammar LProj;
Esc : '\\';
Spaces : [ \t\r\n\f]+;
BlockComment : '/*' .*? ('*/' | EOF) ;
LineComment : '//' ~[\r\n]* ( '\r'? '\n' [ \t]* '//' ~[\r\n]* )* ;
MLN_COMMENT: BlockComment -> channel(HIDDEN) ;
SLN_COMMENT: LineComment -> channel(HIDDEN) ;
doc : expression*;
expression
: BlockComment
| LineComment
| Spaces
| entry
;
entry : '"' key=VALUE '"' Spaces? '=' Spaces? '"' value=VALUE '"' Spaces? ';' ;
VALUE : ( EscSeq | Val )+ ;
fragment Val : Char ( EscSeq | Char )* ;
fragment Symbol
: '*'
| '/'
| ';'
| '='
;
fragment Char
: Spaces
| '!' // skip "
| '#'..')' // skip *
| '+'..'.' // skip /
| '0'..':' // skip ;
| '<' // skip =
| '>'..'[' // skip \
| ']'..'~'
| '\u00B7'..'\ufffd'
; // ignores | ['\u10000-'\uEFFFF] ;
fragment UnicodeEsc
: 'u' (Hex (Hex (Hex Hex?)?)?)?
;
fragment Hex : [0-9a-fA-F] ;
fragment EscSeq
: Esc
( [btnfr"\\] // standard escaped character set
| UnicodeEsc // standard Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;

The Antlr grammar repository, provides good examples of how to achieve the stated goal. Just define the ID terminal to allow for inclusion of escape sequences.
Thus (with obvious details omitted),
id : QUOTE key=ID EQ val=ID QUOTE ;
DOC_COMMENT: DocComment -> channel(HIDDEN) ;
MLN_COMMENT: BlockComment -> channel(HIDDEN) ;
SLN_COMMENT: LineComment -> channel(HIDDEN) ;
NAME : NameStartChar NameChar* ;
VALUE : ( EsqSeq | Val )+ ;
fragment Val : NameStartChar ( EsqSeq | NameChar )* ;
fragment Hws : [ \t] ;
fragment Vws : [\r\n\f] ;
fragment DocComment : '/**' .*? ('*/' | EOF) ;
fragment BlockComment : '/*' .*? ('*/' | EOF) ;
fragment LineComment : '//' ~[\r\n]* ( '\r'? '\n' Hws* '//' ~[\r\n]* )* ;
// escaped short-cut character or Unicode literal
fragment EscSeq
: Esc
( [btnfr"\\] // standard escaped character set
| UnicodeEsc // standard Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;
fragment Esc : '\\' ;
fragment UnicodeEsc
: 'u' (Hex (Hex (Hex Hex?)?)?)?
;
// A valid hex digit
fragment Hex : [0-9a-fA-F] ;
fragment NameChar
: NameStartChar
| '0'..'9'
| '_'
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment NameStartChar
: 'A'..'Z'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
; // ignores | ['\u10000-'\uEFFFF] ;

Related

Modifying a Grammar in ANTLR to manage expressions correctly

I´m working on a simple expression evaluator with ANTLR
this is the grammar I created:
expression : relationExpr (cond_op expression)? ;
relationExpr: addExpr (rel_op relationExpr)? ;
addExpr: multExpr (add_op addExpr)? ;
multExpr: unaryExpr (mult_op multExpr)? ;
unaryExpr: '-' value | '!' value | value ;
value: literal | '('expression')' ;
mult_op : '*' | '/' | '%' ;
add_op : '+'|'-' ;
rel_op : '<' | '>' | '<=' | '>='| eq_op ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal | char_literal | bool_literal ;
int_literal : NUM ;
char_literal : CHAR ;
bool_literal : TRUE | FALSE ;
The problem I´m having is that association of operand is not being to the left.
For example if I evaluate: 10+20*2/10
I get this tree:
As you see the / operand is being evaluated first and the correct way should be to the left.
Can you give me help in modifying the grammar to get the asociation right?
If you do not really need your parse tree nodes to be binary operations, the grammar below yields a single multExpr node which you can walk left-to-right which I believe is your goal.
NUM : [0-9]+;
expression : relationExpr (cond_op relationExpr)* ;
relationExpr: addExpr (rel_op addExpr)* ;
addExpr: multExpr (add_op multExpr)* ;
multExpr: unaryExpr (mult_op unaryExpr)* ;
unaryExpr: '-' value | '!' value | value ;
value: literal | '('expression')' ;
mult_op : '*' | '/' | '%' ;
add_op : '+'|'-' ;
rel_op : '<' | '>' | '<=' | '>='| eq_op ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal ;
int_literal : NUM ;
Parsing you example expression 10+20*2/10<8 produces the parse tree below.
'Hope this helps.

ANTLRWorks v1.4.3 Debugger random behaviour (Can't connect to debugger)

If I debug this grammar:
grammar CDBFile;
options {
language=Java;
TokenLabelType=CommonToken;
output=AST;
k=1;
ASTLabelType=CommonTree;
}
tokens {
IMAG_COMPILE_UNIT;
MODULE;
}
//#lexer::namespace{Parser}
//#parser::namespace{Parser}
#lexer::header {
}
#lexer::members {
}
#parser::header {
}
#parser::members {
}
/*
* Lexer Rules
*/
fragment LETTER :
'a'..'z'
| 'A'..'Z';
MODULE_NAME
:
(LETTER)*
;
COLON
:
':'
;
/*
* Parser Rules
*/
public
compileUnit
:
(basic_record)* EOF
;
basic_record
:
(
'M' COLON module_record
| 'F' COLON function_record
) ('\n')?
;
module_record
:
MODULE_NAME
;
function_record
:
function_scope MODULE_NAME '$'
;
function_scope
:
('G$' | 'F$' | 'L$')
;
With just this input:
M:divide
the debugger does simply not start saying
"Cannot launch the debuggerTab. Time-out waiting to connect to the remote parser".
But using this grammar here:
grammar Calculator;
options {
//DO NOT CHANGE THESE!
backtrack = false;
k = 1;
output = AST;
ASTLabelType = CommonTree;
//SERIOUSLY, DO NOT CHANGE THESE!
}
tokens {
// Imaginary tokens
// Root
PROGRAM;
// function top level
FUNCTION_DECLARATION;
FUNCTION_HEAD;
FUNCTION_BODY;
DECL;
FUN;
// if-else-statement
IF_STATEMENT;
IF_CONDITION;
IF_BODY;
ELSE_BODY;
// for-loop
FOR_STATEMENT;
FOR_INITIALIZE;
FOR_CONDITION;
FOR_INCREMENT;
FOR_BODY;
// Non-imaginary tokens
}
#lexer::header {
package at.tugraz.ist.cc;
}
#lexer::members {
}
#parser::header {
package at.tugraz.ist.cc;
}
#parser::members {
}
//Lexer rules
ASSIGNOP :
'=';
OR :
'||';
AND :
'&&';
RELOP :
'<'
| '<='
| '>'
| '>='
| '=='
| '!=';
SIGN :
'+'
| '-';
MULOP :
'*'
| '/'
| '%';
NOT :
'!';
fragment OPERATORS :
'<'
| '>'
| '='
| '+'
| '-'
| '/'
| '%'
| '*'
| '|'
| '&';
INT :
'0'
| DIGIT DIGIT0*;
fragment DIGIT :
'1'..'9';
fragment DIGIT0 :
'0'..'9';
BOOLEAN :
'true'
| 'false';
ID :
LETTER
(
LETTER
| DIGIT0
| '_'
)*;
fragment LETTER :
'a'..'z'
| 'A'..'Z';
PUNCT :
'.'
| ','
| ';'
| ':'
| '!';
WS :
(
' '
| '\t'
| '\r'
| '\n'
)
{
$channel = HIDDEN;
};
LITERAL :
'"'
(
LETTER
| DIGIT
| '_'
| '\\'
| OPERATORS
| PUNCT
| WS
)*
'"';
// parse rules
program :
functions -> ^(PROGRAM functions)
;
functions :
(function_declaration functions)?
;
function_declaration :
head=function_head '{' declarations optional_stmt return_stmt rc='}' -> ^(FUNCTION_DECLARATION[$head.start, $head.text] function_head ^( FUNCTION_BODY[rc,"FUNCTION_BODY"] declarations optional_stmt? return_stmt))
;
function_head :
typeInfo=type ID arguments -> ^(FUNCTION_HEAD[$typeInfo.start, "FUNCTION_HEAD"] type ID arguments?)
;
type :
'int'
| 'boolean'
| 'String'
;
arguments :
'(' ! argument_optional ')' !;
argument_optional :
parameter_list ? -> ^(DECL parameter_list)? ;
parameter_list :
type ID parameter_list2 -> ^(type ID) parameter_list2
;
parameter_list2 :
(',' type ID)* -> ^(type ID)*;
declarations :
( type idlist ';' )* -> ^(DECL ( ^(type idlist))*) ;
idlist :
( ID idlist2 );
idlist2 :
( ',' ! idlist ) ?;
optional_stmt :
( stmt_list ) ?;
stmt_list :
statement statement2;
statement2 :
stmt_list ?;
return_stmt :
'return' ^ expression ';' ! ;
statement :
(
compound_stmt
| ifThenElse
| forLoop
| assignment ';' !
) ;
ifThenElse :
(
'if' '(' ifCondition=expression ')' ifBody=statement 'else' elseBody=statement -> ^(IF_STATEMENT ^(IF_CONDITION $ifCondition) ^(IF_BODY $ifBody) ^(ELSE_BODY $elseBody))
)
;
forLoop :
(
'for' '(' forInitialization=assignment ';' forCondition=expression ';' forIncrement=assignment ')' forBody=statement ->
^(FOR_STATEMENT ^(FOR_INITIALIZE $forInitialization) ^(FOR_CONDITION $forCondition) ^(FOR_INCREMENT $forIncrement) ^(FOR_BODY $forBody))
)
;
compound_stmt :
'{'! optional_stmt '}' !;
assignment :
ID ASSIGNOP ^ expression;
expression: andExpression (OR ^ andExpression)*;
andExpression: relOPExpression (AND ^ relOPExpression)*;
relOPExpression: signExpression (RELOP ^ signExpression)*;
signExpression : mulExpression (SIGN ^ mulExpression)*;
mulExpression : factor (MULOP ^ factor)*;
factor :
(
factorID
| INT
| BOOLEAN
| LITERAL
| NOT ^ factor
| SIGN ^ factor
| '('! expression ')' !
);
factorID: ID
( function_call -> ^(FUN ID function_call)
| -> ID
)
;
function_call :
'('! function_call_opt ')' !;
function_call_opt :
extend_assign_expr_list ? ;
extend_assign_expr_list :
(
expression
extend_assign_expr_list1
) ;
extend_assign_expr_list1 :
( ',' ! extend_assign_expr_list ) ? ;
parsing an Input like
int main()
{
return 0;
}
works just fine!
The internet has a lot of suggestions regarding this issue but none of them seem to work. The thing is that the debugger DOES work. Assuming that not the input is the problem here, the grammar has to be it. But if there is a problem with the grammar why would the Interpreter work for both examples?
Any ideas?
Edit:
I have noticed that for some reason in __Test__.java just contains:
M:divide
F:G0
I also get this output while Interpreting M:asd:
[13:47:52] Interpreting...
[13:47:52] problem matching token at 1:3 NoViableAltException('a'#[1:1: Tokens : ( T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | COLON );])
[13:47:52] problem matching token at 1:4 NoViableAltException('s'#[1:1: Tokens : ( T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | COLON );])
[13:47:52] problem matching token at 1:5 NoViableAltException('d'#[1:1: Tokens : ( T__8 | T__9 | T__10 | T__11 | T__12 | T__13 | T__14 | COLON );])
(even thought the tree is correct)
AFAIK, the debugger only works with the Java target. Since you have C# specific code in your first grammar:
#lexer::namespace{Parser}
#parser::namespace{Parser}
there are no .java classes generated (or at least, none that will compile), and the debugger hangs (and times out).
EDIT
I see you're using fragment rules in your parser rules: you can't. Fragment rules will never become a token on their own, they're only there for other lexer rules.
I've tested the grammar without the C# code in ANTLRWorks 1.4.3, and had no issues.
You could try the following:
restarting ANTLRWorks
changing the port the debugger listens on (perhaps the port is used by another service, or another debug-run of ANTLRWorks)
use the most recent version of ANTLRWorks

ANTLR resolving non-LL(*) problems and syntactic predicates

consider following rules in the parser:
expression
: IDENTIFIER
| (...)
| procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
procedure_call
: '(' expression expression* ')'
;
macro_use
: '(' IDENTIFIER datum* ')'
;
and
// Note that any string that parses as an <expression> will also parse as a <datum>.
datum
: simple_datum
| compound_datum
;
simple_datum
: BOOLEAN
| NUMBER
| CHARACTER
| STRING
| IDENTIFIER
;
compound_datum
: list
| vector
;
list
: '(' (datum+ ( '.' datum)?)? ')'
| ABBREV_PREFIX datum
;
fragment ABBREV_PREFIX
: ('\'' | '`' | ',' | ',#')
;
vector
: '#(' datum* ')'
;
the procedure_call and macro_rule alternative in the expression rule generate an non-LL(*) structure error. I can see the problem, since (IDENTIFIER) will parse as both. but even when i define both with + instead of *, it generates the error, even though above example shouldn't be parsing anymore.
i came up with the usage of syntactic predicates, but i can't figure out how to use them to do the trick here.
something like
expression
: IDENTIFIER
| (...)
| (procedure_call)=>procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
or
expression
: IDENTIFIER
| (...)
| ('(' IDENTIFIER expression)=>procedure_call // e.g. (foo 1 2 3)
| macro_use // e.g. (xyz (some datum))
;
doesnt work either, since none but the first rule will match anything. is there a proper way to solve that?
I found a JavaCC grammar of R5RS which I used to (quickly!) write an ANTLR equivalent:
/*
* Copyright (C) 2011 by Bart Kiers, based on the work done by Håkan L. Younes'
* JavaCC R5RS grammar, available at: http://mindprod.com/javacc/R5RS.jj
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
grammar R5RS;
parse
: commandOrDefinition* EOF
;
commandOrDefinition
: (syntaxDefinition)=> syntaxDefinition
| (definition)=> definition
| ('(' BEGIN commandOrDefinition)=> '(' BEGIN commandOrDefinition+ ')'
| command
;
syntaxDefinition
: '(' DEFINE_SYNTAX keyword transformerSpec ')'
;
definition
: '(' DEFINE ( variable expression ')'
| '(' variable defFormals ')' body ')'
)
| '(' BEGIN definition* ')'
;
defFormals
: variable* ('.' variable)?
;
keyword
: identifier
;
transformerSpec
: '(' SYNTAX_RULES '(' identifier* ')' syntaxRule* ')'
;
syntaxRule
: '(' pattern template ')'
;
pattern
: patternIdentifier
| '(' (pattern+ ('.' pattern | ELLIPSIS)?)? ')'
| '#(' (pattern+ ELLIPSIS? )? ')'
| patternDatum
;
patternIdentifier
: syntacticKeyword
| VARIABLE
;
patternDatum
: STRING
| CHARACTER
| bool
| number
;
template
: patternIdentifier
| '(' (templateElement+ ('.' templateElement)?)? ')'
| '#(' templateElement* ')'
| templateDatum
;
templateElement
: template ELLIPSIS?
;
templateDatum
: patternDatum
;
command
: expression
;
identifier
: syntacticKeyword
| variable
;
syntacticKeyword
: expressionKeyword
| ELSE
| ARROW
| DEFINE
| UNQUOTE
| UNQUOTE_SPLICING
;
expressionKeyword
: QUOTE
| LAMBDA
| IF
| SET
| BEGIN
| COND
| AND
| OR
| CASE
| LET
| LETSTAR
| LETREC
| DO
| DELAY
| QUASIQUOTE
;
expression
: (variable)=> variable
| (literal)=> literal
| (lambdaExpression)=> lambdaExpression
| (conditional)=> conditional
| (assignment)=> assignment
| (derivedExpression)=> derivedExpression
| (procedureCall)=> procedureCall
| (macroUse)=> macroUse
| macroBlock
;
variable
: VARIABLE
| ELLIPSIS
;
literal
: quotation
| selfEvaluating
;
quotation
: '\'' datum
| '(' QUOTE datum ')'
;
selfEvaluating
: bool
| number
| CHARACTER
| STRING
;
lambdaExpression
: '(' LAMBDA formals body ')'
;
formals
: '(' (variable+ ('.' variable)?)? ')'
| variable
;
conditional
: '(' IF test consequent alternate? ')'
;
test
: expression
;
consequent
: expression
;
alternate
: expression
;
assignment
: '(' SET variable expression ')'
;
derivedExpression
: quasiquotation
| '(' ( COND ( '(' ELSE sequence ')'
| condClause+ ('(' ELSE sequence ')')?
)
| CASE expression ( '(' ELSE sequence ')'
| caseClause+ ('(' ELSE sequence ')')?
)
| AND test*
| OR test*
| LET variable? '(' bindingSpec* ')' body
| LETSTAR '(' bindingSpec* ')' body
| LETREC '(' bindingSpec* ')' body
| BEGIN sequence
| DO '(' iterationSpec* ')' '(' test doResult? ')' command*
| DELAY expression
)
')'
;
condClause
: '(' test (sequence | ARROW recipient)? ')'
;
recipient
: expression
;
caseClause
: '(' '(' datum* ')' sequence ')'
;
bindingSpec
: '(' variable expression ')'
;
iterationSpec
: '(' variable init step? ')'
;
init
: expression
;
step
: expression
;
doResult
: sequence
;
procedureCall
: '(' operator operand* ')'
;
operator
: expression
;
operand
: expression
;
macroUse
: '(' keyword datum* ')'
;
macroBlock
: '(' (LET_SYNTAX | LETREC_SYNTAX) '(' syntaxSpec* ')' body ')'
;
syntaxSpec
: '(' keyword transformerSpec ')'
;
body
: ((definition)=> definition)* sequence
;
//sequence
// : ((command)=> command)* expression
// ;
sequence
: expression+
;
datum
: simpleDatum
| compoundDatum
;
simpleDatum
: bool
| number
| CHARACTER
| STRING
| identifier
;
compoundDatum
: list
| vector
;
list
: '(' (datum+ ('.' datum)?)? ')'
| abbreviation
;
abbreviation
: abbrevPrefix datum
;
abbrevPrefix
: '\'' | '`' | ',#' | ','
;
vector
: '#(' datum* ')'
;
number
: NUM_2
| NUM_8
| NUM_10
| NUM_16
;
bool
: TRUE
| FALSE
;
quasiquotation
: quasiquotationD[1]
;
quasiquotationD[int d]
: '`' qqTemplate[d]
| '(' QUASIQUOTE qqTemplate[d] ')'
;
qqTemplate[int d]
: (expression)=> expression
| ('(' UNQUOTE)=> unquotation[d]
| simpleDatum
| vectorQQTemplate[d]
| listQQTemplate[d]
;
vectorQQTemplate[int d]
: '#(' qqTemplateOrSplice[d]* ')'
;
listQQTemplate[int d]
: '\'' qqTemplate[d]
| ('(' QUASIQUOTE)=> quasiquotationD[d+1]
| '(' (qqTemplateOrSplice[d]+ ('.' qqTemplate[d])?)? ')'
;
unquotation[int d]
: ',' qqTemplate[d-1]
| '(' UNQUOTE qqTemplate[d-1] ')'
;
qqTemplateOrSplice[int d]
: ('(' UNQUOTE_SPLICING)=> splicingUnquotation[d]
| qqTemplate[d]
;
splicingUnquotation[int d]
: ',#' qqTemplate[d-1]
| '(' UNQUOTE_SPLICING qqTemplate[d-1] ')'
;
// macro keywords
LET_SYNTAX : 'let-syntax';
LETREC_SYNTAX : 'letrec-syntax';
SYNTAX_RULES : 'syntax-rules';
DEFINE_SYNTAX : 'define-syntax';
// syntactic keywords
ELSE : 'else';
ARROW : '=>';
DEFINE : 'define';
UNQUOTE_SPLICING : 'unquote-splicing';
UNQUOTE : 'unquote';
// expression keywords
QUOTE : 'quote';
LAMBDA : 'lambda';
IF : 'if';
SET : 'set!';
BEGIN : 'begin';
COND : 'cond';
AND : 'and';
OR : 'or';
CASE : 'case';
LET : 'let';
LETSTAR : 'let*';
LETREC : 'letrec';
DO : 'do';
DELAY : 'delay';
QUASIQUOTE : 'quasiquote';
NUM_2 : PREFIX_2 COMPLEX_2;
NUM_8 : PREFIX_8 COMPLEX_8;
NUM_10 : PREFIX_10? COMPLEX_10;
NUM_16 : PREFIX_16 COMPLEX_16;
ELLIPSIS : '...';
VARIABLE
: INITIAL SUBSEQUENT*
| PECULIAR_IDENTIFIER
;
STRING : '"' STRING_ELEMENT* '"';
CHARACTER : '#\\' (~(' ' | '\n') | CHARACTER_NAME);
TRUE : '#' ('t' | 'T');
FALSE : '#' ('f' | 'F');
// to ignore
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
COMMENT : ';' ~('\r' | '\n')* {$channel=HIDDEN;};
// fragments
fragment INITIAL : LETTER | SPECIAL_INITIAL;
fragment LETTER : 'a'..'z' | 'A'..'Z';
fragment SPECIAL_INITIAL : '!' | '$' | '%' | '&' | '*' | '/' | ':' | '<' | '=' | '>' | '?' | '^' | '_' | '~';
fragment SUBSEQUENT : INITIAL | DIGIT | SPECIAL_SUBSEQUENT;
fragment DIGIT : '0'..'9';
fragment SPECIAL_SUBSEQUENT : '.' | '+' | '-' | '#';
fragment PECULIAR_IDENTIFIER : '+' | '-';
fragment STRING_ELEMENT : ~('"' | '\\') | '\\' ('"' | '\\');
fragment CHARACTER_NAME : 'space' | 'newline';
fragment COMPLEX_2
: REAL_2 ('#' REAL_2)?
| REAL_2? SIGN UREAL_2? ('i' | 'I')
;
fragment COMPLEX_8
: REAL_8 ('#' REAL_8)?
| REAL_8? SIGN UREAL_8? ('i' | 'I')
;
fragment COMPLEX_10
: REAL_10 ('#' REAL_10)?
| REAL_10? SIGN UREAL_10? ('i' | 'I')
;
fragment COMPLEX_16
: REAL_16 ('#' REAL_16)?
| REAL_16? SIGN UREAL_16? ('i' | 'I')
;
fragment REAL_2 : SIGN? UREAL_2;
fragment REAL_8 : SIGN? UREAL_8;
fragment REAL_10 : SIGN? UREAL_10;
fragment REAL_16 : SIGN? UREAL_16;
fragment UREAL_2 : UINTEGER_2 ('/' UINTEGER_2)?;
fragment UREAL_8 : UINTEGER_8 ('/' UINTEGER_8)?;
fragment UREAL_10 : UINTEGER_10 ('/' UINTEGER_10)? | DECIMAL_10;
fragment UREAL_16 : UINTEGER_16 ('/' UINTEGER_16)?;
fragment DECIMAL_10
: UINTEGER_10 SUFFIX
| '.' DIGIT+ '#'* SUFFIX?
| DIGIT+ '.' DIGIT* '#'* SUFFIX?
| DIGIT+ '#'+ '.' '#'* SUFFIX?
;
fragment UINTEGER_2 : DIGIT_2+ '#'*;
fragment UINTEGER_8 : DIGIT_8+ '#'*;
fragment UINTEGER_10 : DIGIT+ '#'*;
fragment UINTEGER_16 : DIGIT_16+ '#'*;
fragment PREFIX_2 : RADIX_2 EXACTNESS? | EXACTNESS RADIX_2;
fragment PREFIX_8 : RADIX_8 EXACTNESS? | EXACTNESS RADIX_8;
fragment PREFIX_10 : RADIX_10 EXACTNESS? | EXACTNESS RADIX_10;
fragment PREFIX_16 : RADIX_16 EXACTNESS? | EXACTNESS RADIX_16;
fragment SUFFIX : EXPONENT_MARKER SIGN? DIGIT+;
fragment EXPONENT_MARKER : 'e' | 's' | 'f' | 'd' | 'l' | 'E' | 'S' | 'F' | 'D' | 'L';
fragment SIGN : '+' | '-';
fragment EXACTNESS : '#' ('i' | 'e' | 'I' | 'E');
fragment RADIX_2 : '#' ('b' | 'B');
fragment RADIX_8 : '#' ('o' | 'O');
fragment RADIX_10 : '#' ('d' | 'D');
fragment RADIX_16 : '#' ('x' | 'X');
fragment DIGIT_2 : '0' | '1';
fragment DIGIT_8 : '0'..'7';
fragment DIGIT_16 : DIGIT | 'a'..'f' | 'A'..'F';
which can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"(define sum-iter \n" +
" (lambda(n acc i) \n" +
" (if (> i n) \n" +
" acc \n" +
" (sum-iter n (+ acc i) (+ i 1))))) ";
R5RSLexer lexer = new R5RSLexer(new ANTLRStringStream(source));
R5RSParser parser = new R5RSParser(new CommonTokenStream(lexer));
parser.parse();
}
}
and to generate a lexer & parser, compile all Java source files and run the main class, do:
bart#hades:~/Programming/ANTLR/Demos/R5RS$ java -cp antlr-3.3.jar org.antlr.Tool R5RS.g
bart#hades:~/Programming/ANTLR/Demos/R5RS$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/R5RS$ java -cp .:antlr-3.3.jar Main
bart#hades:~/Programming/ANTLR/Demos/R5RS$
The fact that nothing is being printed on the console means the parser (and lexer) didn't find any errors with the provided source.
Note that I have no Unit tests and have only tested the single Scheme source inside the Main class. If you find errors in the ANTLR grammar, I'd appreciate to hear about them so I can fix the grammar. In due time, I'll probably commit the grammar to the official ANTLR Wiki.

ANTLR IDL Grammar

Using ANTLR I am trying to create a very simple IDL-style grammar. Here is what I have so far.
grammar idl;
data_type
: 'DataType' ID LCURLY attribute_list RCURLY
;
modifier
: 'public'
;
primitive
: 'byte'
| 'short'
| 'int'
| 'float'
| 'double'
;
attribute
: modifier primitive ID END
;
attribute_list
: attribute+
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
LCURLY : '{'
;
RCURLY : '}'
;
END : ';'
;
This does not seem to work when I run the debugger over 'data_type' however. It just halts when it reaches the 'attribute_list'. Changing 'attribute_list' to just 'attribute' works fine but obviously I want one-or-more attributes, not just one.
Thanks

Problem with antlr grammar (lexical)

I get a mismatched set exception when I try to parse "abc" (the quote marks are part of the input)
Here is the (simplified) grammar - pretty much verbatim from the Java.g example and basically the same from other example grammars. Is there some bug in the latest version? Using 3.2 in the context of eclipse.
Thanks in advance.
grammar String;
options {
language = C;
}
rule: literal EOF;
literal
: CHARLITERAL
| STRINGLITERAL
;
CHARLITERAL
: '\''
( EscapeSequence
| ~( '\'' | '\\' | '\r' | '\n' )
)
'\''
;
STRINGLITERAL
: '"'
( EscapeSequence
| ~( '\\' | '"' | '\r' | '\n' )
)*
'"'
;
fragment
EscapeSequence
: '\\' (
'b'
| 't'
| 'n'
| 'f'
| 'r'
| '\"'
| '\''
| '\\'
|
('0'..'3') ('0'..'7') ('0'..'7')
|
('0'..'7') ('0'..'7')
|
('0'..'7')
)
;
I'm confused by these last edits, but the problem is with the interpreter and is a known problem. Reported in 09.
If the code is generated for the grammar, it works like a charm.
It seems hard to believe that this bug has gone unanswered so long given it's frequency of occurrence.

Resources