Antlr parsing conflict with tokens and identifiers - antlr3

i am attempting to parse freeform strings (ANTLR 3) and am running into string/token issues. Googling has not yet helped me. Here is my grammar:
grammar TestHeader;
options {
language = Java;
output = AST;
}
tokens {
STOKEN_1 = 'SENDTO';
STOKEN_2 = 'SRCSYS';
STOKEN_3 = 'SOFTERROR';
}
#lexer::header {
package com.sample.parser;
}
#parser::header {
package com.sample.parser;
}
fragment TAG : /* empty rule: only used to change the 'type' */;
fragment CR: '\r';
fragment LF: '\n';
fragment SLASH: '/';
fragment HASH: '#';
fragment DIGIT: '0'..'9';
fragment SEMI_COLON: ';';
COLON: ':';
COMMA: ',';
DOT: '.';
LETTER: 'A'..'Z' | 'a'..'z';
HYPHEN: '-';
BRANCH_CHARS: '$' | '#' | HASH;
SPACE: (' ');
EOL: (CR? LF) | CR | SEMI_COLON;
DATE: ('0'..'1')? DIGIT SLASH ('0'..'3')? DIGIT SLASH DIGIT DIGIT;
TIME: ('0'..'2')? DIGIT COLON ('0'..'5')? DIGIT (COLON '0'..'5' DIGIT)?;
DECIMAL
:
DIGIT* ('.' DIGIT*)
{
if ($text.contains("/")) {
$type = DATE;
} else if ($text.contains(":")) {
$type = TIME;
}
}
;
NUMBER
:
'0'..'9' DIGIT*
{
if ($text.contains("/")) {
$type = DATE;
} else if ($text.contains(":")) {
$type = TIME;
}
}
;
SINGLELINE_COMMENT
:
'//-' ~('\r' | '\n')*
;
header
:
(word SPACE+ sequenceNumber SPACE+ type) EOL+
(SPACE* trailer) EOL+
;
word
:
(data += ~(SPACE))+
;
sequenceNumber
:
nn=NUMBER
;
type
:
LETTER
;
trailer
:
// (tt += ~(EOL ))*
SINGLELINE_COMMENT
;
With a test input
SY 260 O
//-$CJ******1
i get the following
0 null
-- 16 S
-- 16 Y
-- 22
-- 18 260
-- 22
-- 16 O
-- 13
-- 20 //-$CJ******1
-- 13
However with the following input
SR 260 O
//-$CJ******1
I get
line 1:2 mismatched character ' ' expecting 'C'
line 1:7 mismatched input 'O' expecting NUMBER
0 260 O
//-$CJ******1
Where am i going wrong? Any tips/help to get past this issues?

Related

how to build a parser with antlr4 target golang without visitors and walkers

I am trying to write a small parser with golang target, but not using visitors or walkers, but I am not able to find any sample code to build my parser upon.
For example, the following is the grammar code which I am trying to replicate with golang:
# Expr.g4:
grammar Expr;
#header {
}
#parser::members {
def eval(self, left, op, right):
if ExprParser.MUL == op.type:
return left * right
elif ExprParser.DIV == op.type:
return left / right
elif ExprParser.ADD == op.type:
return left + right
elif ExprParser.SUB == op.type:
return left - right
else:
return 0
}
stat: e NEWLINE {print($e.v);}
| ID '=' e NEWLINE {self.memory[$ID.text] = $e.v}
| NEWLINE
;
e returns [int v]
: a=e op=('*'|'/') b=e {$v = self.eval($a.v, $op, $b.v)}
| a=e op=('+'|'-') b=e {$v = self.eval($a.v, $op, $b.v)}
| INT {$v = $INT.int}
| ID
{
id = $ID.text
$v = self.memory.get(id, 0)
}
| '(' e ')' {$v = $e.v}
;
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
ID : [a-zA-Z]+ ; // match identifiers
INT : [0-9]+ ; // match integers
NEWLINE:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
And this is the python tester code for it:
# test_expr.py:
import sys
from antlr4 import *
from antlr4.InputStream import InputStream
from ExprLexer import ExprLexer
from ExprParser import ExprParser
if __name__ == '__main__':
parser = ExprParser(None)
parser.buildParseTrees = False
parser.memory = {} # how to add this to generated constructor?
line = sys.stdin.readline()
lineno = 1
while line != '':
line = line.strip()
istream = InputStream(line + "\n")
lexer = ExprLexer(istream)
lexer.line = lineno
lexer.column = 0
token_stream = CommonTokenStream(lexer)
parser.setInputStream(token_stream)
parser.stat()
line = sys.stdin.readline()
lineno += 1
Can anybody please post a sample golang code which is equivalent to the above python and inlined code?

Flex & Bison: Printing a parse tree

Basically, I have an assignment where I need to make a compiler for C-, but we're doing it in 5 steps. One of the steps was to turn the BNF grammar to bison and then print a tree with what has been compiled. Let me explain:
BNF Grammar
1. program→declaration-list
2. declaration-list→declaration-list declaration | declaration
3. var-declaration| fun-declaration
4. var-declaration→type-specifierID;| type-specifierID[NUM];
5. type-specifier→int | void
6. fun-declaration→type-specifierID(params)compound-stmt
7. params→param-list| void
8. param-list→param-list,param | param
9. param→type-specifierID | type-specifierID[]
10. compound-stmt→{local-declarations statement-list}
11. local-declarations→local-declarations var-declaration| empty
12. statement-list→statement-list statement| empty
13. statement→expression-stmt| compound-stmt| selection-stmt | iteration-stmt | return-stmt
14. expession-stmt→expression;| ;
15. selection-stmt→if(expression)statement| if(expression) statement else statement
16. iteration-stmt→while(expression)statement
17. return-stmt→return; | return expression;
18. expression→var=expression| simple-expression
19. var→ID| ID[expression]
20. simple-expression→additive-expression relop additive-expression| additive-expression
21. relop→<=| <| >| >=| ==| !=
22. additive-expression→additive-expression addop term| term
23. addop→+| -
24. term→term mulop factor| factor
25. mulop→*| /
26. factor→(expression)| var| call| NUM
27. call→ID(args)
28. args→arg-list| empty
29. arg-list→arg-list,expression| expression
File: Project.fl
%option noyywrap
%{
/* Definitions and statements */
#include <stdio.h>
#include "project.tab.h"
int nlines = 1;
char filename[50];
%}
ID {letter}{letter}*
NUM {digit}{digit}*
letter [a-zA-Z]
digit [0-9]
%%
"if" { return T_IF; }
"else" { return T_ELSE; }
"int" { return T_INT; }
"return" { return T_RETURN; }
"void" { return T_VOID; }
"while" { return T_WHILE; }
"+" { return yytext[0]; }
"-" { return yytext[0]; }
"*" { return yytext[0]; }
"/" { return yytext[0]; }
">" { return T_GREAT; }
">=" { return T_GREATEQ; }
"<" { return T_SMALL; }
"<=" { return T_SMALLEQ; }
"==" { return T_COMPARE; }
"!=" { return T_NOTEQ; }
"=" { return yytext[0]; }
";" { return yytext[0]; }
"," { return yytext[0]; }
"(" { return yytext[0]; }
")" { return yytext[0]; }
"[" { return yytext[0]; }
"]" { return yytext[0]; }
"{" { return yytext[0]; }
"}" { return yytext[0]; }
(\/\*(ID)\*\/) { return T_COMM; }
{ID} { return T_ID; }
{NUM} { return T_NUM; }
\n { ++nlines; }
%%
File: project.y
%{
#include <stdio.h>
#include <stdlib.h>
extern int yylex();
extern int yyparse();
void yyerror(const char* s);
%}
%token T_IF T_ELSE T_INT T_RETURN T_VOID T_WHILE
T_GREAT T_GREATEQ T_SMALL T_SMALLEQ T_COMPARE T_NOTEQ
T_COMM T_ID T_NUM
%%
program: declaration-list { printf("program"); }
;
declaration-list: declaration-list declaration
| declaration
;
declaration: var-declaration
| fun-declaration
;
var-declaration: type-specifier T_ID ';'
| type-specifier T_ID'['T_NUM']' ';'
;
type-specifier: T_INT
| T_VOID
;
fun-declaration: type-specifier T_ID '('params')' compound-stmt
;
params: param-list
| T_VOID
;
param-list: param-list',' param
| param
;
param: type-specifier T_ID
| type-specifier T_ID'['']'
;
compound-stmt: '{' local-declarations statement-list '}'
;
local-declarations: local-declarations var-declaration
|
;
statement-list: statement-list statement
|
;
statement: expression-stmt
| compound-stmt
| selection-stmt
| iteration-stmt
| return-stmt
;
expression-stmt: expression ';'
| ';'
;
selection-stmt: T_IF '('expression')' statement
| T_IF '('expression')' statement T_ELSE statement
;
iteration-stmt: T_WHILE '('expression')' statement
;
return-stmt: T_RETURN ';'
| T_RETURN expression ';'
;
expression: var '=' expression
| simple-expression
;
var: T_ID { printf("\nterm\nfactor_var\nvar(x)"); }
| T_ID '['expression']'
;
simple-expression: additive-expression relop additive-expression
| additive-expression
;
relop: T_SMALLEQ
| T_SMALL
| T_GREAT
| T_GREATEQ
| T_COMPARE
| T_NOTEQ
;
additive-expression: additive-expression addop term
| term
;
addop: '+' { printf("\naddop(+)"); }
| '-' { printf("\naddop(-)"); }
;
term: term mulop factor
| factor
;
mulop: '*' { printf("\nmulop(*)"); }
| '/' { printf("\nmulop(/)"); }
;
factor: '('expression')' { printf("\nfactor1"); }
| var
| call
| T_NUM { printf("\nterm\nfactor(5)"); }
;
call: T_ID '('args')' { printf("\ncall(input)"); }
;
args: arg-list
| { printf("\nargs(empty)"); }
;
arg-list: arg-list',' expression
| expression
;
%%
int main(void) {
return yyparse();
}
void yyerror(const char* s) {
fprintf(stderr, "Parse error: %s\n", s);
exit(1);
}
And finally the tree that's asked to be replicated:
program
declaration_list
declaration
fun_definition(VOID-main)
params_VOID-compound
params(VOID)
compound_stmt
local_declarations
local_declarations
local_declarations(empty)
var_declaration(x)
type_specifier(INT)
var_declaration(y)
type_specifier(INT)
statement_list
statement_list
statement_list(empty)
statement
expression_stmt
expression
var(x)
expression
simple_expression
additive_expression
term
factor
call(input)
args(empty)
statement
expression_stmt
expression
var(y)
expression
simple_expression
additive_expression(ADDOP)
additive_expression
term
factor_var
var(x)
addop(+)
term
factor(5)
Sample code in which the tree is based off
/* A program */
void main(void)
{
int x; int y;
x = input();
y = x + 5;
}
I've turned the BNF grammar to the actual .y file, but I'm having problems printing out where exactly the messages should go. Usually, a grammar would finish THEN print.
The desired output you present is the result of a pre-order walk of the parse tree.
However, bison generates a bottom-up parser, which performs semantic actions for a node in the parse tree when the node's subtree is complete. Printing the node in the semantic action therefore produces a post-order walk. I suppose that is what you mean by your last sentence.
While there are a variety of possible solutions, the simplest one is probably to construct a parse tree during the parse and then print it out at the end of the parse. (You could print the tree in the semantic action for the start production, but that will sometimes result in a parse tree being printed for an erroneous input. Better is to return the root of the parse tree and print it from the main program after verifying that the parse was successful.)
I don't know where "construct a parse tree" fits in the expected progression of your project. Parse trees are of little use in most applications. Much more common is the construction of an abstract syntax tree (AST) which omits many of the irrelevant details from the parse (such as unit productions). You can construct an AST from a parse tree, but it is generally simpler to construct it directly in the parse actions: the code looks very similar but there is less of it precisely because tree nodes don't have to be built for unit productions.

what is the binary-to-text encoding used by protoc --decode?

I am looking at the output of the protoc --decode command and I cannot fathom the encoding used when it encounters bytes :
data {
image: "\377\330\377\340\000\020JFIF\000\001[…]\242\2634G\377\331"
}
The […] was added by me to shorten the output.
What encoding is this?
Edit
So based on Bruce's answer I wrote my own utility in order to generate sample data from a shell script :
public static void main(String[] parameters) throws IOException {
File binaryInput = new File(parameters[0]);
System.out.println("\""+TextFormat.escapeBytes(ByteString.readFrom(new FileInputStream(binaryInput)))+"\"");
}
}
that way I can call serialize my binaries and insert them in a text serialization of a protobuf before calling protoc --encode on it :
IMAGE=$(mktemp)
OUTPUT=$(mktemp)
BIN_INSTANCE=$(mktemp)
echo -n 'capture: ' > $IMAGE
java -cp "$HOME/.m2/repository/com/google/protobuf/protobuf-java/3.0.0/protobuf-java-3.0.0.jar:target/protobuf-generator-1.0.0-SNAPSHOT.jar" protobuf.BinarySerializer image.jpg >> $IMAGE
sed -e 's/{UUID}/'$(uuidgen)'/' template.protobuf > $OUTPUT
sed -i '/{IMAGE}/ {
r '$IMAGE'
d
}' $OUTPUT
cat $OUTPUT | protoc --encode=prototypesEvent.proto> $BIN_INSTANCE
with template.protobuf being :
uuid: "{UUID}"
image {
capture: "{IMAGE}"
}
I am presuming it is the samer as produced by java.
basically:
* between space (0x20) and tilde (0x7e) treat it as an ascii character
* if there is a shortcut (e.g. \n, \r, \ etc) use the shortcut
* otherwise escape the character (octal)
so in the above \377 is 1 byte: 377 octal or 255 in decimal.
"\377\330\377\340 = 255 216 255 224
You should be able to copy the string into a Java/C program and convert it to bytes
The Java code looks to be:
static String escapeBytes(final ByteSequence input) {
final StringBuilder builder = new StringBuilder(input.size());
for (int i = 0; i < input.size(); i++) {
final byte b = input.byteAt(i);
switch (b) {
// Java does not recognize \a or \v, apparently.
case 0x07: builder.append("\\a"); break;
case '\b': builder.append("\\b"); break;
case '\f': builder.append("\\f"); break;
case '\n': builder.append("\\n"); break;
case '\r': builder.append("\\r"); break;
case '\t': builder.append("\\t"); break;
case 0x0b: builder.append("\\v"); break;
case '\\': builder.append("\\\\"); break;
case '\'': builder.append("\\\'"); break;
case '"' : builder.append("\\\""); break;
default:
// Only ASCII characters between 0x20 (space) and 0x7e (tilde) are
// printable. Other byte values must be escaped.
if (b >= 0x20 && b <= 0x7e) {
builder.append((char) b);
} else {
builder.append('\\');
builder.append((char) ('0' + ((b >>> 6) & 3)));
builder.append((char) ('0' + ((b >>> 3) & 7)));
builder.append((char) ('0' + (b & 7)));
}
break;
}
}
return builder.toString();
}
taken from com.google.protobuf.TextFormatEscaper

error not all control paths return a value

My compiler says that not all control paths return a value and it points to a overloaded>> variable function. I am not exactly sure what is causing the problem and any help would be appreciated. I am trying to overload the stream extraction operator to define if input is valid. if it is not it should set a failbit to indicate improper input.
std::istream &operator >> (std::istream &input, ComplexClass &c)//overloading extraction operator
{
int number;
int multiplier;
char temp;//temp variable used to store input
input >> number;//gets real number
// test if character is a space
if (input.peek() == ' ' /* Write a call to the peek member function to
test if the next character is a space ' ' */) // case a + bi
{
c.real = number;
input >> temp;
multiplier = (temp == '+') ? 1 : -1;
}
// set failbit if character not a space
if (input.peek() != ' ')
{
/* Write a call to the clear member function with
ios::failbit as the argument to set input's fail bit */
input.clear(ios::failbit);
}
else
// set imaginary part if data is valid
if (input.peek() == ' ')
{
input >> c.imaginary;
c.imaginary *= multiplier;
input >> temp;
}
if (input.peek() == '\n' /* Write a call to member function peek to test if the next
character is a newline \n */) // character not a newline
{
input.clear(ios::failbit); // set bad bit
} // end if
else
{
input.clear(ios::failbit); // set bad bit
} // end else
// end if
if (input.peek() == 'i' /* Write a call to member function peek to test if
the next character is 'i' */) // test for i of imaginary number
{
input >> temp;
// test for newline character entered
if (input.peek() == '\n')
{
c.real = 0;
c.imaginary = number;
} // end if
else if (input.peek() == '\n') // set real number if it is valid
{
c.real = number;
c.imaginary = 0;
} // end else if
else
{
input.clear(ios::failbit); // set bad bit
}
return input;
}
}
The return statement is only executed inside the last if:
if (input.peek() == 'i' ) {
...
return input;
}
It should be changed to:
if (input.peek() == 'i' ) {
...
}
return input;

ANTLR White Space Question (and not the typical one)

Consider this short SmallC program:
#include "lib"
main() {
int bob;
}
My ANTLR grammar picks it up fine if I specify, in ANTLWorks and when using the Interpreter, line endings -> "Mac (CR)". If I set the line endings option to Unix (LF), the grammar throws a NoViableAltException and does not recognize anything after the end of the include statement. This error disappears if I add a newline at the end of include. The computer I'm using for this is a Mac, so I figured that it made sense to have to set the line endings to Mac format. So instead, I switch to a Linux box - and get the same thing. If I type anything in the ANTLRWorks Interpreter box, and if I don't select line endings Mac (CR), I get issues about insufficient blank lines as was the case above and, in addition, the last statement of each statement block requires an extra space following the semicolon (ie. after bob; above).
These bugs show up again when I run a Java version of my grammar on a code input file that I want to parse...
What could possibly be the issue? I'd understand if the issue was the presence of TOO many new lines, in a format that perhaps the parser didn't understand / weren't caught by my whitespace rule. But in this case, it's an issue of lacking new lines.
My white space declaration is as follows:
WS : ( '\t' | ' ' | '\r' | '\n' )+ { $channel = HIDDEN; } ;
Alternatively, could this be due to an ambiguity issue?
Here is the full grammar file (feel free to ignore the first few blocks, which override ANTLR's default error handling mechanisms:
grammar SmallC;
options {
output = AST ; // Set output mode to AST
}
tokens {
DIV = '/' ;
MINUS = '-' ;
MOD = '%' ;
MULT = '*' ;
PLUS = '+' ;
RETURN = 'return' ;
WHILE = 'while' ;
// The following are empty tokens used in AST generation
ARGS ;
CHAR ;
DECLS ;
ELSE ;
EXPR ;
IF ;
INT ;
INCLUDES ;
MAIN ;
PROCEDURES ;
PROGRAM ;
RETURNTYPE ;
STMTS ;
TYPEIDENT ;
}
#members {
// Force error throwing, and make sure we don't try to recover from invalid input.
// The exceptions are handled in the FrontEnd class, and gracefully end the
// compilation routine after displaying an error message.
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MismatchedTokenException(ttype, input);
}
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow)throws RecognitionException {
throw e;
}
protected Object recoverFromMismatchedToken(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MissingTokenException(ttype, input, null);
}
// We override getErrorMessage() to include information about the specific
// grammar rule in which the error happened, using a stack of nested rules.
Stack paraphrases = new Stack();
public String getErrorMessage(RecognitionException e, String[] tokenNames) {
String msg = super.getErrorMessage(e, tokenNames);
if ( paraphrases.size()>0 ) {
String paraphrase = (String)paraphrases.peek();
msg = msg+" "+paraphrase;
}
return msg;
}
// We override displayRecognitionError() to specify a clearer error message,
// and to include the error type (ie. class of the exception that was thrown)
// for the user's reference. The idea here is to come as close as possible
// to Java's exception output.
public void displayRecognitionError(String[] tokenNames, RecognitionException e)
{
String exType;
String hdr;
if (e instanceof UnwantedTokenException) {
exType = "UnwantedTokenException";
} else if (e instanceof MissingTokenException) {
exType = "MissingTokenException";
} else if (e instanceof MismatchedTokenException) {
exType = "MismatchedTokenException";
} else if (e instanceof MismatchedTreeNodeException) {
exType = "MismatchedTreeNodeException";
} else if (e instanceof NoViableAltException) {
exType = "NoViableAltException";
} else if (e instanceof EarlyExitException) {
exType = "EarlyExitException";
} else if (e instanceof MismatchedSetException) {
exType = "MismatchedSetException";
} else if (e instanceof MismatchedNotSetException) {
exType = "MismatchedNotSetException";
} else if (e instanceof FailedPredicateException) {
exType = "FailedPredicateException";
} else {
exType = "Unknown";
}
if ( getSourceName()!=null ) {
hdr = "Exception of type " + exType + " encountered in " + getSourceName() + " at line " + e.line + ", char " + e.charPositionInLine + ": ";
} else {
hdr = "Exception of type " + exType + " encountered at line " + e.line + ", char " + e.charPositionInLine + ": ";
}
String msg = getErrorMessage(e, tokenNames);
emitErrorMessage(hdr + msg + ".");
}
}
// Force the parser not to try to guess tokens and resume on faulty input,
// but rather display the error, and throw an exception for the program
// to quit gracefully.
#rulecatch {
catch (RecognitionException e) {
reportError(e);
throw e;
}
}
/*------------------------------------------------------------------
* PARSER RULES
*
* Many of these make use of ANTLR's rewrite rules to allow us to
* specify the roots of AST sub-trees, and to allow us to do away
* with certain insignificant literals (like parantheses and commas
* in lists) and to add empty tokens to disambiguate the tree
* construction
*
* The #init and #after definitions populate the paraphrase
* stack to allow us to specify which grammar rule we are in when
* errors are found.
*------------------------------------------------------------------*/
args
#init { paraphrases.push("in these procedure arguments"); }
#after { paraphrases.pop(); }
: ( typeident ( ',' typeident )* )? -> ^( ARGS ( typeident ( typeident )* )? )? ;
body
#init { paraphrases.push("in this procedure body"); }
#after { paraphrases.pop(); }
: '{'! decls stmtlist '}'! ;
decls
#init { paraphrases.push("in these declarations"); }
#after { paraphrases.pop(); }
: ( typeident ';' )* -> ^( DECLS ( typeident )* )? ;
exp
#init { paraphrases.push("in this expression"); }
#after { paraphrases.pop(); }
: lexp ( ( '>' | '<' | '>=' | '<=' | '!=' | '==' )^ lexp )? ;
factor : '(' lexp ')'
| ( MINUS )? ( IDENT | NUMBER )
| CHARACTER
| IDENT '(' ( IDENT ( ',' IDENT )* )? ')' ;
lexp : term ( ( PLUS | MINUS )^ term )* ;
includes
#init { paraphrases.push("in the include statements"); }
#after { paraphrases.pop(); }
: ( '#include' STRING )* -> ^( INCLUDES ( STRING )* )? ;
main
#init { paraphrases.push("in the main method"); }
#after { paraphrases.pop(); }
: 'main' '(' ')' body -> ^( MAIN body ) ;
procedure
#init { paraphrases.push("in this procedure"); }
#after { paraphrases.pop(); }
: ( proc_return_char | proc_return_int )? IDENT^ '('! args ')'! body ;
procedures : ( procedure )* -> ^( PROCEDURES ( procedure)* )? ;
proc_return_char
: 'char' -> ^( RETURNTYPE CHAR ) ;
proc_return_int : 'int' -> ^( RETURNTYPE INT ) ;
// We hard-code the regex (\n)* to fix a bug whereby a program would be accepted
// if it had 0 or more than 1 new lines before EOF but not if it had exactly 1,
// and not if it had 0 new lines between components of the following rule.
program : includes decls procedures main EOF ;
stmt
#init { paraphrases.push("in this statement"); }
#after { paraphrases.pop(); }
: '{'! stmtlist '}'!
| WHILE '(' exp ')' s=stmt -> ^( WHILE ^( EXPR exp ) $s )
| 'if' '(' exp ')' s=stmt ( options {greedy=true;} : 'else' s2=stmt )? -> ^( IF ^( EXPR exp ) $s ^( ELSE $s2 )? )
| IDENT '='^ lexp ';'!
| ( 'read' | 'output' | 'readc' | 'outputc' )^ '('! IDENT ')'! ';'!
| 'print'^ '('! STRING ( options {greedy=true;} : ')'! ';'! )
| RETURN ( lexp )? ';' -> ^( RETURN ( lexp )? )
| IDENT^ '('! ( IDENT ( ','! IDENT )* )? ')'! ';'!;
stmtlist : ( stmt )* -> ^( STMTS ( stmt )* )? ;
term : factor ( ( MULT | DIV | MOD )^ factor )* ;
// We divide typeident into two grammar rules depending on whether the
// ident is of type 'char' or 'int', to allow us to implement different
// rewrite rules in each case.
typeident : typeident_char | typeident_int ;
typeident_char : 'char' s2=IDENT -> ^( CHAR $s2 ) ;
typeident_int : 'int' s2=IDENT -> ^( INT $s2 ) ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// Must come before CHARACTER to avoid ambiguity ('i' matches both IDENT and CHARACTER)
IDENT : ( LCASE_ALPHA | UCASE_ALPHA | '_' ) ( LCASE_ALPHA | UCASE_ALPHA | DIGIT | '_' )* ;
CHARACTER : PRINTABLE_CHAR
| '\n' | '\t' | EOF ;
NUMBER : ( DIGIT )+ ;
STRING : '\"' ( ~( '"' | '\n' | '\r' | 't' ) )* '\"' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ { $channel = HIDDEN; } ;
fragment
DIGIT : '0'..'9' ;
fragment
LCASE_ALPHA : 'a'..'z' ;
fragment
NONALPHA_CHAR : '`' | '~' | '!' | '#' | '#' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-'
| '_' | '+' | '=' | '{' | '[' | '}' | ']' | '|' | '\\' | ';' | ':' | '\''
| '\\"' | '<' | ',' | '>' | '.' | '?' | '/' ;
fragment
PRINTABLE_CHAR : LCASE_ALPHA | UCASE_ALPHA | DIGIT | NONALPHA_CHAR ;
fragment
UCASE_ALPHA : 'A'..'Z' ;
From the command line, I do get a warning:
java -cp antlr-3.2.jar org.antlr.Tool SmallC.g
warning(200): SmallC.g:182:37: Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
but that won't stop the lexer/parser from being generated.
Anyway, the problem: ANTLR's lexer tries to match the first lexer rule it encounters in the file, and if it can't match said token, it trickles down to the next lexer rule. Now you have defined the CHARACTER rule before the WS rule, which both match the character \n. That is why it didn't work under Linux since the \n was tokenized as a CHARACTER. If you define the WS rule before the CHARACTER rule, it all works properly:
// other rules ...
WS
: ('\t' | ' ' | '\r' | '\n' | '\u000C')+ { $channel = HIDDEN; }
;
CHARACTER
: PRINTABLE_CHAR | '\n' | '\t' | EOF
;
// other rules ...
Running the test class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"#include \"lib\"\n" +
"main() {\n" +
" int bob;\n" +
"}\n";
ANTLRStringStream in = new ANTLRStringStream(source);
SmallCLexer lexer = new SmallCLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SmallCParser parser = new SmallCParser(tokens);
SmallCParser.program_return returnValue = parser.program();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
produces the following AST:
without any error messages.
But you should fix the grammar warning, and remove \n from the CHARACTER rule since it can never be matched in the CHARACTER rule.
One other thing: you've mixed quite a few keywords inside your parser rules without defining them in your lexer rules explicitly. That is tricky because of the first-come-first-serve lexer rules: you don't want 'if' to be accidentally being tokenized as an IDENT. Better do it like this:
IF : 'if';
IDENT : 'a'..'z' ... ; // After the `IF` rule!

Resources