Basically, I have an assignment where I need to make a compiler for C-, but we're doing it in 5 steps. One of the steps was to turn the BNF grammar to bison and then print a tree with what has been compiled. Let me explain:
BNF Grammar
1. program→declaration-list
2. declaration-list→declaration-list declaration | declaration
3. var-declaration| fun-declaration
4. var-declaration→type-specifierID;| type-specifierID[NUM];
5. type-specifier→int | void
6. fun-declaration→type-specifierID(params)compound-stmt
7. params→param-list| void
8. param-list→param-list,param | param
9. param→type-specifierID | type-specifierID[]
10. compound-stmt→{local-declarations statement-list}
11. local-declarations→local-declarations var-declaration| empty
12. statement-list→statement-list statement| empty
13. statement→expression-stmt| compound-stmt| selection-stmt | iteration-stmt | return-stmt
14. expession-stmt→expression;| ;
15. selection-stmt→if(expression)statement| if(expression) statement else statement
16. iteration-stmt→while(expression)statement
17. return-stmt→return; | return expression;
18. expression→var=expression| simple-expression
19. var→ID| ID[expression]
20. simple-expression→additive-expression relop additive-expression| additive-expression
21. relop→<=| <| >| >=| ==| !=
22. additive-expression→additive-expression addop term| term
23. addop→+| -
24. term→term mulop factor| factor
25. mulop→*| /
26. factor→(expression)| var| call| NUM
27. call→ID(args)
28. args→arg-list| empty
29. arg-list→arg-list,expression| expression
File: Project.fl
%option noyywrap
%{
/* Definitions and statements */
#include <stdio.h>
#include "project.tab.h"
int nlines = 1;
char filename[50];
%}
ID {letter}{letter}*
NUM {digit}{digit}*
letter [a-zA-Z]
digit [0-9]
%%
"if" { return T_IF; }
"else" { return T_ELSE; }
"int" { return T_INT; }
"return" { return T_RETURN; }
"void" { return T_VOID; }
"while" { return T_WHILE; }
"+" { return yytext[0]; }
"-" { return yytext[0]; }
"*" { return yytext[0]; }
"/" { return yytext[0]; }
">" { return T_GREAT; }
">=" { return T_GREATEQ; }
"<" { return T_SMALL; }
"<=" { return T_SMALLEQ; }
"==" { return T_COMPARE; }
"!=" { return T_NOTEQ; }
"=" { return yytext[0]; }
";" { return yytext[0]; }
"," { return yytext[0]; }
"(" { return yytext[0]; }
")" { return yytext[0]; }
"[" { return yytext[0]; }
"]" { return yytext[0]; }
"{" { return yytext[0]; }
"}" { return yytext[0]; }
(\/\*(ID)\*\/) { return T_COMM; }
{ID} { return T_ID; }
{NUM} { return T_NUM; }
\n { ++nlines; }
%%
File: project.y
%{
#include <stdio.h>
#include <stdlib.h>
extern int yylex();
extern int yyparse();
void yyerror(const char* s);
%}
%token T_IF T_ELSE T_INT T_RETURN T_VOID T_WHILE
T_GREAT T_GREATEQ T_SMALL T_SMALLEQ T_COMPARE T_NOTEQ
T_COMM T_ID T_NUM
%%
program: declaration-list { printf("program"); }
;
declaration-list: declaration-list declaration
| declaration
;
declaration: var-declaration
| fun-declaration
;
var-declaration: type-specifier T_ID ';'
| type-specifier T_ID'['T_NUM']' ';'
;
type-specifier: T_INT
| T_VOID
;
fun-declaration: type-specifier T_ID '('params')' compound-stmt
;
params: param-list
| T_VOID
;
param-list: param-list',' param
| param
;
param: type-specifier T_ID
| type-specifier T_ID'['']'
;
compound-stmt: '{' local-declarations statement-list '}'
;
local-declarations: local-declarations var-declaration
|
;
statement-list: statement-list statement
|
;
statement: expression-stmt
| compound-stmt
| selection-stmt
| iteration-stmt
| return-stmt
;
expression-stmt: expression ';'
| ';'
;
selection-stmt: T_IF '('expression')' statement
| T_IF '('expression')' statement T_ELSE statement
;
iteration-stmt: T_WHILE '('expression')' statement
;
return-stmt: T_RETURN ';'
| T_RETURN expression ';'
;
expression: var '=' expression
| simple-expression
;
var: T_ID { printf("\nterm\nfactor_var\nvar(x)"); }
| T_ID '['expression']'
;
simple-expression: additive-expression relop additive-expression
| additive-expression
;
relop: T_SMALLEQ
| T_SMALL
| T_GREAT
| T_GREATEQ
| T_COMPARE
| T_NOTEQ
;
additive-expression: additive-expression addop term
| term
;
addop: '+' { printf("\naddop(+)"); }
| '-' { printf("\naddop(-)"); }
;
term: term mulop factor
| factor
;
mulop: '*' { printf("\nmulop(*)"); }
| '/' { printf("\nmulop(/)"); }
;
factor: '('expression')' { printf("\nfactor1"); }
| var
| call
| T_NUM { printf("\nterm\nfactor(5)"); }
;
call: T_ID '('args')' { printf("\ncall(input)"); }
;
args: arg-list
| { printf("\nargs(empty)"); }
;
arg-list: arg-list',' expression
| expression
;
%%
int main(void) {
return yyparse();
}
void yyerror(const char* s) {
fprintf(stderr, "Parse error: %s\n", s);
exit(1);
}
And finally the tree that's asked to be replicated:
program
declaration_list
declaration
fun_definition(VOID-main)
params_VOID-compound
params(VOID)
compound_stmt
local_declarations
local_declarations
local_declarations(empty)
var_declaration(x)
type_specifier(INT)
var_declaration(y)
type_specifier(INT)
statement_list
statement_list
statement_list(empty)
statement
expression_stmt
expression
var(x)
expression
simple_expression
additive_expression
term
factor
call(input)
args(empty)
statement
expression_stmt
expression
var(y)
expression
simple_expression
additive_expression(ADDOP)
additive_expression
term
factor_var
var(x)
addop(+)
term
factor(5)
Sample code in which the tree is based off
/* A program */
void main(void)
{
int x; int y;
x = input();
y = x + 5;
}
I've turned the BNF grammar to the actual .y file, but I'm having problems printing out where exactly the messages should go. Usually, a grammar would finish THEN print.
The desired output you present is the result of a pre-order walk of the parse tree.
However, bison generates a bottom-up parser, which performs semantic actions for a node in the parse tree when the node's subtree is complete. Printing the node in the semantic action therefore produces a post-order walk. I suppose that is what you mean by your last sentence.
While there are a variety of possible solutions, the simplest one is probably to construct a parse tree during the parse and then print it out at the end of the parse. (You could print the tree in the semantic action for the start production, but that will sometimes result in a parse tree being printed for an erroneous input. Better is to return the root of the parse tree and print it from the main program after verifying that the parse was successful.)
I don't know where "construct a parse tree" fits in the expected progression of your project. Parse trees are of little use in most applications. Much more common is the construction of an abstract syntax tree (AST) which omits many of the irrelevant details from the parse (such as unit productions). You can construct an AST from a parse tree, but it is generally simpler to construct it directly in the parse actions: the code looks very similar but there is less of it precisely because tree nodes don't have to be built for unit productions.
Related
I am trying to build a 3 address code generator which would produce:
input:x=a+3*(b/7)
output: t1=b/7
t2=3*t1
t3=a+t2
x=t3
NO matter whatever i give as input the output is "syntax error".
I'm using Windows 10.
Yacc code:
%{
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define YYDEBUG 1
int yylex(void);
int t_count = 1;
void yyerror(char *s)
{
fprintf(stderr,"%s\n",s);
return;
}
char * generateToken(int i)
{
char* ch=(char*)malloc(sizeof(char)*5);
sprintf(ch,"t%d",i++);
return ch;
}
%}
%union { double dval; char ivar[50]; }
%token <ivar> NUMBER
%token <ivar> NAME
%type <ivar> expr
%type <ivar> term
%left '+' '-'
%left '*' '/'
%left '(' ')'
%right '='
%%
program:
line {
}
| program line {
}
;
line:
expr '\n' {
t_count =1;
}
| NAME '=' expr '\n' {
printf("%s = %s", $3,$1);
t_count=1;
}
;
expr:
expr '+' expr {
strcpy($$,generateToken(t_count));
printf("%s = %s + %s",$$,$1,$3);
}
| expr '-' expr {
strcpy($$,generateToken(t_count));
printf("%s = %s - %s",$$,$1,$3);
}
| expr '*' expr {
strcpy($$,generateToken(t_count));
printf("%s = %s * %s",$$,$1,$3);
}
| expr '/' expr {
strcpy($$,generateToken(t_count));
printf("%s = %s / %s",$$,$1,$3);
}
| term {
strcpy($$, $1);
}
| '(' expr ')' {
strcpy($$,generateToken(t_count));
printf("%s =( %s )" ,$$,$2);
}
;
term:
NAME {
strcpy($$, $1);
}
| NUMBER {
strcpy($$, $1);
}
;
%%
int main(void)
{
if (getenv("YYDEBUG")) yydebug = 1;
yyparse();
return 0;
}
Lex code:
%option noyywrap
%{
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "threeAdd.tab.h"
void yyerror(char*);
extern YYSTYPE yylval;
%}
NAME [a-zA-Z]
DIGIT [0-9]+
NUMBER [-]?{DIGIT}+(\.{DIGIT}+)?
%%
[ \t]+ { }
{NUMBER}
{
strcpy(yylval.ivar,yytext);
return *yylval.ivar;
}
"+" {
return *yytext;
}
"-" {
return *yytext;
}
"*" {
return *yytext;
}
"/" {
return *yytext;
}
"=" {
return *yytext;
}
"(" {
return *yytext;
}
")" {
return *yytext;
}
{NAME} {
strcpy(yylval.ivar,yytext);
return *yylval.ivar;
}
"\n" {
return *yytext;
}
exit {
return 0;
}
. {
char msg[25];
sprintf(msg," <%s>","invalid character",yytext);
yyerror(msg);
}
%%
Sample build & run:
C:\Users\USER\OneDrive\Desktop\Compiler\ICG>flex file.l
C:\Users\USER\OneDrive\Desktop\Compiler\ICG>bison -d file.y
C:\Users\USER\OneDrive\Desktop\Compiler\ICG>gcc lex.yy.c file.tab.c -o ICG.exe
C:\Users\USER\OneDrive\Desktop\Compiler\ICG>ICG.exe
3+9
syntax error
The basic problem is that you are use double-quote (" -- strings) for tokens in your yacc file (without defining any codes for them, so they're useless), and returning single character tokens in your lex file. As a result, none of the tokens will be recognized in your parser.
Replace all the " characters with ' characters on all the single character tokens in your yacc file (so "+" becomes '+' and "\n" becomes '\n').
Once you fix that, you have another problem: your lex rules for {DIGITS}+ and {NAME} don't return a token, so the token will be ignored (leading to syntax errors)
For debugging parser problems in general, it is often worth compiling with -DYYDEBUG and sticking yydebug = 1; into main before calling yyparse, which will cause the parser to print a trace of tokens seen and states visited. I often put
if (getenv("YYDEBUG")) yydebug = 1;
into main and just leave it there -- that way normally debugging won't be enabled, but if you set the environment variable YYDEBUG=1 before running your program, you'll see the debug trace (no need to recompile)
In order to return a token, your lexer rule needs to return the token. So your lexer rule for NUMBER should be:
{NUMBER} {
strcpy(yylval.ivar,yytext);
return NUMBER;
}
and similar for NAME. Note that the opening { of the code block must be on the same line as the pattern -- if it is on a separate line it will not be associated with the pattern.
I am currently trying to create a basic grammar that can take in and recognize shell commands. However, I am getting syntax errors that escape my understanding. I have drawn out a tree and, from what I can tell, I cover all the bases. Here is my lex file
%{
#include <cstring>
#include "y.tab.hh"
static void yyunput (int c,char *buf_ptr );
void myunputc(int c) {
unput(c);
}
%}
%option noyywrap
%%
return PIPE;
}
"<" {
return LESS;
}
"&" {
return AMP;
}
\n {
return NEWLINE;
}
[ \t] {
/* Discard spaces and tabs */
}
">" {
return GREAT;
}
"2>" {
return TGREAT;
}
">&" {
return GREATAMP;
}
">>" {
return DUBGREAT;
}
">>&" {
return DUBGREATAMP;
}
[^ \t\n][^ \t\n]* {
/* Assume that file names have only alpha chars */
yylval.cpp_string = new std::string(yytext);
return WORD;
}
And here is my yacc file
%code requires
{
#include <string>
#if __cplusplus > 199711L
#define register // Deprecated in C++11 so remove the keyword
#endif
}
%union
{
char *string_val;
// Example of using a c++ type in yacc
std::string *cpp_string;
}
%token <cpp_string> WORD
%token NOTOKEN GREAT NEWLINE PIPE LESS AMP TGREAT GREATAMP DUBGREAT DUBGREATAMP
%{
//#define yylex yylex
#include <cstdio>
#include "shell.hh"
void yyerror(const char * s);
int yylex();
%}
%%
goal:
command_list
;
command_list:
command_line
| command_list command_line
;
command_line:
pipe_list io_modifier_list background_optional NEWLINE
| NEWLINE
| error NEWLINE{yyerrok;}
;
pipe_list:
pipe_list PIPE command_and_args
| command_and_args
;
command_and_args:
command_word argument_list {
Shell::_currentCommand.
insertSimpleCommand( Command::_currentSimpleCommand );
}
;
argument_list:
argument_list argument
| /* can be empty */
;
io_modifier_list:
io_modifier_list iomodifier_opt
| /*empty*/
;
iomodifier_opt:
GREAT WORD {
printf(" Yacc: insert output \"%s\"\n", $2->c_str());
Shell::_currentCommand._outFile = $2;
}
| DUBGREAT WORD {
printf(" Yacc: insert output \"%s\"\n", $2->c_str());
Shell::_currentCommand._outFile = $2;
}
| DUBGREATAMP WORD {
printf(" Yacc: insert output \"%s\"\n", $2->c_str());
Shell::_currentCommand._outFile = $2;
Shell::_currentCommand._background = true;
}
| GREATAMP WORD {
printf(" Yacc: insert output \"%s\"\n", $2->c_str());
Shell::_currentCommand._outFile = $2;
Shell::_currentCommand._background = true;
}
| LESS WORD {
printf(" Yacc: insert input \"%s\"\n", $2->c_str());
Shell::_currentCommand._inFile = $2;
}
| /* can be empty */
;
background_optional:
AMP {
Shell::_currentCommand._background = true;
}
| /*empty*/
;
argument:
WORD {
printf(" Yacc: insert argument \"%s\"\n", $1->c_str());
Command::_currentSimpleCommand->insertArgument( $1 );
}
;
command_word:
WORD {
printf(" Yacc: insert command \"%s\"\n", $1->c_str());
Command::_currentSimpleCommand = new SimpleCommand();
Command::_currentSimpleCommand->insertArgument( $1 );
}
;
%%
void
yyerror(const char * s)
{
fprintf(stderr,"%s", s);
}
#if 0
main()
{
yyparse();
}
#endif
In the simplest use case, I tried ls -al. In my understanding, ls gets recognized as a command_word. Then -al gets recognized as an argument. This creates a new argument_list containing only -al. From there, command_and_args is created from the command_word ls and the argument_list -al. This creates a pipe_list that only contains this command_and_args. From there, a command_line is created with this new pipe_list, an empty io_modifier_list, an empty background_optional, and the newline character when I hit enter. This creates a command_list, which is the goal. However, my understanding is apparently incorrect because I am getting a syntax error and I was hoping someone could help me fix that lack of understanding.
I have never used JFlex before, and I have no idea how it works. Basically, I've built a runtime for a scheme-esque language in Java, and the parser I have for it was generated using Antlr 4, so I have a .g4 file, which looks like this:
program: expression EOF;
expression: lambdaAbstraction | application| literal;
lambdaAbstraction: LP WS? LAMBDA (WS IDENTIFIER)* WS expression WS? RP;
application: LP WS? expression (WS expression)* WS? RP;
literal: FLOAT_LITERAL | INTEGER_LITERAL | IDENTIFIER | STRING_LITERAL;
STRING_LITERAL: '"' ~["]* '"' | '\'' ~[']* '\'' ;
FLOAT_LITERAL: [0-9]+ '.' [0-9]+;
INTEGER_LITERAL: [0-9]+;
IDENTIFIER: [a-zA-Z+/*-=?#]+;
LAMBDA: '\\';
LP: '(';
RP: ')';
WS: [ \n\r\t]+;
Some examples (so you can get a feel for the grammar).
(+ 1 2) = 3
((\ a b (+ a b)) 1 2) = 3
(+ "Hello, " "World") = "Hello, World"
What I'm trying to do is build a syntax-highlighter plugin for IntelliJ-IDEA for this language using this guide: http://www.jetbrains.org/intellij/sdk/docs/tutorials/custom_language_support_tutorial.html.
Here is my .flex file
package com.michaelsnowden.yalcil_plugin;
import com.intellij.lexer.FlexLexer;
import com.intellij.psi.tree.IElementType;
import com.michaelsnowden.yalcil_plugin.psi.YalcilTypes;
import com.intellij.psi.TokenType;
%%
%class YalcilLexer
%implements FlexLexer
%unicode
%function advance
%type IElementType
%eof{ return;
%eof}
STRING_LITERAL= "\"" [^\"&]* "\"";
FLOAT_LITERAL=[0-9]+ '.' [0-9]+;
INTEGER_LITERAL=[0-9]+;
IDENTIFIER=[a-zA-Z+/*-=?#]+;
LAMBDA="\\";
LP="(";
RP=")";
WS=[ \n\r\t]+;
%%
<YYINITIAL> {
{LAMBDA} { yybegin(YYINITIAL); return YalcilTypes.LAMBDA; }
{FLOAT_LITERAL} { yybegin(YYINITIAL); return YalcilTypes.LAMBDA; }
{INTEGER_LITERAL} { yybegin(YYINITIAL); return YalcilTypes.INTEGER_LITERAL; }
{IDENTIFIER} { yybegin(YYINITIAL); return YalcilTypes.IDENTIFIER; }
{LP} { yybegin(YYINITIAL); return YalcilTypes.LP; }
{RP} { yybegin(YYINITIAL); return YalcilTypes.RP; }
{WS} { yybegin(YYINITIAL); return YalcilTypes.WS; }
{STRING_LITERAL} { yybegin(YYINITIAL); return YalcilTypes.STRING_LITERAL; }
}
. { return TokenType.BAD_CHARACTER; }
And here is my .bnf file
{
parserClass="com.michaelsnowden.yalcil_plugin.YalcilParser"
extends="com.intellij.extapi.psi.ASTWrapperPsiElement"
psiClassPrefix="Yalcil"
psiImplClassSuffix="Impl"
psiPackage="com.michaelsnowden.yalcil_plugin.psi"
psiImplPackage="com.michaelsnowden.yalcil_plugin.psi.impl"
elementTypeHolderClass="com.michaelsnowden.yalcil_plugin.psi.YalcilTypes"
elementTypeClass="com.michaelsnowden.yalcil_plugin.psi.YalcilElementType"
tokenTypeClass="com.michaelsnowden.yalcil_plugin.psi.YalcilTokenType"
}
program ::= expression;
expression ::= lambdaAbstraction | application| literal;
lambdaAbstraction ::= LP WS? LAMBDA (WS IDENTIFIER)* WS expression WS? RP;
application ::= LP WS? expression (WS expression)* WS? RP;
literal ::= FLOAT_LITERAL | INTEGER_LITERAL | IDENTIFIER | STRING_LITERAL;
I get basically nothing but errors when I run this.
For some reason, only strings are highlighted.
The PsiViewer looks like this:
Does anyone see any explanation for why this isn't working?
i am attempting to parse freeform strings (ANTLR 3) and am running into string/token issues. Googling has not yet helped me. Here is my grammar:
grammar TestHeader;
options {
language = Java;
output = AST;
}
tokens {
STOKEN_1 = 'SENDTO';
STOKEN_2 = 'SRCSYS';
STOKEN_3 = 'SOFTERROR';
}
#lexer::header {
package com.sample.parser;
}
#parser::header {
package com.sample.parser;
}
fragment TAG : /* empty rule: only used to change the 'type' */;
fragment CR: '\r';
fragment LF: '\n';
fragment SLASH: '/';
fragment HASH: '#';
fragment DIGIT: '0'..'9';
fragment SEMI_COLON: ';';
COLON: ':';
COMMA: ',';
DOT: '.';
LETTER: 'A'..'Z' | 'a'..'z';
HYPHEN: '-';
BRANCH_CHARS: '$' | '#' | HASH;
SPACE: (' ');
EOL: (CR? LF) | CR | SEMI_COLON;
DATE: ('0'..'1')? DIGIT SLASH ('0'..'3')? DIGIT SLASH DIGIT DIGIT;
TIME: ('0'..'2')? DIGIT COLON ('0'..'5')? DIGIT (COLON '0'..'5' DIGIT)?;
DECIMAL
:
DIGIT* ('.' DIGIT*)
{
if ($text.contains("/")) {
$type = DATE;
} else if ($text.contains(":")) {
$type = TIME;
}
}
;
NUMBER
:
'0'..'9' DIGIT*
{
if ($text.contains("/")) {
$type = DATE;
} else if ($text.contains(":")) {
$type = TIME;
}
}
;
SINGLELINE_COMMENT
:
'//-' ~('\r' | '\n')*
;
header
:
(word SPACE+ sequenceNumber SPACE+ type) EOL+
(SPACE* trailer) EOL+
;
word
:
(data += ~(SPACE))+
;
sequenceNumber
:
nn=NUMBER
;
type
:
LETTER
;
trailer
:
// (tt += ~(EOL ))*
SINGLELINE_COMMENT
;
With a test input
SY 260 O
//-$CJ******1
i get the following
0 null
-- 16 S
-- 16 Y
-- 22
-- 18 260
-- 22
-- 16 O
-- 13
-- 20 //-$CJ******1
-- 13
However with the following input
SR 260 O
//-$CJ******1
I get
line 1:2 mismatched character ' ' expecting 'C'
line 1:7 mismatched input 'O' expecting NUMBER
0 260 O
//-$CJ******1
Where am i going wrong? Any tips/help to get past this issues?
Consider this short SmallC program:
#include "lib"
main() {
int bob;
}
My ANTLR grammar picks it up fine if I specify, in ANTLWorks and when using the Interpreter, line endings -> "Mac (CR)". If I set the line endings option to Unix (LF), the grammar throws a NoViableAltException and does not recognize anything after the end of the include statement. This error disappears if I add a newline at the end of include. The computer I'm using for this is a Mac, so I figured that it made sense to have to set the line endings to Mac format. So instead, I switch to a Linux box - and get the same thing. If I type anything in the ANTLRWorks Interpreter box, and if I don't select line endings Mac (CR), I get issues about insufficient blank lines as was the case above and, in addition, the last statement of each statement block requires an extra space following the semicolon (ie. after bob; above).
These bugs show up again when I run a Java version of my grammar on a code input file that I want to parse...
What could possibly be the issue? I'd understand if the issue was the presence of TOO many new lines, in a format that perhaps the parser didn't understand / weren't caught by my whitespace rule. But in this case, it's an issue of lacking new lines.
My white space declaration is as follows:
WS : ( '\t' | ' ' | '\r' | '\n' )+ { $channel = HIDDEN; } ;
Alternatively, could this be due to an ambiguity issue?
Here is the full grammar file (feel free to ignore the first few blocks, which override ANTLR's default error handling mechanisms:
grammar SmallC;
options {
output = AST ; // Set output mode to AST
}
tokens {
DIV = '/' ;
MINUS = '-' ;
MOD = '%' ;
MULT = '*' ;
PLUS = '+' ;
RETURN = 'return' ;
WHILE = 'while' ;
// The following are empty tokens used in AST generation
ARGS ;
CHAR ;
DECLS ;
ELSE ;
EXPR ;
IF ;
INT ;
INCLUDES ;
MAIN ;
PROCEDURES ;
PROGRAM ;
RETURNTYPE ;
STMTS ;
TYPEIDENT ;
}
#members {
// Force error throwing, and make sure we don't try to recover from invalid input.
// The exceptions are handled in the FrontEnd class, and gracefully end the
// compilation routine after displaying an error message.
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MismatchedTokenException(ttype, input);
}
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow)throws RecognitionException {
throw e;
}
protected Object recoverFromMismatchedToken(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MissingTokenException(ttype, input, null);
}
// We override getErrorMessage() to include information about the specific
// grammar rule in which the error happened, using a stack of nested rules.
Stack paraphrases = new Stack();
public String getErrorMessage(RecognitionException e, String[] tokenNames) {
String msg = super.getErrorMessage(e, tokenNames);
if ( paraphrases.size()>0 ) {
String paraphrase = (String)paraphrases.peek();
msg = msg+" "+paraphrase;
}
return msg;
}
// We override displayRecognitionError() to specify a clearer error message,
// and to include the error type (ie. class of the exception that was thrown)
// for the user's reference. The idea here is to come as close as possible
// to Java's exception output.
public void displayRecognitionError(String[] tokenNames, RecognitionException e)
{
String exType;
String hdr;
if (e instanceof UnwantedTokenException) {
exType = "UnwantedTokenException";
} else if (e instanceof MissingTokenException) {
exType = "MissingTokenException";
} else if (e instanceof MismatchedTokenException) {
exType = "MismatchedTokenException";
} else if (e instanceof MismatchedTreeNodeException) {
exType = "MismatchedTreeNodeException";
} else if (e instanceof NoViableAltException) {
exType = "NoViableAltException";
} else if (e instanceof EarlyExitException) {
exType = "EarlyExitException";
} else if (e instanceof MismatchedSetException) {
exType = "MismatchedSetException";
} else if (e instanceof MismatchedNotSetException) {
exType = "MismatchedNotSetException";
} else if (e instanceof FailedPredicateException) {
exType = "FailedPredicateException";
} else {
exType = "Unknown";
}
if ( getSourceName()!=null ) {
hdr = "Exception of type " + exType + " encountered in " + getSourceName() + " at line " + e.line + ", char " + e.charPositionInLine + ": ";
} else {
hdr = "Exception of type " + exType + " encountered at line " + e.line + ", char " + e.charPositionInLine + ": ";
}
String msg = getErrorMessage(e, tokenNames);
emitErrorMessage(hdr + msg + ".");
}
}
// Force the parser not to try to guess tokens and resume on faulty input,
// but rather display the error, and throw an exception for the program
// to quit gracefully.
#rulecatch {
catch (RecognitionException e) {
reportError(e);
throw e;
}
}
/*------------------------------------------------------------------
* PARSER RULES
*
* Many of these make use of ANTLR's rewrite rules to allow us to
* specify the roots of AST sub-trees, and to allow us to do away
* with certain insignificant literals (like parantheses and commas
* in lists) and to add empty tokens to disambiguate the tree
* construction
*
* The #init and #after definitions populate the paraphrase
* stack to allow us to specify which grammar rule we are in when
* errors are found.
*------------------------------------------------------------------*/
args
#init { paraphrases.push("in these procedure arguments"); }
#after { paraphrases.pop(); }
: ( typeident ( ',' typeident )* )? -> ^( ARGS ( typeident ( typeident )* )? )? ;
body
#init { paraphrases.push("in this procedure body"); }
#after { paraphrases.pop(); }
: '{'! decls stmtlist '}'! ;
decls
#init { paraphrases.push("in these declarations"); }
#after { paraphrases.pop(); }
: ( typeident ';' )* -> ^( DECLS ( typeident )* )? ;
exp
#init { paraphrases.push("in this expression"); }
#after { paraphrases.pop(); }
: lexp ( ( '>' | '<' | '>=' | '<=' | '!=' | '==' )^ lexp )? ;
factor : '(' lexp ')'
| ( MINUS )? ( IDENT | NUMBER )
| CHARACTER
| IDENT '(' ( IDENT ( ',' IDENT )* )? ')' ;
lexp : term ( ( PLUS | MINUS )^ term )* ;
includes
#init { paraphrases.push("in the include statements"); }
#after { paraphrases.pop(); }
: ( '#include' STRING )* -> ^( INCLUDES ( STRING )* )? ;
main
#init { paraphrases.push("in the main method"); }
#after { paraphrases.pop(); }
: 'main' '(' ')' body -> ^( MAIN body ) ;
procedure
#init { paraphrases.push("in this procedure"); }
#after { paraphrases.pop(); }
: ( proc_return_char | proc_return_int )? IDENT^ '('! args ')'! body ;
procedures : ( procedure )* -> ^( PROCEDURES ( procedure)* )? ;
proc_return_char
: 'char' -> ^( RETURNTYPE CHAR ) ;
proc_return_int : 'int' -> ^( RETURNTYPE INT ) ;
// We hard-code the regex (\n)* to fix a bug whereby a program would be accepted
// if it had 0 or more than 1 new lines before EOF but not if it had exactly 1,
// and not if it had 0 new lines between components of the following rule.
program : includes decls procedures main EOF ;
stmt
#init { paraphrases.push("in this statement"); }
#after { paraphrases.pop(); }
: '{'! stmtlist '}'!
| WHILE '(' exp ')' s=stmt -> ^( WHILE ^( EXPR exp ) $s )
| 'if' '(' exp ')' s=stmt ( options {greedy=true;} : 'else' s2=stmt )? -> ^( IF ^( EXPR exp ) $s ^( ELSE $s2 )? )
| IDENT '='^ lexp ';'!
| ( 'read' | 'output' | 'readc' | 'outputc' )^ '('! IDENT ')'! ';'!
| 'print'^ '('! STRING ( options {greedy=true;} : ')'! ';'! )
| RETURN ( lexp )? ';' -> ^( RETURN ( lexp )? )
| IDENT^ '('! ( IDENT ( ','! IDENT )* )? ')'! ';'!;
stmtlist : ( stmt )* -> ^( STMTS ( stmt )* )? ;
term : factor ( ( MULT | DIV | MOD )^ factor )* ;
// We divide typeident into two grammar rules depending on whether the
// ident is of type 'char' or 'int', to allow us to implement different
// rewrite rules in each case.
typeident : typeident_char | typeident_int ;
typeident_char : 'char' s2=IDENT -> ^( CHAR $s2 ) ;
typeident_int : 'int' s2=IDENT -> ^( INT $s2 ) ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// Must come before CHARACTER to avoid ambiguity ('i' matches both IDENT and CHARACTER)
IDENT : ( LCASE_ALPHA | UCASE_ALPHA | '_' ) ( LCASE_ALPHA | UCASE_ALPHA | DIGIT | '_' )* ;
CHARACTER : PRINTABLE_CHAR
| '\n' | '\t' | EOF ;
NUMBER : ( DIGIT )+ ;
STRING : '\"' ( ~( '"' | '\n' | '\r' | 't' ) )* '\"' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ { $channel = HIDDEN; } ;
fragment
DIGIT : '0'..'9' ;
fragment
LCASE_ALPHA : 'a'..'z' ;
fragment
NONALPHA_CHAR : '`' | '~' | '!' | '#' | '#' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-'
| '_' | '+' | '=' | '{' | '[' | '}' | ']' | '|' | '\\' | ';' | ':' | '\''
| '\\"' | '<' | ',' | '>' | '.' | '?' | '/' ;
fragment
PRINTABLE_CHAR : LCASE_ALPHA | UCASE_ALPHA | DIGIT | NONALPHA_CHAR ;
fragment
UCASE_ALPHA : 'A'..'Z' ;
From the command line, I do get a warning:
java -cp antlr-3.2.jar org.antlr.Tool SmallC.g
warning(200): SmallC.g:182:37: Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
but that won't stop the lexer/parser from being generated.
Anyway, the problem: ANTLR's lexer tries to match the first lexer rule it encounters in the file, and if it can't match said token, it trickles down to the next lexer rule. Now you have defined the CHARACTER rule before the WS rule, which both match the character \n. That is why it didn't work under Linux since the \n was tokenized as a CHARACTER. If you define the WS rule before the CHARACTER rule, it all works properly:
// other rules ...
WS
: ('\t' | ' ' | '\r' | '\n' | '\u000C')+ { $channel = HIDDEN; }
;
CHARACTER
: PRINTABLE_CHAR | '\n' | '\t' | EOF
;
// other rules ...
Running the test class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"#include \"lib\"\n" +
"main() {\n" +
" int bob;\n" +
"}\n";
ANTLRStringStream in = new ANTLRStringStream(source);
SmallCLexer lexer = new SmallCLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SmallCParser parser = new SmallCParser(tokens);
SmallCParser.program_return returnValue = parser.program();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
produces the following AST:
without any error messages.
But you should fix the grammar warning, and remove \n from the CHARACTER rule since it can never be matched in the CHARACTER rule.
One other thing: you've mixed quite a few keywords inside your parser rules without defining them in your lexer rules explicitly. That is tricky because of the first-come-first-serve lexer rules: you don't want 'if' to be accidentally being tokenized as an IDENT. Better do it like this:
IF : 'if';
IDENT : 'a'..'z' ... ; // After the `IF` rule!