As part of a school project I have to recognize a .dot file and produce the corresponding parse tree. To accomplish this, I have to use ocamllex and ocamlyacc which whom I've difficulties...
This is my ocaml .mli types file :
type id = string
and node = id
and attr = id * id
and attr_list = attr list list
and attr_stmt =
Graphs of (attr_list)
| Nodes of (attr_list)
| Edge of (attr_list)
and node_stmt = node * attr_list
and node_subgraph =
Node of (node)
| Subgraphs of (graph)
and edge_stmt = node_subgraph * (node_subgraph list) * attr_list
and stmt =
Node_stmt of (node_stmt)
| Edge_stmt of (edge_stmt)
| Attr_stmt of (attr_stmt)
| Attr of (attr)
| Subgraph of graph
and graph =
Graph_node of (node * stmt list)
| Graph of (stmt list);;
this is my lexer file :
{
open Parser
}
rule token = parse
(* Espacements *)
| [ ' ' '\t' '\n']+ {token lexbuf}
(* *)
| "graph" {GRAPH}
| "subgraph" {SUBGRAPH}
| "--" {EDGE}
(* Délimiteurs *)
| "{" {LEFT_ACC}
| "}" {RIGHT_ACC}
| "(" {LEFT_PAR}
| ")" {RIGHT_PAR}
| "[" {LEFT_BRA}
| "]" {RIGHT_BRA}
| "," {COMMA}
| ";" {SEMICOLON}
| "=" {EQUAL}
| ":" {TWOPOINT}
| ['a'-'z''A'-'Z''0'-'9']* as id {ID (id)}
| eof { raise End_of_file }
and this is my not finished yacc file :
%{
open Types
%}
%token <string> ID
%token <string> STR
%token GRAPH SUBGRAPH EDGE
%token LEFT_ACC RIGHT_ACC LEFT_PAR RIGHT_PAR LEFT_BRA RIGHT_BRA
%token COMME SEMICOLON EQUAL TWOPOINT EOF
%start main
%type <graph> main
%%
main:
graph EOF { $1 }
graph:
GRAPH ID LEFT_ACC content RIGHT_ACC {$4}
| GRAPH LEFT_ACC content RIGHT_ACC {$3}
subgraph:
SUBGRAPH ID LEFT_ACC content RIGHT_ACC {$4}
| SUBGRAPH LEFT_ACC content RIGHT_ACC {$3}
content:
| ID EDGE ID SEMICOLON {[($1,$3)]}
It seems enough for recognizing a simple dot file like
graph D {
A -- B ;
}
But when I try to compile my parser interface I get this error :
This expression has type 'a list
but an expression was expected of type Types.graph (refers to ID EDGE ID SEMICOLON {[$1,$3)]} line )
I don't understand because {[$1,$3]} has a (string * string) list type. And if we are looking to types.mli that can be a graph.
Otherwise, have I correctly understand the running of ocamllex and ocamlyacc ?
A value of type graph has to start with either Graph or Graph_node. This is not at all the same as (string * string) list.
Related
In the Protocol Buffers Version 3 Language Specification
The EBNF syntax for an option is
option = "option" optionName "=" constant ";"
optionName = ( ident | "(" fullIdent ")" ) { "." ident }
constant = fullIdent | ( [ "-" | "+" ] intLit ) | ( [ "-" | "+" ] floatLit ) | strLit | boolLit
ident = letter { letter | decimalDigit | "_" }
fullIdent = ident { "." ident }
strLit = ( "'" { charValue } "'" ) | ( '"' { charValue } '"' )
charValue = hexEscape | octEscape | charEscape | /[^\0\n\\]/
hexEscape = '\' ( "x" | "X" ) hexDigit hexDigit
octEscape = '\' octalDigit octalDigit octalDigit
charEscape = '\' ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | '\' | "'" | '"' )
Or in plain English, an option may be assigned a dotted.notation.identifier, an integer, a float, a boolean, or a single- or double-quoted string, which MUST NOT have "raw" newline characters.
And yet, I'm encountering .proto files in various projects such as grpc-gateway and googleapis, where the rhs of the assignment is not quoted and spans multiple lines. For example in googleapis/google/api/http.proto there is this service definition in a comment block:
// service Messaging {
// rpc UpdateMessage(Message) returns (Message) {
// option (google.api.http) = {
// patch: "/v1/messages/{message_id}"
// body: "*"
// };
// }
// }
In other files, the use of semicolons (and occasionally commas) as separators seems somewhat arbitrary, and I have also seen keys repeated, which in JSON or JavaScript would result in loss of data due to overwriting.
Are there any canonical extensions to the language specification, or are people just Microsofting? (Yes, that's a verb now.)
I posted a similar question on the Protocol Buffers Google Group, and received a private message from a fellow at Google stating the following
This syntax is correct and valid for setting fields on a proto option field which is itself a field referencing a message type. This form is based on the TextFormat spec which I'm unclear if its super well documented, but here's an implementation of it: https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.text_format
When I have time, I will try to unpack what I learn from analyzing TextFormat.
update
I received an answer on the Groups forum
I think for better or worse, "what protoc implements" takes precedence over whatever the spec says. The spec came later and as far as I know we have not put a lot of effort into ensuring that it comprehensively matches the format that protoc expects. I believe the syntax you are looking at is missing from the .proto file format spec but is mentioned here as the "aggregate syntax."
The link above is to a section titled Custom Options in the Language Guide (proto2) page. If you scroll all the way to the end of that section, there is the following snippet that mentions TextFormat:
message FooOptions {
optional int32 opt1 = 1;
optional string opt2 = 2;
}
extend google.protobuf.FieldOptions {
optional FooOptions foo_options = 1234;
}
// usage:
message Bar {
optional int32 a = 1 [(foo_options).opt1 = 123, (foo_options).opt2 = "baz"];
// alternative aggregate syntax (uses TextFormat):
optional int32 b = 2 [(foo_options) = { opt1: 123 opt2: "baz" }];
}
Basically, I have an assignment where I need to make a compiler for C-, but we're doing it in 5 steps. One of the steps was to turn the BNF grammar to bison and then print a tree with what has been compiled. Let me explain:
BNF Grammar
1. program→declaration-list
2. declaration-list→declaration-list declaration | declaration
3. var-declaration| fun-declaration
4. var-declaration→type-specifierID;| type-specifierID[NUM];
5. type-specifier→int | void
6. fun-declaration→type-specifierID(params)compound-stmt
7. params→param-list| void
8. param-list→param-list,param | param
9. param→type-specifierID | type-specifierID[]
10. compound-stmt→{local-declarations statement-list}
11. local-declarations→local-declarations var-declaration| empty
12. statement-list→statement-list statement| empty
13. statement→expression-stmt| compound-stmt| selection-stmt | iteration-stmt | return-stmt
14. expession-stmt→expression;| ;
15. selection-stmt→if(expression)statement| if(expression) statement else statement
16. iteration-stmt→while(expression)statement
17. return-stmt→return; | return expression;
18. expression→var=expression| simple-expression
19. var→ID| ID[expression]
20. simple-expression→additive-expression relop additive-expression| additive-expression
21. relop→<=| <| >| >=| ==| !=
22. additive-expression→additive-expression addop term| term
23. addop→+| -
24. term→term mulop factor| factor
25. mulop→*| /
26. factor→(expression)| var| call| NUM
27. call→ID(args)
28. args→arg-list| empty
29. arg-list→arg-list,expression| expression
File: Project.fl
%option noyywrap
%{
/* Definitions and statements */
#include <stdio.h>
#include "project.tab.h"
int nlines = 1;
char filename[50];
%}
ID {letter}{letter}*
NUM {digit}{digit}*
letter [a-zA-Z]
digit [0-9]
%%
"if" { return T_IF; }
"else" { return T_ELSE; }
"int" { return T_INT; }
"return" { return T_RETURN; }
"void" { return T_VOID; }
"while" { return T_WHILE; }
"+" { return yytext[0]; }
"-" { return yytext[0]; }
"*" { return yytext[0]; }
"/" { return yytext[0]; }
">" { return T_GREAT; }
">=" { return T_GREATEQ; }
"<" { return T_SMALL; }
"<=" { return T_SMALLEQ; }
"==" { return T_COMPARE; }
"!=" { return T_NOTEQ; }
"=" { return yytext[0]; }
";" { return yytext[0]; }
"," { return yytext[0]; }
"(" { return yytext[0]; }
")" { return yytext[0]; }
"[" { return yytext[0]; }
"]" { return yytext[0]; }
"{" { return yytext[0]; }
"}" { return yytext[0]; }
(\/\*(ID)\*\/) { return T_COMM; }
{ID} { return T_ID; }
{NUM} { return T_NUM; }
\n { ++nlines; }
%%
File: project.y
%{
#include <stdio.h>
#include <stdlib.h>
extern int yylex();
extern int yyparse();
void yyerror(const char* s);
%}
%token T_IF T_ELSE T_INT T_RETURN T_VOID T_WHILE
T_GREAT T_GREATEQ T_SMALL T_SMALLEQ T_COMPARE T_NOTEQ
T_COMM T_ID T_NUM
%%
program: declaration-list { printf("program"); }
;
declaration-list: declaration-list declaration
| declaration
;
declaration: var-declaration
| fun-declaration
;
var-declaration: type-specifier T_ID ';'
| type-specifier T_ID'['T_NUM']' ';'
;
type-specifier: T_INT
| T_VOID
;
fun-declaration: type-specifier T_ID '('params')' compound-stmt
;
params: param-list
| T_VOID
;
param-list: param-list',' param
| param
;
param: type-specifier T_ID
| type-specifier T_ID'['']'
;
compound-stmt: '{' local-declarations statement-list '}'
;
local-declarations: local-declarations var-declaration
|
;
statement-list: statement-list statement
|
;
statement: expression-stmt
| compound-stmt
| selection-stmt
| iteration-stmt
| return-stmt
;
expression-stmt: expression ';'
| ';'
;
selection-stmt: T_IF '('expression')' statement
| T_IF '('expression')' statement T_ELSE statement
;
iteration-stmt: T_WHILE '('expression')' statement
;
return-stmt: T_RETURN ';'
| T_RETURN expression ';'
;
expression: var '=' expression
| simple-expression
;
var: T_ID { printf("\nterm\nfactor_var\nvar(x)"); }
| T_ID '['expression']'
;
simple-expression: additive-expression relop additive-expression
| additive-expression
;
relop: T_SMALLEQ
| T_SMALL
| T_GREAT
| T_GREATEQ
| T_COMPARE
| T_NOTEQ
;
additive-expression: additive-expression addop term
| term
;
addop: '+' { printf("\naddop(+)"); }
| '-' { printf("\naddop(-)"); }
;
term: term mulop factor
| factor
;
mulop: '*' { printf("\nmulop(*)"); }
| '/' { printf("\nmulop(/)"); }
;
factor: '('expression')' { printf("\nfactor1"); }
| var
| call
| T_NUM { printf("\nterm\nfactor(5)"); }
;
call: T_ID '('args')' { printf("\ncall(input)"); }
;
args: arg-list
| { printf("\nargs(empty)"); }
;
arg-list: arg-list',' expression
| expression
;
%%
int main(void) {
return yyparse();
}
void yyerror(const char* s) {
fprintf(stderr, "Parse error: %s\n", s);
exit(1);
}
And finally the tree that's asked to be replicated:
program
declaration_list
declaration
fun_definition(VOID-main)
params_VOID-compound
params(VOID)
compound_stmt
local_declarations
local_declarations
local_declarations(empty)
var_declaration(x)
type_specifier(INT)
var_declaration(y)
type_specifier(INT)
statement_list
statement_list
statement_list(empty)
statement
expression_stmt
expression
var(x)
expression
simple_expression
additive_expression
term
factor
call(input)
args(empty)
statement
expression_stmt
expression
var(y)
expression
simple_expression
additive_expression(ADDOP)
additive_expression
term
factor_var
var(x)
addop(+)
term
factor(5)
Sample code in which the tree is based off
/* A program */
void main(void)
{
int x; int y;
x = input();
y = x + 5;
}
I've turned the BNF grammar to the actual .y file, but I'm having problems printing out where exactly the messages should go. Usually, a grammar would finish THEN print.
The desired output you present is the result of a pre-order walk of the parse tree.
However, bison generates a bottom-up parser, which performs semantic actions for a node in the parse tree when the node's subtree is complete. Printing the node in the semantic action therefore produces a post-order walk. I suppose that is what you mean by your last sentence.
While there are a variety of possible solutions, the simplest one is probably to construct a parse tree during the parse and then print it out at the end of the parse. (You could print the tree in the semantic action for the start production, but that will sometimes result in a parse tree being printed for an erroneous input. Better is to return the root of the parse tree and print it from the main program after verifying that the parse was successful.)
I don't know where "construct a parse tree" fits in the expected progression of your project. Parse trees are of little use in most applications. Much more common is the construction of an abstract syntax tree (AST) which omits many of the irrelevant details from the parse (such as unit productions). You can construct an AST from a parse tree, but it is generally simpler to construct it directly in the parse actions: the code looks very similar but there is less of it precisely because tree nodes don't have to be built for unit productions.
The Official documentation about map type says:
map<key_type, value_type> map_field = N;
...where the key_type can be any integral or string type (so, any
scalar type except for floating point types and bytes). The value_type
can be any type.
I want to define a map<string, repeated string> field, but it seems illegal on my libprotoc 3.0.0, which complains Expected ">". So I wonder if there is any way to put repeated string into map.
A Possible workaround could be:
message ListOfString {
repeated string value = 1;
}
// Then define:
map<string, ListOfString> mapToRepeatedString = 1;
But ListOfString here looks redundant.
I had the same need, and got the same error. I do not believe this is possible. Here is the relevant BNF definitions from the language specification.
https://developers.google.com/protocol-buffers/docs/reference/proto3-spec
messageType = [ "." ] { ident "." } messageName
mapField = "map" "<" keyType "," type ">" mapName "=" fieldNumber [ "["fieldOptions "]" ] ";"
type = "double" | "float" | "int32" | "int64" | "uint32" | "uint64"
| "sint32" | "sint64" | "fixed32" | "fixed64" | "sfixed32" | "sfixed64"
| "bool" | "string" | "bytes" | messageType | enumType
messageName = ident
ident = letter { letter | decimalDigit | "_" }
field = [ "repeated" ] type fieldName "=" fieldNumber [ "[" fieldOptions "]" ] ";"
"repeated" keyword only appears in the field definition. The map definition requires a "type", which does not include the repeated keyword.
That means there are a few options.
You could create a wrapper around the repeated value as you indicated.
There is the older way people defined define maps, which is more burdensome but is equivalent. This is the backwards compatible example from the language guide.
https://developers.google.com/protocol-buffers/docs/proto3#maps
message MapFieldEntry {
key_type key = 1;
repeated value_type value = 2;
}
repeated MapFieldEntry map_field = N;
You would need to convert the data to a map yourself, but this should be fairly trivial in most languages. In Java:
List<MapFieldEntry> map_field = // Existing list from protobuf.
Map<key_type, List<value_type>> = map_field.stream()
.collect(Collectors.toMap(kv -> kv.key, kv -> kv.value));
Use google.protobuf.ListValue
https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue
This is an untyped list collection from their well known types.
I think it should be as follows.
message ListOfString {
repeated string what_ever_name = 1;
}
// Then define:
map<string, ListOfString> what_ever_name = 1;
Remember what_ever_name should be same in both places.
The Official documentation about map type says:
map<key_type, value_type> map_field = N;
...where the key_type can be any integral or string type (so, any
scalar type except for floating point types and bytes). The value_type
can be any type.
I want to define a map<string, repeated string> field, but it seems illegal on my libprotoc 3.0.0, which complains Expected ">". So I wonder if there is any way to put repeated string into map.
A Possible workaround could be:
message ListOfString {
repeated string value = 1;
}
// Then define:
map<string, ListOfString> mapToRepeatedString = 1;
But ListOfString here looks redundant.
I had the same need, and got the same error. I do not believe this is possible. Here is the relevant BNF definitions from the language specification.
https://developers.google.com/protocol-buffers/docs/reference/proto3-spec
messageType = [ "." ] { ident "." } messageName
mapField = "map" "<" keyType "," type ">" mapName "=" fieldNumber [ "["fieldOptions "]" ] ";"
type = "double" | "float" | "int32" | "int64" | "uint32" | "uint64"
| "sint32" | "sint64" | "fixed32" | "fixed64" | "sfixed32" | "sfixed64"
| "bool" | "string" | "bytes" | messageType | enumType
messageName = ident
ident = letter { letter | decimalDigit | "_" }
field = [ "repeated" ] type fieldName "=" fieldNumber [ "[" fieldOptions "]" ] ";"
"repeated" keyword only appears in the field definition. The map definition requires a "type", which does not include the repeated keyword.
That means there are a few options.
You could create a wrapper around the repeated value as you indicated.
There is the older way people defined define maps, which is more burdensome but is equivalent. This is the backwards compatible example from the language guide.
https://developers.google.com/protocol-buffers/docs/proto3#maps
message MapFieldEntry {
key_type key = 1;
repeated value_type value = 2;
}
repeated MapFieldEntry map_field = N;
You would need to convert the data to a map yourself, but this should be fairly trivial in most languages. In Java:
List<MapFieldEntry> map_field = // Existing list from protobuf.
Map<key_type, List<value_type>> = map_field.stream()
.collect(Collectors.toMap(kv -> kv.key, kv -> kv.value));
Use google.protobuf.ListValue
https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue
This is an untyped list collection from their well known types.
I think it should be as follows.
message ListOfString {
repeated string what_ever_name = 1;
}
// Then define:
map<string, ListOfString> what_ever_name = 1;
Remember what_ever_name should be same in both places.
Consider this short SmallC program:
#include "lib"
main() {
int bob;
}
My ANTLR grammar picks it up fine if I specify, in ANTLWorks and when using the Interpreter, line endings -> "Mac (CR)". If I set the line endings option to Unix (LF), the grammar throws a NoViableAltException and does not recognize anything after the end of the include statement. This error disappears if I add a newline at the end of include. The computer I'm using for this is a Mac, so I figured that it made sense to have to set the line endings to Mac format. So instead, I switch to a Linux box - and get the same thing. If I type anything in the ANTLRWorks Interpreter box, and if I don't select line endings Mac (CR), I get issues about insufficient blank lines as was the case above and, in addition, the last statement of each statement block requires an extra space following the semicolon (ie. after bob; above).
These bugs show up again when I run a Java version of my grammar on a code input file that I want to parse...
What could possibly be the issue? I'd understand if the issue was the presence of TOO many new lines, in a format that perhaps the parser didn't understand / weren't caught by my whitespace rule. But in this case, it's an issue of lacking new lines.
My white space declaration is as follows:
WS : ( '\t' | ' ' | '\r' | '\n' )+ { $channel = HIDDEN; } ;
Alternatively, could this be due to an ambiguity issue?
Here is the full grammar file (feel free to ignore the first few blocks, which override ANTLR's default error handling mechanisms:
grammar SmallC;
options {
output = AST ; // Set output mode to AST
}
tokens {
DIV = '/' ;
MINUS = '-' ;
MOD = '%' ;
MULT = '*' ;
PLUS = '+' ;
RETURN = 'return' ;
WHILE = 'while' ;
// The following are empty tokens used in AST generation
ARGS ;
CHAR ;
DECLS ;
ELSE ;
EXPR ;
IF ;
INT ;
INCLUDES ;
MAIN ;
PROCEDURES ;
PROGRAM ;
RETURNTYPE ;
STMTS ;
TYPEIDENT ;
}
#members {
// Force error throwing, and make sure we don't try to recover from invalid input.
// The exceptions are handled in the FrontEnd class, and gracefully end the
// compilation routine after displaying an error message.
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MismatchedTokenException(ttype, input);
}
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow)throws RecognitionException {
throw e;
}
protected Object recoverFromMismatchedToken(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MissingTokenException(ttype, input, null);
}
// We override getErrorMessage() to include information about the specific
// grammar rule in which the error happened, using a stack of nested rules.
Stack paraphrases = new Stack();
public String getErrorMessage(RecognitionException e, String[] tokenNames) {
String msg = super.getErrorMessage(e, tokenNames);
if ( paraphrases.size()>0 ) {
String paraphrase = (String)paraphrases.peek();
msg = msg+" "+paraphrase;
}
return msg;
}
// We override displayRecognitionError() to specify a clearer error message,
// and to include the error type (ie. class of the exception that was thrown)
// for the user's reference. The idea here is to come as close as possible
// to Java's exception output.
public void displayRecognitionError(String[] tokenNames, RecognitionException e)
{
String exType;
String hdr;
if (e instanceof UnwantedTokenException) {
exType = "UnwantedTokenException";
} else if (e instanceof MissingTokenException) {
exType = "MissingTokenException";
} else if (e instanceof MismatchedTokenException) {
exType = "MismatchedTokenException";
} else if (e instanceof MismatchedTreeNodeException) {
exType = "MismatchedTreeNodeException";
} else if (e instanceof NoViableAltException) {
exType = "NoViableAltException";
} else if (e instanceof EarlyExitException) {
exType = "EarlyExitException";
} else if (e instanceof MismatchedSetException) {
exType = "MismatchedSetException";
} else if (e instanceof MismatchedNotSetException) {
exType = "MismatchedNotSetException";
} else if (e instanceof FailedPredicateException) {
exType = "FailedPredicateException";
} else {
exType = "Unknown";
}
if ( getSourceName()!=null ) {
hdr = "Exception of type " + exType + " encountered in " + getSourceName() + " at line " + e.line + ", char " + e.charPositionInLine + ": ";
} else {
hdr = "Exception of type " + exType + " encountered at line " + e.line + ", char " + e.charPositionInLine + ": ";
}
String msg = getErrorMessage(e, tokenNames);
emitErrorMessage(hdr + msg + ".");
}
}
// Force the parser not to try to guess tokens and resume on faulty input,
// but rather display the error, and throw an exception for the program
// to quit gracefully.
#rulecatch {
catch (RecognitionException e) {
reportError(e);
throw e;
}
}
/*------------------------------------------------------------------
* PARSER RULES
*
* Many of these make use of ANTLR's rewrite rules to allow us to
* specify the roots of AST sub-trees, and to allow us to do away
* with certain insignificant literals (like parantheses and commas
* in lists) and to add empty tokens to disambiguate the tree
* construction
*
* The #init and #after definitions populate the paraphrase
* stack to allow us to specify which grammar rule we are in when
* errors are found.
*------------------------------------------------------------------*/
args
#init { paraphrases.push("in these procedure arguments"); }
#after { paraphrases.pop(); }
: ( typeident ( ',' typeident )* )? -> ^( ARGS ( typeident ( typeident )* )? )? ;
body
#init { paraphrases.push("in this procedure body"); }
#after { paraphrases.pop(); }
: '{'! decls stmtlist '}'! ;
decls
#init { paraphrases.push("in these declarations"); }
#after { paraphrases.pop(); }
: ( typeident ';' )* -> ^( DECLS ( typeident )* )? ;
exp
#init { paraphrases.push("in this expression"); }
#after { paraphrases.pop(); }
: lexp ( ( '>' | '<' | '>=' | '<=' | '!=' | '==' )^ lexp )? ;
factor : '(' lexp ')'
| ( MINUS )? ( IDENT | NUMBER )
| CHARACTER
| IDENT '(' ( IDENT ( ',' IDENT )* )? ')' ;
lexp : term ( ( PLUS | MINUS )^ term )* ;
includes
#init { paraphrases.push("in the include statements"); }
#after { paraphrases.pop(); }
: ( '#include' STRING )* -> ^( INCLUDES ( STRING )* )? ;
main
#init { paraphrases.push("in the main method"); }
#after { paraphrases.pop(); }
: 'main' '(' ')' body -> ^( MAIN body ) ;
procedure
#init { paraphrases.push("in this procedure"); }
#after { paraphrases.pop(); }
: ( proc_return_char | proc_return_int )? IDENT^ '('! args ')'! body ;
procedures : ( procedure )* -> ^( PROCEDURES ( procedure)* )? ;
proc_return_char
: 'char' -> ^( RETURNTYPE CHAR ) ;
proc_return_int : 'int' -> ^( RETURNTYPE INT ) ;
// We hard-code the regex (\n)* to fix a bug whereby a program would be accepted
// if it had 0 or more than 1 new lines before EOF but not if it had exactly 1,
// and not if it had 0 new lines between components of the following rule.
program : includes decls procedures main EOF ;
stmt
#init { paraphrases.push("in this statement"); }
#after { paraphrases.pop(); }
: '{'! stmtlist '}'!
| WHILE '(' exp ')' s=stmt -> ^( WHILE ^( EXPR exp ) $s )
| 'if' '(' exp ')' s=stmt ( options {greedy=true;} : 'else' s2=stmt )? -> ^( IF ^( EXPR exp ) $s ^( ELSE $s2 )? )
| IDENT '='^ lexp ';'!
| ( 'read' | 'output' | 'readc' | 'outputc' )^ '('! IDENT ')'! ';'!
| 'print'^ '('! STRING ( options {greedy=true;} : ')'! ';'! )
| RETURN ( lexp )? ';' -> ^( RETURN ( lexp )? )
| IDENT^ '('! ( IDENT ( ','! IDENT )* )? ')'! ';'!;
stmtlist : ( stmt )* -> ^( STMTS ( stmt )* )? ;
term : factor ( ( MULT | DIV | MOD )^ factor )* ;
// We divide typeident into two grammar rules depending on whether the
// ident is of type 'char' or 'int', to allow us to implement different
// rewrite rules in each case.
typeident : typeident_char | typeident_int ;
typeident_char : 'char' s2=IDENT -> ^( CHAR $s2 ) ;
typeident_int : 'int' s2=IDENT -> ^( INT $s2 ) ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// Must come before CHARACTER to avoid ambiguity ('i' matches both IDENT and CHARACTER)
IDENT : ( LCASE_ALPHA | UCASE_ALPHA | '_' ) ( LCASE_ALPHA | UCASE_ALPHA | DIGIT | '_' )* ;
CHARACTER : PRINTABLE_CHAR
| '\n' | '\t' | EOF ;
NUMBER : ( DIGIT )+ ;
STRING : '\"' ( ~( '"' | '\n' | '\r' | 't' ) )* '\"' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ { $channel = HIDDEN; } ;
fragment
DIGIT : '0'..'9' ;
fragment
LCASE_ALPHA : 'a'..'z' ;
fragment
NONALPHA_CHAR : '`' | '~' | '!' | '#' | '#' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-'
| '_' | '+' | '=' | '{' | '[' | '}' | ']' | '|' | '\\' | ';' | ':' | '\''
| '\\"' | '<' | ',' | '>' | '.' | '?' | '/' ;
fragment
PRINTABLE_CHAR : LCASE_ALPHA | UCASE_ALPHA | DIGIT | NONALPHA_CHAR ;
fragment
UCASE_ALPHA : 'A'..'Z' ;
From the command line, I do get a warning:
java -cp antlr-3.2.jar org.antlr.Tool SmallC.g
warning(200): SmallC.g:182:37: Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
but that won't stop the lexer/parser from being generated.
Anyway, the problem: ANTLR's lexer tries to match the first lexer rule it encounters in the file, and if it can't match said token, it trickles down to the next lexer rule. Now you have defined the CHARACTER rule before the WS rule, which both match the character \n. That is why it didn't work under Linux since the \n was tokenized as a CHARACTER. If you define the WS rule before the CHARACTER rule, it all works properly:
// other rules ...
WS
: ('\t' | ' ' | '\r' | '\n' | '\u000C')+ { $channel = HIDDEN; }
;
CHARACTER
: PRINTABLE_CHAR | '\n' | '\t' | EOF
;
// other rules ...
Running the test class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"#include \"lib\"\n" +
"main() {\n" +
" int bob;\n" +
"}\n";
ANTLRStringStream in = new ANTLRStringStream(source);
SmallCLexer lexer = new SmallCLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SmallCParser parser = new SmallCParser(tokens);
SmallCParser.program_return returnValue = parser.program();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
produces the following AST:
without any error messages.
But you should fix the grammar warning, and remove \n from the CHARACTER rule since it can never be matched in the CHARACTER rule.
One other thing: you've mixed quite a few keywords inside your parser rules without defining them in your lexer rules explicitly. That is tricky because of the first-come-first-serve lexer rules: you don't want 'if' to be accidentally being tokenized as an IDENT. Better do it like this:
IF : 'if';
IDENT : 'a'..'z' ... ; // After the `IF` rule!