ANTLR White Space Question (and not the typical one) - whitespace

Consider this short SmallC program:
#include "lib"
main() {
int bob;
}
My ANTLR grammar picks it up fine if I specify, in ANTLWorks and when using the Interpreter, line endings -> "Mac (CR)". If I set the line endings option to Unix (LF), the grammar throws a NoViableAltException and does not recognize anything after the end of the include statement. This error disappears if I add a newline at the end of include. The computer I'm using for this is a Mac, so I figured that it made sense to have to set the line endings to Mac format. So instead, I switch to a Linux box - and get the same thing. If I type anything in the ANTLRWorks Interpreter box, and if I don't select line endings Mac (CR), I get issues about insufficient blank lines as was the case above and, in addition, the last statement of each statement block requires an extra space following the semicolon (ie. after bob; above).
These bugs show up again when I run a Java version of my grammar on a code input file that I want to parse...
What could possibly be the issue? I'd understand if the issue was the presence of TOO many new lines, in a format that perhaps the parser didn't understand / weren't caught by my whitespace rule. But in this case, it's an issue of lacking new lines.
My white space declaration is as follows:
WS : ( '\t' | ' ' | '\r' | '\n' )+ { $channel = HIDDEN; } ;
Alternatively, could this be due to an ambiguity issue?
Here is the full grammar file (feel free to ignore the first few blocks, which override ANTLR's default error handling mechanisms:
grammar SmallC;
options {
output = AST ; // Set output mode to AST
}
tokens {
DIV = '/' ;
MINUS = '-' ;
MOD = '%' ;
MULT = '*' ;
PLUS = '+' ;
RETURN = 'return' ;
WHILE = 'while' ;
// The following are empty tokens used in AST generation
ARGS ;
CHAR ;
DECLS ;
ELSE ;
EXPR ;
IF ;
INT ;
INCLUDES ;
MAIN ;
PROCEDURES ;
PROGRAM ;
RETURNTYPE ;
STMTS ;
TYPEIDENT ;
}
#members {
// Force error throwing, and make sure we don't try to recover from invalid input.
// The exceptions are handled in the FrontEnd class, and gracefully end the
// compilation routine after displaying an error message.
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MismatchedTokenException(ttype, input);
}
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow)throws RecognitionException {
throw e;
}
protected Object recoverFromMismatchedToken(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MissingTokenException(ttype, input, null);
}
// We override getErrorMessage() to include information about the specific
// grammar rule in which the error happened, using a stack of nested rules.
Stack paraphrases = new Stack();
public String getErrorMessage(RecognitionException e, String[] tokenNames) {
String msg = super.getErrorMessage(e, tokenNames);
if ( paraphrases.size()>0 ) {
String paraphrase = (String)paraphrases.peek();
msg = msg+" "+paraphrase;
}
return msg;
}
// We override displayRecognitionError() to specify a clearer error message,
// and to include the error type (ie. class of the exception that was thrown)
// for the user's reference. The idea here is to come as close as possible
// to Java's exception output.
public void displayRecognitionError(String[] tokenNames, RecognitionException e)
{
String exType;
String hdr;
if (e instanceof UnwantedTokenException) {
exType = "UnwantedTokenException";
} else if (e instanceof MissingTokenException) {
exType = "MissingTokenException";
} else if (e instanceof MismatchedTokenException) {
exType = "MismatchedTokenException";
} else if (e instanceof MismatchedTreeNodeException) {
exType = "MismatchedTreeNodeException";
} else if (e instanceof NoViableAltException) {
exType = "NoViableAltException";
} else if (e instanceof EarlyExitException) {
exType = "EarlyExitException";
} else if (e instanceof MismatchedSetException) {
exType = "MismatchedSetException";
} else if (e instanceof MismatchedNotSetException) {
exType = "MismatchedNotSetException";
} else if (e instanceof FailedPredicateException) {
exType = "FailedPredicateException";
} else {
exType = "Unknown";
}
if ( getSourceName()!=null ) {
hdr = "Exception of type " + exType + " encountered in " + getSourceName() + " at line " + e.line + ", char " + e.charPositionInLine + ": ";
} else {
hdr = "Exception of type " + exType + " encountered at line " + e.line + ", char " + e.charPositionInLine + ": ";
}
String msg = getErrorMessage(e, tokenNames);
emitErrorMessage(hdr + msg + ".");
}
}
// Force the parser not to try to guess tokens and resume on faulty input,
// but rather display the error, and throw an exception for the program
// to quit gracefully.
#rulecatch {
catch (RecognitionException e) {
reportError(e);
throw e;
}
}
/*------------------------------------------------------------------
* PARSER RULES
*
* Many of these make use of ANTLR's rewrite rules to allow us to
* specify the roots of AST sub-trees, and to allow us to do away
* with certain insignificant literals (like parantheses and commas
* in lists) and to add empty tokens to disambiguate the tree
* construction
*
* The #init and #after definitions populate the paraphrase
* stack to allow us to specify which grammar rule we are in when
* errors are found.
*------------------------------------------------------------------*/
args
#init { paraphrases.push("in these procedure arguments"); }
#after { paraphrases.pop(); }
: ( typeident ( ',' typeident )* )? -> ^( ARGS ( typeident ( typeident )* )? )? ;
body
#init { paraphrases.push("in this procedure body"); }
#after { paraphrases.pop(); }
: '{'! decls stmtlist '}'! ;
decls
#init { paraphrases.push("in these declarations"); }
#after { paraphrases.pop(); }
: ( typeident ';' )* -> ^( DECLS ( typeident )* )? ;
exp
#init { paraphrases.push("in this expression"); }
#after { paraphrases.pop(); }
: lexp ( ( '>' | '<' | '>=' | '<=' | '!=' | '==' )^ lexp )? ;
factor : '(' lexp ')'
| ( MINUS )? ( IDENT | NUMBER )
| CHARACTER
| IDENT '(' ( IDENT ( ',' IDENT )* )? ')' ;
lexp : term ( ( PLUS | MINUS )^ term )* ;
includes
#init { paraphrases.push("in the include statements"); }
#after { paraphrases.pop(); }
: ( '#include' STRING )* -> ^( INCLUDES ( STRING )* )? ;
main
#init { paraphrases.push("in the main method"); }
#after { paraphrases.pop(); }
: 'main' '(' ')' body -> ^( MAIN body ) ;
procedure
#init { paraphrases.push("in this procedure"); }
#after { paraphrases.pop(); }
: ( proc_return_char | proc_return_int )? IDENT^ '('! args ')'! body ;
procedures : ( procedure )* -> ^( PROCEDURES ( procedure)* )? ;
proc_return_char
: 'char' -> ^( RETURNTYPE CHAR ) ;
proc_return_int : 'int' -> ^( RETURNTYPE INT ) ;
// We hard-code the regex (\n)* to fix a bug whereby a program would be accepted
// if it had 0 or more than 1 new lines before EOF but not if it had exactly 1,
// and not if it had 0 new lines between components of the following rule.
program : includes decls procedures main EOF ;
stmt
#init { paraphrases.push("in this statement"); }
#after { paraphrases.pop(); }
: '{'! stmtlist '}'!
| WHILE '(' exp ')' s=stmt -> ^( WHILE ^( EXPR exp ) $s )
| 'if' '(' exp ')' s=stmt ( options {greedy=true;} : 'else' s2=stmt )? -> ^( IF ^( EXPR exp ) $s ^( ELSE $s2 )? )
| IDENT '='^ lexp ';'!
| ( 'read' | 'output' | 'readc' | 'outputc' )^ '('! IDENT ')'! ';'!
| 'print'^ '('! STRING ( options {greedy=true;} : ')'! ';'! )
| RETURN ( lexp )? ';' -> ^( RETURN ( lexp )? )
| IDENT^ '('! ( IDENT ( ','! IDENT )* )? ')'! ';'!;
stmtlist : ( stmt )* -> ^( STMTS ( stmt )* )? ;
term : factor ( ( MULT | DIV | MOD )^ factor )* ;
// We divide typeident into two grammar rules depending on whether the
// ident is of type 'char' or 'int', to allow us to implement different
// rewrite rules in each case.
typeident : typeident_char | typeident_int ;
typeident_char : 'char' s2=IDENT -> ^( CHAR $s2 ) ;
typeident_int : 'int' s2=IDENT -> ^( INT $s2 ) ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// Must come before CHARACTER to avoid ambiguity ('i' matches both IDENT and CHARACTER)
IDENT : ( LCASE_ALPHA | UCASE_ALPHA | '_' ) ( LCASE_ALPHA | UCASE_ALPHA | DIGIT | '_' )* ;
CHARACTER : PRINTABLE_CHAR
| '\n' | '\t' | EOF ;
NUMBER : ( DIGIT )+ ;
STRING : '\"' ( ~( '"' | '\n' | '\r' | 't' ) )* '\"' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ { $channel = HIDDEN; } ;
fragment
DIGIT : '0'..'9' ;
fragment
LCASE_ALPHA : 'a'..'z' ;
fragment
NONALPHA_CHAR : '`' | '~' | '!' | '#' | '#' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-'
| '_' | '+' | '=' | '{' | '[' | '}' | ']' | '|' | '\\' | ';' | ':' | '\''
| '\\"' | '<' | ',' | '>' | '.' | '?' | '/' ;
fragment
PRINTABLE_CHAR : LCASE_ALPHA | UCASE_ALPHA | DIGIT | NONALPHA_CHAR ;
fragment
UCASE_ALPHA : 'A'..'Z' ;

From the command line, I do get a warning:
java -cp antlr-3.2.jar org.antlr.Tool SmallC.g
warning(200): SmallC.g:182:37: Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
but that won't stop the lexer/parser from being generated.
Anyway, the problem: ANTLR's lexer tries to match the first lexer rule it encounters in the file, and if it can't match said token, it trickles down to the next lexer rule. Now you have defined the CHARACTER rule before the WS rule, which both match the character \n. That is why it didn't work under Linux since the \n was tokenized as a CHARACTER. If you define the WS rule before the CHARACTER rule, it all works properly:
// other rules ...
WS
: ('\t' | ' ' | '\r' | '\n' | '\u000C')+ { $channel = HIDDEN; }
;
CHARACTER
: PRINTABLE_CHAR | '\n' | '\t' | EOF
;
// other rules ...
Running the test class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"#include \"lib\"\n" +
"main() {\n" +
" int bob;\n" +
"}\n";
ANTLRStringStream in = new ANTLRStringStream(source);
SmallCLexer lexer = new SmallCLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SmallCParser parser = new SmallCParser(tokens);
SmallCParser.program_return returnValue = parser.program();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
produces the following AST:
without any error messages.
But you should fix the grammar warning, and remove \n from the CHARACTER rule since it can never be matched in the CHARACTER rule.
One other thing: you've mixed quite a few keywords inside your parser rules without defining them in your lexer rules explicitly. That is tricky because of the first-come-first-serve lexer rules: you don't want 'if' to be accidentally being tokenized as an IDENT. Better do it like this:
IF : 'if';
IDENT : 'a'..'z' ... ; // After the `IF` rule!

Related

Where is the canonical specification for proto3 that allows JavaScript-like object assignment to an option?

In the Protocol Buffers Version 3 Language Specification
The EBNF syntax for an option is
option = "option" optionName "=" constant ";"
optionName = ( ident | "(" fullIdent ")" ) { "." ident }
constant = fullIdent | ( [ "-" | "+" ] intLit ) | ( [ "-" | "+" ] floatLit ) | strLit | boolLit
ident = letter { letter | decimalDigit | "_" }
fullIdent = ident { "." ident }
strLit = ( "'" { charValue } "'" ) | ( '"' { charValue } '"' )
charValue = hexEscape | octEscape | charEscape | /[^\0\n\\]/
hexEscape = '\' ( "x" | "X" ) hexDigit hexDigit
octEscape = '\' octalDigit octalDigit octalDigit
charEscape = '\' ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | '\' | "'" | '"' )
Or in plain English, an option may be assigned a dotted.notation.identifier, an integer, a float, a boolean, or a single- or double-quoted string, which MUST NOT have "raw" newline characters.
And yet, I'm encountering .proto files in various projects such as grpc-gateway and googleapis, where the rhs of the assignment is not quoted and spans multiple lines. For example in googleapis/google/api/http.proto there is this service definition in a comment block:
// service Messaging {
// rpc UpdateMessage(Message) returns (Message) {
// option (google.api.http) = {
// patch: "/v1/messages/{message_id}"
// body: "*"
// };
// }
// }
In other files, the use of semicolons (and occasionally commas) as separators seems somewhat arbitrary, and I have also seen keys repeated, which in JSON or JavaScript would result in loss of data due to overwriting.
Are there any canonical extensions to the language specification, or are people just Microsofting? (Yes, that's a verb now.)
I posted a similar question on the Protocol Buffers Google Group, and received a private message from a fellow at Google stating the following
This syntax is correct and valid for setting fields on a proto option field which is itself a field referencing a message type. This form is based on the TextFormat spec which I'm unclear if its super well documented, but here's an implementation of it: https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.text_format
When I have time, I will try to unpack what I learn from analyzing TextFormat.
update
I received an answer on the Groups forum
I think for better or worse, "what protoc implements" takes precedence over whatever the spec says. The spec came later and as far as I know we have not put a lot of effort into ensuring that it comprehensively matches the format that protoc expects. I believe the syntax you are looking at is missing from the .proto file format spec but is mentioned here as the "aggregate syntax."
The link above is to a section titled Custom Options in the Language Guide (proto2) page. If you scroll all the way to the end of that section, there is the following snippet that mentions TextFormat:
message FooOptions {
optional int32 opt1 = 1;
optional string opt2 = 2;
}
extend google.protobuf.FieldOptions {
optional FooOptions foo_options = 1234;
}
// usage:
message Bar {
optional int32 a = 1 [(foo_options).opt1 = 123, (foo_options).opt2 = "baz"];
// alternative aggregate syntax (uses TextFormat):
optional int32 b = 2 [(foo_options) = { opt1: 123 opt2: "baz" }];
}

how to build a parser with antlr4 target golang without visitors and walkers

I am trying to write a small parser with golang target, but not using visitors or walkers, but I am not able to find any sample code to build my parser upon.
For example, the following is the grammar code which I am trying to replicate with golang:
# Expr.g4:
grammar Expr;
#header {
}
#parser::members {
def eval(self, left, op, right):
if ExprParser.MUL == op.type:
return left * right
elif ExprParser.DIV == op.type:
return left / right
elif ExprParser.ADD == op.type:
return left + right
elif ExprParser.SUB == op.type:
return left - right
else:
return 0
}
stat: e NEWLINE {print($e.v);}
| ID '=' e NEWLINE {self.memory[$ID.text] = $e.v}
| NEWLINE
;
e returns [int v]
: a=e op=('*'|'/') b=e {$v = self.eval($a.v, $op, $b.v)}
| a=e op=('+'|'-') b=e {$v = self.eval($a.v, $op, $b.v)}
| INT {$v = $INT.int}
| ID
{
id = $ID.text
$v = self.memory.get(id, 0)
}
| '(' e ')' {$v = $e.v}
;
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
ID : [a-zA-Z]+ ; // match identifiers
INT : [0-9]+ ; // match integers
NEWLINE:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
And this is the python tester code for it:
# test_expr.py:
import sys
from antlr4 import *
from antlr4.InputStream import InputStream
from ExprLexer import ExprLexer
from ExprParser import ExprParser
if __name__ == '__main__':
parser = ExprParser(None)
parser.buildParseTrees = False
parser.memory = {} # how to add this to generated constructor?
line = sys.stdin.readline()
lineno = 1
while line != '':
line = line.strip()
istream = InputStream(line + "\n")
lexer = ExprLexer(istream)
lexer.line = lineno
lexer.column = 0
token_stream = CommonTokenStream(lexer)
parser.setInputStream(token_stream)
parser.stat()
line = sys.stdin.readline()
lineno += 1
Can anybody please post a sample golang code which is equivalent to the above python and inlined code?

Flex & Bison: Printing a parse tree

Basically, I have an assignment where I need to make a compiler for C-, but we're doing it in 5 steps. One of the steps was to turn the BNF grammar to bison and then print a tree with what has been compiled. Let me explain:
BNF Grammar
1. program→declaration-list
2. declaration-list→declaration-list declaration | declaration
3. var-declaration| fun-declaration
4. var-declaration→type-specifierID;| type-specifierID[NUM];
5. type-specifier→int | void
6. fun-declaration→type-specifierID(params)compound-stmt
7. params→param-list| void
8. param-list→param-list,param | param
9. param→type-specifierID | type-specifierID[]
10. compound-stmt→{local-declarations statement-list}
11. local-declarations→local-declarations var-declaration| empty
12. statement-list→statement-list statement| empty
13. statement→expression-stmt| compound-stmt| selection-stmt | iteration-stmt | return-stmt
14. expession-stmt→expression;| ;
15. selection-stmt→if(expression)statement| if(expression) statement else statement
16. iteration-stmt→while(expression)statement
17. return-stmt→return; | return expression;
18. expression→var=expression| simple-expression
19. var→ID| ID[expression]
20. simple-expression→additive-expression relop additive-expression| additive-expression
21. relop→<=| <| >| >=| ==| !=
22. additive-expression→additive-expression addop term| term
23. addop→+| -
24. term→term mulop factor| factor
25. mulop→*| /
26. factor→(expression)| var| call| NUM
27. call→ID(args)
28. args→arg-list| empty
29. arg-list→arg-list,expression| expression
File: Project.fl
%option noyywrap
%{
/* Definitions and statements */
#include <stdio.h>
#include "project.tab.h"
int nlines = 1;
char filename[50];
%}
ID {letter}{letter}*
NUM {digit}{digit}*
letter [a-zA-Z]
digit [0-9]
%%
"if" { return T_IF; }
"else" { return T_ELSE; }
"int" { return T_INT; }
"return" { return T_RETURN; }
"void" { return T_VOID; }
"while" { return T_WHILE; }
"+" { return yytext[0]; }
"-" { return yytext[0]; }
"*" { return yytext[0]; }
"/" { return yytext[0]; }
">" { return T_GREAT; }
">=" { return T_GREATEQ; }
"<" { return T_SMALL; }
"<=" { return T_SMALLEQ; }
"==" { return T_COMPARE; }
"!=" { return T_NOTEQ; }
"=" { return yytext[0]; }
";" { return yytext[0]; }
"," { return yytext[0]; }
"(" { return yytext[0]; }
")" { return yytext[0]; }
"[" { return yytext[0]; }
"]" { return yytext[0]; }
"{" { return yytext[0]; }
"}" { return yytext[0]; }
(\/\*(ID)\*\/) { return T_COMM; }
{ID} { return T_ID; }
{NUM} { return T_NUM; }
\n { ++nlines; }
%%
File: project.y
%{
#include <stdio.h>
#include <stdlib.h>
extern int yylex();
extern int yyparse();
void yyerror(const char* s);
%}
%token T_IF T_ELSE T_INT T_RETURN T_VOID T_WHILE
T_GREAT T_GREATEQ T_SMALL T_SMALLEQ T_COMPARE T_NOTEQ
T_COMM T_ID T_NUM
%%
program: declaration-list { printf("program"); }
;
declaration-list: declaration-list declaration
| declaration
;
declaration: var-declaration
| fun-declaration
;
var-declaration: type-specifier T_ID ';'
| type-specifier T_ID'['T_NUM']' ';'
;
type-specifier: T_INT
| T_VOID
;
fun-declaration: type-specifier T_ID '('params')' compound-stmt
;
params: param-list
| T_VOID
;
param-list: param-list',' param
| param
;
param: type-specifier T_ID
| type-specifier T_ID'['']'
;
compound-stmt: '{' local-declarations statement-list '}'
;
local-declarations: local-declarations var-declaration
|
;
statement-list: statement-list statement
|
;
statement: expression-stmt
| compound-stmt
| selection-stmt
| iteration-stmt
| return-stmt
;
expression-stmt: expression ';'
| ';'
;
selection-stmt: T_IF '('expression')' statement
| T_IF '('expression')' statement T_ELSE statement
;
iteration-stmt: T_WHILE '('expression')' statement
;
return-stmt: T_RETURN ';'
| T_RETURN expression ';'
;
expression: var '=' expression
| simple-expression
;
var: T_ID { printf("\nterm\nfactor_var\nvar(x)"); }
| T_ID '['expression']'
;
simple-expression: additive-expression relop additive-expression
| additive-expression
;
relop: T_SMALLEQ
| T_SMALL
| T_GREAT
| T_GREATEQ
| T_COMPARE
| T_NOTEQ
;
additive-expression: additive-expression addop term
| term
;
addop: '+' { printf("\naddop(+)"); }
| '-' { printf("\naddop(-)"); }
;
term: term mulop factor
| factor
;
mulop: '*' { printf("\nmulop(*)"); }
| '/' { printf("\nmulop(/)"); }
;
factor: '('expression')' { printf("\nfactor1"); }
| var
| call
| T_NUM { printf("\nterm\nfactor(5)"); }
;
call: T_ID '('args')' { printf("\ncall(input)"); }
;
args: arg-list
| { printf("\nargs(empty)"); }
;
arg-list: arg-list',' expression
| expression
;
%%
int main(void) {
return yyparse();
}
void yyerror(const char* s) {
fprintf(stderr, "Parse error: %s\n", s);
exit(1);
}
And finally the tree that's asked to be replicated:
program
declaration_list
declaration
fun_definition(VOID-main)
params_VOID-compound
params(VOID)
compound_stmt
local_declarations
local_declarations
local_declarations(empty)
var_declaration(x)
type_specifier(INT)
var_declaration(y)
type_specifier(INT)
statement_list
statement_list
statement_list(empty)
statement
expression_stmt
expression
var(x)
expression
simple_expression
additive_expression
term
factor
call(input)
args(empty)
statement
expression_stmt
expression
var(y)
expression
simple_expression
additive_expression(ADDOP)
additive_expression
term
factor_var
var(x)
addop(+)
term
factor(5)
Sample code in which the tree is based off
/* A program */
void main(void)
{
int x; int y;
x = input();
y = x + 5;
}
I've turned the BNF grammar to the actual .y file, but I'm having problems printing out where exactly the messages should go. Usually, a grammar would finish THEN print.
The desired output you present is the result of a pre-order walk of the parse tree.
However, bison generates a bottom-up parser, which performs semantic actions for a node in the parse tree when the node's subtree is complete. Printing the node in the semantic action therefore produces a post-order walk. I suppose that is what you mean by your last sentence.
While there are a variety of possible solutions, the simplest one is probably to construct a parse tree during the parse and then print it out at the end of the parse. (You could print the tree in the semantic action for the start production, but that will sometimes result in a parse tree being printed for an erroneous input. Better is to return the root of the parse tree and print it from the main program after verifying that the parse was successful.)
I don't know where "construct a parse tree" fits in the expected progression of your project. Parse trees are of little use in most applications. Much more common is the construction of an abstract syntax tree (AST) which omits many of the irrelevant details from the parse (such as unit productions). You can construct an AST from a parse tree, but it is generally simpler to construct it directly in the parse actions: the code looks very similar but there is less of it precisely because tree nodes don't have to be built for unit productions.

Convert my .g4 to .flex and .bnf for IDEA syntax highlighter

I have never used JFlex before, and I have no idea how it works. Basically, I've built a runtime for a scheme-esque language in Java, and the parser I have for it was generated using Antlr 4, so I have a .g4 file, which looks like this:
program: expression EOF;
expression: lambdaAbstraction | application| literal;
lambdaAbstraction: LP WS? LAMBDA (WS IDENTIFIER)* WS expression WS? RP;
application: LP WS? expression (WS expression)* WS? RP;
literal: FLOAT_LITERAL | INTEGER_LITERAL | IDENTIFIER | STRING_LITERAL;
STRING_LITERAL: '"' ~["]* '"' | '\'' ~[']* '\'' ;
FLOAT_LITERAL: [0-9]+ '.' [0-9]+;
INTEGER_LITERAL: [0-9]+;
IDENTIFIER: [a-zA-Z+/*-=?#]+;
LAMBDA: '\\';
LP: '(';
RP: ')';
WS: [ \n\r\t]+;
Some examples (so you can get a feel for the grammar).
(+ 1 2) = 3
((\ a b (+ a b)) 1 2) = 3
(+ "Hello, " "World") = "Hello, World"
What I'm trying to do is build a syntax-highlighter plugin for IntelliJ-IDEA for this language using this guide: http://www.jetbrains.org/intellij/sdk/docs/tutorials/custom_language_support_tutorial.html.
Here is my .flex file
package com.michaelsnowden.yalcil_plugin;
import com.intellij.lexer.FlexLexer;
import com.intellij.psi.tree.IElementType;
import com.michaelsnowden.yalcil_plugin.psi.YalcilTypes;
import com.intellij.psi.TokenType;
%%
%class YalcilLexer
%implements FlexLexer
%unicode
%function advance
%type IElementType
%eof{ return;
%eof}
STRING_LITERAL= "\"" [^\"&]* "\"";
FLOAT_LITERAL=[0-9]+ '.' [0-9]+;
INTEGER_LITERAL=[0-9]+;
IDENTIFIER=[a-zA-Z+/*-=?#]+;
LAMBDA="\\";
LP="(";
RP=")";
WS=[ \n\r\t]+;
%%
<YYINITIAL> {
{LAMBDA} { yybegin(YYINITIAL); return YalcilTypes.LAMBDA; }
{FLOAT_LITERAL} { yybegin(YYINITIAL); return YalcilTypes.LAMBDA; }
{INTEGER_LITERAL} { yybegin(YYINITIAL); return YalcilTypes.INTEGER_LITERAL; }
{IDENTIFIER} { yybegin(YYINITIAL); return YalcilTypes.IDENTIFIER; }
{LP} { yybegin(YYINITIAL); return YalcilTypes.LP; }
{RP} { yybegin(YYINITIAL); return YalcilTypes.RP; }
{WS} { yybegin(YYINITIAL); return YalcilTypes.WS; }
{STRING_LITERAL} { yybegin(YYINITIAL); return YalcilTypes.STRING_LITERAL; }
}
. { return TokenType.BAD_CHARACTER; }
And here is my .bnf file
{
parserClass="com.michaelsnowden.yalcil_plugin.YalcilParser"
extends="com.intellij.extapi.psi.ASTWrapperPsiElement"
psiClassPrefix="Yalcil"
psiImplClassSuffix="Impl"
psiPackage="com.michaelsnowden.yalcil_plugin.psi"
psiImplPackage="com.michaelsnowden.yalcil_plugin.psi.impl"
elementTypeHolderClass="com.michaelsnowden.yalcil_plugin.psi.YalcilTypes"
elementTypeClass="com.michaelsnowden.yalcil_plugin.psi.YalcilElementType"
tokenTypeClass="com.michaelsnowden.yalcil_plugin.psi.YalcilTokenType"
}
program ::= expression;
expression ::= lambdaAbstraction | application| literal;
lambdaAbstraction ::= LP WS? LAMBDA (WS IDENTIFIER)* WS expression WS? RP;
application ::= LP WS? expression (WS expression)* WS? RP;
literal ::= FLOAT_LITERAL | INTEGER_LITERAL | IDENTIFIER | STRING_LITERAL;
I get basically nothing but errors when I run this.
For some reason, only strings are highlighted.
The PsiViewer looks like this:
Does anyone see any explanation for why this isn't working?

Antlr parsing conflict with tokens and identifiers

i am attempting to parse freeform strings (ANTLR 3) and am running into string/token issues. Googling has not yet helped me. Here is my grammar:
grammar TestHeader;
options {
language = Java;
output = AST;
}
tokens {
STOKEN_1 = 'SENDTO';
STOKEN_2 = 'SRCSYS';
STOKEN_3 = 'SOFTERROR';
}
#lexer::header {
package com.sample.parser;
}
#parser::header {
package com.sample.parser;
}
fragment TAG : /* empty rule: only used to change the 'type' */;
fragment CR: '\r';
fragment LF: '\n';
fragment SLASH: '/';
fragment HASH: '#';
fragment DIGIT: '0'..'9';
fragment SEMI_COLON: ';';
COLON: ':';
COMMA: ',';
DOT: '.';
LETTER: 'A'..'Z' | 'a'..'z';
HYPHEN: '-';
BRANCH_CHARS: '$' | '#' | HASH;
SPACE: (' ');
EOL: (CR? LF) | CR | SEMI_COLON;
DATE: ('0'..'1')? DIGIT SLASH ('0'..'3')? DIGIT SLASH DIGIT DIGIT;
TIME: ('0'..'2')? DIGIT COLON ('0'..'5')? DIGIT (COLON '0'..'5' DIGIT)?;
DECIMAL
:
DIGIT* ('.' DIGIT*)
{
if ($text.contains("/")) {
$type = DATE;
} else if ($text.contains(":")) {
$type = TIME;
}
}
;
NUMBER
:
'0'..'9' DIGIT*
{
if ($text.contains("/")) {
$type = DATE;
} else if ($text.contains(":")) {
$type = TIME;
}
}
;
SINGLELINE_COMMENT
:
'//-' ~('\r' | '\n')*
;
header
:
(word SPACE+ sequenceNumber SPACE+ type) EOL+
(SPACE* trailer) EOL+
;
word
:
(data += ~(SPACE))+
;
sequenceNumber
:
nn=NUMBER
;
type
:
LETTER
;
trailer
:
// (tt += ~(EOL ))*
SINGLELINE_COMMENT
;
With a test input
SY 260 O
//-$CJ******1
i get the following
0 null
-- 16 S
-- 16 Y
-- 22
-- 18 260
-- 22
-- 16 O
-- 13
-- 20 //-$CJ******1
-- 13
However with the following input
SR 260 O
//-$CJ******1
I get
line 1:2 mismatched character ' ' expecting 'C'
line 1:7 mismatched input 'O' expecting NUMBER
0 260 O
//-$CJ******1
Where am i going wrong? Any tips/help to get past this issues?

Resources