Strange behaviour for comments in Antlr4 grammar - comments

When adding a comment line under ID is ok, however adding one under WS, causes an error to be raised. Entire file Hello.g4 listed below.
/**
* Define a grammar called Hello
*/
grammar Hello;
r : 'hello' ID ; // match keyword hello followed by an identifier
ID : [a-z]+ ; // match lower-case identifiers
/**********************************************************************************************/
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
/**********************************************************************************************/
The output i get in the console is as below:
ANTLR Tool v4.4 (/tmp/antlr-4.4-complete.jar)
Hello.g4 -o /home/me/workspace/TestComment/target/generated-sources/antlr4 -listener -no-visitor -encoding UTF-8
error(50): Hello.g4:13:0: syntax error: '<EOF>' came as a complete surprise to me
1 error(s)
BUILD FAIL
Total time: 168 millisecond(s)
Running Eclipse Version: Neon.3 Release (4.6.3), Default ANTLR4 project.
Why should ANTLR4, care about a trailing comment line ?

The ANTLR 4 grammar defines JavaDoc comments as optionally allowed as a header and on each rule. No rule follows the last 'comment line', so it is interpreted an invalid beginning of a rule.
Change your comment line to /*----*/ to avoid the error.

Related

JISON: How do I avoid "dog" being parsed as "do"?

I have the following JISON file (lite version of my actual file, but reproduces my problem):
%lex
%%
"do" return 'DO';
[a-zA-Z_][a-zA-Z0-9_]* return 'ID';
"::" return 'DOUBLECOLON'
<<EOF>> return 'ENDOFFILE';
/lex
%%
start
: ID DOUBLECOLON ID ENDOFFILE
{$$ = {type: "enumval", enum: $1, val: $3}}
;
It is for parsing something like "AnimalTypes::cat". It works fine for things like "AnimalTypes::cat", but the when it sees dog instead of cat, it asumes it's a DO instead of an id. I can see why it does that, but how do I get around it? I've been looking at other JISON documents, but can't seem to spot the difference that (I assume) makes those work.
This is the error I get:
JisonParserError: Parse error on line 1:
PetTypes::dog
----------^
Expecting "ID", "enumstr", "id", got unexpected "DO"
Repro steps:
Install jison-gho globally from npm (or modify code to use local version). I use Node v14.6.0.
Save the JISON above as minimal-repro.jison
Run: jison -m es -o ./minimal.mjs ./minimal-repro.jison to create parser
Create a file named test.mjs with code like:
import Parser from "./minimal.mjs";
Parser.parser.parse("PetTypes::dog")
Run node test.mjs
Edit: Updated with a reproducible example.
Edit2: Simpler JISON
Unlike (f)lex, the jison lexer accepts the first matching pattern, even if it is not the longest matching pattern. You can get the (f)lex behaviour by using
%option flex
However, that significantly slows down the scanner.
The original jison automatically added \b to the end of patterns which ended with a literal string matching an alphabetic character, to make it easier to match keywords without incurring this overhead. In jison-gho, this feature was turned off unless you specify
%option easy_keyword_rules
See https://github.com/zaach/jison/wiki/Deviations-From-Flex-Bison#user-content-literal-tokens.
So either of those options will achieve the behaviour you expect.

XText Validator shows Parse Error in wrong line

I am currently developing a small dsl with the following (shortend) grammar:
grammar mydsl with org.eclipse.xtext.common.Terminals hidden(WS, SL_COMMENT)
generate mydsl "uri::mydsl"
CommandSet:
(commands+=Command)*
;
Command:
(commandName=CommandName LBRACKET (args=ArgumentList)? RBRACKET EOL ) |
;
terminal LBRACKET:
'('
;
terminal RBRACKET:
')'
;
terminal EOL:
';'
;
As you can see, I use a semicolon as a EOL seperator and it works just fine for me. The problem occurs with the built-in syntax validator when working with the dsl in eclipse. When I miss a semicolon, the validator throws an syntax error in the wrong line:
Is there an error with my grammar? Thanks ;)
Here is a small DSL loosely based on your example. Basically, I do not consider linebreaks as "hidden" any longer (i.e. they will no longer be ignored by the parser), only the whitespaces. Note new terminals MY_WS and MY_NL as well as modified hidden statement in the grammar header (I also added some comments at relevant places). This approach just gives you some general idea and you can experiment with it to achieve what you want. Note, that if linebreaks are no longer hidden, you will need to take account of them in your grammar rules.
grammar org.xtext.example.mydsl.MyDsl
with org.eclipse.xtext.common.Terminals
hidden( MY_WS, SL_COMMENT ) // ---> hide whitespaces and comments only, not linebreaks!
generate mydsl "uri::mydsl"
CommandSet:
(commands+=Command)*
;
CommandName:
name=ID
;
ArgumentList:
arguments += STRING (',' STRING)*
;
Command:
(commandName=CommandName LBRACKET (args=ArgumentList)? RBRACKET EOL);
terminal LBRACKET:
'('
;
terminal RBRACKET:
')'
;
terminal EOL:
';' MY_NL? // ---> now an optional linebreak at the end!
;
terminal MY_WS: (' '|'\t')+; // ---> whitespace characters (formerly part of WS)
terminal MY_NL: ('\r'|'\n')+; // ---> linebreak characters (no longer hidden)
Here is an image demonstrating the resulting behavior.

Unexpected behavior with ANTLR3

I am experiencing an unexpected behavior with ANTLR3. This is my grammar:
grammar Onto;
****parser rules****
predicate
: VERB
;
****lexer rules****
VERB
: 'VB' WS
;
PREPOSITION
: 'TO' WS
;
WS
: (' ' | '\t' | '\r'| '\n')
;
When I parse the string "VB TO", ANTLR3 exits without flagging an error. This is unexpected because the given string does not match any rule in the grammar.
However when I retry the same after removing the PREPOSITION rule from the grammar, ANTLR3 flags the following error which is the expected result:
line 1:3 no viable alternative at character 'T'
line 1:4 no viable alternative at character 'O'
You made the classic mistake. Your main rule has no EOF at the end, so your parser currently also matches only a part of your input and sees that as valid. In your case it matches VERB and then expects nothing more. That PREPOSITION matches your "TO" input is part of the behavior as this returns the PREPOSIITON token to the parser. But since the parser is already happy with the VERB input it considers the parse done successfully.
Without the PREPOSITION lexer rule however, the lexer returns an error token as it cannot match that input. Which is what the error above is about.

XTEXT: Controlling when whitespace is allowed

I have a custom scripting language, that I am attempting to use XTEXT for syntax checking. It boils down to single line commands in the format
COMMAND:PARAMETERS
For the most part, xtext is working great. The only problem I have currently run into is how to handle wanted (or unwanted) white spaces. The language cannot have a space to begin a line, and there cannot be a space following the colon. As well, I need to allow white space in the parameters, as it could be a string of text, or something similar.
I have used a datatype to allow white space in the parameter:
UNQUOTED_STRING:
(ID | INT | WS | '.' )+
;
This works, but has the side effect of allowing spaces throughout the line.
Does anyone know a way to limit where white spaces are allowed?
Thanks in advance for any advice!
You can disallow whitespace globally for your grammar by using an empty set of hidden tokens, e.g.
grammar org.xyz.MyDsl with org.eclipse.xtext.common.Terminals hidden()
Then you can enable it at specific rules, e.g.
XParameter hidden(WS):
'x' '=' value=ID
;
Note that this would allow linebreaks as well. If you don't want that you can either pass a custom terminal rule or overwrite the default WSrule.
Here is a more complete example (not perfect):
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals hidden()
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
(commands+=Command '\r'? '\n')+
;
Command:
SampleCommand
;
SampleCommand:
command='get' ':' parameter=Parameter
;
Parameter:
'{' x=XParameter '}'
;
XParameter hidden(WS):
'x' '=' value=ID
;
This will parse commands such as:
get:{x=TEST}
get:{ x = TEST}
But will reject:
get:{x=TEST}
get: {x=TEST}
Hope that gives you an idea. You can also do this the other way around by limiting the whitespace only for certain rules, e.g.
CommandList hidden():
(commands+=Command '\r'? '\n')+
;
If that works better for your grammar.

Comment lexer rule

I'm new to ANTLR and I've come up with this lexer rule to parse out comments, will it work?
COMMENT_LINE : (COMMENT (. - LINE_ENDING)* LINE_ENDING){$channel=hidden};
(I couldn't find anything regarding syntax such as this in the docs)
Your rule doesn't compile at all. If you use ANTLRWorks to create a new lexer grammar, you can check a box to have it generate a lexer rule that matches single line comments. It generates this:
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
;
Alternatively, you can use something like this to match single line comments:
COMMENT_LINE
: COMMENT (options{greedy=false;}: .)* LINE_ENDING {$channel=HIDDEN;}
;

Resources