Comment lexer rule - comments

I'm new to ANTLR and I've come up with this lexer rule to parse out comments, will it work?
COMMENT_LINE : (COMMENT (. - LINE_ENDING)* LINE_ENDING){$channel=hidden};
(I couldn't find anything regarding syntax such as this in the docs)

Your rule doesn't compile at all. If you use ANTLRWorks to create a new lexer grammar, you can check a box to have it generate a lexer rule that matches single line comments. It generates this:
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
;
Alternatively, you can use something like this to match single line comments:
COMMENT_LINE
: COMMENT (options{greedy=false;}: .)* LINE_ENDING {$channel=HIDDEN;}
;

Related

Strange behaviour for comments in Antlr4 grammar

When adding a comment line under ID is ok, however adding one under WS, causes an error to be raised. Entire file Hello.g4 listed below.
/**
* Define a grammar called Hello
*/
grammar Hello;
r : 'hello' ID ; // match keyword hello followed by an identifier
ID : [a-z]+ ; // match lower-case identifiers
/**********************************************************************************************/
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
/**********************************************************************************************/
The output i get in the console is as below:
ANTLR Tool v4.4 (/tmp/antlr-4.4-complete.jar)
Hello.g4 -o /home/me/workspace/TestComment/target/generated-sources/antlr4 -listener -no-visitor -encoding UTF-8
error(50): Hello.g4:13:0: syntax error: '<EOF>' came as a complete surprise to me
1 error(s)
BUILD FAIL
Total time: 168 millisecond(s)
Running Eclipse Version: Neon.3 Release (4.6.3), Default ANTLR4 project.
Why should ANTLR4, care about a trailing comment line ?
The ANTLR 4 grammar defines JavaDoc comments as optionally allowed as a header and on each rule. No rule follows the last 'comment line', so it is interpreted an invalid beginning of a rule.
Change your comment line to /*----*/ to avoid the error.

XText Validator shows Parse Error in wrong line

I am currently developing a small dsl with the following (shortend) grammar:
grammar mydsl with org.eclipse.xtext.common.Terminals hidden(WS, SL_COMMENT)
generate mydsl "uri::mydsl"
CommandSet:
(commands+=Command)*
;
Command:
(commandName=CommandName LBRACKET (args=ArgumentList)? RBRACKET EOL ) |
;
terminal LBRACKET:
'('
;
terminal RBRACKET:
')'
;
terminal EOL:
';'
;
As you can see, I use a semicolon as a EOL seperator and it works just fine for me. The problem occurs with the built-in syntax validator when working with the dsl in eclipse. When I miss a semicolon, the validator throws an syntax error in the wrong line:
Is there an error with my grammar? Thanks ;)
Here is a small DSL loosely based on your example. Basically, I do not consider linebreaks as "hidden" any longer (i.e. they will no longer be ignored by the parser), only the whitespaces. Note new terminals MY_WS and MY_NL as well as modified hidden statement in the grammar header (I also added some comments at relevant places). This approach just gives you some general idea and you can experiment with it to achieve what you want. Note, that if linebreaks are no longer hidden, you will need to take account of them in your grammar rules.
grammar org.xtext.example.mydsl.MyDsl
with org.eclipse.xtext.common.Terminals
hidden( MY_WS, SL_COMMENT ) // ---> hide whitespaces and comments only, not linebreaks!
generate mydsl "uri::mydsl"
CommandSet:
(commands+=Command)*
;
CommandName:
name=ID
;
ArgumentList:
arguments += STRING (',' STRING)*
;
Command:
(commandName=CommandName LBRACKET (args=ArgumentList)? RBRACKET EOL);
terminal LBRACKET:
'('
;
terminal RBRACKET:
')'
;
terminal EOL:
';' MY_NL? // ---> now an optional linebreak at the end!
;
terminal MY_WS: (' '|'\t')+; // ---> whitespace characters (formerly part of WS)
terminal MY_NL: ('\r'|'\n')+; // ---> linebreak characters (no longer hidden)
Here is an image demonstrating the resulting behavior.

Unexpected behavior with ANTLR3

I am experiencing an unexpected behavior with ANTLR3. This is my grammar:
grammar Onto;
****parser rules****
predicate
: VERB
;
****lexer rules****
VERB
: 'VB' WS
;
PREPOSITION
: 'TO' WS
;
WS
: (' ' | '\t' | '\r'| '\n')
;
When I parse the string "VB TO", ANTLR3 exits without flagging an error. This is unexpected because the given string does not match any rule in the grammar.
However when I retry the same after removing the PREPOSITION rule from the grammar, ANTLR3 flags the following error which is the expected result:
line 1:3 no viable alternative at character 'T'
line 1:4 no viable alternative at character 'O'
You made the classic mistake. Your main rule has no EOF at the end, so your parser currently also matches only a part of your input and sees that as valid. In your case it matches VERB and then expects nothing more. That PREPOSITION matches your "TO" input is part of the behavior as this returns the PREPOSIITON token to the parser. But since the parser is already happy with the VERB input it considers the parse done successfully.
Without the PREPOSITION lexer rule however, the lexer returns an error token as it cannot match that input. Which is what the error above is about.

XTEXT: Controlling when whitespace is allowed

I have a custom scripting language, that I am attempting to use XTEXT for syntax checking. It boils down to single line commands in the format
COMMAND:PARAMETERS
For the most part, xtext is working great. The only problem I have currently run into is how to handle wanted (or unwanted) white spaces. The language cannot have a space to begin a line, and there cannot be a space following the colon. As well, I need to allow white space in the parameters, as it could be a string of text, or something similar.
I have used a datatype to allow white space in the parameter:
UNQUOTED_STRING:
(ID | INT | WS | '.' )+
;
This works, but has the side effect of allowing spaces throughout the line.
Does anyone know a way to limit where white spaces are allowed?
Thanks in advance for any advice!
You can disallow whitespace globally for your grammar by using an empty set of hidden tokens, e.g.
grammar org.xyz.MyDsl with org.eclipse.xtext.common.Terminals hidden()
Then you can enable it at specific rules, e.g.
XParameter hidden(WS):
'x' '=' value=ID
;
Note that this would allow linebreaks as well. If you don't want that you can either pass a custom terminal rule or overwrite the default WSrule.
Here is a more complete example (not perfect):
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals hidden()
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
(commands+=Command '\r'? '\n')+
;
Command:
SampleCommand
;
SampleCommand:
command='get' ':' parameter=Parameter
;
Parameter:
'{' x=XParameter '}'
;
XParameter hidden(WS):
'x' '=' value=ID
;
This will parse commands such as:
get:{x=TEST}
get:{ x = TEST}
But will reject:
get:{x=TEST}
get: {x=TEST}
Hope that gives you an idea. You can also do this the other way around by limiting the whitespace only for certain rules, e.g.
CommandList hidden():
(commands+=Command '\r'? '\n')+
;
If that works better for your grammar.

Can't make ANTLR4 grammar skip comments

I am trying to write an ANTLR4 grammar to parse actionscript3. I've decided to start with something fairly coarse grained:
grammar actionscriptGrammar;
OBRACE:'{';
CBRACE:'}';
STRING_DELIM:'"';
BLOCK_COMMENT : '/*' .*? '*/' -> skip;
EOL_COMMENT : '//' .*? '/n' -> skip;
WS: [ \n\t\r]+ -> skip;
TEXT: ~[{} \n\t\r"]+;
thing
: TEXT
| string_literal
| OBRACE thing+? CBRACE;
string_literal : STRING_DELIM .+? STRING_DELIM;
start_rule
: thing+?;
Basically, I want a tree of things grouped by their lexical scope. I want comments to be ignored, and string literals be their own things so that any braces they may include do not affect lexical scope. The string_literal rule works fine (such as it is) but the two comment rules don't appear to have any effect. (i.e. comments aren't being ignored).
What am I missing?
This is from a simplified Java grammar I wrote in ANTLR v4.
WS
: [ \t\r\n]+ -> channel(HIDDEN)
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
May be this could help you out.
Also, try rearranging your code. Write the Parser Rules first and Lexer Rules last. Follow a Top-Down approach. I find it much more helpful in debugging. It will also look nice when you create an HTML export of your grammar from ANTLR 4 Eclipse Plugin.
Good Luck!
The answer is that your TEXT rule is consuming your comments. Rather than using a negated set, use something like:
TEXT: [a-zA-Z0-9_][/a-zA-Z0-9.;()\[\]_-]+ ;
That way, your comments cannot be matched by TEXT.

Resources