ANTLR comment problem - comments

I am trying to write a comment matching rule in ANTLR, which is currently the following:
: '--' (options{greedy=false;}: .)* NEWLINE {Skip();}
NEWLINE : '\r'|'\n'|'\r\n' {Skip();};
This code works fine except in the case that a comment is the last characters of a file, in which case it throws a NoViableAlt exception. How can i fix this?

Why not:
fragment NEWLINE : '\r' '\n'? | '\n' ;
If you haven't come across this yet, lexical rules (all uppercase) can only consist of constants and tokens, not other lexemes. You need a parser rule for that.

I'd go for:
: '--' ~( '\r' | '\n' )* {Skip();}
: ( '\r'? '\n' | '\r' ) {Skip();}


Gocc to ignore things in lexical parser

Is there ways to tell gocc to ignore things in lexical parser? E.g., for
2022-01-18 11:33:21.9885 [21] These are strings that I need to egnore, until - MYKW - Start Active One: 1/18/2022 11:33:21 AM
I want to tell gocc to ignore from [21] all the way to until. Here is what I've been trying:
/* Lexical part */
_letter : 'A'-'Z' | 'a'-'z' | '_' ;
_digit : '0'-'9' ;
_timestamp1 : _digit | ' ' | ':' | '-' | '.' ;
_timestamp2 : _digit | ' ' | ':' | '/' | 'A' | 'P' | 'M' ;
_ignore : '[' { . } ' ' '-' ' ' 'M' 'Y' 'K' 'W' ' ' '-' ' ' ;
_lineend : [ '\r' ] '\n' ;
timestamp : _timestamp1 { _timestamp1 } _ignore ;
taskLogStart : 'S' 't' 'a' 'r' 't' ' ' ;
jobName : { . } _timestamp2 { _timestamp2 } _lineend ;
/* Syntax part */
: timestamp taskLogStart jobName ;
However, the parser failed at:
error: expected timestamp; got: unknown/invalid token "2022-01-18 11:33:21.9885 [21] T"
The reason I think it should be working is that, the following ignore rule works perfectly fine for white spaces:
!lineComment : '/' '/' { . } '\n' ;
!blockComment : '/' '*' { . | '*' } '*' '/' ;
and I'm just applying the above rule into my normal text parsing.
It doesn't work that way --
The EBNF looks very much like regular expressions but it does not work like regular expression at all -- what I mean is,
The line,
2022-01-18 11:33:21.9885 [21] These are strings that I need to ignore, until - MYKW - Start Active One: 1/18/2022 11:33:21 AM
If to match with regular expression, it can simply be:
([0-9.: -]+).*? - MYKW - Start ([^:]+):.*$
However, that cannot be directly translate into EBNF definition just like that, because the regular expression relies on the context in between each elements to ping point a match (e.g., the .*? matching rule is a local rule that only works based on the context it is in), however, gocc is a LR parser, which is a context-free grammar!!!
Basically a context-free grammar means, each time it is trying to do a .* match to all existing lexical symbols (i.e., each lexical symbol can be considered a global rule that is not affected by the context it is in). I cannot quite describe it but there is no previous context (or the symbol following it) involved in next match. That's the reason why the OP fails.
For a real sample of how the '{.}' can be used, see
How to describe this event log in formal BNF?

why does a comma "," get counted in [.] type expression in antlr lexer

I am making a grammar for bash scripts. I am facing a problem while tokenising the "," symbol. The following grammar tokenises it as <BLOB> while I expect it to be tokenised as <OTHER>.
grammar newgram;
code : KEY (BLOB)+ (EOF | '\n')+;
KEY : 'wget';
BLOB : [a-zA-Z0-9#!$^%*&+-.]+?;
OTHER : .;
However, if I make BLOB to be [a-zA-Z0-9#!$^%*&+.-]+?;, then it is tokenised as <OTHER>.
I cannot understand why is it happening like this.
In the former case, the characters : and / are also tokenised as <OTHER>, so I do not see a reason for ,, to be marked <BLOB>.
Input I am tokenising, wget -o --quiet https,://
The output I am receiving with the mentioned grammar,
[#1,4:4=' ',<OTHER>,1:4]
[#4,7:7=' ',<OTHER>,1:7]
[#12,15:15=' ',<OTHER>,1:15]
line 1:4 extraneous input ' ' expecting BLOB
line 1:7 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:15 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:22 extraneous input ':' expecting {<EOF>, '
', BLOB}
As already mentioned in a comment, the - in +-. inside your character class is interpreted as a range operator. And the , is inside that range. Escape it like this: [a-zA-Z0-9#!$^%*&+\-.]+?
Also, a trailing [ ... ]+? at the end of a lexer rule will always match a single character. So [a-zA-Z0-9#!$^%*&+\-.]+? can just as well be written as [a-zA-Z0-9#!$^%*&+\-.]

ANTLR mismatched token in simple grammar

I am currently debugging my grammar in ANTLRworks, and reduced it far more than is reasonable to this:
grammar DebugInternalGrammar;
'"' (
('\\' .) |
(~ (
'\\' |
)* '"'
Which, when testing in the interpreter against the String
just yields
What does work is matching "", also reducing the grammar to:
grammar DebugInternalGrammar;
'"' (
(~ (
'\\' |
)* '"'
matches "L" (I assume this is what it means when the parse tree in ANTLRworks shows <epsilon> as leaf).
What is wrong here? This is not the part of the grammar which caused me trouble before, so I am scratching my head as to what the problem could be and what ANTLRworks is trying to tell me.

ANTLR3 not ignoring comments that begin at the first character of a file

Sorry if any terminology is off, just started using antlr recently.
Here's the antlr grammar that ignores multi-line comments:
COMMENT : '/*' .* '*/';
SPACE : (' ' | '\t' | '\r' | '\n' | COMMENT)+ {$channel = HIDDEN;} ;
Here's a comment beginning at the first character of a file I'd like to compile:
This is a comment
Here's the error I get:
[filename] line 252:0 no viable alternative at character '<EOF>'
[filename] line 1:1 no viable alternative at input '*'
However, if I put a space in front of the comment, like so:
This is a comment
It compiles fine. Any ideas?
For ignoring multilines comments:
: '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
Maybe it's not because of your LEXER but because of your Parser. From lexer, with $channel=HIDDEN you are telling all these elements not to be passed to Parser. This is why parser finds EOF at first. You are sending nothing!
If you write a whitespace as the first character, parser receives something and it's able to process an input...
This should be your issue!!
I hope this would help you!

How to disallow nested comments in antlr

I currently have a multiline comment lexer rule in antlr which looks like:
MULTILINE: '/*' .* '*/' {$channel=HIDDEN;} ;
However, this currently allows things like:
/* /* hello */ */
Is there any possible way to disable nesting comments in antlr? I've tried various things like
MULTILINE: '/*' (~(MULTILINE)|.*) '*/' {$channel=HIDDEN;} ;
But that doesn't work. Any help would be much appreciated!
No, that is not correct: .* and .+ are not greedy.
Given the parser generated by the following grammar:
grammar T;
: (t=. {System.out.printf("\%-15s'\%s'\n", tokenNames[$t.type], $t.text);} )* EOF
: '/*' .* '*/' {$channel=HIDDEN;}
: .
the input "/* /* hello */ */" would produce the following on your command line:
I.e., "/* /* hello */" is being put on the HIDDEN channel, and 3 OTHER tokens are constructed.
Try This:
It is not possible for the prefix nor suffix to be recognized in the comment body. Also, nesting is not allowed.
: '/*'
( ('/'|'*'+)? ~[*/] )*?
