ANTLR comment problem - comments

I am trying to write a comment matching rule in ANTLR, which is currently the following:
LINE_COMMENT
: '--' (options{greedy=false;}: .)* NEWLINE {Skip();}
;
NEWLINE : '\r'|'\n'|'\r\n' {Skip();};
This code works fine except in the case that a comment is the last characters of a file, in which case it throws a NoViableAlt exception. How can i fix this?

Why not:
LINE_COMMENT : '--' (~ NEWLINE)* ;
fragment NEWLINE : '\r' '\n'? | '\n' ;
If you haven't come across this yet, lexical rules (all uppercase) can only consist of constants and tokens, not other lexemes. You need a parser rule for that.

I'd go for:
LINE_COMMENT
: '--' ~( '\r' | '\n' )* {Skip();}
;
NEWLINE
: ( '\r'? '\n' | '\r' ) {Skip();}
;

Related

Gocc to ignore things in lexical parser

Is there ways to tell gocc to ignore things in lexical parser? E.g., for
2022-01-18 11:33:21.9885 [21] These are strings that I need to egnore, until - MYKW - Start Active One: 1/18/2022 11:33:21 AM
I want to tell gocc to ignore from [21] all the way to until. Here is what I've been trying:
/* Lexical part */
_letter : 'A'-'Z' | 'a'-'z' | '_' ;
_digit : '0'-'9' ;
_timestamp1 : _digit | ' ' | ':' | '-' | '.' ;
_timestamp2 : _digit | ' ' | ':' | '/' | 'A' | 'P' | 'M' ;
_ignore : '[' { . } ' ' '-' ' ' 'M' 'Y' 'K' 'W' ' ' '-' ' ' ;
_lineend : [ '\r' ] '\n' ;
timestamp : _timestamp1 { _timestamp1 } _ignore ;
taskLogStart : 'S' 't' 'a' 'r' 't' ' ' ;
jobName : { . } _timestamp2 { _timestamp2 } _lineend ;
/* Syntax part */
Log
: timestamp taskLogStart jobName ;
However, the parser failed at:
error: expected timestamp; got: unknown/invalid token "2022-01-18 11:33:21.9885 [21] T"
The reason I think it should be working is that, the following ignore rule works perfectly fine for white spaces:
!lineComment : '/' '/' { . } '\n' ;
!blockComment : '/' '*' { . | '*' } '*' '/' ;
and I'm just applying the above rule into my normal text parsing.
It doesn't work that way --
The EBNF looks very much like regular expressions but it does not work like regular expression at all -- what I mean is,
The line,
2022-01-18 11:33:21.9885 [21] These are strings that I need to ignore, until - MYKW - Start Active One: 1/18/2022 11:33:21 AM
If to match with regular expression, it can simply be:
([0-9.: -]+).*? - MYKW - Start ([^:]+):.*$
However, that cannot be directly translate into EBNF definition just like that, because the regular expression relies on the context in between each elements to ping point a match (e.g., the .*? matching rule is a local rule that only works based on the context it is in), however, gocc is a LR parser, which is a context-free grammar!!!
Basically a context-free grammar means, each time it is trying to do a .* match to all existing lexical symbols (i.e., each lexical symbol can be considered a global rule that is not affected by the context it is in). I cannot quite describe it but there is no previous context (or the symbol following it) involved in next match. That's the reason why the OP fails.
For a real sample of how the '{.}' can be used, see
How to describe this event log in formal BNF?

why does a comma "," get counted in [.] type expression in antlr lexer

I am making a grammar for bash scripts. I am facing a problem while tokenising the "," symbol. The following grammar tokenises it as <BLOB> while I expect it to be tokenised as <OTHER>.
grammar newgram;
code : KEY (BLOB)+ (EOF | '\n')+;
KEY : 'wget';
BLOB : [a-zA-Z0-9#!$^%*&+-.]+?;
OTHER : .;
However, if I make BLOB to be [a-zA-Z0-9#!$^%*&+.-]+?;, then it is tokenised as <OTHER>.
I cannot understand why is it happening like this.
In the former case, the characters : and / are also tokenised as <OTHER>, so I do not see a reason for ,, to be marked <BLOB>.
Input I am tokenising, wget -o --quiet https,://www.google.com
The output I am receiving with the mentioned grammar,
[#0,0:3='wget',<'wget'>,1:0]
[#1,4:4=' ',<OTHER>,1:4]
[#2,5:5='-',<BLOB>,1:5]
[#3,6:6='o',<BLOB>,1:6]
[#4,7:7=' ',<OTHER>,1:7]
[#5,8:8='-',<BLOB>,1:8]
[#6,9:9='-',<BLOB>,1:9]
[#7,10:10='q',<BLOB>,1:10]
[#8,11:11='u',<BLOB>,1:11]
[#9,12:12='i',<BLOB>,1:12]
[#10,13:13='e',<BLOB>,1:13]
[#11,14:14='t',<BLOB>,1:14]
[#12,15:15=' ',<OTHER>,1:15]
[#13,16:16='h',<BLOB>,1:16]
[#14,17:17='t',<BLOB>,1:17]
[#15,18:18='t',<BLOB>,1:18]
[#16,19:19='p',<BLOB>,1:19]
[#17,20:20='s',<BLOB>,1:20]
[#18,21:21=',',<BLOB>,1:21]
[#19,22:22=':',<OTHER>,1:22]
[#20,23:23='/',<OTHER>,1:23]
[#21,24:24='/',<OTHER>,1:24]
[#22,25:25='w',<BLOB>,1:25]
[#23,26:26='w',<BLOB>,1:26]
[#24,27:27='w',<BLOB>,1:27]
[#25,28:28='.',<BLOB>,1:28]
[#26,29:29='g',<BLOB>,1:29]
[#27,30:30='o',<BLOB>,1:30]
[#28,31:31='o',<BLOB>,1:31]
[#29,32:32='g',<BLOB>,1:32]
[#30,33:33='l',<BLOB>,1:33]
[#31,34:34='e',<BLOB>,1:34]
[#32,35:35='.',<BLOB>,1:35]
[#33,36:36='c',<BLOB>,1:36]
[#34,37:37='o',<BLOB>,1:37]
[#35,38:38='m',<BLOB>,1:38]
[#36,39:39='\n',<'
'>,1:39]
[#37,40:39='<EOF>',<EOF>,2:0]
line 1:4 extraneous input ' ' expecting BLOB
line 1:7 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:15 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:22 extraneous input ':' expecting {<EOF>, '
', BLOB}
As already mentioned in a comment, the - in +-. inside your character class is interpreted as a range operator. And the , is inside that range. Escape it like this: [a-zA-Z0-9#!$^%*&+\-.]+?
Also, a trailing [ ... ]+? at the end of a lexer rule will always match a single character. So [a-zA-Z0-9#!$^%*&+\-.]+? can just as well be written as [a-zA-Z0-9#!$^%*&+\-.]

ANTLR mismatched token in simple grammar

I am currently debugging my grammar in ANTLRworks, and reduced it far more than is reasonable to this:
grammar DebugInternalGrammar;
RULE_STRING :
'"' (
('\\' .) |
(~ (
'\\' |
'"'
))
)* '"'
;
Which, when testing in the interpreter against the String
"L"
just yields
MismatchedTokenException(76!=34)
What does work is matching "", also reducing the grammar to:
grammar DebugInternalGrammar;
RULE_STRING :
'"' (
(~ (
'\\' |
'"'
))
)* '"'
;
matches "L" (I assume this is what it means when the parse tree in ANTLRworks shows <epsilon> as leaf).
What is wrong here? This is not the part of the grammar which caused me trouble before, so I am scratching my head as to what the problem could be and what ANTLRworks is trying to tell me.

ANTLR3 not ignoring comments that begin at the first character of a file

Sorry if any terminology is off, just started using antlr recently.
Here's the antlr grammar that ignores multi-line comments:
COMMENT : '/*' .* '*/';
SPACE : (' ' | '\t' | '\r' | '\n' | COMMENT)+ {$channel = HIDDEN;} ;
Here's a comment beginning at the first character of a file I'd like to compile:
/*
This is a comment
*/
Here's the error I get:
[filename] line 252:0 no viable alternative at character '<EOF>'
[filename] line 1:1 no viable alternative at input '*'
However, if I put a space in front of the comment, like so:
/*
This is a comment
*/
It compiles fine. Any ideas?
For ignoring multilines comments:
ML_COMMENT
: '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
EDIT:
Maybe it's not because of your LEXER but because of your Parser. From lexer, with $channel=HIDDEN you are telling all these elements not to be passed to Parser. This is why parser finds EOF at first. You are sending nothing!
If you write a whitespace as the first character, parser receives something and it's able to process an input...
This should be your issue!!
I hope this would help you!

How to disallow nested comments in antlr

I currently have a multiline comment lexer rule in antlr which looks like:
MULTILINE: '/*' .* '*/' {$channel=HIDDEN;} ;
However, this currently allows things like:
/* /* hello */ */
Is there any possible way to disable nesting comments in antlr? I've tried various things like
MULTILINE: '/*' (~(MULTILINE)|.*) '*/' {$channel=HIDDEN;} ;
But that doesn't work. Any help would be much appreciated!
No, that is not correct: .* and .+ are not greedy.
Given the parser generated by the following grammar:
grammar T;
parse
: (t=. {System.out.printf("\%-15s'\%s'\n", tokenNames[$t.type], $t.text);} )* EOF
;
MULTILINE
: '/*' .* '*/' {$channel=HIDDEN;}
;
OTHER
: .
;
the input "/* /* hello */ */" would produce the following on your command line:
OTHER ' '
OTHER '*'
OTHER '/'
I.e., "/* /* hello */" is being put on the HIDDEN channel, and 3 OTHER tokens are constructed.
Try This:
It is not possible for the prefix nor suffix to be recognized in the comment body. Also, nesting is not allowed.
COMMENT_NON_NEST
: '/*'
( ('/'|'*'+)? ~[*/] )*?
('/'|'*'+?)?
'*/'
{$channel=HIDDEN;}
;

Resources