Is there ways to tell gocc to ignore things in lexical parser? E.g., for
2022-01-18 11:33:21.9885 [21] These are strings that I need to egnore, until - MYKW - Start Active One: 1/18/2022 11:33:21 AM
I want to tell gocc to ignore from [21] all the way to until. Here is what I've been trying:
/* Lexical part */
_letter : 'A'-'Z' | 'a'-'z' | '_' ;
_digit : '0'-'9' ;
_timestamp1 : _digit | ' ' | ':' | '-' | '.' ;
_timestamp2 : _digit | ' ' | ':' | '/' | 'A' | 'P' | 'M' ;
_ignore : '[' { . } ' ' '-' ' ' 'M' 'Y' 'K' 'W' ' ' '-' ' ' ;
_lineend : [ '\r' ] '\n' ;
timestamp : _timestamp1 { _timestamp1 } _ignore ;
taskLogStart : 'S' 't' 'a' 'r' 't' ' ' ;
jobName : { . } _timestamp2 { _timestamp2 } _lineend ;
/* Syntax part */
Log
: timestamp taskLogStart jobName ;
However, the parser failed at:
error: expected timestamp; got: unknown/invalid token "2022-01-18 11:33:21.9885 [21] T"
The reason I think it should be working is that, the following ignore rule works perfectly fine for white spaces:
!lineComment : '/' '/' { . } '\n' ;
!blockComment : '/' '*' { . | '*' } '*' '/' ;
and I'm just applying the above rule into my normal text parsing.
It doesn't work that way --
The EBNF looks very much like regular expressions but it does not work like regular expression at all -- what I mean is,
The line,
2022-01-18 11:33:21.9885 [21] These are strings that I need to ignore, until - MYKW - Start Active One: 1/18/2022 11:33:21 AM
If to match with regular expression, it can simply be:
([0-9.: -]+).*? - MYKW - Start ([^:]+):.*$
However, that cannot be directly translate into EBNF definition just like that, because the regular expression relies on the context in between each elements to ping point a match (e.g., the .*? matching rule is a local rule that only works based on the context it is in), however, gocc is a LR parser, which is a context-free grammar!!!
Basically a context-free grammar means, each time it is trying to do a .* match to all existing lexical symbols (i.e., each lexical symbol can be considered a global rule that is not affected by the context it is in). I cannot quite describe it but there is no previous context (or the symbol following it) involved in next match. That's the reason why the OP fails.
For a real sample of how the '{.}' can be used, see
How to describe this event log in formal BNF?
I am making a grammar for bash scripts. I am facing a problem while tokenising the "," symbol. The following grammar tokenises it as <BLOB> while I expect it to be tokenised as <OTHER>.
grammar newgram;
code : KEY (BLOB)+ (EOF | '\n')+;
KEY : 'wget';
BLOB : [a-zA-Z0-9#!$^%*&+-.]+?;
OTHER : .;
However, if I make BLOB to be [a-zA-Z0-9#!$^%*&+.-]+?;, then it is tokenised as <OTHER>.
I cannot understand why is it happening like this.
In the former case, the characters : and / are also tokenised as <OTHER>, so I do not see a reason for ,, to be marked <BLOB>.
Input I am tokenising, wget -o --quiet https,://www.google.com
The output I am receiving with the mentioned grammar,
[#0,0:3='wget',<'wget'>,1:0]
[#1,4:4=' ',<OTHER>,1:4]
[#2,5:5='-',<BLOB>,1:5]
[#3,6:6='o',<BLOB>,1:6]
[#4,7:7=' ',<OTHER>,1:7]
[#5,8:8='-',<BLOB>,1:8]
[#6,9:9='-',<BLOB>,1:9]
[#7,10:10='q',<BLOB>,1:10]
[#8,11:11='u',<BLOB>,1:11]
[#9,12:12='i',<BLOB>,1:12]
[#10,13:13='e',<BLOB>,1:13]
[#11,14:14='t',<BLOB>,1:14]
[#12,15:15=' ',<OTHER>,1:15]
[#13,16:16='h',<BLOB>,1:16]
[#14,17:17='t',<BLOB>,1:17]
[#15,18:18='t',<BLOB>,1:18]
[#16,19:19='p',<BLOB>,1:19]
[#17,20:20='s',<BLOB>,1:20]
[#18,21:21=',',<BLOB>,1:21]
[#19,22:22=':',<OTHER>,1:22]
[#20,23:23='/',<OTHER>,1:23]
[#21,24:24='/',<OTHER>,1:24]
[#22,25:25='w',<BLOB>,1:25]
[#23,26:26='w',<BLOB>,1:26]
[#24,27:27='w',<BLOB>,1:27]
[#25,28:28='.',<BLOB>,1:28]
[#26,29:29='g',<BLOB>,1:29]
[#27,30:30='o',<BLOB>,1:30]
[#28,31:31='o',<BLOB>,1:31]
[#29,32:32='g',<BLOB>,1:32]
[#30,33:33='l',<BLOB>,1:33]
[#31,34:34='e',<BLOB>,1:34]
[#32,35:35='.',<BLOB>,1:35]
[#33,36:36='c',<BLOB>,1:36]
[#34,37:37='o',<BLOB>,1:37]
[#35,38:38='m',<BLOB>,1:38]
[#36,39:39='\n',<'
'>,1:39]
[#37,40:39='<EOF>',<EOF>,2:0]
line 1:4 extraneous input ' ' expecting BLOB
line 1:7 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:15 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:22 extraneous input ':' expecting {<EOF>, '
', BLOB}
As already mentioned in a comment, the - in +-. inside your character class is interpreted as a range operator. And the , is inside that range. Escape it like this: [a-zA-Z0-9#!$^%*&+\-.]+?
Also, a trailing [ ... ]+? at the end of a lexer rule will always match a single character. So [a-zA-Z0-9#!$^%*&+\-.]+? can just as well be written as [a-zA-Z0-9#!$^%*&+\-.]
I am currently debugging my grammar in ANTLRworks, and reduced it far more than is reasonable to this:
grammar DebugInternalGrammar;
RULE_STRING :
'"' (
('\\' .) |
(~ (
'\\' |
'"'
))
)* '"'
;
Which, when testing in the interpreter against the String
"L"
just yields
MismatchedTokenException(76!=34)
What does work is matching "", also reducing the grammar to:
grammar DebugInternalGrammar;
RULE_STRING :
'"' (
(~ (
'\\' |
'"'
))
)* '"'
;
matches "L" (I assume this is what it means when the parse tree in ANTLRworks shows <epsilon> as leaf).
What is wrong here? This is not the part of the grammar which caused me trouble before, so I am scratching my head as to what the problem could be and what ANTLRworks is trying to tell me.
Sorry if any terminology is off, just started using antlr recently.
Here's the antlr grammar that ignores multi-line comments:
COMMENT : '/*' .* '*/';
SPACE : (' ' | '\t' | '\r' | '\n' | COMMENT)+ {$channel = HIDDEN;} ;
Here's a comment beginning at the first character of a file I'd like to compile:
/*
This is a comment
*/
Here's the error I get:
[filename] line 252:0 no viable alternative at character '<EOF>'
[filename] line 1:1 no viable alternative at input '*'
However, if I put a space in front of the comment, like so:
/*
This is a comment
*/
It compiles fine. Any ideas?
For ignoring multilines comments:
ML_COMMENT
: '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
EDIT:
Maybe it's not because of your LEXER but because of your Parser. From lexer, with $channel=HIDDEN you are telling all these elements not to be passed to Parser. This is why parser finds EOF at first. You are sending nothing!
If you write a whitespace as the first character, parser receives something and it's able to process an input...
This should be your issue!!
I hope this would help you!
I currently have a multiline comment lexer rule in antlr which looks like:
MULTILINE: '/*' .* '*/' {$channel=HIDDEN;} ;
However, this currently allows things like:
/* /* hello */ */
Is there any possible way to disable nesting comments in antlr? I've tried various things like
MULTILINE: '/*' (~(MULTILINE)|.*) '*/' {$channel=HIDDEN;} ;
But that doesn't work. Any help would be much appreciated!
No, that is not correct: .* and .+ are not greedy.
Given the parser generated by the following grammar:
grammar T;
parse
: (t=. {System.out.printf("\%-15s'\%s'\n", tokenNames[$t.type], $t.text);} )* EOF
;
MULTILINE
: '/*' .* '*/' {$channel=HIDDEN;}
;
OTHER
: .
;
the input "/* /* hello */ */" would produce the following on your command line:
OTHER ' '
OTHER '*'
OTHER '/'
I.e., "/* /* hello */" is being put on the HIDDEN channel, and 3 OTHER tokens are constructed.
Try This:
It is not possible for the prefix nor suffix to be recognized in the comment body. Also, nesting is not allowed.
COMMENT_NON_NEST
: '/*'
( ('/'|'*'+)? ~[*/] )*?
('/'|'*'+?)?
'*/'
{$channel=HIDDEN;}
;