Is there ways to tell gocc to ignore things in lexical parser? E.g., for
2022-01-18 11:33:21.9885 [21] These are strings that I need to egnore, until - MYKW - Start Active One: 1/18/2022 11:33:21 AM
I want to tell gocc to ignore from [21] all the way to until. Here is what I've been trying:
/* Lexical part */
_letter : 'A'-'Z' | 'a'-'z' | '_' ;
_digit : '0'-'9' ;
_timestamp1 : _digit | ' ' | ':' | '-' | '.' ;
_timestamp2 : _digit | ' ' | ':' | '/' | 'A' | 'P' | 'M' ;
_ignore : '[' { . } ' ' '-' ' ' 'M' 'Y' 'K' 'W' ' ' '-' ' ' ;
_lineend : [ '\r' ] '\n' ;
timestamp : _timestamp1 { _timestamp1 } _ignore ;
taskLogStart : 'S' 't' 'a' 'r' 't' ' ' ;
jobName : { . } _timestamp2 { _timestamp2 } _lineend ;
/* Syntax part */
Log
: timestamp taskLogStart jobName ;
However, the parser failed at:
error: expected timestamp; got: unknown/invalid token "2022-01-18 11:33:21.9885 [21] T"
The reason I think it should be working is that, the following ignore rule works perfectly fine for white spaces:
!lineComment : '/' '/' { . } '\n' ;
!blockComment : '/' '*' { . | '*' } '*' '/' ;
and I'm just applying the above rule into my normal text parsing.
It doesn't work that way --
The EBNF looks very much like regular expressions but it does not work like regular expression at all -- what I mean is,
The line,
2022-01-18 11:33:21.9885 [21] These are strings that I need to ignore, until - MYKW - Start Active One: 1/18/2022 11:33:21 AM
If to match with regular expression, it can simply be:
([0-9.: -]+).*? - MYKW - Start ([^:]+):.*$
However, that cannot be directly translate into EBNF definition just like that, because the regular expression relies on the context in between each elements to ping point a match (e.g., the .*? matching rule is a local rule that only works based on the context it is in), however, gocc is a LR parser, which is a context-free grammar!!!
Basically a context-free grammar means, each time it is trying to do a .* match to all existing lexical symbols (i.e., each lexical symbol can be considered a global rule that is not affected by the context it is in). I cannot quite describe it but there is no previous context (or the symbol following it) involved in next match. That's the reason why the OP fails.
For a real sample of how the '{.}' can be used, see
How to describe this event log in formal BNF?
I am making a grammar for bash scripts. I am facing a problem while tokenising the "," symbol. The following grammar tokenises it as <BLOB> while I expect it to be tokenised as <OTHER>.
grammar newgram;
code : KEY (BLOB)+ (EOF | '\n')+;
KEY : 'wget';
BLOB : [a-zA-Z0-9#!$^%*&+-.]+?;
OTHER : .;
However, if I make BLOB to be [a-zA-Z0-9#!$^%*&+.-]+?;, then it is tokenised as <OTHER>.
I cannot understand why is it happening like this.
In the former case, the characters : and / are also tokenised as <OTHER>, so I do not see a reason for ,, to be marked <BLOB>.
Input I am tokenising, wget -o --quiet https,://www.google.com
The output I am receiving with the mentioned grammar,
[#0,0:3='wget',<'wget'>,1:0]
[#1,4:4=' ',<OTHER>,1:4]
[#2,5:5='-',<BLOB>,1:5]
[#3,6:6='o',<BLOB>,1:6]
[#4,7:7=' ',<OTHER>,1:7]
[#5,8:8='-',<BLOB>,1:8]
[#6,9:9='-',<BLOB>,1:9]
[#7,10:10='q',<BLOB>,1:10]
[#8,11:11='u',<BLOB>,1:11]
[#9,12:12='i',<BLOB>,1:12]
[#10,13:13='e',<BLOB>,1:13]
[#11,14:14='t',<BLOB>,1:14]
[#12,15:15=' ',<OTHER>,1:15]
[#13,16:16='h',<BLOB>,1:16]
[#14,17:17='t',<BLOB>,1:17]
[#15,18:18='t',<BLOB>,1:18]
[#16,19:19='p',<BLOB>,1:19]
[#17,20:20='s',<BLOB>,1:20]
[#18,21:21=',',<BLOB>,1:21]
[#19,22:22=':',<OTHER>,1:22]
[#20,23:23='/',<OTHER>,1:23]
[#21,24:24='/',<OTHER>,1:24]
[#22,25:25='w',<BLOB>,1:25]
[#23,26:26='w',<BLOB>,1:26]
[#24,27:27='w',<BLOB>,1:27]
[#25,28:28='.',<BLOB>,1:28]
[#26,29:29='g',<BLOB>,1:29]
[#27,30:30='o',<BLOB>,1:30]
[#28,31:31='o',<BLOB>,1:31]
[#29,32:32='g',<BLOB>,1:32]
[#30,33:33='l',<BLOB>,1:33]
[#31,34:34='e',<BLOB>,1:34]
[#32,35:35='.',<BLOB>,1:35]
[#33,36:36='c',<BLOB>,1:36]
[#34,37:37='o',<BLOB>,1:37]
[#35,38:38='m',<BLOB>,1:38]
[#36,39:39='\n',<'
'>,1:39]
[#37,40:39='<EOF>',<EOF>,2:0]
line 1:4 extraneous input ' ' expecting BLOB
line 1:7 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:15 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:22 extraneous input ':' expecting {<EOF>, '
', BLOB}
As already mentioned in a comment, the - in +-. inside your character class is interpreted as a range operator. And the , is inside that range. Escape it like this: [a-zA-Z0-9#!$^%*&+\-.]+?
Also, a trailing [ ... ]+? at the end of a lexer rule will always match a single character. So [a-zA-Z0-9#!$^%*&+\-.]+? can just as well be written as [a-zA-Z0-9#!$^%*&+\-.]
This question already has an answer here:
Why can't the first pattern in a shell case statement be a multiple pattern?
(1 answer)
Closed 6 years ago.
The POSIX shell standard
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html
says
The format for the case construct is as follows:
case word in
[(]pattern1) compound-list;;
[[(]pattern[ | pattern] ... ) compound-list;;] ...
[[(]pattern[ | pattern] ... ) compound-list]
esac
which seems to indicate the first compound list pattern is special - there can be only one, no alternatives denoted by | as for the other ones (and reading the POSIX standards elsewhere, pattern itself does not support alternatives).
I tried, using latest dash, and seems to work:
$case foobar in
( foo* | *bar ) echo OK
esac
$OK
There is no mention of "behavior is unspecified". So if the shell did not support that, it should emit an error message.
It cannot be a typo in the standard - too many characters involved.
So clearly I am not understanding something. Does POSIX shell support alternatives among patterns for the first compound list of a case construct, and how is it documented?
The canonical grammar is given in the standard:
case_clause : Case WORD linebreak in linebreak case_list Esac
| Case WORD linebreak in linebreak case_list_ns Esac
| Case WORD linebreak in linebreak Esac
;
case_list_ns : case_list case_item_ns
| case_item_ns
;
case_list : case_list case_item
| case_item
;
case_item_ns : pattern ')' linebreak
| pattern ')' compound_list linebreak
| '(' pattern ')' linebreak
| '(' pattern ')' compound_list linebreak
;
case_item : pattern ')' linebreak DSEMI linebreak
| pattern ')' compound_list DSEMI linebreak
| '(' pattern ')' linebreak DSEMI linebreak
| '(' pattern ')' compound_list DSEMI linebreak
;
pattern : WORD /* Apply rule 4 */
| pattern '|' WORD /* Do not apply rule 4 */
Note that pattern, which allows |s, is used in every position.
Sorry if any terminology is off, just started using antlr recently.
Here's the antlr grammar that ignores multi-line comments:
COMMENT : '/*' .* '*/';
SPACE : (' ' | '\t' | '\r' | '\n' | COMMENT)+ {$channel = HIDDEN;} ;
Here's a comment beginning at the first character of a file I'd like to compile:
/*
This is a comment
*/
Here's the error I get:
[filename] line 252:0 no viable alternative at character '<EOF>'
[filename] line 1:1 no viable alternative at input '*'
However, if I put a space in front of the comment, like so:
/*
This is a comment
*/
It compiles fine. Any ideas?
For ignoring multilines comments:
ML_COMMENT
: '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
EDIT:
Maybe it's not because of your LEXER but because of your Parser. From lexer, with $channel=HIDDEN you are telling all these elements not to be passed to Parser. This is why parser finds EOF at first. You are sending nothing!
If you write a whitespace as the first character, parser receives something and it's able to process an input...
This should be your issue!!
I hope this would help you!
I am trying to write a comment matching rule in ANTLR, which is currently the following:
LINE_COMMENT
: '--' (options{greedy=false;}: .)* NEWLINE {Skip();}
;
NEWLINE : '\r'|'\n'|'\r\n' {Skip();};
This code works fine except in the case that a comment is the last characters of a file, in which case it throws a NoViableAlt exception. How can i fix this?
Why not:
LINE_COMMENT : '--' (~ NEWLINE)* ;
fragment NEWLINE : '\r' '\n'? | '\n' ;
If you haven't come across this yet, lexical rules (all uppercase) can only consist of constants and tokens, not other lexemes. You need a parser rule for that.
I'd go for:
LINE_COMMENT
: '--' ~( '\r' | '\n' )* {Skip();}
;
NEWLINE
: ( '\r'? '\n' | '\r' ) {Skip();}
;