ANTLR mismatched token in simple grammar - antlr3

I am currently debugging my grammar in ANTLRworks, and reduced it far more than is reasonable to this:
grammar DebugInternalGrammar;
RULE_STRING :
'"' (
('\\' .) |
(~ (
'\\' |
'"'
))
)* '"'
;
Which, when testing in the interpreter against the String
"L"
just yields
MismatchedTokenException(76!=34)
What does work is matching "", also reducing the grammar to:
grammar DebugInternalGrammar;
RULE_STRING :
'"' (
(~ (
'\\' |
'"'
))
)* '"'
;
matches "L" (I assume this is what it means when the parse tree in ANTLRworks shows <epsilon> as leaf).
What is wrong here? This is not the part of the grammar which caused me trouble before, so I am scratching my head as to what the problem could be and what ANTLRworks is trying to tell me.

Related

Gocc to ignore things in lexical parser

Is there ways to tell gocc to ignore things in lexical parser? E.g., for
2022-01-18 11:33:21.9885 [21] These are strings that I need to egnore, until - MYKW - Start Active One: 1/18/2022 11:33:21 AM
I want to tell gocc to ignore from [21] all the way to until. Here is what I've been trying:
/* Lexical part */
_letter : 'A'-'Z' | 'a'-'z' | '_' ;
_digit : '0'-'9' ;
_timestamp1 : _digit | ' ' | ':' | '-' | '.' ;
_timestamp2 : _digit | ' ' | ':' | '/' | 'A' | 'P' | 'M' ;
_ignore : '[' { . } ' ' '-' ' ' 'M' 'Y' 'K' 'W' ' ' '-' ' ' ;
_lineend : [ '\r' ] '\n' ;
timestamp : _timestamp1 { _timestamp1 } _ignore ;
taskLogStart : 'S' 't' 'a' 'r' 't' ' ' ;
jobName : { . } _timestamp2 { _timestamp2 } _lineend ;
/* Syntax part */
Log
: timestamp taskLogStart jobName ;
However, the parser failed at:
error: expected timestamp; got: unknown/invalid token "2022-01-18 11:33:21.9885 [21] T"
The reason I think it should be working is that, the following ignore rule works perfectly fine for white spaces:
!lineComment : '/' '/' { . } '\n' ;
!blockComment : '/' '*' { . | '*' } '*' '/' ;
and I'm just applying the above rule into my normal text parsing.
It doesn't work that way --
The EBNF looks very much like regular expressions but it does not work like regular expression at all -- what I mean is,
The line,
2022-01-18 11:33:21.9885 [21] These are strings that I need to ignore, until - MYKW - Start Active One: 1/18/2022 11:33:21 AM
If to match with regular expression, it can simply be:
([0-9.: -]+).*? - MYKW - Start ([^:]+):.*$
However, that cannot be directly translate into EBNF definition just like that, because the regular expression relies on the context in between each elements to ping point a match (e.g., the .*? matching rule is a local rule that only works based on the context it is in), however, gocc is a LR parser, which is a context-free grammar!!!
Basically a context-free grammar means, each time it is trying to do a .* match to all existing lexical symbols (i.e., each lexical symbol can be considered a global rule that is not affected by the context it is in). I cannot quite describe it but there is no previous context (or the symbol following it) involved in next match. That's the reason why the OP fails.
For a real sample of how the '{.}' can be used, see
How to describe this event log in formal BNF?

why does a comma "," get counted in [.] type expression in antlr lexer

I am making a grammar for bash scripts. I am facing a problem while tokenising the "," symbol. The following grammar tokenises it as <BLOB> while I expect it to be tokenised as <OTHER>.
grammar newgram;
code : KEY (BLOB)+ (EOF | '\n')+;
KEY : 'wget';
BLOB : [a-zA-Z0-9#!$^%*&+-.]+?;
OTHER : .;
However, if I make BLOB to be [a-zA-Z0-9#!$^%*&+.-]+?;, then it is tokenised as <OTHER>.
I cannot understand why is it happening like this.
In the former case, the characters : and / are also tokenised as <OTHER>, so I do not see a reason for ,, to be marked <BLOB>.
Input I am tokenising, wget -o --quiet https,://www.google.com
The output I am receiving with the mentioned grammar,
[#0,0:3='wget',<'wget'>,1:0]
[#1,4:4=' ',<OTHER>,1:4]
[#2,5:5='-',<BLOB>,1:5]
[#3,6:6='o',<BLOB>,1:6]
[#4,7:7=' ',<OTHER>,1:7]
[#5,8:8='-',<BLOB>,1:8]
[#6,9:9='-',<BLOB>,1:9]
[#7,10:10='q',<BLOB>,1:10]
[#8,11:11='u',<BLOB>,1:11]
[#9,12:12='i',<BLOB>,1:12]
[#10,13:13='e',<BLOB>,1:13]
[#11,14:14='t',<BLOB>,1:14]
[#12,15:15=' ',<OTHER>,1:15]
[#13,16:16='h',<BLOB>,1:16]
[#14,17:17='t',<BLOB>,1:17]
[#15,18:18='t',<BLOB>,1:18]
[#16,19:19='p',<BLOB>,1:19]
[#17,20:20='s',<BLOB>,1:20]
[#18,21:21=',',<BLOB>,1:21]
[#19,22:22=':',<OTHER>,1:22]
[#20,23:23='/',<OTHER>,1:23]
[#21,24:24='/',<OTHER>,1:24]
[#22,25:25='w',<BLOB>,1:25]
[#23,26:26='w',<BLOB>,1:26]
[#24,27:27='w',<BLOB>,1:27]
[#25,28:28='.',<BLOB>,1:28]
[#26,29:29='g',<BLOB>,1:29]
[#27,30:30='o',<BLOB>,1:30]
[#28,31:31='o',<BLOB>,1:31]
[#29,32:32='g',<BLOB>,1:32]
[#30,33:33='l',<BLOB>,1:33]
[#31,34:34='e',<BLOB>,1:34]
[#32,35:35='.',<BLOB>,1:35]
[#33,36:36='c',<BLOB>,1:36]
[#34,37:37='o',<BLOB>,1:37]
[#35,38:38='m',<BLOB>,1:38]
[#36,39:39='\n',<'
'>,1:39]
[#37,40:39='<EOF>',<EOF>,2:0]
line 1:4 extraneous input ' ' expecting BLOB
line 1:7 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:15 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:22 extraneous input ':' expecting {<EOF>, '
', BLOB}
As already mentioned in a comment, the - in +-. inside your character class is interpreted as a range operator. And the , is inside that range. Escape it like this: [a-zA-Z0-9#!$^%*&+\-.]+?
Also, a trailing [ ... ]+? at the end of a lexer rule will always match a single character. So [a-zA-Z0-9#!$^%*&+\-.]+? can just as well be written as [a-zA-Z0-9#!$^%*&+\-.]

Is there used incorrect terminology in description of a compile error as to 'for' syntax in golang?

I tried to use something like
for i := 0; i < len(bytes); ++i {
...
}
It is not correct and I got an error
syntax error: unexpected ++, expecting expression
It was because of ++i is not an expression I thought.
Then I found out that i++ (it works in for loop) is not an expression as well according to the documentation.
Also I met that in some cases (now I think in all cases) a statement can not be used instead of expression.
Now if we come back to the error we see that for loop requires an expression. I was confused with that. I checked one more part of the documentation it turns out for requires a statement.
For statements with for clause
A "for" statement with a ForClause is also controlled by its
condition, but additionally it may specify an init and a post
statement
I started with question (which I liked more than the final question because it was about language non-acquaintance as I thought)
Is it special case for loop syntax that statement are accepted as expression or are there other cases in golang?
During writing the question and checking the documentation I end up to a questions
Is there used incorrect terminology in description of the error that should be fixed not to confuse? Or is it normally in some cases to substitute such terms as statement and expression?
The Go Programming Language Specification
Primary expressions
Primary expressions are the operands for unary and binary expressions.
PrimaryExpr =
Operand |
Conversion |
PrimaryExpr Selector |
PrimaryExpr Index |
PrimaryExpr Slice |
PrimaryExpr TypeAssertion |
PrimaryExpr Arguments .
Selector = "." identifier .
Index = "[" Expression "]" .
Slice = "[" [ Expression ] ":" [ Expression ] "]" |
"[" [ Expression ] ":" Expression ":" Expression "]" .
TypeAssertion = "." "(" Type ")" .
Arguments = "(" [ ( ExpressionList | Type [ "," ExpressionList ] ) [ "..." ] [ "," ] ] ")" .
Operators and punctuation
The following character sequences represent operators:
++
--
Operators
Operators combine operands into expressions.
Expression = UnaryExpr | Expression binary_op Expression .
UnaryExpr = PrimaryExpr | unary_op UnaryExpr .
binary_op = "||" | "&&" | rel_op | add_op | mul_op .
rel_op = "==" | "!=" | "<" | "<=" | ">" | ">=" .
add_op = "+" | "-" | "|" | "^" .
mul_op = "*" | "/" | "%" | "<<" | ">>" | "&" | "&^" .
unary_op = "+" | "-" | "!" | "^" | "*" | "&" | "<-" .
Operator precedence
The ++ and -- operators form statements, not expressions.
IncDec statements
The "++" and "--" statements increment or decrement their operands by
the untyped constant 1. As with an assignment, the operand must be
addressable or a map index expression.
IncDecStmt = Expression ( "++" | "--" ) .
++ and -- are operators. The ++ and -- operators form statements, not expressions.
IncDecStmt = Expression ( "++" | "--" ) .
When the compiler encounters an ++ operator, it expects it to be immediately preceded by an expresssion.
For example,
package main
func main() {
// syntax error: unexpected ++, expecting expression
for i := 0; i < 1; ++i {}
}
Playground: https://play.golang.org/p/y2d9ijeMdw
Output:
main.go:6:21: syntax error: unexpected ++, expecting expression
The compiler complains about the syntax. It found a ++ operator without an immediately preceding expression: syntax error: unexpected ++, expecting expression.
The Go Spec says the post statement of a for clause accepts (among other things) a IndDec statement.
The IncDec statement is defined as: IncDecStmt = Expression ( "++" | "--" ) .
The parser finds an IndDec statement but an empty expression and thus spits out the error "expecting expression".
Edit: this probably fails because the fallback node to parse for a SimplStmt is an expression. The IncDecStmt failed, so it moves on to the default. The error accurately reflects the latest error that is bubbled up.
While the error message is correct, it is a little bit misleading. However, fixing it would involve passing more context about the current tree being parsed. eg: bad ForClause: bad PostStmt: bad SimpleStmt: expected expression.
There's still the problem that the expected expression is the last error encountered. Before that, it failed to parse the IncDecStmt but that error is swallowed because it falls back on an expression. The same applies at higher levels of the tree.
Even without that problem it would be rather heavy-handed and probably even more confusing than the current error messages. You may want to ask for input from the Go folks though.

ANTLR3 not ignoring comments that begin at the first character of a file

Sorry if any terminology is off, just started using antlr recently.
Here's the antlr grammar that ignores multi-line comments:
COMMENT : '/*' .* '*/';
SPACE : (' ' | '\t' | '\r' | '\n' | COMMENT)+ {$channel = HIDDEN;} ;
Here's a comment beginning at the first character of a file I'd like to compile:
/*
This is a comment
*/
Here's the error I get:
[filename] line 252:0 no viable alternative at character '<EOF>'
[filename] line 1:1 no viable alternative at input '*'
However, if I put a space in front of the comment, like so:
/*
This is a comment
*/
It compiles fine. Any ideas?
For ignoring multilines comments:
ML_COMMENT
: '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
EDIT:
Maybe it's not because of your LEXER but because of your Parser. From lexer, with $channel=HIDDEN you are telling all these elements not to be passed to Parser. This is why parser finds EOF at first. You are sending nothing!
If you write a whitespace as the first character, parser receives something and it's able to process an input...
This should be your issue!!
I hope this would help you!

ANTLR comment problem

I am trying to write a comment matching rule in ANTLR, which is currently the following:
LINE_COMMENT
: '--' (options{greedy=false;}: .)* NEWLINE {Skip();}
;
NEWLINE : '\r'|'\n'|'\r\n' {Skip();};
This code works fine except in the case that a comment is the last characters of a file, in which case it throws a NoViableAlt exception. How can i fix this?
Why not:
LINE_COMMENT : '--' (~ NEWLINE)* ;
fragment NEWLINE : '\r' '\n'? | '\n' ;
If you haven't come across this yet, lexical rules (all uppercase) can only consist of constants and tokens, not other lexemes. You need a parser rule for that.
I'd go for:
LINE_COMMENT
: '--' ~( '\r' | '\n' )* {Skip();}
;
NEWLINE
: ( '\r'? '\n' | '\r' ) {Skip();}
;

Resources