Unexpected behavior with ANTLR3 - antlr3

I am experiencing an unexpected behavior with ANTLR3. This is my grammar:
grammar Onto;
****parser rules****
predicate
: VERB
;
****lexer rules****
VERB
: 'VB' WS
;
PREPOSITION
: 'TO' WS
;
WS
: (' ' | '\t' | '\r'| '\n')
;
When I parse the string "VB TO", ANTLR3 exits without flagging an error. This is unexpected because the given string does not match any rule in the grammar.
However when I retry the same after removing the PREPOSITION rule from the grammar, ANTLR3 flags the following error which is the expected result:
line 1:3 no viable alternative at character 'T'
line 1:4 no viable alternative at character 'O'

You made the classic mistake. Your main rule has no EOF at the end, so your parser currently also matches only a part of your input and sees that as valid. In your case it matches VERB and then expects nothing more. That PREPOSITION matches your "TO" input is part of the behavior as this returns the PREPOSIITON token to the parser. But since the parser is already happy with the VERB input it considers the parse done successfully.
Without the PREPOSITION lexer rule however, the lexer returns an error token as it cannot match that input. Which is what the error above is about.

Related

shell case pattern [...] disallowed chars

Bracket expressions within case patterns seem to disallow [() &;]. However I can't seem to find any such restrictions (or escaping workarounds) in the POSIX shell spec, or in the bash manual for that matter.
case '&' in
# *[&]*) echo y ;; # won't parse
*[\&]*) echo y ;; # will parse & work
esac
# similar for ';', ' ', '(', ')'
# not a problem for ${var#[&; ()]}
This is in a sh shell script function that can't afford to call external utilities (but I'm curious about bash too). So... is there any spec that describes backslash-ing these characters within a bracket expression pattern?
No, I don't think it is explicitly documented anywhere.
But it can be deduced that Token Recognition Rule 6 is applied while the pattern list is being parsed. That is, unless quoted, control operators, redirection operators, and end of input are recognized as operators, and delimit a pattern. The shell expects | (indicates that another pattern follows) or ) (marks the end of the pattern list) to do that; and anything else causes a parse error.
As square brackets have no special meaning to the parser during tokenization, whether an operator occurs between them is irrelevant. And ${var#[&; ()]} is a different case; covered in Token Recognition Rule 5 and Parameter Expansion.

Strange behaviour for comments in Antlr4 grammar

When adding a comment line under ID is ok, however adding one under WS, causes an error to be raised. Entire file Hello.g4 listed below.
/**
* Define a grammar called Hello
*/
grammar Hello;
r : 'hello' ID ; // match keyword hello followed by an identifier
ID : [a-z]+ ; // match lower-case identifiers
/**********************************************************************************************/
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
/**********************************************************************************************/
The output i get in the console is as below:
ANTLR Tool v4.4 (/tmp/antlr-4.4-complete.jar)
Hello.g4 -o /home/me/workspace/TestComment/target/generated-sources/antlr4 -listener -no-visitor -encoding UTF-8
error(50): Hello.g4:13:0: syntax error: '<EOF>' came as a complete surprise to me
1 error(s)
BUILD FAIL
Total time: 168 millisecond(s)
Running Eclipse Version: Neon.3 Release (4.6.3), Default ANTLR4 project.
Why should ANTLR4, care about a trailing comment line ?
The ANTLR 4 grammar defines JavaDoc comments as optionally allowed as a header and on each rule. No rule follows the last 'comment line', so it is interpreted an invalid beginning of a rule.
Change your comment line to /*----*/ to avoid the error.

Strange behaviour of ANTLR3

Why does grammar presented in this answer https://stackoverflow.com/a/1932664/5613768 accept expression like this : 2(38) ?? I know why 12*(5-6) is accepted and why 12*(5-6 is not accepted but I can't explain this behaviour.
It doesn't accept the entire input. It stops parsing after the 2 because the eval rule:
eval
: additionExp
;
matches 2 as a additionExp and then stops since the rest of the input cannot be matched.
If you "anchor" the eval rule so that it must consume the entire token stream like this:
eval
: additionExp EOF
;
you will see an error on your console.

XText Validator shows Parse Error in wrong line

I am currently developing a small dsl with the following (shortend) grammar:
grammar mydsl with org.eclipse.xtext.common.Terminals hidden(WS, SL_COMMENT)
generate mydsl "uri::mydsl"
CommandSet:
(commands+=Command)*
;
Command:
(commandName=CommandName LBRACKET (args=ArgumentList)? RBRACKET EOL ) |
;
terminal LBRACKET:
'('
;
terminal RBRACKET:
')'
;
terminal EOL:
';'
;
As you can see, I use a semicolon as a EOL seperator and it works just fine for me. The problem occurs with the built-in syntax validator when working with the dsl in eclipse. When I miss a semicolon, the validator throws an syntax error in the wrong line:
Is there an error with my grammar? Thanks ;)
Here is a small DSL loosely based on your example. Basically, I do not consider linebreaks as "hidden" any longer (i.e. they will no longer be ignored by the parser), only the whitespaces. Note new terminals MY_WS and MY_NL as well as modified hidden statement in the grammar header (I also added some comments at relevant places). This approach just gives you some general idea and you can experiment with it to achieve what you want. Note, that if linebreaks are no longer hidden, you will need to take account of them in your grammar rules.
grammar org.xtext.example.mydsl.MyDsl
with org.eclipse.xtext.common.Terminals
hidden( MY_WS, SL_COMMENT ) // ---> hide whitespaces and comments only, not linebreaks!
generate mydsl "uri::mydsl"
CommandSet:
(commands+=Command)*
;
CommandName:
name=ID
;
ArgumentList:
arguments += STRING (',' STRING)*
;
Command:
(commandName=CommandName LBRACKET (args=ArgumentList)? RBRACKET EOL);
terminal LBRACKET:
'('
;
terminal RBRACKET:
')'
;
terminal EOL:
';' MY_NL? // ---> now an optional linebreak at the end!
;
terminal MY_WS: (' '|'\t')+; // ---> whitespace characters (formerly part of WS)
terminal MY_NL: ('\r'|'\n')+; // ---> linebreak characters (no longer hidden)
Here is an image demonstrating the resulting behavior.

XTEXT: Controlling when whitespace is allowed

I have a custom scripting language, that I am attempting to use XTEXT for syntax checking. It boils down to single line commands in the format
COMMAND:PARAMETERS
For the most part, xtext is working great. The only problem I have currently run into is how to handle wanted (or unwanted) white spaces. The language cannot have a space to begin a line, and there cannot be a space following the colon. As well, I need to allow white space in the parameters, as it could be a string of text, or something similar.
I have used a datatype to allow white space in the parameter:
UNQUOTED_STRING:
(ID | INT | WS | '.' )+
;
This works, but has the side effect of allowing spaces throughout the line.
Does anyone know a way to limit where white spaces are allowed?
Thanks in advance for any advice!
You can disallow whitespace globally for your grammar by using an empty set of hidden tokens, e.g.
grammar org.xyz.MyDsl with org.eclipse.xtext.common.Terminals hidden()
Then you can enable it at specific rules, e.g.
XParameter hidden(WS):
'x' '=' value=ID
;
Note that this would allow linebreaks as well. If you don't want that you can either pass a custom terminal rule or overwrite the default WSrule.
Here is a more complete example (not perfect):
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals hidden()
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
(commands+=Command '\r'? '\n')+
;
Command:
SampleCommand
;
SampleCommand:
command='get' ':' parameter=Parameter
;
Parameter:
'{' x=XParameter '}'
;
XParameter hidden(WS):
'x' '=' value=ID
;
This will parse commands such as:
get:{x=TEST}
get:{ x = TEST}
But will reject:
get:{x=TEST}
get: {x=TEST}
Hope that gives you an idea. You can also do this the other way around by limiting the whitespace only for certain rules, e.g.
CommandList hidden():
(commands+=Command '\r'? '\n')+
;
If that works better for your grammar.

Resources