how to test my Grammar antlr4 successfully? [duplicate] - java-8

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.

This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).

This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

Related

Jflex ambiguity

I have these two rules from a jflex code:
Bool = true
Ident = [:letter:][:letterdigit:]*
if I try for example to analyse the word "trueStat", it gets recognnized as an Ident expression and not Bool.
How can I avoid this type of ambiguity in Jflex?
In almost all languages, a keyword is only recognised as such if it is a complete word. Otherwise, you would end up banning identifiers like format, downtime and endurance (which would instead start with the keywords for, do and end, respectively). That's quite confusing for programmers, although it's not unheard-of. Lexical scanner generators, like Flex and JFlex generally try to make the common case easy; thus, the snippet you provide, which recognises trueStat as an identifier. But if you really want to recognise it as a keyword followed by an identifier, you can accomplish that by adding trailing context to all your keywords:
Bool = true/[:letterdigit:]*
Ident = [:letter:][:letterdigit:]*
With that pair of patterns, true will match the Bool rule, even if it occurs as trueStat. The pattern matches true and any alphanumeric string immediately following it, and then rewinds the input cursor so that the token matched is just true.
Note that like Lex and Flex, JFlex accepts the longest match at the current input position; if more than one rule accepts this match, the action corresponding to the first such rule is executed. (See the manual section "How the Input is Matched" for a slightly longer explanation of the matching algorithm.) Trailing context is considered part of the match for the purposes of this rule (but, as noted above, is then removed from the match).
The consequence of this rule is that you should always place more specific patterns before the general patterns they might override, whether or not the specific pattern uses trailing context. So the Bool rule must precede the Ident rule.

writing an antlr grammar where whitespace is sometimes significant

This is a dummy example, my actual language is more complicated:
grammar wordasnumber;
WS: [ \t\n] -> skip;
AS: [Aa] [Ss];
ID: [A-Za-z]+;
NUMBER: [0-9]+;
wordAsNumber: (ID AS NUMBER)* EOF;
In this language, these two strings are legal:
seven as 7 eight as 8
seven as 7eight as8
Which is exactly what I told it to do, but not what I want. Because ID and AS are both strings of letters, white space is required between them, I would like that second phrase
to be a syntax error. I could add some other rule to try and match theses mashed up things ...
fragment LETTER: [A-Za-z];
fragment DIGIT: [0-9];
BAD_THING: ( LETTER+ DIGIT (LETTER|DIGIT)* ) | ( DIGIT+ LETTER (LETTER|DIGIT)* );
ID: LETTER+;
NUMBER: DIGIT+;
... to make the lexer return a different token for these smashed up things, but this feels like a weird bandaid which sort of found the need for accidentally and maybe there are more if I really stared at my lexer very carefully.
Is there a better way to do this? My actual grammar is much larger so, for example, making WS NOT be skipped and placing it explicitly between the tokens where it is required is non starter.
There was an older question on this list, which I could not find, which I think is the same question, in that case someone who was parsing white space separated numbers was surprised that 1.2.3 was parsing as 1.2 and .3 and not as a syntax error.
Add another rule for the wrong input, but don't use that in your parser. It will then cause a syntax error when matched:
INVALID: (ID | NUMBER)+;
This additional rule will change the parse tree output, for the input in the question, to:
This trick works because ANTLR4's lexing approach tries to match the longest input in on go, and that INVALID rule matches more than ID and NUMBER alone. But you have to place it after these 2 rules, to make use of another lexing rule: "If two lexer rules would match the same input, pick the first one.". This way, you get the correct tokens for single appearances of ID and NUMBER.

Macro contains a cycle

So I'm trying to make a lexical analyzer for scheme and when I run JFlex to convert the lever.flex file I get an error similar to this one for example:
Reading "lexer.flex"
Macro definition contains a cycle.
1 error, 0 warnings.
the macro it's referring to is this one:
definition = {variable_definition}
| {syntax_definition}
| \(begin {definition}*\)
| \(let-syntax \({syntax_binding}*\){definition}*\)
| \(letrec-syntax \({syntax_binding}*\){definition}*\)
all of the macros defined here have been implemented but fro some reason I can't get rid of this error and I don't know why its happening.
A lex/flex/JFlex style "definition" is a macro expansion, as that error message indicates. Recursive macro expansions are impossible, since macro expansion is not conditional; trying to expand
definition = ... \(begin {definition}*\) ...
will result in an infinitely long regular expression.
Do not mistake a lexical analyser for a general-purpose parser. A lexical analyser does no more than split an input into individual tokens (or "lexemes"), using regular expressions to identify each token. Tokens do not have structure (at least for the purposes of parsing); once a token is identified, it is a single indivisible object. If you find yourself writing lexical descriptions which match structured text, you have almost certainly pushed the lexical analysis beyond its limits.
Parsers use an algorithm which does allow recursive descriptions (but which has very limited forward lookahead) and which can create a recursive description of the input (such as a parse tree).

Elasticsearch standard tokenizer behaviour and word boundaries

I am not sure why the standard tokenizer (used by the default standard analyzer) behaves like this in this scenario:
- If I use the word system.exe it generates the token system.exe. I understand . is not a word breaker.
- If I use the word system32.exe it generates the tokens system and exe. I don´t understand this, why it breaks the word when it finds a number + a . ?
- If I use the word system32tm.exe it generates the token system32tm.exe. As in the first example, it works as expected, not breaking the word into different tokens.
I have read http://unicode.org/reports/tr29/#Word_Boundaries but I still don´t understand why a number + dot (.) is a word boundary
As mentioned in the question, the standard tokenizer provides grammar based tokenization based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29
The rule http://unicode.org/reports/tr29/#Word_Boundaries is to not break if you have letter + dot + letter, see WB6 in the above spec. So tm.exe is preserved and system32.exe is split.
The spec says that it always splits, except for the listed exceptions. Exceptions WB6 and WB7 say that it never splits on letter, then punctuation, then letter. Rules WB11 and WB12 say that it never splits on number, then punctuation, then number. However there is no such rule for number then punctuation then letter, so the default rule applies and system32.exe gets splitted.

Salesforce Validation Rule ensuring field contains no line breaks

I'm trying to write a data validation rule that ensures Shipping Street is one line (doesn't contain line breaks)
I've tried things like
CONTAINS( ShippingStreet , BR() ), and
CONTAINS( ShippingStreet , "\n" ),
but I can't get the rule to trigger.
Any help?
This will do it:
REGEX(ShippingStreet,'.*\\n.*')
There are 2 things to learn from this question about SFDC REGEX parsing:
(1) As per Java SE 6 Pattern syntax, you need to double-escape the new-line character (\n), along with various other special characters, when used in a string that gets compiled to a regular expression, that is, use '\n'.
(2) The Salesforce Regular Expression parser matches the entire phrase by default. To match on just part of the phrase, you have to surround your pattern with .*
Examples:
1. REGEX('Marc Benioff','Marc Benioff') -> TRUE
2. REGEX('Marc Benioff is a CEO','Marc Benioff') -> FALSE
3. REGEX('Marc Benioff','.*Marc Benioff.*') -> TRUE
4. REGEX('Marc Benioff is a CEO','.*Marc Benioff.*') -> TRUE
For more info, read the 'Tips' section of the SFDC REGEX Help docs.
The following in a validation rule should mean that only ShippingStreet values containing characters only (and thereby no line breaks and the like) are accepted.
NOT(REGEX(ShippingStreet, '.*'))

Resources