antlr v3 lexer predicts wrong - antlr3

consider the following (combined) grammar
grammar CastModifier;
tokens{
E='=';
C='=()';
Lp='(';
Rp=')';
I = 'int';
S=';';
}
compilationUnit
: assign+ EOF
;
assign
: '=' Int ';'
| '=' '(' 'int' ')' Int ';'
| '=()' Int ';'
;
Int
: ('1'..'9') ('0'..'9')*
| '0'
;
Whitespace
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
Unfortunately the lexer does not always predict the next token correctly. For instance for the following code
=(int) 1;
The Lexer predicts it must be the '=()' token. It detects the correct token for the following code
= (int) 1;
I figured this problem should be solvable by ANTLR if I provide the following option for the rule "assign":
options{k=3;}
But apparently it does not help and neither if I define this option for the whole grammar. How can I resolve this problem? My workaround at the moment is to built the '=()' token out of '='('')' but that allows the user to write
= ()
Which, well, is kind of ok but I am just wondering why ANTLR is not able to predict it correctly.

The options{k=3;} causes the parser to look ahead at most 3 tokens, not the lexer. ANTLR3's lexer is not that smart: once it matches a character, it won't give up on that match. So, in your case, from the input =(int) 1;, =( is matched but then there is an i in the char stream and no token that matches this.
My workaround at the moment is to built the '=()' token out of '='('')'
I'd call that a proper solution: '=' '(' and ')' are separate tokens and should be handled as such (to be glued together in the parser, not the lexer).

Related

Antlr weird parentheses syntax

Cant understand this round bracket meaning.
Its not necessary to write it, but sometimes it can produce left-recursion error. Where should we use it in grammar rules?
Its not necessary to write it,
That is correct, it is not necessary. Just remove them.
but sometimes it can produce left-recursion error.
If that really is the case, you can open an issue here: https://github.com/antlr/antlr4/issues
EDIT
Seeing kaby76's comment, just to make sure: you cannot just remove them from a grammar file regardless. They can be removed from your example rule.
When used like this:
rule
: ID '=' ( NUMBER | STRING ) // match either `ID '=' NUMBER`
// or `ID '=' STRING`
;
they cannot be removed because removing them wold result in:
rule
: ID '=' NUMBER | STRING // match either `ID '=' NUMBER`
// or `STRING`
;
Or with repetition:
rule
: ( ID STRING )+ // match: `ID STRING ID STRING ID STRING ...`
;
and this:
rule
: ID STRING+ // match: `ID STRING STRING STRING ...`
;

antlr grammar: Allow whitespace matching only in template string

I want to parse template strings:
`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`
Here is my grammar:
varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)* ')' ;
WS : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;
When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:
extraneous input ' ' expecting {'`'}
How can I allow whitespaces to be parsed and not skipped only inside the template string?
What is currently happening
When testing your example against your current grammar displaying the generated tokens, the lexer gives this:
[#0,0:0='`',<'`'>,1:0]
[#1,1:4='Some',<VAR>,1:1]
[#2,6:9='text',<VAR>,1:6]
[#3,11:12='${',<'${'>,1:11]
[#4,13:20='variable',<VAR>,1:13]
[#5,21:21='.',<'.'>,1:21]
[#6,22:25='name',<VAR>,1:22]
[#7,26:26='}',<'}'>,1:26]
... shortened ...
[#26,85:84='<EOF>',<EOF>,2:0]
This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR. Why is this happening?
As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some.
What you could try (Spoiler: won't work)
You could try to modify the rule like this:
TemplateStringLiteral: ('\\`' | ~'`')+ ;
so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:
How would the lexer match anything to the VAR rule, ever?
The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.
How to achieve what you actually want
There might be another solution, but this one works:
File MartinCup.g4:
parser grammar MartinCup;
options { tokenVocab=MartinCupLexer; }
templateString
: BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
;
template
: TemplateStart variable TemplateEnd
;
variable
: varname funParameter? (Dot variable)*
;
varname
: VAR
;
funParameter
: OpenPar variable? (Comma variable)* ClosedPar
;
File MartinCupLexer.g4:
lexer grammar MartinCupLexer;
BackTick : '`' ;
TemplateStart
: '${' -> pushMode(templateMode)
;
TemplateStringLiteral
: '\\`'
| ~'`'
;
mode templateMode;
VAR
: [$]?[a-zA-Z0-9_]+
| [$]
;
OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;
TemplateEnd
: '}' -> popMode;
This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some.
Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.
About the whitespaces
I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.
I tested your alternative grammar, where you put TemplateStringLiteral above WS, but contrary to your observation, this gives me:
line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}
The reason for this is the same as above, Some is lexed to VAR.

Lexer rule is recognized where it wasn't needed

trying to use ANTLR 4 to create a simple grammar for some Select statements in Oracle DB. And faced a small problem. I have the following grammar:
Grammar & Lexer
column
: (tableAlias '.')? IDENT ((AS)? colAlias)?
| expression ((AS)? colAlias)?
| caseWhenClause ((AS)? colAlias)?
| rankAggregate ((AS)? colAlias)?
| rankAnalytic colAlias
;
colAlias
: '"' IDENT '"'
| IDENT
;
rankAnalytic
: RANK '(' ')' OVER '(' queryPartitionClause orderByClause ')'
;
RANK: R A N K;
fragment A:('a'|'A');
fragment N:('n'|'N');
fragment R:('r'|'R');
fragment K:('k'|'K');
The most important part there is in COLUMN declaration rankAnalytic part. I declared that after Rank statement should be colAlias, but in case this colAlias is called like "rank" (without quotes) it's recognized as a RANK lexer rule, but not as colAlias.
So for example in case I have the following text:
SELECT fulfillment_bundle_id, SKU, SKU_ACTIVE, PARENT_SKU, SKU_NAME, LAST_MODIFIED_DATE,
RANK() over (PARTITION BY fulfillment_bundle_id, SKU, PARENT_SKU
order by ACTIVE DESC NULLS LAST,SKU_NAME) rank
"rank" alias will be underlined and marked as an mistake with the following error:
mismatched input 'rank' expecting {'"', IDENT}
But the point is that I don't want it to be recognized as a RANK lexer word, but only rank as an alias for Column. Open for your suggestions :)
The RANK rule apparently appears above the IDENT rule, so the string "rank" will never be emitted by the lexer as an IDENT token.
A simple fix is to change the colAlias rule:
colAlias
: '"' ( IDENT | RANK ) '"'
| ( IDENT | RANK )
;
OP added:
Ok but in case I have not only RANK as a lexer rule but the whole list
(>100) of such key words... What am I supposed to do?
If colAlias can be literally anything, then let it:
colAlias
: '"' .+? '"' // must quote if multiple
| . // one token
;
If that definition would incur ambiguities, a predicate is needed to qualify the match:
colAlias
: '"' m+=.+? '"' { check($m) }? // multiple
| o=. { check($o) }? // one
;
Functionally, the predicate is just another element in the subrule.

ANTLR 3 - how do I make unique tokens with NOT across special chars

I have a short question:
// Lexer
LOOP_NAME : (LETTER|DIGIT)+;
OTHERCHARS : ~('>' | '}')+;
LETTER : ('A'..'Z')|('a'..'z');
DIGIT : ('0'..'9');
A_ELEMENT
: (LETTER|'_')*(LETTER|DIGIT|'_'|'.');
// Parser-Konfiguration
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
My problem is that this is impossible due to:
As a result, alternative(s) 2 were disabled for that input [14:55:32]
error(208): ltxt2.g:61:1: The following token definitions can never be
matched because prior tokens match the same input:
LETTER,DIGIT,A_ELEMENT,WS
My issue is that I also need to catch UTF8 with OTHERCHARS... and I cannot put all special UTF8 chars into a Lexer rule since I cannot range like ("!".."?").
So I need the NOT (~). The OTHERCHARS here can be everything but ">" or "}". These two close a literal context and are forbidden within.
It doesn't seem such cases are referenced very well, so I'd be happy if someone knew a workaround. The NOT operator here creates the ambivalence I need to solve.
Thanks in advance.
Best,
wishi
Move OTHERCHARS to the very end of the lexer and define it like this:
OTHERCHARS : . ;
In the Java target, this will match a single UTF-16 code point which is not matched by a previous rule. I typically name the rule ANY_CHAR and treat it as a fall-back. By using . instead of .+, the lexer will only use this rule if no other rule matches.
If another rule matches more than one character, that rule will have priority over ANY_CHAR due to matching a larger number of characters from the input.
If another rule matches exactly one character, that rule will have priority over ANY_CHAR due to appearing earlier in the grammar.
Edit: To exclude } and > from the ANY_CHAR rule, you'll want to create rules for them so they are covered under point 2.
RBRACE : '}' ;
GT : '>' ;
ANY_CHAR : . ;

ANTLR parse problem

I need to be able to match a certain string ('[' then any number of equals signs or none then '['), then i need to match a matching close bracket (']' then the same number of equals signs then ']') after some other match rules. ((options{greedy=false;}:.)* if you must know). I have no clue how to do this in ANTLR, how can i do it?
An example: I need to match [===[whatever arbitrary text ]===] but not [===[whatever arbitrary text ]==].
I need to do it for an arbitrary number of equals signs as well, so therein lies the problem: how do i get it to match an equal number of equals signs in the open as in the close? The supplied parser rules so far dont seem to make sense as far as helping.
You can't easely write a lexer for it, you need parsing rules. Two rules should be sufficient. One is responsible for matching the braces, one for matching the equal signs.
Something like this:
braces : '[' ']'
| '[' equals ']'
;
equals : '=' equals '='
| '=' braces '='
;
This should cover the use case you described. Not absolute shure but maybe you have to use a predicate in the first rule of 'equals' to avoid ambiguous interpretations.
Edit:
It is hard to integrate your greedy rule and at the same time avoid a lexer context switch or something similar (hard in ANTLR). But if you are willing to integrate a little bit of java in your grammer you can write an lexer rule.
The following example grammar shows how:
grammar TestLexer;
SPECIAL : '[' { int counter = 0; } ('=' { counter++; } )+ '[' (options{greedy=false;}:.)* ']' ('=' { counter--; } )+ { if(counter != 0) throw new RecognitionException(input); } ']';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
rule : ID
| SPECIAL
;
Your tags mention lexing, but your question itself doesn't. What you're trying to do is non-regular, so I don't think it can be done as part of lexing (though I don't remember if ANTLR's lexer is strictly regular -- it's been a couple of years since I last used ANTLR).
What you describe should be possible in parsing, however. Here's the grammar for what you described:
thingy : LBRACKET middle RBRACKET;
middle : EQUAL middle EQUAL
| LBRACKET RBRACKET;

Resources