Lexer rule is recognized where it wasn't needed - oracle

trying to use ANTLR 4 to create a simple grammar for some Select statements in Oracle DB. And faced a small problem. I have the following grammar:
Grammar & Lexer
column
: (tableAlias '.')? IDENT ((AS)? colAlias)?
| expression ((AS)? colAlias)?
| caseWhenClause ((AS)? colAlias)?
| rankAggregate ((AS)? colAlias)?
| rankAnalytic colAlias
;
colAlias
: '"' IDENT '"'
| IDENT
;
rankAnalytic
: RANK '(' ')' OVER '(' queryPartitionClause orderByClause ')'
;
RANK: R A N K;
fragment A:('a'|'A');
fragment N:('n'|'N');
fragment R:('r'|'R');
fragment K:('k'|'K');
The most important part there is in COLUMN declaration rankAnalytic part. I declared that after Rank statement should be colAlias, but in case this colAlias is called like "rank" (without quotes) it's recognized as a RANK lexer rule, but not as colAlias.
So for example in case I have the following text:
SELECT fulfillment_bundle_id, SKU, SKU_ACTIVE, PARENT_SKU, SKU_NAME, LAST_MODIFIED_DATE,
RANK() over (PARTITION BY fulfillment_bundle_id, SKU, PARENT_SKU
order by ACTIVE DESC NULLS LAST,SKU_NAME) rank
"rank" alias will be underlined and marked as an mistake with the following error:
mismatched input 'rank' expecting {'"', IDENT}
But the point is that I don't want it to be recognized as a RANK lexer word, but only rank as an alias for Column. Open for your suggestions :)

The RANK rule apparently appears above the IDENT rule, so the string "rank" will never be emitted by the lexer as an IDENT token.
A simple fix is to change the colAlias rule:
colAlias
: '"' ( IDENT | RANK ) '"'
| ( IDENT | RANK )
;
OP added:
Ok but in case I have not only RANK as a lexer rule but the whole list
(>100) of such key words... What am I supposed to do?
If colAlias can be literally anything, then let it:
colAlias
: '"' .+? '"' // must quote if multiple
| . // one token
;
If that definition would incur ambiguities, a predicate is needed to qualify the match:
colAlias
: '"' m+=.+? '"' { check($m) }? // multiple
| o=. { check($o) }? // one
;
Functionally, the predicate is just another element in the subrule.

Related

Grammar Rule for Math Expressions (No Left-Recursion)

I'm trying to figure out a grammar rule(s) for any mathematical expression.
I'm using EBNF (wiki article linked below) for deriving syntax rules.
I've managed to come up with one that worked for a while, but the grammar rule fails with onScreenTime + (((count) - 1) * 0.9).
The rule is as follows:
math ::= MINUS? LPAREN math RPAREN
| mathOperand (mathRhs)+
mathRhs ::= mathOperator mathRhsGroup
| mathOperator mathOperand mathRhs?
mathRhsGroup ::= MINUS? LPAREN mathOperand (mathRhs | (mathOperator mathOperand))+ RPAREN
You can safely assume mathOperand are positive or negative numbers, or variables.
You can also assume mathOperator denotes any mathematical operator like + or -.
Also, LPAREN and RPAREN are '(' and ')' respectively.
EBNF:
https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form
EDIT
Forgot to mention that it fails on (count) - 1. It says RPAREN expected instead of - 1.
EDIT 2 My revised EBNF now looks like this:
number ::= NUMBER_LITERAL //positive integer
mathExp ::= term_ ((PLUS | MINUS) term_)* // * is zero-or-more.
private term_ ::= factor_ ((ASTERISK | FSLASH) factor_)*
private factor_ ::= PLUS factor_
| MINUS factor_
| primary_
private primary_ ::= number
| IDENTIFIER
| LPAREN mathExp RPAREN
Have a look at the expression grammar of any programming language:
expression
: term
| expression '+' term
| expression '-' term
;
term
: factor
| term '*' factor
| term '/' factor
| term '%' factor
;
factor
: primary
| '-' factor
| '+' factor
;
primary
: IDENTIFIER
| INTEGER
| FLOATING_POINT_LITERAL
| '(' expression ')'
;
Exponentiation left as an exercise for the reader: note that the exponentiation operator is right-associative. This is in yacc notation. NB You are using EBNF, not BNF.
EDIT My non-left-recursive EBNF is not as strong as my yacc, but to factor out the left-recursions you need a scheme like for example:
expression
::= term ((PLUS|MINUS) term)*
term
::= factor ((FSLASH|ASTERISK) factor)*
etc., where * means 'zero or more'. My comments on this below are mostly incorrect and should be ignored.
You may want to take a look at the expression grammar of languages that are typically implemented using recursive descent parsers for which LL(1) grammars are needed which do not allow left recursion. Most if not all of Wirth's languages fall into this group. Below is an example from the grammar of classic Modula-2. EBNF links are shown next to each rule.
http://modula-2.info/m2pim/pmwiki.php/SyntaxDiagrams/PIM4NonTerminals#expression

antlr v3 lexer predicts wrong

consider the following (combined) grammar
grammar CastModifier;
tokens{
E='=';
C='=()';
Lp='(';
Rp=')';
I = 'int';
S=';';
}
compilationUnit
: assign+ EOF
;
assign
: '=' Int ';'
| '=' '(' 'int' ')' Int ';'
| '=()' Int ';'
;
Int
: ('1'..'9') ('0'..'9')*
| '0'
;
Whitespace
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
Unfortunately the lexer does not always predict the next token correctly. For instance for the following code
=(int) 1;
The Lexer predicts it must be the '=()' token. It detects the correct token for the following code
= (int) 1;
I figured this problem should be solvable by ANTLR if I provide the following option for the rule "assign":
options{k=3;}
But apparently it does not help and neither if I define this option for the whole grammar. How can I resolve this problem? My workaround at the moment is to built the '=()' token out of '='('')' but that allows the user to write
= ()
Which, well, is kind of ok but I am just wondering why ANTLR is not able to predict it correctly.
The options{k=3;} causes the parser to look ahead at most 3 tokens, not the lexer. ANTLR3's lexer is not that smart: once it matches a character, it won't give up on that match. So, in your case, from the input =(int) 1;, =( is matched but then there is an i in the char stream and no token that matches this.
My workaround at the moment is to built the '=()' token out of '='('')'
I'd call that a proper solution: '=' '(' and ')' are separate tokens and should be handled as such (to be glued together in the parser, not the lexer).

ANTLR 3 - how do I make unique tokens with NOT across special chars

I have a short question:
// Lexer
LOOP_NAME : (LETTER|DIGIT)+;
OTHERCHARS : ~('>' | '}')+;
LETTER : ('A'..'Z')|('a'..'z');
DIGIT : ('0'..'9');
A_ELEMENT
: (LETTER|'_')*(LETTER|DIGIT|'_'|'.');
// Parser-Konfiguration
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
My problem is that this is impossible due to:
As a result, alternative(s) 2 were disabled for that input [14:55:32]
error(208): ltxt2.g:61:1: The following token definitions can never be
matched because prior tokens match the same input:
LETTER,DIGIT,A_ELEMENT,WS
My issue is that I also need to catch UTF8 with OTHERCHARS... and I cannot put all special UTF8 chars into a Lexer rule since I cannot range like ("!".."?").
So I need the NOT (~). The OTHERCHARS here can be everything but ">" or "}". These two close a literal context and are forbidden within.
It doesn't seem such cases are referenced very well, so I'd be happy if someone knew a workaround. The NOT operator here creates the ambivalence I need to solve.
Thanks in advance.
Best,
wishi
Move OTHERCHARS to the very end of the lexer and define it like this:
OTHERCHARS : . ;
In the Java target, this will match a single UTF-16 code point which is not matched by a previous rule. I typically name the rule ANY_CHAR and treat it as a fall-back. By using . instead of .+, the lexer will only use this rule if no other rule matches.
If another rule matches more than one character, that rule will have priority over ANY_CHAR due to matching a larger number of characters from the input.
If another rule matches exactly one character, that rule will have priority over ANY_CHAR due to appearing earlier in the grammar.
Edit: To exclude } and > from the ANY_CHAR rule, you'll want to create rules for them so they are covered under point 2.
RBRACE : '}' ;
GT : '>' ;
ANY_CHAR : . ;

ANTLR trying to match token within longer token

I'm new to ANTLR, and trying following grammar in ANTLRWorks1.4.3.
command
: 'go' SPACE+ 'to' SPACE+ destination
;
destination
: (UPPER | LOWER) (UPPER | LOWER | DIGIT)*
;
SPACE
: ' '
;
UPPER
: 'A'..'Z'
;
LOWER
: 'a'..'z'
;
DIGIT
: '0'..'9'
;
This seems to work OK, except when the 'destination' contains first two chars of keywords 'go' and 'to'.
For instance, if I give following command:
go to Glasgo
the node-tree is displayed as follows:
I was expecting it to match fill word as destination.
I even tried changing the keyword, for example 'travel' instead of 'go'. In that case, if there is 'tr' in the destination, ANTLR complains.
Any idea why this happens? and how to fix this?
Thanks in advance.
ANTLR lexer and parser are strictly separated. Your input is first tokenized, after which the parser rules operate on said tokens.
In you case, the input go to Glasgo is tokenized into the following X tokens:
'go'
' ' (SPACE)
'to'
'G' (UPPER)
'l' (LOWER)
'a' (LOWER)
's' (LOWER)
'go'
which leaves a "dangling" 'go' keyword. This is simply how ANTLR's lexer works: you cannot change this.
A possible solution in your case would be to make destination a lexer rule instead of a parser rule:
command
: 'go' 'to' DESTINATION
;
DESTINATION
: (UPPER | LOWER) (UPPER | LOWER | DIGIT)*
;
SPACE
: ' ' {skip();}
;
fragment UPPER
: 'A'..'Z'
;
fragment LOWER
: 'a'..'z'
;
fragment DIGIT
: '0'..'9'
;
resulting in:
If you're not entirely sure what the difference between the two is, see: Practical difference between parser rules and lexer rules in ANTLR?
More about fragments: What does "fragment" mean in ANTLR?
PS. Glasgow?

ANTLR parse problem

I need to be able to match a certain string ('[' then any number of equals signs or none then '['), then i need to match a matching close bracket (']' then the same number of equals signs then ']') after some other match rules. ((options{greedy=false;}:.)* if you must know). I have no clue how to do this in ANTLR, how can i do it?
An example: I need to match [===[whatever arbitrary text ]===] but not [===[whatever arbitrary text ]==].
I need to do it for an arbitrary number of equals signs as well, so therein lies the problem: how do i get it to match an equal number of equals signs in the open as in the close? The supplied parser rules so far dont seem to make sense as far as helping.
You can't easely write a lexer for it, you need parsing rules. Two rules should be sufficient. One is responsible for matching the braces, one for matching the equal signs.
Something like this:
braces : '[' ']'
| '[' equals ']'
;
equals : '=' equals '='
| '=' braces '='
;
This should cover the use case you described. Not absolute shure but maybe you have to use a predicate in the first rule of 'equals' to avoid ambiguous interpretations.
Edit:
It is hard to integrate your greedy rule and at the same time avoid a lexer context switch or something similar (hard in ANTLR). But if you are willing to integrate a little bit of java in your grammer you can write an lexer rule.
The following example grammar shows how:
grammar TestLexer;
SPECIAL : '[' { int counter = 0; } ('=' { counter++; } )+ '[' (options{greedy=false;}:.)* ']' ('=' { counter--; } )+ { if(counter != 0) throw new RecognitionException(input); } ']';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
rule : ID
| SPECIAL
;
Your tags mention lexing, but your question itself doesn't. What you're trying to do is non-regular, so I don't think it can be done as part of lexing (though I don't remember if ANTLR's lexer is strictly regular -- it's been a couple of years since I last used ANTLR).
What you describe should be possible in parsing, however. Here's the grammar for what you described:
thingy : LBRACKET middle RBRACKET;
middle : EQUAL middle EQUAL
| LBRACKET RBRACKET;

Resources