Is it possible to change lexical state (aka "start condition") from within the grammar rules of Jison?
Am parsing a computer language where lexical state clearly changes (at least to my human mindset) when certain grammar rules are satisfied, even though there's no token I can exactly point to in the lexer.
(The reason I think this is that certain keywords are reserved/reservable in one state but not the other.)
It's definitely possible to change the lexical state from within the lexer, e.g.:
%lex
%x expression
%%
{id} { return 'ID';
"=" { this.begin('expression'); return '='; }
<expression>";" { this.popState(); return ';'; }
But is there a way to change lexical state when certain grammar rules are matched?
%% /* language grammar */
something : pattern1 pattern2 { this.beginState('expression'); $$ = [$1,$2]; };
pattern1 : some stuff { $$ = [$1, $2]; }
pattern2 : other stuff { $$ = [$1, $2]; }
If I try this, I get
TypeError: this.popState is not a function
at Object.anonymous (eval at createParser (/Users/me/Exp/stats/node_modules/jison/lib/jison.js:1327:23), <anonymous>:47:67)
at Object.parse (eval at createParser (/Users/me/Exp/stats/node_modules/jison/lib/jison.js:1327:23), <anonymous>:329:36)
I'm not sure if what I'm asking for is theoretically impossible or conceptually naive (e.g. is this the very meaning of context free grammar?), or it's there and I'm just not reading the docs right.
The lexer object is available in a parser action as yy.lexer, so you can change the start condition with yy.lexer.begin('expression'); and go back to the old one with yy.lexer.popState(). That part is not problematic.
However, you need to think about when the new start condition will take effect. An LALR(1) parser, such as the one implemented by jison (or bison), uses a single lookahead token to decide what action to take. (The "1" in LALR(1) is the length of the possible lookahead.) That means that when a parser action is executed -- when the rule it is attached to is reduced -- the next token has probably already been read.
This will not always be the case; both jison and bison will sometimes be able to do a reduction without using the lookahead token, in which case they will not yet have read it.
In short, a change to the lexer state in an action might take effect before the next token is read, but most of the time it will take effect when the second next token is read. Because of this ambiguity, it is usually best to make lexer state changes prior to a token which is not affected by the lexer state change.
Consider, for example, the standard calculator. The following example is adapted from the jison manual:
%lex
%%
\s+ /* skip whitespace */
[0-9]+\b yytext=parseInt(yytext); return 'NUMBER'
[*/+%()-] return yytext[0]
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%left '+' '-'
%left '*' '/' '%'
%left UMINUS
%start expressions
%% /* language grammar */
expressions: e EOF {return $1;};
e : e '+' e {$$ = $1+$3;}
| e '-' e {$$ = $1-$3;}
| e '*' e {$$ = $1*$3;}
| e '/' e {$$ = $1/$3;}
| e '%' e {$$ = $1%$3;}
| '-' e %prec UMINUS {$$ = -$2;}
| '(' e ')' {$$ = $2;}
| NUMBER {$$ = $1;}
;
Now, let's modify it so that between [ and ] all numbers are interpreted as hexadecimal. We use a non-exclusive start condition called HEX; when it is enabled, hexadecimal numbers are recognized and converted accordingly.
%lex
%s HEX
%%
\s+ /* skip whitespace */
<INITIAL>[0-9]+("."[0-9]+)?\b yytext=parseInt(yytext); return 'NUMBER'
<HEX>[0-9a-fA-F]+\b yytext=parseInt(yytext, 16); return 'NUMBER'
[*/+%()[\]-] return yytext[0]
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%left '+' '-'
%left '*' '/' '%'
%left UMINUS
%start expressions
%% /* language grammar */
expressions: e EOF {return $1;};
e : e '+' e {$$ = $1+$3;}
| e '-' e {$$ = $1-$3;}
| e '*' e {$$ = $1*$3;}
| e '/' e {$$ = $1/$3;}
| e '%' e {$$ = $1%$3;}
| '-' e %prec UMINUS {$$ = -$2;}
| '(' e ')' {$$ = $2;}
| hex '[' e unhex ']' {$$ = $3;}
| NUMBER {$$ = $1;}
;
hex : { yy.lexer.begin('HEX'); } ;
unhex: { yy.lexer.popState(); } ;
Here, we use the empty non-terminals hex and unhex to change lexer state. (In bison, I would have used a mid-rule action, which is very similar, but jison doesn't seem to implement them.) The key is that the state changes are done before the [ and ] tokens, which are not affected by the state change. Consequently, it doesn't matter whether the state change takes place before or after the current lookahead token, since we don't need it to take effect until the second next token, which might be a number.
This grammar will correctly output 26 given the input [10+a]. If we move the hex marker non-terminal to be inside the brackets:
/* NOT CORRECT */
| '[' hex e unhex ']' {$$ = $3;}
then the start condition change happens after the lookahead token, so that [10+a] produces 20.
Related
trying to use ANTLR 4 to create a simple grammar for some Select statements in Oracle DB. And faced a small problem. I have the following grammar:
Grammar & Lexer
column
: (tableAlias '.')? IDENT ((AS)? colAlias)?
| expression ((AS)? colAlias)?
| caseWhenClause ((AS)? colAlias)?
| rankAggregate ((AS)? colAlias)?
| rankAnalytic colAlias
;
colAlias
: '"' IDENT '"'
| IDENT
;
rankAnalytic
: RANK '(' ')' OVER '(' queryPartitionClause orderByClause ')'
;
RANK: R A N K;
fragment A:('a'|'A');
fragment N:('n'|'N');
fragment R:('r'|'R');
fragment K:('k'|'K');
The most important part there is in COLUMN declaration rankAnalytic part. I declared that after Rank statement should be colAlias, but in case this colAlias is called like "rank" (without quotes) it's recognized as a RANK lexer rule, but not as colAlias.
So for example in case I have the following text:
SELECT fulfillment_bundle_id, SKU, SKU_ACTIVE, PARENT_SKU, SKU_NAME, LAST_MODIFIED_DATE,
RANK() over (PARTITION BY fulfillment_bundle_id, SKU, PARENT_SKU
order by ACTIVE DESC NULLS LAST,SKU_NAME) rank
"rank" alias will be underlined and marked as an mistake with the following error:
mismatched input 'rank' expecting {'"', IDENT}
But the point is that I don't want it to be recognized as a RANK lexer word, but only rank as an alias for Column. Open for your suggestions :)
The RANK rule apparently appears above the IDENT rule, so the string "rank" will never be emitted by the lexer as an IDENT token.
A simple fix is to change the colAlias rule:
colAlias
: '"' ( IDENT | RANK ) '"'
| ( IDENT | RANK )
;
OP added:
Ok but in case I have not only RANK as a lexer rule but the whole list
(>100) of such key words... What am I supposed to do?
If colAlias can be literally anything, then let it:
colAlias
: '"' .+? '"' // must quote if multiple
| . // one token
;
If that definition would incur ambiguities, a predicate is needed to qualify the match:
colAlias
: '"' m+=.+? '"' { check($m) }? // multiple
| o=. { check($o) }? // one
;
Functionally, the predicate is just another element in the subrule.
In Python we can use pass clause as an placeholder.
What is the equivalent clause in Golang?
An ; or something else?
The Go Programming Language Specification
Empty statements
The empty statement does nothing.
EmptyStmt = .
Notation
The syntax is specified using Extended Backus-Naur Form (EBNF):
Production = production_name "=" [ Expression ] "." .
Expression = Alternative { "|" Alternative } .
Alternative = Term { Term } .
Term = production_name | token [ "…" token ] | Group | Option | Repetition .
Group = "(" Expression ")" .
Option = "[" Expression "]" .
Repetition = "{" Expression "}" .
Productions are expressions constructed from terms and the following
operators, in increasing precedence:
| alternation
() grouping
[] option (0 or 1 times)
{} repetition (0 to n times)
Lower-case production names are used to identify lexical tokens.
Non-terminals are in CamelCase. Lexical tokens are enclosed in double
quotes "" or back quotes ``.
The form a … b represents the set of characters from a through b as
alternatives. The horizontal ellipsis … is also used elsewhere in the
spec to informally denote various enumerations or code snippets that
are not further specified. The character … (as opposed to the three
characters ...) is not a token of the Go language.
The empty statement is empty. In EBNF (Extended Backus–Naur Form) form: EmptyStmt = . or an empty string.
For example,
for {
}
var no
if true {
} else {
no = true
}
While parsing VBScript code with my ANTLR3 parser, I found it processes everything except
x = y &htmlTag
This code is obviously meant as "x = y & htmlTag". (Me, I put spaces around operators in any language, but the code I am parsing is not mine.) The lexer should find the longest string that is a valid token, right? So that should work fine here: As '&h' is not followed by text that results in a hex literal, the lexer should decide that this is not a hex literal and the longest valid token is the operator '&'. Followed by an identifier.
But if my grammar says:
HexOrOctalLiteral :
( ( AMPERSAND H HexDigit ) => AMPERSAND H HexDigit+
| ( AMPERSAND O OctalDigit ) => AMPERSAND O OctalDigit+
)
AMPERSAND?
;
ConcatenationOperator: AMPERSAND;
fragment AMPERSAND : '&';
fragment HexDigit : Digit | A | B | C | D | E | F;
fragment OctalDigit : '0' .. '7';
fragment H : 'h' | 'H';
My parser complains: required (...)+ loop did not match anything at character 'h' when processing the '&htmlTag'. It appears the lexer has already decided that it has found a HexOrOctalLiteral and will no longer consider a concat operator. My grammar has k=1, not sure if that is relevant here because setting it higher for this rule using 'options' seems to make no difference.
What am I missing?
consider the following (combined) grammar
grammar CastModifier;
tokens{
E='=';
C='=()';
Lp='(';
Rp=')';
I = 'int';
S=';';
}
compilationUnit
: assign+ EOF
;
assign
: '=' Int ';'
| '=' '(' 'int' ')' Int ';'
| '=()' Int ';'
;
Int
: ('1'..'9') ('0'..'9')*
| '0'
;
Whitespace
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
Unfortunately the lexer does not always predict the next token correctly. For instance for the following code
=(int) 1;
The Lexer predicts it must be the '=()' token. It detects the correct token for the following code
= (int) 1;
I figured this problem should be solvable by ANTLR if I provide the following option for the rule "assign":
options{k=3;}
But apparently it does not help and neither if I define this option for the whole grammar. How can I resolve this problem? My workaround at the moment is to built the '=()' token out of '='('')' but that allows the user to write
= ()
Which, well, is kind of ok but I am just wondering why ANTLR is not able to predict it correctly.
The options{k=3;} causes the parser to look ahead at most 3 tokens, not the lexer. ANTLR3's lexer is not that smart: once it matches a character, it won't give up on that match. So, in your case, from the input =(int) 1;, =( is matched but then there is an i in the char stream and no token that matches this.
My workaround at the moment is to built the '=()' token out of '='('')'
I'd call that a proper solution: '=' '(' and ')' are separate tokens and should be handled as such (to be glued together in the parser, not the lexer).
I need to be able to match a certain string ('[' then any number of equals signs or none then '['), then i need to match a matching close bracket (']' then the same number of equals signs then ']') after some other match rules. ((options{greedy=false;}:.)* if you must know). I have no clue how to do this in ANTLR, how can i do it?
An example: I need to match [===[whatever arbitrary text ]===] but not [===[whatever arbitrary text ]==].
I need to do it for an arbitrary number of equals signs as well, so therein lies the problem: how do i get it to match an equal number of equals signs in the open as in the close? The supplied parser rules so far dont seem to make sense as far as helping.
You can't easely write a lexer for it, you need parsing rules. Two rules should be sufficient. One is responsible for matching the braces, one for matching the equal signs.
Something like this:
braces : '[' ']'
| '[' equals ']'
;
equals : '=' equals '='
| '=' braces '='
;
This should cover the use case you described. Not absolute shure but maybe you have to use a predicate in the first rule of 'equals' to avoid ambiguous interpretations.
Edit:
It is hard to integrate your greedy rule and at the same time avoid a lexer context switch or something similar (hard in ANTLR). But if you are willing to integrate a little bit of java in your grammer you can write an lexer rule.
The following example grammar shows how:
grammar TestLexer;
SPECIAL : '[' { int counter = 0; } ('=' { counter++; } )+ '[' (options{greedy=false;}:.)* ']' ('=' { counter--; } )+ { if(counter != 0) throw new RecognitionException(input); } ']';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
rule : ID
| SPECIAL
;
Your tags mention lexing, but your question itself doesn't. What you're trying to do is non-regular, so I don't think it can be done as part of lexing (though I don't remember if ANTLR's lexer is strictly regular -- it's been a couple of years since I last used ANTLR).
What you describe should be possible in parsing, however. Here's the grammar for what you described:
thingy : LBRACKET middle RBRACKET;
middle : EQUAL middle EQUAL
| LBRACKET RBRACKET;