I'm trying to write a RACC parser, a part of which can be represented by the regular expression a[b][c][d].
I came up with the following productions (each token represents the lower case character of the token name):
expr
: A
| A b
;
b
: B
| B c
| c
;
c
: C
| C D
| D
Is this the simplest form, or am I missing something?
If B. C, and D can be distinguished from each other by their first token (which is certainly the case if they are tokens rather than being the simplification of a more complicated grammar), then an alternative is to define non-terminals representing optionality.
optional_B: /* empty */ | B
optional_C: /* empty */ | C
optional_D: /* empty */ | D
expression: A optional_B optional_C optional_D
This gets more complicated when the optional subexpressions cannot be immediately distinguished, because the parser needs to correctly identify that the optional non-terminal matches the empty string based only on the following token. But it sounds like that's not the case with your grammar.
Related
I have a very basic question about parsing a fragment that contains comment.
First we import my favorite language, Pico:
import lang::pico::\syntax::Main;
Then we execute the following:
parse(#Id,"a");
gives, as expected:
Id: (Id) `a`
However,
parse(#Id,"a\n%% some comment\n");
gives a parse error.
What do I do wrong here?
There are multiple problems.
Id is a lexical, meaning layout (comments) are never there
Layout is only inserted between elements in a production and the Id lexical has only a character class, so no place to insert layout.
Even if Id was a syntax non terminal with multiple elements, it would parse comments between them not before or after.
For more on the difference between syntax, lexical, and layout see: Rascal Syntax Definitions.
If you want to parse comments around a non terminal, we have the start modified for the non terminal. Normally, layout is only inserted between elements in the production, with start it is also inserted before and after it.
Example take this grammer:
layout L = [\t\ ]* !>> [\t\ ];
lexical AB = "A" "B"+;
syntax CD = "C" "D"+;
start syntax EF = "E" "F"+;
this will be transformed into this grammar:
AB = "A" "B"+;
CD' = "C" L "D"+;
EF' = L "E" L "F"+ L;
"B"+ = "B"+ "B" | "B";
"D"+ = "D"+ L "D" | "D";
"F"+ = "F"+ L "F" | "F";
So, in particular if you'd want to parse a string with layout around it, you could write this:
lexical Id = [a-z]+;
start syntax P = Id i;
layout L = [\ \n\t]*;
parse(#start[P], "\naap\n").top // parses and returns the P node
parse(#start[P], "\naap\n").top.i // parses and returns the Id node
parse(P, "\naap"); // parse error at 0 because start wrapper is not around P
I'm trying to figure out a grammar rule(s) for any mathematical expression.
I'm using EBNF (wiki article linked below) for deriving syntax rules.
I've managed to come up with one that worked for a while, but the grammar rule fails with onScreenTime + (((count) - 1) * 0.9).
The rule is as follows:
math ::= MINUS? LPAREN math RPAREN
| mathOperand (mathRhs)+
mathRhs ::= mathOperator mathRhsGroup
| mathOperator mathOperand mathRhs?
mathRhsGroup ::= MINUS? LPAREN mathOperand (mathRhs | (mathOperator mathOperand))+ RPAREN
You can safely assume mathOperand are positive or negative numbers, or variables.
You can also assume mathOperator denotes any mathematical operator like + or -.
Also, LPAREN and RPAREN are '(' and ')' respectively.
EBNF:
https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form
EDIT
Forgot to mention that it fails on (count) - 1. It says RPAREN expected instead of - 1.
EDIT 2 My revised EBNF now looks like this:
number ::= NUMBER_LITERAL //positive integer
mathExp ::= term_ ((PLUS | MINUS) term_)* // * is zero-or-more.
private term_ ::= factor_ ((ASTERISK | FSLASH) factor_)*
private factor_ ::= PLUS factor_
| MINUS factor_
| primary_
private primary_ ::= number
| IDENTIFIER
| LPAREN mathExp RPAREN
Have a look at the expression grammar of any programming language:
expression
: term
| expression '+' term
| expression '-' term
;
term
: factor
| term '*' factor
| term '/' factor
| term '%' factor
;
factor
: primary
| '-' factor
| '+' factor
;
primary
: IDENTIFIER
| INTEGER
| FLOATING_POINT_LITERAL
| '(' expression ')'
;
Exponentiation left as an exercise for the reader: note that the exponentiation operator is right-associative. This is in yacc notation. NB You are using EBNF, not BNF.
EDIT My non-left-recursive EBNF is not as strong as my yacc, but to factor out the left-recursions you need a scheme like for example:
expression
::= term ((PLUS|MINUS) term)*
term
::= factor ((FSLASH|ASTERISK) factor)*
etc., where * means 'zero or more'. My comments on this below are mostly incorrect and should be ignored.
You may want to take a look at the expression grammar of languages that are typically implemented using recursive descent parsers for which LL(1) grammars are needed which do not allow left recursion. Most if not all of Wirth's languages fall into this group. Below is an example from the grammar of classic Modula-2. EBNF links are shown next to each rule.
http://modula-2.info/m2pim/pmwiki.php/SyntaxDiagrams/PIM4NonTerminals#expression
Is it possible to change lexical state (aka "start condition") from within the grammar rules of Jison?
Am parsing a computer language where lexical state clearly changes (at least to my human mindset) when certain grammar rules are satisfied, even though there's no token I can exactly point to in the lexer.
(The reason I think this is that certain keywords are reserved/reservable in one state but not the other.)
It's definitely possible to change the lexical state from within the lexer, e.g.:
%lex
%x expression
%%
{id} { return 'ID';
"=" { this.begin('expression'); return '='; }
<expression>";" { this.popState(); return ';'; }
But is there a way to change lexical state when certain grammar rules are matched?
%% /* language grammar */
something : pattern1 pattern2 { this.beginState('expression'); $$ = [$1,$2]; };
pattern1 : some stuff { $$ = [$1, $2]; }
pattern2 : other stuff { $$ = [$1, $2]; }
If I try this, I get
TypeError: this.popState is not a function
at Object.anonymous (eval at createParser (/Users/me/Exp/stats/node_modules/jison/lib/jison.js:1327:23), <anonymous>:47:67)
at Object.parse (eval at createParser (/Users/me/Exp/stats/node_modules/jison/lib/jison.js:1327:23), <anonymous>:329:36)
I'm not sure if what I'm asking for is theoretically impossible or conceptually naive (e.g. is this the very meaning of context free grammar?), or it's there and I'm just not reading the docs right.
The lexer object is available in a parser action as yy.lexer, so you can change the start condition with yy.lexer.begin('expression'); and go back to the old one with yy.lexer.popState(). That part is not problematic.
However, you need to think about when the new start condition will take effect. An LALR(1) parser, such as the one implemented by jison (or bison), uses a single lookahead token to decide what action to take. (The "1" in LALR(1) is the length of the possible lookahead.) That means that when a parser action is executed -- when the rule it is attached to is reduced -- the next token has probably already been read.
This will not always be the case; both jison and bison will sometimes be able to do a reduction without using the lookahead token, in which case they will not yet have read it.
In short, a change to the lexer state in an action might take effect before the next token is read, but most of the time it will take effect when the second next token is read. Because of this ambiguity, it is usually best to make lexer state changes prior to a token which is not affected by the lexer state change.
Consider, for example, the standard calculator. The following example is adapted from the jison manual:
%lex
%%
\s+ /* skip whitespace */
[0-9]+\b yytext=parseInt(yytext); return 'NUMBER'
[*/+%()-] return yytext[0]
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%left '+' '-'
%left '*' '/' '%'
%left UMINUS
%start expressions
%% /* language grammar */
expressions: e EOF {return $1;};
e : e '+' e {$$ = $1+$3;}
| e '-' e {$$ = $1-$3;}
| e '*' e {$$ = $1*$3;}
| e '/' e {$$ = $1/$3;}
| e '%' e {$$ = $1%$3;}
| '-' e %prec UMINUS {$$ = -$2;}
| '(' e ')' {$$ = $2;}
| NUMBER {$$ = $1;}
;
Now, let's modify it so that between [ and ] all numbers are interpreted as hexadecimal. We use a non-exclusive start condition called HEX; when it is enabled, hexadecimal numbers are recognized and converted accordingly.
%lex
%s HEX
%%
\s+ /* skip whitespace */
<INITIAL>[0-9]+("."[0-9]+)?\b yytext=parseInt(yytext); return 'NUMBER'
<HEX>[0-9a-fA-F]+\b yytext=parseInt(yytext, 16); return 'NUMBER'
[*/+%()[\]-] return yytext[0]
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%left '+' '-'
%left '*' '/' '%'
%left UMINUS
%start expressions
%% /* language grammar */
expressions: e EOF {return $1;};
e : e '+' e {$$ = $1+$3;}
| e '-' e {$$ = $1-$3;}
| e '*' e {$$ = $1*$3;}
| e '/' e {$$ = $1/$3;}
| e '%' e {$$ = $1%$3;}
| '-' e %prec UMINUS {$$ = -$2;}
| '(' e ')' {$$ = $2;}
| hex '[' e unhex ']' {$$ = $3;}
| NUMBER {$$ = $1;}
;
hex : { yy.lexer.begin('HEX'); } ;
unhex: { yy.lexer.popState(); } ;
Here, we use the empty non-terminals hex and unhex to change lexer state. (In bison, I would have used a mid-rule action, which is very similar, but jison doesn't seem to implement them.) The key is that the state changes are done before the [ and ] tokens, which are not affected by the state change. Consequently, it doesn't matter whether the state change takes place before or after the current lookahead token, since we don't need it to take effect until the second next token, which might be a number.
This grammar will correctly output 26 given the input [10+a]. If we move the hex marker non-terminal to be inside the brackets:
/* NOT CORRECT */
| '[' hex e unhex ']' {$$ = $3;}
then the start condition change happens after the lookahead token, so that [10+a] produces 20.
While parsing VBScript code with my ANTLR3 parser, I found it processes everything except
x = y &htmlTag
This code is obviously meant as "x = y & htmlTag". (Me, I put spaces around operators in any language, but the code I am parsing is not mine.) The lexer should find the longest string that is a valid token, right? So that should work fine here: As '&h' is not followed by text that results in a hex literal, the lexer should decide that this is not a hex literal and the longest valid token is the operator '&'. Followed by an identifier.
But if my grammar says:
HexOrOctalLiteral :
( ( AMPERSAND H HexDigit ) => AMPERSAND H HexDigit+
| ( AMPERSAND O OctalDigit ) => AMPERSAND O OctalDigit+
)
AMPERSAND?
;
ConcatenationOperator: AMPERSAND;
fragment AMPERSAND : '&';
fragment HexDigit : Digit | A | B | C | D | E | F;
fragment OctalDigit : '0' .. '7';
fragment H : 'h' | 'H';
My parser complains: required (...)+ loop did not match anything at character 'h' when processing the '&htmlTag'. It appears the lexer has already decided that it has found a HexOrOctalLiteral and will no longer consider a concat operator. My grammar has k=1, not sure if that is relevant here because setting it higher for this rule using 'options' seems to make no difference.
What am I missing?
I am trying to write a parser for a simple language that recognizes integer and float expressions using ocamlyacc. However I want to introduce the possiblity of having variables. So i defined the token VAR in my lexer.mll file which allows it to be any alphanumneric string starting with a capital letter.
expr:
| INT { $1 }
| VAR { /*Some action */}
| expr PLUS expr { $1 + $3 }
| expr MINUS expr { $1 - $3 }
/* and similar rules below for real expressions differently */
Now i have a similar definition for real numbers. However when i run this file, I get 2 reduce/reduce conflict because if i just enter a random string(identified as token VAR). The parser would not know if its a real or an integer type of variable as the keyword VAR is present in defining both int and real expressions in my grammar.
Var + 12 /*means that Var has to be an integer variable*/
Var /*Is a valid expression according to my grammar but can be of any type*/
How do I eliminate this reduce/reduce conflict without losing the generality of variable declaration and mainting the 2 data types available to me.