I'm trying to figure out a grammar rule(s) for any mathematical expression.
I'm using EBNF (wiki article linked below) for deriving syntax rules.
I've managed to come up with one that worked for a while, but the grammar rule fails with onScreenTime + (((count) - 1) * 0.9).
The rule is as follows:
math ::= MINUS? LPAREN math RPAREN
| mathOperand (mathRhs)+
mathRhs ::= mathOperator mathRhsGroup
| mathOperator mathOperand mathRhs?
mathRhsGroup ::= MINUS? LPAREN mathOperand (mathRhs | (mathOperator mathOperand))+ RPAREN
You can safely assume mathOperand are positive or negative numbers, or variables.
You can also assume mathOperator denotes any mathematical operator like + or -.
Also, LPAREN and RPAREN are '(' and ')' respectively.
EBNF:
https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form
EDIT
Forgot to mention that it fails on (count) - 1. It says RPAREN expected instead of - 1.
EDIT 2 My revised EBNF now looks like this:
number ::= NUMBER_LITERAL //positive integer
mathExp ::= term_ ((PLUS | MINUS) term_)* // * is zero-or-more.
private term_ ::= factor_ ((ASTERISK | FSLASH) factor_)*
private factor_ ::= PLUS factor_
| MINUS factor_
| primary_
private primary_ ::= number
| IDENTIFIER
| LPAREN mathExp RPAREN
Have a look at the expression grammar of any programming language:
expression
: term
| expression '+' term
| expression '-' term
;
term
: factor
| term '*' factor
| term '/' factor
| term '%' factor
;
factor
: primary
| '-' factor
| '+' factor
;
primary
: IDENTIFIER
| INTEGER
| FLOATING_POINT_LITERAL
| '(' expression ')'
;
Exponentiation left as an exercise for the reader: note that the exponentiation operator is right-associative. This is in yacc notation. NB You are using EBNF, not BNF.
EDIT My non-left-recursive EBNF is not as strong as my yacc, but to factor out the left-recursions you need a scheme like for example:
expression
::= term ((PLUS|MINUS) term)*
term
::= factor ((FSLASH|ASTERISK) factor)*
etc., where * means 'zero or more'. My comments on this below are mostly incorrect and should be ignored.
You may want to take a look at the expression grammar of languages that are typically implemented using recursive descent parsers for which LL(1) grammars are needed which do not allow left recursion. Most if not all of Wirth's languages fall into this group. Below is an example from the grammar of classic Modula-2. EBNF links are shown next to each rule.
http://modula-2.info/m2pim/pmwiki.php/SyntaxDiagrams/PIM4NonTerminals#expression
Related
In OCaml when I do a pattern matching I can't do the following:
let rec example = function
| ... -> ...
| ... || ... -> ... (* here I get a syntax error because I use ||*)
Instead I need to do:
let rec example1 = function
|... -> ...
|... | ... -> ...
I know that || means or in OCaml, but why do we need to use only one 'pipe' : | to specify 'or' in pattern matching?
Why don't the usual || work?
|| doesn't really mean "or" generally, it means "boolean or", or rather it's the boolean or operator. Operators operate on values resulting from the evaluation of expressions, its operands. Operations and operands together also form expressions which can then be used as operands with other operators to form further expressions and so on.
Pattern matching on the other hand evaluate patterns, which are neither boolean or expressions. Although patterns do in a sense evaluate to true or false if applied to, or rather matched against, a value, they do not evaluate to anything on their own. They are in that sense more like operators than operands. Furthermore, the result of matching against a pattern is not just a boolean value, but also a set of bindings.
Using || instead of | with patterns would overload its meaning and serve more to confuse than to clarify I think.
I am trying to write some BNF (not EBNF) to describe the various elements of the following code fragment which is in no particular programming language but would be syntactically correct in VBA.
If Temperature <=0 Then
Description = "Freezing"
End If
So far I have come up with the BNF at the bottom of this post (I have not yet described string, number or identifier).
What perplexes me is the second line of code, Description = "Freezing", in which I am assigning a string literal to an identifier. How should I deal with this in my BNF?
I am tempted to simply adjust my definition of a factor like this...
<factor> ::= <identifier> | <number> | <string_literal> | (<expression)>
...after all, in VBA an arithmetic expression containing a string or a string variable would be syntactically correct and not picked up until run time. For example (4+3)*(6-"hi") would not be picked up as a syntax error. Is this the right approach?
Or should I leave the production for a factor as it is and redefine the assignment like this...?
<assignment> ::= <identifier> = <expression> | <identifier> = <string_literal>
I am not trying to define a whole language in my BNF, rather, I just want to cover most of the productions that describe the code fragment. Suggestions would be much appreciated.
BNF so far...
<string> ::= …
<number> ::= …
<identifier> ::= …
<assignment> ::= <identifier> = <expression>
<statements> ::= <statement> <statements>
<statement> ::= <assignment> | <if_statement> | <for_statement> | <while_statement> | …
<expression> ::= <expression> + <term> | <expression> - <term> | <term>
<term> ::= <term> * <factor> | <term> / <factor> | <factor>
<factor> ::= <identifier> | <number> | (<expression)>
<relational_operator> ::= < | > | <= | >= | =
<condition> ::= <expression> <relational_operator> <expression>
<if_statement> ::= If <condition> Then <statement>
| If <condition> Then <statements> End If
| If <condition> Then <statements> Else <statements> End If
Consider the code sample:
X = "hi"
Y = 6 - X
The 6 - X expression is an error, but you can't make it a syntax error using just a context-free grammar. Similarly for:
If Temperature <= X Then ...
Instead of catching such type errors via the grammar, you'll have to catch them later, either statically or dynamically. And given that you have to do that analysis anyway, there's not much point trying to catch any type errors (express any type constraints) in the grammar.
So go with your first solution, adding <string_literal> to <factor>.
While you don't provide any details about your language, it seems reasonable to believe that a language which has string literals and string variables also has some operations on strings, at least function calls taking strings as arguments and probably certain operators. (In VB, as I understand it, both + and & function as string concatenation operators.)
In that case, assignment to a string variable is not limited to assigning a string literal, and the grammar would be expected to allow expressions including string literals.
It is always tempting to attempt to enforce type coherency in a grammar, on the basis that some type errors (such as 6 - "hi") can be detected immediately. But there are many other very similar errors (6 - HiStringVariable) which cannot be detected until type deduction (or even until runtime, for dynamic languages). The contortions necessary to do partial type checking during the parse are almost never worth the trouble.
trying to use ANTLR 4 to create a simple grammar for some Select statements in Oracle DB. And faced a small problem. I have the following grammar:
Grammar & Lexer
column
: (tableAlias '.')? IDENT ((AS)? colAlias)?
| expression ((AS)? colAlias)?
| caseWhenClause ((AS)? colAlias)?
| rankAggregate ((AS)? colAlias)?
| rankAnalytic colAlias
;
colAlias
: '"' IDENT '"'
| IDENT
;
rankAnalytic
: RANK '(' ')' OVER '(' queryPartitionClause orderByClause ')'
;
RANK: R A N K;
fragment A:('a'|'A');
fragment N:('n'|'N');
fragment R:('r'|'R');
fragment K:('k'|'K');
The most important part there is in COLUMN declaration rankAnalytic part. I declared that after Rank statement should be colAlias, but in case this colAlias is called like "rank" (without quotes) it's recognized as a RANK lexer rule, but not as colAlias.
So for example in case I have the following text:
SELECT fulfillment_bundle_id, SKU, SKU_ACTIVE, PARENT_SKU, SKU_NAME, LAST_MODIFIED_DATE,
RANK() over (PARTITION BY fulfillment_bundle_id, SKU, PARENT_SKU
order by ACTIVE DESC NULLS LAST,SKU_NAME) rank
"rank" alias will be underlined and marked as an mistake with the following error:
mismatched input 'rank' expecting {'"', IDENT}
But the point is that I don't want it to be recognized as a RANK lexer word, but only rank as an alias for Column. Open for your suggestions :)
The RANK rule apparently appears above the IDENT rule, so the string "rank" will never be emitted by the lexer as an IDENT token.
A simple fix is to change the colAlias rule:
colAlias
: '"' ( IDENT | RANK ) '"'
| ( IDENT | RANK )
;
OP added:
Ok but in case I have not only RANK as a lexer rule but the whole list
(>100) of such key words... What am I supposed to do?
If colAlias can be literally anything, then let it:
colAlias
: '"' .+? '"' // must quote if multiple
| . // one token
;
If that definition would incur ambiguities, a predicate is needed to qualify the match:
colAlias
: '"' m+=.+? '"' { check($m) }? // multiple
| o=. { check($o) }? // one
;
Functionally, the predicate is just another element in the subrule.
While parsing VBScript code with my ANTLR3 parser, I found it processes everything except
x = y &htmlTag
This code is obviously meant as "x = y & htmlTag". (Me, I put spaces around operators in any language, but the code I am parsing is not mine.) The lexer should find the longest string that is a valid token, right? So that should work fine here: As '&h' is not followed by text that results in a hex literal, the lexer should decide that this is not a hex literal and the longest valid token is the operator '&'. Followed by an identifier.
But if my grammar says:
HexOrOctalLiteral :
( ( AMPERSAND H HexDigit ) => AMPERSAND H HexDigit+
| ( AMPERSAND O OctalDigit ) => AMPERSAND O OctalDigit+
)
AMPERSAND?
;
ConcatenationOperator: AMPERSAND;
fragment AMPERSAND : '&';
fragment HexDigit : Digit | A | B | C | D | E | F;
fragment OctalDigit : '0' .. '7';
fragment H : 'h' | 'H';
My parser complains: required (...)+ loop did not match anything at character 'h' when processing the '&htmlTag'. It appears the lexer has already decided that it has found a HexOrOctalLiteral and will no longer consider a concat operator. My grammar has k=1, not sure if that is relevant here because setting it higher for this rule using 'options' seems to make no difference.
What am I missing?
I need to be able to match a certain string ('[' then any number of equals signs or none then '['), then i need to match a matching close bracket (']' then the same number of equals signs then ']') after some other match rules. ((options{greedy=false;}:.)* if you must know). I have no clue how to do this in ANTLR, how can i do it?
An example: I need to match [===[whatever arbitrary text ]===] but not [===[whatever arbitrary text ]==].
I need to do it for an arbitrary number of equals signs as well, so therein lies the problem: how do i get it to match an equal number of equals signs in the open as in the close? The supplied parser rules so far dont seem to make sense as far as helping.
You can't easely write a lexer for it, you need parsing rules. Two rules should be sufficient. One is responsible for matching the braces, one for matching the equal signs.
Something like this:
braces : '[' ']'
| '[' equals ']'
;
equals : '=' equals '='
| '=' braces '='
;
This should cover the use case you described. Not absolute shure but maybe you have to use a predicate in the first rule of 'equals' to avoid ambiguous interpretations.
Edit:
It is hard to integrate your greedy rule and at the same time avoid a lexer context switch or something similar (hard in ANTLR). But if you are willing to integrate a little bit of java in your grammer you can write an lexer rule.
The following example grammar shows how:
grammar TestLexer;
SPECIAL : '[' { int counter = 0; } ('=' { counter++; } )+ '[' (options{greedy=false;}:.)* ']' ('=' { counter--; } )+ { if(counter != 0) throw new RecognitionException(input); } ']';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
rule : ID
| SPECIAL
;
Your tags mention lexing, but your question itself doesn't. What you're trying to do is non-regular, so I don't think it can be done as part of lexing (though I don't remember if ANTLR's lexer is strictly regular -- it's been a couple of years since I last used ANTLR).
What you describe should be possible in parsing, however. Here's the grammar for what you described:
thingy : LBRACKET middle RBRACKET;
middle : EQUAL middle EQUAL
| LBRACKET RBRACKET;