I have implemented the cyk algorithm to check a string, whether its in the grammar given in CNF form using java. So if i consider following is the grammar,
S->AB|BC
A->BA|a
B->CC|b
C->AB|a
then i can validate a string like abba. But now i have a grammar as follows, in BNF form
<query> ::= <definestream>|<executionquery>
<definestream>::= define stream <streamname><attributename><type> {<attributename><type>}
<executionquery>::=<input> <output> [<projection>]
<input> ::= from <streams>
<output> ::= ((insert [<outputtype>]into <streamname>)| (return [<outputtype>]))
<streams> ::= <stream>[#<window>]| <stream>#<window> [unidirectional]<join> [unidirectional] <stream>#<window>
on <condition>within <time>| [every] <stream> ><stream> … <stream>within <time>
| <stream>, <stream>, <stream>within <time>
<stream> ::= <streamname><conditionlist>
<projection> ::= (<externalcall><attributelist>)|<attributelist>
[group by <attributename>][having <condition>]
<externalcall>::= call <name> ( <paramlist>)
<conditionlist>::= {‘[’<condition>’]’}
<attributelist>::=(<attributename>[as <referencename>])| ( <function>(<paramlist>)as <referencename>)
<outputtype>::= expiredevents| currentevents| allevents
<paramlist>::= {<expression>}
<condition> ::= ( <condition> (and|or) <condition> )|(not <condition>)
|( <expression> (==|!=|>=|<=|>|<|contains) <expression> )
<expression> ::= ( <expression> (+||/|*|%)<expression> )|<attributename>|<
int>|<long>|<double>|<float>|<string>
And i need to convert this to CNF form. And i have searched and found out that following should be done to convert any CFG to cnf form.
1.remove null values and then remove unit productions. But the examples they have shown is done for following kind of grammar,
S→ASA|aB
A→B|S
B→b|ε
but in BNF form there are lot of other syntax which i have no idea how to convert it. And its hard to identify the terminals of the above BNF grammar as well. Can any one please explain the procedure, so i can get on with this. Thanks in advance
Related
I need help defining some rules for a grammar in cups. The rules in question belong to the declaration block, which consists of the declaration of 0 or more constants, 0 or more type records, and 0 or more variables. An example of code to parser:
x: constant := True;
y: constant := 32
type Tpersona is record
dni: Integer;
edad : Integer;
casado : Boolean;
end record;
type Tfecha is record
dia: Integer;
mes : Integer;
anyo : Integer;
end record;
type Tcita is record
usuario:Tpersona;
fecha:Tfecha;
end record;
a: Integer;
x,y: Boolean;
x,y: Boolean;
x,y: Boolean;
The order between them must be respected, but any of them can not appear. This last property is what generates a shift/reduce conflict with the following rules.
declaration_block ::= const_block types_block var_block;
// Constant declaration
const_block ::= dec_const const_block | ;
dec_const ::= IDEN TWOPOINT CONSTANT ASSIGN const_values SEMICOLON;
//Types declaration
types_block ::= dec_type types_block | ;
dec_type ::= TYPE IDEN IS RECORD
reg_list
END RECORD SEMICOLON;
reg_list ::= dec_reg reg_list | dec_reg;
dec_reg ::= IDEN TWOPOINT valid_types SEMICOLON;
//Variable declaration
var_block ::= dec_var var_block | ;
dec_variable ::= iden_list TWOPOINT valid_types SEMICOLON;
iden_list ::= IDEN | IDEN COMMA iden_list;
// common use
const_values ::= INT | booleans;
booleans ::= TRUE | FALSE;
valid_types ::= primitive_types | IDEN;
primitive_types ::= INTEGER | BOOLEAN;
The idea is that any X_block can be empty. I understand the shift-reduce conflict, since when starting and receiving an identifier (IDEN), it doesn't know whether to reduce in const_block ::= <empty> and take IDEN as part of dec_variable, or to shift and take the IDEN token as part of const_block. If I remove the empty/epsilon production in const_block or in type_block, the conflict disappears, although the grammar would be incorrect because it would be an infinite list of constants and it would give a syntax error in the reserved word "type".
So I may have an ambiguity caused because both constants and variables can go at the beginning and start with "id:" and either block can appear first. How can I rewrite the rules to resolve the ambiguities and the shift/reduce conflict they cause?
I tried to do something like:
declaration_block ::= const_block types_block var_block | const_block types_block | const_block var_block | types_block var_block | types_block | var_decl | ;
but i have the same problem.
Other try is to create new_rules to identify if it is a constant or a variable... but the ambiguety of the empty rule in contant_block do not dissapear.
dec_const ::= start_const ASSIGN valor_constantes SEMICOLON;
start_const ::= IDEN TWOPOINT CONSTANT;
// dec_var ::= start_variables SEMICOLON;
// start_var ::= lista_iden TWOPOINT tipos_validos;
If I reduce the problem to something simpler, without taking into account types and only allowing one declaration of a constant or a variable, the fact that these blocks can be empty produces the problem:
dec_var ::= iden_list TWOPOINT valid_types SEMICOLON | ;
iden_list ::= IDEN | IDEN COMMA lista_iden;
I expect rewrite the rules some way to solve this conflict and dealing with similar problemns in the future.
Thanks so much
To start with, your grammar is not ambiguous. But it does have a shift-reduce conflict (in fact, two of them), which indicates that it cannot be parsed deterministically with only one lookahead token.
As it happens, you could solve the problem (more or less) by just increasing the lookahead, if you had a parser generator which allowed you to do that. However, such parser generators are pretty rare, and CUP isn't one of them. There are parser generators which allow arbitrary lookahead, either by backtracking (possibly with memoisation, such as ANTLR4), or by using an algorithm which allows multiple alternatives to be explored in parallel (GLR, for example). But I don't know of a parser generators which can produce a deterministic transition table which uses two lookahead tokens (which would suffice, in this case).
So the solution is to add some apparent redundancy to the grammar in order to factor out the cases which require more than one lookahead token.
The fundamental problem is the following set of possible inputs:
...; a : constant 3 ; ...
...; a : Integer ; ...
There's no ambiguity here whatsoever. The first one can only be a constant declaration; the second can only be variable declarations. But observe that we don't discover that fact until we see either the keyword constant (as in the first case), or a identifier which could be a type (as in the second case).
What that means is that we need to avoid forcing the parser to make any decision involving the a and the : until the next token is available. In particular, we cannot force it to decide whether the a is just an IDEN, or the first (or only) element in an iden_list.
iden_list is needed to parse
...; a , b : Integer ; ...
but that's not a problem since the , is a definite sign that we have a list. So the resolution has to include hamdling a : Integer without reducing a to an iden_list. And that requires an (apparently) redundant production:
var_block::=
| dec_var var_block
dec_var : iden_list ':' type ';'
| IDEN ':' type ';'
iden_list : IDEN ',' IDEN
| iden_list ',' IDEN
(Note: I changed valid_types to type because valid is redundant -- only valid syntaxes are parsed -- and because I think you should never use a plural name for a singular object; it confuses the reader.)
That's not quite enough, though, because we also need to avoid forcing the parser to decide whether the const_block needs to be reduced before the variable declaration. For that, we need something like the attempt you already made to remove the empty block definitions, and instead provide eight different declaration_block productions, one of each of the eight possible empty clauses. That will work fine, as long as you change the block definitions to be left-recursive rather than right-recursive. The right-recursive definition forces the parser to perform a reduction at the end of const_block, which means that it needs to know exactly where const_block ends with only one lookahead token.
On the whole, if you're going to use a bottom-up parser like CUP, you should make it a habit to use left-recursion unless you have a good reason not to (like defining a right-associative operator). There are a few exceptions, but on the whole left-recursion will produce fewer surprises, and in addition it will not burn through the parser stack on long inputs.
Making all those changes, we end up with something like this, where:
The block definitions were changed to left-recursive definitions with a non-empty base case;
ident_list was forced to have at least two elements, and a "redundant" production was added for the one-identifier case;
The start production was divided into eight possible combinations in order to allowed each of the three subclauses to be empty;
A few minor name changes were made.
declaration_block ::=
| var_block
| types_block
| types_block var_block
| const_block
| const_block var_block
| const_block types_block
| const_block types_block var_block
;
// Constant declaration
const_block ::= dec_const
| const_block dec_const ;
dec_const ::= IDEN TWOPOINT CONSTANT ASSIGN const_value SEMICOLON;
//Types declaration
types_block ::= dec_type
| types_block dec_type ;
dec_type ::= TYPE IDEN IS RECORD
reg_list
END RECORD SEMICOLON;
reg_list ::= dec_reg
| reg_list dec_reg;
dec_reg ::= IDEN TWOPOINT type SEMICOLON;
//Variable declaration
var_block ::= dec_var
| var_block dec_var;
dec_var : iden_list ':' type ';'
| IDEN ':' type ';' ;
iden_list : IDEN ',' IDEN
| iden_list ',' IDEN;
// common use
const_value ::= INT | boolean;
boolean ::= TRUE | FALSE;
type ::= primitive_type | IDEN;
primitive_type ::= INTEGER | BOOLEAN;
I have the following BNFC code:
GFDefC. GoalForm ::= Constraint ;
GFDefT. GoalForm ::= True ;
GFDefA. GoalForm ::= GoalForm "," GoalForm ;
GFDefO. GoalForm ::= GoalForm ";" GoalForm ;
ConFr. Constraint ::= Var "#" Term ;
TVar. Term ::= UnVar;
TFun. Term ::= Fun ;
FDef. Fun ::= FunId "(" [Arg] ")" ;
ADecl. Arg ::= Term ;
separator Arg "," ;
...
However, the following is not parsed
fun(X)
while it parses the one below
x # fun(Y)
so to sum up, it parses the function as a part of constraints, but not individually.
It should parse both of them.
Could anyone point out why?
You should set your entrypoints properly.
As you're parsing x # fun(Y) successfully, I assume you have set your entrypoints to Constraint and using the generated pConstraint function to parse your expressions. Then, you can change your rules of Constraint to
ConNoVar. Constraint ::= Term ;
ConFr. Constraint ::= Var "#" Term ;
Aternatively, you can add Term to your entrypoints and invoke pTerm to parse your function terms.
In OCaml when I do a pattern matching I can't do the following:
let rec example = function
| ... -> ...
| ... || ... -> ... (* here I get a syntax error because I use ||*)
Instead I need to do:
let rec example1 = function
|... -> ...
|... | ... -> ...
I know that || means or in OCaml, but why do we need to use only one 'pipe' : | to specify 'or' in pattern matching?
Why don't the usual || work?
|| doesn't really mean "or" generally, it means "boolean or", or rather it's the boolean or operator. Operators operate on values resulting from the evaluation of expressions, its operands. Operations and operands together also form expressions which can then be used as operands with other operators to form further expressions and so on.
Pattern matching on the other hand evaluate patterns, which are neither boolean or expressions. Although patterns do in a sense evaluate to true or false if applied to, or rather matched against, a value, they do not evaluate to anything on their own. They are in that sense more like operators than operands. Furthermore, the result of matching against a pattern is not just a boolean value, but also a set of bindings.
Using || instead of | with patterns would overload its meaning and serve more to confuse than to clarify I think.
I am trying to write some BNF (not EBNF) to describe the various elements of the following code fragment which is in no particular programming language but would be syntactically correct in VBA.
If Temperature <=0 Then
Description = "Freezing"
End If
So far I have come up with the BNF at the bottom of this post (I have not yet described string, number or identifier).
What perplexes me is the second line of code, Description = "Freezing", in which I am assigning a string literal to an identifier. How should I deal with this in my BNF?
I am tempted to simply adjust my definition of a factor like this...
<factor> ::= <identifier> | <number> | <string_literal> | (<expression)>
...after all, in VBA an arithmetic expression containing a string or a string variable would be syntactically correct and not picked up until run time. For example (4+3)*(6-"hi") would not be picked up as a syntax error. Is this the right approach?
Or should I leave the production for a factor as it is and redefine the assignment like this...?
<assignment> ::= <identifier> = <expression> | <identifier> = <string_literal>
I am not trying to define a whole language in my BNF, rather, I just want to cover most of the productions that describe the code fragment. Suggestions would be much appreciated.
BNF so far...
<string> ::= …
<number> ::= …
<identifier> ::= …
<assignment> ::= <identifier> = <expression>
<statements> ::= <statement> <statements>
<statement> ::= <assignment> | <if_statement> | <for_statement> | <while_statement> | …
<expression> ::= <expression> + <term> | <expression> - <term> | <term>
<term> ::= <term> * <factor> | <term> / <factor> | <factor>
<factor> ::= <identifier> | <number> | (<expression)>
<relational_operator> ::= < | > | <= | >= | =
<condition> ::= <expression> <relational_operator> <expression>
<if_statement> ::= If <condition> Then <statement>
| If <condition> Then <statements> End If
| If <condition> Then <statements> Else <statements> End If
Consider the code sample:
X = "hi"
Y = 6 - X
The 6 - X expression is an error, but you can't make it a syntax error using just a context-free grammar. Similarly for:
If Temperature <= X Then ...
Instead of catching such type errors via the grammar, you'll have to catch them later, either statically or dynamically. And given that you have to do that analysis anyway, there's not much point trying to catch any type errors (express any type constraints) in the grammar.
So go with your first solution, adding <string_literal> to <factor>.
While you don't provide any details about your language, it seems reasonable to believe that a language which has string literals and string variables also has some operations on strings, at least function calls taking strings as arguments and probably certain operators. (In VB, as I understand it, both + and & function as string concatenation operators.)
In that case, assignment to a string variable is not limited to assigning a string literal, and the grammar would be expected to allow expressions including string literals.
It is always tempting to attempt to enforce type coherency in a grammar, on the basis that some type errors (such as 6 - "hi") can be detected immediately. But there are many other very similar errors (6 - HiStringVariable) which cannot be detected until type deduction (or even until runtime, for dynamic languages). The contortions necessary to do partial type checking during the parse are almost never worth the trouble.
I'm trying to figure out a grammar rule(s) for any mathematical expression.
I'm using EBNF (wiki article linked below) for deriving syntax rules.
I've managed to come up with one that worked for a while, but the grammar rule fails with onScreenTime + (((count) - 1) * 0.9).
The rule is as follows:
math ::= MINUS? LPAREN math RPAREN
| mathOperand (mathRhs)+
mathRhs ::= mathOperator mathRhsGroup
| mathOperator mathOperand mathRhs?
mathRhsGroup ::= MINUS? LPAREN mathOperand (mathRhs | (mathOperator mathOperand))+ RPAREN
You can safely assume mathOperand are positive or negative numbers, or variables.
You can also assume mathOperator denotes any mathematical operator like + or -.
Also, LPAREN and RPAREN are '(' and ')' respectively.
EBNF:
https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form
EDIT
Forgot to mention that it fails on (count) - 1. It says RPAREN expected instead of - 1.
EDIT 2 My revised EBNF now looks like this:
number ::= NUMBER_LITERAL //positive integer
mathExp ::= term_ ((PLUS | MINUS) term_)* // * is zero-or-more.
private term_ ::= factor_ ((ASTERISK | FSLASH) factor_)*
private factor_ ::= PLUS factor_
| MINUS factor_
| primary_
private primary_ ::= number
| IDENTIFIER
| LPAREN mathExp RPAREN
Have a look at the expression grammar of any programming language:
expression
: term
| expression '+' term
| expression '-' term
;
term
: factor
| term '*' factor
| term '/' factor
| term '%' factor
;
factor
: primary
| '-' factor
| '+' factor
;
primary
: IDENTIFIER
| INTEGER
| FLOATING_POINT_LITERAL
| '(' expression ')'
;
Exponentiation left as an exercise for the reader: note that the exponentiation operator is right-associative. This is in yacc notation. NB You are using EBNF, not BNF.
EDIT My non-left-recursive EBNF is not as strong as my yacc, but to factor out the left-recursions you need a scheme like for example:
expression
::= term ((PLUS|MINUS) term)*
term
::= factor ((FSLASH|ASTERISK) factor)*
etc., where * means 'zero or more'. My comments on this below are mostly incorrect and should be ignored.
You may want to take a look at the expression grammar of languages that are typically implemented using recursive descent parsers for which LL(1) grammars are needed which do not allow left recursion. Most if not all of Wirth's languages fall into this group. Below is an example from the grammar of classic Modula-2. EBNF links are shown next to each rule.
http://modula-2.info/m2pim/pmwiki.php/SyntaxDiagrams/PIM4NonTerminals#expression