how to generate all legal statements from ebnf? - ebnf

I have an EBNF that defines a grammar. The rules of the grammar allow around 8000 permutations that are legal statements.
Is there any utility that can automatically generate legal statements from an EBNF?

Related

How can a programming language that is specified using a context-free grammar, be capable of expressing a Turing Machine?

I've been getting into Automata theory, compilers and the fundamentals of CS, but there is something fundamental that I don't understand.
I have seen the Chomsky Hierarchy of languages where different classes of languages that have different expressive power are "associated" with an equivalently powerful automaton.
From Wikipedia :
GRAMMAR LANGUAGE AUTOMATON
Type-0 Recursively enumerable Turing machine
Type-1 Context-sensitive Linear-bounded non-deterministic Turing machine-
Type-2 Context-free Non-deterministic pushdown automaton
Type-3 Regular Finite state automaton
I've seen that every programming language are Turing Complete and that the grammar specifications of programming languages (formalised in BNF, etc..) can be expressed as a Context-free Grammar.
Context-free grammars dont have an "associated" Turing Machine as equivalent.
During interpretation / compilation, the string of the source code of a program written in a programming language (like C, python, etc..) is parsed/translated into an Abstract Syntax Tree.
(As I understand, this is like extracting an array from a string when matching the string against a regular expression, except that the pattern here is not a regular expression, it is a context-free grammar, which is more powerful, hence the tree structure extracted which contain more information that a linear array (coming from capture groups of a regex).)
So the program written, potentially implementing a Turing Machine, is converted into an Abstract Syntax Tree, and all the information contained into the original program is now incorporated into the tree. And later, during execution, the program will accompished some computation that can be as complex as a Turing Machine.
My question is :
How can a string expressed within the confines of the rules dictated by what a Context-free Grammar can be, be implementing a Turing Machine while the equivalence grammar/language/automata and the Chomsky Hierarchy say a Context-free Grammar isn't expressive enough to do so ?
Is one of my assumptions wrong ?
Or is the fact that memory plays a role in this, and that there is a theorem that says something like :
a Turing Machine can be implemented "using" a Tree + a Stack ?
This is really bugging me.
Anything that can enlighten me is really appreciated !
EDIT :
Here's a DUPLICATE of my question :
chomsky hierarchy and programming languages
Why I mistakenly thought that the syntax specification of a programming language defines its semantics ?
Because of what YACC does : (syntax-directed translation)
https://en.wikipedia.org/wiki/Syntax-directed_translation
which associates the rules of the context-free grammar used to parse the programming language (which is used to make the abstract syntax tree) with an action.
This is the source of my confusion.
For example, here's a copy paste of an extract of the source code of the perl5 interpreter. This is the file perly.y which is used to by yacc to make the first pass of compilation.
/* Binary operators between terms */
termbinop: term[lhs] ASSIGNOP term[rhs] /* $x = $y, $x += $y */
{ $$ = newASSIGNOP(OPf_STACKED, $lhs, $ASSIGNOP, $rhs); }
| term[lhs] POWOP term[rhs] /* $x ** $y */
{ $$ = newBINOP($POWOP, 0, scalar($lhs), scalar($rhs)); }
| term[lhs] MULOP term[rhs] /* $x * $y, $x x $y */
{ if ($MULOP != OP_REPEAT)
scalar($lhs);
$$ = newBINOP($MULOP, 0, $lhs, scalar($rhs));
}
This shows a one to one correspondence between a derivation rule and its associated action (what is inside curly brackets).
The 'level' of grammar you use to define a language determines the automaton required to recognize (parse) that language, but it is unrelated to the "power" of that language.
E.g., if you use a Type 2 grammar (CFG) to define a language, the Chomsky hierarchy tells you that you'll need a pushdown automaton to recognize it, but the language might be a Turing-complete programming language, or it might be a language for regular expressions, or it might be a language with no computational "power" at all.
For a more extreme example, you can imagine using a Type 3 grammar (regular expression) to define a language for 'programming' a Turing machine.
The power of a language (in particular, whether it's Turing-complete) depends on its semantics, not its syntax.

Are operators a subset of statements?

Basically in all high-level languages (as I know) we have two main categories of language mechanisms to create a program: statements and expressions.
Usually statements are represented by some subset of language's keywords: if/else/switch, for/foreach/while, {} (or BEGIN/END), etc.
Expressions are represented by literals (which represent some data) and operators: literals: 1, 2, -100, testTest, etc; operators: +, -, /, *, ==, ===, etc.
If we think deeper, we can notice that statements usually answer on question "what?" and expressions -- on question "how?". Statements represent actions, expressions represent the context of actions.
Then we may look in expressions' parts again: literals and operators. Operators are actions too.
And here is my question again: are operators a subset of statement?
P.S. Generally, I understand that statements and expression are used together to aim some programming target. Separation of this categories is mostly theoretical.
In general, "operator" describes a kind of syntactic form, which can be used to produce an expression, a statement, or some other class of language entity. So, technically, the answer to your question is, "no".
For example, Haskell uses a | operator to generate an algebraic type spec, which is neither an expression nor a statement:
data Maybe a = Just a | Nothing

Any tools can randomly generate the source code according to a language grammar?

A C program source code can be parsed according to the C grammar(described in CFG) and eventually turned into many ASTs. I am considering if such tool exists: it can do the reverse thing by firstly randomly generating many ASTs, which include tokens that don't have the concrete string values, just the types of the tokens, according to the CFG, then generating the concrete tokens according to the tokens' definitions in the regular expression.
I can imagine the first step looks like an iterative non-terminals replacement, which is randomly and can be limited by certain number of iteration times. The second step is just generating randomly strings according to regular expressions.
Is there any tool that can do this?
The "Data Generation Language" DGL does this, with the added ability to weight the probabilities of productions in the grammar being output.
In general, a recursive descent parser can be quite directly rewritten into a set of recursive procedures to generate, instead of parse / recognise, the language.
Given a context-free grammar of a language, it is possible to generate a random string that matches the grammar.
For example, the nearley parser generator includes an implementation of an "unparser" that can generate strings from a grammar.
The same task can be accomplished using definite clause grammars in Prolog. An example of a sentence generator using definite clause grammars is given here.
If you have a model of the grammar in a normalized form (all rules like this):
LHS = RHS1 RHS2 ... RHSn ;
and language prettyprinter (e.g., AST to text conversion tool), you can build one of these pretty easily.
Simply start with the goal symbol as a unit tree.
Repeat until no nonterminals are left:
Pick a nonterminal N in the tree;
Expand by adding children for the right hand side of any rule
whose left-hand side matches the nonterminal N
For terminals that carry values (e.g., variable names, numbers, strings, ...) you'll have to generate random content.
A complication with the above algorithm is that it doesn't clearly terminate. What you actually want to do is pick some limit on the size of your tree, and run the algorithm until the all nonterminals are gone or you exceed the limit. In the latter case, backtrack, undo the last replacement, and try something else. This gets you a bounded depth-first search for an AST of your determined size.
Then prettyprint the result. Its the prettyprinter part that is hard to get right.
[You can build all this stuff yourself including the prettyprinter, but it is a fair amount of work. I build tools that include all this machinery directly in a language-parameterized way; see my bio].
A nasty problem even with well formed ASTs is that they may be nonsensical; you might produce a declaration of an integer X, and assign a string literal value to it, for a language that doesn't allow that. You can probably eliminate some simple problems, but language semantics can be incredibly complex, consider C++ as an example. Ensuring that you end up with a semantically meaningful program is extremely hard; in essence, you have to parse the resulting text, and perform name and type resolution/checking on it. For C++, you need a complete C++ front end.
the problem with random generation is that for many CFGs, the expected length of the output string is infinite (there is an easy computation of the expected length using generating functions corresponding to the non-terminal symbols and equations corresponding to the rules of the grammar); you have to control the relative probabilities of the productions in certain ways to guarantee convergence; for example, sometimes, weighting each production rule for a non-terminal symbol inversely to the length of its RHS suffices
there is lot more on this subject in:
Noam Chomsky, Marcel-Paul Sch\"{u}tzenberger, ``The Algebraic Theory of Context-Free Languages'', pp.\ 118-161 in P. Braffort and D. Hirschberg (eds.), Computer Programming and Formal Systems, North-Holland (1963)
(see Wikipedia entry on Chomsky–Schützenberger enumeration theorem)

Prolog loops and conditional statements?

Is there anything in Prolog that works like a for loop and if then condition?
if/then/else can be obtained with (->)/2 and (;)/2:
( If ->
Then
; Else
)
Sometimes this is useful. In general though (when the condition contains variables), it will make your programs unsound and incomplete. Whenever it is possible to describe the conditions with pattern matching, you should use pattern matching instead. You can then not only check but also generate solutions.
If you are looking for such kinds of statements then you are not thinking in Prolog :)
Just kidding, by the way there aren't plain translation or for and if/else, but you can think about how they should be in prolog:
an if/else statement can be obtained by just having two rules that match over different conditions
a for loop can be done with two recursive rules, one is the base case and it doesn't depends upon itself to keepon while the other does what you intend to do inside the loop and follows itself..

What programming languages are context-free?

Or, to be a little more precise: which programming languages are defined by a context-free grammar?
From what I gather C++ is not context-free due to things like macros and templates. My gut tells me that functional languages might be context free, but I don't have any hard data to back that up with.
Extra rep for concise examples :-)
What programming languages are context-free? [...]
My gut tells me that functional languages might be context-free [...]
The short version: There are hardly any real-world programming languages that are context-free in any meaning of the word. Whether a language is context-free or not has nothing to do with it being functional. It is simply a matter of how complex the syntax is.
Here's a CFG for the imperative language Brainfuck:
Program → Instr Program | ε
Instr → '+' | '-' | '>' | '<' | ',' | '.' | '[' Program ']'
And here's a CFG for the functional SKI combinator calculus:
Program → E
E → 'S' E E E
E → 'K' E E
E → 'I'
E → '(' E ')'
These CFGs recognize all valid programs of the two languages because they're so simple.
The longer version: Usually, context-free grammars (CFGs) are only used to roughly specify the syntax of a language. One must distinguish between syntactically correct programs and programs that compile/evaluate correctly. Most commonly, compilers split language analysis into syntax analysis that builds and verifies the general structure of a piece of code, and semantic analysis that verifies the meaning of the program.
If by "context-free language" you mean "... for which all programs compile", then the answer is: hardly any. Languages that fit this bill hardly have any rules or complicated features, like the existence of variables, whitespace-sensitivity, a type system, or any other context: Information defined in one place and relied upon in another.
If, on the other hand, "context-free language" only means "... for which all programs pass syntax analysis", the answer is a matter of how complex the syntax alone is. There are many syntactic features that are hard or impossible to describe with a CFG alone. Some of these are overcome by adding additional state to parsers for keeping track of counters, lookup tables, and so on.
Examples of syntactic features that are not possible to express with a CFG:
Indentation- and whitespace-sensitive languages like Python and Haskell. Keeping track of arbitrarily nested indentation levels is essentially context-sensitive and requires separate counters for the indentation level; both how many spaces that are used for each level and how many levels there are.
Allowing only a fixed level of indentation using a fixed amount of spaces would work by duplicating the grammar for each level of indentation, but in practice this is inconvenient.
The C Typedef Parsing Problem says that C programs are ambiguous during lexical analysis because it cannot know from the grammar alone if something is a regular identifier or a typedef alias for an existing type.
The example is:
typedef int my_int;
my_int x;
At the semicolon, the type environment needs to be updated with an entry for my_int. But if the lexer has already looked ahead to my_int, it will have lexed it as an identifier rather than a type name.
In context-free grammar terms, the X → ... rule that would trigger on my_int is ambiguous: It could be either one that produces an identifier, or one that produces a typedef'ed type; knowing which one relies on a lookup table (context) beyond the grammar itself.
Macro- and template-based languages like Lisp, C++, Template Haskell, Nim, and so on. Since the syntax changes as it is being parsed, one solution is to make the parser into a self-modifying program. See also Is C++ context-free or context-sensitive?
Often, operator precedence and associativity are not expressed directly in CFGs even though it is possible. For example, a CFG for a small expression grammar where ^ binds tighter than ×, and × binds tighter than +, might look like this:
E → E ^ E
E → E × E
E → E + E
E → (E)
E → num
This CFG is ambiguous, however, and is often accompanied by a precedence / associativity table saying e.g. that ^ binds tightest, × binds tighter than +, that ^ is right-associative, and that × and + are left-associative.
Precedence and associativity can be encoded into a CFG in a mechanical way such that it is unambiguous and only produces syntax trees where the operators behave correctly. An example of this for the grammar above:
E₀ → EA E₁
EA → E₁ + EA
EA → ε
E₁ → EM E₂
EM → E₂ × EM
EM → ε
E₂ → E₃ EP
EP → ^ E₃ EP
E₃ → num
E₃ → (E₀)
But ambiguous CFGs + precedence / associativity tables are common because they're more readable and because various types of LR parser generator libraries can produce more efficient parsers by eliminating shift/reduce conflicts instead of dealing with an unambiguous, transformed grammar of a larger size.
In theory, all finite sets of strings are regular languages, and so all legal programs of bounded size are regular. Since regular languages are a subset of context-free languages, all programs of bounded size are context-free. The argument continues,
While it can be argued that it would be an acceptable limitation for a language to allow only programs of less than a million lines, it is not practical to describe a programming language as a regular language: The description would be far too large.
     — Torben Morgensen's Basics of Compiler Design, ch. 2.10.2
The same goes for CFGs. To address your sub-question a little differently,
Which programming languages are defined by a context-free grammar?
Most real-world programming languages are defined by their implementations, and most parsers for real-world programming languages are either hand-written or uses a parser generator that extends context-free parsing. It is unfortunately not that common to find an exact CFG for your favourite language. When you do, it's usually in Backus-Naur form (BNF), or a parser specification that most likely isn't purely context-free.
Examples of grammar specifications from the wild:
BNF for Standard ML
BNF-like for Haskell
BNF for SQL
Yacc grammar for PHP
The set of programs that are syntactically correct is context-free for almost all languages.
The set of programs that compile is not context-free for almost all languages. For example, if the set of all compiling C programs were context free, then by intersecting with a regular language (also known as a regex), the set of all compiling C programs that match
^int main\(void\) { int a+; a+ = a+; return 0; }$
would be context-free, but this is clearly isomorphic to the language a^kba^kba^k, which is well-known not to be context-free.
Depending on how you understand the question, the answer changes. But IMNSHO, the proper answer is that all modern programming languages are in fact context sensitive. For example there is no context free grammar that accepts only syntactically correct C programs. People who point to yacc/bison context free grammars for C are missing the point.
To go for the most dramatic example of a non-context-free grammar, Perl's grammar is, as I understand it, turing-complete.
If I understand your question, you are looking for programming languages which can be described by context free grammars (cfg) so that the cfg generates all valid programs and only valid programs.
I believe that most (if not all) modern programming languages are therefore not context free. For example, once you have user defined types (very common in modern languages) you are automatically context sensitive.
There is a difference between verifying syntax and verifying semantic correctness of a program. Checking syntax is context free, whereas checking semantic correctness isn't (again, in most languages).
This, however, does not mean that such a language cannot exist. Untyped lambda calculus, for example, can be described using a context free grammar, and is, of course, Turing complete.
Most of the modern programming languages are not context-free languages. As a proof, if I delve into the root of CFL its corresponding machine PDA can't process string matchings like {ww | w is a string}. So most programming languages require that.
Example:
int fa; // w
fa=1; // ww as parser treat it like this
VHDL is somewhat context sensitive:
VHDL is context-sensitive in a mean way. Consider this statement inside a
process:
jinx := foo(1);
Well, depending on the objects defined in the scope of the process (and its
enclosing scopes), this can be either:
A function call
Indexing an array
Indexing an array returned by a parameter-less function call
To parse this correctly, a parser has to carry a hierarchical symbol table
(with enclosing scopes), and the current file isn't even enough. foo can be a
function defined in a package. So the parser should first analyze the packages
imported by the file it's parsing, and figure out the symbols defined in them.
This is just an example. The VHDL type/subtype system is a similarly
context-sensitive mess that's very difficult to parse.
(Eli Bendersky, “Parsing VHDL is [very] hard”, 2009)
Let's take Swift, where the user can define operators including operator precedence and associativity. For example, the operators + and * are actually defined in the standard library.
A context free grammar and a lexer may be able to parse a + b - c * d + e, but the semantics is "five operands a, b, c, d and e, separated by the operators +, -, * and +". That's what a parser can achieve without knowing about operators. A context free grammar and a lexer may also be able to parse a +-+ b -+- c, which is three operands a, b and c separated by operators +-+ and -+-.
A parser can "parse" a source file according to a context-free Swift grammar, but that's nowhere near the job done. Another step would be collecting knowledge about operators, and then change the semantics of a + b - c * d + e to be the same as operator+ (operator- (operator+ (a, b), operator* (c, d)), e).
So there is (or maybe there is, I havent checked to closely) a context free grammar, but it only gets you so far to parsing a program.
I think Haskell and ML are supporting context free. See this link for Haskell.

Resources