What programming languages are context-free? - compiler-theory

Or, to be a little more precise: which programming languages are defined by a context-free grammar?
From what I gather C++ is not context-free due to things like macros and templates. My gut tells me that functional languages might be context free, but I don't have any hard data to back that up with.
Extra rep for concise examples :-)

What programming languages are context-free? [...]
My gut tells me that functional languages might be context-free [...]
The short version: There are hardly any real-world programming languages that are context-free in any meaning of the word. Whether a language is context-free or not has nothing to do with it being functional. It is simply a matter of how complex the syntax is.
Here's a CFG for the imperative language Brainfuck:
Program → Instr Program | ε
Instr → '+' | '-' | '>' | '<' | ',' | '.' | '[' Program ']'
And here's a CFG for the functional SKI combinator calculus:
Program → E
E → 'S' E E E
E → 'K' E E
E → 'I'
E → '(' E ')'
These CFGs recognize all valid programs of the two languages because they're so simple.
The longer version: Usually, context-free grammars (CFGs) are only used to roughly specify the syntax of a language. One must distinguish between syntactically correct programs and programs that compile/evaluate correctly. Most commonly, compilers split language analysis into syntax analysis that builds and verifies the general structure of a piece of code, and semantic analysis that verifies the meaning of the program.
If by "context-free language" you mean "... for which all programs compile", then the answer is: hardly any. Languages that fit this bill hardly have any rules or complicated features, like the existence of variables, whitespace-sensitivity, a type system, or any other context: Information defined in one place and relied upon in another.
If, on the other hand, "context-free language" only means "... for which all programs pass syntax analysis", the answer is a matter of how complex the syntax alone is. There are many syntactic features that are hard or impossible to describe with a CFG alone. Some of these are overcome by adding additional state to parsers for keeping track of counters, lookup tables, and so on.
Examples of syntactic features that are not possible to express with a CFG:
Indentation- and whitespace-sensitive languages like Python and Haskell. Keeping track of arbitrarily nested indentation levels is essentially context-sensitive and requires separate counters for the indentation level; both how many spaces that are used for each level and how many levels there are.
Allowing only a fixed level of indentation using a fixed amount of spaces would work by duplicating the grammar for each level of indentation, but in practice this is inconvenient.
The C Typedef Parsing Problem says that C programs are ambiguous during lexical analysis because it cannot know from the grammar alone if something is a regular identifier or a typedef alias for an existing type.
The example is:
typedef int my_int;
my_int x;
At the semicolon, the type environment needs to be updated with an entry for my_int. But if the lexer has already looked ahead to my_int, it will have lexed it as an identifier rather than a type name.
In context-free grammar terms, the X → ... rule that would trigger on my_int is ambiguous: It could be either one that produces an identifier, or one that produces a typedef'ed type; knowing which one relies on a lookup table (context) beyond the grammar itself.
Macro- and template-based languages like Lisp, C++, Template Haskell, Nim, and so on. Since the syntax changes as it is being parsed, one solution is to make the parser into a self-modifying program. See also Is C++ context-free or context-sensitive?
Often, operator precedence and associativity are not expressed directly in CFGs even though it is possible. For example, a CFG for a small expression grammar where ^ binds tighter than ×, and × binds tighter than +, might look like this:
E → E ^ E
E → E × E
E → E + E
E → (E)
E → num
This CFG is ambiguous, however, and is often accompanied by a precedence / associativity table saying e.g. that ^ binds tightest, × binds tighter than +, that ^ is right-associative, and that × and + are left-associative.
Precedence and associativity can be encoded into a CFG in a mechanical way such that it is unambiguous and only produces syntax trees where the operators behave correctly. An example of this for the grammar above:
E₀ → EA E₁
EA → E₁ + EA
EA → ε
E₁ → EM E₂
EM → E₂ × EM
EM → ε
E₂ → E₃ EP
EP → ^ E₃ EP
E₃ → num
E₃ → (E₀)
But ambiguous CFGs + precedence / associativity tables are common because they're more readable and because various types of LR parser generator libraries can produce more efficient parsers by eliminating shift/reduce conflicts instead of dealing with an unambiguous, transformed grammar of a larger size.
In theory, all finite sets of strings are regular languages, and so all legal programs of bounded size are regular. Since regular languages are a subset of context-free languages, all programs of bounded size are context-free. The argument continues,
While it can be argued that it would be an acceptable limitation for a language to allow only programs of less than a million lines, it is not practical to describe a programming language as a regular language: The description would be far too large.
     — Torben Morgensen's Basics of Compiler Design, ch. 2.10.2
The same goes for CFGs. To address your sub-question a little differently,
Which programming languages are defined by a context-free grammar?
Most real-world programming languages are defined by their implementations, and most parsers for real-world programming languages are either hand-written or uses a parser generator that extends context-free parsing. It is unfortunately not that common to find an exact CFG for your favourite language. When you do, it's usually in Backus-Naur form (BNF), or a parser specification that most likely isn't purely context-free.
Examples of grammar specifications from the wild:
BNF for Standard ML
BNF-like for Haskell
BNF for SQL
Yacc grammar for PHP

The set of programs that are syntactically correct is context-free for almost all languages.
The set of programs that compile is not context-free for almost all languages. For example, if the set of all compiling C programs were context free, then by intersecting with a regular language (also known as a regex), the set of all compiling C programs that match
^int main\(void\) { int a+; a+ = a+; return 0; }$
would be context-free, but this is clearly isomorphic to the language a^kba^kba^k, which is well-known not to be context-free.

Depending on how you understand the question, the answer changes. But IMNSHO, the proper answer is that all modern programming languages are in fact context sensitive. For example there is no context free grammar that accepts only syntactically correct C programs. People who point to yacc/bison context free grammars for C are missing the point.

To go for the most dramatic example of a non-context-free grammar, Perl's grammar is, as I understand it, turing-complete.

If I understand your question, you are looking for programming languages which can be described by context free grammars (cfg) so that the cfg generates all valid programs and only valid programs.
I believe that most (if not all) modern programming languages are therefore not context free. For example, once you have user defined types (very common in modern languages) you are automatically context sensitive.
There is a difference between verifying syntax and verifying semantic correctness of a program. Checking syntax is context free, whereas checking semantic correctness isn't (again, in most languages).
This, however, does not mean that such a language cannot exist. Untyped lambda calculus, for example, can be described using a context free grammar, and is, of course, Turing complete.

Most of the modern programming languages are not context-free languages. As a proof, if I delve into the root of CFL its corresponding machine PDA can't process string matchings like {ww | w is a string}. So most programming languages require that.
Example:
int fa; // w
fa=1; // ww as parser treat it like this

VHDL is somewhat context sensitive:
VHDL is context-sensitive in a mean way. Consider this statement inside a
process:
jinx := foo(1);
Well, depending on the objects defined in the scope of the process (and its
enclosing scopes), this can be either:
A function call
Indexing an array
Indexing an array returned by a parameter-less function call
To parse this correctly, a parser has to carry a hierarchical symbol table
(with enclosing scopes), and the current file isn't even enough. foo can be a
function defined in a package. So the parser should first analyze the packages
imported by the file it's parsing, and figure out the symbols defined in them.
This is just an example. The VHDL type/subtype system is a similarly
context-sensitive mess that's very difficult to parse.
(Eli Bendersky, “Parsing VHDL is [very] hard”, 2009)

Let's take Swift, where the user can define operators including operator precedence and associativity. For example, the operators + and * are actually defined in the standard library.
A context free grammar and a lexer may be able to parse a + b - c * d + e, but the semantics is "five operands a, b, c, d and e, separated by the operators +, -, * and +". That's what a parser can achieve without knowing about operators. A context free grammar and a lexer may also be able to parse a +-+ b -+- c, which is three operands a, b and c separated by operators +-+ and -+-.
A parser can "parse" a source file according to a context-free Swift grammar, but that's nowhere near the job done. Another step would be collecting knowledge about operators, and then change the semantics of a + b - c * d + e to be the same as operator+ (operator- (operator+ (a, b), operator* (c, d)), e).
So there is (or maybe there is, I havent checked to closely) a context free grammar, but it only gets you so far to parsing a program.

I think Haskell and ML are supporting context free. See this link for Haskell.

Related

How can a programming language that is specified using a context-free grammar, be capable of expressing a Turing Machine?

I've been getting into Automata theory, compilers and the fundamentals of CS, but there is something fundamental that I don't understand.
I have seen the Chomsky Hierarchy of languages where different classes of languages that have different expressive power are "associated" with an equivalently powerful automaton.
From Wikipedia :
GRAMMAR LANGUAGE AUTOMATON
Type-0 Recursively enumerable Turing machine
Type-1 Context-sensitive Linear-bounded non-deterministic Turing machine-
Type-2 Context-free Non-deterministic pushdown automaton
Type-3 Regular Finite state automaton
I've seen that every programming language are Turing Complete and that the grammar specifications of programming languages (formalised in BNF, etc..) can be expressed as a Context-free Grammar.
Context-free grammars dont have an "associated" Turing Machine as equivalent.
During interpretation / compilation, the string of the source code of a program written in a programming language (like C, python, etc..) is parsed/translated into an Abstract Syntax Tree.
(As I understand, this is like extracting an array from a string when matching the string against a regular expression, except that the pattern here is not a regular expression, it is a context-free grammar, which is more powerful, hence the tree structure extracted which contain more information that a linear array (coming from capture groups of a regex).)
So the program written, potentially implementing a Turing Machine, is converted into an Abstract Syntax Tree, and all the information contained into the original program is now incorporated into the tree. And later, during execution, the program will accompished some computation that can be as complex as a Turing Machine.
My question is :
How can a string expressed within the confines of the rules dictated by what a Context-free Grammar can be, be implementing a Turing Machine while the equivalence grammar/language/automata and the Chomsky Hierarchy say a Context-free Grammar isn't expressive enough to do so ?
Is one of my assumptions wrong ?
Or is the fact that memory plays a role in this, and that there is a theorem that says something like :
a Turing Machine can be implemented "using" a Tree + a Stack ?
This is really bugging me.
Anything that can enlighten me is really appreciated !
EDIT :
Here's a DUPLICATE of my question :
chomsky hierarchy and programming languages
Why I mistakenly thought that the syntax specification of a programming language defines its semantics ?
Because of what YACC does : (syntax-directed translation)
https://en.wikipedia.org/wiki/Syntax-directed_translation
which associates the rules of the context-free grammar used to parse the programming language (which is used to make the abstract syntax tree) with an action.
This is the source of my confusion.
For example, here's a copy paste of an extract of the source code of the perl5 interpreter. This is the file perly.y which is used to by yacc to make the first pass of compilation.
/* Binary operators between terms */
termbinop: term[lhs] ASSIGNOP term[rhs] /* $x = $y, $x += $y */
{ $$ = newASSIGNOP(OPf_STACKED, $lhs, $ASSIGNOP, $rhs); }
| term[lhs] POWOP term[rhs] /* $x ** $y */
{ $$ = newBINOP($POWOP, 0, scalar($lhs), scalar($rhs)); }
| term[lhs] MULOP term[rhs] /* $x * $y, $x x $y */
{ if ($MULOP != OP_REPEAT)
scalar($lhs);
$$ = newBINOP($MULOP, 0, $lhs, scalar($rhs));
}
This shows a one to one correspondence between a derivation rule and its associated action (what is inside curly brackets).
The 'level' of grammar you use to define a language determines the automaton required to recognize (parse) that language, but it is unrelated to the "power" of that language.
E.g., if you use a Type 2 grammar (CFG) to define a language, the Chomsky hierarchy tells you that you'll need a pushdown automaton to recognize it, but the language might be a Turing-complete programming language, or it might be a language for regular expressions, or it might be a language with no computational "power" at all.
For a more extreme example, you can imagine using a Type 3 grammar (regular expression) to define a language for 'programming' a Turing machine.
The power of a language (in particular, whether it's Turing-complete) depends on its semantics, not its syntax.

Equality vs. Bicondition in Z3

I'm trying to understand the difference in Z3 between equality testing and biconditional. My understanding is that = is used to express biconditional, but how is equality tested?
For example. I am trying to write something similar to the following (toy) statement in z3:
on_table(o, a) ↔ (in_hand(o) Λ a != pickup(o)) ∨ a = put_on_table(o)
Note: I am aware the above statement can be factored into a set of implications, but I am interested in expressing it as a single biconditional.
For the Bool type, equality and biconditional are the same operations. For any other type, biconditional doesn't really make sense.
All logics in SMT come equipped with the notion of equality, which is essentially term-level equality of objects. The standard explicitly states:
Version 2.6 of the SMT-LIB format adopts as its underlying logic a
version of many-sorted first-order logic with equality [Man93, Gal86,
End01].
See Section 2.2 of http://smtlib.cs.uiowa.edu/papers/smt-lib-reference-v2.6-r2017-07-18.pdf
The same document also says (Section 3.7.1):
Note the absence of a symbol for double implication. Such a connective
is superfluous because the equality symbol = can be used in its place.
I suspect though, perhaps, you are trying to ask for something else. Some further examples would definitely help.

Determine that the word is from language with the help of stack

What is the algorithm for determining that the word is from a specific language with the help of the stack?
I know that I can put the word into stack symbol by symbol and while doing that I can record any needed info about symbols, but it will be no different from just iterating the word.
If the language is defined by a context-free grammar, membership of a specific word can be determined efficiently by the so-called CYK-Algorithm.
The language given in the example above can be represented by the following context-free grammar where epsilon denotes the empty string.
S -> epsilon | aSb | ab
Update
For the CYK-algorithm to be applicable, the grammar needs to beinChomsky normal form; for the grammar above, this can be done as follows.
S -> epsilon | AT | AB
T -> SB
A -> a
B -> b
In this formulation, A and B are artificial nonterminal symbols for the terminal symbols a and b; T is an artificial variable introduced because each right-hand side may contain at most two nonterminal symbols.
Maybe this helps for a start
LanguageIdentifier
Rosette Language Identifier
Other than that you could count the frequency of the characters composing the word and compare it to frequency tables of different languages to check (maybe this won't work for a single word but for a bunch of sentences it should work though)

What is Eta abstraction in lambda calculus used for?

Eta Abstraction in lambda calculus means following.
A function f can be written as \x -> f x
Is Eta abstraction of any use while reducing lambda expressions?
Is it only an alternate way of writing certain expressions?
Practical use cases would be appreciated.
The eta reduction/expansion is just a consequence of the law that says that given
f = g
it must be, that for any x
f x = g x
and vice versa.
Hence given:
f x = (\y -> f y) x
we get, by beta reducing the right hand side
f x = f x
which must be true. Thus we can conclude
f = \y -> f y
First, to clarify the terminology, paraphrasing a quote from the Eta conversion article in the Haskell wiki (also incorporating Will Ness' comment above):
Converting from \x -> f x to f would
constitute an eta reduction, and moving in the opposite way
would be an eta abstraction or expansion. The term eta conversion can refer to the process in either direction.
Extensive use of η-reduction can lead to Pointfree programming.
It is also typically used in certain compile-time optimisations.
Summary of the use cases found:
Point-free (style of) programming
Allow lazy evaluation in languages using strict/eager evaluation strategies
Compile-time optimizations
Extensionality
1. Point-free (style of) programming
From the Tacit programming Wikipedia article:
Tacit programming, also called point-free style, is a programming
paradigm in which function definitions do not identify the arguments
(or "points") on which they operate. Instead the definitions merely
compose other functions
Borrowing a Haskell example from sth's answer (which also shows composition that I chose to ignore here):
inc x = x + 1
can be rewritten as
inc = (+) 1
This is because (following yatima2975's reasoning) inc x = x + 1 is just syntactic sugar for \x -> (+) 1 x so
\x -> f x => f
\x -> ((+) 1) x => (+) 1
(Check Ingo's answer for the full proof.)
There is a good thread on Stackoverflow on its usage. (See also this repl.it snippet.)
2. Allow lazy evaluation in languages using strict/eager evaluation strategies
Makes it possible to use lazy evaluation in eager/strict languages.
Paraphrasing from the MLton documentation on Eta Expansion:
Eta expansion delays the evaluation of f until the surrounding function/lambda is applied, and will re-evaluate f each time the function/lambda is applied.
Interesting Stackoverflow thread: Can every functional language be lazy?
2.1 Thunks
I could be wrong, but I think the notion of thunking or thunks belongs here. From the wikipedia article on thunks:
In computer programming, a thunk is a subroutine used to inject an
additional calculation into another subroutine. Thunks are primarily
used to delay a calculation until its result is needed, or to insert
operations at the beginning or end of the other subroutine.
The 4.2 Variations on a Scheme — Lazy Evaluation of the Structure and Interpretation of Computer Programs (pdf) has a very detailed introduction to thunks (and even though the latter has not one occurrence of the phrase "lambda calculus", it is worth reading).
(This paper also seemed interesting but didn't have the time to look into it yet: Thunks and the λ-Calculus.)
3. Compile-time optimizations
Completely ignorant on this topic, therefore just presenting sources:
From Georg P. Loczewski's The Lambda Calculus:
In 'lazy' languages like Lambda Calculus, A++, SML, Haskell, Miranda etc., eta conversion, abstraction and reduction alike, are mainly used within compilers. (See [Jon87] page 22.)
where [Jon87] expands to
Simon L. Peyton Jones
The Implementation of Functional Programming Languages
Prentice Hall International, Hertfordshire,HP2 7EZ, 1987.
ISBN 0 13 453325 9.
search results for "eta" reduction abstraction expansion conversion "compiler" optimization
4. Extensionality
This is another topic that I know little about, and this is more theoretical, so here it goes:
From the Lambda calculus wikipedia article:
η-reduction expresses the idea of extensionality, which in this context is that two functions are the same if and only if they give the same result for all arguments.
Some other sources:
nLab entry on Eta-conversion that goes deeper into its connection with extensionality, and its relationship with beta-conversion
ton of info in the What's the point of η-conversion in lambda calculus? on the Theoretical Computer Science Stackexchange (but beware: the author of the accepted answer seems to have a beef with the commonly held belief about the relationsship between eta reduction and extensionality, so make sure to read the entire page. Most of it was over my head so I have no opinions.)
The question above has been cross-posted to Math Exchange as well
Speaking of "over my head" stuff: here's Conor McBride's take; the only thing I understood were that eta conversions can be controversial in certain context, but reading his reply was that of trying to figure out an alien language (couldn't resist)
Saved this page recursively in Internet Archive so if any of the links are not live anymore then that snapshot may have saved those too.

What is a regular language?

I'm trying to understand the concept of languages levels (regular, context free, context sensitive, etc.).
I can look this up easily, but all explanations I find are a load of symbols and talk about sets. I have two questions:
Can you describe in words what a regular language is, and how the languages differ?
Where do people learn to understand this stuff? As I understand it, it is formal mathematics? I had a couple of courses at uni which used it and barely anyone understood it as the tutors just assumed we knew it. Where can I learn it and why are people "expected" to know it in so many sources? It's like there's a gap in education.
Here's an example:
Any language belonging to this set is a regular language over the alphabet.
How can a language be "over" anything?
In the context of computer science, a word is the concatenation of symbols. The used symbols are called the alphabet. For example, some words formed out of the alphabet {0,1,2,3,4,5,6,7,8,9} would be 1, 2, 12, 543, 1000, and 002.
A language is then a subset of all possible words. For example, we might want to define a language that captures all elite MI6 agents. Those all start with double-0, so words in the language would be 007, 001, 005, and 0012, but not 07 or 15. For simplicity's sake, we say a language is "over an alphabet" instead of "a subset of words formed by concatenation of symbols in an alphabet".
In computer science, we now want to classify languages. We call a language regular if it can be decided if a word is in the language with an algorithm/a machine with constant (finite) memory by examining all symbols in the word one after another. The language consisting just of the word 42 is regular, as you can decide whether a word is in it without requiring arbitrary amounts of memory; you just check whether the first symbol is 4, whether the second is 2, and whether any more numbers follow.
All languages with a finite number of words are regular, because we can (in theory) just build a control flow tree of constant size (you can visualize it as a bunch of nested if-statements that examine one digit after the other). For example, we can test whether a word is in the "prime numbers between 10 and 99" language with the following construct, requiring no memory except the one to encode at which code line we're currently at:
if word[0] == 1:
if word[1] == 1: # 11
return true # "accept" word, i.e. it's in the language
if word[1] == 3: # 13
return true
...
return false
Note that all finite languages are regular, but not all regular languages are finite; our double-0 language contains an infinite number of words (007, 008, but also 004242 and 0012345), but can be tested with constant memory: To test whether a word belongs in it, check whether the first symbol is 0, and whether the second symbol is 0. If that's the case, accept it. If the word is shorter than three or does not start with 00, it's not an MI6 code name.
Formally, the construct of a finite-state machine or a regular grammar is used to prove that a language is regular. These are similar to the if-statements above, but allow for arbitrarily long words. If there's a finite-state machine, there is also a regular grammar, and vice versa, so it's sufficient to show either. For example, the finite state machine for our double-0 language is:
start state: if input = 0 then goto state 2
start state: if input = 1 then fail
start state: if input = 2 then fail
...
state 2: if input = 0 then accept
state 2: if input != 0 then fail
accept: for any input, accept
The equivalent regular grammar is:
start → 0 B
B → 0 accept
accept → 0 accept
accept → 1 accept
...
The equivalent regular expression is:
00[0-9]*
Some languages are not regular. For example, the language of any number of 1, followed by the same number of 2 (often written as 1n2n, for an arbitrary n) is not regular - you need more than a constant amount of memory (= a constant number of states) to store the number of 1s to decide whether a word is in the language.
This should usually be explained in the theoretical computer science course. Luckily, Wikipedia explains both formal and regular languages quite nicely.
Here are some of the equivalent definitions from Wikipedia:
[...] a regular language is a formal language (i.e., a possibly
infinite set of finite sequences of symbols from a finite alphabet)
that satisfies the following equivalent properties:
it can be accepted by a deterministic finite state machine.
it can be accepted by a nondeterministic finite state machine
it can be described by a formal regular expression.
Note that the "regular expression" features provided with many programming languages
are augmented with features that make them capable of recognizing
languages which are not regular, and are therefore not strictly
equivalent to formal regular expressions.
The first thing to note is that a regular language is a formal language, with some restrictions. A formal language is essentially a (possibly infinite) collection of strings. For example, the formal language Java is the collection of all possible Java files, which is a subset of the collection of all possible text files.
One of the most important characteristics is that unlike the context-free languages, a regular language does not support arbitrary nesting/recursion, but you do have arbitrary repetition.
A language always has an underlying alphabet which is the set of allowed symbols. For example, the alphabet of a programming language would usually either be ASCII or Unicode, but in formal language theory it's also fine to talk about languages over other alphabets, for example the binary alphabet where the only allowed characters are 0 and 1.
In my university, we were taught some formal language theory in the Compilers class, but this is probably different between different schools.

Resources