Parsing context-free languages in a stream of tokens - algorithm

The problem
Given a context-free grammar with arbitrary rules and a stream of tokens, how can stream fragments that match the grammar be identified effectively?
Example:
Grammar
S -> ASB | AB
A -> a
B -> b
(So essentially, a number of as followed by an equal number of bs)
Stream:
aabaaabbc...
Expected result:
Match starting at position 1: ab
Match starting at position 4: aabb
Of course the key is "effectively". without testing too many hopeless candidates for too long. The only thing I know about my data is that although the grammar is arbitrary, in practice matching sequences will be relatively short (<20 terminals) while the stream itself will be quite long (>10000 terminals).
Ideally I'd also want a syntax tree but that's not too important, because once the fragment is identified, I can run an ordinary parser over it to obtain the tree.
Where should I start? Which type of parser can be adapted to this type of work?

"Arbitrary grammar" makes me suggest you look at wberry's comment.
How complex are these grammars? Is there a manual intervention step?
I'll make an attempt. If I modified your example grammar from:
S -> ASB | AB
A -> a
B -> b
to include:
S' -> S | GS' | S'GS' | S'G
G -> sigma*
So that G = garbage and S' is many S fragments with garbage in between (I may have been careless with my production rules. You get the idea), I think we can solve your problem. You just need a parser that will match other rules before G. You may have to modify these production rules based on the parser. I almost guarantee that there will be rule ordering changes depending on the parser. Since most parser libraries separate lexing from parsing, you'll probably need a catch-all lexeme followed by modifying G to include all possible lexemes. Depending on your specifics, this might not be any better (efficiency-wise) than just starting each attempt at each spot in the stream.
But... Assuming my production rules are fixed (both for correctness and for the particular flavor of parser), this should not only match fragments in the stream, but it should give you a parse tree for the whole stream. You are only interested in subtrees rooted in nodes of type S.

Related

Why are epsilon transitions used in NFA?

I'm trying to understand how to create NFA-s from regular expressions, but I am really confused from epsilon transitions. I have this example in my textbook , but I don't understand why epsilon transitions are used and how does one know when to use them.
In general, espilon-transitions are used when they are convenient. For example, when constructing an NFA from a regular expression, you start by constructing small parts of the automaton corresponding to parts of the expression. To connect them, you need to put a transition. But if there is no symbol to be read there, an epsilon transition is a simple way to do this. They are, however never necessary, you can always find a solution without them.
In your example, just apply the algorithm described in your textbook. It tells you when to use them.
The epsilon transitions
from 1 to 2 probably connects the parts for (a|b)* and for ac
1->5 and 8->1 probably result from the *
5->6 and 5->7 probably result from the alternative in |
Epsilon-transitions in NFAs are a natural representation of choice or disjunction or union in regular expressions. That is, a regular expression like r + s (or r | s or r U s depending on your preferred notation) is naturally represented as an NFA consisting of two independent NFAs, one for r and one for s, joined using e-transitions as follows:
e
----->q0----->(r)
|
| e
|
V
(s)
When used to connect states in more complicated ways, the effect may not be as easy or natural to describe, but essentially these transitions let you choose unconditionally among multiple options. So, if I have seen a part of the input already and there are a few different ways the string could end, I can represent that by using e-transitions to states that handle the different possibilities.
In your example, the e-transitions are not really serving any very useful function and are merely artifacts of the conversion algorithm you have used. That algorithm includes them because, in the general case, they may be useful or necessary. In your specific case this was not true, so they look out of place.

Determine that the word is from language with the help of stack

What is the algorithm for determining that the word is from a specific language with the help of the stack?
I know that I can put the word into stack symbol by symbol and while doing that I can record any needed info about symbols, but it will be no different from just iterating the word.
If the language is defined by a context-free grammar, membership of a specific word can be determined efficiently by the so-called CYK-Algorithm.
The language given in the example above can be represented by the following context-free grammar where epsilon denotes the empty string.
S -> epsilon | aSb | ab
Update
For the CYK-algorithm to be applicable, the grammar needs to beinChomsky normal form; for the grammar above, this can be done as follows.
S -> epsilon | AT | AB
T -> SB
A -> a
B -> b
In this formulation, A and B are artificial nonterminal symbols for the terminal symbols a and b; T is an artificial variable introduced because each right-hand side may contain at most two nonterminal symbols.
Maybe this helps for a start
LanguageIdentifier
Rosette Language Identifier
Other than that you could count the frequency of the characters composing the word and compare it to frequency tables of different languages to check (maybe this won't work for a single word but for a bunch of sentences it should work though)

What does syntax directed translation mean?

Can anyone, in simple terms, explain what does "Syntax Directed Translation" mean? I started to read the topic from Dragon Book but couldn't understand. The Wiki article didn't help either.
In simplest terms, 'Syntax Directed Translation' means driving the entire compilation (translation) process with the syntax recognizer (the parser).
Conceptually, the process of compiling a program (translating it from source code to machine code) starts with a parser that produces a parse tree, and then transforms that parse tree through a sequence of tree or graph transformations, each of which is largely independent, resulting in a final simplified tree or graph that is traversed to produce machine code.
This view, while nice in theory, has a drawback that if you try to implement it directly, enough memory to hold at least two copies of the entire tree or graph is needed. Back when the Dragon Book was written (and when a lot of this theory was hashed out), computer memories were measured in kilobytes, and 64K was a lot. So compiling large programs could be tricky.
With Syntax Directed Translation, you organize all of the graph transformations around the order in which the parser recognizes the parse tree. Instead of producing a complete parse tree, your parser builds little bits of it, and then feeds those bits to the subsequent passes of the compiler, ultimately producing a small piece of machine code, before continuing the parsing process to build the next piece of parse tree. Since only small amounts of the parse tree (or the subsequent graphs) exist at any time, much less memory is required. Since the syntax recognizer is the master sequencer controlling all of this (deciding the order in which things happen), this is called Syntax Directed Translation.
Since this is such an effective way of keeping down memory use, people even redesigned languages to make it easier to do -- the ideal being to have a "Single Pass" compiler that could in fact do the entire process from parsing to machine code generation in a single pass.
Nowadays, memory is not at such a premium, so there's less pressure to force everything into a single pass. Instead you generally use Syntax Direct Translation just for the front end, parsing the syntax, doing typechecking and other semantic checks, and a few simple transformations all from the parser and producing some internal form (three address code, trees, or dags of some kind) and then having separate optimization and back end passes that are independent (and so not syntax directed). Even in this case you might claim that these later passes are at least partly syntax directed, as the compiler may be organized to operate on large pieces of the input (such as entire functions or modules), pushing through all the passes before continuing with the next piece of input.
Tools like yacc are designed around the idea of Syntax Directed Translation -- the tool produces a syntax recognizer that directly runs fragments of code ('actions' in the tool parlance) as productions (fragments of the parse tree) are recognized, without ever creating an actual 'tree'. These actions can directly invoke what are logically later passes in the compiler, and then return to continue parsing. The imperative main loop that drives all of this is the parser's token reading state machine.
Actually No. Historically before the Dragon Book there were syntax directed compilers. Attending ACM SEGPlan meeting in the late 1960's I learned of several types of directed translation. Tree directed and graph directed translation were also discussed. I think these got muddled together in the Dragon Book though I have never owned the Dragon Book. My favorite book was Programming Systems and Languages by Saul Rosen. It is a collection of papers on compilers, operating systems and computer systems. I'll try to explain the early syntax directed compiler parser programming languages. The later ones producing trees were combined with tree directed code generating languages.
Early syntax directed compilers, translated source directly to stack machine code. The Borrows B5000 ALGOL compiler is an example.
A*(B+C) -> A,B,C,ADD,MPY
Schorre's META II domain specific parser programming language, compiler compiler, developed in the 1960s is an example of a syntax directed compiler. You can find the original META II paper in the ACM archive. META II avoids left recursion using $ postfix zero or more sequence operator and ( ) grouping.
EXPR = TERM $('+' TERM .OUT 'ADD'|'-' TERM .OUT 'SUB');
Later Schorre based metalanguage compilers translated to trees using stack based tree transformation operators :<node name> and !<number>.
EXPR = TERM $(('+':ADD|'-':SUB) TERM!2);
Except for TREEMETA that used [<number>] instead of !<number>. The above EXPR formula is basically the same as the META II EXPR except we have factored operators + and - recognition creating corresponding nodes and pushing the node onto the node stack. Then on recognizing the right TERM the tree constructor !2 creates a tree popping the top 2 parse stack <TERM>s and top node from the node stack to form a tree:
ADD or SUB
/ \ / \
TERM TERM TERM TERM
Tokens were recognized by supplied recognizers .ID .NUMBER and .STRING. Later replaced by token ".." and character class ":" formula in CWIC:
id .. let $(leter|dgt|+'_');
Tree directed compiler languages were combined with the syntax directed compilers to generate code. The CWIC compiler compiler developed at Systems Development Corporation included a LISP 2 based tree directed generator language. A short paper in CWIC can be found in the ACM archives.
In the parser programming languages you are programming a type of recursive decent parser. When you get to CWIC all the problems that today are attributed to recursive decent parsers were eliminated. There is no left recursion problem as the $ zero or more construct and programed tree construction eliminated the need of left recursion. You control the tree construction. A loop construct is used to produces a left handed tree and tail recursion a right handed tree. Though parsing formulas may generate no tree at all:
program = $declarations;
In the above the $ zero or more loop operator preceding declarations specifies that declarations is to be repeatably called as long as it returns success. The input source code being compiled is made up of any positive number of declarations. The declarations formula would then define the types of declarations. You might need external linkages declarations, data declarations, function or procedure code declarations.
declarations = linkage_decl | data_decl | code_decl;
The types of declarations each being a separate formula. The syntax language controls when semantic processing and code generation occurs. The program and declarations formulas above do not produce trees. They are simply controlling when and what language structure are parsed. These are neither LL oe LR parser sears. The provide unlimited (limited only by available memory) programed backtracking. They provide programed look ahead and peak ahead tests.
As a last example the following example including token and character class formula illustrates producing both left and right handed trees. Specifically exponentiation using tail recursion.
assign = id '=' expr ';' :ASSIGN!2 arith_gen[*1];
expr = term $(('+':ADD | '-':SUB) term !2);
term = factor $(('*':MPY | '//' :REM | '/':DIV) factor!2);
factor = ( id ('(' +[ arg $(',' arg ]+ ')' :CALL!2 | .EMPTY)
| number
| '(' expr ')'
) ('^' factor:EXP!2 | .EMPTY);
bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';
upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';
lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';
alpha: upr|lwr;
alphanum: alpha|dgt;
number .. dgt $dgt MAKENUM[];
id .. alpha $(alphanum|+'_');

Any tools can randomly generate the source code according to a language grammar?

A C program source code can be parsed according to the C grammar(described in CFG) and eventually turned into many ASTs. I am considering if such tool exists: it can do the reverse thing by firstly randomly generating many ASTs, which include tokens that don't have the concrete string values, just the types of the tokens, according to the CFG, then generating the concrete tokens according to the tokens' definitions in the regular expression.
I can imagine the first step looks like an iterative non-terminals replacement, which is randomly and can be limited by certain number of iteration times. The second step is just generating randomly strings according to regular expressions.
Is there any tool that can do this?
The "Data Generation Language" DGL does this, with the added ability to weight the probabilities of productions in the grammar being output.
In general, a recursive descent parser can be quite directly rewritten into a set of recursive procedures to generate, instead of parse / recognise, the language.
Given a context-free grammar of a language, it is possible to generate a random string that matches the grammar.
For example, the nearley parser generator includes an implementation of an "unparser" that can generate strings from a grammar.
The same task can be accomplished using definite clause grammars in Prolog. An example of a sentence generator using definite clause grammars is given here.
If you have a model of the grammar in a normalized form (all rules like this):
LHS = RHS1 RHS2 ... RHSn ;
and language prettyprinter (e.g., AST to text conversion tool), you can build one of these pretty easily.
Simply start with the goal symbol as a unit tree.
Repeat until no nonterminals are left:
Pick a nonterminal N in the tree;
Expand by adding children for the right hand side of any rule
whose left-hand side matches the nonterminal N
For terminals that carry values (e.g., variable names, numbers, strings, ...) you'll have to generate random content.
A complication with the above algorithm is that it doesn't clearly terminate. What you actually want to do is pick some limit on the size of your tree, and run the algorithm until the all nonterminals are gone or you exceed the limit. In the latter case, backtrack, undo the last replacement, and try something else. This gets you a bounded depth-first search for an AST of your determined size.
Then prettyprint the result. Its the prettyprinter part that is hard to get right.
[You can build all this stuff yourself including the prettyprinter, but it is a fair amount of work. I build tools that include all this machinery directly in a language-parameterized way; see my bio].
A nasty problem even with well formed ASTs is that they may be nonsensical; you might produce a declaration of an integer X, and assign a string literal value to it, for a language that doesn't allow that. You can probably eliminate some simple problems, but language semantics can be incredibly complex, consider C++ as an example. Ensuring that you end up with a semantically meaningful program is extremely hard; in essence, you have to parse the resulting text, and perform name and type resolution/checking on it. For C++, you need a complete C++ front end.
the problem with random generation is that for many CFGs, the expected length of the output string is infinite (there is an easy computation of the expected length using generating functions corresponding to the non-terminal symbols and equations corresponding to the rules of the grammar); you have to control the relative probabilities of the productions in certain ways to guarantee convergence; for example, sometimes, weighting each production rule for a non-terminal symbol inversely to the length of its RHS suffices
there is lot more on this subject in:
Noam Chomsky, Marcel-Paul Sch\"{u}tzenberger, ``The Algebraic Theory of Context-Free Languages'', pp.\ 118-161 in P. Braffort and D. Hirschberg (eds.), Computer Programming and Formal Systems, North-Holland (1963)
(see Wikipedia entry on Chomsky–Schützenberger enumeration theorem)

What is packrat parsing?

I know and use bison/yacc. But in parsing world, there's a lot of buzz around packrat parsing.
What is it? Is it worth studing?
Packrat parsing is a way of providing asymptotically better performance for parsing expression grammars (PEGs); specifically for PEGs, linear time parsing can be guaranteed.
Essentially, Packrat parsing just means caching whether sub-expressions match at the current position in the string when they are tested -- this means that if the current attempt to fit the string into an expression fails then attempts to fit other possible expressions can benefit from the known pass/fail of subexpressions at the points in the string where they have already been tested.
At a high level:
Packrat parsers make use of parsing expression grammars (PEGs) rather than traditional context-free grammars (CFGs).
Through their use of PEGs rather than CFGs, it's typically easier to set up and maintain a packrat parser than a traditional LR parser.
Due to how they use memoization, packrat parsers typically use more memory at runtime than "classical" parsers like LALR(1) and LR(1) parsers.
Like classical LR parsers, packrat parsers run in linear time.
In that sense, you can think of a packrat parser as a simplicity/memory tradeoff with LR-family parsers. Packrat parsers require less theoretical understanding of the parser's inner workings than LR-family parsers, but use more resources at runtime. If you're in an environment where memory is plentiful and you just want to throw a simple parser together, packrat parsing might be a good choice. If you're on a memory-constrained system or want to get maximum performance, it's probably worth investing in an LR-family parser.
The rest of this answer gives a slightly more detailed overview of packrat parsers and PEGs.
On CFGs and PEGs
Many traditional parsers (and many modern parsers) make use of context-free grammars. A context-free grammar consists of a series of rules like the ones shown here:
E -> E * E | E + E | (E) | N
N -> D | DN
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
For example, the top line says that the nonterminal E can be replaced either with E * E, or E + E, or (E), or with N. The second line says that N can be replaced with either D or DN. The last line says that D can be replaced with any single digit.
If you start with the string E and follow the rules from the above grammar, you can generate any mathematical expression using +, *, parentheses, and single digits.
Context-free grammars are a compact way to represent a collection of strings. They have a rich and well-understood theory. However, they have two main drawbacks. The first one is that, by itself, a CFG defines a collection of strings, but doesn't tell you how to check whether a particular string is generated by the grammar. This means that whether a particular CFG will lend itself to a nice parser depends on the particulars of how the parser works, meaning that the grammar author may need to familiarize themselves with the internal workings of their parser generator to understand what restrictions are placed on the sorts of grammar structures can arise. For example, LL(1) parsers don't allow for left-recursion and require left-factoring, while LALR(1) parsers require some understanding of the parsing algorithm to eliminate shift/reduce and reduce/reduce conflicts.
The second, larger problem is that grammars can be ambiguous. For example, the above grammar generates the string 2 + 3 * 4, but does so in two ways. In one way, we essentially get the grouping 2 + (3 * 4), which is what's intended. The other one gives us (2 + 3) * 4, which is not what's meant. This means that grammar authors either need to ensure that the grammar is unambiguous or need to introduce precedence declarations auxiliary to the grammar to tell the parser how to resolve the conflicts. This can be a bit of a hassle.
Packrat parsers make use of an alternative to context-free grammars called parsing expression grammars (PEGs). Parsing expression grammars in some ways resemble CFGs - they describe a collection of strings by saying how to assemble those strings from (potentially recursive) smaller parts. In other ways, they're like regular expressions: they involve simpler statements combined together by a small collection of operations that describe larger structures.
For example, here's a simple PEG for the same sort of arithmetic expressions given above:
E -> F + E / F
F -> T * F / T
T -> D* / (E)
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
To see what this says, let's look at the first line. Like a CFG, this line expresses a choice between two options: you can either replace E with F + E or with F. However, unlike a regular CFG, there is a specific ordering to these choices. Specifically, this PEG can be read as "first, try replacing E with F + E. If that works, great! And if that doesn't work, try replacing E with F. And if that works, great! And otherwise, we tried everything and it didn't work, so give up."
In that sense, PEGs directly encode into the grammar structure itself how the parsing is to be done. Whereas a CFG more abstractly says "an E may be replaced with any of the following," a PEG specifically says "to parse an E, first try this, then this, then this, etc." As a result, for any given string that a PEG can parse, the PEG can parse it exactly one way, since it stops trying options once the first parse is found.
PEGs, like CFGs, can take some time to get the hang of. For example, CFGs in the abstract - and many CFG parsing techniques - have no problem with left recursion. For example, this CFG can be parsed with an LR(1) parser:
E -> E + F | F
F -> F * T | T
T -> (E) | N
N -> ND | D
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
However, the following PEG can't be parsed by a packrat parser (though later improvements to PEG parsing can correct this):
E -> E + F / F
F -> F * T / T
T -> (E) / D*
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Let's take a look at that first line. The first line says "to parse an E, first try reading an E, then a +, then an F. And if that fails, try reading an F." So how would it then go about trying out that first option? The first step would be to try parsing an E, which would work by first trying to parse an E, and now we're caught in an infinite loop. Oops. This is called left recursion and also shows up in CFGs when working with LL-family parsers.
Another issue that comes up when designing PEGs is the need to get the ordered choices right. If you're coming from the Land of Context-Free Grammars, where choices are unordered, it's really easy to accidentally mess up a PEG. For example, consider this PEG:
E -> F / F + E
F -> T / T * F
T -> D+ / (E)
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Now, what happens if you try to parse the string 2 * 3 + 4? Well:
We try parsing an E, which first tries parsing an F.
We trying parsing an F, which first tries parsing a T.
We try parsing a T, which first tries reading a series of digits. This succeeds in reading 2.
We've successfully read an F.
So we've successfully read an E, so we should be done here, but there are leftover tokens and the parse fails.
The issue here is that we first tried parsing F before F + E, and similarly first tried parsing T before parsing T * F. As a result, we essentially bit off less than we could check, because we tried reading a shorter expression before a longer one.
Whether you find CFGs, with attending ambiguities and precedence declarations, easier or harder than PEGs, with attending choice orderings, is mostly a matter of personal preference. But many people report finding PEGs a bit easier to work with than CFGs because they more mechanically map onto what the parser should do. Rather than saying "here's an abstract description of the strings I want," you get to say "here's the order in which I'd like you to try things," which is a bit closer to how parsing often works.
The Packrat Parsing Algorithm
Compared with the algorithms to build LR or LL parsing tables, the algorithm used by a packrat parsing is conceptually quite simple. At a high level, a packrat parser begins with the start symbol, then tries the ordered choices, one at a time, in sequence until it finds one that works. As it works through those choices, it may find that it needs to match another nonterminal, in which case it recursively tries matching that nonterminal on the rest of the string. If a particular choice fails, the parser backtracks and then tries the next production.
Matching any one individual production isn't that hard. If you see a terminal, either it matches the next available terminal or it doesn't. If it does, great! Match it and move on. If not, report an error. If you see a nonterminal, then (recursively) match that nonterminal, and if it succeeds pick up with the rest of the search at the point after where the nonterminal finished matching.
This means that, more generally, the packrat parser works by trying to solve problems of the following form:
Given some position in the string and a nonterminal, determine how much of the string that nonterminal matches starting at that position (or report that it doesn't match at all.)
Here, notice that there's no ambiguity about what's meant by "how much of the string the nonterminal matches." Unlike a traditional CFG where a nonterminal might match at a given position in several different lengths, the ordered choices used in PEGs ensure that if there's some match starting at a given point, then there's exactly one match starting at that point.
If you've studied dynamic programming, you might realize that these subproblems might overlap one another. In fact, in a PEG with k nonterminals and a string of length n, there are only Θ(kn) possible distinct subproblems: one for each combination of a starting position and a nonterminal. This means that, in principle, you could use dynamic programming to precompute a table of all possible position/nonterminal parse matches and have a very fast parser. Packrat parsing essentially does this, but using memoization rather than dynamic programming. This means that it won't necessarily try filling all table entries, just the ones that it actually encounters in the course of parsing the grammar.
Since each table entry can be filled in in constant time (for each nonterminal, there are only finitely many productions to try for a fixed PEG), the parser ends up running in linear time, matching the speed of an LR parser.
The drawback with this approach is the amount of memory used. Specifically, the memoization table may record multiple entries per position in the input string, requiring memory usage proportional to both the size of the PEG and the length of the input string. Contrast this with LL or LR parsing, which only needs memory proportional to the size of the parsing stack, which is typically much smaller than the length of the full string.
That being said, the tradeoff here in worse memory performance is offset by not needing to learn the internal workings of how the packrat parser works. You can just read up on PEGs and take things from there.
Hope this helps!
Pyparsing is a pure-Python parsing library that supports packrat parsing, so you can see how it is implemented. Pyparsing uses a memoizing technique to save previous parse attempts for a particular grammar expression at a particular location in the input text. If the grammar involves retrying that same expression at that location, it skips the expensive parsing logic and just returns the results or exception from the memoizing cache.
There is more info here at the FAQ page of the pyparsing wiki, which also includes links back to Bryan Ford's original thesis on packrat parsing.

Resources