What is packrat parsing? - algorithm

I know and use bison/yacc. But in parsing world, there's a lot of buzz around packrat parsing.
What is it? Is it worth studing?

Packrat parsing is a way of providing asymptotically better performance for parsing expression grammars (PEGs); specifically for PEGs, linear time parsing can be guaranteed.
Essentially, Packrat parsing just means caching whether sub-expressions match at the current position in the string when they are tested -- this means that if the current attempt to fit the string into an expression fails then attempts to fit other possible expressions can benefit from the known pass/fail of subexpressions at the points in the string where they have already been tested.

At a high level:
Packrat parsers make use of parsing expression grammars (PEGs) rather than traditional context-free grammars (CFGs).
Through their use of PEGs rather than CFGs, it's typically easier to set up and maintain a packrat parser than a traditional LR parser.
Due to how they use memoization, packrat parsers typically use more memory at runtime than "classical" parsers like LALR(1) and LR(1) parsers.
Like classical LR parsers, packrat parsers run in linear time.
In that sense, you can think of a packrat parser as a simplicity/memory tradeoff with LR-family parsers. Packrat parsers require less theoretical understanding of the parser's inner workings than LR-family parsers, but use more resources at runtime. If you're in an environment where memory is plentiful and you just want to throw a simple parser together, packrat parsing might be a good choice. If you're on a memory-constrained system or want to get maximum performance, it's probably worth investing in an LR-family parser.
The rest of this answer gives a slightly more detailed overview of packrat parsers and PEGs.
On CFGs and PEGs
Many traditional parsers (and many modern parsers) make use of context-free grammars. A context-free grammar consists of a series of rules like the ones shown here:
E -> E * E | E + E | (E) | N
N -> D | DN
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
For example, the top line says that the nonterminal E can be replaced either with E * E, or E + E, or (E), or with N. The second line says that N can be replaced with either D or DN. The last line says that D can be replaced with any single digit.
If you start with the string E and follow the rules from the above grammar, you can generate any mathematical expression using +, *, parentheses, and single digits.
Context-free grammars are a compact way to represent a collection of strings. They have a rich and well-understood theory. However, they have two main drawbacks. The first one is that, by itself, a CFG defines a collection of strings, but doesn't tell you how to check whether a particular string is generated by the grammar. This means that whether a particular CFG will lend itself to a nice parser depends on the particulars of how the parser works, meaning that the grammar author may need to familiarize themselves with the internal workings of their parser generator to understand what restrictions are placed on the sorts of grammar structures can arise. For example, LL(1) parsers don't allow for left-recursion and require left-factoring, while LALR(1) parsers require some understanding of the parsing algorithm to eliminate shift/reduce and reduce/reduce conflicts.
The second, larger problem is that grammars can be ambiguous. For example, the above grammar generates the string 2 + 3 * 4, but does so in two ways. In one way, we essentially get the grouping 2 + (3 * 4), which is what's intended. The other one gives us (2 + 3) * 4, which is not what's meant. This means that grammar authors either need to ensure that the grammar is unambiguous or need to introduce precedence declarations auxiliary to the grammar to tell the parser how to resolve the conflicts. This can be a bit of a hassle.
Packrat parsers make use of an alternative to context-free grammars called parsing expression grammars (PEGs). Parsing expression grammars in some ways resemble CFGs - they describe a collection of strings by saying how to assemble those strings from (potentially recursive) smaller parts. In other ways, they're like regular expressions: they involve simpler statements combined together by a small collection of operations that describe larger structures.
For example, here's a simple PEG for the same sort of arithmetic expressions given above:
E -> F + E / F
F -> T * F / T
T -> D* / (E)
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
To see what this says, let's look at the first line. Like a CFG, this line expresses a choice between two options: you can either replace E with F + E or with F. However, unlike a regular CFG, there is a specific ordering to these choices. Specifically, this PEG can be read as "first, try replacing E with F + E. If that works, great! And if that doesn't work, try replacing E with F. And if that works, great! And otherwise, we tried everything and it didn't work, so give up."
In that sense, PEGs directly encode into the grammar structure itself how the parsing is to be done. Whereas a CFG more abstractly says "an E may be replaced with any of the following," a PEG specifically says "to parse an E, first try this, then this, then this, etc." As a result, for any given string that a PEG can parse, the PEG can parse it exactly one way, since it stops trying options once the first parse is found.
PEGs, like CFGs, can take some time to get the hang of. For example, CFGs in the abstract - and many CFG parsing techniques - have no problem with left recursion. For example, this CFG can be parsed with an LR(1) parser:
E -> E + F | F
F -> F * T | T
T -> (E) | N
N -> ND | D
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
However, the following PEG can't be parsed by a packrat parser (though later improvements to PEG parsing can correct this):
E -> E + F / F
F -> F * T / T
T -> (E) / D*
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Let's take a look at that first line. The first line says "to parse an E, first try reading an E, then a +, then an F. And if that fails, try reading an F." So how would it then go about trying out that first option? The first step would be to try parsing an E, which would work by first trying to parse an E, and now we're caught in an infinite loop. Oops. This is called left recursion and also shows up in CFGs when working with LL-family parsers.
Another issue that comes up when designing PEGs is the need to get the ordered choices right. If you're coming from the Land of Context-Free Grammars, where choices are unordered, it's really easy to accidentally mess up a PEG. For example, consider this PEG:
E -> F / F + E
F -> T / T * F
T -> D+ / (E)
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Now, what happens if you try to parse the string 2 * 3 + 4? Well:
We try parsing an E, which first tries parsing an F.
We trying parsing an F, which first tries parsing a T.
We try parsing a T, which first tries reading a series of digits. This succeeds in reading 2.
We've successfully read an F.
So we've successfully read an E, so we should be done here, but there are leftover tokens and the parse fails.
The issue here is that we first tried parsing F before F + E, and similarly first tried parsing T before parsing T * F. As a result, we essentially bit off less than we could check, because we tried reading a shorter expression before a longer one.
Whether you find CFGs, with attending ambiguities and precedence declarations, easier or harder than PEGs, with attending choice orderings, is mostly a matter of personal preference. But many people report finding PEGs a bit easier to work with than CFGs because they more mechanically map onto what the parser should do. Rather than saying "here's an abstract description of the strings I want," you get to say "here's the order in which I'd like you to try things," which is a bit closer to how parsing often works.
The Packrat Parsing Algorithm
Compared with the algorithms to build LR or LL parsing tables, the algorithm used by a packrat parsing is conceptually quite simple. At a high level, a packrat parser begins with the start symbol, then tries the ordered choices, one at a time, in sequence until it finds one that works. As it works through those choices, it may find that it needs to match another nonterminal, in which case it recursively tries matching that nonterminal on the rest of the string. If a particular choice fails, the parser backtracks and then tries the next production.
Matching any one individual production isn't that hard. If you see a terminal, either it matches the next available terminal or it doesn't. If it does, great! Match it and move on. If not, report an error. If you see a nonterminal, then (recursively) match that nonterminal, and if it succeeds pick up with the rest of the search at the point after where the nonterminal finished matching.
This means that, more generally, the packrat parser works by trying to solve problems of the following form:
Given some position in the string and a nonterminal, determine how much of the string that nonterminal matches starting at that position (or report that it doesn't match at all.)
Here, notice that there's no ambiguity about what's meant by "how much of the string the nonterminal matches." Unlike a traditional CFG where a nonterminal might match at a given position in several different lengths, the ordered choices used in PEGs ensure that if there's some match starting at a given point, then there's exactly one match starting at that point.
If you've studied dynamic programming, you might realize that these subproblems might overlap one another. In fact, in a PEG with k nonterminals and a string of length n, there are only Θ(kn) possible distinct subproblems: one for each combination of a starting position and a nonterminal. This means that, in principle, you could use dynamic programming to precompute a table of all possible position/nonterminal parse matches and have a very fast parser. Packrat parsing essentially does this, but using memoization rather than dynamic programming. This means that it won't necessarily try filling all table entries, just the ones that it actually encounters in the course of parsing the grammar.
Since each table entry can be filled in in constant time (for each nonterminal, there are only finitely many productions to try for a fixed PEG), the parser ends up running in linear time, matching the speed of an LR parser.
The drawback with this approach is the amount of memory used. Specifically, the memoization table may record multiple entries per position in the input string, requiring memory usage proportional to both the size of the PEG and the length of the input string. Contrast this with LL or LR parsing, which only needs memory proportional to the size of the parsing stack, which is typically much smaller than the length of the full string.
That being said, the tradeoff here in worse memory performance is offset by not needing to learn the internal workings of how the packrat parser works. You can just read up on PEGs and take things from there.
Hope this helps!

Pyparsing is a pure-Python parsing library that supports packrat parsing, so you can see how it is implemented. Pyparsing uses a memoizing technique to save previous parse attempts for a particular grammar expression at a particular location in the input text. If the grammar involves retrying that same expression at that location, it skips the expensive parsing logic and just returns the results or exception from the memoizing cache.
There is more info here at the FAQ page of the pyparsing wiki, which also includes links back to Bryan Ford's original thesis on packrat parsing.

Related

Why are epsilon transitions used in NFA?

I'm trying to understand how to create NFA-s from regular expressions, but I am really confused from epsilon transitions. I have this example in my textbook , but I don't understand why epsilon transitions are used and how does one know when to use them.
In general, espilon-transitions are used when they are convenient. For example, when constructing an NFA from a regular expression, you start by constructing small parts of the automaton corresponding to parts of the expression. To connect them, you need to put a transition. But if there is no symbol to be read there, an epsilon transition is a simple way to do this. They are, however never necessary, you can always find a solution without them.
In your example, just apply the algorithm described in your textbook. It tells you when to use them.
The epsilon transitions
from 1 to 2 probably connects the parts for (a|b)* and for ac
1->5 and 8->1 probably result from the *
5->6 and 5->7 probably result from the alternative in |
Epsilon-transitions in NFAs are a natural representation of choice or disjunction or union in regular expressions. That is, a regular expression like r + s (or r | s or r U s depending on your preferred notation) is naturally represented as an NFA consisting of two independent NFAs, one for r and one for s, joined using e-transitions as follows:
e
----->q0----->(r)
|
| e
|
V
(s)
When used to connect states in more complicated ways, the effect may not be as easy or natural to describe, but essentially these transitions let you choose unconditionally among multiple options. So, if I have seen a part of the input already and there are a few different ways the string could end, I can represent that by using e-transitions to states that handle the different possibilities.
In your example, the e-transitions are not really serving any very useful function and are merely artifacts of the conversion algorithm you have used. That algorithm includes them because, in the general case, they may be useful or necessary. In your specific case this was not true, so they look out of place.

On complexity of recursive descent parsers

It's known that recursive descent parsers may require exponential time in some cases; could anyone point me to the samples, where this happens? Especially interested in cases for PEG (i.e. with prioritized choices).
Any top down parser, including recursive descent, can theoretically become exponential if the combination of input and grammar are such that large numbers of backtracks are necessary. This happens if the grammar is such that determinative choices are placed at the end of long sequences. For example, if you have a symbol like & meaning "all previous minuses are actually plusses" and then have data like "((((a - b) - c) - d) - e &)" then the parser has to go backwards and change all the plusses to minuses. If you start making nested expressions along these lines you can create an effectively non-terminating set of input.
You have to realize you are stepping into a political issue here, because the reality is that most normal grammars and data sets are not like this, however, there are a LOT of people who systematically badmouth recursive descent because it is not easy to make RD automatically. All early parsers are LALR because they are MUCH easier to make automatically than RD. So what happened was that everyone just wrote LALR and badmouthed RD, because in the old days the only way to make an RD was to code it by hand. For example, if you read the dragon book you will find that Aho & Ullman write just one paragraph on RD, and it is basically just a ideological takedown saying "RD is bad, don't do it".
Of course, if you start hand coding RDs (as I have) you will find that they are much better than LALRs for a variety of reasons. In the old days you could always tell a compiler that had a hand-coded RD, because it had meaningful error messages with locational accuracy, whereas compilers with LALRs would show the error occurring like 50 lines away from where it really was. Things have changed a lot since the old days, but you should realize that when you start reading the FUD on RD, that it is coming from a long, long tradition of verbally trashing RD in "certain circles".
It's because you can end up parsing the same things (check the same rule at the same position) many times in different recursion branches. It's kind of like calculating the n-th Fibonacci number using recursion.
Grammar:
A -> xA | xB | x
B -> yA | xA | y | A
S -> A
Input:
xxyxyy
Parsing:
xA(xxyxyy)
xA(xyxyy)
xA(yxyy) fail
xB(yxyy) fail
x(yxyy) fail
xB(xyxyy)
yA(yxyy)
xA(xyy)
xA(yy) fail
xB(yy) fail
x(yy) fail
xB(xyy)
yA(yy)
xA(y) fail
xB(y) fail
x(y) fail
xA(yy) fail *
x(xyy) fail
xA(yxyy) fail *
y(yxyy) fail
A(yxyy)
xA(yxyy) fail *
xB(yxyy) fail *
x(yxyy) fail *
x(xyxyy) fail
xB(xxyxyy)
yA(xyxyy) fail
xA(xyxyy) *
xA(yxyy) fail *
xB(yxyy) fail *
...
* - where we parse a rule at the same position where we have already parsed it in a different branch. If we had saved the results - which rules fail at which positions - we'd know xA(xyxyy) fails the second time around and we wouldn't go through its whole subtree again. I didn't want to write out the whole thing, but you can see it will repeat the same subtrees many times.
When it will happen - when you have many overlapping transformations. Prioritized choice doesn't change things - if the lowest priority rule ends up being the only correct one (or none are correct), you had to check all the rules anyway.

What does syntax directed translation mean?

Can anyone, in simple terms, explain what does "Syntax Directed Translation" mean? I started to read the topic from Dragon Book but couldn't understand. The Wiki article didn't help either.
In simplest terms, 'Syntax Directed Translation' means driving the entire compilation (translation) process with the syntax recognizer (the parser).
Conceptually, the process of compiling a program (translating it from source code to machine code) starts with a parser that produces a parse tree, and then transforms that parse tree through a sequence of tree or graph transformations, each of which is largely independent, resulting in a final simplified tree or graph that is traversed to produce machine code.
This view, while nice in theory, has a drawback that if you try to implement it directly, enough memory to hold at least two copies of the entire tree or graph is needed. Back when the Dragon Book was written (and when a lot of this theory was hashed out), computer memories were measured in kilobytes, and 64K was a lot. So compiling large programs could be tricky.
With Syntax Directed Translation, you organize all of the graph transformations around the order in which the parser recognizes the parse tree. Instead of producing a complete parse tree, your parser builds little bits of it, and then feeds those bits to the subsequent passes of the compiler, ultimately producing a small piece of machine code, before continuing the parsing process to build the next piece of parse tree. Since only small amounts of the parse tree (or the subsequent graphs) exist at any time, much less memory is required. Since the syntax recognizer is the master sequencer controlling all of this (deciding the order in which things happen), this is called Syntax Directed Translation.
Since this is such an effective way of keeping down memory use, people even redesigned languages to make it easier to do -- the ideal being to have a "Single Pass" compiler that could in fact do the entire process from parsing to machine code generation in a single pass.
Nowadays, memory is not at such a premium, so there's less pressure to force everything into a single pass. Instead you generally use Syntax Direct Translation just for the front end, parsing the syntax, doing typechecking and other semantic checks, and a few simple transformations all from the parser and producing some internal form (three address code, trees, or dags of some kind) and then having separate optimization and back end passes that are independent (and so not syntax directed). Even in this case you might claim that these later passes are at least partly syntax directed, as the compiler may be organized to operate on large pieces of the input (such as entire functions or modules), pushing through all the passes before continuing with the next piece of input.
Tools like yacc are designed around the idea of Syntax Directed Translation -- the tool produces a syntax recognizer that directly runs fragments of code ('actions' in the tool parlance) as productions (fragments of the parse tree) are recognized, without ever creating an actual 'tree'. These actions can directly invoke what are logically later passes in the compiler, and then return to continue parsing. The imperative main loop that drives all of this is the parser's token reading state machine.
Actually No. Historically before the Dragon Book there were syntax directed compilers. Attending ACM SEGPlan meeting in the late 1960's I learned of several types of directed translation. Tree directed and graph directed translation were also discussed. I think these got muddled together in the Dragon Book though I have never owned the Dragon Book. My favorite book was Programming Systems and Languages by Saul Rosen. It is a collection of papers on compilers, operating systems and computer systems. I'll try to explain the early syntax directed compiler parser programming languages. The later ones producing trees were combined with tree directed code generating languages.
Early syntax directed compilers, translated source directly to stack machine code. The Borrows B5000 ALGOL compiler is an example.
A*(B+C) -> A,B,C,ADD,MPY
Schorre's META II domain specific parser programming language, compiler compiler, developed in the 1960s is an example of a syntax directed compiler. You can find the original META II paper in the ACM archive. META II avoids left recursion using $ postfix zero or more sequence operator and ( ) grouping.
EXPR = TERM $('+' TERM .OUT 'ADD'|'-' TERM .OUT 'SUB');
Later Schorre based metalanguage compilers translated to trees using stack based tree transformation operators :<node name> and !<number>.
EXPR = TERM $(('+':ADD|'-':SUB) TERM!2);
Except for TREEMETA that used [<number>] instead of !<number>. The above EXPR formula is basically the same as the META II EXPR except we have factored operators + and - recognition creating corresponding nodes and pushing the node onto the node stack. Then on recognizing the right TERM the tree constructor !2 creates a tree popping the top 2 parse stack <TERM>s and top node from the node stack to form a tree:
ADD or SUB
/ \ / \
TERM TERM TERM TERM
Tokens were recognized by supplied recognizers .ID .NUMBER and .STRING. Later replaced by token ".." and character class ":" formula in CWIC:
id .. let $(leter|dgt|+'_');
Tree directed compiler languages were combined with the syntax directed compilers to generate code. The CWIC compiler compiler developed at Systems Development Corporation included a LISP 2 based tree directed generator language. A short paper in CWIC can be found in the ACM archives.
In the parser programming languages you are programming a type of recursive decent parser. When you get to CWIC all the problems that today are attributed to recursive decent parsers were eliminated. There is no left recursion problem as the $ zero or more construct and programed tree construction eliminated the need of left recursion. You control the tree construction. A loop construct is used to produces a left handed tree and tail recursion a right handed tree. Though parsing formulas may generate no tree at all:
program = $declarations;
In the above the $ zero or more loop operator preceding declarations specifies that declarations is to be repeatably called as long as it returns success. The input source code being compiled is made up of any positive number of declarations. The declarations formula would then define the types of declarations. You might need external linkages declarations, data declarations, function or procedure code declarations.
declarations = linkage_decl | data_decl | code_decl;
The types of declarations each being a separate formula. The syntax language controls when semantic processing and code generation occurs. The program and declarations formulas above do not produce trees. They are simply controlling when and what language structure are parsed. These are neither LL oe LR parser sears. The provide unlimited (limited only by available memory) programed backtracking. They provide programed look ahead and peak ahead tests.
As a last example the following example including token and character class formula illustrates producing both left and right handed trees. Specifically exponentiation using tail recursion.
assign = id '=' expr ';' :ASSIGN!2 arith_gen[*1];
expr = term $(('+':ADD | '-':SUB) term !2);
term = factor $(('*':MPY | '//' :REM | '/':DIV) factor!2);
factor = ( id ('(' +[ arg $(',' arg ]+ ')' :CALL!2 | .EMPTY)
| number
| '(' expr ')'
) ('^' factor:EXP!2 | .EMPTY);
bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';
upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';
lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';
alpha: upr|lwr;
alphanum: alpha|dgt;
number .. dgt $dgt MAKENUM[];
id .. alpha $(alphanum|+'_');

Parsing context-free languages in a stream of tokens

The problem
Given a context-free grammar with arbitrary rules and a stream of tokens, how can stream fragments that match the grammar be identified effectively?
Example:
Grammar
S -> ASB | AB
A -> a
B -> b
(So essentially, a number of as followed by an equal number of bs)
Stream:
aabaaabbc...
Expected result:
Match starting at position 1: ab
Match starting at position 4: aabb
Of course the key is "effectively". without testing too many hopeless candidates for too long. The only thing I know about my data is that although the grammar is arbitrary, in practice matching sequences will be relatively short (<20 terminals) while the stream itself will be quite long (>10000 terminals).
Ideally I'd also want a syntax tree but that's not too important, because once the fragment is identified, I can run an ordinary parser over it to obtain the tree.
Where should I start? Which type of parser can be adapted to this type of work?
"Arbitrary grammar" makes me suggest you look at wberry's comment.
How complex are these grammars? Is there a manual intervention step?
I'll make an attempt. If I modified your example grammar from:
S -> ASB | AB
A -> a
B -> b
to include:
S' -> S | GS' | S'GS' | S'G
G -> sigma*
So that G = garbage and S' is many S fragments with garbage in between (I may have been careless with my production rules. You get the idea), I think we can solve your problem. You just need a parser that will match other rules before G. You may have to modify these production rules based on the parser. I almost guarantee that there will be rule ordering changes depending on the parser. Since most parser libraries separate lexing from parsing, you'll probably need a catch-all lexeme followed by modifying G to include all possible lexemes. Depending on your specifics, this might not be any better (efficiency-wise) than just starting each attempt at each spot in the stream.
But... Assuming my production rules are fixed (both for correctness and for the particular flavor of parser), this should not only match fragments in the stream, but it should give you a parse tree for the whole stream. You are only interested in subtrees rooted in nodes of type S.

What is the difference between LR, SLR, and LALR parsers?

What is the actual difference between LR, SLR, and LALR parsers? I know that SLR and LALR are types of LR parsers, but what is the actual difference as far as their parsing tables are concerned?
And how to show whether a grammar is LR, SLR, or LALR? For an LL grammar we just have to show that any cell of the parsing table should not contain multiple production rules. Any similar rules for LALR, SLR, and LR?
For example, how can we show that the grammar
S --> Aa | bAc | dc | bda
A --> d
is LALR(1) but not SLR(1)?
EDIT (ybungalobill): I didn't get a satisfactory answer for what's the difference between LALR and LR. So LALR's tables are smaller in size but it can recognize only a subset of LR grammars. Can someone elaborate more on the difference between LALR and LR please? LALR(1) and LR(1) will be sufficient for an answer. Both of them use 1 token look-ahead and both are table driven! How they are different?
SLR, LALR and LR parsers can all be implemented using exactly the same table-driven machinery.
Fundamentally, the parsing algorithm collects the next input token T, and consults the current state S (and associated lookahead, GOTO, and reduction tables) to decide what to do:
SHIFT: If the current table says to SHIFT on the token T, the pair (S,T) is pushed onto the parse stack, the state is changed according to what the GOTO table says for the current token (e.g, GOTO(T)), another input token T' is fetched, and the process repeats
REDUCE: Every state has 0, 1, or many possible reductions that might occur in the state. If the parser is LR or LALR, the token is checked against lookahead sets for all valid reductions for the state. If the token matches a lookahead set for a reduction for grammar rule G = R1 R2 .. Rn, a stack reduction and shift occurs: the semantic action for G is called, the stack is popped n (from Rn) times, the pair (S,G) is pushed onto the stack, the new state S' is set to GOTO(G), and the cycle repeats with the same token T. If the parser is an SLR parser, there is at most one reduction rule for the state and so the reduction action can be done blindly without searching to see which reduction applies. It is useful for an SLR parser to know if there is a reduction or not; this is easy to tell if each state explicitly records the number of reductions associated with it, and that count is needed for the L(AL)R versions in practice anyway.
ERROR: If neither SHIFT nor REDUCE is possible, a syntax error is declared.
So, if they all the use the same machinery, what's the point?
The purported value in SLR is its simplicity in implementation; you don't have to scan through the possible reductions checking lookahead sets because there is at most one, and this is the only viable action if there are no SHIFT exits from the state. Which reduction applies can be attached specifically to the state, so the SLR parsing machinery doesn't have to hunt for it. In practice L(AL)R parsers handle a usefully larger set of langauges, and is so little extra work to implement that nobody implements SLR except as an academic exercise.
The difference between LALR and LR has to do with the table generator. LR parser generators keep track of all possible reductions from specific states and their precise lookahead set; you end up with states in which every reduction is associated with its exact lookahead set from its left context. This tends to build rather large sets of states. LALR parser generators are willing to combine states if the GOTO tables and lookhead sets for reductions are compatible and don't conflict; this produces considerably smaller numbers of states, at the price of not be able to distinguish certain symbol sequences that LR can distinguish. So, LR parsers can parse a larger set of languages than LALR parsers, but have very much bigger parser tables. In practice, one can find LALR grammars which are close enough to the target langauges that the size of the state machine is worth optimizing; the places where the LR parser would be better is handled by ad hoc checking outside the parser.
So: All three use the same machinery. SLR is "easy" in the sense that you can ignore a tiny bit of the machinery but it is just not worth the trouble. LR parses a broader set of langauges but the state tables tend to be pretty big. That leaves LALR as the practical choice.
Having said all this, it is worth knowing that GLR parsers can parse any context free language, using more complicated machinery but exactly the same tables (including the smaller version used by LALR). This means that GLR is strictly more powerful than LR, LALR and SLR; pretty much if you can write a standard BNF grammar, GLR will parse according to it. The difference in the machinery is that GLR is willing to try multiple parses when there are conflicts between the GOTO table and or lookahead sets. (How GLR does this efficiently is sheer genius [not mine] but won't fit in this SO post).
That for me is an enormously useful fact. I build program analyzers and code transformers and parsers are necessary but "uninteresting"; the interesting work is what you do with the parsed result and so the focus is on doing the post-parsing work. Using GLR means I can relatively easily build working grammars, compared to hacking a grammar to get into LALR usable form. This matters a lot when trying to deal to non-academic langauges such as C++ or Fortran, where you literally needs thousands of rules to handle the entire language well, and you don't want to spend your life trying to hack the grammar rules to meet the limitations of LALR (or even LR).
As a sort of famous example, C++ is considered to be extremely hard to parse... by guys doing LALR parsing. C++ is straightforward to parse using GLR machinery using pretty much the rules provided in the back of the C++ reference manual. (I have precisely such a parser, and it handles not only vanilla C++, but also a variety of vendor dialects as well. This is only possible in practice because we are using a GLR parser, IMHO).
[EDIT November 2011: We've extended our parser to handle all of C++11. GLR made that a lot easier to do. EDIT Aug 2014: Now handling all of C++17. Nothing broke or got worse, GLR is still the cat's meow.]
LALR parsers merge similar states within an LR grammar to produce parser state tables that are exactly the same size as the equivalent SLR grammar, which are usually an order of magnitude smaller than pure LR parsing tables. However, for LR grammars that are too complex to be LALR, these merged states result in parser conflicts, or produce a parser that does not fully recognize the original LR grammar.
BTW, I mention a few things about this in my MLR(k) parsing table algorithm here.
Addendum
The short answer is that the LALR parsing tables are smaller, but the parser machinery is the same. A given LALR grammar will produce much larger parsing tables if all of the LR states are generated, with a lot of redundant (near-identical) states.
The LALR tables are smaller because the similar (redundant) states are merged together, effectively throwing away context/lookahead info that the separate states encode. The advantage is that you get much smaller parsing tables for the same grammar.
The drawback is that not all LR grammars can be encoded as LALR tables because more complex grammars have more complicated lookaheads, resulting in two or more states instead of a single merged state.
The main difference is that the algorithm to produce LR tables carries more info around between the transitions from state to state while the LALR algorithm does not. So the LALR algorithm cannot tell if a given merged state should really be left as two or more separate states.
Yet another answer (YAA).
The parsing algorithms for SLR(1), LALR(1) and LR(1) are identical like Ira Baxter said,
however, the parser tables may be different because of the parser-generation algorithm.
An SLR parser generator creates an LR(0) state machine and computes the look-aheads from the grammar (FIRST and FOLLOW sets). This is a simplified approach and may report conflicts that do not really exist in the LR(0) state machine.
An LALR parser generator creates an LR(0) state machine and computes the look-aheads from the LR(0) state machine (via the terminal transitions). This is a correct approach, but occasionally reports conflicts that would not exist in an LR(1) state machine.
A Canonical LR parser generator computes an LR(1) state machine and the look-aheads are already part of the LR(1) state machine. These parser tables can be very large.
A Minimal LR parser generator computes an LR(1) state machine, but merges compatible states during the process, and then computes the look-aheads from the minimal LR(1) state machine. These parser tables are the same size or slightly larger than LALR parser tables, giving the best solution.
LRSTAR 10.0 can generate LALR(1), LR(1), CLR(1) or LR(*) parsers in C++, whatever is needed for your grammar. See this diagram which shows the difference among LR parsers.
[Full disclosure: LRSTAR is my product]
The basic difference between the parser tables generated with SLR vs LR, is that reduce actions are based on the Follows set for SLR tables. This can be overly restrictive, ultimately causing a shift-reduce conflict.
An LR parser, on the other hand, bases reduce decisions only on the set of terminals which can actually follow the non-terminal being reduced. This set of terminals is often a proper subset of the Follows set of such a non-terminal, and therefore has less chance of conflicting with shift actions.
LR parsers are more powerful for this reason. LR parsing tables can be extremely large, however.
An LALR parser starts with the idea of building an LR parsing table, but combines generated states in a way that results in significantly less table size. The downside is that a small chance of conflicts would be introduced for some grammars that an LR table would otherwise have avoided.
LALR parsers are slightly less powerful than LR parsers, but still more powerful than SLR parsers. YACC and other such parser generators tend to use LALR for this reason.
P.S. For brevity, SLR, LALR and LR above really mean SLR(1), LALR(1), and LR(1), so one token lookahead is implied.
SLR parsers recognize a proper subset of grammars recognizable by LALR(1) parsers, which in turn recognize a proper subset of grammars recognizable by LR(1) parsers.
Each of these is constructed as a state machine, with each state representing some set of the grammar's production rules (and position in each) as it's parsing the input.
The Dragon Book example of an LALR(1) grammar that is not SLR is this:
S → L = R | R
L → * R | id
R → L
Here is one of the states for this grammar:
S → L•= R
R → L•
The • indicates the position of the parser in each of the possible productions. It doesn't know which of the productions it's actually in until it reaches the end and tries to reduce.
Here, the parser could either shift an = or reduce R → L.
An SLR (aka LR(0)) parser would determine whether it could reduce by checking if the next input symbol is in the follow set of R (ie, the set of all terminals in the grammar that can follow R). Since = is also in this set, the SLR parser encounters a shift-reduce conflict.
However, an LALR(1) parser would use the set of all terminals that can follow this particular production of R, which is only $ (ie, end of input). Thus, no conflict.
As previous commenters have noted, LALR(1) parsers have the same number of states as SLR parsers. A lookahead propagation algorithm is used to tack lookaheads on to SLR state productions from corresponding LR(1) states. The resulting LALR(1) parser can introduce reduce-reduce conflicts not present in the LR(1) parser, but it cannot introduce shift-reduce conflicts.
In your example, the following LALR(1) state causes a shift-reduce conflict in an SLR implementation:
S → b d•a / $
A → d• / c
The symbol after / is the follow set for each production in the LALR(1) parser. In SLR, follow(A) includes a, which could also be shifted.
Suppose a parser without a lookahead is happily parsing strings for your grammar.
Using your given example it comes across a string dc, what does it do? Does it reduce it to S, because dc is a valid string produced by this grammar? OR maybe we were trying to parse bdc because even that is an acceptable string?
As humans we know the answer is simple, we just need to remember if we had just parsed b or not. But computers are stupid :)
Since an SLR(1) parser had the additional power over LR(0) to perform a lookahead, we know that any amounts of lookahead cannot tell us what to do in this case; instead, we need to look back in our past. Thus comes the canonical LR parser to the rescue. It remembers the past context.
The way it remembers this context is that it disciplines itself, that whenever it will encounter a b, it will start walking on a path towards reading bdc, as one possibility. So when it sees a d it knows whether it is already walking a path.
Thus a CLR(1) parser can do things an SLR(1) parser cannot!
But now, since we had to define so many paths, the states of the machine gets very large!
So we merge same looking paths, but as expected it could give rise to problems of confusion. However, we are willing to take the risk at the cost of reducing the size.
This is your LALR(1) parser.
Now how to do it algorithmically.
When you draw the configuring sets for the above language, you will see a shift-reduce conflict in two states. To remove them you might want to consider an SLR(1), which takes decisions looking at a follow, but you would observe that it still won't be able to. Thus you would, draw the configuring sets again but this time with a restriction that whenever you calculate the closure, the additional productions being added must have strict follow(s). Refer any textbook on what should these follow be.
In addition to the answers above, this diagram demonstrates how different parsers relate:
Adding on top of the above answers, the difference in between the individual parsers in the class of bottom-up LR parsers is whether they result in shift/reduce or reduce/reduce conflicts when generating the parsing tables. The less it will have the conflicts, the more powerful will be the grammar (LR(0) < SLR(1) < LALR(1) < CLR(1)).
For example, consider the following expression grammar:
E → E + T
E → T
T → F
T → T * F
F → ( E )
F → id
It's not LR(0) but SLR(1). Using the following code, we can construct the LR0 automaton and build the parsing table (we need to augment the grammar, compute the DFA with closure, compute the action and goto sets):
from copy import deepcopy
import pandas as pd
def update_items(I, C):
if len(I) == 0:
return C
for nt in C:
Int = I.get(nt, [])
for r in C.get(nt, []):
if not r in Int:
Int.append(r)
I[nt] = Int
return I
def compute_action_goto(I, I0, sym, NTs):
#I0 = deepcopy(I0)
I1 = {}
for NT in I:
C = {}
for r in I[NT]:
r = r.copy()
ix = r.index('.')
#if ix == len(r)-1: # reduce step
if ix >= len(r)-1 or r[ix+1] != sym:
continue
r[ix:ix+2] = r[ix:ix+2][::-1] # read the next symbol sym
C = compute_closure(r, I0, NTs)
cnt = C.get(NT, [])
if not r in cnt:
cnt.append(r)
C[NT] = cnt
I1 = update_items(I1, C)
return I1
def construct_LR0_automaton(G, NTs, Ts):
I0 = get_start_state(G, NTs, Ts)
I = deepcopy(I0)
queue = [0]
states2items = {0: I}
items2states = {str(to_str(I)):0}
parse_table = {}
cur = 0
while len(queue) > 0:
id = queue.pop(0)
I = states[id]
# compute goto set for non-terminals
for NT in NTs:
I1 = compute_action_goto(I, I0, NT, NTs)
if len(I1) > 0:
state = str(to_str(I1))
if not state in statess:
cur += 1
queue.append(cur)
states2items[cur] = I1
items2states[state] = cur
parse_table[id, NT] = cur
else:
parse_table[id, NT] = items2states[state]
# compute actions for terminals similarly
# ... ... ...
return states2items, items2states, parse_table
states, statess, parse_table = construct_LR0_automaton(G, NTs, Ts)
where the grammar G, non-terminal and terminal symbols are defined as below
G = {}
NTs = ['E', 'T', 'F']
Ts = {'+', '*', '(', ')', 'id'}
G['E'] = [['E', '+', 'T'], ['T']]
G['T'] = [['T', '*', 'F'], ['F']]
G['F'] = [['(', 'E', ')'], ['id']]
Here are few more useful function I implemented along with the above ones for LR(0) parsing table generation:
def augment(G, S): # start symbol S
G[S + '1'] = [[S, '$']]
NTs.append(S + '1')
return G, NTs
def compute_closure(r, G, NTs):
S = {}
queue = [r]
seen = []
while len(queue) > 0:
r = queue.pop(0)
seen.append(r)
ix = r.index('.') + 1
if ix < len(r) and r[ix] in NTs:
S[r[ix]] = G[r[ix]]
for rr in G[r[ix]]:
if not rr in seen:
queue.append(rr)
return S
The following figure (expand it to view) shows the LR0 DFA constructed for the grammar using the above code:
The following table shows the LR(0) parsing table generated as a pandas dataframe, notice that there are couple of shift/reduce conflicts, indicating that the grammar is not LR(0).
SLR(1) parser avoids the above shift / reduce conflicts by reducing only if the next input token is a member of the Follow Set of the nonterminal being reduced. The following parse table is generated by SLR:
The following animation shows how an input expression is parsed by the above SLR(1) grammar:
The grammar from the question is not LR(0) as well:
#S --> Aa | bAc | dc | bda
#A --> d
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c', 'd'}
G['S'] = [['A', 'a'], ['b', 'A', 'c'], ['d', 'c'], ['b', 'd', 'a']]
G['A'] = [['d']]
as can be seen from the next LR0 DFA and the parsing table:
there is a shift / reduce conflict again:
But, the following grammar which accepts the strings of the form a^ncb^n, n >= 1 is LR(0):
A → a A b
A → c
S → A
# S --> A
# A --> a A b | c
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c'}
G['S'] = [['A']]
G['A'] = [['a', 'A', 'b'], ['c']]
As can be seen from the following figure, there is no conflict in the parsing table generated.
Here is how the input string a^2cb^2 can be parsed using the above LR(0) parse table, using the following code:
def parse(input, parse_table, rules):
input = 'aaacbbb$'
stack = [0]
df = pd.DataFrame(columns=['stack', 'input', 'action'])
i, accepted = 0, False
while i < len(input):
state = stack[-1]
char = input[i]
action = parse_table.loc[parse_table.states == state, char].values[0]
if action[0] == 's': # shift
stack.append(char)
stack.append(int(action[-1]))
i += 1
elif action[0] == 'r': # reduce
r = rules[int(action[-1])]
l, r = r['l'], r['r']
char = ''
for j in range(2*len(r)):
s = stack.pop()
if type(s) != int:
char = s + char
if char == r:
goto = parse_table.loc[parse_table.states == stack[-1], l].values[0]
stack.append(l)
stack.append(int(goto[-1]))
elif action == 'acc': # accept
accepted = True
df2 = {'stack': ''.join(map(str, stack)), 'input': input[i:], 'action': action}
df = df.append(df2, ignore_index = True)
if accepted:
break
return df
parse(input, parse_table, rules)
The next animation shows how the input string a^2cb^2 is parsed with LR(0) parser using the above code:
One simple answer is that all LR(1) grammars are LALR(1) grammars.
Compared to LALR(1), LR(1) has more states in the associated finite-state machine (more than double the states). And that is the main reason LALR(1) grammars require more code to detect syntax errors than LR(1) grammars.
And one more important thing to know regarding these two grammars is that in LR(1) grammars we might have less reduce/reduce conflicts. But in LALR(1) there is more possibility of reduce/reduce conflicts.

Resources