Formal Reason Why LL Grammar excludes Left-Recursion - ll

I am currently reading the Dragon Book in my free time. The book states that a grammar is LL if and only if for any production A -> a|b, the following two conditions apply.
1) FIRST(a) and FIRST(b) are disjoint. This implies that they cannot both derive EMPTY
2) If 'b' can derive EMPTY, then 'a' cannot derive any string that begins with FOLLOW(A)
I know that LL parsers cannot handle left recursion in general, but if I make a grammar
S -> S(S) | EMPTY,
FIRST(S) = {'('} and FOLLOW(S) = {EOF}. This does not seem to contradict any of the two rules, am I missing something?
Thank you in advance,
Michael

It has been a while, but I think FOLLOWS(S) = {EOF,')','('}.

Related

Where to find resources/information about proof control in Prolog

As part of an assignment I've been asked to check if proofs in natural deduction are either correct or incorrect, using Prolog. An example text file called "valid.txt" containing a proof looks like this:
[imp(p, q), p].
q.
[
[1, imp(p,q), premise],
[2, p, premise],
[3, q, impel(2,1)]
].```
This would be the input into my program, which should respond "yes" or "true" to a correct proof (which the above is), and "no" or "false" to an incorrect proof.
I have no idea where to begin. So my question is if there are some resources out there where I could learn about verifying/controlling proofs in prolog. I have some small amounts of experience programming in prolog, but I feel like I need some specific instruction on how to construct a program that can verify proofs. I've looked for textbooks and websites, but been unable to find anything that could help me.
In the end my program should probably resemble something like this: Checking whether or not a logical sequence that has assumptions in it is valid
As this is my first time asking a question on here I apologize if I missed something.
In part I totally understand the close votes, comments and such, but on the other had I can totally understand your side of the story.
Realize that I do view your question just barely on the wrong side of what is allowed here, but not so much so that other newcomers don't get away with doing the same thing.
Book recommendation questions are not allowed here. Normally a book recommendation would be seen as subjective, but when there is only one or two books that meet the requirements, how can it be subjective.
"Handbook of Practical Logic and Automated Reasoning" by John Harrison (WorldCat) (Amazon) (Webpage)
Note: The book uses the programming language OCaml, but implements a subset of Prolog. Also the book uses a domain specific language and implements a parser.
The book clearly goes way beyond what you need so you won't need to get or read the entire book.
You might only need this which is found in prop.ml
let rec eval fm v =
match fm with
False -> false
| True -> true
| Atom(x) -> v(x)
| Not(p) -> not(eval p v)
| And(p,q) -> (eval p v) & (eval q v)
| Or(p,q) -> (eval p v) or (eval q v)
| Imp(p,q) -> not(eval p v) or (eval q v)
| Iff(p,q) -> (eval p v) = (eval q v);;
Example query
tautology <<p \/ q ==> q \/ (p <=> q)>>;;
The book is more of a detailed introduction to the source code of a more feature rich version HOL-Light which is a simpler version of HOL.
The links should put you in the ball park of working applications and these all fall under the broader topic of automated theorem proves which is different from proof assistants.

Representing syntactically different terms in TPTP

I am having a look at first order logic theorem provers such as Vampire and E-Prover, and the TPTP syntax seems to be the way to go. I am more familiar with Logic Programming syntaxes such as Answer Set Programming and Prolog, and although I try refering to a detailed description of the TPTP syntax I still don't seem to grasp how to properly distinguish between interpreted and non interpreted functor (and I might be using the terminology wrong).
Essentially, I am trying to prove a theorem by showing that no model acts as a counter-example. My first difficulty was that I did not expect the following logic program to be satisfiable.
fof(all_foo, axiom, ![X] : (pred(X) => (X = foo))).
fof(exists_bar, axiom, pred(bar)).
It is indeed satisfiable because nothing prevents bar from being equal to foo. So a first solution would be to insist that these two terms are distinct and we obtain the following unsatisfiable program.
fof(all_foo, axiom, ![X] : pred(X) => (X = foo)).
fof(exists_bar, axiom, pred(bar)).
fof(foo_not_bar, axiom, foo != bar).
The Techinal Report clarifies that different double quoted strings are different objects indeed, so another solution is to put quotes here and there, so as to obtain the following unsatisfiable program.
fof(all_foo, axiom, ![X] : (pred(X) => (X = "foo"))).
fof(exists_bar, axiom, pred("bar")).
I am happy not to have manually specify the inequality as that would obviously not scale to a more realistic scenario. Moving closer to my real situation, I actually have to handle composed terms, and the following program is unfortunately satisfiable.
fof(all_foo, axiom, ![X] : (pred(X) => (X = f("foo")))).
fof(exists_bar, axiom, pred(g("bar"))).
I guess f("foo") is not a term but the function f applied to the object "foo". So it could potentially coincide with function g. Although a manual specification that f and g never coincide does the trick, the following program is unsatisfiable, I feel like I'm doing it wrong. And it probably wouldn't scale to my real setting with plenty of terms all to be interpreted as distinct when they are syntactically distinct.
fof(all_foo, axiom, ![X] : (pred(X) => (X = f("foo")))).
fof(exists_bar, axiom, pred(g("bar"))).
fof(f_not_g, axiom, ![X, Y] : f(X) != g(Y)).
I have tried throwing single quotes around, but I didn't find the proper way to do it.
How do I make syntactically different (composed) terms and test for syntactical equality?
Subsidiary question: the following program is satisfiable, because the automated-theorem prover understands f as a function rather than a uninterpreted functor.
fof(exists_f_g, axiom, (?[I] : ((f(foo) = f(I)) & pred(g(I))))).
fof(not_g_foo, axiom, ~pred(g(foo))).
To make it unsatisfiable, I need to manually specify that f is injective. What would be the natural way to obtain this behaviour without specifying injectivity of all functors that occur in my program?
fof(exists_f_g, axiom, (?[I] : ((f(foo) = f(I)) & pred(g(I))))).
fof(not_g_foo, axiom, ~pred(g(foo))).
fof(f_injective, axiom, ![X,Y] : (f(X) = f(Y) => (X = Y))).
First of all let me point you to the Syntax BNF of TPTP. In principle, you have Prolog terms with some predefined infix/prefix operators of appropriate precedences. This means, variables are written in upper case and constants are written in lower case. Also like Prolog, escaping with single quotes allows us to write a constant starting with a capital letter i.e. 'X'. I have never seen double quoted atoms so far, so you might want look up the instructions of the prover on how to interpret them.
But even though the syntax is Prolog-ish, automated theorem proving is a different kind of beast. There is no closed world assumption nor are different constants assumed to be different - that's why you cannot find a proof for:
fof(c1, conjecture, a=b ).
and neither for:
fof(c1, conjecture, ~(a=b) ).
So if you want to have syntactic dis-equality, you need to axiomatize it. Now, assuming a different from b trivially shows that they are different, so I at least claimed: "Suppose there are two different constants a and b, then there exists some variable which is not b."
fof(a1, axiom, ~(a=b)).
fof(c1, conjecture, ?[X]: ~(X=b)).
Since functions in first-order logic are not necessarily injective, you also don't get around of adding your assumption in there.
Please also note the different roles of input formulas: so far you only stated axioms and no conjectures i.e. you ask the prover to show your axiom set to be inconsistent. Some provers might even give up because they use some resolution refinements (e.g. set of support) which restricts resolution between axioms[1]. In any case, you need to be aware that the formula you are trying to prove is of the form A1 ∧ ... ∧ An → C1 ∨ ... Cm where the A are axioms and the C are conjectures.[2]
I hope that at least the syntax is a bit clearer now - unfortunately the answer to the questions is more that atomated theorem provers don't make the same assumptions as you expect, so you have to axiomatize them. These axiomatizations are also often ineffective and you might get better perfomance from specialized tools.
[1] As you already notice, advanced provers like Vampire or E Prover tell you about (counter-)satisfyability instead.
[2] A resolution based theorem prover will first negate that formula and perform a CNF transformation, but even though most TPTP accepting provers are resolution based, that's not a requirement.

Validity and satisfiability

I am having problems understanding the difference between validity and satisfiability.
Given the following:
(i) For all F, F is satisfiable or ~F is satisfiable.
(ii) For all F, F is valid or ~F is valid.
How do I prove which is true and which is false?
Statement (i) is true, as for all F, F will either be satisfiable, or ~F will be satisfiable (truth table). However, how do I go about solving for statement (ii)?
Any help is highly appreciated!
Aprilrocks92,
I don't blame you for being confused, because actually logicians, mathematicians, heck even those philosopher types use the words validity differently sometimes.
Trying not to overcomplicate, I'll give you a thin definition: a conclusion if valid when it is true whenever the premises are true. We also say, given a suitably defined logic, that the conclusion follows as a "logical consequence" of the premises.
On the other hand, satisfisability means that there exists a valuation of the non logical symbols in the formula F that makes the formula true in the logic.
So I should probably mention the difference between semantics and syntax to explain. The syntax of your logic is all those logical and non logical symbols, and the deductive rules that enable you to make "steps" towards proof in the logic. My definition of satisfisability above mentioned the word "valuation"- now how does that fit? Well the answer is that you need to supply a semantics: in short this is the structure that the formula of the logic are expressions of, usually given in set theory, and a valuation of a given F is a function that maps all the non logical symbols in F to sets and members of sets, which a given semantics for the logic composes into a truth value.
Hmm. I'm not sure that's the best explanation, but I don't have much time.
Either way, that should help you understand the difference. To answer your question about the difference between (i) and (ii) without giving too much away, think: what's the relationship between the two? Well given that as above an F' is true given a valuation that sends the sentence to true. So you could "rewrite" my definition of validity as: a conclusion is valid iff whenever the premises are satisfisable the conclusion is satisfisable.
Now, with regards your requirement to prove these things, I strongly suspect you've got a lot more context about your logic you're not telling us and your teacher or text book has intimated a context in which to answer these, as actually taken in the general sense your question doesn't make complete sense.

Parsing context-free languages in a stream of tokens

The problem
Given a context-free grammar with arbitrary rules and a stream of tokens, how can stream fragments that match the grammar be identified effectively?
Example:
Grammar
S -> ASB | AB
A -> a
B -> b
(So essentially, a number of as followed by an equal number of bs)
Stream:
aabaaabbc...
Expected result:
Match starting at position 1: ab
Match starting at position 4: aabb
Of course the key is "effectively". without testing too many hopeless candidates for too long. The only thing I know about my data is that although the grammar is arbitrary, in practice matching sequences will be relatively short (<20 terminals) while the stream itself will be quite long (>10000 terminals).
Ideally I'd also want a syntax tree but that's not too important, because once the fragment is identified, I can run an ordinary parser over it to obtain the tree.
Where should I start? Which type of parser can be adapted to this type of work?
"Arbitrary grammar" makes me suggest you look at wberry's comment.
How complex are these grammars? Is there a manual intervention step?
I'll make an attempt. If I modified your example grammar from:
S -> ASB | AB
A -> a
B -> b
to include:
S' -> S | GS' | S'GS' | S'G
G -> sigma*
So that G = garbage and S' is many S fragments with garbage in between (I may have been careless with my production rules. You get the idea), I think we can solve your problem. You just need a parser that will match other rules before G. You may have to modify these production rules based on the parser. I almost guarantee that there will be rule ordering changes depending on the parser. Since most parser libraries separate lexing from parsing, you'll probably need a catch-all lexeme followed by modifying G to include all possible lexemes. Depending on your specifics, this might not be any better (efficiency-wise) than just starting each attempt at each spot in the stream.
But... Assuming my production rules are fixed (both for correctness and for the particular flavor of parser), this should not only match fragments in the stream, but it should give you a parse tree for the whole stream. You are only interested in subtrees rooted in nodes of type S.

What is the difference between LR, SLR, and LALR parsers?

What is the actual difference between LR, SLR, and LALR parsers? I know that SLR and LALR are types of LR parsers, but what is the actual difference as far as their parsing tables are concerned?
And how to show whether a grammar is LR, SLR, or LALR? For an LL grammar we just have to show that any cell of the parsing table should not contain multiple production rules. Any similar rules for LALR, SLR, and LR?
For example, how can we show that the grammar
S --> Aa | bAc | dc | bda
A --> d
is LALR(1) but not SLR(1)?
EDIT (ybungalobill): I didn't get a satisfactory answer for what's the difference between LALR and LR. So LALR's tables are smaller in size but it can recognize only a subset of LR grammars. Can someone elaborate more on the difference between LALR and LR please? LALR(1) and LR(1) will be sufficient for an answer. Both of them use 1 token look-ahead and both are table driven! How they are different?
SLR, LALR and LR parsers can all be implemented using exactly the same table-driven machinery.
Fundamentally, the parsing algorithm collects the next input token T, and consults the current state S (and associated lookahead, GOTO, and reduction tables) to decide what to do:
SHIFT: If the current table says to SHIFT on the token T, the pair (S,T) is pushed onto the parse stack, the state is changed according to what the GOTO table says for the current token (e.g, GOTO(T)), another input token T' is fetched, and the process repeats
REDUCE: Every state has 0, 1, or many possible reductions that might occur in the state. If the parser is LR or LALR, the token is checked against lookahead sets for all valid reductions for the state. If the token matches a lookahead set for a reduction for grammar rule G = R1 R2 .. Rn, a stack reduction and shift occurs: the semantic action for G is called, the stack is popped n (from Rn) times, the pair (S,G) is pushed onto the stack, the new state S' is set to GOTO(G), and the cycle repeats with the same token T. If the parser is an SLR parser, there is at most one reduction rule for the state and so the reduction action can be done blindly without searching to see which reduction applies. It is useful for an SLR parser to know if there is a reduction or not; this is easy to tell if each state explicitly records the number of reductions associated with it, and that count is needed for the L(AL)R versions in practice anyway.
ERROR: If neither SHIFT nor REDUCE is possible, a syntax error is declared.
So, if they all the use the same machinery, what's the point?
The purported value in SLR is its simplicity in implementation; you don't have to scan through the possible reductions checking lookahead sets because there is at most one, and this is the only viable action if there are no SHIFT exits from the state. Which reduction applies can be attached specifically to the state, so the SLR parsing machinery doesn't have to hunt for it. In practice L(AL)R parsers handle a usefully larger set of langauges, and is so little extra work to implement that nobody implements SLR except as an academic exercise.
The difference between LALR and LR has to do with the table generator. LR parser generators keep track of all possible reductions from specific states and their precise lookahead set; you end up with states in which every reduction is associated with its exact lookahead set from its left context. This tends to build rather large sets of states. LALR parser generators are willing to combine states if the GOTO tables and lookhead sets for reductions are compatible and don't conflict; this produces considerably smaller numbers of states, at the price of not be able to distinguish certain symbol sequences that LR can distinguish. So, LR parsers can parse a larger set of languages than LALR parsers, but have very much bigger parser tables. In practice, one can find LALR grammars which are close enough to the target langauges that the size of the state machine is worth optimizing; the places where the LR parser would be better is handled by ad hoc checking outside the parser.
So: All three use the same machinery. SLR is "easy" in the sense that you can ignore a tiny bit of the machinery but it is just not worth the trouble. LR parses a broader set of langauges but the state tables tend to be pretty big. That leaves LALR as the practical choice.
Having said all this, it is worth knowing that GLR parsers can parse any context free language, using more complicated machinery but exactly the same tables (including the smaller version used by LALR). This means that GLR is strictly more powerful than LR, LALR and SLR; pretty much if you can write a standard BNF grammar, GLR will parse according to it. The difference in the machinery is that GLR is willing to try multiple parses when there are conflicts between the GOTO table and or lookahead sets. (How GLR does this efficiently is sheer genius [not mine] but won't fit in this SO post).
That for me is an enormously useful fact. I build program analyzers and code transformers and parsers are necessary but "uninteresting"; the interesting work is what you do with the parsed result and so the focus is on doing the post-parsing work. Using GLR means I can relatively easily build working grammars, compared to hacking a grammar to get into LALR usable form. This matters a lot when trying to deal to non-academic langauges such as C++ or Fortran, where you literally needs thousands of rules to handle the entire language well, and you don't want to spend your life trying to hack the grammar rules to meet the limitations of LALR (or even LR).
As a sort of famous example, C++ is considered to be extremely hard to parse... by guys doing LALR parsing. C++ is straightforward to parse using GLR machinery using pretty much the rules provided in the back of the C++ reference manual. (I have precisely such a parser, and it handles not only vanilla C++, but also a variety of vendor dialects as well. This is only possible in practice because we are using a GLR parser, IMHO).
[EDIT November 2011: We've extended our parser to handle all of C++11. GLR made that a lot easier to do. EDIT Aug 2014: Now handling all of C++17. Nothing broke or got worse, GLR is still the cat's meow.]
LALR parsers merge similar states within an LR grammar to produce parser state tables that are exactly the same size as the equivalent SLR grammar, which are usually an order of magnitude smaller than pure LR parsing tables. However, for LR grammars that are too complex to be LALR, these merged states result in parser conflicts, or produce a parser that does not fully recognize the original LR grammar.
BTW, I mention a few things about this in my MLR(k) parsing table algorithm here.
Addendum
The short answer is that the LALR parsing tables are smaller, but the parser machinery is the same. A given LALR grammar will produce much larger parsing tables if all of the LR states are generated, with a lot of redundant (near-identical) states.
The LALR tables are smaller because the similar (redundant) states are merged together, effectively throwing away context/lookahead info that the separate states encode. The advantage is that you get much smaller parsing tables for the same grammar.
The drawback is that not all LR grammars can be encoded as LALR tables because more complex grammars have more complicated lookaheads, resulting in two or more states instead of a single merged state.
The main difference is that the algorithm to produce LR tables carries more info around between the transitions from state to state while the LALR algorithm does not. So the LALR algorithm cannot tell if a given merged state should really be left as two or more separate states.
Yet another answer (YAA).
The parsing algorithms for SLR(1), LALR(1) and LR(1) are identical like Ira Baxter said,
however, the parser tables may be different because of the parser-generation algorithm.
An SLR parser generator creates an LR(0) state machine and computes the look-aheads from the grammar (FIRST and FOLLOW sets). This is a simplified approach and may report conflicts that do not really exist in the LR(0) state machine.
An LALR parser generator creates an LR(0) state machine and computes the look-aheads from the LR(0) state machine (via the terminal transitions). This is a correct approach, but occasionally reports conflicts that would not exist in an LR(1) state machine.
A Canonical LR parser generator computes an LR(1) state machine and the look-aheads are already part of the LR(1) state machine. These parser tables can be very large.
A Minimal LR parser generator computes an LR(1) state machine, but merges compatible states during the process, and then computes the look-aheads from the minimal LR(1) state machine. These parser tables are the same size or slightly larger than LALR parser tables, giving the best solution.
LRSTAR 10.0 can generate LALR(1), LR(1), CLR(1) or LR(*) parsers in C++, whatever is needed for your grammar. See this diagram which shows the difference among LR parsers.
[Full disclosure: LRSTAR is my product]
The basic difference between the parser tables generated with SLR vs LR, is that reduce actions are based on the Follows set for SLR tables. This can be overly restrictive, ultimately causing a shift-reduce conflict.
An LR parser, on the other hand, bases reduce decisions only on the set of terminals which can actually follow the non-terminal being reduced. This set of terminals is often a proper subset of the Follows set of such a non-terminal, and therefore has less chance of conflicting with shift actions.
LR parsers are more powerful for this reason. LR parsing tables can be extremely large, however.
An LALR parser starts with the idea of building an LR parsing table, but combines generated states in a way that results in significantly less table size. The downside is that a small chance of conflicts would be introduced for some grammars that an LR table would otherwise have avoided.
LALR parsers are slightly less powerful than LR parsers, but still more powerful than SLR parsers. YACC and other such parser generators tend to use LALR for this reason.
P.S. For brevity, SLR, LALR and LR above really mean SLR(1), LALR(1), and LR(1), so one token lookahead is implied.
SLR parsers recognize a proper subset of grammars recognizable by LALR(1) parsers, which in turn recognize a proper subset of grammars recognizable by LR(1) parsers.
Each of these is constructed as a state machine, with each state representing some set of the grammar's production rules (and position in each) as it's parsing the input.
The Dragon Book example of an LALR(1) grammar that is not SLR is this:
S → L = R | R
L → * R | id
R → L
Here is one of the states for this grammar:
S → L•= R
R → L•
The • indicates the position of the parser in each of the possible productions. It doesn't know which of the productions it's actually in until it reaches the end and tries to reduce.
Here, the parser could either shift an = or reduce R → L.
An SLR (aka LR(0)) parser would determine whether it could reduce by checking if the next input symbol is in the follow set of R (ie, the set of all terminals in the grammar that can follow R). Since = is also in this set, the SLR parser encounters a shift-reduce conflict.
However, an LALR(1) parser would use the set of all terminals that can follow this particular production of R, which is only $ (ie, end of input). Thus, no conflict.
As previous commenters have noted, LALR(1) parsers have the same number of states as SLR parsers. A lookahead propagation algorithm is used to tack lookaheads on to SLR state productions from corresponding LR(1) states. The resulting LALR(1) parser can introduce reduce-reduce conflicts not present in the LR(1) parser, but it cannot introduce shift-reduce conflicts.
In your example, the following LALR(1) state causes a shift-reduce conflict in an SLR implementation:
S → b d•a / $
A → d• / c
The symbol after / is the follow set for each production in the LALR(1) parser. In SLR, follow(A) includes a, which could also be shifted.
Suppose a parser without a lookahead is happily parsing strings for your grammar.
Using your given example it comes across a string dc, what does it do? Does it reduce it to S, because dc is a valid string produced by this grammar? OR maybe we were trying to parse bdc because even that is an acceptable string?
As humans we know the answer is simple, we just need to remember if we had just parsed b or not. But computers are stupid :)
Since an SLR(1) parser had the additional power over LR(0) to perform a lookahead, we know that any amounts of lookahead cannot tell us what to do in this case; instead, we need to look back in our past. Thus comes the canonical LR parser to the rescue. It remembers the past context.
The way it remembers this context is that it disciplines itself, that whenever it will encounter a b, it will start walking on a path towards reading bdc, as one possibility. So when it sees a d it knows whether it is already walking a path.
Thus a CLR(1) parser can do things an SLR(1) parser cannot!
But now, since we had to define so many paths, the states of the machine gets very large!
So we merge same looking paths, but as expected it could give rise to problems of confusion. However, we are willing to take the risk at the cost of reducing the size.
This is your LALR(1) parser.
Now how to do it algorithmically.
When you draw the configuring sets for the above language, you will see a shift-reduce conflict in two states. To remove them you might want to consider an SLR(1), which takes decisions looking at a follow, but you would observe that it still won't be able to. Thus you would, draw the configuring sets again but this time with a restriction that whenever you calculate the closure, the additional productions being added must have strict follow(s). Refer any textbook on what should these follow be.
In addition to the answers above, this diagram demonstrates how different parsers relate:
Adding on top of the above answers, the difference in between the individual parsers in the class of bottom-up LR parsers is whether they result in shift/reduce or reduce/reduce conflicts when generating the parsing tables. The less it will have the conflicts, the more powerful will be the grammar (LR(0) < SLR(1) < LALR(1) < CLR(1)).
For example, consider the following expression grammar:
E → E + T
E → T
T → F
T → T * F
F → ( E )
F → id
It's not LR(0) but SLR(1). Using the following code, we can construct the LR0 automaton and build the parsing table (we need to augment the grammar, compute the DFA with closure, compute the action and goto sets):
from copy import deepcopy
import pandas as pd
def update_items(I, C):
if len(I) == 0:
return C
for nt in C:
Int = I.get(nt, [])
for r in C.get(nt, []):
if not r in Int:
Int.append(r)
I[nt] = Int
return I
def compute_action_goto(I, I0, sym, NTs):
#I0 = deepcopy(I0)
I1 = {}
for NT in I:
C = {}
for r in I[NT]:
r = r.copy()
ix = r.index('.')
#if ix == len(r)-1: # reduce step
if ix >= len(r)-1 or r[ix+1] != sym:
continue
r[ix:ix+2] = r[ix:ix+2][::-1] # read the next symbol sym
C = compute_closure(r, I0, NTs)
cnt = C.get(NT, [])
if not r in cnt:
cnt.append(r)
C[NT] = cnt
I1 = update_items(I1, C)
return I1
def construct_LR0_automaton(G, NTs, Ts):
I0 = get_start_state(G, NTs, Ts)
I = deepcopy(I0)
queue = [0]
states2items = {0: I}
items2states = {str(to_str(I)):0}
parse_table = {}
cur = 0
while len(queue) > 0:
id = queue.pop(0)
I = states[id]
# compute goto set for non-terminals
for NT in NTs:
I1 = compute_action_goto(I, I0, NT, NTs)
if len(I1) > 0:
state = str(to_str(I1))
if not state in statess:
cur += 1
queue.append(cur)
states2items[cur] = I1
items2states[state] = cur
parse_table[id, NT] = cur
else:
parse_table[id, NT] = items2states[state]
# compute actions for terminals similarly
# ... ... ...
return states2items, items2states, parse_table
states, statess, parse_table = construct_LR0_automaton(G, NTs, Ts)
where the grammar G, non-terminal and terminal symbols are defined as below
G = {}
NTs = ['E', 'T', 'F']
Ts = {'+', '*', '(', ')', 'id'}
G['E'] = [['E', '+', 'T'], ['T']]
G['T'] = [['T', '*', 'F'], ['F']]
G['F'] = [['(', 'E', ')'], ['id']]
Here are few more useful function I implemented along with the above ones for LR(0) parsing table generation:
def augment(G, S): # start symbol S
G[S + '1'] = [[S, '$']]
NTs.append(S + '1')
return G, NTs
def compute_closure(r, G, NTs):
S = {}
queue = [r]
seen = []
while len(queue) > 0:
r = queue.pop(0)
seen.append(r)
ix = r.index('.') + 1
if ix < len(r) and r[ix] in NTs:
S[r[ix]] = G[r[ix]]
for rr in G[r[ix]]:
if not rr in seen:
queue.append(rr)
return S
The following figure (expand it to view) shows the LR0 DFA constructed for the grammar using the above code:
The following table shows the LR(0) parsing table generated as a pandas dataframe, notice that there are couple of shift/reduce conflicts, indicating that the grammar is not LR(0).
SLR(1) parser avoids the above shift / reduce conflicts by reducing only if the next input token is a member of the Follow Set of the nonterminal being reduced. The following parse table is generated by SLR:
The following animation shows how an input expression is parsed by the above SLR(1) grammar:
The grammar from the question is not LR(0) as well:
#S --> Aa | bAc | dc | bda
#A --> d
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c', 'd'}
G['S'] = [['A', 'a'], ['b', 'A', 'c'], ['d', 'c'], ['b', 'd', 'a']]
G['A'] = [['d']]
as can be seen from the next LR0 DFA and the parsing table:
there is a shift / reduce conflict again:
But, the following grammar which accepts the strings of the form a^ncb^n, n >= 1 is LR(0):
A → a A b
A → c
S → A
# S --> A
# A --> a A b | c
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c'}
G['S'] = [['A']]
G['A'] = [['a', 'A', 'b'], ['c']]
As can be seen from the following figure, there is no conflict in the parsing table generated.
Here is how the input string a^2cb^2 can be parsed using the above LR(0) parse table, using the following code:
def parse(input, parse_table, rules):
input = 'aaacbbb$'
stack = [0]
df = pd.DataFrame(columns=['stack', 'input', 'action'])
i, accepted = 0, False
while i < len(input):
state = stack[-1]
char = input[i]
action = parse_table.loc[parse_table.states == state, char].values[0]
if action[0] == 's': # shift
stack.append(char)
stack.append(int(action[-1]))
i += 1
elif action[0] == 'r': # reduce
r = rules[int(action[-1])]
l, r = r['l'], r['r']
char = ''
for j in range(2*len(r)):
s = stack.pop()
if type(s) != int:
char = s + char
if char == r:
goto = parse_table.loc[parse_table.states == stack[-1], l].values[0]
stack.append(l)
stack.append(int(goto[-1]))
elif action == 'acc': # accept
accepted = True
df2 = {'stack': ''.join(map(str, stack)), 'input': input[i:], 'action': action}
df = df.append(df2, ignore_index = True)
if accepted:
break
return df
parse(input, parse_table, rules)
The next animation shows how the input string a^2cb^2 is parsed with LR(0) parser using the above code:
One simple answer is that all LR(1) grammars are LALR(1) grammars.
Compared to LALR(1), LR(1) has more states in the associated finite-state machine (more than double the states). And that is the main reason LALR(1) grammars require more code to detect syntax errors than LR(1) grammars.
And one more important thing to know regarding these two grammars is that in LR(1) grammars we might have less reduce/reduce conflicts. But in LALR(1) there is more possibility of reduce/reduce conflicts.

Resources