I'm working on a simple LL(1) parser generator, and I've run into an issue with PREDICT/PREDICT conflicts given certain input grammars. For example, given an input grammar like:
E → E + E
| P
P → 1
I can remove out the left recursion from E, replacing it with a roughly equivalent right recursive rule, thus arriving at the grammar:
E → P E'
E' → + E E'
| ε
P → 1
Next, I can compute the relevant FIRST and FOLLOW sets for the grammar, and end up with the following:
FIRST(E) = { 1 }
FIRST(E') = { +, ε }
FIRST(P) = { 1 }
FOLLOW(E) = { +, EOF }
FOLLOW(E') = { +, EOF }
FOLLOW(P) = { +, EOF }
And finally, using PREDICT(A → α) = { FIRST(α) - ε } ∪ (FOLLOW(A) if ε ∈ FIRST(α) else ∅) to construct the PREDICT sets for the grammar, the resulting sets are as follows.
PREDICT(1. E → P E') = { 1 }
PREDICT(2. E' → + E E') = { +, EOF }
PREDICT(3. E' → ε) = { +, EOF }
PREDICT(4. P → 1) = { 1 }
So this is where I run into the conflict that PREDICT(2) = PREDICT(3), and thus, I cannot produce a parse table as the grammar is not LL(1), since parser wouldn't be able to choose which rule should be applied.
What I'm really wondering is whether it's possible to resolve the conflict or factor the grammar such that the conflict can be avoided, and produce a legal LL(1) grammar, without having to directly modify the original input grammar.
The problem here is that your original grammar is ambiguous.
E → E + E
E → P
means that P + P + P can be parsed either as (P + P) + P or P + (P + P). Eliminating left recursion doesn't fix the ambiguity, so the modified grammar is also ambiguous. And ambiguous grammars can't be LL(k) (or, for that matter, LR(k)).
So you need to make the grammar unambiguous:
E → E + P
E → P
(That's the common left-associative version.) Once you eliminate left recursion, you end up with:
E → P E'
E' → + P E'
| ε
Now + is not in FOLLOW(E').
(The example is drawn straight from the Dragon book, but simplified; it's example 4.8 in the rather battered old copy I have.)
It's worth noting that the transformation used here preserves the set of strings derived by the grammar, but not the derivation. The parse tree which results from the modified grammar is effectively right-associative, so it will need to be reprocessed to recover the desired parse. This fact is rather briefly mentioned by the Dragon book authors:
Although left-recursion elimination and left factoring are easy to do, they make the resulting grammar hard to read and difficult to use for translation purposes. (My emphasis)
They go on to suggest that operator precedence parsing can be used for expressions, and then mention that if an LR parser generator is available, dividing the grammar into a predictive part and an operator-precedence part is no longer necessary.
Related
I'm currently trying to familiarize myself with packrat parsing. So I've read the PDF paper from 2002 linked here and in section 2.3 it describes packrat caching as a preliminary process (which occurs before the actual parsing) in which a full caching table is pre-constructed by reading the input from right to left. Only then, the actual linear parsing from left to right can start.
But in every PEG parser implementation I found, the "cache" option is usually a caching process that occurs during the actual left to right parsing. For example here.
Is there any difference between both approaches?
Thank you.
I recently worked on similar research, met the exact same confusion, and resolved it. Regardless if you are still working on this topic, here's my answer.
Your understanding is correct:
Packrat parser scans input string from left to right
Packrat parser construct the cache from right to left
But there's just one approach, not two. Let's use one simple example Parsing Expression Grammar (PEG) without left-recursion: E -> Num + E | Num
(Note that, a left-recursion example requires another long explanation, you can refer CPython's implementation for details)
The Syntax Directed Translation (SDT) will be something like:
E -> a=Num + b=E { a + b }
E -> Num { Num }
And we can write a parse_E function in below:
def parse_E(idx):
if idx in cache['parse_E']:
return cache['parse_E'][idx]
lval, nidx = parse_Char(idx)
if nidx < len(self.tokens):
operator, nnidx = parse_Char(nidx)
if operator == '+':
# E -> Num + E
rval, nnnidx = parse_E(nnidx)
cache['parse_E'][idx] = lval + rval, nnnidx
return cache['parse_E'][idx]
# E -> Num
cache['parse_E'][idx] = lval, nidx
return cache['parse_E'][idx]
According to Byran Ford's paper, the parser needs to scan the input string from left to right and construct the cache in any position:
for idx in len(input_string):
parse_E(idx)
parse_Char(idx)
So, let's check the cache construction under the hood, initially, we have an empty cache and input string:
cache: {'parse_E': {}, 'parse_Char': {}}
input string: `2 + 3 + 4`
The function call happens in the following order when idx=0. Clearly, we construct the cache from right to left at position 0 (not even to mention idx=1 or above).
parse_Char(Y) happens earlier than parse_Char(X) (Y > X)
parse_Char(X) must happens earlier than parse_E(X)
parse_E(0) --- (E -> Num + E) (pending)
-> parse_Char(0) --- 2 (pending)
-> parse_Char(1) --- + (pending)
-> parse_E(2) --- E (E -> Num + E) (pending)
-> parse_Char(2) --- 3 (pending)
-> parse_Char(3) --- + (pending)
-> parse_E(4) --- E (E -> Num) (pending)
-> parse_Char(4) --- 4 (acc)
# Only after parse_Char(4) succeed and fill into cache, parse_E(4) can be successful...and so on.
If you want to read the full Python example of Packrat parser implementation, you can check my repository. It contains a handmade Packrat parser and a CPython PEG generated Packrat parser based on a simple meta grammar.
I am given 2 DFAs. * denotes final states and -> denotes the initial state, defined over the alphabet {a, b}.
1) ->A with a goes to A. -> A with b goes to *B. *B with a goes to *B. *B with b goes to ->A.
The regular expression for this is clearly:
E = a* b(a* + (a* ba* ba*)*)
And the language that it accepts is L1= {w over {a,b} | w is b preceeded by any number of a's followed by any number of a's or w is b preceeded by any number of a's followed by any number of bb with any number of a's in middle of(middle of bb), end or beginning.}
2) ->* A with b goes to ->* A. ->*A with a goes to *B. B with b goes to -> A. *B with a goes to C. C with a goes to C. C with b goes to C.
Note: A is both final and initial state. B is final state.
Now the regular expression that I get for this is:
E = b* ((ab) * + a(b b* a)*)
Finally the language that this DFA accepts is:
L2 = {w over {a, b} | w is n 1's followed by either k 01's or a followed by m 11^r0' s where n,km,r >= 0}
Now the question is, is there a cleaner way to represent the languages L1 and L2 because it does seem ugly. Thanks in advance.
E = a* b(a* + (a* ba* ba*)*)
= a*ba* + a*b(a* ba* ba*)*
= a*ba* + a*b(a*ba*ba*)*a*
= a*b(a*ba*ba*)*a*
= a*b(a*ba*b)*a*
This is the language of all strings of a and b containing an odd number of bs. This might be most compactly denoted symbolically as {w in {a,b}* | #b(w) = 1 (mod 2)}.
For the second one: the only way to get to state B is to see an a in A, and the only way to get to C from outside C is to see an a in B. C is a dead state and the only way to get to it is to see aa starting in A. That is: if you ever see two as in a row, the string is not in the language; the language is the set of all strings over a and b not containing the substring aa. This might be most compactly denoted symbolically as {(a+b)*aa(a+b)*}^c where ^c means "complement".
[] = always
O = next
! = negation
<> = eventually
Wondering is it []<> is that equivalent to just []?
Also having a hard time understanding how to distribute temporal logic.
[][] (a OR !b)
!<>(!a AND b)
[]([] a ==> <> b)
I'll use the following notations:
F = eventually
G = always
X = next
U = until
In my model-checking course, we defined LTL the following way:
LTL: p | φ ∩ ψ | ¬φ | Xφ | φ U ψ
With F being a syntactic sugar for :
F (future)
Fφ = True U φ
and G:
G (global)
Gφ = ¬F¬φ
With that, your question is :
Is it true that : Gφ ?= GFφ
GFφ <=> G (True U φ)
Knowing that :
P ⊧ φ U ψ <=> exists i >= 0: P_(>= i) ⊧ ψ AND forall 0 <= j < i : P_(<= j) ⊧ φ
From that, we can clearly see that GFφ indicates that it must always be true that φ will be always be verified after some time i, and before that (j before i) True must be verified (trivial).
But Gφ indicates that φ must always be true, "from now to forever" and not "from i to forever".
G p indicates that at all times p holds. GF p indidcates that at all times, eventually p will hold. So while the infinite trace pppppp... satisfies both of the specifications, an infinite trace of the form p(!p)(!p!)p(!p)p... satisfies only GF p but not G p.
To be clear, both these example traces need to contain infinitely many locations, where p holds. But in the case of GF p, and only in this case, it is acceptable that there be locations in between, where p does not hold.
So the short answer to the above question by counterexample is: no, those two specifications aren't the same.
Click here for the answer. Turing Machine
The question is to construct a Turing Machine which accepts the regular expression,
L = {a^n b^n | n>= 1}.
I am not sure if my answer is correct or wrong. Thank you in advance for your reply.
You cannot "accept the regular expression", only the language it describes. And what you provide is not a regular expression, but a set description. In fact, the language is not regular and therefore cannot be described by standard regular expressions.
The machine from your answer accepts the language described by the regular expression a^+ b^+.
A TM could mark the first a (e.g. by converting it to A) then delete the first b. And for each n one loop. If you and up with a string only of A, then accept.
As stated before, language L = {a^nb^n; n >= 1} cannot be described by regular expressions, it doesn't belong into the category of regular grammars. This language in particular is an example of context-free grammar, and thus it can be described by context-free grammar and recognized by pushdown automaton (an automaton with LIFO memory, a stack).
Grammar for this language would look something like this:
G = (V, S, R, P)
Where:
V is finite set of non-terminal characters, V = { S }
S is finite set of terminal characters, S = { a, b }
R is relation that describes "rewrites" from non-terminal characters to non-terminals and terminals, in this case R = { S -> aSb, S -> ab }
P is starting non-terminal character, P = S
A pushdown automata recognizing this language would be more complex, as it is a 7-tuple M = (Q, S, G, D, q0, Z, F)
Q is set of states
S is input alphabet
G is stack alphabet
D is the transition relation
q0 is start state
Z is initial stack symbol
F is set of accepting states
For our case, it would be:
Q = { q0, q1, qF }
S = { a, b }
G = { z0, X }
D will take a form of relation (current state, input character, top of stack) -> (output state, top of stack) (meaning you can move to a different state and rewrite top of stack (erase it, rewrite it or let it be)
(q0, a, z0) -> (q0, Xz0) - reading the first a
(q0, a, X) -> (q0, XX) - reading consecutive a's
(q0, b, X) -> (q1, e) - reading first b
(q1, b, X) -> (q1, e) - reading consecutive b's
(q1, e, z0) -> (qF, e) - reading last b
where e is empty word (sometimes called epsilon)
q0 = q0
Z = z0
F = { qF }
The language L = {a^n b^n | n≥1} represents a kind of language where we use only 2 character, i.e., a, b. In the beginning language has some number of a’s followed by equal number of b’s . Any such string which falls in this category will be accepted by this language. The beginning and end of string is marked by $ sign.
Step-1:
Replace a by X and move right, Go to state Q1.
Step-2:
Replace a by a and move right, Remain on same state
Replace Y by Y and move right, Remain on same state
Replace b by Y and move right, go to state Q2.
Step-3:
Replace b by b and move left, Remain on same state
Replace a by a and move left, Remain on same state
Replace Y by Y and move left, Remain on same state
Replace X by X and move right, go to state Q0.
Step-5:
If symbol is Y replace it by Y and move right and Go to state Q4
Else go to step 1
Step-6:
Replace Y by Y and move right, Remain on same state
If symbol is $ replace it by $ and move left, STRING IS ACCEPTED, GO TO FINAL STATE Q4
given the following grammar I have to find the appropriate semantic actions to calculate, for each string of the language, the number of pairs of parentheses in the string.
S -> (L)
S -> a
L -> L, S
L -> S
Usually, to perform this type of exercise, I build a derivation tree of a sample string and then I add the attributes. After that it is easier to find the semantic rules.
So I built this derivation tree for the string "((a, (a), a))", but I can't proceed with the resolution of the exercise. How do I count the pairs of parentheses? I'am not able to do that...
I do't want the solution but I'd like someone to help me with the reasoning to be made in these cases.
(I'm sorry for the bad tree...)
The OP wrote:
These might be the correct semantic actions for this grammar?
S -> (L) {S.p = counter + 1}
S -> a {do nothing}
L -> L, S {L.p = S.p}
L -> S {L.p = S.p}
.p is a synthesized attribute.
S-> (S) { S.count =S.count + 1}
S-> SS{ S.count = S.count + S.count}
S-> ϵ{S.count = 0}
This should make things clear