Computing the Follow Set - compiler-theory

Ok, I've understood how to compute the Follow_k(N) set (N is a nonterminal): for every production rule of the form A -> aBc you add First_k(First_k(c)Follow_k(A)) to Follow_k(B) (a, c are any group of terminals and nonterminals, or even lambda). ...and you repeat this until there's nothing left to add.
But what happends for production rules like: S -> ABCD (A, B, C, D are all nonterminals)?
Should I
add First_k(First_k(BCD)Follow_k(S)) to Follow_k(A) or
add First_k(First_k(CD)Follow_k(S)) to Follow_k(B) or
add First_k(First_k(D)Follow_k(S)) to Follow_k(C) or
add First_k(First_k(lambda)Follow_k(S)) to Follow_k(D) or
do all of the above?
UPDATE:
Let's take the following grammar for example:
S -> ABC
A -> a
B -> b
C -> c
Intuitively, Follow_1(S) = {} because nothing follows after S
Follow_1(A) = {b} because b follows after A,
Follow_1(B) = {c} because c follows after B,
Follow_1(C) = {} because nothing follows after C.
In order to get this result using the algorithm you must consider all cases for S -> ABC.
But my judgement or example may not be right so the question still remains open...

If you run into trouble on other grammar problems like this, give this online first, follow, & predict set finder a shot. It's automatic and you can compare answers to its output to get a feel for how to work through these.
But what happens for production rules like: S -> ABCD (A, B, C, D are all nonterminals)?
Here are the rules for finding follow sets.
First put $ (the end of input marker) in Follow(S) (S is the start symbol)
If there is a production A → aBb, (where a can be a whole string) then everything in FIRST(b) except for ε is placed in FOLLOW(B).
If there is a production A → aB, then everything in FOLLOW(A) is in FOLLOW(B)
If there is production A → aBb, where FIRST(b) contains ε, then everything in FOLLOW(A) is in FOLLOW(B)
Let's use your example grammar:
S -> ABC
A -> a
B -> b
C -> c
Rule 1 says that follow(S) contains $.
Rule 2 gives us: follow(A) contains first(B); also, follow(B) contains first(C).
Rule 3 says that follow(C) contains follow (S).
None of your productions are nullable, so we don't care about rule #4. A symbol is nullable if it derives ε or if it derives a nullable non-terminal symbol.
Nullability's transitivity can trip people up. Consider this grammar:
S -> A
A -> B
B -> ε
Since B derives ε, B's nullable. Since A derives B, which derives ε, A's nullable too. S derives A, which derives B, which derives ε, so S is nullable as well.
Granted, you didn't bring that up, but it's a common source of confusion in compiler courses, so I figured I'd lay it out.
Also, if you need some sample grammars to work through, http://faculty.stedwards.edu/laurab/cosc4342/g1answers.txt might be handy.

Related

Best way to ensure something happens more than once in an overall sequence, but only once for each subsequence

I have a scenario where a starting action branches out and triggers multiple actions. such as :
A -> B -> D -> F
-> E -> H
-> C -> E -> H
-> F -> G
B and C both started from A, and "DEEF" started from B & C, and so on.
Today, I only allow "E" to run once in the overall sequence. However there is now a requirement to allow "E" to run more than once in the overall sequence, but only for unique originators (so as to avoid any looping). I.E. "E" (or "F" or "G") in above example, can run once in the sequence C -> E -> H and once in A -> B -> E --> H but never A -> B -> E -> H -> E. E can also only always emit H, B can only emit D and E etc. so that set is immutable.
Hopefully I was able to explain the problem.
My initial thought is to have each action output a nonce value - and then store if an action has already run for a nonce value (originator) then it can't run again for that same originator.
In the above example A would create a nonce value "foobar". B and C would not have run for the nonce "foobar" in the flow yet, so they would run the first time.
B would output nonce "boofar" and C would output nonce "oofbar". The next set of actions would check, if they have run for either of these nonces - "E" in particular would now be able to run for each nonce, instead of running only once for the sequence as in the current single per sequence lookup.
I think this might work, but wondering if I'm missing anything. Would appreciate more interesting thoughts.
EDIT: Saw Thomas's comment below - nonce alone would not help me solve the loop issue. I might consider adding a nonce vector that keeps adding ie. foobar.boofar.oofbar and then check that a module ran once for each nonce vector?

How to easily prove the following in Coq such as using only assumptions?

Is there an easy way to prove the following in Coq such as using only assumptions?
(P -> (Q /\ R)) -> (~Q) -> ~P
The question is a bit vague... Do you wonder if it is possible (yes), what the answer is (see Arthur's comment above), or how to think about solving these problems?
In the latter case, remember that the goal is to create a "lambda-term" with the specified type. You can either use "tactics" which are helping you construct the term "from the outside and inwards. It is good to do it by hand a couple of times to understand what is going on and what the tactics really do, which I think is why you are given this exercise.
If you look at your example,
(P -> (Q /\ R)) -> (~Q) -> ~P
you can see that it is a function of three (!) arguments. It is because the last type ~P really means P -> False, so the types of the arguments to the function that you need to create are
P -> (Q /\ R)
Q -> False
P
and the function should construct a term of type
False
You can create a term fun A B C => _ where A, B, C has the types above, (this is what the tactic intros does), and you need to come up with a term that should go into the hole _ by combining the terms A, B, C and the raw gallina constructions.
In this case, when you have managed to create a term of type Q /\ R you will have to "destruct" it to get the term of type Q, (Hint: for that you will have to use the match construction).
Hope this helps without spoiling the fun!

Chomsky Normal form removing epsilon transitions

I'm working on converting a CFG to Chomsky Normal Form but I'm having some difficulty.
I have this CFG
A-> BAB|B|epsilon
B -> 00|epsilon
Ok I add a new start state
S -> A
A-> BAB|B|epsilon
B -> 00|epsilon
Then I have to remove epsilon transitions so I start with B
S -> A
A-> BAB|B|AB|BA|A|epsilon
B -> 00
How do I then remove the epsilon from A? Can the start have an epsilon in it? And how do I convert A-> A?
You can't convert this grammar to one without ε, and therefore it cannot be written in Chomsky Normal form. This is because all productions can reduce to ε, therefore ε is a valid sentence in the language.

What is difference between trivial FD and two cyclic FD's

In the Complete Book by Ullman and Widom I've read that with two attributes (A and B) we have four cases for FD's. Second and third are A -> B and B -> A, so they are easier. But I don't understand what the difference between trivial dependency «B is a subset of A» and cyclic FD's A -> B and B -> A. Aren't they the same?
With two attributes you have four cases:
A -> B (this means you also have the trivial FDs: A -> A, B -> B)
B -> A (with trivial FDs as above)
A -> B, B -> A (with trivial FDs as above)
no non-trivial FDs. This means you only have the trivial FDs A -> A, B -> B. This means that the two attributes are independent.
A "real-world" example of case 3 could be two attributes: SSN (social security number of a person) and passport_number of a person. Each one is the consequence of the other.
An example of case 4 could be two attributes: SSN (social security number of a person) and book_title. The two attributes are completely independent. One does not imply the other.

Parsing expressions with an undefined number of arguments

I'm trying to parse a string in a self-made language into a sort of tree, e.g.:
# a * b1 b2 -> c * d1 d2 -> e # f1 f2 * g
should result in:
# a
* b1 b2
-> c
* d1 d2
-> e
# f1 f2
* g
#, * and -> are symbols. a, b1, etc. are texts.
Since the moment I know only rpn method to evaluate expressions, and my current solution is as follows. If I allow only a single text token after each symbol I can easily convert expression first into RPN notation (b = b1 b2; d = d1 d2; f = f1 f2) and parse it from here:
a b c -> * d e -> * # f g * #
However, merging text tokens and whatever else comes seems to be problematic. My idea was to create marker tokens (M), so RPN looks like:
a M b2 b1 M c -> * M d2 d1 M e -> * # f2 f1 M g * #
which is also parseable and seems to solve the problem.
That said:
Does anyone have experience with something like that and can say it is or it is not a viable solution for the future?
Are there better methods for parsing expressions with undefined arity of operators?
Can you point me at some good resources?
Note. Yes, I know this example very much resembles Lisp prefix notation and maybe the way to go would be to add some brackets, but I don't have any experience here. However, the source text must not contain any artificial brackets and also I'm not sure what to do about potential infix mixins like # a * b -> [if value1 = value2] c -> d.
Thanks for any help.
EDIT: It seems that what I'm looking for are sources on postfix notation with a variable number of arguments.
I couldn't fully understand your question, but it seems what you want is a grammar definition and a parser generator. I suggest you take a look at ANTLR, it should be pretty straightforward with it to define a grammar for either your original syntax or the RPN.
Edit: (After exercising self-criticism, and making some effort to understand the question details.) Actually, the language grammar is unclear from your example. However, it seems to me, that the advantages of the prefix/postfix notations (i.e. that you need neither parentheses nor a precedence-aware parser) stem from the fact that you know the number of arguments every time you encounter an operator, therefore you know exactly how many elements to read (for prefix notation) or to pop from the stack (for postfix notation). OTOH, I beleive that having operators which can have variable number of arguments makes prefix/postfix notations not simply difficult to parse but outright ambiguous. Take the following expression for example:
# a * b c d
Which of the following three is the canonical form?
(a, *(b, c, d))
(a, *(b, c), d)
(a, *(b), c, d)
Without knowing more about the operators, it is impossible to tell. Of course you could define some sort of greedyness of the operators, e.g. * is greedier than #, so it gobbles up all the arguments. But this would beat the purpose of a prefix notation, because you simply wouldn't be able to write down the second variant from the above three; not without additinonal syntactic elements.
Now that I think of it, it is probably not by sheer chance that none of the programming languages I know support operators with a variable number of arguments, only functions/procedures.

Resources