What is the difference between LR, SLR, and LALR parsers? - algorithm

What is the actual difference between LR, SLR, and LALR parsers? I know that SLR and LALR are types of LR parsers, but what is the actual difference as far as their parsing tables are concerned?
And how to show whether a grammar is LR, SLR, or LALR? For an LL grammar we just have to show that any cell of the parsing table should not contain multiple production rules. Any similar rules for LALR, SLR, and LR?
For example, how can we show that the grammar
S --> Aa | bAc | dc | bda
A --> d
is LALR(1) but not SLR(1)?
EDIT (ybungalobill): I didn't get a satisfactory answer for what's the difference between LALR and LR. So LALR's tables are smaller in size but it can recognize only a subset of LR grammars. Can someone elaborate more on the difference between LALR and LR please? LALR(1) and LR(1) will be sufficient for an answer. Both of them use 1 token look-ahead and both are table driven! How they are different?

SLR, LALR and LR parsers can all be implemented using exactly the same table-driven machinery.
Fundamentally, the parsing algorithm collects the next input token T, and consults the current state S (and associated lookahead, GOTO, and reduction tables) to decide what to do:
SHIFT: If the current table says to SHIFT on the token T, the pair (S,T) is pushed onto the parse stack, the state is changed according to what the GOTO table says for the current token (e.g, GOTO(T)), another input token T' is fetched, and the process repeats
REDUCE: Every state has 0, 1, or many possible reductions that might occur in the state. If the parser is LR or LALR, the token is checked against lookahead sets for all valid reductions for the state. If the token matches a lookahead set for a reduction for grammar rule G = R1 R2 .. Rn, a stack reduction and shift occurs: the semantic action for G is called, the stack is popped n (from Rn) times, the pair (S,G) is pushed onto the stack, the new state S' is set to GOTO(G), and the cycle repeats with the same token T. If the parser is an SLR parser, there is at most one reduction rule for the state and so the reduction action can be done blindly without searching to see which reduction applies. It is useful for an SLR parser to know if there is a reduction or not; this is easy to tell if each state explicitly records the number of reductions associated with it, and that count is needed for the L(AL)R versions in practice anyway.
ERROR: If neither SHIFT nor REDUCE is possible, a syntax error is declared.
So, if they all the use the same machinery, what's the point?
The purported value in SLR is its simplicity in implementation; you don't have to scan through the possible reductions checking lookahead sets because there is at most one, and this is the only viable action if there are no SHIFT exits from the state. Which reduction applies can be attached specifically to the state, so the SLR parsing machinery doesn't have to hunt for it. In practice L(AL)R parsers handle a usefully larger set of langauges, and is so little extra work to implement that nobody implements SLR except as an academic exercise.
The difference between LALR and LR has to do with the table generator. LR parser generators keep track of all possible reductions from specific states and their precise lookahead set; you end up with states in which every reduction is associated with its exact lookahead set from its left context. This tends to build rather large sets of states. LALR parser generators are willing to combine states if the GOTO tables and lookhead sets for reductions are compatible and don't conflict; this produces considerably smaller numbers of states, at the price of not be able to distinguish certain symbol sequences that LR can distinguish. So, LR parsers can parse a larger set of languages than LALR parsers, but have very much bigger parser tables. In practice, one can find LALR grammars which are close enough to the target langauges that the size of the state machine is worth optimizing; the places where the LR parser would be better is handled by ad hoc checking outside the parser.
So: All three use the same machinery. SLR is "easy" in the sense that you can ignore a tiny bit of the machinery but it is just not worth the trouble. LR parses a broader set of langauges but the state tables tend to be pretty big. That leaves LALR as the practical choice.
Having said all this, it is worth knowing that GLR parsers can parse any context free language, using more complicated machinery but exactly the same tables (including the smaller version used by LALR). This means that GLR is strictly more powerful than LR, LALR and SLR; pretty much if you can write a standard BNF grammar, GLR will parse according to it. The difference in the machinery is that GLR is willing to try multiple parses when there are conflicts between the GOTO table and or lookahead sets. (How GLR does this efficiently is sheer genius [not mine] but won't fit in this SO post).
That for me is an enormously useful fact. I build program analyzers and code transformers and parsers are necessary but "uninteresting"; the interesting work is what you do with the parsed result and so the focus is on doing the post-parsing work. Using GLR means I can relatively easily build working grammars, compared to hacking a grammar to get into LALR usable form. This matters a lot when trying to deal to non-academic langauges such as C++ or Fortran, where you literally needs thousands of rules to handle the entire language well, and you don't want to spend your life trying to hack the grammar rules to meet the limitations of LALR (or even LR).
As a sort of famous example, C++ is considered to be extremely hard to parse... by guys doing LALR parsing. C++ is straightforward to parse using GLR machinery using pretty much the rules provided in the back of the C++ reference manual. (I have precisely such a parser, and it handles not only vanilla C++, but also a variety of vendor dialects as well. This is only possible in practice because we are using a GLR parser, IMHO).
[EDIT November 2011: We've extended our parser to handle all of C++11. GLR made that a lot easier to do. EDIT Aug 2014: Now handling all of C++17. Nothing broke or got worse, GLR is still the cat's meow.]

LALR parsers merge similar states within an LR grammar to produce parser state tables that are exactly the same size as the equivalent SLR grammar, which are usually an order of magnitude smaller than pure LR parsing tables. However, for LR grammars that are too complex to be LALR, these merged states result in parser conflicts, or produce a parser that does not fully recognize the original LR grammar.
BTW, I mention a few things about this in my MLR(k) parsing table algorithm here.
Addendum
The short answer is that the LALR parsing tables are smaller, but the parser machinery is the same. A given LALR grammar will produce much larger parsing tables if all of the LR states are generated, with a lot of redundant (near-identical) states.
The LALR tables are smaller because the similar (redundant) states are merged together, effectively throwing away context/lookahead info that the separate states encode. The advantage is that you get much smaller parsing tables for the same grammar.
The drawback is that not all LR grammars can be encoded as LALR tables because more complex grammars have more complicated lookaheads, resulting in two or more states instead of a single merged state.
The main difference is that the algorithm to produce LR tables carries more info around between the transitions from state to state while the LALR algorithm does not. So the LALR algorithm cannot tell if a given merged state should really be left as two or more separate states.

Yet another answer (YAA).
The parsing algorithms for SLR(1), LALR(1) and LR(1) are identical like Ira Baxter said,
however, the parser tables may be different because of the parser-generation algorithm.
An SLR parser generator creates an LR(0) state machine and computes the look-aheads from the grammar (FIRST and FOLLOW sets). This is a simplified approach and may report conflicts that do not really exist in the LR(0) state machine.
An LALR parser generator creates an LR(0) state machine and computes the look-aheads from the LR(0) state machine (via the terminal transitions). This is a correct approach, but occasionally reports conflicts that would not exist in an LR(1) state machine.
A Canonical LR parser generator computes an LR(1) state machine and the look-aheads are already part of the LR(1) state machine. These parser tables can be very large.
A Minimal LR parser generator computes an LR(1) state machine, but merges compatible states during the process, and then computes the look-aheads from the minimal LR(1) state machine. These parser tables are the same size or slightly larger than LALR parser tables, giving the best solution.
LRSTAR 10.0 can generate LALR(1), LR(1), CLR(1) or LR(*) parsers in C++, whatever is needed for your grammar. See this diagram which shows the difference among LR parsers.
[Full disclosure: LRSTAR is my product]

The basic difference between the parser tables generated with SLR vs LR, is that reduce actions are based on the Follows set for SLR tables. This can be overly restrictive, ultimately causing a shift-reduce conflict.
An LR parser, on the other hand, bases reduce decisions only on the set of terminals which can actually follow the non-terminal being reduced. This set of terminals is often a proper subset of the Follows set of such a non-terminal, and therefore has less chance of conflicting with shift actions.
LR parsers are more powerful for this reason. LR parsing tables can be extremely large, however.
An LALR parser starts with the idea of building an LR parsing table, but combines generated states in a way that results in significantly less table size. The downside is that a small chance of conflicts would be introduced for some grammars that an LR table would otherwise have avoided.
LALR parsers are slightly less powerful than LR parsers, but still more powerful than SLR parsers. YACC and other such parser generators tend to use LALR for this reason.
P.S. For brevity, SLR, LALR and LR above really mean SLR(1), LALR(1), and LR(1), so one token lookahead is implied.

SLR parsers recognize a proper subset of grammars recognizable by LALR(1) parsers, which in turn recognize a proper subset of grammars recognizable by LR(1) parsers.
Each of these is constructed as a state machine, with each state representing some set of the grammar's production rules (and position in each) as it's parsing the input.
The Dragon Book example of an LALR(1) grammar that is not SLR is this:
S → L = R | R
L → * R | id
R → L
Here is one of the states for this grammar:
S → L•= R
R → L•
The • indicates the position of the parser in each of the possible productions. It doesn't know which of the productions it's actually in until it reaches the end and tries to reduce.
Here, the parser could either shift an = or reduce R → L.
An SLR (aka LR(0)) parser would determine whether it could reduce by checking if the next input symbol is in the follow set of R (ie, the set of all terminals in the grammar that can follow R). Since = is also in this set, the SLR parser encounters a shift-reduce conflict.
However, an LALR(1) parser would use the set of all terminals that can follow this particular production of R, which is only $ (ie, end of input). Thus, no conflict.
As previous commenters have noted, LALR(1) parsers have the same number of states as SLR parsers. A lookahead propagation algorithm is used to tack lookaheads on to SLR state productions from corresponding LR(1) states. The resulting LALR(1) parser can introduce reduce-reduce conflicts not present in the LR(1) parser, but it cannot introduce shift-reduce conflicts.
In your example, the following LALR(1) state causes a shift-reduce conflict in an SLR implementation:
S → b d•a / $
A → d• / c
The symbol after / is the follow set for each production in the LALR(1) parser. In SLR, follow(A) includes a, which could also be shifted.

Suppose a parser without a lookahead is happily parsing strings for your grammar.
Using your given example it comes across a string dc, what does it do? Does it reduce it to S, because dc is a valid string produced by this grammar? OR maybe we were trying to parse bdc because even that is an acceptable string?
As humans we know the answer is simple, we just need to remember if we had just parsed b or not. But computers are stupid :)
Since an SLR(1) parser had the additional power over LR(0) to perform a lookahead, we know that any amounts of lookahead cannot tell us what to do in this case; instead, we need to look back in our past. Thus comes the canonical LR parser to the rescue. It remembers the past context.
The way it remembers this context is that it disciplines itself, that whenever it will encounter a b, it will start walking on a path towards reading bdc, as one possibility. So when it sees a d it knows whether it is already walking a path.
Thus a CLR(1) parser can do things an SLR(1) parser cannot!
But now, since we had to define so many paths, the states of the machine gets very large!
So we merge same looking paths, but as expected it could give rise to problems of confusion. However, we are willing to take the risk at the cost of reducing the size.
This is your LALR(1) parser.
Now how to do it algorithmically.
When you draw the configuring sets for the above language, you will see a shift-reduce conflict in two states. To remove them you might want to consider an SLR(1), which takes decisions looking at a follow, but you would observe that it still won't be able to. Thus you would, draw the configuring sets again but this time with a restriction that whenever you calculate the closure, the additional productions being added must have strict follow(s). Refer any textbook on what should these follow be.

In addition to the answers above, this diagram demonstrates how different parsers relate:

Adding on top of the above answers, the difference in between the individual parsers in the class of bottom-up LR parsers is whether they result in shift/reduce or reduce/reduce conflicts when generating the parsing tables. The less it will have the conflicts, the more powerful will be the grammar (LR(0) < SLR(1) < LALR(1) < CLR(1)).
For example, consider the following expression grammar:
E → E + T
E → T
T → F
T → T * F
F → ( E )
F → id
It's not LR(0) but SLR(1). Using the following code, we can construct the LR0 automaton and build the parsing table (we need to augment the grammar, compute the DFA with closure, compute the action and goto sets):
from copy import deepcopy
import pandas as pd
def update_items(I, C):
if len(I) == 0:
return C
for nt in C:
Int = I.get(nt, [])
for r in C.get(nt, []):
if not r in Int:
Int.append(r)
I[nt] = Int
return I
def compute_action_goto(I, I0, sym, NTs):
#I0 = deepcopy(I0)
I1 = {}
for NT in I:
C = {}
for r in I[NT]:
r = r.copy()
ix = r.index('.')
#if ix == len(r)-1: # reduce step
if ix >= len(r)-1 or r[ix+1] != sym:
continue
r[ix:ix+2] = r[ix:ix+2][::-1] # read the next symbol sym
C = compute_closure(r, I0, NTs)
cnt = C.get(NT, [])
if not r in cnt:
cnt.append(r)
C[NT] = cnt
I1 = update_items(I1, C)
return I1
def construct_LR0_automaton(G, NTs, Ts):
I0 = get_start_state(G, NTs, Ts)
I = deepcopy(I0)
queue = [0]
states2items = {0: I}
items2states = {str(to_str(I)):0}
parse_table = {}
cur = 0
while len(queue) > 0:
id = queue.pop(0)
I = states[id]
# compute goto set for non-terminals
for NT in NTs:
I1 = compute_action_goto(I, I0, NT, NTs)
if len(I1) > 0:
state = str(to_str(I1))
if not state in statess:
cur += 1
queue.append(cur)
states2items[cur] = I1
items2states[state] = cur
parse_table[id, NT] = cur
else:
parse_table[id, NT] = items2states[state]
# compute actions for terminals similarly
# ... ... ...
return states2items, items2states, parse_table
states, statess, parse_table = construct_LR0_automaton(G, NTs, Ts)
where the grammar G, non-terminal and terminal symbols are defined as below
G = {}
NTs = ['E', 'T', 'F']
Ts = {'+', '*', '(', ')', 'id'}
G['E'] = [['E', '+', 'T'], ['T']]
G['T'] = [['T', '*', 'F'], ['F']]
G['F'] = [['(', 'E', ')'], ['id']]
Here are few more useful function I implemented along with the above ones for LR(0) parsing table generation:
def augment(G, S): # start symbol S
G[S + '1'] = [[S, '$']]
NTs.append(S + '1')
return G, NTs
def compute_closure(r, G, NTs):
S = {}
queue = [r]
seen = []
while len(queue) > 0:
r = queue.pop(0)
seen.append(r)
ix = r.index('.') + 1
if ix < len(r) and r[ix] in NTs:
S[r[ix]] = G[r[ix]]
for rr in G[r[ix]]:
if not rr in seen:
queue.append(rr)
return S
The following figure (expand it to view) shows the LR0 DFA constructed for the grammar using the above code:
The following table shows the LR(0) parsing table generated as a pandas dataframe, notice that there are couple of shift/reduce conflicts, indicating that the grammar is not LR(0).
SLR(1) parser avoids the above shift / reduce conflicts by reducing only if the next input token is a member of the Follow Set of the nonterminal being reduced. The following parse table is generated by SLR:
The following animation shows how an input expression is parsed by the above SLR(1) grammar:
The grammar from the question is not LR(0) as well:
#S --> Aa | bAc | dc | bda
#A --> d
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c', 'd'}
G['S'] = [['A', 'a'], ['b', 'A', 'c'], ['d', 'c'], ['b', 'd', 'a']]
G['A'] = [['d']]
as can be seen from the next LR0 DFA and the parsing table:
there is a shift / reduce conflict again:
But, the following grammar which accepts the strings of the form a^ncb^n, n >= 1 is LR(0):
A → a A b
A → c
S → A
# S --> A
# A --> a A b | c
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c'}
G['S'] = [['A']]
G['A'] = [['a', 'A', 'b'], ['c']]
As can be seen from the following figure, there is no conflict in the parsing table generated.
Here is how the input string a^2cb^2 can be parsed using the above LR(0) parse table, using the following code:
def parse(input, parse_table, rules):
input = 'aaacbbb$'
stack = [0]
df = pd.DataFrame(columns=['stack', 'input', 'action'])
i, accepted = 0, False
while i < len(input):
state = stack[-1]
char = input[i]
action = parse_table.loc[parse_table.states == state, char].values[0]
if action[0] == 's': # shift
stack.append(char)
stack.append(int(action[-1]))
i += 1
elif action[0] == 'r': # reduce
r = rules[int(action[-1])]
l, r = r['l'], r['r']
char = ''
for j in range(2*len(r)):
s = stack.pop()
if type(s) != int:
char = s + char
if char == r:
goto = parse_table.loc[parse_table.states == stack[-1], l].values[0]
stack.append(l)
stack.append(int(goto[-1]))
elif action == 'acc': # accept
accepted = True
df2 = {'stack': ''.join(map(str, stack)), 'input': input[i:], 'action': action}
df = df.append(df2, ignore_index = True)
if accepted:
break
return df
parse(input, parse_table, rules)
The next animation shows how the input string a^2cb^2 is parsed with LR(0) parser using the above code:

One simple answer is that all LR(1) grammars are LALR(1) grammars.
Compared to LALR(1), LR(1) has more states in the associated finite-state machine (more than double the states). And that is the main reason LALR(1) grammars require more code to detect syntax errors than LR(1) grammars.
And one more important thing to know regarding these two grammars is that in LR(1) grammars we might have less reduce/reduce conflicts. But in LALR(1) there is more possibility of reduce/reduce conflicts.

Related

Conditional Dependencies in Compiler Semantic Analysis Passes

Imagine that we have a been given an Excel spreadsheet with three columns, labeled COND, X and Y.
COND = TRUE or FALSE (user input)
X = if(COND == TRUE) then 0 else Y
Y = if(COND == TRUE) then X else 1;
These formulas evaluate perfectly fine in Excel, and Excel does not generate a Circular Dependency error.
I am writing a compiler that tries to convert these Excel formulas to C code. In my compiler, these formulas do generate a circular dependency error. The issue is that (naïvely) the expression of X depends on Y and the expression for Y depends on X and my compiler is unable to logically continue.
Excel is able to accomplish this feat because it is a lazy, interpreted language. Excel will just lazily evaluate the formulas at run-time (with user inputs), and since no circular dependency occurs at run-time Excel has no problem evaluating such logic.
Unfortunately, I need to convert these formulas to a compiled language (not an interpreted one). The actual formulas, in the actual spreadsheets, have more complicated dependencies between multiple cells/variables (involving up to over half a dozen different cells). This means that my compiler has to perform some kind of sophisticated static, semantic analysis of the formulas and be smart enough to detect that there are no circular references if we "look inside" the conditional branches. The compiler would then have to generate the following C code from the above Excel formulas:
bool COND;
int X, Y;
if(COND) { X = 0; Y = X; } else { Y = 1; X = Y; }
Notice that the order of the assignment instructions is different in each branch of the if-statement in C.
My question is, is there any established algorithm or literature on compilers that explains how to implement this type of analysis in a compiler? Do functional programming language compilers have to solve this problem?
Why aren't standard optimization techniques adequate?
Presumably, the Excel formulas form a DAG with the leaves being primitive values and the nodes being computations/assignments. (If the Excel computation forms a cycle, then you need
some kind of iterative solver assuming you want a fixpoint).
If you simply propagate the conditional by lifting it (a class compiler optimization), we start with your original equations, where each computation is evaluated in any order WRT to others, such that the result computes dag-like (that "anyorder" is an operator intending to model that):
X = if(COND == TRUE) then 0 else Y;
anyorder
Y = if(COND == TRUE) then X else 1;
then lifting the conditional:
if (COND) { X=0; } else { X = 1; }
anyorder
if (COND) { Y=X; } else { Y = 1; }
then
if (COND) { X=0; anyorder Y=X; } else { X = Y; anyorder Y = 1; }
Each of the arms must be dag-like.
The first arm is daglike evaluating the X=0 assignment first.
The second arm is daglike evaluating Y=1 first. So, we get the answer you wanted:
if (COND) { X=0; Y=X; } else { Y = 1; X = Y; }
So conventional transformations and knowledge about anyorder-if-daglike knowledges
seems to give the right effect.
I'm not sure what you do if COND is computed as a function of the cells.
I suspect the way to do this is to generate a dependency graph of computations with
with conditionals on the dependencies. You probably have to propagate/group those conditionals over the arcs more as less as I did over the syntax.
Yes, literature exists, sorry I cannot quote any, I simply don't remember and would it just google up just as you can..
Basic algos for dependency and cycle analysis are really simple. I.e. detect symbols in the expression, build a set of expressions and dependencies in form:
inps expr outs
cell_A6, cell_B7 -> expr3 -> cell_A7
cell_A1, cell_B4 -> expr1 -> cell_A5
cell_A1, cell_A5 -> expr2 -> cell_A6
and then by comparing and iteratively expanding/replacing sets of inputs/outputs:
step0:
cell_A6, cell_B7 -> expr3 -> cell_A7
cell_A1, cell_B4 -> expr1 -> cell_A5 <--1 note that cell_A5 ~ (A1,B4)
cell_A1, cell_A5 -> expr2 -> cell_A6 <--1 apply that knowledge here
so dependency
cell_A1, cell_A5 -> expr2 -> cell_A6
morphs into
cell_A1, cell_B4 -> expr2 -> cell_A6 <--2 note that cell_A6 ~ (A1,B4) and so on
Finally, you will get either a set of full dependencies, where you can easily detect circular dependencies, like for example:
cell_A1, cell_D6, cell_F7 -> exprN -> cell_D6
or, if none found - you will be able to determine a safe, incremental order of the execution.
If the expressions contain branches or sideeffects other than the 'returned value', you can apply various transformations to reduce/expand the expressions into new ones, or into groups of new expressions that will be of the form above. For example:
B5 = { if(A5 + A3 > 0) A3-1 else A5+1 }
so
inps ... outs
A3, A5 -> theExpr -> B5
the condition can be 'lifted' and form two conditional rules:
A5 + A3 > 0 : A3 -> reducedexpr "A3-1" -> B5
A5 + A3 <= 0 : A5 -> reducedexpr "A5-1" -> B5
but now, your execution/analysis must also take care of the conditions before applying the rules. Lifting is only one of possible transformations.
However, you stil need something more than that, at least some an 'extension' for it. The hard part of your problem is that your expressions are complex, have branches, and you need to include user-random input to resolve branches to eliminate the dead branches and break dead dependencies.
Since the key is elimination of dead dependencies, you have to somehow detect dead branches. Conditions can be of any arbitrary complexity, and user-input is random, so you cannot work it out completely statically, really. After playing with transformations, you would still have to analyze the conditions and generate code accordingly. To do so, you would need to generate code for all possible combinations of the outcomes of the conditions, and all resulting branching and rule combinations, which is simply infeasible except for some trivial cases. With number of unknown the number of leafs can grow exponentially (2^N) which is a huge bloat after crossing some threshold.
Of course while analyzing conditions based on Bools, you can analyze, group and eliminate conflicting conditions like (a & b & !a)..
..but if your input values and conditions include NON-BOOL data, like integers or floating or strings, just imagine your condition is have a condition that executes some external weird statistical function and checks its result.. Ignore the 'weird' part and focus on 'external'. If you meet some expressions that use complex functions like AVG or MAX, you cannot chew through something like that statically(*). Even simple arithmetic is hard to analyze: (a+b)*(c+d) - you could derive a fact that c+d can be ignored when a+b==0, but this a really tough task to cover fully..
IIRC, doing a satisfiability analysis (SAT) for boolean expressions with basic operators is an NP-hard problem, not mentioning integers or floating points with all their math.. Calculating the result of expression is much easier than telling which values does it really depend on!!
So, since input values may be either hardcoded (cool) or user-supplied at runtime (doh!), your compler most probably will not be able to fully analyze it up front. Now link it with the fact marked as (*) and it's quite obvious that you can include some static analysis and try to eliminate some branches at 'compilation time', but still there might be some parts that must be delayed until the user provides the actual input.
So, if part of the analysis must be done at runtime, all the branch elimination is just an optional optimisation and I think you should focus on the runtime part now.
At minimal unoptimized version, your generated program could simply remember all the excel-expressions and wait for input data. Once the program is run and input is given, the program has to substitute the input in the expressions, and then try to iteratively reduce them to output values.
Writing such algo in imperative language is completely possible. Actually, you'd need to write it once, and later you'd just merge it with a different sets of rules derived from cell-formulas and done. Runtime part of the program would be the same, formulas would change.
You could then expand the 'compiler' side to try to help by i.e. preliminarily partially analyzing the dependencies and trying to reorder the rules so later they will be checked in a "better order", or by precalculating constants, or inlining some expressions and so on but as I said, it's all optimizations, not core feature.
Sadly, I cannot really tell you much anything serious about the "functional languages", but since usually their runtimes are 'very dynamic' and sometimes they even execute the code in terms of symbols and transformations, it could reduce the complexity of your 'compiler' and 'engine' part. The most valuable asset here is the dynamism. So, even a Ruby would do much better than C - but in no way it's a "compiled" language as you'd say.
For example, you could try to transform excel rules directly into functions:
def cell_A5 = expr1(cell_A1, cell_B4)
def cell_A7 = expr3(cell_A6, cell_B7)
def cell_A6 = expr2(cell_A1, cell_A5)
write it down as part of the program, then when at runtime when the user provides some values, you'd those would just redefine some of the parts of the program
cell_B7 = 11.2 // filling up undefined variable
cell_A1 = 23 // filling up undefined variable
cell_A5 = 13 // overwriting the function with a value
That's the power of dynamic platforms, nothing very 'functional' here. Dynamic platforms make it easy to fill/override bits. But then, once the user provided some bits and once the program has been "corrected on the fly", which one function would you call first?
The answer is somewhat sad.. You don't know.
If your dynamic language has some rule-engine built into it, you can try generating rules instead of functions and later rely on that engine to "fill up" everything that is possible to calculate.
But if it doesn't have rule engine, you are back to point one..
afterthought:
Hm.. sorry, I think I just wrote too much and too vaguely/chatty. If you think it's helpful, please drop me a comment. Otherwise I'll delete it after few days or a week.

Merging duplicate path nodes

Consider the following trivial data structure:
data Step =
Match Char |
Options [Pattern]
type Pattern = [Step]
This is used together with a small function
match :: Pattern -> String -> Bool
match [] _ = True
match _ "" = False
match (s:ss) (c:cs) =
case s of
Match c0 -> (c == c0) && (match ss cs)
Options ps -> any (\ p -> match (p ++ ss) (c:cs)) ps
It should be fairly obvious what is going on here; a Pattern either does or does not match a given String based on the steps it contains. Each Step either matches a single character (Match), or it consists of a list of possible sub-patterns. (Note well: sub-patterns are not necessarily of equal length!)
Suppose we have a pattern such as this:
[
Match '*',
Options
[
[Match 'F', Match 'o', Match 'o'],
[Match 'F', Match 'o', Match 'b']
],
Match '*'
]
This pattern matches two possible strings, *Foo* and *Fob*. Clearly we can "optimise" this into
[Match '*', Match 'F', Match 'o', Options [[Match 'o'], [Match 'b']], Match '*']
My question: How do I write the function to do this?
More generally, a given Options constructor may have an arbitrary number of sub-paths, of wildly different lengths, some with common prefixes and suffixes, and some without. It's even possible to have empty sub-paths, or even to do something like Options [] (which is of course no-op). I'm struggling to write a function which will reduce every possible input correctly...
On cursory inspection this looks like you've defined a nondeterministic finite state automata. NFA were first defined by Michael O. Rabin, and of all peope—Dana Scott, who has brought us much else as well!
This is an automata because it is built out of steps, with transitions between them, based on acceptance states. At each step you have many possible transitions. Hence your automata is nondeterministic. Now you want to optimize this. One way to optimize it (not the way you're asking for, but related) is to eliminate backtracking. You can do this by taking every combination of how you get to a state along with the state itself. This is known as the powerset construction: http://en.wikipedia.org/wiki/Powerset_construction
The wikipedia article is actually pretty good -- and in a language like Haskell we can first define the full powerset DFA, then lazily traverse all genuine paths to "strip out" most of the unreachable cruft. That gets us to a decent DFA, but not necessarily a minimal one.
As described at the bottom of that article, we can use Brzozowski's algorithm, flipping all the arrows and getting a new NFA that describes going from end to initial states. Now if we were minimizing a DFA, we'd need to go from there back to the DFA again, then flip the arrows and do it all again. This isn't necessarily the fastest approach, but its straightforward and works well enough for plenty of cases. There are plenty of better algorithms available as well: http://en.wikipedia.org/wiki/DFA_minimization
For minimizing an NFA, there are a variety of approaches, but the problem is in general np-hard, so you'll have to pick some poison :-)
Of course all this is assuming you have a full NFA. If you have mutually recursive definitions, then you can put a pattern "inside" itself, and you sure do. That said, you'll then need to use clever tricks to recover the explicit shared structure in order to even begin working with an NFA in this form -- otherwise you'll loop forever.
If you insert a "no sharing" rule -- i.e. the directed graph of your NFA is not only acyclic, but branches never 'merge back' except when you exit an 'options' set, then I'd imagine that simplification is a much more straightforward affair, just 'factoring out' common characters. Since this involves thinking and not just providing references, I'll leave it there for now, just noting that this article might somehow be of interest: http://matt.might.net/articles/parsing-with-derivatives/
p.s.
A stab at the "factoring" solution is a function with the following type:
factor :: [Pattern] -> (Maybe Step, [Pattern])
factor = -- pulls out a common element of the pattern head, should one exist. shallow.
factorTail = -- same, but pulling out of the pattern tail
simplify :: [Pattern] -> [Pattern]
simplify = -- remove redundant constructs, such as options composed only of other options, which can be flattened out, options with no elements that are the "only" option, etc. should run "deep" all levels down.
Now you can start at the lowest level and cycle (simplify . factor) until you have no new factors. Then do so with (simplify . factorTail). Then go one level up, do the same thing. I wouldn't be shocked if you couldn't "trick" this into a nonminimal solution, but I think for most cases it will work very well.
Update: What this solution doesn't address is something where you have e.g. Options ["--DD--", "++DD++"] (reading strings as list of matches), and so you have uncommon structure in both the head and tail but not in the middle. A more general solution in such an instance would be to pull out the least common substring between all matches in your list, and use that as the "frame" with options inserted in the sections where things differ.

What does syntax directed translation mean?

Can anyone, in simple terms, explain what does "Syntax Directed Translation" mean? I started to read the topic from Dragon Book but couldn't understand. The Wiki article didn't help either.
In simplest terms, 'Syntax Directed Translation' means driving the entire compilation (translation) process with the syntax recognizer (the parser).
Conceptually, the process of compiling a program (translating it from source code to machine code) starts with a parser that produces a parse tree, and then transforms that parse tree through a sequence of tree or graph transformations, each of which is largely independent, resulting in a final simplified tree or graph that is traversed to produce machine code.
This view, while nice in theory, has a drawback that if you try to implement it directly, enough memory to hold at least two copies of the entire tree or graph is needed. Back when the Dragon Book was written (and when a lot of this theory was hashed out), computer memories were measured in kilobytes, and 64K was a lot. So compiling large programs could be tricky.
With Syntax Directed Translation, you organize all of the graph transformations around the order in which the parser recognizes the parse tree. Instead of producing a complete parse tree, your parser builds little bits of it, and then feeds those bits to the subsequent passes of the compiler, ultimately producing a small piece of machine code, before continuing the parsing process to build the next piece of parse tree. Since only small amounts of the parse tree (or the subsequent graphs) exist at any time, much less memory is required. Since the syntax recognizer is the master sequencer controlling all of this (deciding the order in which things happen), this is called Syntax Directed Translation.
Since this is such an effective way of keeping down memory use, people even redesigned languages to make it easier to do -- the ideal being to have a "Single Pass" compiler that could in fact do the entire process from parsing to machine code generation in a single pass.
Nowadays, memory is not at such a premium, so there's less pressure to force everything into a single pass. Instead you generally use Syntax Direct Translation just for the front end, parsing the syntax, doing typechecking and other semantic checks, and a few simple transformations all from the parser and producing some internal form (three address code, trees, or dags of some kind) and then having separate optimization and back end passes that are independent (and so not syntax directed). Even in this case you might claim that these later passes are at least partly syntax directed, as the compiler may be organized to operate on large pieces of the input (such as entire functions or modules), pushing through all the passes before continuing with the next piece of input.
Tools like yacc are designed around the idea of Syntax Directed Translation -- the tool produces a syntax recognizer that directly runs fragments of code ('actions' in the tool parlance) as productions (fragments of the parse tree) are recognized, without ever creating an actual 'tree'. These actions can directly invoke what are logically later passes in the compiler, and then return to continue parsing. The imperative main loop that drives all of this is the parser's token reading state machine.
Actually No. Historically before the Dragon Book there were syntax directed compilers. Attending ACM SEGPlan meeting in the late 1960's I learned of several types of directed translation. Tree directed and graph directed translation were also discussed. I think these got muddled together in the Dragon Book though I have never owned the Dragon Book. My favorite book was Programming Systems and Languages by Saul Rosen. It is a collection of papers on compilers, operating systems and computer systems. I'll try to explain the early syntax directed compiler parser programming languages. The later ones producing trees were combined with tree directed code generating languages.
Early syntax directed compilers, translated source directly to stack machine code. The Borrows B5000 ALGOL compiler is an example.
A*(B+C) -> A,B,C,ADD,MPY
Schorre's META II domain specific parser programming language, compiler compiler, developed in the 1960s is an example of a syntax directed compiler. You can find the original META II paper in the ACM archive. META II avoids left recursion using $ postfix zero or more sequence operator and ( ) grouping.
EXPR = TERM $('+' TERM .OUT 'ADD'|'-' TERM .OUT 'SUB');
Later Schorre based metalanguage compilers translated to trees using stack based tree transformation operators :<node name> and !<number>.
EXPR = TERM $(('+':ADD|'-':SUB) TERM!2);
Except for TREEMETA that used [<number>] instead of !<number>. The above EXPR formula is basically the same as the META II EXPR except we have factored operators + and - recognition creating corresponding nodes and pushing the node onto the node stack. Then on recognizing the right TERM the tree constructor !2 creates a tree popping the top 2 parse stack <TERM>s and top node from the node stack to form a tree:
ADD or SUB
/ \ / \
TERM TERM TERM TERM
Tokens were recognized by supplied recognizers .ID .NUMBER and .STRING. Later replaced by token ".." and character class ":" formula in CWIC:
id .. let $(leter|dgt|+'_');
Tree directed compiler languages were combined with the syntax directed compilers to generate code. The CWIC compiler compiler developed at Systems Development Corporation included a LISP 2 based tree directed generator language. A short paper in CWIC can be found in the ACM archives.
In the parser programming languages you are programming a type of recursive decent parser. When you get to CWIC all the problems that today are attributed to recursive decent parsers were eliminated. There is no left recursion problem as the $ zero or more construct and programed tree construction eliminated the need of left recursion. You control the tree construction. A loop construct is used to produces a left handed tree and tail recursion a right handed tree. Though parsing formulas may generate no tree at all:
program = $declarations;
In the above the $ zero or more loop operator preceding declarations specifies that declarations is to be repeatably called as long as it returns success. The input source code being compiled is made up of any positive number of declarations. The declarations formula would then define the types of declarations. You might need external linkages declarations, data declarations, function or procedure code declarations.
declarations = linkage_decl | data_decl | code_decl;
The types of declarations each being a separate formula. The syntax language controls when semantic processing and code generation occurs. The program and declarations formulas above do not produce trees. They are simply controlling when and what language structure are parsed. These are neither LL oe LR parser sears. The provide unlimited (limited only by available memory) programed backtracking. They provide programed look ahead and peak ahead tests.
As a last example the following example including token and character class formula illustrates producing both left and right handed trees. Specifically exponentiation using tail recursion.
assign = id '=' expr ';' :ASSIGN!2 arith_gen[*1];
expr = term $(('+':ADD | '-':SUB) term !2);
term = factor $(('*':MPY | '//' :REM | '/':DIV) factor!2);
factor = ( id ('(' +[ arg $(',' arg ]+ ')' :CALL!2 | .EMPTY)
| number
| '(' expr ')'
) ('^' factor:EXP!2 | .EMPTY);
bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';
upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';
lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';
alpha: upr|lwr;
alphanum: alpha|dgt;
number .. dgt $dgt MAKENUM[];
id .. alpha $(alphanum|+'_');

Mathematica Notation and syntax mods

I am experimenting with syntax mods in Mathematica, using the Notation package.
I am not interested in mathematical notation for a specific field, but general purpose syntax modifications and extensions, especially notations that reduce the verbosity of Mathematica's VeryLongFunctionNames, clean up unwieldy constructs, or extend the language in a pleasing way.
An example modification is defining Fold[f, x] to evaluate as Fold[f, First#x, Rest#x]
This works well, and is quite convenient.
Another would be defining *{1,2} to evaluate as Sequence ## {1,2} as inspired by Python; this may or may not work in Mathematica.
Please provide information or links addressing:
Limits of notation and syntax modification
Tips and tricks for implementation
Existing packages, examples or experiments
Why this is a good or bad idea
Not a really constructive answer, just a couple of thoughts. First, a disclaimer - I don't suggest any of the methods described below as good practices (perhaps generally they are not), they are just some possibilities which seem to address your specific question. Regarding the stated goal - I support the idea very much, being able to reduce verbosity is great (for personal needs of a solo developer, at least). As for the tools: I have very little experience with Notation package, but, whether or not one uses it or writes some custom box-manipulation preprocessor, my feeling is that the whole fact that the input expression must be parsed into boxes by Mathematica parser severely limits a number of things that can be done. Additionally, there will likely be difficulties with using it in packages, as was mentioned in the other reply already.
It would be easiest if there would be some hook like $PreRead, which would allow the user to intercept the input string and process it into another string before it is fed to the parser. That would allow one to write a custom preprocessor which operates on the string level - or you can call it a compiler if you wish - which will take a string of whatever syntax you design and generate Mathematica code from it. I am not aware of such hook (it may be my ignorance of course). Lacking that, one can use for example the program style cells and perhaps program some buttons which read the string from those cells and call such preprocessor to generate the Mathematica code and paste it into the cell next to the one where the original code is.
Such preprocessor approach would work best if the language you want is some simple language (in terms of its syntax and grammar, at least), so that it is easy to lexically analyze and parse. If you want the Mathematica language (with its full syntax modulo just a few elements that you want to change), in this approach you are out of luck in the sense that, regardless of how few and "lightweight" your changes are, you'd need to re-implement pretty much completely the Mathematica parser, just to make those changes, if you want them to work reliably. In other words, what I am saying is that IMO it is much easier to write a preprocessor that would generate Mathematica code from some Lisp-like language with little or no syntax, than try to implement a few syntactic modifications to otherwise the standard mma.
Technically, one way to write such a preprocessor is to use standard tools like Lex(Flex) and Yacc(Bison) to define your grammar and generate the parser (say in C). Such parser can be plugged back to Mathematica either through MathLink or LibraryLink (in the case of C). Its end result would be a string, which, when parsed, would become a valid Mathematica expression. This expression would represent the abstract syntax tree of your parsed code. For example, code like this (new syntax for Fold is introduced here)
"((1|+|{2,3,4,5}))"
could be parsed into something like
"functionCall[fold,{plus,1,{2,3,4,5}}]"
The second component for such a preprocessor would be written in Mathematica, perhaps in a rule-based style, to generate Mathematica code from the AST. The resulting code must be somehow held unevaluated. For the above code, the result might look like
Hold[Fold[Plus,1,{2,3,4,5}]]
It would be best if analogs of tools like Lex(Flex)/Yacc(Bison) were available within Mathematica ( I mean bindings, which would require one to only write code in Mathematica, and generate say C parser from that automatically, plugging it back to the kernel either through MathLink or LibraryLink). I may only hope that they will become available in some future versions. Lacking that, the approach I described would require a lot of low-level work (C, or Java if your prefer). I think it is still doable however. If you can write C (or Java), you may try to do some fairly simple (in terms of the syntax / grammar) language - this may be an interesting project and will give an idea of what it will be like for a more complex one. I'd start with a very basic calculator example, and perhaps change the standard arithmetic operators there to some more weird ones that Mathematica can not parse properly itself, to make it more interesting. To avoid MathLink / LibraryLink complexity at first and just test, you can call the resulting executable from Mathematica with Run, passing the code as one of the command line arguments, and write the result to a temporary file, that you will then import into Mathematica. For the calculator example, the entire thing can be done in a few hours.
Of course, if you only want to abbreviate certain long function names, there is a much simpler alternative - you can use With to do that. Here is a practical example of that - my port of Peter Norvig's spelling corrector, where I cheated in this way to reduce the line count:
Clear[makeCorrector];
makeCorrector[corrector_Symbol, trainingText_String] :=
Module[{model, listOr, keys, words, edits1, train, max, known, knownEdits2},
(* Proxies for some commands - just to play with syntax a bit*)
With[{fn = Function, join = StringJoin, lower = ToLowerCase,
rev = Reverse, smatches = StringCases, seq = Sequence, chars = Characters,
inter = Intersection, dv = DownValues, len = Length, ins = Insert,
flat = Flatten, clr = Clear, rep = ReplacePart, hp = HoldPattern},
(* body *)
listOr = fn[Null, Scan[If[# =!= {}, Return[#]] &, Hold[##]], HoldAll];
keys[hash_] := keys[hash] = Union[Most[dv[hash][[All, 1, 1, 1]]]];
words[text_] := lower[smatches[text, LetterCharacter ..]];
With[{m = model},
train[feats_] := (clr[m]; m[_] = 1; m[#]++ & /# feats; m)];
With[{nwords = train[words[trainingText]],
alphabet = CharacterRange["a", "z"]},
edits1[word_] := With[{c = chars[word]}, join ### Join[
Table[
rep[c, c, #, rev[#]] &#{{i}, {i + 1}}, {i, len[c] - 1}],
Table[Delete[c, i], {i, len[c]}],
flat[Outer[#1[c, ##2] &, {ins[#1, #2, #3 + 1] &, rep},
alphabet, Range[len[c]], 1], 2]]];
max[set_] := Sort[Map[{nwords[#], #} &, set]][[-1, -1]];
known[words_] := inter[words, keys[nwords]]];
knownEdits2[word_] := known[flat[Nest[Map[edits1, #, {-1}] &, word, 2]]];
corrector[word_] := max[listOr[known[{word}], known[edits1[word]],
knownEdits2[word], {word}]];]];
You need some training text with a large number of words as a string to pass as a second argument, and the first argument is the function name for a corrector. Here is the one that Norvig used:
text = Import["http://norvig.com/big.txt", "Text"];
You call it once, say
In[7]:= makeCorrector[correct, text]
And then use it any number of times on some words
In[8]:= correct["coputer"] // Timing
Out[8]= {0.125, "computer"}
You can make your custom With-like control structure, where you hard-code the short names for some long mma names that annoy you the most, and then wrap that around your piece of code ( you'll lose the code highlighting however). Note, that I don't generally advocate this method - I did it just for fun and to reduce the line count a bit. But at least, this is universal in the sense that it will work both interactively and in packages. Can not do infix operators, can not change precedences, etc, etc, but almost zero work.
(my first reply/post.... be gentle)
From my experience, the functionality appears to be a bit of a programming cul-de-sac. The ability to define custom notations seems heavily dependent on using the 'notation palette' to define and clear each custom notation. ('everything is an expression'... well, except for some obscure cases, like Notations, where you have to use a palette.) Bummer.
The Notation package documentation mentions this explicitly, so I can't complain too much.
If you just want to define custom notations in a particular notebook, Notations might be useful to you. On the other hand, if your goal is to implement custom notations in YourOwnPackage.m and distribute them to others, you are likely to encounter issues. (unless you're extremely fluent in Box structures?)
If someone can correct my ignorance on this, you'd make my month!! :)
(I was hoping to use Notations to force MMA to treat subscripted variables as symbols.)
Not a full answer, but just to show a trick I learned here (more related to symbol redefinition than to Notation, I reckon):
Unprotect[Fold];
Fold[f_, x_] :=
Block[{$inMsg = True, result},
result = Fold[f, First#x, Rest#x];
result] /; ! TrueQ[$inMsg];
Protect[Fold];
Fold[f, {a, b, c, d}]
(*
--> f[f[f[a, b], c], d]
*)
Edit
Thanks to #rcollyer for the following (see comments below).
You can switch the definition on or off as you please by using the $inMsg variable:
$inMsg = False;
Fold[f, {a, b, c, d}]
(*
->f[f[f[a,b],c],d]
*)
$inMsg = True;
Fold[f, {a, b, c, d}]
(*
->Fold::argrx: (Fold called with 2 arguments; 3 arguments are expected.
*)
Fold[f, {a, b, c, d}]
That's invaluable while testing

What is packrat parsing?

I know and use bison/yacc. But in parsing world, there's a lot of buzz around packrat parsing.
What is it? Is it worth studing?
Packrat parsing is a way of providing asymptotically better performance for parsing expression grammars (PEGs); specifically for PEGs, linear time parsing can be guaranteed.
Essentially, Packrat parsing just means caching whether sub-expressions match at the current position in the string when they are tested -- this means that if the current attempt to fit the string into an expression fails then attempts to fit other possible expressions can benefit from the known pass/fail of subexpressions at the points in the string where they have already been tested.
At a high level:
Packrat parsers make use of parsing expression grammars (PEGs) rather than traditional context-free grammars (CFGs).
Through their use of PEGs rather than CFGs, it's typically easier to set up and maintain a packrat parser than a traditional LR parser.
Due to how they use memoization, packrat parsers typically use more memory at runtime than "classical" parsers like LALR(1) and LR(1) parsers.
Like classical LR parsers, packrat parsers run in linear time.
In that sense, you can think of a packrat parser as a simplicity/memory tradeoff with LR-family parsers. Packrat parsers require less theoretical understanding of the parser's inner workings than LR-family parsers, but use more resources at runtime. If you're in an environment where memory is plentiful and you just want to throw a simple parser together, packrat parsing might be a good choice. If you're on a memory-constrained system or want to get maximum performance, it's probably worth investing in an LR-family parser.
The rest of this answer gives a slightly more detailed overview of packrat parsers and PEGs.
On CFGs and PEGs
Many traditional parsers (and many modern parsers) make use of context-free grammars. A context-free grammar consists of a series of rules like the ones shown here:
E -> E * E | E + E | (E) | N
N -> D | DN
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
For example, the top line says that the nonterminal E can be replaced either with E * E, or E + E, or (E), or with N. The second line says that N can be replaced with either D or DN. The last line says that D can be replaced with any single digit.
If you start with the string E and follow the rules from the above grammar, you can generate any mathematical expression using +, *, parentheses, and single digits.
Context-free grammars are a compact way to represent a collection of strings. They have a rich and well-understood theory. However, they have two main drawbacks. The first one is that, by itself, a CFG defines a collection of strings, but doesn't tell you how to check whether a particular string is generated by the grammar. This means that whether a particular CFG will lend itself to a nice parser depends on the particulars of how the parser works, meaning that the grammar author may need to familiarize themselves with the internal workings of their parser generator to understand what restrictions are placed on the sorts of grammar structures can arise. For example, LL(1) parsers don't allow for left-recursion and require left-factoring, while LALR(1) parsers require some understanding of the parsing algorithm to eliminate shift/reduce and reduce/reduce conflicts.
The second, larger problem is that grammars can be ambiguous. For example, the above grammar generates the string 2 + 3 * 4, but does so in two ways. In one way, we essentially get the grouping 2 + (3 * 4), which is what's intended. The other one gives us (2 + 3) * 4, which is not what's meant. This means that grammar authors either need to ensure that the grammar is unambiguous or need to introduce precedence declarations auxiliary to the grammar to tell the parser how to resolve the conflicts. This can be a bit of a hassle.
Packrat parsers make use of an alternative to context-free grammars called parsing expression grammars (PEGs). Parsing expression grammars in some ways resemble CFGs - they describe a collection of strings by saying how to assemble those strings from (potentially recursive) smaller parts. In other ways, they're like regular expressions: they involve simpler statements combined together by a small collection of operations that describe larger structures.
For example, here's a simple PEG for the same sort of arithmetic expressions given above:
E -> F + E / F
F -> T * F / T
T -> D* / (E)
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
To see what this says, let's look at the first line. Like a CFG, this line expresses a choice between two options: you can either replace E with F + E or with F. However, unlike a regular CFG, there is a specific ordering to these choices. Specifically, this PEG can be read as "first, try replacing E with F + E. If that works, great! And if that doesn't work, try replacing E with F. And if that works, great! And otherwise, we tried everything and it didn't work, so give up."
In that sense, PEGs directly encode into the grammar structure itself how the parsing is to be done. Whereas a CFG more abstractly says "an E may be replaced with any of the following," a PEG specifically says "to parse an E, first try this, then this, then this, etc." As a result, for any given string that a PEG can parse, the PEG can parse it exactly one way, since it stops trying options once the first parse is found.
PEGs, like CFGs, can take some time to get the hang of. For example, CFGs in the abstract - and many CFG parsing techniques - have no problem with left recursion. For example, this CFG can be parsed with an LR(1) parser:
E -> E + F | F
F -> F * T | T
T -> (E) | N
N -> ND | D
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
However, the following PEG can't be parsed by a packrat parser (though later improvements to PEG parsing can correct this):
E -> E + F / F
F -> F * T / T
T -> (E) / D*
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Let's take a look at that first line. The first line says "to parse an E, first try reading an E, then a +, then an F. And if that fails, try reading an F." So how would it then go about trying out that first option? The first step would be to try parsing an E, which would work by first trying to parse an E, and now we're caught in an infinite loop. Oops. This is called left recursion and also shows up in CFGs when working with LL-family parsers.
Another issue that comes up when designing PEGs is the need to get the ordered choices right. If you're coming from the Land of Context-Free Grammars, where choices are unordered, it's really easy to accidentally mess up a PEG. For example, consider this PEG:
E -> F / F + E
F -> T / T * F
T -> D+ / (E)
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Now, what happens if you try to parse the string 2 * 3 + 4? Well:
We try parsing an E, which first tries parsing an F.
We trying parsing an F, which first tries parsing a T.
We try parsing a T, which first tries reading a series of digits. This succeeds in reading 2.
We've successfully read an F.
So we've successfully read an E, so we should be done here, but there are leftover tokens and the parse fails.
The issue here is that we first tried parsing F before F + E, and similarly first tried parsing T before parsing T * F. As a result, we essentially bit off less than we could check, because we tried reading a shorter expression before a longer one.
Whether you find CFGs, with attending ambiguities and precedence declarations, easier or harder than PEGs, with attending choice orderings, is mostly a matter of personal preference. But many people report finding PEGs a bit easier to work with than CFGs because they more mechanically map onto what the parser should do. Rather than saying "here's an abstract description of the strings I want," you get to say "here's the order in which I'd like you to try things," which is a bit closer to how parsing often works.
The Packrat Parsing Algorithm
Compared with the algorithms to build LR or LL parsing tables, the algorithm used by a packrat parsing is conceptually quite simple. At a high level, a packrat parser begins with the start symbol, then tries the ordered choices, one at a time, in sequence until it finds one that works. As it works through those choices, it may find that it needs to match another nonterminal, in which case it recursively tries matching that nonterminal on the rest of the string. If a particular choice fails, the parser backtracks and then tries the next production.
Matching any one individual production isn't that hard. If you see a terminal, either it matches the next available terminal or it doesn't. If it does, great! Match it and move on. If not, report an error. If you see a nonterminal, then (recursively) match that nonterminal, and if it succeeds pick up with the rest of the search at the point after where the nonterminal finished matching.
This means that, more generally, the packrat parser works by trying to solve problems of the following form:
Given some position in the string and a nonterminal, determine how much of the string that nonterminal matches starting at that position (or report that it doesn't match at all.)
Here, notice that there's no ambiguity about what's meant by "how much of the string the nonterminal matches." Unlike a traditional CFG where a nonterminal might match at a given position in several different lengths, the ordered choices used in PEGs ensure that if there's some match starting at a given point, then there's exactly one match starting at that point.
If you've studied dynamic programming, you might realize that these subproblems might overlap one another. In fact, in a PEG with k nonterminals and a string of length n, there are only Θ(kn) possible distinct subproblems: one for each combination of a starting position and a nonterminal. This means that, in principle, you could use dynamic programming to precompute a table of all possible position/nonterminal parse matches and have a very fast parser. Packrat parsing essentially does this, but using memoization rather than dynamic programming. This means that it won't necessarily try filling all table entries, just the ones that it actually encounters in the course of parsing the grammar.
Since each table entry can be filled in in constant time (for each nonterminal, there are only finitely many productions to try for a fixed PEG), the parser ends up running in linear time, matching the speed of an LR parser.
The drawback with this approach is the amount of memory used. Specifically, the memoization table may record multiple entries per position in the input string, requiring memory usage proportional to both the size of the PEG and the length of the input string. Contrast this with LL or LR parsing, which only needs memory proportional to the size of the parsing stack, which is typically much smaller than the length of the full string.
That being said, the tradeoff here in worse memory performance is offset by not needing to learn the internal workings of how the packrat parser works. You can just read up on PEGs and take things from there.
Hope this helps!
Pyparsing is a pure-Python parsing library that supports packrat parsing, so you can see how it is implemented. Pyparsing uses a memoizing technique to save previous parse attempts for a particular grammar expression at a particular location in the input text. If the grammar involves retrying that same expression at that location, it skips the expensive parsing logic and just returns the results or exception from the memoizing cache.
There is more info here at the FAQ page of the pyparsing wiki, which also includes links back to Bryan Ford's original thesis on packrat parsing.

Resources