I have really tried to get my head around the first steps of understanding Stratego/XT. I've googled a lot and all the web resources I have found seem to make a large enough leap at the beginning that I just can't make the connection. Let me explain.
I understand Abstract Syntax Trees like this:
Minus(Call(Var("f"),[Plus(Var("a"),Int("10"))]),Int("3"))
But then it seems (in the very next sentence even) the documents make this leap to this:
LetSplit :
Let([d1, d2 | d*], e*) ->
Let([d1], Let([d2 | d*], e*))
This makes no sense to me. Could someone explain what is going on here with LetSplit?
Also, is there a good resource for furthering a solid understanding of Stratego/XT that is easier to read that the garganutan and complex official "tutorial" on the Stratego/XT website?
Thanks!
LetSplit :
Let([d1, d2 | d*], e*) ->
Let([d1], Let([d2 | d*], e*))
This is a rewrite rule with the name LetSplit.
It is equivalent (syntactic sugar) to the strategy:
LetSplit =
?Let([d1, d2 | d*], e*) ; // match
!Let([d1], Let([d2 | d*], e*)) // build
When invoked, then, when the left hand side Let([d1, d2 | d*], e*) (the match part) matches the current term, the current term is replaced by the right hand side Let([d1], Let([d2 | d*], e*)) (the build part). When the left hand side does not match, the rule fails and the current term remains unchanged.
d1, d2, d*, e* are term variables bound to the sub-terms found at their respective positions during the match. The names are then used in the build part, where they expand to the sub-tree they were bound to before. Note that indeed, * and ' may appear at the end of term variable names. The single quote has no special meaning, while * has a special meaning in list build operations (not the case here).
The syntax [d1, d2 | d*] in the match part matches any list with at least two elements. These elements will be bound to d1 and d2 and the remaining elements in the list will be bound to d* (so d* will be a list, and may be the empty list []).
Also, is there a good resource for furthering a solid understanding of
Stratego/XT that is easier to read that the garganutan and complex
official "tutorial" on the Stratego/XT website?
Research papers. Though admittedly they aren't really easier to read, but arguably they are the only place where some of the more advanced concepts are explained.
Stratego/XT 0.17. A language and toolset for program transformation (may be a good starting point to find keywords to use in e.g. google scholar)
Program Transformation with Scoped Dynamic Rewrite Rules (scary, but contains a wealth of information about dynamic rewrite rules that is hard to find elsewhere)
more papers
Anyway feel free to ask more questions here on stackoverflow, I will try to answer them :-)
Related
Would anyone be able to explain to me what the forward slash '/' means in the context of this Prolog predicate. I've tried Googling it, reviewing other questions but I can't find a definitive answer, or at least one that makes sense to me. I'm aware of what arity is but I'm not sure this is related.
move_astar([Square | Path] / G / _, [NextSquare, Square | Path] / SumG / NewH) :-
square(Square, NextSquare, Distance),
not(member(NextSquare, Path)),
SumG is G + Distance,
heuristic(NextSquare, NewH).
It has no implicit meaning, and it is not the same as arity. Prolog terms are name(Arg1, Arg2) and / can be a name. /(Arg1, Arg2).
There is a syntax sugar which allows some names to be written inline, such as *(X,Y) as X * Y and /(X,Y) as X / Y which is useful so they look like arithmetic (but NB. this does not do arithmetic). Your code is using this syntax to keep three things together:
?- write_canonical([Square | Path] / G / _).
/(/([_|_],_),_)
That is, it has no more semantic meaning than xIIPANIKIIx(X,Y) would have, it's Ls/G/H kept together.
The / has no meaning here, except for being a structure and a left associative operator. So it is unrelated to division. Sometimes people like to add such decoration to their code. It is hard to tell if this serves anything from that single clause, but think of someone who just wants to refer to the list and the two values like in:
?- start(A), move_astar(A,A).
So here it would be much more compact to ask that question than to hand over each parameter manually.
Another use would be:
?- start(A), closure(move_astar, A,B).
Using closure/2. That is, existing predicates may expect a single argument.
It's an outdated style. It's bad because:
It conveys no useful information, apart from being a weird-looking delimiter
There's a slight performance hit for Prolog having to assemble and re-assemble those slash delimiters for parsing
It's better to either:
Keep parameters individual (and therefore fast to use)
Group parameters in a reasonably-named term, or e.g. v if brevity is more appropriate than classification of the term
I'm trying to understand how to create NFA-s from regular expressions, but I am really confused from epsilon transitions. I have this example in my textbook , but I don't understand why epsilon transitions are used and how does one know when to use them.
In general, espilon-transitions are used when they are convenient. For example, when constructing an NFA from a regular expression, you start by constructing small parts of the automaton corresponding to parts of the expression. To connect them, you need to put a transition. But if there is no symbol to be read there, an epsilon transition is a simple way to do this. They are, however never necessary, you can always find a solution without them.
In your example, just apply the algorithm described in your textbook. It tells you when to use them.
The epsilon transitions
from 1 to 2 probably connects the parts for (a|b)* and for ac
1->5 and 8->1 probably result from the *
5->6 and 5->7 probably result from the alternative in |
Epsilon-transitions in NFAs are a natural representation of choice or disjunction or union in regular expressions. That is, a regular expression like r + s (or r | s or r U s depending on your preferred notation) is naturally represented as an NFA consisting of two independent NFAs, one for r and one for s, joined using e-transitions as follows:
e
----->q0----->(r)
|
| e
|
V
(s)
When used to connect states in more complicated ways, the effect may not be as easy or natural to describe, but essentially these transitions let you choose unconditionally among multiple options. So, if I have seen a part of the input already and there are a few different ways the string could end, I can represent that by using e-transitions to states that handle the different possibilities.
In your example, the e-transitions are not really serving any very useful function and are merely artifacts of the conversion algorithm you have used. That algorithm includes them because, in the general case, they may be useful or necessary. In your specific case this was not true, so they look out of place.
Consider the following trivial data structure:
data Step =
Match Char |
Options [Pattern]
type Pattern = [Step]
This is used together with a small function
match :: Pattern -> String -> Bool
match [] _ = True
match _ "" = False
match (s:ss) (c:cs) =
case s of
Match c0 -> (c == c0) && (match ss cs)
Options ps -> any (\ p -> match (p ++ ss) (c:cs)) ps
It should be fairly obvious what is going on here; a Pattern either does or does not match a given String based on the steps it contains. Each Step either matches a single character (Match), or it consists of a list of possible sub-patterns. (Note well: sub-patterns are not necessarily of equal length!)
Suppose we have a pattern such as this:
[
Match '*',
Options
[
[Match 'F', Match 'o', Match 'o'],
[Match 'F', Match 'o', Match 'b']
],
Match '*'
]
This pattern matches two possible strings, *Foo* and *Fob*. Clearly we can "optimise" this into
[Match '*', Match 'F', Match 'o', Options [[Match 'o'], [Match 'b']], Match '*']
My question: How do I write the function to do this?
More generally, a given Options constructor may have an arbitrary number of sub-paths, of wildly different lengths, some with common prefixes and suffixes, and some without. It's even possible to have empty sub-paths, or even to do something like Options [] (which is of course no-op). I'm struggling to write a function which will reduce every possible input correctly...
On cursory inspection this looks like you've defined a nondeterministic finite state automata. NFA were first defined by Michael O. Rabin, and of all peope—Dana Scott, who has brought us much else as well!
This is an automata because it is built out of steps, with transitions between them, based on acceptance states. At each step you have many possible transitions. Hence your automata is nondeterministic. Now you want to optimize this. One way to optimize it (not the way you're asking for, but related) is to eliminate backtracking. You can do this by taking every combination of how you get to a state along with the state itself. This is known as the powerset construction: http://en.wikipedia.org/wiki/Powerset_construction
The wikipedia article is actually pretty good -- and in a language like Haskell we can first define the full powerset DFA, then lazily traverse all genuine paths to "strip out" most of the unreachable cruft. That gets us to a decent DFA, but not necessarily a minimal one.
As described at the bottom of that article, we can use Brzozowski's algorithm, flipping all the arrows and getting a new NFA that describes going from end to initial states. Now if we were minimizing a DFA, we'd need to go from there back to the DFA again, then flip the arrows and do it all again. This isn't necessarily the fastest approach, but its straightforward and works well enough for plenty of cases. There are plenty of better algorithms available as well: http://en.wikipedia.org/wiki/DFA_minimization
For minimizing an NFA, there are a variety of approaches, but the problem is in general np-hard, so you'll have to pick some poison :-)
Of course all this is assuming you have a full NFA. If you have mutually recursive definitions, then you can put a pattern "inside" itself, and you sure do. That said, you'll then need to use clever tricks to recover the explicit shared structure in order to even begin working with an NFA in this form -- otherwise you'll loop forever.
If you insert a "no sharing" rule -- i.e. the directed graph of your NFA is not only acyclic, but branches never 'merge back' except when you exit an 'options' set, then I'd imagine that simplification is a much more straightforward affair, just 'factoring out' common characters. Since this involves thinking and not just providing references, I'll leave it there for now, just noting that this article might somehow be of interest: http://matt.might.net/articles/parsing-with-derivatives/
p.s.
A stab at the "factoring" solution is a function with the following type:
factor :: [Pattern] -> (Maybe Step, [Pattern])
factor = -- pulls out a common element of the pattern head, should one exist. shallow.
factorTail = -- same, but pulling out of the pattern tail
simplify :: [Pattern] -> [Pattern]
simplify = -- remove redundant constructs, such as options composed only of other options, which can be flattened out, options with no elements that are the "only" option, etc. should run "deep" all levels down.
Now you can start at the lowest level and cycle (simplify . factor) until you have no new factors. Then do so with (simplify . factorTail). Then go one level up, do the same thing. I wouldn't be shocked if you couldn't "trick" this into a nonminimal solution, but I think for most cases it will work very well.
Update: What this solution doesn't address is something where you have e.g. Options ["--DD--", "++DD++"] (reading strings as list of matches), and so you have uncommon structure in both the head and tail but not in the middle. A more general solution in such an instance would be to pull out the least common substring between all matches in your list, and use that as the "frame" with options inserted in the sections where things differ.
The problem
Given a context-free grammar with arbitrary rules and a stream of tokens, how can stream fragments that match the grammar be identified effectively?
Example:
Grammar
S -> ASB | AB
A -> a
B -> b
(So essentially, a number of as followed by an equal number of bs)
Stream:
aabaaabbc...
Expected result:
Match starting at position 1: ab
Match starting at position 4: aabb
Of course the key is "effectively". without testing too many hopeless candidates for too long. The only thing I know about my data is that although the grammar is arbitrary, in practice matching sequences will be relatively short (<20 terminals) while the stream itself will be quite long (>10000 terminals).
Ideally I'd also want a syntax tree but that's not too important, because once the fragment is identified, I can run an ordinary parser over it to obtain the tree.
Where should I start? Which type of parser can be adapted to this type of work?
"Arbitrary grammar" makes me suggest you look at wberry's comment.
How complex are these grammars? Is there a manual intervention step?
I'll make an attempt. If I modified your example grammar from:
S -> ASB | AB
A -> a
B -> b
to include:
S' -> S | GS' | S'GS' | S'G
G -> sigma*
So that G = garbage and S' is many S fragments with garbage in between (I may have been careless with my production rules. You get the idea), I think we can solve your problem. You just need a parser that will match other rules before G. You may have to modify these production rules based on the parser. I almost guarantee that there will be rule ordering changes depending on the parser. Since most parser libraries separate lexing from parsing, you'll probably need a catch-all lexeme followed by modifying G to include all possible lexemes. Depending on your specifics, this might not be any better (efficiency-wise) than just starting each attempt at each spot in the stream.
But... Assuming my production rules are fixed (both for correctness and for the particular flavor of parser), this should not only match fragments in the stream, but it should give you a parse tree for the whole stream. You are only interested in subtrees rooted in nodes of type S.
I know and use bison/yacc. But in parsing world, there's a lot of buzz around packrat parsing.
What is it? Is it worth studing?
Packrat parsing is a way of providing asymptotically better performance for parsing expression grammars (PEGs); specifically for PEGs, linear time parsing can be guaranteed.
Essentially, Packrat parsing just means caching whether sub-expressions match at the current position in the string when they are tested -- this means that if the current attempt to fit the string into an expression fails then attempts to fit other possible expressions can benefit from the known pass/fail of subexpressions at the points in the string where they have already been tested.
At a high level:
Packrat parsers make use of parsing expression grammars (PEGs) rather than traditional context-free grammars (CFGs).
Through their use of PEGs rather than CFGs, it's typically easier to set up and maintain a packrat parser than a traditional LR parser.
Due to how they use memoization, packrat parsers typically use more memory at runtime than "classical" parsers like LALR(1) and LR(1) parsers.
Like classical LR parsers, packrat parsers run in linear time.
In that sense, you can think of a packrat parser as a simplicity/memory tradeoff with LR-family parsers. Packrat parsers require less theoretical understanding of the parser's inner workings than LR-family parsers, but use more resources at runtime. If you're in an environment where memory is plentiful and you just want to throw a simple parser together, packrat parsing might be a good choice. If you're on a memory-constrained system or want to get maximum performance, it's probably worth investing in an LR-family parser.
The rest of this answer gives a slightly more detailed overview of packrat parsers and PEGs.
On CFGs and PEGs
Many traditional parsers (and many modern parsers) make use of context-free grammars. A context-free grammar consists of a series of rules like the ones shown here:
E -> E * E | E + E | (E) | N
N -> D | DN
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
For example, the top line says that the nonterminal E can be replaced either with E * E, or E + E, or (E), or with N. The second line says that N can be replaced with either D or DN. The last line says that D can be replaced with any single digit.
If you start with the string E and follow the rules from the above grammar, you can generate any mathematical expression using +, *, parentheses, and single digits.
Context-free grammars are a compact way to represent a collection of strings. They have a rich and well-understood theory. However, they have two main drawbacks. The first one is that, by itself, a CFG defines a collection of strings, but doesn't tell you how to check whether a particular string is generated by the grammar. This means that whether a particular CFG will lend itself to a nice parser depends on the particulars of how the parser works, meaning that the grammar author may need to familiarize themselves with the internal workings of their parser generator to understand what restrictions are placed on the sorts of grammar structures can arise. For example, LL(1) parsers don't allow for left-recursion and require left-factoring, while LALR(1) parsers require some understanding of the parsing algorithm to eliminate shift/reduce and reduce/reduce conflicts.
The second, larger problem is that grammars can be ambiguous. For example, the above grammar generates the string 2 + 3 * 4, but does so in two ways. In one way, we essentially get the grouping 2 + (3 * 4), which is what's intended. The other one gives us (2 + 3) * 4, which is not what's meant. This means that grammar authors either need to ensure that the grammar is unambiguous or need to introduce precedence declarations auxiliary to the grammar to tell the parser how to resolve the conflicts. This can be a bit of a hassle.
Packrat parsers make use of an alternative to context-free grammars called parsing expression grammars (PEGs). Parsing expression grammars in some ways resemble CFGs - they describe a collection of strings by saying how to assemble those strings from (potentially recursive) smaller parts. In other ways, they're like regular expressions: they involve simpler statements combined together by a small collection of operations that describe larger structures.
For example, here's a simple PEG for the same sort of arithmetic expressions given above:
E -> F + E / F
F -> T * F / T
T -> D* / (E)
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
To see what this says, let's look at the first line. Like a CFG, this line expresses a choice between two options: you can either replace E with F + E or with F. However, unlike a regular CFG, there is a specific ordering to these choices. Specifically, this PEG can be read as "first, try replacing E with F + E. If that works, great! And if that doesn't work, try replacing E with F. And if that works, great! And otherwise, we tried everything and it didn't work, so give up."
In that sense, PEGs directly encode into the grammar structure itself how the parsing is to be done. Whereas a CFG more abstractly says "an E may be replaced with any of the following," a PEG specifically says "to parse an E, first try this, then this, then this, etc." As a result, for any given string that a PEG can parse, the PEG can parse it exactly one way, since it stops trying options once the first parse is found.
PEGs, like CFGs, can take some time to get the hang of. For example, CFGs in the abstract - and many CFG parsing techniques - have no problem with left recursion. For example, this CFG can be parsed with an LR(1) parser:
E -> E + F | F
F -> F * T | T
T -> (E) | N
N -> ND | D
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
However, the following PEG can't be parsed by a packrat parser (though later improvements to PEG parsing can correct this):
E -> E + F / F
F -> F * T / T
T -> (E) / D*
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Let's take a look at that first line. The first line says "to parse an E, first try reading an E, then a +, then an F. And if that fails, try reading an F." So how would it then go about trying out that first option? The first step would be to try parsing an E, which would work by first trying to parse an E, and now we're caught in an infinite loop. Oops. This is called left recursion and also shows up in CFGs when working with LL-family parsers.
Another issue that comes up when designing PEGs is the need to get the ordered choices right. If you're coming from the Land of Context-Free Grammars, where choices are unordered, it's really easy to accidentally mess up a PEG. For example, consider this PEG:
E -> F / F + E
F -> T / T * F
T -> D+ / (E)
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Now, what happens if you try to parse the string 2 * 3 + 4? Well:
We try parsing an E, which first tries parsing an F.
We trying parsing an F, which first tries parsing a T.
We try parsing a T, which first tries reading a series of digits. This succeeds in reading 2.
We've successfully read an F.
So we've successfully read an E, so we should be done here, but there are leftover tokens and the parse fails.
The issue here is that we first tried parsing F before F + E, and similarly first tried parsing T before parsing T * F. As a result, we essentially bit off less than we could check, because we tried reading a shorter expression before a longer one.
Whether you find CFGs, with attending ambiguities and precedence declarations, easier or harder than PEGs, with attending choice orderings, is mostly a matter of personal preference. But many people report finding PEGs a bit easier to work with than CFGs because they more mechanically map onto what the parser should do. Rather than saying "here's an abstract description of the strings I want," you get to say "here's the order in which I'd like you to try things," which is a bit closer to how parsing often works.
The Packrat Parsing Algorithm
Compared with the algorithms to build LR or LL parsing tables, the algorithm used by a packrat parsing is conceptually quite simple. At a high level, a packrat parser begins with the start symbol, then tries the ordered choices, one at a time, in sequence until it finds one that works. As it works through those choices, it may find that it needs to match another nonterminal, in which case it recursively tries matching that nonterminal on the rest of the string. If a particular choice fails, the parser backtracks and then tries the next production.
Matching any one individual production isn't that hard. If you see a terminal, either it matches the next available terminal or it doesn't. If it does, great! Match it and move on. If not, report an error. If you see a nonterminal, then (recursively) match that nonterminal, and if it succeeds pick up with the rest of the search at the point after where the nonterminal finished matching.
This means that, more generally, the packrat parser works by trying to solve problems of the following form:
Given some position in the string and a nonterminal, determine how much of the string that nonterminal matches starting at that position (or report that it doesn't match at all.)
Here, notice that there's no ambiguity about what's meant by "how much of the string the nonterminal matches." Unlike a traditional CFG where a nonterminal might match at a given position in several different lengths, the ordered choices used in PEGs ensure that if there's some match starting at a given point, then there's exactly one match starting at that point.
If you've studied dynamic programming, you might realize that these subproblems might overlap one another. In fact, in a PEG with k nonterminals and a string of length n, there are only Θ(kn) possible distinct subproblems: one for each combination of a starting position and a nonterminal. This means that, in principle, you could use dynamic programming to precompute a table of all possible position/nonterminal parse matches and have a very fast parser. Packrat parsing essentially does this, but using memoization rather than dynamic programming. This means that it won't necessarily try filling all table entries, just the ones that it actually encounters in the course of parsing the grammar.
Since each table entry can be filled in in constant time (for each nonterminal, there are only finitely many productions to try for a fixed PEG), the parser ends up running in linear time, matching the speed of an LR parser.
The drawback with this approach is the amount of memory used. Specifically, the memoization table may record multiple entries per position in the input string, requiring memory usage proportional to both the size of the PEG and the length of the input string. Contrast this with LL or LR parsing, which only needs memory proportional to the size of the parsing stack, which is typically much smaller than the length of the full string.
That being said, the tradeoff here in worse memory performance is offset by not needing to learn the internal workings of how the packrat parser works. You can just read up on PEGs and take things from there.
Hope this helps!
Pyparsing is a pure-Python parsing library that supports packrat parsing, so you can see how it is implemented. Pyparsing uses a memoizing technique to save previous parse attempts for a particular grammar expression at a particular location in the input text. If the grammar involves retrying that same expression at that location, it skips the expensive parsing logic and just returns the results or exception from the memoizing cache.
There is more info here at the FAQ page of the pyparsing wiki, which also includes links back to Bryan Ford's original thesis on packrat parsing.