Merging duplicate path nodes - algorithm

Consider the following trivial data structure:
data Step =
Match Char |
Options [Pattern]
type Pattern = [Step]
This is used together with a small function
match :: Pattern -> String -> Bool
match [] _ = True
match _ "" = False
match (s:ss) (c:cs) =
case s of
Match c0 -> (c == c0) && (match ss cs)
Options ps -> any (\ p -> match (p ++ ss) (c:cs)) ps
It should be fairly obvious what is going on here; a Pattern either does or does not match a given String based on the steps it contains. Each Step either matches a single character (Match), or it consists of a list of possible sub-patterns. (Note well: sub-patterns are not necessarily of equal length!)
Suppose we have a pattern such as this:
[
Match '*',
Options
[
[Match 'F', Match 'o', Match 'o'],
[Match 'F', Match 'o', Match 'b']
],
Match '*'
]
This pattern matches two possible strings, *Foo* and *Fob*. Clearly we can "optimise" this into
[Match '*', Match 'F', Match 'o', Options [[Match 'o'], [Match 'b']], Match '*']
My question: How do I write the function to do this?
More generally, a given Options constructor may have an arbitrary number of sub-paths, of wildly different lengths, some with common prefixes and suffixes, and some without. It's even possible to have empty sub-paths, or even to do something like Options [] (which is of course no-op). I'm struggling to write a function which will reduce every possible input correctly...

On cursory inspection this looks like you've defined a nondeterministic finite state automata. NFA were first defined by Michael O. Rabin, and of all peope—Dana Scott, who has brought us much else as well!
This is an automata because it is built out of steps, with transitions between them, based on acceptance states. At each step you have many possible transitions. Hence your automata is nondeterministic. Now you want to optimize this. One way to optimize it (not the way you're asking for, but related) is to eliminate backtracking. You can do this by taking every combination of how you get to a state along with the state itself. This is known as the powerset construction: http://en.wikipedia.org/wiki/Powerset_construction
The wikipedia article is actually pretty good -- and in a language like Haskell we can first define the full powerset DFA, then lazily traverse all genuine paths to "strip out" most of the unreachable cruft. That gets us to a decent DFA, but not necessarily a minimal one.
As described at the bottom of that article, we can use Brzozowski's algorithm, flipping all the arrows and getting a new NFA that describes going from end to initial states. Now if we were minimizing a DFA, we'd need to go from there back to the DFA again, then flip the arrows and do it all again. This isn't necessarily the fastest approach, but its straightforward and works well enough for plenty of cases. There are plenty of better algorithms available as well: http://en.wikipedia.org/wiki/DFA_minimization
For minimizing an NFA, there are a variety of approaches, but the problem is in general np-hard, so you'll have to pick some poison :-)
Of course all this is assuming you have a full NFA. If you have mutually recursive definitions, then you can put a pattern "inside" itself, and you sure do. That said, you'll then need to use clever tricks to recover the explicit shared structure in order to even begin working with an NFA in this form -- otherwise you'll loop forever.
If you insert a "no sharing" rule -- i.e. the directed graph of your NFA is not only acyclic, but branches never 'merge back' except when you exit an 'options' set, then I'd imagine that simplification is a much more straightforward affair, just 'factoring out' common characters. Since this involves thinking and not just providing references, I'll leave it there for now, just noting that this article might somehow be of interest: http://matt.might.net/articles/parsing-with-derivatives/
p.s.
A stab at the "factoring" solution is a function with the following type:
factor :: [Pattern] -> (Maybe Step, [Pattern])
factor = -- pulls out a common element of the pattern head, should one exist. shallow.
factorTail = -- same, but pulling out of the pattern tail
simplify :: [Pattern] -> [Pattern]
simplify = -- remove redundant constructs, such as options composed only of other options, which can be flattened out, options with no elements that are the "only" option, etc. should run "deep" all levels down.
Now you can start at the lowest level and cycle (simplify . factor) until you have no new factors. Then do so with (simplify . factorTail). Then go one level up, do the same thing. I wouldn't be shocked if you couldn't "trick" this into a nonminimal solution, but I think for most cases it will work very well.
Update: What this solution doesn't address is something where you have e.g. Options ["--DD--", "++DD++"] (reading strings as list of matches), and so you have uncommon structure in both the head and tail but not in the middle. A more general solution in such an instance would be to pull out the least common substring between all matches in your list, and use that as the "frame" with options inserted in the sections where things differ.

Related

Why are epsilon transitions used in NFA?

I'm trying to understand how to create NFA-s from regular expressions, but I am really confused from epsilon transitions. I have this example in my textbook , but I don't understand why epsilon transitions are used and how does one know when to use them.
In general, espilon-transitions are used when they are convenient. For example, when constructing an NFA from a regular expression, you start by constructing small parts of the automaton corresponding to parts of the expression. To connect them, you need to put a transition. But if there is no symbol to be read there, an epsilon transition is a simple way to do this. They are, however never necessary, you can always find a solution without them.
In your example, just apply the algorithm described in your textbook. It tells you when to use them.
The epsilon transitions
from 1 to 2 probably connects the parts for (a|b)* and for ac
1->5 and 8->1 probably result from the *
5->6 and 5->7 probably result from the alternative in |
Epsilon-transitions in NFAs are a natural representation of choice or disjunction or union in regular expressions. That is, a regular expression like r + s (or r | s or r U s depending on your preferred notation) is naturally represented as an NFA consisting of two independent NFAs, one for r and one for s, joined using e-transitions as follows:
e
----->q0----->(r)
|
| e
|
V
(s)
When used to connect states in more complicated ways, the effect may not be as easy or natural to describe, but essentially these transitions let you choose unconditionally among multiple options. So, if I have seen a part of the input already and there are a few different ways the string could end, I can represent that by using e-transitions to states that handle the different possibilities.
In your example, the e-transitions are not really serving any very useful function and are merely artifacts of the conversion algorithm you have used. That algorithm includes them because, in the general case, they may be useful or necessary. In your specific case this was not true, so they look out of place.

Stem comparsion algorithm

I'm writing a program that makes word declension for Polish language. In this language stems can vary in some cases (because of palatalization or mobile/fleeting e and other effects).
For example, we have word "karzeł" and it is basic dictionary form of word. It's stem is also 'karzeł'. But genitive form of this word is "karła" and stem is "karł". We can see here that 'e' dissapeared and 'rz' changes to 'r'.
Another example:
'uzda' -> stem 'uzd'
'uździe' -> stem 'uździ'
Alternation: 'zd' -> 'ździ'
I'd like to store in dictionary only basic form of stem ('karzeł' and 'uzd') and when I'll put in my program stem 'karł' or 'uździ' it will find proper basic stems. Alternations takes place only at the end of stem and contains maximum 4 letters of it.
Is there any algorithms that could do that? Levensthein distance treats all letters equally so if I type word 'barzeł' then the distance to stem 'karzeł' will be less than to stem 'karł'.
I thought also about neural networks but I'm not sure how to encode words (give each stem variation different id?).
Another idea is to write algorith which makes something like reversed alternation and creates set of possible stems and try to find them in dictionary.
I would like to highlight that I only want store basic form of stem and everything else makes on the fly.
First of all, I remember seeing a number of projects on Polish morphology around. So I would look at them first, before starting one of your own.
Regarding Levenshtein, as Pierre correctly noted in the comment, the distance function can be customized. And it should be. Let me put it this way: think of Levenshtein not as an algorithm of and in itself, but as a solution to a specific error model. First he suggests a model which says that when you are typing a word every letter can be either dropped or replaced by another one due to some random process (fingers not pressing the right keys). Then, his algorithm is just a generator of maximum likelihood solutions under this model. The more errors you allow, the smaller is the probability of this sequence of errors actually happening, the bigger is the score.
You (implicitly) state a very different hypothesis, though. That Polish stems may have certain flexibility at the end (some linguistic process that you do not fully understand within this framework). Then, when you strip your suffix (or something that looks like one), there are three options:
1) there is a chance that what you have here is just a different form of a stem you have stored in your dictionary, or
2) it is a completely different stem, or
3) you've stripped your suffix improperly and what you have is not stem at all.
You can heuristically estimate these probabilities by looking at how many letters in the beginning of the supposed stem match some dictionary entries, for example (how to find these entries is a related but different question). And then you can pick the guess that is the most plausible according to your metric/heuristic.
Now, note that you can use any algorithm to find the candidates in the dictionary. Including the Levenshtein algorithm - as long as you are reasonably sure that the right ones will be picked up. But obviously you are better off writing your own dictionary search algorithm that follows your own metric or emulates it. For example, by giving the biggest/prohibitive cost to the change of letters in the beginning of the word and reducing it as you go towards the end.

Left-recursive Grammar Identification

Often we would like to refactor a context-free grammar to remove left-recursion. There are numerous algorithms to implement such a transformation; for example here or here.
Such algorithms will restructure a grammar regardless of the presence of left-recursion. This has negative side-effects, such as producing different parse trees from the original grammar, possibly with different associativity. Ideally a grammar would only be transformed if it was absolutely necessary.
Is there an algorithm or tool to identify the presence of left recursion within a grammar? Ideally this might also classify subsets of production rules which contain left recursion.
There is a standard algorithm for identifying nullable non-terminals, which runs in time linear in the size of the grammar (see below). Once you've done that, you can construct the relation A potentially-starts-with B over all non-terminals A, B. (In fact, it's more normal to construct that relationship over all grammatical symbols, since it is also used to construct FIRST sets, but in this case we only need the projection onto non-terminals.)
Having done that, left-recursive non-terminals are all A such that A potentially-starts-with+ A, where potentially-starts-with+ is:
potentially-starts-with ∘ potentially-starts-with*
You can use any transitive closure algorithm to compute that relation.
For reference, to detect nullable non-terminals.
Remove all useless symbols.
Attach a pointer to every production, initially at the first position.
Put all the productions into a workqueue.
While possible, find a production to which one of the following applies:
If the left-hand-side of the production has been marked as an ε-non-terminal, discard the production.
If the token immediately to the right of the pointer is a terminal, discard the production.
If there is no token immediately to the right of the pointer (i.e., the pointer is at the end) mark the left-hand-side of the production as an ε-non-terminal and discard the production.
If the token immediately to the right of the pointer is a non-terminal which has been marked as an ε-non-terminal, advance the pointer one token to the right and return the production to the workqueue.
Once it is no longer possible to select a production from the work queue, all ε-non-terminals have been identified.
Just for fun, a trivial modification of the above algorithm can be used to do step 1. I'll leave it as an exercise (it's also an exercise in the dragon book). Also left as an exercise is the way to make sure the above algorithm executes in linear time.

Mathematica's pattern matching poorly optimized?

I recently inquired about why PatternTest was causing a multitude of needless evaluations: PatternTest not optimized? Leonid replied that it is necessary for what seems to me as a rather questionable method. I can accept that, though I would prefer a more efficient alternative.
I now realize, which I believe Leonid has been saying for some time, that this problem runs much deeper in Mathematica, and I am troubled. I cannot understand why this is not or cannot be better optimized.
Consider this example:
list = RandomReal[9, 20000];
Head /# list; // Timing
MatchQ[list, {x__Integer, y__}] // Timing
{0., Null}
{1.014, False}
Checking the heads of the list is essentially instantaneous, yet checking the pattern takes over a second. Surely Mathematica could recognize that since the first element of the list is not an Integer, the pattern cannot match, and unlike the case with PatternTest I cannot see how there is any mutability in the pattern. What is the explanation for this?
There appears to be some confusion regarding packed arrays, which as far as I can tell have no bearing on this question. Rather, I am concerned with the O(n2) time complexity on all lists, packed or unpacked.
MatchQ unpacks for these kinds of tests. The reason is that no special case for this has been implemented. In principle it could contain anything.
On["Packing"]
MatchQ[list, {x_Integer, y__}] // Timing
MatchQ[list, {x__Integer, y__}] // Timing
Improving this is very tricky - if you break the pattern matcher you have a serious problem.
Edit 1:
It is true that the unpacking is not the cause for the O(n^2) complexity. It does, however, show that for the MatchQ[list, {x__Integer, y__}] part the code goes to another part of the algorithm (which needs the lists to be unpacked). Some other things to note: This complexity arises only if both patterns are __ if either one of them is _ the algorithm has a better complexity.
The algorithm then goes through all n*n potential matches and there seems no early bailout. Presumably because other patters could be constructed that would need this complexity - The issue is that the above pattern forces the matcher to a very general algorithm.
I then was hoping for MatchQ[list, {Shortest[x__Integer], __}] and friends but to no avail.
So, my two cents: either use a different pattern (and have On["Packing"] to see if it goes to the general matcher) or do a pre-check DeveloperPackedArrayQ[expr] && Head[expr[[1]]]===Integer or some such.
#the author of the first answer. As far as I know from reverse-engeneering and reading of available information, it may be due to different ways the patterns are checked. In fact - as they say - a special hash code is used for pattern matching. This hash (basically a FNV-1 round) makes it very easy to check for particular patterns related to the type of expression involved (matter of a few xor operations). The hashing algorithm cycles inside the expression and each subpart is xorred with the output of the previous one. Special xor values are used for each atom expression - machineInts, machineReals, bigNums, Rationals and so on. Hence, for example, _Integer is easy to check because the hash of any integer is formed with integer's xor value, so all we need to do is doing the inverse op and see if matches - i.e. if we get some particular value or something like that (sorry if I'm vague on actual implementation details. It's WIP). For general or uncommon patterns the check may not take advantage of this hash stuff and require something different.
#the OP Head[] simply acts on the internal expression, taking the value of the first pointer of the expression (expressions are implemented as arrays of pointers). So doing it is as easy as copying and printing a string - very very fast. The pattern matching engine is not even called in this case.

Any tools can randomly generate the source code according to a language grammar?

A C program source code can be parsed according to the C grammar(described in CFG) and eventually turned into many ASTs. I am considering if such tool exists: it can do the reverse thing by firstly randomly generating many ASTs, which include tokens that don't have the concrete string values, just the types of the tokens, according to the CFG, then generating the concrete tokens according to the tokens' definitions in the regular expression.
I can imagine the first step looks like an iterative non-terminals replacement, which is randomly and can be limited by certain number of iteration times. The second step is just generating randomly strings according to regular expressions.
Is there any tool that can do this?
The "Data Generation Language" DGL does this, with the added ability to weight the probabilities of productions in the grammar being output.
In general, a recursive descent parser can be quite directly rewritten into a set of recursive procedures to generate, instead of parse / recognise, the language.
Given a context-free grammar of a language, it is possible to generate a random string that matches the grammar.
For example, the nearley parser generator includes an implementation of an "unparser" that can generate strings from a grammar.
The same task can be accomplished using definite clause grammars in Prolog. An example of a sentence generator using definite clause grammars is given here.
If you have a model of the grammar in a normalized form (all rules like this):
LHS = RHS1 RHS2 ... RHSn ;
and language prettyprinter (e.g., AST to text conversion tool), you can build one of these pretty easily.
Simply start with the goal symbol as a unit tree.
Repeat until no nonterminals are left:
Pick a nonterminal N in the tree;
Expand by adding children for the right hand side of any rule
whose left-hand side matches the nonterminal N
For terminals that carry values (e.g., variable names, numbers, strings, ...) you'll have to generate random content.
A complication with the above algorithm is that it doesn't clearly terminate. What you actually want to do is pick some limit on the size of your tree, and run the algorithm until the all nonterminals are gone or you exceed the limit. In the latter case, backtrack, undo the last replacement, and try something else. This gets you a bounded depth-first search for an AST of your determined size.
Then prettyprint the result. Its the prettyprinter part that is hard to get right.
[You can build all this stuff yourself including the prettyprinter, but it is a fair amount of work. I build tools that include all this machinery directly in a language-parameterized way; see my bio].
A nasty problem even with well formed ASTs is that they may be nonsensical; you might produce a declaration of an integer X, and assign a string literal value to it, for a language that doesn't allow that. You can probably eliminate some simple problems, but language semantics can be incredibly complex, consider C++ as an example. Ensuring that you end up with a semantically meaningful program is extremely hard; in essence, you have to parse the resulting text, and perform name and type resolution/checking on it. For C++, you need a complete C++ front end.
the problem with random generation is that for many CFGs, the expected length of the output string is infinite (there is an easy computation of the expected length using generating functions corresponding to the non-terminal symbols and equations corresponding to the rules of the grammar); you have to control the relative probabilities of the productions in certain ways to guarantee convergence; for example, sometimes, weighting each production rule for a non-terminal symbol inversely to the length of its RHS suffices
there is lot more on this subject in:
Noam Chomsky, Marcel-Paul Sch\"{u}tzenberger, ``The Algebraic Theory of Context-Free Languages'', pp.\ 118-161 in P. Braffort and D. Hirschberg (eds.), Computer Programming and Formal Systems, North-Holland (1963)
(see Wikipedia entry on Chomsky–Schützenberger enumeration theorem)

Resources