The complement of a context-free language is not always context-free. It is not allowed to just swap the final and non-final states of an NPDA and assume that it produces the complement of the language. Could someone give an example where it goes wrong?
And why does the above described procedure work for regular languages given a DFA? Maybe because DFA and NFA are equivalent and DPDA and NPDA are not?
Well, swapping the final vs non-final states of an NFA doesn't even guarantee you'll get the complement of the language. Consider this rather curious NFA:
----->q0--a-->[q1]
|
a
|
V
q2
This NFA accepts the language {a}. Swapping the final and non-final states, the accepted language becomes {e, a}. These languages are not complementary since they have an intersection.
In exactly the same way, swapping the states of a NPDA is not guaranteed to work either. The difference, as you point out, is that for any NFA, there is some equivalent DFA (indeed, there are lots), and swapping toggling the finality of states will work for those, so the languages are guaranteed to be closed under complementation.
For NPDAs, though, we do not necessarily have equivalent DPDAs (where swapping finality would work fine). Thus, it is possible that the complement of some languages accepted only by NPDAs is not context-free.
Indeed, the context-free language {a^i b^j c^k | i != j or j != k} is accepted only by NPDAs and its complement {strings not of the form a^i b^j c^k or strings of that form with i=j=k) is not context-free.
The grammar which does not specify a unique move from at least one sigma element.
From any state by taking one input we can not determine to which step we will reach so the grammar generating such type of situation is called non deterministic grammar.
[enter image description here][1]
https://i.stack.imgur.com/U6vaJ.jpg
For a particular input the computer will give different output on different execution.
Can’t solve the problem in polynomial time.
Cannot determine the next step of execution due to more than one path the algorithm can take.
The DFA describing the intersection of two regular expressions can be exponentially large compared to the DFAs of the regular expressions themselves. (Here's a nice Python library for computing it.) Is there a way to compute the size of the DFA for the intersection without needing exponential resources?
From Wikipedia:
Universality: is LA = Σ* ? […] For regular expressions, the universality problem is NP-complete already for a singleton alphabet.
If I'm reading that right, it says that the problem of determining whether a regular expression generates all strings is known to be NP-complete.
Now, for your problem: consider the case where the two input regular expressions are known to generate the same regular language (perhaps the expressions are identical). Then your problem reduces to this: what is the size of the DFA for this RE? It is relatively straightforward to tell whether a RE generates at least some strings (i.e., whether the language is empty). If the language is not empty, then the minimal DFA corresponding to the RE has one state if and only if the RE generates all strings.
So, if your problem had a general polynomial-time solution, you'd be able to solve universality for regular expressions, which Wikipedia says is not possible.
(If you're not asking about minimal DFAs, but the DFAs produced by a specific minimization technique, I think you'd have to specify the minimization technique).
I'm looking for algorithms or data structures specifically for dealing with ambiguities.
In my particular current field of interest I'm looking into ambiguous parses of natural languages, but I assume there must be many fields in computing where ambiguity plays a part.
I can find a lot out there on trying to avoid ambiguity but very little on how to embrace ambiguity and analyse ambiguous data.
Say a parser generates these alternative token streams or interpretations:
A B1 C
A B2 C
A B3 B4 C
It can be seen that some parts of the stream are shared between interpretations (A ... B) and other parts branch into alternative interpretations and often meet back with the main stream.
Of course there may be many more interpretations, nesting of alternatives, and interpretations which have no main stream.
This is obviously some kind of graph with nodes. I don't know if it has an established name.
Are there extant algorithms or data structures I can study that are intended to deal with just this kind of ambiguous graph?
Ambiguity and sharing in Natural Language Parsing
Ambiguity and sharing in general
Given the generality of your question, I am trying to match that
generality.
The concept of ambiguity arises as soon a you consider a mapping or function f: A -> B which is not injective.
An injective function (also called one-to-one function) is one such
that when a≠a' then f(a) ≠ f(a'). Given a function f, you are often
interested in reversing it: given an element b of the codomain B of f,
you want to know what element a of the domain A is such that f(a)=b.
Note that there may be none if the function is not surjective
(i.e. onto).
When the function is not injective, there may be several values a in A
such that f(a)=b. In other words, if you use values in B to actually
represent values in A through the mapping f, you have an ambiguous
representation b that may not determine the value a uniquely.
From this you realize that the concept of ambiguity is so general that
it is unlikely that there is a unified body of knowledge about it,
even when limiting this to computer science and programming.
However, if you wish to consider reversing a function creating such
ambiguities, for example to compute the set f'(b)={a∈A | f(a)=b}, or
the best element(s) in that set according to some optimality criterion,
there are inded some techniques that may help you in situations where
the problem can be decomposed into subproblems that often re-occur
with the same arguments. Then, if you memorize the result(s) for the
various combinations of arguments encountered, you never compute twice
the same thing (the subproblem is said to be memo-ized). Note that
ambiguity may exist for subproblems too, so that there may be several
answers for some subproblem instances, or optimal answers among
several others.
This amount to sharing a single copy of a subproblem between all the
situations that require solving it with this set of parameters. The
whole technique is called dynamic programming, and the difficulty is
often to find the right decomposition into subproblems. Dynamic
programming is primarily a way to share the repeated subcomputation for a
solution, so as to reduce complexity. However, if each subcomputation
produces a fragment of a structure that is reused recursively in
larger structures to find an answer which is a structured object (a
graph for exmple), then the sharing of subcomputation step may result in
also sharing a corresponding substructure in all the places where it is
needed. When many answers are to be found (because of ambiguity for
example), these answers can share subparts.
Rather than finding all the answers, dynamic programming can be used
to find those satisfying some optimality criterion. This requires that
an optimal solution of a problem uses optimal solutions of
subproblems.
The case of linguistic processing
Things can be more specific in the case of linguistics and language
processing. For that purpose, you have to identify the domains you are
dealing with, and the kind of functions you use with these domains.
The purpose of language is to exchange information, concepts, ideas
that reside in our brains, with the very approximate assumption that
our brains use the same functions to represent these ideas
linguistically. I must also simplify things considerably (sorry about
it) because this is not exactly the place for a full theory of
language, which would be disputed anyway. And I cannot even consider
all types of syntactic theories.
So linguistic exchange of information, of an idea, from a person P to a person Q
goes as follow:
idea in P ---f--> syntactic tree ---g--> lexical sequence ---h--> sound sequence
|
s
|
V
idea in Q <--f'-- syntactic tree <--g'-- lexical sequence <--h'-- sound sequence
The first line is about sentence generation taking place in person P,
and the second line is about sentence analysis taking place in person
Q. The function s stands for speech transmission, and should be the
identity function. The functions f', g' and h' are supposed to be the
inverse of the functions f,g, and h that compute the successive
representations down to the spoken representation of the idea. But
each of these functions may be non injective (usually is) so that
ambiguities are introduced at each level, making it difficult for Q to
inverse then to retrieve the original meaning from the sound sequence
it receives (I am deliberately using the word sound to avoid getting
into details). The same diagram holds, with some variations in details,
for written communication.
We ignore f and f' since they are concerned with semantics, which may
be less formalized, and for which I do not have competence. Syntax
trees are often defined by grammatical formalisms (here I am skipping
over important refinements such as feature structures, but they can be
taken into account).
Both the function g and the function h are usually not injective, and
thus are sources of ambiguity. There are actually other sources of
ambiguity due to all kind of errors inherent to the speech chain, but
we will ignore it for simplicity, as it does not much change the
nature of problems. The presence of errors, due to sentence generation
or transmission, or to language specification mismatch between the
speaker and the listener, is an extra source of ambiguity since the
listener attempts to correct potential errors without knowing what
they may have been or whether they exist at all.
We assume that the listener does not make mistakes, and that he
attempts to best "decode" the sentence according to his own linguistic
standards and knowledge, including knowledge of error sources and
statistics.
Lexical ambiguity
Given a sound sequence, the listening system has to inverse the effect
of the lexical generation function g with a function g'. A first
probem is that several different words may give the same sound
sequence, which is a first source of ambiguity. The second problem is
that the listening system actually receives the sequence corresponding
to a string of words, and there may be no indication of where words
begin or end. So they may be different ways of cutting the sound
sequence into subsequences corresponding to recognizable words.
This problem may be worsened when noise creates more confusion
between words.
An example is the following holorime verses taken from the web, that
are pronounced more or less similarly:
Ms Stephen, without a first-rate stakeholder sum or deal,
Must, even with outer fur straight, stay colder - some ordeal.
The analysis of the sound sequence can be performed by a finite state
non-deterministic automaton, interpreted in a dynamic programming
mode, which produces a directed acyclic graph where the nodes
correspong the word separations and the edges to recongnized words.
Any longest path through the graph corresponds to a possible way
of analyzing the sound sequence as a sequence of words.
The above example gives the (fairly simple) word lattice (oriented
left to right):
the-*-fun
/ \
Ms -*-- Stephen \ without --*-- a first -*- ...
/ \ / \ /
* * *
\ / \ / \
must --*-- even with -*- outer fur -*- ...
So that the sound sequence could also correspond to the following word
sequences (among several others):
Ms Stephen, with outer first-rate ...
Must, even with outer first-rate ...
This make the lexical analysis ambiguous.
Probabilities may be used to choose a best sequence. But it is also
possible to keep the ambiguity of all possible reading and use it
as is in the next stage of sentence analysis.
Note that the word lattice may be seen as a finite state automaton
that generates or recognizes all the possible lexical readings of the
word sequence
Syntactic ambiguity
Syntactic structure is often based on a context-free grammar
skeleton. The problem of ambiguity of context-free languages is well
known and analyzed. A number of general CF parsers have been devised
to parse ambiguous sentences, and produce a structure (which varies
somewhat) from which all parses can be extracted. Such structure have
come to be known as parse forests, or shared parse forest.
What is known is that the structure can be at worst cubic in the length of
the analyzed sentence, on the condition that the language grammar is
binarized, i.e. with no more than 2 non-terminals in each rule
right-hand-side (or more simply, no more than 2 symbols in each rule
right-hand-side).
Actually, all these general CF parsing algorithms are more or less
sophisticated variations around a simple concept: the intersection of
the language L(A) of a finite state automaton A and the language L(G)
of a CF grammar G. Construction of such an intersection goes back to
the early papers on context-free languages (Bar-Hillel, Perles and
Shamir 1961), and was intended as proof of a closure property. It took
some thirty years to realize that it was also a very general parsing algorithm in a
1995 paper.
This classical cross-product construction yields a a CF grammar for the
intersection of the two languages L(A) and L(G). If you consider a
sentence w to be parsed, represented as a sequence of lexical elements,
it can also be viewed as finite state automaton W that generate only
the sentence w. For example:
this is a finite state automaton
=> (1)------(2)----(3---(4)--------(5)-------(6)-----------((7))
is a finite state automaton W accepting only the sentence
w="this is a finite state automaton". So L(W)={w}.
If the grammar of the language is G, then the intersection
construction gives a grammar G_w for the language
L(G_w)=L(W)∩L(G).
It the sentence w is not in L(G), then L(G_w) is empty, and the
sentence is not recognized. Else L(G_w)={w}. Furthermore, it is then
easily proved that the grammar G_w generates the sentence w with
exactly the same parse-trees (hence the same ambiguity) as the grammar
G, up to a simple renaming of the non-terminals.
The grammar G_w is the (shared) parse forest for w, and the set of
parse trees of w is precisely the set of derivations with this
grammar. So this gives a very simple view organizing the concepts, and
explaining the structure of shared parse forests and general CF parsers.
But there is more to it, because it shows how to generalize to
different grammars and to different structures to be parsed.
Constructive closure of intersection with regular sets by
cross-product constructions is common to a lot of grammatical
formalisms that extend the power of CF grammars somewhat into the
context-sensitive realm. This includes tree-adjoining grammars, and
linear context-free rewriting systems. Hence this is a guideline on
how to build for these more powerful formalisms general parsers that
can handle ambiguity and produce shared parse-forests, wich are simply
specialized grammars of the same type.
The other generalization is that, when there is lexical ambiguity so
that lexical analysis produces many candidate sentences represented
with sharing by a word lattice, this word lattice can be read as a
finite state automaton recognizing all these sentences. Then, the same
intersection construction will eliminate all sentences that are not in
the language (not grammatical), and produce a CF grammar that is a
shared parse forest for all possible parses of all admissible
(grammatical) sentences from the word lattice.
As requested in the question, all possible ambiguous readings are
preserved as long as compatible with available linguistic or utterance
information.
The handling of noise and ill-formed sentences is usually modelled also
with finite state devices, and can thus be addressed by the same
techniques.
There are actually many other issues to be considered. For example,
there are many ways of building the shared forest, with more or less
sharing. The techniques used to precompile pushdown automata to be
used for general context-free parsing my have an effect on the quality
of the sharing. Being too smart is not always very smart.
See also other answers I made on SE on this topic:
https://cs.stackexchange.com/questions/27937/how-do-i-reconstruct-the-forest-of-syntax-trees-from-the-earley-vector/27952#27952
https://cstheory.stackexchange.com/questions/7374/recovering-a-parse-forest-from-an-earley-parser/18006#18006
I'm experimenting with PFGs -- Parse-Forest Grammars built using Marpa::R2 ASF.
The approach is to represent ambiguous parses as a grammar, design a criterion to prune unneeded rules, apply it and then remove unproductive and unaccessible symbols from the PFG thus arriving at a parse tree.
This test case is an illustration: it parses arithmetic expressions with highly ambiguous grammar, then prunes the PFG rules based on associativity and precedence, cleans up the grammar, converts it to abstract syntax tree (Problem 3.10 from the cited source — Grune and Jacobs).
I'd call this data structure a lattice, see for instance Lexicalized Parsing (PDF).
I have a polynomial form:
w(x) = a +b(x-x1) + c(x-x1)^2*(x-x2) + d(x-x1)^2*(x-x2)^2+....
Does anybody know a fast algorithm to counting this polynomial?
I want to draw this polynomial, but first i have to count the value, but I can't find any fast and interesting method to that.
I don't think there is faster algorithm than iterating over the whole polynom. As it is not obvious from your description what exactly is the rule by which you form the terms I can not offer a solution but it is even better if you come up with it yourself.
From what I see the x-dependent part for each consecutive term is formed by multiplying what you have so far by another monomial. If that is so, keep the value between iterations of the cycle.
Write a parser to get the individual elements of the expression such as a, b, x, x1, etc and 2, *, +, ^, (, ), etc.
Then use the Shunting-yard algorithm to transform the expression into Reverse Polish Notation.
Then evaluate it using a stack or a tree.
If you intend to evaluate the same expression many times, you may want to eliminate common subexpressions (e.g. x - x1 repeats multiple times and you may calculate it just once). There are ways to do that as well. But before you go there, first see if what you get without such optimizations is really not sufficient.
I am needing to construct a DFA using the McNaughton-Yamada algorithm for a CS class. The problem is the algorithm is supplemental material and I am not clear on what it is exactly. Is it a method for finding a DFA given a RegEx or is finding the DFA plus minimizing it? I can't seem to find any info on the subject.
I am confused because the minimization routine my instructor showed after we found the DFA in class doesn't seem any different than the 'mark' minimization described in our book.
Thanks for your reply,
Nathan
There is a description of the algorithm (for regular expression to NFA and NFA to DFA) at http://swtch.com/~rsc/regexp/regexp1.html; they show Thompson's version, and claim that the McNaughton-Yamada algorithm is basically the same, but generating a DFA directly from the regular expression.
"...the McNaughton-Yamada analysis algorithm, whereby a regular expression is found describing the words accepted by a finite state machine whose transition table is given. Unmodified the algorithm will produce 4n terms representing an n-state machine. This number could be reduced by eliminating duplicate calculations and rejecting ona high level expressions corresponding to no possible path in the same state diagram. The remaining expressions present a serious simplification problem, since empty expressions and null words are generated liberally by the algorithm."
Source
Their algorithm isn't usually brought up separate from Thompson's work. Thompson's original went from a regular expression to a simulated NFA in memory. The Thompson-McNaughton-Yamada algorithm is an extension that turns the regex into an actual NFA stored in memory, as opposed to a transient simulation.
Converting the NFA to a DFA (determinization) is not part of McNaughton or Yamada's extension. Rather, it's done via the subset construction (aka powerset construction) algorithm.
You can see Thompson-McNaughton-Yamada algorithm and subset construction algorithm in action on arbitrary regular expressions by using the Compiler Construction Toolkit's regular expression to NFA and DFA tool.