{ x£ { 0,1 } *| ∉ { 01,10 } * } prove that it is a regular language or not - computation-theory

In my computational theory class we have an assignment of proving a language is regular. The language is defined as:
{ x£ { 0,1 } *| ∉ { 01,10 } * }
I don't know how to solve it but if someone could push me in the right direction to try and prove this is a regular language.

Your language is the language of all strings over {0, 1} that are not in the language {01, 10}*. By closure properties of regular languages, we know that the complement of a regular language is also a regular language. But your language is the complement of the regular language (01 + 10)* with respect to the universe (0+1)*. So, your language must be regular. To see this, make a DFA for the regular language (01 + 10)* and then make all non-accepting states into accepting states, and vice versa.

Related

Recursive descent Parser: infix to RPN [2]

this the continuation of me trying to make a recursive descent parser--LL(1)-- that takes in infix expressions and outputs RPN. Here is a link to my first question to which #rici did an amazing job of answering and i hope i do his answer justice with this revised implementation.
My new grammer is as follows(without support for unary operators):
expr -> term (+|-) term | term
term -> exponent (*|/) exponent | exponent
exponent -> factor ^ factor | factor
factor -> number | ( expr )
in his answer #rici points out with respect to Norwell's grammar:
We normally put the unary negation operator between multiplication and exponentiation
and i have tried to in-cooperate it here:
expr -> term (+|-) term | term
term -> exponent1 (*|/) exponent1 | exponent1
exponent1 -> (+|-) exponent | exponent
exponent -> factor ^ factor | factor
factor -> number | ( expr )
Coding the first grammar made it such that uary(+/-) numbers cannot be accepted and only binary -/+ operators were the one to be accepted. And the solution works well for the number of problems that i have tried (it could be wrong and hope to learn more). However on closer inspection the second one fails and I am forced reside back to the same "hack" i used in my first. As #rici points out:
By the way, your output is not Reverse Polish Notation (and nor is it unambiguous without parentheses) because you output unary operators before their operands.
to be fair he does point out adding the extra 0 operand which is fine and i think it is going to work. However say if i do 13/-5 this whose equivalent infix would be 13/0-5 and its RPN 13 0 / 5 -. Or perhaps i am misunderstanding his point.
And finally to put the nail in the coffin #rici also points out:
left-recursion elimination would have deleted the distinction between left-associative and right-associative operators
and hence that would mean that it is pretty much impossible to determine the associativity of any of the operators, whereby all are the same and none are different. Moreover that would imply that trying to support many right and left associative operators is going to be very difficult if not impossible for simple LL(1) parsers.
Here is my C code implementation of the grammar:
#include <stdio.h>
#include <stdlib.h>
void error();
void factor();
void expr();
void term();
void exponent1();
void exponent();
void parseNumber();
void match(int t);
char lookahead;
int position=0;
int main() {
lookahead = getchar();
expr();
return 0;
}
void error() {
printf("\nSyntax error at lookahead %c pos: %d\n",lookahead,position);
exit(1);
}
void factor() {
if (isdigit(lookahead)) {
parseNumber();
// printf("lookahead at %c",lookahead);
} else if(lookahead =='('){
match('(');
expr();
match(')');
}else {
error();
}
}
void expr(){
term();
while(1){
if(!lookahead||lookahead =='\n') break;
if(lookahead=='+'|| lookahead=='-'){
char token = lookahead;
match(lookahead);
term();
printf(" %c ", token);
}else {
break;
}
}
}
void term(){
exponent1();
while(1){
if(!lookahead||lookahead =='\n') break;
if(lookahead=='/'|| lookahead=='*'){
char token = lookahead;
match(lookahead);
exponent1();
printf(" %c ", token);
}else {
break;
}
}
}
void exponent1(){
if(lookahead=='-'||lookahead=='+'){
char token = lookahead;
match(lookahead);
//having the printf here:
printf("%c", token);
//passes this:
// 2+6*2--5/3 := 2.00 6.00 2.00 * + 5.00 3.00 / -
// -1+((-2-1)+3)*-2 := -1.00 -2.00 1.00 - 3.00 + -2.00 * + (not actual RPN #rici mentions)
//but fails at:
// -(3/2) := -3.00 2.00 /
// -3/2 := -3.00 2.00 /
exponent();
// but having the printf here
//printf("%c ", token);
// fails this -1+((-2-1)+3)*-2 := 1.00 - 2.00 - 1.00 - 3.00 + 2.00 - * +
// since it is supposed to be
// 1.00 - -2.00 1.00 - 3.00 + -2.00 * +
// but satisfies this:
// -(3/2) := 3.00 2.00 / -
// (-3/2) := 3.00 - 2.00 /
}else {
exponent();
//error();
}
}
void exponent(){
factor();
while(1){
if(!lookahead||lookahead =='\n') break;
if(lookahead=='^'){
char token = lookahead;
match('^');
factor();
printf(" ^ ");
}else {
break;
}
}
}
void parseNumber() {
double number = 0;
if (lookahead == '\0'|| lookahead=='\n') return;
while (lookahead >= '0' && lookahead <= '9') {
number = number * 10 + lookahead - '0';
match(lookahead);
}
if (lookahead == '.') {
match(lookahead);
double weight = 1;
while (lookahead >= '0' && lookahead <= '9') {
weight /= 10;
number = number + (lookahead - '0') * weight;
match(lookahead);
}
}
printf("%.2f ", number);
//printf("\ncurrent look ahead at after exiting parseNumber %c\n",lookahead);
}
void match(int t) {
if (lookahead == t){
lookahead = getchar();
position++;
}
else error();
}
So does that mean I should give up on LL(1) parsers and perhaps look at LR parsers instead? Or can increasing the lookahead number help and if there are many paths then it could perhaps narrow things down decreasing the lookhead of the lookahead. For instance:
-(5 ;; looks weird
-( 5 ;; could be - ( exp )
or
--5 ;; could be many things
-- 5 ;; ought to be the -- operator and output say #
EDITs:
I think having a larger lookahead is going to be difficult to coordinate. So perhaps have something like the shunting yard algorithm where i like peek into the next operator and based on the precedence of the operator the alogrthim is going to determine function call to do. Something like using actual stack of the actual running program. So a pop would be a return and a push would be a function call. Not sure how i could coordinate that with recursive descent.
perhaps the precedence of the peek should determine the lookahead length?
Increasing lookahead doesn't help.
Here is the usual LALR(1) grammar for arithmetical expressions, including exponentiation:
expr -> sum
sum -> sum (+|-) product | product
product -> product (*|/) prefix | prefix
prefix -> (+|-) prefix | exponent
exponent -> atom ^ exponent | atom
atom -> number | ( expr )
You can find examples of that model of constructing a grammar all over the internet, although you will also find many examples where the same non-terminal is used throughout and the resulting ambiguities are dealt with using precedence declarations.
Note the structural difference between exponent and the other binary operators. exponent is right-recursive (because exponentiation is right-associative); the other ones are are left-recursive (because the other binary operators are left-associative).
When I said that you could fix the ambiguity of the prefix operator characters by adding an explicit 0, I didn't mean that you should edit your input to insert the 0. That won't work, because (as you note) it gets the precedence of the unary operators wrong. What I meant was that the recursive descent RPN converter should look something like this:
void parsePrefix(void) {
if (lookahead=='-'||lookahead=='+') {
char token = lookahead;
match(lookahead);
fputs("0 ", stdout);
parsePrefix();
printf("%c ", token);
}
else {
parseExponent();
}
}
That outputs the 0 precisely where it needs to go.
Warning: The following paragraph is unadulterated opinion which does not conform to StackOverflow's policy. If that will offend you, please don't read it. (And perhaps you should just skip the rest of this answer, in that case.)
IMHO, this is really a hack, but so is the use of RPN. If you were building an AST, you would simply build a unaryOperator AST node with a single operand. There would be no issue of ambiguity since there would be no need to interpret the token again during evaluation. For whatever reason, students who go through the usual compiler theory classes seem to come out of them believing that ASTs are somehow a sophisticated technique which should be avoided until necessary, that left-recursion must be avoided at all costs, and that there is moral value in coding your own LL parser instead of just using a standard available LALR parser generator. I disagree with all of those things. In particular, I recommend that you start by creating an AST, because it will make almost everything else easier. Also, if you want to learn about parsing, start by using a standard tool and focus on writing a clear, self-documenting grammar, and using the information about the syntactic structure of the input you are trying to parse. Later on you can learn how the parser generator works, if you really find that interesting. Similarly, I would never teach trigonometry by starting with the accurate evaluation of the Taylor expansion of the sin() function. That does not provide the student any insights whatsoever about how to use the trigonometric functions (for example, to rotate by an angle), which is surely the most important part of trigonometry. Once the student has a firm understanding of trigonometry, how to use it, and particularly what the demands of precise calculation are in typical problem domains, then they might want to look at Taylor expansions and other calculation techniques. But most will be content to just call sin(), and I think that's just perfect.
If you really want to use a recursive descent parser, go for it. They can certainly be made to work.
What will happen as you code your grammar into an executable program is that you will slowly start to diverge from a representation of the grammar which could be used in other programs, like syntax colorizer and static analysers. In part, that will be because the grammar you are using has lost important aspects of the syntax, including associativity, and these features are instead coded directly into your parser code. The ultimate result is often that only the parser code is maintained, and the theoretical grammar is left to rot. When the code itself is the "grammar", it is no longer usable as practical documentation of your language's syntax.
But I'm not saying that it can't be done. It most certainly can be done, and there are lots and lots of parsers in practical use which do it that way.
The shunting yard algorithm (and operator precedence parsing in general) is a bottom-up technique, like LR parsing, and neither of them require a recursive parser. If for some reason you really want to use recursion, you could use a Pratt parser instead, but there is a huge practical advantage to bottom-up parsing over recursive descent: it eliminates the possibility of uncontrollable stack overflow. So it's hard to recommend the use of recursive descent in production parsers unless the input text is strictly controlled to avoid the possible attacks through stack overflow. That might not apply to a compiler which is not used with unverified inputs. But is that true of your compiler? Have you never downloaded a source tarball from a foreign site and then typed ./configure && make all? :-)

Data structure for an Algebra equation?

I'm trying to make an application that is fed in an algebra equation and solves for a given variable of the users choosing.
Pseudocode below
enum Variable
x, pi, y, z; //.. etc
class Value
double constant;
Variable var;
class Term
Value val; // Might be a variable or a constant
Expression exponent; // The exponent of this term
boolean sign; // Negative flag
class Expression
LinkedList<Term>; // All the terms in this expression
^ This is what I need help on.
Expression exponent; // The exponent of this term
For example the average equation might be:
y = x + (x - 5)^z
^term ^term ^ operator ^ expression^term
I need to store this information in some sort of data structure however in order to parse through it. As you see above when I wrote LinkedList<Term>, it works but there's no way for me to implement operators.
Using the above example, this is how I want my data structure to look like:
// Left side of the equals sign
{ NULL <-> y <-> NULL }
// Right side of the equals sign
{ NULL <-> x <-> Operator.ADD <-> Expression: (x - 5) <-> NULL }
I can't do this though, because LinkedList needs to be of one data type, which needs to be expression. How should i represent operators?
It is significantly easier to work with expressions when you have them represented as abstract syntax trees, tree structures that show the underlying structures of formulas. I would strongly recommend investigating how to use ASTs here; you typically build them with a parsing algorithm (Dijkstra's shunting-yard algorithm might work really well for you based on your setup) and then use either abstract methods or the visitor pattern to traverse the ASTs to perform the computations you need.
ASTs are often represented by having either an interface or an abstract class representing a node in the tree, then having subclasses for each operator you'd encounter (they represent internal nodes) and subclasses for concepts like "number" or "variable" (typically, they're leaves).
If you'd like to get a sense of what this might look like, I implemented a tool to generate truth tables for propositional logic formulas using these techniques. The JavaScript source shows off how to use ASTs and the shunting-yard algorithm.

Finding all subsequences from dictionary

In a program I need to efficiently answer queries of the following form:
Given a set of strings A and a query string q return all s ∈ A such that s is a subsequence of q
For example, given A = {"abc", "aaa", "abd"} and q = "abcd", "abc" and "abd" should be returned.
Is there any better way than iterating each element of A and checking if it is a subsequence of q?
NOTE: I have STRIPS planner or automated planner in mind. Each state in STRIPS planner is a set of propositions like {"(room rooma)", "(at-robby rooma)", "(at ball1 rooma)"}. I want to find all ground actions applicable to a given state. Actions in STRIPS planner basically consist of two parts, preconditions and effects(which are not really relevant here). Preconditions are set of propositions needed to be true to apply an action to a state. For example, to apply an action"(move rooma roomb)", its preconditions, {"(room rooma)", "(room roomb)","(at-robby rooma)"} must all be true in the state.
If your set A is large and you have many queries, you could implement a trie-like structure, where level n refers to character n in a string. In your example:
trie = {
a: {
a: {
a: { value: "aaa"}
},
b {
c: { value: "abc"},
d: { value: "abd"}
}
}
}
That would enable you to look up matches in a forked path through the trie:
function query(trie, q) {
s = Set();
if (q.isEmpty()) {
if (trie.value) s.add(t.value);
} else {
s = s.union(query(trie, q[1:]));
c = substr(q, 0, 1);
if (t[c]) {
s = s.union(query(t[c], substr(q, 1));
}
}
return s;
}
Efectively, you will generate all 2^m subsets of the quesy string of m characters, but in practice, the trie is sparse and you end up checking fewer paths.
The speed payoff comes with many lookups. Building the trie is more costly than doing a brute-force lookup. But if you build the trie only one or have a means to update the trie when you update the set A, you wil get a good lookup performance.
The actual data structure for the trie nodes depends on how many possible elements the items can have. In your example, only four letters are used. If you have a limited range of "letters", you can use an array. Otherwise you might need a sort of dictionary, which might make the tree quite big in memory.

longest common consecutive subsequence

I know how to find the lcs of two sequences/strings, but lcs doesn't impose a restriction that the subsequence needs to be consecutive. I've tried it as follows
function lccs(a, b)
if a.length == 0 or b.length == 0
return ""
possible = []
if a[0] == b[0]
possible.push(lcs(a[1:), b[1:])
possible.push(lcs(a[1:], b))
possible.push(lcs(a, b[1:))
return longest_string(possible)
where longest_string returns the longest string in an array, and s[1:] means a slice of s, starting from the first character.
I've run this both inside a browser in javascript, and in golang, on a remote server where I put each call to lccs in its own goroutine, although I have no idea about the server's hardware specs, so I have no idea of the parallelization of these routines.
In both these cases, in ran way too slowly for my needs. Is there a way to speed this up?
I believe that the basic idea would be to use dynamic programming. something like that:
for i in 1:length(a) {
for j in 1:length(b) {
if (a[i]==b[j]) then {
result[i,j] = result[i-1,j-1]+1 #remember to initialize the borders with zeros
# track the maximum of the matrix
} else {
result[i,j]=0
}
}
}
this question is basically similra to the context of sequence alignment, common in bioinformatics. in fact, you should be able to use existing sequence alignment algorithms for your purpose (such as blast, etc.) by setting the "gap" penalties to very high values, practically disallowing gaps in the alignment

Spell checker with fused spelling error correction algorithm

Recently I've looked through several spell checker algorithms including simple ones(like Peter Norvig's) and much more complex (like Brill and Moore's) ones. But there's a type of errors which none of them can handle. If for example I type stackoverflow instead of stack overflow these spellcheckers will fail to correct the mistype (unless the stack overflow in the dictionary of terms). Storing all the pairs of words is too expensive (and it will not help if the error is 3 single words without spaces between them).
Is there an algorithm which can correct (despite usual mistypes) this type of errors?
Some examples of what I need:
spel checker -> spell checker
spellchecker -> spell checker
spelcheker -> spell checker
I hacked up Norvig's spell corrector to do this. I had to cheat a bit and add the word 'checker' to Norvig's data file because it never appears. Without that cheating, the problem is really hard.
expertsexchange expert exchange
spel checker spell checker
spellchecker spell checker
spelchecker she checker # can't win them all
baseball base all # baseball isn't in the dictionary either :(
hewent he went
Basically you need to change the code so that:
you add space to the alphabet to automatically explore the word breaks.
you first check that all of the words that make up a phrase are in the dictionary to consider the phrase valid, rather than just dictionary membership directly (the dict contains no phrases).
you need a way to score a phrase against plain words.
The latter is the trickiest, and I use a braindead independence assumption for phrase composition that the probability of two adjacent words is the product of their individual probabilities (here done with sum in log prob space), with a small penalty. I am sure that in practice, you'll want to keep some bigram stats to do that splitting well.
import re, collections, math
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
counts = collections.defaultdict(lambda: 1.0)
for f in features:
counts[f] += 1.0
tot = float(sum(counts.values()))
model = collections.defaultdict(lambda: math.log(.1 / tot))
for f in counts:
model[f] = math.log(counts[f] / tot)
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz '
def valid(w):
return all(s in NWORDS for s in w.split())
def score(w):
return sum(NWORDS[s] for s in w.split()) - w.count(' ')
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if valid(e2))
def known(words): return set(w for w in words if valid(w))
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=score)
def t(w):
print w, correct(w)
t('expertsexchange')
t('spel checker')
t('spellchecker')
t('spelchecker')
t('baseball')
t('hewent')
This problem is very similar to the problem of compound splitting as applied to German or Dutch, but also noisy English data. See Monz & De Rijke for a very simple algorithm (which can I think be implemented as a finite state transducer for efficiency) and Google for "compound splitting" and "decompounding".
I sometimes get such suggestions when spell-checking in kate, so there certainly is an algorithm that can correct some such errors. I am sure one can do better, but one idea is to split the candidate in likely places and check whether close matches for the components exist. The hard part is to decide what are likely places. In the languages I'm sort of familiar with, there are letter combinations that occur rarely in words. For example, the combinations dk or lh are, as far as I'm aware rare in English words. Other combinations occur often at the start of words (e.g. un, ch), so those would be good guesses for splitting too. In the example spelcheker, the lc combination is not too widespread, and ch is a common start of words, so the split spel and cheker is a prime candidate, and any decent algorithm would then find spell and checker (but it would probably also find spiel, so don't auto-correct, just give suggestions).

Resources