Parsing expressions with an undefined number of arguments - algorithm

I'm trying to parse a string in a self-made language into a sort of tree, e.g.:
# a * b1 b2 -> c * d1 d2 -> e # f1 f2 * g
should result in:
# a
* b1 b2
-> c
* d1 d2
-> e
# f1 f2
* g
#, * and -> are symbols. a, b1, etc. are texts.
Since the moment I know only rpn method to evaluate expressions, and my current solution is as follows. If I allow only a single text token after each symbol I can easily convert expression first into RPN notation (b = b1 b2; d = d1 d2; f = f1 f2) and parse it from here:
a b c -> * d e -> * # f g * #
However, merging text tokens and whatever else comes seems to be problematic. My idea was to create marker tokens (M), so RPN looks like:
a M b2 b1 M c -> * M d2 d1 M e -> * # f2 f1 M g * #
which is also parseable and seems to solve the problem.
That said:
Does anyone have experience with something like that and can say it is or it is not a viable solution for the future?
Are there better methods for parsing expressions with undefined arity of operators?
Can you point me at some good resources?
Note. Yes, I know this example very much resembles Lisp prefix notation and maybe the way to go would be to add some brackets, but I don't have any experience here. However, the source text must not contain any artificial brackets and also I'm not sure what to do about potential infix mixins like # a * b -> [if value1 = value2] c -> d.
Thanks for any help.
EDIT: It seems that what I'm looking for are sources on postfix notation with a variable number of arguments.

I couldn't fully understand your question, but it seems what you want is a grammar definition and a parser generator. I suggest you take a look at ANTLR, it should be pretty straightforward with it to define a grammar for either your original syntax or the RPN.
Edit: (After exercising self-criticism, and making some effort to understand the question details.) Actually, the language grammar is unclear from your example. However, it seems to me, that the advantages of the prefix/postfix notations (i.e. that you need neither parentheses nor a precedence-aware parser) stem from the fact that you know the number of arguments every time you encounter an operator, therefore you know exactly how many elements to read (for prefix notation) or to pop from the stack (for postfix notation). OTOH, I beleive that having operators which can have variable number of arguments makes prefix/postfix notations not simply difficult to parse but outright ambiguous. Take the following expression for example:
# a * b c d
Which of the following three is the canonical form?
(a, *(b, c, d))
(a, *(b, c), d)
(a, *(b), c, d)
Without knowing more about the operators, it is impossible to tell. Of course you could define some sort of greedyness of the operators, e.g. * is greedier than #, so it gobbles up all the arguments. But this would beat the purpose of a prefix notation, because you simply wouldn't be able to write down the second variant from the above three; not without additinonal syntactic elements.
Now that I think of it, it is probably not by sheer chance that none of the programming languages I know support operators with a variable number of arguments, only functions/procedures.

Related

Expanding all definitions in Isabelle lemma

How can I tell Isabelle to expand all my definitions, please, because that way the proof is trivial? Unfortunately there is no default expansion or simplification happens, and basically I get back the original expression as the subgoal.
Example:
theory Test
imports Main
begin
definition b0 :: "nat⇒nat"
where "b0 n ≡ (n mod 2)"
definition b1 :: "nat⇒nat"
where "b1 n ≡ (n div 2)"
lemma "(a::nat)≤3 ∧ (b::nat)≤3 ⟶
2*(b1 a)+(b0 a)+2*(b1 b)+(b0 b) = a+b"
apply auto
oops
end
Respose before oops:
proof (prove)
goal (1 subgoal):
1. a ≤ 3 ⟹
b ≤ 3 ⟹ 2 * b1 a + b0 a + 2 * b1 b + b0 b = a + b
My recommendation: unfolding
There is a special keyword unfolding for unpacking definitions at the start of proofs. For your example this would read:
unfolding b0_def b1_def by simp
I consider unfolding the most elegant way. It also helps while writing the proofs. Internally, this is (mostly?) equivalent to using the unfold-method:
apply (unfold b0_def b1_def) by simp
This will recursively (!) use the set of equalities you supply to rewrite the proof goal. (Due to the recursion, you should rather not supply a set of equalities that could generate cycles...)
Alternative: Using the simplifier
In cases with possible loops, the simplifier might be able to reach a nice unfolding without running into these cycles, maybe by interleaving with other simplifications. In such cases, by (simp add: b0_def b1_def), as you've suggested, is great!
Alternative definition: Maybe it's just an abbreviation (and no definition)?
If you find yourself unfolding a definition in every single instance, you could consider, using abbreviation instead of definition. Then, some Isabelle magic will do the packing/unpacking for you without further hints. abbeviation does only affect how the user communicates with Isabelle. It does not introduce new symbols at the object level, and consequently, there would be no b1_def facts and the like.
abbreviation b0 :: "nat⇒nat"
where "b0 n ≡ (n mod 2)"
Usually not recommended: Building something like an abbreviation using the simplifier
If you (for whatever reason) want to have a defined name at the object level, but unfold it in almost every instance, you can also feed the defining equality directly into the simplifier.
definition b0 :: "nat⇒nat"
where [simp]: "b0 n ≡ (n mod 2)"
(Usually there should be little reason for the last option.)
Yes, I keep forgetting that definitions are not used in simplifications by default.
Adding the definitions explicitly to the simplification rules solves this problem:
lemma "(a::nat)≤3 ∧ (b::nat)≤3 ⟶
2*(b1 a)+(b0 a)+2*(b1 b)+(b0 b) = a+b"
by (simp add: b0_def b1_def)
This way the definitions (b0, b1) are correctly used.

Is it possible to represent a context-free grammar with first-order logic?

Briefly, I have a EBNF grammar and so a parse-tree, but I do not know if there is a procedure to translate it in First Order Logic.
For example:
DR ::= E and P
P ::= B | (and P)* | (or P)*
B ::= L | P (and L P)
L ::= a
Yes, there is. The general pattern for translating a production of the form
A ::= B C ... D
is to paraphrase is declaratively as saying
A sequence of terminals s is an A (or: A generates the sequence s, if you prefer that formulation) if:
s is the concatenation of s_1, s_2, ... s_n, and
s_1 is a B / B generates the sequence s_1, and
s_2 is a C / C generates the sequence s_2, and
...
s_n is a D / D generates the sequence s_n.
Assuming we write these in the obvious way using a generates predicate, and that we can write concatenation using a || operator, your first rule becomes (if I am right to guess that E and P are non-terminals and "and" is a terminal symbol) something like
generates(DR,s) ⊃ generates(E,s1)
∧ generates(and,s2)
∧ generates(P,s3)
∧ s = s1 || s2 || s3
To establish the consequent (i.e. prove that s is an A), prove the antecedents. As long as the grammar does actually generate some sentences, and as long as you have some premises defining the "generates" relation for terminal symbols, the proof will be straightforward.
Prolog definite-clause grammars are a beautiful instantiation of this pattern. It takes some of us a while to understand and appreciate the use of difference lists in DCGs, but they handle the partitioning of s into subsequences and the association of the subsequences with the different parts of the right hand side much more elegantly than the simple translation into logic given above.

Multi-character substitution cipher algorithm

My problem is the following. I have a list of substitutions, including one substitution for each letter of the alphabet, but also some substitutions for groups of more than one letter. For example, in my cipher p becomes b, l becomes w, e becomes i, but le becomes by, and ple becomes memi.
So, while I can think of a few simple/naïve ways of implementing this cipher, it's not very efficient, and I was wondering what the most efficient way to do it would be. The answer doesn't have to be in any particular language, a general structured English algorithm would be fine, but if it must be in some language I'd prefer C++ or Java or similar.
EDIT: I don't need this cipher to be decipherable, an algorithm that mapped all single letters to the letter 'w' but mapped the string 'had' to the string 'jon' instead should be ok, too (then the string "Mary had a little lamb." would become "Wwww jon w wwwwww wwww.").
I'd like the algorithm to be fully general.
One possible approach is to use deterministic automaton. The closest to your problem and commonly used example is Aho–Corasick string matching algorithm. The difference will be, instead of matching you would like to emit cypher at some transition. Generally at each transition you will emit or do not emit cypher.
In your example
p -> b
l -> w
e -> i
le -> by
ple -> memi
The automaton (in Erlang like pseudocode)
start(p) -> p(next());
start(l) -> l(next());
start(e) -> e(next());
...
p(l) -> pl(next);
p(X) -> emit(b), start(X).
l(e) -> emit(by), start(next());
l(X) -> emit(w), start(X).
e(X) -> emit(i), start(X).
pl(e) -> emit(memi), start(next());
pl(X) -> emit(b), l(X).
If you are not familiar with Erlang, start(), p() are functions each for one state. Each line with -> is one transition and the actions follows the ->. emit() is function which emits cypher and next() is function returning next character. The X is variable for any other character.

Computing the Follow Set

Ok, I've understood how to compute the Follow_k(N) set (N is a nonterminal): for every production rule of the form A -> aBc you add First_k(First_k(c)Follow_k(A)) to Follow_k(B) (a, c are any group of terminals and nonterminals, or even lambda). ...and you repeat this until there's nothing left to add.
But what happends for production rules like: S -> ABCD (A, B, C, D are all nonterminals)?
Should I
add First_k(First_k(BCD)Follow_k(S)) to Follow_k(A) or
add First_k(First_k(CD)Follow_k(S)) to Follow_k(B) or
add First_k(First_k(D)Follow_k(S)) to Follow_k(C) or
add First_k(First_k(lambda)Follow_k(S)) to Follow_k(D) or
do all of the above?
UPDATE:
Let's take the following grammar for example:
S -> ABC
A -> a
B -> b
C -> c
Intuitively, Follow_1(S) = {} because nothing follows after S
Follow_1(A) = {b} because b follows after A,
Follow_1(B) = {c} because c follows after B,
Follow_1(C) = {} because nothing follows after C.
In order to get this result using the algorithm you must consider all cases for S -> ABC.
But my judgement or example may not be right so the question still remains open...
If you run into trouble on other grammar problems like this, give this online first, follow, & predict set finder a shot. It's automatic and you can compare answers to its output to get a feel for how to work through these.
But what happens for production rules like: S -> ABCD (A, B, C, D are all nonterminals)?
Here are the rules for finding follow sets.
First put $ (the end of input marker) in Follow(S) (S is the start symbol)
If there is a production A → aBb, (where a can be a whole string) then everything in FIRST(b) except for ε is placed in FOLLOW(B).
If there is a production A → aB, then everything in FOLLOW(A) is in FOLLOW(B)
If there is production A → aBb, where FIRST(b) contains ε, then everything in FOLLOW(A) is in FOLLOW(B)
Let's use your example grammar:
S -> ABC
A -> a
B -> b
C -> c
Rule 1 says that follow(S) contains $.
Rule 2 gives us: follow(A) contains first(B); also, follow(B) contains first(C).
Rule 3 says that follow(C) contains follow (S).
None of your productions are nullable, so we don't care about rule #4. A symbol is nullable if it derives ε or if it derives a nullable non-terminal symbol.
Nullability's transitivity can trip people up. Consider this grammar:
S -> A
A -> B
B -> ε
Since B derives ε, B's nullable. Since A derives B, which derives ε, A's nullable too. S derives A, which derives B, which derives ε, so S is nullable as well.
Granted, you didn't bring that up, but it's a common source of confusion in compiler courses, so I figured I'd lay it out.
Also, if you need some sample grammars to work through, http://faculty.stedwards.edu/laurab/cosc4342/g1answers.txt might be handy.

complex logical operation

Given this logical operation :
(A AND B) OR (C AND D)
Is there a way to write a similar expression without using any parentheses and giving the same result ? Usage of logical operators AND, OR, NOT are allowed.
Yes:
A and B or C and D
In most programming languages, and is taken to have higher precedence than or (this stems from the equivalence of and and or to * and +, respectively).
Of course, if your original expression had been:
(A or B) and (C or D)
you couldn't simply remove the parentheses. In this instance, you'd have to "multiply out" the factors:
A and C or B and C or A and D or B and D
How about A AND B OR C AND D? It's the same because AND takes precedence over OR.
Just don't put any parentheses, it is the same...
It can be written in two ways
A & B | C & D
Type as it is mentioned in question just remove the parenthesis it will show the same result.
We can use & for AND to multiply and | for OR to divide. Also simply you can write them without any parenthesis

Resources