I have the following simple expression parser:
expr(+(T,E))-->term(T),"+",expr(E).
expr(T)-->term(T).
term(*(F,T))-->factor(F),"*",term(T).
term(F)-->factor(F).
factor(N)-->nat(N).
factor(E)-->"(",expr(E),")".
nat(0)-->"0".
nat(1)-->"1".
nat(2)-->"2".
nat(3)-->"3".
nat(4)-->"4".
nat(5)-->"5".
nat(6)-->"6".
nat(7)-->"7".
nat(8)-->"8".
nat(9)-->"9".
However this only supports 1-digit numbers. How can I parse numbers with multiple digits in this case?
Use accumulator variables, and pass those in recursive calls. In the following, A and A1 are the accumulator.
digit(0) --> "0".
digit(1) --> "1".
% ...
digit(9) --> "9".
nat(N) --> digit(D), nat(D,N).
nat(N,N) --> [].
nat(A,N) --> digit(D), { A1 is A*10 + D }, nat(A1,N).
Note that the first nat clause initializes the accumulator by consuming a digit, because you don't want to match the empty string.
nat(0).
nat(N):-nat(N-1).
But you use a syntax that I don't know (see my comment above).
Can you provide a sample input?
I think this might work:
nat(N)-->number(N).
If that fails try:
nat(N)-->number(N),!.
The ! is a cut it stops the unification. You can read about it in books/tutorials.
Related
As I said in the title, I am trying to do an exercise where I need to write a DCG capable of reading propositional logic, which are represented by lowercase letters, operators (not, and , and or), with the tokens separated by whitespace. So the expression:
a or not b and c
is parsed as
a or ( not b and c )
producing a parse tree that looks like:
or(
a,
and(
not(b),
c
)
)
To be completely honest I have been having a hard time understanding how to effectively use DCGs, but this is what I've got so far:
bexpr([T]) --> [T].
bexpr(not,R1) --> oper(not), bexpr(R1).
bexpr(R1,or,R2) --> bexpr(R1),oper(or), bexpr(R2).
bexpr(R1, and ,R2) --> bexpr(R1),oper(and), bexpr(R2).
oper(X) --> X.
I would appreciate any suggestions, either on the exercise itself, or on how to better understand DCGs.
The key to understanding DCGs is that they are syntactic sugar over writing a recursive descent parser. You need to think about operator precedence (how tightly do your operators bind?). Here, the operator precedence, from tightest to loosest is
not
and
or
so a or not b and c is evaluated as a or ( (not b) and c ) ).
And we can say this (I've included parenthetical expressions as well, because they're pretty trivial to do):
% the infix OR operator is the lowest priority, so we start with that.
expr --> infix_OR.
% and an infix OR expression is either the next highest priority operator (AND),
% or... it's an actual OR expression.
infix_OR --> infix_AND(T).
infix_OR --> infix_AND(X), [or], infix_OR(Y).
% and an infix AND expression is either next highest priority operator (NOT)
% or... it's an actual AND expression.
infix_AND --> unary_NOT(T).
infix_AND --> unary_NOT(X), [and], infix_AND(Y).
% and the unary NOT expression is either a primary expression
% or... an actual unary NOT expression
unary_NOT --> primary(T).
unary_NOT --> [not], primary(X).
% and a primary expression is either an identifer
% or... it's a parenthetical expression.
%
% NOTE that the body of the parenthetical expression starts parsing at the root level.
primary --> identifier(ID).
primary --> ['(', expr(T), ')' ].
identifier --> [X], {id(X)}. % the stuff in '{...}' is evaluated as normal prolog code.
id(a).
id(b).
id(c).
id(d).
id(e).
id(f).
id(g).
id(h).
id(i).
id(j).
id(k).
id(l).
id(m).
id(n).
id(o).
id(p).
id(q).
id(r).
id(s).
id(t).
id(u).
id(v).
id(w).
id(x).
id(y).
id(z).
But note that all this does is to recognize sentences of the grammar (pro tip: if you write your grammar correctly, it should also be able to generate all possible valid sentences of the grammar). Note that this might take a while to do, depending on your grammar.
So, to actually DO something with the parse, you need to add a little extra. We do this by adding extra arguments to the DCG, viz:
expr( T ) --> infix_OR(T).
infix_OR( T ) --> infix_AND(T).
infix_OR( or(X,Y) ) --> infix_AND(X), [or], infix_OR(Y).
infix_AND( T ) --> unary_NOT(T).
infix_AND( and(X,Y) ) --> unary_NOT(X), [and], infix_AND(Y).
unary_NOT( T ) --> primary(T).
unary_NOT( not(X) ) --> [not], primary(X).
primary( ID ) --> identifier(ID).
primary( T ) --> ['(', expr(T), ')' ].
identifier( ID ) --> [X], { id(X), ID = X }.
id(a).
id(b).
id(c).
id(d).
id(e).
id(f).
id(g).
id(h).
id(i).
id(j).
id(k).
id(l).
id(m).
id(n).
id(o).
id(p).
id(q).
id(r).
id(s).
id(t).
id(u).
id(v).
id(w).
id(x).
id(y).
id(z).
And that is where the parse tree is constructed. One might note that one could just as easily evaluate the expression instead of building the parse tree... and then you're on you way to writing an interpreted language.
You can fiddle with it at this fiddle: https://swish.swi-prolog.org/p/gyFsAeAz.pl
where you'll notice that executing the goal phrase(expr(T),[a, or, not, b, and, c]). yields the desired parse T = or(a, and(not(b), c)).
Let's say I want to tokenize a string of words (symbols) and numbers separated by whitespaces. For example, the expected result of tokenizing "aa 11" would be [tkSym("aa"), tkNum(11)].
My first attempt was the code below:
whitespace --> [Ws], { code_type(Ws, space) }, whitespace.
whitespace --> [].
letter(Let) --> [Let], { code_type(Let, alpha) }.
symbol([Sym|T]) --> letter(Sym), symbol(T).
symbol([Sym]) --> letter(Sym).
digit(Dg) --> [Dg], { code_type(Dg, digit) }.
digits([Dg|Dgs]) --> digit(Dg), digits(Dgs).
digits([Dg]) --> digit(Dg).
token(tkSym(Token)) --> symbol(Token).
token(tkNum(Token)) --> digits(Digits), { number_chars(Token, Digits) }.
tokenize([Token|Tokens]) --> whitespace, token(Token), tokenize(Tokens).
tokenize([]) --> whitespace, [].
Calling tokenize on "aa bb" leaves me with several possible responses:
?- tokenize(X, "aa bb", []).
X = [tkSym([97|97]), tkSym([98|98])] ;
X = [tkSym([97|97]), tkSym(98), tkSym(98)] ;
X = [tkSym(97), tkSym(97), tkSym([98|98])] ;
X = [tkSym(97), tkSym(97), tkSym(98), tkSym(98)] ;
false.
In this case, however, it seems appropriate to expect only one correct answer. Here's another, more deterministic approach:
whitespace --> [Space], { char_type(Space, space) }, whitespace.
whitespace --> [].
symbol([Sym|T]) --> letter(Sym), !, symbol(T).
symbol([]) --> [].
letter(Let) --> [Let], { code_type(Let, alpha) }.
% similarly for numbers
token(tkSym(Token)) --> symbol(Token).
tokenize([Token|Tokens]) --> whitespace, token(Token), !, tokenize(Tokens).
tokenize([]) --> whiteSpace, [].
But there is a problem: although the single answer to token called on "aa" is now a nice list, the tokenize predicate ends up in an infinite recursion:
?- token(X, "aa", []).
X = tkSym([97, 97]).
?- tokenize(X, "aa", []).
ERROR: Out of global stack
What am I missing? How is the problem usually solved in Prolog?
The underlying problem is that in your second version, token//1 also succeeds for the "empty" token:
?- phrase(token(T), "").
T = tkSym([]).
Therefore, unintentionally, the following succeeds too, as does an arbitrary number of tokens:
?- phrase((token(T1),token(T2)), "").
T1 = T2, T2 = tkSym([]).
To fix this, I recommend you adjust the definitions so that a token must consist of at least one lexical element, as is also typical. A good way to ensure that at least one element is described is to split the DCG rules into two sets. For example, shown for symbol///1:
symbol([L|Ls]) --> letter(L), symbol_r(Ls).
symbol_r([L|Ls]) --> letter(L), symbol_r(Ls).
symbol_r([]) --> [].
This way, you avoid an unbounded recursion that can endlessly consume empty tokens.
Other points:
Always use phrase/2 to access DCGs in a portable way, i.e., independent of the actual implementation method used by any particular Prolog system.
The [] in the final DCG clause is superfluous, you can simply remove it.
Also, avoid using so many !/0. It is OK to commit to the first matching tokenization, but do it only at a single place, like via a once/1 wrapped around the phrase/2 call.
For naming, see my comment above. I recommend to use tokens//1 to make this more declarative. Sample queries, using the above definition of symbol//1:
?- phrase(tokens(Ts), "").
Ts = [].
?- phrase(tokens(Ls), "a").
Ls = [tkSym([97])].
?- phrase(tokens(Ls), "a b").
Ls = [tkSym([97]), tkSym([98])].
I'm trying to teach myself prolog and implementing an interpreter for a simple arithmetic cfg:
<expression> --> number
<expression> --> ( <expression> )
<expression> --> <expression> + <expression>
<expression> --> <expression> - <expression>
<expression> --> <expression> * <expression>
<expression> --> <expression> / <expression>
So far, I've written this in swi-prolog which hits a number of bugs described below;
expression(N) --> number(Cs), { number_codes(N, Cs) }.
expression(N) --> "(", expression(N), ")".
expression(N) --> expression(X), "+", expression(Y), { N is X + Y }.
expression(N) --> expression(X), "-", expression(Y), { N is X - Y }.
number([D|Ds]) --> digit(D), number(Ds).
number([D]) --> digit(D).
digit(D) --> [D], { code_type(D, digit) }.
Testing with
phrase(expression(X), "12+4").
gives X = 16 which is good. Also
phrase(expression(X), "(12+4)").
works and phrase(expression(X), "12+4+5"). is ok.
But trying
phrase(expression(X), "12-4").
causes "ERROR: Out of local stack" unless I comment out the "+" rule. And while I can add more than two numbers, brackets don't work recursively (ie "(1+2)+3" hangs).
I'm sure the solution is simple, but I haven't been able to figure it out from the online tutorials I've found.
Everything you did is correct in principle. And you're right: the answer is simple.
But.
Left recursion is fatal in definite-clause grammars; the symptom is precisely the behavior you are seeing.
If you set a spy point on expression and use the trace facility, you can watch your stack grow and grow and grow while the parser makes no progress at all.
gtrace.
spy(expression).
phrase(expression(N),"12-4").
If you think carefully about the Prolog execution model, you can see what is happening.
We try to parse "12-4" as an expression.
Our call stack is contains this call to expression from step 1, which I will write expression(1).
We succeed in parsing "12" as an expression, by the first clause for "expression", and we record a choice point in case we need to backtrack later. In fact we need to backtrack immediately, because the parent request involving phrase says we want to parse the entire string, and we haven't: we still have "-4" to go. So we fail and go back to the choice point. We have shown that the first clause of "expression" doesn't succeed so we retry against the second clause.
The call stack: expression(1).
We try to parse "12-4" using the second clause for "expression", but fail immediately (the initial character is not "("). So we fail and retry against the third clause.
Call stack: expression(1).
The third clause asks us to parse an expression off the beginning of the input and then find a "+" and another expression. So we must try now to parse the beginning of the input as an expression.
Call stack: expression(4) expression(1).
We try to parse the beginning of "12-4" as an expression, and succeed with "12", just as in step 1. We record a choice point in case we need to backtrack later.
Call stack: expression(4) expression(1).
We now resume the attempt begun in step 4 to parse "12-4" as an expression against clause 3 of "expression". We've done the first bit; now we must try to parse "-4" as the remainder of the right-hand side of clause 3 of "expression", namely "+", expression(Y). But "-" is not "+", so we fail immediately, and go back to the most recently recorded choice point, the one recorded in step 5. The next thing is to try to find a different way of parsing the beginning of the input as an expression. We resume this search with the second clause of "expression".
Call stack: expression(4) expression(1).
Once again the second clause fails. So we continue with the third clause of "expression". This asks us to look for an expression at the beginning of the input (as part of figuring out whether our current two calls to "expression" can succeed or will fail). So we call "expression" again.
Call stack: expression(7) expression(4) expression(1).
You can see that each time we add a call to expression to the stack, we are going to succeed, look for a plus, fail, and try again, eventually reaching the third clause, at which point we will push another call on the stack and try again.
Short answer: left recursion is fatal in DCGs.
It's also fatal in recursive-descent parsers, and the solution is much the same: don't recur to the left.
A non-left-recursive version of your grammar would be:
expression(N) --> term(N).
expression(N) --> term(X), "+", expression(Y), { N is X + Y }.
expression(N) --> term(X), "-", expression(Y), { N is X - Y }.
term(N) --> number(Cs), { number_codes(N, Cs) }.
term(N) --> "(", expression(N), ")".
However, this makes "-" right associative, and requires the initial term to be reparsed repeatedly in many cases, so a common approach in code intended for production is to do something less like the BNF you started with and more like the following EBNF version:
expression = term {("+"|"-") term}
term = number | "(" expression ")".
The way I learned to write it (long enough ago that I no longer remember whom to credit for it) is something like this (I found it ugly at first, but it grows on you):
expression(N) --> term(X), add_op_sequence(X,N).
add_op_sequence(LHS0, Result) -->
"+", term(Y),
{LHS1 is LHS0 + Y},
add_op_sequence(LHS1,Result).
add_op_sequence(LHS0, Result) -->
"-", term(Y),
{LHS1 is LHS0 - Y},
add_op_sequence(LHS1,Result).
add_op_sequence(N,N) --> [].
term(N) --> number(Cs), { number_codes(N, Cs) }.
term(N) --> "(", expression(N), ")".
The value accumulated so far is passed down in the left-hand argument of add_op_sequence and eventually (when the sequence ends with the empty production) passed back up as a result.
The parsing strategy known as 'left-corner parsing' is a way of dealing with this problem; books on the use of Prolog in natural-language processing will almost invariably discuss it.
I have simplified a more complex problem to the following: there are three houses on a street with three different colours (no colour repeated); red, blue, green. Write a program using DCGs to simulate all permutations/possibilities. My code won't run and I'm struggling to see why. Any corrections would really help.
s --> h(X), h(Y), h(Z), X\=Y, X\=Z, Y\=Z.
h(X) --> Col(X).
Col(X) --> [red].
Col(X) --> [blue].
Col(X) --> [green].
s/Col/col/
And then, you are using within s//0 Prolog goals in the place of non-terminals. That does not work, you need to "escape" them with {}//0 like so
s -->h(X),h(Y),h(Z),{X\=Y,X\=Z,Y\=Z}.
But I would rather write:
s --> {dif(X,Y), dif(Y,Z), dif(X,Z)}, h(X),h(Y),h(Z).
In this manner Prolog performs all the bookkeeping for you.
If we're at it. Don't forget to call the non-terminal via phrase/2. Thus:
?- phrase(s, L).
You are (also) forgetting to 'return' the value from the leaves:
...
col(red)-->[red].
...
With a so small dataset, it's tempting to hardcode the permutation:
s --> r,g,b ; r,b,g ; g,r,b ; b,r,g ; g,b,r ; b,g,r.
r --> [red].
g --> [green].
b --> [blue].
So basically I use this code to check the substring:
substring(X,S) :- append(_,T,S), append(X,_,T), X \= [].
and my input is this:
substring("cmp", Ins) % Ins is "cmp(eax, 4)"
But when I use swi-prolog to trace this code, I find this:
substring([99, 109, 112], cmp(eax, 4))
and obviously it failed...
So could anyone give me some help?
SWI-Prolog has recently changed the traditional string literals as 'list of codes' to a more memory efficient representation (starting from version 7).
As a consequence (among others more difficult to explain), append/3 doesn't work anymore for your task, unless you convert explicitly to list of codes.
Contextually, many builtins have been introduced, like sub_string/5: for instance, try
?- sub_string("cmp(eax, 4)", Start,Len,Stop, "eax").
Start = Stop, Stop = 4,
Len = 3
Make this string a term of the form cmp(eax, 4). Here, in Prolog lingo, you have:
the term cmp(eax, 4)
with a functor cmp/2
with a first argument the atom eax
and a second argument the integer 4
Now that you have a term, you can use pattern matching in the head of your predicate (unification) to write predicates like:
apply_instruction(cmp(Reg, Operand) /*, other arguments as needed */) :-
/* do the comparison of the contents of _Reg_ and the values in _Operand_ */
apply_instruction(add(Reg, Addend) /*, other arguments */) :-
/* add _Addend_ to _Reg_ */
% and so on
How to make a term out of your input: there are many ways, the easiest would be to read one full line (depends on the Prolog implementation you are using, in SWI-Prolog, assuming you have your input stream in In):
read_line_to_codes(In, Line).
and then use a DCG to parse it. A DCG would look maybe something like:
instruction(cmp(Op1, Op2)) -->
"cmp",
ops(Op1, Op2).
instruction(add(Op1, Op2) -->
"add",
ops(Op1, Op2).
ops(Op1, Op2) -->
space,
op1(Op1), optional_space,
",", optional_space,
op2(Op2),
space_to_eol.
% and so on
You can then use phrase/2 to apply the DCG to the line you have read:
phrase(instruction(Instr), Line).