Tokenizing a string in Prolog using DCG - prolog

Let's say I want to tokenize a string of words (symbols) and numbers separated by whitespaces. For example, the expected result of tokenizing "aa 11" would be [tkSym("aa"), tkNum(11)].
My first attempt was the code below:
whitespace --> [Ws], { code_type(Ws, space) }, whitespace.
whitespace --> [].
letter(Let) --> [Let], { code_type(Let, alpha) }.
symbol([Sym|T]) --> letter(Sym), symbol(T).
symbol([Sym]) --> letter(Sym).
digit(Dg) --> [Dg], { code_type(Dg, digit) }.
digits([Dg|Dgs]) --> digit(Dg), digits(Dgs).
digits([Dg]) --> digit(Dg).
token(tkSym(Token)) --> symbol(Token).
token(tkNum(Token)) --> digits(Digits), { number_chars(Token, Digits) }.
tokenize([Token|Tokens]) --> whitespace, token(Token), tokenize(Tokens).
tokenize([]) --> whitespace, [].
Calling tokenize on "aa bb" leaves me with several possible responses:
?- tokenize(X, "aa bb", []).
X = [tkSym([97|97]), tkSym([98|98])] ;
X = [tkSym([97|97]), tkSym(98), tkSym(98)] ;
X = [tkSym(97), tkSym(97), tkSym([98|98])] ;
X = [tkSym(97), tkSym(97), tkSym(98), tkSym(98)] ;
false.
In this case, however, it seems appropriate to expect only one correct answer. Here's another, more deterministic approach:
whitespace --> [Space], { char_type(Space, space) }, whitespace.
whitespace --> [].
symbol([Sym|T]) --> letter(Sym), !, symbol(T).
symbol([]) --> [].
letter(Let) --> [Let], { code_type(Let, alpha) }.
% similarly for numbers
token(tkSym(Token)) --> symbol(Token).
tokenize([Token|Tokens]) --> whitespace, token(Token), !, tokenize(Tokens).
tokenize([]) --> whiteSpace, [].
But there is a problem: although the single answer to token called on "aa" is now a nice list, the tokenize predicate ends up in an infinite recursion:
?- token(X, "aa", []).
X = tkSym([97, 97]).
?- tokenize(X, "aa", []).
ERROR: Out of global stack
What am I missing? How is the problem usually solved in Prolog?

The underlying problem is that in your second version, token//1 also succeeds for the "empty" token:
?- phrase(token(T), "").
T = tkSym([]).
Therefore, unintentionally, the following succeeds too, as does an arbitrary number of tokens:
?- phrase((token(T1),token(T2)), "").
T1 = T2, T2 = tkSym([]).
To fix this, I recommend you adjust the definitions so that a token must consist of at least one lexical element, as is also typical. A good way to ensure that at least one element is described is to split the DCG rules into two sets. For example, shown for symbol///1:
symbol([L|Ls]) --> letter(L), symbol_r(Ls).
symbol_r([L|Ls]) --> letter(L), symbol_r(Ls).
symbol_r([]) --> [].
This way, you avoid an unbounded recursion that can endlessly consume empty tokens.
Other points:
Always use phrase/2 to access DCGs in a portable way, i.e., independent of the actual implementation method used by any particular Prolog system.
The [] in the final DCG clause is superfluous, you can simply remove it.
Also, avoid using so many !/0. It is OK to commit to the first matching tokenization, but do it only at a single place, like via a once/1 wrapped around the phrase/2 call.
For naming, see my comment above. I recommend to use tokens//1 to make this more declarative. Sample queries, using the above definition of symbol//1:
?- phrase(tokens(Ts), "").
Ts = [].
?- phrase(tokens(Ls), "a").
Ls = [tkSym([97])].
?- phrase(tokens(Ls), "a b").
Ls = [tkSym([97]), tkSym([98])].

Related

DCG capable of reading a propositional logic expression

As I said in the title, I am trying to do an exercise where I need to write a DCG capable of reading propositional logic, which are represented by lowercase letters, operators (not, and , and or), with the tokens separated by whitespace. So the expression:
a or not b and c
is parsed as
a or ( not b and c )
producing a parse tree that looks like:
or(
a,
and(
not(b),
c
)
)
To be completely honest I have been having a hard time understanding how to effectively use DCGs, but this is what I've got so far:
bexpr([T]) --> [T].
bexpr(not,R1) --> oper(not), bexpr(R1).
bexpr(R1,or,R2) --> bexpr(R1),oper(or), bexpr(R2).
bexpr(R1, and ,R2) --> bexpr(R1),oper(and), bexpr(R2).
oper(X) --> X.
I would appreciate any suggestions, either on the exercise itself, or on how to better understand DCGs.
The key to understanding DCGs is that they are syntactic sugar over writing a recursive descent parser. You need to think about operator precedence (how tightly do your operators bind?). Here, the operator precedence, from tightest to loosest is
not
and
or
so a or not b and c is evaluated as a or ( (not b) and c ) ).
And we can say this (I've included parenthetical expressions as well, because they're pretty trivial to do):
% the infix OR operator is the lowest priority, so we start with that.
expr --> infix_OR.
% and an infix OR expression is either the next highest priority operator (AND),
% or... it's an actual OR expression.
infix_OR --> infix_AND(T).
infix_OR --> infix_AND(X), [or], infix_OR(Y).
% and an infix AND expression is either next highest priority operator (NOT)
% or... it's an actual AND expression.
infix_AND --> unary_NOT(T).
infix_AND --> unary_NOT(X), [and], infix_AND(Y).
% and the unary NOT expression is either a primary expression
% or... an actual unary NOT expression
unary_NOT --> primary(T).
unary_NOT --> [not], primary(X).
% and a primary expression is either an identifer
% or... it's a parenthetical expression.
%
% NOTE that the body of the parenthetical expression starts parsing at the root level.
primary --> identifier(ID).
primary --> ['(', expr(T), ')' ].
identifier --> [X], {id(X)}. % the stuff in '{...}' is evaluated as normal prolog code.
id(a).
id(b).
id(c).
id(d).
id(e).
id(f).
id(g).
id(h).
id(i).
id(j).
id(k).
id(l).
id(m).
id(n).
id(o).
id(p).
id(q).
id(r).
id(s).
id(t).
id(u).
id(v).
id(w).
id(x).
id(y).
id(z).
But note that all this does is to recognize sentences of the grammar (pro tip: if you write your grammar correctly, it should also be able to generate all possible valid sentences of the grammar). Note that this might take a while to do, depending on your grammar.
So, to actually DO something with the parse, you need to add a little extra. We do this by adding extra arguments to the DCG, viz:
expr( T ) --> infix_OR(T).
infix_OR( T ) --> infix_AND(T).
infix_OR( or(X,Y) ) --> infix_AND(X), [or], infix_OR(Y).
infix_AND( T ) --> unary_NOT(T).
infix_AND( and(X,Y) ) --> unary_NOT(X), [and], infix_AND(Y).
unary_NOT( T ) --> primary(T).
unary_NOT( not(X) ) --> [not], primary(X).
primary( ID ) --> identifier(ID).
primary( T ) --> ['(', expr(T), ')' ].
identifier( ID ) --> [X], { id(X), ID = X }.
id(a).
id(b).
id(c).
id(d).
id(e).
id(f).
id(g).
id(h).
id(i).
id(j).
id(k).
id(l).
id(m).
id(n).
id(o).
id(p).
id(q).
id(r).
id(s).
id(t).
id(u).
id(v).
id(w).
id(x).
id(y).
id(z).
And that is where the parse tree is constructed. One might note that one could just as easily evaluate the expression instead of building the parse tree... and then you're on you way to writing an interpreted language.
You can fiddle with it at this fiddle: https://swish.swi-prolog.org/p/gyFsAeAz.pl
where you'll notice that executing the goal phrase(expr(T),[a, or, not, b, and, c]). yields the desired parse T = or(a, and(not(b), c)).

What am I doing wrong in declaring a predicate for use with a DCG in Prolog?

I'm using the following to check if a string is valid for use with my grammar:
id(ID) :-
atom_chars(ID, [H|T]),
is_alpha(H),
ensure_valid_char(T).
ensure_valid_char([H|T]) :-
H == '_';
is_alpha(H);
atom_number(H, _),
ensure_valid_char(T).
It basically just checks that it starts with an alphabetic character, and after that it can be alphanumeric or an underscore.
I cannot seem to figure out how to get this to work with my DCG/grammar though.
This is its current structure where the predicate would be used:
typeID(Result) --> ['int'], id(ID), {
Result = ['int', ID]
}.
Where basically I'm saying a typeID is an integer type declaration followed by an identifier (int foo would be an example), and then I format it into a list and "give it back".
But in this case it's saying "id" is an undefined predicate. How do I use it so that I'm still able to access what ID holds to be able to format it, and still ensure that it's an ID using the predicate?
If I try:
id(ID) --> {
atom_chars(ID, [H|T]),
is_alpha(H),
ensure_valid_char(T),
ID = ID
}.
I get the error that:
atom_chars/2: Arguments are not sufficiently instantiated
Please use more readable names. The Prolog convention is to use underscores for readability.
This is_because_using_underscores_makes_even_long_names_readable, butUsingMixedCapsDoesNotAndMakesYourCodeAsUnreadableAsJava.
Second, please avoid unnecessary goals. A goal like ID=ID always holds, so you can as well remove it.
Third, a common pattern when describing the longest match in DCGs is to use clauses like the following:
symbol([A|As]) -->
[A],
{ memberchk(A, "+/-*><=") ; code_type(A, alpha) },
symbolr(As).
symbolr([A|As]) -->
[A],
{ memberchk(A, "+/-*><=") ; code_type(A, alnum) },
symbolr(As).
symbolr([]) --> [].
You can use this in a DCG like this:
id(Atom) --> symbol(Codes), { atom_codes(Atom, Cs) }
The longest match of symbol//1 will be the first solution.
All of this requires that you have the Prolog flag double_quotes set to codes.
semicolon has higher precedence than comma
?- (N=(',') ; N=(';')), current_op(P,T,N).
N = (','),
P = 1000,
T = xfy ;
N = (;),
P = 1100,
T = xfy.
then ensure_valid_char/1 doesn't have the structure you expect. It should be
ensure_valid_char([H|T]) :-
( H == '_' ;
is_alpha(H) ;
atom_number(H, _) % ugly
),
ensure_valid_char(T).
and a simpler issue: you're missing the base case of the recursion
ensure_valid_char([]).

Prolog build rules from atoms

I'm currently trying to to interpret user-entered strings via Prolog. I'm using code I've found on the internet, which converts a string into a list of atoms.
"Men are stupid." => [men,are,stupid,'.'] % Example
From this I would like to create a rule, which then can be used in the Prolog command-line.
% everyone is a keyword for a rule. If the list doesn't contain 'everyone'
% it's a fact.
% [men,are,stupid]
% should become ...
stupid(men).
% [everyone,who,is,stupid,is,tall]
% should become ...
tall(X) :- stupid(X).
% [everyone,who,is,not,tall,is,green]
% should become ...
green(X) :- not(tall(X)).
% Therefore, this query should return true/yes:
?- green(women).
true.
I don't need anything super fancy for this as my input will always follow a couple of rules and therefore just needs to be analyzed according to these rules.
I've been thinking about this for probably an hour now, but didn't come to anything even considerable, so I can't provide you with what I've tried so far. Can anyone push me into the right direction?
Consider using a DCG. For example:
list_clause(List, Clause) :-
phrase(clause_(Clause), List).
clause_(Fact) --> [X,are,Y], { Fact =.. [Y,X] }.
clause_(Head :- Body) --> [everyone,who,is,B,is,A],
{ Head =.. [A,X], Body =.. [B,X] }.
Examples:
?- list_clause([men,are,stupid], Clause).
Clause = stupid(men).
?- list_clause([everyone,who,is,stupid,is,tall], Clause).
Clause = tall(_G2763):-stupid(_G2763).
I leave the remaining example as an easy exercise.
You can use assertz/1 to assert such clauses dynamically:
?- List = <your list>, list_clause(List, Clause), assertz(Clause).
First of all, you could already during the tokenization step make terms instead of lists, and even directly assert rules into the database. Let's take the "men are stupid" example.
You want to write down something like:
?- assert_rule_from_sentence("Men are stupid.").
and end up with a rule of the form stupid(men).
assert_rule_from_sentence(Sentence) :-
phrase(sentence_to_database, Sentence).
sentence_to_database -->
subject(Subject), " ",
"are", " ",
object(Object), " ",
{ Rule =.. [Object, Subject],
assertz(Rule)
}.
(let's assume you know how to write the DCGs for subject and object)
This is it! Of course, your sentence_to_database//0 will need to have more clauses, or use helper clauses and predicates, but this is at least a start.
As #mat says, it is cleaner to first tokenize and then deal with the tokenized sentence. But then, it would go something like this:
tokenize_sentence(be(Subject, Object)) -->
subject(Subject), space,
be, !,
object(Object), end.
(now you also need to probably define what a space and an end of sentence is...)
be -->
"is".
be -->
"are".
assert_tokenized(be(Subject, Object)) :-
Fact =.. [Object, Subject],
assertz(Fact).
The main reason for doing it this way is that you know during the tokenization what sort of sentence you have: subject - verb - object, or subject - modifier - object - modifier etc, and you can use this information to write your assert_tokenized/1 in a more explicit way.
Definite Clause Grammars are Prolog's go-to tool for translating from strings (such as your English sentences) to Prolog terms (such as the Prolog clauses you want to generate), or the other way around. Here are two introductions I'd recommend:
http://www.learnprolognow.org/lpnpage.php?pagetype=html&pageid=lpn-htmlse29
http://www.pathwayslms.com/swipltuts/dcg/

Determine the type of characters

I would like to determine in Prolog the type of a string of characters, if it is alphabetic, alphanumeric or numeric.
For example:
"I use this page" alphabetic
"0c0d24e" alphanumeric
How can i do?
the predicate available is char_type/2, or better, code_type/2.
To apply to each code in string, use maplist/2. The only problem it's the wrong arguments order of code_type. Then a service predicate is needed (or download lambda, if you're using SWI-Prolog, with ?- pack_install(lambda).).
Without lambda:
code_type_(X,Y) :- code_type(Y,X).
?- maplist(code_type_(alpha), "abc").
true.
With lambda:
?- [library(lambda)].
?- maplist(\C^code_type(C,alpha), "abc").
true.
edit after comments, it's apparent that more flexible parsing is required. A DCG it's the recommended way to go: library(dcg/basics) offers some prebuilt 'categorizer', and highlights the proper way to write your own, combining with code_type: for instance, here is a recently added rule:
%% prolog_var_name(-Name:atom)// is semidet.
%
% Matches a Prolog variable name. Primarily intended to deal with
% quasi quotations that embed Prolog variables.
prolog_var_name(Name) -->
[C0], { code_type(C0, prolog_var_start) }, !,
prolog_id_cont(CL),
{ atom_codes(Name, [C0|CL]) }.
prolog_id_cont([H|T]) -->
[H], { code_type(H, prolog_identifier_continue) }, !,
prolog_id_cont(T).
prolog_id_cont([]) --> "".
see how code_type/2 is used to qualify single characters...
more edit - note: untested
qualify_atom(Atom, Type) :-
atom_codes(Atom, Codes),
qualify_codes(Codes, Type).
qualify_codes(Codes, Type) :-
( maplist(code_type_(alnum), Codes)
-> Type = alnum
; maplist(code_type_(alpha), Codes)
-> Type = alpha
; Type = unknown
).
then, to work on a list
?- maplist(qualify_atom, Atoms, Types).
edit
An update of this answer is mandatory: since library(yall) has been released in SWI-Prolog, and is autoloaded, we can now write:
?- maplist([C]>>code_type(C,alpha), `abc`).
Also, note the change in literal representation: double quotes in SWI-Prolog ver.7+ don't represent anymore a list of character codes.

Parsing numbers with multiple digits in Prolog

I have the following simple expression parser:
expr(+(T,E))-->term(T),"+",expr(E).
expr(T)-->term(T).
term(*(F,T))-->factor(F),"*",term(T).
term(F)-->factor(F).
factor(N)-->nat(N).
factor(E)-->"(",expr(E),")".
nat(0)-->"0".
nat(1)-->"1".
nat(2)-->"2".
nat(3)-->"3".
nat(4)-->"4".
nat(5)-->"5".
nat(6)-->"6".
nat(7)-->"7".
nat(8)-->"8".
nat(9)-->"9".
However this only supports 1-digit numbers. How can I parse numbers with multiple digits in this case?
Use accumulator variables, and pass those in recursive calls. In the following, A and A1 are the accumulator.
digit(0) --> "0".
digit(1) --> "1".
% ...
digit(9) --> "9".
nat(N) --> digit(D), nat(D,N).
nat(N,N) --> [].
nat(A,N) --> digit(D), { A1 is A*10 + D }, nat(A1,N).
Note that the first nat clause initializes the accumulator by consuming a digit, because you don't want to match the empty string.
nat(0).
nat(N):-nat(N-1).
But you use a syntax that I don't know (see my comment above).
Can you provide a sample input?
I think this might work:
nat(N)-->number(N).
If that fails try:
nat(N)-->number(N),!.
The ! is a cut it stops the unification. You can read about it in books/tutorials.

Resources