I am trying to understand the basic syntax of prolog and dcg but it's really hard to get ahold of proper information on the really basic stuff. Take a look at the code below, I basically just want to achieve something like this:
Output = te(a, st).
Code:
test(te(X,Y)) --> [X], test2(Y).
test2(st(_X)) --> [bonk].
?- test(Output, [a, bonk],[]).
Output = te(a, st(_G6369)).
Simply what I want to do is to add the the word 'st' at the end, and the closest way I've managed is by doing this but unfortunately st is followed a bunch of nonsense, most likely because of the singleton _X. I simply want my Output to contain like: te(a, st).
If you want to accept input of the form [Term, bonk] and obtain te(Term,st) you should change test/2 to accept bonk a return st:
test(te(X,Y)) --> [X], test2(Y).
test2(st) --> [bonk].
?- test(Output, [a, bonk],[]).
Output = te(a, st).
As you said, st is followed by "a bunch of nonsense" because of _X (basically, _G6369 is the internal 'name' of the variable and since the variable remains uninstantiated prolog displays it; try print(X), X=3, print(X).
Anyway, you can simply remove (_X) since you can have anything you want as an argument:
test(te(X,Y)) --> [X], test2(Y).
test2(st) --> [bonk].
Of course, if you don't actually have bonk's in your input and you simply want to add a st at the end you can simplify it even more:
test(te(X,st)) --> [X].
Or if you have bonk's:
test(te(X,st)) --> [X,bonk].
Finally, it is generally suggested to use phrase/3 or phrase/2 instead of adding the arguments manually.
Related
What's the preferred way to ignore rest of input? I found one somewhat verbose way:
ignore_rest --> [].
ignore_rest --> [_|_].
And it works:
?- phrase(ignore_rest, "foo schmoo").
true ;
But when I try to collapse these two rules into:
ignore_rest2 --> _.
Then it doesn't:
?- phrase(ignore_rest2, "foo schmoo").
ERROR: phrase/3: Arguments are not sufficiently instantiated
What you want is to state that there is a sequence of arbitrarily many characters. The easiest way to describe this is:
... -->
[].
... -->
[_],
... .
Using [_|_] as a non-terminal as you did, is an SWI-Prolog specific extension which is highly problematic. In fact, in the past, there were several different extensions to/interpretations of [_|_]. Most notably Quintus Prolog did permit to define a user-defined '.'/4 to be called when [_|_] was used as a non-terminal. Note that [_|[]] was still considered a terminal! Actually, this was rather an implementation error. But nevertheless, it was exploited. See for such an example:
David B. Searls, Investigating the Linguistics of DNA with Definite Clause Grammars. NACLP 1989.
Why not simply use phrase/3 instead of phrase/2? For example, assuming that you have a prefix//0 non-terminal that consumes only part of the input:
?- phrase(prefix, Input, _).
The third argument of phrase/3 returns the non-consumed terminals, which you can simply ignore.
I'm trying for the moment to keep my lexer and parser separate, based on the vague advice of the book Prolog and Natural Language Analysis, which really doesn't go into any detail about lexing/tokenizing. So I am giving it a shot and seeing several little issues that indicate to me that there is something obvious I'm missing.
All my little token parsers seem to be working alright; at the moment this is a snippet of my code:
:- use_module(library(dcg/basics)).
operator('(') --> "(". operator(')') --> ")".
operator('[') --> "[". operator(']') --> "]".
% ... etc.
keyword(array) --> "array".
keyword(break) --> "break".
% ... etc.
It's a bit repetitive but it seems to work. Then I have some stuff I don't completely love and would welcome suggestions on, but does seem to work:
id(id(Id)) -->
[C],
{
char_type(C, alpha)
},
idRest(Rest),
{
atom_chars(Id, [C|Rest])
}.
idRest([C|Rest]) -->
[C],
{
char_type(C, alpha) ; char_type(C, digit) ; C = '_'
},
idRest(Rest).
idRest([]) --> [].
int(int(Int)) --> integer(Int).
string(str(String)) -->
"\"",
stringContent(Codes),
"\"",
{
string_chars(String, Codes)
}.
stringContent([C|Chars]) -->
stringChar(C), stringContent(Chars).
stringContent([]) --> [].
stringChar(0'\n) --> "\\n".
stringChar(0'\t) --> "\\t".
stringChar(0'\") --> "\\\"".
stringChar(0'\") --> "\\\\".
stringChar(C) --> [C].
The main rule for my tokenizer is this:
token(X) --> whites, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).
It's not perfect; I will see int get parsed into in,id(t) because keyword(X) comes before id(X). So I guess that's the first question.
The bigger question I have is that I do not see how to properly integrate comments into this situation. I have tried the following:
skipAhead --> [].
skipAhead --> (comment ; whites), skipAhead.
comment --> "/*", anything, "*/".
anything --> [].
anything --> [_], anything.
token(X) --> skipAhead, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).
This does not seem to work; the parses that return (and I get many parses) do not seem to have the comment removed. I'm nervous that my comment rule is needlessly inefficient and probably induces a lot of unnecessary backtracking. I'm also nervous that whites//0 from dcg/basics is deterministic; however, that part of the equation seems to work, it's just integrating it with the comment skipping that doesn't seem to.
As a final note, I don't see how to handle propagating parse errors back to the user with line/column information from here. It feels like I'd have to track and thread through some kind of current line/column info and write it into the tokens and then maybe try to rebuild the line if I wanted to do something similar to how llvm does it. Is that fair or is there a "recommended practice" there?
The whole code can be found in this haste.
It currently still looks a bit strange (unreadableNamesLikeInJavaAnyone?), but in its core it is quite solid, so I have only a few comments about some aspects of the code and the questions:
Separating lexing from parsing makes perfect sense. It is also a perfectly acceptable solution to store line and column information along with each token, leaving tokens (for example) of the form l_c_t(Line,Column,Token) or Token-lc(Line,Column) for the parser to process.
Comments are always nasty, or should I say, often not-nesty? A useful pattern in DCGs is often to go for the longest match, which you are already using in some cases, but not yet for anything//0. So, reordering the two rules may help you to skip everything that is meant to be commented away.
Regarding the determinism: It is OK to commit to the first parse that matches, but do it only once, and resist the temptation to mess up the declarative grammar.
In DCGs, it is elegant to use | instead of ;.
tokenize//1? Come on! That's just tokens//1. It makes sense in all directions.
I have this code to support error reporting, that itself must be handled with care, sprinkling meaningful messages and 'skip rules' around the code. But there is not ready-to-use alternative: a DCG is a nice computation engine, but it cannot compete out-of-the-box with specialized parsing engines, that are able to emit error messages automatically, exploiting the theoretical properties of the targeted grammars...
:- dynamic text_length/1.
parse_conf_cs(Cs, AST) :-
length(Cs, TL),
retractall(text_length(_)),
assert(text_length(TL)),
phrase(cfg(AST), Cs).
....
%% tag(?T, -X, -Y)// is det.
%
% Start/Stop tokens for XML like entries.
% Maybe this should restrict somewhat the allowed text.
%
tag(T, X, Y) -->
pos(X), unquoted(T), pos(Y).
....
%% pos(-C, +P, -P) is det.
%
% capture offset from end of stream
%
pos(C, P, P) :- text_length(L), length(P, Q), C is L - Q.
tag//3 is just an example usage, in this parser I'm building an editable AST, so I store the positions to be able to properly attribute each nested part in an editor...
edit
a small enhancement for id//1: SWI-Prolog has specialized code_type/2 for that:
1 ?- code_type(0'a, csymf).
true.
2 ?- code_type(0'1, csymf).
false.
so (glossing over literal transformation)
id([C|Cs]) --> [C], {code_type(C, csymf)}, id_rest(Cs).
id_rest([C|Cs]) --> [C], {code_type(C, csym)}, id_rest(Cs).
id_rest([]) --> [].
depending on your attitude to generalize small snippets, and the actual grammar details, id_rest//1 could be written in reusable fashion, and made deterministic
id([C|Cs]) --> [C], {code_type(C, csymf)}, codes(csym, Cs).
% greedy and deterministic
codes(Kind, [C|Cs]) --> [C], {code_type(C, Kind)}, !, codes(Kind, Cs).
codes(Kind, []), [C] --> [C], {\+code_type(C, Kind)}, !.
codes(_, []) --> [].
this stricter definition of id//1 would also allow to remove some ambiguity wrt identifiers with keyword prefix: recoding keyword//1 like
keyword(K) --> id(id(K)), {memberchk(K, [
array,
break,
...
]}.
will correctly identify
?- phrase(tokenize(Ts), `if1*2`).
Ts = [id(if1), *, int(2)] ;
Your string//1 (OT: what unfortunate clash with library(dcg/basics):string//1) is an easy candidate for implementing a simple 'error recovery strategy':
stringChar(0'\") --> "\\\\".
stringChar(0'") --> pos(X), "\n", {format('unclosed string at ~d~n', [X])}.
It's an example of 'report error and insert missing token', so the parsing can go on...
I always seem to struggle to write DCG's to parse input files. But it seems it should be simple? Are there any tips or tricks to think about this problem?
For a concrete example, lets say I want to parse a fasta file. (https://en.wikipedia.org/wiki/FASTA_format). I want to read each description and each sequence on back tracking.
:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- portray_text(true).
:- set_prolog_flag(double_quotes, codes).
:- set_prolog_flag(back_quotes,string).
fasta_file([]) -->[].
fasta_file([Section|Sections]) -->
fasta_section(Section),
fasta_file(Sections).
fasta_section(Section) -->
fasta_description(Description),
fasta_seq(Sequence),
{Section =.. [section,Description,Sequence]}.
fasta_description(Description) -->
">",
string(Description),
{no_gt(Description),
no_nl(Description)}.
fasta_seq([]) --> [].
fasta_seq(Seq) -->
nt([S]),
fasta_seq(Ss),
{S="X"->Seq =Ss;Seq=[S|Ss]}.
nt("A") --> "A".
nt("C") --> "C".
nt("G") --> "G".
nt("T") --> "T".
nt("X") --> "\n".
no_gt([]).
no_gt([E|Es]):-
dif([E],">"),
no_gt(Es).
no_nl([]).
no_nl([E|Es]):-
dif([E],"\n"),
no_nl(Es).
Now this is clearly wrong. The behaviour I would like is
?-phrase(fasta_section(S),">frog\nACGGGGTACG\n>duck\nACGTTAG").
S = section("frog","ACGGGGTACG");
S = section("duck","ACGTTAG");
false.
But if I did phrase(fasta_file(Sections),">frog\nACGGGGTACG\n>duck\nACGTTAG). Sections is unified with a list of sections/2s, which is what I want, but my current code seems quite hacky- how I have handled the newline character for example.
for sure, there are 'small' typing problems:
nt("A") -->"A",
nt("C") -->"C",
nt("G") -->"G",
nt("T") -->"T".
should be
nt("A") -->"A".
nt("C") -->"C".
nt("G") -->"G".
nt("T") -->"T".
anyway, I also had my problems debugging DCG, I wrote a parser to load in Prolog a MySQL dump (plain SQL, really), and was a pain when something unexpected, like escaped strings, or UTF8 (?) weird encodings were found.
I would suggest to use phrase/3, to see if there is an unparsable tail. Also, could help to place some debug output after known, well behaved sequences.
Of course, I assume you already tried to use the SWI-Prolog debugger.
Also, beware of
...
dif([E],">"),
...
did you set the appropriate flag about double quotes ? In DCG bodies, the rewrite machinery takes care of matching, but a sequence of codes in SWI-Prolog by default doesn't match double quoted strings...
edit
I think this will not solve your doubt about a general strategy... anyway, it's how I would handle the problem...
fasta_file([]) -->[].
fasta_file([Section|Sections]) -->
fasta_section(Section),
fasta_file(Sections).
fasta_section(section(Description,Sequence)) -->
fasta_description(Description),
fasta_seq(SequenceCs), {atom_codes(Sequence, SequenceCs)}, !.
fasta_description(Description) -->
">", string(DescriptionCs), "\n", {atom_codes(Description, DescriptionCs)}.
fasta_seq([S|Seq]) --> nt(S), fasta_seq(Seq).
fasta_seq([]) --> "\n" ; []. % optional \n at EOF
nt(0'A) --> "A".
nt(0'C) --> "C".
nt(0'G) --> "G".
nt(0'T) --> "T".
now
?- phrase(fasta_file(S), `>frog\nACGGGGTACG\n>duck\nACGTTAG`).
S = [section(frog, 'ACGGGGTACG'), section(duck, 'ACGTTAG')] ;
false.
note: the order of clauses fasta_seq//1 is important, since it implements an 'eager' parsing - mainly for efficiency. As I said, I had to parse SQL, several MBs was common.
edit
?- phrase((string(_),fasta_section(S)), `>frog\nACGGGGTACG\n>duck\nACGTTAG`,_).
S = section(frog, 'ACGGGGTACG') ;
S = section(duck, 'ACGTTAG') ;
false.
fasta_section//1 is mean to match a definite sequence. To get all on backtracking we must provide a backtrack point. In this case, string//1 from library(dcg/basics) does the job
In the following tutorial: http://www.csupomona.edu/~jrfisher/www/prolog_tutorial/7_3.html
There is the part:
test_parser :- repeat,
write('?? '),
read_line(X),
( c(F,X,[]) | q(F,X,[]) ),
nl, write(X), nl, write(F), nl, fail.
Now I'm extremely confused about the c(F,X,[]) and q(F,X,[]) part because it doesn't seem to match any thing that I have seen, c only takes one parameter from what I can tell and these parameters don't seem to make sense for q. Please help me understand what is going on here.
c//1 and q//1 are entry points (aka top level production) of the Definite Clauses Grammar defined below, where you find
c(F) --> ....
q(F) --> ....
This style of 'call' on a DCG entry point is discouraged, usually is better to invoke the phrase(Grammar, TextToAnalyze, TextAfterAnalysis), in this case phrase((c(F) ; q(F)), "some text", "")...
The --> operator is usually rewritten adding 2 arguments, that are cause of your concern.
EDIT
I.e.
c(L) --> lead_in,arrange(L),end.
is rewritten to
c(L,X,Y) :- lead_in(X,X1),arrange(L,X1,X2),end(X2,Y).
c is defined with -->, which actually adds two hidden arguments to it. The first of these is a list to be parsed by the grammar rule; the second is "what's left" after the parse. c(F,X,[]) calls c on the list X to obtain a result F, expecting [] to be left, i.e. the parser should consume the entire list X.
I was trying to define a functor and print each individual items of list in Prolog, but Prolog is not printing in correct format.
rint(L):-
write(H).
the output is like
rint([a, s,v ,c]).
_L139
true.
This is what I expect to achieve by calling the functor, any help or thought is appreciated, I'm new to Prolog and learning it.
?- rint([a,b,c,d]).
.(a, .(b, .(c, .(d, []))))
I think it should be
rint(L) :- write(L).
Also if you want .(a, .(b, .(c, .(d, [])))) and not [a, b, c, d] in output, use display:
rint(L) :- display(L).
The problem is an error in your rule for rint.
Your definition says that rint(L) succeeds if write(H) succeeds. At that point, the interpreter knows nothing about H. So it writes a value it doesn't know, which is why you see the _L139, the internal representation of an uninitialised variable.
Having done that, write(H) has succeed, is true, so rint(L) is true. The interpreter tells you that: true.
To define your own rint/1 without relying on built-ins such as display/1, you would need to do something like
rint([]) :-
write([]).
rint([H|T]) :-
write('.('),
write(H),
write(', '),
rint(T),
write(')').
If you're trying to display an empty list, just write it. If you're trying to display any other list, write the opening period and parenthesis, write the Head, write the following comma and space, then call itself for the Tail of the list, then write the closing parenthesis.