Prolog DCG: Writing programming language lexer - prolog

I'm trying for the moment to keep my lexer and parser separate, based on the vague advice of the book Prolog and Natural Language Analysis, which really doesn't go into any detail about lexing/tokenizing. So I am giving it a shot and seeing several little issues that indicate to me that there is something obvious I'm missing.
All my little token parsers seem to be working alright; at the moment this is a snippet of my code:
:- use_module(library(dcg/basics)).
operator('(') --> "(". operator(')') --> ")".
operator('[') --> "[". operator(']') --> "]".
% ... etc.
keyword(array) --> "array".
keyword(break) --> "break".
% ... etc.
It's a bit repetitive but it seems to work. Then I have some stuff I don't completely love and would welcome suggestions on, but does seem to work:
id(id(Id)) -->
[C],
{
char_type(C, alpha)
},
idRest(Rest),
{
atom_chars(Id, [C|Rest])
}.
idRest([C|Rest]) -->
[C],
{
char_type(C, alpha) ; char_type(C, digit) ; C = '_'
},
idRest(Rest).
idRest([]) --> [].
int(int(Int)) --> integer(Int).
string(str(String)) -->
"\"",
stringContent(Codes),
"\"",
{
string_chars(String, Codes)
}.
stringContent([C|Chars]) -->
stringChar(C), stringContent(Chars).
stringContent([]) --> [].
stringChar(0'\n) --> "\\n".
stringChar(0'\t) --> "\\t".
stringChar(0'\") --> "\\\"".
stringChar(0'\") --> "\\\\".
stringChar(C) --> [C].
The main rule for my tokenizer is this:
token(X) --> whites, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).
It's not perfect; I will see int get parsed into in,id(t) because keyword(X) comes before id(X). So I guess that's the first question.
The bigger question I have is that I do not see how to properly integrate comments into this situation. I have tried the following:
skipAhead --> [].
skipAhead --> (comment ; whites), skipAhead.
comment --> "/*", anything, "*/".
anything --> [].
anything --> [_], anything.
token(X) --> skipAhead, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).
This does not seem to work; the parses that return (and I get many parses) do not seem to have the comment removed. I'm nervous that my comment rule is needlessly inefficient and probably induces a lot of unnecessary backtracking. I'm also nervous that whites//0 from dcg/basics is deterministic; however, that part of the equation seems to work, it's just integrating it with the comment skipping that doesn't seem to.
As a final note, I don't see how to handle propagating parse errors back to the user with line/column information from here. It feels like I'd have to track and thread through some kind of current line/column info and write it into the tokens and then maybe try to rebuild the line if I wanted to do something similar to how llvm does it. Is that fair or is there a "recommended practice" there?
The whole code can be found in this haste.

It currently still looks a bit strange (unreadableNamesLikeInJavaAnyone?), but in its core it is quite solid, so I have only a few comments about some aspects of the code and the questions:
Separating lexing from parsing makes perfect sense. It is also a perfectly acceptable solution to store line and column information along with each token, leaving tokens (for example) of the form l_c_t(Line,Column,Token) or Token-lc(Line,Column) for the parser to process.
Comments are always nasty, or should I say, often not-nesty? A useful pattern in DCGs is often to go for the longest match, which you are already using in some cases, but not yet for anything//0. So, reordering the two rules may help you to skip everything that is meant to be commented away.
Regarding the determinism: It is OK to commit to the first parse that matches, but do it only once, and resist the temptation to mess up the declarative grammar.
In DCGs, it is elegant to use | instead of ;.
tokenize//1? Come on! That's just tokens//1. It makes sense in all directions.

I have this code to support error reporting, that itself must be handled with care, sprinkling meaningful messages and 'skip rules' around the code. But there is not ready-to-use alternative: a DCG is a nice computation engine, but it cannot compete out-of-the-box with specialized parsing engines, that are able to emit error messages automatically, exploiting the theoretical properties of the targeted grammars...
:- dynamic text_length/1.
parse_conf_cs(Cs, AST) :-
length(Cs, TL),
retractall(text_length(_)),
assert(text_length(TL)),
phrase(cfg(AST), Cs).
....
%% tag(?T, -X, -Y)// is det.
%
% Start/Stop tokens for XML like entries.
% Maybe this should restrict somewhat the allowed text.
%
tag(T, X, Y) -->
pos(X), unquoted(T), pos(Y).
....
%% pos(-C, +P, -P) is det.
%
% capture offset from end of stream
%
pos(C, P, P) :- text_length(L), length(P, Q), C is L - Q.
tag//3 is just an example usage, in this parser I'm building an editable AST, so I store the positions to be able to properly attribute each nested part in an editor...
edit
a small enhancement for id//1: SWI-Prolog has specialized code_type/2 for that:
1 ?- code_type(0'a, csymf).
true.
2 ?- code_type(0'1, csymf).
false.
so (glossing over literal transformation)
id([C|Cs]) --> [C], {code_type(C, csymf)}, id_rest(Cs).
id_rest([C|Cs]) --> [C], {code_type(C, csym)}, id_rest(Cs).
id_rest([]) --> [].
depending on your attitude to generalize small snippets, and the actual grammar details, id_rest//1 could be written in reusable fashion, and made deterministic
id([C|Cs]) --> [C], {code_type(C, csymf)}, codes(csym, Cs).
% greedy and deterministic
codes(Kind, [C|Cs]) --> [C], {code_type(C, Kind)}, !, codes(Kind, Cs).
codes(Kind, []), [C] --> [C], {\+code_type(C, Kind)}, !.
codes(_, []) --> [].
this stricter definition of id//1 would also allow to remove some ambiguity wrt identifiers with keyword prefix: recoding keyword//1 like
keyword(K) --> id(id(K)), {memberchk(K, [
array,
break,
...
]}.
will correctly identify
?- phrase(tokenize(Ts), `if1*2`).
Ts = [id(if1), *, int(2)] ;
Your string//1 (OT: what unfortunate clash with library(dcg/basics):string//1) is an easy candidate for implementing a simple 'error recovery strategy':
stringChar(0'\") --> "\\\\".
stringChar(0'") --> pos(X), "\n", {format('unclosed string at ~d~n', [X])}.
It's an example of 'report error and insert missing token', so the parsing can go on...

Related

DCG: hidden argument across all Rules?

.in DCG is there a way to have hidden argument i.e. argument is passed in the top rule , but I dont mention it in the rest of the rules , but i still have access to it.
S(Ctx,A,B) --> ...
R1(A) --> ....
R2(A) --> ..R5(A), { write(Ctx) }
R3(A) --> ..add2ctx(abc,Ctx), remove4ctx(bcd,Ctx)..
the same way that DCG is syntax-sugar over difference lists I just want to skip declaring a variable it in the rules head and when I call another rule ?
No there is not. Whenever data has to be passed to a clause, it must be done so explicitly. You cannot define a piece of "context information" implicitly visible to all DCG roles.
But there is this note in the SWI-Prolog manual:
phrase/3
A portable solution for threading state through a DCG can be implemented
by wrapping the state in a list and use the DCG semicontext facility.
Subsequently, the following predicates may be used to access and modify > the state.
state(S), [S] --> [S].
state(S0, S), [S] --> [S0].
So the idea here is that you have term that describes a "current state" that you hot-potatoe from one DCG rule to the next by
Getting it from the input list
Transforming it from a state S0 to a state S
Then putting it back onto the list so that it is available for the next rule.
For example
state(S), [S] --> [S].
does not modify the state and just pushes it back on the list.
But
state(S0, S), [S] --> [S0].
grabs the state S0, maps it to S and put it back onto the list. That should be the idea I think. But in that example, there should probably be something more in the body, namely a call to some p(S,S0)...
Is not completely clear what you want to accomplish. The implicit arguments of a grammar rule are used for threading state as described e.g. in David's answer. It's however also possible to share implicit logical variables with all grammar rules (but note that these cannot be used for threading state). This can be easily accomplished by encapsulating your grammar rules in a Logtalk parametric object. For example:
:- object(grammar(_Ctx_)).
:- public(test/2).
test(L, Z) :-
phrase((a(Z); b(Z)), L).
a(Y) --> [aa, X], {atom_concat(X, _Ctx_, Y)}.
b(Y) --> [bb, X], {atom_concat(_Ctx_, X, Y)}.
:- end_object.
Some sample queries:
?- {grammar}.
% [ /Users/pmoura/grammar.lgt loaded ]
% (0 warnings)
true.
?- grammar(foo)::test([aa,cc], Z).
Z = ccfoo .
?- grammar(foo)::test([bb,cc], Z).
Z = foocc.
Would this work in your case? You can run this example with all Logtalk supported Prolog systems. You can also read more about parametric objects and parameter variables at https://logtalk.org/manuals/userman/objects.html#parametric-objects
In SWI-Prolog there is a ready-to-use and battle-proven pack, that builds on the DCG concept.
It does require a bit of learning, but don't fear, it's well supported.

Ignore rest of input

What's the preferred way to ignore rest of input? I found one somewhat verbose way:
ignore_rest --> [].
ignore_rest --> [_|_].
And it works:
?- phrase(ignore_rest, "foo schmoo").
true ;
But when I try to collapse these two rules into:
ignore_rest2 --> _.
Then it doesn't:
?- phrase(ignore_rest2, "foo schmoo").
ERROR: phrase/3: Arguments are not sufficiently instantiated
What you want is to state that there is a sequence of arbitrarily many characters. The easiest way to describe this is:
... -->
[].
... -->
[_],
... .
Using [_|_] as a non-terminal as you did, is an SWI-Prolog specific extension which is highly problematic. In fact, in the past, there were several different extensions to/interpretations of [_|_]. Most notably Quintus Prolog did permit to define a user-defined '.'/4 to be called when [_|_] was used as a non-terminal. Note that [_|[]] was still considered a terminal! Actually, this was rather an implementation error. But nevertheless, it was exploited. See for such an example:
David B. Searls, Investigating the Linguistics of DNA with Definite Clause Grammars. NACLP 1989.
Why not simply use phrase/3 instead of phrase/2? For example, assuming that you have a prefix//0 non-terminal that consumes only part of the input:
?- phrase(prefix, Input, _).
The third argument of phrase/3 returns the non-consumed terminals, which you can simply ignore.

What is the general pattern for creating a dcg for file input?

I always seem to struggle to write DCG's to parse input files. But it seems it should be simple? Are there any tips or tricks to think about this problem?
For a concrete example, lets say I want to parse a fasta file. (https://en.wikipedia.org/wiki/FASTA_format). I want to read each description and each sequence on back tracking.
:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- portray_text(true).
:- set_prolog_flag(double_quotes, codes).
:- set_prolog_flag(back_quotes,string).
fasta_file([]) -->[].
fasta_file([Section|Sections]) -->
fasta_section(Section),
fasta_file(Sections).
fasta_section(Section) -->
fasta_description(Description),
fasta_seq(Sequence),
{Section =.. [section,Description,Sequence]}.
fasta_description(Description) -->
">",
string(Description),
{no_gt(Description),
no_nl(Description)}.
fasta_seq([]) --> [].
fasta_seq(Seq) -->
nt([S]),
fasta_seq(Ss),
{S="X"->Seq =Ss;Seq=[S|Ss]}.
nt("A") --> "A".
nt("C") --> "C".
nt("G") --> "G".
nt("T") --> "T".
nt("X") --> "\n".
no_gt([]).
no_gt([E|Es]):-
dif([E],">"),
no_gt(Es).
no_nl([]).
no_nl([E|Es]):-
dif([E],"\n"),
no_nl(Es).
Now this is clearly wrong. The behaviour I would like is
?-phrase(fasta_section(S),">frog\nACGGGGTACG\n>duck\nACGTTAG").
S = section("frog","ACGGGGTACG");
S = section("duck","ACGTTAG");
false.
But if I did phrase(fasta_file(Sections),">frog\nACGGGGTACG\n>duck\nACGTTAG). Sections is unified with a list of sections/2s, which is what I want, but my current code seems quite hacky- how I have handled the newline character for example.
for sure, there are 'small' typing problems:
nt("A") -->"A",
nt("C") -->"C",
nt("G") -->"G",
nt("T") -->"T".
should be
nt("A") -->"A".
nt("C") -->"C".
nt("G") -->"G".
nt("T") -->"T".
anyway, I also had my problems debugging DCG, I wrote a parser to load in Prolog a MySQL dump (plain SQL, really), and was a pain when something unexpected, like escaped strings, or UTF8 (?) weird encodings were found.
I would suggest to use phrase/3, to see if there is an unparsable tail. Also, could help to place some debug output after known, well behaved sequences.
Of course, I assume you already tried to use the SWI-Prolog debugger.
Also, beware of
...
dif([E],">"),
...
did you set the appropriate flag about double quotes ? In DCG bodies, the rewrite machinery takes care of matching, but a sequence of codes in SWI-Prolog by default doesn't match double quoted strings...
edit
I think this will not solve your doubt about a general strategy... anyway, it's how I would handle the problem...
fasta_file([]) -->[].
fasta_file([Section|Sections]) -->
fasta_section(Section),
fasta_file(Sections).
fasta_section(section(Description,Sequence)) -->
fasta_description(Description),
fasta_seq(SequenceCs), {atom_codes(Sequence, SequenceCs)}, !.
fasta_description(Description) -->
">", string(DescriptionCs), "\n", {atom_codes(Description, DescriptionCs)}.
fasta_seq([S|Seq]) --> nt(S), fasta_seq(Seq).
fasta_seq([]) --> "\n" ; []. % optional \n at EOF
nt(0'A) --> "A".
nt(0'C) --> "C".
nt(0'G) --> "G".
nt(0'T) --> "T".
now
?- phrase(fasta_file(S), `>frog\nACGGGGTACG\n>duck\nACGTTAG`).
S = [section(frog, 'ACGGGGTACG'), section(duck, 'ACGTTAG')] ;
false.
note: the order of clauses fasta_seq//1 is important, since it implements an 'eager' parsing - mainly for efficiency. As I said, I had to parse SQL, several MBs was common.
edit
?- phrase((string(_),fasta_section(S)), `>frog\nACGGGGTACG\n>duck\nACGTTAG`,_).
S = section(frog, 'ACGGGGTACG') ;
S = section(duck, 'ACGTTAG') ;
false.
fasta_section//1 is mean to match a definite sequence. To get all on backtracking we must provide a backtrack point. In this case, string//1 from library(dcg/basics) does the job

prolog,very simple dcg syntax

I am trying to understand the basic syntax of prolog and dcg but it's really hard to get ahold of proper information on the really basic stuff. Take a look at the code below, I basically just want to achieve something like this:
Output = te(a, st).
Code:
test(te(X,Y)) --> [X], test2(Y).
test2(st(_X)) --> [bonk].
?- test(Output, [a, bonk],[]).
Output = te(a, st(_G6369)).
Simply what I want to do is to add the the word 'st' at the end, and the closest way I've managed is by doing this but unfortunately st is followed a bunch of nonsense, most likely because of the singleton _X. I simply want my Output to contain like: te(a, st).
If you want to accept input of the form [Term, bonk] and obtain te(Term,st) you should change test/2 to accept bonk a return st:
test(te(X,Y)) --> [X], test2(Y).
test2(st) --> [bonk].
?- test(Output, [a, bonk],[]).
Output = te(a, st).
As you said, st is followed by "a bunch of nonsense" because of _X (basically, _G6369 is the internal 'name' of the variable and since the variable remains uninstantiated prolog displays it; try print(X), X=3, print(X).
Anyway, you can simply remove (_X) since you can have anything you want as an argument:
test(te(X,Y)) --> [X], test2(Y).
test2(st) --> [bonk].
Of course, if you don't actually have bonk's in your input and you simply want to add a st at the end you can simplify it even more:
test(te(X,st)) --> [X].
Or if you have bonk's:
test(te(X,st)) --> [X,bonk].
Finally, it is generally suggested to use phrase/3 or phrase/2 instead of adding the arguments manually.

Prolog predicate calling

In the following tutorial: http://www.csupomona.edu/~jrfisher/www/prolog_tutorial/7_3.html
There is the part:
test_parser :- repeat,
write('?? '),
read_line(X),
( c(F,X,[]) | q(F,X,[]) ),
nl, write(X), nl, write(F), nl, fail.
Now I'm extremely confused about the c(F,X,[]) and q(F,X,[]) part because it doesn't seem to match any thing that I have seen, c only takes one parameter from what I can tell and these parameters don't seem to make sense for q. Please help me understand what is going on here.
c//1 and q//1 are entry points (aka top level production) of the Definite Clauses Grammar defined below, where you find
c(F) --> ....
q(F) --> ....
This style of 'call' on a DCG entry point is discouraged, usually is better to invoke the phrase(Grammar, TextToAnalyze, TextAfterAnalysis), in this case phrase((c(F) ; q(F)), "some text", "")...
The --> operator is usually rewritten adding 2 arguments, that are cause of your concern.
EDIT
I.e.
c(L) --> lead_in,arrange(L),end.
is rewritten to
c(L,X,Y) :- lead_in(X,X1),arrange(L,X1,X2),end(X2,Y).
c is defined with -->, which actually adds two hidden arguments to it. The first of these is a list to be parsed by the grammar rule; the second is "what's left" after the parse. c(F,X,[]) calls c on the list X to obtain a result F, expecting [] to be left, i.e. the parser should consume the entire list X.

Resources