Parsing grammar using DCG [SICStus] - prolog

So far,a wide syntax that I have to parse it (in order to create a Syntax Analyzer), the problem is that I got redundancy somewhere in code, but I dont know where is it.
part of Grammar ;
Grammar_types
Type :: = Basic_Type
| "PP" "(" Type ")"
| Type "*" Type
| "struct" "(" (Ident ":" Type)+"," ")"
| "(" Type ")" .
Basic_Type :: = "ZZ"| "BOOL" | "STRING" | Ident .
I try to analyze this gramar without DCG , example to parse Id :: = Id ((Id) * ",") *
Example_1
"id","id_0(id1,id2,..)"
Code_1
Entete_ (ID, Id, Ids) - atom_concat(XY,')', ID),
atom_concat(XX,Ids, XY),check_ids(Ids),
atom_concat(Id,'(',XX),check_id(Id) ,!.
...
but during some searches , I found that DCG is one of the most effective parsers, so I come back to got the code below ;
Code_2
type(Type) --> "struct(", idents_types(Type),")"
| "PP(",ident(Type),")"
| "(",type(Type),")"
| type(Type),"*",type(Type)
| basic_type(Type)
| "error1_type".
...
Example_Syntaxe ;
"ZZ" ; "PP(ZZ*STRING)" ; "struct(x:personne,struct(y:PP(PP))" ; "ZZ*ZZ" ...
Test
| ?- phrase(type(L),"struct(aa:struct())").
! Resource error: insufficient memory
% source_info
I think that the problem over here (idents_types)
| ?- phrase(idents_types(L),"struct(aa:STRING)").
! Resource error: insufficient memory
Expected result
| ?- type('ZZ*struct(p1:STRING,p2:PP(STRING),p3:(BOOL*PP(STRING)),p4:PP(personne*BOOL))').
p1-STRING
STRING
p2-PP(STRING)
STRING
p3-(BOOL*PP(STRING))
STRING
BOOL
p4-PP(personne*BOOL)
BOOL
personne
ZZ
yes
So my question is, why am I receiving this error of redundancy , and how can I fix it?

You have a left recursion on type//1.
type(Type) --> ... | type(Type),"*",type(Type) | ...
You can look into this question for further information.
Top down parsers, from which DCGs borrow, must have a mean to lookahead a symbol that drives the analysis in right direction.
The usual solution to this problem, as indicated from the link above, is to introduce a service nonterminal that left associate recursive applications of the culprit rule, or is epsilon (terminate the recursion).
An epsilon rule is written like
rule --> [].
The transformation can require a fairly bit of thinking... In this answer I suggest a bottom up alternative implementation, that could be worthy if the grammar cannot be transformed, for practical or theoric problems (LR grammars are more general than LL).
You may want to try this simple minded transformation, but for sure it leaves several details to be resolved.
type([Type,T]) --> "struct(", idents_types(Type),")", type_1(T)
| "PP(",ident(Type),")", type_1(T)
| "(",type(Type),")", type_1(T)
| basic_type(Type), type_1(T)
| "error1_type", type_1(T).
type_1([Type1,Type2,T]) --> type(Type1),"*",type(Type2), type_1(T).
type_1([]) --> [].
edit
I fixed several problems, in both your and mine code. Now it parses the example...
type([Type,T]) --> "struct(", idents_types(Type), ")", type_1(T)
| "PP(", type(Type), ")", type_1(T)
| "(", type(Type), ")", type_1(T)
| basic_type(Type), type_1(T)
| "error1_type", type_1(T).
% my mistake here...
type_1([]) --> [].
type_1([Type,T]) --> "*",type(Type), type_1(T).
% the output Type was unbound on ZZ,etc
basic_type('ZZ') --> "ZZ".
basic_type('BOOL') --> "BOOL".
basic_type('STRING') --> "STRING".
basic_type(Type) --> ident(Type).
% here would be better to factorize ident(Ident),":",type(Type)
idents_types([Ident,Type|Ids]) --> ident(Ident),":",type(Type),",",
idents_types(Ids).
idents_types([Ident,Type]) --> ident(Ident),":",type(Type).
idents_types([]) --> [].
% ident//1 forgot to 'eat' a character
ident(Id) --> [C], { between(0'a,0'z,C),C\=0'_},ident_1(Cs),{ atom_codes(Id,[C|Cs]),last(Cs,L),L\=0'_}.
ident_1([C|Cs]) --> [C], { between(0'a,0'z,C);between(0'0,0'9,C);C=0'_ },
ident_1(Cs).
ident_1([]) --> [].

Related

Adding a parsing constraint to a DCG

Graphic tokens can serve as Prolog operators that don't require single quotes.
A translation of ISO/IEC 13211-1:1995, 6.4.2 "Syntax.Tokens.Names" is:
graphic_token --> kleene_plus(graphic_token_char).
graphic_token_char --> member("#$&*+-./:<=>?#^~\\").
% some auxiliary code
kleene_plus(NT) --> NT, kleene_star(NT).
kleene_star(NT) --> "" | kleene_plus(NT).
member(Xs) --> [X], { member(X,Xs) }.
Subsection 6.4.1 "Syntax.Tokens.Layout Text" adds the following constraint:
A graphic token shall not begin with the character sequence comment open (i.e., "/*").
Enforcing that restriction in the DCG is no big deal...
graphic_token --> graphic_token_char. % 1 char
graphic_token --> % 2+ chars
[C1,C2],
{ phrase((graphic_token_char,graphic_token_char), [C1,C2]) },
{ dif([C1,C2], "/*") },
kleene_star(graphic_token_char).
... but quite ugly!
How do I make it pretty again (and keep it bidirectional)?
I'm not sure this is prettier, but maybe something like this:
graphic_token --> kleene_plus_member("#$&*+-.:<=>?#^~\\",0'/).
graphic_token --> "/", kleene_star_member("#$&+-./:<=>?#^~\\", 0'*).
kleene_plus_member(Xs, Code) --> member(Xs), kleene_star(member([Code|Xs])).
kleene_star_member(Xs, Code) --> "" | member(Xs), kleene_star(member([Code|Xs])).
The first clause of graphic_token parses a graphic token that does not begin with / and the second clause the one which starts with it.

convert EBNF to BNF and use it as DCG format on Prolog

As part of my project I am supposed to convert EBNF to BNF and use DCG to program BNF in SWI-Prolog.
EBNF is as follows:
program -> int main ( ) { declarations statements }
declarations -> { declaration }
declaration -> type identifier [ [digit] ] ;
type -> int | bool | float | char
statements -> { statement }
statement -> ; | block | assignment | if_statement | while_statement
block -> { statements }
assignment -> identifier [ [digit] ] = expression ;
if_statement -> if ( expression ) statement
while_statement -> while ( expression ) statement
expression -> conjunction { || conjunction }
conjunction -> equality { && equality }
equality -> relation [ equ_op relation ]
equ_op -> == | !=
relation -> addition [ rel_op addition ]
rel_op -> < | <= | > | >=
addition -> term { add_op term }
add_op -> + | -
term -> factor { mul_op factor }
mul_op -> * | / | %
factor -> [ unary_op ] primary
unary_op -> - | !
primary -> identifier [ [digit] ] | literal | ( expression ) | type (
expression )
literal --> digit | boolean
identifier -> A | ... | Z
boolean --> true | false
digit --> 0 | ... | 9
My program should take the source file as input and print a message which says the program is syntactically correct or not.
Since I don't have any experience in prolog and watching lots of videos in Youtube and reading tutorials and weblogs which are not helpful at all (at least for me because of lack of experience), I need some help how to do it. Is there anybody please?
I solved this question. It was kind of easy:
program --> ["int"], ["main"], ["("], [")"], ["{"], declarations,
statements, ["}"].
declarations --> declaration.
declarations --> declaration, declarations.
declarations --> [].
declaration --> type, identifier, [";"].
declaration --> type, identifier, ["["], digit, ["]"], [";"].
type --> ["int"].
type --> ["bool"].
type --> ["float"].
type --> ["char"].
statements --> statement.
statements --> statement, statements.
statements --> [].
statement --> [";"].
statement --> block.
statement --> assignment.
statement --> if_statement.
statement --> while_statement.
block --> ["{"], statements, ["}"].
assignment --> identifier, ["["], digit, ["]"], ["="], expression, [";"].
if_statement --> ["if"], ["("], expression, [")"], statement.
while_statement --> ["while"], ["("], expression, [")"], statement.
expression --> conjunction, conjunctions.
conjunctions --> ["||"], conjunction.
conjunctions --> ["||"], conjunction, conjunctions.
conjunctions --> [].
conjunction --> equality, equalities.
equalities --> ["&&"], equality.
equalities --> ["&&"], equality, equalities.
equalities --> [].
equality --> relation.
equality --> relation, equ_op, relation.
equ_op --> ["=="].
equ_op --> ["!="].
relation --> addition.
relation --> addition, rel_op, addition.
rel_op --> ["<"].
rel_op --> ["<="].
rel_op --> [">"].
rel_op --> [">="].
addition --> term, terms.
terms --> add_op, term.
terms --> add_op, term, terms.
terms --> [].
add_op --> ["+"].
add_op --> ["-"].
term --> factor, factors.
factors --> mul_op, factor.
factors --> mul_op, factor, factors.
factors --> [].
mul_op --> ["*"].
mul_op --> ["/"].
mul_op --> ["%"].
factor --> primary.
factor --> unary_op, primary.
unary_op --> ["-"].
unary_op --> ["!"].
primary --> identifier.
primary --> identifier, ["["], digit, ["]"].
primary --> literal.
primary --> ["("], expression, [")"].
primary --> type, ["("], expression, [")"].
literal --> digit.
literal --> boolean.
identifier --> ["A"].
identifier --> ["B"].
identifier --> ["C"].
identifier --> ["D"].
identifier --> ["E"].
identifier --> ["F"].
identifier --> ["G"].
identifier --> ["H"].
identifier --> ["I"].
identifier --> ["J"].
identifier --> ["K"].
identifier --> ["L"].
identifier --> ["M"].
identifier --> ["N"].
identifier --> ["O"].
identifier --> ["P"].
identifier --> ["Q"].
identifier --> ["R"].
identifier --> ["S"].
identifier --> ["T"].
identifier --> ["U"].
identifier --> ["V"].
identifier --> ["W"].
identifier --> ["X"].
identifier --> ["Y"].
identifier --> ["Z"].
boolean -->["true"].
boolean --> ["false"].
digit --> ["0"].
digit --> ["1"].
digit --> ["2"].
digit --> ["3"].
digit --> ["4"].
digit --> ["5"].
digit --> ["6"].
digit --> ["7"].
digit --> ["8"].
digit --> ["9"].

How to capture part of a sentence that starts with a verb and finishes with nouns

I am trying to use NLTK package to capture the following chunk in a sentence:
verb + smth + noun
or it may be
verb + smth + noun + and + noun
I truthfully spent entire day messing with regex, but still nothing proper is produced..
I was looking at this tutorial which wasn't much of help.
When you have an idea of what those somethings that might come in between are, there is a relatively easy method using NLTK's CFG. This is most certainly not the most efficient way. For a comprehensive analysis, consult NLTK's book on chapter 8.
We have two patterns as you mentioned:
<verb> ... <noun>
<verb> ... <noun> "and" <noun>
We should assemble a list of VPs and NPs and also the range of possible words that could happen in between. As a silly little example:
grammar = nltk.CFG.fromstring("""
% start S
S -> VP SOMETHING NP
VP -> V
SOMETHING -> WORDS SOMETHING
SOMETHING ->
NP -> N 'and' N
NP -> N
V -> 'told' | 'scolded' | 'loved' | 'respected' | 'nominated' | 'rescued' | 'included'
N -> 'this' | 'us' | 'them' | 'you' | 'I' | 'me' | 'him'|'her'
WORDS -> 'among' | 'others' | 'not' | 'all' | 'of'| 'uhm' | '...' | 'let'| 'finish' | 'certainly' | 'maybe' | 'even' | 'me'
""")
Now suppose this is the list of the sentences we want to use our filter against:
sentences = ['scolded me and you', 'included certainly uhm maybe even her and I', 'loved me and maybe many others','nominated others not even him', 'told certainly among others uhm let me finish ... us and them', 'rescued all of us','rescued me and somebody else']
As you can see, the third and the last phrases don't pass the filter. We can check whether the rest match the pattern:
def sentence_filter(sent, grammar):
rd_parser = nltk.RecursiveDescentParser(grammar)
try:
for p in rd_parser.parse(sent):
print("SUCCESS!")
except:
print("Doesn't match the filter...")
for s in sentences:
s = s.split()
sentence_filter(s, grammar)
When we run this, we get this result:
>>>
SUCCESS!
SUCCESS!
Doesn't match the filter...
SUCCESS!
SUCCESS!
SUCCESS!
Doesn't match the filter...
>>>

Xtext grammar QualifiedName ambiguity

I have the following problem. Part of my grammar looks like this
RExpr
: SetOp
;
SetOp returns RExpr
: PrimaryExpr (({Union.left=current} '+'|{Difference.left=current} '-'|{Intersection.left=current} '&') right = PrimaryExpr)*
;
PrimaryExpr returns RExpr
: '(' RExpr ')'
| (this = 'this.')? slot = [Slot | QualifiedName]
| (this = 'this' | ensName = [Ensemble | QualifiedName])
| 'All'
;
When generating Xtext artifacts ANTLR says that due to some ambiguity it disables an option(3). The ambiguity is because of QualifiedName slot and ensemble share. How do I refactor this kind of cases? I guess syntactic predicate wont help here since it'll force only one(Slot/Ensemble) to be resolved only.
Thanks.
Xtext can't choose between your two references slot and ensemble.
You can merge these references into one reference by adding this rule to your grammar:
SlotOrEnsemble:
Slot | Ensemble
;
Then your primaryExpr rule will be something like:
PrimaryExpr returns RExpr
: '(' RExpr ')'
| ((this = 'this.')? ref= [SlotOrEnsemble | QualifiedName])
| this = 'this'
| 'All'
;

String tokenization in prolog

I have the following context free grammar in a text file 'grammar.txt'
S ::= a S b
S ::= []
I'm opening this file and able to read each line in prolog.
Now i want to tokenize each line and generate a list such as
L=[['S','::=','a','S','b'],['S','::=','#']] ('#' represents empty)
How can i do this?
Write the specification in a DCG. I give you the basic (untested), you'll need to refine it.
parse_grammar([Rule|Rules]) -->
parse_rule(Rule),
parse_grammar(Rules).
parse_grammar([]) --> [].
parse_rule([NT, '::=' | Body]) -->
parse_symbol(NT),
skip_space,
"::=",
skip_space,
parse_symbols(Body),
skip_space, !. % the cut is required if you use findall/3 (see below)
parse_symbols([S|Rest]) -->
parse_symbol(S),
skip_space,
parse_symbols(Rest).
parse_symbols([]) --> [].
parse_symbol(S) -->
[C], {code_type(C, alpha), atom_codes(S, [C])}.
skip_space -->
[C], {code_type(C, space)}, skip_space.
skip_space --> [].
This parse the whole file, using this toplevel:
...,
read_file_to_codes('grammar.txt', Codes),
phrase(parse_grammar(Grammar), Codes, [])).
You say you read the file 1 line at time: then use
...
findall(R, (get_line(L), phrase(parse_rule(R), L, [])), Grammar).
HTH

Resources