Using a Prolog DCG to split a string - prolog

I'm trying to use a DCG to split a string into two parts separated by spaces. E.g. 'abc def' should give me back "abc" & "def". The program & DCG are below.
main:-
prompt(_, ''),
repeat,
read_line_to_codes(current_input, Codes),
(
Codes = end_of_file
->
true
;
processData(Codes),
fail
).
processData(Codes):-
(
phrase(data(Part1, Part2), Codes)
->
format('~s, ~s\n', [ Part1, Part2 ])
;
format('Didn''t recognize data.\n')
).
data([ P1 | Part1 ], [ P2 | Part2 ]) --> [ P1 | Part1 ], spaces(_), [ P2 | Part2 ].
spaces([ S | S1 ]) --> [ S ], { code_type(S, space) }, (spaces(S1); "").
This works correctly. But I found that having to type [ P1 | Part1 ] & [ P2 | Part2 ] was really verbose. So, I tried replacing all instances of [ P1 | Part1 ] w/ Part1 & likewise w/ [ P2 | Part2 ] in the definition of data, i.e. the following.
data(Part1, Part2) --> Part1, spaces(_), Part2.
That's much easier to type, but that gave me an Arguments are not sufficiently instantiated error. So it looks like an unbound variable isn't automatically interpreted as a list of codes in a DCG. Is there any other way to make this less verbose? My intent is to use DCG's where I would use regular expressions in other programming languages.

Your intuition is correct; the term-expansion procedure for DCGs (at least in SWI-Prolog, but should apply to others) with your modified version of data gives the following:
?- listing(data).
data(A, D, B, F) :-
phrase(A, B, C),
spaces(_, C, E),
phrase(D, E, F).
As you can see, the variable Part1 and Part2 parts of your DCG rule have been interpreted into calls to phrase/3 again, and not lists; you need to explicitly specify that they are lists for them to be treated as such.
I can suggest an alternative version which is more general. Consider the following bunch of DCG rules:
data([A|As]) -->
spaces(_),
chars([X|Xs]),
{atom_codes(A, [X|Xs])},
spaces(_),
data(As).
data([]) --> [].
chars([X|Xs]) --> char(X), !, chars(Xs).
chars([]) --> [].
spaces([X|Xs]) --> space(X), !, spaces(Xs).
spaces([]) --> [].
space(X) --> [X], {code_type(X, space)}.
char(X) --> [X], {\+ code_type(X, space)}.
Take a look at the first clause at the top; the data rule now attempts to match 0-to-many spaces (as many as possible, because of the cut), then one-to-many non-space characters to construct an atom (A) from the codes, then 0-to-many spaces again, then recurses to find more atoms in the string (As). What you end up with is a list of atoms which appeared in the input string without any spaces. You can incorporate this version into your code with the following:
processData(Codes) :-
% convert the list of codes to a list of code lists of words
(phrase(data(AtomList), Codes) ->
% concatenate the atoms into a single one delimited by commas
concat_atom(AtomList, ', ', Atoms),
write_ln(Atoms)
;
format('Didn''t recognize data.\n')
).
This version breaks a string apart with any number of spaces between words, even if they appear at the start and end of the string.

Related

Program in SWI-Prolog

Task:Write to the new file all the lines of the source file that contain the specified string as a fragment, which is entered from the keyboard.
I can't figure out how to compare the lines from the file line by line in the Prolog with the fragment that I entered from the keyboard and output the matching lines to a new file.I will be glad to receive any advice or direction. I can't think straight in the prologue.
f:-
write('Enter the name of the source file:'),
read(SOURFILE),
check_exist(SOURFILE),
open(SOURFILE,read,FROM),
read_line_to_string(FROM,X),writef(" "),
writef(X),
writeln(" "),
write('Enter a substring:'),
read(WR),
close(FROM),
write('Enter the name of the new file:'),
read(NEWFILE),
check_exist(NEWFILE),
name(S,X),
write_to_file(NEWFILE,S).
check_exist(Filename):-exists_file(Filename),!.
check_exist(_):-writeln('There is no such file'),
fail.
write_to_file(Filename,TEXT) :-
open(Filename, write, File),
write(File, TEXT),nl,
writeln('Data recorded successfully'),
close(File).
In SWI-Prolog, you can use predicate sub_string/5 to verify whether a string contains a substring. Thus, to solve your problem, you can do something like this:
copy(From, To, Substring) :-
open(From, read, Input),
open(To, write, Output),
repeat,
read_line_to_string(Input, String),
( String = end_of_file
-> ! % stop reading lines
; sub_string(String, _, _, _, Substring),
writeln(Output, String),
fail ), % backtracks to read next line
close(Input),
close(Output).

antlr grammar: Allow whitespace matching only in template string

I want to parse template strings:
`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`
Here is my grammar:
varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)* ')' ;
WS : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;
When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:
extraneous input ' ' expecting {'`'}
How can I allow whitespaces to be parsed and not skipped only inside the template string?
What is currently happening
When testing your example against your current grammar displaying the generated tokens, the lexer gives this:
[#0,0:0='`',<'`'>,1:0]
[#1,1:4='Some',<VAR>,1:1]
[#2,6:9='text',<VAR>,1:6]
[#3,11:12='${',<'${'>,1:11]
[#4,13:20='variable',<VAR>,1:13]
[#5,21:21='.',<'.'>,1:21]
[#6,22:25='name',<VAR>,1:22]
[#7,26:26='}',<'}'>,1:26]
... shortened ...
[#26,85:84='<EOF>',<EOF>,2:0]
This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR. Why is this happening?
As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some.
What you could try (Spoiler: won't work)
You could try to modify the rule like this:
TemplateStringLiteral: ('\\`' | ~'`')+ ;
so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:
How would the lexer match anything to the VAR rule, ever?
The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.
How to achieve what you actually want
There might be another solution, but this one works:
File MartinCup.g4:
parser grammar MartinCup;
options { tokenVocab=MartinCupLexer; }
templateString
: BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
;
template
: TemplateStart variable TemplateEnd
;
variable
: varname funParameter? (Dot variable)*
;
varname
: VAR
;
funParameter
: OpenPar variable? (Comma variable)* ClosedPar
;
File MartinCupLexer.g4:
lexer grammar MartinCupLexer;
BackTick : '`' ;
TemplateStart
: '${' -> pushMode(templateMode)
;
TemplateStringLiteral
: '\\`'
| ~'`'
;
mode templateMode;
VAR
: [$]?[a-zA-Z0-9_]+
| [$]
;
OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;
TemplateEnd
: '}' -> popMode;
This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some.
Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.
About the whitespaces
I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.
I tested your alternative grammar, where you put TemplateStringLiteral above WS, but contrary to your observation, this gives me:
line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}
The reason for this is the same as above, Some is lexed to VAR.

Prolog wildcard for completing a string

I am currently stuck on a prolog problem.
So far I have:
film(Title) :- movie(Title,_,_). (Where 'movie(T,_,_,)' is a reference to my database)
namesearch(Title, Firstword) :- film(Title), contains_term(Firstword, Title).
It is hard to explain what I need help on, but basically is there a wildcard I can use to search for all films starting with a specific word, for example, if I were to search for all films beginning with the word "The".
Is there a wildcard which would allow me to input as such: namesearch(X,'The*') ?
I have tried using the asterisk like this and it does not work,
Thank your for your help
It all depends how the title is represented.
Atom
If it is represented as an atom, you need sub_atom(Atom, Before, Length, After, Sub_atom)
?- Title = 'The Third Man', sub_atom(Title, 0, _, _, 'The').
Title = 'The Third Man'.
List of codes
If it is a list of codes which is called a string in Prologs in Edinburgh tradition, you can either "hard code" it with append/3 or you might use Definite Clause Grammars for general patterns.
?- set_prolog_flag(double_quotes,codes).
true.
?- append("The",_, Pattern), Title = "The Third Man", Pattern = Title.
Pattern = Title, Title = [84,104,101,32,84,104,105,114,100|...].
?- Title = "The Third Man", phrase(("The",...), Title).
Title = [84,104,101,32,84,104,105,114,100|...]
; false.
Note that 84 is the character code of T etc.
phrase/2 is "the entry" to grammars. See dcg for more. Above used the following definition:
... --> [] | [_], ... .
List of characters
Similar to list of codes, list of characters provide a more readable representation that has still the advantages of being compatible with list predicates and Definite Clause Grammars:
?- set_prolog_flag(double_quotes,chars).
true.
?- append("The",_, Pattern), Title = "The Third Man", Pattern = Title.
Pattern = Title, Title = ['T',h,e,' ','T',h,i,r,d|...].
?- Title = "The Third Man", phrase(("The",...), Title).
Title = ['T',h,e,' ','T',h,i,r,d|...]
; false.
See also this answer.

XPATH : replace every ohter whitespace

I'd like to replace every other (odd?) space with x. The result should be:
axb axb axb axb axb
I tried something like:
replace ("a b a b a b a b" , " " , "x")[position() mod 2 = 0]
-- but with no result.
First of all: fn:replace requires an XPath 2.0 (or XQuery) compatible query processor.
You cannot use fn:replace with an predicate like this. There is no array-like access to characters in XPath (like you're used to from eg. C). You probably could also solve this using fn:tokenize and a for-loop, but that's getting things rather complicated.
Your query did not return any result, as there is exactly one result (single element string sequence), but the predicate only returns every second.
Use a regular expression instead. This expression matches on non-space (\S) and space (\s) and replaces those patterns by a version with x in between. The star quantifier in the end is important for odd number of match groups (like in your example).
replace("a b a b a b a b" , "(\S+)\s+(\S+\s*)", "$1x$2")

Regular expression to match my pattern of words, wild chars

can you help me with this:
I want a regular expression for my Ruby program to match a word with the below pattern
Pattern has
List of letters ( For example. ABCC => 1 A, 1 B, 2 C )
N Wild Card Charaters ( N can be 0 or 1 or 2)
A fixed word (for example “XY”).
Rules:
Regarding the List of letters, it should match words with
a. 0 or 1 A
b. 0 or 1 B
c. 0 or 1 or 2 C
Based on the value of N, there can be 0 or 1 or 2 wild chars
Fixed word is always in the order it is given.
The combination of all these can be in any order and should match words like below
ABWXY ( if wild char = 1)
BAXY
CXYCB
But not words with 2 A’s or 2 B’s
I am using the pattern like ^[ABCC]*.XY$
But it looks for words with more than 1 A, or 1 B or 2 C's and also looks for words which end with XY, I want all words which have XY in any place and letters and wild chars in any postion.
If it HAS to be a regex, the following could be used:
if subject =~
/^ # start of string
(?!(?:[^A]*A){2}) # assert that there are less than two As
(?!(?:[^B]*B){2}) # and less than two Bs
(?!(?:[^C]*C){3}) # and less than three Cs
(?!(?:[ABCXY]*[^ABCXY]){3}) # and less than three non-ABCXY characters
(?=.*XY) # and that XY is contained in the string.
/x
# Successful match
else
# Match attempt failed
end
This assumes that none of the characters A, B, C, X, or Y are allowed as wildcards.
I consider myself to be fairly good with regular expressions and I can't think of a way to do what you're asking. Regular expressions look for patterns and what you seem to want is quite a few different patterns. It might be more appropriate to in your case to write a function which splits the string into characters and count what you have so you can satisfy your criteria.
Just to give an example of your problem, a regex like /[abc]/ will match every single occurrence of a, b and c regardless of how many times those letters appear in the string. You can try /c{1,2}/ and it will match "c", "cc", and "ccc". It matches the last case because you have a pattern of 1 c and 2 c's in "ccc".
One thing I have found invaluable when developing and debugging regular expressions is rubular.com. Try some examples and I think you'll see what you're up against.
I don't know if this is really any help but it might help you choose a direction.
You need to break out your pattern properly. In regexp terms, [ABCC] means "any one of A, B or C" where the duplicate C is ignored. It's a set operator, not a grouping operator like () is.
What you seem to be describing is creating a regexp based on parameters. You can do this by passing a string to Regexp.new and using the result.
An example is roughly:
def match_for_options(options)
pattern = '^'
pattern << 'A' * options[:a] if (options[:a])
pattern << 'B' * options[:b] if (options[:b])
pattern << 'C' * options[:c] if (options[:c])
Regexp.new(pattern)
end
You'd use it something like this:
if (match_for_options(:a => 1, :c => 2).match('ACC'))
# ...
end
Since you want to allow these "elements" to appear in any order, you might be better off writing a bit of Ruby code that goes through the string from beginning to end and counts the number of As, Bs, and Cs, finds whether it contains your desired substring. If the number of As, Bs, and Cs, is in your desired limits, and it contains the desired substring, and its length (i.e. the number of characters) is equal to the length of the desired substring, plus # of As, plus # of Bs, plus # of Cs, plus at most N characters more than that, then the string is good, otherwise it is bad. Actually, to be careful, you should first search for your desired substring and then remove it from the original string, then count # of As, Bs, and Cs, because otherwise you may unintentionally count the As, Bs, and Cs that appear in your desired string, if there are any there.
You can do what you want with a regular expression, but it would be a long ugly regular expression. Why? Because you would need a separate "case" in the regular expression for each of the possible orders of the elements. For example, the regular expression "^ABC..XY$" will match any string beginning with "ABC" and ending with "XY" and having two wild card characters in the middle. But only in that order. If you want a regular expression for all possible orders, you'd need to list all of those orders in the regular expression, e.g. it would begin something like "^(ABC..XY|ACB..XY|BAC..XY|BCA..XY|" and go on from there, with about 5! = 120 different orders for that list of 5 elements, then you'd need more for the cases where there was no A, then more for cases where there was no B, etc. I think a regular expression is the wrong tool for the job here.

Resources