In parsing a large 3 gigabyte file with DCG, efficiency is of importance.
The current version of my lexer is using mostly the or predicate ;/2 but I read that indexing can help.
Indexing is a technique used to quickly select candidate clauses of a
predicate for a specific goal. In most Prolog systems, indexing is
done (only) on the first argument of the head. If this argument is
instantiated to an atom, integer, float or compound term with functor,
hashing is used to quickly select all clauses where the first argument
may unify with the first argument of the goal. SWI-Prolog supports
just-in-time and multi-argument indexing. See section 2.18.
Can someone give an example of using indexing for lexing and possibly explain how it improves efficiency?
Details
Note: I changed some of the names before coping the source code into this question. If you find a mistake feel free to edit it here or leave me a comment and I will gladly fix it.
Currently my lexer/tokenizer (based on mzapotoczny/prolog-interpreter parser.pl) is this
% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.
:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).
lexer(Tokens) -->
white_space,
(
( ":", !, { Token = tokColon }
; "(", !, { Token = tokLParen }
; ")", !, { Token = tokRParen }
; "{", !, { Token = tokLMusta}
; "}", !, { Token = tokRMusta}
; "\\", !, { Token = tokSlash}
; "->", !, { Token = tokImpl}
; "+", !, { Token = tokPlus }
; "-", !, { Token = tokMinus }
; "*", !, { Token = tokTimes }
; "=", !, { Token = tokEqual }
; "<", !, { Token = tokLt }
; ">", !, { Token = tokGt }
; "_", !, { Token = tokUnderscore }
; ".", !, { Token = tokPeriod }
; "/", !, { Token = tokForwardSlash }
; ",", !, { Token = tokComma }
; ";", !, { Token = tokSemicolon }
; digit(D), !,
number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
; [],
{ Tokens = [] }
).
white_space -->
[Char], { code_type(Char, space) }, !, white_space.
white_space -->
"--", whole_line, !, white_space.
white_space -->
[].
whole_line --> "\n", !.
whole_line --> [_], whole_line.
digit(D) -->
[D],
{ code_type(D, digit) }.
digits([D|T]) -->
digit(D),
!,
digits(T).
digits([]) -->
[].
number(D, N) -->
digits(Ds),
{ number_chars(N, [D|Ds]) }.
letter(L) -->
[L], { code_type(L, alpha) }.
alphanum([A|T]) -->
[A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
[].
alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).
identifier(L, Id) -->
alphanum(As),
{ atom_codes(Id, [L|As]) }.
Here are some helper predicates used for development and testing.
read_file_for_lexing_and_user_review(Path) :-
open(Path,read,Input),
read_input_for_user_review(Input), !,
close(Input).
read_file_for_lexing_and_performance(Path,Limit) :-
open(Path,read,Input),
read_input_for_performance(Input,0,Limit), !,
close(Input).
read_input(Input) :-
at_end_of_stream(Input).
read_input(Input) :-
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line(Line),
read_input(Input).
read_input_for_user_review(Input) :-
at_end_of_stream(Input).
read_input_for_user_review(Input) :-
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line_for_user_review(Line),
nl,
print('Press spacebar to continue or any other key to exit: '),
get_single_char(Key),
process_user_continue_or_exit_key(Key,Input).
read_input_for_performance(Input,Count,Limit) :-
Count >= Limit.
read_input_for_performance(Input,_,_) :-
at_end_of_stream(Input).
read_input_for_performance(Input,Count0,Limit) :-
% print(Count0),
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line(Line),
Count is Count0 + 1,
read_input_for_performance(Input,Count,Limit).
process_user_continue_or_exit_key(32,Input) :- % space bar
nl, nl,
read_input_for_user_review(Input).
process_user_continue_or_exit_key(Key) :-
Key \= 32.
lex_line_for_user_review(Line) :-
lex_line(Line,TokList),
print(Line),
nl,
print(TokList),
nl.
lex_line(Line,TokList) :-
string_chars(Line,Code_line),
phrase(lexer(TokList),Code_line).
lex_line(Line) :-
string_chars(Line,Code_line),
phrase(lexer(TokList),Code_line).
read_user_input_for_lexing_and_user_review :-
print('Enter a line to parse or just Enter to exit: '),
nl,
read_string(user, "\n", "\r", _, String),
nl,
lex_line_for_user_review(String),
nl,
continue_user_input_for_lexing_and_user_review(String).
continue_user_input_for_lexing_and_user_review(String) :-
string_length(String,N),
N > 0,
read_user_input_for_lexing_and_user_review.
continue_user_input_for_lexing_and_user_review(String) :-
string_length(String,0).
read_user_input_for_lexing_and_user_review/0 allows a user to enter a string at the terminal for lexing and review the tokens.
read_file_for_lexing_and_user_review/1 Reads a file for lexing and review the tokens for each line one line at a time.
read_file_for_lexing_and_performance/2 Reads a file for lexing with a limit on the number of lines to lex. This is for use with gathering basic performance statistics to measure efficiency. Meant to be used with time/1.
One thing it means is that this is silly code:
token(T) -->
( "1", !, { T = one }
; "2", !, { T = two }
; "3", !, { T = three }
)
This is less silly code:
token(T) --> one_two_three(T).
one_two_three(one) --> "1".
one_two_three(two) --> "2".
one_two_three(three) --> "3".
But still not so good. Maybe better:
token(T) --> [X], { one_two_three(X, T) }.
one_two_three(0'1, one).
one_two_three(0'2, two).
one_two_three(0'3, three).
Last example also starts to look silly but remember that now you have indexing on first argument. You read once, no choice point, no backtrack.
But if you want to really know how to write efficient you need to measure where the time and space goes. Have you measured?
But if you really want to know how to fix you maybe read "Craft of Prolog", I do not understand all of this book but I remember it had big section on DCG.
But if you really want to parse such formats large files maybe find existing libraries in other languages, it might be much faster than fastest Prolog.
Solution:
You should replace the following:
lexer(Tokens) -->
white_space,
(
( ":", !, { Token = tokColon }
; "(", !, { Token = tokLParen }
; ")", !, { Token = tokRParen }
; "{", !, { Token = tokLMusta}
; "}", !, { Token = tokRMusta}
; "\\", !, { Token = tokSlash}
; "->", !, { Token = tokImpl}
; "+", !, { Token = tokPlus }
; "-", !, { Token = tokMinus }
; "*", !, { Token = tokTimes }
; "=", !, { Token = tokEqual }
; "<", !, { Token = tokLt }
; ">", !, { Token = tokGt }
; "_", !, { Token = tokUnderscore }
; ".", !, { Token = tokPeriod }
; "/", !, { Token = tokForwardSlash }
; ",", !, { Token = tokComma }
; ";", !, { Token = tokSemicolon }
; digit(D), !,
number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
; [],
{ Tokens = [] }
).
with
lexer(Tokens) -->
white_space,
(
(
op_token(Token), ! % replace ;/2 long chain searched blindly with call to new predicate op_token//1 which clauses have indexed access by first arg in Prolog standard way
;
digit(D), !, number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
;
[],
{ Tokens = [] }
).
%%%
op_token(tokColon) --> ";".
op_token(tokLParen) --> "(".
op_token(tokRParen) --> ")".
op_token(tokLMusta) --> "{".
op_token(tokRMusta) --> "}".
op_token(tokBackSlash) --> "\\".
op_token(tokImpl) --> "->".
op_token(tokPlus) --> "+".
op_token(tokMinus) --> "-".
op_token(tokTimes) --> "*".
op_token(tokEqual) --> "=".
op_token(tokLt) --> "<".
op_token(tokGt) --> ">".
op_token(tokUnderscore) --> "_".
op_token(tokPeriod) --> ".".
op_token(tokSlash) --> "/".
op_token(tokComma) --> ",".
op_token(tokSemicolon) --> ";".
Edit by Guy Coder
I ran a test using the example data posted in the question into a list where each item in the list was a line in the data converted to character codes. Then with time/1 called lexer on each item in the list and repeated the test for the list 10000 times. The reason the data was loaded into a list and converted to characters codes before time/1 was so that those processes did not skew the results. Each of these runs was repeated 5 times to get a consistency of data.
In the following runs below, for all of the different versions the lexer was extended to cover all of the 7-bit ASCII characters which significantly increased the number of cases for special characters.
The version of Prolog used for the following was SWI-Prolog 8.0.
For the version in the question.
Version: 1
:- set_prolog_flag(double_quotes,chars).
% 694,080,002 inferences, 151.141 CPU in 151.394 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 150.813 CPU in 151.059 seconds (100% CPU, 4602271 Lips)
% 694,080,001 inferences, 152.063 CPU in 152.326 seconds (100% CPU, 4564439 Lips)
% 694,080,001 inferences, 151.141 CPU in 151.334 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 151.875 CPU in 152.139 seconds (100% CPU, 4570074 Lips)
For the version as posted above in this answer
Version: 2
:- set_prolog_flag(double_quotes,chars).
% 773,260,002 inferences, 77.469 CPU in 77.543 seconds (100% CPU, 9981573 Lips)
% 773,260,001 inferences, 77.344 CPU in 77.560 seconds (100% CPU, 9997705 Lips)
% 773,260,001 inferences, 77.406 CPU in 77.629 seconds (100% CPU, 9989633 Lips)
% 773,260,001 inferences, 77.891 CPU in 77.967 seconds (100% CPU, 9927511 Lips)
% 773,260,001 inferences, 78.422 CPU in 78.644 seconds (100% CPU, 9860259 Lips)
Version 2 gives a dramatic improvement by using indexing from Version 1.
In doing further research on the code, upon looking at op_token which is DCG and has two hidden variables for implicitly passing around a state representation, using listing/1 showed:
op_token(tokUnderscore,['_'|A], A).
Notice that the first parameter is not the character being searched and that in this answer the indexing code is written as
c_digit(0'0,0).
where the first parameter is the character being searched and the second parameter is the result.
So change this
op_token(Token), !
to this
[S], { special_character_indexed(S,Token) }
with indexed clauses as
special_character_indexed( ';' ,tokSemicolon).
Version: 3
:- set_prolog_flag(double_quotes,chars).
% 765,800,002 inferences, 74.125 CPU in 74.348 seconds (100% CPU, 10331197 Lips)
% 765,800,001 inferences, 74.766 CPU in 74.958 seconds (100% CPU, 10242675 Lips)
% 765,800,001 inferences, 74.734 CPU in 74.943 seconds (100% CPU, 10246958 Lips)
% 765,800,001 inferences, 74.828 CPU in 75.036 seconds (100% CPU, 10234120 Lips)
% 765,800,001 inferences, 74.547 CPU in 74.625 seconds (100% CPU, 10272731 Lips)
Version 3 gives a slightly better but consistently better result than Version 2.
Lastly just changing double_quotes flag to atom as noted in a comment by AntonDanilov
Version: 4
:- set_prolog_flag(double_quotes,atom).
% 765,800,003 inferences, 84.234 CPU in 84.539 seconds (100% CPU, 9091300 Lips)
% 765,800,001 inferences, 74.797 CPU in 74.930 seconds (100% CPU, 10238396 Lips)
% 765,800,001 inferences, 75.125 CPU in 75.303 seconds (100% CPU, 10193677 Lips)
% 765,800,001 inferences, 75.078 CPU in 75.218 seconds (100% CPU, 10200042 Lips)
% 765,800,001 inferences, 75.031 CPU in 75.281 seconds (100% CPU, 10206414 Lips)
Version 4 is almost the same as Version 3.
Just looking at CPU numbers, using indexing is faster, e.g. (Version: 1) 151.875 vs (Version: 3) 74.547
Related
Using regular expressions makes it quite easy to capture sub strings, e.g. the string "Jaco was an American bassist" matches this regular expression (PCRE2 syntax):
(?sm)^([Jj]aco).+(was|is).+?(American|famous).+(dancer|bassist|singer|programmer|dueller)
and captures these strings
Jaco
was
American
bassist.
Here is a DCG that matches the string as well as generating all the possible strings. But it doesn't capture the specific sub strings.
jaco_bassist --> ("J" ; "j"), "aco", space, ("was" ; "is"), space, ("a" ; "an"), space,
("American" ; "famous"), space,
("dancer" ; "bassist" ; "singer" ; "programmer" ; "dueller").
space --> " ".
What would be the best - or at last a good - way of getting the same captures using Prolog's DCGs. Preferably an approach that also generates the possible strings.
For simple problems like this one can use member/2 to enumerate all the alternatives:
jaco_bassist2([Name,WasIs,Adj,Noun]) --> who(Name), space, was_is(WasIs), space,
("a" ; "an"), space, adj(Adj), space,
noun(Noun).
who(Who) --> [Who], {member(Who,["Jaco","jaco"])}.
was_is(WasIs) --> [WasIs], {member(WasIs,["was","is"])}.
adj(Adj) --> [Adj], {member(Adj,["American","famous"])}.
noun(Noun) --> [Noun], {member(Noun,["dancer","bassist","singer","programmer","dueller"])}.
To get the captures:
% ...
phrase(jaco_bassist2,[Who,WasIs,Adj,Noun], String)
A major drawback of this approach is that for more complex structures the enumeration can be a little tricky, for example if the name in the subject string instead of "[Jj]aco" would be one of the 48 spellings of my last name (kjellerstrand):
kjellerstrand --> "k", ("je" ; "ä"), "ll", ("" ; "er" ; "ar"),
("st" ; "b"), ("" ; "r"), "a", (""; "n"), "d".
Please note that I'm looking for "basic" DCG, for example those supported by e.g. B-Prolog (i.e. not requiring SWI-Prolog's fancy DCG stuff).
Let me re-phrase that: Given a goal phrase(NT__0, Cs0,Cs), capture the sequence described by NT__0.
First of all we need to restrict ourselves to DCGs without semicontext. For a (non-empty) semicontext may be represented with two variables (which in that context do not form a difference) but cannot be captured with a single list.
append(Capture, Cs, Cs0) should be it. At least declaratively when considering only ground terms.
as --> "" | "a", as.
?- Cs = [], phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
Cs = [], Cs0 = [], Capture = []
; Cs = [], Cs0 = "a", Capture = "a"
; Cs = [], Cs0 = "aa", Capture = "aa"
; Cs = [], Cs0 = "aaa", Capture = "aaa"
; Cs = [], Cs0 = "aaaa", Capture = "aaaa"
; ... .
?- phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
Cs0 = Cs, Capture = []
; Cs0 = [_A|Cs0], Cs = [_A|Cs0], Capture = [_A], unexpected
; Cs0 = [_A,_B|Cs0], Cs = [_A,_B|Cs0], Capture = [_A,_B], unexpected
; ... .
?- set_prolog_flag(occurs_check,true).
true.
?- phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
Cs0 = Cs, Capture = []
; loops, unexpected.
So far, the procedural reality of Prolog is a bit different. append/3 only works for lists but not for partial lists. There infinite, rational trees show up. And the occurs-check does not help that much, it just prevents the display of such answers, but keeps non-termination.
Time for a new version of append/3, append2u/3
?- set_prolog_flag(occurs_check,false).
true.
?- phrase(as, Cs0,Cs), append2u(Capture, Cs, Cs0).
Cs0 = Cs, Capture = []
; Cs0 = [a|Cs0], Cs = [a|Cs0], Capture = [], unexpected
; Cs0 = [a|Cs], Capture = "a", dif:dif(Cs,[a|Cs])
; Cs0 = [a,a|Cs0], Cs = [a,a|Cs0], Capture = [], unexpected
; Cs0 = [a,a|Cs], Capture = "aa", dif:dif(Cs,[a,a|Cs])
; Cs0 = [a,a,a|Cs0], Cs = [a,a,a|Cs0], Capture = [], unexpected
; ... .
?- set_prolog_flag(occurs_check,true).
true.
?- phrase(as, Cs0,Cs), append2u(Capture, Cs, Cs0).
Cs0 = Cs, Capture = []
; Cs0 = [a|Cs], Capture = "a"
; Cs0 = [a,a|Cs], Capture = "aa"
; Cs0 = [a,a,a|Cs], Capture = "aaa"
; ... .
So with the help of the occurs-check it is possible to get this right, also for the more general case. A new non-terminal phrase_capture//2 now uses the following internal definition:
phrase_capture(NT__0, Capture, S0,S) :-
phrase(NT__0, S0,S1),
append2u(Capture, S1, S0),
S1 = S.
For systems without a built-in occurs-check like B, rewrite append2u/3 using unify_with_occurs_check/2 explicitly. That is, also for (\=)/2.
Some further optimizations may be done to avoid costs that depend on the size of Cs0+Cs instead of the length of Capture. Like special casing for var(Cs), Cs == [], and partial strings. If Cs is a list constructor, an internal implementation may also just skip through Cs0 to find that very address of Cs first, and only resort to more costly means otherwise. Care must be given to ensure that this is always terminating, thus using mechanisms similar to '$skip_max_list'/4.
Also, what to do if Cs0 and Cs do not fit, that is, if they are not the result of a valid grammar. Such a case may happen with generalizations to explain unexpected failure.
Usage:
jaco_bassist([Name,WasIs,Adj,Noun]) -->
phrase_capture( (("J" ; "j"), "aco"), Name),
space,
phrase_capture( ("was" ; "is"), WasIs),
space,
("a" ; "an"),
space,
phrase_capture( ("American" ; "famous"), Adj),
space,
phrase_capture( ("dancer" ; "bassist" ; "singer" ; "programmer" ; "dueller"), Noun).
?- phrase(jaco_bassist(D), Ys).
D = ["Jaco","was","American","dancer"], Ys = "Jaco was a American ..."
; D = ["Jaco","was","American","bassist"], Ys = "Jaco was a American ..."
; D = ["Jaco","was","American","singer"], Ys = "Jaco was a American ..."
; ...
; D = ["jaco","is","famous","dueller"], Ys = "jaco is an famous d ...".
So this version terminates also when generating strings. And it has the potential to incur costs that are in many cases only depending on the length of the captured string. The original version using append/3 will always visit the entire string.
Lest I forget, there will always be some oddities should you be into infinite lists. Think of:
?- phrase("abc",L0,L0).
L0 = [a,b,c|L0].
?- phrase("abc",L0,L0), phrase(phrase_capture("abc",Capture),L0,L).
L0 = [a,b,c|L0], Capture = [], L = [a,b,c|L0], unexpected.
L0 = [a,b,c|L0], Capture = "abc", L = [a,b,c|L0]. % expected
These are all typical paradoxa that infinite lists ensue. First luring people into them only to smother them.
The following version of phrase_capture//2 does not rely on internal details. It uses the ^s of library(lambda) which are responsible for parameter passing only. (The other lambda-related construct \ is for renaming.)
phrase_capture(NT__0, Capture) -->
call(S0^S0^true),
NT__0,
call(S1^S1^true),
{append2u(Capture, S1, S0)}.
Isn't this as simple as:
% Show lists of codes as text (if 3 chars or longer)
:- portray_text(true).
sentence([P, T]) --> person(P), space, tense(T).
person(N, DL, T) :-
member(N, [`Jaco`, `jaco`]),
list_to_dl(N, DL, T).
tense(N, DL, T) :-
member(N, [`was`, `is`]),
list_to_dl(N, DL, T).
space --> " ".
list_to_dl([], T, T).
list_to_dl([H|T], [H|T2], Tail) :-
list_to_dl(T, T2, Tail).
Results in swi-prolog (so you'll have to tweak the quoting to suit your Prolog system):
?- time(phrase(sentence(S), `Jaco is`)).
% 25 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 797728 Lips)
S = [`Jaco`,[105,115]] ;
% 7 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 311360 Lips)
false.
... and it can generate:
?- time(phrase(sentence(S), L)).
% 24 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 767043 Lips)
S = [`Jaco`,`was`],
L = `Jaco was` ;
% 7 inferences, 0.000 CPU in 0.000 seconds (75% CPU, 392971 Lips)
S = [`Jaco`,[105,115]],
L = `Jaco is` ;
% 17 inferences, 0.000 CPU in 0.000 seconds (82% CPU, 667504 Lips)
S = [`jaco`,`was`],
L = `jaco was` ;
% 8 inferences, 0.000 CPU in 0.000 seconds (62% CPU, 460750 Lips)
S = [`jaco`,[105,115]],
L = `jaco is`.
To handle the surname - can use term expansion to automate the string duplication:
sentence([P, T, SN]) -->
dcg(person, P), space, dcg(tense, T), space, surname(SN).
space --> " ".
surname(SN) -->
dcg(sn1, SN1), dcg(sn2, SN2), dcg(sn3, SN3),
dcg(sn4, SN4), dcg(sn5, SN5), dcg(sn6, SN6),
dcg(sn7, SN7), dcg(sn8, SN8), dcg(sn9, SN9),
{ append([SN1, SN2, SN3, SN4, SN5, SN6, SN7, SN8, SN9], SN) }.
term_expansion(expand(Name, Codes), [dcg(Name, Codes) --> Codes]).
expand(person, `Jaco`).
expand(person, `jaco`).
expand(tense, `was`).
expand(tense, `is`).
expand(sn1, `k`).
expand(sn2, `je`).
expand(sn2, `ä`).
expand(sn3, `ll`).
expand(sn4, ``).
expand(sn4, `er`).
expand(sn4, `ar`).
expand(sn5, `st`).
expand(sn5, `b`).
expand(sn6, ``).
expand(sn6, `r`).
expand(sn7, `a`).
expand(sn8, ``).
expand(sn8, `n`).
expand(sn9, `d`).
... which can both parse and generate - results in swi-prolog:
?- time(phrase(sentence(S), `jaco is kjellerstrand`)).
% 61 inferences, 0.000 CPU in 0.000 seconds (89% CPU, 1618037 Lips)
S = [`jaco`,[105,115],`kjellerstrand`] ;
% 5 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 295299 Lips)
false.
?- time(phrase(sentence(S), L)).
% 54 inferences, 0.000 CPU in 0.000 seconds (90% CPU, 1390570 Lips)
S = [`Jaco`,`was`,`kjellstad`],
L = `Jaco was kjellstad` ;
% 37 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 1141236 Lips)
S = [`Jaco`,`was`,`kjellstand`],
L = `Jaco was kjellstand` ;
% 39 inferences, 0.000 CPU in 0.000 seconds (87% CPU, 1291519 Lips)
S = [`Jaco`,`was`,`kjellstrad`],
L = `Jaco was kjellstrad` ;
% 38 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 1573173 Lips)
S = [`Jaco`,`was`,`kjellstrand`],
L = `Jaco was kjellstrand` ;
% 38 inferences, 0.000 CPU in 0.000 seconds (86% CPU, 1382774 Lips)
S = [`Jaco`,`was`,`kjellbad`],
L = `Jaco was kjellbad`
etc.
I would like to replace a sequence of tokens in a DCG grammar in Prolog. In other words replace the sequence or sublist A: [a,a,a,a] by the sublist B: [b].
chain --> chain_where_sublist_A_is_replaced_by_sublist_B but entirely using the DCG formalism.
For example: [c,a,a,a,a,d] gives [c,b,d]
First solution:
eos_([], []).
transform --> call(eos_).
transform, [b] --> [a,a,a], transform.
transform, [c] --> [c], transform.
transform, [d] --> [d], transform.
Then the query:
?- phrase(transform, "caaad", Cs).
Cs = "cbd"
Second solution:
step(b) --> [a,a,a].
step(C) --> [C].
transform([]) --> [].
transform([C|Cs]) --> step(C), transform(Cs).
Then the query:
?- phrase(transform(Cs), "caaad").
Cs = "cbd"
As at Prolog - substitute substring of string by letter not used in string itself
replace(Find, Replace), Replace --> Find, !, replace(Find, Replace).
% Otherwise accept char-by-char
replace(Find, Replace), [C] --> [C], !, replace(Find, Replace).
% Accept success when reached end
replace(_Find, _Replace) --> [].
substitute(Find, Replace, Request, Result):-
phrase(replace(Find, Replace), Request, Result).
In swi-prolog:
?- substitute([a,a,a,a], [b], [c,a,a,a,a,d], S).
S = [c,b,d].
Using difference lists rather than DCG, to be faster (due in part to tail-end recursion) and also more widely compatible with Prolog systems:
find_replace_list(Find, Replace, Lst, Result) :-
find_replace_list_(Lst, Find, Replace, Result).
find_replace_list_([], _Find, _Replace, []).
find_replace_list_([H|T], Find, Replace, Result) :-
list_begins_dl(Find, [H|T], Tail),
!,
append(Replace, Result0, Result),
find_replace_list_(Tail, Find, Replace, Result0).
find_replace_list_([H|T], Find, Replace, [H|Result]) :-
find_replace_list_(T, Find, Replace, Result).
list_begins_dl([], T, T).
list_begins_dl([H|TShort], [H|TLong], Tail) :-
list_begins_dl(TShort, TLong, Tail).
Performance comparison in swi-prolog:
?- numlist(1, 1_000_000, L), time(find_replace_list([a,a,a,a], [b], L, S)).
% 2,000,002 inferences, 0.231 CPU in 0.231 seconds (100% CPU, 8649354 Lips)
?- numlist(1, 1_000_000, L), time(substitute([a,a,a,a], [b], L, S)).
% 16,000,024 inferences, 2.081 CPU in 2.081 seconds (100% CPU, 7689631 Lips)
?- length(L, 1_000_000), maplist(=(1), L), time(find_replace_list([1], [2], L, S)).
% 5,000,002 inferences, 0.421 CPU in 0.421 seconds (100% CPU, 11865099 Lips)
?- length(L, 1_000_000), maplist(=(1), L), time(substitute([1], [2], L, S)).
% 25,000,021 inferences, 2.962 CPU in 2.962 seconds (100% CPU, 8439159 Lips)
Result:
?- find_replace_list([a,a,a,a], [b], [c,a,a,a,a,d], S).
S = [c,b,d].
I have a rule that matches bc. When I encounter that in a string, I don't want to parse that string, otherwise parse anything else.
% Prolog
bc(B, C) --> [B, C], {
B = "b",
C = "c"
}.
not_bc(O) --> [O], % ?! bc(O, C).
% ?- phrase(not_bc(O), "bcdefg").
% false.
% ?- phrase(not_bc(O), "abcdefg").
% O = "a".
% ?- phrase(not_bc(O), "wxcybgz")
% O = "w".
% ?- phrase(not_bc(O), "wxybgz")
% O = "w".
Simplified version of my problem, hopefully solutions are isomorphic.
Similar to this question:
Translation to DCG Semicontext not working - follow on
An alternative:
process_bc(_) --> "bc", !, { fail }.
process_bc(C) --> [C].
This differs from my other solution in accepting:
?- time(phrase(process_bc(C), `b`, _)).
% 8 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 387053 Lips)
C = 98.
In swi-prolog:
process_text(C1) --> [C1, C2], { dif([C1, C2], `bc`) }.
Results:
?- time(phrase(process_text(C), `bca`, _)).
% 11 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 376790 Lips)
false.
?- time(phrase(process_text(C), `bd`, _)).
% 10 inferences, 0.000 CPU in 0.000 seconds (80% CPU, 353819 Lips)
C = 98.
?- time(phrase(process_text(C), `zbcagri4gj40w9tu4tu34ty3ty3478t348t`, _)).
% 10 inferences, 0.000 CPU in 0.000 seconds (80% CPU, 372717 Lips)
C = 122.
A single character, or no characters, are both presumably meant to be failures.
This is nicely efficient, only having to check the first 2 characters.
I am new to this language and am having trouble coming up with a solution to this problem. The program must implement the following cases.
Both variables are instantiated:
pivot( [1,2,3,4,5,6,7], [5,6,7,4,1,2,3] ).`
yields a true/yes result.
Only Before is instantiated:
pivot( [1,2,3,4,5,6], R ).
unifies R = [4,5,6,1,2,3] as its one result.
Only After is instantiated:
pivot(L, [1,2]).
unifies L = [2,1] as its one result.
Neither variable is instantiated:
pivot(L, R).
is undefined (since results are generated arbitrarily).
If by pivot, you mean to split the list in 2 and swap the halves, then something like this would work.
First, consider the normal case: If you have an instantiated list, pivoting it is trivial. You just need to
figure out half the length of the list
break it up into
a prefix, consisting of that many items, and
a suffix, consisting of whatever is left over
concatenate those two lists in reverse order
Once you have that, everything else is just a matter of deciding which variable is bound and using that as the source list.
It is a common Prolog idiom to have a single "public" predicate that invokes a "private" worker predicate that does the actual work.
Given that the problem statement requires that at least one of the two variable in your pivot/2 must be instantiated, we can define our public predicate along these lines:
pivot( Ls , Rs ) :- nonvar(Ls), !, pivot0(Ls,Rs) .
pivot( Ls , Rs ) :- nonvar(Rs), !, pivot0(Rs,Ls) .
If Ls is bound, we invoke the worker, pivot0/2 with the arguments as-is. But if Ls is unbound, and Rs is bound, we invoke it with the arguments reversed. The cuts (!) are there to prevent the predicate from succeeding twice if invoked with both arguments bound (pivot([a,b,c],[a,b,c]).).
Our private helper, pivot0/2 is simple, because it knows that the 1st argument will always be bound:
pivot0( Ls , Rs ) :- % to divide a list in half and exchange the halves...
length(Ls,N0) , % get the length of the source list
N is N0 // 2 , % divide it by 2 using integer division
length(Pfx,N) , % construct a unbound list of the desired length
append(Pfx,Sfx,Ls) , % break the source list up into its two halves
append(Sfx,Pfx,Rs) % put the two halves back together in the desired order
. % Easy!
In swi-prolog:
:- use_module(library(dcg/basics)).
pivot_using_dcg3(Lst, LstPivot) :-
list_first(Lst, LstPivot, L1, L2, IsList),
phrase(piv3_up(L1), L1, L2),
% Improve determinism
(IsList = true -> ! ; true).
piv3_up(L), string(Ri), string(M), string(Le) --> piv3(L, Le, M, Ri).
piv3([], [], [], Ri) --> [], remainder(Ri).
piv3([_], [], [H], Ri) --> [H], remainder(Ri).
piv3([_, _|Lst], [H|T], M, Ri) --> [H], piv3(Lst, T, M, Ri).
% From 2 potential lists, rearrange them in order of usefulness
list_first(V1, V2, L1, L2, IsList) :-
( is_list(V1) ->
L1 = V1, L2 = V2,
IsList = true
; L1 = V2, L2 = V1,
(is_list(L1) -> IsList = true ; IsList = false)
).
Is general and deterministic, with good performance:
?- time(pivot_using_dcg3(L, P)).
% 18 inferences, 0.000 CPU in 0.000 seconds (88% CPU, 402441 Lips)
L = P, P = [] ;
% 8 inferences, 0.000 CPU in 0.000 seconds (86% CPU, 238251 Lips)
L = P, P = [_] ;
% 10 inferences, 0.000 CPU in 0.000 seconds (87% CPU, 275073 Lips)
L = [_A,_B],
P = [_B,_A] ;
% 10 inferences, 0.000 CPU in 0.000 seconds (94% CPU, 313391 Lips)
L = [_A,_B,_C],
P = [_C,_B,_A] ;
% 12 inferences, 0.000 CPU in 0.000 seconds (87% CPU, 321940 Lips)
L = [_A,_B,_C,_D],
P = [_C,_D,_A,_B] ;
% 12 inferences, 0.000 CPU in 0.000 seconds (86% CPU, 345752 Lips)
L = [_A,_B,_C,_D,_E],
P = [_D,_E,_C,_A,_B] ;
% 14 inferences, 0.000 CPU in 0.000 seconds (88% CPU, 371589 Lips)
L = [_A,_B,_C,_D,_E,_F],
P = [_D,_E,_F,_A,_B,_C] ;
?- numlist(1, 5000000, P), time(pivot_using_dcg3(L, P)).
% 7,500,018 inferences, 1.109 CPU in 1.098 seconds (101% CPU, 6759831 Lips)
The performance could be improved further, using difference lists for the final left-middle-right append, and cuts (sacrificing generality).
I was given a task to write a Prolog program that would put given domino pieces in a loop. I figured I would first put them in order and then check for answers where the last number and the first number would match.
However, the predicate responsible for the latter task (domino_order2) is giving me some trouble. Any ideas on how I could implement this last check?
Thank you in advance!
domino_order2(L1, L2):-
end_match(L2,L2),
domino_order(L1, L2).
end_match([X-Y | _],L2):-
%last(L2,X).
append(_,[Y-X],L2).
domino_order(L1, L2) :-
domino_order(L1, _, L2).
domino_order([], _, []) :- !.
domino_order(In, X, [X-Y | Out]) :-
select(Piece, In, Remaining),
swap_or_not(Piece, X-Y),
domino_order(Remaining, Y, Out).
swap_or_not(X-Y, X-Y).
swap_or_not(X-Y, Y-X).
Desired outcome:
?- domino_order2([4-3,3-5,5-8,8-4],Out).
Out = [4-3, 3-5, 5-8, 8-4] ;
Out = [3-4, 4-8, 8-5, 5-3] ;
Out = [3-5, 5-8, 8-4, 4-3] ;
Out = [5-3, 3-4, 4-8, 8-5] ;
Out = [5-8, 8-4, 4-3, 3-5] ;
Out = [8-5, 5-3, 3-4, 4-8] ;
Out = [8-4, 4-3, 3-5, 5-8] ;
Out = [4-8, 8-5, 5-3, 3-4].
?- domino_order2([4-3,3-5,5-8,8-5],Out).
false.
Here is a rewrite which matches the worded spec and desired output (but not the precise order of the output):
domino_order(Lst, Ord) :-
(Lst1 = Lst ; reverse_dominos(Lst, Lst1)),
domino_order_left_to_right(Lst1, Ord).
reverse_dominos(Lst, RevLst) :-
reverse_dominos_(Lst, [], RevLst).
reverse_dominos_([], RevLst, RevLst).
reverse_dominos_([X-Y|Tail], SoFar, RevLst) :-
% Reverse the numbers and their position
reverse_dominos_(Tail, [Y-X|SoFar], RevLst).
domino_order_left_to_right(Lst, Lst1) :-
% Get a starting point in Lst
append([Before, [Start], After], Lst),
% Rearrange the list
append([[Start], After, Before], Lst1).
Result in swi-prolog:
?- time(findall(Ord, domino_order([4-3,3-5,5-8,8-4], Ord), Ords)).
% 244 inferences, 0.000 CPU in 0.000 seconds (97% CPU, 1475890 Lips)
Ords = [[4-3,3-5,5-8,8-4],[3-5,5-8,8-4,4-3],[5-8,8-4,4-3,3-5],[8-4,4-3,3-5,5-8],[4-8,8-5,5-3,3-4],[8-5,5-3,3-4,4-8],[5-3,3-4,4-8,8-5],[3-4,4-8,8-5,5-3]].