Related
Using regular expressions makes it quite easy to capture sub strings, e.g. the string "Jaco was an American bassist" matches this regular expression (PCRE2 syntax):
(?sm)^([Jj]aco).+(was|is).+?(American|famous).+(dancer|bassist|singer|programmer|dueller)
and captures these strings
Jaco
was
American
bassist.
Here is a DCG that matches the string as well as generating all the possible strings. But it doesn't capture the specific sub strings.
jaco_bassist --> ("J" ; "j"), "aco", space, ("was" ; "is"), space, ("a" ; "an"), space,
("American" ; "famous"), space,
("dancer" ; "bassist" ; "singer" ; "programmer" ; "dueller").
space --> " ".
What would be the best - or at last a good - way of getting the same captures using Prolog's DCGs. Preferably an approach that also generates the possible strings.
For simple problems like this one can use member/2 to enumerate all the alternatives:
jaco_bassist2([Name,WasIs,Adj,Noun]) --> who(Name), space, was_is(WasIs), space,
("a" ; "an"), space, adj(Adj), space,
noun(Noun).
who(Who) --> [Who], {member(Who,["Jaco","jaco"])}.
was_is(WasIs) --> [WasIs], {member(WasIs,["was","is"])}.
adj(Adj) --> [Adj], {member(Adj,["American","famous"])}.
noun(Noun) --> [Noun], {member(Noun,["dancer","bassist","singer","programmer","dueller"])}.
To get the captures:
% ...
phrase(jaco_bassist2,[Who,WasIs,Adj,Noun], String)
A major drawback of this approach is that for more complex structures the enumeration can be a little tricky, for example if the name in the subject string instead of "[Jj]aco" would be one of the 48 spellings of my last name (kjellerstrand):
kjellerstrand --> "k", ("je" ; "ä"), "ll", ("" ; "er" ; "ar"),
("st" ; "b"), ("" ; "r"), "a", (""; "n"), "d".
Please note that I'm looking for "basic" DCG, for example those supported by e.g. B-Prolog (i.e. not requiring SWI-Prolog's fancy DCG stuff).
Let me re-phrase that: Given a goal phrase(NT__0, Cs0,Cs), capture the sequence described by NT__0.
First of all we need to restrict ourselves to DCGs without semicontext. For a (non-empty) semicontext may be represented with two variables (which in that context do not form a difference) but cannot be captured with a single list.
append(Capture, Cs, Cs0) should be it. At least declaratively when considering only ground terms.
as --> "" | "a", as.
?- Cs = [], phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
Cs = [], Cs0 = [], Capture = []
; Cs = [], Cs0 = "a", Capture = "a"
; Cs = [], Cs0 = "aa", Capture = "aa"
; Cs = [], Cs0 = "aaa", Capture = "aaa"
; Cs = [], Cs0 = "aaaa", Capture = "aaaa"
; ... .
?- phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
Cs0 = Cs, Capture = []
; Cs0 = [_A|Cs0], Cs = [_A|Cs0], Capture = [_A], unexpected
; Cs0 = [_A,_B|Cs0], Cs = [_A,_B|Cs0], Capture = [_A,_B], unexpected
; ... .
?- set_prolog_flag(occurs_check,true).
true.
?- phrase(as, Cs0,Cs), append(Capture, Cs, Cs0).
Cs0 = Cs, Capture = []
; loops, unexpected.
So far, the procedural reality of Prolog is a bit different. append/3 only works for lists but not for partial lists. There infinite, rational trees show up. And the occurs-check does not help that much, it just prevents the display of such answers, but keeps non-termination.
Time for a new version of append/3, append2u/3
?- set_prolog_flag(occurs_check,false).
true.
?- phrase(as, Cs0,Cs), append2u(Capture, Cs, Cs0).
Cs0 = Cs, Capture = []
; Cs0 = [a|Cs0], Cs = [a|Cs0], Capture = [], unexpected
; Cs0 = [a|Cs], Capture = "a", dif:dif(Cs,[a|Cs])
; Cs0 = [a,a|Cs0], Cs = [a,a|Cs0], Capture = [], unexpected
; Cs0 = [a,a|Cs], Capture = "aa", dif:dif(Cs,[a,a|Cs])
; Cs0 = [a,a,a|Cs0], Cs = [a,a,a|Cs0], Capture = [], unexpected
; ... .
?- set_prolog_flag(occurs_check,true).
true.
?- phrase(as, Cs0,Cs), append2u(Capture, Cs, Cs0).
Cs0 = Cs, Capture = []
; Cs0 = [a|Cs], Capture = "a"
; Cs0 = [a,a|Cs], Capture = "aa"
; Cs0 = [a,a,a|Cs], Capture = "aaa"
; ... .
So with the help of the occurs-check it is possible to get this right, also for the more general case. A new non-terminal phrase_capture//2 now uses the following internal definition:
phrase_capture(NT__0, Capture, S0,S) :-
phrase(NT__0, S0,S1),
append2u(Capture, S1, S0),
S1 = S.
For systems without a built-in occurs-check like B, rewrite append2u/3 using unify_with_occurs_check/2 explicitly. That is, also for (\=)/2.
Some further optimizations may be done to avoid costs that depend on the size of Cs0+Cs instead of the length of Capture. Like special casing for var(Cs), Cs == [], and partial strings. If Cs is a list constructor, an internal implementation may also just skip through Cs0 to find that very address of Cs first, and only resort to more costly means otherwise. Care must be given to ensure that this is always terminating, thus using mechanisms similar to '$skip_max_list'/4.
Also, what to do if Cs0 and Cs do not fit, that is, if they are not the result of a valid grammar. Such a case may happen with generalizations to explain unexpected failure.
Usage:
jaco_bassist([Name,WasIs,Adj,Noun]) -->
phrase_capture( (("J" ; "j"), "aco"), Name),
space,
phrase_capture( ("was" ; "is"), WasIs),
space,
("a" ; "an"),
space,
phrase_capture( ("American" ; "famous"), Adj),
space,
phrase_capture( ("dancer" ; "bassist" ; "singer" ; "programmer" ; "dueller"), Noun).
?- phrase(jaco_bassist(D), Ys).
D = ["Jaco","was","American","dancer"], Ys = "Jaco was a American ..."
; D = ["Jaco","was","American","bassist"], Ys = "Jaco was a American ..."
; D = ["Jaco","was","American","singer"], Ys = "Jaco was a American ..."
; ...
; D = ["jaco","is","famous","dueller"], Ys = "jaco is an famous d ...".
So this version terminates also when generating strings. And it has the potential to incur costs that are in many cases only depending on the length of the captured string. The original version using append/3 will always visit the entire string.
Lest I forget, there will always be some oddities should you be into infinite lists. Think of:
?- phrase("abc",L0,L0).
L0 = [a,b,c|L0].
?- phrase("abc",L0,L0), phrase(phrase_capture("abc",Capture),L0,L).
L0 = [a,b,c|L0], Capture = [], L = [a,b,c|L0], unexpected.
L0 = [a,b,c|L0], Capture = "abc", L = [a,b,c|L0]. % expected
These are all typical paradoxa that infinite lists ensue. First luring people into them only to smother them.
The following version of phrase_capture//2 does not rely on internal details. It uses the ^s of library(lambda) which are responsible for parameter passing only. (The other lambda-related construct \ is for renaming.)
phrase_capture(NT__0, Capture) -->
call(S0^S0^true),
NT__0,
call(S1^S1^true),
{append2u(Capture, S1, S0)}.
Isn't this as simple as:
% Show lists of codes as text (if 3 chars or longer)
:- portray_text(true).
sentence([P, T]) --> person(P), space, tense(T).
person(N, DL, T) :-
member(N, [`Jaco`, `jaco`]),
list_to_dl(N, DL, T).
tense(N, DL, T) :-
member(N, [`was`, `is`]),
list_to_dl(N, DL, T).
space --> " ".
list_to_dl([], T, T).
list_to_dl([H|T], [H|T2], Tail) :-
list_to_dl(T, T2, Tail).
Results in swi-prolog (so you'll have to tweak the quoting to suit your Prolog system):
?- time(phrase(sentence(S), `Jaco is`)).
% 25 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 797728 Lips)
S = [`Jaco`,[105,115]] ;
% 7 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 311360 Lips)
false.
... and it can generate:
?- time(phrase(sentence(S), L)).
% 24 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 767043 Lips)
S = [`Jaco`,`was`],
L = `Jaco was` ;
% 7 inferences, 0.000 CPU in 0.000 seconds (75% CPU, 392971 Lips)
S = [`Jaco`,[105,115]],
L = `Jaco is` ;
% 17 inferences, 0.000 CPU in 0.000 seconds (82% CPU, 667504 Lips)
S = [`jaco`,`was`],
L = `jaco was` ;
% 8 inferences, 0.000 CPU in 0.000 seconds (62% CPU, 460750 Lips)
S = [`jaco`,[105,115]],
L = `jaco is`.
To handle the surname - can use term expansion to automate the string duplication:
sentence([P, T, SN]) -->
dcg(person, P), space, dcg(tense, T), space, surname(SN).
space --> " ".
surname(SN) -->
dcg(sn1, SN1), dcg(sn2, SN2), dcg(sn3, SN3),
dcg(sn4, SN4), dcg(sn5, SN5), dcg(sn6, SN6),
dcg(sn7, SN7), dcg(sn8, SN8), dcg(sn9, SN9),
{ append([SN1, SN2, SN3, SN4, SN5, SN6, SN7, SN8, SN9], SN) }.
term_expansion(expand(Name, Codes), [dcg(Name, Codes) --> Codes]).
expand(person, `Jaco`).
expand(person, `jaco`).
expand(tense, `was`).
expand(tense, `is`).
expand(sn1, `k`).
expand(sn2, `je`).
expand(sn2, `ä`).
expand(sn3, `ll`).
expand(sn4, ``).
expand(sn4, `er`).
expand(sn4, `ar`).
expand(sn5, `st`).
expand(sn5, `b`).
expand(sn6, ``).
expand(sn6, `r`).
expand(sn7, `a`).
expand(sn8, ``).
expand(sn8, `n`).
expand(sn9, `d`).
... which can both parse and generate - results in swi-prolog:
?- time(phrase(sentence(S), `jaco is kjellerstrand`)).
% 61 inferences, 0.000 CPU in 0.000 seconds (89% CPU, 1618037 Lips)
S = [`jaco`,[105,115],`kjellerstrand`] ;
% 5 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 295299 Lips)
false.
?- time(phrase(sentence(S), L)).
% 54 inferences, 0.000 CPU in 0.000 seconds (90% CPU, 1390570 Lips)
S = [`Jaco`,`was`,`kjellstad`],
L = `Jaco was kjellstad` ;
% 37 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 1141236 Lips)
S = [`Jaco`,`was`,`kjellstand`],
L = `Jaco was kjellstand` ;
% 39 inferences, 0.000 CPU in 0.000 seconds (87% CPU, 1291519 Lips)
S = [`Jaco`,`was`,`kjellstrad`],
L = `Jaco was kjellstrad` ;
% 38 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 1573173 Lips)
S = [`Jaco`,`was`,`kjellstrand`],
L = `Jaco was kjellstrand` ;
% 38 inferences, 0.000 CPU in 0.000 seconds (86% CPU, 1382774 Lips)
S = [`Jaco`,`was`,`kjellbad`],
L = `Jaco was kjellbad`
etc.
I have a rule that matches bc. When I encounter that in a string, I don't want to parse that string, otherwise parse anything else.
% Prolog
bc(B, C) --> [B, C], {
B = "b",
C = "c"
}.
not_bc(O) --> [O], % ?! bc(O, C).
% ?- phrase(not_bc(O), "bcdefg").
% false.
% ?- phrase(not_bc(O), "abcdefg").
% O = "a".
% ?- phrase(not_bc(O), "wxcybgz")
% O = "w".
% ?- phrase(not_bc(O), "wxybgz")
% O = "w".
Simplified version of my problem, hopefully solutions are isomorphic.
Similar to this question:
Translation to DCG Semicontext not working - follow on
An alternative:
process_bc(_) --> "bc", !, { fail }.
process_bc(C) --> [C].
This differs from my other solution in accepting:
?- time(phrase(process_bc(C), `b`, _)).
% 8 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 387053 Lips)
C = 98.
In swi-prolog:
process_text(C1) --> [C1, C2], { dif([C1, C2], `bc`) }.
Results:
?- time(phrase(process_text(C), `bca`, _)).
% 11 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 376790 Lips)
false.
?- time(phrase(process_text(C), `bd`, _)).
% 10 inferences, 0.000 CPU in 0.000 seconds (80% CPU, 353819 Lips)
C = 98.
?- time(phrase(process_text(C), `zbcagri4gj40w9tu4tu34ty3ty3478t348t`, _)).
% 10 inferences, 0.000 CPU in 0.000 seconds (80% CPU, 372717 Lips)
C = 122.
A single character, or no characters, are both presumably meant to be failures.
This is nicely efficient, only having to check the first 2 characters.
We want to count the correspondences between two (possibly long) strings which happen to represent DNA sequences. The sequences are lists-of-chars where the char is taken from a,c,t,g,'_', with the '_' a "don't know" placeholder which never corresponds to anything, even itself. In this case, we employ library(aggregate) (thanks to CapelliC for the idea):
match(Seq1,Seq2,Count) :-
aggregate_all(count,
(
nth1(Pos,Seq1,X),
nth1(Pos,Seq2,X),
memberchk(X,[a,c,g,t])
),
N).
This approach can be compared to a "straightforward" approach where one would set up a (tail-recursive) recursion that just walks down both sequences in tandem and compares elements pairwise, counting as it goes.
As the sequences can be very large, algorithmic complexity becomes of some interest.
One would expect, with n = length(sequence) and both sequences the same length:
Straightforward approach: complexity is O(n)
aggregation approach: complexity is O(n²)
What is the (time and maybe space) complexity of the above algorithm and why?
Test code
To complement the above, an SWI-Prolog based plunit test code block:
:- begin_tests(atcg).
wrap_match(String1,String2,Count) :-
atom_chars(String1,Seq1),
atom_chars(String2,Seq2),
fit(Seq1,Seq1,0,Count).
test("string 1 empty",nondet) :-
wrap_match("atcg","",Count),
assertion(Count == 0).
test("string 2 empty") :-
wrap_match("","atcg",Count),
assertion(Count == 0).
test("both strings empty") :-
wrap_match("","",Count),
assertion(Count == 0).
test("both strings match, 1 char only") :-
wrap_match("a","a",Count),
assertion(Count == 1).
test("both strings match") :-
wrap_match("atcgatcgatcg","atcgatcgatcg",Count),
assertion(MatchCount == 12).
test("both strings match with underscores") :-
wrap_match("_TC_ATCG_TCG","_TC_ATCG_TCG",Count),
assertion(MatchCount == 9).
test("various mismatches 1") :-
wrap_match("atcgatcgatcg","atcgatcgatcg",Count),
assertion(MatchCount == 8).
test("various mismatches with underscores") :-
wrap_match("at_ga_cg__cg","atcgatcgatcg",Count),
assertion(Count == 8).
:- end_tests(atcg).
And so:
?- run_tests.
% PL-Unit: atcg ........ done
% All 8 tests passed
true.
Empirical info
After some manual data collection (something that cries out for automatization) using the code below, which outputs time elapsed and number of inferences made to the console:
gimme_random_sequence(Length,Seq) :-
length(Seq,Length),
maplist(
[E]>>(random_between(0,3,Ix),nth0(Ix,[a,t,c,g],E)),
Seq).
how_fast(Length) :-
gimme_random_sequence(Length,Seq1),
gimme_random_sequence(Length,Seq2),
time(match(Seq1,Seq2,_)).
... and a bit of graph fumbling in LibreOffice Calc (my ggplot skills are rusty), we have empirical data that this algorithm's cost is
O((length(sequence))²).
Count,Inferences,Seconds,milliseconds,megainferences
1000,171179,0.039,39,0.171179
2000,675661,0.097,97,0.675661
3000,1513436,0.186,186,1.513436
4000,2684639,0.327,327,2.684639
5000,4189172,0.502,502,4.189172
6000,6027056,0.722,722,6.027056
7000,8198103,1.002,1002,8.198103
8000,10702603,1.304,1304,10.702603
9000,13540531,1.677,1677,13.540531
10000,16711607,2.062,2062,16.711607
11000,20216119,2.449,2449,20.216119
20000,66756619,8.091,8091,66.756619
30000,150134731,17.907,17907,150.134731
40000,266846773,32.012,32012,266.846773
50000,416891749,52.942,52942,416.891749
60000,600269907,74.103,74103,600.269907
Never ever use functional programming idioms in Prolog that avoid backtracking, like maplist/4. This here, pair_member/4 and match3/3, should be a tick faster.
match2(Seq1, Seq2, Count) :-
( maplist([X,Y,X-Y]>>true, Seq1, Seq2, Seq3)
-> aggregate_all(count, (member(X-X, Seq3), X\='_'), Count)
; Count = 0 ).
pair_member(X, Y, [X|_], [Y|_]).
pair_member(X, Y, [_|L], [_|R]) :-
pair_member(X, Y, L, R).
match3(Seq1, Seq2, Count) :-
aggregate_all(count,
(pair_member(X, X, Seq1, Seq2), X \= '_'), Count).
gimme_random_sequence(Length, Seq) :-
length(Seq, Length),
maplist([E]>>(random_between(0,3,Ix), nth0(Ix, [a,t,c,g], E)), Seq).
test(N) :-
gimme_random_sequence(N, S1),
gimme_random_sequence(N, S2),
time(match2(S1, S2, Count)),
time(match3(S1, S2, Count)).
Woa! Its 10x times faster! Thanks to genius of SWI-Prolog how it
compiles the tail recursion in pair_member/4:
/* SWI-Prolog 8.3.21, MacBook Air 2019 */
?- set_prolog_flag(double_quotes, chars).
true.
?- X = "abc".
X = [a, b, c].
?- match2("_TC_ATCG_TCG","_TC_ATCG_TCG",Count).
Count = 9.
?- match3("_TC_ATCG_TCG","_TC_ATCG_TCG",Count).
Count = 9.
?- test(100000).
% 1,575,520 inferences, 0.186 CPU in 0.190 seconds (98% CPU, 8465031 Lips)
% 175,519 inferences, 0.018 CPU in 0.019 seconds (98% CPU, 9577595 Lips)
true.
Edit 29.04.2021:
Oh the irony, bifurcation backtracking is nevertheless challenging.
After fixing a misuse of library(apply_macros), I get:
?- test(100000).
% 374,146 inferences, 0.019 CPU in 0.019 seconds (99% CPU, 19379778 Lips)
% 174,145 inferences, 0.014 CPU in 0.014 seconds (99% CPU, 12400840 Lips)
true.
Does native member/2 contribute to the good maplist solution performance?
But I should do a better measure, with larger times durations.
Open Source:
Sequence Match Problem
https://gist.github.com/jburse/9fd22e8c3e8de6148fbd341817538ef6#file-sequence-pl
I think that it is interresting to observe that complexity O(n²) is not due to the aggregation approach itself, but to the fact that subgoal nth1(Pos,Seq1,X), nth1(Pos,Seq2,X) behaves as a "nested loop" (in the size n of the sequences).
Thus, it should be possible to create another algorithm that, even using aggregation, can have complexity O(n), as long as the "nested loop" is eliminated.
Algorithms to compare
% Original algorithm: Complexity O(n²)
match1(Seq1, Seq2, Count) :-
aggregate_all(count,
( nth1(Pos, Seq1, X),
nth1(Pos, Seq2, X),
memberchk(X, [a,c,g,t]) ),
Count).
% Proposed algorithm: Complexity O(n)
match2(Seq1, Seq2, Count) :-
( maplist([X,Y,X-Y]>>true, Seq1, Seq2, Seq3)
-> aggregate_all(count, (member(X-X, Seq3), X\='_'), Count)
; Count = 0 ).
gimme_random_sequence(Length, Seq) :-
length(Seq, Length),
maplist([E]>>(random_between(0,3,Ix), nth0(Ix, [a,t,c,g], E)), Seq).
test(N) :-
gimme_random_sequence(N, S1),
gimme_random_sequence(N, S2),
time(match1(S1, S2, Count)),
time(match2(S1, S2, Count)).
Simple empirical results
?- test(10000).
% 16,714,057 inferences, 1.156 CPU in 1.156 seconds (100% CPU, 14455401 Lips)
% 39,858 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
true.
?- test(20000).
% 66,761,535 inferences, 4.594 CPU in 4.593 seconds (100% CPU, 14533123 Lips)
% 79,826 inferences, 0.016 CPU in 0.016 seconds (100% CPU, 5108864 Lips)
true.
?- test(40000).
% 266,856,213 inferences, 19.734 CPU in 19.841 seconds (99% CPU, 13522405 Lips)
% 159,398 inferences, 0.016 CPU in 0.015 seconds (104% CPU, 10201472 Lips)
true.
?- test(80000).
% 1,067,046,835 inferences, 77.203 CPU in 77.493 seconds (100% CPU, 13821291 Lips)
% 320,226 inferences, 0.047 CPU in 0.047 seconds (100% CPU, 6831488 Lips)
true.
Edit 30/04/2021:
Does nth1(I,S,X), nth1(I,S,X) really work as nested loop?
To see that the answer to this question is yes, consider the following simple implementation of nth/3, that counts the number of rounds needed to find each solution, using a global flag:
nth(Index, List, Item) :-
( var(Index)
-> nth_nondet(1, Index, List, Item)
; integer(Index)
-> nth_det(Index, List, Item)
).
nth_det(1, [Item|_], Item) :- !.
nth_det(Index, [_|Rest], Item) :-
flag(rounds, Rounds, Rounds+1),
Index1 is Index - 1,
nth_det(Index1, Rest, Item).
nth_nondet(Index, Index, [Item|_], Item).
nth_nondet(Acc, Index, [_|Rest], Item) :-
flag(rounds, Rounds, Rounds+1),
Acc1 is Acc + 1,
nth_nondet(Acc1, Index, Rest, Item).
To get the number of rounds, you can ask:
?- flag(rounds,_,0), nth(5,[a,b,c,d,e],X), flag(rounds,Rounds,Rounds).
X = e,
Rounds = 4.
Now, using this predicate, we can create a predicate to count the number of rounds of the goal nth(I,L,X), nth(I,L,X), for lists of different lengths:
count_rounds :-
forall(between(1, 10, N),
( Length is 10*N,
count_rounds(Length, Rounds),
writeln(rounds(Length) = Rounds)
)).
count_rounds(Length, _) :-
numlist(1, Length, List),
flag(rounds, _, 0),
nth(Index, List, Item),
nth(Index, List, Item),
fail.
count_rounds(_, Rounds) :-
flag(rounds, Rounds, Rounds).
Empirical results:
?- count_rounds.
rounds(10) = 55
rounds(20) = 210
rounds(30) = 465
rounds(40) = 820
rounds(50) = 1275
rounds(60) = 1830
rounds(70) = 2485
rounds(80) = 3240
rounds(90) = 4095
rounds(100) = 5050
As we can see, the goal nth(I,L,X), nth(I,L,X) computes half of a square matrix of order n (including its diagonal). Thus, the number of rounds for a list of length n is rounds(n) = (n² + n)/2. Hence, the time complexity of this goal is O(n²).
Remark The implementation of the library predicate nth1/3 is a little more efficient than that of predicate nth/3considered for this experiment. Nevertheless, the time complexity of goal nth1(I,S,X), nth1(I,S,X)still is O(n²).
This is a followup of #MostowskiCollapse answer, where I have applied the same optimization that Gertjan van Noord provided for member/2 to pair_member/4, but I have renamed it to member/4.
member(X, Y, [XH|XT], [YH|YT]) :-
member_(XT, YT, X, Y, XH, YH).
member_(_, _, X,Y, X,Y).
member_([XH|XT],[YH|YT], X,Y, _,_) :-
member_(XT,YT, X,Y, XH,YH).
match4(Seq1, Seq2, Count) :-
aggregate_all(count,
(member(X, X, Seq1, Seq2), X \= '_'), Count).
test(N) :-
gimme_random_sequence(N, S1),
gimme_random_sequence(N, S2),
%time(match2(S1, S2, Count)),
time(match3(S1, S2, Count)),
time(match4(S1, S2, Count)).
...
with lists of length 1.000.000 I get
% 1,751,758 inferences, 0.835 CPU in 0.835 seconds (100% CPU, 2098841 Lips)
% 1,751,757 inferences, 0.637 CPU in 0.637 seconds (100% CPU, 2751198 Lips)
that is, a gain of about 25%...
I'm trying to find the most common list item common([b,a,a,a,c,d,b,f,s,f,s,f,s,f,s,f,f],R) so the result should be R=f,
I was thinking if we take the list , go to the end of the list take el=b ,num1=1 then go back to the beginning and compare if b=b ,num1=num1+1 else a!=b then if num2=num2+1 , num1>num2 recursion else el=a or something like this, but i had some difficulty transforming it into Prolog.
insert_sort sorts the list , but for some interesting reason if i use las(X,Y) (I override the original last/2 ) I get 4-a if I use last(X,Y) i get just a...
most_common([X|Y],J):-
insert_sort([X|Y],[R|Rs]),
count_runs([R|Rs],G),
las(G,J).
las([N-Y],Y).
las([_|T],Y):- las(T,Y).
las([_|Tail], Y) :- las(Tail, Y).
insert_sort(List,Sorted):-
i_sort(List,[],Sorted).
i_sort([],Acc,Acc).
i_sort([H|T],Acc,Sorted):-
insert(H,Acc,NAcc),
i_sort(T,NAcc,Sorted).
insert(X,[],[X]).
insert(X,[Y|T],[Y|NT]):- X #> Y, insert(X,T,NT).
insert(X,[Y|T],[X,Y|T]):- X #=< Y.
This looks like homework, so I'm not going to give you a full answer, but will suggest how you could solve it in one particular way, which isn't necessarily the best way:
Sort the list into sorted order (by standard order of terms if this is good enough): look at sort/2 routines. e.g., [b,a,a,a,c,d,b] becomes [a,a,a,b,b,c,d].
Take the sorted list and count the size of 'runs', perhaps to convert [a,a,a,b,b,c,d] into [3-a,2-b,1-c,1-d] (where -/2 is simply another term). e.g., consider the following code:
count_runs([E|Es], C) :-
% defer to count_runs/3 with an initial count of element E
count_runs(Es, 1-E, C).
% return the final count for Y elements if none remain (base case)
count_runs([], N-Y, [N-Y]).
count_runs([X|Es], N-Y, [N-Y|Rest]) :-
% if X is not equal to Y, record the count and continue next run
X \== Y, !,
count_runs([X|Es], Rest).
count_runs([_X|Es], N-Y, Rest) :-
% else X equals Y; increment the counter and continue
NPlusOne is N + 1,
count_runs(Es, NPlusOne-Y, Rest).
Perform something like keysort/2 to order the terms by the value of their keys (i.e., the numbers which are the counts, turning [3-a,2-b,1-c,1-d] into [1-c,1-d,2-b,3-a]). Then, the most-occurring elements of the list are the values at the end of the list with the same key value (i.e., here, this is the a in the last term 3-a). In general, they may be more than one element that occurs the most (equally with another).
Good luck.
Based on Prolog lambdas, we use the meta-predicates tcount/3 and reduce/3, as well as the reified term equality predicate (=)/3:
:- use_module(library(lambda)).
mostcommon_in(E,Xs) :-
tcount(=(E),Xs,M),
maplist(Xs+\X^N^(tcount(=(X),Xs,N)),Xs,Counts),
reduce(\C0^C1^C^(C is max(C0,C1)),Counts,M).
Sample query:
?- mostcommon_in(X,[a,b,c,d,a,b,c,a,b]).
X = a ;
X = b ;
false.
Note that this is monotone (unlike it's earlier quick-hack version). Look!
?- mostcommon_in(X,[A,B,C,D,A,B,C,A,B]), A=a,B=b,C=c,D=d.
X = a, A = a, B = b, C = c, D = d ;
X = b, A = a, B = b, C = c, D = d ;
false.
Preserve logical-purity by
using list_counts/2 for defining mostcommonitem_in/2 as follows:
mostcommonitem_in(E,Xs) :-
list_counts(Xs,Cs), % tag items with multiplicity
maplist(\ (X-N)^(M-X)^(M is -N),Cs,Ps), % prepare keysorting
keysort(Ps,[Max-_|_]), % sort ascending by negated count
member(Max-E,Ps). % pick most common ones
Let's run a query!
?- mostcommonitem_in(X,[a,b,c,d,a,b,c,a,b]).
X = a ;
X = b ;
false. % OK
But, is it still monotone?
?- mostcommonitem_in(X,[A,B,C,D,A,B,C,A,B]), A=a,B=b,C=c,D=d.
X = A, A = a, B = b, C = c, D = d ;
X = B, B = b, A = a, C = c, D = d ;
false. % OK: monotone
Got speed? (compared to the pure answer I showed in my previous answer to this question)
% OLD
?- length(Xs,5), time(findall(t,mostcommon_in(E,Xs),Ts)), length(Ts,N_sols).
% 854,636 inferences, 0.115 CPU in 0.115 seconds (100% CPU, 7447635 Lips)
N_sols = 71, Xs = [_,_,_,_,_], Ts = [t,t,t|...].
?- length(Xs,6), time(findall(t,mostcommon_in(E,Xs),Ts)), length(Ts,N_sols).
% 4,407,975 inferences, 0.449 CPU in 0.449 seconds (100% CPU, 9813808 Lips)
N_sols = 293, Xs = [_,_,_,_,_,_], Ts = [t,t,t|...].
?- length(Xs,7), time(findall(t,mostcommon_in(E,Xs),Ts)), length(Ts,N_sols).
% 24,240,240 inferences, 2.385 CPU in 2.384 seconds (100% CPU, 10162591 Lips)
N_sols = 1268, Xs = [_,_,_,_,_,_,_], Ts = [t,t,t|...].
% NEW
?- length(Xs,5), time(findall(t,mostcommonitem_in(E,Xs),Ts)), length(Ts,N_sols).
% 4,031 inferences, 0.001 CPU in 0.002 seconds (93% CPU, 2785423 Lips)
N_sols = 71, Xs = [_,_,_,_,_], Ts = [t,t,t|...].
?- length(Xs,6), time(findall(t,mostcommonitem_in(E,Xs),Ts)), length(Ts,N_sols).
% 17,632 inferences, 0.002 CPU in 0.002 seconds (100% CPU, 9194323 Lips)
N_sols = 293, Xs = [_,_,_,_,_,_], Ts = [t,t,t|...].
?- length(Xs,7), time(findall(t,mostcommonitem_in(E,Xs),Ts)), length(Ts,N_sols).
% 82,263 inferences, 0.023 CPU in 0.023 seconds (100% CPU, 3540609 Lips)
N_sols = 1268, Xs = [_,_,_,_,_,_,_], Ts = [t,t,t|...].
I could give you a high-level answer: You could sort the list and then it's relatively easy to count the items, one after another, and update what so far is the most common item.
While reading SICP I came across logic programming chapter 4.4. Then I started looking into the Prolog programming language and tried to understand some simple assignments in Prolog. I found that Prolog seems to have troubles with numerical calculations.
Here is the computation of a factorial in standard Prolog:
f(0, 1).
f(A, B) :- A > 0, C is A-1, f(C, D), B is A*D.
The issues I find is that I need to introduce two auxiliary variables (C and D), a new syntax (is) and that the problem is non-reversible (i.e., f(5,X) works as expected, but f(X,120) does not).
Naively, I expect that at the very least C is A-1, f(C, D) above may be replaced by f(A-1,D), but even that does not work.
My question is: Why do I need to do this extra "stuff" in numerical calculations but not in other queries?
I do understand (and SICP is quite clear about it) that in general information on "what to do" is insufficient to answer the question of "how to do it". So the declarative knowledge in (at least some) math problems is insufficient to actually solve these problems. But that begs the next question: How does this extra "stuff" in Prolog help me to restrict the formulation to just those problems where "what to do" is sufficient to answer "how to do it"?
is/2 is very low-level and limited. As you correctly observe, it cannot be used in all directions and is therefore not a true relation.
For reversible arithmetic, use your Prolog system's constraint solvers.
For example, SWI-Prolog's CLP(FD) manual contains the following definition of n_factorial/2:
:- use_module(library(clpfd)).
n_factorial(0, 1).
n_factorial(N, F) :- N #> 0, N1 #= N - 1, F #= N * F1, n_factorial(N1, F1).
The following example queries show that it can be used in all directions:
?- n_factorial(47, F).
F = 258623241511168180642964355153611979969197632389120000000000 ;
false.
?- n_factorial(N, 1).
N = 0 ;
N = 1 ;
false.
?- n_factorial(N, 3).
false.
Of course, this definition still relies on unification, and you can therefore not plug in arbitrary integer expressions. A term like 2-2 (which is -(2,2) in prefix notation) does not unfiy with 0. But you can easily allow this if you rewrite this to:
:- use_module(library(clpfd)).
n_factorial(N, F) :- N #= 0, F #= 1.
n_factorial(N, F) :- N #> 0, N1 #= N - 1, F #= N * F1, n_factorial(N1, F1).
Example query and its result:
?- n_factorial(2-2, -4+5).
true .
Forget about variables and think that A and B - is just a name for value which can be placed into that clause (X :- Y). to make it reachable. Think about X = (2 + (3 * 4)) in the way of data structures which represent mathematical expression. If you will ask prolog to reach goal f(A-1, B) it will try to find such atom f(A-1,B). or a rule (f(A-1,B) :- Z), Z. which will be unified to "success".
is/2 tries to unify first argument with result of interpreting second argument as an expression. Consider eval/2 as variant of is/2:
eval(0, 1-1). eval(0, 2-2). eval(1,2-1).
eval(Y, X-0):- eval(Y, X).
eval(Y, A+B):- eval(ValA, A), eval(ValB, B), eval(Y, ValA + ValB).
eval(4, 2*2).
eval(0, 0*_). eval(0, _*0).
eval(Y, X*1):- eval(Y, X).
eval(Y, 1*X):- eval(Y, X).
eval(Y, A*B):- eval(ValA, A), eval(ValB, B), eval(Y, ValA * ValB).
The reason why f(X,120) doesn't work is simple >/2 works only when its arguments is bound (i.e. you can't compare something not yet defined like X with anything else). To fix that you have to split that rule into:
f(A,B) :- nonvar(A), A > 0, C is A-1, f(C, D), B is A*D.
f(A,B) :- nonvar(B), f_rev(A, B, 1, 1).
% f_rev/4 - only first argument is unbound.
f_rev(A, B, A, B). % solution
f_rev(A, B, N, C):- C < B, NextN is (N+1), NextC is (C*NextN), f_rev(A, B, NextN, NextC).
Update: (fixed f_rev/4)
You may be interested in finite-domain solver. There was a question about using such things. By using #>/2 and #=/2 you can describe some formula and restrictions and then resolve them. But these predicates uses special abilities of some prolog systems which allows to associate name with some attributes which may help to narrow set of possible values by intersection of restriction. Some other systems (usually the same) allows you to reorder sequence of processing goals ("suspend").
Also member(X,[1,2,3,4,5,6,7]), f(X, 120) is probably doing the same thing what your "other queries" do.
If you are interested in logical languages in general you may also look at Curry language (there all non-pure functions is "suspended" until not-yed-defined value is unified).
In this answer we use clpfd, just like this previous answer did.
:- use_module(library(clpfd)).
For easy head-to-head comparison (later on), we call the predicate presented here n_fac/2:
n_fac(N_expr,F_expr) :-
N #= N_expr, % eval arith expr
F #= F_expr, % eval arith expr
n_facAux(N,F).
Like in this previous answer, n_fac/2 admits the use of arithmetic expressions.
n_facAux(0,1). % 0! = 1
n_facAux(1,1). % 1! = 1
n_facAux(2,2). % 2! = 2
n_facAux(N,F) :-
N #> 2,
F #> N, % redundant constraint
% to help `n_fac(N,N)` terminate
n0_n_fac0_fac(3,N,6,F). % general case starts with "3! = 6"
The helper predicate n_facAux/2 delegates any "real" work to n0_n_fac0_fac/4:
n0_n_fac0_fac(N ,N,F ,F).
n0_n_fac0_fac(N0,N,F0,F) :-
N0 #< N,
N1 #= N0+1, % count "up", not "down"
F1 #= F0*N1, % calc `1*2*...*N`, not `N*(N-1)*...*2*1`
F1 #=< F, % enforce redundant constraint
n0_n_fac0_fac(N1,N,F1,F).
Let's compare n_fac/2 and n_factorial/2!
?- n_factorial(47,F).
F = 258623241511168180642964355153611979969197632389120000000000
; false.
?- n_fac(47,F).
F = 258623241511168180642964355153611979969197632389120000000000
; false.
?- n_factorial(N,1).
N = 0
; N = 1
; false.
?- n_fac(N,1).
N = 0
; N = 1
; false.
?- member(F,[3,1_000_000]), ( n_factorial(N,F) ; n_fac(N,F) ).
false. % both predicates agree
OK! Identical, so far... Why not do a little brute-force testing?
?- time((F1 #\= F2,n_factorial(N,F1),n_fac(N,F2))).
% 57,739,784 inferences, 6.415 CPU in 7.112 seconds (90% CPU, 9001245 Lips)
% Execution Aborted
?- time((F1 #\= F2,n_fac(N,F2),n_factorial(N,F1))).
% 52,815,182 inferences, 5.942 CPU in 6.631 seconds (90% CPU, 8888423 Lips)
% Execution Aborted
?- time((N1 #> 1,N2 #> 1,N1 #\= N2,n_fac(N1,F),n_factorial(N2,F))).
% 99,463,654 inferences, 15.767 CPU in 16.575 seconds (95% CPU, 6308401 Lips)
% Execution Aborted
?- time((N1 #> 1,N2 #> 1,N1 #\= N2,n_factorial(N2,F),n_fac(N1,F))).
% 187,621,733 inferences, 17.192 CPU in 18.232 seconds (94% CPU, 10913552 Lips)
% Execution Aborted
No differences for the first few hundred values of N in 2..sup... Good!
Moving on: How about the following (suggested in a comment to this answer)?
?- n_factorial(N,N), false.
false.
?- n_fac(N,N), false.
false.
Doing fine! Identical termination behaviour... More?
?- N #< 5, n_factorial(N,_), false.
false.
?- N #< 5, n_fac(N,_), false.
false.
?- F in 10..100, n_factorial(_,F), false.
false.
?- F in 10..100, n_fac(_,F), false.
false.
Alright! Still identical termination properties! Let's dig a little deeper! How about the following?
?- F in inf..10, n_factorial(_,F), false.
... % Execution Aborted % does not terminate universally
?- F in inf..10, n_fac(_,F), false.
false. % terminates universally
D'oh! The first query does not terminate, the second does.
What a speedup! :)
Let's do some empirical runtime measurements!
?- member(Exp,[6,7,8,9]), F #= 10^Exp, time(n_factorial(N,F)) ; true.
% 328,700 inferences, 0.043 CPU in 0.043 seconds (100% CPU, 7660054 Lips)
% 1,027,296 inferences, 0.153 CPU in 0.153 seconds (100% CPU, 6735634 Lips)
% 5,759,864 inferences, 1.967 CPU in 1.967 seconds (100% CPU, 2927658 Lips)
% 22,795,694 inferences, 23.911 CPU in 23.908 seconds (100% CPU, 953351 Lips)
true.
?- member(Exp,[6,7,8,9]), F #= 10^Exp, time(n_fac(N,F)) ; true.
% 1,340 inferences, 0.000 CPU in 0.000 seconds ( 99% CPU, 3793262 Lips)
% 1,479 inferences, 0.000 CPU in 0.000 seconds (100% CPU, 6253673 Lips)
% 1,618 inferences, 0.000 CPU in 0.000 seconds (100% CPU, 5129994 Lips)
% 1,757 inferences, 0.000 CPU in 0.000 seconds (100% CPU, 5044792 Lips)
true.
Wow! Some more?
?- member(U,[10,100,1000]), time((N in 1..U,n_factorial(N,_),false)) ; true.
% 34,511 inferences, 0.004 CPU in 0.004 seconds (100% CPU, 9591041 Lips)
% 3,091,271 inferences, 0.322 CPU in 0.322 seconds (100% CPU, 9589264 Lips)
% 305,413,871 inferences, 90.732 CPU in 90.721 seconds (100% CPU, 3366116 Lips)
true.
?- member(U,[10,100,1000]), time((N in 1..U,n_fac(N,_),false)) ; true.
% 3,729 inferences, 0.001 CPU in 0.001 seconds (100% CPU, 2973653 Lips)
% 36,369 inferences, 0.004 CPU in 0.004 seconds (100% CPU, 10309784 Lips)
% 362,471 inferences, 0.036 CPU in 0.036 seconds (100% CPU, 9979610 Lips)
true.
The bottom line?
The code presented in this answer is as low-level as you should go: Forget is/2!
Redundant constraints can and do pay off.
The order of arithmetic operations (counting "up" vs "down") can make quite a difference, too.
If you want to calculate the factorial of some "large" N, consider using a different approach.
Use clpfd!
There are some things which you must remember when looking at Prolog:
There is no implicit return value when you call a predicate. If you want to get a value out of a call you need to add extra arguments which can be used to "return" values, the second argument in your f/2 predicate. While being more verbose it does have the benefit of being easy to return many values.
This means that automatically "evaluating" arguments in a call is really quite meaningless as there is no value to return and it is not done. So there are no nested calls, in this respect Prolog is flat. So when you call f(A-1, D) the first argument to f/2 is the structure A-1, or really -(A, 1) as - is an infix operator. So if you want to get the value from a call to foo into a call to bar you have to explicitly use a variable to do it like:
foo(..., X), bar(X, ...),
So you need a special predicate which forces arithmetic evaluation, is/2. It's second argument is a structure representing an arithmetic expression which it interprets, evaluates and unifies the result with its first argument, which can be either a variable or numerical value.
While in principle you can run things backwards with most things you can't. Usually it is only simple predicates working on structures for which it is possible, though there are some very useful cases where it is possible. is/2 doesn't work backwards, it would be exceptional if it did.
This is why you need the extra variables C and D and can't replace C is A-1, f(C, D) by f(A-1,D).
(Yes I know you don't make calls in Prolog, but evaluate goals, but we were starting from a functional viewpoint here)