How to count coincidences on each character of two large strings without triggering the Out of Local Stack exception? - performance

I need a clause that counts char coincidences between two large strings but omitting '_' coincidences. I have this code:
fit(GEN1, GEN2, N, N) :-
length(GEN1, L1),
length(GEN2, L2),
0 is L1*L2.
fit([P1|R1], [P2|R2], N, TOTAL) :-
member(P1, ['_',a,c,t,g]),
member(P2, ['_',a,c,t,g]),
append([P1],[P2],T),
( member(T,[[a,a],[c,c],[t,t],[g,g]])
-> X is N+1
; X is N
),
fit(R1,R2,X,TOTAL).
Where GEN1 and GEN2 are lists containing all characters large strings.
I've tried increasing the stack limit to avoid Out of Local Stack exception with little success.
The issue is that, is called often and in deep recursive clauses. Is there any better way to do this?
EDIT
The clause needs to stop when one or both lists are empty.
EDIT 2
Is worth saying that testings on all answers below were done using 64bit prolog, with the --stack-limit=32g option as my code isn't well optimized and the fit clause is a small part of a larger process, but was the main problem with my code.
EDIT 3
CapelliC code worked using the less resources.
false code using the library(reif) v2 worked the faster.
See Complexity of counting matching elements in two sequences using library(aggregate) for more proposed solutions.

It seems that there is no point to insist that you have letters out of "_actg" all the time. A generalized definition seems to be sufficient. Using library(reif):
fit([], _, N,N).
fit([_|_], [], N,N).
fit([P1|R1], [P2|R2], N,TOTAL) :-
if_( ( P1 = P2, dif(P1, '_') ), X is N+1, X = N ),
fit(R1, R2, X,TOTAL).
Update: please make sure to use v2 of library(reif). The original version did not compile dif/3.
And here a version for systems that can only index on one argument simultaneously:
fit([], _, N,N).
fit([P1|R1], L2, N,TOTAL) :-
ifit(L2, [P1|R1], N,TOTAL).
ifit([], _, N,N).
ifit([P2|R2], [P1|R1], N,TOTAL) :-
if_( ( P1 = P2, dif(P1, '_') ), X is N+1, X = N ),
fit(R1, R2, X,TOTAL).

if your Prolog has library(aggregate) you can do
fit(GEN1, GEN2, N) :-
aggregate_all(count, (nth1(P,GEN1,S),nth1(P,GEN2,S),memberchk(S,[a,c,g,t])), N).
edit
Depending on the statistic of data, a noticeable improvement can be obtained just swapping the last two calls, i.e. ...(nth1(P,GEN1,S),memberchk(S,[a,c,g,t]),nth1(P,GEN2,S))...
edit
Of course a tight loop it's better that a double indexed scan. For performance, I would write it like
fit_cc(GEN1, GEN2, N) :-
fit_cc(GEN1, GEN2, 0, N).
fit_cc([X|GEN1], [Y|GEN2], C, N) :-
( X\='_' /*memberchk(X, [a,c,g,t])*/, X=Y
-> D is C+1 ; D=C
),
fit_cc(GEN1, GEN2, D, N).
fit_cc(_, _, N, N).
but the generality and correctness allowed by library(reif) v2, as seen in #false' answer and comments, seems to be well worth the (pretty small) overhead.

In case you always call your predicate with two first arguments already fully instantiated, so you use it as a function, not as a relation -- which it seems like you do indeed -- I suspect that just adding !, at the start of your very last line of code should be enough to remove the stack overflow.
To do a little bit better, we'd use memberchk instead of member and notice that append([A],[B],C) is exactly the same thing as C = [A,B]; so after a little bit of reshufflling we end up with something like
fit( [], [], N, N).
fit( [P1|R1], [P2|R2], N, TOTAL) :-
memberchk( P1, [a,c,t,g]),
( P2 == P1
-> X is N+1
; X is N
),
%% !, %% might need the cut
fit( R1, R2, X, TOTAL).
and we might not even need that cut since memberchk is already deterministic.
(not tested, though)

Related

Find the minimum in a mixed list in Prolog

I am new to prolog, I am just learning about lists and I came across this question. The answer works perfect for a list of integers.
minimo([X], X) :- !.
minimo([X,Y|Tail], N):-
( X > Y ->
minimo([Y|Tail], N)
;
minimo([X|Tail], N)
).
How can I change this code to get the smallest int from a mixed list?
This
sint([a,b,3,2,1],S)
should give an answer:
S=1
you could just ignore the problem, changing the comparison operator (>)/2 (a binary builtin predicate, actually) to the more general (#>)/2:
minimo([X], X) :- !.
minimo([X,Y|Tail], N):-
( X #> Y ->
minimo([Y|Tail], N)
;
minimo([X|Tail], N)
).
?- minimo([a,b,3,2,1],S).
S = 1.
First of all, I don't think the proposed implementation is very elegant: here they pass the minimum found element thus far by constructing a new list each time. Using an additional parameter (we call an accumulator) is usually the way to go (and is probably more efficient as well).
In order to solve the problem, we first have to find an integer. We can do this like:
sint([H|T],R) :-
integer(H),
!,
sint(T,H,R).
sint([_|T],R) :-
sint(T,R).
So here we check if the head H is an integer/1. If that is the case, we call a predicate sint/3 (not to be confused with sint/2). Otherwise we call recursively sint/2 with the tail T of the list.
Now we still need to define sint/3. In case we have reached the end of the list [], we simply return the minum found thus far:
sint([],R,R).
Otherwise there are two cases:
the head H is an integer and smaller than the element found thus far, in that case we perform recursion with the head as new current minimum:
sint([H|T],M,R):
integer(H),
H < M,
!,
sint(T,H,R).
otherwise, we simply ignore the head, and perform recursion with the tail T.
sint([_|T],M,R) :-
sint(T,M,R).
We can put the recursive clauses in an if-then-else structure. Together with the earlier defined predicate, the full program then is:
sint([H|T],R) :-
integer(H),
!,
sint(T,H,R).
sint([_|T],R) :-
sint(T,R).
sint([],R,R).
sint([H|T],M,R):
(
(integer(H),H < M)
-> sint(T,H,R)
; sint(T,M,R)
).
The advantage of this approach is that filtering and comparing (to obtain the minimum) is done at the same time, so we only iterate once over the list. This will usually result in a performance boost since the "control structures" are only executed once: more is done in an iteration, but we only iterate once.
We can generalize the approach by making the filter generic:
filter_minimum(Filter,[H|T],R) :-
Goal =.. [Filter,H],
call(Goal),
!,
filter_minimum(Filter,T,H,R).
filter_minimum(Filter,[_|T],R) :-
filter_minimum(Filter,T,R).
filter_minimum(_,[],R,R).
filter_minimum(Filter,[H|T],M,R) :-
Goal =.. [Filter,H],
(
(call(Goal),H < M)
-> filter_minimum(Filter,T,H,R)
; filter_minimum(Filter,T,M,R)
).
You can then call it with:
filter_minimum(integer,[a,b,3,2,1],R).
to filter with integer/1 and calculate the minimum.
You could just write a predicate that returns a list with the numbers and the use the above minimo/2 predicate:
only_numbers([],[]).
only_numbers([H|T],[H|T1]):-integer(H),only_numbers(T,T1).
only_numbers([H|T],L):- \+integer(H),only_numbers(T,L).
sint(L,S):-only_numbers(L,L1),minimo(L1,S).

How can I verify if a coordinate is in a list

I'm generating random coordinates and adding on my list, but first I need verify if that coordinate already exists. I'm trying to use member but when I was debugging I saw that isn't working:
My code is basically this:
% L is a list and Q is a count that define the number of coordinate
% X and Y are the coordinate members
% check if the coordniate already exists
% if exists, R is 0 and if not, R is 1
createCoordinates(L,Q) :-
random(1,10,X),
random(1,10,Y),
convertNumber(X,Z),
checkCoordinate([Z,Y],L,R),
(R is 0 -> print('member'), createCoordinates(L,Q); print('not member'),createCoordinates(L,Q-1).
checkCoordinate(C,L,R) :-
(member(C,L) -> R is 0; R is 1).
% transforms the number N in a letter L
convertNumber(N,L) :-
N is 1, L = 'A';
N is 2, L = 'B';
...
N is 10, L = 'J'.
%call createCoordinates
createCoordinates(L,20).
When I was debugging this was the output:
In this picture I'm in the firts interation and L is empty, so R should be 1 but always is 0, the coordinate always is part of the list.
I have the impression that the member clause is adding the coordinate at my list and does'nt make sense
First off, I would recommend breaking your problem down into smaller pieces. You should have a procedure for making a random coordinate:
random_coordinate([X,Y]) :-
random(1, 10, XN), convertNumber(XN, X),
random(1, 10, Y).
Second, your checkCoordinate/3 is converting Prolog's success/failure into an integer, which is just busy work for Prolog and not really improving life for you. memberchk/2 is completely sufficient to your task (member/2 would work too but is more powerful than necessary). The real problem here is not that member/2 didn't work, it's that you are trying to build up this list parameter on the way out, but you need it to exist on the way in to examine it.
We usually solve this kind of problem in Prolog by adding a third parameter and prepending values to the list on the way through. The base case then equates that list with the outbound list and we protect the whole thing with a lower-arity procedure. In other words, we do this:
random_coordinates(N, Coordinates) :- random_coordinates(N, [], Coordinates).
random_coordinates(0, Result, Result).
random_coordinates(N, CoordinatesSoFar, FinalResult) :- ...
Now that we have two things, memberchk/2 should work the way we need it to:
random_coordinates(N, CoordinatesSoFar, FinalResult) :-
N > 0, succ(N0, N), % count down, will need for recursive call
random_coordinate(Coord),
(memberchk(Coord, CoordinatesSoFar) ->
random_coordinates(N, CoordinatesSoFar, FinalResult)
;
random_coordinates(N0, [Coord|CoordinatesSoFar], FinalResult)
).
And this seems to do what we want:
?- random_coordinates(10, L), write(L), nl.
[[G,7],[G,3],[H,9],[H,8],[A,4],[G,1],[I,9],[H,6],[E,5],[G,8]]
?- random_coordinates(10, L), write(L), nl.
[[F,1],[I,8],[H,4],[I,1],[D,3],[I,6],[E,9],[D,1],[C,5],[F,8]]
Finally, I note you continue to use this syntax: N is 1, .... I caution you that this looks like an error to me because there is no distinction between this and N = 1, and your predicate could be stated somewhat tiresomely just with this:
convertNumber(1, 'A').
convertNumber(2, 'B').
...
My inclination would be to do it computationally with char_code/2 but this construction is actually probably better.
Another hint that you are doing something wrong is that the parameter L to createCoordinates/2 gets passed along in all cases and is not examined in any of them. In Prolog, we often have variables that appear to just be passed around meaninglessly, but they usually change positions or are used multiple times, as in random_coordinates(0, Result, Result); while nothing appears to be happening there, what's actually happening is plumbing: the built-up parameter becomes the result value. Nothing interesting is happening to the variable directly there, but it is being plumbed around. But nothing is happening at all to L in your code, except it is supposedly being checked for a new coordinate. But you're never actually appending anything to it, so there's no reason to expect that anything would wind up in L.
Edit Notice that #lambda.xy.x solves the problem in their answer by prepending the new coordinate in the head of the clause and examining the list only after the recursive call in the body, obviating the need for the second list parameter.
Edit 2 Also take a look at #lambda.xy.x's other solution as it has better time complexity as N approaches 100.
Since i had already written it, here is an alternative solution: The building block is gen_coord_notin/2 which guarantees a fresh solution C with regard to an exclusion list Excl.
gen_coord_notin(C, Excl) :-
random(1,10,X),
random(1,10,Y),
( memberchk(X-Y, Excl) ->
gen_coord_notin(C, Excl)
;
C = X-Y
).
The trick is that we only unify C with the new result, if it is fresh.
Then we only have to fold the generations into N iterations:
gen_coords([], 0).
gen_coords([X|Xs], N) :-
N > 0,
M is N - 1,
gen_coords(Xs, M),
gen_coord_notin(X, Xs).
Remark 1: since coordinates are always 2-tuples, a list representation invites unwanted errors (e.g. writing [X|Y] instead of [X,Y]). Traditionally, an infix operator like - is used to seperate tuples, but it's not any different than using coord(X,Y).
Remark 2: this predicate is inherently non-logical (i.e. calling gen_coords(X, 20) twice will result in different substitutions for X). You might use the meta-level predicates var/1, nonvar/1, ground/1, integer, etc. to guard against non-sensical calls like gen_coord(1-2, [1-1]).
Remark 3: it is also important that the conditional does not have multiple solutions (compare member(X,[A,B]) and memberchk(X,[A,B])). In general, this can be achieved by calling once/1 but there is a specialized predicate memberchk/2 which I used here.
I just realized that the performance of my other solutions is very bad for N close to 100. The reason is that with diminishing possible coordinates, the generate and test approach will take longer and longer. There's an alternative solution which generates all coordinates and picks N random ones:
all_pairs(Ls) :-
findall(X-Y, (between(1,10,X), between(1,10,Y)), Ls).
remove_index(X,[X|Xs],Xs,0).
remove_index(I,[X|Xs],[X|Rest],N) :-
N > 0,
M is N - 1,
remove_index(I,Xs,Rest,M).
n_from_pool(_Pool, [], 0).
n_from_pool(Pool, [C|Cs], N) :-
N > 0,
M is N - 1,
length(Pool, L),
random(0,L,R),
remove_index(C,Pool,NPool,R),
n_from_pool(NPool, Cs, M).
gen_coords2(Xs, N) :-
all_pairs(Pool),
n_from_pool(Pool, Xs, N).
Now the query
?- gen_coords2(Xs, 100).
Xs = [4-6, 5-6, 5-8, 9-6, 3-1, 1-3, 9-4, 6-1, ... - ...|...] ;
false.
succeeds as expected. The error message
?- gen_coords2(Xs, 101).
ERROR: random/1: Domain error: not_less_than_one' expected, found0'
when we try to generate more distinct elements than possible is not nice, but better than non-termination.

Fold over a partial list

This is a question provoked by an already deleted answer to this question. The issue could be summarized as follows:
Is it possible to fold over a list, with the tail of the list generated while folding?
Here is what I mean. Say I want to calculate the factorial (this is a silly example but it is just for demonstration), and decide to do it like this:
fac_a(N, F) :-
must_be(nonneg, N),
( N =< 1
-> F = 1
; numlist(2, N, [H|T]),
foldl(multiplication, T, H, F)
).
multiplication(X, Y, Z) :-
Z is Y * X.
Here, I need to generate the list that I give to foldl. However, I could do the same in constant memory (without generating the list and without using foldl):
fac_b(N, F) :-
must_be(nonneg, N),
( N =< 1
-> F = 1
; fac_b_1(2, N, 2, F)
).
fac_b_1(X, N, Acc, F) :-
( X < N
-> succ(X, X1),
Acc1 is X1 * Acc,
fac_b_1(X1, N, Acc1, F)
; Acc = F
).
The point here is that unlike the solution that uses foldl, this uses constant memory: no need for generating a list with all values!
Calculating a factorial is not the best example, but it is easier to follow for the stupidity that comes next.
Let's say that I am really afraid of loops (and recursion), and insist on calculating the factorial using a fold. I still would need a list, though. So here is what I might try:
fac_c(N, F) :-
must_be(nonneg, N),
( N =< 1
-> F = 1
; foldl(fac_foldl(N), [2|Back], 2-Back, F-[])
).
fac_foldl(N, X, Acc-Back, F-Rest) :-
( X < N
-> succ(X, X1),
F is Acc * X1,
Back = [X1|Rest]
; Acc = F,
Back = []
).
To my surprise, this works as intended. I can "seed" the fold with an initial value at the head of a partial list, and keep on adding the next element as I consume the current head. The definition of fac_foldl/4 is almost identical to the definition of fac_b_1/4 above: the only difference is that the state is maintained differently. My assumption here is that this should use constant memory: is that assumption wrong?
I know this is silly, but it could however be useful for folding over a list that cannot be known when the fold starts. In the original question we had to find a connected region, given a list of x-y coordinates. It is not enough to fold over the list of x-y coordinates once (you can however do it in two passes; note that there is at least one better way to do it, referenced in the same Wikipedia article, but this also uses multiple passes; altogether, the multiple-pass algorithms assume constant-time access to neighboring pixels!).
My own solution to the original "regions" question looks something like this:
set_region_rest([A|As], Region, Rest) :-
sort([A|As], [B|Bs]),
open_set_closed_rest([B], Bs, Region0, Rest),
sort(Region0, Region).
open_set_closed_rest([], Rest, [], Rest).
open_set_closed_rest([X-Y|As], Set, [X-Y|Closed0], Rest) :-
X0 is X-1, X1 is X + 1,
Y0 is Y-1, Y1 is Y + 1,
ord_intersection([X0-Y,X-Y0,X-Y1,X1-Y], Set, New, Set0),
append(New, As, Open),
open_set_closed_rest(Open, Set0, Closed0, Rest).
Using the same "technique" as above, we can twist this into a fold:
set_region_rest_foldl([A|As], Region, Rest) :-
sort([A|As], [B|Bs]),
foldl(region_foldl, [B|Back],
closed_rest(Region0, Bs)-Back,
closed_rest([], Rest)-[]),
!,
sort(Region0, Region).
region_foldl(X-Y,
closed_rest([X-Y|Closed0], Set)-Back,
closed_rest(Closed0, Set0)-Back0) :-
X0 is X-1, X1 is X + 1,
Y0 is Y-1, Y1 is Y + 1,
ord_intersection([X0-Y,X-Y0,X-Y1,X1-Y], Set, New, Set0),
append(New, Back0, Back).
This also "works". The fold leaves behind a choice point, because I haven't articulated the end condition as in fac_foldl/4 above, so I need a cut right after it (ugly).
The Questions
Is there a clean way of closing the list and removing the cut? In the factorial example, we know when to stop because we have additional information; however, in the second example, how do we notice that the back of the list should be the empty list?
Is there a hidden problem I am missing?
This looks like its somehow similar to the Implicit State with DCGs, but I have to admit I never quite got how that works; are these connected?
You are touching on several extremely interesting aspects of Prolog, each well worth several separate questions on its own. I will provide a high-level answer to your actual questions, and hope that you post follow-up questions on the points that are most interesting to you.
First, I will trim down the fragment to its essence:
essence(N) :-
foldl(essence_(N), [2|Back], Back, _).
essence_(N, X0, Back, Rest) :-
( X0 #< N ->
X1 #= X0 + 1,
Back = [X1|Rest]
; Back = []
).
Note that this prevents the creation of extremely large integers, so that we can really study the memory behaviour of this pattern.
To your first question: Yes, this runs in O(1) space (assuming constant space for arising integers).
Why? Because although you continuously create lists in Back = [X1|Rest], these lists can all be readily garbage collected because you are not referencing them anywhere.
To test memory aspects of your program, consider for example the following query, and limit the global stack of your Prolog system so that you can quickly detect growing memory by running out of (global) stack:
?- length(_, E),
N #= 2^E,
portray_clause(N),
essence(N),
false.
This yields:
1.
2.
...
8388608.
16777216.
etc.
It would be completely different if you referenced the list somewhere. For example:
essence(N) :-
foldl(essence_(N), [2|Back], Back, _),
Back = [].
With this very small change, the above query yields:
?- length(_, E),
N #= 2^E,
portray_clause(N),
essence(N),
false.
1.
2.
...
1048576.
ERROR: Out of global stack
Thus, whether a term is referenced somewhere can significantly influence the memory requirements of your program. This sounds quite frightening, but really is hardly an issue in practice: You either need the term, in which case you need to represent it in memory anyway, or you don't need the term, in which case it is simply no longer referenced in your program and becomes amenable to garbage collection. In fact, the amazing thing is rather that GC works so well in Prolog also for quite complex programs that not much needs to be said about it in many situations.
On to your second question: Clearly, using (->)/2 is almost always highly problematic in that it limits you to a particular direction of use, destroying the generality we expect from logical relations.
There are several solutions for this. If your CLP(FD) system supports zcompare/3 or a similar feature, you can write essence_/3 as follows:
essence_(N, X0, Back, Rest) :-
zcompare(C, X0, N),
closing(C, X0, Back, Rest).
closing(<, X0, [X1|Rest], Rest) :- X1 #= X0 + 1.
closing(=, _, [], _).
Another very nice meta-predicate called if_/3 was recently introduced in Indexing dif/2 by Ulrich Neumerkel and Stefan Kral. I leave implementing this with if_/3 as a very worthwhile and instructive exercise. Discussing this is well worth its own question!
On to the third question: How do states with DCGs relate to this? DCG notation is definitely useful if you want to pass around a global state to several predicates, where only a few of them need to access or modify the state, and most of them simply pass the state through. This is completely analogous to monads in Haskell.
The "normal" Prolog solution would be to extend each predicate with 2 arguments to describe the relation between the state before the call of the predicate, and the state after it. DCG notation lets you avoid this hassle.
Importantly, using DCG notation, you can copy imperative algorithms almost verbatim to Prolog, without the hassle of introducing many auxiliary arguments, even if you need global states. As an example for this, consider a fragment of Tarjan's strongly connected components algorithm in imperative terms:
function strongconnect(v)
// Set the depth index for v to the smallest unused index
v.index := index
v.lowlink := index
index := index + 1
S.push(v)
This clearly makes use of a global stack and index, which ordinarily would become new arguments that you need to pass around in all your predicates. Not so with DCG notation! For the moment, assume that the global entities are simply easily accessible, and so you can code the whole fragment in Prolog as:
scc_(V) -->
vindex_is_index(V),
vlowlink_is_index(V),
index_plus_one,
s_push(V),
This is a very good candidate for its own question, so consider this a teaser.
At last, I have a general remark: In my view, we are only at the beginning of finding a series of very powerful and general meta-predicates, and the solution space is still largely unexplored. call/N, maplist/[3,4], foldl/4 and other meta-predicates are definitely a good start. if_/3 has the potential to combine good performance with the generality we expect from Prolog predicates.
If your Prolog implementation supports freeze/2 or similar predicate (e.g. Swi-Prolog), then you can use following approach:
fac_list(L, N, Max) :-
(N >= Max, L = [Max], !)
;
freeze(L, (
L = [N|Rest],
N2 is N + 1,
fac_list(Rest, N2, Max)
)).
multiplication(X, Y, Z) :-
Z is Y * X.
factorial(N, Factorial) :-
fac_list(L, 1, N),
foldl(multiplication, L, 1, Factorial).
Example above first defines a predicate (fac_list) which creates a "lazy" list of increasing integer values starting from N up to maximum value (Max), where next list element is generated only after previous one was "accessed" (more on that below). Then, factorial just folds multiplication over lazy list, resulting in constant memory usage.
The key to understanding how this example works is remembering that Prolog lists are, in fact, just terms of arity 2 with name '.' (actually, in Swi-Prolog 7 the name was changed, but this is not important for this discussion), where first element represents list item and the second element represents tail (or terminating element - empty list, []). For example. [1, 2, 3] can be represented as:
.(1, .(2, .(3, [])))
Then, freeze is defined as follows:
freeze(+Var, :Goal)
Delay the execution of Goal until Var is bound
This means if we call:
freeze(L, L=[1|Tail]), L = [A|Rest].
then following steps will happen:
freeze(L, L=[1|Tail]) is called
Prolog "remembers" that when L will be unified with "anything", it needs to call L=[1|Tail]
L = [A|Rest] is called
Prolog unifies L with .(A, Rest)
This unification triggers execution of L=[1|Tail]
This, obviously, unifies L, which at this point is bound to .(A, Rest), with .(1, Tail)
As a result, A gets unified with 1.
We can extend this example as follows:
freeze(L1, L1=[1|L2]),
freeze(L2, L2=[2|L3]),
freeze(L3, L3=[3]),
L1 = [A|R2], % L1=[1|L2] is called at this point
R2 = [B|R3], % L2=[2|L3] is called at this point
R3 = [C]. % L3=[3] is called at this point
This works exactly like the previous example, except that it gradually generates 3 elements, instead of 1.
As per Boris's request, the second example implemented using freeze. Honestly, I'm not quite sure whether this answers the question, as the code (and, IMO, the problem) is rather contrived, but here it is. At least I hope this will give other people the idea what freeze might be useful for. For simplicity, I am using 1D problem instead of 2D, but changing the code to use 2 coordinates should be rather trivial.
The general idea is to have (1) function that generates new Open/Closed/Rest/etc. state based on previous one, (2) "infinite" list generator which can be told to "stop" generating new elements from the "outside", and (3) fold_step function which folds over "infinite" list, generating new state on each list item and, if that state is considered to be the last one, tells generator to halt.
It is worth to note that list's elements are used for no other reason but to inform generator to stop. All calculation state is stored inside accumulator.
Boris, please clarify whether this gives a solution to your problem. More precisely, what kind of data you were trying to pass to fold step handler (Item, Accumulator, Next Accumulator)?
adjacent(X, Y) :-
succ(X, Y) ;
succ(Y, X).
state_seq(State, L) :-
(State == halt -> L = [], !)
;
freeze(L, (
L = [H|T],
freeze(H, state_seq(H, T))
)).
fold_step(Item, Acc, NewAcc) :-
next_state(Acc, NewAcc),
NewAcc = _:_:_:NewRest,
(var(NewRest) ->
Item = next ;
Item = halt
).
next_state(Open:Set:Region:_Rest, NewOpen:NewSet:NewRegion:NewRest) :-
Open = [],
NewOpen = Open,
NewSet = Set,
NewRegion = Region,
NewRest = Set.
next_state(Open:Set:Region:Rest, NewOpen:NewSet:NewRegion:NewRest) :-
Open = [H|T],
partition(adjacent(H), Set, Adjacent, NotAdjacent),
append(Adjacent, T, NewOpen),
NewSet = NotAdjacent,
NewRegion = [H|Region],
NewRest = Rest.
set_region_rest(Ns, Region, Rest) :-
Ns = [H|T],
state_seq(next, L),
foldl(fold_step, L, [H]:T:[]:_, _:_:Region:Rest).
One fine improvement to the code above would be making fold_step a higher order function, passing it next_state as the first argument.

Collect all "minimum" solutions from a predicate

Given the following facts in a database:
foo(a, 3).
foo(b, 2).
foo(c, 4).
foo(d, 3).
foo(e, 2).
foo(f, 6).
foo(g, 3).
foo(h, 2).
I want to collect all first arguments that have the smallest second argument, plus the value of the second argument. First try:
find_min_1(Min, As) :-
setof(B-A, foo(A, B), [Min-_|_]),
findall(A, foo(A, Min), As).
?- find_min_1(Min, As).
Min = 2,
As = [b, e, h].
Instead of setof/3, I could use aggregate/3:
find_min_2(Min, As) :-
aggregate(min(B), A^foo(A, B), Min),
findall(A, foo(A, Min), As).
?- find_min_2(Min, As).
Min = 2,
As = [b, e, h].
NB
This only gives the same results if I am looking for the minimum of a number. If an arithmetic expression in involved, the results might be different. If a non-number is involved, aggregate(min(...), ...) will throw an error!
Or, instead, I can use the full key-sorted list:
find_min_3(Min, As) :-
setof(B-A, foo(A, B), [Min-First|Rest]),
min_prefix([Min-First|Rest], Min, As).
min_prefix([Min-First|Rest], Min, [First|As]) :-
!,
min_prefix(Rest, Min, As).
min_prefix(_, _, []).
?- find_min_3(Min, As).
Min = 2,
As = [b, e, h].
Finally, to the question(s):
Can I do this directly with library(aggregate)? It feels like it should be possible....
Or is there a predicate like std::partition_point from the C++ standard library?
Or is there some easier way to do this?
EDIT:
To be more descriptive. Say there was a (library) predicate partition_point/4:
partition_point(Pred_1, List, Before, After) :-
partition_point_1(List, Pred_1, Before, After).
partition_point_1([], _, [], []).
partition_point_1([H|T], Pred_1, Before, After) :-
( call(Pred_1, H)
-> Before = [H|B],
partition_point_1(T, Pred_1, B, After)
; Before = [],
After = [H|T]
).
(I don't like the name but we can live with it for now)
Then:
find_min_4(Min, As) :-
setof(B-A, foo(A, B), [Min-X|Rest]),
partition_point(is_min(Min), [Min-X|Rest], Min_pairs, _),
pairs_values(Min_pairs, As).
is_min(Min, Min-_).
?- find_min_4(Min, As).
Min = 2,
As = [b, e, h].
What is the idiomatic approach to this class of problems?
Is there a way to simplify the problem?
Many of the following remarks could be added to many programs here on SO.
Imperative names
Every time, you write an imperative name for something that is a relation you will reduce your understanding of relations. Not much, just a little bit. Many common Prolog idioms like append/3 do not set a good example. Think of append(As,As,AsAs). The first argument of find_min(Min, As) is the minimum. So minimum_with_nodes/2 might be a better name.
findall/3
Do not use findall/3 unless the uses are rigorously checked, essentially everything must be ground. In your case it happens to work. But once you generalize foo/2 a bit, you will lose. And that is frequently a problem: You write a tiny program ; and it seems to work.
Once you move to bigger ones, the same approach no longer works. findall/3 is (compared to setof/3) like a bull in a china shop smashing the fine fabric of shared variables and quantification. Another problem is that accidental failure does not lead to failure of findall/3 which often leads to bizarre, hard to imagine corner cases.
Untestable, too specific program
Another problem is somewhat related to findall/3, too. Your program is so specific, that it is quite improbable that you will ever test it. And marginal changes will invalidate your tests. So you will soon give up to perform testing. Let's see what is specific: Primarily the foo/2 relation. Yes, only an example. Think of how to set up a test configuration where foo/2 may change. After each change (writing a new file) you will have to reload the program. This is so complex, chances are you will never do it. I presume you do not have a test harness for that. Plunit for one, does not cover such testing.
As a rule of thumb: If you cannot test a predicate on the top level you never will. Consider instead
minimum_with(Rel_2, Min, Els)
With such a relation, you can now have a generalized xfoo/3 with an additional parameter, say:
xfoo(o, A,B) :-
foo(A,B).
xfoo(n, A,B) :-
newfoo(A,B).
and you most naturally get two answers for minimum_with(xfoo(X), Min, Els). Would you have used findall/3 instead of setof/3 you already would have serious problems. Or just in general: minmum_with(\A^B^member(A-B, [x-10,y-20]), Min, Els). So you can play around on the top level and produce lots of interesting test cases.
Unchecked border cases
Your version 3 is clearly my preferred approach, however there are still some parts that can be improved. In particular, if there are answers that contain variables as a minimum. These should be checked.
And certainly, also setof/3 has its limits. And ideally you would test them. Answers should not contain constraints, in particular not in the relevant variables. This shows how setof/3 itself has certain limits. After the pioneering phase, SICStus produced many errors for constraints in such cases (mid 1990s), later changed to consequently ignoring constraints in built-ins that cannot handle them. SWI on the other hand does entirely undefined things here. Sometimes things are copied, sometimes not. As an example take:
setof(A, ( A in 1..3 ; A in 3..5 ), _) and setof(t, ( A in 1..3 ; A in 3.. 5 ), _).
By wrapping the goal this can be avoided.
call_unconstrained(Goal_0) :-
call_residue_vars(Goal_0, Vs),
( Vs = [] -> true ; throw(error(representation_error(constraint),_)) ).
Beware, however, that SWI has spurious constraints:
?- call_residue_vars(all_different([]), Xs).
Xs = [_A].
Not clear if this is a feature in the meantime. It has been there since the introduction of call_residue_vars/2 about 5 years ago.
I don't think that library(aggregate) covers your use case. aggregate(min) allows for one witness:
min(Expr, Witness)
A term min(Min, Witness), where Min is the minimal version of Expr over all solutions, and Witness is any other template applied to solutions that produced Min. If multiple solutions provide the same minimum, Witness corresponds to the first solution.
Some time ago, I wrote a small 'library', lag.pl, with predicates to aggregate with low overhead - hence the name (LAG = Linear AGgregate). I've added a snippet, that handles your use case:
integrate(min_list_associated, Goal, Min-Ws) :-
State = term(_, [], _),
forall(call(Goal, V, W), % W stands for witness
( arg(1, State, C), % C is current min
arg(2, State, CW), % CW are current min witnesses
( ( var(C) ; V #< C )
-> U = V, Ws = [W]
; U = C,
( C == V
-> Ws = [W|CW]
; Ws = CW
)
),
nb_setarg(1, State, U),
nb_setarg(2, State, Ws)
)),
arg(1, State, Min), arg(2, State, Ws).
It's a simple minded extension of integrate(min)...
The comparison method it's surely questionable (it uses less general operator for equality), could be worth to adopt instead a conventional call like that adopted for predsort/3. Efficiency wise, still better would be to encode the comparison method as option in the 'function selector' (min_list_associated in this case)
edit thanks #false and #Boris for correcting the bug relative to the state representation. Calling nb_setarg(2, State, Ws) actually changes the term' shape, when State = (_,[],_) was used. Will update the github repo accordingly...
Using library(pairs) and [sort/4], this can be simply written as:
?- bagof(B-A, foo(A, B), Ps),
sort(1, #=<, Ps, Ss), % or keysort(Ps, Ss)
group_pairs_by_key(Ss, [Min-As|_]).
Min = 2,
As = [b, e, h].
This call to sort/4 can be replaced with keysort/2, but with sort/4 one can also find for example the first arguments associated with the largest second argument: just use #>= as the second argument.
This solution is probably not as time and space efficient as the other ones, but may be easier to grok.
But there is another way to do it altogether:
?- bagof(A, ( foo(A, Min), \+ ( foo(_, Y), Y #< Min ) ), As).
Min = 2,
As = [b, e, h].

What is the bottleneck in this primes related predicate?

So here it is : I'm trying to calculate the sum of all primes below two millions (for this problem), but my program is very slow. I do know that the algorithm in itself is terribly bad and a brute force one, but it seems way slower than it should to me.
Here I limit the search to 20,000 so that the result isn't waited too long.
I don't think that this predicate is difficult to understand but I'll explain it anyway : I calculate the list of all the primes below 20,000 and then sum them. The sum part is fine, the primes part is really slow.
problem_010(R) :-
p010(3, [], Primes),
sumlist([2|Primes], R).
p010(20001, Primes, Primes) :- !.
p010(Current, Primes, Result) :-
(
prime(Current, Primes)
-> append([Primes, [Current]], NewPrimes)
; NewPrimes = Primes
),
NewCurrent is Current + 2,
p010(NewCurrent, NewPrimes, Result).
prime(_, []) :- !.
prime(N, [Prime|_Primes]) :- 0 is N mod Prime, !, fail.
prime(ToTest, [_|Primes]) :- prime(ToTest, Primes).
I'd like some insight about why it is so slow. Is it a good implementation of the stupid brute force algorithm, or is there some reason that makes Prolog fall?
EDIT : I already found something, by appending new primes instead of letting them in the head of the list, I have primes that occur more often at start so it's ~3 times faster. Still need some insight though :)
First, Prolog does not fail here.
There are very smart ways how to generate prime numbers. But as a cheap start simply accumulate the primes in reversed order! (7.9s -> 2.6s) In this manner the smaller ones are tested sooner. Then, consider to test only against primes up to 141. Larger primes cannot be a factor.
Then, instead of stepping only through numbers not divisible by 2, you might add 3, 5, 7.
There are people writing papers on this "problem". See, for example this paper, although it's a bit of a sophistic discussion what the "genuine" algorithm actually was, 22 centuries ago when the latest release of the abacus was celebrated as Salamis tablets.
Consider using for example a sieve method ("Sieve of Eratosthenes"): First create a list [2,3,4,5,6,....N], using for example numlist/3. The first number in the list is a prime, keep it. Eliminate its multiples from the rest of the list. The next number in the remaining list is again a prime. Again eliminate its multiples. And so on. The list will shrink quite rapidly, and you end up with only primes remaining.
First of all, appending at the end of a list using append/3 is quite slow. If you must, then use difference lists instead. (Personally, I try to avoid append/3 as much as possible)
Secondly, your prime/2 always iterates over the whole list when checking a prime. This is unnecessarily slow. You can instead just check id you can find an integral factor up to the square root of the number you want to check.
problem_010(R) :-
p010(3, 2, R).
p010(2000001, Primes, Primes) :- !.
p010(Current, In, Result) :-
( prime(Current) -> Out is In+Current ; Out=In ),
NewCurrent is Current + 2,
p010(NewCurrent, Out, Result).
prime(2).
prime(3).
prime(X) :-
integer(X),
X > 3,
X mod 2 =\= 0,
\+is_composite(X, 3). % was: has_factor(X, 3)
is_composite(X, F) :- % was: has_factor(X, F)
X mod F =:= 0, !.
is_composite(X, F) :-
F * F < X,
F2 is F + 2,
is_composite(X, F2).
Disclaimer: I found this implementation of prime/1 and has_factor/2 by googling.
This code gives:
?- problem_010(R).
R = 142913828922
Yes (12.87s cpu)
Here is even faster code:
problem_010(R) :-
Max = 2000001,
functor(Bools, [], Max),
Sqrt is integer(floor(sqrt(Max))),
remove_multiples(2, Sqrt, Max, Bools),
compute_sum(2, Max, 0, R, Bools).
% up to square root of Max, remove multiples by setting bool to 0
remove_multiples(I, Sqrt, _, _) :- I > Sqrt, !.
remove_multiples(I, Sqrt, Max, Bools) :-
arg(I, Bools, B),
(
B == 0
->
true % already removed: do nothing
;
J is 2*I, % start at next multiple of I
remove(J, I, Max, Bools)
),
I1 is I+1,
remove_multiples(I1, Sqrt, Max, Bools).
remove(I, _, Max, _) :- I > Max, !.
remove(I, Add, Max, Bools) :-
arg(I, Bools, 0), % remove multiple by setting bool to 0
J is I+Add,
remove(J, Add, Max, Bools).
% sum up places that are not zero
compute_sum(Max, Max, R, R, _) :- !.
compute_sum(I, Max, RI, R, Bools) :-
arg(I, Bools, B),
(B == 0 -> RO = RI ; RO is RI + I ),
I1 is I+1,
compute_sum(I1, Max, RO, R, Bools).
This runs an order of magnitude faster than the code I gave above:
?- problem_010(R).
R = 142913828922
Yes (0.82s cpu)
OK, before the edit the problem was just the algorithm (imho).
As you noticed, it's more efficient to check if the number is divided by the smaller primes first; in a finite set, there are more numbers divisible by 3 than by 32147.
Another algorithm improvement is to stop checking when the primes are greater than the square root of the number.
Now, after your change there are indeed some prolog issues:
you use append/3. append/3 is quite slow since you have to traverse the whole list to place the element at the end.
Instead, you should use difference lists, which makes placing the element at the tail really fast.
Now, what is a difference list? Instead of creating a normal list [1,2,3] you create this one [1,2,3|T]. Notice that we leave the tail uninstantiated. Then, if we want to add one element (or more) at the end of the list we can simply say T=[4|NT]. awesome?
The following solution (accumulate primes in reverse order, stop when prime>sqrt(N), difference lists to append) takes 0.063 for 20k primes and 17sec for 2m primes while your original code took 3.7sec for 20k and the append/3 version 1.3sec.
problem_010(R) :-
p010(3, Primes, Primes),
sumlist([2|Primes], R).
p010(2000001, _Primes,[]) :- !. %checking for primes till 2mil
p010(Current, Primes,PrimesTail) :-
R is sqrt(Current),
(
prime(R,Current, Primes)
-> PrimesTail = [Current|NewPrimesTail]
; NewPrimesTail = PrimesTail
),
NewCurrent is Current + 2,
p010(NewCurrent, Primes,NewPrimesTail).
prime(_,_, Tail) :- var(Tail),!.
prime(R,_N, [Prime|_Primes]):-
Prime>R.
prime(_R,N, [Prime|_Primes]) :-0 is N mod Prime, !, fail.
prime(R,ToTest, [_|Primes]) :- prime(R,ToTest, Primes).
also, considering adding the numbers while you generate them to avoid the extra o(n) because of sumlist/2
in the end, you can always implement the AKS algorithm that runs in polynomial time (XD)

Resources