In my Prolog-inspired language Brachylog, there is the possibility to label CLP(FD)-equivalent variables that have potentially infinite domains. The code that does this labelization can be found here (thanks to Markus Triska #mat).
This predicate requires the existence of a predicate positive_integer/1, which must have the following behavior:
?- positive_integer(X).
X = 1 ;
X = 2 ;
X = 3 ;
X = 4 ;
…
This is implemented as such in our current solution:
positive_integer(N) :- length([_|_], N).
This has two problems that I can see:
This becomes slow fairly quickly:
?- time(positive_integer(100000)).
% 5 inferences, 0.000 CPU in 0.001 seconds (0% CPU, Infinite Lips)
?- time(positive_integer(1000000)).
% 5 inferences, 0.000 CPU in 0.008 seconds (0% CPU, Infinite Lips)
?- time(positive_integer(10000000)).
% 5 inferences, 0.062 CPU in 0.075 seconds (83% CPU, 80 Lips)
This ultimately returns an Out of global stack error for numbers that are a bit too big:
?- positive_integer(100000000).
ERROR: Out of global stack
This is obviously due to the fact that Prolog needs to instantiate the list, which is bad if its length is big.
How can we improve this predicate such that this works even for very big numbers, with the same behavior?
There are already many good ideas posted, and they work to various degrees.
Additional test case
#vmg has the right intuition: between/3 does not mix well with constraints. To see this, I would like to use the following query as an additional benchmark:
?- X #> 10^30, positive_integer(X).
Solution
With the test case in mind, I suggest the following solution:
positive_integer(I) :-
I #> 0,
( var(I) ->
fd_inf(I, Inf),
( I #= Inf
; I #\= Inf,
positive_integer(I)
)
; true
).
The key idea is to use the CLP(FD) reflection predicate fd_inf/2 to reason about the smallest element in the domain of a variable. This is the only predicate you will need to change when you port the solution to further Prolog systems. For example, in SICStus Prolog, the predicate is called fd_min/2.
Main features
portable to several Prolog systems with minimal changes
fast in the shown cases
works also in the test case above
advertises and uses the full power of CLP(FD) constraints.
It is of course very clear which of these points is most important.
Sample queries
creatio ex nihilo:
?- positive_integer(X).
X = 1 ;
X = 2 ;
X = 3 .
fixed integer:
?- X #= 12^42, time(positive_integer(X)).
% 4 inferences, 0.000 CPU in 0.000 seconds (68% CPU, 363636 Lips)
X = 2116471057875484488839167999221661362284396544.
constrained integer:
?- X #> 10^30, time(positive_integer(X)).
% 124 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 3647059 Lips)
X = 1000000000000000000000000000001 ;
% 206 inferences, 0.000 CPU in 0.000 seconds (93% CPU, 2367816 Lips)
X = 1000000000000000000000000000002 ;
% 204 inferences, 0.000 CPU in 0.000 seconds (92% CPU, 2428571 Lips)
X = 1000000000000000000000000000003 .
Other comments
First, make sure to check out Brachylog and the latest Brachylog solutions on Code Golf. Thanks to Julien's efforts, a language inspired by Prolog is now increasingly often hosting some of the most concise and elegant programs that are posted there. Awesome work Julien!
Please refrain from using implementation-specific anomalies of between/3: These destroy important semantic properties of the predicate and are not portable to other systems.
If you ignore (2), please use infinite instead of inf. In the context of CLP(FD), inf denotes the infimum of the set of integers, which is the exact opposite of positive infinity.
In the context of CLP(FD), I recommend to use CLP(FD) constraints instead of between/3 and other predicates that don't take constraints into account.
In fact, I recommend to use CLP(FD) constraints instead of all low-level predicates that reason over integers. This can at most make your programs more general, never more specific.
Many thanks for your interest in this question and the posted solutions! I hope you find the test case above useful for your variants, and find ways to take CLP(FD) constraints into account in your versions so that they run faster and we can all upvote them!
Since "Brachylog's interpreter is entirely written in Prolog" meaning SWI-Prolog, you can use between/3 with the second argument bound to inf.
Comparing your positive_integer with
positive_integer_b(X):- between(1,inf,X).
Tests on my machine:
?- time(positive_integer(10000000)).
% 5 inferences, 0.062 CPU in 0.072 seconds (87% CPU, 80 Lips)
true.
9 ?- time(positive_integer_b(10000000)).
% 2 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
true.
And showing "Out of global stack":
13 ?- time(positive_integer(100000000)).
% 5 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
ERROR: Out of global stack
14 ?- time(positive_integer_b(100000000)).
% 2 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
true.
I don't think between is pure prolog though.
In case you want your code to be runnable also on other systems, consider:
positive_integer(N) :-
( nonvar(N), % strictly not needed, but clearer
integer(N),
N > 0
-> true
; length([_|_], N)
).
This version produces exactly the same errors as your first try.
You have indeed spotted a bit towards a weakness in current length/2 implementations. Ideally, a goal like length([_|_], 1000000000000000) would take some time, but at least does not consume more than constant memory. On the other hand, I am not too sure if this is worth optimizing. After all, I do not see an easy way to solve the runtime problem for such cases.
Note that the version of between/3 in SWI-Prolog is highly specific to SWI. It makes termination arguments much more complex. In other systems like SICStus, you know for sure that between/3 is terminating, regardless of the arguments. In SWI you would have to prove, that the atom inf will not be encountered which raises the burden of proof obligation.
without between/3, and ISO compliant (I think)
positive_integer(1).
positive_integer(X) :-
var(X),
positive_integer(Y),
X is Y + 1.
positive_integer(X) :-
integer(X),
X > 0.
Related
I am using LogisticRegression for a classification problem with a large number of sparse features (tfidf vectors for documents to be specific) as well as a large number of classes. I noticed recently that performance seems to have dramatically worsened when upgrading to newer versions of scikit-learn. While it's hard to trace the exact origin of the performance problem, I did notice when profiling that ravel is called twice, which is taking up a large amount of the time at inference. What's interesting though, is that if I change the coef_ matrix to column-major order with np.asfortranarray, I recover the performance I am expecting. I also noticed that the problem only occurs when the input is sparse, as it is in my case.
Is there a way to change inference so that it is fastest with row-major ordering? I suspect you couldn't do this without having to transpose the input matrix to predict_proba, which would be worse since now the time doing taken doing the raveling is unbounded. Or is there some flag to tell scikit to use column-major ordering in order to have to avoid these calls during inference?
Example code below:
import scipy
import numpy as np
from sklearn.linear_model import LogisticRegression
X = np.random.rand(10_000, 10_000)
y = np.random.randint(0, 500, size=10_000)
clf = LogisticRegression(max_iter=10).fit(X, y)
%timeit clf.predict_proba(scipy.sparse.rand(1, 10_000))
# 21.9 ms ± 973 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%prun
# ncalls tottime percall cumtime percall filename:lineno(function)
# 2 0.019 0.010 0.019 0.010 {method 'ravel' of 'numpy.ndarray' objects}
# 1 0.003 0.003 0.022 0.022 _compressed.py:493(_mul_multivector)
clf.coef_ = np.asfortranarray(clf.coef_)
%timeit clf.predict_proba(scipy.sparse.rand(1, 10_000))
# 467 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%prun clf.predict_proba(scipy.sparse.rand(1, 10_000))
# ncalls tottime percall cumtime percall filename:lineno(function)
# 1 0.000 0.000 0.000 0.000 {built-in method scipy.sparse._sparsetools.csr_matvecs}
# 1 0.000 0.000 0.000 0.000 {method 'choice' of 'numpy.random.mtrand.RandomState' objects}
As you can see, converting the matrix to column-major order reduced the runtime of the ravel calls by a large margin.
Sparse matmul is handled by scipy as a.dot(b), and it needs b to be in row-major order. In this case, when you call clf.predict_proba() you're calculating p # clf.coef_.T, and clf.coef_.T is done by switching between row-major and column-major order (cause doing it that way doesn't require a copy).
If clf.coef_ is row-major order (which it will be after the model is fit), clf.coef_.T is column-major order and calling clf.predict_proba() requires it to be fully copied in memory (in this case, by .ravel()) to return it to row-major order.
When you turn clf.coef_ to column-major order with clf.coef_ = np.asfortranarray(clf.coef_), you make it so clf.coef_.T is row-major order, and .ravel() is basically a noop as it makes a view into the existing C array which doesn't have to be copied.
You have already found the most efficient workaround for this, so I don't know that there's anything else to be done. You could also just make p dense with p.A; the scipy.sparse matmul isn't terribly efficient and doesn't handle edge conditions well. This isn't a new thing and I don't know why you'd have not seen it with older sklearn.
Are there any up-to-date Prolog implementation benchmarks (with results)?
I found this on the mercury web site. Surprisingly, it shows a 20-fold gap between swi-prolog and Aquarius. I suspect that these results are pretty old. Does this gap still hold? Personally, I'd also like to see some comparisons with the occurs check turned on, since it has a major impact on performance, and some compilers might be better than others at optimizing it away.
Of more recent comparisons, I found this claim that gnu-prolog is 2x faster than SWI, and YAP is 4x faster than SWI on one specific code base.
Edit:
a specific case where the occurs check is needed for a real world problem
Sure: type inference in Haskell, OCaml, Swift or theorem provers such as this one. I also think the burden is on the programmer to prove that his code doesn't need the occurs check. Tests can only prove that you do need it, not that you don't need it.
I have some benchmark results published at:
https://logtalk.org/performance.html
Be sure to read and understand the notes at the end of that page, however.
Regarding running benchmarks with GNU Prolog, note that you cannot use the top-level interpreter as code loaded from it is interpreted, not compiled (see GNU Prolog documentation on gplc). In general, is not uncommon to see people running benchmarks from the top-level interpreter, forgetting what the word interpreter means, and publishing bogus stats where compilation/term-expansion/... steps mistakenly end up mixed with what's supposed to be benchmarked.
There's also a classical set of Prolog benchmarks that can be used for comparing Prolog implementations. Some Prolog systems include them (e.g. SWI-Prolog). They are also included in the Logtalk distribution, which allows running them with the supported backends:
https://github.com/LogtalkDotOrg/logtalk3/tree/master/examples/bench
In the current Logtalk git version, you can start it with the backend you want to benchmark and use the queries:
?- {bench(loader)}.
...
?- run.
These will run each benchmark 1000 times are reported the total time. Use run/1 for a different number of repetitions. For example, in my macOS system using SWI-Prolog 8.3.15 I get:
?- run.
boyer: 20.897818 seconds
chat_parser: 7.962188999999999 seconds
crypt: 0.14653999999999812 seconds
derive: 0.004462999999997663 seconds
divide10: 0.002300000000001745 seconds
log10: 0.0011489999999980682 seconds
meta_qsort: 0.2729539999999986 seconds
mu: 0.04534600000000211 seconds
nreverse: 0.016964000000001533 seconds
ops8: 0.0016230000000021505 seconds
poly_10: 1.9540520000000008 seconds
prover: 0.05286200000000463 seconds
qsort: 0.030829000000004214 seconds
queens_8: 2.2245050000000077 seconds
query: 0.11675499999999772 seconds
reducer: 0.00044199999999960937 seconds
sendmore: 3.048624999999994 seconds
serialise: 0.0003770000000073992 seconds
simple_analyzer: 0.8428750000000065 seconds
tak: 5.495768999999996 seconds
times10: 0.0019139999999993051 seconds
unify: 0.11229400000000567 seconds
zebra: 1.595203000000005 seconds
browse: 31.000829000000003 seconds
fast_mu: 0.04102400000000728 seconds
flatten: 0.028527999999994336 seconds
nand: 0.9632950000000022 seconds
perfect: 0.36678499999999303 seconds
true.
For SICStus Prolog 4.6.0 I get:
| ?- run.
boyer: 3.638 seconds
chat_parser: 0.7650000000000006 seconds
crypt: 0.029000000000000803 seconds
derive: 0.0009999999999994458 seconds
divide10: 0.001000000000000334 seconds
log10: 0.0009999999999994458 seconds
meta_qsort: 0.025000000000000355 seconds
mu: 0.004999999999999893 seconds
nreverse: 0.0019999999999997797 seconds
ops8: 0.001000000000000334 seconds
poly_10: 0.20500000000000007 seconds
prover: 0.005999999999999339 seconds
qsort: 0.0030000000000001137 seconds
queens_8: 0.2549999999999999 seconds
query: 0.024999999999999467 seconds
reducer: 0.001000000000000334 seconds
sendmore: 0.6079999999999997 seconds
serialise: 0.0019999999999997797 seconds
simple_analyzer: 0.09299999999999997 seconds
tak: 0.5869999999999997 seconds
times10: 0.001000000000000334 seconds
unify: 0.013000000000000789 seconds
zebra: 0.33999999999999986 seconds
browse: 4.137 seconds
fast_mu: 0.0070000000000014495 seconds
nand: 0.1280000000000001 seconds
perfect: 0.07199999999999918 seconds
yes
For GNU Prolog 1.4.5, I use the sample embedding script in logtalk3/scripts/embedding/gprolog to create an executable that includes the bench example fully compiled:
| ?- run.
boyer: 9.3459999999999983 seconds
chat_parser: 1.9610000000000003 seconds
crypt: 0.048000000000000043 seconds
derive: 0.0020000000000006679 seconds
divide10: 0.00099999999999944578 seconds
log10: 0.00099999999999944578 seconds
meta_qsort: 0.099000000000000199 seconds
mu: 0.012999999999999901 seconds
nreverse: 0.0060000000000002274 seconds
ops8: 0.00099999999999944578 seconds
poly_10: 0.72000000000000064 seconds
prover: 0.016000000000000014 seconds
qsort: 0.0080000000000008953 seconds
queens_8: 0.68599999999999994 seconds
query: 0.041999999999999815 seconds
reducer: 0.0 seconds
sendmore: 1.1070000000000011 seconds
serialise: 0.0060000000000002274 seconds
simple_analyzer: 0.25 seconds
tak: 1.3899999999999988 seconds
times10: 0.0010000000000012221 seconds
unify: 0.089999999999999858 seconds
zebra: 0.63499999999999979 seconds
browse: 10.923999999999999 seconds
fast_mu: 0.015000000000000568 seconds
(27352 ms) yes
I suggest you try these benchmarks, running them on your computer, with the Prolog systems that you want to compare. In doing that, remember that this is a limited set of benchmarks, not necessarily reflecting the actual relative performance in non-trivial applications.
Ratios:
SICStus/SWI GNU/SWI
boyer 17.4% 44.7%
browse 13.3% 35.2%
chat_parser 9.6% 24.6%
crypt 19.8% 32.8%
derive 22.4% 44.8%
divide10 43.5% 43.5%
fast_mu 17.1% 36.6%
flatten - -
log10 87.0% 87.0%
meta_qsort 9.2% 36.3%
mu 11.0% 28.7%
nand 13.3% -
nreverse 11.8% 35.4%
ops8 61.6% 61.6%
perfect 19.6% -
poly_10 10.5% 36.8%
prover 11.4% 30.3%
qsort 9.7% 25.9%
queens_8 11.5% 30.8%
query 21.4% 36.0%
reducer 226.2% 0.0%
sendmore 19.9% 36.3%
serialise 530.5% 1591.5%
simple_analyzer 11.0% 29.7%
tak 10.7% 25.3%
times10 52.2% 52.2%
unify 11.6% 80.1%
zebra 21.3% 39.8%
P.S. Be sure to use Logtalk 3.43.0 or later as it includes portability fixes for the bench example, including for GNU Prolog, and a set of basic unit tests.
I stumbled upon this comparison from 2008 in the Internet archive:
https://web.archive.org/web/20100227050426/http://www.probp.com/performance.htm
Firstly, I have read all other posts on SO regarding the usage of cuts in Prolog and definitely see the issues related to using them. However, there's still some unclarity for me and I'd like to settle this once and for all.
In the trivial example below, we recursively iterate through a list and check whether every 2nd element is equal to one. When doing so, the recursive process may end up in either one of following base cases: either an empty list or a list with a single element remains.
base([]).
base([_]).
base([_,H|T]) :- H =:= 1, base(T).
When executed:
?- time(base([1])).
% 0 inferences, 0.000 CPU in 0.000 seconds (74% CPU, 0 Lips)
true ;
% 2 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 99502 Lips)
false.
?- time(base([3,1,3])).
% 2 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 304044 Lips)
true ;
% 2 inferences, 0.000 CPU in 0.000 seconds (84% CPU, 122632 Lips)
false.
In such situations, I always used an explicit cut operator in the 2nd base case (i.e. the one representing one element left in the list) like below to do away with the redundant choice point.
base([]).
base([_]) :- !.
base([_,H|T]) :- H =:= 1, base(T).
Now we get:
?- time(base([1])).
% 1 inferences, 0.000 CPU in 0.000 seconds (81% CPU, 49419 Lips)
true.
?- time(base([3,1,3])).
% 3 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 388500 Lips)
true.
I understand that the behaviour of this cut is specific to the position of the rule and can be considered as bad practice.
Moving on however, one could reposition the cases as following:
base([_,H|T]) :- H =:= 1, base(T).
base([_]).
base([]).
which would also do away with the redundant choice point without using a cut, but of course, we would just shift the choice point to queries with lists with an even amount of digits like below:
?- time(base([3,1])).
% 2 inferences, 0.000 CPU in 0.000 seconds (82% CPU, 99157 Lips)
true ;
% 2 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 96632 Lips)
false.
So this is obviously no solution either. We could however adapt this order of rules with a cut as below:
base([_,H|T]) :- H =:= 1, base(T), !.
base([_]).
base([]).
as this would in fact leave no choice points. Looking at some queries:
?- time(base([3])).
% 1 inferences, 0.000 CPU in 0.000 seconds (81% CPU, 157679 Lips)
true.
?- time(base([3,1])).
% 3 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 138447 Lips)
true.
?- time(base([3,1,3])).
% 3 inferences, 0.000 CPU in 0.000 seconds (82% CPU, 393649 Lips)
true.
However, once again, this cut's behaviour only works correctly because of the ordering of the rules. If someone would reposition the base cases back to the original form as shown below:
base([]).
base([_]).
base([_,H|T]) :- H =:= 1, base(T), !.
we would still get the unwanted behaviour:
?- time(base([1])).
% 0 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 0 Lips)
true ;
% 2 inferences, 0.000 CPU in 0.000 seconds (84% CPU, 119546 Lips)
false.
In these sort of scenarios, I always used the single cut in the second base case as I'm the only one ever going through my code and I got kind of used to it. However, I've been told in one of my answers on another SO post that this is not recommended usage of the cut operator and that I should try to avoid it as much as possible.
This brings me to my bipartite question:
If a cut, regardless of the position of the rule in which it is present, does change behaviour, but not the solution (as in the examples above), is it still considered to be bad practice?
If I would like to do away with a typical redundant choice point as the one in the examples above in order to make a predicate fully deterministic, is there any other, recommended, way to accomplish this rather than using cuts?
Thanks in advance!
Always try hard to avoid !/0. Almost invariably, !/0 completely destroys the declarative semantics of your program.
Everything that can be expressed by pattern matching should be expressed by pattern matching. In your example:
every_second([]).
every_second([_|Ls]) :-
every_second_(Ls).
every_second_([]).
every_second_([1|Rest]) :- every_second(Rest).
Like in your impure version, no choice points whatsoever remain for the examples you posted:
?- every_second([1]).
true.
?- every_second([3,1]).
true.
?- every_second([3,1,3]).
true.
Notice also that in this version, all predicates are completely pure and usable in all directions. The relation also works for the most general query and generates answers, just as we expect from a logical relation:
?- every_second(Ls).
Ls = [] ;
Ls = [_G774] ;
Ls = [_G774, 1] ;
Ls = [_G774, 1, _G780] ;
Ls = [_G774, 1, _G780, 1] .
None of the versions you posted can do this, due to the impure or non-declarative predicates (!/0, (=:=)/2) you use!
When reasoning about lists, you can almost always use pattern matching alone to distinguish the cases. If that is not possible, use for example if_/3 for logical purity while still retaining acceptable performance.
The trick is "currying" over number of unbounds in the rule:
base([]).
base([_|Q]) :- base2(Q).
base2([]).
base2([H|Q]) :- H =:= 1, base(Q).
However, it is a bad rule to say cuts are bad. In fact, my favorite will be:
base([]) :- !.
base([_]) :- !.
base([_,H|Q]) :- !, H =:= 1, base(Q).
Thing about this example of primes(++):
primes([5]).
primes([7]).
primes([11]).
vs
primes([5]) :- !.
primes([7]) :- !.
primes([11]) :- !.
I am new to Prolog (and fairly new to CS/programming in general), and I'm trying to assess and improve my programs' performance by using the time/1 predicate. However, I'm not sure I understand the output. For instance, the query time("MyProgram") yields the following result in addition to the solution to "MyProgram":
% 34,865,980 inferences, 4.479 CPU in 4.549 seconds (98% CPU, 7784905 Lips)
What does this mean? There is somewhat of an explanation here but I'm finding it's not quite enough.
Thanks in advance!
Firstly, see this answer for some general information about the difficulties of benchmarking in Prolog, or any programming language for that matter. The answer concerns the ECLiPSe language which uses Prolog internally so you'll be familiar with the syntax.
Now, let's look at a simple example:
equal_to_one(X) :- X =:= 1.
If we trace the execution (which by the way is a great way to better understand how Prolog works), we get:
?- trace, foo(1).
Call: (7) foo(1) ? creep
Call: (8) 1=:=1 ? creep
Exit: (8) 1=:=1 ? creep
Exit: (7) foo(1) ? creep
Notice the two calls and two exits occurring in the trace. In the first call, foo(1) is matched with defined facts/rules in the Prolog file and successfully finds foo/1, whereafter in the second call, the body is (successfully) executed. Subsequently the two exits simply represent exiting out of the statements that were true (both calls).
When we run our program with time/1, we see:
?- time(foo(1)).
% 2 inferences, 0.000 CPU in 0.000 seconds (86% CPU, 69691 Lips)
true.
?- time(foo(2)).
% 2 inferences, 0.000 CPU in 0.000 seconds (82% CPU, 77247 Lips)
false.
Both queries need 2 (logical) inferences to complete. These inferences represent the calls described above (i.e. the program 'tries to match' something twice, it doesn't matter whether the number is equal to one or not). It is because of this that inferences are a good indication of the performance of your program, being not based on any hardware-specific properties, but rather on the complexity of your algorithm(s).
Furthermore we see CPU and seconds, which respectively represent cpu-time and overall clock-time spent while executing the program (see referred SO answer for more information).
Finally, we see a different % CPU and LIPS for each execution. You shouldn't worry too much about these numbers as they represent the percentage CPU used and average Logical Inferences Per Second made and for obvious reasons these will always differ for each execution.
PS : a similar SO question can be found here
The meaning is as follows. The basic data is sampled via the following calls:
get_time(Wall)
statistics(cputime, Time)
statistics(inferences, Inferences)
What is then shown is:
'%1 inferences, %2 CPU in %3 seconds (%4% CPU, %5 Lips)'
%1: Inferences2-Inferences1
%2: Time2-Time1
%3: Wall2-Wall1
%4: round(100*%2/%3)
%5: integer(%1/%2)
In a single threaded application and no other applications, we have still %2 =< %3 if there is a separate GC thread, subsequently %4 will be a precentage below or equal 100. If your application isn't doing I/O, and your percentage is very low, you might have a locking problem somewhere.
I'm using time/1 to measure cpu time in YAP prolog and I'm getting for example
514.000 CPU in 0.022 seconds (2336363% CPU)
yes
What I'd like to ask is what is the interpretation of these numbers? Does 514.000 represents CPU secs? What is "0.022 seconds" and the CPU percentage that follows?
Thank you