I am new to Prolog (and fairly new to CS/programming in general), and I'm trying to assess and improve my programs' performance by using the time/1 predicate. However, I'm not sure I understand the output. For instance, the query time("MyProgram") yields the following result in addition to the solution to "MyProgram":
% 34,865,980 inferences, 4.479 CPU in 4.549 seconds (98% CPU, 7784905 Lips)
What does this mean? There is somewhat of an explanation here but I'm finding it's not quite enough.
Thanks in advance!
Firstly, see this answer for some general information about the difficulties of benchmarking in Prolog, or any programming language for that matter. The answer concerns the ECLiPSe language which uses Prolog internally so you'll be familiar with the syntax.
Now, let's look at a simple example:
equal_to_one(X) :- X =:= 1.
If we trace the execution (which by the way is a great way to better understand how Prolog works), we get:
?- trace, foo(1).
Call: (7) foo(1) ? creep
Call: (8) 1=:=1 ? creep
Exit: (8) 1=:=1 ? creep
Exit: (7) foo(1) ? creep
Notice the two calls and two exits occurring in the trace. In the first call, foo(1) is matched with defined facts/rules in the Prolog file and successfully finds foo/1, whereafter in the second call, the body is (successfully) executed. Subsequently the two exits simply represent exiting out of the statements that were true (both calls).
When we run our program with time/1, we see:
?- time(foo(1)).
% 2 inferences, 0.000 CPU in 0.000 seconds (86% CPU, 69691 Lips)
true.
?- time(foo(2)).
% 2 inferences, 0.000 CPU in 0.000 seconds (82% CPU, 77247 Lips)
false.
Both queries need 2 (logical) inferences to complete. These inferences represent the calls described above (i.e. the program 'tries to match' something twice, it doesn't matter whether the number is equal to one or not). It is because of this that inferences are a good indication of the performance of your program, being not based on any hardware-specific properties, but rather on the complexity of your algorithm(s).
Furthermore we see CPU and seconds, which respectively represent cpu-time and overall clock-time spent while executing the program (see referred SO answer for more information).
Finally, we see a different % CPU and LIPS for each execution. You shouldn't worry too much about these numbers as they represent the percentage CPU used and average Logical Inferences Per Second made and for obvious reasons these will always differ for each execution.
PS : a similar SO question can be found here
The meaning is as follows. The basic data is sampled via the following calls:
get_time(Wall)
statistics(cputime, Time)
statistics(inferences, Inferences)
What is then shown is:
'%1 inferences, %2 CPU in %3 seconds (%4% CPU, %5 Lips)'
%1: Inferences2-Inferences1
%2: Time2-Time1
%3: Wall2-Wall1
%4: round(100*%2/%3)
%5: integer(%1/%2)
In a single threaded application and no other applications, we have still %2 =< %3 if there is a separate GC thread, subsequently %4 will be a precentage below or equal 100. If your application isn't doing I/O, and your percentage is very low, you might have a locking problem somewhere.
Related
Context
I'm writing some high-performance code for ARM64 using NEON SIMD instructions, which I am trying to further optimize. I only use integer operations, no floating-point. This code is fully CPU- or memory-bound: it does not perform system calls or I/O of any kind (filesystem, networking, or anything else). The code is single-threaded by design -- any parallelism should be handled by calling the code from different CPUs with different arguments. The data working set should be small enough to fit in my CPU's L1 D-cache, and if it overflows a little, it will definitely fit in L2 with lots of space to spare.
My development environment is an Apple laptop with the M1 processor, running macOS; as such, the prime choice for a performance investigation tool is Apple's Instruments. I know VTune has some more advanced features such as top-down microarchitecture analysis, but evidently this isn't available for ARM.
The problem
I had an idea that, at a high level, works like this: a certain function f(x, y) can be broken down into two functions g() and h(). I can calculate x2 = g(x), y2 = g(y) and then h(x2, y2), obtaining the same result as f(x, y). However, it turns out that I compute f() many times with different combinations of the same input arguments. By applying all these inputs to g() and caching their outputs, I can directly call the output of h()with these cached values and save some time recomputing theg()-part of f()`.
Benchmarks
I confirmed the basic idea is sound by microbenchmarking with Google Benchmark. If f() takes 100 X (where X is some arbitrary unit of time), then each call to g() takes 14 X, and a call to h() takes 78 X. While it's longer to call g() twice then h() rather than f(), suppose I need to compute f(x, y) and f(x, z), which would ordinarily take 200 X. I can instead compute x2 = g(x), y2 = g(y) and z2 = g(z), taking 3*14 = 42 X, and then h(x2, y2) and h(x2, z2), taking 2*78 = 156 X. In total, I spend 156 + 42 = 198 X, which is already less than 200 X, and the savings would add up for larger examples, up to maximum of 22%, since this is how much less h() costs compared to f() (assuming I compute h() much more often than g()). This would represent a significant speedup for my application.
I proceeded to test this idea on a more realistic example: I have some code which does a bunch of things, plus 3 calls to f() which, among themselves, use combinations of the same 2 arguments. So, I replace 3 calls to f() by 2 calls to g() and 3 calls to h(). The benchmarks above indicate this should reduce execution time by 3*100 - 2*14 - 3*78 = 38 X. However, benchmarking the modified code shows that execution time increases by ~700 X!
I tried replacing each call to f() individually with 2 calls to g() for its arguments and a call to h(). This should increase execution time by 2*14 + 78 - 100 = 6 X, but instead, execution time increases by 230 X (not coincidentally, approximately 1/3 of 700 X).
Performance counter results using Apple Instruments
To bring some data to the discussion, I ran both codes under Apple Instruments using the CPU counters template, monitoring some performance counters I thought might be relevant.
For reference, the original code executes in 7.6 seconds (considering only number of iterations times execution time per iteration, i.e. disregarding Google Benchmark overhead), whereas the new code executes in 9.4 seconds; i.e. a difference of 1.8 seconds. Both versions use the exact same number of iterations and work on the same input, producing the same output. The code runs on the M1's performance core, which I assume is running at its maximum 3.2 GHz clock speed.
Parameter
Original code
New code
Total cycles
22,199,155,777
27,510,276,704
MAP_DISPATCH_BUBBLE
78,611,658
6,438,255,204
L1D_CACHE_MISS_LD
892,442
1,808,341
L1D_CACHE_MISS_ST
2,163,402
4,830,661
L1I_CACHE_MISS_DEMAND
2,620,793
7,698,674
INST_SIMD_ALU
79,448,291,331
78,253,076,740
INST_SIMD_LD
17,254,640,147
16,867,679,279
INST_SIMD_ST
14,169,912,790
14,029,275,120
INST_INT_ALU
4,512,600,211
4,379,585,445
INST_INT_LD
550,965,752
546,134,341
INST_INT_ST
455,541,070
455,298,056
INST_ALL
119,683,934,968
118,972,558,207
MAP_STALL_DISPATCH
6,307,551,337
5,470,291,508
SCHEDULE_UOP
116,252,941,232
113,882,670,763
MAP_REWIND
16,293,616
11,787,119
FLUSH_RESTART_OTHER_NONSPEC
58,616
90,955
FETCH_RESTART
27,417,457
28,119,690
BRANCH_MISPRED_NONSPEC
432,761
465,697
L1I_TLB_MISS_DEMAND
754,161
1,492,705
L2_TLB_MISS_INSTRUCTION
485,702
1,217,474
MMU_TABLE_WALK_INSTRUCTION
486,812
1,219,082
BRANCH_MISPRED_NONSPEC
377,750
440,382
INST_BRANCH
1,194,614,553
1,151,040,641
Instruments won't let me add all these counters to the same run, so some results are from different runs. However, since the code is fully deterministic and runs the same number of iterations, any differences between runs should be just random noise.
EDIT: playing around with Instruments, I found one performance counter that has wildly differing values between the original code and the new code, which is MAP_DISPATCH_BUBBLE. Still doing research on what it means, whether it might explain the issues I'm seeing, and how to work around this.
EDIT 2: I decided to test this code on other ARM processors I have access to (Cortex-X2 and Cortex-A72). On the Cortex-X2, both versions perform identically, and on the Cortex-A72, there was a small (~1.5%) increase in performance with the new code. So I'm more inclined than ever to believe that I hit an M1 front-end bottleneck.
Hypotheses and data analysis
Having faced previous performance problems with this code base before, some ideas sprung to mind:
Memory alignment: SIMD code is sometimes sensitive to memory alignment, particularly for memory-bound code, which I suspect my code may be. However, adding or removing __attribute__((aligned(64))) made no difference, so I don't think that's it.
D-cache misses: the new code allocates some new arrays to cache the output of g(), so it might lead to more cache misses. And indeed there are 3.6 million more L1 D-cache misses (load + store) than the original code. However, as I've mentioned at the beginning, the working set easily fits into L2. Assuming a 10-cycle L2 cache miss cost, that's only 36 million cycles. At 3.2 GHz, that's just 1.1 ms, i.e. < 0.1% of the observed performance difference.
I-cache misses: a similar situation: there's an extra 5.1 million L1 I-cache misses, but at a 10-cycle cost, we're looking at 1.6 ms, again < 0.1% of the observed performance difference.
Inlining/unrolling: I employ aggressive inlining and loop unrolling on my code, as well as LTO and unity builds, since performance is the #1 priority and code size is irrelevant (unless it affects performance via e.g. I-cache misses). I considered the possibility that the new code might be inlining/unrolling less aggressively due to the compiler hitting some kind of heuristic for maximum code size. This might result in more instructions being executed, such as compares/branches for loops, and CALL/RET and function prologues/epilogues for function call. However, the table shows that the new code executes a bit fewer instructions of each kind (as I would expect), and of course, in total (INST_ALL).
Somehow, the original code simply achieves a higher IPC, and I have no idea why. Also, to be clear: both codes perform the same operation using the same algorithm. What I did was to basically the code for f() (a bunch of function calls to other subroutines) between g() and h().
The question
This brings me to my question: what could possibly be making the new code run slower than the old code? What other performance counters could I look at in Instruments to give me insight into this issue?
Beyond answers to this specific question, I'm looking for general advice on how to approach similar problems like this in the future. I've found some books about debugging performance problems, but they generally fall into two camps. The first just describes the profiling process I'm familiar with: find out which functions take the longest to execute and optimize them. The second is represented by books like Systems Performance: Enterprise and the Cloud and The Every Computer Performance Book, and is closer to what I'm looking for. However, they look at system-level issues like I/O, kernel calls, etc.; the kind of code I write is CPU- and maybe memory-bound, with many opportunities to convert to SIMD, and no interaction with the outside world. Basically, I'd like to know how to design meaningful experiments using a profiler and CPU performance counters (cycle counters, cache misses, instructions executed by type such as ALU, memory, etc.) to solve these kinds of performance issues with my code when they arise.
Are there any up-to-date Prolog implementation benchmarks (with results)?
I found this on the mercury web site. Surprisingly, it shows a 20-fold gap between swi-prolog and Aquarius. I suspect that these results are pretty old. Does this gap still hold? Personally, I'd also like to see some comparisons with the occurs check turned on, since it has a major impact on performance, and some compilers might be better than others at optimizing it away.
Of more recent comparisons, I found this claim that gnu-prolog is 2x faster than SWI, and YAP is 4x faster than SWI on one specific code base.
Edit:
a specific case where the occurs check is needed for a real world problem
Sure: type inference in Haskell, OCaml, Swift or theorem provers such as this one. I also think the burden is on the programmer to prove that his code doesn't need the occurs check. Tests can only prove that you do need it, not that you don't need it.
I have some benchmark results published at:
https://logtalk.org/performance.html
Be sure to read and understand the notes at the end of that page, however.
Regarding running benchmarks with GNU Prolog, note that you cannot use the top-level interpreter as code loaded from it is interpreted, not compiled (see GNU Prolog documentation on gplc). In general, is not uncommon to see people running benchmarks from the top-level interpreter, forgetting what the word interpreter means, and publishing bogus stats where compilation/term-expansion/... steps mistakenly end up mixed with what's supposed to be benchmarked.
There's also a classical set of Prolog benchmarks that can be used for comparing Prolog implementations. Some Prolog systems include them (e.g. SWI-Prolog). They are also included in the Logtalk distribution, which allows running them with the supported backends:
https://github.com/LogtalkDotOrg/logtalk3/tree/master/examples/bench
In the current Logtalk git version, you can start it with the backend you want to benchmark and use the queries:
?- {bench(loader)}.
...
?- run.
These will run each benchmark 1000 times are reported the total time. Use run/1 for a different number of repetitions. For example, in my macOS system using SWI-Prolog 8.3.15 I get:
?- run.
boyer: 20.897818 seconds
chat_parser: 7.962188999999999 seconds
crypt: 0.14653999999999812 seconds
derive: 0.004462999999997663 seconds
divide10: 0.002300000000001745 seconds
log10: 0.0011489999999980682 seconds
meta_qsort: 0.2729539999999986 seconds
mu: 0.04534600000000211 seconds
nreverse: 0.016964000000001533 seconds
ops8: 0.0016230000000021505 seconds
poly_10: 1.9540520000000008 seconds
prover: 0.05286200000000463 seconds
qsort: 0.030829000000004214 seconds
queens_8: 2.2245050000000077 seconds
query: 0.11675499999999772 seconds
reducer: 0.00044199999999960937 seconds
sendmore: 3.048624999999994 seconds
serialise: 0.0003770000000073992 seconds
simple_analyzer: 0.8428750000000065 seconds
tak: 5.495768999999996 seconds
times10: 0.0019139999999993051 seconds
unify: 0.11229400000000567 seconds
zebra: 1.595203000000005 seconds
browse: 31.000829000000003 seconds
fast_mu: 0.04102400000000728 seconds
flatten: 0.028527999999994336 seconds
nand: 0.9632950000000022 seconds
perfect: 0.36678499999999303 seconds
true.
For SICStus Prolog 4.6.0 I get:
| ?- run.
boyer: 3.638 seconds
chat_parser: 0.7650000000000006 seconds
crypt: 0.029000000000000803 seconds
derive: 0.0009999999999994458 seconds
divide10: 0.001000000000000334 seconds
log10: 0.0009999999999994458 seconds
meta_qsort: 0.025000000000000355 seconds
mu: 0.004999999999999893 seconds
nreverse: 0.0019999999999997797 seconds
ops8: 0.001000000000000334 seconds
poly_10: 0.20500000000000007 seconds
prover: 0.005999999999999339 seconds
qsort: 0.0030000000000001137 seconds
queens_8: 0.2549999999999999 seconds
query: 0.024999999999999467 seconds
reducer: 0.001000000000000334 seconds
sendmore: 0.6079999999999997 seconds
serialise: 0.0019999999999997797 seconds
simple_analyzer: 0.09299999999999997 seconds
tak: 0.5869999999999997 seconds
times10: 0.001000000000000334 seconds
unify: 0.013000000000000789 seconds
zebra: 0.33999999999999986 seconds
browse: 4.137 seconds
fast_mu: 0.0070000000000014495 seconds
nand: 0.1280000000000001 seconds
perfect: 0.07199999999999918 seconds
yes
For GNU Prolog 1.4.5, I use the sample embedding script in logtalk3/scripts/embedding/gprolog to create an executable that includes the bench example fully compiled:
| ?- run.
boyer: 9.3459999999999983 seconds
chat_parser: 1.9610000000000003 seconds
crypt: 0.048000000000000043 seconds
derive: 0.0020000000000006679 seconds
divide10: 0.00099999999999944578 seconds
log10: 0.00099999999999944578 seconds
meta_qsort: 0.099000000000000199 seconds
mu: 0.012999999999999901 seconds
nreverse: 0.0060000000000002274 seconds
ops8: 0.00099999999999944578 seconds
poly_10: 0.72000000000000064 seconds
prover: 0.016000000000000014 seconds
qsort: 0.0080000000000008953 seconds
queens_8: 0.68599999999999994 seconds
query: 0.041999999999999815 seconds
reducer: 0.0 seconds
sendmore: 1.1070000000000011 seconds
serialise: 0.0060000000000002274 seconds
simple_analyzer: 0.25 seconds
tak: 1.3899999999999988 seconds
times10: 0.0010000000000012221 seconds
unify: 0.089999999999999858 seconds
zebra: 0.63499999999999979 seconds
browse: 10.923999999999999 seconds
fast_mu: 0.015000000000000568 seconds
(27352 ms) yes
I suggest you try these benchmarks, running them on your computer, with the Prolog systems that you want to compare. In doing that, remember that this is a limited set of benchmarks, not necessarily reflecting the actual relative performance in non-trivial applications.
Ratios:
SICStus/SWI GNU/SWI
boyer 17.4% 44.7%
browse 13.3% 35.2%
chat_parser 9.6% 24.6%
crypt 19.8% 32.8%
derive 22.4% 44.8%
divide10 43.5% 43.5%
fast_mu 17.1% 36.6%
flatten - -
log10 87.0% 87.0%
meta_qsort 9.2% 36.3%
mu 11.0% 28.7%
nand 13.3% -
nreverse 11.8% 35.4%
ops8 61.6% 61.6%
perfect 19.6% -
poly_10 10.5% 36.8%
prover 11.4% 30.3%
qsort 9.7% 25.9%
queens_8 11.5% 30.8%
query 21.4% 36.0%
reducer 226.2% 0.0%
sendmore 19.9% 36.3%
serialise 530.5% 1591.5%
simple_analyzer 11.0% 29.7%
tak 10.7% 25.3%
times10 52.2% 52.2%
unify 11.6% 80.1%
zebra 21.3% 39.8%
P.S. Be sure to use Logtalk 3.43.0 or later as it includes portability fixes for the bench example, including for GNU Prolog, and a set of basic unit tests.
I stumbled upon this comparison from 2008 in the Internet archive:
https://web.archive.org/web/20100227050426/http://www.probp.com/performance.htm
In my Prolog-inspired language Brachylog, there is the possibility to label CLP(FD)-equivalent variables that have potentially infinite domains. The code that does this labelization can be found here (thanks to Markus Triska #mat).
This predicate requires the existence of a predicate positive_integer/1, which must have the following behavior:
?- positive_integer(X).
X = 1 ;
X = 2 ;
X = 3 ;
X = 4 ;
…
This is implemented as such in our current solution:
positive_integer(N) :- length([_|_], N).
This has two problems that I can see:
This becomes slow fairly quickly:
?- time(positive_integer(100000)).
% 5 inferences, 0.000 CPU in 0.001 seconds (0% CPU, Infinite Lips)
?- time(positive_integer(1000000)).
% 5 inferences, 0.000 CPU in 0.008 seconds (0% CPU, Infinite Lips)
?- time(positive_integer(10000000)).
% 5 inferences, 0.062 CPU in 0.075 seconds (83% CPU, 80 Lips)
This ultimately returns an Out of global stack error for numbers that are a bit too big:
?- positive_integer(100000000).
ERROR: Out of global stack
This is obviously due to the fact that Prolog needs to instantiate the list, which is bad if its length is big.
How can we improve this predicate such that this works even for very big numbers, with the same behavior?
There are already many good ideas posted, and they work to various degrees.
Additional test case
#vmg has the right intuition: between/3 does not mix well with constraints. To see this, I would like to use the following query as an additional benchmark:
?- X #> 10^30, positive_integer(X).
Solution
With the test case in mind, I suggest the following solution:
positive_integer(I) :-
I #> 0,
( var(I) ->
fd_inf(I, Inf),
( I #= Inf
; I #\= Inf,
positive_integer(I)
)
; true
).
The key idea is to use the CLP(FD) reflection predicate fd_inf/2 to reason about the smallest element in the domain of a variable. This is the only predicate you will need to change when you port the solution to further Prolog systems. For example, in SICStus Prolog, the predicate is called fd_min/2.
Main features
portable to several Prolog systems with minimal changes
fast in the shown cases
works also in the test case above
advertises and uses the full power of CLP(FD) constraints.
It is of course very clear which of these points is most important.
Sample queries
creatio ex nihilo:
?- positive_integer(X).
X = 1 ;
X = 2 ;
X = 3 .
fixed integer:
?- X #= 12^42, time(positive_integer(X)).
% 4 inferences, 0.000 CPU in 0.000 seconds (68% CPU, 363636 Lips)
X = 2116471057875484488839167999221661362284396544.
constrained integer:
?- X #> 10^30, time(positive_integer(X)).
% 124 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 3647059 Lips)
X = 1000000000000000000000000000001 ;
% 206 inferences, 0.000 CPU in 0.000 seconds (93% CPU, 2367816 Lips)
X = 1000000000000000000000000000002 ;
% 204 inferences, 0.000 CPU in 0.000 seconds (92% CPU, 2428571 Lips)
X = 1000000000000000000000000000003 .
Other comments
First, make sure to check out Brachylog and the latest Brachylog solutions on Code Golf. Thanks to Julien's efforts, a language inspired by Prolog is now increasingly often hosting some of the most concise and elegant programs that are posted there. Awesome work Julien!
Please refrain from using implementation-specific anomalies of between/3: These destroy important semantic properties of the predicate and are not portable to other systems.
If you ignore (2), please use infinite instead of inf. In the context of CLP(FD), inf denotes the infimum of the set of integers, which is the exact opposite of positive infinity.
In the context of CLP(FD), I recommend to use CLP(FD) constraints instead of between/3 and other predicates that don't take constraints into account.
In fact, I recommend to use CLP(FD) constraints instead of all low-level predicates that reason over integers. This can at most make your programs more general, never more specific.
Many thanks for your interest in this question and the posted solutions! I hope you find the test case above useful for your variants, and find ways to take CLP(FD) constraints into account in your versions so that they run faster and we can all upvote them!
Since "Brachylog's interpreter is entirely written in Prolog" meaning SWI-Prolog, you can use between/3 with the second argument bound to inf.
Comparing your positive_integer with
positive_integer_b(X):- between(1,inf,X).
Tests on my machine:
?- time(positive_integer(10000000)).
% 5 inferences, 0.062 CPU in 0.072 seconds (87% CPU, 80 Lips)
true.
9 ?- time(positive_integer_b(10000000)).
% 2 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
true.
And showing "Out of global stack":
13 ?- time(positive_integer(100000000)).
% 5 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
ERROR: Out of global stack
14 ?- time(positive_integer_b(100000000)).
% 2 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
true.
I don't think between is pure prolog though.
In case you want your code to be runnable also on other systems, consider:
positive_integer(N) :-
( nonvar(N), % strictly not needed, but clearer
integer(N),
N > 0
-> true
; length([_|_], N)
).
This version produces exactly the same errors as your first try.
You have indeed spotted a bit towards a weakness in current length/2 implementations. Ideally, a goal like length([_|_], 1000000000000000) would take some time, but at least does not consume more than constant memory. On the other hand, I am not too sure if this is worth optimizing. After all, I do not see an easy way to solve the runtime problem for such cases.
Note that the version of between/3 in SWI-Prolog is highly specific to SWI. It makes termination arguments much more complex. In other systems like SICStus, you know for sure that between/3 is terminating, regardless of the arguments. In SWI you would have to prove, that the atom inf will not be encountered which raises the burden of proof obligation.
without between/3, and ISO compliant (I think)
positive_integer(1).
positive_integer(X) :-
var(X),
positive_integer(Y),
X is Y + 1.
positive_integer(X) :-
integer(X),
X > 0.
Firstly, I have read all other posts on SO regarding the usage of cuts in Prolog and definitely see the issues related to using them. However, there's still some unclarity for me and I'd like to settle this once and for all.
In the trivial example below, we recursively iterate through a list and check whether every 2nd element is equal to one. When doing so, the recursive process may end up in either one of following base cases: either an empty list or a list with a single element remains.
base([]).
base([_]).
base([_,H|T]) :- H =:= 1, base(T).
When executed:
?- time(base([1])).
% 0 inferences, 0.000 CPU in 0.000 seconds (74% CPU, 0 Lips)
true ;
% 2 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 99502 Lips)
false.
?- time(base([3,1,3])).
% 2 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 304044 Lips)
true ;
% 2 inferences, 0.000 CPU in 0.000 seconds (84% CPU, 122632 Lips)
false.
In such situations, I always used an explicit cut operator in the 2nd base case (i.e. the one representing one element left in the list) like below to do away with the redundant choice point.
base([]).
base([_]) :- !.
base([_,H|T]) :- H =:= 1, base(T).
Now we get:
?- time(base([1])).
% 1 inferences, 0.000 CPU in 0.000 seconds (81% CPU, 49419 Lips)
true.
?- time(base([3,1,3])).
% 3 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 388500 Lips)
true.
I understand that the behaviour of this cut is specific to the position of the rule and can be considered as bad practice.
Moving on however, one could reposition the cases as following:
base([_,H|T]) :- H =:= 1, base(T).
base([_]).
base([]).
which would also do away with the redundant choice point without using a cut, but of course, we would just shift the choice point to queries with lists with an even amount of digits like below:
?- time(base([3,1])).
% 2 inferences, 0.000 CPU in 0.000 seconds (82% CPU, 99157 Lips)
true ;
% 2 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 96632 Lips)
false.
So this is obviously no solution either. We could however adapt this order of rules with a cut as below:
base([_,H|T]) :- H =:= 1, base(T), !.
base([_]).
base([]).
as this would in fact leave no choice points. Looking at some queries:
?- time(base([3])).
% 1 inferences, 0.000 CPU in 0.000 seconds (81% CPU, 157679 Lips)
true.
?- time(base([3,1])).
% 3 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 138447 Lips)
true.
?- time(base([3,1,3])).
% 3 inferences, 0.000 CPU in 0.000 seconds (82% CPU, 393649 Lips)
true.
However, once again, this cut's behaviour only works correctly because of the ordering of the rules. If someone would reposition the base cases back to the original form as shown below:
base([]).
base([_]).
base([_,H|T]) :- H =:= 1, base(T), !.
we would still get the unwanted behaviour:
?- time(base([1])).
% 0 inferences, 0.000 CPU in 0.000 seconds (83% CPU, 0 Lips)
true ;
% 2 inferences, 0.000 CPU in 0.000 seconds (84% CPU, 119546 Lips)
false.
In these sort of scenarios, I always used the single cut in the second base case as I'm the only one ever going through my code and I got kind of used to it. However, I've been told in one of my answers on another SO post that this is not recommended usage of the cut operator and that I should try to avoid it as much as possible.
This brings me to my bipartite question:
If a cut, regardless of the position of the rule in which it is present, does change behaviour, but not the solution (as in the examples above), is it still considered to be bad practice?
If I would like to do away with a typical redundant choice point as the one in the examples above in order to make a predicate fully deterministic, is there any other, recommended, way to accomplish this rather than using cuts?
Thanks in advance!
Always try hard to avoid !/0. Almost invariably, !/0 completely destroys the declarative semantics of your program.
Everything that can be expressed by pattern matching should be expressed by pattern matching. In your example:
every_second([]).
every_second([_|Ls]) :-
every_second_(Ls).
every_second_([]).
every_second_([1|Rest]) :- every_second(Rest).
Like in your impure version, no choice points whatsoever remain for the examples you posted:
?- every_second([1]).
true.
?- every_second([3,1]).
true.
?- every_second([3,1,3]).
true.
Notice also that in this version, all predicates are completely pure and usable in all directions. The relation also works for the most general query and generates answers, just as we expect from a logical relation:
?- every_second(Ls).
Ls = [] ;
Ls = [_G774] ;
Ls = [_G774, 1] ;
Ls = [_G774, 1, _G780] ;
Ls = [_G774, 1, _G780, 1] .
None of the versions you posted can do this, due to the impure or non-declarative predicates (!/0, (=:=)/2) you use!
When reasoning about lists, you can almost always use pattern matching alone to distinguish the cases. If that is not possible, use for example if_/3 for logical purity while still retaining acceptable performance.
The trick is "currying" over number of unbounds in the rule:
base([]).
base([_|Q]) :- base2(Q).
base2([]).
base2([H|Q]) :- H =:= 1, base(Q).
However, it is a bad rule to say cuts are bad. In fact, my favorite will be:
base([]) :- !.
base([_]) :- !.
base([_,H|Q]) :- !, H =:= 1, base(Q).
Thing about this example of primes(++):
primes([5]).
primes([7]).
primes([11]).
vs
primes([5]) :- !.
primes([7]) :- !.
primes([11]) :- !.
I'm using time/1 to measure cpu time in YAP prolog and I'm getting for example
514.000 CPU in 0.022 seconds (2336363% CPU)
yes
What I'd like to ask is what is the interpretation of these numbers? Does 514.000 represents CPU secs? What is "0.022 seconds" and the CPU percentage that follows?
Thank you