I'm building a Definite Clause Grammar to parse 20,000 pieces of semi-natural text. As the size of my database of predicates grows (now up to 1,200 rules), parsing a string can take quite a long time -- particularly for strings that are not currently interpretable by the DCG, due to syntax I haven't yet encoded. The current worst-case is 3 minutes for a string containing 30 words. I'm trying to figure out how I can optimize this, or if I should just start researching cloud computing.
I'm using SWI-Prolog, and that provides a "profile" goal, which provides some statistics. I was surprised to find that the simplest rules in my database are taking up the majority of execution time. My corpus contains strings that represent numbers, and I want to capture these in a scalar/3 predicate. These are hogging ~50-60% of total execution time.
At the outset, I had 70 lines in my scalars.pl, representing the numeric and natural language representations of the numbers in my corpus. Like so:
scalar(scalar(3)) --> ["three"].
scalar(scalar(3)) --> ["3"].
scalar(scalar(4)) --> ["four"].
scalar(scalar(4)) --> ["4"].
...and so on.
Thinking that the length of the file was the problem, I put in a new rule that would automatically parse any numeric representations:
scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.
Thanks to that, I've gone from 70 rules to 31, and helped a bit -- but it wasn't a huge savings. Is there anything more that can be done? My feeling is maybe not, because what could be simpler than a single atom in a list?
These scalars are called in a lot of places throughout the grammar, and I assume that's the root of the issue. Though they're simple rules, they're everywhere, and unavoidably so. A highly general grammar just won't work for my application, and I wouldn't be surprised if I end up with 3,000 rules or more.
I've never built a DCG this large, so I'm not sure how much I can expect in terms of performance. Happy to take any kind of advice on this one: is there some other way of encoding these rules? Should I accept that some parses will take a long time, and figure out how to run parses in parallel?
Thank you in advance!
EDIT: I was asked to provide a reproducible example, but to do that I'd have to link SO to the entire project, since this is an issue of scale. Here's a toy version of what I'm doing for the sake of completeness. Just imagine there were large files describing hundreds of nouns, hundreds of verbs, and hundreds of syntactic structures.
sent(sent(VP, NP)) --> vp(VP), np(NP).
vp(vp(V)) --> v(V).
np(np(Qty, Noun)) --> qty(Qty), n(Noun).
scalar(scalar(3)) --> ["three"].
scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.
qty(qty(Scalar)) --> scalar(Scalar).
v(v(eat)) --> ["eat"].
n(n(pie)) --> ["pie"].

One aspect of your program that you might investigate is to make sure individual predicates succeed quickly and fail quickly. This is particularly useful to check for predicates that have many clauses.
For instance, when scalar(X) is evaluated on a token that is not a scalar then the program will have to try 31 (by your last count) times before it can determine that scalar//1 fails. If the structure of your program is such that scalar(X) is checked against every token then this could be very expensive.
Further, if scalar(X) does happen to find that a token matches but a subsequent goal fails then it appears that your program will retry the scalar(X) until all of the scalar//1 clauses have been attempted.
The judicious use of cut (!) or if-then-else (C1->G1;C2->G2;G3) can provide a tremendous performance improvement.
Or you can structure your predicates so that they rely on indexing to select the appropriate clause. E.g.:
scalar(scalar(N)) --> [Token], {scalar1(Token, scalar(N))}.
scalar1("3", scalar(3)) :- !.
scalar1(Y, scalar(X)) :- atom_number(Y, X).
This uses both cut and clause indexing (if the compiler provides it) with the scalar1/1 predicate.
EDIT: You should read R. A. O'Keefe's The Craft of Prolog. It is an excellent guide to the practical aspects of Prolog.

Here's how I've tackled performance and optimization problems as a novice Prologer.
1.) Introduce timeouts to your application. I'm calling Prolog via the subprocess module in Python 3.6, and that allows you to set a timeout. As I've worked with my code base more, I've got a pretty good sense of how long a successful parse might take, and can assume anything taking longer is not going to work.
2.) Make use of the graphical profiler that's packaged in the swi-prolog IDE. This gives a lot more insight, as you can bounce around the call tree. I found it particularly helpful to sort predicates by the execution time of their children. Before I was thinking about it like pollution in a river. "Man, there's a lot of junk floating in here," I thought, not considering that upstream some factories were contributing a lot of that junk.
As for how to optimize a DCG without hurting the semantics & expressivity of one's grammar, I think that will have to be a question for another Stack Overflow. And as for my initial question, that's still an open one -- predicates that seem simple (to me) take quite a while.


examples of prolog meta-interpreter uses?

I'm reading several texts and online guides to understand the possibilities of prolog meta-interpreters.
The following seem like solid use cases:
proof explainers / tracers
changing proof search strategy, eg breadth first vs depth first
domain specific languages
Question - what other compelling use-cases are there?
Quoting from A Couple of Meta-interpreters in Prolog which is a part of the book "The Power of Prolog":
Further extensions
Other possible extensions are module systems, delayed goals, checking for various kinds of infinite loops, profiling, debugging, type systems, constraint solving etc. The overhead incurred by implementing these things using MIs can be compiled away using partial evaluation techniques. [...]
This quite extends your proposed uses, e.g., by
changing the search of p(X) :- p(s(X)). to detect loops (including "obvious" ones like this one),
hinting at where most compute time is spent ("profiling"),
or by reducing a program to a simpler fragment that is easier to analyse—but still has the property of interest: unexpected non-termination (explained via failure-slice), unexpected failure, or unexpected success.

Looking for a more compact syntax for Prolog

Prolog is a nice language. I use it occasionally, from time to time.
But approaching it every subsequent time makes me feel less and less comfortable syntactically.
The modern programming languages are moving to allow
programmer less repeating himself
omit unnecessary pieces if they can be deduced, or their names are just placeholders.
The DCG is a step in the right direction allowing one to write
sentence --> noun_phrase, verb_phrase.
instead of
sentence(A,Z) :- noun_phrase(A,B), verb_phrase(B,Z).
but its entanglement with difference lists makes it less useful.
So what I am looking for are projects giving Prolog
a more compact syntactic representation, while preserving its semantic expressiveness.
Higher-order programming based on call/N is still a pretty much unexplored terrain. Major implementations like SICStus Prolog added call/N as late as 2006. So there is still a lot to explore. Consider library(lambda), library(reif) (both here) and other definitions using the meta-predicate declaration.
One thing you might want to look into in case of Swi-Prolog are actual language extensions introduced specifically by Swi-Prolog 7:
Another thing is Quasi-Quotation library which allows you to insert pieces of code in your own language (defined using DCG) inside "regular" Prolog code:
The last thing I can recommend is the list of additional Swi-Prolog packages, some of which are specifically designed to extend the language, e.g. 'func', 'lambda', etc.:

Dealing with complicated prolog loops

I am using Prolog to encode some fairly complicated rules in a project of mine. There is a lot of recursion, including mutual recursion. Part of the rules look something like this:
pred1(X) :- ...
pred1(X) :- someguard(X), pred2(X).
pred2(X) :- ...
pred2(X) :- othercondition(X), pred1(X).
There is a fairly obvious infinite loop between pred1 and pred2. Unfortunately, the interaction between these predicates is very complicated and difficult to isolate. I was able to eliminate the infinite loop in this instance by passing around a list of objects that have been passed to pred1, but this is extremely unwieldy! In fact, it largely defeats the purpose of using Prolog in this application.
How can I make Prolog avoid infinite loops? For example, if in the course of proving pred1(foo) it tries to prove pred1(foo) as a sub-goal, fail and backtrack.
Is it possible to do this with meta-interpreters?
Yes, you can use meta-interpreters for this purpose, as mat suggests. But for the normal use case, that is going far beyond the regular effort.
What you may consider instead is to separate the looping functionality from your actual logic using higher-order predicates. That is a very safe way to go — SWI even checks if all the uses have a corresponding definition. This checking is either invoked when typing make. or check.
As an example, consider closure0/3 and path/4 which both handle loop checks "once and forever".
One feature that is available in some Prolog systems and that may help you to solve such issues is called tabling. See for example the related question and prolog-tabling.
If tabling is not available, then yes, meta-interpreters can definitely help a lot with this. For example, you can change the executation strategy etc. with a meta-interpreter.
In SWI-Prolog, also check out call_with_inference_limit/3 to robustly limit the execution, independent of CPU type and system load.
Related and also useful are termination analyzers like cTI: They allow you to statically derive termination conditions.

O(1) term look up

I wish to be able to look up the existence of a term as fast as possible in my current prolog program, without the prolog engine traversing all the terms until it finally reaches the existing term.
I have not found any proof of it.. but I assume that given
% thousands of other animals
The swi-prolog engine will have to go through thousands of animals trying to unify with tiger in order to confirm that animal(tiger) is in my prolog database.
In other languages I believe a HashSet would solve this problem, enabling a O(1) look up... However I cannot seem to find any hashsets or hashtables in the swi-prolog documentation.
Is there a swi-prolog library for hashsets, or can I somehow built it myself using term_hash\2?
Bonus info, I will most likely have to do the look up on some dynamically added data, either added to a hashset data-structure or using assertz
All serious Prolog systems perform this O(1) lookup via hashing automatically and implicitly for you, so you do not have to do it yourself.
It is called argument-indexing, and you find this explained in all good Prolog books. See also "JIT (just-in-time) indexing" in more recent versions of many Prolog systems, including SWI. Indexing is applied to dynamically added clauses too, and is one reason why assertz/1 is slowed down and therefore not a good choice for data that changes more often than it is read.
You can also easily test this yourself by creating databases with increasingly more facts and seeing that the lookup time remains roughly constant when argument indexing applies.
When the built-in first argument indexing is not enough (note that some Prolog systems also provide multi-argument indexing), depending on the system, you can construct your own indexing scheme using a built-in or library term hashing predicate. In the case of ECLiPSe, GNU Prolog, SICStus Prolog, SWI-Prolog, and YAP, look into the documentation of the term_hash/4 predicate.

Colossal memory usage/stack problems with ANTLR lexer/parser

I'm porting over a grammar from flex/bison, and mostly seem to have everything up and running (in particular, my token stream seems fine, and my parser grammar is compiling and running), but seem to be running into problems of runaway stack/memory usage even with very small/moderate sized inputs to my grammar. What is the preferred construct for chaining together an unbounded sequence of the same nonterminal? In my Bison grammar I had production rules of the form:
statements: statement | statement statements
words: | word words
In ANTLR, if I maintain the same rule setup, this seems to perform admirably on small inputs (on the order of 4kB), but leads to stack overflow on larger inputs (on the order of 100kB). In both cases the automated parse tree produced is also rather ungainly.
I experimented with changing these production rules to have an explicitly additive (rather than recursive form):
statements: statement+
words: word*
However this seems to have lead to absolutely horrific blowup in memory usage (upwards of 1GB) on even very small inputs, and the parser has not yet managed to return a parse tree after 20 minutes of letting it run.
Any pointers would be appreciated.
Your rewritten statements are the optimal ANTLR 4 form of the two rules you described (highest performing and minimum memory usage). Here is some general feedback regarding the issues you describe.
I developed some very advanced diagnostic code for numerous potential performance problems. Much of this code is included in TestPerformance, but it is geared towards expert users and requires a rather deep understanding of ANTLR 4's new ALL(*) algorithm to interpret the results.
Terence and I are interested in turning the above into a tool that users can make use of. I may be able to help (run and interpret the test) if you provide a complete grammar and example inputs, so that I can use that grammar and input pair as part of evaluating the usability of a tool further down the road that automates the analysis.
Make sure you are using the two-stage parsing strategy from the book. In many cases, this will vastly improve the parsing performance for correct inputs (incorrect inputs would not be faster).
We don't like to use more memory than necessary, but you should be aware that we are working under a very different definition of "excessive" - e.g. we run our testing applications with -Xmx4g to -Xmx12g, depending on the test.
Okay, so I've gotten it working, in the following manner. My YACC grammar had the following constructions:
lines: lines | line lines;
words: | word words;
However, this did not make the recursive parsing happy, so I rewrote it as:
lines: line+;
words: word*;
Which is in line with #280Z28's feedback (and my original guess). This hung the parser, which is why I posted the question in the first place, but the debugging procedure outlined in my comments to #280Z28's answer showed that in fact it was only the lines parsing which was causing the problem (words) was fine. On a whim, I tried the following rewrite:
lines : stmt (EOL stmt)+ EOL*;
(where line had originally been defined as:
line : stmt (EOL | EOF);
This seems to be working quite well, even for large inputs. However it is entirely unclear to me WHY this is the Right Thing To Do(tm), or why it makes a difference compared to the revision which prompted this question. Any feedback on this matter would still be appreciated.
