When Warren's Abstract Machine program instructions are executed? - prolog

I'm reading Hassan Aït-Kaci's "Warren's Abstract Machine: A Tutorial Reconstruction".
In Chapter 2, the compilation of L0 programs is presented after the compilation of L0 queries. The program compilation section (2.3) starts with:
Compiling a program term p is just a bit trickier, although not by
much. Observe that it assumes that a query ?- q will have built a term
on the heap and set register X1 to contain its address. Thus,
unifying q to p can proceed by following the term structure already
present in X1 as long as it matches functor for functor the structure of p.
So the compilation of a program is made after instructions obtained from query compilation are executed? Does that even make sense? I'm confused...
What makes sense to me: WAM code generated from a program's annotated syntax tree is stored by the interpreter. For each procedure (defined in the program) a block of WAM code is stored. When a query is made, its instructions are generated and executed. If the query is calling a defined procedure, execute its block of code. Is it something like that?

Please note that what you quote is from the very beginning of a series of increasingly complex virtual machines that are introduced in this text:
We consider here ℒ0, a very simple language indeed. In this language, one can specify only two sorts of entities: a program term and a query term. Both program and query are first-order terms but not variables. The semantics of ℒ0 is simply
tantamount to computing the most general unifier of the program and the query.
This simple language is interpreted as you describe.
In later sections of the book, the design and execution of more complex machines becomes proportionally more sophisticated, and already a few pages later we find for example:
In ℳ1, compiled code is stored in a code area (CODE), an addressable array of data words, each containing a possibly labeled instruction over one or more memory words consisting of an opcode
followed by operands.
This is already the design you describe at the end of your post, which is of course how actual Prolog code is compiled in practice.

So the compilation of a program is made after instructions obtained from query compilation are executed? Does that even make sense? I'm confused...
In the beginning, this is clarified (2, last paragraph):
The idea is quite simple: having defined a program term p, one can submit any query ?-q and execution either fails if p and q do not unify, or succeeds with a binding of the variables in q obtained by unifying it with p.
As #mat already states: This is a step-by-step approach. Starting from very simple programs. Just one ground fact and a query.

Related

An algorithm for compiler designing?

Recently I am thinking about an algorithm constructed by myself. I call it Replacment Compiling.
It works as follows:
Define a language as well as its operators' precedence, such as
(1) store <value> as <id>, replace with: var <id> = <value>, precedence: 1
(2) add <num> to <num>, replace with: <num> + <num>, precedence: 2
Accept a line of input, such as store add 1 to 2 as a;
Tokenize it: <kw,store><kw,add><num,1><kw,to><2><kw,as><id,a><EOF>;
Then scan through all the tokens until reach the end-of-file, find the operation with highest precedence, and "pack" the operation:
<kw,store>(<kw,add><num,1><kw,to><2>)<kw,as><id,a><EOF>
Replace the "sub-statement", the expression in parenthesis, with the defined replacement:
<kw,store>(1 + 2)<kw,as><id,a><EOF>
Repeat until no more statements left:
(<kw,store>(1 + 2)<kw,as><id,a>)<EOF>
(var a = (1 + 2))
Then evaluate the code with the built-in function, eval().
eval("var a = (1 + 2)")
Then my question is: would this algorithm work, and what are the limitations? Is this algorithm works better on simple languages?
This won't work as-is, because there's no way of deciding the precedence of operations and keywords, but you have essentially defined parsing (and thrown in an interpretation step at the end). This looks pretty close to operator-precedence parsing, but I could be wrong in the details of your vision. The real keys to what makes a parsing algorithm are the direction/precedence it reads the code, whether the decisions are made top-down (figure out what kind of statement and apply the rules) or bottom-up (assemble small pieces into larger components until the types of statements are apparent), and whether the grammar is encoded as code or data for a generic parser. (I'm probably overlooking something, but this should give you a starting point to make sense out of further reading.)
More typically, code is generally parsed using an LR technique (LL if it's top-down) that's driven from a state machine with look-ahead and next-step information, but you'll also find the occasional recursive descent. Since they're all doing very similar things (only implemented differently), your rough algorithm could probably be refined to look a lot like any of them.
For most people learning about parsing, recursive-descent is the way to go, since everything is in the code instead of building what amounts to an interpreter for the state machine definition. But most parser generators build an LL or LR compiler.
And I'm obviously over-simplifying the field, since you can see at the bottom of the Wikipedia pages that there's a smattering of related systems that partly revolve around the kind of grammar you have available. But for most languages, those are the big-three algorithms.
What you've defined is a rewriting system: https://en.wikipedia.org/wiki/Rewriting
You can make a compiler like that, but it's hard work and runs slowly, and if you do a really good job of optimizing it then you'll get conventional table-driven parser. It would be better in the end to learn about those first and just start there.
If you really don't want to use a parser generating tool, then the easiest way to write a parser for a simple language by hand is usually recursive descent: https://en.wikipedia.org/wiki/Recursive_descent_parser

Boolean expression optimization in compiler and high end processor pipeline

I want to calculate a boolean expression. For ease of understanding let's assume the expression is,
O=( A & B & C) | ( D & E & F)---(eqn. 1),
Here A, B, C, D, E and F are random bits. Now, as my target platform is high-end intel i7-Haswell processor that supports 64 bit data type, I can make this much more efficient using bit-slicing.
So now, O, A, B, C, D, E and f are 64 bits data type,
O_64=( A_64 & B_64 & C_64) | ( D_64 & E_64 & F_64)---(eqn. 2), the & and | are bitwise operators similar to C language.
Now, I need the expression to take constant time to execute. That means, the calculation of Eqn. 2 should take the exact number of steps in the processor irrespective of the values in A_64, B_64, C_64, D_64, E_64, and F_64. The values are filled up using a random generator in the runtime.
Now my question is,
Considering I am using GCC or GCC-7 with -O3, How far can the compiler optimize the expression? for example, if A_64 becomes all zeroes (can happen with probability 2^{-64} ) Then we don't need to calculate the first part of eqn.2 then O_64 becomes equal to D_64 & E_64 & F_64. Is it possible for a c compiler to optimize such a way? We have to remember that the values are filled up at runtime and the boolean expressions have around 120 variables.
Is it possible for a for a processor to do such an optimization (List 1) during runtime? As my boolean expression is very long, the execution will be heavily pipelined, now is it possible for a processor to pull out an operation out of the pipeline in if such a situation arises?
Please, let me know if any part of the question is not understandable.
I appreciate your help.
Is it possible for a c compiler to optimize such a way?
It's allowed to do it, but it probably won't. There is nothing to gain in general. If part of the expression was statically known to be zero, that would be used. But inserting branches inside bitwise calculations is almost always counterproductive, and I've never seen a compiler judge a sequence of ANDs to be "long enough to be worth inserting an early-out" (you can certainly do so manually, of course). If you need a hard guarantee of course I can't give you that, if you want to be sure you should always check the assembly.
What it probably will do (for longer expressions at least) is reassociate the expression for more instruction-level parallelism. So code like that probably won't be just two long (but parallel with each other) chains of dependent ANDs, but be split up into more chains. That still wouldn't make the time depend on the values.
Is it possible for a for a processor to do such an optimization during runtime?
Extremely hypothetically yes. No processor architecture that I am aware of does that. It would be a slightly tricky mechanism, and as a general rule it would almost never help.
Hypothetically it could work like this: when the operands for an AND instruction are looked up and one (or both) of them is found to be renamed to the hard-wired zero-register, the renamer can immediately rename the destination to zero as well (rather than allocating a new register for the result), effectively giving that AND instruction 0-latency. The flags output would also be known so the µop would not even have to be executed. It would roughly be a cross between copy-elimination and a zeroing idiom.
That mechanism wouldn't even trigger unless one of the inputs is set to zero with a zeroing idiom, if an input is accidentally zero that wouldn't be detected. It would also not completely remove the influence of the redundant AND instructions, they still have to go through (most of) the front-end of the processor even if it is just to find out that they didn't need to be executed after all.

Prolog CLP order of operations?

Hi hopefully someone can help me. I was just wondering if my code below was sufficient in setting up a matrix of 12 x 12 and, assuming the 'constrain(M)' calls all the correct constraints which are defined in rules lower down, labelling each of the rows? It's failing at the moment and I've traced my constraints so I know they all work but didn't know whether it was because I'm calling them outside of the main predicate?
matrix(M) :-
M = [R1,R2,R3,R4,R5,R6,R7,R8,R9,R10,R11,R12],
R1 = [A,B,C,D,E,F,G,H,I,J,K,L],
R2 = [A2,B2,C2,D2,E2,F2,G2,H2,I2,J2,K2,L2],
R3 = [A3,B3,C3,D3,E3,F3,G3,H3,I3,J3,K3,L3],
R4 = [A4,B4,C4,D4,E4,F4,G4,H4,I4,J4,K4,L4],
R5 = [A5,B5,C5,D5,E5,F5,G5,H5,I5,J5,K5,L5],
R6 = [A6,B6,C6,D6,E6,F6,G6,H6,I6,J6,K6,L6],
R7 = [A7,B7,C7,D7,E7,F7,G7,H7,I7,J7,K7,L7],
R8 = [A8,B8,C8,D8,E8,F8,G8,H8,I8,J8,K8,L8],
R9 = [A9,B9,C9,D9,E9,F9,G9,H9,I9,J9,K9,L9],
R10 = [A10,B10,C10,D10,E10,F10,G10,H10,I10,J10,K10,L10],
R11 = [A11,B11,C11,D11,E11,F11,G11,H11,I11,J11,K11,L11],
R12 = [A12,B12,C12,D12,E12,F12,G12,H12,I12,J12,K12,L12],
constrain(M),
labeling([],R1),
labeling([],R2),
labeling([],R3),
labeling([],R4),
labeling([],R5),
labeling([],R6),
labeling([],R7),
labeling([],R8),
labeling([],R9),
labeling([],R10),
labeling([],R11),
labeling([],R12).
You should always separate the constraint posting from the actual search (labeling/2).
The reason is clear: It can often be extremely expensive to search for concrete solutions. Posting the constraints, on the other hand, is often very fast.
If, as in your case, the two parts are uncleanly mixed, you cannot tell easily which part is responsible if there are unexpected problems such as nontermination.
In your case, the only thing you should improve in the main predicate is enforcing said separation between constraint posting and search.
The mistake that causes unexpected failure is most likely contained in one of the rules you did not post here. You can find out which rules are involved in the failure by systematically replacing the goals in which they are called by true. Thus, there's no need for tracing: You can debug CLP(FD) programs declaratively in this way.
EDIT: Here is more information about the separation between posting constraints and the search for concrete solutions. As introduced in GUPU, we will use the notion of core relation, which has the following properties:
By convention, its name ends with an underscore _.
Also by convention, its last argument is the list of variables that need to be labeled.
It posts the CLP(FD) constraints. This is also called the (constraint) modeling part or (constraint) model.
It doesn't use labeling/2.
The search part is usually performed by label/1 or labeling/2.
Suppose you have a predicate where you intermingle these two aspects, such as in your current case:
matrix(M) :-
constraints_hold(M),
... relate M to variables Vs ...
labeling(Strategy, Vs).
Obviously, for the reasons explained above, the call of labeling/2 is the part we want to remove from this predicate. Of course, as you observe, we still want to somehow access the variables that are supposed to be labeled.
We do this as follows:
We introduce a new argument to the core relation to pass around the list of finite domain variables that need to be labeled.
By convention, we reflect the additional argument by appending an underscore (_) to the predicate name.
So, we obtain the following core relation:
matrix_(M, Vs) :-
constraints_hold(M),
... relate M to variables Vs ...
The only missing part (which you haven't done yet, but which you should have done in any case), is stating the relation between the object of interest (in this case: the matrix) and the finite domain variables. This is the part I leave as a simple exercise for you. Hint: append/2.
Once you have done all this, you can solve the whole task by combining the core relation and labeling/2 in a single query or predicate:
?- matrix_(M, Vs), labeling(Strategy, Vs).
Note that this separation between core relation and search:
makes it extremely easy to try different labeling strategies without recompiling your program.
allows you to determine important procedural properties of the core relation without needing to search for concrete solutions.
Use the introduction and explanation of this important separation as an indicator when judging the quality of any text about CLP(FD) constraints.

How to recognize variables that don't affect the output of a program?

Sometimes the value of a variable accessed within the control-flow of a program cannot possibly have any effect on a its output. For example:
global var_1
global var_2
start program hello(var_3, var_4)
if (var_2 < 0) then
save-log-to-disk (var_1, var_3, var_4)
end-if
return ("Hello " + var_3 + ", my name is " + var_1)
end program
Here only var_1 and var_3 have any influence on the output, while var_2 and var_4 are only used for side effects.
Do variables such as var_1 and var_3 have a name in dataflow-theory/compiler-theory?
Which static dataflow analysis techniques can be used to discover them?
References to academic literature on the subject would be particularly appreciated.
The problem that you stated is undecidable in general,
even for the following very narrow special case:
Given a single routine P(x), where x is a parameter of type integer. Is the output of P(x) independent of the value of x, i.e., does
P(0) = P(1) = P(2) = ...?
We can reduce the following still undecidable version of the halting problem to the question above: Given a Turing machine M(), does the program
never stop on the empty input?
I assume that we use a (Turing-complete) language in which we can build a "Turing machine simulator":
Given the program M(), construct this routine:
P(x):
if x == 0:
return 0
Run M() for x steps
if M() has terminated then:
return 1
else:
return 0
Now:
P(0) = P(1) = P(2) = ...
=>
M() does not terminate.
M() does terminate
=> P(x) = 1 for a sufficiently large x
=> P(x) != P(0) = 0
So, it is very difficult for a compiler to decide whether a variable actually does not influence the return value of a routine; in your example, the "side effect routine" might manipulate one of its values (or even loop infinitely, which would most definitely change the return value of the routine ;-)
Of course overapproximations are still possible. For example, one might conclude that a variable does not influence the return value if it does not appear in the routine body at all. You can also see some classical compiler analyses (like Expression Simplification, Constant propagation) having the side effect of eliminating appearances of such redundant variables.
Pachelbel has discussed the fact that you cannot do this perfectly. OK, I'm an engineer, I'm willing to accept some dirt in my answer.
The classic way to answer you question is to do dataflow tracing from program outputs back to program inputs. A dataflow is the connection of a program assignment (or sideeffect) to a variable value, to a place in the application that consumes that value.
If there is (transitive) dataflow from a program output that you care about (in your example, the printed text stream) to an input you supplied (var2), then that input "affects" the output. A variable that does not flow from the input to your desired output is useless from your point of view.
If you focus your attention only the computations involved in the dataflows, and display them, you get what is generally called a "program slice" . There are (very few) commercial tools that can show this to you.
Grammatech has a good reputation here for C and C++.
There are standard compiler algorithms for constructing such dataflow graphs; see any competent compiler book.
They all suffer from some limitation due to Turing's impossibility proofs as pointed out by Pachelbel. When you implement such a dataflow algorithm, there will be places that it cannot know the right answer; simply pick one.
If your algorithm chooses to answer "there is no dataflow" in certain places where it is not sure, then it may miss a valid dataflow and it might report that a variable does not affect the answer incorrectly. (This is called a "false negative"). This occasional error may be satisfactory if
the algorithm has some other nice properties, e.g, it runs really fast on a millions of code. (The trivial algorithm simply says "no dataflow" in all places, and it is really fast :)
If your algorithm chooses to answer "yes there is a dataflow", then it may claim that some variable affects the answer when it does not. (This is called a "false positive").
You get to decide which is more important; many people prefer false positives when looking for a problem, because then you have to at least look at possibilities detected by the tool. A false negative means it didn't report something you might care about. YMMV.
Here's a starting reference: http://en.wikipedia.org/wiki/Data-flow_analysis
Any of the books on that page will be pretty good. I have Muchnick's book and like it lot. See also this page: (http://en.wikipedia.org/wiki/Program_slicing)
You will discover that implementing this is pretty big effort, for any real langauge. You are probably better off finding a tool framework that does most or all this for you already.
I use the following algorithm: a variable is used if it is a parameter or it occurs anywhere in an expression, excluding as the LHS of an assignment. First, count the number of uses of all variables. Delete unused variables and assignments to unused variables. Repeat until no variables are deleted.
This algorithm only implements a subset of the OP's requirement, it is horribly inefficient because it requires multiple passes. A garbage collection may be faster but is harder to write: my algorithm only requires a list of variables with usage counts. Each pass is linear in the size of the program. The algorithm effectively does a limited kind of dataflow analysis by elimination of the tail of a flow ending in an assignment.
For my language the elimination of side effects in the RHS of an assignment to an unused variable is mandated by the language specification, it may not be suitable for other languages. Effectiveness is improved by running before inlining to reduce the cost of inlining unused function applications, then running it again afterwards which eliminates parameters of inlined functions.
Just as an example of the utility of the language specification, the library constructs a thread pool and assigns a pointer to it to a global variable. If the thread pool is not used, the assignment is deleted, and hence the construction of the thread pool elided.
IMHO compiler optimisations are almost invariably heuristics whose performance matters more than effectiveness achieving a theoretical goal (like removing unused variables). Simple reductions are useful not only because they're fast and easy to write, but because a programmer using a language who understand basics of the compiler operation can leverage this knowledge to help the compiler. The most well known example of this is probably the refactoring of recursive functions to place the recursion in tail position: a pointless exercise unless the programmer knows the compiler can do tail-recursion optimisation.

Alternatives to the WAM

I remember once reading that there were at least two other alternatives invented roughly at the same time as the WAM. Any pointers?
Prior to the WAM, there was the ZIP by Clocksin. Its design is still very interesting. SWI-Prolog uses it. And also B-Prolog has slowly migrated from a WAM design towards the ZIP. Of course, on that way many new innovations were developed. Another alternative is the VAM.
A comparison as of 1993 is:
http://www.complang.tuwien.ac.at/ulrich/papers/PDF/binwam-nov93.pdf
In the meantime, the most interesting architectural developments are related to B-Prolog.
WAM vs. ZIP
The key difference between the WAM and the ZIP is the precise interface for a predicate's arguments. In the WAM, the arguments are all passed via registers, that is, either real registers or at least fixed locations in memory. The ZIP passes all arguments via the stack.
Let's consider a minimal example:
p(R1,R2,R3,L1,L2,L3) :- % WAM % ZIP
% store L1..L3 % nothing
% nothing % push R1..R3
% init X1..X3 % push X1..X3
q(R1,R2,R3,X1,X2,X3),
% put unsafe X1..X3 % push X1..X3
% load L1..L3 % push L1..L3
r(X1,X2,X3,L1,L2,L3).
Prior to calling q:
The WAM does not need to do any action for arguments that are passed on to the first goal at the very same positions (R1..R3). This is particularly interesting for binary clauses - that is, clauses with exactly one regular goal at the end. Here the WAM excels.
The other arguments L1..L3 need to be stored locally. So for these arguments, the register interface did not do anything good.
The ZIP on the other hand does not need to save arguments - they are already saved on the stack. This is not only good for clauses with more than one goal, but also for other interrupting goals like constraints or interrupts.
As a downside, the ZIP must push again R1..R3.
Both have to initialize X1..X3 and store them on the stack.
Calling q:
When calling q, the WAM has to allocate stack space for X1..X3 and L1..L3 thus 6 cells, whereas the ZIP needs R1..R3,L1..L3,X1..X3. So here, the WAM is more space efficient. Also, the WAM permits environment trimming (for more complex situations) which is next-to-impossible for the ZIP.
Prior to calling r:
This r is the last call, and systems try to free the space for this clause, provided no choice point is present.
For the WAM, the existential variables X1..X3 have to be checked for being still uninstantiated local variables (put_unsafe), and if, they are moved onto the heap - that's expensive, but occurs rarely. L1..L3 are just loaded. That's all, the WAM can now safely deallocate the local frame. So last call optimization is dirt cheap.
For the ZIP, everything has to be pushed as usual. Then only, an extra scan has to examine all the values on the stack and moves them accordingly. That's rather expensive. Some optimizations are possible, but it is still much more than what the WAM does. ((A possible improvement would be to push arguments in reverse order. Then the variables L1..L3 might be left in their location. So these variables would not need any handling. I have not seen such an implementation (yet).))
In the technical note entitled An abstract Prolog instruction set, Warren also references another compiler by Bowen, Byrd, and Clocksin. However, he says that the two architectures have much in common, so I don't know whether that compiler could be really considered as an alternative.
Not sure if this is what you mean, but the first two Prolog implementations were an interpreter written in Fortran by Colmerauer et al. and a DEC PDP-10 native compiler by Warren et al.
Warren mentions these in his foreword to Ait-Kaci's Tutorial Reconstruction of the WAM. If this is not what you mean, you may find it in that document or its references.

Resources