In a Warren's Abstract Machine, how does bind work, if one of the arguments is a register? - prolog

I'm trying to create my own WAM implementation and I'm stuck at the exercise 2.4
I can't understand how to execute instruction unify_value X4 in figure 2.4.
As far as I understand, this instruction should unify Y from the program with f(W) from the query.
unify_value X4 calls unify (X4,S) where S=2 (see Figure 2.1) and a corresponding heap cell is "REF 2", and X4 is "STR 5".
Unify (Figure 2.7) should bind those values, but I do not understand how to deref a register.
"REF 2" is in the heap, "STR 5" is in a register. How do you bind something to a register?

We are talking about Warren's "New" Engine, WAM and not the Old Engine, known as PLM.
In the WAM variables are allocated in two places.
the local stack (environment stack)
the heap
Registers cannot hold variables. However, they may hold references to variables. Note that references from the heap only point into the heap.
Much related to your question is the pretty ingenious way how the WAM maintains this order and at the same time has very cheap last-call optimization. At the point in time of a (determinate) last call, the local variables that are arguments of the last call must be moved somehow. In more traditional Prolog machines like the ZIP this is an extremely laborious undertaking which essentially requires to scan the environment frame for variables still sitting in them.
The WAM however has a much better calling convention: Most variables are already in a safe place, which can be trivially analyzed during compilation. The very few remaining need an explicit PUT_UNSAFE instruction where the value is checked, and should it still be a local variable that variable is transferred onto the heap.
Consider what is a safe variable in the WAM:
All variables occurring in the head
All variables that appear as an argument of a structure
Thus only variables that appear first in a goal and in the last goal and that do not appear in some structure must have a PUT_UNSAFE. That is not that much. Further, the dynamic check may reduce the actual copying onto the heap to a minimum.
At first this PUT_UNSAFE looks like a lot of work, but never forget that the WAM permits to remove many PUTs, while the ZIP has to execute at least one instruction for each argument.
Here is a tiny typical example using GNU:
a --> b, c.
expanded to:
a(S0,S) :- b(S0,S1), c(S1,S).
and compiled using the command pl2wam to:
predicate(a/2,1,static,private,monofile,global,[
allocate(2),
get_variable(y(0),1), % S
put_variable(y(1),1), % S1
call(b/2),
put_unsafe_value(y(1),0), % S1
put_value(y(0),1), % S
deallocate,
execute(c/2)]).

Related

Boolean expression optimization in compiler and high end processor pipeline

I want to calculate a boolean expression. For ease of understanding let's assume the expression is,
O=( A & B & C) | ( D & E & F)---(eqn. 1),
Here A, B, C, D, E and F are random bits. Now, as my target platform is high-end intel i7-Haswell processor that supports 64 bit data type, I can make this much more efficient using bit-slicing.
So now, O, A, B, C, D, E and f are 64 bits data type,
O_64=( A_64 & B_64 & C_64) | ( D_64 & E_64 & F_64)---(eqn. 2), the & and | are bitwise operators similar to C language.
Now, I need the expression to take constant time to execute. That means, the calculation of Eqn. 2 should take the exact number of steps in the processor irrespective of the values in A_64, B_64, C_64, D_64, E_64, and F_64. The values are filled up using a random generator in the runtime.
Now my question is,
Considering I am using GCC or GCC-7 with -O3, How far can the compiler optimize the expression? for example, if A_64 becomes all zeroes (can happen with probability 2^{-64} ) Then we don't need to calculate the first part of eqn.2 then O_64 becomes equal to D_64 & E_64 & F_64. Is it possible for a c compiler to optimize such a way? We have to remember that the values are filled up at runtime and the boolean expressions have around 120 variables.
Is it possible for a for a processor to do such an optimization (List 1) during runtime? As my boolean expression is very long, the execution will be heavily pipelined, now is it possible for a processor to pull out an operation out of the pipeline in if such a situation arises?
Please, let me know if any part of the question is not understandable.
I appreciate your help.
Is it possible for a c compiler to optimize such a way?
It's allowed to do it, but it probably won't. There is nothing to gain in general. If part of the expression was statically known to be zero, that would be used. But inserting branches inside bitwise calculations is almost always counterproductive, and I've never seen a compiler judge a sequence of ANDs to be "long enough to be worth inserting an early-out" (you can certainly do so manually, of course). If you need a hard guarantee of course I can't give you that, if you want to be sure you should always check the assembly.
What it probably will do (for longer expressions at least) is reassociate the expression for more instruction-level parallelism. So code like that probably won't be just two long (but parallel with each other) chains of dependent ANDs, but be split up into more chains. That still wouldn't make the time depend on the values.
Is it possible for a for a processor to do such an optimization during runtime?
Extremely hypothetically yes. No processor architecture that I am aware of does that. It would be a slightly tricky mechanism, and as a general rule it would almost never help.
Hypothetically it could work like this: when the operands for an AND instruction are looked up and one (or both) of them is found to be renamed to the hard-wired zero-register, the renamer can immediately rename the destination to zero as well (rather than allocating a new register for the result), effectively giving that AND instruction 0-latency. The flags output would also be known so the µop would not even have to be executed. It would roughly be a cross between copy-elimination and a zeroing idiom.
That mechanism wouldn't even trigger unless one of the inputs is set to zero with a zeroing idiom, if an input is accidentally zero that wouldn't be detected. It would also not completely remove the influence of the redundant AND instructions, they still have to go through (most of) the front-end of the processor even if it is just to find out that they didn't need to be executed after all.

When Warren's Abstract Machine program instructions are executed?

I'm reading Hassan Aït-Kaci's "Warren's Abstract Machine: A Tutorial Reconstruction".
In Chapter 2, the compilation of L0 programs is presented after the compilation of L0 queries. The program compilation section (2.3) starts with:
Compiling a program term p is just a bit trickier, although not by
much. Observe that it assumes that a query ?- q will have built a term
on the heap and set register X1 to contain its address. Thus,
unifying q to p can proceed by following the term structure already
present in X1 as long as it matches functor for functor the structure of p.
So the compilation of a program is made after instructions obtained from query compilation are executed? Does that even make sense? I'm confused...
What makes sense to me: WAM code generated from a program's annotated syntax tree is stored by the interpreter. For each procedure (defined in the program) a block of WAM code is stored. When a query is made, its instructions are generated and executed. If the query is calling a defined procedure, execute its block of code. Is it something like that?
Please note that what you quote is from the very beginning of a series of increasingly complex virtual machines that are introduced in this text:
We consider here ℒ0, a very simple language indeed. In this language, one can specify only two sorts of entities: a program term and a query term. Both program and query are first-order terms but not variables. The semantics of ℒ0 is simply
tantamount to computing the most general unifier of the program and the query.
This simple language is interpreted as you describe.
In later sections of the book, the design and execution of more complex machines becomes proportionally more sophisticated, and already a few pages later we find for example:
In ℳ1, compiled code is stored in a code area (CODE), an addressable array of data words, each containing a possibly labeled instruction over one or more memory words consisting of an opcode
followed by operands.
This is already the design you describe at the end of your post, which is of course how actual Prolog code is compiled in practice.
So the compilation of a program is made after instructions obtained from query compilation are executed? Does that even make sense? I'm confused...
In the beginning, this is clarified (2, last paragraph):
The idea is quite simple: having defined a program term p, one can submit any query ?-q and execution either fails if p and q do not unify, or succeeds with a binding of the variables in q obtained by unifying it with p.
As #mat already states: This is a step-by-step approach. Starting from very simple programs. Just one ground fact and a query.

How to see the local variable in DDC-I debugger?

I am trying to see the index value of for loop in DDC-I debugger and it always shows me ERROR.
With the assembly of the same, it shows the following instruction:
cmp cr7,0,r20,r23
so it's comparing r20 and r23 but both of these registers don't hold the index value. I am not sure what is cr7 ?
In short, most embedded tool chains (including the ones you pay for) are horrible about reconstructing local/automatic variables in even lightly optimized code. A lot of them simply can't reconstruct variables that never have storage because they live in registers the whole time (loop index variables like the one you can't see are typical cases). Some even have issues with interim computation holders, and arguments (since they're almost always passed as registers).
Typical strategies might be:
Temporarily turning off optimizations around the code in question
Temporarily moving the variable in question to the global scope
Becoming proficient at reading disassembly.
This isn't a terribly practical answer, but it is surprising for a lot of people that are new to the embedded world or never had the luxury of a source level debugger on their embedded platform.
On PowerPC there are eight CR fields, cr0 to cr7. If you don't specify a CR field for a compare result the default is cr0, but in this case cr7 is specified and so the flags in field cr7 will indicate the result of the compare operation. There are 4 condition code bits in each CR field: lt, gt, eq and so. Typically the compare will be followed by a conditional branch, bc.
There is some useful info in this IBM developerWorks article: Assembly language for Power Architecture, Part 3: Programming with the PowerPC branch processor.

How to get variable/function definitions set in Parallel (e.g. with ParallelMap)?

I have a function that I use to look up a value based on an index. The value takes some time to calculate, so I want to do it with ParallelMap, and references another similar such function that returns a list of expressions, also based on an index.
However, when I set it all up in a seemingly reasonable fashion, I see some very bizarre behaviour. First, I see that the function appears to work, albeit very slowly. For large indexes, however, the processor activity in Taskmangler stays entirely at zero for an extended period of time (i.e. 2-4 minutes) where all instances of Mathematica are seemingly inert. Then, without the slightest blip of CPU use, a result appears. Is this another case of Mathematica spukhafte Fernwirkung?
That is, I want to create a variable/function that stores an expression, here a list of integers (ListOfInts), and then on the parallel workers I want to perform some function on that expression (here I apply a set of replacement rules and take the Min). I want the result of that function to also be indexed by the same index under another variable/function (IndexedFunk), whose result is then available back on the main instance of Mathematica:
(*some arbitrary rules that will convert some of the integers to negative values:*)
rulez=Dispatch[Thread[Rule[Range[222],-Range[222]]]];
maxIndex = 333;
Clear[ListOfInts]
Scan[(ListOfInts[#]=RandomInteger[{1,999},55])&,Range[maxIndex ]]
(*just for safety's sake:*)
DistributeDefinitions[rulez, ListOfInts]
Clear[IndexedFunk]
(*I believe I have to have at least one value of IndexedFunk defined before I Share the definition to the workers:*)
IndexedFunk[1]=Min[ListOfInts[1]]/.rulez
(*... and this should let me retrieve the values back on the primary instance of MMA:*)
SetSharedFunction[IndexedFunk]
(*Now, here is the mysterious part: this just sits there on my multiprocessor machine for many minutes until suddenly a result appears. If I up maxIndex to say 99999 (and of course re-execute the above code again) then the effect can more clearly be seen.*)
AbsoluteTiming[Short[ParallelMap[(IndexedFunk[#]=Min[ListOfInts[#]/.rulez])&, Range[maxIndex]]]]
I believe this is some bug, but then I am still trying to figure out Mathematica Parallel, so I can't be too confident in this conclusion. Despite its being depressingly slow, it is nonetheless impressive in its ability to perform calculations without actually requiring a CPU to do so.
I thought perhaps it was due to whatever communications protocol is being used between the master and slave processes, perhaps it is so slow that it just appears that the processors are doing nothing when if fact they are just waiting to send the next bit of some definition or other. In which case I thought ParallelMap[..., Method->"CoarsestGrained"] would be of some use. But no, that doesn't work neither.
A question: "Am I doing something obviously wrong, or is this a bug?"
I am afraid you are. The problem is with the shared definition of a variable. Mathematica maintains a single coherent value in all copies of the variable across kernels, and therefore that variable becomes a single point of huge contention. CPU is idle because kernels line up to the queue waiting for the variable IndexedFunk, and most time is spent in interprocess or inter-machine communication. Go figure.
By the way, there is no function SetSharedDefinition in any Mathematica version I know of. You probably intended to write SetSharedVariable. But remove that evil call anyway! To avoid contention, return results from the parallelized computation as a list of pairs, and then assemble them into downvalues of your variable at the main kernel:
Clear[IndexedFunk]
Scan[(IndexedFunk[#[[1]]] = #[[2]]) &,
ParallelMap[{#, Min[ListOfInts[#] /. rulez]} &, Range[maxIndex]]
]
ParallelMap takes care of distributing definition automagically, so the call to DistributeDefinitions is superfluous. (As a minor note, it is not correct as written, omitting the maxIndex variable, but the omission is automatically taken care of by ParallelMap in this particular case.)
EDIT, NB!: The automatic distribution applies only to the version 8 of Mathematica. Thanks #MikeHoneychurch for the correction.

Alternatives to the WAM

I remember once reading that there were at least two other alternatives invented roughly at the same time as the WAM. Any pointers?
Prior to the WAM, there was the ZIP by Clocksin. Its design is still very interesting. SWI-Prolog uses it. And also B-Prolog has slowly migrated from a WAM design towards the ZIP. Of course, on that way many new innovations were developed. Another alternative is the VAM.
A comparison as of 1993 is:
http://www.complang.tuwien.ac.at/ulrich/papers/PDF/binwam-nov93.pdf
In the meantime, the most interesting architectural developments are related to B-Prolog.
WAM vs. ZIP
The key difference between the WAM and the ZIP is the precise interface for a predicate's arguments. In the WAM, the arguments are all passed via registers, that is, either real registers or at least fixed locations in memory. The ZIP passes all arguments via the stack.
Let's consider a minimal example:
p(R1,R2,R3,L1,L2,L3) :- % WAM % ZIP
% store L1..L3 % nothing
% nothing % push R1..R3
% init X1..X3 % push X1..X3
q(R1,R2,R3,X1,X2,X3),
% put unsafe X1..X3 % push X1..X3
% load L1..L3 % push L1..L3
r(X1,X2,X3,L1,L2,L3).
Prior to calling q:
The WAM does not need to do any action for arguments that are passed on to the first goal at the very same positions (R1..R3). This is particularly interesting for binary clauses - that is, clauses with exactly one regular goal at the end. Here the WAM excels.
The other arguments L1..L3 need to be stored locally. So for these arguments, the register interface did not do anything good.
The ZIP on the other hand does not need to save arguments - they are already saved on the stack. This is not only good for clauses with more than one goal, but also for other interrupting goals like constraints or interrupts.
As a downside, the ZIP must push again R1..R3.
Both have to initialize X1..X3 and store them on the stack.
Calling q:
When calling q, the WAM has to allocate stack space for X1..X3 and L1..L3 thus 6 cells, whereas the ZIP needs R1..R3,L1..L3,X1..X3. So here, the WAM is more space efficient. Also, the WAM permits environment trimming (for more complex situations) which is next-to-impossible for the ZIP.
Prior to calling r:
This r is the last call, and systems try to free the space for this clause, provided no choice point is present.
For the WAM, the existential variables X1..X3 have to be checked for being still uninstantiated local variables (put_unsafe), and if, they are moved onto the heap - that's expensive, but occurs rarely. L1..L3 are just loaded. That's all, the WAM can now safely deallocate the local frame. So last call optimization is dirt cheap.
For the ZIP, everything has to be pushed as usual. Then only, an extra scan has to examine all the values on the stack and moves them accordingly. That's rather expensive. Some optimizations are possible, but it is still much more than what the WAM does. ((A possible improvement would be to push arguments in reverse order. Then the variables L1..L3 might be left in their location. So these variables would not need any handling. I have not seen such an implementation (yet).))
In the technical note entitled An abstract Prolog instruction set, Warren also references another compiler by Bowen, Byrd, and Clocksin. However, he says that the two architectures have much in common, so I don't know whether that compiler could be really considered as an alternative.
Not sure if this is what you mean, but the first two Prolog implementations were an interpreter written in Fortran by Colmerauer et al. and a DEC PDP-10 native compiler by Warren et al.
Warren mentions these in his foreword to Ait-Kaci's Tutorial Reconstruction of the WAM. If this is not what you mean, you may find it in that document or its references.

Resources