Identify similar expressions? - data-structures

How can I validate two expressions that are logically equivalent ?
For e.g :
(a+b) <=> (b+a) or (a+(b+c)) <=> ((a+b)+c) or (a && b) <=> (b && a) ,etc
I want an optimize solution for it, that will identify the duplicate one from 1000s of expressions.

Contrary to a comment by #japreiss, there is no open research question here. In the context of propositional expressions like the examples given, logical equivalence is very well understood. (I'm assuming OP is using + to mean logical OR rather than numeric addition)
A question for the OP: are you interested in doing this by hand or automating it via programming code?
There are multiple approaches but the one taught in most intro to logic courses and text books would be to construct separate truth tables for the left-hand side and right-hand side, and see if the final values on each row are identical.
If you have a computer system that can already evaluate the truth of an expression like (a+(b+c)) and ((a+b)+c), then that same system can probably be given the whole expression (a+(b+c)) <=> ((a+b)+c) and can tell you whether or not it is a tautology.
With regard to comparing 1,000's of expressions to find duplicates, one technique is to convert each expression to conjunctive normal form and then just use string comparison to see which ones are identical.

Related

Retrieve internal/canonical representation of terms in Prolog

I am writing a program for which I need terms in their prefix notation.
The point is to being able to parse mathematical expressions to prefix notation, while preserving the correct order of Operations. I then want to save the result in the database for later use (using assert), which includes translating to another language, which uses prefix notation. Prolog Operators do all have a fixed priority which is a feature I want to use, as I will be using all sorts of operators (including clp operators).
As among others I need to include complete mathematical expressions, such as the equality operator. Thus I cannot recursively use the Univ operator (=..), because it won't accept equality operators etc. Or can I somehow use =.. ?
Essentially I want to work with the internal representation of
N is 3*4+5 % just a random example
which would be
is(N,+(*(3,4),5))
Now, I do know that I can use, write_canonical(N is 3*4+5) to get the internal representation as seen above.
So is there a way to somehow get the internal representation as a term or a list, or something.
Would it be possible to bind the output of write_canonical to a variable?
I hope my question is clear enough.
Prolog terms can be despicted as trees. But, when writing a term, the way a term is displayed depends on the defined operators and write options. Consider:
?- (N is 3*4+5) = is(N,+(*(3,4),5)).
true.
?- (N is 3*4+5) = is(Variable, Expression).
N = Variable,
Expression = 3*4+5.
?- 3*4+5 = +(*(3,4),5).
true.
I.e. operators are syntactic sugar. They don't change how terms are represented, only how terms are displayed.

most readable way in XPath to write "is value X a member of sequence S"?

XPath 2.0 has some new functions and syntax, relative to 1.0, that work with sequences. Some of theset don't really add to what the language could already do in 1.0 (with node sets), but they make it easier to express the desired logic in ways that are more readable. This increases the chances of the programmer getting the code correct -- and keeping it that way. For example,
empty(s) is equivalent to not(s), but its intent is much clearer when you want to test whether a sequence is empty.
Correction: the effective boolean value of a sequence is in general more complicated than that. E.g. empty((0)) != not((0)). This applies to exists(s) vs. s in a boolean context as well. However, there are domains of s where empty(s) is equivalent to not(s), so the two could be used interchangeably within those domains. But this goes to show that the use of empty() can make a non-trivial difference in making code easier to understand.
Similarly, exists(s) is equivalent to boolean(s) that already existed in XPath 1.0 (or just s in a boolean context), but again is much clearer about the intent.
Quantified expressions; e.g. "some $x in expression satisfies test($x)" would be equivalent to boolean(expression[test(.)]) (although the new syntax is more flexible, in that you don't need to worry about losing the context item because you have the variable to refer to it by).
Similarly, "every $x in expression satisfies test($x)" would be equivalent to not(expression[not(test(.))]) but is more readable.
These functions and syntax were evidently added at no small cost, solely to serve the goal of writing XPath that is easier to map to how humans think. This implies, as experienced developers know, that understandable code is significantly superior to code that is difficult to understand.
Given all that ... what would be a clear and readable way to write an XPath test expression that asks
Does value X occur in sequence S?
Some ways to do it: (Note: I used X and S notation here to indicate the value and the sequence, but I don't mean to imply that these subexpressions are element name tests, nor that they are simple expressions. They could be complicated.)
X = S: This would be one of the most unreadable, since it requires the reader to
think about which of X and S are sequences vs. single values
understand general comparisons, which are not obvious from the syntax
However, one advantage of this form is that it allows us to put the topic (X) before the comment ("is a member of S"), which, I think, helps in readability.
See also CMS's good point about readability, when the syntax or names make the "cardinality" of X and S obvious.
index-of(S, X): This one is clear about what's intended as a value and what as a sequence (if you remember the order of arguments to index-of()). But it expresses more than we need to: it asks for the index, when all we really want to know is whether X occurs in S. This is somewhat misleading to the reader. An experienced developer will figure out what's intended, with some effort and with understanding of the context. But the more we rely on context to understand the intent of each line, the more understanding the code becomes a circular (spiral) and potentially Sisyphean task! Also, since index-of() is designed to return a list of all the indexes of occurrences of X, it could be more expensive than necessary: a smart processor, in order to evaluate X = S, wouldn't necessarily have to find all the contents of S, nor enumerate them in order; but for index-of(S, X), correct order would have to be determined, and all contents of S must be compared to X. One other drawback of using index-of() is that it's limited to using eq for comparison; you can't, for example, use it to ask whether a node is identical to any node in a given sequence.
Correction: This form, used as a conditional test, can result in a runtime error: Effective boolean value is not defined for a sequence of two or more items starting with a numeric value. (But at least we won't get wrong boolean values, since index-of() can't return a zero.) If S can have multiple instances of X, this is another good reason to prefer form 3 or 6.
exists(index-of(X, S)): makes the intent clearer, and would help the processor eliminate the performance penalty if the processor is smart enough.
some $m in S satisfies $m eq X: This one is very clear, and matches our intent exactly. It seems long-winded compared to 1, and that in itself can reduce readability. But maybe that's an acceptable price for clarity. Keep in mind that X and S could potentially be complex expressions themselves -- they're not necessarily just variable references. An advantage is that since the eq operator is explicit, you can replace it with is or any other comparison operator.
S[. eq X]: clearer than 1, but shares the semantic drawbacks of 2: it computes all members of S that are equal to X. Actually, this could return a false negative (incorrect effective boolean value), if X is falsy. E.g. (0, 1)[. eq 0] returns 0 which is falsy, even though 0 occurs in (0, 1).
exists(S[. eq X]): Clearer than 1, 2, 3, and 5. Not as clear as 4, but shorter. Avoids the drawbacks of 5 (or at least most of them, depending on the processor smarts).
I'm kind of leaning toward the last one, at this point: exists(S[. eq X])
What about you... As a developer coming to a complex, unfamiliar XSLT or XQuery or other program that uses XPath 2.0, and wanting to figure out what that program is doing, which would you find easiest to read?
Apologies for the long question. Thanks for reading this far.
Edit: I changed = to eq wherever possible in the above discussion, to make it easier to see where a "value comparison" (as opposed to a general comparison) was intended.
For what it's worth, if names or context make clear that X is a singleton, I'm happy to use your first form, X = S -- for example when I want to check an attribute value against a set of possible values:
<xsl:when test="#type = ('A', 'A+', 'A-', 'B+')" />
or
<xsl:when test="#type = $magic-types"/>
If I think there is a risk of confusion, then I like your sixth formulation. The less frequently I have to remember the rules for calculating an effective boolean value, the less frequently I make a mistake with them.
I prefer this one:
count(distinct-values($seq)) eq count(distinct-values(($x, $seq)))
When $x is itself a sequence, this expression implements the (value-based) subset of relation between two sets of values, that are represented as sequences. This implementation of subset of has just linear time complexity -- vs many other ways of expressing this, that have O(N^2)) time complexity.
To summarize, the question whether a single value belongs to a set of values is a special case of the question whether one set of values is a subset of another. If we have a good implementation of the latter, we can simply use it for answering the former.
The functx library has a nice implementation of this function, so you can use
functx:is-node-in-sequence($X, $Y)
(this particular function can be found at http://www.xqueryfunctions.com/xq/functx_is-node-in-sequence.html)
The whole functx library is available for both XQuery (http://www.xqueryfunctions.com/) and XSLT (http://www.xsltfunctions.com/)
Marklogic ships the functx library with their core product; other vendors may also.
Another possibility, when you want to know whether node X occurs in sequence S, is
exists((X) intersect S)
I think that's pretty readable, and concise. But it only works when X and the values in S are nodes; if you try to ask
exists(('bob') intersect ('alice', 'bob'))
you'll get a runtime error.
In the program I'm working on now, I need to compare strings, so this isn't an option.
As Dimitri notes, the occurrence of a node in a sequence is a question of identity, not of value comparison.

Efficiently store and evaluate a large number of boolean expressions

I have a huge set (20000) of boolean expressions. They consist of AND, OR and NOT operators and a large number of boolean variables A1, A2, A3 ... (about 1000). Most expression contain only 5, maybe 20 of these variables.
Given an assignment of the variables (A1 = true, A2 = false, A3 = false ...) I have to find those expressions that evaluate to false.
The same set of expressions will be evaluated for multiple (10-100) assignments
For this purpose:
How should I store the expressions on disk so I can load and parse them fast (I currently have them either as some specialized DSL or as a more or less normalized (and dead slow) relational data structure, but I can change that)
Is there a fast algorithm / data structure for evaluating such expressions that I can use?
Do implementations on the JVM exist?
You may want to look at converting your expressions into Conjunctive Normal Form and combining like terms. You then can have a two-way mapping of an expression to a set of terms, any of which evaluating to false implies that the whole expression evaluates false. For each assignment of variables, start with a set of expressions, evaluate CNF terms until one evaluates to false. If that term is false, then all expressions involving that term will also be false, so those expressions can also be removed from the set.
Whether such an approach fits your case can't be said without looking at the expressions - with 1000 variables and 20000 expressions, it might not be that they have many CNF terms in common.
Outside of Java, and for much larger numbers of expressions, DNF is possibly more useful, since its implementation on the GPU is obvious.
The SOP answer to this is to store the expressions as strings in RPN (Reverse Polish Notation) and then write a simple Stack Machine parser to evaluate them.
Generally, an RPN string can be evaluated almost as fast as an already in-memory AST (Abstract Symbol Tree). And the stack machine parser is dead easy to write.
You seem attached to Java, but have you considered feeding these things to a language that has an eval() function? It would probably reduce the problem to saving an expression in a file and evaluating it. Note that if you don't trust the (source of the) expressions, this has security implications!
Jython comes to mind, but there are probably several that would make very short work of this.
If you're married to java, you could probably implement a recursive descent parser for boolean algebra. But that's quite a bit more involved.
UPDATE: The following site has code that might help.
Convert your list of expressions into source code for a function that when called with the value of the variables will evaluate all the functions and return an indication of which expressions evaluate to false. compile the function then call it for your different variable values.
I have done similar and used Python. The only parsing and interpretation I had to write was to translate the input boolean operators, '&', '|', '~' into their Python equivalents.
Your problem size seems quite OK for a Python solution.
You could build an index where for each variable you record two sets of expressions, those where the variable occurs positively and those where it occurs negatively. Depending on the values of the variables you collect those expressions which could become false due to this variable (positive occurrences if the variables is set to false and vice versa). Edit: These are just candidates, you still need to evaluate them to find out if they really become false.
Whether this helps compared to just evaluating all your expressions depends on the structure of your expressions and how many evaluate to false.
Try to convert them into CNF and use MiniSat to check whether the expression evaluates to true or false

Function to detect conflicting mathematical operators in VB6

I've developed a program which generates insurance quotes using different types of coverages based on state criteria. Now I want to add the ability to specify 'rules'. For example we may have 3 types of coverage (we'll call them UM, BI, and PD). Well some states don't allow PD to be greater than BI and other states don't allow UM to exist without BI. So I've added the ability for the user to create these rules so that when the quote is generated the rule will be followed and thus no state regulations will be violated when the program generates the quote.
The Problem
I don't want the user to be able to select conflicting rules. The user can select any of the VB mathematical operators (>, <, >=, <=, =, <>) and set a coverage on either side. They can do this multiple times (but only one at a time) so they might end up with a list of rules like this:
A > B
B > C
C > A
As you can see, the last rule conflicts with the previously set rules. My solution to this was to validate the list each time the user clicks 'Add rule to list'.
Pretend the 3rd list item is not yet in the list but the user has clicked 'add rule' to put it in the list. The validation process first checks to see if both incoming variables have already been used on the same line. If not, it just searches for the left side incoming variable (in this case 'C') in the already created list. if it finds it, it then sets tmp1 equal to the variable across from the match (tmp1 = 'B'). It then does the same for the incoming variable on the right side (in this case 'A'). Then tmp2 is set equal to the variable across from A (tmp2 = 'B'). If tmp1 and tmp2 are equal then the incoming rule is either conflicting OR is irrelevant regardless of the operators used. I'm pretty sure this is solid logic given 3 variables. However, I found that adding any additional variables could easily bypass my validation. There could be upwards of 10 coverage types in any given state so it is important to be able to validate more than just 3.
Is there any uniform way to do a sound validation given any number of variables? Any ideas or thoughts are appreciated. I hope my explanation makes sense.
Thanks
My best bet is some sort of hierarchical tree of rules. When the user adds the first rule (say A > B), the application could create a data structure like this (lowerValues is a Map which the key leads to a list of values):
lowerValues['A'] = ['B']
Now when the user adds the next rule (B > C), the application could check if B is already in a any lowerValues list (in this case, A). If that happens, C is added to lowerValues['A'], and lowerValues['B'] is also created:
lowerValues['A'] = ['B', 'C']
lowerValues['B'] = ['C']
Finally, when the last rule is provided by the user (C > A), the application checks if C is in any lowerValues list. Since it's in B and A, the rule is invalid.
Hope that helps. I don't remember if there's some sort of mapping in VB. I think you should try the Dictionary object.
In order to this idea works out, all the operations must be internally translated to a simple type. So, for example:
A > B
could be translated as
B <= A
Good luck
In general this is a pretty hard problem. What you in fact want to know is if a set of propositional equations over (apparantly) some set of arithmetic is true. To do this you need what amounts to constraint solvers that "know" arithmetic. Not likely to find that in VB6, but you might be able to invoke one as a subprocess.
If the rules are propositional equations only over inequalities (AA", write them only one way).
Second, try solving the propositions for tautology (see for Wang's algorithm which you can likely implment awkwardly in VB6).
If the propositions are not a tautology, now you want build chains of inequalities (e.g, A > B > C) as a graph and look for cycles. The place this fails is when your propositions have disjunctions, e.g., ("A>B or B>Q"); you'll have to generate an inequality chain for each combination of disjunctions, and discard the inconsistent ones. If you discard all of them, the set is inconsistent. Watch out for expressions like "A and B"; by DeMorgans theorem, they're equivalent to "not A or not B", e.g., "A>B and B>Q" is the same as "A<=B or B<=Q". You might want to reduce the conditions to disjunctive normal form to avoid getting suprised.
There are apparantly decision procedures for such inequalities. They're likely hard to implement.

Where might I find a method to convert an arbitrary boolean expression into conjunctive or disjunctive normal form?

I've written a little app that parses expressions into abstract syntax trees. Right now, I use a bunch of heuristics against the expression in order to decide how to best evaluate the query. Unfortunately, there are examples which make the query plan extremely bad.
I've found a way to provably make better guesses as to how queries should be evaluated, but I need to put my expression into CNF or DNF first in order to get provably correct answers. I know this could result in potentially exponential time and space, but for typical queries my users run this is not a problem.
Now, converting to CNF or DNF is something I do by hand all the time in order to simplify complicated expressions. (Well, maybe not all the time, but I do know how that's done using e.g. demorgan's laws, distributive laws, etc.) However, I'm not sure how to begin translating that into a method that is implementable as an algorithm. I've looked at papers on query optimization, and several start with "well first we put things into CNF" or "first we put things into DNF", and they never seem to explain their method for accomplishing that.
Where should I start?
Look at https://github.com/bastikr/boolean.py
Example:
def test(self):
expr = parse("a*(b+~c*d)")
print(expr)
dnf_expr = normalize(boolean.OR, expr)
print(list(map(str, dnf_expr)))
cnf_expr = normalize(boolean.AND, expr)
print(list(map(str, cnf_expr)))
Output is:
a*(b+(~c*d))
['a*b', 'a*~c*d']
['a', 'b+~c', 'b+d']
UPDATE: Now I prefer this sympy logic package:
>>> from sympy.logic.boolalg import to_dnf
>>> from sympy.abc import A, B, C
>>> to_dnf(B & (A | C))
Or(And(A, B), And(B, C))
>>> to_dnf((A & B) | (A & ~B) | (B & C) | (~B & C), True)
Or(A, C)
The naive vanilla algorithm, for quantifier-free formulae, is :
for CNF, convert to negation normal form with De Morgan laws then distribute OR over AND
for DNF, convert to negation normal form with De Morgan laws then distribute AND over OR
It's unclear to me if your formulae are quantified. But even if they aren't, it seems the end of the wikipedia articles on conjunctive normal form, and its - roughly equivalent in the automated theorm prover world - clausal normal form alter ego outline a usable algorithm (and point to references if you want to make this transformation a bit more clever). If you need more than that, please do tell us more about where you encounter a difficulty.
I've came across this page: How to Convert a Formula to CNF. It shows the algorithm to convert a Boolean expression to CNF in pseudo code. Helped me to get started into this topic.

Resources