How to get rid of unnecessary parentheses in mathematical expression - algorithm

Hi I was wondering if there is any known way to get rid of unnecessary parentheses in mathematical formula. The reason I am asking this question is that I have to minimize such formula length
if((-if(([V].[6432])=0;0;(([V].[6432])-([V].[6445]))*(((([V].[6443]))/1000*([V].[6448])
+(([V].[6443]))*([V].[6449])+([V].[6450]))*(1-([V].[6446])))))=0;([V].[6428])*
((((([V].[6443]))/1000*([V].[6445])*([V].[6448])+(([V].[6443]))*([V].[6445])*
([V].[6449])+([V].[6445])*([V].[6450])))*(1-([V].[6446])));
it is basically part of sql select statement. It cannot surpass 255 characters and I cannot modify the code that produces this formula (basically a black box ;) )
As you see many parentheses are useless. Not mentioning the fact that:
((a) * (b)) + (c) = a * b + c
So I want to keep the order of operations Parenthesis, Multiply/Divide, Add/Subtract.
Im working in VB, but solution in any language will be fine.
Edit
I found an opposite problem (add parentheses to a expression) Question.
I really thought that this could be accomplished without heavy parsing. But it seems that some parser that will go through the expression and save it in a expression tree is unevitable.

If you are interested in remove the non-necessary parenthesis in your expression, the generic solution consists in parsing your text and build the associated expression tree.
Then, from this tree, you can find the corresponding text without non-necessary parenthesis, by applying some rules:
if the node is a "+", no parenthesis are required
if the node is a "*", then parenthesis are required for left(right) child only if the left(right) child is a "+"
the same apply for "/"
But if your problem is just to deal with these 255 characters, you can probably just use intermediate variables to store intermediate results
T1 = (([V].[6432])-([V].[6445]))*(((([V].[6443]))/1000*([V].[6448])+(([V].[6443]))*([V].[6449])+([V].[6450]))*(1-([V].[6446])))))
T2 = etc...

You could strip the simplest cases:
([V].[6432]) and (([V].[6443]))
Becomes
v.[6432]
You shouldn't need the [] around the table name or its alias.
You could shorten it further if you can alias the columns:
select v.[6432] as a, v.[6443] as b, ....
Or even put all the tables being queried into a single subquery - then you wouldn't need the table prefix:
if((-if(a=0;0;(a-b)*((c/1000*d
+c*e+f)*(1-g))))=0;h*
(((c/1000*b*d+c*b*
e+b*f))*(1-g));
select [V].[6432] as a, [V].[6445] as b, [V].[6443] as c, [V].[6448] as d,
[V].[6449] as e, [V].[6450] as f,[V].[6446] as g, [V].[6428] as h ...
Obviously this is all a bit psedo-code, but it should help you simplify the full statement

I know this thread is really old, but as it is searchable from google.
I'm writing a TI-83 plus calculator program that addresses similar issues. In my case, I'm trying to actually solve the equation for a specific variable in number, but it may still relate to your problem, although I'm using an array, so it might be easier for me to pick out specific values...
It's not quite done, but it does get rid of the vast majority of parentheses with (I think), a somewhat elegant solution.
What I do is scan through the equation/function/whatever, keeping track of each opening parenthese "(" until I find a closing parenthese ")", at which point I can be assured that I won't run into any more deeply nested parenthese.
y=((3x + (2))) would show the (2) first, and then the (3x + (2)), and then the ((3x + 2))).
What it does then is checks the values immediately before and after each parenthese. In the case above, it would return + and ). Each of these is assigned a number value. Between the two of them, the higher is used. If no operators are found (*,/,+,^, or -) I default to a value of 0.
Next I scan through the inside of the parentheses. I use a similar numbering system, although in this case I use the lowest value found, not the highest. I default to a value of 5 if nothing is found, as would be in the case above.
The idea is that you can assign a number to the importance of the parentheses by subtracting the two values. If you have something like a ^ on the outside of the parentheses
(2+3)^5
those parentheses are potentially very important, and would be given a high value, (in my program I use 5 for ^).
It is possible however that the inside operators would render the parentheses very unimportant,
(2)^5
where nothing is found. In that case the inside would be assigned a value of 5. By subtracting the two values, you can then determine whether or not a set of parentheses is neccessary simply by checking whether the resulting number is greater than 0. In the case of (2+3)^5, a ^ would give a value of 5, and a + would give a value of 1. The resulting number would be 4, which would indicate that the parentheses are in fact needed.
In the case of (2)^5 you would have an inner value of 5 and an outer value of 5, resulting
in a final value of 0, showing that the parentheses are unimportant, and can be removed.
The downside to this is that, (at least on the TI-83) scanning through the equation so many times is ridiculously slow. But if speed isn't an issue...
Don't know if that will help at all, I might be completely off topic. Hope you got everything up and working.

I'm pretty sure that in order to determine what parentheses are unnecessary, you have to evaluate the expressions within them. Because you can nest parentheses, this is is the sort of recursive problem that a regular expression can only address in a shallow manner, and most likely to incorrect results. If you're already evaluating the expression, maybe you'd like to simplify the formula if possible. This also gets kind of tricky, and in some approaches uses techniques that that are also seen in machine learning, such as you might see in the following paper: http://portal.acm.org/citation.cfm?id=1005298

If your variable names don't change significantly from 1 query to the next, you could try a series of replace() commands. i.e.
X=replace([QryString],"(([V].[6443]))","[V].[6443]")
Also, why can't it surpass 255 characters? If you are storing this as a string field in an Access table, then you could try putting half the expression in 1 field and the second half in another.

You could also try parsing your expression using ANTLR, yacc or similar and create a parse tree. These trees usually optimize parentheses away. Then you would just have to create expression back from tree (without parentheses obviously).
It might take you more than a few hours to get this working though. But expression parsing is usually the first example on generic parsing, so you might be able to take a sample and modify it to your needs.

Related

Parse expression with functions

This is my situation: the input is a string that contains a normal mathematical operation like 5+3*4. Functions are also possible, i.e. min(5,A*2). This string is already tokenized, and now I want to parse it using stacks (so no AST). I first used the Shunting Yard Algorithm, but here my main problem arise:
Suppose you have this (tokenized) string: min(1,2,3,+) which is obviously invalid syntax. However, SYA turns this into the output stack 1 2 3 + min(, and hopefully you see the problem coming. When parsing from left to right, it sees the + first, calculating 2+3=5, and then calculating min(1,5), which results in 1. Thus, my algorithm says this expression is completely fine, while it should throw a syntax error (or something similar).
What is the best way to prevent things like this? Add a special delimiter (such as the comma), use a different algorithm, or what?
In order to prevent this issue, you might have to keep track of the stack depth. The way I would do this (and I'm not sure it is the "best" way) is with another stack.
The new stack follows these rules:
When an open parentheses, (, or function is parsed, push a 0.
Do this in case of nested functions
When a closing parentheses, ), is parsed, pop the last item off and add it to the new last value on the stack.
The number that just got popped off is how many values were returned by the function. You probably want this to always be 1.
When a comma or similar delimiter is parsed, pop from the stack, add that number to the new last element, then push a 0.
Reset so that we can begin verifying the next argument of a function
The value that just got popped off is how many values were returned by the statement. You probably want this to always be 1.
When a number is pushed to the output, increment the top element of this stack.
This is how many values are available in the output. Numbers increase the number of values. Binary operators need to have at least 2.
When a binary operator is pushed to the output, decrement the top element
A binary operator takes 2 values and outputs 1, thus reducing the overall number of values left on the output by 1.
In general, an n-ary operator that takes n values and returns m values should add (m-n) to the top element.
If this value ever becomes negative, throw an error!
This will find that the last argument in your example, which just contains a +, will decrement the top of the stack to -1, automatically throwing an error.
But then you might notice that a final argument in your example of, say, 3+ would return a zero, which is not negative. In this case, you would throw an error in one of the steps where "you probably want this to always be 1."

Chained XPath axes with trailing `[1]` or `[last()]` predicates

This question is specifically about using XPath in XSLT 2.0 and Saxon.
XPaths ending with [1]
For XPaths like
following-sibling::foo[1]
descendant::bar[1]
I take it for granted that Saxon will not iterate over the entire axis but stop when it finds the first matching node - crucial in situations like:
following-sibling::foo[some:expensivePredicate(.)][1]
I assume that this is also the case for XPaths like this:
(following-sibling::foo/descendant::bar)[1]
I.e. Saxon will not compile the entire set of nodes matching following-sibling::foo/descendant::bar before picking the first one in the set. Rather, it will (even for chained axes) stop at the first matching node.
XPaths ending with [last()]
Now it gets interesting. When going "backwards" in the tree, I assume that XPaths like
preceding-sibling::foo[1]
work just as efficiently as their following-sibling equivalents. But what happens when chaining axes, e.g.
(preceding-sibling::foo/descendant::bar)[last()]
As we need to use [last()] here instead of [1],
will Saxon compile the entire set of nodes to count them to get a numeric value for last()?
Or will it be smart and stop iterating the preceding-sibling axis when it found a matching descendant?
Or will it be even more clever and iterate the descendant axis in reverse to more efficiently find the last descendant?
Saxon has a variety of strategies for evaluating last(). When used as a predicate, meaning [position()=last()], it is generally translated to an internal function [isLast()] which can be evaluated by a single-item lookahead. (So in your example of (preceding-sibling::foo /descendant::bar)[last()], it doesn't build the node-set in memory, rather it reads the nodes one by one and when it hits the end, returns the last one it found).
In other cases, particularly when used in XSLT match patterns, Saxon will convert child::x[last()] to child::x[not(following-sibling::x)].
When none of these approaches work, for many years Saxon had two strategies for evaluating last() depending on the expression it was applied to: (a) sometimes it would evaluate the expression twice, counting nodes the first time and returning them the second time; (b) in other cases it would read all the nodes into memory. We've recently encountered cases where strategy (a) fails: see https://saxonica.plan.io/issues/3122, and so we're always doing (b).
The last() expression is potentially expensive and it should be avoided where possible. For example the classic "insert a separator between adjacent items" which is often written
xx
if (position() != last()) sep
is much better written as
if (position() != 1) sep
xx
i.e. instead of inserting the separator after every item except the last, insert it before every item except the first. Or use string-join, or xsl:value-of/#separator.

Exactly when does the reverse direction of an axis apply?

Those who are familiar with XPath know that some axes, such as preceding::, are reverse axes. And if you put a positional predicate on an expression built with a reverse axis, you may be counting backward instead of forward. E.g.
$foo/preceding-sibling::*[1]
returns the preceding sibling element just before $foo, not the first preceding sibling element (in document order).
But then you encounter variations where this rule seems to be broken, depending on how far removed the positional predicate is from the reverse axis. E.g.
($foo/preceding-sibling::*)[1]
counts forward from the beginning of the document, not backward from $foo.
Today I was writing some code where I had an expression like
$foo/preceding::bar[not(parent::baz)][1]
I wanted to be counting backwards from $foo. But was my positional predicate too far removed from the preceding:: axis? Had the expression lost its reverse direction before I added the [1]? I thought it probably wouldn't work, so I changed it to
$foo/preceding::bar[not(parent::baz)][last()]
but then I wasn't really sure of the direction, so I put in parentheses to make sure:
($foo/preceding::bar[not(parent::baz)])[last()]
However, the extra parentheses are a bit confusing, and I thought the expression might be less efficient, if it really has to count from the beginning of the (large) input document instead of backward from $foo. Was it really necessary to do it this way?
Finally I tested the original expression, and found to my surprise that it worked! So the intervening [not(parent::baz)] had not caused the expression to lose its reverse direction after all.
That problem was solved, but I've come to the point where I'd like to get a better handle on when I can expect the reverse direction of an axis to apply. My question is: At what point(s) does an XPath expression using a reverse axis lose its reverse direction?
I believe I've found the answer now, so I'll answer my own question. But I couldn't find the answer on SO, and it's something that has bothered me long enough that it was worth asking and answering here.
The best answer I found was in an old email by Evan Lenz.
It's worth reading in full, as an explanation of how XPath works in this regard, and how the XPath 1.0 spec shows us the answer. But the executive summary is in this rule:
Step ::= AxisSpecifier NodeTest Predicate*
| AbbreviatedStep
The Step production defines the syntax of a location step, and it's only within a location step that the reverse direction of an axis applies.
Any syntax that comes between the axis and a positional predicate, other than the nodetest and predicates, will break the chain and the direction will revert to forward.
This explains why, if you put parentheses around a preceding::foo and append a positional predicate outside the parentheses, the positional predicate ignores the direction of the preceding:: axis.
It also explains why my first attempt in my code today worked, despite my expectations: you can put as many predicates after a NodeTest as you want, and the direction of the axis will still apply to all of them.

most readable way in XPath to write "is value X a member of sequence S"?

XPath 2.0 has some new functions and syntax, relative to 1.0, that work with sequences. Some of theset don't really add to what the language could already do in 1.0 (with node sets), but they make it easier to express the desired logic in ways that are more readable. This increases the chances of the programmer getting the code correct -- and keeping it that way. For example,
empty(s) is equivalent to not(s), but its intent is much clearer when you want to test whether a sequence is empty.
Correction: the effective boolean value of a sequence is in general more complicated than that. E.g. empty((0)) != not((0)). This applies to exists(s) vs. s in a boolean context as well. However, there are domains of s where empty(s) is equivalent to not(s), so the two could be used interchangeably within those domains. But this goes to show that the use of empty() can make a non-trivial difference in making code easier to understand.
Similarly, exists(s) is equivalent to boolean(s) that already existed in XPath 1.0 (or just s in a boolean context), but again is much clearer about the intent.
Quantified expressions; e.g. "some $x in expression satisfies test($x)" would be equivalent to boolean(expression[test(.)]) (although the new syntax is more flexible, in that you don't need to worry about losing the context item because you have the variable to refer to it by).
Similarly, "every $x in expression satisfies test($x)" would be equivalent to not(expression[not(test(.))]) but is more readable.
These functions and syntax were evidently added at no small cost, solely to serve the goal of writing XPath that is easier to map to how humans think. This implies, as experienced developers know, that understandable code is significantly superior to code that is difficult to understand.
Given all that ... what would be a clear and readable way to write an XPath test expression that asks
Does value X occur in sequence S?
Some ways to do it: (Note: I used X and S notation here to indicate the value and the sequence, but I don't mean to imply that these subexpressions are element name tests, nor that they are simple expressions. They could be complicated.)
X = S: This would be one of the most unreadable, since it requires the reader to
think about which of X and S are sequences vs. single values
understand general comparisons, which are not obvious from the syntax
However, one advantage of this form is that it allows us to put the topic (X) before the comment ("is a member of S"), which, I think, helps in readability.
See also CMS's good point about readability, when the syntax or names make the "cardinality" of X and S obvious.
index-of(S, X): This one is clear about what's intended as a value and what as a sequence (if you remember the order of arguments to index-of()). But it expresses more than we need to: it asks for the index, when all we really want to know is whether X occurs in S. This is somewhat misleading to the reader. An experienced developer will figure out what's intended, with some effort and with understanding of the context. But the more we rely on context to understand the intent of each line, the more understanding the code becomes a circular (spiral) and potentially Sisyphean task! Also, since index-of() is designed to return a list of all the indexes of occurrences of X, it could be more expensive than necessary: a smart processor, in order to evaluate X = S, wouldn't necessarily have to find all the contents of S, nor enumerate them in order; but for index-of(S, X), correct order would have to be determined, and all contents of S must be compared to X. One other drawback of using index-of() is that it's limited to using eq for comparison; you can't, for example, use it to ask whether a node is identical to any node in a given sequence.
Correction: This form, used as a conditional test, can result in a runtime error: Effective boolean value is not defined for a sequence of two or more items starting with a numeric value. (But at least we won't get wrong boolean values, since index-of() can't return a zero.) If S can have multiple instances of X, this is another good reason to prefer form 3 or 6.
exists(index-of(X, S)): makes the intent clearer, and would help the processor eliminate the performance penalty if the processor is smart enough.
some $m in S satisfies $m eq X: This one is very clear, and matches our intent exactly. It seems long-winded compared to 1, and that in itself can reduce readability. But maybe that's an acceptable price for clarity. Keep in mind that X and S could potentially be complex expressions themselves -- they're not necessarily just variable references. An advantage is that since the eq operator is explicit, you can replace it with is or any other comparison operator.
S[. eq X]: clearer than 1, but shares the semantic drawbacks of 2: it computes all members of S that are equal to X. Actually, this could return a false negative (incorrect effective boolean value), if X is falsy. E.g. (0, 1)[. eq 0] returns 0 which is falsy, even though 0 occurs in (0, 1).
exists(S[. eq X]): Clearer than 1, 2, 3, and 5. Not as clear as 4, but shorter. Avoids the drawbacks of 5 (or at least most of them, depending on the processor smarts).
I'm kind of leaning toward the last one, at this point: exists(S[. eq X])
What about you... As a developer coming to a complex, unfamiliar XSLT or XQuery or other program that uses XPath 2.0, and wanting to figure out what that program is doing, which would you find easiest to read?
Apologies for the long question. Thanks for reading this far.
Edit: I changed = to eq wherever possible in the above discussion, to make it easier to see where a "value comparison" (as opposed to a general comparison) was intended.
For what it's worth, if names or context make clear that X is a singleton, I'm happy to use your first form, X = S -- for example when I want to check an attribute value against a set of possible values:
<xsl:when test="#type = ('A', 'A+', 'A-', 'B+')" />
or
<xsl:when test="#type = $magic-types"/>
If I think there is a risk of confusion, then I like your sixth formulation. The less frequently I have to remember the rules for calculating an effective boolean value, the less frequently I make a mistake with them.
I prefer this one:
count(distinct-values($seq)) eq count(distinct-values(($x, $seq)))
When $x is itself a sequence, this expression implements the (value-based) subset of relation between two sets of values, that are represented as sequences. This implementation of subset of has just linear time complexity -- vs many other ways of expressing this, that have O(N^2)) time complexity.
To summarize, the question whether a single value belongs to a set of values is a special case of the question whether one set of values is a subset of another. If we have a good implementation of the latter, we can simply use it for answering the former.
The functx library has a nice implementation of this function, so you can use
functx:is-node-in-sequence($X, $Y)
(this particular function can be found at http://www.xqueryfunctions.com/xq/functx_is-node-in-sequence.html)
The whole functx library is available for both XQuery (http://www.xqueryfunctions.com/) and XSLT (http://www.xsltfunctions.com/)
Marklogic ships the functx library with their core product; other vendors may also.
Another possibility, when you want to know whether node X occurs in sequence S, is
exists((X) intersect S)
I think that's pretty readable, and concise. But it only works when X and the values in S are nodes; if you try to ask
exists(('bob') intersect ('alice', 'bob'))
you'll get a runtime error.
In the program I'm working on now, I need to compare strings, so this isn't an option.
As Dimitri notes, the occurrence of a node in a sequence is a question of identity, not of value comparison.

Function to detect conflicting mathematical operators in VB6

I've developed a program which generates insurance quotes using different types of coverages based on state criteria. Now I want to add the ability to specify 'rules'. For example we may have 3 types of coverage (we'll call them UM, BI, and PD). Well some states don't allow PD to be greater than BI and other states don't allow UM to exist without BI. So I've added the ability for the user to create these rules so that when the quote is generated the rule will be followed and thus no state regulations will be violated when the program generates the quote.
The Problem
I don't want the user to be able to select conflicting rules. The user can select any of the VB mathematical operators (>, <, >=, <=, =, <>) and set a coverage on either side. They can do this multiple times (but only one at a time) so they might end up with a list of rules like this:
A > B
B > C
C > A
As you can see, the last rule conflicts with the previously set rules. My solution to this was to validate the list each time the user clicks 'Add rule to list'.
Pretend the 3rd list item is not yet in the list but the user has clicked 'add rule' to put it in the list. The validation process first checks to see if both incoming variables have already been used on the same line. If not, it just searches for the left side incoming variable (in this case 'C') in the already created list. if it finds it, it then sets tmp1 equal to the variable across from the match (tmp1 = 'B'). It then does the same for the incoming variable on the right side (in this case 'A'). Then tmp2 is set equal to the variable across from A (tmp2 = 'B'). If tmp1 and tmp2 are equal then the incoming rule is either conflicting OR is irrelevant regardless of the operators used. I'm pretty sure this is solid logic given 3 variables. However, I found that adding any additional variables could easily bypass my validation. There could be upwards of 10 coverage types in any given state so it is important to be able to validate more than just 3.
Is there any uniform way to do a sound validation given any number of variables? Any ideas or thoughts are appreciated. I hope my explanation makes sense.
Thanks
My best bet is some sort of hierarchical tree of rules. When the user adds the first rule (say A > B), the application could create a data structure like this (lowerValues is a Map which the key leads to a list of values):
lowerValues['A'] = ['B']
Now when the user adds the next rule (B > C), the application could check if B is already in a any lowerValues list (in this case, A). If that happens, C is added to lowerValues['A'], and lowerValues['B'] is also created:
lowerValues['A'] = ['B', 'C']
lowerValues['B'] = ['C']
Finally, when the last rule is provided by the user (C > A), the application checks if C is in any lowerValues list. Since it's in B and A, the rule is invalid.
Hope that helps. I don't remember if there's some sort of mapping in VB. I think you should try the Dictionary object.
In order to this idea works out, all the operations must be internally translated to a simple type. So, for example:
A > B
could be translated as
B <= A
Good luck
In general this is a pretty hard problem. What you in fact want to know is if a set of propositional equations over (apparantly) some set of arithmetic is true. To do this you need what amounts to constraint solvers that "know" arithmetic. Not likely to find that in VB6, but you might be able to invoke one as a subprocess.
If the rules are propositional equations only over inequalities (AA", write them only one way).
Second, try solving the propositions for tautology (see for Wang's algorithm which you can likely implment awkwardly in VB6).
If the propositions are not a tautology, now you want build chains of inequalities (e.g, A > B > C) as a graph and look for cycles. The place this fails is when your propositions have disjunctions, e.g., ("A>B or B>Q"); you'll have to generate an inequality chain for each combination of disjunctions, and discard the inconsistent ones. If you discard all of them, the set is inconsistent. Watch out for expressions like "A and B"; by DeMorgans theorem, they're equivalent to "not A or not B", e.g., "A>B and B>Q" is the same as "A<=B or B<=Q". You might want to reduce the conditions to disjunctive normal form to avoid getting suprised.
There are apparantly decision procedures for such inequalities. They're likely hard to implement.

Resources