How to build a binary expression tree from a prefix notation? - algorithm

something like this ( * (+ 1 2 3) 5)
Operator like *, + can have more than two operands.

To make prefix notation with unbounded number of operands you should define some additional rules for open/close brackets (and that's not what prefix notation generaly does).
Simple parser will take operation, first operand and add other operands one by one. On each step just create new operation node, left operand will take previous (current) result, right operand will take newly fetched operand.
Continue up to the end of input or close bracket. Do not remove close bracket from input - it should be dealt with in open-close parse part, not in operation parse.
Taking operand is straightforward:
"(" -> go deeper and parse subexpression up to ")".
Different operation - > go deeper and parse sub expression.
Same operation cam be simply ignored, but it's up to you.
Constant (or variable if you have them) -> make operand subexpression.

Related

What to do when converting infix to postfix expression using stack?

Problem I am facing is that what to do when two operators of same priorities are their?
Example:
If ^ is in the top of stack and ^ comes that what to do?
Should I enter it in stack or just pop out one ^ or both ^ comes out of the stack?
Since both operators are of same precedence, it doesn't matter in which order you execute the calculation as long as there is no bracket involved. You can push it onto stack and do calculation later together or pop the existing one to do the calculation now.
What to do in this case depends on the operator or its specific precedence level, and is referred to as the operator's associativity: https://en.wikipedia.org/wiki/Operator_associativity
Usually + and - have the same precedence and left associativity, for example, meaning a+b-c+d = ((a+b)-c)+d.
The assignment operators usually have right-associativity, meaning a=b+=c=d is the same as a=(b+=(c=d))
I haven't done a detailed survey, but I think that exponentiation operators usually have right associativity, because left associativity is redundant with multiplication, i.e., (a^b)^c = a^(b*c)

Parse expression with functions

This is my situation: the input is a string that contains a normal mathematical operation like 5+3*4. Functions are also possible, i.e. min(5,A*2). This string is already tokenized, and now I want to parse it using stacks (so no AST). I first used the Shunting Yard Algorithm, but here my main problem arise:
Suppose you have this (tokenized) string: min(1,2,3,+) which is obviously invalid syntax. However, SYA turns this into the output stack 1 2 3 + min(, and hopefully you see the problem coming. When parsing from left to right, it sees the + first, calculating 2+3=5, and then calculating min(1,5), which results in 1. Thus, my algorithm says this expression is completely fine, while it should throw a syntax error (or something similar).
What is the best way to prevent things like this? Add a special delimiter (such as the comma), use a different algorithm, or what?
In order to prevent this issue, you might have to keep track of the stack depth. The way I would do this (and I'm not sure it is the "best" way) is with another stack.
The new stack follows these rules:
When an open parentheses, (, or function is parsed, push a 0.
Do this in case of nested functions
When a closing parentheses, ), is parsed, pop the last item off and add it to the new last value on the stack.
The number that just got popped off is how many values were returned by the function. You probably want this to always be 1.
When a comma or similar delimiter is parsed, pop from the stack, add that number to the new last element, then push a 0.
Reset so that we can begin verifying the next argument of a function
The value that just got popped off is how many values were returned by the statement. You probably want this to always be 1.
When a number is pushed to the output, increment the top element of this stack.
This is how many values are available in the output. Numbers increase the number of values. Binary operators need to have at least 2.
When a binary operator is pushed to the output, decrement the top element
A binary operator takes 2 values and outputs 1, thus reducing the overall number of values left on the output by 1.
In general, an n-ary operator that takes n values and returns m values should add (m-n) to the top element.
If this value ever becomes negative, throw an error!
This will find that the last argument in your example, which just contains a +, will decrement the top of the stack to -1, automatically throwing an error.
But then you might notice that a final argument in your example of, say, 3+ would return a zero, which is not negative. In this case, you would throw an error in one of the steps where "you probably want this to always be 1."

Shunting Yard Algorithm in case "a b +" [duplicate]

We use the Shunting-Yard algorithm to evaluate expressions. We can validate the expression by simply applying the algorithm. It fails if there are missing operands, miss-matched parenthesis, and other things. The Shunting-Yard algorithm however has a larger supported syntax than just human readable infix. For example,
1 + 2
+ 1 2
1 2 +
are all acceptable ways to provide '1+2' as input to the Shunting-Yard algorithm. '+ 1 2' and '1 2 +' are not valid infix, but the standard Shunting-Yard algorithm can handle them. The algorithm does not really care about the order, it applies operators by order of precedence grabbing the 'nearest' operands.
We would like to restrict our input to valid human readable infix. I am looking for a way to either modify the Shunting-Yard algorithm to fail with non-valid infix or provide an infix validation prior to using Shunting-Yard.
Is anyone aware of any published techniques to do this? We must support both basic operator, custom operators, brackets, and functions (with multiple arguments). I haven't seen anything that works with more than the basic operators online.
Thanks
The solution to my problem was to enhance the algorithm posted on Wikipedia with the state machine recommended by Rici. I am posting the pseudo code here because it may be of use to others.
Support two states, ExpectOperand and ExpectOperator.
Set State to ExpectOperand
While there are tokens to read:
If token is a constant (number)
Error if state is not ExpectOperand.
Push token to output queue.
Set state to ExpectOperator.
If token is a variable.
Error if state is not ExpectOperand.
Push token to output queue.
Set state to ExpectOperator.
If token is an argument separator (a comma).
Error if state is not ExpectOperator.
Until the top of the operator stack is a left parenthesis (don't pop the left parenthesis).
Push the top token of the stack to the output queue.
If no left parenthesis is encountered then error. Either the separator was misplaced or the parentheses were mismatched.
Set state to ExpectOperand.
If token is a unary operator.
Error if the state is not ExpectOperand.
Push the token to the operator stack.
Set the state to ExpectOperand.
If the token is a binary operator.
Error if the state is not ExpectOperator.
While there is an operator token at the top of the operator stack and either the current token is left-associative and of lower then or equal precedence to the operator on the stack, or the current token is right associative and of lower precedence than the operator on the stack.
Pop the operator from the operator stack and push it onto the output queue.
Push the current operator onto the operator stack.
Set the state to ExpectOperand.
If the token is a Function.
Error if the state is not ExpectOperand.
Push the token onto the operator stack.
Set the state to ExpectOperand.
If the token is a open parentheses.
Error if the state is not ExpectOperand.
Push the token onto the operator stack.
Set the state to ExpectOperand.
If the token is a close parentheses.
Error if the state is not ExpectOperator.
Until the token at the top of the operator stack is a left parenthesis.
Pop the token off of the operator stack and push it onto the output queue.
Pop the left parenthesis off of the operator stack and discard.
If the token at the top of the operator stack is a function then pop it and push it onto the output queue.
Set the state to ExpectOperator.
At this point you have processed all the input tokens.
While there are tokens on the operator stack.
Pop the next token from the operator stack and push it onto the output queue.
If a parenthesis is encountered then error. There are mismatched parenthesis.
You can easily differentiate between unary and binary operators (I'm specifically speaking about the negative prefix and subtraction operator) by looking at the previous token. If there is no previous token, the previous token is an open parenthesis, or the previous token is an operator then you have encountered a unary prefix operator, else you have encountered the binary operator.
A nice discussion on Shunting Yard algorithms is http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm
The algorithm presented there uses the key idea of the operator stack but has some grammar to know what should be expected next. It has two main functions E() which expects an expression and P() which is expecting either a prefix operator, a variable, a number, brackets and functions. Prefix operators always bind tighter than binary operators so you want to deal this any of then first.
If we say P stands for some prefix sequence and B is a binary operator then any expression will be of the form
P B P B P
i.e. you are either expecting a a prefix sequence or a binary operator. Formally the grammar is
E -> P (B P)*
and P will be
P -> -P | variable | constant | etc.
This translates to psudocode as
E() {
P()
while next token is a binary op:
read next op
push onto stack and do the shunting yard logic
P()
if any tokens remain report error
pop remaining operators off the stack
}
P() {
if next token is constant or variable:
add to output
else if next token is unary minus:
push uminus onto operator stack
P()
}
You can expand this to handle other unary operators, functions, brackets, suffix operators.

handling unary minus for shunting-yard algorithm

Is there a better way to handle unary "-" in converting a infix expression to a postfix one?
The obvious one would be prefix every unary "-" with a 0. Does anyone know better implementation? Thanks!
The way I did this years ago was invent a new operator for my postfix expression. So when I encountered a unary minus in the infix, I'd convert it to #. So my postfix for a + -b became ab#+.
And, of course, my evaluator had to know that # only popped one operand.
Kind of depends on how you're using the postfix expression once it's built. If you want to display it then your special # operator would probably confuse people. But if you're just using it internally (which I was), then it works great.
Traverse through the string, and replace all unary minus operators with 0- and surround the result with parenthesis. For example, given -20 + (-2 * 50), transform it to (0-20) + ((0-2) * 50).

How to get rid of unnecessary parentheses in mathematical expression

Hi I was wondering if there is any known way to get rid of unnecessary parentheses in mathematical formula. The reason I am asking this question is that I have to minimize such formula length
if((-if(([V].[6432])=0;0;(([V].[6432])-([V].[6445]))*(((([V].[6443]))/1000*([V].[6448])
+(([V].[6443]))*([V].[6449])+([V].[6450]))*(1-([V].[6446])))))=0;([V].[6428])*
((((([V].[6443]))/1000*([V].[6445])*([V].[6448])+(([V].[6443]))*([V].[6445])*
([V].[6449])+([V].[6445])*([V].[6450])))*(1-([V].[6446])));
it is basically part of sql select statement. It cannot surpass 255 characters and I cannot modify the code that produces this formula (basically a black box ;) )
As you see many parentheses are useless. Not mentioning the fact that:
((a) * (b)) + (c) = a * b + c
So I want to keep the order of operations Parenthesis, Multiply/Divide, Add/Subtract.
Im working in VB, but solution in any language will be fine.
Edit
I found an opposite problem (add parentheses to a expression) Question.
I really thought that this could be accomplished without heavy parsing. But it seems that some parser that will go through the expression and save it in a expression tree is unevitable.
If you are interested in remove the non-necessary parenthesis in your expression, the generic solution consists in parsing your text and build the associated expression tree.
Then, from this tree, you can find the corresponding text without non-necessary parenthesis, by applying some rules:
if the node is a "+", no parenthesis are required
if the node is a "*", then parenthesis are required for left(right) child only if the left(right) child is a "+"
the same apply for "/"
But if your problem is just to deal with these 255 characters, you can probably just use intermediate variables to store intermediate results
T1 = (([V].[6432])-([V].[6445]))*(((([V].[6443]))/1000*([V].[6448])+(([V].[6443]))*([V].[6449])+([V].[6450]))*(1-([V].[6446])))))
T2 = etc...
You could strip the simplest cases:
([V].[6432]) and (([V].[6443]))
Becomes
v.[6432]
You shouldn't need the [] around the table name or its alias.
You could shorten it further if you can alias the columns:
select v.[6432] as a, v.[6443] as b, ....
Or even put all the tables being queried into a single subquery - then you wouldn't need the table prefix:
if((-if(a=0;0;(a-b)*((c/1000*d
+c*e+f)*(1-g))))=0;h*
(((c/1000*b*d+c*b*
e+b*f))*(1-g));
select [V].[6432] as a, [V].[6445] as b, [V].[6443] as c, [V].[6448] as d,
[V].[6449] as e, [V].[6450] as f,[V].[6446] as g, [V].[6428] as h ...
Obviously this is all a bit psedo-code, but it should help you simplify the full statement
I know this thread is really old, but as it is searchable from google.
I'm writing a TI-83 plus calculator program that addresses similar issues. In my case, I'm trying to actually solve the equation for a specific variable in number, but it may still relate to your problem, although I'm using an array, so it might be easier for me to pick out specific values...
It's not quite done, but it does get rid of the vast majority of parentheses with (I think), a somewhat elegant solution.
What I do is scan through the equation/function/whatever, keeping track of each opening parenthese "(" until I find a closing parenthese ")", at which point I can be assured that I won't run into any more deeply nested parenthese.
y=((3x + (2))) would show the (2) first, and then the (3x + (2)), and then the ((3x + 2))).
What it does then is checks the values immediately before and after each parenthese. In the case above, it would return + and ). Each of these is assigned a number value. Between the two of them, the higher is used. If no operators are found (*,/,+,^, or -) I default to a value of 0.
Next I scan through the inside of the parentheses. I use a similar numbering system, although in this case I use the lowest value found, not the highest. I default to a value of 5 if nothing is found, as would be in the case above.
The idea is that you can assign a number to the importance of the parentheses by subtracting the two values. If you have something like a ^ on the outside of the parentheses
(2+3)^5
those parentheses are potentially very important, and would be given a high value, (in my program I use 5 for ^).
It is possible however that the inside operators would render the parentheses very unimportant,
(2)^5
where nothing is found. In that case the inside would be assigned a value of 5. By subtracting the two values, you can then determine whether or not a set of parentheses is neccessary simply by checking whether the resulting number is greater than 0. In the case of (2+3)^5, a ^ would give a value of 5, and a + would give a value of 1. The resulting number would be 4, which would indicate that the parentheses are in fact needed.
In the case of (2)^5 you would have an inner value of 5 and an outer value of 5, resulting
in a final value of 0, showing that the parentheses are unimportant, and can be removed.
The downside to this is that, (at least on the TI-83) scanning through the equation so many times is ridiculously slow. But if speed isn't an issue...
Don't know if that will help at all, I might be completely off topic. Hope you got everything up and working.
I'm pretty sure that in order to determine what parentheses are unnecessary, you have to evaluate the expressions within them. Because you can nest parentheses, this is is the sort of recursive problem that a regular expression can only address in a shallow manner, and most likely to incorrect results. If you're already evaluating the expression, maybe you'd like to simplify the formula if possible. This also gets kind of tricky, and in some approaches uses techniques that that are also seen in machine learning, such as you might see in the following paper: http://portal.acm.org/citation.cfm?id=1005298
If your variable names don't change significantly from 1 query to the next, you could try a series of replace() commands. i.e.
X=replace([QryString],"(([V].[6443]))","[V].[6443]")
Also, why can't it surpass 255 characters? If you are storing this as a string field in an Access table, then you could try putting half the expression in 1 field and the second half in another.
You could also try parsing your expression using ANTLR, yacc or similar and create a parse tree. These trees usually optimize parentheses away. Then you would just have to create expression back from tree (without parentheses obviously).
It might take you more than a few hours to get this working though. But expression parsing is usually the first example on generic parsing, so you might be able to take a sample and modify it to your needs.

Resources