Algorithm to build syntax tree from prefix order expression - algorithm

What type of algorithm would be used to construct a syntax tree from an expression represented in prefix order notation?

A simple recursive algorithm can convert a prefix-order expression to a syntax tree.
GetNextPrefixExpression(tokenStream)
nextToken = tokenStream.GetNextToken()
if nextToken.IsValue()
return new Value(nextToken)
else if nextToken.IsUnaryOperator()
return new UnaryOperator(nextToken, GetNextPrefixExpression(tokenStream))
else if nextToken.IsBinaryOperator()
return new BinaryOperator(nextToken, GetNextPrefixExpression(tokenStream), GetNextPrefixExpression(tokenStream))
else if nextToken.IsTrinaryOperator()
return new TrinaryOperator(nextToken, GetNextPrefixExpression(tokenStream), GetNextPrefixExpression(tokenStream), GetNextPrefixExpression(tokenStream))

Related

How to keep track of nesting information on iterative (non-recursive) function

Say I have a recursive descent parser that defines a bunch of nested rules.
Expr ← Sum
Sum ← Product (('+' / '-') Product)*
Product ← Value (('*' / '/') Value)*
Value ← [0-9]+ / '(' Expr ')'
Say I am right ● here on the second Value in the process:
Expr ← Sum
Sum ← Product (('+' / '-') Product)*
Product ← Value (('*' / '/') ●)*
Value ← [0-9]+ / '(' Expr ')'
That would mean that I am somewhere in here in a nesting level let's say:
Expr
Sum
|Product
+
Product
|Product
-
Product
|Value
*
Value
|Value
*
●
When parsing with recursive descent, it is recursive so when Value returns, we get back to the "sequence" * parsing node, which then returns to the Product node, which returns to the product sequence node, etc. So it's easy to build up the parsing tree.
But let's say that you want to do this using an iterative stack. The question is, how to keep track of the nesting information so that you can say in your code (eventually):
function handleValue(state, string) {
// ...
}
function handleValueSequence(state, string) {
if (state.startedValueSequenceEarlier) {
wrapItUp(new ValueSequence(state.values))
}
}
function handleProduct(state, string) {
// ...
}
function handleProductSequence(state, string) {
if (state.startedProductSequenceEarlier) {
wrapItUp(new ProductSequence(state.products))
}
}
The tricky part is, this can be arbitrarily nested, so you might have:
Product
Value
Product
Value
Product
...
So if your function like handleProductSequence doesn't have any context other than the function's arguments, I can't tell how it should figure out how to "wrapItUp" and finally create that ProductSequence object. In that state object I added, I am trying to think of ways of adding a state.stack property or something, but I'm not sure what would go in there. Any help would be much appreciated.
Your stack has to contain "where you are" in the control flow. In a recursive descent parser, that will be effectively the same as where you are in the parse, so you could write a generalised LL parser in this fashion. Personally, I'd probably represent a production as an object with a list of tokens and a handler function. (Plus some extension for EBNF operators like *.) A state would then be a production, a position in the production, and a list of already matched values.
But it's hard to see a good reason to do that when LR parser generators already exist, using essentially this representation, and they can handle many more grammars.

Is it necessary to convert infix notation to postfix when creating an expression tree from it?

I want to create an expression tree given expression in infix form. Is it necessary to convert the expression to postfix first and then create the tree? I understand that it somehow depends on the problem itself. But assume that it is simple expression of mathematical function with unknowns and operators like: / * ^ + -.
No. If you're going to build an expression tree, then it's not necessary to convert the expression to postfix first. It will be simpler just to build the expression tree as you parse.
I usually write recursive descent parsers for expressions. In that case each recursive call just returns the tree for the subexpression that it parses. If you want to use an iterative shunting-yard-like algorithm, then you can do that too.
Here's a simple recursive descent parser in python that makes a tree with tuples for nodes:
import re
def toTree(infixStr):
# divide string into tokens, and reverse so I can get them in order with pop()
tokens = re.split(r' *([\+\-\*\^/]) *', infixStr)
tokens = [t for t in reversed(tokens) if t!='']
precs = {'+':0 , '-':0, '/':1, '*':1, '^':2}
#convert infix expression tokens to a tree, processing only
#operators above a given precedence
def toTree2(tokens, minprec):
node = tokens.pop()
while len(tokens)>0:
prec = precs[tokens[-1]]
if prec<minprec:
break
op=tokens.pop()
# get the argument on the operator's right
# this will go to the end, or stop at an operator
# with precedence <= prec
arg2 = toTree2(tokens,prec+1)
node = (op, node, arg2)
return node
return toTree2(tokens,0)
print toTree("5+3*4^2+1")
This prints:
('+', ('+', '5', ('*', '3', ('^', '4', '2'))), '1')
Try it here:
https://ideone.com/RyusvI
Note that the above recursive descent style is the result of having written many parsers. Now I pretty much always parse expressions in this way (the recursive part, not the tokenization). It is just about as simple as an expression parser can be, and it makes it easy to handle parentheses, and operators that associate right-to-left like the assignment operator.

Algorithm to find which hashes in a list match another hash the fastest? (this is complicated to explain on title)

Explaining with words when 2 hashes would match is complicated, so, see the example:
Hash patterns are stored in a list like: (I'm using JavaScript for notation)
pattern:[
0:{type:'circle', radius:function(n){ return n>10; }},
1:{type:'circle', radius:function(n){ return n==2; }},
2:{type:'circle', color:'blue', radius:5},
... etc]
var test = {type:'circle', radius:12};
test should match with pattern 0 because pattern[0].type==test.type && pattern.radius(test.radius)==true.
So, trying with words, a hash matches a pattern if every of it's values is either equal of those of the pattern or returns true when applied as a function.
My question is: is there an algorithm to find all patterns that match certain hash without testing all of them?
Consider a dynamic, recursive, decision tree structure like the following.
decision:[
field:'type',
values:[
'circle': <another decision structure>,
'square': 0, // meaning it matched, return this value
'triangle': <another decision structure>
],
functions:[
function(n){ return n<12;}: <another decision structure>,
function(n){ return n>12;}: <another decision structure>
],
missing: <another decision structure>
]
Algorithm on d (a decision structure):
if test has field d.field
if test[d.field] in d.values
if d.values[test[d.field]] is a decision structure
recurse using the new decision structure
else
yield d.values[test[d.field]]
foreach f => v in d.functions
if f(test[d.field])
if v is a decision structure
recurse using the new decision structure
else
yield v
else if d.missing is present
if d.missing is a decision structure
recurse using the new decision structure
else
yield d.missing
else
No match

Breaking a string apart into a sequence of words

I recently came across the following interview question:
Given an input string and a dictionary of words, implement a method that breaks up the input string into a space-separated string of dictionary words that a search engine might use for "Did you mean?" For example, an input of "applepie" should yield an output of "apple pie".
I can't seem to get an optimal solution as far as complexity is concerned. Does anyone have any suggestions on how to do this efficiently?
Looks like the question is exactly my interview problem, down to the example I used in the post at The Noisy Channel. Glad you liked the solution. Am quite sure you can't beat the O(n^2) dynamic programming / memoization solution I describe for worst-case performance.
You can do better in practice if your dictionary and input aren't pathological. For example, if you can identify in linear time the substrings of the input string are in the dictionary (e.g., with a trie) and if the number of such substrings is constant, then the overall time will be linear. Of course, that's a lot of assumptions, but real data is often much nicer than a pathological worst case.
There are also fun variations of the problem to make it harder, such as enumerating all valid segmentations, outputting a best segmentation based on some definition of best, handling a dictionary too large to fit in memory, and handling inexact segmentations (e.g., correcting spelling mistakes). Feel free to comment on my blog or otherwise contact me to follow up.
This link describes this problem as a perfect interview question and provides several methods to solve it. Essentially it involves recursive backtracking. At this level it would produce an O(2^n) complexity. An efficient solution using memoization could bring this problem down to O(n^2).
Using python, we can write two function, the first one segment returns the first segmentation of a piece of contiguous text into words given a dictionary or None if no such segmentation is found. Another function segment_all returns a list of all segmentations found. Worst case complexity is O(n**2) where n is the input string length in chars.
The solution presented here can be extended to include spelling corrections and bigram analysis to determine the most likely segmentation.
def memo(func):
'''
Applies simple memoization to a function
'''
cache = {}
def closure(*args):
if args in cache:
v = cache[args]
else:
v = func(*args)
cache[args] = v
return v
return closure
def segment(text, words):
'''
Return the first match that is the segmentation of 'text' into words
'''
#memo
def _segment(text):
if text in words: return text
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix = _segment(suffix)
if prefix in words and segmented_suffix:
return '%s %s' % (prefix, segmented_suffix)
return None
return _segment(text)
def segment_all(text, words):
'''
Return a full list of matches that are the segmentation of 'text' into words
'''
#memo
def _segment(text):
matches = []
if text in words:
matches.append(text)
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix_matches = _segment(suffix)
if prefix in words and len(segmented_suffix_matches):
for match in segmented_suffix_matches:
matches.append('%s %s' % (prefix, match))
return matches
return _segment(text)
if __name__ == "__main__":
string = 'cargocultscience'
words = set('car cargo go cult science'.split())
print segment(string, words)
# >>> car go cult science
print segment_all(string, words)
# >>> ['car go cult science', 'cargo cult science']
One option would be to store all valid English words in a trie. Once you've done this, you could start walking the trie from the root downward, following the letters in the string. Whenever you find a node that's marked as a word, you have two options:
Break the input at this point, or
Continue extending the word.
You can claim that you've found a match once you have broken the input up into a set of words that are all legal and have no remaining characters left. Since at each letter you either have one forced option (either you are building a word that isn't valid and should stop -or- you can keep extending the word) or two options (split or keep going), you could implement this function using exhaustive recursion:
PartitionWords(lettersLeft, wordSoFar, wordBreaks, trieNode):
// If you walked off the trie, this path fails.
if trieNode is null, return.
// If this trie node is a word, consider what happens if you split
// the word here.
if trieNode.isWord:
// If there is no input left, you're done and have a partition.
if lettersLeft is empty, output wordBreaks + wordSoFar and return
// Otherwise, try splitting here.
PartitinWords(lettersLeft, "", wordBreaks + wordSoFar, trie root)
// Otherwise, consume the next letter and continue:
PartitionWords(lettersLeft.substring(1), wordSoFar + lettersLeft[0],
wordBreaks, trieNode.child[lettersLeft[0])
In the pathologically worst case this will list all partitions of the string, which can t exponentially long. However, this only occurs if you can partition the string in a huge number of ways that all start with valid English words, and is unlikely to occur in practice. If the string has many partitions, we might spend a lot of time finding them, though. For example, consider the string "dotheredo." We can split this many ways:
do the redo
do the red o
doth ere do
dot here do
dot he red o
dot he redo
To avoid this, you might want to institute a limit of the number of answers you report, perhaps two or three.
Since we cut off the recursion when we walk off the trie, if we ever try a split that doesn't leave the remainder of the string valid, we will detect this pretty quickly.
Hope this helps!
import java.util.*;
class Position {
int indexTest,no;
Position(int indexTest,int no)
{
this.indexTest=indexTest;
this.no=no;
} } class RandomWordCombo {
static boolean isCombo(String[] dict,String test)
{
HashMap<String,ArrayList<String>> dic=new HashMap<String,ArrayList<String>>();
Stack<Position> pos=new Stack<Position>();
for(String each:dict)
{
if(dic.containsKey(""+each.charAt(0)))
{
//System.out.println("=========it is here");
ArrayList<String> temp=dic.get(""+each.charAt(0));
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
else
{
ArrayList<String> temp=new ArrayList<String>();
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
}
Iterator it = dic.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
System.out.println("key: "+pair.getKey());
for(String str:(ArrayList<String>)pair.getValue())
{
System.out.print(str);
}
}
pos.push(new Position(0,0));
while(!pos.isEmpty())
{
Position position=pos.pop();
System.out.println("position index: "+position.indexTest+" no: "+position.no);
if(dic.containsKey(""+test.charAt(position.indexTest)))
{
ArrayList<String> strings=dic.get(""+test.charAt(position.indexTest));
if(strings.size()>1&&position.no<strings.size()-1)
pos.push(new Position(position.indexTest,position.no+1));
String str=strings.get(position.no);
if(position.indexTest+str.length()==test.length())
return true;
pos.push(new Position(position.indexTest+str.length(),0));
}
}
return false;
}
public static void main(String[] st)
{
String[] dic={"world","hello","super","hell"};
System.out.println("is 'hellworld' a combo: "+isCombo(dic,"superman"));
} }
I have done similar problem. This solution gives true or false if given string is combination of dictionary words. It can be easily converted to get space-separated string. Its average complexity is O(n), where n: no of dictionary words in given string.

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

Resources