Write a code to generate the parse tree - algorithm

This question was asked to me in an interview question:
Write a code to generate the parse tree like compilers do internally for any given expression. For example:
a+(b+c*(e/f)+d)*g

Start by defining the language. No one can implement a parser or a compiler to a language that isn't very well defined. You give an example: 'a+(b+c*(e/f)+d)*g', which should trigger the following questions:
Is the language a single expression, or there may be multiple statements (separated by ';' maybe?
what are the 'a', 'b', ... 'g' tokens? Is it variable? What is the syntax of variables? Is it a C-like variable, or is it a single alphanumeric character as your example may imply.
There are 3 binary expression in your example. Is that all? Does the language also support '-'. Does your language support logical, and bit-wise operators?
Does the language support number literals? integer only? double? Does the language support string literals? Do you quote string literals?
Syntax for comments?
Which operator has precedence? Does '*' operator has precedence over '+' as in the example? Does operands evaluated right to left or left to right?
Any Pre-processing?
Once you are equipped with a good definition of the language syntax, start with implementing a tokenizer. A tokenizer gets a stream of characters and generates a list of tokens. In the example above, each character is a token, but in var*12 (var power 12) there are 3 tokens: 'var', '*' and '12'. If regular expression is permitted, it is possible you can do this part of the parsing with regular expressions.
Next, have a function that identify each token by type: is it an operator, is it a variable, a number literal, string literal, etc. Package all in a method called NextToken that returns a token and its type.
Finally, start parsing. In your sample above the root of the parsing tree will be a node with the '+' operator (which has precedence over the ''). The left child is a variable token 'a' and the right child is a tree with a root element the '' token. Work recursively.

Simple way out is to convert your expression into postfix notation (abcef/*++) & then refer to the answer to this question (http://stackoverflow.com/questions/423898/postfix-notation-to-expression-tree) for converting the postfix expression to a Tree.
This is what the interviewer expected :)

Whenever you intend to write a parser, the main question to ask is if you want to do it manually, or to use a parser generator framework.
In this case, I would say that it's a good exercise to write it all yourself.
Start with a good representation for the tree itself. This will be be output of your algorithm. For example, this could be a collection of objects, where one object kind could represent a "label" like a, b, and c in your example. Others could represent numbers. You could then defined a representation of operators, for example + is a binary operator, which would have two subobjects, representing the left and right subexpression.
Next step is the actual parser, I would suggest a classical recursive decent parser. One text describing this, and provides a standard pseudo-code implementation is this text by Theodore Norvell

I'd start with a simple grammar, something like those used by ANTLR and JavaCC.

Related

What do I do in ANTLR if I want to parse something which is extremely configurable?

I'm writing a grammar to recognise simple mathematical expressions. I have it working for English.
Now I want to expand the grammar to support i18n. Therefore, the digits, radix separator and so forth depend on the user's locale.
What is the best way to do this in ANTLR?
What I'm currently considering is something like this:
lexer grammar ExpressionLexer;
options {
superClass = AbstractLexer;
}
DIGIT: . {isDigit(getText())}?;
// ... and so on for other tokens ...
abstract class AbstractLexer(input: CharStream, symbols: Symbols) extends Lexer(input) {
fun isDigit(codePoint: Int): Boolean = symbols.isDigit(codePoint)
// ... and so on for other tokens ...
}
Alternative approaches I am considering:
(b) I gather every possible digit and every possible separator in every possible locale, and jam all of those into the one grammar, and then check isDigit after that.
(c) I make a different lexer for every single numbering system and somehow align them all to emit the same token types in the same order, so they can be swapped in and out (sounds like it might be the most pure and correct solution? but not the most enjoyable.)
(And on a side tangent, how do people in European countries which use comma for the decimal separator deal with writing function calls with more than one parameter?)
I recommend doing that in two steps:
Parse the main language structure (e.g. (digits+ separator)+), regardless of what a digit or a separator is.
Do a semantic check against the user's locale if the digits that were given actually match what's allowed. Same for the separator.
This way you don't need to do all kind of hacks, add platform code and so on.
For your side question: programming usually uses the english language, including the number format. In strings you can use any format you want, but that doesn't affect the surrounding code.
Note that since ANTLR v4.7 and up, there is more possible w.r.t. Unicode inside ANTLR's lexer grammar: https://github.com/antlr/antlr4/blob/master/doc/unicode.md
So you could define a lexer rule like this:
DIGIT
: [\p{Digit}]
;
which will match both ٣ and 3.

How to get first character that is causing reg expression not to match

We have one quite complex regular expression which checks for string structure.
I wonder if there is an easy way to find out which character in the string that is causing reg expression not to match.
For example,
string.match(reg_exp).get_position_which_fails
Basically, the idea is how to get "position" of state machine when it gave up.
Here is an example of regular expression:
%q^[^\p{Cc}\p{Z}]([^\p{Cc}\p{Zl}\p{Zp}]{0,253}[^\p{Cc}\p{Z}])?$
The short answer is: No.
The long answer is that a regular expression is a complicated finite state machine that may be in a state trying to match several different possible paths simultaneously. There's no way of getting a partial match out of a regular expression without constructing a regular expression that allows partial matches.
If you want to allow partial matches, either re-engineer your expression to support them, or write a parser that steps through the string using a more manual method.
You could try generating one of these automatically with Ragel if you have a particularly difficult expression to solve.

Extract function names from function calls in C files

Is it posible to extract function calls in C source files, e.g.,
...
myfunc(1);
...
or
...
myfunc(anotherfunc(1, 2));
....
by just using Ruby regular expression? If not, would a parser generator such as ANTLR be useful?
This is not a full-proof pattern for finding out method calls but should just serve the pattern that you are interested in.
[a-zA-Z\s]*\([a-zA-Z0-9]*(\([a-zA-Z0-9\s]*[\s,]*[\sa-zA-Z0-9]*\))?\);
This regex will match following method call patterns.
1. myfunc(another(one,two));
2. myfunc();
3. myfunc(another());
4. myfunc(oneArg);
You can also use the regular expressions already written from grammar that are used by emacs -- imenu , etags, ecb, c-mode etc.
In the purest sense you can't, because the possibility to nest function calls recursively makes it a non-regular language. That is, you cannot write a regular expression that matches an arbitrary function call and extracts all of the contained function names.
But of course you could search incrementally for sequences of characters allowed in function names (ie., must start with a letter or underscore, followed by letters, underscore, numbers, etc...) followed by an left parenthesis, or something along those lines.
Keep in mind, however, that any such approach is prone to errors: what if a function is referenced in a comment? What if it appears inside a string constant? Really, to catch all the special cases you would have to (almost) properly parse the full C file.
Most modern regular expression engines have features to parse more than regular languages e.g. by means of back-references to subexpressions. But you shouldn't go down that road. With a proper parser such as ANTLR that can parse context-free languages you'll make your own life a lot easier.

Tokenize (lex? parse?) a regular expression

Using Ruby I'd like to take a Regexp object (or a String representing a valid regex; your choice) and tokenize it so that I may manipulate certain parts.
Specifically, I'd like to take a regex/string like this:
regex = /var (\w+) = '([^']+)';/
parts = ["foo","bar"]
and create a replacement string that replaces each capture with a literal from the array:
"var foo = 'bar';"
A naïve regex-based approach to parsing the regex, such as:
i = -1
result = regex.source.gsub(/\([^)]+\)/){ parts[i+=1] }
…would fail for things like nested capture groups, or non-capturing groups, or a regex that had a parenthesis inside a character class. Hence my desire to properly break the regex into semantically-valid pieces.
Is there an existing Regex parser available for Ruby? Is there a (horror of horrors) known regex that cleanly matches regexes? Is there a gem I've not found?
The motivation for this question is a desire to find a clean and simple answer to this question.
I have a JavaScript project on GitHub called: Dynamic (?:Regex Highlighting)++ with Javascript! you may want to look at. It parses PCRE compatible regular expressions written in both free-spacing and non-free-spacing modes. Since the regexes are written in the less-feature-rich JavaScript syntax, these regexes could be easily converted to Ruby.
Note that regular expressions may contain arbitrarily nested parentheses structures and JavaScript has no recursive regex features, so the code must parse the tree of nested parens from the-inside-out. Its a bit tricky but works quite well. Be sure to try it out on the highlighter demo page, where you can input and dynamically highlight any regex. The JavaScript regular expressions used to parse regular expressions are documented here.

Is this XPath technique reliable in all situations?

I am developing an application that accepts user-defined XPath expressions and employs them as part of its runtime operation.
However, I would like to be able to infer some additional data by programmatically manipulating the expression, and I am curious to know whether there are any situations in which this approach might fail.
Given any user-defined XPath expression that returns a node set, is it safe to wrap it in the XPath count() function to determine the number of nodes in the set:
count(user_defined_expression)
Similarly, is it safe to append an array index to the expression to extract one of the nodes in the set:
user_defined_expression[1]
Well an XPath expression (in XPath 1.0) can yield a node-set or a string or a number or a boolean and doing count(expression) only makes sense on any expression yielding a node-set.
As for adding a positional predicate, I think you might want to use parentheses around your expression i.e. to change /root/foo/bar into (/root/foo/bar)[1] as that way you select the first bar element in the node-set selected by /root/foo/bar while without them you would get /root/foo/bar[1] which would select the first bar child element of any foo child element of the root element.
Are you checking that such user-defined expressions always evaluate to node-set?
If yes, first Expr is ok. Datatype will be correct for fn:count
Second one is a lot trickier, with a lot of situations there predicate will overweight axis, for example. Check this answer for a simple analysis. It will be difficult to say, what a user really meant.
A more robust approach would be to convert the XPath expression to XQueryX, which is an XML representation of the abstract syntax tree; you can then do XQuery or XSLT transformations on this XML representation, and then convert back to a modified XPath (or XQuery) for evaluation.
However, this will still only give you the syntactic structure of the expression; if you want semantic information, such as the inferred static type of the result, you will probably have to poke inside an XPath process that exposes this information.

Resources