Bison - handling non LALR(1) grammars

Bison - handling non LALR(1) grammars - expression

I am making a simple calculator using flex, bison.
I developed a grammar which includes two type of expressions - integer expressions and real expressions.
The grammar is similar to this:
exp -> intExp | realExp
intExp -> INT | intExp '+' intExp
realExp -> REAL | realExp '+' realExp | intExp '+' realExp | realExp '+' intExp
This is not LALR(1).
For example consider the string INT '+' REAL. At 'INT' the lookahead is '+' and based on just this it is impossible to tell whether the string is an intExp or a realExp.
I tried rewriting the grammar to resolve the ambiguity but nothing came of it.
I know I could defer making computations during parsing and instead build a parse tree. Then with type checking, the issue could be resolved. But that seems a bit too much for such a simple problem.
Is there any way bison itself can be made to handle this ambiguity? Or can the grammar be re-written in a better way?

No, if it's not LALR(1) then it's not. However in your language you cannot have a type mismatch error. Why then have separate productions for int and real expressions? Just make the node value contain an integer, a real and a type code.

Related

Does antlr4 memoize tokens?

Let's say I have the following expression alternation:
expr
: expr BitwiseAnd expr
| expr BitwiseXor expr
// ...
;
Just for arguments sake, let's say that the expr on the left-hand-side turns out to be 1MB. Will antlr be able to 'save' that expression so it doesn't have to start-from-zero on each alternation, or how far does it have to backtrack when it fails to match on an alternation?
Just

ANTLR will recognize the 1st expr and then if it doesn't find a BitwiseAnd, it will look for a BitwiseXor to try to match the second alternative. It won't backtrack all the way to trying to recognize the 1st expr again. It's not exactly memoization, but you get the same benefit (arguably even better).
You may find it useful to have ANTLR generate the ATN for your grammar. Use the -atn option when running the antlr4 command, this will generate *.dot files for each of your rules (both Lexer and Parser). You can then use graphViz to render them to svg, pdf, etc. They may look a bit intimidating at first glance, but just take a moment with them and you'll get a LOT of insight into how ANTLR goes about parsing your input.
The second place to look is the generated parser code. It too is much more understandable than you might expect (especially if reading it with the ATN graph handy).

Macro contains a cycle

So I'm trying to make a lexical analyzer for scheme and when I run JFlex to convert the lever.flex file I get an error similar to this one for example:
Reading "lexer.flex"
Macro definition contains a cycle.
1 error, 0 warnings.
the macro it's referring to is this one:
definition = {variable_definition}
| {syntax_definition}
| \(begin {definition}*\)
| \(let-syntax \({syntax_binding}*\){definition}*\)
| \(letrec-syntax \({syntax_binding}*\){definition}*\)
all of the macros defined here have been implemented but fro some reason I can't get rid of this error and I don't know why its happening.

A lex/flex/JFlex style "definition" is a macro expansion, as that error message indicates. Recursive macro expansions are impossible, since macro expansion is not conditional; trying to expand
definition = ... \(begin {definition}*\) ...
will result in an infinitely long regular expression.
Do not mistake a lexical analyser for a general-purpose parser. A lexical analyser does no more than split an input into individual tokens (or "lexemes"), using regular expressions to identify each token. Tokens do not have structure (at least for the purposes of parsing); once a token is identified, it is a single indivisible object. If you find yourself writing lexical descriptions which match structured text, you have almost certainly pushed the lexical analysis beyond its limits.
Parsers use an algorithm which does allow recursive descriptions (but which has very limited forward lookahead) and which can create a recursive description of the input (such as a parse tree).

Coq notation format for double square braces

According to the documentation, one can define formats for printing notations:
https://coq.inria.fr/refman/Reference-Manual014.html#sec530
However, one can define a notation such as:
Notation " '[[' a ']]' b " := (* something *).
It is very unclear whether the two can interact. Trying:
format " '[hv' '[[' a ']]' ']' b "
for instance, trips Coq up as it expects a square brace to be followed by one of , v, and hv.
Any other sort of escaping I have tried so far made Coq refuse the format as not matching the notation.
I'm not sure this can be done...

Your friend here is metasyntax:parse_format https://github.com/coq/coq/blob/trunk/toplevel/metasyntax.ml#L102
As you can see in the code, your concrete scheme is not going to work. I dunno if there could be some specific hack, for now you'll have to desist using double brackets.
I am certain however that a patch adding a case for [[ in parse_quoted would be considered by Coq upstream.
Hopefully 8.7 will bring some improvements here, CEP#9 tries to propose replacing/evolving unparsing to a true box-based model.

ACM programming - Arithmetica 1.0 - any additional operators?

Has anyone in this forum attempted to solve the ACM programming problem http://acm.mipt.ru/judge/problems.pl?browse=yes&problem=024? It is one of the simpler problems in ACM MIPT and the goal is to evaluate an expression consisting of +, -, * and parentheses. Despite the apparent simplicity, I haven't been able to get my solution accepted, apparently because one of the test case expressions has an operator not stated in the problem. I even added support for division ('/') but that too didn't help. Any idea on what other operator needs to be supported? FYI, my program removes all whitespaces from the input before processing so that spaces shouldn't be a problem. Anything not stated in the problem but needs to be taken care of?

You're being bitten by ruby's handling of strings and characters.
curr_ch = #input[i]
gives you an integer, for the input you get, the ASCII code of the character at index i of the input.
curr_ch == '('
for example compares that integer to the string "(", of course that fails. Also the regex matches fail because you pass them an integer where a string is expected.
Replacing all occurrences of some_var = #input[some_index] with some_var = #input[some_index...some_index+1] gives me a programme that seems to work (it works on a few test inputs I gave it). Probably someone who actually knows the quirks of ruby can give you a better fix.

Write a code to generate the parse tree

This question was asked to me in an interview question:
Write a code to generate the parse tree like compilers do internally for any given expression. For example:
a+(b+c*(e/f)+d)*g

Start by defining the language. No one can implement a parser or a compiler to a language that isn't very well defined. You give an example: 'a+(b+c*(e/f)+d)*g', which should trigger the following questions:
Is the language a single expression, or there may be multiple statements (separated by ';' maybe?
what are the 'a', 'b', ... 'g' tokens? Is it variable? What is the syntax of variables? Is it a C-like variable, or is it a single alphanumeric character as your example may imply.
There are 3 binary expression in your example. Is that all? Does the language also support '-'. Does your language support logical, and bit-wise operators?
Does the language support number literals? integer only? double? Does the language support string literals? Do you quote string literals?
Syntax for comments?
Which operator has precedence? Does '*' operator has precedence over '+' as in the example? Does operands evaluated right to left or left to right?
Any Pre-processing?
Once you are equipped with a good definition of the language syntax, start with implementing a tokenizer. A tokenizer gets a stream of characters and generates a list of tokens. In the example above, each character is a token, but in var*12 (var power 12) there are 3 tokens: 'var', '*' and '12'. If regular expression is permitted, it is possible you can do this part of the parsing with regular expressions.
Next, have a function that identify each token by type: is it an operator, is it a variable, a number literal, string literal, etc. Package all in a method called NextToken that returns a token and its type.
Finally, start parsing. In your sample above the root of the parsing tree will be a node with the '+' operator (which has precedence over the ''). The left child is a variable token 'a' and the right child is a tree with a root element the '' token. Work recursively.

Simple way out is to convert your expression into postfix notation (abcef/*++) & then refer to the answer to this question (http://stackoverflow.com/questions/423898/postfix-notation-to-expression-tree) for converting the postfix expression to a Tree.
This is what the interviewer expected :)

Whenever you intend to write a parser, the main question to ask is if you want to do it manually, or to use a parser generator framework.
In this case, I would say that it's a good exercise to write it all yourself.
Start with a good representation for the tree itself. This will be be output of your algorithm. For example, this could be a collection of objects, where one object kind could represent a "label" like a, b, and c in your example. Others could represent numbers. You could then defined a representation of operators, for example + is a binary operator, which would have two subobjects, representing the left and right subexpression.
Next step is the actual parser, I would suggest a classical recursive decent parser. One text describing this, and provides a standard pseudo-code implementation is this text by Theodore Norvell

I'd start with a simple grammar, something like those used by ANTLR and JavaCC.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Bison - handling non LALR(1) grammars - expression

No, if it's not LALR(1) then it's not. However in your language you cannot have a type mismatch error. Why then have separate productions for int and real expressions? Just make the node value contain an integer, a real and a type code.

Related

Does antlr4 memoize tokens?

Macro contains a cycle

Coq notation format for double square braces

ACM programming - Arithmetica 1.0 - any additional operators?

Write a code to generate the parse tree

Categories

Resources