So I'm trying to make a lexical analyzer for scheme and when I run JFlex to convert the lever.flex file I get an error similar to this one for example:
Reading "lexer.flex"
Macro definition contains a cycle.
1 error, 0 warnings.
the macro it's referring to is this one:
definition = {variable_definition}
| {syntax_definition}
| \(begin {definition}*\)
| \(let-syntax \({syntax_binding}*\){definition}*\)
| \(letrec-syntax \({syntax_binding}*\){definition}*\)
all of the macros defined here have been implemented but fro some reason I can't get rid of this error and I don't know why its happening.
A lex/flex/JFlex style "definition" is a macro expansion, as that error message indicates. Recursive macro expansions are impossible, since macro expansion is not conditional; trying to expand
definition = ... \(begin {definition}*\) ...
will result in an infinitely long regular expression.
Do not mistake a lexical analyser for a general-purpose parser. A lexical analyser does no more than split an input into individual tokens (or "lexemes"), using regular expressions to identify each token. Tokens do not have structure (at least for the purposes of parsing); once a token is identified, it is a single indivisible object. If you find yourself writing lexical descriptions which match structured text, you have almost certainly pushed the lexical analysis beyond its limits.
Parsers use an algorithm which does allow recursive descriptions (but which has very limited forward lookahead) and which can create a recursive description of the input (such as a parse tree).
Related
I have these two rules from a jflex code:
Bool = true
Ident = [:letter:][:letterdigit:]*
if I try for example to analyse the word "trueStat", it gets recognnized as an Ident expression and not Bool.
How can I avoid this type of ambiguity in Jflex?
In almost all languages, a keyword is only recognised as such if it is a complete word. Otherwise, you would end up banning identifiers like format, downtime and endurance (which would instead start with the keywords for, do and end, respectively). That's quite confusing for programmers, although it's not unheard-of. Lexical scanner generators, like Flex and JFlex generally try to make the common case easy; thus, the snippet you provide, which recognises trueStat as an identifier. But if you really want to recognise it as a keyword followed by an identifier, you can accomplish that by adding trailing context to all your keywords:
Bool = true/[:letterdigit:]*
Ident = [:letter:][:letterdigit:]*
With that pair of patterns, true will match the Bool rule, even if it occurs as trueStat. The pattern matches true and any alphanumeric string immediately following it, and then rewinds the input cursor so that the token matched is just true.
Note that like Lex and Flex, JFlex accepts the longest match at the current input position; if more than one rule accepts this match, the action corresponding to the first such rule is executed. (See the manual section "How the Input is Matched" for a slightly longer explanation of the matching algorithm.) Trailing context is considered part of the match for the purposes of this rule (but, as noted above, is then removed from the match).
The consequence of this rule is that you should always place more specific patterns before the general patterns they might override, whether or not the specific pattern uses trailing context. So the Bool rule must precede the Ident rule.
R7RS-small says that all identifiers must be terminated by a delimiter, but at the same time it defines pretty elaborate rules for what can be in an identifier. So, which one is it?
Is an identifier supposed to start with an initial character and then continue until a delimiter, or does it start with an initial character and continue following the syntax defined in 7.1.1.
Here are a couple of obvious cases. Are these valid identifiers?
a#a
b,b
c'c
d[d]
If they are not supposed to be valid, what is the purpose of saying that an identifier must be terminated by a delimiter?
|..ident..| are delimiters for symbols in R7RS, to allow any character that you cannot insert in an old style symbol (| is the delimiter).
However, in R6RS the "official" grammar was incorrect, as it did not allow to define symbols such that 1+, which led all implementations define their own rules to overcome this illness of the official grammar.
Unless you need to read the source code of a given implementation and see how it defines the symbols, you should not care too much about these rules and use classical symbols.
In the section 7.1.1 you find the backus-naur form that defines the lexical structure of R7RS identifiers but I doubt the implementations follow it.
I quote from here
As with identifiers, different implementations of Scheme use slightly
different rules, but it is always the case that a sequence of
characters that contains no special characters and begins with a
character that cannot begin a number is taken to be a symbol
In other words, an implementation will use a function like read-atom and after that it will classify an atom by backtracking with read-number and if number? fails it will be a symbol.
One more difference between gcc preprocessor and that of MS VS cl. Consider the following snippet:
# define A(x) L ## x
# define B A("b")
# define C(x) x
C(A("a" B))
For 'gcc -E' we get the following:
L"a" A("b")
For 'cl /E' the output is different:
L"a" L"b"
MS preprocessor somehow performs an additional macro expansion. Algorithm of its work is obviously different from that of gcc, but this algorithm also seems to be a secret. Does anyone know how the observed difference can be explained and what is the scheme of preprocessing in MS cl?
GCC is correct. The standard specifies:
C99 6.10.3.4/2 (and also C++98/11 16.3.4/2): If the name of the macro being replaced is found during this scan of the replacement list
(not including the rest of the source file’s preprocessing tokens), it is not replaced.
So, when expanding A("a" B), we first replace B to give A("a" A("B")).
A("B") is not replaced, according to the quoted rule, so the final result is L"a" A("B").
Mike's answer is correct, but he actually elides the critical part of the standard that shows why this is so:
6.10.3.4/2 If the name of the macro being replaced is found during this scan of the replacement list
(not including the rest of the source file’s preprocessing tokens), it is not replaced.
Furthermore, if any nested replacements encounter the name of the macro being replaced,
it is not replaced. These nonreplaced macro name preprocessing tokens are no longer
available for further replacement even if they are later (re)examined in contexts in which
that macro name preprocessing token would otherwise have been replaced.
Note the last clause here that I've emphasized.
So both gcc and MSVC expand the macro A("a" B) to L"a" A("b"), but the interesting case (where MSVC screws up) is when the macro is wrapped by the C macro.
When expanding the C macro, its argument is first examined for macros to expand and A is expanded. This is then substituted into the body of C, and then that body is then scanned AGAIN for macros to replace. Now you might think that since this is the expansion of C, only the name C will be skipped, but this last clause means that the tokens from the expansion of A will also skip reexpansions of A.
There are basically two ways of how one could think that the remaining occurrance of the A macro should be replaced:
The first would be the processing or macro arguments before they are inserted in place of the corresponding parameter in the macro's replacement list. Usually each argument is complete macro-replaced as if it formed the rest of the input file, as decribed in section 6.10.3.1
of the standard. However, this is not done if the parameter (here: x) occurs next to the ##; in this case the parameter is simply replaced with the argument according to 6.10.3.3, without any recursive macro replacement.
The second way would be the "rescanning and further replacement" of section 6.10.3.4, but this not done recursively for a macro that has already been replaced once.
So neither applies in this case, which means that gcc is correct in leaving that occurrence of A unreplaced.
Is it posible to extract function calls in C source files, e.g.,
...
myfunc(1);
...
or
...
myfunc(anotherfunc(1, 2));
....
by just using Ruby regular expression? If not, would a parser generator such as ANTLR be useful?
This is not a full-proof pattern for finding out method calls but should just serve the pattern that you are interested in.
[a-zA-Z\s]*\([a-zA-Z0-9]*(\([a-zA-Z0-9\s]*[\s,]*[\sa-zA-Z0-9]*\))?\);
This regex will match following method call patterns.
1. myfunc(another(one,two));
2. myfunc();
3. myfunc(another());
4. myfunc(oneArg);
You can also use the regular expressions already written from grammar that are used by emacs -- imenu , etags, ecb, c-mode etc.
In the purest sense you can't, because the possibility to nest function calls recursively makes it a non-regular language. That is, you cannot write a regular expression that matches an arbitrary function call and extracts all of the contained function names.
But of course you could search incrementally for sequences of characters allowed in function names (ie., must start with a letter or underscore, followed by letters, underscore, numbers, etc...) followed by an left parenthesis, or something along those lines.
Keep in mind, however, that any such approach is prone to errors: what if a function is referenced in a comment? What if it appears inside a string constant? Really, to catch all the special cases you would have to (almost) properly parse the full C file.
Most modern regular expression engines have features to parse more than regular languages e.g. by means of back-references to subexpressions. But you shouldn't go down that road. With a proper parser such as ANTLR that can parse context-free languages you'll make your own life a lot easier.
This question was asked to me in an interview question:
Write a code to generate the parse tree like compilers do internally for any given expression. For example:
a+(b+c*(e/f)+d)*g
Start by defining the language. No one can implement a parser or a compiler to a language that isn't very well defined. You give an example: 'a+(b+c*(e/f)+d)*g', which should trigger the following questions:
Is the language a single expression, or there may be multiple statements (separated by ';' maybe?
what are the 'a', 'b', ... 'g' tokens? Is it variable? What is the syntax of variables? Is it a C-like variable, or is it a single alphanumeric character as your example may imply.
There are 3 binary expression in your example. Is that all? Does the language also support '-'. Does your language support logical, and bit-wise operators?
Does the language support number literals? integer only? double? Does the language support string literals? Do you quote string literals?
Syntax for comments?
Which operator has precedence? Does '*' operator has precedence over '+' as in the example? Does operands evaluated right to left or left to right?
Any Pre-processing?
Once you are equipped with a good definition of the language syntax, start with implementing a tokenizer. A tokenizer gets a stream of characters and generates a list of tokens. In the example above, each character is a token, but in var*12 (var power 12) there are 3 tokens: 'var', '*' and '12'. If regular expression is permitted, it is possible you can do this part of the parsing with regular expressions.
Next, have a function that identify each token by type: is it an operator, is it a variable, a number literal, string literal, etc. Package all in a method called NextToken that returns a token and its type.
Finally, start parsing. In your sample above the root of the parsing tree will be a node with the '+' operator (which has precedence over the ''). The left child is a variable token 'a' and the right child is a tree with a root element the '' token. Work recursively.
Simple way out is to convert your expression into postfix notation (abcef/*++) & then refer to the answer to this question (http://stackoverflow.com/questions/423898/postfix-notation-to-expression-tree) for converting the postfix expression to a Tree.
This is what the interviewer expected :)
Whenever you intend to write a parser, the main question to ask is if you want to do it manually, or to use a parser generator framework.
In this case, I would say that it's a good exercise to write it all yourself.
Start with a good representation for the tree itself. This will be be output of your algorithm. For example, this could be a collection of objects, where one object kind could represent a "label" like a, b, and c in your example. Others could represent numbers. You could then defined a representation of operators, for example + is a binary operator, which would have two subobjects, representing the left and right subexpression.
Next step is the actual parser, I would suggest a classical recursive decent parser. One text describing this, and provides a standard pseudo-code implementation is this text by Theodore Norvell
I'd start with a simple grammar, something like those used by ANTLR and JavaCC.