Is it posible to extract function calls in C source files, e.g.,
...
myfunc(1);
...
or
...
myfunc(anotherfunc(1, 2));
....
by just using Ruby regular expression? If not, would a parser generator such as ANTLR be useful?
This is not a full-proof pattern for finding out method calls but should just serve the pattern that you are interested in.
[a-zA-Z\s]*\([a-zA-Z0-9]*(\([a-zA-Z0-9\s]*[\s,]*[\sa-zA-Z0-9]*\))?\);
This regex will match following method call patterns.
1. myfunc(another(one,two));
2. myfunc();
3. myfunc(another());
4. myfunc(oneArg);
You can also use the regular expressions already written from grammar that are used by emacs -- imenu , etags, ecb, c-mode etc.
In the purest sense you can't, because the possibility to nest function calls recursively makes it a non-regular language. That is, you cannot write a regular expression that matches an arbitrary function call and extracts all of the contained function names.
But of course you could search incrementally for sequences of characters allowed in function names (ie., must start with a letter or underscore, followed by letters, underscore, numbers, etc...) followed by an left parenthesis, or something along those lines.
Keep in mind, however, that any such approach is prone to errors: what if a function is referenced in a comment? What if it appears inside a string constant? Really, to catch all the special cases you would have to (almost) properly parse the full C file.
Most modern regular expression engines have features to parse more than regular languages e.g. by means of back-references to subexpressions. But you shouldn't go down that road. With a proper parser such as ANTLR that can parse context-free languages you'll make your own life a lot easier.
Related
I'm writing a grammar to recognise simple mathematical expressions. I have it working for English.
Now I want to expand the grammar to support i18n. Therefore, the digits, radix separator and so forth depend on the user's locale.
What is the best way to do this in ANTLR?
What I'm currently considering is something like this:
lexer grammar ExpressionLexer;
options {
superClass = AbstractLexer;
}
DIGIT: . {isDigit(getText())}?;
// ... and so on for other tokens ...
abstract class AbstractLexer(input: CharStream, symbols: Symbols) extends Lexer(input) {
fun isDigit(codePoint: Int): Boolean = symbols.isDigit(codePoint)
// ... and so on for other tokens ...
}
Alternative approaches I am considering:
(b) I gather every possible digit and every possible separator in every possible locale, and jam all of those into the one grammar, and then check isDigit after that.
(c) I make a different lexer for every single numbering system and somehow align them all to emit the same token types in the same order, so they can be swapped in and out (sounds like it might be the most pure and correct solution? but not the most enjoyable.)
(And on a side tangent, how do people in European countries which use comma for the decimal separator deal with writing function calls with more than one parameter?)
I recommend doing that in two steps:
Parse the main language structure (e.g. (digits+ separator)+), regardless of what a digit or a separator is.
Do a semantic check against the user's locale if the digits that were given actually match what's allowed. Same for the separator.
This way you don't need to do all kind of hacks, add platform code and so on.
For your side question: programming usually uses the english language, including the number format. In strings you can use any format you want, but that doesn't affect the surrounding code.
Note that since ANTLR v4.7 and up, there is more possible w.r.t. Unicode inside ANTLR's lexer grammar: https://github.com/antlr/antlr4/blob/master/doc/unicode.md
So you could define a lexer rule like this:
DIGIT
: [\p{Digit}]
;
which will match both ٣ and 3.
So I'm trying to make a lexical analyzer for scheme and when I run JFlex to convert the lever.flex file I get an error similar to this one for example:
Reading "lexer.flex"
Macro definition contains a cycle.
1 error, 0 warnings.
the macro it's referring to is this one:
definition = {variable_definition}
| {syntax_definition}
| \(begin {definition}*\)
| \(let-syntax \({syntax_binding}*\){definition}*\)
| \(letrec-syntax \({syntax_binding}*\){definition}*\)
all of the macros defined here have been implemented but fro some reason I can't get rid of this error and I don't know why its happening.
A lex/flex/JFlex style "definition" is a macro expansion, as that error message indicates. Recursive macro expansions are impossible, since macro expansion is not conditional; trying to expand
definition = ... \(begin {definition}*\) ...
will result in an infinitely long regular expression.
Do not mistake a lexical analyser for a general-purpose parser. A lexical analyser does no more than split an input into individual tokens (or "lexemes"), using regular expressions to identify each token. Tokens do not have structure (at least for the purposes of parsing); once a token is identified, it is a single indivisible object. If you find yourself writing lexical descriptions which match structured text, you have almost certainly pushed the lexical analysis beyond its limits.
Parsers use an algorithm which does allow recursive descriptions (but which has very limited forward lookahead) and which can create a recursive description of the input (such as a parse tree).
R7RS-small says that all identifiers must be terminated by a delimiter, but at the same time it defines pretty elaborate rules for what can be in an identifier. So, which one is it?
Is an identifier supposed to start with an initial character and then continue until a delimiter, or does it start with an initial character and continue following the syntax defined in 7.1.1.
Here are a couple of obvious cases. Are these valid identifiers?
a#a
b,b
c'c
d[d]
If they are not supposed to be valid, what is the purpose of saying that an identifier must be terminated by a delimiter?
|..ident..| are delimiters for symbols in R7RS, to allow any character that you cannot insert in an old style symbol (| is the delimiter).
However, in R6RS the "official" grammar was incorrect, as it did not allow to define symbols such that 1+, which led all implementations define their own rules to overcome this illness of the official grammar.
Unless you need to read the source code of a given implementation and see how it defines the symbols, you should not care too much about these rules and use classical symbols.
In the section 7.1.1 you find the backus-naur form that defines the lexical structure of R7RS identifiers but I doubt the implementations follow it.
I quote from here
As with identifiers, different implementations of Scheme use slightly
different rules, but it is always the case that a sequence of
characters that contains no special characters and begins with a
character that cannot begin a number is taken to be a symbol
In other words, an implementation will use a function like read-atom and after that it will classify an atom by backtracking with read-number and if number? fails it will be a symbol.
Using Ruby I'd like to take a Regexp object (or a String representing a valid regex; your choice) and tokenize it so that I may manipulate certain parts.
Specifically, I'd like to take a regex/string like this:
regex = /var (\w+) = '([^']+)';/
parts = ["foo","bar"]
and create a replacement string that replaces each capture with a literal from the array:
"var foo = 'bar';"
A naïve regex-based approach to parsing the regex, such as:
i = -1
result = regex.source.gsub(/\([^)]+\)/){ parts[i+=1] }
…would fail for things like nested capture groups, or non-capturing groups, or a regex that had a parenthesis inside a character class. Hence my desire to properly break the regex into semantically-valid pieces.
Is there an existing Regex parser available for Ruby? Is there a (horror of horrors) known regex that cleanly matches regexes? Is there a gem I've not found?
The motivation for this question is a desire to find a clean and simple answer to this question.
I have a JavaScript project on GitHub called: Dynamic (?:Regex Highlighting)++ with Javascript! you may want to look at. It parses PCRE compatible regular expressions written in both free-spacing and non-free-spacing modes. Since the regexes are written in the less-feature-rich JavaScript syntax, these regexes could be easily converted to Ruby.
Note that regular expressions may contain arbitrarily nested parentheses structures and JavaScript has no recursive regex features, so the code must parse the tree of nested parens from the-inside-out. Its a bit tricky but works quite well. Be sure to try it out on the highlighter demo page, where you can input and dynamically highlight any regex. The JavaScript regular expressions used to parse regular expressions are documented here.
This question was asked to me in an interview question:
Write a code to generate the parse tree like compilers do internally for any given expression. For example:
a+(b+c*(e/f)+d)*g
Start by defining the language. No one can implement a parser or a compiler to a language that isn't very well defined. You give an example: 'a+(b+c*(e/f)+d)*g', which should trigger the following questions:
Is the language a single expression, or there may be multiple statements (separated by ';' maybe?
what are the 'a', 'b', ... 'g' tokens? Is it variable? What is the syntax of variables? Is it a C-like variable, or is it a single alphanumeric character as your example may imply.
There are 3 binary expression in your example. Is that all? Does the language also support '-'. Does your language support logical, and bit-wise operators?
Does the language support number literals? integer only? double? Does the language support string literals? Do you quote string literals?
Syntax for comments?
Which operator has precedence? Does '*' operator has precedence over '+' as in the example? Does operands evaluated right to left or left to right?
Any Pre-processing?
Once you are equipped with a good definition of the language syntax, start with implementing a tokenizer. A tokenizer gets a stream of characters and generates a list of tokens. In the example above, each character is a token, but in var*12 (var power 12) there are 3 tokens: 'var', '*' and '12'. If regular expression is permitted, it is possible you can do this part of the parsing with regular expressions.
Next, have a function that identify each token by type: is it an operator, is it a variable, a number literal, string literal, etc. Package all in a method called NextToken that returns a token and its type.
Finally, start parsing. In your sample above the root of the parsing tree will be a node with the '+' operator (which has precedence over the ''). The left child is a variable token 'a' and the right child is a tree with a root element the '' token. Work recursively.
Simple way out is to convert your expression into postfix notation (abcef/*++) & then refer to the answer to this question (http://stackoverflow.com/questions/423898/postfix-notation-to-expression-tree) for converting the postfix expression to a Tree.
This is what the interviewer expected :)
Whenever you intend to write a parser, the main question to ask is if you want to do it manually, or to use a parser generator framework.
In this case, I would say that it's a good exercise to write it all yourself.
Start with a good representation for the tree itself. This will be be output of your algorithm. For example, this could be a collection of objects, where one object kind could represent a "label" like a, b, and c in your example. Others could represent numbers. You could then defined a representation of operators, for example + is a binary operator, which would have two subobjects, representing the left and right subexpression.
Next step is the actual parser, I would suggest a classical recursive decent parser. One text describing this, and provides a standard pseudo-code implementation is this text by Theodore Norvell
I'd start with a simple grammar, something like those used by ANTLR and JavaCC.