Combined grammar ANTLR option filter - filter

I have a combined grammar (lexer and parser on the same file). How do I set the
filter = true
to the lexer?
Thanks

To Clarify what Bart said in his answer:
The options{filter=true;} only works in lexer-grammars. You will need to define a parser- and lexer grammar in two separate files:
You don't actually need to put the lexer and parser in 2 separate files. You just need to put the filter rule in the lexer grammar. Both the lexer and the parser grammars can be in the same file.
Though Bart's advice of putting them in seperate files is the approach I normally use.
So to take an example you can do this in a single file
Single file with parser and lexer Grammar.g
parser grammar FooParser;
parse
: Number+ EOF
;
lexer grammar FooLexer;
options{filter=true;}
Number
: '0'..'9'+
;
Note how the filter rule is in the lexer class definition but the lexer and parser are in the same file.

You don't. The options{filter=true;} only works in lexer-grammars. You will need to define a parser- and lexer grammar in two separate files (EDIT: no, they can be put in one file, see chollida's answer):
FooParser.g
parser grammar FooParser;
parse
: Number+ EOF
;
FooLexer.g
lexer grammar FooLexer;
options{filter=true;}
Number
: '0'..'9'+
;
Combining these lexer and parser will result in a parser that will only parse the strings in a file (or string) consisting of digits.
Note that ANTLR plugins or ANTLRWorks might complain because because the lexer rule(s) is/are not visible to it, but generating a lexer and partser on the command line:
java -cp antlr-3.2.jar org.antlr.Tool FooLexer.g
java -cp antlr-3.2.jar org.antlr.Tool FooParser.g
works fine.

Related

Recursive use of regex's in TextMate JSON grammars

I'm trying to write a grammar JSON file for TextMate to use with VSCode highlighting.
Let's say an int is:
([0-9]+|[0-9]+e[0-9]+)
And now I want to use ints to define this new expression:
int:int
example: 33:5 or 2e4:9
or any combination of some ints for that matter. Do I have to define it like:
([0-9]+|[0-9]+e[0-9]+):([0-9]+|[0-9]+e[0-9]+)
Or is there a better way?! Can I use stuff like "begin/end" or "begin/while" with "include" to achieve this? There must be a better way than just repeating my regexes thousands of times for different variations of items.
I'm looking for a way to name a certain pattern and then use the pattern in combination with other regex stuff to create a new patter. Kinda like in bnf grammar.
I tried using "begin" with (?=\d) and then "end" with (?!\d) and then "include" #int as pattern. It did not work...

Does antlr4 memoize tokens?

Let's say I have the following expression alternation:
expr
: expr BitwiseAnd expr
| expr BitwiseXor expr
// ...
;
Just for arguments sake, let's say that the expr on the left-hand-side turns out to be 1MB. Will antlr be able to 'save' that expression so it doesn't have to start-from-zero on each alternation, or how far does it have to backtrack when it fails to match on an alternation?
Just
ANTLR will recognize the 1st expr and then if it doesn't find a BitwiseAnd, it will look for a BitwiseXor to try to match the second alternative. It won't backtrack all the way to trying to recognize the 1st expr again. It's not exactly memoization, but you get the same benefit (arguably even better).
You may find it useful to have ANTLR generate the ATN for your grammar. Use the -atn option when running the antlr4 command, this will generate *.dot files for each of your rules (both Lexer and Parser). You can then use graphViz to render them to svg, pdf, etc. They may look a bit intimidating at first glance, but just take a moment with them and you'll get a LOT of insight into how ANTLR goes about parsing your input.
The second place to look is the generated parser code. It too is much more understandable than you might expect (especially if reading it with the ATN graph handy).

What do I do in ANTLR if I want to parse something which is extremely configurable?

I'm writing a grammar to recognise simple mathematical expressions. I have it working for English.
Now I want to expand the grammar to support i18n. Therefore, the digits, radix separator and so forth depend on the user's locale.
What is the best way to do this in ANTLR?
What I'm currently considering is something like this:
lexer grammar ExpressionLexer;
options {
superClass = AbstractLexer;
}
DIGIT: . {isDigit(getText())}?;
// ... and so on for other tokens ...
abstract class AbstractLexer(input: CharStream, symbols: Symbols) extends Lexer(input) {
fun isDigit(codePoint: Int): Boolean = symbols.isDigit(codePoint)
// ... and so on for other tokens ...
}
Alternative approaches I am considering:
(b) I gather every possible digit and every possible separator in every possible locale, and jam all of those into the one grammar, and then check isDigit after that.
(c) I make a different lexer for every single numbering system and somehow align them all to emit the same token types in the same order, so they can be swapped in and out (sounds like it might be the most pure and correct solution? but not the most enjoyable.)
(And on a side tangent, how do people in European countries which use comma for the decimal separator deal with writing function calls with more than one parameter?)
I recommend doing that in two steps:
Parse the main language structure (e.g. (digits+ separator)+), regardless of what a digit or a separator is.
Do a semantic check against the user's locale if the digits that were given actually match what's allowed. Same for the separator.
This way you don't need to do all kind of hacks, add platform code and so on.
For your side question: programming usually uses the english language, including the number format. In strings you can use any format you want, but that doesn't affect the surrounding code.
Note that since ANTLR v4.7 and up, there is more possible w.r.t. Unicode inside ANTLR's lexer grammar: https://github.com/antlr/antlr4/blob/master/doc/unicode.md
So you could define a lexer rule like this:
DIGIT
: [\p{Digit}]
;
which will match both ٣ and 3.

how to test my Grammar antlr4 successfully? [duplicate]

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

Extract function names from function calls in C files

Is it posible to extract function calls in C source files, e.g.,
...
myfunc(1);
...
or
...
myfunc(anotherfunc(1, 2));
....
by just using Ruby regular expression? If not, would a parser generator such as ANTLR be useful?
This is not a full-proof pattern for finding out method calls but should just serve the pattern that you are interested in.
[a-zA-Z\s]*\([a-zA-Z0-9]*(\([a-zA-Z0-9\s]*[\s,]*[\sa-zA-Z0-9]*\))?\);
This regex will match following method call patterns.
1. myfunc(another(one,two));
2. myfunc();
3. myfunc(another());
4. myfunc(oneArg);
You can also use the regular expressions already written from grammar that are used by emacs -- imenu , etags, ecb, c-mode etc.
In the purest sense you can't, because the possibility to nest function calls recursively makes it a non-regular language. That is, you cannot write a regular expression that matches an arbitrary function call and extracts all of the contained function names.
But of course you could search incrementally for sequences of characters allowed in function names (ie., must start with a letter or underscore, followed by letters, underscore, numbers, etc...) followed by an left parenthesis, or something along those lines.
Keep in mind, however, that any such approach is prone to errors: what if a function is referenced in a comment? What if it appears inside a string constant? Really, to catch all the special cases you would have to (almost) properly parse the full C file.
Most modern regular expression engines have features to parse more than regular languages e.g. by means of back-references to subexpressions. But you shouldn't go down that road. With a proper parser such as ANTLR that can parse context-free languages you'll make your own life a lot easier.

Resources