in my g4 file, I have defined an integer like so:
INT: '0'
| '-'? [1-9] [0-9_]*
;
// no leading zeros are allowed!
A parser rule uses this like so:
versionDecl: PACK_VERSION_DECL INT;
However, when ANTLR comes across one, it doesn't recognise it, and throws a NullPointerException if I run ctx.INT().getText():
#Override
public void exitVersionDecl(VersionDeclContext ctx) {
System.out.println(ctx.INT().getText());
}
Log:
line 1:13 mismatched input '6' expecting INT
[...]
java.lang.NullPointerException
at com.blockypenguin.mcfs.MCFSCustomListener.exitVersionDecl(MCFSCustomListener.java:16)
at main.antlr.MCFSParser$VersionDeclContext.exitRule(MCFSParser.java:604)
at org.antlr.v4.runtime.tree.ParseTreeWalker.exitRule(ParseTreeWalker.java:47)
at org.antlr.v4.runtime.tree.ParseTreeWalker.walk(ParseTreeWalker.java:30)
at org.antlr.v4.runtime.tree.ParseTreeWalker.walk(ParseTreeWalker.java:28)
at org.antlr.v4.runtime.tree.ParseTreeWalker.walk(ParseTreeWalker.java:28)
at com.blockypenguin.mcfs.Main.main(Main.java:40)
(Unrelated output omitted for brevity)
And finally, the input I am parsing:
pack_version 6
Why does ANTLR not recognise the integer? Any help appreciated, thank you :)
...
INT: '0'
| '-'? [1-9] [0-9_]*
;
// no leading zeros are allowed!
...
line 1:13 mismatched input '6' expecting INT
This error indicates that for the input 6, the lexer rule INT was not matched. This can happen if you have a lexer rules defined before the INT rule that also matches 6. Like this for example:
DIGIT
: [0-9]
;
...
INT
: '0'
| '-'? [1-9] [0-9_]*
;
Now the input "6" (or any single digit) will be matched as a DIGIT token. Even if you have this in the parser part of your grammar:
parse
: INT
;
the input "6" will still be tokenised as a DIGIT token: the lexer is not "driven" by the parser, it operates on it's own 2 rules:
try to match as much characters as possible for a single lexer rule
in case 2 or more lexer rules match the same amount of characters, let the rule defined first "win"
So, the input "12" will be tokenised as an INT token (rule 1 applies here), and input "0" is tokenised as a DIGIT token (rule 2).
Related
Cant understand this round bracket meaning.
Its not necessary to write it, but sometimes it can produce left-recursion error. Where should we use it in grammar rules?
Its not necessary to write it,
That is correct, it is not necessary. Just remove them.
but sometimes it can produce left-recursion error.
If that really is the case, you can open an issue here: https://github.com/antlr/antlr4/issues
EDIT
Seeing kaby76's comment, just to make sure: you cannot just remove them from a grammar file regardless. They can be removed from your example rule.
When used like this:
rule
: ID '=' ( NUMBER | STRING ) // match either `ID '=' NUMBER`
// or `ID '=' STRING`
;
they cannot be removed because removing them wold result in:
rule
: ID '=' NUMBER | STRING // match either `ID '=' NUMBER`
// or `STRING`
;
Or with repetition:
rule
: ( ID STRING )+ // match: `ID STRING ID STRING ID STRING ...`
;
and this:
rule
: ID STRING+ // match: `ID STRING STRING STRING ...`
;
I want to parse template strings:
`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`
Here is my grammar:
varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)* ')' ;
WS : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;
When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:
extraneous input ' ' expecting {'`'}
How can I allow whitespaces to be parsed and not skipped only inside the template string?
What is currently happening
When testing your example against your current grammar displaying the generated tokens, the lexer gives this:
[#0,0:0='`',<'`'>,1:0]
[#1,1:4='Some',<VAR>,1:1]
[#2,6:9='text',<VAR>,1:6]
[#3,11:12='${',<'${'>,1:11]
[#4,13:20='variable',<VAR>,1:13]
[#5,21:21='.',<'.'>,1:21]
[#6,22:25='name',<VAR>,1:22]
[#7,26:26='}',<'}'>,1:26]
... shortened ...
[#26,85:84='<EOF>',<EOF>,2:0]
This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR. Why is this happening?
As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some.
What you could try (Spoiler: won't work)
You could try to modify the rule like this:
TemplateStringLiteral: ('\\`' | ~'`')+ ;
so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:
How would the lexer match anything to the VAR rule, ever?
The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.
How to achieve what you actually want
There might be another solution, but this one works:
File MartinCup.g4:
parser grammar MartinCup;
options { tokenVocab=MartinCupLexer; }
templateString
: BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
;
template
: TemplateStart variable TemplateEnd
;
variable
: varname funParameter? (Dot variable)*
;
varname
: VAR
;
funParameter
: OpenPar variable? (Comma variable)* ClosedPar
;
File MartinCupLexer.g4:
lexer grammar MartinCupLexer;
BackTick : '`' ;
TemplateStart
: '${' -> pushMode(templateMode)
;
TemplateStringLiteral
: '\\`'
| ~'`'
;
mode templateMode;
VAR
: [$]?[a-zA-Z0-9_]+
| [$]
;
OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;
TemplateEnd
: '}' -> popMode;
This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some.
Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.
About the whitespaces
I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.
I tested your alternative grammar, where you put TemplateStringLiteral above WS, but contrary to your observation, this gives me:
line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}
The reason for this is the same as above, Some is lexed to VAR.
I have these lexer rules in my ANTLR3 grammar:
INTEGER: DIGITS;
FLOAT: DIGITS? DOT_SYMBOL DIGITS ('E' (MINUS_OPERATOR | PLUS_OPERATOR)? DIGITS)?;
HEXNUMBER: '0X' HEXDIGIT+;
HEXSTRING: 'X' '\'' HEXDIGIT+ '\'';
BITNUMBER: '0B' ('0' | '1')+;
BITSTRING: 'B' '\'' ('0' | '1')+ '\'';
NCHAR_TEXT: 'N' SINGLE_QUOTED_TEXT;
IDENTIFIER: LETTER_WHEN_UNQUOTED+;
fragment LETTER_WHEN_UNQUOTED:
'0'..'9'
| 'A'..'Z' // Only upper case, as we use a case insensitive parser (insensitive only for ASCII).
| '$'
| '_'
| '\u0080'..'\uffff'
;
and
qualified_identifier:
IDENTIFIER ( options { greedy = true; }: DOT_SYMBOL IDENTIFIER)?
;
This works mostly fine except for very specific situations like the input t1.1_d which is supposed to be parsed as 2 identifiers connected with a dot. What happens is that .1 is matched as float even though it's followed by underscore and letter(s).
It's clear where that comes from: LETTER_WHEN_UNQUOTED includes digits so '1' can be both an integer and an identifier. But the rule order should take care to resolve this to an integer, as intented (and usually does).
However, I'm perplexed the t1.1_d input causes the float rule to kick in and would appreciate some pointers to resolve this problem. As soon as I add a space after the dot all is fine, but that is obviously not a real solution.
When I move the IDENTIFIER rule before the others I get new trouble because several other rules can no longer be matched then. Moving the FLOAT rule after the IDENTIFIER rule doesn't fix the problem either (but at least doesn't produce new problems). In this case we see the actual problem: the dot is always matched by the FLOAT rule if directly followed by a digit. What can I do to make it not match in my case?
The problem is that the lexer operates independently of the parser. When faced with the input string t1.1_d, the lexer will first consume an IDENTIFIER, leaving .1_d. You now want it to match DOT_SYMBOL, followed by IDENTIFIER. However, the lexer will always match the longest possible token, resulting in FLOAT matching .1.
Moving IDENTIFIER before FLOAT doesn't help, because '.' isn't a valid IDENTIFIER symbol and so can't match the input at all when it starts with ..
Note that Java and co. don't allow identifiers to start with numbers, probably to avert these kinds of problems.
One possible solution would be to change the FLOAT rule to require digits before the dot: FLOAT: DIGITS '.' DIGITS ...
Is there a way to parse words that start with a specific character?
I've been trying the following but i couldn't get any promising results:
//This one is working it accepts AD CD and such
example1
:
.'D'
;
//This one is not, it expects character D, then any ws character then any character
example2
:
'D'.
;
//These two are not working either
example3
:
'D'.*
;
//Doesn't accept input due to error: "line 1:3 missing 'D' at '<EOF>'"
example4
:
.*'D'
;
//just in case my WS rule:
/** WhiteSpace Characters (HIDDEN)*/
WS : ( ' '
| '\t'
)+ {$channel=HIDDEN;}
;
I am using ANTLR 3.4
Thanks in advance
//This one is not, it expects character D, then any ws character then any character
example2
:
'D'.
;
No, it does not it accept the token (not character!) 'D' followed by a space and then any character. Since example2 is a parser rule, it does not match characters, but matches tokens (there's a big difference!). And since you put spaces on a separate channel, the spaces are not matched by this rule either. At the end, the . (DOT) matches any token (again: not any character!).
More info on meta chars (like the . (DOT)) whose meaning differ inside lexer- and parser rules: Negating inside lexer- and parser rules
//These two are not working either
example3
:
'D'.*
;
//Doesn't accept input due to error: "line 1:3 missing 'D' at '<EOF>'"
example4
:
.*'D'
;
Unless you know exactly what you're doing, don't use .*: they gobble up too much in your case (especially when placed at the start or end of a rule).
It looks like you're trying to tokenize things inside the parser (all your example rules are parser rules). As far as I can see, these should be lexer rules instead. More on the difference between parser- and lexer rules, see: Practical difference between parser rules and lexer rules in ANTLR?
So far I only see stuff like '<' ,but never see 'abc' nor "abc" in a yacc file.
a:
b '<' c;
Are the later two valid at all?
'abc' = is valid character since whenever you specify char like this compiler/preprocessor
simply remove last character , sometimes you would get "character constants must be one or two character long" compile time error in ANSI C.If it is not given by your compiler then
it has removed last 'c' from 'abc' should be assumed.
so
char ch='abc' ; // is actually equi. to ch = 'ab'
but while binding it will only use ch='a' ,that's why 'abc' is syntaxically correct but symantically wrong characher.(I wrote C coz. we use c89 tool i.e. POSIX C for compiling yacc and lex inputs)
Again yylex() works on characters as basic functional unit and not string (anything inside double quotes). So "abc" is not valid character not even character to match with yylex()'s
input.
(yylex() accepts string of token
exam. "10+20"
having grammer [[:DIGIT:]]+ [-+*/%] [[:DIGIT:]]+
and having tokens 1,0,+,2,0
The tokens lex can identify by default w/o specifying grammer are
10 as number
+ as char and
20 as number again
so it will match with grammer specified before )
you can also specify string in rules section for matching with , like
^["I am"] means match with any input line starting with "I am"
"I am" match with only input having string as "I am" only , It wont match with "I am Swapnil # vikas.ghode#gmail.com"