antlr grammar: Allow whitespace matching only in template string - whitespace

I want to parse template strings:
`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`
Here is my grammar:
varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)* ')' ;
WS : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;
When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:
extraneous input ' ' expecting {'`'}
How can I allow whitespaces to be parsed and not skipped only inside the template string?

What is currently happening
When testing your example against your current grammar displaying the generated tokens, the lexer gives this:
[#0,0:0='`',<'`'>,1:0]
[#1,1:4='Some',<VAR>,1:1]
[#2,6:9='text',<VAR>,1:6]
[#3,11:12='${',<'${'>,1:11]
[#4,13:20='variable',<VAR>,1:13]
[#5,21:21='.',<'.'>,1:21]
[#6,22:25='name',<VAR>,1:22]
[#7,26:26='}',<'}'>,1:26]
... shortened ...
[#26,85:84='<EOF>',<EOF>,2:0]
This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR. Why is this happening?
As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some.
What you could try (Spoiler: won't work)
You could try to modify the rule like this:
TemplateStringLiteral: ('\\`' | ~'`')+ ;
so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:
How would the lexer match anything to the VAR rule, ever?
The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.
How to achieve what you actually want
There might be another solution, but this one works:
File MartinCup.g4:
parser grammar MartinCup;
options { tokenVocab=MartinCupLexer; }
templateString
: BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
;
template
: TemplateStart variable TemplateEnd
;
variable
: varname funParameter? (Dot variable)*
;
varname
: VAR
;
funParameter
: OpenPar variable? (Comma variable)* ClosedPar
;
File MartinCupLexer.g4:
lexer grammar MartinCupLexer;
BackTick : '`' ;
TemplateStart
: '${' -> pushMode(templateMode)
;
TemplateStringLiteral
: '\\`'
| ~'`'
;
mode templateMode;
VAR
: [$]?[a-zA-Z0-9_]+
| [$]
;
OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;
TemplateEnd
: '}' -> popMode;
This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some.
Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.
About the whitespaces
I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.
I tested your alternative grammar, where you put TemplateStringLiteral above WS, but contrary to your observation, this gives me:
line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}
The reason for this is the same as above, Some is lexed to VAR.

Related

Antlr weird parentheses syntax

Cant understand this round bracket meaning.
Its not necessary to write it, but sometimes it can produce left-recursion error. Where should we use it in grammar rules?
Its not necessary to write it,
That is correct, it is not necessary. Just remove them.
but sometimes it can produce left-recursion error.
If that really is the case, you can open an issue here: https://github.com/antlr/antlr4/issues
EDIT
Seeing kaby76's comment, just to make sure: you cannot just remove them from a grammar file regardless. They can be removed from your example rule.
When used like this:
rule
: ID '=' ( NUMBER | STRING ) // match either `ID '=' NUMBER`
// or `ID '=' STRING`
;
they cannot be removed because removing them wold result in:
rule
: ID '=' NUMBER | STRING // match either `ID '=' NUMBER`
// or `STRING`
;
Or with repetition:
rule
: ( ID STRING )+ // match: `ID STRING ID STRING ID STRING ...`
;
and this:
rule
: ID STRING+ // match: `ID STRING STRING STRING ...`
;

How to remove ambiguity from this syntax (antlr4)

I am writing a tool to generation sequence diagram from some text. I need to support this two syntax:
anInstance:AClass.DoSomething() and
participant A -> participant B: Any character except for \r\n (<>{}?)etc..
Let's call the fist one strict syntax and the second one free syntax. In anInstance:AClass.DoSomething(), I need it to be matched by to ( ID ':' ID ) as in the strict syntax. However, :AClass.DoSomething() will be first matched by CONTENT. I am thinking some kind of lookahead, checking if -> is there but not able to figure it out.
Strict syntax
message
: to '.' signature
;
signature
: methodName '()'
;
to
: ID ':' ID
;
methodName
: ID
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
Free syntax
asyncMessage
: source '->' target content
;
source
: ID+
;
target
: ID+
;
content
: CONTENT
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
CONTENT
: ':' ~[\r\n]+
;
SPACE
: [ \t\r\n] -> channel(HIDDEN)
;
You need to understand how ANTLR lexer works:
It uses whichever rule matches the longest part of the input (starting at current position)
In case multiple rules can match the same input (i.e. same length), the first one (in order they're defined in) is used
With your current lexer rules, CONTENT takes precedence whenever you encounter an : so ':' ID will never be matched.
With ANTLR 4, you should probably use modes in this case - when you encounter the : in the free form, switch to a "free" mode and define a lexer rule CONTENT to be only available in the "free" mode.
See this question for an idea about how ANTLR 4 lexer modes work.

How to parse a word that starts with a specific letter with ANTLR3 java target

Is there a way to parse words that start with a specific character?
I've been trying the following but i couldn't get any promising results:
//This one is working it accepts AD CD and such
example1
:
.'D'
;
//This one is not, it expects character D, then any ws character then any character
example2
:
'D'.
;
//These two are not working either
example3
:
'D'.*
;
//Doesn't accept input due to error: "line 1:3 missing 'D' at '<EOF>'"
example4
:
.*'D'
;
//just in case my WS rule:
/** WhiteSpace Characters (HIDDEN)*/
WS : ( ' '
| '\t'
)+ {$channel=HIDDEN;}
;
I am using ANTLR 3.4
Thanks in advance
//This one is not, it expects character D, then any ws character then any character
example2
:
'D'.
;
No, it does not it accept the token (not character!) 'D' followed by a space and then any character. Since example2 is a parser rule, it does not match characters, but matches tokens (there's a big difference!). And since you put spaces on a separate channel, the spaces are not matched by this rule either. At the end, the . (DOT) matches any token (again: not any character!).
More info on meta chars (like the . (DOT)) whose meaning differ inside lexer- and parser rules: Negating inside lexer- and parser rules
//These two are not working either
example3
:
'D'.*
;
//Doesn't accept input due to error: "line 1:3 missing 'D' at '<EOF>'"
example4
:
.*'D'
;
Unless you know exactly what you're doing, don't use .*: they gobble up too much in your case (especially when placed at the start or end of a rule).
It looks like you're trying to tokenize things inside the parser (all your example rules are parser rules). As far as I can see, these should be lexer rules instead. More on the difference between parser- and lexer rules, see: Practical difference between parser rules and lexer rules in ANTLR?

ANTLR 3 - how do I make unique tokens with NOT across special chars

I have a short question:
// Lexer
LOOP_NAME : (LETTER|DIGIT)+;
OTHERCHARS : ~('>' | '}')+;
LETTER : ('A'..'Z')|('a'..'z');
DIGIT : ('0'..'9');
A_ELEMENT
: (LETTER|'_')*(LETTER|DIGIT|'_'|'.');
// Parser-Konfiguration
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
My problem is that this is impossible due to:
As a result, alternative(s) 2 were disabled for that input [14:55:32]
error(208): ltxt2.g:61:1: The following token definitions can never be
matched because prior tokens match the same input:
LETTER,DIGIT,A_ELEMENT,WS
My issue is that I also need to catch UTF8 with OTHERCHARS... and I cannot put all special UTF8 chars into a Lexer rule since I cannot range like ("!".."?").
So I need the NOT (~). The OTHERCHARS here can be everything but ">" or "}". These two close a literal context and are forbidden within.
It doesn't seem such cases are referenced very well, so I'd be happy if someone knew a workaround. The NOT operator here creates the ambivalence I need to solve.
Thanks in advance.
Best,
wishi
Move OTHERCHARS to the very end of the lexer and define it like this:
OTHERCHARS : . ;
In the Java target, this will match a single UTF-16 code point which is not matched by a previous rule. I typically name the rule ANY_CHAR and treat it as a fall-back. By using . instead of .+, the lexer will only use this rule if no other rule matches.
If another rule matches more than one character, that rule will have priority over ANY_CHAR due to matching a larger number of characters from the input.
If another rule matches exactly one character, that rule will have priority over ANY_CHAR due to appearing earlier in the grammar.
Edit: To exclude } and > from the ANY_CHAR rule, you'll want to create rules for them so they are covered under point 2.
RBRACE : '}' ;
GT : '>' ;
ANY_CHAR : . ;

ANTLR parse problem

I need to be able to match a certain string ('[' then any number of equals signs or none then '['), then i need to match a matching close bracket (']' then the same number of equals signs then ']') after some other match rules. ((options{greedy=false;}:.)* if you must know). I have no clue how to do this in ANTLR, how can i do it?
An example: I need to match [===[whatever arbitrary text ]===] but not [===[whatever arbitrary text ]==].
I need to do it for an arbitrary number of equals signs as well, so therein lies the problem: how do i get it to match an equal number of equals signs in the open as in the close? The supplied parser rules so far dont seem to make sense as far as helping.
You can't easely write a lexer for it, you need parsing rules. Two rules should be sufficient. One is responsible for matching the braces, one for matching the equal signs.
Something like this:
braces : '[' ']'
| '[' equals ']'
;
equals : '=' equals '='
| '=' braces '='
;
This should cover the use case you described. Not absolute shure but maybe you have to use a predicate in the first rule of 'equals' to avoid ambiguous interpretations.
Edit:
It is hard to integrate your greedy rule and at the same time avoid a lexer context switch or something similar (hard in ANTLR). But if you are willing to integrate a little bit of java in your grammer you can write an lexer rule.
The following example grammar shows how:
grammar TestLexer;
SPECIAL : '[' { int counter = 0; } ('=' { counter++; } )+ '[' (options{greedy=false;}:.)* ']' ('=' { counter--; } )+ { if(counter != 0) throw new RecognitionException(input); } ']';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
rule : ID
| SPECIAL
;
Your tags mention lexing, but your question itself doesn't. What you're trying to do is non-regular, so I don't think it can be done as part of lexing (though I don't remember if ANTLR's lexer is strictly regular -- it's been a couple of years since I last used ANTLR).
What you describe should be possible in parsing, however. Here's the grammar for what you described:
thingy : LBRACKET middle RBRACKET;
middle : EQUAL middle EQUAL
| LBRACKET RBRACKET;

Resources