ANTLR v3 Parsing domain specific language - antlr3

I have a scripting language of the form:
<keyword> = <text>,
Where <text> can contain keywords and sometimes the <text> can contain instructions depending on what <keyword> is used.
I am trying to handle the <text> based on what <keyword> is used.
/* lang.g */
grammar lang;
/* parser rules */
script : assignment+ ;
assignment : keyword VALUE ;
/* cannot do the following (but I would like to)
assignment : command | command_b | display ;
command : COMMAND '=' /* parser rules for command */ ',' ;
command_b : COMMAND_B '=' /* parser rules for command_b */ ',' ;
display : DISPLAY '=' ~(',')+ ',' ;
*/
/* lexer rules */
VALUE : '='! ~(',')+ ','!
COMMAND : 'command' ;
COMMAND_B : 'command_b' ;
DISPLAY : 'display' ;
WS : (' '|'\t'|'\r'|'\n')+ {$channel=HIDDEN;} ;
Example input file:
command = goto->step_b,
display = this is some plain text. command keyword used,
command_b = read_file:"readme.txt",
I want to be able to handle command, command_b and display rules differently using ANTLR to parse everything without using a target language to assist. Using the above *.g file; the first line have command and goto->step_b as tokens. goto->step_b needs to be parsed further, it would be nice to have ANTLR do all that work rather than the target language.
If there isn't a way to do this directly, I thought I would accomplish this in two stages.
Use *.g file above to parse input file
Cull everything but command and command_b nodes; feed those nodes into another parser using a grammar defined for command and command_b syntax only.
Is there a way to parse the script using a single grammar such that I can handle command/command_b rules differently than any other rule? Or will I have to process the script file in multiple stages?
Thanks for any help.
Josh

Have a look at my answer here:
antlr identifier name same as pre-defined function name cause MismatchedTokenException
You can use disambiguating semantic predicates to keep these rules out of your grammar:
COMMAND : 'command' ;
COMMAND_B : 'command_b' ;
DISPLAY : 'display' ;
And instead you will write rules like:
functions_stats
: {input.LT(1).getText().equals("command")}? '=' /* parser rules for command */ ',' ;
;
The action in the semantic predicate is language specific, so may differ based on your target language. It works for Java and probably a number of others as well.

Related

antlr grammar: Allow whitespace matching only in template string

I want to parse template strings:
`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`
Here is my grammar:
varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)* ')' ;
WS : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;
When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:
extraneous input ' ' expecting {'`'}
How can I allow whitespaces to be parsed and not skipped only inside the template string?
What is currently happening
When testing your example against your current grammar displaying the generated tokens, the lexer gives this:
[#0,0:0='`',<'`'>,1:0]
[#1,1:4='Some',<VAR>,1:1]
[#2,6:9='text',<VAR>,1:6]
[#3,11:12='${',<'${'>,1:11]
[#4,13:20='variable',<VAR>,1:13]
[#5,21:21='.',<'.'>,1:21]
[#6,22:25='name',<VAR>,1:22]
[#7,26:26='}',<'}'>,1:26]
... shortened ...
[#26,85:84='<EOF>',<EOF>,2:0]
This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR. Why is this happening?
As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some.
What you could try (Spoiler: won't work)
You could try to modify the rule like this:
TemplateStringLiteral: ('\\`' | ~'`')+ ;
so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:
How would the lexer match anything to the VAR rule, ever?
The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.
How to achieve what you actually want
There might be another solution, but this one works:
File MartinCup.g4:
parser grammar MartinCup;
options { tokenVocab=MartinCupLexer; }
templateString
: BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
;
template
: TemplateStart variable TemplateEnd
;
variable
: varname funParameter? (Dot variable)*
;
varname
: VAR
;
funParameter
: OpenPar variable? (Comma variable)* ClosedPar
;
File MartinCupLexer.g4:
lexer grammar MartinCupLexer;
BackTick : '`' ;
TemplateStart
: '${' -> pushMode(templateMode)
;
TemplateStringLiteral
: '\\`'
| ~'`'
;
mode templateMode;
VAR
: [$]?[a-zA-Z0-9_]+
| [$]
;
OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;
TemplateEnd
: '}' -> popMode;
This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some.
Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.
About the whitespaces
I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.
I tested your alternative grammar, where you put TemplateStringLiteral above WS, but contrary to your observation, this gives me:
line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}
The reason for this is the same as above, Some is lexed to VAR.

How to remove ambiguity from this syntax (antlr4)

I am writing a tool to generation sequence diagram from some text. I need to support this two syntax:
anInstance:AClass.DoSomething() and
participant A -> participant B: Any character except for \r\n (<>{}?)etc..
Let's call the fist one strict syntax and the second one free syntax. In anInstance:AClass.DoSomething(), I need it to be matched by to ( ID ':' ID ) as in the strict syntax. However, :AClass.DoSomething() will be first matched by CONTENT. I am thinking some kind of lookahead, checking if -> is there but not able to figure it out.
Strict syntax
message
: to '.' signature
;
signature
: methodName '()'
;
to
: ID ':' ID
;
methodName
: ID
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
Free syntax
asyncMessage
: source '->' target content
;
source
: ID+
;
target
: ID+
;
content
: CONTENT
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
CONTENT
: ':' ~[\r\n]+
;
SPACE
: [ \t\r\n] -> channel(HIDDEN)
;
You need to understand how ANTLR lexer works:
It uses whichever rule matches the longest part of the input (starting at current position)
In case multiple rules can match the same input (i.e. same length), the first one (in order they're defined in) is used
With your current lexer rules, CONTENT takes precedence whenever you encounter an : so ':' ID will never be matched.
With ANTLR 4, you should probably use modes in this case - when you encounter the : in the free form, switch to a "free" mode and define a lexer rule CONTENT to be only available in the "free" mode.
See this question for an idea about how ANTLR 4 lexer modes work.

antlr v3 lexer predicts wrong

consider the following (combined) grammar
grammar CastModifier;
tokens{
E='=';
C='=()';
Lp='(';
Rp=')';
I = 'int';
S=';';
}
compilationUnit
: assign+ EOF
;
assign
: '=' Int ';'
| '=' '(' 'int' ')' Int ';'
| '=()' Int ';'
;
Int
: ('1'..'9') ('0'..'9')*
| '0'
;
Whitespace
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
Unfortunately the lexer does not always predict the next token correctly. For instance for the following code
=(int) 1;
The Lexer predicts it must be the '=()' token. It detects the correct token for the following code
= (int) 1;
I figured this problem should be solvable by ANTLR if I provide the following option for the rule "assign":
options{k=3;}
But apparently it does not help and neither if I define this option for the whole grammar. How can I resolve this problem? My workaround at the moment is to built the '=()' token out of '='('')' but that allows the user to write
= ()
Which, well, is kind of ok but I am just wondering why ANTLR is not able to predict it correctly.
The options{k=3;} causes the parser to look ahead at most 3 tokens, not the lexer. ANTLR3's lexer is not that smart: once it matches a character, it won't give up on that match. So, in your case, from the input =(int) 1;, =( is matched but then there is an i in the char stream and no token that matches this.
My workaround at the moment is to built the '=()' token out of '='('')'
I'd call that a proper solution: '=' '(' and ')' are separate tokens and should be handled as such (to be glued together in the parser, not the lexer).

How to parse a word that starts with a specific letter with ANTLR3 java target

Is there a way to parse words that start with a specific character?
I've been trying the following but i couldn't get any promising results:
//This one is working it accepts AD CD and such
example1
:
.'D'
;
//This one is not, it expects character D, then any ws character then any character
example2
:
'D'.
;
//These two are not working either
example3
:
'D'.*
;
//Doesn't accept input due to error: "line 1:3 missing 'D' at '<EOF>'"
example4
:
.*'D'
;
//just in case my WS rule:
/** WhiteSpace Characters (HIDDEN)*/
WS : ( ' '
| '\t'
)+ {$channel=HIDDEN;}
;
I am using ANTLR 3.4
Thanks in advance
//This one is not, it expects character D, then any ws character then any character
example2
:
'D'.
;
No, it does not it accept the token (not character!) 'D' followed by a space and then any character. Since example2 is a parser rule, it does not match characters, but matches tokens (there's a big difference!). And since you put spaces on a separate channel, the spaces are not matched by this rule either. At the end, the . (DOT) matches any token (again: not any character!).
More info on meta chars (like the . (DOT)) whose meaning differ inside lexer- and parser rules: Negating inside lexer- and parser rules
//These two are not working either
example3
:
'D'.*
;
//Doesn't accept input due to error: "line 1:3 missing 'D' at '<EOF>'"
example4
:
.*'D'
;
Unless you know exactly what you're doing, don't use .*: they gobble up too much in your case (especially when placed at the start or end of a rule).
It looks like you're trying to tokenize things inside the parser (all your example rules are parser rules). As far as I can see, these should be lexer rules instead. More on the difference between parser- and lexer rules, see: Practical difference between parser rules and lexer rules in ANTLR?

ANTLR 3 - how do I make unique tokens with NOT across special chars

I have a short question:
// Lexer
LOOP_NAME : (LETTER|DIGIT)+;
OTHERCHARS : ~('>' | '}')+;
LETTER : ('A'..'Z')|('a'..'z');
DIGIT : ('0'..'9');
A_ELEMENT
: (LETTER|'_')*(LETTER|DIGIT|'_'|'.');
// Parser-Konfiguration
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
My problem is that this is impossible due to:
As a result, alternative(s) 2 were disabled for that input [14:55:32]
error(208): ltxt2.g:61:1: The following token definitions can never be
matched because prior tokens match the same input:
LETTER,DIGIT,A_ELEMENT,WS
My issue is that I also need to catch UTF8 with OTHERCHARS... and I cannot put all special UTF8 chars into a Lexer rule since I cannot range like ("!".."?").
So I need the NOT (~). The OTHERCHARS here can be everything but ">" or "}". These two close a literal context and are forbidden within.
It doesn't seem such cases are referenced very well, so I'd be happy if someone knew a workaround. The NOT operator here creates the ambivalence I need to solve.
Thanks in advance.
Best,
wishi
Move OTHERCHARS to the very end of the lexer and define it like this:
OTHERCHARS : . ;
In the Java target, this will match a single UTF-16 code point which is not matched by a previous rule. I typically name the rule ANY_CHAR and treat it as a fall-back. By using . instead of .+, the lexer will only use this rule if no other rule matches.
If another rule matches more than one character, that rule will have priority over ANY_CHAR due to matching a larger number of characters from the input.
If another rule matches exactly one character, that rule will have priority over ANY_CHAR due to appearing earlier in the grammar.
Edit: To exclude } and > from the ANY_CHAR rule, you'll want to create rules for them so they are covered under point 2.
RBRACE : '}' ;
GT : '>' ;
ANY_CHAR : . ;

Resources