Lexer rule terminates without matching terminator symbol - antlr3

fragment
EXEC : ('E' 'X' 'E' 'C');
fragment
CMD : ('C' 'M' 'D');
fragment
BEGIN : ('B' 'E' 'G' 'I' 'N');
fragment
END : ('E' 'N' 'D');
fragment
SEMICOLON : ';';
ExecCommand :
EXEC Whitespace CMD Whitespace
BEGIN WhiteSpace? SEMICOLON
( options {greedy=false;} : . )*
END Whitespace EXEC;
Begin : BEGIN;
End : END;
Exec : EXEC;
The ExecCommand rule terminates the scan on the first 'E' following the Semicolon and then fails if the next characters are not 'END'. The scan should only terminate on 'END' and not for 'ELSE' or any other string beginning with 'E'.
The scan loop has a test for _LA(1) == 'E' instead of match('END').
Exec, Begin and End are also token rules. The ExecCommand rule is the first rule in the lexer grammar so it should have precedence.
How do I generate a rule that will accept any arbitrary text between the start and end symbols and not terminate until the end symbol is found?
I tried the following and it did not generate successfully:
ExecCommand :
EXEC Whitespace CMD Whitespace
BEGIN WhiteSpace? SEMICOLON
( options {greedy=false;} : ~(END Whitespace EXEC) )*
END Whitespace EXEC;

By default there's only one lookahead, so a non-greedy loop terminates as soon as the first input char after that loop is found (the 'E'). It then attempts to match 'N' and 'D' (as directed by the END rule), which fails in your case. Try to increase the lookahead (paramater k) and see if that helps. If not you have to find a different approach.
As a side note: Exec and EXEC are entirely the same (except for their names), so either remove EXEC (if you need that lexer token in your parser) or Exec (if that lexer token is exclusively used in your lexer. Similar for END and BEGIN.

Related

antlr grammar: Allow whitespace matching only in template string

I want to parse template strings:
`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`
Here is my grammar:
varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)* ')' ;
WS : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;
When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:
extraneous input ' ' expecting {'`'}
How can I allow whitespaces to be parsed and not skipped only inside the template string?
What is currently happening
When testing your example against your current grammar displaying the generated tokens, the lexer gives this:
[#0,0:0='`',<'`'>,1:0]
[#1,1:4='Some',<VAR>,1:1]
[#2,6:9='text',<VAR>,1:6]
[#3,11:12='${',<'${'>,1:11]
[#4,13:20='variable',<VAR>,1:13]
[#5,21:21='.',<'.'>,1:21]
[#6,22:25='name',<VAR>,1:22]
[#7,26:26='}',<'}'>,1:26]
... shortened ...
[#26,85:84='<EOF>',<EOF>,2:0]
This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR. Why is this happening?
As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some.
What you could try (Spoiler: won't work)
You could try to modify the rule like this:
TemplateStringLiteral: ('\\`' | ~'`')+ ;
so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:
How would the lexer match anything to the VAR rule, ever?
The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.
How to achieve what you actually want
There might be another solution, but this one works:
File MartinCup.g4:
parser grammar MartinCup;
options { tokenVocab=MartinCupLexer; }
templateString
: BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
;
template
: TemplateStart variable TemplateEnd
;
variable
: varname funParameter? (Dot variable)*
;
varname
: VAR
;
funParameter
: OpenPar variable? (Comma variable)* ClosedPar
;
File MartinCupLexer.g4:
lexer grammar MartinCupLexer;
BackTick : '`' ;
TemplateStart
: '${' -> pushMode(templateMode)
;
TemplateStringLiteral
: '\\`'
| ~'`'
;
mode templateMode;
VAR
: [$]?[a-zA-Z0-9_]+
| [$]
;
OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;
TemplateEnd
: '}' -> popMode;
This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some.
Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.
About the whitespaces
I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.
I tested your alternative grammar, where you put TemplateStringLiteral above WS, but contrary to your observation, this gives me:
line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}
The reason for this is the same as above, Some is lexed to VAR.

How below REGEXP_REPLACE works?

I have query in my project and that is having REGEXP_REPLACE
i tried to find how it works by searching but i found it like
w+ Matches a word character (that is, an alphanumeric or underscore
(_) character).
but not able to find '"\w+\":' why these "" are used and what is mean by '{|}|"',''
UPDATE (SELECT data,data_value FROM TEMP) t
SET t.DATA_VALUE=REGEXP_REPLACE(REGEXP_REPLACE(t.data, '"\w+\":',''),'{|}|"','');
can you please tell me how it works?
This appear to be a regular expression for stripping keys and enclosing brackets from a JSON string - unfortunately, if this is the case then it does not work in all situations.
The regular expression
'"\w+\":'
will match:
A " double quotation mark;
\w+ one-or-more word (a-z or A-Z or 0-9 or _) characters;
\" another double quotation mark - note: the \ character is not necessary; then
A : colon.
So:
REGEXP_REPLACE(
'{"key":"value","key2":"value with \"quote"}',
'"\w+":', -- Pattern matched
'' -- Replacement string
)
Will output:
{"value","value with \"quote"}
The second pattern {|}|" will match either a {, or a } or a " character (and could have been equivalently written as [{}"]) so:
REGEXP_REPLACE(
'{"value","value with \"quote"}',
'{|}|"', -- Pattern matched
'' -- Replacement string
)
Will output:
value,value with \quote
Which is fine, until (like my example) you have an escaped double quote (or curly braces) in the value string; in which case those will also get stripped leaving the escape character.
(Note: you would not typically find this but it is possible to include escaped quotes in the key. So {"keywith\":quote":"value"} would get replaced to {quote":"value"} and then quote:value which is not the intended output.)
If parsing JSON is what you are trying to do (pre-Oracle 12) then you can use:
REGEXP_REPLACE(
'{"key":"value","key2":"value with \"quote","keywith\":quote":"value with \"{}"}',
'^{|"(\\"|[^"])+":(")?((\\"|[^"])+?)\2((,)|})',
'\3\6'
)
Which outputs:
value,value with \"quote,value with \"{}
Or in Oracle 12 you can do:
SELECT *
FROM JSON_TABLE(
'{"key":"value","key2":"value with \"quote","keywith\":quote":"value with \"{}"}',
'$.*' NULL ON ERROR
COLUMNS (
value VARCHAR2(4000) PATH '$'
)
)
Which outputs:
VALUE
-----------------
value
value with "quote
value with "{}
example:::REGEXP_REPLACE( string, pattern [, replacement_string [, start_position [, nth_appearance [, match_parameter ] ] ] ] )
| is or(CAN MEAN MORE THAN ONE ALTERNATIVE ) , is for at least as in {n,} at least n times
https://www.techonthenet.com/oracle/functions/regexp_replace.php
"where I got my info"
'"\w+\":' why these "" are used and what is mean by '{|}|"',''
Matches a word character(\w)One or more times(+) this has to be messed up it's missing the right quantity of close parentheses by putting \" w+ \"
they allow the " to be shown. This expression takes one expression changes it then uses that as the basis for the next change. Good luck figuring the rest out. Regular expressions aren't too bad, pretty intuitive once you get the basics down.

oracle regexp_replace delete last occurrence of special character

I have a pl sql string as follows :
String := 'ctx_ddl.add_stopword(''"SHARK_IDX19_SPL"'',''can'');
create index "SCOTT"."SHARK_IDX2"
on "SCOTT"."SHARK2"
("DOC")
indextype is ctxsys.context
parameters(''
datastore "SHARK_IDX2_DST"
filter "SHARK_IDX2_FIL"
section group "SHARK_IDX2_SGP"
lexer "SHARK_IDX2_LEX"
wordlist "SHARK_IDX2_WDL"
stoplist "SHARK_IDX2_SPL"
storage "SHARK_IDX2_STO"
sync (every "SYSDATE+(1/1)" memory 67108864)
'')
/
';
I have to get search the final occurrence of '/' and add ';' to it. Also I need to escape the quotes preset in parameters ('') to have extra quotes. I need output like
String := 'ctx_ddl.add_stopword(''"SHARK_IDX19_SPL"'',''can'');
create index "SCOTT"."SHARK_IDX2"
on "SCOTT"."SHARK2"
("DOC")
indextype is ctxsys.context
parameters(''''
datastore "SHARK_IDX2_DST"
filter "SHARK_IDX2_FIL"
section group "SHARK_IDX2_SGP"
lexer "SHARK_IDX2_LEX"
wordlist "SHARK_IDX2_WDL"
stoplist "SHARK_IDX2_SPL"
storage "SHARK_IDX2_STO"
sync (every "SYSDATE+(1/1)" memory 67108864)
'''')
/;
';
Any help.
There's an age-old saying: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems"
Unless you are confronted by a problem that truly requires regular expressions, I'd recommend working with basic string manipulation functions.
Semi-colon:
Use INSTR to find last occurence of '/', call this P1.
Result = Substr from position 1 through P1||';'||substr from P1+1 through to end-of-string
Parameters substitution:
Use INSTR to find where parameter list starts (i.e. find "parameters(" in your string) and ends (presumably the last closing parenthesis ")" in your string). Call these P2 and P3.
Result = substr from 1 through P2 || REPLACE(substr from P2+1 through P3-1,'''','''''''') || substr from P3 to end-of-string

How to parse a word that starts with a specific letter with ANTLR3 java target

Is there a way to parse words that start with a specific character?
I've been trying the following but i couldn't get any promising results:
//This one is working it accepts AD CD and such
example1
:
.'D'
;
//This one is not, it expects character D, then any ws character then any character
example2
:
'D'.
;
//These two are not working either
example3
:
'D'.*
;
//Doesn't accept input due to error: "line 1:3 missing 'D' at '<EOF>'"
example4
:
.*'D'
;
//just in case my WS rule:
/** WhiteSpace Characters (HIDDEN)*/
WS : ( ' '
| '\t'
)+ {$channel=HIDDEN;}
;
I am using ANTLR 3.4
Thanks in advance
//This one is not, it expects character D, then any ws character then any character
example2
:
'D'.
;
No, it does not it accept the token (not character!) 'D' followed by a space and then any character. Since example2 is a parser rule, it does not match characters, but matches tokens (there's a big difference!). And since you put spaces on a separate channel, the spaces are not matched by this rule either. At the end, the . (DOT) matches any token (again: not any character!).
More info on meta chars (like the . (DOT)) whose meaning differ inside lexer- and parser rules: Negating inside lexer- and parser rules
//These two are not working either
example3
:
'D'.*
;
//Doesn't accept input due to error: "line 1:3 missing 'D' at '<EOF>'"
example4
:
.*'D'
;
Unless you know exactly what you're doing, don't use .*: they gobble up too much in your case (especially when placed at the start or end of a rule).
It looks like you're trying to tokenize things inside the parser (all your example rules are parser rules). As far as I can see, these should be lexer rules instead. More on the difference between parser- and lexer rules, see: Practical difference between parser rules and lexer rules in ANTLR?

ANTLR trying to match token within longer token

I'm new to ANTLR, and trying following grammar in ANTLRWorks1.4.3.
command
: 'go' SPACE+ 'to' SPACE+ destination
;
destination
: (UPPER | LOWER) (UPPER | LOWER | DIGIT)*
;
SPACE
: ' '
;
UPPER
: 'A'..'Z'
;
LOWER
: 'a'..'z'
;
DIGIT
: '0'..'9'
;
This seems to work OK, except when the 'destination' contains first two chars of keywords 'go' and 'to'.
For instance, if I give following command:
go to Glasgo
the node-tree is displayed as follows:
I was expecting it to match fill word as destination.
I even tried changing the keyword, for example 'travel' instead of 'go'. In that case, if there is 'tr' in the destination, ANTLR complains.
Any idea why this happens? and how to fix this?
Thanks in advance.
ANTLR lexer and parser are strictly separated. Your input is first tokenized, after which the parser rules operate on said tokens.
In you case, the input go to Glasgo is tokenized into the following X tokens:
'go'
' ' (SPACE)
'to'
'G' (UPPER)
'l' (LOWER)
'a' (LOWER)
's' (LOWER)
'go'
which leaves a "dangling" 'go' keyword. This is simply how ANTLR's lexer works: you cannot change this.
A possible solution in your case would be to make destination a lexer rule instead of a parser rule:
command
: 'go' 'to' DESTINATION
;
DESTINATION
: (UPPER | LOWER) (UPPER | LOWER | DIGIT)*
;
SPACE
: ' ' {skip();}
;
fragment UPPER
: 'A'..'Z'
;
fragment LOWER
: 'a'..'z'
;
fragment DIGIT
: '0'..'9'
;
resulting in:
If you're not entirely sure what the difference between the two is, see: Practical difference between parser rules and lexer rules in ANTLR?
More about fragments: What does "fragment" mean in ANTLR?
PS. Glasgow?

Resources