I tried to use ANTLR3 to build a simple Regexpression parser, but it throws the internal error
Here is the Sample.g
grammar Sample;
options {
memoize=true;
output=AST;
}
tokens {
RegExp;
}
RegExpression:
'/' (a=~('/' | NL))+ '/'
-> ^(RegExp[$RegExpression.start, $RegExpression.text] $a+ )
;
fragment NL: '\n' | '\r';
ANY : . ;
I run the command:
java -jar antlr-3.5.2-complete.jar -print Sample.g
and it gives this:
error(10): internal error: Sample.g : java.lang.NullPointerException
org.antlr.grammar.v3.DefineGrammarItemsWalker.rewrite_atom(DefineGrammarItemsWalker.java:3896)
...
...
Updated according to comments
grammar Sample{
memoize=true;
output=AST;
}
tokens {
RegExp;
}
regExpression:
'/' (a=~('/' | NL))+ '/'
-> ^(RegExp[$regExpression.start, $regExpression.text] $a+ )
;
NL: '\n' | '\r';
And here are the errors after running the java -jar antlr-3.5.2-complete.jar Sample.g
error(10): internal error: Sample.g : java.lang.NullPointerException
org.antlr.grammar.v3.CodeGenTreeWalker.getTokenElementST(CodeGenTreeWalker.java:311)
org.antlr.grammar.v3.CodeGenTreeWalker.notElement(CodeGenTreeWalker.java:2886)
org.antlr.grammar.v3.CodeGenTreeWalker.element(CodeGenTreeWalker.java:2431)
org.antlr.grammar.v3.CodeGenTreeWalker.element(CodeGenTreeWalker.java:2446)
org.antlr.grammar.v3.CodeGenTreeWalker.alternative(CodeGenTreeWalker.java:2250)
org.antlr.grammar.v3.CodeGenTreeWalker.block(CodeGenTreeWalker.java:1798)
org.antlr.grammar.v3.CodeGenTreeWalker.ebnf(CodeGenTreeWalker.java:3014)
org.antlr.grammar.v3.CodeGenTreeWalker.element(CodeGenTreeWalker.java:2495)
org.antlr.grammar.v3.CodeGenTreeWalker.alternative(CodeGenTreeWalker.java:2250)
org.antlr.grammar.v3.CodeGenTreeWalker.block(CodeGenTreeWalker.java:1798)
org.antlr.grammar.v3.CodeGenTreeWalker.rule(CodeGenTreeWalker.java:1321)
org.antlr.grammar.v3.CodeGenTreeWalker.rules(CodeGenTreeWalker.java:955)
org.antlr.grammar.v3.CodeGenTreeWalker.grammarSpec(CodeGenTreeWalker.java:877)
org.antlr.grammar.v3.CodeGenTreeWalker.grammar_(CodeGenTreeWalker.java:518)
org.antlr.codegen.CodeGenerator.genRecognizer(CodeGenerator.java:415)
org.antlr.Tool.generateRecognizer(Tool.java:674)
org.antlr.Tool.process(Tool.java:487)
org.antlr.Tool.main(Tool.java:98)
You're trying to use a rewrite rule (tree construction) on a lexer rule. That doesn't make sense.
In ANTLR, all rules with name starting with an uppercase letter are lexer rules. The tree construction is used on AST nodes, not on tokens themselves, so you have to use it on parser rules (starting with lowercase letter).
When you do that, keep in mind that your NL is a fragment now (you cannot use fragments in parser rules) and make sure your ANY token doesn't collide with anything else, i.e. define all needed tokens (/, NL etc.) and put them above the ANY token definition.
Related
I want to parse yaml with antlr4.
Target file contains image: xxx.com/node:8.14.
Then I wrote a grammar file like this:
grammar Drone;
yaml: obj+ ;
obj: ID ':' value;
value:
STRING;
ID
: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'-')+
;
STRING: ('a'..'z'|'A'..'Z'|'0'..'9'|'-'|'.'|'_'|'/'|':')+ ;
WS: [ \t]+ -> skip;
CRLF: [\r\n]+ ;
got result like this:
[antlr4] ➜ dronemigrate antlr4-parse Drone.g4 yaml -tree -trace drone.yml
line 1:0 mismatched input 'image:' expecting ID
enter yaml, LT(1)=image:
enter obj, LT(1)=image:
exit obj, LT(1)=<EOF>
exit yaml, LT(1)=<EOF>
(yaml:1 (obj:1 image: xxx.com/node:8.14))
When I remove char ':' in Grammar rules file, got reulst like this:
[antlr4] ➜ dronemigrate antlr4-parse Drone.g4 yaml -tree -trace drone.yml
enter yaml, LT(1)=image
enter obj, LT(1)=image
consume [#0,0:4='image',<2>,1:0] rule obj
consume [#1,5:5=':',<1>,1:5] rule obj
enter value, LT(1)=xxx.com/node
consume [#2,7:18='xxx.com/node',<3>,1:7] rule value
exit value, LT(1)=:
exit obj, LT(1)=:
exit yaml, LT(1)=:
(yaml:1 (obj:1 image : (value:1 xxx.com/node)))
how to deal the ':' in string?
YAML is not well suited for parser generators like ANTLR (where there is a strict separation between the lexer and parser). There are ways around it (using lexical modes), but fully implementing a YAML grammar is by no means a trivial thing. That's why there is still an open issue in ANTLR's issue tracker to implement such a grammar. You could have a look at how it is done here: https://github.com/umaranis/FastYaml
Following this question, I have ANTLR installed via HomeBrew:
brew install antlr
and it is installed on:
/usr/local/Cellar/antlr/<version>/
and installed the Python runtime via
pip3 install antlr4-python3-runtime
From here, I ran the
export CLASSPATH=".:/usr/local/Cellar/antlr/<version>/antlr-<version>-complete.jar:$CLASSPATH"
but when I run the command
grun <grammarName> <inputFile>
I get the infamous error message:
Can't load <grammarName> as lexer or parser
I would appreciate it if you could help me know the problem and how to solve it.
P.S. It shouldn't matter, but you may see the code I am working on here.
This error message is an indication that TestRig (that the grun alias uses), can’t find the Parser (or Lexer) class in your classpath. If you’ve placed your generated Parser in a package, you may need to take the package name into account, but the main thing is that the generated (and compiled) class is in your classpath.
Also.. TestRig takes the grammarName AND a startRule as parameters, and expects your input to come from stdin.
I cloned your repo to take a closer look at your issue.
The immediate issue for why grun is giving you this issue is that you specified your target language in the grammar file (looks like I need to retract that comment about it not being anything in your grammar). By specifying python as the target language in the grammar, you didn't generate the *.java classes that the TestRig class (used by the grun alias) needed to execute.
I removed the target language option from the grammar and was able to run the grun command against your sample input. To get it to parse correctly, I took the liberty of modifying several things in your grammar:
Removed the target language (It's generally better to specify the target language on the antlr command line so that the grammar remains language agnostic (it's also essential if you want to use the TestRig/grun utility to test things out, since you'll need the Java target)).
changed SectionName lexer rule to section parser rule (with labeled alternatives. Having lexer rules like 'Body ' Integer will give you a single token with both the body keyword and the integer, that you'd then have to pull apart later (it also forces there to be only a single space between 'Body' and the integer).
set the NewLine token to -> skip (This is a bit more presumptive on my part, but not skipping NewLine will require modifying more parse rule to specify where all NewLine is a valid token.)
removed the StatementEnd lexer rule since I skiped the NewLine tokens
reworked theInteger and Float stuff to be two different tokens so that I could use the Integer token in the section parser rule.
a couple more minor tweaks just to get this skeleton to handle your sample input.
The resulting grammar I used was:
grammar ElmerSolver;
// Parser Rules
// eostmt: ';' | CR;
statement: ';';
statement_list: statement*;
sections: section+ EOF;
// section: SectionName /* statement_list */ End;
// Lexer Rules
fragment DIGIT: [0-9];
Integer: DIGIT+;
Float:
[+-]? (DIGIT+ ([.]DIGIT*)? | [.]DIGIT+) ([Ee][+-]? DIGIT+)?;
section:
'Header' statement_list End # headerSection
| 'Simulation' statement_list End # simulatorSection
| 'Constants' statement_list End # constantsSection
| 'Body ' Integer statement_list End # bodySection
| 'Material ' Integer statement_list End # materialSection
| 'Body Force ' Integer statement_list End # bodyForceSection
| 'Equation ' Integer statement_list End # equationSection
| 'Solver ' Integer statement_list End # solverSection
| 'Boundary Condition ' Integer statement_list End # boundaryConditionSection
| 'Initial Condition ' Integer statement_list End # initialConditionSection
| 'Component' Integer statement_list End # componentSection;
End: 'End';
// statementEnd: ';' NewLine*;
NewLine: ('\r'? '\n' | '\n' | '\r') -> skip;
LineJoining:
'\\' WhiteSpace? ('\r'? '\n' | '\r' | '\f') -> skip;
WhiteSpace: [ \t\r\n]+ -> skip;
LineComment: '#' ~( '\r' | '\n')* -> skip;
With those changes, I ran
➜ antlr4 ElmerSolver.g4
javac *.java
grun ElmerSolver sections -tree < examples/ex001.sif
and got the output:
(sections (section Simulation statement_list End) (section Equation 1 statement_list End) <EOF>)
I tried to use ANTLR3 to build a simple Regexpression parser, but it throws the internal error
Here is the Sample.g
grammar Sample;
options {
memoize=true;
output=AST;
}
tokens {
RegExp;
}
RegExpression:
'/' (a=~('/' | NL))+ '/'
-> ^(RegExp[$RegExpression.start, $RegExpression.text] $a+ )
;
fragment NL: '\n' | '\r';
ANY : . ;
I run the command:
java -jar antlr-3.5.2-complete.jar -print Sample.g
and it gives this:
error(10): internal error: Sample.g : java.lang.NullPointerException
org.antlr.grammar.v3.DefineGrammarItemsWalker.rewrite_atom(DefineGrammarItemsWalker.java:3896)
...
...
Updated according to comments
grammar Sample{
memoize=true;
output=AST;
}
tokens {
RegExp;
}
regExpression:
'/' (a=~('/' | NL))+ '/'
-> ^(RegExp[$regExpression.start, $regExpression.text] $a+ )
;
NL: '\n' | '\r';
And here are the errors after running the java -jar antlr-3.5.2-complete.jar Sample.g
error(10): internal error: Sample.g : java.lang.NullPointerException
org.antlr.grammar.v3.CodeGenTreeWalker.getTokenElementST(CodeGenTreeWalker.java:311)
org.antlr.grammar.v3.CodeGenTreeWalker.notElement(CodeGenTreeWalker.java:2886)
org.antlr.grammar.v3.CodeGenTreeWalker.element(CodeGenTreeWalker.java:2431)
org.antlr.grammar.v3.CodeGenTreeWalker.element(CodeGenTreeWalker.java:2446)
org.antlr.grammar.v3.CodeGenTreeWalker.alternative(CodeGenTreeWalker.java:2250)
org.antlr.grammar.v3.CodeGenTreeWalker.block(CodeGenTreeWalker.java:1798)
org.antlr.grammar.v3.CodeGenTreeWalker.ebnf(CodeGenTreeWalker.java:3014)
org.antlr.grammar.v3.CodeGenTreeWalker.element(CodeGenTreeWalker.java:2495)
org.antlr.grammar.v3.CodeGenTreeWalker.alternative(CodeGenTreeWalker.java:2250)
org.antlr.grammar.v3.CodeGenTreeWalker.block(CodeGenTreeWalker.java:1798)
org.antlr.grammar.v3.CodeGenTreeWalker.rule(CodeGenTreeWalker.java:1321)
org.antlr.grammar.v3.CodeGenTreeWalker.rules(CodeGenTreeWalker.java:955)
org.antlr.grammar.v3.CodeGenTreeWalker.grammarSpec(CodeGenTreeWalker.java:877)
org.antlr.grammar.v3.CodeGenTreeWalker.grammar_(CodeGenTreeWalker.java:518)
org.antlr.codegen.CodeGenerator.genRecognizer(CodeGenerator.java:415)
org.antlr.Tool.generateRecognizer(Tool.java:674)
org.antlr.Tool.process(Tool.java:487)
org.antlr.Tool.main(Tool.java:98)
You're trying to use a rewrite rule (tree construction) on a lexer rule. That doesn't make sense.
In ANTLR, all rules with name starting with an uppercase letter are lexer rules. The tree construction is used on AST nodes, not on tokens themselves, so you have to use it on parser rules (starting with lowercase letter).
When you do that, keep in mind that your NL is a fragment now (you cannot use fragments in parser rules) and make sure your ANY token doesn't collide with anything else, i.e. define all needed tokens (/, NL etc.) and put them above the ANY token definition.
I have an application that uses NEST (Elasticsearch .NET client) to communicate with an Elasticsearch cluster. The integration allows the user to specify input for the "query_string" portion of a query.
The user may input an invalid query. Say "AND", which is invalid because the predicate is incomplete. But the error message that comes back from Elasticsearch is exceedingly verbose, and contains terminology that isn't very user-friendly, like "all shards failed".
Is there a way I can offer the user a more meaningful error message (say - "bad predicate"). Ideally, the users search string would be validated without an Elasticsearch round-trip, but I'll settle for a simpler error message however I can get it.
The error message returned by Elasticsearch is verbose but for parsing errors like these, Elasticsearch throws a QueryParsingException. If you examine the error message closely, you'll find the string QueryParsingException towards the end of the entire error message. This is the exception (and its message) you are interested in. For example, when I spelt must as mus2t in a search request, I get a huge error message by Elasticsearch and below is the last part of the error message.
QueryParsingException[[<index name>] bool query does not support [mus2t]]; }]
I got this when I spelt must as mus2t. You can parse and extract out this error message.
You can use validation api.
For following query
var validateResponse = client.Validate<Document>(descriptor => descriptor
.Explain()
.Query(query => query
.QueryString(qs => qs
.OnFields(f => f.Name)
.Query("AND"))));
you will get
org.elasticsearch.index.query.QueryParsingException: [indexname]
Failed to parse query [AND];
org.apache.lucene.queryparser.classic.ParseException: Cannot parse
'AND': Encountered " <AND> "AND "" at line 1, column 0. Was expecting
one of:
<NOT> ...
"+" ...
"-" ...
<BAREOPER> ...
"(" ...
"*" ...
<QUOTED> ...
<TERM> ...
<PREFIXTERM> ...
<WILDTERM> ...
<REGEXPTERM> ...
"[" ...
"{" ...
<NUMBER> ...
<TERM> ...
"*" ...
Still not so perfect for end user and it requires round-trip to ES, but maybe it will be helpful.
How to find the line numbers(of source file) of instructions from AST.
example:
for the following code
24> void foo(){
25> System.out.println(" hi ");
26> }
the ast corresponding to print statement is
METHOD_CALL
.
.
System
out
println
ARGUMENT_LIST
EXPR
" hi "
I want to retrieve the line number of "System" from the generated Tree. The answer for "System" should be 25(line number in the source code).
If your Tree for the System token is in fact a CommonTree, then you can use the CommonTree.getToken() method to get the Token for Symbol. You can then call Token.getLine() to get the line number.