i am using clang plugin to develop a tool which can insert a 'printf' statemente behind every statement to trace the project.for example:
int main(){
int a=1000;
if(a==1000){
a=999;}
}
insert a 'printf' statemente
int main(){
int a=1000;
printf("int a=1000;");
if(a==1000){
a=999;
printf("a=999;");}
}
i use the structure of the 'printFunctionName' in the clang'example to realize.
i need to get the complete source code location corresponding to the ASTnode so that i can insert the 'printf' statement in the right location.however i found i can't use the Stmt->getSourceRange() function to get the complete location range for some Specific type statement.for example,BinaryOperator,UnaryOperator.
a=1000;
| | | |-BinaryOperator 0x565356502960 <line:13:3, col:5> 'int' '='
| | | | |-DeclRefExpr 0x565356502918 <col:3> 'int' lvalue Var 0x565356502880 'a' 'int'
| | | | `-IntegerLiteral 0x565356502940 <col:5> 'int' 1000
In fact, these characters occupy 5 columns in the source code file, but only 3 columns are recognized.
a++;
-UnaryOperator 0x5653565029b0 <line:14:9, col:10> 'int' postfix '++'
| | | | `-DeclRefExpr 0x565356502988 <col:9> 'int' lvalue Var 0x565356502880 'a' 'int'
this is three chars for 3 column.but clang think it is 2 column.
do you have some idea to solve this problem?and coud you please give me some advices to better recognize the location for a complete sentence end with ';'?thank you!
and this is my code:
SourceRange range=declStmt->getSourceRange();
TheRewriter.ReplaceText(declStmt->getSourceRange(),str+";"+"\nprintf(\""+str+"\\n\")");
I were upgrading my antlr3 grammar to antlr4 but found the rule rewiring is not supported in antrl3, appreciate any advice to make below grammar work in Antlr4?
fragment date
: DATE (MINUS DATE)* -> ^(TO DATE+)
;
fragment simpleExpression
: expr (OR expr)* -> expr+
;
fragment simpleExpressionWithLiteral
: exprWithLiteral (OR exprWithLiteral)* -> exprWithLiteral+
;
fragment conditionalExpression
: orExpression -> ^(COND orExpression)?
;
fragment orExpression
: andExpression (OR^ andExpression)*
;
fragment andExpression
: atom (AND^ atom)*
;
fragment atom
: exprWithLiteral
| NOT exprWithLiteral -> ^(NOT exprWithLiteral)
| NOT LPAREN orExpression RPAREN-> ^(NOT orExpression)
| LPAREN orExpression RPAREN -> orExpression
;
fragment exprWithLiteral
: expr
| StringLiteral
;
fragment expr
: WORD
| NUMBER
;
The part after -> is not rule rewiring but tree rewriting. ANTLR3 produced an AST which you could manually change using this tree rewriting syntax. ANTLR4 no longer produces ASTs but parse trees, which you cannot change (as they represent the path taken through the grammar).
So the simple solution is to remove everything on a line starting with ->, example:
fragment date
: DATE (MINUS DATE)* -> ^(TO DATE+)
;
becomes
fragment date
: DATE (MINUS DATE)*
;
I need to build a lexical analyzer using Gocc, however no option to ignore case is mentioned in the documentation and I haven't been able to find anything related. Anyone have any idea how it can be done or should I use another tool?
/* Lexical part */
_digit : '0'-'9' ;
int64 : '1'-'9' {_digit} ;
switch: 's''w''i''t''c''h';
while: 'w''h''i''l''e';
!whitespace : ' ' | '\t' | '\n' | '\r' ;
/* Syntax part */
<<
import(
"github.com/goccmack/gocc/example/calc/token"
"github.com/goccmack/gocc/example/calc/util"
)
>>
Calc : Expr;
Expr :
Expr "+" Term << $0.(int64) + $2.(int64), nil >>
| Term
;
Term :
Term "*" Factor << $0.(int64) * $2.(int64), nil >>
| Factor
;
Factor :
"(" Expr ")" << $1, nil >>
| int64 << util.IntValue($0.(*token.Token).Lit) >>
;
For example, for "switch", I want to recognize no matter if it is uppercase or lowercase, but without having to type all the combinations. In Bison there is the option % option caseless, in Gocc is there one?
Looking through the docs for that product, I don't see any option for making character literals case-insensitive, nor do I see any way to write a character class, as in pretty well every regex engine and scanner generator. But nothing other than tedium, readability and style stops you from writing
switch: ('s'|'S')('w'|'W')('i'|'I')('t'|'T')('c'|'C')('h'|'H');
while: ('w'|'W')('h'|'H')('i'|'I')('l'|'L')('e'|'E');
(That's derived from the old way of doing it in lex without case-insensitivity, which uses character classes to make it quite a bit more readable:
[sS][wW][iI][tT][cC][hH] return T_SWITCH;
[wW][hH][iI][lL][eE] return T_WHILE;
You can come closer to the former by defining 26 patterns:
_a: 'a'|'A';
_b: 'b'|'B';
_c: 'c'|'C';
_d: 'd'|'D';
_e: 'e'|'E';
_f: 'f'|'F';
_g: 'g'|'G';
_h: 'h'|'H';
_i: 'i'|'I';
_j: 'j'|'J';
_k: 'k'|'K';
_l: 'l'|'L';
_m: 'm'|'M';
_n: 'n'|'N';
_o: 'o'|'O';
_p: 'p'|'P';
_q: 'q'|'Q';
_r: 'r'|'R';
_s: 's'|'S';
_t: 't'|'T';
_u: 'u'|'U';
_v: 'v'|'V';
_w: 'w'|'W';
_x: 'x'|'X';
_y: 'y'|'Y';
_z: 'z'|'Z';
and then explode the string literals:
switch: _s _w _i _t _c _h;
while: _w _h _i _l _e;
So far,a wide syntax that I have to parse it (in order to create a Syntax Analyzer), the problem is that I got redundancy somewhere in code, but I dont know where is it.
part of Grammar ;
Grammar_types
Type :: = Basic_Type
| "PP" "(" Type ")"
| Type "*" Type
| "struct" "(" (Ident ":" Type)+"," ")"
| "(" Type ")" .
Basic_Type :: = "ZZ"| "BOOL" | "STRING" | Ident .
I try to analyze this gramar without DCG , example to parse Id :: = Id ((Id) * ",") *
Example_1
"id","id_0(id1,id2,..)"
Code_1
Entete_ (ID, Id, Ids) - atom_concat(XY,')', ID),
atom_concat(XX,Ids, XY),check_ids(Ids),
atom_concat(Id,'(',XX),check_id(Id) ,!.
...
but during some searches , I found that DCG is one of the most effective parsers, so I come back to got the code below ;
Code_2
type(Type) --> "struct(", idents_types(Type),")"
| "PP(",ident(Type),")"
| "(",type(Type),")"
| type(Type),"*",type(Type)
| basic_type(Type)
| "error1_type".
...
Example_Syntaxe ;
"ZZ" ; "PP(ZZ*STRING)" ; "struct(x:personne,struct(y:PP(PP))" ; "ZZ*ZZ" ...
Test
| ?- phrase(type(L),"struct(aa:struct())").
! Resource error: insufficient memory
% source_info
I think that the problem over here (idents_types)
| ?- phrase(idents_types(L),"struct(aa:STRING)").
! Resource error: insufficient memory
Expected result
| ?- type('ZZ*struct(p1:STRING,p2:PP(STRING),p3:(BOOL*PP(STRING)),p4:PP(personne*BOOL))').
p1-STRING
STRING
p2-PP(STRING)
STRING
p3-(BOOL*PP(STRING))
STRING
BOOL
p4-PP(personne*BOOL)
BOOL
personne
ZZ
yes
So my question is, why am I receiving this error of redundancy , and how can I fix it?
You have a left recursion on type//1.
type(Type) --> ... | type(Type),"*",type(Type) | ...
You can look into this question for further information.
Top down parsers, from which DCGs borrow, must have a mean to lookahead a symbol that drives the analysis in right direction.
The usual solution to this problem, as indicated from the link above, is to introduce a service nonterminal that left associate recursive applications of the culprit rule, or is epsilon (terminate the recursion).
An epsilon rule is written like
rule --> [].
The transformation can require a fairly bit of thinking... In this answer I suggest a bottom up alternative implementation, that could be worthy if the grammar cannot be transformed, for practical or theoric problems (LR grammars are more general than LL).
You may want to try this simple minded transformation, but for sure it leaves several details to be resolved.
type([Type,T]) --> "struct(", idents_types(Type),")", type_1(T)
| "PP(",ident(Type),")", type_1(T)
| "(",type(Type),")", type_1(T)
| basic_type(Type), type_1(T)
| "error1_type", type_1(T).
type_1([Type1,Type2,T]) --> type(Type1),"*",type(Type2), type_1(T).
type_1([]) --> [].
edit
I fixed several problems, in both your and mine code. Now it parses the example...
type([Type,T]) --> "struct(", idents_types(Type), ")", type_1(T)
| "PP(", type(Type), ")", type_1(T)
| "(", type(Type), ")", type_1(T)
| basic_type(Type), type_1(T)
| "error1_type", type_1(T).
% my mistake here...
type_1([]) --> [].
type_1([Type,T]) --> "*",type(Type), type_1(T).
% the output Type was unbound on ZZ,etc
basic_type('ZZ') --> "ZZ".
basic_type('BOOL') --> "BOOL".
basic_type('STRING') --> "STRING".
basic_type(Type) --> ident(Type).
% here would be better to factorize ident(Ident),":",type(Type)
idents_types([Ident,Type|Ids]) --> ident(Ident),":",type(Type),",",
idents_types(Ids).
idents_types([Ident,Type]) --> ident(Ident),":",type(Type).
idents_types([]) --> [].
% ident//1 forgot to 'eat' a character
ident(Id) --> [C], { between(0'a,0'z,C),C\=0'_},ident_1(Cs),{ atom_codes(Id,[C|Cs]),last(Cs,L),L\=0'_}.
ident_1([C|Cs]) --> [C], { between(0'a,0'z,C);between(0'0,0'9,C);C=0'_ },
ident_1(Cs).
ident_1([]) --> [].
I'm trying to write a piece of grammar to express field access for a hierarchical structure, something like a.b.c where c is a field of a.b and b is a field of a.
To evaluate the value of a.b.c.d.e we need to evaluate the value of a.b.c.d and then get the value of e.
To evalutate the value of a.b.c.d we need to evalute the value of a.b.c and then get the value of d and so on...
If you have a tree like this (the arrow means "lhs is parent of rhs"):
Node(e) -> Node(d) -> Node(c) -> Node(b) -> Node(a)
the evaluation is quite simple. Using recursion, we just need to resolve the value of the child and then access to the correct field.
The problem is: I have this 3 rules in my ANTLR grammar file:
tokens {
LBRACE = '{' ;
RBRACE = '}' ;
LBRACK = '[' ;
RBRACK = ']' ;
DOT = '.' ;
....
}
reference
: DOLLAR LBRACE selector RBRACE -> ^(NODE_VAR_REFERENCE selector)
;
selector
: IDENT access -> ^(IDENT access)
;
access
: DOT IDENT access? -> ^(IDENT<node=com.at.cson.ast.FieldAccessTree> access?)
| LBRACK IDENT RBRACK access? -> ^(IDENT<node=com.at.cson.ast.FieldAccessTree> access?)
| LBRACK INTEGER RBRACK access? -> ^(INTEGER<node=com.at.cson.ast.ArrayAccessTree> access?)
;
As expected, my tree has this form:
ReferenceTree
IdentTree[a]
FieldAccessTree[b]
FieldAccessTree[c]
FieldAccessTree[d]
FieldAccessTree[e]
The evaluation is not that easy as in the other case because I need to get the value of the current node and then give it to the child and so on...
Is there any way to reverse the order of the tree using ANTLR or I need to do it manually?
You can only do this by using the inline tree operator1, ^, instead of a rewrite rule.
A demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
LBRACK = '[' ;
RBRACK = ']' ;
DOT = '.' ;
}
parse
: selector+ EOF -> ^(ROOT selector+)
;
selector
: IDENT (access^)*
;
access
: DOT IDENT -> IDENT
| LBRACK IDENT RBRACK -> IDENT
| LBRACK INTEGER RBRACK -> INTEGER
;
IDENT : 'a'..'z'+;
INTEGER : '0'..'9'+;
SPACE : ' ' {skip();};
Parsing the input:
a.b.c a[1][2][3]
will produce the following AST:
1 for more info about inline tree operators and rewrite rules, see: How to output the AST built using ANTLR?