How to implement case insensitive lexical parser in Golang using gocc? - go

I need to build a lexical analyzer using Gocc, however no option to ignore case is mentioned in the documentation and I haven't been able to find anything related. Anyone have any idea how it can be done or should I use another tool?
/* Lexical part */
_digit : '0'-'9' ;
int64 : '1'-'9' {_digit} ;
switch: 's''w''i''t''c''h';
while: 'w''h''i''l''e';
!whitespace : ' ' | '\t' | '\n' | '\r' ;
/* Syntax part */
<<
import(
"github.com/goccmack/gocc/example/calc/token"
"github.com/goccmack/gocc/example/calc/util"
)
>>
Calc : Expr;
Expr :
Expr "+" Term << $0.(int64) + $2.(int64), nil >>
| Term
;
Term :
Term "*" Factor << $0.(int64) * $2.(int64), nil >>
| Factor
;
Factor :
"(" Expr ")" << $1, nil >>
| int64 << util.IntValue($0.(*token.Token).Lit) >>
;
For example, for "switch", I want to recognize no matter if it is uppercase or lowercase, but without having to type all the combinations. In Bison there is the option % option caseless, in Gocc is there one?

Looking through the docs for that product, I don't see any option for making character literals case-insensitive, nor do I see any way to write a character class, as in pretty well every regex engine and scanner generator. But nothing other than tedium, readability and style stops you from writing
switch: ('s'|'S')('w'|'W')('i'|'I')('t'|'T')('c'|'C')('h'|'H');
while: ('w'|'W')('h'|'H')('i'|'I')('l'|'L')('e'|'E');
(That's derived from the old way of doing it in lex without case-insensitivity, which uses character classes to make it quite a bit more readable:
[sS][wW][iI][tT][cC][hH] return T_SWITCH;
[wW][hH][iI][lL][eE] return T_WHILE;
You can come closer to the former by defining 26 patterns:
_a: 'a'|'A';
_b: 'b'|'B';
_c: 'c'|'C';
_d: 'd'|'D';
_e: 'e'|'E';
_f: 'f'|'F';
_g: 'g'|'G';
_h: 'h'|'H';
_i: 'i'|'I';
_j: 'j'|'J';
_k: 'k'|'K';
_l: 'l'|'L';
_m: 'm'|'M';
_n: 'n'|'N';
_o: 'o'|'O';
_p: 'p'|'P';
_q: 'q'|'Q';
_r: 'r'|'R';
_s: 's'|'S';
_t: 't'|'T';
_u: 'u'|'U';
_v: 'v'|'V';
_w: 'w'|'W';
_x: 'x'|'X';
_y: 'y'|'Y';
_z: 'z'|'Z';
and then explode the string literals:
switch: _s _w _i _t _c _h;
while: _w _h _i _l _e;

Related

Rcpp sample sugar function, how to use

I am trying to permute the order of elements in a CharacterVector. In R I would simply use:
sample(charvec)
I am trying the same thing in Rcpp using the sample sugar function, but it keeps throwing 'error: no matching function for call to 'sample(Rcpp::CharacterVector&)'. Other sugar functions I have tried, like intersect or sort_unique work fine with CharacterVector, but sample refuses to work. This is the minimal example I have been experimenting with:
cppFunction('CharacterVector samplefunc() {
CharacterVector v = {"Cat", "Dog", "Fox", "Fish", "Lion"} ;
CharacterVector v2 = sample(v) ;
return v2 ;
}')
What I am doing wrong when trying to use the sample sugar function?
You are just missing the size parameter, which is mandatory for Rcpp::sample:
set.seed(42)
Rcpp::cppFunction('CharacterVector samplefunc() {
CharacterVector v = {"Cat", "Dog", "Fox", "Fish", "Lion"} ;
CharacterVector v2 = sample(v, v.size()) ;
return v2 ;
}')
samplefunc()
#> [1] "Lion" "Fish" "Cat" "Dog" "Fox"
UPDATE (about debugging this kind of errors): Admittedly, the error you see when you do not provide the size argument is kind of obscure (at least with gcc), but you can see:
file1294a34f4734f.cpp: In function ‘Rcpp::CharacterVector samplefunc()’:
file1294a34f4734f.cpp:8:30: error: no matching function for call to ‘sample(Rcpp::CharacterVector&)’
8 | CharacterVector v2 = sample(v) ;
| ~~~~~~^~~
This is the error: no matching function. And then,
In file included from /***/Rcpp/include/Rcpp/sugar/functions/functions.h:89,
from /***/Rcpp/include/Rcpp/sugar/sugar.h:31,
from /***/Rcpp/include/Rcpp.h:78,
from file1294a34f4734f.cpp:1:
/***/Rcpp/include/Rcpp/sugar/functions/sample.h:437:1: note: candidate: ‘template<int RTYPE> Rcpp::Vector<RTYPE, Rcpp::PreserveStorage> Rcpp::sample(const Rcpp::Vector<RTYPE, Rcpp::PreserveStorage>&, int, bool, Rcpp::sugar::probs_t)’
437 | sample(const Vector<RTYPE>& x, int size, bool replace = false, sugar::probs_t probs = R_NilValue)
| ^~~~~~
where gcc is showing you a candidate, and you can see that this function accepts a constant Vector of any RTYPE (numeric, character...), and then it needs a size argument, because there is no default. The others (replace, probs) do have a default. R functions may have missing arguments, C++ functions cannot.

OCaml guards syntax after a value

I can't quite understand the syntax used here:
let rec lex = parser
(* Skip any whitespace. *)
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
Firstly, I don't understand what it means to use a guard (vertical line) followed by parser.
And secondly, I can't seem to find the relevant syntax for the condition surrounded by [< and >]
Got the code from here. Thanks in advance!
|
means: "or" (does the stream matches this char or this char or ... ?)
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
means:
IF the stream (one char, in this clause, but it can be a sequence of
several chars) matches "space" or "new line" or "carriage return" or
"tabulation".
THEN consume the ("white") matching character and call lex with the
rest of the stream.
ELSE use the next clause (in your example: "filtering A to Z and a to
z chars" for identifiers). As the matched character has been consumed
by this clause,
(btw, adding '\n\r', which is "newline + carriage return" would be better to address this historical case; you can do it as an exercise).
To be able to parse streams in OCaml with this syntax, you need the modules from OCaml stdlib (at least Stream and Buffer) and you need the camlp4 or camlp5 syntax extension system that knows the meaning of the keywords parser, [<', etc.
In your toplevel, you can do as follows:
#use "topfind";; (* useless if already in your ~/.ocamlinit file *)
#camlp4o;; (* Topfind directive to load camlp4o in the Toplevel *)
# let st = Stream.of_string "OCaml"
val st : char Stream.t = <abstr>
# Stream.next st
- : char = 'O'
# Stream.next flux_car
- : char = 'C'
(* btw, Exception: Stdlib.Stream.Failure must be handled(empty stream) *)
# let rec lex = parser
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
| [< >] -> [< >]
(* just the beginning of the parser definition *)
# val lex : char Stream.t -> 'a = <fun>
Now you are up and running to deal with streams and LL(1) stream parsers.
The exammple you mentioned works well. If you play within the Toplevel, you can evaluate the token.ml and lexer.ml file with the #use directive to respect the module names (#use "token.ml"). Or you can directly evaluate the expressions of lexer.ml if you nest the type token in a module Token.
# let rec lex = parser (* complete definition *)
val lex : char Stream.t -> Token.token Stream.t = <fun>
val lex_number : Buffer.t -> char Stream.t -> Token.token Stream.t = <fun>
val lex_ident : Buffer.t -> char Stream.t -> Token.token Stream.t = <fun>
val lex_comment : char Stream.t -> Token.token Stream.t = <fun>
# let pgm =
"def fib(x) \
if x < 3 then \
1 \
else \
fib(x-1)+fib(x-2)";;
val pgm : string = "def fib(x) if x < 3 then 1 else fib(x-1)+fib(x-2)"
# let cs' = lex (Stream.of_string pgm);;
val cs' : Token.token Stream.t = <abstr>
# Stream.next cs';;
- : Token.token = Token.Def
# Stream.next cs';;
- : Token.token = Token.Ident "fib"
# Stream.next cs';;
- : Token.token = Token.Kwd '('
# Stream.next cs';;
- : Token.token = Token.Ident "x"
# Stream.next cs';;
- : Token.token = Token.Kwd ')'
You get the expected stream of type token.
Now a few technical words about camlp4 and camlp5.
It's indeed recommended not to use the so-called "camlp4" that is being deprecated, and instead use "camlp5" which is in fact the "genuine camlp4" (see hereafter). Assuming you want to use a LL(1) parser.
For that, you can use the following camlp5 Toplevel directive instead of the camlp4 one:
#require "camlp5";; (* add the path + loads the module (topfind directive) *)
#load "camlp5o.cma";;(* patch: manually loads camlp50 module,
because #require forgets to do it (why?)
"o" in "camlp5o" stands for "original syntax" *)
let rec lex = parser
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
| [< >] -> [< >]
# val lex : char Stream.t -> 'a = <fun>
More history about camlp4 and camlp5.
Disclaimer : while I try to be as neutral and factual as possible, this too short explanation may reflect also my personal opinion. Of course, discussion is welcome.
As an Ocaml beginner, I found camlp4 very attractive and powerful but it was not easy to distinguish what was exactly camlp4 and to find its more recent documentation.
In very brief :
It's an old and confused story mainly because of the naming of "camlp4". campl4 is a/the historical syntax extension system for OCaml. Someone decided to improve/retrofit camlp4 around 2006, but it seems that some design decisions turned it in something somehow considered by some people as a "beast" (often, less is more). So, it works, but "there is a lot of stuff under the hood" (its signature became very large).
His historical author, Daniel de Rauglaudre decided to keep on developing camlp4 his way and renamed it "campl5" to differentiate from what was the "new camlp4" (named camlp4). Even if camlp5 is not largely used, it's still maintained, operational and used, for example, by coq that has recently integrated a part of campl5 instead of being dependent of the whole camlp5 library (which doesn't mean that "coq doesn't use camlp5 anymore", as you could read).
ppx has become a mainstream syntax extension technology in the OCaml world (it seems that it's dedicated to make "limited and reliable" OCaml syntax extensions, mainly for small and very useful code generation (helpers functions, etc.); it's a side discussion). It doesn't mean that camlp5 is "deprecated". camlp5 is certainly misunderstood. I had hard time at the beginning, mainly because of its documentation. I wish I could read this post at that time! Anyway, when programming in OCaml, I believe it's a good thing to explore all kinds of technology. It's up to you to make your opinion.
So, the today so-called "camlp4" is in fact the "old campl4" (or the "new camlp4 of the past" ; I know, it's complicated).
LALR(1) parsers such as ocamlyacc or menhir are or have been made mainstream. They have a a bottom-up approach (define .mll and .mly, then compile to OCaml code).
LL(1) parsers, such as camlp4/camlp5, have a top-down approach, very close to functional style.
The best thing is that you compare then by yourself. Implementing a lexer/parser of your language is perfect for that: with ocamllex/menhir and with ocamllex/camlp5, or even with only camlp5 because it's also a lexer (with pros/cons).
I hope you'll enjoy your LLVM tutorial.
All technical and historical complementary comments are very welcome.
As #glennsl says, this page uses the campl4 preprocessor, which is considered obsolete by many in the OCaml community.
Here is a forum message from August 2019 that describes how to move from camlp4 to the more recent ppx:
The end of campl4
Unfortunately that doesn't really help you learn what that LLVM page is trying to teach you, which has little to do with OCaml it seems.
This is one reason I find the use of syntax extensions to be problematic. They don't have the staying power of the base language.
(On the other hand, OCaml really is a fantastic language for writing compilers and other language tools.)

Parsing multiple instances of data

I am trying to parse multiple instances of data from a textfile. I can grep and grab one line and the lat/lon associated with that find, but I am having issued parsing multiple instances:
... CATEGORICAL ...
SLGT 33618675 34608681 35658642 36668567 38218542 41018363
41588227 41918045 41377903 40177805 38927813 37817869
36678030 35068154 33368262 33078321 32888462 33618675
SLGT 30440169 31710202 33010185 33730148 34010037 33999962
33709892 32869871 30979883 29539912 29430025 30440169
SLGT 41788755 41698893 42069059 42639132 43889124 44438960
44438757 43988717 43278708 42398720 41788755
MRGL 42897922 41907743 40147624 38837627 37637700 35897915
35028021 34038079 33118130 31998226 31698419 32078601
32818733 33848809 34758764 36998623 38588677 39458701
40178757 40608870 41069099 43549479 44499512 44809478
45259379 44989263 45109100 45718986 46478920 46758853
46738752 46398664 44768565 44308457 43198218
MRGL 29720174 31900221 33650181 34160154 34430032 34649931
34159800 32539784 31359767 29739808 29299723 28969581
28959440 99999999 26769674 26579796 26139874
TSTM 45077438 43177245 40597113 99999999 30488085 30248563
29588926 28739072 28569092 99999999 27138160 27578139
27908100 27848061 27518032 26968006 26338005 25698017
25338025 25088048 25058071 25238109 25578128 25888157
26218171 26578170 26988163 27138160 99999999 29200399
31910374 33520340 35190229 35450147 36109944 36399709
35779395 36399167 38559059 40189373 41729594 43029985
42820283 42860489 43580863 44121062 44521135 45281179
46271166 47561286 48251548 48671765 49051814 99999999
38810245 37660271 37120322 36950398 37090559 37380662
38090741 39410791 39980777 40930695 41380598 41370510
41190353 40840299 40220263 38810245
From: https://www.spc.noaa.gov/products/outlook/archive/2019/KWNSPTSDY1_201906241300.txt
Here is my code and results:
#!/bin/sh
sed -n '/^MRGL/,/^TSTM/p;/^TSTM/q' day1_status | sed '$ d' | sed -e 's/MRGL//g' > MRGL
while read line
do
count=1
ncols=$(echo $line | wc -w)
while [ $count -le $ncols ]
do
echo $line | cut -d' ' -f$count
((count++))
done
done < MRGL > MRGL_output.txt
cat MRGL_output.txt | sed ':a;s/\B[0-9]\{2\}\>/.&/;ta'| sed 's/./, -/6' > MRGL_final
Results:
one instance of MRGL and the lat/lon associated with that polygon
more MRGL
32947889 34137855 35307825 36147735 36327622 35797468
27107968 25518232 99999999 27088303 28418215 30208125
30618064
Turn the line above into a single instance of lines
more MRGL_output.txt
32947889
34137855
35307825
36147735
36327622
35797468
27107968
25518232
99999999
27088303
28418215
30208125
30618064
Final format that I need it in
more MRGL_final
32.94, -78.89
34.13, -78.55
35.30, -78.25
36.14, -77.35
36.32, -76.22
35.79, -74.68
27.10, -79.68
25.51, -82.32
99.99, -99.99
27.08, -83.03
28.41, -82.15
30.20, -81.25
30.61, -80.64
Just need to parse multiple instances that show up.
UPDATE for better explanation.
... CATEGORICAL ...
ENH 38298326 40108202 40518094 40357974 39907953 39017948
38038052 36148202 35848297 35888367 36618371 38298326
SLGT 30440169 31710202 33010185 33730148 34010037 33999962
33709892 32869871 30979883 29539912 29430025 30440169
SLGT 33548672 34408661 35918543 36858496 38648520 41018363
41588227 41918045 41377903 40177805 38927813 37817869
36678030 35068154 33368262 33078321 32888462 33548672
SLGT 41788755 41698893 42069059 42639132 43889124 44438960
44438757 43988717 43278708 42398720 41788755
MRGL 29720174 31900221 33650181 34160154 34430032 34649931
34159800 32539784 31359767 30059748 29299723 28969581
28959440 99999999 26769674 26579796 26139874
MRGL 42897922 41907743 40147624 38837627 37637700 35897915
35028021 34038079 33118130 31938225 30758424 30678620
30988709 34128741 36208583 37738554 39508601 40628878
41069099 43549479 44499512 44809478 45259379 44989263
45109100 45718986 46478920 46758853 46738752 46398664
44768565 44308457 43198218
TSTM 30488085 29978211 29408316 29068379 99999999 27138160
27578139 27908100 27848061 27518032 26968006 26338005
25698017 25338025 25088048 25058071 25238109 25578128
25888157 26218171 26578170 26988163 27138160 99999999
45427410 43217292 40247181 99999999 28650405 31910374
33520340 35190229 35450147 36109944 36399709 35779395
36769245 38319148 40189373 41219571 41299753 39959979
38220054 37320091 36560136 36070290 36100295 35840394
36790544 37150626 37880709 39110774 40120876 41150895
41600769 41890540 43070599 43580863 43390914 43401262
44171458 45521497 46131301 47181242 47561286 48251548
48671765 49371856
Wanting to take this data set above and grab each available risk ENH, SLGT, MRGL, TSTM lat and long and place into this format:
"Enhanced Risk"
38.29, -83.26
40.10, -82.02
40.51, -80.94
40.35, -79.74
39.90, -79.53
39.01, -79.48
38.03, -80.52
36.14, -82.02
35.84, -82.97
35.88, -83.67
36.61, -83.71
38.29, -83.26
End:
"Slight Risk"
30.44, -101.69
31.71, -102.02
33.01, -101.85
33.73, -101.48
34.01, -100.37
33.99, -99.62
33.70, -98.92
32.86, -98.71
30.97, -98.83
29.53, -99.12
29.43, -100.25
30.44, -101.69
End:
"Slight Risk"
33.54, -86.72
34.40, -86.61
35.91, -85.43
36.85, -84.96
38.64, -85.20
41.01, -83.63
41.58, -82.27
41.91, -80.45
41.37, -79.03
40.17, -78.05
38.92, -78.13
37.81, -78.69
36.67, -80.30
35.06, -81.54
33.36, -82.62
33.07, -83.21
32.88, -84.62
33.54, -86.72
End:
"Slight Risk"
41.78, -87.55
41.69, -88.93
42.06, -90.59
42.63, -91.32
43.88, -91.24
44.43, -89.60
44.43, -87.57
43.98, -87.17
43.27, -87.08
42.39, -87.20
41.78, -87.55
End:
"Marginal Risk"
29.72, -101.74
31.90, -102.21
33.65, -101.81
34.16, -101.54
34.43, -100.32
34.64, -99.31
34.15, -98.00
32.53, -97.84
31.35, -97.67
30.05, -97.48
29.29, -97.23
28.96, -95.81
28.95, -94.40
26.76, -96.74
26.57, -97.96
26.13, -98.74
End:
Here's a little awk program which seems to work, although I'm not certain about some of the details. In particular, I don't know what the minimum value for longitude is; evidently, a value under the minimum has 100 added to it before the longitude is negated. So you'll have to change LON_THRESHOLD to what you consider the correct value.
I've tried to avoid the usual temptation to golf awk programs into a textual minimum, in the hopes that the way this program works is less obscure. But it's entirely possible that some awkisms snuck in anyway. I added a bit of explanation at the end.
BEGIN { risk["HIGH"] = "High Risk"
risk["ENH"] = "Enhanced Risk"
risk["SLGT"] = "Slight Risk"
risk["MRGL"] = "Marginal Risk"
LON_THRESHOLD = 30
END_STRING = "End:"
}
END { if (in_risk) print END_STRING }
in_risk && substr($0, 1, 1) != " " {
print END_STRING "\n" "\n"
in_risk = 0
}
$1 in risk { printf("\"%s\"\n", risk[$1])
in_risk = 2
}
in_risk { for (i = in_risk; i <= NF; ++i) {
lat = substr($i, 1, 4) / 100
lon = substr($i, 5, 4) / 100
if (lon < LON_THRESHOLD) lon += 100
printf "%5.2f, %.2f\n", lat, -lon
}
in_risk = 1
}
Save that program as, for example, noaa.awk, and then apply it with:
awk -f noaa.awk input.txt
By way of explanation:
Awk programs consist of a series of rules. Each rule has a predicate -- that is, an expression which evaluates to a true or false value -- and an action.
Awk processes each line from its input in turn, running through all of the rules and executing the actions of the ones whose predicates evaluate to a true value. Inside the action, you can use the $ operator to access individual fields in the input (by default, fields are separated with whitespace). $0 is the entire input line, and $n is field number n. Unlike bash/sh, $ is an operator and can be applied to an expression.
BEGIN and END rules are special, in that they are not real variables. BEGIN rules are executed exactly once, before any other processing; END rules are executed exactly once after all processing is finished. In this example, as is common, BEGIN is used to initialise reference data, while END is used for any necessary termination -- in this case, printing the final End: line.
In cases like this, where the desired action is really dependent on where we are in the file, it's necessary to build some kind of state machine, and I did that using the variable in_risk, which has three possible values:
0 or undefined: We're not currently in a block corresponding to a risk selector.
1: The current line, if it starts with a space, is part of a previously identified risk selector.
2: The current line has been detected as starting with a risk selector.
The reason for the difference between the last two values is that $1 in a line which starts with a risk selector is the risk selector, whereas in a line which starts with a space, $1 is actually the first number. So when we're iterating over the numbers in a line, we have to start with $2 for lines which start with a risk selector.
If you're just asking how to turn a file of lines of like AABBCCDD into lines like AA.BB, -CC.DD:
perl -nE '/^(..)(..)(..)(..)$/ && say "$1.$2, -$3.$4"' MRGL_output.txt
(There's almost certainly better ways to get from your original input to those lines, but I'm not really clear on what your posted code is doing or why)
I think this will process your original input correctly, but can't be sure because the numbers in your sample output don't match up with your sample input so I can't verify:
perl -anE 'if (/^MRGL/ .. /^TSTM/) { exit if /^TSTM/; push #nums, #F }
END { for (#nums) {
if (/^(..)(..)(..)(..)$/) { say "$1.$2, -$3.$4" }
}}' day1_status
Got GNU Awk?
awk -v RS='\\s+' '
/[A-Z]/ {p = /^MRGL$/? 1: 0; next}
p {print gensub(/(..)(..)(..)(..)/, "\\1.\\2, -\\3.\\4", "G")}
' file
-v RS'\\s+' - Use any amount of whitespace as the Record Separator
/[A-Z]/ {...} - On records with uppercase alphabetics, do
p = /^MRGL$/? 1: 0; next - Set flag if record is MRGL, else unset, but always skip any other rules.
p {print gensub(...)} - Print result of gensub if flag is set
/(...)/, "\\1", "G" - Capturing groups, Backreferences, Global substitution.

antl3:Java heap space when testing parser

I'm trying to build a simple config-file reader to read files of this format:
A .-
B -...
C -.-.
D -..
E .
This is the grammar I have so far:
grammar def;
#header {
package mypackage.parser;
}
#lexer::header { package mypackage.parser; }
file
: line+;
line : ID WS* CODE NEWLINE;
ID : ('A'..'Z')*
;
CODE : ('-'|'.')*;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
) {$channel=HIDDEN;}
;
NEWLINE:'\r'? '\n' ;
And this is my test rig (junit4)
#Test
public void BasicGrammarCheckGood() {
String CorrectlyFormedLine="A .-;\n";
ANTLRStringStream input;
defLexer lexer;
defParser parser;
input = new ANTLRStringStream(CorrectlyFormedLine);
lexer = new defLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
parser = new defParser(tokens);
try {
parser.line();
}
catch(RecognitionException re) { fail(re.getMessage()); }
}
If I run this test right with a corrected formatted string - the code exits without any exception or output.
However if feed the parser with an invalid string like this : "xA .-;\n", the code spins for a while then exits with a "Java heap space".
(If I start my test with the top-level rule 'file', then I get the same result - with the additional (repeated) output of "line 1:0 mismatched input '' expecting CODE")
What's going wrong here ? I never seem to get the "RecognitionException" for the invalid output ?
EDIT: Here's my grammar file (Fragment), after being provided advice here - this avoids the 'Java heap space' issue.
file
: line+ EOF;
line : ID WS* CODE NEWLINE;
ID : ('A'..'Z')('A'..'Z')*
;
CODE : ('-'|'.')('-'|'.')*;
Some of your lexer rules match zero characters (an empty string):
ID : ('A'..'Z')*
;
CODE : ('-'|'.')*;
There are, of course, an infinite amount of empty strings in your input, causing your lexer to keep producing tokens, resulting in a heap space error after a while.
Always let lexer rules match at least 1 character.
EDIT
Two (small) remarks:
since you put the WS token on the hidden channel, you don't need to add them in your parser rules. So line becomes line : ID CODE NEWLINE;
something like ('A'..'Z')('A'..'Z')* can be written like this: ('A'..'Z')+

ANTLR: field access and evaluation

I'm trying to write a piece of grammar to express field access for a hierarchical structure, something like a.b.c where c is a field of a.b and b is a field of a.
To evaluate the value of a.b.c.d.e we need to evaluate the value of a.b.c.d and then get the value of e.
To evalutate the value of a.b.c.d we need to evalute the value of a.b.c and then get the value of d and so on...
If you have a tree like this (the arrow means "lhs is parent of rhs"):
Node(e) -> Node(d) -> Node(c) -> Node(b) -> Node(a)
the evaluation is quite simple. Using recursion, we just need to resolve the value of the child and then access to the correct field.
The problem is: I have this 3 rules in my ANTLR grammar file:
tokens {
LBRACE = '{' ;
RBRACE = '}' ;
LBRACK = '[' ;
RBRACK = ']' ;
DOT = '.' ;
....
}
reference
: DOLLAR LBRACE selector RBRACE -> ^(NODE_VAR_REFERENCE selector)
;
selector
: IDENT access -> ^(IDENT access)
;
access
: DOT IDENT access? -> ^(IDENT<node=com.at.cson.ast.FieldAccessTree> access?)
| LBRACK IDENT RBRACK access? -> ^(IDENT<node=com.at.cson.ast.FieldAccessTree> access?)
| LBRACK INTEGER RBRACK access? -> ^(INTEGER<node=com.at.cson.ast.ArrayAccessTree> access?)
;
As expected, my tree has this form:
ReferenceTree
IdentTree[a]
FieldAccessTree[b]
FieldAccessTree[c]
FieldAccessTree[d]
FieldAccessTree[e]
The evaluation is not that easy as in the other case because I need to get the value of the current node and then give it to the child and so on...
Is there any way to reverse the order of the tree using ANTLR or I need to do it manually?
You can only do this by using the inline tree operator1, ^, instead of a rewrite rule.
A demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
LBRACK = '[' ;
RBRACK = ']' ;
DOT = '.' ;
}
parse
: selector+ EOF -> ^(ROOT selector+)
;
selector
: IDENT (access^)*
;
access
: DOT IDENT -> IDENT
| LBRACK IDENT RBRACK -> IDENT
| LBRACK INTEGER RBRACK -> INTEGER
;
IDENT : 'a'..'z'+;
INTEGER : '0'..'9'+;
SPACE : ' ' {skip();};
Parsing the input:
a.b.c a[1][2][3]
will produce the following AST:
1 for more info about inline tree operators and rewrite rules, see: How to output the AST built using ANTLR?

Resources