antlr4: perplexed about whitespace handling - whitespace

I know this question has been asked in more or less the same terms before, but none of the answers are working for me:
grammar Problem;
top: (IDENT | INT)*;
IDENT: (ALPHA|'_') (ALPHA|DIGIT|'_')*;
INT: DEC_INT | HEX_INT;
DEC_INT: (ZERO | (NZERO_DIGIT DIGIT*));
HEX_INT: ZERO X HEX+;
ZERO: '0';
NZERO_DIGIT: '1'..'9';
DIGIT: '0'..'9';
ALPHA: [a-zA-Z];
HEX: [0-9a-fA-F];
X: [xX];
WS: [ \t\r\n]+ -> skip;
When I give this input to the parser:
0xFF ZZ123
followed by a newline and ctrl-D, it gets parsed as :
(top 0xFF ZZ123)
Which is the intended behaviour.
However when I give this input to the parser:
0xFFZZ123
followed by a newline and ctrl-D, it gets parsed as :
(top 0xFF ZZ123)
which is not at all intended. I would like this to trigger a lexer error, considering this as a misspelled HEX_INT.
If I disable whitespace skipping, I still get the same lexer behaviour (a single group of chars parsed as two tokens), however since WS tokens are now reported to the parser, I get the following error:
0XFFZZ123
line 1:9 extraneous input '\n' expecting {<EOF>, IDENT, INT}
(top 0XFF ZZ123 \n)
And in addition I cannot type space separated tokens anymore (normal since top does not mention WS):
0XFF ZZ123
line 1:4 extraneous input ' ' expecting {<EOF>, IDENT, INT}
(top 0XFF ZZ123)
I have tried to fix the grammar by disabling whitespace skipping and changing the top rule to :
top: WS* (IDENT | INT) (WS+ (IDENT|INT))* WS*;
However if I enter the following stream to the parser,
0xFF ZZ123 0XFFZZ123
I get this error:
line 1:20 extraneous input 'ZZ123' expecting {<EOF>, WS}
(top 0xFF ZZ123 0xFF ZZ123 \n)
Where you can still see that the last input token has been split in OxFF and ZZ123, whereas I would really trigger a lexing error here instead of having to handle whitespace in the parser explicitly.
So what combination of tricks do I have to use to obtain the desired behaviour?

You can write a token that accepts erroneous tokens like 0XFFZZ123, and place it just before WS. For example:
grammar SandBox;
top: (IDENT | INT)*;
IDENT: (ALPHA|'_') (ALPHA|DIGIT|'_')*;
INT: DEC_INT | HEX_INT;
DEC_INT: (ZERO | (NZERO_DIGIT DIGIT*));
HEX_INT: ZERO X HEX+;
ZERO: '0';
NZERO_DIGIT: '1'..'9';
DIGIT: '0'..'9';
ALPHA: [a-zA-Z];
HEX: [0-9a-fA-F];
X: [xX];
ERROR_TOKEN: (~[ \t\r\n])+;
WS: [ \t\r\n]+ -> skip;
What happens is the following. If you input 0xFF ZZ123, then INT and IDENT win, because of their position. If you input 0XFFZZ123, then ERROR_TOKEN wins, because of the length (length has priority over position). Since ERROR_TOKEN is not part of the "top", an error would be raised.
I hope this solves the problem.

Related

How to implement case insensitive lexical parser in Golang using gocc?

I need to build a lexical analyzer using Gocc, however no option to ignore case is mentioned in the documentation and I haven't been able to find anything related. Anyone have any idea how it can be done or should I use another tool?
/* Lexical part */
_digit : '0'-'9' ;
int64 : '1'-'9' {_digit} ;
switch: 's''w''i''t''c''h';
while: 'w''h''i''l''e';
!whitespace : ' ' | '\t' | '\n' | '\r' ;
/* Syntax part */
<<
import(
"github.com/goccmack/gocc/example/calc/token"
"github.com/goccmack/gocc/example/calc/util"
)
>>
Calc : Expr;
Expr :
Expr "+" Term << $0.(int64) + $2.(int64), nil >>
| Term
;
Term :
Term "*" Factor << $0.(int64) * $2.(int64), nil >>
| Factor
;
Factor :
"(" Expr ")" << $1, nil >>
| int64 << util.IntValue($0.(*token.Token).Lit) >>
;
For example, for "switch", I want to recognize no matter if it is uppercase or lowercase, but without having to type all the combinations. In Bison there is the option % option caseless, in Gocc is there one?
Looking through the docs for that product, I don't see any option for making character literals case-insensitive, nor do I see any way to write a character class, as in pretty well every regex engine and scanner generator. But nothing other than tedium, readability and style stops you from writing
switch: ('s'|'S')('w'|'W')('i'|'I')('t'|'T')('c'|'C')('h'|'H');
while: ('w'|'W')('h'|'H')('i'|'I')('l'|'L')('e'|'E');
(That's derived from the old way of doing it in lex without case-insensitivity, which uses character classes to make it quite a bit more readable:
[sS][wW][iI][tT][cC][hH] return T_SWITCH;
[wW][hH][iI][lL][eE] return T_WHILE;
You can come closer to the former by defining 26 patterns:
_a: 'a'|'A';
_b: 'b'|'B';
_c: 'c'|'C';
_d: 'd'|'D';
_e: 'e'|'E';
_f: 'f'|'F';
_g: 'g'|'G';
_h: 'h'|'H';
_i: 'i'|'I';
_j: 'j'|'J';
_k: 'k'|'K';
_l: 'l'|'L';
_m: 'm'|'M';
_n: 'n'|'N';
_o: 'o'|'O';
_p: 'p'|'P';
_q: 'q'|'Q';
_r: 'r'|'R';
_s: 's'|'S';
_t: 't'|'T';
_u: 'u'|'U';
_v: 'v'|'V';
_w: 'w'|'W';
_x: 'x'|'X';
_y: 'y'|'Y';
_z: 'z'|'Z';
and then explode the string literals:
switch: _s _w _i _t _c _h;
while: _w _h _i _l _e;

OCaml guards syntax after a value

I can't quite understand the syntax used here:
let rec lex = parser
(* Skip any whitespace. *)
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
Firstly, I don't understand what it means to use a guard (vertical line) followed by parser.
And secondly, I can't seem to find the relevant syntax for the condition surrounded by [< and >]
Got the code from here. Thanks in advance!
|
means: "or" (does the stream matches this char or this char or ... ?)
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
means:
IF the stream (one char, in this clause, but it can be a sequence of
several chars) matches "space" or "new line" or "carriage return" or
"tabulation".
THEN consume the ("white") matching character and call lex with the
rest of the stream.
ELSE use the next clause (in your example: "filtering A to Z and a to
z chars" for identifiers). As the matched character has been consumed
by this clause,
(btw, adding '\n\r', which is "newline + carriage return" would be better to address this historical case; you can do it as an exercise).
To be able to parse streams in OCaml with this syntax, you need the modules from OCaml stdlib (at least Stream and Buffer) and you need the camlp4 or camlp5 syntax extension system that knows the meaning of the keywords parser, [<', etc.
In your toplevel, you can do as follows:
#use "topfind";; (* useless if already in your ~/.ocamlinit file *)
#camlp4o;; (* Topfind directive to load camlp4o in the Toplevel *)
# let st = Stream.of_string "OCaml"
val st : char Stream.t = <abstr>
# Stream.next st
- : char = 'O'
# Stream.next flux_car
- : char = 'C'
(* btw, Exception: Stdlib.Stream.Failure must be handled(empty stream) *)
# let rec lex = parser
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
| [< >] -> [< >]
(* just the beginning of the parser definition *)
# val lex : char Stream.t -> 'a = <fun>
Now you are up and running to deal with streams and LL(1) stream parsers.
The exammple you mentioned works well. If you play within the Toplevel, you can evaluate the token.ml and lexer.ml file with the #use directive to respect the module names (#use "token.ml"). Or you can directly evaluate the expressions of lexer.ml if you nest the type token in a module Token.
# let rec lex = parser (* complete definition *)
val lex : char Stream.t -> Token.token Stream.t = <fun>
val lex_number : Buffer.t -> char Stream.t -> Token.token Stream.t = <fun>
val lex_ident : Buffer.t -> char Stream.t -> Token.token Stream.t = <fun>
val lex_comment : char Stream.t -> Token.token Stream.t = <fun>
# let pgm =
"def fib(x) \
if x < 3 then \
1 \
else \
fib(x-1)+fib(x-2)";;
val pgm : string = "def fib(x) if x < 3 then 1 else fib(x-1)+fib(x-2)"
# let cs' = lex (Stream.of_string pgm);;
val cs' : Token.token Stream.t = <abstr>
# Stream.next cs';;
- : Token.token = Token.Def
# Stream.next cs';;
- : Token.token = Token.Ident "fib"
# Stream.next cs';;
- : Token.token = Token.Kwd '('
# Stream.next cs';;
- : Token.token = Token.Ident "x"
# Stream.next cs';;
- : Token.token = Token.Kwd ')'
You get the expected stream of type token.
Now a few technical words about camlp4 and camlp5.
It's indeed recommended not to use the so-called "camlp4" that is being deprecated, and instead use "camlp5" which is in fact the "genuine camlp4" (see hereafter). Assuming you want to use a LL(1) parser.
For that, you can use the following camlp5 Toplevel directive instead of the camlp4 one:
#require "camlp5";; (* add the path + loads the module (topfind directive) *)
#load "camlp5o.cma";;(* patch: manually loads camlp50 module,
because #require forgets to do it (why?)
"o" in "camlp5o" stands for "original syntax" *)
let rec lex = parser
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
| [< >] -> [< >]
# val lex : char Stream.t -> 'a = <fun>
More history about camlp4 and camlp5.
Disclaimer : while I try to be as neutral and factual as possible, this too short explanation may reflect also my personal opinion. Of course, discussion is welcome.
As an Ocaml beginner, I found camlp4 very attractive and powerful but it was not easy to distinguish what was exactly camlp4 and to find its more recent documentation.
In very brief :
It's an old and confused story mainly because of the naming of "camlp4". campl4 is a/the historical syntax extension system for OCaml. Someone decided to improve/retrofit camlp4 around 2006, but it seems that some design decisions turned it in something somehow considered by some people as a "beast" (often, less is more). So, it works, but "there is a lot of stuff under the hood" (its signature became very large).
His historical author, Daniel de Rauglaudre decided to keep on developing camlp4 his way and renamed it "campl5" to differentiate from what was the "new camlp4" (named camlp4). Even if camlp5 is not largely used, it's still maintained, operational and used, for example, by coq that has recently integrated a part of campl5 instead of being dependent of the whole camlp5 library (which doesn't mean that "coq doesn't use camlp5 anymore", as you could read).
ppx has become a mainstream syntax extension technology in the OCaml world (it seems that it's dedicated to make "limited and reliable" OCaml syntax extensions, mainly for small and very useful code generation (helpers functions, etc.); it's a side discussion). It doesn't mean that camlp5 is "deprecated". camlp5 is certainly misunderstood. I had hard time at the beginning, mainly because of its documentation. I wish I could read this post at that time! Anyway, when programming in OCaml, I believe it's a good thing to explore all kinds of technology. It's up to you to make your opinion.
So, the today so-called "camlp4" is in fact the "old campl4" (or the "new camlp4 of the past" ; I know, it's complicated).
LALR(1) parsers such as ocamlyacc or menhir are or have been made mainstream. They have a a bottom-up approach (define .mll and .mly, then compile to OCaml code).
LL(1) parsers, such as camlp4/camlp5, have a top-down approach, very close to functional style.
The best thing is that you compare then by yourself. Implementing a lexer/parser of your language is perfect for that: with ocamllex/menhir and with ocamllex/camlp5, or even with only camlp5 because it's also a lexer (with pros/cons).
I hope you'll enjoy your LLVM tutorial.
All technical and historical complementary comments are very welcome.
As #glennsl says, this page uses the campl4 preprocessor, which is considered obsolete by many in the OCaml community.
Here is a forum message from August 2019 that describes how to move from camlp4 to the more recent ppx:
The end of campl4
Unfortunately that doesn't really help you learn what that LLVM page is trying to teach you, which has little to do with OCaml it seems.
This is one reason I find the use of syntax extensions to be problematic. They don't have the staying power of the base language.
(On the other hand, OCaml really is a fantastic language for writing compilers and other language tools.)

British Pound Sign £ causing PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding “UTF8”: 0xa3

When collecting information containing the British Pound Sign '£' from external sources such as my bank, via csv file, and posting to postgres using ActiveRecord, I get the error:
PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding “UTF8”: 0xa3
The 0xa3 is the hex code for a £ sign. The perceived wisdom is to clearly specify UTF-8 on the string whilst replacing invalid byte sequences..
string.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
This stops the error, but is a lossy fix as the '£' is converted into a '?'
UTF-8 is able to handle the '£' sign, so what can be done to fix the invalid byte sequence and persist the '£' sign?
I'm answering my own question thanks to Michael Fuhr who explained the UTF-8 byte sequence for the pound sign is 0xc2 0xa3. So, all you have to do is find each occurrence of 0xa3 (163) and place 0xc2 (194) in front of it...
array_bytes = string.bytes
new_pound_ptr = 0
# Look for £ sign
pound_ptr = array_bytes.index(163)
while !pound_ptr.nil?
pound_ptr+= new_pound_ptr # new_pound_ptr is set at end of block
# The following statement finds incorrectly sequenced £ sign...
if (pound_ptr == 0) || (array_bytes[pound_ptr-1] != 194)
array_bytes.insert(pound_ptr,194)
pound_ptr+= 1
end
new_pound_ptr = pound_ptr
# Search remainder of array for pound sign
pound_ptr = array_bytes[(new_pound_ptr+1)..-1].index(163)
end
end
# Convert bytes to 8-bit unsigned char, and UTF-8
string = array_bytes.pack('C*').force_encoding('UTF-8') unless new_pound_ptr == 0
# Can now write string to model without out-of-sequence error..
hash["description"] = string
Model.create!(hash)
I've had so much help on this stackoverflow forum, I hope I have helped somebody else.

Antlr4 Spaces within assignment

I'm trying to write a simple parser in ANTLR 4 that'll be able to handle stuff like this:
java.lang.String dataSourceName=FOO
java.lang.Long dataLoadTimeout=30000
This is what I put in my .g4 file:
cfg : (paramAssign NEWLINE)* ;
paramAssign : paramDecl '=' paramVal ;
paramDecl : javaType paramName ;
paramName : SIMPLEID ;
paramVal : PARAMVAL ;
javaType : JAVATYPE ;
SIMPLEID : [a-zA-Z_][a-zA-Z0-9_]* ;
PARAMVAL : [0-9a-zA-Z_]+ ;
JAVATYPE : SIMPLEID ('.' SIMPLEID)* ;
NEWLINE : '\n' ;
When I run on inputs above, I get:
line 1:16 token recognition error at: ' '
line 2:14 token recognition error at: ' '
line 1:32 mismatched input 'FOO' expecting PARAMVAL
I know that there are precedence rules that ANTLR's lexer & parser follow but it's not clear to me how I'm violating them. For some reason it doesn't like the string FOO although FOO clearly conforms to the PARAMVAL rule. Also, when I put spaces before & after equals signs I get:
token recognition error at: ' '
for each space I've added. Sorry, but I'm really baffled.
FOO is matched as a SIMPLEID token, not a PARAMVAL token. That is just how ANTLR works: whenever 2 (or more) lexer rules match the same amount of characters, the rule defined first will win (SIMPLEID in your case).
So if you let paramVal also match a SIMPLEID, the error would go away:
paramVal : SIMPLEID | PARAMVAL ;
For the recognition error at: ' ' to disappear, you'd have to match space chars as well:
SPACE : [ \t]+ -> skip ;

antl3:Java heap space when testing parser

I'm trying to build a simple config-file reader to read files of this format:
A .-
B -...
C -.-.
D -..
E .
This is the grammar I have so far:
grammar def;
#header {
package mypackage.parser;
}
#lexer::header { package mypackage.parser; }
file
: line+;
line : ID WS* CODE NEWLINE;
ID : ('A'..'Z')*
;
CODE : ('-'|'.')*;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
) {$channel=HIDDEN;}
;
NEWLINE:'\r'? '\n' ;
And this is my test rig (junit4)
#Test
public void BasicGrammarCheckGood() {
String CorrectlyFormedLine="A .-;\n";
ANTLRStringStream input;
defLexer lexer;
defParser parser;
input = new ANTLRStringStream(CorrectlyFormedLine);
lexer = new defLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
parser = new defParser(tokens);
try {
parser.line();
}
catch(RecognitionException re) { fail(re.getMessage()); }
}
If I run this test right with a corrected formatted string - the code exits without any exception or output.
However if feed the parser with an invalid string like this : "xA .-;\n", the code spins for a while then exits with a "Java heap space".
(If I start my test with the top-level rule 'file', then I get the same result - with the additional (repeated) output of "line 1:0 mismatched input '' expecting CODE")
What's going wrong here ? I never seem to get the "RecognitionException" for the invalid output ?
EDIT: Here's my grammar file (Fragment), after being provided advice here - this avoids the 'Java heap space' issue.
file
: line+ EOF;
line : ID WS* CODE NEWLINE;
ID : ('A'..'Z')('A'..'Z')*
;
CODE : ('-'|'.')('-'|'.')*;
Some of your lexer rules match zero characters (an empty string):
ID : ('A'..'Z')*
;
CODE : ('-'|'.')*;
There are, of course, an infinite amount of empty strings in your input, causing your lexer to keep producing tokens, resulting in a heap space error after a while.
Always let lexer rules match at least 1 character.
EDIT
Two (small) remarks:
since you put the WS token on the hidden channel, you don't need to add them in your parser rules. So line becomes line : ID CODE NEWLINE;
something like ('A'..'Z')('A'..'Z')* can be written like this: ('A'..'Z')+

Resources