This is a dummy example, my actual language is more complicated:
grammar wordasnumber;
WS: [ \t\n] -> skip;
AS: [Aa] [Ss];
ID: [A-Za-z]+;
NUMBER: [0-9]+;
wordAsNumber: (ID AS NUMBER)* EOF;
In this language, these two strings are legal:
seven as 7 eight as 8
seven as 7eight as8
Which is exactly what I told it to do, but not what I want. Because ID and AS are both strings of letters, white space is required between them, I would like that second phrase
to be a syntax error. I could add some other rule to try and match theses mashed up things ...
fragment LETTER: [A-Za-z];
fragment DIGIT: [0-9];
BAD_THING: ( LETTER+ DIGIT (LETTER|DIGIT)* ) | ( DIGIT+ LETTER (LETTER|DIGIT)* );
ID: LETTER+;
NUMBER: DIGIT+;
... to make the lexer return a different token for these smashed up things, but this feels like a weird bandaid which sort of found the need for accidentally and maybe there are more if I really stared at my lexer very carefully.
Is there a better way to do this? My actual grammar is much larger so, for example, making WS NOT be skipped and placing it explicitly between the tokens where it is required is non starter.
There was an older question on this list, which I could not find, which I think is the same question, in that case someone who was parsing white space separated numbers was surprised that 1.2.3 was parsing as 1.2 and .3 and not as a syntax error.
Add another rule for the wrong input, but don't use that in your parser. It will then cause a syntax error when matched:
INVALID: (ID | NUMBER)+;
This additional rule will change the parse tree output, for the input in the question, to:
This trick works because ANTLR4's lexing approach tries to match the longest input in on go, and that INVALID rule matches more than ID and NUMBER alone. But you have to place it after these 2 rules, to make use of another lexing rule: "If two lexer rules would match the same input, pick the first one.". This way, you get the correct tokens for single appearances of ID and NUMBER.
I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.
I have already seen the explanation of the difference between syntax and semantics, such as this What is the difference between syntax and semantics?
But is there any difference between "grammar" and "syntax" when we discuss compiler?
A grammar is a series of productions that generate the valid "words" of a language. It is a way to specify the syntax of a language. Another way to specify the syntax would be using plain English, but that would end up being very verbose for non-trivial languages if you want it to be precise enough to serve as a specification.
As an example consider the following text:
A program is a series of zero or more statements.
A statement is either the keyword "var", followed by an identifier, followed by a semicolon; an identifier followed by "++" or "--", followed by a semicolon; or the keyword "while", followed by an identifier, followed by the keyword "do", followed by zero or more statements, followed by the keyword "end".
This describes the syntax of a very simple programming language, but it is not a grammar. Here is a grammar that describes the same language:
program ::= statement*
statement ::= "var" ID ";"
| ID "++" ";"
| ID "--" ";"
| "while" ID "do" statement* "end"
I am trying to do basic ANTLR-based scanning. I have a problem with a lexer not matching wanted tokens.
lexer grammar DefaultLexer;
ALPHANUM : (LETTER | DIGIT)+;
ACRONYM : LETTER '.' (LETTER '.')+;
HOST : ALPHANUM (('.' | '-') ALPHANUM)+;
fragment
LETTER : UNICODE_CLASS_LL | UNICODE_CLASS_LM | UNICODE_CLASS_LO | UNICODE_CLASS_LT | UNICODE_CLASS_LU;
fragment
DIGIT : UNICODE_CLASS_ND | UNICODE_CLASS_NL;
For the grammar above, hello. world string given as an input results in world only. Whereas I would expect to get both hello and world. What am I missing? Thanks.
ADDED:
Ok, I learned that input hello. world matches more characters using rule HOST than ALPHANUM, therefore lexer will choose to use it. Then, when it fails to match input to the HOST rule, it does not "look back" to , because that's how lexer works.
How I get around it?
As a foreword, ANTLR 4 would not behave in a strange manner here. Both ANTLR 3 and ANTLR 4 should be matching ALPHANUM, then giving 2 syntax errors, then matching another ALPHANUM, and I can state with confidence that ANTLR 4 will behave that way.
It looks like your HOST rule might be better suited to be host, a parser rule.
You need to make sure and provide a lexer rule that can match . (either together or as two separate tokens).
I have this line as an example from a CSV file:
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
I want to split it into an array. The immediate thought is to just split on commas, but some of the strings have commas in them, eg "Life and Living Processes, Life Processes", and these should stay as single elements in the array. Note also that there's two commas with nothing in between - i want to get these as empty strings.
In other words, the array i want to get is
[2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes","","",1,0,"endofline"]
I can think of hacky ways involving eval but i'm hoping someone can come up with a clean regex to do it...
cheers, max
This is not a suitable task for regular expressions. You need a CSV parser, and Ruby has one built in:
http://ruby-doc.org/stdlib/libdoc/csv/rdoc/classes/CSV.html
And an arguably superior 3rd part library:
http://fastercsv.rubyforge.org/
str=<<EOF
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
EOF
require 'csv' # built in
p CSV.parse(str)
# That's it! However, empty fields appear as nil.
# Makes sense to me, but if you insist on empty strings then do something like:
parser = CSV.new(str)
parser.convert{|field| field.nil? ? "" : field}
p parser.readlines
EDIT: I failed to read the Ruby tag. The good news is, the guide will explain the theory behind building this, even if the language specifics aren't right. Sorry.
Here is a fantastic guide to doing this:
http://knab.ws/blog/index.php?/archives/10-CSV-file-parser-and-writer-in-C-Part-2.html
and the csv writer is here:
http://knab.ws/blog/index.php?/archives/3-CSV-file-parser-and-writer-in-C-Part-1.html
These examples cover the case of having a quoted literal in a csv (which may or may not contain a comma).
text=<<EOF
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
EOF
x=[]
text.chomp.split("\042").each_with_index do |y,i|
i%2==0 ? x<< y.split(",") : x<<y
end
print x.flatten
output
$ ruby test.rb
["2412", "21", "Which of the following is not found in all cells?", "Curriculum", "Life and Living Processes, Life Processes", "", "", "", "1", "0", "endofline"]
This morning I stumbled across a CSV Table Importer project for Ruby-on-Rails. Eventually you will find the code helpful:
Github TableImporter
My preference is #steenstag's solution, but an alternative is to use String#scan with the following regular expression.
r = /(?<![^,])(?:(?!")[^,\n]*(?<!")|"[^"\n]*")(?![^,])/
If the variable str holds the string given in the example, we obtain:
puts str.scan r
displays
2412
21
"Which of the following is not found in all cells?"
"Curriculum"
"Life and Living Processes, Life Processes"
1
0
"endofline"
Start your engine!
See also regex101 which provides a detailed explanation of each token of the regex. (Move your cursor across the regex.)
Ruby's regex engine performs the following operations.
(?<![^,]) : negative lookbehind assert current location is not preceded
by a character other than a comma
(?: : begin non-capture group
(?!") : negative lookahead asserts next char is not a double-quote
[^,\n]* : match 0+ chars other than a comma and newline
(?<!") : negative lookbehind asserts preceding character is not a
double-quote
| : or
" : match double-quote
[^"\n]* : match 0+ chars other than double-quote and newline
" : match double-quote
) : end of non-capture group
(?![^,]) : negative lookahead asserts current location is not followed
by a character other than a comma
Note that (?<![^,]) is the same as (?<=,|^) and (?![^,]) is the same as (?=^|,).