writing an antlr grammar where whitespace is sometimes significant - whitespace

This is a dummy example, my actual language is more complicated:
grammar wordasnumber;
WS: [ \t\n] -> skip;
AS: [Aa] [Ss];
ID: [A-Za-z]+;
NUMBER: [0-9]+;
wordAsNumber: (ID AS NUMBER)* EOF;
In this language, these two strings are legal:
seven as 7 eight as 8
seven as 7eight as8
Which is exactly what I told it to do, but not what I want. Because ID and AS are both strings of letters, white space is required between them, I would like that second phrase
to be a syntax error. I could add some other rule to try and match theses mashed up things ...
fragment LETTER: [A-Za-z];
fragment DIGIT: [0-9];
BAD_THING: ( LETTER+ DIGIT (LETTER|DIGIT)* ) | ( DIGIT+ LETTER (LETTER|DIGIT)* );
ID: LETTER+;
NUMBER: DIGIT+;
... to make the lexer return a different token for these smashed up things, but this feels like a weird bandaid which sort of found the need for accidentally and maybe there are more if I really stared at my lexer very carefully.
Is there a better way to do this? My actual grammar is much larger so, for example, making WS NOT be skipped and placing it explicitly between the tokens where it is required is non starter.
There was an older question on this list, which I could not find, which I think is the same question, in that case someone who was parsing white space separated numbers was surprised that 1.2.3 was parsing as 1.2 and .3 and not as a syntax error.

Add another rule for the wrong input, but don't use that in your parser. It will then cause a syntax error when matched:
INVALID: (ID | NUMBER)+;
This additional rule will change the parse tree output, for the input in the question, to:
This trick works because ANTLR4's lexing approach tries to match the longest input in on go, and that INVALID rule matches more than ID and NUMBER alone. But you have to place it after these 2 rules, to make use of another lexing rule: "If two lexer rules would match the same input, pick the first one.". This way, you get the correct tokens for single appearances of ID and NUMBER.

Related

Jflex ambiguity

I have these two rules from a jflex code:
Bool = true
Ident = [:letter:][:letterdigit:]*
if I try for example to analyse the word "trueStat", it gets recognnized as an Ident expression and not Bool.
How can I avoid this type of ambiguity in Jflex?
In almost all languages, a keyword is only recognised as such if it is a complete word. Otherwise, you would end up banning identifiers like format, downtime and endurance (which would instead start with the keywords for, do and end, respectively). That's quite confusing for programmers, although it's not unheard-of. Lexical scanner generators, like Flex and JFlex generally try to make the common case easy; thus, the snippet you provide, which recognises trueStat as an identifier. But if you really want to recognise it as a keyword followed by an identifier, you can accomplish that by adding trailing context to all your keywords:
Bool = true/[:letterdigit:]*
Ident = [:letter:][:letterdigit:]*
With that pair of patterns, true will match the Bool rule, even if it occurs as trueStat. The pattern matches true and any alphanumeric string immediately following it, and then rewinds the input cursor so that the token matched is just true.
Note that like Lex and Flex, JFlex accepts the longest match at the current input position; if more than one rule accepts this match, the action corresponding to the first such rule is executed. (See the manual section "How the Input is Matched" for a slightly longer explanation of the matching algorithm.) Trailing context is considered part of the match for the purposes of this rule (but, as noted above, is then removed from the match).
The consequence of this rule is that you should always place more specific patterns before the general patterns they might override, whether or not the specific pattern uses trailing context. So the Bool rule must precede the Ident rule.

How can I identify 6 consecutive numbers in a string?

Applescript noob, I'm trying to identify a date format in filenames, and return the characters immediately preceding the date. The way the date is formatted in the files is just 6 consecutive numbers. The data before that is an indication of the length of the file and are also numbers. These files will never have 6 or more consecutive numbers, except for the date, so I don't have to worry about false positives. What I need to do is find the 6 consecutive numbers so I can use that to find the data before the date and group all those files together.
ex:
Barry_Waterson_Speech_1955_27.02_012219_video_file_from_grdx1.mov
Test Recording Iceland 19 040407 low quality screener.mov
initially it seemed like the numbers preceding the date had set values that I could have the code look out for with
if fileName contains "29" then
but now I'm stumped on how to approach this. My general idea was the following:
Looks like something’s eaten the last part of your question. At any rate, AppleScript is not the best language for text processing, but whatever language you use the standard technique is regular expression-based pattern matching.
For example, to match six digits you’d use the pattern \d{6}. The \d pattern matches any digit, the {6} matches the preceding pattern exactly six times.
If you want to extract the text from the start of a line up to the six digits, you’d use something like (?-s)^(.+?)\d{6}. The ^ matches the start of each line. The .+? matches one or more characters (.+) only up to the next pattern match (?); grouping it in parens extracts the matched text. By default, the . pattern matches any character including a line break, so add (?-s) to the start of the pattern to turn off the line break matching (-s).
Bit cryptic, but very powerful and you’ll get the hang with a bit of practice. Tons of online docs and examples too; just search for “PCRE regular expression”. (Tip: build it up one pattern at a time, testing at every step.)
AppleScript doesn’t have built-in support for regular expressions, but it can use Cocoa’s NSRegularExpression class via the AppleScript-ObjC bridge. The syntax isn’t very friendly so you may want to use a library that wraps it for you:
use script "Text"
set theText to "Barry_Waterson_Speech_1955_27.02_012219_video_file_from_grdx1.mov
Test Recording Iceland 19 040407 low quality screener.mov"
search text theText for "^(.+?)\\d{6}" using pattern matching
returns:
{{class:matched text, startIndex:1, endIndex:39, foundText:"Barry_Waterson_Speech_1955_27.02_012219", foundGroups:{{class:matched group, startIndex:1, endIndex:33, foundText:"Barry_Waterson_Speech_1955_27.02_"}}},
{class:matched text, startIndex:67, endIndex:98, foundText:"Test Recording Iceland 19 040407", foundGroups:{{class:matched group, startIndex:67, endIndex:92, foundText:"Test Recording Iceland 19 "}}}}

how to test my Grammar antlr4 successfully? [duplicate]

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

phone regex does not completely working

In my country the phone numbers follow a format like this (XX)XXXX-XXXX. But enter phone numbers according to the pattern in input texts it's too mainstream. Some people follow, but some people don't. I'd like to make a regex to catch all possible cases. By now it look like this:
/^[\(]?\d{2}?[\)]?\d{4}[. -]?\d{4}$/
And I prepared some test cases to prove the regex's functionality
# GOOD PHONES #
8432115262
843211 5262
843211.5262
843211-5262
32115262
3211.5262
3211 5262
3211-5262
(84)32115262
(84)3211.5262
(84)3211 5262
(84)3211-5262
# BAD PHONES #
!##$%*()
()32115262
()1231 3213
()1231.3213
()1231-3213
().3213
()-3213
()3213.
()3213-
3211-5a62
sakdiihbnmwlzi
Unfortunately, the wrong case ()32115262 is bypassing the regex. Altought it is clear why. this part [\(]?\d{2}?[\)]? is responsable for the mistake. From left to right, you can enter zero or one of (; You can enter zero or two digits; You can enter zero or one of ).
I'd like that part should be like this: If you put (, you will have to enter two digits and ), else you can enter zero or two digits. Something like this or with simmilar semantics is possible in regex world?
Thanks in advance
Something like this perhaps:
/^(?:\(\d{2}\)|\d{2}?)\d{4}[. -]?\d{4}$/
I used a non-matching group (?: ... ) and alternation to provide two possible options for the first part of the phone number.
Either it is \(\d{2}\) which means brackets with exactly two digits, or it is \d{2}? which means two digits or empty string.
Combine these two options together with | (which means OR) and you get the first part of the regex above: (?:\(\d{2}\)|\d{2}?)
It seemed to work for all your test cases!
try with this: ^(?:\(\d\d\)|\d\d)?\d{4}[. -]?\d{4}$
If pattern matches (..) then have to match 2 digits inside.

How to conflate consecutive gsubs in ruby

I have the following
address.gsub(/^\d*/, "").gsub(/\d*-?\d*$/, "").gsub(/\# ?\d*/,"")
Can this be done in one gsub? I would like to pass a list of patterns rather then just one pattern - they are all being replaced by the same thing.
You could combine them with an alternation operator (|):
address = '6 66-666 #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/, "")
# " 66-666 "
address = 'pancakes 6 66-666 # pancakes #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/,"")
# "pancakes 6 66-666 pancakes "
You might want to add little more whitespace cleanup. And you might want to switch to one of:
/\A\d*|\d*-?\d*\z|\# ?\d*/
/\A\d*|\d*-?\d*\Z|\# ?\d*/
depending on what your data really looks like and how you need to handle newlines.
Combining the regexes is a good idea--and relatively simple--but I'd like to recommend some additional changes. To wit:
address.gsub(/^\d+|\d+(?:-\d+)?$|\# *\d+/, "")
Of your original regexes, ^\d* and \d*-?\d*$ will always match, because they don't have to consume any characters. So you're guaranteed to perform two replacements on every line, even if that's just replacing empty strings with empty strings. Of my regexes, ^\d+ doesn't bother to match unless there's at least one digit at the beginning of the line, and \d+(?:-\d+)?$ matches what looks like an integer-or-range expression at the end of the line.
Your third regex, \# ?\d*, will match any # character, and if the # is followed by a space and some digits, it'll take those as well. Judging by your other regexes and my experience with other questions, I suspect you meant to match a # only if it's followed by one or more digits, with optional spaces intervening. That's what my third regex does.
If any of my guesses are wrong, please describe what you were trying to do, and I'll do my best to come up with the right regex. But I really don't think those first two regexes, at least, are what you want.
EDIT (in answer to the comment): When working with regexes, you should always be aware of the distinction between a regex the matches nothing and a regex that doesn't match. You say you're applying the regexes to street addresses. If an address doesn't happen to start with a house number, ^\d* will match nothing--that is, it will report a successful match, said match consisting of the empty string preceding the first character in the address.
That doesn't matter to you, you're just replacing it with another empty string anyway. But why bother doing the replacement at all? If you change the regex to ^\d+, it will report a failed match and no replacement will be performed. The result is the same either way, but the "matches noting" scenario (^\d*) results in a lot of extra work that the "doesn't match" scenario avoids. In a high-throughput situation, that could be a life-saver.
The other two regexes bring additional complications: \d*-?\d*$ could match a hyphen at the end of the string (e.g. "123-", or even "-"); and \# ?\d* could match a hash symbol anywhere in string, not just as part of an apartment/office number. You know your data, so you probably know neither of those problems will ever arise; I'm just making sure you're aware of them. My regex \d+(?:-\d+)?$ deals with the trailing-hyphen issue, and \# *\d+ at least makes sure there are digits after the hash symbol.
I think that if you combine them together in a single gsub() regex, as an alternation,
it changes the context of the starting search position.
Example, each of these lines start at the beginning of the result of the previous
regex substitution.
s/^\d*//g
s/\d*-?\d*$//g
s/\# ?\d*//g
and this
s/^\d*|\d*-?\d*$|\# ?\d*//g
resumes search/replace where the last match left off and could potentially produce a different overall output, especially since a lot of the subexpressions search for similar
if not the same characters, distinguished only by line anchors.
I think your regex's are unique enough in this case, and of course changing the order
changes the result.

Resources