How does Ruby know when '/' is a divide symbol, vs. when it starts a regular expression? - ruby

I'm working on a lexer for Ruby. Such a lexer needs to clearly
distinguish divide '/' operators from regex /..../ operands.
Lexers are nicest to build when they are context free (stateless)
with respect to lexing-the-next token.
Some program text that starts with "/" might be:
... / abc*(foo(def,bar[q-z]*)+sam) / ...
You can't tell if the '/' symbol is a divide or the start of regexp.
So clearly Ruby must be looking at the context, or it must have rule
to decide when it is ambiguous. What's the rule?
[one possibility: it only allows them where divide cannot occur, e.g, after
when [ ( , #{ { if elseif != = !~ + , << and or not
(Edit 8/24/2015: extended the above list)
Does that cover everything? Or it is something entirely different?]

The Ruby lexer emits completely different tokens for a division operator and for the start of a regex (one is '/', the other tREGEXP_BEG). So the parser has no idea that the two actually use the same source text.
How does the lexer know which token to emit? See parse.y:8451 from the Ruby source.
The parser_params struct which is passed to the lexer has a member called lex.state. This is a bitfield, with each bit indicating something about the lexer state. The individual bits are called BEG, END, ENDARG, ENDFN, ARG, CMDARG, MID, FNAME, DOT, CLASS, LABEL, and LABELED.
When the lexer sees a '/' character, it emits tREGEXP_BEG if...
The lexer state is true for both ARG and LABELED, or
The lexer state is true for any one of BEG, MID, or CLASS.
Otherwise, it emits a division operator token.
So what do the states actually mean? The Ruby source contains the following comments on them:
EXPR_BEG_bit, /* ignore newline, +/- is a sign. */
EXPR_END_bit, /* newline significant, +/- is an operator. */
EXPR_ENDARG_bit, /* ditto, and unbound braces. */
EXPR_ENDFN_bit, /* ditto, and unbound braces. */
EXPR_ARG_bit, /* newline significant, +/- is an operator. */
EXPR_CMDARG_bit, /* newline significant, +/- is an operator. */
EXPR_MID_bit, /* newline significant, +/- is an operator. */
EXPR_FNAME_bit, /* ignore newline, no reserved words. */
EXPR_DOT_bit, /* right after `.' or `::', no reserved words. */
EXPR_CLASS_bit, /* immediate after `class', no here document. */
EXPR_LABEL_bit, /* flag bit, label is allowed. */
EXPR_LABELED_bit, /* flag bit, just after a label. */
Whenever the lexer emits a token, depending on the current lexer state, the token which was lexed, and possibly what the lexer sees next in the source text (it does look ahead in a number of places), it may move to a new state.
Some of the states are only entered after lexing a reserved keyword. For example, EXPR_MID is entered after lexing break, next, rescue, or return.

This is because of the way how the parser is defined. Having a look at BNF definition of Ruby you can see that the division operation (in the ARGS section) is defined before the definition of a REGEXP. That's why the division operation has a higher precedence than a regexp.
Meaning, if the ruby parser stumbles upon a section that resolves to
ARG / ARG
it will treat it as a division and goes further.
Walking trough a flex/bison tutorial will enlighten you! (Plus it is a fun)

Related

how to document a single space character within a string in reST/Sphinx?

I've gotten lost in an edge case of sorts. I'm working on a conversion of some old plaintext documentation to reST/Sphinx format, with the intent of outputting to a few formats (including HTML and text) from there. Some of the documented functions are for dealing with bitstrings, and a common case within these is a sentence like the following: Starting character is the blank " " which has the value 0.
I tried writing this as an inline literal the following ways: Starting character is the blank `` `` which has the value 0. or Starting character is the blank :literal:` ` which has the value 0. but there are a few problems with how these end up working:
reST syntax objects to a whitespace immediately inside of the literal, and it doesn't get recognized.
The above can be "fixed"--it looks correct in the HTML () and plaintext (" ") output--with a non-breaking space character inside the literal, but technically this is a lie in our case, and if a user copied this character, they wouldn't be copying what they expect.
The space can be wrapped in regular quotes, which allows the literal to be properly recognized, and while the output in HTML is probably fine (" "), in plaintext it ends up double-quoted as "" "".
In both 2/3 above, if the literal falls on the wrap boundary, the plaintext writer (which uses textwrap) will gladly wrap inside the literal and trim the space because it's at the start/end of the line.
I feel like I'm missing something; is there a good way to handle this?
Try using the unicode character codes. If I understand your question, this should work.
Here is a "|space|" and a non-breaking space (|nbspc|)
.. |space| unicode:: U+0020 .. space
.. |nbspc| unicode:: U+00A0 .. non-breaking space
You should see:
Here is a “ ” and a non-breaking space ( )
I was hoping to get out of this without needing custom code to handle it, but, alas, I haven't found a way to do so. I'll wait a few more days before I accept this answer in case someone has a better idea. The code below isn't complete, nor am I sure it's "done" (will sort out exactly what it should look like during our review process) but the basics are intact.
There are two main components to the approach:
introduce a char role which expects the unicode name of a character as its argument, and which produces an inline description of the character while wrapping the character itself in an inline literal node.
modify the text-wrapper Sphinx uses so that it won't break at the space.
Here's the code:
class TextWrapperDeux(TextWrapper):
_wordsep_re = re.compile(
r'((?<!`)\s+(?!`)|' # whitespace not between backticks
r'(?<=\s)(?::[a-z-]+:)`\S+|' # interpreted text start
r'[^\s\w]*\w+[a-zA-Z]-(?=\w+[a-zA-Z])|' # hyphenated words
r'(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))') # em-dash
#property
def wordsep_re(self):
return self._wordsep_re
def char_role(name, rawtext, text, lineno, inliner, options={}, content=[]):
"""Describe a character given by unicode name.
e.g., :char:`SPACE` -> "char:` `(U+00020 SPACE)"
"""
try:
character = nodes.unicodedata.lookup(text)
except KeyError:
msg = inliner.reporter.error(
':char: argument %s must be valid unicode name at line %d' % (text, lineno))
prb = inliner.problematic(rawtext, rawtext, msg)
return [prb], [msg]
app = inliner.document.settings.env.app
describe_char = "(U+%05X %s)" % (ord(character), text)
char = nodes.inline("char:", "char:", nodes.literal(character, character))
char += nodes.inline(describe_char, describe_char)
return [char], []
def setup(app):
app.add_role('char', char_role)
The code above lacks some glue to actually force the use of the new TextWrapper, imports, etc. When a full version settles out I may try to find a meaningful way to republish it; if so I'll link it here.
Markup: Starting character is the :char:`SPACE` which has the value 0.
It'll produce plaintext output like this: Starting character is the char:` `(U+00020 SPACE) which has the value 0.
And HTML output like: Starting character is the <span>char:<code class="docutils literal"> </code><span>(U+00020 SPACE)</span></span> which has the value 0.
The HTML output ends up looking roughly like: Starting character is the char:(U+00020 SPACE) which has the value 0.

Treetop grammar line continuation

I'm trying to create a grammar for a language like the following
someVariable = This is a string, I know it doesn't have double quotes
anotherString = This string has a continuation _
this means I can write it on multiple line _
like this
anotherVariable = "This string is surrounded by quotes"
What are the correct Treetop grammar rules that parse the previous code correctly?
I should be able to extract the following values for the three variables
This is a string, I know it doesn't have double quotes
This string has a continuation this means I can write it on multiple line like this
This string is surrounded by quotes
Thank you
If you define the sequence "_\n" as if it was a single white-space character, and ensure that you test for that before you accept an end-of-line, your VB-style line continuation should just drop out. In VB, the newline "\n" is not white-space per se, but is a distinct statement termination character. You probably also need to deal with carriage returns, depending on your input character processing rules. I would write the white-space rule like this:
rule white
( [ \t] / "_\n" "\r"? )+
end
Then your statement rule looks like this:
rule variable_assignment
white* var:([[:alpha:]]+) white* "=" white* value:((white / !"\n" .)*) "\n"
end
and your top rule:
rule top
variable_assignment*
end
Your language doesn't seem to have any more apparent structure than that.

Lexer rule conflict resolution needed

I have these lexer rules in my ANTLR3 grammar:
INTEGER: DIGITS;
FLOAT: DIGITS? DOT_SYMBOL DIGITS ('E' (MINUS_OPERATOR | PLUS_OPERATOR)? DIGITS)?;
HEXNUMBER: '0X' HEXDIGIT+;
HEXSTRING: 'X' '\'' HEXDIGIT+ '\'';
BITNUMBER: '0B' ('0' | '1')+;
BITSTRING: 'B' '\'' ('0' | '1')+ '\'';
NCHAR_TEXT: 'N' SINGLE_QUOTED_TEXT;
IDENTIFIER: LETTER_WHEN_UNQUOTED+;
fragment LETTER_WHEN_UNQUOTED:
'0'..'9'
| 'A'..'Z' // Only upper case, as we use a case insensitive parser (insensitive only for ASCII).
| '$'
| '_'
| '\u0080'..'\uffff'
;
and
qualified_identifier:
IDENTIFIER ( options { greedy = true; }: DOT_SYMBOL IDENTIFIER)?
;
This works mostly fine except for very specific situations like the input t1.1_d which is supposed to be parsed as 2 identifiers connected with a dot. What happens is that .1 is matched as float even though it's followed by underscore and letter(s).
It's clear where that comes from: LETTER_WHEN_UNQUOTED includes digits so '1' can be both an integer and an identifier. But the rule order should take care to resolve this to an integer, as intented (and usually does).
However, I'm perplexed the t1.1_d input causes the float rule to kick in and would appreciate some pointers to resolve this problem. As soon as I add a space after the dot all is fine, but that is obviously not a real solution.
When I move the IDENTIFIER rule before the others I get new trouble because several other rules can no longer be matched then. Moving the FLOAT rule after the IDENTIFIER rule doesn't fix the problem either (but at least doesn't produce new problems). In this case we see the actual problem: the dot is always matched by the FLOAT rule if directly followed by a digit. What can I do to make it not match in my case?
The problem is that the lexer operates independently of the parser. When faced with the input string t1.1_d, the lexer will first consume an IDENTIFIER, leaving .1_d. You now want it to match DOT_SYMBOL, followed by IDENTIFIER. However, the lexer will always match the longest possible token, resulting in FLOAT matching .1.
Moving IDENTIFIER before FLOAT doesn't help, because '.' isn't a valid IDENTIFIER symbol and so can't match the input at all when it starts with ..
Note that Java and co. don't allow identifiers to start with numbers, probably to avert these kinds of problems.
One possible solution would be to change the FLOAT rule to require digits before the dot: FLOAT: DIGITS '.' DIGITS ...

Validating User Input for Letters

I've recently been trying to validate a user input so that only letters from the alphabet are accepted, how would I do this? I know how to validate user input for most things, but this one line of code for letters is really troubling me.
You can check the contents of a field with this function:
function validate theString
return matchText(theString,"^[a-zA-Z]+$")
end validate
^[a-zA-Z]+$ is a regular expression. ^indicates the beginning of a string, the brackets equal one char and the expression inside the bracket determine a set of characters. The + means that all following characters have to be equal to the preceding (set of) character(s). The $ indicates the end of the string. In other words, according to this expression, all characters must be of the set a up to and including z or A up to and including Z.
matchText() is a LiveCode function, which checks if the string in the first parameter matches the regular expression in the second parameter. Put the validate() function somewhere at card or stack level and call it from a field in a rawKeyUp handler:
on rawKeyUp
if not validate(the text of me) then
beep
answer "Sorry, that's wrong"
end if
end rawKeyUp
You could also check beforehand:
on keyDown theKey
if validate(theKey) then
pass keyDown
end if
end keyDown
This method is slightly verbose. You could also put the matchText function in a keyDown handler of your field.

How to parse a word that starts with a specific letter with ANTLR3 java target

Is there a way to parse words that start with a specific character?
I've been trying the following but i couldn't get any promising results:
//This one is working it accepts AD CD and such
example1
:
.'D'
;
//This one is not, it expects character D, then any ws character then any character
example2
:
'D'.
;
//These two are not working either
example3
:
'D'.*
;
//Doesn't accept input due to error: "line 1:3 missing 'D' at '<EOF>'"
example4
:
.*'D'
;
//just in case my WS rule:
/** WhiteSpace Characters (HIDDEN)*/
WS : ( ' '
| '\t'
)+ {$channel=HIDDEN;}
;
I am using ANTLR 3.4
Thanks in advance
//This one is not, it expects character D, then any ws character then any character
example2
:
'D'.
;
No, it does not it accept the token (not character!) 'D' followed by a space and then any character. Since example2 is a parser rule, it does not match characters, but matches tokens (there's a big difference!). And since you put spaces on a separate channel, the spaces are not matched by this rule either. At the end, the . (DOT) matches any token (again: not any character!).
More info on meta chars (like the . (DOT)) whose meaning differ inside lexer- and parser rules: Negating inside lexer- and parser rules
//These two are not working either
example3
:
'D'.*
;
//Doesn't accept input due to error: "line 1:3 missing 'D' at '<EOF>'"
example4
:
.*'D'
;
Unless you know exactly what you're doing, don't use .*: they gobble up too much in your case (especially when placed at the start or end of a rule).
It looks like you're trying to tokenize things inside the parser (all your example rules are parser rules). As far as I can see, these should be lexer rules instead. More on the difference between parser- and lexer rules, see: Practical difference between parser rules and lexer rules in ANTLR?

Resources