The precedence tables in many Ruby documentations out there list binary arithmetic operations as having higher precedence than their corresponding compound assignment operators. This leads me to believe that code like this shouldn't be valid Ruby code, yet it is.
1 + age *= 2
If the precedence rules were correct, I'd expect that the above code would be parenthesized like this:
((1 + age) *= 2) #ERROR: Doesn't compile
But it doesn't.
So what gives?
Checking ruby -y output, you can see exactly what is happening. Given the source of 1 + age *= 2, the output suggests this happens (simplified):
tINTEGER found, recognised as simple_numeric, which is a numeric, which is a literal, which is a primary. Knowing that + comes next, primary is recognised as arg.
+ found. Can't deal yet.
tIDENTIFIER found. Knowing that next token is tOP_ASGN (operator-assignment), tIDENTIFIER is recognised as user_variable, and then as var_lhs.
tOP_ASGN found. Can't deal yet.
tINTEGER found. Same as last one, it is ultimately recognised as primary. Knowing that next token is \n, primary is recognised as arg.
At this moment we have arg + var_lhs tOP_ASGN arg on stack. In this context, we recognise the last arg as arg_rhs. We can now pop var_lhs tOP_ASGN arg_rhs from stack and recognise it as arg, with stack ending up as arg + arg, which can be reduced to arg.
arg is then recognised as expr, stmt, top_stmt, top_stmts. \n is recognised as term, then terms, then opt_terms. top_stmts opt_terms are recognised as top_compstmt, and ultimately program.
On the other hand, given the source 1 + age * 2, this happens:
tINTEGER found, recognised as simple_numeric, which is a numeric, which is a literal, which is a primary. Knowing that + comes next, primary is recognised as arg.
+ found. Can't deal yet.
tIDENTIFIER found. Knowing that next token is *, tIDENTIFIER is recognised as user_variable, then var_ref, then primary, and arg.
* found. Can't deal yet.
tINTEGER found. Same as last one, it is ultimately recognised as primary. Knowing that next token is \n, primary is recognised as arg.
The stack is now arg + arg * arg. arg * arg can be reduced to arg, and the resultant arg + arg can also be reduced to arg.
arg is then recognised as expr, stmt, top_stmt, top_stmts. \n is recognised as term, then terms, then opt_terms. top_stmts opt_terms are recognised as top_compstmt, and ultimately program.
What's the critical difference? In the first piece of code, age (a tIDENTIFIER) is recognised as var_lhs (left-hand-side of assignment), but in the second one, it's var_ref (a variable reference). Why? Because Bison is a LALR(1) parser, meaning that it has one-token look-ahead. So age is var_lhs because Ruby saw tOP_ASGN coming up; and it was var_ref when it saw *. This comes about because Ruby knows (using the huge state transition table that Bison generates) that one specific production is impossible. Specifically, at this time, the stack is arg + tIDENTIFIER, and next token is *=. If tIDENTIFIER is recognised as var_ref (which leads up to arg), and arg + arg reduced to arg, then there is no rule that starts with arg tOP_ASGN; thus, tIDENTIFIER cannot be allowed to become var_ref, and we look at the next matching rule (the var_lhs one).
So Aleksei is partly right in that there is some truth to "when it sees a syntax error, it tries another way", but it is limited to one token into future, and the "attempt" is just a lookup in the state table. Ruby is incapable of complex repair strategies we humans use to understand sentences like "the horse raced past the barn fell", where we happily parse till the last word, then reevaluate the whole sentence when the first parse turns out impossible.
tl;dr: The precedence table is not exactly correct. There is no place in Ruby source where it exists; rather, it is the result of the interplay of various parsing rules. Many of the precedence rules break in when left-hand-side of an assignment is introduced.
The simplified answer is. You can only assign a value to a variable, not to an expression. Therefore the order is 1 + (age *= 2). The precedence only comes into play if multiple options are possible. For example age *= 2 + 1 can be seen as (age *= 2) + 1 or age *= (2 + 1), since multiple options are possible and the + has a higher precedence than *=, age *= (2 + 1) is used.
NB This answer should not be marked as solving the issue. See the answer by #Amadan for the correct explanation.
I am not sure what “many Ruby documentations” you mentioned, here is the official one.
Ruby parser does its best to understand and successfully parse the input; when it sees a syntax error, it tries another way. That said, syntax errors have greater precedence compared to all operator precedence rules.
Since LHO must be variable, it starts with an assignment. Here is the case when the parsing can be done with a default precedence order and + is done before *=:
age = 2
age *= age + 1
#⇒ 6
Ruby has 3 phases before your code is actually executed.
Tokenize -> Parse -> Compile
Let's look at the AST(Abstract Syntax Tree) Ruby generates which is the parse phase.
# # NODE_SCOPE (line: 1, location: (1,0)-(1,12))
# | # new scope
# | # format: [nd_tbl]: local table, [nd_args]: arguments, [nd_body]: body
# +- nd_tbl (local table): :age
# +- nd_args (arguments):
# | (null node)
# +- nd_body (body):
# # NODE_OPCALL (line: 1, location: (1,0)-(1,12))*
# | # method invocation
# | # format: [nd_recv] [nd_mid] [nd_args]
# | # example: foo + bar
# +- nd_mid (method id): :+
# +- nd_recv (receiver):
# | # NODE_LIT (line: 1, location: (1,0)-(1,1))
# | | # literal
# | | # format: [nd_lit]
# | | # example: 1, /foo/
# | +- nd_lit (literal): 1
# +- nd_args (arguments):
# # NODE_ARRAY (line: 1, location: (1,4)-(1,12))
# | # array constructor
# | # format: [ [nd_head], [nd_next].. ] (length: [nd_alen])
# | # example: [1, 2, 3]
# +- nd_alen (length): 1
# +- nd_head (element):
# | # NODE_DASGN_CURR (line: 1, location: (1,4)-(1,12))
# | | # dynamic variable assignment (in current scope)
# | | # format: [nd_vid](current dvar) = [nd_value]
# | | # example: 1.times { x = foo }
# | +- nd_vid (local variable): :age
# | +- nd_value (rvalue):
# | # NODE_CALL (line: 1, location: (1,4)-(1,12))
# | | # method invocation
# | | # format: [nd_recv].[nd_mid]([nd_args])
# | | # example: obj.foo(1)
# | +- nd_mid (method id): :*
# | +- nd_recv (receiver):
# | | # NODE_DVAR (line: 1, location: (1,4)-(1,7))
# | | | # dynamic variable reference
# | | | # format: [nd_vid](dvar)
# | | | # example: 1.times { x = 1; x }
# | | +- nd_vid (local variable): :age
# | +- nd_args (arguments):
# | # NODE_ARRAY (line: 1, location: (1,11)-(1,12))
# | | # array constructor
# | | # format: [ [nd_head], [nd_next].. ] (length: [nd_alen])
# | | # example: [1, 2, 3]
# | +- nd_alen (length): 1
# | +- nd_head (element):
# | | # NODE_LIT (line: 1, location: (1,11)-(1,12))
# | | | # literal
# | | | # format: [nd_lit]
# | | | # example: 1, /foo/
# | | +- nd_lit (literal): 2
# | +- nd_next (next element):
# | (null node)
# +- nd_next (next element):
# (null node)
As you can see # +- nd_mid (method id): :+ where 1 is treated as the receiver and everything on the right as arguments. Now, it goes further and does its best to evaluate the arguments.
To further support Aleksei's great answer. The # NODE_DASGN_CURR (line: 1, location: (1,4)-(1,12)) is the assignment on age as a local variable as it decodes it as age = age * 2, which is why +- nd_mid (method id): :* is treated as the operation on age as the receiver and 2 as its argument.
Now when it goes on to compile it tries as operation: age * 2 where age is nil because it already parsed it as a local variable with no pre-assigned value, raises exception undefined method '*' for nil:NilClass (NoMethodError).
It works the way it did is cause any operation on the receiver must have an evaluated argument from the RHO.
Related
In Ruby 2.7 and 3.1 this script does the same thing whether or not the % signs are there:
def count(str)
state = :start
tbr = []
str.each_char do
% %case state
when :start
tbr << 0
% %state = :symbol
% when :symbol
tbr << 1
% % state = :start
% end
end
tbr
end
p count("Foobar")
How is this parsed? You can add more % or remove some and it will still work, but not any combination. I found this example through trial and error.
I was teaching someone Ruby and noticed only after their script was working that they had a random % in the margin. I pushed it a little further to see how many it would accept.
Syntax
Percent String Literal
This is a Percent String Literal receiving the message %.
A Percent String Literal has the form:
% character
opening-delimiter
string content
closing-delimiter
If the opening-delimiter is one of <, [, (, or {, then the closing-delimiter must be the corresponding >, ], ), or }. Otherwise, the opening-delimiter can be any arbitrary character and the closing-delimiter must be the same character.
So,
%
(that is, % SPACE SPACE)
is a Percent String Literal with SPACE as the delimiter and no content. I.e. it is equivalent to "".
Operator Message Send a % b
a % b
is equivalent to
a.%(b)
I.e. sending the message % to the result of evaluating the expression a, passing the result of evaluating the expression b as the single argument.
Which means
% % b
is (roughly) equivalent to
"".%(b)
Argument List
So, what's b then? Well, it's the expression following the % operator (not to be confused with the % sigil of the Percent String Literal).
The entire code is (roughly) equivalent to this:
def count(str)
state = :start
tbr = []
str.each_char do
"".%(case state
when :start
tbr << 0
"".%(state = :symbol)
""when :symbol
tbr << 1
"".%(state = :start)
""end)
end
tbr
end
p count("Foobar")
AST
You can figure this out yourself by just asking Ruby:
# ruby --dump=parsetree_with_comment test.rb
###########################################################
## Do NOT use this node dump for any purpose other than ##
## debug and research. Compatibility is not guaranteed. ##
###########################################################
# # NODE_SCOPE (id: 62, line: 1, location: (1,0)-(17,17))
# | # new scope
# | # format: [nd_tbl]: local table, [nd_args]: arguments, [nd_body]: body
# +- nd_tbl (local table): (empty)
# +- nd_args (arguments):
# | (null node)
[…]
# | | +- nd_body (body):
# | | # NODE_OPCALL (id: 48, line: 5, location: (5,0)-(12,7))*
# | | | # method invocation
# | | | # format: [nd_recv] [nd_mid] [nd_args]
# | | | # example: foo + bar
# | | +- nd_mid (method id): :%
# | | +- nd_recv (receiver):
# | | | # NODE_STR (id: 12, line: 5, location: (5,0)-(5,3))
# | | | | # string literal
# | | | | # format: [nd_lit]
# | | | | # example: 'foo'
# | | | +- nd_lit (literal): ""
# | | +- nd_args (arguments):
# | | # NODE_LIST (id: 47, line: 5, location: (5,4)-(12,7))
# | | | # list constructor
# | | | # format: [ [nd_head], [nd_next].. ] (length: [nd_alen])
# | | | # example: [1, 2, 3]
# | | +- nd_alen (length): 1
# | | +- nd_head (element):
# | | | # NODE_CASE (id: 46, line: 5, location: (5,4)-(12,7))
# | | | | # case statement
# | | | | # format: case [nd_head]; [nd_body]; end
# | | | | # example: case x; when 1; foo; when 2; bar; else baz; end
# | | | +- nd_head (case expr):
# | | | | # NODE_DVAR (id: 13, line: 5, location: (5,9)-(5,14))
# | | | | | # dynamic variable reference
# | | | | | # format: [nd_vid](dvar)
# | | | | | # example: 1.times { x = 1; x }
# | | | | +- nd_vid (local variable): :state
[…]
Some of the interesting places here are the node at (id: 12, line: 5, location: (5,0)-(5,3)) which is the first string literal, and (id: 48, line: 5, location: (5,0)-(12,7)) which is the first % message send:
# | | +- nd_body (body):
# | | # NODE_OPCALL (id: 48, line: 5, location: (5,0)-(12,7))*
# | | | # method invocation
# | | | # format: [nd_recv] [nd_mid] [nd_args]
# | | | # example: foo + bar
# | | +- nd_mid (method id): :%
# | | +- nd_recv (receiver):
# | | | # NODE_STR (id: 12, line: 5, location: (5,0)-(5,3))
# | | | | # string literal
# | | | | # format: [nd_lit]
# | | | | # example: 'foo'
# | | | +- nd_lit (literal): ""
Note: this is just the simplest possible method of obtaining a parse tree, which unfortunately contains a lot of internal minutiae that are not really relevant to figuring out what is going on. There are other methods such as the parser gem or its companion ast which produce far more readable results:
# ruby-parse count.rb
(begin
(def :count
(args
(arg :str))
(begin
(lvasgn :state
(sym :start))
(lvasgn :tbr
(array))
(block
(send
(lvar :str) :each_char)
(args)
(send
(dstr) :%
(case
(lvar :state)
(when
(sym :start)
(begin
(send
(lvar :tbr) :<<
(int 0))
(send
(dstr) :%
(lvasgn :state
(sym :symbol)))
(dstr)))
(when
(sym :symbol)
(begin
(send
(lvar :tbr) :<<
(int 1))
(send
(dstr) :%
(lvasgn :state
(sym :start)))
(dstr))) nil)))
(lvar :tbr)))
(send nil :p
(send nil :count
(str "Foobar"))))
Semantics
So far, all we have talked about is the Syntax, i.e. the grammatical structure of the code. But what does it mean?
The method String#% performs String Formatting a la C's printf family of functions. However, since the format string (the receiver of the % message) is the empty string, the result of the message send is the empty string as well, since there is nothing to format.
If Ruby were a purely functional, lazy, non-strict language, the result would be equivalent to this:
def count(str)
state = :start
tbr = []
str.each_char do
"".%(case state
when :start
tbr << 0
""
""when :symbol
tbr << 1
""
""end)
end
tbr
end
p count("Foobar")
which in turn is equivalent to this
def count(str)
state = :start
tbr = []
str.each_char do
"".%(case state
when :start
tbr << 0
""
when :symbol
tbr << 1
""
end)
end
tbr
end
p count("Foobar")
which is equivalent to this
def count(str)
state = :start
tbr = []
str.each_char do
"".%(case state
when :start
""
when :symbol
""
end)
end
tbr
end
p count("Foobar")
which is equivalent to this
def count(str)
state = :start
tbr = []
str.each_char do
"".%(case state
when :start, :symbol
""
end)
end
tbr
end
p count("Foobar")
which is equivalent to this
def count(str)
state = :start
tbr = []
str.each_char do
""
end
tbr
end
p count("Foobar")
which is equivalent to this
def count(str)
state = :start
tbr = []
tbr
end
p count("Foobar")
which is equivalent to this
def count(str)
[]
end
p count("Foobar")
Clearly, that is not what is happening, and the reason is that Ruby isn't a purely functional, lazy, non-strict language. While the arguments which are passed to the % message sends are irrelevant to the result of the message send, they are nevertheless evaluated (because Ruby is strict and eager) and they have side-effects (because Ruby is not purely functional), i.e. their side-effects of re-assigning variables and mutating the tbr result array are still executed.
If this code were written in a more Ruby-like style with less mutation and fewer side-effects and instead using functional transformations, then arbitrarily replacing results with empty strings would immediately break it. The only reason there is no effect here is because the abundant use of side-effects and mutation.
I am new to Ruby and a bit confused about how the ternary operator, ?:, works.
According to the book Engineering Software as a Service: An Agile Approach Using Cloud Computing:
every operation is a method call on some object and returns a value.
In this sense, if the ternary operator represents an operation, it is a method call on an object with two arguments. However, I can't find any method of which the ternary operator represents in Ruby's documentation. Does a ternary operator represent an operation in Ruby? Is the above claim made by the book mentioned wrong? Is the ternary operator in Ruby really just a syntactic sugar for if ... then ... else ... end statements?
Please note:
My question is related to How do I use the conditional operator (? :) in Ruby? but not the same as that one. I know how to use the ternary operator in the way described in that post. My question is about where ternary operator is defined in Ruby and if the ternary operator is defined as a method or methods.
Is the ternary operator in Ruby really just a syntactic sugar for if ... then ... else ... end statements?
Yes.
From doc/syntax/control_expressions.rdoc
You may also write a if-then-else expression using ? and :. This ternary if:
input_type = gets =~ /hello/i ? "greeting" : "other"
Is the same as this if expression:
input_type =
if gets =~ /hello/i
"greeting"
else
"other"
end
"According to this book, "every operation is a method call on some object and returns a value." In this sense, if the ternary operator represents an operation, it is a method call on an object with two arguments."
if, unless, while, and until are not operators, they are control structures. Their modifier versions appear in the operator precedence table because they need to have precedence in order to be parsed. They simply check if their condition is true or false. In Ruby this is simple, only false and nil are false. Everything else is true.
Operators are things like !, +, *, and []. They are unary or binary. You can see a list of them by calling .methods.sort on various objects. For example...
2.4.3 :004 > 1.methods.sort
=> [:!, :!=, :!~, :%, :&, :*, :**, :+, :+#, :-, :-#, :/, :<, :<<, :<=, :<=>, :==, :===, :=~, :>, :>=, :>>, :[], :^, :__id__, :__send__, etc...
Note that in Smalltalk, from which Ruby borrows heavily, everything really is a method call. Including the control structures.
Is the ternary operator in Ruby really just a syntactic sugar for if ... then ... else ... end statements?
(another) yes.
Here's the parse tree for a ? b : c:
$ ruby --dump=parsetree -e 'a ? b : c'
###########################################################
## Do NOT use this node dump for any purpose other than ##
## debug and research. Compatibility is not guaranteed. ##
###########################################################
# # NODE_SCOPE (line: 1)
# +- nd_tbl: (empty)
# +- nd_args:
# | (null node)
# +- nd_body:
# # NODE_PRELUDE (line: 1)
# +- nd_head:
# | (null node)
# +- nd_body:
# | # NODE_IF (line: 1)
# | +- nd_cond:
# | | # NODE_VCALL (line: 1)
# | | +- nd_mid: :a
# | +- nd_body:
# | | # NODE_VCALL (line: 1)
# | | +- nd_mid: :b
# | +- nd_else:
# | # NODE_VCALL (line: 1)
# | +- nd_mid: :c
# +- nd_compile_option:
# +- coverage_enabled: false
Here's the parse tree for if a then b else c end:
$ ruby --dump=parsetree -e 'if a then b else c end'
###########################################################
## Do NOT use this node dump for any purpose other than ##
## debug and research. Compatibility is not guaranteed. ##
###########################################################
# # NODE_SCOPE (line: 1)
# +- nd_tbl: (empty)
# +- nd_args:
# | (null node)
# +- nd_body:
# # NODE_PRELUDE (line: 1)
# +- nd_head:
# | (null node)
# +- nd_body:
# | # NODE_IF (line: 1)
# | +- nd_cond:
# | | # NODE_VCALL (line: 1)
# | | +- nd_mid: :a
# | +- nd_body:
# | | # NODE_VCALL (line: 1)
# | | +- nd_mid: :b
# | +- nd_else:
# | # NODE_VCALL (line: 1)
# | +- nd_mid: :c
# +- nd_compile_option:
# +- coverage_enabled: false
They are identical.
In many languages ?: is an expression whereas if-then-else is a statement. In Ruby, both are expressions.
Objects that are not assigned to any variable/constant disappear immediately (under normal circumstance). In the following, the string "foo" is not captured by ObjectSpace.each_object(String) in the third line:
strings = ObjectSpace.each_object(String).to_a
"foo"
ObjectSpace.each_object(String).to_a - strings # => []
Is it possible to capture objects that are not necessarly assigned to any variables/constants or part of any variables/constants? I am particularly interested in capturing strings. The relevant domain can be a file, or a block. I expect something like the following:
capture_all_strings do
...
"a"
s = "b"
#s = "c"
##s = "d"
S = "e"
%q{f}
...
end
# => ["a", "b", "c", "d", "e", "f"]
Ruby creates the string instances when parsing your file. Here's an example: the string
"aaa #{123} zzz"
is parsed as:
$ ruby --dump=parsetree -e '"aaa #{123} zzz"'
###########################################################
## Do NOT use this node dump for any purpose other than ##
## debug and research. Compatibility is not guaranteed. ##
###########################################################
# # NODE_SCOPE (line: 1)
# +- nd_tbl: (empty)
# +- nd_args:
# | (null node)
# +- nd_body:
# # NODE_DSTR (line: 1)
# +- nd_lit: "aaa "
# +- nd_next->nd_head:
# | # NODE_EVSTR (line: 1)
# | +- nd_body:
# | # NODE_LIT (line: 1)
# | +- nd_lit: 123
# +- nd_next->nd_next:
# # NODE_ARRAY (line: 1)
# +- nd_alen: 1
# +- nd_head:
# | # NODE_STR (line: 1)
# | +- nd_lit: " zzz"
# +- nd_next:
# (null node)
There are two string literals at the parser stage, "aaa " and " zzz":
# +- nd_lit: "aaa "
# ...
# | +- nd_lit: " zzz"
Inspecting ObjectSpace confirms that these strings have been instantiated:
$ ruby -e '"aaa #{123} zzz"; ObjectSpace.each_object(String) { |s| p s }' | egrep "aaa|zzz"
"\"aaa \#{123} zzz\"; ObjectSpace.each_object(String) { |s| p s }\n"
"aaa 123 zzz"
" zzz"
"aaa "
So unless you are creating a new string instance (e.g. by assigning the string literal to a variable) you can't detect the string creation. It's already there when the code is being executed.
How can I predict how Ruby will parse things?
I came across a really surprising parsing error in Ruby while trying to concatenate strings.
> "every".capitalize +"thing"
=> NoMethodError: undefined method `+#' for "thing":String
Of course, if you put the extra space in their, it works as intended;
> "every".capitalize + "thing"
=> "Everything"
This error will occur if I have anything.any_method +"any string". What Ruby does is assume that we have elided parentheses, and are trying to give an argument to the method;
"every".capitalize( +"thing" )
It notices that we haven't defined the unary operator +# on strings, and throws that error.
My question is, what principles should I use to predict the behavior of the Ruby parser? I only figured this error out after a lot of googling. It's notable that .capitalize takes no parameters ever (not even in the C source code). If you use a method that doesn't apply to the previous object, it still throws the +# error instead of a undefined method 'capitalize' for "every":String error. So this parsing is obviously high-level. I'm not knowledgeable enough to read through Matz's parser.y. I've come across other similarly surprising errors. Can anyone tell me Ruby's parsing priority?
If you want to see how ruby is parsing your code, you can dump the parsetree, i.e.
ruby -e '"every".capitalize +"thing"' --dump parsetree
# # NODE_SCOPE (line: 1)
# +- nd_tbl: (empty)
# +- nd_args:
# | (null node)
# +- nd_body:
# # NODE_CALL (line: 1)
# +- nd_mid: :capitalize
# +- nd_recv:
# | # NODE_STR (line: 1)
# | +- nd_lit: "every"
# +- nd_args:
# # NODE_ARRAY (line: 1)
# +- nd_alen: 1
# +- nd_head:
# | # NODE_CALL (line: 1)
# | +- nd_mid: :+#
# | +- nd_recv:
# | | # NODE_STR (line: 1)
# | | +- nd_lit: "thing"
# | +- nd_args:
# | (null node)
# +- nd_next:
# (null node)
I like to use explainruby sometimes too, cause it's much easier on my eyes :)
I've just read here (http://ruby.runpaint.org/programs#lexical) that comments are tokens. I've never thought of comments as tokens as they're either annotations or for a post-processor.
Are comments really tokens or is this source wrong?
Yes, they should be tokens, but ignored by the parser later on. If you do ruby --dump parsetree foo.rb with a file that looks like this
# this is a comment
1+1
# another comment
this is what you'll get:
# # NODE_SCOPE (line: 3)
# +- nd_tbl: (empty)
# +- nd_args:
# | (null node)
# +- nd_body:
# # NODE_CALL (line: 2)
# +- nd_mid: :+
# +- nd_recv:
# | # NODE_LIT (line: 2)
# | +- nd_lit: 1
# +- nd_args:
# # NODE_ARRAY (line: 2)
# +- nd_alen: 1
# +- nd_head:
# | # NODE_LIT (line: 2)
# | +- nd_lit: 1
# +- nd_next:
# (null node)
Yeah they're tokens to the parser. Usually, if you use a parser generator this is the definition of a comment
{code} short_comment = '//' not_cr_lf* eol | '#' not_cr_lf* eol;
{code} long_comment = '/*' not_star* '*'+ (not_star_slash not_star* '*'+)* '/'; /* '4vim */
Ignored Tokens
short_comment,
long_comment;
This is a SableCC grammar. They're usually ignored tokens.
Remember that everything you write in a source code is a token, that's always the first step. The parser needs to start building the abstract syntax tree from tokens.