What is the formal term for the "#{}" token in Ruby syntax? - ruby

The Background
I recently posted an answer where I variously referred to #{} as a literal, an operator, and (in one draft) a "literal constructor." The squishiness of this definition didn't really affect the quality of the answer, since the question was more about what it does and how to find language references for it, but I'm unhappy with being unable to point to a canonical definition of exactly what to call this element of Ruby syntax.
The Ruby manual mentions this syntax element in the section on expression substitution, but doesn't really define the term for the syntax itself. Almost every reference to this language element says it's used for string interpolation, but doesn't define what it is.
Wikipedia Definitions
Here are some Wikipedia definitions that imply this construct is (strictly speaking) neither a literal nor an operator.
Literal (computer programming)
Operator (programming)
The Questions
Does anyone know what the proper term is for this language element? If so, can you please point me to a formal definition?

Ruby's parser calls #{} the "embexpr" operator. That's EMBedded EXPRession, naturally.

I would definitely call it neither a literal (that's more for, e.g. string literals or number literals themselves, but not parts thereof) nor an operator; those are solely for e.g. binary or unary (infix) operators.
I would either just refer to it without a noun (i.e. for string interpolation), or perhaps call those characters the string interpolation sequence or escape.

TL;DR
Originally, I'd hypothesized:
Embedded expression seems the most likely definition for this token, based on hints in the source code.
This turned out to be true, and has been officially validated by the Ruby 2.x documentation. Based on the updates to the Ripper documentation since this answer was originally written, it seems the parser token is formally defined as string_embexpr and the symbol itself is called an "embedded expression." See the Update for Ruby 2.x section at the bottom of this answer for detailed corroboration.
The remainder of the answer is still relevant, especially for older Rubies such as Ruby 1.9.3, and the methodology used to develop the original answer remains interesting. I am therefore updating the answer, but leaving the bulk of the original post as-is for historical purposes, even though the current answer could now be shorter.
Pre-2.x Answer Based on Ruby 1.9.3 Source Code
Related Answer
This answer calls attention to the Ruby source, which makes numerous references to embexpr throughout the code base. #Phlip suggests that this variable is an abbreviation for "EMBedded EXPRession." This seems like a reasonable interpretation, but neither the ruby-1.9.3-p194 source nor Google (as of this writing) explicitly references the term embedded expression in association with embexpr in any context, Ruby-related or not.
Additional Research
A scan of the Ruby 1.9.3-p194 source code with:
ack-grep -cil --type-add=YACC=.y embexpr .rvm/src/ruby-1.9.3-p194 |
sort -rnk2 -t: |
sed 's!^.*/!!'
reveals 9 files and 33 lines with the term embexpr:
test_scanner_events.rb:12
test_parser_events.rb:7
eventids2.c:5
eventids1.c:3
eventids2table.c:2
parse.y:1
parse.c:1
ripper.y:1
ripper.c:1
Of particular interest is the inclusion of string_embexpr on line 4,176 of the parse.y and ripper.y bison files. Likewise, TestRipper::ParserEvents#test_string_embexpr contains two references to parsing #{} on lines 899 and 902 of test_parser_events.rb.
The scanner, exercised in test_scanner_events.rb, is also noteworthy. This file defines tests in #test_embexpr_beg and #test_embexpr_end that scan for the token #{expr} inside various string expressions. The tests reference both embexpr and expr, raising the likelihood that "embedded expression" is indeed a sensible name for the thing.
Update for Ruby 2.x
Since this post was originally written, the documentation for the standard library's Ripper class has been updated to formally identify the token. The usage section provides "Hello, #{world}!" as an example, and says in part:
Within our :string_literal you’ll notice two #tstring_content, this is the literal part for Hello, and !. Between the two #tstring_content statements is a :string_embexpr, where embexpr is an embedded expression.

This Block post suggests, it is called an 'idiom':
http://kconrails.com/2010/12/08/ruby-string-interpolation/
The Wikipedia Article doesn't seem to contradict that:
http://en.wikipedia.org/wiki/Programming_idiom

#{} It's called placeholder and is used to reference variables with a string.
puts "My name is #{my_name}"

Related

Why do Julia programmers need to prefix macros with the at-sign?

Whenever I see a Julia macro in use like #assert or #time I'm always wondering about the need to distinguish a macro syntactically with the # prefix. What should I be thinking of when using # for a macro? For me it adds noise and distraction to an otherwise very nice language (syntactically speaking).
I mean, for me '#' has a meaning of reference, i.e. a location like a domain or address. In the location sense # does not have a meaning for macros other than that it is a different compilation step.
The # should be seen as a warning sign which indicates that the normal rules of the language might not apply. E.g., a function call
f(x)
will never modify the value of the variable x in the calling context, but a macro invocation
#mymacro x
(or #mymacro f(x) for that matter) very well might.
Another reason is that macros in Julia are not based on textual substitution as in C, but substitution in the abstract syntax tree (which is much more powerful and avoids the unexpected consequences that textual substitution macros are notorious for).
Macros have special syntax in Julia, and since they are expanded after parse time, the parser also needs an unambiguous way to recognise them
(without knowing which macros have been defined in the current scope).
ASCII characters are a precious resource in the design of most programming languages, Julia very much included. I would guess that the choice of # mostly comes down to the fact that it was not needed for something more important, and that it stands out pretty well.
Symbols always need to be interpreted within the context they are used. Having multiple meanings for symbols, across contexts, is not new and will probably never go away. For example, no one should expect #include in a C program to go viral on Twitter.
Julia's Documentation entry Hold up: why macros? explains pretty well some of the things you might keep in mind while writing and/or using macros.
Here are a few snippets:
Macros are necessary because they execute when code is parsed,
therefore, macros allow the programmer to generate and include
fragments of customized code before the full program is run.
...
It is important to emphasize that macros receive their arguments as
expressions, literals, or symbols.
So, if a macro is called with an expression, it gets the whole expression, not just the result.
...
In place of the written syntax, the macro call is expanded at parse
time to its returned result.
It actually fits quite nicely with the semantics of the # symbol on its own.
If we look up the Wikipedia entry for 'At symbol' we find that it is often used as a replacement for the preposition 'at' (yes it even reads 'at'). And the preposition 'at' is used to express a spatial or temporal relation.
Because of that we can use the #-symbol as an abbreviation for the preposition at to refer to a spatial relation, i.e. a location like #tony's bar, #france, etc., to some memory location #0x50FA2C (e.g. for pointers/addresses), to the receiver of a message (#user0851 which twitter and other forums use, etc.) but as well for a temporal relation, i.e. #05:00 am, #midnight, #compile_time or #parse_time.
And since macros are processed at parse time (here you have it) and this is totally distinct from the other code that is evaluated at run time (yes there are many different phases in between but that's not the point here).
In addition to explicitly direct the attention to the programmer that the following code fragment is processed at parse time! as oppossed to run time, we use #.
For me this explanation fits nicely in the language.
thanks#all ;)

How does the ruby interperter parse double quoted strings

Background:
I am implementing a language similar to Ruby, called Sapphire, as a way to try out some Ideas I have on concurrency in programming languages. I am trying to copy Ruby's double quoted strings with embedded code which I find very useful as a programmer.
Question:
How do any of the Ruby interpreters turn a double quotes string with embedded code into and AST?
eg:
puts "The value of foo is #{#foo}."
puts "this is an example of unmatched braces in code: #{ foo.go('}') }"
Details:
The problem I have is how to decide which } closes the code block. Code blocks can have other braces within them and with a little effort they can be unmatched. The lexer can find the beginning of a code block in a string, but without the aid of the parser, it cannot know for sure which character is the end of that block.
It looks like Ruby's parse.y file does both the lexing and parsing steps, but reading that thing is a nightmare it is 11628 lines long with no comments and lots of abbr.
True, Yacc files can be a bit daunting to read at first and parse.y is not the best file to start with. Have you looked at the various string production rules? Do you have any specific questions?
As for the actual parsing, it's indeed not uncommon that lexers do also parse numeric literals and strings, see e.g. the accepted answer to a similar question here on SO. If you approach things this way, it's not too hard to see how to go about it. Hitting #{ inside a string, basically starts a new parsing context that gets parsed as an expression again. This means that the first } in your example can't be the terminating one for the interpolation, since it's part of a literal string within the expression. Once you reach the end of the expression (keep in mind expression separators like ;), the next } is the one you need.
This is not a complete answer, but I leave it in hopes that it might be useful either to me or one who follows me.
Matz gives a pretty detailed rundown of the yylex() function of parse.y in chapter 11 of his book. It does not directly mention strings, but it does describe how the lexer uses lex_state to resolve several locally ambiguous constructs in Ruby.
A reproduction of an English translation of this chapter can be found here.
Please bear in mind that they don't have to (create an AST at compile time).
Ruby strings can be assembled at runtime and will interpolate correctly. Therefore all the parsing and evaluation machinery has to be available at runtime. Any work done at compile time in that sense could be considered an optimisation.
So why does this matter? Because there are very effective stack-based techniques for parsing and evaluating expressions that do not create or decorate an AST. The string is read (parsed) from left to right, and as embedded tokens are encountered they are either evaluated or pushed on a stack, or cause stack contents to be popped and evaluated.
This is a simple technique to implement provided the expressions are relatively simple. If you really want the full power of the language inside every string, then you need the full compiler at runtime. Not everyone does.
Disclosure: I wrote a commercial language product that does exactly this.
Dart also supports expressions interpolated into strings like Ruby, and I've skimmed a few parsers for it. I believe what they do is define separate tokens for a string literal preceding interpolation and a string literal at the end. So if you tokenize:
"before ${the + expression} after"
You would get tokens like:
STRING_START "before "
IDENTIFIER the
PLUS
IDENTIFIER expression
STRING " after"
Then in your parser, it's a pretty straightforward process of handling STRING_START to parse the interpolated expression(s) following it.
Our Ruby parser (see my bio) treats Ruby "strings" as complex objects having lots of substructures, including string start and end tokens, bare string literal fragments, lots of funny punctuation sequences representing the various regexp operators, and of course, recursively, most of Ruby itself for expressions nested inside such strings.
This is accomplished by allowing the lexer to detect and generate such string fragments in a (for Ruby, many) special lexing modes. The parser has a (sub)grammar that defines valid sequences of tokens. And that kind of parsing solves OP's original problem; the parser knows whether a curly brace matches other curly braces from the regexp content, and/or if the regexp has been completely assembled and the curly brace is a matching block end.
Yes, it builds an AST of the Ruby code, and of the regexps.
The purpose of all this is to allow us to build analyzers and transformers of Ruby code. See https://softwarerecs.stackexchange.com/q/11779/101

Some Macro terms in Racket

I am confused by the terms for a long time, thinking it is good to ask out what exactly do they mean:
A. syntax. B. syntax value. C. syntax object. D.s-expression E.datum (in syntax->datum)
What's the difference between s-expression and symbol?
What's the difference between s-expression and datum?
What's the difference between (syntax, syntax values and syntax object) from s-expression?
Code examples for explanation will be appreciated.
"Syntax" is a type for representing source code in Racket, which is a wrapper around S-expression (see a recent blog post for details). "Syntax value" and "syntax object" are all synonyms of this, and ni the ancient days of the mzscheme language functions that deal with syntax used syntax-value in the name. These days we use just "syntax" more often, and for a plural form we use "syntaxes".
An "S-expression" is either a primitive piece of data that can be typed in code (symbols, numbers, strings, booleans, etc -- in Racket you could also include other types), or a list of these things. An S-expression is therefore any nested structure of lists made of these primitive types at the fringe. Sometimes this includes vectors too (since they can be typed in using the #(...) syntax) but more usually they're left out.
Finally, "datum" is another name for an S-expression, sometimes when you want to refer to the fact that it's a piece of data that has an input representation. You can see how R5RS introduces it: <Datum> may be any external representation of a Scheme object [...]. This notation is used to include literal constants in Scheme code.
As for your questions:
What's the difference between s-expression and symbol?
A symbols is an S-expression, an S-expression may contain symbols.
What's the difference between s-expression and datum?
Nothing really. (Although some subtle intentions differences might be there.)
What's the difference between (syntax, syntax values and syntax object) from s-expression?
They are the representation of program syntax used by macros in racket -- they contain the S-expressions, but they add source location information, lexical context, syntax properties, and certificates. See that blog post for a quick introduction.

What are the pros and cons of Ruby's general delimited input? (percent syntax)

I don't understand why some people use the percentage syntax a lot in ruby.
For instance, I'm reading through the ruby plugin guide and it uses code such as:
%w{ models controllers }.each do |dir|
path = File.join(File.dirname(__FILE__), 'app', dir)
$LOAD_PATH << path
ActiveSupport::Dependencies.load_paths << path
ActiveSupport::Dependencies.load_once_paths.delete(path)
end
Every time I see something like this, I have to go and look up the percentage syntax reference because I don't remember what %w means.
Is that syntax really preferable to ["models", "controllers"].each ...?
I think in this latter case it's more clear that I've defined an array of strings, but in the former - especially to someone learning ruby - it doesn't seem as clear, at least for me.
If someone can tell me that I'm missing some key point here then please do, as I'm having a hard time understanding why the percent syntax appears to be preferred by the vast majority of ruby programmers.
One good use for general delimited input (as %w, %r, etc. are called) to avoid having to escape delimiters. This makes it especially good for literals with embedded delimiters. Contrast the regular expression
/^\/home\/[^\/]+\/.myprogram\/config$/
with
%r|^/home/[^/]+/.myprogram/config$|
or the string
"I thought John's dog was called \"Spot,\" not \"Fido.\""
with
%Q{I thought John's dog was called "Spot," not "Fido."}
As you read more Ruby, the meaning of general delimited input (%w, %r, &c.), as well as Ruby's other peculiarities and idioms, will become plain.
I believe that is no accident that Ruby often has several ways to do the same thing. Ruby, like Perl, appears to be a postmodern language: Minimalism is not a core values, but merely one of many competing design forces.
The %w syntax shaves 3 characters off each item in the list... can't beat that!
It's easy to remember: %w{} is for "words", %r{} for regexps, %q{} for "quotes", and so on... It's pretty easy once you build such memory aids.
As the size of the array grows, the %w syntax saves more and more keystrokes by not making you type in all the quotes and commas. At least that's the reason given in Learning Ruby.

What are some example use cases for symbol literals in Scala?

The use of symbol literals is not immediately clear from what I've read up on Scala. Would anyone care to share some real world uses?
Is there a particular Java idiom being covered by symbol literals? What languages have similar constructs? I'm coming from a Python background and not sure there's anything analogous in that language.
What would motivate me to use 'HelloWorld vs "HelloWorld"?
Thanks
In Java terms, symbols are interned strings. This means, for example, that reference equality comparison (eq in Scala and == in Java) gives the same result as normal equality comparison (== in Scala and equals in Java): 'abcd eq 'abcd will return true, while "abcd" eq "abcd" might not, depending on JVM's whims (well, it should for literals, but not for strings created dynamically in general).
Other languages which use symbols are Lisp (which uses 'abcd like Scala), Ruby (:abcd), Erlang and Prolog (abcd; they are called atoms instead of symbols).
I would use a symbol when I don't care about the structure of a string and use it purely as a name for something. For example, if I have a database table representing CDs, which includes a column named "price", I don't care that the second character in "price" is "r", or about concatenating column names; so a database library in Scala could reasonably use symbols for table and column names.
If you have plain strings representing say method names in code, that perhaps get passed around, you're not quite conveying things appropriately. This is sort of the Data/Code boundary issue, it's not always easy to the draw the line, but if we were to say that in that example those method names are more code than they are data, then we want something to clearly identify that.
A Symbol Literal comes into play where it clearly differentiates just any old string data with a construct being used in the code. It's just really there where you want to indicate, this isn't just some string data, but in fact in some way part of the code. The idea being things like your IDE would highlight it differently, and given the tooling, you could refactor on those, rather than doing text search/replace.
This link discusses it fairly well.
Note: Symbols will be deprecated and then removed in Scala 3 (dotty).
Reference: http://dotty.epfl.ch/docs/reference/dropped-features/symlits.html
Because of this, I personally recommend not using Symbols anymore (at least in new scala code). As the dotty documentation states:
Symbol literals are no longer supported
it is recommended to use a plain string literal [...] instead
Python mantains an internal global table of "interned strings" with the names of all variables, functions, modules, etc. With this table, the interpreter can make faster searchs and optimizations. You can force this process with the intern function (sys.intern in python3).
Also, Java and Scala automatically use "interned strings" for faster searchs. With scala, you can use the intern method to force the intern of a string, but this process don't works with all strings. Symbols benefit from being guaranteed to be interned, so a single reference equality check is both sufficient to prove equality or inequality.

Resources